Notebook

Final Project v1.7

Description and note

Task 1. Choose one specific type of product to analyze. (Product type information is already merged in the dataframe merged_df, we can just filter with the columns ‘title’ and ‘description’) Tool link (additional reference): https://colab.research.google.com/drive/12r4KJVbNqjjhiZ6aeiaG809x4-Tg5fm8?usp=sharing#scrollTo=V06gw2d93Q5D 2. Consider how to utilize different scores (1-5) to extract ideas on product features ie: high score have what kind of features & low score have what kind of feature 3. High and low score specific analysis

Load the data

In [1]:

import pandas as pd
import json

In [2]:

# 2
def load_json_file(file_path):
    data = []
    with open(file_path, 'r') as file:
        for line in file:
            json_line = json.loads(line.strip())
            data.append(json_line)
    return data

file_path = ''

data_list = load_json_file(file_path)
df = pd.DataFrame(data_list)

# Replace the file path and name with your own
file_path = ''

# Load the data into a list of dictionaries
data_list = load_json_file(file_path)

# Convert the list of dictionaries into a pandas DataFrame
meta_df = pd.DataFrame(data_list)

In [3]:

df.head(10)

Out[3]:

	overall	verified	reviewTime	reviewerID	asin	style	reviewerName	reviewText	summary	unixReviewTime	vote	image
0	5.0	True	01 5, 2018	A2HOI48JK8838M	B00004U9V2	{‘Size:’: ‘ 0.9 oz.’}	DB	This handcream has a beautiful fragrance. It d…	Beautiful Fragrance	1515110400	NaN	NaN
1	5.0	True	04 5, 2017	A1YIPEY7HX73S7	B00004U9V2	{‘Size:’: ‘ 3.5 oz.’}	Ajaey	wonderful hand lotion, for seriously dry skin,…	wonderful hand lotion	1491350400	NaN	NaN
2	5.0	True	03 27, 2017	A2QCGHIJ2TCLVP	B00004U9V2	{‘Size:’: ‘ 250 g’}	D. Jones	Best hand cream around. Silky, thick, soaks i…	Best hand cream around	1490572800	NaN	NaN
3	5.0	True	03 20, 2017	A2R4UNHFJBA6PY	B00004U9V2	{‘Size:’: ‘ 3.5 oz.’}	Amazon Customer	Thanks!!	Five Stars	1489968000	NaN	NaN
4	5.0	True	02 28, 2017	A2QCGHIJ2TCLVP	B00004U9V2	{‘Size:’: ‘ 0.9 oz.’}	D. Jones	Great hand lotion. Soaks right in and leaves …	Great hand lotion!	1488240000	NaN	NaN
5	5.0	True	02 25, 2017	A1606LA683WZZU	B00004U9V2	{‘Size:’: ‘ 250 g’}	Amr	Great product. Doesn’t leave you hands feeling…	Five Stars	1487980800	NaN	NaN
6	5.0	True	02 25, 2017	A1606LA683WZZU	B00004U9V2	{‘Size:’: ‘ 3.5 oz.’}	Amr	Great product. Doesn’t leave you hands feeling…	Five Stars	1487980800	NaN	NaN
7	5.0	True	01 30, 2017	A1606LA683WZZU	B00004U9V2	{‘Size:’: ‘ 0.9 oz.’}	Amr	Just as described. Arrived on time.	Five Stars	1485734400	NaN	NaN
8	4.0	False	01 24, 2017	A1YY53NQXFKMRN	B00004U9V2	{‘Size:’: ‘ 3.5 oz.’}	Trixie	Nice lightweight hand cream for the summer.	Smells good, absorbs quickly	1485216000	NaN	NaN
9	5.0	True	12 1, 2016	A3R0NQ9E53JHYQ	B00004U9V2	{‘Size:’: ‘ 250 g’}	T. Hooth	Best hand cream ever.	Five Stars	1480550400	NaN	NaN

In [4]:

meta_df.head(10)

Out[4]:

	category	description	title	also_buy	feature	rank	also_view	details	main_cat	price	asin	imageURL	imageURLHighRes
0	[]	[After a long day of handling thorny situation…	Crabtree & Evelyn – Gardener’s Ultra-Moist…	[B00GHX7H0A, B00FRERO7G, B00R68QXCS, B000Z65AZ…	[]	4,324 in Beauty & Personal Care (	[B00FRERO7G, B00GHX7H0A, B07GFHJRMX, B00TJ3NBN…	{‘ Product Dimensions: ‘: ‘2.2 x 2.2 …	Luxury Beauty	$30.00	B00004U9V2	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
1	[]	[If you haven’t experienced the pleasures of b…	AHAVA Bath Salts	[]	[]	1,633,549 in Beauty & Personal Care (	[]	{‘ Product Dimensions: ‘: ‘3 x 3.5 x …	Luxury Beauty		B0000531EN	[]	[]
2	[]	[Rich, black mineral mud, harvested from the b…	AHAVA Dead Sea Mineral Mud, 8.5 oz, Pack of 4	[]	[]	1,806,710 in Beauty & Personal Care (	[]	{‘ Product Dimensions: ‘: ‘5.1 x 3 x …	Luxury Beauty		B0000532JH	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
3	[]	[This liquid soap with convenient pump dispens…	Crabtree & Evelyn Hand Soap, Gardeners, 10…	[]	[]	[]	[B00004U9V2, B00GHX7H0A, B00FRERO7G, B00R68QXC…	{‘ Product Dimensions: ‘: ‘2.6 x 2.6 …	Luxury Beauty	$15.99	B00005A77F	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
4	[]	[Remember why you love your favorite blanket? …	Soy Milk Hand Crme	[B000NZT6KM, B001BY229Q, B008J724QY, B0009YGKJ…	[]	42,464 in Beauty & Personal Care (	[]	{‘ Product Dimensions: ‘: ‘7.2 x 2.2 …	Luxury Beauty	$18.00	B00005NDTD	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
5	[]	[Winter, summer, spring or fall, this soothing…	AHAVA Dermud Enriched Intensive Foot Cream, 4….	[]	[]	1,527,650 in Beauty & Personal Care (	[]	{‘ Product Dimensions: ‘: ‘2.5 x 2.3 …	Luxury Beauty		B00005R7ZZ	[]	[]
6	[]	[Highly concentrated formula created to rejuve…	AHAVA Dermud Intensive Nourishing Hand Cream, …	[]	[]	1,538,330 in Beauty & Personal Care (	[]	{‘ Product Dimensions: ‘: ‘2.5 x 2.3 …	Luxury Beauty		B00005R7ZY	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
7	[]	[<P><STRONG>Please note: Due to product improv…	Supersmile Powdered Mouthrinse	[B0010Y3M2S, B00005V50B, B00NNZWXEK, B001AC3VI…	[]	122,723 in Beauty & Personal Care (	[B07CHTPD6W, B07D72B2VX, B07CLR4T96, B07B9XZ3K…	{‘ Product Dimensions: ‘: ‘5.8 x 2.8 …	Luxury Beauty	$21.73	B00005V50C	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
8	[]	[Created by Dr. Irwin Smigel, world-renowned “…	Supersmile Professional Teeth Whitening Toothp…	[B00NNZWXEK, B0057MMSWY, B00TZJDY4Q, B001ABYRZ…	[]	5,522 in Beauty & Personal Care (	[B00TZJDY4Q, B07CHTPD6W, B076GZSV93, B00KAC7LE…	{‘ Product Dimensions: ‘: ‘1.8 x 1.4 …	Luxury Beauty	$23.00	B00005V50B	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
9	[]	[Naturally stimulating essential oils make our…	Archipelago Morning Mint Body Lotion ,18 Fl Oz	[B001IJOYJA, B008J720A4, B001IJQR68, B008J722A…	[]	20,146 in Beauty & Personal Care (	[B001JB55SQ, B00J0A448K, B001IJQR68, B008J722A…	{‘ Product Dimensions: ‘: ‘2.6 x 2.6 …	Luxury Beauty	$25.00	B000066SYB	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…

In [5]:

# Merge the DataFrames on the 'asin' column
merged_df = df.merge(meta_df, on='asin', how='left')

# Display the first few rows of the merged DataFrame
merged_df

Out[5]:

	overall	verified	reviewTime	reviewerID	asin	style	reviewerName	reviewText	summary	unixReviewTime	…	feature	rank	also_view	details	main_cat	similar_item	date	price	imageURL	imageURLHighRes
0	5.0	True	01 5, 2018	A2HOI48JK8838M	B00004U9V2	{‘Size:’: ‘ 0.9 oz.’}	DB	This handcream has a beautiful fragrance. It d…	Beautiful Fragrance	1515110400	…	[]	4,324 in Beauty & Personal Care (	[B00FRERO7G, B00GHX7H0A, B07GFHJRMX, B00TJ3NBN…	{‘ Product Dimensions: ‘: ‘2.2 x 2.2 …	Luxury Beauty			$30.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
1	5.0	True	01 5, 2018	A2HOI48JK8838M	B00004U9V2	{‘Size:’: ‘ 0.9 oz.’}	DB	This handcream has a beautiful fragrance. It d…	Beautiful Fragrance	1515110400	…	[]	4,324 in Beauty & Personal Care (	[B00FRERO7G, B00GHX7H0A, B07GFHJRMX, B00TJ3NBN…	{‘ Product Dimensions: ‘: ‘2.2 x 2.2 …	Luxury Beauty			$30.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
2	5.0	True	04 5, 2017	A1YIPEY7HX73S7	B00004U9V2	{‘Size:’: ‘ 3.5 oz.’}	Ajaey	wonderful hand lotion, for seriously dry skin,…	wonderful hand lotion	1491350400	…	[]	4,324 in Beauty & Personal Care (	[B00FRERO7G, B00GHX7H0A, B07GFHJRMX, B00TJ3NBN…	{‘ Product Dimensions: ‘: ‘2.2 x 2.2 …	Luxury Beauty			$30.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
3	5.0	True	04 5, 2017	A1YIPEY7HX73S7	B00004U9V2	{‘Size:’: ‘ 3.5 oz.’}	Ajaey	wonderful hand lotion, for seriously dry skin,…	wonderful hand lotion	1491350400	…	[]	4,324 in Beauty & Personal Care (	[B00FRERO7G, B00GHX7H0A, B07GFHJRMX, B00TJ3NBN…	{‘ Product Dimensions: ‘: ‘2.2 x 2.2 …	Luxury Beauty			$30.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
4	5.0	True	03 27, 2017	A2QCGHIJ2TCLVP	B00004U9V2	{‘Size:’: ‘ 250 g’}	D. Jones	Best hand cream around. Silky, thick, soaks i…	Best hand cream around	1490572800	…	[]	4,324 in Beauty & Personal Care (	[B00FRERO7G, B00GHX7H0A, B07GFHJRMX, B00TJ3NBN…	{‘ Product Dimensions: ‘: ‘2.2 x 2.2 …	Luxury Beauty			$30.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
35853	4.0	False	09 3, 2017	A2CF66KIQ3RKX3	B01GOZ61O8	NaN	Vivian Deliz	I like to use moisturizers and sunscreens that…	Works great as a moisturizer and sunscreen	1504396800	…	[]	60,938 in Beauty & Personal Care (	[B00YHMQDC6, B01GKH6FTQ, B00YHPQLO8, B00J9POUB…	{‘ Product Dimensions: ‘: ‘5.8 x 1 x …	Luxury Beauty			$49.99	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
35854	4.0	False	09 3, 2017	A1LKOIZXPQ9VG0	B01GOZ61O8	NaN	Elisa 20	I wouldn’t be able to afford this if not asked…	Nice skin care product and sunscreen if you do…	1504396800	…	[]	60,938 in Beauty & Personal Care (	[B00YHMQDC6, B01GKH6FTQ, B00YHPQLO8, B00J9POUB…	{‘ Product Dimensions: ‘: ‘5.8 x 1 x …	Luxury Beauty			$49.99	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
35855	1.0	True	08 25, 2017	AV2RWORXTFRJU	B01H353HUY	NaN	Gapeachmama	Did nothing	One Star	1503619200	…	[]	40,994 in Beauty & Personal Care (	[B00UZFYSTY, B01N9PTAFL, B01N0Z13T1, B01LYMSEH…	{‘ Product Dimensions: ‘: ‘2 x 2 x 5….	Luxury Beauty			$58.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
35856	5.0	False	07 8, 2017	A22S7D0LP8GRDH	B01H353HUY	NaN	Jacob and Kiki Hantla	I love the Oribe bright blonde radiance spray….	No more brass!	1499472000	…	[]	40,994 in Beauty & Personal Care (	[B00UZFYSTY, B01N9PTAFL, B01N0Z13T1, B01LYMSEH…	{‘ Product Dimensions: ‘: ‘2 x 2 x 5….	Luxury Beauty			$58.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
35857	5.0	True	07 9, 2018	AAF5D1LTFGB7L	B01HGSJPMW	NaN	Libby Johnson	I love all of the Elemis products.	Five Stars	1531094400	…	[]	13,211 in Beauty & Personal Care (	[B0714LK2WT, B078K2QSTR, B00175W3HK, B00DZP5SJ…	{‘ Product Dimensions: ‘: ‘2.2 x 1.5 …	Luxury Beauty			$55.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…

35858 rows × 30 columns

Filtering product type

In [6]:

# keywords filtering function
def filter_dataset_by_keywords(dataframe, keywords, columns):
    keywords = keywords.lower().split()
    mask = []

    for _, row in dataframe.iterrows():
        keyword_found = False

        for column in columns:
            text = str(row[column]).lower()
            if all(keyword in text for keyword in keywords):
                keyword_found = True
                break

        mask.append(keyword_found)

    return dataframe[mask]

#Change the ‘hand cream’ with any product name you want to search

apply the function¶

###################### change the ‘hand cream’ to try differet product！！##################

filtered_df = filter_dataset_by_keywords(merged_df, ”, [‘description’, ‘title’])¶

filtered_df = filter_dataset_by_keywords(merged_df, ‘shampoo’, [‘description’,’title’])

check how many rows are there¶

for instance: (660, 37) means there are 660 products with the name in it¶

print(filtered_df.shape)

beauty_products = [ ‘Hand Crme’, ‘hand cream’, ‘lipstick’, ‘eyeliner’, ‘mascara’, ‘foundation’, ‘moisturizer’, ‘face mask’, ‘nail polish’, ‘shampoo’, ‘conditioner’, ‘eyeshadow’, ‘blush’, ‘concealer’, ‘bronzer’, ‘highlighter’, ‘primer’, ‘makeup remover’, ‘face wash’, ‘face serum’, ‘body lotion’, ‘body wash’, ‘sunscreen’, ‘hair mask’, ‘hair serum’, ‘hair spray’, ‘hair oil’, ‘hair gel’, ‘hair mousse’, ‘hair color’, ‘hair dye’, ‘perfume’, ‘cologne’, ‘fragrance’, ‘deodorant’, ‘antiperspirant’, ‘bath salts’, ‘bath bombs’, ‘body scrub’, ‘exfoliator’, ‘toner’, ‘face oil’, ‘eye cream’, ‘lip balm’, ‘lip gloss’, ‘lip liner’, ‘makeup brushes’, ‘makeup sponge’, ‘beauty tools’, ‘tweezers’, ‘eyelash curler’, ‘nail clippers’, ‘nail files’, ‘nail care’, ‘cuticle care’, ‘beard oil’, ‘beard balm’, ‘beard comb’, ‘shaving cream’, ‘shaving soap’, ‘razor’, ‘aftershave’, ‘hair removal’, ‘waxing’, ‘tanning lotion’, ‘tanning spray’, ‘teeth whitening’, ‘toothpaste’, ‘mouthwash’, ‘dental floss’, ‘skin care sets’, ‘makeup sets’, ‘hair care sets’, ‘fragrance sets’, ‘bath & body sets’ ]

create an empty dataframe to store the product name and the number of reviews¶

product_review_counts = pd.DataFrame(columns=[‘product’, ‘review_count’])

for product in beauty_products: filtered_df = filter_dataset_by_keywords(merged_df, product, [‘description’, ‘title’]) product_review_counts = product_review_counts.append({‘product’: product, ‘review_count’: len(filtered_df)}, ignore_index=True)

sort the dataframe in descending order of the number of reviews¶

product_review_counts = product_review_counts.sort_values(by=’review_count’, ascending=False)

product_review_counts

In [7]:

# Export the list
# product_review_counts.to_csv('beauty_product_counts.csv', index=False)

Test filter the data

In [8]:

# change keywords here
merged_df = filter_dataset_by_keywords(merged_df, 'foundation', ['title'])

In [9]:

merged_df

Out[9]:

	overall	verified	reviewTime	reviewerID	asin	style	reviewerName	reviewText	summary	unixReviewTime	…	feature	rank	also_view	details	main_cat	similar_item	date	price	imageURL	imageURLHighRes
718	3.0	True	04 9, 2018	A1KSC91G9AIY2Z	B00014GT8W	{‘Color:’: ‘ 30W Yellow Beige: For light skin …	RYW	Good foundation. High coverage but be prepared…	Good coverage. Requires patient effort to blen…	1523232000	…	[]	25,522 in Beauty & Personal Care (	[B07J6SWRND, B076LW35CN, B0749Z1PSS, B0748G28V…	{‘ Product Dimensions: ‘: ‘2.3 x 2.3 …	Luxury Beauty			$39.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
719	3.0	True	04 9, 2018	A1KSC91G9AIY2Z	B00014GT8W	{‘Color:’: ‘ 30W Yellow Beige: For light skin …	RYW	Good foundation. High coverage but be prepared…	Good coverage. Requires patient effort to blen…	1523232000	…	[]	25,522 in Beauty & Personal Care (	[B07J6SWRND, B076LW35CN, B0749Z1PSS, B0748G28V…	{‘ Product Dimensions: ‘: ‘2.3 x 2.3 …	Luxury Beauty			$39.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
720	4.0	False	03 13, 2018	A1WKQ94M45D8MG	B00014GT8W	{‘Color:’: ‘ 35W Warm Beige: For medium skin t…	Denise Crawford	I have light skin with some imperfections. As …	Full coverage,	1520899200	…	[]	25,522 in Beauty & Personal Care (	[B07J6SWRND, B076LW35CN, B0749Z1PSS, B0748G28V…	{‘ Product Dimensions: ‘: ‘2.3 x 2.3 …	Luxury Beauty			$39.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
721	4.0	False	03 13, 2018	A1WKQ94M45D8MG	B00014GT8W	{‘Color:’: ‘ 35W Warm Beige: For medium skin t…	Denise Crawford	I have light skin with some imperfections. As …	Full coverage,	1520899200	…	[]	25,522 in Beauty & Personal Care (	[B07J6SWRND, B076LW35CN, B0749Z1PSS, B0748G28V…	{‘ Product Dimensions: ‘: ‘2.3 x 2.3 …	Luxury Beauty			$39.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
722	4.0	False	02 28, 2018	AFICF7DKHTQ87	B00014GT8W	{‘Color:’: ‘ 40W Caramel Beige: For medium ski…	Amazon Customer	My wife likes the smoothness of the product as…	SPF30 makeup	1519776000	…	[]	25,522 in Beauty & Personal Care (	[B07J6SWRND, B076LW35CN, B0749Z1PSS, B0748G28V…	{‘ Product Dimensions: ‘: ‘2.3 x 2.3 …	Luxury Beauty			$39.00	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
35648	5.0	False	08 1, 2016	AAZ5OJ2OOJ2DK	B01BMBL56S	{‘Color:’: ‘ Natural’}	Krisilou	I’ve tried many Bliss skin care products in th…	Smooth and even coverage	1470009600	…	[]	666,773 in Beauty & Personal Care (	[B01BTOD520, B01MXRO1ZF, B01BTO8LEW, B01BTOC16…	{‘ Item Weight: ‘: ‘2.88 ounces’, ‘Sh…	Luxury Beauty			$34.99	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
35649	3.0	False	07 24, 2016	A1ZO9D554VQO9F	B01BMBL56S	{‘Color:’: ‘ Natural’}	Jadecat	Well, I received the “Natural” color of this p…	Not what I expected, probably best if your ski…	1469318400	…	[]	666,773 in Beauty & Personal Care (	[B01BTOD520, B01MXRO1ZF, B01BTO8LEW, B01BTOC16…	{‘ Item Weight: ‘: ‘2.88 ounces’, ‘Sh…	Luxury Beauty			$34.99	[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
35664	3.0	False	08 9, 2016	A3JLOIXFM75QNV	B01BMBNUQQ	{‘Color:’: ‘ Buff’}	Valerya Couto	I’m not a huge fan of foundation and use it sp…	Mediocre for the price	1470700800	…	[]	975,868 in Beauty & Personal Care (	[B01BTOD520, B01BTO8836, B01MXRO1ZF, B01BTOBHI…	{‘Shipping Weight:’: ‘4 ounces’, ‘ASIN:’: ‘B01…	Luxury Beauty				[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
35665	3.0	False	07 18, 2016	A9P07NJ7UV0M	B01BMBNUQQ	{‘Color:’: ‘ Buff’}	Meryl K. Evans, Digital Marketing Pro	I could not figure out how to use the dropper …	Nice color and lightness, but bottle design ma…	1468800000	…	[]	975,868 in Beauty & Personal Care (	[B01BTOD520, B01BTO8836, B01MXRO1ZF, B01BTOBHI…	{‘Shipping Weight:’: ‘4 ounces’, ‘ASIN:’: ‘B01…	Luxury Beauty				[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…
35666	3.0	False	07 17, 2016	A2X2WTEVCZ5L8N	B01BMBNUQQ	{‘Color:’: ‘ Buff’}	Sandy Kay	I thought from the name and the description of…	Foundation and sun protection combined	1468713600	…	[]	975,868 in Beauty & Personal Care (	[B01BTOD520, B01BTO8836, B01MXRO1ZF, B01BTOBHI…	{‘Shipping Weight:’: ‘4 ounces’, ‘ASIN:’: ‘B01…	Luxury Beauty				[https://images-na.ssl-images-amazon.com/image…	[https://images-na.ssl-images-amazon.com/image…

1741 rows × 30 columns

In [10]:

import pandas as pd
import matplotlib.pyplot as plt

In [11]:

import re

def extract_volume(details):
    # Convert the details dictionary to a string
    details_str = str(details)

    # Extract volume in ounces
    ounces_pattern = r"(\d+(\.\d+)?)\s*ounces?"
    ounces_match = re.search(ounces_pattern, details_str, flags=re.IGNORECASE)
    if ounces_match:
        volume = float(ounces_match.group(1))
        return volume

    # If no match found, return None
    return None

In [12]:

merged_df['volume'] = merged_df['details'].apply(extract_volume)

/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/2753941139.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['volume'] = merged_df['details'].apply(extract_volume)

In [13]:

merged_df['volume']

Out[13]:

718      2.08
719      2.08
720      2.08
721      2.08
722      2.08
         ... 
35648    2.88
35649    2.88
35664    4.00
35665    4.00
35666    4.00
Name: volume, Length: 1741, dtype: float64

In [61]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Remove rows with missing volume or price information
cleaned_df = merged_df.dropna(subset=['volume', 'price'])

# Calculate the frequency of each (volume, price) pair
frequency = cleaned_df.groupby(['volume', 'price']).size().reset_index(name='frequency')

# Create a scatter plot with point sizes based on frequency
plt.scatter(frequency['volume'], frequency['price'], s=frequency['frequency']*10, alpha=0.5)

# Set the labels for x and y axes
plt.xlabel('Volume (ounces)')
plt.ylabel('Price ($)')

# Set the title of the plot
plt.title('Scatter Plot of Volume vs. Price with Frequency Indicated by Point Size')

# Show the plot
plt.show()

No description has been provided for this image

In [62]:

import matplotlib.pyplot as plt

# Remove rows with missing volume or price information
cleaned_df = merged_df.dropna(subset=['volume', 'price'])

# Create a scatter plot
plt.scatter(cleaned_df['volume'], cleaned_df['price'])

# Set the labels for x and y axes
plt.xlabel('Volume (ounces)')
plt.ylabel('Price ($)')

# Set the title of the plot
plt.title('Scatter Plot of Volume vs. Price')

# Show the plot
plt.show()

In [16]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming the dataset is already in a pandas DataFrame named merged_df
# If not, load the dataset into a DataFrame first
# merged_df = pd.read_csv('your_dataset.csv')

# Clean the price data by removing the '$' sign and converting the column to a numeric type
merged_df['price'] = merged_df['price'].str.replace('$', '')
merged_df['price'] = pd.to_numeric(merged_df['price'], errors='coerce')

# Drop rows with NaN values in the 'price' column (Optional)
merged_df = merged_df.dropna(subset=['price'])

# Create a histogram for the distribution of prices
sns.histplot(data=merged_df, x='price', bins=30, kde=True)

# Set the chart title and labels
plt.title('Distribution of Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')

# Display the chart
plt.show()

/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/1696552794.py:10: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  merged_df['price'] = merged_df['price'].str.replace('$', '')
/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/1696552794.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['price'] = merged_df['price'].str.replace('$', '')
/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/1696552794.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['price'] = pd.to_numeric(merged_df['price'], errors='coerce')

Import the packages

In [17]:

# Install the packages, remove if not necessary!
!pip install textblob
!pip install scikit-learn
!pip install gensim
!pip install nltk

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

Requirement already satisfied: textblob in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (0.17.1)
Requirement already satisfied: nltk>=3.1 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from textblob) (3.7)
Requirement already satisfied: joblib in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.1->textblob) (1.2.0)
Requirement already satisfied: regex>=2021.8.3 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.1->textblob) (2022.3.15)
Requirement already satisfied: click in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.1->textblob) (8.0.4)
Requirement already satisfied: tqdm in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.1->textblob) (4.64.0)

[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: pip install --upgrade pip
Requirement already satisfied: scikit-learn in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: numpy>=1.17.3 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (1.21.5)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: scipy>=1.3.2 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (1.7.3)

[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: pip install --upgrade pip
Requirement already satisfied: gensim in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (4.1.2)
Requirement already satisfied: numpy>=1.17.0 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from gensim) (1.21.5)
Requirement already satisfied: scipy>=0.18.1 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from gensim) (1.7.3)
Requirement already satisfied: smart-open>=1.8.1 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from gensim) (5.1.0)

[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: pip install --upgrade pip
Requirement already satisfied: nltk in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (3.7)
Requirement already satisfied: tqdm in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from nltk) (4.64.0)
Requirement already satisfied: regex>=2021.8.3 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from nltk) (2022.3.15)
Requirement already satisfied: click in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from nltk) (8.0.4)
Requirement already satisfied: joblib in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from nltk) (1.2.0)

[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: pip install --upgrade pip

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lailongying/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/lailongying/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lailongying/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

Out[17]:

True

In [18]:

import re
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from gensim.corpora import Dictionary
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.models import LdaModel
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

Preprocessing

In [19]:

# 3
ENGLISH_STOP_WORDS = stopwords.words('english')

def preprocess(text):
    text = str(text)  # Convert non-string values to strings
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = " ".join(text.split())
    text = text.split()
    text = [x for x in text if x not in ENGLISH_STOP_WORDS]
    # remove stop-words
    text = [x for x in text if x not in ['work well', 'well','great','good','like','product','look','make','realli']] # remove topic related stop-words
    text = " ".join(text)
   # stemmer_ps = PorterStemmer()
    #text = [stemmer_ps.stem(word) for word in text.split()]
    #text = " ".join(text)
    #lemmatizer = WordNetLemmatizer()
    #text = [lemmatizer.lemmatize(word) for word in text.split()]
    #text = " ".join(text)
    return text

In [20]:

import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/lailongying/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

Out[20]:

True

In [21]:

# Preprocess the text data
merged_df['cleaned_reviewText'] = merged_df['reviewText'].apply(preprocess)
merged_df['cleaned_summary'] = merged_df['summary'].apply(preprocess)

/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/2423404023.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['cleaned_reviewText'] = merged_df['reviewText'].apply(preprocess)
/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/2423404023.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['cleaned_summary'] = merged_df['summary'].apply(preprocess)

Sentiment Analysis

In [22]:

# Sentiment Analysis
merged_df['reviewText_sentiment'] = merged_df['cleaned_reviewText'].apply(lambda x: TextBlob(x).sentiment.polarity)
merged_df['summary_sentiment'] = merged_df['cleaned_summary'].apply(lambda x: TextBlob(x).sentiment.polarity)

/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/3911972301.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['reviewText_sentiment'] = merged_df['cleaned_reviewText'].apply(lambda x: TextBlob(x).sentiment.polarity)
/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/3911972301.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['summary_sentiment'] = merged_df['cleaned_summary'].apply(lambda x: TextBlob(x).sentiment.polarity)

Tokenization

In [23]:

# Tokenization
merged_df['tokens_reviewText'] = merged_df['cleaned_reviewText'].apply(word_tokenize)
merged_df['tokens_summary'] = merged_df['cleaned_summary'].apply(word_tokenize)

/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/1506086681.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['tokens_reviewText'] = merged_df['cleaned_reviewText'].apply(word_tokenize)
/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/1506086681.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['tokens_summary'] = merged_df['cleaned_summary'].apply(word_tokenize)

N-grams

In [24]:

# Bigrams and Trigrams
bigram_phraser = Phrases(merged_df['tokens_reviewText'], min_count=5)
trigram_phraser = Phrases(bigram_phraser[merged_df['tokens_reviewText']], min_count=5)

bigrams = [bigram_phraser[line] for line in merged_df['tokens_reviewText']]
trigrams = [trigram_phraser[bigram_phraser[line]] for line in merged_df['tokens_reviewText']]

LDA modeling

In [25]:

# Topic Modeling with LDA
dictionary = Dictionary(trigrams)

Modify this part! for low and high end in the LDA model

In [26]:

# modify the parameter
dictionary.filter_extremes(no_below=10, no_above=0.1)

In [27]:

corpus = [dictionary.doc2bow(text) for text in trigrams]

lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, passes=15)
merged_df['topic_distribution'] = merged_df['tokens_reviewText'].apply(lambda x: lda_model.get_document_topics(dictionary.doc2bow(x)))

/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/2219660351.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['topic_distribution'] = merged_df['tokens_reviewText'].apply(lambda x: lda_model.get_document_topics(dictionary.doc2bow(x)))

In [28]:

# Print the topics
for topic_idx, topic in lda_model.print_topics(num_topics=10, num_words= 10):
    print(f"Topic #{topic_idx + 1}:")
    print(topic)
    print("\n")

Topic #1:
0.014*"setting_powder" + 0.013*"sunscreen" + 0.012*"smooth_liquid_camo_medium" + 0.011*"without" + 0.009*"want" + 0.008*"since" + 0.008*"line" + 0.008*"yet" + 0.008*"spf" + 0.008*"stick"


Topic #2:
0.014*"sponge" + 0.014*"blends" + 0.009*"applying" + 0.009*"overall" + 0.007*"little_goes_long_way" + 0.007*"lighter" + 0.007*"line" + 0.007*"quality" + 0.006*"job" + 0.006*"imperfections"


Topic #3:
0.013*"brand" + 0.009*"easy" + 0.008*"comes" + 0.008*"covered" + 0.008*"goes" + 0.007*"stick" + 0.007*"felt" + 0.007*"redness" + 0.007*"setting_powder" + 0.006*"looking"


Topic #4:
0.009*"pretty" + 0.009*"give" + 0.008*"tried" + 0.008*"put" + 0.008*"could" + 0.008*"makes" + 0.007*"find" + 0.007*"may" + 0.006*"lines" + 0.006*"foundations"


Topic #5:
0.012*"concealer" + 0.009*"stick" + 0.008*"way" + 0.007*"definitely" + 0.006*"youre" + 0.006*"colors" + 0.006*"hard" + 0.006*"body" + 0.006*"take" + 0.006*"natural"


Topic #6:
0.010*"colors" + 0.008*"right" + 0.008*"darker" + 0.008*"match" + 0.008*"said" + 0.008*"natural" + 0.007*"shades" + 0.007*"lighter" + 0.007*"find" + 0.006*"way"


Topic #7:
0.013*"summer" + 0.011*"cream" + 0.011*"medium" + 0.010*"looked" + 0.009*"say" + 0.009*"tan" + 0.008*"darker" + 0.008*"easily" + 0.007*"happy" + 0.007*"pretty"


Topic #8:
0.011*"applying" + 0.010*"imperfections" + 0.010*"sponge" + 0.010*"oily" + 0.010*"wear" + 0.009*"want" + 0.009*"prefer" + 0.008*"foundations" + 0.008*"made" + 0.007*"finish"


Topic #9:
0.010*"easily" + 0.010*"sponge" + 0.009*"dry" + 0.008*"liquid_foundation" + 0.007*"cream" + 0.007*"feels" + 0.007*"way" + 0.006*"spf" + 0.006*"seems" + 0.006*"pretty"


Topic #10:
0.010*"easy_apply" + 0.008*"sunscreen" + 0.007*"since" + 0.007*"find" + 0.007*"easily" + 0.006*"without" + 0.006*"needed" + 0.006*"lot" + 0.006*"set" + 0.005*"spf"

In [29]:

# Create a DataFrame with topic distribution as columns
topic_df = pd.DataFrame([{f'topic_{i}': topic_prob for i, topic_prob in row} for row in merged_df['topic_distribution']])
topic_df.fillna(0, inplace=True)

LDA visualization

In [30]:

!pip install pyLDAvis
!pip install gensim

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.0-py3-none-any.whl (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 5.1 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: pandas>=1.3.4 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from pyLDAvis) (1.4.2)
Requirement already satisfied: numexpr in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from pyLDAvis) (2.8.1)
Requirement already satisfied: setuptools in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from pyLDAvis) (61.2.0)
Collecting funcy
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Requirement already satisfied: joblib>=1.2.0 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from pyLDAvis) (1.2.0)
Requirement already satisfied: gensim in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from pyLDAvis) (4.1.2)
Requirement already satisfied: scikit-learn>=1.0.0 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from pyLDAvis) (1.2.2)
Requirement already satisfied: jinja2 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from pyLDAvis) (2.11.3)
Collecting numpy>=1.22.0
  Downloading numpy-1.24.3-cp39-cp39-macosx_10_9_x86_64.whl (19.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.8/19.8 MB 4.1 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: scipy in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from pyLDAvis) (1.7.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from pandas>=1.3.4->pyLDAvis) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from pandas>=1.3.4->pyLDAvis) (2021.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->pyLDAvis) (2.2.0)
  Downloading numpy-1.22.4-cp39-cp39-macosx_10_15_x86_64.whl (17.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.7/17.7 MB 2.3 MB/s eta 0:00:0000:0100:01m
Requirement already satisfied: smart-open>=1.8.1 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from gensim->pyLDAvis) (5.1.0)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from jinja2->pyLDAvis) (2.0.1)
Requirement already satisfied: packaging in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from numexpr->pyLDAvis) (21.3)
Requirement already satisfied: six>=1.5 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas>=1.3.4->pyLDAvis) (1.16.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from packaging->numexpr->pyLDAvis) (3.0.4)
Installing collected packages: funcy, numpy, pyLDAvis
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.5
    Uninstalling numpy-1.21.5:
      Successfully uninstalled numpy-1.21.5
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
daal4py 2021.5.0 requires daal==2021.4.0, which is not installed.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.22.4 which is incompatible.
Successfully installed funcy-2.0 numpy-1.22.4 pyLDAvis-3.4.0

[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: pip install --upgrade pip
Requirement already satisfied: gensim in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (4.1.2)
Requirement already satisfied: smart-open>=1.8.1 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from gensim) (5.1.0)
Requirement already satisfied: numpy>=1.17.0 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from gensim) (1.22.4)
Requirement already satisfied: scipy>=0.18.1 in /Users/lailongying/opt/anaconda3/lib/python3.9/site-packages (from gensim) (1.7.3)

[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: pip install --upgrade pip

In [31]:

import gensim
import gensim.corpora as corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

In [32]:

pyLDAvis.enable_notebook()
corpus = [dictionary.doc2bow(text) for text in trigrams]
lda_visualization = gensimvis.prepare(lda_model, corpus, dictionary=lda_model.id2word)
pyLDAvis.display(lda_visualization)

/Users/lailongying/opt/anaconda3/lib/python3.9/site-packages/pyLDAvis/_prepare.py:243: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  default_term_info = default_term_info.sort_values(

Out[32]:

Top rating analysis

In [33]:

# top rating df
high_rated_df = merged_df[merged_df['overall'] >= 4]

In [34]:

high_rated_df.shape

Out[34]:

(1258, 38)

In [35]:

# most common n-grams
from collections import Counter

def get_top_n_grams(tokenized_texts, n, top_n):
    n_grams = Counter()

    for text in tokenized_texts:
        n_grams.update(zip(*[text[i:] for i in range(n)]))

    return n_grams.most_common(top_n)

In [36]:

# top 20 n-grams
top_words_high_rated = get_top_n_grams(high_rated_df['tokens_reviewText'], 1, 20)
top_bigrams_high_rated = get_top_n_grams(high_rated_df['tokens_reviewText'], 2, 20)
top_trigrams_high_rated = get_top_n_grams(high_rated_df['tokens_reviewText'], 3, 20)

LDA model for high rated products

In [37]:

# Bigrams and Trigrams
bigram_phraser = Phrases(high_rated_df['tokens_reviewText'], min_count=5)
trigram_phraser = Phrases(bigram_phraser[high_rated_df['tokens_reviewText']], min_count=5)

bigrams_high = [bigram_phraser[line] for line in high_rated_df['tokens_reviewText']]
trigrams_high = [trigram_phraser[bigram_phraser[line]] for line in high_rated_df['tokens_reviewText']]

In [38]:

# Topic Modeling with LDA
dictionary_high = Dictionary(trigrams_high)

In [39]:

# modify the parameter
dictionary_high.filter_extremes(no_below=10, no_above=0.1)

In [40]:

corpus = [dictionary_high.doc2bow(text) for text in trigrams]

lda_model_high = LdaModel(corpus=corpus, id2word=dictionary_high, num_topics=10, passes=15)
high_rated_df['topic_distribution'] = high_rated_df['tokens_reviewText'].apply(lambda x: lda_model_high.get_document_topics(dictionary_high.doc2bow(x)))

/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/3522050477.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  high_rated_df['topic_distribution'] = high_rated_df['tokens_reviewText'].apply(lambda x: lda_model_high.get_document_topics(dictionary_high.doc2bow(x)))

In [41]:

# Print the topics
for topic_idx, topic in lda_model_high.print_topics(num_topics=10, num_words= 10):
    print(f"Topic #{topic_idx + 1}:")
    print(topic)
    print("\n")

Topic #1:
0.010*"blemishes" + 0.010*"natural" + 0.009*"blends" + 0.008*"brand" + 0.008*"try" + 0.008*"lot" + 0.007*"concealer" + 0.007*"isnt" + 0.007*"slightly" + 0.007*"matte"


Topic #2:
0.012*"stick" + 0.010*"online" + 0.010*"pretty" + 0.009*"setting_powder" + 0.009*"ive" + 0.009*"small" + 0.008*"concealer" + 0.008*"body" + 0.008*"still" + 0.008*"hard"


Topic #3:
0.013*"stick" + 0.012*"blend" + 0.012*"professional" + 0.012*"smooth_liquid_camo_medium" + 0.012*"job" + 0.011*"since" + 0.009*"darker" + 0.009*"scar" + 0.009*"spf" + 0.009*"water"


Topic #4:
0.015*"didnt" + 0.012*"blend" + 0.010*"line" + 0.009*"spf" + 0.009*"made" + 0.008*"imperfections" + 0.007*"enough" + 0.007*"liquid_foundation" + 0.007*"never" + 0.007*"got"


Topic #5:
0.015*"want" + 0.013*"may" + 0.011*"covering" + 0.010*"setting_powder" + 0.009*"go" + 0.008*"try" + 0.008*"said" + 0.008*"see" + 0.007*"tone" + 0.007*"colors"


Topic #6:
0.008*"see" + 0.008*"go" + 0.008*"medium" + 0.008*"different" + 0.007*"find" + 0.007*"could" + 0.006*"primer" + 0.006*"ive" + 0.006*"price" + 0.006*"still"


Topic #7:
0.016*"sunscreen" + 0.013*"way" + 0.011*"looks" + 0.011*"beauty_blender" + 0.011*"put" + 0.010*"pretty" + 0.009*"blend" + 0.008*"creme" + 0.008*"areas" + 0.008*"found"


Topic #8:
0.014*"find" + 0.011*"blend" + 0.008*"natural" + 0.008*"makes" + 0.008*"old" + 0.008*"looks" + 0.007*"yellow" + 0.007*"time" + 0.007*"oily_skin" + 0.007*"way"


Topic #9:
0.010*"feels" + 0.009*"worked" + 0.009*"sunscreen" + 0.009*"felt" + 0.009*"easily" + 0.009*"natural" + 0.008*"easy" + 0.008*"stuff" + 0.008*"blends" + 0.008*"sensitive_skin"


Topic #10:
0.011*"didnt" + 0.008*"darker" + 0.008*"cream" + 0.008*"applying" + 0.008*"wear" + 0.008*"dry" + 0.008*"looked" + 0.008*"foundations" + 0.007*"easily" + 0.007*"pretty"

LDA visualization for high rated products

In [42]:

pyLDAvis.enable_notebook()
corpus = [dictionary_high.doc2bow(text) for text in trigrams]
lda_visualization = gensimvis.prepare(lda_model_high, corpus, dictionary=lda_model_high.id2word)
pyLDAvis.display(lda_visualization)

/Users/lailongying/opt/anaconda3/lib/python3.9/site-packages/pyLDAvis/_prepare.py:243: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  default_term_info = default_term_info.sort_values(

Out[42]:

Low rating analysis

In [43]:

low_rated_df = merged_df[merged_df['overall'] < 4]

In [44]:

low_rated_df.shape

Out[44]:

(317, 38)

In [45]:

# most common n-grams
from collections import Counter

def get_top_n_grams(tokenized_texts, n, top_n):
    n_grams = Counter()

    for text in tokenized_texts:
        n_grams.update(zip(*[text[i:] for i in range(n)]))

    return n_grams.most_common(top_n)

In [46]:

# top 20 n-grams
top_words_low_rated = get_top_n_grams(low_rated_df['tokens_reviewText'], 1, 20)
top_bigrams_low_rated = get_top_n_grams(low_rated_df['tokens_reviewText'], 2, 20)
top_trigrams_low_rated = get_top_n_grams(low_rated_df['tokens_reviewText'], 3, 20)

LDA model for low rated products

In [47]:

# Bigrams and Trigrams
bigram_phraser = Phrases(low_rated_df['tokens_reviewText'], min_count=5)
trigram_phraser = Phrases(bigram_phraser[low_rated_df['tokens_reviewText']], min_count=5)

bigrams_low = [bigram_phraser[line] for line in low_rated_df['tokens_reviewText']]
trigrams_low = [trigram_phraser[bigram_phraser[line]] for line in low_rated_df['tokens_reviewText']]

In [48]:

# Topic Modeling with LDA
dictionary_low = Dictionary(trigrams_low)

In [49]:

# modify the parameter
dictionary_low.filter_extremes(no_below=10, no_above=0.1)

In [50]:

corpus = [dictionary_low.doc2bow(text) for text in trigrams]

lda_model_low = LdaModel(corpus=corpus, id2word=dictionary_low, num_topics=10, passes=15)
low_rated_df['topic_distribution'] = low_rated_df['tokens_reviewText'].apply(lambda x: lda_model_low.get_document_topics(dictionary_low.doc2bow(x)))

/var/folders/0d/rtr5srz94xxb4_5xm7x2vqbh0000gn/T/ipykernel_66253/4039827119.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_rated_df['topic_distribution'] = low_rated_df['tokens_reviewText'].apply(lambda x: lda_model_low.get_document_topics(dictionary_low.doc2bow(x)))

In [51]:

# Print the topics
for topic_idx, topic in lda_model_low.print_topics(num_topics=10, num_words= 10):
    print(f"Topic #{topic_idx + 1}:")
    print(topic)
    print("\n")

Topic #1:
0.054*"sunscreen" + 0.034*"want" + 0.028*"set" + 0.026*"concealer" + 0.024*"recommend" + 0.022*"full_coverage" + 0.020*"pretty" + 0.019*"sure" + 0.019*"love" + 0.018*"going"


Topic #2:
0.030*"wear" + 0.022*"thats" + 0.021*"enough" + 0.020*"cream" + 0.018*"spf" + 0.017*"made" + 0.017*"give" + 0.015*"complexion" + 0.015*"actually" + 0.015*"stays"


Topic #3:
0.027*"best" + 0.025*"full_coverage" + 0.024*"without" + 0.023*"job" + 0.022*"sponge" + 0.020*"ive" + 0.019*"blends" + 0.019*"covers" + 0.018*"spots" + 0.016*"almost"


Topic #4:
0.061*"works" + 0.051*"stick" + 0.037*"said" + 0.027*"lot" + 0.024*"liquid_foundation" + 0.022*"tone" + 0.022*"goes" + 0.018*"easy_apply" + 0.017*"hand" + 0.017*"area"


Topic #5:
0.037*"best" + 0.035*"covers" + 0.034*"cream" + 0.029*"pretty" + 0.027*"full_coverage" + 0.023*"covering" + 0.021*"dermablend_products" + 0.021*"want" + 0.019*"imperfections" + 0.017*"online"


Topic #6:
0.032*"different" + 0.031*"primer" + 0.025*"foundations" + 0.020*"take" + 0.020*"oily" + 0.019*"full_coverage" + 0.017*"brush" + 0.017*"colors" + 0.017*"looking" + 0.015*"applying"


Topic #7:
0.041*"setting_powder" + 0.033*"beige" + 0.027*"concealer" + 0.021*"cover_creme" + 0.020*"easy_apply" + 0.020*"covers" + 0.019*"colors" + 0.018*"cream" + 0.018*"many" + 0.017*"spf"


Topic #8:
0.047*"love" + 0.029*"brand" + 0.021*"line" + 0.021*"red" + 0.017*"never" + 0.016*"got" + 0.016*"fine" + 0.015*"pretty" + 0.015*"without" + 0.014*"always"


Topic #9:
0.131*"brush" + 0.044*"sponge" + 0.029*"compact" + 0.018*"quality" + 0.017*"lot" + 0.017*"foundations" + 0.016*"want" + 0.013*"say" + 0.013*"thought" + 0.012*"liquid"


Topic #10:
0.023*"feels" + 0.023*"sensitive_skin" + 0.020*"blends" + 0.018*"easy" + 0.016*"covers" + 0.016*"water" + 0.015*"greasy" + 0.015*"overall" + 0.015*"texture" + 0.014*"matte"

LDA visualization for low rated products

In [52]:

pyLDAvis.enable_notebook()
corpus = [dictionary_low.doc2bow(text) for text in trigrams]
lda_visualization = gensimvis.prepare(lda_model_low, corpus, dictionary=lda_model_low.id2word)
pyLDAvis.display(lda_visualization)

/Users/lailongying/opt/anaconda3/lib/python3.9/site-packages/pyLDAvis/_prepare.py:243: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  default_term_info = default_term_info.sort_values(

Out[52]:

Conclusion

In [53]:

# lda analysis
def display_top_words(lda_model, num_topics, num_words):
    for topic_idx, topic in lda_model.print_topics(num_topics=num_topics, num_words=num_words):
        print(f"Topic #{topic_idx + 1}:")
        print(topic)
        print("\n")

display_top_words(lda_model, 10, 10)

Topic #1:
0.014*"setting_powder" + 0.013*"sunscreen" + 0.012*"smooth_liquid_camo_medium" + 0.011*"without" + 0.009*"want" + 0.008*"since" + 0.008*"line" + 0.008*"yet" + 0.008*"spf" + 0.008*"stick"


Topic #2:
0.014*"sponge" + 0.014*"blends" + 0.009*"applying" + 0.009*"overall" + 0.007*"little_goes_long_way" + 0.007*"lighter" + 0.007*"line" + 0.007*"quality" + 0.006*"job" + 0.006*"imperfections"


Topic #3:
0.013*"brand" + 0.009*"easy" + 0.008*"comes" + 0.008*"covered" + 0.008*"goes" + 0.007*"stick" + 0.007*"felt" + 0.007*"redness" + 0.007*"setting_powder" + 0.006*"looking"


Topic #4:
0.009*"pretty" + 0.009*"give" + 0.008*"tried" + 0.008*"put" + 0.008*"could" + 0.008*"makes" + 0.007*"find" + 0.007*"may" + 0.006*"lines" + 0.006*"foundations"


Topic #5:
0.012*"concealer" + 0.009*"stick" + 0.008*"way" + 0.007*"definitely" + 0.006*"youre" + 0.006*"colors" + 0.006*"hard" + 0.006*"body" + 0.006*"take" + 0.006*"natural"


Topic #6:
0.010*"colors" + 0.008*"right" + 0.008*"darker" + 0.008*"match" + 0.008*"said" + 0.008*"natural" + 0.007*"shades" + 0.007*"lighter" + 0.007*"find" + 0.006*"way"


Topic #7:
0.013*"summer" + 0.011*"cream" + 0.011*"medium" + 0.010*"looked" + 0.009*"say" + 0.009*"tan" + 0.008*"darker" + 0.008*"easily" + 0.007*"happy" + 0.007*"pretty"


Topic #8:
0.011*"applying" + 0.010*"imperfections" + 0.010*"sponge" + 0.010*"oily" + 0.010*"wear" + 0.009*"want" + 0.009*"prefer" + 0.008*"foundations" + 0.008*"made" + 0.007*"finish"


Topic #9:
0.010*"easily" + 0.010*"sponge" + 0.009*"dry" + 0.008*"liquid_foundation" + 0.007*"cream" + 0.007*"feels" + 0.007*"way" + 0.006*"spf" + 0.006*"seems" + 0.006*"pretty"


Topic #10:
0.010*"easy_apply" + 0.008*"sunscreen" + 0.007*"since" + 0.007*"find" + 0.007*"easily" + 0.006*"without" + 0.006*"needed" + 0.006*"lot" + 0.006*"set" + 0.005*"spf"

In [54]:

display_top_words(lda_model_high, 10, 10)

Topic #1:
0.010*"blemishes" + 0.010*"natural" + 0.009*"blends" + 0.008*"brand" + 0.008*"try" + 0.008*"lot" + 0.007*"concealer" + 0.007*"isnt" + 0.007*"slightly" + 0.007*"matte"


Topic #2:
0.012*"stick" + 0.010*"online" + 0.010*"pretty" + 0.009*"setting_powder" + 0.009*"ive" + 0.009*"small" + 0.008*"concealer" + 0.008*"body" + 0.008*"still" + 0.008*"hard"


Topic #3:
0.013*"stick" + 0.012*"blend" + 0.012*"professional" + 0.012*"smooth_liquid_camo_medium" + 0.012*"job" + 0.011*"since" + 0.009*"darker" + 0.009*"scar" + 0.009*"spf" + 0.009*"water"


Topic #4:
0.015*"didnt" + 0.012*"blend" + 0.010*"line" + 0.009*"spf" + 0.009*"made" + 0.008*"imperfections" + 0.007*"enough" + 0.007*"liquid_foundation" + 0.007*"never" + 0.007*"got"


Topic #5:
0.015*"want" + 0.013*"may" + 0.011*"covering" + 0.010*"setting_powder" + 0.009*"go" + 0.008*"try" + 0.008*"said" + 0.008*"see" + 0.007*"tone" + 0.007*"colors"


Topic #6:
0.008*"see" + 0.008*"go" + 0.008*"medium" + 0.008*"different" + 0.007*"find" + 0.007*"could" + 0.006*"primer" + 0.006*"ive" + 0.006*"price" + 0.006*"still"


Topic #7:
0.016*"sunscreen" + 0.013*"way" + 0.011*"looks" + 0.011*"beauty_blender" + 0.011*"put" + 0.010*"pretty" + 0.009*"blend" + 0.008*"creme" + 0.008*"areas" + 0.008*"found"


Topic #8:
0.014*"find" + 0.011*"blend" + 0.008*"natural" + 0.008*"makes" + 0.008*"old" + 0.008*"looks" + 0.007*"yellow" + 0.007*"time" + 0.007*"oily_skin" + 0.007*"way"


Topic #9:
0.010*"feels" + 0.009*"worked" + 0.009*"sunscreen" + 0.009*"felt" + 0.009*"easily" + 0.009*"natural" + 0.008*"easy" + 0.008*"stuff" + 0.008*"blends" + 0.008*"sensitive_skin"


Topic #10:
0.011*"didnt" + 0.008*"darker" + 0.008*"cream" + 0.008*"applying" + 0.008*"wear" + 0.008*"dry" + 0.008*"looked" + 0.008*"foundations" + 0.007*"easily" + 0.007*"pretty"

In [55]:

display_top_words(lda_model_low, 10, 10)

Topic #1:
0.054*"sunscreen" + 0.034*"want" + 0.028*"set" + 0.026*"concealer" + 0.024*"recommend" + 0.022*"full_coverage" + 0.020*"pretty" + 0.019*"sure" + 0.019*"love" + 0.018*"going"


Topic #2:
0.030*"wear" + 0.022*"thats" + 0.021*"enough" + 0.020*"cream" + 0.018*"spf" + 0.017*"made" + 0.017*"give" + 0.015*"complexion" + 0.015*"actually" + 0.015*"stays"


Topic #3:
0.027*"best" + 0.025*"full_coverage" + 0.024*"without" + 0.023*"job" + 0.022*"sponge" + 0.020*"ive" + 0.019*"blends" + 0.019*"covers" + 0.018*"spots" + 0.016*"almost"


Topic #4:
0.061*"works" + 0.051*"stick" + 0.037*"said" + 0.027*"lot" + 0.024*"liquid_foundation" + 0.022*"tone" + 0.022*"goes" + 0.018*"easy_apply" + 0.017*"hand" + 0.017*"area"


Topic #5:
0.037*"best" + 0.035*"covers" + 0.034*"cream" + 0.029*"pretty" + 0.027*"full_coverage" + 0.023*"covering" + 0.021*"dermablend_products" + 0.021*"want" + 0.019*"imperfections" + 0.017*"online"


Topic #6:
0.032*"different" + 0.031*"primer" + 0.025*"foundations" + 0.020*"take" + 0.020*"oily" + 0.019*"full_coverage" + 0.017*"brush" + 0.017*"colors" + 0.017*"looking" + 0.015*"applying"


Topic #7:
0.041*"setting_powder" + 0.033*"beige" + 0.027*"concealer" + 0.021*"cover_creme" + 0.020*"easy_apply" + 0.020*"covers" + 0.019*"colors" + 0.018*"cream" + 0.018*"many" + 0.017*"spf"


Topic #8:
0.047*"love" + 0.029*"brand" + 0.021*"line" + 0.021*"red" + 0.017*"never" + 0.016*"got" + 0.016*"fine" + 0.015*"pretty" + 0.015*"without" + 0.014*"always"


Topic #9:
0.131*"brush" + 0.044*"sponge" + 0.029*"compact" + 0.018*"quality" + 0.017*"lot" + 0.017*"foundations" + 0.016*"want" + 0.013*"say" + 0.013*"thought" + 0.012*"liquid"


Topic #10:
0.023*"feels" + 0.023*"sensitive_skin" + 0.020*"blends" + 0.018*"easy" + 0.016*"covers" + 0.016*"water" + 0.015*"greasy" + 0.015*"overall" + 0.015*"texture" + 0.014*"matte"

In [56]:

# sentiment analysis
rating_sentiment_summary = merged_df.groupby('overall').agg({'reviewText_sentiment': 'mean', 'summary_sentiment': 'mean'}).reset_index()
print(rating_sentiment_summary)

   overall  reviewText_sentiment  summary_sentiment
0      1.0              0.023442          -0.147807
1      2.0              0.087539           0.059064
2      3.0              0.103638           0.069082
3      4.0              0.179245           0.192631
4      5.0              0.229236           0.312825

In [57]:

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.barplot(data=rating_sentiment_summary, x='overall', y='reviewText_sentiment', ax=axes[0])
axes[0].set_title('Average Review Text Sentiment per Rating')
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Average Sentiment Score')

sns.barplot(data=rating_sentiment_summary, x='overall', y='summary_sentiment', ax=axes[1])
axes[1].set_title('Average Summary Sentiment per Rating')
axes[1].set_xlabel('Rating')
axes[1].set_ylabel('Average Sentiment Score')

plt.show()

/Users/lailongying/opt/anaconda3/lib/python3.9/site-packages/seaborn/rcmod.py:400: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if LooseVersion(mpl.__version__) >= "3.0":
/Users/lailongying/opt/anaconda3/lib/python3.9/site-packages/setuptools/_distutils/version.py:351: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  other = LooseVersion(other)

In [58]:

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.scatterplot(data=merged_df, x='reviewText_sentiment', y='overall', ax=axes[0], alpha=0.2)
axes[0].set_title('Review Text Sentiment vs Rating')
axes[0].set_xlabel('Review Text Sentiment Score')
axes[0].set_ylabel('Rating')

sns.scatterplot(data=merged_df, x='summary_sentiment', y='overall', ax=axes[1], alpha=0.2)
axes[1].set_title('Summary Sentiment vs Rating')
axes[1].set_xlabel('Summary Sentiment Score')
axes[1].set_ylabel('Rating')

plt.show()

In [59]:

# top 20 n-grams
def plot_top_ngrams(top_ngrams, title, n):
    labels, values = zip(*top_ngrams)
    labels = [" ".join(label) for label in labels]
    index = list(range(n))
    
    plt.figure(figsize=(10, 5))
    plt.bar(index, values)
    plt.xlabel('Phrases')
    plt.ylabel('Frequency')
    plt.xticks(index, labels, rotation=45)
    plt.title(title)
    plt.show()

plot_top_ngrams(top_words_high_rated, 'Top 20 Words in High-Rated Products', 20)
plot_top_ngrams(top_bigrams_high_rated, 'Top 20 Bigrams in High-Rated Products', 20)
plot_top_ngrams(top_trigrams_high_rated, 'Top 20 Trigrams in High-Rated Products', 20)

In [60]:

# top 20 n-grams
def plot_top_ngrams(top_ngrams, title, n):
    labels, values = zip(*top_ngrams)
    labels = [" ".join(label) for label in labels]
    index = list(range(n))
    
    plt.figure(figsize=(10, 5))
    plt.bar(index, values)
    plt.xlabel('Phrases')
    plt.ylabel('Frequency')
    plt.xticks(index, labels, rotation=45)
    plt.title(title)
    plt.show()

plot_top_ngrams(top_words_low_rated, 'Top 20 Words in Low-Rated Products', 20)
plot_top_ngrams(top_bigrams_low_rated, 'Top 20 Bigrams in Low-Rated Products', 20)
plot_top_ngrams(top_trigrams_low_rated, 'Top 20 Trigrams in Low-Rated Products', 20)