Topic: Sephora Beauty Website’s Rating and Product Analysis
Dataset: Sephore Website Data
Owner: Raghad Alharbi
Source: Kaggle.com
Data dictionary can be found at: https://www.kaggle.com/datasets/raghadalharbi/all-products-available-on-sephora-website
The variables included in this entire data set are the following: id, brand, category, name, size, rating, number of reviews, love, price, value price, URL, marketing flags, ingredients, online only, exclusive, limited edition, limited time offer.
—————— Loading Libraries ———————
#We will beging loading the packages we will need for this project.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr) #to wrangle out data
library(ggplot2) #for plottinf our data
library(RColorBrewer) #adding some color
## Warning: package 'RColorBrewer' was built under R version 4.1.3
———— Readind data……. ————–
library(readr)
sephora_website_dataset <- read_csv("sephora_website_dataset.csv")
## Rows: 9168 Columns: 21
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): brand, category, name, size, URL, MarketingFlags_content, options,...
## dbl (10): id, rating, number_of_reviews, love, price, value_price, online_on...
## lgl (1): MarketingFlags
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(sephora_website_dataset)
———- Battle of the Doctor Brands: Moisturizer Edition ——————
I selected 8 variables from this data set, and then I began filtering using the filter() function from the dplyr package of four different Doctor-owned brands in Sephora. These brands carried so many different products to work with such as face washes, face serums, face masks, sunscreen, and other specialty products. I decided to filter out moisturizers only and my code returned 21 objects and 8 variables.
BOD <- sephora_website_dataset %>%
select(brand, category, rating, number_of_reviews, price, name, love, ingredients) %>%
filter(brand %in% c("Dr Roebuck's", "Dr. Barbara Sturm", "Dr. Brandt Skincare", "Dr. Dennis Gross Skincare", "Dr. Jart+")) %>%
filter(category %in% c("Moisturizers")) %>%
group_by(name, ingredients) %>%
filter(rating >= 3.5) #I want to filter out all ratings that are greater than or equal to 3.5
BOD
## # A tibble: 21 x 8
## # Groups: name, ingredients [21]
## brand category rating number_of_revie~ price name love ingredients
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 Dr Roebuck's Moistur~ 4 236 45 No W~ 5300 "-Hyaluron~
## 2 Dr Roebuck's Moistur~ 4 3 45 Stok~ 272 "-WildBerr~
## 3 Dr. Barbara S~ Moistur~ 4 237 215 Face~ 1700 "-Purslane~
## 4 Dr. Barbara S~ Moistur~ 4.5 5 230 Face~ 757 "-Skullcap~
## 5 Dr. Barbara S~ Moistur~ 4.5 3 215 Clar~ 1000 "-Complex ~
## 6 Dr. Barbara S~ Moistur~ 4.5 3 205 Face~ 555 "-Purslane~
## 7 Dr. Barbara S~ Moistur~ 5 1 230 Brig~ 235 "-Extract ~
## 8 Dr. Barbara S~ Moistur~ 5 3 215 Dark~ 355 "-Extracts~
## 9 Dr. Barbara S~ Moistur~ 4 1 230 Dark~ 166 "-Extracts~
## 10 Dr. Brandt Sk~ Moistur~ 4 16 72 Hyal~ 1700 "-Multi-hy~
## # ... with 11 more rows
This was one of my favorite visualizations to plot. I was able to plot the customer ratings on the X-axis and the price on the Y-axis. The different dot color represents the four doctor brands as you can see in the legend to the right. Below that, I was able the measure the size of the dots based on the number of customers who left product reviews for that brand which you can also identify in the second legend. Dr. Jart+ had a 4.5-star review, received the most customer reviews, and was priced at about $50.00 in USD. His competitor Dr. Barbara Strum received mostly five-star reviews but lacked customer reviews and was priced above $275.00 in USD.
BODplot <- ggplot(BOD, aes(x=rating, y=price, size = number_of_reviews, color=brand)) +
geom_point(alpha=0.9)+
scale_size(range = c(.1, 9), name="Customer Who Left a Review") +
labs(title= "Battle of the Doctors: Moisturizer Edition")+
ylab("Price (in USD)") +
xlab("Ratings out of 5 Stars")
BODplot
——– High End Luxury Brands: Perfume Edition ————
I enjoy a good perfume and I’ve always noticed that higher-end brands use good quality ingredients and last a lot longer than mid-luxury brands like Victoria’s Secret. I decided to filter out the top three known luxury brand perfumes sold on the Sephora website. I chose the brands Gucci, HERMES, and Maison Margiela. I was able to do this using the select() and filter() functions from the dplyr package. I selected only 6 variables: brand, size, love, category, name, and rating. I filtered out perfumes that were less than $100.00 in USD and moved on to plotting my data in the next step.
Lux <- sephora_website_dataset %>%
select(brand, size, love, price, category, name, rating) %>%
filter(brand %in% c("HERMÈS", "Gucci", "Maison Margiela")) %>%
filter(category %in% c("Perfume")) %>% #lets filter out only Perfume products
filter(price > "100") %>% #I want to filter out perfumes that are greater than $100.00
arrange(name)
head(Lux)
## # A tibble: 6 x 7
## brand size love price category name rating
## <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 Maison Margiela 3.4 oz/ 100 mL 3700 130 Perfume 'REPLICA' Springti~ 4
## 2 Maison Margiela 0.34 oz/10 mL 973 30 Perfume ’REPLICA’ At The B~ 4
## 3 Maison Margiela 3.4 oz/ 100 mL 37000 130 Perfume ’REPLICA’ Beach Wa~ 4
## 4 Maison Margiela 3.4 oz/ 100 mL 41900 130 Perfume ’REPLICA’ By The F~ 4.5
## 5 Maison Margiela 3.4 oz/ 100 mL 2600 180 Perfume ’REPLICA’ Fantasie~ 3.5
## 6 Maison Margiela 3.4 oz/ 100 mL 13700 130 Perfume ’REPLICA’ Flower M~ 4
My chart is very simple to read. On the X-axis you can set the customer ratings for these three luxury brand perfumes. The Y-axis measures the amount of “love button” clicks the product received on the Sephora website. I found it interesting that more people were willing to click the love button vs leaving a star. I attribute to the love button being a faster way to “review” a product, but I worry about the integrity with all the “internet bots” or spammers fishing around. In my opinion, star reviews are more personal and more often followed up with a written review on the product. The variable was of customers who left a review was available within this dataset, but I did not use it.
Luxp1 <- Lux %>%
ggplot(aes(x=rating, y=love)) +
labs(title= "High End Luxury Brands: Perfume Edition")+
xlab("Customer Ratings out of 5 Stars")+
ylab("Love Button Reactions") +
theme_minimal(base_size = 15) +
geom_point(aes(color= brand)) + scale_color_brewer(palette="Set1")
Luxp1
———– Statistical Analysis on Fenty Beauty by Rihanna ——————–
Fenty Beauty by Rihanna is black-owned but internationally popular and celebrates diversity worldwide. Rihanna also provides a brand that is affordable for the everyday consumer while providing a high-quality product. I wanted to keep all her products within this dataset but look at some of her most popular items based on ratings. I selected 9 variables from this data set such as brand, category, rating, # of reviews, price, name, love button reactions, value price, and size. Then I began filtering using the filter() function from the dplyr package for ratings greater than or equal to 4.5-stars. My code returned 39 objects and 9 variables.
Fenty <- sephora_website_dataset %>%
select(brand, category, rating, number_of_reviews, price, name, love, value_price, size) %>% #8 variables selected
filter(brand %in% c("FENTY BEAUTY by Rihanna"), rating >= "4.5") %>%
arrange(price) #price is now from low $ to high $
Fenty
## # A tibble: 39 x 9
## brand category rating number_of_revie~ price name love value_price size
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 FENTY ~ Sponges~ 4.5 466 16 Prec~ 55300 16 no s~
## 2 FENTY ~ Lipstick 4.5 2000 18 Matt~ 349300 18 0.06~
## 3 FENTY ~ Lip Bal~ 4.5 522 18 Pro ~ 47900 18 no s~
## 4 FENTY ~ Lip Glo~ 4.5 10000 19 Glos~ 553300 19 0.3 ~
## 5 FENTY ~ Lipstick 4.5 177 20 Pout~ 39000 20 0.1 ~
## 6 FENTY ~ Face Se~ 5 1000 23 Bomb~ 156900 23 no s~
## 7 FENTY ~ Bronzer 4.5 89 24 Lil'~ 35200 24 no s~
## 8 FENTY ~ Face Br~ 4.5 84 24 Port~ 28700 24 no s~
## 9 FENTY ~ Eye Bru~ 5 47 24 Tape~ 19800 24 no s~
## 10 FENTY ~ Eye Bru~ 5 2 24 Prec~ 3500 24 no s~
## # ... with 29 more rows
Frequency Table & One Sample T Testing
To do our statistical analysis on the variable price, the first I created a frequency table in this chunk was a frequncy table where we could see that 4-5 products within this data set were priced anywhere between $32.00-$36.00 USD. I went on to execute an one sample t test using the t.test(). A one sample t-test is used to test the statistical difference between a sample mean and an assumed or hypothesized value of the mean within the population. When I used a t.test function where, the code returned the “mean of x” or average price to about $33.25 USD.
table(Fenty$price)
##
## 16 18 19 20 23 24 25 26 28 29 30 32 34 36 39 42 45 50 54 69 99
## 1 2 1 1 1 4 3 1 2 2 1 4 4 5 1 1 1 1 1 1 1
t.test(Fenty$price)
##
## One Sample t-test
##
## data: Fenty$price
## t = 13.843, df = 38, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 28.39300 38.11982
## sample estimates:
## mean of x
## 33.25641
histFenty <- hist(Fenty$price, main= "Average $ of Fenty by Rihanna Products (ratings base on 4.5-5.0)") #this is a basic R histogram using the hist() function
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plotFenty <- plot_ly(x = Fenty$price,
type = "histogram") %>%
layout(title = "Frequency: Price (in USD) of Fenty by Rihanna Products (ratings 4.5-5.0)",
xaxis = list(title = "Price (in USD)",
zeroline = FALSE),
yaxis = list(title = "Frequency",
zeroline = FALSE))
plotFenty
1.) The topic of the data, any variables included, what kind of variables they are, where the data came from and how you cleaned it up (be detailed and specific, using proper terminology where appropriate). Be sure to explain why you chose this topic and dataset – what meaning does it have for you?
Once I laucnhed the dplyr package, I was able to use various functions to clean my dataset. I selected the variables I wanted to focus on using the select() function. I would then go onto began filtering out the column data using the filter() function. I grouped certain variables using the group_by() function.
2.) Incorporate background research about this topic. This background information will include information you find in an article, website, or book. Please source this background information within the essay or if you have multiple sources, include a bibliography. I am not particular about the format of this bibliography. If you need help finding articles, I am happy to help you and/or show you how to search the MC Library Database.
The data dictionary was very useful in defining my variables. I was able to locate this on the Kaggle website but still needed to further define some variables like “love”. The “love” button is defined as “The number of people loving the product”. But how is this measured? Well if you were to visit the product’s sephora URLs provided in this data set, you will find a heart-shaped button representing the “love” varibale. This button is a similar feature to the like button on many social platforms.
3.) What the visualization represents, any interesting patterns or surprises that arise within the visualization, and anything that could have been shown that you could not get to work or that you wished you could have included.
I would have loved to include more interactivity within my charts. I would like for users to be able to hover over a bubble or dot and learn the name and ingredients of each product. I made an attempt to combine the variables using the paste() function but it turned out to be a complete distatster.