Topic: Sephora Beauty Website’s Rating and Product Analysis

Dataset: Sephore Website Data

Owner: Raghad Alharbi

Source: Kaggle.com

Data dictionary can be found at: https://www.kaggle.com/datasets/raghadalharbi/all-products-available-on-sephora-website

The variables included in this entire data set are the following: id, brand, category, name, size, rating, number of reviews, love, price, value price, URL, marketing flags, ingredients, online only, exclusive, limited edition, limited time offer.

—————— Loading Libraries ———————

#We will beging loading the packages we will need for this project.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr) #to wrangle out data
library(ggplot2) #for plottinf our data
library(RColorBrewer) #adding some color 
## Warning: package 'RColorBrewer' was built under R version 4.1.3

———— Readind data……. ————–

library(readr)
sephora_website_dataset <- read_csv("sephora_website_dataset.csv")
## Rows: 9168 Columns: 21
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): brand, category, name, size, URL, MarketingFlags_content, options,...
## dbl (10): id, rating, number_of_reviews, love, price, value_price, online_on...
## lgl  (1): MarketingFlags
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(sephora_website_dataset)

———- Battle of the Doctor Brands: Moisturizer Edition ——————

I selected 8 variables from this data set, and then I began filtering using the filter() function from the dplyr package of four different Doctor-owned brands in Sephora. These brands carried so many different products to work with such as face washes, face serums, face masks, sunscreen, and other specialty products. I decided to filter out moisturizers only and my code returned 21 objects and 8 variables.

BOD <- sephora_website_dataset %>%
  select(brand, category, rating, number_of_reviews, price, name, love, ingredients) %>%
  filter(brand %in% c("Dr Roebuck's", "Dr. Barbara Sturm", "Dr. Brandt Skincare", "Dr. Dennis Gross Skincare", "Dr. Jart+")) %>%
  filter(category %in% c("Moisturizers")) %>%
  group_by(name, ingredients) %>%
  filter(rating >= 3.5) #I want to filter out all ratings that are greater than or equal to 3.5
BOD
## # A tibble: 21 x 8
## # Groups:   name, ingredients [21]
##    brand          category rating number_of_revie~ price name   love ingredients
##    <chr>          <chr>     <dbl>            <dbl> <dbl> <chr> <dbl> <chr>      
##  1 Dr Roebuck's   Moistur~    4                236    45 No W~  5300 "-Hyaluron~
##  2 Dr Roebuck's   Moistur~    4                  3    45 Stok~   272 "-WildBerr~
##  3 Dr. Barbara S~ Moistur~    4                237   215 Face~  1700 "-Purslane~
##  4 Dr. Barbara S~ Moistur~    4.5                5   230 Face~   757 "-Skullcap~
##  5 Dr. Barbara S~ Moistur~    4.5                3   215 Clar~  1000 "-Complex ~
##  6 Dr. Barbara S~ Moistur~    4.5                3   205 Face~   555 "-Purslane~
##  7 Dr. Barbara S~ Moistur~    5                  1   230 Brig~   235 "-Extract ~
##  8 Dr. Barbara S~ Moistur~    5                  3   215 Dark~   355 "-Extracts~
##  9 Dr. Barbara S~ Moistur~    4                  1   230 Dark~   166 "-Extracts~
## 10 Dr. Brandt Sk~ Moistur~    4                 16    72 Hyal~  1700 "-Multi-hy~
## # ... with 11 more rows

This was one of my favorite visualizations to plot. I was able to plot the customer ratings on the X-axis and the price on the Y-axis. The different dot color represents the four doctor brands as you can see in the legend to the right. Below that, I was able the measure the size of the dots based on the number of customers who left product reviews for that brand which you can also identify in the second legend. Dr. Jart+ had a 4.5-star review, received the most customer reviews, and was priced at about $50.00 in USD. His competitor Dr. Barbara Strum received mostly five-star reviews but lacked customer reviews and was priced above $275.00 in USD.

BODplot <- ggplot(BOD, aes(x=rating, y=price, size = number_of_reviews, color=brand)) +
   geom_point(alpha=0.9)+
    scale_size(range = c(.1, 9), name="Customer Who Left a Review") +
  labs(title= "Battle of the Doctors: Moisturizer Edition")+
    ylab("Price (in USD)") +
    xlab("Ratings out of 5 Stars")

BODplot

——– High End Luxury Brands: Perfume Edition ————

I enjoy a good perfume and I’ve always noticed that higher-end brands use good quality ingredients and last a lot longer than mid-luxury brands like Victoria’s Secret. I decided to filter out the top three known luxury brand perfumes sold on the Sephora website. I chose the brands Gucci, HERMES, and Maison Margiela. I was able to do this using the select() and filter() functions from the dplyr package. I selected only 6 variables: brand, size, love, category, name, and rating. I filtered out perfumes that were less than $100.00 in USD and moved on to plotting my data in the next step.

Lux <- sephora_website_dataset %>%
  select(brand, size, love, price, category, name, rating) %>%
  filter(brand %in% c("HERMÈS", "Gucci", "Maison Margiela")) %>%
  filter(category %in% c("Perfume")) %>% #lets filter out only Perfume products
  filter(price > "100") %>% #I want to filter out perfumes that are greater than $100.00
  arrange(name)
  
head(Lux)
## # A tibble: 6 x 7
##   brand           size            love price category name                rating
##   <chr>           <chr>          <dbl> <dbl> <chr>    <chr>                <dbl>
## 1 Maison Margiela 3.4 oz/ 100 mL  3700   130 Perfume  'REPLICA' Springti~    4  
## 2 Maison Margiela 0.34 oz/10 mL    973    30 Perfume  ’REPLICA’ At The B~    4  
## 3 Maison Margiela 3.4 oz/ 100 mL 37000   130 Perfume  ’REPLICA’ Beach Wa~    4  
## 4 Maison Margiela 3.4 oz/ 100 mL 41900   130 Perfume  ’REPLICA’ By The F~    4.5
## 5 Maison Margiela 3.4 oz/ 100 mL  2600   180 Perfume  ’REPLICA’ Fantasie~    3.5
## 6 Maison Margiela 3.4 oz/ 100 mL 13700   130 Perfume  ’REPLICA’ Flower M~    4

My chart is very simple to read. On the X-axis you can set the customer ratings for these three luxury brand perfumes. The Y-axis measures the amount of “love button” clicks the product received on the Sephora website. I found it interesting that more people were willing to click the love button vs leaving a star. I attribute to the love button being a faster way to “review” a product, but I worry about the integrity with all the “internet bots” or spammers fishing around. In my opinion, star reviews are more personal and more often followed up with a written review on the product. The variable was of customers who left a review was available within this dataset, but I did not use it.

Luxp1 <- Lux %>%
  ggplot(aes(x=rating, y=love)) +
  labs(title= "High End Luxury Brands: Perfume Edition")+
  xlab("Customer Ratings out of 5 Stars")+
  ylab("Love Button Reactions") +
  theme_minimal(base_size = 15) +
  geom_point(aes(color= brand)) + scale_color_brewer(palette="Set1")

Luxp1

———– Statistical Analysis on Fenty Beauty by Rihanna ——————–

Fenty Beauty by Rihanna is black-owned but internationally popular and celebrates diversity worldwide. Rihanna also provides a brand that is affordable for the everyday consumer while providing a high-quality product. I wanted to keep all her products within this dataset but look at some of her most popular items based on ratings. I selected 9 variables from this data set such as brand, category, rating, # of reviews, price, name, love button reactions, value price, and size. Then I began filtering using the filter() function from the dplyr package for ratings greater than or equal to 4.5-stars. My code returned 39 objects and 9 variables.

Fenty <- sephora_website_dataset %>%
  select(brand, category, rating, number_of_reviews, price, name, love, value_price, size) %>% #8 variables selected
  filter(brand %in% c("FENTY BEAUTY by Rihanna"), rating >= "4.5") %>%
  arrange(price) #price is now from low $ to high $
Fenty
## # A tibble: 39 x 9
##    brand   category rating number_of_revie~ price name    love value_price size 
##    <chr>   <chr>     <dbl>            <dbl> <dbl> <chr>  <dbl>       <dbl> <chr>
##  1 FENTY ~ Sponges~    4.5              466    16 Prec~  55300          16 no s~
##  2 FENTY ~ Lipstick    4.5             2000    18 Matt~ 349300          18 0.06~
##  3 FENTY ~ Lip Bal~    4.5              522    18 Pro ~  47900          18 no s~
##  4 FENTY ~ Lip Glo~    4.5            10000    19 Glos~ 553300          19 0.3 ~
##  5 FENTY ~ Lipstick    4.5              177    20 Pout~  39000          20 0.1 ~
##  6 FENTY ~ Face Se~    5               1000    23 Bomb~ 156900          23 no s~
##  7 FENTY ~ Bronzer     4.5               89    24 Lil'~  35200          24 no s~
##  8 FENTY ~ Face Br~    4.5               84    24 Port~  28700          24 no s~
##  9 FENTY ~ Eye Bru~    5                 47    24 Tape~  19800          24 no s~
## 10 FENTY ~ Eye Bru~    5                  2    24 Prec~   3500          24 no s~
## # ... with 29 more rows

Frequency Table & One Sample T Testing

To do our statistical analysis on the variable price, the first I created a frequency table in this chunk was a frequncy table where we could see that 4-5 products within this data set were priced anywhere between $32.00-$36.00 USD. I went on to execute an one sample t test using the t.test(). A one sample t-test is used to test the statistical difference between a sample mean and an assumed or hypothesized value of the mean within the population. When I used a t.test function where, the code returned the “mean of x” or average price to about $33.25 USD.

table(Fenty$price)
## 
## 16 18 19 20 23 24 25 26 28 29 30 32 34 36 39 42 45 50 54 69 99 
##  1  2  1  1  1  4  3  1  2  2  1  4  4  5  1  1  1  1  1  1  1
t.test(Fenty$price)
## 
##  One Sample t-test
## 
## data:  Fenty$price
## t = 13.843, df = 38, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  28.39300 38.11982
## sample estimates:
## mean of x 
##  33.25641
histFenty <- hist(Fenty$price, main= "Average $ of Fenty by Rihanna Products (ratings base on 4.5-5.0)") #this is a basic R histogram using the hist() function

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plotFenty <- plot_ly(x = Fenty$price,
              type = "histogram") %>% 
  layout(title = "Frequency: Price (in USD) of Fenty by Rihanna Products (ratings 4.5-5.0)",
         xaxis = list(title = "Price (in USD)",
                      zeroline = FALSE),
         yaxis = list(title = "Frequency",
                      zeroline = FALSE))

plotFenty

1.) The topic of the data, any variables included, what kind of variables they are, where the data came from and how you cleaned it up (be detailed and specific, using proper terminology where appropriate). Be sure to explain why you chose this topic and dataset – what meaning does it have for you?

Once I laucnhed the dplyr package, I was able to use various functions to clean my dataset. I selected the variables I wanted to focus on using the select() function. I would then go onto began filtering out the column data using the filter() function. I grouped certain variables using the group_by() function.

2.) Incorporate background research about this topic. This background information will include information you find in an article, website, or book. Please source this background information within the essay or if you have multiple sources, include a bibliography. I am not particular about the format of this bibliography. If you need help finding articles, I am happy to help you and/or show you how to search the MC Library Database.

The data dictionary was very useful in defining my variables. I was able to locate this on the Kaggle website but still needed to further define some variables like “love”. The “love” button is defined as “The number of people loving the product”. But how is this measured? Well if you were to visit the product’s sephora URLs provided in this data set, you will find a heart-shaped button representing the “love” varibale. This button is a similar feature to the like button on many social platforms.

3.) What the visualization represents, any interesting patterns or surprises that arise within the visualization, and anything that could have been shown that you could not get to work or that you wished you could have included.

I would have loved to include more interactivity within my charts. I would like for users to be able to hover over a bubble or dot and learn the name and ingredients of each product. I made an attempt to combine the variables using the paste() function but it turned out to be a complete distatster.