Introduction

The beauty industry is one of the largest global markets, generating billions of dollars each year. Beauty product assessment is crucial for brands, beauty professionals, customers and retailers, as customer feedback can reveal important insights on product performance. In this project we analyse the Sephora Products and Skin Care Review dataset, a free dataset available on Kraggle.com with over 8.000 product characteristics and 1 million user reviews on over 2,000 products. Our main goal is to extract and analyse important information on the data set such as:

-The number of most popular brands and products, -The emotional tone of reviews, -What brands and products have the most positive or negative review? -The most common words on positive and negative reviews.

Together, these analysis provides customer perception and can support quality control decisions and marketing strategies.

Most Reviewed Brands

This visualization shows the rank of the most reviewed brands in the dataset. This plot gives us insights of the distribution of customer engagement among the almost 2.000 different products.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(stringr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
#set up working directory + load dataset
setwd("/Users/anacarolinadesouzadossantos/Documents/Mestrado/Quarto semestre/QL/Project")


#Combine all files in one(Chat GPT assisted)
allfiles <- c(
  "reviews_0_250.csv",
  "reviews_250_500.csv",
  "reviews_500_750.csv",
  "reviews_750_1000.csv",
  "reviews_1000_1500.csv",
  "reviews_1500_end.csv"
)

reviews <- allfiles %>%
  map_dfr(~ read_csv(.x, show_col_types = FALSE) %>%
            clean_names() %>%
            select(-author_id))
## New names:
## • `` -> `...1`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## New names:
## New names:
## New names:
## • `` -> `...1`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## New names:
## • `` -> `...1`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## New names:
## • `` -> `...1`
#Stablish sentment values (Basic preprocessing)
reviews2 <- reviews %>%
  filter(!is.na(review_text), review_text != "") %>%
  mutate(
    sentiment = case_when(
      rating >= 4 ~ "positive",
      rating <= 2 ~ "negative",
      TRUE ~ "neutral"
    ),
    review_length_words = str_count(review_text, "\\w+")
  ) %>%
  filter(sentiment != "neutral")

top_brands <- reviews2 %>%
  count(brand_name, sort = TRUE) %>%
  slice_head(n = 15)

## Top brands by review count
ggplot(top_brands, aes(x = reorder(brand_name, n), y = n)) +
  geom_col(fill = "pink") +
  coord_flip() +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 20)) +
  labs(
    title = "Most Reviewed Brands",
    x = "Brand",
    y = "Number of Reviews"
  ) +
  theme_minimal(base_size = 12)

Most Reviewd Products

# Top products by review count
ggplot(top_products, aes(x = n, y = reorder(product_name, n))) +
  geom_col(fill = "lightblue") +
  labs(
    title = "Most Reviewed Products",
    x = "Number of Reviews",
    y = "Product"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.y = element_text(size = 9),
    plot.title = element_text(face = "bold")
  )

This figure is a different version of the previous plot. This figure highlights the most reviewed products in the dataset. High levels of consumer interaction indicates that their products generate informative reviews making them particularly valuable for sentiment and text-based analysis.

Sentiment Distribution

reviews2 <- reviews %>%
  filter(!is.na(review_text), review_text != "") %>%
  mutate(
    sentiment = case_when(
      rating >= 4 ~ "Positive",
      rating <= 2 ~ "Negative",
      TRUE ~ "Neutral"
    )
  )

ggplot(reviews2, aes(x = sentiment)) +
  geom_bar(fill = "lightgreen") +
  labs(
    title = "Sentiment Distribution of Reviews",
    x = "Sentiment",
    y = "Number of Reviews"
  ) +
  theme_minimal(base_size = 12)

This visualization illustrates the overall distribution of sentiment in the dataset, categorized as Positive, Neutral and Negative.The distribution reveals a strong predominance of positive reviews suggesting high overall customer satisfaction. Although negative and neutral reviews occur less frequently, they can be great parameters for future improvement on product features.

Brands With Most Positive Review

top_positive_brands <- reviews2 %>%
  filter(sentiment == "Positive") %>%
  count(brand_name, sort = TRUE) %>%
  slice_head(n = 5)

top_negative_brands <- reviews2 %>%
  filter(sentiment == "Negative") %>%
  count(brand_name, sort = TRUE) %>%
  slice_head(n = 5)

ggplot(top_positive_brands,
       aes(x = n, y = reorder(brand_name, n))) +
  geom_segment(aes(x = 0, xend = n, yend = brand_name),
               color = "pink") +
  geom_point(size = 4, color = "red") +
  scale_x_continuous(labels = scales::comma) +
  labs(
    title = "Brands with the Most Positive Reviews",
    x = "Number of Positive Reviews",
    y = "Brand"
  ) +
  theme_minimal(base_size = 12)

This visualization highlights brands that received the most number of positive reviews. These brands capture strong customer satisfaction and engagement, suggesting their products are generally well received and offer a good user experience. The insights provided in this figure could be useful to better understand what please customers the most and what products’ chemical properties impact on customer satisfaction..

Brands With Most Negative Review

ggplot(top_negative_brands,
       aes(x = n, y = reorder(brand_name, n))) +
  geom_segment(aes(x = 0, xend = n, yend = brand_name),
               color = "darkblue") +
  geom_point(size = 4, color = "blue") +
  scale_x_continuous(labels = scales::comma) +
  labs(
    title = "Brands with the Most Negative Reviews",
    x = "Number of Negative Reviews",
    y = "Brand"
  ) +
  theme_minimal(base_size = 12)

This visualization shows the brand with the highest number of negative reviews. A high volume of negative reviews indicates customer dissatisfaction with the product and brand. Such insights can be useful for retailers seeking to prioritize products that better meet customer expectations and improve overall consumer satisfaction.

Positive VS Negative vocabulary Plot

#(Chat GPT Assisted) 
set.seed(123)

reviews_sample <- reviews2 %>%
  filter(sentiment %in% c("Positive", "Negative")) %>%
  group_by(sentiment) %>%
  slice_sample(n = 20000) %>%  
  ungroup()


sentiment_words <- reviews_sample %>%
  unnest_tokens(word, review_text) %>%
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "^[a-z]+$"))


sentiment_words %>%
  count(sentiment, word, sort = TRUE) %>%
  group_by(sentiment) %>%
  slice_head(n = 15)
## # A tibble: 30 × 3
## # Groups:   sentiment [2]
##    sentiment word         n
##    <chr>     <chr>    <int>
##  1 Negative  skin     18023
##  2 Negative  product  12135
##  3 Negative  dry       3950
##  4 Negative  love      3006
##  5 Negative  feel      2762
##  6 Negative  smell     2625
##  7 Negative  products  2486
##  8 Negative  acne      2444
##  9 Negative  oily      2363
## 10 Negative  cream     2225
## # ℹ 20 more rows
sentiment_words %>%
  count(sentiment, word, sort = TRUE) %>%
  group_by(sentiment) %>%
  slice_max(n, n = 15) %>%
  ungroup() %>%
  ggplot(aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~ sentiment, scales = "free") +
  labs(
    title = "Most Frequent Words in Positive vs Negative Reviews",
    x = "Word",
    y = "Frequency"
  ) +
  theme_minimal()

# number of rows and columns
dim(reviews2)
## [1] 1299520      19

This visualization compares the most frequent words used by customers in positive and negative reviews. The results show clear lexical differences between sentiments, with positive reviews emphasizing product effectiveness, quality and user satisfaction while negative reviews carry words that express clear dissatisfaction. Looking at the most used word can provide a clear understanding of what bothers customers the most and what pleases them the most.

Conclusion

After careful analysis we can conclude that this data set is rich in customer sentiment providing key insights that reveals product and brand performance. All insights showed here can be used for quality improvement of product properties and marketing.