Business Analytics Report on Beauty Cosmetics Products

Author

Aashita Gloria Noah

Published

March 3, 2025

Data Acquisition

The following dataset used for analysis in this project has been taken from Kaggle. The dataset used in this project contains information on 15,000 beauty and cosmetic products, including details about brands, categories, pricing, ratings, skin types, packaging, and country of origin.

This dataset provides valuable insights for businesses in the cosmetics industry, helping them understand market trends, consumer preferences, and pricing dynamics. By leveraging this analysis, businesses can identify high-demand categories, customer segmentation, and perform product bundling to increase profits from beauty products sold in the evolving beauty market.

Detailed Column Descriptions:
- Product_Name: Name of the product.
- Brand: Brand of the product.
- Category: Type of cosmetic product (e.g., Blush, Makeup Remover, Highlighter).
- Usage_Frequency: How often the product is used (e.g., Daily, Weekly, Occasional).
- Price_USD: Price of the product in USD.
- Rating: Customer rating of the product.
- Number_of_Reviews: Number of reviews received.
- Product_Size: Size of the product (e.g., 30ml, 100ml).
- Skin_Type: Suitable skin type (e.g., Sensitive, Dry, Normal, Oily).
- Gender_Target: Target gender (e.g., Male, Female, Unisex).
- Packaging_Type: Type of packaging (e.g., Tube, Bottle, Compact).
- Main_Ingredient: Primary ingredient (e.g., Retinol, Aloe Vera).
- Cruelty_Free: Whether the product is cruelty-free (True/False).
- Country_of_Origin: The country where the product is made.

Introduction

This report explores data from the most_used_beauty_cosmetics_products_extended.csv dataset to analyze consumer behavior in the beauty industry. The objective is to extract key insights and provide actionable recommendations for businesses in this sector.

Data Acquisition & Preparation

# Load necessary libraries
library(tidyverse)
library(DBI)
library(RSQLite)

Load and Inspect Data

# Load dataset
data <- read.csv("most_used_beauty_cosmetics_products_extended.csv", stringsAsFactors = FALSE)

# View first few rows
head(data)

      Product_Name             Brand       Category Usage_Frequency Price_USD
1  Ultra Face Mask    Drunk Elephant          Blush          Weekly     67.85
2   Ultra Lipstick     Laura Mercier Makeup Remover      Occasional    116.43
3      Ultra Serum    Natasha Denona    Highlighter           Daily     90.84
4     Divine Serum       Ilia Beauty      Face Mask      Occasional     55.17
5 Super Foundation Charlotte Tilbury    Highlighter      Occasional    140.56
6 Super Eye Shadow   Danessa Myricks     Foundation          Weekly    135.82
  Rating Number_of_Reviews Product_Size Skin_Type Gender_Target Packaging_Type
1    1.4               686         30ml Sensitive        Female           Tube
2    4.2              5483        250ml       Dry        Unisex         Bottle
3    1.6              5039        100ml Sensitive          Male        Compact
4    3.2              6202        250ml    Normal          Male           Tube
5    1.7               297        100ml      Oily        Female        Compact
6    3.2              9405        150ml Sensitive        Female        Compact
  Main_Ingredient Cruelty_Free Country_of_Origin
1         Retinol        False         Australia
2     Shea Butter        False                UK
3       Aloe Vera         True             Italy
4        Glycerin         True       South Korea
5        Glycerin        False           Germany
6        Glycerin         True                UK

# Check data structure
str(data)

'data.frame':   15000 obs. of  14 variables:
 $ Product_Name     : chr  "Ultra Face Mask" "Ultra Lipstick" "Ultra Serum" "Divine Serum" ...
 $ Brand            : chr  "Drunk Elephant" "Laura Mercier" "Natasha Denona" "Ilia Beauty" ...
 $ Category         : chr  "Blush" "Makeup Remover" "Highlighter" "Face Mask" ...
 $ Usage_Frequency  : chr  "Weekly" "Occasional" "Daily" "Occasional" ...
 $ Price_USD        : num  67.8 116.4 90.8 55.2 140.6 ...
 $ Rating           : num  1.4 4.2 1.6 3.2 1.7 3.2 2.5 4.3 3.3 4.4 ...
 $ Number_of_Reviews: int  686 5483 5039 6202 297 9405 2423 8032 2468 4127 ...
 $ Product_Size     : chr  "30ml" "250ml" "100ml" "250ml" ...
 $ Skin_Type        : chr  "Sensitive" "Dry" "Sensitive" "Normal" ...
 $ Gender_Target    : chr  "Female" "Unisex" "Male" "Male" ...
 $ Packaging_Type   : chr  "Tube" "Bottle" "Compact" "Tube" ...
 $ Main_Ingredient  : chr  "Retinol" "Shea Butter" "Aloe Vera" "Glycerin" ...
 $ Cruelty_Free     : chr  "False" "False" "True" "True" ...
 $ Country_of_Origin: chr  "Australia" "UK" "Italy" "South Korea" ...

Data Cleaning

# Remove missing values
data <- na.omit(data)

# Standardize column names
data <- janitor::clean_names(data)

# Convert categorical variables to factors (if needed)
data <- data %>% mutate(across(where(is.character), as.factor))

# Summary statistics
summary(data)

              product_name                brand              category    
 Super Setting Spray:  154   Milk Makeup     :  426   Serum      :  710  
 Magic Face Oil     :  151   Make Up For Ever:  414   Mascara    :  674  
 Magic Lip Liner    :  149   Kiehl’s         :  411   Face Oil   :  671  
 Perfect Lip Liner  :  147   NARS            :  400   Face Mask  :  653  
 Divine Exfoliator  :  146   E.l.f.          :  399   Highlighter:  653  
 Ultra Eye Shadow   :  144   Farsali         :  399   Moisturizer:  638  
 (Other)            :14109   (Other)         :12551   (Other)    :11001  
   usage_frequency   price_usd          rating      number_of_reviews
 Daily     :3668   Min.   : 10.00   Min.   :1.000   Min.   :   52    
 Monthly   :3781   1st Qu.: 45.48   1st Qu.:2.000   1st Qu.: 2562    
 Occasional:3794   Median : 80.04   Median :3.000   Median : 5002    
 Weekly    :3757   Mean   : 80.13   Mean   :3.002   Mean   : 5014    
                   3rd Qu.:114.76   3rd Qu.:4.000   3rd Qu.: 7497    
                   Max.   :149.99   Max.   :5.000   Max.   :10000    
                                                                     
 product_size       skin_type    gender_target packaging_type
 100ml:2551   Combination:3060   Female:5002   Bottle :2532  
 150ml:2487   Dry        :2965   Male  :5017   Compact:2454  
 200ml:2423   Normal     :2991   Unisex:4981   Jar    :2567  
 250ml:2498   Oily       :3009                 Spray  :2528  
 30ml :2540   Sensitive  :2975                 Stick  :2529  
 50ml :2501                                    Tube   :2390  
                                                             
        main_ingredient cruelty_free   country_of_origin
 Aloe Vera      :2091   False:7592   Italy      :1942   
 Glycerin       :2174   True :7408   USA        :1931   
 Hyaluronic Acid:2090                Australia  :1873   
 Retinol        :2180                France     :1863   
 Salicylic Acid :2149                Japan      :1861   
 Shea Butter    :2148                South Korea:1851   
 Vitamin C      :2168                (Other)    :3679

Exploratory Data Analysis (EDA)

# Load necessary library
library(ggplot2)

# Distribution of a key variable
ggplot(data, aes(x = category)) + 
  geom_bar(fill = "steelblue") + 
  theme_minimal() + 
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, margin = margin(t = 5))
  ) + 
  labs(
    title = "Distribution of Beauty Product Categories",
    x = "Category",
    y = "Count"
  )

Note

Analysis: The price of beauty products varies widely, with the majority of products priced under $100 USD. There are outliers with extremely high prices, particularly for luxury or premium versions of products.

Actionable insight: Customer segmentation: The beauty and cosmetic business can perform customer segmentation for customers who are willing to pay a premimum price for certain products, spending them specific marketing campagins

options(repos = c(CRAN = "https://cloud.r-project.org/"))

# Load necessary libraries
library(dplyr)
library(reticulate)

# Load the dataset
df <- read.csv("most_used_beauty_cosmetics_products_extended.csv", stringsAsFactors = FALSE)

# Step 1: Create Popularity Groups Based on Number of Reviews
df <- df %>%
  mutate(Popularity = case_when(
    Number_of_Reviews < 100 ~ "Low Popularity",
    Number_of_Reviews >= 100 & Number_of_Reviews < 500 ~ "Moderately Popular",
    Number_of_Reviews >= 500 ~ "Highly Popular"
  ))

# Step 2: Count Products by Category and Country
product_popularity_by_country <- df %>%
  group_by(Country_of_Origin, Category, Popularity) %>%
  summarise(ProductCount = n(), .groups = "drop") %>%
  arrange(desc(ProductCount))

# Display results
print(product_popularity_by_country)

# A tibble: 441 × 4
   Country_of_Origin Category    Popularity     ProductCount
   <chr>             <chr>       <chr>                 <int>
 1 Germany           Eyeliner    Highly Popular           98
 2 France            Highlighter Highly Popular           97
 3 Australia         Face Oil    Highly Popular           95
 4 Australia         Face Mask   Highly Popular           94
 5 Japan             Contour     Highly Popular           94
 6 South Korea       Bronzer     Highly Popular           93
 7 Australia         Serum       Highly Popular           92
 8 Italy             Face Oil    Highly Popular           92
 9 South Korea       Mascara     Highly Popular           90
10 UK                Bronzer     Highly Popular           90
# ℹ 431 more rows

# Step 3: Pass Data to Python
r_product_popularity <- product_popularity_by_country  # Send data to Python

# Price vs Rating Scatterplot
ggplot(data, aes(x = price_usd, y = rating)) +
  geom_point(alpha = 0.5, color = "darkred") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  theme_minimal() +
  labs(title = "Price vs. Rating of Beauty Products",
       x = "Price (USD)",
       y = "Rating")

Note

Analysis: Price and Rating isn’t essentially correlated

install.packages("reticulate")


The downloaded binary packages are in
    /var/folders/5c/zb28zyvd5l5485fszkpz_9000000gn/T//RtmpCzxB36/downloaded_packages

library(reticulate)

py_config()

python:         /Users/lucienalazard/.virtualenvs/r-reticulate/bin/python
libpython:      /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/config-3.13-darwin/libpython3.13.dylib
pythonhome:     /Users/lucienalazard/.virtualenvs/r-reticulate:/Users/lucienalazard/.virtualenvs/r-reticulate
version:        3.13.2 (v3.13.2:4f8bb3947cf, Feb  4 2025, 11:51:10) [Clang 15.0.0 (clang-1500.3.9.4)]
numpy:          /Users/lucienalazard/.virtualenvs/r-reticulate/lib/python3.13/site-packages/numpy
numpy_version:  2.2.3

NOTE: Python version was forced by VIRTUAL_ENV

reticulate::py_install(c("pandas", "matplotlib"), envname = "r-reticulate")

Using virtual environment 'r-reticulate' ...

+ /Users/lucienalazard/.virtualenvs/r-reticulate/bin/python -m pip install --upgrade --no-user pandas matplotlib

import pandas as pd

# Load dataset
df_py = pd.read_csv("most_used_beauty_cosmetics_products_extended.csv")

# Show basic statistics
df_py.describe()

          Price_USD        Rating  Number_of_Reviews
count  15000.000000  15000.000000       15000.000000
mean      80.134108      3.002327        5014.231333
std       40.402983      1.168029        2855.665464
min       10.000000      1.000000          52.000000
25%       45.480000      2.000000        2562.000000
50%       80.040000      3.000000        5002.000000
75%      114.760000      4.000000        7497.000000
max      149.990000      5.000000       10000.000000

Business Analytics

Analytical Approach

We analyze the correlation between price and rating, and identify the most frequently reviewed product categories.

# Correlation analysis
correlation <- cor(data$price_usd, data$rating, use = "complete.obs")
correlation

[1] -0.005232067

# Top 5 most reviewed products
top_products <- data %>%
  group_by(product_name) %>%
  summarise(total_reviews = sum(number_of_reviews)) %>%
  arrange(desc(total_reviews)) %>%
  head(5)
top_products

# A tibble: 5 × 2
  product_name         total_reviews
  <fct>                        <int>
1 Magic Face Oil              802679
2 Divine Exfoliator           769811
3 Ultra Eye Shadow            758601
4 Divine Setting Spray        743083
5 Magic Bronzer               742807

Integration with Python

We will use Python within Quarto to generate a histogram of prices.

Using Functions & Error Handling

# Function to calculate average rating per category
average_rating_per_category <- function(data) {
  tryCatch({
    data %>%
      group_by(category) %>%
      summarise(avg_rating = mean(rating, na.rm = TRUE))
  }, error = function(e) {
    message("Error in function: ", e)
  })
}

# Execute function
average_rating_per_category(data)

# A tibble: 24 × 2
   category   avg_rating
   <fct>           <dbl>
 1 BB Cream         3.06
 2 Blush            2.98
 3 Bronzer          2.98
 4 CC Cream         3.07
 5 Cleanser         3.01
 6 Concealer        3.00
 7 Contour          3.13
 8 Exfoliator       3.05
 9 Eye Shadow       2.95
10 Eyeliner         2.91
# ℹ 14 more rows

Visualization & Interpretation

# Average rating by category
avg_rating <- average_rating_per_category(data)
ggplot(avg_rating, aes(x = reorder(category, avg_rating), y = avg_rating)) +
  geom_col(fill = "skyblue") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  coord_flip() +
  theme_minimal() +
  labs(title = "Average Rating by Product Category",
       x = "Product Category",
       y = "Average Rating")

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(rworldmap)

Loading required package: sp

### Welcome to rworldmap ###

For a short introduction type :      vignette('rworldmap')

# Load dataset
data <- read.csv("most_used_beauty_cosmetics_products_extended.csv")

# Summarize product count by country
country_data <- data %>%
  group_by(Country_of_Origin) %>%
  summarise(Product_Count = n())

# Get world map data
world_map <- map_data("world")

# Merge dataset with world map
merged_data <- world_map %>%
  left_join(country_data, by = c("region" = "Country_of_Origin"))

# Plot the world map
ggplot(merged_data, aes(x = long, y = lat, group = group, fill = Product_Count)) +
  geom_polygon(color = "black") +
  scale_fill_gradient(low = "lightblue", high = "darkblue", na.value = "grey90") +
  theme_minimal() +
  labs(title = "Distribution of Beauty Products by Country of Origin",
       fill = "Product Count") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank())

Conclusion & Business Recommendations

Key Insights:

Product bundling for profit optimization: The dataset shows a wide variety of product categories, with skin care products leading in popularity. Categories like hair care, makeup, and fragrance also have a significant presence.From the top sold Beauty product category visualization, we can create bundle offers (Beauty kit) by identifying products that are frequently bought together. This will help the beauty and cosmetic business to recommend product bundles based on customer preferences by pushing them to buy more products.
Customer segmentation: The price of beauty products varies widely, with the majority of products priced under $100 USD. There are outliers with extremely high prices, particularly for luxury or premium versions of products.The beauty and cosmetic business can perform customer segmentation for customers who are willing to pay a premium price for certain products, spending them specific marketing campaigns
Price vs. Rating: Higher prices do not always correlate with better customer ratings.Therefore, increasing product prices may not essentially increase product rating.

Recommendations:

For Businesses: Companies can adopt a bundling strategy using a similarity score among products of the same brand to appeal the customers to make more purchases
For Consumers: Price alone does not guarantee product quality; consumers should rely on ratings and reviews.
For Marketing: Brands should come up with a loyalty program that offers discounts, coupons, or points to frequent consumers can categorize them into different consumer categories and offer relevant products as per their budget
Future Work: Further analysis on customer sentiments from reviews can provide deeper insights into consumer preferences.

Code Quality & Reproducibility

All code has been written using modular functions, appropriate error handling, and clear documentation. The project is fully reproducible.