# Load necessary libraries
library(tidyverse)
library(DBI)
library(RSQLite)
Business Analytics Report on Beauty Cosmetics Products
The following dataset used for analysis in this project has been taken from Kaggle. The dataset used in this project contains information on 15,000 beauty and cosmetic products, including details about brands, categories, pricing, ratings, skin types, packaging, and country of origin.
This dataset provides valuable insights for businesses in the cosmetics industry, helping them understand market trends, consumer preferences, and pricing dynamics. By leveraging this analysis, businesses can identify high-demand categories, customer segmentation, and perform product bundling to increase profits from beauty products sold in the evolving beauty market.
Detailed Column Descriptions:
- Product_Name: Name of the product.
- Brand: Brand of the product.
- Category: Type of cosmetic product (e.g., Blush, Makeup Remover, Highlighter).
- Usage_Frequency: How often the product is used (e.g., Daily, Weekly, Occasional).
- Price_USD: Price of the product in USD.
- Rating: Customer rating of the product.
- Number_of_Reviews: Number of reviews received.
- Product_Size: Size of the product (e.g., 30ml, 100ml).
- Skin_Type: Suitable skin type (e.g., Sensitive, Dry, Normal, Oily).
- Gender_Target: Target gender (e.g., Male, Female, Unisex).
- Packaging_Type: Type of packaging (e.g., Tube, Bottle, Compact).
- Main_Ingredient: Primary ingredient (e.g., Retinol, Aloe Vera).
- Cruelty_Free: Whether the product is cruelty-free (True/False).
- Country_of_Origin: The country where the product is made.
Introduction
This report explores data from the most_used_beauty_cosmetics_products_extended.csv dataset to analyze consumer behavior in the beauty industry. The objective is to extract key insights and provide actionable recommendations for businesses in this sector.
Data Acquisition & Preparation
Load and Inspect Data
# Load dataset
<- read.csv("most_used_beauty_cosmetics_products_extended.csv", stringsAsFactors = FALSE)
data
# View first few rows
head(data)
Product_Name Brand Category Usage_Frequency Price_USD
1 Ultra Face Mask Drunk Elephant Blush Weekly 67.85
2 Ultra Lipstick Laura Mercier Makeup Remover Occasional 116.43
3 Ultra Serum Natasha Denona Highlighter Daily 90.84
4 Divine Serum Ilia Beauty Face Mask Occasional 55.17
5 Super Foundation Charlotte Tilbury Highlighter Occasional 140.56
6 Super Eye Shadow Danessa Myricks Foundation Weekly 135.82
Rating Number_of_Reviews Product_Size Skin_Type Gender_Target Packaging_Type
1 1.4 686 30ml Sensitive Female Tube
2 4.2 5483 250ml Dry Unisex Bottle
3 1.6 5039 100ml Sensitive Male Compact
4 3.2 6202 250ml Normal Male Tube
5 1.7 297 100ml Oily Female Compact
6 3.2 9405 150ml Sensitive Female Compact
Main_Ingredient Cruelty_Free Country_of_Origin
1 Retinol False Australia
2 Shea Butter False UK
3 Aloe Vera True Italy
4 Glycerin True South Korea
5 Glycerin False Germany
6 Glycerin True UK
# Check data structure
str(data)
'data.frame': 15000 obs. of 14 variables:
$ Product_Name : chr "Ultra Face Mask" "Ultra Lipstick" "Ultra Serum" "Divine Serum" ...
$ Brand : chr "Drunk Elephant" "Laura Mercier" "Natasha Denona" "Ilia Beauty" ...
$ Category : chr "Blush" "Makeup Remover" "Highlighter" "Face Mask" ...
$ Usage_Frequency : chr "Weekly" "Occasional" "Daily" "Occasional" ...
$ Price_USD : num 67.8 116.4 90.8 55.2 140.6 ...
$ Rating : num 1.4 4.2 1.6 3.2 1.7 3.2 2.5 4.3 3.3 4.4 ...
$ Number_of_Reviews: int 686 5483 5039 6202 297 9405 2423 8032 2468 4127 ...
$ Product_Size : chr "30ml" "250ml" "100ml" "250ml" ...
$ Skin_Type : chr "Sensitive" "Dry" "Sensitive" "Normal" ...
$ Gender_Target : chr "Female" "Unisex" "Male" "Male" ...
$ Packaging_Type : chr "Tube" "Bottle" "Compact" "Tube" ...
$ Main_Ingredient : chr "Retinol" "Shea Butter" "Aloe Vera" "Glycerin" ...
$ Cruelty_Free : chr "False" "False" "True" "True" ...
$ Country_of_Origin: chr "Australia" "UK" "Italy" "South Korea" ...
Data Cleaning
# Remove missing values
<- na.omit(data)
data
# Standardize column names
<- janitor::clean_names(data)
data
# Convert categorical variables to factors (if needed)
<- data %>% mutate(across(where(is.character), as.factor))
data
# Summary statistics
summary(data)
product_name brand category
Super Setting Spray: 154 Milk Makeup : 426 Serum : 710
Magic Face Oil : 151 Make Up For Ever: 414 Mascara : 674
Magic Lip Liner : 149 Kiehl’s : 411 Face Oil : 671
Perfect Lip Liner : 147 NARS : 400 Face Mask : 653
Divine Exfoliator : 146 E.l.f. : 399 Highlighter: 653
Ultra Eye Shadow : 144 Farsali : 399 Moisturizer: 638
(Other) :14109 (Other) :12551 (Other) :11001
usage_frequency price_usd rating number_of_reviews
Daily :3668 Min. : 10.00 Min. :1.000 Min. : 52
Monthly :3781 1st Qu.: 45.48 1st Qu.:2.000 1st Qu.: 2562
Occasional:3794 Median : 80.04 Median :3.000 Median : 5002
Weekly :3757 Mean : 80.13 Mean :3.002 Mean : 5014
3rd Qu.:114.76 3rd Qu.:4.000 3rd Qu.: 7497
Max. :149.99 Max. :5.000 Max. :10000
product_size skin_type gender_target packaging_type
100ml:2551 Combination:3060 Female:5002 Bottle :2532
150ml:2487 Dry :2965 Male :5017 Compact:2454
200ml:2423 Normal :2991 Unisex:4981 Jar :2567
250ml:2498 Oily :3009 Spray :2528
30ml :2540 Sensitive :2975 Stick :2529
50ml :2501 Tube :2390
main_ingredient cruelty_free country_of_origin
Aloe Vera :2091 False:7592 Italy :1942
Glycerin :2174 True :7408 USA :1931
Hyaluronic Acid:2090 Australia :1873
Retinol :2180 France :1863
Salicylic Acid :2149 Japan :1861
Shea Butter :2148 South Korea:1851
Vitamin C :2168 (Other) :3679
Exploratory Data Analysis (EDA)
# Load necessary library
library(ggplot2)
# Distribution of a key variable
ggplot(data, aes(x = category)) +
geom_bar(fill = "steelblue") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, margin = margin(t = 5))
+
) labs(
title = "Distribution of Beauty Product Categories",
x = "Category",
y = "Count"
)
Analysis: The price of beauty products varies widely, with the majority of products priced under $100 USD. There are outliers with extremely high prices, particularly for luxury or premium versions of products.
Actionable insight: Customer segmentation: The beauty and cosmetic business can perform customer segmentation for customers who are willing to pay a premimum price for certain products, spending them specific marketing campagins
options(repos = c(CRAN = "https://cloud.r-project.org/"))
# Load necessary libraries
library(dplyr)
library(reticulate)
# Load the dataset
<- read.csv("most_used_beauty_cosmetics_products_extended.csv", stringsAsFactors = FALSE)
df
# Step 1: Create Popularity Groups Based on Number of Reviews
<- df %>%
df mutate(Popularity = case_when(
< 100 ~ "Low Popularity",
Number_of_Reviews >= 100 & Number_of_Reviews < 500 ~ "Moderately Popular",
Number_of_Reviews >= 500 ~ "Highly Popular"
Number_of_Reviews
))
# Step 2: Count Products by Category and Country
<- df %>%
product_popularity_by_country group_by(Country_of_Origin, Category, Popularity) %>%
summarise(ProductCount = n(), .groups = "drop") %>%
arrange(desc(ProductCount))
# Display results
print(product_popularity_by_country)
# A tibble: 441 × 4
Country_of_Origin Category Popularity ProductCount
<chr> <chr> <chr> <int>
1 Germany Eyeliner Highly Popular 98
2 France Highlighter Highly Popular 97
3 Australia Face Oil Highly Popular 95
4 Australia Face Mask Highly Popular 94
5 Japan Contour Highly Popular 94
6 South Korea Bronzer Highly Popular 93
7 Australia Serum Highly Popular 92
8 Italy Face Oil Highly Popular 92
9 South Korea Mascara Highly Popular 90
10 UK Bronzer Highly Popular 90
# ℹ 431 more rows
# Step 3: Pass Data to Python
<- product_popularity_by_country # Send data to Python r_product_popularity
# Price vs Rating Scatterplot
ggplot(data, aes(x = price_usd, y = rating)) +
geom_point(alpha = 0.5, color = "darkred") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
theme_minimal() +
labs(title = "Price vs. Rating of Beauty Products",
x = "Price (USD)",
y = "Rating")
Analysis: Price and Rating isn’t essentially correlated
install.packages("reticulate")
The downloaded binary packages are in
/var/folders/5c/zb28zyvd5l5485fszkpz_9000000gn/T//RtmpCzxB36/downloaded_packages
library(reticulate)
py_config()
python: /Users/lucienalazard/.virtualenvs/r-reticulate/bin/python
libpython: /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/config-3.13-darwin/libpython3.13.dylib
pythonhome: /Users/lucienalazard/.virtualenvs/r-reticulate:/Users/lucienalazard/.virtualenvs/r-reticulate
version: 3.13.2 (v3.13.2:4f8bb3947cf, Feb 4 2025, 11:51:10) [Clang 15.0.0 (clang-1500.3.9.4)]
numpy: /Users/lucienalazard/.virtualenvs/r-reticulate/lib/python3.13/site-packages/numpy
numpy_version: 2.2.3
NOTE: Python version was forced by VIRTUAL_ENV
::py_install(c("pandas", "matplotlib"), envname = "r-reticulate") reticulate
Using virtual environment 'r-reticulate' ...
+ /Users/lucienalazard/.virtualenvs/r-reticulate/bin/python -m pip install --upgrade --no-user pandas matplotlib
import pandas as pd
# Load dataset
= pd.read_csv("most_used_beauty_cosmetics_products_extended.csv")
df_py
# Show basic statistics
df_py.describe()
Price_USD Rating Number_of_Reviews
count 15000.000000 15000.000000 15000.000000
mean 80.134108 3.002327 5014.231333
std 40.402983 1.168029 2855.665464
min 10.000000 1.000000 52.000000
25% 45.480000 2.000000 2562.000000
50% 80.040000 3.000000 5002.000000
75% 114.760000 4.000000 7497.000000
max 149.990000 5.000000 10000.000000
Business Analytics
Analytical Approach
We analyze the correlation between price and rating, and identify the most frequently reviewed product categories.
# Correlation analysis
<- cor(data$price_usd, data$rating, use = "complete.obs")
correlation correlation
[1] -0.005232067
# Top 5 most reviewed products
<- data %>%
top_products group_by(product_name) %>%
summarise(total_reviews = sum(number_of_reviews)) %>%
arrange(desc(total_reviews)) %>%
head(5)
top_products
# A tibble: 5 × 2
product_name total_reviews
<fct> <int>
1 Magic Face Oil 802679
2 Divine Exfoliator 769811
3 Ultra Eye Shadow 758601
4 Divine Setting Spray 743083
5 Magic Bronzer 742807
Integration with Python
We will use Python within Quarto to generate a histogram of prices.
Using Functions & Error Handling
# Function to calculate average rating per category
<- function(data) {
average_rating_per_category tryCatch({
%>%
data group_by(category) %>%
summarise(avg_rating = mean(rating, na.rm = TRUE))
error = function(e) {
}, message("Error in function: ", e)
})
}
# Execute function
average_rating_per_category(data)
# A tibble: 24 × 2
category avg_rating
<fct> <dbl>
1 BB Cream 3.06
2 Blush 2.98
3 Bronzer 2.98
4 CC Cream 3.07
5 Cleanser 3.01
6 Concealer 3.00
7 Contour 3.13
8 Exfoliator 3.05
9 Eye Shadow 2.95
10 Eyeliner 2.91
# ℹ 14 more rows
Visualization & Interpretation
# Average rating by category
<- average_rating_per_category(data)
avg_rating ggplot(avg_rating, aes(x = reorder(category, avg_rating), y = avg_rating)) +
geom_col(fill = "skyblue") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
coord_flip() +
theme_minimal() +
labs(title = "Average Rating by Product Category",
x = "Product Category",
y = "Average Rating")
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(rworldmap)
Loading required package: sp
### Welcome to rworldmap ###
For a short introduction type : vignette('rworldmap')
# Load dataset
<- read.csv("most_used_beauty_cosmetics_products_extended.csv")
data
# Summarize product count by country
<- data %>%
country_data group_by(Country_of_Origin) %>%
summarise(Product_Count = n())
# Get world map data
<- map_data("world")
world_map
# Merge dataset with world map
<- world_map %>%
merged_data left_join(country_data, by = c("region" = "Country_of_Origin"))
# Plot the world map
ggplot(merged_data, aes(x = long, y = lat, group = group, fill = Product_Count)) +
geom_polygon(color = "black") +
scale_fill_gradient(low = "lightblue", high = "darkblue", na.value = "grey90") +
theme_minimal() +
labs(title = "Distribution of Beauty Products by Country of Origin",
fill = "Product Count") +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())
Conclusion & Business Recommendations
Key Insights:
- Product bundling for profit optimization: The dataset shows a wide variety of product categories, with skin care products leading in popularity. Categories like hair care, makeup, and fragrance also have a significant presence.From the top sold Beauty product category visualization, we can create bundle offers (Beauty kit) by identifying products that are frequently bought together. This will help the beauty and cosmetic business to recommend product bundles based on customer preferences by pushing them to buy more products.
- Customer segmentation: The price of beauty products varies widely, with the majority of products priced under $100 USD. There are outliers with extremely high prices, particularly for luxury or premium versions of products.The beauty and cosmetic business can perform customer segmentation for customers who are willing to pay a premium price for certain products, spending them specific marketing campaigns
- Price vs. Rating: Higher prices do not always correlate with better customer ratings.Therefore, increasing product prices may not essentially increase product rating.
Recommendations:
For Businesses: Companies can adopt a bundling strategy using a similarity score among products of the same brand to appeal the customers to make more purchases
For Consumers: Price alone does not guarantee product quality; consumers should rely on ratings and reviews.
For Marketing: Brands should come up with a loyalty program that offers discounts, coupons, or points to frequent consumers can categorize them into different consumer categories and offer relevant products as per their budget
Future Work: Further analysis on customer sentiments from reviews can provide deeper insights into consumer preferences.
Code Quality & Reproducibility
All code has been written using modular functions, appropriate error handling, and clear documentation. The project is fully reproducible.