library(tidyverse)
library(knitr)
library(tidygraph)
library(ggraph)
This project looks at trends in the usage of skincare ingredients by product type. Our intention is to know which ingredients are used most often in standard skincare categoriesβCleanser, Toner, Moisturiser, SPF, and Serumβand learn about pairs of ingredients that are likely to occur together.
The motivation for this analysis is to better understand trends in product formulation within the skincare industry. Our question markers are: What are the most characteristic ingredients associated with each category of product? Which ingredients tend to be combined with one another? This could be of use in aiding product development or consumer education.
skincare <- read_csv("skincare_products.csv") %>%
filter(!is.na(ingredients)) %>%
mutate(
# Clean product types into 5 groups + Other category
product_type_group = case_when(
str_detect(tolower(product_type), "cleanser") ~ "Cleanser",
str_detect(tolower(product_type), "toner") ~ "Toner",
str_detect(tolower(product_type), "moisturiser|moisturizing|moisturizer") ~ "Moisturiser",
str_detect(tolower(product_type), "spf|sunscreen|sun protection") ~ "SPF",
str_detect(tolower(product_type), "serum") ~ "Serum",
TRUE ~ "Other"
),
# Split ingredients into lists
ingredient_list = strsplit(ingredients, ",\\s*")
) %>%
filter(map_int(ingredient_list, length) >= 1) # Keep products with at least 1 ingredient
## Rows: 1138 Columns: 5
## ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Delimiter: ","
## chr (5): product_name, product_url, product_type, ingredients, price
##
## βΉ Use `spec()` to retrieve the full column specification for this data.
## βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset was collected from Kaggle , a public data-sharing platform. It includes information on skincare products and their listed ingredients. Each row represents one product, and the ingredients column lists the components used. This project treats the ingredients as nodes in a network and their co-occurrence within a product as undirected edges.
We use a dataset of ingredients and skincare products. One row per
product and the ingredients column with a list of
ingredients used. For network analysis, each ingredient is a
node, and co-occurrence of ingredients within a product
forms an edge between them.
This data was gathered from publicly accessible skincare product
catalogs. We are treating the data as an undirected co-occurrence
network. You can find the data in the same directory with the filename
skincare_products.csv.
top3_freq <- skincare %>%
unnest(ingredient_list) %>%
mutate(ingredient_list = tolower(str_trim(ingredient_list))) %>%
group_by(product_type_group, ingredient_list) %>%
summarise(frequency = n(), .groups = "drop") %>%
group_by(product_type_group) %>%
slice_max(order_by = frequency, n = 3) %>%
arrange(product_type_group, desc(frequency))
kable(top3_freq,
col.names = c("Product Type", "Ingredient", "Frequency"),
caption = "Top 3 Most Common Ingredients and Frequencies by Product Type Group")
| Product Type | Ingredient | Frequency |
|---|---|---|
| Cleanser | glycerin | 84 |
| Cleanser | phenoxyethanol | 61 |
| Cleanser | citric acid | 51 |
| Moisturiser | glycerin | 96 |
| Moisturiser | phenoxyethanol | 70 |
| Moisturiser | dimethicone | 58 |
| Other | glycerin | 433 |
| Other | phenoxyethanol | 313 |
| Other | limonene | 250 |
| Serum | glycerin | 81 |
| Serum | phenoxyethanol | 70 |
| Serum | sodium hyaluronate | 53 |
| Serum | xanthan gum | 53 |
| Toner | glycerin | 57 |
| Toner | citric acid | 34 |
| Toner | phenoxyethanol | 34 |
# Filter for products with 2 or more ingredients
skincare_pairs <- skincare %>%
filter(map_int(ingredient_list, length) >= 2)
# Create ingredient pairs per the product
ingredient_pairs <- skincare_pairs %>%
rowwise() %>%
mutate(
unique_ings = list(sort(unique(tolower(str_trim(ingredient_list)))))
) %>%
mutate(
pairs = list(if(length(unique_ings) >= 2) combn(unique_ings, 2, simplify = FALSE) else NULL)
) %>%
unnest(pairs) %>%
mutate(
from = map_chr(pairs, 1),
to = map_chr(pairs, 2)
) %>%
select(from, to)
# Counting for co-occurrence
pair_counts <- ingredient_pairs %>%
group_by(from, to) %>%
summarise(co_occurrence = n(), .groups = "drop") %>%
arrange(desc(co_occurrence)) %>%
slice_head(n = 10)
# Showing the table or error message if there is nothing found
if(nrow(pair_counts) == 0) {
cat("## No ingredient pairs found to display.\n")
} else {
kable(pair_counts,
col.names = c("Ingredient 1", "Ingredient 2", "Co-occurrence Count"),
caption = "Top 10 Most Common Ingredient Pairs Co-occurring in Skincare Products")
}
| Ingredient 1 | Ingredient 2 | Co-occurrence Count |
|---|---|---|
| glycerin | phenoxyethanol | 456 |
| citric acid | glycerin | 301 |
| disodium edta | glycerin | 281 |
| butylene glycol | glycerin | 268 |
| limonene | linalool | 251 |
| disodium edta | phenoxyethanol | 242 |
| glycerin | sodium hyaluronate | 240 |
| glycerin | sodium hydroxide | 238 |
| glycerin | limonene | 223 |
| glycerin | xanthan gum | 223 |
These pairs reveal formulation tendencies across different products. If, for example, βglycerinβ and βwaterβ appear very often as a pair, then maybe it appears in a common base. Examining the leading pairs can give clues to typical combinations utilized in skin chemistry, which can be helpful to the researcher or consumer in cosmetic science.
g_top10 <- as_tbl_graph(pair_counts, directed = FALSE) %>%
mutate(degree = centrality_degree())
ggraph(g_top10, layout = "fr") +
geom_edge_link(aes(width = co_occurrence), color = "gray70", alpha = 0.6) +
geom_node_point(aes(size = degree), color = "hotpink") +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
theme_void() +
ggtitle("Top 10 Ingredient Co-occurrence Network")
The Top 10 Ingredient Co-occurrence Network visualizes the top 10 most frequent ingredient pairings that are used in skincare products. The node size represents degree centrality which indicates the number of unique ingredient pairings each ingredient has across products. A high degree indicates that an ingredient is frequently paired alongside other ingredients whereas a low degree indicates that the ingredient is not frequently paired and is only paired with a few other ingredients. According to our analysis, Glycerin is the largest and most central node as it appears in almost every pair. This tells us that Glycerin has a high degree of centrality because it frequently appears alongside many other ingredients and serves as a key ingredient in majority of skin care products. The edge thickness corresponds to the count of each co-occurrence which is the number of times a pair of ingredients appear together. The thickest edge occurs between Glycerin and Phenoxyethanol. This tells us that these two ingredients are nearly always paired together and are a vital pairing in skin care products.
This review highlights ingredient use patterns in various categories of skincare products. Water, glycerin, and alcohols appear in many products, but ingredients also make appearances in particular types of products, like niacinamide in serums.
The limitation is that ingredient concentrations are not in the database, which would impact how visible or powerful an ingredient is in a product. Another limitation is that the βOtherβ category is broad, which can hide specific patterns.
Future studies can include ingredient concentration or product ratings as additional variables to evaluate further how ingredients relate to product efficacy or preference.