Introduction 💧
Data Loading and Preparation 🛠️
Top Ingredients by Product Type Group 🔬
Top Ingredient Pairs Co-occurring in Skincare Products🔗
Network Visualization 🎨
Conclusion 📝

library(tidyverse)
library(knitr)
library(tidygraph)
library(ggraph)

Introduction 💧

This project looks at trends in the usage of skincare ingredients by product type. Our intention is to know which ingredients are used most often in standard skincare categories—Cleanser, Toner, Moisturiser, SPF, and Serum—and learn about pairs of ingredients that are likely to occur together.

The motivation for this analysis is to better understand trends in product formulation within the skincare industry. Our question markers are: What are the most characteristic ingredients associated with each category of product? Which ingredients tend to be combined with one another? This could be of use in aiding product development or consumer education.

Data Loading and Preparation 🛠️

skincare <- read_csv("skincare_products.csv") %>%
  filter(!is.na(ingredients)) %>%
  mutate(
    # Clean product types into 5 groups + Other category
    product_type_group = case_when(
      str_detect(tolower(product_type), "cleanser") ~ "Cleanser",
      str_detect(tolower(product_type), "toner") ~ "Toner",
      str_detect(tolower(product_type), "moisturiser|moisturizing|moisturizer") ~ "Moisturiser",
      str_detect(tolower(product_type), "spf|sunscreen|sun protection") ~ "SPF",
      str_detect(tolower(product_type), "serum") ~ "Serum",
      TRUE ~ "Other"
    ),
    # Split ingredients into lists
    ingredient_list = strsplit(ingredients, ",\\s*")
  ) %>%
  filter(map_int(ingredient_list, length) >= 1)  # Keep products with at least 1 ingredient

## Rows: 1138 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): product_name, product_url, product_type, ingredients, price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The dataset was collected from Kaggle , a public data-sharing platform. It includes information on skincare products and their listed ingredients. Each row represents one product, and the ingredients column lists the components used. This project treats the ingredients as nodes in a network and their co-occurrence within a product as undirected edges.

We use a dataset of ingredients and skincare products. One row per product and the ingredients column with a list of ingredients used. For network analysis, each ingredient is a node, and co-occurrence of ingredients within a product forms an edge between them.

This data was gathered from publicly accessible skincare product catalogs. We are treating the data as an undirected co-occurrence network. You can find the data in the same directory with the filename skincare_products.csv.

Top Ingredients by Product Type Group 🔬

top3_freq <- skincare %>%
  unnest(ingredient_list) %>%
  mutate(ingredient_list = tolower(str_trim(ingredient_list))) %>%
  group_by(product_type_group, ingredient_list) %>%
  summarise(frequency = n(), .groups = "drop") %>%
  group_by(product_type_group) %>%
  slice_max(order_by = frequency, n = 3) %>%
  arrange(product_type_group, desc(frequency))

kable(top3_freq,
      col.names = c("Product Type", "Ingredient", "Frequency"),
      caption = "Top 3 Most Common Ingredients and Frequencies by Product Type Group")

Top 3 Most Common Ingredients and Frequencies by Product Type Group
Product Type	Ingredient	Frequency
Cleanser	glycerin	84
Cleanser	phenoxyethanol	61
Cleanser	citric acid	51
Moisturiser	glycerin	96
Moisturiser	phenoxyethanol	70
Moisturiser	dimethicone	58
Other	glycerin	433
Other	phenoxyethanol	313
Other	limonene	250
Serum	glycerin	81
Serum	phenoxyethanol	70
Serum	sodium hyaluronate	53
Serum	xanthan gum	53
Toner	glycerin	57
Toner	citric acid	34
Toner	phenoxyethanol	34

Top Ingredient Pairs Co-occurring in Skincare Products🔗

# Filter for products with 2 or more ingredients
skincare_pairs <- skincare %>%
  filter(map_int(ingredient_list, length) >= 2)

# Create ingredient pairs per the product 
ingredient_pairs <- skincare_pairs %>%
  rowwise() %>%
  mutate(
    unique_ings = list(sort(unique(tolower(str_trim(ingredient_list)))))
  ) %>%
  mutate(
    pairs = list(if(length(unique_ings) >= 2) combn(unique_ings, 2, simplify = FALSE) else NULL)
  ) %>%
  unnest(pairs) %>%
  mutate(
    from = map_chr(pairs, 1),
    to = map_chr(pairs, 2)
  ) %>%
  select(from, to)

# Counting for co-occurrence
pair_counts <- ingredient_pairs %>%
  group_by(from, to) %>%
  summarise(co_occurrence = n(), .groups = "drop") %>%
  arrange(desc(co_occurrence)) %>%
  slice_head(n = 10)

# Showing the table or error message if there is nothing found
if(nrow(pair_counts) == 0) {
  cat("## No ingredient pairs found to display.\n")
} else {
  kable(pair_counts,
        col.names = c("Ingredient 1", "Ingredient 2", "Co-occurrence Count"),
        caption = "Top 10 Most Common Ingredient Pairs Co-occurring in Skincare Products")
}

Top 10 Most Common Ingredient Pairs Co-occurring in Skincare Products
Ingredient 1	Ingredient 2	Co-occurrence Count
glycerin	phenoxyethanol	456
citric acid	glycerin	301
disodium edta	glycerin	281
butylene glycol	glycerin	268
limonene	linalool	251
disodium edta	phenoxyethanol	242
glycerin	sodium hyaluronate	240
glycerin	sodium hydroxide	238
glycerin	limonene	223
glycerin	xanthan gum	223

These pairs reveal formulation tendencies across different products. If, for example, “glycerin” and “water” appear very often as a pair, then maybe it appears in a common base. Examining the leading pairs can give clues to typical combinations utilized in skin chemistry, which can be helpful to the researcher or consumer in cosmetic science.

Network Visualization 🎨

g_top10 <- as_tbl_graph(pair_counts, directed = FALSE) %>%
  mutate(degree = centrality_degree())

ggraph(g_top10, layout = "fr") +
  geom_edge_link(aes(width = co_occurrence), color = "gray70", alpha = 0.6) +
  geom_node_point(aes(size = degree), color = "hotpink") +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  theme_void() +
  ggtitle("Top 10 Ingredient Co-occurrence Network")

The Top 10 Ingredient Co-occurrence Network visualizes the top 10 most frequent ingredient pairings that are used in skincare products. The node size represents degree centrality which indicates the number of unique ingredient pairings each ingredient has across products. A high degree indicates that an ingredient is frequently paired alongside other ingredients whereas a low degree indicates that the ingredient is not frequently paired and is only paired with a few other ingredients. According to our analysis, Glycerin is the largest and most central node as it appears in almost every pair. This tells us that Glycerin has a high degree of centrality because it frequently appears alongside many other ingredients and serves as a key ingredient in majority of skin care products. The edge thickness corresponds to the count of each co-occurrence which is the number of times a pair of ingredients appear together. The thickest edge occurs between Glycerin and Phenoxyethanol. This tells us that these two ingredients are nearly always paired together and are a vital pairing in skin care products.

Conclusion 📝

This review highlights ingredient use patterns in various categories of skincare products. Water, glycerin, and alcohols appear in many products, but ingredients also make appearances in particular types of products, like niacinamide in serums.

The limitation is that ingredient concentrations are not in the database, which would impact how visible or powerful an ingredient is in a product. Another limitation is that the “Other” category is broad, which can hide specific patterns.

Future studies can include ingredient concentration or product ratings as additional variables to evaluate further how ingredients relate to product efficacy or preference.

Skincare Ingredient Analysis by Product Type