library(tidyverse)
library(knitr)
library(tidygraph)
library(ggraph)

Introduction πŸ’§

This project looks at trends in the usage of skincare ingredients by product type. Our intention is to know which ingredients are used most often in standard skincare categoriesβ€”Cleanser, Toner, Moisturiser, SPF, and Serumβ€”and learn about pairs of ingredients that are likely to occur together.

The motivation for this analysis is to better understand trends in product formulation within the skincare industry. Our question markers are: What are the most characteristic ingredients associated with each category of product? Which ingredients tend to be combined with one another? This could be of use in aiding product development or consumer education.

Data Loading and Preparation πŸ› οΈ

skincare <- read_csv("skincare_products.csv") %>%
  filter(!is.na(ingredients)) %>%
  mutate(
    # Clean product types into 5 groups + Other category
    product_type_group = case_when(
      str_detect(tolower(product_type), "cleanser") ~ "Cleanser",
      str_detect(tolower(product_type), "toner") ~ "Toner",
      str_detect(tolower(product_type), "moisturiser|moisturizing|moisturizer") ~ "Moisturiser",
      str_detect(tolower(product_type), "spf|sunscreen|sun protection") ~ "SPF",
      str_detect(tolower(product_type), "serum") ~ "Serum",
      TRUE ~ "Other"
    ),
    # Split ingredients into lists
    ingredient_list = strsplit(ingredients, ",\\s*")
  ) %>%
  filter(map_int(ingredient_list, length) >= 1)  # Keep products with at least 1 ingredient
## Rows: 1138 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): product_name, product_url, product_type, ingredients, price
## 
## β„Ή Use `spec()` to retrieve the full column specification for this data.
## β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.

The dataset was collected from Kaggle , a public data-sharing platform. It includes information on skincare products and their listed ingredients. Each row represents one product, and the ingredients column lists the components used. This project treats the ingredients as nodes in a network and their co-occurrence within a product as undirected edges.

We use a dataset of ingredients and skincare products. One row per product and the ingredients column with a list of ingredients used. For network analysis, each ingredient is a node, and co-occurrence of ingredients within a product forms an edge between them.

This data was gathered from publicly accessible skincare product catalogs. We are treating the data as an undirected co-occurrence network. You can find the data in the same directory with the filename skincare_products.csv.

Top Ingredients by Product Type Group πŸ”¬

top3_freq <- skincare %>%
  unnest(ingredient_list) %>%
  mutate(ingredient_list = tolower(str_trim(ingredient_list))) %>%
  group_by(product_type_group, ingredient_list) %>%
  summarise(frequency = n(), .groups = "drop") %>%
  group_by(product_type_group) %>%
  slice_max(order_by = frequency, n = 3) %>%
  arrange(product_type_group, desc(frequency))

kable(top3_freq,
      col.names = c("Product Type", "Ingredient", "Frequency"),
      caption = "Top 3 Most Common Ingredients and Frequencies by Product Type Group")
Top 3 Most Common Ingredients and Frequencies by Product Type Group
Product Type Ingredient Frequency
Cleanser glycerin 84
Cleanser phenoxyethanol 61
Cleanser citric acid 51
Moisturiser glycerin 96
Moisturiser phenoxyethanol 70
Moisturiser dimethicone 58
Other glycerin 433
Other phenoxyethanol 313
Other limonene 250
Serum glycerin 81
Serum phenoxyethanol 70
Serum sodium hyaluronate 53
Serum xanthan gum 53
Toner glycerin 57
Toner citric acid 34
Toner phenoxyethanol 34

Top Ingredient Pairs Co-occurring in Skincare ProductsπŸ”—

# Filter for products with 2 or more ingredients
skincare_pairs <- skincare %>%
  filter(map_int(ingredient_list, length) >= 2)

# Create ingredient pairs per the product 
ingredient_pairs <- skincare_pairs %>%
  rowwise() %>%
  mutate(
    unique_ings = list(sort(unique(tolower(str_trim(ingredient_list)))))
  ) %>%
  mutate(
    pairs = list(if(length(unique_ings) >= 2) combn(unique_ings, 2, simplify = FALSE) else NULL)
  ) %>%
  unnest(pairs) %>%
  mutate(
    from = map_chr(pairs, 1),
    to = map_chr(pairs, 2)
  ) %>%
  select(from, to)

# Counting for co-occurrence
pair_counts <- ingredient_pairs %>%
  group_by(from, to) %>%
  summarise(co_occurrence = n(), .groups = "drop") %>%
  arrange(desc(co_occurrence)) %>%
  slice_head(n = 10)

# Showing the table or error message if there is nothing found
if(nrow(pair_counts) == 0) {
  cat("## No ingredient pairs found to display.\n")
} else {
  kable(pair_counts,
        col.names = c("Ingredient 1", "Ingredient 2", "Co-occurrence Count"),
        caption = "Top 10 Most Common Ingredient Pairs Co-occurring in Skincare Products")
}
Top 10 Most Common Ingredient Pairs Co-occurring in Skincare Products
Ingredient 1 Ingredient 2 Co-occurrence Count
glycerin phenoxyethanol 456
citric acid glycerin 301
disodium edta glycerin 281
butylene glycol glycerin 268
limonene linalool 251
disodium edta phenoxyethanol 242
glycerin sodium hyaluronate 240
glycerin sodium hydroxide 238
glycerin limonene 223
glycerin xanthan gum 223

These pairs reveal formulation tendencies across different products. If, for example, β€œglycerin” and β€œwater” appear very often as a pair, then maybe it appears in a common base. Examining the leading pairs can give clues to typical combinations utilized in skin chemistry, which can be helpful to the researcher or consumer in cosmetic science.

Network Visualization 🎨

g_top10 <- as_tbl_graph(pair_counts, directed = FALSE) %>%
  mutate(degree = centrality_degree())

ggraph(g_top10, layout = "fr") +
  geom_edge_link(aes(width = co_occurrence), color = "gray70", alpha = 0.6) +
  geom_node_point(aes(size = degree), color = "hotpink") +
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  theme_void() +
  ggtitle("Top 10 Ingredient Co-occurrence Network") 

The Top 10 Ingredient Co-occurrence Network visualizes the top 10 most frequent ingredient pairings that are used in skincare products. The node size represents degree centrality which indicates the number of unique ingredient pairings each ingredient has across products. A high degree indicates that an ingredient is frequently paired alongside other ingredients whereas a low degree indicates that the ingredient is not frequently paired and is only paired with a few other ingredients. According to our analysis, Glycerin is the largest and most central node as it appears in almost every pair. This tells us that Glycerin has a high degree of centrality because it frequently appears alongside many other ingredients and serves as a key ingredient in majority of skin care products. The edge thickness corresponds to the count of each co-occurrence which is the number of times a pair of ingredients appear together. The thickest edge occurs between Glycerin and Phenoxyethanol. This tells us that these two ingredients are nearly always paired together and are a vital pairing in skin care products.

Conclusion πŸ“

This review highlights ingredient use patterns in various categories of skincare products. Water, glycerin, and alcohols appear in many products, but ingredients also make appearances in particular types of products, like niacinamide in serums.

The limitation is that ingredient concentrations are not in the database, which would impact how visible or powerful an ingredient is in a product. Another limitation is that the β€œOther” category is broad, which can hide specific patterns.

Future studies can include ingredient concentration or product ratings as additional variables to evaluate further how ingredients relate to product efficacy or preference.