Cosmetics datasets

I have difficulty finding the right skincare products that work best with my skin. Often than not, the products did not work, and I wasted my money repeatedly to find another one. Even though the name of the dataset is “Cosmetics,” When looking closely, the dataset observes skincare products instead. We will use this dataset as a tool to find some products for me.

Dataset is from:

https://www.kaggle.com/datasets/kingabzpro/cosmetics-datasets

Metadata

There are 1472 observations of 11 variables.
Label (chr) Type of product
Brand (chr) Brand of product
Name (chr) Name of Cosmetic
Price (int) Price in USD
Rank (num) Ranking
Ingredients (chr) Ingredients
Combination (int) Combination of Dry and oily
Dry (int) For Dry Skin. 1: Yes 2: No
Normal (int) For normal skin. 1: Yes 2: No
Oily (int) For oily Skin. . 1: Yes 2: No
Sensitive (int) For sensitive skins. 1: Yes 2: No

We are going to investigate:

  1. What are the 10 most popular products from the dataset?
  2. What is the average price for each?
  3. Are there any relations between Price and Ranking of the products?
#install.packages("tidyverse")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(tibble)
library(ggplot2)
library(RColorBrewer)
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.1.3
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(viridis)
## Warning: package 'viridis' was built under R version 4.1.3
## Loading required package: viridisLite
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
df <- read_csv("cosmetics.csv")
## Rows: 1472 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): Label, Brand, Name, Ingredients
## dbl (7): Price, Rank, Combination, Dry, Normal, Oily, Sensitive
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df <- as_tibble(df)
df
## # A tibble: 1,472 x 11
##    Label      Brand Name  Price  Rank Ingredients Combination   Dry Normal  Oily
##    <chr>      <chr> <chr> <dbl> <dbl> <chr>             <dbl> <dbl>  <dbl> <dbl>
##  1 Moisturiz~ LA M~ Crèm~   175   4.1 Algae (Sea~           1     1      1     1
##  2 Moisturiz~ SK-II Faci~   179   4.1 Galactomyc~           1     1      1     1
##  3 Moisturiz~ DRUN~ Prot~    68   4.4 Water, Dic~           1     1      1     1
##  4 Moisturiz~ LA M~ The ~   175   3.8 Algae (Sea~           1     1      1     1
##  5 Moisturiz~ IT C~ Your~    38   4.1 Water, Sna~           1     1      1     1
##  6 Moisturiz~ TATC~ The ~    68   4.2 Water, Sac~           1     0      1     1
##  7 Moisturiz~ DRUN~ Lala~    60   4.2 Water, Gly~           1     1      1     1
##  8 Moisturiz~ DRUN~ Virg~    72   4.4 100% Unref~           1     1      1     1
##  9 Moisturiz~ KIEH~ Ultr~    29   4.4 Water, Gly~           1     1      1     1
## 10 Moisturiz~ LA M~ Litt~   325   5   Algae (Sea~           0     0      0     0
## # ... with 1,462 more rows, and 1 more variable: Sensitive <dbl>
sum(is.na(df))
## [1] 0

Proportion of Skincare Products by Type

df_type <- df %>% 
      group_by(Label) %>%
      summarize(cnt = n()) %>%
      mutate(Percent = (round(cnt / sum(cnt), 5)) *100) %>%
      arrange(desc(Percent))

df_type
## # A tibble: 6 x 3
##   Label         cnt Percent
##   <chr>       <int>   <dbl>
## 1 Moisturizer   298    20.2
## 2 Cleanser      281    19.1
## 3 Face Mask     266    18.1
## 4 Treatment     248    16.8
## 5 Eye cream     209    14.2
## 6 Sun protect   170    11.5
ggplot(df_type, aes(x = "", y = Percent, fill = Label)) +
      geom_col(color = "black") +
      scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
      theme_minimal() +
      geom_text(aes(label = Percent),
      position = position_stack(vjust = 0.5)) +
      coord_polar(theta = "y")+
      guides(fill = guide_legend(title = "Type of skincare"))+
      theme_void() + 
      ggtitle("Proportion of Skincare Products by Type")

df_brand <- df %>% 
      group_by(Brand) %>%
      summarize(count = n()) %>%
      mutate(Percent = (round(count / sum(count), 5)) *100) %>%
      arrange(desc(Percent))

top_df_brand <- head(df_brand, 10)
top_df_brand
## # A tibble: 10 x 3
##    Brand              count Percent
##    <chr>              <int>   <dbl>
##  1 CLINIQUE              79    5.37
##  2 SEPHORA COLLECTION    66    4.48
##  3 SHISEIDO              63    4.28
##  4 ORIGINS               54    3.67
##  5 MURAD                 47    3.19
##  6 KIEHL'S SINCE 1851    46    3.12
##  7 PETER THOMAS ROTH     46    3.12
##  8 FRESH                 44    2.99
##  9 DR. JART+             41    2.78
## 10 KATE SOMERVILLE       35    2.38
p.bar <- ggplot(top_df_brand, aes(x=Brand, y=count, fill=Brand)) + 
  geom_bar(stat = "identity", width=0.5) +
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic()+  
  theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1))+
  ggtitle("10 most Popular Skincares")+
  theme(legend.position="none")
ggplotly(p.bar, tooltip = c("Brand", "count"))


Summary

In general, people seem to use moisturizer than any other skincare. My EDA was about finding skincare products that I should be trying in terms of popularity and pricing. My EDA focused on the top 10 products. It suggests that the No.1 product I should be trying is CLINIQUE, with its average price of $32.594937. Other products include SEPHORA COLLECTION, SHISEIDO, ORIGINS, MURAD, KIEHL’S SINCE 1851, PETER THOMAS ROTH FRESH, DR. JART+, and KATE SOMERVILLE in order.

The prices for these ten products range from $3 - $300. The cheapest skincare is SEPHORA and the most expensive is SHISEIDO. However, the average price of all is $46.52811.

Last but not least,it seems to be too difficult to identify each product relationship. I have created the second visualization for the overall relationship between the Price and Ranking of the products. When the ranking of the products is getting higher, the price seems to be slightly lower, which I find interesting. Therefore, I can find good products that do not have to be expensive. I would love to explore the flagship for each Brand when I get a chance.