Project3b

Cosmetics datasets

I have difficulty finding the right skincare products that work best with my skin. Often than not, the products did not work, and I wasted my money repeatedly to find another one. Even though the name of the dataset is “Cosmetics,” When looking closely, the dataset observes skincare products instead. We will use this dataset as a tool to find some products for me.

Dataset is from:

https://www.kaggle.com/datasets/kingabzpro/cosmetics-datasets

Metadata

There are 1472 observations of 11 variables.
Label (chr) Type of product
Brand (chr) Brand of product
Name (chr) Name of Cosmetic
Price (int) Price in USD
Rank (num) Ranking
Ingredients (chr) Ingredients
Combination (int) Combination of Dry and oily
Dry (int) For Dry Skin. 1: Yes 2: No
Normal (int) For normal skin. 1: Yes 2: No
Oily (int) For oily Skin. . 1: Yes 2: No
Sensitive (int) For sensitive skins. 1: Yes 2: No

We are going to investigate:

What are the 10 most popular products from the dataset?
What is the average price for each?
Are there any relations between Price and Ranking of the products?

#install.packages("tidyverse")
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(tibble)
library(ggplot2)
library(RColorBrewer)
library(hrbrthemes)

## Warning: package 'hrbrthemes' was built under R version 4.1.3

## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

library(viridis)

## Warning: package 'viridis' was built under R version 4.1.3

## Loading required package: viridisLite

library(plotly)

## Warning: package 'plotly' was built under R version 4.1.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

df <- read_csv("cosmetics.csv")

## Rows: 1472 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): Label, Brand, Name, Ingredients
## dbl (7): Price, Rank, Combination, Dry, Normal, Oily, Sensitive
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

df <- as_tibble(df)
df

## # A tibble: 1,472 x 11
##    Label      Brand Name  Price  Rank Ingredients Combination   Dry Normal  Oily
##    <chr>      <chr> <chr> <dbl> <dbl> <chr>             <dbl> <dbl>  <dbl> <dbl>
##  1 Moisturiz~ LA M~ Crèm~   175   4.1 Algae (Sea~           1     1      1     1
##  2 Moisturiz~ SK-II Faci~   179   4.1 Galactomyc~           1     1      1     1
##  3 Moisturiz~ DRUN~ Prot~    68   4.4 Water, Dic~           1     1      1     1
##  4 Moisturiz~ LA M~ The ~   175   3.8 Algae (Sea~           1     1      1     1
##  5 Moisturiz~ IT C~ Your~    38   4.1 Water, Sna~           1     1      1     1
##  6 Moisturiz~ TATC~ The ~    68   4.2 Water, Sac~           1     0      1     1
##  7 Moisturiz~ DRUN~ Lala~    60   4.2 Water, Gly~           1     1      1     1
##  8 Moisturiz~ DRUN~ Virg~    72   4.4 100% Unref~           1     1      1     1
##  9 Moisturiz~ KIEH~ Ultr~    29   4.4 Water, Gly~           1     1      1     1
## 10 Moisturiz~ LA M~ Litt~   325   5   Algae (Sea~           0     0      0     0
## # ... with 1,462 more rows, and 1 more variable: Sensitive <dbl>

sum(is.na(df))

## [1] 0

Proportion of Skincare Products by Type

df_type <- df %>% 
      group_by(Label) %>%
      summarize(cnt = n()) %>%
      mutate(Percent = (round(cnt / sum(cnt), 5)) *100) %>%
      arrange(desc(Percent))

df_type

## # A tibble: 6 x 3
##   Label         cnt Percent
##   <chr>       <int>   <dbl>
## 1 Moisturizer   298    20.2
## 2 Cleanser      281    19.1
## 3 Face Mask     266    18.1
## 4 Treatment     248    16.8
## 5 Eye cream     209    14.2
## 6 Sun protect   170    11.5

ggplot(df_type, aes(x = "", y = Percent, fill = Label)) +
      geom_col(color = "black") +
      scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
      theme_minimal() +
      geom_text(aes(label = Percent),
      position = position_stack(vjust = 0.5)) +
      coord_polar(theta = "y")+
      guides(fill = guide_legend(title = "Type of skincare"))+
      theme_void() + 
      ggtitle("Proportion of Skincare Products by Type")

df_brand <- df %>% 
      group_by(Brand) %>%
      summarize(count = n()) %>%
      mutate(Percent = (round(count / sum(count), 5)) *100) %>%
      arrange(desc(Percent))

top_df_brand <- head(df_brand, 10)
top_df_brand

## # A tibble: 10 x 3
##    Brand              count Percent
##    <chr>              <int>   <dbl>
##  1 CLINIQUE              79    5.37
##  2 SEPHORA COLLECTION    66    4.48
##  3 SHISEIDO              63    4.28
##  4 ORIGINS               54    3.67
##  5 MURAD                 47    3.19
##  6 KIEHL'S SINCE 1851    46    3.12
##  7 PETER THOMAS ROTH     46    3.12
##  8 FRESH                 44    2.99
##  9 DR. JART+             41    2.78
## 10 KATE SOMERVILLE       35    2.38

p.bar <- ggplot(top_df_brand, aes(x=Brand, y=count, fill=Brand)) + 
  geom_bar(stat = "identity", width=0.5) +
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic()+  
  theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1))+
  ggtitle("10 most Popular Skincares")+
  theme(legend.position="none")
ggplotly(p.bar, tooltip = c("Brand", "count"))

Price range for 10 most Popular Products

#filtering top 10

top_10_df <- df %>% filter(Brand == "CLINIQUE" | Brand == "SEPHORA COLLECTION" | Brand == "SHISEIDO" |Brand == "ORIGINS" | Brand == "MURAD" | Brand == "KIEHL'S SINCE 1851"| Brand == "PETER THOMAS ROTH" | Brand == "FRESH" | Brand == "DR. JART+" | Brand == "KATE SOMERVILLE")
top_10_df

## # A tibble: 521 x 11
##    Label      Brand Name  Price  Rank Ingredients Combination   Dry Normal  Oily
##    <chr>      <chr> <chr> <dbl> <dbl> <chr>             <dbl> <dbl>  <dbl> <dbl>
##  1 Moisturiz~ KIEH~ Ultr~    29   4.4 Water, Gly~           1     1      1     1
##  2 Moisturiz~ FRESH Lotu~    45   4.3 Water, Gly~           0     0      0     0
##  3 Moisturiz~ KIEH~ Midn~    47   4.4 Caprylic/C~           1     1      1     1
##  4 Moisturiz~ CLIN~ Mois~    39   4.4 Water , Di~           1     1      1     1
##  5 Moisturiz~ FRESH Rose~    40   4.4 Water, Gly~           0     0      0     0
##  6 Moisturiz~ SHIS~ Bio-~    78   4.6 Water, Gly~           0     0      0     0
##  7 Moisturiz~ FRESH Blac~    92   4.1 Water, Gly~           1     1      1     0
##  8 Moisturiz~ ORIG~ Dr. ~    34   4.4 Water, But~           1     1      1     1
##  9 Moisturiz~ CLIN~ Dram~    28   3.9 Water , Mi~           1     1      0     0
## 10 Moisturiz~ FRESH Blac~    68   4.4 Water, Sac~           0     0      0     0
## # ... with 511 more rows, and 1 more variable: Sensitive <dbl>

top_10_df_sammary <-top_10_df %>%      # Summary by group using dplyr
  group_by(Brand) %>% 
  summarize(mean = mean(Price),
            median = median(Price),
            min = min(Price),
            max = max(Price),
            sd=sd(Price),
             count = n())
top_10_df_sammary

## # A tibble: 10 x 7
##    Brand               mean median   min   max    sd count
##    <chr>              <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
##  1 CLINIQUE           32.6    28       9    80 14.2     79
##  2 DR. JART+          23.1    12       6    52 16.5     41
##  3 FRESH              75.2    50      15   290 61.9     44
##  4 KATE SOMERVILLE    63.7    58      24   125 26.9     35
##  5 KIEHL'S SINCE 1851 38.6    35      16    84 16.4     46
##  6 MURAD              54.9    52      22    90 19.7     47
##  7 ORIGINS            35.6    33.5     7    63 15.5     54
##  8 PETER THOMAS ROTH  60.0    52      30   150 27.0     46
##  9 SEPHORA COLLECTION  9.68    6       3    60  8.80    66
## 10 SHISEIDO           71.9    60      10   300 60.9     63

mean_10 <- top_10_df_sammary %>%
          summarize(mean = mean(mean))
mean_10

## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1  46.5

p.box <- top_10_df %>%
  ggplot( aes(x=Brand, y=Price, fill=Brand)) +
    geom_boxplot(outlier.shape = NA) +
    scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic()+  
    theme(legend.position="none",plot.title = element_text(size=11)) +
    ggtitle("Prices range of 10 most popular Skincares") +
    xlab("")  +  
  theme(axis.text.x = element_text(size = 7, angle = 25, hjust = 1))

ggplotly(p.box)

top_10_df %>% group_by(Brand) %>%
      summarize(cnt = n())

## # A tibble: 10 x 2
##    Brand                cnt
##    <chr>              <int>
##  1 CLINIQUE              79
##  2 DR. JART+             41
##  3 FRESH                 44
##  4 KATE SOMERVILLE       35
##  5 KIEHL'S SINCE 1851    46
##  6 MURAD                 47
##  7 ORIGINS               54
##  8 PETER THOMAS ROTH     46
##  9 SEPHORA COLLECTION    66
## 10 SHISEIDO              63

p <- ggplot(top_10_df, aes(x=Rank, y=Price, color=Brand)) +
  geom_point(aes(color = Brand)) + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+  
    ggtitle("Relationship between Price and Rank of Skincare Products")+
  scale_color_viridis(discrete = TRUE, option = "A")+
  scale_fill_viridis(discrete = TRUE) +
   theme_light() 
ggplotly(p)

## `geom_smooth()` using formula 'y ~ x'

p.dot <- ggplot(top_10_df, aes(Rank, Price))+
  geom_point(aes(color = Brand)) + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+ 
    ggtitle("Overall Relationship between Price and Rank of Skincare Products")+
  scale_color_viridis(discrete = TRUE, option = "A")+
  scale_fill_viridis(discrete = TRUE) +
  theme_light()

ggplotly(p.dot)

## `geom_smooth()` using formula 'y ~ x'

Summary

In general, people seem to use moisturizer than any other skincare. My EDA was about finding skincare products that I should be trying in terms of popularity and pricing. My EDA focused on the top 10 products. It suggests that the No.1 product I should be trying is CLINIQUE, with its average price of $32.594937. Other products include SEPHORA COLLECTION, SHISEIDO, ORIGINS, MURAD, KIEHL’S SINCE 1851, PETER THOMAS ROTH FRESH, DR. JART+, and KATE SOMERVILLE in order.

The prices for these ten products range from $3 - $300. The cheapest skincare is SEPHORA and the most expensive is SHISEIDO. However, the average price of all is $46.52811.

Last but not least,it seems to be too difficult to identify each product relationship. I have created the second visualization for the overall relationship between the Price and Ranking of the products. When the ranking of the products is getting higher, the price seems to be slightly lower, which I find interesting. Therefore, I can find good products that do not have to be expensive. I would love to explore the flagship for each Brand when I get a chance.