I have difficulty finding the right skincare products that work best with my skin. Often than not, the products did not work, and I wasted my money repeatedly to find another one. Even though the name of the dataset is “Cosmetics,” When looking closely, the dataset observes skincare products instead. We will use this dataset as a tool to find some products for me.
Dataset is from:
https://www.kaggle.com/datasets/kingabzpro/cosmetics-datasets
There are 1472 observations of 11 variables.
Label (chr) Type of product
Brand (chr) Brand of product
Name (chr) Name of Cosmetic
Price (int) Price in USD
Rank (num) Ranking
Ingredients (chr) Ingredients
Combination (int) Combination of Dry and oily
Dry (int) For Dry Skin. 1: Yes 2: No
Normal (int) For normal skin. 1: Yes 2: No
Oily (int) For oily Skin. . 1: Yes 2: No
Sensitive (int) For sensitive skins. 1: Yes 2: No
#install.packages("tidyverse")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(tibble)
library(ggplot2)
library(RColorBrewer)
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.1.3
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(viridis)
## Warning: package 'viridis' was built under R version 4.1.3
## Loading required package: viridisLite
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
df <- read_csv("cosmetics.csv")
## Rows: 1472 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): Label, Brand, Name, Ingredients
## dbl (7): Price, Rank, Combination, Dry, Normal, Oily, Sensitive
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df <- as_tibble(df)
df
## # A tibble: 1,472 x 11
## Label Brand Name Price Rank Ingredients Combination Dry Normal Oily
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Moisturiz~ LA M~ Crèm~ 175 4.1 Algae (Sea~ 1 1 1 1
## 2 Moisturiz~ SK-II Faci~ 179 4.1 Galactomyc~ 1 1 1 1
## 3 Moisturiz~ DRUN~ Prot~ 68 4.4 Water, Dic~ 1 1 1 1
## 4 Moisturiz~ LA M~ The ~ 175 3.8 Algae (Sea~ 1 1 1 1
## 5 Moisturiz~ IT C~ Your~ 38 4.1 Water, Sna~ 1 1 1 1
## 6 Moisturiz~ TATC~ The ~ 68 4.2 Water, Sac~ 1 0 1 1
## 7 Moisturiz~ DRUN~ Lala~ 60 4.2 Water, Gly~ 1 1 1 1
## 8 Moisturiz~ DRUN~ Virg~ 72 4.4 100% Unref~ 1 1 1 1
## 9 Moisturiz~ KIEH~ Ultr~ 29 4.4 Water, Gly~ 1 1 1 1
## 10 Moisturiz~ LA M~ Litt~ 325 5 Algae (Sea~ 0 0 0 0
## # ... with 1,462 more rows, and 1 more variable: Sensitive <dbl>
sum(is.na(df))
## [1] 0
df_type <- df %>%
group_by(Label) %>%
summarize(cnt = n()) %>%
mutate(Percent = (round(cnt / sum(cnt), 5)) *100) %>%
arrange(desc(Percent))
df_type
## # A tibble: 6 x 3
## Label cnt Percent
## <chr> <int> <dbl>
## 1 Moisturizer 298 20.2
## 2 Cleanser 281 19.1
## 3 Face Mask 266 18.1
## 4 Treatment 248 16.8
## 5 Eye cream 209 14.2
## 6 Sun protect 170 11.5
ggplot(df_type, aes(x = "", y = Percent, fill = Label)) +
geom_col(color = "black") +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_minimal() +
geom_text(aes(label = Percent),
position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y")+
guides(fill = guide_legend(title = "Type of skincare"))+
theme_void() +
ggtitle("Proportion of Skincare Products by Type")
df_brand <- df %>%
group_by(Brand) %>%
summarize(count = n()) %>%
mutate(Percent = (round(count / sum(count), 5)) *100) %>%
arrange(desc(Percent))
top_df_brand <- head(df_brand, 10)
top_df_brand
## # A tibble: 10 x 3
## Brand count Percent
## <chr> <int> <dbl>
## 1 CLINIQUE 79 5.37
## 2 SEPHORA COLLECTION 66 4.48
## 3 SHISEIDO 63 4.28
## 4 ORIGINS 54 3.67
## 5 MURAD 47 3.19
## 6 KIEHL'S SINCE 1851 46 3.12
## 7 PETER THOMAS ROTH 46 3.12
## 8 FRESH 44 2.99
## 9 DR. JART+ 41 2.78
## 10 KATE SOMERVILLE 35 2.38
p.bar <- ggplot(top_df_brand, aes(x=Brand, y=count, fill=Brand)) +
geom_bar(stat = "identity", width=0.5) +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_classic()+
theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1))+
ggtitle("10 most Popular Skincares")+
theme(legend.position="none")
ggplotly(p.bar, tooltip = c("Brand", "count"))
#filtering top 10
top_10_df <- df %>% filter(Brand == "CLINIQUE" | Brand == "SEPHORA COLLECTION" | Brand == "SHISEIDO" |Brand == "ORIGINS" | Brand == "MURAD" | Brand == "KIEHL'S SINCE 1851"| Brand == "PETER THOMAS ROTH" | Brand == "FRESH" | Brand == "DR. JART+" | Brand == "KATE SOMERVILLE")
top_10_df
## # A tibble: 521 x 11
## Label Brand Name Price Rank Ingredients Combination Dry Normal Oily
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Moisturiz~ KIEH~ Ultr~ 29 4.4 Water, Gly~ 1 1 1 1
## 2 Moisturiz~ FRESH Lotu~ 45 4.3 Water, Gly~ 0 0 0 0
## 3 Moisturiz~ KIEH~ Midn~ 47 4.4 Caprylic/C~ 1 1 1 1
## 4 Moisturiz~ CLIN~ Mois~ 39 4.4 Water , Di~ 1 1 1 1
## 5 Moisturiz~ FRESH Rose~ 40 4.4 Water, Gly~ 0 0 0 0
## 6 Moisturiz~ SHIS~ Bio-~ 78 4.6 Water, Gly~ 0 0 0 0
## 7 Moisturiz~ FRESH Blac~ 92 4.1 Water, Gly~ 1 1 1 0
## 8 Moisturiz~ ORIG~ Dr. ~ 34 4.4 Water, But~ 1 1 1 1
## 9 Moisturiz~ CLIN~ Dram~ 28 3.9 Water , Mi~ 1 1 0 0
## 10 Moisturiz~ FRESH Blac~ 68 4.4 Water, Sac~ 0 0 0 0
## # ... with 511 more rows, and 1 more variable: Sensitive <dbl>
top_10_df_sammary <-top_10_df %>% # Summary by group using dplyr
group_by(Brand) %>%
summarize(mean = mean(Price),
median = median(Price),
min = min(Price),
max = max(Price),
sd=sd(Price),
count = n())
top_10_df_sammary
## # A tibble: 10 x 7
## Brand mean median min max sd count
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 CLINIQUE 32.6 28 9 80 14.2 79
## 2 DR. JART+ 23.1 12 6 52 16.5 41
## 3 FRESH 75.2 50 15 290 61.9 44
## 4 KATE SOMERVILLE 63.7 58 24 125 26.9 35
## 5 KIEHL'S SINCE 1851 38.6 35 16 84 16.4 46
## 6 MURAD 54.9 52 22 90 19.7 47
## 7 ORIGINS 35.6 33.5 7 63 15.5 54
## 8 PETER THOMAS ROTH 60.0 52 30 150 27.0 46
## 9 SEPHORA COLLECTION 9.68 6 3 60 8.80 66
## 10 SHISEIDO 71.9 60 10 300 60.9 63
mean_10 <- top_10_df_sammary %>%
summarize(mean = mean(mean))
mean_10
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 46.5
p.box <- top_10_df %>%
ggplot( aes(x=Brand, y=Price, fill=Brand)) +
geom_boxplot(outlier.shape = NA) +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_classic()+
theme(legend.position="none",plot.title = element_text(size=11)) +
ggtitle("Prices range of 10 most popular Skincares") +
xlab("") +
theme(axis.text.x = element_text(size = 7, angle = 25, hjust = 1))
ggplotly(p.box)
top_10_df %>% group_by(Brand) %>%
summarize(cnt = n())
## # A tibble: 10 x 2
## Brand cnt
## <chr> <int>
## 1 CLINIQUE 79
## 2 DR. JART+ 41
## 3 FRESH 44
## 4 KATE SOMERVILLE 35
## 5 KIEHL'S SINCE 1851 46
## 6 MURAD 47
## 7 ORIGINS 54
## 8 PETER THOMAS ROTH 46
## 9 SEPHORA COLLECTION 66
## 10 SHISEIDO 63
p <- ggplot(top_10_df, aes(x=Rank, y=Price, color=Brand)) +
geom_point(aes(color = Brand)) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
ggtitle("Relationship between Price and Rank of Skincare Products")+
scale_color_viridis(discrete = TRUE, option = "A")+
scale_fill_viridis(discrete = TRUE) +
theme_light()
ggplotly(p)
## `geom_smooth()` using formula 'y ~ x'
p.dot <- ggplot(top_10_df, aes(Rank, Price))+
geom_point(aes(color = Brand)) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
ggtitle("Overall Relationship between Price and Rank of Skincare Products")+
scale_color_viridis(discrete = TRUE, option = "A")+
scale_fill_viridis(discrete = TRUE) +
theme_light()
ggplotly(p.dot)
## `geom_smooth()` using formula 'y ~ x'
In general, people seem to use moisturizer than any other skincare. My EDA was about finding skincare products that I should be trying in terms of popularity and pricing. My EDA focused on the top 10 products. It suggests that the No.1 product I should be trying is CLINIQUE, with its average price of $32.594937. Other products include SEPHORA COLLECTION, SHISEIDO, ORIGINS, MURAD, KIEHL’S SINCE 1851, PETER THOMAS ROTH FRESH, DR. JART+, and KATE SOMERVILLE in order.
The prices for these ten products range from $3 - $300. The cheapest skincare is SEPHORA and the most expensive is SHISEIDO. However, the average price of all is $46.52811.
Last but not least,it seems to be too difficult to identify each product relationship. I have created the second visualization for the overall relationship between the Price and Ranking of the products. When the ranking of the products is getting higher, the price seems to be slightly lower, which I find interesting. Therefore, I can find good products that do not have to be expensive. I would love to explore the flagship for each Brand when I get a chance.