Dataset:Makeup Shades Dataset Source: https://www.kaggle.com/shivamb/makeup-shades-dataset
In my final project, I am using the Makeup Shades Dataset. A list of beauty brands in the US, Nigeria, India, and Japan was collected that were considered by several sources to be “best sellers” in their home countries. The original author visited each brand’s website during May 2018, found their liquid foundation line that (at the time of our sampling) had the largest number of shades available, and recorded the hex color values for each of the colored swatches shown for the product. Then, using Adobe Photoshop, they extracted the lightness value of each color (using the CIE Lab color model).
I will review this dataset to explore:
The following variables are included in the dataset: “brand” (categorical variable), “brand_short” (categorical variable) , “product” (categorical variable), “product_short” (categorical variable), “hex” (categorical variable), “H” (continuous variable), “S” (continuous variable), “V” (continuous variable), “L” (interger/discrete variable), and “group” (integer/discrete variable). Variables that I would like to define further include the following:
I cleaned this dataset prior due to the number of symbols it contained. I didn’t want any errors when loading. As a result, I removed symbols from the make-up “brand” column and verify “hex” numbers in the original data set (some of the hex numbers showed up as expressions).
decided to use this dataset for my final project because I have always been really into beauty products and makeup. Over the last few years the range of foundation shades have really expanded. These new options have been really great for women of color. I thought this dataset would be an interesting look inside the major brands that have been apart of this increase.
Sin (2021) recently completed her dissertation called “Colorism Toward BIOPIC Community in the Makeup Industry”. She shares her history as a woman of color with makeup and using a lighter shade. Her study found that consumers are becoming more aware of and want more variety starting a movement that has and can continue to result in companies increasing their shade offerings. The Fashion Network (2018) published an article on December 3, 2018 a few months after this dataset was collected. Their was a 28% increase in new foundation products from August 2017 to July 2018 sparked by the trend for a more natural or “second skin” look. During this time colour is where brands began innovating and more than 330 new shades were launched between August 2017 and July 2018, around 100 more than in the previous year. It’s more evident then ever that BIPOC not only set beauty trends but our dollars are a large part of the market share yet BIPOC are still excluded from advertising and marketing. At this time it is important that brands understand the importance of diversity and inclusion and more specifically that the products meet the need of the people (Brown, 2021).
Source:
Ahssen, S. and R. Driver (2018). Sales of foundation boosted by expand shade offering. Fashion Network. https://us.fashionnetwork.com/news/Sales-of-foundation-boosted-by-expanded-shade-offering,1041805.html
Brown, D. (2021). What Diversity Looks like in Foundation and the Beauty Industry? Essence. https://www.essence.com/beauty/what-diversity-looks-like-in-foundation-and-the-beauty-industry/
Sin, P. P. (2021). Colorism Toward the Black, Indigenous and People of Color (BIPOC) Community in the Makeup Industry(Doctoral dissertation). https://cache.kzoo.edu/handle/10920/39131
library(readr)
shades <- read_csv("shades.csv")
## Rows: 625 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): brand, brand_short, product, product_short, hex
## dbl (5): H, S, V, L, group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
An initial summary of the dataset was run in R Studio. Using R Studio I ran an overall summary of the data.
str(shades)
## spec_tbl_df [625 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ brand : chr [1:625] "Maybelline" "Maybelline" "Maybelline" "Maybelline" ...
## $ brand_short : chr [1:625] "mb" "mb" "mb" "mb" ...
## $ product : chr [1:625] "Fit Me" "Fit Me" "Fit Me" "Fit Me" ...
## $ product_short: chr [1:625] "fmf" "fmf" "fmf" "fmf" ...
## $ hex : chr [1:625] "f3cfb3" "ffe3c2" "ffe0cd" "ffd3be" ...
## $ H : num [1:625] 26 32 23 19 18 20 28 24 26 20 ...
## $ S : num [1:625] 0.26 0.24 0.2 0.25 0.3 0.29 0.31 0.33 0.38 0.38 ...
## $ V : num [1:625] 0.95 1 1 1 0.74 0.92 0.98 0.89 0.89 0.7 ...
## $ L : num [1:625] 86 92 91 88 65 80 87 77 77 60 ...
## $ group : num [1:625] 2 2 2 2 2 2 2 2 2 2 ...
## - attr(*, "spec")=
## .. cols(
## .. brand = col_character(),
## .. brand_short = col_character(),
## .. product = col_character(),
## .. product_short = col_character(),
## .. hex = col_character(),
## .. H = col_double(),
## .. S = col_double(),
## .. V = col_double(),
## .. L = col_double(),
## .. group = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(shades)
## brand brand_short product product_short
## Length:625 Length:625 Length:625 Length:625
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## hex H S V
## Length:625 Min. : 4.00 Min. :0.1000 Min. :0.2000
## Class :character 1st Qu.:23.00 1st Qu.:0.3500 1st Qu.:0.6900
## Mode :character Median :26.00 Median :0.4400 Median :0.8400
## Mean :25.31 Mean :0.4595 Mean :0.7795
## 3rd Qu.:29.00 3rd Qu.:0.5600 3rd Qu.:0.9100
## Max. :45.00 Max. :1.0000 Max. :1.0000
## NA's :12 NA's :12 NA's :12
## L group
## Min. :11.00 Min. :0.000
## 1st Qu.:55.00 1st Qu.:2.000
## Median :71.00 Median :3.000
## Mean :65.92 Mean :3.472
## 3rd Qu.:79.00 3rd Qu.:5.000
## Max. :95.00 Max. :7.000
##
library("table1")
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
Next, I ran cross-tabulation by brand and group to get a better visual of how many brands were any each group category.
## cross tabulation brand * group
table(shades$brand,shades$group)
##
## 0 1 2 3 4 5 6 7
## Addiction 0 0 0 0 0 0 17 0
## bareMinerals 0 0 29 0 0 0 0 0
## Beauty Bakerie 0 0 0 30 0 0 0 0
## Bharat and Doris 0 0 0 0 0 0 0 7
## Black Opal 0 0 0 12 0 0 0 0
## Black Up 0 0 0 18 0 0 0 0
## Blue Heaven 0 0 0 0 0 0 0 2
## Bobbi Brown 0 0 0 0 30 0 0 0
## Colorbar 0 0 0 0 0 0 0 3
## Covergirl Olay 0 0 12 0 0 0 0 0
## Dior 0 0 0 0 0 0 6 0
## Elsas Pro 0 0 0 0 0 11 0 0
## Estee Lauder 0 0 42 0 0 0 0 0
## Fenty 40 0 0 0 0 0 0 0
## Hegai and Ester 0 0 0 0 0 10 0 0
## House of Tara 0 0 0 0 0 11 0 0
## Iman 0 0 0 8 0 0 0 0
## IPSA 0 0 0 0 0 0 6 0
## Kate 0 0 0 0 0 0 6 0
## Kuddy 0 0 0 0 0 5 0 0
## Lakme 0 0 0 0 0 0 0 4
## Lancome 0 0 0 0 40 0 0 0
## Laws of Nature 0 0 0 17 0 0 0 0
## LOreal 0 0 22 0 0 0 0 14
## Lotus Herbals 0 0 0 0 0 0 0 4
## MAC 0 0 0 0 42 0 0 0
## Make Up For Ever 0 40 0 0 0 0 0 0
## Maybelline 0 0 40 0 0 0 0 14
## NARS 0 0 0 0 0 0 13 0
## Nykaa 0 0 0 0 0 0 0 5
## Olivia 0 0 0 0 0 0 0 4
## Revlon 0 0 22 0 0 0 0 0
## RMK 0 0 0 0 0 0 9 0
## Shiseido 0 0 0 0 0 0 6 0
## Shu Uemera 0 0 0 0 0 0 11 0
## Trim and Prissy 0 0 0 0 0 13 0 0
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
To further analyse the data I used histograms to further look at the variables “brands” by “group”.
country <- shades$group
hist(country,
main="Best Selling Make-up Brands by Group",
xlab="Group",
border="red",
col="orange"
)
country <- shades$group
hist(country,
main="Best Selling Make-up Brands by Group",
xlab="Group",
border="red",
col="orange",
prob = TRUE)
lines(density(country))
The rankings for best country are US, Japan, India, and Nigeria. I also need to consider looking into larger categories that are not representative of a country specifically - ex.”3” and “4”. These are 3: BIPOC-recommended Brands with BIPOC Founders and 4: BIPOC-recommended Brands with White Founders. “BIPOC” is Black, Indigenous, (and) People of Color.
I decided to filter the dataset to specifically focus on the beauty brands that re the best sellers in the following four countries: US, Nigeria, Japan, and India).
shades_country <- shades %>%
select (brand, brand_short, product, product_short, hex, H, S, V, L, group) %>%
filter(group %in% c("2", "5", "6", "7"))
shades_country <- shades_country %>%
mutate(group = case_when(
group %in% 2 ~ "United States",
group %in% 5 ~ "Nigeria",
group %in% 6 ~ "Japan",
group %in% 7 ~ "India",
))
shades_country <- shades_country %>%
mutate(group = factor(group, levels= rev(c("United States", "Nigeria", "Japan", "India"))))
Conduct andditional analysis by plotting the variables Hue (H), Saturation (S), Value (V), and Lightness (L) against each other to visualize the correlation between the variables.
pairs(~H + S + V + L, data = shades_country)
Plot the variables Hue (H), Saturation (S), Value (V), and Lightness (L) against each other, specifying factor variable “group” = country, to visualize the correlation between the variables.
H - The bottom 3 charts have no correlation.
S - The top chart no correlation and bottom 2 charts moderate negative correlation.
V - The 1st top chart has a weak positive correlation, top 2nd chart weak negative correlation, and bottom strong positive correlation.
L - The 1st top chart has a weak positive correlation, top 2nd chart weak weak negative correlation, and 3rd top chart has a strong positive correlation.
pairs(~H + S + V + L, col = factor(shades_country$group), pch = 19, data = shades_country)
library ("ggplot2")
Scatter plot to see if their is a correlation between brands and variables Hue and Saturation.
Scatter plot to see if their is a correlation between brands and variables Value and Lightness.
ggplot(shades_country, aes(x = log(H), y = log(S))) +
geom_point(aes(color = factor(brand)))
## Warning: Removed 12 rows containing missing values (geom_point).
ggplot(shades_country, aes(x = log(V), y = log(L))) +
geom_point(aes(color = factor(brand)))
## Warning: Removed 12 rows containing missing values (geom_point).
Link to Tableau Dashboards:
https://public.tableau.com/views/DATA110_FinalProject/MakeupBrandsbyGroup?:language=en-US&:display_count=n&:origin=viz_share_link
The visualization for this project was created in Tableau. I did add a column with the “group” names to the variable dataset to assist with creating data visualizations in Tableau. Separate dashboards were created to create visuals for “Makeup Brands by Group”, “Hex County by Makeup Brand”, and “HSVL by Makeup Brand”. The dashboard were created to combine multiple worksheet topics and to continue to explore the following topics:
In the “Makeup Brands by Group” dashboard I used a treemap and a horizontal bar. In the treemap US and overall show that the United States makeup brands as best sellers followed by BIPOC - recommended Brands with White Founders.
The “Hex Count by Makeup Brand” dashboard is use to show the color range within each brand. The top brands with the widest (hex) or number of colors are Maybelline. I created a buble chart to demonstrate the brands with the widest selection. The top brands include: Mac. Este Lauder, Fenty, Lancome, and Makeup Forever. I also created a bar chart that show the brand breakout by the “Group” and “Brand” variables to further show the impact of the best selling countries.
The final dashboard that I created was Hue, Saturation, Value, and Lightness by Makeup Brand. Although, we saw a positive correlation for “Value” earlier the bar chart is not representative of a huge impact across brands. However, “Hue” and “Lightness” appear to be a large part of what beauty brands focus on with foundation shade colors. This would seem accurate with the expansion of colors.
One thing I wish this dataset provided was multiple years of data. This would have allowed for a comparison for how the beauty industry has changed and shown an increase in foundation colors over time.