Source: Sheglam Skinfluencer Foundation

Introduction…

For my final project, I am using the Diversity of Makeup Shades Data set. This Data set includes a list of beauty brands in the US, Nigeria, India, and Japan was. Collected data were considered by several sources to be “best sellers” in their home countries. The original author visited each brand’s website during May 2018, found their liquid foundation line that (at the time of their sampling) had the largest number of shades available, and recorded the hex color values for each of the colored swatches shown for the product. Then, using Adobe Photoshop, they extracted the lightness value of each color (using the CIE Lab color model).

I will be focusing on:

The information for the data set was collected from different websites and people. Below is the list for all the sources.

Variables Variable Type
Brand Categorical Variable
Brand_short Categorical Variable
Product Categorical Variable
Product_short Categorical Variable
Hex(The hexadecimal color code for a particular shade) Categorical Variable
H (Hue - The origin of the color) Continuous Variable
S (Saturation - The intensity and purity of a color.) Continuous Variable
V (Value) Continuous Variable
L (Lightness) Interger / Discrete Variable
Group ( Which Country is the foundation from) Interger / Discrete Variable

I would like to define the “Group” Variable further. Group in this data set shows us where the foundation came from.

I decided to use this data set for my final project because I absolutely love Make up. Over the last few years the range of foundation shades have been eally expanded. When I first got to USA there was not much of choices for the shades that mached my skin tone but These new options have been really great for women of color. I thought it would be fun to explore this.

Incorporate background research about this topic.

Creating shades for all skin tones is a fundamental aspect of developing an inclusive and diverse cosmetics line. Historically, the beauty industry has been criticized for lacking options for darker skin tones. However, more than 330 new shades were launched between August 2017 and July 2018, around 100 more than in the previous year. It’s more evident then ever that BIPOC not only set beauty trends but our dollars are a large part of the market share yet BIPOC are still excluded from advertising and marketing. At this time it is important that brands understand the importance of diversity and inclusion and more specifically that the products meet the need of the people (Brown, 2021).

Source:

Brown, D. (2021). What Diversity Looks like in Foundation and the Beauty Industry? Essence. https://www.essence.com/beauty/what-diversity-looks-like-in-foundation-and-the-beauty-industry/

Incorporating inclusivity and diversity in the Beauty Industry (no date) Hale Cosmeceuticals Inc - Your Skin HEALTH Company. Available at: https://www.halecosmeceuticals.com/blog/incorporating-inclusivity-and-diversity-in-the-beauty-industry

Load in the Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
library(ggplot2)
library(RColorBrewer)

Load in the Data Set.

Diversity_Shades <- readr:: read_csv("Shades.csv")
## Rows: 625 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): brand, brand_short, product, product_short, hex
## dbl (5): H, S, V, L, group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Summary and the Structure of the Data set. Using R Studio I checked the summary and the structure of the data set.

summary(Diversity_Shades)
##     brand           brand_short          product          product_short     
##  Length:625         Length:625         Length:625         Length:625        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      hex                  H               S                V         
##  Length:625         Min.   : 4.00   Min.   :0.1000   Min.   :0.2000  
##  Class :character   1st Qu.:23.00   1st Qu.:0.3500   1st Qu.:0.6900  
##  Mode  :character   Median :26.00   Median :0.4400   Median :0.8400  
##                     Mean   :25.31   Mean   :0.4595   Mean   :0.7795  
##                     3rd Qu.:29.00   3rd Qu.:0.5600   3rd Qu.:0.9100  
##                     Max.   :45.00   Max.   :1.0000   Max.   :1.0000  
##                     NA's   :12      NA's   :12       NA's   :12      
##        L             group      
##  Min.   :11.00   Min.   :0.000  
##  1st Qu.:55.00   1st Qu.:2.000  
##  Median :71.00   Median :3.000  
##  Mean   :65.92   Mean   :3.472  
##  3rd Qu.:79.00   3rd Qu.:5.000  
##  Max.   :95.00   Max.   :7.000  
## 
str(Diversity_Shades)
## spc_tbl_ [625 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ brand        : chr [1:625] "Maybelline" "Maybelline" "Maybelline" "Maybelline" ...
##  $ brand_short  : chr [1:625] "mb" "mb" "mb" "mb" ...
##  $ product      : chr [1:625] "Fit Me" "Fit Me" "Fit Me" "Fit Me" ...
##  $ product_short: chr [1:625] "fmf" "fmf" "fmf" "fmf" ...
##  $ hex          : chr [1:625] "f3cfb3" "ffe3c2" "ffe0cd" "ffd3be" ...
##  $ H            : num [1:625] 26 32 23 19 18 20 28 24 26 20 ...
##  $ S            : num [1:625] 0.26 0.24 0.2 0.25 0.3 0.29 0.31 0.33 0.38 0.38 ...
##  $ V            : num [1:625] 0.95 1 1 1 0.74 0.92 0.98 0.89 0.89 0.7 ...
##  $ L            : num [1:625] 86 92 91 88 65 80 87 77 77 60 ...
##  $ group        : num [1:625] 2 2 2 2 2 2 2 2 2 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   brand = col_character(),
##   ..   brand_short = col_character(),
##   ..   product = col_character(),
##   ..   product_short = col_character(),
##   ..   hex = col_character(),
##   ..   H = col_double(),
##   ..   S = col_double(),
##   ..   V = col_double(),
##   ..   L = col_double(),
##   ..   group = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Scatter smooth of Best Selling Make-up Brands by Group

Best_MP <- Diversity_Shades$group
scatter.smooth(Best_MP,
    main="Best Selling Make-up Brands by Group",
     xlab="Frequency",
     ylab = "Group",
    col="red"
)

From above visualization we can see that Group 0 - “Fenty Beauty’s PRO FILT’R Foundation Only” and Group 1 - ” Make Up For Ever’s Ultra HD Foundation Only” has low frequency than the other five groups. Group 7 - “Japanese Sellers” has the highest frequency out of all the group.

Histogram of Best Selling Make-up Brands by Group

Best_MP <- Diversity_Shades$group
hist(Best_MP,
    main="Best Selling Make-up Brands by Group",
     xlab="Group",
    ylab = "Density",
     border="black", 
     col="red",
    prob = TRUE)

lines(density(Best_MP))

According to the histogram, rankings for best country are US, Japan, India, and Nigeria. I also need to consider looking into larger categories that are not representative of a country specifically - ex.”3” and “4”. These are 3: BIPOC-recommended Brands with BIPOC Founders and 4: BIPOC-recommended Brands with White Founders. “BIPOC” is Black, Indigenous, (and) People of Color.

I want to filter out the brands only in US-Best Sellers.

US_Best <- Diversity_Shades %>%
    filter(brand == "Estée Lauder" & group == "2"|
           brand == "Revlon" & group == "2"|
             brand == "L'Oréal" & group == "2"|
             brand == "Maybelline" & group == "2"|
             brand == "bareMinerals" & group == "2"|
             brand == "Covergirl + Olay" & group == "2")

Let’s take a look at the Lightness Value of US-Best Sellers Makeup Brands.

US <- US_Best %>%
    filter(brand == "Estée Lauder" |
           brand == "Maybelline" |
           brand == "Revlon" |
           brand == "L'Oréal"|
           brand == "bareMinerals"|
             brand == "Covergirl + Olay") %>%
    group_by(brand, group) %>%
    ggplot(aes(x = brand, y = L, fill = group)) +
    geom_line() +
    ggtitle("Lightness of the Foundation by US Best Sellers") + 
    ylab("Lightness of the Foundation") +
    xlab("US Best Sellers") + 
    theme_bw() + 
    theme(legend.position="0.3", axis.title = element_text())

US

The six Makeup Brands listed in the visualization above are the ones that are in the US Best Sellers category. L’Oreal and Maybelline has a similar lightness values from the beginning to the end but Covergirl + Olay only has a quater of value compared to the others.

Filter group by countries (US, Nigeria, Japan, and India). I decided to filter the data set to specifically focus on the beauty brands that was listed as the best sellers in the following four countries: US, Nigeria, Japan, and India).

Best_Countries <- Diversity_Shades %>%
  select (brand, brand_short, product, product_short, hex, H, S, V, L, group) %>%
  filter(group %in% c("2", "5", "6", "7"))

Mutate group to country name (US, Nigerian, Japan and India).

Best_Countries <- Best_Countries %>%
  mutate(group = case_when(
    group %in% 2 ~ "United States",
    group %in% 5 ~ "Nigeria",
    group %in% 6 ~ "Japan",
    group %in% 7 ~ "India",
  ))

Best_Countries <- Best_Countries %>%
  mutate(group = factor(group, levels= rev(c("United States", "Nigeria", "Japan", "India"))))

Statistical Analysis.

Scatter plot matrix.

Plot the variables Hue (H), Saturation (S), Value (V), and Lightness (L) against each other, specifying factor variable “group” = country, to visualize the correlation between the variables.

H - The bottom 3 charts have no correlation.

S - The top chart no correlation and bottom 2 charts moderate negative correlation.

V - The 1st top chart has a weak positive correlation, top 2nd chart weak negative correlation, and bottom strong positive correlation.

L - The 1st top chart has a weak positive correlation, top 2nd chart has a weak negative correlation, and 3rd top chart has a strong positive correlation.

pairs(~H + S + V + L, col = factor(Best_Countries$group), pch = 23, data = Best_Countries)

Scatter plot to see if their is a correlation between brands and variables Hue and Saturation.

lm_model <- lm(H ~ S, data = Best_Countries)

summary(lm_model)
## 
## Call:
## lm(formula = H ~ S, data = Best_Countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4023  -2.5088   0.5279   3.5389  16.4655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.7475     0.9337  27.577   <2e-16 ***
## S            -0.7344     1.9656  -0.374    0.709    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.969 on 334 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.0004178,  Adjusted R-squared:  -0.002575 
## F-statistic: 0.1396 on 1 and 334 DF,  p-value: 0.7089
plot(lm_model)

The equation for my model is: Hue vs Saturation = 25.7475 − 0.7344 × Visibility + ϵ

Adjusted R-squared: -0.002575
p-value: 0.7089

ggplot(Best_Countries, aes(x = log(H), y = log(S))) +
    geom_point(aes(color = factor(brand)))
## Warning: Removed 12 rows containing missing values (`geom_point()`).

Plot one or more various visualizations we have discussed throughout the course, which may or may not include GIS information. You could also use Tableau for your visualization, but be sure to include the link to your Tableau visualization in your Markdown File. During your exploration, keep a running commentary in the Markdown text area of what you are doing and why you are doing it.

Link to Tableau Dashboards:

https://public.tableau.com/app/profile/janithri.pannala/viz/FinalPR_DATA_110/MakeupBrandsbyGroup

Essay…

The final visualization for this project was created in Tableau Public. I changed the data set a bit for the visualization by changing the numbers that was in the “Group” category to the actual names of the countries that has been used in the data set. I thought it would be helpful for a better understatement. I created three different dashboards thinking that I would be able to combine them in to a one but I failed to do that. So, I used the navigation box to travel through the main view to the other two dashboards.These dashboards were created to include visuals for “Makeup Brands by Group”, “Lightness by Makeup Brand”, and “HSVL by Makeup Brand”.

In the “Makeup Brands by Group” dashboard I used a treemap and a horizontal bar. In the treemap we can see that the United States makeup brands as best sellers followed by BIPOC(Black, Indigenous, and people of color.) - recommended Brands with White Founders.

The “Lightness by Makeup Brand” dashboard is use to show the color lightness of the foundation within each brand. The top brands with the widest number of colors are Este Lauder and Maybelline. I created a bubble chart to show the selection. The top brands include: Mac. Este Lauder, Fenty, Lancome, and Bobby Brown. I also created a side by side circle that show the brand breakout by the “Group” and “Brand” variables to further show the impact of the best selling countries.

The final dashboard that I created was Hue, Saturation, Value, and Lightness by Makeup Brand. Also, we saw a positive correlation for “Value” earlier. However, “Hue” and “Lightness” takes a large part of what beauty brands focus on with foundation shade colors.

One thing I wish this data set provided was multiple years of data. I also think instead of providing us with a hex code they could have provided us a hex count for more calculation.