Closer look at Cosmetify Index Report using Unsupervised Learning methods

Introduction

Cosmetify Index Raports has been made since 2019 and show trends in beauty industry each year. This article intends to take a closer look on the data presented in the 2021 raport from online beauty discovery platform Cosmetify.

library(readr)
library(knitr)

data <- read_csv("https://raw.githubusercontent.com/wmotkowska/Cosmetify_Index_USL_Analysis/main/cosmetify_data_csv.csv", 
    col_types = cols(score = col_double()))

kable(head(data), 'html', table.attr='id="cosm_table"', align = 'c', caption='Cosmetify Index Data - top brands', col.names=c('Rank', 'Brand', 'Instagram followers', 'Instagram Hashtags', 'Instagram Engagement Rate', 'Annual Web Searches', 'Yearly change in searches', 'Overall score'))

Cosmetify Index Data - top brands
Rank	Brand	Instagram followers	Instagram Hashtags	Instagram Engagement Rate	Annual Web Searches	Yearly change in searches	Overall score
1	Huda Beauty	49500000	29500000	0.0003	7000000	-0.301	6.54
2	M·A·C Cosmetics	24200000	20600000	0.0003	8700000	0.009	4.98
3	Anastasia Beverly Hills	19500000	25700000	0.0004	3000000	-0.189	4.96
4	Avon	161000	6800000	0.0012	37300000	-0.165	4.88
5	The Body Shop	2300000	1500000	0.0014	38000000	0.154	4.49
6	Yves Rocher	882000	1200000	0.0031	33100000	0.064	4.36

On the data for ranked hottest beauty brands, with 100 observed brands and 6 variables, I conduct unsupervised methods of:

clustering to find brand types within the cosmetic market based on marketing strategy and show the most popular one in each type.
dimension reduction to find which attributes are correlated and which variables group in a certain marketing strategy.

I assume that there will be two types of cosmetic brands based on this type of data: new brands who have been largely marketed online and have some influencial owners; brands largely marketed outside of online world that have been used for decades.

In my study I will try to use unsupervised learning methods to group on data, check if my assumptions are right and if maybe there are some other groups not seen at first sight.

In the article shortcut ‘ig’ stands for Instagram.

Used libraries:

library(cluster)
library(factoextra)
library(fpc)
library(gridExtra)
require(gridExtra)
library(lattice)
library(corrplot)
library(clusterSim)
library(psych)
library(ggfortify)
library(wesanderson)
library(pca3d)
library(jtools)
library(lmtest)
library(tseries)

Used colors:

pal1 <- wes_palette("Royal2")
pal2 <- wes_palette("Rushmore")

Ranking

The score variable is calculated from variables in the dataset. This means that this variable is strongly dependent on the others and should not be used as a distinct variable in methods used in this article.

The position variable is dependent on the score variable. In this analysis, position will be an element describing how brands compare to each other while we compare the attributes.

data1 <- data[,3:8]
data2 <- scale(data1)
data2 <- as.data.frame(data2)
rank <- lm(score~ig_follow+ig_hash+ig_engagement_rate+annual_searches+year_search_change, data2)
resettest(rank, power = 2:3, type = c("fitted"))

summ(rank)

## MODEL INFO:
## Observations: 100
## Dependent Variable: score
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(5,94) = 130152.67, p = 0.00
## R² = 1.00
## Adj. R² = 1.00 
## 
## Standard errors: OLS
## --------------------------------------------------------
##                             Est.   S.E.    t val.      p
## ------------------------ ------- ------ --------- ------
## (Intercept)                -0.00   0.00     -0.00   1.00
## ig_follow                   0.37   0.00    168.58   0.00
## ig_hash                     0.44   0.00    202.88   0.00
## ig_engagement_rate          0.35   0.00    252.11   0.00
## annual_searches             0.52   0.00    407.38   0.00
## year_search_change         -0.32   0.00   -225.86   0.00
## --------------------------------------------------------

From regression on standardized variables I conclude that the score is counted from standardized values with weights between 0 and 1 in absolute value. We cannot exclude an option that intercept is 0. The R^2 equal to 1 can be interpreted that the model explains one hundred percent of the variance of score variable. This is probably the true formula used for calculations in the article. This simple discovery means that any company can put their data into the formula and calculate their score.

Clustering

Exploring the data

First let us get an understanding of how the data behave when we try to find potential clusters. This section starts with intuitive exploring concerting attributes and later I will try to find the optimal number of clusters and model. Here I use K-means method with Manhattan metric. I have scaled the data in order to have the same scale on all attributes.

attribute_value_org <- as.matrix(data[,3:7])
attribute_value <- scale(attribute_value_org)

km1<-eclust(attribute_value, "kmeans", hc_metric="manhattan", k=2)

km2<-eclust(attribute_value, "kmeans", hc_metric="manhattan", k=4)

km3<-eclust(attribute_value, "kmeans", hc_metric="manhattan", k=5)

sil<-silhouette(km3$cluster, dist(attribute_value))

plot1 <- fviz_cluster(km1, main="K-means k=2", ggtheme=theme_classic(), palette = pal1)
plot2 <- fviz_cluster(km2, main="K-means k=4", ggtheme=theme_classic(), palette = pal1)
plot3 <- fviz_cluster(km3, main="K-means k=5", ggtheme=theme_classic(), palette = pal1)
plot6 <- fviz_silhouette(sil, ggtheme=theme_classic(), main="K-means k=5", , palette = pal1)

grid.arrange(plot1, plot2, ncol=2)

grid.arrange(plot3, plot6, ncol=2)

First plot shows that k-means with two clusters result in two groups of better-ranked brands and lower-ranked. Silhouettes are negative for more than 10 observations.

Taking 4 clusters makes all silhouettes positive. First 10 best brands are mostly in two clusters (olive green and yellow), which may exhibit that different attributes (marketing strategies) influence their fame. In PCA we will discuss that this is true.

At first glance there might be some outliers (ex. obs. 94). However, behavior of this brand can be explained as other marketing strategie plays role here and should not be omitted in interpretation of groups. Brands ranked 7 and 94 (olive green) have the biggest Instagram engagement rate. Despite them being so distant from each other in the ranking, it seems that they follow the same marketing strategy of engaging in conversation with customers. Most of brands ranked lower than 25 are in one cluster.

To gather more knowledge about data I need to check for the best model describing it.

K-means and PAM

After initial exploration and preparing the data I am going to use unsupervised learning methods and focus on proper implementation of the tools. First let us look at two methods of K-means and Partitioning Around Medoids. Those are used for grouping the observations based on similar characteristics, one centers around means and the other around medoids. To use those clustering tools first we need to see if observations are clusterable and what number of groups should we assume.

Statistic test shows that data are clusterable.

hops <- get_clust_tendency(attribute_value, 20, graph=TRUE, gradient=list(low="blue", mid="white", high="black"))

## Hopkins' statistic (interpretation for get_clust_tendency): 0.836976 data can be clustered

Next I will try to find an optimal number of clusters. The highest shilhouette in k-means and PAM is for 3 clusters and automatic tools also suggest 3. However, as said before, to have all shilhouettes positive I need to take 4 clusters. If we look at clusGap the gap statistic starts to stabilize around 5 clusters. Shilhouette for 4 clusters is higher than for 5 clusters in K-means. It is different when comparing Calinski-Harabasz index. The preference between 4 and 5 also changes in PAM (see figures below). That’s why I’m deciding to examine 5 clusters.

ch1 <- round(calinhara(attribute_value, km2$cluster),digits=2) 
ch2 <- round(calinhara(attribute_value, km3$cluster),digits=2)

## Calinski-Harabasz index for k-means with k=4: 47.24

## Calinski-Harabasz index for k-means with k=5: 51.63

pamk.best<-pamk(attribute_value, krange=2:10,criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.01, diss=inherits(attribute_value, "dist"), critout=FALSE) # fpc::pamk()

## Number of clusters estimated by optimum average silhouette width: 3

plot1 <- fviz_nbclust(attribute_value, FUNcluster=kmeans, linecolor = pal1[1])
plot2 <- fviz_nbclust(attribute_value, FUNcluster=cluster::pam, linecolor = pal1[2])
grid.arrange(plot1, plot2, ncol=2, top="K-means / PAM")

plot1 <- fviz_nbclust(attribute_value, FUNcluster=kmeans, method="gap_stat", linecolor = pal1[3])+ theme_classic()
plot2 <- fviz_nbclust(attribute_value, FUNcluster=cluster::pam, method="gap_stat", linecolor = pal1[4])+ theme_classic()
grid.arrange(plot1, plot2, ncol=2, top="K-means / PAM")

pam2<-eclust(attribute_value, "pam", hc_metric="manhattan", k=5)

par(mfrow=c(1,2))
fviz_cluster(km3, main="K-means k=5", ggtheme=theme_classic(), palette = pal1)

fviz_cluster(pam2, main="PAM k=5", ggtheme=theme_classic(), palette = pal1)

If we look closely, only observation nr 17 changes its cluster while we change the method. According to this, I conclude that both methods give almost identical results and grouping has been done correctly. For the next section I will use the results from k-means for k=5.

xxc<-as.data.frame(cbind(attribute_value, km3$cluster))
colnames(xxc)[6]<-"clust"

par(mfrow=c(2,3))
boxplot(xxc[,1]~xxc[,6], vertical=TRUE, col=pal1[2], xlab='clusters', ylab=colnames(xxc)[1])
boxplot(xxc[,2]~xxc[,6], vertical=TRUE, col=pal2[2], xlab='clusters', ylab=colnames(xxc)[2])
boxplot(xxc[,3]~xxc[,6], vertical=TRUE, col=pal1[3], xlab='clusters', ylab=colnames(xxc)[3])
boxplot(xxc[,4]~xxc[,6], vertical=TRUE, col=pal2[5], xlab='clusters', ylab=colnames(xxc)[4])
boxplot(xxc[,5]~xxc[,6], vertical=TRUE, col=pal2[1], xlab='clusters', ylab=colnames(xxc)[5])

Boxplots for each variable in five clusters show that values differ in clusters, which proves that there is a proper grouping. Variables which are correlated display similarity in plots (I will discuss this correlation in next section).

First cluster has the biggest instagram following and hashtags number. Their marketing strategy is focused on instagram promotion. For example Anastasia Beverlly Hills or Huda Beauty was hugely promoted by beauty gurus on instagram and youtube hence massive amount of hashtags compared to other brands. Those make up artists are either famous and are paid to promote the brand or are trying to raise in fame so are using popular products promoted by the best in the industry.

Second cluster is the one where engagement and yearly change in searches is high. They market their product by engaging in conversation with customers, checking what they would like to see, promoting care for people (maybe planet). They are a talk of the community thats why an increase in searches. For example Florence by Mills is owned by Millie Bobby Brown who is a teen series star and promotes her beauty line on live talks on Instagram. That is still online marketing, but more personal than edited picture posted with hashtags. The marketing startegy as well as products are more natural than in the previous brand.

Forth has the biggest annual searches. Those are the brands with traditional marketing that customers have to search name on Google to buy from and not enter the site by hyperlink in Instagram bio. This brand might have evaluated that their customers are older or more traditional. Their cosmetics might be more of basic skincare products than colorful eye-shadow palettes. Then marketing is more focused on promoting the good name of the company and product quality in everyday situations. It may be on tv commercial, newspaper, poster. Then the customer remembers the company name and searches for it on the internet when in need of skincare/makeup products.

Third cluster has almost all variables at low values. Those are mostly the brands under 25th position whose marketing strategies are not well established. The analysis has shown no call for clustering this part more as well as PCA has shown no distinct direction for those variables.

Fifth cluster has better online following than third cluster, however it still achieves lower statistics than others. The positions in this cluster are higher than in third cluster.

In conclusion, this analysis shows that brands that put a lot of work into one chosen marketing strategy seem to be better in the ranking. The marketing team should evaluate what the brand’s customers are like, what will catch their interest and on what platforms it is best to find them. Then the team should plan their marketing according to that.

Dimension reduction

In this section I will analyse the correlations between attributes. In order to do so I apply unsupervised learning method for dimension reduction called Principal Components Analysis. This will help assess similarities between attributes and how they correspond to the achieved clusters. First I will explore the data. There are 5 variables, so i assume that the reduction has been done prior. However, my question is if any more reduction can be done. In conclusion I propose how the attribute groups may correspond to brand groups.

matrix <- data.matrix(attribute_value, rownames.force = NA)
cor_matrix <- cor(matrix, method="pearson")

corrplot(cor_matrix, method = "number", number.cex = 0.75, tl.col=pal1[3])

As assumed two variables ig following and ig hashtags are highely correlated. Ig engagement and year-to-year change in name being searched have some high correlation. However, I cannot see any direct cause of this and it seems that there is a hidden variable behind this (maybe the rapid growth of the company).

PCA

First I conduct PCA in 2 dimensions. The data has been normalized. Vectors show the behaviour of attributes and relationships between them. Vectors that are close to each other are positively correlated. Vectors opposite of each other are negatively correlated. Orthogonal vectors have no correlation. There is also a visualization of observations on the same plane.

attribute_value.n<-data.Normalization(attribute_value_org, type="n1", normalization="column")
attribute_value.n.pca1<-prcomp(attribute_value.n, center=FALSE, scale.=FALSE)

plot1 <- fviz_pca_var(attribute_value.n.pca1)
plot2 <- fviz_pca_ind(attribute_value.n.pca1)

grid.arrange(plot1, plot2, ncol=2)

summary(attribute_value.n.pca1)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5
## Standard deviation     1.4254 1.1246 0.9762 0.7574 0.42055
## Proportion of Variance 0.4063 0.2530 0.1906 0.1147 0.03537
## Cumulative Proportion  0.4063 0.6593 0.8499 0.9646 1.00000

First two PC’s explain 66% of variance. Considering three vectors will improve this value by 19%, which is 85%. 4 vectors improve the proportion to 96%. It might be good to consider representation with 3 vectors.

var<-get_pca_var(attribute_value.n.pca1)
a<-fviz_contrib(attribute_value.n.pca1, "var", axes=1, xtickslab.rt=90, fill = pal2[3], color = pal2[3])
b<-fviz_contrib(attribute_value.n.pca1, "var", axes=2, xtickslab.rt=90, fill = pal2[3], color = pal2[3])
c<-fviz_contrib(attribute_value.n.pca1, "var", axes=3, xtickslab.rt=90, fill = pal2[3], color = pal2[3])
grid.arrange(a,b,c, ncol=2, nrow=2)

From the contributions plot we can see that variables ig follow and hash contribute to first dimension and variables ig engagement and yearly change in search contribute to second dimension. What is interesting, the annual search rate has no contribution to first dimension, this variable has almost no correlation to ig following and hashtag number. Considering rotated PCA with number of factors equal to three and command ‘varimax’ shows that third factor is mostly explained by annual searches.

attribute_value.n.pca2<-principal(attribute_value.n, nfactors=3, rotate="varimax")
attribute_value.n.pca2

attribute_value.n.pca2$complexity

##          ig_follow            ig_hash ig_engagement_rate    annual_searches 
##           1.042948           1.039049           1.030907           1.007688 
## year_search_change 
##           1.105364

attribute_value.n.pca2$uniquenesses

##          ig_follow            ig_hash ig_engagement_rate    annual_searches 
##         0.08902607         0.09178232         0.26054248         0.01682452 
## year_search_change 
##         0.29233919

The quality measures of PCA with 3 factors show that the model is a good fit. The complexity of the variables is low, around 1, which is desirable. Uniquenesses is also low, it does not exceed 0.3.

print(loadings(attribute_value.n.pca2), digits=3, cutoff=0.4, sort=TRUE)

## 
## Loadings:
##                    RC1    RC2    RC3   
## ig_follow           0.944              
## ig_hash             0.944              
## ig_engagement_rate         0.853       
## year_search_change         0.820       
## annual_searches                   0.990
## 
##                  RC1   RC2   RC3
## SS loadings    1.804 1.428 1.017
## Proportion Var 0.361 0.286 0.203
## Cumulative Var 0.361 0.647 0.850

We have three principal components which we can call:

Instagram success - Instagram following and number of hashtags,
Rapid growth - growth in annual searches of the name as well as engagement of clients on the Instagram,
Traditional searching - annual search rate.

fviz_pca_biplot(attribute_value.n.pca1, repel = TRUE, col.var = 'darkgreen', col.ind = 'lightgreen')

The 3d plot shows the directions of 3 components. However, they can also be easily observed on 2d plot which is easier to use. The observations can be represented on the same two-dimensional plane as attribute vectors. The position of an observation indicates its values according to vectors.

The observations number 1, 2, 3 have the highest ig following and hashtag number. Those observations have the biggest distances from the center in the direction of the vectors for variables ig following and hashtags.

Observations 7 and 94 have the biggest ig engagement rate and yearly change in searches.

The cluster with observations from for 4 to 9, 14 and 25 are the ones that are the most distant from the middle point in the direction of annual searches. However, they are not as distant as both of the clusters described before.

The other cluster with observations that are left from the first quarter of the data (1-25 without described before) are positioned in the direction of Instagram success, but not as far the first cluster.

The last cluster which has most of the observations between 25 and 100 is centered around the middle point of the plane, with some observations having a score of components lower than average.

Those interpretations are easier to read from the graph than from the table of coordinates.

Discussion

Main attributes

The three main attribute groups explained earlier allow for quicker computations of where the brand can be situated on the plane. However, if we look closely to compare brands we need all 5 variables to assess them correctly in the ranking and marketing strategies. In conclusion, I propose that we cannot get rid of any of those 5 attributes.

I will provide an example of interpretation why we cannot skip on any variables. Ig hashtag and ig following. If we would consider ranking only by ig following it would be difficult to distinguish between observations 1,2,3 and for example 10, as they have similar ig following results but hugely differ in number of hashtags. It is because the following mostly indicates if brand is promoted by influential people and the hashtags indicate if the brand concerns other online promoting strategies.

I conclude that the authors have already reduced the unneeded variables.

Brand types

After the evaluation of data and analyzing the attributes I conclude that clustering method has distinguished four main categories depending on brand marketing strategy:

(huge success, green dots) influencer brand with massive online marketing (more than 20M hashtags) (ex. Anastasia Beverly Hills) Best one - Huda Beauty (Ranked #1)
(big success, navy blue dots) brand with wide customer group and traditional marketing (highly searched for on Google to shop) (ex. The Body Shop) Best one - Avon (Ranked #4)
(big success, red dots) brands who have one of the following: they are mainly focused on one sector of cosmetics highly preached online (ex. Too Faced), have a couple of hot products highly preached online (ex. CeraVe) or have an influential founder (ex. Kylie Cosmetics) Best one - Kylie Cosmetics (Ranked #10)
(very few, purple dots) brands who care about communication with their customers Best one - Florence by Mills (Ranked #7)

The brands who fall under 25th position in the ranking are harder to interpret. We can say that their marketing strategies are spread along those above four but the success is smaller in comparison.

autoplot(attribute_value.n.pca1, data=km3, colour="cluster", loadings=TRUE, loadings.colour=pal1[2], loadings.label=TRUE, loadings.label.size=3, loadings.label.colour=pal1[2]) + theme_void() + scale_color_manual(values=c(pal1[5], "purple", pal1[4], pal2[4], pal2[5]))

The best marketing strategy for the brand is to evaluate what their customers are like and push marketing into corresponding direction. A brand has to clearly decide on their image and market accordingly.

Out of three directions described in this paper the most succesfull ones are Instagram success and Traditional marketing. The engagement with customers is very fresh and natural take on marketing strategy but only one brand had big success following it for now.

As can be seen in the industry there are other directions which are not analysed in this data set. For example luxury products as Chanel, Dior are marketed as something a person may desire to obtain. Their goal is not to be as common as products used in tactics of Instagram success and Traditional marketing. They use similar marketing tools (promoting on instagram, posters, tv commercials) but with different approach of uniqueness, mystery and luxuriousness.

Question about hidden variables

A hidden variable is an attribute that causes the results but is not considered in the model. It may be the case of not being able to observe it or simply it has been overlooked. Dimension reduction cannot be properly done without observing if omitting of variables has been properly done before implementing the model.

If we look closely at other reports from Cosmetify we can induce that the researchers already cut the data short of other online platforms than Instagram. The influence on other platforms is much smaller compared to Instagram followers and scarcely changes the overall followers number. However, some brands have higher combined following while having lower IG following. This raises the question if this reduction influenced the hottest brands ranking at small rate or was it actually harmful to some companies. See: Influencial beauty brands.

It also looks like some very impactfull variables like eco-friendliness and diversity were omitted in the ranking.

Moreover, as I discussed in data exploration, it seems like there is additional hidden data which influences dependence between Instagram engagement and yearly search change. As mentioned, maybe this hidden variable is the rapid growth of the company. I also thought if people like the posts more when they have freshly discovered company’s ig while searching, but there is a negative and small correlation between searches and other Instagram variables, which disproves this statement.

Summary

This paper intended to take a closer look at data from Cosmetify raport and try to find any more insights. The unsupervised learning methods were used. Firstly, I found grouping of brands based on their marketing strategy using K-means and PAM clustering. Secondly, I intended to reduce the amount of attributes used in the Cosmetify ranking with PCA. Both results were discussed and brought some interesting insight. The used methods were taught by dr hab. prof. ucz. Katarzyna Kopczewska from WNE UW as part of the classes on Unsupervised Learning.

The paper was very interesting to prepare. In future research I might try causal learning methods to build a model which correctly reduces hidden variables. I would also be very interested in researching more data on cosmetic industry.