Dataset was taken from Kaggle https://www.kaggle.com/datasets/venessagreen/comparing-cosmetics-by-ingredients
Description: Dataset includes information about cosmetics products, original data included 1472 observations with 11 variables: Name, Brand, Label, Ingredients, for which type of skin this product is, Price and Rank. There are different product types such as moisturizers, cleansers, masks, etc. But I wanted to work with smaller amount of observations (my computer is not able to work with more), and it is more prudent to consider products by type, so I chose sunscreens.
The purpose is to find similar products based on factor and numeric features.
library(dplyr)
library(Matrix)
library(ggplot2)
library(readr)
library(psych)
library(smacof)
library(cluster)
library(plotly)
cosmetics = read_csv("cosmetics.csv")
I want to reduce data and focus only on sun protect products, this type has the least number of products and it would be easier to visualize.
cosmetics$Label = as.factor(cosmetics$Label)
# barplot
ggplot(cosmetics) +
aes(x = Label, fill = Label) +
geom_bar() +
theme(legend.position = "none")
I also do not include ingredients into analysis as this variable has too many ingredients, for now my purpose is to investigate products by their rank, price and for which skin type they are.
data = cosmetics %>% filter(Label=="Sun protect") %>% dplyr::select(-Label,-Ingredients)
There are no NAs or duplicates.
# Check for missing values
sum(is.na(data))
## [1] 0
# remove any duplicate rows
data = distinct(data)
Making factors.
data$Brand = as.factor(data$Brand)
data$Combination = as.factor(data$Combination)
data$Dry = as.factor(data$Dry)
data$Normal = as.factor(data$Normal)
data$Oily = as.factor(data$Oily)
data$Sensitive = as.factor(data$Sensitive)
There are only two numeric variables: Price and Rank, so I will be working with them.
ggplot(data, aes(x = Rank, y = Price)) +
geom_point()
There are 6 outliers: three with Ranks 0 and 1 and three with Price more than 150, it would be better to remove them. I also assume products with Rank=0 were not reviewed yet.
data = data %>% filter(Rank>1 & Price<150)
ggplot(data, aes(x = Rank, y = Price)) +
geom_point() + geom_smooth(method="lm")
summary(data)
## Brand Name Price Rank
## COOLA :18 Length:164 Min. : 18.00 Min. :3.100
## CLINIQUE :16 Class :character 1st Qu.: 32.00 1st Qu.:3.800
## SUPERGOOP! :13 Mode :character Median : 38.00 Median :4.100
## SHISEIDO :10 Mean : 42.60 Mean :4.121
## LAURA MERCIER : 7 3rd Qu.: 48.25 3rd Qu.:4.400
## MDSOLARSCIENCES: 6 Max. :100.00 Max. :5.000
## (Other) :94
## Combination Dry Normal Oily Sensitive
## 0:73 0:80 0:74 0:84 0:91
## 1:91 1:84 1:90 1:80 1:73
##
##
##
##
##
Variables Combination, Dry, Normal, Oily and Sensitive include almost equal number of values 0 and 1.
data %>% dplyr::select(Rank) %>% boxplot(main="Rank")
Rank distribution looks quite normal, there are problems with Price.
data %>% dplyr::select(Price) %>% boxplot(main="Price")
describe(data$Price)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 164 42.6 16.43 38 40.36 10.38 18 100 82 1.38 1.87 1.28
It indicates that the distribution of Price is highly skewed and has heavy tails compared to a normal distribution. One way to deal with such skewed data is to apply a transformation to the variable to make it more normally distributed. For instance, take the natural logarithm of the Price variable.
data$Price_log <- log(data$Price)
data %>% dplyr::select(Price_log) %>% boxplot(main="Price")
Now distribution looks more normal. I remove Price as new Price_log will be used in mds.
data1 = data %>% dplyr::select(-Price)
Gower’s distance/measure/coefficient/similarity is appropriate when dataset has mixed type attributes, like in my case: numeric and factor.
names = data$Name
data1 = data %>% dplyr::select(-Name)
dissimilarity <- data1 %>% daisy(metric = "gower")
# perform MDS on the dissimilarity matrix
mds = mds(dissimilarity)
mds
##
## Call:
## mds(delta = dissimilarity)
##
## Model: Symmetric SMACOF
## Number of objects: 164
## Stress-1 value: 0.123
## Number of iterations: 69
mds1 = mds(dissimilarity, type = 'ordinal')
mds1
##
## Call:
## mds(delta = dissimilarity, type = "ordinal")
##
## Model: Symmetric SMACOF
## Number of objects: 164
## Stress-1 value: 0.078
## Number of iterations: 42
Stress−1 values in both models are acceptable: 0.123 and 0.078, but ordinal has better result.
plot(mds1)
Such plot is not informative.
mds_df = data.frame(mds1$conf)
mds_df$Name = data$Name
mds_df$Rank=data$Rank
mds_df$Price=data$Price
p_mds <- plot_ly(data = mds_df,
x = ~D1, y = ~D2,
text = ~paste('Product: ', Name,
'</br> Rating: ', Rank,
'</br> Price: ', Price),
type="scatter",
color = ~Price,
size = ~Rank)
# add a title and axis labels
p_mds <- layout(p_mds, title = "Interactive Scatter Plot", xaxis = list(title = "Dimension 1"), yaxis = list(title = "Dimension 2"))
p_mds
Size of dots indicates rating, the bigger the dot is the higher rating product has.
#htmlwidgets::saveWidget(widget = p_mds, file = "p_mds.html", selfcontained = TRUE)
I am satisfied with MDS:
Products that are close to each other are indeed similar.
It looks like Dimension 2 is about Price, cheaper products are below 0 and more expensive ones are above.
I assume that there are 4 clusters on plot based on skin type.
From the previous conclusion I can assume that Dimension 1 is about skin type.
Using MDS for defining similarities between products is useful in terms of choosing appropriate product for yourself, especially if you need to take into account price, which I think is more crucial for most people.