Presentation of dimensionality reduction techniques as the preparation for DBSCAN clustering
Introdction
Living in a world full of distractions forced human brain to work as some kind of a filter that is able to extract only important information and ignore the rest. While such an approach is prone to mistakes and delusions, it turned out to be a very efficient way of processing numerous signals registered by humans’ senses. We may use this biological solution as an inspiration for a method of analysing big (in terms a of number of variables) datasets. Perhaps it would be possible to resign from some of the variables included in the data and still obtain meaningful results in the end. What is more, perhaps it would be achievable to limit a number of variables to only two or three. Thanks to such reduction, we would be able to present a modified dataset in a graphical form.
The purpose of this work is to show two popular methods of dimension reduction, i.e. principal component analysis (PCA) and multidimensional scaling (MDS). Their application should lead to simplifying a dataset. We may use such modified data as the input to density-based spatial clustering of applications with noise (DBSCAN) algorithm.
library("corrplot")
library("ggplot2")
library("factoextra")
library("gridExtra")
library("fpc")Dataset
In order to present both PCA and MDS algorithms in an effective way, we are using the dataset devoted for dimension reduction purposes. It was probably artificially designed to be useful from a point of view of such methods. The dataset can be found under the following link. As its name suggests, data focus on a little trivial topic of pizza.
data_raw <- read.csv("Pizza.csv", sep = ",")
head(data_raw)## brand id mois prot fat ash sodium carb cal
## 1 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
## 2 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
## 3 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
## 4 A 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
## 5 A 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
## 6 A 14075 31.14 20.23 42.31 4.92 1.65 1.40 4.67
tail(data_raw)## brand id mois prot fat ash sodium carb cal
## 295 J 34043 44.07 10.96 18.39 2.56 0.66 24.02 3.05
## 296 J 34044 44.91 11.07 17.00 2.49 0.66 25.36 2.91
## 297 J 24069 43.15 11.79 18.46 2.43 0.67 24.17 3.10
## 298 J 34039 44.55 11.01 16.03 2.43 0.64 25.98 2.92
## 299 J 14044 47.60 10.43 15.18 2.32 0.56 24.47 2.76
## 300 J 14045 46.84 9.91 15.50 2.27 0.57 25.48 2.81
The dataset consists of exactly 300 observations described with 9 variables. Two of them are not used in further analysis. This is because id is just a number assigned to distinguish observations and brand serves as a categorical variable grouping pizzas into 6 different types.
The rest of the variables describes an amount of water (mois), protein (prot), fat (fat), ash (ash), sodium (sodium), carbohydrates (carb), and kilocalories (cal) per 100 g of a given product, respectively.
data <- data_raw[, 3:9] # Extracting data we will work on later
head(data)## mois prot fat ash sodium carb cal
## 1 27.82 21.43 44.87 5.11 1.77 0.77 4.93
## 2 28.49 21.26 43.89 5.34 1.79 1.02 4.84
## 3 28.35 19.99 45.78 5.08 1.63 0.80 4.95
## 4 30.55 20.15 43.13 4.79 1.61 1.38 4.74
## 5 30.49 21.28 41.65 4.82 1.64 1.76 4.67
## 6 31.14 20.23 42.31 4.92 1.65 1.40 4.67
Analysis
Before we start the main part of our analysis, i.e. applying dimension reduction algorithms, we may calculate some basic descriptive stats in order to see what we should expect.
summary(data_raw)## brand id mois prot
## Length:300 Min. :14003 Min. :25.00 Min. : 6.98
## Class :character 1st Qu.:14094 1st Qu.:30.90 1st Qu.: 8.06
## Mode :character Median :24021 Median :43.30 Median :10.44
## Mean :20841 Mean :40.90 Mean :13.37
## 3rd Qu.:24110 3rd Qu.:49.12 3rd Qu.:20.02
## Max. :34045 Max. :57.22 Max. :28.48
## fat ash sodium carb
## Min. : 4.38 Min. :1.170 Min. :0.2500 Min. : 0.510
## 1st Qu.:14.77 1st Qu.:1.450 1st Qu.:0.4500 1st Qu.: 3.467
## Median :17.14 Median :2.225 Median :0.4900 Median :23.245
## Mean :20.23 Mean :2.633 Mean :0.6694 Mean :22.865
## 3rd Qu.:21.43 3rd Qu.:3.592 3rd Qu.:0.7025 3rd Qu.:41.337
## Max. :47.20 Max. :5.430 Max. :1.7900 Max. :48.640
## cal
## Min. :2.180
## 1st Qu.:2.910
## Median :3.215
## Mean :3.271
## 3rd Qu.:3.520
## Max. :5.080
correlation <- cor(data, method = "pearson")
corrplot(correlation)Multidimensional scaling
We may start with the method based on comparing distances in two spaces that differ in the number of dimensions they consist of. The initial space is constructed using 7 variables. Then this number is reduced. In effect, another space is built where the number of dimensions (variables) is lower. The point of multidimensional scaling is to transform data from one space to another and at the same time preserve the initial distance between observations as well as possible.
The number of dimensions that should be included in the final space is given by the user of the MDS algorithm. We are going to use only two dimensions. And this is because of two reasons. Firstly, it will be possible to visualize the results. The second reason will be discussed after applying PCA to the same dataset.
mds <- cmdscale(dist(data), k = 2)
summary(mds)## V1 V2
## Min. :-25.3765 Min. :-28.8343
## 1st Qu.:-22.1295 1st Qu.: -4.3765
## Median : 0.3719 Median : 0.9267
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 21.0257 3rd Qu.: 5.9692
## Max. : 29.4927 Max. : 18.5784
mds_plot <- data.frame(mds)
ggplot(data = mds_plot, aes(x = X1, y = X2) ) + geom_point() + labs(x = "First dimension", y = "Second dimension")We may see that MDS prevented, in that case, serious distortions because the initial number of types has been preserved which is positive news when it comes to our plans of applying DBSCAN.
Principal component analysis
Now it is time to use another approach to dimensionality reduction, i.e. PCA method. Its point is to find directions (vectors) that span space that preserve maximum variance. When compared with MDS, this algorithm is a linear approach to the problem of dimension reduction which might be considered a slight disadvantage in some cases but on the other hand, it does not look for a solution iteratively. Thus PCA is not prone to get stuck in a local minimum.
data_scale <- scale(data) # Normalizing data
tail(data_scale)## mois prot fat ash sodium carb
## [295,] 0.3315124 -0.3751041 -0.2049469 -0.0576766 -0.025380865 0.06407383
## [296,] 0.4194430 -0.3580084 -0.3598102 -0.1128067 -0.025380865 0.13839555
## [297,] 0.2352074 -0.2461097 -0.1971480 -0.1600611 0.001620055 0.07239342
## [298,] 0.3817584 -0.3673333 -0.4678803 -0.1600611 -0.079382707 0.17278321
## [299,] 0.7010303 -0.4574739 -0.5625808 -0.2466941 -0.295390073 0.08903261
## [300,] 0.6214741 -0.5382896 -0.5269289 -0.2860728 -0.268389152 0.14505123
## cal
## [295,] -0.3564319
## [296,] -0.5822259
## [297,] -0.2757912
## [298,] -0.5660978
## [299,] -0.8241480
## [300,] -0.7435073
pca <- prcomp(data_scale)
summary(pca)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.042 1.5134 0.64387 0.3085 0.16636 0.01837 0.003085
## Proportion of Variance 0.596 0.3272 0.05922 0.0136 0.00395 0.00005 0.000000
## Cumulative Proportion 0.596 0.9232 0.98240 0.9960 0.99995 1.00000 1.000000
data_plot <- data.frame(pca$x[, c(1,2)])
ggplot(data = data_plot, aes(x = PC1, y = PC2)) + geom_point() + labs(x = "First component", y = "Second component")fviz_eig(pca, main = "Explained variance plot")The above graph presents the percentage of variance that can be explained by constructing a space with a given number of eigenvectors (horizontal axis). It turns out that more than 92% of the variance observed in the dataset can be explained with only two components. This by the way the second argument why we have decided to use only two dimensions during MDS analysis.
We may go one step further (or back) and compare which variables constitute principal components in the most significant way.
scores <- abs(pca$rotation[, 1])
names <- names(sort(scores, decreasing = TRUE))
pca$rotation[names, 1]## ash fat sodium carb prot cal
## 0.47188953 0.44666592 0.43570289 -0.42491371 0.37876090 0.24448730
## mois
## 0.06470937
A few variables (or to be precise all of them but one) have their important input to variability explained by the PCA method.
We may now limit this analysis to only two components.
var <- get_pca_var(pca)
p1 <- fviz_contrib(pca, "var", axes = 1, xtickslab.rt = 90, title = "First component")
p2 <- fviz_contrib(pca, "var", axes = 2, xtickslab.rt = 90, title = "Second component")
grid.arrange(p1, p2, top = 'Contribution to the first two principal components')While to the first principal component five variables have their significant contribution, more than 70% of the second component stems from only two variables.
Density-based spatial clustering of applications with noise
The dataset was modified by applying dimension reduction algorithms. Now it seems suitable to be clustered. We have the opportunity of presenting what the DBSCAN algorithm is capable of. Mentioned approach to clustering works best when there are no more than two or three variables. Our data has been reduced to such a number of dimensions thanks to PCA and MDS methods. Thus, it is perfectly fine to use them for DBSCAN clustering. We will use data obtained during the MDS analysis because it seems to preserve the initial distinction of pizzas’ types better. Unfortunately, in real-life data we would not have such external knowledge so it could be far more difficult to decide which algorithm distorted the input more profoundly.
data_clustering <- mds_plot
model <- dbscan(data_clustering, eps = 2.8, MinPts = 5)
fviz_cluster(model, data_clustering, geom = "point", xlab = "First component", ylab = "Second component")Using the DBSCAN algorithm we obtained 6 clusters. Additionally, a few observations are classified as noise. Clustering results are here controlled by two parameters. The first one, eps defines a radius of the neighbourhood around a given point. If a number of points in that neighbourhood is greater or equal MinPts (the second parameter), such a point is classified as a core point. All observations that can be reached using a path consisting of different core points are classified as one cluster.
Summary
Two popular algorithms used to reduce the number of dimensions (variables) in a dataset has been presented. Both MDS and PCA proved to be useful in such applications, although their results were in that case different. As the next step, the DBSCAN clustering algorithm was applied in order to present its usefulness by classifying data into groups based on their initial characteristics reduced by mentioned methods to only two variables.