Presentation of dimensionality reduction techniques as the preparation for DBSCAN clustering

Introdction

Living in a world full of distractions forced human brain to work as some kind of a filter that is able to extract only important information and ignore the rest. While such an approach is prone to mistakes and delusions, it turned out to be a very efficient way of processing numerous signals registered by humans’ senses. We may use this biological solution as an inspiration for a method of analysing big (in terms a of number of variables) datasets. Perhaps it would be possible to resign from some of the variables included in the data and still obtain meaningful results in the end. What is more, perhaps it would be achievable to limit a number of variables to only two or three. Thanks to such reduction, we would be able to present a modified dataset in a graphical form.

The purpose of this work is to show two popular methods of dimension reduction, i.e. principal component analysis (PCA) and multidimensional scaling (MDS). Their application should lead to simplifying a dataset. We may use such modified data as the input to density-based spatial clustering of applications with noise (DBSCAN) algorithm.

library("corrplot")
library("ggplot2")
library("factoextra")
library("gridExtra")
library("fpc")

Dataset

In order to present both PCA and MDS algorithms in an effective way, we are using the dataset devoted for dimension reduction purposes. It was probably artificially designed to be useful from a point of view of such methods. The dataset can be found under the following link. As its name suggests, data focus on a little trivial topic of pizza.

data_raw <- read.csv("Pizza.csv", sep = ",")
head(data_raw)

##   brand    id  mois  prot   fat  ash sodium carb  cal
## 1     A 14069 27.82 21.43 44.87 5.11   1.77 0.77 4.93
## 2     A 14053 28.49 21.26 43.89 5.34   1.79 1.02 4.84
## 3     A 14025 28.35 19.99 45.78 5.08   1.63 0.80 4.95
## 4     A 14016 30.55 20.15 43.13 4.79   1.61 1.38 4.74
## 5     A 14005 30.49 21.28 41.65 4.82   1.64 1.76 4.67
## 6     A 14075 31.14 20.23 42.31 4.92   1.65 1.40 4.67

tail(data_raw)

##     brand    id  mois  prot   fat  ash sodium  carb  cal
## 295     J 34043 44.07 10.96 18.39 2.56   0.66 24.02 3.05
## 296     J 34044 44.91 11.07 17.00 2.49   0.66 25.36 2.91
## 297     J 24069 43.15 11.79 18.46 2.43   0.67 24.17 3.10
## 298     J 34039 44.55 11.01 16.03 2.43   0.64 25.98 2.92
## 299     J 14044 47.60 10.43 15.18 2.32   0.56 24.47 2.76
## 300     J 14045 46.84  9.91 15.50 2.27   0.57 25.48 2.81

The dataset consists of exactly 300 observations described with 9 variables. Two of them are not used in further analysis. This is because id is just a number assigned to distinguish observations and brand serves as a categorical variable grouping pizzas into 6 different types.

The rest of the variables describes an amount of water (mois), protein (prot), fat (fat), ash (ash), sodium (sodium), carbohydrates (carb), and kilocalories (cal) per 100 g of a given product, respectively.

data <- data_raw[, 3:9] # Extracting data we will work on later
head(data)

##    mois  prot   fat  ash sodium carb  cal
## 1 27.82 21.43 44.87 5.11   1.77 0.77 4.93
## 2 28.49 21.26 43.89 5.34   1.79 1.02 4.84
## 3 28.35 19.99 45.78 5.08   1.63 0.80 4.95
## 4 30.55 20.15 43.13 4.79   1.61 1.38 4.74
## 5 30.49 21.28 41.65 4.82   1.64 1.76 4.67
## 6 31.14 20.23 42.31 4.92   1.65 1.40 4.67

Analysis

Before we start the main part of our analysis, i.e. applying dimension reduction algorithms, we may calculate some basic descriptive stats in order to see what we should expect.

summary(data_raw)

##     brand                 id             mois            prot      
##  Length:300         Min.   :14003   Min.   :25.00   Min.   : 6.98  
##  Class :character   1st Qu.:14094   1st Qu.:30.90   1st Qu.: 8.06  
##  Mode  :character   Median :24021   Median :43.30   Median :10.44  
##                     Mean   :20841   Mean   :40.90   Mean   :13.37  
##                     3rd Qu.:24110   3rd Qu.:49.12   3rd Qu.:20.02  
##                     Max.   :34045   Max.   :57.22   Max.   :28.48  
##       fat             ash            sodium            carb       
##  Min.   : 4.38   Min.   :1.170   Min.   :0.2500   Min.   : 0.510  
##  1st Qu.:14.77   1st Qu.:1.450   1st Qu.:0.4500   1st Qu.: 3.467  
##  Median :17.14   Median :2.225   Median :0.4900   Median :23.245  
##  Mean   :20.23   Mean   :2.633   Mean   :0.6694   Mean   :22.865  
##  3rd Qu.:21.43   3rd Qu.:3.592   3rd Qu.:0.7025   3rd Qu.:41.337  
##  Max.   :47.20   Max.   :5.430   Max.   :1.7900   Max.   :48.640  
##       cal       
##  Min.   :2.180  
##  1st Qu.:2.910  
##  Median :3.215  
##  Mean   :3.271  
##  3rd Qu.:3.520  
##  Max.   :5.080

correlation <- cor(data, method = "pearson")
corrplot(correlation)

Multidimensional scaling

We may start with the method based on comparing distances in two spaces that differ in the number of dimensions they consist of. The initial space is constructed using 7 variables. Then this number is reduced. In effect, another space is built where the number of dimensions (variables) is lower. The point of multidimensional scaling is to transform data from one space to another and at the same time preserve the initial distance between observations as well as possible.

The number of dimensions that should be included in the final space is given by the user of the MDS algorithm. We are going to use only two dimensions. And this is because of two reasons. Firstly, it will be possible to visualize the results. The second reason will be discussed after applying PCA to the same dataset.

mds <- cmdscale(dist(data), k = 2)
summary(mds)

##        V1                 V2          
##  Min.   :-25.3765   Min.   :-28.8343  
##  1st Qu.:-22.1295   1st Qu.: -4.3765  
##  Median :  0.3719   Median :  0.9267  
##  Mean   :  0.0000   Mean   :  0.0000  
##  3rd Qu.: 21.0257   3rd Qu.:  5.9692  
##  Max.   : 29.4927   Max.   : 18.5784

mds_plot <- data.frame(mds)
ggplot(data = mds_plot, aes(x = X1, y = X2) ) + geom_point() + labs(x = "First dimension", y = "Second dimension")

We may see that MDS prevented, in that case, serious distortions because the initial number of types has been preserved which is positive news when it comes to our plans of applying DBSCAN.

Principal component analysis

Now it is time to use another approach to dimensionality reduction, i.e. PCA method. Its point is to find directions (vectors) that span space that preserve maximum variance. When compared with MDS, this algorithm is a linear approach to the problem of dimension reduction which might be considered a slight disadvantage in some cases but on the other hand, it does not look for a solution iteratively. Thus PCA is not prone to get stuck in a local minimum.

data_scale <- scale(data) # Normalizing data
tail(data_scale)

##             mois       prot        fat        ash       sodium       carb
## [295,] 0.3315124 -0.3751041 -0.2049469 -0.0576766 -0.025380865 0.06407383
## [296,] 0.4194430 -0.3580084 -0.3598102 -0.1128067 -0.025380865 0.13839555
## [297,] 0.2352074 -0.2461097 -0.1971480 -0.1600611  0.001620055 0.07239342
## [298,] 0.3817584 -0.3673333 -0.4678803 -0.1600611 -0.079382707 0.17278321
## [299,] 0.7010303 -0.4574739 -0.5625808 -0.2466941 -0.295390073 0.08903261
## [300,] 0.6214741 -0.5382896 -0.5269289 -0.2860728 -0.268389152 0.14505123
##               cal
## [295,] -0.3564319
## [296,] -0.5822259
## [297,] -0.2757912
## [298,] -0.5660978
## [299,] -0.8241480
## [300,] -0.7435073

pca <- prcomp(data_scale)
summary(pca)

## Importance of components:
##                          PC1    PC2     PC3    PC4     PC5     PC6      PC7
## Standard deviation     2.042 1.5134 0.64387 0.3085 0.16636 0.01837 0.003085
## Proportion of Variance 0.596 0.3272 0.05922 0.0136 0.00395 0.00005 0.000000
## Cumulative Proportion  0.596 0.9232 0.98240 0.9960 0.99995 1.00000 1.000000

data_plot <- data.frame(pca$x[, c(1,2)])
ggplot(data = data_plot, aes(x = PC1, y = PC2)) + geom_point() + labs(x = "First component", y = "Second component")

fviz_eig(pca, main = "Explained variance plot")

The above graph presents the percentage of variance that can be explained by constructing a space with a given number of eigenvectors (horizontal axis). It turns out that more than 92% of the variance observed in the dataset can be explained with only two components. This by the way the second argument why we have decided to use only two dimensions during MDS analysis.

We may go one step further (or back) and compare which variables constitute principal components in the most significant way.

scores <- abs(pca$rotation[, 1])
names <- names(sort(scores, decreasing = TRUE))
pca$rotation[names, 1]

##         ash         fat      sodium        carb        prot         cal 
##  0.47188953  0.44666592  0.43570289 -0.42491371  0.37876090  0.24448730 
##        mois 
##  0.06470937

A few variables (or to be precise all of them but one) have their important input to variability explained by the PCA method.

We may now limit this analysis to only two components.

var <- get_pca_var(pca)
p1 <- fviz_contrib(pca, "var", axes = 1, xtickslab.rt = 90, title = "First component")
p2 <- fviz_contrib(pca, "var", axes = 2, xtickslab.rt = 90, title = "Second component")
grid.arrange(p1, p2, top = 'Contribution to the first two principal components')

While to the first principal component five variables have their significant contribution, more than 70% of the second component stems from only two variables.

Density-based spatial clustering of applications with noise

The dataset was modified by applying dimension reduction algorithms. Now it seems suitable to be clustered. We have the opportunity of presenting what the DBSCAN algorithm is capable of. Mentioned approach to clustering works best when there are no more than two or three variables. Our data has been reduced to such a number of dimensions thanks to PCA and MDS methods. Thus, it is perfectly fine to use them for DBSCAN clustering. We will use data obtained during the MDS analysis because it seems to preserve the initial distinction of pizzas’ types better. Unfortunately, in real-life data we would not have such external knowledge so it could be far more difficult to decide which algorithm distorted the input more profoundly.

data_clustering <- mds_plot
model <- dbscan(data_clustering, eps = 2.8, MinPts = 5)
fviz_cluster(model, data_clustering, geom = "point", xlab = "First component", ylab = "Second component")

Using the DBSCAN algorithm we obtained 6 clusters. Additionally, a few observations are classified as noise. Clustering results are here controlled by two parameters. The first one, eps defines a radius of the neighbourhood around a given point. If a number of points in that neighbourhood is greater or equal MinPts (the second parameter), such a point is classified as a core point. All observations that can be reached using a path consisting of different core points are classified as one cluster.

Summary

Two popular algorithms used to reduce the number of dimensions (variables) in a dataset has been presented. Both MDS and PCA proved to be useful in such applications, although their results were in that case different. As the next step, the DBSCAN clustering algorithm was applied in order to present its usefulness by classifying data into groups based on their initial characteristics reduced by mentioned methods to only two variables.