library(readxl)
library(tidyverse)
library(plotly)
library(dbscan)
library(fpc)
library(cluster)
library(factoextra)
library(NbClust)

Data Preparation

Data Loading

Data comes from an Excel file with serveral useless columns for this analysis so we’ll get rid of them:

Angles_Mav_ <- read_excel("data/Compared_Angles_Mav_2018-02-21.xlsx",
                          col_types = c("skip", "skip", "skip", 
                                        "skip", "skip", "numeric", "numeric", 
                                        "numeric", "skip", "skip", "skip", 
                                        "skip", "skip", "skip", "skip"))
Angles_Mav_ <- na.exclude(Angles_Mav_)
head(Angles_Mav_ , 5)

Convertion from Polar Cylindrical Coordinates to Cartesian Coordinates

In order to plot values we need to convert it to x, y, z points and then we’ll plot them:

x <- vector(mode="numeric", length=0)
y <- vector(mode="numeric", length=0)
z <- vector(mode="numeric", length=0)
for (i in 1:nrow(Angles_Mav_)) {
  radius <- as.numeric(Angles_Mav_$`Distance (Rho)`[i]) #distance rho (radius)
  theta <- as.numeric(Angles_Mav_$`Angles (Theta)`[i])  # angle theta
  altitude <- Angles_Mav_$`Altitude (z)`[i]             # altitude
  x[i] <-  radius * cos(theta)
  y[i] <-  radius * sin(theta)
  z[i] <-  altitude
}

Clustering

Defining optimal number of clusters

#Elbow method
fviz_nbclust(scale(Angles_Mav_xyz), kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) +
  labs(subtitle = "Elbow method")

1 - Basic K-means

We’ll begin with a simplea approach, considering 4 clusters and unscaled data

k_vect <- Angles_Mav_xyz %>%
  select( y, x, z) %>% 
  kmeans(., 4, nstart = 25)
## K-means clustering with 4 clusters of sizes: 
##  2 45 36 24 
##  
## Within cluster sum of squares by cluster: 
##  145298.9 717819.3 1158916 846057.7 
##  
## Within  sum of squares: 
##  2868092

2 - Scaled K-means

Generally variables are scaled to have: a) standard deviation one b) mean zero. The standardization of data is an approach widely used before clustering.

scale_angles <- scale(Angles_Mav_xyz[, 2:4])
set.seed(123)
km_ang <- kmeans(scale_angles, 4, nstart = 25)
## K-means clustering with 4 clusters of sizes: 
##  38 22 25 22 
##  
## Within cluster sum of squares by cluster: 
##  12.11293 25.7764 41.90451 27.90894 
##  
## Within  sum of squares: 
##  107.7028

Within Sum of Squares decreases dramatically when data is scaled

3-DBSCAN - Density Based Spacial Clustering of Aplications with Noise

Density-based clustering constructs clusters in regard to the density measurement. Clusters in this method have a higher density than the remainder of the dataset. This algorithm works on a parametric approach. a) Epsilon: the radius of our neighborhoods around a data point p. b) minPts: the minimum number of data points we want in a neighborhood to define a cluster.

#DBSCAN input must be a matrix.
angles_matrix <- as.matrix(Angles_Mav_xyz[,2:4])
angles_matrix_sca <- scale(angles_matrix)

#this plot helps us to define the value of epsilon
kNNdistplot(angles_matrix_sca, k=25) #tells us that the ideal values of "e" is between 1 and 2
abline(h=1.5, col="red")

The plot suggests that the epsilon distance should be between 1 and 2

set.seed(1234)
db <-  dbscan::dbscan(angles_matrix_sca, eps = 1.6, minPts = 10)
print(db)
## DBSCAN clustering for 107 objects.
## Parameters: eps = 1.6, minPts = 10
## The clustering contains 2 cluster(s) and 4 noise points.
## 
##  0  1  2 
##  4 38 65 
## 
## Available fields: cluster, eps, minPts

Density based method suggests that the number of cluster should be 2. Cluster “0”" contains all the values labeled as “noise”.

Clusters vizualised thru Principal Components

hullplot(angles_matrix_sca, db$cluster)

## Within sum of squares: 
##  206.1225

4- Medoids Based K-means (PAM - Partition Medoids)

In k-medoids clustering, each cluster is represented by one of the data point in the cluster. These points are named cluster medoids. K-medoid is a robust alternative to k-means clustering. This means that, the algorithm is less sensitive to noise and outliers, compared to k-means, because it uses medoids as cluster centers instead of means (used in k-means).

pam.res <- pam(angles_matrix_sca, 3) 
## Within sum of squares: 
##  153.5372

Summary

Summary Table

Conclusion

Based on Within Sum of Sqaures metric we can assume that the best approach is K-Means Scaled clustering data in 4 groups. Despite this mathematical conclusion, checking the corresponding plots might give a better understanding of the situation either to confirm or not the approach accuracy.