library(readxl)
library(tidyverse)
library(plotly)
library(dbscan)
library(fpc)
library(cluster)
library(factoextra)
library(NbClust)
Data comes from an Excel file with serveral useless columns for this analysis so we’ll get rid of them:
Angles_Mav_ <- read_excel("data/Compared_Angles_Mav_2018-02-21.xlsx",
col_types = c("skip", "skip", "skip",
"skip", "skip", "numeric", "numeric",
"numeric", "skip", "skip", "skip",
"skip", "skip", "skip", "skip"))
Angles_Mav_ <- na.exclude(Angles_Mav_)
head(Angles_Mav_ , 5)
In order to plot values we need to convert it to x, y, z points and then we’ll plot them:
x <- vector(mode="numeric", length=0)
y <- vector(mode="numeric", length=0)
z <- vector(mode="numeric", length=0)
for (i in 1:nrow(Angles_Mav_)) {
radius <- as.numeric(Angles_Mav_$`Distance (Rho)`[i]) #distance rho (radius)
theta <- as.numeric(Angles_Mav_$`Angles (Theta)`[i]) # angle theta
altitude <- Angles_Mav_$`Altitude (z)`[i] # altitude
x[i] <- radius * cos(theta)
y[i] <- radius * sin(theta)
z[i] <- altitude
}
#Elbow method
fviz_nbclust(scale(Angles_Mav_xyz), kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")
We’ll begin with a simplea approach, considering 4 clusters and unscaled data
k_vect <- Angles_Mav_xyz %>%
select( y, x, z) %>%
kmeans(., 4, nstart = 25)
## K-means clustering with 4 clusters of sizes:
## 2 45 36 24
##
## Within cluster sum of squares by cluster:
## 145298.9 717819.3 1158916 846057.7
##
## Within sum of squares:
## 2868092
Generally variables are scaled to have: a) standard deviation one b) mean zero. The standardization of data is an approach widely used before clustering.
scale_angles <- scale(Angles_Mav_xyz[, 2:4])
set.seed(123)
km_ang <- kmeans(scale_angles, 4, nstart = 25)
## K-means clustering with 4 clusters of sizes:
## 38 22 25 22
##
## Within cluster sum of squares by cluster:
## 12.11293 25.7764 41.90451 27.90894
##
## Within sum of squares:
## 107.7028
Within Sum of Squares decreases dramatically when data is scaled
Density-based clustering constructs clusters in regard to the density measurement. Clusters in this method have a higher density than the remainder of the dataset. This algorithm works on a parametric approach. a) Epsilon: the radius of our neighborhoods around a data point p. b) minPts: the minimum number of data points we want in a neighborhood to define a cluster.
#DBSCAN input must be a matrix.
angles_matrix <- as.matrix(Angles_Mav_xyz[,2:4])
angles_matrix_sca <- scale(angles_matrix)
#this plot helps us to define the value of epsilon
kNNdistplot(angles_matrix_sca, k=25) #tells us that the ideal values of "e" is between 1 and 2
abline(h=1.5, col="red")
The plot suggests that the epsilon distance should be between 1 and 2
set.seed(1234)
db <- dbscan::dbscan(angles_matrix_sca, eps = 1.6, minPts = 10)
print(db)
## DBSCAN clustering for 107 objects.
## Parameters: eps = 1.6, minPts = 10
## The clustering contains 2 cluster(s) and 4 noise points.
##
## 0 1 2
## 4 38 65
##
## Available fields: cluster, eps, minPts
Density based method suggests that the number of cluster should be 2. Cluster “0”" contains all the values labeled as “noise”.
Clusters vizualised thru Principal Components
hullplot(angles_matrix_sca, db$cluster)
## Within sum of squares:
## 206.1225
In k-medoids clustering, each cluster is represented by one of the data point in the cluster. These points are named cluster medoids. K-medoid is a robust alternative to k-means clustering. This means that, the algorithm is less sensitive to noise and outliers, compared to k-means, because it uses medoids as cluster centers instead of means (used in k-means).
pam.res <- pam(angles_matrix_sca, 3)
## Within sum of squares:
## 153.5372
Based on Within Sum of Sqaures metric we can assume that the best approach is K-Means Scaled clustering data in 4 groups. Despite this mathematical conclusion, checking the corresponding plots might give a better understanding of the situation either to confirm or not the approach accuracy.