K-Means

We’ll work with the data set cars.

d <- cars
summary (cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Since the data aren’t in the same scale we’ll standardized* it.

*Z = (x-mean)/sd

d$speed <- scale(d$speed)
d$dist <- scale(d$dist)

We determine the number of cluster that we need using the elbow method.

An abrupt decrease of the WCSS marks the optimum number of clusters that we need for this dataset.

library("NbClust")
library("factoextra")

fviz_nbclust(d, kmeans, method = "wss") +
  geom_vline(xintercept = 0 , linetype = 2)+
  labs(subtitle = "Elbow method")

In this case we’ll choose to make 3 clusters.

kmedias <- kmeans(d, centers = 3,iter.max = 10)
d$cluster <- kmedias$cluster
centroides <- kmedias$centers
d$cluster <- as.factor(d$cluster) #if we not convert it the next plot will fail

library("ggplot2")
ggplot(d, aes(speed, dist, color = cluster)) +
  geom_point() +
  geom_point(data = as.data.frame(centroides), aes(speed, dist), color = "black", size = 3, shape = 17) +
  scale_color_manual(values = c("red", "green", "blue")) + 
  theme_minimal()

Here we use two popular measures for goodness of fit. The CH index should be high and the DB index should be low. That index haven’t cutoff.

For more information on how these indices are calculated, check Clustering evaluation

#Calinski-Harabasz Index
library("fpc")
d$cluster <- NULL #that's important, because the index interpret all the data frame.
round(calinhara(d,kmedias$cluster),digits=2)

## [1] 71.78

#Davies-Bouldin Index
library("clusterSim")
index.DB(d, kmedias$cluster, d=NULL, centrotypes="centroids", p=2, q=2)

## $DB
## [1] 0.8477413
## 
## $r
## [1] 0.8743056 0.7946129 0.8743056
## 
## $R
##           [,1]      [,2]      [,3]
## [1,]       Inf 0.7946129 0.8743056
## [2,] 0.7946129       Inf 0.4762805
## [3,] 0.8743056 0.4762805       Inf
## 
## $d
##          1        2        3
## 1 0.000000 1.536303 1.715331
## 2 1.536303 0.000000 3.173807
## 3 1.715331 3.173807 0.000000
## 
## $S
## [1] 0.6044337 0.6163326 0.8952900
## 
## $centers
##            [,1]       [,2]
## [1,]  0.1765121 -0.1119453
## [2,] -1.1158088 -0.9426887
## [3,]  1.0881681  1.3410670

For more information about the output values of the second index ?index.DB or clusterSim

Disclaimer: the k means model that we’ve trained is that, a model, it doesn’t fit to a perfection with other data, but in this case if you repeat the elbow method you can observe that our model is not good for d1 dataset.

This Rpub is just for show some things that i’ve learned here and there and this final part is how i know use a trained model.

The new dataset:

speed <- c(-0.714421298, 1.570101525, 0.492395036, 1.029811955, -1.060355655, 1.829127471,
           -0.476726416, 0.278957735, 0.576229126, -0.119376407, -0.854093485, 
           -0.98089661,0.550578976, 1.383766532, -1.592552216, -0.964708427, -1.196658413, 
           0.857113853,0.728057245, -1.224530189, -0.677890455, 1.07981807, 1.403805889, 
           1.886441524,0.953076804)

dist <- c(2.643086907, 1.092038324, -0.454501317, 1.017435336, 0.823089231, 0.867627793, 
          2.210679941, 2.708608868, 1.341273448, 0.96710234, -0.060334785, -0.186007158,
          1.188867714, 1.312420774, 2.387067609, -0.21882152, 2.719395448, 2.368862934, 
          1.512240851, -0.339202671, 2.120518645, 2.442190096, 1.002244867, -0.437064364, 
          1.385796414)

d1 <- data.frame(speed,dist)

Here, finally, we can save the predicted cluster using our model.

library ("clue")
kmedias1 <- kmedias #I create a new object because i don't want to smash the previous one
kmedias1 <- as.cl_partition(kmedias1) #the next function just work with cl_partition class
d1$cluster <- cl_predict(kmedias1,newdata= d1)

K-Means

Carlos Mallo

2024-04-23