Introduction

The below paper is my project for Unsupervised Learning class at Warsaw University(Faculty of Economics).

To pass the class, we need to write three papers explaining one of the topics included in the three main chapters (Clustering, Dimension reduction and Association rules) of the course.

I hope this paper will be helpful in explaining the K-means clustering to random bypassers :).

Clustering

Unfortunately, the data that we work with is not always clear and understandable; this is where unsupervised learning comes to the rescue.

Clustering – is a statistical technique that is used in unsupervised learning, when we need to have a clearer structure of our data. It helps us group the observations into groups according to their similarities in selected features.

The technique can be applied in market and image segmentation, image compression etc. The first method that I want to talk about is K-means.

It is considered one of the most popular ones, probably because it is the easiest to apply:)

In general, the K-means algorithm works this way:

• A number of K clusters is assigned
• Random K points are taken as the centroids of the clusters
• Each observation point is assigned to the closest(measured by Euclidean distance*) centroid established earlier, forming a cluster.
• The mean of each newly-formed cluster is calculated and used as a centroid in the next iteration
• The previous two steps are executed until the centroids remain unchanged.

Application

To show how the K-means works in R, I will use a dataset I created for my bachelors thesis, about the profitability of the EXPO exhibitions and the key indicators that determine their success. First step is loading the data, making sure that the variables are of correct classess and download the right package. I will be using the eclust function from the factorextra package.

expo <- read.csv2("EXPO.csv", stringsAsFactors = F)
expo$type <- as.character(expo$type)
expo$ratio <- as.numeric(expo$ratio)
summary(expo)
##      type              visitors         participants        ratio       
##  Length:35          Min.   : 1330000   Min.   : 10.00   Min.   :0.2226  
##  Class :character   1st Qu.: 6700000   1st Qu.: 29.50   1st Qu.:0.6633  
##  Mode  :character   Median :16156626   Median : 36.00   Median :1.0052  
##                     Mean   :21735494   Mean   : 52.17   Mean   :0.8964  
##                     3rd Qu.:33125148   3rd Qu.: 58.00   3rd Qu.:1.1335  
##                     Max.   :73080000   Max.   :192.00   Max.   :1.5863  
##      profit            id           
##  Min.   :0.0000   Length:35         
##  1st Qu.:0.0000   Class :character  
##  Median :1.0000   Mode  :character  
##  Mean   :0.5429                     
##  3rd Qu.:1.0000                     
##  Max.   :1.0000
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

The dataset consists of 35 observations and 6 variables. Where id is the year and the place that the exhibition took place in,
type is a descrete variable where 1- World Exhibition, 2 - International Specialized Exhibition and 3 - Specialized Exhibition, visitors is the number of tickets sold to the exhibition and participants is the number of countries that took part in the event.
ratio is the ratio of the income to the expenditures of the exhibition, which also defines the last binary variable profit, hence if ratio is less than 1, profit equals 0.

To use K means clustering, I am going to create a new dataframe “xx”, that includes variables visitors and ratio I also divided the visitors number by million, to have a better scale on the graph.

xx<-expo[,c(2,4)] 
xx$visitors <- (expo$visitors/1000000)
plot(xx, main = "Number of Visitors & Profitability", xlab = "visitors(mln)", ylab = "profit/expenditure",col= "blue")

and the xx dataframe looks as per below:

##     visitors    ratio
## 1   6.039205 1.204203
## 2   5.162330 0.282187
## 3   6.100000 1.001722
## 4  15.000000 1.136182
## 5   7.255000 0.222569
## 6   9.000000 0.600000
## 7  16.156626 0.481770
## 8   1.330000 0.280000
## 9   2.000000 0.235294
## 10 32.250297 1.207317
## 11 27.529400 1.014558
## 12  6.000000 1.228070
## 13 50.860801 1.059488
## 14 19.694855 0.962963
## 15  7.000000 1.005197
## 16 10.000000 0.846154
## 17 13.000000 0.985915
## 18 11.000000 0.848485
## 19 18.876438 1.050729
## 20 38.872000 1.016041
## 21 20.000000 1.281218
## 22 34.000000 1.150861
## 23 44.955997 0.856000
## 24 42.000000 1.016356
## 25 50.306648 0.512242
## 26  6.400000 0.955128
## 27 64.210770 1.586345
## 28  5.600000 1.007143
## 29  7.335279 0.714286
## 30 22.111578 0.612219
## 31 41.814571 1.130757
## 32 18.100000 0.352941
## 33 22.049544 1.050691
## 34  5.650941 1.230444
## 35 73.080000 1.250000

Now that we have a general idea about how our data looks, we can try and cluster it.In order to compare what number of clusters is the best fit for this case visually, i’m running the K-means clusterring for k = [3, 4, 5]

km2<- eclust(xx, "kmeans", hc_metric="euclidean",k=2, graph = TRUE)

km3 <- eclust(xx, "kmeans", hc_metric="euclidean",k=3, graph = TRUE)

km4 <- eclust(xx, "kmeans", hc_metric="euclidean",k=4, graph = TRUE)

km5 <- eclust(xx, "kmeans", hc_metric="euclidean",k=5, graph = TRUE)

To compare which of the clusterings does the best job at categorizing the we can look at the details of each clustering

km2
## K-means clustering with 2 clusters of sizes 10, 25
## 
## Cluster means:
##   visitors     ratio
## 1 47.23511 1.0785407
## 2 11.53565 0.8236027
## 
## Clustering vector:
##  [1] 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 2 1 1 1 1 2 1 2 2 2 1 2 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 1511.039 1239.666
##  (between_SS / total_SS =  76.8 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "clust_plot"  
## [11] "silinfo"      "nbclust"      "data"
km3
## K-means clustering with 3 clusters of sizes 8, 16, 11
## 
## Cluster means:
##    visitors     ratio
## 1 50.762598 1.0534036
## 2  6.804547 0.7904248
## 3 22.342613 0.9364954
## 
## Clustering vector:
##  [1] 2 2 2 3 2 2 3 2 2 3 3 2 1 3 2 2 2 2 3 1 3 3 1 1 1 2 1 2 2 3 1 3 3 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 1011.7538  133.9227  396.7552
##  (between_SS / total_SS =  87.0 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "clust_plot"  
## [11] "silinfo"      "nbclust"      "data"
km4
## K-means clustering with 4 clusters of sizes 2, 9, 8, 16
## 
## Cluster means:
##    visitors     ratio
## 1 68.645385 1.4181725
## 2 19.946493 0.8825857
## 3 41.882539 0.9936327
## 4  6.804547 0.7904248
## 
## Clustering vector:
##  [1] 4 4 4 2 4 4 2 4 4 3 2 4 3 2 4 4 4 4 2 3 2 3 3 3 3 4 1 4 4 2 3 2 2 4 1
## 
## Within cluster sum of squares by cluster:
## [1]  39.38818 110.88042 325.36243 133.92268
##  (between_SS / total_SS =  94.9 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "clust_plot"  
## [11] "silinfo"      "nbclust"      "data"
km5
## K-means clustering with 5 clusters of sizes 2, 5, 7, 13, 8
## 
## Cluster means:
##    visitors     ratio
## 1 68.645385 1.4181725
## 2 13.031325 0.8597012
## 3 21.194545 0.9036170
## 4  5.759443 0.7666341
## 5 41.882539 0.9936327
## 
## Clustering vector:
##  [1] 4 4 4 2 4 4 2 4 4 5 3 4 5 3 4 2 2 2 3 5 3 5 5 5 5 4 1 4 4 3 5 3 3 4 1
## 
## Within cluster sum of squares by cluster:
## [1]  39.38818  27.19489  60.91804  53.47508 325.36243
##  (between_SS / total_SS =  95.7 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "clust_plot"  
## [11] "silinfo"      "nbclust"      "data"

There are also a couple of methods that can help us determine the optimal number of clusters. for that we will need fviz_nbclust function from NbClust. We can try to determine the optimal number of clusters with the Silhouette, Elbow and Gap statistics methods.

fviz_nbclust(xx,kmeans,method = "silhouette", print.summary = TRUE)

fviz_nbclust(xx,kmeans,method = "wss")

fviz_nbclust(xx,kmeans,method = "gap_stat",k=5, print.summary = TRUE)

From this we can see, that the sillhoette and elbow(total within-clusters sum of squares) methods show us that the optimal number of clusters is 2(in the elbow method we take the point where the marginal gain od the % of the variance explained by the clusters drops with K+1). And Gap Statistics sugests that the optimal amount of clusters is 1. Optimal k in this method(very generally) is when Gap(k)>Gap(k+1)