K means Clustering

Clustering is the use of unsupervised techniques for grouping similar objects, Usually supervised methods use labelled objects and unsupervised methods use unlabelled objects. Clustering looks for hidden structure in the data, similarities based on attributes, It is often used for exploratory analysis and no predictions are made.

K-means Clustering

Given a collection of objects each with n measurable attributes and a chosen value k of the number of clusters, the algorithm identifies the k clusters of objects based on the objects proximity to the centres of the k groups. The algorithm is iterative with the centres adjusted to the mean of each cluster’s n-dimensional vector of attributes.

For this purpose of Week-3 Assignment I am using ‘USArrest’, Crime data analysis data which is available in R packages.

K-means clustering (MacQueen, 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster. K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), in which, each cluster is represented by one of the objects in the cluster. We’ll describe also a variant of PAM named CLARA (Clustering Large Applications) which is used for analyzing large data sets.

This particular dataset, named USArrests, contains the number of arrests for murder, assault, and rape for each of the 50 states in 1973. It also contains the percentage of people in the state who live in an urban area.

Loading Data

data("USArrests")

rawdf <- na.omit(USArrests)

head(rawdf)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Basic Statics

Before applying K-means we can scale data using scale() function

desc_stats <- data.frame(
  Min = apply(rawdf, 2, min), # minimum
  Med = apply(rawdf, 2, median), # median
  Mean = apply(rawdf, 2, mean), # mean
  SD = apply(rawdf, 2, sd), # Standard deviation
  Max = apply(rawdf, 2, max) # Maximum
  )
desc_stats <- round(desc_stats, 1)
head(desc_stats)
##           Min   Med  Mean   SD   Max
## Murder    0.8   7.2   7.8  4.4  17.4
## Assault  45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0  66.0  65.5 14.5  91.0
## Rape      7.3  20.1  21.2  9.4  46.0
df <- scale(USArrests)
head(df)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

K means

set.seed(123)
km.res <- kmeans(scale(USArrests), 4, nstart = 25)
km.res
## K-means clustering with 4 clusters of sizes 13, 16, 13, 8
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 2 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 3  0.6950701  1.0394414  0.7226370  1.27693964
## 4  1.4118898  0.8743346 -0.8145211  0.01927104
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              4              3              3              4              3 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              3              2              2              3              4 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              1              3              2              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              1              4              1              3 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              3              1              4              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              3              1              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              3              3              4              1              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              4 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              4              3              2              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              1              1              2 
## 
## Within cluster sum of squares by cluster:
## [1] 11.952463 16.212213 19.922437  8.316061
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
aggregate(USArrests, by=list(cluster=km.res$cluster), mean)
##   cluster   Murder   Assault UrbanPop     Rape
## 1       1  3.60000  78.53846 52.07692 12.17692
## 2       2  5.65625 138.87500 73.87500 18.78125
## 3       3 10.81538 257.38462 76.00000 33.19231
## 4       4 13.93750 243.62500 53.75000 21.41250

Including Plots

library("factoextra")
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_cluster(km.res, data = df,
             palette = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"),
             ggtheme = theme_minimal(),
             main = "Partitioning Clustering Plot"
             )

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Cluster Validation

library(cluster)
library(HSAUR)
## Loading required package: tools
data(pottery)
km    <- kmeans(pottery,3)
dissE <- daisy(pottery) 
dE2   <- dissE^2
sk2   <- silhouette(km$cl, dE2)
plot(sk2)