K-means

load libraries needed

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(cluster)

load and prepare dataset

df <- USArrests

remove rows with missing values

df <- na.omit(df)

scale each variable to have mean of 0 and standard deviation of 1

df <- scale(df)

head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

**Start the clustering

Since we don’t know beforehand how many clusters is optimal, we’ll create two different plots that can help us decide**

fviz_nbclust(df, kmeans, method = "wss")

For this plot it appear that there is a bit of an elbow or “bend” at k = 4

Clusters Another way to determine the optimal number of clusters is to use a metric known as the gap statistics

calculate gap statistic based on number of clusters

gap_stat <- clusGap(df,
                    FUN = kmeans,
                    nstart = 25,
                    K.max = 10,
                    B = 50)

fviz_gap_stat(gap_stat)

We can observe from the plot that the gap statistic is greatest at k = 4 clusters, which corresponds to the elbow strategy we employed earlier.

Perform K-Means Clustering with Optimal K

perform k-means clustering with k = 4 clusters

set.seed(1)
km <- kmeans(df, centers = 4, nstart = 25)
km

## K-means clustering with 4 clusters of sizes 13, 13, 16, 8
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 2  0.6950701  1.0394414  0.7226370  1.27693964
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4  1.4118898  0.8743346 -0.8145211  0.01927104
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              4              2              2              4              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              4 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              1              2              3              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              1              4              1              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              1              4              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              2              1              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              4              1              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              4 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              4              2              3              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              1              1              3 
## 
## Within cluster sum of squares by cluster:
## [1] 11.952463 19.922437 16.212213  8.316061
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

16states were assigned to the first cluster
13 states were assigned to the second cluster
states were assigned to the third cluster
states were assigned to the fourth cluster

plot results of final k-means model

fviz_cluster(km, data = df)

find means of each cluster

aggregate(USArrests, by=list(cluster=km$cluster), mean)

The mean number of murders per 100,000 citizens among the states in cluster 1 is 3.6.
The mean number of assaults per 100,000 citizens among the states in cluster 1 is 78.6.
The mean percentage of residents living in an urban area among the states in cluster 1 is 52.1%.
The mean number of rapes per 100,000 citizens among the states in cluster 1 is 12.2.

add cluster assignment to original data

final_data <- cbind(USArrests, cluster = km$cluster)
head(final_data)

K-means

Duque, Don Gerard H. BS-Math

2022-06-20