load libraries needed
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
load and prepare dataset
df <- USArrests
remove rows with missing values
df <- na.omit(df)
scale each variable to have mean of 0 and standard deviation of 1
df <- scale(df)
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
**Start the clustering
Since we don’t know beforehand how many clusters is optimal, we’ll create two different plots that can help us decide**
fviz_nbclust(df, kmeans, method = "wss")
Clusters Another way to determine the optimal number of clusters is to use a metric known as the gap statistics
calculate gap statistic based on number of clusters
gap_stat <- clusGap(df,
FUN = kmeans,
nstart = 25,
K.max = 10,
B = 50)
fviz_gap_stat(gap_stat)
Perform K-Means Clustering with Optimal K
perform k-means clustering with k = 4 clusters
set.seed(1)
km <- kmeans(df, centers = 4, nstart = 25)
km
## K-means clustering with 4 clusters of sizes 13, 13, 16, 8
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 2 0.6950701 1.0394414 0.7226370 1.27693964
## 3 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 4 1.4118898 0.8743346 -0.8145211 0.01927104
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 4 2 2 4 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 4
## Hawaii Idaho Illinois Indiana Iowa
## 3 1 2 3 1
## Kansas Kentucky Louisiana Maine Maryland
## 3 1 4 1 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 1 4 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 2 1 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 4 1 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 4
## South Dakota Tennessee Texas Utah Vermont
## 1 4 2 3 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 1 1 3
##
## Within cluster sum of squares by cluster:
## [1] 11.952463 19.922437 16.212213 8.316061
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
16states were assigned to the first cluster
13 states were assigned to the second cluster
states were assigned to the third cluster
states were assigned to the fourth cluster
plot results of final k-means model
fviz_cluster(km, data = df)
find means of each cluster
aggregate(USArrests, by=list(cluster=km$cluster), mean)
The mean number of murders per 100,000 citizens among the states in cluster 1 is 3.6.
The mean number of assaults per 100,000 citizens among the states in cluster 1 is 78.6.
The mean percentage of residents living in an urban area among the states in cluster 1 is 52.1%.
The mean number of rapes per 100,000 citizens among the states in cluster 1 is 12.2.
add cluster assignment to original data
final_data <- cbind(USArrests, cluster = km$cluster)
head(final_data)