This is a simple educational exercise that introduces the use of K-Means Clustering with R Studio. I have chosen to follow the exercise as outlined per the following page at WikiBooks.org. This page explains this important technique and then provides a simple Case Study that makes use of some country data complete with data related to the ‘Per capita income’, ‘Literacy’, ‘Infant mortality’, and ‘Life expectancy’.
I copied the given data off the web page and created a CSV file entitled ‘data.txt’ as used within the Case Study. I am going to follow the given code and create the result myself in R Studio.
# import data (assume that all data in "data.txt" is stored as comma separated values)
x <- read.csv("data.txt", header=TRUE, row.names=1)
# show first 10 rows
head(x, n=10)
## Per.capita.income Literacy Infant.mortality Life.expectancy
## Brazil 10326 90.0 23.60 75.4
## Germany 39650 99.0 4.08 79.4
## Mozambique 830 38.7 95.90 42.1
## Australia 43163 99.0 4.57 81.2
## China 5300 90.9 23.00 73.0
## Argentina 13308 97.2 13.40 75.3
## United Kingdom 34105 99.0 5.01 79.4
## South Africa 10600 82.4 44.80 49.3
## Zambia 1000 68.0 92.70 42.4
## Namibia 5249 85.0 42.30 52.9
# run K-Means
km <- kmeans(x, 3, 15)
# print components of km
print(km)
## K-means clustering with 3 clusters of sizes 6, 11, 2
##
## Cluster means:
## Per.capita.income Literacy Infant.mortality Life.expectancy
## 1 37122.5 98.50000 4.233333 80.50000
## 2 6363.0 77.43636 45.732727 62.12727
## 3 23245.0 99.05000 7.220000 76.50000
##
## Clustering vector:
## Brazil Germany Mozambique Australia China
## 2 1 2 1 2
## Argentina United Kingdom South Africa Zambia Namibia
## 2 1 2 2 2
## Georgia Pakistan India Turkey Sweden
## 2 2 2 2 1
## Lithuania Greece Italy Japan
## 3 1 3 1
##
## Within cluster sum of squares by cluster:
## [1] 66842392 211664558 24710478
## (between_SS / total_SS = 92.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
# plot clusters
plot(x, col = km$cluster)
# plot centers
points(km$centers, col = 1:2, pch = 8)