This file was created whilst doing the course on Statistical Learning on the Stanford-Lagunita website. This constitutes the 10th part of the course on Unsupervised Learning techniques. The basic idea of Unsupervised Learning as opposed to Supervised Learning is, that there is no Y-variable. In supervised learning we are interested in factors attributed to one class(Y), and as such the class is predetermined. In Unsupervised Learning we do not know Y, and we are in a sense looking for patterns in the data, which can help us classify or cluster observations together.
The idea of Principal Components is to pre-process data for the ML-algorithms such as K-means. PCA is used to identify and understand sub-groups within the data - this is done by finding linear combinations of highest variance and uncorrelated data points.
This note is focused on the R-aspects of the course, rather than the theoretical details, so let’s get to it. First, import a dataset:
# Load USArrests dataset from R-base files
USArrests = USArrests
# Inspect the data
dimnames(USArrests)
## [[1]]
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
##
## [[2]]
## [1] "Murder" "Assault" "UrbanPop" "Rape"
# Standardize the variables for use in Principal Component Analysis
pca1 = prcomp(USArrests, scale = TRUE)
# Show the PC's
pca1
## Standard deviations:
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
##
## Rotation:
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
# In order to visualize the assigned principal components, a biplot is useful
biplot(pca1, scale = 0, cex = 0.65)
# We see that the 1st Principal Component has something to do with overall level of crime
# Whilst, PC2 is primarily concerned with population density in Urban Areas.
# On the leftside of the PC1 axis, high-crime states are shown, whilst lower crime-rate states are on the right side.
# Regarding Urban Population, the lower observations on the PC2 axis are high urban population areas & v.v.
Having outlined the basic ideas of PCA, we’re ready to take a brief look at Clustering - in particular K-means Clustering. Here the idea is to identify a hyperplane which minimizes the overall distance between data points. This is different from OLS regression due to the fact that we are not minimizing the vertical distance as in OLS.
The K-means algorithm, which exists in mulitple specifications, but most commonly used (and default in R) is the Hartigan-Wong algorithm (1979). The number of clusters, K, is predetermined by researcher - and should be chosen with care.The algorithm works in the following way:
The R-code below produces a dataset with 4 assigned clusters - this is the “true” clustering.
set.seed(101)
usml = matrix(rnorm(100*2), 100, 2)
usmlmean = matrix(rnorm(8, sd = 4), 4, 2)
which = sample(1:4, 100, replace = TRUE)
usml = usml + usmlmean[which, ]
plot(usml, col = which, pch = 19)
Now we would like to see how the K-means algorithm perfoms in actually categorizing the data given that we know, the number of clusters.
kmeans = kmeans(usml, 4, nstart = 15)
kmeans
## K-means clustering with 4 clusters of sizes 21, 30, 32, 17
##
## Cluster means:
## [,1] [,2]
## 1 -3.1068542 1.1213302
## 2 1.7226318 -0.2584919
## 3 -5.5818142 3.3684991
## 4 -0.6148368 4.8861032
##
## Clustering vector:
## [1] 2 3 3 4 1 1 4 3 2 3 2 1 1 3 1 1 2 3 3 2 2 3 1 3 1 1 2 2 3 1 1 4 3 1 3
## [36] 3 1 2 2 3 2 2 3 3 1 3 1 3 4 2 1 2 2 4 3 3 2 2 3 2 1 2 3 4 2 4 3 4 4 2
## [71] 2 4 3 2 3 4 4 2 2 1 2 4 4 3 3 2 3 3 1 2 3 2 4 4 4 2 3 3 1 1
##
## Within cluster sum of squares by cluster:
## [1] 30.82790 54.48008 71.98228 21.04952
## (between_SS / total_SS = 87.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
plot(usml, col = kmeans$cluster, cex = 2, pch = 2, lwd =2)
points(usml, col = which, pch = 19)
points(usml, col = c(4, 3, 2, 1)[which], pch = 19)
We see, that the algorithm does rather well - it did miss 2 of the 100 observations. It failed to properly categorize one observation which, was actually “black”, and one that was “blue”.
….