This case study uses the well-known USArrests dataset in R. For each of the n = 50 US states in 1973, this dataset hosts the information about p = 4 numeric variables:
| Variable | Description |
|---|---|
| Murder | The number of arrests for murder per 100,000 residents. |
| Assault | The number of arrests for assault per 100,000 residents. |
| Rape | The number of arrests for rape per 100,000 residents. |
| UrbanPop | The percent (from 0 to 100) of the population residing in urban areas. |
One would expect that the three crime-related variables are correlated at least to some extent and it may make sense to use PCA to combine them into a single variable measuring overall crime levels and reduce the dimension of the data from 4 to 2. We can then understand and visualize the data much more easily.
# CHUNK 1
### Load USArrests data
data(USArrests)
### View summary
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
# CHUNK 2
library(ggplot2)
#### names(USArrests) extracts the column names of the USArrests data
for (i in names(USArrests)) {
plot <- ggplot(USArrests, aes(x = USArrests[, i])) + geom_histogram() + xlab(i)
print(plot)}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# CHUNK 3
### Create a correlation matrix
cor(USArrests)
## Murder Assault UrbanPop Rape
## Murder 1.00000000 0.8018733 0.06957262 0.5635788
## Assault 0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape 0.56357883 0.6652412 0.41134124 1.0000000
# CHUNK 5
### To incorporate mean-centering and scaling, we set center = TRUE and scale. = TRUE
### By default, center = TRUE and scale. = FALSE
PCA <- prcomp(USArrests, center = TRUE, scale = TRUE)
### PC loadings
PCA$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
### PC scores
PCA$x
## PC1 PC2 PC3 PC4
## Alabama -0.97566045 1.12200121 -0.43980366 0.154696581
## Alaska -1.93053788 1.06242692 2.01950027 -0.434175454
## Arizona -1.74544285 -0.73845954 0.05423025 -0.826264240
## Arkansas 0.13999894 1.10854226 0.11342217 -0.180973554
## California -2.49861285 -1.52742672 0.59254100 -0.338559240
## Colorado -1.49934074 -0.97762966 1.08400162 0.001450164
## Connecticut 1.34499236 -1.07798362 -0.63679250 -0.117278736
## Delaware -0.04722981 -0.32208890 -0.71141032 -0.873113315
## Florida -2.98275967 0.03883425 -0.57103206 -0.095317042
## Georgia -1.62280742 1.26608838 -0.33901818 1.065974459
## Hawaii 0.90348448 -1.55467609 0.05027151 0.893733198
## Idaho 1.62331903 0.20885253 0.25719021 -0.494087852
## Illinois -1.36505197 -0.67498834 -0.67068647 -0.120794916
## Indiana 0.50038122 -0.15003926 0.22576277 0.420397595
## Iowa 2.23099579 -0.10300828 0.16291036 0.017379470
## Kansas 0.78887206 -0.26744941 0.02529648 0.204421034
## Kentucky 0.74331256 0.94880748 -0.02808429 0.663817237
## Louisiana -1.54909076 0.86230011 -0.77560598 0.450157791
## Maine 2.37274014 0.37260865 -0.06502225 -0.327138529
## Maryland -1.74564663 0.42335704 -0.15566968 -0.553450589
## Massachusetts 0.48128007 -1.45967706 -0.60337172 -0.177793902
## Michigan -2.08725025 -0.15383500 0.38100046 0.101343128
## Minnesota 1.67566951 -0.62590670 0.15153200 0.066640316
## Mississippi -0.98647919 2.36973712 -0.73336290 0.213342049
## Missouri -0.68978426 -0.26070794 0.37365033 0.223554811
## Montana 1.17353751 0.53147851 0.24440796 0.122498555
## Nebraska 1.25291625 -0.19200440 0.17380930 0.015733156
## Nevada -2.84550542 -0.76780502 1.15168793 0.311354436
## New Hampshire 2.35995585 -0.01790055 0.03648498 -0.032804291
## New Jersey -0.17974128 -1.43493745 -0.75677041 0.240936580
## New Mexico -1.96012351 0.14141308 0.18184598 -0.336121113
## New York -1.66566662 -0.81491072 -0.63661186 -0.013348844
## North Carolina -1.11208808 2.20561081 -0.85489245 -0.944789648
## North Dakota 2.96215223 0.59309738 0.29824930 -0.251434626
## Ohio 0.22369436 -0.73477837 -0.03082616 0.469152817
## Oklahoma 0.30864928 -0.28496113 -0.01515592 0.010228476
## Oregon -0.05852787 -0.53596999 0.93038718 -0.235390872
## Pennsylvania 0.87948680 -0.56536050 -0.39660218 0.355452378
## Rhode Island 0.85509072 -1.47698328 -1.35617705 -0.607402746
## South Carolina -1.30744986 1.91397297 -0.29751723 -0.130145378
## South Dakota 1.96779669 0.81506822 0.38538073 -0.108470512
## Tennessee -0.98969377 0.85160534 0.18619262 0.646302674
## Texas -1.34151838 -0.40833518 -0.48712332 0.636731051
## Utah 0.54503180 -1.45671524 0.29077592 -0.081486749
## Vermont 2.77325613 1.38819435 0.83280797 -0.143433697
## Virginia 0.09536670 0.19772785 0.01159482 0.209246429
## Washington 0.21472339 -0.96037394 0.61859067 -0.218628161
## West Virginia 2.08739306 1.41052627 0.10372163 0.130583080
## Wisconsin 2.05881199 -0.60512507 -0.13746933 0.182253407
## Wyoming 0.62310061 0.31778662 -0.23824049 -0.164976866
The first PC is defined by \[Z_1= -0.5358995(Murder) - 0.5831836(Assault)-0.2781909(UrbanPop) - 0.5434321 (Rape),\] Of course we have \(\phi_{11}^2+\phi_{21}^2+\phi_{31}^2+\phi_{41}^2=1\)
The second PC is defined by \[Z_2= 0.4181809(Murder) + 0.1879856(Assault)-0.8728062(UrbanPop) -0.1673186(Rape)\]
*PC1: The first principal component attaches the same weight of around 0.55 to the crime-related variables, which are positively correlation with each other (shown previously). The UrbanPop variable has a much smaller weight compared to the other variables. The more negative the PC1 score is for a state, the higher the crime rate is for that state. That is, PC1 can be used to interpret the measure of overall crime rate in a state.
*PC2: The second principal component has a very high negative weight on the Urban Population variable. This is contrasted with the high weights on the other 3 variables, the crime-related variables. That is, PC2 can be seen as a measure of the urbanization level of a state. The more negative PC2 is, the more urbanized the state is.
What is a biplot? A biplot overlays the scores and loading vectors of the first two PCs in a single graph
As soon as the PCs have been computed, we can plot their scores against one another to produce a low-dimensional view of the data. In particular, plotting the scores of the first two PCs against each other allows us to visualize the data in a two-dimensional scatterplot and inspect which observations and variables are similar to each other.
# CHUNK 6
################################################################################
#### cex argument indicates the amount by which plotting symbols should be scaled
#### cex = 0.6 means 40% smaller
#### scale = 0 ensures that the arrows are scaled to represent the loadings
################################################################################
### Generate a biplot
biplot(PCA, scale=0, cex = 0.5)
################################################################################
#### Bottom axis:PC1 score
#### Left axis:PC2 score
#### Top axis:loadings on PC1 (tallies in red)
#### Right axis:loadings on PC2 (tallies in red)
################################################################################
For instance, Murder is represented by the arrow heading to the north west and passing through (—0.5358995,0.4181809).
A biplot is a very helpful graphical tool as it reveals a lot of useful information about the variables as well as observations.
The coordinate of Murder, Assault, and Rape on the top axis are highly similar to each other. The coordinate of UrbanPop has a lower value than the values of the crime-related variables. So we can say that the first PC is a measure of a state’s crime level. The coordinate of UrbanPop on the right axis is far in the negative direction while the other variables are not close to UrbanPop. We can say that PC2 is dominated by the UrbanPop variable.
Florida: Florida has a greatly negative PC1 score and a PC2 score of basically 0. This suggests that Florida has a high level of crime and an average urbanization level. California: California has a greatly negative PC1 score and a fairly negative PC2 score. This suggests that California has a high level of crime and somewhat high urbanization level. *Virginia: Virginia has both a PC1 and PC2 score of virtually 0. This suggests that Virginia has an average level of crime and average level of urbanization.
# CHUNK 7
### Get PCA results using summary()
summary(PCA)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
### Draw scree plot.
screeplot(PCA, npcs = 4)
### try to use argument type = "lines"
screeplot(PCA, type = "lines")
The first PC explains about 62% of the variability in the original dataset. The second PC explains about 25% of the variablity in the original dataset. The cumulative proportion of variance explained by both PC1 and PC2 is 87%. The first PC being dominated by the three crime-related variables and the first two PCs explaining most of the variability in the original dataset, I recommend using the first two PCs.
# CHUNK 8
### Define new features
USArrests$PC1 <- PCA$x[,1]
USArrests$PC2 <- PCA$x[,2]
### View the data using head()
head(USArrests)
## Murder Assault UrbanPop Rape PC1 PC2
## Alabama 13.2 236 58 21.2 -0.9756604 1.1220012
## Alaska 10.0 263 48 44.5 -1.9305379 1.0624269
## Arizona 8.1 294 80 31.0 -1.7454429 -0.7384595
## Arkansas 8.8 190 50 19.5 0.1399989 1.1085423
## California 9.0 276 91 40.6 -2.4986128 -1.5274267
## Colorado 7.9 204 78 38.7 -1.4993407 -0.9776297
It is importance to drop the variables that contribute to the new feature. Failure to do so will result in a duplication of information in the data and raise the issue of multicollinearity when a model is fitted. Using the second feature generation method as an example, we can delete the three crime-related variables
# CHUNK 9
### Deleting the original variables
USArrests[,c("Murder","Assault","UrbanPop","Rape")] <- NULL
head(USArrests)
## PC1 PC2
## Alabama -0.9756604 1.1220012
## Alaska -1.9305379 1.0624269
## Arizona -1.7454429 -0.7384595
## Arkansas 0.1399989 1.1085423
## California -2.4986128 -1.5274267
## Colorado -1.4993407 -0.9776297
# CHUNK 10
### Load USArrests data
data(USArrests)
### Create PC1 and PC2 variable in the data.
USArrests$PC1 <- PCA$x[,1]
USArrests$PC2 <- PCA$x[,2]
### View summary
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
## PC1 PC2
## Min. :-2.9828 Min. :-1.5547
## 1st Qu.:-1.3592 1st Qu.:-0.7198
## Median : 0.1774 Median :-0.1519
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.1060 3rd Qu.: 0.7596
## Max. : 2.9622 Max. : 2.3697
First take out the four original variables to which cluster analysis is applied and scale them
# CHUNK 11
### Take out the original variables to which cluster analysis is applied
clusterdata <- USArrests[,c("Murder","Assault","UrbanPop","Rape")]
head(clusterdata)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
### Scale these variables
clusterdata <- as.data.frame(scale(clusterdata))
### Check dataset using head()
head(clusterdata)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
# CHUNK 12
### Perform K-means clustering from K=1 to K=10
################################################################################
#### ) uses random initial cluster centers
#### centers: specifies k, the number of clusters used
#### nstart: controls the number of random selection of initial cluster centers and only the round with the best result will be reported.(default = 1)
################################################################################
set.seed(1)
km1 <- kmeans(clusterdata, centers = 1, nstart = 20)
km2 <- kmeans(clusterdata, centers = 2, nstart = 20)
km3 <- kmeans(clusterdata, centers = 3, nstart = 20)
km4 <- kmeans(clusterdata, centers = 4, nstart = 20)
km5 <- kmeans(clusterdata, centers = 5, nstart = 20)
km6 <- kmeans(clusterdata, centers = 6, nstart = 20)
km7 <- kmeans(clusterdata, centers = 7, nstart = 20)
km8 <- kmeans(clusterdata, centers = 8, nstart = 20)
km9 <- kmeans(clusterdata, centers = 9, nstart = 20)
km10 <- kmeans(clusterdata, centers = 10, nstart = 20)
# CHUNK 13
### Make dataframe with two columns(k, within cluster variation)
elbow <- data.frame(k = 1:10, within_var = c(km1$tot.withinss,
km2$tot.withinss,
km3$tot.withinss,
km4$tot.withinss,
km5$tot.withinss,
km6$tot.withinss,
km7$tot.withinss,
km8$tot.withinss,
km9$tot.withinss,
km10$tot.withinss))
### Create an elbow plot
library(ggplot2)
ggplot(data = elbow, aes(x = k, y = within_var)) + geom_point() + geom_line() + ggtitle("Elbow Plot")
# CHUNK 14
### Type the name of final clustering
km4
## K-means clustering with 4 clusters of sizes 13, 8, 13, 16
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 0.6950701 1.0394414 0.7226370 1.27693964
## 2 1.4118898 0.8743346 -0.8145211 0.01927104
## 3 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 4 -0.4894375 -0.3826001 0.5758298 -0.26165379
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 2 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 4 4 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 4 3 1 4 3
## Kansas Kentucky Louisiana Maine Maryland
## 4 3 2 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 4 1 3 2 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 4
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 2 3 4
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 4 4 4 4 2
## South Dakota Tennessee Texas Utah Vermont
## 3 2 1 4 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 4 4 3 3 4
##
## Within cluster sum of squares by cluster:
## [1] 19.922437 8.316061 11.952463 16.212213
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
To visualize, first extract the vector of group assignments and convert it to a factor.
# CHUNK 15
### create group variable which defines group assignments and convert it to a factor
USArrests$Group <- as.factor(km4$cluster)
head(USArrests)
## Murder Assault UrbanPop Rape PC1 PC2 Group
## Alabama 13.2 236 58 21.2 -0.9756604 1.1220012 2
## Alaska 10.0 263 48 44.5 -1.9305379 1.0624269 1
## Arizona 8.1 294 80 31.0 -1.7454429 -0.7384595 1
## Arkansas 8.8 190 50 19.5 0.1399989 1.1085423 2
## California 9.0 276 91 40.6 -2.4986128 -1.5274267 1
## Colorado 7.9 204 78 38.7 -1.4993407 -0.9776297 1
# CHUNK 16
### Make a scatterplot for the first two PCs colored by the group assignments
ggplot(data = USArrests, aes(x = PC1, y = PC2, col = Group, label = row.names(USArrests))) + geom_point() + geom_text(vjust = 1)
*Red Cluster: PC1 scores are greatly negative, indicating a high crime rate level.
*Green Cluster: PC1 scores are in the range of about -2 and 0. This indicates a moderately high crime rate. Based on the positive PC2 scores, the states in this cluster have a low urbanization level.
*Purple Cluster: PC1 scores fall between 0 and 1. This represents a moderately low crime rate level. PC2 scores are spread out from -1 to 2. This suggests an average urbanization level.
*Blue Cluster: PC1 scores are between 1 and 3. This indicates a low crime rate level.
# CHUNK 17
### Do a hierarchical cluster analysis with complete, single, average, and centroid linkage
hc.complete <- hclust(dist(clusterdata), method = "complete")
hc.single <- hclust(dist(clusterdata), method = "single")
hc.average <- hclust(dist(clusterdata), method = "average")
hc.centroid <- hclust(dist(clusterdata), method = "centroid")
### Draw dendrogram of the hierarchical cluster
plot(hc.complete, cex = 0.5)
plot(hc.single, cex = 0.5)
plot(hc.average, cex = 0.5)
plot(hc.centroid, cex = 0.5)
The complete and average linkage yield more balanced clusters. In these dendrograms, a horizontal cut will tend to produce clusters with a similar number of observations. The dendrograms using single and centroid linkage have extended, trailing clusters. Moreover, some observations take up a single cluster until fusing with other observations later in the process.
In what follows, we will use the complete linkage.
# CHUNK 18
### Create clusters variable which defines cluster assignments and convert it to a factor
### Use the cutree(meaning to cut a tree) function, to assign each observation to one of clusters
USArrests$Cluster <- as.factor(cutree(hc.complete, k = 4))
### Make a scatterplot for the first two PCs colored by the cluster assignments
ggplot(data = USArrests, aes(x = PC1, y = PC2, col = Cluster, label = row.names(USArrests))) + geom_point() + geom_text(vjust = 1)
*Comparing the results of k-means clustering and hierarchical clustering with complete linkage produce very similarly distributed sets of observations.
*Red Cluster: PC1 scores are greatly negative, indicating a high crime rate level.
*Green Cluster: PC1 scores are in the range of about -2 and 0. This indicates a moderately high crime rate. Based on the positive PC2 scores, the states in this cluster have a low urbanization level.
*Purple Cluster: PC1 scores fall between 0 and 1. This represents a moderately low crime rate level. PC2 scores are spread out from -1 to 2. This suggests an average urbanization level.
*Blue Cluster: PC1 scores are between 1 and 3. This indicates a low crime rate level.
[Comment 1: Explain results]
Murder, Assault, and Rape are positively correlated with one another, with all pairwise correlations greater that 0.5. These correlations suggest that PCA may be an effective technique to compress the three crime-related variables into a single variable while retaining most of the information in the original data set. *There seems to not be a strong linear relationship between UrbanPop and the 3 crime-related variables.