This case study uses the well-known USArrests dataset in R. For each of the n = 50 US states in 1973, this dataset hosts the information about p = 4 numeric variables:

Variable Description
Murder The number of arrests for murder per 100,000 residents.
Assault The number of arrests for assault per 100,000 residents.
Rape The number of arrests for rape per 100,000 residents.
UrbanPop The percent (from 0 to 100) of the population residing in urban areas.

One would expect that the three crime-related variables are correlated at least to some extent and it may make sense to use PCA to combine them into a single variable measuring overall crime levels and reduce the dimension of the data from 4 to 2. We can then understand and visualize the data much more easily.

TASK 1: Explore the data

# CHUNK 1

### Load USArrests data
data(USArrests)
### View summary 
summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

Univariate data exploration. Draw histograms

# CHUNK 2
library(ggplot2)

#### names(USArrests) extracts the column names of the USArrests data
 
for (i in names(USArrests)) {
  plot <- ggplot(USArrests, aes(x = USArrests[, i])) + geom_histogram() + xlab(i)
  print(plot)}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Bivariate data exploration. Create a correlation matrix

# CHUNK 3

### Create a correlation matrix
cor(USArrests)
##              Murder   Assault   UrbanPop      Rape
## Murder   1.00000000 0.8018733 0.06957262 0.5635788
## Assault  0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape     0.56357883 0.6652412 0.41134124 1.0000000
[Comment 1: Explain results]

Murder, Assault, and Rape are positively correlated with one another, with all pairwise correlations greater that 0.5. These correlations suggest that PCA may be an effective technique to compress the three crime-related variables into a single variable while retaining most of the information in the original data set. *There seems to not be a strong linear relationship between UrbanPop and the 3 crime-related variables.


Case Study 1: Principal Components Analysis

TASK 2: Conduct a principal components analysis

Implementing a PCA

# CHUNK 5

### To incorporate mean-centering and scaling, we set center = TRUE and scale. = TRUE 
### By default, center = TRUE and scale. = FALSE
PCA <- prcomp(USArrests, center = TRUE, scale = TRUE)

### PC loadings
PCA$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432
### PC scores
PCA$x
##                        PC1         PC2         PC3          PC4
## Alabama        -0.97566045  1.12200121 -0.43980366  0.154696581
## Alaska         -1.93053788  1.06242692  2.01950027 -0.434175454
## Arizona        -1.74544285 -0.73845954  0.05423025 -0.826264240
## Arkansas        0.13999894  1.10854226  0.11342217 -0.180973554
## California     -2.49861285 -1.52742672  0.59254100 -0.338559240
## Colorado       -1.49934074 -0.97762966  1.08400162  0.001450164
## Connecticut     1.34499236 -1.07798362 -0.63679250 -0.117278736
## Delaware       -0.04722981 -0.32208890 -0.71141032 -0.873113315
## Florida        -2.98275967  0.03883425 -0.57103206 -0.095317042
## Georgia        -1.62280742  1.26608838 -0.33901818  1.065974459
## Hawaii          0.90348448 -1.55467609  0.05027151  0.893733198
## Idaho           1.62331903  0.20885253  0.25719021 -0.494087852
## Illinois       -1.36505197 -0.67498834 -0.67068647 -0.120794916
## Indiana         0.50038122 -0.15003926  0.22576277  0.420397595
## Iowa            2.23099579 -0.10300828  0.16291036  0.017379470
## Kansas          0.78887206 -0.26744941  0.02529648  0.204421034
## Kentucky        0.74331256  0.94880748 -0.02808429  0.663817237
## Louisiana      -1.54909076  0.86230011 -0.77560598  0.450157791
## Maine           2.37274014  0.37260865 -0.06502225 -0.327138529
## Maryland       -1.74564663  0.42335704 -0.15566968 -0.553450589
## Massachusetts   0.48128007 -1.45967706 -0.60337172 -0.177793902
## Michigan       -2.08725025 -0.15383500  0.38100046  0.101343128
## Minnesota       1.67566951 -0.62590670  0.15153200  0.066640316
## Mississippi    -0.98647919  2.36973712 -0.73336290  0.213342049
## Missouri       -0.68978426 -0.26070794  0.37365033  0.223554811
## Montana         1.17353751  0.53147851  0.24440796  0.122498555
## Nebraska        1.25291625 -0.19200440  0.17380930  0.015733156
## Nevada         -2.84550542 -0.76780502  1.15168793  0.311354436
## New Hampshire   2.35995585 -0.01790055  0.03648498 -0.032804291
## New Jersey     -0.17974128 -1.43493745 -0.75677041  0.240936580
## New Mexico     -1.96012351  0.14141308  0.18184598 -0.336121113
## New York       -1.66566662 -0.81491072 -0.63661186 -0.013348844
## North Carolina -1.11208808  2.20561081 -0.85489245 -0.944789648
## North Dakota    2.96215223  0.59309738  0.29824930 -0.251434626
## Ohio            0.22369436 -0.73477837 -0.03082616  0.469152817
## Oklahoma        0.30864928 -0.28496113 -0.01515592  0.010228476
## Oregon         -0.05852787 -0.53596999  0.93038718 -0.235390872
## Pennsylvania    0.87948680 -0.56536050 -0.39660218  0.355452378
## Rhode Island    0.85509072 -1.47698328 -1.35617705 -0.607402746
## South Carolina -1.30744986  1.91397297 -0.29751723 -0.130145378
## South Dakota    1.96779669  0.81506822  0.38538073 -0.108470512
## Tennessee      -0.98969377  0.85160534  0.18619262  0.646302674
## Texas          -1.34151838 -0.40833518 -0.48712332  0.636731051
## Utah            0.54503180 -1.45671524  0.29077592 -0.081486749
## Vermont         2.77325613  1.38819435  0.83280797 -0.143433697
## Virginia        0.09536670  0.19772785  0.01159482  0.209246429
## Washington      0.21472339 -0.96037394  0.61859067 -0.218628161
## West Virginia   2.08739306  1.41052627  0.10372163  0.130583080
## Wisconsin       2.05881199 -0.60512507 -0.13746933  0.182253407
## Wyoming         0.62310061  0.31778662 -0.23824049 -0.164976866

The first PC is defined by \[Z_1= -0.5358995(Murder) - 0.5831836(Assault)-0.2781909(UrbanPop) - 0.5434321 (Rape),\] Of course we have \(\phi_{11}^2+\phi_{21}^2+\phi_{31}^2+\phi_{41}^2=1\)

The second PC is defined by \[Z_2= 0.4181809(Murder) + 0.1879856(Assault)-0.8728062(UrbanPop) -0.1673186(Rape)\]

[Comment 2: Interpret the first and second PC]

*PC1: The first principal component attaches the same weight of around 0.55 to the crime-related variables, which are positively correlation with each other (shown previously). The UrbanPop variable has a much smaller weight compared to the other variables. The more negative the PC1 score is for a state, the higher the crime rate is for that state. That is, PC1 can be used to interpret the measure of overall crime rate in a state.

*PC2: The second principal component has a very high negative weight on the Urban Population variable. This is contrasted with the high weights on the other 3 variables, the crime-related variables. That is, PC2 can be seen as a measure of the urbanization level of a state. The more negative PC2 is, the more urbanized the state is.

TASK 3: Generate and interpret a biplot

What is a biplot? A biplot overlays the scores and loading vectors of the first two PCs in a single graph

As soon as the PCs have been computed, we can plot their scores against one another to produce a low-dimensional view of the data. In particular, plotting the scores of the first two PCs against each other allows us to visualize the data in a two-dimensional scatterplot and inspect which observations and variables are similar to each other.

# CHUNK 6
################################################################################
#### cex argument indicates the amount by which plotting symbols should be scaled
#### cex = 0.6 means 40% smaller
#### scale = 0 ensures that the arrows are scaled to represent the loadings
################################################################################

### Generate a biplot
biplot(PCA, scale=0, cex = 0.5)

################################################################################
#### Bottom axis:PC1 score
#### Left axis:PC2 score
#### Top axis:loadings on PC1 (tallies in red)
#### Right axis:loadings on PC2 (tallies in red)
################################################################################

For instance, Murder is represented by the arrow heading to the north west and passing through (—0.5358995,0.4181809).

A biplot is a very helpful graphical tool as it reveals a lot of useful information about the variables as well as observations.

[Comment 3: Use the biplot to interpret the PCs]

The coordinate of Murder, Assault, and Rape on the top axis are highly similar to each other. The coordinate of UrbanPop has a lower value than the values of the crime-related variables. So we can say that the first PC is a measure of a state’s crime level. The coordinate of UrbanPop on the right axis is far in the negative direction while the other variables are not close to UrbanPop. We can say that PC2 is dominated by the UrbanPop variable.

[Comment 4: Use the biplot to interpret the florida, california, and virginia]

Florida: Florida has a greatly negative PC1 score and a PC2 score of basically 0. This suggests that Florida has a high level of crime and an average urbanization level. California: California has a greatly negative PC1 score and a fairly negative PC2 score. This suggests that California has a high level of crime and somewhat high urbanization level. *Virginia: Virginia has both a PC1 and PC2 score of virtually 0. This suggests that Virginia has an average level of crime and average level of urbanization.

[Comment 5: Which variables are more correlated with each other?]

*In the biplot, we can see that the three crime-related variables are quite close to each other. This suggests that these three variables are positively correlated with each other. The UrbanPop variable looks to be separated from the crime-related variables. This indicates that it is less correlated with the crime-related variables.

TASK 4: Use observations from the PCA to generate a new feature

# CHUNK 7
### Get PCA results using summary()
summary(PCA)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion  0.6201 0.8675 0.95664 1.00000
### Draw scree plot. 
screeplot(PCA, npcs = 4)

### try to use argument type = "lines" 
screeplot(PCA, type = "lines")

[Comment 6: Interpret results and recommend the number of PCs]

The first PC explains about 62% of the variability in the original dataset. The second PC explains about 25% of the variablity in the original dataset. The cumulative proportion of variance explained by both PC1 and PC2 is 87%. The first PC being dominated by the three crime-related variables and the first two PCs explaining most of the variability in the original dataset, I recommend using the first two PCs.

Feature generation based on PC

# CHUNK 8

### Define new features
USArrests$PC1 <- PCA$x[,1]
USArrests$PC2 <- PCA$x[,2]
### View the data using head()
head(USArrests)
##            Murder Assault UrbanPop Rape        PC1        PC2
## Alabama      13.2     236       58 21.2 -0.9756604  1.1220012
## Alaska       10.0     263       48 44.5 -1.9305379  1.0624269
## Arizona       8.1     294       80 31.0 -1.7454429 -0.7384595
## Arkansas      8.8     190       50 19.5  0.1399989  1.1085423
## California    9.0     276       91 40.6 -2.4986128 -1.5274267
## Colorado      7.9     204       78 38.7 -1.4993407 -0.9776297

Deleting the original variables

It is importance to drop the variables that contribute to the new feature. Failure to do so will result in a duplication of information in the data and raise the issue of multicollinearity when a model is fitted. Using the second feature generation method as an example, we can delete the three crime-related variables

# CHUNK 9

### Deleting the original variables
USArrests[,c("Murder","Assault","UrbanPop","Rape")] <- NULL
head(USArrests)
##                   PC1        PC2
## Alabama    -0.9756604  1.1220012
## Alaska     -1.9305379  1.0624269
## Arizona    -1.7454429 -0.7384595
## Arkansas    0.1399989  1.1085423
## California -2.4986128 -1.5274267
## Colorado   -1.4993407 -0.9776297

Case Study 2: Cluster Analysis

Recreate the results we obtained in PCA case study.

# CHUNK 10

### Load USArrests data
data(USArrests)

### Create PC1 and PC2 variable in the data. 
USArrests$PC1 <- PCA$x[,1]
USArrests$PC2 <- PCA$x[,2]
### View summary
summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00  
##       PC1               PC2         
##  Min.   :-2.9828   Min.   :-1.5547  
##  1st Qu.:-1.3592   1st Qu.:-0.7198  
##  Median : 0.1774   Median :-0.1519  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.1060   3rd Qu.: 0.7596  
##  Max.   : 2.9622   Max.   : 2.3697

TASK 6: Perform a k-means cluster analysis

First take out the four original variables to which cluster analysis is applied and scale them

# CHUNK 11

### Take out the original variables to which cluster analysis is applied
clusterdata <- USArrests[,c("Murder","Assault","UrbanPop","Rape")]
head(clusterdata)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
### Scale these variables
clusterdata <- as.data.frame(scale(clusterdata))

### Check dataset using head()
head(clusterdata)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207
# CHUNK 12

### Perform K-means clustering from K=1 to K=10
################################################################################
####  ) uses random initial cluster centers
#### centers: specifies k, the number of clusters used
#### nstart: controls the number of random selection of initial cluster centers and only the round with the best result will be reported.(default = 1)
################################################################################
set.seed(1)
km1 <- kmeans(clusterdata, centers = 1, nstart = 20)
km2 <- kmeans(clusterdata, centers = 2, nstart = 20)
km3 <- kmeans(clusterdata, centers = 3, nstart = 20)
km4 <- kmeans(clusterdata, centers = 4, nstart = 20)
km5 <- kmeans(clusterdata, centers = 5, nstart = 20)
km6 <- kmeans(clusterdata, centers = 6, nstart = 20)
km7 <- kmeans(clusterdata, centers = 7, nstart = 20)
km8 <- kmeans(clusterdata, centers = 8, nstart = 20)
km9 <- kmeans(clusterdata, centers = 9, nstart = 20)
km10 <- kmeans(clusterdata, centers = 10, nstart = 20)

Choosing the number of clusters. Create an elbow plot.

# CHUNK 13

### Make dataframe with two columns(k, within cluster variation) 
elbow <- data.frame(k = 1:10, within_var = c(km1$tot.withinss,
                                             km2$tot.withinss,
                                             km3$tot.withinss,
                                             km4$tot.withinss,
                                             km5$tot.withinss,
                                             km6$tot.withinss,
                                             km7$tot.withinss,
                                             km8$tot.withinss,
                                             km9$tot.withinss,
                                             km10$tot.withinss))
### Create an elbow plot
library(ggplot2)
ggplot(data = elbow, aes(x = k, y = within_var)) + geom_point() + geom_line() + ggtitle("Elbow Plot")

[Comment 7: Determine K ]
  • The elbow point looks to be at k = 4. After k = 4, the elbow plot looks to be flat. This flat line indicates that there is no significant reduction in within cluster variation using 5 or more clusters.
# CHUNK 14

### Type the name of final clustering 
km4
## K-means clustering with 4 clusters of sizes 13, 8, 13, 16
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  0.6950701  1.0394414  0.7226370  1.27693964
## 2  1.4118898  0.8743346 -0.8145211  0.01927104
## 3 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 4 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              2              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              4              4              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              4              3              1              4              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              4              3              2              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              4              1              3              2              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              4 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              2              3              4 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              4              4              4              4              2 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              1              4              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              4              4              3              3              4 
## 
## Within cluster sum of squares by cluster:
## [1] 19.922437  8.316061 11.952463 16.212213
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualizing the resulting cluster assignments.

To visualize, first extract the vector of group assignments and convert it to a factor.

# CHUNK 15

### create group variable which defines group assignments and convert it to a factor
USArrests$Group <- as.factor(km4$cluster)
head(USArrests)
##            Murder Assault UrbanPop Rape        PC1        PC2 Group
## Alabama      13.2     236       58 21.2 -0.9756604  1.1220012     2
## Alaska       10.0     263       48 44.5 -1.9305379  1.0624269     1
## Arizona       8.1     294       80 31.0 -1.7454429 -0.7384595     1
## Arkansas      8.8     190       50 19.5  0.1399989  1.1085423     2
## California    9.0     276       91 40.6 -2.4986128 -1.5274267     1
## Colorado      7.9     204       78 38.7 -1.4993407 -0.9776297     1
# CHUNK 16

### Make a scatterplot for the first two PCs colored by the group assignments
ggplot(data = USArrests, aes(x = PC1, y = PC2, col = Group, label = row.names(USArrests))) + geom_point() + geom_text(vjust = 1)

[Comment 8: Explain results ]

*Red Cluster: PC1 scores are greatly negative, indicating a high crime rate level.

*Green Cluster: PC1 scores are in the range of about -2 and 0. This indicates a moderately high crime rate. Based on the positive PC2 scores, the states in this cluster have a low urbanization level.

*Purple Cluster: PC1 scores fall between 0 and 1. This represents a moderately low crime rate level. PC2 scores are spread out from -1 to 2. This suggests an average urbanization level.

*Blue Cluster: PC1 scores are between 1 and 3. This indicates a low crime rate level.

TASK 7: Perform a hierarchical cluster analysis

# CHUNK 17

### Do a hierarchical cluster analysis with complete, single, average, and centroid linkage 
hc.complete <- hclust(dist(clusterdata), method = "complete")
hc.single <- hclust(dist(clusterdata), method = "single")
hc.average <- hclust(dist(clusterdata), method = "average")
hc.centroid <- hclust(dist(clusterdata), method = "centroid")

### Draw dendrogram of the hierarchical cluster
plot(hc.complete, cex = 0.5)

plot(hc.single, cex = 0.5)

plot(hc.average, cex = 0.5)

plot(hc.centroid, cex = 0.5)

[Comment 9: Describe dendrogram based on linkages]

The complete and average linkage yield more balanced clusters. In these dendrograms, a horizontal cut will tend to produce clusters with a similar number of observations. The dendrograms using single and centroid linkage have extended, trailing clusters. Moreover, some observations take up a single cluster until fusing with other observations later in the process.

In what follows, we will use the complete linkage.

# CHUNK 18

### Create clusters variable which defines cluster assignments and convert it to a factor
### Use the cutree(meaning to cut a tree) function, to assign each observation to one of clusters
USArrests$Cluster <- as.factor(cutree(hc.complete, k = 4))

### Make a scatterplot for the first two PCs colored by the cluster assignments
ggplot(data = USArrests, aes(x = PC1, y = PC2, col = Cluster, label = row.names(USArrests))) + geom_point() + geom_text(vjust = 1)

[Comment 10: Explain results ]

*Comparing the results of k-means clustering and hierarchical clustering with complete linkage produce very similarly distributed sets of observations.

*Red Cluster: PC1 scores are greatly negative, indicating a high crime rate level.

*Green Cluster: PC1 scores are in the range of about -2 and 0. This indicates a moderately high crime rate. Based on the positive PC2 scores, the states in this cluster have a low urbanization level.

*Purple Cluster: PC1 scores fall between 0 and 1. This represents a moderately low crime rate level. PC2 scores are spread out from -1 to 2. This suggests an average urbanization level.

*Blue Cluster: PC1 scores are between 1 and 3. This indicates a low crime rate level.