Understanding USArrests data using PCA

Data Exploration

Data structure

We have four variables all of type numeric.

glimpse(USArrests)

## Observations: 50
## Variables: 4
## $ Murder   <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4,...
## $ Assault  <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46,...
## $ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 6...
## $ Rape     <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9,...

data summary

The summary of each variable is shown below.

summary(USArrests)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

Mean

We notice that the variables have vastly different means.

apply(USArrests, 2, mean)

##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

Variance

the variables also have vastly different variances: the UrbanPop variable measures the percentage of the population in each state living in an urban area, which is not a comparable number to the number of rapes in each state per 100,000 individuals.

apply(USArrests, 2, var)

##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

checking for null values

There are no null values to report in the dataset

colSums(is.na(USArrests))

##   Murder  Assault UrbanPop     Rape 
##        0        0        0        0

checking for correlation

Checking for correlation between the variables, we notice that the 3 crime variables are correlated with each other.

corrplot(cor(USArrests), order = "hclust")

Principal Components

We try to create the principal components for the four variables to explain the variance in the dataset without including the correlation between variables. Please note we need to scale the variables before creating principal components.

The rotation matrix provides the principal component loadings vector

pca.res <- prcomp(USArrests, scale = TRUE)
pca.res$rotation

##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

Checking the contribution of each principal component:

The amount of variance explained by each principal component:

pca.var =pca.res$sdev ^2
pca.var

## [1] 2.4802416 0.9897652 0.3565632 0.1734301

The percentage of variance explained by each principal component:

62% of the variance is explained by the first principal component, 25% by the second principal component, 9% by the third principal component and the remaining 4% by the last principal component.
Hence a large proportion of the variance is explained by the first 2 principal components

var.ratio=pca.var/sum(pca.var)
var.ratio

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

Chosing the number of required Principal components:

We typically decide on the number of principal components required to visualize the data by examining a scree plot. We choose the smallest number of principal components that are required in order to explain a sizable amount of the variation in the data. We try to find out the points after which the variation explained starts to drop off. This also called the elbow point.

We see that a fair amount of variance is explained by the first two principal components, and that there is an elbow after the second component.

plot(var.ratio , xlab=" Principal Component ", ylab=" Proportion of
Variance Explained ", ylim=c(0,1) ,type="b")

We already saw that third principal component explained less than 10% variance and the last was almost negligible. Hence we decide to go with two principal components

plot(cumsum (var.ratio), xlab=" Principal Component ", ylab ="
Cumulative Proportion of Variance Explained ", ylim=c(0,1) ,
     type="b")

Creating 2 principal components:

On checking the weights of 2 principal components, we see that:

The first loading vector places approximately equal weight on Assault, Murder, and Rape, with much less weight on UrbanPop. Hence this component roughly corresponds to a measure of overall rates of serious crimes.
The second loading vector places most of its weight on UrbanPop and much less weight on the other three features. Hence, this component roughly corresponds to the level of urbanization of the state.

pca.res <- prcomp(USArrests, scale = TRUE, rank =2)
pca.res$rotation

##                 PC1        PC2
## Murder   -0.5358995  0.4181809
## Assault  -0.5831836  0.1879856
## UrbanPop -0.2781909 -0.8728062
## Rape     -0.5434321 -0.1673186

The below biplot shows that 50 states mapped according to the 2 principal components. The vectors of the PCA for 4 variables are also plotted.

The large positive scores on the first component, such as California, Nevada and Florida, have high crime rates, while states like North Dakota, with negative scores on the first component, have low crime rates.
California also has a high score on the second component, indicating a high level of urbanization, while the opposite is true for states like Mississippi.
States close to zero on both components, such as Indiana, have approximately average levels of both crime and urbanization.

pca.res$rotation=-pca.res$rotation
pca.res$x=-pca.res$x
biplot (pca.res , scale =0)

Checking the principal components scores vector for all 50 states:

pca.res$x

##                        PC1         PC2
## Alabama         0.97566045 -1.12200121
## Alaska          1.93053788 -1.06242692
## Arizona         1.74544285  0.73845954
## Arkansas       -0.13999894 -1.10854226
## California      2.49861285  1.52742672
## Colorado        1.49934074  0.97762966
## Connecticut    -1.34499236  1.07798362
## Delaware        0.04722981  0.32208890
## Florida         2.98275967 -0.03883425
## Georgia         1.62280742 -1.26608838
## Hawaii         -0.90348448  1.55467609
## Idaho          -1.62331903 -0.20885253
## Illinois        1.36505197  0.67498834
## Indiana        -0.50038122  0.15003926
## Iowa           -2.23099579  0.10300828
## Kansas         -0.78887206  0.26744941
## Kentucky       -0.74331256 -0.94880748
## Louisiana       1.54909076 -0.86230011
## Maine          -2.37274014 -0.37260865
## Maryland        1.74564663 -0.42335704
## Massachusetts  -0.48128007  1.45967706
## Michigan        2.08725025  0.15383500
## Minnesota      -1.67566951  0.62590670
## Mississippi     0.98647919 -2.36973712
## Missouri        0.68978426  0.26070794
## Montana        -1.17353751 -0.53147851
## Nebraska       -1.25291625  0.19200440
## Nevada          2.84550542  0.76780502
## New Hampshire  -2.35995585  0.01790055
## New Jersey      0.17974128  1.43493745
## New Mexico      1.96012351 -0.14141308
## New York        1.66566662  0.81491072
## North Carolina  1.11208808 -2.20561081
## North Dakota   -2.96215223 -0.59309738
## Ohio           -0.22369436  0.73477837
## Oklahoma       -0.30864928  0.28496113
## Oregon          0.05852787  0.53596999
## Pennsylvania   -0.87948680  0.56536050
## Rhode Island   -0.85509072  1.47698328
## South Carolina  1.30744986 -1.91397297
## South Dakota   -1.96779669 -0.81506822
## Tennessee       0.98969377 -0.85160534
## Texas           1.34151838  0.40833518
## Utah           -0.54503180  1.45671524
## Vermont        -2.77325613 -1.38819435
## Virginia       -0.09536670 -0.19772785
## Washington     -0.21472339  0.96037394
## West Virginia  -2.08739306 -1.41052627
## Wisconsin      -2.05881199  0.60512507
## Wyoming        -0.62310061 -0.31778662