This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
A data frame with 50 observations on 4 variables.
The following packages have been used for the analysis:
library(ggplot2)
library(dplyr)
library(corrplot)
library(corrr)
library(DT)
We have four variables all of type numeric.
glimpse(USArrests)
## Observations: 50
## Variables: 4
## $ Murder <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4,...
## $ Assault <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46,...
## $ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 6...
## $ Rape <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9,...
The summary of each variable is shown below.
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
We notice that the variables have vastly different means.
apply(USArrests, 2, mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
the variables also have vastly different variances: the UrbanPop variable measures the percentage of the population in each state living in an urban area, which is not a comparable number to the number of rapes in each state per 100,000 individuals.
apply(USArrests, 2, var)
## Murder Assault UrbanPop Rape
## 18.97047 6945.16571 209.51878 87.72916
There are no null values to report in the dataset
colSums(is.na(USArrests))
## Murder Assault UrbanPop Rape
## 0 0 0 0
Checking for correlation between the variables, we notice that the 3 crime variables are correlated with each other.
corrplot(cor(USArrests), order = "hclust")
We try to create the principal components for the four variables to explain the variance in the dataset without including the correlation between variables. Please note we need to scale the variables before creating principal components.
The rotation matrix provides the principal component loadings vector
pca.res <- prcomp(USArrests, scale = TRUE)
pca.res$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
The amount of variance explained by each principal component:
pca.var =pca.res$sdev ^2
pca.var
## [1] 2.4802416 0.9897652 0.3565632 0.1734301
The percentage of variance explained by each principal component:
62% of the variance is explained by the first principal component, 25% by the second principal component, 9% by the third principal component and the remaining 4% by the last principal component.
Hence a large proportion of the variance is explained by the first 2 principal components
var.ratio=pca.var/sum(pca.var)
var.ratio
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
We typically decide on the number of principal components required to visualize the data by examining a scree plot. We choose the smallest number of principal components that are required in order to explain a sizable amount of the variation in the data. We try to find out the points after which the variation explained starts to drop off. This also called the elbow point.
We see that a fair amount of variance is explained by the first two principal components, and that there is an elbow after the second component.
plot(var.ratio , xlab=" Principal Component ", ylab=" Proportion of
Variance Explained ", ylim=c(0,1) ,type="b")
We already saw that third principal component explained less than 10% variance and the last was almost negligible. Hence we decide to go with two principal components
plot(cumsum (var.ratio), xlab=" Principal Component ", ylab ="
Cumulative Proportion of Variance Explained ", ylim=c(0,1) ,
type="b")
On checking the weights of 2 principal components, we see that:
pca.res <- prcomp(USArrests, scale = TRUE, rank =2)
pca.res$rotation
## PC1 PC2
## Murder -0.5358995 0.4181809
## Assault -0.5831836 0.1879856
## UrbanPop -0.2781909 -0.8728062
## Rape -0.5434321 -0.1673186
The below biplot shows that 50 states mapped according to the 2 principal components. The vectors of the PCA for 4 variables are also plotted.
pca.res$rotation=-pca.res$rotation
pca.res$x=-pca.res$x
biplot (pca.res , scale =0)
pca.res$x
## PC1 PC2
## Alabama 0.97566045 -1.12200121
## Alaska 1.93053788 -1.06242692
## Arizona 1.74544285 0.73845954
## Arkansas -0.13999894 -1.10854226
## California 2.49861285 1.52742672
## Colorado 1.49934074 0.97762966
## Connecticut -1.34499236 1.07798362
## Delaware 0.04722981 0.32208890
## Florida 2.98275967 -0.03883425
## Georgia 1.62280742 -1.26608838
## Hawaii -0.90348448 1.55467609
## Idaho -1.62331903 -0.20885253
## Illinois 1.36505197 0.67498834
## Indiana -0.50038122 0.15003926
## Iowa -2.23099579 0.10300828
## Kansas -0.78887206 0.26744941
## Kentucky -0.74331256 -0.94880748
## Louisiana 1.54909076 -0.86230011
## Maine -2.37274014 -0.37260865
## Maryland 1.74564663 -0.42335704
## Massachusetts -0.48128007 1.45967706
## Michigan 2.08725025 0.15383500
## Minnesota -1.67566951 0.62590670
## Mississippi 0.98647919 -2.36973712
## Missouri 0.68978426 0.26070794
## Montana -1.17353751 -0.53147851
## Nebraska -1.25291625 0.19200440
## Nevada 2.84550542 0.76780502
## New Hampshire -2.35995585 0.01790055
## New Jersey 0.17974128 1.43493745
## New Mexico 1.96012351 -0.14141308
## New York 1.66566662 0.81491072
## North Carolina 1.11208808 -2.20561081
## North Dakota -2.96215223 -0.59309738
## Ohio -0.22369436 0.73477837
## Oklahoma -0.30864928 0.28496113
## Oregon 0.05852787 0.53596999
## Pennsylvania -0.87948680 0.56536050
## Rhode Island -0.85509072 1.47698328
## South Carolina 1.30744986 -1.91397297
## South Dakota -1.96779669 -0.81506822
## Tennessee 0.98969377 -0.85160534
## Texas 1.34151838 0.40833518
## Utah -0.54503180 1.45671524
## Vermont -2.77325613 -1.38819435
## Virginia -0.09536670 -0.19772785
## Washington -0.21472339 0.96037394
## West Virginia -2.08739306 -1.41052627
## Wisconsin -2.05881199 0.60512507
## Wyoming -0.62310061 -0.31778662