We will use the DescTools, caret and factoextra packages in the lesson that follows.
First, we load the packages for use.
library(caret)
library(DescTools)
library(factoextra)
The USArrests data set, which is a built-in dataset, contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. We would like to analyze the data, but there are high correlations among the variables.
data("USArrests")
We use the Abstract() function from the DescTools package to obtain information about the dataframe.
Abstract(x = USArrests)
## ------------------------------------------------------------------------------
## USArrests
##
## data frame: 50 obs. of 4 variables
## 50 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 Murder numeric .
## 2 Assault integer .
## 3 UrbanPop integer .
## 4 Rape numeric .
We can use the summary() function to obtain basic summary statistics.
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
We can use the pairs() function to visually assess relationships among our variables.
pairs(USArrests,
upper.panel = NULL)
We can use the cor() function to obtain a correlation matrix. We use the round() function and round the matrix values to one decimal place (digits = 1).
round(cor(x = USArrests),
digits = 1) # rounded to 1 decimal place
## Murder Assault UrbanPop Rape
## Murder 1.0 0.8 0.1 0.6
## Assault 0.8 1.0 0.3 0.7
## UrbanPop 0.1 0.3 1.0 0.4
## Rape 0.6 0.7 0.4 1.0
The prcomp() function in the stats package performs PCA. By default, the function will center the data. To scale the data so that it is standardized, we set the scale. argument to TRUE (scale. = TRUE).
pc <- prcomp(x = USArrests,
scale. = TRUE)
The summary() function provides information about the variance explained by the PCs.
summary(pc)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
The importance component of the summary() output gives us information about how much variance is explained.
var_info <- summary(pc)
var_info$importance
## PC1 PC2 PC3 PC4
## Standard deviation 1.574878 0.9948694 0.5971291 0.4164494
## Proportion of Variance 0.620060 0.2474400 0.0891400 0.0433600
## Cumulative Proportion 0.620060 0.8675000 0.9566400 1.0000000
We can use a Scree Plot to visualize the proportion of variance explained by our PCs, either using the fviz_eig() function from the factoextra package or the screeplot() function from the stats package. Based on the output, we have an elbow point at 3.
fviz_eig(X = pc)
Another way to choose the number of PCs is to create a Scree Plot that includes the cumulative proportion of variance explained. Here, we can choose a cutoff point for the minimum proportion of variance that we would like to retain. We can choose the minimum number of PCs that meets that criteria.
barplot(var_info$importance[2, ], # plot PVE
names.arg = 1:4, # 4 PCS
ylim = c(0, 1.1),
main = "Cumulative Variance Explained",
xlab = "Principal Component",
ylab = "Proportion of Variance Explained")
lines(var_info$importance[3, ], # add cumulative PVE
type = "o",
pch = 16)
abline(h = .8, col = "red", lty = 2) # add cutoff at .8
PC loading information is available for each of the variables and each of the PCs. It is contained in the rotation component of an object created using the prcomp() function. We can focus on the first 2 PCs by subsetting the column dimension.
pc$rotation[ ,1:2] # Loadings for 1st 2 PCs
## PC1 PC2
## Murder -0.5358995 0.4181809
## Assault -0.5831836 0.1879856
## UrbanPop -0.2781909 -0.8728062
## Rape -0.5434321 -0.1673186
Based on the output, we have the following PC loadings:
Based on this, we would describe the first PC as ‘Crime’ and the second PC as ‘Population’.
We can create a Loading plot for our first 2 PCs using the fviz_pca_var() function in the factoextra package.
fviz_pca_var(X = pc)
For each of our PCs, we have PC scores for each of our observations. We can obtain it in the x component of an object created using the prcomp() function.
head(pc$x[, 1:2])
## PC1 PC2
## Alabama -0.9756604 1.1220012
## Alaska -1.9305379 1.0624269
## Arizona -1.7454429 -0.7384595
## Arkansas 0.1399989 1.1085423
## California -2.4986128 -1.5274267
## Colorado -1.4993407 -0.9776297
We can plot the PC scores for the first 2 PCs using the fviz_pca_ind() function in the factoextra package.
fviz_pca_ind(X = pc,
labelsize = 2)
We can use a biplot to view the variables and observations together for the first 2 PCs. We use the biplot() function to create the plot.
biplot(x = pc,
scale = 0,
cex = .5) # reduce size of text
Data was collected on the nutritional information and consumer ratings of 77 breakfast cereal types. A supermarket would like to perform PCA on the 12 numeric variables to reduce the dimensionality. Then, they would like to predict the rating variable using Linear Regression.
The data is available in the “Cereals.csv” file. We import the data as a dataframe named cereal. The row names are in the first column, so we set row.names = 1.
cereal <- read.csv(file = "Cereals.csv",
row.names = 1)
We can first obtain information about our cereal dataframe.
Abstract(x = cereal)
## ------------------------------------------------------------------------------
## cereal
##
## data frame: 77 obs. of 15 variables
## 74 complete cases (96.1%)
##
## Nr ColName Class NAs Levels
## 1 mfr character .
## 2 type character .
## 3 calories integer .
## 4 protein integer .
## 5 fat integer .
## 6 sodium integer .
## 7 fiber numeric .
## 8 carbo numeric 1 (1.3%)
## 9 sugars integer 1 (1.3%)
## 10 potass integer 2 (2.6%)
## 11 vitamins integer .
## 12 shelf integer .
## 13 weight numeric .
## 14 cups numeric .
## 15 rating numeric .
For convenience, we can create vectors of our categorical (fac) and numerical (num) variable names. We exclude the rating variable from our num vector, since we will not include it in the PCA.
fac <- c("mfr", "type")
num <- names(cereal)[!names(cereal) %in% c(fac, "rating")]
We can obtain summary statistic information using the summary() function.
summary(cereal)
## mfr type calories protein
## Length:77 Length:77 Min. : 50.0 Min. :1.000
## Class :character Class :character 1st Qu.:100.0 1st Qu.:2.000
## Mode :character Mode :character Median :110.0 Median :3.000
## Mean :106.9 Mean :2.545
## 3rd Qu.:110.0 3rd Qu.:3.000
## Max. :160.0 Max. :6.000
##
## fat sodium fiber carbo
## Min. :0.000 Min. : 0.0 Min. : 0.000 Min. : 5.0
## 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000 1st Qu.:12.0
## Median :1.000 Median :180.0 Median : 2.000 Median :14.5
## Mean :1.013 Mean :159.7 Mean : 2.152 Mean :14.8
## 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000 3rd Qu.:17.0
## Max. :5.000 Max. :320.0 Max. :14.000 Max. :23.0
## NA's :1
## sugars potass vitamins shelf
## Min. : 0.000 Min. : 15.00 Min. : 0.00 Min. :1.000
## 1st Qu.: 3.000 1st Qu.: 42.50 1st Qu.: 25.00 1st Qu.:1.000
## Median : 7.000 Median : 90.00 Median : 25.00 Median :2.000
## Mean : 7.026 Mean : 98.67 Mean : 28.25 Mean :2.208
## 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00 3rd Qu.:3.000
## Max. :15.000 Max. :330.00 Max. :100.00 Max. :3.000
## NA's :1 NA's :2
## weight cups rating
## Min. :0.50 Min. :0.250 Min. :18.04
## 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.17
## Median :1.00 Median :0.750 Median :40.40
## Mean :1.03 Mean :0.821 Mean :42.67
## 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.83
## Max. :1.50 Max. :1.500 Max. :93.70
##
We can use the PlotMiss() function to visualize information about missing data.
PlotMiss(x = cereal)
We can replace NA values with the median prior to performing PCA using the caret package.
pp <- preProcess(x = cereal,
method = "medianImpute")
cereal <- predict(object = pp,
newdata = cereal)
Since we have a lot of variables, we will use symnum() to look at a symbolic version of our correlation matrix to check for high correlation among our variables.
symnum(cor(x = cereal[ ,num]),
corr = TRUE)
## cl pr ft sd fb cr sg pt v sh w cp
## calories 1
## protein 1
## fat . 1
## sodium . 1
## fiber . 1
## carbo . 1
## sugars . . . 1
## potass . * . 1
## vitamins . 1
## shelf . 1
## weight , . . . . 1
## cups . . . . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
We use the prcomp() function to perform PCA on our numeric variables.
pc2 <- prcomp(x = cereal[ ,num],
scale. = TRUE) # scale
The summary() function provides information about the variance explained by the PCs.
summary(pc2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8135 1.6278 1.3215 1.04043 0.98411 0.83250 0.81251
## Proportion of Variance 0.2741 0.2208 0.1455 0.09021 0.08071 0.05775 0.05501
## Cumulative Proportion 0.2741 0.4949 0.6404 0.73063 0.81134 0.86909 0.92411
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.6609 0.54789 0.30296 0.23927 0.15709
## Proportion of Variance 0.0364 0.02502 0.00765 0.00477 0.00206
## Cumulative Proportion 0.9605 0.98552 0.99317 0.99794 1.00000
The importance component of the summary() output gives us information about how much variance is explained.
var_info2 <- summary(pc2)
var_info2$importance
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.813539 1.627814 1.321523 1.040427 0.9841076 0.8324993
## Proportion of Variance 0.274080 0.220810 0.145540 0.090210 0.0807100 0.0577500
## Cumulative Proportion 0.274080 0.494890 0.640430 0.730630 0.8113400 0.8690900
## PC7 PC8 PC9 PC10 PC11
## Standard deviation 0.8125136 0.66090 0.5478901 0.3029585 0.2392712
## Proportion of Variance 0.0550100 0.03640 0.0250200 0.0076500 0.0047700
## Cumulative Proportion 0.9241100 0.96051 0.9855200 0.9931700 0.9979400
## PC12
## Standard deviation 0.1570939
## Proportion of Variance 0.0020600
## Cumulative Proportion 1.0000000
We can use a Scree Plot to visualize the proportion of variance explained by our PCs, either using the fviz_eig() function from the factoextra package or the screeplot() function from the stats package. Based on the Scree Plot, we have an elbow point at 4, suggesting we should retain 4 PCs.
fviz_eig(X = pc2)
We can also create a Scree Plot that includes the cumulative proportion of variance explained. Here, we can choose a cutoff point for the minimum proportion of variance that we would like to retain. We can choose the minimum number of PCs that meets that criteria.
barplot(var_info2$importance[2, ], # plot PVE
names.arg = 1:12,
ylim=c(0, 1.1),
main = "Cumulative Variance Explained",
xlab = "Principal Component",
ylab = "Proportion of Variance Explained")
lines(var_info2$importance[3, ], # add cumulative PVE
type = "o",
pch = 16)
abline(h = .8, col = "red", lty = 2) # add cutoff at .8
Based on the output, we can explain approximately 81% of the variance by retaining PC1 - PC5.
PC loading information is available for each of the variables and each of the PCs. It is contained in the rotation component of an object created using the prcomp() function. We can focus on the first 5 PCs by subsetting the column dimension.
pc2$rotation[ ,1:5] # Loadings for 1st 5 PCs
## PC1 PC2 PC3 PC4 PC5
## calories 0.04453048 -0.56612303 0.05649216 -0.26918354 0.08019453
## protein 0.31814974 0.08245802 -0.31214924 -0.46234688 0.17684923
## fat 0.22198380 -0.24054462 0.29729134 -0.35158848 0.45689426
## sodium -0.04818453 -0.31006185 -0.33249151 0.12199200 -0.31904935
## fiber 0.45710646 0.20873834 -0.18216653 0.06793186 -0.23649828
## carbo -0.27613247 -0.15246418 -0.52237589 -0.20978935 0.13921704
## sugars 0.09961866 -0.37583984 0.48444020 0.15010799 -0.30853389
## potass 0.50191850 0.08287566 -0.14764210 -0.02704716 -0.15860250
## vitamins 0.02078775 -0.30802816 -0.32568194 0.52762016 0.19613194
## shelf 0.29603622 -0.07784817 -0.01676226 0.45779825 0.58030569
## weight 0.26286896 -0.43744710 -0.15546873 -0.09345034 -0.27795633
## cups -0.37548375 -0.10706400 -0.08588833 -0.08181433 0.06004537
Based on the output, we have the following PC loadings: