Principal component analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while retaining most of the original variability in the data. It accomplishes this by transforming the data into a new coordinate system such that the greatest variance comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
PCA is commonly used as one step in an exploratory data analysis pipeline. It can reveal clustering, outliers, and other interesting structure in your data. This article will demonstrate how to conduct PCA in R using a built-in dataset as an example.
Download the code from this: Principal Component Analysis in R -PCA Explained
Principal component analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a dataset with a large number of correlated variables into a new dataset with fewer uncorrelated variables called principal components. The goal is to retain as much of the original variability in the data as possible.
Some key applications of PCA include:
In this article, we will walk through an example of conducting PCA in R on a built-in dataset. We will visualize the results to explore the structure of the data.
For this analysis, we will use the USArrests
dataset
built into R. This dataset contains statistics on violent crime rates in
each of the 50 US states in 1973. Let’s load it and inspect the
structure:
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## 'data.frame': 50 obs. of 4 variables:
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
The output shows there are 50 observations (one for each state) and 4 numeric variables:
Before running PCA, we need to check that the key assumptions are met:
Let’s verify these one by one:
## Murder Assault UrbanPop Rape
## Murder 1.00000000 0.8018733 0.06957262 0.5635788
## Assault 0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape 0.56357883 0.6652412 0.41134124 1.0000000
## $Murder
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.95703, p-value = 0.06674
##
##
## $Assault
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.95181, p-value = 0.04052
##
##
## $UrbanPop
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.97714, p-value = 0.4385
##
##
## $Rape
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.94674, p-value = 0.0251
The correlations, histograms, and Shapiro-Wilk tests (output not shown) indicate the assumptions are reasonably met, so we can proceed.
We scale the data to put all the variables on a comparable scale before conducting PCA.
We can now conduct PCA using the prcomp()
function in R.
We specify scale = TRUE
to tell it to scale the variables
first:
This performs the PCA and stores the results in a model object called
pca
.
The summary()
function displays useful information about
the results:
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.5358 0.6768 0.42822
## Proportion of Variance 0.7862 0.1527 0.06112
## Cumulative Proportion 0.7862 0.9389 1.00000
## Standard deviations (1, .., p=3):
## [1] 1.5357670 0.6767949 0.4282154
##
## Rotation (n x k) = (3 x 3):
## PC1 PC2 PC3
## Murder -0.5826006 -0.5339532 0.6127565
## Assault -0.6079818 -0.2140236 -0.7645600
## Rape -0.5393836 0.8179779 0.1999436
This includes the standard deviations (square roots of the eigenvalues), the proportion of variance explained by each principal component, and the cumulative proportion of variance explained.
We can also visualize the eigenvalues to see how much variance is explained by each component. This “scree plot” can tell us how many components are meaningful to retain:
The eigenvalues taper off after the first few components, with the first two capturing the majority of the variance.
Similarly, we can plot the cumulative proportion of variance explained:
This shows that the first two components explain over 80% of the variance. The remaining components add little additional information.
The factoextra
package provides additional functions to
generate nicer plots of PCA results. For example, we can color the
points in the PCA plot by each state:
We can also label each point with the state name, color by region, add ellipses to highlight clustering, and more.
PCA provides an accessible visualization of high dimensional data. It can reveal interesting patterns, like:
The principal components extracted from PCA can also be used as features in subsequent modeling, rather than the original correlated variables. This is a common application of PCA for feature engineering.
PCA has some limitations to be aware of:
So PCA may not be appropriate for all datasets and should be combined with other techniques.
In this article we walked through conducting PCA in R - from loading the data and checking assumptions to interpreting the results and enhanced visualizations. PCA is a key technique for exploring the structure of high-dimensional datasets and extracting new features for modeling. With the right context and careful interpretation, it can reveal interesting insights into your data.