Principal Component Analysis in R

Principal component analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while retaining most of the original variability in the data. It accomplishes this by transforming the data into a new coordinate system such that the greatest variance comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

PCA is commonly used as one step in an exploratory data analysis pipeline. It can reveal clustering, outliers, and other interesting structure in your data. This article will demonstrate how to conduct PCA in R using a built-in dataset as an example.

Download the code from this: Principal Component Analysis in R -PCA Explained

Introduction

Principal component analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a dataset with a large number of correlated variables into a new dataset with fewer uncorrelated variables called principal components. The goal is to retain as much of the original variability in the data as possible.

Some key applications of PCA include:

Visualizing high-dimensional data in two or three dimensions
Identifying patterns, clusters, outliers, and other interesting structures
Feature extraction - using the principal components as new features for modeling
Data compression - reducing storage and computational requirements

In this article, we will walk through an example of conducting PCA in R on a built-in dataset. We will visualize the results to explore the structure of the data.

Loading the Data

For this analysis, we will use the USArrests dataset built into R. This dataset contains statistics on violent crime rates in each of the 50 US states in 1973. Let’s load it and inspect the structure:

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

The output shows there are 50 observations (one for each state) and 4 numeric variables:

Murder - murder arrests per 100,000 residents
Assault - assault arrests per 100,000 residents
UrbanPop - percent urban population
Rape - rape arrests per 100,000 residents

Checking Assumptions

Before running PCA, we need to check that the key assumptions are met:

The data should be numeric and continuous
The variables should be linearly related
The data should be approximately normally distributed
The variables should be on a similar scale

Let’s verify these one by one:

##              Murder   Assault   UrbanPop      Rape
## Murder   1.00000000 0.8018733 0.06957262 0.5635788
## Assault  0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape     0.56357883 0.6652412 0.41134124 1.0000000

## $Murder
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.95703, p-value = 0.06674
## 
## 
## $Assault
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.95181, p-value = 0.04052
## 
## 
## $UrbanPop
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.97714, p-value = 0.4385
## 
## 
## $Rape
## 
##  Shapiro-Wilk normality test
## 
## data:  newX[, i]
## W = 0.94674, p-value = 0.0251

The correlations, histograms, and Shapiro-Wilk tests (output not shown) indicate the assumptions are reasonably met, so we can proceed.

We scale the data to put all the variables on a comparable scale before conducting PCA.

Performing PCA

We can now conduct PCA using the prcomp() function in R. We specify scale = TRUE to tell it to scale the variables first:

This performs the PCA and stores the results in a model object called pca.

Interpreting the Results

The summary() function displays useful information about the results:

## Importance of components:
##                           PC1    PC2     PC3
## Standard deviation     1.5358 0.6768 0.42822
## Proportion of Variance 0.7862 0.1527 0.06112
## Cumulative Proportion  0.7862 0.9389 1.00000

## Standard deviations (1, .., p=3):
## [1] 1.5357670 0.6767949 0.4282154
## 
## Rotation (n x k) = (3 x 3):
##                PC1        PC2        PC3
## Murder  -0.5826006 -0.5339532  0.6127565
## Assault -0.6079818 -0.2140236 -0.7645600
## Rape    -0.5393836  0.8179779  0.1999436

This includes the standard deviations (square roots of the eigenvalues), the proportion of variance explained by each principal component, and the cumulative proportion of variance explained.

We can also visualize the eigenvalues to see how much variance is explained by each component. This “scree plot” can tell us how many components are meaningful to retain:

The eigenvalues taper off after the first few components, with the first two capturing the majority of the variance.

Similarly, we can plot the cumulative proportion of variance explained:

This shows that the first two components explain over 80% of the variance. The remaining components add little additional information.

Enhanced PCA Plots

The factoextra package provides additional functions to generate nicer plots of PCA results. For example, we can color the points in the PCA plot by each state:

Plot the principal component scores with labels by state

Plot the principal component scores with ellipses by region

We can also label each point with the state name, color by region, add ellipses to highlight clustering, and more.

Use Cases

PCA provides an accessible visualization of high dimensional data. It can reveal interesting patterns, like:

Clustering of similar states
Outlier states that differ from the rest
Relationships between the original variables

The principal components extracted from PCA can also be used as features in subsequent modeling, rather than the original correlated variables. This is a common application of PCA for feature engineering.

Limitations

PCA has some limitations to be aware of:

Interpretability - PCs can be difficult to interpret
Information loss - Reducing dimensionality loses some info
Sensitive to scaling - Mix of units can distort results
Assumptions - Requires linearly related numeric data

So PCA may not be appropriate for all datasets and should be combined with other techniques.

Conclusion

In this article we walked through conducting PCA in R - from loading the data and checking assumptions to interpreting the results and enhanced visualizations. PCA is a key technique for exploring the structure of high-dimensional datasets and extracting new features for modeling. With the right context and careful interpretation, it can reveal interesting insights into your data.