Problem

A data set often contains multiple variables or dimensions, determining which combination of variables to focus on in a large complex data set can become problematic and unclear. Standard visualisation and correlation techniques applied across many dimensions can lead to confusing outcomes that don’t convey meaningful insight and are cluttered by noise.

By using dimension reduction techniques such as principal component analysis it allows you to isolate linear combinations of multiple variables that explain the majority of variance in the data, referred to as the principal components.

What is Principal Component Analysis

Principal component analysis is a statistical procedure used in exploratory data analysis that transforms multiple variables that are likely correlated into a set of uncorrelated linear combinations of those variables.

The Data

The R data set ‘longley’ will be used in this example, it is a macroeconomic data set with 7 variables and 16 observations.

# Load longley data set
data("longley")
longley
##      GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
## 1947         83.0 234.289      235.6        159.0    107.608 1947   60.323
## 1948         88.5 259.426      232.5        145.6    108.632 1948   61.122
## 1949         88.2 258.054      368.2        161.6    109.773 1949   60.171
## 1950         89.5 284.599      335.1        165.0    110.929 1950   61.187
## 1951         96.2 328.975      209.9        309.9    112.075 1951   63.221
## 1952         98.1 346.999      193.2        359.4    113.270 1952   63.639
## 1953         99.0 365.385      187.0        354.7    115.094 1953   64.989
## 1954        100.0 363.112      357.8        335.0    116.219 1954   63.761
## 1955        101.2 397.469      290.4        304.8    117.388 1955   66.019
## 1956        104.6 419.180      282.2        285.7    118.734 1956   67.857
## 1957        108.4 442.769      293.6        279.8    120.445 1957   68.169
## 1958        110.8 444.546      468.1        263.7    121.950 1958   66.513
## 1959        112.6 482.704      381.3        255.2    123.366 1959   68.655
## 1960        114.2 502.601      393.1        251.4    125.368 1960   69.564
## 1961        115.7 518.173      480.6        257.2    127.852 1961   69.331
## 1962        116.9 554.894      400.7        282.7    130.081 1962   70.551

The data set contains variables that we do not want to run the analysis on, the following variables will be removed; GNP.deflator and Year. The result will be saved into the object longley.PCA.

#Remove two columns
longley.PCA <- longley[-c(1,6)]
longley.PCA
##          GNP Unemployed Armed.Forces Population Employed
## 1947 234.289      235.6        159.0    107.608   60.323
## 1948 259.426      232.5        145.6    108.632   61.122
## 1949 258.054      368.2        161.6    109.773   60.171
## 1950 284.599      335.1        165.0    110.929   61.187
## 1951 328.975      209.9        309.9    112.075   63.221
## 1952 346.999      193.2        359.4    113.270   63.639
## 1953 365.385      187.0        354.7    115.094   64.989
## 1954 363.112      357.8        335.0    116.219   63.761
## 1955 397.469      290.4        304.8    117.388   66.019
## 1956 419.180      282.2        285.7    118.734   67.857
## 1957 442.769      293.6        279.8    120.445   68.169
## 1958 444.546      468.1        263.7    121.950   66.513
## 1959 482.704      381.3        255.2    123.366   68.655
## 1960 502.601      393.1        251.4    125.368   69.564
## 1961 518.173      480.6        257.2    127.852   69.331
## 1962 554.894      400.7        282.7    130.081   70.551

Looking at this data set it is difficult to determine how each year differs across the remaining 5 variables as a whole, principal component analysis will be used to reduce the dimensions of the data and allow for a more interpretable analysis.

The Analysis

Principal component analysis is based around the following assumptions:

Means centering and standardisation are extremely important in principal component analysis due to its sensitivity to variables on different scales. Means centering is a process of subtracting the mean from each observation in a relevant variable and standardisation refers to the division of each observation by the standard deviation.

In determining the direction of greatest variance eigenvectors are used, in simplified terms if we look at two variables on a scatter plot, the eigenvector will be a linear line through the direction of greatest variance. For example, if we analyse the relationship between Unemployment and Population as below, we can determine the eigenvectors.

#Scatter plot of two dimensions
library(ggplot2)
ggplot(longley.PCA,aes(Population,Unemployed)) + geom_point()

To find the eigenvector or the line of greatest variance of these two variables the data needs to be scaled and centered as follows.

#Means centering and standardisation of two variables.
x <- matrix(c(longley.PCA$Population,longley.PCA$Unemployed),ncol = 2)
longley.scaled <- scale(x)
longley.scaled
##               [,1]       [,2]
##  [1,] -1.411135233 -0.8960348
##  [2,] -1.263926342 -0.9292089
##  [3,] -1.099897684  0.5229601
##  [4,] -0.933712647  0.1687464
##  [5,] -0.768965196 -1.1710587
##  [6,] -0.597173570 -1.3497707
##  [7,] -0.334957732 -1.4161189
##  [8,] -0.173229213  0.4116664
##  [9,] -0.005175313 -0.3096025
## [10,]  0.188323875 -0.3973534
## [11,]  0.434294982 -0.2753583
## [12,]  0.650651800  1.5920219
## [13,]  0.854214095  0.6631474
## [14,]  1.142018979  0.7894229
## [15,]  1.499115547  1.7257883
## [16,]  1.819553652  0.8707530
## attr(,"scaled:center")
## [1] 117.4240 319.3313
## attr(,"scaled:scale")
## [1]  6.956102 93.446425

The eigenvectors and values are then calculated off the scaled data.

#Calculation of eigenvectors and principal components 1 & 2.
longley.cov <- cov(longley.scaled)
longley.eigen <- eigen(longley.cov)
longley.eigen
## $values
## [1] 1.6865515 0.3134485
## 
## $vectors
##           [,1]       [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068  0.7071068
#Slope of vectors
slope1 <- longley.eigen$vectors[1,1]/longley.eigen$vectors[2,1]
slope2 <- longley.eigen$vectors[1,2]/longley.eigen$vectors[2,2]

If we then plot the two eigenvectors onto the scatter plot, we can see they cut through two directions of the greatest variance and represent principle component one (largest variance, blue line) and principal component two (second largest variance, green line).

#Plotting of eigenvectors on scaled data of two variables to demonstrate direction of variances for illustrative purposes
scaled.df <- as.data.frame(longley.scaled)
p <- ggplot(scaled.df,aes(V1,V2)) + geom_point()
p + geom_abline(slope = slope1,col="blue") + geom_abline(slope = slope2,col="green")

This was an illustrative example of the calculation method of eigenvectors for two variables, this process does not need to be done manually as above and can be applied to the whole data set through the R function ‘prcomp( )’.

# Calculation of principal components for the whole data set
longley.PC <- prcomp(longley.PCA,scale = TRUE)
longley.PC
## Standard deviations:
## [1] 1.88399943 1.08882674 0.50115614 0.11091345 0.03928371
## 
## Rotation:
##                    PC1         PC2        PC3         PC4          PC5
## GNP          0.5278599  0.03392182  0.1786603 -0.22631424 -0.798170175
## Unemployed   0.3519851 -0.61503723 -0.6651187  0.23531386 -0.008957406
## Armed.Forces 0.2354739  0.77911377 -0.5766000  0.05603713  0.043886189
## Population   0.5272243 -0.06530038  0.1071890 -0.63539332  0.550051136
## Employed     0.5138648  0.09641636  0.4263104  0.69752736  0.241582049

To determine the principal components of greatest variance for this data set a visualisation of variance by principal component can be used. This shows the majority of variance in the data set can be accounted for through the first two principal components.

#Plot of principal component variance
plot(longley.PC, type = 'l')

A bi-plot can be used to interpret how each variable responds as it moves on the scale of principal components, as an observation moves to the right on principal component one, this is generally associated with higher Employment, GNP and Population as indicated by the red arrows, moving to the left would reduce these three variables. As an observation moves up principal component two this is generally associated with higher Armed Forces and lower Unemployment, conversely moving down principal component two would be associated with higher Unemployment and lower armed Armed Forces.

#Create bi-plot
biplot(longley.PC,scale=0)

The bi-plot can become increasingly cluttered as the data set gets larger, the principal component scores can by assigned to each observation from the original data set such that we can re-plot the data in a cleaner way.

#Amend PC scores to original data set
longley.output <- cbind(longley.PCA,longley.PC$x[,1:2])
longley.output
##          GNP Unemployed Armed.Forces Population Employed         PC1
## 1947 234.289      235.6        159.0    107.608   60.323 -2.94881516
## 1948 259.426      232.5        145.6    108.632   61.122 -2.67781650
## 1949 258.054      368.2        161.6    109.773   60.171 -2.17249141
## 1950 284.599      335.1        165.0    110.929   61.187 -1.90841556
## 1951 328.975      209.9        309.9    112.075   63.221 -1.26957903
## 1952 346.999      193.2        359.4    113.270   63.639 -0.91753863
## 1953 365.385      187.0        354.7    115.094   64.989 -0.52337610
## 1954 363.112      357.8        335.0    116.219   63.761 -0.05316329
## 1955 397.469      290.4        304.8    117.388   66.019  0.19222427
## 1956 419.180      282.2        285.7    118.734   67.857  0.58296139
## 1957 442.769      293.6        279.8    120.445   68.169  0.90654643
## 1958 444.546      468.1        263.7    121.950   66.513  1.39056281
## 1959 482.704      381.3        255.2    123.366   68.655  1.65823528
## 1960 502.601      393.1        251.4    125.368   69.564  2.08023315
## 1961 518.173      480.6        257.2    127.852   69.331  2.66632153
## 1962 554.894      400.7        282.7    130.081   70.551  2.99411082
##             PC2
## 1947 -0.6844452
## 1948 -0.7931595
## 1949 -1.5444580
## 1950 -1.2624386
## 1951  1.2440406
## 1952  1.9145391
## 1953  1.9289415
## 1954  0.5391853
## 1955  0.7074315
## 1956  0.5928022
## 1957  0.4522715
## 1958 -0.9354686
## 1959 -0.4008019
## 1960 -0.5080567
## 1961 -1.0434234
## 1962 -0.2069597
#Create scatter plot
ggplot(longley.output,aes(PC1,PC2)) + geom_point() + geom_text(aes(label=row.names(longley.output)),nudge_y = 0.15, size = 3)

To further confirm how the linear combination variables of principal component one and two relate back to the original variables, we can view the correlations between them.

#Correlation between PCs and original variables.
cor(longley.PCA,longley.output[,6:7])
##                    PC1         PC2
## GNP          0.9944877  0.03693498
## Unemployed   0.6631398 -0.66966898
## Armed.Forces 0.4436327  0.84831990
## Population   0.9932903 -0.07110081
## Employed     0.9681209  0.10498071

These combinations of correlation along each principal component align with the biplot, this allows us to easily compare the years in the data set for variance across each variable on the one chart above. From here we are able to cluster the observations into distinct categories, visual interpretation of the chart reveals there is likely three clusters.

#Determine clusters using kmeans
cluster <- kmeans(longley.output[,6:7],3)
cluster
## K-means clustering with 3 clusters of sizes 5, 7, 4
## 
## Cluster means:
##          PC1        PC2
## 1  2.1578927 -0.6189421
## 2 -0.1545607  1.0541731
## 3 -2.4268847 -1.0711253
## 
## Clustering vector:
## 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 
##    3    3    3    3    2    2    2    2    2    2    2    1    1    1    1 
## 1962 
##    1 
## 
## Within cluster sum of squares by cluster:
## [1] 2.312228 6.263944 1.156335
##  (between_SS / total_SS =  86.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Re-plot the chart to include clustering.

#Amend clustering vector to output data
longley.cluster <- cbind(longley.output,cluster$cluster)
longley.cluster
##          GNP Unemployed Armed.Forces Population Employed         PC1
## 1947 234.289      235.6        159.0    107.608   60.323 -2.94881516
## 1948 259.426      232.5        145.6    108.632   61.122 -2.67781650
## 1949 258.054      368.2        161.6    109.773   60.171 -2.17249141
## 1950 284.599      335.1        165.0    110.929   61.187 -1.90841556
## 1951 328.975      209.9        309.9    112.075   63.221 -1.26957903
## 1952 346.999      193.2        359.4    113.270   63.639 -0.91753863
## 1953 365.385      187.0        354.7    115.094   64.989 -0.52337610
## 1954 363.112      357.8        335.0    116.219   63.761 -0.05316329
## 1955 397.469      290.4        304.8    117.388   66.019  0.19222427
## 1956 419.180      282.2        285.7    118.734   67.857  0.58296139
## 1957 442.769      293.6        279.8    120.445   68.169  0.90654643
## 1958 444.546      468.1        263.7    121.950   66.513  1.39056281
## 1959 482.704      381.3        255.2    123.366   68.655  1.65823528
## 1960 502.601      393.1        251.4    125.368   69.564  2.08023315
## 1961 518.173      480.6        257.2    127.852   69.331  2.66632153
## 1962 554.894      400.7        282.7    130.081   70.551  2.99411082
##             PC2 cluster$cluster
## 1947 -0.6844452               3
## 1948 -0.7931595               3
## 1949 -1.5444580               3
## 1950 -1.2624386               3
## 1951  1.2440406               2
## 1952  1.9145391               2
## 1953  1.9289415               2
## 1954  0.5391853               2
## 1955  0.7074315               2
## 1956  0.5928022               2
## 1957  0.4522715               2
## 1958 -0.9354686               1
## 1959 -0.4008019               1
## 1960 -0.5080567               1
## 1961 -1.0434234               1
## 1962 -0.2069597               1
#Re-plot scatter plot with clustering
ggplot(longley.cluster,aes(PC1,PC2)) + geom_point(colour=cluster$cluster) + geom_text(aes(label=row.names(longley.cluster)),nudge_y = 0.15, size = 3)

The chart above shows three very distinct clusters as follows:

Conclusion

This is an example of how principal component analysis can be used to analyse observations on a two-dimensional axis representing linear combinations of multiple variables. This method of analysis is extremely valuable when dealing with large multivariate data sets.