PCA

Author

Giuseppe A. Veltri

Principal Component Analysis or PCA

Principal component analysis (PCA) allows us to summarize the variations (information) in a data set described by multiple variables. Each variable could be considered as a different dimension. If you have more than 3 variables in your data sets, it could be very difficult to visualize a multi-dimensional hyperspace.

The goal of principal component analysis is to transform the initial variables into a new set of variables which explain the variation in the data. These new variables corresponds to a linear combination of the originals and are called principal components.

PCA reduces the dimensionality of multivariate data, to two or three that can be visualized graphically with minimal loss of information.

The curse of dimensionality

WHY DIMENSION REDUCTION?

Anything can go wrong in high dimensions
Big Data => Data compression
Visualization

CAUTIONS

Data loss
Some methods such as PCA tend to be linear
Garbage in = Garbage out

Variance and Covariance

VARIANCE: Measures the variability or “spread” of the data. Squared of standard deviation.
COVARIANCE: Measures the strength of the relationship between 2 or more variables (can be positive, negative, or no relationship).

Intuition behind the PCA

Imagine a data set where observations are ratings of communities on 9 different criteria:
Housing, Climate, Health, Crime, Transportation, Education, Arts, Recreation, Economics
Can we construct perhaps 4 new features from the existing ones that would still contain relatively as much information?
PCA: geometrically, linear transformation from one coordinate system to another
The bases for the new subspace are called principal components (PCs), constructed using linear combination of the variables of the original data
New variables created by PCA should represent the same amount of information as the original variables (i.e. same total variance, just a change of basis)

m-features (variables) data can produce m different principal components.
Reduce dimension by eliminating later PCs that have low variances.
Projection of p-dimensional data onto lower k-dimensional subspace (k < p)

How to compress the data loosing the least amount of information?

PCI is chosen to be the direction in which observations vary the most (highest variance).

Why? Similar observations don’t reveal much information and pattern, do they?
Another interpretation
- PC2 is required to be uncorrelated to PCI (i.e., they are orthogonal) and by constraint points in direction with second highest variance, etc.

In plain language

PCA finds a set of components (or factors, or scales) that taken together best explain the total variance within a dataset-
Each component is constructed by adding together a number of previously measured variables, each with its own weight. These variables are selected and their weights are calculated in such a way that the resulting components or scales explain as much of the overall variance in the data matrix as possible.
In a second step, each of these components is “rotated” so that it is now uncorrelated with any other dimension. What this yields is a small number of new variables or features which summarize most of the variance contained in the larger number of original variables. So PCA achieves dimension reduction by reducing the number of variables.

Eigenvectors and eigenvalues

The eigenvector of a linear transformation matrix X is a vector v such that when multiplied by the transformation gives back the vector itself, scaled by a scalar (its corresponding eigenvalue).

Xv=\lambda v

If the data has been centered to have mean zero, Cov(X) is given conveniently by

Cov(X)= \frac{1}{n-1}(X^{T}X)

Video about linear

Cov{X) reveals shape (spread and orientation) of the data.
Represent Cov(X) with a vector: should point in the direction of the largest spread of the data, and magnitude should equal the spread (variance) in this direction.
Turns out that the largest eigenvector of Cov(X) points in this direction, and its magnitude equals its corresponding eigenvalue.

PCA:

Eigendecomposition of (X*X) & Linear Transformation of X

PCA using R

We will use the following packages

library(FactoMineR)
library(factoextra)

Loading required package: ggplot2

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(corrplot)

corrplot 0.92 loaded

Let’s use the decathlon dataset in the package factoextra:

data(decathlon2)
head(decathlon2[, 1:6])

          X100m Long.jump Shot.put High.jump X400m X110m.hurdle
SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69
CLAY      10.76      7.40    14.26      1.86 49.37        14.05
BERNARD   11.02      7.23    14.25      1.92 48.93        14.99
YURKOV    11.34      7.09    15.19      2.10 50.42        15.31
ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17
McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38

#Extract only active individuals and variables for principal component analysis:
decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active[, 1:6])

          X100m Long.jump Shot.put High.jump X400m X110m.hurdle
SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69
CLAY      10.76      7.40    14.26      1.86 49.37        14.05
BERNARD   11.02      7.23    14.25      1.92 48.93        14.99
YURKOV    11.34      7.09    15.19      2.10 50.42        15.31
ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17
McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38

Descriptive statistics

Before principal component analysis, we can perform some exploratory data analysis such as descriptive statistics, correlation matrix and scatter plot matrix.

decathlon2.active_stats <- data.frame(
  Min = apply(decathlon2.active, 2, min), # minimum
  Q1 = apply(decathlon2.active, 2, quantile, 1/4), # First quartile
  Med = apply(decathlon2.active, 2, median), # median
  Mean = apply(decathlon2.active, 2, mean), # mean
  Q3 = apply(decathlon2.active, 2, quantile, 3/4), # Third quartile
  Max = apply(decathlon2.active, 2, max) # Maximum
  )
decathlon2.active_stats <- round(decathlon2.active_stats, 1)
head(decathlon2.active_stats)

              Min   Q1  Med Mean   Q3  Max
X100m        10.4 10.8 11.0 11.0 11.2 11.6
Long.jump     6.8  7.2  7.3  7.3  7.5  8.0
Shot.put     12.7 14.2 14.7 14.6 15.1 16.4
High.jump     1.9  1.9  2.0  2.0  2.1  2.1
X400m        46.8 49.0 49.4 49.4 50.0 51.2
X110m.hurdle 14.0 14.2 14.4 14.5 14.9 15.7

Correlation matrix

The correlation between variables can be calculated as follow :

cor.mat <- round(cor(decathlon2.active),2)
head(cor.mat[, 1:6])

             X100m Long.jump Shot.put High.jump X400m X110m.hurdle
X100m         1.00     -0.76    -0.45     -0.40  0.59         0.73
Long.jump    -0.76      1.00     0.44      0.34 -0.51        -0.59
Shot.put     -0.45      0.44     1.00      0.53 -0.31        -0.38
High.jump    -0.40      0.34     0.53      1.00 -0.37        -0.25
X400m         0.59     -0.51    -0.31     -0.37  1.00         0.58
X110m.hurdle  0.73     -0.59    -0.38     -0.25  0.58         1.00

Visualize the correlation matrix using a correlogram: the package corrplot is required.

corrplot(cor.mat, type="upper", order="hclust", 
         tl.col="black", tl.srt=45)

Make a scatter plot matrix showing the correlation coefficients between variables and the significance levels : the package PerformanceAnalytics is required.

# install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")

Loading required package: xts

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric


################################### WARNING ###################################
# We noticed you have dplyr installed. The dplyr lag() function breaks how    #
# base R's lag() function is supposed to work, which breaks lag(my_xts).      #
#                                                                             #
# If you call library(dplyr) later in this session, then calls to lag(my_xts) #
# that you enter or source() into this session won't work correctly.          #
#                                                                             #
# All package code is unaffected because it is protected by the R namespace   #
# mechanism.                                                                  #
#                                                                             #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
# You can use stats::lag() to make sure you're not using dplyr::lag(), or you #
# can add conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop   #
# dplyr from breaking base R's lag() function.                                #
################################### WARNING ###################################


Attaching package: 'PerformanceAnalytics'

The following object is masked from 'package:graphics':

    legend

chart.Correlation(decathlon2.active[, 1:6], histogram=TRUE, pch=19)

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

Warning in par(usr): argument 1 does not name a graphical parameter

PCA

The function PCA() [in FactoMiner package] can be used. A simplified format is :

PCA(X, scale.unit = TRUE, ncp = 5, graph = TRUE)

X : a data frame. Rows are individuals and columns are numeric variables
scale.unit : a logical value. If TRUE, the data are scaled to unit variance before the analysis. This standardization to the same scale avoids some variables to become dominant just because of their large measurement units.
ncp : number of dimensions kept in the final results.
graph : a logical value. If TRUE a graph is displayed.

res.pca <- PCA(decathlon2.active, graph = FALSE)

print(res.pca)

**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 23 individuals, described by 10 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"

Variances of the principal components

The proportion of variances retained by the principal components can be extracted as follow :

eigenvalues <- res.pca$eig
head(eigenvalues[, 1:2])

       eigenvalue percentage of variance
comp 1  4.1242133              41.242133
comp 2  1.8385309              18.385309
comp 3  1.2391403              12.391403
comp 4  0.8194402               8.194402
comp 5  0.7015528               7.015528
comp 6  0.4228828               4.228828

Eigenvalues correspond to the amount of the variation explained by each principal component (PC). Eigenvalues are large for the first PC and small for the subsequent PCs.
A PC with an eigenvalue > 1 indicates that the PC accounts for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point to determine the number of PCs to retain.

Make a scree plot using base graphics : A scree plot is a graph of the eigenvalues/variances associated with components.

barplot(eigenvalues[, 2], names.arg=1:nrow(eigenvalues), 
       main = "Variances",
       xlab = "Principal Components",
       ylab = "Percentage of variances",
       col ="steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(eigenvalues), eigenvalues[, 2], 
      type="b", pch=19, col = "red")

~60% of the informations (variances) contained in the data are retained by the first two principal components.

Make the scree plot using the package factoextra:

fviz_screeplot(res.pca, ncp=10)

Graph of individus and variables

The function plot.PCA() can be used. A simplified format is :

plot.PCA(x, axes = c(1,2), choix=c(“ind”, “var”))

x : An object of class PCA
axes : A numeric vector of length 2 specifying the component to plot
choix : The graph to be plotted. Possible values are “ind” for the individuals and “var” for the variables

Variables factor map : The correlation circle

Contributions of the variables to the principal components

Variable contributions in the determination of a given principal component are (in percentage) : (var.cos2 * 100) / (total cos2 of the component)

head(res.pca$var$contrib)

                 Dim.1      Dim.2     Dim.3       Dim.4     Dim.5
X100m        17.544293  1.7505098  7.338659  0.13755240  5.389252
Long.jump    15.293168  4.2904162  2.930094  1.62485936  7.748815
Shot.put     13.060137  0.3967224 21.620432  2.01407269  8.824401
High.jump     9.024811 11.7715838  8.792888  2.54987951 23.115504
X400m        11.935544  4.5799296  6.487636 22.65090599  1.539012
X110m.hurdle 14.157544  0.0332933 16.261261  0.03483735  7.166193

Graph of variables using factoextra

The function fviz_pca_var() is used to visualize variables:

# Default plot
fviz_pca_var(res.pca)

Change theme:

# Change color and theme
fviz_pca_var(res.pca, col.var="steelblue")+
  theme_minimal()

Control variable colors using their contribution:

# Possible values for the argument col.var are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_pca_var(res.pca, col.var="contrib")

Change the color gradient for better interpretation of contributions:

fviz_pca_var(res.pca, col.var="contrib")+
scale_color_gradient2(low="blue", mid="white", 
                      high="red", midpoint=55)+theme_bw()

This is helpful to highlight the most important variables in the determination of the principal components.

It’s also possible to control automatically the transparency of variables by their contributions :

# Control the transparency of variables using their contribution
# Possible values for the argument alpha.var are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_pca_var(res.pca, alpha.var="contrib")+
  theme_minimal()