library(FactoMineR)
library(factoextra)Loading required package: ggplot2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(corrplot)corrplot 0.92 loaded
Principal component analysis (PCA) allows us to summarize the variations (information) in a data set described by multiple variables. Each variable could be considered as a different dimension. If you have more than 3 variables in your data sets, it could be very difficult to visualize a multi-dimensional hyperspace.
The goal of principal component analysis is to transform the initial variables into a new set of variables which explain the variation in the data. These new variables corresponds to a linear combination of the originals and are called principal components.
PCA reduces the dimensionality of multivariate data, to two or three that can be visualized graphically with minimal loss of information.
WHY DIMENSION REDUCTION?
Anything can go wrong in high dimensions
Big Data => Data compression
Visualization
CAUTIONS
Data loss
Some methods such as PCA tend to be linear
Garbage in = Garbage out
VARIANCE: Measures the variability or “spread” of the data. Squared of standard deviation.
COVARIANCE: Measures the strength of the relationship between 2 or more variables (can be positive, negative, or no relationship).
Imagine a data set where observations are ratings of communities on 9 different criteria:
Housing, Climate, Health, Crime, Transportation, Education, Arts, Recreation, Economics
Can we construct perhaps 4 new features from the existing ones that would still contain relatively as much information?
PCA: geometrically, linear transformation from one coordinate system to another
The bases for the new subspace are called principal components (PCs), constructed using linear combination of the variables of the original data
New variables created by PCA should represent the same amount of information as the original variables (i.e. same total variance, just a change of basis)
m-features (variables) data can produce m different principal components.
Reduce dimension by eliminating later PCs that have low variances.
Projection of p-dimensional data onto lower k-dimensional subspace (k < p)
How to compress the data loosing the least amount of information?
PCI is chosen to be the direction in which observations vary the most (highest variance).
Why? Similar observations don’t reveal much information and pattern, do they?
Another interpretation
PCA finds a set of components (or factors, or scales) that taken together best explain the total variance within a dataset-
Each component is constructed by adding together a number of previously measured variables, each with its own weight. These variables are selected and their weights are calculated in such a way that the resulting components or scales explain as much of the overall variance in the data matrix as possible.
In a second step, each of these components is “rotated” so that it is now uncorrelated with any other dimension. What this yields is a small number of new variables or features which summarize most of the variance contained in the larger number of original variables. So PCA achieves dimension reduction by reducing the number of variables.
The eigenvector of a linear transformation matrix X is a vector v such that when multiplied by the transformation gives back the vector itself, scaled by a scalar (its corresponding eigenvalue).
Xv=\lambda v
If the data has been centered to have mean zero, Cov(X) is given conveniently by
Cov(X)= \frac{1}{n-1}(X^{T}X)
Cov{X) reveals shape (spread and orientation) of the data.
Represent Cov(X) with a vector: should point in the direction of the largest spread of the data, and magnitude should equal the spread (variance) in this direction.
Turns out that the largest eigenvector of Cov(X) points in this direction, and its magnitude equals its corresponding eigenvalue.
PCA:
Eigendecomposition of (X*X) & Linear Transformation of X
We will use the following packages
library(FactoMineR)
library(factoextra)Loading required package: ggplot2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(corrplot)corrplot 0.92 loaded
Let’s use the decathlon dataset in the package factoextra:
data(decathlon2)
head(decathlon2[, 1:6]) X100m Long.jump Shot.put High.jump X400m X110m.hurdle
SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69
CLAY 10.76 7.40 14.26 1.86 49.37 14.05
BERNARD 11.02 7.23 14.25 1.92 48.93 14.99
YURKOV 11.34 7.09 15.19 2.10 50.42 15.31
ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17
McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38
#Extract only active individuals and variables for principal component analysis:
decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active[, 1:6]) X100m Long.jump Shot.put High.jump X400m X110m.hurdle
SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69
CLAY 10.76 7.40 14.26 1.86 49.37 14.05
BERNARD 11.02 7.23 14.25 1.92 48.93 14.99
YURKOV 11.34 7.09 15.19 2.10 50.42 15.31
ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17
McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38
Before principal component analysis, we can perform some exploratory data analysis such as descriptive statistics, correlation matrix and scatter plot matrix.
decathlon2.active_stats <- data.frame(
Min = apply(decathlon2.active, 2, min), # minimum
Q1 = apply(decathlon2.active, 2, quantile, 1/4), # First quartile
Med = apply(decathlon2.active, 2, median), # median
Mean = apply(decathlon2.active, 2, mean), # mean
Q3 = apply(decathlon2.active, 2, quantile, 3/4), # Third quartile
Max = apply(decathlon2.active, 2, max) # Maximum
)
decathlon2.active_stats <- round(decathlon2.active_stats, 1)
head(decathlon2.active_stats) Min Q1 Med Mean Q3 Max
X100m 10.4 10.8 11.0 11.0 11.2 11.6
Long.jump 6.8 7.2 7.3 7.3 7.5 8.0
Shot.put 12.7 14.2 14.7 14.6 15.1 16.4
High.jump 1.9 1.9 2.0 2.0 2.1 2.1
X400m 46.8 49.0 49.4 49.4 50.0 51.2
X110m.hurdle 14.0 14.2 14.4 14.5 14.9 15.7
The correlation between variables can be calculated as follow :
cor.mat <- round(cor(decathlon2.active),2)
head(cor.mat[, 1:6]) X100m Long.jump Shot.put High.jump X400m X110m.hurdle
X100m 1.00 -0.76 -0.45 -0.40 0.59 0.73
Long.jump -0.76 1.00 0.44 0.34 -0.51 -0.59
Shot.put -0.45 0.44 1.00 0.53 -0.31 -0.38
High.jump -0.40 0.34 0.53 1.00 -0.37 -0.25
X400m 0.59 -0.51 -0.31 -0.37 1.00 0.58
X110m.hurdle 0.73 -0.59 -0.38 -0.25 0.58 1.00
Visualize the correlation matrix using a correlogram: the package corrplot is required.
corrplot(cor.mat, type="upper", order="hclust",
tl.col="black", tl.srt=45)Make a scatter plot matrix showing the correlation coefficients between variables and the significance levels : the package PerformanceAnalytics is required.
# install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")Loading required package: xts
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
################################### WARNING ###################################
# We noticed you have dplyr installed. The dplyr lag() function breaks how #
# base R's lag() function is supposed to work, which breaks lag(my_xts). #
# #
# If you call library(dplyr) later in this session, then calls to lag(my_xts) #
# that you enter or source() into this session won't work correctly. #
# #
# All package code is unaffected because it is protected by the R namespace #
# mechanism. #
# #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
# #
# You can use stats::lag() to make sure you're not using dplyr::lag(), or you #
# can add conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
# dplyr from breaking base R's lag() function. #
################################### WARNING ###################################
Attaching package: 'PerformanceAnalytics'
The following object is masked from 'package:graphics':
legend
chart.Correlation(decathlon2.active[, 1:6], histogram=TRUE, pch=19)Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
The function PCA() [in FactoMiner package] can be used. A simplified format is :
PCA(X, scale.unit = TRUE, ncp = 5, graph = TRUE)
X : a data frame. Rows are individuals and columns are numeric variables
scale.unit : a logical value. If TRUE, the data are scaled to unit variance before the analysis. This standardization to the same scale avoids some variables to become dominant just because of their large measurement units.
ncp : number of dimensions kept in the final results.
graph : a logical value. If TRUE a graph is displayed.
res.pca <- PCA(decathlon2.active, graph = FALSE)
print(res.pca)**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 23 individuals, described by 10 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
The proportion of variances retained by the principal components can be extracted as follow :
eigenvalues <- res.pca$eig
head(eigenvalues[, 1:2]) eigenvalue percentage of variance
comp 1 4.1242133 41.242133
comp 2 1.8385309 18.385309
comp 3 1.2391403 12.391403
comp 4 0.8194402 8.194402
comp 5 0.7015528 7.015528
comp 6 0.4228828 4.228828
Eigenvalues correspond to the amount of the variation explained by each principal component (PC). Eigenvalues are large for the first PC and small for the subsequent PCs.
A PC with an eigenvalue > 1 indicates that the PC accounts for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point to determine the number of PCs to retain.
Make a scree plot using base graphics : A scree plot is a graph of the eigenvalues/variances associated with components.
barplot(eigenvalues[, 2], names.arg=1:nrow(eigenvalues),
main = "Variances",
xlab = "Principal Components",
ylab = "Percentage of variances",
col ="steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(eigenvalues), eigenvalues[, 2],
type="b", pch=19, col = "red")~60% of the informations (variances) contained in the data are retained by the first two principal components.
Make the scree plot using the package factoextra:
fviz_screeplot(res.pca, ncp=10)The function plot.PCA() can be used. A simplified format is :
plot.PCA(x, axes = c(1,2), choix=c(“ind”, “var”))
x : An object of class PCA
axes : A numeric vector of length 2 specifying the component to plot
choix : The graph to be plotted. Possible values are “ind” for the individuals and “var” for the variables
Variable contributions in the determination of a given principal component are (in percentage) : (var.cos2 * 100) / (total cos2 of the component)
head(res.pca$var$contrib) Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
X100m 17.544293 1.7505098 7.338659 0.13755240 5.389252
Long.jump 15.293168 4.2904162 2.930094 1.62485936 7.748815
Shot.put 13.060137 0.3967224 21.620432 2.01407269 8.824401
High.jump 9.024811 11.7715838 8.792888 2.54987951 23.115504
X400m 11.935544 4.5799296 6.487636 22.65090599 1.539012
X110m.hurdle 14.157544 0.0332933 16.261261 0.03483735 7.166193
The function fviz_pca_var() is used to visualize variables:
# Default plot
fviz_pca_var(res.pca)Change theme:
# Change color and theme
fviz_pca_var(res.pca, col.var="steelblue")+
theme_minimal()Control variable colors using their contribution:
# Possible values for the argument col.var are :
# "cos2", "contrib", "coord", "x", "y"
fviz_pca_var(res.pca, col.var="contrib")Change the color gradient for better interpretation of contributions:
fviz_pca_var(res.pca, col.var="contrib")+
scale_color_gradient2(low="blue", mid="white",
high="red", midpoint=55)+theme_bw()This is helpful to highlight the most important variables in the determination of the principal components.
It’s also possible to control automatically the transparency of variables by their contributions :
# Control the transparency of variables using their contribution
# Possible values for the argument alpha.var are :
# "cos2", "contrib", "coord", "x", "y"
fviz_pca_var(res.pca, alpha.var="contrib")+
theme_minimal()