Principal Component Analysis (PCA)

Principal Component Analysis is a mathematical technique based on the eigenvalue decomposition/singular value decomposition of a data matrix (or correlation/covariance matrix). This technique represents a complicated data set (set of variables) in a simpler form, where the information about the original variables is now represented by fewer components, which represent the majority of the variance in the original data. This is known as a variable reduction strategy. It is also used commonly to form indices in the behavioral/social sciences. It is closely related to factor analysis, and indeed the initial solution of factor analysis is often the exact same as the PCA solution. PCA preserves the orthogonal nature of the components, while factor analysis can violate this assumption.

The dataset that this project will operate on contains data on 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed. It was provided by Kaggle (https://www.kaggle.com/abcsds/pokemon)

data <- read.csv('Pokemon.csv')
data_pca <- data %>% select(HP, Attack, Defense, Speed, Sp..Atk, Sp..Def)

summary(data_pca)

##        HP             Attack       Defense           Speed           Sp..Atk          Sp..Def     
##  Min.   :  1.00   Min.   :  5   Min.   :  5.00   Min.   :  5.00   Min.   : 10.00   Min.   : 20.0  
##  1st Qu.: 50.00   1st Qu.: 55   1st Qu.: 50.00   1st Qu.: 45.00   1st Qu.: 49.75   1st Qu.: 50.0  
##  Median : 65.00   Median : 75   Median : 70.00   Median : 65.00   Median : 65.00   Median : 70.0  
##  Mean   : 69.26   Mean   : 79   Mean   : 73.84   Mean   : 68.28   Mean   : 72.82   Mean   : 71.9  
##  3rd Qu.: 80.00   3rd Qu.:100   3rd Qu.: 90.00   3rd Qu.: 90.00   3rd Qu.: 95.00   3rd Qu.: 90.0  
##  Max.   :255.00   Max.   :190   Max.   :230.00   Max.   :180.00   Max.   :194.00   Max.   :230.0

head(data_pca)

##   HP Attack Defense Speed Sp..Atk Sp..Def
## 1 45     49      49    45      65      65
## 2 60     62      63    60      80      80
## 3 80     82      83    80     100     100
## 4 80    100     123    80     122     120
## 5 39     52      43    65      60      50
## 6 58     64      58    80      80      65

pairs(data_pca)

PCA

We may proceed with PCA in two ways:

Using the manual technique
Using implemented functions in R packages (eg. prcomp)

I want to show that we can obtain the same results using both the aforementioned methods.

Manual techniques

The raw data is imported and then a correlation matrix is generated (this is using simulated data based on the original correlation matrix). The correlation matrix will then be used to run the PCA.

cor_mat<-cor(data_pca)

The first step is to check whether the data can be reduced. This relies on variables being correlated with each other and allows them to be combined. If not (i.e., if the variables were all orthogonal), there would be no way to combine the variables as factors or components. One basic test is Bartlett’s test of sphericity – the null hypothesis of the test is that the correlation matrix is an identity matrix – or that the matrix has one’s on the diagonal and zeroes on all the off diagonals. The test statistic follows a chi square distribution and to proceed, we would want to see statistically significant results.

bart_spher(data_pca)

##  Bartlett's Test of Sphericity
## 
## Call: bart_spher(x = data_pca)
## 
##      X2 = 1320.297
##      df = 15
## p-value < 2.22e-16

Results indicate that the p value is < 0.000 and is statistically significant. PCA can be performed. Test’s results agree with the correlation matrix form, it is straightforward that variables are positively correlated with each other.

Next step is to solve the eigenvalues and eigenvectors from the correlation matrix. We should create a a x a matrix with zeroes (a is number of analyzed variables) and fill the diagonal with eigenvalues. Such a matrix will be an orthogonal matrix, because the correlations between the variables are 0.

e<-eigen(cor_mat) 
L<-e$values
Vm<-matrix(0,nrow=ncol(data_pca),ncol=ncol(data_pca))
diag(Vm)<-L
Vm

##         [,1]     [,2]      [,3]      [,4]      [,5]     [,6]
## [1,] 2.71144 0.000000 0.0000000 0.0000000 0.0000000 0.000000
## [2,] 0.00000 1.093521 0.0000000 0.0000000 0.0000000 0.000000
## [3,] 0.00000 0.000000 0.7787452 0.0000000 0.0000000 0.000000
## [4,] 0.00000 0.000000 0.0000000 0.7206653 0.0000000 0.000000
## [5,] 0.00000 0.000000 0.0000000 0.0000000 0.4285402 0.000000
## [6,] 0.00000 0.000000 0.0000000 0.0000000 0.0000000 0.267088

If we want to see how much of the variance is explained by each of the principal components, we simply divide eigen values vector by its length.

L/length(L)

## [1] 0.45190665 0.18225358 0.12979086 0.12011089 0.07142337 0.04451466

The first component accounts for the most variance ( 2.7114399}/4). First component accounts for 45% of the variance, second 18%, third 12%, fourth 12%, fifth 7% and sixth 5%.

To compute component scores: PCA1 = a1z1 + a2z2 + … + a4z4 – we need the weights (the eigenvectors) and the standardized values of the original data (z1, z2, etc.). We can have as many component scores as we have variables (but not all will be useful). To compute the PCA scores, normValues x eigenvectors:

norm_dat<-scale(data_pca)
pca.scores<- norm_dat %*% e$vectors 
colnames(pca.scores)<-c('pca1','pca2','pca3','pca4','pca5','pca6')
head(pca.scores)

##            pca1        pca2      pca3        pca4         pca5        pca6
## [1,]  1.5554017 -0.02146869 0.6660872  0.18406113 -0.403554706  0.30281472
## [2,]  0.3626397 -0.05023711 0.6674959  0.26908616 -0.225647228  0.19436515
## [3,] -1.2793512 -0.06268101 0.6235239  0.33118417 -0.001544326  0.06813435
## [4,] -2.6192779  0.70382271 0.9949151 -0.19919599 -0.309976280 -0.08732570
## [5,]  1.7571845 -0.70573748 0.4111965 -0.26843393 -0.168771440  0.06932488
## [6,]  0.4353607 -0.74735577 0.4059051 -0.04938257 -0.061009227 -0.13969565

Automated techniques

pca<-prcomp(data_pca, center=TRUE, scale=TRUE)
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6
## Standard deviation     1.6466 1.0457 0.8825 0.8489 0.65463 0.51681
## Proportion of Variance 0.4519 0.1822 0.1298 0.1201 0.07142 0.04451
## Cumulative Proportion  0.4519 0.6342 0.7640 0.8841 0.95549 1.00000

cat('\n')

cat(L/length(L))

## 0.4519067 0.1822536 0.1297909 0.1201109 0.07142337 0.04451466

It is now straightforward that the proportions of explained variance are equal to the previously calculated eigenvalues proportions. Automated techniques are very useful in R, since there are numerous packages that implement various statistical measures we don’t have to proceed with the whole mathematical background as it was presented.

Analysis

Number of PC to retain

Several methods have been suggested to decide how many components/factors to retain, some of which are better than others. Kaiser’s rule suggests eigenvalues > 1 but that is not always a good way to decide (the thought being that if the L < 1 then the component accounts for less variance than one of the original variables). Others suggest using the % of variance accounted for (as long as it is ‘meaningful’). One of the better recognized ways to empirically decide how many factors to retain is to run a parallel analysis (Horn, 1965).

From the hornpa function: The program generates a specified number of random datasets based on the number of variables entered by the user. As the correlation matrices have an eigenvalue based on random noise, these eigenvalues are computed for the each dataset and collected. The mean and the specified percentile (95th is the default) are computed. The output table shows how large eigenvalues can be as a result of merely using randomly generated datasets. If the user’s own dataset has an actual eigenvalue greater than the generated eigenvalue (which is based on random noise), that supports the claim to retain that factor. In other words, if the i(th) eigenvalue from the actual data was larger than the percentile of the (i)th eigenvalue generated using randomly generated data, empirical support is provided to retain that factor.

We specify how many variables are in the original dataset (k = 6), how big our sample is (size = 800), how many repetitions to run (reps = 500), and we set an optional seed so we get the same results if we run it again (seed = 1111).

hornpa(k=6,size=800,reps=500,seed=1111)

## 
##  Parallel Analysis Results  
##  
## Method: pca 
## Number of variables: 6 
## Sample size: 800 
## Number of correlation matrices: 500 
## Seed: 1111 
## Percentile: 0.95 
##  
## Compare your observed eigenvalues from your original dataset to the 95 percentile in the table below generated using random data. If your eigenvalue is greater than the percentile indicated (not the mean), you have support to retain that factor/component. 
##  
##  Component  Mean  0.95
##          1 1.116 1.163
##          2 1.062 1.096
##          3 1.018 1.044
##          4 0.979 1.003
##          5 0.938 0.966
##          6 0.888 0.924

cat('\n')

cat(e$values)

## 2.71144 1.093521 0.7787452 0.7206653 0.4285402 0.267088

Our first component has eigenvalue higher than 95 percentile (3 > 1.163). It seems that the incorporated components should end there as the second eigenvalue is already less than 95 percentile (1 < 1.096), but for the sake of nice analyses I assume that such a small error is not that significant.

Moreover using the criteria of variance contribution, second component still contributes to ~20% of the variance explained. Using the so-called ‘elbow method’ we see that the highest decrease in explained variance is seen after the third principal component, hence it can be used to explain the original variables. Such relationship is presented on the plot below.

To sum up: to decide which variables contribute to which principal components we may rely on three things:

Plot with contribution of PC to the explained variance,
Kaiser’s rule,
Number of principal components that contributes the most to cumulative explained variance.

fviz_eig(pca)

fviz_pca_var(pca, col.var = "steelblue")

pca$rotation[,1:2]

##               PC1         PC2
## HP      0.3898858  0.08483455
## Attack  0.4392537 -0.01182493
## Defense 0.3637473  0.62878867
## Speed   0.3354405 -0.66846305
## Sp..Atk 0.4571623 -0.30541446
## Sp..Def 0.4485704  0.23909670

Presented plot shows that all of the variables (Speed, Attack, HP, Defense, Special Attack and Special Defense) are correlated to the first principal component. On the other hand only Defense (Normal and Special) is correlated with the second PC. Rest of the variables are either correlated negatively or not correlated at all. Such conclusions are confirmed by the Rotation matrix from PCA analysis.

Let us see then how are the Principal Components built

var <- get_pca_var(pca)
a<-fviz_contrib(pca, "var",axes = 1)
b<-fviz_contrib(pca, "var",axes = 2)
grid.arrange(a,b,top='Contribution to the Principal Components')

corrplot(var$contrib, is.corr = F)

It turns out that the first dimension is built up by Special Abilities (Attack and Defense) and Regular Attack, on the other hand second dimension is built up by Speed and Defense.

Apart from the simple biplot, we can use some advanced stuff, i.e. biplot below. It creates ellipses around groups of Pokemon generations. We may also allow for the 3rd PC and plot 3d charts (eg. pca3d from pca3d).

We may also create a nice representation of contribution on the circle plot. If the variable is perfectly represented by only two principal components the sum (color red on the plot) of cos2 (representation of the variable on the principal component) will be equal (close) to one. Therefore it seems that Defense and Speed are very well explained by PC, however Attack and HP are not explained that well. Maybe other PC (3 and further) explained them better? As was previously shown PC 4 explains HP well and PC 3 explains Attack well. But the 2D-ness of screens and papers doesn’t allow or such analyses.

fviz_pca_var(pca, col.var = "cos2", gradient.cols = c("#00AFBB","#E7B800","#FC4E07"), repel = T)

Biplot

Let us consider the conclusions that arose while analyzing the dimensions contribution and analyze biplot.

ggbiplot(pca, obs.scale=1, var.scale=1, groups=as.factor(data$Generation), ellipse=TRUE, circle = TRUE)

The variables are spread with more or less the same variance, some outliers occur in the case of second principal component. There are some observations with low values of Defense and mediocre statistics and on the other hand some observations (of Legendary Pokemon) with high defensive statistics. It seems that the generations of Pokemon doesn’t affect analysis at all.

Moreover Pokemon that are on the same side of the given variable have high values of this variables, and opposite have low.

We can also see how individual Pokemon contribute to the Principal Components. As seen below, most of the best (in terms of contribution) observations are far away from the middle of the plot.

fviz_pca_ind(pca, col.ind = "cos2", label="none", gradient.cols = c("#00AFBB","#E7B800","#FC4E07"))

Clustering of PCA

Let us prepare quick K-means on the Principal Components to see how the clustering affects circle plots.

kmd<-kmeans(var$coord, centers = 3, nstart = 50)
grp<-as.factor(kmd$cluster)
fviz_pca_var(pca, col.var = grp, palette = "Dark2", legend.title = "Clusters", repel = T)

It turns out that the Defense and Special Defense are grouped as one, but Special Attack and Attack does not. It may translate into Defense being more important attribute than Attack in Pokemon.

Hierarchical Clustering of data

I would like to present also the method of hierarchical clustering based on PCA. It can be summarized as follows:

Computing PCA on the set of variables,
Compute hierarchical clustering,
Choose best number of subsets to cut tree off,
Perform k-means clustering to improve the results.

This approach is useful in situations, when we have a large data set containing continuous variables: principal component analysis can be used to reduce the dimension of the data before the hierarchical clustering analysis. Instead of analysing clustering of only two variables we can analyse clustering of the whole dataset.

Moreover when you have a data set containing categorical variables, a Correspondence analysis can be used (insted of PCA) to transform the categorical variables into few continuous principal components, which can be used as the input of the cluster analysis.

Let’s test it on our Pokemon dataset.

pca2 = PCA(data_pca, ncp = 3, graph = FALSE)
hcpc <- HCPC(pca2, graph = FALSE)
#fviz_dend(hcpc, cex = 0.8, palette = "Dark2", rect = T, rect_fill = T, rect_border = "Dark2", show_labels = F) #takes too long

fviz_cluster(hcpc, palette = "Dark2", show.clust.cent = T, main = "Factor map", labelsize=1)

First graph produces clustering on the PCA plot with the same dimensions as before discussed. It is now clear that we can see three distinct groups of Pokemon. In cluster 1 there are very good, strongly defensive Pokemon. Clusters 2 and 3 contain mediocre and low Pokemon with distinction to defensive and speed (3 and 2 cluster, respectively).

PCA in image compressing

Image compression with principal component analysis is a frequently occurring application of the dimension reduction technique. Principal component analysis can be used to reduce the dimensions of the matrix (image) and project those new dimensions to reform the image that retains its qualities but is smaller in k-weight. We will use PCA to compress the image of my faculty and we will show that as the number of principal components used to project the new data increases, the quality and representation compared to the original image improve.

Here is the original picture:

Below code transforms a photo to 3 layer matrix for each of the colors: red, green, blue.

Later on PCA is performed on separate color matrices. Results are combined into a list and from the rotation matrices the image is reconstructed.

wne <- readJPEG('rekrutacja-wne-uw.jpg')
r <- wne[,,1]
g <- wne[,,2]
b <- wne[,,3]

wne.r.pca <- prcomp(r, center = FALSE)
wne.g.pca <- prcomp(g, center = FALSE)
wne.b.pca <- prcomp(b, center = FALSE)

rgb.pca <- list(wne.r.pca, wne.g.pca, wne.b.pca)


for (i in seq.int(3, round(nrow(wne) - 10), by = 40)) {
  pca.img <- sapply(rgb.pca, function(j) {
    compressed.img <- j$x[,1:i] %*% t(j$rotation[,1:i])
  }, simplify = 'array')
  writeJPEG(pca.img, paste('wne_compressed_', round(i,0), '_components.jpg', sep = ''))
}

Let us see how the image with only three PC looks like:

It is blurry and doesn’t resemble the original picutre.

But using more than 40 PCs, gives us below results:

It already started looking well. Using 123 proncipal components we obtain a picture below:

The image is sharp with only some distortions in the clouds. Adding the PCs will make our compressed picture of better quality, but on the other hand it will increase the volume of compressed file.

The border between quality of original and compressed picutre blurs around ~450 PCs.

Below is presented a comparison of file sizes depending on t he number of principal components.

3 components: 40kB
43 components: 84kB
123 components: 91kB
483 components: 95kB
original file: 123kB

Therefore we were successful to compress the picture in 77%.

PCA - Pokemon

Mateusz Buczyński