Dawn Daras MS
Dawn Daras MS

Principal Component Analysis Using R

In our other tutorials we covered first, how to describe the data and secondly, how to demonstrate relationships and prediction. Now, we are going to explore dimension reduction What is “dimension reduction?” In short, many datasets, particularly with BIG DATA, are messy. They often contain numerous independent/predictor variables for the dependant variable. To make sense of this data, how does one reduce the data to what are the most important predictors, both quantitatively and qualitatively - those that have the strongest correlation and make the most business sense? We could just eliminate features however we will be extracting features## In essence, principal component analysis transforms the original variables into a set of new, uncorrelated variables

This is our Principal Component Analysis - PCA## There are three assumptions underlying PCA### Existence of an identity matrix## Sampling adequacy or an appropriate number of observations to the number of variables## Independent variables are correlated to each other# We are going to use the “Fake Bills” Dataset The columns are as follows:##### -length, the length of the banknote in mm### -height left, the height of the left side of the banknote in mm### -height right, the height of the right side of the bank note in mm### -diagonal, the diagonal of the bank note in mm#### -margin low, lower side margin in mm### -margin up, upper side margin in mm### -is_genuine### In this dataset, the DV/dependent variable is “Is Geniune” - meaning that the bill is not fake. We are going to be reducing the set of variables to the ones which hold the most information###

Must be the money!
Must be the money!
library(corrr)
library(ggcorrplot)
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.92 loaded
library(FactoMineR)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(vembedr)
df <- read.csv("/cloud/project/fakebills.csv", header=TRUE, stringsAsFactors=FALSE)
head(df)
##   is_genuine diagonal height_left height_right margin_low margin_up length
## 1          1   171.81      104.86       104.95       4.52      2.89 112.83
## 2          1   171.46      103.36       103.66       3.77      2.99 113.09
## 3          1   172.69      104.48       103.50       4.40      2.94 113.16
## 4          1   171.36      103.91       103.94       3.62      3.01 113.51
## 5          1   171.73      104.28       103.46       4.04      3.48 112.54
## 6          1   172.17      103.74       104.08       4.42      2.95 112.81

Before conducting the PCA we remove the Dependent Variable DV “is_geniune” from the dataset, or it becomes a factor to explain itself in the dataset which is created.##

#Here I removed the "is geniune" Dependent Variable, otherwise it becomes a factor to explain itself
df <- df[,-1]

We’ll run some quick summary statistics on the dataset before we dive into PCA##

#Lets look at the summary statistics - and the distributions of the variables of Fake Bills
summary(df)
##     diagonal      height_left     height_right     margin_low   
##  Min.   :171.0   Min.   :103.1   Min.   :102.8   Min.   :2.980  
##  1st Qu.:171.8   1st Qu.:103.8   1st Qu.:103.7   1st Qu.:4.015  
##  Median :172.0   Median :104.0   Median :103.9   Median :4.310  
##  Mean   :172.0   Mean   :104.0   Mean   :103.9   Mean   :4.486  
##  3rd Qu.:172.2   3rd Qu.:104.2   3rd Qu.:104.2   3rd Qu.:4.870  
##  Max.   :173.0   Max.   :104.9   Max.   :105.0   Max.   :6.900  
##                                                  NA's   :37     
##    margin_up         length     
##  Min.   :2.270   Min.   :109.5  
##  1st Qu.:2.990   1st Qu.:112.0  
##  Median :3.140   Median :113.0  
##  Mean   :3.151   Mean   :112.7  
##  3rd Qu.:3.310   3rd Qu.:113.3  
##  Max.   :3.910   Max.   :114.4  
## 
library(repr)
options(repr.plot.width=16, repr.plot.height=16)
a <- colnames(df)[1:5] 
par(mfrow = c(3, 3))    
for (i in 1:length(a)){
  sub = df[a[i]][,1]   
  hist(sub, main = paste("Hist. of", a[i], sep = " "), xlab = a[i])
}

##We are also going to look at the data type.  
#PCA only works with numerical data
str(df)
## 'data.frame':    1500 obs. of  6 variables:
##  $ diagonal    : num  172 171 173 171 172 ...
##  $ height_left : num  105 103 104 104 104 ...
##  $ height_right: num  105 104 104 104 103 ...
##  $ margin_low  : num  4.52 3.77 4.4 3.62 4.04 4.42 4.58 3.98 4 4.04 ...
##  $ margin_up   : num  2.89 2.99 2.94 3.01 3.48 2.95 3.26 2.92 3.25 3.25 ...
##  $ length      : num  113 113 113 114 113 ...

Now let’s talk about the five basic steps of Principal Component Analysis (PCA)## Remember that the goal of PCA is to measure the contribution of each variable - it is the measurement of highly dimensional data and capturing the most important information from it PCA Roadmap: 1) Data Normalization - because many variables within a dataset could be on different scales, to equally measure their contribution within a dataset, they must be normalized or put on the same scale. For example if you are looking at salary and job satisfaction scores, those will be on different scales and must be normalized to measure their respective contributions within a dataset## 2) Covariance Matrix - a covariance matrix between the normalized data is computed 3) Eigenvectors and Eigenvalues - To have a basic understanding of PCA, it helps to understand that the data is in relationship to each other in 3-dimensional space. The relationship between variables is not on a flat plane. So an Eigenvector represents a direction, such as vertical or 90 degrees. And an eigenvalue represents the amount of variance for the given direction. Each eigenvector has an eigenvalue. In Step 3 we compute the eigenvalues and eigenvectors to identify the principal components## 4) Selecting principal components. We do this based upon the variation in the principal components and Scree Plots.## 5) Data transformation according to the new principal components - so that the data is now along the lines of the new principal components axes.## We are going to check for missing because missing data when we normalize it will bias results. The colSums() function combined with is.na() retuns the number of missing values in each column##

colSums(is.na(df))
##     diagonal  height_left height_right   margin_low    margin_up       length 
##            0            0            0           37            0            0
count(df)
##      n
## 1 1500

Because there are 37 observations that are missing in the “margin_low” column, we are going to delete those rows. For PCA in this instance, imputation would not be appropriate running the count function from dplyr tells us that we still have 1463 observations

df <- na.omit(df)
count(df)
##      n
## 1 1463
#In our next step we are putting all variables on the same scale through normalization
data_normalized <- scale(df)
head(data_normalized)
##     diagonal height_left height_right  margin_low  margin_up     length
## 1 -0.4884266   2.7658616    3.1726813  0.05126870 -1.1365937  0.1777815
## 2 -1.6342497  -2.2407234   -0.8065749 -1.07856821 -0.7045647  0.4755295
## 3  2.3925002   1.4975267   -1.3001261 -0.12950521 -0.9205792  0.5556924
## 4 -1.9616278  -0.4049756    0.0571396 -1.30453559 -0.6181589  0.9565070
## 5 -0.7503290   0.8299821   -1.4235139 -0.67182692  1.4123777 -0.1543220
## 6  0.6901344  -0.9723885    0.4889969 -0.09937622 -0.8773763  0.1548778
#calculating a correlation matrix
corr_matrix <- cor(data_normalized)
corr_matrix
##                 diagonal height_left height_right margin_low   margin_up
## diagonal      1.00000000  0.01826548  -0.01942814 -0.1115341 -0.05914674
## height_left   0.01826548  1.00000000   0.23513162  0.3026430  0.24381218
## height_right -0.01942814  0.23513162   1.00000000  0.3910851  0.30686721
## margin_low   -0.11153408  0.30264296   0.39108515  1.0000000  0.43160607
## margin_up    -0.05914674  0.24381218   0.30686721  0.4316061  1.00000000
## length        0.10075779 -0.31434436  -0.40427216 -0.6667528 -0.52113891
##                  length
## diagonal      0.1007578
## height_left  -0.3143444
## height_right -0.4042722
## margin_low   -0.6667528
## margin_up    -0.5211389
## length        1.0000000

What we can see is that the most highly correlated variables include margin_low to length (negatively) and margin_up to length (also negatively), as well as height_left to length (mildly, negatively). and length to height_right (negative)###

#visualiztion of the correlation matrix
library(repr)
options(repr.plot.width=12, repr.plot.height=12)
library(corrplot)
corrplot(corr_matrix, method = 'square', order = 'FPC', type = 'lower', diag = FALSE)

Bartlett’s Test for Sphericity### To test for the first assumption, we perform Bartlett’s Test for Sphericity.## The null hypothesis for Bartlett’s is that there is no collinearity between the variables, which would render PCA impossible as it depends on correlation between the variables.## Our significance level will be p>.05##

#We can reject the null and proceed
cortest.bartlett(corr_matrix, n = nrow(data_normalized))
## $chisq
## [1] 1896.66
## 
## $p.value
## [1] 0
## 
## $df
## [1] 15

Now we will check to see if we have an adequate sample to run PCA using the Kaiser-Meyer-Olkin (KMO)## We calculate this from a function called from the psych library##

#Generally KMO's above .80 and above are considered more than adequate sampling
#Another rule of thumb is 20 obs per every var, which we more than meet
library(psych)
KMO(data_normalized)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = data_normalized)
## Overall MSA =  0.78
## MSA for each item = 
##     diagonal  height_left height_right   margin_low    margin_up       length 
##         0.70         0.88         0.88         0.74         0.83         0.71

Applying Our PCA

#Now we can proceed with our PCA
#princomp runs the PCA and summary shows the result
data.pca <- princomp(corr_matrix)
summary(data.pca)
## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4    Comp.5
## Standard deviation     0.9619396 0.34795246 0.31873961 0.28336861 0.2200564
## Proportion of Variance 0.7247717 0.09482993 0.07957519 0.06289398 0.0379292
## Cumulative Proportion  0.7247717 0.81960163 0.89917682 0.96207080 1.0000000
##                        Comp.6
## Standard deviation          0
## Proportion of Variance      0
## Cumulative Proportion       1

We can see that effectively, 5 principal components have been generated. The first component, looking at the Cumulative Proportion row, explains 78% of the total variance. Component 2, explains 9.67% more of the total variance and Component 3 explains almost an additional 7.58%. This means that the first component represents the majority of the data, then the second, and then the rest contribute less and less.

#loadings
data.pca$loadings[, 1:4]
##                  Comp.1      Comp.2      Comp.3      Comp.4
## diagonal      0.1941426  0.96196923  0.01891581  0.04367662
## height_left  -0.2700698  0.02729285 -0.91708721  0.12562547
## height_right -0.3533459 -0.01959323  0.32401220  0.77820769
## margin_low   -0.5076471  0.04044728  0.07269246 -0.04384266
## margin_up    -0.4217331  0.03099468  0.21944983 -0.61115732
## length        0.5735445 -0.26625128 -0.01291994  0.03560807

The loading matrix shows that the first principal component has high value for length, but a high negative value for margin_low and high negative height_right.. This indicates that suggests that bills with longer length, overall will have a lower margin and lower height on the right.## The second principal component has only one component which is very high and positive, indicating those with a high diagnoal.## The third component shows a high, negative loading for height on the left side of the bill related to positive low height on the right side.## The last component,4, shows a high reading for height on the right side and a high negative margin_up.## Remember that these readings/components are predictors of is_geniune for bills## Now we will conduct a couple of visualizations to help gain insight into the components##

The first is a scree plot. This plot shows the eigenvalues in a downward curve, from highest to lowest. The first component can be considered to be the most significant since it contains 78% of the total information of the data. After that the increase in amount of information contributed by remaining components decreases steeply##

#scree plot
library(repr)
options(repr.plot.width=14, repr.plot.height=14)
fviz_eig(data.pca, addlabels = TRUE)

Biplot of the attributes## With the biplot, we visualize the similarities and dissimilarities between the variables, and demonstrate the impact of each attribute on each of the principal components.###

# Graph of the variables
library(repr)
options(repr.plot.width=14, repr.plot.height=14)
fviz_pca_var(data.pca, col.var = "black")

First, all the variables that are grouped together are positively correlated to each other, such as margin_low, margin_up, and height_right.## Then, variables that are negatively correlated are displayed to the opposite sides of the biplot’s origin### Contribution of each variable## In this next visualization we are determining how much each variable is represented in a given component. Such a quality of representation is called the Cos2 and corresponds to the square cosine, and it is computed using the fviz_cos2 function. A low value means that the variable is not perfectly represented by that component. A high value, on the other hand, means a good representation of the variable on that component###

library(repr)
options(repr.plot.width=14, repr.plot.height=14)
fviz_cos2(data.pca, choice = "var", axes = 1:2)

From the illustration above,length and margin_low are the variables with the highest cos2, hence contributing the most to PC1 and PC2.##

embed_url("https://youtu.be/TSffz_bl6zo?si=hr2zhpRJqDIC3D61")