We are going to explore dimension reduction What is “dimension reduction?”

In short, many datasets, particularly with BIG DATA, are messy. They often contain numerous independent/predictor variables for the dependant variable. To make sense of this data, how does one reduce the data to what are the most important predictors, both quantitatively and qualitatively - those that have the strongest correlation and make the most business sense?

We could just eliminate features however we will be extracting features

In essence, principal component analysis transforms the original variables into a set of new, uncorrelated variables

This is our Principal Component Analysis - PCA

There are three assumptions underlying PCA

  1. Existence of an identity matrix
  2. Sampling adequacy or an appropriate number of observations to the number of variables
  3. Independent variables are correlated to each other
  4. We are going to use the “Fake Bills” Dataset

The columns are as follows: -length, the length of the banknote in mm### -height left, the height of the left side of the banknote in mm -height right, the height of the right side of the bank note in mm -diagonal, the diagonal of the bank note in mm -margin low, lower side margin in mm -margin up, upper side margin in mm -is_genuine

In this dataset, the DV/dependent variable is “Is Geniune” - meaning that the bill is not fake. We are going to be reducing the set of variables to the ones which hold the most information

Install the libraries below to conduct PCA

library(corrr)
library(ggcorrplot)
library(corrplot)
library(FactoMineR)
library(psych)
library(dplyr)
library(factoextra)
library(repr)

Bringing the data

df <- read.csv("/cloud/project/fakebills.csv", header=TRUE, stringsAsFactors=FALSE)

head(df,3)
##   is_genuine diagonal height_left height_right margin_low margin_up length
## 1          1   171.81      104.86       104.95       4.52      2.89 112.83
## 2          1   171.46      103.36       103.66       3.77      2.99 113.09
## 3          1   172.69      104.48       103.50       4.40      2.94 113.16

Before conducting the PCA we remove the Dependent Variable DV “is_geniune” from the dataset, or it becomes a factor to explain itself in the dataset which is created.

df <- df[,-1]

EDA’s

We’ll run some quick summary statistics on the dataset before we dive into PCA

summary(df)
##     diagonal      height_left     height_right     margin_low   
##  Min.   :171.0   Min.   :103.1   Min.   :102.8   Min.   :2.980  
##  1st Qu.:171.8   1st Qu.:103.8   1st Qu.:103.7   1st Qu.:4.015  
##  Median :172.0   Median :104.0   Median :103.9   Median :4.310  
##  Mean   :172.0   Mean   :104.0   Mean   :103.9   Mean   :4.486  
##  3rd Qu.:172.2   3rd Qu.:104.2   3rd Qu.:104.2   3rd Qu.:4.870  
##  Max.   :173.0   Max.   :104.9   Max.   :105.0   Max.   :6.900  
##                                                  NA's   :37     
##    margin_up         length     
##  Min.   :2.270   Min.   :109.5  
##  1st Qu.:2.990   1st Qu.:112.0  
##  Median :3.140   Median :113.0  
##  Mean   :3.151   Mean   :112.7  
##  3rd Qu.:3.310   3rd Qu.:113.3  
##  Max.   :3.910   Max.   :114.4  
## 

We are also going to look at the data type. PCA only works with numerical data

str(df)
## 'data.frame':    1500 obs. of  6 variables:
##  $ diagonal    : num  172 171 173 171 172 ...
##  $ height_left : num  105 103 104 104 104 ...
##  $ height_right: num  105 104 104 104 103 ...
##  $ margin_low  : num  4.52 3.77 4.4 3.62 4.04 4.42 4.58 3.98 4 4.04 ...
##  $ margin_up   : num  2.89 2.99 2.94 3.01 3.48 2.95 3.26 2.92 3.25 3.25 ...
##  $ length      : num  113 113 113 114 113 ...

Now let’s talk about the five basic steps of Principal Component Analysis (PCA)

Remember that the goal of PCA is to measure the contribution of each variable - it is the measurement of highly dimensional data and capturing the most important information from it

PCA Roadmap:

  1. Data Normalization - because many variables within a dataset could be on different scales, to equally measure their contribution within a dataset, they must be normalized or put on the same scale. For example if you are looking at salary and job satisfaction scores, those will be on different scales and must be normalized to measure their respective contributions within a dataset

  2. Covariance Matrix - a covariance matrix between the normalized data is computed

  3. Eigenvectors and Eigenvalues - To have a basic understanding of PCA, it helps to understand that the data is in relationship to each other in 3-dimensional space. The relationship between variables is not on a flat plane. So an Eigenvector represents a direction, such as vertical or 90 degrees. And an eigenvalue represents the amount of variance for the given direction. Each eigenvector has an eigenvalue. In Step 3 we compute the eigenvalues and eigenvectors to identify the principal components

  4. Selecting principal components. We do this based upon the variation in the principal components and Scree Plots.

  5. Data transformation according to the new principal components - so that the data is now along the lines of the new principal components axes.

We are going to check for missing because missing data when we normalize it will bias results. The colSums() function combined with is.na() retuns the number of missing values in each column

colSums(is.na(df))
##     diagonal  height_left height_right   margin_low    margin_up       length 
##            0            0            0           37            0            0
count(df)
##      n
## 1 1500

Because there are 37 observations that are missing in the “margin_low” column, we are going to delete those rows.

For PCA in this instance, imputation would not be appropriate

Running the count function from dplyr tells us that we still have 1463 observations

df <- na.omit(df)
count(df)
##      n
## 1 1463

Normalization

In our next step we are putting all variables on the same scale through normalization

data_normalized <- scale(df)
head(data_normalized)
##     diagonal height_left height_right  margin_low  margin_up     length
## 1 -0.4884266   2.7658616    3.1726813  0.05126870 -1.1365937  0.1777815
## 2 -1.6342497  -2.2407234   -0.8065749 -1.07856821 -0.7045647  0.4755295
## 3  2.3925002   1.4975267   -1.3001261 -0.12950521 -0.9205792  0.5556924
## 4 -1.9616278  -0.4049756    0.0571396 -1.30453559 -0.6181589  0.9565070
## 5 -0.7503290   0.8299821   -1.4235139 -0.67182692  1.4123777 -0.1543220
## 6  0.6901344  -0.9723885    0.4889969 -0.09937622 -0.8773763  0.1548778

Calculating a correlation matrix

corr_matrix <- cor(data_normalized)
corr_matrix
##                 diagonal height_left height_right margin_low   margin_up
## diagonal      1.00000000  0.01826548  -0.01942814 -0.1115341 -0.05914674
## height_left   0.01826548  1.00000000   0.23513162  0.3026430  0.24381218
## height_right -0.01942814  0.23513162   1.00000000  0.3910851  0.30686721
## margin_low   -0.11153408  0.30264296   0.39108515  1.0000000  0.43160607
## margin_up    -0.05914674  0.24381218   0.30686721  0.4316061  1.00000000
## length        0.10075779 -0.31434436  -0.40427216 -0.6667528 -0.52113891
##                  length
## diagonal      0.1007578
## height_left  -0.3143444
## height_right -0.4042722
## margin_low   -0.6667528
## margin_up    -0.5211389
## length        1.0000000

What we can see is that the most highly correlated variables include margin_low to length (negatively) and margin_up to length (also negatively), as well as height_left to length (mildly, negatively). and length to height_right (negative)

Visualiztion of the correlation matrix

library(repr)
options(repr.plot.width=12, repr.plot.height=12)
library(corrplot)
corrplot(corr_matrix, method = 'square', order = 'FPC', type = 'lower', diag = FALSE)

Bartlett’s Test for Sphericity

To test for the first assumption, we perform Bartlett’s Test for Sphericity.## The null hypothesis for Bartlett’s is that there is no collinearity between the variables, which would render PCA impossible as it depends on correlation between the variables.## Our significance level will be p>.05

cortest.bartlett(corr_matrix, n = nrow(data_normalized))
## $chisq
## [1] 1896.66
## 
## $p.value
## [1] 0
## 
## $df
## [1] 15

We can reject the null and proceed

Kaiser-Meyer-Olkin (KMO)

Now we will check to see if we have an adequate sample to run PCA using the Kaiser-Meyer-Olkin (KMO)

We calculate this from a function called from the psych library

Generally KMO’s above .80 and above are considered more than adequate sampling Another rule of thumb is 20 obs per every var, which we more than meet

library(psych)
KMO(data_normalized)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = data_normalized)
## Overall MSA =  0.78
## MSA for each item = 
##     diagonal  height_left height_right   margin_low    margin_up       length 
##         0.70         0.88         0.88         0.74         0.83         0.71

Applying Our PCA

data.pca <- princomp(corr_matrix)
summary(data.pca)
## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4    Comp.5
## Standard deviation     0.9619396 0.34795246 0.31873961 0.28336861 0.2200564
## Proportion of Variance 0.7247717 0.09482993 0.07957519 0.06289398 0.0379292
## Cumulative Proportion  0.7247717 0.81960163 0.89917682 0.96207080 1.0000000
##                        Comp.6
## Standard deviation          0
## Proportion of Variance      0
## Cumulative Proportion       1

We can see that effectively, 5 principal components have been generated. The first component, looking at the Cumulative Proportion row, explains 78% of the total variance. Component 2, explains 9.67% more of the total variance and Component 3 explains almost an additional 7.58%. This means that the first component represents the majority of the data, then the second, and then the rest contribute less and less.

Examine the Loadings

data.pca$loadings[, 1:4]
##                  Comp.1      Comp.2      Comp.3      Comp.4
## diagonal      0.1941426  0.96196923  0.01891581  0.04367662
## height_left  -0.2700698  0.02729285 -0.91708721  0.12562547
## height_right -0.3533459 -0.01959323  0.32401220  0.77820769
## margin_low   -0.5076471  0.04044728  0.07269246 -0.04384266
## margin_up    -0.4217331  0.03099468  0.21944983 -0.61115732
## length        0.5735445 -0.26625128 -0.01291994  0.03560807

The loading matrix shows that the first principal component has high value for length, but a high negative value for margin_low and high negative height_right.. This indicates that suggests that bills with longer length, overall will have a lower margin and lower height on the right.

The second principal component has only one component which is very high and positive, indicating those with a high diagnoal.

The third component shows a high, negative loading for height on the left side of the bill related to positive low height on the right side.

The last component,4, shows a high reading for height on the right side and a high negative margin_up.

Remember that these readings/components are predictors of is_geniune for bills

#Visualizations

Now we will conduct a couple of visualizations to help gain insight into the components

Scree plot

The first is a scree plot. This plot shows the eigenvalues in a downward curve, from highest to lowest. The first component can be considered to be the most significant since it contains 78% of the total information of the data. After that the increase in amount of information contributed by remaining components decreases steeply

library(repr)
options(repr.plot.width=14, repr.plot.height=14)
fviz_eig(data.pca, addlabels = TRUE)

Biplot

Biplot of the attributes

With the biplot, we visualize the similarities and dissimilarities between the variables, and demonstrate the impact of each attribute on each of the principal components.

library(repr)
options(repr.plot.width=14, repr.plot.height=14)
fviz_pca_var(data.pca, col.var = "purple")

First, all the variables that are grouped together are positively correlated to each other, such as margin_low, margin_up, and height_right.

Then, variables that are negatively correlated are displayed to the opposite sides of the biplot’s origin

Contribution of Each Variable

In this next visualization we are determining how much each variable is represented in a given component. Such a quality of representation is called the Cos2 and corresponds to the square cosine, and it is computed using the fviz_cos2 function.

A low value means that the variable is not perfectly represented by that component. A high value, on the other hand, means a good representation of the variable on that component

library(repr)
options(repr.plot.width=14, repr.plot.height=14)
fviz_cos2(data.pca, choice = "var", axes = 1:2)

From the illustration above,length and margin_low are the variables with the highest cos2, hence contributing the most to PC1 and PC2.