Principal Component Analysis (PCA), which is one of the most popular multivariate analysis method. The goal of PCA is to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.

Step 1: Data Normalization

Start by normalizing the predictors by subtracting the mean from each data point and dividing by standard deviation. It is important to normalize variables as original predictors can be on the different scale and can contribute significantly towards variance. This will lead to large loading for variables with high variance and to dependence of a principal component on the variable with high variance.

data <- read.csv("/Users/N-mickey/Desktop/Anly 699/Data/data copy.csv",header=T,encoding = "UTF8")

X_std<- data.frame(scale(data[,1:9]))

Step 2: Calculate Covariance Matrix

dim(X_std)
## [1] 14401     9
cov(X_std)
##               StockCode          DSRI           GMI           AQI           SGI
## StockCode  1.0000000000  0.0098910099  1.313361e-02  8.859223e-03 -0.0007642167
## DSRI       0.0098910099  1.0000000000 -2.160700e-04 -2.493442e-04  0.0226318502
## GMI        0.0131336133 -0.0002160700  1.000000e+00  3.728691e-05  0.0004723872
## AQI        0.0088592233 -0.0002493442  3.728691e-05  1.000000e+00 -0.0021475199
## SGI       -0.0007642167  0.0226318502  4.723872e-04 -2.147520e-03  1.0000000000
## DEPI      -0.0086995213 -0.0001966322  8.619139e-04  3.120418e-05 -0.0018860864
## SGAI       0.0012658101  0.0011953471  1.085463e-04 -4.510580e-03 -0.0261002265
## LVGI      -0.0248252294  0.0169863131 -2.486182e-03 -1.427836e-02  0.1368746151
## TATA      -0.0068486982 -0.0004119543  1.403282e-02 -9.061456e-03  0.0216623010
##                    DEPI          SGAI         LVGI          TATA
## StockCode -8.699521e-03  0.0012658101 -0.024825229 -0.0068486982
## DSRI      -1.966322e-04  0.0011953471  0.016986313 -0.0004119543
## GMI        8.619139e-04  0.0001085463 -0.002486182  0.0140328172
## AQI        3.120418e-05 -0.0045105803 -0.014278361 -0.0090614557
## SGI       -1.886086e-03 -0.0261002265  0.136874615  0.0216623010
## DEPI       1.000000e+00  0.0085104715 -0.011414114  0.0111867625
## SGAI       8.510471e-03  1.0000000000  0.050230731 -0.0261440826
## LVGI      -1.141411e-02  0.0502307306  1.000000000 -0.0100029978
## TATA       1.118676e-02 -0.0261440826 -0.010002998  1.0000000000
cov_mat <- cov(X_std)

Step 3: Calculate Eigenvectors from Covariance Matrix

In order to decide which eigenvector(s) can dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped. In order to do so, the common approach is to rank the eigenvalues from highest to lowest in order choose the top kk eigenvectors.

eig_vals <- eigen(cov_mat)$values
eig_vecs <- eigen(cov_mat)$vectors

Step 4: Select Eigenvectors with the largest Eigenvalues

Summary displays the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix,

pcaobj <- prcomp(X_std)

print(pcaobj)
## Standard deviations (1, .., p=9):
## [1] 1.0714531 1.0220914 1.0112477 1.0045628 1.0001237 0.9975552 0.9880671
## [8] 0.9815276 0.9167945
## 
## Rotation (n x k) = (9 x 9):
##                   PC1         PC2         PC3         PC4         PC5
## StockCode  0.11410426  0.02565993  0.67866484  0.21565170 -0.04187868
## DSRI      -0.17527069  0.04764878  0.37149851  0.10310366  0.35553506
## GMI        0.01701054  0.21720823  0.24857581  0.65631451 -0.30310800
## AQI        0.09004749 -0.01737532  0.36576407 -0.22523656  0.58376079
## SGI       -0.66257805  0.24949637  0.09424299 -0.08467917  0.06180508
## DEPI       0.04694402  0.05808495 -0.38248019  0.48049468  0.65841426
## SGAI      -0.11564635 -0.67667830 -0.06999270  0.42572830 -0.01132605
## LVGI      -0.70149833 -0.15588115 -0.02212459  0.02834132 -0.03993341
## TATA      -0.03427958  0.63385731 -0.21276346  0.20319996 -0.01173558
##                    PC6         PC7         PC8          PC9
## StockCode  0.114493997 -0.66640495  0.09159504 -0.109005749
## DSRI       0.674408845  0.48506073  0.04028530 -0.015132457
## GMI       -0.353744799  0.39368210 -0.29661792  0.003864027
## AQI       -0.613517012  0.20823353  0.21156053 -0.042140771
## SGI       -0.108752422 -0.18129957 -0.12935383  0.646096491
## DEPI       0.026153102 -0.29327651 -0.30614269 -0.057692160
## SGAI      -0.047280412  0.03048644  0.49363472  0.309225573
## LVGI      -0.125347139 -0.04578801  0.01045148 -0.680291168
## TATA       0.008179927 -0.01916039  0.70922998 -0.083140791
summary(pcaobj)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
## Standard deviation     1.0715 1.0221 1.0112 1.0046 1.0001 0.9976 0.9881 0.9815
## Proportion of Variance 0.1276 0.1161 0.1136 0.1121 0.1111 0.1106 0.1085 0.1070
## Cumulative Proportion  0.1276 0.2436 0.3573 0.4694 0.5805 0.6911 0.7996 0.9066
##                            PC9
## Standard deviation     0.91679
## Proportion of Variance 0.09339
## Cumulative Proportion  1.00000

All variables are presented as vectors: the direction and the length of vectors show how each variable contribute to components.

biplot(pcaobj,scale=0, cex=1.3)

library(factoextra)
## Warning: package 'factoextra' was built under R version 4.0.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

var <- get_pca_var(pcaobj)
library("corrplot")
## Warning: package 'corrplot' was built under R version 4.0.2
## corrplot 0.84 loaded
corrplot(var$cos2, is.corr=FALSE)