Principal Component Analysis (PCA), which is one of the most popular multivariate analysis method. The goal of PCA is to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.
Start by normalizing the predictors by subtracting the mean from each data point and dividing by standard deviation. It is important to normalize variables as original predictors can be on the different scale and can contribute significantly towards variance. This will lead to large loading for variables with high variance and to dependence of a principal component on the variable with high variance.
data <- read.csv("/Users/N-mickey/Desktop/Anly 699/Data/data copy.csv",header=T,encoding = "UTF8")
X_std<- data.frame(scale(data[,1:9]))
dim(X_std)
## [1] 14401 9
cov(X_std)
## StockCode DSRI GMI AQI SGI
## StockCode 1.0000000000 0.0098910099 1.313361e-02 8.859223e-03 -0.0007642167
## DSRI 0.0098910099 1.0000000000 -2.160700e-04 -2.493442e-04 0.0226318502
## GMI 0.0131336133 -0.0002160700 1.000000e+00 3.728691e-05 0.0004723872
## AQI 0.0088592233 -0.0002493442 3.728691e-05 1.000000e+00 -0.0021475199
## SGI -0.0007642167 0.0226318502 4.723872e-04 -2.147520e-03 1.0000000000
## DEPI -0.0086995213 -0.0001966322 8.619139e-04 3.120418e-05 -0.0018860864
## SGAI 0.0012658101 0.0011953471 1.085463e-04 -4.510580e-03 -0.0261002265
## LVGI -0.0248252294 0.0169863131 -2.486182e-03 -1.427836e-02 0.1368746151
## TATA -0.0068486982 -0.0004119543 1.403282e-02 -9.061456e-03 0.0216623010
## DEPI SGAI LVGI TATA
## StockCode -8.699521e-03 0.0012658101 -0.024825229 -0.0068486982
## DSRI -1.966322e-04 0.0011953471 0.016986313 -0.0004119543
## GMI 8.619139e-04 0.0001085463 -0.002486182 0.0140328172
## AQI 3.120418e-05 -0.0045105803 -0.014278361 -0.0090614557
## SGI -1.886086e-03 -0.0261002265 0.136874615 0.0216623010
## DEPI 1.000000e+00 0.0085104715 -0.011414114 0.0111867625
## SGAI 8.510471e-03 1.0000000000 0.050230731 -0.0261440826
## LVGI -1.141411e-02 0.0502307306 1.000000000 -0.0100029978
## TATA 1.118676e-02 -0.0261440826 -0.010002998 1.0000000000
cov_mat <- cov(X_std)
In order to decide which eigenvector(s) can dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.In order to do so, the common approach is to rank the eigenvalues from highest to lowest in order choose the top kk eigenvectors.
eig_vals <- eigen(cov_mat)$values
eig_vecs <- eigen(cov_mat)$vectors
Summary displays the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix,
pcaobj <- prcomp(X_std)
print(pcaobj)
## Standard deviations (1, .., p=9):
## [1] 1.0714531 1.0220914 1.0112477 1.0045628 1.0001237 0.9975552 0.9880671
## [8] 0.9815276 0.9167945
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## StockCode 0.11410426 0.02565993 0.67866484 0.21565170 -0.04187868
## DSRI -0.17527069 0.04764878 0.37149851 0.10310366 0.35553506
## GMI 0.01701054 0.21720823 0.24857581 0.65631451 -0.30310800
## AQI 0.09004749 -0.01737532 0.36576407 -0.22523656 0.58376079
## SGI -0.66257805 0.24949637 0.09424299 -0.08467917 0.06180508
## DEPI 0.04694402 0.05808495 -0.38248019 0.48049468 0.65841426
## SGAI -0.11564635 -0.67667830 -0.06999270 0.42572830 -0.01132605
## LVGI -0.70149833 -0.15588115 -0.02212459 0.02834132 -0.03993341
## TATA -0.03427958 0.63385731 -0.21276346 0.20319996 -0.01173558
## PC6 PC7 PC8 PC9
## StockCode 0.114493997 -0.66640495 0.09159504 -0.109005749
## DSRI 0.674408845 0.48506073 0.04028530 -0.015132457
## GMI -0.353744799 0.39368210 -0.29661792 0.003864027
## AQI -0.613517012 0.20823353 0.21156053 -0.042140771
## SGI -0.108752422 -0.18129957 -0.12935383 0.646096491
## DEPI 0.026153102 -0.29327651 -0.30614269 -0.057692160
## SGAI -0.047280412 0.03048644 0.49363472 0.309225573
## LVGI -0.125347139 -0.04578801 0.01045148 -0.680291168
## TATA 0.008179927 -0.01916039 0.70922998 -0.083140791
summary(pcaobj)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.0715 1.0221 1.0112 1.0046 1.0001 0.9976 0.9881 0.9815
## Proportion of Variance 0.1276 0.1161 0.1136 0.1121 0.1111 0.1106 0.1085 0.1070
## Cumulative Proportion 0.1276 0.2436 0.3573 0.4694 0.5805 0.6911 0.7996 0.9066
## PC9
## Standard deviation 0.91679
## Proportion of Variance 0.09339
## Cumulative Proportion 1.00000
All variables are presented as vectors: the direction and the length of vectors show how each variable contribute to components.
biplot(pcaobj,scale=0, cex=1.3)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.0.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
var <- get_pca_var(pcaobj)
library("corrplot")
## Warning: package 'corrplot' was built under R version 4.0.2
## corrplot 0.84 loaded
corrplot(var$cos2, is.corr=FALSE)