Since PCA is designed for continuous variables I will omit nominal variables and keep only the continuous ones. Moreover, I will remove target variable. Let’s take a look at some basic descriptive statistics and dimension of the dataset.
## age cigsPerDay totChol sysBP
## Min. :32.00 Min. : 0.000 Min. :113.0 Min. : 83.5
## 1st Qu.:42.00 1st Qu.: 0.000 1st Qu.:206.0 1st Qu.:117.0
## Median :49.00 Median : 0.000 Median :234.0 Median :128.0
## Mean :49.56 Mean : 9.022 Mean :236.9 Mean :132.4
## 3rd Qu.:56.00 3rd Qu.:20.000 3rd Qu.:263.2 3rd Qu.:144.0
## Max. :70.00 Max. :70.000 Max. :600.0 Max. :295.0
## diaBP BMI heartRate glucose
## Min. : 48.00 Min. :15.54 Min. : 44.00 Min. : 40.00
## 1st Qu.: 75.00 1st Qu.:23.08 1st Qu.: 68.00 1st Qu.: 71.00
## Median : 82.00 Median :25.38 Median : 75.00 Median : 78.00
## Mean : 82.91 Mean :25.78 Mean : 75.73 Mean : 81.86
## 3rd Qu.: 90.00 3rd Qu.:28.04 3rd Qu.: 82.00 3rd Qu.: 87.00
## Max. :142.50 Max. :56.80 Max. :143.00 Max. :394.00
## [1] 3656 8
Final dataset that will be used for PCA has 3800 observations and 8 variables. Since PCA algorithm maximizes variance, it is important to do the normalization. Otherwise, the results would be biased because of natural value discrepancies of some variables (e.g totChol would be considered as more important than BMI, totChol is simply greater than BMI because of its nature).
preproc1 <- preProcess(cardio, method=c("center", "scale"))
cardio.s <- predict(preproc1, cardio)
summary(cardio.s)
## age cigsPerDay totChol sysBP
## Min. :-2.05083 Min. :-0.757 Min. :-2.80915 Min. :-2.2120
## 1st Qu.:-0.88276 1st Qu.:-0.757 1st Qu.:-0.70013 1st Qu.:-0.6956
## Median :-0.06511 Median :-0.757 Median :-0.06516 Median :-0.1977
## Mean : 0.00000 Mean : 0.000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.75254 3rd Qu.: 0.921 3rd Qu.: 0.59817 3rd Qu.: 0.5265
## Max. : 2.38783 Max. : 5.116 Max. : 8.23488 Max. : 7.3614
## diaBP BMI heartRate glucose
## Min. :-2.91546 Min. :-2.51953 Min. :-2.64798 Min. :-1.7506
## 1st Qu.:-0.66073 1st Qu.:-0.66509 1st Qu.:-0.64513 1st Qu.:-0.4540
## Median :-0.07616 Median :-0.09941 Median :-0.06097 Median :-0.1613
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.59190 3rd Qu.: 0.55481 3rd Qu.: 0.52319 3rd Qu.: 0.2151
## Max. : 4.97610 Max. : 7.62825 Max. : 5.61376 Max. :13.0549
To dive deeper into the analysis before proceeding with PCA, I am using a pairwise correlation plot. The ggpairs() plot is useful for visualizing distribution and correlation between each variable. It can be observed that almost every correlation is significant.
Another way of visualizing pairwise correlation is corrplot(). As It can be noticed, some features (like systolic and diastilic blood pressure) are highly correlated with each other.
Finally, let’s proceed with PCA.
pca<-prcomp(cardio.s, center=FALSE, scale.=FALSE)
pca$rotation
## PC1 PC2 PC3 PC4 PC5
## age 0.3492564 -0.47265344 -0.16211020 0.243202088 -0.06950582
## cigsPerDay -0.1312331 0.63008631 -0.08530298 0.366523550 -0.60838869
## totChol 0.2702664 -0.17143122 -0.31469833 0.677137946 -0.02796975
## sysBP 0.5643796 0.09983590 0.13804910 -0.018836247 -0.04109148
## diaBP 0.5315404 0.23033551 0.27201028 -0.067373301 -0.03325014
## BMI 0.3601777 0.08845929 0.27250469 -0.255509909 -0.18101516
## heartRate 0.1815720 0.52325495 -0.44034099 0.001518421 0.65734890
## glucose 0.1549128 -0.07455782 -0.71147909 -0.527082393 -0.39570406
## PC6 PC7 PC8
## age -0.32354252 -0.6550804 -0.174787375
## cigsPerDay -0.13053810 -0.2386936 0.001178434
## totChol 0.43969411 0.3819581 0.003364339
## sysBP -0.33095501 0.1642550 0.716865826
## diaBP -0.20233545 0.3072132 -0.670415139
## BMI 0.72998467 -0.3917871 0.061331788
## heartRate 0.03904090 -0.2557804 -0.016673185
## glucose 0.00874433 0.1665011 -0.045210743
The loading matrix above shows loadings of a variable. The loadings can be interpreted as the importance of a variable in the composite variable. The higher the value, the higher the correlation between factor and variable, and the more important the variable in the given factor. However, the meaning of components are hard to interpret. I will focus on the interpretation of the the meaning of certain components later on.
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.5545 1.0776 1.0100 0.9860 0.9154 0.88064 0.79406
## Proportion of Variance 0.3021 0.1452 0.1275 0.1215 0.1047 0.09694 0.07882
## Cumulative Proportion 0.3021 0.4472 0.5747 0.6963 0.8010 0.89795 0.97676
## PC8
## Standard deviation 0.43118
## Proportion of Variance 0.02324
## Cumulative Proportion 1.00000
First components explain a small fraction of variance (1st PC - 30.06%, 2nd PC - 14.59%, 3rd PC - 12.77%). First 5 PCs explain over 80% of variance. Plot shows the proportion of variance of each component.
In order to analyze the components and try to grasp the meaning of them, one can use plots that are shown below.
fviz_pca_var(pca, col.var="steelblue")
This plot is the visualization of loadings of the variable for the first two components (exact same values as for the loading matrix above). It shows the relationships between all variables. It can be interpreted as follow: Positively correlated variables are grouped together. Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants). The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.
Another way of visualizing the loading matrix is presented below. Let’s take a look at the first 5 PCs.
var<-get_pca_var(pca)
a<-fviz_contrib(pca, "var", axes=1)
b<-fviz_contrib(pca, "var", axes=2)
c<-fviz_contrib(pca, "var", axes=3)
d<-fviz_contrib(pca, "var", axes=4)
e<-fviz_contrib(pca, "var", axes=5)
grid.arrange(a,b,c,d,e,top='Contribution to the first five Principal Components')
In the first Principal Component above the threshold are sysBP, diaBP and BMI. Second - cigsPerDay, heartRate, age. Third - glucose, heartRate. Fourth - totChol, glucose, cigsPerDay. Fifth - heartRate, cigsPerDay and glucose.
On the interactive 3D plot below, one can observe how the original variables behave relative to the first three principal components for each observation, and color by target variable.
pca3d(pca, group=as.factor(target), biplot=TRUE, biplot.vars=3, legend="topleft")
rglwidget()
One can observe that positive (heart disease) target variables are slightly shifted in the same direction as diastolic and systolic blood preasure and BMI. It may suggest that these three variables result in higher chance of being ill.
As I have mentioned before, the meaning of PCs are hard to interpret. In order to solve this issue one can use a varimax rotation. A varimax rotation is used to simplify the expression of a particular sub-space in terms of just a few major items each. Varimax is so called because it maximizes the sum of the variances of the squared loadings (squared correlations between variables and factors). It simplifies the interpretation of factors by minimizing the number of variables necessary to explain a given factor.
pca.varimax<-principal(cardio.s, nfactors=5, rotate="varimax")
print(loadings(pca.varimax), digits=3, cutoff=0.4, sort=TRUE)
##
## Loadings:
## RC1 RC4 RC2 RC5 RC3
## sysBP 0.832
## diaBP 0.885
## BMI 0.686
## age 0.683
## totChol 0.852
## cigsPerDay 0.973
## heartRate 0.971
## glucose 0.986
##
## RC1 RC4 RC2 RC5 RC3
## SS loadings 2.016 1.303 1.046 1.031 1.011
## Proportion Var 0.252 0.163 0.131 0.129 0.126
## Cumulative Var 0.252 0.415 0.546 0.675 0.801
Let’s print only the significant loadings. Because I have no medical background, I can only guess why such features are grouped together. As if sysBP, diaBP and BMI creates RC1 I could make a guess that it may be linked with of obesity and cardiac output. I would interpret RC4 as getting old, RC2, RC5 and RC3 are self explanatory.
In this paper, I have used PCA on the chosen medical dataset. PCA is a powerful dimensionality reduction tool that helps with reducing the complexity of a model. It compresses a dataset onto a lower-dimensional feature subspace with the goal of maintaining most of the relevant information. As a result of the analysis, I have managed to reduce the number of variables from 8 to 5 with maintaining 80.1% of variance.