Introduction

In this paper I will perform dimension reduction using Principal component analysis (PCA).

The PCA will be conducted on publically available dataset that could be find on the Kaggle website. The data comes from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes. Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors.
Variables:
  • Demographic:
    • Sex: male or female (Nominal)
    • Age: Age of the patient; (Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)
  • Behavioral
    • Current Smoker: whether or not the patient is a current smoker (Nominal)
    • Cigs Per Day: the number of cigarettes that the person smoked on average in one day (can be considered continuous as one can have any number of cigarettes, even half a cigarette)
  • Medical (history)
    • BP Meds: whether or not the patient was on blood pressure medication (Nominal)
    • Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
    • Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
    • Diabetes: whether or not the patient had diabetes (Nominal)
  • Medical (current)
    • Tot Chol: total cholesterol level (Continuous)
    • Sys BP: systolic blood pressure (Continuous)
    • Dia BP: diastolic blood pressure (Continuous)
    • BMI: Body Mass Index (Continuous)
    • Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)
    • Glucose: glucose level (Continuous)
  • Predict variable (desired target)
    • 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

Data Preparation

Since PCA is designed for continuous variables I will omit nominal variables and keep only the continuous ones. Moreover, I will remove target variable.
Let’s take a look at some basic descriptive statistics and dimension of the dataset.

##       age          cigsPerDay        totChol          sysBP      
##  Min.   :32.00   Min.   : 0.000   Min.   :113.0   Min.   : 83.5  
##  1st Qu.:42.00   1st Qu.: 0.000   1st Qu.:206.0   1st Qu.:117.0  
##  Median :49.00   Median : 0.000   Median :234.0   Median :128.0  
##  Mean   :49.56   Mean   : 9.022   Mean   :236.9   Mean   :132.4  
##  3rd Qu.:56.00   3rd Qu.:20.000   3rd Qu.:263.2   3rd Qu.:144.0  
##  Max.   :70.00   Max.   :70.000   Max.   :600.0   Max.   :295.0  
##      diaBP             BMI          heartRate         glucose      
##  Min.   : 48.00   Min.   :15.54   Min.   : 44.00   Min.   : 40.00  
##  1st Qu.: 75.00   1st Qu.:23.08   1st Qu.: 68.00   1st Qu.: 71.00  
##  Median : 82.00   Median :25.38   Median : 75.00   Median : 78.00  
##  Mean   : 82.91   Mean   :25.78   Mean   : 75.73   Mean   : 81.86  
##  3rd Qu.: 90.00   3rd Qu.:28.04   3rd Qu.: 82.00   3rd Qu.: 87.00  
##  Max.   :142.50   Max.   :56.80   Max.   :143.00   Max.   :394.00
## [1] 3656    8

Final dataset that will be used for PCA has 3800 observations and 8 variables.

Since PCA algorithm maximizes variance, it is important to do the normalization. Otherwise, the results would be biased because of natural value discrepancies of some variables (e.g totChol would be considered as more important than BMI, totChol is simply greater than BMI because of its nature).

preproc1 <- preProcess(cardio, method=c("center", "scale"))
cardio.s <- predict(preproc1, cardio)
summary(cardio.s)
##       age             cigsPerDay        totChol             sysBP        
##  Min.   :-2.05083   Min.   :-0.757   Min.   :-2.80915   Min.   :-2.2120  
##  1st Qu.:-0.88276   1st Qu.:-0.757   1st Qu.:-0.70013   1st Qu.:-0.6956  
##  Median :-0.06511   Median :-0.757   Median :-0.06516   Median :-0.1977  
##  Mean   : 0.00000   Mean   : 0.000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.75254   3rd Qu.: 0.921   3rd Qu.: 0.59817   3rd Qu.: 0.5265  
##  Max.   : 2.38783   Max.   : 5.116   Max.   : 8.23488   Max.   : 7.3614  
##      diaBP               BMI             heartRate           glucose       
##  Min.   :-2.91546   Min.   :-2.51953   Min.   :-2.64798   Min.   :-1.7506  
##  1st Qu.:-0.66073   1st Qu.:-0.66509   1st Qu.:-0.64513   1st Qu.:-0.4540  
##  Median :-0.07616   Median :-0.09941   Median :-0.06097   Median :-0.1613  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.59190   3rd Qu.: 0.55481   3rd Qu.: 0.52319   3rd Qu.: 0.2151  
##  Max.   : 4.97610   Max.   : 7.62825   Max.   : 5.61376   Max.   :13.0549

Principal component analysis (PCA)

To dive deeper into the analysis before proceeding with PCA, I am using a pairwise correlation plot. The ggpairs() plot is useful for visualizing distribution and correlation between each variable.
It can be observed that almost every correlation is significant.

Another way of visualizing pairwise correlation is corrplot(). As It can be noticed, some features (like systolic and diastilic blood pressure) are highly correlated with each other.

Finally, let’s proceed with PCA.

pca<-prcomp(cardio.s, center=FALSE, scale.=FALSE)
pca$rotation
##                   PC1         PC2         PC3          PC4         PC5
## age         0.3492564 -0.47265344 -0.16211020  0.243202088 -0.06950582
## cigsPerDay -0.1312331  0.63008631 -0.08530298  0.366523550 -0.60838869
## totChol     0.2702664 -0.17143122 -0.31469833  0.677137946 -0.02796975
## sysBP       0.5643796  0.09983590  0.13804910 -0.018836247 -0.04109148
## diaBP       0.5315404  0.23033551  0.27201028 -0.067373301 -0.03325014
## BMI         0.3601777  0.08845929  0.27250469 -0.255509909 -0.18101516
## heartRate   0.1815720  0.52325495 -0.44034099  0.001518421  0.65734890
## glucose     0.1549128 -0.07455782 -0.71147909 -0.527082393 -0.39570406
##                    PC6        PC7          PC8
## age        -0.32354252 -0.6550804 -0.174787375
## cigsPerDay -0.13053810 -0.2386936  0.001178434
## totChol     0.43969411  0.3819581  0.003364339
## sysBP      -0.33095501  0.1642550  0.716865826
## diaBP      -0.20233545  0.3072132 -0.670415139
## BMI         0.72998467 -0.3917871  0.061331788
## heartRate   0.03904090 -0.2557804 -0.016673185
## glucose     0.00874433  0.1665011 -0.045210743

The loading matrix above shows loadings of a variable. The loadings can be interpreted as the importance of a variable in the composite variable. The higher the value, the higher the correlation between factor and variable, and the more important the variable in the given factor.
However, the meaning of components are hard to interpret. I will focus on the interpretation of the the meaning of certain components later on.

summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.5545 1.0776 1.0100 0.9860 0.9154 0.88064 0.79406
## Proportion of Variance 0.3021 0.1452 0.1275 0.1215 0.1047 0.09694 0.07882
## Cumulative Proportion  0.3021 0.4472 0.5747 0.6963 0.8010 0.89795 0.97676
##                            PC8
## Standard deviation     0.43118
## Proportion of Variance 0.02324
## Cumulative Proportion  1.00000

First components explain a small fraction of variance (1st PC - 30.06%, 2nd PC - 14.59%, 3rd PC - 12.77%).
First 5 PCs explain over 80% of variance.
Plot shows the proportion of variance of each component.

Analysis of components

In order to analyze the components and try to grasp the meaning of them, one can use plots that are shown below.

fviz_pca_var(pca, col.var="steelblue")

This plot is the visualization of loadings of the variable for the first two components (exact same values as for the loading matrix above). It shows the relationships between all variables.
It can be interpreted as follow: Positively correlated variables are grouped together. Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants).
The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.

Another way of visualizing the loading matrix is presented below. Let’s take a look at the first 5 PCs.

var<-get_pca_var(pca)
a<-fviz_contrib(pca, "var", axes=1)
b<-fviz_contrib(pca, "var", axes=2)
c<-fviz_contrib(pca, "var", axes=3)
d<-fviz_contrib(pca, "var", axes=4)
e<-fviz_contrib(pca, "var", axes=5)
grid.arrange(a,b,c,d,e,top='Contribution to the first five Principal Components')

In the first Principal Component above the threshold are sysBP, diaBP and BMI. Second - cigsPerDay, heartRate, age. Third - glucose, heartRate. Fourth - totChol, glucose, cigsPerDay. Fifth - heartRate, cigsPerDay and glucose.

On the interactive 3D plot below, one can observe how the original variables behave relative to the first three principal components for each observation, and color by target variable.

pca3d(pca, group=as.factor(target), biplot=TRUE, biplot.vars=3, legend="topleft")
rglwidget()

One can observe that positive (heart disease) target variables are slightly shifted in the same direction as diastolic and systolic blood preasure and BMI. It may suggest that these three variables result in higher chance of being ill.

Varimax rotation

As I have mentioned before, the meaning of PCs are hard to interpret. In order to solve this issue one can use a varimax rotation.
A varimax rotation is used to simplify the expression of a particular sub-space in terms of just a few major items each. Varimax is so called because it maximizes the sum of the variances of the squared loadings (squared correlations between variables and factors). It simplifies the interpretation of factors by minimizing the number of variables necessary to explain a given factor.

pca.varimax<-principal(cardio.s, nfactors=5, rotate="varimax")
print(loadings(pca.varimax), digits=3, cutoff=0.4, sort=TRUE)
## 
## Loadings:
##            RC1    RC4    RC2    RC5    RC3   
## sysBP       0.832                            
## diaBP       0.885                            
## BMI         0.686                            
## age                0.683                     
## totChol            0.852                     
## cigsPerDay                0.973              
## heartRate                        0.971       
## glucose                                 0.986
## 
##                  RC1   RC4   RC2   RC5   RC3
## SS loadings    2.016 1.303 1.046 1.031 1.011
## Proportion Var 0.252 0.163 0.131 0.129 0.126
## Cumulative Var 0.252 0.415 0.546 0.675 0.801

Let’s print only the significant loadings. Because I have no medical background, I can only guess why such features are grouped together. As if sysBP, diaBP and BMI creates RC1 I could make a guess that it may be linked with of obesity and cardiac output. I would interpret RC4 as getting old, RC2, RC5 and RC3 are self explanatory.

Conclusion

In this paper, I have used PCA on the chosen medical dataset. PCA is a powerful dimensionality reduction tool that helps with reducing the complexity of a model. It compresses a dataset onto a lower-dimensional feature subspace with the goal of maintaining most of the relevant information.
As a result of the analysis, I have managed to reduce the number of variables from 8 to 5 with maintaining 80.1% of variance.