Intruduction

The goal of this paper is to examine capabilities of PCA (principal component analysis) in order to reduce dimnesions on Cardiovascular Disease dataset from kaggle (https://www.kaggle.com/yassinehamdaoui1/cardiovascular-disease). To asses PCA performance we will perform supervised task on dataset created with dimension reduction.

Firstly lets define variables used in analysis:

sbp -> systolic blood pressure

tobacco -> cumulative tobacco (kg)

ldl -> low density lipoprotein cholesterol

adiposity -> a numeric vector

famhist -> family history of heart disease, a factor with levels “Absent” and “Present”

typea -> type-A behavior

obesity -> a numeric vector

alcohol -> current alcohol consumption

age -> age at onset

chd -> response, coronary heart disease

Data loading and transformations

In order to apply dimnesion reduction alghoritms we must be sure that all of the variables are numerical. To prepare dataset, ID and target variable are being removed, famhist variable is going to be binarized.

To check if there are no obvious outliers, basic summary of data may be calculated.

##       sbp           tobacco             ldl           adiposity    
##  Min.   :101.0   Min.   : 0.0000   Min.   : 0.980   Min.   : 6.74  
##  1st Qu.:124.0   1st Qu.: 0.0525   1st Qu.: 3.283   1st Qu.:19.77  
##  Median :134.0   Median : 2.0000   Median : 4.340   Median :26.11  
##  Mean   :138.3   Mean   : 3.6356   Mean   : 4.740   Mean   :25.41  
##  3rd Qu.:148.0   3rd Qu.: 5.5000   3rd Qu.: 5.790   3rd Qu.:31.23  
##  Max.   :218.0   Max.   :31.2000   Max.   :15.330   Max.   :42.49  
##      typea         obesity         alcohol            age       
##  Min.   :13.0   Min.   :14.70   Min.   :  0.00   Min.   :15.00  
##  1st Qu.:47.0   1st Qu.:22.98   1st Qu.:  0.51   1st Qu.:31.00  
##  Median :53.0   Median :25.80   Median :  7.51   Median :45.00  
##  Mean   :53.1   Mean   :26.04   Mean   : 17.04   Mean   :42.82  
##  3rd Qu.:60.0   3rd Qu.:28.50   3rd Qu.: 23.89   3rd Qu.:55.00  
##  Max.   :78.0   Max.   :46.58   Max.   :147.19   Max.   :64.00  
##   famhist_bin    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4156  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

The next thing obvious to do is checking correlations, before them one need to check varible distributions in order to use proper measure, generall schema is as follows: For normally distributed variables: Pearson For other: KendalTau

pval = apply(df_pca,2,shapiro.test)
print('All the variables have normal distribution: ')
## [1] "All the variables have normal distribution: "
bad=0
for (x in pval){
  if(x$p.value < 0.05){
    bad=bad+1
  }
}
if (bad>0){
  print(FALSE)
  print('Number of variables without normality: ')
  print(bad)
} else {
  print(TRUE)
}
## [1] FALSE
## [1] "Number of variables without normality: "
## [1] 9

Due to the fact that no variable has normal distribution the proper way to test correlation is using kendalltau measure.

The only non-trivial correlations visible are adiposity with age and tobacco with age. Both are easily explainable, the age usually correlate with weight and tobacco is becoming less popular among young people, what may impact some small interactions.

Principal Component Analysis

The principal component analysis is trivial way to reduce dimensions of given dataset without loosing a lot from the information stored inside.

Intuition

The idea behind PCA is to introduce set of orthogonal vectors which are going to be next coordinate system. This set of vectors must be orthogonal (as we are operating in euclidean coordinates system) and must explain as much varinance in data as its possible with as little vectors as possible. It can imaged as answer to the question: “How should I draw coordinates system over the point cloud so that the rejection of some k-n dimensions (n<k) will result in the smallest loss of information”.

For the first vector we check which eigenvector of dataset allows us to eliminate the highest variance. On the example below the setting of the first principal component on 2 dimnesional dataset is taking place, the variance is shown by red dot distribution over black line. After the first line choosen (black will align with pink lines) we will choose which orthogonal vector to the first one describes the highest vaiance etc. In example below there is just one more available vector as the data is only 2 dimnesional. Source: https://builtin.com/data-science/step-step-explanation-principal-component-analysis

Formal definition

“The PCA is defined as orthogonal linear transformation that transform the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate, second gratest with second cordinate etc.” Source: https://en.wikipedia.org/wiki/Principal_component_analysis

The first vector must satisfy:

And for every k > 1:

Given the ‘rest’ of the dataset (without already reduced part).

Where X is data matrix, W are target coefficients and t being x*w. In mathematical sense, PCA is just turning coordinate system and eventually flattening the data. For more information please visit https://en.wikipedia.org/wiki/Principal_component_analysis.

Choosing right amount of principle components

There are multiple ways to decide how many principal components should be used, usually in business with high number of variables choosing even half of components allows to reduce dimnesions with ony 5-10% of explained variance loss. The stopping rules themselves are discussed in scientific papers (https://www.researchgate.net/publication/222533061_How_Many_Principal_Components_Stopping_Rules_for_Determining_the_Number_of_Non-Trivial_Axes_Revisited). Sadly the topic is still extremely theoretical and by the end of the day there is no one unique and best way to asses number of components. What can be done is decision based on what loss of explained variable we ‘tolerate’ given the distribution of each variable. (In paper above the datasets were mainly generated witch N(0, 1) distribution) For example, if suspected that part of variance is due to data noise or data inaccuracies, widely seen in business data we could treat bigger part of variance being just noise.

In this case I will choose number of components to explain at least 70% of variance which is 5 (even if sometimes eigenvalues with values under 10% explained variance are not recommended to take). What is the reason? Intuition.

pca <- prcomp(df_pca, center=TRUE, scale=TRUE)
fviz_eig(pca)

summary(pca)
## Importance of components:
##                           PC1    PC2    PC3   PC4     PC5     PC6     PC7
## Standard deviation     1.6985 1.0941 1.0372 0.972 0.91426 0.87213 0.81606
## Proportion of Variance 0.3205 0.1330 0.1195 0.105 0.09287 0.08451 0.07399
## Cumulative Proportion  0.3205 0.4536 0.5731 0.678 0.77092 0.85543 0.92942
##                            PC8     PC9
## Standard deviation     0.67850 0.41812
## Proportion of Variance 0.05115 0.01942
## Cumulative Proportion  0.98058 1.00000

Description of generated components

fviz_pca_var(pca, col.var = "steelblue")

pca$rotation[,1:5]
##                    PC1          PC2          PC3          PC4        PC5
## sbp         -0.3237513  0.238299948 -0.125094760 -0.196356344  0.2149068
## tobacco     -0.3018344  0.458509121  0.068151638  0.005356255 -0.6235827
## ldl         -0.3339222 -0.363913504  0.003104188  0.140291992 -0.2422249
## adiposity   -0.5162964 -0.187617138 -0.081873893 -0.135036139  0.1188945
## typea        0.0183216 -0.282181554  0.792293077 -0.209846012 -0.3209846
## obesity     -0.4014734 -0.391905478  0.040190206 -0.305463202  0.2833830
## alcohol     -0.1214213  0.543015265  0.459133400 -0.258467158  0.4190170
## age         -0.4600810  0.193042956 -0.135298204  0.157278763 -0.1999289
## famhist_bin -0.1951431  0.001340897  0.338439106  0.833434999  0.3053965

In order to break into pieces each variable contribution on each Principal Component we can use fviz_contrib function.

var <- get_pca_var(pca)
a<-fviz_contrib(pca, "var",axes = 1)
b<-fviz_contrib(pca, "var",axes = 2)
c<-fviz_contrib(pca, "var",axes = 3)
d<-fviz_contrib(pca, "var",axes = 4)
e<-fviz_contrib(pca, "var",axes = 5)
grid.arrange(a,b,c,d,e,top='Contribution to the Principal Components')

Lets try to manually asses ‘titles’ to each principal component.

Dim1 -> Being obese and old, logical (or opposite).

Dim2 -> Here we have tendency for drugs of different kind (alcohol, tobacco) (or opposite).

Dim3 -> TypeA personality is someone ambitious, stressed and competitive, it may connect easly with alcohol, also intuitive (or opposite).

Dim4 -> Here we have almost only heart desease in family (famhist_bin) (or opposite).

Dim5 -> Drugs again.

We can of course try to asses ‘classes’ for each component analysis, but by definition they do not have direct meaning. To asses difference between people ill and not, lets map values for ill (1) and healthy (0) people.

What may be worrying is that variance for ill people is higher then for healthy people. At the end it can cause some imperfections in predictions, assuming huge unreducible error. What one can see is that most part of ill people are located near low values of PC1, which means high obesity and age, other dependencies may be not so clearly visible as we are looking at only 2 dimensions.

Prediction using supervised learning

For this part we will perform supervised learning using both unreduced dataset and reduced one to see if it helps in any way.

The two selected alghoritms are logistic regression and random forest. I c

Logistic Regression

The prediciton will be perfmed using bootstraped k-fold validation in order to minimize the risk of overfitting to one specific subsample. Choose of those 2 alghoritms is connected with the fact of their different character what may potentially lead to exhibiting different characteristics over tested data.

smp_size <- floor(0.75 * nrow(pca$x))

model_data = data.frame(subset(pca$x, select=c(1, 2, 3, 4, 5)))
model_data$target = df_$chd
auc_list = c()
for (i in 1:1000){
  train_ind <- sample(seq_len(nrow(model_data)), size = smp_size)
  train <- model_data[train_ind, ]
  test <- model_data[-train_ind, ]
  model <- glm(target ~.,family=binomial(link='logit'),data=train)
  fitted.results <- predict(model,newdata=subset(test, select=-c(6)),type='response')
  auc_score = suppressMessages(auc(test$target, fitted.results))
  auc_list = append(auc_list, auc_score)
}
mean(auc_list)
## [1] 0.7760175
smp_size <- floor(0.75 * nrow(pca$x))

model_data = df_pca
model_data$target = df_$chd
auc_list_nopca = c()
for (i in 1:1000){
  train_ind <- sample(seq_len(nrow(model_data)), size = smp_size)
  train <- model_data[train_ind, ]
  test <- model_data[-train_ind, ]
  model <- glm(target ~.,family=binomial(link='logit'),data=train)
  fitted.results <- predict(model,newdata=subset(test, select=-c(10)),type='response')
  auc_score = suppressMessages(auc(test$target, fitted.results))
  auc_list_nopca = append(auc_list_nopca, auc_score)
}
mean(auc_list_nopca)
## [1] 0.7745282

The bootstraped (1000 times) AUC values were: 0.776 for dataset with dimension reduced. 0.773 for dataset without dimension reduced.

As we can clearly see that for reduced dataset AUC is higher, the question to ask is if that difference is statisticaly significant. To test that hypothesis we will use ttest for two unapired samples.

Given that the probes are distributed normally:

t.test(auc_list, auc_list_nopca, alternative = "two.sided", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  auc_list and auc_list_nopca
## t = 0.8676, df = 1996.8, p-value = 0.3857
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.001877209  0.004855882
## sample estimates:
## mean of x mean of y 
## 0.7760175 0.7745282

We cannot reject null hypothesis of equality of probes, it means that the PCA sucessfully reduced the dataset dimnesions without loosing information.

Random Forest

Same for random forests, the hiperparameters below were tuned in order to eliminate radical overfitting to each train sample, it stabilized achieved AUC a bit.

smp_size <- floor(0.75 * nrow(pca$x))

model_data = data.frame(subset(pca$x, select=c(1, 2, 3, 4, 5)))
model_data$target = df_$chd
auc_list_rf = c()
#auc_list_train = c()
for (i in 1:1000){
  train_ind <- sample(seq_len(nrow(model_data)), size = smp_size)
  train <- model_data[train_ind, ]
  test <- model_data[-train_ind, ]
  model <- suppressWarnings(randomForest(target ~., data=train, importance=TRUE, proximity=TRUE, maxnodes=6, nodesize=10))
  fitted.results <- predict(model,newdata=subset(test, select=-c(6)),type='response')
  #fitted.results.train = predict(model,newdata=subset(train, select=-c(6)),type='response')
  auc_score = suppressMessages(auc(test$target, fitted.results))
  auc_list_rf = append(auc_list_rf, auc_score)
  
  #auc_score_train = auc(train$target, fitted.results.train)
  #auc_list_train = append(auc_list_train, auc_score_train)
}
mean(auc_list_rf)
## [1] 0.7573705
smp_size <- floor(0.75 * nrow(pca$x))

model_data = df_pca
model_data$target = df_$chd
auc_list_rf_nopca = c()
#auc_list_train = c()
for (i in 1:1000){
  train_ind <- sample(seq_len(nrow(model_data)), size = smp_size)
  train <- model_data[train_ind, ]
  test <- model_data[-train_ind, ]
  model <- suppressWarnings(randomForest(target ~., data=train, importance=TRUE, proximity=TRUE, maxnodes=6, nodesize=10))
  fitted.results <- predict(model,newdata=subset(test, select=-c(10)),type='response')
  #fitted.results.train = predict(model,newdata=subset(train, select=-c(6)),type='response')
  auc_score = suppressMessages(auc(test$target, fitted.results))
  auc_list_rf_nopca = append(auc_list_rf_nopca, auc_score)
  
  #auc_score_train = auc(train$target, fitted.results.train)
  #auc_list_train = append(auc_list_train, auc_score_train)
}
mean(auc_list_rf_nopca)
## [1] 0.7558212

Testing probe means equality using ttest:

t.test(auc_list_rf, auc_list_rf_nopca, alternative = "two.sided", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  auc_list_rf and auc_list_rf_nopca
## t = 0.83658, df = 1993, p-value = 0.4029
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.002082675  0.005181319
## sample estimates:
## mean of x mean of y 
## 0.7573705 0.7558212

The same story for Logistic Regression, the difference between samples is not statistically significant. That means that the dataset analyzed can be reduced without lost on predictive power.

Conclusions

PCA is powerful tool to reduce dataset dimensions, in this paper I showed that it can be useful in real data and can provide even higher (not statistically significant, but always) predictive power then regular dataset. It can be used even more widely on huge datasets in order to reduce computation time and space occupied.