Dimension Reduction in cardiovascular disease dataset

The aim of this article is to use PCA and MCA methods for dimension reduction of medical data. PCA is a statistical method, which is used on highly dimensional continuous data in order to reduce number of dimension (number of variables) in dataset. PCA allows to visualize data on 2-dimensional plot and discover more dependency between sets of variables. MCA corresponds for the same process, but for categorical variables.

Dataset analysis

Data set contains information of group of patients suffering for cardiovascular disease and group of healthy patients. It is taken from kaggle platform. https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

df <- read.csv("cardio.csv", sep = ";")
dim(df)

## [1] 70000    13

Thera are 13 columns and 7000 rows in the dataset.

colnames(df)

##  [1] "id"          "age"         "gender"      "height"      "weight"     
##  [6] "ap_hi"       "ap_lo"       "cholesterol" "gluc"        "smoke"      
## [11] "alco"        "active"      "cardio"

Names of columns correspond to:

id - id of a patient

age - age of a patient given in days

gender - 1 for women, 2 for men

height - height in cm

weight - weight in kg

ap_hi - systolic blood pressure

ap_lo - diastolic blood pressure

cholesterol - level of cholesterol : 1 - normal 2 - above normal 3 - well above normal

gluc - glucose level: 1 - normal 2 - above normal 3 - well above normal

smoke - 1 - patient smokes, 0 - patient does not smoke

alco - 1 - patient drinks alcoholic beverages, 0 - patient does not drink alcoholic beverages

active - 1 - patient is active, 0 - patient is not active

cardio - 1 - patient suffers for cardiovascular disease, 0 - patient does not suffer for cardiovascular disease

head(df)

##   id   age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1  0 18393      2    168     62   110    80           1    1     0    0      1
## 2  1 20228      1    156     85   140    90           3    1     0    0      1
## 3  2 18857      1    165     64   130    70           3    1     0    0      0
## 4  3 17623      2    169     82   150   100           1    1     0    0      1
## 5  4 17474      1    156     56   100    60           1    1     0    0      0
## 6  8 21914      1    151     67   120    80           2    2     0    0      0
##   cardio
## 1      0
## 2      1
## 3      1
## 4      1
## 5      0
## 6      0

Dataset contains numerical as well as categorical variables. To make analysis more accurate, I will separete them and conduct individual analysis for numerical variables (using PCA) and categorical variables (using MCA). I will omit id variable as it does not contain any information.

num_df <-df[, c(2, 4:7)]
cat_df <- df[, c(3,8:13)]

Age, height, weight, systolic and diastolic blood pressure are numerical variables and gender, cholesterol, glucose, smoke, alcohol, activity and whether person is healthy or not are categorical variables.

As age is given in days, I will transform it into years, to make it more readable.

num_df$age = num_df$age/365

Let’s check the basic statistic for numerical variables

summary(num_df)

##       age            height          weight           ap_hi        
##  Min.   :29.58   Min.   : 55.0   Min.   : 10.00   Min.   : -150.0  
##  1st Qu.:48.39   1st Qu.:159.0   1st Qu.: 65.00   1st Qu.:  120.0  
##  Median :53.98   Median :165.0   Median : 72.00   Median :  120.0  
##  Mean   :53.34   Mean   :164.4   Mean   : 74.21   Mean   :  128.8  
##  3rd Qu.:58.43   3rd Qu.:170.0   3rd Qu.: 82.00   3rd Qu.:  140.0  
##  Max.   :64.97   Max.   :250.0   Max.   :200.00   Max.   :16020.0  
##      ap_lo         
##  Min.   :  -70.00  
##  1st Qu.:   80.00  
##  Median :   80.00  
##  Mean   :   96.63  
##  3rd Qu.:   90.00  
##  Max.   :11000.00

Average examinated person is 53 years old. The person is 164 cm tall and weights 74kg. Their systolic blood pressure equals 128 and diastolic blood pressure equals 96.

Now, let’s check categorical variables.

library(ggplot2)
library(reshape2)

plot.df <- melt(cat_df, id.vars = NULL)
ggplot(data = plot.df) +
  geom_bar(aes(x = value)) + 
  theme(plot.title = element_text(hjust = 0.5, size = 14)) +
  facet_wrap(~ variable, scales = "free", ncol = 4)

Most of the patients are women. Only minority reports high glucose or cholesterol level. Most of participants do not smoke and do not drink alcoholic beverage and are active. However, almost a half of them suffers for cardiovascular disease.

PCA for numerical variables

First, let’s check correlations between variables.

cor<-cor(num_df, method="pearson") 
print(cor, digits= 1)

##          age height weight ap_hi ap_lo
## age     1.00 -0.082   0.05 0.021 0.018
## height -0.08  1.000   0.29 0.005 0.006
## weight  0.05  0.291   1.00 0.031 0.044
## ap_hi   0.02  0.005   0.03 1.000 0.016
## ap_lo   0.02  0.006   0.04 0.016 1.000

library(corrplot)

## corrplot 0.84 loaded

corrplot(cor)

The results show there is a positive correlation between height and weight variables.

In order to run PCA, I will use factoextra library

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

pca <- prcomp(num_df, center=TRUE, scale=TRUE)
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5
## Standard deviation     1.1395 1.0285 0.9919 0.9905 0.8239
## Proportion of Variance 0.2597 0.2115 0.1968 0.1962 0.1358
## Cumulative Proportion  0.2597 0.4713 0.6680 0.8642 1.0000

fviz_eig(pca)

Scree plot represents graphically the percentage of variance explained by every component. The results show that PC1 explains about 26% of variation. To explain 86% of variance, there has to be 4 components.

fviz_pca_var(pca, col.var = "darkred")

The plot shows that variables converge into three groups. First, there is weight and height, which confirm the correlation results. The second group is systolic and diastolic blood pressure and the third one is age.

Let’s chech variable contribution in every first 3 dimensions, which explain 66% of variance.

library(pdp)
PC1 <- fviz_contrib(pca, choice = "var", axes = 1)
PC2 <- fviz_contrib(pca, choice = "var", axes = 2)
PC3 <- fviz_contrib(pca, choice = "var", axes = 3)
grid.arrange(PC1, PC2, PC3)

The most important component consist of height and weight. The second one consist only of age. The last one consist of systolic and diastolic blood pressure.

MCA for categorical variables

In order to run MCA, we need to transform data from 0-1 to categories given in character type.

cat_df$gender <- ifelse(cat_df$gender== 1, "female", "male")
cat_df$smoke <- ifelse(cat_df$smoke==0, "Doesn't smoke", "Smoke")
cat_df$alco <- ifelse(cat_df$alco==0, "Doesn't drink", "Drinks")
cat_df$active <- ifelse(cat_df$active==0, "Is not active", "Is active")
cat_df$cholesterol <- ifelse(cat_df$cholesterol==1, "c_normal", ifelse(cat_df$cholesterol==2,
                                                          "c_above", "c_well_above"))
cat_df$gluc <- ifelse(cat_df$gluc==1, "g_normal", ifelse(cat_df$gluc==2,
                                                        "g_above", "g_well_above"))
cat_df$cardio <- ifelse(cat_df$cardio ==0, "healthy", "sick")
head(cat_df)

##   gender  cholesterol     gluc         smoke          alco        active
## 1   male     c_normal g_normal Doesn't smoke Doesn't drink     Is active
## 2 female c_well_above g_normal Doesn't smoke Doesn't drink     Is active
## 3 female c_well_above g_normal Doesn't smoke Doesn't drink Is not active
## 4   male     c_normal g_normal Doesn't smoke Doesn't drink     Is active
## 5 female     c_normal g_normal Doesn't smoke Doesn't drink Is not active
## 6 female      c_above  g_above Doesn't smoke Doesn't drink Is not active
##    cardio
## 1 healthy
## 2    sick
## 3    sick
## 4    sick
## 5 healthy
## 6 healthy

Now, we can run MCA on categorical data.

library(FactoMineR)
mca<-MCA(cat_df, graph = FALSE)
fviz_screeplot(mca, addlabels = TRUE)

Scree plot represent graphically the percentage of explained variance, just like in PCA.

Let’s have a look at the contribution plot, which shows variables contribution to the results in 2 dimensional space.

fviz_contrib(mca, choice = "var", axes = 1:2)

High level of cholesterol, glucose, smoking, drinking alcohol and being a man has the biggest contribution to the result.

fviz_mca_var(mca, col.var = "contrib",
             gradient.cols = c("darkgreen", "yellow", "darkred"), 
             repel = TRUE, 
             ggtheme = theme_minimal()
)

The plot shows the results of MCA. Variables with the biggest contribution are represented in red colour, green represents variables with the lowest contribution. We can observe 2 groups of variables which stand out from the others. It’s high level of glucose and cholesterol as the first group, drinking and smoking as the second group. Opposite categories are set out on the opposite sides of the plots eg. “healthy” is on the opposite of “sick”.

Conclusions

Although the percantege of explained variance in 2- dimensional space is not high enough to assume that all points have been assigned correctly, the results seems to be reasonable in both analysis. Height is related to weight in most cases, as well as diastolic/systolic blood pressure. High level of glucose and cholesterol may corresponds to people with health issues. Alcohol and smoking also very often occurs together.