Mateusz Tomczak

432561

Introduction

This project takes an inspiration from the results of the paper by D.Chicco and G.Jurman titled Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. In this article authors conclude that machine learning can be uset to predict heart failure using only two variables. The goal of this project is to verify, if same results can be acquired using dimension reduction techniques - we will try to reduce the dimensionality of the data to just 2 components to see how much information we can retain and if this information will be meaningful in defining the patient survivability. Dataset used contains 299 observations and 13 features listed below, along with the descriptions provided in the mentioned paper:

DEATH_EVENT variable will be removed from the dataset and used to assess the results of the dimension reduction. Dataset also contains patient_id column, which is redundant in the analysis and will be removed.

During the analysis for the dimension reduction we will use PCA and MDS.

Libraries used

library(corrplot)
library(factoextra)
library(psych)
library(pdp)
library(gridExtra)
library(smacof)
library(StatMatch)
library(ape)

Data preparation

data <- read.csv('data/heart-failure/heart_failure_ehrapy_prepared.csv')
head(data)
dim(data)
## [1] 299  14

Checking if there are no missing values in the dataset

data[!complete.cases(data)]
options(width = 100)
str(data)
## 'data.frame':    299 obs. of  14 variables:
##  $ patient_id              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : int  1 0 0 0 0 1 0 0 0 1 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ smoking                 : int  0 0 1 0 0 1 0 1 0 1 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : int  1 1 1 1 1 1 1 1 1 1 ...

Assign the DEATH_EVENT to other variable and remove it along with patient_id column from the data

death_event <- data$DEATH_EVENT
data <- data[,!(names(data) %in% c("patient_id", "DEATH_EVENT"))]

We need to convert all data to numeric for the analysis.

data[,c("anaemia","creatinine_phosphokinase", "diabetes", "ejection_fraction", "high_blood_pressure", "serum_sodium","sex","smoking","time")] <- lapply(data[,c("anaemia","creatinine_phosphokinase", "diabetes", "ejection_fraction", "high_blood_pressure", "serum_sodium","sex","smoking","time")], as.numeric)
options(width = 100)
str(data)
## 'data.frame':    299 obs. of  12 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : num  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: num  582 7861 146 111 160 ...
##  $ diabetes                : num  0 0 0 0 1 0 0 1 0 0 ...
##  $ ejection_fraction       : num  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : num  1 0 0 0 0 1 0 0 0 1 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : num  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : num  1 1 1 1 0 1 1 1 0 1 ...
##  $ smoking                 : num  0 0 1 0 0 1 0 1 0 1 ...
##  $ time                    : num  4 6 7 7 8 8 10 10 10 10 ...

Below some statistics are show describing the data by columns. Those statistics include minimum value, 1st quantile, median, mean, 3rd quantile and maximum value

options(width = 100)
summary(data)
##       age           anaemia       creatinine_phosphokinase    diabetes      ejection_fraction
##  Min.   :40.00   Min.   :0.0000   Min.   :  23.0           Min.   :0.0000   Min.   :14.00    
##  1st Qu.:51.00   1st Qu.:0.0000   1st Qu.: 116.5           1st Qu.:0.0000   1st Qu.:30.00    
##  Median :60.00   Median :0.0000   Median : 250.0           Median :0.0000   Median :38.00    
##  Mean   :60.83   Mean   :0.4314   Mean   : 581.8           Mean   :0.4181   Mean   :38.08    
##  3rd Qu.:70.00   3rd Qu.:1.0000   3rd Qu.: 582.0           3rd Qu.:1.0000   3rd Qu.:45.00    
##  Max.   :95.00   Max.   :1.0000   Max.   :7861.0           Max.   :1.0000   Max.   :80.00    
##  high_blood_pressure   platelets      serum_creatinine  serum_sodium        sex        
##  Min.   :0.0000      Min.   : 25100   Min.   :0.500    Min.   :113.0   Min.   :0.0000  
##  1st Qu.:0.0000      1st Qu.:212500   1st Qu.:0.900    1st Qu.:134.0   1st Qu.:0.0000  
##  Median :0.0000      Median :262000   Median :1.100    Median :137.0   Median :1.0000  
##  Mean   :0.3512      Mean   :263358   Mean   :1.394    Mean   :136.6   Mean   :0.6488  
##  3rd Qu.:1.0000      3rd Qu.:303500   3rd Qu.:1.400    3rd Qu.:140.0   3rd Qu.:1.0000  
##  Max.   :1.0000      Max.   :850000   Max.   :9.400    Max.   :148.0   Max.   :1.0000  
##     smoking            time      
##  Min.   :0.0000   Min.   :  4.0  
##  1st Qu.:0.0000   1st Qu.: 73.0  
##  Median :0.0000   Median :115.0  
##  Mean   :0.3211   Mean   :130.3  
##  3rd Qu.:1.0000   3rd Qu.:203.0  
##  Max.   :1.0000   Max.   :285.0

Data analysis

As a first step we will check the relationships between the variables and their correlation

pairs(data)

As we can see the variables do not display linear relationships or high correlations between each other, we can confirm this with correlation matrix

corr_matrix <- cor(data, method = c("spearman"))
corrplot(corr_matrix, type = "lower", tl.cex = 0.9, order="alphabet", tl.col = "black")

The correlation plot confirms that there are no high correlations present.

To test if we can use PCA on our data we can use Kaiser-Meyer-Olkin (KMO) test. It determines if PCA is useful for a giver variable or not.

options(width = 100)
KMO(corr_matrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = corr_matrix)
## Overall MSA =  0.56
## MSA for each item = 
##                      age                  anaemia creatinine_phosphokinase                 diabetes 
##                     0.56                     0.56                     0.57                     0.67 
##        ejection_fraction      high_blood_pressure                platelets         serum_creatinine 
##                     0.54                     0.50                     0.52                     0.55 
##             serum_sodium                      sex                  smoking                     time 
##                     0.62                     0.56                     0.53                     0.63

Overall MSA for the KMO test equals 0.56. In the best case scenario this value should be close to 1, but we can assume that value above 0.5 is adequate and we can acknowlegde the test as satisfied.

Low correlations and low MSA for KMO test point to the conclusion that the MDS may be the better choice in this case. However, we can perform PCA anyway.

PCA

Principle Component Analysis (PCA) is a dimensionality-reduction method often used to reduce the dimensionality of multivariate data. It works by looking for r-dimentional projection of data, that preserves variance in the best manner. PCA can enable the visualisation of multivariate data on the 2D or 3D space and at the same time it minimizes the information loss by creating new uncorrelated variables that are correlated with some of the original data features.

To avoid a situation where some features can dominate over other we need to standarize the data.

options(width = 80)
pca <- prcomp(data, center = TRUE, scale = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.2896 1.2566 1.1261 1.05638 1.01483 0.99442 0.93987
## Proportion of Variance 0.1386 0.1316 0.1057 0.09299 0.08582 0.08241 0.07361
## Cumulative Proportion  0.1386 0.2702 0.3759 0.46885 0.55467 0.63708 0.71069
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.91940 0.87408 0.84132 0.80250 0.71457
## Proportion of Variance 0.07044 0.06367 0.05898 0.05367 0.04255
## Cumulative Proportion  0.78113 0.84480 0.90378 0.95745 1.00000
eig_pca <- fviz_eig(pca, choice='eigenvalue', addlabels = TRUE)
var_pca <- fviz_screeplot(pca, addlabels = TRUE)
grid.arrange(eig_pca, var_pca, ncol=2)

As we can see, not much variance is explained by the first few principal components. The most important 1PC explains only 13.86% of data variance. If we set our target of 70% or more explained variance, the optimal number include at least 7 principal components.

Based on those findings we can assume that reduction of the data to 2 dimension using PCA does not retain enough information from the dataset. However, we will continue with the analysis to see what the results will be.

Loading plots

Loading plots show how strongly different variables contributes to the given principal component.

PC1 and PC2

As we can see, PC1 is highly influenced by the sex and smoking variables. PC2 is highly influenced by age and time variables.

fviz_pca_var(pca, axes=c(1,2))

PC2 and PC3

In case of PC3 we can see that it is highly influenced by serum_sodium and ejection_fraction variables

fviz_pca_var(pca, axes=c(2,3))

We can verify the findings from loading plots by plotting bar plot of each variable influence on PC1, PC2 and PC3

pca_var <- get_pca_var(pca)

PC1 <- fviz_contrib(pca, "var", axes = 1, xtickslab.rt=45)
PC2 <- fviz_contrib(pca, "var", axes = 2, xtickslab.rt=45)
PC3 <- fviz_contrib(pca, "var", axes = 3, xtickslab.rt=45)
grid.arrange(PC1, PC2, PC3, nrow=3)

As we can see, the most contirbuting variables match those visible on the loading plots.

Using scatterplot we can verify if first two principal components group the data into somewhat visible clusters depending on death_event variable.

fviz_pca_ind(pca, col.ind=death_event, geom = "point")

As we can see the PCA does not retain enough information within first two principal components that would make it possible to group the data based on the death_event variable.

MDS

Multidimensional Scaling (MDS) is a non-linear technique. In case of this method we need to supply data in the form of dissimilarities between pairs of objects. It reduces the dimensionality of the data while keeping the distances between the pairs, which allows for a reasonable preservation of patterns and clusters. However, the coordinates of new points no longer retain any information and become dimensionless.

dist <- dist(data)
mds <- cmdscale(dist, k=2)
plot(mds)

plot(mds, col=factor(death_event), pch=19)

As we can see, the MDS also does not provide enough information when the data is reduced only to two dimensions.

We can also check the goodness of fit of the MDS using STRESS funciton. We will compare empirical and theoretical STRESS functions as ratio, where counter is an empirical stress and the nominator is a theorietical stress.

Based on Kruskal (1964), values above or euqal to 0.2 indicate poor goodness of fit, while 0 indicates perfect goodness of fit.

stressvec<-randomstress(n=299, ndim=2, nrep=1) # creating theoretical stress vector
dist_gower<-gower.dist(data) # to conduct the test we will user Gower distance, which measures how different the two records are
mds_ratio <- mds(dist_gower, ndim=2)
ratio<- mds_ratio$stress/ mean(stressvec)
ratio
## [1] 0.4717684

Ratio value equal to around 0.47 indicates rather poor fit of the MDS.

We can plot the results of the MDS performed with distance matrix instead of the correlation matrix

plot(mds_ratio$conf[,1:2], col=factor(death_event), pch=19)

Conclusions

The conducted analisis shows that the results obtained by D.Chicco and G.Jurman (2020) cannot be reasonably replicated using dimension reduction techniques. Both PCA and MDS did not provide results that would make it remotely possible to extract meaningful information from the data reduced to 2 dimensions, or group the reduced data by death_event variable. In the mentioned article authors used only two vaariables - serum_creatinine and ejection_fraction. But the analysis conducted showed that those two variables were not that important in explaining the data variance.

In case of serum_creatinine variable, it was contributing in around 16% to the second component of the PCA. Ejection_fraction was contirbuting around 20% to the third component. But overall we can see that the few first principal components were not able to explain much data variation, which can be attributed to the low correlations between original variables.

The Multidimensional Scaling did not yield reasonable results as well. Analising the scatterplot of the data reduced to 2 dimensions MDS shows that no meaningful information about patients’ possibility of death was retained. Test for goodness of fit of the MDS using, instead of correlation data, Gower distance and STRESS function points to poor goodness of fit. Thus, we can conclude that the results of the analysis conducted by D.Chicco and G.Jurman (2020) cannot be replicated using dimension reduction methods.

References

Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5

Heart failure clinical records. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5Z89R.