Mateusz Tomczak
432561
This project takes an inspiration from the results of the paper by D.Chicco and G.Jurman titled Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. In this article authors conclude that machine learning can be uset to predict heart failure using only two variables. The goal of this project is to verify, if same results can be acquired using dimension reduction techniques - we will try to reduce the dimensionality of the data to just 2 components to see how much information we can retain and if this information will be meaningful in defining the patient survivability. Dataset used contains 299 observations and 13 features listed below, along with the descriptions provided in the mentioned paper:
age - age of the patient
anaemia - decrease of red blood cells or hemoglobin in patient
creatinine_phosphokinase - patient’s level of CPK enzyme in the blood
diabetes - if the patient has diabetes
ejection_fraction - percentage of blood leaving the heart at each contraction
high_blood_pressure - if patient has high blood pressure
platelets - platelets in the blood
serum_creatinine - level of creatinine in the blood
serum_sodium - level of sodium in the blood
sex - sex of the patient
smoking - if patient is smoking or not
time - follow-up period
DEATH_EVENT - if the patient died during the follow-up period
DEATH_EVENT variable will be removed from the dataset and used to assess the results of the dimension reduction. Dataset also contains patient_id column, which is redundant in the analysis and will be removed.
During the analysis for the dimension reduction we will use PCA and MDS.
library(corrplot)
library(factoextra)
library(psych)
library(pdp)
library(gridExtra)
library(smacof)
library(StatMatch)
library(ape)
data <- read.csv('data/heart-failure/heart_failure_ehrapy_prepared.csv')
head(data)
dim(data)
## [1] 299 14
Checking if there are no missing values in the dataset
data[!complete.cases(data)]
options(width = 100)
str(data)
## 'data.frame': 299 obs. of 14 variables:
## $ patient_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : int 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : int 0 0 0 0 1 0 0 1 0 0 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : int 1 0 0 0 0 1 0 0 0 1 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : int 1 1 1 1 0 1 1 1 0 1 ...
## $ smoking : int 0 0 1 0 0 1 0 1 0 1 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : int 1 1 1 1 1 1 1 1 1 1 ...
Assign the DEATH_EVENT to other variable and remove it along with patient_id column from the data
death_event <- data$DEATH_EVENT
data <- data[,!(names(data) %in% c("patient_id", "DEATH_EVENT"))]
We need to convert all data to numeric for the analysis.
data[,c("anaemia","creatinine_phosphokinase", "diabetes", "ejection_fraction", "high_blood_pressure", "serum_sodium","sex","smoking","time")] <- lapply(data[,c("anaemia","creatinine_phosphokinase", "diabetes", "ejection_fraction", "high_blood_pressure", "serum_sodium","sex","smoking","time")], as.numeric)
options(width = 100)
str(data)
## 'data.frame': 299 obs. of 12 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : num 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: num 582 7861 146 111 160 ...
## $ diabetes : num 0 0 0 0 1 0 0 1 0 0 ...
## $ ejection_fraction : num 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : num 1 0 0 0 0 1 0 0 0 1 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : num 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : num 1 1 1 1 0 1 1 1 0 1 ...
## $ smoking : num 0 0 1 0 0 1 0 1 0 1 ...
## $ time : num 4 6 7 7 8 8 10 10 10 10 ...
Below some statistics are show describing the data by columns. Those statistics include minimum value, 1st quantile, median, mean, 3rd quantile and maximum value
options(width = 100)
summary(data)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## Min. :40.00 Min. :0.0000 Min. : 23.0 Min. :0.0000 Min. :14.00
## 1st Qu.:51.00 1st Qu.:0.0000 1st Qu.: 116.5 1st Qu.:0.0000 1st Qu.:30.00
## Median :60.00 Median :0.0000 Median : 250.0 Median :0.0000 Median :38.00
## Mean :60.83 Mean :0.4314 Mean : 581.8 Mean :0.4181 Mean :38.08
## 3rd Qu.:70.00 3rd Qu.:1.0000 3rd Qu.: 582.0 3rd Qu.:1.0000 3rd Qu.:45.00
## Max. :95.00 Max. :1.0000 Max. :7861.0 Max. :1.0000 Max. :80.00
## high_blood_pressure platelets serum_creatinine serum_sodium sex
## Min. :0.0000 Min. : 25100 Min. :0.500 Min. :113.0 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:212500 1st Qu.:0.900 1st Qu.:134.0 1st Qu.:0.0000
## Median :0.0000 Median :262000 Median :1.100 Median :137.0 Median :1.0000
## Mean :0.3512 Mean :263358 Mean :1.394 Mean :136.6 Mean :0.6488
## 3rd Qu.:1.0000 3rd Qu.:303500 3rd Qu.:1.400 3rd Qu.:140.0 3rd Qu.:1.0000
## Max. :1.0000 Max. :850000 Max. :9.400 Max. :148.0 Max. :1.0000
## smoking time
## Min. :0.0000 Min. : 4.0
## 1st Qu.:0.0000 1st Qu.: 73.0
## Median :0.0000 Median :115.0
## Mean :0.3211 Mean :130.3
## 3rd Qu.:1.0000 3rd Qu.:203.0
## Max. :1.0000 Max. :285.0
As a first step we will check the relationships between the variables and their correlation
pairs(data)
As we can see the variables do not display linear relationships or high correlations between each other, we can confirm this with correlation matrix
corr_matrix <- cor(data, method = c("spearman"))
corrplot(corr_matrix, type = "lower", tl.cex = 0.9, order="alphabet", tl.col = "black")
The correlation plot confirms that there are no high correlations present.
To test if we can use PCA on our data we can use Kaiser-Meyer-Olkin (KMO) test. It determines if PCA is useful for a giver variable or not.
options(width = 100)
KMO(corr_matrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = corr_matrix)
## Overall MSA = 0.56
## MSA for each item =
## age anaemia creatinine_phosphokinase diabetes
## 0.56 0.56 0.57 0.67
## ejection_fraction high_blood_pressure platelets serum_creatinine
## 0.54 0.50 0.52 0.55
## serum_sodium sex smoking time
## 0.62 0.56 0.53 0.63
Overall MSA for the KMO test equals 0.56. In the best case scenario this value should be close to 1, but we can assume that value above 0.5 is adequate and we can acknowlegde the test as satisfied.
Low correlations and low MSA for KMO test point to the conclusion that the MDS may be the better choice in this case. However, we can perform PCA anyway.
Principle Component Analysis (PCA) is a dimensionality-reduction method often used to reduce the dimensionality of multivariate data. It works by looking for r-dimentional projection of data, that preserves variance in the best manner. PCA can enable the visualisation of multivariate data on the 2D or 3D space and at the same time it minimizes the information loss by creating new uncorrelated variables that are correlated with some of the original data features.
To avoid a situation where some features can dominate over other we need to standarize the data.
options(width = 80)
pca <- prcomp(data, center = TRUE, scale = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.2896 1.2566 1.1261 1.05638 1.01483 0.99442 0.93987
## Proportion of Variance 0.1386 0.1316 0.1057 0.09299 0.08582 0.08241 0.07361
## Cumulative Proportion 0.1386 0.2702 0.3759 0.46885 0.55467 0.63708 0.71069
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.91940 0.87408 0.84132 0.80250 0.71457
## Proportion of Variance 0.07044 0.06367 0.05898 0.05367 0.04255
## Cumulative Proportion 0.78113 0.84480 0.90378 0.95745 1.00000
eig_pca <- fviz_eig(pca, choice='eigenvalue', addlabels = TRUE)
var_pca <- fviz_screeplot(pca, addlabels = TRUE)
grid.arrange(eig_pca, var_pca, ncol=2)
As we can see, not much variance is explained by the first few principal components. The most important 1PC explains only 13.86% of data variance. If we set our target of 70% or more explained variance, the optimal number include at least 7 principal components.
Based on those findings we can assume that reduction of the data to 2 dimension using PCA does not retain enough information from the dataset. However, we will continue with the analysis to see what the results will be.
Loading plots show how strongly different variables contributes to the given principal component.
As we can see, PC1 is highly influenced by the sex and smoking variables. PC2 is highly influenced by age and time variables.
fviz_pca_var(pca, axes=c(1,2))
In case of PC3 we can see that it is highly influenced by serum_sodium and ejection_fraction variables
fviz_pca_var(pca, axes=c(2,3))
We can verify the findings from loading plots by plotting bar plot of each variable influence on PC1, PC2 and PC3
pca_var <- get_pca_var(pca)
PC1 <- fviz_contrib(pca, "var", axes = 1, xtickslab.rt=45)
PC2 <- fviz_contrib(pca, "var", axes = 2, xtickslab.rt=45)
PC3 <- fviz_contrib(pca, "var", axes = 3, xtickslab.rt=45)
grid.arrange(PC1, PC2, PC3, nrow=3)
As we can see, the most contirbuting variables match those visible on the loading plots.
Using scatterplot we can verify if first two principal components group the data into somewhat visible clusters depending on death_event variable.
fviz_pca_ind(pca, col.ind=death_event, geom = "point")
As we can see the PCA does not retain enough information within first two principal components that would make it possible to group the data based on the death_event variable.
Multidimensional Scaling (MDS) is a non-linear technique. In case of this method we need to supply data in the form of dissimilarities between pairs of objects. It reduces the dimensionality of the data while keeping the distances between the pairs, which allows for a reasonable preservation of patterns and clusters. However, the coordinates of new points no longer retain any information and become dimensionless.
dist <- dist(data)
mds <- cmdscale(dist, k=2)
plot(mds)
plot(mds, col=factor(death_event), pch=19)
As we can see, the MDS also does not provide enough information when the data is reduced only to two dimensions.
We can also check the goodness of fit of the MDS using STRESS funciton. We will compare empirical and theoretical STRESS functions as ratio, where counter is an empirical stress and the nominator is a theorietical stress.
Based on Kruskal (1964), values above or euqal to 0.2 indicate poor goodness of fit, while 0 indicates perfect goodness of fit.
stressvec<-randomstress(n=299, ndim=2, nrep=1) # creating theoretical stress vector
dist_gower<-gower.dist(data) # to conduct the test we will user Gower distance, which measures how different the two records are
mds_ratio <- mds(dist_gower, ndim=2)
ratio<- mds_ratio$stress/ mean(stressvec)
ratio
## [1] 0.4717684
Ratio value equal to around 0.47 indicates rather poor fit of the MDS.
We can plot the results of the MDS performed with distance matrix instead of the correlation matrix
plot(mds_ratio$conf[,1:2], col=factor(death_event), pch=19)
The conducted analisis shows that the results obtained by D.Chicco and G.Jurman (2020) cannot be reasonably replicated using dimension reduction techniques. Both PCA and MDS did not provide results that would make it remotely possible to extract meaningful information from the data reduced to 2 dimensions, or group the reduced data by death_event variable. In the mentioned article authors used only two vaariables - serum_creatinine and ejection_fraction. But the analysis conducted showed that those two variables were not that important in explaining the data variance.
In case of serum_creatinine variable, it was contributing in around 16% to the second component of the PCA. Ejection_fraction was contirbuting around 20% to the third component. But overall we can see that the few first principal components were not able to explain much data variation, which can be attributed to the low correlations between original variables.
The Multidimensional Scaling did not yield reasonable results as well. Analising the scatterplot of the data reduced to 2 dimensions MDS shows that no meaningful information about patients’ possibility of death was retained. Test for goodness of fit of the MDS using, instead of correlation data, Gower distance and STRESS function points to poor goodness of fit. Thus, we can conclude that the results of the analysis conducted by D.Chicco and G.Jurman (2020) cannot be replicated using dimension reduction methods.
Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5
Heart failure clinical records. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5Z89R.