Luka Pascal Cavazza, 7. January 2023

IMB; Multivariate Analysis;

HOMEWORK 1

Heart faliure prediction

So for my first MvA homework I decided that I will look at the world happiness report for 2022. Considering we’ve just entered 2023 I figured it would be nice to start the year with some positivity. Of course this dataset didn’t have the things I needed for the HW after taking a good look at it. So, a couple of hours searching later I came up with this. If I can’t look at what makes people happy, I can at least maybe learn how not to die.

The basic idea is to see what are the features that can be used to predict heart faliure.

My unit of observation are the people that had heart faliure, the sample size is 299.

Description of the variables:

  • age: The age of the person
  • anaemia: Decrease of red blood cells or hemoglobin (boolean)
  • creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
  • diabetes: If the patient has diabetes (boolean)
  • ejection_fraction: Percentage of blood leaving the heart at each contraction (percentage)
  • high_blood_pressure: If the patient has hypertension (boolean)
  • platelets: Platelets in the blood (kiloplatelets/mL)
  • serum_creatinine: Level of serum creatinine in the blood (mg/dL)
  • serum_sodium: Level of serum sodium in the blood (mEq/L)
  • sex: Woman or man (binary)
  • smoking: If the patient smokes or not (boolean)
  • time: Follow-up period (days)
  • DEATH_EVENT: If the patient deceased during the follow-up period (boolean)

Source of dataset: https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data

mydata <- read.csv("~/Desktop/FAKS/IMB/Multivariate Analysis/HW1/heart_failure_clinical_records_dataset.csv")
head(mydata)
##   age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1  75       0                      582        0                20
## 2  55       0                     7861        0                38
## 3  65       0                      146        0                20
## 4  50       1                      111        0                20
## 5  65       1                      160        1                20
## 6  90       1                       47        0                40
##   high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1                   1    265000              1.9          130   1       0    4
## 2                   0    263358              1.1          136   1       0    6
## 3                   0    162000              1.3          129   1       1    7
## 4                   0    210000              1.9          137   1       0    7
## 5                   0    327000              2.7          116   0       0    8
## 6                   1    204000              2.1          132   1       1    8
##   DEATH_EVENT
## 1           1
## 2           1
## 3           1
## 4           1
## 5           1
## 6           1

As already stated above, the main goal of this data analysis is to understand what are the most important factors when it comes to heart faliure. So let’s take a look.

First let’s see if there are any duplicates or values with NA within our data.

Checking for duplicates or values with NA

sum(duplicated(mydata))
## [1] 0
sum(is.na(mydata))
## [1] 0

This is a nice surprise, no duplicates or NA’s were found. Let’s see what we’re dealing with in a little more detail.

Descriptive statistics

summary(mydata)
##       age           anaemia       creatinine_phosphokinase    diabetes     
##  Min.   :40.00   Min.   :0.0000   Min.   :  23.0           Min.   :0.0000  
##  1st Qu.:51.00   1st Qu.:0.0000   1st Qu.: 116.5           1st Qu.:0.0000  
##  Median :60.00   Median :0.0000   Median : 250.0           Median :0.0000  
##  Mean   :60.83   Mean   :0.4314   Mean   : 581.8           Mean   :0.4181  
##  3rd Qu.:70.00   3rd Qu.:1.0000   3rd Qu.: 582.0           3rd Qu.:1.0000  
##  Max.   :95.00   Max.   :1.0000   Max.   :7861.0           Max.   :1.0000  
##  ejection_fraction high_blood_pressure   platelets      serum_creatinine
##  Min.   :14.00     Min.   :0.0000      Min.   : 25100   Min.   :0.500   
##  1st Qu.:30.00     1st Qu.:0.0000      1st Qu.:212500   1st Qu.:0.900   
##  Median :38.00     Median :0.0000      Median :262000   Median :1.100   
##  Mean   :38.08     Mean   :0.3512      Mean   :263358   Mean   :1.394   
##  3rd Qu.:45.00     3rd Qu.:1.0000      3rd Qu.:303500   3rd Qu.:1.400   
##  Max.   :80.00     Max.   :1.0000      Max.   :850000   Max.   :9.400   
##   serum_sodium        sex            smoking            time      
##  Min.   :113.0   Min.   :0.0000   Min.   :0.0000   Min.   :  4.0  
##  1st Qu.:134.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 73.0  
##  Median :137.0   Median :1.0000   Median :0.0000   Median :115.0  
##  Mean   :136.6   Mean   :0.6488   Mean   :0.3211   Mean   :130.3  
##  3rd Qu.:140.0   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:203.0  
##  Max.   :148.0   Max.   :1.0000   Max.   :1.0000   Max.   :285.0  
##   DEATH_EVENT    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3211  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Looking at this I have come to a realization that I have to transform the gender values. I had to dig a bit deeper, but I managed to find the corresponding values for all of the variables.

Sex - Female = 0, Male = 1 Diabetes - 0 = No, 1 = Yes Anaemia - 0 = No, 1 = Yes High_blood_pressure - 0 = No, 1 = Yes Smoking - 0 = No, 1 = Yes DEATH_EVENT - 0 = No, 1 = Yes

Later I’ll probably also have to change some of the other variables, but for now I’ll stick to gender.

#adding a new factor column for gender variable in mydata (if I would be changing from categorical the concept is the same)
mydata$GenderFactor <- factor(mydata$sex, 
                 levels = c(0, 1),
                 labels = c("Female", "Male")
                 )

Works like intended.

Now let’s take a look at some descriptive statistics.

#for this we'll need the library(pastecs)
round(stat.desc(mydata[ ,c(-14)])) #rounding the data to the nearest whole number and removing the column for GenderFactor as it gives back NA
##                age anaemia creatinine_phosphokinase diabetes ejection_fraction
## nbr.val        299     299                      299      299               299
## nbr.null         0     170                        0      174                 0
## nbr.na           0       0                        0        0                 0
## min             40       0                       23        0                14
## max             95       1                     7861        1                80
## range           55       1                     7838        1                66
## sum          18189     129                   173970      125             11387
## median          60       0                      250        0                38
## mean            61       0                      582        0                38
## SE.mean          1       0                       56        0                 1
## CI.mean.0.95     1       0                      110        0                 1
## var            141       0                   941459        0               140
## std.dev         12       0                      970        0                12
## coef.var         0       1                        2        1                 0
##              high_blood_pressure  platelets serum_creatinine serum_sodium sex
## nbr.val                      299        299              299          299 299
## nbr.null                     194          0                0            0 105
## nbr.na                         0          0                0            0   0
## min                            0      25100                0          113   0
## max                            1     850000                9          148   1
## range                          1     824900                9           35   1
## sum                          105   78744051              417        40851 194
## median                         0     262000                1          137   1
## mean                           0     263358                1          137   1
## SE.mean                        0       5656                0            0   0
## CI.mean.0.95                   0      11131                0            1   0
## var                            0 9565668749                1           19   0
## std.dev                        0      97804                1            4   0
## coef.var                       1          0                1            0   1
##              smoking  time DEATH_EVENT
## nbr.val          299   299         299
## nbr.null         203     0         203
## nbr.na             0     0           0
## min                0     4           0
## max                1   285           1
## range              1   281           1
## sum               96 38948          96
## median             0   115           0
## mean               0   130           0
## SE.mean            0     4           0
## CI.mean.0.95       0     9           0
## var                0  6024           0
## std.dev            0    78           0
## coef.var           1     1           1

Because of the type of data that I’m dealing with, this doesn’t tell me much. It would make more sense to look at the visualization of data. But let’s look at age for example. The average age is 61 years, the youngest person was 40 years old and the oldest was 95 years old. Range is 55 years. Median is 60 years old so 1/2 (50%) of all observations are younger and 50% are older.

boxplot(mydata$age)

We see that data is fine, we have no outliers which is great. But what if we look at creatine phosphokinase for example?

Cleaning the data

boxplot(mydata$creatinine_phosphokinase)

As expected there are quite a few outliers. I’ll remove them.

# for this I'll use library(dplyr)
data = mydata %>%  #creating a new object
  filter(creatinine_phosphokinase <= 1300) #filtering out all the values that are above 1300
boxplot(data$creatinine_phosphokinase) 

Let’s look at death events together with age.

Visual presentation

#for this I'll use the library(ggplot2)
ggplot(data, aes(DEATH_EVENT, age)) +
  geom_boxplot(aes(fill=factor(DEATH_EVENT))) + 
  theme(axis.text.x = element_text(angle=0, vjust=0)) + 
  labs(title="Death events grouped by age", 
       subtitle="0 = no, 1 = yes",
       x="Death Event",
       y="age")

As expected we can see that on average people that died were older in comparison to those who survived. In the case of people that survived we have one “outlier”. Boxplots make it easier to see the quartile ranges, the box is showing the values from the first to the third quartile, the vertical line connects the minimum with the maximum and lastly the horizontal line shows us the median value.

Let’s look at another histogram for age.

ggplot(data, aes(x = age)) + 
  geom_histogram(aes(y=..density..), binwidth = 5, colour = "blue", position = "identity", alpha = 0.5) +
  geom_density(alpha = 0.64)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

Above we can see the distribution of age among all observations. It’s quite normally distributed, but we could say that it’s skewed slightly to the right.

library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:car':
## 
##     some
map_df(data, ~sum(is.na(.x))) #we need this if we want a dataframe in map, otherwise it returns it as a list
## # A tibble: 1 × 14
##     age anaemia creatini…¹ diabe…² eject…³ high_…⁴ plate…⁵ serum…⁶ serum…⁷   sex
##   <int>   <int>      <int>   <int>   <int>   <int>   <int>   <int>   <int> <int>
## 1     0       0          0       0       0       0       0       0       0     0
## # … with 4 more variables: smoking <int>, time <int>, DEATH_EVENT <int>,
## #   GenderFactor <int>, and abbreviated variable names
## #   ¹​creatinine_phosphokinase, ²​diabetes, ³​ejection_fraction,
## #   ⁴​high_blood_pressure, ⁵​platelets, ⁶​serum_creatinine, ⁷​serum_sodium
options(repr.plot.width=30, repr.plot.height=30)

data1 <- map2(.x = mydata, .y = names(mydata), .f = ~ggplot(mydata, aes(x = .x)) +
     geom_histogram(fill = "navy", col = "black", alpha = 30, bins = 30, stat = "count") +
     labs(x = .y) +
     theme_bw())
## Warning in geom_histogram(fill = "navy", col = "black", alpha = 30, bins = 30, : Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
ggpubr::ggarrange(plotlist = data1)

After a quick Google search I managed to put this together but it doesn’t tell me much, now I’m realising I didn’t pick out a very good dataset (everything that I wanted to know is in binary, should’ve thought about this before). Lost way too much time, but the binary variables are presented quite nicely.

I’ll try something a bit more simple.

#for this I'll need library(car)

scatterplot(y= data$serum_sodium,
            x= data$ejection_fraction,
            ylab = "Level of serum sodium in the blood (mEq/L)",
            xlab = "Percentage of blood leaving the heart at each contraction (percentage)",
            smooth = FALSE)

This again doesn’t tell me much.

I just remembered that I could compare men and women.

data$DeathFactor <- factor(data$DEATH_EVENT, #transforming
                           levels = c(0,1),
                           labels = c("no death", "death"))

ggplot(data, aes(x=age, fill=GenderFactor)) + #plotting the graph
  geom_histogram(position = "dodge", binwidth = 50, colour="gray") +
  facet_wrap(~DeathFactor, ncol=1)

Everything looks good.

But let’s see if there is any differences when we look at other things than just age.

ggplot(data, aes(x=serum_sodium, fill=GenderFactor)) + #plotting the graph
  geom_histogram(position = "dodge", binwidth = 50, colour="gray") +
  facet_wrap(~DeathFactor, ncol=1)

We can see that some woman that had lower levels of sodium died, but no men. Other deaths were connected to higher levels of sodium.

I think this is it for my first homework since I’m getting quite agitated with this dataset to be honest. I hope I managed to show what are some of the things that we learned in class. If there is one thing I’ll keep in mind for HW2 it’s to look for a better dataset.