So for my first MvA homework I decided that I will look at the world happiness report for 2022. Considering we’ve just entered 2023 I figured it would be nice to start the year with some positivity. Of course this dataset didn’t have the things I needed for the HW after taking a good look at it. So, a couple of hours searching later I came up with this. If I can’t look at what makes people happy, I can at least maybe learn how not to die.
The basic idea is to see what are the features that can be used to predict heart faliure.
My unit of observation are the people that had heart faliure, the sample size is 299.
Description of the variables:
Source of dataset: https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data
mydata <- read.csv("~/Desktop/FAKS/IMB/Multivariate Analysis/HW1/heart_failure_clinical_records_dataset.csv")
head(mydata)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1 75 0 582 0 20
## 2 55 0 7861 0 38
## 3 65 0 146 0 20
## 4 50 1 111 0 20
## 5 65 1 160 1 20
## 6 90 1 47 0 40
## high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1 1 265000 1.9 130 1 0 4
## 2 0 263358 1.1 136 1 0 6
## 3 0 162000 1.3 129 1 1 7
## 4 0 210000 1.9 137 1 0 7
## 5 0 327000 2.7 116 0 0 8
## 6 1 204000 2.1 132 1 1 8
## DEATH_EVENT
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
As already stated above, the main goal of this data analysis is to understand what are the most important factors when it comes to heart faliure. So let’s take a look.
First let’s see if there are any duplicates or values with NA within our data.
sum(duplicated(mydata))
## [1] 0
sum(is.na(mydata))
## [1] 0
This is a nice surprise, no duplicates or NA’s were found. Let’s see what we’re dealing with in a little more detail.
summary(mydata)
## age anaemia creatinine_phosphokinase diabetes
## Min. :40.00 Min. :0.0000 Min. : 23.0 Min. :0.0000
## 1st Qu.:51.00 1st Qu.:0.0000 1st Qu.: 116.5 1st Qu.:0.0000
## Median :60.00 Median :0.0000 Median : 250.0 Median :0.0000
## Mean :60.83 Mean :0.4314 Mean : 581.8 Mean :0.4181
## 3rd Qu.:70.00 3rd Qu.:1.0000 3rd Qu.: 582.0 3rd Qu.:1.0000
## Max. :95.00 Max. :1.0000 Max. :7861.0 Max. :1.0000
## ejection_fraction high_blood_pressure platelets serum_creatinine
## Min. :14.00 Min. :0.0000 Min. : 25100 Min. :0.500
## 1st Qu.:30.00 1st Qu.:0.0000 1st Qu.:212500 1st Qu.:0.900
## Median :38.00 Median :0.0000 Median :262000 Median :1.100
## Mean :38.08 Mean :0.3512 Mean :263358 Mean :1.394
## 3rd Qu.:45.00 3rd Qu.:1.0000 3rd Qu.:303500 3rd Qu.:1.400
## Max. :80.00 Max. :1.0000 Max. :850000 Max. :9.400
## serum_sodium sex smoking time
## Min. :113.0 Min. :0.0000 Min. :0.0000 Min. : 4.0
## 1st Qu.:134.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 73.0
## Median :137.0 Median :1.0000 Median :0.0000 Median :115.0
## Mean :136.6 Mean :0.6488 Mean :0.3211 Mean :130.3
## 3rd Qu.:140.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:203.0
## Max. :148.0 Max. :1.0000 Max. :1.0000 Max. :285.0
## DEATH_EVENT
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3211
## 3rd Qu.:1.0000
## Max. :1.0000
Looking at this I have come to a realization that I have to transform the gender values. I had to dig a bit deeper, but I managed to find the corresponding values for all of the variables.
Sex - Female = 0, Male = 1 Diabetes - 0 = No, 1 = Yes Anaemia - 0 = No, 1 = Yes High_blood_pressure - 0 = No, 1 = Yes Smoking - 0 = No, 1 = Yes DEATH_EVENT - 0 = No, 1 = Yes
Later I’ll probably also have to change some of the other variables, but for now I’ll stick to gender.
#adding a new factor column for gender variable in mydata (if I would be changing from categorical the concept is the same)
mydata$GenderFactor <- factor(mydata$sex,
levels = c(0, 1),
labels = c("Female", "Male")
)
Works like intended.
Now let’s take a look at some descriptive statistics.
#for this we'll need the library(pastecs)
round(stat.desc(mydata[ ,c(-14)])) #rounding the data to the nearest whole number and removing the column for GenderFactor as it gives back NA
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## nbr.val 299 299 299 299 299
## nbr.null 0 170 0 174 0
## nbr.na 0 0 0 0 0
## min 40 0 23 0 14
## max 95 1 7861 1 80
## range 55 1 7838 1 66
## sum 18189 129 173970 125 11387
## median 60 0 250 0 38
## mean 61 0 582 0 38
## SE.mean 1 0 56 0 1
## CI.mean.0.95 1 0 110 0 1
## var 141 0 941459 0 140
## std.dev 12 0 970 0 12
## coef.var 0 1 2 1 0
## high_blood_pressure platelets serum_creatinine serum_sodium sex
## nbr.val 299 299 299 299 299
## nbr.null 194 0 0 0 105
## nbr.na 0 0 0 0 0
## min 0 25100 0 113 0
## max 1 850000 9 148 1
## range 1 824900 9 35 1
## sum 105 78744051 417 40851 194
## median 0 262000 1 137 1
## mean 0 263358 1 137 1
## SE.mean 0 5656 0 0 0
## CI.mean.0.95 0 11131 0 1 0
## var 0 9565668749 1 19 0
## std.dev 0 97804 1 4 0
## coef.var 1 0 1 0 1
## smoking time DEATH_EVENT
## nbr.val 299 299 299
## nbr.null 203 0 203
## nbr.na 0 0 0
## min 0 4 0
## max 1 285 1
## range 1 281 1
## sum 96 38948 96
## median 0 115 0
## mean 0 130 0
## SE.mean 0 4 0
## CI.mean.0.95 0 9 0
## var 0 6024 0
## std.dev 0 78 0
## coef.var 1 1 1
Because of the type of data that I’m dealing with, this doesn’t tell me much. It would make more sense to look at the visualization of data. But let’s look at age for example. The average age is 61 years, the youngest person was 40 years old and the oldest was 95 years old. Range is 55 years. Median is 60 years old so 1/2 (50%) of all observations are younger and 50% are older.
boxplot(mydata$age)
We see that data is fine, we have no outliers which is great. But what
if we look at creatine phosphokinase for example?
boxplot(mydata$creatinine_phosphokinase)
As expected there are quite a few outliers. I’ll remove them.
# for this I'll use library(dplyr)
data = mydata %>% #creating a new object
filter(creatinine_phosphokinase <= 1300) #filtering out all the values that are above 1300
boxplot(data$creatinine_phosphokinase)
Let’s look at death events together with age.
#for this I'll use the library(ggplot2)
ggplot(data, aes(DEATH_EVENT, age)) +
geom_boxplot(aes(fill=factor(DEATH_EVENT))) +
theme(axis.text.x = element_text(angle=0, vjust=0)) +
labs(title="Death events grouped by age",
subtitle="0 = no, 1 = yes",
x="Death Event",
y="age")
As expected we can see that on average people that died were older in
comparison to those who survived. In the case of people that survived we
have one “outlier”. Boxplots make it easier to see the quartile ranges,
the box is showing the values from the first to the third quartile, the
vertical line connects the minimum with the maximum and lastly the
horizontal line shows us the median value.
Let’s look at another histogram for age.
ggplot(data, aes(x = age)) +
geom_histogram(aes(y=..density..), binwidth = 5, colour = "blue", position = "identity", alpha = 0.5) +
geom_density(alpha = 0.64)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
Above we can see the distribution of age among all observations. It’s
quite normally distributed, but we could say that it’s skewed slightly
to the right.
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:car':
##
## some
map_df(data, ~sum(is.na(.x))) #we need this if we want a dataframe in map, otherwise it returns it as a list
## # A tibble: 1 × 14
## age anaemia creatini…¹ diabe…² eject…³ high_…⁴ plate…⁵ serum…⁶ serum…⁷ sex
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0 0 0 0
## # … with 4 more variables: smoking <int>, time <int>, DEATH_EVENT <int>,
## # GenderFactor <int>, and abbreviated variable names
## # ¹creatinine_phosphokinase, ²diabetes, ³ejection_fraction,
## # ⁴high_blood_pressure, ⁵platelets, ⁶serum_creatinine, ⁷serum_sodium
options(repr.plot.width=30, repr.plot.height=30)
data1 <- map2(.x = mydata, .y = names(mydata), .f = ~ggplot(mydata, aes(x = .x)) +
geom_histogram(fill = "navy", col = "black", alpha = 30, bins = 30, stat = "count") +
labs(x = .y) +
theme_bw())
## Warning in geom_histogram(fill = "navy", col = "black", alpha = 30, bins = 30, : Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
ggpubr::ggarrange(plotlist = data1)
After a quick Google search I managed to put this together but it
doesn’t tell me much, now I’m realising I didn’t pick out a very good
dataset (everything that I wanted to know is in binary, should’ve
thought about this before). Lost way too much time, but the binary
variables are presented quite nicely.
I’ll try something a bit more simple.
#for this I'll need library(car)
scatterplot(y= data$serum_sodium,
x= data$ejection_fraction,
ylab = "Level of serum sodium in the blood (mEq/L)",
xlab = "Percentage of blood leaving the heart at each contraction (percentage)",
smooth = FALSE)
This again doesn’t tell me much.
I just remembered that I could compare men and women.
data$DeathFactor <- factor(data$DEATH_EVENT, #transforming
levels = c(0,1),
labels = c("no death", "death"))
ggplot(data, aes(x=age, fill=GenderFactor)) + #plotting the graph
geom_histogram(position = "dodge", binwidth = 50, colour="gray") +
facet_wrap(~DeathFactor, ncol=1)
Everything looks good.
But let’s see if there is any differences when we look at other things than just age.
ggplot(data, aes(x=serum_sodium, fill=GenderFactor)) + #plotting the graph
geom_histogram(position = "dodge", binwidth = 50, colour="gray") +
facet_wrap(~DeathFactor, ncol=1)
We can see that some woman that had lower levels of sodium died, but no men. Other deaths were connected to higher levels of sodium.
I think this is it for my first homework since I’m getting quite agitated with this dataset to be honest. I hope I managed to show what are some of the things that we learned in class. If there is one thing I’ll keep in mind for HW2 it’s to look for a better dataset.