R Markdown

Importnig data to R and showing it with head() function

Units of observation are people, sample size is 299.

Variables description with units of measurement:

  • age: Age
  • anemia: Decrease of red blood cells or hemoglobin (Boolean)
  • creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
  • diabetes: If the patient has diabetes (Boolean)
  • ejection_fraction: Percentage of blood leaving the heart at each contraction (percentage)
  • high_blood_pressure: If the patient has hypertension (Boolean)
  • platelets: Platelets in the blood (kiloplatelets/mL)
  • serum_creatinine: Level of serum creatinine in the blood (mg/dL)
  • serum_sodium: Level of serum sodium in the blood (mEq/L)
  • sex: Woman or man
  • smoking: If the patient smokes or not (boolean)
  • time: Follow-up period (days)
  • DEATH_EVENT: If the patient deceased during the follow-up period (boolean)
mydata <- read.csv("podatki_hw.csv") 
mydata <- mydata  %>% mutate(anaemia_factor = factor(mydata$anaemia,
                               levels = c(0,1),
                               labels = c("No","Yes")),
                 diabetes_factor=factor(mydata$diabetes,
                               levels = c("0","1"),
                               labels = c("No","Yes" )),
                 sex_factor=factor(mydata$sex,
                               levels = c("0","1"),
                               labels = c("Female","Male" )),
                 smoking_factor=factor(mydata$smoking,
                               levels = c("0","1"),
                               labels = c("No","Yes" )),
                 high_blood_pressure_factor=factor(mydata$high_blood_pressure,
                               levels = c("0","1"),
                               labels = c("No","Yes" )),
                 DEATH_EVENT_factor=factor(mydata$DEATH_EVENT,
                               levels = c("0","1"),
                               labels = c("No","Yes" )))
  

head(mydata)
##   age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine
## 1  75       0                      582        0                20                   1    265000              1.9
## 2  55       0                     7861        0                38                   0    263358              1.1
## 3  65       0                      146        0                20                   0    162000              1.3
## 4  50       1                      111        0                20                   0    210000              1.9
## 5  65       1                      160        1                20                   0    327000              2.7
## 6  90       1                       47        0                40                   1    204000              2.1
##   serum_sodium sex smoking time DEATH_EVENT anaemia_factor diabetes_factor sex_factor smoking_factor
## 1          130   1       0    4           1             No              No       Male             No
## 2          136   1       0    6           1             No              No       Male             No
## 3          129   1       1    7           1             No              No       Male            Yes
## 4          137   1       0    7           1            Yes              No       Male             No
## 5          116   0       0    8           1            Yes             Yes     Female             No
## 6          132   1       1    8           1            Yes              No       Male            Yes
##   high_blood_pressure_factor DEATH_EVENT_factor
## 1                        Yes                Yes
## 2                         No                Yes
## 3                         No                Yes
## 4                         No                Yes
## 5                         No                Yes
## 6                        Yes                Yes

The main goal of the data analysis:

The main goal of the data is to figure out whether and how different factor lead to hearth failure and death due to heart failure. Most Cardiovascular diseases can be prevented by addressing factors such as diabetes, hypertension, age, high blood pressure… The need for early detection is very high, if we want the patients to prevent potential hearth attack.

Authors of this data claim that they can predict survival of patients with hearth failure from serum creatitine and ejection fraction alone.

Some more manpulations

On average people with anaemia die more often from heart attacks than ones without it.

 mydata %>% dplyr::group_by(anaemia_factor) %>%  dplyr::summarise(Deaths_average = mean(DEATH_EVENT))
## # A tibble: 2 × 2
##   anaemia_factor Deaths_average
##   <fct>                   <dbl>
## 1 No                      0.294
## 2 Yes                     0.357

On average, people with heart failure are 60 years old (mean and median).

Normal creatinine phosphokinase values are 10 to 120 micrograms per liter. As we see, people with heart attack have increased creatinine phosphokinase, and even standard deviation is 970.29.

Ejection fraction of a healthy person should be around 60%. As seen in the figure beneath, people with heart failure on average have only 38% ejection fraction.

A normal platelet count in adults ranges from 15,000 to 450,000 platelets per microliter of blood. Looking at platelet count of our sample, people had mostly normal values.

Increased creatinine levels during hospitalization are a marker of poor cardiac output, leading to diminished renal blood flow and reduced ability to tolerate inpatient heart failure treatment. The typical range for serum creatinine is: For adult men, 0.74 to 1.35 mg/dL. For adult women, 0.59 to 1.04 mg/dL.

round(stat.desc(mydata[,c(1,3,5,7,8)]),2)
##                   age creatinine_phosphokinase ejection_fraction    platelets serum_creatinine
## nbr.val        299.00                   299.00            299.00 2.990000e+02           299.00
## nbr.null         0.00                     0.00              0.00 0.000000e+00             0.00
## nbr.na           0.00                     0.00              0.00 0.000000e+00             0.00
## min             40.00                    23.00             14.00 2.510000e+04             0.50
## max             95.00                  7861.00             80.00 8.500000e+05             9.40
## range           55.00                  7838.00             66.00 8.249000e+05             8.90
## sum          18189.33                173970.00          11387.00 7.874405e+07           416.77
## median          60.00                   250.00             38.00 2.620000e+05             1.10
## mean            60.83                   581.84             38.08 2.633580e+05             1.39
## SE.mean          0.69                    56.11              0.68 5.656170e+03             0.06
## CI.mean.0.95     1.35                   110.43              1.35 1.113109e+04             0.12
## var            141.49                941458.57            140.06 9.565669e+09             1.07
## std.dev         11.89                   970.29             11.83 9.780424e+04             1.03
## coef.var         0.20                     1.67              0.31 3.700000e-01             0.74

On average 60.83 years old patients have a stroke. The median is 60 years, and the standard deviation is 11.98 years.

round(mean(mydata[ , 1]), 2)
## [1] 60.83
round(median(mydata[ , 1]), 2)
## [1] 60
round(sd(mydata[ , 1]), 2)
## [1] 11.89
# mydata %>%  summary()
# get_summary_stats(mydata)

Looking more closely at creatinine levels, we can see that male and female means were above the recommended levels. Especially critical are maximum values. There is almost no distinction between men’s and women’s values.

mydata%>% group_by(sex_factor) %>% dplyr::summarise(Mean=mean(serum_creatinine), Max=max(serum_creatinine), Min=min(serum_creatinine), Median=median(serum_creatinine), Std=sd(serum_creatinine))
## # A tibble: 2 × 6
##   sex_factor  Mean   Max   Min Median   Std
##   <fct>      <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 Female      1.38   9     0.5    1   1.12 
## 2 Male        1.40   9.4   0.6    1.1 0.989

Looking more closely at ejection fraction, we can see that women have higher ejection, that’s why their mortality is lower. Especially critical are minimum values, for both are almost 0%, which is really problematic because blood flow is almost stopped. Especially interesting is the maximum value for females (80%), which is counterintuitive. The standard for both sexes is between 11.1 and 12.7. The Mean (40) and median (38) are higher for women. However, these values are 20 percentage points lower than they should be.

mydata%>% group_by(sex_factor) %>% dplyr::summarise(Mean=mean(ejection_fraction), Max=max(ejection_fraction), Min=min(serum_creatinine), Median=median(ejection_fraction), Std=sd(ejection_fraction))
## # A tibble: 2 × 6
##   sex_factor  Mean   Max   Min Median   Std
##   <fct>      <dbl> <int> <dbl>  <dbl> <dbl>
## 1 Female      40.5    80   0.5     38  12.7
## 2 Male        36.8    62   0.6     35  11.1

Now we will observe how many people with creatinine levels above recommended died when they had heart failure. On average, 42% of men died when they had a heart attack and too high creatinine levels, while only 26% with good levels died.

mydata %>%  filter(!(0.74<serum_creatinine & serum_creatinine < 1.35) , sex_factor == "Male") %>% 
  select(DEATH_EVENT) %>%  pull() %>%  mean() 
## [1] 0.4225352
mydata %>%  filter( 0.74< serum_creatinine & serum_creatinine < 1.35 , sex_factor == "Male") %>% 
  select(DEATH_EVENT) %>%  pull() %>%  mean() 
## [1] 0.2601626

If you get a heart attack, it is 28% more likely to survive if you are younger than 45 years.

mydata %>%  filter(age >= 45 ) %>% 
  select(DEATH_EVENT) %>%  pull() %>%  mean() - mydata %>%  filter( age < 45 ) %>% 
  select(DEATH_EVENT) %>%  pull() %>%  mean()
## [1] 0.2825227

More sample statistics

People with heart failure who smoke have a lower percentage of blood leaving the heart at each contraction. (Normal is arund 60%)

Statistics <- summarySE(mydata, 
              measurevar="ejection_fraction", 
              groupvars=c("smoking_factor"), 
              conf.interval=0.95)
Statistics
##   smoking_factor   N ejection_fraction       sd       se       ci
## 1             No 203          38.63054 12.15520 0.853128 1.682179
## 2            Yes  96          36.92708 11.09978 1.132867 2.249025

Graphs

As expected people around 50, 60, and 70 year have most heart attacks, and majority of heart attacks have men.

ggplot(mydata , aes(x=age, color = sex_factor)) +
  geom_histogram(fill="white", binwidth = 1, position="dodge")+
   ylab("Number")+
   xlab("Age")+
  theme_classic() + 
  labs(colour="Gender", title= "Number of heart failues by gender and age")

As expected, the lower the percentage of blood leaving the heart at each contraction, the more likely person dies when having heart failure. There is also distinction between men and women.

ggplot(mydata, aes(x=DEATH_EVENT_factor, y= ejection_fraction, fill=sex_factor))+
  geom_boxplot() + scale_fill_brewer(palette="Dark2") + theme_classic() +
  labs(title="Percentage of blood leaving the heart at each contraction by \n death event and gender",x="Death event", y = "Percentage of blood leaving the heart at each contraction", fill="Gender")

As expected, patients who died had too high creatinine levels. Patients with normal levels of creatinine very rarely die from a heart attack. This graph clearly shows the author’s hypothesis that they can predict death events by looking at patients’ creatinine levels can be true.

ggplot(mydata, aes(x=age, y= serum_creatinine, color=DEATH_EVENT_factor, shape=DEATH_EVENT_factor))+
  geom_point() + scale_fill_brewer(palette="Dark2") + theme_classic() +
  geom_smooth(formula = y ~ x, method=lm, se=FALSE)+
  geom_hline(yintercept=1.34, linetype="dashed", color = "black") +
  annotate("text", x = 90, y = 1.52, label = "Max normal level") +
  labs(title="Levels of serum creatinine by age and event of death",x="Age", y = "Levels of serum creatinine", color="Death of patient", shape="Death of patient")

Graph in the second column and first row shows positive correlation between age and levels of creatinine.

scatterplotMatrix(mydata[ , c(1,8,5)], 
                  smooth = FALSE) 

The majority of patients have normal creatinine levels, but as seen before, the ones with higher creatinine levels have higher mortality. Usually, the ones with higher normal levels are on the upper limit.

ggplot(mydata, aes(x=serum_creatinine)) +
  geom_histogram(binwidth = 0.1, colour="gray") +
  facet_wrap(~sex_factor, ncol = 1) + 
  ylab("Frequency")+
  theme_bw() +
  xlim(0, 5)