Load Packages

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(ggplot2)
library(highcharter)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(RColorBrewer)

Abstract

Stroke is one type of cardiovascular disease. There are two types of stroke: ischaemic stroke and hemorrhagic stroke. According to the World Health Organization, stroke is the second leading cause of death and the third leading cause of disability. The latest global mortality estimates show that 10.2% of total deaths were caused by stroke in the year 2016 and 9.9% for the year 2000 (Global Health Estimates 2016). There are 15 million people who suffer from stroke worldwide every year. Among these 15 million people, 5 million died and another 5 million became permanently disabled (Stroke, Cerebrovascular accident). The risk factors for stroke are high blood pressure, high cholesterol, overweight & obesity, tobacco, diabetes, physical inactivity, unhealthy diet, harmful use of alcohol, kidney disease, etc (Cardiovascular Disease Inforgraphic).

The dataset I’m using is extracted from kaggle: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset. The object of this analysis is to explore how each risk factor impacts the risk of stroke and the possible relationship among factors. The analysis will carry out by answering some general questions about the topic and variables in the dataset. The first part will be exploratory data analysis based on visualizations and the second part will be logistic regression analysis.

Import Dataset

setwd("~/School/MC/DATA 110/Datasets")
data <- read_csv("healthcare-dataset-stroke-data.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   id = col_double(),
##   gender = col_character(),
##   age = col_double(),
##   hypertension = col_double(),
##   heart_disease = col_double(),
##   ever_married = col_character(),
##   work_type = col_character(),
##   Residence_type = col_character(),
##   avg_glucose_level = col_double(),
##   bmi = col_character(),
##   smoking_status = col_character(),
##   stroke = col_double()
## )

Variables in this Dataset

id: unique identifier for each subject

gender: male, female, or other

age: 0-82 years old

hypertension: 0 if the subject doesn’t have hypertension, 1 if the subject has hypertension

heart_disease: 0 if the subject doesn’t have any heart diesease, 1 if the subject has a heart disease

ever_married: Yes or No

work_type: children, govt_job, never_worked, private, or self-employed

residence_type: urban or rural

avg_glucose_level: average glucose level in blood; range: 55-272

bmi: body mass index; range: 10-98

smoking_status: formerly smoked, never smoked, smokes, unknown

stroke: 0 if subject never had a stroke, 1 if the subject suffered a stroke before

Analysis

# Use mutate to change 0 values to "No" and 1 values to "Yes" so that visualizations are easier to read
df <- data %>%
  mutate(hypertension = ifelse(hypertension == 0, "No","Yes")) %>%
  mutate(heart_disease = ifelse(heart_disease == 0, "No","Yes")) %>%
  mutate(stroke = ifelse(stroke == 0, "No","Yes"))
df$bmi <- as.numeric(df$bmi)

## Warning: NAs introduced by coercion

See the summary of the dataset

summary(df)

##        id           gender               age        hypertension      
##  Min.   :   67   Length:5110        Min.   : 0.08   Length:5110       
##  1st Qu.:17741   Class :character   1st Qu.:25.00   Class :character  
##  Median :36932   Mode  :character   Median :45.00   Mode  :character  
##  Mean   :36518                      Mean   :43.23                     
##  3rd Qu.:54682                      3rd Qu.:61.00                     
##  Max.   :72940                      Max.   :82.00                     
##                                                                       
##  heart_disease      ever_married        work_type         Residence_type    
##  Length:5110        Length:5110        Length:5110        Length:5110       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  avg_glucose_level      bmi        smoking_status        stroke         
##  Min.   : 55.12    Min.   :10.30   Length:5110        Length:5110       
##  1st Qu.: 77.25    1st Qu.:23.50   Class :character   Class :character  
##  Median : 91.89    Median :28.10   Mode  :character   Mode  :character  
##  Mean   :106.15    Mean   :28.89                                        
##  3rd Qu.:114.09    3rd Qu.:33.10                                        
##  Max.   :271.74    Max.   :97.60                                        
##                    NA's   :201

Age Density Analysis

df %>%
  ggplot(aes(x = age)) +
  geom_density() +
  geom_vline(aes(xintercept=mean(age)), color = "blue", linetype = "dashed") +
  labs(title = "Age Density", x = "Age", y = "Density") +
  theme_classic()

df %>%
  ggplot(aes(x = age, fill = gender)) +
  geom_density(alpha = 0.4) +
  geom_vline(aes(xintercept=mean(age)), color = "blue", linetype = "dashed") +
  labs(title = "Age Density", x = "Age", y = "Density") +
  theme_classic()

## Warning: Groups with fewer than two data points have been dropped.

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf

This study involved 5,110 people including 2994 females and 2115 males. The density plot shows the age distribution. Overall, more observations are concentrated between 40 and 60 years old. There are differences in the age distribution for female and male subjects. The age distribution of female subjects is closer to normal distribution while the age distribution of male subjects is slightly skewed to the right.

Analysis Based on Visualization

1. Do female or male have more stroke?

df_1 <- df %>%
  group_by(gender, stroke) %>%
  count(stroke)
df_1

## # A tibble: 5 x 3
## # Groups:   gender, stroke [5]
##   gender stroke     n
##   <chr>  <chr>  <int>
## 1 Female No      2853
## 2 Female Yes      141
## 3 Male   No      2007
## 4 Male   Yes      108
## 5 Other  No         1

df_1 %>%
ggplot(mapping = aes(x = gender, y = n, fill = stroke) ) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Percentage of Female Group vs. Male Group", 
       x = "Gender", 
       y = "Percentage") +
  scale_fill_brewer(palette = "Paired")

In this dataset, the total number of female subjects is more than the total number of male subjects. The percentage of subjects who previously suffered a stroke in the male group is slightly higher than the female group. There’s only one observation in the Other gender group and that subject does not have stroke experience. Therefore, no conclusion can be drawn for the Other group.

According to Harvard Health Publishing, women have a higher risk of suffering stroke. Stroke events happen more among women than men each year. This can be explained by many factors. Women have longer average life span and the risk of experiencing stroke increases when people become older. In addition, reproductive health, pregnancy, childbirth, and imbalanced hormones are all considered risk factors (Heart attack and stroke: Men vs. women). In this dataset, we can’t see strong evidence to support that women have a higher risk of suffering stroke.

2. Which age group of people are more likely to get stroke?

df_2 <- df %>%
  group_by(age, stroke) %>%
  count(stroke)
df_2

## # A tibble: 148 x 3
## # Groups:   age, stroke [148]
##      age stroke     n
##    <dbl> <chr>  <int>
##  1  0.08 No         2
##  2  0.16 No         3
##  3  0.24 No         5
##  4  0.32 No         5
##  5  0.4  No         2
##  6  0.48 No         3
##  7  0.56 No         5
##  8  0.64 No         4
##  9  0.72 No         5
## 10  0.8  No         4
## # ... with 138 more rows

df_2 %>%
  #filter(stroke == "Yes") %>%
  ggplot(mapping = aes(x = age, y = n, color = stroke)) +
  geom_point(alpha=0.5) +
  geom_smooth(method = "loess", span = 0.3) +
  labs(title = "Age Distribution of Stoke Event",
       x = "Age", 
       y = "Number of Observations") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

# 
df_2 %>%
  filter(stroke == "Yes") %>%
  ggplot(mapping = aes(x = age, y = n)) +
  geom_point(alpha=0.5) +
  geom_smooth(method = "loess", span = 0.3) +
  labs(title = "Stoke Event Distribution by Age",
       x = "Age", 
       y = "Number of Observations") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

Though the majority of subjects didn’t experience stroke, it can happen at any age. Among subjects who have experienced a stroke, age varies from 1 to 80 years old. As we can see from the above plot, the number of observations at each age remains under 5 before 55 years old. There are two peaks around the age of 60 and 80. According to an online publication from NIH, the risk of stroke doubles for each decade between the age of 55 and 85. Although people consider stroke as a disease of aging, the risk of stroke in childhood can be very high during the perinatal period (Brain Basics: Preventing Stroke).

3. How does hypertension influence the likelihood of getting a stroke?

df_3 <- df %>%
  group_by(hypertension, stroke) %>%
  count(stroke)
df_3

## # A tibble: 4 x 3
## # Groups:   hypertension, stroke [4]
##   hypertension stroke     n
##   <chr>        <chr>  <int>
## 1 No           No      4429
## 2 No           Yes      183
## 3 Yes          No       432
## 4 Yes          Yes       66

df_3

## # A tibble: 4 x 3
## # Groups:   hypertension, stroke [4]
##   hypertension stroke     n
##   <chr>        <chr>  <int>
## 1 No           No      4429
## 2 No           Yes      183
## 3 Yes          No       432
## 4 Yes          Yes       66

df_3 %>%
ggplot(mapping = aes(x = hypertension, y = n, fill = stroke) ) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Percentage of subjects with Hypertension vs. without Hypertension", 
       x = "Hypertension", 
       y = "Percentage") +
  scale_fill_brewer(palette = "Paired")

The number of subjects without hypertension is way more than subjects with hypertension in this dataset. The percentage of subjects getting a stroke with hypertension is higher than the percentage of subjects getting a stroke without hypertension. Therefore, this dataset proved the fact that hypertension is a significant risk factor for stroke. In fact, it is the most important risk factor for stroke. Hypertension tends to run in the family. If one’s family member has hypertension, one has a higher risk of having it. However, hypertension can be prevented or controlled by medicine and a healthy lifestyle.

4. Is there a relationship between hypertension and age?

df_4 <- df %>% 
  group_by(age, hypertension) %>%
  count(hypertension)
df_4

## # A tibble: 164 x 3
## # Groups:   age, hypertension [164]
##      age hypertension     n
##    <dbl> <chr>        <int>
##  1  0.08 No               2
##  2  0.16 No               3
##  3  0.24 No               5
##  4  0.32 No               5
##  5  0.4  No               2
##  6  0.48 No               3
##  7  0.56 No               5
##  8  0.64 No               4
##  9  0.72 No               5
## 10  0.8  No               4
## # ... with 154 more rows

df_4 %>%
  filter(hypertension == "Yes") %>%
  ggplot(mapping = aes(x = age, y = n )) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", span = 0.3) +
  labs(title = "With Hypertension vs. Without Hypertension", 
       subtitle = "Age Distribution", 
       x = "Age", 
       y = "Number of Observation") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

Hypertension can be developed at any age though the elderly are more likely to develop hypertension. Hypertension is age-related because the vascular system changes as we get older. Our arteries get stiffer, therefore blood pressure goes up. As we can see from the plot above, it’s very obvious that the number of observations with hypertension jumps up to the next level after the age of 50.

5. Do subjects previously diagnosed heart disease have higher risk of getting a stroke?

df_5 <- df %>%
  group_by(heart_disease, stroke) %>%
  count(stroke)
df_5

## # A tibble: 4 x 3
## # Groups:   heart_disease, stroke [4]
##   heart_disease stroke     n
##   <chr>         <chr>  <int>
## 1 No            No      4632
## 2 No            Yes      202
## 3 Yes           No       229
## 4 Yes           Yes       47

df_5 %>%
ggplot(mapping = aes(x = heart_disease, y = n, fill = stroke) ) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Percentage of subjects with Heart Disease vs. without Heart Disease", 
       x = "Heart Disease", 
       y = "Percentage") +
  scale_fill_brewer(palette = "Paired")

Subjects previously diagnosed with heart disease have a higher risk of experiencing stroke than subjects without heart disease. According to NIH’s online publication, heart diseases can cause blood clots that may break loose or block vessels in or leading to the brain.

6. Do married subjects have higher risk of getting a stroke?

df_6 <- df %>%
  group_by(ever_married, stroke) %>%
  count(stroke)
df_6

## # A tibble: 4 x 3
## # Groups:   ever_married, stroke [4]
##   ever_married stroke     n
##   <chr>        <chr>  <int>
## 1 No           No      1728
## 2 No           Yes       29
## 3 Yes          No      3133
## 4 Yes          Yes      220

df_6 %>%
ggplot(mapping = aes(x = ever_married, y = n, fill = stroke) ) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Married subjects vs. Unmarried subjects", 
       x = "Ever Married", 
       y = "Percentage") +
  scale_fill_brewer(palette = "Paired")

Based on the chart above, it seems that married subjects have a higher risk of getting a stroke. However, let’s look at the age distribution of married subjects and unmarried subjects.

df %>%
  ggplot(mapping = aes(x = ever_married, y = age)) +
  geom_boxplot() +
  labs(title = "Age Distribution - Married vs. Unmarried", 
       x = "Ever Married", 
       y = "Age") +
  theme_minimal()

df %>%
  ggplot(mapping = aes(x = ever_married, y = age, fill = stroke)) +
  geom_boxplot() +
  labs(title = "Age Distribution - Married vs. Unmarried", 
       x = "Ever Married", 
       y = "Age") +
  theme_minimal() +
  scale_fill_brewer(palette = "Paired")

As we see in the first plot, the overall median age of unmarried subjects is significantly younger than married subjects in this dataset, though there are outliers. In the second plot, the age distribution of subjects who experienced stroke is very similar no matter the subject is ever married or not. The median age is around 70. There’s no strong evidence from this dataset to support that marital status has a direct impact on the risk of getting a stroke.

7. Which work type has higher risk of getting a stroke?

df_7 <- df %>%
  group_by(work_type, stroke) %>%
  count(stroke)
df_7

## # A tibble: 9 x 3
## # Groups:   work_type, stroke [9]
##   work_type     stroke     n
##   <chr>         <chr>  <int>
## 1 children      No       685
## 2 children      Yes        2
## 3 Govt_job      No       624
## 4 Govt_job      Yes       33
## 5 Never_worked  No        22
## 6 Private       No      2776
## 7 Private       Yes      149
## 8 Self-employed No       754
## 9 Self-employed Yes       65

df_7 %>%
ggplot(mapping = aes(x = work_type, y = n, fill = stroke) ) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Which Work Type Has Higher Risk of Getting a Stroke?", 
       x = "Work Type", 
       y = "Percentage") + 
  theme_minimal() + 
  scale_fill_brewer(palette = "Paired")

df %>%
  ggplot(mapping = aes(x = work_type, y = age)) +
  geom_boxplot() +
  labs(title = "Age Distribution of Work Type", 
       x = "Work Type", 
       y = "Age") +
  theme_minimal()

For children and subjects who never worked, the median age is under 20 years old and their risk of getting stroke is low because of the age. For government job, private, and self-employed groups, their median age is around 50, 45, and 61 respectively. The higher percentage could result from aging. There’s no strong evidence to support that work type has an impact on the risk of getting a stroke.

8. Which residence type has higher risk of getting stroke?

df_8 <- df %>%
  group_by(Residence_type, stroke) %>%                                      
  count(stroke)
df_8

## # A tibble: 4 x 3
## # Groups:   Residence_type, stroke [4]
##   Residence_type stroke     n
##   <chr>          <chr>  <int>
## 1 Rural          No      2400
## 2 Rural          Yes      114
## 3 Urban          No      2461
## 4 Urban          Yes      135

df_8 %>%
ggplot(mapping = aes(x = Residence_type, y = n, fill = stroke) ) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Does residence type have a great impact on risk of getting stroke?", 
       x = "Residence Type", 
       y = "Percentage") +
  theme_minimal() +
  scale_fill_brewer(palette = "Paired")

It seems that residence type does not have an impact on the risk of getting a stroke.

9. Do subjects with higher average glucose level have higher risk of getting a stroke?

df %>%
  ggplot(mapping = aes(x = stroke, y = avg_glucose_level, fill = stroke)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot of Average Glucose Level", 
       x = "Stroke", 
       y = "Average Glucose Level") +
  scale_fill_brewer(palette = "Paired")

High glucose level is an indicator of diabetes. According to NIH’s online publication, in terms of stroke and cardiovascular disease, having diabetes is the equivalent of aging 15 years old. Diabetes not only affects our body’s ability to control the level of blood sugar but also causes destructive changes in the blood vessels throughout the body. If the glucose level is high at the time of the stroke, the damage to our body is more severe and extensive than when the glucose level is well-controlled. From the boxplot, we can see that people who experienced stroke have a higher average glucose level than people who never experienced a stroke.

cols <- brewer.pal(2, "Paired")

## Warning in brewer.pal(2, "Paired"): minimal value for n is 3, returning requested palette with 3 different levels

This is another visualization of average glucose level. According to CDC, A normal blood pressure level is less than 120/80 mmHg.

highchart() %>%
  hc_add_series(data = df,
                   type = "point",
                   hcaes(x = age,
                   y = avg_glucose_level, 
                   group = stroke)) %>%
  hc_colors(cols) %>%
  hc_xAxis(title = list(text="Age")) %>%
  hc_yAxis(title = list(text="Average Glucose Level")) %>%
  hc_plotOptions(series = list(marker = list(symbol = "circle"))) %>%
  hc_legend(align = "right", 
            verticalAlign = "top")

10. Does bmi have impact on the risk of getting a stroke?

df_10 <- df %>%
  group_by(bmi, stroke, gender) %>%
  count(stroke)
df_10

## # A tibble: 883 x 4
## # Groups:   bmi, stroke, gender [883]
##      bmi stroke gender     n
##    <dbl> <chr>  <chr>  <int>
##  1  10.3 No     Female     1
##  2  11.3 No     Female     1
##  3  11.5 No     Male       1
##  4  12   No     Female     1
##  5  12.3 No     Female     1
##  6  12.8 No     Female     1
##  7  13   No     Male       1
##  8  13.2 No     Female     1
##  9  13.3 No     Female     1
## 10  13.4 No     Male       1
## # ... with 873 more rows

df_10 %>%
  ggplot(mapping = aes(x = stroke, y = bmi, fill = stroke)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot of BMI", 
       x = "Stroke", 
       y = "BMI") +
  scale_fill_brewer(palette = "Paired")

## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

df_10 %>%
  ggplot(mapping = aes(x = stroke, y = bmi, fill = stroke, color = gender)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot of BMI", 
       x = "Stroke", 
       y = "BMI") +
  scale_fill_brewer(palette = "Paired")

## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

According to CDC, Body Mass Index (BMI) is a person’s weight in kilograms divided by the square of height in meters. A high BMI can be an indicator of high body fatness and may lead to health problems. Some studies have suggested that a high body mass index (BMI) may increase the risk of stroke. From the boxplot above, subjects who experience a stroke have lower BMI than subjects who didn’t experienced stroke. we can’t find strong evidence to prove that BMI has a direct impact on the risk of stroke.

11. Does smoking increase the likelihood of getting a stroke?

df_11 <- df %>%
  group_by(smoking_status, stroke) %>%
  count(stroke)
df_11

## # A tibble: 8 x 3
## # Groups:   smoking_status, stroke [8]
##   smoking_status  stroke     n
##   <chr>           <chr>  <int>
## 1 formerly smoked No       815
## 2 formerly smoked Yes       70
## 3 never smoked    No      1802
## 4 never smoked    Yes       90
## 5 smokes          No       747
## 6 smokes          Yes       42
## 7 Unknown         No      1497
## 8 Unknown         Yes       47

df_11 %>%
ggplot(mapping = aes(x = smoking_status, y = n, fill = stroke) ) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Does the smoking status impact the risk of stroke? ", 
       x = "Smoking Status", 
       y = "Percentage") +
  theme_minimal() +
  scale_fill_brewer(palette = "Paired")

We can see that smoking status does impact the risk of getting a stroke. Smoking causes the buildup of fatty substances (atherosclerosis) in the carotid artery, which is the main artery to supply blood to the brain. In addition, nicotine can cause high blood pressure. The carbon monoxide produced from smoking reduces the amount of oxygen in the blood supply to the brain. Smoking also thickens the blood and makes it easier to clot.

Logistic Regression Analysis

library(DataExplorer)

Prepare the data

df_logistic <- df %>%
  mutate(stroke = ifelse(stroke == "Yes", 1, 0)) %>%
  mutate(hypertension = ifelse(hypertension == "Yes", 1, 0)) %>%     # Converting stroke, hypertension, heart disease into 0s and 1s
  mutate(heart_disease = ifelse(heart_disease == "Yes", 1, 0)) %>%
  filter(gender != "Other")                                      # filter out the only "Other" observation in gender group

# set categorical variables as factor
df_logistic$gender <- as.factor(df_logistic$gender)
df_logistic$hypertension <- as.factor(df_logistic$hypertension)
df_logistic$heart_disease <- as.factor(df_logistic$heart_disease)
df_logistic$smoking_status <- as.factor(df_logistic$smoking_status)
str(df_logistic)

## spec_tbl_df [5,109 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id               : num [1:5109] 9046 51676 31112 60182 1665 ...
##  $ gender           : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 1 1 ...
##  $ age              : num [1:5109] 67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
##  $ heart_disease    : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 1 1 1 ...
##  $ ever_married     : chr [1:5109] "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr [1:5109] "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr [1:5109] "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num [1:5109] 229 202 106 171 174 ...
##  $ bmi              : num [1:5109] 36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...
##  $ smoking_status   : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
##  $ stroke           : num [1:5109] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   hypertension = col_double(),
##   ..   heart_disease = col_double(),
##   ..   ever_married = col_character(),
##   ..   work_type = col_character(),
##   ..   Residence_type = col_character(),
##   ..   avg_glucose_level = col_double(),
##   ..   bmi = col_character(),
##   ..   smoking_status = col_character(),
##   ..   stroke = col_double()
##   .. )

Fitting a generalized linear model with numeric variables only

fit1 <- glm(stroke ~ age + avg_glucose_level + bmi , data = df_logistic, family = "binomial")
summary(fit1)

## 
## Call:
## glm(formula = stroke ~ age + avg_glucose_level + bmi, family = "binomial", 
##     data = df_logistic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9845  -0.3029  -0.1626  -0.0752   3.6180  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -8.042725   0.538866 -14.925  < 2e-16 ***
## age                0.072264   0.005532  13.063  < 2e-16 ***
## avg_glucose_level  0.005455   0.001263   4.320 1.56e-05 ***
## bmi                0.005555   0.011558   0.481    0.631    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1728.3  on 4907  degrees of freedom
## Residual deviance: 1387.7  on 4904  degrees of freedom
##   (201 observations deleted due to missingness)
## AIC: 1395.7
## 
## Number of Fisher Scoring iterations: 7

The coefficient is positive for both age and average glucose level. For every one unit change in age, the log odds of getting stroke increases by 0.073. For every one unit change in average glucose level, the log odds of getting stroke increases by 0.005. Both age and average glucose level are statistically significant.

The coefficient for BMI is positive 0.009. This means people with higher BMI have a higher risk of getting a stroke. However, BMI is not statistically significant. We will exclude BMI from the logistic regression.

Fitting a generalized linear model with categorical variables only

fit2 <- glm(stroke ~ factor(gender) + factor(hypertension) + factor(heart_disease) + factor(smoking_status), family = "binomial", data = df_logistic)
summary(fit2)

## 
## Call:
## glm(formula = stroke ~ factor(gender) + factor(hypertension) + 
##     factor(heart_disease) + factor(smoking_status), family = "binomial", 
##     data = df_logistic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9883  -0.2760  -0.2680  -0.2338   2.7013  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         -2.8514     0.1514 -18.838  < 2e-16 ***
## factor(gender)Male                  -0.0357     0.1357  -0.263 0.792498    
## factor(hypertension)1                1.0861     0.1588   6.838 8.02e-12 ***
## factor(heart_disease)1               1.3027     0.1839   7.083 1.41e-12 ***
## factor(smoking_status)never smoked  -0.4575     0.1694  -2.701 0.006922 ** 
## factor(smoking_status)smokes        -0.3971     0.2061  -1.927 0.053958 .  
## factor(smoking_status)Unknown       -0.7350     0.1992  -3.689 0.000225 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1990.3  on 5108  degrees of freedom
## Residual deviance: 1868.6  on 5102  degrees of freedom
## AIC: 1882.6
## 
## Number of Fisher Scoring iterations: 6

Looking at the coefficients in this model, the gender factor is -0.029 for the male group. This means that males are less likely to suffer stroke compare to females. However, this factor is not statistically significant in this model. We will exclude this factor from the logistic regression model.

Hypertension 1 is associated with having hypertension and heart_disease 1 is associated with having heart disease. The coefficients for hypertension and heart disease are positive 1.115 and 1.183 respectively. This means that people who are having hypertension and heart disease are more likely to get stroke comparing to people who don’t have these diseases. These two factors are both statistically significant.

The smoking status factor is a four-level factor. Comparing to people who formerly smoked, ones whose smoking status are never smoked, smokes, or unknown are less likely to suffer a stroke. The smoking status is also statistically significant.

Fitting a generalized linear model with categorical variables and numeric variables (excluding statistically insignificant variables)

fit3 <- glm(stroke ~ age + avg_glucose_level + factor(hypertension) + factor(heart_disease) + factor(smoking_status), family = "binomial", data = df_logistic)
summary(fit3)

## 
## Call:
## glm(formula = stroke ~ age + avg_glucose_level + factor(hypertension) + 
##     factor(heart_disease) + factor(smoking_status), family = "binomial", 
##     data = df_logistic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0808  -0.3241  -0.1709  -0.0806   3.7744  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -7.461481   0.391538 -19.057  < 2e-16 ***
## age                                 0.069551   0.005227  13.307  < 2e-16 ***
## avg_glucose_level                   0.004130   0.001164   3.548 0.000388 ***
## factor(hypertension)1               0.395962   0.163653   2.420 0.015541 *  
## factor(heart_disease)1              0.300253   0.188853   1.590 0.111862    
## factor(smoking_status)never smoked -0.205325   0.173458  -1.184 0.236524    
## factor(smoking_status)smokes        0.112176   0.214171   0.524 0.600439    
## factor(smoking_status)Unknown      -0.043141   0.205822  -0.210 0.833976    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1990.3  on 5108  degrees of freedom
## Residual deviance: 1588.6  on 5101  degrees of freedom
## AIC: 1604.6
## 
## Number of Fisher Scoring iterations: 7

Combining both numeric and categorical to the model, we still get positive coefficients for age, average glucose level, and hypertension. And they are all considered statistically significant. For heart disease variable, they appear less significant in this model than in the previous model. Three smoking status variables are now statistically insignificant in this model. So, we are going to remove them.

Fitting a generalized linear model with significant variables (including numeric and categorical)

fit4 <- glm(stroke ~ age + avg_glucose_level  + factor(hypertension) + factor(heart_disease), family = "binomial", data = df_logistic)
summary(fit4)

## 
## Call:
## glm(formula = stroke ~ age + avg_glucose_level + factor(hypertension) + 
##     factor(heart_disease), family = "binomial", data = df_logistic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0587  -0.3215  -0.1732  -0.0828   3.7706  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -7.488996   0.357890 -20.925  < 2e-16 ***
## age                     0.068920   0.005140  13.408  < 2e-16 ***
## avg_glucose_level       0.004121   0.001162   3.547  0.00039 ***
## factor(hypertension)1   0.381396   0.162599   2.346  0.01899 *  
## factor(heart_disease)1  0.329972   0.187724   1.758  0.07879 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1990.3  on 5108  degrees of freedom
## Residual deviance: 1591.4  on 5104  degrees of freedom
## AIC: 1601.4
## 
## Number of Fisher Scoring iterations: 7

This is the final logistic model I run for this project. We can conclude that the above four variables are the most significant variables in predicting the likelihood of getting a stroke.

Generate confidence intervals for model parameters

confint.default(fit4)

##                               2.5 %       97.5 %
## (Intercept)            -8.190447267 -6.787544552
## age                     0.058844853  0.078994570
## avg_glucose_level       0.001843696  0.006398262
## factor(hypertension)1   0.062709201  0.700083786
## factor(heart_disease)1 -0.037959554  0.697904046

Create a null model / intercept only model

fitNull <- glm(stroke~1, family = "binomial", data = df_logistic)
summary(fitNull)

## 
## Call:
## glm(formula = stroke ~ 1, family = "binomial", data = df_logistic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3161  -0.3161  -0.3161  -0.3161   2.4582  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.97134    0.06497  -45.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1990.3  on 5108  degrees of freedom
## Residual deviance: 1990.3  on 5108  degrees of freedom
## AIC: 1992.3
## 
## Number of Fisher Scoring iterations: 5

To Evaluate the Overall Fit of the Model

Test if the full model containing predictor variables provides significant improvement of fit over the null model using annova function

anova(fitNull, fit4, test = "LRT")

## Analysis of Deviance Table
## 
## Model 1: stroke ~ 1
## Model 2: stroke ~ age + avg_glucose_level + factor(hypertension) + factor(heart_disease)
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      5108     1990.3                          
## 2      5104     1591.5  4   398.83 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p value shows that the full model provides a significant improvement of fit over the null model.

Correlation Plot

Correlation Plot Containing Numeric Variables Only

df_numeric <- df %>%
  select(age, avg_glucose_level, bmi, stroke) %>%
  filter(bmi != "NA") %>%
  mutate(stroke = ifelse(stroke == "Yes", 1, 0))
df_numeric$stroke <- as.factor(df_numeric$stroke)
str(df_numeric)

## tibble [4,909 x 4] (S3: tbl_df/tbl/data.frame)
##  $ age              : num [1:4909] 67 80 49 79 81 74 69 78 81 61 ...
##  $ avg_glucose_level: num [1:4909] 229 106 171 174 186 ...
##  $ bmi              : num [1:4909] 36.6 32.5 34.4 24 29 27.4 22.8 24.2 29.7 36.8 ...
##  $ stroke           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

plot_correlation(df_numeric)

Correlation Plot Containing Categorical Variables Only

df_character <- df %>%
  select(age, hypertension, heart_disease, smoking_status, stroke)
df_character

## # A tibble: 5,110 x 5
##      age hypertension heart_disease smoking_status  stroke
##    <dbl> <chr>        <chr>         <chr>           <chr> 
##  1    67 No           Yes           formerly smoked Yes   
##  2    61 No           No            never smoked    Yes   
##  3    80 No           Yes           never smoked    Yes   
##  4    49 No           No            smokes          Yes   
##  5    79 Yes          No            never smoked    Yes   
##  6    81 No           No            formerly smoked Yes   
##  7    74 Yes          Yes           never smoked    Yes   
##  8    69 No           No            never smoked    Yes   
##  9    59 No           No            Unknown         Yes   
## 10    78 No           No            Unknown         Yes   
## # ... with 5,100 more rows

plot_correlation(df_character)

Conclusion

Based on the exploratory analysis and logistic regression analysis of this dataset, the following conclusion can be drawn:

The risk of having stroke increases when people get older.
People with higher average glucose level have higher risk of getting a stroke.
People previously diagnosed with hypertension have higher risk of getting a stroke.
People previously diagnosed with heart disease have higher risk of getting a stroke.
Gender, marital status, work type, residence type, and BMI are not statistically significant risk factors based on this dataset.
Smoking status does have an impact on the risk of stroke, but the significance is not as high as other significant factors. People having smoking habit have higher risk of getting a stroke.

Stroke is the second leading cause of death. There are nearly 5.17 million people died of stroke globally in 2020 and that constitute of 9.9% of total death. Stroke is also associated with long-term disability. Stroke can affect many functions of our body such as speaking, swallowing, memory, mobility, etc. Patients who experienced stroke can also develop emotional problems and personality changes. Stroke can adversely affect the life quality of ourselves and our family, therefore, it’s significantly important to spread awareness on stroke and educate people on the causes of stroke, and ultimately help people to transit to a healthy lifestyle with lower risk of having a stroke. Fortunately, this disease is preventable and treatable. Except for age, all the risk factors that we studied in this analysis are treatable with medicines or exercise, or both. Knowing the sense of stroke and the risk factors associated with stroke is the first step. This project has helped me learned a lot and raised my awareness and I hope it can help more people.

References

Body Mass Index (BMI). 17 September 2020. May 2021. https://www.cdc.gov/healthyweight/assessing/bmi/index.html#:~:text=Body%20Mass%20Index%20(BMI)%20is,or%20health%20of%20an%20individual..

Brain Basics: Preventing Stroke. 16 April 2020. May 2021. https://www.ninds.nih.gov/Disorders/subject-Caregiver-Education/Preventing-Stroke#:~:text=Stroke%20occurs%20in%20all%20age,occur%20in%20childhood%20or%20adolescence..

Cardiovascular Disease Inforgraphic. 29 September 2020. May 2021. https://www.world-heart-federation.org/resources/cardiovascular-disease-infographic/.

Global Health Estimates 2016: Deaths by Cause, Age, Sex, by Country and by Region, 2000-2016. Geneva, World Health Organization; 2018. https://www.who.int/healthinfo/global_burden_disease/estimates/en/.

Heart attack and stroke: Men vs. women. 09 April 2014. May 2021. https://www.health.harvard.edu/heart-health/heart-attack-and-stroke-men-vs-women

Stroke, Cerebrovascular accident. n.d. May 2021. http://www.emro.who.int/health-topics/stroke-cerebrovascular-accident/index.html.

Data Analysis on Stroke Predictions Dataset

Yujie Deng

4/28/2021