library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(RColorBrewer)
Stroke is one type of cardiovascular disease. There are two types of stroke: ischaemic stroke and hemorrhagic stroke. According to the World Health Organization, stroke is the second leading cause of death and the third leading cause of disability. The latest global mortality estimates show that 10.2% of total deaths were caused by stroke in the year 2016 and 9.9% for the year 2000 (Global Health Estimates 2016). There are 15 million people who suffer from stroke worldwide every year. Among these 15 million people, 5 million died and another 5 million became permanently disabled (Stroke, Cerebrovascular accident). The risk factors for stroke are high blood pressure, high cholesterol, overweight & obesity, tobacco, diabetes, physical inactivity, unhealthy diet, harmful use of alcohol, kidney disease, etc (Cardiovascular Disease Inforgraphic).
The dataset I’m using is extracted from kaggle: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset. The object of this analysis is to explore how each risk factor impacts the risk of stroke and the possible relationship among factors. The analysis will carry out by answering some general questions about the topic and variables in the dataset. The first part will be exploratory data analysis based on visualizations and the second part will be logistic regression analysis.
setwd("~/School/MC/DATA 110/Datasets")
data <- read_csv("healthcare-dataset-stroke-data.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## id = col_double(),
## gender = col_character(),
## age = col_double(),
## hypertension = col_double(),
## heart_disease = col_double(),
## ever_married = col_character(),
## work_type = col_character(),
## Residence_type = col_character(),
## avg_glucose_level = col_double(),
## bmi = col_character(),
## smoking_status = col_character(),
## stroke = col_double()
## )
id: unique identifier for each subject
gender: male, female, or other
age: 0-82 years old
hypertension: 0 if the subject doesn’t have hypertension, 1 if the subject has hypertension
heart_disease: 0 if the subject doesn’t have any heart diesease, 1 if the subject has a heart disease
ever_married: Yes or No
work_type: children, govt_job, never_worked, private, or self-employed
residence_type: urban or rural
avg_glucose_level: average glucose level in blood; range: 55-272
bmi: body mass index; range: 10-98
smoking_status: formerly smoked, never smoked, smokes, unknown
stroke: 0 if subject never had a stroke, 1 if the subject suffered a stroke before
# Use mutate to change 0 values to "No" and 1 values to "Yes" so that visualizations are easier to read
df <- data %>%
mutate(hypertension = ifelse(hypertension == 0, "No","Yes")) %>%
mutate(heart_disease = ifelse(heart_disease == 0, "No","Yes")) %>%
mutate(stroke = ifelse(stroke == 0, "No","Yes"))
df$bmi <- as.numeric(df$bmi)
## Warning: NAs introduced by coercion
summary(df)
## id gender age hypertension
## Min. : 67 Length:5110 Min. : 0.08 Length:5110
## 1st Qu.:17741 Class :character 1st Qu.:25.00 Class :character
## Median :36932 Mode :character Median :45.00 Mode :character
## Mean :36518 Mean :43.23
## 3rd Qu.:54682 3rd Qu.:61.00
## Max. :72940 Max. :82.00
##
## heart_disease ever_married work_type Residence_type
## Length:5110 Length:5110 Length:5110 Length:5110
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Min. :10.30 Length:5110 Length:5110
## 1st Qu.: 77.25 1st Qu.:23.50 Class :character Class :character
## Median : 91.89 Median :28.10 Mode :character Mode :character
## Mean :106.15 Mean :28.89
## 3rd Qu.:114.09 3rd Qu.:33.10
## Max. :271.74 Max. :97.60
## NA's :201
df %>%
ggplot(aes(x = age)) +
geom_density() +
geom_vline(aes(xintercept=mean(age)), color = "blue", linetype = "dashed") +
labs(title = "Age Density", x = "Age", y = "Density") +
theme_classic()
df %>%
ggplot(aes(x = age, fill = gender)) +
geom_density(alpha = 0.4) +
geom_vline(aes(xintercept=mean(age)), color = "blue", linetype = "dashed") +
labs(title = "Age Density", x = "Age", y = "Density") +
theme_classic()
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf
This study involved 5,110 people including 2994 females and 2115 males. The density plot shows the age distribution. Overall, more observations are concentrated between 40 and 60 years old. There are differences in the age distribution for female and male subjects. The age distribution of female subjects is closer to normal distribution while the age distribution of male subjects is slightly skewed to the right.
df_1 <- df %>%
group_by(gender, stroke) %>%
count(stroke)
df_1
## # A tibble: 5 x 3
## # Groups: gender, stroke [5]
## gender stroke n
## <chr> <chr> <int>
## 1 Female No 2853
## 2 Female Yes 141
## 3 Male No 2007
## 4 Male Yes 108
## 5 Other No 1
df_1 %>%
ggplot(mapping = aes(x = gender, y = n, fill = stroke) ) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Percentage of Female Group vs. Male Group",
x = "Gender",
y = "Percentage") +
scale_fill_brewer(palette = "Paired")
In this dataset, the total number of female subjects is more than the total number of male subjects. The percentage of subjects who previously suffered a stroke in the male group is slightly higher than the female group. There’s only one observation in the Other gender group and that subject does not have stroke experience. Therefore, no conclusion can be drawn for the Other group.
According to Harvard Health Publishing, women have a higher risk of suffering stroke. Stroke events happen more among women than men each year. This can be explained by many factors. Women have longer average life span and the risk of experiencing stroke increases when people become older. In addition, reproductive health, pregnancy, childbirth, and imbalanced hormones are all considered risk factors (Heart attack and stroke: Men vs. women). In this dataset, we can’t see strong evidence to support that women have a higher risk of suffering stroke.
df_2 <- df %>%
group_by(age, stroke) %>%
count(stroke)
df_2
## # A tibble: 148 x 3
## # Groups: age, stroke [148]
## age stroke n
## <dbl> <chr> <int>
## 1 0.08 No 2
## 2 0.16 No 3
## 3 0.24 No 5
## 4 0.32 No 5
## 5 0.4 No 2
## 6 0.48 No 3
## 7 0.56 No 5
## 8 0.64 No 4
## 9 0.72 No 5
## 10 0.8 No 4
## # ... with 138 more rows
df_2 %>%
#filter(stroke == "Yes") %>%
ggplot(mapping = aes(x = age, y = n, color = stroke)) +
geom_point(alpha=0.5) +
geom_smooth(method = "loess", span = 0.3) +
labs(title = "Age Distribution of Stoke Event",
x = "Age",
y = "Number of Observations") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
#
df_2 %>%
filter(stroke == "Yes") %>%
ggplot(mapping = aes(x = age, y = n)) +
geom_point(alpha=0.5) +
geom_smooth(method = "loess", span = 0.3) +
labs(title = "Stoke Event Distribution by Age",
x = "Age",
y = "Number of Observations") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
Though the majority of subjects didn’t experience stroke, it can happen at any age. Among subjects who have experienced a stroke, age varies from 1 to 80 years old. As we can see from the above plot, the number of observations at each age remains under 5 before 55 years old. There are two peaks around the age of 60 and 80. According to an online publication from NIH, the risk of stroke doubles for each decade between the age of 55 and 85. Although people consider stroke as a disease of aging, the risk of stroke in childhood can be very high during the perinatal period (Brain Basics: Preventing Stroke).
df_3 <- df %>%
group_by(hypertension, stroke) %>%
count(stroke)
df_3
## # A tibble: 4 x 3
## # Groups: hypertension, stroke [4]
## hypertension stroke n
## <chr> <chr> <int>
## 1 No No 4429
## 2 No Yes 183
## 3 Yes No 432
## 4 Yes Yes 66
df_3
## # A tibble: 4 x 3
## # Groups: hypertension, stroke [4]
## hypertension stroke n
## <chr> <chr> <int>
## 1 No No 4429
## 2 No Yes 183
## 3 Yes No 432
## 4 Yes Yes 66
df_3 %>%
ggplot(mapping = aes(x = hypertension, y = n, fill = stroke) ) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Percentage of subjects with Hypertension vs. without Hypertension",
x = "Hypertension",
y = "Percentage") +
scale_fill_brewer(palette = "Paired")
The number of subjects without hypertension is way more than subjects with hypertension in this dataset. The percentage of subjects getting a stroke with hypertension is higher than the percentage of subjects getting a stroke without hypertension. Therefore, this dataset proved the fact that hypertension is a significant risk factor for stroke. In fact, it is the most important risk factor for stroke. Hypertension tends to run in the family. If one’s family member has hypertension, one has a higher risk of having it. However, hypertension can be prevented or controlled by medicine and a healthy lifestyle.
df_4 <- df %>%
group_by(age, hypertension) %>%
count(hypertension)
df_4
## # A tibble: 164 x 3
## # Groups: age, hypertension [164]
## age hypertension n
## <dbl> <chr> <int>
## 1 0.08 No 2
## 2 0.16 No 3
## 3 0.24 No 5
## 4 0.32 No 5
## 5 0.4 No 2
## 6 0.48 No 3
## 7 0.56 No 5
## 8 0.64 No 4
## 9 0.72 No 5
## 10 0.8 No 4
## # ... with 154 more rows
df_4 %>%
filter(hypertension == "Yes") %>%
ggplot(mapping = aes(x = age, y = n )) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", span = 0.3) +
labs(title = "With Hypertension vs. Without Hypertension",
subtitle = "Age Distribution",
x = "Age",
y = "Number of Observation") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
Hypertension can be developed at any age though the elderly are more likely to develop hypertension. Hypertension is age-related because the vascular system changes as we get older. Our arteries get stiffer, therefore blood pressure goes up. As we can see from the plot above, it’s very obvious that the number of observations with hypertension jumps up to the next level after the age of 50.
df_5 <- df %>%
group_by(heart_disease, stroke) %>%
count(stroke)
df_5
## # A tibble: 4 x 3
## # Groups: heart_disease, stroke [4]
## heart_disease stroke n
## <chr> <chr> <int>
## 1 No No 4632
## 2 No Yes 202
## 3 Yes No 229
## 4 Yes Yes 47
df_5 %>%
ggplot(mapping = aes(x = heart_disease, y = n, fill = stroke) ) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Percentage of subjects with Heart Disease vs. without Heart Disease",
x = "Heart Disease",
y = "Percentage") +
scale_fill_brewer(palette = "Paired")
Subjects previously diagnosed with heart disease have a higher risk of experiencing stroke than subjects without heart disease. According to NIH’s online publication, heart diseases can cause blood clots that may break loose or block vessels in or leading to the brain.
df_6 <- df %>%
group_by(ever_married, stroke) %>%
count(stroke)
df_6
## # A tibble: 4 x 3
## # Groups: ever_married, stroke [4]
## ever_married stroke n
## <chr> <chr> <int>
## 1 No No 1728
## 2 No Yes 29
## 3 Yes No 3133
## 4 Yes Yes 220
df_6 %>%
ggplot(mapping = aes(x = ever_married, y = n, fill = stroke) ) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Married subjects vs. Unmarried subjects",
x = "Ever Married",
y = "Percentage") +
scale_fill_brewer(palette = "Paired")
Based on the chart above, it seems that married subjects have a higher risk of getting a stroke. However, let’s look at the age distribution of married subjects and unmarried subjects.
df %>%
ggplot(mapping = aes(x = ever_married, y = age)) +
geom_boxplot() +
labs(title = "Age Distribution - Married vs. Unmarried",
x = "Ever Married",
y = "Age") +
theme_minimal()
df %>%
ggplot(mapping = aes(x = ever_married, y = age, fill = stroke)) +
geom_boxplot() +
labs(title = "Age Distribution - Married vs. Unmarried",
x = "Ever Married",
y = "Age") +
theme_minimal() +
scale_fill_brewer(palette = "Paired")
As we see in the first plot, the overall median age of unmarried subjects is significantly younger than married subjects in this dataset, though there are outliers. In the second plot, the age distribution of subjects who experienced stroke is very similar no matter the subject is ever married or not. The median age is around 70. There’s no strong evidence from this dataset to support that marital status has a direct impact on the risk of getting a stroke.
df_7 <- df %>%
group_by(work_type, stroke) %>%
count(stroke)
df_7
## # A tibble: 9 x 3
## # Groups: work_type, stroke [9]
## work_type stroke n
## <chr> <chr> <int>
## 1 children No 685
## 2 children Yes 2
## 3 Govt_job No 624
## 4 Govt_job Yes 33
## 5 Never_worked No 22
## 6 Private No 2776
## 7 Private Yes 149
## 8 Self-employed No 754
## 9 Self-employed Yes 65
df_7 %>%
ggplot(mapping = aes(x = work_type, y = n, fill = stroke) ) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Which Work Type Has Higher Risk of Getting a Stroke?",
x = "Work Type",
y = "Percentage") +
theme_minimal() +
scale_fill_brewer(palette = "Paired")
df %>%
ggplot(mapping = aes(x = work_type, y = age)) +
geom_boxplot() +
labs(title = "Age Distribution of Work Type",
x = "Work Type",
y = "Age") +
theme_minimal()
For children and subjects who never worked, the median age is under 20 years old and their risk of getting stroke is low because of the age. For government job, private, and self-employed groups, their median age is around 50, 45, and 61 respectively. The higher percentage could result from aging. There’s no strong evidence to support that work type has an impact on the risk of getting a stroke.
df_8 <- df %>%
group_by(Residence_type, stroke) %>%
count(stroke)
df_8
## # A tibble: 4 x 3
## # Groups: Residence_type, stroke [4]
## Residence_type stroke n
## <chr> <chr> <int>
## 1 Rural No 2400
## 2 Rural Yes 114
## 3 Urban No 2461
## 4 Urban Yes 135
df_8 %>%
ggplot(mapping = aes(x = Residence_type, y = n, fill = stroke) ) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Does residence type have a great impact on risk of getting stroke?",
x = "Residence Type",
y = "Percentage") +
theme_minimal() +
scale_fill_brewer(palette = "Paired")
It seems that residence type does not have an impact on the risk of getting a stroke.
df %>%
ggplot(mapping = aes(x = stroke, y = avg_glucose_level, fill = stroke)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot of Average Glucose Level",
x = "Stroke",
y = "Average Glucose Level") +
scale_fill_brewer(palette = "Paired")
High glucose level is an indicator of diabetes. According to NIH’s online publication, in terms of stroke and cardiovascular disease, having diabetes is the equivalent of aging 15 years old. Diabetes not only affects our body’s ability to control the level of blood sugar but also causes destructive changes in the blood vessels throughout the body. If the glucose level is high at the time of the stroke, the damage to our body is more severe and extensive than when the glucose level is well-controlled. From the boxplot, we can see that people who experienced stroke have a higher average glucose level than people who never experienced a stroke.
cols <- brewer.pal(2, "Paired")
## Warning in brewer.pal(2, "Paired"): minimal value for n is 3, returning requested palette with 3 different levels
This is another visualization of average glucose level. According to CDC, A normal blood pressure level is less than 120/80 mmHg.
highchart() %>%
hc_add_series(data = df,
type = "point",
hcaes(x = age,
y = avg_glucose_level,
group = stroke)) %>%
hc_colors(cols) %>%
hc_xAxis(title = list(text="Age")) %>%
hc_yAxis(title = list(text="Average Glucose Level")) %>%
hc_plotOptions(series = list(marker = list(symbol = "circle"))) %>%
hc_legend(align = "right",
verticalAlign = "top")
df_10 <- df %>%
group_by(bmi, stroke, gender) %>%
count(stroke)
df_10
## # A tibble: 883 x 4
## # Groups: bmi, stroke, gender [883]
## bmi stroke gender n
## <dbl> <chr> <chr> <int>
## 1 10.3 No Female 1
## 2 11.3 No Female 1
## 3 11.5 No Male 1
## 4 12 No Female 1
## 5 12.3 No Female 1
## 6 12.8 No Female 1
## 7 13 No Male 1
## 8 13.2 No Female 1
## 9 13.3 No Female 1
## 10 13.4 No Male 1
## # ... with 873 more rows
df_10 %>%
ggplot(mapping = aes(x = stroke, y = bmi, fill = stroke)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot of BMI",
x = "Stroke",
y = "BMI") +
scale_fill_brewer(palette = "Paired")
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
df_10 %>%
ggplot(mapping = aes(x = stroke, y = bmi, fill = stroke, color = gender)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot of BMI",
x = "Stroke",
y = "BMI") +
scale_fill_brewer(palette = "Paired")
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
According to CDC, Body Mass Index (BMI) is a person’s weight in kilograms divided by the square of height in meters. A high BMI can be an indicator of high body fatness and may lead to health problems. Some studies have suggested that a high body mass index (BMI) may increase the risk of stroke. From the boxplot above, subjects who experience a stroke have lower BMI than subjects who didn’t experienced stroke. we can’t find strong evidence to prove that BMI has a direct impact on the risk of stroke.
df_11 <- df %>%
group_by(smoking_status, stroke) %>%
count(stroke)
df_11
## # A tibble: 8 x 3
## # Groups: smoking_status, stroke [8]
## smoking_status stroke n
## <chr> <chr> <int>
## 1 formerly smoked No 815
## 2 formerly smoked Yes 70
## 3 never smoked No 1802
## 4 never smoked Yes 90
## 5 smokes No 747
## 6 smokes Yes 42
## 7 Unknown No 1497
## 8 Unknown Yes 47
df_11 %>%
ggplot(mapping = aes(x = smoking_status, y = n, fill = stroke) ) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Does the smoking status impact the risk of stroke? ",
x = "Smoking Status",
y = "Percentage") +
theme_minimal() +
scale_fill_brewer(palette = "Paired")
We can see that smoking status does impact the risk of getting a stroke. Smoking causes the buildup of fatty substances (atherosclerosis) in the carotid artery, which is the main artery to supply blood to the brain. In addition, nicotine can cause high blood pressure. The carbon monoxide produced from smoking reduces the amount of oxygen in the blood supply to the brain. Smoking also thickens the blood and makes it easier to clot.
library(DataExplorer)
df_logistic <- df %>%
mutate(stroke = ifelse(stroke == "Yes", 1, 0)) %>%
mutate(hypertension = ifelse(hypertension == "Yes", 1, 0)) %>% # Converting stroke, hypertension, heart disease into 0s and 1s
mutate(heart_disease = ifelse(heart_disease == "Yes", 1, 0)) %>%
filter(gender != "Other") # filter out the only "Other" observation in gender group
# set categorical variables as factor
df_logistic$gender <- as.factor(df_logistic$gender)
df_logistic$hypertension <- as.factor(df_logistic$hypertension)
df_logistic$heart_disease <- as.factor(df_logistic$heart_disease)
df_logistic$smoking_status <- as.factor(df_logistic$smoking_status)
str(df_logistic)
## spec_tbl_df [5,109 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:5109] 9046 51676 31112 60182 1665 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 1 1 ...
## $ age : num [1:5109] 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 1 ...
## $ heart_disease : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 1 1 1 ...
## $ ever_married : chr [1:5109] "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr [1:5109] "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr [1:5109] "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num [1:5109] 229 202 106 171 174 ...
## $ bmi : num [1:5109] 36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...
## $ smoking_status : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
## $ stroke : num [1:5109] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. gender = col_character(),
## .. age = col_double(),
## .. hypertension = col_double(),
## .. heart_disease = col_double(),
## .. ever_married = col_character(),
## .. work_type = col_character(),
## .. Residence_type = col_character(),
## .. avg_glucose_level = col_double(),
## .. bmi = col_character(),
## .. smoking_status = col_character(),
## .. stroke = col_double()
## .. )
fit1 <- glm(stroke ~ age + avg_glucose_level + bmi , data = df_logistic, family = "binomial")
summary(fit1)
##
## Call:
## glm(formula = stroke ~ age + avg_glucose_level + bmi, family = "binomial",
## data = df_logistic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9845 -0.3029 -0.1626 -0.0752 3.6180
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.042725 0.538866 -14.925 < 2e-16 ***
## age 0.072264 0.005532 13.063 < 2e-16 ***
## avg_glucose_level 0.005455 0.001263 4.320 1.56e-05 ***
## bmi 0.005555 0.011558 0.481 0.631
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1728.3 on 4907 degrees of freedom
## Residual deviance: 1387.7 on 4904 degrees of freedom
## (201 observations deleted due to missingness)
## AIC: 1395.7
##
## Number of Fisher Scoring iterations: 7
The coefficient is positive for both age and average glucose level. For every one unit change in age, the log odds of getting stroke increases by 0.073. For every one unit change in average glucose level, the log odds of getting stroke increases by 0.005. Both age and average glucose level are statistically significant.
The coefficient for BMI is positive 0.009. This means people with higher BMI have a higher risk of getting a stroke. However, BMI is not statistically significant. We will exclude BMI from the logistic regression.
fit2 <- glm(stroke ~ factor(gender) + factor(hypertension) + factor(heart_disease) + factor(smoking_status), family = "binomial", data = df_logistic)
summary(fit2)
##
## Call:
## glm(formula = stroke ~ factor(gender) + factor(hypertension) +
## factor(heart_disease) + factor(smoking_status), family = "binomial",
## data = df_logistic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9883 -0.2760 -0.2680 -0.2338 2.7013
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8514 0.1514 -18.838 < 2e-16 ***
## factor(gender)Male -0.0357 0.1357 -0.263 0.792498
## factor(hypertension)1 1.0861 0.1588 6.838 8.02e-12 ***
## factor(heart_disease)1 1.3027 0.1839 7.083 1.41e-12 ***
## factor(smoking_status)never smoked -0.4575 0.1694 -2.701 0.006922 **
## factor(smoking_status)smokes -0.3971 0.2061 -1.927 0.053958 .
## factor(smoking_status)Unknown -0.7350 0.1992 -3.689 0.000225 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1990.3 on 5108 degrees of freedom
## Residual deviance: 1868.6 on 5102 degrees of freedom
## AIC: 1882.6
##
## Number of Fisher Scoring iterations: 6
Looking at the coefficients in this model, the gender factor is -0.029 for the male group. This means that males are less likely to suffer stroke compare to females. However, this factor is not statistically significant in this model. We will exclude this factor from the logistic regression model.
Hypertension 1 is associated with having hypertension and heart_disease 1 is associated with having heart disease. The coefficients for hypertension and heart disease are positive 1.115 and 1.183 respectively. This means that people who are having hypertension and heart disease are more likely to get stroke comparing to people who don’t have these diseases. These two factors are both statistically significant.
The smoking status factor is a four-level factor. Comparing to people who formerly smoked, ones whose smoking status are never smoked, smokes, or unknown are less likely to suffer a stroke. The smoking status is also statistically significant.
fit3 <- glm(stroke ~ age + avg_glucose_level + factor(hypertension) + factor(heart_disease) + factor(smoking_status), family = "binomial", data = df_logistic)
summary(fit3)
##
## Call:
## glm(formula = stroke ~ age + avg_glucose_level + factor(hypertension) +
## factor(heart_disease) + factor(smoking_status), family = "binomial",
## data = df_logistic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0808 -0.3241 -0.1709 -0.0806 3.7744
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.461481 0.391538 -19.057 < 2e-16 ***
## age 0.069551 0.005227 13.307 < 2e-16 ***
## avg_glucose_level 0.004130 0.001164 3.548 0.000388 ***
## factor(hypertension)1 0.395962 0.163653 2.420 0.015541 *
## factor(heart_disease)1 0.300253 0.188853 1.590 0.111862
## factor(smoking_status)never smoked -0.205325 0.173458 -1.184 0.236524
## factor(smoking_status)smokes 0.112176 0.214171 0.524 0.600439
## factor(smoking_status)Unknown -0.043141 0.205822 -0.210 0.833976
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1990.3 on 5108 degrees of freedom
## Residual deviance: 1588.6 on 5101 degrees of freedom
## AIC: 1604.6
##
## Number of Fisher Scoring iterations: 7
Combining both numeric and categorical to the model, we still get positive coefficients for age, average glucose level, and hypertension. And they are all considered statistically significant. For heart disease variable, they appear less significant in this model than in the previous model. Three smoking status variables are now statistically insignificant in this model. So, we are going to remove them.
fit4 <- glm(stroke ~ age + avg_glucose_level + factor(hypertension) + factor(heart_disease), family = "binomial", data = df_logistic)
summary(fit4)
##
## Call:
## glm(formula = stroke ~ age + avg_glucose_level + factor(hypertension) +
## factor(heart_disease), family = "binomial", data = df_logistic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0587 -0.3215 -0.1732 -0.0828 3.7706
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.488996 0.357890 -20.925 < 2e-16 ***
## age 0.068920 0.005140 13.408 < 2e-16 ***
## avg_glucose_level 0.004121 0.001162 3.547 0.00039 ***
## factor(hypertension)1 0.381396 0.162599 2.346 0.01899 *
## factor(heart_disease)1 0.329972 0.187724 1.758 0.07879 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1990.3 on 5108 degrees of freedom
## Residual deviance: 1591.4 on 5104 degrees of freedom
## AIC: 1601.4
##
## Number of Fisher Scoring iterations: 7
This is the final logistic model I run for this project. We can conclude that the above four variables are the most significant variables in predicting the likelihood of getting a stroke.
confint.default(fit4)
## 2.5 % 97.5 %
## (Intercept) -8.190447267 -6.787544552
## age 0.058844853 0.078994570
## avg_glucose_level 0.001843696 0.006398262
## factor(hypertension)1 0.062709201 0.700083786
## factor(heart_disease)1 -0.037959554 0.697904046
fitNull <- glm(stroke~1, family = "binomial", data = df_logistic)
summary(fitNull)
##
## Call:
## glm(formula = stroke ~ 1, family = "binomial", data = df_logistic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3161 -0.3161 -0.3161 -0.3161 2.4582
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.97134 0.06497 -45.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1990.3 on 5108 degrees of freedom
## Residual deviance: 1990.3 on 5108 degrees of freedom
## AIC: 1992.3
##
## Number of Fisher Scoring iterations: 5
anova(fitNull, fit4, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: stroke ~ 1
## Model 2: stroke ~ age + avg_glucose_level + factor(hypertension) + factor(heart_disease)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 5108 1990.3
## 2 5104 1591.5 4 398.83 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p value shows that the full model provides a significant improvement of fit over the null model.
df_numeric <- df %>%
select(age, avg_glucose_level, bmi, stroke) %>%
filter(bmi != "NA") %>%
mutate(stroke = ifelse(stroke == "Yes", 1, 0))
df_numeric$stroke <- as.factor(df_numeric$stroke)
str(df_numeric)
## tibble [4,909 x 4] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:4909] 67 80 49 79 81 74 69 78 81 61 ...
## $ avg_glucose_level: num [1:4909] 229 106 171 174 186 ...
## $ bmi : num [1:4909] 36.6 32.5 34.4 24 29 27.4 22.8 24.2 29.7 36.8 ...
## $ stroke : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
plot_correlation(df_numeric)
df_character <- df %>%
select(age, hypertension, heart_disease, smoking_status, stroke)
df_character
## # A tibble: 5,110 x 5
## age hypertension heart_disease smoking_status stroke
## <dbl> <chr> <chr> <chr> <chr>
## 1 67 No Yes formerly smoked Yes
## 2 61 No No never smoked Yes
## 3 80 No Yes never smoked Yes
## 4 49 No No smokes Yes
## 5 79 Yes No never smoked Yes
## 6 81 No No formerly smoked Yes
## 7 74 Yes Yes never smoked Yes
## 8 69 No No never smoked Yes
## 9 59 No No Unknown Yes
## 10 78 No No Unknown Yes
## # ... with 5,100 more rows
plot_correlation(df_character)
Based on the exploratory analysis and logistic regression analysis of this dataset, the following conclusion can be drawn:
The risk of having stroke increases when people get older.
People with higher average glucose level have higher risk of getting a stroke.
People previously diagnosed with hypertension have higher risk of getting a stroke.
People previously diagnosed with heart disease have higher risk of getting a stroke.
Gender, marital status, work type, residence type, and BMI are not statistically significant risk factors based on this dataset.
Smoking status does have an impact on the risk of stroke, but the significance is not as high as other significant factors. People having smoking habit have higher risk of getting a stroke.
Stroke is the second leading cause of death. There are nearly 5.17 million people died of stroke globally in 2020 and that constitute of 9.9% of total death. Stroke is also associated with long-term disability. Stroke can affect many functions of our body such as speaking, swallowing, memory, mobility, etc. Patients who experienced stroke can also develop emotional problems and personality changes. Stroke can adversely affect the life quality of ourselves and our family, therefore, it’s significantly important to spread awareness on stroke and educate people on the causes of stroke, and ultimately help people to transit to a healthy lifestyle with lower risk of having a stroke. Fortunately, this disease is preventable and treatable. Except for age, all the risk factors that we studied in this analysis are treatable with medicines or exercise, or both. Knowing the sense of stroke and the risk factors associated with stroke is the first step. This project has helped me learned a lot and raised my awareness and I hope it can help more people.
Body Mass Index (BMI). 17 September 2020. May 2021. https://www.cdc.gov/healthyweight/assessing/bmi/index.html#:~:text=Body%20Mass%20Index%20(BMI)%20is,or%20health%20of%20an%20individual..
Brain Basics: Preventing Stroke. 16 April 2020. May 2021. https://www.ninds.nih.gov/Disorders/subject-Caregiver-Education/Preventing-Stroke#:~:text=Stroke%20occurs%20in%20all%20age,occur%20in%20childhood%20or%20adolescence..
Cardiovascular Disease Inforgraphic. 29 September 2020. May 2021. https://www.world-heart-federation.org/resources/cardiovascular-disease-infographic/.
Global Health Estimates 2016: Deaths by Cause, Age, Sex, by Country and by Region, 2000-2016. Geneva, World Health Organization; 2018. https://www.who.int/healthinfo/global_burden_disease/estimates/en/.
Heart attack and stroke: Men vs. women. 09 April 2014. May 2021. https://www.health.harvard.edu/heart-health/heart-attack-and-stroke-men-vs-women
Stroke, Cerebrovascular accident. n.d. May 2021. http://www.emro.who.int/health-topics/stroke-cerebrovascular-accident/index.html.