mydata <- read.table("./healthcare-dataset-stroke-data.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 2 Rural 202.21 N/A never smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24 never smoked 1
## 6 Urban 186.21 29 formerly smoked 1
Unit of observation is individual who had or did not have a stroke.
The data contains 5110 observations with 12 attributes.
Description of data:
id: unique identifier
gender: “Male”, “Female” or “Other”
age: age of the patient
hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
ever_married: “No” or “Yes”
work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
Residence_type: “Rural” or “Urban”
avg_glucose_level: average glucose level in blood
bmi: body mass index
smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
stroke: 1 if the patient had a stroke or 0 if not
*Note: “Unknown” in smoking_status means that the information is unavailable for this patient
Owner of the data is FEDESORIANO from Kaggle.com
mydata <- mydata[mydata$bmi != "N/A", ]
mydata <- mydata[mydata$smoking_status != "Unknown", ]
mydata <- mydata[mydata$gender != "Other", ]
head(mydata)
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 0 1 Yes Private
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## 7 53882 Male 74 1 1 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24 never smoked 1
## 6 Urban 186.21 29 formerly smoked 1
## 7 Rural 70.09 27.4 never smoked 1
I removed units that contain either N/A, Unknown and Other.
mydata$gender <- factor(mydata$gender)
mydata$ever_married <- factor(mydata$ever_married)
mydata$work_type <- factor(mydata$work_type)
mydata$Residence_type <- factor(mydata$Residence_type)
mydata$bmi <- as.numeric(as.character(mydata$bmi))
mydata$smoking_status <- factor(mydata$smoking_status)
mydata$stroke <- factor(mydata$stroke)
mydata$hypertension <- factor(mydata$hypertension)
mydata$heart_disease <- factor(mydata$heart_disease)
summary(mydata[-1])
## gender age hypertension heart_disease ever_married
## Female:2086 Min. :10.00 0:3017 0:3219 No : 826
## Male :1339 1st Qu.:34.00 1: 408 1: 206 Yes:2599
## Median :50.00
## Mean :48.65
## 3rd Qu.:63.00
## Max. :82.00
## work_type Residence_type avg_glucose_level bmi
## children : 68 Rural:1680 Min. : 55.12 Min. :11.50
## Govt_job : 514 Urban:1745 1st Qu.: 77.23 1st Qu.:25.30
## Never_worked : 14 Median : 92.35 Median :29.10
## Private :2200 Mean :108.31 Mean :30.29
## Self-employed: 629 3rd Qu.:116.20 3rd Qu.:34.10
## Max. :271.74 Max. :92.00
## smoking_status stroke
## formerly smoked: 836 0:3245
## never smoked :1852 1: 180
## smokes : 737
##
##
##
In the previous chunck I changed variables to factors. Average age is 48,65. Median glucose level is 92,35. Majority of participants are female.
mydata$stroke <- factor(mydata$stroke,
levels = c(0, 1),
labels = c("Not stroke", "Stroke"))
mydata$heart_disease <- factor(mydata$heart_disease,
levels = c(0, 1),
labels = c("No disease", "Disease"))
mydata$hypertension <- factor(mydata$hypertension,
levels = c(0, 1),
labels = c("Not hypertensive", "Hypertensive"))
head(mydata,5)
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 Not hypertensive Disease Yes Private
## 3 31112 Male 80 Not hypertensive Disease Yes Private
## 4 60182 Female 49 Not hypertensive No disease Yes Private
## 5 1665 Female 79 Hypertensive No disease Yes Self-employed
## 6 56669 Male 81 Not hypertensive No disease Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked Stroke
## 3 Rural 105.92 32.5 never smoked Stroke
## 4 Urban 171.23 34.4 smokes Stroke
## 5 Rural 174.12 24.0 never smoked Stroke
## 6 Urban 186.21 29.0 formerly smoked Stroke
I changed 0s and 1s with more insightful data.
library(psych)
describeBy(mydata[, c(3, 9,10)])
## Warning in describeBy(mydata[, c(3, 9, 10)]): no grouping variable requested
## vars n mean sd median trimmed mad min max
## age 1 3425 48.65 18.85 50.00 48.77 22.24 10.00 82.00
## avg_glucose_level 2 3425 108.31 47.71 92.35 100.01 27.15 55.12 271.74
## bmi 3 3425 30.29 7.30 29.10 29.65 6.38 11.50 92.00
## range skew kurtosis se
## age 72.00 -0.06 -0.96 0.32
## avg_glucose_level 216.62 1.47 1.20 0.82
## bmi 80.50 1.20 3.43 0.12
Minimal age is 10 and maximal 82. Mean Bmi level is 30,29 with standard deviation of 7,30.
1.1 Research question 1: Is there a difference in Body Mass Index(BMI) between people that had a stroke or people who did not have a stroke?
1.1.1 Analysis
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = bmi)) +
geom_histogram(binwidth = 5, colour = "orange", fill = "red") +
ylab("Frequency") +
xlab("Distribution of BMI")
shapiro.test(mydata$bmi)
##
## Shapiro-Wilk normality test
##
## data: mydata$bmi
## W = 0.9396, p-value < 2.2e-16
First we check the normality of distribution of BMI. From the histogram and shapiro test we reject the Null Hypotheses. Body Mass Index is not normally distributed.
wilcox.test(mydata$bmi ~ mydata$stroke,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$bmi by mydata$stroke
## W = 274899, p-value = 0.1841
## alternative hypothesis: true location shift is not equal to 0
Firstly, we do non-parametric Wilcoxon Rank Sum test. We cannot reject null hypotheses. There is no significant difference in BMI between people who had stroke or not.
t.test(mydata$bmi ~ mydata$stroke,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydata$bmi by mydata$stroke
## t = -0.77736, df = 206.84, p-value = 0.4378
## alternative hypothesis: true difference in means between group Not stroke and group Stroke is not equal to 0
## 95 percent confidence interval:
## -1.3410707 0.5825756
## sample estimates:
## mean in group Not stroke mean in group Stroke
## 30.27242 30.65167
Then I did a parametric t-test. We cannot reject null hypotheses. There is no significant difference in BMI between people who had stroke or not.
Because we do not have a normality of distribution of BMI, we will follow the results of Wilcoxon Rank Sum test.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(mydata$bmi ~ mydata$stroke,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.06 | [-0.14, 0.03]
interpret_rank_biserial(0.06, rules = "funder2019")
## [1] "very small"
## (Rules: funder2019)
There is very small effect on BMI if an individual had or did not have a stroke.
1.1.2 Conclusion
Based on the statistical analysis, there is no significant relationship between BMI level of people that had stroke and people that did not have a stroke (p = 0.1841), and the effect size is very small, indicating a negligible practical impact.
2.1 Research question 2: Is there any correlation between Body Mass Index(BMI) and Average Glucose Level?
2.1.1 Analysis
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata[ , c(9,10)], smooth=FALSE)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(9,10)]),
type = "pearson")
## avg_glucose_level bmi
## avg_glucose_level 1.00 0.16
## bmi 0.16 1.00
##
## n= 3425
##
##
## P
## avg_glucose_level bmi
## avg_glucose_level 0
## bmi 0
cor(mydata$bmi,mydata$avg_glucose_level,
method="pearson")
## [1] 0.1566746
I did the correlation matrix and saw that correlation coefficient is 0,1566746. p-value is 0, hence we reject the null hypotheses at p-value is lower than 0,1%.
2.1.2 Conclusion
There is a statisticaly significant correlation between Body Mass Index(BMI) and Average Glucose Level. The correlation coefficient is 0,1567, therefore there is weak correlation. We reject null hypotheses at p-value less than 0,001.
3.1 Research question 3: Is there any association between gender and having a heart disease?
3.1.1 Analysis
Ho: There is no association between Gender and Heart Disease.
H1: There is association between Gender and Heart Disease.
results2 <- chisq.test(mydata$gender, mydata$heart_disease,
correct = TRUE)
results2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$gender and mydata$heart_disease
## X-squared = 34.646, df = 1, p-value = 3.955e-09
We reject Ho at p-value less than 0,1%. There is association between gender and having a heart disease.
addmargins(results2$observed,2)
## mydata$heart_disease
## mydata$gender No disease Disease Sum
## Female 2001 85 2086
## Male 1218 121 1339
round(results2$expected,2)
## mydata$heart_disease
## mydata$gender No disease Disease
## Female 1960.54 125.46
## Male 1258.46 80.54
Assumptions are met as all the expected values are greater than 5.
round(results2$res,2)
## mydata$heart_disease
## mydata$gender No disease Disease
## Female 0.91 -3.61
## Male -1.14 4.51
oddsratio(mydata$gender,mydata$heart_disease)
## Odds ratio | 95% CI
## -------------------------
## 2.34 | [1.76, 3.11]
interpret_oddsratio(2.34)
## [1] "small"
## (Rules: chen2010)
3.3.2 Conclusion
Male are more than expected to have a heart disease and women are less than expected to have a heart disease. Both at alpha equals to 0,001. Effect size is small meaning that there is a small association between gender and having a heart disease.