Introduction

Stroke involves blood flow interruption in brain, often lead to permanent brain damage
World Stroke Organization reports stroke ranks as the third leading cause of both mortality and long-term disability globally
Significant healthcare burden and reduced quality of life
Many types of risk factors including biological (e.g. hypertension), lifestyle(e.g. smoking), demographic (e.g. age)
Understanding risk factors help to develop effective prevention strategies, enabling early clinical intervention

Problem Statement

Research question 1: Is the blood glucose level differed between stroke and non-stoke patients?
We will check mean difference of average glucose level between stroke and non-stroke patients is statistically significant
- Test method: Two-sample t-test
Research question 2: Any association between residence type and stroke occurence?
We will check if categorical association exists between residence type (urban versus rural) and stroke occurrence
- Test method: Chi-square Test of Association

Data

Dataset Source: The analysis uses the open Stroke Prediction Dataset (fedesoriano 2021) from Kaggle
Dataset Context: Compiled from healthcare records, it focuses on predicting stroke risk based on demographic, clinical, and lifestyle factors
Dataset Size: Comprises 5,110 observations (rows) and 12 features (columns), representing individual patient profiles
The data is imbalanced, with a minority of stroke cases (approximately 4.9%, or 249 instances)

Data Cont.

For this assignment, we will focus on avg_glucose_level, Residence_type, stroke
- avg_glucose_level: numerical feature representing average glucose level in blood
- Residence_type: categorical feature representing where the patient live, ‘Rural’ or ‘Urban’
- stroke: numerical feature, 1 = patient had a stroke; 0 = patient did not
Residence_type is converted into factor with level ‘Rural’ ‘Urban’
stroke is converted into factor with level and labels ‘No’, ‘Yes’ to make it more descriptive
There is no missing values in avg_glucose_level, Residence_type, stroke

dat_stroke <- read_csv("stroke_dataset.csv")

#Filter only important feature
stroke_ds <- dat_stroke[,c(8,9,12)]

Check data frame structure

str(as.data.frame(stroke_ds))

## 'data.frame':    5110 obs. of  3 variables:
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ stroke           : num  1 1 1 1 1 1 1 1 1 1 ...

Data Cont.

Check missing values

colSums(is.na(stroke_ds))

##    Residence_type avg_glucose_level            stroke 
##                 0                 0                 0

Convert to factors

stroke_ds$Residence_type <- factor(stroke_ds$Residence_type,
                             levels = c("Rural","Urban"))
levels(stroke_ds$Residence_type)

## [1] "Rural" "Urban"

stroke_ds$stroke <- factor(stroke_ds$stroke,
                     levels = c(0,1),labels = c("No","Yes")) 
levels(stroke_ds$stroke)

## [1] "No"  "Yes"

Descriptive Statistics and Visualisation

cross-tabulation and clustered barchart shows similar stroke/non-stroke proportion of patients live in rural/urban area

tab1 <- table(stroke_ds$stroke,stroke_ds$Residence_type) %>%
  prop.table(margin = 2)
knitr::kable(tab1)

	Rural	Urban
No	0.9546539	0.9479969
Yes	0.0453461	0.0520031

barplot(tab1,
        main = "Stroke History by Residence Type",ylab="Proportion within residence type",
        ylim=c(0,1),legend=rownames(tab1),beside=TRUE,
        args.legend=c(x = "topright",horiz=TRUE, title="Had stroke"),
        xlab="Residence type")

Decsriptive Statistics Cont.

Summary statistics of the data shows
- for stroke patients the mean average glucose level is 132.54
- for non-stroke patients the mean average glucose level is 104.8

stroke_ds %>% group_by(stroke) %>% summarise(Min = min(avg_glucose_level,na.rm = TRUE),
                                      Q1 = quantile(avg_glucose_level,probs = .25,na.rm = TRUE),
                                      Median = median(avg_glucose_level, na.rm = TRUE),
                                      Q3 = quantile(avg_glucose_level,probs = .75,na.rm = TRUE),
                                      Max = max(avg_glucose_level,na.rm = TRUE),
                                      Mean = mean(avg_glucose_level, na.rm = TRUE),
                                      SD = sd(avg_glucose_level, na.rm = TRUE),
                                      n = n(),
                                      Missing = sum(is.na(avg_glucose_level))) -> tab2

knitr::kable(tab2)

stroke	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
No	55.12	77.12	91.47	112.83	267.76	104.7955	43.84607	4861	0
Yes	56.11	79.79	105.22	196.71	271.74	132.5447	61.92106	249	0

Decsriptive Statistics Cont.

Side-by-side box plot shows average glucose level distributions are right-skewed
Stroke patients had higher average glucose level compare to non-stroke patients
outliers observed in average glucose level of non-stroke patients
- not removed as they are not isolated, and appear plausible

stroke_ds %>% boxplot(avg_glucose_level ~ stroke,data = ., na.rm=TRUE, main="Box Plot of glucose by stroke", 
               ylab="stroke", xlab="glucose", horizontal = TRUE, col = "darkorange2")

Hypothesis Testing and Confidence Interval

By applying two-sample t-test, we will check if mean difference of average glucose level between stroke and non-stroke patients is statistically significant
Assumptions: 1) average glucose level is normally distributed 2) Homogeneity of variances
Before the test, we will check the normality and variance homogeneity assumption
Check normality using QQ Plot

# QQ plot for stroke = Yes
stroke_yes <- stroke_ds %>% filter(stroke == "Yes")
stroke_yes$avg_glucose_level %>% qqPlot(dist="norm")

## [1] 194 136

Hypothesis Testing and CI Cont.

# QQ plot for stroke = no
stroke_no <- stroke_ds %>% filter(stroke == "No")
stroke_no$avg_glucose_level %>% qqPlot(dist="norm")

## [1]  959 2840

QQplot shows the average glucose level in both sampling groups fall outside of the 95% CI
- They do not follow normal distribution
sample is large enough with n>30 for both groups
- based on CLT, we can assume they follow normal distribution

Hypothesis Testing and CI Cont.

By Levene’s Test, check assumption of population variances are homogeneous

leveneTest(avg_glucose_level ~ stroke, data = stroke_ds)

Null-hypothesis of Levene test H0: \(\sigma_1 = \sigma_2\)
Alternate hypothesis of Levene test HA: \(\sigma_1 \ne \sigma_2\)
Since p<0.05, H0 of Levene test rejected.
- population variances are not homogeneous

Hypothesis Testing and CI Cont.

Use two-sample Welch test for unequal variances
H0: Mean average glucose level of stroke patients is not statistically significant different from that of non-stroke patients \[H_0: \mu_1 - \mu_2 = 0\]
HA: Mean average glucose level of stroke patients is statistically significant from that of non-stroke patients \[H_A: \mu_1 - \mu_2 \ne 0\]

results <- t.test(
  avg_glucose_level ~ stroke,
  data = stroke_ds,
  var.equal = FALSE,
  alternative = "two.sided"
)
results

## 
##  Welch Two Sample t-test
## 
## data:  avg_glucose_level by stroke
## t = -6.9824, df = 260.89, p-value = 2.401e-11
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -35.57474 -19.92371
## sample estimates:
##  mean in group No mean in group Yes 
##          104.7955          132.5447

Hypothesis Testing and CI Cont.

Use p-value and 95% CI of the mean to make decision about null hypothesis

results$p.value

## [1] 2.401437e-11

results$conf.int

## [1] -35.57474 -19.92371
## attr(,"conf.level")
## [1] 0.95

Results Summary:
- Estimated difference between means: 104.8 - 132.54 = -27.74
- p-value = 2.401e-11, significance level \(\alpha\)=0.05, p<\(\alpha\)
- t(df=261)=−6.98
- 95% CI for the difference in means [-35.57 -19.92] cannot capture H0: mean difference = 0
Reject H0. The two sample t-test results were statistically significant
Results found stroke patients and non-stroke patients have statistically significant difference in mean average glucose level

Categorical association

Use Chi-square test of association with hypotheses below

Null Hypothesis H0: residence type and stroke has no statistically significant association

Alternative Hypothesis HA: residence type and stroke has statistically significant association

Assumption: Less than 25% cells with expected counts < 5

chi2 <- chisq.test(table(stroke_ds$stroke,stroke_ds$Residence_type))

Observed counts

chi2$observed

##      
##       Rural Urban
##   No   2400  2461
##   Yes   114   135

Expected counts

chi2$expected

##      
##           Rural     Urban
##   No  2391.4978 2469.5022
##   Yes  122.5022  126.4978

expected counts > 5 in all cells

Categorical association Cont.

Results of Chi-square test

chi2

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(stroke_ds$stroke, stroke_ds$Residence_type)
## X-squared = 1.0816, df = 1, p-value = 0.2983

df = (r-1)(c-1) = (2-1)(2-1) = 1, X-squared = 1.0816
p-value = 0.2983, \(\alpha\)=0.05, p>\(\alpha\)
Fail to reject H0. The Chi-square test of association was not statistically significant.
Results found no association between residence type and stroke

Discussion

We used the Stroke Prediction Dataset from Kaggle
primary focus on features average glucose level, residence type and stroke
Summary statistics shows the mean average glucose level of stroke and non-stroke patients are 132.54 and 104.8
clustered bar chart shows similar proportion of stroke/non-stroke patients living in both rural and urban areas
two-sample Welch t-test was used to test mean average glucose level of stroke and non-stroke patients have statistically significant difference
- The average glucose level for both group of patients exhibited evidence of non-normality upon inspection of the normal Q-Q plot, the central limit theorem ensured that the t-test could be applied due to the large sample size in each group
- The Levene’s test of homogeneity of variance indicated that assumption of equal variance was violated, with p=4.63e-22 < 0.05
- Welch’s test assuming unequal variance found a statistically significant difference between the mean average glucose level of stoke and non-stroke patients, with t(df=261)=−6.98, p<0.001, 95% CI for the difference in means [-35.57 -19.92]
- The results suggest that stroke patients have significantly higher average blood glucose level

Discussion Cont.

A Chi-square test of association was used to test for a statistically significant association between residency type and stroke
- Results of the test found no statistically significant association, with χ2=1.0816,p>0.05.
- This suggest that whether living in rural or urban area have no effect on chance of having stroke
Limitation:
- dataset is imbalanced with minority stroke case
- comparisons with stroke subgroup have higher uncertainty
Future work
- use resampling techniques such as SMOTE to make the dataset more balanced before analysis
- perform regression analysis to investigate any relationship between factors such as bmi, glucose level, age, and their effect on stroke

References

Feigin, V. L., Brainin, M., Norrving, B., Martins, S. O., Pandian, J., Lindsay, P., F Grupper, M., & Rautalin, I. (2025). World Stroke Organization: Global Stroke Fact Sheet 2025. International journal of stroke : official journal of the International Stroke Society, 20(2), 132–144. https://doi.org/10.1177/17474930241308142
Grolemund, Y. X. J. J. a. G. (2023, December 30). R Markdown: The Definitive guide. https://bookdown.org/yihui/rmarkdown/
Stroke Prediction Dataset. (2021, January 26). Kaggle. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

MATH1324 Applied Analytics Assignment 2

Statistical Analysis Project of Stroke Data

RPubs link information

Introduction

Problem Statement

Data

Data Cont.

Data Cont.

Descriptive Statistics and Visualisation

Decsriptive Statistics Cont.

Decsriptive Statistics Cont.

Hypothesis Testing and Confidence Interval

Hypothesis Testing and CI Cont.

Hypothesis Testing and CI Cont.

Hypothesis Testing and CI Cont.

Hypothesis Testing and CI Cont.

Categorical association

Categorical association Cont.

Discussion

Discussion Cont.

References