Homework assignment 1 at course MVA

Data import

mydata <- read.table("./healthcare-dataset-stroke-data.csv", header=TRUE, sep=",", dec=".")
head(mydata)

##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 2 51676 Female  61            0             0          Yes Self-employed
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21  N/A    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12   24    never smoked      1
## 6          Urban            186.21   29 formerly smoked      1

Data description

Unit of observation is individual who had or did not have a stroke.

The data contains 5110 observations with 12 attributes.

Description of data:

id: unique identifier
gender: “Male”, “Female” or “Other”
age: age of the patient
hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
ever_married: “No” or “Yes”
work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
Residence_type: “Rural” or “Urban”
avg_glucose_level: average glucose level in blood
bmi: body mass index
smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
stroke: 1 if the patient had a stroke or 0 if not

*Note: “Unknown” in smoking_status means that the information is unavailable for this patient

Owner of the data is FEDESORIANO from Kaggle.com

mydata <- mydata[mydata$bmi != "N/A", ]
mydata <- mydata[mydata$smoking_status != "Unknown", ]
mydata <- mydata[mydata$gender != "Other", ]
head(mydata)

##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
## 7 53882   Male  74            1             1          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12   24    never smoked      1
## 6          Urban            186.21   29 formerly smoked      1
## 7          Rural             70.09 27.4    never smoked      1

I removed units that contain either N/A, Unknown and Other.

mydata$gender <- factor(mydata$gender)
mydata$ever_married <- factor(mydata$ever_married)
mydata$work_type <- factor(mydata$work_type)
mydata$Residence_type <- factor(mydata$Residence_type)
mydata$bmi <- as.numeric(as.character(mydata$bmi))
mydata$smoking_status <- factor(mydata$smoking_status)
mydata$stroke <- factor(mydata$stroke)
mydata$hypertension <- factor(mydata$hypertension)
mydata$heart_disease <- factor(mydata$heart_disease)

summary(mydata[-1])

##     gender          age        hypertension heart_disease ever_married
##  Female:2086   Min.   :10.00   0:3017       0:3219        No : 826    
##  Male  :1339   1st Qu.:34.00   1: 408       1: 206        Yes:2599    
##                Median :50.00                                          
##                Mean   :48.65                                          
##                3rd Qu.:63.00                                          
##                Max.   :82.00                                          
##          work_type    Residence_type avg_glucose_level      bmi       
##  children     :  68   Rural:1680     Min.   : 55.12    Min.   :11.50  
##  Govt_job     : 514   Urban:1745     1st Qu.: 77.23    1st Qu.:25.30  
##  Never_worked :  14                  Median : 92.35    Median :29.10  
##  Private      :2200                  Mean   :108.31    Mean   :30.29  
##  Self-employed: 629                  3rd Qu.:116.20    3rd Qu.:34.10  
##                                      Max.   :271.74    Max.   :92.00  
##          smoking_status stroke  
##  formerly smoked: 836   0:3245  
##  never smoked   :1852   1: 180  
##  smokes         : 737           
##                                 
##                                 
##

In the previous chunck I changed variables to factors. Average age is 48,65. Median glucose level is 92,35. Majority of participants are female.

mydata$stroke <- factor(mydata$stroke, 
                             levels = c(0, 1), 
                             labels = c("Not stroke", "Stroke"))
mydata$heart_disease <- factor(mydata$heart_disease, 
                             levels = c(0, 1), 
                             labels = c("No disease", "Disease"))
mydata$hypertension <- factor(mydata$hypertension, 
                             levels = c(0, 1), 
                             labels = c("Not hypertensive", "Hypertensive"))

head(mydata,5)

##      id gender age     hypertension heart_disease ever_married     work_type
## 1  9046   Male  67 Not hypertensive       Disease          Yes       Private
## 3 31112   Male  80 Not hypertensive       Disease          Yes       Private
## 4 60182 Female  49 Not hypertensive    No disease          Yes       Private
## 5  1665 Female  79     Hypertensive    No disease          Yes Self-employed
## 6 56669   Male  81 Not hypertensive    No disease          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked Stroke
## 3          Rural            105.92 32.5    never smoked Stroke
## 4          Urban            171.23 34.4          smokes Stroke
## 5          Rural            174.12 24.0    never smoked Stroke
## 6          Urban            186.21 29.0 formerly smoked Stroke

I changed 0s and 1s with more insightful data.

library(psych)
describeBy(mydata[, c(3, 9,10)])

## Warning in describeBy(mydata[, c(3, 9, 10)]): no grouping variable requested

##                   vars    n   mean    sd median trimmed   mad   min    max
## age                  1 3425  48.65 18.85  50.00   48.77 22.24 10.00  82.00
## avg_glucose_level    2 3425 108.31 47.71  92.35  100.01 27.15 55.12 271.74
## bmi                  3 3425  30.29  7.30  29.10   29.65  6.38 11.50  92.00
##                    range  skew kurtosis   se
## age                72.00 -0.06    -0.96 0.32
## avg_glucose_level 216.62  1.47     1.20 0.82
## bmi                80.50  1.20     3.43 0.12

Minimal age is 10 and maximal 82. Mean Bmi level is 30,29 with standard deviation of 7,30.

1.1 Research question 1: Is there a difference in Body Mass Index(BMI) between people that had a stroke or people who did not have a stroke?

1.1.1 Analysis

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x = bmi)) +
  geom_histogram(binwidth = 5, colour = "orange", fill = "red") +
  ylab("Frequency") + 
  xlab("Distribution of BMI")

shapiro.test(mydata$bmi)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$bmi
## W = 0.9396, p-value < 2.2e-16

First we check the normality of distribution of BMI. From the histogram and shapiro test we reject the Null Hypotheses. Body Mass Index is not normally distributed.

wilcox.test(mydata$bmi ~ mydata$stroke,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata$bmi by mydata$stroke
## W = 274899, p-value = 0.1841
## alternative hypothesis: true location shift is not equal to 0

Firstly, we do non-parametric Wilcoxon Rank Sum test. We cannot reject null hypotheses. There is no significant difference in BMI between people who had stroke or not.

t.test(mydata$bmi ~ mydata$stroke, 
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydata$bmi by mydata$stroke
## t = -0.77736, df = 206.84, p-value = 0.4378
## alternative hypothesis: true difference in means between group Not stroke and group Stroke is not equal to 0
## 95 percent confidence interval:
##  -1.3410707  0.5825756
## sample estimates:
## mean in group Not stroke     mean in group Stroke 
##                 30.27242                 30.65167

Then I did a parametric t-test. We cannot reject null hypotheses. There is no significant difference in BMI between people who had stroke or not.

Because we do not have a normality of distribution of BMI, we will follow the results of Wilcoxon Rank Sum test.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize(wilcox.test(mydata$bmi ~ mydata$stroke,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## -0.06             | [-0.14, 0.03]

interpret_rank_biserial(0.06, rules = "funder2019")

## [1] "very small"
## (Rules: funder2019)

There is very small effect on BMI if an individual had or did not have a stroke.

1.1.2 Conclusion

Based on the statistical analysis, there is no significant relationship between BMI level of people that had stroke and people that did not have a stroke (p = 0.1841), and the effect size is very small, indicating a negligible practical impact.

2.1 Research question 2: Is there any correlation between Body Mass Index(BMI) and Average Glucose Level?

2.1.1 Analysis

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(mydata[ , c(9,10)], smooth=FALSE)

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[ , c(9,10)]), 
      type = "pearson")

##                   avg_glucose_level  bmi
## avg_glucose_level              1.00 0.16
## bmi                            0.16 1.00
## 
## n= 3425 
## 
## 
## P
##                   avg_glucose_level bmi
## avg_glucose_level                    0 
## bmi                0

cor(mydata$bmi,mydata$avg_glucose_level,
    method="pearson")

## [1] 0.1566746

I did the correlation matrix and saw that correlation coefficient is 0,1566746. p-value is 0, hence we reject the null hypotheses at p-value is lower than 0,1%.

2.1.2 Conclusion

There is a statisticaly significant correlation between Body Mass Index(BMI) and Average Glucose Level. The correlation coefficient is 0,1567, therefore there is weak correlation. We reject null hypotheses at p-value less than 0,001.

3.1 Research question 3: Is there any association between gender and having a heart disease?

3.1.1 Analysis

Ho: There is no association between Gender and Heart Disease.

H1: There is association between Gender and Heart Disease.

results2 <- chisq.test(mydata$gender, mydata$heart_disease, 
                      correct = TRUE)

results2

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$gender and mydata$heart_disease
## X-squared = 34.646, df = 1, p-value = 3.955e-09

We reject Ho at p-value less than 0,1%. There is association between gender and having a heart disease.

addmargins(results2$observed,2)

##              mydata$heart_disease
## mydata$gender No disease Disease  Sum
##        Female       2001      85 2086
##        Male         1218     121 1339

round(results2$expected,2)

##              mydata$heart_disease
## mydata$gender No disease Disease
##        Female    1960.54  125.46
##        Male      1258.46   80.54

Assumptions are met as all the expected values are greater than 5.

round(results2$res,2)

##              mydata$heart_disease
## mydata$gender No disease Disease
##        Female       0.91   -3.61
##        Male        -1.14    4.51

oddsratio(mydata$gender,mydata$heart_disease)

## Odds ratio |       95% CI
## -------------------------
## 2.34       | [1.76, 3.11]

interpret_oddsratio(2.34)

## [1] "small"
## (Rules: chen2010)

3.3.2 Conclusion

Male are more than expected to have a heart disease and women are less than expected to have a heart disease. Both at alpha equals to 0,001. Effect size is small meaning that there is a small association between gender and having a heart disease.

Homework assignment 1 at course MVA

Nik Potokar

2025-01-10