mydata <- read.csv("./healthcare-dataset-stroke-data.csv")
head(mydata)
## id gender age ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
## 1 9046 Male 67 Yes Private Urban 228.69 36.6 formerly smoked 1
## 2 51676 Female 61 Yes Self-employed Rural 202.21 N/A never smoked 1
## 3 31112 Male 80 Yes Private Rural 105.92 32.5 never smoked 1
## 4 60182 Female 49 Yes Private Urban 171.23 34.4 smokes 1
## 5 1665 Female 79 Yes Self-employed Rural 174.12 24 never smoked 1
## 6 56669 Male 81 Yes Private Urban 186.21 29 formerly smoked 1
id: unique identifier for each individual in the data set.
gender: gender of the individual, categories: (“Male” or “Female”).
age: the age of the patient.
ever_married: whether the individual has ever been married, categories: (“No” or “Yes”).
work_type: the type of work the individual is engaged in, categories: (“children,” “Govt_job,” “Never_worked,” “Private,” or “Self-employed”).
Residence_type: the type of residence of the individual, categories (“Rural” and “Urban”).
avg_glucose_level: the average glucose level in the blood of the individual.
bmi: body mass index, offering a numerical assessment of the individual’s body weight in relation to their height.
smoking_status: smoking habits of the individual, categories (“formerly smoked,” “never smoked,” “smokes,” or “Unknown”).
stroke: whether the patient has experienced a stroke. (1 - the patient had a stroke, 0 - the patient has not had a stroke).
library(tidyr) #Dropping units with NA
mydata$bmi[mydata$bmi == "N/A"] <- NA
mydata$smoking_status [mydata$smoking_status == "Unknown"] <- NA
mydata <- drop_na(mydata)
head(mydata)
## id gender age ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
## 1 9046 Male 67 Yes Private Urban 228.69 36.6 formerly smoked 1
## 2 31112 Male 80 Yes Private Rural 105.92 32.5 never smoked 1
## 3 60182 Female 49 Yes Private Urban 171.23 34.4 smokes 1
## 4 1665 Female 79 Yes Self-employed Rural 174.12 24 never smoked 1
## 5 56669 Male 81 Yes Private Urban 186.21 29 formerly smoked 1
## 6 53882 Male 74 Yes Private Rural 70.09 27.4 never smoked 1
mydata$strokeF <- factor(mydata$stroke,
levels = c (0,1),
labels = c ("No", "Yes")) #Creating a factor
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 300), ] #Choosing random sample of 300 units
head(mydata)
## id gender age ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke strokeF
## 1017 42902 Male 35 Yes Private Rural 102.34 34.3 never smoked 0 No
## 679 47924 Male 24 No Private Urban 59.28 43.2 never smoked 0 No
## 2177 4280 Female 51 Yes Govt_job Rural 105.52 30.8 never smoked 0 No
## 930 67309 Male 47 Yes Private Rural 86.37 39.2 smokes 0 No
## 1533 5863 Female 71 Yes Private Urban 240.81 27.4 never smoked 0 No
## 471 54338 Female 58 Yes Govt_job Rural 77.46 27.6 never smoked 0 No
mydata$bmi <- as.numeric(mydata$bmi)
mydata$gender <- as.factor(mydata$gender)
mydata$ever_married <- as.factor(mydata$ever_married)
mydata$work_type <- as.factor(mydata$work_type)
mydata$Residence_type <- as.factor(mydata$Residence_type)
mydata$smoking_status <- as.factor(mydata$smoking_status)
mydata$stroke <- as.factor(mydata$stroke)
mydata$strokeF <- as.factor(mydata$strokeF)
summary(mydata) #Descriptive statistics
## id gender age ever_married work_type Residence_type avg_glucose_level
## Min. : 187 Female:187 Min. :10.00 No : 70 children : 12 Rural:146 Min. : 55.58
## 1st Qu.:20663 Male :113 1st Qu.:32.75 Yes:230 Govt_job : 44 Urban:154 1st Qu.: 77.22
## Median :40061 Median :50.00 Never_worked : 1 Median : 92.31
## Mean :37982 Mean :48.06 Private :203 Mean :108.39
## 3rd Qu.:55171 3rd Qu.:63.00 Self-employed: 40 3rd Qu.:117.64
## Max. :72340 Max. :82.00 Max. :271.74
## bmi smoking_status stroke strokeF
## Min. :11.50 formerly smoked: 59 0:286 No :286
## 1st Qu.:26.00 never smoked :171 1: 14 Yes: 14
## Median :29.50 smokes : 70
## Mean :30.55
## 3rd Qu.:34.35
## Max. :60.20
Min: The minimum value for average glucose level is 55.58.
1st Qu (First Quartile): 25% of the data falls below this value and 75% falls above this value. The 1st quartile for average glucose level is 77.22.
Median: 50% of the data falls below this value. The median for average glucose level is 92.31.
Mean: The average glucose level is 108.39.
3rd Qu (Third Quartile): 75% of the data falls below this value and 25% above this value. The 3rd quartile for average glucose level is 117.64.
Max: The maximum value for average glucose level is 271.74.
We have 187 Female patients in the sample data and 113 Male.
Assumptions
Both variables are numeric - since we are analyzing age and the average glucose level, this assumption is met.
Errors are normally distributed - the sample size is large so we assume normality.
Linear relationship between the variables. - will be checked later.
library(car) #Creating a scatterplot to check linear relationship
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
scatterplot(mydata$age, mydata$avg_glucose_level,
xlab = "Age",
ylab = "Average glucose level",
boxplots = FALSE,
smooth = FALSE)
For educational purposes, we will assume linear relationship between the variables and continue with the Pearson correlation coefficient.
library(Hmisc) #Correlation - Pearson
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[, c(3, 7)]),
type = "pearson")
## age avg_glucose_level
## age 1.00 0.29
## avg_glucose_level 0.29 1.00
##
## n= 300
##
##
## P
## age avg_glucose_level
## age 0
## avg_glucose_level 0
H0: rho(age, average glucose level) = 0
H1: rho(age, average glucose level) != 0
We reject H0 at p < 0.001. There is correlation between Age and Average Glucose Level based on the dataset.
The linear relationship between Age and Average Glucose Level is positive and weak.
mydata2 <- read.csv("./students_adaptability_level_online_education.csv")
head(mydata2)
## Gender Education.Level Institution.Type FiNAncial.Condition Internet.Type Network.Type Device Adaptivity.Level
## 1 Boy University Non Government Mid Wifi 4G Tab High
## 2 Girl University Non Government Mid Mobile Data 4G Mobile High
## 3 Girl College Government Mid Wifi 4G Mobile High
## 4 Girl School Non Government Mid Mobile Data 4G Mobile High
## 5 Girl School Non Government Poor Mobile Data 3G Mobile Low
## 6 Boy School Non Government Poor Mobile Data 3G Mobile Low
Gender: gender of the individual, categories: (“Boy” or “Girl”).
Education.Level: the level of education of the individual, categories: (“University”, “College”, “School”).
Institution.Type: whether the institution is public or private, categories: (“Non Government”, “Government”)
Financial.Conditions: the financial conditions of the individual, categories: (“Mid”, “Poor”, “Rich”).
Internet.Type: the type of internet the individual is using, categories: (“Wifi”, “Mobile Data”).
Network.Type: the type of network the individual is using, categories: (“4G”, “3G”).
Device: the type of device the individual is using, categories: (“Tab”, “Mobile”, “Computer”).
Adaptability.Level: level of adaptability to online education by the individual. categories: (“Low”, “High”)
library(tidyr)
mydata2 <- drop_na(mydata2) #Dropping units with NA
head(mydata2)
## Gender Education.Level Institution.Type FiNAncial.Condition Internet.Type Network.Type Device Adaptivity.Level
## 1 Boy University Non Government Mid Wifi 4G Tab High
## 2 Girl University Non Government Mid Mobile Data 4G Mobile High
## 3 Girl College Government Mid Wifi 4G Mobile High
## 4 Girl School Non Government Mid Mobile Data 4G Mobile High
## 5 Girl School Non Government Poor Mobile Data 3G Mobile Low
## 6 Boy School Non Government Poor Mobile Data 3G Mobile Low
set.seed(1)
mydata2 <- mydata2[sample(nrow(mydata2), 350), ] #Choosing random sample of 350 units
head(mydata2)
## Gender Education.Level Institution.Type FiNAncial.Condition Internet.Type Network.Type Device Adaptivity.Level
## 1017 Boy School Non Government Mid Mobile Data 3G Mobile Low
## 679 Boy University Government Poor Mobile Data 4G Mobile Low
## 129 Boy University Government Mid Wifi 4G Computer Low
## 930 Girl School Non Government Mid Mobile Data 3G Mobile High
## 471 Girl School Non Government Mid Wifi 4G Mobile Low
## 299 Girl College Non Government Mid Wifi 4G Mobile Low
mydata2$Gender <- as.factor(mydata2$Gender)
mydata2$Education.Level <- as.factor(mydata2$Education.Level)
mydata2$Institution.Type <- as.factor(mydata2$Institution.Type)
mydata2$FiNAncial.Condition <- as.factor(mydata2$FiNAncial.Condition)
mydata2$Internet.Type <- as.factor(mydata2$Internet.Type)
mydata2$Network.Type <- as.factor(mydata2$Network.Type)
mydata2$Device <- as.factor(mydata2$Device)
mydata2$Adaptivity.Level <- as.factor(mydata2$Adaptivity.Level)
summary(mydata2) #Descriptive statistics
## Gender Education.Level Institution.Type FiNAncial.Condition Internet.Type Network.Type Device
## Boy :188 College : 73 Government :121 Mid :265 Mobile Data:211 2G: 6 Computer: 36
## Girl:162 School :151 Non Government:229 Poor: 66 Wifi :139 3G:127 Mobile :304
## University:126 Rich: 19 4G:217 Tab : 10
## Adaptivity.Level
## High:195
## Low :155
##
We have 162 students that are Girls in the sample data and 188 that are Boys.
We have 121 students that go to government owned institutions in the sample data and 229 that go to non government owned institutions.
We have 195 students with high adaptability level to online education and 155 students with low adaptability level to online education in the sample data.
Assumptions:
Observations must be independent. - this assumption is met
Check that all expected frequencies are greater than 5. - will be checked later
In larger contingency tables (at least one categorical variable has more than two categories), up to 20% of the expected frequencies can be between 1 and 5, but this will reduce the power of the test.
If conditions 2 and 3 are not met or if any of the expected frequencies is less than 1, only Fisher‘s Exact Probability Test of Independence should be used – nonparametric test.
results <- chisq.test(mydata2$Institution.Type, mydata2$Adaptivity.Level,
correct = TRUE) #Pearson Chisquare test
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata2$Institution.Type and mydata2$Adaptivity.Level
## X-squared = 24.585, df = 1, p-value = 7.11e-07
H0: There is no association between the two categorical variables.
H1: There is association between the two categorical variables.
We reject the null hypothesis at p < 0.001 and assume that there is association between the two categorical variables.
addmargins(results$observed) #Observed frequencies of data
## mydata2$Adaptivity.Level
## mydata2$Institution.Type High Low Sum
## Government 45 76 121
## Non Government 150 79 229
## Sum 195 155 350
round(results$expected, 2) #Expected frequencies of data
## mydata2$Adaptivity.Level
## mydata2$Institution.Type High Low
## Government 67.41 53.59
## Non Government 127.59 101.41
The second assumption is met since all expected frequencies are above 5.
round(results$res, 2) #Standardized residuals
## mydata2$Adaptivity.Level
## mydata2$Institution.Type High Low
## Government -2.73 3.06
## Non Government 1.98 -2.23
There is less than expected number of students in category Government and High (alpha = 0.01).
There is more than expected number of students in category Government and Low (alpha = 0.01).
There is more than expected number of students in category Non Government and High (alpha = 0.05).
There is less than expected number of students in category Non Government and Low (alpha = 0.05).
addmargins(round(prop.table(results$observed), 3)) #Proportion table 1
## mydata2$Adaptivity.Level
## mydata2$Institution.Type High Low Sum
## Government 0.129 0.217 0.346
## Non Government 0.429 0.226 0.655
## Sum 0.558 0.443 1.001
Explanation of the number 0.217 (Government, Low) - Out of 350 students, there are 21.7% of students that are attending institution (school/college/university) owned by the government and have low adaptability level to online education.
Explanation of the number 0.429 (Non Government, High) - Out of 350 students, there are 42.9% of students that are attending institution (school/college/university) not owned by the government and have high adaptability level to online education.
addmargins(round(prop.table(results$observed, 1), 3), 2) #Proportion table 2
## mydata2$Adaptivity.Level
## mydata2$Institution.Type High Low Sum
## Government 0.372 0.628 1.000
## Non Government 0.655 0.345 1.000
Explanation of the number 0.628 (Government, Low) - Out of all students that are attending institution (school/college/university) owned by the government, 62.8% of them have low adaptability level to online education
Explanation of the number 0.655 (Non Government, High) - Out of all students that are attending institution (school/college/university) not owned by the government, 65.5% of them have high adaptability level to online education
addmargins(round(prop.table(results$observed, 2), 3), 1) #Proportion table 3
## mydata2$Adaptivity.Level
## mydata2$Institution.Type High Low
## Government 0.231 0.490
## Non Government 0.769 0.510
## Sum 1.000 1.000
Explanation of the number 0.490 (Government, Low) - Out of all students that have low adaptability level to online education, 49% are students studying at institution (school/college/university) owned by the government.
Explanation of the number 0.769 (Non Government, High) - Out of all students that have high adaptability level to online education, 76.9% are students studying at institution (school/college/university) not owned by the government.
library(effectsize)
effectsize::cramers_v(mydata2$Institution.Type, mydata2$Adaptivity.Level)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.27 | [0.18, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.27)
## [1] "medium"
## (Rules: funder2019)
BDS we found that there is association between the type of institution (Government owned or Non Government owned) students attend and the level of their adaptability to online education (Low or High) at p value < 0.001.
The effect size is medium indicating that we found medium effect on the adaptability level to online education based on whether the student studies at institution owned by the government or not owned by the government.
Following, I will do the Fisher’s exact probability test - nonparametric test, even though the assumptions were met in order to show the steps
fisher.test(mydata2$Institution.Type, mydata2$Adaptivity.Level)
##
## Fisher's Exact Test for Count Data
##
## data: mydata2$Institution.Type and mydata2$Adaptivity.Level
## p-value = 5.308e-07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.1918893 0.5056299
## sample estimates:
## odds ratio
## 0.312948
interpret_oddsratio(0.31)
## [1] "small"
## (Rules: chen2010)
HO: Odds ratio is equal to 1.
H1: Odds ratio in not equal to 1.
BDS we reject the null hypothesis at p-value < 0.001 and assume that the odds ratio is not equal to one meaning that government owned institutions and non government owned institutions have different success rates in the adaptability level of students to online education.
The effect size is small indicating that we found small effect on the adaptability level to online education based on whether the student attends institution owned by the government or not owned by the government.