Ilina Maksimovska

RQ1: Is there a correlation between patients’ age and their average glucose level based on the dataset?

RQ2: Is there association between the type of institution attended by students (Government-owned or Non-Government-owned) and their level of adaptability to online education (Low or High) based on the dataset?

mydata <- read.csv("./healthcare-dataset-stroke-data.csv")

head(mydata)
##      id gender age ever_married     work_type Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1  9046   Male  67          Yes       Private          Urban            228.69 36.6 formerly smoked      1
## 2 51676 Female  61          Yes Self-employed          Rural            202.21  N/A    never smoked      1
## 3 31112   Male  80          Yes       Private          Rural            105.92 32.5    never smoked      1
## 4 60182 Female  49          Yes       Private          Urban            171.23 34.4          smokes      1
## 5  1665 Female  79          Yes Self-employed          Rural            174.12   24    never smoked      1
## 6 56669   Male  81          Yes       Private          Urban            186.21   29 formerly smoked      1

Explanation of a dataset:

Explanation of the variables in the data set:

library(tidyr) #Dropping units with NA
mydata$bmi[mydata$bmi == "N/A"] <- NA
mydata$smoking_status [mydata$smoking_status == "Unknown"] <- NA
mydata <- drop_na(mydata)
head(mydata)
##      id gender age ever_married     work_type Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1  9046   Male  67          Yes       Private          Urban            228.69 36.6 formerly smoked      1
## 2 31112   Male  80          Yes       Private          Rural            105.92 32.5    never smoked      1
## 3 60182 Female  49          Yes       Private          Urban            171.23 34.4          smokes      1
## 4  1665 Female  79          Yes Self-employed          Rural            174.12   24    never smoked      1
## 5 56669   Male  81          Yes       Private          Urban            186.21   29 formerly smoked      1
## 6 53882   Male  74          Yes       Private          Rural             70.09 27.4    never smoked      1
mydata$strokeF <- factor(mydata$stroke, 
                               levels = c (0,1),
                               labels = c ("No", "Yes")) #Creating a factor
set.seed(1) 
mydata <- mydata[sample(nrow(mydata), 300), ] #Choosing random sample of 300 units
head(mydata)
##         id gender age ever_married work_type Residence_type avg_glucose_level  bmi smoking_status stroke strokeF
## 1017 42902   Male  35          Yes   Private          Rural            102.34 34.3   never smoked      0      No
## 679  47924   Male  24           No   Private          Urban             59.28 43.2   never smoked      0      No
## 2177  4280 Female  51          Yes  Govt_job          Rural            105.52 30.8   never smoked      0      No
## 930  67309   Male  47          Yes   Private          Rural             86.37 39.2         smokes      0      No
## 1533  5863 Female  71          Yes   Private          Urban            240.81 27.4   never smoked      0      No
## 471  54338 Female  58          Yes  Govt_job          Rural             77.46 27.6   never smoked      0      No
mydata$bmi <- as.numeric(mydata$bmi)
mydata$gender <- as.factor(mydata$gender)
mydata$ever_married <- as.factor(mydata$ever_married)
mydata$work_type <- as.factor(mydata$work_type)
mydata$Residence_type <- as.factor(mydata$Residence_type)
mydata$smoking_status <- as.factor(mydata$smoking_status)
mydata$stroke <- as.factor(mydata$stroke)
mydata$strokeF <- as.factor(mydata$strokeF)
summary(mydata) #Descriptive statistics
##        id           gender         age        ever_married         work_type   Residence_type avg_glucose_level
##  Min.   :  187   Female:187   Min.   :10.00   No : 70      children     : 12   Rural:146      Min.   : 55.58   
##  1st Qu.:20663   Male  :113   1st Qu.:32.75   Yes:230      Govt_job     : 44   Urban:154      1st Qu.: 77.22   
##  Median :40061                Median :50.00                Never_worked :  1                  Median : 92.31   
##  Mean   :37982                Mean   :48.06                Private      :203                  Mean   :108.39   
##  3rd Qu.:55171                3rd Qu.:63.00                Self-employed: 40                  3rd Qu.:117.64   
##  Max.   :72340                Max.   :82.00                                                   Max.   :271.74   
##       bmi                smoking_status stroke  strokeF  
##  Min.   :11.50   formerly smoked: 59    0:286   No :286  
##  1st Qu.:26.00   never smoked   :171    1: 14   Yes: 14  
##  Median :29.50   smokes         : 70                     
##  Mean   :30.55                                           
##  3rd Qu.:34.35                                           
##  Max.   :60.20

Descriptive statistics of the dataset

Correlation analysis

Assumptions

library(car) #Creating a scatterplot to check linear relationship 
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
scatterplot(mydata$age, mydata$avg_glucose_level, 
     xlab = "Age",
     ylab = "Average glucose level",
     boxplots = FALSE,
     smooth = FALSE)

For educational purposes, we will assume linear relationship between the variables and continue with the Pearson correlation coefficient.

library(Hmisc) #Correlation - Pearson
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata[, c(3, 7)]), 
      type = "pearson")
##                    age avg_glucose_level
## age               1.00              0.29
## avg_glucose_level 0.29              1.00
## 
## n= 300 
## 
## 
## P
##                   age avg_glucose_level
## age                    0               
## avg_glucose_level  0

H0: rho(age, average glucose level) = 0

H1: rho(age, average glucose level) != 0

We reject H0 at p < 0.001. There is correlation between Age and Average Glucose Level based on the dataset.

The linear relationship between Age and Average Glucose Level is positive and weak.

Assocation between two categorical variables

mydata2 <- read.csv("./students_adaptability_level_online_education.csv")

head(mydata2)
##   Gender Education.Level Institution.Type FiNAncial.Condition Internet.Type Network.Type Device Adaptivity.Level
## 1    Boy      University   Non Government                 Mid          Wifi           4G    Tab             High
## 2   Girl      University   Non Government                 Mid   Mobile Data           4G Mobile             High
## 3   Girl         College       Government                 Mid          Wifi           4G Mobile             High
## 4   Girl          School   Non Government                 Mid   Mobile Data           4G Mobile             High
## 5   Girl          School   Non Government                Poor   Mobile Data           3G Mobile              Low
## 6    Boy          School   Non Government                Poor   Mobile Data           3G Mobile              Low

Explanation of data set:

Explanation of the variables in the data set:

library(tidyr)
mydata2 <- drop_na(mydata2) #Dropping units with NA
head(mydata2)
##   Gender Education.Level Institution.Type FiNAncial.Condition Internet.Type Network.Type Device Adaptivity.Level
## 1    Boy      University   Non Government                 Mid          Wifi           4G    Tab             High
## 2   Girl      University   Non Government                 Mid   Mobile Data           4G Mobile             High
## 3   Girl         College       Government                 Mid          Wifi           4G Mobile             High
## 4   Girl          School   Non Government                 Mid   Mobile Data           4G Mobile             High
## 5   Girl          School   Non Government                Poor   Mobile Data           3G Mobile              Low
## 6    Boy          School   Non Government                Poor   Mobile Data           3G Mobile              Low
set.seed(1) 
mydata2 <- mydata2[sample(nrow(mydata2), 350), ] #Choosing random sample of 350 units
head(mydata2)
##      Gender Education.Level Institution.Type FiNAncial.Condition Internet.Type Network.Type   Device Adaptivity.Level
## 1017    Boy          School   Non Government                 Mid   Mobile Data           3G   Mobile              Low
## 679     Boy      University       Government                Poor   Mobile Data           4G   Mobile              Low
## 129     Boy      University       Government                 Mid          Wifi           4G Computer              Low
## 930    Girl          School   Non Government                 Mid   Mobile Data           3G   Mobile             High
## 471    Girl          School   Non Government                 Mid          Wifi           4G   Mobile              Low
## 299    Girl         College   Non Government                 Mid          Wifi           4G   Mobile              Low
mydata2$Gender <- as.factor(mydata2$Gender)
mydata2$Education.Level <- as.factor(mydata2$Education.Level)
mydata2$Institution.Type <- as.factor(mydata2$Institution.Type)
mydata2$FiNAncial.Condition <- as.factor(mydata2$FiNAncial.Condition)
mydata2$Internet.Type <- as.factor(mydata2$Internet.Type)
mydata2$Network.Type <- as.factor(mydata2$Network.Type)
mydata2$Device <- as.factor(mydata2$Device)
mydata2$Adaptivity.Level <- as.factor(mydata2$Adaptivity.Level)
summary(mydata2) #Descriptive statistics
##   Gender      Education.Level       Institution.Type FiNAncial.Condition     Internet.Type Network.Type      Device   
##  Boy :188   College   : 73    Government    :121     Mid :265            Mobile Data:211   2G:  6       Computer: 36  
##  Girl:162   School    :151    Non Government:229     Poor: 66            Wifi       :139   3G:127       Mobile  :304  
##             University:126                           Rich: 19                              4G:217       Tab     : 10  
##  Adaptivity.Level
##  High:195        
##  Low :155        
## 

Descriptive statistics of the dataset:

Pearson Chi2 test

Assumptions:

If conditions 2 and 3 are not met or if any of the expected frequencies is less than 1, only Fisher‘s Exact Probability Test of Independence should be used – nonparametric test.

results <- chisq.test(mydata2$Institution.Type, mydata2$Adaptivity.Level,
                      correct = TRUE) #Pearson Chisquare test

results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata2$Institution.Type and mydata2$Adaptivity.Level
## X-squared = 24.585, df = 1, p-value = 7.11e-07

H0: There is no association between the two categorical variables.

H1: There is association between the two categorical variables.

We reject the null hypothesis at p < 0.001 and assume that there is association between the two categorical variables.

addmargins(results$observed) #Observed frequencies of data
##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type High Low Sum
##           Government       45  76 121
##           Non Government  150  79 229
##           Sum             195 155 350
round(results$expected, 2) #Expected frequencies of data
##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type   High    Low
##           Government      67.41  53.59
##           Non Government 127.59 101.41

The second assumption is met since all expected frequencies are above 5.

round(results$res, 2) #Standardized residuals
##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type  High   Low
##           Government     -2.73  3.06
##           Non Government  1.98 -2.23

There is less than expected number of students in category Government and High (alpha = 0.01).

There is more than expected number of students in category Government and Low (alpha = 0.01).

There is more than expected number of students in category Non Government and High (alpha = 0.05).

There is less than expected number of students in category Non Government and Low (alpha = 0.05).

addmargins(round(prop.table(results$observed), 3)) #Proportion table 1
##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type  High   Low   Sum
##           Government     0.129 0.217 0.346
##           Non Government 0.429 0.226 0.655
##           Sum            0.558 0.443 1.001

Explanation of the number 0.217 (Government, Low) - Out of 350 students, there are 21.7% of students that are attending institution (school/college/university) owned by the government and have low adaptability level to online education.

Explanation of the number 0.429 (Non Government, High) - Out of 350 students, there are 42.9% of students that are attending institution (school/college/university) not owned by the government and have high adaptability level to online education.

addmargins(round(prop.table(results$observed, 1), 3), 2) #Proportion table 2
##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type  High   Low   Sum
##           Government     0.372 0.628 1.000
##           Non Government 0.655 0.345 1.000

Explanation of the number 0.628 (Government, Low) - Out of all students that are attending institution (school/college/university) owned by the government, 62.8% of them have low adaptability level to online education

Explanation of the number 0.655 (Non Government, High) - Out of all students that are attending institution (school/college/university) not owned by the government, 65.5% of them have high adaptability level to online education

addmargins(round(prop.table(results$observed, 2), 3), 1) #Proportion table 3
##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type  High   Low
##           Government     0.231 0.490
##           Non Government 0.769 0.510
##           Sum            1.000 1.000

Explanation of the number 0.490 (Government, Low) - Out of all students that have low adaptability level to online education, 49% are students studying at institution (school/college/university) owned by the government.

Explanation of the number 0.769 (Non Government, High) - Out of all students that have high adaptability level to online education, 76.9% are students studying at institution (school/college/university) not owned by the government.

library(effectsize)
effectsize::cramers_v(mydata2$Institution.Type, mydata2$Adaptivity.Level)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.27              | [0.18, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.27)
## [1] "medium"
## (Rules: funder2019)

BDS we found that there is association between the type of institution (Government owned or Non Government owned) students attend and the level of their adaptability to online education (Low or High) at p value < 0.001.

The effect size is medium indicating that we found medium effect on the adaptability level to online education based on whether the student studies at institution owned by the government or not owned by the government.

Following, I will do the Fisher’s exact probability test - nonparametric test, even though the assumptions were met in order to show the steps

Fisher’s exact probability test

fisher.test(mydata2$Institution.Type, mydata2$Adaptivity.Level) 
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata2$Institution.Type and mydata2$Adaptivity.Level
## p-value = 5.308e-07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1918893 0.5056299
## sample estimates:
## odds ratio 
##   0.312948
interpret_oddsratio(0.31)
## [1] "small"
## (Rules: chen2010)

HO: Odds ratio is equal to 1.

H1: Odds ratio in not equal to 1.

BDS we reject the null hypothesis at p-value < 0.001 and assume that the odds ratio is not equal to one meaning that government owned institutions and non government owned institutions have different success rates in the adaptability level of students to online education.

The effect size is small indicating that we found small effect on the adaptability level to online education based on whether the student attends institution owned by the government or not owned by the government.