Assignment 2

Ilina Maksimovska

RQ1: Is there a correlation between patients’ age and their average glucose level based on the dataset?

RQ2: Is there association between the type of institution attended by students (Government-owned or Non-Government-owned) and their level of adaptability to online education (Low or High) based on the dataset?

mydata <- read.csv("./healthcare-dataset-stroke-data.csv")

head(mydata)

##      id gender age ever_married     work_type Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1  9046   Male  67          Yes       Private          Urban            228.69 36.6 formerly smoked      1
## 2 51676 Female  61          Yes Self-employed          Rural            202.21  N/A    never smoked      1
## 3 31112   Male  80          Yes       Private          Rural            105.92 32.5    never smoked      1
## 4 60182 Female  49          Yes       Private          Urban            171.23 34.4          smokes      1
## 5  1665 Female  79          Yes Self-employed          Rural            174.12   24    never smoked      1
## 6 56669   Male  81          Yes       Private          Urban            186.21   29 formerly smoked      1

Explanation of a dataset:

Unit of observation: one patient
Sample size: 5109 observations
Number of variables: 12
Source: Fedesoriano. (2021, January 26). Stroke prediction dataset. Kaggle. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?resource=download

Explanation of the variables in the data set:

id: unique identifier for each individual in the data set.
gender: gender of the individual, categories: (“Male” or “Female”).
age: the age of the patient.
ever_married: whether the individual has ever been married, categories: (“No” or “Yes”).
work_type: the type of work the individual is engaged in, categories: (“children,” “Govt_job,” “Never_worked,” “Private,” or “Self-employed”).
Residence_type: the type of residence of the individual, categories (“Rural” and “Urban”).
avg_glucose_level: the average glucose level in the blood of the individual.
bmi: body mass index, offering a numerical assessment of the individual’s body weight in relation to their height.
smoking_status: smoking habits of the individual, categories (“formerly smoked,” “never smoked,” “smokes,” or “Unknown”).
stroke: whether the patient has experienced a stroke. (1 - the patient had a stroke, 0 - the patient has not had a stroke).

library(tidyr) #Dropping units with NA
mydata$bmi[mydata$bmi == "N/A"] <- NA
mydata$smoking_status [mydata$smoking_status == "Unknown"] <- NA
mydata <- drop_na(mydata)
head(mydata)

##      id gender age ever_married     work_type Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1  9046   Male  67          Yes       Private          Urban            228.69 36.6 formerly smoked      1
## 2 31112   Male  80          Yes       Private          Rural            105.92 32.5    never smoked      1
## 3 60182 Female  49          Yes       Private          Urban            171.23 34.4          smokes      1
## 4  1665 Female  79          Yes Self-employed          Rural            174.12   24    never smoked      1
## 5 56669   Male  81          Yes       Private          Urban            186.21   29 formerly smoked      1
## 6 53882   Male  74          Yes       Private          Rural             70.09 27.4    never smoked      1

mydata$strokeF <- factor(mydata$stroke, 
                               levels = c (0,1),
                               labels = c ("No", "Yes")) #Creating a factor

set.seed(1) 
mydata <- mydata[sample(nrow(mydata), 300), ] #Choosing random sample of 300 units
head(mydata)

##         id gender age ever_married work_type Residence_type avg_glucose_level  bmi smoking_status stroke strokeF
## 1017 42902   Male  35          Yes   Private          Rural            102.34 34.3   never smoked      0      No
## 679  47924   Male  24           No   Private          Urban             59.28 43.2   never smoked      0      No
## 2177  4280 Female  51          Yes  Govt_job          Rural            105.52 30.8   never smoked      0      No
## 930  67309   Male  47          Yes   Private          Rural             86.37 39.2         smokes      0      No
## 1533  5863 Female  71          Yes   Private          Urban            240.81 27.4   never smoked      0      No
## 471  54338 Female  58          Yes  Govt_job          Rural             77.46 27.6   never smoked      0      No

mydata$bmi <- as.numeric(mydata$bmi)
mydata$gender <- as.factor(mydata$gender)
mydata$ever_married <- as.factor(mydata$ever_married)
mydata$work_type <- as.factor(mydata$work_type)
mydata$Residence_type <- as.factor(mydata$Residence_type)
mydata$smoking_status <- as.factor(mydata$smoking_status)
mydata$stroke <- as.factor(mydata$stroke)
mydata$strokeF <- as.factor(mydata$strokeF)
summary(mydata) #Descriptive statistics

##        id           gender         age        ever_married         work_type   Residence_type avg_glucose_level
##  Min.   :  187   Female:187   Min.   :10.00   No : 70      children     : 12   Rural:146      Min.   : 55.58   
##  1st Qu.:20663   Male  :113   1st Qu.:32.75   Yes:230      Govt_job     : 44   Urban:154      1st Qu.: 77.22   
##  Median :40061                Median :50.00                Never_worked :  1                  Median : 92.31   
##  Mean   :37982                Mean   :48.06                Private      :203                  Mean   :108.39   
##  3rd Qu.:55171                3rd Qu.:63.00                Self-employed: 40                  3rd Qu.:117.64   
##  Max.   :72340                Max.   :82.00                                                   Max.   :271.74   
##       bmi                smoking_status stroke  strokeF  
##  Min.   :11.50   formerly smoked: 59    0:286   No :286  
##  1st Qu.:26.00   never smoked   :171    1: 14   Yes: 14  
##  Median :29.50   smokes         : 70                     
##  Mean   :30.55                                           
##  3rd Qu.:34.35                                           
##  Max.   :60.20

Descriptive statistics of the dataset

Min: The minimum value for average glucose level is 55.58.
1st Qu (First Quartile): 25% of the data falls below this value and 75% falls above this value. The 1st quartile for average glucose level is 77.22.
Median: 50% of the data falls below this value. The median for average glucose level is 92.31.
Mean: The average glucose level is 108.39.
3rd Qu (Third Quartile): 75% of the data falls below this value and 25% above this value. The 3rd quartile for average glucose level is 117.64.
Max: The maximum value for average glucose level is 271.74.
We have 187 Female patients in the sample data and 113 Male.

Correlation analysis

Assumptions

Both variables are numeric - since we are analyzing age and the average glucose level, this assumption is met.
Errors are normally distributed - the sample size is large so we assume normality.
Linear relationship between the variables. - will be checked later.

library(car) #Creating a scatterplot to check linear relationship

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

scatterplot(mydata$age, mydata$avg_glucose_level, 
     xlab = "Age",
     ylab = "Average glucose level",
     boxplots = FALSE,
     smooth = FALSE)

For educational purposes, we will assume linear relationship between the variables and continue with the Pearson correlation coefficient.

library(Hmisc) #Correlation - Pearson

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[, c(3, 7)]), 
      type = "pearson")

##                    age avg_glucose_level
## age               1.00              0.29
## avg_glucose_level 0.29              1.00
## 
## n= 300 
## 
## 
## P
##                   age avg_glucose_level
## age                    0               
## avg_glucose_level  0

H0: rho(age, average glucose level) = 0

H1: rho(age, average glucose level) != 0

We reject H0 at p < 0.001. There is correlation between Age and Average Glucose Level based on the dataset.

The linear relationship between Age and Average Glucose Level is positive and weak.

Assocation between two categorical variables

mydata2 <- read.csv("./students_adaptability_level_online_education.csv")

head(mydata2)

##   Gender Education.Level Institution.Type FiNAncial.Condition Internet.Type Network.Type Device Adaptivity.Level
## 1    Boy      University   Non Government                 Mid          Wifi           4G    Tab             High
## 2   Girl      University   Non Government                 Mid   Mobile Data           4G Mobile             High
## 3   Girl         College       Government                 Mid          Wifi           4G Mobile             High
## 4   Girl          School   Non Government                 Mid   Mobile Data           4G Mobile             High
## 5   Girl          School   Non Government                Poor   Mobile Data           3G Mobile              Low
## 6    Boy          School   Non Government                Poor   Mobile Data           3G Mobile              Low

Explanation of data set:

Sample size: 1205
Unit of observation: a student
Number of variables: 8
Source: Suzan, Md. M. H. (2022, April 16). Students adaptability level in online education. Kaggle. https://www.kaggle.com/datasets/mdmahmudulhasansuzan/students-adaptability-level-in-online-education

Explanation of the variables in the data set:

Gender: gender of the individual, categories: (“Boy” or “Girl”).
Education.Level: the level of education of the individual, categories: (“University”, “College”, “School”).
Institution.Type: whether the institution is public or private, categories: (“Non Government”, “Government”)
Financial.Conditions: the financial conditions of the individual, categories: (“Mid”, “Poor”, “Rich”).
Internet.Type: the type of internet the individual is using, categories: (“Wifi”, “Mobile Data”).
Network.Type: the type of network the individual is using, categories: (“4G”, “3G”).
Device: the type of device the individual is using, categories: (“Tab”, “Mobile”, “Computer”).
Adaptability.Level: level of adaptability to online education by the individual. categories: (“Low”, “High”)

library(tidyr)
mydata2 <- drop_na(mydata2) #Dropping units with NA
head(mydata2)

##   Gender Education.Level Institution.Type FiNAncial.Condition Internet.Type Network.Type Device Adaptivity.Level
## 1    Boy      University   Non Government                 Mid          Wifi           4G    Tab             High
## 2   Girl      University   Non Government                 Mid   Mobile Data           4G Mobile             High
## 3   Girl         College       Government                 Mid          Wifi           4G Mobile             High
## 4   Girl          School   Non Government                 Mid   Mobile Data           4G Mobile             High
## 5   Girl          School   Non Government                Poor   Mobile Data           3G Mobile              Low
## 6    Boy          School   Non Government                Poor   Mobile Data           3G Mobile              Low

set.seed(1) 
mydata2 <- mydata2[sample(nrow(mydata2), 350), ] #Choosing random sample of 350 units
head(mydata2)

##      Gender Education.Level Institution.Type FiNAncial.Condition Internet.Type Network.Type   Device Adaptivity.Level
## 1017    Boy          School   Non Government                 Mid   Mobile Data           3G   Mobile              Low
## 679     Boy      University       Government                Poor   Mobile Data           4G   Mobile              Low
## 129     Boy      University       Government                 Mid          Wifi           4G Computer              Low
## 930    Girl          School   Non Government                 Mid   Mobile Data           3G   Mobile             High
## 471    Girl          School   Non Government                 Mid          Wifi           4G   Mobile              Low
## 299    Girl         College   Non Government                 Mid          Wifi           4G   Mobile              Low

mydata2$Gender <- as.factor(mydata2$Gender)
mydata2$Education.Level <- as.factor(mydata2$Education.Level)
mydata2$Institution.Type <- as.factor(mydata2$Institution.Type)
mydata2$FiNAncial.Condition <- as.factor(mydata2$FiNAncial.Condition)
mydata2$Internet.Type <- as.factor(mydata2$Internet.Type)
mydata2$Network.Type <- as.factor(mydata2$Network.Type)
mydata2$Device <- as.factor(mydata2$Device)
mydata2$Adaptivity.Level <- as.factor(mydata2$Adaptivity.Level)
summary(mydata2) #Descriptive statistics

##   Gender      Education.Level       Institution.Type FiNAncial.Condition     Internet.Type Network.Type      Device   
##  Boy :188   College   : 73    Government    :121     Mid :265            Mobile Data:211   2G:  6       Computer: 36  
##  Girl:162   School    :151    Non Government:229     Poor: 66            Wifi       :139   3G:127       Mobile  :304  
##             University:126                           Rich: 19                              4G:217       Tab     : 10  
##  Adaptivity.Level
##  High:195        
##  Low :155        
##

Descriptive statistics of the dataset:

We have 162 students that are Girls in the sample data and 188 that are Boys.
We have 121 students that go to government owned institutions in the sample data and 229 that go to non government owned institutions.
We have 195 students with high adaptability level to online education and 155 students with low adaptability level to online education in the sample data.

Pearson Chi2 test

Assumptions:

Observations must be independent. - this assumption is met
Check that all expected frequencies are greater than 5. - will be checked later
In larger contingency tables (at least one categorical variable has more than two categories), up to 20% of the expected frequencies can be between 1 and 5, but this will reduce the power of the test.

If conditions 2 and 3 are not met or if any of the expected frequencies is less than 1, only Fisher‘s Exact Probability Test of Independence should be used – nonparametric test.

results <- chisq.test(mydata2$Institution.Type, mydata2$Adaptivity.Level,
                      correct = TRUE) #Pearson Chisquare test

results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata2$Institution.Type and mydata2$Adaptivity.Level
## X-squared = 24.585, df = 1, p-value = 7.11e-07

H0: There is no association between the two categorical variables.

H1: There is association between the two categorical variables.

We reject the null hypothesis at p < 0.001 and assume that there is association between the two categorical variables.

addmargins(results$observed) #Observed frequencies of data

##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type High Low Sum
##           Government       45  76 121
##           Non Government  150  79 229
##           Sum             195 155 350

round(results$expected, 2) #Expected frequencies of data

##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type   High    Low
##           Government      67.41  53.59
##           Non Government 127.59 101.41

The second assumption is met since all expected frequencies are above 5.

round(results$res, 2) #Standardized residuals

##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type  High   Low
##           Government     -2.73  3.06
##           Non Government  1.98 -2.23

There is less than expected number of students in category Government and High (alpha = 0.01).

There is more than expected number of students in category Government and Low (alpha = 0.01).

There is more than expected number of students in category Non Government and High (alpha = 0.05).

There is less than expected number of students in category Non Government and Low (alpha = 0.05).

addmargins(round(prop.table(results$observed), 3)) #Proportion table 1

##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type  High   Low   Sum
##           Government     0.129 0.217 0.346
##           Non Government 0.429 0.226 0.655
##           Sum            0.558 0.443 1.001

Explanation of the number 0.217 (Government, Low) - Out of 350 students, there are 21.7% of students that are attending institution (school/college/university) owned by the government and have low adaptability level to online education.

Explanation of the number 0.429 (Non Government, High) - Out of 350 students, there are 42.9% of students that are attending institution (school/college/university) not owned by the government and have high adaptability level to online education.

addmargins(round(prop.table(results$observed, 1), 3), 2) #Proportion table 2

##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type  High   Low   Sum
##           Government     0.372 0.628 1.000
##           Non Government 0.655 0.345 1.000

Explanation of the number 0.628 (Government, Low) - Out of all students that are attending institution (school/college/university) owned by the government, 62.8% of them have low adaptability level to online education

Explanation of the number 0.655 (Non Government, High) - Out of all students that are attending institution (school/college/university) not owned by the government, 65.5% of them have high adaptability level to online education

addmargins(round(prop.table(results$observed, 2), 3), 1) #Proportion table 3

##                         mydata2$Adaptivity.Level
## mydata2$Institution.Type  High   Low
##           Government     0.231 0.490
##           Non Government 0.769 0.510
##           Sum            1.000 1.000

Explanation of the number 0.490 (Government, Low) - Out of all students that have low adaptability level to online education, 49% are students studying at institution (school/college/university) owned by the government.

Explanation of the number 0.769 (Non Government, High) - Out of all students that have high adaptability level to online education, 76.9% are students studying at institution (school/college/university) not owned by the government.

library(effectsize)
effectsize::cramers_v(mydata2$Institution.Type, mydata2$Adaptivity.Level)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.27              | [0.18, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.27)

## [1] "medium"
## (Rules: funder2019)

BDS we found that there is association between the type of institution (Government owned or Non Government owned) students attend and the level of their adaptability to online education (Low or High) at p value < 0.001.

The effect size is medium indicating that we found medium effect on the adaptability level to online education based on whether the student studies at institution owned by the government or not owned by the government.

Following, I will do the Fisher’s exact probability test - nonparametric test, even though the assumptions were met in order to show the steps

Fisher’s exact probability test

fisher.test(mydata2$Institution.Type, mydata2$Adaptivity.Level)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata2$Institution.Type and mydata2$Adaptivity.Level
## p-value = 5.308e-07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1918893 0.5056299
## sample estimates:
## odds ratio 
##   0.312948

interpret_oddsratio(0.31)

## [1] "small"
## (Rules: chen2010)

HO: Odds ratio is equal to 1.

H1: Odds ratio in not equal to 1.

BDS we reject the null hypothesis at p-value < 0.001 and assume that the odds ratio is not equal to one meaning that government owned institutions and non government owned institutions have different success rates in the adaptability level of students to online education.

The effect size is small indicating that we found small effect on the adaptability level to online education based on whether the student attends institution owned by the government or not owned by the government.

Assignment 2

2024-01-16

Ilina Maksimovska

RQ1: Is there a correlation between patients’ age and their average glucose level based on the dataset?

RQ2: Is there association between the type of institution attended by students (Government-owned or Non-Government-owned) and their level of adaptability to online education (Low or High) based on the dataset?

Explanation of a dataset:

Explanation of the variables in the data set:

Descriptive statistics of the dataset

Correlation analysis

Assocation between two categorical variables

Explanation of data set:

Explanation of the variables in the data set:

Descriptive statistics of the dataset:

Pearson Chi2 test

Fisher’s exact probability test