StudentsPerformance <- read.table("./StudentsPerformance.csv", header=TRUE, sep=",", dec=".")
head(StudentsPerformance)
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
The unit of observation: an individual student.
The sample size: 1000
Gender: Indicates whether the student is male or female.
Race/Ethnicity: Identifies the student’s racial or ethnic background.
Parental Level of Education: Specifies the highest education level attained by the student’s parents.
Lunch: Describes the type of lunch program the student is enrolled in, reflecting socio-economic status.
Test Preparation Course: Indicates whether the student completed a course preparing for standardized tests.
Math Score: Represents the student’s achievement in the math portion of the standardized test.
Reading Score: Reflects the student’s performance in the reading section of the standardized test.
Writing Score: Represents the student’s achievement in the writing portion of the standardized test.
Dataset for this homework was found on Kaggle. Here’s the link https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?resource=download
#Forming factors out of the categorical variables that I will use
StudentsPerformance$genderF <- factor(StudentsPerformance$gender,
levels = c("female", "male"),
labels = c("Female", "Male"))
StudentsPerformance$test.preparation.courseF <- factor(StudentsPerformance$test.preparation.course,
levels = c("none", "completed"),
labels = c("None", "Completed"))
head(StudentsPerformance, 6)
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score genderF
## 1 none 72 72 74 Female
## 2 completed 69 90 88 Female
## 3 none 90 95 93 Female
## 4 none 47 57 44 Male
## 5 none 76 78 75 Male
## 6 none 71 83 78 Female
## test.preparation.courseF
## 1 None
## 2 Completed
## 3 None
## 4 None
## 5 None
## 6 None
#Removing all unnecessary variables
mydata1 <- StudentsPerformance[, !(names(StudentsPerformance) %in% c ("race.ethnicity", "parental.level.of.education", "lunch", "writing.score") ) ]
head(mydata1)
## gender test.preparation.course math.score reading.score genderF
## 1 female none 72 72 Female
## 2 female completed 69 90 Female
## 3 female none 90 95 Female
## 4 male none 47 57 Male
## 5 male none 76 78 Male
## 6 female none 71 83 Female
## test.preparation.courseF
## 1 None
## 2 Completed
## 3 None
## 4 None
## 5 None
## 6 None
#Showing some descriptive statistics
summary(mydata1)
## gender test.preparation.course math.score reading.score
## Length:1000 Length:1000 Min. : 0.00 Min. : 17.00
## Class :character Class :character 1st Qu.: 57.00 1st Qu.: 59.00
## Mode :character Mode :character Median : 66.00 Median : 70.00
## Mean : 66.09 Mean : 69.17
## 3rd Qu.: 77.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00
## genderF test.preparation.courseF
## Female:518 None :642
## Male :482 Completed:358
##
##
##
##
math.score: Minimum score is 0, the 1st quartile is 57 (meaning 25% of scores are below 57), the median (middle value) is 66, the mean (average) is 66.09, the 3rd quartile is 77 (meaning 75% of scores are below 77), and the maximum value is 100.
reading.score: The minimum score is 17, the 1st quartile is 59, the median is 70, the mean (average) is 69.17, the 3rd quartile is 79, and the maximum value is 100.
Out of 1000 students, there are 518 females and 482 males. 358 students have completed test preparation course, while 642 haven’t completed it.
Given my research question (RQ1), I am clearly interested in examining the association between two categorical variables: gender and test.preparation.course. Since that is the case, the appropriate statistical test for this would be the Chi-Square Test of Independence.
Assumptions:
The observations must be independent of each other. This means that there is no relationship between the observations in each category or group. MET
All expected frequencies are greater than 1. MET
Maximum 20% of the frequencies can be between 1 and 5. MET
results <- chisq.test(mydata1$genderF, mydata1$test.preparation.courseF,
correct = TRUE) #Correction because of 2x2 table
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata1$genderF and mydata1$test.preparation.courseF
## X-squared = 0.015529, df = 1, p-value = 0.9008
Hypothesis for Pearson’s Chi-squared:
H0: There is no association between the gender and the test preparation course. H1: There is an association between the gender and the test preparation course.
Based on the sample data, we can not reject the null hypothesis, since the p-value is 0.901. The test did not provide sufficient evidence to conclude that there is an association between the gender of the student and their participation in a test preparation course. It suggests that the gender and the test preparation course are independent of each other in this dataset.
#Checking empirical frequencies
addmargins(results$observed)
##
## mydata1$genderF None Completed Sum
## Female 334 184 518
## Male 308 174 482
## Sum 642 358 1000
#Checking expected frequencies
addmargins(round(results$expected, 2))
##
## mydata1$genderF None Completed Sum
## Female 332.56 185.44 518
## Male 309.44 172.56 482
## Sum 642.00 358.00 1000
#Checking standardized residuals
round(results$res, 2)
##
## mydata1$genderF None Completed
## Female 0.08 -0.11
## Male -0.08 0.11
There is no significant discrepancies, because all residuals are less than 1.96.
For the purpose of practice, I will explain two of them.
Female, None: The observed frequency of females who did not complete the test preparation course is 0.08 higher than expected.
Male, Completed: The observed frequency of males who completed the test preparation course is 0.11 higher than expected.
addmargins(round(prop.table(results$observed), 3))
##
## mydata1$genderF None Completed Sum
## Female 0.334 0.184 0.518
## Male 0.308 0.174 0.482
## Sum 0.642 0.358 1.000
addmargins(round(prop.table(results$observed, 1), 3), 2)
##
## mydata1$genderF None Completed Sum
## Female 0.645 0.355 1.000
## Male 0.639 0.361 1.000
addmargins(round(prop.table(results$observed, 2), 3), 1)
##
## mydata1$genderF None Completed
## Female 0.520 0.514
## Male 0.480 0.486
## Sum 1.000 1.000
library(effectsize)
effectsize::cramers_v(mydata1$genderF, mydata1$test.preparation.courseF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.00 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.00)
## [1] "tiny"
## (Rules: funder2019)
There is a tiny effect.
Based on the sample data we cannot reject H0 (at p= 0.901) and we can conclude that there is no association between gender and test preparation course. Additionally, the effect size is tiny (0.00) which supports the conclusion that there is no association between variables.
Even though assumptions for parametric test were met, for educational purposes I will show Fisher’s exact probability test (non-parametric test).
fisher.test(mydata1$genderF, mydata1$test.preparation.courseF)
##
## Fisher's Exact Test for Count Data
##
## data: mydata1$genderF and mydata1$test.preparation.courseF
## p-value = 0.895
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.7849638 1.3394293
## sample estimates:
## odds ratio
## 1.025463
H0: Odds ratio is equal to 1.
H1: Odds ratio is not equal to 1.
We cannot reject H0 at p-value = 0.895 and we can’t conclude that there are differences in gender and a test preparation course.
interpret_oddsratio(1.03)
## [1] "very small"
## (Rules: chen2010)
The odds ratio (OR) of 1.03 implies that any observed difference in the odds of gender with test preparation course may be negligible.
Correlation analysis assumptions:
Variables must be numeric. (this assumptions is met)
Errors are normally distributed. (since we do have big enough sample, we don’t check)
Linear relationship between variables.
mydata2 <- mydata1[sample(nrow(mydata1), 200), ]
head(mydata2)
## gender test.preparation.course math.score reading.score genderF
## 841 female none 39 52 Female
## 885 female none 51 51 Female
## 423 female completed 47 58 Female
## 984 female completed 78 87 Female
## 283 female none 73 79 Female
## 519 female completed 66 78 Female
## test.preparation.courseF
## 841 None
## 885 None
## 423 Completed
## 984 Completed
## 283 None
## 519 Completed
library(car)
## Loading required package: carData
#Checking scatterplot
scatterplotMatrix(mydata2[, c(3, 4)], smooth = FALSE)
From the scatter plot, we can see that there is a linear relationship between math scores and reading scores, so accordingly, we can say that the third assumption is met.
scatterplot(mydata2$math.score ~ mydata2$reading.score,
smooth = FALSE,
boxplots = FALSE,
ylab = "math.score",
xlab = "reading.score")
#Checking normality with Shapiro-Wilk test
shapiro.test(mydata2$math.score)
##
## Shapiro-Wilk normality test
##
## data: mydata2$math.score
## W = 0.98801, p-value = 0.09032
shapiro.test(mydata2$reading.score)
##
## Shapiro-Wilk normality test
##
## data: mydata2$reading.score
## W = 0.98735, p-value = 0.07207
H0: Variable is normally distributed.
H1: Variable is not normally distributed.
Math Score: The p-value is 0.029 (< 0.05). Therefore, we reject the null hypothesis and conclude that the math scores are not normally distributed.
Reading Score: The p-value is 0.234. Therefore, we can not reject the null hypothesis and conclude that the reading scores are normally distributed.
Given the results of Shapiro-Wilk tests, it appears that math.score data is not normally distributed. Therefore, Spearman’s correlation might be appropriate choice for further analyzing the relationship between math.score and reading.score.
#Checking Pearson correlation for educational purposes
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
cor(mydata2$math.score, mydata2$reading.score,
method= "pearson")
## [1] 0.830166
cor.test(mydata2$math.score, mydata2$reading.score,
method = "pearson",
exact = FALSE)
##
## Pearson's product-moment correlation
##
## data: mydata2$math.score and mydata2$reading.score
## t = 20.953, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7814284 0.8688362
## sample estimates:
## cor
## 0.830166
#Checking Spearman correlation - the CORRECT one
cor(mydata2$math.score, mydata2$reading.score,
method= "spearman")
## [1] 0.8120235
cor.test(mydata2$math.score, mydata2$reading.score,
method = "spearman",
exact = FALSE)
##
## Spearman's rank correlation rho
##
## data: mydata2$math.score and mydata2$reading.score
## S = 250629, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8120235
H0: There is no correlation.
H1: There is a correlation.
We reject H0 at p<0.001. We can conclude that there is a correlation between math scores and reading scores.
Linear relationship between sleep duration and stress level is positive and strong (Spearman correlation coefficient is 0.796).