Students performance in exams

RQ1: Is there an association between the gender of the student and their participation in a test preparation course?

RQ2: Is there a correlation between the math score and the reading score of a student?

StudentsPerformance <- read.table("./StudentsPerformance.csv", header=TRUE, sep=",", dec=".")
head(StudentsPerformance)
##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78
Description of data set

The unit of observation: an individual student.

The sample size: 1000

Description of all variables:
  • Gender: Indicates whether the student is male or female.

  • Race/Ethnicity: Identifies the student’s racial or ethnic background.

  • Parental Level of Education: Specifies the highest education level attained by the student’s parents.

  • Lunch: Describes the type of lunch program the student is enrolled in, reflecting socio-economic status.

  • Test Preparation Course: Indicates whether the student completed a course preparing for standardized tests.

  • Math Score: Represents the student’s achievement in the math portion of the standardized test.

  • Reading Score: Reflects the student’s performance in the reading section of the standardized test.

  • Writing Score: Represents the student’s achievement in the writing portion of the standardized test.

Source

Dataset for this homework was found on Kaggle. Here’s the link https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?resource=download

Data processing
#Forming factors out of the categorical variables that I will use 

StudentsPerformance$genderF <- factor(StudentsPerformance$gender, 
                 levels = c("female", "male"),
                 labels = c("Female", "Male"))
StudentsPerformance$test.preparation.courseF <- factor(StudentsPerformance$test.preparation.course, 
                 levels = c("none", "completed"),
                 labels = c("None", "Completed"))

head(StudentsPerformance, 6)
##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score genderF
## 1                    none         72            72            74  Female
## 2               completed         69            90            88  Female
## 3                    none         90            95            93  Female
## 4                    none         47            57            44    Male
## 5                    none         76            78            75    Male
## 6                    none         71            83            78  Female
##   test.preparation.courseF
## 1                     None
## 2                Completed
## 3                     None
## 4                     None
## 5                     None
## 6                     None
#Removing all unnecessary variables
mydata1 <- StudentsPerformance[, !(names(StudentsPerformance) %in% c ("race.ethnicity", "parental.level.of.education", "lunch", "writing.score") ) ] 
head(mydata1)
##   gender test.preparation.course math.score reading.score genderF
## 1 female                    none         72            72  Female
## 2 female               completed         69            90  Female
## 3 female                    none         90            95  Female
## 4   male                    none         47            57    Male
## 5   male                    none         76            78    Male
## 6 female                    none         71            83  Female
##   test.preparation.courseF
## 1                     None
## 2                Completed
## 3                     None
## 4                     None
## 5                     None
## 6                     None
#Showing some descriptive statistics
summary(mydata1)
##     gender          test.preparation.course   math.score     reading.score   
##  Length:1000        Length:1000             Min.   :  0.00   Min.   : 17.00  
##  Class :character   Class :character        1st Qu.: 57.00   1st Qu.: 59.00  
##  Mode  :character   Mode  :character        Median : 66.00   Median : 70.00  
##                                             Mean   : 66.09   Mean   : 69.17  
##                                             3rd Qu.: 77.00   3rd Qu.: 79.00  
##                                             Max.   :100.00   Max.   :100.00  
##    genderF    test.preparation.courseF
##  Female:518   None     :642           
##  Male  :482   Completed:358           
##                                       
##                                       
##                                       
## 
  • math.score: Minimum score is 0, the 1st quartile is 57 (meaning 25% of scores are below 57), the median (middle value) is 66, the mean (average) is 66.09, the 3rd quartile is 77 (meaning 75% of scores are below 77), and the maximum value is 100.

  • reading.score: The minimum score is 17, the 1st quartile is 59, the median is 70, the mean (average) is 69.17, the 3rd quartile is 79, and the maximum value is 100.

Out of 1000 students, there are 518 females and 482 males. 358 students have completed test preparation course, while 642 haven’t completed it.

RQ1: Is there an association between the gender of the student and their participation in a test preparation course?

Given my research question (RQ1), I am clearly interested in examining the association between two categorical variables: gender and test.preparation.course. Since that is the case, the appropriate statistical test for this would be the Chi-Square Test of Independence.

Chi-Square Test

Assumptions:

  1. The observations must be independent of each other. This means that there is no relationship between the observations in each category or group. MET

  2. All expected frequencies are greater than 1. MET

  3. Maximum 20% of the frequencies can be between 1 and 5. MET

results <- chisq.test(mydata1$genderF, mydata1$test.preparation.courseF,
                      correct = TRUE) #Correction because of 2x2 table

results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata1$genderF and mydata1$test.preparation.courseF
## X-squared = 0.015529, df = 1, p-value = 0.9008

Hypothesis for Pearson’s Chi-squared:

H0: There is no association between the gender and the test preparation course. H1: There is an association between the gender and the test preparation course.

Based on the sample data, we can not reject the null hypothesis, since the p-value is 0.901. The test did not provide sufficient evidence to conclude that there is an association between the gender of the student and their participation in a test preparation course. It suggests that the gender and the test preparation course are independent of each other in this dataset.

#Checking empirical frequencies
addmargins(results$observed)
##                
## mydata1$genderF None Completed  Sum
##          Female  334       184  518
##          Male    308       174  482
##          Sum     642       358 1000
#Checking expected frequencies
addmargins(round(results$expected, 2))
##                
## mydata1$genderF   None Completed  Sum
##          Female 332.56    185.44  518
##          Male   309.44    172.56  482
##          Sum    642.00    358.00 1000
#Checking standardized residuals
round(results$res, 2)
##                
## mydata1$genderF  None Completed
##          Female  0.08     -0.11
##          Male   -0.08      0.11

There is no significant discrepancies, because all residuals are less than 1.96.

For the purpose of practice, I will explain two of them.

  • Female, None: The observed frequency of females who did not complete the test preparation course is 0.08 higher than expected.

  • Male, Completed: The observed frequency of males who completed the test preparation course is 0.11 higher than expected.

Proportion tables

addmargins(round(prop.table(results$observed), 3))
##                
## mydata1$genderF  None Completed   Sum
##          Female 0.334     0.184 0.518
##          Male   0.308     0.174 0.482
##          Sum    0.642     0.358 1.000
Explanation of number 0.184 (Female/Completed):
  • Out of 1000 students,there are 18.4% of them who are females and have completed a test preparation course.
addmargins(round(prop.table(results$observed, 1), 3), 2)
##                
## mydata1$genderF  None Completed   Sum
##          Female 0.645     0.355 1.000
##          Male   0.639     0.361 1.000
Explanation of number 0.639 (Male/None):
  • Out of all males, 63.9% of them have not completed a test preparation course.
Explanation of number 0.355 (Female/Completed):
  • Out of all females, 35.5% of them have completed a test preparation course.
addmargins(round(prop.table(results$observed, 2), 3), 1)
##                
## mydata1$genderF  None Completed
##          Female 0.520     0.514
##          Male   0.480     0.486
##          Sum    1.000     1.000
Explanation of number 0.486 (Male/Completed):
  • Out of all students who have taken a test preparation course, 48.6% of them are males.

Effect size

library(effectsize)
effectsize::cramers_v(mydata1$genderF, mydata1$test.preparation.courseF)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.00)
## [1] "tiny"
## (Rules: funder2019)

There is a tiny effect.

Conclusion:

Based on the sample data we cannot reject H0 (at p= 0.901) and we can conclude that there is no association between gender and test preparation course. Additionally, the effect size is tiny (0.00) which supports the conclusion that there is no association between variables.

Odds ratio

Even though assumptions for parametric test were met, for educational purposes I will show Fisher’s exact probability test (non-parametric test).

fisher.test(mydata1$genderF, mydata1$test.preparation.courseF)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata1$genderF and mydata1$test.preparation.courseF
## p-value = 0.895
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7849638 1.3394293
## sample estimates:
## odds ratio 
##   1.025463

Hypothesis

H0: Odds ratio is equal to 1.

H1: Odds ratio is not equal to 1.

We cannot reject H0 at p-value = 0.895 and we can’t conclude that there are differences in gender and a test preparation course.

interpret_oddsratio(1.03)
## [1] "very small"
## (Rules: chen2010)

The odds ratio (OR) of 1.03 implies that any observed difference in the odds of gender with test preparation course may be negligible.

RQ2: Is there a correlation between the math score and the reading score of a student?

Correlation analysis assumptions:

  1. Variables must be numeric. (this assumptions is met)

  2. Errors are normally distributed. (since we do have big enough sample, we don’t check)

  3. Linear relationship between variables.

mydata2 <- mydata1[sample(nrow(mydata1), 200), ]

head(mydata2)
##     gender test.preparation.course math.score reading.score genderF
## 841 female                    none         39            52  Female
## 885 female                    none         51            51  Female
## 423 female               completed         47            58  Female
## 984 female               completed         78            87  Female
## 283 female                    none         73            79  Female
## 519 female               completed         66            78  Female
##     test.preparation.courseF
## 841                     None
## 885                     None
## 423                Completed
## 984                Completed
## 283                     None
## 519                Completed
library(car)
## Loading required package: carData
#Checking scatterplot
scatterplotMatrix(mydata2[, c(3, 4)], smooth = FALSE)

From the scatter plot, we can see that there is a linear relationship between math scores and reading scores, so accordingly, we can say that the third assumption is met.

scatterplot(mydata2$math.score ~ mydata2$reading.score,
            smooth = FALSE,
            boxplots = FALSE,
            ylab = "math.score",
            xlab = "reading.score")

#Checking normality with Shapiro-Wilk test
shapiro.test(mydata2$math.score)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata2$math.score
## W = 0.98801, p-value = 0.09032
shapiro.test(mydata2$reading.score)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata2$reading.score
## W = 0.98735, p-value = 0.07207

Hypothesis for both variables

H0: Variable is normally distributed.

H1: Variable is not normally distributed.

Math Score: The p-value is 0.029 (< 0.05). Therefore, we reject the null hypothesis and conclude that the math scores are not normally distributed.

Reading Score: The p-value is 0.234. Therefore, we can not reject the null hypothesis and conclude that the reading scores are normally distributed.

Given the results of Shapiro-Wilk tests, it appears that math.score data is not normally distributed. Therefore, Spearman’s correlation might be appropriate choice for further analyzing the relationship between math.score and reading.score.

Spearman correlation coefficient

#Checking Pearson correlation for educational purposes
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
cor(mydata2$math.score, mydata2$reading.score, 
      method= "pearson")
## [1] 0.830166
cor.test(mydata2$math.score, mydata2$reading.score,
         method = "pearson",
         exact = FALSE)
## 
##  Pearson's product-moment correlation
## 
## data:  mydata2$math.score and mydata2$reading.score
## t = 20.953, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7814284 0.8688362
## sample estimates:
##      cor 
## 0.830166
#Checking Spearman correlation - the CORRECT one
cor(mydata2$math.score, mydata2$reading.score, 
      method= "spearman")
## [1] 0.8120235
cor.test(mydata2$math.score, mydata2$reading.score,
         method = "spearman",
         exact = FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  mydata2$math.score and mydata2$reading.score
## S = 250629, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8120235

Hypothesis

H0: There is no correlation.

H1: There is a correlation.

We reject H0 at p<0.001. We can conclude that there is a correlation between math scores and reading scores.

Linear relationship between sleep duration and stress level is positive and strong (Spearman correlation coefficient is 0.796).