mydata <- read.table("./Student_performance_data _.csv", header = TRUE, sep = ",", dec = ".")
head(mydata)
## StudentID Age Gender Ethnicity ParentalEducation StudyTimeWeekly Absences
## 1 1001 17 1 0 2 19.833723 7
## 2 1002 18 0 0 1 15.408756 0
## 3 1003 15 0 2 3 4.210570 26
## 4 1004 17 1 0 3 10.028829 14
## 5 1005 17 1 0 2 4.672495 17
## 6 1006 18 0 0 1 8.191219 0
## Tutoring ParentalSupport Extracurricular Sports Music Volunteering GPA
## 1 1 2 0 0 1 0 2.9291956
## 2 0 1 0 0 0 0 3.0429148
## 3 0 2 0 0 0 0 0.1126023
## 4 0 3 1 0 0 0 2.0542181
## 5 1 3 0 0 0 0 1.2880612
## 6 0 1 1 0 0 0 3.0841836
## GradeClass
## 1 2
## 2 1
## 3 4
## 4 3
## 5 4
## 6 1
The unit of observation is one student, the data has 2392 observations and 11 variables.
The dataset was obtained from Kaggle.com on 15.01.2025.
#mydata <- mydata[c(-10, -11, -12, -13)]
#head(mydata)
mydata$Gender <- factor(mydata$Gender)
mydata$Ethnicity <- factor(mydata$Ethnicity)
In this section I changed the categorical variables of Gender and Ethnicity to factors.
library(psych)
describe(mydata)
## vars n mean sd median trimmed mad min max
## StudentID 1 2392 2196.50 690.66 2196.50 2196.50 886.59 1001 3392.00
## Age 2 2392 16.47 1.12 16.00 16.46 1.48 15 18.00
## Gender* 3 2392 1.51 0.50 2.00 1.51 0.00 1 2.00
## Ethnicity* 4 2392 1.88 1.03 1.00 1.73 0.00 1 4.00
## ParentalEducation 5 2392 1.75 1.00 2.00 1.75 1.48 0 4.00
## StudyTimeWeekly 6 2392 9.77 5.65 9.71 9.73 6.97 0 19.98
## Absences 7 2392 14.54 8.47 15.00 14.57 10.38 0 29.00
## Tutoring 8 2392 0.30 0.46 0.00 0.25 0.00 0 1.00
## ParentalSupport 9 2392 2.12 1.12 2.00 2.14 1.48 0 4.00
## Extracurricular 10 2392 0.38 0.49 0.00 0.35 0.00 0 1.00
## Sports 11 2392 0.30 0.46 0.00 0.25 0.00 0 1.00
## Music 12 2392 0.20 0.40 0.00 0.12 0.00 0 1.00
## Volunteering 13 2392 0.16 0.36 0.00 0.07 0.00 0 1.00
## GPA 14 2392 1.91 0.92 1.89 1.90 1.07 0 4.00
## GradeClass 15 2392 2.98 1.23 4.00 3.16 0.00 0 4.00
## range skew kurtosis se
## StudentID 2391.00 0.00 -1.20 14.12
## Age 3.00 0.04 -1.37 0.02
## Gender* 1.00 -0.04 -2.00 0.01
## Ethnicity* 3.00 0.76 -0.77 0.02
## ParentalEducation 4.00 0.22 -0.29 0.02
## StudyTimeWeekly 19.98 0.05 -1.14 0.12
## Absences 29.00 -0.03 -1.18 0.17
## Tutoring 1.00 0.86 -1.25 0.01
## ParentalSupport 4.00 -0.17 -0.73 0.02
## Extracurricular 1.00 0.48 -1.77 0.01
## Sports 1.00 0.85 -1.27 0.01
## Music 1.00 1.52 0.32 0.01
## Volunteering 1.00 1.88 1.54 0.01
## GPA 4.00 0.01 -0.87 0.02
## GradeClass 4.00 -0.90 -0.42 0.03
mean_tutoring <- mean(mydata$GPA[mydata$Tutoring == 1], na.rm = TRUE)
mean_no_tutoring <- mean(mydata$GPA[mydata$Tutoring == 0], na.rm = TRUE)
mean_tutoring
## [1] 2.108325
mean_no_tutoring
## [1] 1.818968
Interpretation: From the results we can see that the mean of students that received tutoring is higher then of the students that did not receive tutoring.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = GPA, fill = as.factor(Tutoring))) +
geom_histogram(position = "identity", alpha = 0.9, binwidth = 0.1, colour = "black") +
facet_wrap(~ Tutoring, scales = "free") +
labs(title = "GPA Distribution by Tutoring Group", x = "GPA", y = "Frequency", fill = "Tutoring") +
theme_minimal()
Interpretation: The red (left) graph is showing the distribution of GPA of students that did not receive tutoring. We can see that the graph is slightly skewed to the right which indicated a slightly lower GPA. The blue (right) graph is showing the distribution of GPA of students that did receive tutoring. We can see that the graph is slightly skewed to the left which indicated a slightly higher GPA.
shapiro.test(mydata$GPA[mydata$Tutoring == 1])
##
## Shapiro-Wilk normality test
##
## data: mydata$GPA[mydata$Tutoring == 1]
## W = 0.98201, p-value = 9.628e-08
shapiro.test(mydata$GPA[mydata$Tutoring == 0])
##
## Shapiro-Wilk normality test
##
## data: mydata$GPA[mydata$Tutoring == 0]
## W = 0.98076, p-value = 3.466e-14
Interpretation: We performed the Shapiro-Wilk normality test to see if the variable on the population is normally distributed both for students that did receive tutoring and those that didn’t. The hypotheses for both are the same. H0: The data variables are normally distributed. H1: The data variables are not normally distributed. From the results of the test, we reject both the null hypothesis for tutoring at p-value < 0.001, and for not tutoring at p-value < 0.001.
t.test(GPA ~ Tutoring, data = mydata, var.equal = FALSE, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: GPA by Tutoring
## t = -7.1725, df = 1366.6, p-value = 1.203e-12
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.3684970 -0.2102165
## sample estimates:
## mean in group 0 mean in group 1
## 1.818968 2.108325
print(t.test)
## function (x, ...)
## UseMethod("t.test")
## <bytecode: 0x119270190>
## <environment: namespace:stats>
Interpretation: The independent t-test with Welch correction indicates that the GPA for group that did not receive tutoring is lower than the GPA for the group that did receive tutoring (t = -7.1725). The p value < 0.001 shows us that there is statistically significant difference in the GPA, but because the Shapiro-Wilk normality test showed there is no normality in the distribution of variables, we have to perform the Wilcoxon Rank-Sum Test.
wilcox.test(GPA ~ Tutoring, data = mydata,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: GPA by Tutoring
## W = 499984, p-value = 3.918e-11
## alternative hypothesis: true location shift is not equal to 0
Interpretation: H0: There is no difference in GPA distribution between the two groups. H1: There is difference in GPA between the two groups. We reject the null hypothesis at p < 0.001. This test confirms that there is a statistically siginificant difference in GPA distribution between the two groups of students.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(mydata$Tutoring, mydata$GPA,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.88 | [-0.88, -0.87]
The effect size is very large. (Funder & Ozer, 2019).
Answer to the research question: Yes, there is a significant difference in the mean of the GPA between students who receive tutoring and those who don’t.
Before using the correlation tests, I will check if the two variables are normally distrubuted with the Shapiro-Wilk test.
shapiro.studytime <- shapiro.test(mydata$StudyTimeWeekly)
print(shapiro.studytime)
##
## Shapiro-Wilk normality test
##
## data: mydata$StudyTimeWeekly
## W = 0.95999, p-value < 2.2e-16
shapiro.absences <- shapiro.test(mydata$Absences)
print(shapiro.absences)
##
## Shapiro-Wilk normality test
##
## data: mydata$Absences
## W = 0.95568, p-value < 2.2e-16
Based on the test, we can conclude that both variables do not follow the normal distrubution (p<0.001). This means I will use the Spearman correlation test.
library(ggplot2)
ggplot(mydata, aes(x = StudyTimeWeekly, y = Absences)) +
geom_point(color = "blue", alpha = 0.6) +
labs(title = "Scatterplot of Study Time vs. Absences",
x = "Study Time Weekly (hours)",
y = "Number of Absences") +
theme_minimal()
Based on the scatterplot there doesn’t seem to be a significant correlation between the number of absences and study time weekly. Now I will test this with the Spearman correlation test.
cor(mydata$StudyTimeWeekly, mydata$Absences,
method = "spearman",
use = "complete.obs")
## [1] 0.009183532
cor.test(mydata$StudyTimeWeekly, mydata$Absences,
method = "spearman",
exact = FALSE,
USE = "complete.obs")
##
## Spearman's rank correlation rho
##
## data: mydata$StudyTimeWeekly and mydata$Absences
## S = 2260088346, p-value = 0.6535
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.009183532
Interpretation: H0: There is no correlation between study time and absence. H1: There is correlation between study time and absence. From the result we got with the Spearman’s rank correlation test, we conclude that we cannot reject null hypotheis (p-value = 0.654). The correlation coefficient (rho) indicates a very weak positive correlation (0.009). Answer to the research question: There is no statistically significant correlation between Study time weekly and the absences of students.
table(mydata$ParentalSupport, mydata$Tutoring)
##
## 0 1
## 0 151 61
## 1 338 151
## 2 514 226
## 3 491 206
## 4 177 77
chi_square <- chisq.test(mydata$ParentalSupport, mydata$Tutoring,
correct = TRUE)
chi_square
##
## Pearson's Chi-squared test
##
## data: mydata$ParentalSupport and mydata$Tutoring
## X-squared = 0.48818, df = 4, p-value = 0.9746
Interpretation: I tested if there is association with the Chi-Square test. H0: There is no association between parental support and tutoring. H1: There is association between parental support and tutoring. Based on the p value (0.975) we cannot reject null hypothesis.
addmargins(chi_square$observed)
## mydata$Tutoring
## mydata$ParentalSupport 0 1 Sum
## 0 151 61 212
## 1 338 151 489
## 2 514 226 740
## 3 491 206 697
## 4 177 77 254
## Sum 1671 721 2392
addmargins(round(chi_square$expected, 2))
## mydata$Tutoring
## mydata$ParentalSupport 0 1 Sum
## 0 148.10 63.90 212
## 1 341.60 147.40 489
## 2 516.95 223.05 740
## 3 486.91 210.09 697
## 4 177.44 76.56 254
## Sum 1671.00 721.00 2392
If there were association between the two variables we are testing these, the last two tests, would show us the combination that most contribues to the association. In this example, the tests showed again that there is no assocation.
residuals <- chi_square$stdres
print(residuals)
## mydata$Tutoring
## mydata$ParentalSupport 0 1
## 0 0.45487076 -0.45487076
## 1 -0.39829907 0.39829907
## 2 -0.28419482 0.28419482
## 3 0.40112924 -0.40112924
## 4 -0.06348824 0.06348824
Standardizes residuals measure how much the the observed frequencies deviate from expected. In out case all show there is no strong deviation (none are greater than 2).
effectsize::cramers_v(mydata$ParentalSupport, mydata$Tutoring)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.00 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
The effect size is tiny (Funder&Ozer,2019). It indicates no meaninful association. Answer to research question: Based on the Chi Square test, there is no statistically significant association between parental support and tutoring.