mydata <- read.table("./Student_performance_data _.csv", header = TRUE, sep = ",", dec = ".")
head(mydata)
##   StudentID Age Gender Ethnicity ParentalEducation StudyTimeWeekly Absences
## 1      1001  17      1         0                 2       19.833723        7
## 2      1002  18      0         0                 1       15.408756        0
## 3      1003  15      0         2                 3        4.210570       26
## 4      1004  17      1         0                 3       10.028829       14
## 5      1005  17      1         0                 2        4.672495       17
## 6      1006  18      0         0                 1        8.191219        0
##   Tutoring ParentalSupport Extracurricular Sports Music Volunteering       GPA
## 1        1               2               0      0     1            0 2.9291956
## 2        0               1               0      0     0            0 3.0429148
## 3        0               2               0      0     0            0 0.1126023
## 4        0               3               1      0     0            0 2.0542181
## 5        1               3               0      0     0            0 1.2880612
## 6        0               1               1      0     0            0 3.0841836
##   GradeClass
## 1          2
## 2          1
## 3          4
## 4          3
## 5          4
## 6          1

Data description

The unit of observation is one student, the data has 2392 observations and 11 variables.

  1. StudentID: ID number of the student.
  2. Age: The age of the student in years.
  3. Gender of the student: 0 (Male), 1 (Female).
  4. Ethnicity of the student: 0 (Caucasian), 1 (African American), 2 (Asian), 3 (Other)
  5. Parental Education: 0 (None), 1 (High School), 2 (Some University), 3 (Bachelor’s degree), 4 (Higher)
  6. Study time Weekly: Weekly study time in hours, ranging from 0 to 20.
  7. Absence: Number of absences during the school year, ranging from 0 to 30.
  8. Tutoring: 0 (No), 1 (Yes).
  9. Parental support: 0 (None), 1 (Low), 2 (Moderate), 3 (High), 4 (Very High).
  10. GPA: Grade Point Average on a scale from 2.0 to 4.0, influenced by study habits and parental involvement.
  11. Grade class: Classification of students’ grades based on GPA: 0 (A, GPA >= 3.5), 1 (B, 3.0 <= GPA < 3.5), 2 (C, 2.5 <= GPA < 3.0), 3 (D, 2.0 <= GPA < 2.5), 4 (F, GPA < 2.0).

The dataset was obtained from Kaggle.com on 15.01.2025.

Data manipulation

#mydata <- mydata[c(-10, -11, -12, -13)]
#head(mydata)
mydata$Gender <- factor(mydata$Gender)
mydata$Ethnicity <- factor(mydata$Ethnicity)

In this section I changed the categorical variables of Gender and Ethnicity to factors.

library(psych)
describe(mydata)
##                   vars    n    mean     sd  median trimmed    mad  min     max
## StudentID            1 2392 2196.50 690.66 2196.50 2196.50 886.59 1001 3392.00
## Age                  2 2392   16.47   1.12   16.00   16.46   1.48   15   18.00
## Gender*              3 2392    1.51   0.50    2.00    1.51   0.00    1    2.00
## Ethnicity*           4 2392    1.88   1.03    1.00    1.73   0.00    1    4.00
## ParentalEducation    5 2392    1.75   1.00    2.00    1.75   1.48    0    4.00
## StudyTimeWeekly      6 2392    9.77   5.65    9.71    9.73   6.97    0   19.98
## Absences             7 2392   14.54   8.47   15.00   14.57  10.38    0   29.00
## Tutoring             8 2392    0.30   0.46    0.00    0.25   0.00    0    1.00
## ParentalSupport      9 2392    2.12   1.12    2.00    2.14   1.48    0    4.00
## Extracurricular     10 2392    0.38   0.49    0.00    0.35   0.00    0    1.00
## Sports              11 2392    0.30   0.46    0.00    0.25   0.00    0    1.00
## Music               12 2392    0.20   0.40    0.00    0.12   0.00    0    1.00
## Volunteering        13 2392    0.16   0.36    0.00    0.07   0.00    0    1.00
## GPA                 14 2392    1.91   0.92    1.89    1.90   1.07    0    4.00
## GradeClass          15 2392    2.98   1.23    4.00    3.16   0.00    0    4.00
##                     range  skew kurtosis    se
## StudentID         2391.00  0.00    -1.20 14.12
## Age                  3.00  0.04    -1.37  0.02
## Gender*              1.00 -0.04    -2.00  0.01
## Ethnicity*           3.00  0.76    -0.77  0.02
## ParentalEducation    4.00  0.22    -0.29  0.02
## StudyTimeWeekly     19.98  0.05    -1.14  0.12
## Absences            29.00 -0.03    -1.18  0.17
## Tutoring             1.00  0.86    -1.25  0.01
## ParentalSupport      4.00 -0.17    -0.73  0.02
## Extracurricular      1.00  0.48    -1.77  0.01
## Sports               1.00  0.85    -1.27  0.01
## Music                1.00  1.52     0.32  0.01
## Volunteering         1.00  1.88     1.54  0.01
## GPA                  4.00  0.01    -0.87  0.02
## GradeClass           4.00 -0.90    -0.42  0.03

RQ1: “Is there a significant difference in the mean of the GPA between students who receive tutoring and those who don’t?”

mean_tutoring <- mean(mydata$GPA[mydata$Tutoring == 1], na.rm = TRUE)

mean_no_tutoring <- mean(mydata$GPA[mydata$Tutoring == 0], na.rm = TRUE)

mean_tutoring
## [1] 2.108325
mean_no_tutoring
## [1] 1.818968

Interpretation: From the results we can see that the mean of students that received tutoring is higher then of the students that did not receive tutoring.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata, aes(x = GPA, fill = as.factor(Tutoring))) +
  geom_histogram(position = "identity", alpha = 0.9, binwidth = 0.1, colour = "black") +
  facet_wrap(~ Tutoring, scales = "free") +
  labs(title = "GPA Distribution by Tutoring Group", x = "GPA", y = "Frequency", fill = "Tutoring") +
  theme_minimal()

Interpretation: The red (left) graph is showing the distribution of GPA of students that did not receive tutoring. We can see that the graph is slightly skewed to the right which indicated a slightly lower GPA. The blue (right) graph is showing the distribution of GPA of students that did receive tutoring. We can see that the graph is slightly skewed to the left which indicated a slightly higher GPA.

shapiro.test(mydata$GPA[mydata$Tutoring == 1])
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$GPA[mydata$Tutoring == 1]
## W = 0.98201, p-value = 9.628e-08
shapiro.test(mydata$GPA[mydata$Tutoring == 0])
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$GPA[mydata$Tutoring == 0]
## W = 0.98076, p-value = 3.466e-14

Interpretation: We performed the Shapiro-Wilk normality test to see if the variable on the population is normally distributed both for students that did receive tutoring and those that didn’t. The hypotheses for both are the same. H0: The data variables are normally distributed. H1: The data variables are not normally distributed. From the results of the test, we reject both the null hypothesis for tutoring at p-value < 0.001, and for not tutoring at p-value < 0.001.

t.test(GPA ~ Tutoring, data = mydata, var.equal = FALSE, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  GPA by Tutoring
## t = -7.1725, df = 1366.6, p-value = 1.203e-12
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.3684970 -0.2102165
## sample estimates:
## mean in group 0 mean in group 1 
##        1.818968        2.108325
print(t.test)
## function (x, ...) 
## UseMethod("t.test")
## <bytecode: 0x119270190>
## <environment: namespace:stats>

Interpretation: The independent t-test with Welch correction indicates that the GPA for group that did not receive tutoring is lower than the GPA for the group that did receive tutoring (t = -7.1725). The p value < 0.001 shows us that there is statistically significant difference in the GPA, but because the Shapiro-Wilk normality test showed there is no normality in the distribution of variables, we have to perform the Wilcoxon Rank-Sum Test.

wilcox.test(GPA ~ Tutoring, data = mydata,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  GPA by Tutoring
## W = 499984, p-value = 3.918e-11
## alternative hypothesis: true location shift is not equal to 0

Interpretation: H0: There is no difference in GPA distribution between the two groups. H1: There is difference in GPA between the two groups. We reject the null hypothesis at p < 0.001. This test confirms that there is a statistically siginificant difference in GPA distribution between the two groups of students.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize(wilcox.test(mydata$Tutoring, mydata$GPA,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.88             | [-0.88, -0.87]

The effect size is very large. (Funder & Ozer, 2019).

Answer to the research question: Yes, there is a significant difference in the mean of the GPA between students who receive tutoring and those who don’t.

RQ2: “Is there a significant correlation between the time students spend studying weekly and their number of absences?”

Before using the correlation tests, I will check if the two variables are normally distrubuted with the Shapiro-Wilk test.

shapiro.studytime <- shapiro.test(mydata$StudyTimeWeekly)
print(shapiro.studytime)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StudyTimeWeekly
## W = 0.95999, p-value < 2.2e-16
shapiro.absences <- shapiro.test(mydata$Absences)
print(shapiro.absences)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Absences
## W = 0.95568, p-value < 2.2e-16

Based on the test, we can conclude that both variables do not follow the normal distrubution (p<0.001). This means I will use the Spearman correlation test.

library(ggplot2)
ggplot(mydata, aes(x = StudyTimeWeekly, y = Absences)) +
  geom_point(color = "blue", alpha = 0.6) +
  labs(title = "Scatterplot of Study Time vs. Absences",
       x = "Study Time Weekly (hours)",
       y = "Number of Absences") +
  theme_minimal()

Based on the scatterplot there doesn’t seem to be a significant correlation between the number of absences and study time weekly. Now I will test this with the Spearman correlation test.

cor(mydata$StudyTimeWeekly, mydata$Absences, 
         method = "spearman",
         use = "complete.obs")
## [1] 0.009183532
cor.test(mydata$StudyTimeWeekly, mydata$Absences,
         method = "spearman",
         exact = FALSE,
         USE = "complete.obs")
## 
##  Spearman's rank correlation rho
## 
## data:  mydata$StudyTimeWeekly and mydata$Absences
## S = 2260088346, p-value = 0.6535
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## 0.009183532

Interpretation: H0: There is no correlation between study time and absence. H1: There is correlation between study time and absence. From the result we got with the Spearman’s rank correlation test, we conclude that we cannot reject null hypotheis (p-value = 0.654). The correlation coefficient (rho) indicates a very weak positive correlation (0.009). Answer to the research question: There is no statistically significant correlation between Study time weekly and the absences of students.

RQ3: “Is there an association between parental support and tutoring?”

table(mydata$ParentalSupport, mydata$Tutoring)
##    
##       0   1
##   0 151  61
##   1 338 151
##   2 514 226
##   3 491 206
##   4 177  77
chi_square <- chisq.test(mydata$ParentalSupport, mydata$Tutoring,
                         correct = TRUE)
chi_square
## 
##  Pearson's Chi-squared test
## 
## data:  mydata$ParentalSupport and mydata$Tutoring
## X-squared = 0.48818, df = 4, p-value = 0.9746

Interpretation: I tested if there is association with the Chi-Square test. H0: There is no association between parental support and tutoring. H1: There is association between parental support and tutoring. Based on the p value (0.975) we cannot reject null hypothesis.

addmargins(chi_square$observed)
##                       mydata$Tutoring
## mydata$ParentalSupport    0    1  Sum
##                    0    151   61  212
##                    1    338  151  489
##                    2    514  226  740
##                    3    491  206  697
##                    4    177   77  254
##                    Sum 1671  721 2392
addmargins(round(chi_square$expected, 2))
##                       mydata$Tutoring
## mydata$ParentalSupport       0      1  Sum
##                    0    148.10  63.90  212
##                    1    341.60 147.40  489
##                    2    516.95 223.05  740
##                    3    486.91 210.09  697
##                    4    177.44  76.56  254
##                    Sum 1671.00 721.00 2392

If there were association between the two variables we are testing these, the last two tests, would show us the combination that most contribues to the association. In this example, the tests showed again that there is no assocation.

residuals <- chi_square$stdres
print(residuals)
##                       mydata$Tutoring
## mydata$ParentalSupport           0           1
##                      0  0.45487076 -0.45487076
##                      1 -0.39829907  0.39829907
##                      2 -0.28419482  0.28419482
##                      3  0.40112924 -0.40112924
##                      4 -0.06348824  0.06348824

Standardizes residuals measure how much the the observed frequencies deviate from expected. In out case all show there is no strong deviation (none are greater than 2).

effectsize::cramers_v(mydata$ParentalSupport, mydata$Tutoring)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

The effect size is tiny (Funder&Ozer,2019). It indicates no meaninful association. Answer to research question: Based on the Chi Square test, there is no statistically significant association between parental support and tutoring.