mydata <- read.table("./student-mat.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## guardian traveltime studytime failures schoolsup famsup paid activities
## 1 mother 2 2 0 yes no no no
## 2 father 1 2 0 no yes no no
## 3 mother 1 2 3 yes no yes no
## 4 mother 1 3 0 no yes yes yes
## 5 father 1 2 0 no yes yes no
## 6 mother 1 2 0 no yes yes yes
## nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1 yes yes no no 4 3 4 1 1 3
## 2 no yes yes no 5 3 3 1 1 3
## 3 yes yes yes no 4 3 2 2 3 3
## 4 yes yes yes yes 3 2 2 1 1 5
## 5 yes yes no no 4 3 2 1 2 5
## 6 yes yes yes no 5 4 2 1 2 5
## absences G1 G2 G3
## 1 6 5 6 6
## 2 4 5 5 6
## 3 10 7 8 10
## 4 2 15 14 15
## 5 4 6 10 10
## 6 10 15 15 15
GP (Gabriel Pereira), MS
(Mousinho da Silveira).F (Female), M (Male).U (Urban), R (Rural).LE3 (≤3 members), GT3 (>3
members).T (Living together), A
(Apart).0 (none) to 4 (higher
education).0 (none) to 4 (higher
education).teacher, health,
services, at_home, other.teacher, health,
services, at_home, other.home, reputation,
course, other.mother, father,
other.1 (<15 min), 2 (15–30 min),
3 (30 min–1 hour), 4 (>1 hour).1 (<2 hours), 2 (2–5 hours),
3 (5–10 hours), 4 (>10 hours).yes, no.yes, no.yes, no.yes, no.yes, no.yes, no.yes, no.yes, no.1 (very bad) to 5
(excellent).1 (very low) to 5 (very
high).1 (very low) to 5 (very
high).1 (very low) to 5 (very
high).1 (very low) to 5 (very
high).1 (very bad) to 5 (very
good).Source of the data is Kaggle.com from contributor Vikram Amin
RESEARCH QUESTION 1: Is there a difference between grade for the first school period (G1) and final grade (G3) for students.
library(psych)
describeBy(mydata[ ,c(31,33)])
## Warning in describeBy(mydata[, c(31, 33)]): no grouping variable requested
## vars n mean sd median trimmed mad min max range skew kurtosis se
## G1 1 395 10.91 3.32 11 10.80 4.45 3 19 16 0.24 -0.71 0.17
## G3 2 395 10.42 4.58 11 10.84 4.45 0 20 20 -0.73 0.37 0.23
Description of statistical data: G1: Grade for the first school period. G3: Final grade
Mean: The mean for G1 (10.91) is slightly higher than G3 (10.42), showing a small decline in average performance from the first school period to the final grade.
Variability: G3 has a larger standard deviation (4.58 vs. 3.32) and range (20 vs. 16) compared to G1, indicating greater variability in final grades. This suggests that final grades are influenced by additional factors leading to more extreme outcomes.
Skewness and distribution:G1 is slightly positively skewed (0.24), while G3 is moderately negatively skewed (-0.73). This means that G3 has a higher concentration of students scoring well, though with more extreme low scores as well.
Kurtosis: G1 shows a flatter distribution (-0.71), while G3 has a distribution closer to normal (0.37). This indicates more clustering around the mean for final grades compared to the first-period grades.
mydata$Difference <- mydata$G1 - mydata$G3
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = Difference)) +
geom_histogram(position = "identity", binwidth = 3, colour = "black") +
ylab("Frequency") +
xlab("Difference")
The distribution appears skewed to the right, suggesting that grade improvements (positive differences) are more frequent than declines (negative differences). Most differences are concentrated around 0, meaning that for the majority of students, the grades for the first school period (G1) and the final grades (G3) were very similar. The highest bar is centered at 0, with a frequency of almost 250 students, indicating that little to no difference occurred for a significant number of students. While most students’ grades remained unchanged (centered around 0), a larger number of students experienced declines (negative differences) in their final grades (G3 < G1) compared to those who improved. The positive differences, although present, are fewer and tend to be smaller in magnitude.
#
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
ggqqplot(mydata$Difference)
The data for Difference shows approximate normality in the middle range but exhibits heavier tails, indicating potential outliers or a non-normal distribution in extreme values
shapiro.test(mydata$Difference)
##
## Shapiro-Wilk normality test
##
## data: mydata$Difference
## W = 0.83277, p-value < 2.2e-16
H0: Differences are normally distributed.
H1: Differences are not normally distributed.
We reject H0 at p<0.001
t.test(mydata$G1, mydata$G3,
paired = TRUE,
alternative = "two.sided")
##
## Paired t-test
##
## data: mydata$G1 and mydata$G3
## t = 3.5517, df = 394, p-value = 0.0004291
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.2204052 0.7669366
## sample estimates:
## mean difference
## 0.4936709
The results indicate a statistically significant decline in grades from the first school period to the final grades. The confidence interval suggests that this decline is consistent across the sample, and the average drop is small but measurable. Although the difference is statistically significant, the effect size (mean difference of 0.49) is relatively small.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
cohens_d(mydata$Difference)
## Cohen's d | 95% CI
## ------------------------
## 0.18 | [0.08, 0.28]
interpret_cohens_d(0.18, rules = "sawilowsky2009")
## [1] "very small"
## (Rules: sawilowsky2009)
The decline in grades from G1 to G3, though consistent and statistically significant, has a minimal (very small) real-world impact on the overall distribution of grades.
wilcox.test(mydata$G1, mydata$G3,
paired = TRUE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon signed rank test
##
## data: mydata$G1 and mydata$G3
## V = 24375, p-value = 0.3151
## alternative hypothesis: true location shift is not equal to 0
We fail to reject the null hypothesis (H0), meaning there is no significant difference between the first school period grades (G1) and the final grades (G3) based on the Wilcoxon signed rank test.
library(effectsize)
effectsize(wilcox.test(mydata$G1, mydata$G3,
paired = TRUE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.07 | [-0.05, 0.18]
The rank biserial correlation confirms that the effect (difference between G1 and G3) is practically negligible. Even though the paired t-test suggested a statistically significant difference, the effect size from this non-parametric test shows that the difference has very little practical importance.
RESEARCH QUESTION 2: Is there a correlation between Grade for the first school period (G1) and Grade for the second school period (G2)?
H0: There is correlation between Grade for the first school period (G1) and Grade for the second school period (G2).
H1: There is no correlation between Grade for the first school period (G1) and Grade for the second school period (G2).
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(mydata[ ,c(31,32)])
The Pearson coefficient as seen from the matrix is 0.852 (p-value<0.001).
#install.packages("Hmisc")
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.4.2
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ ,c(31,32)]),
type = "pearson")
## G1 G2
## G1 1.00 0.85
## G2 0.85 1.00
##
## n= 395
##
##
## P
## G1 G2
## G1 0
## G2 0
P is lower than 0.001; we can reject H0. We found strong and positive correlation.
RESEARCH QUESTION 3: Does the Attendance of nursery school (“nursery”) vary depending on on the Aspiration to pursue higher education (“higher”)?
H0: There is no association between nursery and higher.
H1: There is association between nursery and higher.
results <- chisq.test(mydata$higher, mydata$nursery,
correct = TRUE)
## Warning in chisq.test(mydata$higher, mydata$nursery, correct = TRUE):
## Chi-squared approximation may be incorrect
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$higher and mydata$nursery
## X-squared = 0.6321, df = 1, p-value = 0.4266
Since P-value>0.05, we do not reject the null hypothesis (H0). There is no significant association between nursery school attendance and aspiration to pursue higher education in this dataset.
addmargins(results$observed)
## mydata$nursery
## mydata$higher no yes Sum
## no 6 14 20
## yes 75 300 375
## Sum 81 314 395
A higher proportion of students who attended nursery school aspire to pursue higher education (300/314 = 95.5%) compared to those who did not attend nursery (75/81 = 92.6%). However, this difference is very small. Based on these observed frequencies and the results of the chi-squared test (p=0.4266), we can conclude that there is no statistically significant association between nursery attendance and aspiration to pursue higher education.
round(results$expected, 2)
## mydata$nursery
## mydata$higher no yes
## no 4.1 15.9
## yes 76.9 298.1
For students who did not attend nursery and do not aspire to higher education: Observed: 6; Expected: 4.1 (slightly higher observed frequency than expected).
For students who attended nursery and aspire to higher education: Observed: 300; Expected: 298.1 (very close match).
The differences between observed and expected frequencies are small, indicating little deviation from what would be expected under H0. This aligns with the chi-squared test result showing no significant association.
round(results$res, 2)
## mydata$nursery
## mydata$higher no yes
## no 0.94 -0.48
## yes -0.22 0.11
For the category of students who did not attend nursery and do not aspire to higher education, the residual is 0.94, suggesting a slightly higher observed frequency (6) than the expected frequency (4.1). Conversely, for students who attended nursery and do not aspire to higher education, the residual is -0.48, meaning the observed frequency (14) is slightly lower than the expected frequency (15.9). Among students aspiring to higher education, those who did not attend nursery have a residual of -0.22, showing a slightly lower observed frequency (75) than expected (76.9), while those who attended nursery have a residual of 0.11, indicating a slightly higher observed frequency (300) than expected (298.1).
All residuals fall within the range of -2 to 2, which is generally considered insignificant. This supports the conclusion from the chi-squared test that there is no significant association between nursery attendance and aspiration to pursue higher education. The observed frequencies are very close to the expected frequencies, indicating no meaningful deviation from what would be expected under the null hypothesis of independence.
addmargins(round(prop.table(results$observed), 3))
## mydata$nursery
## mydata$higher no yes Sum
## no 0.015 0.035 0.050
## yes 0.190 0.759 0.949
## Sum 0.205 0.794 0.999
From the table, among students who do not aspire to higher education, 1.5% did not attend nursery, while 3.5% did. Among students who aspire to higher education, 19.0% did not attend nursery, and 75.9% did. In terms of marginal proportions, 20.5% of the total students did not attend nursery, while 79.4% attended. Similarly, 5.0% of the students do not aspire to higher education, while 94.9% do.
These proportions confirm that a large majority of students attended nursery and aspire to higher education. However, the small proportion of students who do not aspire to higher education, combined with the chi-squared test results and residuals analysis, indicates no significant relationship between nursery attendance and aspiration for higher education. The proportions align closely with the expected independence of the variables under the null hypothesis.
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydata$nursery
## mydata$higher no yes Sum
## no 0.3 0.7 1.0
## yes 0.2 0.8 1.0
This table shows the row-wise proportions of nursery attendance (“nursery”) for students grouped by their aspiration to pursue higher education (“higher”). For students who do not aspire to higher education, 30% did not attend nursery while 70% did. Among students who do aspire to higher education, 20% did not attend nursery and 80% attended nursery. These proportions indicate that students who attended nursery are slightly more likely to aspire to higher education compared to those who did not. However, this difference is not statistically significant, as demonstrated by the earlier chi-squared test results. The observed variation in nursery attendance between the two groups does not provide evidence of a meaningful association between nursery attendance and aspiration for higher education.
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydata$nursery
## mydata$higher no yes
## no 0.074 0.045
## yes 0.926 0.955
## Sum 1.000 1.000
This table shows the column-wise proportions of students based on their nursery attendance (“nursery”), divided into those who aspire to pursue higher education (“higher”) and those who do not. For students who did not attend nursery, 7.4% do not aspire to higher education while 92.6% do. For students who attended nursery, 4.5% do not aspire to higher education while 95.5% do. These proportions demonstrate that the vast majority of students, regardless of nursery attendance, aspire to pursue higher education. The differences between the two groups (7.4% vs. 4.5% for “no higher” and 92.6% vs. 95.5% for “yes higher”) are minor and not statistically significant, as previously established by the chi-squared test. These results suggest no meaningful relationship between nursery attendance and aspirations for higher education.
library(effectsize)
effectsize::cramers_v(mydata$higher, mydata$nursery)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.02 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.02)
## [1] "tiny"
## (Rules: funder2019)
The value of Cramer’s V here is 0.02, with a 95% confidence interval of [0.00,1.00]. According to Funder’s 2019 guidelines, this value is classified as a “tiny” effect size.
This means that the association between nursery attendance (“nursery”) and aspiration to pursue higher education (“higher”) is extremely weak and practically negligible. The confidence interval includes 0, further indicating the lack of a meaningful relationship. These results align with the chi-squared test, which found no significant association between the two variables.
Overall, the data shows that any observed differences between the groups are minimal and do not indicate a strong or significant relationship between nursery attendance and aspirations for higher education.