Homework 1

Importing data

mydata <- read.table("./student-mat.csv", header=TRUE, sep=",", dec=".")
head(mydata)

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        3       yes     no  yes         no
## 4   mother          1         3        0        no    yes  yes        yes
## 5   father          1         2        0        no    yes  yes         no
## 6   mother          1         2        0        no    yes  yes        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        6  5  6  6
## 2        4  5  5  6
## 3       10  7  8 10
## 4        2 15 14 15
## 5        4  6 10 10
## 6       10 15 15 15

Explaining data Unit of Observation: an individual student (detailing their personal, academic, and social characteristics). Below is the definition of all 33 variables in the dataset, along with their units of measurement:

school:
- Definition: The school the student attends.
- Categories: GP (Gabriel Pereira), MS (Mousinho da Silveira).
- Unit: Categorical.
sex:
- Definition: Gender of the student.
- Categories: F (Female), M (Male).
- Unit: Categorical.
age:
- Definition: Age of the student.
- Unit: Numeric (years).
address:
- Definition: Type of home address.
- Categories: U (Urban), R (Rural).
- Unit: Categorical.
famsize:
- Definition: Family size.
- Categories: LE3 (≤3 members), GT3 (>3 members).
- Unit: Categorical.
Pstatus:
- Definition: Parent’s cohabitation status.
- Categories: T (Living together), A (Apart).
- Unit: Categorical.
Medu:
- Definition: Mother’s education level.
- Scale: 0 (none) to 4 (higher education).
- Unit: Ordinal.
Fedu:
- Definition: Father’s education level.
- Scale: 0 (none) to 4 (higher education).
- Unit: Ordinal.
Mjob:
- Definition: Mother’s job type.
- Categories: teacher, health, services, at_home, other.
- Unit: Categorical.
Fjob:
- Definition: Father’s job type.
- Categories: teacher, health, services, at_home, other.
- Unit: Categorical.
reason:
- Definition: Reason for choosing the school.
- Categories: home, reputation, course, other.
- Unit: Categorical.
guardian:
- Definition: Legal guardian of the student.
- Categories: mother, father, other.
- Unit: Categorical.
traveltime:
- Definition: Home-to-school travel time.
- Scale: 1 (<15 min), 2 (15–30 min), 3 (30 min–1 hour), 4 (>1 hour).
- Unit: Ordinal.
studytime:
- Definition: Weekly study time.
- Scale: 1 (<2 hours), 2 (2–5 hours), 3 (5–10 hours), 4 (>10 hours).
- Unit: Ordinal.
failures:
- Definition: Number of past class failures.
- Unit: Numeric (count).
schoolsup:
- Definition: Extra educational support from school.
- Categories: yes, no.
- Unit: Categorical.
famsup:
- Definition: Family educational support.
- Categories: yes, no.
- Unit: Categorical.
paid:
- Definition: Extra paid classes in the subject.
- Categories: yes, no.
- Unit: Categorical.
activities:
- Definition: Participation in extracurricular activities.
- Categories: yes, no.
- Unit: Categorical.
nursery:
- Definition: Attended nursery school.
- Categories: yes, no.
- Unit: Categorical.
higher:
- Definition: Aspiration to pursue higher education.
- Categories: yes, no.
- Unit: Categorical.
internet:
- Definition: Internet access at home.
- Categories: yes, no.
- Unit: Categorical.
romantic:
- Definition: In a romantic relationship.
- Categories: yes, no.
- Unit: Categorical.
famrel:
- Definition: Quality of family relationships.
- Scale: 1 (very bad) to 5 (excellent).
- Unit: Ordinal.
freetime:
- Definition: Amount of free time after school.
- Scale: 1 (very low) to 5 (very high).
- Unit: Ordinal.
goout:
- Definition: Frequency of going out with friends.
- Scale: 1 (very low) to 5 (very high).
- Unit: Ordinal.
Dalc:
- Definition: Workday alcohol consumption.
- Scale: 1 (very low) to 5 (very high).
- Unit: Ordinal.
Walc:
- Definition: Weekend alcohol consumption.
- Scale: 1 (very low) to 5 (very high).
- Unit: Ordinal.
health:
- Definition: Current health status.
- Scale: 1 (very bad) to 5 (very good).
- Unit: Ordinal.
absences:
- Definition: Number of school absences.
- Unit: Numeric (count).
G1:
- Definition: Grade for the first school period.
- Unit: Numeric (score: 0–20).
G2:
- Definition: Grade for the second school period.
- Unit: Numeric (score: 0–20).
G3:
- Definition: Final grade.
- Unit: Numeric (score: 0–20).

Source of the data is Kaggle.com from contributor Vikram Amin

RESEARCH QUESTION 1: Is there a difference between grade for the first school period (G1) and final grade (G3) for students.

library(psych)
describeBy(mydata[ ,c(31,33)])

## Warning in describeBy(mydata[, c(31, 33)]): no grouping variable requested

##    vars   n  mean   sd median trimmed  mad min max range  skew kurtosis   se
## G1    1 395 10.91 3.32     11   10.80 4.45   3  19    16  0.24    -0.71 0.17
## G3    2 395 10.42 4.58     11   10.84 4.45   0  20    20 -0.73     0.37 0.23

Description of statistical data: G1: Grade for the first school period. G3: Final grade

Mean: The mean for G1 (10.91) is slightly higher than G3 (10.42), showing a small decline in average performance from the first school period to the final grade.

Variability: G3 has a larger standard deviation (4.58 vs. 3.32) and range (20 vs. 16) compared to G1, indicating greater variability in final grades. This suggests that final grades are influenced by additional factors leading to more extreme outcomes.

Skewness and distribution:G1 is slightly positively skewed (0.24), while G3 is moderately negatively skewed (-0.73). This means that G3 has a higher concentration of students scoring well, though with more extreme low scores as well.

Kurtosis: G1 shows a flatter distribution (-0.71), while G3 has a distribution closer to normal (0.37). This indicates more clustering around the mean for final grades compared to the first-period grades.

mydata$Difference <- mydata$G1 - mydata$G3

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x = Difference)) +
  geom_histogram(position = "identity", binwidth = 3, colour = "black") +
  ylab("Frequency") +
  xlab("Difference")

The distribution appears skewed to the right, suggesting that grade improvements (positive differences) are more frequent than declines (negative differences). Most differences are concentrated around 0, meaning that for the majority of students, the grades for the first school period (G1) and the final grades (G3) were very similar. The highest bar is centered at 0, with a frequency of almost 250 students, indicating that little to no difference occurred for a significant number of students. While most students’ grades remained unchanged (centered around 0), a larger number of students experienced declines (negative differences) in their final grades (G3 < G1) compared to those who improved. The positive differences, although present, are fewer and tend to be smaller in magnitude.

#
library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.4.2

ggqqplot(mydata$Difference)

The data for Difference shows approximate normality in the middle range but exhibits heavier tails, indicating potential outliers or a non-normal distribution in extreme values

shapiro.test(mydata$Difference)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Difference
## W = 0.83277, p-value < 2.2e-16

H0: Differences are normally distributed.

H1: Differences are not normally distributed.

We reject H0 at p<0.001

t.test(mydata$G1, mydata$G3,
       paired = TRUE,
       alternative = "two.sided")

## 
##  Paired t-test
## 
## data:  mydata$G1 and mydata$G3
## t = 3.5517, df = 394, p-value = 0.0004291
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.2204052 0.7669366
## sample estimates:
## mean difference 
##       0.4936709

The results indicate a statistically significant decline in grades from the first school period to the final grades. The confidence interval suggests that this decline is consistent across the sample, and the average drop is small but measurable. Although the difference is statistically significant, the effect size (mean difference of 0.49) is relatively small.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

cohens_d(mydata$Difference)

## Cohen's d |       95% CI
## ------------------------
## 0.18      | [0.08, 0.28]

interpret_cohens_d(0.18, rules = "sawilowsky2009")

## [1] "very small"
## (Rules: sawilowsky2009)

The decline in grades from G1 to G3, though consistent and statistically significant, has a minimal (very small) real-world impact on the overall distribution of grades.

wilcox.test(mydata$G1, mydata$G3,
            paired = TRUE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon signed rank test
## 
## data:  mydata$G1 and mydata$G3
## V = 24375, p-value = 0.3151
## alternative hypothesis: true location shift is not equal to 0

We fail to reject the null hypothesis (H0), meaning there is no significant difference between the first school period grades (G1) and the final grades (G3) based on the Wilcoxon signed rank test.

library(effectsize)
effectsize(wilcox.test(mydata$G1, mydata$G3,
                       paired = TRUE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## 0.07              | [-0.05, 0.18]

The rank biserial correlation confirms that the effect (difference between G1 and G3) is practically negligible. Even though the paired t-test suggested a statistically significant difference, the effect size from this non-parametric test shows that the difference has very little practical importance.

RESEARCH QUESTION 2: Is there a correlation between Grade for the first school period (G1) and Grade for the second school period (G2)?

H0: There is correlation between Grade for the first school period (G1) and Grade for the second school period (G2).

H1: There is no correlation between Grade for the first school period (G1) and Grade for the second school period (G2).

library(GGally)

## Warning: package 'GGally' was built under R version 4.4.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(mydata[ ,c(31,32)])

The Pearson coefficient as seen from the matrix is 0.852 (p-value<0.001).

#install.packages("Hmisc")
library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.4.2

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[ ,c(31,32)]), 
      type = "pearson")

##      G1   G2
## G1 1.00 0.85
## G2 0.85 1.00
## 
## n= 395 
## 
## 
## P
##    G1 G2
## G1     0
## G2  0

P is lower than 0.001; we can reject H0. We found strong and positive correlation.

RESEARCH QUESTION 3: Does the Attendance of nursery school (“nursery”) vary depending on on the Aspiration to pursue higher education (“higher”)?

H0: There is no association between nursery and higher.

H1: There is association between nursery and higher.

results <- chisq.test(mydata$higher, mydata$nursery, 
                      correct = TRUE)

## Warning in chisq.test(mydata$higher, mydata$nursery, correct = TRUE):
## Chi-squared approximation may be incorrect

results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$higher and mydata$nursery
## X-squared = 0.6321, df = 1, p-value = 0.4266

Since P-value>0.05, we do not reject the null hypothesis (H0). There is no significant association between nursery school attendance and aspiration to pursue higher education in this dataset.

addmargins(results$observed)

##              mydata$nursery
## mydata$higher  no yes Sum
##           no    6  14  20
##           yes  75 300 375
##           Sum  81 314 395

A higher proportion of students who attended nursery school aspire to pursue higher education (300/314 = 95.5%) compared to those who did not attend nursery (75/81 = 92.6%). However, this difference is very small. Based on these observed frequencies and the results of the chi-squared test (p=0.4266), we can conclude that there is no statistically significant association between nursery attendance and aspiration to pursue higher education.

round(results$expected, 2)

##              mydata$nursery
## mydata$higher   no   yes
##           no   4.1  15.9
##           yes 76.9 298.1

For students who did not attend nursery and do not aspire to higher education: Observed: 6; Expected: 4.1 (slightly higher observed frequency than expected).

For students who attended nursery and aspire to higher education: Observed: 300; Expected: 298.1 (very close match).

The differences between observed and expected frequencies are small, indicating little deviation from what would be expected under H0. This aligns with the chi-squared test result showing no significant association.

round(results$res, 2)

##              mydata$nursery
## mydata$higher    no   yes
##           no   0.94 -0.48
##           yes -0.22  0.11

For the category of students who did not attend nursery and do not aspire to higher education, the residual is 0.94, suggesting a slightly higher observed frequency (6) than the expected frequency (4.1). Conversely, for students who attended nursery and do not aspire to higher education, the residual is -0.48, meaning the observed frequency (14) is slightly lower than the expected frequency (15.9). Among students aspiring to higher education, those who did not attend nursery have a residual of -0.22, showing a slightly lower observed frequency (75) than expected (76.9), while those who attended nursery have a residual of 0.11, indicating a slightly higher observed frequency (300) than expected (298.1).

All residuals fall within the range of -2 to 2, which is generally considered insignificant. This supports the conclusion from the chi-squared test that there is no significant association between nursery attendance and aspiration to pursue higher education. The observed frequencies are very close to the expected frequencies, indicating no meaningful deviation from what would be expected under the null hypothesis of independence.

addmargins(round(prop.table(results$observed), 3))

##              mydata$nursery
## mydata$higher    no   yes   Sum
##           no  0.015 0.035 0.050
##           yes 0.190 0.759 0.949
##           Sum 0.205 0.794 0.999

From the table, among students who do not aspire to higher education, 1.5% did not attend nursery, while 3.5% did. Among students who aspire to higher education, 19.0% did not attend nursery, and 75.9% did. In terms of marginal proportions, 20.5% of the total students did not attend nursery, while 79.4% attended. Similarly, 5.0% of the students do not aspire to higher education, while 94.9% do.

These proportions confirm that a large majority of students attended nursery and aspire to higher education. However, the small proportion of students who do not aspire to higher education, combined with the chi-squared test results and residuals analysis, indicates no significant relationship between nursery attendance and aspiration for higher education. The proportions align closely with the expected independence of the variables under the null hypothesis.

addmargins(round(prop.table(results$observed, 1), 3), 2)

##              mydata$nursery
## mydata$higher  no yes Sum
##           no  0.3 0.7 1.0
##           yes 0.2 0.8 1.0

This table shows the row-wise proportions of nursery attendance (“nursery”) for students grouped by their aspiration to pursue higher education (“higher”). For students who do not aspire to higher education, 30% did not attend nursery while 70% did. Among students who do aspire to higher education, 20% did not attend nursery and 80% attended nursery. These proportions indicate that students who attended nursery are slightly more likely to aspire to higher education compared to those who did not. However, this difference is not statistically significant, as demonstrated by the earlier chi-squared test results. The observed variation in nursery attendance between the two groups does not provide evidence of a meaningful association between nursery attendance and aspiration for higher education.

addmargins(round(prop.table(results$observed, 2), 3), 1)

##              mydata$nursery
## mydata$higher    no   yes
##           no  0.074 0.045
##           yes 0.926 0.955
##           Sum 1.000 1.000

This table shows the column-wise proportions of students based on their nursery attendance (“nursery”), divided into those who aspire to pursue higher education (“higher”) and those who do not. For students who did not attend nursery, 7.4% do not aspire to higher education while 92.6% do. For students who attended nursery, 4.5% do not aspire to higher education while 95.5% do. These proportions demonstrate that the vast majority of students, regardless of nursery attendance, aspire to pursue higher education. The differences between the two groups (7.4% vs. 4.5% for “no higher” and 92.6% vs. 95.5% for “yes higher”) are minor and not statistically significant, as previously established by the chi-squared test. These results suggest no meaningful relationship between nursery attendance and aspirations for higher education.

library(effectsize)
effectsize::cramers_v(mydata$higher, mydata$nursery)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.02              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.02)

## [1] "tiny"
## (Rules: funder2019)

The value of Cramer’s V here is 0.02, with a 95% confidence interval of [0.00,1.00]. According to Funder’s 2019 guidelines, this value is classified as a “tiny” effect size.

This means that the association between nursery attendance (“nursery”) and aspiration to pursue higher education (“higher”) is extremely weak and practically negligible. The confidence interval includes 0, further indicating the lack of a meaningful relationship. These results align with the chi-squared test, which found no significant association between the two variables.

Overall, the data shows that any observed differences between the groups are minimal and do not indicate a strong or significant relationship between nursery attendance and aspirations for higher education.

Homework 1

Jernej Košorok

2025-01-13