I decided to use data on Portuguese school students. The dataset is described here:
Generally, the dataset contains socio-demografic and educational characteristics of students and their math grades.
In our analysis we will try to predict the final grade in math (var G3) for students based on some factors.
We are interested in to what extent time spent on studying can predict academic achievement.
We suppose that studying more typically result in higher grades and will be solid factor contributing to academic achievement. Inclusion of special paid classes will also increase grades as it boosts amount of skill and motivation
We also want to control for 1) sex as females are known to share better academic culture and study better in general, and 2) mother’s education, as it is one of the main factors predicting academic motivation of students in sociology of education.
Let us proceed to modeling.
library(dplyr)
library(ggplot2)
library(car)
library(MASS)
data = read.csv2('student-mat.csv')
data = data %>% filter(G3 > 0)
mlr_model = lm(G3 ~ studytime + paid + Medu + sex, data)
plot(mlr_model, which = 1)
Though our residuals seem to be a bit scattered we can say that after model fit the general pattern is linear.
library(lmtest)
hist(residuals(mlr_model), breaks = 10, main = "Histogram of Residuals", col = "pink")
plot(mlr_model, which = 2)
shapiro.test(residuals(mlr_model))
##
## Shapiro-Wilk normality test
##
## data: residuals(mlr_model)
## W = 0.99439, p-value = 0.2152
All three methods depict that residuals are distributed more or less normally (even Shapiro, w(°o°)w).
library(lmtest)
dwtest(mlr_model)
##
## Durbin-Watson test
##
## data: mlr_model
## DW = 2.0423, p-value = 0.6449
## alternative hypothesis: true autocorrelation is greater than 0
Residuals are independent, which is good, we can proceed.
plot(mlr_model, which = 3)
bptest(mlr_model)
##
## studentized Breusch-Pagan test
##
## data: mlr_model
## BP = 10.533, df = 4, p-value = 0.03235
Unfortunately, we witness heteroscedacticity. The variance of residuals is growing as the grades grow.
cor(data$studytime, data$absences,use = "complete.obs")
## [1] -0.07454126
vif(mlr_model)
## studytime paid Medu sex
## 1.116075 1.057491 1.034153 1.107766
Our variables are fairly uncorrelated. Variance-inflation factors are small. No multicollinearity can be traced.
library(sjPlot)
# studytime + paid + Medu + sex
tab_model(mlr_model)
| G3 | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 8.49 | 7.21 – 9.78 | <0.001 |
| studytime | 0.64 | 0.23 – 1.06 | 0.002 |
| paid [yes] | -0.44 | -1.11 – 0.23 | 0.193 |
| Medu | 0.54 | 0.24 – 0.85 | <0.001 |
| sex [M] | 0.85 | 0.17 – 1.54 | 0.015 |
| Observations | 357 | ||
| R2 / R2 adjusted | 0.072 / 0.061 | ||
Our model is fairly poor as it explains only 6% of variance in grades. Increase in study time by 1 factor (it is coded as factors, unfortunately) increases the grade by 0.64 point all other things being equal, with stat. significance. Mother’s education is also a significant factor, and increase in one level will raise students expected grade by 0.54 points. We can also see that males study better than females (even though in general girls study better, in higher school boys usually perform better in math, so our results do not contradict existing knowledge). For our data, students who had additional paid classes did not perform better than those who do not (p ~ 0.2).
Various research suggest that girls are better at organizing their studying. Let us check if an effect of paid courses if different for girls and boys, assuming that girls will try to get more benefit from these classes than boys.
mlr_model_int = lm(G3 ~ studytime + paid*sex + Medu, data)
tab_model(mlr_model_int)
| G3 | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 8.55 | 7.24 – 9.86 | <0.001 |
| studytime | 0.65 | 0.23 – 1.06 | 0.002 |
| paid [yes] | -0.60 | -1.53 – 0.33 | 0.204 |
| sex [M] | 0.70 | -0.23 – 1.63 | 0.139 |
| Medu | 0.55 | 0.24 – 0.85 | <0.001 |
| paid [yes] × sex [M] | 0.32 | -1.00 – 1.64 | 0.631 |
| Observations | 357 | ||
| R2 / R2 adjusted | 0.072 / 0.059 | ||
For that interaction model we can see that only study time and mother’s education level turn statistically significant.
There is no gender difference in terms of paid classes’ association with final grade in math in our case. We can also see that some factors became non-significant (maybe after grouping there were much fewer observations which raised p-value). The adj. R^2 did not grow, these model does not seem to be better (can say without comparing them statistically).
Generally, our model predicts that:
increase in time dedicated to study will be positively associated with final math grade
boys are expected to get higher grade
mother’s education is a valuable factor, and the better is mother’s education, the better is a grade
But our model explains very little variance and we faced heteroscedasticity, which can bias the results, especially for those with highest expected grades.