Data and aims of regression

I decided to use data on Portuguese school students. The dataset is described here:

https://archive.ics.uci.edu/dataset/320/student+performance

Generally, the dataset contains socio-demografic and educational characteristics of students and their math grades.

In our analysis we will try to predict the final grade in math (var G3) for students based on some factors.

Aims

We are interested in to what extent time spent on studying can predict academic achievement.

We suppose that studying more typically result in higher grades and will be solid factor contributing to academic achievement. Inclusion of special paid classes will also increase grades as it boosts amount of skill and motivation

We also want to control for 1) sex as females are known to share better academic culture and study better in general, and 2) mother’s education, as it is one of the main factors predicting academic motivation of students in sociology of education.

Regression

Let us proceed to modeling.

Linearity

library(dplyr)
library(ggplot2)
library(car)
library(MASS) 

data = read.csv2('student-mat.csv')
data = data %>% filter(G3 > 0)
mlr_model = lm(G3 ~ studytime + paid + Medu + sex, data)

plot(mlr_model, which = 1)

Though our residuals seem to be a bit scattered we can say that after model fit the general pattern is linear.

Normality of residuals

library(lmtest)

hist(residuals(mlr_model), breaks = 10, main = "Histogram of Residuals", col = "pink")

plot(mlr_model, which = 2)

shapiro.test(residuals(mlr_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mlr_model)
## W = 0.99439, p-value = 0.2152

All three methods depict that residuals are distributed more or less normally (even Shapiro, w(°o°)w).

Independence of residuals

library(lmtest)

dwtest(mlr_model)
## 
##  Durbin-Watson test
## 
## data:  mlr_model
## DW = 2.0423, p-value = 0.6449
## alternative hypothesis: true autocorrelation is greater than 0

Residuals are independent, which is good, we can proceed.

Homoscedasticity

plot(mlr_model, which = 3)

bptest(mlr_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  mlr_model
## BP = 10.533, df = 4, p-value = 0.03235

Unfortunately, we witness heteroscedacticity. The variance of residuals is growing as the grades grow.

Multicollinearity

cor(data$studytime, data$absences,use = "complete.obs")
## [1] -0.07454126
vif(mlr_model)
## studytime      paid      Medu       sex 
##  1.116075  1.057491  1.034153  1.107766

Our variables are fairly uncorrelated. Variance-inflation factors are small. No multicollinearity can be traced.

Multiple model

library(sjPlot)
# studytime + paid + Medu + sex
tab_model(mlr_model)
  G3
Predictors Estimates CI p
(Intercept) 8.49 7.21 – 9.78 <0.001
studytime 0.64 0.23 – 1.06 0.002
paid [yes] -0.44 -1.11 – 0.23 0.193
Medu 0.54 0.24 – 0.85 <0.001
sex [M] 0.85 0.17 – 1.54 0.015
Observations 357
R2 / R2 adjusted 0.072 / 0.061

Our model is fairly poor as it explains only 6% of variance in grades. Increase in study time by 1 factor (it is coded as factors, unfortunately) increases the grade by 0.64 point all other things being equal, with stat. significance. Mother’s education is also a significant factor, and increase in one level will raise students expected grade by 0.54 points. We can also see that males study better than females (even though in general girls study better, in higher school boys usually perform better in math, so our results do not contradict existing knowledge). For our data, students who had additional paid classes did not perform better than those who do not (p ~ 0.2).

Interaction

Various research suggest that girls are better at organizing their studying. Let us check if an effect of paid courses if different for girls and boys, assuming that girls will try to get more benefit from these classes than boys.

mlr_model_int = lm(G3 ~ studytime + paid*sex + Medu, data)

tab_model(mlr_model_int)
  G3
Predictors Estimates CI p
(Intercept) 8.55 7.24 – 9.86 <0.001
studytime 0.65 0.23 – 1.06 0.002
paid [yes] -0.60 -1.53 – 0.33 0.204
sex [M] 0.70 -0.23 – 1.63 0.139
Medu 0.55 0.24 – 0.85 <0.001
paid [yes] × sex [M] 0.32 -1.00 – 1.64 0.631
Observations 357
R2 / R2 adjusted 0.072 / 0.059

For that interaction model we can see that only study time and mother’s education level turn statistically significant.

There is no gender difference in terms of paid classes’ association with final grade in math in our case. We can also see that some factors became non-significant (maybe after grouping there were much fewer observations which raised p-value). The adj. R^2 did not grow, these model does not seem to be better (can say without comparing them statistically).

Conclusions

Generally, our model predicts that:

  1. increase in time dedicated to study will be positively associated with final math grade

  2. boys are expected to get higher grade

  3. mother’s education is a valuable factor, and the better is mother’s education, the better is a grade

But our model explains very little variance and we faced heteroscedasticity, which can bias the results, especially for those with highest expected grades.