ggplot(sp, aes(x = G3)) +
geom_density() +
labs(title = "G3 Distribution")
full_model = lm(G3 ~ school + sex + age + famsize + traveltime + studytime, data = sp)
final_model = step(full_model, direction = "both")
## Start: AIC=1428.54
## G3 ~ school + sex + age + famsize + traveltime + studytime
##
## Df Sum of Sq RSS AIC
## - traveltime 1 12.65 5751.4 1428.0
## <none> 5738.8 1428.5
## - famsize 1 28.05 5766.8 1429.7
## - age 1 50.22 5789.0 1432.2
## - sex 1 89.90 5828.7 1436.6
## - studytime 1 222.29 5961.1 1451.2
## - school 1 383.36 6122.1 1468.5
##
## Step: AIC=1427.97
## G3 ~ school + sex + age + famsize + studytime
##
## Df Sum of Sq RSS AIC
## <none> 5751.4 1428.0
## + traveltime 1 12.65 5738.8 1428.5
## - famsize 1 28.00 5779.4 1429.1
## - age 1 51.00 5802.4 1431.7
## - sex 1 94.27 5845.7 1436.5
## - studytime 1 224.01 5975.5 1450.8
## - school 1 446.94 6198.4 1474.5
My chosen response variable, term 3 grades (G3), is fairly normally distributed. I chose to do a simple linear model since a glm may bring in too much complexity. In the future, I may consider doing a zero-inflated model or a Hurdle model to capture the additional zeroes or could try more complex models to make comparisons to this simple model.
For my predictors, I chose school, sex, age, famsize, studytime and traveltime. I ran stepwise selection which ended up dropping traveltime based on AIC. In the future, I may consider other selection methods like ridge/lasso regression to broaden my scope.
gg_resfitted(final_model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The residuals vs. fitted values shows us the variation in the errors. There isn’t an obvious increase or decrease here which means that our model meets the homoscedasticity assumption.
residuals = resid(final_model)
df = data.frame(residuals, school = sp$school, sex = sp$sex, age = sp$age, famsize = sp$famsize, studytime = sp$studytime)
# Residuals vs. school
ggplot(df, aes(x = school, y = residuals)) +
geom_boxplot() +
geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
geom_smooth(se = F) +
labs(
title = "Residuals vs. school"
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Residuals vs. sex
ggplot(df, aes(x = sex, y = residuals)) +
geom_boxplot() +
geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
geom_smooth(se = F) +
labs(
title = "Residuals vs. sex"
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Residuals vs age
ggplot(df, aes(x = age, y = residuals)) +
geom_jitter() +
geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
geom_smooth(se = F) +
labs(
title = "Residuals vs. age"
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Residuals vs famsize
ggplot(df, aes(x = famsize, y = residuals)) +
geom_boxplot() +
geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
geom_smooth(se = F) +
labs(
title = "Residuals vs. famsize"
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Residuals vs studytime
ggplot(df, aes(x = studytime, y = residuals)) +
geom_jitter() +
geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
geom_smooth(se = F) +
labs(
title = "Residuals vs. studytime"
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
I produced a residuals vs. predictor plot for each predictor. For continuous predictors, I did a normal scatterplot. For binary predictors, I did boxplots. Each of the boxes in the boxplots are fairly similar for each group and has a median around zero which indicates that none of these predictors violate the homoscedasticity assumption. We also see that the two loess smoothers associated with the continuous predictor’s scatterplots closely matches the horizontal zero line and the scatters are roughly normally distributed. The model meets the linearity assumption. We may want to look into some of the points that may be considered outliers because the linear model may not be doing the best job at capturing those points.
ggcorr(select(sp,
age,
studytime), label = TRUE) +
labs(title='Correlation Heatmap')
vif(final_model)
## school sex age famsize studytime
## 1.041160 1.070789 1.009036 1.010905 1.071730
We can use the Correlation Heatmap to compare the two continuous variables to check for collinearity. However, we want to consider all the variables in the model and check for multicollinearity. For this reason, I chose to check the Varaiance Inflation Factor. When I run the function, every single variable has a VIF of close to 1 which is well below the threshold of problems. Luckily, we don’t need to worry about multicollinearity for our model.
gg_reshist(final_model)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Ideally, we would want to see a completely normal distribution for our histogram. For the most part, our residuals are pretty symmetric but there are a few residuals that are well away from the rest of the distribution. This likely means that there are points from our original data that the model isn’t effectively picking up on. This is why it is probably good to test out other models (including ones that account for several zeroes) in the future.
gg_qqplot(final_model)
Here, we are looking at the standardized residuals against the theoretical quantiles. The standardized residuals closely follow the red line for the majority of the graph. However, in the bottom left, there are several points that deviate from the line. These are the observations that we saw in our histogram that are separated from the rest of the distribution. As mentioned a few times, if we take into account these outliers in something like a hurdle model, we may get better model performance and better diagnostic performance.
The primary issues with the model is the inability to capture the zeroes and potential outliers at the bottom end of our response’s distribution. We may get nicer performance out of a model that accounts for the zeroes/outliers but we will have to weigh complexity vs. interpretability.
summary(final_model)
##
## Call:
## lm(formula = G3 ~ school + sex + age + famsize + studytime, data = sp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4005 -1.6318 0.0538 1.8274 9.1007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.17521 1.66202 9.131 < 2e-16 ***
## schoolMS -1.77738 0.25144 -7.069 4.09e-12 ***
## sexM -0.80192 0.24701 -3.246 0.00123 **
## age -0.23134 0.09688 -2.388 0.01724 *
## famsizeLE3 0.45759 0.25861 1.769 0.07730 .
## studytime 0.73379 0.14663 5.004 7.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.991 on 643 degrees of freedom
## Multiple R-squared: 0.1496, Adjusted R-squared: 0.143
## F-statistic: 22.62 on 5 and 643 DF, p-value: < 2.2e-16
For the continuous age coefficient estimate, which is -0.23134, we could interpret it as: When holding all other variables constant, for every 1 unit (year) increase in age, the model predicts that a student’s G3 grade will decrease by about 0.23134 units. This means that the data suggest that older students don’t perform quite as well. This could be due to several factors including lower sample size for older students or lower motivation.
For the binary sex coefficient estimate, which is -0.80192, we could interpret it as: When holding all other predictors constant on average, the model predicts that male student’s G3 grades will be 0.80192 points lower compared to female students. The data suggests that females G3 grades are quite a bit higher than male grades.