Expanding the model from last week

My linear model from last week:

model = lm(G3 ~ studytime, data = sp)
summary(model)
## 
## Call:
## lm(formula = G3 ~ studytime, data = sp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9735  -1.9463   0.0265   2.0265   7.0265 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.0278     0.3115  32.191  < 2e-16 ***
## studytime     0.9728     0.1483   6.562 1.09e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.131 on 647 degrees of freedom
## Multiple R-squared:  0.06239,    Adjusted R-squared:  0.06095 
## F-statistic: 43.06 on 1 and 647 DF,  p-value: 1.091e-10

Include 1-3 more variables into your regression model:

Try out either an interaction or binary term to start:

model1 = lm(G3 ~ studytime + traveltime, data = sp)
summary(model1)
## 
## Call:
## lm(formula = G3 ~ studytime + traveltime, data = sp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2460  -1.7087  -0.1913   1.8087   6.7540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.8379     0.4139  26.183  < 2e-16 ***
## studytime     0.9453     0.1477   6.401 2.97e-10 ***
## traveltime   -0.4826     0.1636  -2.950   0.0033 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.112 on 646 degrees of freedom
## Multiple R-squared:  0.07485,    Adjusted R-squared:  0.07199 
## F-statistic: 26.13 on 2 and 646 DF,  p-value: 1.219e-11
AIC(model, model1)
##        df      AIC
## model   3 3327.115
## model1  4 3320.433

My original linear model from last week had term 3 grades ‘G3’ as the response and ‘studytime’ as the only predictor. To start this assignment, I chose to add ‘traveltime’ to my model. To do this, I had ‘studytime*travelime’ as my sole predictor which includes both the main effects plus the interaction effect between the two predictors. The summary output for this model shows that traveltime and the interaction term aren’t very close to being significant. If we judge the model based on criteria like R-squared or AIC, there is a modest improvement in the model with the interaction term. However, if we drop the interaction, the model fit improves and both main effects become signficant. For this reason, I will keep ‘studytime’ in as a new predictor but will not include the interaction term.

cor(sp$studytime, sp$traveltime)
## [1] -0.0631539

Fortunately, there is very low correlation between these two predictors so we don’t need to worry about collinearity/multicollinearity.

Consider adding other integer or continuous variables:

model2 = lm(G3 ~ studytime + traveltime + age, data = sp)
summary(model2)
## 
## Call:
## lm(formula = G3 ~ studytime + traveltime + age, data = sp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1690  -1.6347  -0.0234   1.8881   7.0981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.29259    1.71687   8.907  < 2e-16 ***
## studytime    0.94288    0.14699   6.415 2.73e-10 ***
## traveltime  -0.46781    0.16295  -2.871  0.00423 ** 
## age         -0.26715    0.09995  -2.673  0.00771 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.098 on 645 degrees of freedom
## Multiple R-squared:  0.08499,    Adjusted R-squared:  0.08073 
## F-statistic: 19.97 on 3 and 645 DF,  p-value: 2.184e-12
full_model = lm(G3 ~ studytime*traveltime*age, data = sp)
step = step(full_model, direction = "both")
## Start:  AIC=1459.93
## G3 ~ studytime * traveltime * age
## 
##                            Df Sum of Sq    RSS    AIC
## - studytime:traveltime:age  1   0.63187 6005.2 1458.0
## <none>                                  6004.6 1459.9
## 
## Step:  AIC=1458
## G3 ~ studytime + traveltime + age + studytime:traveltime + studytime:age + 
##     traveltime:age
## 
##                            Df Sum of Sq    RSS    AIC
## - studytime:traveltime      1     1.232 6006.5 1456.1
## - traveltime:age            1    13.104 6018.3 1457.4
## <none>                                  6005.2 1458.0
## + studytime:traveltime:age  1     0.632 6004.6 1459.9
## - studytime:age             1   161.344 6166.6 1473.2
## 
## Step:  AIC=1456.13
## G3 ~ studytime + traveltime + age + studytime:age + traveltime:age
## 
##                        Df Sum of Sq    RSS    AIC
## - traveltime:age        1    12.648 6019.1 1455.5
## <none>                              6006.5 1456.1
## + studytime:traveltime  1     1.232 6005.2 1458.0
## - studytime:age         1   163.961 6170.4 1471.6
## 
## Step:  AIC=1455.49
## G3 ~ studytime + traveltime + age + studytime:age
## 
##                        Df Sum of Sq    RSS    AIC
## <none>                              6019.1 1455.5
## + traveltime:age        1    12.648 6006.5 1456.1
## + studytime:traveltime  1     0.776 6018.3 1457.4
## - traveltime            1    70.774 6089.9 1461.1
## - studytime:age         1   169.356 6188.5 1471.5
ggcorr(select(sp,
              studytime,
              traveltime,
              age), label = TRUE) +
  labs(title='Correlation Heatmap')

In addition to the existing ‘studytime’ variable and the ‘traveltime’ variable I added above, I decided to add ‘age’ to the model. Instead of going through the possible combinations of main effects and interaction terms, I decided to run stepwise selection on a full model (which includes all main effects and interaction effects). This ended up telling me that a model with the three main effects and 1 interaction term between ‘studytime’ and ‘age’ had good performance according to AIC. We also see that the R-squared has jumped 71% compared to our 1 term model and 45% compared to our 2 term model. Every term in this model is also significant and there is no correlation between the three predictors. This will be my final model that I will run diagnostic plots on in the next section.

# Final model
model_final = lm(formula = G3 ~ studytime + traveltime + age + studytime:age, data = sp)

summary(model_final)
## 
## Call:
## lm(formula = G3 ~ studytime + traveltime + age + studytime:age, 
##     data = sp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1711  -1.6403  -0.0477   1.8896   7.0635 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    31.1252     4.0872   7.615 9.42e-14 ***
## studytime      -7.2615     1.9328  -3.757 0.000188 ***
## traveltime     -0.4429     0.1609  -2.752 0.006094 ** 
## age            -1.2211     0.2448  -4.987 7.90e-07 ***
## studytime:age   0.4932     0.1159   4.257 2.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.057 on 644 degrees of freedom
## Multiple R-squared:   0.11,  Adjusted R-squared:  0.1045 
## F-statistic:  19.9 on 4 and 644 DF,  p-value: 1.822e-15

Evaluate the Model

Residuals vs. Fitted Values

gg_resfitted(model_final) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The residuals vs. fitted values plot above can give us an indication of the variation in the errors. Fortunately, there is no clear indication of an increasing or decreasing trend meaning that the homoscedasticity assumption is met.

Residuals vs. X-values

residuals = resid(model_final)
df = data.frame(residuals, studytime = sp$studytime, traveltime = sp$traveltime, age = sp$age)

# Residuals vs. studytime
ggplot(df, aes(x = studytime, y = residuals)) + 
  geom_jitter() +
  geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
  geom_smooth(se = F) +
  labs(
    title = "Residuals vs. studytime"
  )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Residuals vs. traveltime
ggplot(df, aes(x = traveltime, y = residuals)) + 
  geom_jitter() +
  geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
  geom_smooth(se = F) +
  labs(
    title = "Residuals vs. traveltime"
  )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Residuals vs Age
ggplot(df, aes(x = age, y = residuals)) + 
  geom_jitter() +
  geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
  geom_smooth(se = F) +
  labs(
    title = "Residuals vs. age"
  )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

I had some issues with the package used in class for these diagnostic plots so I did them manually here. These plots show the residuals against each predictor. The linearity assumption is strongly met because the blue loess smoother closely matches the red horizontal and straight zero line in each plot. In each plot, we can also see that the errors are normally distributed across the prediction line. There isn’t any clear non-normal trend like the example in class. This shows that our model meets the linearity assumption and the errors meet the constant variance assumption.

Correlation Heatmap

ggcorr(select(sp,
              studytime,
              traveltime,
              age), label = TRUE) +
  labs(title='Correlation Heatmap')

I’ve copied my correlation heatmap from above. As we can see, there is almost no correlation between any of the predictors in our model. Because of this, we don’t need to worry about issues of collinearity/multicollinearity in our model.

Residual Histogram

gg_reshist(model_final)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The plot above shows the distribution of the residuals of our model. In an ideal scenario, we want to see a shape that roughly resembles a normal distribution. The bulk of our residuals are fairly normally distributed but the tails are noteworthy. The left tail extends further out than the right. This deviation from perfect normality indicates that we might still be missing something in our model (which makes sense because we aren’t likely to get the best performance from a linear model with just 4 terms).

QQ-Plot

gg_qqplot(model_final)

The plot above shows the standardized residuals against the theoretical quantiles. We can see that our standardized residuals closely follow the red line for most of the plot with some deviations at the tails (especially the left one). This matches our histogram above. The deviations at the tails could indicate the presence of a few outliers that need further investigation. We do have a large sample size which helps a bit here.