Week 11 Data Dive

Build an lm or glm as you like

Use whatever response and predictors

ggplot(sp, aes(x = G3)) + 
  geom_density() +
  labs(title = "G3 Distribution")

full_model = lm(G3 ~ school + sex + age + famsize + traveltime + studytime, data = sp)
final_model = step(full_model, direction = "both")

## Start:  AIC=1428.54
## G3 ~ school + sex + age + famsize + traveltime + studytime
## 
##              Df Sum of Sq    RSS    AIC
## - traveltime  1     12.65 5751.4 1428.0
## <none>                    5738.8 1428.5
## - famsize     1     28.05 5766.8 1429.7
## - age         1     50.22 5789.0 1432.2
## - sex         1     89.90 5828.7 1436.6
## - studytime   1    222.29 5961.1 1451.2
## - school      1    383.36 6122.1 1468.5
## 
## Step:  AIC=1427.97
## G3 ~ school + sex + age + famsize + studytime
## 
##              Df Sum of Sq    RSS    AIC
## <none>                    5751.4 1428.0
## + traveltime  1     12.65 5738.8 1428.5
## - famsize     1     28.00 5779.4 1429.1
## - age         1     51.00 5802.4 1431.7
## - sex         1     94.27 5845.7 1436.5
## - studytime   1    224.01 5975.5 1450.8
## - school      1    446.94 6198.4 1474.5

My chosen response variable, term 3 grades (G3), is fairly normally distributed. I chose to do a simple linear model since a glm may bring in too much complexity. In the future, I may consider doing a zero-inflated model or a Hurdle model to capture the additional zeroes or could try more complex models to make comparisons to this simple model.

For my predictors, I chose school, sex, age, famsize, studytime and traveltime. I ran stepwise selection which ended up dropping traveltime based on AIC. In the future, I may consider other selection methods like ridge/lasso regression to broaden my scope.

Use tools from previous weeks to diagnose the model

Residuals vs. Fitted Values

gg_resfitted(final_model) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The residuals vs. fitted values shows us the variation in the errors. The residuals don’t show a clear funnel shape or non-constant trend. The residuals also appear to be randomly scattered around the zero line across fitted values. There isn’t any strong evidence of heteroscedasticity.

Residuals vs. X-values

residuals = resid(final_model)
df = data.frame(residuals, school = sp$school, sex = sp$sex, age = sp$age, famsize = sp$famsize, studytime = sp$studytime)

# Residuals vs. school
ggplot(df, aes(x = school, y = residuals)) + 
  geom_boxplot() +
  geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
  geom_smooth(se = F) +
  labs(
    title = "Residuals vs. school"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Residuals vs. sex
ggplot(df, aes(x = sex, y = residuals)) + 
  geom_boxplot() +
  geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
  geom_smooth(se = F) +
  labs(
    title = "Residuals vs. sex"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Residuals vs age
ggplot(df, aes(x = age, y = residuals)) + 
  geom_jitter() +
  geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
  geom_smooth(se = F) +
  labs(
    title = "Residuals vs. age"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Residuals vs famsize
ggplot(df, aes(x = famsize, y = residuals)) + 
  geom_boxplot() +
  geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
  geom_smooth(se = F) +
  labs(
    title = "Residuals vs. famsize"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Residuals vs studytime
ggplot(df, aes(x = studytime, y = residuals)) + 
  geom_jitter() +
  geom_hline(yintercept = 0, col = "red", linetype = "dashed") +
  geom_smooth(se = F) +
  labs(
    title = "Residuals vs. studytime"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

I produced a residuals vs. predictor plot for each predictor. For continuous predictors, I did a normal scatterplot. For binary predictors, I did boxplots. For the boxplots, we can see that the variability of the residuals seems to be similar across groups. This is indicated by the comparable IQRs and whisker lengths. Because of this, there isn’t any strong evidence of heteroscedasticity. For the continuous predictors (scatterplots), the residuals appear to be randomly placed around the zero line. We can also see that, for the most part, the loess smoothers cloesly match the zero line. This supports the linearity assumption.

Correlation Heatmap and VIF

ggcorr(select(sp,
              age,
              studytime), label = TRUE) +
  labs(title='Correlation Heatmap')

vif(final_model)

##    school       sex       age   famsize studytime 
##  1.041160  1.070789  1.009036  1.010905  1.071730

We can use the Correlation Heatmap to compare the two continuous variables to check for collinearity. However, we want to consider all the variables in the model and check for multicollinearity. For this reason, I chose to check the Varaiance Inflation Factor. When I run the function, every single variable has a VIF of close to 1 which is well below the threshold of problems. Luckily, we don’t need to worry about multicollinearity for our model.

Residual Histogram

gg_reshist(final_model)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Ideally, we would want to see a plot that is approximately symmetrical and doesn’t show major skew on either side. For the most part, our residuals are pretty symmetric with a few deviations in the tails. Overall, we don’t have massive deviations from normality that affect inference.

QQ-Plot

gg_qqplot(final_model)

Here, we are looking at the standardized residuals against the theoretical quantiles. The standardized residuals closely follow the red line for the majority of the graph. However, in the bottom left, there are some points that deviate from the line. Given the sample size and the fact that the majority of our points follow the line, our residuals are close enough to being normally distributed.

Highlight any issues with the model

The primary issues with the model is the inability to capture the zeroes and potential outliers at the bottom end of our response’s distribution. We may get nicer performance out of a model that accounts for the zeroes/outliers but we will have to weigh complexity vs. interpretability.

Interpret at least one of the coefficients

summary(final_model)

## 
## Call:
## lm(formula = G3 ~ school + sex + age + famsize + studytime, data = sp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4005  -1.6318   0.0538   1.8274   9.1007 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.17521    1.66202   9.131  < 2e-16 ***
## schoolMS    -1.77738    0.25144  -7.069 4.09e-12 ***
## sexM        -0.80192    0.24701  -3.246  0.00123 ** 
## age         -0.23134    0.09688  -2.388  0.01724 *  
## famsizeLE3   0.45759    0.25861   1.769  0.07730 .  
## studytime    0.73379    0.14663   5.004 7.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.991 on 643 degrees of freedom
## Multiple R-squared:  0.1496, Adjusted R-squared:  0.143 
## F-statistic: 22.62 on 5 and 643 DF,  p-value: < 2.2e-16

For the continuous age coefficient estimate, which is -0.23134, we could interpret it as: When holding all other variables constant, for every 1 unit (year) increase in age, the model predicts that a student’s G3 grade will decrease by about 0.23134 units. This means that the data suggest that older students don’t perform quite as well. This could be due to several factors including lower sample size for older students or lower motivation.

For the binary sex coefficient estimate, which is -0.80192, we could interpret it as: When holding all other predictors constant on average, the model predicts that male student’s G3 grades will be 0.80192 points lower compared to female students. The data suggests that females G3 grades are quite a bit higher than male grades.