Build a logistic regression model

full_model <- glm(G3Pass ~ studytime*traveltime*age, data = sp,
             family = binomial(link = 'logit'))
step = step(full_model, direction = "both")

## Start:  AIC=543.51
## G3Pass ~ studytime * traveltime * age
## 
##                            Df Deviance    AIC
## - studytime:traveltime:age  1   527.90 541.90
## <none>                          527.51 543.51
## 
## Step:  AIC=541.9
## G3Pass ~ studytime + traveltime + age + studytime:traveltime + 
##     studytime:age + traveltime:age
## 
##                            Df Deviance    AIC
## - traveltime:age            1   527.99 539.99
## - studytime:age             1   528.24 540.24
## - studytime:traveltime      1   528.48 540.48
## <none>                          527.90 541.90
## + studytime:traveltime:age  1   527.51 543.51
## 
## Step:  AIC=539.99
## G3Pass ~ studytime + traveltime + age + studytime:traveltime + 
##     studytime:age
## 
##                        Df Deviance    AIC
## - studytime:age         1   528.35 538.35
## - studytime:traveltime  1   528.58 538.58
## <none>                      527.99 539.99
## + traveltime:age        1   527.90 541.90
## 
## Step:  AIC=538.35
## G3Pass ~ studytime + traveltime + age + studytime:traveltime
## 
##                        Df Deviance    AIC
## - studytime:traveltime  1   529.00 537.00
## <none>                      528.35 538.35
## + studytime:age         1   527.99 539.99
## + traveltime:age        1   528.24 540.24
## - age                   1   536.34 544.34
## 
## Step:  AIC=537
## G3Pass ~ studytime + traveltime + age
## 
##                        Df Deviance    AIC
## - traveltime            1   530.09 536.09
## <none>                      529.00 537.00
## + studytime:traveltime  1   528.35 538.35
## + studytime:age         1   528.58 538.58
## + traveltime:age        1   528.91 538.91
## - age                   1   536.94 542.94
## - studytime             1   548.10 554.10
## 
## Step:  AIC=536.09
## G3Pass ~ studytime + age
## 
##                 Df Deviance    AIC
## <none>               530.09 536.09
## + traveltime     1   529.00 537.00
## + studytime:age  1   529.67 537.67
## - age            1   538.25 542.25
## - studytime      1   550.00 554.00

final_model = step
final_model$coefficients

## (Intercept)   studytime         age 
##   4.8185675   0.6636091  -0.2549431

Interpret the coefficients and explain what they mean

Similar to the linear model I fit in Week 9, I tried the same 3 predictors in this week’s logistic model: studytime, traveltime and age. I also used stepwise selection to determine what combination of those predictors minimized AIC. It turns out that a logistic model with just the ‘studytime’ and ‘age’ main effects produced the lowest AIC. This was the model I ended up using. There are three coefficients in this model:

The first is the intercept and is 4.8186. This can be interpreted as the log-odds of G3Pass when the two predictors are equal to zero. In other words, when age and studytime are set to zero, the log odds of passing is 4.8186.

The second is the ‘studytime’ coeffieicnt and is 0.6636. This means that, when holding everything else constant, a 1-unit increase in studytime increases the log-odds of passing by 0.6636. If we convert this to an odds ratio (e^0.6636 = 1.942) we can say that, when keeping everything else constant, a 1-unit increase in studytime increases the odds of a student passing by 94.2% (or about double). Overall, this means that students in this dataset that study more tend to have a higher chance of achieving a passing grade.

The final coefficient ‘age’ is -0.2549. This means that if we hold all other parts of the model constant, a 1-unit increase in age decreases the log-odds of passing by 0.2549. When we convert it to an odds ratio (e^-0.2549 = 0.775) we can say that, when keeping other aspects of our model constant, a 1-unit increase in age decreases the odds of a student passing by about 22.5%. Overall, this means that older students in this data set have a lower chance of passing than younger students. There are a few reasons this could happen. One could be lower motivation from older students. This could also be due to the amount of data that exists for older students versus younger students. In future analyses, we would want to investigate the cause of this trend.

Using the SE for at least one coefficient, build a CI and translate its meaning

coef = coef(final_model)
se = summary(final_model)$coefficients[, 'Std. Error']
lower = coef - 1.96 * se
upper = coef + 1.96 * se

cbind(lower, upper)

##                  lower       upper
## (Intercept)  1.8195297  7.81760533
## studytime    0.3517962  0.97542193
## age         -0.4300590 -0.07982718

exp(cbind(lower, upper))

##                 lower        upper
## (Intercept) 6.1689563 2483.9500454
## studytime   1.4216187    2.6522861
## age         0.6504707    0.9232759

Above, I ran a function to find 95% Wald confidence intervals for each coefficient in both the log-odds and odds scale. For this question, I want to specifically interpret the confidence intervals for the ‘studytime’ variable, which is [0.3518, 0.9754] for log-odds and is [1.4216, 2.6523] for the odds ratio.

Since we are finding a 95% confidence interval it means that, if sampling were repeated many times, 95% of the intervals produced would contain the true population coefficient. In this case, we are 95% confident that the true log odds coefficient of studytime lies between 0.3518 and 0.9754. If we exponentiate, we find that the true odds ratio of studytime lies between 1.4216 and 2.6523. In other words, each additional unit of study time multiplies the odds of passing by between 1.42 and 2.65.

Diagnostic Plots

Cost Plot

y = sp$G3Pass
p_hat = final_model$fitted.values
cost = -(y * log(p_hat) + (1 - y) * log(1 - p_hat))

ggplot() +
  geom_boxplot(mapping = aes(x = as_factor(sp$G3Pass), 
                             y = cost), 
               orientation = 'h') +
  labs(title = "Cost Diagnostic Plot", x = "Pass", y= 'Cost') +
  scale_y_log10()

Above I ran a ‘Cost’ diagnostic plot as seen in the lab. Here, we can see the cost for each G3Pass outcome. Students that pass have much lower cost and so the model performs well for students that pass. Students that did not pass have much higher cost which shows that the model may be less accurate for failures. The model discriminates between the levels well but the class imbalance could be causing this difference.

Cook’s Distance

plot(final_model, which = 4, id.n = 3)

The Cook’s Distance plot shows us which observations have the strongest influence on the model. Observations 82, 503 and 524 have the highest influence and warrant further investigation in future analyses.

Week 10 Data Dive

2026-03-30

Select an interesting binary column