sp$G3Pass = ifelse(sp$G3 >= 10, 1, 0)
For this assignment, I chose to create a new binary variable using G3 grades called ‘G3Pass’. A student that has a value of 1 for G3Pass is considered to have a passing grade (a G3 value of 10 or above). A student with a 0 is considered to have a failing grade(a G3 value of 9 or below).
full_model <- glm(G3Pass ~ studytime*traveltime*age, data = sp,
family = binomial(link = 'logit'))
step = step(full_model, direction = "both")
## Start: AIC=543.51
## G3Pass ~ studytime * traveltime * age
##
## Df Deviance AIC
## - studytime:traveltime:age 1 527.90 541.90
## <none> 527.51 543.51
##
## Step: AIC=541.9
## G3Pass ~ studytime + traveltime + age + studytime:traveltime +
## studytime:age + traveltime:age
##
## Df Deviance AIC
## - traveltime:age 1 527.99 539.99
## - studytime:age 1 528.24 540.24
## - studytime:traveltime 1 528.48 540.48
## <none> 527.90 541.90
## + studytime:traveltime:age 1 527.51 543.51
##
## Step: AIC=539.99
## G3Pass ~ studytime + traveltime + age + studytime:traveltime +
## studytime:age
##
## Df Deviance AIC
## - studytime:age 1 528.35 538.35
## - studytime:traveltime 1 528.58 538.58
## <none> 527.99 539.99
## + traveltime:age 1 527.90 541.90
##
## Step: AIC=538.35
## G3Pass ~ studytime + traveltime + age + studytime:traveltime
##
## Df Deviance AIC
## - studytime:traveltime 1 529.00 537.00
## <none> 528.35 538.35
## + studytime:age 1 527.99 539.99
## + traveltime:age 1 528.24 540.24
## - age 1 536.34 544.34
##
## Step: AIC=537
## G3Pass ~ studytime + traveltime + age
##
## Df Deviance AIC
## - traveltime 1 530.09 536.09
## <none> 529.00 537.00
## + studytime:traveltime 1 528.35 538.35
## + studytime:age 1 528.58 538.58
## + traveltime:age 1 528.91 538.91
## - age 1 536.94 542.94
## - studytime 1 548.10 554.10
##
## Step: AIC=536.09
## G3Pass ~ studytime + age
##
## Df Deviance AIC
## <none> 530.09 536.09
## + traveltime 1 529.00 537.00
## + studytime:age 1 529.67 537.67
## - age 1 538.25 542.25
## - studytime 1 550.00 554.00
final_model = step
final_model$coefficients
## (Intercept) studytime age
## 4.8185675 0.6636091 -0.2549431
Similar to the linear model I fit in Week 9, I tried the same 3 predictors in this week’s logistic model: studytime, traveltime and age. I also used stepwise selection to determine what combination of those predictors minimized AIC. It turns out that a logistic model with just the ‘studytime’ and ‘age’ main effects produced the lowest AIC. This was the model I ended up using. There are three coefficients in this model:
The first is the intercept and is 4.8186. This can be interpreted as the log-odds of G3Pass when the two predictors are equal to zero. In other words, when age and studytime are set to zero, the log odds of passing is 4.8186.
The second is the ‘studytime’ coeffieicnt and is 0.6636. This means that, when holding everything else constant, a 1-unit increase in studytime increases the log-odds of passing by 0.6636. If we convert this to an odds ratio (e^0.6636 = 1.942) we can say that, when keeping everything else constant, a 1-unit increase in studytime increases the odds of a student passing by 94.2% (or about double). Overall, this means that students in this dataset that study more tend to have a higher chance of achieving a passing grade.
The final coefficient ‘age’ is -0.2549. This means that if we hold all other parts of the model constant, a 1-unit increase in age decreases the log-odds of passing by 0.2549. When we convert it to an odds ratio (e^-0.2549 = 0.775) we can say that, when keeping other aspects of our model constant, a 1-unit increase in age decreases the odds of a student passing by about 22.5%. Overall, this means that older students in this data set have a lower chance of passing than younger students. There are a few reasons this could happen. One could be lower motivation from older students. This could also be due to the amount of data that exists for older students versus younger students. In future analyses, we would want to investigate the cause of this trend.
coef = coef(final_model)
se = summary(final_model)$coefficients[, 'Std. Error']
lower = coef - 1.96 * se
upper = coef + 1.96 * se
cbind(lower, upper)
## lower upper
## (Intercept) 1.8195297 7.81760533
## studytime 0.3517962 0.97542193
## age -0.4300590 -0.07982718
exp(cbind(lower, upper))
## lower upper
## (Intercept) 6.1689563 2483.9500454
## studytime 1.4216187 2.6522861
## age 0.6504707 0.9232759
Above, I ran a function to find 95% Wald confidence intervals for each coefficient in both the log-odds and odds scale. For this question, I want to specifically interpret the confidence intervals for the ‘studytime’ variable, which is [0.3518, 0.9754] for log-odds and is [1.4216, 2.6523] for the odds ratio.
Since we are finding a 95% confidence interval it means that, if sampling were repeated many times, 95% of the intervals produced would contain the true population coefficient. In this case, we are 95% confident that the true log odds coefficient of studytime lies between 0.3518 and 0.9754. If we exponentiate, we find that the true odds ratio of studytime lies between 1.4216 and 2.6523. In other words, each additional unit of study time multiplies the odds of passing by between 1.42 and 2.65.
y = sp$G3Pass
p_hat = final_model$fitted.values
cost = -(y * log(p_hat) + (1 - y) * log(1 - p_hat))
ggplot() +
geom_boxplot(mapping = aes(x = as_factor(sp$G3Pass),
y = cost),
orientation = 'h') +
labs(title = "Cost Diagnostic Plot", x = "Pass", y= 'Cost') +
scale_y_log10()
Above I ran a ‘Cost’ diagnostic plot as seen in the lab. Here, we can see the cost for each G3Pass outcome. Students that pass have much lower cost and so the model performs well for students that pass. Students that did not pass have much higher cost which shows that the model may be less accurate for failures. The model discriminates between the levels well but the class imbalance could be causing this difference.
plot(final_model, which = 4, id.n = 3)
The Cook’s Distance plot shows us which observations have the strongest influence on the model. Observations 82, 503 and 524 have the highest influence and warrant further investigation in future analyses.