Greetings everyone! This week we will be learning how to estimate and interpret a logistic regression model. The logistic regression model is one example of a generalized linear model. A fundamental feature of logistic regression models is that they are focused on binary outcome.
Prompt What factors infuence whether or not someone is accepted into graduate school? In this HW, we will examine the extent to which a student’s gpa, gre and rank of school (the prestige of the school) predict whether or not they are accepted. Gpa and gre are both continuous predictors, whereas rank is 1-4 categorical predictor with 1 being the most prestigious and 4 being the least.
Let’s load the data along with a few packages that will we will need.
1. Let’s begin by looking at some descriptive statistics. What are the means and SD for our admit, gre and gpa?
#there are a few ways to go about this, but just for novelty let's try the summary and sapply functions
summary(df)
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.395 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
sapply(df, sd)
## admit gre gpa rank
## 0.4660867 115.5165364 0.3805668 0.9444602
Answer: Mean for admit= .31; mean for gre = 587.7; mean for gpa = 3.39
2. Let’s begin by regressing admit onto our three predictors, but for now we will ignore the fact that our dv is actually binary and instead treat it as a linear model. We can use the check_model function from the performance package to give us some insights into if we are meeting our model assumptions and if there is cause for concern. In particular, pay attention to our residuals, homogeneity of variance and homoscedasticity. Are we concerned??
#remember that we will want to ensure our categorical predictor is being treated as a factor and not continuous.
df$rank <- factor(df$rank)
#next, let's estimate the model as a
myLinearModel <- lm(admit ~ gre + gpa + rank, data = df)
summary(myLinearModel)
##
## Call:
## lm(formula = admit ~ gre + gpa + rank, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7022 -0.3288 -0.1922 0.4952 0.9093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2589102 0.2159904 -1.199 0.2314
## gre 0.0004296 0.0002107 2.038 0.0422 *
## gpa 0.1555350 0.0639618 2.432 0.0155 *
## rank2 -0.1623653 0.0677145 -2.398 0.0170 *
## rank3 -0.2905705 0.0702453 -4.137 4.31e-05 ***
## rank4 -0.3230264 0.0793164 -4.073 5.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4449 on 394 degrees of freedom
## Multiple R-squared: 0.1004, Adjusted R-squared: 0.08898
## F-statistic: 8.795 on 5 and 394 DF, p-value: 6.333e-08
check_model(myLinearModel)
## Loading required namespace: qqplotr
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 400 rows containing missing values (geom_text_repel).
Answer: Wow, yes we are concerned. Homogoneity of variance is not spread equally around the line, nor is homoscedasticity. Finally, residuals are not normally distributed. These are all red flags. Also how amazing is this package!
2. Now let’s run this model as a logistic regression. To do so, we will be using the glm (generalized linear model) function. What’s nice is that most of the syntax is the same as with a linear model, the only difference is now we are going to specify the family as equal to binomial". By default, the coefficients will be provided in logit (i.e. log odds). Interpret the regression coefficients (sans the intercept) and include each one’s confidence interval
mylogit <- glm(admit ~ gre + gpa + rank, data = df, family = "binomial")
summary(mylogit)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
confint(mylogit) #compute confidence interval for each coefficient.
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre 0.0001375921 0.004435874
## gpa 0.1602959439 1.464142727
## rank2 -1.3008888002 -0.056745722
## rank3 -2.0276713127 -0.670372346
## rank4 -2.4000265384 -0.753542605
Answer:
3. The p values associated with each coefficient are by default calculated using a Wald test. What if we wanted to get a chi-square test for each variable instead? Does this change anything, and if so what?
anova(mylogit, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: admit
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 399 499.98
## gre 1 13.9204 398 486.06 0.0001907 ***
## gpa 1 5.7122 397 480.34 0.0168478 *
## rank 3 21.8265 394 458.52 7.088e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: Many things have changed. For one, all of our p values are different. This is mainly because the chi-square is testing each variable’s unique contribution, remember (-2LLNull) - (-2LLModel) = χ2. Therefore, rank is being treated as 1 variable rather than k-1 pseudovariates.
4. Interpreting coeffients in terms of log odds is not always the most intuitive. Sometimes it’s easier to interpret them in terms of odds ratios. How would you interpret the coefficients after we make them odds ratios?
exp(coef(mylogit))
## (Intercept) gre gpa rank2 rank3 rank4
## 0.0185001 1.0022670 2.2345448 0.5089310 0.2617923 0.2119375
#remember, to go from log odds to odds we simply exponentiate the coefficients. I am also including confidence intervals by wrapping in confit.
exp(cbind(OR = coef(mylogit), confint(mylogit)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre 1.0022670 1.000137602 1.0044457
## gpa 2.2345448 1.173858216 4.3238349
## rank2 0.5089310 0.272289674 0.9448343
## rank3 0.2617923 0.131641717 0.5115181
## rank4 0.2119375 0.090715546 0.4706961
Answer:
5. Okay, so we know Dan likes to interpret logistic regression coefficients in terms of probabilities. However, in order to do so, we need to look at these probabilites at different values. So, let’s start by predicting admit rates at different rank levels, while holding gre and gpa constant. How would you interpret these predicted probabilities?
#first we need to create a testing df where we sample the mean for gpa and gre and then each level of rank.
newdata1 <- with(df, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
#What we are doing here we are saying we want the predictions to be based on the output or object of our initial regression model with values from our new df.
#we are specifying that we want probabilities by using type = "response".
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1
## gre gpa rank rankP
## 1 587.7 3.3899 1 0.5166016
## 2 587.7 3.3899 2 0.3522846
## 3 587.7 3.3899 3 0.2186120
## 4 587.7 3.3899 4 0.1846684
Answer: The predicted probability of being accepted into a graduate program is 0.52 for students from the highest prestige undergraduate institutions (rank=1), and 0.18 for students from the lowest ranked institutions (rank=4), holding gre and gpa at their means.
6. What about if we look at probabilities at different levels of gre and rank? Instead of tediously trying to intepret all different levels, it may be helpful to plot predicted probabilities to get a visual sense. What does this plot indicate?
#the first thing we will do is create 100 values of gre between 200 and 800 save to a new df.
newdata2 <- with(df, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100),
4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
#next we will predict admit rates with the original model object from our new values that we generated above.
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link",
se = TRUE))
#The only difference or thing we are adding is that we are also going to get confidence intervals.
newdata3 <- within(newdata3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
#here we will use ggplot to plot this data with 95% CI
ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank),
size = 1)
Answer: It looks like GRE score differentially influences the chance of admittance based on the rank of your undergrad institution. Specifically, the influence of GRE scores is more influential for students who went to undergrad universities that ranked 1 and 2 compared to those who ranked 3 and 4. So university prestige is more important than GRE. Not super duper surprising.