Greetings everyone! This week we will be learning how to estimate and interpret a logistic regression model. The logistic regression model is one example of a generalized linear model. A fundamental feature of logistic regression models is that they are focused on binary outcome.

Prompt What factors infuence whether or not someone is accepted into graduate school? In this HW, we will examine the extent to which a student’s gpa, gre and rank of school (the prestige of the school) predict whether or not they are accepted. Gpa and gre are both continuous predictors, whereas rank is 1-4 categorical predictor with 1 being the most prestigious and 4 being the least.

Let’s load the data along with a few packages that will we will need.

1. Let’s begin by looking at some descriptive statistics. What are the means and SD for our admit, gre and gpa?

#there are a few ways to go about this, but just for novelty let's try the summary and sapply functions
summary(df)
##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000
sapply(df, sd)
##       admit         gre         gpa        rank 
##   0.4660867 115.5165364   0.3805668   0.9444602

Answer: Mean for admit= .31; mean for gre = 587.7; mean for gpa = 3.39

2. Let’s begin by regressing admit onto our three predictors, but for now we will ignore the fact that our dv is actually binary and instead treat it as a linear model. We can use the check_model function from the performance package to give us some insights into if we are meeting our model assumptions and if there is cause for concern. In particular, pay attention to our residuals, homogeneity of variance and homoscedasticity. Are we concerned??

#remember that we will want to ensure our categorical predictor is being treated as a factor and not continuous. 
df$rank <- factor(df$rank)

#next, let's estimate the model as a 
myLinearModel <- lm(admit ~ gre + gpa + rank, data = df)
summary(myLinearModel)
## 
## Call:
## lm(formula = admit ~ gre + gpa + rank, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7022 -0.3288 -0.1922  0.4952  0.9093 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.2589102  0.2159904  -1.199   0.2314    
## gre          0.0004296  0.0002107   2.038   0.0422 *  
## gpa          0.1555350  0.0639618   2.432   0.0155 *  
## rank2       -0.1623653  0.0677145  -2.398   0.0170 *  
## rank3       -0.2905705  0.0702453  -4.137 4.31e-05 ***
## rank4       -0.3230264  0.0793164  -4.073 5.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4449 on 394 degrees of freedom
## Multiple R-squared:  0.1004, Adjusted R-squared:  0.08898 
## F-statistic: 8.795 on 5 and 394 DF,  p-value: 6.333e-08
check_model(myLinearModel)
## Loading required namespace: qqplotr
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 400 rows containing missing values (geom_text_repel).

Answer: Wow, yes we are concerned. Homogoneity of variance is not spread equally around the line, nor is homoscedasticity. Finally, residuals are not normally distributed. These are all red flags. Also how amazing is this package!

2. Now let’s run this model as a logistic regression. To do so, we will be using the glm (generalized linear model) function. What’s nice is that most of the syntax is the same as with a linear model, the only difference is now we are going to specify the family as equal to binomial". By default, the coefficients will be provided in logit (i.e. log odds). Interpret the regression coefficients (sans the intercept) and include each one’s confidence interval

mylogit <- glm(admit ~ gre + gpa + rank, data = df, family = "binomial")
summary(mylogit)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4
confint(mylogit) #compute confidence interval for each coefficient. 
## Waiting for profiling to be done...
##                     2.5 %       97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre          0.0001375921  0.004435874
## gpa          0.1602959439  1.464142727
## rank2       -1.3008888002 -0.056745722
## rank3       -2.0276713127 -0.670372346
## rank4       -2.4000265384 -0.753542605

Answer:

  1. For every one unit change in gre, the log odds of admission (versus non-admission) increases by .002
  2. For a one unit increase in gpa, the log odds of being admitted to graduate school increases by .804
  3. The indicator variables for rank have a slightly different interpretation. Remember that R automatically dummy codes factors and will set the lowest value as the reference group. So for instance, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -.675

3. The p values associated with each coefficient are by default calculated using a Wald test. What if we wanted to get a chi-square test for each variable instead? Does this change anything, and if so what?

anova(mylogit, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: admit
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   399     499.98              
## gre   1  13.9204       398     486.06 0.0001907 ***
## gpa   1   5.7122       397     480.34 0.0168478 *  
## rank  3  21.8265       394     458.52 7.088e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: Many things have changed. For one, all of our p values are different. This is mainly because the chi-square is testing each variable’s unique contribution, remember (-2LLNull) - (-2LLModel) = χ2. Therefore, rank is being treated as 1 variable rather than k-1 pseudovariates.

4. Interpreting coeffients in terms of log odds is not always the most intuitive. Sometimes it’s easier to interpret them in terms of odds ratios. How would you interpret the coefficients after we make them odds ratios?

exp(coef(mylogit))
## (Intercept)         gre         gpa       rank2       rank3       rank4 
##   0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375
#remember, to go from log odds to odds we simply exponentiate the coefficients. I am also including confidence intervals by wrapping in confit.  
exp(cbind(OR = coef(mylogit), confint(mylogit)))
## Waiting for profiling to be done...
##                    OR       2.5 %    97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre         1.0022670 1.000137602 1.0044457
## gpa         2.2345448 1.173858216 4.3238349
## rank2       0.5089310 0.272289674 0.9448343
## rank3       0.2617923 0.131641717 0.5115181
## rank4       0.2119375 0.090715546 0.4706961

Answer:

  1. for a one unit increase in gre, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of 1.0023
  2. for a one unit increase in gpa, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of 2.23
  3. The odds of being admitted to gratuate school increases by a factor of .5089 when you attended an undergraduate institution with a rank of 2 versus 1. Likewise, the odds increase by a factor of .26 when you attended a rank of 3 versus 1 and .21 when you attended a rank of 4 versus 1.

5. Okay, so we know Dan likes to interpret logistic regression coefficients in terms of probabilities. However, in order to do so, we need to look at these probabilites at different values. So, let’s start by predicting admit rates at different rank levels, while holding gre and gpa constant. How would you interpret these predicted probabilities?

#first we need to create a testing df where we sample the mean for gpa and gre and then each level of rank. 
newdata1 <- with(df, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4))) 

#What we are doing here we are saying we want the predictions to be based on the output or object of our initial regression model with values from our new df. 
#we are specifying that we want probabilities by using type = "response". 
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1
##     gre    gpa rank     rankP
## 1 587.7 3.3899    1 0.5166016
## 2 587.7 3.3899    2 0.3522846
## 3 587.7 3.3899    3 0.2186120
## 4 587.7 3.3899    4 0.1846684

Answer: The predicted probability of being accepted into a graduate program is 0.52 for students from the highest prestige undergraduate institutions (rank=1), and 0.18 for students from the lowest ranked institutions (rank=4), holding gre and gpa at their means.

6. What about if we look at probabilities at different levels of gre and rank? Instead of tediously trying to intepret all different levels, it may be helpful to plot predicted probabilities to get a visual sense. What does this plot indicate?

#the first thing we will do is create 100 values of gre between 200 and 800 save to a new df. 
newdata2 <- with(df, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100),
    4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))

#next we will predict admit rates with the original model object from our new values that we generated above. 
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link",
    se = TRUE))

#The only difference or thing we are adding is that we are also going to get confidence intervals. 
newdata3 <- within(newdata3, {
    PredictedProb <- plogis(fit)
    LL <- plogis(fit - (1.96 * se.fit))
    UL <- plogis(fit + (1.96 * se.fit))
})

#here we will use ggplot to plot this data with 95% CI
ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
    ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank),
    size = 1)

Answer: It looks like GRE score differentially influences the chance of admittance based on the rank of your undergrad institution. Specifically, the influence of GRE scores is more influential for students who went to undergrad universities that ranked 1 and 2 compared to those who ranked 3 and 4. So university prestige is more important than GRE. Not super duper surprising.