Welcome to week 8 everyone! This week we will be learning how to estimate and interpret a logistic regression model. The logistic regression model is one example of a generalized linear model. A fundamental feature of logistic regression models is that they are focused on dependendent variables that are binary

Prompt What factors influence whether or not someone is accepted into graduate school? In this HW, we will examine the extent to which a student’s gpa, gre and the rank of school they attended for undergrad (the prestige of the school) predict whether or not they are accepted. Gpa and gre are both continuous predictors, whereas rank is 1-4 categorical predictor with 1 being the most prestigious and 4 being the least.

Let’s begin as we usually do by loading the data along with a few packages.

1. Let’s begin by looking at some descriptive statistics. What are the means and SD for our three variables of interest?

#there are a few ways to go about this, but just for novelty let's try the summary and sapply functions
summary(df)
##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000
sapply(df, sd)
##       admit         gre         gpa        rank 
##   0.4660867 115.5165364   0.3805668   0.9444602

Answer: GRE mean = 587.7, sd = 115.52; GPA mean = 3.39, sd = 0.38; Rank mean = 2.49, sd = .94.

2. Next, let’s estimate a regression model by regressing admit onto our three predictors. For now, we will ignore the fact that our dv is binary and estimate it as a linear model. We can use the check_model function from the performance package to see if we are violating any assumptions when we fit this as a linear model. In particular, pay attention to our residuals, homogeneity of variance and homoscedasticity. Should we be concerned??

#remember that we will want to ensure our categorical predictor is being treated as a factor and not continuous. 
df$rank <- factor(df$rank)

#next, let's estimate the model as a 
myLinearModel <- lm(admit ~ gre + gpa + rank, data = df)
summary(myLinearModel)
## 
## Call:
## lm(formula = admit ~ gre + gpa + rank, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7022 -0.3288 -0.1922  0.4952  0.9093 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.2589102  0.2159904  -1.199   0.2314    
## gre          0.0004296  0.0002107   2.038   0.0422 *  
## gpa          0.1555350  0.0639618   2.432   0.0155 *  
## rank2       -0.1623653  0.0677145  -2.398   0.0170 *  
## rank3       -0.2905705  0.0702453  -4.137 4.31e-05 ***
## rank4       -0.3230264  0.0793164  -4.073 5.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4449 on 394 degrees of freedom
## Multiple R-squared:  0.1004, Adjusted R-squared:  0.08898 
## F-statistic: 8.795 on 5 and 394 DF,  p-value: 6.333e-08
check_model(myLinearModel)
## Loading required namespace: qqplotr
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 400 rows containing missing values (geom_text).

Answer: Our residuals do not follow a normal curve, homogeneity of variance does not follow the horizontal line (this line has a curve), and the homoscedasticity seems to be inconsistent, which all violate our assumptions.

2. Now let’s run this model as a logistic regression. To do so, we will be using the glm (generalized linear model) function. What’s nice is that most of the syntax is the same as it is with a linear model. The only difference is now we are going to specify the family (in this case binomial). By default, the coefficients will be provided in logit (i.e. log odds). Interpret the regression coefficients (sans the intercept) and include each one’s confidence interval.

mylogit <- glm(admit ~ gre + gpa + rank, data = df, family = "binomial")
summary(mylogit)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4
confint(mylogit) #compute confidence interval for each coefficient. 
## Waiting for profiling to be done...
##                     2.5 %       97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre          0.0001375921  0.004435874
## gpa          0.1602959439  1.464142727
## rank2       -1.3008888002 -0.056745722
## rank3       -2.0276713127 -0.670372346
## rank4       -2.4000265384 -0.753542605

Answer: The coefficient for GRE being nearly 0 implies there is little to no effect of GRE on admission rate. GPA has an effect of 0.16, while rank has a negative effect.

3. The p values associated with each coefficient are by default calculated using a Wald test. What if we wanted to get a chi-square test for each variable instead? Does this change anything, and if so what?

anova(mylogit, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: admit
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   399     499.98              
## gre   1  13.9204       398     486.06 0.0001907 ***
## gpa   1   5.7122       397     480.34 0.0168478 *  
## rank  3  21.8265       394     458.52 7.088e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: The result of the chi squared suggests that rank is in fact a significant predictor of admission, whereas the previous analysis did not.

4. Interpreting coefficients in terms of log odds is not always the most intuitive. Sometimes it’s easier to interpret them in terms of odds ratios. How would you interpret the coefficients after we make them odds ratios?

#remember, to go from log odds to odds we simply exponentiate the coefficients. I am also including confidence intervals by wrapping in confit.  
exp(cbind(OR = coef(mylogit), confint(mylogit)))
## Waiting for profiling to be done...
##                    OR       2.5 %    97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre         1.0022670 1.000137602 1.0044457
## gpa         2.2345448 1.173858216 4.3238349
## rank2       0.5089310 0.272289674 0.9448343
## rank3       0.2617923 0.131641717 0.5115181
## rank4       0.2119375 0.090715546 0.4706961

Answer: This result suggests that GPA would be the most predictive of admission, with GRE being the second most informative, and rank being the least informatve.

5. Okay, so we know Dan likes to interpret logistic regression coefficients in terms of probabilities as they are more intuitive. Because probabilities are not constant and can change as a function of the x value, we need to predict the probabilities at different values of x. So, let’s start by predicting admit rates at different rank levels, while holding gre and gpa constant. We are going to do this by generating a test df and using the predict function. How would you interpret these predicted probabilities?

#first we need to create a testing df where we sample the mean for gpa and gre and then each level of rank. 
newdata1 <- with(df, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4))) 

#What we are doing here we are saying we want the predictions to be based on the output or object of our initial regression model with values from our new df. 
#we are specifying that we want probabilities by using type = "response". 
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1
##     gre    gpa rank     rankP
## 1 587.7 3.3899    1 0.5166016
## 2 587.7 3.3899    2 0.3522846
## 3 587.7 3.3899    3 0.2186120
## 4 587.7 3.3899    4 0.1846684

Answer:

6. What about if we want to look at the probability of being admitted at different levels of gre and rank? To do so, we can apply similar logic. However, because gre is continuous, it can become tedious to try and interpret all different levels. Therefore, it may be helpful to plot predicted estimates and interpret them visually. What does this plot indicate?

#the first thing we will do is create 100 values of gre between 200 and 800 save to a new df. 
newdata2 <- with(df, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100),
    4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))

#next we will predict admit rates with the original model object from our new values that we generated above. 
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link",
    se = TRUE))

#The only difference or thing we are adding is that we are also going to get confidence intervals. 
newdata3 <- within(newdata3, {
    PredictedProb <- plogis(fit)
    LL <- plogis(fit - (1.96 * se.fit))
    UL <- plogis(fit + (1.96 * se.fit))
})

#here we will use ggplot to plot this data with 95% CI
ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
    ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank),
    size = 1)

Answer: This plot indicates that the higher the GRE, the more predictive rank is regarding admission. The higher the rank. the more predictive power the variable contains.