Assignment 4

Use the data set of your own choice (choose carefully!) and conduct a set of logistic regression models. More specifically:

Begin with a simple model (e.g., with only one independent variable) and gradually build it up by adding more and more variables; Include some interaction terms in some of your models; Conduct likelihood ratio test to compare model fit; Also look at both AIC and BIC; Plot a few interesting figures using the visreg package (consult its document). Write about what you did, why you did it, and what your results show (better be an interesting story there).

Also, to avoid unnecessary problems and confusions, post a short description of the data you choose to use; if somebody has already taken that data, you will have to choose a different one. Post your questions on the BBS.

The due date is the midnight of October 14.

Happy crunching!

Data Selection

I used UCLA dataset on graduate school admissions, downloaded from https://stats.idre.ucla.edu/stat/data/binary.csv.

This data set has one binary variable “admit” which is either true or false. There are three predictor variables: gre, gpa and rank. GRE and GPA are continuous variables, while rank can have a vaulue from 1 to 4 (1 being highest prestige, 4 lowest). The variable rank should be treated as a categorical variable, not a continuous integer.

check out the data

library(readr)
gradData <- read_csv("/Users/meredithpowers/Desktop/ucla-grad.csv")
## Parsed with column specification:
## cols(
##   admit = col_integer(),
##   gre = col_integer(),
##   gpa = col_double(),
##   rank = col_integer()
## )
names(gradData)
## [1] "admit" "gre"   "gpa"   "rank"
## make rank into a categorical variable
gradData$rank <- factor(gradData$rank)

head(gradData)
## # A tibble: 6 x 4
##   admit   gre   gpa rank 
##   <int> <int> <dbl> <fct>
## 1     0   380  3.61 3    
## 2     1   660  3.67 3    
## 3     1   800  4    1    
## 4     1   640  3.19 4    
## 5     0   520  2.93 4    
## 6     1   760  3    2

I want to look at some descriptive statistics.

summary(gradData)
##      admit             gre             gpa        rank   
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   1: 61  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   2:151  
##  Median :0.0000   Median :580.0   Median :3.395   3:121  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   4: 67  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670          
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000
xtabs(~rank + admit, data = gradData)
##     admit
## rank  0  1
##    1 28 33
##    2 97 54
##    3 93 28
##    4 55 12

First I want to do some simple analysis and look at how GRE scores affects admission status.

m1 <- glm(gre ~ factor(admit), data = gradData)
summary(m1)
## 
## Call:
## glm(formula = gre ~ factor(admit), data = gradData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -353.19   -73.19     1.10    81.10   226.81  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      573.19       6.88  83.311  < 2e-16 ***
## factor(admit)1    45.71      12.21   3.744 0.000208 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 12922.55)
## 
##     Null deviance: 5324284  on 399  degrees of freedom
## Residual deviance: 5143173  on 398  degrees of freedom
## AIC: 4925.8
## 
## Number of Fisher Scoring iterations: 2

GRE scores are statistically significant: as GRE score increase above 573.19, the chances of admission also increase.

I also want to look at GRE scores and rank:

m2 <- glm(gre ~ factor(rank), data = gradData)
summary(m2)
## 
## Call:
## glm(formula = gre ~ factor(rank), data = gradData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -354.88   -74.88     3.97    83.97   229.85  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     611.80      14.73  41.548   <2e-16 ***
## factor(rank)2   -15.78      17.45  -0.904   0.3664    
## factor(rank)3   -36.93      18.06  -2.045   0.0415 *  
## factor(rank)4   -41.65      20.35  -2.047   0.0414 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 13226.87)
## 
##     Null deviance: 5324284  on 399  degrees of freedom
## Residual deviance: 5237839  on 396  degrees of freedom
## AIC: 4937.1
## 
## Number of Fisher Scoring iterations: 2

This is also significant – the interpretation is a little hard to follow even though it makes sense. Because rank1 is the most prestigious and rank4 the least, it makes sense that the relationship is negative: as GRE scores increase above 611.80, the odds of admission into a lower-prestige school drops, because the odds of being admitted to a higher prestige school is higher. The results are only stastically significant for the highest(rank1) and lowest (rank3 and rank4) – maybe there are other variables at play for a rank2 grad school.

I want to quickly look at the affect of GPA and GRE on admissions before moving on the the interaction.

m3 <- glm(admit ~ gre + gpa, family = binomial, data = gradData)
summary(m3)
## 
## Call:
## glm(formula = admit ~ gre + gpa, family = binomial, data = gradData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2730  -0.8988  -0.7206   1.3013   2.0620  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.949378   1.075093  -4.604 4.15e-06 ***
## gre          0.002691   0.001057   2.544   0.0109 *  
## gpa          0.754687   0.319586   2.361   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 480.34  on 397  degrees of freedom
## AIC: 486.34
## 
## Number of Fisher Scoring iterations: 4

As expected, both GRE and GPA have a positive effect on admissions – as they go up, so do the odds of admission.

m4 <- glm(admit ~ gre*gpa, family = binomial, data = gradData)
summary(m4)
## 
## Call:
## glm(formula = admit ~ gre * gpa, family = binomial, data = gradData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1324  -0.9387  -0.7099   1.2969   2.3519  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -13.818700   5.874816  -2.352   0.0187 *
## gre           0.017464   0.009606   1.818   0.0691 .
## gpa           3.367918   1.722685   1.955   0.0506 .
## gre:gpa      -0.004327   0.002787  -1.552   0.1205  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 477.85  on 396  degrees of freedom
## AIC: 485.85
## 
## Number of Fisher Scoring iterations: 4

The results look like there is not much of a significant interaction effect between GPA and GRE on admission status, even though individually there is more of an affect.

Now I want to know which of these models are the most appropriate, best fitting models:

#Place regression models into a table to assess fit
library(texreg)
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
screenreg(list(m1, m2, m3, m4),doctype = FALSE)
## 
## ======================================================================
##                 Model 1         Model 2         Model 3      Model 4  
## ----------------------------------------------------------------------
## (Intercept)         573.19 ***      611.80 ***    -4.95 ***   -13.82 *
##                      (6.88)         (14.73)       (1.08)       (5.87) 
## factor(admit)1       45.71 ***                                        
##                     (12.21)                                           
## factor(rank)2                       -15.78                            
##                                     (17.45)                           
## factor(rank)3                       -36.93 *                          
##                                     (18.06)                           
## factor(rank)4                       -41.65 *                          
##                                     (20.35)                           
## gre                                                0.00 *       0.02  
##                                                   (0.00)       (0.01) 
## gpa                                                0.75 *       3.37  
##                                                   (0.32)       (1.72) 
## gre:gpa                                                        -0.00  
##                                                                (0.00) 
## ----------------------------------------------------------------------
## AIC                4925.84         4937.13       486.34       485.85  
## BIC                4937.81         4957.09       498.32       501.81  
## Log Likelihood    -2459.92        -2463.57      -240.17      -238.92  
## Deviance        5143173.14      5237839.18       480.34       477.85  
## Num. obs.           400             400          400          400     
## ======================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Model 1 and 2 were about comparing GRE scores – what rank of school and what admissions status were more likely as GRE scores increased. Model 1 has the lower AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values, which indicate better fit. Admission status is more closely related to GRE scores than rank.

Model 3 and 4 are about the effects GPA and GRE scores have on admissions. (Kind of the reverse variable as Model 1 and 2.). Model 4 has the lowest AIC (Akaike Information Criterion) and Model 3 has the lowest BIC (Bayesian Information Criterion). It’s kind of a wash, though the difference between the BIC values are greater, so perhaps Model 3 is a better fit that 4, suggesting that the interaction of GRE scores and GPA values aren’t more significant that the values alone.

Plots

Let’s look at GRE scores (mean scores) for both categories 0f admit (0=no, 1=yes)

library(visreg)
visreg(m1, "gre", by = "admit")

And also for school rank:

library(visreg)
visreg(m2, "gre", by = "rank")