Use the data set of your own choice (choose carefully!) and conduct a set of logistic regression models. More specifically:
Begin with a simple model (e.g., with only one independent variable) and gradually build it up by adding more and more variables; Include some interaction terms in some of your models; Conduct likelihood ratio test to compare model fit; Also look at both AIC and BIC; Plot a few interesting figures using the visreg package (consult its document). Write about what you did, why you did it, and what your results show (better be an interesting story there).
Also, to avoid unnecessary problems and confusions, post a short description of the data you choose to use; if somebody has already taken that data, you will have to choose a different one. Post your questions on the BBS.
The due date is the midnight of October 14.
Happy crunching!
I used UCLA dataset on graduate school admissions, downloaded from https://stats.idre.ucla.edu/stat/data/binary.csv.
This data set has one binary variable “admit” which is either true or false. There are three predictor variables: gre, gpa and rank. GRE and GPA are continuous variables, while rank can have a vaulue from 1 to 4 (1 being highest prestige, 4 lowest). The variable rank should be treated as a categorical variable, not a continuous integer.
library(readr)
gradData <- read_csv("/Users/meredithpowers/Desktop/ucla-grad.csv")
## Parsed with column specification:
## cols(
## admit = col_integer(),
## gre = col_integer(),
## gpa = col_double(),
## rank = col_integer()
## )
names(gradData)
## [1] "admit" "gre" "gpa" "rank"
## make rank into a categorical variable
gradData$rank <- factor(gradData$rank)
head(gradData)
## # A tibble: 6 x 4
## admit gre gpa rank
## <int> <int> <dbl> <fct>
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3 2
I want to look at some descriptive statistics.
summary(gradData)
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 1: 61
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 2:151
## Median :0.0000 Median :580.0 Median :3.395 3:121
## Mean :0.3175 Mean :587.7 Mean :3.390 4: 67
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670
## Max. :1.0000 Max. :800.0 Max. :4.000
xtabs(~rank + admit, data = gradData)
## admit
## rank 0 1
## 1 28 33
## 2 97 54
## 3 93 28
## 4 55 12
First I want to do some simple analysis and look at how GRE scores affects admission status.
m1 <- glm(gre ~ factor(admit), data = gradData)
summary(m1)
##
## Call:
## glm(formula = gre ~ factor(admit), data = gradData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -353.19 -73.19 1.10 81.10 226.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 573.19 6.88 83.311 < 2e-16 ***
## factor(admit)1 45.71 12.21 3.744 0.000208 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 12922.55)
##
## Null deviance: 5324284 on 399 degrees of freedom
## Residual deviance: 5143173 on 398 degrees of freedom
## AIC: 4925.8
##
## Number of Fisher Scoring iterations: 2
GRE scores are statistically significant: as GRE score increase above 573.19, the chances of admission also increase.
I also want to look at GRE scores and rank:
m2 <- glm(gre ~ factor(rank), data = gradData)
summary(m2)
##
## Call:
## glm(formula = gre ~ factor(rank), data = gradData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -354.88 -74.88 3.97 83.97 229.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 611.80 14.73 41.548 <2e-16 ***
## factor(rank)2 -15.78 17.45 -0.904 0.3664
## factor(rank)3 -36.93 18.06 -2.045 0.0415 *
## factor(rank)4 -41.65 20.35 -2.047 0.0414 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 13226.87)
##
## Null deviance: 5324284 on 399 degrees of freedom
## Residual deviance: 5237839 on 396 degrees of freedom
## AIC: 4937.1
##
## Number of Fisher Scoring iterations: 2
This is also significant – the interpretation is a little hard to follow even though it makes sense. Because rank1 is the most prestigious and rank4 the least, it makes sense that the relationship is negative: as GRE scores increase above 611.80, the odds of admission into a lower-prestige school drops, because the odds of being admitted to a higher prestige school is higher. The results are only stastically significant for the highest(rank1) and lowest (rank3 and rank4) – maybe there are other variables at play for a rank2 grad school.
I want to quickly look at the affect of GPA and GRE on admissions before moving on the the interaction.
m3 <- glm(admit ~ gre + gpa, family = binomial, data = gradData)
summary(m3)
##
## Call:
## glm(formula = admit ~ gre + gpa, family = binomial, data = gradData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2730 -0.8988 -0.7206 1.3013 2.0620
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.949378 1.075093 -4.604 4.15e-06 ***
## gre 0.002691 0.001057 2.544 0.0109 *
## gpa 0.754687 0.319586 2.361 0.0182 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 480.34 on 397 degrees of freedom
## AIC: 486.34
##
## Number of Fisher Scoring iterations: 4
As expected, both GRE and GPA have a positive effect on admissions – as they go up, so do the odds of admission.
m4 <- glm(admit ~ gre*gpa, family = binomial, data = gradData)
summary(m4)
##
## Call:
## glm(formula = admit ~ gre * gpa, family = binomial, data = gradData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1324 -0.9387 -0.7099 1.2969 2.3519
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.818700 5.874816 -2.352 0.0187 *
## gre 0.017464 0.009606 1.818 0.0691 .
## gpa 3.367918 1.722685 1.955 0.0506 .
## gre:gpa -0.004327 0.002787 -1.552 0.1205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 477.85 on 396 degrees of freedom
## AIC: 485.85
##
## Number of Fisher Scoring iterations: 4
The results look like there is not much of a significant interaction effect between GPA and GRE on admission status, even though individually there is more of an affect.
Now I want to know which of these models are the most appropriate, best fitting models:
#Place regression models into a table to assess fit
library(texreg)
## Version: 1.36.23
## Date: 2017-03-03
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
screenreg(list(m1, m2, m3, m4),doctype = FALSE)
##
## ======================================================================
## Model 1 Model 2 Model 3 Model 4
## ----------------------------------------------------------------------
## (Intercept) 573.19 *** 611.80 *** -4.95 *** -13.82 *
## (6.88) (14.73) (1.08) (5.87)
## factor(admit)1 45.71 ***
## (12.21)
## factor(rank)2 -15.78
## (17.45)
## factor(rank)3 -36.93 *
## (18.06)
## factor(rank)4 -41.65 *
## (20.35)
## gre 0.00 * 0.02
## (0.00) (0.01)
## gpa 0.75 * 3.37
## (0.32) (1.72)
## gre:gpa -0.00
## (0.00)
## ----------------------------------------------------------------------
## AIC 4925.84 4937.13 486.34 485.85
## BIC 4937.81 4957.09 498.32 501.81
## Log Likelihood -2459.92 -2463.57 -240.17 -238.92
## Deviance 5143173.14 5237839.18 480.34 477.85
## Num. obs. 400 400 400 400
## ======================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
Model 1 and 2 were about comparing GRE scores – what rank of school and what admissions status were more likely as GRE scores increased. Model 1 has the lower AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values, which indicate better fit. Admission status is more closely related to GRE scores than rank.
Model 3 and 4 are about the effects GPA and GRE scores have on admissions. (Kind of the reverse variable as Model 1 and 2.). Model 4 has the lowest AIC (Akaike Information Criterion) and Model 3 has the lowest BIC (Bayesian Information Criterion). It’s kind of a wash, though the difference between the BIC values are greater, so perhaps Model 3 is a better fit that 4, suggesting that the interaction of GRE scores and GPA values aren’t more significant that the values alone.
Let’s look at GRE scores (mean scores) for both categories 0f admit (0=no, 1=yes)
library(visreg)
visreg(m1, "gre", by = "admit")
And also for school rank:
library(visreg)
visreg(m2, "gre", by = "rank")