This is to practice logit model, using data name “binary.csv”, which is provided by UCLA. The data is about the admission of US’s universities. The dependent variable is binary variable, which is = 1 if the student is selected and = 0 if not. We are going to estimate the probility of a student being accepted depend on their gpa and gre and the university’s rank.

vandata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
head(vandata)
##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2

This dataset has a binary variable (outcome, dependent) called “admit”. The independent variables is gre, gpa and rank. We will treat the variables “gre” and “gpa” as continuous. The variable “rank” takes the values from 1 to 4. The university with a rank of 1 have the highest prestige while those with a rank of 4 have the lowest.

summary(vandata)
##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000
cor(vandata)
##            admit        gre         gpa        rank
## admit  1.0000000  0.1844343  0.17821225 -0.24251318
## gre    0.1844343  1.0000000  0.38426588 -0.12344707
## gpa    0.1782123  0.3842659  1.00000000 -0.05746077
## rank  -0.2425132 -0.1234471 -0.05746077  1.00000000
#Logit model
vanlogit <- glm(admit~gre+gpa+factor(rank), data = vandata, family = "binomial")
summary(vanlogit)
## 
## Call:
## glm(formula = admit ~ gre + gpa + factor(rank), family = "binomial", 
##     data = vandata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.989979   1.139951  -3.500 0.000465 ***
## gre            0.002264   0.001094   2.070 0.038465 *  
## gpa            0.804038   0.331819   2.423 0.015388 *  
## factor(rank)2 -0.675443   0.316490  -2.134 0.032829 *  
## factor(rank)3 -1.340204   0.345306  -3.881 0.000104 ***
## factor(rank)4 -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

The results show that if “gre” increses by 1 point, the probility that student will be admitted incresed by 0.0023*100% = 0.23%. By contrast, if “Rank” of the university raise 1 level, decrese 56% of the probility that the student is admitted, holding others constant. It does mean that the probility of one student to be admitted in rank 2 university is 56% higher than that probility in rank 1 university, holding other factors constant.

## Graph - Just for fun - no idea
newdata1 <- with(vandata,
  data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
newdata1$rankP <- predict(vanlogit, newdata = newdata1, type = "response")
newdata2 <- with(vandata,
  data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100), 4),
  gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
newdata3 <- cbind(newdata2, predict(vanlogit, newdata = newdata2, type="link", se=TRUE))
newdata3 <- within(newdata3, {
  PredictedProb <- plogis(fit)
  LL <- plogis(fit - (1.96 * se.fit))
  UL <- plogis(fit + (1.96 * se.fit))
})
library(ggplot2)
ggplot(newdata3, aes(x = gre, y = PredictedProb)) +
  geom_ribbon(aes(ymin = LL, ymax = UL, fill = rank), alpha = .2) +
  geom_line(aes(colour = rank), size=1)