This is to practice logit model, using data name “binary.csv”, which is provided by UCLA. The data is about the admission of US’s universities. The dependent variable is binary variable, which is = 1 if the student is selected and = 0 if not. We are going to estimate the probility of a student being accepted depend on their gpa and gre and the university’s rank.
vandata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
head(vandata)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
This dataset has a binary variable (outcome, dependent) called “admit”. The independent variables is gre, gpa and rank. We will treat the variables “gre” and “gpa” as continuous. The variable “rank” takes the values from 1 to 4. The university with a rank of 1 have the highest prestige while those with a rank of 4 have the lowest.
summary(vandata)
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.395 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
cor(vandata)
## admit gre gpa rank
## admit 1.0000000 0.1844343 0.17821225 -0.24251318
## gre 0.1844343 1.0000000 0.38426588 -0.12344707
## gpa 0.1782123 0.3842659 1.00000000 -0.05746077
## rank -0.2425132 -0.1234471 -0.05746077 1.00000000
#Logit model
vanlogit <- glm(admit~gre+gpa+factor(rank), data = vandata, family = "binomial")
summary(vanlogit)
##
## Call:
## glm(formula = admit ~ gre + gpa + factor(rank), family = "binomial",
## data = vandata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## factor(rank)2 -0.675443 0.316490 -2.134 0.032829 *
## factor(rank)3 -1.340204 0.345306 -3.881 0.000104 ***
## factor(rank)4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
The results show that if “gre” increses by 1 point, the probility that student will be admitted incresed by 0.0023*100% = 0.23%. By contrast, if “Rank” of the university raise 1 level, decrese 56% of the probility that the student is admitted, holding others constant. It does mean that the probility of one student to be admitted in rank 2 university is 56% higher than that probility in rank 1 university, holding other factors constant.
## Graph - Just for fun - no idea
newdata1 <- with(vandata,
data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
newdata1$rankP <- predict(vanlogit, newdata = newdata1, type = "response")
newdata2 <- with(vandata,
data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100), 4),
gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
newdata3 <- cbind(newdata2, predict(vanlogit, newdata = newdata2, type="link", se=TRUE))
newdata3 <- within(newdata3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
library(ggplot2)
ggplot(newdata3, aes(x = gre, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = rank), alpha = .2) +
geom_line(aes(colour = rank), size=1)