library(aod)
library(ggplot2)
Example: A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
## view the first few rows of the data
head(mydata)
admit gre gpa rank
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2
summary(mydata)
admit gre gpa rank
Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
Median :0.0000 Median :580.0 Median :3.395 Median :2.000
Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
Interpretation:
From the summary statistics we observe that the mean of admit is \(0.3175\) which is less than 0.5. Hence, from this fact we can say that there are greater number of reject than acceptance, this means that most of students did not get admitted
sapply(mydata, sd)
admit gre gpa rank
0.4660867 115.5165364 0.3805668 0.9444602
Here, we see that there is a low standard deviation in the admit column which means that the values are close to its mean given that we only have 0 and 1.
## two-way contingency table of categorical outcome and predictors we want
## to make sure there are not 0 cells
xtabs(~admit + rank, data = mydata)
rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12
Notice that here is no empty cell, we say that the admits are distributed well enough in each rank.
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
View the model summary
summary(mylogit)
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = mydata)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
Number of Fisher Scoring iterations: 4
Interpretaion/Observation:
We can use the confint function to obtain confidence intervals for the coefficient estimates. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function.
## CIs using profiled log-likelihood
confint(mylogit)
2.5 % 97.5 %
(Intercept) -6.2716202334 -1.792547080
gre 0.0001375921 0.004435874
gpa 0.1602959439 1.464142727
rank2 -1.3008888002 -0.056745722
rank3 -2.0276713127 -0.670372346
rank4 -2.4000265384 -0.753542605
## CIs using standard errors
confint.default(mylogit)
2.5 % 97.5 %
(Intercept) -6.2242418514 -1.755716295
gre 0.0001202298 0.004408622
gpa 0.1536836760 1.454391423
rank2 -1.2957512650 -0.055134591
rank3 -2.0169920597 -0.663415773
rank4 -2.3703986294 -0.732528724
We can test for an overall effect of rank using the wald.test function of the aod library. The wald.test function refers to the coefficients by their order in the model. We use the wald.test function, b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested. In this case, terms \(4, 5,\) and \(6\), are the three terms for the levels of rank.
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6)
Wald test:
----------
Chi-squared test:
X2 = 20.9, df = 3, P(> X2) = 0.00011
Interpretation:
Wald test measures to confirm whether a set of independent variables are collectively significant for a model or not. Referring to our summary table above, the term 4:6 refers to the varibale rank, thus we are going to test if the rank is significant in the admission. Using \(\alpha=0.05\), the chi-squared test statistic of \(20.9\), with three degrees of freedom is associated with a p-value of \(0.00011\) indicating that the overall effect of rank is statistically significant.
We can also test additional hypotheses about the differences in the
coefficients for the different levels of rank. We test that the
coefficient for rank 2 is equal to the coefficient for rank 3. The first
line of code below creates a vector l that defines the test we want to
perform.
In this case, we want to test the difference (subtraction) of the terms
for rank=2 and rank=3 (i.e., the 4th and 5th terms in the model). To
contrast these two terms, we multiply one of them by \(1\), and the other by \(-1\). The other terms in the model are not
involved in the test, so they are multiplied by \(0\).We used \(L=l\) to tell R that we wish to base the
test on the vector \(l\).
l <- cbind(0, 0, 0, 1, -1, 0)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l)
Wald test:
----------
Chi-squared test:
X2 = 5.5, df = 1, P(> X2) = 0.019
Interpretation: Observe that the chi-squared test statistic of \(5.5\) with \(1\) degree of freedom is associated with a p-value of \(0.019\), indicating that the difference between the coefficient for rank 2 and the coefficient for rank 3 is statistically significant.
## odds ratios only
exp(coef(mylogit))
(Intercept) gre gpa rank2 rank3 rank4
0.0185001 1.0022670 2.2345448 0.5089310 0.2617923 0.2119375
## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))
OR 2.5 % 97.5 %
(Intercept) 0.0185001 0.001889165 0.1665354
gre 1.0022670 1.000137602 1.0044457
gpa 2.2345448 1.173858216 4.3238349
rank2 0.5089310 0.272289674 0.9448343
rank3 0.2617923 0.131641717 0.5115181
rank4 0.2119375 0.090715546 0.4706961
Interpretation:
From the table, we see that for every \(1\) unit increase of gre, the odds of being admitted is increased by a factor of \(1.0022670\) and for a one unit increase in gpa, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of \(2.2345448\). Note also that being in rank \(2\) has a higher odds of being admitted than in rank \(3\) and rank \(4\).
newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
## view data frame
newdata1
gre gpa rank
1 587.7 3.3899 1
2 587.7 3.3899 2
3 587.7 3.3899 3
4 587.7 3.3899 4
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1
gre gpa rank rankP
1 587.7 3.3899 1 0.5166016
2 587.7 3.3899 2 0.3522846
3 587.7 3.3899 3 0.2186120
4 587.7 3.3899 4 0.1846684
Interpretation:
In the above output we see that the predicted probability of being accepted into a graduate program is $0.5166016 $ for students from the highest prestige undergraduate institutions (rank 1), and \(0.1846684\) for students from the lowest ranked institutions (rank 4), holding gre and gpa at their means.
newdata2 <- with(mydata, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100),
4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link",
se = TRUE))
newdata3 <- within(newdata3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
## view first few rows of final dataset
head(newdata3)
gre gpa rank fit se.fit residual.scale UL LL
1 200.0000 3.3899 1 -0.8114870 0.5147714 1 0.5492064 0.1393812
2 206.0606 3.3899 1 -0.7977632 0.5090986 1 0.5498513 0.1423880
3 212.1212 3.3899 1 -0.7840394 0.5034491 1 0.5505074 0.1454429
4 218.1818 3.3899 1 -0.7703156 0.4978239 1 0.5511750 0.1485460
5 224.2424 3.3899 1 -0.7565919 0.4922237 1 0.5518545 0.1516973
6 230.3030 3.3899 1 -0.7428681 0.4866494 1 0.5525464 0.1548966
PredictedProb
1 0.3075737
2 0.3105042
3 0.3134499
4 0.3164108
5 0.3193867
6 0.3223773
Here, we see the predicted probabilities of the first 6 of the 100 values holding gpa at its mean and varying gre and rank.
ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank),
size = 1)
Interpretation:
The plot shows how the predicted probability of admission increases
with GRE score. The higher gre and having a rank 1 undergraduate
institution has a higher probability to be admitted in the school.
Additionally, higher gre is also an advantage since it can increase the
probability of being admitted. However, the type of school has more
advantage as seen in the graph. This suggests that students from
higher-ranked institutions also have a higher predicted probability of
admission. The confidence intervals provide a visual representation of
the uncertainty in the predictions.
Now,to test the significance of our model we want to test whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model).
with(mylogit, null.deviance - deviance)
[1] 41.45903
The degrees of freedom for the difference between the two models is equal to the number of predictor variables in the mode, and can be obtained using:
with(mylogit, df.null - df.residual)
[1] 5
Finally, the p-value can be obtained using:
with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
[1] 7.578194e-08
The chi-square of 41.46 with 5 degrees of freedom and an associated p-value of less than 0.001 tells us that our model as a whole fits significantly better than an empty model. This is sometimes called a likelihood ratio test (the deviance residual is -2*log likelihood). To see the model’s log likelihood, we type:
logLik(mylogit)
'log Lik.' -229.2587 (df=6)
This code creates and fits a logistic regression model to predict admission to graduate school based on GRE score, GPA, and undergraduate institution prestige. It then calculates odds ratios, predicted probabilities, and confidence intervals, and presents the results graphically.