library(aod)
## Warning: package 'aod' was built under R version 4.3.2
library(ggplot2)
library(rmarkdown)
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
head(mydata)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
summary(mydata)
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.395 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
sapply(mydata,sd)
## admit gre gpa rank
## 0.4660867 115.5165364 0.3805668 0.9444602
Let us produce a table of categorical outcome and predictors we want, in other words, we need to produce a table for our dependent(outcome) variable, and our choice of predictors. Our categorical outcome will be the admit, then we can choose, the predictors (rank, gre,gpa). Suppose we choose Rank,
xtabs(~admit+rank,data=mydata)
## rank
## admit 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
We can see that it produces a table for the categorical outcome (admit) and predictor (rank)
Let us convert our rank to a factor so it will become a categorical variable
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = mydata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
gre and gpa are statistically significant, the same also for rank 2, rank 3 and rank 4.
For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.002.
For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.804.
The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -0.675.
Let’s use the confint function to obtain confidence
intervals for the coefficient estimates.
confint(mylogit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre 0.0001375921 0.004435874
## gpa 0.1602959439 1.464142727
## rank2 -1.3008888002 -0.056745722
## rank3 -2.0276713127 -0.670372346
## rank4 -2.4000265384 -0.753542605
confint.default(mylogit)
## 2.5 % 97.5 %
## (Intercept) -6.2242418514 -1.755716295
## gre 0.0001202298 0.004408622
## gpa 0.1536836760 1.454391423
## rank2 -1.2957512650 -0.055134591
## rank3 -2.0169920597 -0.663415773
## rank4 -2.3703986294 -0.732528724
We can test for an overall effect of rank using
the wald.test function of the aod library. The
order in which the coefficients are given in the table of coefficients
is the same as the order of the terms in the model. This is important
because the wald.test function refers to the coefficients
by their order in the model. We use
the wald.test function. b supplies the
coefficients, while Sigma supplies the variance covariance
matrix of the error terms, finally Terms tells R which
terms in the model are to be tested, in this case, terms 4, 5, and 6,
are the three terms for the levels of rank.
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 20.9, df = 3, P(> X2) = 0.00011
as we have observed, the p-value is 0.00011, in other words, the overall effect of the rank is statistically significant.
l <- cbind(0, 0, 0, 1, -1, 0)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 5.5, df = 1, P(> X2) = 0.019
Notice that the chi-square test statistic of 5.5 with 1 d.o.f. has a p-value of 0.019. This indicates that the difference between rank 2 and rank 3 are statistically significant.
exp(coef(mylogit))
## (Intercept) gre gpa rank2 rank3 rank4
## 0.0185001 1.0022670 2.2345448 0.5089310 0.2617923 0.2119375
exp(cbind(OR = coef(mylogit), confint(mylogit)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre 1.0022670 1.000137602 1.0044457
## gpa 2.2345448 1.173858216 4.3238349
## rank2 0.5089310 0.272289674 0.9448343
## rank3 0.2617923 0.131641717 0.5115181
## rank4 0.2119375 0.090715546 0.4706961
Now we can say that for a one unit increase in gpa, the
odds of being admitted to graduate school (versus not being admitted)
increase by a factor of 2.23.
newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
newdata1
## gre gpa rank
## 1 587.7 3.3899 1
## 2 587.7 3.3899 2
## 3 587.7 3.3899 3
## 4 587.7 3.3899 4
The newdata1$rankP, where P=1,2,3,4., tells R that we want to create a new variable in the data set.
newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1
## gre gpa rank rankP
## 1 587.7 3.3899 1 0.5166016
## 2 587.7 3.3899 2 0.3522846
## 3 587.7 3.3899 3 0.2186120
## 4 587.7 3.3899 4 0.1846684
In the above output we see that the predicted probability of being
accepted into a graduate program is 0.52 for students from the highest
prestige undergraduate institutions (rank=1), and 0.18 for
students from the lowest ranked institutions (rank=4),
holding gre and gpa at their means. We can do
something very similar to create a table of predicted probabilities
varying the value of gre and rank. We are
going to plot these, so we will create 100 values
of gre between 200 and 800, at each value of rank (i.e., 1,
2, 3, and 4).
newdata2 <- with(mydata, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100),
4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link",
se = TRUE))
newdata3 <- within(newdata3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
head(newdata3)
## gre gpa rank fit se.fit residual.scale UL LL
## 1 200.0000 3.3899 1 -0.8114870 0.5147714 1 0.5492064 0.1393812
## 2 206.0606 3.3899 1 -0.7977632 0.5090986 1 0.5498513 0.1423880
## 3 212.1212 3.3899 1 -0.7840394 0.5034491 1 0.5505074 0.1454429
## 4 218.1818 3.3899 1 -0.7703156 0.4978239 1 0.5511750 0.1485460
## 5 224.2424 3.3899 1 -0.7565919 0.4922237 1 0.5518545 0.1516973
## 6 230.3030 3.3899 1 -0.7428681 0.4866494 1 0.5525464 0.1548966
## PredictedProb
## 1 0.3075737
## 2 0.3105042
## 3 0.3134499
## 4 0.3164108
## 5 0.3193867
## 6 0.3223773
ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank),
size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We make a plot with the predicted probabilities, and 95% confidence intervals.
We may also wish to see measures of how well our model fits. This can be particularly useful when comparing competing models. The output produced by summary(mylogit) included indices of fit (shown below the coefficients), including the null and deviance residuals and the AIC. One measure of model fit is the significance of the overall model. This test asks whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model). The test statistic is the difference between the residual deviance for the model with predictors and the null model. The test statistic is distributed chi-squared with degrees of freedom equal to the differences in degrees of freedom between the current and the null model (i.e., the number of predictor variables in the model). To find the difference in deviance for the two models (i.e., the test statistic) we can use the command:
with(mylogit, null.deviance - deviance)
## [1] 41.45903
The d.o.f. for the difference of the two models is equal to the number of predictor variables in the mode
with(mylogit, df.null - df.residual)
## [1] 5
Obtaining the p-value, we have
with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
## [1] 7.578194e-08
The chi-square of 41.46 with 5 degrees of freedom and an associated p-value of less than 0.001 tells us that our model as a whole fits significantly better than an empty model. This is sometimes called a likelihood ratio test (the deviance residual is -2*log likelihood). To see the model's log likelihood, we type:
logLik(mylogit)
## 'log Lik.' -229.2587 (df=6)