LOGIT REGRESSION | R DATA ANALYSIS EXAMPLES

library(aod)
library(ggplot2)

Example: A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

Load data

The dataset has a binary response (outcome, dependent) variable called admit, which is equal to 1 if the individual was admitted to graduate school, and 0 otherwise.
There are three predictor variables:gre, gpa and rank. We will treat the variables GRE and GPA as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
## view the first few rows of the data
head(mydata)

  admit gre  gpa rank
1     0 380 3.61    3
2     1 660 3.67    3
3     1 800 4.00    1
4     1 640 3.19    4
5     0 520 2.93    4
6     1 760 3.00    2

Descriptive Statistic

We can get basic descriptives for the entire data set by using summary.

summary(mydata)

     admit             gre             gpa             rank      
 Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
 Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
 Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
 3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
 Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000

Interpretation:

From the summary statistics we observe that the mean of admit is $0.3175$ which is less than 0.5. Hence, from this fact we can say that there are greater number of reject than acceptance, this means that most of students did not get admitted

To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.

sapply(mydata, sd)

      admit         gre         gpa        rank 
  0.4660867 115.5165364   0.3805668   0.9444602

Here, we see that there is a low standard deviation in the admit column which means that the values are close to its mean given that we only have 0 and 1.

## two-way contingency table of categorical outcome and predictors we want
## to make sure there are not 0 cells
xtabs(~admit + rank, data = mydata)

     rank
admit  1  2  3  4
    0 28 97 93 55
    1 33 54 28 12

Notice that here is no empty cell, we say that the admits are distributed well enough in each rank.

Logistic Regression (Using logistic model)

The code below estimates a logistic regression model using the glm (generalized linear model) function.
First, we convert rank to a factor to indicate that rank should be treated as a categorical variable. Then, fit the logistic regression model.

mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")

View the model summary

summary(mylogit)


Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = mydata)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4

Interpretaion/Observation:

The output shows the distribution of the deviance residuals for individiual cases used in the model.
Furthermore, it shows the coefficients, in which both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
- For every one unit change in gre, the log odds of admission (versus non-admission) increases by $0.002$, and its p-value implies that it is somewhat significant in determining the admission;
- For a one unit increase in gpa, the log odds of being admitted to graduate school increases by $0.804$, and p-value implies that it is somewhat significant in determining the admission.
- Now for the rank, when going to rank 2 college from rank 1 college will decrease the log odds of getting admit by $-0.675443$. Going from rank 2 to rank 3 will decrease it by $-1.340204$ and going from rank 3 to rank 4 will also decrease it by $-1.551464$.
Null deviance is the value when you only have intercept in your equation with no variables and Residual deviance is the value when you are taking all the variables into account. The greater the difference, the better the model. Now, from the table above it tells us that the model is a good fit.

We can use the confint function to obtain confidence intervals for the coefficient estimates. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function.

## CIs using profiled log-likelihood
confint(mylogit)

                    2.5 %       97.5 %
(Intercept) -6.2716202334 -1.792547080
gre          0.0001375921  0.004435874
gpa          0.1602959439  1.464142727
rank2       -1.3008888002 -0.056745722
rank3       -2.0276713127 -0.670372346
rank4       -2.4000265384 -0.753542605

## CIs using standard errors
confint.default(mylogit)

                    2.5 %       97.5 %
(Intercept) -6.2242418514 -1.755716295
gre          0.0001202298  0.004408622
gpa          0.1536836760  1.454391423
rank2       -1.2957512650 -0.055134591
rank3       -2.0169920597 -0.663415773
rank4       -2.3703986294 -0.732528724

We can test for an overall effect of rank using the wald.test function of the aod library. The wald.test function refers to the coefficients by their order in the model. We use the wald.test function, b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested. In this case, terms $4, 5,$ and $6$, are the three terms for the levels of rank.

wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6)

Wald test:
----------

Chi-squared test:
X2 = 20.9, df = 3, P(> X2) = 0.00011

Interpretation:

Wald test measures to confirm whether a set of independent variables are collectively significant for a model or not. Referring to our summary table above, the term 4:6 refers to the varibale rank, thus we are going to test if the rank is significant in the admission. Using $\alpha=0.05$, the chi-squared test statistic of $20.9$, with three degrees of freedom is associated with a p-value of $0.00011$ indicating that the overall effect of rank is statistically significant.

We can also test additional hypotheses about the differences in the coefficients for the different levels of rank. We test that the coefficient for rank 2 is equal to the coefficient for rank 3. The first line of code below creates a vector l that defines the test we want to perform.
In this case, we want to test the difference (subtraction) of the terms for rank=2 and rank=3 (i.e., the 4th and 5th terms in the model). To contrast these two terms, we multiply one of them by $1$, and the other by $-1$. The other terms in the model are not involved in the test, so they are multiplied by $0$.We used $L=l$ to tell R that we wish to base the test on the vector $l$.

l <- cbind(0, 0, 0, 1, -1, 0)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l)

Wald test:
----------

Chi-squared test:
X2 = 5.5, df = 1, P(> X2) = 0.019

Interpretation: Observe that the chi-squared test statistic of $5.5$ with $1$ degree of freedom is associated with a p-value of $0.019$, indicating that the difference between the coefficient for rank 2 and the coefficient for rank 3 is statistically significant.

Calculate odds ratios

To get the exponentiated coefficients, you tell R that you want to exponentiate (exp), and that the object you want to exponentiate is called coefficients and it is part of mylogit (coef(mylogit)).

## odds ratios only
exp(coef(mylogit))

(Intercept)         gre         gpa       rank2       rank3       rank4 
  0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375

Calculate odds ratios with confidence intervals

We can use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. We use cbind to bind the coefficients and confidence intervals column-wise.

## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))

                   OR       2.5 %    97.5 %
(Intercept) 0.0185001 0.001889165 0.1665354
gre         1.0022670 1.000137602 1.0044457
gpa         2.2345448 1.173858216 4.3238349
rank2       0.5089310 0.272289674 0.9448343
rank3       0.2617923 0.131641717 0.5115181
rank4       0.2119375 0.090715546 0.4706961

Interpretation:

From the table, we see that for every $1$ unit increase of gre, the odds of being admitted is increased by a factor of $1.0022670$ and for a one unit increase in gpa, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of $2.2345448$. Note also that being in rank $2$ has a higher odds of being admitted than in rank $3$ and rank $4$.

Create a new data frame with mean values for gre and gpa

We will start by calculating the predicted probability of admission at each value of rank, holding gre and gpa at their means. First we create and view the data frame.

newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))

## view data frame
newdata1

    gre    gpa rank
1 587.7 3.3899    1
2 587.7 3.3899    2
3 587.7 3.3899    3
4 587.7 3.3899    4

Calculate the predicted probability of admission at each value of rank

The newdata1$rankP tells R that we want to create a new variable in the dataset (data frame) newdata1 called rankP, the rest of the command tells R that the values of rankP should be predictions made using the predict( ) function.
The options within the parentheses tell R that the predictions should be based on the analysis mylogit with values of the predictor variables coming from newdata1 and that the type of prediction is a predicted probability (type=“response”).

newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
newdata1

    gre    gpa rank     rankP
1 587.7 3.3899    1 0.5166016
2 587.7 3.3899    2 0.3522846
3 587.7 3.3899    3 0.2186120
4 587.7 3.3899    4 0.1846684

Interpretation:

In the above output we see that the predicted probability of being accepted into a graduate program is $0.5166016 $ for students from the highest prestige undergraduate institutions (rank 1), and $0.1846684$ for students from the lowest ranked institutions (rank 4), holding gre and gpa at their means.

Create a new data frame with varying gre and rank values

We can do something very similar to create a table of predicted probabilities varying the value of gre and rank. We will create $100$ values of gre between 2$00$ and $800$, at each value of rank (i.e., $1, 2, 3,$ and $4$).

newdata2 <- with(mydata, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100),
    4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))

Generate predicted probabilities with standard errors

The code to generate the predicted probabilities (the first line below) is the same as before, except we are also going to ask for standard errors so we can plot a confidence interval. We get the estimates on the link scale and back transform both the predicted values and confidence limits into probabilities.

newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link",
    se = TRUE))
newdata3 <- within(newdata3, {
    PredictedProb <- plogis(fit)
    LL <- plogis(fit - (1.96 * se.fit))
    UL <- plogis(fit + (1.96 * se.fit))
})

## view first few rows of final dataset
head(newdata3)

       gre    gpa rank        fit    se.fit residual.scale        UL        LL
1 200.0000 3.3899    1 -0.8114870 0.5147714              1 0.5492064 0.1393812
2 206.0606 3.3899    1 -0.7977632 0.5090986              1 0.5498513 0.1423880
3 212.1212 3.3899    1 -0.7840394 0.5034491              1 0.5505074 0.1454429
4 218.1818 3.3899    1 -0.7703156 0.4978239              1 0.5511750 0.1485460
5 224.2424 3.3899    1 -0.7565919 0.4922237              1 0.5518545 0.1516973
6 230.3030 3.3899    1 -0.7428681 0.4866494              1 0.5525464 0.1548966
  PredictedProb
1     0.3075737
2     0.3105042
3     0.3134499
4     0.3164108
5     0.3193867
6     0.3223773

Here, we see the predicted probabilities of the first 6 of the 100 values holding gpa at its mean and varying gre and rank.

Plot the predicted probabilities with confidence intervals

It can also be helpful to use graphs of predicted probabilities to understand and/or present the model. We will use the ggplot2 package for graphing.
Below we make a plot with the predicted probabilities, and $95\%$ confidence intervals.

ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
    ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank),
    size = 1)

Interpretation:

The plot shows how the predicted probability of admission increases with GRE score. The higher gre and having a rank 1 undergraduate institution has a higher probability to be admitted in the school. Additionally, higher gre is also an advantage since it can increase the probability of being admitted. However, the type of school has more advantage as seen in the graph. This suggests that students from higher-ranked institutions also have a higher predicted probability of admission. The confidence intervals provide a visual representation of the uncertainty in the predictions.

Now,to test the significance of our model we want to test whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model).

with(mylogit, null.deviance - deviance)

[1] 41.45903

The degrees of freedom for the difference between the two models is equal to the number of predictor variables in the mode, and can be obtained using:

with(mylogit, df.null - df.residual)

[1] 5

Finally, the p-value can be obtained using:

with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

[1] 7.578194e-08

The chi-square of 41.46 with 5 degrees of freedom and an associated p-value of less than 0.001 tells us that our model as a whole fits significantly better than an empty model. This is sometimes called a likelihood ratio test (the deviance residual is -2*log likelihood). To see the model’s log likelihood, we type:

logLik(mylogit)

'log Lik.' -229.2587 (df=6)

This code creates and fits a logistic regression model to predict admission to graduate school based on GRE score, GPA, and undergraduate institution prestige. It then calculates odds ratios, predicted probabilities, and confidence intervals, and presents the results graphically.

LOGIT REGRESSION | R DATA ANALYSIS EXAMPLES

JORIELYN S. MANLA

2023-12-05

Load data

Descriptive Statistic

Logistic Regression (Using logistic model)

Calculate odds ratios

Calculate odds ratios with confidence intervals

Create a new data frame with mean values for gre and gpa

Calculate the predicted probability of admission at each value of rank

Create a new data frame with varying gre and rank values

Generate predicted probabilities with standard errors

Plot the predicted probabilities with confidence intervals