STAT 54 - Logistic Regression

library(aod)

Warning: package 'aod' was built under R version 4.2.3

library(ggplot2)
library(rmarkdown)

Description of the data

mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
## view the first few rows of the data
head(mydata)

  admit gre  gpa rank
1     0 380 3.61    3
2     1 660 3.67    3
3     1 800 4.00    1
4     1 640 3.19    4
5     0 520 2.93    4
6     1 760 3.00    2

summary(mydata)

     admit             gre             gpa             rank      
 Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
 Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
 Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
 3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
 Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000

sapply(mydata, sd)

      admit         gre         gpa        rank 
  0.4660867 115.5165364   0.3805668   0.9444602

## two-way contingency table of categorical outcome and predictors we want
## to make sure there are not 0 cells
xtabs(~admit + rank, data = mydata)

     rank
admit  1  2  3  4
    0 28 97 93 55
    1 33 54 28 12

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.

Logistic regression, the focus of this page.
Probit regression. Probit analysis will produce results similar logistic regression. The choice of probit versus logit depends largely on individual preferences.
OLS regression. When used with a binary response variable, this model is known as a linear probability model and can be used as a way to describe conditional probabilities. However, the errors (i.e., residuals) from the linear probability model violate the homoskedasticity and normality of errors assumptions of OLS regression, resulting in invalid standard errors and hypothesis tests. For a more thorough discussion of these and other problems with the linear probability model, see Long (1997, p. 38-40).
Two-group discriminant function analysis. A multivariate method for dichotomous outcome variables.
Hotelling’s T2. The 0/1 outcome is turned into the grouping variable, and the former predictors are turned into outcome variables. This will produce an overall test of significance but will not give individual coefficients for each variable, and it is unclear the extent to which each “predictor” is adjusted for the impact of the other “predictors.”

Using the logit model

The code below estimates a logistic regression model using the glm (generalized linear model) function.
First, we convert rank to a factor to indicate that rank should be treated as a categorical variable.

mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")

summary(mylogit)


Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6268  -0.8662  -0.6388   1.1490   2.0790  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4

## CIs using profiled log-likelihood
confint(mylogit)

Waiting for profiling to be done...

                    2.5 %       97.5 %
(Intercept) -6.2716202334 -1.792547080
gre          0.0001375921  0.004435874
gpa          0.1602959439  1.464142727
rank2       -1.3008888002 -0.056745722
rank3       -2.0276713127 -0.670372346
rank4       -2.4000265384 -0.753542605

## CIs using standard errors
confint.default(mylogit)

                    2.5 %       97.5 %
(Intercept) -6.2242418514 -1.755716295
gre          0.0001202298  0.004408622
gpa          0.1536836760  1.454391423
rank2       -1.2957512650 -0.055134591
rank3       -2.0169920597 -0.663415773
rank4       -2.3703986294 -0.732528724

wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 4:6)

Wald test:
----------

Chi-squared test:
X2 = 20.9, df = 3, P(> X2) = 0.00011

l <- cbind(0, 0, 0, 1, -1, 0)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l)

Wald test:
----------

Chi-squared test:
X2 = 5.5, df = 1, P(> X2) = 0.019

## odds ratios only
exp(coef(mylogit))

(Intercept)         gre         gpa       rank2       rank3       rank4 
  0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375

## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))

Waiting for profiling to be done...

                   OR       2.5 %    97.5 %
(Intercept) 0.0185001 0.001889165 0.1665354
gre         1.0022670 1.000137602 1.0044457
gpa         2.2345448 1.173858216 4.3238349
rank2       0.5089310 0.272289674 0.9448343
rank3       0.2617923 0.131641717 0.5115181
rank4       0.2119375 0.090715546 0.4706961

newdata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))

## view data frame
paged_table(newdata1)

newdata1$rankP <- predict(mylogit, newdata = newdata1, type = "response")
paged_table(newdata1)

newdata2 <- with(mydata, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100),
    4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))

newdata3 <- cbind(newdata2, predict(mylogit, newdata = newdata2, type = "link",
    se = TRUE))
newdata3 <- within(newdata3, {
    PredictedProb <- plogis(fit)
    LL <- plogis(fit - (1.96 * se.fit))
    UL <- plogis(fit + (1.96 * se.fit))
})

## view first few rows of final dataset
head(newdata3)

       gre    gpa rank        fit    se.fit residual.scale        UL        LL
1 200.0000 3.3899    1 -0.8114870 0.5147714              1 0.5492064 0.1393812
2 206.0606 3.3899    1 -0.7977632 0.5090986              1 0.5498513 0.1423880
3 212.1212 3.3899    1 -0.7840394 0.5034491              1 0.5505074 0.1454429
4 218.1818 3.3899    1 -0.7703156 0.4978239              1 0.5511750 0.1485460
5 224.2424 3.3899    1 -0.7565919 0.4922237              1 0.5518545 0.1516973
6 230.3030 3.3899    1 -0.7428681 0.4866494              1 0.5525464 0.1548966
  PredictedProb
1     0.3075737
2     0.3105042
3     0.3134499
4     0.3164108
5     0.3193867
6     0.3223773

ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
    ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank),
    size = 1)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

with(mylogit, null.deviance - deviance)

[1] 41.45903

with(mylogit, df.null - df.residual)

[1] 5

with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

[1] 7.578194e-08

logLik(mylogit)

'log Lik.' -229.2587 (df=6)

Summary (My understanding in using Binary Logistic Regression Analysis)

In analyzing the data using Binary Logistic Regression analysis, first, is to describe the data. In this step, you must run your data with binary response. In this case (given topic/example), there are three predictor variables in the given data. To describe with, code the function summary(). In line with this, to get the standard deviations, use sapply() function to apply the sd function to each variable in dataset.Next, there are list of methods that we might consider in this analysis. The following are some of the methods listed that are quite reasonable while others have either fallen out of favor or have limitations: (1) Logistic regression, (2) Prohibit regression that will produce results similar to logistic regression, (3) OLS regression that is known as a linear probability model and can be used as a way to describe conditional probabilities, (4) Two-group discriminant function analysis that is a multivariate method for dichotomous outcome variables, and (5) Hotelling’s T^{2} in which it will produce an overall test of significance but will not give individual coefficients for each variable. Then, upon using the logit model, the goal is to estimate a logistic regression model using generalized linear model function or glm(). In line with this, we must convert one of the predictor variable to a factor to indicate that variable should be treated as a categorical variable. To produce the output from our regression, use summary() function of the model. In addition to this, we can use the confint() function to obtain confidence intervals for the coefficient estimates. Also, using wal.test() function of the aod library, this will give us the overall effect of the categorical variable from the dataset. We can also test additional hypotheses about the differences in the coefficients for the different levels of the said categorical variable. Further, you can also exponentiate the coefficients and interpret them as odds-ratios using the exp() function. Lastly, the output produced by summary(mylogit) included indices of fit, null and deviance residuals, and the AIC.

STAT 54 - Logistic Regression

Ian Duhaylungsod

2023-05-01

Description of the data

Analysis methods you might consider

Using the logit model

Summary (My understanding in using Binary Logistic Regression Analysis)