Logistic Regression in R

In this overview, we will be covering basic logistic regression, but we will also cover ordinal logistic regression and multinomial logistic regression. Very importantly, the data in this section were created by the wonderful people over at the UCLA Institute for Digital Research and Education (IDRE). They also have great overviews of how to do various data analyses in R (and other languages), including logistic, ordinal, and multinomial regression. My only reasons for recreating overviews here is to align them with the tutorials you have gone through thus far and ensure their alignment with the curriculum at Seton Hall University.

Logistic regression

As a reminder, logistic regression is a method for modeling binary outcome variables. Let’s read in data that contains some of that.

# Read in data
ldata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

Notice something cool here - you didn’t have to download that dataset. R can read in data straght from a URL if it’s loaded online. Cool, right? Notice that we’re pretty much done with the api data for now. I named this object ldata.

These are (fake? I think?) data on graduate school admissions. The admit variable indicates whether someone was admitted. The gre variable is the GRE score. The gpa is undergraduate GPA. And rank is the rank of the undergraduate institution (with values 1 through 4).

Just to practice, can you run some summary statistics on the data?

% admitted:
Mean GRE:
GRE standard deviation:

Here’s an important sidenote - can you get a frequency table for the rank variable? What happens if we just use the summary() command?

summary(ldata$rank)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   2.000   2.485   3.000   4.000

This isn’t what we want, is it? The reason for this is because R can’t tell, by itself, that this is meant to be a categorical variable because it uses numbers instead of words. So it just treats it like a continuous variable. To make sure that R treats this variable like a categorical variable, you have to transform the variable into what we call a “factor.” To do this, just use the factor() function.

# Re-write the rank variable as a factor
ldata$rank <- factor(ldata$rank)

Here, I over-wrote the rank variable with a new version of it that is a factor. Now, let’s try that summary again.

summary(ldata$rank)

  1   2   3   4 
 61 151 121  67

This is what we wanted. Back to logistic regression.

Logistic regression in R is treated as a “generalized linear model.” The short explanation is that there are a variety of different types of regression. You know of two important ones, ordinary least squares and logistic. They function similarly, but assume different things about the shape of the outcome variable. There is a series of types of models in R, including Poisson (for count data), that can all be run pretty similarly, but just assume different things about the outcome variable. They can all be run with the glm() command the exact same way, and you just need to specify what the outcome variable looks like using the family argument. The “family” for logistic regression is called “binomial.”

So let’s say we wanted to run the following logistic regression model.

\[log(\frac{Pr(Admit)}{NotAdmit}) = \beta_0 + \beta_1(gre) + \beta_2(gpa) + \beta_3(rank) + \varepsilon\]

The way we use the glm() command is identical to the way we use the lm() command for regression, except now we have an extra argument.

# Run the logistic regression model
model <- glm(admit ~ gre + gpa + rank, data=ldata, family="binomial")

Like before when we were doing ordinary least squares regression, this doesn’t display anything because we stored the results in an object, which I called model. Let’s see the results using the summary() command!

summary(model)


Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = ldata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6268  -0.8662  -0.6388   1.1490   2.0790  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4

You don’t have to worry about the deviance and AIC things - they’re related to the method of estimation. (Good to know in general, but for the purpose of right now, not so much. Also, I’ve had a lot of caffeine so sorry if this tutorial and the last one are more hyped up then they need to be.) The results are displayed, for the most part, exactly the same as they are in ordinary least squares. Here, though, we have to do a little bit of extra work because the regression coefficients are returned in logged odds units, as is generally the case with logistic regression.

Remember, to interpret logistic regression results, we want to first transform the coefficients to odds ratios by raising e - or Euler’s constant - to the coefficients. Luckily, that’s pretty easy in R.

First, we can pull the coefficients themselves by using the coef() command.

# Store coefficients in another object
coefs <- coef(model)
# Show the coefficients, just for fun
coefs

 (Intercept)          gre          gpa        rank2        rank3 
-3.989979073  0.002264426  0.804037549 -0.675442928 -1.340203916 
       rank4 
-1.551463677

Second, we raise e to the coefficients using the exp() command.

# Raise e to the coefficients
exp(coefs)

(Intercept)         gre         gpa       rank2       rank3       rank4 
  0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375

If you want, you could even further transform them for direct use in the verbal interpretation of odds. For example, we can take the odds ratio, subtract 1, and multiply by 100 to get the percent change in the odds for a one unit increase in the independent variable.

# Do the full transformation, all in one line
(exp(coefs)-1)*100

(Intercept)         gre         gpa       rank2       rank3       rank4 
-98.1499899   0.2266992 123.4544824 -49.1069049 -73.8207721 -78.8062461

(It is important to note that published articles will normally report the odds ratio, not the full transformation.)

Let’s interpret some coefficients.

gre = 1.0023: For a one point increase in GRE score, we expect to see a 0.23% increase in the odds of being admitted to graduate school. With a z-value of 2.07 and an associated p-value of 0.04, this coefficient is statistically significant at the 5% level.
gpa = 2.23:
rank2 = 0.51: Being from a rank 2 school (as compared to being a rank 1 school) is associated with a 49% decrease in the odds of being admitted to graduate school. With a z-value of -2.134 and an associated p-value of 0.03, this coefficient is statistically significant at the 5% level.
rank3 = 0.26:
rank4 = 0.21:

In order to get a fit statistic, like the $R^2$ in ordinary least squares regression, we need to do a little bit more. First of all, for logistic regression, there are many different fit statistics that have been suggested, and there are a variety of different functions out there for calculating all of them. Here, we will use McFadden’s Pseudo $R^2$. The function, pR2(), is included in the pscl package by Simon Jackman at Stanford University, which you can download using the install.packages() command.

Let’s try it out. Remember that you have to load the pscl package first.

# Load the package
library(pscl)

Classes and Methods for R developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University
Simon Jackman
hurdle and zeroinfl functions by Achim Zeileis

# Get the McFadden pR2 from our logistic model
pR2(model)

          llh       llhNull            G2      McFadden          r2ML 
-229.25874624 -249.98825878   41.45902508    0.08292194    0.09845702 
         r2CU 
   0.13799580

The value under “McFadden” if the pseudo-$R^2$.

Ordinal logistic regression

As you know, the ordinal logistic regression is just an extension of logistic regression for ordered outcomes with more than two categories. For this experience, we will again use data from IDRE. This time, though, the data source is a Stata data file. Two packages have been developed out in the world with functions to read Stata data, foreign and haven. I happen to favor the haven package, so download it using install.packages() if you don’t have it already.

# Load library
library(haven)
# Read in data
odata <- read_dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")

There are four variables here. The apply variable is an ordered categorical variable with responses to a survey about whether a student feels they are “Unlikely” (0), “Somewhat likely” (1), or “Very likely” (2) to apply to graduate school. The pared variable is a binary variable indicating if at least one parent has attended graduate school. The public variable is a binary variable indicating if the undergraduate institution is public (as opposed to private), and the gpa variable is the undergraduate GPA. Notice that the order of the values in apply correspond to the order of the variable.

Ordinal logistic regression uses the polr() command, which stands for “proportional odds logistic regression.” It is in the MASS package (which again, you should install if you don’t have it). The syntax is the same as the lm() command. Note that the outcome variable is required to be formatted as a factor in order to run the model.

# Load the library
library(MASS)
# Turn the apply variable into a factor
odata$apply <- factor(odata$apply)
# Run the ordinal logistic regression model
model <- polr(apply ~ pared + public + gpa, data=odata)

Let’s see the output.

summary(model)


Re-fitting to get Hessian

Call:
polr(formula = apply ~ pared + public + gpa, data = odata)

Coefficients:
          Value Std. Error t value
pared   1.04769     0.2658  3.9418
public -0.05879     0.2979 -0.1974
gpa     0.61594     0.2606  2.3632

Intercepts:
    Value   Std. Error t value
0|1  2.2039  0.7795     2.8272
1|2  4.2994  0.8043     5.3453

Residual Deviance: 717.0249 
AIC: 727.0249

As is pretty much always the case in logistic-type regressions, the coefficients are output as logged odds. We can pull out those coefficients, like before, with the coef() command.

coefs <- coef(model)

We can do the same transformation from before, using the same code, to turn these into interpretable odds ratios. What code would you run to get the odds ratios?

Run:

    pared    public       gpa 
2.8510579 0.9429088 1.8513972

And what about the full transformation to the percent change in odds?

Run:

     pared     public        gpa 
185.105785  -5.709121  85.139715

But wait, why doesn’t it report p-values? I don’t know, again, I didn’t make it, but you can figure out the p-values yourself. Remember, to figure out the p-value for a particular t-value in a regression:

Take the absolute value of the t-value.
Calculate the degrees of freedom as $N-K$ or the total number of observations in the model minus the total number of independent variables.
Find the upper tail probability, on a t-distribution with $N-K$ degrees of freedom, associated with that t-value. (We can do this with the pt() function, using the lower.tail=FALSE argument.)
Double it.

We can do this all in one line in R. For example, take the t-value of 3.9418 that is reported in the summary() results. (Note that the abs() function takes the absolute value.)

# Find the p-value for a t-value of 3.9418
pt(3.9418, 400-3, lower.tail=FALSE)*2

[1] 9.556751e-05

That all said, with degrees of freedom over 100, we can also just say that any t-value over 2 is statistically significant at least at the 5% level. #RuleOfThumb #LazyStatistics

Let’s interpret some results:

pared = 1.85: Having a parent who went to graduate school, as opposed to not having a parent that went to graduate school, is associated with increased odds of being in a higher category of reported likelihood of going to graduate school of 185%. The t-value is above 2, and therefore is statistically significant at the 5% level.

…Wait, what? That’s a not-particularly-useful mouthful. I’m not even going to interpret the rest of the coefficients this way because that’s so messy and not really useful. This is why when it comes to ordinal logistic regression, people often just report the direction and general magnitude of the direction. (People will often also report predicted marginal probabilities, but we will save that discussion for later.)

Let’s try again, but more generally speaking:

pared = 2.85: Having a parent who went to graduate school, as opposed to not having a parent that went to graduate school, is associated with much higher reported likelihood of going to graduate school. The t-value is above 2, and therefore is statistically significant at the 5% level.
public = 0.94:
gpa = 1.85:

The last thing to talk about when it comes to ordinal logistic regression is the parallel regression assumption, a.k.a. the proportional odds assumption. In short, we assume that the coefficients from the following two regressions are the same.

\[log(\frac{Pr(apply=2)}{Pr(apply=0|1)}) = \beta_0 + \beta_1 (pared) + \beta_2 (public) + \beta_3 (gpa) + \varepsilon\]

\[log(\frac{Pr(apply=1|2)}{Pr(apply=0)}) = \beta_0 + \beta_1 (pared) + \beta_2 (public) + \beta_3 (gpa) + \varepsilon\]

We can test this assumption using a Brant test. This is available in the brant package.

# Load the library
library(brant)
# Run the Brant test on the model
brant(model)

-------------------------------------------- 
Test for    X2  df  probability 
-------------------------------------------- 
Omnibus     4.34    3   0.23
pared       0.13    1   0.72
public      3.44    1   0.06
gpa     0.18    1   0.67
-------------------------------------------- 

H0: Parallel Regression Assumption holds

Each of the probabilities refers to the null hypothesis that the parallel regression assumption holds, or that the value of the coefficients don’t differ across different cutpoints in the outcome variable. Here, we look at the result of the “Omnibus” check. Here, we see a value of 0.23, leading us to fail to reject the null hypothesis that the parallel regression assumption holds. In other words, we do not have evidence that we have not met the assumption. Hooray for us! Ordinal logistic regression was an okay thing for us to do! If we had gotten a lower probability value, like 0.05, we would have had to instead run multinomial logistic regression. Good thing we’re talking about that next anyway.

Multinomial logistic regression

We’ve done a lot of regression today. I would suggest you stop after this one and take a break.

But anyway, here we are, at multinomial logistic regression. This is the regression you use when you have an unordered categorical variable outcome or an ordered categorical variable outcome that failed the Brant test.

Let’s load in the data from IDRE.

# Read in data
mdata <- read_dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")

Note that if you already loaded the haven package before, you don’t have to do it again to use the read_dta() function. There are a bunch of variables in this dataset, but we only care about three: prog, ses, and write. The first, prog is a categorical variable indicating what type of program a student is in: “General” (1), “Academic” (2), or “Vocational” (3). The ses variable is a categorical variable indicating someone’s socioeconomic class: “Low” (1), “Middle” (2), and “High” (3). The write variable is their score on a writing test.

To begin, let’s format our categorical variables as factor variables. This should generally be a first step if you have categorical data that is set up as numbers.

# Format categorical variables
mdata$prog <- factor(mdata$prog)
mdata$ses <- factor(mdata$ses)

Remember, multinomial logistic regression reports the odds of being in the different outcome categories in reference to some base group. So while we want to model the outcome prog on the independent variables ses and write, because there are three levels to prog, it will report two different sets of regression results corresponding to the following two models.

\[log(\frac{Pr(prog=2)}{Pr(prog=1)}) = \beta_0 + \beta_1 (ses) + \beta_2 (write) + \varepsilon\]

\[log(\frac{Pr(prog=3)}{Pr(prog=1)}) = \beta_0 + \beta_1 (ses) + \beta_2 (write) + \varepsilon\]

Note here that we have chosen the “General” (1) group as our base group. We did not have to choose this group. We just picked one for the purpose of discussion. And by “we,” I mean “I” because I’m the one writing this thing.

So first, you have to set your base group. You do this using the relevel() function and the ref argument, which stands for “reference group.”

# Set the reference group for prog to be 1
mdata$prog <- relevel(mdata$prog, ref=1)

Now, we can go ahead and run our model using the multinom() command, which is in the nnet package. It has the same structure as basically all of the other modeling functions we’ve seen so far.

# Load the package
library(nnet)
# Run the model
model <- multinom(prog ~ ses + write, data=mdata)

# weights:  15 (8 variable)
initial  value 219.722458 
iter  10 value 179.985215
final  value 179.981726 
converged

Here, even though we stored the results in an object, it still printed out some information about the estimation process. Thanks, R.

Let’s check out a summary of the results.

summary(model)

Call:
multinom(formula = prog ~ ses + write, data = mdata)

Coefficients:
  (Intercept)      ses2      ses3       write
2   -2.851973 0.5332914 1.1628257  0.05792480
3    2.366097 0.8246384 0.1802176 -0.05567514

Std. Errors:
  (Intercept)      ses2      ses3      write
2    1.166437 0.4437319 0.5142215 0.02141092
3    1.174251 0.4901237 0.6484508 0.02333135

Residual Deviance: 359.9635 
AIC: 375.9635

Okay, so I will the the first to admit that the way this function displays the results of the model is strange and slightly unwieldly. But that’s why I’m here, so we can go through this process of growth and reconciliation together.

First, note that it’s broadly broken up into two sections: “Coefficients” and “Std. Errors.” These sections are exactly what they sound like: the coefficients and standard errors for the respective models. See the “2” and “3” on the sides? These refer to the separate models we showed above, where the “2” model has the “2” in the numerator of the odds and the “3” model has the “3” in the numerator of the odds. In other words, the numbers in the rows beginning with 2 are for the models comparing the probability of being in an “Academic” track versus a “General” track. The numbers in the rows beginning with 3 are for models comparing the probability of being in a “Vocational” track versus a “General” track.

Were I to take those formulas from above and substitute in the actual numbers, I would get the following.

\[\widehat{log(\frac{Pr(prog=2)}{Pr(prog=1)})} = -2.85 + 0.53 (sesMiddle) + 1.16 (sesHigh) + 0.06 (write)\]

\[\widehat{log(\frac{Pr(prog=3)}{Pr(prog=1)})} = 2.37 + 0.82 (sesMiddle) + 0.18 (sesHigh) -0.06 (write)\]

See where I got those numbers from? They came from the “Coefficient” section of the results. Note that to make it more interpretable, I renamed ses2 and ses3 to sesMiddle and sesHigh.

We can pull out the coefficients, like before, with the coef command.

coefs <- coef(model)

Note that we can use the exact same method from binary logistic regression and ordinal logistic regression to turn those coefficients into odds ratios. Go ahead and do it.

Run:

  (Intercept)     ses2     ses3     write
2  0.05773033 1.704533 3.198960 1.0596353
3 10.65572511 2.281056 1.197478 0.9458464

Or, alternatively, do the full transformation.

Run:

  (Intercept)      ses2     ses3     write
2   -94.22697  70.45334 219.8960  5.963531
3   965.57251 128.10558  19.7478 -5.415365

Can you interpret the coefficients? Let’s forget statistical significance for a second. The coefficients are interpreted exactly the same as they way they are in binary logistic regression.

Interpretations for the General vs. Academic model:

sesMiddle = 1.70:
sesHigh = 3.20:
write = 1.06:

Interpretations for the General vs. Vocational model:

sesMiddle = 2.28:
sesHigh = 1.20:
write = 0.95:

How do we determine statistical significance of the coefficients, though, when they only provide us the point estimate and the standard errors?

First, want another #RuleOfThumb #LazyStatistics trick? Well, because we know that the z-value (or t-value) is generally equal to the point estimate over the standard error, and we know that a value over 2 when you have large degrees of freedom generally means statistical significance at the 95% level, we can conclude the following: If the point estimate is more than two times the standard error, the coefficient is statistically significant.

But let’s say you actually wanted the z-values and the p-values. To get the z-values, you divide the coefficients by the standard errors. We already know how to get the coefficients by themselves. What about the standard errors? The reasoning for why the code below works is a little more than I feel like getting into at 9pm on a Monday, but know that it works. Basically, the summary() version of the results has a lot of parts to it, and you can refer to those parts using the $ symbol, just like how you refer to variables in a dataset. The standard errors part is labeled standard.errors.

summary(model)$standard.errors

  (Intercept)      ses2      ses3      write
2    1.166437 0.4437319 0.5142215 0.02141092
3    1.174251 0.4901237 0.6484508 0.02333135

See? It worked. Note that we could have also gotten the coefficients using a similar notation: summary(model)$coefficients

So to get the z-values, we simply divide one by the other.

# Calculate z-values
zvalues <- summary(model)$coefficients / summary(model)$standard.errors
# Show z-values
zvalues

  (Intercept)     ses2      ses3     write
2   -2.445028 1.201832 2.2613324  2.705386
3    2.014984 1.682511 0.2779203 -2.386280

Then you can calculate the p-values, if you want, using the same kind of code that we used when looking at ordinal logistic regression, except this time, we use the pnorm() function instead of the pt() function because the z-value is related to a normal distribution, not a t distribution. Remember that the normal distribution doesn’t depend on degrees of freedom.

pnorm(abs(zvalues), lower.tail=FALSE)*2

  (Intercept)       ses2       ses3      write
2  0.01448407 0.22942845 0.02373868 0.00682251
3  0.04390634 0.09246981 0.78107356 0.01701977

And look, there are your p-values. Now you can interpret the statistical significance of the coefficients.

In terms of model fit, the field is still evolving on how to do this well for multinomial logistic regression. For now, you can just run the logistic models separately and report their fit statistics.

Next steps

Next, the basics of multilevel models: http://rpubs.com/rslbliss/r_mlm_ws

Logistic Regression in R

Richard Blissett

2017-11-26