Hi everyone!

As you’ve done R work in Dr. Witesman’s class this semester, you may have noticed that she uses a slightly different flavor of R (i.e. no knitting R Markdown files or making RStudio Projects, using summary(), etc.) that you might not feel as comfortable with. This document shows how to get the results from logistic regression and ordered logistic regression into the same format you’re familiar with from our class.

I’ll use a dataset you’re familiar with from last semester: the 2016 General Social Survey (GSS), a biennial nationally representative survey with a comprehensive set of questions about all sort of trends in American life. I’ve shrunk it down for you so it only includes a handful of variables.

# MASS needs to come before tidyverse because MASS has its own select() function
# that will overwrite the one you get from tidyverse. If you load MASS after
# tidyverse you won't be able to use select() to choose columns
library(MASS)        # For polr()
library(tidyverse)   # For dplyr, ggplot2, and family
library(broom)       # For converting models to nice clean data frames
library(moderndive)  # ModernDive stuff

# Data from the 2012 GSS
# This readRDS(gzcon(url())) incantation will read this file directly from the internet
gss <- readRDS(gzcon(url("https://statsf18.classes.andrewheiss.com/data/gss.rds")))

Quick reference summary

Use tidy() and glance() if you want output similar to ModernDive.

Linear regression

Use this when your Y / outcome / dependent variable is numeric / continuous.

library(tidyverse)  # For dplyr, ggplot2, and friends
library(broom)      # Convert models to nice clean data frames

model_linear <- lm(y ~ x1 + x2 + x3,
                   data = whatever)

tidy(model_linear, conf.int = TRUE)  # This is the same as get_regression_table()
glance(model_linear)  # This is the same as get_regression_summaries()

Logistic regression

Use this when your Y / outcome / dependent variable is binary.

library(tidyverse)  # For dplyr, ggplot2, and friends
library(broom)      # Convert models to nice clean data frames with tidy() and glance()
library(DescTools)  # For PseudoR2()

model_logit <- glm(y ~ x1 + x2 + x3, data = whatever,
                   family = binomial(link = "logit"))

tidy(model_logit, conf.int = TRUE, exponentiate = TRUE)
glance(model_logit)
PseudoR2(model_logit)

Ordered logistic regression

Use this when your Y / outcome / dependent variable is categorical and the order of the categories matters.

library(MASS)       # ¡¡¡Make sure you run this before library(tidyverse)!!!
library(tidyverse)  # For dplyr, ggplot2, and friends
library(broom)      # Convert models to nice clean data frames with tidy() and glance()
library(lmtest)     # For p-values
library(DescTools)  # For PseudoR2()

model_ordered_logit <- polr(y ~ x1 + x2 + x3, 
                            data = whatever, Hess = TRUE)

tidy(model_ordered_logit, conf.int = TRUE, exponentiate = TRUE) %>% 
  filter(coefficient_type == "coefficient")  # Ignore cutpoints

tidy(coeftest(model_ordered_logit))  # For p-values
glance(model_ordered_logit)
PseudoR2(model_ordered_logit)

Linear regression

In our class, we covered linear regression. You use this when your Y variable (or outcome variable, or dependent variable) is numeric.

Here, we’ll predict the number of kids someone has (childs) based on level of education (educ), whether they were born in the US (born), marital status (marital2; this is binary married/not married), and whether they pray once a week (pray2; this is binary less than once a week/at least once a week).

After making the model, we showed the results of the model with get_regression_table() from the ModernDive package, which shows you the coefficient, p-value, standard errors, and everything else.

We also used get_regression_summaries() to get the \(R^2\) for the model.

model_linear <- lm(childs ~ educ + born + marital2 + pray2, 
                   data = gss)

get_regression_table(model_linear)

## # A tibble: 5 x 7
##   term               estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                 <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept             3.56      0.169     21.1    0        3.23     3.89 
## 2 educ                 -0.122     0.01     -12.6    0       -0.141   -0.103
## 3 bornYes              -0.163     0.086     -1.89   0.058   -0.332    0.006
## 4 marital2Not marri…   -0.62      0.058    -10.7    0       -0.733   -0.507
## 5 pray2At least onc…    0.546     0.066      8.24   0        0.416    0.676

get_regression_summaries(model_linear)

## # A tibble: 1 x 8
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl>
## 1     0.114         0.113  2.22  1.49  1.49      90.3       0     5

Following the template from our class, we can interpret all these numbers like this:

Education is negatively associated with the number of kids someone has. An additional year of education is associated with 0.122 fewer kids, on average. This is statistically significant (p < 0.001).
Immigration status is associated with the number of kids someone has. Those born in the US have an average of 0.163 fewer kids than immigrants, though this difference is not statistically significant (p = 0.058).
Marital status is associated with kids too. Those who are not married have an average of 0.620 fewer kids than those who are married (p < 0.001).
Prayer is also associated with kids. Those who pray at least once a week have an average of 0.546 more kids than those who don’t (p < 0.001).
This model explains 11.3% of the variation in the number of kids people have, so it’s not that great

In Dr. Witesman’s class, you’ve been using the summary() function instead of get_regression_table(), like so:

summary(model_linear)

## 
## Call:
## lm(formula = childs ~ educ + born + marital2 + pray2, data = gss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9725 -1.1374 -0.1635  0.9378  5.5778 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.562498   0.168872  21.096  < 2e-16 ***
## educ                      -0.121541   0.009684 -12.551  < 2e-16 ***
## bornYes                   -0.163304   0.086254  -1.893   0.0584 .  
## marital2Not married       -0.619726   0.057708 -10.739  < 2e-16 ***
## pray2At least once a week  0.545644   0.066248   8.236 2.69e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.492 on 2801 degrees of freedom
##   (61 observations deleted due to missingness)
## Multiple R-squared:  0.1142, Adjusted R-squared:  0.1129 
## F-statistic: 90.27 on 4 and 2801 DF,  p-value: < 2.2e-16

This returns a big hairy block of text that contains the same information as get_regression_table() (minus the confidence intervals), and you can make the same interpretations with this.

An alternative way of getting these deatils out of the models is to use a function named tidy() from the broom package. tidy() takes any model object and converts it to a nice data frame. Behind the scenes, get_regression_table() is actually using tidy().

For example, we can convert model_linear to a data frame like so (make sure you’ve loaded the boom package with library(broom)):

tidy(model_linear)

## # A tibble: 5 x 5
##   term                      estimate std.error statistic  p.value
##   <chr>                        <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                  3.56    0.169       21.1  8.33e-92
## 2 educ                        -0.122   0.00968    -12.6  3.43e-35
## 3 bornYes                     -0.163   0.0863      -1.89 5.84e- 2
## 4 marital2Not married         -0.620   0.0577     -10.7  2.16e-26
## 5 pray2At least once a week    0.546   0.0662       8.24 2.69e-16

This looks almost identical to get_regression_table()!. The only thing it’s missing is confidence intervals. You can turn those on inside tidy():

tidy(model_linear, conf.int = TRUE)

## # A tibble: 5 x 7
##   term             estimate std.error statistic  p.value conf.low conf.high
##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)         3.56    0.169       21.1  8.33e-92    3.23    3.89   
## 2 educ               -0.122   0.00968    -12.6  3.43e-35   -0.141  -0.103  
## 3 bornYes            -0.163   0.0863      -1.89 5.84e- 2   -0.332   0.00582
## 4 marital2Not mar…   -0.620   0.0577     -10.7  2.16e-26   -0.733  -0.507  
## 5 pray2At least o…    0.546   0.0662       8.24 2.69e-16    0.416   0.676

And voila: the same output you get from get_regression_table().

You can get the regression summary statistics using glance() instead of get_regression_summaries():

glance(model_linear)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik    AIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl>  <dbl>
## 1     0.114         0.113  1.49      90.3 2.88e-72     5 -5101. 10215.
## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>

So, to translate the ModernDive commands to broom-based commands, follow this translation:

get_regression_table() = tidy(..., conf.int = TRUE)
get_regression_summaries() = glance(...)

It’s a good idea to switch away from get_regression_table() and gang and use tidy(), because the other types of models (logistic, ordered logistic, etc.) don’t actually work with get_regression_table(). They do however work perfectly with tidy() and glance().

Logistic regression

Let’s see how to use tidy() and glance() with logistic regression. Here our outcome variable / dependent variable / Y is a binary variable showing who each respondent voted for in the 2012 presidential election. We’ll predict who people voted for based on the number of children they have (childs), marital status (marital2; this is binary married/not married), whether they pray once a week (pray2; this is binary less than once a week/at least once a week), and whether they identify as conservative or not (conservative; this is binary conservative/not conservative):

model_logit <- glm(pres12 ~ childs + marital2 + pray2 + conservative, 
                   family = binomial(link = "logit"), data = gss)

In Dr. Witesman’s class, you’ve been running these commands to see the model details:

summary(model_logit)

## 
## Call:
## glm(formula = pres12 ~ childs + marital2 + pray2 + conservative, 
##     family = binomial(link = "logit"), data = gss)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7711  -0.6415  -0.4929   0.7245   2.2490  
## 
## Coefficients:
##                              Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)                   1.11036    0.19834   5.598 0.0000000217 ***
## childs                       -0.04374    0.04237  -1.032        0.302    
## marital2Not married          -0.65748    0.12832  -5.124 0.0000002994 ***
## pray2At least once a week     0.22439    0.16121   1.392        0.164    
## conservativeNot conservative -2.59250    0.12978 -19.977      < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2116.6  on 1624  degrees of freedom
## Residual deviance: 1560.1  on 1620  degrees of freedom
##   (1242 observations deleted due to missingness)
## AIC: 1570.1
## 
## Number of Fisher Scoring iterations: 4

As you’ve learned, these values are log odds and are fairly uninterpretable on their own. To make them more interpretable, you need to unlog them by rasing \(e\) to the power of each coefficient (e.g. \(e^{-0.04374}\), \(e^{-0.65748}\), etc.). You’ve done this with this command:

exp(coef(model_logit))

##                  (Intercept)                       childs 
##                   3.03543824                   0.95720053 
##          marital2Not married    pray2At least once a week 
##                   0.51815593                   1.25156097 
## conservativeNot conservative 
##                   0.07483271

You can then interpret these coefficients like you’ve learned in your new class.

Rather than use a combination of summary() and exp(coef()) to get these numbers, you can use tidy() to get all the results into one table that looks identical to what you worked with last semester. You can even exponentiate the coefficients at the same time:

tidy(model_logit, conf.int = TRUE, exponentiate = TRUE)

## # A tibble: 5 x 7
##   term             estimate std.error statistic  p.value conf.low conf.high
##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)        3.04      0.198       5.60 2.17e- 8   2.06      4.49  
## 2 childs             0.957     0.0424     -1.03 3.02e- 1   0.881     1.04  
## 3 marital2Not mar…   0.518     0.128      -5.12 2.99e- 7   0.402     0.666 
## 4 pray2At least o…   1.25      0.161       1.39 1.64e- 1   0.914     1.72  
## 5 conservativeNot…   0.0748    0.130     -20.0  8.77e-89   0.0578    0.0962

Here you have estimates, p-values, and confidence intervals all in one nice clean table. Now you can interpret the coefficients:

The likelihood of voting for Romney decreases by 4.3% (1 - 0.957) for each additional child, on average. This is not statistically significant, though (p = 0.3)
People who are not married are 48.2% (1 - 0.0.518) less likely to vote for Romney, on average. This is statistically significant (p < 0.001).
Those who pray at least once a week are 25% more likely to vote for Romney. This is not statistically significant (p = 0.164).
Those who are not conservative are 92.5% (1 - 0.0748) less likely to vote for Romney. This is statistically significant (p < 0.001).

You can get model details with glance(). However, R won’t add the pseudo \(R^2\) here, since many in the R community feel like it’s not a super useful number:

glance(model_logit)

## # A tibble: 1 x 7
##   null.deviance df.null logLik   AIC   BIC deviance df.residual
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
## 1         2117.    1624  -780. 1570. 1597.    1560.        1620

To get the pseudo \(R^2\) you’ll need to use the PseudoR2() function like Dr. Witesman showed you:

library(DescTools)
PseudoR2(model_logit)

##  McFadden 
## 0.2820392

Ordered logistic regression

As you’ve learned in Dr. Witesman’s class, you use ordered logistic regression when your Y variable / outcome variable / dependent variable is categorical with categories where the order matters (e.g. “Disagree” > “Neutral” > “Agree”). You can’t use the glm() function for these models—instead you use the polr() function from the MASS library.

(Important sidenote: if you ever use MASS and tidyverse at the same time, make sure you run library(MASS) before library(tidyverse). MASS provides a function named select() that does… something… and if you load the MASS library after tidyverse, MASS’s select() will overwrite the select() you normally use for selecting columns like gss %>% select(marital, childs), and your code will break and you’ll be super sad.)

In this model, we’ll predict political views (polviews_ordered; factor with seven levels from “Extremely conservative” to “Extremely liberal”) with whether they were born in the US (born), their sex (sex), the number of children they have (childs), and whether they pray once a week (pray2; this is binary less than once a week/at least once a week). Note the Hess = TRUE. This is an underlying mathy thing (a Hessian matrix) that ordered logistic regression uses to run the model. If you leave this off (by default it’s Hess = TRUE), your model will still run, but it’ll run twice behind the scenes to get the Hessian. It’s weird. Just go with it.

model_ordered <- polr(polviews_ordered ~ born + sex + childs + pray2, 
                      data = gss, Hess = TRUE)

In Dr. Witesman’s code, you then run summary() and exp(coef()) and PseudoR2() to extract the relevant information from the model. You also have to run coeftest() to calculate the p-values for the different coefficients (since the author of MASS feels like p-values are harmful and has purposely not included them).

# In Dr. Witesman's code it says to run library(AER), but loading AER actually
# loads a bunch of other packages for you. coeftest() lives inside the lmtest
# library, so you can just run library(lmtest) if you want
library(lmtest)  # For getting p-values from polr objects

summary(model_ordered)  # Get coefficients

## Call:
## polr(formula = polviews_ordered ~ born + sex + childs + pray2, 
##     data = gss, Hess = TRUE)
## 
## Coefficients:
##                             Value Std. Error t value
## bornYes                   -0.3802    0.10726  -3.545
## sexMale                   -0.3652    0.07172  -5.092
## childs                    -0.1123    0.02250  -4.990
## pray2At least once a week -0.9614    0.08505 -11.305
## 
## Intercepts:
##                                   Value    Std. Error t value 
## Extrmly conservative|Conservative  -4.6454   0.1673   -27.7639
## Conservative|Slghtly conservative  -2.8955   0.1432   -20.2205
## Slghtly conservative|Moderate      -2.1462   0.1387   -15.4712
## Moderate|Slightly liberal          -0.4806   0.1325    -3.6264
## Slightly liberal|Liberal            0.1892   0.1331     1.4213
## Liberal|Extremely liberal           1.6546   0.1490    11.1077
## 
## Residual Deviance: 9133.784 
## AIC: 9153.784 
## (158 observations deleted due to missingness)

exp(coef(model_ordered))  # Exponentiate coefficients

##                   bornYes                   sexMale 
##                 0.6837064                 0.6940391 
##                    childs pray2At least once a week 
##                 0.8937991                 0.3823466

coeftest(model_ordered)  # Get p-values for coefficients

## 
## t test of coefficients:
## 
##                            Estimate Std. Error  t value     Pr(>|t|)    
## bornYes                   -0.380227   0.107264  -3.5448    0.0003996 ***
## sexMale                   -0.365227   0.071721  -5.0923 0.0000003781 ***
## childs                    -0.112274   0.022498  -4.9904 0.0000006407 ***
## pray2At least once a week -0.961428   0.085047 -11.3047    < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PseudoR2(model_ordered)  # Get pseudo R2

##   McFadden 
## 0.03831948

All those commands spit out a ton of output and it can look intimidating and hard to interpret. Fortunately, just like we did with linear regression and logistic regression, we can use tidy() to convert all this to a nice data frame with confidence intervals and with exponentiated coefficients:

tidy(model_ordered, conf.int = TRUE, exponentiate = TRUE)

## # A tibble: 10 x 7
##    term    estimate std.error statistic conf.low conf.high coefficient_type
##    <chr>      <dbl>     <dbl>     <dbl>    <dbl>     <dbl> <chr>           
##  1 bornYes  0.684      0.107      -3.54    0.554     0.844 coefficient     
##  2 childs   0.894      0.0225     -4.99    0.855     0.934 coefficient     
##  3 Conser…  0.0553     0.143     -20.2    NA        NA     zeta            
##  4 Extrml…  0.00961    0.167     -27.8    NA        NA     zeta            
##  5 Libera…  5.23       0.149      11.1    NA        NA     zeta            
##  6 Modera…  0.618      0.133      -3.63   NA        NA     zeta            
##  7 pray2A…  0.382      0.0850    -11.3     0.324     0.452 coefficient     
##  8 sexMale  0.694      0.0717     -5.09    0.603     0.799 coefficient     
##  9 Slghtl…  0.117      0.139     -15.5    NA        NA     zeta            
## 10 Slight…  1.21       0.133       1.42   NA        NA     zeta

The only thing this table is missing is p-values. We can convert the output of coeftest() to a nice data frame with tidy() as well:

tidy(coeftest(model_ordered))

## # A tibble: 4 x 5
##   term                      estimate std.error statistic  p.value
##   <chr>                        <dbl>     <dbl>     <dbl>    <dbl>
## 1 bornYes                     -0.380    0.107      -3.54 4.00e- 4
## 2 sexMale                     -0.365    0.0717     -5.09 3.78e- 7
## 3 childs                      -0.112    0.0225     -4.99 6.41e- 7
## 4 pray2At least once a week   -0.961    0.0850    -11.3  5.52e-29

However, for whatever reason, you can’t exponentiate the values in tidy(coeftest(...)) or calculate confidence intervals. This means you still have to run both commands, but the output is a lot cleaner and easier to navigate (especially in an R Markdown file):

tidy(model_ordered, conf.int = TRUE, exponentiate = TRUE)

## # A tibble: 10 x 7
##    term    estimate std.error statistic conf.low conf.high coefficient_type
##    <chr>      <dbl>     <dbl>     <dbl>    <dbl>     <dbl> <chr>           
##  1 bornYes  0.684      0.107      -3.54    0.554     0.844 coefficient     
##  2 childs   0.894      0.0225     -4.99    0.855     0.934 coefficient     
##  3 Conser…  0.0553     0.143     -20.2    NA        NA     zeta            
##  4 Extrml…  0.00961    0.167     -27.8    NA        NA     zeta            
##  5 Libera…  5.23       0.149      11.1    NA        NA     zeta            
##  6 Modera…  0.618      0.133      -3.63   NA        NA     zeta            
##  7 pray2A…  0.382      0.0850    -11.3     0.324     0.452 coefficient     
##  8 sexMale  0.694      0.0717     -5.09    0.603     0.799 coefficient     
##  9 Slghtl…  0.117      0.139     -15.5    NA        NA     zeta            
## 10 Slight…  1.21       0.133       1.42   NA        NA     zeta

tidy(coeftest(model_ordered))

## # A tibble: 4 x 5
##   term                      estimate std.error statistic  p.value
##   <chr>                        <dbl>     <dbl>     <dbl>    <dbl>
## 1 bornYes                     -0.380    0.107      -3.54 4.00e- 4
## 2 sexMale                     -0.365    0.0717     -5.09 3.78e- 7
## 3 childs                      -0.112    0.0225     -4.99 6.41e- 7
## 4 pray2At least once a week   -0.961    0.0850    -11.3  5.52e-29

Because these are regular data frames, you can use things like filter() on them to look at specific coefficients, or to get rid of the cutpoints (e.g. Liberal|Extremely liberal, etc.). Notice how there’s a column named coefficient_type. If you filter that to only look at coefficients, you’ll get rid of the cutpoints:

tidy(model_ordered, conf.int = TRUE, exponentiate = TRUE) %>% 
  filter(coefficient_type == "coefficient")

## # A tibble: 4 x 7
##   term     estimate std.error statistic conf.low conf.high coefficient_type
##   <chr>       <dbl>     <dbl>     <dbl>    <dbl>     <dbl> <chr>           
## 1 bornYes     0.684    0.107      -3.54    0.554     0.844 coefficient     
## 2 childs      0.894    0.0225     -4.99    0.855     0.934 coefficient     
## 3 pray2At…    0.382    0.0850    -11.3     0.324     0.452 coefficient     
## 4 sexMale     0.694    0.0717     -5.09    0.603     0.799 coefficient

# This filtering is slightly different, since there's no coefficient_type
# column. Instead, we remove any rows where the | character is found.
# We can also exponentiate the estimate column by hand
tidy(coeftest(model_ordered)) %>% 
  filter(!str_detect(term, "\\|")) %>% 
  mutate(estimate = exp(estimate))

## # A tibble: 4 x 5
##   term                      estimate std.error statistic  p.value
##   <chr>                        <dbl>     <dbl>     <dbl>    <dbl>
## 1 bornYes                      0.684    0.107      -3.54 4.00e- 4
## 2 sexMale                      0.694    0.0717     -5.09 3.78e- 7
## 3 childs                       0.894    0.0225     -4.99 6.41e- 7
## 4 pray2At least once a week    0.382    0.0850    -11.3  5.52e-29

Phew. Here’s how to interpret these things:

People born in the US are 32% (1 - 0.684) less likely to report being extremely liberal (the final category) vs. the other six categories, and this is statistically significant (p < 0.001).
As the number of children increases by one, the odds of someone being extremely liberal vs. the other six categories decreases by 11% (1 - 0.89). This is statistically significant (p < 0.001).
Those who pray at least once a week are 62% (1 - 0.38) less likely to be extremely liberal vs. the other six categories. This is statistically significant (p < 0.001).
Males are 31% (1 - 0.69) less likely to report being extremely liberal compared to the other six categories. This is statistically significant (p < 0.001).

How to translate ModernDive regression code into regular regression code

Andrew Heiss

Quick reference summary

Linear regression

Logistic regression

Ordered logistic regression

Linear regression

Logistic regression

Ordered logistic regression