Hi everyone!
As you’ve done R work in Dr. Witesman’s class this semester, you may have noticed that she uses a slightly different flavor of R (i.e. no knitting R Markdown files or making RStudio Projects, using summary()
, etc.) that you might not feel as comfortable with. This document shows how to get the results from logistic regression and ordered logistic regression into the same format you’re familiar with from our class.
I’ll use a dataset you’re familiar with from last semester: the 2016 General Social Survey (GSS), a biennial nationally representative survey with a comprehensive set of questions about all sort of trends in American life. I’ve shrunk it down for you so it only includes a handful of variables.
# MASS needs to come before tidyverse because MASS has its own select() function
# that will overwrite the one you get from tidyverse. If you load MASS after
# tidyverse you won't be able to use select() to choose columns
library(MASS) # For polr()
library(tidyverse) # For dplyr, ggplot2, and family
library(broom) # For converting models to nice clean data frames
library(moderndive) # ModernDive stuff
# Data from the 2012 GSS
# This readRDS(gzcon(url())) incantation will read this file directly from the internet
gss <- readRDS(gzcon(url("https://statsf18.classes.andrewheiss.com/data/gss.rds")))
Use tidy()
and glance()
if you want output similar to ModernDive.
Use this when your Y / outcome / dependent variable is numeric / continuous.
library(tidyverse) # For dplyr, ggplot2, and friends
library(broom) # Convert models to nice clean data frames
model_linear <- lm(y ~ x1 + x2 + x3,
data = whatever)
tidy(model_linear, conf.int = TRUE) # This is the same as get_regression_table()
glance(model_linear) # This is the same as get_regression_summaries()
Use this when your Y / outcome / dependent variable is binary.
library(tidyverse) # For dplyr, ggplot2, and friends
library(broom) # Convert models to nice clean data frames with tidy() and glance()
library(DescTools) # For PseudoR2()
model_logit <- glm(y ~ x1 + x2 + x3, data = whatever,
family = binomial(link = "logit"))
tidy(model_logit, conf.int = TRUE, exponentiate = TRUE)
glance(model_logit)
PseudoR2(model_logit)
Use this when your Y / outcome / dependent variable is categorical and the order of the categories matters.
library(MASS) # ¡¡¡Make sure you run this before library(tidyverse)!!!
library(tidyverse) # For dplyr, ggplot2, and friends
library(broom) # Convert models to nice clean data frames with tidy() and glance()
library(lmtest) # For p-values
library(DescTools) # For PseudoR2()
model_ordered_logit <- polr(y ~ x1 + x2 + x3,
data = whatever, Hess = TRUE)
tidy(model_ordered_logit, conf.int = TRUE, exponentiate = TRUE) %>%
filter(coefficient_type == "coefficient") # Ignore cutpoints
tidy(coeftest(model_ordered_logit)) # For p-values
glance(model_ordered_logit)
PseudoR2(model_ordered_logit)
In our class, we covered linear regression. You use this when your Y variable (or outcome variable, or dependent variable) is numeric.
Here, we’ll predict the number of kids someone has (childs
) based on level of education (educ
), whether they were born in the US (born
), marital status (marital2
; this is binary married/not married), and whether they pray once a week (pray2
; this is binary less than once a week/at least once a week).
After making the model, we showed the results of the model with get_regression_table()
from the ModernDive package, which shows you the coefficient, p-value, standard errors, and everything else.
We also used get_regression_summaries()
to get the \(R^2\) for the model.
model_linear <- lm(childs ~ educ + born + marital2 + pray2,
data = gss)
get_regression_table(model_linear)
## # A tibble: 5 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 3.56 0.169 21.1 0 3.23 3.89
## 2 educ -0.122 0.01 -12.6 0 -0.141 -0.103
## 3 bornYes -0.163 0.086 -1.89 0.058 -0.332 0.006
## 4 marital2Not marri… -0.62 0.058 -10.7 0 -0.733 -0.507
## 5 pray2At least onc… 0.546 0.066 8.24 0 0.416 0.676
get_regression_summaries(model_linear)
## # A tibble: 1 x 8
## r_squared adj_r_squared mse rmse sigma statistic p_value df
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.114 0.113 2.22 1.49 1.49 90.3 0 5
Following the template from our class, we can interpret all these numbers like this:
In Dr. Witesman’s class, you’ve been using the summary()
function instead of get_regression_table()
, like so:
summary(model_linear)
##
## Call:
## lm(formula = childs ~ educ + born + marital2 + pray2, data = gss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9725 -1.1374 -0.1635 0.9378 5.5778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.562498 0.168872 21.096 < 2e-16 ***
## educ -0.121541 0.009684 -12.551 < 2e-16 ***
## bornYes -0.163304 0.086254 -1.893 0.0584 .
## marital2Not married -0.619726 0.057708 -10.739 < 2e-16 ***
## pray2At least once a week 0.545644 0.066248 8.236 2.69e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.492 on 2801 degrees of freedom
## (61 observations deleted due to missingness)
## Multiple R-squared: 0.1142, Adjusted R-squared: 0.1129
## F-statistic: 90.27 on 4 and 2801 DF, p-value: < 2.2e-16
This returns a big hairy block of text that contains the same information as get_regression_table()
(minus the confidence intervals), and you can make the same interpretations with this.
An alternative way of getting these deatils out of the models is to use a function named tidy()
from the broom package. tidy()
takes any model object and converts it to a nice data frame. Behind the scenes, get_regression_table()
is actually using tidy()
.
For example, we can convert model_linear
to a data frame like so (make sure you’ve loaded the boom package with library(broom)
):
tidy(model_linear)
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.56 0.169 21.1 8.33e-92
## 2 educ -0.122 0.00968 -12.6 3.43e-35
## 3 bornYes -0.163 0.0863 -1.89 5.84e- 2
## 4 marital2Not married -0.620 0.0577 -10.7 2.16e-26
## 5 pray2At least once a week 0.546 0.0662 8.24 2.69e-16
This looks almost identical to get_regression_table()
!. The only thing it’s missing is confidence intervals. You can turn those on inside tidy()
:
tidy(model_linear, conf.int = TRUE)
## # A tibble: 5 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.56 0.169 21.1 8.33e-92 3.23 3.89
## 2 educ -0.122 0.00968 -12.6 3.43e-35 -0.141 -0.103
## 3 bornYes -0.163 0.0863 -1.89 5.84e- 2 -0.332 0.00582
## 4 marital2Not mar… -0.620 0.0577 -10.7 2.16e-26 -0.733 -0.507
## 5 pray2At least o… 0.546 0.0662 8.24 2.69e-16 0.416 0.676
And voila: the same output you get from get_regression_table()
.
You can get the regression summary statistics using glance()
instead of get_regression_summaries()
:
glance(model_linear)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.114 0.113 1.49 90.3 2.88e-72 5 -5101. 10215.
## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
So, to translate the ModernDive commands to broom-based commands, follow this translation:
get_regression_table()
= tidy(..., conf.int = TRUE)
get_regression_summaries()
= glance(...)
It’s a good idea to switch away from get_regression_table()
and gang and use tidy()
, because the other types of models (logistic, ordered logistic, etc.) don’t actually work with get_regression_table()
. They do however work perfectly with tidy()
and glance()
.
Let’s see how to use tidy()
and glance()
with logistic regression. Here our outcome variable / dependent variable / Y is a binary variable showing who each respondent voted for in the 2012 presidential election. We’ll predict who people voted for based on the number of children they have (childs
), marital status (marital2
; this is binary married/not married), whether they pray once a week (pray2
; this is binary less than once a week/at least once a week), and whether they identify as conservative or not (conservative
; this is binary conservative/not conservative):
model_logit <- glm(pres12 ~ childs + marital2 + pray2 + conservative,
family = binomial(link = "logit"), data = gss)
In Dr. Witesman’s class, you’ve been running these commands to see the model details:
summary(model_logit)
##
## Call:
## glm(formula = pres12 ~ childs + marital2 + pray2 + conservative,
## family = binomial(link = "logit"), data = gss)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7711 -0.6415 -0.4929 0.7245 2.2490
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.11036 0.19834 5.598 0.0000000217 ***
## childs -0.04374 0.04237 -1.032 0.302
## marital2Not married -0.65748 0.12832 -5.124 0.0000002994 ***
## pray2At least once a week 0.22439 0.16121 1.392 0.164
## conservativeNot conservative -2.59250 0.12978 -19.977 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2116.6 on 1624 degrees of freedom
## Residual deviance: 1560.1 on 1620 degrees of freedom
## (1242 observations deleted due to missingness)
## AIC: 1570.1
##
## Number of Fisher Scoring iterations: 4
As you’ve learned, these values are log odds and are fairly uninterpretable on their own. To make them more interpretable, you need to unlog them by rasing \(e\) to the power of each coefficient (e.g. \(e^{-0.04374}\), \(e^{-0.65748}\), etc.). You’ve done this with this command:
exp(coef(model_logit))
## (Intercept) childs
## 3.03543824 0.95720053
## marital2Not married pray2At least once a week
## 0.51815593 1.25156097
## conservativeNot conservative
## 0.07483271
You can then interpret these coefficients like you’ve learned in your new class.
Rather than use a combination of summary()
and exp(coef())
to get these numbers, you can use tidy()
to get all the results into one table that looks identical to what you worked with last semester. You can even exponentiate the coefficients at the same time:
tidy(model_logit, conf.int = TRUE, exponentiate = TRUE)
## # A tibble: 5 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.04 0.198 5.60 2.17e- 8 2.06 4.49
## 2 childs 0.957 0.0424 -1.03 3.02e- 1 0.881 1.04
## 3 marital2Not mar… 0.518 0.128 -5.12 2.99e- 7 0.402 0.666
## 4 pray2At least o… 1.25 0.161 1.39 1.64e- 1 0.914 1.72
## 5 conservativeNot… 0.0748 0.130 -20.0 8.77e-89 0.0578 0.0962
Here you have estimates, p-values, and confidence intervals all in one nice clean table. Now you can interpret the coefficients:
You can get model details with glance()
. However, R won’t add the pseudo \(R^2\) here, since many in the R community feel like it’s not a super useful number:
glance(model_logit)
## # A tibble: 1 x 7
## null.deviance df.null logLik AIC BIC deviance df.residual
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 2117. 1624 -780. 1570. 1597. 1560. 1620
To get the pseudo \(R^2\) you’ll need to use the PseudoR2()
function like Dr. Witesman showed you:
library(DescTools)
PseudoR2(model_logit)
## McFadden
## 0.2820392
As you’ve learned in Dr. Witesman’s class, you use ordered logistic regression when your Y variable / outcome variable / dependent variable is categorical with categories where the order matters (e.g. “Disagree” > “Neutral” > “Agree”). You can’t use the glm()
function for these models—instead you use the polr()
function from the MASS library.
(Important sidenote: if you ever use MASS and tidyverse at the same time, make sure you run library(MASS)
before library(tidyverse)
. MASS provides a function named select()
that does… something… and if you load the MASS library after tidyverse, MASS’s select()
will overwrite the select()
you normally use for selecting columns like gss %>% select(marital, childs)
, and your code will break and you’ll be super sad.)
In this model, we’ll predict political views (polviews_ordered
; factor with seven levels from “Extremely conservative” to “Extremely liberal”) with whether they were born in the US (born
), their sex (sex
), the number of children they have (childs
), and whether they pray once a week (pray2
; this is binary less than once a week/at least once a week). Note the Hess = TRUE
. This is an underlying mathy thing (a Hessian matrix) that ordered logistic regression uses to run the model. If you leave this off (by default it’s Hess = TRUE
), your model will still run, but it’ll run twice behind the scenes to get the Hessian. It’s weird. Just go with it.
model_ordered <- polr(polviews_ordered ~ born + sex + childs + pray2,
data = gss, Hess = TRUE)
In Dr. Witesman’s code, you then run summary()
and exp(coef())
and PseudoR2()
to extract the relevant information from the model. You also have to run coeftest()
to calculate the p-values for the different coefficients (since the author of MASS feels like p-values are harmful and has purposely not included them).
# In Dr. Witesman's code it says to run library(AER), but loading AER actually
# loads a bunch of other packages for you. coeftest() lives inside the lmtest
# library, so you can just run library(lmtest) if you want
library(lmtest) # For getting p-values from polr objects
summary(model_ordered) # Get coefficients
## Call:
## polr(formula = polviews_ordered ~ born + sex + childs + pray2,
## data = gss, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## bornYes -0.3802 0.10726 -3.545
## sexMale -0.3652 0.07172 -5.092
## childs -0.1123 0.02250 -4.990
## pray2At least once a week -0.9614 0.08505 -11.305
##
## Intercepts:
## Value Std. Error t value
## Extrmly conservative|Conservative -4.6454 0.1673 -27.7639
## Conservative|Slghtly conservative -2.8955 0.1432 -20.2205
## Slghtly conservative|Moderate -2.1462 0.1387 -15.4712
## Moderate|Slightly liberal -0.4806 0.1325 -3.6264
## Slightly liberal|Liberal 0.1892 0.1331 1.4213
## Liberal|Extremely liberal 1.6546 0.1490 11.1077
##
## Residual Deviance: 9133.784
## AIC: 9153.784
## (158 observations deleted due to missingness)
exp(coef(model_ordered)) # Exponentiate coefficients
## bornYes sexMale
## 0.6837064 0.6940391
## childs pray2At least once a week
## 0.8937991 0.3823466
coeftest(model_ordered) # Get p-values for coefficients
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## bornYes -0.380227 0.107264 -3.5448 0.0003996 ***
## sexMale -0.365227 0.071721 -5.0923 0.0000003781 ***
## childs -0.112274 0.022498 -4.9904 0.0000006407 ***
## pray2At least once a week -0.961428 0.085047 -11.3047 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
PseudoR2(model_ordered) # Get pseudo R2
## McFadden
## 0.03831948
All those commands spit out a ton of output and it can look intimidating and hard to interpret. Fortunately, just like we did with linear regression and logistic regression, we can use tidy()
to convert all this to a nice data frame with confidence intervals and with exponentiated coefficients:
tidy(model_ordered, conf.int = TRUE, exponentiate = TRUE)
## # A tibble: 10 x 7
## term estimate std.error statistic conf.low conf.high coefficient_type
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 bornYes 0.684 0.107 -3.54 0.554 0.844 coefficient
## 2 childs 0.894 0.0225 -4.99 0.855 0.934 coefficient
## 3 Conser… 0.0553 0.143 -20.2 NA NA zeta
## 4 Extrml… 0.00961 0.167 -27.8 NA NA zeta
## 5 Libera… 5.23 0.149 11.1 NA NA zeta
## 6 Modera… 0.618 0.133 -3.63 NA NA zeta
## 7 pray2A… 0.382 0.0850 -11.3 0.324 0.452 coefficient
## 8 sexMale 0.694 0.0717 -5.09 0.603 0.799 coefficient
## 9 Slghtl… 0.117 0.139 -15.5 NA NA zeta
## 10 Slight… 1.21 0.133 1.42 NA NA zeta
The only thing this table is missing is p-values. We can convert the output of coeftest()
to a nice data frame with tidy()
as well:
tidy(coeftest(model_ordered))
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 bornYes -0.380 0.107 -3.54 4.00e- 4
## 2 sexMale -0.365 0.0717 -5.09 3.78e- 7
## 3 childs -0.112 0.0225 -4.99 6.41e- 7
## 4 pray2At least once a week -0.961 0.0850 -11.3 5.52e-29
However, for whatever reason, you can’t exponentiate the values in tidy(coeftest(...))
or calculate confidence intervals. This means you still have to run both commands, but the output is a lot cleaner and easier to navigate (especially in an R Markdown file):
tidy(model_ordered, conf.int = TRUE, exponentiate = TRUE)
## # A tibble: 10 x 7
## term estimate std.error statistic conf.low conf.high coefficient_type
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 bornYes 0.684 0.107 -3.54 0.554 0.844 coefficient
## 2 childs 0.894 0.0225 -4.99 0.855 0.934 coefficient
## 3 Conser… 0.0553 0.143 -20.2 NA NA zeta
## 4 Extrml… 0.00961 0.167 -27.8 NA NA zeta
## 5 Libera… 5.23 0.149 11.1 NA NA zeta
## 6 Modera… 0.618 0.133 -3.63 NA NA zeta
## 7 pray2A… 0.382 0.0850 -11.3 0.324 0.452 coefficient
## 8 sexMale 0.694 0.0717 -5.09 0.603 0.799 coefficient
## 9 Slghtl… 0.117 0.139 -15.5 NA NA zeta
## 10 Slight… 1.21 0.133 1.42 NA NA zeta
tidy(coeftest(model_ordered))
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 bornYes -0.380 0.107 -3.54 4.00e- 4
## 2 sexMale -0.365 0.0717 -5.09 3.78e- 7
## 3 childs -0.112 0.0225 -4.99 6.41e- 7
## 4 pray2At least once a week -0.961 0.0850 -11.3 5.52e-29
Because these are regular data frames, you can use things like filter()
on them to look at specific coefficients, or to get rid of the cutpoints (e.g. Liberal|Extremely liberal
, etc.). Notice how there’s a column named coefficient_type
. If you filter that to only look at coefficients, you’ll get rid of the cutpoints:
tidy(model_ordered, conf.int = TRUE, exponentiate = TRUE) %>%
filter(coefficient_type == "coefficient")
## # A tibble: 4 x 7
## term estimate std.error statistic conf.low conf.high coefficient_type
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 bornYes 0.684 0.107 -3.54 0.554 0.844 coefficient
## 2 childs 0.894 0.0225 -4.99 0.855 0.934 coefficient
## 3 pray2At… 0.382 0.0850 -11.3 0.324 0.452 coefficient
## 4 sexMale 0.694 0.0717 -5.09 0.603 0.799 coefficient
# This filtering is slightly different, since there's no coefficient_type
# column. Instead, we remove any rows where the | character is found.
# We can also exponentiate the estimate column by hand
tidy(coeftest(model_ordered)) %>%
filter(!str_detect(term, "\\|")) %>%
mutate(estimate = exp(estimate))
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 bornYes 0.684 0.107 -3.54 4.00e- 4
## 2 sexMale 0.694 0.0717 -5.09 3.78e- 7
## 3 childs 0.894 0.0225 -4.99 6.41e- 7
## 4 pray2At least once a week 0.382 0.0850 -11.3 5.52e-29
Phew. Here’s how to interpret these things: