Introduction

This chapter is designed to teach you how to fit, interpret, and visualize regression models in R. We’ll cover:

  • Ordinary Least Squares (OLS): For predicting continuous outcomes like wages.
  • Logistic Regression (Logit): For predicting binary outcomes like policy support.
  • Multilevel Models: For fitting models with a multilevel or hierarchical data structure

For each model, we’ll:

  1. Fit it with the appropriate estimator (‘lm’, ‘glm’, ‘lmer’)
  2. Create an aesthetically pleasing table of regression results using modelsummary::modelsummary()
  3. Coefficient Plots: Using modelsummary::modelplot() to show the size and significance of effects.
  4. Marginal Effects Plots: Using marginaleffects::plot_predictions() to show how predictions change with key variables.

Both of these packages are written by the amazing Vincent Arel-Bundock, a political scientist at the University of Montreal.

Why This Matters

Interpreting regression models starts with traditional tables. But visualizations are often clearer and should supplement or even replace tables. Coefficient plots highlight what’s significant, while conditional/marginal effects plots show real-world impacts—like how much wages increase with age or how ideology shifts policy support.

Datasets

We’ll use: - Wage (from ISLR): Data on wages, education, job class, and age. - health22: Survey data on support for Medicaid expansion, with party affiliation and ideology. - movies from Rotten Tomatoes - imm10.dta from Kreft and de Leeuw’s Introducing Multilevel Modeling.

Let’s peek at them:

data(Wage)
glimpse(Wage)  # See Wage variables
## Rows: 3,000
## Columns: 11
## $ year       <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004,…
## $ age        <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54,…
## $ maritl     <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married,…
## $ race       <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White,…
## $ education  <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College …
## $ region     <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic,…
## $ jobclass   <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informatio…
## $ health     <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=…
## $ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Ye…
## $ logwage    <dbl> 4.318063, 4.255273, 4.875061, 5.041393, 4.318063, 4.845098,…
## $ wage       <dbl> 75.04315, 70.47602, 130.98218, 154.68529, 75.04315, 127.115…
Wage |> tabyl(education, health_ins) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits=1)
##           education 1. Yes 2. No
##        1. < HS Grad   6.0% 15.7%
##          2. HS Grad  29.4% 39.1%
##     3. Some College  22.4% 20.0%
##     4. College Grad  25.4% 17.0%
##  5. Advanced Degree  16.9%  8.2%
modelsummary::datasummary_skim(Wage)
Unique Missing Pct. Mean SD Min Median Max Histogram
year 7 0 2005.8 2.0 2003.0 2006.0 2009.0
age 61 0 42.4 11.5 18.0 42.0 80.0
logwage 508 0 4.7 0.4 3.0 4.7 5.8
wage 508 0 111.7 41.7 20.1 104.9 318.3
N %
maritl 1. Never Married 648 21.6
2. Married 2074 69.1
3. Widowed 19 0.6
4. Divorced 204 6.8
5. Separated 55 1.8
race 1. White 2480 82.7
2. Black 293 9.8
3. Asian 190 6.3
4. Other 37 1.2
education 1. < HS Grad 268 8.9
2. HS Grad 971 32.4
3. Some College 650 21.7
4. College Grad 685 22.8
5. Advanced Degree 426 14.2
region 1. New England 0 0.0
2. Middle Atlantic 3000 100.0
3. East North Central 0 0.0
4. West North Central 0 0.0
5. South Atlantic 0 0.0
6. East South Central 0 0.0
7. West South Central 0 0.0
8. Mountain 0 0.0
9. Pacific 0 0.0
jobclass 1. Industrial 1544 51.5
2. Information 1456 48.5
health 1. <=Good 858 28.6
2. >=Very Good 2142 71.4
health_ins 1. Yes 2083 69.4
2. No 917 30.6

OLS Models: Predicting Wages

Let’s predict wages using OLS with education, jobclass, and age. OLS assumes a linear relationship—e.g., more education increases wages by a fixed amount.

Fitting the OLS Model

with(Wage,cor.test(wage,age))
## 
##  Pearson's product-moment correlation
## 
## data:  wage and age
## t = 10.923, df = 2998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1609777 0.2298147
## sample estimates:
##       cor 
## 0.1956372
fit_ols <- lm(wage ~ education + jobclass + age, data = Wage)

This model estimates how each predictor affects wages, assuming other factors are held constant.

Summarizing the OLS Model

Here’s the base R summary of the model. It’s a bit cluttered and uses scientific notation, which can be hard to read.

summary(fit_ols)
## 
## Call:
## lm(formula = wage ~ education + jobclass + age, data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -106.656  -20.043   -3.488   14.908  217.504 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 59.58000    3.24874  18.339  < 2e-16 ***
## education2. HS Grad         11.20088    2.47733   4.521 6.38e-06 ***
## education3. Some College    23.33036    2.61811   8.911  < 2e-16 ***
## education4. College Grad    38.38622    2.62045  14.649  < 2e-16 ***
## education5. Advanced Degree 62.91152    2.87492  21.883  < 2e-16 ***
## jobclass2. Information       4.51067    1.38091   3.266   0.0011 ** 
## age                          0.55537    0.05725   9.701  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.89 on 2993 degrees of freedom
## Multiple R-squared:  0.2619, Adjusted R-squared:  0.2605 
## F-statistic:   177 on 6 and 2993 DF,  p-value: < 2.2e-16

Let’s try an alternative display of the model from Gelman’s arm package: display. A lot better (no scientific notation, fewer significant digits,etc)

# arm::display(fit_ols)

Stargazer is nice because it produces text and LaTeX output with a minimum of fuss, works well with multiple models, and looks pretty good. The problem is that these don’t really work well with Rmarkdown. So I use it just to get some quick output when I’m playing around with models, or when I work with LaTeX documents (which I’m transitioning away from–unless coauthors insist on it). Moreover it appears to be an abandoned project (see the strange warning on top). Still works, though.

stargazer(fit_ols, type="text",digits=1)
## 
## =======================================================
##                                 Dependent variable:    
##                             ---------------------------
##                                        wage            
## -------------------------------------------------------
## education2. HS Grad                   11.2***          
##                                        (2.5)           
##                                                        
## education3. Some College              23.3***          
##                                        (2.6)           
##                                                        
## education4. College Grad              38.4***          
##                                        (2.6)           
##                                                        
## education5. Advanced Degree           62.9***          
##                                        (2.9)           
##                                                        
## jobclass2. Information                4.5***           
##                                        (1.4)           
##                                                        
## age                                   0.6***           
##                                        (0.1)           
##                                                        
## Constant                              59.6***          
##                                        (3.2)           
##                                                        
## -------------------------------------------------------
## Observations                           3,000           
## R2                                      0.3            
## Adjusted R2                             0.3            
## Residual Std. Error              35.9 (df = 2993)      
## F Statistic                   177.0*** (df = 6; 2993)  
## =======================================================
## Note:                       *p<0.1; **p<0.05; ***p<0.01

The Workhorse

We utilize the modelsummary() function from the modelsummary package to present regression results in a concise, professional format. This function helps generate easy-to-read tables from statistical models, making interpretation straightforward and clear.

Several helpful options enhance the readability and interpretability of our results:

  • stars = TRUE: Adds significance markers (asterisks) to regression coefficients to indicate statistical significance levels.

  • coef_omit: Specifies patterns to omit certain coefficients from the displayed output. In this case, we’re omitting standard deviations (sd), intercepts (Intercept), and other less substantively interesting terms. This helps focus the table on key explanatory variables of interest.

  • gof_omit: Removes unnecessary goodness-of-fit statistics (e.g., Bayesian Information Criterion (BIC), Akaike Information Criterion (AIC), Root Mean Squared Error (RMSE), Intraclass Correlation Coefficient (ICC), Restricted Maximum Likelihood (REML), and standard deviations) from the table. These are statistics we aren’t directly discussing and hence simplify the presented output.

Additionally, factor variables—categorical predictors such as jobclass or education—are automatically managed by R. The software internally creates dummy variables for these factors, thus eliminating the need for manual dummy coding. Each factor level is compared to a baseline (reference) category, and coefficients on these dummy variables indicate how much the outcome (in this case, wages) changes relative to the baseline category.

Here’s how the code might look:

modelsummary(fit_ols,
             stars = TRUE,
             coef_rename = coef_rename,
             column_labels="Wages",
             coef_omit = "sd|SD|Intercept",
             gof_omit = "BIC|AIC|SD|ICC|RMSE|REML")
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Education2. HS Grad 11.201***
(2.477)
Education3. Some College 23.330***
(2.618)
Education4. College Grad 38.386***
(2.620)
Education5. Advanced Degree 62.912***
(2.875)
Jobclass2. Information 4.511**
(1.381)
Age 0.555***
(0.057)
Num.Obs. 3000
R2 0.262
R2 Adj. 0.260
Log.Lik. -14994.296
F 177.031

Interpreting the Regression Table

  • Coefficients: Indicate the estimated change in wages associated with a one-unit increase in a predictor variable. A positive coefficient signifies that higher values of the predictor are associated with higher wages, while a negative coefficient suggests the opposite.

  • Significance Stars: Represent the statistical significance level of each coefficient, helping quickly identify which predictors are robustly related to wages.

  • R-squared: Provides a measure of how effectively the regression model explains variation in wages. Values range from 0 to 1, with higher values indicating better explanatory power.

Coefficient Plot for OLS

Instead of a table, let’s visualize the coefficients. It will be easier to see significant predictors.

modelplot(fit_ols) +
  labs(title = "Coefficient Plot for Wage Prediction Model") +
  theme_bw()

Reading the Plot

  • Dots: The estimated effect size.
  • Lines: 95% confidence intervals. If they don’t cross zero, the effect is significant.
  • Teaching Tip: This plot quickly shows which predictors (like education levels) have the biggest impact on wages.

Conditional/Marginal Effects for OLS

Let’s see how wages change with age, split by jobclass. We do this with plot_predictions() from the marginaleffects package.

The predictions displayed by the plot_predictions() using the condition argument, can be said to be “conditional” in the sense that they are conditional on the values of the predictors in a constructed grid of “representative” values. This is oftentimes described as a “marginal effect.”

This produces a ggplot object, so we can customize it with standard ggplot2 functions like labs() and theme_bw().

plot_predictions(fit_ols, condition = c("age", "jobclass")) +
  labs(title = "Predicted Wages by Age and Job Class",
       x = "Age", y = "Predicted Wage") +
  theme_bw()

What’s Happening Here?

  • Lines: Show predicted wages as age increases, one per job class.
  • Slope: Notice both slopes are identical since both show the independent effect of age, shifted by a constant amount for job class.

Works even with transformed predictors.

data(mtcars)

mpg_model_reg <- lm (mpg ~ hp, data = mtcars)
modelsummary(mpg_model_reg)
(1)
(Intercept) 30.099
(1.634)
hp -0.068
(0.010)
Num.Obs. 32
R2 0.602
R2 Adj. 0.589
AIC 181.2
BIC 185.6
Log.Lik. -87.619
F 45.460
RMSE 3.74
mpg_model <- lm(mpg ~ log(hp), data = mtcars)
modelsummary(mpg_model)
(1)
(Intercept) 72.640
(6.004)
log(hp) -10.764
(1.224)
Num.Obs. 32
R2 0.720
R2 Adj. 0.711
AIC 170.0
BIC 174.4
Log.Lik. -81.987
F 77.301
RMSE 3.14
plot_predictions(mpg_model_reg, condition="hp") +
  labs(title = "Predicted MPG by Horsepower",
       x = "Horsepower", y = "Predicted MPG") +
  theme_bw()

# x-values and predictions based on the log(hp)-values
plot_predictions(mpg_model, condition="hp") +
  labs(title = "Predicted MPG by Horsepower",
       x = "Horsepower", y = "Predicted MPG") +
  theme_bw()

Logit Models: Predicting Medicaid Expansion Support

Let’s switch to a binary outcome: support for Medicaid expansion. We’ll use the health22 dataset from my own research.

load("Data/survey/health22.Rdata")
glimpse(health22)  # See health22 variables
## Rows: 1,000
## Columns: 5
## $ party            <chr> "X", "D", "R", "R", "D", "X", "R", "D", "X", "X", "D"…
## $ st               <chr> "WI", "IL", "MI", "OR", "TX", "TN", "WI", "IL", "NE",…
## $ ideo7            <dbl> 4, 5, 6, 5, 2, 4, 4, 4, 3, 4, 6, 5, 7, 6, 4, 3, 6, 1,…
## $ medicaid_2       <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,…
## $ prolifeprochoice <dbl> NA, 1, 0, 0, 0, NA, 0, 1, 1, 1, 0, 0, 1, 0, NA, 0, 0,…

Now, we’ll predict a yes/no outcome: support for Medicaid expansion (medicaid_2) using party and ideo7 (ideology, 1=liberal, 7=conservative).

Fitting Multiple Logit Models

We often want to fit multiple models, to show how each predictor works alone and together.

We’ll try three versions: 1. Just party. 2. Just ideo7. 3. Both together.

z <- list(a=1:3, b=c("x","y"))
z
## $a
## [1] 1 2 3
## 
## $b
## [1] "x" "y"
med.glm <- list()
med.glm[["Party"]] <- glm(data = health22, medicaid_2 ~ party, family = binomial)
med.glm[["Ideology"]] <- glm(data = health22, medicaid_2 ~ ideo7, family = binomial)
med.glm[["Both"]] <- glm(data = health22, medicaid_2 ~ party + ideo7, family = binomial)
med.glm[["Interaction"]] <- glm(data = health22, medicaid_2 ~ party * ideo7, family = binomial)

Notice that I created a list of models. This is a great way to keep track of multiple models, especially when you want to summarize them together. I could have put each fit into a separate object, but this is cleaner.

Summarizing Logit Models

We rename coefficients for clarity (e.g., partyR becomes “Republican”) and apply consistent options:

modelsummary(med.glm, 
             stars = TRUE,
             coef_rename = c("partyR" = "Republican", "partyX" = "Independent", "ideo7" = "Self Reported Ideology"),
             coef_omit = "sd|SD|Intercept",
             gof_omit = "BIC|AIC|SD|ICC|RMSE|REML")
Party Ideology Both Interaction
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Republican -1.415*** -0.879*** -0.123
(0.172) (0.208) (0.588)
Independent -0.748*** -0.531* 0.896
(0.207) (0.213) (0.764)
Self Reported Ideology -0.341*** -0.223*** -0.125+
(0.042) (0.051) (0.070)
Republican:Self Reported Ideology -0.176
(0.116)
Independent:Self Reported Ideology -0.361*
(0.179)
Num.Obs. 966 966 966 966
Log.Lik. -526.629 -526.448 -517.141 -514.426
F 33.830 65.343 26.983 17.788

You can save tables in PNG or Word format, too!

modelsummary(med.glm,output="Tables/medicaid_model.png",stars=T,
             coef_rename = c("partyR"="Republican","partyX"="Independent","ideo7"="Self Reported Ideology"),
              gof_omit = "BIC|AIC|SD|ICC|RMSE|REML",
              coef_omit = "sd|SD|Intercept")

modelsummary(med.glm,output="Tables/medicaid_model.docx",stars=T,
             coef_rename = c("partyR"="Republican","partyX"="Independent","ideo7"="Self Reported Ideology"),
              gof_omit = "BIC|AIC|SD|ICC|RMSE|REML",
              coef_omit = "sd|SD|Intercept")

Decoding the Output

  • Coefficients: These are log-odds ratios. Negative means less likely to support Medicaid expansion.

Coefficient Plot for Logit Models

We can visualize the coefficients for all three models in one plot. This is a great way to see how predictors interact.

modelplot(med.glm) +
  labs(title = "Coefficient Plot for Medicaid Expansion Support Models") +
  geom_vline(xintercept=0) +
  theme_bw()

What to Notice

  • Across Models: See how coefficients shift when we add predictors.

Marginal Effects for Logit

The biggest issue with the standard regression output is that it’s in log-odds, which is completely unintuitive. We want to know how likely someone is to support Medicaid expansion based on their party and ideology. Marginal effects plots for logit models show this.

For the “Both” model, let’s plot probabilities by ideology and party:

plot_predictions(med.glm[["Both"]], condition = c("ideo7", "party")) +
  labs(title = "Predicted Probability of Supporting Medicaid Expansion",
       x = "Ideology (1=Liberal, 7=Conservative)",
       y = "Probability of Supporting Medicaid Expansion") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_bw()

We clearly see that there is an independent effect of party and ideology on support for Medicaid expansion.

Now let’s try the interaction model.

plot_predictions(med.glm[["Interaction"]], condition = c("ideo7", "party")) +
  labs(title = "Predicted Probability of Supporting Medicaid Expansion",
       x = "Ideology (1=Liberal, 7=Conservative)",
       y = "Probability of Supporting Medicaid Expansion") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_bw()

Here, partisanship and ideology interact. The effect of ideology on support for Medicaid expansion is much stronger for Independents and Republicans than Democrats.

Movie Example: Predicting Certified Fresh Movies

Let’s model whether movies are “certified fresh” (rating >= 8) with a fun dataset.

# Show a table of the data
#load("Data/bundock/movies.Rdata")
dat <- read_csv(file="Data/bundock/movies.csv")
## Rows: 58788 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): title, mpaa
## dbl (23): rownames, year, length, budget, rating, votes, r1, r2, r3, r4, r5,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(dat)
## Rows: 58,788
## Columns: 25
## $ rownames    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ title       <chr> "$", "$1000 a Touchdown", "$21 a Day Once a Month", "$40,0…
## $ year        <dbl> 1971, 1939, 1941, 1996, 1975, 2000, 2002, 2002, 1987, 1917…
## $ length      <dbl> 121, 71, 7, 70, 71, 91, 93, 25, 97, 61, 99, 96, 10, 10, 10…
## $ budget      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ rating      <dbl> 6.4, 6.0, 8.2, 8.2, 3.4, 4.3, 5.3, 6.7, 6.6, 6.0, 5.4, 5.9…
## $ votes       <dbl> 348, 20, 5, 6, 17, 45, 200, 24, 18, 51, 23, 53, 44, 11, 12…
## $ r1          <dbl> 4.5, 0.0, 0.0, 14.5, 24.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4…
## $ r2          <dbl> 4.5, 14.5, 0.0, 0.0, 4.5, 4.5, 0.0, 4.5, 4.5, 0.0, 0.0, 0.…
## $ r3          <dbl> 4.5, 4.5, 0.0, 0.0, 0.0, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5…
## $ r4          <dbl> 4.5, 24.5, 0.0, 0.0, 14.5, 14.5, 4.5, 4.5, 0.0, 4.5, 14.5,…
## $ r5          <dbl> 14.5, 14.5, 0.0, 0.0, 14.5, 14.5, 24.5, 4.5, 0.0, 4.5, 24.…
## $ r6          <dbl> 24.5, 14.5, 24.5, 0.0, 4.5, 14.5, 24.5, 14.5, 0.0, 44.5, 4…
## $ r7          <dbl> 24.5, 14.5, 0.0, 0.0, 0.0, 4.5, 14.5, 14.5, 34.5, 14.5, 24…
## $ r8          <dbl> 14.5, 4.5, 44.5, 0.0, 0.0, 4.5, 4.5, 14.5, 14.5, 4.5, 4.5,…
## $ r9          <dbl> 4.5, 4.5, 24.5, 34.5, 0.0, 14.5, 4.5, 4.5, 4.5, 4.5, 14.5,…
## $ r10         <dbl> 4.5, 14.5, 24.5, 45.5, 24.5, 14.5, 14.5, 14.5, 24.5, 4.5, …
## $ mpaa        <chr> NA, NA, NA, NA, NA, NA, "R", NA, NA, NA, NA, NA, NA, NA, "…
## $ Action      <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0…
## $ Animation   <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ Comedy      <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1…
## $ Drama       <dbl> 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Documentary <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ Romance     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Short       <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1…
# show a table of certified fresh
dat |>
    group_by(rating) |>
    summarize(n = n()) |>
    arrange(desc(n))
## # A tibble: 91 × 2
##    rating     n
##     <dbl> <int>
##  1    6    1733
##  2    6.3  1729
##  3    6.2  1702
##  4    6.4  1632
##  5    6.1  1619
##  6    6.5  1595
##  7    5.7  1545
##  8    6.6  1538
##  9    6.8  1528
## 10    6.7  1489
## # ℹ 81 more rows

Preparing the Movies Dataset

We load and prep it like this:

# load("Data/bundock/movies.Rdata")
# 
# glimpse(dat)

movies <- dat |>
    mutate(style = case_when(Action == 1 ~ "Action",
                             Comedy == 1 ~ "Comedy",
                             Drama == 1 ~ "Drama",
                             TRUE ~ "Other"),
           style = factor(style),
           certified_fresh = rating >= 8) |>
    dplyr::filter(length < 240)

This creates a binary outcome (certified_fresh) and a genre variable (style).

Fitting the Logit Model with Interaction

We include length, style, and their interaction:

mod <- glm(certified_fresh ~ length * style, data = movies, family = binomial)

Interaction 101

The * means length’s effect on certification varies by genre. Maybe long dramas do better than long comedies.

Summarizing the Movie Model

modelsummary(mod, 
             stars = TRUE, 
             coef_omit = "sd|SD|Intercept", 
             gof_omit = "BIC|AIC|SD|ICC|RMSE|REML")
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
length 0.009***
(0.003)
styleComedy 2.284***
(0.302)
styleDrama 2.002***
(0.312)
styleOther 2.663***
(0.300)
length × styleComedy -0.019***
(0.003)
length × styleDrama -0.011***
(0.003)
length × styleOther -0.017***
(0.003)
Num.Obs. 58662
Log.Lik. -16065.412
F 89.724

Coefficient Plot for Movies

 # modelplot(mod) +
 #  labs(title = "Coefficient Plot for Movie Certification Model") +
 #  theme_bw()

Marginal Effects for Movies

plot_predictions(mod, condition = c("length", "style")) +
  labs(title = "Certified Fresh Movies by Length and Genre",
       x = "Length (minutes)",
       y = "Probability of Certified Fresh") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_bw()

The marginal effects plot shows how the probability of being certified fresh changes with length, for the different categories of movies. Notice the slopes can be different with this interaction effect.

Multilevel Models

Multilevel (hierarchical) linear models extend simple regression to data with nested structures. In multilevel data, observations within the same cluster—such as students within a school—are correlated. Ordinary least squares (OLS) assumes independent errors; violating this assumption by ignoring clustering leads to underestimated standard errors.

Multilevel models address this by including random effects to model cluster-level variation. A random intercept model allows each cluster its own baseline outcome level, capturing between-cluster heterogeneity. A random slopes model additionally permits the effect of a predictor to vary across clusters, reflecting differential relationships between the predictor and outcome in each cluster. Fixed effects estimate the average relationships across the population, while random effects capture deviations at the cluster level.