This chapter is designed to teach you how to fit, interpret, and visualize regression models in R. We’ll cover:
For each model, we’ll:
modelsummary::modelsummary()
modelsummary::modelplot()
to show the size and significance
of effects.marginaleffects::plot_predictions()
to show how predictions
change with key variables.Both of these packages are written by the amazing Vincent Arel-Bundock, a political scientist at the University of Montreal.
Interpreting regression models starts with traditional tables. But visualizations are often clearer and should supplement or even replace tables. Coefficient plots highlight what’s significant, while conditional/marginal effects plots show real-world impacts—like how much wages increase with age or how ideology shifts policy support.
We’ll use: - Wage
(from ISLR
): Data on
wages, education, job class, and age. - health22
: Survey
data on support for Medicaid expansion, with party affiliation and
ideology. - movies
from Rotten Tomatoes -
imm10.dta
from Kreft and de Leeuw’s Introducing
Multilevel Modeling.
Let’s peek at them:
data(Wage)
glimpse(Wage) # See Wage variables
## Rows: 3,000
## Columns: 11
## $ year <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004,…
## $ age <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54,…
## $ maritl <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married,…
## $ race <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White,…
## $ education <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College …
## $ region <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic,…
## $ jobclass <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informatio…
## $ health <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=…
## $ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Ye…
## $ logwage <dbl> 4.318063, 4.255273, 4.875061, 5.041393, 4.318063, 4.845098,…
## $ wage <dbl> 75.04315, 70.47602, 130.98218, 154.68529, 75.04315, 127.115…
Wage |> tabyl(education, health_ins) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits=1)
## education 1. Yes 2. No
## 1. < HS Grad 6.0% 15.7%
## 2. HS Grad 29.4% 39.1%
## 3. Some College 22.4% 20.0%
## 4. College Grad 25.4% 17.0%
## 5. Advanced Degree 16.9% 8.2%
modelsummary::datasummary_skim(Wage)
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
year | 7 | 0 | 2005.8 | 2.0 | 2003.0 | 2006.0 | 2009.0 | |
age | 61 | 0 | 42.4 | 11.5 | 18.0 | 42.0 | 80.0 | |
logwage | 508 | 0 | 4.7 | 0.4 | 3.0 | 4.7 | 5.8 | |
wage | 508 | 0 | 111.7 | 41.7 | 20.1 | 104.9 | 318.3 | |
N | % | |||||||
maritl | 1. Never Married | 648 | 21.6 | |||||
2. Married | 2074 | 69.1 | ||||||
3. Widowed | 19 | 0.6 | ||||||
4. Divorced | 204 | 6.8 | ||||||
5. Separated | 55 | 1.8 | ||||||
race | 1. White | 2480 | 82.7 | |||||
2. Black | 293 | 9.8 | ||||||
3. Asian | 190 | 6.3 | ||||||
4. Other | 37 | 1.2 | ||||||
education | 1. < HS Grad | 268 | 8.9 | |||||
2. HS Grad | 971 | 32.4 | ||||||
3. Some College | 650 | 21.7 | ||||||
4. College Grad | 685 | 22.8 | ||||||
5. Advanced Degree | 426 | 14.2 | ||||||
region | 1. New England | 0 | 0.0 | |||||
2. Middle Atlantic | 3000 | 100.0 | ||||||
3. East North Central | 0 | 0.0 | ||||||
4. West North Central | 0 | 0.0 | ||||||
5. South Atlantic | 0 | 0.0 | ||||||
6. East South Central | 0 | 0.0 | ||||||
7. West South Central | 0 | 0.0 | ||||||
8. Mountain | 0 | 0.0 | ||||||
9. Pacific | 0 | 0.0 | ||||||
jobclass | 1. Industrial | 1544 | 51.5 | |||||
2. Information | 1456 | 48.5 | ||||||
health | 1. <=Good | 858 | 28.6 | |||||
2. >=Very Good | 2142 | 71.4 | ||||||
health_ins | 1. Yes | 2083 | 69.4 | |||||
2. No | 917 | 30.6 |
Let’s predict wages using OLS with education
,
jobclass
, and age
. OLS assumes a linear
relationship—e.g., more education increases wages by a fixed amount.
with(Wage,cor.test(wage,age))
##
## Pearson's product-moment correlation
##
## data: wage and age
## t = 10.923, df = 2998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1609777 0.2298147
## sample estimates:
## cor
## 0.1956372
fit_ols <- lm(wage ~ education + jobclass + age, data = Wage)
This model estimates how each predictor affects wages, assuming other factors are held constant.
Here’s the base R summary of the model. It’s a bit cluttered and uses scientific notation, which can be hard to read.
summary(fit_ols)
##
## Call:
## lm(formula = wage ~ education + jobclass + age, data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -106.656 -20.043 -3.488 14.908 217.504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.58000 3.24874 18.339 < 2e-16 ***
## education2. HS Grad 11.20088 2.47733 4.521 6.38e-06 ***
## education3. Some College 23.33036 2.61811 8.911 < 2e-16 ***
## education4. College Grad 38.38622 2.62045 14.649 < 2e-16 ***
## education5. Advanced Degree 62.91152 2.87492 21.883 < 2e-16 ***
## jobclass2. Information 4.51067 1.38091 3.266 0.0011 **
## age 0.55537 0.05725 9.701 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.89 on 2993 degrees of freedom
## Multiple R-squared: 0.2619, Adjusted R-squared: 0.2605
## F-statistic: 177 on 6 and 2993 DF, p-value: < 2.2e-16
Let’s try an alternative display of the model from Gelman’s
arm
package: display
. A lot better (no
scientific notation, fewer significant digits,etc)
# arm::display(fit_ols)
Stargazer is nice because it produces text and LaTeX output with a minimum of fuss, works well with multiple models, and looks pretty good. The problem is that these don’t really work well with Rmarkdown. So I use it just to get some quick output when I’m playing around with models, or when I work with LaTeX documents (which I’m transitioning away from–unless coauthors insist on it). Moreover it appears to be an abandoned project (see the strange warning on top). Still works, though.
stargazer(fit_ols, type="text",digits=1)
##
## =======================================================
## Dependent variable:
## ---------------------------
## wage
## -------------------------------------------------------
## education2. HS Grad 11.2***
## (2.5)
##
## education3. Some College 23.3***
## (2.6)
##
## education4. College Grad 38.4***
## (2.6)
##
## education5. Advanced Degree 62.9***
## (2.9)
##
## jobclass2. Information 4.5***
## (1.4)
##
## age 0.6***
## (0.1)
##
## Constant 59.6***
## (3.2)
##
## -------------------------------------------------------
## Observations 3,000
## R2 0.3
## Adjusted R2 0.3
## Residual Std. Error 35.9 (df = 2993)
## F Statistic 177.0*** (df = 6; 2993)
## =======================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We utilize the modelsummary()
function from the
modelsummary
package to present regression results in a
concise, professional format. This function helps generate easy-to-read
tables from statistical models, making interpretation straightforward
and clear.
Several helpful options enhance the readability and interpretability of our results:
stars = TRUE
: Adds significance markers (asterisks)
to regression coefficients to indicate statistical significance
levels.
coef_omit
: Specifies patterns to omit certain
coefficients from the displayed output. In this case, we’re omitting
standard deviations (sd
), intercepts
(Intercept
), and other less substantively interesting
terms. This helps focus the table on key explanatory variables of
interest.
gof_omit
: Removes unnecessary goodness-of-fit
statistics (e.g., Bayesian Information Criterion (BIC), Akaike
Information Criterion (AIC), Root Mean Squared Error (RMSE), Intraclass
Correlation Coefficient (ICC), Restricted Maximum Likelihood (REML), and
standard deviations) from the table. These are statistics we aren’t
directly discussing and hence simplify the presented output.
Additionally, factor variables—categorical predictors such as
jobclass
or education
—are automatically
managed by R. The software internally creates dummy variables for these
factors, thus eliminating the need for manual dummy coding. Each factor
level is compared to a baseline (reference) category, and coefficients
on these dummy variables indicate how much the outcome (in this case,
wages) changes relative to the baseline category.
Here’s how the code might look:
modelsummary(fit_ols,
stars = TRUE,
coef_rename = coef_rename,
column_labels="Wages",
coef_omit = "sd|SD|Intercept",
gof_omit = "BIC|AIC|SD|ICC|RMSE|REML")
(1) | |
---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
Education2. HS Grad | 11.201*** |
(2.477) | |
Education3. Some College | 23.330*** |
(2.618) | |
Education4. College Grad | 38.386*** |
(2.620) | |
Education5. Advanced Degree | 62.912*** |
(2.875) | |
Jobclass2. Information | 4.511** |
(1.381) | |
Age | 0.555*** |
(0.057) | |
Num.Obs. | 3000 |
R2 | 0.262 |
R2 Adj. | 0.260 |
Log.Lik. | -14994.296 |
F | 177.031 |
Coefficients: Indicate the estimated change in wages associated with a one-unit increase in a predictor variable. A positive coefficient signifies that higher values of the predictor are associated with higher wages, while a negative coefficient suggests the opposite.
Significance Stars: Represent the statistical significance level of each coefficient, helping quickly identify which predictors are robustly related to wages.
R-squared: Provides a measure of how effectively the regression model explains variation in wages. Values range from 0 to 1, with higher values indicating better explanatory power.
Instead of a table, let’s visualize the coefficients. It will be easier to see significant predictors.
modelplot(fit_ols) +
labs(title = "Coefficient Plot for Wage Prediction Model") +
theme_bw()
Let’s see how wages change with age
, split by
jobclass
. We do this with plot_predictions()
from the marginaleffects
package.
The predictions displayed by the plot_predictions()
using the condition
argument, can be said to be
“conditional” in the sense that they are conditional on the values of
the predictors in a constructed grid of “representative” values. This is
oftentimes described as a “marginal effect.”
This produces a ggplot object, so we can customize it with standard
ggplot2 functions like labs()
and
theme_bw()
.
plot_predictions(fit_ols, condition = c("age", "jobclass")) +
labs(title = "Predicted Wages by Age and Job Class",
x = "Age", y = "Predicted Wage") +
theme_bw()
Works even with transformed predictors.
data(mtcars)
mpg_model_reg <- lm (mpg ~ hp, data = mtcars)
modelsummary(mpg_model_reg)
(1) | |
---|---|
(Intercept) | 30.099 |
(1.634) | |
hp | -0.068 |
(0.010) | |
Num.Obs. | 32 |
R2 | 0.602 |
R2 Adj. | 0.589 |
AIC | 181.2 |
BIC | 185.6 |
Log.Lik. | -87.619 |
F | 45.460 |
RMSE | 3.74 |
mpg_model <- lm(mpg ~ log(hp), data = mtcars)
modelsummary(mpg_model)
(1) | |
---|---|
(Intercept) | 72.640 |
(6.004) | |
log(hp) | -10.764 |
(1.224) | |
Num.Obs. | 32 |
R2 | 0.720 |
R2 Adj. | 0.711 |
AIC | 170.0 |
BIC | 174.4 |
Log.Lik. | -81.987 |
F | 77.301 |
RMSE | 3.14 |
plot_predictions(mpg_model_reg, condition="hp") +
labs(title = "Predicted MPG by Horsepower",
x = "Horsepower", y = "Predicted MPG") +
theme_bw()
# x-values and predictions based on the log(hp)-values
plot_predictions(mpg_model, condition="hp") +
labs(title = "Predicted MPG by Horsepower",
x = "Horsepower", y = "Predicted MPG") +
theme_bw()
Let’s switch to a binary outcome: support for Medicaid expansion.
We’ll use the health22
dataset from my own research.
load("Data/survey/health22.Rdata")
glimpse(health22) # See health22 variables
## Rows: 1,000
## Columns: 5
## $ party <chr> "X", "D", "R", "R", "D", "X", "R", "D", "X", "X", "D"…
## $ st <chr> "WI", "IL", "MI", "OR", "TX", "TN", "WI", "IL", "NE",…
## $ ideo7 <dbl> 4, 5, 6, 5, 2, 4, 4, 4, 3, 4, 6, 5, 7, 6, 4, 3, 6, 1,…
## $ medicaid_2 <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,…
## $ prolifeprochoice <dbl> NA, 1, 0, 0, 0, NA, 0, 1, 1, 1, 0, 0, 1, 0, NA, 0, 0,…
Now, we’ll predict a yes/no outcome: support for Medicaid expansion
(medicaid_2
) using party
and
ideo7
(ideology, 1=liberal, 7=conservative).
We often want to fit multiple models, to show how each predictor works alone and together.
We’ll try three versions: 1. Just party
. 2. Just
ideo7
. 3. Both together.
z <- list(a=1:3, b=c("x","y"))
z
## $a
## [1] 1 2 3
##
## $b
## [1] "x" "y"
med.glm <- list()
med.glm[["Party"]] <- glm(data = health22, medicaid_2 ~ party, family = binomial)
med.glm[["Ideology"]] <- glm(data = health22, medicaid_2 ~ ideo7, family = binomial)
med.glm[["Both"]] <- glm(data = health22, medicaid_2 ~ party + ideo7, family = binomial)
med.glm[["Interaction"]] <- glm(data = health22, medicaid_2 ~ party * ideo7, family = binomial)
Notice that I created a list of models. This is a great way to keep track of multiple models, especially when you want to summarize them together. I could have put each fit into a separate object, but this is cleaner.
We rename coefficients for clarity (e.g., partyR
becomes
“Republican”) and apply consistent options:
modelsummary(med.glm,
stars = TRUE,
coef_rename = c("partyR" = "Republican", "partyX" = "Independent", "ideo7" = "Self Reported Ideology"),
coef_omit = "sd|SD|Intercept",
gof_omit = "BIC|AIC|SD|ICC|RMSE|REML")
Party | Ideology | Both | Interaction | |
---|---|---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
Republican | -1.415*** | -0.879*** | -0.123 | |
(0.172) | (0.208) | (0.588) | ||
Independent | -0.748*** | -0.531* | 0.896 | |
(0.207) | (0.213) | (0.764) | ||
Self Reported Ideology | -0.341*** | -0.223*** | -0.125+ | |
(0.042) | (0.051) | (0.070) | ||
Republican:Self Reported Ideology | -0.176 | |||
(0.116) | ||||
Independent:Self Reported Ideology | -0.361* | |||
(0.179) | ||||
Num.Obs. | 966 | 966 | 966 | 966 |
Log.Lik. | -526.629 | -526.448 | -517.141 | -514.426 |
F | 33.830 | 65.343 | 26.983 | 17.788 |
You can save tables in PNG or Word format, too!
modelsummary(med.glm,output="Tables/medicaid_model.png",stars=T,
coef_rename = c("partyR"="Republican","partyX"="Independent","ideo7"="Self Reported Ideology"),
gof_omit = "BIC|AIC|SD|ICC|RMSE|REML",
coef_omit = "sd|SD|Intercept")
modelsummary(med.glm,output="Tables/medicaid_model.docx",stars=T,
coef_rename = c("partyR"="Republican","partyX"="Independent","ideo7"="Self Reported Ideology"),
gof_omit = "BIC|AIC|SD|ICC|RMSE|REML",
coef_omit = "sd|SD|Intercept")
We can visualize the coefficients for all three models in one plot. This is a great way to see how predictors interact.
modelplot(med.glm) +
labs(title = "Coefficient Plot for Medicaid Expansion Support Models") +
geom_vline(xintercept=0) +
theme_bw()
The biggest issue with the standard regression output is that it’s in log-odds, which is completely unintuitive. We want to know how likely someone is to support Medicaid expansion based on their party and ideology. Marginal effects plots for logit models show this.
For the “Both” model, let’s plot probabilities by ideology and party:
plot_predictions(med.glm[["Both"]], condition = c("ideo7", "party")) +
labs(title = "Predicted Probability of Supporting Medicaid Expansion",
x = "Ideology (1=Liberal, 7=Conservative)",
y = "Probability of Supporting Medicaid Expansion") +
scale_y_continuous(labels = scales::percent_format()) +
theme_bw()
We clearly see that there is an independent effect of party and ideology on support for Medicaid expansion.
Now let’s try the interaction model.
plot_predictions(med.glm[["Interaction"]], condition = c("ideo7", "party")) +
labs(title = "Predicted Probability of Supporting Medicaid Expansion",
x = "Ideology (1=Liberal, 7=Conservative)",
y = "Probability of Supporting Medicaid Expansion") +
scale_y_continuous(labels = scales::percent_format()) +
theme_bw()
Here, partisanship and ideology interact. The effect of ideology on support for Medicaid expansion is much stronger for Independents and Republicans than Democrats.
Let’s model whether movies are “certified fresh” (rating >= 8) with a fun dataset.
# Show a table of the data
#load("Data/bundock/movies.Rdata")
dat <- read_csv(file="Data/bundock/movies.csv")
## Rows: 58788 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): title, mpaa
## dbl (23): rownames, year, length, budget, rating, votes, r1, r2, r3, r4, r5,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(dat)
## Rows: 58,788
## Columns: 25
## $ rownames <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ title <chr> "$", "$1000 a Touchdown", "$21 a Day Once a Month", "$40,0…
## $ year <dbl> 1971, 1939, 1941, 1996, 1975, 2000, 2002, 2002, 1987, 1917…
## $ length <dbl> 121, 71, 7, 70, 71, 91, 93, 25, 97, 61, 99, 96, 10, 10, 10…
## $ budget <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ rating <dbl> 6.4, 6.0, 8.2, 8.2, 3.4, 4.3, 5.3, 6.7, 6.6, 6.0, 5.4, 5.9…
## $ votes <dbl> 348, 20, 5, 6, 17, 45, 200, 24, 18, 51, 23, 53, 44, 11, 12…
## $ r1 <dbl> 4.5, 0.0, 0.0, 14.5, 24.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4…
## $ r2 <dbl> 4.5, 14.5, 0.0, 0.0, 4.5, 4.5, 0.0, 4.5, 4.5, 0.0, 0.0, 0.…
## $ r3 <dbl> 4.5, 4.5, 0.0, 0.0, 0.0, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5…
## $ r4 <dbl> 4.5, 24.5, 0.0, 0.0, 14.5, 14.5, 4.5, 4.5, 0.0, 4.5, 14.5,…
## $ r5 <dbl> 14.5, 14.5, 0.0, 0.0, 14.5, 14.5, 24.5, 4.5, 0.0, 4.5, 24.…
## $ r6 <dbl> 24.5, 14.5, 24.5, 0.0, 4.5, 14.5, 24.5, 14.5, 0.0, 44.5, 4…
## $ r7 <dbl> 24.5, 14.5, 0.0, 0.0, 0.0, 4.5, 14.5, 14.5, 34.5, 14.5, 24…
## $ r8 <dbl> 14.5, 4.5, 44.5, 0.0, 0.0, 4.5, 4.5, 14.5, 14.5, 4.5, 4.5,…
## $ r9 <dbl> 4.5, 4.5, 24.5, 34.5, 0.0, 14.5, 4.5, 4.5, 4.5, 4.5, 14.5,…
## $ r10 <dbl> 4.5, 14.5, 24.5, 45.5, 24.5, 14.5, 14.5, 14.5, 24.5, 4.5, …
## $ mpaa <chr> NA, NA, NA, NA, NA, NA, "R", NA, NA, NA, NA, NA, NA, NA, "…
## $ Action <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0…
## $ Animation <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ Comedy <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1…
## $ Drama <dbl> 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Documentary <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ Romance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Short <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1…
# show a table of certified fresh
dat |>
group_by(rating) |>
summarize(n = n()) |>
arrange(desc(n))
## # A tibble: 91 × 2
## rating n
## <dbl> <int>
## 1 6 1733
## 2 6.3 1729
## 3 6.2 1702
## 4 6.4 1632
## 5 6.1 1619
## 6 6.5 1595
## 7 5.7 1545
## 8 6.6 1538
## 9 6.8 1528
## 10 6.7 1489
## # ℹ 81 more rows
We load and prep it like this:
# load("Data/bundock/movies.Rdata")
#
# glimpse(dat)
movies <- dat |>
mutate(style = case_when(Action == 1 ~ "Action",
Comedy == 1 ~ "Comedy",
Drama == 1 ~ "Drama",
TRUE ~ "Other"),
style = factor(style),
certified_fresh = rating >= 8) |>
dplyr::filter(length < 240)
This creates a binary outcome (certified_fresh
) and a
genre variable (style
).
We include length
, style
, and their
interaction:
mod <- glm(certified_fresh ~ length * style, data = movies, family = binomial)
The *
means length’s effect on certification varies by
genre. Maybe long dramas do better than long comedies.
modelsummary(mod,
stars = TRUE,
coef_omit = "sd|SD|Intercept",
gof_omit = "BIC|AIC|SD|ICC|RMSE|REML")
(1) | |
---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
length | 0.009*** |
(0.003) | |
styleComedy | 2.284*** |
(0.302) | |
styleDrama | 2.002*** |
(0.312) | |
styleOther | 2.663*** |
(0.300) | |
length × styleComedy | -0.019*** |
(0.003) | |
length × styleDrama | -0.011*** |
(0.003) | |
length × styleOther | -0.017*** |
(0.003) | |
Num.Obs. | 58662 |
Log.Lik. | -16065.412 |
F | 89.724 |
# modelplot(mod) +
# labs(title = "Coefficient Plot for Movie Certification Model") +
# theme_bw()
plot_predictions(mod, condition = c("length", "style")) +
labs(title = "Certified Fresh Movies by Length and Genre",
x = "Length (minutes)",
y = "Probability of Certified Fresh") +
scale_y_continuous(labels = scales::percent_format()) +
theme_bw()
The marginal effects plot shows how the probability of being certified fresh changes with length, for the different categories of movies. Notice the slopes can be different with this interaction effect.
Multilevel (hierarchical) linear models extend simple regression to data with nested structures. In multilevel data, observations within the same cluster—such as students within a school—are correlated. Ordinary least squares (OLS) assumes independent errors; violating this assumption by ignoring clustering leads to underestimated standard errors.
Multilevel models address this by including random effects to model cluster-level variation. A random intercept model allows each cluster its own baseline outcome level, capturing between-cluster heterogeneity. A random slopes model additionally permits the effect of a predictor to vary across clusters, reflecting differential relationships between the predictor and outcome in each cluster. Fixed effects estimate the average relationships across the population, while random effects capture deviations at the cluster level.