For the main problem of this chapter – divorce rate, in relation to marriate rate and median age at marriage, what if we started with a classical linear multivariate regression model?
If I were writing something longer, I’d take it step-by-step, perhaps starting with median age at marriage and then adding marriage rate.
library(rethinking)
library(ggplot2)
library(dplyr)
library(ggrepel)
library(ggplot2)
library(car)
library(effects)
data(WaffleDivorce, package = "rethinking")
d <- WaffleDivorce
I note in the predictors that we have SEs for both
Marriage and Divorce. I would want to use
these somehow.
str(d)
## 'data.frame': 50 obs. of 13 variables:
## $ Location : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Loc : Factor w/ 50 levels "AK","AL","AR",..: 2 1 4 3 5 6 7 9 8 10 ...
## $ Population : num 4.78 0.71 6.33 2.92 37.25 ...
## $ MedianAgeMarriage: num 25.3 25.2 25.8 24.3 26.8 25.7 27.6 26.6 29.7 26.4 ...
## $ Marriage : num 20.2 26 20.3 26.4 19.1 23.5 17.1 23.1 17.7 17 ...
## $ Marriage.SE : num 1.27 2.93 0.98 1.7 0.39 1.24 1.06 2.89 2.53 0.58 ...
## $ Divorce : num 12.7 12.5 10.8 13.5 8 11.6 6.7 8.9 6.3 8.5 ...
## $ Divorce.SE : num 0.79 2.05 0.74 1.22 0.24 0.94 0.77 1.39 1.89 0.32 ...
## $ WaffleHouses : int 128 0 18 41 0 11 0 3 0 133 ...
## $ South : int 1 0 0 1 0 0 0 0 0 1 ...
## $ Slaves1860 : int 435080 0 0 111115 0 0 0 1798 0 61745 ...
## $ Population1860 : int 964201 0 0 435450 379994 34277 460147 112216 75080 140424 ...
## $ PropSlaves1860 : num 0.45 0 0 0.26 0 0 0 0.016 0 0.44 ...
set.seed(12345) # make analysis reproducible
Use car::scatterplot() to give regression line, loess
smooth, and point identification. The id method allows to
use a function to determine which points to label, so I fit the linear
model first. The relationship looks nonlinear
mod <- lm(Divorce ~ Marriage, data = d)
notables <-
car::scatterplot(Divorce ~ Marriage, data = d,
id = list(n=5, method = abs(residuals(mod)), labels = d$Loc),
boxplots = FALSE,
regLine = TRUE, # wish: show the confidence band!
pch = 16, cex = 1.3,
xlab = "Marriage Rate",
ylab = "Divorce Rate")
Which points were identified?
notables
## AL DC ID ME ND
## 1 9 13 20 34
ggplot(data=d, aes(x = Marriage, y = Divorce)) +
stat_smooth(method = "lm", formula = y~x,
color = "firebrick4", fill = "firebrick",
alpha = 1/5, size = 2) +
stat_smooth(method = "loess", se = FALSE) +
geom_point(size = 2, color = "firebrick4", alpha = 1/2) +
geom_text_repel(data = d |> filter(Loc %in% names(notables)),
aes(label = Loc),
size = 5) +
ylab("Divorce rate") +
xlab("Marriage rate") +
coord_cartesian(ylim = c(5, 15)) +
theme_bw(base_size = 16) +
theme(panel.grid = element_blank())
The relationship looks reasonably linear. One point
(IDaho) is a large outlier!
mod <- lm(Divorce ~ MedianAgeMarriage, data = d)
notables <-
car::scatterplot(Divorce ~ MedianAgeMarriage, data = d,
id = list(n=5, method = "mahal", labels = d$Loc),
boxplots = FALSE,
regLine = TRUE, # wish: show the confidence band!
pch = 16, cex = 1.3,
xlab = "Marriage Rate",
ylab = "Divorce Rate")
notables
## AR DC ID ME UT
## 4 9 13 20 44
ggplot(data=d, aes(x = MedianAgeMarriage, y = Divorce)) +
stat_smooth(method = "lm", formula = y~x,
color = "firebrick4", fill = "firebrick",
alpha = 1/5, size = 2) +
stat_smooth(method = "loess", se = FALSE) +
geom_point(size = 2, color = "firebrick4", alpha = 1/2) +
geom_text_repel(data = d |> filter(Loc %in% names(notables)),
aes(label = Loc),
size = 5) +
ylab("Divorce rate") +
xlab("Median Age at Marriage") +
coord_cartesian(ylim = c(5, 15)) +
theme_bw(base_size = 16) +
theme(panel.grid = element_blank())
First, standardize the variables as McElrath did. I’m not sure
why.
I prefer to interpret a model in the scale of the raw variables, not
standardized \(\beta\)
coefficients.
d <-
d |>
mutate(div = scale(Divorce),
mar = scale(Marriage),
age = scale(MedianAgeMarriage))
The type I tests show signif effects of both variables, but the type
II tests show only an effect of age
div.mod1 <- lm(div ~ mar + age, data = d)
anova(div.mod1) # type I tests
## Analysis of Variance Table
##
## Response: div
## Df Sum Sq Mean Sq F value Pr(>F)
## mar 1 6.84 6.84 10.3 0.00238 **
## age 1 10.96 10.96 16.5 0.00018 ***
## Residuals 47 31.19 0.66
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(div.mod1) # type II tests
## Anova Table (Type II tests)
##
## Response: div
## Sum Sq Df F value Pr(>F)
## mar 0.33 1 0.5 0.48359
## age 10.96 1 16.5 0.00018 ***
## Residuals 31.19 47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmtest::coeftest(div.mod1)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.97e-16 1.15e-01 0.00 1.00000
## mar -1.19e-01 1.68e-01 -0.71 0.48359
## age -6.83e-01 1.68e-01 -4.06 0.00018 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the fitted marginal relation for each variable, controlling for
the other. Both effects are negative in the full model; the effect of
age at marriage is obviously much stronger.
div.eff <- allEffects(div.mod1)
plot(div.eff)
Show the partial relation of Y to each X, controlling (adjusting)
both Y and X for the other variable in the model. The
slope of line in each plot is the slope in the full model containing
both mar and age.
avPlots(div.mod1,
ellipse=list(levels = 0.68),
id = list(labels = d$Loc))
Looks OK
car::qqPlot(div.mod1, id = list(labels = d$Loc))
## ID ME
## 13 20
notables <-influencePlot(div.mod1, id = list(labels = d$Loc))
Useful to examine the variable values to understand why cases are
influential. (I wish this was easier with the car
functions. )
d |> select(Loc, div:age) |>
filter(Loc %in% rownames(notables)) |>
cbind(notables) |>
arrange(desc(CookD))
## Loc div mar age StudRes Hat CookD
## ID ID -1.0918 1.4971 -2.2949 -3.66987 0.1285 0.523332
## ME ME 1.8190 -1.7415 0.2782 2.48900 0.1225 0.259642
## WY WY 0.3361 2.7873 -1.4908 -0.47499 0.1900 0.017937
## DC DC -1.8607 -0.6356 2.9317 0.09445 0.2883 0.001231
Fit the Bayesian model for the multiple regression,
m5.3, p. 133
m5.3 <- quap(
alist(
div ~ dnorm( mu , sigma ) , # model for divorce rate
mu <- a + bM*mar + bA*age , # multiple regression model for mu
a ~ dnorm( 0 , 0.2 ) , # prior for intercept
bM ~ dnorm( 0 , 0.5 ) , # prior for coef of marriage
bA ~ dnorm( 0 , 0.5 ) , # prior for coef of age
sigma ~ dexp( 1 )
) , data = d )
(p5.3 <- precis( m5.3 ))
## mean sd 5.5% 94.5%
## a -1.4e-06 0.097 -0.16 0.16
## bM -6.5e-02 0.151 -0.31 0.18
## bA -6.1e-01 0.151 -0.85 -0.37
## sigma 7.9e-01 0.078 0.66 0.91
the posterior means for the coefficients
round(p5.3[1:3, "mean"], 3)
## [1] 0.000 -0.065 -0.613
OLS coefficients
round(coef(div.mod1), 3)
## (Intercept) mar age
## 0.000 -0.119 -0.683
These differences are rather larger than I expected.