Model comparison

Overview

By the end of these workshops and your own reading, you should be able to

Describe what the likelihood of a model is
Using unexplained variance to assess model fit
Describe the limitations of \(R^2\) as model fit statistic
Use likelihood ratio test (F-test) to compare nested models

Why do we need model comparion?

“All models are wrong, but some are useful”

George E.P. Box

Models are statistical representations of a hypothesis.
We want to compare the usefulness of alternative models.
Can my model accurately predict future outcomes?
Parsimony: As complex as necessary, as simple as possible.

Ockham’s razor: the principle of parsimony

Is it worth assuming a more complex model?
Philosopher William of Ockham (1285-1347/49):

“Plurality should not be posited without necessity.”

Precedence to simplicity: of two competing theories, the simpler explanation of an entity is to be preferred.
We don’t assume more complex theories, if there is not sufficient evidence.

Statistical models

Model comparisons are based on how well a model can capture data.
In any statistical analysis, we assume our data are drawn from some probability distribution with a set of parameters.
This is what we mean by statistical model, a probabilistic generative model of the statistical population like

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2), \mu_i = \beta_0 + \beta_1 \cdot x_i \]

We aim to find the best model of the population in our set of candidate models.

Model evaluation via likelihood

Are data are compatible with the model: what is the probability of the data according to the model.
If the probability of observing the data is higher under model \(\mathcal{M}_0\) than under \(\mathcal{M}_1\), which model is more compatible with the data? Therefore which model is better?
The probability of the data according to the model is the model’s likelihood.

What’s the difference between probability and likelihood?

Probability

of a hypothetical outcome given a particular model (hypothesis) such as a fair coin, i.e. \(\mathcal{P}(\text{data}|\theta)\)

Likelihood

of a parameter value, model or hypothesis after having seen some data, i.e. \(\mathcal{L}(\theta|\text{data})\)

Probabilistic model

A possible model of each observed outcome \(y_i\) is

\[ y_i \sim N(\mu, \sigma^2)\\ \text{for } i \in 1\dots n, \]

We are modelling \(y\) as normally distributed with two unknowns, the mean \(\mu\) and standard deviation \(\sigma^2\).

Likelihood

Probability of the observed values, \(y_1, y_2, y_3\dots y_n\), assuming values for \(\mu\) and \(\sigma^2\) (based on a particular model).

\[ P(y_1\dots y_n\mid\mu, \sigma^2) \]

The joint probability is the product, if all \(y\)s are conditionally independent of one another.

\[ \prod_\text{i=1}^n P(y_i\mid \mu, \sigma^2) \]

Likelihood

Replace \(\mu\) and \(\sigma^2\) with their maximum likelihood estimates \(\hat\mu\) and \(\hat\sigma^2\):

\[ \prod_\text{i=1}^n P(y_i\mid \hat\mu, \hat\sigma^2) \]

We use the logarithm, cause for large ns, the product will be a very small number (i.e. numeric underflow) cause probabilities are \(0 > p < 1\):

\[ \sum_\text{i=1}^n \text{log } P(y_i\mid \hat\mu, \hat\sigma) \]

Logarithm is turning product \(\prod\) into sum \(\sum\). Cause log(.5) + log(.2) is the same as log(.5 * .2).

Likelihood

model_0 <- lm(log_rt ~ 1, data = blomkvist)
mu_hat <- coef(model_0) # MLE of mu
sigma_hat <- sigma(model_0) # MLE of sigma

# log probability of an observed outcome for given MLEs.
logp <- dnorm(blomkvist$log_rt, mean = mu_hat, sd = sigma_hat, log = TRUE)
sum(logp) # sum of log probabilities is the log likelihood

[1] -17.6

logLik(model_0)

'log Lik.' -17.6 (df=2)

Work through exercise script: exercises/log_likelihood_1.R

\(R^2\): coefficient of determination

Ability of a model to predict the outcome variable.
Proportion of the variability in the outcome attributed to the predictor(s).
Routinely used as measure of model fit.

\(R^2\) is the ratio of ESS (explained sum of squares) and TSS (total sum of squares):

\[ R^2 = \frac{\text{ESS}}{\text{TSS}} \]

We need to work out the RSS (residual sum of squares; based on residuals).

Work through exercise script: exercises/rsquared.R

Residuals

Residuals are the unexplained (residual) variance: error in the modelling results.
Distance between observed (\(y\)) and predicted rt (\(\hat{\mu}\))

\[ \epsilon = y - \hat{\mu} \]

The closer the residuals are to 0, the lower the prediction error.

Residuals

mutate(blomkvist, 
       mu_hat = predict(model),
       resid_1 = log_rt - mu_hat,
       resid_2 = residuals(model)) %>% 
  select(log_rt, mu_hat, resid_1, resid_2)

# A tibble: 266 × 4
  log_rt mu_hat resid_1 resid_2
   <dbl>  <dbl>   <dbl>   <dbl>
1   6.55   6.58 -0.0286 -0.0286
2   6.15   6.28 -0.131  -0.131 
3   6.46   6.51 -0.0500 -0.0500
4   6.56   6.72 -0.153  -0.153 
5   6.41   6.50 -0.0890 -0.0890
# ℹ 261 more rows

Work through exercise script: exercises/residuals.R

Residual sum of squares

The sum of squared residuals when using the maximum likelihood estimators is

\[ \begin{aligned} \text{RSS}&=\sum_\text{i=1}^n \mid y_i - (\hat\beta_0+\hat\beta_1\cdot x_i)\mid^2,\\ &=\sum_\text{i=1}^n\mid y_i-\hat{\mu}_i\mid^2 \end{aligned} \]

RSS is a measure of how much variance in the data is not explained by the model; or the model’s lack of fit.

Residual sum of squares

# RSS: total of unexplained variance
(rss <- sum(residuals(model)^2))

[1] 9.2

Explained sum of squares

ESS is sum of squared differences of predicted data and sample mean.

\[ \text{ESS} = \sum_{i=1}^n(\hat\mu_i-\bar{y})^2 \]

Predicted mean value \(\hat\mu_i\) for each observation \(i\):

\[ \hat\mu_i = \hat\beta_0 + \hat\beta_1 \cdot x_{i} \]

(mu_hat <- predict(model))[1:10]

   1    2    3    4    5    6    7    8    9   10 
6.58 6.28 6.51 6.72 6.50 6.44 6.22 6.39 6.57 6.38

\(\bar{y}\) is the mean of the sample.

(y_bar <- mean(blomkvist$log_rt))

[1] 6.42

# ESS: explained sum of squares
(ess <- sum( (mu_hat - y_bar)^2 ))

[1] 8.58

\(R^2\): coefficient of determination

calculated on the basis of TSS which is the sum of ESS and RSS

\[ \underbrace{\sum_{i=1}^n(y_i-\bar{y})^2}_\text{TSS} = \underbrace{\sum_{i=1}^n(\hat\mu_i-\bar{y})^2}_\text{ESS} + \underbrace{\sum_{i=1}^n(y_i-\hat\mu_i)^2}_\text{RSS} \]

(tss <- ess + rss)

[1] 17.8

ess / tss

[1] 0.482

summary(model)$r.sq

[1] 0.482

Adjusted \(R^2\)

The value of \(R^2\) necessarily increases, or does not decrease, as we add more predictors to the model, even if the true values of the coefficients for these predictors are zero.
Overfitting: Complex models explain more variance but may make over optimistic predictions (Gelman, Hill, and Vehtari 2020).
Example: brain size and body mass for seven primate species (from McElreath 2016).

Adjusted \(R^2\)

To overcome this spurious increase in \(R^2\), the following adjustment is applied.

\[ \begin{aligned} R^2_\text{Adj} &= 1 - (1 - R^2) \cdot \underbrace{\frac{n-1}{n-K-1}}_{\text{penalty}}\\ \end{aligned} \]

Large number of parameters \(K\) reduces \(R^2\).
\(K\) has a small effect for large \(n\)s.

Adjusted \(R^2\)

\(R^2\) and Adjusted \(R^2\) with `lm`

# Specify models
model_1 <- lm(log_rt ~ sex, data = blomkvist)
model_2 <- lm(log_rt ~ sex + age, data = blomkvist)

From the lm objects, \(R^2\) and \(R_\text{Adj}^2\) can be obtained using summary

# Generate model summary
summary_model_1 <- summary(model_1)
summary_model_2 <- summary(model_2)

Because \(n \gg K\), the penalty term is (very) close to 1.0 and the adjustment is minimal.

Do exercise exercises/adjrsquared.R

summary_model_1$r.squared

[1] 0.0316

summary_model_2$r.squared

[1] 0.48

summary_model_1$adj.r.squared

[1] 0.0279

summary_model_2$adj.r.squared

[1] 0.476

Model comparison

Under null hypothesis, the following two models are identical:

\[ \begin{aligned} \mathcal{M}_0:& \text{logrt}_i \sim N(\hat\mu, \sigma^2), \hat\mu_i = \beta_0\\ \mathcal{M}_1:& \text{logrt}_i \sim N(\hat\mu, \sigma^2), \hat\mu_i = \beta_0 + \beta_\text{sex} \cdot \text{sex}_{i} \end{aligned} \]

\(\text{ for } i \in 1\dots n\)

Hence, \(R^2=0\) is only true if \(\beta_\text{sex}=0\).
Adding \(\text{sex}\) in \(\mathcal{M}_1\) does not explain any variance in the data that model \(\mathcal{M}_0\) isn’t already accounting for.
\(\mathcal{M}_0\) is nested in \(\mathcal{M}_1\): all the \(K_0\) predictors in \(\mathcal{M}_0\) are also presented in \(\mathcal{M}_1\).

Model comparison

Null hypothesis that \(R^2=0\) for model_2 and model_1.

# Specify models
model_1 <- lm(log_rt ~ 1, data = blomkvist)
model_2 <- lm(log_rt ~ sex, data = blomkvist)

# Compare models
anova(model_1, model_2)

Analysis of Variance Table

Model 1: log_rt ~ 1
Model 2: log_rt ~ sex
  Res.Df  RSS Df Sum of Sq   F Pr(>F)
1    265 17.8                        
2    264 17.2  1     0.561 8.6 0.0037

Model comparison

Null hypothesis that \(R^2=0\) for model_2 and model_1.

# Specify models
model_1 <- lm(log_rt ~ 1, data = blomkvist)
model_2 <- lm(log_rt ~ sex, data = blomkvist)

# Compare models
anova(model_1, model_2)

Analysis of Variance Table

Model 1: log_rt ~ 1
Model 2: log_rt ~ sex
  Res.Df  RSS Df Sum of Sq   F Pr(>F)
1    265 17.8                        
2    264 17.2  1     0.561 8.6 0.0037

RSS are TSS and RSS of model_2.
TSS of model_1: TSS is identical to RSS in model with no predictor.
Sum of Sq is ESS or difference of RSSs:

mu_hat <- predict(model_2)
y_bar <- mean(blomkvist$log_rt)
sumdiff <- mu_hat - y_bar
(ess <- sum(sumdiff^2))

[1] 0.561

Model comparison

Null hypothesis that \(R^2=0\) for model_2 and model_1.

# Specify models
model_1 <- lm(log_rt ~ 1, data = blomkvist)
model_2 <- lm(log_rt ~ sex, data = blomkvist)

# Compare models
anova(model_1, model_2)

Analysis of Variance Table

Model 1: log_rt ~ 1
Model 2: log_rt ~ sex
  Res.Df  RSS Df Sum of Sq   F Pr(>F)
1    265 17.8                        
2    264 17.2  1     0.561 8.6 0.0037

Df and bottom value of Res.Df are degrees of freedom

K <- 1
n <- nrow(blomkvist)
c(K, n - K - 1)

[1]   1 264

F-value is ratio of ess and rss:

rss <- sum(residuals(model_2)^2)
(f_stat <- (ess/K) / (rss/(n - K - 1)))

[1] 8.6

Model comparison

Null hypothesis that \(R^2=0\) for model_2 and model_1.

# Specify models
model_1 <- lm(log_rt ~ 1, data = blomkvist)
model_2 <- lm(log_rt ~ sex, data = blomkvist)

# Compare models
anova(model_1, model_2)

Analysis of Variance Table

Model 1: log_rt ~ 1
Model 2: log_rt ~ sex
  Res.Df  RSS Df Sum of Sq   F Pr(>F)
1    265 17.8                        
2    264 17.2  1     0.561 8.6 0.0037

Pr(>F): probability of getting an F statistic greater than or equal to 8.601 in an F distribution with \(K\) and \(n - K - 1\) degrees of freedom.

pf(f_stat, K, n-K-1, lower.tail = FALSE)

[1] 0.00366

Model comparison can be reported as “Adding the predictor sex significantly improved the model fit; F(1, 264) = 8.6, p<0.05.”

F ratio

Complete exercise exercises/modelcomparison.R

\[ \text{F} = \underbrace{\frac{\text{RSS}_0 - \text{RSS}_1}{\text{RSS}_1}}_\text{effect size} \cdot \underbrace{\frac{\text{df}_1}{\text{df}_0 -\text{df}_1}}_\text{sample size} = \frac{(\text{RSS}_0 - \text{RSS}_1)/(\text{df}_0 - \text{df}_1)}{\text{RSS}_1/\text{df}_1} \]

where Df\(_1\) is \(n - K_1 + 1\), \(K_1\) is the number of predictors in \(\mathcal{M}_1\).

Deviance information criterion (DIC)

\[ \text{DIC} = -2 \cdot \text{log} \hat{\mathcal{L}} \]

where \(\hat{\mathcal{L}}\) is the likelihood of the model using the MLEs.

Lower deviance indicates better model fit

deviance(model_2)

[1] 17.2

Equivalent to RSS for generalised linear models.

sum(residuals(model_2)^2)

[1] 17.2

Akaike and Bayesian Information Criterion

Estimators of prediction error: relative quality of statistical models for data.
Similar to \(R^2\), log likelihood and deviance will always increase when adding predictors, even if the true value of the coefficient is zero.
Prevent overfitting by penalising models with more parameters: trade-off model fit (measured by \(\hat{\mathcal{L}}\)) and the number of parameters (\(K\)) in the model.

\[ \text{AIC} = 2 \cdot K - 2 \cdot \text{log}(\hat{\mathcal{L}})\\ \text{BIC} = \text{log}(n) \cdot K - 2\cdot \text{log}(\hat{\mathcal{L}})\\ \]

ll <- as.vector(logLik(model_2))
K <- 3 # beta_0, beta_sex, sigma^2
(2 * K) - 2 * ll # AIC

[1] 32.7

AIC(model_2)

[1] 32.7

n <- nrow(blomkvist) 
log(n) * K - 2 * ll # BIC

[1] 43.4

BIC(model_2)

[1] 43.4

Akaike and Bayesian Information Criterion

Estimators of prediction error: relative quality of statistical models for data.
Similar to \(R^2\), log likelihood and deviance will always increase when adding predictors, even if the true value of the coefficient is zero.
Prevent overfitting by penalising models with more parameters: trade-off model fit (measured by \(\hat{\mathcal{L}}\)) and the number of parameters (\(K\)) in the model.

\[ \text{AIC} = 2 \cdot K - 2 \cdot \text{log}(\hat{\mathcal{L}})\\ \text{BIC} = \text{log}(n) \cdot K - 2\cdot \text{log}(\hat{\mathcal{L}})\\ \]

AIC chooses models with better predictive ability; favours more complex models as the sample size increases.
AIC is therefore useful when fit to data is more important than generalisations.
BIC has a stronger penalty for model complexity, which is important to prevent overfitting and for parsimony.

Comparing multiple nested models

… including the varying-intercepts (model_3) and the varying-intercepts and varying-slopes model (model_4).

# Specify models
model_1 <- lm(log_rt ~ 1, data = blomkvist)
model_2 <- lm(log_rt ~ sex, data = blomkvist)
model_3 <- lm(log_rt ~ sex + age, data = blomkvist)
model_4 <- lm(log_rt ~ sex * age, data = blomkvist)

# Compare models
anova(model_1, model_2, model_3, model_4)

Analysis of Variance Table

Model 1: log_rt ~ 1
Model 2: log_rt ~ sex
Model 3: log_rt ~ sex + age
Model 4: log_rt ~ sex * age
  Res.Df   RSS Df Sum of Sq      F  Pr(>F)
1    265 17.78                            
2    264 17.22  1      0.56  15.97 8.4e-05
3    263  9.25  1      7.97 226.81 < 2e-16
4    262  9.20  1      0.05   1.37    0.24

Check exercises/comparing_multiple_nested_models.R

Watch out

Model comparisons don’t tell us which model is the best (i.e. “all models are wrong”).
Failing to find evidence for an alternative model is not mean that the simpler model is the best, nor does it mean that the assumption of the more complex model is not theoretically relevant.

The end

Comparing models with nested sets of predictors.
Evaluation is based on likelihood of data under the model.
Penalising more complex models by considering number of parameters.
Other comparison techniques for models that are not nested within each other: e.g. different probability models (e.g. Poisson vs negative binomial).
Cross-validation (e.g. “LOO”: Leave-one-out CV, k-fold CV, Monte Carlo CV).

References

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.

McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

Overview

Why do we need model comparion?

Ockham’s razor: the principle of parsimony

Statistical models

Model evaluation via likelihood

What’s the difference between probability and likelihood?

Probabilistic model

Likelihood

Likelihood

Likelihood

\(R^2\): coefficient of determination

Residuals

Residuals

Residuals

Residual sum of squares

Residual sum of squares

Explained sum of squares

\(R^2\): coefficient of determination

Adjusted \(R^2\)

Adjusted \(R^2\)

Adjusted \(R^2\)

Adjusted \(R^2\)

\(R^2\) and Adjusted \(R^2\) with lm

Model comparison

Model comparison

Model comparison

Model comparison

Model comparison

F ratio

Deviance information criterion (DIC)

Akaike and Bayesian Information Criterion

Akaike and Bayesian Information Criterion

Comparing multiple nested models

Watch out

The end

References

\(R^2\) and Adjusted \(R^2\) with `lm`