Overview

By the end of these workshops and your own reading, you should be able to

  • Describe what the likelihood of a model is
  • Using unexplained variance to assess model fit
  • Describe the limitations of \(R^2\) as model fit statistic
  • Use likelihood ratio test (F-test) to compare nested models

Download exercises from Week 3 folder on NOW and move them into your R-project directory.

Why do we need model comparion?

“All models are wrong, but some are useful”

George E.P. Box

  • Models are statistical representations of a hypothesis.
  • We want to compare the usefulness of alternative models.
  • Can my model accurately predict future outcomes?
  • Parsimony: As complex as necessary, as simple as possible.

Ockham’s razor: the principle of parsimony

  • Is it worth assuming a more complex model?
  • Philosopher William of Ockham (1285-1347/49):

“Plurality should not be posited without necessity.”

  • Precedence to simplicity: of two competing theories, the simpler explanation of an entity is to be preferred.
  • We don’t assume more complex theories, if there is not sufficient evidence / reason to do so.

Ockham’s razor: examples

A psychologist is studying why a child is struggling in school. Two hypotheses are proposed:

  1. The child has a specific learning disability.
  2. The child is experiencing stress at home, affecting concentration.

Both could explain the behavior; which is more parsimonious?

  • The second hypothesis requires fewer assumptions (e.g. no need for neurological testing, just observable stressors).
  • Occam’s Razor favours the second hypothesis until further evidence suggests otherwise.



  • Another example: Skinner explained language acquisition through reinforcement and conditioning.
  • This was a more parsimonious explanation compared to Chomsky’s innate “language acquisition device.”
  • Skinner’s model was simpler and based on observable behaviour, though later evidence favored Chomsky’s more complex model.

Statistical models

  • Model comparisons are based on how well a model can capture data.
  • In any statistical analysis, we assume our data are drawn from some probability distribution with a set of parameters.
  • This is what we mean by statistical model, a probabilistic generative model of the statistical population like

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2), \mu_i = \beta_0 + \beta_1 \cdot x_i \]

  • We aim to find the best model of the population in our set of candidate models.

Model evaluation via likelihood

Probability is the chance to obtain a hypothetical outcome given a particular model (hypothesis) such as a fair coin, i.e. \(\mathcal{P}(\text{data}|\theta)\); like p-values, i.e. the probability of the data given the null hypothesis.

Likelihood is the chance to observe a parameter value, model or hypothesis after having seen the data, i.e. \(\mathcal{L}(\theta|\text{data})\)

  • Likelihood tells us the compatibility of model and data.
  • If data are more likely under model \(\mathcal{M}_0\) than under model \(\mathcal{M}_1\), model \(\mathcal{M}_0\) is more compatible with the data.

Probabilistic model

A possible model of each observed outcome \(y_i\) is

\[ y_i \sim \mathcal{N}(\mu, \sigma^2)\\ \text{for } i \in 1\dots n, \]

We are modelling \(y\) as normally distributed with two unknowns, the mean \(\mu\) and standard deviation \(\sigma^2\).

Probability

Probability of the observed values, \(y_1, y_2, y_3\dots y_n\), assuming values for \(\mu\) and \(\sigma^2\).

\[ \mathcal{P}(y_1\dots y_n\mid\mu, \sigma^2) \]

The joint probability is their product (assuming all \(y\)s are conditionally independent of one another).

\[ \prod_\text{i=1}^n \mathcal{P}(y_i\mid \mu, \sigma^2) \]

Likelihood

Replace \(\mu\) and \(\sigma^2\) with their maximum likelihood estimates \(\hat\mu\) and \(\hat\sigma^2\):

\[ \prod_\text{i=1}^n \mathcal{P}(y_i\mid \hat\mu, \hat\sigma^2) \]

We use the logarithm, cause for large ns, the product will be a very small number cause probabilities are \(0 > p < 1\):

\[ \sum_\text{i=1}^n \text{log } \mathcal{P}(y_i\mid \hat\mu, \hat\sigma) \]

Logarithm is turning product \(\prod\) into sum \(\sum\) cause log(.5) + log(.2) is the same as log(.5 * .2).

Likelihood

model_0 <- lm(log_rt ~ 1, data = blomkvist)
mu_hat <- coef(model_0) # MLE of mu
sigma_hat <- sigma(model_0) # MLE of sigma
# log probability of an observed outcome for given MLEs.
logp <- dnorm(blomkvist$log_rt, mean = mu_hat, sd = sigma_hat, log = TRUE)
sum(logp) # sum of log probabilities is the log likelihood
[1] -17.6
logLik(model_0)
'log Lik.' -17.6 (df=2)

Work through exercise script: log_likelihood_1.R

\(R^2\): coefficient of determination

  • Ability of a model to predict the outcome variable.
  • Proportion of variance in the outcome attributed to the predictor(s).
  • Routinely used as measure of model fit.

\(R^2\) is the ratio of ESS (explained sum of squares) and TSS (total sum of squares):

\[ R^2 = \frac{\text{ESS}}{\text{TSS}} \]

We need to work out the RSS (residual sum of squares) first, which is based on the residuals.

Residuals

Residuals

  • Residuals are the unexplained (residual) variance: error in the modelling results.
  • Distance between observed (\(y\)) and predicted rt (\(\hat{\mu}\))

\[ \epsilon = y - \hat{\mu} \]

  • The closer the residuals are to 0, the lower the prediction error.

Residuals

Work through exercise script: residuals.R

Residual sum of squares

The RSS when using the maximum likelihood estimators is

\[ \begin{aligned} \text{RSS}&=\sum_\text{i=1}^n \mid y_i - (\hat\beta_0+\hat\beta_1\cdot x_i)\mid^2,\\ &=\sum_\text{i=1}^n\mid y_i-\hat{\mu}_i\mid^2 \end{aligned} \]

RSS is a measure of how much variance in the data is not explained by the model.

Residual sum of squares

# RSS: total of unexplained variance
(rss <- sum(residuals(model)^2))
[1] 9.2

Explained sum of squares

ESS is sum of squared differences of predicted data and sample mean.

\[ \text{ESS} = \sum_{i=1}^n(\hat\mu_i-\bar{y})^2 \]

Predicted mean value \(\hat\mu_i\) for each observation \(i\):

\[ \hat\mu_i = \hat\beta_0 + \hat\beta_1 \cdot x_{i} \]

(mu_hat <- predict(model))[1:10]
   1    2    3    4    5    6    7    8    9   10 
6.58 6.28 6.51 6.72 6.50 6.44 6.22 6.39 6.57 6.38 

\(\bar{y}\) is the mean of the sample.

(y_bar <- mean(blomkvist$log_rt))
[1] 6.42
# ESS: explained sum of squares
(ess <- sum( (mu_hat - y_bar)^2 ))
[1] 8.58

\(R^2\): coefficient of determination

calculated on the basis of TSS which is the sum of ESS and RSS

\[ \underbrace{\sum_{i=1}^n(y_i-\bar{y})^2}_\text{TSS} = \underbrace{\sum_{i=1}^n(\hat\mu_i-\bar{y})^2}_\text{ESS} + \underbrace{\sum_{i=1}^n(y_i-\hat\mu_i)^2}_\text{RSS} \]

(tss <- ess + rss)
[1] 17.8
ess / tss
[1] 0.482
summary(model)$r.sq
[1] 0.482

Work through exercise script: rsquared.R

Adjusted \(R^2\)

  • The value of \(R^2\) necessarily increases, or does not decrease, as we add more predictors to the model, even if the true values of the coefficients for these predictors are zero.
  • Overfitting: Complex models explain more variance but may make over optimistic predictions (Gelman, Hill, and Vehtari 2020).
  • Example: brain size and body mass for seven primate species (from McElreath 2016).

Adjusted \(R^2\)

Adjusted \(R^2\)

To overcome this spurious increase in \(R^2\), the following adjustment is applied.

\[ \begin{aligned} R^2_\text{Adj} &= 1 - (1 - R^2) \cdot \underbrace{\frac{n-1}{n-K-1}}_{\text{penalty}}\\ \end{aligned} \]

  • Large number of parameters \(K\) reduces \(R^2\).
  • \(K\) has a small effect for large \(n\)s.

Adjusted \(R^2\)

\(R^2\) and Adjusted \(R^2\) with lm

# Specify models
model_1 <- lm(log_rt ~ sex, data = blomkvist)
model_2 <- lm(log_rt ~ sex + age, data = blomkvist)

From the lm objects, \(R^2\) and \(R_\text{Adj}^2\) can be obtained using summary

# Generate model summary
summary_model_1 <- summary(model_1)
summary_model_2 <- summary(model_2)

Because \(n \gg K\), the penalty term is (very) close to 1.0 and the adjustment is minimal.

Do exercise adjrsquared.R

summary_model_1$r.squared
[1] 0.0316
summary_model_2$r.squared
[1] 0.48
summary_model_1$adj.r.squared
[1] 0.0279
summary_model_2$adj.r.squared
[1] 0.476

Model comparison

Under null hypothesis, the following two models are identical:

\[ \begin{aligned} \mathcal{M}_0:& \text{logrt}_i \sim N(\hat\mu, \sigma^2), \hat\mu_i = \beta_0\\ \mathcal{M}_1:& \text{logrt}_i \sim N(\hat\mu, \sigma^2), \hat\mu_i = \beta_0 + \beta_\text{sex} \cdot \text{sex}_{i} \end{aligned} \]

  • Hence, \(R^2=0\) is only true if \(\beta_\text{sex}=0\).
  • Adding \(\text{sex}\) in \(\mathcal{M}_1\) does not explain any variance in the data that model \(\mathcal{M}_0\) isn’t already accounting for.
  • \(\mathcal{M}_0\) is nested in \(\mathcal{M}_1\): all the \(K_0\) predictors in \(\mathcal{M}_0\) are also presented in \(\mathcal{M}_1\).

Model comparison

Null hypothesis that \(R^2=0\) for model_2 and model_1.

# Specify models
model_1 <- lm(log_rt ~ 1, data = blomkvist)
model_2 <- lm(log_rt ~ sex, data = blomkvist)
# Compare models
anova(model_1, model_2)
Analysis of Variance Table

Model 1: log_rt ~ 1
Model 2: log_rt ~ sex
  Res.Df  RSS Df Sum of Sq   F Pr(>F)
1    265 17.8                        
2    264 17.2  1     0.561 8.6 0.0037

Model comparison

Null hypothesis that \(R^2=0\) for model_2 and model_1.

# Specify models
model_1 <- lm(log_rt ~ 1, data = blomkvist)
model_2 <- lm(log_rt ~ sex, data = blomkvist)
# Compare models
anova(model_1, model_2)
Analysis of Variance Table

Model 1: log_rt ~ 1
Model 2: log_rt ~ sex
  Res.Df  RSS Df Sum of Sq   F Pr(>F)
1    265 17.8                        
2    264 17.2  1     0.561 8.6 0.0037
  • RSS are TSS and RSS of model_2.
  • TSS of model_1: TSS is identical to RSS in model with no predictor.
  • Sum of Sq is ESS or difference of RSSs:
mu_hat <- predict(model_2)
y_bar <- mean(blomkvist$log_rt)
sumdiff <- mu_hat - y_bar
(ess <- sum(sumdiff^2))
[1] 0.561

Model comparison using likelihood-ratio test

Null hypothesis that \(R^2=0\) for model_2 and model_1.

# Specify models
model_1 <- lm(log_rt ~ 1, data = blomkvist)
model_2 <- lm(log_rt ~ sex, data = blomkvist)
# Compare models
anova(model_1, model_2)
Analysis of Variance Table

Model 1: log_rt ~ 1
Model 2: log_rt ~ sex
  Res.Df  RSS Df Sum of Sq   F Pr(>F)
1    265 17.8                        
2    264 17.2  1     0.561 8.6 0.0037

Df and bottom value of Res.Df are degrees of freedom

K <- 1
n <- nrow(blomkvist)
c(K, n - K - 1)
[1]   1 264

F-value is ratio of ess and rss:

rss <- sum(residuals(model_2)^2)
(f_stat <- (ess/K) / (rss/(n - K - 1)))
[1] 8.6

Model comparison using likelihood-ratio test

Null hypothesis that \(R^2=0\) for model_2 and model_1.

# Specify models
model_1 <- lm(log_rt ~ 1, data = blomkvist)
model_2 <- lm(log_rt ~ sex, data = blomkvist)
# Compare models
anova(model_1, model_2)
Analysis of Variance Table

Model 1: log_rt ~ 1
Model 2: log_rt ~ sex
  Res.Df  RSS Df Sum of Sq   F Pr(>F)
1    265 17.8                        
2    264 17.2  1     0.561 8.6 0.0037

Pr(>F): probability of getting an F statistic greater than or equal to 8.601 in an F distribution with \(K\) and \(n - K - 1\) degrees of freedom.

pf(f_stat, K, n-K-1, lower.tail = FALSE)
[1] 0.00366

Model comparison can be reported as “Adding the predictor sex significantly improved the model fit; F(1, 264) = 8.6, p<0.05.”

F ratio

Complete exercise modelcomparison.R

\[ \text{F} = \underbrace{\frac{\text{RSS}_0 - \text{RSS}_1}{\text{RSS}_1}}_\text{effect size} \cdot \underbrace{\frac{\text{df}_1}{\text{df}_0 -\text{df}_1}}_\text{sample size} = \frac{(\text{RSS}_0 - \text{RSS}_1)/(\text{df}_0 - \text{df}_1)}{\text{RSS}_1/\text{df}_1} \]

where Df\(_1\) is \(n - K_1 + 1\), \(K_1\) is the number of predictors in \(\mathcal{M}_1\).

Deviance information criterion (DIC)

\[ \text{DIC} = -2 \cdot \text{log} \hat{\mathcal{L}} \]

where \(\hat{\mathcal{L}}\) is the likelihood of the model using the MLEs.

Lower deviance indicates better model fit

deviance(model_2)
[1] 17.2

Equivalent to RSS for generalised linear models.

sum(residuals(model_2)^2)
[1] 17.2

Akaike and Bayesian Information Criterion

  • Estimators of prediction error: relative quality of statistical models for data.
  • Similar to \(R^2\), log likelihood and deviance will always increase when adding predictors, even if the true value of the coefficient is zero.
  • Prevent overfitting by penalising models with more parameters: trade-off model fit (measured by \(\hat{\mathcal{L}}\)) and the number of parameters (\(K\)) in the model.

\[ \text{AIC} = 2 \cdot K - 2 \cdot \text{log}(\hat{\mathcal{L}})\\ \text{BIC} = \text{log}(n) \cdot K - 2\cdot \text{log}(\hat{\mathcal{L}})\\ \]

ll <- as.vector(logLik(model_2))
K <- 3 # beta_0, beta_sex, sigma^2
(2 * K) - 2 * ll # AIC
[1] 32.7
AIC(model_2)
[1] 32.7
n <- nrow(blomkvist) 
log(n) * K - 2 * ll # BIC
[1] 43.4
BIC(model_2)
[1] 43.4

Akaike and Bayesian Information Criterion

  • Estimators of prediction error: relative quality of statistical models for data.
  • Similar to \(R^2\), log likelihood and deviance will always increase when adding predictors, even if the true value of the coefficient is zero.
  • Prevent overfitting by penalising models with more parameters: trade-off model fit (measured by \(\hat{\mathcal{L}}\)) and the number of parameters (\(K\)) in the model.

\[ \text{AIC} = 2 \cdot K - 2 \cdot \text{log}(\hat{\mathcal{L}})\\ \text{BIC} = \text{log}(n) \cdot K - 2\cdot \text{log}(\hat{\mathcal{L}})\\ \]

  • AIC chooses models with better predictive ability; favours more complex models as the sample size increases.
  • AIC is therefore useful when fit to data is more important than generalisations.
  • BIC has a stronger penalty for model complexity, which is important to prevent overfitting and for parsimony.

Comparing multiple nested models

… including the varying-intercepts (model_3) and the varying-intercepts and varying-slopes model (model_4): comparing_multiple_nested_models.R

# Specify models
model_1 <- lm(log_rt ~ 1, data = blomkvist)
model_2 <- lm(log_rt ~ sex, data = blomkvist)
model_3 <- lm(log_rt ~ sex + age, data = blomkvist)
model_4 <- lm(log_rt ~ sex * age, data = blomkvist)
# Compare models
anova(model_1, model_2, model_3, model_4)
Analysis of Variance Table

Model 1: log_rt ~ 1
Model 2: log_rt ~ sex
Model 3: log_rt ~ sex + age
Model 4: log_rt ~ sex * age
  Res.Df   RSS Df Sum of Sq      F  Pr(>F)
1    265 17.78                            
2    264 17.22  1      0.56  15.97 8.4e-05
3    263  9.25  1      7.97 226.81 < 2e-16
4    262  9.20  1      0.05   1.37    0.24

Watch out

  • Model comparisons don’t tell us which model is the best but which model is worse (i.e. “all models are wrong”).
  • Failing to find evidence for an alternative model is not mean that the simpler model is the best
  • It also doesn’t mean that the assumption of the more complex model is not theoretically relevant,
  • or that both models are equally good.

The end

  • Comparing models with nested sets of predictors.
  • Evaluation is based on likelihood of data under the model.
  • Penalising more complex models by considering number of parameters.
  • Other comparison techniques for models that are not nested within each other: e.g. different probability models (e.g. Poisson vs negative binomial).
  • Cross-validation (e.g. “LOO”: Leave-one-out CV, k-fold CV, Monte Carlo CV).

Reminders

Next week

  • Thom Baguley’s session on generalized regression models for binomial and count data.

Formative assessment

  • Data-analysis report using linear regression to address a psychologically relevant question.
  • One reproducible data analysis hence 4 pages max (written in RMarkdown).
  • Submit single zip archive via Dropbox until 24th October 2025, 2pm.

References

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.

McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.