Load required packages.

library( tidyverse )
library( tidybayes )
# library( rethinking )
library( brms )
library( gridExtra )

Set up workspace, i.e., remove all existing data from working memory, initialize the random number generator, turn off scientific notation of large numbers, set a standard theme for plotting.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
rm(list=ls())
set.seed(42)
options(scipen=10000)

## Warning in options(scipen = 10000): invalid 'scipen' 10000, used 9999

theme_set(theme_bw())

1 Concepts of the day: (1) Gaussian distribution and (2) regression to the mean

1.1 Teaching case

The Hand of God World Cup 1986: In an aerial battle with Peter Shilton, Maradona (1.66 meters) takes a hand to help. “It was a bit of Maradona’s head and a bit of God’s hand,” he said later.

In this session, we ask: can we use a player’s performance in aerial duels to predict his height?.

Let’s load some player data

d_player <- read.csv("players.csv") %>%                                     # load data
  filter( competition == "Premier League" & season == "2013") %>%           # filter for Premier League and Season
  select( name, height, duelAerialPerc )                                    # select some variables

head( d_player )

##             name height duelAerialPerc
## 1 Álvaro Negredo    186      0.3649635
## 2   Aaron Lennon    165      0.1666667
## 3   Aaron Ramsey    177      0.3750000
## 4 Abdoulaye Faye    188      0.6111111
## 5   Adam Johnson    175      0.3809524
## 6   Adam Lallana    173      0.3050847

Plot height and duelAerialPerc histograms

d_player %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_histogram()                       # As histogram

1.2 Gaussian distribution

First, let’s get back to our penalty kick problem.

We make the following assumptions:

The probability of scoring (i.e., goal) is 50% (like a coin flip)
Each trial is independent of the previous

We assume a player (red dot) has 4 trials (see layers below), where in each trial the goal probability is 50%. Right = goal; left = miss. Some rare players will end scoring 4 goals, some no goal but the central tendency is somewhere in the middle.

The idea was introduced by Francis Galton – the cousin of Charles Darwin – and is also known as Galton board.

1.3 Regression to the mean

What does it mean? Values – discrete like goals or continuous like height – fluctuate around the mean.

In the case of continuous variables such as height, we may observe extreme values; however, there is the tendency that they will be lower next observation.

Natural fluctuations tend to produce Gaussian – normal – distributions, that is, they regress to the mean (assuming observations are independent).

Excursus: “The Sports Illustrated Curse” In many cases, when an athlete appeared on the cover of Sports Illustrated, their performance got worse. What would be an explanation?

Back to normal (Gaussian distribution). They can be described by two parameters: it’s mean mu and its standard deviation sigma (or variance sigma^2)

Excursus: in this course, we will use various distributions. You can find a wonderful summery on distributions here: https://ben18785.shinyapps.io/distribution-zoo/

2 Linear regression

Teaching Case:

Purpose: Use football data to study how players’ heights can be modeled and predicted.
Data (Likelihood): The observed heights of Premier League players — our sample from the study population.
Prior: What we already know (or believe) about typical player heights before seeing the data.
Posterior: What we learn about player heights after combining prior beliefs with the observed data.

2.1 Model language

This is the general syntax shared by many models (regression, ANOVA, ANCOVA, t-test).

Recipe

Select the variables of interest. Some are observed (data), others are unobserved (parameters).
Define each variable either deterministically (as a function of other variables) or probabilistically (by a distribution).
The combination of variables and their distributions defines a joint generative model that can be used to
- simulate hypothetical observations and
- analyze real data.

A simple linear model

This is a very simple linear model, with an outcome y being normally distribute with mean mu and standard deviation sigma. It has one predictor, x. (red = variables, black = parameters)

\[ \begin{align*} \text{Data:} \quad & \color{red}{y_i} \sim \text{Normal}(\mu_i, \sigma) \\ \text{Model:} \quad & \mu_i = \beta \times \color{red}{x_i} \\ \text{Priors:} \quad & \beta \sim \text{Normal}(0, 10) \\ & \sigma \sim \text{Exponential}(1) \end{align*} \]

We first start with a height model without predictor. We define heights as normally distributed with a mean and a standard deviation (this is the likelihood):

\[ h_i \sim \text{Normal}(\mu_i, \sigma) \]

Let’s add some priors to the model (red = variable; black = parameters):

\[ \begin{align*} \text{Data:} \quad & \color{red}{h_i} \sim \text{Normal}(\mu_i, \sigma) \\ \text{Priors:} \quad & \mu_i \sim \text{Normal}(178, 20) \\ & \sigma \sim \text{Uniform}(0, 50) \end{align*} \] \[ \begin{align*} \text{Data:} \quad & h_i \sim \text{Normal}(\mu_i, \sigma) \\ \text{Priors:} \quad & \mu_i \sim \text{Normal}(178, 20) \\ & \sigma \sim \text{Uniform}(0, 50) \end{align*} \] Explanation

Data model (likelihood):
$h_i$ are the observed data (heights). They’re assumed to come from a Normal distribution with unknown mean $ _i $ and standard deviation $ $.
Priors:
These describe our beliefs before seeing data:
- $ _i $: typical height is around 178 cm, but could plausibly vary ±20 cm.
- $ $: population spread could be anywhere from 0 cm to 50 cm (flat prior).

Let’s first plot the priors for mu and sigma to get an idea of them.

# mu ~ Normal(178, 20)
ggplot(data.frame(x = c(100, 250)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 178, sd = 20)) +
  labs(x = "cm", y = "Density",
       subtitle = expression(mu %~% Normal(178,~20))) +
  geom_vline(xintercept = 178, linetype = 2, alpha = 0.6)

# sigma ~ Uniform(0, 50)  (plot only on its support)
ggplot(data.frame(x = c(0, 50)), aes(x)) +
  stat_function(fun = dunif, args = list(min = 0, max = 50)) +
  scale_y_continuous(NULL, breaks = NULL) +
  labs(x = "cm", y = "Density",
       subtitle = expression(sigma %~% Uniform(0,~50)))

2.2 Prior predictive simulation

…means simulating both priors at once (the full model).

Why? Checking the model before adding data. Do priors make sense? Before the model has seen the data, what does it believe?

Reminder: this is our model: \[ \begin{align*} \text{Data:} \quad & \color{red}{h_i} \sim \text{Normal}(\mu_i, \sigma) \\ \text{Priors:} \quad & \mu_i \sim \text{Normal}(178, 20) \\ & \sigma \sim \text{Uniform}(0, 50) \end{align*} \]

In R, rnorm or runif are functions that genertate random values from a given distribution (check ?rnorm)

# Simulate from priors
n= 10000    # number of simulations

df <- tibble( sample_mu       = rnorm( n, mean = 178,       sd = 20 ),
              sample_sigma    = runif( n, min = 0,         max = 50 )) %>% 
              mutate(prior_h  = rnorm( n, mean = sample_mu, sd = sample_sigma ))
df

## # A tibble: 10,000 × 3
##    sample_mu sample_sigma prior_h
##        <dbl>        <dbl>   <dbl>
##  1      205.        43.9     161.
##  2      167.        22.8     150.
##  3      185.        21.2     159.
##  4      191.        46.9     143.
##  5      186.        19.3     219.
##  6      176.        22.2     243.
##  7      208.        17.3     251.
##  8      176.         9.67    181.
##  9      218.        38.0     263.
## 10      177.        22.9     186.
## # ℹ 9,990 more rows

# Inspect summary statistics (sanity check)
df %>% summarise(
  mean_mu = mean(sample_mu),
  mean_sigma = mean(sample_sigma),
  mean_h = mean(prior_h)
)

## # A tibble: 1 × 3
##   mean_mu mean_sigma mean_h
##     <dbl>      <dbl>  <dbl>
## 1    178.       24.9   178.

# Plot prior predictive distribution 
df %>%
  ggplot( aes( x = prior_h )) +
  geom_density( fill = "black", size = 0 ) +
  scale_y_continuous( NULL, breaks = NULL ) +
  labs( subtitle = expression(paste( "Prior predictive distribution for ", italic(h[i]))),
        x = "cm")

Exercise:

What happens if you chose a wider prior for the model’s standard deviation — for example, sigma~Uniform(0,100) cm?
What happens if you chose a narrower prior for the model’s mean — for example, mu~Normal(178,10) cm?

2.3 Sampling for the posterior (last example with grid approximation)

… means we randomly draw samples from the posterior (here, we have two parameters mu and sd; in Session 1 we had only 1: p)

We want to compute the posterior distribution of two parameters in our model:

\[ h_i \sim \text{Normal}(\mu, \sigma) \]

where:

$\mu$ = average height
$\sigma$ = standard deviation (spread of heights)

Our goal is to find out:

Which combinations of ($\mu$, $\sigma$) are most plausible, given our observed height data and the priors?

Grid approximation: Compute posterior for many combinations of mu and sigma.

Doing some preparations (you don’t need to understand the code). Note: We will never use this for practical data analysis but it helps better understanding what exactly we’re doing with Bayesian estimation.

n <- 200

# Create some height data
d_grid <-
  crossing(mu    = seq(from = 180, to = 183, length.out = n),
           sigma = seq(from = 5.5, to = 8.5, length.out = n))

head(d_grid)

## # A tibble: 6 × 2
##      mu sigma
##   <dbl> <dbl>
## 1   180  5.5 
## 2   180  5.52
## 3   180  5.53
## 4   180  5.55
## 5   180  5.56
## 6   180  5.58

# Helper function
grid_function <- function( mu, sigma ){
  dnorm( d_player$height, mean = mu, sd = sigma, log = TRUE ) %>% 
    sum()
}

# Complete the tibble (e.g., create priors and posterior)
d_grid <-
  d_grid %>% 
  mutate( log_likelihood = map2( mu, sigma, grid_function )) %>% 
  unnest( cols = c(log_likelihood )) %>% 
  mutate( prior_mu       = dnorm(mu,    mean = 178, sd  = 20, log = TRUE ),
          prior_sigma    = dunif(sigma, min  = 0,   max = 50, log = TRUE )) %>% 
  mutate( product        = log_likelihood + prior_mu + prior_sigma ) %>% 
  mutate( posterior    = exp(product - max(product )))

d_grid

## # A tibble: 40,000 × 7
##       mu sigma log_likelihood prior_mu prior_sigma product posterior
##    <dbl> <dbl>          <dbl>    <dbl>       <dbl>   <dbl>     <dbl>
##  1   180  5.5          -1311.    -3.92       -3.91  -1319.  1.92e-14
##  2   180  5.52         -1310.    -3.92       -3.91  -1318.  3.28e-14
##  3   180  5.53         -1310.    -3.92       -3.91  -1318.  5.57e-14
##  4   180  5.55         -1309.    -3.92       -3.91  -1317.  9.34e-14
##  5   180  5.56         -1309.    -3.92       -3.91  -1317.  1.55e-13
##  6   180  5.58         -1308.    -3.92       -3.91  -1316.  2.55e-13
##  7   180  5.59         -1308.    -3.92       -3.91  -1316.  4.16e-13
##  8   180  5.61         -1307.    -3.92       -3.91  -1315.  6.71e-13
##  9   180  5.62         -1307.    -3.92       -3.91  -1315.  1.07e-12
## 10   180  5.64         -1307.    -3.92       -3.91  -1314.  1.70e-12
## # ℹ 39,990 more rows

# Draw samples from posterior
d_grid_samples <- 
  d_grid %>% 
  sample_n( size = 1000, replace = T, weight = posterior )

# Plot mu and sigma separately 
d_grid_samples %>% 
  select(mu, sigma) %>% 
  gather() %>% 

  ggplot(aes(x = value)) + 
  geom_density(fill = "grey33", size = 0) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(NULL) +
  theme(panel.grid = element_blank()) +
  facet_wrap(~key, scales = "free")

# Plot mu and sigma together
d_grid_samples %>% 
  ggplot( aes( x = mu, y = sigma )) + 
  geom_point( alpha = 0.3 ) +
  scale_fill_viridis_c() +
  labs( x = expression(mu[samples]),
        y = expression(sigma[samples]))

d_grid %>%
  ggplot(aes(mu, sigma, z = posterior)) +
  geom_contour_filled(bins = 10) +
  labs(x = expression(mu), y = expression(sigma))

- The filled contour plot shows where the posterior probability mass lies — the “hill” where the most plausible combinations of $\mu$ and $\sigma$ live.

The scatter plot of posterior samples shows how those parameters vary together — for example, smaller $\sigma$ might require slightly larger $\mu$ to explain the same data.
The marginal densities show the posterior for each parameter individually.

Exercise:

What happens if you were to change the prior of the mean, say to 1,780 or 17,800 cm?

2.4 Adding a predictor variable

Let’s assume, we now add the variable “percentage of aerials won” as predictor of height.

d_player %>%
  ggplot( aes(x = duelAerialPerc, y = height )) +
  geom_point( shape = 1, size = 1 )

From now on, we fit the models with brm() from the brms package, which is very common in scientific practice.

Reminder, this was our model:

\[ \begin{align*} \text{Data:} \quad & h_i \sim \text{Normal}(\mu_i, \sigma) \\ \text{Model:} \quad & \mu_i = \mu \\ \text{Priors:} \quad& \mu \sim \text{Normal}(178, 20) \\ & \sigma \sim \text{Uniform}(0, 100) \end{align*} \]

This is our new model with a predictor, i.e., the percentage of aerials won:

\[ \begin{align*} \text{Data:} \quad & \color{red}{h_i} \sim \text{Normal}(\mu_i, \sigma) \\ \text{Model:} \quad & \mu_i = \color{red}{\alpha} + \color{red}{\beta} \times \color{red}{\mathrm{duelAerial}_i} \\ \text{Priors:} \quad& \alpha \sim \text{Normal}(178, 20) \\ & \beta \sim \text{Normal}(0, 10) \\ & \sigma \sim \text{Uniform}(0, 100) \end{align*} \]

Lets translate the statistical model into brms code (the engine that makes our model go).

In a first step, we sample only from the priors (prior predictive check).

# Prior predictive check (sample from priors only)
m4.3a <- 
  brm( family = gaussian,                                      # h ~ Normal( mu, sigma )
       height ~ 1 + duelAerialPerc,                            # mu = a + b*x_i 
       prior = c(prior( normal( 178, 20 ), class = Intercept), # a ~ Normal( 178, 20 )
                 prior( normal( 0, 10), class = b),            # b ~ Normal( 0, 10 )
                 prior( uniform( 0, 100 ), class = sigma )),   # sigma ~ Uniform( 0, 100 )
       sample_prior = "only", # sample only from priors
       data = d_player, 
       iter = 28000, warmup = 27000, chains = 4, cores = 4.    # MCMC settings
     )

# Only for prior predictive check
conditional_effects( m4.3a, spaghetti = T, ndraws = 200 ) %>% 
  plot( points = F, point_args = c(alpha = 1/2, size = 1), line_args = c(colour = "black") )

pp_check( m4.3a )

# plot( m4.3a )

Exercise:

What happens if you change sigma to 50?

Let’s add data (uncomment sample_prior = “only”)

m4.3b <- 
  brm( family = gaussian,                                      # h ~ Normal( mu, sigma )
       height ~ 1 + duelAerialPerc,                            # mu = a + bx_i 
       prior = c(prior( normal( 178, 20 ), class = Intercept), # a ~ Normal( 178, 20 )
                 prior( normal( 0, 10 ), class = b),           # b ~ Normal( 0, 10 )
                 prior( uniform(0, 100), class = sigma )),     # sigma ~ Uniform( 0, 100 )
       # sample_prior = "only", # sample only from priors
       data = d_player,
       iter = 28000, warmup = 27000, chains = 4, cores = 4.    # MCMC settings
     )

pp_check( m4.3b )

# plot( m4.3b )

Let’s check the summary of the posterior distribution

posterior_summary( m4.3b )[1:3, ]

##                    Estimate Est.Error       Q2.5      Q97.5
## b_Intercept      174.521355 0.8390793 172.891066 176.141410
## b_duelAerialPerc  15.671251 1.7172059  12.303982  19.043173
## sigma              5.945388 0.2162514   5.552194   6.394807

2.5 Checking correlations among parameters

Why check correlations among parameters?

Sometimes, the intercept ($\alpha$) and slope ($\beta$) move together (are correlated).
When one goes up and the other goes down, the model can still fit the data almost the same way.

If that happens: - The model has a harder time deciding which values of $\alpha$ and $\beta$ are best.
- The estimates become less stable and take longer to compute.

pairs( m4.3b )

alpha (b_Intercept) and beta (b_duelAerialPerc) are highly correlated. This means these two parameters carry the same information. Not a problem here but can cause issues in more complex models.

How centering and z-scoring help

Centering the predictor (subtracting its mean) moves the zero point to the average value of the data.
Then the intercept means “the expected outcome when the predictor is average.”
Centering typically reduces the $\alpha$–$\beta$ correlation.
Z-scoring goes one step further: center and divide by the standard deviation.
This puts predictors on a similar scale, so coefficients are easier to compare across models.

d_player <- 
  d_player %>%
  mutate( aerials_c = duelAerialPerc - mean( duelAerialPerc ))

head(d_player, n=10)

##                name height duelAerialPerc   aerials_c
## 1    Álvaro Negredo    186      0.3649635 -0.09289577
## 2      Aaron Lennon    165      0.1666667 -0.29119261
## 3      Aaron Ramsey    177      0.3750000 -0.08285928
## 4    Abdoulaye Faye    188      0.6111111  0.15325183
## 5      Adam Johnson    175      0.3809524 -0.07690690
## 6      Adam Lallana    173      0.3050847 -0.15277453
## 7      Adel Taarabt    180      0.3571429 -0.10071642
## 8  Adlène Guédioura    183      0.2857143 -0.17214499
## 9   Adrian Mariappa    180      0.6190476  0.16118834
## 10 Ahmed Elmohamady    180      0.5034014  0.04554208

Rerun our model with the centered variable

m4.3c <- 
  brm( family = gaussian,                                      # h ~ Normal( mu, sigma )
       height ~ 1 + aerials_c,                                 # mu = a + bx_i 
       prior = c(prior( normal( 178, 20 ), class = Intercept), # a ~ Normal( 178, 20 )
                 prior( normal( 0, 10 ), class = b),           # b ~ Normal( 0, 10 )
                 prior( uniform(0, 100), class = sigma )),     # sigma ~ Uniform( 0, 100 )
       # sample_prior = "only", # sample only from priors
       data = d_player, 
       iter = 28000, warmup = 27000, chains = 4, cores = 4.    # MCMC settings
     )

# Plot correlations
pairs( m4.3c )

Let’s check the summary of the posterior distribution

mean(d_player$height)

## [1] 181.6933

posterior_summary( m4.3c )[1:3, ]

##               Estimate Est.Error      Q2.5      Q97.5
## b_Intercept 181.691233 0.3007712 181.10591 182.297491
## b_aerials_c  15.701673 1.7312709  12.29597  19.122064
## sigma         5.950094 0.2135221   5.54392   6.381074

Centering reduced the correlation between alpha and beta. Note that $\beta$ and $\sigma$ are unchanged but the estimate for alpha (i.e., the intercept). When variables are standardized, the intercept represents the mean of the outcome variable!

2.6 Visualizing uncertainty in regression

Now that we have estimated our model, we can visualize what it has learned.

First, we plot the posterior inference — the regression line based on the most likely values of $\alpha$ (intercept) and $\beta$ (slope) — against the actual data.

# Mean line from posterior (point estimates)
d_player %>%
  ggplot(aes(x = duelAerialPerc, y = height)) +
  geom_abline(intercept = fixef(m4.3b)[1], 
              slope     = fixef(m4.3b)[2]) +
  geom_point(color = "royalblue")

fixef(m4.3b)

##                 Estimate Est.Error      Q2.5     Q97.5
## Intercept      174.52136 0.8390793 172.89107 176.14141
## duelAerialPerc  15.67125 1.7172059  12.30398  19.04317

Next, we add an uncertainty band around the mean (the credible interval). You don’t need to understand the code yet — the idea is that the model gives us not just one line, but a range of plausible lines given what the data tell us.

We can use fitted() from the brms package to compute the model’s expected (mean) height for a range of predictor values.

# Here, we define a sequence of values for aerial duels for which we want to compute predictions.
aerials_seq <- 
  tibble( duelAerialPerc = seq(from = 0, to = 1, by = 0.1 ))
aerials_seq

## # A tibble: 11 × 1
##    duelAerialPerc
##             <dbl>
##  1            0  
##  2            0.1
##  3            0.2
##  4            0.3
##  5            0.4
##  6            0.5
##  7            0.6
##  8            0.7
##  9            0.8
## 10            0.9
## 11            1

# Get fitted values for each new data point
mu_summary <-
  fitted( m4.3b, 
         newdata = aerials_seq ) %>%
  as_tibble() %>%
  bind_cols( aerials_seq ) %>%
  select( duelAerialPerc, Estimate, Q2.5, Q97.5)

mu_summary

## # A tibble: 11 × 4
##    duelAerialPerc Estimate  Q2.5 Q97.5
##             <dbl>    <dbl> <dbl> <dbl>
##  1            0       175.  173.  176.
##  2            0.1     176.  175.  177.
##  3            0.2     178.  177.  179.
##  4            0.3     179.  178.  180.
##  5            0.4     181.  180.  181.
##  6            0.5     182.  182.  183.
##  7            0.6     184.  183.  185.
##  8            0.7     185.  184.  186.
##  9            0.8     187.  186.  188.
## 10            0.9     189.  187.  190.
## 11            1       190.  188.  192.

Now we can plot the credible interval around the mean:

# Plot mean ± 95% credible interval
d_player %>%
  ggplot( aes( x = duelAerialPerc, y = height )) +
  geom_smooth( data = mu_summary,
               aes( y = Estimate, ymin = Q2.5, ymax = Q97.5 ),
               stat = "identity" ) +
  geom_point( )

So far, this shows uncertainty about the average height at each level of aerial-duel performance. But real players also differ randomly around that mean — that’s captured by the model’s $\sigma$.

\[ h_i \sim \text{Normal}(\mu_i = \alpha + \beta \times x_i, \sigma) \] To visualize that, we can draw a prediction interval that shows where new players’ heights are likely to fall.

fitted() → focuses on the mean prediction (what the model expects on average): credible interval
predict() → includes observation-level noise ($\sigma$) and thus gives a wider interval: posterior prediction interval

# Here, we define a sequence of values for aerial duels for which we want to compute predictions.
aerials_seq <- 
  tibble( duelAerialPerc = seq(from = 0, to = 1, by = 0.1 ))

# Get predicted value for each new data point
pred_aerial <-
  predict( m4.3b,
           newdata = aerials_seq ) %>%
  as_tibble() %>%
  bind_cols( aerials_seq ) %>%
  select( duelAerialPerc, Estimate, Q2.5, Q97.5)
  
pred_aerial

## # A tibble: 11 × 4
##    duelAerialPerc Estimate  Q2.5 Q97.5
##             <dbl>    <dbl> <dbl> <dbl>
##  1            0       175.  163.  187.
##  2            0.1     176.  164.  187.
##  3            0.2     178.  166.  189.
##  4            0.3     179.  167.  191.
##  5            0.4     181.  169.  193.
##  6            0.5     182.  171.  194.
##  7            0.6     184.  173.  196.
##  8            0.7     185.  174.  197.
##  9            0.8     187.  175.  199.
## 10            0.9     189.  177.  201.
## 11            1       190.  178.  202.

Finally, we can plot both together: the inner (narrow) band for the mean and the outer (wider) band for possible new data.

# Plot: prediction interval (gray band) + credible interval for mean (line+ribbon)
d_player %>%
  ggplot( aes( x = duelAerialPerc )) +
  geom_ribbon( data = pred_aerial, 
               aes( ymin = Q2.5, ymax = Q97.5 ),
               fill = "grey83" ) +
  geom_smooth( data = mu_summary,
               aes( y = Estimate, ymin = Q2.5, ymax = Q97.5 ),
               stat = "identity" ) +
  geom_point( aes( y = height ) )

There is one line of code in brms that does it at once ;-)

# Prior predictive check
conditional_effects( m4.3a, spaghetti = T, ndraws = 200 ) %>% 
  plot( points = T, point_args = c(alpha = 1/2, size = 1), line_args = c(colour = "black") )

# Posterior predictive check
conditional_effects( m4.3b ) %>% plot( points = T )

3 Let’s experiment!

Predict a player’s marketvalue from age.

Write out a statistical model.
Define some suitable priors.
Translate the model into brm code.
Do a prior predictive check.
Run the model with data.

3.1 Playground

Use the following data.

d_player <- read.csv("players.csv") %>%                                     # load data
  filter( competition == "Premier League" & season == "2013") %>%           # filter for Premier League and Season
  select( name, age, marketvalue )                                          # select some variables

head( d_player )

##             name age marketvalue
## 1 Álvaro Negredo  28    27000000
## 2   Aaron Lennon  26    15000000
## 3   Aaron Ramsey  23    20000000
## 4 Abdoulaye Faye  35      250000
## 5   Adam Johnson  26     7500000
## 6   Adam Lallana  25     7500000

3.2 Solution

Plot the marketvalue and age (scatter plot).

# Your code here

Write out a statistical model. We expect the mean marketvalue (mu) to be 8,000,000 € with a broad standard deviation (sigma) = 50,000,000 €. Let’s further assume a broad range of possible means of 30,000,000 €. Finally, we assume that age does not matter (mean = 0) but with a range of +- 2,000,000 €.

\[ Model \]

Model in brm code.

# Your code here

Let’s check the summary of the posterior distribution

# Your code here

Is this a good model?

4 Polynomial regression

Let’s assume a player’s market value increases until a certain age and then decreases. How do we model this?

A common polynomial regression adds a quadratic term (u-shaped):

\[ \begin{align*} mv_i & \sim \text{Normal}( \mu_i, \sigma ) \\ \text{1st order (line):} &\ \mu_i = \alpha + \beta_1 x_i \\ \text{2nd order (parabola):} &\ \mu_i = \alpha + \beta_1 x_i + \beta_2 x_i^2 \end{align*} \]

m4.5 <- 
  brm( family = gaussian,                                                 # mv ~ Normal( mu, sigma )
       marketvalue ~ 1 + age + I(age^2),                                  # mu = a + b1*age_i + b2*age^2_i 
       prior = c(prior( normal( 8000000, 30000000 ), class = Intercept),   # a ~ Normal( 8.000.000, 30.000.000 )
                 prior( normal( 0, 2000000 ), class = b),                 # b ~ Normal( 0, 2.000.000 )
                 prior( uniform(0, 50000000), class = sigma )),           # sigma ~ Uniform( 0, 50.000.000 )
       #sample_prior = "only",
       data = d_player,
       iter = 28000, warmup = 27000, chains = 4, cores = 4 
     )

# Plot the model
pp_check( m4.5 )

# plot( m4.5 )

(The function I() stands for the “inhibit interpretation” operator, which tells R to treat age^2 literally as a squared term, rather than as part of a formula.)

Let’s check the summary of the posterior distribution

posterior_summary( m4.5 )[1:4, ]

##                Estimate   Est.Error         Q2.5       Q97.5
## b_Intercept -17560754.5 14658642.90 -45956224.08 10981104.05
## b_age         2375568.1  1091336.75    268623.98  4555878.69
## b_IageE2       -52146.1    20166.48    -92162.82   -13039.49
## sigma         8582295.6   312925.68   8006954.89  9204241.58

Plot the polynomial.

conditional_effects( m4.5 ) %>% plot( points = TRUE )

Now, let’s remove outliers, log the marketvalue, standardize age and age2, and then rerun the model.

(Why do we log and how can we interpret it: https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/)

# Load and wrangle data
d_player <- d_player %>%                            
  # remove outlier, standardize vars, log marketvalue
  filter( marketvalue <= 20000000) %>%
  mutate( age_s = ( age - mean( age )) / sd(age),
          age2 = age^2,
          age2_s = ( age2 - mean( age2 )) / sd(age2),
          mv_log = log( marketvalue )
          )

head( d_player )

##               name age marketvalue      age_s age2     age2_s   mv_log
## 1     Aaron Lennon  26    15000000 -0.1598652  676 -0.2249829 16.52356
## 2   Abdoulaye Faye  35      250000  2.1932911 1225  2.4168203 12.42922
## 3     Adam Johnson  26     7500000 -0.1598652  676 -0.2249829 15.83041
## 4     Adam Lallana  25     7500000 -0.4213270  625 -0.4703963 15.83041
## 5     Adel Taarabt  24     6000000 -0.6827889  576 -0.7061857 15.60727
## 6 Adlène Guédioura  28     2500000  0.3630584  784  0.2947161 14.73180

Plot a histogram of log(marketvalue)

d_player %>%
  ggplot( aes( x= mv_log )) + 
  geom_histogram()

Let’s rerun the model

m4.5_s <- 
  brm( family = gaussian,                                                 # h ~ Normal( mu, sigma )
       mv_log ~ 1 + age_s + I(age_s^2),                                   # mu = a + bx_i 
       prior = c(prior( normal( 8000000, 30000000 ), class = Intercept),  # a ~ Normal( 8.000.000, 30.000.000 )
                 prior( normal( 0, 1 ), class = b),                       # b ~ Normal( 0, 1 )
                 prior( uniform(0, 50000000), class = sigma )),           # sigma ~ Uniform( 0, 50.000.000 )
       #sample_prior = "only",
       data = d_player,
       iter = 28000, warmup = 27000, chains = 4, cores = 4 
    )

# Plot the model
pp_check( m4.5_s ) # Posterior Predictive Check

#plot( m4.5_s ) # Plot estimates

# Plot posterior
conditional_effects( m4.5_s ) %>% plot( points = TRUE )

5 Multiple Regression

Teaching Case:

Purpose: Extend our height model by adding a second predictor — age.
Question: How does age relate to height, after accounting for aerial performance?
Idea: We now allow two predictors to jointly explain variation in player height.

Reload player data with age

d_player <- read.csv("players.csv") %>%                                     # load data
  filter( competition == "Premier League" & season == "2013") %>%           # filter for Premier League and Season
  select( name, height, duelAerialPerc, age )                               # select some variables

head( d_player )

##             name height duelAerialPerc age
## 1 Álvaro Negredo    186      0.3649635  28
## 2   Aaron Lennon    165      0.1666667  26
## 3   Aaron Ramsey    177      0.3750000  23
## 4 Abdoulaye Faye    188      0.6111111  35
## 5   Adam Johnson    175      0.3809524  26
## 6   Adam Lallana    173      0.3050847  25

5.1 Model setup

Previously, we modeled height as a function of aerial duels alone.
Now we expand our model to include age as an additional predictor.

\[ \begin{align*} \text{Data:} \quad & h_i \sim \text{Normal}(\mu_i, \sigma) \\ \text{Model:} \quad & \mu_i = \alpha + \beta_1 \cdot \text{duelAerialPerc}_i + \beta_2 \cdot \text{age}_i \\ \text{Priors:} \quad & \alpha \sim \text{Normal}(178, 20) \\ & \beta_1 \sim \text{Normal}(0, 10) \\ & \beta_2 \sim \text{Normal}(0, 10) \\ & \sigma \sim \text{Uniform}(0, 100) \end{align*} \]

Interpretation

$h_i$ – observed player height (data)
$\mu_i$ – expected height given aerial ability and age
$\alpha$ – average height when both predictors are zero (i.e., baseline)
$\beta_1$ – expected change in height per unit change in aerial-duel performance
$\beta_2$ – expected change in height per additional year of age
$\sigma$ – residual variation in height not explained by predictors

5.2 Estimation in `brms`

We use the same brm() function as before, now with two predictors.

m4.6 <- 
  brm(
    family = gaussian,
    height ~ 1 + duelAerialPerc + age,                     # two predictors
    prior = c(
      prior(normal(178, 20), class = Intercept),           # α
      prior(normal(0, 10), class = b),                     # βs (shared prior)
      prior(uniform(0, 100), class = sigma)                # σ
    ),
    data = d_player,
    iter = 4000, chains = 4, cores = 4
  )

summary(m4.6)

5.3 Visualizing multiple regression results

We can visualize how predicted height changes with aerial performance for different ages.

conditional_effects(m4.6, effects = "duelAerialPerc:age") %>% 
  plot(points = TRUE)

5.4 Why standardizing matters

Unstandardized predictors can produce unstable estimates and slow convergence because the variables are on very different scales:

Variable	Typical range	Units
Height	160–200	cm
Duel Aerial %	0–100	percentage pts
Age	17–40	years

This scale mismatch makes it harder for the sampler to find good combinations of $(\alpha, \beta_1, \beta_2)$, can inflate uncertainty, and makes priors like Normal(0, 10) mean very different things across predictors.

Centering (subtract the mean) makes 0 correspond to an average player, so the intercept $\alpha$ becomes the expected height at average aerial performance and average age.

Standardizing (z-scoring: center & divide by SD) puts predictors on a common scale (1 unit = 1 SD), which:

improves computational stability and convergence,
makes priors interpretable (e.g., Normal(0, 2) ≈ ±4 cm per 1 SD change),
and allows direct comparison of effect sizes across predictors.

d_player <- d_player %>%
  mutate(
    duelAerial_z = as.numeric(scale(duelAerialPerc)),
    age_z        = as.numeric(scale(age))
  )

Next, we refit the same model with z-standardized predictors, so slopes are interpreted as cm per 1 SD change in each predictor and the sampler works on a better-scaled parameter space.

m4.6b <- brm(
  family = gaussian(),
  height ~ 1 + duelAerial_z + age_z,
  prior = c(
    prior(normal(178, 12), class = Intercept),   # mean height at average predictors
    prior(normal(0, 2),    class = b),           # ~±4 cm per 1 SD change (weakly-informative)
    prior(student_t(3, 0, 10), class = sigma)    # half-Student-t for residual SD
  ),
  data   = d_player,
  iter   = 4000, warmup = 2000, chains = 4, cores = 4
)

Inspect the results

summary(m4.6b)

##  Family: gaussian 
##   Links: mu = identity 
## Formula: height ~ 1 + duelAerial_z + age_z 
##    Data: d_player (Number of observations: 388) 
##   Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
##          total post-warmup draws = 8000
## 
## Regression Coefficients:
##              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept      181.70      0.30   181.13   182.30 1.00     8758     6397
## duelAerial_z     2.70      0.31     2.10     3.31 1.00     8046     6129
## age_z            0.26      0.31    -0.35     0.86 1.00     8780     5900
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     5.94      0.21     5.53     6.37 1.00     9193     6457
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

conditional_effects(m4.6b, effects = "duelAerial_z:age_z") %>% 
  plot(points = TRUE)

Discussion

Adding age allows us to separate its effect from aerial ability — we can now ask:
“Among players with the same aerial ability, are older players taller?”
This introduces the concept of partial effects — each predictor’s influence holding the others constant.
If the two predictors are correlated, Bayesian priors help regularize their estimates and prevent overfitting.

In-Class Exercise: see worksheet

Bayesian Data Analytics - Week 02: Linear Regression

Markus Weinmann

1 Concepts of the day: (1) Gaussian distribution and (2) regression to the mean

1.1 Teaching case

1.2 Gaussian distribution

1.3 Regression to the mean

2 Linear regression

2.1 Model language

2.2 Prior predictive simulation

2.3 Sampling for the posterior (last example with grid approximation)

2.4 Adding a predictor variable

2.5 Checking correlations among parameters

2.6 Visualizing uncertainty in regression

3 Let’s experiment!

3.1 Playground

3.2 Solution

4 Polynomial regression

5 Multiple Regression

5.1 Model setup

5.2 Estimation in `brms`

5.3 Visualizing multiple regression results

5.4 Why standardizing matters

Bayesian Data Analytics - Week 02: Linear Regression

Markus Weinmann

1 Concepts of the day: (1) Gaussian distribution and (2) regression to the mean

1.1 Teaching case

1.2 Gaussian distribution

1.3 Regression to the mean

2 Linear regression

2.1 Model language

2.2 Prior predictive simulation

2.3 Sampling for the posterior (last example with grid approximation)

2.4 Adding a predictor variable

2.5 Checking correlations among parameters

2.6 Visualizing uncertainty in regression

3 Let’s experiment!

3.1 Playground

3.2 Solution

4 Polynomial regression

5 Multiple Regression

5.1 Model setup

5.2 Estimation in brms

5.3 Visualizing multiple regression results

5.4 Why standardizing matters

5.2 Estimation in `brms`