Load required packages.
library( tidyverse )
library( tidybayes )
# library( rethinking )
library( brms )
library( gridExtra )
Set up workspace, i.e., remove all existing data from working memory, initialize the random number generator, turn off scientific notation of large numbers, set a standard theme for plotting.
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
rm(list=ls())
set.seed(42)
options(scipen=10000)
## Warning in options(scipen = 10000): invalid 'scipen' 10000, used 9999
theme_set(theme_bw())
The Hand of God World Cup 1986: In an aerial battle with Peter Shilton, Maradona (1.66 meters) takes a hand to help. “It was a bit of Maradona’s head and a bit of God’s hand,” he said later.
In this session, we ask: can we use a player’s performance in aerial duels to predict his height?.
Let’s load some player data
d_player <- read.csv("players.csv") %>% # load data
filter( competition == "Premier League" & season == "2013") %>% # filter for Premier League and Season
select( name, height, duelAerialPerc ) # select some variables
head( d_player )
## name height duelAerialPerc
## 1 Álvaro Negredo 186 0.3649635
## 2 Aaron Lennon 165 0.1666667
## 3 Aaron Ramsey 177 0.3750000
## 4 Abdoulaye Faye 188 0.6111111
## 5 Adam Johnson 175 0.3809524
## 6 Adam Lallana 173 0.3050847
Plot height and duelAerialPerc histograms
d_player %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_histogram() # As histogram
First, let’s get back to our penalty kick problem.
We make the following assumptions:
We assume a player (red dot) has 4 trials (see layers below), where in each trial the goal probability is 50%. Right = goal; left = miss. Some rare players will end scoring 4 goals, some no goal but the central tendency is somewhere in the middle.
The idea was introduced by
Francis Galton – the cousin of Charles Darwin – and is also known as
Galton board.
What does it mean? Values – discrete like goals or continuous like height – fluctuate around the mean.
In the case of continuous variables such as height, we may observe extreme values; however, there is the tendency that they will be lower next observation.
Natural fluctuations tend to produce Gaussian – normal – distributions, that is, they regress to the mean (assuming observations are independent).
See also: https://theconversation.com/regression-to-the-mean-or-why-perfection-rarely-lasts-74694
Excursus: “The Sports Illustrated Curse” In many cases, when an athlete appeared on the cover of Sports Illustrated, their performance got worse. What would be an explanation?
Back to normal (Gaussian distribution). They can be described by two parameters: it’s mean mu and its standard deviation sigma (or variance sigma^2)
Excursus: in this course, we will use various distributions. You can find a wonderful summery on distributions here: https://ben18785.shinyapps.io/distribution-zoo/
Teaching Case:
This is the general syntax shared by many models (regression, ANOVA, ANCOVA, t-test).
Recipe
Select the variables of interest. Some are observed (data), others are unobserved (parameters).
Define each variable either deterministically (as a function of other variables) or probabilistically (by a distribution).
The combination of variables and their distributions defines a joint generative model that can be used to
A simple linear model
This is a very simple linear model, with an outcome y being normally distribute with mean mu and standard deviation sigma. It has one predictor, x. (red = variables, black = parameters)
\[ \begin{align*} \text{Data:} \quad & \color{red}{y_i} \sim \text{Normal}(\mu_i, \sigma) \\ \text{Model:} \quad & \mu_i = \beta \times \color{red}{x_i} \\ \text{Priors:} \quad & \beta \sim \text{Normal}(0, 10) \\ & \sigma \sim \text{Exponential}(1) \end{align*} \]
We first start with a height model without predictor. We define heights as normally distributed with a mean and a standard deviation (this is the likelihood):
\[ h_i \sim \text{Normal}(\mu_i, \sigma) \]
Let’s add some priors to the model (red = variable; black = parameters):
\[ \begin{align*} \text{Data:} \quad & \color{red}{h_i} \sim \text{Normal}(\mu_i, \sigma) \\ \text{Priors:} \quad & \mu_i \sim \text{Normal}(178, 20) \\ & \sigma \sim \text{Uniform}(0, 50) \end{align*} \] \[ \begin{align*} \text{Data:} \quad & h_i \sim \text{Normal}(\mu_i, \sigma) \\ \text{Priors:} \quad & \mu_i \sim \text{Normal}(178, 20) \\ & \sigma \sim \text{Uniform}(0, 50) \end{align*} \] Explanation
Data model (likelihood):
\(h_i\) are the observed data
(heights). They’re assumed to come from a Normal distribution with
unknown mean $ _i $ and standard deviation $ $.
Priors:
These describe our beliefs before seeing data:
Let’s first plot the priors for mu and sigma to get an idea of them.
# mu ~ Normal(178, 20)
ggplot(data.frame(x = c(100, 250)), aes(x)) +
stat_function(fun = dnorm, args = list(mean = 178, sd = 20)) +
labs(x = "cm", y = "Density",
subtitle = expression(mu %~% Normal(178,~20))) +
geom_vline(xintercept = 178, linetype = 2, alpha = 0.6)
# sigma ~ Uniform(0, 50) (plot only on its support)
ggplot(data.frame(x = c(0, 50)), aes(x)) +
stat_function(fun = dunif, args = list(min = 0, max = 50)) +
scale_y_continuous(NULL, breaks = NULL) +
labs(x = "cm", y = "Density",
subtitle = expression(sigma %~% Uniform(0,~50)))
…means simulating both priors at once (the full model).
Why? Checking the model before adding data. Do priors make sense? Before the model has seen the data, what does it believe?
Reminder: this is our model: \[ \begin{align*} \text{Data:} \quad & \color{red}{h_i} \sim \text{Normal}(\mu_i, \sigma) \\ \text{Priors:} \quad & \mu_i \sim \text{Normal}(178, 20) \\ & \sigma \sim \text{Uniform}(0, 50) \end{align*} \]
In R, rnorm or runif are functions that genertate random values from a given distribution (check ?rnorm)
# Simulate from priors
n= 10000 # number of simulations
df <- tibble( sample_mu = rnorm( n, mean = 178, sd = 20 ),
sample_sigma = runif( n, min = 0, max = 50 )) %>%
mutate(prior_h = rnorm( n, mean = sample_mu, sd = sample_sigma ))
df
## # A tibble: 10,000 × 3
## sample_mu sample_sigma prior_h
## <dbl> <dbl> <dbl>
## 1 205. 43.9 161.
## 2 167. 22.8 150.
## 3 185. 21.2 159.
## 4 191. 46.9 143.
## 5 186. 19.3 219.
## 6 176. 22.2 243.
## 7 208. 17.3 251.
## 8 176. 9.67 181.
## 9 218. 38.0 263.
## 10 177. 22.9 186.
## # ℹ 9,990 more rows
# Inspect summary statistics (sanity check)
df %>% summarise(
mean_mu = mean(sample_mu),
mean_sigma = mean(sample_sigma),
mean_h = mean(prior_h)
)
## # A tibble: 1 × 3
## mean_mu mean_sigma mean_h
## <dbl> <dbl> <dbl>
## 1 178. 24.9 178.
# Plot prior predictive distribution
df %>%
ggplot( aes( x = prior_h )) +
geom_density( fill = "black", size = 0 ) +
scale_y_continuous( NULL, breaks = NULL ) +
labs( subtitle = expression(paste( "Prior predictive distribution for ", italic(h[i]))),
x = "cm")
Exercise:
What happens if you chose a wider prior for the model’s standard deviation — for example, sigma~Uniform(0,100) cm?
What happens if you chose a narrower prior for the model’s mean — for example, mu~Normal(178,10) cm?
… means we randomly draw samples from the posterior (here, we have two parameters mu and sd; in Session 1 we had only 1: p)
We want to compute the posterior distribution of two parameters in our model:
\[ h_i \sim \text{Normal}(\mu, \sigma) \]
where:
Our goal is to find out:
Which combinations of (\(\mu\), \(\sigma\)) are most plausible, given our observed height data and the priors?
Grid approximation: Compute posterior for many combinations of mu and sigma.
Doing some preparations (you don’t need to understand the code). Note: We will never use this for practical data analysis but it helps better understanding what exactly we’re doing with Bayesian estimation.
n <- 200
# Create some height data
d_grid <-
crossing(mu = seq(from = 180, to = 183, length.out = n),
sigma = seq(from = 5.5, to = 8.5, length.out = n))
head(d_grid)
## # A tibble: 6 × 2
## mu sigma
## <dbl> <dbl>
## 1 180 5.5
## 2 180 5.52
## 3 180 5.53
## 4 180 5.55
## 5 180 5.56
## 6 180 5.58
# Helper function
grid_function <- function( mu, sigma ){
dnorm( d_player$height, mean = mu, sd = sigma, log = TRUE ) %>%
sum()
}
# Complete the tibble (e.g., create priors and posterior)
d_grid <-
d_grid %>%
mutate( log_likelihood = map2( mu, sigma, grid_function )) %>%
unnest( cols = c(log_likelihood )) %>%
mutate( prior_mu = dnorm(mu, mean = 178, sd = 20, log = TRUE ),
prior_sigma = dunif(sigma, min = 0, max = 50, log = TRUE )) %>%
mutate( product = log_likelihood + prior_mu + prior_sigma ) %>%
mutate( posterior = exp(product - max(product )))
d_grid
## # A tibble: 40,000 × 7
## mu sigma log_likelihood prior_mu prior_sigma product posterior
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 180 5.5 -1311. -3.92 -3.91 -1319. 1.92e-14
## 2 180 5.52 -1310. -3.92 -3.91 -1318. 3.28e-14
## 3 180 5.53 -1310. -3.92 -3.91 -1318. 5.57e-14
## 4 180 5.55 -1309. -3.92 -3.91 -1317. 9.34e-14
## 5 180 5.56 -1309. -3.92 -3.91 -1317. 1.55e-13
## 6 180 5.58 -1308. -3.92 -3.91 -1316. 2.55e-13
## 7 180 5.59 -1308. -3.92 -3.91 -1316. 4.16e-13
## 8 180 5.61 -1307. -3.92 -3.91 -1315. 6.71e-13
## 9 180 5.62 -1307. -3.92 -3.91 -1315. 1.07e-12
## 10 180 5.64 -1307. -3.92 -3.91 -1314. 1.70e-12
## # ℹ 39,990 more rows
# Draw samples from posterior
d_grid_samples <-
d_grid %>%
sample_n( size = 1000, replace = T, weight = posterior )
# Plot mu and sigma separately
d_grid_samples %>%
select(mu, sigma) %>%
gather() %>%
ggplot(aes(x = value)) +
geom_density(fill = "grey33", size = 0) +
scale_y_continuous(NULL, breaks = NULL) +
xlab(NULL) +
theme(panel.grid = element_blank()) +
facet_wrap(~key, scales = "free")
# Plot mu and sigma together
d_grid_samples %>%
ggplot( aes( x = mu, y = sigma )) +
geom_point( alpha = 0.3 ) +
scale_fill_viridis_c() +
labs( x = expression(mu[samples]),
y = expression(sigma[samples]))
d_grid %>%
ggplot(aes(mu, sigma, z = posterior)) +
geom_contour_filled(bins = 10) +
labs(x = expression(mu), y = expression(sigma))
- The filled contour plot shows where the posterior probability mass
lies — the “hill” where the most plausible combinations of \(\mu\) and \(\sigma\) live.
The scatter plot of posterior samples shows how those parameters vary together — for example, smaller \(\sigma\) might require slightly larger \(\mu\) to explain the same data.
The marginal densities show the posterior for each parameter individually.
Exercise:
Let’s assume, we now add the variable “percentage of aerials won” as predictor of height.
d_player %>%
ggplot( aes(x = duelAerialPerc, y = height )) +
geom_point( shape = 1, size = 1 )
From now on, we fit the models with brm() from the brms package, which is very common in scientific practice.
Reminder, this was our model:
\[ \begin{align*} \text{Data:} \quad & h_i \sim \text{Normal}(\mu_i, \sigma) \\ \text{Model:} \quad & \mu_i = \mu \\ \text{Priors:} \quad& \mu \sim \text{Normal}(178, 20) \\ & \sigma \sim \text{Uniform}(0, 100) \end{align*} \]
This is our new model with a predictor, i.e., the percentage of aerials won:
\[ \begin{align*} \text{Data:} \quad & \color{red}{h_i} \sim \text{Normal}(\mu_i, \sigma) \\ \text{Model:} \quad & \mu_i = \color{red}{\alpha} + \color{red}{\beta} \times \color{red}{\mathrm{duelAerial}_i} \\ \text{Priors:} \quad& \alpha \sim \text{Normal}(178, 20) \\ & \beta \sim \text{Normal}(0, 10) \\ & \sigma \sim \text{Uniform}(0, 100) \end{align*} \]
Lets translate the statistical model into brms code (the engine that makes our model go).
In a first step, we sample only from the priors (prior predictive check).
# Prior predictive check (sample from priors only)
m4.3a <-
brm( family = gaussian, # h ~ Normal( mu, sigma )
height ~ 1 + duelAerialPerc, # mu = a + b*x_i
prior = c(prior( normal( 178, 20 ), class = Intercept), # a ~ Normal( 178, 20 )
prior( normal( 0, 10), class = b), # b ~ Normal( 0, 10 )
prior( uniform( 0, 100 ), class = sigma )), # sigma ~ Uniform( 0, 100 )
sample_prior = "only", # sample only from priors
data = d_player,
iter = 28000, warmup = 27000, chains = 4, cores = 4. # MCMC settings
)
# Only for prior predictive check
conditional_effects( m4.3a, spaghetti = T, ndraws = 200 ) %>%
plot( points = F, point_args = c(alpha = 1/2, size = 1), line_args = c(colour = "black") )
pp_check( m4.3a )
# plot( m4.3a )
Exercise:
Let’s add data (uncomment sample_prior = “only”)
m4.3b <-
brm( family = gaussian, # h ~ Normal( mu, sigma )
height ~ 1 + duelAerialPerc, # mu = a + bx_i
prior = c(prior( normal( 178, 20 ), class = Intercept), # a ~ Normal( 178, 20 )
prior( normal( 0, 10 ), class = b), # b ~ Normal( 0, 10 )
prior( uniform(0, 100), class = sigma )), # sigma ~ Uniform( 0, 100 )
# sample_prior = "only", # sample only from priors
data = d_player,
iter = 28000, warmup = 27000, chains = 4, cores = 4. # MCMC settings
)
pp_check( m4.3b )
# plot( m4.3b )
Let’s check the summary of the posterior distribution
posterior_summary( m4.3b )[1:3, ]
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept 174.521355 0.8390793 172.891066 176.141410
## b_duelAerialPerc 15.671251 1.7172059 12.303982 19.043173
## sigma 5.945388 0.2162514 5.552194 6.394807
Why check correlations among parameters?
Sometimes, the intercept (\(\alpha\)) and slope (\(\beta\)) move together (are
correlated).
When one goes up and the other goes down, the model can still fit the
data almost the same way.
If that happens: - The model has a harder time deciding which values
of \(\alpha\) and \(\beta\) are best.
- The estimates become less stable and take longer to compute.
pairs( m4.3b )
alpha (b_Intercept) and beta (b_duelAerialPerc) are highly correlated. This means these two parameters carry the same information. Not a problem here but can cause issues in more complex models.
How centering and z-scoring help
Centering the predictor (subtracting its mean)
moves the zero point to the average value of the data.
Then the intercept means “the expected outcome when the predictor is
average.”
Centering typically reduces the \(\alpha\)–\(\beta\) correlation.
Z-scoring goes one step further: center
and divide by the standard deviation.
This puts predictors on a similar scale, so coefficients are easier to
compare across models.
d_player <-
d_player %>%
mutate( aerials_c = duelAerialPerc - mean( duelAerialPerc ))
head(d_player, n=10)
## name height duelAerialPerc aerials_c
## 1 Álvaro Negredo 186 0.3649635 -0.09289577
## 2 Aaron Lennon 165 0.1666667 -0.29119261
## 3 Aaron Ramsey 177 0.3750000 -0.08285928
## 4 Abdoulaye Faye 188 0.6111111 0.15325183
## 5 Adam Johnson 175 0.3809524 -0.07690690
## 6 Adam Lallana 173 0.3050847 -0.15277453
## 7 Adel Taarabt 180 0.3571429 -0.10071642
## 8 Adlène Guédioura 183 0.2857143 -0.17214499
## 9 Adrian Mariappa 180 0.6190476 0.16118834
## 10 Ahmed Elmohamady 180 0.5034014 0.04554208
Rerun our model with the centered variable
m4.3c <-
brm( family = gaussian, # h ~ Normal( mu, sigma )
height ~ 1 + aerials_c, # mu = a + bx_i
prior = c(prior( normal( 178, 20 ), class = Intercept), # a ~ Normal( 178, 20 )
prior( normal( 0, 10 ), class = b), # b ~ Normal( 0, 10 )
prior( uniform(0, 100), class = sigma )), # sigma ~ Uniform( 0, 100 )
# sample_prior = "only", # sample only from priors
data = d_player,
iter = 28000, warmup = 27000, chains = 4, cores = 4. # MCMC settings
)
# Plot correlations
pairs( m4.3c )
Let’s check the summary of the posterior distribution
mean(d_player$height)
## [1] 181.6933
posterior_summary( m4.3c )[1:3, ]
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept 181.691233 0.3007712 181.10591 182.297491
## b_aerials_c 15.701673 1.7312709 12.29597 19.122064
## sigma 5.950094 0.2135221 5.54392 6.381074
Centering reduced the correlation between alpha and beta. Note that \(\beta\) and \(\sigma\) are unchanged but the estimate for alpha (i.e., the intercept). When variables are standardized, the intercept represents the mean of the outcome variable!
Now that we have estimated our model, we can visualize what it has learned.
First, we plot the posterior inference — the regression line based on the most likely values of \(\alpha\) (intercept) and \(\beta\) (slope) — against the actual data.
# Mean line from posterior (point estimates)
d_player %>%
ggplot(aes(x = duelAerialPerc, y = height)) +
geom_abline(intercept = fixef(m4.3b)[1],
slope = fixef(m4.3b)[2]) +
geom_point(color = "royalblue")
fixef(m4.3b)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 174.52136 0.8390793 172.89107 176.14141
## duelAerialPerc 15.67125 1.7172059 12.30398 19.04317
Next, we add an uncertainty band around the mean (the credible interval). You don’t need to understand the code yet — the idea is that the model gives us not just one line, but a range of plausible lines given what the data tell us.
We can use fitted() from the brms package to compute the model’s expected (mean) height for a range of predictor values.
# Here, we define a sequence of values for aerial duels for which we want to compute predictions.
aerials_seq <-
tibble( duelAerialPerc = seq(from = 0, to = 1, by = 0.1 ))
aerials_seq
## # A tibble: 11 × 1
## duelAerialPerc
## <dbl>
## 1 0
## 2 0.1
## 3 0.2
## 4 0.3
## 5 0.4
## 6 0.5
## 7 0.6
## 8 0.7
## 9 0.8
## 10 0.9
## 11 1
# Get fitted values for each new data point
mu_summary <-
fitted( m4.3b,
newdata = aerials_seq ) %>%
as_tibble() %>%
bind_cols( aerials_seq ) %>%
select( duelAerialPerc, Estimate, Q2.5, Q97.5)
mu_summary
## # A tibble: 11 × 4
## duelAerialPerc Estimate Q2.5 Q97.5
## <dbl> <dbl> <dbl> <dbl>
## 1 0 175. 173. 176.
## 2 0.1 176. 175. 177.
## 3 0.2 178. 177. 179.
## 4 0.3 179. 178. 180.
## 5 0.4 181. 180. 181.
## 6 0.5 182. 182. 183.
## 7 0.6 184. 183. 185.
## 8 0.7 185. 184. 186.
## 9 0.8 187. 186. 188.
## 10 0.9 189. 187. 190.
## 11 1 190. 188. 192.
Now we can plot the credible interval around the mean:
# Plot mean ± 95% credible interval
d_player %>%
ggplot( aes( x = duelAerialPerc, y = height )) +
geom_smooth( data = mu_summary,
aes( y = Estimate, ymin = Q2.5, ymax = Q97.5 ),
stat = "identity" ) +
geom_point( )
So far, this shows uncertainty about the average height at each level of aerial-duel performance. But real players also differ randomly around that mean — that’s captured by the model’s \(\sigma\).
\[ h_i \sim \text{Normal}(\mu_i = \alpha + \beta \times x_i, \sigma) \] To visualize that, we can draw a prediction interval that shows where new players’ heights are likely to fall.
# Here, we define a sequence of values for aerial duels for which we want to compute predictions.
aerials_seq <-
tibble( duelAerialPerc = seq(from = 0, to = 1, by = 0.1 ))
# Get predicted value for each new data point
pred_aerial <-
predict( m4.3b,
newdata = aerials_seq ) %>%
as_tibble() %>%
bind_cols( aerials_seq ) %>%
select( duelAerialPerc, Estimate, Q2.5, Q97.5)
pred_aerial
## # A tibble: 11 × 4
## duelAerialPerc Estimate Q2.5 Q97.5
## <dbl> <dbl> <dbl> <dbl>
## 1 0 175. 163. 187.
## 2 0.1 176. 164. 187.
## 3 0.2 178. 166. 189.
## 4 0.3 179. 167. 191.
## 5 0.4 181. 169. 193.
## 6 0.5 182. 171. 194.
## 7 0.6 184. 173. 196.
## 8 0.7 185. 174. 197.
## 9 0.8 187. 175. 199.
## 10 0.9 189. 177. 201.
## 11 1 190. 178. 202.
Finally, we can plot both together: the inner (narrow) band for the mean and the outer (wider) band for possible new data.
# Plot: prediction interval (gray band) + credible interval for mean (line+ribbon)
d_player %>%
ggplot( aes( x = duelAerialPerc )) +
geom_ribbon( data = pred_aerial,
aes( ymin = Q2.5, ymax = Q97.5 ),
fill = "grey83" ) +
geom_smooth( data = mu_summary,
aes( y = Estimate, ymin = Q2.5, ymax = Q97.5 ),
stat = "identity" ) +
geom_point( aes( y = height ) )
There is one line of code in brms that does it at once ;-)
# Prior predictive check
conditional_effects( m4.3a, spaghetti = T, ndraws = 200 ) %>%
plot( points = T, point_args = c(alpha = 1/2, size = 1), line_args = c(colour = "black") )
# Posterior predictive check
conditional_effects( m4.3b ) %>% plot( points = T )
Predict a player’s marketvalue from age.
Use the following data.
d_player <- read.csv("players.csv") %>% # load data
filter( competition == "Premier League" & season == "2013") %>% # filter for Premier League and Season
select( name, age, marketvalue ) # select some variables
head( d_player )
## name age marketvalue
## 1 Álvaro Negredo 28 27000000
## 2 Aaron Lennon 26 15000000
## 3 Aaron Ramsey 23 20000000
## 4 Abdoulaye Faye 35 250000
## 5 Adam Johnson 26 7500000
## 6 Adam Lallana 25 7500000
Plot the marketvalue and age (scatter plot).
# Your code here
Write out a statistical model. We expect the mean marketvalue (mu) to be 8,000,000 € with a broad standard deviation (sigma) = 50,000,000 €. Let’s further assume a broad range of possible means of 30,000,000 €. Finally, we assume that age does not matter (mean = 0) but with a range of +- 2,000,000 €.
\[ Model \]
Model in brm code.
# Your code here
Let’s check the summary of the posterior distribution
# Your code here
Is this a good model?
Let’s assume a player’s market value increases until a certain age and then decreases. How do we model this?
A common polynomial regression adds a quadratic term (u-shaped):
\[ \begin{align*} mv_i & \sim \text{Normal}( \mu_i, \sigma ) \\ \text{1st order (line):} &\ \mu_i = \alpha + \beta_1 x_i \\ \text{2nd order (parabola):} &\ \mu_i = \alpha + \beta_1 x_i + \beta_2 x_i^2 \end{align*} \]
m4.5 <-
brm( family = gaussian, # mv ~ Normal( mu, sigma )
marketvalue ~ 1 + age + I(age^2), # mu = a + b1*age_i + b2*age^2_i
prior = c(prior( normal( 8000000, 30000000 ), class = Intercept), # a ~ Normal( 8.000.000, 30.000.000 )
prior( normal( 0, 2000000 ), class = b), # b ~ Normal( 0, 2.000.000 )
prior( uniform(0, 50000000), class = sigma )), # sigma ~ Uniform( 0, 50.000.000 )
#sample_prior = "only",
data = d_player,
iter = 28000, warmup = 27000, chains = 4, cores = 4
)
# Plot the model
pp_check( m4.5 )
# plot( m4.5 )
(The function I() stands for the “inhibit interpretation” operator, which tells R to treat age^2 literally as a squared term, rather than as part of a formula.)
Let’s check the summary of the posterior distribution
posterior_summary( m4.5 )[1:4, ]
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept -17560754.5 14658642.90 -45956224.08 10981104.05
## b_age 2375568.1 1091336.75 268623.98 4555878.69
## b_IageE2 -52146.1 20166.48 -92162.82 -13039.49
## sigma 8582295.6 312925.68 8006954.89 9204241.58
Plot the polynomial.
conditional_effects( m4.5 ) %>% plot( points = TRUE )
Now, let’s remove outliers, log the marketvalue, standardize age and age2, and then rerun the model.
(Why do we log and how can we interpret it: https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/)
# Load and wrangle data
d_player <- d_player %>%
# remove outlier, standardize vars, log marketvalue
filter( marketvalue <= 20000000) %>%
mutate( age_s = ( age - mean( age )) / sd(age),
age2 = age^2,
age2_s = ( age2 - mean( age2 )) / sd(age2),
mv_log = log( marketvalue )
)
head( d_player )
## name age marketvalue age_s age2 age2_s mv_log
## 1 Aaron Lennon 26 15000000 -0.1598652 676 -0.2249829 16.52356
## 2 Abdoulaye Faye 35 250000 2.1932911 1225 2.4168203 12.42922
## 3 Adam Johnson 26 7500000 -0.1598652 676 -0.2249829 15.83041
## 4 Adam Lallana 25 7500000 -0.4213270 625 -0.4703963 15.83041
## 5 Adel Taarabt 24 6000000 -0.6827889 576 -0.7061857 15.60727
## 6 Adlène Guédioura 28 2500000 0.3630584 784 0.2947161 14.73180
Plot a histogram of log(marketvalue)
d_player %>%
ggplot( aes( x= mv_log )) +
geom_histogram()
Let’s rerun the model
m4.5_s <-
brm( family = gaussian, # h ~ Normal( mu, sigma )
mv_log ~ 1 + age_s + I(age_s^2), # mu = a + bx_i
prior = c(prior( normal( 8000000, 30000000 ), class = Intercept), # a ~ Normal( 8.000.000, 30.000.000 )
prior( normal( 0, 1 ), class = b), # b ~ Normal( 0, 1 )
prior( uniform(0, 50000000), class = sigma )), # sigma ~ Uniform( 0, 50.000.000 )
#sample_prior = "only",
data = d_player,
iter = 28000, warmup = 27000, chains = 4, cores = 4
)
# Plot the model
pp_check( m4.5_s ) # Posterior Predictive Check
#plot( m4.5_s ) # Plot estimates
# Plot posterior
conditional_effects( m4.5_s ) %>% plot( points = TRUE )
Teaching Case:
Reload player data with age
d_player <- read.csv("players.csv") %>% # load data
filter( competition == "Premier League" & season == "2013") %>% # filter for Premier League and Season
select( name, height, duelAerialPerc, age ) # select some variables
head( d_player )
## name height duelAerialPerc age
## 1 Álvaro Negredo 186 0.3649635 28
## 2 Aaron Lennon 165 0.1666667 26
## 3 Aaron Ramsey 177 0.3750000 23
## 4 Abdoulaye Faye 188 0.6111111 35
## 5 Adam Johnson 175 0.3809524 26
## 6 Adam Lallana 173 0.3050847 25
Previously, we modeled height as a function of aerial duels
alone.
Now we expand our model to include age as an additional predictor.
\[ \begin{align*} \text{Data:} \quad & h_i \sim \text{Normal}(\mu_i, \sigma) \\ \text{Model:} \quad & \mu_i = \alpha + \beta_1 \cdot \text{duelAerialPerc}_i + \beta_2 \cdot \text{age}_i \\ \text{Priors:} \quad & \alpha \sim \text{Normal}(178, 20) \\ & \beta_1 \sim \text{Normal}(0, 10) \\ & \beta_2 \sim \text{Normal}(0, 10) \\ & \sigma \sim \text{Uniform}(0, 100) \end{align*} \]
Interpretation
brmsWe use the same brm() function as before, now with two
predictors.
m4.6 <-
brm(
family = gaussian,
height ~ 1 + duelAerialPerc + age, # two predictors
prior = c(
prior(normal(178, 20), class = Intercept), # α
prior(normal(0, 10), class = b), # βs (shared prior)
prior(uniform(0, 100), class = sigma) # σ
),
data = d_player,
iter = 4000, chains = 4, cores = 4
)
summary(m4.6)
We can visualize how predicted height changes with aerial performance for different ages.
conditional_effects(m4.6, effects = "duelAerialPerc:age") %>%
plot(points = TRUE)
Unstandardized predictors can produce unstable estimates and slow convergence because the variables are on very different scales:
| Variable | Typical range | Units |
|---|---|---|
| Height | 160–200 | cm |
| Duel Aerial % | 0–100 | percentage pts |
| Age | 17–40 | years |
This scale mismatch makes it harder for the sampler to find good
combinations of \((\alpha, \beta_1,
\beta_2)\), can inflate uncertainty, and makes priors like
Normal(0, 10) mean very different things across
predictors.
Centering (subtract the mean) makes 0 correspond to an average player, so the intercept \(\alpha\) becomes the expected height at average aerial performance and average age.
Standardizing (z-scoring: center & divide by SD) puts predictors on a common scale (1 unit = 1 SD), which:
Normal(0, 2) ≈ ±4 cm per 1 SD change),d_player <- d_player %>%
mutate(
duelAerial_z = as.numeric(scale(duelAerialPerc)),
age_z = as.numeric(scale(age))
)
Next, we refit the same model with z-standardized predictors, so slopes are interpreted as cm per 1 SD change in each predictor and the sampler works on a better-scaled parameter space.
m4.6b <- brm(
family = gaussian(),
height ~ 1 + duelAerial_z + age_z,
prior = c(
prior(normal(178, 12), class = Intercept), # mean height at average predictors
prior(normal(0, 2), class = b), # ~±4 cm per 1 SD change (weakly-informative)
prior(student_t(3, 0, 10), class = sigma) # half-Student-t for residual SD
),
data = d_player,
iter = 4000, warmup = 2000, chains = 4, cores = 4
)
Inspect the results
summary(m4.6b)
## Family: gaussian
## Links: mu = identity
## Formula: height ~ 1 + duelAerial_z + age_z
## Data: d_player (Number of observations: 388)
## Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
## total post-warmup draws = 8000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 181.70 0.30 181.13 182.30 1.00 8758 6397
## duelAerial_z 2.70 0.31 2.10 3.31 1.00 8046 6129
## age_z 0.26 0.31 -0.35 0.86 1.00 8780 5900
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 5.94 0.21 5.53 6.37 1.00 9193 6457
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
conditional_effects(m4.6b, effects = "duelAerial_z:age_z") %>%
plot(points = TRUE)
Discussion
In-Class Exercise: see worksheet