Overview

By the end of these workshops and your own studies, you should be able to

  • Formally describe a normal linear model
  • Implement normal models (continuous, categorical predictors)
  • Conceptually understand maximum likelihood estimation (how lm determines model coefficients)
  • Interpret model coefficients and quantities related to null hypothesis testing (confidence interval, t-value, p-value)
  • Predict hypothetical outcomes given a statistical model

Using a combination of theoretical descriptions, practical exercises, implementations, and simulations.

Regression models

  • Often introduced as fitting lines to points.
  • Limited perspective that makes more complex regression models, like generalised linear models, hard to understand.
  • Extension to generalised and multilevel linear models (multiple / simple linear regressions, AN(C)OVAs, MAN(C)OVAs)
  • Backbone of statistical modelling: machine learning, time series models, path analysis, structural equation models, factor analysis, signal detection models

Regression models

  • A model of how the probability distribution of one variable, known as the outcome variable, varies as a function of other variables, known as the explanatory or predictor variables.
  • The most basic type of regression models is the normal linear model.
  • In normal linear models, we assume that the outcome variable is normally distributed and that its mean varies linearly with changes in a set of predictor variables.
  • By understanding the normal linear model thoroughly, we can see how it can be extended to deal with data and problems beyond those that it is designed for.

Normal linear model (formal description)

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu, \sigma^2) \end{aligned} \]

  • \(y\): observed outcome variable
  • Each observation \(y_i\) is a sample (indexed by \(i\))
  • Lower case letters indicate observed variables.
  • “\(\sim\)”: “is proportional to” / “is a function of”
  • \(\mathcal{N}\): univariate normal distributed probability model; i.e. the underlying process that describes the nature of our data.
  • Population parameters are represented as Greek letters; i.e. mean \(\mu\) and variance \(\sigma^2\) (parameters depend on probability model and parametrisation).

Normal linear model (formal description)

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \beta_1 \cdot x_{1i} \\ \end{aligned} \]

Value of \(\mu_i\) can be decomposed using a linear function.

  • \(\beta_1\) directly depends on explanatory variable \(x_{1i}\)
  • \(\beta_0\) is “constant”
  • If \(x_{1i}\) takes on 0, \(\mu_i = \beta_0\)
  • If \(x_{1i}\) takes on 1, \(\mu_i = \beta_0 + \beta_1\)
  • \(\beta_1\) is the difference between \(x_{1i}=1\) and \(x_{1i}=0\).
  • Value of \(\mu_i\) is deterministic (“\(=\)”) linear function of values of one or more predictor variables \(x\) and the unknowns \(\beta_0\) (intercept) and \(\beta_1\) (slope).

Normal linear model (example)

A model of how the mean of the distribution of reaction time values (rt) varies as a function of age and sex.

\[ \begin{align} \text{rt}_i &\sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \beta_\text{age} \cdot \text{age}_i + \beta_\text{sex} \cdot \text{sex}_i \end{align} \]

For \(i \in 1\dots n\) where \(n\) is total number of observations.

  • Writing down (on paper) a formal model description for an hypothetical normal linear model of blood pressure as a function of / predicted by nutrition, age, and levels of physical activity.
  • We will then write down the model description in RMarkdown.
  • Adapt the model description to your own dataset.

Simulation of normal distributed data

  • We can use lm to easily implement such a model.
model <- lm(rt ~ age + sex, data = data)
  • How do we know that lm returns accurate estimates for our model parameters?
  • We can simulate data that exactly meet model assumptions (normal distribution, equal variance, independence etc.) with known model parameters.

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Generate data
y <- rnorm(n = 10, mean = 550, sd = 10)

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Generate data
y <- rnorm(n = 100, mean = 550, sd = 10)

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Generate data
y <- rnorm(n = 1000, mean = 550, sd = 10)

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Decompose the mean
beta_0 = 500
beta_1 = 50
# Generate data
y <- rnorm(n = 1000, mean = beta_0 + beta_1, sd = 10)

Simulation of normal distributed data

# Set parameter values
n <- 1000
beta_0 <- 500 
beta_1 <- 50
sigma <- 100
# Random data for group 1
group_1 <- rnorm(n = n/2, mean = beta_0, sd = sigma)
# Random data for group 2
group_2 <- rnorm(n = n/2, mean = beta_0 + beta_1, sd = sigma)
# Generate data
sim_data <- tibble(group_1 = group_1,
                   group_2 = group_2) 

# Preview data
glimpse(sim_data)
Rows: 500
Columns: 2
$ group_1 <dbl> 441, 627, 473, 491, 568, 472, 625, 377, 496, 522, 575, 428, 43…
$ group_2 <dbl> 611, 593, 477, 657, 587, 598, 570, 595, 619, 567, 644, 601, 32…
# Change data format
sim_data <- pivot_longer(sim_data, cols = c(group_1, group_2), names_to = "x", values_to = "y")

# Preview data
glimpse(sim_data)
Rows: 1,000
Columns: 2
$ x <chr> "group_1", "group_2", "group_1", "group_2", "group_1", "group_2", "g…
$ y <dbl> 441, 611, 627, 593, 473, 477, 491, 657, 568, 587, 472, 598, 625, 570…

Simulation of normal distributed data

# As a reminder, these are the parameter values
beta_0 <- 500 
beta_1 <- 50
sigma <- 100
# Normal linear model of simulated data
model <- lm(y ~ x, data = sim_data)

coef(model) # Model coefficients
(Intercept)    xgroup_2 
      500.1        45.7 
sigma(model) # Standard deviation
[1] 98

Simulation: exercise

Work through the scripts

  • exercises/normal_model_simulation_1.R
  • exercises/normal_model_simulation_2.R

You will need to complete the code in those scripts. Then observe how changing the number of observations affects the model estimates.

Real data

  • In real life we don’t know the true parameter values.
  • We also don’t know the true process that generates the data.
  • For the simulated data we know that the process:
    • a normal distribution because rnorm samples normally distributed data
    • both groups have the same (equal) variance because the value for sigma was the same.
  • We can use linear models to estimate parameter values from the data (i.e. we make assumptions about the process that generates the data).
  • Maximum likelihood estimation: for which set of parameter values are the data most likely.

Real data

  • Data may appear non-normal distributed for small samples.
  • That’s fine if have reasons to believe that a normal distribution underlies the process that generates the data.
  • Above data are by-participant means, meaning the central limit theorem applies.
  • Yet, reaction times are notoriously skewed; cause they are zero bound.

Real data

  • Data may appear non-normal distributed for small samples.
  • That’s fine if have reasons to believe that a normal distribution underlies the process that generates the data.
  • Above data are by-participant means, meaning the central limit theorem applies.
  • Yet, reaction times are notoriously skewed; cause they are zero bound.

Real data

  • Data are roughly normal distributed (there is some positive skew).
  • We assume that the logarithm of the \(rt\) denoted as \(\text{logrt}_1, \text{logrt}_2\dots \text{logrt}_n\), \(\text{ for } i \in 1\dots n\) is sampled from a normal distribution with a fixed and unknown mean \(\mu\) and standard deviation \(\sigma\).

\[ \text{logrt}_i \sim \mathcal{N}(\mu_i, \sigma^2) \]

Real data

Predictor variable sex is denoted as \(\text{sex}_1, \text{sex}_2, \dots \text{sex}_n\), \(\text{ for } i \in 1\dots n\)

\[ \text{logrt}_i \sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_\text{sex} \cdot \text{sex}_i \]

where \(\text{sex}_i\) takes on \(0\) for females and \(1\) for males.

\[ \text{sex}_i = \left\{ \begin{array}{ll} 0, \text{ if sex}_i = \texttt{female}\\ 1, \text{ if sex}_i = \texttt{male}\\ \end{array} \right. \]

variance is the same in males and females.

Real data

Predictor variable sex is denoted as \(\text{sex}_1, \text{sex}_2, \dots \text{sex}_n\), \(\text{ for } i \in 1\dots n\)

\[ \text{logrt}_i \sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_\text{sex} \cdot \text{sex}_i \]

where \(\text{sex}_i\) takes on \(0\) for females and \(1\) for males.

\[ \text{sex}_i = \left\{ \begin{array}{ll} 0, \text{ if sex}_i = \texttt{female}\\ 1, \text{ if sex}_i = \texttt{male}\\ \end{array} \right. \]

variance is the same in males and females.

Real data

Predictor variable sex is denoted as \(\text{sex}_1, \text{sex}_2, \dots \text{sex}_n\), \(\text{ for } i \in 1\dots n\)

\[ \text{logrt}_i \sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_\text{sex} \cdot \text{sex}_i \]

where \(\text{sex}_i\) takes on \(0\) for females and \(1\) for males.

\[ \text{sex}_i = \left\{ \begin{array}{ll} 0, \text{ if sex}_i = \texttt{female}\\ 1, \text{ if sex}_i = \texttt{male}\\ \end{array} \right. \]

  • Because \(\mu_i = \beta_0 + \beta_\text{sex} \cdot 0 = \beta_0\), \(\beta_0\) is the average logrt for females.
  • Because \(\mu_i = \beta_0 + \beta_\text{sex} \cdot 1 = \beta_0 + \beta_\text{sex}\) is the average logrt for females, \(\beta_\text{sex}\) is the difference in the averages logrt between males and females.

Real data: exercise

  • Complete exercise scripts
    • exercises/normal_model_blomkvist_1.R
    • exercises/normal_model_blomkvist_2.R
  • Make sure you understand how the model coefficients relate to the visualisation.

Compare simulated and real data

# Set parameter values
beta_0 <- 500 
beta_1 <- 50
sigma <- 100
# Inspect data
glimpse(data, width = 50)
Rows: 1,000
Columns: 2
$ x <chr> "group_1", "group_2", "group_1", "grou…
$ y <dbl> 366, 488, 586, 458, 626, 615, 432, 685…
# Specify model
model <- lm(y ~ x, data = data)
# Model coefficients
coef(model)
(Intercept)    xgroup_2 
      508.5        38.2 
# Load and transform data
blomkvist <- read_csv("../data/blomkvist.csv") %>% 
  mutate(log_rt = log(rt_hand_d)) %>% 
  select(sex, log_rt)
# Inspect data
glimpse(blomkvist, width = 50)
Rows: 267
Columns: 2
$ sex    <chr> "male", "female", "female", "fema…
$ log_rt <dbl> 6.55, 6.15, 6.46, 6.56, 6.41, 6.2…
# Specify model
model <- lm(log_rt ~ sex, data = blomkvist)
# Model coefficients
coef(model)
(Intercept)     sexmale 
     6.4602     -0.0943 

Interpreting coeffcients on the log scale

Slope coef is no longer additive (linear scale) but multiplicative (exponential scale)

# Specify model
model <- lm(log_rt ~ sex, data = blomkvist)

# Model coefficients
(coefs <- as.vector(coef(model)))
[1]  6.4602 -0.0943

I.e. logrt is \(\beta_1\) times longer.

# Converting coefficients to msecs
intercept <- coefs[1]
slope <- coefs[2]

Exponentional function exp un-logs coefficients.

exp(intercept) # female group in msecs
[1] 639
exp(intercept + slope) # male group in msecs
[1] 582
exp(intercept + slope) - exp(intercept) # in msecs
[1] -57.5
  • Complete :
    • exercises/interpreting_coefficients_1.R
    • exercises/interpreting_coefficients_2.R

Maximum likelihood estimation

If we assume the model

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2) \text{, for } i \in 1\dots n\\ \mu_i = \beta_0 + \beta_1 \cdot x_{1i} \]

We estimate values of 2 unknowns plus the number of predictor coefficients (here 1) using maximum likelihood estimation of the unknowns, which we can denote by

\[ \hat\beta_0,\hat\beta_1,\hat\sigma, \]

MLEs are the set of values that make the observed data most probable.

Maximum likelihood estimation

The maximum likelihood values of coefficients \(\hat\beta_0,\hat\beta_1\) are those values of \(\beta_0,\beta_1\) variables that minimize residual sum of squares

\[ \text{RSS} = \sum_{i=1}^n \mid y_i - \mu_i\mid^2, \]

where \(\mu_i\) is (as before)

\[ \mu_i = \beta_0 + \beta_1 \cdot x_{1i} \]

This can be simulated using grid approximation for models with small sets of unknowns: see script exercises/mle.R

Fitting a normal linear model using lm

  • For larger sets of parameters we can use maximum likelihood estimation as implemented in lm.
model <- lm(rt ~ sex + age, data = blomkvist)
  • We’ll continue with rt (instead logrt) data for simplicity.
  • Maximum likelihood estimators of \(\beta_0\), \(\beta_\text{sex}\) and \(\beta_\text{age}\) are as follow:
(coefs <- coef(model))
(Intercept)     sexmale         age 
     338.64      -58.93        5.75 

Fitting a normal linear model using lm

Once we have \(\hat\mu\), the maximum likelihood estimate of \(\sigma\), denoted as \(\hat\sigma\), is:

\[ \hat\sigma = \sqrt{\frac{1}{n-K-1} \cdot \sum_{i=1}^n\mid y_i-\hat\mu_i\mid^2} \]

  • \(\sum_{i=1}^n\mid y_i-\hat\mu_i\mid^2\) is the RSS (\(y_i\): observed data; \(\hat\mu_i\): model prediction).
  • \(K\) is the number of predictors
  • \(n\) is the number of observations
  • Complete script exercises/calculating_sigma.R

Before we move on…

  1. Construct a theoretical model for your own data.
  2. Implement your theoretical model using lm().
  3. Work out the maximum likelihood estimators obtained by lm() but instead of the summary() function, use of the broom package and run
tidy(my_model, conf.int = TRUE)

Hypothesis testing: t-values

If \(\beta_1\) is the hypothesized true value of the coefficient, and \(\hat\beta_1\) is the estimated coefficient, then

\[ \frac{\hat\beta_1 - \beta_1}{\hat{\text{se}}_1} \sim t_{n-K-1}, \]

which tells us that if the true value of the coefficient was some hypothesized value \(\beta_1\), then we expect the t-statistic to have a certain range of values with high probability.

Hypothesis testing: t-values

We will test a hypothesis about the coefficient for sex in the model above.

# Extract the coefficients table from summary
coefs <- summary(model)$coefficients
# MLE for sex
(beta_sex <- coefs['sexmale', 'Estimate'])
[1] -58.9
# Standard error
(se_sex <- coefs['sexmale', 'Std. Error'])
[1] 18.5

If we hypothesize the true value of the coefficient is exactly 1.0, then our t-statistic is

(t <- (beta_sex - 10)/se_sex)
[1] -3.73

\[ \frac{\hat\beta_\text{sex} - \beta_\text{sex}}{\hat{\text{se}}_\text{sex}} = \frac{-58.9 - 10}{18.5} = -3.74 \]

p-values

  • The p-value for the hypothesis is the probability of getting a value as or more extreme than the absolute value of t-statistic in the t-distribution.
  • This is the area under the curve in the t-distribution that is as or more extreme than the absolute value of t-statistic.
n <- nrow(blomkvist)
K <- 2
pt(abs(t), df = n - K - 1, lower.tail = FALSE) * 2
[1] 0.000233

Hypothesis testing: t-values

Null hypothesis significance testing: we hypothesize the true value of the coefficient is exactly 0.0, then our t-statistic is

\[ \frac{\hat\beta_k - \beta_k}{\hat{\text{se}}_k} = \frac{-58.9 - 0.0}{18.5} = \frac{-58.9}{18.5} = -3.19 \]

(t <- (beta_sex - 0.0)/se_sex)
[1] -3.19
n <- nrow(blomkvist)
K <- 2
pt(abs(t), df = n - K - 1, lower.tail = FALSE) * 2
[1] 0.00159

t-distribution

p-values

  • Work through script exercise/t_values.R
  • Calculate t and p-value for \(\hat\beta_\text{age}\) testing the null hypothesis that the true value \(\beta_\text{age}\) is 0.0.
  • Verify your calculation using
summary(model)$coefficients
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   338.64     27.359   12.38 4.91e-28
sexmale       -58.93     18.473   -3.19 1.59e-03
age             5.75      0.439   13.09 1.72e-30

t-distribution

t-distribution

t-distribution

t-distribution

t-distribution

Confidence intervals

  • The set of all hypotheses (hypothetical values) that we do not rule out at the \(\alpha < 0.05\) level of significance is known as the 95% confidence interval.
  • The formula for the confidence interval is

\[ 95\% \text{ CI} = \hat\beta_1 \pm \tau_\nu \cdot \hat{\text{se}}_1 \]

  • \(\tau_\nu\) is value in a t-distribution (with \(\nu\) degrees of freedom) below which is 95% of the area under the curve: i.e. \(\tau\approx1.96\) for large dfs.
  • \(\tau\) times standard errors above and below \(\hat\beta_k\) are lower and upper bound of the confidence interval.

Confidence intervals

(coefs <- coef(summary(model)))
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   338.64     27.359   12.38 4.91e-28
sexmale       -58.93     18.473   -3.19 1.59e-03
age             5.75      0.439   13.09 1.72e-30
beta_sex <- coefs[2,1]
se_sex <- coefs[2,2]

We can calculate \(\tau\) using qt

(tau <- qt(.975, df = n - K - 1))
[1] 1.97

We then obtain the confidence intervals as follows

beta_sex + c(-1, 1) * se_sex * tau
[1] -95.3 -22.6
confint(model, parm = 'sexmale',  level = .95)
        2.5 % 97.5 %
sexmale -95.3  -22.6

Open and complete script exercises/confidence_interval.R

Confidence intervals: a simulation

  • Work through script exercises/confidence_interval_simulation.R
  • Change the number of simulations to, one at a time, n_sims = 20, 50, 100, 500, 1000
  • Notice how often the CI does not contain the true parameter value \(\mu\).
  • How many % of simulations contain \(\mu\)?

Confidence intervals: a simulation

count(sims, mu_in_ci) %>% 
  mutate(prop = n / n_sims)
# A tibble: 2 × 3
  mu_in_ci     n  prop
  <chr>    <int> <dbl>
1 no          51 0.051
2 yes        949 0.949
  • When number of experiments approaches \(\infty\) and we calculate the 95% CI for each experiment, 95% of these CIs contain \(\mu\).
  • Note though, 5% of the time, the confidence interval does not contain \(\mu\)!
  • What does this mean for real (as opposed to simulated) experiments?

Varying intercepts model

When we include sex as an explanatory variable in additional to a continuous predictor variable (e.g. age), the model is as follows.

\[ \text{rt}_i \sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_\text{sex} \cdot \text{sex}_{i} + \beta_\text{age} \cdot \text{age}_{i} \]

With sexes denoted by \(\text{sex}_{1},\text{sex}_{2}\dots \text{sex}_{i}\dots \text{sex}_{n}\) and age by \(\text{age}_{1},\text{age}_{2}\dots \text{age}_{i}\dots \text{age}_{n}\) \(\text{ for } i \in 1\dots n\).

Varying intercepts model

To implement this model using lm we would do the following.

model_2 <- lm(rt ~ sex + age, data = blomkvist)
coef(model_2)
(Intercept)     sexmale         age 
     338.64      -58.93        5.75 
  • Note that sex is dichotomous categorical and age is continuous.
  • If sex changes by one unit, when age is held constant, then the average value of the corresponding distribution of rt changes by \(\beta_\text{sex}\).
  • Conversely, if age changes by one unit, when sex is held constant, the average value of the distribution of rt changes by \(\beta_\text{age}\).

Varying intercepts model

Given that \(\text{sex}_{i}\) takes that value of 0 when sex is female and 1 when sex is male, this model can be written as follows.

\[ \text{rt}_i \sim \mathcal{N}(\mu_i, \sigma^2) \\ \mu_i = \beta_0 + \beta_\text{sex} \cdot \text{sex}_{i} + \beta_\text{age} \cdot \text{age}_{i}\\ \mu_i = \left\{ \begin{array}{ll} \beta_0 + \beta_\text{age} \cdot \text{age}_{i}, \text{ if sex}_i = 0 \texttt{ [female]}\\ \beta_0 + \beta_\text{sex} \cdot \text{sex}_{i} + \beta_\text{age} \cdot \text{age}_{i}, \text{ if sex}_i = 1\texttt{ [male]} \end{array} \right. \]

This is a varying intercepts model.

Assumption: age slopes do not co-vary by sex!

Fixed intercept model (simple linear regression)

Varying intercepts model

Varying intercepts and varying slopes model

The varying-intercepts model is written as follows. For \(i \in 1\dots n\)

\[ \begin{aligned} \text{rt}_i &\sim N(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \beta_\text{sex} \cdot \text{sex}_{i} + \beta_\text{age} \cdot \text{sex}_{i}\\ \end{aligned} \]

The following is a varying intercepts and varying slopes model:

\[ \begin{aligned} \text{rt}_i &\sim N(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \beta_\text{sex} \cdot \text{sex}_{i} + \beta_\text{age} \cdot \text{age}_{i} + \underbrace{\beta_\text{sex,age} \cdot \text{sex}_{i} \cdot \text{age}_{i}}_\text{interaction} \end{aligned} \]

We effectively have a third predictor that is the product of \(\text{sex}_{i} \cdot \text{age}_{i}\); i.e. the interaction.

Varying intercepts and varying slopes model

Using sex and age as predictors, we can do the following to perform a varying-intercepts and varying-slopes model:

model_3 <- lm(rt ~ sex + age + sex:age, data = blomkvist)

Or slightly shorter:

model_3 <- lm(rt ~ sex * age, data = blomkvist)

Assumption: age slopes do co-vary by sex!

In comparison

Varying intercepts and varying slopes model

model_3 <- lm(rt ~ sex * age, data = blomkvist)
(betas <- coef(model_3))
(Intercept)     sexmale         age sexmale:age 
     308.46       17.15        6.28       -1.36 

From the varying-intercepts and varying-slopes model, the intercept term represents female; male is

# males
betas[1] + betas[2]
[1] 326

The slope terms are as follows:

# age slope for females
betas[3]
[1] 6.28
# age slope for males
betas[3] + betas[4]
[1] 4.93
# sex difference for age slopes
betas[4]
[1] -1.36

Varying intercepts and varying slopes model

Work through scripts

  • exercises/varying_intercepts_model.R
  • exercises/varying_slopes_model.R

Polychotomous predictors

Categorical predictor variables with more than two levels (i.e. dichotomous like sex).

Using smoker as single categorical predictor variable, the linear model would be as follows.

\[ \text{rt}_i \sim N(\mu_i, \sigma^2), \mu_i = \beta_0 + \beta_\text{smoker[1]} \cdot \text{smoker}_{1i} + \beta_\text{smoker[2]} \cdot \text{smoker}_{2i}, \text{ for } i \in 1\dots n. \]

where \(\text{smoker}_{1i}\), \(\text{smoker}_{2i}\) are as follows.

\[ \text{smoker}_{1i}, \text{smoker}_{2i} = \left\{ \begin{array}{ll} 0,0 \text{ if smoker}_i = \texttt{former}\\ 1,0 \text{ if smoker}_i = \texttt{no}\\ 0,1 \text{ if smoker}_i = \texttt{yes}\\ \end{array} \right. \]

The number of coefs for predictor is \(K-1\) where \(K\) is number of levels of the predictor.

# Categories of smoker
levels(blomkvist$smoker) #  3 levels
[1] "former" "no"     "yes"   

Polychotomous predictors

Using lm, we would simply do as follows.

model_smoker <- lm(rt ~ smoker, data = blomkvist)
(betas <- coef(model_smoker))
(Intercept)    smokerno   smokeryes 
      652.5       -18.0       -19.1 

Default contrasts for polychotomous predictor used by lm.

contrasts(blomkvist$smoker)
       no yes
former  0   0
no      1   0
yes     0   1

The intercept term is the predicted average when smoker is former (i.e. row with only 0s).

Polychotomous predictors

Using lm, we would simply do as follows.

model_smoker <- lm(rt ~ smoker, data = blomkvist)
(betas <- coef(model_smoker))
(Intercept)    smokerno   smokeryes 
      652.5       -18.0       -19.1 

The predicted mean of the rt distribution for no

\[ \hat\beta_0 + \hat\beta_1 \cdot 1 + \hat\beta_2 \cdot 0 \]

betas[1] + betas[2] * 1 + betas[3] * 0
[1] 635

Polychotomous predictors

Using lm, we would simply do as follows.

model_smoker <- lm(rt ~ smoker, data = blomkvist)
(betas <- coef(model_smoker))
(Intercept)    smokerno   smokeryes 
      652.5       -18.0       -19.1 

When smoker is yes, the predicted mean of the rt distribution for smokers is

\[ \hat\beta_0 + \hat\beta_1 \cdot 0 + \hat\beta_2 \cdot 1 \]

betas[1] + betas[2] * 0 + betas[3] * 1
[1] 633

Exercise: work through script exercises/polychotomous_predictor.R

Contrast coding

Change contrast coding so intercept represents the average across both groups and to get desired copmarisons (for discussion see Schad et al. 2020; Brehm and Alday 2022).

coef(model_smoker)
(Intercept)    smokerno   smokeryes 
      652.5       -18.0       -19.1 
# Default: treatment contrast
contrasts(blomkvist$smoker)
       no yes
former  0   0
no      1   0
yes     0   1
contr.treatment(3, base = 2) # change base level
       former yes
former      1   0
no          0   0
yes         0   1
contr.helmert(3) / 2 # Helmert contrast
       former vs no former / no vs yes
former         -0.5               -0.5
no              0.5               -0.5
yes             0.0                1.0
contr.sum(3) / 2 # sum contrast
       former vs yes no vs yes
former           0.5       0.0
no               0.0       0.5
yes             -0.5      -0.5
model <- lm(rt ~ smoker, data = blomkvist) # Re-fit model with sum contrast
coef(model)
        (Intercept) smokerformer vs yes     smokerno vs yes 
              640.2                24.7               -11.2 
coef(model_smoker) # model with treatment contrasts
(Intercept)    smokerno   smokeryes 
      652.5       -18.0       -19.1 

Exercise: work through script exercises/contrast_coding.R

Predictions

Using \(\hat\beta_0,\hat\beta_1\) (etc.) and \(\hat\sigma^2\), for any new vector of predictor variables \(x_{1i'}\), the corresponding \(y_i\) is

\[ y_{i'} \sim \mathcal{N}(\hat\mu_{i'}, \sigma^2)\\ \hat\mu_{i'} = \hat\beta_0 + \beta_1 \cdot x_{1i'} \]

Predictions

model <- lm(rt ~ sex + age, data = blomkvist)
(Intercept)     sexmale         age 
     338.64      -58.93        5.75 

We can predict rts for a hypothetical woman of age 95.

\[ \begin{aligned} \hat\mu_i &= \hat\beta_0 + \hat\beta_\text{sex} \cdot \text{sex}_{i} + \hat\beta_\text{age} \cdot \text{age}_{i}\\ &=338.64 + -58.93 \cdot \text{sex}_{i} + 5.75 \cdot \text{age}_{i}\\ &=338.64 + -58.93 \cdot 0 + 5.75 \cdot 95\\ &=338.64 + 5.75 \cdot 95\\ &=885 \end{aligned} \]

To calculate \(\hat\mu_{i'}\) for any vector of predictor variables

newdata <- tibble(sex = "female", age = 95)
predict(model, newdata = newdata) 
  1 
885 

Prediction confidence intervals

  • We can calculate confidence intervals for the predicted values of \(\mu_{i'}\).
  • The confidence interval for \(\mu_{i'}\) is as follows:

\[ \hat\mu_{i'} \pm \tau_\nu \cdot \hat{\text{se}}_{\mu_{i'}} \]

where \(\hat{\text{se}}_{\mu_{i'}}\) is the standard error term that corresponds to \(\hat\mu_{i'}\).

  • We can obtain these by using interval = 'confidence' in the predict function.
predict(model, interval = 'confidence', newdata = newdata)
  fit lwr upr
1 885 845 925

Prediction confidence intervals

Work through scripts

  • exercises/predictions.R
  • exercises/predictions_with_cis.R

Recap

  • Normal model assumes a single normal distributed data generating process.
  • MLE provides estimates for unknown population parameters.
  • We can specify models with continues and / or categorical predictors.
  • Combinations are varying intercepts and / or varying slopes.

Homework

  • For the formative assessment

Using a data-set of your own choice, perform and report a linear regression analysis to address a psychologically relevant question.

  • Find a suitable data set you can use for the formative assessment.
  • Add section headers (what do you need to include?).
  • Construct (and implement using lm()) at least two models of which the model with more predictor variables contains all predictor variables of the simpler model.
  • Read lecture notes on NOW (also Andrews 2021).

References

Andrews, Mark. 2021. Doing Data Science in R: An Introduction for Social Scientists. SAGE Publications Ltd.

Brehm, Laurel, and Phillip M. Alday. 2022. “Contrast Coding Choices in a Decade of Mixed Models.” Journal of Memory and Language 125: 104334.

Schad, Daniel J., Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. 2020. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of Memory and Language 110: 104038.