PSYC30815: Normal linear models

Overview

By the end of these workshops and your own studies, you should be able to

Formally describe a normal linear model
Implement normal models (continuous, categorical predictors)
Conceptually understand maximum likelihood estimation (how lm determines model coefficients)
Interpret model coefficients and quantities related to null hypothesis testing (confidence interval, t-value, p-value)
Predict hypothetical outcomes given a statistical model

Using a combination of theoretical descriptions, practical exercises, implementations, and simulations.

Before we start \(\dots\)

Go to the NOW learning room of psyc30815 and download exercises.zip from the week 2 unit.

Open the folder and copy the R-scripts into the “exercises” folder inside the R-projects folder from last week.

If you don’t have an “exercises” folder, create on, or move the “exercises” folder in the zip directory over to your R-project.

What are normal linear models?

Often introduced as fitting lines to points.
Limited perspective that makes more complex regression models, like generalised linear models, hard to understand.
Extension to generalised and multilevel linear models (multiple / simple linear regressions, AN(C)OVAs, MAN(C)OVAs)
Backbone of statistical modelling: machine learning, time series models, path analysis, structural equation models, factor analysis, signal detection models

What are normal linear models?

A model of how the probability distribution of one variable, known as the outcome variable, varies as a function of other variables, known as the explanatory or predictor variables.
The most basic type of regression models is the normal linear model.
In normal models, we assume that the outcome variable is normally distributed and that its mean varies linearly with changes in a set of predictor variables.
By understanding the normal linear model thoroughly, we can see how it can be extended to deal with data and problems beyond those that it is designed for.

Normal linear model (formal description)

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu, \sigma^2) \end{aligned} \]

\(y\): observed outcome variable
Each observation \(y_i\) is a sample (indexed by \(i\))
Lower case letters indicate observed and known variables.
“\(\sim\)”: “is proportional to” / “is a function of”
\(\mathcal{N}\): univariate normal distributed probability model; i.e. the underlying process that describes the nature of our data.
Population parameters are represented as Greek letters; i.e. mean \(\mu\) and variance \(\sigma^2\).

Simulation of normal distributed data

rnorm generates unimodal, symmetrically distributed values.
r in rnorm = random; norm = normal

# Generate data
y <- rnorm(n = 10, mean = 550, sd = 10)

Simulation of normal distributed data

rnorm generates unimodal, symmetrically distributed values.
r in rnorm = random; norm = normal

# Generate data
y <- rnorm(n = 100, mean = 550, sd = 10)

Simulation of normal distributed data

rnorm generates unimodal, symmetrically distributed values.
r in rnorm = random; norm = normal

# Generate data
y <- rnorm(n = 1000, mean = 550, sd = 10)

Simulation: exercise

Work through the scripts

normal_model_simulation_1.R

You will need to complete the code in those scripts; i.e. replace “---” with the correct information.

Then observe how changing the number of observations N affects the model estimates.

Normal linear model (formal description)

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \beta_1 \cdot x_{1i} \\ \end{aligned} \]

Value of \(\mu_i\) can be decomposed using a linear regression equation.

\(\mu_i\) is a deterministic (“\(=\)”) function of one or more predictors \(x\) and the unknowns \(\beta_0\) (intercept) and \(\beta_1\) (slope).
\(\beta_0\) is “constant”
\(\beta_1\) directly depends on explanatory variable \(x_{1i}\)
If \(x_{1i}\) takes on 0, \(\mu_i = \beta_0\)
If \(x_{1i}\) takes on 1, \(\mu_i = \beta_0 + \beta_1\)
\(\beta_1\) is difference between \(x_{1i}=1\) and \(x_{1i}=0\).

Simulation of normal distributed data

We can use lm to implement such a model.

model <- lm(y ~ x, data = data)

How do we know that lm returns accurate estimates for our model parameters?
We can simulate data that exactly meet model assumptions (normal distribution, equal variance, independence etc.) with known model parameters.

Simulation of normal distributed data

rnorm generates unimodal, symmetrically distributed values.
r in rnorm = random; norm = normal

# Generate data
y <- rnorm(n = 1000, mean = 550, sd = 10)

Simulation of normal distributed data

rnorm generates unimodal, symmetrically distributed values.
r in rnorm = random; norm = normal

# Decompose the mean
beta_0 = 500
beta_1 = 50
# Generate data
y <- rnorm(n = 1000, mean = beta_0 + beta_1, sd = 10)

Simulation of normal distributed data

# Set parameter values
n <- 1000
beta_0 <- 500 
beta_1 <- 50
sigma <- 100

# Random data for group 1
group_1 <- rnorm(n = n/2, mean = beta_0, sd = sigma)
# Random data for group 2
group_2 <- rnorm(n = n/2, mean = beta_0 + beta_1, sd = sigma)

# Generate data
sim_data <- tibble(group_1 = group_1,
                   group_2 = group_2) 

# Preview data
glimpse(sim_data)

Rows: 500
Columns: 2
$ group_1 <dbl> 467, 477, 459, 590, 392, 640, 559, 558, 528, 557, 591, 514, 48…
$ group_2 <dbl> 416, 642, 547, 668, 610, 566, 388, 613, 511, 542, 498, 584, 64…

# Change data format
sim_data <- pivot_longer(sim_data, cols = c(group_1, group_2), names_to = "x", values_to = "y")

# Preview data
glimpse(sim_data)

Rows: 1,000
Columns: 2
$ x <chr> "group_1", "group_2", "group_1", "group_2", "group_1", "group_2", "g…
$ y <dbl> 467, 416, 477, 642, 459, 547, 590, 668, 392, 610, 640, 566, 559, 388…

Simulation of normal distributed data

# As a reminder, these are the parameter values
beta_0 <- 500 
beta_1 <- 50
sigma <- 100

# Normal linear model of simulated data
model <- lm(y ~ x, data = sim_data)

coef(model) # Model coefficients

(Intercept)    xgroup_2 
      503.7        43.5

sigma(model) # Standard deviation

[1] 99.4

Simulation: exercise

Work through the scripts

normal_model_simulation_2.R

You will need to complete the code in those scripts; i.e. replace “---” with the correct information.

Then observe how changing the number of observations N affects the model estimates.

Normal linear model (example)

A model of how the mean of the distribution of reaction times (rt) varies as a function of age and sex.

\[ \begin{align} \text{rt}_i &\sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \beta_\text{age} \cdot \text{age}_i + \beta_\text{sex} \cdot \text{sex}_i \end{align} \]

For \(i \in 1\dots n\) where \(n\) is total number of observations.

Writing down (on paper) a formal model description for a hypothetical normal linear model of blood pressure; state that we assume that blood pressure varies by nutrition, age, and levels of physical activity.
We will then write down the model description in RMarkdown.
Adapt the model description to your own dataset.

Real data

In real life we don’t know the underlying process that generates the data or the true parameter values.
For the simulated data we know these:
- the process we used was a normal distribution because rnorm samples normally distributed data.
- we determined the the parameters of the mean \(\beta_0\) and \(\beta_1\).
- both groups share the same variance \(\sigma^2\).
We can use linear models to estimate parameter values from the data (i.e. we make assumptions about the process that generates the data).
Maximum likelihood estimation: for which set of hypothetical parameter values are the data most likely.

Real data

Data may appear non-normal distributed for small samples.
That’s fine if have reasons to believe the data come from a process that is normal distributed.
Above data are by-participant means, meaning the central limit theorem applies.
Yet, reaction times are zero bound, so the process that generates these data is not continuous.
(Also the psychological distance between equal units of time might not be linear.)

Real data

Log-normal distributions are zero-bound.
We assume that the \(rt\) values are sampled from a log-normal distribution with a fixed and unknown mean \(\mu\) and standard deviation \(\sigma\).

\[ \log(\text{rt}_i) \sim \mathcal{N}(\mu_i, \sigma^2) \]

Real data: exercise

Complete exercise scripts
- normal_model_blomkvist_1.R
- normal_model_blomkvist_2.R
Make sure you understand how the model coefficients relate to the visualisation.

Real data

Predictor variable sex is denoted as \(\text{sex}_1, \text{sex}_2, \dots \text{sex}_n\), \(\text{ for } i \in 1\dots n\)

\[ \log(\text{rt}_i) \sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_\text{sex} \cdot \text{sex}_i \]

where \(\text{sex}_i\) takes on \(0\) for females and \(1\) for males.

\[ \text{sex}_i = \left\{ \begin{array}{ll} 0, \text{ if sex}_i = \texttt{female}\\ 1, \text{ if sex}_i = \texttt{male}\\ \end{array} \right. \]

variance is the same for males and females.

Real data

Predictor variable sex is denoted as \(\text{sex}_1, \text{sex}_2, \dots \text{sex}_n\), \(\text{ for } i \in 1\dots n\)

\[ \log(\text{rt}_i) \sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_\text{sex} \cdot \text{sex}_i \]

where \(\text{sex}_i\) takes on \(0\) for females and \(1\) for males.

\[ \text{sex}_i = \left\{ \begin{array}{ll} 0, \text{ if sex}_i = \texttt{female}\\ 1, \text{ if sex}_i = \texttt{male}\\ \end{array} \right. \]

\(\beta_0\) is the average \(\log(\text{rt})\) for females because \(\mu_i = \beta_0 + \beta_\text{sex} \cdot 0 = \beta_0\).
Because \(\mu_i = \beta_0 + \beta_\text{sex} \cdot 1 = \beta_0 + \beta_\text{sex}\) is the average \(\log(\text{rt})\) for females, \(\beta_\text{sex}\) is the difference in the averages \(\log(\text{rt})\) between males and females.

Compare simulated and real data

# Set parameter values
beta_0 <- 500 
beta_1 <- 50
sigma <- 100

# Inspect data
glimpse(data, width = 50)

Rows: 1,000
Columns: 2
$ x <chr> "group_1", "group_2", "group_1", "grou…
$ y <dbl> 469, 504, 494, 409, 471, 583, 506, 552…

# Specify model
model <- lm(y ~ x, data = data)

# Model coefficients
coef(model)

(Intercept)    xgroup_2 
      505.6        48.4

# Load and transform data
blomkvist <- read_csv("../data/blomkvist.csv") %>% 
  mutate(log_rt = log(rt_hand_d)) %>% 
  select(sex, log_rt)

# Inspect data
glimpse(blomkvist, width = 50)

Rows: 267
Columns: 2
$ sex    <chr> "male", "female", "female", "fema…
$ log_rt <dbl> 6.55, 6.15, 6.46, 6.56, 6.41, 6.2…

# Specify model
model <- lm(log_rt ~ sex, data = blomkvist)

# Model coefficients
coef(model)

(Intercept)     sexmale 
     6.4602     -0.0943

Interpreting coeffcients on the log scale

Slope coef is not additive (linear) but multiplicative (exponential scale)

# Specify model
model <- lm(log_rt ~ sex, data = blomkvist)

# Model coefficients
(coefs <- as.vector(coef(model)))

[1]  6.4602 -0.0943

I.e. logrt is \(\beta_1\) times longer than \(\beta_0\)

# Converting coefficients to msecs
intercept <- coefs[1]
slope <- coefs[2]

Exponentional function exp un-logs coefficients.

exp(intercept) # female group in msecs

[1] 639

exp(intercept + slope) # male group in msecs

[1] 582

exp(intercept + slope) - exp(intercept) # in msecs

[1] -57.5

interpreting_coefficients_1.R
interpreting_coefficients_2.R
interpreting_coefficients_with_emmeans.R

Maximum likelihood estimation

If we assume the model

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2) \text{, for } i \in 1\dots n\\ \mu_i = \beta_0 + \beta_1 \cdot x_{1i} \]

We estimate values of 2 unknowns plus the number of predictor coefficients (here 1) using maximum likelihood estimation of the unknowns, which we can denote by

\[ \hat\beta_0,\hat\beta_1,\hat\sigma, \]

MLEs are the set of values that make the observed data most probable.

Maximum likelihood estimation

The maximum likelihood values of coefficients \(\hat\beta_0,\hat\beta_1\) are those values of \(\beta_0,\beta_1\) variables that minimize residual sum of squares

\[ \text{RSS} = \sum_{i=1}^n \mid y_i - \mu_i\mid^2, \]

where \(\mu_i\) is

\[ \mu_i = \beta_0 + \beta_1 \cdot x_{1i} \]

This can be simulated using grid approximation for models with small sets of unknowns: see script mle.R

Fitting a normal linear model using `lm`

For larger sets of parameters we can use maximum likelihood estimation as implemented in lm.

model <- lm(rt ~ sex + age, data = blomkvist)

We’ll continue with rt (instead logrt) data for simplicity.

Maximum likelihood estimators of \(\beta_0\), \(\beta_\text{sex}\) and \(\beta_\text{age}\) are as follow:

(coefs <- coef(model))

(Intercept)     sexmale         age 
     338.64      -58.93        5.75

Fitting a normal linear model using `lm`

Once we have \(\hat\mu\), the maximum likelihood estimate of \(\sigma\), denoted as \(\hat\sigma\), is:

\[ \hat\sigma = \sqrt{\frac{1}{n-K-1} \cdot \sum_{i=1}^n\mid y_i-\hat\mu_i\mid^2} \]

\(\sum_{i=1}^n\mid y_i-\hat\mu_i\mid^2\) is the RSS (\(y_i\): observed data; \(\hat\mu_i\): model prediction).
\(K\) is the number of predictors
\(n\) is the number of observations
Complete script calculating_sigma.R

Before we move on…

Construct a theoretical model for your own data (formal description and a draft of verbal description).
Implement your theoretical model using lm().
Work out the maximum likelihood estimators obtained by lm() but instead of summary(), use of the broom package and run

tidy(my_model, conf.int = TRUE)

Hypothesis testing: t-values

If \(\beta_1\) is the hypothesized true value of the coefficient, and \(\hat\beta_1\) is the estimated coefficient, then

\[ \frac{\hat\beta_1 - \beta_1}{\hat{\text{se}}_1} \sim t_{n-K-1}, \]

which tells us that if the true value of the coefficient was some hypothesized value \(\beta_1\), then we expect the t-statistic to have a certain range of values with high probability.

Hypothesis testing: t-values

We will test a hypothesis about the coefficient for sex in the model above.

# Extract the coefficients table from summary
coefs <- summary(model)$coefficients
# MLE for sex
(beta_sex <- coefs['sexmale', 'Estimate'])

[1] -58.9

# Standard error
(se_sex <- coefs['sexmale', 'Std. Error'])

[1] 18.5

If we hypothesize the true value of the coefficient is exactly 10, then our t-statistic is

(t <- (beta_sex - 10)/se_sex)

[1] -3.73

\[ \frac{\hat\beta_\text{sex} - \beta_\text{sex}}{\hat{\text{se}}_\text{sex}} = \frac{-58.9 - 10}{18.5} = -3.74 \]

Null hypothesis significance testing: t-values

We hypothesize the true value of the coefficient is exactly 0.0, then our t-statistic is

\[ \frac{\hat\beta_k - \beta_k}{\hat{\text{se}}_k} = \frac{-58.9 - 0.0}{18.5} = \frac{-58.9}{18.5} = -3.19 \]

(t <- (beta_sex - 0.0)/se_sex)

[1] -3.19

Null hypothesis significance testing: p-values

n <- nrow(blomkvist)
K <- 2

Probability of a value smaller than or equal to absolute t.

pt(abs(t), df = n - K - 1)

[1] 0.999

Null hypothesis significance testing: p-values

n <- nrow(blomkvist)
K <- 2

Probability of a value larger than or equal to absolute t.

1 - pt(abs(t), df = n - K - 1)

[1] 0.000797

Null hypothesis significance testing: p-values

n <- nrow(blomkvist)
K <- 2

Probability of a value more extreme or equal to t.

2 * (1 - pt(abs(t), df = n - K - 1))

[1] 0.00159

Null hypothesis significance testing: p-values

If we hypothesize that the true value of the coefficient is exactly 0.0, then the probability of obtaining a value more extreme or equal to the t-value we obtained is:

2 * (1 - pt(abs(t), df = n - K - 1))

[1] 0.00159

The null hypothesis has a very low probability, lower than what would be unsurprising (\(\alpha > .05\)), and can be rejected.

p-values

The p-value for the hypothesis is the probability of getting a value as or more extreme than the absolute value of t-statistic in the t-distribution.
This is the area under the curve in the t-distribution that is as or more extreme than the absolute value of t-statistic.

n <- nrow(blomkvist)
K <- 2
pt(abs(t), df = n - K - 1, lower.tail = FALSE) * 2

[1] 0.00159

p-values

Work through script exercise/t_values.R
Calculate t and p-value for \(\hat\beta_\text{age}\) testing the null hypothesis that the true value \(\beta_\text{age}\) is 0.0.
Verify your calculation using

summary(model)$coefficients

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   338.64     27.359   12.38 4.91e-28
sexmale       -58.93     18.473   -3.19 1.59e-03
age             5.75      0.439   13.09 1.72e-30