Regression models

  • Often introduced as fitting lines to points.
  • Limited perspective that makes more complex regression models, like generalised linear models, hard to understand.
  • Backbone of statistical modelling
  • For multiple / simple linear regressions, t-tests, ANOVAs, ANCOVAs, MANCOVAs, time series models
  • Basis for path analysis, structural equation models, factor analysis
  • Extension to generalized linear models: logistic regression for categorical data and count models such as poisson and negative binomial
  • Generalised further to multilevel, aka hierarchical / mixed-effects modes
  • Even nonlinear models: linear combination of nonlinear basis functions

Regression models

  • Simply put, a regression model is a model of how the probability distribution of one variable, known as the outcome variable varies as a function of other variables, known as the explanatory or predictor variables.
  • The most common or basic type of regression models is the normal linear model.
  • In normal linear models, we assume that the outcome variable is normally distributed and that its mean varies linearly with changes in a set of predictor variables.
  • By understanding the normal linear model thoroughly, we can see how it can be extended to deal with data and problems beyond those that it is designed for.

Normal linear model

Univariate distribution with \(n\) independent observations of pairs

\[ (y_1, \overrightarrow{x_1}), (y_2, \overrightarrow{x_2}) \dots (y_i, \overrightarrow{x_i}) \dots (y_n, \overrightarrow{x_n}). \]

\(y_i\) is the observed value of a univariate outcome variable.

We use probabilistic models to predict / explain / understand the outcome variable.

\[ \overrightarrow{x_1} = [x_{1i}, x_{2i} \dots x_{ki}, \dots x_{Ki}] \] are a set of \(K\) values used to explain \(y_i\). \(\overrightarrow{x_i}\) are a set of \(K\) predictor / explanatory variables.

Normal linear model

Introducing \(K\) in \(x_{Ki}\) allows us to abbreviate the model equation.

\[ \begin{aligned} \mu_i &= \beta_0 + \beta_1 \cdot x_{1i} + \beta_2 \cdot x_{2i} \dots \beta_k \cdot x_{ki} \dots \beta_K \cdot x_{Ki}\\ &= \beta_0 + \sum_{k=1}^K \beta_k \cdot x_{ki} \end{aligned} \]

Normal linear model

\[ \mu_i = \beta_0 + \sum_{k=1}^K \beta_k \cdot x_{ki}\\ y_i \sim N(\mu_i, \sigma^2)\\ \text{for } i \in 1\dots n \]

  • \(N\) is a univariate normal distribution with mean \(\mu\) and variance \(\sigma^2\).
  • Each observation \(y_i\) is a sample from a normal distribution with an unknown mean \(\mu\) and standard deviation \(\sigma\)
  • Value of \(\mu_i\) is a deterministic (i.e. “\(=\)”) linear function of the values of the \(K\) predictor variables.
  • Probabilistic model (i.e. “\(\sim\)”) of \(y_1 \dots y_n\) conditional on \(\overrightarrow{x_1}\dots\overrightarrow{x_n}\), \(\overrightarrow{\beta} = \beta_0, \beta_1\dots\beta_K\), and \(\sigma\)
  • Which ones do we know and which ones must be inferred using statistical inference?

Normal linear model

\[ \mu_i = \beta_0 + \sum_{k=1}^K \beta_k \cdot x_{ki}\\ y_i \sim N(\mu_i, \sigma^2)\\ \text{for } i \in 1\dots n \]

  • For every hypothetically possible value of the \(K\) predictor variables, i.e. \(\overrightarrow{x_{i'}}\), there is a corresponding mean \(\mu_{i'}\), i.e. \(\mu_{i'} = \beta_0 + \sum_{k=1}^K \beta_k \cdot x_{ki'}\)
  • If we change \(x_{ki'}\) by \(\Delta_{k'}\), then \(\mu_{i'}\) changes by \(\beta_k\Delta_k\).

Simulation of normal distributed data

  • We can used lm to easily implement such a model.
  • How do we know that lm returns accurate estimates for out model parameters?
  • We can simulate data that exactly meet model assumptions (normal distribution, equal variance, independence etc.) with known model parameters.
  • In simulations, we know that the model assumptions are met and we know the true parameter values, so we can evaluate whether the normal model does a sensible job in estimating those parameter values.

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Generate data
x <- rnorm(n = 10, mean = 500, sd = 10)

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Generate data
x <- rnorm(n = 100, mean = 500, sd = 10)

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Generate data
x <- rnorm(n = 1000, mean = 500, sd = 10)

Simulation of normal distributed data

# Set parameter values
n <- 1000
beta_0 <- 600 
beta_1 <- 30
sigma <- 50
# Generate data
sim_data <- tibble(beta_0 = rnorm(n = n/2, mean = beta_0, sd = sigma),
                   beta_1 = rnorm(n = n/2, mean = beta_0 + beta_1, sd = sigma)) 
# Preview data
glimpse(sim_data)
Rows: 500
Columns: 2
$ beta_0 <dbl> 638, 624, 590, 669, 578, 625, 519, 637, 629, 549, 583, 626, 643…
$ beta_1 <dbl> 679, 643, 697, 715, 592, 584, 523, 687, 656, 569, 647, 678, 650…

Simulation of normal distributed data

# Set parameter values
n <- 1000
beta_0 <- 600 
beta_1 <- 30
sigma <- 50
# Change data format
sim_data <- pivot_longer(sim_data, cols = c(beta_0, beta_1), names_to = "mu", values_to = "outcome")
# Preview data
glimpse(sim_data)
Rows: 1,000
Columns: 2
$ mu      <chr> "beta_0", "beta_1", "beta_0", "beta_1", "beta_0", "beta_1", "b…
$ outcome <dbl> 638, 679, 624, 643, 590, 697, 669, 715, 578, 592, 625, 584, 51…

Simulation of normal distributed data

# Set parameter values
n <- 1000
beta_0 <- 600 
beta_1 <- 30
sigma <- 50
# Preview data
glimpse(sim_data)
Rows: 1,000
Columns: 2
$ mu      <chr> "beta_0", "beta_1", "beta_0", "beta_1", "beta_0", "beta_1", "b…
$ outcome <dbl> 638, 679, 624, 643, 590, 697, 669, 715, 578, 592, 625, 584, 51…
# Normal linear model of simulated data
model <- lm(outcome ~ mu, data = sim_data)

Simulation of normal distributed data

# Set parameter values
n <- 1000
beta_0 <- 600 
beta_1 <- 30
sigma <- 50
# Model coefficients
coef(model)
(Intercept)    mubeta_1 
      598.1        35.4 
# Standard deviation
sigma(model)
[1] 60.4

Simulation: equality of variance

# Set parameter values
n <- 1000
beta_0 <- 600 
beta_1 <- 30
sigma <- 150
# Model coefficients
coef(model)
(Intercept)    mubeta_1 
      602.2        32.7 
# Standard deviation
sigma(model)
[1] 172

Simulation: exercise

  • Work through the scripts
    • exercises/normal_model_simulation_1.R
    • exercises/normal_model_simulation_2.R
  • You will need to complete the scripts.
  • Then observe how changing the number of observations affects the estimates.

Real data

  • In real life we don’t know the true parameter values as we do in simulations.
  • We can use linear models to estimate parameter values from the data.
  • Maximum likelihood estimation: for which set of parameter values are the data most likely.
  • We also don’t know what underlying process generates our real data.
  • For the simulated data we know that the process is a normal distribution because rnorm samples normally distributed data.

Real data

  • Sometimes even simulated data don’t appear normal distribution because the sample is too small.
blomkvist <- read_csv("../data/blomkvist.csv") %>% 
  select(id, sex, rt = rt_hand_d)
Rows: 267
Columns: 3
$ id  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1…
$ sex <chr> "male", "female", "female", "female"…
$ rt  <dbl> 702, 471, 639, 708, 607, 542, 571, 5…
  • Observed data are by-participant means, meaning the central limit theorem applies.
  • Yet, reaction times are notoriously skewed (cause they have lower but no upper bound; see Baayen 2008).

Real data

Sometimes even simulated data don’t appear normal distribution because the sample is too small.

blomkvist <- read_csv("../data/blomkvist.csv") %>% 
  select(id, sex, rt = rt_hand_d) %>% 
  mutate(log_rt = log(rt))
Rows: 267
Columns: 4
$ id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
$ sex    <chr> "male", "female", "female", "fema…
$ rt     <dbl> 702, 471, 639, 708, 607, 542, 571…
$ log_rt <dbl> 6.55, 6.15, 6.46, 6.56, 6.41, 6.2…
  • Observed data are by-participant means, meaning the central limit theorem applies.
  • Yet, reaction times are notoriously skewed (cause they have lower but no upper bound; see Baayen 2008).

Real data

  • Data are unimodal, roughly normal distributed (there is some positive skew).
  • Variance is roughly the same in both groups.
  • There seem to be more longer rts for males (?).
  • We assume that the \(\text{log}(rt)\) that we denote \(y_1, y_2\dots y_n\) are sampled from a normal distribution with a fixed and unknown mean \(\mu\) ans standard deviation \(\sigma\).

\[ y_i \sim N(\mu_i, \sigma^2), \text{ for } i \in 1\dots n. \]

Real data

Predictor variable sex is denoted as \(x_1, x_2, \dots x_n\).

\[ \mu_i = \beta_0 + \beta_1 \cdot x_i \]

where \(x_i\) takes on \(0\) for females and \(1\) for males (because “f” is before “m” in the alphabet).

\[ x_i = \left\{ \begin{array}{ll} 0, \text{ if sex}_i = \texttt{female}\\ 1, \text{ if sex}_i = \texttt{male}\\ \end{array} \right. \]

For females \(\mu_i = \beta_0 + \beta_1 \cdot 0 = \beta_0\) and for males \(\mu_i = \beta_0 + \beta_1 \cdot 1 = \beta_0 + \beta_1\), so \(\beta_1\) gives the difference in the averages of the distribution of rts.

Real data: exercise

  • Complete exercise scripts
    • exercises/normal_model_blomkvist_1.R
    • exercises/normal_model_blomkvist_2.R
  • Make sure you understand how the model coefficients relate to the visualisation.

Compare simulated and real data

# Set parameter values
n <- 100
beta_0 <- 600 
beta_1 <- 30
sigma <- 50
# Specify model
model <- lm(outcome ~ mu, data = sim_data)
# Model coefficients
coef(model)
(Intercept)    mubeta_1 
      602.2        32.7 
# Load and transform data
blomkvist <- read_csv("../data/blomkvist.csv") %>% 
  select(id, sex, rt = rt_hand_d) %>% 
  mutate(log_rt = log(rt))
# Specify model
model <- lm(log_rt ~ sex, data = blomkvist)
# Model coefficients
coef(model)
(Intercept)     sexmale 
     6.4602     -0.0943 

Interpreting coeffcients

Log transformation: slope is no longer additive (linear scale) but multiplicative (exponential scale)

# Specify model
model <- lm(log_rt ~ sex, data = blomkvist)
# Model coefficients
(coefs <- as.vector(coef(model)))
[1]  6.4602 -0.0943
  • I.e. log rt difference isn’t \(\beta_0 + \beta_1\) but \(\beta_0 \cdot \beta_1\); log rt is \(\beta_1\) times shorter / longer.
  • Intercept \(\beta_0\) is the average for females and slope \(\beta_1\) is difference for males.
# Converting coefficients to msecs
intercept <- coefs[1]
slope <- coefs[2]

Exponentional function exp can be used to un-log coefficients.

exp(intercept) # female group in msecs
[1] 639
exp(intercept + slope) # male group in msecs
[1] 582
exp(intercept + slope) - exp(intercept) 
[1] -57.5

Interpreting coeffcients

According to the model, for any given sex \(x'\), the corresponding distribution of rts is normally distributed with a mean \(\mu' = \beta_0 + \beta_1 \cdot x'\) and standard deviation \(\sigma\).

coef(model)
(Intercept)     sexmale 
     6.4602     -0.0943 
exp(intercept) # female group in msecs
[1] 639
  • If sex changes by \(\Delta_x\), the mean of the corresponding normal distribution over rts changes by exactly \(\beta_1\Delta_x\).
  • This fact entails that if sex changes by exactly \(\Delta_x=1\), the mean of the corresponding normal distribution over rts changes by exactly \(\beta_1\).
exp(intercept + slope) - exp(intercept) 
[1] -57.5

Complete exercises/interpreting_coefficients_1.R

Contrast coding

Change contrast coding so intercept represents population average and slope shows difference between groups.

coef(model)
(Intercept)     sexmale 
     6.4602     -0.0943 
# Default: treatment contrast
contrasts(blomkvist$sex)
       male
female    0
male      1
# Sum contrast
contrasts(blomkvist$sex) <- c(-.5, .5)
colnames(contrasts(blomkvist$sex)) <- ": female vs male"
       : female vs male
female             -0.5
male                0.5
# Re-fit model with sum contrast
model <- lm(log_rt ~ sex, data = blomkvist)
coef(model)
        (Intercept) sex: female vs male 
             6.4130             -0.0943 

Complete exercises/interpreting_coefficients_2.R

Normal linear model of rts by sex and age

  • We may also model how the distribution of rts varies as either, or both, sex and age.
  • In this figure, we see the density of rts for the different sexes, and the different terciles of age.
  • Denoting sexes by \(x_{11},x_{12}\dots x_{1i}\dots x_{1n}\) and age by \(x_{21},x_{22}\dots x_{2i}\dots x_{2n}\) the model is now

\[ y_i \sim N(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_1 \cdot x_{1i} + \beta_2 \cdot x_{2i}. \]

Normal linear model of rts by sex and age

  • For any given combination of sex and age, we have a normal distribution over rts.
  • If the sex variable changes by one unit, when age is held constant, then the average value of the corresponding distribution of rts changes by \(\beta_1\).
  • Conversely, if the age variable changes by one unit, when sex is held constant, then the average value of the corresponding distribution of rts changes by \(\beta_2\).

\[ y_i \sim N(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_1 \cdot x_{1i} + \beta_2 \cdot x_{2i}. \]

Model fitting

If we assume the model

\[ y_i \sim N(\mu_i, \sigma^2) \text{, for } i \in 1\dots n\\ \mu_i = \beta_0 + \sum_{k=1}^K \beta_k \cdot x_{ki} \]

we face the problem of inferring the values of \(K+2\) unknowns \(\beta_0, \beta_1\dots\beta_K\), and \(\sigma\).

  • In general in statistics, maximum likelihood estimation is a method by which we estimate the values of the unknown variables in a statistical model.
  • In the case of normal linear model, the maximum likelihood estimators of the \(K + 2\) unknowns, which we can denote by

\[ \hat\beta_0,\hat\beta_1\dots\hat\beta_K,\hat\sigma, \] are the set of values \(K+2\) unknown variables that make the observed data most probable.

Model fitting

The maximum likelihood values of the \(K+1\) coefficients, \(\hat\beta_0,\hat\beta_1\dots\hat\beta_K\), are those values of the \(\beta_0,\beta_1\dots\beta_K\) variables that minimize the residual sum of squares

\[ \text{RSS} = \sum_{i=1}^n \mid y_i - \mu_i\mid^2, \]

where \(\mu_i\) is defined above.

# Define population parameter
mu <- 100
# Generate random data from normal distribution
y <- rnorm(10, mean = mu, sd = 10)
# Propose values for population parameter
mu_grid <- seq(0, 500, 10)
# Check which parameter value returns lowest RSS
rss <- c()
for(mu in mu_grid){
  # calculate rss for mu
  rss_mu <- sum((y - mu) ^ 2)
  rss <- c(rss, rss_mu)
}
# Look for the smallest rss value
idx <- which.min(rss)
# Find corresponding mu value
mu_grid[idx]
[1] 100

See script exercises/mle.R for example.

Fitting a normal linear model using lm

  • lm is using maximum likelihood estimation and can be used for larger sets of parameters.
  • Let’s continue with the Blomkvist et al. (2017) rt data.
  • For simplicity of explanation we will use the untransformed rts.
model <- lm(rt ~ sex + age, data = blomkvist)
  • Maximum likelihood estimators of \(\beta_0\) (intercept) and \(\beta_1\) (coefficient of sex) and \(\beta_2\) (coefficent of age) are as follow:
(coefs <- coef(model))
(Intercept)     sexmale         age 
     338.64      -58.93        5.75 

Fitting a normal linear model using lm

Once we have \(\hat\mu\), then the maximum likelihood estimate of \(\sigma\), denoted as \(\hat\sigma\) is:

\[ \hat\sigma = \sqrt{\frac{1}{n-K-1}\sum_{i=1}^n\mid y_i-\hat\mu_i\mid^2}. \]

Maximum likelihood estimate of \(\sigma\) is

sigma(model)
[1] 147
mutate(blomkvist, 
       sex = recode(sex, female = 0, male = 1),
       mu = coefs[1] + coefs[2]*sex + coefs[3]*age,
       squared_diffs = (rt - mu)^2) %>% 
summarise(sum_of_squres = sum(squared_diffs),
          K = 2,
          N = n(),
          sigma_2 = 1 / (N-K-1) * sum_of_squres,
          sigma = sqrt(sigma_2))
# A tibble: 1 × 5
  sum_of_squres     K     N sigma_2 sigma
          <dbl> <dbl> <int>   <dbl> <dbl>
1      5662477.     2   266  21530.  147.

Complete exercises/calculating_sigma.R

Hypothesis testing and confidence intervals

If \(\beta_k\) is the hypothesized true value of the coefficient, then

\[ \frac{\hat\beta_k - \beta_k}{\hat{\text{se}}_k} \sim t_{n-K-1}, \] which tells us that if the true value of the coefficient was some hypothesized value \(\beta_k\), then we expect the t-statistic to have a certain range of values with high probability.

Hypothesis testing and confidence intervals

We will test a hypothesis about the coefficient for sex in the model above.

# Extract the coefficients table from summary
coefs <- summary(model)$coefficients
# MLE for sex
(beta_sex <- coefs['sexmale', 'Estimate'])
[1] -58.9
# Standard error
(se_sex <- coefs['sexmale', 'Std. Error'])
[1] 18.5

If we hypothesize the true value of the coefficient is exactly 1.0, then our t-statistic is

(t <- (beta_sex - 1.0)/se_sex)
[1] -3.24

\[ \frac{\hat\beta_k - \beta_k}{\hat{\text{se}}_k} = \frac{-58.9 - 1.0}{18.5} = -3.24 \]

p-values

  • The p-value for the hypothesis is the probability of getting a value as or more extreme than the absolute value of t-statistic in the t-distribution.
  • This is the area under the curve in the t-distribution that is as or more extreme than the absolute value of t-statistic.
  • It tells us how far into the tails of the distribution the t-statistic is.
n <- nrow(blomkvist)
K <- 2
pt(abs(t), df = n - K - 1, lower.tail = FALSE) * 2
[1] 0.00133

Hypothesis testing and confidence intervals

Null hypothesis significance testing: we hypothesize the true value of the coefficient is exactly 0.0, then our t-statistic is

\[ \frac{\hat\beta_k - \beta_k}{\hat{\text{se}}_k} = \frac{-58.9 - 0.0}{18.5} = \frac{-58.9}{18.5} = -3.19 \]

(t <- (beta_sex - 0.0)/se_sex)
[1] -3.19
n <- nrow(blomkvist)
K <- 2
pt(abs(t), df = n - K - 1, lower.tail = FALSE) * 2
[1] 0.00159

p-values

  • Work through script exercise/t_values.R
  • Calculate t and p-value for \(\hat\beta_\text{age}\) testing the null hypothesis that the true value \(\beta_\text{age}\) is 0.0.
  • Verify your calculation using
summary(model)$coefficients
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   338.64     27.359   12.38 4.91e-28
sexmale       -58.93     18.473   -3.19 1.59e-03
age             5.75      0.439   13.09 1.72e-30

t-distribution

t-distribution

t-distribution

t-distribution

t-distribution

Confidence intervals

  • The set of all hypotheses that we do not rule out at the \(\alpha < 0.05\) level of significance is known as the 0.95 (or 95%) confidence interval.
  • Likewise, the set of all hypotheses that we do not rule out at the \(\alpha < 0.01\) level of significance is known as the 0.99 confidence interval, and so on.
  • In a normal linear model, the formula for the confidence interval on coefficient \(k\) is

\[ \hat\beta_k \pm \tau_{(1-\epsilon,\nu)} \cdot \hat{\text{se}}_k, \] where \(\tau_{(1-\epsilon,\nu)}\) is value in a t-distribution with \(\nu\) degrees of freedom below which is \(1-\epsilon\) of the area under the curve.

  • \(\tau\): how many standard errors above and below (‘\(\pm\)’) \(\hat\beta_k\) are the lower and upper bound of the confidence interval.

Confidence intervals

  • We can calculate \(\tau\) using qt
(tau <- qt(.975, df = n - K - 1))
[1] 1.97

We then obtain the confidence intervals as follows

beta_sex + c(-1, 1) * se_sex * tau
[1] -95.3 -22.6
coef(model) 
(Intercept)     sexmale         age 
     338.64      -58.93        5.75 

Verification:

confint(model, parm = 'sexmale',  level = .95)
        2.5 % 97.5 %
sexmale -95.3  -22.6

Open and complete script exercises/confidence_interval.R

Confidence intervals: a simulation

# Set simulation parameters
mu <- 600 # True parameter value
n <- 1000 # Number of observations per experiment
n_sims <- 10 # Number of hypothetical experiments
# Empty data frame for results
sims <- tibble()
# Run n_sims simulations
for(i in 1:n_sims){
  # Simulate normal distributed data
  y <- rnorm(n, mean = mu, sd = 100)
  # Get MLEs
  m <- lm(y ~ 1)
  # Extract CI
  cis <- confint(m) %>% as.vector()
  # Store results
  sims <- tibble(sim_id = i, 
                 lower = cis[1], 
                 upper = cis[2]) %>% 
    bind_rows(sims)
}

Confidence intervals: a simulation

  • Complete and run script
    • exercises/confidence_interval_simulation.R
  • Change the number of simulations to, one at a time, n_sims = 20,50,100,500,1000
  • Notice how often the CI does not contain the true parameter value \(\mu\).
  • How many % of simulations contain \(\mu\) (it’s okay to guess)?

Confidence intervals: a simulation

count(sims, mu_in_ci) %>% 
  mutate(prop = n / n_sims)
# A tibble: 2 × 3
  mu_in_ci     n  prop
  <chr>    <int> <dbl>
1 no          51 0.051
2 yes        949 0.949
  • When our number of experiments approaches \(\infty\) and we calculate a 95% CI for each experiment, 95% of these CIs will contain \(\mu\).
  • Note though, 5% of the time, the confidence interval does not contain \(\mu\)!
  • What does this mean for real (as opposed to simulated) experiments?

Predictions

Using \(\hat\beta_0,\hat\beta_1\dots\hat\beta_K\) and \(\hat\sigma^2\), then for any new vector of predictor variables \(\overrightarrow{x}_{i'}\), the corresponding \(y_i\) is now

\[ y_{i'} \sim N(\hat\mu_{i'}, \sigma^2)\\ \hat\mu_{i'} = \hat\beta_0 + \sum_{k=1}^K \hat\beta_k \cdot x_{i'k} \]

coef(model)
(Intercept)     sexmale         age 
     338.64      -58.93        5.75 

As \(K\) is 2 we can rewrite, then simplify, and predict rts for females age 95.

\[ \begin{aligned} \hat\mu_i &= \hat\beta_0 + \hat\beta_1 \cdot x_{i1} + \hat\beta_2 \cdot x_{i2}\\ &=338.64 + -58.83 \cdot x_{i1} + 5.75 \cdot x_{i2}\\ &=338.64 + -58.83 \cdot 0 + 5.75 \cdot 95\\ &=338.64 + 5.75 \cdot 95\\ &=885 \end{aligned} \]

To calculate \(\hat\mu_{i'}\) for any given new vector of predictor variables \(\overrightarrow{x}_{i'}\) use predict.

newdata <- tibble(sex = "female", age = 95)
predict(model, newdata = newdata) 
  1 
885 

Prediction confidence intervals

  • We can calculate confidence intervals for the predicted values of \(\mu_{i'}\).
  • The confidence interval for \(\mu_{i'}\) is as follows:

\[ \hat\mu_{i'} \pm \tau_{(1-\epsilon,\nu)} \cdot \hat{\text{se}}_{\mu_{i'}} \]

where \(\hat{\text{se}}_{\mu_{i'}}\) is the standard error term that corresponds to \(\hat\mu_{i'}\).

  • We can obtain the confidence intervals on \(\hat\mu_{i'}\) by using the option interval = 'confidence' in the predict function.
predict(model, interval = 'confidence', newdata = newdata)
  fit lwr upr
1 885 845 925

Prediction confidence intervals: exercise

  • Work through scripts
    • exercises/predictions.R
    • exercises/predictions_with_cis.R

Interactions and varying intercepts model

When we include sex as an explanatory variable in additional to a continuous predictor variable, the model is as follows.

\[ y_i \sim N(\mu_i, \sigma^2), \mu_i = \beta_0 + \beta_1 \cdot x_{1i} + \beta_2 \cdot x_{2i}, \text{ for } i \in 1\dots n. \]

To implement this model using lm we would do the following.

model_2 <- lm(log_rt ~ sex + age, data = blomkvist)
coef(model_2)
(Intercept)     sexmale         age 
    5.98258    -0.08724     0.00845 

Interactions and varying intercepts model

Given that \(x_{2i}\) takes that value of 0 when the gender is female and 1 when gender is male, this model can be written as follows.

\[ y_i \sim N(\mu_i, \sigma^2) \\ \mu_i = \left\{ \begin{array}{ll} \beta_0 + \beta_2 \cdot x_{2i}, \text{ if sex}_i = \texttt{female}\\ \beta_0 + \beta_1 \cdot x_{1i} + \beta_2 \cdot x_{2i}, \text{ if sex}_i = \texttt{male} \end{array} \right. \] This is a varying intercepts model.

Interactions and varying intercepts model

Interactions and varying intercepts model

Interactions and varying-slopes model

The varying-intercepts model is written as follows. For \(i \in 1\dots n\)

\[ \begin{aligned} y_i &\sim N(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \beta_1 \cdot x_{1i} + \beta_2 \cdot x_{2i}\\ \end{aligned} \]

The following is a varying intercepts and varying slopes model:

\[ \begin{aligned} y_i &\sim N(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \beta_1 \cdot x_{1i} + \beta_2 \cdot x_{2i} + \underbrace{\beta_3 \cdot x_{1i} \cdot x_{2i}}_\text{interaction} \end{aligned} \] We effectively have a third predictor \(x_{1i} \cdot x_{2i}\) that is the product of \(x_{1i}\) and \(x_{2i}\).

Interactions and varying-slopes model

Using sex and age as predictors, we can do the following to perform a varying-intercepts and varying-slopes model:

model_3 <- lm(log_rt ~ sex + age + sex:age, data = blomkvist)

Or slightly shorter:

model_3 <- lm(log_rt ~ sex * age, data = blomkvist)

Interactions and varying-slopes model

Interactions and varying-slopes model

For simplicity, lets start with untransformed rts.

model_3 <- lm(rt ~ sex * age, data = blomkvist)

From the varying-intercepts and varying-slopes model, the intercept terms for female and male are

(betas <- coef(model_3))
(Intercept)     sexmale         age sexmale:age 
     308.46       17.15        6.28       -1.36 
# females
betas[1]
[1] 308
# males
betas[1] + betas[2]
[1] 326

The slope terms are as follows:

# females
betas[3]
[1] 6.28
# males
betas[3] + betas[4]
[1] 4.93

Interactions and varying-slopes model

model_3 <- lm(log_rt ~ sex * age, data = blomkvist)

From the varying-intercepts and varying-slopes model, the intercept terms for female and male are

(betas <- coef(model_3))
(Intercept)     sexmale         age sexmale:age 
    5.95274    -0.01201     0.00898    -0.00134 
# females
exp(betas[1]) 
[1] 385
# males
exp(betas[1] + betas[2])
[1] 380

The slope terms are as follows:

# females
exp(betas[1] + betas[3]) - exp(betas[1])
[1] 3.47
# males
exp(betas[1] + betas[3] + betas[4]) -
  exp(betas[1])
[1] 2.95

Interactions and varying-slopes model: exercise

  • Work through scripts
    • exercises/varying_intercepts_model.R
    • exercises/varying_slopes_model.R

Polychotomous predictors

We can also use categorical predictor variables that have more than two levels (i.e. dichotomous like sex), which can be referred to as polychotomous predictor variables.

# Load and pre-process data
blomkvist <- read_csv("../data/blomkvist.csv") %>% 
  select(id, sex, age, rt = rt_hand_d, smoker) %>% 
  mutate(log_rt = log(rt)) %>% 
  drop_na()
# Categories of smoker
unique(blomkvist$smoker)
[1] "former" "no"     "yes"   

Using smoker as out single categorical predictor variable, the linear model would be as follows.

\[ y_i \sim N(\mu_i, \sigma^2), \mu_i = \beta_0 + \beta_1 \cdot x_{1i} + \beta_2 \cdot x_{2i}, \text{ for } i \in 1\dots n. \]

where \(x_{1i}\), \(x_{2i}\) are as follows.

\[ x_{1i}, x_{2i} = \left\{ \begin{array}{ll} 0,0 \text{ if smoker}_i = \texttt{former}\\ 1,0 \text{ if smoker}_i = \texttt{no}\\ 0,1 \text{ if smoker}_i = \texttt{yes}\\ \end{array} \right. \]

Polychotomous predictors

Using lm, we would simply do as follows.

model_smoker <- lm(log_rt ~ smoker, data = blomkvist)
(betas <- coef(model_smoker))
(Intercept)    smokerno   smokeryes 
     6.4556     -0.0432     -0.0444 

The intercept term is the predicted average of the distribution of rt when smoker is former.

exp(betas[1])
[1] 636

Polychotomous predictors

Using lm, we would simply do as follows.

model_smoker <- lm(log_rt ~ smoker, data = blomkvist)
(betas <- coef(model_smoker))
(Intercept)    smokerno   smokeryes 
     6.4556     -0.0432     -0.0444 

The predicted mean of the weight distribution for no

\[ \hat\beta_0 + \hat\beta_1 \cdot 1 + \hat\beta_2 \cdot 0 \]

exp(betas[1] + betas[2] * 1 + betas[3] * 0)
[1] 609

Polychotomous predictors

Using lm, we would simply do as follows.

model_smoker <- lm(log_rt ~ smoker, data = blomkvist)
(betas <- coef(model_smoker))
(Intercept)    smokerno   smokeryes 
     6.4556     -0.0432     -0.0444 

When smoker is yes, the predicted mean of the rt distribution for smokers is

\[ \hat\beta_0 + \hat\beta_1 \cdot 0 + \hat\beta_2 \cdot 1 \]

exp(betas[1] + betas[2] * 0 + betas[3] * 1)
[1] 609

Exercise: work through script exercises/polychotomous_predictor.R

Recap

  • Normal model assumes that data come from a single process that generates normal distributed data.
  • Log-normal transformation is often used to correct positive skew.
  • MLE provides estimates for unknown population parameters.
  • We can specify models with continues and categorical predictors and combinations of them.
  • Model specifications using varying intercepts and varying slopes.
  • Read lecture notes on NOW (Andrews 2021).
  • Model comparisons: has age a stronger effect for females than for males?

Homework

Using a data-set of your own choice, perform and report a linear regression analysis to address a set of theoretical questions of your choice that relate to your chosen data-set

  • Familiarize yourself with the demo for the formative assessment.
  • Find a suitable data set you can use for the formative assessment.

References

Andrews, Mark. 2021. Doing Data Science in R: An Introduction for Social Scientists. SAGE Publications Ltd.

Baayen, R. Harald. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

Blomkvist, Andreas W., Fredrik Eika, Martin T. Rahbek, Karin D. Eikhof, Mette D. Hansen, Malene Søndergaard, Jesper Ryg, Stig Andersen, and Martin G. Jørgensen. 2017. “Reference Data on Reaction Time and Aging Using the Nintendo Wii Balance Board: A Cross-Sectional Study of 354 Subjects from 20 to 99 Years of Age.” PLoS One 12 (12): e0189598.