PSYC40540 – Week 6: Normal linear models II

Normal linear model

Each observation \(y_i\) is assumed to come from

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2) \]

where \(\mu\) is

\[ \mu_i = \beta_0 + \beta_1 \times x_i \]

Subscript \(i\) means that every observation \(y_i\) depends on a corresponding \(x_i\).

Normal linear model

Given a linear relationship we can calculate the predicted mean of the distribution over rt

\[ \text{rt}_i = \text{intercept} + \text{slope} \times \text{age}_i \] Use R as a calculator but think first:

What’s the mean of the distribution over rt when the intercept is 300, the slope is 0.5 and for a 10 year old participant?
If the mean of the distribution over rt is 600, the intercept is 300 and the slope of 5, what is the corresponding value of the predictor for this rt?
If the mean of the distribution over rt is 750 and intercept is 400, what is the slope coefficient for a 40 year old participant?
what is the intercept for a mean of 750 in the distribution over rt, for an age of 60 and a slope value of 6.24 (the answer is easier than it might appear)?
What is the corresponding standard deviation for ages 15, 25, 35 if \(\sigma^2\) is 10?

Why is \(\sigma^2\) the same for all ages?

In the assumed model

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2) \]

where \(\mu\) is

\[ \mu_i = \beta_0 + \beta_1 \times x_i \]

\(\mu\) and therefore \(y\) depends on \(i\) but not \(\sigma^2\).

Why is \(\sigma^2\) the same for all ages?

Predicted rt with SDs across age groups (n = 5).

But what about the standard error?

Predicted rt with SEs across age groups (n = 5).

Why is \(\sigma^2\) the same for all ages?

The standard deviation \(\sigma^2\) is unexplained variance in the data.
The standard error is adjusted by the leverage to account for variability in sample (certainty of model prediction).
Leverage is a measure of how much a value influences the model fit: high leverage for values far away from means of predictor variables.

\[ \text{SE}(\hat{y}_i) = \sqrt{\hat\sigma \times h_i} \]

where \(\hat\sigma\) is the mean squared error (the residual variance sigma(model)^2) and \(h_i\) is the penalty for sampling variability which we get in R using hatvalue(model).

Also for confidence interval because CI depends on SE \(\hat{y}_i \pm \tau \times \text{SE}_i\).
Check exercise script leverage.R

Null hypothesis test of slope coefficient

We want to know if the outcome variable is related to a predictor. Therefore, we ask if the change in the outcome that is due to a predictor (aka the slope) different from zero?

t-value is the difference between the estimate of the slope coefficient and the hypothesis over the standard error of the slope coefficient.

\[ \frac{\hat\beta - \beta_H}{\text{SE}} \sim \text{t}_\text{df} \]

p-value is the probability of observing such a t-value or anything more extreme in a t-distribution with a given number of degrees of freedom.

Confidence interval is \(\hat\beta \pm \tau \times \text{SE}\) where \(\tau\) is the value in a t-distribution with df degrees of freedom that contains 95% of the area under the curve (for 95% CIs).

Report results as: est. = \(\dots\), 95% CI [\(\dots\), \(\dots\)], t = \(\dots\), p < / = \(\dots\)

Check exercise script linearmodel.R

\(R^2\): coefficient of determination

\[ \underbrace{\sum_{i=1}^n(y_i-\bar{y})^2}_\text{TSS} = \underbrace{\sum_{i=1}^n(\hat\mu_i-\bar{y})^2}_\text{ESS} + \underbrace{\sum_{i=1}^n(y_i-\hat\mu_i)^2}_\text{RSS} \]

\(R^2\): coefficient of determination

\[ R^2 = \frac{\text{ESS}}{\text{TSS}} \]

where

\[ \text{ESS} = \sum_{i=1}^n(\hat\mu_i-\bar{y})^2 \] and

\[ \text{TSS} = \text{ESS} + \text{RSS} \]

\(R^2\): coefficient of determination

\[ \text{RSS} = \sum_{i=1}^n(y_i-\hat\mu_i)^2 \] \(y_i-\hat\mu_i\) are the residuals of the model; see script 1_residuals.R.

Then complete the calculation for 2_rsquared.R.

Adjusted \(R^2\)

To overcome this spurious increase in \(R^2\), the following adjustment is applied.

\[ R^2_\text{Adj} = 1 - (1 - R^2) \cdot \underbrace{\frac{n-1}{n-K-1}}_\text{penalty} \]

Large number of parameters \(K\) reduces \(R^2\).
\(K\) has a small effect for large \(n\)s.

Complete exercise script 3_adjrsquared.R

F-test of model fit

# Fit the rt as a normal model with age as predictor.
m_1 <- lm(rt ~ age, data = blomkvist)
summary(m_1)

Call:
lm(formula = rt ~ age, data = blomkvist)

Residuals:
   Min     1Q Median     3Q    Max 
-371.3  -93.8  -23.0   60.9  838.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  314.260     26.723    11.8   <2e-16 ***
age            5.778      0.447    12.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 149 on 264 degrees of freedom
Multiple R-squared:  0.388, Adjusted R-squared:  0.386 
F-statistic:  167 on 1 and 264 DF,  p-value: <2e-16

F-test of model fit

Go through exercises script 4_ftest.R.

You will need this equation for the F value

\[ \text{F} = \underbrace{\frac{\text{RSS}_0 - \text{RSS}_1}{\text{RSS}_1}}_\text{effect size} \cdot \underbrace{\frac{\text{df}_1}{\text{df}_0 -\text{df}_1}}_\text{sample size} = \frac{(\text{RSS}_0 - \text{RSS}_1)/(\text{df}_0 - \text{df}_1)}{\text{RSS}_1/\text{df}_1} \]

where subscripts refer to the two models in the exercises script.

Complete this statement: “Model comparisons showed that age did / did not have a significant effect on the model fit (F(\(\text{df}_1\), \(\text{df}_2\)) = \(\dots\), p < \(\dots\)).”

where \(\text{df}_2\) is \(n - K + 1\), \(K\) and \(\text{df}_1\) are the number of predictors in the more complex model.

All inferential statistics

Summary for model coefficients and F-test

summary(model)

Confidence intervals for model coefficients

confint(model)

MySay Student Feedback

Our current courses are informed by previous student feedback.
We rely on your feedback to help us optimise our modules and courses.
Please take a moment to provide some constructive feedback.

Go to [www.ntu.ac.uk/mysay](www.ntu.ac.uk/mysay) or scan the QR code

Go to www.ntu.ac.uk/mysay or scan the QR code