PSYC40540 – Week 6: Normal linear models II

Normal linear model

Each observation \(y_i\) is assumed to come from

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2) \]

where \(\mu\) is

\[ \mu_i = \beta_0 + \beta_1 \times x_i \]

Subscript \(i\) means that every observation \(y_i\) depends on a corresponding \(x_i\).

Normal linear model

Assuming a linear relationship we can calculate the predicted mean of the distribution over reaction times rt using the linear regression equation:

\[ \text{rt}_i = \text{intercept} + \text{slope} \times \text{age}_i \] Use R as a calculator, when necessary, but think first:

What’s the mean of the distribution over rt when the intercept is 300 msecs, the slope is 0.5 msecs and for a 10 year old participant?
If the mean of the distribution over rt is 600 msecs, the intercept is 300 msecs and the slope of 5 msecs, what is the corresponding value of the predictor for this rt?
If the mean of the distribution over rt is 750 msecs and intercept is 400 msecs, what is the slope coefficient for a 40 year old participant?
What is the intercept for a mean predicted rt of 750 msecs, for an age of 0 years and a slope coefficient of 6.24 msecs?
What is the standard deviation of the predicted rt values for ages 15, 25, and 35 if we obtained a standard deviation value of \(\sigma\) = 10?

Why is \(\sigma\) the same for all ages?

In the assumed model

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2) \]

where \(\mu\) is

\[ \mu_i = \beta_0 + \beta_1 \times x_i \]

the value of \(\mu\) and \(x\) depend on \(i\) but not the variance \(\sigma^2\). The variance \(\sigma^2\) is constant and, hence, so is the standard deviation \(\sigma\).

Why is \(\sigma^2\) the same for all ages?

Predicted rt with SDs across age (in years).

Null hypothesis test of slope coefficient

We want to know if the outcome variable is related to a predictor. Therefore, we ask if the change in the outcome for a predictor (aka the slope) different from zero?

t-value is the difference between the estimate of the slope coefficient and the hypothesis over the standard error of the slope coefficient.

\[ \frac{\hat\beta_1 - \beta_H}{\text{SE}_1} \sim \text{t}_\text{df} \] \(\hat\beta_1\): slope coefficient (change in the outcome variable for a predictor)

\(\beta_H\): hypothesized change in the outcome variable; typically 0 when an outcome is hypothesized to be unrelated to the predictor.

\(SE\): standard error of the change in the outcome variable for a predictor

Null hypothesis test of slope coefficient

One the basis of this t-value (difference between slope coefficient and 0 in units of standard errors) we can calculate the probability of observing such a t-value or anything more extreme in a t-distribution with a given number of degrees of freedom.

That’s our p-value, the probability of our t-value or anything more extreme, if the null hypothesis is true.

Confidence interval are all hypothetical values of our slope effect that can’t be ruled out: \(\hat\beta_1 \pm \tau \times \text{SE}_1\) where \(\tau\) is the value in a t-distribution that contains 95% of the area under the curve (for 95% CIs).

Report results as:

est. = \(\dots\), 95% CI [\(\dots\), \(\dots\)], t = \(\dots\), p < / = \(\dots\)

Complete exercise script 1_linearmodel.R

\(R^2\): coefficient of determination

Variance in the data explained by the model predictor(s).

\[ R^2 = \frac{\text{ESS}}{\text{TSS}} \]

where

\[ \text{ESS} = \sum_{i=1}^n(\hat\mu_i-\bar{y})^2 \] and

\[ \text{TSS} = \text{ESS} + \text{RSS} \]

Total sum of squares = explained sum of squares + residual sum of squares

\(R^2\): coefficient of determination

\[ \text{RSS} = \sum_{i=1}^n(y_i-\hat\mu_i)^2 \]

\(y_i-\hat\mu_i\) are the residuals of the model; see script 2_residuals.R.

Then complete the calculation for 3_rsquared.R.

Adjusted \(R^2\)

To overcome this spurious increase in \(R^2\), the following adjustment is applied.

\[ R^2_\text{Adj} = 1 - (1 - R^2) \cdot \underbrace{\frac{n-1}{n-K-1}}_\text{penalty} \]

Large number of parameters \(K\) reduces \(R^2\).
\(K\) has a small effect for large \(n\)s.

Complete exercise script 4_adjrsquared.R

F-test of model fit

# Fit the rt as a normal model with age as predictor.
m_1 <- lm(rt ~ age, data = blomkvist)
summary(m_1)

Call:
lm(formula = rt ~ age, data = blomkvist)

Residuals:
   Min     1Q Median     3Q    Max 
-371.3  -93.8  -23.0   60.9  838.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  314.260     26.723    11.8   <2e-16 ***
age            5.778      0.447    12.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 149 on 264 degrees of freedom
Multiple R-squared:  0.388, Adjusted R-squared:  0.386 
F-statistic:  167 on 1 and 264 DF,  p-value: <2e-16

F-test of model fit

Go through exercises script 5_ftest.R.

You will need this equation for the F value

\[ \text{F} = \underbrace{\frac{\text{RSS}_0 - \text{RSS}_1}{\text{RSS}_1}}_\text{effect size} \cdot \underbrace{\frac{\text{df}_1}{\text{df}_0 -\text{df}_1}}_\text{sample size} \]

where subscripts refer to the two models (hypotheses) in the exercises script.

Complete this statement: “Model comparisons showed that age did / did not have a significant effect on the model fit (F(\(\text{df}_1\), \(\text{df}_2\)) = \(\dots\), p < \(\dots\)).”

where \(\text{df}_2\) is \(n - K + 1\), \(K\) and \(\text{df}_1\) are the number of predictors in the more complex model.

F-test of model fit

Go through exercises script 5_ftest.R.

You will need this equation for the F value

\[ \text{F} = \underbrace{\frac{\text{RSS}_0 - \text{RSS}_1}{\text{RSS}_1}}_\text{effect size} \cdot \underbrace{\frac{\text{df}_1}{\text{df}_0 -\text{df}_1}}_\text{sample size} = \frac{(\text{RSS}_0 - \text{RSS}_1)/(\text{df}_0 - \text{df}_1)}{\text{RSS}_1/\text{df}_1} \]

where subscripts refer to the two models (hypotheses) in the exercises script.

Complete this statement: “Model comparisons showed that age did / did not have a significant effect on the model fit (F(\(\text{df}_1\), \(\text{df}_2\)) = \(\dots\), p < \(\dots\)).”

where \(\text{df}_2\) is \(n - K + 1\), \(K\) and \(\text{df}_1\) are the number of predictors in the more complex model.

Inferential summary statistics

Summary for model coefficients and F-test

summary(model)

Confidence intervals for model coefficients

confint(model)

Also try

library(broom)
tidy(model, conf.int = TRUE)