Linear Regression

Author

Shota Momma

Linear regression with one continuous predictor

Finally, we will dip our toes into a statistical method that everyone uses nowadays in linguistics: linear regression. We will start by considering the simplest case where there are only two continuous variables: average reaction times (across many people) of words in a lexical decision task and the frequency of words.

Suppose that you got the following data (this is real data created by taking 30 random samples from the English Lexicon Project database):

suppressPackageStartupMessages(library("tidyverse"))
df <- read_csv("/Users/shota/Documents/Teaching/Fall2024/Statistics/data/LexDec.csv", show_col_types = FALSE)
df$logf <- log(df$Frequency)
df

# A tibble: 30 × 5
   Word       Length Frequency    RT  logf
   <chr>       <dbl>     <dbl> <dbl> <dbl>
 1 make            4    520909  621. 13.2 
 2 wrist           5      4355  616.  8.38
 3 trial           5     19700  602.  9.89
 4 soul            4     32352  588. 10.4 
 5 sale            4     95391  598. 11.5 
 6 retire          6      2307  762.  7.74
 7 reservoir       9      1635  846.  7.40
 8 relaxation     10      1658  703.  7.41
 9 relative        8     18302  664.  9.81
10 rack            4     11655  622.  9.36
# … with 20 more rows

One of the most robust effects in psycholinguistics is the word frequency effect: more frequent words tend to be recognized faster, so the reaction times of more frequent words tend to be shorter in lexical decision tasks. So let’s test whether this is true in our data set. When we want to test if a variable affects another variable, we want to essentially understand how they are related to each other. So let’s mathematically formalize a relation between them, using a regression model.

\[ y_i = b_0 + b_1x_i + \epsilon_i \]

where $y_i$ is the dependent variable (or the outcome variable), $b_0$ is an intercept, $x_i$ is the independent variable (or the predictor variable), and $b_1$ is a coefficient or slope. $\epsilon_i$ is an error that cannot be explained by the regression model, also called residuals. Note that each symbol comes with a subscript. This means that we are talking about individual observations.

What a simple linear regression does is to find a line that minimizes the difference between predictions and observations, that is, minimizes the total amount of residuals. The line can be expressed using the following notation:

\[ \hat{y_i} = b_0 + b_1x_i \]

Thus, “to find a line that minimizes residuals” is to find values of $b_0$ and $b_1$ such that residuals are minimized.

The following graph might be useful for visually understanding what residual means:Here’s some visual aid that visualizes what residual is:

Least square estimates - finding the slope

How do we find $b_0$ and $b_1$ such that the sum of residuals is minimized? Let me walk you through how to do it using least square estimates. First, let me give you a formula for calculating the residual-minimizing slope $b_1$:

\[ b_1 = \frac{SS_{xy}}{SS_{xx}} =\frac{\sum_i^n (x_i - \bar{x}) (y_i - \bar{y})}{\sum_i^n(x_i-\bar{x})^2} \]

SS stands for the sum of squares, which is essentially the value that indicates how much dispersion there is in the data. $SS_{xx}$ is the sum of squared differences between x and the mean of x. $SS_{xx}$ expresses how spread out x is from the mean of x (the dispersion along the x-axis). $SS_{xy}$ is the sum of the products of the difference between x and the mean of x and y and the mean of y. This value is closely related to (sample) covariance. If you divide $SS_{xy}$ by (n-1), you get the sample covariance between x and y (denoted as cov(x,y)). Covariance expresses to what extent two variables change in the same direction (it’s closely related to correlation, which is just a normalized version of covariance):

$SS_{xy}$ can be positive or negative. If x and y change in the same direction with respect to the mean of x and y, $(x_i-\bar{x})(y_i-\bar{y})$ would be positive. In contrast, if x and y change in the opposite direction, $(x_i-\bar{x})(y_i-\bar{y})$ would be negative. Thus, $SS_{xy}$ expresses to what extent x and y covary in the same vs. the opposite direction.

A visualization might help. Let’s take ($\bar{x}, \bar{y}$), the intersection of the red and blue line in the figure below, as the reference point:

In this graph, most data points are clustered around either the lower right or the upper left regions. When x is bigger than the mean of x, y tends to be smaller than the mean of y (i.e., negative covariance). This means that ($x_i - \bar{x})(y_i - \bar{y}$) associated with most data points would be negative. Because calculating $b_1$ involves summing all ($x_i - \bar{x})(y_i - \bar{y}$), which would be negative for most data points, $b_1$ would be negative (the denominator of the equation for $b_1$, $SS_{xx}$ cannot be negative, so if the numerator is negative $b_1$ would necessarily be negative).

To illustrate the point further, consider an alternative scenario where data points are spread across all four regions (upper right, lower right, upper left and lower left, taking $(\bar{x}, \bar{y})$ at the center).

In this case, taking the intersection between the mean of x and mean of y as the reference point, x and y do not covary systematically. When x and y do not covary systematically, $SS_{xy}$ would not move away from 0 much because $(x_i - \hat{x})(y_i - \hat{y})$ across data points are both positive and negative, so they cancel each other out.

Let’s calculate $b_1$ in our data:

SS_xx <- sum( (df$logf - x_bar)^2)
SS_xy <- sum( (df$RT - y_bar)*(df$logf - x_bar))
b1 <- SS_xy/SS_xx
b1

[1] -19.53011

We got about -19.53 as the value of $b_1$. This means that 1 unit change in the log-frequency decreases the reaction time by 19.53 milliseconds.

Least square estimates - finding the intercept

We figured out the value of $b_1$. Now, how do we figure out $b_0$, that is, the intercept? We can exploit the fact that the regression line must go through the intersection of the mean of x and the mean of y, that is, the intersection of the red and the blue lines in the figures above. Now, recall the formula for the regression:

\[ \hat{y_i} = b_0 + b_1x_i \]

Because the line has to go through the intersection between the mean of x (9.44157) and the mean of y (664.5942), and given $b_1$ = 19.53011, the following equation holds:

\[ 664.5942 = b_0 + (-19.53011)(9.44157) \]

If we solve this equation:

\[ b_0 = 664.5942 + (19.53011)(9.44157) \]

b0 <- y_bar - (b1 *x_bar)
b0

[1] 848.9892

We got 848.9892. This means that when the log frequency is assumed to be 0, the reaction time is predicted to be about 849 milliseconds.

Predicting the outcome variable

Putting everything together, we arrive at the following linear equation for our data set:

\[ \hat{y_i} = 848.9892 + (-19.53011)*x_i \]

We calculated $b_0$ and $b_1$ by hand, but as always, R has a function lm, which basically does what we did above (and more).

model <- lm(RT ~ logf,data = df)

If you store the output of the lm function and put it into the summary() function, you get whole bunch of information associated with this regression model:

summary(model)


Call:
lm(formula = RT ~ logf, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-146.17  -51.56  -12.65   55.94  141.25 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  848.989     75.491  11.246 6.78e-12 ***
logf         -19.530      7.866  -2.483   0.0193 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 74.22 on 28 degrees of freedom
Multiple R-squared:  0.1804,    Adjusted R-squared:  0.1512 
F-statistic: 6.165 on 1 and 28 DF,  p-value: 0.01929

I know this is overwhelming if you haven’t dealt with this kind of stuff before. So let’s first focus on the section that says “coefficients,” which can be extracted by doing:

summary(model)$coef

             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 848.98919  75.491479 11.246159 6.775062e-12
logf        -19.53011   7.865765 -2.482926 1.929132e-02

The column named ‘Estimate’ corresponds to coefficient estimates of each term ($b_0$ = Intercept and $b_1$ = logf). The good news is that R also spits out the same values for $b_0$ and $b_1$!

Hypothesis testing in regression

As you might have noticed from the summary output, figuring out $b_0$ and $b_1$ is not the only thing R is doing. Notice that R spit out t-values, standard errors, and p-values (denoted as Pr(>|t|)). You might guess that lm function actually did a significance testing, in fact, a t-test for us! Specifically, it is asking whether $b_x$ is significantly different from 0, that is, it’s testing the following null and alternative hypotheses:

$H_0$: $b$ = 0

$H_1$: $b$ ≠ 0

Remember that, when we test if the sample mean significantly deviates from the population mean under the null, the t-value is calculated as follows, where SE was simply $\frac{\sigma}{\sqrt{n}}$.

\[ t = \frac{\bar{X} - \mu}{SE} \]

However, when we do one sample t-tests on coefficients estimated by regression models, the t-value is calculated based on coefficient estimates, $b_x$ instead of sample means $\bar{X}$:

\[ t = \frac{\hat{b} - \mu}{SE} \]

Because the null hypothesis is $\mu = 0$:

\[ t = \frac{\hat{b}}{SE} \]

Unfortunately, calculating $b_0$ and $b_1$ is a bit… complicated. But I think it is worth taking some time going through the procedure.

Let me first deal with SE for $b_1$. To calculate it, we need the following formula:

\[ SE(\hat{b_1}) = \sqrt{\frac{1}{(n-2)} \sum_i^n({y_i - \hat{y_i}})^2 \frac{1}{{\sum_i^n(x_i - \bar{x})^2}}} \]

To explain this formula, I first need to explain Residual Sum of Squares (RSS), which is just sum of squared residuals.

\[ RSS = \sum_i^n(y_i - \hat{y})^2 \]

RSS expresses how much variability there is in residuals. Now, if we divide RSS with the appropriate degrees of freedom and take the square root of its result, we get Residual Standard Errors (RSE):

\[ RSE = \sqrt{\frac{1}{(n-2)} \sum_i^n{(y_i - \hat{y_i})^2}} \]

(n-2) is the degrees of freedom. The degrees of freedom for calculating RSE is (n - p), where n is the sample size and the p is the number of parameters already used to calculate residuals. In the current example, we’ve already fixed two parameters, $b_0$ and $b_1$, so p is 2 in the current example (later, we will talk about multiple regressions with more than two parameters). RSE is the standard deviation of errors. The lower the RSE, the better the fit. Visually:

Note that in these graphs the slope didn’t change. The only thing that changed was the residual variability.

As the figure above shows, RSE express how well the model fits the data. Higher the RSE, the worse the fit. Let’s calculate RSE in our data:

n = length(df$RT)
df$y_hat <- b0 + b1*df$logf
df$error <- df$RT - df$y_hat
RSE <- sqrt(sum((df$error)^2)/(n-2))

RSE is the numerator of the formula for calculating the SE for $b_1$. The formula can now be rewritten as:

\[ SE(\hat{b_1}) = \frac{RSE}{\sqrt{\sum_i^n({x - \bar{x}})^2}} \]

The denominator of this formula is actually just the square-rooted version of what we already computed above: $SS_{xx}$ (sum of squared difference between each x and the mean of x):

\[ SE(\hat{b}) = \frac{RSE}{\sqrt{SS_{xx}}} \]

This says that the standard error of a slope term (here, $b_1$) is the ratio between the dispersion of residuals and the dispersion of x from its mean. Intuitively, when residuals are more variable (when the fit is worse), SE($\hat{b}$) is larger. Keeping the residual variation constant, when x is more variable, SE($\hat{b}$) is smaller.

Let’s calculate the $SE(\hat{b})$:

SE_b = RSE/sqrt(SS_xx)

We now have both $SE(\hat{b_1})$ and $\hat{b_1}$, so we can compute the t-value associated with $\hat{b_1}$.

t <- b1/SE_b
t

[1] -2.482926

Remember that t-values follow the t-distribution with a given degree of freedom. Here, the degrees of freedom is (n - 2) = 28 (-2 because to calculate t we fixed two parameters, b0 and b1). To compute the p-value, we calculate the cumulative probability of t-values greater than 2.4829 or smaller than -2.4829:

deg = 28
pt(t, deg) * 2

[1] 0.01929132

This matches with the p-value the $lm$ function spit out.

What about the SE for $b_0$? Here’s the formula for calculating SE for $b_0$.

\[ SE(b_0) = \sqrt{RSE * (\frac{1}{n} + \frac{\bar{x}^2}{\sum_i^2 (x_i - \bar{x})^2} )} \]

SE_b0 <- sqrt(RSE^2 * (1/n + (x_bar^2)/sum((df$logf - x_bar)^2)))

Apologies for the opaqueness, but SE for $b_0$ becomes larger when the RSE is large, i.e., when the fit is poor. It also becomes larger when the mean of the predictor is further away from 0. But it becomes smaller when the dispersion of x is small, or when n increases.

Just like in the case of b1, once we have SE for $b_0$, it’s easy to compute the associated t-value and p-value:

t_b0 <- b0/SE_b0
p_b0 <- pt(t_b0, deg, lower.tail = FALSE)*2

t_b0

[1] 11.24616

p_b0

[1] 6.775062e-12

This means that the intercept is significantly different from 0 (not surprising at all, as reaction times should of course be greater than 0).

Assessing the overall fit: variance explained

One more thing that R gives you when you fit a linear model is $R^2$. This value tells us the proportion of variance of the outcome variable accounted for by the predictor variables (the higher the $R^2$, the better the fit). The calculation, at this point, is relatively easy:

\[ R^2 = 1 - \frac{RSS}{SS_{yy}} \]

Remember, RSS is residual sum of squares (this is sometimes denoted as $SS_{res}$ in this context). Remember that RSS is larger when the fit is poorer.

\[ RSS = \sum_i^n(y_i - \hat{y_i})^2 \]

$SS_{yy}$ is the sum of the difference between y and the mean of y, and can be thought of as the total amount of variance in the outcome variable (the dispersion y).

\[ SS_{yy} = \sum_i^n{(y_i - \bar{y})^2} \]

As you can see in the equation, $R^2$ value would be larger when RSS decreases or when the dispersion of y ($SS_{yy}$) increases. Let’s calculate the value and check it against the R output.

RSS <- sum((df$RT - df$y_hat)^2)
SS_yy <- sum((df$RT - y_bar)^2)
1 - (RSS/SS_yy)

[1] 0.180446

This corresponds to the value of “multiple R-squared” in the summary output:

summary(model)$r.squared

[1] 0.180446

Another value that R gives is the “adjusted” R-squared, which is, well…, an adjusted version of R squared. The idea is that adding more predictors (we haven’t done that, but we will, when we talk about multiple regression) always improve the R squared values. We have to penalize increasing the number of predictors and there is a way to penalize it. I am not gonna go too much into depth here, but you can see Navarro’s textbook (p. 464-).

Centering predictors

The intercept in our model tells us the predicted RT when the log frequency is set to 0. Sometimes, we want the model to tell if the predicted reading time when the log frequency is set to the average value (in the sample). To achieve this, we have two options: (a) center the predictor (subtract the mean of x from each x) (b) standardize the predictor (subtract the mean of x from each x, and then divide the difference by the standard deviation of x). Either case, the mean of the predictor value is (very close to) 0. Thus, the intercept should reflects the predicted RT when the log frequency is set to the average value. These two options makes a difference in how to interpret the slope though. When you do centering, the slope is going to reflect the same thing as before: the amount of predicted change due to the log frequency increasing by 1. When you do standardizing, the slope is going to reflect the amount of predicted change due to the log frequency increasing by 1 standard deviation. In R, both centering and standardizing can be done by using the scale function. If you set “scale = F”, it does centering. If you set “scale = T,” it does standardizing. Compare and contrast both $b_0$ and $b_1$ of each model!

df$logf_c <- scale(df$logf, scale = F) # this just subtract the mean frequency
df$logf_z <- scale(df$logf, scale = T) # this calculate the z-score

model_c <- lm(RT ~ logf_c,data = df)
model_z <- lm(RT ~ logf_z,data = df)

summary(model_c)


Call:
lm(formula = RT ~ logf_c, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-146.17  -51.56  -12.65   55.94  141.25 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  664.594     13.552  49.042   <2e-16 ***
logf_c       -19.530      7.866  -2.483   0.0193 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 74.22 on 28 degrees of freedom
Multiple R-squared:  0.1804,    Adjusted R-squared:  0.1512 
F-statistic: 6.165 on 1 and 28 DF,  p-value: 0.01929

summary(model_z)


Call:
lm(formula = RT ~ logf_z, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-146.17  -51.56  -12.65   55.94  141.25 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   664.59      13.55  49.042   <2e-16 ***
logf_z        -34.22      13.78  -2.483   0.0193 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 74.22 on 28 degrees of freedom
Multiple R-squared:  0.1804,    Adjusted R-squared:  0.1512 
F-statistic: 6.165 on 1 and 28 DF,  p-value: 0.01929

As you can see,

Assumptions of linear regression

Just like any other statistical models, linear regression makes various assumptions. Here are some that are mentioned very often. Note that the normality assumption is about residuals, not the outcome.

Normality: The residuals are assumed to be normally distributed (note, it’s not about x or y, it’s about residuals).
Linearity: the relationship between x and y is assumed to be linear.
Homogeneity of variance: Each residual is assumed to be drawn from the same normal distribution with the mean of 0 and a fixed standard deviation.
Independence of errors. Residuals and the outcome variable should not be correlated