Let’s run a linear regression predicting scores on a writing test based on scores on a previous reading test. The predictor variable and outcome variable are distributed as follows:
hist(df$reading_score)
hist(df$writing_score)
Here I run the regression:
summary(lm(writing_score ~ reading_score, data = df))
##
## Call:
## lm(formula = writing_score ~ reading_score, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9573 -2.9573 0.0363 3.1026 15.0557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.667554 0.693792 -0.962 0.336
## reading_score 0.993531 0.009814 101.233 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.529 on 998 degrees of freedom
## Multiple R-squared: 0.9113, Adjusted R-squared: 0.9112
## F-statistic: 1.025e+04 on 1 and 998 DF, p-value: < 2.2e-16
We can see that the intercept is -0.6675536 and that the slope is 0.9935311. This means that for each 1-grade increment in the predictor variable, the outcome variable increases by almost the same amount. But how are these two parameters actually estimated behind the scenes?
The slope parameter \(\beta\)1 is calculated as the sum of the product of the deviances for the predictor and outcome variables divided by the sum of squared errors \(SS\) for the predictor variable:
\[\beta_{1} = \frac {\Sigma (x_{i} - \overline {x}) (y_{i} - \overline {y})} {\Sigma (x_{i} - \overline {x})^2}\]
You may notice that the numerator is equivalent to the numerator of the covariance formula \(cov(x,y) = \frac {\Sigma (x_{i} - \overline {x}) (y_{i} - \overline {y})} {N-2}\), but the denominator is the predictor’s \(SS\) instead of the degrees of freedom. So the slope is determined by the total shared deviance between the two variable as a proportion of the deviance in the predictor variable.
The formula can also be written as the product of the Pearson’s correlation coefficient for the two variables and the inverse proportion of the standard deviation of variable y and variable x:
\[\beta_{1} = r \frac {s_{y}} {s_{x}}\]
Let’s try to compute the slope using the formula above and see if it matches the output from the lm() function.
beta_1 <-
sum ((df$reading_score - mean(df$reading_score)) * (df$writing_score - mean(df$writing_score))) /
sum((df$reading_score - mean(df$reading_score)) ^ 2)
beta_1
## [1] 0.9935311
We can see that the slope estimate is identical to the estimate provided in the lm() output above.
Let’s compute the intercept now. The intercept is derived from the difference between the mean of the outcome variable and the product of the mean of the predict variable and the slope of the regression line:
\[\beta_{0} = \overline {y} - \beta_{1} \overline {x}\]
In R code, this would be:
beta_0 <- mean( df$writing_score ) - ( ( beta_1 * mean (df$reading_score) ) )
beta_0
## [1] -0.6675536
Again, you can see that our intercept estimate is identical to the one provided in the lm() output.
Since these two values are estimates of unobserved population values based on observed population data, we can quantify how “confident” we are in these estimates. This is the same we do for the mean: we quantify how well our sample mean \(\overline {x}\) approximates the population mean \(\mu\) by calculating the standard error of the mean (SEM), which is the standard deviation of the (unobserved) sampling distribution of the sample means. The smaller the SEM the better our sample mean represents the population mean.
In regression, we can follow the same logic to estimate how much the linear regression model based on our sample data reflects a “true” linear relationship in the population. In order to do this, we start by calculating the so-called residuals, that is, the deviance of the individual data points from our model, the regression line:
residuals <- df$writing_score-(beta_0+(beta_1*df$reading_score))
Here I am basically only calculating the difference between each observed data point in the outcome variable and the relative data point as predicted by our regression model \(Y_{i} = \beta_{0} + \beta_{1}X_{i}\). I am basically quantifying the error \(\epsilon_{i}\) that is in the model.
We can see that the residuals are normally distributed, which is a good thing, because one of the four assumptions of linear regression is that the residuals are normally distributed:
hist(residuals)
Once we have the residuals, we can easily calculate the so-called standard error of the regression, which is a measure of how well our regression line modeled on the sample data approximates the “true” regression line in the population. This is done by taking the square root of the sum of squared errors \(SS\) for the regression line. Notice that this formula is identical to the formula for the sample standard deviation, with the only difference that the degrees of freedom \(df\) are computed not as \(N-1\), but as \(N-2\) because we have two variables.
s = sqrt(sum(residuals^2)/nrow(df)-2)
s
## [1] 4.297774
This value is very close to the value for the Residual Standard Error reported in the model fit section of the lm() output. This value is a bit hard to interpret on its own, which is why we normally compute the \(R^2\) for the model, which is of much easier interpretation.
This value for the standard error of the regression can be used to derive two other measures: the standard error of the intercept and the standard error of the slope. Again, we can use these value to quantify how good our estimates of intercept and slope are.
The standard error of the intercept \(SE_{b_{0}}\) can be quantified as the standard error of the regression \(s\) divided by the square root of \(n\), times the square root of 1 plus \(\overline {x}\) squared over the variance of \(x\):
\[SE_{b_{0}} = \frac {s} {\sqrt{n}} \times \sqrt{ 1+ \frac {\overline {x}^2} { \frac {\Sigma (x_{i}-{\overline{x}^2})} {N-2} }} \]
Let’s do it in R:
se_intercept <-
(s / sqrt(nrow(df))) * sqrt(1 + (mean(df$reading_score) ^ 2 / (sum((
df$reading_score - mean(df$reading_score)
) ^ 2) / nrow(df)-2 )))
se_intercept
## [1] 0.6613513
You can see that this value is also quite close to the one reported in the lm() output (I am not sure why it’s not identicaly — R may be doing some fancy correction behind the scenes).
The standard error of the slope \(SE_{b_{1}}\) can be quantified as the standard error of the regression \(s\) divided by the square root of \(n\), times 1 over the standard deviation of the predictor variable:
\[SE_{b_{1}} = \frac {s} {\sqrt{n}} \times \frac {1} { \sqrt{\frac {\Sigma (x_{i}-{\overline{x}^2})} {N-2}}}\]
Let’s do this in R:
se_intercept <-
(s / sqrt(nrow(df))) *
(1/sqrt((sum((
df$reading_score - mean(df$reading_score)
) ^ 2) / nrow(df)-1 )))
se_intercept
## [1] 0.009335216
Also in this case, our \(SE_{b_{1}}\) estimate is very similar to that provided by the lm() output.