Linear regression is a widely used technique to model the association between a dependent variable and one or more independent variables. In the simplest case, which is what will be explored in this post, there is one independent and dependent variable. The linear regression model in the simple form is \(y = ax + b\), with a slope \(a\) and constant \(b\). Ordinary Least Squares is the most common and straightforward approach to fitting a linear regression model. In the Ordinary Least Squares (OLS) setting, the goal when fitting the linear regression model is to keep the difference between the actual observed data and the fitted model as small as possible.
The goal of this post is to examine in-depth the calculations and interpretations of the linear regression model. Although regression comes in many types of flavors and can be easily done with statistical programs such as R, it is beneficial to know the assumptions and calculations of the model to gauge its appropriateness better in a particular situation or explain the results.
To begin, load the ggplot2
package and the cars
dataset.
library(ggplot2)
data("cars")
The cars
dataset contains 50 observations of the speed and stopping times of cars. Given the first law of motion, it is reasonable to expect the stopping distance of the cars will increase as speed increases. The association between the two variables can be modeled by linear regression.
Plotting a scatterplot of the speed and stopping distances gives an excellent visualization of the association of two variables. A linear regression line can also be drawn with the method=lm
argument in geom_smooth()
to illustrate how the regression model would fit the data.
ggplot(cars, aes(x=speed, y=dist)) +
geom_point(color='#2980B9', size = 4) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='#2C3E50')
The Ordinary Least Squares approach mentioned previously attempts to find the optimal line that minimizes the distance between the fitted and observed values. The fitted line is drawn nicely through the data points and thus provides good motivation to model the association between the variables with a linear regression model.
To expand on the above point that visualizing the linear regression line on the data gives support to fitting the data with a linear model, below is an example of a linear regression model fit to non-linear data.
x <- -100:100
y <- 0.001 * x^3
noise <- rnorm(length(x), mean = 50, sd = 100)
noisy <- noise + y
df <- data.frame(cbind(x, noisy))
ggplot(df, aes(x = x, y = noisy)) + geom_point(color='#2980B9', size = 2) +
geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='#2C3E50', size = 1.25) + ylab('y')
The data points clearly show a non-linear relationship, and thus a linear regression model would be ill-suited for modeling this data. Plotting a scatterplot or other plot of the data to visualize the association of the variables can have a significant role in determining what approach should be taken to modeling the data.
Building a linear regression model in R is done with the lm()
function and is rather straightforward. As so much literal and virtual ink has been devoted to exploring the linear regression model and the lm()
function, this will be touched on briefly.
Use the lm()
function to create a linear model. The variable on the left-hand side of the ‘~’ is the response (dependent) variable, and the independent variable speed is on the right-hand side of the ‘~’. A summary of the linear regression model can be printed with the summary()
call.
cars.lm <- lm(dist ~ speed, data = cars)
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The summary()
function outputs the residuals, coefficients and other statistics of the linear model. The residuals are the difference between the fitted responses and the observed responses. This difference is used to find the SSR, the sum of squared residuals, also known as the SSE, or the sum of squared error. The \(R^2\), or coefficient of determination, is .65, which suggests the fitted line is close to the observed responses. As seen in the first graph above, the linear regression line is indeed close to most of the data points.
The p-value reported by the summary()
function is way below 0.05. Therefore it can be concluded the coefficient of speed does affect the stopping distance of the cars.
Before fitting the model, some definitions of Least Squares estimation must be made. As mentioned above, the Ordinary Least Squares approach finds estimates of \(\beta_0\) and \(\beta_1\) that best fit the data. The fitted equation is defined as \(y = \hat{\beta}_0 + \hat{\beta}_1x\). Therefore, for each observed response \(y_i\), with a corresponding predictor value \(x_i\), the goal is to obtain a fitted value \(\hat{y} = \beta_0 + \beta_1x\) that minimizes the sum of the squared distances of each response. This distance is called the SSE, the error sum of squares, and is defined as:
\[ SSE = \sum^n_{i=1}(y_i - \hat{y}_i)^2 \]
From the SSE, the values of \(S_{xy}\), \(S_{xx}\) and \(S_{yy}\) can be derived. These values should be familiar from a previous post on Pearson’s correlation coefficient and are defined as:
\[ S_{xx} = \sum (x - \bar{x})^2 = \sum x^2 - \frac{(\sum x)^2}{n} \] \[ S_{xy} = \sum (x - \bar{x})(y - \bar{y}) = \sum xy - \frac{(\sum x)(\sum y)}{n} \] \[ S_{yy} = \sum (y - \bar{y})^2 = \sum y^2 - \frac{(\sum y)}{n} \]
which gives the estimates of \(\beta_0\) and \(\beta_1\):
\[ \hat{\beta}_1 = \frac{\sum^n_{i=1} (x - \bar{x})(y - \bar{y})}{\sum^n_{i=1} (x - \bar{x})^2} = \frac{S_{xy}}{S_{xx}} \] \[ \hat{\beta}_0 = \bar{y} - \hat{\beta}\bar{x} = \frac{\sum^n_{i=1}y_i}{n} - \hat{\beta}_1 \frac{\sum^n_{i=1} x_i}{n} \]
The \(SSE\) can then be written regarding the above.
\[ SSE = S_{yy} - \frac{S_{xy}^2}{S_{xx}} \]
The Mean Square Error is then the \(SSE\) divided by \(n - 2\) since there are two estimated parameters.
\[ MSE = \frac{SSE}{n - 2} = \frac{\sum^n_{i=1} (y_i - \hat{y}_i)}{n - 2} = \frac{S_{yy} - \frac{S_{xy}^2}{S_{xx}}}{n - 2} \]
The standard error of \(\beta_0\) and \(\beta_1\) is defined as:
\[ se(\beta_1) = \frac{\sqrt{MSE}}{\sqrt{S_{xx}}} \] \[ se(\beta_0) = \frac{\sqrt{MSE}}{\sqrt{n} \sqrt{1 + \frac{\bar{x}^2}{\sum \frac{(x - \bar{x})^2}{n}}}} \]
The t-values of \(\beta_1\) and \(\beta_0\) can then be found with the standard errors.
\[ t = \frac{\beta_1 - \beta}{se(\beta_1)} \]
In the case of linear regression, the \(\beta\) that is subtracted from the coefficients is set to 0.
\[ t = \frac{\beta_1}{se(\beta_1)} \]
The t-value of \(\beta_0\) can also be found with the same equation by replacing \(\beta_1\) for \(\beta_0\). With the t-values of the coefficients, the p-values can then be computed. There are \(n - 2\) degrees of freedom since there are two parameters in the model. The p-value is multiplied by two as it is a two-tailed test, i.e., the values can be either positive or negative.
The coefficient of determination, \(r^2\), which represents the proportion of the variance in the dependent variable (distance in this case) that is predictable from the independent variable (speed). In the context of a linear regression model, \(r^2\) is a measure of how well the model fits the data. In the simple linear regression setting, the coefficient of determination is designated \(r^2\), whereas in multiple regression it is represented as \(R^2\) (Wikipedia).
\(r^2\) can be defined as:
\[ r^2 = \frac{S_{yy} - SSE}{S_{yy}} \]
The adjusted \(R^2\), often written as \(\bar{R}^2\), takes into account the number of predictors in the model. The adjusted \(R^2\) increases when an additional variable is added to the model that has correlation to the dependent variable and decreases when the added variable is not correlated with the dependent variable. The adjusted \(R^2\) is defined as:
\[ \bar{R}^2 = R^2 - (1 - R^2) \frac{p - 1}{n - p} \]
Where p is the number of independent variables in the model.
From a previous post on ANOVA, the ratio \(\frac{MSR}{MSE}\) is used to calculate the F-statistic as it is known to follow the F distribution. The numerator has 1 degree of freedom while the denominator has \(n - 2\) degrees of freedom.
The function below collects all the equations and steps above to perform simple linear regression. There’s no practical reason to use this over the standard lm()
function, however, it may provide some further understanding of how linear regression is performed in R.
simple.linear.coef <- function(x, y) {
# Find length of x to get sample size. Assuming x and y have the same sample size.
n <- length(x)
# Calculate the error statistics Sxx, Syy, and Sxy
sxx <- sum(x^2) - sum(x)^2 / n
syy <- sum(y^2) - sum(y)^2 / n
sxy <- sum(x * y) - (sum(x) * sum(y)) / n
# Coefficients beta0 and beta1
b1 <- sxy / sxx
b0 <- mean(y) - b1 * mean(x)
# Sum of standard error and Mean Standard Error
sse <- syy - sxy^2 / sxx
mse <- sse / (n - 2)
# Standard error beta0 and beta1
b1.err <- sqrt(mse) / sqrt(sxx)
b0.err <- sqrt(mse) / sqrt(n) * sqrt(1 + (mean(x)^2 / (sum((x - mean(x))^2) / n)))
# beta0 and beta1 t-values
b0.t <- (b0 - 0) / b0.err
b1.t <- (b1 - 0) / b1.err
# p-values of beta0 and beta1
b0.p <- 2 * pt(b0.t, df = n - 2)
b1.p <- 2 * pt(b1.t, df = n - 2, lower.tail = FALSE)
# Coefficient of determination R-squared
r2 <- (syy - sse) /syy
# R-squared adjusted
r2.adj <- r2 - (1 - r2) * ((2 - 1) / (length(y) - 2))
rsquare <- paste('Multiple R-squared: ', round(r2, 4), ', Adjusted R-squared: ', round(r2.adj, 4))
coeffs <- data.frame(cbind(c(b0, b1), c(b0.err, b1.err), c(b0.t, b1.t), c(b0.p, b1.p)))
colnames(coeffs) <- c('Estimate', 'Std. Error', 't value', 'Pr(>|t|)')
rownames(coeffs) <- c('Intercept', 'x1')
# Fit the line to the data with beta0 and beta1 found above
fitted <- x * b1 + b0
# The F-Statistic
msr <- sum((fitted - mean(y))^2) / 1
mse2 <- sum((y - fitted)^2) / (length(y) - 2)
f <- msr / mse2
# p-value
p <- pf(f, 1, length(y) - 2, lower.tail = FALSE)
f.stat <- paste('F-statistic: ', round(f, 2), ' on 1 and ', n - 2, ' DF, p-value: ', format(p, digits = 3, scientific = TRUE))
# Calculate and find summary statistics of the residuals
resd <- y - fitted
min.res <- round(min(resd), 3)
max.res <- round(max(resd), 3)
q1.q3 <- quantile(resd, probs = c(.25, .75))
med <- round(median(resd), 3)
residual <- data.frame(cbind(min.res, round(q1.q3[1], 3), med, round(q1.q3[2], 3), max.res))
colnames(residual) <- c('Min', 'Q1', 'Median', 'Q3', 'Max')
resdi <- paste('Residual standard error: ', round(sqrt(mse2), 2), ' on ', n - 2, ' degrees of freedom')
regres <- list('Residuals'=residual, 'Coefficients'=coeffs, resdi, rsquare, f.stat)
return(regres)
}
The results can be verified against the lm()
function by using the summary()
call.
simple.linear.coef(cars$speed, cars$dist)
## $Residuals
## Min Q1 Median Q3 Max
## 25% -29.069 -9.525 -2.272 9.215 43.201
##
## $Coefficients
## Estimate Std. Error t value Pr(>|t|)
## Intercept -17.579095 6.7584402 -2.601058 1.231882e-02
## x1 3.932409 0.4155128 9.463990 1.489836e-12
##
## [[3]]
## [1] "Residual standard error: 15.38 on 48 degrees of freedom"
##
## [[4]]
## [1] "Multiple R-squared: 0.6511 , Adjusted R-squared: 0.6438"
##
## [[5]]
## [1] "F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12"
summary(cars.lm)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
In this post, linear regression was explored more in-depth using the one explanatory variable case. Linear regression is a powerful tool to find relationships between variables and how they effect one another. Many assumptions need to be made before a linear model can be fitted, such as normality, and independence and equal variances of the errors. One other drawback of linear regression is it’s prone to fit towards outliers as the goal is to minimize the distance between the fitted line and the observed data.
The function above is also a Gist.
Winner, L. (2009). Applied Statistical Methods. University of Florida
Simple linear regression (2016). In Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Simple_linear_regression
Ordinary least squares (2016). In Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Ordinary_least_squares