Linear regression is a widely used technique to model the association between a dependent variable and one or more independent variables. In the simplest case, which is what will be explored in this post, there is one independent and dependent variable. The linear regression model in the simple form is \(y = ax + b\), with a slope \(a\) and constant \(b\). Ordinary Least Squares is the most common and straightforward approach to fitting a linear regression model. In the Ordinary Least Squares (OLS) setting, the goal when fitting the linear regression model is to keep the difference between the actual observed data and the fitted model as small as possible.

The goal of this post is to examine in-depth the calculations and interpretations of the linear regression model. Although regression comes in many types of flavors and can be easily done with statistical programs such as R, it is beneficial to know the assumptions and calculations of the model to gauge its appropriateness better in a particular situation or explain the results.

To begin, load the ggplot2 package and the cars dataset.

library(ggplot2)
data("cars")

The cars dataset contains 50 observations of the speed and stopping times of cars. Given the first law of motion, it is reasonable to expect the stopping distance of the cars will increase as speed increases. The association between the two variables can be modeled by linear regression.

Visualizing the Variables’ Association

Plotting a scatterplot of the speed and stopping distances gives an excellent visualization of the association of two variables. A linear regression line can also be drawn with the method=lm argument in geom_smooth() to illustrate how the regression model would fit the data.

ggplot(cars, aes(x=speed, y=dist)) + 
  geom_point(color='#2980B9', size = 4) + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='#2C3E50')

The Ordinary Least Squares approach mentioned previously attempts to find the optimal line that minimizes the distance between the fitted and observed values. The fitted line is drawn nicely through the data points and thus provides good motivation to model the association between the variables with a linear regression model.

Aside: Non-Linear Associations of Data

To expand on the above point that visualizing the linear regression line on the data gives support to fitting the data with a linear model, below is an example of a linear regression model fit to non-linear data.

x <- -100:100
y <- 0.001 * x^3

noise <- rnorm(length(x), mean = 50, sd = 100)
noisy <- noise + y

df <- data.frame(cbind(x, noisy))

ggplot(df, aes(x = x, y = noisy)) + geom_point(color='#2980B9', size = 2) + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE, color='#2C3E50', size = 1.25) + ylab('y')

The data points clearly show a non-linear relationship, and thus a linear regression model would be ill-suited for modeling this data. Plotting a scatterplot or other plot of the data to visualize the association of the variables can have a significant role in determining what approach should be taken to modeling the data.

Fitting a Linear Regression Model in R

Building a linear regression model in R is done with the lm() function and is rather straightforward. As so much literal and virtual ink has been devoted to exploring the linear regression model and the lm() function, this will be touched on briefly.

Use the lm() function to create a linear model. The variable on the left-hand side of the ‘~’ is the response (dependent) variable, and the independent variable speed is on the right-hand side of the ‘~’. A summary of the linear regression model can be printed with the summary() call.

cars.lm <- lm(dist ~ speed, data = cars)
summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The summary() function outputs the residuals, coefficients and other statistics of the linear model. The residuals are the difference between the fitted responses and the observed responses. This difference is used to find the SSR, the sum of squared residuals, also known as the SSE, or the sum of squared error. The \(R^2\), or coefficient of determination, is .65, which suggests the fitted line is close to the observed responses. As seen in the first graph above, the linear regression line is indeed close to most of the data points.

The p-value reported by the summary() function is way below 0.05. Therefore it can be concluded the coefficient of speed does affect the stopping distance of the cars.

Manually Calculating the Linear Regression Model

Before fitting the model, some definitions of Least Squares estimation must be made. As mentioned above, the Ordinary Least Squares approach finds estimates of \(\beta_0\) and \(\beta_1\) that best fit the data. The fitted equation is defined as \(y = \hat{\beta}_0 + \hat{\beta}_1x\). Therefore, for each observed response \(y_i\), with a corresponding predictor value \(x_i\), the goal is to obtain a fitted value \(\hat{y} = \beta_0 + \beta_1x\) that minimizes the sum of the squared distances of each response. This distance is called the SSE, the error sum of squares, and is defined as:

\[ SSE = \sum^n_{i=1}(y_i - \hat{y}_i)^2 \]

From the SSE, the values of \(S_{xy}\), \(S_{xx}\) and \(S_{yy}\) can be derived. These values should be familiar from a previous post on Pearson’s correlation coefficient and are defined as:

\[ S_{xx} = \sum (x - \bar{x})^2 = \sum x^2 - \frac{(\sum x)^2}{n} \] \[ S_{xy} = \sum (x - \bar{x})(y - \bar{y}) = \sum xy - \frac{(\sum x)(\sum y)}{n} \] \[ S_{yy} = \sum (y - \bar{y})^2 = \sum y^2 - \frac{(\sum y)}{n} \]

which gives the estimates of \(\beta_0\) and \(\beta_1\):

\[ \hat{\beta}_1 = \frac{\sum^n_{i=1} (x - \bar{x})(y - \bar{y})}{\sum^n_{i=1} (x - \bar{x})^2} = \frac{S_{xy}}{S_{xx}} \] \[ \hat{\beta}_0 = \bar{y} - \hat{\beta}\bar{x} = \frac{\sum^n_{i=1}y_i}{n} - \hat{\beta}_1 \frac{\sum^n_{i=1} x_i}{n} \]

The \(SSE\) can then be written regarding the above.

\[ SSE = S_{yy} - \frac{S_{xy}^2}{S_{xx}} \]

The Mean Square Error is then the \(SSE\) divided by \(n - 2\) since there are two estimated parameters.

\[ MSE = \frac{SSE}{n - 2} = \frac{\sum^n_{i=1} (y_i - \hat{y}_i)}{n - 2} = \frac{S_{yy} - \frac{S_{xy}^2}{S_{xx}}}{n - 2} \]

The standard error of \(\beta_0\) and \(\beta_1\) is defined as:

\[ se(\beta_1) = \frac{\sqrt{MSE}}{\sqrt{S_{xx}}} \] \[ se(\beta_0) = \frac{\sqrt{MSE}}{\sqrt{n} \sqrt{1 + \frac{\bar{x}^2}{\sum \frac{(x - \bar{x})^2}{n}}}} \]

The t-values of \(\beta_1\) and \(\beta_0\) can then be found with the standard errors.

\[ t = \frac{\beta_1 - \beta}{se(\beta_1)} \]

In the case of linear regression, the \(\beta\) that is subtracted from the coefficients is set to 0.

\[ t = \frac{\beta_1}{se(\beta_1)} \]

The t-value of \(\beta_0\) can also be found with the same equation by replacing \(\beta_1\) for \(\beta_0\). With the t-values of the coefficients, the p-values can then be computed. There are \(n - 2\) degrees of freedom since there are two parameters in the model. The p-value is multiplied by two as it is a two-tailed test, i.e., the values can be either positive or negative.

\(R^2\), the Coefficient of Determination

The coefficient of determination, \(r^2\), which represents the proportion of the variance in the dependent variable (distance in this case) that is predictable from the independent variable (speed). In the context of a linear regression model, \(r^2\) is a measure of how well the model fits the data. In the simple linear regression setting, the coefficient of determination is designated \(r^2\), whereas in multiple regression it is represented as \(R^2\) (Wikipedia).

\(r^2\) can be defined as:

\[ r^2 = \frac{S_{yy} - SSE}{S_{yy}} \]

The adjusted \(R^2\), often written as \(\bar{R}^2\), takes into account the number of predictors in the model. The adjusted \(R^2\) increases when an additional variable is added to the model that has correlation to the dependent variable and decreases when the added variable is not correlated with the dependent variable. The adjusted \(R^2\) is defined as:

\[ \bar{R}^2 = R^2 - (1 - R^2) \frac{p - 1}{n - p} \]

Where p is the number of independent variables in the model.

The F-Statistic

From a previous post on ANOVA, the ratio \(\frac{MSR}{MSE}\) is used to calculate the F-statistic as it is known to follow the F distribution. The numerator has 1 degree of freedom while the denominator has \(n - 2\) degrees of freedom.

A Function for Performing Simple Linear Regression

The function below collects all the equations and steps above to perform simple linear regression. There’s no practical reason to use this over the standard lm() function, however, it may provide some further understanding of how linear regression is performed in R.

simple.linear.coef <- function(x, y) {
  # Find length of x to get sample size. Assuming x and y have the same sample size.
  n <- length(x)
  # Calculate the error statistics Sxx, Syy, and Sxy
  sxx <- sum(x^2) - sum(x)^2 / n
  syy <- sum(y^2) - sum(y)^2 / n
  sxy <- sum(x * y) - (sum(x) * sum(y)) / n
  # Coefficients beta0 and beta1
  b1 <- sxy / sxx
  b0 <- mean(y) - b1 * mean(x)
  # Sum of standard error and Mean Standard Error
  sse <- syy - sxy^2 / sxx
  mse <- sse / (n - 2)
  # Standard error beta0 and beta1
  b1.err <- sqrt(mse) / sqrt(sxx)
  b0.err <- sqrt(mse) / sqrt(n) * sqrt(1 + (mean(x)^2 / (sum((x - mean(x))^2) / n)))
  # beta0 and beta1 t-values
  b0.t <- (b0 - 0) / b0.err
  b1.t <- (b1 - 0) / b1.err
  # p-values of beta0 and beta1
  b0.p <- 2 * pt(b0.t, df = n - 2)
  b1.p <- 2 * pt(b1.t, df = n - 2, lower.tail = FALSE)
  # Coefficient of determination R-squared
  r2 <- (syy - sse) /syy
  # R-squared adjusted
  r2.adj <- r2 - (1 - r2) * ((2 - 1) / (length(y) - 2))
  
  rsquare <- paste('Multiple R-squared: ', round(r2, 4), ', Adjusted R-squared: ', round(r2.adj, 4))
  
  coeffs <- data.frame(cbind(c(b0, b1), c(b0.err, b1.err), c(b0.t, b1.t), c(b0.p, b1.p)))
  colnames(coeffs) <- c('Estimate', 'Std. Error', 't value', 'Pr(>|t|)')
  rownames(coeffs) <- c('Intercept', 'x1')
  
  # Fit the line to the data with beta0 and beta1 found above
  fitted <- x * b1 + b0
  
  # The F-Statistic
  msr <- sum((fitted - mean(y))^2) / 1
  mse2 <- sum((y - fitted)^2) / (length(y) - 2)
  f <- msr / mse2
  # p-value
  p <- pf(f, 1, length(y) - 2, lower.tail = FALSE)
  
  f.stat <- paste('F-statistic: ', round(f, 2), ' on 1 and ', n - 2, ' DF, p-value: ', format(p, digits = 3, scientific = TRUE))
  # Calculate and find summary statistics of the residuals
  resd <- y - fitted
  min.res <- round(min(resd), 3)
  max.res <- round(max(resd), 3)
  q1.q3 <- quantile(resd, probs = c(.25, .75))
  med <- round(median(resd), 3)
  residual <- data.frame(cbind(min.res, round(q1.q3[1], 3), med, round(q1.q3[2], 3), max.res))
  colnames(residual) <- c('Min', 'Q1', 'Median', 'Q3', 'Max')
  resdi <- paste('Residual standard error: ', round(sqrt(mse2), 2), ' on ', n - 2, ' degrees of freedom')
  regres <- list('Residuals'=residual, 'Coefficients'=coeffs, resdi, rsquare, f.stat)
  
  return(regres)
}

The results can be verified against the lm() function by using the summary() call.

simple.linear.coef(cars$speed, cars$dist)
## $Residuals
##         Min     Q1 Median    Q3    Max
## 25% -29.069 -9.525 -2.272 9.215 43.201
## 
## $Coefficients
##             Estimate Std. Error   t value     Pr(>|t|)
## Intercept -17.579095  6.7584402 -2.601058 1.231882e-02
## x1          3.932409  0.4155128  9.463990 1.489836e-12
## 
## [[3]]
## [1] "Residual standard error:  15.38  on  48  degrees of freedom"
## 
## [[4]]
## [1] "Multiple R-squared:  0.6511 , Adjusted R-squared:  0.6438"
## 
## [[5]]
## [1] "F-statistic:  89.57  on 1 and  48  DF, p-value:  1.49e-12"
summary(cars.lm)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Summary

In this post, linear regression was explored more in-depth using the one explanatory variable case. Linear regression is a powerful tool to find relationships between variables and how they effect one another. Many assumptions need to be made before a linear model can be fitted, such as normality, and independence and equal variances of the errors. One other drawback of linear regression is it’s prone to fit towards outliers as the goal is to minimize the distance between the fitted line and the observed data.

The function above is also a Gist.

References

Winner, L. (2009). Applied Statistical Methods. University of Florida

Simple linear regression (2016). In Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Simple_linear_regression

Ordinary least squares (2016). In Wikipedia. Retrieved from https://en.wikipedia.org/wiki/Ordinary_least_squares