Linear regression is a classical technique that is used to predict a quantitative response \(y\) that results from one or more predictor variables \(x_j\).
Simple linear regression is an approach for predicting a quantitative response \(y\) on the basis of a single predictor variable \(x\). It assumes that there is approximately a linear relationship between \(x\) and \(y\). If such a relationship exists, it can be written as \[ y = \beta_0 + \beta_1x + \varepsilon,\] where \(y\) represents the response variable, \(\beta_0\) represents the \(y\)-intercept, \(\beta_1\) is the coefficient representing the slope of the regression line and \(\varepsilon\) represents a random error term with mean 0.
We build a model to attempt to capture the relationship which is given by \[\hat{y} = b_0 + b_1x,\] where \(b_0\) predicts \(\beta_0\) and \(b_1\) predicts \(\beta_1\). The variable \(\hat{y}\) is actually a point estimate of the expected value of \(y\) given \(x\), \(E(y|x)\).
Let \((x_i,y_i)\) be a point in the data set. A residual is the difference between the predicted value \(\hat{y}_i\) and the actual value \(y_i\), \[e_i = \hat{y}_i - y_i.\] Thus, in a linear regression, we attempt to find \[\min \sum_{i=1}^n e^2_i = \min \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \min \sum_{i=1}^n (y_i - b_0 - b_1x_i)^2.\] The diagram below is an example that shows the residuals (vertical lines) in a diagram of money spent on TV advertising versus total sales.
Mathematically, finding the minimum sum of squared residuals requires finding the values of \(b_0\) and \(b_1\) that make the \(\sum (y_1 - \hat{y}_i)^2\) minimum since \[S =\min \sum_{i=1}^n e^2_i = \min \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \min \sum_{i=1}^n (y_i - b_0 - b_1x_i)^2.\] Thus, in order to minimize \(S\), we find values of \(b_0\) and \(b_1\) for which \(\partial S/ \partial b_0 = 0\) and \(\partial S/ \partial b_1 = 0\). Finding these partial derivatives and evaluating the equations, we see that \[b_0 = \bar{y} - b_1 \bar{x},\] for some value of \(b_1\). Substituting this equation in for \(b_0\) in our \(\partial S/ \partial b_1\) equation, we find (after some algebra) that \[b_1 = \dfrac{cov(x,y)}{var(x)} = \dfrac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i-\bar{x})^2}.\]
Consider the following data set:
| y | x |
|---|---|
| 20 | 30 |
| 50 | 65 |
| 35 | 40 |
| 30 | 20 |
| 55 | 55 |
Note that \(\bar{x} = 42\) and \(\bar{y} = 38\). The calculations for the regression are below:
| y | x | \(y-\bar{y}\) | \(x-\bar{x}\) | \((x-\bar{x})(y-\bar{y})\) | \((x-\bar{x})^2\) |
|---|---|---|---|---|---|
| 20 | 30 | -18 | -12 | 216 | 144 |
| 50 | 65 | 12 | 23 | 276 | 529 |
| 35 | 40 | -3 | -2 | 6 | 4 |
| 30 | 20 | -8 | -22 | 176 | 484 |
| 55 | 55 | 17 | 13 | 221 | 169 |
Summing column 5 and column 6, we see \(\sum(x-\bar{x})(y-\bar{y}) = 895\) and \(\sum(x-\bar{x})^2 = 1330\). So, \[b_1 = \dfrac{\sum(x-\bar{x})(y-\bar{y})}{\sum(x-\bar{x})^2} = \dfrac{895}{1330} = 0.672932.\] Furthermore, \[b_0 = \bar{y} - b_1 \bar{x} = 38 - 0.672932(42) = 9.736842105.\] Thus, our regression equation is \[\hat{y} = 9.74 + 0.67x.\]
Our model is \[\hat{y} = 9.74 + 0.67x.\] Thus, if the value of \(x\) is 0, we predict a value of 9.74 for \(y\). Be slightly careful here about forecasting a \(y\)-value outside of the rough region where our data comes from - the analysis does not always hold out of this “experimental region”.
For every additional unit increase in \(x\), we see an increase in the \(y\)-value by 0.67 units.
In our analysis of the model, we will also want to define the following terms:
RSS (residual sum of squares): \[RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2.\]
TSS (total sum of squares): \[TSS = \sum_{i=1}^n (y_i - \bar{y})^2.\]
We will want to examine each of the statistics:
Residual standard error: This measures the quality of our regression fit. It is the average amount the response variable will vary from the true regression line. Here, \[RSE = \sqrt{\dfrac{1}{n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2} = \sqrt{\dfrac{RSS}{n-2}}.\]
Multiple \(R\)-squared: This is one of our most important metrics for measuring regression model fit. \(R^2\) measures the linear relationship between our predictor variable and our response / target variable. \(R^2\) is always between 0 and 1. A number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable. If we perform a multiple regression, we will find that the \(R^2\) will increase with an increase in the number of response variables. We can calculate \(R^2\) using the formula \[R^2 = 1 - \dfrac{RSS}{TSS} = 1 - \dfrac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y}_i)^2}. \]
\(F\)-statistic: This is a good indicator of whether there is a relationship between \(y\) and \(x\). It tests to see if at least one predictor variable is non-zero. Of course, this becomes increasingly important as we add variables to the regression. We compute the \(F\) statistic as \[F = \dfrac{(TSS - RSS)/p}{RSS/(n-p-1)}, \] where \(p\) is the number of parameters in the equation. The further our \(F\)-statistic is away from 1, the better our regression model. The \(F\)-statistic is more relevant in a multiple regression model.
The key indicators of model fit include \(R^2\) and \(F\)-statistic. \(R^2\) measures how well a model fits the data — if \(R^2\) is close to 1, then this indicates that a large proportion of the variation in \(y\) can be explained by \(x\). The \(F\)-statistic shows the overall significance of the model. A large \(F\)-statistic will correspond to a statistically significant \(p\)-value (\(p < 0.05\)).
Recall the data for our example. Note that \(\bar{x} = 42\) and \(\bar{y} = 38\). The calculations for the regression are below and we found that \(\hat{y} = 9.74 + 0.67x.\)
| y | x | \(y-\bar{y}\) | \(x-\bar{x}\) | \((x-\bar{x})(y-\bar{y})\) | \((x-\bar{x})^2\) |
|---|---|---|---|---|---|
| 20 | 30 | -18 | -12 | 216 | 144 |
| 50 | 65 | 12 | 23 | 276 | 529 |
| 35 | 40 | -3 | -2 | 6 | 4 |
| 30 | 20 | -8 | -22 | 176 | 484 |
| 55 | 55 | 17 | 13 | 221 | 169 |
In our example, we find RSS = 227.7255639 and TSS = 830.
| y | x | \(y-\bar{y}\) | \(x-\bar{x}\) | \((x-\bar{x})(y-\bar{y})\) | \((x-\bar{x})^2\) | \((y_i-\hat{y}_i)^2\) | \((y_i-\bar{y})^2\) |
|---|---|---|---|---|---|---|---|
| 20 | 30 | -18 | -12 | 216 | 144 | 98.50169 | 324 |
| 50 | 65 | 12 | 23 | 276 | 529 | 12.09246 | 144 |
| 35 | 40 | -3 | -2 | 6 | 4 | 2.73612 | 9 |
| 30 | 20 | -8 | -22 | 176 | 484 | 46.30147 | 64 |
| 55 | 55 | 17 | 13 | 221 | 169 | 68.09382 | 289 |
We want to calculate RSE: \[RSE = \sqrt{\dfrac{1}{n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2}= \sqrt{\dfrac{RSS}{n-2}} = \sqrt{\dfrac{227.7255639}{3}} = 8.71,\] \(R^2\): \[R^2 = 1 - \dfrac{RSS}{TSS} = 1 - \dfrac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2} = 1 - 0.274368 = 0.725632,\] and the \(F\) statistic: \[F = \dfrac{(TSS - RSS)/p}{RSS/(n-p-1)} = \dfrac{602.2744361}{75.9085213} = 7.934214.\] We can conclude from the RSE that, on average, the regression line has an error of 8.71 from the real observation. From the \(R^2\), we see that our model predicts roughly 72.56% of the variability in \(y\). Finally, the \(F\)-statistic of 7.93 tells us (looking at the F-table) that there is less than a 5% chance that a similar random sample with coefficients that are not statistically significant would give an F value at least this high. Thus, we can be reasonably confident that this model does predict \(y\).
The assumptions that must be met for linear regression to be valid depend on the purposes for which it will be used.
Any application of linear regression makes two assumptions:
The data used in fitting the model are representative of the population.
The true underlying relationship between \(x\) and \(y\) is linear.
To estimate the standard error of the prediction \(y_i\), you also must assume that:
For linear regression to provide the best linear unbiased estimator of the true \(y\), 1 – 3 must be true, and you must also assume that:
To make probabilistic statements, such as hypothesis tests involving \(b_0\), \(b_1\) or \(R^2\), or to construct confidence intervals, 1 – 4 must be true, and you must also assume that:
As with anything, there are some common mistakes that are often made with regression.
Correlation is not causation. Regression assumes that \(x\) causes \(y\); it cannot prove that \(x\) causes \(x\).
Overlooking hidden variables. Hidden variables that are correlated with both \(x\) and \(y\) can obscure, or even distort, the dependence of \(y\) on \(x\).
Overlooking serial correlation. Strong serial correlation can cause you to seriously underestimate the uncertainties in your regression results (since successive measurements are not independent, the true number of degrees of freedom is much smaller than \(n\) suggests). In time-series data, it can also produce spurious but impressive-looking trends.
Overlooking uncertainty in \(x\). Linear regression assumes that \(x\) is known precisely, and only \(y\) is uncertain. If there are significant uncertainties in \(x\), the regression slope will be lower than it would have been otherwise. The regression line will still be an unbiased estimator of the value of \(y\) that is likely to accompany a given \(x\) measurement, but it will be a biased estimator of the \(y\) values that would arise if \(x\) could be controlled precisely.
Helsel, D. R. and R. M. Hirsch, Statistical Methods in Water Resources, 522 pp., Elsevier, 1992.
Kirchner, James. (1996, 2001). Data Analysis Toolkit #10: Simple linear regression. http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
Lloyd, S. P. (1957, 1982). Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory, 28, 128–137.
Mosteller, F. and J. W. Tukey, Data Analysis and Regression, 588 pp., Addison-Wesley, 1977.
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.