Least-squares regression traces the mean of \(Y\) conditional on \(X\) i.e. we are interested in understanding how the mean of \(Y\) varies as the values of the predictors change.


Why analyse residuals?

We use the study of the residuals to check whether

Model Assumptions that can be checked using the model’s residuals

  1. Normality of errors

    • QQ Plot

    • Shapiro Wilko Test

  2. Constant Error Variance (Homoskedasticity)

    • Residuals Plot

     

  3. Linearity and Zero-conditional mean

    • Component-Plus-Residual Plots

Other Model Assumptions

  1. No collinearity in X’s

    • Correlation Plots

    • Variance Inflation factor

  2. Random / i.i.d sample

    • Study design: repeated measurements

     

  3. Correlation in the errors

    • Durban Watson test

Other Diagnostics

Effect of individual X’s on the model


Model

Given the linear model mod = lm(Price ~ Size + Lot, data=df)

Data Sample
Price ($) Lot (sq.feet) Size (sq.feet)
208,500 8,450 856
181,500 9,600 1,262
223,500 11,250 920
140,000 9,550 961
250,000 14,260 1,145
143,000 14,115 796
Summary Linear Regression Model
term estimate std.error statistic p.value
(Intercept) 35073.04 5223.19 6.71 0
Size 118.93 4.46 26.65 0
Lot 0.72 0.17 4.17 0
r.squared adj.r.squared sigma statistic p.value df df.residual nobs
0.37 0.37 62872.2 436.2 0 2 1457 1460


Studying the Variables

\label{fig:figs}Box Plots

Box Plots

\label{fig:figs}Pair Plots

Pair Plots

Model Assumptions that can be checked using the model’s residuals


Normality of errors

The assumption of normally distributed errors is almost always arbitrary. Nevertheless, the central limit theorem ensures that, under very broad conditions, inference based on the least squares estimator is approximately valid in all but small samples. Why, then, should we be concerned about non-normal errors?

  • Although the validity of least-squares estimation is robust— the levels of tests and the coverage of confidence intervals are approximately correct in large samples even when the assumption of normality is violated—the efficiency of least squares is not robust: Statistical theory assures us that the least-squares estimator is the most efficient unbiased estimator only when the errors are normal. For some types of error distributions, however, particularly those with heavy tails, the efficiency of least-squares estimation decreases markedly.

  • Compromises the interpretation of the least-squares fit. This fit is a conditional mean (of Y given the Xs), and the mean is not a good measure of the center of a highly skewed distribution. Consequently, we may prefer to transform the response to produce a symmetric error distribution.

A multimodal error distribution suggests the omission of one or more discrete explanatory variables that divide the data naturally into groups. An examination of the distribution of the residuals may, therefore, motivate respecification of the model.

if your residuals are non-normal (not symmetric and bell-shaped) then this is conveying that something is wrong with your model. Either you have outliers (unusual observations that are distorting your results) in which case I would report my results with and without the outliers and if there’s no difference then it’s not making much and you can be confident that the outlier is not distorting your conclusions/interprtations and if there is a difference in the reported results you need to seek the advice of someone with domain-specific knowledge who can inform you if this unusual observation is valid or realistic (for example if its medical data a doctor who knows the field). Another instance is where your data are counts and so modelling them with the appropriate distribution using generalised linear models (covered in advanced predictive analytics) would be better; another issue could be you have misspecified the model you need extra terms in your model (eg. higher-order terms, interactions also covered in advanced predictive analytics) or you need to transform some of your variables (ill cover in week 10). The non-normality generally arises as a consequence of a misspecified model and while in theory if you have a large sample it should not be much of a concern in relation to the (CI, PI and hypothesis test) its normally as a consequence of a more serious problem with your model and so you should examine the residuals in detail to get to the bottom of it.

Consequence of non-normally distributed residuals (as per M. Carey)

  • We cannot trust F-test, CI’s and PI’s on this model

  • Must modify the model somehow i.e. must make alternative appropriate assumptions.

QQ Plot

aka normal probability plot of the residuals

  • QQ plot shows that too many values deviate at the tail for this to be normally distributed
    • QQ plot shows some deviation from the straight-line pattern indicating a distribution with heavier tails than a normal distribution.
\label{fig:figs}QQ Plot

QQ Plot

Shapiro Wilko Test
statistic p.value method
0.92 0 Shapiro-Wilk normality test


Shapiro Wilko Test

  • \(H_0:\) residuals are normally distributed
    • pvalue of the shapiro wilko test is < 0.05 i.e. p(residuals are normally distribute = 0) = 0 thus we reject \(H_0:\)

Constant Error Variance \(Var(\epsilon_i)=\sigma^2\)

Consequence of non-constant error variance

  • Although the least-squares estimator is unbiased and consistent even when the error variance is not constant, the efficiency of the least-squares estimator is impaired, and the usual formulas for coefficient standard errors are inaccurate—the degree of the problem depending on the degree to which error variances differ, the sample size, and the configuration of the X-values in the regression.

Residual Plots

In the plot on the right, each point is one day, where the prediction made by the model is on the x-axis and the accuracy of the prediction is on the y-axis. The distance from the line at 0 is how bad the prediction was for that value.

Since…

Residual = Observed – Predicted

  • positive values for the residual (on the y-axis) mean the prediction was too low,

  • and negative values mean the prediction was too high; 0 means the guess was exactly correct.

\label{fig:figs} Residuals Plot (left) and Residuals vs. Fitted (right)

Residuals Plot (left) and Residuals vs. Fitted (right)

left plot shows

  • constant variance i.e. constant band about the mean line
  • there are outliers
  • there is more variability above the zero mean line. Interpretation: a residual tells us the difference between the actual price and what we predicted (\(Y-\hat{Y}\)). If that difference is positive, the prediction underestimates the house. In this case the model appears to underestimate houses more often than it over-estimates them.

right plot shows

  • This plot of residuals versus fits shows that the residual variance (vertical spread) increases as the fitted values (predicted values of sale price) increase. This violates the assumption of constant error variance.

Linearity

Consequence of non-linearity

  • Non-linearity results in biased/inconsistent estimates.

Component + Residual Plots

  • Use a component-plus-residual plot to examine if the relationship between the response and the predictor variables is linear or not.
\label{fig:figs}Component + Residual Plots

Component + Residual Plots

  • it’s a spline fit to the data
  • the pink (unbroken) line tracks the trend of the data
  • We want the pink line to be similar to the blue (broken) line
  • it if it similar it means that a linear relationship is valid because the general trend of the data is also linear
  • in the right plot we can see that there is a non-linear relationship between lot and price: it doesn’t match the linear relationship model of the blue (broken) line.


Other Model Assumptions

Multicollinearity

Consequence of multicollinearity

  • Multicollinearity affects a model’s ability to explain but not to predict.

  • The coefficients do not have a reliable interpretation.

  • There is another estimator (ridge regression) that can improve on the least squares error in case of strong multi-collinearity.

There are two ways to check if the predictor variables are highly correlated with one another:

  • Correlation plot
  • Variance Inflation factor (VIF).

Correlation Plot

  • corr is 0.3 which is weak so no multicollinearity
\label{fig:figs} Corr. Plot

Corr. Plot

Variance Inflation factor (VIF)

  • both VIFs are approx. one indicating no multicollinearity problem with a regression including both predictors.
  • the size of the lot and the size of the house are not strongly correlated in this instance here.
  • i.e. they are not telling us the same information ; both contribute different information that affects the price.
Variance Inflation Factor
names x
Size 1.098521
Lot 1.098521


Independence

aka non-independent, paired, dependent, or correlated observations

Consequences of non-independence

  • results in biased estimates

  • Must modify the model somehow i.e. must make alternative appropriate assumptions.

  • Violations imply non-constance variance of the errors (heteroskedasticity).

  • Outliers from different distributions can cause inefficiency/bias.

Correlation in the errors

Consequences of non-independance

  • results in biased estimates

  • Must modify the model somehow i.e. must make alternative appropriate assumptions.

  • Violations imply non-constance variance of the errors (heteroskedasticity).

  • Outliers from different distributions can cause inefficiency/bias.

  • Serial (or auto) correlation in the errors (i.e., correlation between consecutive errors or errors separated by some other number of periods) means that there is room for improvement in the model, and extreme serial correlation is often a symptom of a badly misspecified model.

  • With time series data, it is highly likely that the value of a variable observed in the current time period will be similar to its value in the previous period, or even the period before that, and so on.

  • Therefore when fitting a regression model to time series data, it is common to find autocorrelation in the residuals.

  • Autocorrelation does not only occur in time series data. Other causes of autocorrelation can be when observations are grouped somehow or other underlying factors that cuase observations to be dependent from one another.

Durban Watson test

Test for serial, or auto, correlation in the errors.

  • \(H_0\): correlation between the errors is zero (no auto-corr.)
  • \(H_A\): correlation between the errors is not equal to zero.

The Durbin-Watson statistic will always have a value between 0 and 4

  • 2.0 means no autocorrelation detected in the sample.
    • [0, 1.9] indicates positive autocorrelation
    • [2.1, 4] indicates negative autocorrelation.
Summary Durban Watson Test
statistic p.value autocorrelation method alternative
1.96 0.44 0.02 Durbin-Watson Test two.sided
  • The pvalue is not < 0.05 i.e. p(corr. btw. errors = 0) = p(\(H_0\) is true) = 41%
  • therefore we fail to reject H0 and we say that it is likely that the errors are not auto-correlated.
  • the best estimate for the auto-correlation is 0.02 which is very small. Thus the auto-correlation between any two individual errors is very weak.
  • the observations can be classed as independent.


Other Diagnostics

Added Variable Plots (aka partial-regression plot)

Examine what is the effect of a particular predictor variable (Size or Lot) on our response variable Price while holding all other predictor variables constant.

Note that in this case “all other predictor variables” will be just one of the two.

\label{fig:figs}Added Variable Plots

Added Variable Plots

  • left plot
    • What is the effect that Size has on the model?
    • the slope indicates that adding Size does have a significant impact on the model.
    • note that it would not be significant if the blue line was flat. In that case, no matter what the size, the price would be the same.
    • 3 observations exert substantial leverage on the Size coefficient.
    • the numbers on the plot outliers are the House ID’s
    • Y-axis: Price given all the other variables (in this case Lot)
    • X-axis: the horizontal variable in this added-variable plot is the residual from the regression of size on Lot, and thus values far from 0 in this direction are for House ID’s with Size that are unusually high or low given their levels of Lot.
  • right plot
    • similar to the above


Appendix

Model Selection

The model selection problem is often stated as a variable selection problem. We have a response y and a set of predictors \(x = {x_1, . . . ,x_m}\), and we wish to divide x into two groups, \(x = ( x_A, x_I)\), the active and inactive predictors, such that the distribution of \(y|x_A\) is the same as the distribution of \(y|(x_A, x_I)\), s.t. all the information about y is in the active predictors.

Procedure:

  • Fit a sequence of subset models which differ only in the elements of x that are used to define the regressors.
  • If we have m predictors, then there are \(2^m\) — 1 possible subset models.
  • For example, if m ~ 10, \(2^m\) — 1 = 1,023, while if m = 20, \(2^m\) - 1 is slightly more than 1 million.
  • Select the subset model that optimizes some criterion of model quality.

Strategies

  • “backward”, in which we start with all the regressors in the model and continue to remove terms until removing another term makes the criterion of interest worse
  • “forward”, in which we start with no regressors and continue to add terms until adding another term makes the criterion of interest worse;

A Reminder about Transformations

You generally use a transformation to (a) to make the relationship between the response and the predictor variable more linear or (b) make the residuals of a regression model approximately normally distributed. Does the transformation do either of these things?

From Continuous to Categorical

“Do these findings suggest it might be better to bin the data for”Fare" and make it a categorical variable and then use it in the model (and see what happens)?"

This is a good idea if you must use a linear regression model and you cannot get the continuous predictor variable in the model without violating the assumptions of the model it may be a good idea to create a categorical variable as you can still use the information without violating the assumptions of the model.

Misc.