Linear Regression Model Diagnostics

Least-squares regression traces the mean of $Y$ conditional on $X$ i.e. we are interested in understanding how the mean of $Y$ varies as the values of the predictors change.

Why analyse residuals?

We use the study of the residuals to check whether

the regression function is non-linear (Component plus resisudal plots)
the errors have non-constant variance (Residuals plots)
there are outliers (Added variable plots, Residuals plots)
the error terms are not normally distributed (QQplots, Shapiro-wilk test)

Model Assumptions that can be checked using the model’s residuals

Normality of errors
- QQ Plot
- Shapiro Wilko Test
Constant Error Variance (Homoskedasticity)
- Residuals Plot
Linearity and Zero-conditional mean
- Component-Plus-Residual Plots

Other Model Assumptions

No collinearity in X’s
- Correlation Plots
- Variance Inflation factor
Random / i.i.d sample
- Study design: repeated measurements
Correlation in the errors
- Durban Watson test

Other Diagnostics

Effect of individual X’s on the model

Added Variable Plots are useful for identifying subsets of observations that can be jointly influential.

Model

Given the linear model mod = lm(Price ~ Size + Lot, data=df)

Data Sample
Price ($)	Lot (sq.feet)	Size (sq.feet)
208,500	8,450	856
181,500	9,600	1,262
223,500	11,250	920
140,000	9,550	961
250,000	14,260	1,145
143,000	14,115	796

Summary Linear Regression Model
term	estimate	std.error	statistic
(Intercept)	35073.04	5223.19	6.71
Size	118.93	4.46	26.65
Lot	0.72	0.17	4.17


r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual	nobs
0.37	0.37	62872.2	436.2	0	2	1457	1460

Studying the Variables

$\label{fig:figs}Box Plots$

Box Plots

$\label{fig:figs}Pair Plots$

Pair Plots

Size vs. Price
- Approx. linear relationship
- As the Size increases, the price range gets wider i.e. the variability increases
- unusual obs.: one house with very large size and relatively low price
Lot vs. Price
- Approx. linear relationship
- As the size of the lot increases, the price range gets wider i.e. the variability increases
- Four houses with relatively low price and large lot
Size vs. Lot
- Approx. linear relationship
- as the size of the lot increases the variability of the house size also increases
- one house with small lot and large size
- Four houses with large lot and small size

Model Assumptions that can be checked using the model’s residuals

Normality of errors

The assumption of normally distributed errors is almost always arbitrary. Nevertheless, the central limit theorem ensures that, under very broad conditions, inference based on the least squares estimator is approximately valid in all but small samples. Why, then, should we be concerned about non-normal errors?

Although the validity of least-squares estimation is robust— the levels of tests and the coverage of confidence intervals are approximately correct in large samples even when the assumption of normality is violated—the efficiency of least squares is not robust: Statistical theory assures us that the least-squares estimator is the most efficient unbiased estimator only when the errors are normal. For some types of error distributions, however, particularly those with heavy tails, the efficiency of least-squares estimation decreases markedly.
Compromises the interpretation of the least-squares fit. This fit is a conditional mean (of Y given the Xs), and the mean is not a good measure of the center of a highly skewed distribution. Consequently, we may prefer to transform the response to produce a symmetric error distribution.

A multimodal error distribution suggests the omission of one or more discrete explanatory variables that divide the data naturally into groups. An examination of the distribution of the residuals may, therefore, motivate respecification of the model.

if your residuals are non-normal (not symmetric and bell-shaped) then this is conveying that something is wrong with your model. Either you have outliers (unusual observations that are distorting your results) in which case I would report my results with and without the outliers and if there’s no difference then it’s not making much and you can be confident that the outlier is not distorting your conclusions/interprtations and if there is a difference in the reported results you need to seek the advice of someone with domain-specific knowledge who can inform you if this unusual observation is valid or realistic (for example if its medical data a doctor who knows the field). Another instance is where your data are counts and so modelling them with the appropriate distribution using generalised linear models (covered in advanced predictive analytics) would be better; another issue could be you have misspecified the model you need extra terms in your model (eg. higher-order terms, interactions also covered in advanced predictive analytics) or you need to transform some of your variables (ill cover in week 10). The non-normality generally arises as a consequence of a misspecified model and while in theory if you have a large sample it should not be much of a concern in relation to the (CI, PI and hypothesis test) its normally as a consequence of a more serious problem with your model and so you should examine the residuals in detail to get to the bottom of it.

Consequence of non-normally distributed residuals (as per M. Carey)

We cannot trust F-test, CI’s and PI’s on this model
Must modify the model somehow i.e. must make alternative appropriate assumptions.

QQ Plot

aka normal probability plot of the residuals

QQ plot shows that too many values deviate at the tail for this to be normally distributed
- QQ plot shows some deviation from the straight-line pattern indicating a distribution with heavier tails than a normal distribution.

$\label{fig:figs}QQ Plot$

QQ Plot

Shapiro Wilko Test
statistic	p.value	method
0.92	0	Shapiro-Wilk normality test

Shapiro Wilko Test

$H_0:$ residuals are normally distributed
- pvalue of the shapiro wilko test is < 0.05 i.e. p(residuals are normally distribute = 0) = 0 thus we reject $H_0:$

Constant Error Variance $Var(\epsilon_i)=\sigma^2$

Consequence of non-constant error variance

Although the least-squares estimator is unbiased and consistent even when the error variance is not constant, the efficiency of the least-squares estimator is impaired, and the usual formulas for coefficient standard errors are inaccurate—the degree of the problem depending on the degree to which error variances differ, the sample size, and the configuration of the X-values in the regression.

Residual Plots

In the plot on the right, each point is one day, where the prediction made by the model is on the x-axis and the accuracy of the prediction is on the y-axis. The distance from the line at 0 is how bad the prediction was for that value.

Since…

Residual = Observed – Predicted

positive values for the residual (on the y-axis) mean the prediction was too low,
and negative values mean the prediction was too high; 0 means the guess was exactly correct.

$\label{fig:figs} Residuals Plot (left) and Residuals vs. Fitted (right)$

Residuals Plot (left) and Residuals vs. Fitted (right)

left plot shows

constant variance i.e. constant band about the mean line
there are outliers
there is more variability above the zero mean line. Interpretation: a residual tells us the difference between the actual price and what we predicted ($Y-\hat{Y}$). If that difference is positive, the prediction underestimates the house. In this case the model appears to underestimate houses more often than it over-estimates them.

right plot shows

This plot of residuals versus fits shows that the residual variance (vertical spread) increases as the fitted values (predicted values of sale price) increase. This violates the assumption of constant error variance.

Linearity

Consequence of non-linearity

Non-linearity results in biased/inconsistent estimates.

Component + Residual Plots

Use a component-plus-residual plot to examine if the relationship between the response and the predictor variables is linear or not.

$\label{fig:figs}Component + Residual Plots$

Component + Residual Plots

it’s a spline fit to the data
the pink (unbroken) line tracks the trend of the data
We want the pink line to be similar to the blue (broken) line
it if it similar it means that a linear relationship is valid because the general trend of the data is also linear
in the right plot we can see that there is a non-linear relationship between lot and price: it doesn’t match the linear relationship model of the blue (broken) line.

Other Model Assumptions

Multicollinearity

Consequence of multicollinearity

Multicollinearity affects a model’s ability to explain but not to predict.
The coefficients do not have a reliable interpretation.
There is another estimator (ridge regression) that can improve on the least squares error in case of strong multi-collinearity.

There are two ways to check if the predictor variables are highly correlated with one another:

Correlation plot
Variance Inflation factor (VIF).

Correlation Plot

corr is 0.3 which is weak so no multicollinearity

$\label{fig:figs} Corr. Plot$

Corr. Plot

Variance Inflation factor (VIF)

both VIFs are approx. one indicating no multicollinearity problem with a regression including both predictors.
the size of the lot and the size of the house are not strongly correlated in this instance here.
i.e. they are not telling us the same information ; both contribute different information that affects the price.

Variance Inflation Factor
names	x
Size	1.098521
Lot	1.098521

Independence

aka non-independent, paired, dependent, or correlated observations

Consequences of non-independence

results in biased estimates
Must modify the model somehow i.e. must make alternative appropriate assumptions.
Violations imply non-constance variance of the errors (heteroskedasticity).
Outliers from different distributions can cause inefficiency/bias.

Correlation in the errors

Consequences of non-independance

results in biased estimates
Must modify the model somehow i.e. must make alternative appropriate assumptions.
Violations imply non-constance variance of the errors (heteroskedasticity).
Outliers from different distributions can cause inefficiency/bias.
Serial (or auto) correlation in the errors (i.e., correlation between consecutive errors or errors separated by some other number of periods) means that there is room for improvement in the model, and extreme serial correlation is often a symptom of a badly misspecified model.
With time series data, it is highly likely that the value of a variable observed in the current time period will be similar to its value in the previous period, or even the period before that, and so on.
Therefore when fitting a regression model to time series data, it is common to find autocorrelation in the residuals.
Autocorrelation does not only occur in time series data. Other causes of autocorrelation can be when observations are grouped somehow or other underlying factors that cuase observations to be dependent from one another.

Durban Watson test

Test for serial, or auto, correlation in the errors.

$H_0$: correlation between the errors is zero (no auto-corr.)
$H_A$: correlation between the errors is not equal to zero.

The Durbin-Watson statistic will always have a value between 0 and 4

2.0 means no autocorrelation detected in the sample.
- [0, 1.9] indicates positive autocorrelation
- [2.1, 4] indicates negative autocorrelation.

Summary Durban Watson Test
statistic	p.value	autocorrelation	method	alternative
1.96	0.44	0.02	Durbin-Watson Test	two.sided

The pvalue is not < 0.05 i.e. p(corr. btw. errors = 0) = p($H_0$ is true) = 41%
therefore we fail to reject H0 and we say that it is likely that the errors are not auto-correlated.
the best estimate for the auto-correlation is 0.02 which is very small. Thus the auto-correlation between any two individual errors is very weak.
the observations can be classed as independent.

Other Diagnostics

Added Variable Plots (aka partial-regression plot)

Examine what is the effect of a particular predictor variable (Size or Lot) on our response variable Price while holding all other predictor variables constant.

Note that in this case “all other predictor variables” will be just one of the two.

$\label{fig:figs}Added Variable Plots$

Added Variable Plots

left plot
- What is the effect that Size has on the model?
- the slope indicates that adding Size does have a significant impact on the model.
- note that it would not be significant if the blue line was flat. In that case, no matter what the size, the price would be the same.
- 3 observations exert substantial leverage on the Size coefficient.
- the numbers on the plot outliers are the House ID’s
- Y-axis: Price given all the other variables (in this case Lot)
- X-axis: the horizontal variable in this added-variable plot is the residual from the regression of size on Lot, and thus values far from 0 in this direction are for House ID’s with Size that are unusually high or low given their levels of Lot.
right plot
- similar to the above

Appendix

Model Selection

The model selection problem is often stated as a variable selection problem. We have a response y and a set of predictors $x = {x_1, . . . ,x_m}$, and we wish to divide x into two groups, $x = ( x_A, x_I)$, the active and inactive predictors, such that the distribution of $y|x_A$ is the same as the distribution of $y|(x_A, x_I)$, s.t. all the information about y is in the active predictors.

Procedure:

Fit a sequence of subset models which differ only in the elements of x that are used to define the regressors.
If we have m predictors, then there are $2^m$ — 1 possible subset models.
For example, if m ~ 10, $2^m$ — 1 = 1,023, while if m = 20, $2^m$ - 1 is slightly more than 1 million.
Select the subset model that optimizes some criterion of model quality.

Strategies

“backward”, in which we start with all the regressors in the model and continue to remove terms until removing another term makes the criterion of interest worse
“forward”, in which we start with no regressors and continue to add terms until adding another term makes the criterion of interest worse;

A Reminder about Transformations

You generally use a transformation to (a) to make the relationship between the response and the predictor variable more linear or (b) make the residuals of a regression model approximately normally distributed. Does the transformation do either of these things?

From Continuous to Categorical

“Do these findings suggest it might be better to bin the data for”Fare" and make it a categorical variable and then use it in the model (and see what happens)?"

This is a good idea if you must use a linear regression model and you cannot get the continuous predictor variable in the model without violating the assumptions of the model it may be a good idea to create a categorical variable as you can still use the information without violating the assumptions of the model.

Misc.

studentized residuals
see https://web.stanford.edu/class/stats191/notebooks/Diagnostics_for_multiple_regression.html
seems to be good for identifying if there is too many outliers
plot(df$Size, rstudent(mod))
https://ademos.people.uic.edu/Chapter12.html
https://slideplayer.com/slide/12401661/

Linear Regression Model Diagnostics

Why analyse residuals?

Model Assumptions that can be checked using the model’s residuals

Other Model Assumptions

Other Diagnostics

Model

Studying the Variables

Model Assumptions that can be checked using the model’s residuals

Normality of errors

Constant Error Variance \(Var(\epsilon_i)=\sigma^2\)

Linearity

Other Model Assumptions

Multicollinearity

Independence

Correlation in the errors

Other Diagnostics

Appendix

Model Selection

A Reminder about Transformations

From Continuous to Categorical

Misc.