Least-squares regression traces the mean of \(Y\) conditional on \(X\) i.e. we are interested in understanding how the mean of \(Y\) varies as the values of the predictors change.
We use the study of the residuals to check whether
Normality of errors
QQ Plot
Shapiro Wilko Test
Constant Error Variance (Homoskedasticity)
Linearity and Zero-conditional mean
No collinearity in X’s
Correlation Plots
Variance Inflation factor
Random / i.i.d sample
Correlation in the errors
Effect of individual X’s on the model
Given the linear model mod = lm(Price ~ Size + Lot, data=df)
| Price ($) | Lot (sq.feet) | Size (sq.feet) |
|---|---|---|
| 208,500 | 8,450 | 856 |
| 181,500 | 9,600 | 1,262 |
| 223,500 | 11,250 | 920 |
| 140,000 | 9,550 | 961 |
| 250,000 | 14,260 | 1,145 |
| 143,000 | 14,115 | 796 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 35073.04 | 5223.19 | 6.71 | 0 |
| Size | 118.93 | 4.46 | 26.65 | 0 |
| Lot | 0.72 | 0.17 | 4.17 | 0 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
|---|---|---|---|---|---|---|---|
| 0.37 | 0.37 | 62872.2 | 436.2 | 0 | 2 | 1457 | 1460 |
Box Plots
Pair Plots
The assumption of normally distributed errors is almost always arbitrary. Nevertheless, the central limit theorem ensures that, under very broad conditions, inference based on the least squares estimator is approximately valid in all but small samples. Why, then, should we be concerned about non-normal errors?
Although the validity of least-squares estimation is robust— the levels of tests and the coverage of confidence intervals are approximately correct in large samples even when the assumption of normality is violated—the efficiency of least squares is not robust: Statistical theory assures us that the least-squares estimator is the most efficient unbiased estimator only when the errors are normal. For some types of error distributions, however, particularly those with heavy tails, the efficiency of least-squares estimation decreases markedly.
Compromises the interpretation of the least-squares fit. This fit is a conditional mean (of Y given the Xs), and the mean is not a good measure of the center of a highly skewed distribution. Consequently, we may prefer to transform the response to produce a symmetric error distribution.
A multimodal error distribution suggests the omission of one or more discrete explanatory variables that divide the data naturally into groups. An examination of the distribution of the residuals may, therefore, motivate respecification of the model.
if your residuals are non-normal (not symmetric and bell-shaped) then this is conveying that something is wrong with your model. Either you have outliers (unusual observations that are distorting your results) in which case I would report my results with and without the outliers and if there’s no difference then it’s not making much and you can be confident that the outlier is not distorting your conclusions/interprtations and if there is a difference in the reported results you need to seek the advice of someone with domain-specific knowledge who can inform you if this unusual observation is valid or realistic (for example if its medical data a doctor who knows the field). Another instance is where your data are counts and so modelling them with the appropriate distribution using generalised linear models (covered in advanced predictive analytics) would be better; another issue could be you have misspecified the model you need extra terms in your model (eg. higher-order terms, interactions also covered in advanced predictive analytics) or you need to transform some of your variables (ill cover in week 10). The non-normality generally arises as a consequence of a misspecified model and while in theory if you have a large sample it should not be much of a concern in relation to the (CI, PI and hypothesis test) its normally as a consequence of a more serious problem with your model and so you should examine the residuals in detail to get to the bottom of it.
Consequence of non-normally distributed residuals (as per M. Carey)
We cannot trust F-test, CI’s and PI’s on this model
Must modify the model somehow i.e. must make alternative appropriate assumptions.
QQ Plot
aka normal probability plot of the residuals
QQ Plot
| statistic | p.value | method |
|---|---|---|
| 0.92 | 0 | Shapiro-Wilk normality test |
Shapiro Wilko Test
Consequence of non-constant error variance
Residual Plots
In the plot on the right, each point is one day, where the prediction made by the model is on the x-axis and the accuracy of the prediction is on the y-axis. The distance from the line at 0 is how bad the prediction was for that value.
Since…
Residual = Observed – Predicted
positive values for the residual (on the y-axis) mean the prediction was too low,
and negative values mean the prediction was too high; 0 means the guess was exactly correct.
Residuals Plot (left) and Residuals vs. Fitted (right)
left plot shows
right plot shows
Consequence of non-linearity
Component + Residual Plots
Component + Residual Plots
Consequence of multicollinearity
Multicollinearity affects a model’s ability to explain but not to predict.
The coefficients do not have a reliable interpretation.
There is another estimator (ridge regression) that can improve on the least squares error in case of strong multi-collinearity.
There are two ways to check if the predictor variables are highly correlated with one another:
Correlation Plot
Corr. Plot
Variance Inflation factor (VIF)
| names | x |
|---|---|
| Size | 1.098521 |
| Lot | 1.098521 |
aka non-independent, paired, dependent, or correlated observations
Consequences of non-independence
results in biased estimates
Must modify the model somehow i.e. must make alternative appropriate assumptions.
Violations imply non-constance variance of the errors (heteroskedasticity).
Outliers from different distributions can cause inefficiency/bias.
Consequences of non-independance
results in biased estimates
Must modify the model somehow i.e. must make alternative appropriate assumptions.
Violations imply non-constance variance of the errors (heteroskedasticity).
Outliers from different distributions can cause inefficiency/bias.
Serial (or auto) correlation in the errors (i.e., correlation between consecutive errors or errors separated by some other number of periods) means that there is room for improvement in the model, and extreme serial correlation is often a symptom of a badly misspecified model.
With time series data, it is highly likely that the value of a variable observed in the current time period will be similar to its value in the previous period, or even the period before that, and so on.
Therefore when fitting a regression model to time series data, it is common to find autocorrelation in the residuals.
Autocorrelation does not only occur in time series data. Other causes of autocorrelation can be when observations are grouped somehow or other underlying factors that cuase observations to be dependent from one another.
Durban Watson test
Test for serial, or auto, correlation in the errors.
The Durbin-Watson statistic will always have a value between 0 and 4
| statistic | p.value | autocorrelation | method | alternative |
|---|---|---|---|---|
| 1.96 | 0.44 | 0.02 | Durbin-Watson Test | two.sided |
Added Variable Plots (aka partial-regression plot)
Examine what is the effect of a particular predictor variable (Size or Lot) on our response variable Price while holding all other predictor variables constant.
Note that in this case “all other predictor variables” will be just one of the two.
Added Variable Plots
Size coefficient.size on Lot, and thus values far from 0 in this direction are for House ID’s with Size that are unusually high or low given their levels of Lot.
The model selection problem is often stated as a variable selection problem. We have a response y and a set of predictors \(x = {x_1, . . . ,x_m}\), and we wish to divide x into two groups, \(x = ( x_A, x_I)\), the active and inactive predictors, such that the distribution of \(y|x_A\) is the same as the distribution of \(y|(x_A, x_I)\), s.t. all the information about y is in the active predictors.
Procedure:
Strategies
You generally use a transformation to (a) to make the relationship between the response and the predictor variable more linear or (b) make the residuals of a regression model approximately normally distributed. Does the transformation do either of these things?
“Do these findings suggest it might be better to bin the data for”Fare" and make it a categorical variable and then use it in the model (and see what happens)?"
This is a good idea if you must use a linear regression model and you cannot get the continuous predictor variable in the model without violating the assumptions of the model it may be a good idea to create a categorical variable as you can still use the information without violating the assumptions of the model.
plot(df$Size, rstudent(mod))