musings_on_linear_regression.utf8

Musings on the Assumptions of the Linear Regression Model

CLT in the context of residuals

“When the sample size is large, the central limit theorem implies that violations of normality of the residuals have only limited effect on the accuracy of the estimates.”¹

we take repeated samples from the population and make predictions using some linear model
we note that for each sample and corresponding predictions, the residuals are likely to be non-normally distributed i.e. sometimes the predictions are accurate and often they are not accurate (hence the residuals).
for each sample we compute the mean residual of the model (i.e. the mean error made by the model for a particular sample)
the CLT tells us that the mean residual will be normally distributed (e.g. over 100 sample predictions)
the fact that the mean predictions will be normally distributed allows us to affirm that the parameters are likely to be accurate (assuming no other violations for arguments sake)

In other words this is the same as saying that under the above scenario, no matter what the distribution of the residuals in a particular prediction sample , the expected value of the betas will be equal to their true population betas?

Lastly, in general, when we talk about the “distribution of the residuals” are we actuall referring to the fact that some predictions will be close to accurate, others less so?

If all predictions were accurate, what would that distribution then look like?

Constant Error Variance

In the reference book (p302) it says

"Because the least-squares residuals *have unequal variances even when the assumption of constant error variance is correct, it is preferable to …"

The error variance assumption relates to the true but unknown regression line or hyperplane.
The assumption of constant error variance is a way to idealise a model that results in all errors being equally distant from the regression hyperplane.
This assumption is hardly be ever met unless the explanatory variables capture to a near-perfect degree the information needed to explain the response variable.
The residuals of the estimated regression line or hyperplane act as proxy errors of the true errors and, assuming that the model is sound, both estimated and true regression lines or hyperplanes should be nearly identical.
In that case the variance of the residuals can be used as a proxy for the variance of the errors.
By the way this is also the reason why the mean of the residuals should be equal to zero.
The fact that least-squares residuals have unequal variances even when the assumption of constant error variance is correct is simply because the estimated model will never be perfect so it cannot have constant variance ever.
So really what the residuals plot tells us is whether the candidate tru model is likely to be correct.
If there is no constant variance (heteroscedasticity) then we don’t have a good model to start with i.e. the candidate model is incorrect.

Constant variance in the error implies that the variance of the response variable does not depend on the explanatory variables.

\(E(\epsilon)=0\)

The above point about the errors all being equally distant from the regression line or hyperplane in the case of the perfect true and assumed model is well captured by the assumption \(E(\epsilon)=0\):

“The assumption that the average error, \(E(\epsilon)\), is everywhere 0 implies that the specified regression surface accurately reflects the dependency of the conditional average value of Y on the Xs.”

“Violating the assumption of linearity implies that the model fails to capture the systematic pattern of relationship between the response and explanatory variables”

In other words, \(E(\epsilon)=0\) states that the model is correctly formulated, that is, all relevant X’s are included and the model is indeed linear.

\(Cov(\epsilon_i, \epsilon_j)=0\)

\(Cov(\epsilon_i, \epsilon_j)=0\) states that the Y’s are uncorrelated with each other, which usually holds in a random sample (the observations would typically be correlated in a time series or when repeated measurements are made on a single plant or animal).