Heteroskedasticity

Classic Linear Regression Model

The classic linear regression model is one method to determine the relationship between two variables.

Ordinary Least Squares (OLS) is a method for estimating the unknown parameters in a linear regression model. This method minimizes the differences between the observed responses in a dataset and the responses predicted by the linear approximation of the data.

Efficient OLS

The OLS estimator provides unbiased results when the Classic LInear Regression Model is reliable.

OLS is the Best Linear Unbiased Estimator (BLUE). This means that out of all possible linear unbiased estimators, OLS gives the most precise estimates.

OLS is the Best Unbiased Estimator (BUE), so it even beats non-linear estimators.

Assumptions of the Classic Linear Regression Model

In order to ensure the model creates a reliable relationship between our variables, we must provide good data that can prove the basic assumptions of the classic linear regression are true.

One of these assumptions is that there is homoskedastiticty within the error terms.

What is Homoskedasticity?

Homoskedasticity relates to the variance in error term. Variance in error terms that is constant is said to be homoskedastic.

\[var(u_i/x_i) = \sigma^{2}\]

This means that the variance of \(u_i\) does not depend on the value of \(x_i\).

Heteroskedasticity Error Terms

When the variance is not constant, the error terms are said to be heteroskedastic.

Errors are more likely to occur as the value of the independant variable increases.

Any time that there is a high value for an independant variable that is a necessary but not sufficient condition for an observation to have a high value on a dependant variable, heteroskedasticity is likely.

Example of Heteroskedasticity

Annual family income is the Independant Variable.

Annual family expenditures on vacations is the Dependant variable.

Families with low incomes will spend relatively little on vacations, and the variations in expenditures across such families will be small because they all have less discretionary income.

Families with large incomes have a higher discretionary income. The mean amount spent on vacations will be higher, but there will also be greater variability among such families, resulting in heteroskedasticity.

Other Explanations for Heteroskedasticity

Measurement Error
Model Misspecifications (such as using \(X\) when the - model requires \(X^2\)).
Subpopulation Differences
Other interation effects

Consequences of Heteroskedasticity

Heteroskedasticity does not result in biased parameter estimates in OLS, but estimates are no longer BLUE. Meaning among all the unbiased estimators, OLS does not provide the estimate with the smallest variance.
When heteroskedasticity is present, OLS gives equal weight to all observations when some obversations contain less information than others.
In using methods other than OLS, heteroskedasticity can produce biased and misleading parameter estimates.

Correcting Heteroskedasticity in R

Heterokedasticity can be removed quickly and easily from our model using R by installing some additional packages.

The packages that contain functions to remove Heteroskedasticity are called sandwich and lmtest.

Installing Required Packages

install.packages(sandwich)

install.packages(lmtest)

library(sandwich)

library(lmtest)

These commands will install the required packages and download the packages into R Memory .

Estimating variance- covariance matrix

Now to check whether the data has heteroskedasticity or not, construct a variance covariance matrix.

vcovHC(FitLinReg, omega = NULL, type = "HC4")

The first parameter is the regression model. Keep omega as NULL. The type parameter refers to what measure of heteroskedasticity is being used, there are 5 types from HC0 to HC4. HC4 is the latest.

Variance- Covariance Matrix

The independent variables are listed both in column and rows. The diagonal elements show the variance of each variable with itself. The diagonal values should have been constant, but since they vary, we can detect the presence of heteroskedasticity!

Finally to remove this heteroskedasticity, use the coeftest() function in R.

coeftest(FitLinReg, df = Inf, vcovHC(FitLinReg, omega = NULL, type = "HC4"))

'df' stands for degrees of freedom and we take it as infinite because of the large number of variables.

Variance- Covariance Matrix with Heteroskedasticity errors removed

This has fixed the standard errors in your regression!