Any time you conduct a statistical analysis, you make specific assumptions about the data. For OLS regression, there are several specific assumptions that should be evaluated. Overall, assumption violations in OLS cause two general types of issues: - Bias (Drawing errorenous conclusions about the relationship between some X and some Y) - Inefficiency (Standard errors are incorrectly sized causing issues in statistical significance calculations)
In this tutorial, we will cover evaluating the following specific types of OLS assumptions: - Assessing Linearity Between X and Y - Homoscedasticity - Outliers - Multicollinearity - Normally Distributed Error Term
Depending on the specific assumption, evaluation of the model assumptions will be done through a variety of techniques including visual results and specific statistical tests.
We will work out of the 2015 California Real-Estate data that we have used this summer semester.
We start by estimating a linear regression and saving the results. By saving the results, we can then evaluate the various assumptions listed above.
lm1<-lm(price ~ bedrooms+sqft+bathrooms, data)
summary(lm1)
##
## Call:
## lm(formula = price ~ bedrooms + sqft + bathrooms, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1085432 -236078 -98988 132725 5210441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -84869.34 59998.04 -1.415 0.158
## bedrooms -13343.86 24914.51 -0.536 0.592
## sqft 114.93 23.31 4.930 1.10e-06 ***
## bathrooms 167924.73 30392.88 5.525 5.16e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 430200 on 531 degrees of freedom
## Multiple R-squared: 0.2896, Adjusted R-squared: 0.2856
## F-statistic: 72.16 on 3 and 531 DF, p-value: < 2.2e-16
We start our model assumption evaluations by examining linearity between x and y and multicollinearity before turning our attention to other assumptions.
One of the primary assumptions in linear regression is that there is in fact a linear relationship between some continuous or ordinal X and some Y. To understand the relationship, we will use a LOWESS (also see it called LOESS) curve, which stands for Locally Weighted Scatterplot Smoother. This technique is powerful at revealing the actual relationship between two variables including if they linearly or non-linearly related.
To do visually assess the relationship between some X and some Y, we
will use the loess
smoothing function along with
ggplot2
. Here we will look at the relationship between
sqft
and price
.
b <- ggplot(data, aes(x = sqft, y =price))
b + geom_point(size=2) +
theme_bw(base_size = 20) + theme_minimal()+
theme(legend.position="bottom") +
guides(fill=guide_legend(title=NULL))+
geom_smooth(method = "loess", se=FALSE, span=.8)
## `geom_smooth()` using formula = 'y ~ x'
We produce a scatterplot between the two variables of interest and
then apply the loess
smoothing function to graph the
localized relationship between the x and y. Here, we see a linear
relationship initially before a downward shift at extreme values of
sqft
. This graph is indicative of some outlier issues in
the relationship (more on testing for outliers later in this tutorial).
However, the vast majority of the data does indicate a linear
relationship between x and y so outside of dealing with outliers the
relationship is mostly linear between the two variables.
Multicollinearity exists when your independent variables are really highly correlated with each other. When this occurs, your standard errors become more inefficient which can lead to increased Type II error rates. There is a straight-forward test to conduct after you estimate your OLS model to evaluate the degree of collinearity in your IVs. The Variance Inflation Factor (VIF) test evaluates the impact of correlations between your IVs on your overall model. The standard cut-point for establishing problematic variables in your model is a value of 10 or greater. With VIF scores >= 5, some of your IVs are so highly correlated that the standard errors are inefficient potentially leading to increased Type II error rates.
We use the vif
function from the cars
package to evaluate the Variance Inflation Factor of
lm1
.
vif(lm1)
## bedrooms sqft bathrooms
## 2.413488 2.350538 2.638573
Each independent variable has its own Variance Inflation Factor. For this basic model, all of our VIFs are < 3 indicating that none of the three independent variables are too highly correlated as to be problematic in the model. Because of this, we conclude that it is fine to keep all three IVs in the model.
If you do have a high VIF score for a variable or variables, you can either remove the offending variable or consider combining multiple IVs to create an index or composite variable. This should only be done with theoretical and empirical justification that the IVs you want to combine are measuring the same underlying concept.
We continue our assumption evaluations by using the plot
function along with the lm1
OLS model we created above.
There are two ways to use this approach. First, by simply using
plot(lm1)
you will return all of the assumption evaluation
plots. Or, you can use
plot(lm_norm, which = value from 1 to 6)
to plot only 1 of
the 6 plots the general approach uses. We will look at each individual
plot in turn and review what to look for to understand if that
assumption is violated or not.
plot(lm1, which=1:6)
The first assumption we will evaluate is if the residuals demonstrate hetero/homoskedasticity. This assumption deals with constant variance of the residuals regardless of the value of x. Ideally, at all points of X, the residuals will roughly miss at the same rate. Heteroskedastic residuals are efficiency problem not a bias problem. Meaning, the model standard errors might by incorrectly sized potentially causing either Type I or Type II errors depending on the results. It does not introduce bias into the estimates of how some X influences some Y however.
We will assess heteroskedasticity both visually and with two different statistical tests: - Breusch Pagan - White’s Test
First, we start with two separate visual displays of the residuals.
Plot 1 from plot(lm1)
results show the residuals
(difference from the y to the predicted y) and the predicted value of y.
While technically heteroskedasticity deals with constant variance at
each point of an X, in a multivariate model with many x’s, graphing the
predicted value compared to the actual value is a great approximation of
the variance at each point. The second visualization we will use is the
Scale-Location plot (plot 2 from the assumptions plot).
plot(lm1, which=1) #Graphs the first of the 6 plots
plot(lm1, which=3) #Graphs the third of the 6 plots
By using which = 1
in the code, we tell R to only
display the first of the 6 possible assumption plots. Plots 1 and 3
allow you to visually review your model for heteroskedasticity as well
as linearity between your x’s and y. What you are looking for in both
visualizations is for the redline to mostly be straight across centered
on the light gray, dashed line centered at 0. You also do not want to
see any type of curvilinear trend to the red line which would indicate a
non-linear relationship between the x’s and y.
In the first plot, the red-line shows some slight deviations from the grey dotted line on the right-hand side of the graph. This indicates some small increase in variance at extreme predictions. Examining the second plot, we see more deviation for the red-line compared to a straight line. Both of these plots indicate some potential of heteroskedasticity in the error term.
To more formally evaluate the presence of heteroskedasticity, there are two separate statistsical tests you can use to more formally evaluation this hypothesis. This is not always the case however for OLS assumptions.
We first look at the Breusch-Pagan test before turning to the White’s test. The null hypothesis of this test is that you have constant variance so a significant Breusch-Pagan test is evidence of heteroskedasticity in your residuals (i.e. you reject the null hypothesis that you have constant variance).
We use the lmtest
package to run the Breusch-Pagan test
so it must be installed and loaded on your machine for this.
bp<-bptest(lm1)
bp
##
## studentized Breusch-Pagan test
##
## data: lm1
## BP = 5.3734, df = 3, p-value = 0.1464
The results of the BP test gives you a few different statistics. The key one to evaluate is the p-value. Remember the null hypothesis is constant variance so a non-significant p-value means you fail to reject the null. Here we have a p-value of .1464, which is > .05 standard alpha level indicating we fail to reject the null hypothesis of constant variance.
This test allows us to verify our visual inspection and inciates that we do not have to worry about this assumption being violated in the model.
Next, we will run the White’s test. This test does a few more things then simply testing for heteroskedasticity. It also allows you to test for different types of transformations that could be done to provide a better fitting model. Here we will test if including a squared term of any of the IVs would help the heteroskedasticity. You can also do cubes (^3 but not a transformation that happens much in social sciences), logs or interactions (more on these next two weeks when we cover these concepts specifically).
Code for the White’s test starts the same as the Breusch-Pagan test but there are some differences so that R knows which specific test to run. Same as the Breusch-Pagan Test, the null hypothesis is that you have constant variance so a significant p-value would mean you reject that hypothesis in favor of the alternative (that there is non-constant variance).
white_test <- bptest(lm1, ~ fitted(lm1) + I(fitted(lm1)^2))
white_test
##
## studentized Breusch-Pagan test
##
## data: lm1
## BP = 6.3851, df = 2, p-value = 0.04107
Unlike the Breusch-Pagan test, here we do see a p-value < .05 indicating there might be issues with heteroskedasticity in the model. Generally, since the BP test had a large p-value and the White’s test has a borderline significant p-value, you would not worry as much about heteroskedasticity compared to if both tests were significant with a smaller p-value. Later in this tutorial, we will cover the use of robust standard errors that minimizes the impact of heteroskedasticity.
OLS assumes that the residuals are normally distributed without skewness. When violated, you are not introducing bias into your model but rather inefficiency. Just like with heteroskedasticity, a non-normally distributed error term can lead to incorrectly sized standard errors leading to Type I or Type II errors.
We will review this assumption both visually and with a statistical test (Shapiro-Wilk’s test for Normality).
plot(lm1, which=2) #Graphs the first of the 4 plots
For the visual review of this assumption, we use which=2
because it is the second assumption plot produced with the
plot
command. When reviewing this plot, you want to see the
individual dots to be closes to the dotted gray line at all points on
the x-axis. Here, we see some extreme deviations at the extreme
righthand side of the graph. This indicates an issue with the normally
distributed residuals at extreme values. One approach to solving this is
to transform the DV in some way (which we cover in Week 6).
We can also use the Shapiro-Wilk’s test of normality to empirically
test if the deviations violate the assumption of normality. The null
hypothesis in this test is that you have normally distributed residuals
so a significant p-value (less than .05) would mean you reject the null
hypothesis of normal distribution in favor of the alternative
hypothesis. We use the shapiro.test
function from the
stats
package for this test. To conduct this test, you
first need to save the residuals from your linear model using
residuals<-residuals(model name)
. Then you use the newly
created residuals vector in the actual test.
residuals <- residuals(lm1)
shapiro_test <- shapiro.test(residuals)
shapiro_test
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.72053, p-value < 2.2e-16
The Shapiro-Wilk’s results give you two values: the test statistic (w) and the associated p-value. Here, the p-value is tiny as evidenced by the e-16 part of the value which is scientific notation. Basically, there are too many zeros at the start of the p-value for R to effectively display them so it moves to scientific notation. Importantly, this test validates the visual look that we previously saw as we reject the null hypothesis of normally distributed residuals. As mentioned above, we can attempt to solve this by transforming our DV in some way (more on this Week 6).
Outliers are cases with values that are extreme with values far from the majority of the units. However, not all outliers are influential at influencing the estimated relationship between some x and some y. We will look at two approaches - Leverage and Cook’s Distance - to understanding outliers in your model and how influential they are on your estimates. These tests help you determine whether or not you should take action to deal with the offending univariate variable.
Leverage is a measure of how far an observation’s value on x is from the mean values of the x’s themselves. It takes on values between 0 and 1 with values closer to 1 indicating a stronger potential to be influential. However, we will eventually combine leverage with Cook’s distance to measure the how influential the value is. Meaning, a high leverage value is not sufficient to establish that an observation is causing issues with the model estimates. Plot 5 graphs the residuals versus leverage values for the OLS model.
plot(lm1, which=5)
The graph itself is somewhat confusing but you are looking for all values to be within the dotted gray lines on the plot. Here, that is exactly what we see. There are some extreme cases but from a leverage perspective they do not seem to be too influential. As mentioned above, we do need to use Cook’s Distance to better evaluate the influence of the extreme cases.
Plots 4 and 6 produce different Cook’s Distance graphs for
evaluation. Plot 4 graphs the Cook’s Distance value on y-axis and the
individual unit row on the x-axis. The plot will flag the worst
offending observations. Definitely if you have any Cook’s Distance value
>=1
that case should be removed from the model when
re-estimated.
Plot 6 plots Cook’s Distance on the y-axis along with the leverage value for each observation on the x-axis. This is the best plot to evaluate any individual cases that are potential causing issues with the model estimation. For this plot, you are looking for cases in the upper right-hand side of the plot itself. These cases have high leverage and Cook’s Distance values, which indicates they are outliers that are causing issues in your model.
plot(lm1, which=4)
plot(lm1, which=6)
Plot 4 flags 3 cases with extremely high Cook’s Distance values that potentially could be problematic: rows 428, 430, and 521. Next we look at Plot 6 which plots Cook’s Distance and leverage values on the x-y axes. In this plot, you are looking for cases in the upper right-hand part of the plot. Here, we once again see two of the same observations as in Plot 4: 428 and 430. Combining the information from these two plots indicates that we should remove these two cases from the model and re-estimate the model without the two cases.
To remove these cases from your analysis, it takes a few steps. First, we create a vector with the specific rows we want to exclude. Next, we remove those problematic cases from the dataset before re-estimating the model. You should always compare the original model with the new model to see if there actually is a change in the estimated relationship between the x’s and y. If the beta weights remain close to each other, you would use the full model without removing the outliers. If there is a large shift in the beta weights, then you should rely on the model that has removed the influential outliers.
problematic_obs <- c(428, 430)
cleaned_data <- data[-problematic_obs, ]
cleaned_model <- lm(price ~ bedrooms+sqft+bathrooms, data=cleaned_data)
stargazer(lm1, cleaned_model, digits=3, type="text")
##
## ===================================================================
## Dependent variable:
## -----------------------------------------------
## price
## (1) (2)
## -------------------------------------------------------------------
## bedrooms -13,343.860 -26,839.460
## (24,914.510) (24,934.410)
##
## sqft 114.927*** 170.285***
## (23.310) (27.794)
##
## bathrooms 167,924.700*** 128,892.500***
## (30,392.880) (32,368.430)
##
## Constant -84,869.340 -61,244.140
## (59,998.040) (59,794.610)
##
## -------------------------------------------------------------------
## Observations 535 533
## R2 0.290 0.307
## Adjusted R2 0.286 0.303
## Residual Std. Error 430,243.900 (df = 531) 425,648.100 (df = 529)
## F Statistic 72.156*** (df = 3; 531) 78.131*** (df = 3; 529)
## ===================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Here, we estimated the new model with the two offending cases removed
and then compare the new model with the original. Evaluating the beta
weights first, we see changes in how each X influences the Y. For
instance, the coefficient on the sqft
variable goes from
115 to 170 while the impact of the number of bathrooms changes from 168K
to 129K. Both of these coefficients are different enough from the
initial model to justify using the model with the outliers removed.
Theoretically, the reason this should be done is that the outliers are so different from the vast majority of the cases that are not necessarily representative of how the X’s influence the Y’s for the majority of cases. We can also evaluate the model fit statistics to understand if the model performs better with the removal of outliers. In this instance, we see that the R2, Adjusted R2 and Residual Standard Errors all indicate a better fitting model with the outliers removed. Taken as a whole, we would conclude that removal of the outliers is warranted for this model.
If you conclude that you have issues with heteroskedasticity in your OLS model, one approach to “fixing” them is to use Robust Standard Errors. Robust SE change how the standard errors are calculated with slight differences in the calculation based on which version you choose.
To use robust standard error adjustments, we must load the
lmtest
package into R. We will focus on two types of robust
standard errors here: - Method 1 (‘hc1’): Uses squared residuals to
penalize unequal variance - Method 3(‘hc3’): Uses leverage values for
each individual case to adjust the size of the SE
Method 2 adjusts for autocorrelation, which is beyond the scope of this course.
To apply robust standard errors, first estimate the regression model without any changes and save the model. Then we will use that saved model to reestimate the standard errors with each of the two types of robust adjustments.
We save these new estimated models with names robust1
and robust3
to then compare to the original model. The code
to apply robust standard errors is straightforward. We feed the original
model name into the code at two different places, specify the specific
robust standard error we want to use, hc1
or
hc3
and keep the rest of the code the same.
We need to use the lmtest
and sandwich
packages to get the robust standard errors to be estimated. We then use
stargazer
to compare across the three models.
#new_name<-coeftest(model_name, vcov =
#vcovHC(model_name, type="hc#"))
robust1<-coeftest(lm1, vcov =
vcovHC(lm1, type="HC1"))
robust3<-coeftest(lm1, vcov =
vcovHC(lm1, type="HC3"))
stargazer(lm1, robust1, robust3, type="text")
##
## =========================================================================
## Dependent variable:
## -----------------------------------------------------
## price
## OLS coefficient
## test
## (1) (2) (3)
## -------------------------------------------------------------------------
## bedrooms -13,343.860 -13,343.860 -13,343.860
## (24,914.510) (24,186.850) (25,127.380)
##
## sqft 114.927*** 114.927*** 114.927***
## (23.310) (36.861) (41.453)
##
## bathrooms 167,924.700*** 167,924.700*** 167,924.700***
## (30,392.880) (35,655.250) (38,785.110)
##
## Constant -84,869.340 -84,869.340 -84,869.340
## (59,998.040) (61,867.510) (63,863.060)
##
## -------------------------------------------------------------------------
## Observations 535
## R2 0.290
## Adjusted R2 0.286
## Residual Std. Error 430,243.900 (df = 531)
## F Statistic 72.156*** (df = 3; 531)
## =========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
First thing to note is that the coefficients do not change with the application of robust standard errors. That is correct. We are not changing the estimate of the relationship between the IV and DV. Rather we are changing our certainty level, i.e. the size of the standard error of the estimate.
Using robust standard errors does not have to increase the size of
the standard errors of the estimate. In fact for the
bedroom
variable, the robust standard error size is smaller
than the unadjusted model for Model 2 while larger for Model 3. For the
other two variables, the standard errors are larger in the adjusted
models compared to the unadjusted one.
We want to be conservative in what we consider significant finding, which is why we want to evaluate the residuals of our regression models and apply robust standard errors if necessary.