Introduction

Any time you conduct a statistical analysis, you make specific assumptions about the data. For OLS regression, there are several specific assumptions that should be evaluated. Overall, assumption violations in OLS cause two general types of issues: - Bias (Drawing errorenous conclusions about the relationship between some X and some Y) - Inefficiency (Standard errors are incorrectly sized causing issues in statistical significance calculations)

In this tutorial, we will cover evaluating the following specific types of OLS assumptions: - Assessing Linearity Between X and Y - Homoscedasticity - Outliers - Multicollinearity - Normally Distributed Error Term

Depending on the specific assumption, evaluation of the model assumptions will be done through a variety of techniques including visual results and specific statistical tests.

We will work out of the 2015 California Real-Estate data that we have used this summer semester.

OLS Model

We start by estimating a linear regression and saving the results. By saving the results, we can then evaluate the various assumptions listed above.

lm1<-lm(price ~ bedrooms+sqft+bathrooms, data)
summary(lm1)
## 
## Call:
## lm(formula = price ~ bedrooms + sqft + bathrooms, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1085432  -236078   -98988   132725  5210441 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -84869.34   59998.04  -1.415    0.158    
## bedrooms    -13343.86   24914.51  -0.536    0.592    
## sqft           114.93      23.31   4.930 1.10e-06 ***
## bathrooms   167924.73   30392.88   5.525 5.16e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 430200 on 531 degrees of freedom
## Multiple R-squared:  0.2896, Adjusted R-squared:  0.2856 
## F-statistic: 72.16 on 3 and 531 DF,  p-value: < 2.2e-16

Evaluating Model Assumptions

We start our model assumption evaluations by examining linearity between x and y and multicollinearity before turning our attention to other assumptions.

Linearity Between X and Y

One of the primary assumptions in linear regression is that there is in fact a linear relationship between some continuous or ordinal X and some Y. To understand the relationship, we will use a LOWESS (also see it called LOESS) curve, which stands for Locally Weighted Scatterplot Smoother. This technique is powerful at revealing the actual relationship between two variables including if they linearly or non-linearly related.

To do visually assess the relationship between some X and some Y, we will use the loess smoothing function along with ggplot2. Here we will look at the relationship between sqft and price.

b <- ggplot(data, aes(x = sqft, y =price))

b + geom_point(size=2) +
  theme_bw(base_size = 20) + theme_minimal()+  
  theme(legend.position="bottom") +
  guides(fill=guide_legend(title=NULL))+
  geom_smooth(method = "loess", se=FALSE, span=.8)
## `geom_smooth()` using formula = 'y ~ x'

We produce a scatterplot between the two variables of interest and then apply the loess smoothing function to graph the localized relationship between the x and y. Here, we see a linear relationship initially before a downward shift at extreme values of sqft. This graph is indicative of some outlier issues in the relationship (more on testing for outliers later in this tutorial). However, the vast majority of the data does indicate a linear relationship between x and y so outside of dealing with outliers the relationship is mostly linear between the two variables.

Multicollinearity

Multicollinearity exists when your independent variables are really highly correlated with each other. When this occurs, your standard errors become more inefficient which can lead to increased Type II error rates. There is a straight-forward test to conduct after you estimate your OLS model to evaluate the degree of collinearity in your IVs. The Variance Inflation Factor (VIF) test evaluates the impact of correlations between your IVs on your overall model. The standard cut-point for establishing problematic variables in your model is a value of 10 or greater. With VIF scores >= 5, some of your IVs are so highly correlated that the standard errors are inefficient potentially leading to increased Type II error rates.

We use the vif function from the cars package to evaluate the Variance Inflation Factor of lm1.

vif(lm1)
##  bedrooms      sqft bathrooms 
##  2.413488  2.350538  2.638573

Each independent variable has its own Variance Inflation Factor. For this basic model, all of our VIFs are < 3 indicating that none of the three independent variables are too highly correlated as to be problematic in the model. Because of this, we conclude that it is fine to keep all three IVs in the model.

If you do have a high VIF score for a variable or variables, you can either remove the offending variable or consider combining multiple IVs to create an index or composite variable. This should only be done with theoretical and empirical justification that the IVs you want to combine are measuring the same underlying concept.

Using Plot to Review Other Assumptions

We continue our assumption evaluations by using the plot function along with the lm1 OLS model we created above. There are two ways to use this approach. First, by simply using plot(lm1) you will return all of the assumption evaluation plots. Or, you can use plot(lm_norm, which = value from 1 to 6) to plot only 1 of the 6 plots the general approach uses. We will look at each individual plot in turn and review what to look for to understand if that assumption is violated or not.

plot(lm1, which=1:6)

Heteroskedasticity

The first assumption we will evaluate is if the residuals demonstrate hetero/homoskedasticity. This assumption deals with constant variance of the residuals regardless of the value of x. Ideally, at all points of X, the residuals will roughly miss at the same rate. Heteroskedastic residuals are efficiency problem not a bias problem. Meaning, the model standard errors might by incorrectly sized potentially causing either Type I or Type II errors depending on the results. It does not introduce bias into the estimates of how some X influences some Y however.

We will assess heteroskedasticity both visually and with two different statistical tests: - Breusch Pagan - White’s Test

First, we start with two separate visual displays of the residuals. Plot 1 from plot(lm1) results show the residuals (difference from the y to the predicted y) and the predicted value of y. While technically heteroskedasticity deals with constant variance at each point of an X, in a multivariate model with many x’s, graphing the predicted value compared to the actual value is a great approximation of the variance at each point. The second visualization we will use is the Scale-Location plot (plot 2 from the assumptions plot).

plot(lm1, which=1) #Graphs the first of the 6 plots 

plot(lm1, which=3) #Graphs the third of the 6 plots 

By using which = 1 in the code, we tell R to only display the first of the 6 possible assumption plots. Plots 1 and 3 allow you to visually review your model for heteroskedasticity as well as linearity between your x’s and y. What you are looking for in both visualizations is for the redline to mostly be straight across centered on the light gray, dashed line centered at 0. You also do not want to see any type of curvilinear trend to the red line which would indicate a non-linear relationship between the x’s and y.

In the first plot, the red-line shows some slight deviations from the grey dotted line on the right-hand side of the graph. This indicates some small increase in variance at extreme predictions. Examining the second plot, we see more deviation for the red-line compared to a straight line. Both of these plots indicate some potential of heteroskedasticity in the error term.

To more formally evaluate the presence of heteroskedasticity, there are two separate statistsical tests you can use to more formally evaluation this hypothesis. This is not always the case however for OLS assumptions.

We first look at the Breusch-Pagan test before turning to the White’s test. The null hypothesis of this test is that you have constant variance so a significant Breusch-Pagan test is evidence of heteroskedasticity in your residuals (i.e. you reject the null hypothesis that you have constant variance).

We use the lmtest package to run the Breusch-Pagan test so it must be installed and loaded on your machine for this.

bp<-bptest(lm1)
bp
## 
##  studentized Breusch-Pagan test
## 
## data:  lm1
## BP = 5.3734, df = 3, p-value = 0.1464

The results of the BP test gives you a few different statistics. The key one to evaluate is the p-value. Remember the null hypothesis is constant variance so a non-significant p-value means you fail to reject the null. Here we have a p-value of .1464, which is > .05 standard alpha level indicating we fail to reject the null hypothesis of constant variance.

This test allows us to verify our visual inspection and inciates that we do not have to worry about this assumption being violated in the model.

Next, we will run the White’s test. This test does a few more things then simply testing for heteroskedasticity. It also allows you to test for different types of transformations that could be done to provide a better fitting model. Here we will test if including a squared term of any of the IVs would help the heteroskedasticity. You can also do cubes (^3 but not a transformation that happens much in social sciences), logs or interactions (more on these next two weeks when we cover these concepts specifically).

Code for the White’s test starts the same as the Breusch-Pagan test but there are some differences so that R knows which specific test to run. Same as the Breusch-Pagan Test, the null hypothesis is that you have constant variance so a significant p-value would mean you reject that hypothesis in favor of the alternative (that there is non-constant variance).

white_test <- bptest(lm1, ~ fitted(lm1) + I(fitted(lm1)^2))
white_test
## 
##  studentized Breusch-Pagan test
## 
## data:  lm1
## BP = 6.3851, df = 2, p-value = 0.04107

Unlike the Breusch-Pagan test, here we do see a p-value < .05 indicating there might be issues with heteroskedasticity in the model. Generally, since the BP test had a large p-value and the White’s test has a borderline significant p-value, you would not worry as much about heteroskedasticity compared to if both tests were significant with a smaller p-value. Later in this tutorial, we will cover the use of robust standard errors that minimizes the impact of heteroskedasticity.

Normality of Error Term

OLS assumes that the residuals are normally distributed without skewness. When violated, you are not introducing bias into your model but rather inefficiency. Just like with heteroskedasticity, a non-normally distributed error term can lead to incorrectly sized standard errors leading to Type I or Type II errors.

We will review this assumption both visually and with a statistical test (Shapiro-Wilk’s test for Normality).

plot(lm1, which=2) #Graphs the first of the 4 plots 

For the visual review of this assumption, we use which=2 because it is the second assumption plot produced with the plot command. When reviewing this plot, you want to see the individual dots to be closes to the dotted gray line at all points on the x-axis. Here, we see some extreme deviations at the extreme righthand side of the graph. This indicates an issue with the normally distributed residuals at extreme values. One approach to solving this is to transform the DV in some way (which we cover in Week 6).

We can also use the Shapiro-Wilk’s test of normality to empirically test if the deviations violate the assumption of normality. The null hypothesis in this test is that you have normally distributed residuals so a significant p-value (less than .05) would mean you reject the null hypothesis of normal distribution in favor of the alternative hypothesis. We use the shapiro.test function from the stats package for this test. To conduct this test, you first need to save the residuals from your linear model using residuals<-residuals(model name). Then you use the newly created residuals vector in the actual test.

residuals <- residuals(lm1)
shapiro_test <- shapiro.test(residuals)
shapiro_test
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.72053, p-value < 2.2e-16

The Shapiro-Wilk’s results give you two values: the test statistic (w) and the associated p-value. Here, the p-value is tiny as evidenced by the e-16 part of the value which is scientific notation. Basically, there are too many zeros at the start of the p-value for R to effectively display them so it moves to scientific notation. Importantly, this test validates the visual look that we previously saw as we reject the null hypothesis of normally distributed residuals. As mentioned above, we can attempt to solve this by transforming our DV in some way (more on this Week 6).

Outliers

Outliers are cases with values that are extreme with values far from the majority of the units. However, not all outliers are influential at influencing the estimated relationship between some x and some y. We will look at two approaches - Leverage and Cook’s Distance - to understanding outliers in your model and how influential they are on your estimates. These tests help you determine whether or not you should take action to deal with the offending univariate variable.

Leverage & Cook’s Distance

Leverage is a measure of how far an observation’s value on x is from the mean values of the x’s themselves. It takes on values between 0 and 1 with values closer to 1 indicating a stronger potential to be influential. However, we will eventually combine leverage with Cook’s distance to measure the how influential the value is. Meaning, a high leverage value is not sufficient to establish that an observation is causing issues with the model estimates. Plot 5 graphs the residuals versus leverage values for the OLS model.

plot(lm1, which=5) 

The graph itself is somewhat confusing but you are looking for all values to be within the dotted gray lines on the plot. Here, that is exactly what we see. There are some extreme cases but from a leverage perspective they do not seem to be too influential. As mentioned above, we do need to use Cook’s Distance to better evaluate the influence of the extreme cases.

Plots 4 and 6 produce different Cook’s Distance graphs for evaluation. Plot 4 graphs the Cook’s Distance value on y-axis and the individual unit row on the x-axis. The plot will flag the worst offending observations. Definitely if you have any Cook’s Distance value >=1 that case should be removed from the model when re-estimated.

Plot 6 plots Cook’s Distance on the y-axis along with the leverage value for each observation on the x-axis. This is the best plot to evaluate any individual cases that are potential causing issues with the model estimation. For this plot, you are looking for cases in the upper right-hand side of the plot itself. These cases have high leverage and Cook’s Distance values, which indicates they are outliers that are causing issues in your model.

plot(lm1, which=4)

plot(lm1, which=6) 

Plot 4 flags 3 cases with extremely high Cook’s Distance values that potentially could be problematic: rows 428, 430, and 521. Next we look at Plot 6 which plots Cook’s Distance and leverage values on the x-y axes. In this plot, you are looking for cases in the upper right-hand part of the plot. Here, we once again see two of the same observations as in Plot 4: 428 and 430. Combining the information from these two plots indicates that we should remove these two cases from the model and re-estimate the model without the two cases.

To remove these cases from your analysis, it takes a few steps. First, we create a vector with the specific rows we want to exclude. Next, we remove those problematic cases from the dataset before re-estimating the model. You should always compare the original model with the new model to see if there actually is a change in the estimated relationship between the x’s and y. If the beta weights remain close to each other, you would use the full model without removing the outliers. If there is a large shift in the beta weights, then you should rely on the model that has removed the influential outliers.

problematic_obs <- c(428, 430)
cleaned_data <- data[-problematic_obs, ]

cleaned_model <- lm(price ~ bedrooms+sqft+bathrooms, data=cleaned_data)

stargazer(lm1, cleaned_model, digits=3, type="text")
## 
## ===================================================================
##                                   Dependent variable:              
##                     -----------------------------------------------
##                                          price                     
##                               (1)                     (2)          
## -------------------------------------------------------------------
## bedrooms                  -13,343.860             -26,839.460      
##                          (24,914.510)            (24,934.410)      
##                                                                    
## sqft                      114.927***              170.285***       
##                            (23.310)                (27.794)        
##                                                                    
## bathrooms               167,924.700***          128,892.500***     
##                          (30,392.880)            (32,368.430)      
##                                                                    
## Constant                  -84,869.340             -61,244.140      
##                          (59,998.040)            (59,794.610)      
##                                                                    
## -------------------------------------------------------------------
## Observations                  535                     533          
## R2                           0.290                   0.307         
## Adjusted R2                  0.286                   0.303         
## Residual Std. Error 430,243.900 (df = 531)  425,648.100 (df = 529) 
## F Statistic         72.156*** (df = 3; 531) 78.131*** (df = 3; 529)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

Here, we estimated the new model with the two offending cases removed and then compare the new model with the original. Evaluating the beta weights first, we see changes in how each X influences the Y. For instance, the coefficient on the sqft variable goes from 115 to 170 while the impact of the number of bathrooms changes from 168K to 129K. Both of these coefficients are different enough from the initial model to justify using the model with the outliers removed.

Theoretically, the reason this should be done is that the outliers are so different from the vast majority of the cases that are not necessarily representative of how the X’s influence the Y’s for the majority of cases. We can also evaluate the model fit statistics to understand if the model performs better with the removal of outliers. In this instance, we see that the R2, Adjusted R2 and Residual Standard Errors all indicate a better fitting model with the outliers removed. Taken as a whole, we would conclude that removal of the outliers is warranted for this model.

Robust Standard Errors

If you conclude that you have issues with heteroskedasticity in your OLS model, one approach to “fixing” them is to use Robust Standard Errors. Robust SE change how the standard errors are calculated with slight differences in the calculation based on which version you choose.

To use robust standard error adjustments, we must load the lmtest package into R. We will focus on two types of robust standard errors here: - Method 1 (‘hc1’): Uses squared residuals to penalize unequal variance - Method 3(‘hc3’): Uses leverage values for each individual case to adjust the size of the SE

Method 2 adjusts for autocorrelation, which is beyond the scope of this course.

To apply robust standard errors, first estimate the regression model without any changes and save the model. Then we will use that saved model to reestimate the standard errors with each of the two types of robust adjustments.

We save these new estimated models with names robust1 and robust3 to then compare to the original model. The code to apply robust standard errors is straightforward. We feed the original model name into the code at two different places, specify the specific robust standard error we want to use, hc1 or hc3 and keep the rest of the code the same.

We need to use the lmtest and sandwich packages to get the robust standard errors to be estimated. We then use stargazer to compare across the three models.

#new_name<-coeftest(model_name, vcov = 
                #vcovHC(model_name, type="hc#"))

robust1<-coeftest(lm1, vcov = 
                vcovHC(lm1, type="HC1"))

robust3<-coeftest(lm1, vcov = 
                vcovHC(lm1, type="HC3"))

stargazer(lm1, robust1, robust3, type="text")
## 
## =========================================================================
##                                      Dependent variable:                 
##                     -----------------------------------------------------
##                              price                                       
##                               OLS                    coefficient         
##                                                         test             
##                               (1)                (2)            (3)      
## -------------------------------------------------------------------------
## bedrooms                  -13,343.860        -13,343.860    -13,343.860  
##                          (24,914.510)        (24,186.850)   (25,127.380) 
##                                                                          
## sqft                      114.927***          114.927***     114.927***  
##                            (23.310)            (36.861)       (41.453)   
##                                                                          
## bathrooms               167,924.700***      167,924.700*** 167,924.700***
##                          (30,392.880)        (35,655.250)   (38,785.110) 
##                                                                          
## Constant                  -84,869.340        -84,869.340    -84,869.340  
##                          (59,998.040)        (61,867.510)   (63,863.060) 
##                                                                          
## -------------------------------------------------------------------------
## Observations                  535                                        
## R2                           0.290                                       
## Adjusted R2                  0.286                                       
## Residual Std. Error 430,243.900 (df = 531)                               
## F Statistic         72.156*** (df = 3; 531)                              
## =========================================================================
## Note:                                         *p<0.1; **p<0.05; ***p<0.01

First thing to note is that the coefficients do not change with the application of robust standard errors. That is correct. We are not changing the estimate of the relationship between the IV and DV. Rather we are changing our certainty level, i.e. the size of the standard error of the estimate.

Using robust standard errors does not have to increase the size of the standard errors of the estimate. In fact for the bedroom variable, the robust standard error size is smaller than the unadjusted model for Model 2 while larger for Model 3. For the other two variables, the standard errors are larger in the adjusted models compared to the unadjusted one.

We want to be conservative in what we consider significant finding, which is why we want to evaluate the residuals of our regression models and apply robust standard errors if necessary.