OLS Assumptions

There are several assumptions underpinning linear regression that should be tested. In this tutorial, we are going to look at well behaved residuals through simulating a relationship between two normally distributed variables (note that OLS assumes a normally distributed DV not IV but here we use normal distributions for both). We will then compare these well behaved residuals to an identical model where we force the variables to be ordinal - like they are in a lot of survey research - instead of normally distributed.

Ordinal Variables in Analysis

We will start with a random set of multivariate normally distributed ordinal data called pop, which consists of 3 variables in these models:

  • y = dependent variable of interest ranges from 1-4
  • x = an independent variable of interest ranges from 1-4
  • x2 = an independent variable of interest ranges from 1-5

First, let’s look at the means, standard deviations, and correlations between these three existing variables so we can match our simulated data drawn from a Gaussian Distribution.

skim(pop)
Data summary
Name pop
Number of rows 2602
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
y 0 1 2.21 1.37 1 1 1 4 4 ▇▁▁▂▅
x 0 1 2.86 1.14 1 2 3 4 4 ▃▃▁▃▇
x2 0 1 1.96 1.01 1 1 2 2 5 ▇▇▃▂▁
cor(pop)
##            y         x        x2
## y  1.0000000 0.4095337 0.3449629
## x  0.4095337 1.0000000 0.2622368
## x2 0.3449629 0.2622368 1.0000000

Simulating from a Multivariate Normal Distribution

Now that we know the means, standard deviations and how these three variables are interelated, we can use the rnorm_multi function from the faux package. This package allows us to simulate a multivariate distribution - i.e. a distribution where the value of one variable is at least partially conditional on the value of one or more additional variables - with pre-specified means, standard deviations and correlations between items.

This simulation will create a new dataset that has our three original variables with known population parameters that match our existing ordinal data. In other words, the primary difference between the two samples is if the variables are drawn from a Gaussian or Ordered distribution. We will be able to use this setup to investigate the difference in the models between using OLS on a 4-point survey scale versus a normally distributed one.

set.seed(2011)
n<-7188
pop2 <- rnorm_multi(n = n, #Number of cases to simulation 
                   mu = c(2.17, 2.82, 1.97), #avgs for the 3 variables 
                   sd = c(1.36, 1.15, 1.04), #Standard deviations for the 3 variables 
                   r = c(0.4186282, .3456132, .2551263), #Pearson R Correlations between items - Order (1-2, 1-3, 2-3)
                   varnames = c("y", "x", "x2"), #New names for variables
                   empirical = FALSE) #Randomly simulating the data rather than forcing the values to be exact 

We now have two population datasets that we can use to compare the performance of OLS residuals depending on the structure of the DV. Also remember that the populations are created to be exact in terms of the means, SDs, and correlations between variables, which means the betas from the OLS model are also known to us.

skim(pop)
Data summary
Name pop
Number of rows 2602
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
y 0 1 2.21 1.37 1 1 1 4 4 ▇▁▁▂▅
x 0 1 2.86 1.14 1 2 3 4 4 ▃▃▁▃▇
x2 0 1 1.96 1.01 1 1 2 2 5 ▇▇▃▂▁
skim(pop2)
Data summary
Name pop2
Number of rows 7188
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
y 0 1 2.16 1.37 -2.32 1.24 2.19 3.08 6.94 ▁▅▇▃▁
x 0 1 2.82 1.15 -1.26 2.04 2.81 3.61 6.96 ▁▃▇▃▁
x2 0 1 1.95 1.05 -1.86 1.25 1.95 2.65 6.03 ▁▃▇▂▁

Testing the residuals of our OLS models allow us to test several assumptions about the error term. First, that our error is constant - i.e. homoskedastic - across values of Y. This measures how consistent the model is at predicting Y across the range of Y. Changing error variance does not cause bias in our beta estimates but rather can make the standard errors the wrong size.

Next, we can test whether or not our error term is normally distributed - which is not the same thing as being constantly distributed. Normally distributed errors give us more confidence in significance testing our beta weights and having appropriately sized standard errors are critical to this. Once again, violating this assumption will not lead to biased coefficients but rather incorrectly sized standard errors.

Next, we draw the same sized random sample of 500 from both data sets before running a linear regression for comparison. This simulates the fact that we typically only get sample data instead of having access to data on the entire population.

set.seed(1944)
sample_size <- 600
sample_ord <- pop[sample(nrow(pop), sample_size, replace = TRUE), ]
sample_norm <- pop2[sample(nrow(pop2), sample_size, replace = TRUE), ]

cor(sample_norm)
##            y        x        x2
## y  1.0000000 0.504405 0.3971664
## x  0.5044050 1.000000 0.3394970
## x2 0.3971664 0.339497 1.0000000
cor(sample_ord)
##            y         x        x2
## y  1.0000000 0.4290973 0.3250993
## x  0.4290973 1.0000000 0.2349794
## x2 0.3250993 0.2349794 1.0000000

Estimating Linear Regression on Continuous and Ordinal DV

Let’s first estimate a linear regression on our multivariate normally distributed simulated data where we set the correlations between the variables. We use the lm function to estimate our linear regression using the simulated multivariate data dat predicting y as a function of x and x2.

Because we know the underlying relationships between y and x and x2, we know roughly what the coefficients are we should observe in this model. Coefficient on x should roughly be with x being roughly equal to .41 while coefficient on x2 should roughly be equal to .35.

We also estimate the ordinal linear regression for comparison.

lm_norm<-lm(y ~ x + x2, data=sample_norm) #estimates the OLS model 
summary(lm_norm) #Returns estimates of regression model 
## 
## Call:
## lm(formula = y ~ x + x2, data = sample_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3240 -0.7701 -0.0010  0.7193  3.1247 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.11103    0.12886   0.862    0.389    
## x            0.49929    0.04314  11.575  < 2e-16 ***
## x2           0.31521    0.04455   7.076 4.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.124 on 597 degrees of freedom
## Multiple R-squared:  0.3121, Adjusted R-squared:  0.3098 
## F-statistic: 135.4 on 2 and 597 DF,  p-value: < 2.2e-16
lm_ord<-lm(y ~ x + x2, data=sample_ord) #estimates the OLS model 
summary(lm_ord) #Returns estimates of regression model 
## 
## Call:
## lm(formula = y ~ x + x2, data = sample_ord)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72697 -0.94108 -0.01582  1.05892  2.98418 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.22238    0.15347   1.449    0.148    
## x            0.46263    0.04557  10.153  < 2e-16 ***
## x2           0.33082    0.05125   6.456 2.23e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.197 on 597 degrees of freedom
## Multiple R-squared:  0.2374, Adjusted R-squared:  0.2348 
## F-statistic:  92.9 on 2 and 597 DF,  p-value: < 2.2e-16

Results show us what we expected to see in terms of coefficient size with x being close to .41 and x2 being within the 95% CI of .35. With only 600 cases, you should not expect the coefficients to be exact. Let’s briefly interpret this regression output. The Intercept is the value of y when the independent variables are all equal to zero. Without a true zero for either IV the intercept is meaningless here so we ignore it.

Looking at the independent variables, both show a significant relationship with y as indicated in several ways. First, the p-value is <=.05 as indicated by ***. Next, the t-value is >= 1.96 which is the standard cut point for treating something as significant. Finally, the estimate itself is more than twice the size of the standard error, which also indicates a significant relationship.

For x, the interpretation is that for every 1 unit increase in x, y increases on average .41 holding x2 constant. For x2, the interpretation is that for every 1 unit increase in x2, y increases on average .385 holding x constant.

Overall, the F-statistic tells us that our model does better than chance at predicting y. Which we already knew since we checked the significance on the individual items. It is rare to impossible to have a significant IV and a non-significant F-statistic. Meaning, the F-statistic usually does not give us unique information so does not need to be focused on in interpretations.

Residuals for Normally Distributed DV

Now that we have estimated and interpreted the model, let’s review the model residuals. Remember, a residual is simply the distance between the predicting value of y and the actual value of y. Larger residuals mean a bigger miss predicting y for that case.

The easiest way to review residuals is to plot them. After you save your OLS model, you can easily plot the residuals using the plot function. Without changes to the code, plot(model_name) will return 6 unique charts with different looks at your residuals (and influential data points) that you have to cycle through to evaluate. We will simplify this by using the par function to graph the specific residual plots we want to evaluate.

plot(lm_norm) #This plots 6 different regression checks 

plot(lm_ord) #This plots 6 different regression checks 

par(mfrow = c(1,2)); plot(lm_norm, which = 3) # Plot homoskedasticity check for ; creating plot with 1 row and 2 columns to compare side-by-size
plot(lm_ord, which=3) #Adds same homoskedasticity residual graph to plot but for the ordinal DV

Here we plot checks for heteroskedasticity comparing the Gaussian distributed DV on the left and the ordinally distributed DV on the right. The model with normally distributed variables has a more consistent and random error variance across all levels of the DV. The model with the ordinal dependent variable shows a clear pattern in the residuals indicating non-constant error variance in the model.

par(mfrow = c(1,2)); plot(lm_norm, which = 2) # Plot normally distributed residuals check; creating plot with 1 row and 2 columns 
plot(lm_ord, which=2)#Adds same residual graph to plot but for the ordinal DV

With this pattern of results indicating clear heteroskedasticity, we should be concerned that our standard errors for the coefficients are not being accurately estimated. To adjust for this, we can use robust standard errors in our regression models.

Using Robust Standard Errors

To use robust standard error adjustments, we must load the lmtest package into R. We will focus on two types of robust standard errors here: - Method 1 (‘hc1’): Uses squared residuals to penalize unequal variance - Method 3(‘hc3’): Uses leverage values for each individual case to adjust the size of the SE

Method 2 adjusts for autocorrelation, which we will learn how to account for using multilevel models instead.

We will look at two approaches to estimating robust standard errors. First using the sandwich package and then using the jtools package.

Approach 1

To apply robust standard errors, first estimate the regression model without any changes and save the model. Then we will use that saved model to reestimate the standard errors with each of the two types of robust adjustments.

We save these new estimated models with names robust1 and robust3 to then compare to the original model. The code to apply robust standard errors is straightforward. We feed the original model name into the code at two different places, specify the specific robust standard error we want to use, hc1 or hc3 and keep the rest of the code the same.

We then use stargazer to compare across the three models.

lm_ord<-lm(y ~ x + x2, data=sample_ord) #estimates the OLS model 
summary(lm_ord) #Returns estimates of regression model 
## 
## Call:
## lm(formula = y ~ x + x2, data = sample_ord)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72697 -0.94108 -0.01582  1.05892  2.98418 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.22238    0.15347   1.449    0.148    
## x            0.46263    0.04557  10.153  < 2e-16 ***
## x2           0.33082    0.05125   6.456 2.23e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.197 on 597 degrees of freedom
## Multiple R-squared:  0.2374, Adjusted R-squared:  0.2348 
## F-statistic:  92.9 on 2 and 597 DF,  p-value: < 2.2e-16
#new_name<-coeftest(model_name, vcov = 
                #vcovHC(model_name, type="hc#"))

robust1<-coeftest(lm_ord, vcov = 
                vcovHC(lm_ord, type="HC1"))

robust3<-coeftest(lm_ord, vcov = 
                vcovHC(lm_ord, type="HC3"))

stargazer(lm_norm, lm_ord, robust1, robust3, type="text")
## 
## =====================================================================
##                                         Dependent variable:          
##                                --------------------------------------
##                                         y                            
##                                        OLS             coefficient   
##                                                           test       
##                                   (1)        (2)      (3)      (4)   
## ---------------------------------------------------------------------
## x                               0.499***  0.463***  0.463*** 0.463***
##                                 (0.043)    (0.046)  (0.045)  (0.045) 
##                                                                      
## x2                              0.315***  0.331***  0.331*** 0.331***
##                                 (0.045)    (0.051)  (0.059)  (0.059) 
##                                                                      
## Constant                         0.111      0.222    0.222*   0.222* 
##                                 (0.129)    (0.153)  (0.132)  (0.133) 
##                                                                      
## ---------------------------------------------------------------------
## Observations                      600        600                     
## R2                               0.312      0.237                    
## Adjusted R2                      0.310      0.235                    
## Residual Std. Error (df = 597)   1.124      1.197                    
## F Statistic (df = 2; 597)      135.439*** 92.904***                  
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

Approach 2 - JTools

The second approach you can use to estimate robust is using the jtools package. This approach is more straightforward and easier to implement than approach 1 but as you will see returns the same results (as it should). Here, we use the export_summs function from jtools package and explicitly stipulate which robust standard error form we want to use based on the ordered linear model.

lm_ord<-lm(y ~ x + x2, data=sample_ord) #estimates the OLS model 

export_summs(lm_ord, robust ="HC1", digits=3)
## Registered S3 methods overwritten by 'broom':
##   method            from  
##   tidy.glht         jtools
##   tidy.summary.glht jtools
Model 1
(Intercept)0.222    
(0.132)   
x0.463 ***
(0.045)   
x20.331 ***
(0.059)   
N600        
R20.237    
Standard errors are heteroskedasticity robust. *** p < 0.001; ** p < 0.01; * p < 0.05.
export_summs(lm_ord, robust = "HC3", digits=3)
Model 1
(Intercept)0.222    
(0.133)   
x0.463 ***
(0.045)   
x20.331 ***
(0.059)   
N600        
R20.237    
Standard errors are heteroskedasticity robust. *** p < 0.001; ** p < 0.01; * p < 0.05.
export_summs(lm_norm, lm_ord, lm_ord, lm_ord, robust = list(FALSE, FALSE, "HC1", "HC3"),
           model.names = c("REG SE Norm" , "Reg SE Ord", "HC1 SE", "HC3 SE"), digits=3)
REG SE NormReg SE OrdHC1 SEHC3 SE
(Intercept)0.111    0.222    0.222    0.222    
(0.129)   (0.153)   (0.132)   (0.133)   
x0.499 ***0.463 ***0.463 ***0.463 ***
(0.043)   (0.046)   (0.045)   (0.045)   
x20.315 ***0.331 ***0.331 ***0.331 ***
(0.045)   (0.051)   (0.059)   (0.059)   
N600        600        600        600        
R20.312    0.237    0.237    0.237    
*** p < 0.001; ** p < 0.01; * p < 0.05.

First thing to note is that the coefficients do not change with the application of robust standard errors. That is correct. We are not changing the estimate of the relationship between the IV and DV. Rather we are changing our certainty level, i.e. the standard error of the estimate.

Using robust standard errors does not have to increase the size of the standard errors of the estimate. In fact for the x variable, the robust standard error sizes are actually smaller than the unadjusted model. For the x2 variable, the standard error is slightly larger in the adjusted models compared to the unadjusted one.

In this instance, the conclusions between model types and overall difference in size between the standard errors do not warrant the use of the robust standard error adjustments. This is not always going to be the case. If the robust errors were much larger and lead to a loss of significance, provided the model truly has heteroskedasticity in its residuals, then we would report the model using the robust standard errors.

We want to be conservative in what we consider significant, which is why we want to evaluate the residuals of our regression models and apply robust standard errors if necessary.

plot_summs(lm_norm, lm_ord, robust = list(FALSE, FALSE),
           model.names = c("REG SE Norm" , "Reg SE Ord"))

plot_summs(lm_norm, lm_ord, lm_ord, lm_ord, robust = list(FALSE, FALSE, "HC1", "HC3"),
           model.names = c("REG SE Norm" , "Reg SE Ord", "HC1 SE", "HC3 SE"))