There are several assumptions underpinning linear regression that should be tested. In this tutorial, we are going to look at well behaved residuals through simulating a relationship between two normally distributed variables (note that OLS assumes a normally distributed DV not IV but here we use normal distributions for both). We will then compare these well behaved residuals to an identical model where we force the variables to be ordinal - like they are in a lot of survey research - instead of normally distributed.
We will start with a random set of multivariate normally distributed
ordinal data called pop
, which consists of 3 variables in
these models:
First, let’s look at the means, standard deviations, and correlations between these three existing variables so we can match our simulated data drawn from a Gaussian Distribution.
skim(pop)
Name | pop |
Number of rows | 2602 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
y | 0 | 1 | 2.21 | 1.37 | 1 | 1 | 1 | 4 | 4 | ▇▁▁▂▅ |
x | 0 | 1 | 2.86 | 1.14 | 1 | 2 | 3 | 4 | 4 | ▃▃▁▃▇ |
x2 | 0 | 1 | 1.96 | 1.01 | 1 | 1 | 2 | 2 | 5 | ▇▇▃▂▁ |
cor(pop)
## y x x2
## y 1.0000000 0.4095337 0.3449629
## x 0.4095337 1.0000000 0.2622368
## x2 0.3449629 0.2622368 1.0000000
Now that we know the means, standard deviations and how these three
variables are interelated, we can use the rnorm_multi
function from the faux
package. This package allows us to
simulate a multivariate distribution - i.e. a distribution where the
value of one variable is at least partially conditional on the value of
one or more additional variables - with pre-specified means, standard
deviations and correlations between items.
This simulation will create a new dataset that has our three original variables with known population parameters that match our existing ordinal data. In other words, the primary difference between the two samples is if the variables are drawn from a Gaussian or Ordered distribution. We will be able to use this setup to investigate the difference in the models between using OLS on a 4-point survey scale versus a normally distributed one.
set.seed(2011)
<-7188
n<- rnorm_multi(n = n, #Number of cases to simulation
pop2 mu = c(2.17, 2.82, 1.97), #avgs for the 3 variables
sd = c(1.36, 1.15, 1.04), #Standard deviations for the 3 variables
r = c(0.4186282, .3456132, .2551263), #Pearson R Correlations between items - Order (1-2, 1-3, 2-3)
varnames = c("y", "x", "x2"), #New names for variables
empirical = FALSE) #Randomly simulating the data rather than forcing the values to be exact
We now have two population datasets that we can use to compare the performance of OLS residuals depending on the structure of the DV. Also remember that the populations are created to be exact in terms of the means, SDs, and correlations between variables, which means the betas from the OLS model are also known to us.
skim(pop)
Name | pop |
Number of rows | 2602 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
y | 0 | 1 | 2.21 | 1.37 | 1 | 1 | 1 | 4 | 4 | ▇▁▁▂▅ |
x | 0 | 1 | 2.86 | 1.14 | 1 | 2 | 3 | 4 | 4 | ▃▃▁▃▇ |
x2 | 0 | 1 | 1.96 | 1.01 | 1 | 1 | 2 | 2 | 5 | ▇▇▃▂▁ |
skim(pop2)
Name | pop2 |
Number of rows | 7188 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
y | 0 | 1 | 2.16 | 1.37 | -2.32 | 1.24 | 2.19 | 3.08 | 6.94 | ▁▅▇▃▁ |
x | 0 | 1 | 2.82 | 1.15 | -1.26 | 2.04 | 2.81 | 3.61 | 6.96 | ▁▃▇▃▁ |
x2 | 0 | 1 | 1.95 | 1.05 | -1.86 | 1.25 | 1.95 | 2.65 | 6.03 | ▁▃▇▂▁ |
Testing the residuals of our OLS models allow us to test several assumptions about the error term. First, that our error is constant - i.e. homoskedastic - across values of Y. This measures how consistent the model is at predicting Y across the range of Y. Changing error variance does not cause bias in our beta estimates but rather can make the standard errors the wrong size.
Next, we can test whether or not our error term is normally distributed - which is not the same thing as being constantly distributed. Normally distributed errors give us more confidence in significance testing our beta weights and having appropriately sized standard errors are critical to this. Once again, violating this assumption will not lead to biased coefficients but rather incorrectly sized standard errors.
Next, we draw the same sized random sample of 500 from both data sets before running a linear regression for comparison. This simulates the fact that we typically only get sample data instead of having access to data on the entire population.
set.seed(1944)
<- 600
sample_size <- pop[sample(nrow(pop), sample_size, replace = TRUE), ]
sample_ord <- pop2[sample(nrow(pop2), sample_size, replace = TRUE), ]
sample_norm
cor(sample_norm)
## y x x2
## y 1.0000000 0.504405 0.3971664
## x 0.5044050 1.000000 0.3394970
## x2 0.3971664 0.339497 1.0000000
cor(sample_ord)
## y x x2
## y 1.0000000 0.4290973 0.3250993
## x 0.4290973 1.0000000 0.2349794
## x2 0.3250993 0.2349794 1.0000000
Let’s first estimate a linear regression on our multivariate normally
distributed simulated data where we set the correlations between the
variables. We use the lm
function to estimate our linear
regression using the simulated multivariate data dat
predicting y
as a function of x
and
x2
.
Because we know the underlying relationships between y
and x
and x2
, we know roughly what the
coefficients are we should observe in this model. Coefficient on
x
should roughly be with x
being roughly equal
to .41 while coefficient on x2
should roughly be equal to
.35.
We also estimate the ordinal linear regression for comparison.
<-lm(y ~ x + x2, data=sample_norm) #estimates the OLS model
lm_normsummary(lm_norm) #Returns estimates of regression model
##
## Call:
## lm(formula = y ~ x + x2, data = sample_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3240 -0.7701 -0.0010 0.7193 3.1247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.11103 0.12886 0.862 0.389
## x 0.49929 0.04314 11.575 < 2e-16 ***
## x2 0.31521 0.04455 7.076 4.18e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.124 on 597 degrees of freedom
## Multiple R-squared: 0.3121, Adjusted R-squared: 0.3098
## F-statistic: 135.4 on 2 and 597 DF, p-value: < 2.2e-16
<-lm(y ~ x + x2, data=sample_ord) #estimates the OLS model
lm_ordsummary(lm_ord) #Returns estimates of regression model
##
## Call:
## lm(formula = y ~ x + x2, data = sample_ord)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.72697 -0.94108 -0.01582 1.05892 2.98418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.22238 0.15347 1.449 0.148
## x 0.46263 0.04557 10.153 < 2e-16 ***
## x2 0.33082 0.05125 6.456 2.23e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.197 on 597 degrees of freedom
## Multiple R-squared: 0.2374, Adjusted R-squared: 0.2348
## F-statistic: 92.9 on 2 and 597 DF, p-value: < 2.2e-16
Results show us what we expected to see in terms of coefficient size
with x
being close to .41 and x2
being within
the 95% CI of .35. With only 600 cases, you should not expect the
coefficients to be exact. Let’s briefly interpret this regression
output. The Intercept is the value of y
when the
independent variables are all equal to zero. Without a true zero for
either IV the intercept is meaningless here so we ignore it.
Looking at the independent variables, both show a significant
relationship with y
as indicated in several ways. First,
the p-value is <=.05 as indicated by ***. Next, the t-value is >=
1.96 which is the standard cut point for treating something as
significant. Finally, the estimate itself is more than twice the size of
the standard error, which also indicates a significant relationship.
For x
, the interpretation is that for every 1 unit
increase in x, y
increases on average .41 holding
x2
constant. For x2
, the interpretation is
that for every 1 unit increase in x2
, y
increases on average .385 holding x
constant.
Overall, the F-statistic tells us that our model does better than chance at predicting y. Which we already knew since we checked the significance on the individual items. It is rare to impossible to have a significant IV and a non-significant F-statistic. Meaning, the F-statistic usually does not give us unique information so does not need to be focused on in interpretations.
Now that we have estimated and interpreted the model, let’s review the model residuals. Remember, a residual is simply the distance between the predicting value of y and the actual value of y. Larger residuals mean a bigger miss predicting y for that case.
The easiest way to review residuals is to plot them. After you save
your OLS model, you can easily plot the residuals using the
plot
function. Without changes to the code,
plot(model_name)
will return 6 unique charts with different
looks at your residuals (and influential data points) that you have to
cycle through to evaluate. We will simplify this by using the
par
function to graph the specific residual plots we want
to evaluate.
plot(lm_norm) #This plots 6 different regression checks
plot(lm_ord) #This plots 6 different regression checks
par(mfrow = c(1,2)); plot(lm_norm, which = 3) # Plot homoskedasticity check for ; creating plot with 1 row and 2 columns to compare side-by-size
plot(lm_ord, which=3) #Adds same homoskedasticity residual graph to plot but for the ordinal DV
Here we plot checks for heteroskedasticity comparing the Gaussian distributed DV on the left and the ordinally distributed DV on the right. The model with normally distributed variables has a more consistent and random error variance across all levels of the DV. The model with the ordinal dependent variable shows a clear pattern in the residuals indicating non-constant error variance in the model.
par(mfrow = c(1,2)); plot(lm_norm, which = 2) # Plot normally distributed residuals check; creating plot with 1 row and 2 columns
plot(lm_ord, which=2)#Adds same residual graph to plot but for the ordinal DV
With this pattern of results indicating clear heteroskedasticity, we should be concerned that our standard errors for the coefficients are not being accurately estimated. To adjust for this, we can use robust standard errors in our regression models.
To use robust standard error adjustments, we must load the
lmtest
package into R. We will focus on two types of robust
standard errors here: - Method 1 (‘hc1’): Uses squared residuals to
penalize unequal variance - Method 3(‘hc3’): Uses leverage values for
each individual case to adjust the size of the SE
Method 2 adjusts for autocorrelation, which we will learn how to account for using multilevel models instead.
We will look at two approaches to estimating robust standard errors.
First using the sandwich
package and then using the
jtools
package.
Approach 1
To apply robust standard errors, first estimate the regression model without any changes and save the model. Then we will use that saved model to reestimate the standard errors with each of the two types of robust adjustments.
We save these new estimated models with names robust1
and robust3
to then compare to the original model. The code
to apply robust standard errors is straightforward. We feed the original
model name into the code at two different places, specify the specific
robust standard error we want to use, hc1
or
hc3
and keep the rest of the code the same.
We then use stargazer
to compare across the three
models.
<-lm(y ~ x + x2, data=sample_ord) #estimates the OLS model
lm_ordsummary(lm_ord) #Returns estimates of regression model
##
## Call:
## lm(formula = y ~ x + x2, data = sample_ord)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.72697 -0.94108 -0.01582 1.05892 2.98418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.22238 0.15347 1.449 0.148
## x 0.46263 0.04557 10.153 < 2e-16 ***
## x2 0.33082 0.05125 6.456 2.23e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.197 on 597 degrees of freedom
## Multiple R-squared: 0.2374, Adjusted R-squared: 0.2348
## F-statistic: 92.9 on 2 and 597 DF, p-value: < 2.2e-16
#new_name<-coeftest(model_name, vcov =
#vcovHC(model_name, type="hc#"))
<-coeftest(lm_ord, vcov =
robust1vcovHC(lm_ord, type="HC1"))
<-coeftest(lm_ord, vcov =
robust3vcovHC(lm_ord, type="HC3"))
stargazer(lm_norm, lm_ord, robust1, robust3, type="text")
##
## =====================================================================
## Dependent variable:
## --------------------------------------
## y
## OLS coefficient
## test
## (1) (2) (3) (4)
## ---------------------------------------------------------------------
## x 0.499*** 0.463*** 0.463*** 0.463***
## (0.043) (0.046) (0.045) (0.045)
##
## x2 0.315*** 0.331*** 0.331*** 0.331***
## (0.045) (0.051) (0.059) (0.059)
##
## Constant 0.111 0.222 0.222* 0.222*
## (0.129) (0.153) (0.132) (0.133)
##
## ---------------------------------------------------------------------
## Observations 600 600
## R2 0.312 0.237
## Adjusted R2 0.310 0.235
## Residual Std. Error (df = 597) 1.124 1.197
## F Statistic (df = 2; 597) 135.439*** 92.904***
## =====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Approach 2 - JTools
The second approach you can use to estimate robust is using the
jtools
package. This approach is more straightforward and
easier to implement than approach 1 but as you will see returns the same
results (as it should). Here, we use the export_summs
function from jtools
package and explicitly stipulate which
robust standard error form we want to use based on the ordered linear
model.
<-lm(y ~ x + x2, data=sample_ord) #estimates the OLS model
lm_ord
export_summs(lm_ord, robust ="HC1", digits=3)
## Registered S3 methods overwritten by 'broom':
## method from
## tidy.glht jtools
## tidy.summary.glht jtools
Model 1 | |
---|---|
(Intercept) | 0.222 |
(0.132) | |
x | 0.463 *** |
(0.045) | |
x2 | 0.331 *** |
(0.059) | |
N | 600 |
R2 | 0.237 |
Standard errors are heteroskedasticity robust. *** p < 0.001; ** p < 0.01; * p < 0.05. |
export_summs(lm_ord, robust = "HC3", digits=3)
Model 1 | |
---|---|
(Intercept) | 0.222 |
(0.133) | |
x | 0.463 *** |
(0.045) | |
x2 | 0.331 *** |
(0.059) | |
N | 600 |
R2 | 0.237 |
Standard errors are heteroskedasticity robust. *** p < 0.001; ** p < 0.01; * p < 0.05. |
export_summs(lm_norm, lm_ord, lm_ord, lm_ord, robust = list(FALSE, FALSE, "HC1", "HC3"),
model.names = c("REG SE Norm" , "Reg SE Ord", "HC1 SE", "HC3 SE"), digits=3)
REG SE Norm | Reg SE Ord | HC1 SE | HC3 SE | |
---|---|---|---|---|
(Intercept) | 0.111 | 0.222 | 0.222 | 0.222 |
(0.129) | (0.153) | (0.132) | (0.133) | |
x | 0.499 *** | 0.463 *** | 0.463 *** | 0.463 *** |
(0.043) | (0.046) | (0.045) | (0.045) | |
x2 | 0.315 *** | 0.331 *** | 0.331 *** | 0.331 *** |
(0.045) | (0.051) | (0.059) | (0.059) | |
N | 600 | 600 | 600 | 600 |
R2 | 0.312 | 0.237 | 0.237 | 0.237 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
First thing to note is that the coefficients do not change with the application of robust standard errors. That is correct. We are not changing the estimate of the relationship between the IV and DV. Rather we are changing our certainty level, i.e. the standard error of the estimate.
Using robust standard errors does not have to increase the size of
the standard errors of the estimate. In fact for the x
variable, the robust standard error sizes are actually smaller than the
unadjusted model. For the x2
variable, the standard error
is slightly larger in the adjusted models compared to the unadjusted
one.
In this instance, the conclusions between model types and overall difference in size between the standard errors do not warrant the use of the robust standard error adjustments. This is not always going to be the case. If the robust errors were much larger and lead to a loss of significance, provided the model truly has heteroskedasticity in its residuals, then we would report the model using the robust standard errors.
We want to be conservative in what we consider significant, which is why we want to evaluate the residuals of our regression models and apply robust standard errors if necessary.
plot_summs(lm_norm, lm_ord, robust = list(FALSE, FALSE),
model.names = c("REG SE Norm" , "Reg SE Ord"))
plot_summs(lm_norm, lm_ord, lm_ord, lm_ord, robust = list(FALSE, FALSE, "HC1", "HC3"),
model.names = c("REG SE Norm" , "Reg SE Ord", "HC1 SE", "HC3 SE"))