Code Through - Checking Assumptions for OLS

For my code-through, I have decided to explain the process of checking a linear regression model against Ordinary Least Squares (OLS) assumptions. This is an important step in any data analysis because it can reveal aspects of the model which are not discovered by simply running a regression (p-values, R-squared, etc.). By running simple disgnostics in R, we can get a better picture of how well the model fits the data.

Set Up

# Load packages
library(tidycensus)
library(tidyverse)
library(plyr)

# Input Census API Key
census_api_key("8eab9b16f44cb26460ecbde164482194b7052772")

# Obtain dataset from ACS 2017 and search for variables to be used in model
VarSearch <- load_variables(2017, "acs5", cache = T)
VarSearch$label <- toupper(VarSearch$label)

# Create a dataframe with desired variables
affordability <- c(total_pop ='B01001_001', home_value = "B25077_001", med_income = "B19013_001", bach_degree = 'B06009_005', below_poverty = 'B06012_002', pop_white = 'B01001H_001', pop_black = 'B01001B_001', pop_hispanic = 'B01001I_001', pop_married = 'B06008_003', foreign_born = 'B05002_013' )

CenDFSpokane <- get_acs(geography = "tract", year = 2017, survey = "acs5", variables = affordability, state = "Washington", county = "Spokane County", geometry = T)

# Manipulate dataframe to allow each variable to be considered separately, and create new variables to reflect population proportions for each census category. 
CenDFSpokane <-
  CenDFSpokane %>%
  select( -moe ) %>%
  spread( variable, estimate ) %>%
  mutate( home_affordability_ratio = round( home_value/med_income, 3 ) ) %>%
  mutate( pct_bach = bach_degree/total_pop*100 ) %>%
  mutate( pct_poverty = below_poverty/total_pop*100 ) %>%
  mutate( pct_white = pop_white/total_pop*100 ) %>%
  mutate( pct_black = pop_black/total_pop*100 ) %>%
  mutate( pct_hispanic = pop_hispanic/total_pop*100 ) %>%
  mutate( pct_married = pop_married/total_pop*100 ) %>%
  mutate( pct_foreign = foreign_born/total_pop*100 )

head(CenDFSpokane)

Now that the data has been extracted and cleaned, we can set up our regression analysis. We will use the home affordability ratio as the dependent variable, as this is a subject we have been exploring throughout the class. Some combination of the other variables can be used as independent and control variables. Results can be displayed in ‘stargazer’.

library(stargazer)

regSpokane <- lm( home_affordability_ratio ~ pct_bach + pct_poverty + pct_white + pct_married + pct_foreign, data = CenDFSpokane )

stargazer(regSpokane, title="Effects of Resident Demographics on Home Affordability in Spokane County, WA",type='html',align=TRUE)

**Effects of Resident Demographics on Home Affordability in Spokane County, WA**

	Dependent variable:

	home_affordability_ratio

pct_bach	0.192^***
	(0.038)

pct_poverty	0.143^***
	(0.027)

pct_white	0.071^**
	(0.027)

pct_married	-0.058^**
	(0.025)

pct_foreign	0.048
	(0.056)

Constant	-4.722^**
	(2.336)


Observations	104
R²	0.464
Adjusted R²	0.437
Residual Std. Error	1.400 (df = 98)
F Statistic	16.958^*** (df = 5; 98)

Note:	p<0.1; p<0.05; p<0.01

OLS Assumptions

The results show a fairly strong model, with two independent variables (pct_bach and pct_poverty) significant at the 5% level and 2 (pct_white and pct_married) at the 1% level. The adjusted R-squared is 0.437 indicates a moderately successful explanation of variance. Many scholars (myself included) would normally stop at this point and call it a day. However, as we have learned in several statistics courses, this would be skipping a crucial step at checking the quality of the model. The regression was completed under the five key OLS assumptions:

Assumption I: The linear regression model is “linear in parameters”

By checking this assumption we are making sure that the variable coefficients (beta) are linear in relation to their corresponding independent variables.

Assumption II: There is a random sampling of observations

This ensures that the sample used in the regression comes from a truly random sample of the study population, with a greater number of observations than parameters, and fixed independent variables which do not impact the dependent variable. Also inside this assumption is the randomization of error terms.

Assumption III: The conditional mean should be zero

This checks that there is no relationship between the independent variable and the error terms, which should have a mean of zero.

Assumption IV: There is no multi-collinearity (or perfect collinearity)

This assumption simply means that there is no linear relationship between the independent variables. The more variation between the independent variables, the better the OLS estimates turn out.

Assumption V: There is homoskedasticity and no autocorrelation

Finally, this assumption makes sure that the variances of the error terms do not depend on the independent variables, and are ideally all equal and not correlated with each other.

Diagnostic Tools in R

Now, back to our regression. Our results can be check against these OLS assumptions using simple diagnistic tools in R.

Residuals vs. Fitted

The first of these is the “Residuals vs Fitted” plot which shows the variability of the residual values with predictor variables. The more linear the relationship, the better the predictive value, thus satisfying OLS assumption I.

library(ggplot2)

plot(regSpokane, which=1)

The above plot indicates a mostly linear relationship between the residuals and the fitted values, with the exception of two outliers on the far right end of the distribution. Thus the linearity assumption is satified.

Normal Q-Q

This plot shows the distribution of residuals accross the model. Of particular interest is whether the residuals follow a normal distribution, which shows up as a straigt line. If there are many deviations from normality, this can be seen as a violation of OLS assumptions II and III.

plot(regSpokane, which=2)

The above plot indicates a highly normal distribution of residual values, as shown by the tightness of the points to the line.

Scale-Location

This plot illustrates how evenly residuals are distributed along the range of predictor variables, and thus a model’s satisfaction of OLS assumption V. An ideal model would produce a plot with a random spread of residuals, indicated by a straight and horizontal red line.

plot(regSpokane, which=3)

The above Scale-Location plot shows a distribution of standardized residuals that is lower and spaced more widely in the middle of the range, with both ends being higher with thinner spacing. This is what causes the red line to be sloping and non-linear. This could be an indication of heteroskedasticity and thus a violation of OLS assumption V.

Residuals vs. Leverage

These plots can be used to determine the influence of extreme values produced by the model. If outside the limits of “Cook’s distance” (past the red dotted hyperbolic lines in the top- and bottom-right portions of the plot), a point is seen as influential to the regression results, and therefore excluding it would alter these results.

plot(regSpokane, which=5)

The above plot shows one single extreme point in the model, Point 28, excluding which from the model would alter the regression results.

Looking at them all together shows us the overall fit of our model against the OLS assumptions.

par(mfrow=c(2,2))
plot(regSpokane)

Other plots such as correlation plots (from the “corrplot” package) are useful in tandem with these four diagnostic figures. I hope this demonstration has achieved its goal of communicating the importance of checking OLS assumptions on a linear regression, and also providing a clear path to performing these checks in R.