Table of contents

  1. Introduction
  2. Data
    1. Step 1
    2. Step 2
    3. Step 3
  3. Conclusion
  4. References

Introduction

Adjustment, is the idea of putting regressors into a linear model to investigate the role of a third variable on the relationship between another two. Since it is often the case that a third variable can distort, or confound if you will, the relationship between two others (Caffo, 2015)

With confounding variables, the problem is one of omission: an important variable is not included in the regression equation. Naïve interpretation of the equation coefficients can lead to invalid conclusions (Bruce, 2017). But how do we determine whether confounding is present? Many researchers use the 10% rule of thumb to answer that question. The idea of this rule of thumb is to determine if the parameter estimate for your predictor of interest changes by more than 10% from the unadjusted, or crude, estimate (from simple linear regression) to the adjusted estimate (from multiple linear regression).

This vignette will assess confounding in the Boston Housing Dataset taken from the Kaggle competition “House Prices: Advanced Regression Technique”.

Data

Using the example of the Boston house sale price and its association with the Lot size in square feet, lets walk through the steps of the 10% rule of thumb.

##   SalePrice LotArea OverallQual Total.Bath TotRmsAbvGrd GarageCars
## 1    208500    8450           7          3            8          2
## 2    181500    9600           6          3            6          2
## 3    223500   11250           7          3            6          2
## 4    140000    9550           7          1            7          3
## 5    250000   14260           8          3            9          3
## 6    143000   14115           5          2            5          2
##   GarageArea
## 1        548
## 2        460
## 3        608
## 4        642
## 5        836
## 6        480

Step 1: Find the parameter estimate for Lot size in square feet from a simple linear regression

In this step we are going to run a simple linear regression model, where the dependant variable is the house sale price and the independent variable is the Lot size in square feet.

The parameter estimate is the coefficient of the variable Lot size in square feet in the summary output of the regression function:

## 
## Call:
## lm(formula = SalePrice ~ LotArea, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -275668  -48169  -17725   31248  553356 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.588e+05  2.915e+03   54.49   <2e-16 ***
## LotArea     2.100e+00  2.011e-01   10.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76650 on 1458 degrees of freedom
## Multiple R-squared:  0.06961,    Adjusted R-squared:  0.06898 
## F-statistic: 109.1 on 1 and 1458 DF,  p-value: < 2.2e-16

We found a parameter estimate of 2.1 for Lot size in square feet, which can be interpreted to mean that for every one unit increase in Lot size in square feet, we expect the house price to increase by 2.1 USD, on average.

Step 2: Find the parameter estimate for Lot size in square feet from a multiple linear regression which adjusts for potential confounders

In this step we are going to run a simple linear regression model, where the dependant variable is the house sale price and the independent variables are the Lot size in square feet, overall quality, number of bathrooms, Total rooms above grade, number of car garage and Size of garage in square feet.

## 
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + Total.Bath + 
##     +TotRmsAbvGrd + GarageCars + GarageArea, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -303510  -23297   -2932   16927  367941 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.253e+05  5.701e+03 -21.977  < 2e-16 ***
## LotArea       1.035e+00  1.119e-01   9.245  < 2e-16 ***
## OverallQual   3.137e+04  1.069e+03  29.333  < 2e-16 ***
## Total.Bath    6.304e+03  1.784e+03   3.533 0.000424 ***
## TotRmsAbvGrd  7.870e+03  8.440e+02   9.324  < 2e-16 ***
## GarageCars    7.406e+03  3.270e+03   2.264 0.023691 *  
## GarageArea    5.699e+01  1.097e+01   5.194 2.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41520 on 1453 degrees of freedom
## Multiple R-squared:  0.728,  Adjusted R-squared:  0.7269 
## F-statistic: 648.1 on 6 and 1453 DF,  p-value: < 2.2e-16

From this analysis, we found a parameter estimate for Lot size in square feet of 1.035, which can be interpreted to mean that for every 1 unit increase in Lot size in square feet, we expect the house price to to increase by 1.035 USD, adjusting for overall quality, number of bathrooms, Total rooms above grade, number of car garage and Size of garage in square feet.

Step 3: Calculate the percentage change in the parameter estimate and determine whether confounding is present

Percentage_Change = (model1$coefficients[2] - model2$coefficients[2])/model1$coefficients[2]*100

Percentage_Change = (2.100 - 1.035)/2.100 * 100

Percentage_Change
## [1] 50.71429

Conclusion

Since the percentage change is 51%, which is greater than 10%, this indicates that the association between house price and Lot size in square feet is confounded by overall quality, number of bathrooms, Total rooms above grade, number of car garage and Size of garage in square feet.

Also, adding those variables to the model the R square increase from 0.07 to 0.728, which means that these new variables are explaining 73% of the variance in the house sale price.

Since confounding is present, we should present the results from the adjusted analysis.

References

Caffo, B. (2015), Regression Models for Data Science in R, Leanpub, Victoria

Bruce, P. and Bruce, A. (2017), Practical Statistic for Data Scientists . Sebastopol, CA: O’Reilly.

James, G., Witten, D., Hastie, T. and Tibshirani, R. (2015), An introduction to statistical learning with applications in R, New York, NY: Springer-Verlag.