Adjustment, is the idea of putting regressors into a linear model to investigate the role of a third variable on the relationship between another two. Since it is often the case that a third variable can distort, or confound if you will, the relationship between two others (Caffo, 2015)
With confounding variables, the problem is one of omission: an important variable is not included in the regression equation. Naïve interpretation of the equation coefficients can lead to invalid conclusions (Bruce, 2017). But how do we determine whether confounding is present? Many researchers use the 10% rule of thumb to answer that question. The idea of this rule of thumb is to determine if the parameter estimate for your predictor of interest changes by more than 10% from the unadjusted, or crude, estimate (from simple linear regression) to the adjusted estimate (from multiple linear regression).
This vignette will assess confounding in the Boston Housing Dataset taken from the Kaggle competition “House Prices: Advanced Regression Technique”.
Using the example of the Boston house sale price and its association with the Lot size in square feet, lets walk through the steps of the 10% rule of thumb.
## SalePrice LotArea OverallQual Total.Bath TotRmsAbvGrd GarageCars
## 1 208500 8450 7 3 8 2
## 2 181500 9600 6 3 6 2
## 3 223500 11250 7 3 6 2
## 4 140000 9550 7 1 7 3
## 5 250000 14260 8 3 9 3
## 6 143000 14115 5 2 5 2
## GarageArea
## 1 548
## 2 460
## 3 608
## 4 642
## 5 836
## 6 480
In this step we are going to run a simple linear regression model, where the dependant variable is the house sale price and the independent variable is the Lot size in square feet.
The parameter estimate is the coefficient of the variable Lot size in square feet in the summary output of the regression function:
##
## Call:
## lm(formula = SalePrice ~ LotArea, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -275668 -48169 -17725 31248 553356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.588e+05 2.915e+03 54.49 <2e-16 ***
## LotArea 2.100e+00 2.011e-01 10.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76650 on 1458 degrees of freedom
## Multiple R-squared: 0.06961, Adjusted R-squared: 0.06898
## F-statistic: 109.1 on 1 and 1458 DF, p-value: < 2.2e-16
We found a parameter estimate of 2.1 for Lot size in square feet, which can be interpreted to mean that for every one unit increase in Lot size in square feet, we expect the house price to increase by 2.1 USD, on average.
In this step we are going to run a simple linear regression model, where the dependant variable is the house sale price and the independent variables are the Lot size in square feet, overall quality, number of bathrooms, Total rooms above grade, number of car garage and Size of garage in square feet.
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + Total.Bath +
## +TotRmsAbvGrd + GarageCars + GarageArea, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -303510 -23297 -2932 16927 367941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.253e+05 5.701e+03 -21.977 < 2e-16 ***
## LotArea 1.035e+00 1.119e-01 9.245 < 2e-16 ***
## OverallQual 3.137e+04 1.069e+03 29.333 < 2e-16 ***
## Total.Bath 6.304e+03 1.784e+03 3.533 0.000424 ***
## TotRmsAbvGrd 7.870e+03 8.440e+02 9.324 < 2e-16 ***
## GarageCars 7.406e+03 3.270e+03 2.264 0.023691 *
## GarageArea 5.699e+01 1.097e+01 5.194 2.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41520 on 1453 degrees of freedom
## Multiple R-squared: 0.728, Adjusted R-squared: 0.7269
## F-statistic: 648.1 on 6 and 1453 DF, p-value: < 2.2e-16
From this analysis, we found a parameter estimate for Lot size in square feet of 1.035, which can be interpreted to mean that for every 1 unit increase in Lot size in square feet, we expect the house price to to increase by 1.035 USD, adjusting for overall quality, number of bathrooms, Total rooms above grade, number of car garage and Size of garage in square feet.
Since the percentage change is 51%, which is greater than 10%, this indicates that the association between house price and Lot size in square feet is confounded by overall quality, number of bathrooms, Total rooms above grade, number of car garage and Size of garage in square feet.
Also, adding those variables to the model the R square increase from 0.07 to 0.728, which means that these new variables are explaining 73% of the variance in the house sale price.
Since confounding is present, we should present the results from the adjusted analysis.
Caffo, B. (2015), Regression Models for Data Science in R, Leanpub, Victoria
Bruce, P. and Bruce, A. (2017), Practical Statistic for Data Scientists . Sebastopol, CA: O’Reilly.
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2015), An introduction to statistical learning with applications in R, New York, NY: Springer-Verlag.