Introduction

Adjustment is the idea of putting regressors into a linear model to investigate the role of a third variable on the relationship between another two (Caffo 2019). A confounder is an extraneous variable whose presence affects the variables being studied so that the results do not reflect the actual relationship between the variables under study (Pourhoseingholi, Baghestani, and Vahedi 2012). How to identify the presence of confounding? According to (Lee 2014), the rule of thumb is if the parameter estimates the predictor of interest changes by more than 10% from the unadjusted, then the parameters have a confounding effect on the relationship between the predictor and target.

Fig. 1 The principle of confounding Murray and Duggan (2011)

Fig. 1 The principle of confounding Murray and Duggan (2011)

This vignette will assess confounding in the Car Datasets taken from Kaggle. The car dataset consists of 205 records of car prices, and its attributes, such as number of engine’s cylinders, fuel type, number of doors, fuel consumption, and prices. We will assess the relationship between city driving fuel consumption (city.mpg) and the car’s price (price).

model <- lm(price ~ city.mpg, data = car, na.action = na.omit)
summary(model)
## 
## Call:
## lm(formula = price ~ city.mpg, data = car, na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9210  -3212  -1717   1787  22697 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34595.60    1656.78   20.88   <2e-16 ***
## city.mpg     -849.45      63.77  -13.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5793 on 199 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.4714, Adjusted R-squared:  0.4687 
## F-statistic: 177.5 on 1 and 199 DF,  p-value: < 2.2e-16

The plot above shows that price has a negative correlation with fuel consumption, where every one unit (miles per gallon) increases fuel consumption, we expect the price to decrease by 849 units, on average.

Case 1: Number of Cylinders

In this case, we will examine the effect of the number of cylinders (i.e., four, six, and eight) on the relationship between fuel consumption and price.

The plot above shows that the downslope of four-cylinder engines is less steep than the other engines. So, it may be incorrect to conclude that the magnitude of fuel consumption increases to the car prices is similar for all types of engines.

Let check the model and the coefficient.

## 
## Call:
## lm(formula = price ~ city.mpg + num.of.cylinders, data = car, 
##     na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10842.4  -2098.3   -870.9   1983.5  16455.3 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           46675.18    2203.55  21.182  < 2e-16 ***
## city.mpg               -518.35      57.53  -9.010 2.88e-16 ***
## num.of.cylindersfour -22386.56    2165.81 -10.336  < 2e-16 ***
## num.of.cylinderssix  -13521.95    2198.16  -6.151 4.80e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4055 on 181 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.7211 
## F-statistic: 159.6 on 3 and 181 DF,  p-value: < 2.2e-16

We could see that the coefficient of fuel consumption is increased by 39% (Change-in-Estimation/CIE) after we added the number of cylinders to the model. The number of cylinders has an impact on relationship between fuel consumption and car prices.

Case 2: Fuel Type

In this case, we will examine the effect of fuel type (i.e., gas, diesel) on the relationship between fuel consumption and price.

The plot above shows that the downslopes are slightly different between gas and diesel. Let’s see the model and coefficient.

## 
## Call:
## lm(formula = price ~ city.mpg + fuel.type, data = car, na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7030.7 -2728.6  -713.7   837.9 21898.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44770.37    2464.41  18.167  < 2e-16 ***
## city.mpg      -990.62      65.05 -15.228  < 2e-16 ***
## fuel.typegas -7400.16    1423.44  -5.199 5.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5119 on 182 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.5603, Adjusted R-squared:  0.5555 
## F-statistic:   116 on 2 and 182 DF,  p-value: < 2.2e-16

We could see that the coefficient of fuel consumption is decreased by 17% (Change-in-Estimation/CIE) after we added the number of cylinders to the model.

Case 3: Number of Doors

In this case, we will examine the effect of number of doors (i.e., two, four) on the relationship between fuel consumption and price.

The plot above indicates that .. Let’s see the model and coefficient.

## 
## Call:
## lm(formula = price ~ city.mpg + num.of.doors, data = car, na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7316  -2960  -1449   1320  22513 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     35409.01    1755.63  20.169   <2e-16 ***
## city.mpg         -887.85      66.02 -13.449   <2e-16 ***
## num.of.doorstwo   -92.15     818.94  -0.113    0.911    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5474 on 180 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.5018, Adjusted R-squared:  0.4963 
## F-statistic: 90.65 on 2 and 180 DF,  p-value: < 2.2e-16

Conclusion

The percentage of change greater than 10% indicates that the association between car price and city driving fuel consumption in mpg is confounded by the number of cylinders and fuel type. Other than that, by adding these variables, the R square increase from 0.46 to 0.72 and 0.56, respectively.

Since confounding is present, we should present the results from the adjusted analysis.

Reference

Caffo, Brian. 2019. Regression Models for Data Science in R. https://leanpub.com/regmods/.

Lee, Paul H. 2014. “Should We Adjust for a Confounder If Empirical and Theoretical Criteria Yield Contradictory Results? A Simulation Study.” Scientific Reports 4 (6085). https://doi.org/10.1038/srep06085.

Murray, Kantahyanee W., and Anne Duggan. 2011. “Understanding Confounding in Research.” Pediatrics in Review 2010 (31). https://doi.org/110.1542/pir.31-3-124.

Pourhoseingholi, Mohamad Amin, Ahmad Reza Baghestani, and Mohsen Vahedi. 2012. “How to Control Confounding Effects by Statistical Analysis.” https://doi.org/10.22037/ghfbb.v5i2.246.