Adjusting models with confounding variables

Sachintha Mohotti

2020-05-31

Introduction

Modelling multivariate date is not as simple as a cause-and effect relationship. In addition to the main dependent variable and the independent variable, there are other variables that influence the model. These variables are called confounding variables, and the model needs to be adjusted to account for their effects. To show the effect of confounding variables on a model, I have used the ‘concrete’ Dataset from the ‘AppliedPredictiveModelling’ package.

Data Understanding

For this example, I’m going to plot Fuel Economy against Engine Displacement, and compare the effects of two confounding variables, Transmission System and Number of Cylinders in the engine.

This graph appears to explain most of the variation in the data, however there appears to be a hyperbolic curve in the data, indicating that there are other variables at play

Effect of Transmission System

Cars with Four wheel drive are designed for off-road driving, unlike Two wheel drive cars, which are designed for city roads. Their fuel economy is likely to be different, so I grouped the data according to Transmission System.

As this plot shows, the lines of best fit for vehicles with different transmission systems are very different. This indicates that there is variation that is explained by Transmission System, but not by Engine Displacement, and vice versa.Therefore, both variables should be considered when modelling this data.

Effect of Number of Cylinders

Similarly, the number of cylinders in the car’s engine is potentially another variable that affects the Fuel Economy. For simplicity, I’m only going to consider Four cylinder, six cylinder and eight cylinder engines, as these are the most common in civilian vehicles.

Although this plot shows that the slope of the curve is different for each number of cylinders, there is very little overlap between the groups of data. This means that the effects of Engine Displacement and Number of Cylinders in explaining the variation in data is largely identical, so it is unnecessary to use both variables when modelling the data.

Conclusion

When modelling mutlivariate data, the effects of all variables needs to be considered, however not all variables have a significant effect on the model.

References

[1] Caffo, B. Regression Models for Data Science in R.

[2] Kuhn, M., & Johnson, K. (2019, May 2). AppliedPredictiveModeling: Functions and Data Sets for ‘Applied Predictive Modeling’ version 1.1-7 from CRAN. Retrieved from https://rdrr.io/cran/AppliedPredictiveModeling/