Introduction

“Adjustment, is the idea of putting regressors into a linear model to investigate the role of a third variable on the relationship between another two. Since it is often the case that a third variable can distort, or confound, the relationship between two others” (Caffo, 2019). Confounding variables are the variables which are either positively or negatively correlated to both dependant variable and the independent variable (Elwood, 1988). The compounding variable may affect the variables being studied, and therefore, the results may not reflect the actual relationship.

In this example, I will use life expectancy data from kaggle to illustrate the impact of third variables. The data includes life expectancy of 193 countries and predicting variables such as immunisation related factors, mortality factor and socio-economic factors. The data is collected from various sources such as United Nation, WHO for the period of 2000 to 2015.

The 2015 information is only used for analysis and not all variables are used for analysis.

## 
## Call:
## lm(formula = life_expectancy ~ schooling, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.8986  -2.8210   0.6186   3.8186  30.4911 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.10889    0.43676  100.99   <2e-16 ***
## schooling    2.10345    0.03506   59.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.172 on 2766 degrees of freedom
##   (170 observations deleted due to missingness)
## Multiple R-squared:  0.5655, Adjusted R-squared:  0.5653 
## F-statistic:  3599 on 1 and 2766 DF,  p-value: < 2.2e-16

The above plot shows the relation between life expectancy and the number of years of Schooling in years for the year 2015. Both x and y variable are closely linked to developed or developing status of a country. It seems clear that schooling increases life expectancy. The schooling variable has low p-value and it is a good predictor of life expectancy in the model. However, it may be difficult to conclude that schooling actually increases life expectancy. The schooling variable has a close link to each county’s socio-economic status and other variables must be held constant in order to conclude that the schooling actually the life expectancy.

## 
## Call:
## lm(formula = life_expectancy ~ bmi, data = df %>% filter(year == 
##     2015))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0342  -4.4353   0.4605   4.7700  23.6582 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 63.11095    1.18638  53.196  < 2e-16 ***
## bmi          0.20177    0.02499   8.074 9.67e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.936 on 179 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.2669, Adjusted R-squared:  0.2628 
## F-statistic: 65.18 on 1 and 179 DF,  p-value: 9.667e-14

The Above plot shows that the relationship between BMI and Life Expectancy for the year 2015. The BMI has the high p value and it seems clear that the BMI is Positively correlated to the Life Expectancy of the country.

The above plot shows the relationship between BMI and Life expectancy by different developed or developing status. The Life expectancy tend to decrease as the BMI increases in developed countries. On the contrary, countries with high BMI tend to have higher life expectancy in developing countries. The status has significant impact where the BMI is low, however, the effect of the status decreases as the BMI increases. The difference in slopes between two were significant, and therefore, it may be incorrect to conclude that the high BMI is linked to the longer life expectancy, even if the countries with high BMI tend to have longer life expectancy overall.

Conclusion

Modelling multivariate relationships is uneasy process. The relationship between independent variables must be taken into account in the model. All of the other predictors in the model should be held constant when looking at the impact of one variable in the model.

REFERENCES

Caffo, B. 2019, Regression models for data science in R, electronic book, viewed 27 May 2020, .

Elwood JM, editor. Causal Relationships in Medicine. Oxford: Oxford University Press; 1988. p. 332.