When a linear model demonstrates the correlation between predictor variable and outcome variable, we should be aware of the other factors which can influence the association between the model’s two variables. For example- in research cases, it can be easily concluded that heavy drinking habit causes reduced longevity. Inreality, many other reasons or factors such as physical exercises or eating habits could explain the cause to reduced longevity. Those factors are referred as third variables.
(Caffo 2019) introduced the concept of Adjustment which is putting regressors into a linear model to investigate the role of a third variable on a relationship between another two variables. Adjustment can help to identify the impact of a third variable which can distort or confound the relationship between two others.
This article will demonstrate how the confounding variable behaves in the linear model. The Campus Recruitement data is downloaded from [kaggle] (https://www.kaggle.com/benroshan/factors-affecting-campus-placement). The data set consists of students’ job placement data in the campus after completion of their MBA course at Jain University, Banglore. The concerned variables are 1. students received job offered or not in the campus 2. salary offerred 3. their MBA academic score percentage 4. specialisation subjects in MBA course 5. undergraduate degree percentage
The following plot illustrates the distribution of students job placement. The y-axis is the salary. The salary 0 indicates that observations of students who did not receive job offer.
model_salary_mbap <- lm(salary ~ mba_p , data = campus_placement_data)
summary(model_salary_mbap)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32350.132 112601.365 -0.2872979 0.77416356
## mba_p 3710.006 1800.195 2.0608907 0.04052816
The model generates coefficient of mba_p as 3710.006. When the specialisation is added as a third variable in the model below, its coefficient is reduced to 2969.583. The change percentage is 0.19 which is almost 20%. According to (H.Lee 2014) , by using the change in estimate (CIE) with 10% cutoff, the presence of confounding effected can be detected. The unbiased estimate can be achieved by using the CIF criterion.
model_salary_mbap_spec <- lm(salary ~ mba_p + specialisation , data = campus_placement_data)
summary(model_salary_mbap_spec)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50025.653 110758.691 0.4516635 6.519732e-01
## mba_p 2969.583 1749.703 1.6971924 9.112782e-02
## specialisationMkt&HR -82070.144 20504.947 -4.0024558 8.666221e-05
So, the variable specialisation is considered a confounding variable. The plot demonstrates and tries to answer the question number 2 which is interested to see the impact of students’specialised subjects during their MBA studies on degree score’s association with the outcome variable. Salary.
model_salary_mbap_degreet <- lm(salary ~ mba_p + degree_t , data = campus_placement_data)
summary(model_salary_mbap_degreet)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17727.677 112782.966 -0.1571840 0.87525015
## mba_p 3454.722 1811.381 1.9072304 0.05784895
## degree_tOthers -66849.760 47962.209 -1.3938007 0.16484400
## degree_tSci&Tech 17114.049 23853.934 0.7174518 0.47388886
When degree_t is added to the model, the coeficient of mba_p has changed slightly. The change of estimate is just 6%. It is less than suggessted 10% cut-off. This indicates that the estimated relationship between mba degree score and salary does not change much regarless of presence of variable degree_t. The degree_t is not considered the confounding variable. Their p-values are also not significant. The variable can be disarded from the model.
The following plot illustrated to see the effect of degree_t in the model.
The examples are the simple investigation of situations where adding a certain variable highlights an impact on relationship between the predictor and outcome variables. To decide the inclusion and exclusion of those third variables should be further checked by thorough statistical evidence to avoid multicollinearity when doing the multivariate modelling. Moreover, this is also to take note that whether adding the adjustment variable or not into the model may depend on the research objective. If a model is to achieve an accurate prediction it may be a good idea, however, it may not be appropriate in making decision models as confounding can introduce bias.
Caffo, Brian. 2019. Regression Models for Data Science in R. https://leanpub.com/regmods/.
H.Lee, Paul. 2014. https://www.nature.com/articles/srep06085.