Introduction

Most of the datasets we find in the real world are multivariate. It means besides exposure and outcome, there are confounders which are associated but not causually related to both of them. They make the simple cause-and-effect relationship much more complicated.

Figure 1

Under many circumstances, we need to develop strategies for dealing with the influence of confounders. Commonly used statistical methods include Stratified Analysis, Multivariable Risk Adjustment, Propensity Score Analysis (PSA), and Instrumental Variable Analysis (IVA). In this blog, we are going to do the adjustment by adding one continuous variable to see how it affects the apparent relationship of another variable on the outcome.

Data Preparation

The dataset we use is amazon books by a third party website harvested from Amazon.com. It consists of 13 variables and 325 observations. As part of data preparation, NA is removed and variables are excluded or renamed.

##   price cover page height width thick weight
## 1 12.95     P  304    7.8   5.5   0.8   11.2
## 2 15.00     P  273    8.4   5.5   0.7    7.2
## 3  1.50     P   96    8.3   5.2   0.3    4.0
## 4 15.99     P  672    8.8   6.0   1.6   28.8
## 5 30.50     P  720    8.0   5.2   1.4   22.4
## 6 28.95     H  460    8.9   6.3   1.7   32.0

We’re interested in the relationship between the binary group variable Book Cover and the outcome Book Price. Since we’re also concerned that the relationship might be distorted or confounded by a third variable, we’re going to use plots and compare the change of coefficient with and without the inclusion of the third variable in the model to investigate whether different attributes have the confounding effect on the relationship between Cover and Price.

Example 1: Number of Pages

In this example, the third variable is Page.

The horizontal parallel lines are the price averages of hardcover and paperback disregarding the number of pages, of which distance between is known as the marginal effect of group status. The fitted lines of estimated model incorporating number of pages are parallel lines as well. By comparing the distance between the horizontal lines and the distance between the intercepts of the fitted lines, we notice the marginal effect are about the same. This indicates that no matter whether Page is included or not, the estimated relationship between Cover and Price doesn’t change much.

In addition, the number of pages is linearly related to book price and unrelated to book cover. While book cover is related to book price and determines the intercept of the relationship between page and price.

Let’s check the model and coefficients,

fit1 <- lm(price~cover, data = df1)
summary(fit1)$coef
##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 20.20346  0.7373678 27.399435 4.099864e-84
## coverP      -5.06890  0.8542352 -5.933846 8.053733e-09
fit2 <- lm(price~cover+page, data = df1)
summary(fit2)$coef
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 16.95377021 1.026519937 16.515773 3.821748e-44
## coverP      -5.18317932 0.829760486 -6.246597 1.417656e-09
## page         0.01011961 0.002290955  4.417201 1.393394e-05
Percentage_Change
##   coverP 
## 2.254516

What we see from models is consistent with what we observed from plots, the estimate doesn’t change much after the inclusion of the number of pages. By calculating percentage change of coefficient, we find it is less than cutoff of 10%. In this case, we can say that Page doesn’t have a confounding effect on the relationship between Cover and Price. Also, Based on the plot and estimates from the models, we consider the relationship between the outcome and the regressor can be explained as: \[Price = \beta_0\ +\beta_1 Page\ +\beta_2 Cover\ + \epsilon\]

Example 2:

In this example, the third variable is Thick.

Different from the number of pages, the plot clearly shows the impact of inclusion of the thickness of the book to both the group variable and the outcome. It can be seen that when considering the thickness of the book, although the intercept of the fitted line of the paperback is above the intercept of the fitted line of the hardback, the slope of the hardback is larger than that of the paperback, so the prices of both reverse quickly. Thick really has an impact on the relationship betweenCover and Price and changes the estimate.

In this example, group variable, outcome and third variable are all correlated. The thickness of the book is linearly related to the price of the book. Together with the cover of the book, it determines the intercept of the fitted line.

Let’s check the model and coefficients,

fit1 <- lm(price~cover, data = df1)
summary(fit1)$coef
##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 20.20346  0.7373678 27.399435 4.099864e-84
## coverP      -5.06890  0.8542352 -5.933846 8.053733e-09
fit2 <- lm(price~cover+thick, data = df1)
summary(fit2)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 15.388216  1.3662338 11.263238 8.165005e-25
## coverP      -3.938195  0.8759675 -4.495823 9.878734e-06
## thick        4.444842  1.0726554  4.143775 4.437209e-05
Percentage_Change
##   coverP 
## 22.30672

In this case, effects are correlated but in opposite direction. The estimate goes down by 22% after the inclusion of the thickness of the book. The thickness changes the significance of the book cover little bit, but they both are quite significant according to their p-value. We can say Thick is having a confounding effect on the relationship between the Cover and the Price.

Conclusions and Final Thoughts

All the examples above show the adjustment can help us to investigate and focus towards how the inclusion and exclusion of the third variable can impact the estimate of the relationship that we are interested in. However, the adjustment is only one method. In the real world, we need more analysis before and after adjustment. For example, in the second example, if we reverse Thick and Cover, we can also conclude that Cover is a Confounding Variable between Thick and Price. This mainly depends on which relationship we are interested in between the variable and the outcome.

Hope through this blog you can have brief ideas about the method of Adjustment and can help you when you encounter with Multivariable Regression.

Reference: Regression Models for Data Science in R