Adjusting for confounding and regression
Learning outcomes
Our aim is to:
- understand the definition of confounding, and under which circumstances it is likely to occur.
- adjust for confounding through use of restriction, stratification and regression.
We will continue with analysing the cot-death data, trying to find out how we can prevent cases of cot-death.
Hypotheses
The main hypotheses were related to:
- infant sleeping position,
- maternal smoking and
- bed-sharing.
These were the main issues that were believed to lead to cot-death.
The file is available at the following link.
This is a case-control study that considers the association between various risk factors and cot death.
Today we will be thinking about confounding, which is a very common issue in observational studies, and can be dealt with using statistical methods. In fact, most of the statistical work in analytical epidemiology will involve trying to deal with confounding. Remember, here, we are correcting for measured confounders. Unmeasured confounding is still a potential problem.
🤓 What is one way to deal with unmeasured confounding?
Sometimes, when we are considering the association between an exposure and disease, it is possible that there is a third variable that may distort the relationship between these variables, and give us inaccurate or distorted results. This is a problem for all observational study designs, since it is only trials that deal with this through randomisation.
🤓 Confounding is defined as a third variable which is a . This means that we have to think of the likely causal relationships between variables. 🤓 For example, for the relationship between bed-sharing and cot-death, is ethnicity likely to be a common cause of both conditions?
How could we look for evidence that ethnicity is a cause of both of these factors?
🤓 One could...
Upload the data to iNZight lite
.
Crude association
Let us consider the crude or unadjusted relationship between bed-sharing and cot-death.
Select Case_status
as the first variable and
Bedshare
as the second variable.
Is there a statistical association? 🤓
The \(P\)-value is less than 0.05.
🤓 What is the odds ratio? (to three significant figures).
Potential objections
You want to get this published, get the Nobel prize and save lives by telling all Mums not to bed-share.
Hang on a moment!
Suppose then that someone criticises your finding by saying "it is all just explained by ethnicity!" How do you respond to that?
🤓
We need to remove the potential problem posed by the third variable. How? 🤔
🤓
Restriction
One way of adjusting for another variable is using restriction. Since the study includes Māori, Pacific and European mothers, we could just limit our study to Māori, and then recalculate the measure of association. This means that our subjects in our association can not differ by ethnicity (they will all be Māori), and so cannot be 'confounded'. Neat, eh?
Go to Dataset
--> Filter Dataset
-->
levels of categorical variable
--> select
Ethnic
--> levels to include is Maori
🤓 Now calculate the odds ratio for bed-sharing and cot-death for Māori (\(\text{odds ratio}_{\text{Māori}}\)): . Is this stratum-specific odds ratio statistically significant? Compare this to our original result (odds ratio: 2.62).
Go to Dataset
--> Restore data
.
🤓 Select Pacific only (\(\text{odds ratio}_{\text{Pacific}}\)?: ; significant?: ), then European only (\(\text{odds ratio}_{\text{European}}\)?: ; significant?: ). What do you notice about the measure of association? Does the stratified odds ratio change from the crude or univariable odds ratio?
🤓
How can you respond now to your critic, who says that the cot-death association with bed sharing is all related to ethnicity?
🤓
Finally, go to Dataset
--> Restore data
to include all ethnic groups in our analysis.
Mantel-Haenszel stratification
This time, we are just going to go full monty, and split the data into all three ethnicity strata to estimate an odds ratio that is a weighted average of each of the stratum-specific odds ratios for each ethnicity that we estimated above.
Select Case_status
as the first
variable and Bedshare
as the second. This
time add in Ethnic
as the third variable.
Check the Epidemiology output
box.
Carefully inspect the output. What is happening here?
The Mantel-Haenszel method (see link for a worked example of the calculation) takes a weighted average of all three stratum (layer) specific odds ratios. That is, it divides the population by ethnicity into Māori, Pacific and European. The odds ratio for each stratum is then calculated. We then end up with three odds ratios, which is great, but we want to collapse it all into one for simplicity.
Luckily, clever Mantel and Haenszel figured this one out for us. 🤓 The trick is to take the of the three odds ratios, one for the association between bed-sharing and cot-death for Māori, then for Pacific, and finally for European (🤓 notice these are as for our stratified analysis. The odds ratio for the association between bed-sharing and cot-death, adjusted for ethnicity is: and statistically significant (\(P\) = ). Since there is more information for European, since there are more participants in this group, the average of the three odds ratio is weighted more heavily towards this one. Don't worry about the details of how this is done, but worry more about the principle (taking the weighted average of stratum-specific odds ratios), which is important to understand.
Is there confounding?
We can show some evidence of confounding when there is a substantial difference between the 'crude' and 'adjusted'.
What is a substantial difference?
Well, again, this is a bit of a judgement call. Epidemiologists use a rule of thumb of 10%. So to calculate the difference, take the difference between the crude and adjusted and then divide by the crude measure of association, and you have the proportion difference, then multiply by 100 and you have the percentage difference! Easy peasy.
\[ \text{% difference} = 100*\frac{(\text{odds ratio}_{\text{crude}} - \text{odds ratio}_{\text{adjusted}} )}{\text{odds ratio}_{\text{crude}}} \]
Have a go
Consider whether the relationship between Mother_smoke
and Case_status
is potentially confounded by
Bedshare
.
- Step 1. Crude association (odds ratio) between
Mother_smoke
andCase_status
is: - Step 2. Adjusted association between
Mother_smoke
andCase_status
, after adjusting forBedshare
is: - Step 3. Percentage difference is:
\[\begin{align} \text{% difference} &= 100 * \frac{ ( \text{odds ratio}_{\text{crude}} - \text{odds ratio}_{\text{adjusted}} ) } {\text{odds ratio}_{\text{crude}} } \\ &= 100 * \frac{4.36 - 4.06}{4.36} \\ \end{align}\]
🤓 = %
Is this less than or greater than a 10% change?
So, we can conclude there is no evidence for confounding.
Examine in more detail
However, if we look closely at the stratum specific odds ratio we see:
Mother_smoke
- Case_status
\(\text{odds ratio}_{\text{Sleeping apart}}\)
=
Mother_smoke
- Case_status
\(\text{odds ratio}_{\text{Bed-sharing}}\) =
Wow! The odds ratio for Mother_smoke
and
Case_status
in bed-sharers is much
than for children sleeping apart from their
parents.
This is considered evidence for , since the two stratum-specific odds ratios are so .
Please note: There is a formal statistical test for
whether stratum-specific odds ratios are likely to come from the same
common odds ratios, so that their results may be pooled (\(\chi^2\) heterogeneity test), but
iNZight lite
unfortunately doesn't implement this. See page
183 of Kirkwood and Sterne for more details.
Just note here that the heterogeneity test is significant, and for that reason, you should not report a summary adjusted odds ratio, but rather stratum-specific odds ratios.
Why do you think the odds ratio for Mother_smoke
-
Case_status
is so high for bed-sharers?
In this way, effect-modification can help stimulate our thinking about possible biological mechanisms, where several risk factors may act together to cause disease.
Scatter plots and linear regression
For continuous variables we can sometimes visualise directly the relationship between one variable and the other in a type of plot called a scatter plot. These relationships have greater statistical power than we would otherwise have if we categorised them in a binary manner to make risk ratios or odds ratios and two-by-two tables.
In the last tutorial, before the break, we compared birth weights by
case status and used a \(t\)-test. What
if, however, our outcome variable is continuous? Maybe, we want to see
whether there is a relationship between gestational age
(Gestation
- exposure) and birth weight
(Birth_wt
- outcome). That is, if a baby stays longer in
their mother's womb, we expect that it will be on average a bigger
child. Let's try selecting these variables in iNZight
.
🤓 Which variable should be on the vertical or \(y\) axis and which on the horizontal or \(x\) axis?
🤓 In the first variable, add the outcome
🤓 In the second, add the exposure
Click Add to plot
-->
Trend Lines and Curves
Click on the linear
option, then also
smoother
.
🤓 How would you summarise the relationship?
🤓 How can we describe the slope of the straight line?
🤓 How about the \(y\)-intercept?
🤓 What is the advantage of the smoother over the linear regression line?
🤓 In this instance, what information does the smoother convey?
Linear regression modelling
We are going on to the Advanced
menu. Hold on to your
hats 🎩!
🤓 Go to Model Fitting
. The outcome or \(y\) variable should be
and the predictor variable should be
.
Then click FIT MODEL
. We get some crazy scientific
notation, like 1.916e+02
. Remember this just means \(1.916 * 10^2\) or 192
.
Estimate
and Gestation
is the
slope or \(m\) from
high school maths, from the formula
where:
- \(y\) is the outcome,
- \(x\) is the exposure,
- \(m\) the slope, and
- \(c\) is the \(y\)-intercept or value of \(y\), when \(x = 0\).
The slope is the key measure of
association. In epidemiology, we call this the
"\(\beta\)
coefficient", which iNZight
calls the
Estimate
- just to confuse you 😵.
A positive \(\beta\) coefficient
means a positive association, the higher \(x\) gets, the higher \(y\) becomes. Here, the greater the
Gestation
, the greater the Birth_wt
. A slope
of zero means 'no association', and is the regression equivalent of the
null hypothesis. The Estimate
is the
software name for slope and means the average change in
\(y\) given a one-unit
change in \(x\).
🤓 Here, therefore, the baby puts on grams for every extra week (one unit increase) of gestation. The 95% confidence interval has the same interpretation as usual, and the \(p\)-value likewise (a slope of represents the null hypothesis).
🤓 Are the two variables associated?
🤓 Is the relationship statistically significant?
Also, this slope is described as a
\(\beta\) coefficient, because there is
only one independent variable (Gestation
) in the model.
🤓 What does the \(R^2\) value tell us?
The \(R^2\) is an
indicator of model fit. A value close to 1
indicates good
fit, whereas a value close to 0
is poor.
🤓 What is the \(R^2\) for this model? . This means, that compared to a simple mean, the model reduces the residuals by ~%.
What about the residual standard error? 🤓 (Note: this is a misnomer, it actually should be residual standard deviation). It is unchanged by the sample size. The value for this model is: . Smaller numbers indicate better model fit than larger numbers.
🤓 Residual standard error is particularly useful for
If we now include a confounder, such as maternal age
(Mother_age
), we get an
\(\beta\) coefficient.
After adjusting for Mother_age
, the slope of the line
between Gestation
and Birth_wt
is
,
compared to the crude model of 192
. The residual standard
error for the age adjusted model is
grams which is slightly
than the crude (470
), and indicates
model fit.
Some homework
Does maternal age (Mother_age
) influence baby's birth
weight (Birth_wt
)?
The crude \(\beta\) coefficient is and means that on average, as mothers age by year, their babies get on average 15 grams heavier.
This association statistically significant, indicated by the \(P\)-value.
Check with both plots and model.
How much of the variation is explained by the model? %
What is the residual standard error? By grams.
See whether the relationship between Gestation
and
Birth_wt
varies by case status (add
Case_status
as another Variable of interest
,
including Mother_age
in the Model
interface)
and also observe the plot.
In the plot, on average, cases are than controls, although the slope looks between the two plots.
The adjusted \(\beta\) coefficient
(slope) for Case
is
,
indicating that cases are on average
than controls, after adjusting for
. This
association
statistically significant.
The residual standard error is grams, which is than the equivalent statistic in the crude model. This indicates model fit.
Well done!😅 We have taken baby steps in understanding regression and other methods of adjusting for confounding. 🥳