Photo by Isaac Quesada on Unsplash

Adjusting for confounding and regression

Learning outcomes

Our aim is to:

understand the definition of confounding, and under which circumstances it is likely to occur.
adjust for confounding through use of restriction, stratification and regression.

We will continue with analysing the cot-death data, trying to find out how we can prevent cases of cot-death.

Hypotheses

The main hypotheses were related to:

infant sleeping position,
maternal smoking and
bed-sharing.

These were the main issues that were believed to lead to cot-death.

The file is available at the following link.

This is a case-control study that considers the association between various risk factors and cot death.

Today we will be thinking about confounding, which is a very common issue in observational studies, and can be dealt with using statistical methods. In fact, most of the statistical work in analytical epidemiology will involve trying to deal with confounding. Remember, here, we are correcting for measured confounders. Unmeasured confounding is still a potential problem.

🤓 What is one way to deal with unmeasured confounding?

Sometimes, when we are considering the association between an exposure and disease, it is possible that there is a third variable that may distort the relationship between these variables, and give us inaccurate or distorted results. This is a problem for all observational study designs, since it is only trials that deal with this through randomisation.

🤓 Confounding is defined as a third variable which is a . This means that we have to think of the likely causal relationships between variables. 🤓 For example, for the relationship between bed-sharing and cot-death, is ethnicity likely to be a common cause of both conditions?

Directed acyclic graph

How could we look for evidence that ethnicity is a cause of both of these factors?

🤓 One could...

Upload the data to iNZight lite.

Crude association

Let us consider the crude or unadjusted relationship between bed-sharing and cot-death.

Select Case_status as the first variable and Bedshare as the second variable.

Is there a statistical association? 🤓

The \(P\)-value is less than 0.05.

🤓 What is the odds ratio? (to three significant figures).

Potential objections

You want to get this published, get the Nobel prize and save lives by telling all Mums not to bed-share.

Hang on a moment!

Suppose then that someone criticises your finding by saying "it is all just explained by ethnicity!" How do you respond to that?

🤓

We need to remove the potential problem posed by the third variable. How? 🤔

🤓

No confounding!

Restriction

One way of adjusting for another variable is using restriction. Since the study includes Māori, Pacific and European mothers, we could just limit our study to Māori, and then recalculate the measure of association. This means that our subjects in our association can not differ by ethnicity (they will all be Māori), and so cannot be 'confounded'. Neat, eh?

Go to Dataset --> Filter Dataset --> levels of categorical variable --> select Ethnic --> levels to include is Maori

🤓 Now calculate the odds ratio for bed-sharing and cot-death for Māori (\(\text{odds ratio}_{\text{Māori}}\)): . Is this stratum-specific odds ratio statistically significant? Compare this to our original result (odds ratio: 2.62).

Go to Dataset --> Restore data.

🤓 Select Pacific only (\(\text{odds ratio}_{\text{Pacific}}\)?: ; significant?: ), then European only (\(\text{odds ratio}_{\text{European}}\)?: ; significant?: ). What do you notice about the measure of association? Does the stratified odds ratio change from the crude or univariable odds ratio?

🤓

Photo by Steve Harvey on Unsplash

How can you respond now to your critic, who says that the cot-death association with bed sharing is all related to ethnicity?

🤓

Finally, go to Dataset --> Restore data to include all ethnic groups in our analysis.

Mantel-Haenszel stratification

This time, we are just going to go full monty, and split the data into all three ethnicity strata to estimate an odds ratio that is a weighted average of each of the stratum-specific odds ratios for each ethnicity that we estimated above.

Select Case_status as the first variable and Bedshare as the second. This time add in Ethnic as the third variable. Check the Epidemiology output box.

Carefully inspect the output. What is happening here?

The Mantel-Haenszel method (see link for a worked example of the calculation) takes a weighted average of all three stratum (layer) specific odds ratios. That is, it divides the population by ethnicity into Māori, Pacific and European. The odds ratio for each stratum is then calculated. We then end up with three odds ratios, which is great, but we want to collapse it all into one for simplicity.

Luckily, clever Mantel and Haenszel figured this one out for us. 🤓 The trick is to take the of the three odds ratios, one for the association between bed-sharing and cot-death for Māori, then for Pacific, and finally for European (🤓 notice these are as for our stratified analysis. The odds ratio for the association between bed-sharing and cot-death, adjusted for ethnicity is: and statistically significant (\(P\) = ). Since there is more information for European, since there are more participants in this group, the average of the three odds ratio is weighted more heavily towards this one. Don't worry about the details of how this is done, but worry more about the principle (taking the weighted average of stratum-specific odds ratios), which is important to understand.

Is there confounding?

We can show some evidence of confounding when there is a substantial difference between the 'crude' and 'adjusted'.

What is a substantial difference?

Well, again, this is a bit of a judgement call. Epidemiologists use a rule of thumb of 10%. So to calculate the difference, take the difference between the crude and adjusted and then divide by the crude measure of association, and you have the proportion difference, then multiply by 100 and you have the percentage difference! Easy peasy.

\[ \text{% difference} = 100*\frac{(\text{odds ratio}_{\text{crude}} - \text{odds ratio}_{\text{adjusted}} )}{\text{odds ratio}_{\text{crude}}} \]

Have a go

Consider whether the relationship between Mother_smoke and Case_status is potentially confounded by Bedshare.

Step 1. Crude association (odds ratio) between Mother_smoke and Case_status is:
Step 2. Adjusted association between Mother_smoke and Case_status, after adjusting for Bedshare is:
Step 3. Percentage difference is:

\[\begin{align} \text{% difference} &= 100 * \frac{ ( \text{odds ratio}_{\text{crude}} - \text{odds ratio}_{\text{adjusted}} ) } {\text{odds ratio}_{\text{crude}} } \\ &= 100 * \frac{4.36 - 4.06}{4.36} \\ \end{align}\]

🤓 = %

Is this less than or greater than a 10% change?

So, we can conclude there is no evidence for confounding.

Examine in more detail

However, if we look closely at the stratum specific odds ratio we see:

Mother_smoke - Case_status \(\text{odds ratio}_{\text{Sleeping apart}}\) =

Mother_smoke - Case_status \(\text{odds ratio}_{\text{Bed-sharing}}\) =

Wow! The odds ratio for Mother_smoke and Case_status in bed-sharers is much than for children sleeping apart from their parents.

This is considered evidence for , since the two stratum-specific odds ratios are so .

Please note: There is a formal statistical test for whether stratum-specific odds ratios are likely to come from the same common odds ratios, so that their results may be pooled (\(\chi^2\) heterogeneity test), but iNZight lite unfortunately doesn't implement this. See page 183 of Kirkwood and Sterne for more details.

Just note here that the heterogeneity test is significant, and for that reason, you should not report a summary adjusted odds ratio, but rather stratum-specific odds ratios.

Why do you think the odds ratio for Mother_smoke - Case_status is so high for bed-sharers?

In this way, effect-modification can help stimulate our thinking about possible biological mechanisms, where several risk factors may act together to cause disease.

Scatter plots and linear regression

Another type of model...

For continuous variables we can sometimes visualise directly the relationship between one variable and the other in a type of plot called a scatter plot. These relationships have greater statistical power than we would otherwise have if we categorised them in a binary manner to make risk ratios or odds ratios and two-by-two tables.

In the last tutorial, before the break, we compared birth weights by case status and used a \(t\)-test. What if, however, our outcome variable is continuous? Maybe, we want to see whether there is a relationship between gestational age (Gestation - exposure) and birth weight (Birth_wt - outcome). That is, if a baby stays longer in their mother's womb, we expect that it will be on average a bigger child. Let's try selecting these variables in iNZight.

🤓 Which variable should be on the vertical or \(y\) axis and which on the horizontal or \(x\) axis?

🤓 In the first variable, add the outcome

🤓 In the second, add the exposure

Click Add to plot --> Trend Lines and Curves

Click on the linear option, then also smoother.

🤓 How would you summarise the relationship?

🤓 How can we describe the slope of the straight line?

🤓 How about the \(y\)-intercept?

🤓 What is the advantage of the smoother over the linear regression line?

🤓 In this instance, what information does the smoother convey?

Linear regression modelling

We are going on to the Advanced menu. Hold on to your hats 🎩!

🤓 Go to Model Fitting. The outcome or \(y\) variable should be and the predictor variable should be .

Then click FIT MODEL. We get some crazy scientific notation, like 1.916e+02. Remember this just means \(1.916 * 10^2\) or 192.

The term for Estimate and Gestation is the slope or \(m\) from high school maths, from the formula \(y = mx + c\)

where:

\(y\) is the outcome,
\(x\) is the exposure,
\(m\) the slope, and
\(c\) is the \(y\)-intercept or value of \(y\), when \(x = 0\).

The slope is the key measure of association. In epidemiology, we call this the "\(\beta\) coefficient", which iNZight calls the Estimate - just to confuse you 😵.

A positive \(\beta\) coefficient means a positive association, the higher \(x\) gets, the higher \(y\) becomes. Here, the greater the Gestation, the greater the Birth_wt. A slope of zero means 'no association', and is the regression equivalent of the null hypothesis. The Estimate is the software name for slope and means the average change in \(y\) given a one-unit change in \(x\).

🤓 Here, therefore, the baby puts on grams for every extra week (one unit increase) of gestation. The 95% confidence interval has the same interpretation as usual, and the \(p\)-value likewise (a slope of represents the null hypothesis).

🤓 Are the two variables associated?

🤓 Is the relationship statistically significant?

Also, this slope is described as a \(\beta\) coefficient, because there is only one independent variable (Gestation) in the model.

🤓 What does the \(R^2\) value tell us?

The \(R^2\) is an indicator of model fit. A value close to 1 indicates good fit, whereas a value close to 0 is poor.

🤓 What is the \(R^2\) for this model? . This means, that compared to a simple mean, the model reduces the residuals by ~%.

What about the residual standard error? 🤓 (Note: this is a misnomer, it actually should be residual standard deviation). It is unchanged by the sample size. The value for this model is: . Smaller numbers indicate better model fit than larger numbers.

🤓 Residual standard error is particularly useful for

If we now include a confounder, such as maternal age (Mother_age), we get an \(\beta\) coefficient. After adjusting for Mother_age, the slope of the line between Gestation and Birth_wt is , compared to the crude model of 192. The residual standard error for the age adjusted model is grams which is slightly than the crude (470), and indicates model fit.

Some homework

Does maternal age (Mother_age) influence baby's birth weight (Birth_wt)?

The crude \(\beta\) coefficient is and means that on average, as mothers age by year, their babies get on average 15 grams heavier.

This association statistically significant, indicated by the \(P\)-value.

Check with both plots and model.

How much of the variation is explained by the model? %

What is the residual standard error? By grams.

See whether the relationship between Gestation and Birth_wt varies by case status (add Case_status as another Variable of interest, including Mother_age in the Model interface) and also observe the plot.

In the plot, on average, cases are than controls, although the slope looks between the two plots.

The adjusted \(\beta\) coefficient (slope) for Case is , indicating that cases are on average than controls, after adjusting for . This association statistically significant.

The residual standard error is grams, which is than the equivalent statistic in the crude model. This indicates model fit.

Well done!😅 We have taken baby steps in understanding regression and other methods of adjusting for confounding. 🥳

4. POPLHLTH 216 tutorial: quantitative methods for health

Simon Thornley

13 September, 2024