Analysing a case-control study

Photo by Garrett Jackson on Unsplash
Photo by Garrett Jackson on Unsplash

Aims

  • Gain confidence using iNZight lite to analyse a case-control study.
  • Reinforce the assessment of categorical associations, this time focusing on odds ratios, and exposures with several different categories.
  • Calculate and explain the meaning and assumptions underlying a Population attributable risk calculation.

We will use this as a chance to both develop our analytical skills and revise some of the material we've been discussing.


Cot-death

This is otherwise unexplained death, usually during sleep, in infants aged less than 12 months. The cause is not known, but there were several theories at the time of the study, related to suffocation induced from infant sleep position or bed-sharing, or exposure to tobacco smoke, usually from the parents.

Case-control study

The full story of cot-death is largely one of an iatrogenic (medically caused) tragedy and I encourage you to read the paper which is linked below the plot.

study
study

Source: Neonatology 2018;113:162--169 DOI: 10.1159/000481880

This is a case-control study that was carried out in the 1980s to address the problem of an epidemic of cot-death in New Zealand.

The main hypotheses of the cause were related to:

  • infant sleeping position,
  • maternal smoking and
  • bed-sharing

being the main issues that were thought to lead to cot-death.

Cases were those that came to medical attention due to their child's death. Controls were sampled from the community at the same time as the cases were recruited.

🤓 The main reason the researchers chose to use a case-control design was because cot death is a . Case-control studies are particularly efficient study designs for studying rare .

The data were collected in the following .csv file available in Canvas.

The meaning of the variables should be self-explanatory.

We will be using iNZight lite.

We will assume that we've covered the data checking side of it, but for revision, can you remember what the four important items to check are?

As well as looking for missing values and ranges, what is the other important issue to look for in a dataset?

As a reminder, the issues to consider are:

  • Duplicates (check in Excel)
  • Ranges
    • Are any variables out of range or biologically implausible?
  • Missing values
    • Take a note as they may affect calculations down the track.
    • They should be consistently coded.
    • More than 15% is generally considered problematic.
  • Consistent coding of categorical variables.

Prevalence of maternal smoking

🤓 Overall, the prevalence is: % [two significant figures].

🤓 In cases, % [two significant figures].

🤓 In controls, % [two significant figures].

🤓 This indicate that maternal smoking associated with cot-death, because smoking is higher prevalence in cases of cot-death compared to controls.

Smoking and cot-death

Photo by Anastasia Vityukova on Unsplash
Photo by Anastasia Vityukova on Unsplash

Use iNZight lite to determine the magnitude and direction of the association between maternal smoking and cot-death.

First select the (outcome) variable:

Case_status

followed by:

Mother_smoke

as the second variable (exposure).

Check and interpret the bar plot. By clicking on the Add to Plot button, you can experiment with different types of plot. You can see that the percentage of cases is much higher in the smoking group than the non-smoking.

Check the Summary tab and inspect the raw numbers and percentages to verify the nature and direction of the association.

Check the Epidemiology options in the Inference tab (check the Epidemiology options box).

🤓 Which measure of association is usually reported for a case-control study?

Interpret the findings.

🤓 Please select the most appropriate description:

Our hypothesis is related to 'mothers who smoke are 4.4 times more likely to have a cot-death baby than non-smokers' and in fact, this is equal to our estimate which is rather 'mothers of cot-death babies are 4.4 times more likely to be smokers (than non-smokers) than controls' thanks to the use of odds ratios. So you could in actual fact use either of these interpretations, but the latter is consistent with how the data has been collected. That is, cases and controls have been sampled, and then exposure has been assessed. As a matter of fact, you can switch 'Case_status' and 'Mother_smoke' and you'll notice that the odds ratio is identical! It is not so for the relative risk.

How is it interpreted

We are interested in the interpretation 'mothers who smoke are 4.4 times more likely to have a baby die of cot-death than controls' and in fact, this is equal to our estimate which is rather 'mothers of cot-death babies are 4.4 times more likely to be smokers (than non-smokers) than controls' thanks to the use of odds ratios. So you could in actual fact use either of these interpretations, but the latter is consistent with how the data has been collected. That is, cases and controls have been sampled, and then exposure has been assessed.

Warning Remember that case-control studies should use odds ratios rather than relative risks as measures of association. For simplicity, we usually say that odds ratios from case-control studies are interpreted as risk ratios, however, the gory reality is that, depending on the design, the odds ratio sometimes has a different interpretation, sometimes as a rate ratio. See here for details.

Remember, the first step is to consider the nature of the two groups being compared. An example is: Mothers of babies who died of cot-death [cases] were 4.37 times [odds ratio interpreted as relative risk] more likely to smoke [exposure] than mothers of healthy babies [controls]. Note: This is somewhat different from how risk ratios are presented from cross-sectional and cohort studies, where we consider the condition to be consistent with our hypothesis (exposure). Usually we are interested in the risk of disease given exposure, but in a case-control study, it is the risk of exposure, given disease (case or control status).

Could chance be an explanation of this finding?

Examine the \(P\)-value. Is it less than 0.05?

Revision

🤓 What is a P-value?

🤓 Which statement about a 95% confidence interval of the mean is true?

Other explanations

What additional information might you like to consider before assuming that there is a causal association here?

Population benefits

Your boss is excited by these findings and convinced that you may have the key to solving the cot-death epidemic 😆. If we 'took away' smoking from this population, what changes in incidence of cot-death would we expect to see?

🤓 Which of the following is the most appropriate statistic to answer this question?

Population attributable risk

The proportion of the cases in the population that may be prevented if the exposure is removed. It is derived from a measure of association (relative risk or odds ratio) and the prevalence of exposure in the population. A causal assumption is inherent in the calculation: i.e. that the exposure causes the outcome.

The formula is: \[ \text{Population proportion attributable risk} = \frac{\text{prevalence}_\text{exposure} (\text{RR} - 1)}{1 + \text{prevalence}_\text{exposure} (\text{RR} - 1)} \] You will need a calculator or Microsoft Excel to work this out, since unfortunately, iNZight doesn't perform this calculation!

Remember, \(\text{prevalence}_{\text{exposure}}\) here is likely to be closer to that estimated from controls rather than from cases!

The correct answer here is: \[ \begin{align} \text{Population proportion attributable risk} &= \frac{0.311*(4.37 - 1)}{1 + 0.311*(4.37 - 1)} \\ &= \frac{1.05}{2.05} \\ &= 0.51 \text{ or } 51\% \\ \end{align} \]

Here, prevalence of exposure is derived from the prevalence in controls - since this is close to the population prevalence, and the relative risk is taken from the odds ratio.

Warning

For cohort or cross-sectional studies, the prevalence in a population attributable risk calculation is generally taken as the unconditional or overall prevalence of exposure. In a case-control study, the best assessment of the overall prevalence of exposure is from the control group, although, technically, it could be thought of as a weighted average of the prevalence in the controls and the cases. The weights would be proportions of the cases and controls in the study population.

Assumptions

Remember, that we have made a hidden assumption here. We are now considering that maternal smoking is causal when we only have evidence of association from the analysis. What additional steps do we need to take to move from association to causation?


Sleeping position and cot-death

bliss!
bliss!

Select

Case_status

As the outcome and

Sleep_position

as the second variable.

Interpret the barplot. What does it show?

In what direction do you expect to see the association?

It is important to visualise the direction of association so that you don't misinterpret inference information later.

You might need to reorder the Sleep_position variable to improve the interpretation of the graph. It makes sense here to go from low risk to high risk.

In that case, I suggest Back first, then Side, then Other, then Front_face_down and finally, Front_face_to_side.

Use Manipulate variables --> Categorical variables --> Reorder variables

Click

"Inference"

Check

"Epidemiology Options"

box if it is available.

Interpret the output. Focus on the direction and strength of association.

🤓 Which sleeping position is highest risk?

🤓 Which exposure category is the software selecting as the comparison?

How could we change that? Verify that your changes make more sensible output and comparison categories.

Change the exposure category to compare front_face_down and front_face_to_side with every other category.

Hint: you may need to use the following functions...

"Manipulate Variables" --> "Categorical variables" --> "Collapse levels"

and

"Manipulate Variables" --> "Categorical variables" --> "Reorder levels"

Once you have your binary variable (face down vs. other), estimate the association between this variable and case-status.

Check the Epidemiology options in the Inference tab.

🤓 Interpret the findings.

🤓 Could chance be an explanation of this finding?

The \(P\)-value is extremely small (< 0.001), indicating that sleeping position is likely to affect the risk of cot-death.

What additional information might you like to consider before assuming that there is a causal association here?

🤓 Possible other explanations for an association include all of the following except:

Type-2 error is not possible because this is a false-negative and we have a positive association. For a type-2 error, 
we would need a null or not-significant association.

Population benefits

Your boss is (again!) convinced that they may have the key to solving the cot-death epidemic. If we 'took away' sleeping face down from this population, what changes in incidence of cot-death would we expect to see?

First we need to decide what the prevalence of sleeping on the front in controls is.

The answer is the prevalence in controls, so you need to first select Case_status then your binary variable for sleeping position.

The relevant odds ratio is 3.9, so the calculation is: \[ \begin{align} \text{Population attributable risk} &= \frac{\text{prevalence}_\text{exposure} (\text{OR} - 1)}{1 + \text{prevalence}_\text{exposure} (\text{OR} - 1)} \\ &= \frac{0.32*(3.9 - 1)}{1 + 0.32*(3.9 - 1)} \\ &= \frac{0.32*(2.9)}{1 + 0.32*(2.9)}\\ &= \frac{0.93}{1.93}\\ \end{align} \] So, the answer to two significant figures, as a percentage is: %.

🤓 The assumption underlying this calculation is:

Wow 😮. We can potentially prevent half of cot deaths just by telling parents to sleep their infants on their backs rather than their fronts. This is very powerful stuff! 💪

Birth weight and cot-death

Check the nature of the association between birth weight in grams (Birth_wt) and Case_status.

What does the plot show? Experiment with different types of plots.

🤓 The boxplot shows the distribution of Birth_wt by Case_status is .

🤓 The statistical test for association between these variables is a test.

🤓 Which group has higher birth weights?

🤓 A description of the statistical test is the difference in birth weight between the controls and cases is grams [to two significant figures or round to the nearest 10], with being heavier on average.

🤓 Is the difference likely to be due to chance (\(P\)-value)?

🤓 You wonder whether these differences could be attributed to maternal smoking? Add in a third variable (Mother_smoke), and now check Inference. What is the average difference (between cases and controls) in the smoking () and non-smoking () groups [to two significant figures]? What is your conclusion? Smoking is a likely of the relationship between birth weight and cot-death, since it is a likely shared common cause of both exposure and outcome.


Practice

Ethnic group and cot-death

🤓 Which ethnic group is at highest risk of cot-death?

🤓 What could this be due to?

The incorrect responses here are possible, but not as likely as maternal smoking status.

🤓 How could you investigate this hypothesis further?

Include maternal smoking as a potential confounder (3rd variable).

What happens to the association between Maori ethnicity and cot-death status?

🤓

🤓 This indicates that smoking is likely to be a of the relationship between ethnicity and cot-death. This is because cigarette smoking is likely to be affected by ethnicity, rather than the other way around.


Mothers age and cot-death

Attempt to answer the following questions using iNZight with age as a continuous variable.

🤓 Mothers which are are at highest risk of cot-death.

🤓 The distribution of maternal age is .

🤓 The statistical test for association between these variables is a test.

🤓 A description of the statistical test is the difference in maternal age between the controls and cases is years [to two significant figures], with being older on average.