We are investigating the association between cigarette smoking and dental caries.
This is a cross-sectional study, so we have to interpret our findings with caution, being mindful of the limits of such a study design. For example, since the exposure and outcome, and in fact all information, are collected at the same time, we have to be cautious about the possibility of reverse causation. Does the outcome lead to the exposure, rather than the converse?
Please see tutorial 1 for a list of the meanings of each variable. They are pretty straight forward.
We have a belief or hypothesis that people who do not smoke have fewer dental caries, and less tartar or plaque build-up.
We will be revising some of the lecture concepts this week by exploring these data.
We will be using iNZight lite
.
A version of the spreadsheet is available here.
We will assume that we’ve covered the data checking side of it, but
for revision, can you remember what the three
four important items to check are?
Data ribbon
–>
Remove duplicates
, then save new version)Upload the spreadsheet into INZight
using
File
–> Import Dataset
Navigate to where your file is located.
First, we need to convert these variables to categorical, or else they will be thought to be numbers or numeric. We were taught how to do this in the last tutorial.
As a refresher, go to Manipulate variables
–>
Convert to categorical
.
First select our categorical variables.
smoking.cat
dental_caries.cat
in the
Visualise
tab.
Interpret the barplot. What does it show? Hint, focus on the right of the barplot which shows the prevalence of dental caries in smokers (light blue) and non-smokers (orange).
The width of the bars indicate the relative proportions of smokers and non-smokers. Which group is more numerous?
Which height of the bars indicate the proportion
This tutorial, we will focus on the Inference
tab. Make
sure, you check the Epidemiology options
–>
Show Output
check box.
This will calculate odds ratios and relative risks for the association between the two variables.
You should see something like the output shown below:
You can see that the proportion of smokers with caries is
0.26
or 26%, compared to the proportion of non-smokers with
caries (0.19
or 19%). The risk ratio or measure of
association is simply the ratio of the two:
\(\begin{align} \text{Risk ratio} &= \frac{\text{risk of rotten teeth in smokers}}{\text{risk of rotten teeth in non-smokers}}\\ &= \frac{0.26}{0.19} \\ &\approx 1.38 \\ &= \frac{1.38}{1}\\ \end{align}\)
This means that if the risk of rotten teeth (caries) is (1.38 - 1 * 100) = 38% greater in the group that smokes, compared to the non-smokers.
Another way of stating this is that “smokers [exposed] are 1.38 [risk ratio] times more likely than non-smokers [unexposed] to have rotten teeth [outcome]”.
This is shown in the output at the bottom of the screen. The
P-value reads 0.000
, but this is a subtle error.
The computer has rounded the value down. It should read
< 0.0001
. P-values are the probability
(long-run frequency) of getting the results we did or more extreme, if
there truly was no difference in caries between smokers and non-smokers
(null hypothesis is true, or caries and smoking status were
independent). To interpret the statistic, we have to imagine doing the
study over and over again and the statistic gives us the chance of such
a result or more extreme. A low value like this means “our
results are very unlikely to be due to chance”.
Note, that the 95% confidence interval is interpreted as a range of values compatible with the true value. Technically, it is defined by the interval that, if the study were repeated over and over and the interval calculated, would contain the true value 19/20 times. This is a bit of a mouthful, so that is why I prefer the former definition.
Note: the 95% confidence interval dose not cross the null value (relative risk or odds ratio = 1; risk difference = 0), so this also indicates that the association is statistically significant.
You may wonder why epidemiologists deal in both odds ratios and risk ratios.
Odds ratios are similar to risk ratios. An odds is simply a risk divided by its complement, the risk of something happening divided by the risk of the event not happening.
\(\text{Odds} = \frac{\text{Risk}}{\text{1-Risk}}\)
So, here, the odds ratio is the ratio of the two odds of the event (caries) in the exposed (smokers) and the unexposed (non-smokers):
\(\begin{align} \text{Odds ratio} &= \frac{\text{Odds of rotten teeth in smokers}}{\text{Odds of rotten teeth in non-smokers}}\\ &= \frac{ \left( \frac{0.26}{1-0.26}\right)} {\left(\frac{0.19}{1-0.19} \right) }\\ &= \frac{0.35}{0.23}\\ &= 1.51\\ \end{align}\)
The odds ratio is very similar to a risk ratio and the definition is as set out above for risk ratios, but where you read risk, insert odds.
To complicate matters, in case-control studies, where cases are sampled rather than a population, the odds ratio is considered an approximation of the relative risk! The odds ratio and relative risk are very similar if the disease or outcome in question is rare.
Warning
Remember that whenever you encounter a risk ratio or odds ratio, you must first think of the two exposure groups that are being compared. The statistic relates the relative risk or odds of the outcome in the two groups. Here it is smokers and non-smokers.
Relative measures give an assessment of the strength of association between exposure and disease, and the higher the number, the more likely it is to be causal. However, this does not convey how important the association is.
Risk ratios are generally considered simpler to understand than odds ratios and more intuitive. Odds ratios have some nice mathematical properties, however. They are reversible and also are not bounded by the prevalence in the unexposed group, as risk ratios are.
For example, if a disease is especially common and half the unexposed population suffer from it, then even if all the exposed have the disease, the maximum the relative risk could ever be is 2. See the calculation below.
\(\begin{align} \text{Max. risk ratio} &= \frac{1}{0.5} \\ &= 2 \\ \end{align}\)
The absolute risk difference is the difference between the risk of disease in the two exposure groups. Instead of a measure of association, it gives an indication of the importance of the association at a population level. Here the calculation is:
\(\begin{align} \text{Risk difference} &= \text{Risk of rotten teeth in smokers} - \text{Risk of rotten teeth in non-smokers}\\ &= 0.26 - 0.19\\ &= 0.07\\ \end{align}\)
This gives an indication of how much an individual’s risk of disease changes if they change their exposure status (here quit smoking), with the crucial assumption being that association is causal.
This could be a bit difficult to interpret, however, if we take the reciprocal of the risk difference, we get the number needed to treat which is easier to interpret. Here, it is:
\(\begin{align} \text{Number needed to treat} &= \frac{1}{\text{risk difference}}\\ &= \frac{1}{0.07}\\ &= 14.3\\ \end{align}\)
Which means for every \(\approx\) 14 people we get to quit smoking, we will prevent one case of dental caries.
This gives an indication of the clinical relevance of our finding. It is a number that is relatively easy to understand for clinicians, and is useful in conveying the importance of a finding to a non-scientifically literate audience.
Warning
Whenever you are using a number needed to treat (NNT) statistic, you are implicitly making a causal assumption. If you have a very large NNT it is likely that you are dealing with a very weak association that is unlikely to be causal.
So, we’ve now covered measures of association between categorical variables, let’s move on and look at measures of association between a categorical and continuous variable.
Let’s consider the relationship between cigarette smoking and waist circumference? Are smokers, on average, thinner or fatter than non-smokers?
To conduct the analysis, you’ll need to tell iNZight
that smoking
is a categorical rather than continuous
variable.
Use the Manipulate variables
->
Convert to categorical
trick to create
smoking.cat
.
Select waist.cm
and smoking.cat
.
Interpret the boxplot. You should see something like the picture below
You can see that the
distribution of waist circumference is approximately symmetric, as
indicated by the boxplot, with the median line centrally placed in the
box.
Which group has a larger waist? Yes, it is the smokers, and the median in the smokers is about 85 cm whereas the median in the non-smoking group is about 80 cm.
We can tell just from the plot, that on average smokers have a wider waist by about 5 cm compared to non-smokers.
You can see that for continuous variables, a measure of association with categorical variables is a mean difference.
The Summary
tab will give you the numeric summary of the
means and the Inference
tab gives you the mean
difference.
If you click the Two sample t-test
option, you’ll see
that the difference is highly statistically significant, since the
P-value is less than 0.05. The t-value tells you that
the observed result is about 13 standard errors from the mean, which
means it is way beyond the usual threshold of 1.96 standard deviations
from the mean.
Consider the nature of the association between gender
and tartar
.
Which variable should be the first variable
in
iNZight
and which should be the
second variable
. How do you choose which one is which?
Interpret the plot. Focus on the right hand side.
Which gender is most prevalent in the survey? Males or females?
Which gender has the highest prevalence of tartar
?
Comparing the relative heights of the bars on the right-hand side of the plot what do you expect the relative risk to be?
Check this against what you find in the Inference
tab
for the relative risk. Remember, you will need to check the
Epidemiology output
tab.
Consider the nature of the association between tartar
and dental_caries
.
Remember to convert dental_caries
to a categorical
variable.
Interpret the plot. What proportion of the population have tartar?
From the plot, what do you expect the risk ratio to be?
Check this against what you find in the Inference
tab
for the relative risk. Remember, you will need to check the
Epidemiology output
tab.
Interpret the relative risk and explain this to an educated audience.
Interpret the risk difference. Calculate a number needed to treat from this statistic. Please explain this result to an educated audience. What assumption is entailed in this calculation?
## Categorical and continuous variables ### Is cigarette smoking associated with HDL (“good”) cholesterol?
Select HDL
and smoking.cat
. Describe the
plot. Hint: use the Interactive plot
to see values.
Remember, higher HDL is associated with lower risk of CVD. Check the
Inference
tab and select
Two sample t-test
.
Interpret the test?
Is smoking status associated with HDL?
What is the magnitude of the association?
The plot shows that males are more prevalent than females in the study. Also that there is about a 10% increase in risk of tartar in men than women. Formally considering the relative risk shows that it is 1.13.
Men [exposed] are 1.13 [risk ratio] times more likely than women [unexposed] to have tartar [outcome] or 13% increase in risk of tartar comparing men to women.
The \(P\)-value is 0.001, so the difference in prevalence of tartar is unlikely to be due to chance.
The 95% confidence interval of 1.12 to 1.52 means that the true value is likely to be within a 12% to 52% increase in risk comparing men to women. They are a range of values compatible with the true value.
Subjects with tartar [exposed] are 1.97 (almost 2x) [risk ratio] times more likely than those with no tartar [unexposed] to have dental caries [outcome] or there is a 97% (~100%) increase in risk of caries comparing the tartar to the no tartar group.
The 95% confidence interval of 1.92 to 2.81 means that the true value is likely to be within a 92% to 181% increase in risk comparing men to women. They are a range of values compatible with the true value.
Subjects with tartar have an overall risk of caries increase by 13% compared to subjects with no tartar. If it were possible to reverse tartar, doing so would reduce caries in every 8 (1/0.13 = 7.7) people who had their tartar removed. The 95% CI (range of values compatible with the true value) is 6.25 to 10.
The association between tartar and caries is unlikely to be due to chance (\(P\) < 0.001).
The boxplot is slightly right skewed. Non smokers have a median HDL of 57 mg/dL and non-smokers 52 mg/dL, with a median difference of 5 mg/dL. A two-sample \(t\)-test shows that the mean difference (5.2 mg/dL) is unlikely to be due to chance (\(P\) < 0.001). The 95% CI for the difference is 4.2 to 6.2. This is a range of values compatible with the true value.