Exploring inferential statistics.

How can we avoid dental procedures?

Tutorial aims

Gain confidence using iNZight lite to explore hypothesis testing for both categorical and continuous variables.

We are investigating the association between cigarette smoking and dental caries.

This is a cross-sectional study, so we have to interpret our findings with caution, being mindful of the limits of such a study design. For example, since the exposure and outcome, and in fact all information, are collected at the same time, we have to be cautious about the possibility of reverse causation. Does the outcome lead to the exposure, rather than the converse?

Please see tutorial 1 for a list of the meanings of each variable. They are pretty straight forward.

We have a belief or hypothesis that people who do not smoke have fewer dental caries, and less tartar or plaque build-up.

We will be revising some of the lecture concepts this week by exploring these data.

We will be using iNZight lite.

A version of the spreadsheet is available here.

We will assume that we’ve covered the data checking side of it, but for revision, can you remember what the ~~three~~ four important items to check are?

Duplicates (check in Excel, Data ribbon –> Remove duplicates, then save new version)
Ranges
- Are any variables out of range? (again, simple in Excel, using )
Missing values
- Take a note as they may affect calculations down the track.
- Make sure that they are consistently coded. We want to avoid the situation of having blank cells and “I don’t know”. Generally, choose a generic code - usually blank cell and stick with that.
- More than 15% is generally considered problematic.
Consistent coding of variables.

Revision

Upload the spreadsheet

Upload the spreadsheet into INZight using File –> Import Dataset

Navigate to where your file is located.

Categorical associations

Is smoking associated with dental decay?

First, we need to convert these variables to categorical, or else they will be thought to be numbers or numeric. We were taught how to do this in the last tutorial.

As a refresher, go to Manipulate variables –> Convert to categorical.

First select our categorical variables.

Outcome first

smoking.cat

and exposure second

dental_caries.cat

in the

Visualise

tab.

Categorical association

Interpret the barplot. What does it show? Hint, focus on the right of the barplot which shows the prevalence of dental caries in smokers (light blue) and non-smokers (orange).

The width of the bars indicate the relative proportions of smokers and non-smokers. Which group is more numerous?
Which height of the bars indicate the proportion

This tutorial, we will focus on the Inference tab. Make sure, you check the Epidemiology options –> Show Output check box.

This will calculate odds ratios and relative risks for the association between the two variables.

You should see something like the output shown below:

Measure of association

You can see that the proportion of smokers with caries is 0.26 or 26%, compared to the proportion of non-smokers with caries (0.19 or 19%). The risk ratio or measure of association is simply the ratio of the two:

\(\begin{align} \text{Risk ratio} &= \frac{\text{risk of rotten teeth in smokers}}{\text{risk of rotten teeth in non-smokers}}\\ &= \frac{0.26}{0.19} \\ &\approx 1.38 \\ &= \frac{1.38}{1}\\ \end{align}\)

This means that if the risk of rotten teeth (caries) is (1.38 - 1 * 100) = 38% greater in the group that smokes, compared to the non-smokers.

Another way of stating this is that “smokers [exposed] are 1.38 [risk ratio] times more likely than non-smokers [unexposed] to have rotten teeth [outcome]”.

This is shown in the output at the bottom of the screen. The P-value reads 0.000, but this is a subtle error. The computer has rounded the value down. It should read < 0.0001. P-values are the probability (long-run frequency) of getting the results we did or more extreme, if there truly was no difference in caries between smokers and non-smokers (null hypothesis is true, or caries and smoking status were independent). To interpret the statistic, we have to imagine doing the study over and over again and the statistic gives us the chance of such a result or more extreme. A low value like this means “our results are very unlikely to be due to chance”.

Note, that the 95% confidence interval is interpreted as a range of values compatible with the true value. Technically, it is defined by the interval that, if the study were repeated over and over and the interval calculated, would contain the true value 19/20 times. This is a bit of a mouthful, so that is why I prefer the former definition.

Note: the 95% confidence interval dose not cross the null value (relative risk or odds ratio = 1; risk difference = 0), so this also indicates that the association is statistically significant.

Risk ratio and odds ratios

Odds ratios vs. risk ratios vs. risk differences

You may wonder why epidemiologists deal in both odds ratios and risk ratios.

Why odds ratios?

Odds ratios are similar to risk ratios. An odds is simply a risk divided by its complement, the risk of something happening divided by the risk of the event not happening.

\(\text{Odds} = \frac{\text{Risk}}{\text{1-Risk}}\)

So, here, the odds ratio is the ratio of the two odds of the event (caries) in the exposed (smokers) and the unexposed (non-smokers):

\(\begin{align} \text{Odds ratio} &= \frac{\text{Odds of rotten teeth in smokers}}{\text{Odds of rotten teeth in non-smokers}}\\ &= \frac{ \left( \frac{0.26}{1-0.26}\right)} {\left(\frac{0.19}{1-0.19} \right) }\\ &= \frac{0.35}{0.23}\\ &= 1.51\\ \end{align}\)

The odds ratio is very similar to a risk ratio and the definition is as set out above for risk ratios, but where you read risk, insert odds.

To complicate matters, in case-control studies, where cases are sampled rather than a population, the odds ratio is considered an approximation of the relative risk! The odds ratio and relative risk are very similar if the disease or outcome in question is rare.

Warning

Remember that whenever you encounter a risk ratio or odds ratio, you must first think of the two exposure groups that are being compared. The statistic relates the relative risk or odds of the outcome in the two groups. Here it is smokers and non-smokers.

Relative measures give an assessment of the strength of association between exposure and disease, and the higher the number, the more likely it is to be causal. However, this does not convey how important the association is.

Advantages of odds ratios and risk ratios

Risk ratios are generally considered simpler to understand than odds ratios and more intuitive. Odds ratios have some nice mathematical properties, however. They are reversible and also are not bounded by the prevalence in the unexposed group, as risk ratios are.

For example, if a disease is especially common and half the unexposed population suffer from it, then even if all the exposed have the disease, the maximum the relative risk could ever be is 2. See the calculation below.

\(\begin{align} \text{Max. risk ratio} &= \frac{1}{0.5} \\ &= 2 \\ \end{align}\)

Absolute risk difference

The absolute risk difference is the difference between the risk of disease in the two exposure groups. Instead of a measure of association, it gives an indication of the importance of the association at a population level. Here the calculation is:

\(\begin{align} \text{Risk difference} &= \text{Risk of rotten teeth in smokers} - \text{Risk of rotten teeth in non-smokers}\\ &= 0.26 - 0.19\\ &= 0.07\\ \end{align}\)

This gives an indication of how much an individual’s risk of disease changes if they change their exposure status (here quit smoking), with the crucial assumption being that association is causal.

Clinical significance: number needed to treat

This could be a bit difficult to interpret, however, if we take the reciprocal of the risk difference, we get the number needed to treat which is easier to interpret. Here, it is:

\(\begin{align} \text{Number needed to treat} &= \frac{1}{\text{risk difference}}\\ &= \frac{1}{0.07}\\ &= 14.3\\ \end{align}\)

Which means for every \(\approx\) 14 people we get to quit smoking, we will prevent one case of dental caries.

This gives an indication of the clinical relevance of our finding. It is a number that is relatively easy to understand for clinicians, and is useful in conveying the importance of a finding to a non-scientifically literate audience.

Warning

Whenever you are using a number needed to treat (NNT) statistic, you are implicitly making a causal assumption. If you have a very large NNT it is likely that you are dealing with a very weak association that is unlikely to be causal.

So, we’ve now covered measures of association between categorical variables, let’s move on and look at measures of association between a categorical and continuous variable.

Categorical and continuous variable association

Let’s consider the relationship between cigarette smoking and waist circumference? Are smokers, on average, thinner or fatter than non-smokers?

To conduct the analysis, you’ll need to tell iNZight that smoking is a categorical rather than continuous variable.

Use the Manipulate variables -> Convert to categorical trick to create smoking.cat.

Select waist.cm and smoking.cat.

Interpret the boxplot. You should see something like the picture below

Continuous and categorical variables You can see that the distribution of waist circumference is approximately symmetric, as indicated by the boxplot, with the median line centrally placed in the box.

Which group has a larger waist? Yes, it is the smokers, and the median in the smokers is about 85 cm whereas the median in the non-smoking group is about 80 cm.

We can tell just from the plot, that on average smokers have a wider waist by about 5 cm compared to non-smokers.

You can see that for continuous variables, a measure of association with categorical variables is a mean difference.

The Summary tab will give you the numeric summary of the means and the Inference tab gives you the mean difference.

If you click the Two sample t-test option, you’ll see that the difference is highly statistically significant, since the P-value is less than 0.05. The t-value tells you that the observed result is about 13 standard errors from the mean, which means it is way beyond the usual threshold of 1.96 standard deviations from the mean.

Some practice!

Two categorical variables

Is gender associated with tartar build up on the teeth?

Consider the nature of the association between gender and tartar.

Which variable should be the first variable in iNZight and which should be the second variable. How do you choose which one is which?

Interpret the plot. Focus on the right hand side.

Which gender is most prevalent in the survey? Males or females?

Which gender has the highest prevalence of tartar?

Comparing the relative heights of the bars on the right-hand side of the plot what do you expect the relative risk to be?

Check this against what you find in the Inference tab for the relative risk. Remember, you will need to check the Epidemiology output tab.

Does tartar lead to dental caries?

Consider the nature of the association between tartar and dental_caries.

Remember to convert dental_caries to a categorical variable.

Interpret the plot. What proportion of the population have tartar?

From the plot, what do you expect the risk ratio to be?

Check this against what you find in the Inference tab for the relative risk. Remember, you will need to check the Epidemiology output tab.

Interpret the relative risk and explain this to an educated audience.

Interpret the risk difference. Calculate a number needed to treat from this statistic. Please explain this result to an educated audience. What assumption is entailed in this calculation?

## Categorical and continuous variables ### Is cigarette smoking associated with HDL (“good”) cholesterol?

Select HDL and smoking.cat. Describe the plot. Hint: use the Interactive plot to see values. Remember, higher HDL is associated with lower risk of CVD. Check the Inference tab and select Two sample t-test.

Interpret the test?

Is smoking status associated with HDL?

What is the magnitude of the association?

Answers

Gender and tartar.

The plot shows that males are more prevalent than females in the study. Also that there is about a 10% increase in risk of tartar in men than women. Formally considering the relative risk shows that it is 1.13.

Interpretation of RR

Men [exposed] are 1.13 [risk ratio] times more likely than women [unexposed] to have tartar [outcome] or 13% increase in risk of tartar comparing men to women.

The \(P\)-value is 0.001, so the difference in prevalence of tartar is unlikely to be due to chance.

The 95% confidence interval of 1.12 to 1.52 means that the true value is likely to be within a 12% to 52% increase in risk comparing men to women. They are a range of values compatible with the true value.

Does tartar lead to dental caries?

Subjects with tartar [exposed] are 1.97 (almost 2x) [risk ratio] times more likely than those with no tartar [unexposed] to have dental caries [outcome] or there is a 97% (~100%) increase in risk of caries comparing the tartar to the no tartar group.

The 95% confidence interval of 1.92 to 2.81 means that the true value is likely to be within a 92% to 181% increase in risk comparing men to women. They are a range of values compatible with the true value.

Subjects with tartar have an overall risk of caries increase by 13% compared to subjects with no tartar. If it were possible to reverse tartar, doing so would reduce caries in every 8 (1/0.13 = 7.7) people who had their tartar removed. The 95% CI (range of values compatible with the true value) is 6.25 to 10.

The association between tartar and caries is unlikely to be due to chance (\(P\) < 0.001).

Smoking and HDL

The boxplot is slightly right skewed. Non smokers have a median HDL of 57 mg/dL and non-smokers 52 mg/dL, with a median difference of 5 mg/dL. A two-sample \(t\)-test shows that the mean difference (5.2 mg/dL) is unlikely to be due to chance (\(P\) < 0.001). The 95% CI for the difference is 4.2 to 6.2. This is a range of values compatible with the true value.

Tutorial 2 for POPLHLTH 216: Quantitative methods for Health

Simon Thornley

22 July, 2024

Exploring inferential statistics.

Tutorial aims

Revision

Upload the spreadsheet

Categorical associations

Is smoking associated with dental decay?

Odds ratios vs. risk ratios vs. risk differences

Why odds ratios?

Advantages of odds ratios and risk ratios

Absolute risk difference

Clinical significance: number needed to treat

Categorical and continuous variable association

Some practice!

Two categorical variables

Is gender associated with tartar build up on the teeth?

Does tartar lead to dental caries?

Answers

Gender and tartar.

Interpretation of RR

Does tartar lead to dental caries?

Smoking and HDL