Primary Resources (link):
- Full Lectures (Links to Youtube)
- Lecture Slides: 2. PH250b.14 Measures of association-1.pdf
- Readings: Ch8_Ch9_J.Ahern.pdf
- Kaufman and Poole 2010
Big Fat Surprise Pages: TBD
Big Fat Surprise Pages: TBD
Please ask any questions you have here. Jack and the GSIs will be able to answer your questions on the discussion board, and also use discussion questions to better explain material in class.
If you see a question you know the answer to, please answer it. Some of the best learning happens when trying to teach material to others.
For causation we need a counterfactual
* Depends on the study design (appropriate comparison group) and the measure chosen associations estimate the average difference in outcome between exposed and unexposed groups
* Causal association
* Random error (chance)
* Systematic error (bias)
“…is a statistically significant relative risk of 1.2 or 1.3 relevant from a public health, legal, or policy perspective? In particular, how important is such a risk when found only in a single study?”
With regard to Measures of Association, our goal is to ascertain associations between exposures and outcomes and, ultimately, effects of exposures on outcomes. Specifically, measures of association are used in order to quantify the strength and direction of associations between exposures (A) and outcomes (Y).
We’ll come back to strength later when we discuss different scales, but for now here’s some directionality terminology: #####Terms:
* Risk Factor: The exposure increases the risk of outcome/disease
* Protective Effect: The exposure decreases the risk of outcome/disease
* Null Effect: There is no association between exposure and outcome/disease
For example, smoking is a risk factor for lung cancer, exercise has a protective effect on obesity, and vaccines have a null effect on autism.
Since we can not conduct the “ideal” counterfactual study, epidemiologists examine the associations between exposures and outcomes by comparing different groups with different exposures. However, it is important to note that while we may observe an association, we can not extrapolate that to mean that there is an effect, we can only infer causality.
There are different possible reasons for an observed association:
-An actual causal relation
-Random error
-Systemic Error/Bias
As discussed in class, there are two different scales to measure the strength and direction of an association, using ratios (relative scales) or differences (absolute scales).
Asymmetrical around 1
A RR of 0.25 is the same magnitude as a RR of 4, but a decrease that than increase in risk or rate.
If smokers have twice the rate of disease as non-smokers( RR-3), then
It is simple to calculate the relative measure strength on the “positive” side of the null, however, you need to take the inverse in order to calculate values between 0 and 1.
Symmetrical around 0
Magnitudes are equivalent in strength (and mirrored exactly) but opposite in direction.
When discussing Measures of Association, we often use a Contingency Table (otherwise referred to as a 2x2 table).
Risks are organized into tables comparing numbers of diseases to not diseases, across levels of exposure
. | Disease | No Disease | Total |
---|---|---|---|
Exposed | a | b | a+b |
Unexposed | c | d | c+d |
Total | a+c | b+d | a+b+c+d |
Rates are organized into tables comparing numbers of diseased to person-time (For person-time, we will be using the notation of \(PT_e\) for exposed person time and \(PT_u\) for unexposed person time.)
. | Disease | Person Time |
---|---|---|
Exposed | a | \(PT_e\) |
Unexposed | c | \(PT_u\) |
Total | a+c | \(PT_e+Pt_u\) |
(Note the differences.)
2x2 tables are a hallmark of epidemiology, and by the end of the course you will not only be adept at organizing your existing data in to them, but creating them all on your own. Below we’ve given some examples, hand done, and calculated in the statistical programs R and Stata:
. | Disease | No Disease | Total |
---|---|---|---|
Exposed | a | b | a+b |
Unexposed | c | d | c+d |
Total | a+c | b+d | a+b+c+d |
1600 people are surveyed to see if they smoke, and if they have heart disease. 25% smoker, and of those who smoke 75% have heart disease. Among non-smokers, 25% have hear disease. Fill out the 2x2 table
. | Heart Disease | No Disease | Total |
---|---|---|---|
Smoker | 300 | 100 | 400 |
Non-smoker | 300 | 900 | 1200 |
Total | 600 | 1000 | 1600 |
suppressMessages(library(epiR))
## Warning: package 'epiR' was built under R version 3.2.5
## Warning: package 'survival' was built under R version 3.2.5
suppressMessages(library(survival))
#Create vector of exposed, with 1=diseases and 0= not diseases
Exposed= c(0,0,1,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,1,0,0,1,1,0,1,0)
#Create vector of unexposed
Diseased=c(0,0,0,0,0,0,1,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,1,0,1,1,1,1,1)
print("put together data in 2x2 table")
## [1] "put together data in 2x2 table"
table(Exposed,Diseased)
## Diseased
## Exposed 0 1
## 0 17 3
## 1 3 9
print("Analyze 2x2 table")
## [1] "Analyze 2x2 table"
epi.2by2(table(-Exposed,-Diseased), method = "cross.sectional")
## Outcome + Outcome - Total Prevalence *
## Exposed + 9 3 12 75.0
## Exposed - 3 17 20 15.0
## Total 12 20 32 37.5
## Odds
## Exposed + 3.000
## Exposed - 0.176
## Total 0.600
##
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Prevalence ratio 5.00 (1.68, 14.92)
## Odds ratio 17.00 (2.83, 102.10)
## Attrib prevalence * 60.00 (30.93, 89.07)
## Attrib prevalence in population * 22.50 (-0.44, 45.44)
## Attrib fraction in exposed (%) 80.00 (40.32, 93.30)
## Attrib fraction in population (%) 60.00 (5.22, 83.12)
## -------------------------------------------------------------------
## X2 test statistic: 11.52 p-value: < 0.001
## Wald confidence limits
## * Outcomes per 100 population units
Notice how the OR is much higher than the prevalence ratio.
Now lets examine how the OR and PR compare when the disease is much rarer.
Exposed= c(0,0,1,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1)
#Create vector of unexposed
Diseased=c(0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1)
print("A 2x2 table of a rarer disease:")
## [1] "A 2x2 table of a rarer disease:"
epi.2by2(table(-Exposed,-Diseased), method = "cross.sectional")
## Outcome + Outcome - Total Prevalence *
## Exposed + 4 22 26 15.38
## Exposed - 1 28 29 3.45
## Total 5 50 55 9.09
## Odds
## Exposed + 0.1818
## Exposed - 0.0357
## Total 0.1000
##
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Prevalence ratio 4.46 (0.53, 37.41)
## Odds ratio 5.09 (0.53, 48.85)
## Attrib prevalence * 11.94 (-3.44, 27.31)
## Attrib prevalence in population * 5.64 (-4.45, 15.73)
## Attrib fraction in exposed (%) 77.59 (-87.93, 97.33)
## Attrib fraction in population (%) 62.07 (-109.70, 93.14)
## -------------------------------------------------------------------
## X2 test statistic: 2.363 p-value: 0.124
## Wald confidence limits
## * Outcomes per 100 population units
The OR is much closer to the PR. Remember, when a disease is rare, \(OR \approx PR\)
To Be Coded
Provides information about relative association between an exposure and disease by comparing the disease in exposed to the disease in unexposed.
\[ RR=\frac {R_{exposed}}{R_{unexposed}}\]
In this instance R can indicate either a risk, or a rate. Specific RR’s are Cumulative Incidence Ratio and Incidence Density Ratio.
Going back to our 2x2 table, CIR would be calculated as \[\frac {a/(a+b)}{c/(c+d)}\] while IDR would be calculated as \[\frac {a/b}{c/d}\] where b and d are both person-time.
Understandably, if you want to calculate cumulative incidence (exposed) or incidence density (exposed) you can just remove the ratio and do the numerator/denominator alone. Make sure to differentiate when you are and are not asked for a ratio.
Provides information about the relative association between exposures and diseases using prevalence as the measure of disease.
\[ PR=\frac {Prevalence_{exposed}}{Prevalence_{unexposed}}\]
Odds ratios provide information about the relative association between an exposure and a disease which is analogous to RR.
Formula= \[\frac {P_{exposed}/(1-P_{exposed})}{P_{unexposed}/(1-P_{unexposed})}\]
In a 2x2 table, that would be: \[\frac {a/b}{c/d}\]
Which if you remember basic algebra reduces down to \[\frac{a \times d}{b \times c}\]
However you must make sure that your 2x2 table is oriented correctly before just going through this calculation.
Odds ratio’s are the relative measure that is always a little tricky. Many find it easy to think of Odd’s as the “odds” (probability) of an event happening over the “odds” (probability) of it not happening. This definition doesn’t always work the first time you think it through, so make sure to spend some time with this logic so you fully understand it.
While Odds Ratios may seem counterintuitive compared to risk ratios, there are many times that we do not actually have the information necessary for RR’s based on study design alone (more on that in study design!). Odds Ratio’s are often used in place of risk ratios in study design’s like case-control studies. Case control studies are the ONLY time that you may interpret an odds ratio as a risk/rate ratio.
Odds Ratio’s are also extremely useful if a disease is rare. When diseases are rare (in both the exposed and unexposed), odds ratios estimate the risk ratio: \(OR\approx RR\)
Look at the denominators of the OR and RR to undestand this. When diseases is rare, \(a and c \approx 0\)
Therefore \(a+b \approx b\) and \(c+d \approx d\)
Risk ratios are conservative compared to odds ratios. Odds ratios will always be larger than the risk ratio calculated from the same data.
For a breakdown of formulas, look at slide 45.
When discussing absolute measures, “risk” is often used to refer to risk or rate, and with AR in particular, we get information about the absolute association between an exposure and a disease, or the excess risk/rate of disease in the exposed in comparison to the unexposed. For a clear picture, refer back to the Szklo Figure 3-1 in the MoA lecture (Slide 53).
AR is calculated \(R_{exposed}-R_{unexposed}\) where R is risk/rate. Back to our 2x2 table, we would calculate \[ \frac{a}{a+b} - \frac{c}{c+d} \]
AR Percent is just like AR, except for instead of providing the excess incidence in decimal terms, we are able to infer the excess incidence as a percent of incidence in the entire exposed population. It tells you the percentage of disease incidence in the exposed that it is excess to the incidence in the unexposed, or the percentage of all disease incidence among the exposed that is associated with the exposure (because you can have disease incidence that is not associated with the exposure).
AR % would be calculated \[\frac{R_{exposed} -R_{unexposed}}{R_{exposed}} \times 100 =\]
\[ \frac{AR}{R_{exposed}} \times 100\]
Clinically, AR% is analogous to the efficiency of an intervention in comparison to a control treatment. (In this case your exposure would be protective.)
\[PAR= R_{total} - R_{unexposed} =\]
\[AR \times Prevalence_{exposure}\]
Just like AR, this is excess risk/rate of disease, except we’re broadening to the total population compared to the unexposed. If we believe our association to be causal, we can use PAR to estimate the impact of an exposure on population health.
Because of the nature of PAR, it will never be larger than AR in a given populaiton.
Population Attributable Risk Percent (PAR%)
Population attributable risk percent provides information about the excess incidence in the total population as a percentage of incidence in hte total population.
It’s calculated \[\frac{R_{total} - R_{unexposed}}{Rtotal} \times 100 =\]
\[\frac{PAR}{Rtotal} \times 100\]
Have only examined what are called “crude” measures of association:
- Compared exposed and unexposed populations without considering other variables that may differ between the populations.
- Later in the course we will discuss how to deal analytically with other variables that may be different between the exposed and unexposed and that thus make the populations not exchangeable (to be discussed in confounding)
\[RR=\frac{R_e}{R_u} \]
\[PR=\frac{Prev_e}{Prev_u} \]
\[CIR=\frac{CI_e}{CI_u} \]
\[IDR=\frac{ID_e}{ID_u} \]
\[OR=\frac{P_e/(1-P_e)}{P_u/(1-P_u)} \]
\[OR=\frac{Odds(dis)_e}{Odds(dis)_u} \]
\[AR=R_e - R_u \]
\[AR\%=\frac{R_e - R_u}{R_e} \times 100 \]
Other forumlations:
\[AR\%=\frac{OR - 1}{OR} \times 100 \] \[AR\%=OR \times R_u - \frac{R_t}{OR \times P_e + (1-P_e)}\]
\[PAR=R_{total} - R_u \]
\[PAR=AR \times P_e \]
\[PAR\%=\frac{R_{total} - R_u}{R_{total}} \times 100 \]
These are optional practice questions to see if you understand the module material prior to taking the module completion quiz.
Table: Baseline prevalence of HIV infection by race, 1984
The San Francisco Men’s Health study proceeded to screen additional men and follow the seronegative men enrolled at baseline for 30 months. At the end of 30 months, each man returned to the clinic and was tested again for HIV infection. Of 414 white men seronegative at baseline, 18 tested positive for HIV at the 30-month follow-up. Of the 15 black men seronegative at baseline, 8 tested positive for HIV at the follow-up. Seven percent of white men and 20% of black men did not return for follow-up.
At the time, it was not possible to estimate when seroconversion occurred among those testing positive, as the latent period of the virus was unknown. (Note: 2x2 tables (also called contingency tables) represent data available at follow-up. Papers usually present a flow chart to show retention. Remember to keep in mind the assumptions we are making in calculating measures of association from a 2X2 table)
a. Draw a contingency table to represent these data
b. What specific measure of absolute risk is appropriate for these data?
c. What is the absolute risk for incident infection for black men compared to white men?
d. What specific measure of relative risk (RR) is appropriate for these data?
e. What is the RR for incident infection for black men compared to white men?
f. Suppose all men lost to follow-up were actually HIV-positive by 30 months. What would the RR be? If this scenario were true, what assumption would have been violated in calculating the RR in part e?
g. Calculate the RR first as if 1 black man had seroconverted, then if 3 black men had seroconverted. What can you conclude about studies with small subgroup sizes?
Toschke AM, Rückinger S, Böhler E, Von Kries R (2007) Adjusted population attributable fractions and preventable potential of risk factors for childhood obesity. Public Health Nutr 10: 902-906.
OBJECTIVE: A number of individual risk factors for childhood obesity have been identified, but only some of these are amenable to prevention. To assess the amount of cases in a general population attributable to these risk factors, adjusted population-attributable fractions were estimated.
DESIGN: Cross-sectional study.
SETTING: Obligatory school entry examination in 2001/2002 in six Bavarian communities (Germany).
SUBJECTS: 5472 children at age 5-6 years.
MEASURES: Anthropometric measures were ascertained by public health nurses, and measures concerning sociodemographics, lifestyle and child behaviour such as child’s daily meal frequency were obtained with self-administered parental questionnaires. Obesity was defined according to sex- and age-specific body mass index cut-off points proposed by the International Obesity Task Force. Adjusted population-attributable fractions were calculated based on logistic regression.
RESULTS: A combination of the risk factors low meal frequency, decreased physical activity, watching television >1 h day- 1, formula feeding and smoking in pregnancy accounted for 48.2% of obese children. This combination yielded a maximal achievable prevalence reduction of 1.5% for obesity (3.2% observed prevalence).
CONCLUSIONS: A modification of five known risk factors for childhood overweight and obesity could reasonably lower obesity prevalences at school entry. These risk factors should be particularly considered in decision making on preventive measures.
“Smoking in the offspring’s first trimester was reported by n=1248 mothers (22.8%).”
. | Overweight | Not overweight | Total |
---|---|---|---|
Smoking during pregnancy | . | . | . |
No smoking during pregnancy | . | . | . |
Total | . | . | . |
Calculate the AR and the AR% from the table you filled in. Interpret these measures.
The table below presents adjusted population-attributable risk fractions (PARF, a synonym for PAR%). Write the formula for the PAR% (i.e. PARF) and define each quantity if you were to calculate the crude population attributable risk of overweight for the exposure watching more than one hour of television per day.
Click other tabs to reveal answers
b) What measure of relative risk is appropriate for these data?
Incidence density ratio (IDR)
\[IDR = \frac{IDwomen}{IDmen} = \frac{15/728}{31/1988} = 1.32\]
The rate of mortality among women was 1.32 times the rate of mortality among men in homeless shelters in New York City between 1987 and 1994.
Between 1987 and 1994, there was an excess of 0.0050 deaths per person-year (5.0 deaths per 1000 person-years) among women living in homeless shelters compared to men in New York City.
What relative measure of disease is appropriate for these data?
Prevalence ratio
Using the white population as the reference group, calculate the RR of HIV infection at baseline by race.
See above
Interpret the relative measure you calculated for the Asian population.
Asian men have a prevalence of HIV that is 0.75 times the prevalence of HIV in whites, or 25% lower. (1-0.75 = 0.25)
a. Draw a contingency table to represent this data
The censored individuals are not included in the 2x2 table because we did not measure their outcomes. As a result, we would not know where to put those data and the numbers in the marginal columns would not add up.
b. What specific measure of absolute risk is appropriate for these data?
Cumulative incidence difference (or attributable risk) We are not able to reliably estimate person-time due to the uncertainty around timing of seroconversion. A few options for calculating cumulative incidence difference: could use simple cumulative (8/15)-(18/414)=0.49 greater risk among black men. However, we know we don’t have complete follow up. We don’t know time of seroconversion so we can’t use Kaplan-Meier, and we have no evidence on whether rate was fairly constant, making the density method less tenable. That leaves actuarial.
c. What is the absolute risk for incident infection for black men compared to white men?
Using the actuarial method, the CID is 8/(15-3/2) - 18/(414-29/2) = 0.548 greater risk of seroconversion among black men over 30 months than among white men.
Kaplan-Meier is also mathematically possible but because there’s only one interval, it will use only those at risk at the end of the interval in the denominator (385, 12) and give you higher estimates than actuarial. Actuarial is a nice middle of the road option that takes into account some of the missing person time based on withdrawal, so we’ll use that for most of the rest of the problem.
d. What specific measure of relative risk (RR) is appropriate for these data?
Cumulative incidence ratio (CIR) - also using the actuarial method to estimate CI
e. What is the RR for incident infection for black men compared to white men?
CIR=(8/(15-3/2))/(18/(414-29/2)=13.15
f. Suppose all men lost to follow-up were actually HIV-positive by 30 months. What would the RR be? If this scenario were true, what assumption would have been violated in calculating the RR in part e?
CIR= (11/15)/(47/414)=6.46 The assumption of independence of censoring and the outcome/survival.
Since we are now able to use complete follow up, the simple CI method is appropriate just for this part of the question to underscore the points we’re making above
g. Calculate the RR first as if 1 black man had seroconverted, then if 3 black men had seroconverted. What can you conclude about studies with small subgroup sizes?
CIR = (1/(15-3/2))/(18/(414-29/2) = 1.64
CIR = (3/(15-3/2))/(18/(414-29/2) = 4.93
The effect estimates calculated from small sample sizes are unstable, that is, they can change significantly with a small absolute change in the cell values
. | Death | No Death | Total | — | — | — | — | Intervention | 9 | 91 | 100 | Control | 16 | 84 | 100 | Total | 25 | 175 | 200 |
What is the probability (p) of death among the intervention group?
\[P_{intervention} = 9/100 = 0.09\]
What are the odds of death among the intervention group?
\[Odds_{intervention}=\frac{p_{intervention}}{1-p_{intervention}}= \frac{.09}{0.91} = 0.099 \]
What is the odds ratio (OR) comparing the intervention group to the control group?
\[Odds ratio = \frac{p_{intervention}/(1-p_{intervention})}{p_{control}}{(1-p_{control}} = \frac{(.09/.91)}{(.16/.84)} = 0.519 \]
Alternatively, OR=(9/91)/(16/84)=(984)/(9116)=0.519
Cumulative incidence ratio = \[CIR = \frac{CI_{intervention}}{CI_{control}} = \frac{(9/100)}{(16/100)} = 0.563\]
over unknown study duration
The OR is always farther from the null than the RR (except in instances of special sampling designs, to be covered later in the course). (0.519 is farther from 1.0 than 0.563)
Calculate the variance of the log (ln) of both the odds ratio and the RR you calculated in part d.
Var[ln(OR)] = (1/a) + (1/b) + (1/c) + (1/d) = (1/9)+(1/91)+(1/16)+(1/84) = 0.197 Var[ln(CIR)] = (b / a (a+b)) + (d / c(c+d)) = [91/(9*(9+91))]+ [84/(16*(16+84))] = 0.154
Calculate the AR and the AR% from the table you filled in.Interpret these measures.
\[AR= R_{exposed} - R_{unexposed} = (188/1248) - (427/4224) = 0.050\]
Maternal smoking during pregnancy was associated with a 0.050 higher risk of being overweight as a child compared to no maternal smoking during pregnancy (5 excess cases per 100 people).
\[AR_{percent} = \frac{(R_{exposed} - R_{unexposed})}{} R_{exposed}} \times 100 = ((188/1248) - (427/4224))/ (188/1248) \times 100 = 32.89%\]
Of the children whose mothers smoked during pregnancy, there is a 32.89% excess of prevalent cases compared to the children whose mothers did not smoke during pregnancy
OR = (188/427) / (1060/3799) = 1.58
b. Calculate the AR and the AR% from the table you filled in. Interpret these measures.
\[AR= R_{exposed} - R_{unexposed} = (188/1248) - (427/4224) = 0.050\] Maternal smoking during pregnancy was associated with a 0.050 higher risk of being overweight as a child compared to no maternal smoking during pregnancy (5 excess cases per 100 people).
AR% = (Rexposed - Runexposed)/ Rexposed x 100 = ((188/1248) - (427/4224))/ (188/1248) x 100 = 32.89%
Of the children whose mothers smoked during pregnancy, there is a 32.89% excess of prevalent cases compared to the children whose mothers did not smoke during pregnancy
c.
PAR%= (Rtotal-Runexposed)/Rtotal x 100 = PAR/Rtotal x 100 =13.0%
Rtotal =Risk of being overweight among all children in the study
Rexposed =Risk of being overweight among children who watch more than one hour of television a day
\(R_{unexposed}\) =Risk of being overweight among children who watch one hour or less of television a day PAR = The excess overweight children in the total population who wouldn’t have been overweight, had they not watched more than 1 hr of tv a day. AR (if used AR*Pe) = The number of overweight children attributable to tv watching.
Take the following quiz after watching the video lectures and self-testing through the diagnostic quizzes. Note: look through the learning objectives again and make sure you understand all of the points. The quiz is open notes and graded. The grade is to test completion of the module, not complete understanding of the material, so should not be too challenging.
Module completion quiz
options(rpubs.upload.method = "internal")