library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
library(epitools)
## Warning: package 'epitools' was built under R version 4.5.2
setwd("E:/Biostat and Study Design/204/Lectures/Data")
NHANES_df <- openxlsx::read.xlsx('NHEFS.xlsx')
A cohort study is a study in which a group of disease-free individuals is identified at one point in time and is followed over a period of time until some of them develop the disease. The development of disease over time is then related to other variables measured at baseline, generally called exposure variables. The study population in a prospective study is often called a cohort. The primary distinction between a cohort study and a randomized controlled trial lies in whether the researcher employs randomization to assign patients to either the treatment or control group.
Relative risk (RR) is defined as the probability of an event (developing a disease) occurring in exposed people compared to the probability of the event in nonexposed people or as the ratio of the two probabilities. The relative risk is calculated using the following formula:
\[ Relative\:risk= \frac{Risk\:in\:exposed}{Risk\:in\:nonexposed}=\frac{Incidence\:in\:exposed}{Incidence\:in \:nonexposed}=\frac{\frac{a}{a+b}}{\frac{c}{c+d}}\]
The standard error (SE) for relative risk is calculated using the following formula:
\[SE=\sqrt{\frac{1}{a}+\frac{1}{c}-\frac{1}{a+b}-\frac{1}{c+d}}\] To calculate the 95% CI, we use the following formula:
\[ 95\%\:CI=exp(ln(RR)-1.96\times SE ),\:exp(ln(RR)+1.96\times SE )\]
If relative risk = 1, then risk in exposed equal to risk in nonexposed (no association); if relative risk > 1, then risk in exposed greater than risk in nonexposed (positive association; possibly causal); if relative risk < 1, then risk in exposed less than risk in nonexposed (negative association; possibly protective).
Example: You conducted a prospective cohort study of 3,000 smokers and 5,000 nonsmokers to investigate the relationship between smoking and the development of coronary heart disease (CHD) over 1 year. Calculate relative risk!
\({H_0}: RR=1\)
\({H_1}: RR\neq1\)
| CHD | No CHD | |
|---|---|---|
| Smokers | 84 | 2,916 |
| Non-Smokers | 87 | 4,913 |
\[ Relative\:risk= \frac{Risk\:in\:exposed}{Risk\:in\:nonexposed}=\frac{Incidence\:in\:exposed}{Incidence\:in \:nonexposed}=\frac{\frac{a}{a+b}}{\frac{c}{c+d}}=\frac{\frac{84}{84+2916}}{\frac{87}{87+4913}}=\frac{0.028}{0.0174}=1.61\]
\[SE=\sqrt{\frac{1}{a}+\frac{1}{c}-\frac{1}{a+b}-\frac{1}{c+d}}=\sqrt{\frac{1}{84}+\frac{1}{87}-\frac{1}{84+2916}-\frac{1}{87+4913}}=0.151\]
\[ 95\%\:CI=exp(ln(RR)-1.96\times SE ),\:exp(ln(RR)+1.96\times SE )=exp(ln(1.61)-1.96\times 0.151 ),\:exp(ln(1.61)+1.96\times 0.151 )=1.20, 2.16 \]
Interpretation: Smokers have 1.61 (95% CI 1.20, 2.16) times the risk of CHD compared to non-smokers. If the relative risk was 0.6, the interpretation would be that smokers have a 40% reduction in the risk of CHD compared to non-smokers.
We can calculate the relative risk using the epitools package in R
smokers_table <- matrix(c(84,87,2916,4913),ncol = 2,nrow = 2)
colnames(smokers_table) <- c('CHD','No CHD')
rownames(smokers_table) <- c('Smokers','Non-Smokers')
smokers_table
## CHD No CHD
## Smokers 84 2916
## Non-Smokers 87 4913
epitools::riskratio(smokers_table,rev='both',method = 'wald')
## $data
## No CHD CHD Total
## Non-Smokers 4913 87 5000
## Smokers 2916 84 3000
## Total 7829 171 8000
##
## $measure
## NA
## risk ratio with 95% C.I. estimate lower upper
## Non-Smokers 1.000000 NA NA
## Smokers 1.609195 1.196452 2.164325
##
## $p.value
## NA
## two-sided midp.exact fisher.exact chi.square
## Non-Smokers NA NA NA
## Smokers 0.001799736 0.001800482 0.001505872
##
## $correction
## [1] FALSE
##
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"
Instead of comparing two measures of disease by calculating their risk ratio, we can compare risk in terms of their absolute difference. The risk difference is calculated by subtracting the cumulative risk in the unexposed group from the cumulative risk in the exposure group.
\[Risk\:difference =Risk_{exposed}-Risk_{unexposed}\]
Example: Using the example above, calculate the risk difference.
| CHD | No CHD | |
|---|---|---|
| Smokers | 84 | 2,916 |
| Non-Smokers | 87 | 4,913 |
\[ Risk\:difference= Risk\:in\:exposed-Risk\:in\:nonexposed={\frac{a}{a+b}}-{\frac{c}{c+d}}={0.028}-{0.0174}=0.0106\]
Interpretation: Smokers have 11 additional cases of CHD per 1000 people compared to non-smokers.
Attributable risk percent is the proportion of disease in the exposed group that can be attributed to the exposure.
\[Attributable\:risk\:percent =\frac{Risk\:in\:exposed-Risk\:in\:nonexposed}{Risk\:in\:exposed}\times{100}\]
Example: Using the example above, calculate the attributable risk difference.
| CHD Developed | CHD Did Not Develop | |
|---|---|---|
| Smokers | 84 | 2,916 |
| Non-Smokers | 87 | 4,913 |
\[Attributable\:relative\:percent= \frac{Risk\:in\:exposed-Risk\:in\:nonexposed}{Risk\:in\:exposed}=\frac{{0.028}-{0.0174}}{0.028}\times{100}=37.9\]
Interpretation: 37.9% of the total risk for CHD among smokers may be attributable to smoking.
When interpreting findings of clinical trials, It is important to help frame results in a way that clinicians can understand and integrate into decision making process. The number needed to treat (NNT) is the number of patients you need to treat to prevent one additional bad outcome. NNT is calculated using the following formula:
\[NNT =\frac{1}{Risk\:in\:the\:untreated\:group -Risk\:in\:the\:treated\:group}\] Estimates of NNT are usually rounded up to the next highest whole number to avoid overestimate of efficacy.
Example: 1,500 patients with CHF were randomized to receive a new treatment plan while 1,500 were randomized to standard of care. After one year’s time, 375 patients of the standard of care group expired, while 325 patients of the new treatment group expired. Calculate NNT.
\[NNT =\frac{1}{Risk\:in\:the\:untreated\:group -Risk\:in\:the\:treated\:group}=\frac{1}{375/1500-325/1500}=30\] Interpretation: 30 patients need to be treated with the new treatment plan to prevent one death.
The same approach can also be used to look at the risk of side effects by calculating the number needed to harm (NNH) to cause one additional person to be harmed.
\[NNH =\frac{1}{Rate\:in\:the\:treated\:group -Rate\:in\:the\:untreated\:group}\] Estimates of NNH are usually rounded down to the next lowest whole number to avoid understating the harms.
Example: 1,500 patients with colorectal cancer were randomized to receive a new therapeutic agent while 1,500 were randomized to receive placebo. After 6 month’s time, 25 of the placebo group developed severe diarrhea, while 375 of the treatment group developed severe diarrhea. Calculate NNH!
\[NNH =\frac{1}{Rate\:in\:the\:treated\:group -Rate\:in\:the\:untreated\:group}=\frac{1}{375/1500-25/1500}=4.3\] Interpretation: 4 patients need to be treated with the new therapeutic agent in order for one patient to develop severe diarrhea.
case–control study is a study in which two groups of individuals are initially identified: (1) a group that has the disease under study (the cases) and (2) a group that does not have the disease under study (the controls). An attempt is then made to relate their prior health habits to their current disease status.
In a case-control study, we do not know the incidence in the exposed population or the incidence in the nonexposed population because we start with diseased people (cases) and nondiseased people (controls). Therefore, in a case-control study we cannot calculate the relative risk. An alternative solution would be to calculate odds ratio.
To better under the concept of odds, consider the following example. If you toss a fair coin, the probability of getting heads is 50% (P). The probability of getting tails is 1-P. What are the odds of getting heads? To answer this question we start by defining odds, which is defined as the ratio of the number of ways the event can occur to the number of ways the event cannot occur.
\[Odds\:of\:heads=\frac{probability\:of\:heads}{probability\:of\:tails}=\frac{P}{1-P}=\frac{50\%}{50\%}=1\] Keep in mind the distinction between probability and odds. In our example, the probability of getting heads when tossing a fair coin is 50% and the odds of getting heads when tossing a fair coin is 1.
Since we cannot calculate relative risk in case-control study, we use alternatively odds ratio (OR) instead. Odds ratio is calculated using the following formula:
\[ Odds\:ratio= \frac{Odds\:that\:a\:case\:was\:exposed}{Odds\:that\:a\:control\:was\:exposed}=\frac{\frac{a}{c}}{\frac{b}{d}}=\frac{ad}{bc}\]
The standard error (SE) for odds ratio is calculated using the following formula:
\[SE=\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}\] To calculate the 95% CI, we use the following formula:
\[ 95\%\:CI=exp(ln(OR)-1.96\times SE ),\:exp(ln(OR)+1.96\times SE )\]
Odds ratio is interpreted similarly to relative risk. If the exposure is not related to the disease, the odds ratio will equal 1. If the exposure is positively related to the disease, the odds ratio will be greater than 1. If the exposure is negatively related to the disease, the odds ratio will be less than 1.
Example: An outbreak of cyclosporiasis was detected among residents of New Jersey. In a case-control study, investigators found that 21 of 30 case-patients and four of 60 controls had eaten raspberries.
\({H_0}: OR=1\)
\({H_1}: OR\neq1\)
| Cyclosporiasis | No Cyclosporiasis | |
|---|---|---|
| Ate Raspberries | 21 | 4 |
| Did not eat raspberries | 9 | 56 |
\[Odds\:ratio=\frac{\frac{a}{c}}{\frac{b}{d}}=\frac{\frac{21}{9}}{\frac{4}{56}}=32.67\] \[SE=\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}=\sqrt{\frac{1}{21}+\frac{1}{9}+\frac{1}{4}+\frac{1}{56}}=0.65\] \[95\%\:CI=exp(ln(OR)-1.96\times SE ),\:exp(ln(OR)+1.96\times SE)=exp(ln(32.67)-1.96\times 0.65),\:exp(ln(32.67)+1.96\times 0.65)=9.08,117.5\]
Interpretation: The odds of cyclosporiasis was 32.7 (95% CI 9.08- 117.5) times higher in those who ate raspberries compared to those who did not eat raspberries.
cyclosporiasis_table <- matrix(c(21,9,4,56),ncol = 2,nrow = 2)
colnames(cyclosporiasis_table) <- c('Cyclosporiasis','No cyclosporiasis')
rownames(cyclosporiasis_table) <- c('Ate Raspberries','Did not eat raspberries')
cyclosporiasis_table
## Cyclosporiasis No cyclosporiasis
## Ate Raspberries 21 4
## Did not eat raspberries 9 56
epitools::oddsratio(cyclosporiasis_table,rev='both',method = 'wald')
## $data
## No cyclosporiasis Cyclosporiasis Total
## Did not eat raspberries 56 9 65
## Ate Raspberries 4 21 25
## Total 60 30 90
##
## $measure
## NA
## odds ratio with 95% C.I. estimate lower upper
## Did not eat raspberries 1.00000 NA NA
## Ate Raspberries 32.66667 9.081425 117.5048
##
## $p.value
## NA
## two-sided midp.exact fisher.exact chi.square
## Did not eat raspberries NA NA NA
## Ate Raspberries 6.358611e-10 6.183017e-10 2.555681e-10
##
## $correction
## [1] FALSE
##
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"
Example: Using data from NHANES study, determine if there is an association between asthma diagnosis and sex.
We start by creating 2 X 2 of of asthma diagnosis and sex.
asthma_table <- table(Sex=NHANES_df$sex,Asthma=NHANES_df$asthma)
asthma_table
## Asthma
## Sex 0 1
## Female 781 49
## Male 769 30
Next, let’s re-arrange the table results to fit our contingency table layout.
| Asthma | No Asthma | |
|---|---|---|
| Female | 49 | 781 |
| Male | 30 | 769 |
epitools::oddsratio(asthma_table,rev='rows',method = 'wald') #use males as a reference
## $data
## Asthma
## Sex 0 1 Total
## Male 769 30 799
## Female 781 49 830
## Total 1550 79 1629
##
## $measure
## odds ratio with 95% C.I.
## Sex estimate lower upper
## Male 1.000000 NA NA
## Female 1.608237 1.010044 2.560708
##
## $p.value
## two-sided
## Sex midp.exact fisher.exact chi.square
## Male NA NA NA
## Female 0.04412095 0.04961711 0.04354632
##
## $correction
## [1] FALSE
##
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"
Interpretation: The odds of asthma was 1.61 (95% CI 1.01, 2.56) times higher in females than males. If the odds ratio was 0.6, the interpretation would be that the odds of asthma is reduced by 40% in females compared to males.
Odds ratio can be used to estimate relative risk when the following conditions are satisfied:
The third condition can be proven mathematically:
\[ Relative\:risk=\frac{\frac{a}{a+b}}{\frac{c}{c+d}} \cong \frac{\frac{a}{b}}{\frac{c}{d}} = \frac{ad}{bc} \]
\[Odds\:ratio=\frac{\frac{a}{c}}{\frac{b}{d}}= \frac{ad}{bc}\]
Example: For the table below, calculate the relative risk and the odds ratio:
| Developd disease | Do not develop disease | |
|---|---|---|
| Exposed | 200 | 9,800 |
| Not exposed | 100 | 9,900 |
\[ Relative\:risk= \frac{Risk\:in\:exposed}{Risk\:in\:nonexposed}=\frac{\frac{a}{a+b}}{\frac{c}{c+d}}=\frac{\frac{200}{200+9800}}{\frac{100}{100+9900}}=2\]
\[Odds\:ratio=\frac{\frac{a}{c}}{\frac{b}{d}}= \frac{\frac{200}{100}}{\frac{9800}{9900}}=2.02\]