This document will teach a little more R code, building on previous tutorials, and review the material taught in the 2. Measures of Association module
First, set your working directory (the folder your data is in), and load in the Framingham study data
setwd("C:/~yourworkingdirectory")
load("frmgham_recoded.Rdata")Use the head command to explore the data. (It displays the first 6 rows. The “tail” command displays the bottom 6.)
head(frmgham_recoded)## randid sex totchol age sysbp diabp cursmoke cigpday bmi diabetes
## 1 2448 1 195 39 106.0 70.0 0 0 26.97 0
## 2 2448 1 209 52 121.0 66.0 0 0 NA 0
## 3 6238 2 250 46 121.0 81.0 0 0 28.73 0
## 4 6238 2 260 52 105.0 69.5 0 0 29.43 0
## 5 6238 2 237 58 108.0 66.0 0 0 28.50 0
## 6 9428 1 245 48 127.5 80.0 1 20 25.34 0
## bpmeds heartrte glucose educ prevchd prevap prevmi prevstrk prevhyp time
## 1 0 80 77 4 0 0 0 0 0 0
## 2 0 69 92 4 0 0 0 0 0 4628
## 3 0 95 76 2 0 0 0 0 0 0
## 4 0 80 86 2 0 0 0 0 0 2156
## 5 0 80 71 2 0 0 0 0 0 4344
## 6 0 75 70 1 0 0 0 0 0 0
## period hdlc ldlc death angina hospmi mi_fchd anychd stroke cvd hyperten
## 1 1 NA NA 0 0 1 1 1 0 1 0
## 2 3 31 178 0 0 1 1 1 0 1 0
## 3 1 NA NA 0 0 0 0 0 0 0 0
## 4 2 NA NA 0 0 0 0 0 0 0 0
## 5 3 54 141 0 0 0 0 0 0 0 0
## 6 1 NA NA 0 0 0 0 0 0 0 0
## timeap timemi timemifc timechd timestrk timecvd timedth timehyp bmi_cat
## 1 8766 6438 6438 6438 8766 6438 8766 8766 3
## 2 8766 6438 6438 6438 8766 6438 8766 8766 NA
## 3 8766 8766 8766 8766 8766 8766 8766 8766 3
## 4 8766 8766 8766 8766 8766 8766 8766 8766 3
## 5 8766 8766 8766 8766 8766 8766 8766 8766 3
## 6 8766 8766 8766 8766 8766 8766 8766 8766 3
## underwt normalwt overwt obese male timedth_yrs time_yrs
## 1 0 0 1 0 1 24 0.000000
## 2 NA NA NA NA 1 24 12.670773
## 3 0 0 1 0 0 24 0.000000
## 4 0 0 1 0 0 24 5.902806
## 5 0 0 1 0 0 24 11.893224
## 6 0 0 1 0 1 24 0.000000
To make the data easier to play with, we will only keep twp variables, whether a heart attack occurred (“anychd”), and hypertension(“hyperten”).
data<-subset(data, select=c("anychd","hyperten"))Let’s summarize the each variable individually
#Attaching the data allows us to just use the variable names in commands. Otherwise, you'd have to code data$anychd, etc.
attach(data)
summary(anychd)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2716 1.0000 1.0000
table(anychd)## anychd
## 0 1
## 8469 3158
summary(hyperten)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.7433 1.0000 1.0000
table(hyperten)## hyperten
## 0 1
## 2985 8642
From the minimums and maximums, we can tell all three of these variables are binary. This is good, it allows us to easily make 2x2 tables with these three variables.
Remember the 2x2 table format:
| . | Disease | No Disease | Total |
|---|---|---|---|
| Exposed | a | b | a+b |
| Unexposed | c | d | c+d |
| Total | a+c | b+d | a+b+c+d |
The “table” command can be used with two variables, to give us a 2x2 table
#The "-" is used before the variable name to flip the table output in the form we are used to for the 2x2 table
#Try it without the "-" and see what happens
table(hyperten,anychd)## anychd
## hyperten 0 1
## 0 2524 461
## 1 5945 2697
The organization of the table is different than what we use in epidemiology. The data is set up so positive disease or exposure equals 1, and here is on the bottom row/ right column.
We can add some code to flip the table format so it mirrors the 2x2 format
#The "-" is used before the variable name to flip the table output in the form we are used to for the 2x2 table
#Try it without the "-" and see what happens
table(-hyperten,-anychd)##
## -1 0
## -1 2697 5945
## 0 461 2524
We can visualize the table sizes with a mosaic plot:
mosaicplot(table(-hyperten,-anychd))\[PR=\frac{a/(a+b)}{c/(c+d}\]
First, lets calculate the denominators, the total number of exposed and not exposed, This is the number of 1s (exposed) and 0s (not-exposed) of “hyperten”.
#"sum" adds up all rows of data (so here, will count all the 1's)
exposed=sum(hyperten)
exposed## [1] 8642
#"length" counts the number of rows, so length minus sum of 1's will be the numbers of 0's
no_exposed=length(hyperten)-exposed
no_exposed## [1] 2985
Now let’s count the number of observations where the subject both has coronary heart disease and has hypertension
a=length(which(anychd==1 & hyperten==1))
a## [1] 2697
c=length(which(anychd==1 & hyperten==0))
c## [1] 461
Do these numbers match what we saw in the table above?
PR=(a/exposed)/(c/no_exposed)
PR## [1] 2.020738
\[OR=\frac{a/c}{b/d}\]
####or
\[OR=\frac{ad}{bc}\]
Let’s assign values to \(b\) and \(d\) like we did with \(a\) and \(c\)
b=length(which(anychd==0 & hyperten==1))
b## [1] 5945
d=length(which(anychd==0 & hyperten==0))
d## [1] 2524
OR=(a/b)/(c/d)
OR## [1] 2.483805
#Using the other formula
OR=(a*d)/(b*c)
OR## [1] 2.483805
Do the two formulas match?
In this study population, people with hypertension have 2 times the risk of any coronary disease than people without hypertension.
People with hypertension have 2.5 times the odds of having any coronary disease than do people without hypertension.
Notice how different the odd ratio and prevalence ratio are? Looking at the formulas, could a prevalence ratio ever be larger than the corresponding odds ratio? What would happen is CHD was much rarer in this population.
Lets make the number of disease 1/10th it was in both the exposed and unexposed populations.
rare_a=round(a/10)
rare_a## [1] 270
rare_c=round(c/10)
rare_c## [1] 46
#Prevalence ratio:
PRrare=(rare_a/(rare_a+b))/(rare_c/(rare_c+d))
PRrare## [1] 2.427157
#odds ratio:
ORrare=(rare_a*d)/(b*rare_c)
ORrare## [1] 2.491974
Closer? Now lets try 1/100th the original rate of disease
rare_a=round(a/100)
rare_a## [1] 27
rare_c=round(c/100)
rare_c## [1] 5
#Prevalence ratio:
PRrare=(rare_a/(rare_a+b))/(rare_c/(rare_c+d))
PRrare## [1] 2.286772
#odds ratio:
ORrare=(rare_a*d)/(b*rare_c)
ORrare## [1] 2.292616
OR and PR converge as disease becomes rarer in both exposed and unexposed populations
Now for to calculate absolute measures:
\[R_{exposed}-R_{unexposed}\] where R is risk/rate, which, in the 2x2 notation, is: \[ \frac{a}{a+b} - \frac{c}{c+d} \]
AR=a/(a+b)-c/(c+d)
AR## [1] 0.1576417
\[\frac{R_{exposed} -R_{unexposed}}{R_{exposed}} \times 100 =\]
\[ \frac{AR}{R_{exposed}} \times 100\]
ARper=(a/(a+b)-c/(c+d))/(a/(a+b))*100
ARper## [1] 50.51314
\[\frac{R_{total} - R_{unexposed}}{Rtotal} \times 100 =\]
\[\frac{PAR}{Rtotal} \times 100\]
PAR=((sum(hyperten)/length(hyperten))-c/(c+d))/(sum(hyperten)/length(hyperten))*100
PAR## [1] 79.2217
#Equivalent to:
PAR=((a+c)/(a+b+c+d)-c/(c+d))/((a+c)/(a+b+c+d))*100
PAR## [1] 43.13931
The epiR package gives us usefull commands for epidemiology analysis. It is not a default package, which means we need to install it. Once it’s installed on your machine, it doesn’t need to be installed again, but each time you open R, you need to load it from your library of packages.
install.packages("epiR", repos="http://cran.rstudio.com/")
library(epiR)Let’s try out the epi.2by2 command on anychd and hyperten
epi.2by2(table(-hyperten,-anychd))## Warning in N0 * (N0 + N1) * a: NAs produced by integer overflow
## Warning in N0 * N1 * (c + a): NAs produced by integer overflow
## Warning in N0 * N1 * (c + a): NAs produced by integer overflow
## Warning in N0 * (N0 + N1) * a: NAs produced by integer overflow
## Warning in N0 * N1 * (c + a): NAs produced by integer overflow
## Warning in N0 * N1 * (c + a): NAs produced by integer overflow
## Outcome + Outcome - Total Inc risk *
## Exposed + 2697 5945 8642 31.2
## Exposed - 461 2524 2985 15.4
## Total 3158 8469 11627 27.2
## Odds
## Exposed + 0.454
## Exposed - 0.183
## Total 0.373
##
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Inc risk ratio 2.02 (1.85, 2.21)
## Odds ratio 2.48 (2.23, 2.77)
## Attrib risk * 15.76 (14.14, 17.39)
## Attrib risk in population * 11.72 (10.19, 13.24)
## Attrib fraction in exposed (%) 50.51 (45.88, 54.75)
## Attrib fraction in population (%) 43.14 (38.61, 47.33)
## -------------------------------------------------------------------
## X2 test statistic: 278.692 p-value: < 0.001
## Wald confidence limits
## * Outcomes per 100 population units
Did we do our calculations right? (Note that the package output reports AR per 100 people).