R tutorial

This document will teach a little more R code, building on previous tutorials, and review the material taught in the 2. Measures of Association module

First, set your working directory (the folder your data is in), and load in the Framingham study data

setwd("C:/~yourworkingdirectory")
load("frmgham_recoded.Rdata")

Use the head command to explore the data. (It displays the first 6 rows. The “tail” command displays the bottom 6.)

head(frmgham_recoded)
##   randid sex totchol age sysbp diabp cursmoke cigpday   bmi diabetes
## 1   2448   1     195  39 106.0  70.0        0       0 26.97        0
## 2   2448   1     209  52 121.0  66.0        0       0    NA        0
## 3   6238   2     250  46 121.0  81.0        0       0 28.73        0
## 4   6238   2     260  52 105.0  69.5        0       0 29.43        0
## 5   6238   2     237  58 108.0  66.0        0       0 28.50        0
## 6   9428   1     245  48 127.5  80.0        1      20 25.34        0
##   bpmeds heartrte glucose educ prevchd prevap prevmi prevstrk prevhyp time
## 1      0       80      77    4       0      0      0        0       0    0
## 2      0       69      92    4       0      0      0        0       0 4628
## 3      0       95      76    2       0      0      0        0       0    0
## 4      0       80      86    2       0      0      0        0       0 2156
## 5      0       80      71    2       0      0      0        0       0 4344
## 6      0       75      70    1       0      0      0        0       0    0
##   period hdlc ldlc death angina hospmi mi_fchd anychd stroke cvd hyperten
## 1      1   NA   NA     0      0      1       1      1      0   1        0
## 2      3   31  178     0      0      1       1      1      0   1        0
## 3      1   NA   NA     0      0      0       0      0      0   0        0
## 4      2   NA   NA     0      0      0       0      0      0   0        0
## 5      3   54  141     0      0      0       0      0      0   0        0
## 6      1   NA   NA     0      0      0       0      0      0   0        0
##   timeap timemi timemifc timechd timestrk timecvd timedth timehyp bmi_cat
## 1   8766   6438     6438    6438     8766    6438    8766    8766       3
## 2   8766   6438     6438    6438     8766    6438    8766    8766      NA
## 3   8766   8766     8766    8766     8766    8766    8766    8766       3
## 4   8766   8766     8766    8766     8766    8766    8766    8766       3
## 5   8766   8766     8766    8766     8766    8766    8766    8766       3
## 6   8766   8766     8766    8766     8766    8766    8766    8766       3
##   underwt normalwt overwt obese male timedth_yrs  time_yrs
## 1       0        0      1     0    1          24  0.000000
## 2      NA       NA     NA    NA    1          24 12.670773
## 3       0        0      1     0    0          24  0.000000
## 4       0        0      1     0    0          24  5.902806
## 5       0        0      1     0    0          24 11.893224
## 6       0        0      1     0    1          24  0.000000

To make the data easier to play with, we will only keep twp variables, whether a heart attack occurred (“anychd”), and hypertension(“hyperten”).

data<-subset(data, select=c("anychd","hyperten"))

Let’s summarize the each variable individually

#Attaching the data allows us to just use the variable names in commands. Otherwise, you'd have to code data$anychd, etc.
attach(data)
summary(anychd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2716  1.0000  1.0000
table(anychd)
## anychd
##    0    1 
## 8469 3158
summary(hyperten)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.7433  1.0000  1.0000
table(hyperten)
## hyperten
##    0    1 
## 2985 8642

From the minimums and maximums, we can tell all three of these variables are binary. This is good, it allows us to easily make 2x2 tables with these three variables.

2x2 tables

Remember the 2x2 table format:

. Disease No Disease Total
Exposed a b a+b
Unexposed c d c+d
Total a+c b+d a+b+c+d

The “table” command can be used with two variables, to give us a 2x2 table

#The "-" is used before the variable name to flip the table output in the form we are used to for the 2x2 table
#Try it without the "-" and see what happens
table(hyperten,anychd)
##         anychd
## hyperten    0    1
##        0 2524  461
##        1 5945 2697

The organization of the table is different than what we use in epidemiology. The data is set up so positive disease or exposure equals 1, and here is on the bottom row/ right column.
We can add some code to flip the table format so it mirrors the 2x2 format

#The "-" is used before the variable name to flip the table output in the form we are used to for the 2x2 table
#Try it without the "-" and see what happens
table(-hyperten,-anychd)
##     
##        -1    0
##   -1 2697 5945
##   0   461 2524

We can visualize the table sizes with a mosaic plot:

mosaicplot(table(-hyperten,-anychd))

Calculating measures of association

Prevalence ratio

\[PR=\frac{a/(a+b)}{c/(c+d}\]

First, lets calculate the denominators, the total number of exposed and not exposed, This is the number of 1s (exposed) and 0s (not-exposed) of “hyperten”.

#"sum" adds up all rows of data (so here, will count all the 1's)
exposed=sum(hyperten)
exposed
## [1] 8642
#"length" counts the number of rows, so length minus sum of 1's will be the numbers of 0's
no_exposed=length(hyperten)-exposed
no_exposed
## [1] 2985

Now let’s count the number of observations where the subject both has coronary heart disease and has hypertension

a=length(which(anychd==1 & hyperten==1))
a
## [1] 2697
c=length(which(anychd==1 & hyperten==0))
c
## [1] 461

Do these numbers match what we saw in the table above?

Calculate PR

PR=(a/exposed)/(c/no_exposed)
PR
## [1] 2.020738

Odds ratio

\[OR=\frac{a/c}{b/d}\]
####or
\[OR=\frac{ad}{bc}\]

Let’s assign values to \(b\) and \(d\) like we did with \(a\) and \(c\)

b=length(which(anychd==0 & hyperten==1))
b
## [1] 5945
d=length(which(anychd==0 & hyperten==0))
d
## [1] 2524

Calculate OR

OR=(a/b)/(c/d)
OR
## [1] 2.483805
#Using the other formula
OR=(a*d)/(b*c)
OR
## [1] 2.483805

Do the two formulas match?

Interpretation

In this study population, people with hypertension have 2 times the risk of any coronary disease than people without hypertension.
People with hypertension have 2.5 times the odds of having any coronary disease than do people without hypertension.

Notice how different the odd ratio and prevalence ratio are? Looking at the formulas, could a prevalence ratio ever be larger than the corresponding odds ratio? What would happen is CHD was much rarer in this population.

Comparing RR and OR for rare diseases

Lets make the number of disease 1/10th it was in both the exposed and unexposed populations.

rare_a=round(a/10)
rare_a
## [1] 270
rare_c=round(c/10)
rare_c
## [1] 46
#Prevalence ratio:
PRrare=(rare_a/(rare_a+b))/(rare_c/(rare_c+d))
PRrare
## [1] 2.427157
#odds ratio:
ORrare=(rare_a*d)/(b*rare_c)
ORrare
## [1] 2.491974

Closer? Now lets try 1/100th the original rate of disease

rare_a=round(a/100)
rare_a
## [1] 27
rare_c=round(c/100)
rare_c
## [1] 5
#Prevalence ratio:
PRrare=(rare_a/(rare_a+b))/(rare_c/(rare_c+d))
PRrare
## [1] 2.286772
#odds ratio:
ORrare=(rare_a*d)/(b*rare_c)
ORrare
## [1] 2.292616

OR and PR converge as disease becomes rarer in both exposed and unexposed populations

Now for to calculate absolute measures:

Attributable risk

\[R_{exposed}-R_{unexposed}\] where R is risk/rate, which, in the 2x2 notation, is: \[ \frac{a}{a+b} - \frac{c}{c+d} \]

AR=a/(a+b)-c/(c+d)
AR
## [1] 0.1576417

Attributable risk Percent

\[\frac{R_{exposed} -R_{unexposed}}{R_{exposed}} \times 100 =\]
\[ \frac{AR}{R_{exposed}} \times 100\]

ARper=(a/(a+b)-c/(c+d))/(a/(a+b))*100
ARper
## [1] 50.51314

Population attributable risk

\[\frac{R_{total} - R_{unexposed}}{Rtotal} \times 100 =\]
\[\frac{PAR}{Rtotal} \times 100\]

PAR=((sum(hyperten)/length(hyperten))-c/(c+d))/(sum(hyperten)/length(hyperten))*100
PAR
## [1] 79.2217
#Equivalent to:
PAR=((a+c)/(a+b+c+d)-c/(c+d))/((a+c)/(a+b+c+d))*100
PAR
## [1] 43.13931

epiR package

The epiR package gives us usefull commands for epidemiology analysis. It is not a default package, which means we need to install it. Once it’s installed on your machine, it doesn’t need to be installed again, but each time you open R, you need to load it from your library of packages.

install.packages("epiR", repos="http://cran.rstudio.com/")
library(epiR)

Let’s try out the epi.2by2 command on anychd and hyperten

epi.2by2(table(-hyperten,-anychd))
## Warning in N0 * (N0 + N1) * a: NAs produced by integer overflow
## Warning in N0 * N1 * (c + a): NAs produced by integer overflow

## Warning in N0 * N1 * (c + a): NAs produced by integer overflow
## Warning in N0 * (N0 + N1) * a: NAs produced by integer overflow
## Warning in N0 * N1 * (c + a): NAs produced by integer overflow

## Warning in N0 * N1 * (c + a): NAs produced by integer overflow
##              Outcome +    Outcome -      Total        Inc risk *
## Exposed +         2697         5945       8642              31.2
## Exposed -          461         2524       2985              15.4
## Total             3158         8469      11627              27.2
##                  Odds
## Exposed +       0.454
## Exposed -       0.183
## Total           0.373
## 
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Inc risk ratio                               2.02 (1.85, 2.21)
## Odds ratio                                   2.48 (2.23, 2.77)
## Attrib risk *                                15.76 (14.14, 17.39)
## Attrib risk in population *                  11.72 (10.19, 13.24)
## Attrib fraction in exposed (%)               50.51 (45.88, 54.75)
## Attrib fraction in population (%)            43.14 (38.61, 47.33)
## -------------------------------------------------------------------
##  X2 test statistic: 278.692 p-value: < 0.001
##  Wald confidence limits
##  * Outcomes per 100 population units

Did we do our calculations right? (Note that the package output reports AR per 100 people).