We will learn about the following topics in this section.
## Load tidyverse
library(tidyverse)
## Load CSV file
framingham <- read_csv("./framingham.csv")
The two-sample t-test is commonly used to test the hypothesis regarding mean difference between two independent groups. In R, this has the following syntax if we are comparing the mean BMI across SEX.
## Welch t-test is the default
(ttest_welch <- t.test(BMI ~ SEX, data = framingham))
##
## Welch Two Sample t-test
##
## data: BMI by SEX
## t = -4.8099, df = 4403.8, p-value = 0.00000156
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8117584 -0.3416388
## sample estimates:
## mean in group 0 mean in group 1
## 25.59288 26.16958
## Student t-test
(ttest_student <- t.test(BMI ~ SEX, data = framingham, var.equal = TRUE))
##
## Two Sample t-test
##
## data: BMI by SEX
## t = -4.6471, df = 4413, p-value = 0.000003464
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8199941 -0.3334030
## sample estimates:
## mean in group 0 mean in group 1
## 25.59288 26.16958
In R, most functions do not print the result, but return a result object that contains detail information, which then is printed. To examine what types of information is held in the result object, use the names() function.
names(ttest_welch)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
Elements can be extracted as follows.
## p-value using $ operator
ttest_welch$p.value
## [1] 0.00000156033
## t-statistic using [[]] operator
ttest_welch[["statistic"]]
## t
## -4.809922
The object created by BMI ~ SEX is of the “formula” class. it roughly means, “explain LHS by RHS”. “formula” objects are extensively used in regression modeling.
class(BMI ~ SEX)
## [1] "formula"
In epidemiology, 2x2 tables are often encountered. Given the raw data, one way to construct a 2x2 table is the following using xtabs(). The formula “~ SEX + CURSMOKE” means explain the LHS (implicitly count of observations) by SEX and CURSMOKE (current smoking status).
cross_tab1 <- xtabs( ~ SEX + CURSMOKE, data = framingham)
cross_tab1
## CURSMOKE
## SEX 0 1
## 0 1484 1006
## 1 769 1175
If you already have a 2x2 table and want to manually enter it, it’s a bit clumsy…
cross_tab2 <- matrix(c(1484, 1006,
769, 1175),
## Number of columns 2
ncol = 2,
## Row-based entry
byrow = TRUE)
cross_tab2
## [,1] [,2]
## [1,] 1484 1006
## [2,] 769 1175
The \(\chi^2\) test can be conducted using the 2x2 table created above.
## Default is with continuity correction by subtraction of 0.5.
(chisq_correct <- chisq.test(cross_tab1))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cross_tab1
## X-squared = 174.63, df = 1, p-value < 2.2e-16
## Without continuity correction
(chisq_no_correct <- chisq.test(cross_tab1, correct = FALSE))
##
## Pearson's Chi-squared test
##
## data: cross_tab1
## X-squared = 175.43, df = 1, p-value < 2.2e-16
Check what elements the result object has using names().
Fisher’s exact test can use the same syntax.
fisher.test(cross_tab1)
##
## Fisher's Exact Test for Count Data
##
## data: cross_tab1
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.993100 2.548945
## sample estimates:
## odds ratio
## 2.253545
Different types of 2x2 tables exist depending on the type of assessment you are trying to do.
This type of 2x2 tables are used to assess the association between an exposure of interest (environmental exposures, drugs, etc) and the outcome of interest (cancer, etc). Both variables have to be binary. Let’s examine the association between current smoking and death during the follow up. First create a two by two table.
tab_death_cursmoke <- xtabs( ~ CURSMOKE + DEATH, data = framingham)
tab_death_cursmoke
## DEATH
## CURSMOKE 0 1
## 0 1491 762
## 1 1393 788
epiR::epi.2by2() function can be used to obtain measures that you will hear about in your epidemiology and biostatistics courses. This function takes a 2x2 table of specific format. The outcome occupies the columns, positive first and negative next. The rows are exposure status, positive first and negative next.
Where method is ‘cohort.count’, ‘case.control’, or
‘cross.sectional’ and ‘outcome = as.columns’ the required 2 by 2
table format is:
---- ---- ---- ----
Disease + Disease - Total
---- ---- ---- ----
Expose + a b a+b
Expose - c d c+d
---- ---- ---- ----
Total a+c b+d a+b+c+d
---- ---- ---- ----
library(epiR)
## Loading required package: survival
## Package epiR 0.9-82 is loaded
## Type help(epi.about) for summary information
##
## Reorder columns and rows
tab_death_cursmoke[2:1, 2:1]
## DEATH
## CURSMOKE 1 0
## 1 788 1393
## 0 762 1491
## epi.2by2()
res_2by2 <- epi.2by2(tab_death_cursmoke[2:1, 2:1], method = "cohort.count")
## Warning in N0 * (N0 + N1) * a: NAs produced by integer overflow
## Warning in N0 * N1 * (c + a): NAs produced by integer overflow
## Warning in N0 * N1 * (c + a): NAs produced by integer overflow
## Warning in N0 * (N0 + N1) * a: NAs produced by integer overflow
## Warning in N0 * N1 * (c + a): NAs produced by integer overflow
## Warning in N0 * N1 * (c + a): NAs produced by integer overflow
res_2by2
## Outcome + Outcome - Total Inc risk *
## Exposed + 788 1393 2181 36.1
## Exposed - 762 1491 2253 33.8
## Total 1550 2884 4434 35.0
## Odds
## Exposed + 0.566
## Exposed - 0.511
## Total 0.537
##
## Point estimates and 95 % CIs:
## -------------------------------------------------------------------
## Inc risk ratio 1.07 (0.99, 1.16)
## Odds ratio 1.11 (0.98, 1.25)
## Attrib risk * 2.31 (-0.50, 5.12)
## Attrib risk in population * 1.14 (-1.27, 3.54)
## Attrib fraction in exposed (%) 6.39 (-1.44, 13.61)
## Attrib fraction in population (%) 3.25 (-0.78, 7.12)
## -------------------------------------------------------------------
## X2 test statistic: 2.598 p-value: 0.107
## Wald confidence limits
## * Outcomes per 100 population units
## Unformatted more detailed results
summary(res_2by2)
## $RR.strata.wald
## est lower upper
## 1 1.06826 0.9858212 1.157592
##
## $RR.strata.score
## est lower upper
## 1 1.06826 NA NA
##
## $OR.strata.wald
## est lower upper
## 1 1.106873 0.9782858 1.252362
##
## $OR.strata.score
## est lower upper
## 1 1.106873 0.9774559 1.253368
##
## $OR.strata.cfield
## est lower upper
## 1 1.106873 0.9782489 1.252407
##
## $OR.strata.mle
## est lower upper
## 1 1.106846 0.9763701 1.254817
##
## $ARisk.strata.wald
## est lower upper
## 1 2.308644 -0.4986352 5.115924
##
## $ARisk.strata.score
## est lower upper
## 1 2.308644 -0.4986573 5.114424
##
## $PARisk.strata.wald
## est lower upper
## 1 1.135578 -1.269871 3.541028
##
## $PARisk.strata.piri
## est lower upper
## 1 1.135578 -0.245687 2.516843
##
## $AFRisk.strata.wald
## est lower upper
## 1 0.06389788 -0.01438277 0.1361376
##
## $PAFRisk.strata.wald
## est lower upper
## 1 0.03248486 -0.007847079 0.0712028
##
## $chisq.strata
## test.statistic df p.value
## 1 2.597764 1 0.1070146
This type of 2x2 tables are used when you are assessing performance of diagnostic tests. Here we use an example mentioned in the epiR package. The 2x2 table is similar except that instead of an exposure, we have a test as the rows.
## epiR package's example
## Scott et al. 2008, Table 1:
## A new diagnostic test was trialled on 1586 patients. Of 744 patients
## that were disease positive, 670 tested positive. Of 842 patients that
## were disease negative, 640 tested negative. What is the likeliood
## ratio of a positive test? What is the number needed to diagnose?
dat <- matrix(c(670, 202, 74, 640), nrow = 2, byrow = TRUE)
colnames(dat) <- c("Dis+","Dis-")
rownames(dat) <- c("Test+","Test-")
dat
## Dis+ Dis-
## Test+ 670 202
## Test- 74 640
## epi.tests()
rval <- epi.tests(dat, conf.level = 0.95)
rval
## Outcome + Outcome - Total
## Test + 670 202 872
## Test - 74 640 714
## Total 744 842 1586
##
## Point estimates and 95 % CIs:
## ---------------------------------------------------------
## Apparent prevalence 0.55 (0.52, 0.57)
## True prevalence 0.47 (0.44, 0.49)
## Sensitivity 0.90 (0.88, 0.92)
## Specificity 0.76 (0.73, 0.79)
## Positive predictive value 0.77 (0.74, 0.80)
## Negative predictive value 0.90 (0.87, 0.92)
## Positive likelihood ratio 3.75 (3.32, 4.24)
## Negative likelihood ratio 0.13 (0.11, 0.16)
## ---------------------------------------------------------
## Unformatted more detailed results
summary(rval)
## est lower upper
## aprev 0.5498108 0.5249373 0.5744996
## tprev 0.4691047 0.4443055 0.4940184
## se 0.9005376 0.8767462 0.9210923
## sp 0.7600950 0.7297765 0.7885803
## diag.acc 0.8259773 0.8064049 0.8443346
## diag.or 28.6861119 21.5181917 38.2417364
## nnd 1.5137005 1.4091004 1.6487431
## youden 0.6606326 0.6065226 0.7096726
## ppv 0.7683486 0.7388926 0.7959784
## npv 0.8963585 0.8716393 0.9177402
## plr 3.7537262 3.3206884 4.2432346
## nlr 0.1308552 0.1050643 0.1629771