Case-control studies in R

References

Load packages

## For diet data and ccwc()
library(Epi)
## For case-cohort design
library(survival)

Risk set sampling using the Epi package.

## Load data
data(diet)
head(diet)
   id        doe        dox        dob      y fail         job month energy height weight    fat fibre   energy.grp
1 102 1976-01-17 1986-12-02 1939-03-02 10.875    0      Driver     1  22.86  181.6  88.18  9.168 1.400 <=2750 KCals
2  59 1973-07-16 1982-07-05 1912-07-05  8.969    0      Driver     7  23.88  166.0  58.74  9.651 0.935 <=2750 KCals
3 126 1970-03-17 1984-03-20 1919-12-24 14.010   13   Conductor     3  24.95  152.4  49.90 11.249 1.248 <=2750 KCals
4  16 1969-05-16 1969-12-31 1906-09-17  0.627    3      Driver     5  22.24  171.2  89.40  7.578 1.557 <=2750 KCals
5 247 1968-03-16 1979-06-25 1918-07-10 11.274   13 Bank worker     3  18.54  177.8  97.07  9.147 0.991 <=2750 KCals
6 272 1969-03-16 1973-12-13 1920-03-06  4.745    3 Bank worker     3  20.31  175.3  61.01  8.536 0.765 <=2750 KCals
  chd
1   0
2   0
3   1
4   1
5   1
6   1
Diet and heart data
     The ‘diet’ data frame has 337 rows and 14 columns. The data
     concern a subsample of subjects drawn from larger cohort studies
     of the incidence of coronary heart disease (CHD). These subjects
     had all completed a 7-day weighed dietary survey while taking part
     in validation studies of dietary questionnaire methods. Upon the
     closure of the MRC Social Medicine Unit, from where these studies
     were directed, it was found that 46 CHD events had occurred in
     this group, thus allowing a serendipitous study of the
     relationship between diet and the incidence of CHD.
Format:
     This data frame contains the following columns:
               ‘id’:  subject identifier, a numeric vector.
              ‘doe’:  date of entry into follow-up study, a
                      ‘Date’ variable.
              ‘dox’:  date of exit from the follow-up study, a
                      ‘Date’ variable.
              ‘dob’:  date of birth, a‘Date’ variable.
                ‘y’:  - number of years at risk, a numeric vector.
             ‘fail’:  status on exit, a numeric vector (codes 1, 3, 11, and
                      13 represent CHD events)
              ‘job’:  occupation, a factor with levels
                      ‘Driver’
                      ‘Conductor’
                      ‘Bank worker’
            ‘month’:  month of dietary survey, a numeric vector
           ‘energy’:  total energy intake (KCal per day/100), a numeric
                      vector
           ‘height’:  (cm), a numeric vector
           ‘weight’:  (kg), a numeric vector
              ‘fat’:  fat intake (g/day), a numeric vector
            ‘fibre’:  dietary fibre intake (g/day), a numeric vector
       ‘energy.grp’:  high daily energy intake, a factor with levels
                      ‘<=2750 KCal’
                      ‘>2750 KCal’
              ‘chd’:  CHD event, a numeric vector (1=CHD event, 0=no event)
Source:
     The data are described and used extensively by Clayton and Hills,
     Statistical Models in Epidemiology, Oxford University Press,
     Oxford:1993. They were rescued from destruction by David Clayton
     and reentered from paper printouts.

ccwc() function for riskset sampling

Generate a nested case-control study:
Given the basic outcome variables for a cohort study: the time of
entry to the cohort, the time of exit and the reason for exit
("failure" or "censoring"), this function computes risk sets and
generates a matched case-control study in which each case is
compared with a set of controls randomly sampled from the
appropriate risk set. Other variables may be matched when
selecting controls.
## Set seed for the random number generator
set.seed(20140111)
## Generate a nested case-control study
dietcc <- ccwc(entry    = doe,    # Time of entry to follow-up
               exit     = dox,    # Time of exit from follow-up
               fail     = chd,    # Status on exit (1 = Fail, 0 = Censored)
               origin   = dob,    # Origin of analysis time scale
               controls = 2,      # The number of controls to be selected for each case
               data     = diet,   # data frame
               include  = energy, # List of other variables to be carried across into the case-control study
               match    = job,    # List of categorical variables on which to match cases and controls
               silent   = TRUE
               )
## Show first 10 observations
head(dietcc, 10)
   Set Map       Time Fail       job energy
1    1   3 1984-03-20    1 Conductor  24.95
2    1 290 1981-03-28    0 Conductor  30.08
3    1  96 1976-09-16    0 Conductor  28.26
4    2   9 1976-07-27    1 Conductor  23.16
5    2 253 1973-05-13    0 Conductor  32.71
6    2 137 1975-09-18    0 Conductor  26.50
7    3  14 1973-03-28    1 Conductor  24.13
8    3 101 1973-10-19    0 Conductor  28.02
9    3 332 1979-10-11    0 Conductor  31.84
10   4  26 1980-02-05    1 Conductor  19.07

The 1:2 risk set-matched dataset contains the following variables.

Set: case-control set number
Map: row number of record in input dataframe
Time: failure time of the case in this set
Fail: failure status (1=case, 0=control)
These are followed by the matching variables,
and finally by thevariables in the ‘include’ list

Analysis

It requires conditional logistic regression. By stratifying on the matched sets, the matching factors, i.e., risk set and the job type here, are conditioned on. Conditioning on the matching factors is necessary in matched case control studies. Otherwise, selection bias (collider stratification bias) will arise.

resClogit <- clogit(formula = Fail ~ scale(energy) + strata(Set), data = dietcc)
summary(resClogit)
Call:
coxph(formula = Surv(rep(1, 138L), Fail) ~ scale(energy) + strata(Set), 
    data = dietcc, method = "exact")

  n= 138, number of events= 46 

                coef exp(coef) se(coef)     z Pr(>|z|)   
scale(energy) -0.574     0.563    0.209 -2.74   0.0061 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

              exp(coef) exp(-coef) lower .95 upper .95
scale(energy)     0.563       1.78     0.374     0.849

Rsquare= 0.061   (max possible= 0.523 )
Likelihood ratio test= 8.64  on 1 df,   p=0.00329
Wald test            = 7.53  on 1 df,   p=0.00606
Score (logrank) test = 8.06  on 1 df,   p=0.00454

Assuming that the log odds of failure has a linear relationship with the energy intake, one standard deviation increase in energy intake results in a rate ratio (odds ratio is interpreted as a rate ratio in risk set sampling) of 0.563 within each risk set.

Case-cohort design using the survival package

## Load data
data(nwtco)
head(nwtco)
  seqno instit histol stage study rel edrel age in.subcohort
1     1      2      2     1     3   0  6075  25        FALSE
2     2      1      1     2     3   0  4121  50        FALSE
3     3      2      2     1     3   0  6069   9        FALSE
4     4      2      1     4     3   0  6200  28         TRUE
5     5      2      2     2     3   0  1244  55        FALSE
6     6      1      1     2     3   0  2932  32        FALSE
Data from the National Wilm's Tumor Study
     Missing data/masurement error example. Tumor histology predicts
     survival, but prediction is stronger with central lab histology
     than with the local institution determination.
Format:
     A data frame with 4028 observations on the following 9 variables.
     ‘seqno’ id number
     ‘instit’ Histology from local institution
     ‘histol’ Histology from central lab
     ‘stage’ Disease stage
     ‘study’ study
     ‘rel’ indicator for relapse
     ‘edrel’ time to relapse
     ‘age’ age in months
     ‘in.subcohort’ Included in the subcohort for the example in the paper'

The in.subcohort variable indicates if the subject has been chosen as a member of the subcohort at the beginning of the study. To simulate a case-cohort study, dataset is subset to those who relapsed (cases) and those who are in the subcohort (controls).

Prepare data

## Indicator for those relapsed OR in the subcohort (data ascertained for cases and controls only)
selccoh <- with(nwtco, rel == 1 | in.subcohort == TRUE)

## Subset to these 1154 patients. (Case-cohort dataset)
caseCohortData <- nwtco[selccoh,]

## Create factors
caseCohortData <- within(caseCohortData, {

    histol <- factor(histol, labels = c("FH","UH"))
    stage  <- factor(stage,  labels = c("I","II","III","IV"))
    age    <- age / 12 # Age in years
 })

## Check
head(caseCohortData)
   seqno instit histol stage study rel edrel   age in.subcohort
4      4      2     FH    IV     3   0  6200 2.333         TRUE
7      7      1     FH    IV     3   1   324 3.750        FALSE
11    11      1     UH    II     3   0  5570 2.000         TRUE
14    14      1     FH    II     3   0  5942 1.583         TRUE
17    17      1     FH    II     3   1   960 7.167        FALSE
22    22      1     FH    II     3   1    93 2.667        FALSE

Analysis with the cch() function

Cox regression is used to fit the model, thus the HR notation. The HR shown here is interpreted as the risk ratio.

Fits proportional hazards regression model to case-cohort data
     Returns estimates and standard errors from relative risk
     regression fit to data from case-cohort studies. A choice is
     available among the Prentice, Self-Prentice and Lin-Ying methods
     for unstratified data. For stratified data the choice is between
     Borgan I, a generalization of the Self-Prentice estimator for
     unstratified case-cohort data, and Borgan II, a generalization of
     the Lin-Ying estimator.
## Standard case-cohort analysis: simple random subcohort
## Fits proportional hazards regression model to case-cohort data
fit.ccP <- cch(Surv(edrel, rel) ~ stage + histol + age,
               data        = caseCohortData,
               subcoh      = ~ in.subcohort, # Vector of indicatorsfor subjects sampled as part of the sub-cohort
               id          = ~ seqno,     # Vector of unique identifiers
               cohort.size = 4028)        # Vector with size of each stratum original cohort
summary(fit.ccP)
Case-cohort analysis,x$method, Prentice 
 with subcohort of 668 from cohort of 4028 

Call: cch(formula = Surv(edrel, rel) ~ stage + histol + age, data = caseCohortData, 
    subcoh = ~in.subcohort, id = ~seqno, cohort.size = 4028)

Coefficients:
          Coef    HR  (95%   CI)     p
stageII  0.735 2.085 1.498 2.900 0.000
stageIII 0.597 1.817 1.293 2.552 0.001
stageIV  1.384 3.991 2.672 5.963 0.000
histolUH 1.498 4.473 3.271 6.117 0.000
age      0.043 1.044 0.997 1.094 0.068

## Self-Prentine method
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
                data        = caseCohortData,
                subcoh      = ~ in.subcohort, # Vector of indicatorsfor subjects sampled as part of the sub-cohort
                id          = ~ seqno,     # Vector of unique identifiers
                cohort.size = 4028,        # Vector with size of each stratum original cohort
                method      = "SelfPren"
                )
summary(fit.ccSP)
Case-cohort analysis,x$method, SelfPrentice 
 with subcohort of 668 from cohort of 4028 

Call: cch(formula = Surv(edrel, rel) ~ stage + histol + age, data = caseCohortData, 
    subcoh = ~in.subcohort, id = ~seqno, cohort.size = 4028, 
    method = "SelfPren")

Coefficients:
          Coef    HR  (95%   CI)     p
stageII  0.736 2.088 1.501 2.905 0.000
stageIII 0.597 1.818 1.294 2.553 0.001
stageIV  1.392 4.021 2.692 6.008 0.000
histolUH 1.506 4.507 3.295 6.163 0.000
age      0.043 1.044 0.997 1.094 0.069

Stages II, III, and IV tumors compared to stage I tumors have increased risk of relapse.