## For diet data and ccwc()
library(Epi)
## For case-cohort design
library(survival)
## Load data
data(diet)
head(diet)
id doe dox dob y fail job month energy height weight fat fibre energy.grp
1 102 1976-01-17 1986-12-02 1939-03-02 10.875 0 Driver 1 22.86 181.6 88.18 9.168 1.400 <=2750 KCals
2 59 1973-07-16 1982-07-05 1912-07-05 8.969 0 Driver 7 23.88 166.0 58.74 9.651 0.935 <=2750 KCals
3 126 1970-03-17 1984-03-20 1919-12-24 14.010 13 Conductor 3 24.95 152.4 49.90 11.249 1.248 <=2750 KCals
4 16 1969-05-16 1969-12-31 1906-09-17 0.627 3 Driver 5 22.24 171.2 89.40 7.578 1.557 <=2750 KCals
5 247 1968-03-16 1979-06-25 1918-07-10 11.274 13 Bank worker 3 18.54 177.8 97.07 9.147 0.991 <=2750 KCals
6 272 1969-03-16 1973-12-13 1920-03-06 4.745 3 Bank worker 3 20.31 175.3 61.01 8.536 0.765 <=2750 KCals
chd
1 0
2 0
3 1
4 1
5 1
6 1
Diet and heart data
The ‘diet’ data frame has 337 rows and 14 columns. The data
concern a subsample of subjects drawn from larger cohort studies
of the incidence of coronary heart disease (CHD). These subjects
had all completed a 7-day weighed dietary survey while taking part
in validation studies of dietary questionnaire methods. Upon the
closure of the MRC Social Medicine Unit, from where these studies
were directed, it was found that 46 CHD events had occurred in
this group, thus allowing a serendipitous study of the
relationship between diet and the incidence of CHD.
Format:
This data frame contains the following columns:
‘id’: subject identifier, a numeric vector.
‘doe’: date of entry into follow-up study, a
‘Date’ variable.
‘dox’: date of exit from the follow-up study, a
‘Date’ variable.
‘dob’: date of birth, a‘Date’ variable.
‘y’: - number of years at risk, a numeric vector.
‘fail’: status on exit, a numeric vector (codes 1, 3, 11, and
13 represent CHD events)
‘job’: occupation, a factor with levels
‘Driver’
‘Conductor’
‘Bank worker’
‘month’: month of dietary survey, a numeric vector
‘energy’: total energy intake (KCal per day/100), a numeric
vector
‘height’: (cm), a numeric vector
‘weight’: (kg), a numeric vector
‘fat’: fat intake (g/day), a numeric vector
‘fibre’: dietary fibre intake (g/day), a numeric vector
‘energy.grp’: high daily energy intake, a factor with levels
‘<=2750 KCal’
‘>2750 KCal’
‘chd’: CHD event, a numeric vector (1=CHD event, 0=no event)
Source:
The data are described and used extensively by Clayton and Hills,
Statistical Models in Epidemiology, Oxford University Press,
Oxford:1993. They were rescued from destruction by David Clayton
and reentered from paper printouts.
ccwc() function for riskset sampling
Generate a nested case-control study:
Given the basic outcome variables for a cohort study: the time of
entry to the cohort, the time of exit and the reason for exit
("failure" or "censoring"), this function computes risk sets and
generates a matched case-control study in which each case is
compared with a set of controls randomly sampled from the
appropriate risk set. Other variables may be matched when
selecting controls.
## Set seed for the random number generator
set.seed(20140111)
## Generate a nested case-control study
dietcc <- ccwc(entry = doe, # Time of entry to follow-up
exit = dox, # Time of exit from follow-up
fail = chd, # Status on exit (1 = Fail, 0 = Censored)
origin = dob, # Origin of analysis time scale
controls = 2, # The number of controls to be selected for each case
data = diet, # data frame
include = energy, # List of other variables to be carried across into the case-control study
match = job, # List of categorical variables on which to match cases and controls
silent = TRUE
)
## Show first 10 observations
head(dietcc, 10)
Set Map Time Fail job energy
1 1 3 1984-03-20 1 Conductor 24.95
2 1 290 1981-03-28 0 Conductor 30.08
3 1 96 1976-09-16 0 Conductor 28.26
4 2 9 1976-07-27 1 Conductor 23.16
5 2 253 1973-05-13 0 Conductor 32.71
6 2 137 1975-09-18 0 Conductor 26.50
7 3 14 1973-03-28 1 Conductor 24.13
8 3 101 1973-10-19 0 Conductor 28.02
9 3 332 1979-10-11 0 Conductor 31.84
10 4 26 1980-02-05 1 Conductor 19.07
The 1:2 risk set-matched dataset contains the following variables.
Set: case-control set number
Map: row number of record in input dataframe
Time: failure time of the case in this set
Fail: failure status (1=case, 0=control)
These are followed by the matching variables,
and finally by thevariables in the ‘include’ list
Analysis
It requires conditional logistic regression. By stratifying on the matched sets, the matching factors, i.e., risk set and the job type here, are conditioned on. Conditioning on the matching factors is necessary in matched case control studies. Otherwise, selection bias (collider stratification bias) will arise.
resClogit <- clogit(formula = Fail ~ scale(energy) + strata(Set), data = dietcc)
summary(resClogit)
Call:
coxph(formula = Surv(rep(1, 138L), Fail) ~ scale(energy) + strata(Set),
data = dietcc, method = "exact")
n= 138, number of events= 46
coef exp(coef) se(coef) z Pr(>|z|)
scale(energy) -0.574 0.563 0.209 -2.74 0.0061 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
scale(energy) 0.563 1.78 0.374 0.849
Rsquare= 0.061 (max possible= 0.523 )
Likelihood ratio test= 8.64 on 1 df, p=0.00329
Wald test = 7.53 on 1 df, p=0.00606
Score (logrank) test = 8.06 on 1 df, p=0.00454
Assuming that the log odds of failure has a linear relationship with the energy intake, one standard deviation increase in energy intake results in a rate ratio (odds ratio is interpreted as a rate ratio in risk set sampling) of 0.563 within each risk set.
## Load data
data(nwtco)
head(nwtco)
seqno instit histol stage study rel edrel age in.subcohort
1 1 2 2 1 3 0 6075 25 FALSE
2 2 1 1 2 3 0 4121 50 FALSE
3 3 2 2 1 3 0 6069 9 FALSE
4 4 2 1 4 3 0 6200 28 TRUE
5 5 2 2 2 3 0 1244 55 FALSE
6 6 1 1 2 3 0 2932 32 FALSE
Data from the National Wilm's Tumor Study
Missing data/masurement error example. Tumor histology predicts
survival, but prediction is stronger with central lab histology
than with the local institution determination.
Format:
A data frame with 4028 observations on the following 9 variables.
‘seqno’ id number
‘instit’ Histology from local institution
‘histol’ Histology from central lab
‘stage’ Disease stage
‘study’ study
‘rel’ indicator for relapse
‘edrel’ time to relapse
‘age’ age in months
‘in.subcohort’ Included in the subcohort for the example in the paper'
The in.subcohort variable indicates if the subject has been chosen as a member of the subcohort at the beginning of the study. To simulate a case-cohort study, dataset is subset to those who relapsed (cases) and those who are in the subcohort (controls).
Prepare data
## Indicator for those relapsed OR in the subcohort (data ascertained for cases and controls only)
selccoh <- with(nwtco, rel == 1 | in.subcohort == TRUE)
## Subset to these 1154 patients. (Case-cohort dataset)
caseCohortData <- nwtco[selccoh,]
## Create factors
caseCohortData <- within(caseCohortData, {
histol <- factor(histol, labels = c("FH","UH"))
stage <- factor(stage, labels = c("I","II","III","IV"))
age <- age / 12 # Age in years
})
## Check
head(caseCohortData)
seqno instit histol stage study rel edrel age in.subcohort
4 4 2 FH IV 3 0 6200 2.333 TRUE
7 7 1 FH IV 3 1 324 3.750 FALSE
11 11 1 UH II 3 0 5570 2.000 TRUE
14 14 1 FH II 3 0 5942 1.583 TRUE
17 17 1 FH II 3 1 960 7.167 FALSE
22 22 1 FH II 3 1 93 2.667 FALSE
Analysis with the cch() function
Cox regression is used to fit the model, thus the HR notation. The HR shown here is interpreted as the risk ratio.
Fits proportional hazards regression model to case-cohort data
Returns estimates and standard errors from relative risk
regression fit to data from case-cohort studies. A choice is
available among the Prentice, Self-Prentice and Lin-Ying methods
for unstratified data. For stratified data the choice is between
Borgan I, a generalization of the Self-Prentice estimator for
unstratified case-cohort data, and Borgan II, a generalization of
the Lin-Ying estimator.
## Standard case-cohort analysis: simple random subcohort
## Fits proportional hazards regression model to case-cohort data
fit.ccP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data = caseCohortData,
subcoh = ~ in.subcohort, # Vector of indicatorsfor subjects sampled as part of the sub-cohort
id = ~ seqno, # Vector of unique identifiers
cohort.size = 4028) # Vector with size of each stratum original cohort
summary(fit.ccP)
Case-cohort analysis,x$method, Prentice
with subcohort of 668 from cohort of 4028
Call: cch(formula = Surv(edrel, rel) ~ stage + histol + age, data = caseCohortData,
subcoh = ~in.subcohort, id = ~seqno, cohort.size = 4028)
Coefficients:
Coef HR (95% CI) p
stageII 0.735 2.085 1.498 2.900 0.000
stageIII 0.597 1.817 1.293 2.552 0.001
stageIV 1.384 3.991 2.672 5.963 0.000
histolUH 1.498 4.473 3.271 6.117 0.000
age 0.043 1.044 0.997 1.094 0.068
## Self-Prentine method
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data = caseCohortData,
subcoh = ~ in.subcohort, # Vector of indicatorsfor subjects sampled as part of the sub-cohort
id = ~ seqno, # Vector of unique identifiers
cohort.size = 4028, # Vector with size of each stratum original cohort
method = "SelfPren"
)
summary(fit.ccSP)
Case-cohort analysis,x$method, SelfPrentice
with subcohort of 668 from cohort of 4028
Call: cch(formula = Surv(edrel, rel) ~ stage + histol + age, data = caseCohortData,
subcoh = ~in.subcohort, id = ~seqno, cohort.size = 4028,
method = "SelfPren")
Coefficients:
Coef HR (95% CI) p
stageII 0.736 2.088 1.501 2.905 0.000
stageIII 0.597 1.818 1.294 2.553 0.001
stageIV 1.392 4.021 2.692 6.008 0.000
histolUH 1.506 4.507 3.295 6.163 0.000
age 0.043 1.044 0.997 1.094 0.069
Stages II, III, and IV tumors compared to stage I tumors have increased risk of relapse.