Introduction

In this tutorial we show examples on how to obtain risk predictions based on a longitudinal marker in a case-cohort (CCH) or a nested case-control (NCC) sample, using methods described in ``Evaluating longitudinal markers under two-phase study designs’’ by Marlena Maziarz, Tianxi Ci, Li Qi, Anna Lok and Yingye Zheng. The examples below illustrate the use of our methods to predict risk using PC\(_{GLM}\) based on a longitudinal biomarker with a survival outcome, as well as several evaluation measures of prediction and their standard error. The evaluation measures we considered are the prediction error (PE), true and false positive fractions (TPF and FPF), the area under the Receiver Operating Characteristics curve (AUC), the proportion of cases followed (PCF) and the proportion of cases needed to be followed (PNF). All code for predicting and evaluating risk, example datasets and examples on how to run the analysis are also available on github.com/marlenamaziarz/longitudinal-two-phase (right-click and open in new tab/window).

Before we get started

Load the necessary libraries

require(survival)
require(gam)
require(stats)
require(splines)
require(RCurl)   # for 'getURL'
require(repmis)  # for 'source_data'

Load the functions (one file for NCC and CCH)

The source code on GitHub contains all the functions needed for risk estimation and estimation of the measures to evaluate risk predictions, including standard errors estimated using perturbation. To load the source code:

script <- getURL("https://raw.githubusercontent.com/marlenamaziarz/longitudinal-two-phase/master/code-helper-functions.r", ssl.verifypeer = FALSE)

eval(parse(text = script))

Nested case-control analysis

Load the example dataset for NCC

The example dataset for NCC can be loaded using:

source_data("https://github.com/marlenamaziarz/longitudinal-two-phase/blob/master/dataset-ncc.rdata?raw=True")

The dataset has the following columns:

##    sub.id visit meas.time    marker      time status    t.star in.ncc ix.first
## 1       1     1         0 4.0173812  3.116595      0  3.116595  FALSE     TRUE
## 11      2     1         0 2.8730772  8.025501      0  8.025501  FALSE     TRUE
## 12      2     2         6 1.4007894  8.025501      0  2.025501  FALSE    FALSE
## 21      3     1         0 5.3736497 20.844036      0 20.844036  FALSE     TRUE
## 22      3     2         6 2.9753239 20.844036      0 14.844036  FALSE    FALSE
## 23      3     3        12 2.2411058 20.844036      0  8.844036  FALSE    FALSE
## 24      3     4        18 1.6639688 20.844036      0  2.844036  FALSE    FALSE
## 31      4     1         0 2.7390869 29.367374      1 29.367374   TRUE     TRUE
## 32      4     2         6 0.8567458 29.367374      1 23.367374   TRUE    FALSE
## 33      4     3        12 0.3968545 29.367374      1 17.367374   TRUE    FALSE

where sub.id = subject id, vist = visit number at time meas.time, marker = marker value, time = event or censoring time, status = event status (1 = event observed, 0 otherwise), t.star = time since measurement, i.e. time - meas.time, in.ncc = index of whether the subject was selected into the NCC study, ix.first = index of the first visits.

Estimate risk and evaluation measures for NCC

To predict risk and evaluation measures based on the NCC example dataset, run the following:

tau <- 6
dc.list      <- NULL
results.list <- NULL
results.mx   <- NULL

for(si.i in seq(6, 36, 6)){
    # P = pert = 500 # ncc =T/F NCC/CCH analysis ie. ncc = T = NCC, ncc = F = CCH. 
    dc <- set.data.control(si = si.i, ti = si.i + tau, P = 500, seed = 1, ncc = T) 
    results.list <- c(results.list, main(dc, d.full = d.ncc.cohort))
    dc.list <- c(dc.list, list(dc))
    # save(list = ls(), file = 'results-ncc.rdata')
}

for(i in 1:length(results.list)){
    res.out <- with(results.list[[i]], cbind(sim.stats, sim.stats.se, n.case, n.ctrl, n.cens))
    dc.out  <-  with(dc.list[[i]], c(si, ti, pred.time, P))
    res.dc.out <- cbind(matrix(rep(dc.out, each = dc$n.stats), nrow = dc$n.stats, byrow = F), round(res.out, 4))
    results.mx <- rbind(results.mx, res.dc.out)
}
colnames(results.mx) <- c('s', 't', 'tau_0', 'Pert', 'Est', 'SE', 'n.case', 'n.ctrl', 'n.cens')

save(list = ls(), file = paste('results-ncc-table-p', dc$P, '-tau', tau, '.rdata', sep = ''))

NCC results

Results of the nested case-control (NCC) analysis for conditioning time (s) = 6 and prediction time (t = s + tau) = 12. Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s. Complete results for s from 6 to 36 in increments of 6, and tau = 6 are below.
	s	t	tau_0	Pert	Est	SE	n.case	n.ctrl	n.cens
PE	6	12	6	500	0.035	0.0046	53	512	8
TPF(0.4)	6	12	6	500	0.13	0.053	53	512	8
FPF(0.3)	6	12	6	500	0.01	0.0049	53	512	8
AUC	6	12	6	500	0.79	0.034	53	512	8
PCF(0.2)	6	12	6	500	0.55	0.069	53	512	8
PNF(0.8)	6	12	6	500	0.44	0.07	53	512	8

Results of the nested case-control (NCC) analysis for conditioning time (s) = 36 and prediction time (t = s + tau) = 42. Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s. Complete results for s from 6 to 36 in increments of 6, and tau = 6 are below.
	s	t	tau_0	Pert	Est	SE	n.case	n.ctrl	n.cens
PE	36	42	6	500	0.11	0.017	19	104	22
TPF(0.4)	36	42	6	500	0.37	0.11	19	104	22
FPF(0.3)	36	42	6	500	0.2	0.041	19	104	22
AUC	36	42	6	500	0.72	0.059	19	104	22
PCF(0.2)	36	42	6	500	0.42	0.11	19	104	22
PNF(0.8)	36	42	6	500	0.53	0.071	19	104	22

Case-cohort analysis

Load the example dataset for CCH

To load the data file:

source_data("https://github.com/marlenamaziarz/longitudinal-two-phase/blob/master/dataset-cch.rdata?raw=True")

The dataset has the following columns:

##    sub.id visit meas.time   marker      time status     t.star in.cch ix.first
## 1       1     1         0 4.047375  8.020719      0  8.0207186  FALSE     TRUE
## 2       1     2         6 2.278364  8.020719      0  2.0207186  FALSE    FALSE
## 11      2     1         0 4.367408  8.070592      0  8.0705921  FALSE     TRUE
## 12      2     2         6 2.036170  8.070592      0  2.0705921  FALSE    FALSE
## 21      3     1         0 4.932912 18.897848      0 18.8978483  FALSE     TRUE
## 22      3     2         6 2.996144 18.897848      0 12.8978483  FALSE    FALSE
## 23      3     3        12 2.202109 18.897848      0  6.8978483  FALSE    FALSE
## 24      3     4        18 1.847787 18.897848      0  0.8978483  FALSE    FALSE
## 31      4     1         0 2.028736  2.094211      0  2.0942106   TRUE     TRUE
## 41      5     1         0 2.755369  9.442380      0  9.4423798  FALSE     TRUE

where sub.id = subject id, vist = visit number at time meas.time, marker = marker value, time = event or censoring time, status = event status (1 = event observed, 0 otherwise), t.star = time since measurement, i.e. time - meas.time, in.ncc = index of whether the subject was selected into the CCH study, ix.first = index of the first visits.

Estimate risk and evaluation measures for CCH

To predict risk and evaluation measures based on the CCH example dataset, run the following:

tau <- 6
dc.list      <- NULL
results.list <- NULL
results.mx   <- NULL

for(si.i in seq(6, 36, 6)){
    # P = pert = 500 # ncc =T/F NCC/CCH analysis ie. ncc = T = NCC, ncc = F = CCH. 
    dc <- set.data.control(si = si.i, ti = si.i + tau, P = 500, seed = 1, ncc = F) 
    results.list <- c(results.list, main(dc, d.full = d.cch.cohort))
    dc.list <- c(dc.list, list(dc))
    # save(list = ls(), file = 'results-cch.rdata')
}

for(i in 1:length(results.list)){
    res.out <- with(results.list[[i]], cbind(sim.stats, sim.stats.se, n.case, n.ctrl, n.cens))
    dc.out  <-  with(dc.list[[i]], c(si, ti, pred.time, P))
    res.dc.out <- cbind(matrix(rep(dc.out, each = dc$n.stats), nrow = dc$n.stats, byrow = F), round(res.out, 4))
    results.mx <- rbind(results.mx, res.dc.out)
}
colnames(results.mx) <- c('s', 't', 'tau_0', 'Pert', 'Est', 'SE', 'n.case', 'n.ctrl', 'n.cens')

save(list = ls(), file = paste('results-cch-table-p', dc$P, '-tau', tau, '.rdata', sep = ''))

Results for the CCH analysis

Results of the case-cohort (CCH) analysis for conditioning time (s) = 6 and prediction time (t = s + tau) = 12. Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s. Complete results for s from 6 to 36 in increments of 6, and tau = 6 are below.
	s	t	tau_0	Pert	Est	SE	n.case	n.ctrl	n.cens
PE	6	12	6	500	0.041	0.0045	74	639	124
TPF(0.4)	6	12	6	500	0.052	0.038	74	639	124
FPF(0.3)	6	12	6	500	0.0078	0.0041	74	639	124
AUC	6	12	6	500	0.81	0.025	74	639	124
PCF(0.2)	6	12	6	500	0.61	0.059	74	639	124
PNF(0.8)	6	12	6	500	0.41	0.048	74	639	124

Results of the case-cohort (CCH) analysis for conditioning time (s) = 36 and prediction time (t = s + tau) = 42. Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s. Complete results for s from 6 to 36 in increments of 6, and tau = 6 are below.
	s	t	tau_0	Pert	Est	SE	n.case	n.ctrl	n.cens
PE	36	42	6	500	0.098	0.017	26	84	18
TPF(0.4)	36	42	6	500	0.22	0.1	26	84	18
FPF(0.3)	36	42	6	500	0.09	0.043	26	84	18
AUC	36	42	6	500	0.7	0.067	26	84	18
PCF(0.2)	36	42	6	500	0.53	0.11	26	84	18
PNF(0.8)	36	42	6	500	0.69	0.16	26	84	18

Complete results

Complete results for NCC analysis

Results of the nested case-control (NCC) analysis for different combinations of conditioning time (s) and prediction time (t = s + tau) (s from 6 to 36 in increments of 6, and tau = 6). Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s.
	s	t	tau_0	Pert	Est	SE	n.case	n.ctrl	n.cens
PE	6	12	6	500	0.035	0.0046	53	512	8
TPF(0.4)	6	12	6	500	0.13	0.053	53	512	8
FPF(0.3)	6	12	6	500	0.01	0.0049	53	512	8
AUC	6	12	6	500	0.79	0.034	53	512	8
PCF(0.2)	6	12	6	500	0.55	0.069	53	512	8
PNF(0.8)	6	12	6	500	0.44	0.07	53	512	8
PE	12	18	6	500	0.063	0.0061	74	423	15
TPF(0.4)	12	18	6	500	0.24	0.061	74	423	15
FPF(0.3)	12	18	6	500	0.037	0.01	74	423	15
AUC	12	18	6	500	0.83	0.024	74	423	15
PCF(0.2)	12	18	6	500	0.61	0.058	74	423	15
PNF(0.8)	12	18	6	500	0.35	0.057	74	423	15
PE	18	24	6	500	0.1	0.0091	89	310	24
TPF(0.4)	18	24	6	500	0.24	0.046	89	310	24
FPF(0.3)	18	24	6	500	0.088	0.015	89	310	24
AUC	18	24	6	500	0.8	0.026	89	310	24
PCF(0.2)	18	24	6	500	0.56	0.051	89	310	24
PNF(0.8)	18	24	6	500	0.4	0.063	89	310	24
PE	24	30	6	500	0.11	0.012	57	226	27
TPF(0.4)	24	30	6	500	0.22	0.064	57	226	27
FPF(0.3)	24	30	6	500	0.12	0.024	57	226	27
AUC	24	30	6	500	0.78	0.031	57	226	27
PCF(0.2)	24	30	6	500	0.47	0.07	57	226	27
PNF(0.8)	24	30	6	500	0.46	0.052	57	226	27
PE	30	36	6	500	0.12	0.014	50	145	31
TPF(0.4)	30	36	6	500	0.31	0.069	50	145	31
FPF(0.3)	30	36	6	500	0.12	0.029	50	145	31
AUC	30	36	6	500	0.81	0.031	50	145	31
PCF(0.2)	30	36	6	500	0.47	0.061	50	145	31
PNF(0.8)	30	36	6	500	0.41	0.053	50	145	31
PE	36	42	6	500	0.11	0.017	19	104	22
TPF(0.4)	36	42	6	500	0.37	0.11	19	104	22
FPF(0.3)	36	42	6	500	0.2	0.041	19	104	22
AUC	36	42	6	500	0.72	0.059	19	104	22
PCF(0.2)	36	42	6	500	0.42	0.11	19	104	22
PNF(0.8)	36	42	6	500	0.53	0.071	19	104	22

Complete results for CCH analysis

Results of the case-cohort (CCH) analysis for different combinations of conditioning time (s) and prediction time (t = s + tau) (s from 6 to 36 in increments of 6, and tau = 6). Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s.
	s	t	tau_0	Pert	Est	SE	n.case	n.ctrl	n.cens
PE	6	12	6	500	0.041	0.0045	74	639	124
TPF(0.4)	6	12	6	500	0.052	0.038	74	639	124
FPF(0.3)	6	12	6	500	0.0078	0.0041	74	639	124
AUC	6	12	6	500	0.81	0.025	74	639	124
PCF(0.2)	6	12	6	500	0.61	0.059	74	639	124
PNF(0.8)	6	12	6	500	0.41	0.048	74	639	124
PE	12	18	6	500	0.06	0.0059	89	475	75
TPF(0.4)	12	18	6	500	0.14	0.06	89	475	75
FPF(0.3)	12	18	6	500	0.021	0.0069	89	475	75
AUC	12	18	6	500	0.83	0.024	89	475	75
PCF(0.2)	12	18	6	500	0.65	0.058	89	475	75
PNF(0.8)	12	18	6	500	0.36	0.071	89	475	75
PE	18	24	6	500	0.095	0.0082	104	326	45
TPF(0.4)	18	24	6	500	0.2	0.059	104	326	45
FPF(0.3)	18	24	6	500	0.054	0.014	104	326	45
AUC	18	24	6	500	0.77	0.027	104	326	45
PCF(0.2)	18	24	6	500	0.45	0.048	104	326	45
PNF(0.8)	18	24	6	500	0.42	0.061	104	326	45
PE	24	30	6	500	0.1	0.011	85	208	33
TPF(0.4)	24	30	6	500	0.28	0.072	85	208	33
FPF(0.3)	24	30	6	500	0.09	0.022	85	208	33
AUC	24	30	6	500	0.83	0.027	85	208	33
PCF(0.2)	24	30	6	500	0.58	0.06	85	208	33
PNF(0.8)	24	30	6	500	0.33	0.071	85	208	33
PE	30	36	6	500	0.11	0.014	54	128	26
TPF(0.4)	30	36	6	500	0.21	0.074	54	128	26
FPF(0.3)	30	36	6	500	0.095	0.029	54	128	26
AUC	30	36	6	500	0.76	0.042	54	128	26
PCF(0.2)	30	36	6	500	0.53	0.073	54	128	26
PNF(0.8)	30	36	6	500	0.56	0.07	54	128	26
PE	36	42	6	500	0.098	0.017	26	84	18
TPF(0.4)	36	42	6	500	0.22	0.1	26	84	18
FPF(0.3)	36	42	6	500	0.09	0.043	26	84	18
AUC	36	42	6	500	0.7	0.067	26	84	18
PCF(0.2)	36	42	6	500	0.53	0.11	26	84	18
PNF(0.8)	36	42	6	500	0.69	0.16	26	84	18

Risk prediction using longitudinal data in two-phase studies

February 26, 2018