Introduction

In this tutorial we show examples on how to obtain risk predictions based on a longitudinal marker in a case-cohort (CCH) or a nested case-control (NCC) sample, using methods described in ``Evaluating longitudinal markers under two-phase study designs’’ by Marlena Maziarz, Tianxi Ci, Li Qi, Anna Lok and Yingye Zheng. The examples below illustrate the use of our methods to predict risk using PC\(_{GLM}\) based on a longitudinal biomarker with a survival outcome, as well as several evaluation measures of prediction and their standard error. The evaluation measures we considered are the prediction error (PE), true and false positive fractions (TPF and FPF), the area under the Receiver Operating Characteristics curve (AUC), the proportion of cases followed (PCF) and the proportion of cases needed to be followed (PNF). All code for predicting and evaluating risk, example datasets and examples on how to run the analysis are also available on github.com/marlenamaziarz/longitudinal-two-phase (right-click and open in new tab/window).

Before we get started

Load the necessary libraries

require(survival)
require(gam)
require(stats)
require(splines)
require(RCurl)   # for 'getURL'
require(repmis)  # for 'source_data'

Load the functions (one file for NCC and CCH)

The source code on GitHub contains all the functions needed for risk estimation and estimation of the measures to evaluate risk predictions, including standard errors estimated using perturbation. To load the source code:

script <- getURL("https://raw.githubusercontent.com/marlenamaziarz/longitudinal-two-phase/master/code-helper-functions.r", ssl.verifypeer = FALSE)

eval(parse(text = script))

Nested case-control analysis

Load the example dataset for NCC

The example dataset for NCC can be loaded using:

source_data("https://github.com/marlenamaziarz/longitudinal-two-phase/blob/master/dataset-ncc.rdata?raw=True")

The dataset has the following columns:

##    sub.id visit meas.time    marker      time status    t.star in.ncc ix.first
## 1       1     1         0 4.0173812  3.116595      0  3.116595  FALSE     TRUE
## 11      2     1         0 2.8730772  8.025501      0  8.025501  FALSE     TRUE
## 12      2     2         6 1.4007894  8.025501      0  2.025501  FALSE    FALSE
## 21      3     1         0 5.3736497 20.844036      0 20.844036  FALSE     TRUE
## 22      3     2         6 2.9753239 20.844036      0 14.844036  FALSE    FALSE
## 23      3     3        12 2.2411058 20.844036      0  8.844036  FALSE    FALSE
## 24      3     4        18 1.6639688 20.844036      0  2.844036  FALSE    FALSE
## 31      4     1         0 2.7390869 29.367374      1 29.367374   TRUE     TRUE
## 32      4     2         6 0.8567458 29.367374      1 23.367374   TRUE    FALSE
## 33      4     3        12 0.3968545 29.367374      1 17.367374   TRUE    FALSE

where sub.id = subject id, vist = visit number at time meas.time, marker = marker value, time = event or censoring time, status = event status (1 = event observed, 0 otherwise), t.star = time since measurement, i.e. time - meas.time, in.ncc = index of whether the subject was selected into the NCC study, ix.first = index of the first visits.

Estimate risk and evaluation measures for NCC

To predict risk and evaluation measures based on the NCC example dataset, run the following:

tau <- 6
dc.list      <- NULL
results.list <- NULL
results.mx   <- NULL

for(si.i in seq(6, 36, 6)){
    # P = pert = 500 # ncc =T/F NCC/CCH analysis ie. ncc = T = NCC, ncc = F = CCH. 
    dc <- set.data.control(si = si.i, ti = si.i + tau, P = 500, seed = 1, ncc = T) 
    results.list <- c(results.list, main(dc, d.full = d.ncc.cohort))
    dc.list <- c(dc.list, list(dc))
    # save(list = ls(), file = 'results-ncc.rdata')
}

for(i in 1:length(results.list)){
    res.out <- with(results.list[[i]], cbind(sim.stats, sim.stats.se, n.case, n.ctrl, n.cens))
    dc.out  <-  with(dc.list[[i]], c(si, ti, pred.time, P))
    res.dc.out <- cbind(matrix(rep(dc.out, each = dc$n.stats), nrow = dc$n.stats, byrow = F), round(res.out, 4))
    results.mx <- rbind(results.mx, res.dc.out)
}
colnames(results.mx) <- c('s', 't', 'tau_0', 'Pert', 'Est', 'SE', 'n.case', 'n.ctrl', 'n.cens')

save(list = ls(), file = paste('results-ncc-table-p', dc$P, '-tau', tau, '.rdata', sep = ''))

NCC results

Results of the nested case-control (NCC) analysis for conditioning time (s) = 6 and prediction time (t = s + tau) = 12. Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s. Complete results for s from 6 to 36 in increments of 6, and tau = 6 are below.
  s t tau_0 Pert Est SE n.case n.ctrl n.cens
PE 6 12 6 500 0.035 0.0046 53 512 8
TPF(0.4) 6 12 6 500 0.13 0.053 53 512 8
FPF(0.3) 6 12 6 500 0.01 0.0049 53 512 8
AUC 6 12 6 500 0.79 0.034 53 512 8
PCF(0.2) 6 12 6 500 0.55 0.069 53 512 8
PNF(0.8) 6 12 6 500 0.44 0.07 53 512 8
Results of the nested case-control (NCC) analysis for conditioning time (s) = 36 and prediction time (t = s + tau) = 42. Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s. Complete results for s from 6 to 36 in increments of 6, and tau = 6 are below.
  s t tau_0 Pert Est SE n.case n.ctrl n.cens
PE 36 42 6 500 0.11 0.017 19 104 22
TPF(0.4) 36 42 6 500 0.37 0.11 19 104 22
FPF(0.3) 36 42 6 500 0.2 0.041 19 104 22
AUC 36 42 6 500 0.72 0.059 19 104 22
PCF(0.2) 36 42 6 500 0.42 0.11 19 104 22
PNF(0.8) 36 42 6 500 0.53 0.071 19 104 22

Case-cohort analysis

Load the example dataset for CCH

To load the data file:

source_data("https://github.com/marlenamaziarz/longitudinal-two-phase/blob/master/dataset-cch.rdata?raw=True")

The dataset has the following columns:

##    sub.id visit meas.time   marker      time status     t.star in.cch ix.first
## 1       1     1         0 4.047375  8.020719      0  8.0207186  FALSE     TRUE
## 2       1     2         6 2.278364  8.020719      0  2.0207186  FALSE    FALSE
## 11      2     1         0 4.367408  8.070592      0  8.0705921  FALSE     TRUE
## 12      2     2         6 2.036170  8.070592      0  2.0705921  FALSE    FALSE
## 21      3     1         0 4.932912 18.897848      0 18.8978483  FALSE     TRUE
## 22      3     2         6 2.996144 18.897848      0 12.8978483  FALSE    FALSE
## 23      3     3        12 2.202109 18.897848      0  6.8978483  FALSE    FALSE
## 24      3     4        18 1.847787 18.897848      0  0.8978483  FALSE    FALSE
## 31      4     1         0 2.028736  2.094211      0  2.0942106   TRUE     TRUE
## 41      5     1         0 2.755369  9.442380      0  9.4423798  FALSE     TRUE

where sub.id = subject id, vist = visit number at time meas.time, marker = marker value, time = event or censoring time, status = event status (1 = event observed, 0 otherwise), t.star = time since measurement, i.e. time - meas.time, in.ncc = index of whether the subject was selected into the CCH study, ix.first = index of the first visits.

Estimate risk and evaluation measures for CCH

To predict risk and evaluation measures based on the CCH example dataset, run the following:

tau <- 6
dc.list      <- NULL
results.list <- NULL
results.mx   <- NULL

for(si.i in seq(6, 36, 6)){
    # P = pert = 500 # ncc =T/F NCC/CCH analysis ie. ncc = T = NCC, ncc = F = CCH. 
    dc <- set.data.control(si = si.i, ti = si.i + tau, P = 500, seed = 1, ncc = F) 
    results.list <- c(results.list, main(dc, d.full = d.cch.cohort))
    dc.list <- c(dc.list, list(dc))
    # save(list = ls(), file = 'results-cch.rdata')
}

for(i in 1:length(results.list)){
    res.out <- with(results.list[[i]], cbind(sim.stats, sim.stats.se, n.case, n.ctrl, n.cens))
    dc.out  <-  with(dc.list[[i]], c(si, ti, pred.time, P))
    res.dc.out <- cbind(matrix(rep(dc.out, each = dc$n.stats), nrow = dc$n.stats, byrow = F), round(res.out, 4))
    results.mx <- rbind(results.mx, res.dc.out)
}
colnames(results.mx) <- c('s', 't', 'tau_0', 'Pert', 'Est', 'SE', 'n.case', 'n.ctrl', 'n.cens')

save(list = ls(), file = paste('results-cch-table-p', dc$P, '-tau', tau, '.rdata', sep = ''))

Results for the CCH analysis

Results of the case-cohort (CCH) analysis for conditioning time (s) = 6 and prediction time (t = s + tau) = 12. Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s. Complete results for s from 6 to 36 in increments of 6, and tau = 6 are below.
  s t tau_0 Pert Est SE n.case n.ctrl n.cens
PE 6 12 6 500 0.041 0.0045 74 639 124
TPF(0.4) 6 12 6 500 0.052 0.038 74 639 124
FPF(0.3) 6 12 6 500 0.0078 0.0041 74 639 124
AUC 6 12 6 500 0.81 0.025 74 639 124
PCF(0.2) 6 12 6 500 0.61 0.059 74 639 124
PNF(0.8) 6 12 6 500 0.41 0.048 74 639 124
Results of the case-cohort (CCH) analysis for conditioning time (s) = 36 and prediction time (t = s + tau) = 42. Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s. Complete results for s from 6 to 36 in increments of 6, and tau = 6 are below.
  s t tau_0 Pert Est SE n.case n.ctrl n.cens
PE 36 42 6 500 0.098 0.017 26 84 18
TPF(0.4) 36 42 6 500 0.22 0.1 26 84 18
FPF(0.3) 36 42 6 500 0.09 0.043 26 84 18
AUC 36 42 6 500 0.7 0.067 26 84 18
PCF(0.2) 36 42 6 500 0.53 0.11 26 84 18
PNF(0.8) 36 42 6 500 0.69 0.16 26 84 18

Complete results

Complete results for NCC analysis

Results of the nested case-control (NCC) analysis for different combinations of conditioning time (s) and prediction time (t = s + tau) (s from 6 to 36 in increments of 6, and tau = 6). Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s.
  s t tau_0 Pert Est SE n.case n.ctrl n.cens
PE 6 12 6 500 0.035 0.0046 53 512 8
TPF(0.4) 6 12 6 500 0.13 0.053 53 512 8
FPF(0.3) 6 12 6 500 0.01 0.0049 53 512 8
AUC 6 12 6 500 0.79 0.034 53 512 8
PCF(0.2) 6 12 6 500 0.55 0.069 53 512 8
PNF(0.8) 6 12 6 500 0.44 0.07 53 512 8
PE 12 18 6 500 0.063 0.0061 74 423 15
TPF(0.4) 12 18 6 500 0.24 0.061 74 423 15
FPF(0.3) 12 18 6 500 0.037 0.01 74 423 15
AUC 12 18 6 500 0.83 0.024 74 423 15
PCF(0.2) 12 18 6 500 0.61 0.058 74 423 15
PNF(0.8) 12 18 6 500 0.35 0.057 74 423 15
PE 18 24 6 500 0.1 0.0091 89 310 24
TPF(0.4) 18 24 6 500 0.24 0.046 89 310 24
FPF(0.3) 18 24 6 500 0.088 0.015 89 310 24
AUC 18 24 6 500 0.8 0.026 89 310 24
PCF(0.2) 18 24 6 500 0.56 0.051 89 310 24
PNF(0.8) 18 24 6 500 0.4 0.063 89 310 24
PE 24 30 6 500 0.11 0.012 57 226 27
TPF(0.4) 24 30 6 500 0.22 0.064 57 226 27
FPF(0.3) 24 30 6 500 0.12 0.024 57 226 27
AUC 24 30 6 500 0.78 0.031 57 226 27
PCF(0.2) 24 30 6 500 0.47 0.07 57 226 27
PNF(0.8) 24 30 6 500 0.46 0.052 57 226 27
PE 30 36 6 500 0.12 0.014 50 145 31
TPF(0.4) 30 36 6 500 0.31 0.069 50 145 31
FPF(0.3) 30 36 6 500 0.12 0.029 50 145 31
AUC 30 36 6 500 0.81 0.031 50 145 31
PCF(0.2) 30 36 6 500 0.47 0.061 50 145 31
PNF(0.8) 30 36 6 500 0.41 0.053 50 145 31
PE 36 42 6 500 0.11 0.017 19 104 22
TPF(0.4) 36 42 6 500 0.37 0.11 19 104 22
FPF(0.3) 36 42 6 500 0.2 0.041 19 104 22
AUC 36 42 6 500 0.72 0.059 19 104 22
PCF(0.2) 36 42 6 500 0.42 0.11 19 104 22
PNF(0.8) 36 42 6 500 0.53 0.071 19 104 22

Complete results for CCH analysis

Results of the case-cohort (CCH) analysis for different combinations of conditioning time (s) and prediction time (t = s + tau) (s from 6 to 36 in increments of 6, and tau = 6). Prediction timeframe = tau, Pert = number of perturbations used to estimate the standard error, Est = estimate, SE = standard error, n.case = number of cases in the sample at time s, n.ctrl = number of controls at s, and n.cens = number of subjects censored up to time s.
  s t tau_0 Pert Est SE n.case n.ctrl n.cens
PE 6 12 6 500 0.041 0.0045 74 639 124
TPF(0.4) 6 12 6 500 0.052 0.038 74 639 124
FPF(0.3) 6 12 6 500 0.0078 0.0041 74 639 124
AUC 6 12 6 500 0.81 0.025 74 639 124
PCF(0.2) 6 12 6 500 0.61 0.059 74 639 124
PNF(0.8) 6 12 6 500 0.41 0.048 74 639 124
PE 12 18 6 500 0.06 0.0059 89 475 75
TPF(0.4) 12 18 6 500 0.14 0.06 89 475 75
FPF(0.3) 12 18 6 500 0.021 0.0069 89 475 75
AUC 12 18 6 500 0.83 0.024 89 475 75
PCF(0.2) 12 18 6 500 0.65 0.058 89 475 75
PNF(0.8) 12 18 6 500 0.36 0.071 89 475 75
PE 18 24 6 500 0.095 0.0082 104 326 45
TPF(0.4) 18 24 6 500 0.2 0.059 104 326 45
FPF(0.3) 18 24 6 500 0.054 0.014 104 326 45
AUC 18 24 6 500 0.77 0.027 104 326 45
PCF(0.2) 18 24 6 500 0.45 0.048 104 326 45
PNF(0.8) 18 24 6 500 0.42 0.061 104 326 45
PE 24 30 6 500 0.1 0.011 85 208 33
TPF(0.4) 24 30 6 500 0.28 0.072 85 208 33
FPF(0.3) 24 30 6 500 0.09 0.022 85 208 33
AUC 24 30 6 500 0.83 0.027 85 208 33
PCF(0.2) 24 30 6 500 0.58 0.06 85 208 33
PNF(0.8) 24 30 6 500 0.33 0.071 85 208 33
PE 30 36 6 500 0.11 0.014 54 128 26
TPF(0.4) 30 36 6 500 0.21 0.074 54 128 26
FPF(0.3) 30 36 6 500 0.095 0.029 54 128 26
AUC 30 36 6 500 0.76 0.042 54 128 26
PCF(0.2) 30 36 6 500 0.53 0.073 54 128 26
PNF(0.8) 30 36 6 500 0.56 0.07 54 128 26
PE 36 42 6 500 0.098 0.017 26 84 18
TPF(0.4) 36 42 6 500 0.22 0.1 26 84 18
FPF(0.3) 36 42 6 500 0.09 0.043 26 84 18
AUC 36 42 6 500 0.7 0.067 26 84 18
PCF(0.2) 36 42 6 500 0.53 0.11 26 84 18
PNF(0.8) 36 42 6 500 0.69 0.16 26 84 18