Introduction

This tutorial provides examples on how to obtain parameter estimates and their standard errors using data from case-control studies with incident and prevalent cases. The detailed methods are discussed in Inference for case-control studies with incident and prevalent cases by Marlena Maziarz\(^*\), Yukun Liu\(^*\), Jing Qin and Ruth Pfeiffer. The manuscript can be found on http://arxiv.org/abs/1803.06365 (you need to right-click and open in new tab/window to access the link). All the code and example datasets are on https://github.com/NCI-biostats/prevalent-case-control/.

(\(\,^*\)equal contributions).

glm.prev.cc() function specification

Input

data = a data frame with at least the following columns: backward.time, case.status (0 = controls, 1 = incident cases, 2 = prevalent cases), covariates listed in logistic.vars and survival.vars (see below)

logistic.vars = a list of covariates (as in the data) to be included in the logistic submodel (can be the same as survival.vars)

survival.vars = a list of covariates (as in the data) to be included in the survival submodel (can be the same as logistic.vars)

const.bh = FALSE (default) if a Weibull baseline hazard should be assumed (2 parameters: \(\kappa_1\) = shape, \(\kappa_2\) = scale), or TRUE if a constant baseline hazard should be assumed (one parameter). The Weibull baseline hazard is parametrized as: \[\frac{\kappa_1}{\kappa_2}\left(\frac{x}{\kappa_2}\right)^{(\kappa_1 - 1)}\qquad x\ge 0\] and 0 if \(x < 0\). If a constant baseline hazard is assumed, \(\kappa_1\) = 1.

se.type = type of standard error, can be NULL (do not report a standard error), ‘asy’ for analytic standard error estimate, ‘jack’ for jackknife, ‘boot’ for bootstrap. ‘asy’ is very fast, but currently assumes that logistic and survival covariates are the same. ‘jack’ and ‘boot’ allow for any combination of covariates, possibly overlapping or identical, in the logistic and survival submodels, but are very slow, since optimx() function is called at each iteration and it takes roughly 3.5 seconds to converge at each iteration. So depending on the sample size, the jackknife or the bootstrap may be faster, but we recommend at least 500 bootstrap resamples.

param.vec.fitting.start.values = vector of starting values for the optimizer (currently using the optimx() function from an R package of the same name). The number of starting values should be (2 intercepts + number of covariates in the logistic model + number of baseline hazard parameters (2 assuming a Weibull baseline hazard, 1 for constant)) + number of covariates in the survival model. If NULL, the following start values will be used: c(-2, -4, logistic model estimates fit to controls and incident cases [i.e. glm(…, family = binomial())], 1, rep(-0.1, length(survival.vars))) assuming a constant baseline hazard and c(-2, -4, logistic model estimates fit to controls and incident cases [i.e. glm(…, family = binomial())], 1, 1, rep(-0.1, length(survival.vars))) assuming a Weibull baseline hazard.

optim.method = default is ‘L-BFGS-B’ and is highly recommended. See the optimx() documentation for details on other options.

xi.a = maximum backward time, if NULL (default) it is 1.01*max backward time in the dataset.

print.ok = currently not used, but allows for adding functionality to show intermediate computational steps.

boot = number of bootstrap samples, only used if se.type == ‘boot’.

Output

Returns a list with the following core components (assuming optimx() converged, if it did not, or if it failed entirely, returns NULL):

ests = estimates for 2 intercepts, logistic submodel estimates, 2 Weibull baseline hazard parameters and the survival submodel parameter estimates

loglik = log likelihood at the estimated values.

In addition, depending on the se.type returns the following:

  • if se.type == NULL: ses = NA of the length of ests

  • if se.type == ‘asy’: ses = standard errors of ests estimated analytically

  • if se.type == ‘jack’: ses = standard errors of ests estimated using the jackknife (took about 1.5 hrs for this example with n = 1500). This option also return the number of successful iterations (i.succ), and the convergence rate (conv.prop, i.e. i.succ/number of individuals in the dataset)

  • if se.type == ‘boot’: ses = standard errors of ests estimated using the bootstrap (took about 0.5 hrs for this example with boot = 500 bootstrap resamples). The number of iterations is set by boot (default 500). In case of convergence issues, up to \(3 \times boot\) iterations will be attempted, and if at least boot of them converge, the SEs will be reported. If it does not complete boot iterations, it returns SEs = NA. The function also returns the convergence rate (conv.prop) and the 95% confidence interval based on quantiles (quant.025 and quant.975).

NOTE: optimx() is called within a try() statement to catch the calls to optimx() that fail. The warning message “Warning message: In if (class(optimx.out) ==”try-error“) … the condition has length > 1 and only the first element will be used” means that all went well, optimx() did not fail, so please just ignore the warning (or turn off with options(warn = -1)).

glm.prev.cc <- function(data,
                        logistic.vars,
                        survival.vars,
                        se.type = NULL, # 'asy', 'jack', 'boot', or NULL (no SEs)
                        const.bh = F,   # F = Weibull, T = constant baseline hazard
                        param.vec.fitting.start.values = NULL,
                        optim.method = 'L-BFGS-B', 
                        xi.a = NULL,
                        print.ok = F, 
                        boot = 500){    # this is only used if se.type == 'boot'

Example

Download required files from GitHub

  1. code-helper-functions.r

  2. example-dataset-prev-cc.rdata

and save them in the current directory.

Load the necessary libraries and files

require(optimx)
require(numDeriv)
source('code-helper-functions.r')
load('example-dataset-prev-cc.rdata')

The code-helper-functions.r file contains all the functions needed for obtaining the estimates and their standard errors (using analytical methods, jackknife or bootstrap).

The example dataset is comprised of 500 individuals in each of the three groups: controls (case.status = 0), incident (case.status = 1) and prevalent cases (case.status = 2), (n = 1500), two normally distributed covariates, \(x_1\) and \(x_2\), with the backward time for prevalent cases ranging from 0 to 25. The details of data generation are described in section 4.1 of the manuscript, where the parameters used to generate data were: \(\beta_1\) = 1, \(\beta_2\) = -1, \(\kappa_1\) = 1, \(\kappa_2\) = 1, \(\zeta_1\) = 1, \(\zeta_2\) = -1.

##   backward.time case.status          x1         x2
## 1             0           0  0.50387335 -0.5811765
## 2             0           0 -0.01060546  0.3074741
## 3             0           0  1.31529673 -0.1320545
## 4             0           0 -1.38720005  1.3759074
## 5             0           0 -0.78116262 -0.2104384
## 6             0           0 -0.08643726 -1.5075302
##  backward.time      case.status       x1                x2         
##  Min.   : 0.0000   Min.   :0    Min.   :-3.3603   Min.   :-4.7230  
##  1st Qu.: 0.0000   1st Qu.:0    1st Qu.:-0.3477   1st Qu.:-1.3105  
##  Median : 0.0000   Median :1    Median : 0.4583   Median :-0.4027  
##  Mean   : 0.6788   Mean   :1    Mean   : 0.5336   Mean   :-0.4867  
##  3rd Qu.: 0.1250   3rd Qu.:2    3rd Qu.: 1.3718   3rd Qu.: 0.3529  
##  Max.   :23.3490   Max.   :2    Max.   : 4.9703   Max.   : 3.2393
## 
##   0   1   2 
## 500 500 500
## d$case.status: 0
##        x1                 x2           
##  Min.   :-3.36033   Min.   :-3.091816  
##  1st Qu.:-0.70467   1st Qu.:-0.654964  
##  Median : 0.02810   Median :-0.004502  
##  Mean   : 0.00336   Mean   : 0.042581  
##  3rd Qu.: 0.68409   3rd Qu.: 0.745188  
##  Max.   : 2.87420   Max.   : 3.239267  
## --------------------------------------------------------------------------- 
## d$case.status: 1
##        x1                x2         
##  Min.   :-1.4716   Min.   :-4.7230  
##  1st Qu.: 0.8501   1st Qu.:-2.2266  
##  Median : 1.5692   Median :-1.5095  
##  Mean   : 1.5174   Mean   :-1.4878  
##  3rd Qu.: 2.1797   3rd Qu.:-0.7261  
##  Max.   : 4.9703   Max.   : 1.8323  
## --------------------------------------------------------------------------- 
## d$case.status: 2
##        x1                 x2          
##  Min.   :-2.64157   Min.   :-3.28654  
##  1st Qu.:-0.56884   1st Qu.:-0.63427  
##  Median : 0.04445   Median :-0.05287  
##  Mean   : 0.08009   Mean   :-0.01480  
##  3rd Qu.: 0.75808   3rd Qu.: 0.70126  
##  Max.   : 2.98163   Max.   : 2.51101

Calling glm.prev.cc() assuming a Weibull baseline hazard

To obtain just the estimates of the parameters, set se.type = NULL:

m.p <- glm.prev.cc(data = d,
                   logistic.vars = c('x1', 'x2'),
                   survival.vars = c('x1', 'x2'),
                   se.type = NULL)  
m.p$loglik
## [1] -1683.614
m.p$ests
##                  [,1]
## alpha    -1.477538766
## nu        0.008289241
## x1        1.004081925
## x2       -0.984974540
## k1.shape  1.051275990
## k2.scale  1.047988832
## x1        1.040241928
## x2       -1.052848203
m.p$ses
## [1] NA NA NA NA NA NA NA NA

To obtain estimates of the parameters and asymptotic standard errors, set se.type = ‘asy’:

m.p <- glm.prev.cc(data = d,
                   logistic.vars = c('x1', 'x2'),
                   survival.vars = c('x1', 'x2'),
                   se.type = 'asy')  
# m.p$ests # the estimates are as shown above
m.p$ses
## [1] 0.06988703 0.10240006 0.07452473 0.07419976 0.09847570 0.14211247 0.10592505 0.11016451

To obtain estimates of the parameters and their jackknife standard errors, set se.type = ‘jack’:

m.p <- glm.prev.cc(data = d,
                   logistic.vars = c('x1', 'x2'),
                   survival.vars = c('x1', 'x2'),
                   se.type = 'jack')  
# m.p$ests # the estimates are as shown above
m.p$ses
##      alpha         nu         x1         x2   k1.shape   k2.scale         x1         x2 
## 0.09436695 0.11849341 0.07519103 0.07486024 0.10021019 0.14240255 0.10705600 0.11368188
m.p$i.succ
## [1] 1500
m.p$conv.prop
## [1] 1

To obtain estimates of the parameters and their bootstrap standard errors, set se.type = ‘boot’ (default 500 bootstrap resamples):

set.seed(1) # this is what we used in our example
m.p <- glm.prev.cc(data = d,
                   logistic.vars = c('x1', 'x2'),
                   survival.vars = c('x1', 'x2'),
                   se.type = 'boot')  
# m.p$ests # the estimates are as shown above
m.p$ses
##      alpha         nu         x1         x2   k1.shape   k2.scale         x1         x2 
## 0.06707147 0.10252953 0.07386802 0.07357958 0.10487613 0.14383603 0.11465843 0.11510958
m.p$quant.025
##      alpha         nu         x1         x2   k1.shape   k2.scale         x1         x2 
## -1.5934654 -0.2000256  0.8659989 -1.1164357  0.8937768  0.8106060  0.8530295 -1.3170893
m.p$quant.975
##      alpha         nu         x1         x2   k1.shape   k2.scale         x1         x2 
## -1.3301590  0.1954363  1.1493730 -0.8448190  1.2958412  1.3591515  1.2834586 -0.8828718
m.p$conv.prop
## [1] 1

and to run more bootstrap resamples, set boot to a desired integer:

m.p <- glm.prev.cc(data = d,
                   logistic.vars = c('x1', 'x2'),
                   survival.vars = c('x1', 'x2'),
                   se.type = 'boot',
                   boot = 1000)  

Calling glm.prev.cc() assuming a constant baseline hazard.

To obtain just the estimates of the parameters, set se.type = NULL:

m.p <- glm.prev.cc(data = d,
                   logistic.vars = c('x1', 'x2'),
                   survival.vars = c('x1', 'x2'),
                   const.bh = T,
                   se.type = NULL)  
m.p$loglik
## [1] -1683.752
m.p$ests
##                [,1]
## alpha    -1.4802653
## nu        0.0546042
## x1        1.0047956
## x2       -0.9870271
## k2.scale  0.9811142
## x1        0.9904572
## x2       -1.0043673
m.p$ses
## [1] NA NA NA NA NA NA NA

To obtain estimates of the parameters and asymptotic standard errors, set se.type = ‘asy’:

m.p <- glm.prev.cc(data = d,
                   logistic.vars = c('x1', 'x2'),
                   survival.vars = c('x1', 'x2'),
                   const.bh = T,
                   se.type = 'asy')  
# m.p$ests # the estimates are as shown above
m.p$ses
## [1] 0.06977315 0.04312684 0.07452871 0.07417077 0.04473766 0.05122809 0.04933346

To obtain estimates of the parameters and their jackknife standard errors, set se.type = ‘jack’:

m.p <- glm.prev.cc(data = d,
                   logistic.vars = c('x1', 'x2'),
                   survival.vars = c('x1', 'x2'),
                   const.bh = T,
                   se.type = 'jack')  
# m.p$ests # the estimates are as shown above
m.p$ses
##      alpha         nu         x1         x2   k2.scale         x1         x2 
## 0.09339912 0.07687840 0.07516035 0.07509655 0.04500898 0.05223866 0.05038417
m.p$i.succ
## [1] 1500
m.p$conv.prop
## [1] 1

To obtain estimates of the parameters and their bootstrap standard errors, set se.type = ‘boot’ (default 500 bootstrap resamples):

set.seed(1) # this is what we used in our example
m.p <- glm.prev.cc(data = d,
                   logistic.vars = c('x1', 'x2'),
                   survival.vars = c('x1', 'x2'),
                   const.bh = T,
                   se.type = 'boot')  
# m.p$ests # the estimates are as shown above
m.p$ses
##      alpha         nu         x1         x2   k2.scale         x1         x2 
## 0.06635043 0.04422028 0.07363598 0.07335649 0.04535919 0.05270274 0.04928991
m.p$quant.025
##       alpha          nu          x1          x2    k2.scale          x1          x2 
## -1.59403815 -0.02524435  0.86739853 -1.12225491  0.89508183  0.88618926 -1.10147038
m.p$quant.975
##      alpha         nu         x1         x2   k2.scale         x1         x2 
## -1.3360627  0.1459998  1.1487357 -0.8477167  1.0645702  1.0884124 -0.9032116
m.p$conv.prop
## [1] 1