Relevant package installation

library(survival)
library(Epi)
library(popEpi)
## 
## Attaching package: 'popEpi'
## The following object is masked from 'package:survival':
## 
##     Surv
# popEpi::splitMulti returns a data.frame rather than a data.table
options("popEpi.datatable" = FALSE)
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## 
##   There is a binary version available but the source version is later:
##           binary source needs_compilation
## tidyverse  1.3.2  2.0.0             FALSE
## installing the source package 'tidyverse'
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()      masks stats::filter()
## ✖ lubridate::is.Date() masks popEpi::is.Date()
## ✖ dplyr::lag()         masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
clear()

Data

This exercise follows quite closely the section on competing risks in “Epidemiology with R”, pp. 207 and 210 ff. With the major exception that we will use the function ci.Crisk, which was not available in the Epi package when the book was written. We shall use the DMlate dataset which is a random sample of Danish diabetes patients, with dates of birth, diabetes, OAD start, insulin start and death, all diagnosed after 1996-1-1. We want to look at the event “start of insulin use”, which occurs at doins, while taking death as competing event into account. This means that we want to address the question of the probability of starting Ins, while taking death into account. Essentially estimating the probability of being in each of the states DM, Ins and Dead, where Ins means “started insulin and either alive or dead after this” and Dead means “dead without starting insulin”.

Load the DMlate data from the Epi package (and for ease of calculation restrict to a random sample of 2000 persons while developing):

data(DMlate)
#set.seed(1952)
#DMlate <- DMlate[sample(1:nrow(DMlate), 2000),]
str(DMlate)
## 'data.frame':    10000 obs. of  7 variables:
##  $ sex  : Factor w/ 2 levels "M","F": 2 1 2 2 1 2 1 1 2 1 ...
##  $ dobth: num  1940 1939 1918 1965 1933 ...
##  $ dodm : num  1999 2003 2005 2009 2009 ...
##  $ dodth: num  NA NA NA NA NA ...
##  $ dooad: num  NA 2007 NA NA NA ...
##  $ doins: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ dox  : num  2010 2010 2010 2010 2010 ...
head(DMlate)
##        sex    dobth     dodm    dodth    dooad doins      dox
## 50185    F 1940.256 1998.917       NA       NA    NA 2009.997
## 307563   M 1939.218 2003.309       NA 2007.446    NA 2009.997
## 294104   F 1918.301 2004.552       NA       NA    NA 2009.997
## 336439   F 1965.225 2009.261       NA       NA    NA 2009.997
## 245651   M 1932.877 2008.653       NA       NA    NA 2009.997
## 216824   F 1927.870 2007.886 2009.923       NA    NA 2009.923
str(DMlate)
## 'data.frame':    10000 obs. of  7 variables:
##  $ sex  : Factor w/ 2 levels "M","F": 2 1 2 2 1 2 1 1 2 1 ...
##  $ dobth: num  1940 1939 1918 1965 1933 ...
##  $ dodm : num  1999 2003 2005 2009 2009 ...
##  $ dodth: num  NA NA NA NA NA ...
##  $ dooad: num  NA 2007 NA NA NA ...
##  $ doins: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ dox  : num  2010 2010 2010 2010 2010 ...

It is always wise to get en overview of the dates represented in the baseline dataset—preferably at a monthly level:

par(mfrow=c(2,2))
hist(DMlate$dobth, breaks = seq(1898, 2010, 1   ), col = "black")
hist(DMlate$dodm , breaks = seq(1995, 2010, 1/12), col = "black")
    abline(v = 1995:2020, col = "red")
hist(DMlate$dodth, breaks = seq(1995, 2010, 1/12), col = "black")
    abline(v = 1995:2020, col = "red")
hist(DMlate$doins, breaks = seq(1995, 2010, 1/12), col = "black")
    abline(v = 1995:2020, col = "red")

From the figure we see that there is a tendency to fewer dates of diagnosis and insulin start in the summer (because of holidays) and more deaths in the winter, precisely as expected

First we define a Lexis object with the total follow up for each person:

Ldm <- Lexis(entry = list(per = dodm,
                          age = dodm - dobth,
                          tfd = 0),
              exit = list(per = dox),
       exit.status = factor(!is.na(dodth),
                            labels = c("DM", "Dead")),
              data = DMlate)
## NOTE: entry.status has been set to "DM" for all.
## NOTE: Dropping  4  rows with duration of follow up < tol
summary(Ldm)
##     
## Transitions:
##      To
## From   DM Dead  Records:  Events: Risk time:  Persons:
##   DM 7497 2499      9996     2499   54273.27      9996
summary(Ldm$lex.dur)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.002738  2.031485  4.796715  5.429499  8.249144 14.995209

Then subdivide the follow-up at the date of insulin, using dooad:

Ldm <- sortLexis(Ldm)
Cdm <- cutLexis(Ldm,
                cut = Ldm$doins,
          timescale = "per",
          new.state = "Ins")
summary(Cdm)
##      
## Transitions:
##      To
## From    DM  Ins Dead  Records:  Events: Risk time:  Persons:
##   DM  6157 1694 2048      9899     3742   45885.49      9899
##   Ins    0 1340  451      1791      451    8387.77      1791
##   Sum 6157 3034 2499     11690     4193   54273.27      9996

In this context we are not interested in what goes on after Ins so we only keep follow-up in state DM; we can use either filter or subset:

Adm <- filter(Cdm, lex.Cst == "DM")
Adm <- subset(Cdm, lex.Cst == "DM")
summary(Adm)
##     
## Transitions:
##      To
## From   DM  Ins Dead  Records:  Events: Risk time:  Persons:
##   DM 6157 1694 2048      9899     3742   45885.49      9899
par(mfrow=c(1,2))
boxes(Cdm, boxpos = TRUE, scale.R = 80, show.BE = TRUE, cex = 0.7)
boxes(Adm, boxpos = TRUE, scale.R = 80, show.BE = TRUE, cex = 0.7)

?Lexis
## starting httpd help server ... done

As shown in figure 2.2 we now have a traditional competing risks set-up, with some 7500 DM patients starting without insulin, and where the quantity of interest is the probability of starting drug treatment, and the Ins state here means “having been on insulin treatment, disregarding subsequent death”. The other event considered is Dead which here means “dead without initiating insulin treatment”.

State probabilities

We can compute the (correct) counterpart of the survival function for this competing risks setup. The survival function we saw in the previous exercise gives the probability of being alive, and the complement is the probability of being dead.

survfit can do the corresponding calculation for the three states in the figure; the requirements are: 1) the third argument to the Surv function is a factor and 2) an id argument is given, pointing to an id variable that links together records belonging to the same person. The latter is superfluous in this case because there is only one record for each person, but even so it is required by the function survfit. Also note that the initial state (DM) must be the first level of the factor lex.Xst:

levels(Adm$lex.Xst)
## [1] "DM"   "Ins"  "Dead"
m3 <- survfit(Surv(tfd,
                   tfd + lex.dur,
                   lex.Xst) ~ 1,
                id = lex.id,
              data = Adm)
names(m3)
##  [1] "n"           "time"        "n.risk"      "n.event"     "n.censor"   
##  [6] "pstate"      "p0"          "cumhaz"      "std.err"     "sp0"        
## [11] "logse"       "transitions" "lower"       "upper"       "conf.type"  
## [16] "conf.int"    "states"      "type"        "call"
m3$states
## [1] "(s0)" "Ins"  "Dead"
head(cbind(time = m3$time, m3$pstate), 20)
##              time                                    
##  [1,] 0.002737851 0.9988888 0.0003030609 0.0008081624
##  [2,] 0.005475702 0.9982825 0.0005051424 0.0012123254
##  [3,] 0.008213552 0.9972721 0.0011113869 0.0016164884
##  [4,] 0.010951403 0.9955543 0.0024250496 0.0020206923
##  [5,] 0.013689254 0.9939374 0.0038397633 0.0022227943
##  [6,] 0.016427105 0.9916133 0.0057597319 0.0026269982
##  [7,] 0.019164956 0.9883793 0.0087915703 0.0028291207
##  [8,] 0.021902806 0.9858525 0.0108130026 0.0033344788
##  [9,] 0.024640657 0.9820102 0.0140486211 0.0039411573
## [10,] 0.027378508 0.9797855 0.0161722144 0.0040422808
## [11,] 0.030116359 0.9774582 0.0182971236 0.0042446531
## [12,] 0.032854209 0.9752321 0.0202196605 0.0045482115
## [13,] 0.035592060 0.9734104 0.0215353535 0.0050542473
## [14,] 0.038329911 0.9704742 0.0242690835 0.0052567458
## [15,] 0.041067762 0.9686513 0.0256868690 0.0056618274
## [16,] 0.043805613 0.9670301 0.0269027493 0.0060671208
## [17,] 0.046543463 0.9653073 0.0280175399 0.0066751884
## [18,] 0.049281314 0.9632802 0.0297405789 0.0069792541
## [19,] 0.052019165 0.9615571 0.0308554865 0.0075873855
## [20,] 0.054757016 0.9608476 0.0313622627 0.0077900960

Because lex.Xst is a factor, survfit will compute the Aalen-Johansen estimator of being in a given state and place the probabilities in the matrix m3\(pstate; the times, these refer to are in the vector m3\)time. These are measured in years since diabetes, because tfd is in units of years.

Explore the object m3 and compare m3$transitions to summary(adm)

names(m3)
##  [1] "n"           "time"        "n.risk"      "n.event"     "n.censor"   
##  [6] "pstate"      "p0"          "cumhaz"      "std.err"     "sp0"        
## [11] "logse"       "transitions" "lower"       "upper"       "conf.type"  
## [16] "conf.int"    "states"      "type"        "call"
m3$states
## [1] "(s0)" "Ins"  "Dead"
head(cbind(time = m3$time, m3$pstate), 20)
##              time                                    
##  [1,] 0.002737851 0.9988888 0.0003030609 0.0008081624
##  [2,] 0.005475702 0.9982825 0.0005051424 0.0012123254
##  [3,] 0.008213552 0.9972721 0.0011113869 0.0016164884
##  [4,] 0.010951403 0.9955543 0.0024250496 0.0020206923
##  [5,] 0.013689254 0.9939374 0.0038397633 0.0022227943
##  [6,] 0.016427105 0.9916133 0.0057597319 0.0026269982
##  [7,] 0.019164956 0.9883793 0.0087915703 0.0028291207
##  [8,] 0.021902806 0.9858525 0.0108130026 0.0033344788
##  [9,] 0.024640657 0.9820102 0.0140486211 0.0039411573
## [10,] 0.027378508 0.9797855 0.0161722144 0.0040422808
## [11,] 0.030116359 0.9774582 0.0182971236 0.0042446531
## [12,] 0.032854209 0.9752321 0.0202196605 0.0045482115
## [13,] 0.035592060 0.9734104 0.0215353535 0.0050542473
## [14,] 0.038329911 0.9704742 0.0242690835 0.0052567458
## [15,] 0.041067762 0.9686513 0.0256868690 0.0056618274
## [16,] 0.043805613 0.9670301 0.0269027493 0.0060671208
## [17,] 0.046543463 0.9653073 0.0280175399 0.0066751884
## [18,] 0.049281314 0.9632802 0.0297405789 0.0069792541
## [19,] 0.052019165 0.9615571 0.0308554865 0.0075873855
## [20,] 0.054757016 0.9608476 0.0313622627 0.0077900960
m3$transitions
##       to
## from    Ins Dead (censored)
##   (s0) 1694 2048       6157
##   Ins     0    0          0
##   Dead    0    0          0
summary(Adm)
##     
## Transitions:
##      To
## From   DM  Ins Dead  Records:  Events: Risk time:  Persons:
##   DM 6157 1694 2048      9899     3742   45885.49      9899

The m3$pstate contains the Aalen-Johansen probabilities of being in the Alive, having left to the Ins, resp. Dead state.

Plot the three curves in the same graph (use for example matplot). Add the confidence limits.

par(mfrow=c(1, 2))
matplot(m3$time, m3$pstate,
        type = "s", lty = 1, lwd = 4,
         col = c("ForestGreen", "red", "black"),
         xlim = c(0, 15), xaxs = "i",
         ylim = c(0, 1), yaxs = "i" )
 stackedCIF(m3, lwd = 3, xlim = c(0, 15), xaxs = "i", yaxs = "i" )
 text(rep(12, 3), c(0.85, 0.45, 0.15), levels(Cdm))
 box()

These three curves have sum 1, so basically this is a way of distributing the probabilities across states at each time. It is therefore natural to stack the probabilities, which can be done by stackedCIF:

par(mfrow = c(1, 2))
matshade(m3$time, cbind(m3$pstate,
                        m3$lower,
                        m3$upper)[, c(1, 4, 7, 2, 5, 8, 3, 6, 9)],
         plot = TRUE, lty = 1, lwd = 2,
         col = clr <- c("ForestGreen","red","black"),
         xlim=c(0,15), xaxs="i",
         ylim = c(0,1), yaxs = "i")
mat2pol(m3$pstate, perm = 3:1, x = m3$time, col = clr[3:1])
text(rep(12, 3), c(0.8, 0.5, 0.2), levels(Cdm), col = "white")

The figure shpws separate state probabilities (left) and stacked state probabilities (right). In the left panel, Alive is green, Ins is red and Dead is black.

What do you get if you replace " ~ 1" with " ~ sex" in the call to Survfit?

#levels(Adm$lex.Xst)
#m3 <- survfit(Surv(tfd,
#                   tfd + lex.dur,
#                   lex.Xst) ~ sex,
#                id = lex.id,
#              data = Adm)
#names(m3)
#m3$states
#head(cbind(time = m3$time, m3$pstate), 20)

#par(mfrow = c(1, 2))
#matshade(m3$time, cbind(m3$pstate,
#                        m3$lower,
#                        m3$upper)[, c(1, 4, 7, 2, 5, 8, 3, 6, 9)],
#         plot = TRUE, lty = 1, lwd = 2,
#         col = clr <- c("ForestGreen","red","black"),
#         xlim=c(0,15), xaxs="i",
#         ylim = c(0,1), yaxs = "i")
#mat2pol(m3$pstate, perm = 3:1, x = m3$time, col = clr[3:1])
#text(rep(12, 3), c(0.8, 0.5, 0.2), levels(Cdm), col = "white")

What not to do

A very common error is to use a partial outcome such as Ins, when there is a competing type of event, in this case Dead. If that is ignored and a traditional survival analysis is made as if Ins were the only possible event, we will have a substantial overestimate of the cumulative probability of going on drug. Here is an illustration of this erroneous approach:

m2 <- survfit(Surv(tfd,
                   tfd + lex.dur,
                   lex.Xst == "Ins" ) ~ 1,
              data = Adm)
M2 <- survfit(Surv(tfd,
                   tfd + lex.dur,
                   lex.Xst == "Dead") ~ 1,
              data = Adm)
par(mfrow = c(1,2))
mat2pol(m3$pstate, c(2,3,1), x = m3$time,
        col = c("red", "black", "transparent"),
        xlim=c(0,15), xaxs="i",
        yaxs = "i", xlab = "time since DM", ylab = "" )
  lines(m2$time, 1 - m2$surv, lwd = 3, col = "red" )
mat2pol(m3$pstate, c(3,2,1), x = m3$time, yaxs = "i",
        col = c("black","red","transparent"),
        xlim=c(0,15), xaxs="i",
        yaxs = "i", xlab = "time since DM", ylab = "" )
  lines(M2$time, 1 - M2$surv, lwd = 3, col = "black" )

The figure above shows the stacked state probabilities Alive is white, Ins is red and Dead is black. The red line in the left panel is the wrong (but often computed) “cumulative risk” of Ins, and the black line in the right panel is the wrong (but often computed) “cumulative risk” of Death. The black and the red areas in the two plots represent the correctly computed probabilities; they have the same size in both panels, only they are stacked differently.

What these means is that studies that do not consider the alternate way to exit the initial state will often overestimate survival. The degree of this is based on the probability of the alternate state occurring.

The correct survival function here is:

\[R_{Ins(t)} = \int_0^t \lambda(u)exp(-\int_0^t\lambda(s)+\mu(s)ds).du \]

The first two statements calculate the survival as if only Ins, respectively Dead were the only way of exiting the state Alive. The mat2pol (matrix to polygon) takes the columns of state probabilities from the survfit object m3 that contains the correctly modeled probabilities and plot them as coloured areas stacked; the second argument to mat2pol is the order in which they should be stacked. The lines plot the wrongly computed cumulative risks (from m2 and M2) — in order to find these we fish out the surv component from the survfit objects.

A question frequently asked in competing risk situations is what the probability of being on Ins is, and it is often (wrongly) believed to be answered by the red curve in the left panel of figure 2.4. But a highly unrealistic assumption is often forgotten: How the rate of pharmaceutical treatment would look in the absence of death (or vice versa for that matter) cannot be deduced from data where both types of risk are present. It is essentially a theological question as no data is available concerning the situation.

Modeling cause specific rates

There is nothing wrong with modeling the cause-specific event-rates, the problem lies in how you transform them into probabilities. The relevant model for a competing risks situation normally consists of separate models for each of the cause-specific rates. Not for technical or statistical reasons, but for substantial reasons; it is unlikely that rates of different types of event (Ins initiation and death, say) depend on time in the same way

Now we model the two sets of rates by parametric models; this must be based on a time-split data set:

Sdm <- splitMulti(Adm, tfd = seq(0, 20, 0.1))
summary(Adm)
##     
## Transitions:
##      To
## From   DM  Ins Dead  Records:  Events: Risk time:  Persons:
##   DM 6157 1694 2048      9899     3742   45885.49      9899
summary(Sdm)
##     
## Transitions:
##      To
## From     DM  Ins Dead  Records:  Events: Risk time:  Persons:
##   DM 460054 1694 2048    463796     3742   45885.49      9899

We will use natural splines for the effect of diabetes duration in a model using glm. The Ns requires a set of pre-specified knots for the time variable, where the specification should be (partially) guided by the location on the times of the events:

round(cbind(
with(subset(Sdm, lex.Xst == "Ins" ), quantile(tfd + lex.dur, 0:10/10)),
with(subset(Sdm, lex.Xst == "Dead"), quantile(tfd + lex.dur, 0:10/10))),
3)
##        [,1]   [,2]
## 0%    0.003  0.003
## 10%   0.030  0.272
## 20%   0.066  0.769
## 30%   0.172  1.418
## 40%   0.454  2.133
## 50%   1.823  3.076
## 60%   3.288  4.047
## 70%   4.834  5.153
## 80%   6.638  6.523
## 90%   8.663  8.598
## 100% 13.878 14.609

We see that the Ins occur earlier than Dead, so we choose the knots a bit earlier:

okn <- c(0,0.5,5,6)
dkn <- c(0,2.0,5,9)
 Ins.glm <- glm.Lexis(Sdm, ~ Ns(tfd, knots = okn), from = "DM", to = "Ins" )
## stats::glm Poisson analysis of Lexis object Sdm with log link:
## Rates for the transition:
## DM->Ins
Dead.glm <- glm.Lexis(Sdm, ~ Ns(tfd, knots = dkn), from = "DM", to = "Dead")
## stats::glm Poisson analysis of Lexis object Sdm with log link:
## Rates for the transition:
## DM->Dead

With models for the two rates out of the DM state we can derive the estimated rates from the two models for rates by time by using a prediction frame, nd:

int <- 0.01
nd <- data.frame(tfd = seq(0, 15, int))
l.glm <- ci.pred( Ins.glm, nd)
m.glm <- ci.pred(Dead.glm, nd)

Now plot the estimated rates, in this case the gam models with dotted and glm models with full lines; mortality with black and Ins rates with red:

par(mfrow = c(1,2))

matshade(nd$tfd,
         cbind(l.glm, m.glm) * 100,
         plot = TRUE,
         log = "y", ylim = c(0.5, 50),
         col = rep(c("red","black"), 2), lwd = 3,
         xlab = "Time since DM (years)",
         ylab = "Rates per 100 PY")

matshade(nd$tfd,
         cbind(l.glm, m.glm) * 100,
         plot = TRUE,
         ylim = c(0.5, 20),
         col = rep(c("red","black"), 2), lwd = 3,
         xlab = "Time since DM (years)",
         ylab = "Rates per 100 PY")

The figure shows Mortality rates (black) and rates of insukin start (red), from a glm model with natural splines.

Integrals with R

Based on these parametric models we can estimate the cumulative risks of being in each of the states, but also the expected time time spent in each state. The theory of these involves calculation of integrals of the rate functions. Integrals looks scary to many people, but they are really just areas under curves. So here is a digression showing how to calculate integrals as areas under a curve.

The key is to understand how a curve is represented in R. A curve representing the function μ is just a set of two vectors, one vector of ts and one vector y = μ(t)s. When we have a model such as the gam or glm above that estimates the mortality as a function of time (tfd), we can get a representation of the mortality as a funtion of time by first choosing the timepoints, say from 0 to 15 years in steps of 0.01 year (≈ 4 days). Then put this in a dataframe (nd, newdata) with the variable name from the model to get the function values at the chosen time points:

t <- seq(0, 15, 0.01)
nd <- data.frame(tfd = t)
mu <- ci.pred(Dead.glm, nd)[,1]
head(cbind(t, mu))
##      t         mu
## 1 0.00 0.06681677
## 2 0.01 0.06657067
## 3 0.02 0.06632549
## 4 0.03 0.06608123
## 5 0.04 0.06583789
## 6 0.05 0.06559547
plot(t, mu, type="l", lwd = 3,
     xlim = c(0, 7), xaxs = "i",
     ylim = c(0, max(mu)), yaxs = "i", main = "Mortality function and integral from 0 to 5 years")
polygon(t[c(1:501,501:1)], c(mu[1:501], rep(0, 501)),
        col = "gray", border = "transparent")

This is a representaion of the points (t, μ(t)); if we want the integral of mu over the interval [0, 5], say, M(5) = R 5 0 μ(s) ds, we are just asking for the area under the curve. Each trepresents an endpoint of an interval, but what we want in order to compute the area under the curve is the width of each interval, diff(t), multiplied by the average of the function values at the ends of each interval (this goes under the name of the ”trapezoidal formula”). So we need a small function to compute midpoints between successive values in a vector:

mid <- function(x) x[-1] - diff(x) / 2
(x <- c(1:5, 7, 10, 17))
## [1]  1  2  3  4  5  7 10 17
mid(x)
## [1]  1.5  2.5  3.5  4.5  6.0  8.5 13.5

Note that mid(x) is a vector that is 1 shorter than the vector x, just as diff(x) is. So if we want the integral over the period 0 to 5 years, we want the sum over the first 500 intervals, corresponding to the first 501 interval endpoints:

sum(diff(t[1:501]) * mid(mu[1:501]))
## [1] 0.2085188

So now we have computed \(\int_{5}^{0}\) μ(s) d(s). This is called the cumulative rate over the interval [0, 5] years, even if it is not a rate—it is dimensionless (why?). It is important to get the units right. In the modeling we entered the risk time (“person-years”) in units of 1 year, so the unit of predicted mortality function, mu, is events per 1 person-year. Therefore, the units of t must be year too; otherwise we will introduce a scaling. In pratice we will want the integral as function of μ, so for every t we want M(t) = \(\int_{t}^{0}\) μ(s) d(s). This is easily accomplished by the function cumsum:

Mu <- c(0, cumsum(diff(t) * mid(mu)))
head(cbind(t, Mu))
##      t           Mu
##   0.00 0.0000000000
## 2 0.01 0.0006669372
## 3 0.02 0.0013314180
## 4 0.03 0.0019934516
## 5 0.04 0.0026530472
## 6 0.05 0.0033102141

Note that the first value is the integral from 0 to 0, so by definition 0.

Cumulative risks from parametric models

Here is the theory where we need integration: The cumulative risk of Ins at time t is:

\[R_{ins} (t) = \int_{t}^{0} \lambda (u)S(u)du = \int_{t}^{0} \lambda (u)exp(-\int_{t}^{0} \lambda(s) + \mu (s)ds)du\]

Where λ is the rate of Ins (lam), and μ the mortality rate (mrt). A similar formula is obtained for the cumulative risk of Dead (that is “dead without insulin use”), by exchanging λ and μ. The practical calculation of these quantities are on pages 214–5 of “Epidemiology with R”.

This means that if we have estimates of λ and μ as functions of time, we can derive the cumulative risks. In practice this will be by numerical integration; compute the rates at closely spaced intervals and evaluate the integrals as sums. This is easy, but what is not so easy is to come up with confidence intervals for the cumulative risks. Confidence intervals are most conveniently produced by simulation (“parametric bootstrap” as some say):

  1. generate a random vector from the multivariate normal distribution with mean equal to the parameters of the model, and variance-covariance equal to the estimated variance-covariance of the parameter estimates (the Hessian as it is called).
  2. use this to generate a simulated set of rates (λ(t), μ(t)), evaluated a closely spaced times.
  3. use these in numerical integration to derive state probabilities at these times.
  4. repeat 1000 times, say, to obtain 1000 sets of state probabilities at these times.
  5. use these to derive confidence intervals for the state. probabilities as the 2.5 and 97.5 percentiles of the simulated state probabilities at each time point.

This machinery is implemented in the function ci.Crisk

cR <- ci.Crisk(mods = list(Ins =  Ins.glm,
                          Dead = Dead.glm),
                 nd = nd)
## NOTE: Times are assumed to be in the column tfd at equal distances of 0.01
args(ci.Crisk)
## function (mods, nd, tnam = names(nd)[1], nB = 1000, perm = length(mods):0 + 
##     1, alpha = 0.05, sim.res = "none") 
## NULL

The above arguments is an example of bootstrapping, with nB noting the number of bootstraps. Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics.

str(cR)
## List of 4
##  $ Crisk: num [1:1501, 1:3, 1:3] 1 0.997 0.994 0.991 0.988 ...
##   ..- attr(*, "dimnames")=List of 3
##   .. ..$ tfd  : chr [1:1501] "0" "0.01" "0.02" "0.03" ...
##   .. ..$ cause: chr [1:3] "Surv" "Ins" "Dead"
##   .. ..$      : chr [1:3] "50%" "2.5%" "97.5%"
##  $ Srisk: num [1:1501, 1:2, 1:3] 0 0.000666 0.001329 0.001986 0.00264 ...
##   ..- attr(*, "dimnames")=List of 3
##   .. ..$ tfd  : chr [1:1501] "0" "0.01" "0.02" "0.03" ...
##   .. ..$ cause: chr [1:2] "Dead" "Dead+Ins"
##   .. ..$      : chr [1:3] "50%" "2.5%" "97.5%"
##  $ Stime: num [1:1501, 1:3, 1:3] 0 0.00998 0.01994 0.02987 0.03976 ...
##   ..- attr(*, "dimnames")=List of 3
##   .. ..$ tfd  : chr [1:1501] "0" "0.01" "0.02" "0.03" ...
##   .. ..$ cause: chr [1:3] "Surv" "Ins" "Dead"
##   .. ..$      : chr [1:3] "50%" "2.5%" "97.5%"
##  $ time : num [1:1501] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
##  - attr(*, "int")= num 0.01

The outputs above from ci.Crisk show the different ways the data can be presented.

-Crisk Cumulative risks for the length(mods) events and the survival

-Srisk Stacked versions of the cumulative risks

-Stime Sojourn times in each states

-time Endpoints of intervals. It is just the numerical version of the names of the first dimension of the three arrays

There are 4 components of the results, the three first are simply arrays with 2 or 3 functions of time with confidence intervals.

-time, named as tnam; endpoints of intervals. Length nrow(nd).

-cause. The arrays Crisk and Stime have values “Surv” plus the names of the list mods (first argument). Srisk has length length(mod), with each level representing a cumulative sum of cumulative risks, in order indicated by the perm argument.

-Unnamed, ci.50%, ci.2.5%, ci.97.5% representing quantiles of the quantities derived from the bootstrap samples. If alpha is different from 0.05, names are of course different.

So now plot the cumulative risks of being in each of the states (the Crisk component):

matshade(as.numeric(dimnames(cR$Crisk)[[1]]),
         cbind(cR$Crisk[,1,],
               cR$Crisk[,2,],
               cR$Crisk[,3,]), plot = TRUE,
         lwd = 2, main = "Cumulative risks of being in each of the states",  col = c("limegreen","red","black"))

legend(10, 0.95, legend=c("DM", "Ins", "Dead"),
       col=c("limegreen", "red", "black"), lty=1:3)

Plot the stacked probabilities (use for example matrix 2 polygons):

str(cR$Crisk)
##  num [1:1501, 1:3, 1:3] 1 0.997 0.994 0.991 0.988 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ tfd  : chr [1:1501] "0" "0.01" "0.02" "0.03" ...
##   ..$ cause: chr [1:3] "Surv" "Ins" "Dead"
##   ..$      : chr [1:3] "50%" "2.5%" "97.5%"
mat2pol(cR$Crisk[,3:1,1], col = c("forestgreen","red","black")[3:1])

The component Srisk has the confidence limits of the stacked probabilities, add these to the plot, for example by semi-transparent shades or dotted lines. If you are really entrepreneurial, devise a function that will take the Srisk component of cR and produce a stacked plot with shaded confidence limits;here is the stacked plot:

matshade(as.numeric(dimnames(cR$Srisk)[[1]]),
         cbind(cR$Srisk[,1,],
               cR$Srisk[,2,]), plot = TRUE,
         lwd = 2, col = c("black","red"),
         ylim = 0:1, yaxs = "i")

You may want to look at adjustcolor or rgb to see how to make semi-transparent colours.

Expected life time: using simulated objects

It is not only the cumulative risks of being in different states that my be of interest, the integrals — area under the cumulative risk curves are of interest too. The cumulative risks are probabilities, so dimensionless, which means that integrals of these along the time-axis will have dimension time; they will represent the expected time spent in each of the states.

The areas between the lines (up to say 10 years) are expected sojourn times, that is: - expected years alive without insulin - expected years lost to death without insulin - expected years after insulin, including years dead after insulin

Not all of these are of direct relevance; actually only the first may be so.

They are available (with simulation-based confidence intervals) in the component of cR, Stime (Sojourn time). A relevant quantity would be the expected time alive without insulin during the first 5, 10 and 15 years (remember that the first dimension of Stime is in units of 1/100 year):

str(cR$Stime)
##  num [1:1501, 1:3, 1:3] 0 0.00998 0.01994 0.02987 0.03976 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ tfd  : chr [1:1501] "0" "0.01" "0.02" "0.03" ...
##   ..$ cause: chr [1:3] "Surv" "Ins" "Dead"
##   ..$      : chr [1:3] "50%" "2.5%" "97.5%"
round(cR$Stime[1:3 * 500 + 1,"Surv",], 1)
##     
## tfd  50% 2.5% 97.5%
##   5  4.1  4.0   4.1
##   10 6.9  6.9   7.0
##   15 8.8  8.6   8.9

We can also compute the expected fraction of the first 5, 10, 15 years spent alive without insulin therapy:

(mY <- matrix(1:3 * 5, 3, 3)) # using the recycling rule
##      [,1] [,2] [,3]
## [1,]    5    5    5
## [2,]   10   10   10
## [3,]   15   15   15
round(cR$Stime[1:3*500+1,"Surv",] / mY * 100, 1)
##     
## tfd   50% 2.5% 97.5%
##   5  81.5 80.9  82.2
##   10 69.5 68.6  70.3
##   15 58.6 57.6  59.5

This can also be shown as a function of time; how large a fraction of the first t time can a person expect to be alive, for t ranging from 0 to 15 years:

time <- as.numeric(dimnames(cR$Stime)[[1]])
matshade(time, cR$Stime[,"Surv",] /
               cbind(time,
                     time,
                     time) * 100,
         plot=TRUE,
         ylim = 0:1*100, yaxs = "i", xaxs = "i", ylab = "y axis", xlab = "time")