Executive Summary

A duration (also known as a survival time) is the amount of time that passes between when an observation enters the data and when an event occurs for that observation. For example, in medical studies, observations are often patients and durations are the length of time each patient survives. Durations are not necessarily morbid, however: it is possible to model the durations until each patient is cured. In political science, durations are used to model the length of wars prior to peace, or the endurance of peace prior to war, or in other areas such as the length of economic recessions or the tenure of a Supreme Court Justice.

Durations are difficult to forecast because duration distributions can be highly skewed, non-normal, and subject to right-censoring (data collection ends or a subject drops out before a terminating event is observed, leading to some durations that are ongoing in the dataset). In applied work, duration outcomes are often modeled using the Cox proportional hazards (CPH) model (Cox and Oakes 1984, also see https://www.r-bloggers.com/cox-proportional-hazards-model/). The CPH model is extremely flexible: it handles any kind of distribution of duration times, and it elegantly accounts for right-censoring.

The problem with CPH models is in the interpretation of these models’ results. By default, results can only be understood in terms of hazard-based quantities, which are technical and difficult to communicate to a general audience.¹ Worse, hazards can’t be used to forecast durations, so there’s no way to understand statistical effects in terms of change in duration or to generate out-of-sample predicted durations.

Together with a co-author, I developed a method to reconstruct duration forecasts from the results of a CPH model (see Kropko and Harden 2018 for details). The code to implement this technique is available for R through the coxed package on the CRAN repository (Kropko and Harden 2019). This report serves as technical documentation for the coxed package, and demonstrates how to use this package using the example of the timing of important political events that occur before and after elections.

In this report, I consider two phenomena that have been of interest to journalists, policy makers, and political scientists:

How long will it take legislators in a multiparty legislature to negotiate a governing coalition after an election?
What factors deter quality challengers from entering a U.S. House of Representatives race?

These questions have previously been addressed in published work. But key findings have been described using the inaccessible language of hazard functions. With regard to the duration of coalition negotiations, Martin and Vanberg (2003) find that negotiations terminate more quickly when the parties are more ideologically similar. They report:

[W]e find that an increase in the ideological range of the government from zero (the case of a single-party government) to 1.24 (the average range for coalition governments in our sample) decreases the odds of government formation on any given day in the bargaining process by approximately 23 percent (p. 331, emphasis added).

In this case, “odds” is a word that describes a result based on hazard rates. While this conclusion supports the notion that larger ideological differences lead to longer negotiations, this particular result conveys no information about the duration of negotiation. We cannot know from the given information if ideology exerts a large or small effect on negotiation duration. Below I replicate this finding and employ the coxed() function to determine the change in duration associated with these values of ideological spread: I find that the average negotiation for a single party government takes 25.3 days to conclude, and the average negotiation for parties with an ideological range of 1.24 takes 28.9 days to conclude, for a difference of 3.6 days (the 95% confidence interval contains differences between -0.2 days and 7.4 days).

Likewise, Box-Steffensmeier (1996) finds that when an incumbent has more money on hand for conducting a campaign (a “war chest”) challengers take longer to declare candidacy in the election. She writes:

Each $100,000 increase in an incumbent’s war chest decreases the hazard of a high quality challenger entering by 16% (p. 365, emphasis added).

While this result expresses that larger war chests delay challenger entry, it is based on the hazard function and as such does not describe the length of the delay. I replicate Box-Steffensmeier’s model, and find that an incumbent whose war chest is at the 75th percentile in the data delays the entry of a high quality challenger by 2.7 weeks, on average, compared to an incumbent whose war chest is at the 25th percentile.

Installation/Usage

The stable version of the coxed package available for download from CRAN by typing

install.packages("coxed")

The development version is available on Github and can be downloaded by typing

devtools::install_github("jkropko/coxed")

The coxed package includes two top-level functions, coxed() and sim.survdata(). The coxed() function can:

Generate predicted survival/duration times from the Cox proportional hazards model, both in-sample and for out-of-sample cases
Calculate marginal changes in expected duration between two given covariate profiles
Produce standard errors and confidence intervals for the mean or median survival time or difference in duration for each case

Primary usage of the coxed() function is

coxed(cox.model, newdata = NULL, newdata2 = NULL, bootstrap = FALSE, method = "npsf")

where:

cox.model is an object that contains the output of a CPH model estimated with the survival::coxph() or rms::cph() function
newdata and newdata2 are optional data frames; if newdata is specified, expected durations are calculated for the observations in newdata, and if newdata2 is also specified then marginal changes are calculated by subtracting the expected durations for newdata2 from the expected durations for newdata. If newdata is not specified, expected durations are calculated for the estimation sample.
bootstrap calculates bootstrapped standard errors and confidence intervals if TRUE. Additional arguments B, confidence, and level specify the number of bootstrap iterations, the type and level of the confidence interval, respectively.
method can be either “npsf” for the non-parametric step function method of calculating expected durations as described in Kropko and Harden (2018), or “gam” for the generalized additive model method. More details are provided in the examples below.

Full usage of the coxed() function is described in the help documentation that can be accessed by typing ?coxed or help("coxed").

The other top-level function in the coxed package, sim.survdata() creates simulated survival-time data that can optionally have time-varying covariates or coefficients. Usage is described on the help documentation accessible with ?sim.survdata or help("sim.survdata").

Example 1: Duration of Negotiations to Form Governing Coalitions

The Cox proportional hazards model (implemented in R as coxph() in the survival package or as cph() rms package) is one of the most frequently used estimators in duration (survival) analysis. Because it is estimated using only the observed durations’ rank ordering, typical quantities of interest used to communicate results of the Cox model come from the hazard function (e.g., hazard ratios or percentage changes in the hazard rate). These quantities are substantively vague and difficult for many audiences of research to understand. The coxed package allows researchers to calculate duration-based quantities from Cox model results, such as the expected duration (or survival time) given covariate values and marginal changes in duration for a specified change in a covariate. These duration-based quantities often match better with researchers’ substantive interests and are easily understood by most readers.

This section is a walk-through of various ways to use the coxed() function.

Before I begin, I load the coxed package,

library(coxed)

and packages from the tidyverse for managing and plotting data as I go:

library(dplyr)
library(tidyr)
library(ggplot2)

Running a Cox proportional hazards model

The following quote from Kropko and Harden (2018) sets up our first working example:

Martin and Vanberg (2003) examine the determinants of negotiation time among political parties forming a coalition government. . . . The dependent variable in Martin and Vanberg’s analysis is the number of days between the beginning and end of the bargaining period. Martin and Vanberg model this variable as a function of the Range of government, which is a measure of the ideological distance between the extreme members of the coalition, the Number of government parties in the coalition, and several other variables. They interact Number of government parties with the natural log of time because that variable violates the proportional hazards assumption. Their hypotheses predict negative coefficients on the variables of interest, indicating that increases in the ideological distance between the parties and in the number of parties correspond with a decrease in the risk of government formation, or a longer negotiation time.

The authors demonstrate support for their hypotheses by computing changes in the hazard rate based on changes to these independent variables. However, to assess what the estimated effects of Range of government and Number of government parties mean in substantive terms, I use coxed() to predict how long is each case predicted to last. I will also find answers to the following questions about duration:

How much longer will negotiations take for an ideologically polarized coalition as compared to an ideologically homogeneous one?
How much longer will negotiations take for a multiparty coalition government than for a single-party government?

First I replicate the Cox model from Martin and Vanberg (2003):

mv.surv <- Surv(martinvanberg$formdur, event = rep(1, nrow(martinvanberg)))
mv.cox <- coxph(mv.surv ~ postel + prevdef + cont + ident + rgovm + pgovno + 
                     tpgovno + minority, method = "breslow", data = martinvanberg)
summary(mv.cox)

## Call:
## coxph(formula = mv.surv ~ postel + prevdef + cont + ident + rgovm + 
##     pgovno + tpgovno + minority, data = martinvanberg, method = "breslow")
## 
##   n= 203, number of events= 203 
## 
##              coef exp(coef) se(coef)       z Pr(>|z|)    
## postel   -0.57665   0.56177  0.16862  -3.420 0.000627 ***
## prevdef  -0.10000   0.90484  0.22987  -0.435 0.663543    
## cont      1.10047   3.00556  0.23970   4.591 4.41e-06 ***
## ident     0.14579   1.15695  0.11859   1.229 0.218938    
## rgovm    -0.21312   0.80806  0.12036  -1.771 0.076625 .  
## pgovno    1.19057   3.28894  0.12405   9.598  < 2e-16 ***
## tpgovno  -0.43189   0.64928  0.03476 -12.425  < 2e-16 ***
## minority -0.42759   0.65208  0.20793  -2.056 0.039740 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## postel      0.5618     1.7801    0.4037    0.7818
## prevdef     0.9048     1.1052    0.5766    1.4198
## cont        3.0056     0.3327    1.8789    4.8079
## ident       1.1570     0.8643    0.9170    1.4597
## rgovm       0.8081     1.2375    0.6382    1.0231
## pgovno      3.2889     0.3040    2.5791    4.1942
## tpgovno     0.6493     1.5402    0.6065    0.6951
## minority    0.6521     1.5336    0.4338    0.9801
## 
## Concordance= 0.903  (se = 0.025 )
## Rsquare= 0.745   (max possible= 1 )
## Likelihood ratio test= 277.2  on 8 df,   p=<2e-16
## Wald test            = 218.1  on 8 df,   p=<2e-16
## Score (logrank) test = 279.3  on 8 df,   p=<2e-16

Next I will use the both versions of coxed() to examine expected durations and marginal changes in duration.

Using the NPSF method within the `coxed()` function

The first version of coxed() is the non-parametric step function (NPSF) approach. The NPSF method estimates a survivor function for every observation in the given data, which by necessity given the mathematical construction of the CPH model must be a step function. The expected durations are calculated from these survivor step-functions. Full details are reported in Kropko and Harden (2018).

To use the NPSF version of coxed(), specify model="npsf" in the call to the function. By default, quantities are estimated without standard errors, but to estimate SEs and confidence intervals specify bootstrap=TRUE.

Without standard errors

To see predicted durations from the Cox model, place the Cox model output as the first argument of coxed():

ed1 <- coxed(mv.cox, method="npsf")

There are a number of uses of the coxed() output. First, the predicted durations for each individual observation are stored in the exp.dur attribute:

head(ed1$exp.dur)

##     exp.dur
## 1 34.620664
## 2 31.639975
## 3 39.509093
## 4 14.717544
## 5  2.473799
## 6 47.347670

The summary() function, when applied to coxed, reports either the mean or median duration in the estimation sample, depending on the option specified with stat:

summary(ed1, stat="mean")

##  mean 
## 25.18

summary(ed1, stat="median")

## median 
## 19.121

The predicted mean duration of government negotiations is 25.18 days, and the predicted median duration is 19.12 days.

In addition to reporting the mean and median duration, the NPSF version of coxed() provides estimates of the cumulative baseline hazard function and the baseline survivor function in the data. These functions are stored as a data frame in the baseline.functions attribute.

head(ed1$baseline.functions)

##   time        cbh  survivor
## 1    1 0.01631230 0.9838200
## 2    2 0.03587341 0.9647624
## 3    3 0.04746259 0.9536462
## 4    4 0.07651484 0.9263392
## 5    5 0.09169880 0.9123799
## 6    6 0.12573075 0.8818522

I can plot these baseline functions with ggplot():

baseline <- gather(ed1$baseline.functions, cbh, survivor, key="survivefunction", value="value")
ggplot(baseline, aes(x=time, y=value)) +
     geom_line() +
     xlab("Time") +
     ylab("Function") +
     facet_wrap( ~ survivefunction, scales = "free")

With (bootstrapped) standard errors

I can calculate standard errors and confidence intervals for any of these quantities with the bootstrap=TRUE option. By default the bootstrapping procedure uses 200 iterations (to set this value to a different number, use the B argument). Here I use 30 iterations simply to ease the computational burden of compiling this vignette. For more reliable results, set B to a higher value:

ed1 <- coxed(mv.cox, method="npsf", bootstrap = TRUE, B=30)

Now every predicted duration has a standard error and a 95% confidence interval.

head(ed1$exp.dur)

##     exp.dur bootstrap.se        lb        ub
## 1 34.620664    0.8564241 32.942104 36.299224
## 2 31.639975    0.7821738 30.106943 33.173008
## 3 39.509093    1.0924546 37.367921 41.650264
## 4 14.717544    0.5718398 13.596758 15.838329
## 5  2.473799    0.3911958  1.707069  3.240529
## 6 47.347670    1.9787969 43.469299 51.226040

The mean and median also have standard errors and confidence intervals.

summary(ed1, stat="mean")

##   mean bootstrap.se     lb     ub
##  25.18        1.086 23.051 27.309

summary(ed1, stat="median")

##  median bootstrap.se     lb     ub
##  19.121        0.652 17.844 20.398

To change the confidence interval to a different level, use the level argument:

ed1 <- coxed(mv.cox, method="npsf", bootstrap = TRUE, B=30, level=.8)
summary(ed1, stat="mean")

##   mean bootstrap.se     lb     ub
##  25.18        1.296 23.519 26.841

summary(ed1, stat="median")

##  median bootstrap.se    lb     ub
##  19.121        0.851 18.03 20.212

There are different methods for calculating a bootstrapped confidence interval. The default method used by coxed() (setting the argument confidence="studentized") adds and subtracts qnorm(level - (1 - level)/2) times the bootstrapped standard error to the point estimate. The alternative approach is to take the (1-level)/2 and level + (1-level)/2 quantiles of the bootstrapped draws, which can be done by specifying confidence="empirical" (I recommend a higher number of bootstrap iterations for empirical confidence intervals):

ed1 <- coxed(mv.cox, method="npsf", bootstrap = TRUE, B=30, confidence="empirical")
summary(ed1, stat="mean")

##   mean bootstrap.se     lb     ub
##  25.18        1.379 22.311 27.354

summary(ed1, stat="median")

##  median bootstrap.se     lb     ub
##  19.121        0.906 17.786 20.748

Out of sample prediction

coxed() can be used to provide duration predictions for observations outside of the estimation sample. Suppose that I observe three new cases and place them inside a new data frame:

new.coalitions <- data.frame(postel = c(1,0,0),
                             prevdef = c(1,0,1),
                             cont = c(1,0,0),
                             ident = c(1,3,2),
                             rgovm = c(0.81, 0.62, 1.18),
                             pgovno = c(2,3,4),
                             tpgovno = c(3.58, 5.17, 10.2),
                             minority = c(0,0,1))
new.coalitions

##   postel prevdef cont ident rgovm pgovno tpgovno minority
## 1      1       1    1     1  0.81      2    3.58        0
## 2      0       0    0     3  0.62      3    5.17        0
## 3      0       1    0     2  1.18      4   10.20        1

To forecast durations for these cases along with standard errors and confidence intervals, I use the coxed() function and place new.coalitions into the newdata argument:

forecast <- coxed(mv.cox, newdata=new.coalitions, method="npsf", bootstrap=TRUE, B=30)
forecast$exp.dur

##     exp.dur bootstrap.se       lb        ub
## 1  6.758058     1.895572 3.042805 10.473311
## 2  4.805083     1.579520 1.709281  7.900886
## 3 16.489185     3.338757 9.945342 23.033028

Marginal changes in expected duration

Here I use coxed() to provide answers to the two duration-based questions I posed in the introduction. First consider “How much longer will negotiations take for an ideologically polarized coalition as compared to an ideologically homogeneous one?” To answer this question, I call coxed() and specify two new datasets, one in which rgovm=0 indicating that all political parties in the governing coalition have the same ideological position, and one in which rgovm=1.24, indicating that the parties have very different ideological positions. I use mutate() from the dplyr library to quickly create new data frames in which rgovm equals 0 or 1.24 for all cases, and set these two data frames as newdata and newdata2 inside coxed().

me <- coxed(mv.cox, method = "gam", bootstrap = TRUE, B = 30,
            newdata = dplyr::mutate(martinvanberg, rgovm = 0),
            newdata2 = dplyr::mutate(martinvanberg, rgovm = 1.24))

## Warning in rank.predict(x = exp.xb2, v = exp.xb, ties.method =
## ties.method, : New data contain 1 observations with linear predictors
## greater than all linear predictors in the estimation sample. These
## observations will all have the same predicted duration

## Warning in rank.predict(x = exp.xb2, v = exp.xb, ties.method =
## ties.method, : New data contain 1 observations with linear predictors
## greater than all linear predictors in the estimation sample. These
## observations will all have the same predicted duration

## Warning in rank.predict(x = exp.xb2, v = exp.xb, ties.method =
## ties.method, : New data contain 4 observations with linear predictors
## greater than all linear predictors in the estimation sample. These
## observations will all have the same predicted duration

## Warning in rank.predict(x = exp.xb2, v = exp.xb, ties.method =
## ties.method, : New data contain 3 observations with linear predictors
## greater than all linear predictors in the estimation sample. These
## observations will all have the same predicted duration

coxed() calculates expected durations for all cases under each new data frame and subtracts the durations for each case. As an overall result, I can see either the mean or the median of these differences.

summary(me, stat="mean")

##              mean bootstrap.se     lb     ub
## newdata2   28.927        3.166 22.722 35.131
## newdata    25.319        2.439 20.538 30.100
## difference  3.607        1.859 -0.037  7.252

summary(me, stat="median")

##            median bootstrap.se     lb     ub
## newdata2   22.392        2.436 17.616 27.167
## newdata    19.692        2.435 14.918 24.465
## difference  2.928        1.507 -0.025  5.882

A coalition in which the parties have ideological differences so that rgovm=1.24 will take 3.08 more days on average (with a median of 2.6 days) to conclude negotiations than a coalition in which all parties have the same position.

Next I consider “How much longer will negotiations take for a multiparty coalition government than for a single-party government?” In this case I compare coalitions with one party to coalitions with 6 parties by setting the pgovno variable to 1 and 6 and setting these two data frames as the newdata and newdata2 arguments of coxed():

me <- coxed(mv.cox, method="npsf", bootstrap = TRUE, B=30,
            newdata = dplyr::mutate(martinvanberg, pgovno=1),
            newdata2 = dplyr::mutate(martinvanberg, pgovno=6))
summary(me, stat="mean")

##               mean bootstrap.se      lb      ub
## newdata2     5.801        0.813   4.209   7.394
## newdata     56.298        6.854  42.863  69.732
## difference -50.496        7.154 -64.519 -36.474

summary(me, stat="median")

##             median bootstrap.se      lb      ub
## newdata2     0.008        0.015  -0.022   0.038
## newdata     28.508        3.239  22.160  34.856
## difference -28.500        2.903 -34.190 -22.810

A coalition of 6 parties will take 50.5 more days on average (with a median of 28.5 days) to conclude negotiations than a coalition with one party.

Using the GAM method within the `coxed()` function

The generalized additive model (GAM) method is a second implementation of the coxed method that is likely to work better when there are long periods that separate one failure from the next. GAMs are locally-weighted smoothing models, and are described in detail in Kropko and Harden (2018) and also here: https://www.r-bloggers.com/generalized-additive-models/.

I can use the GAM method for all of the same uses for which I used the NPSF method above, except for estimating the baseline functions. I can however view and plot the output from the GAM model that maps predicted ranks to duration. While the bootstrap=TRUE argument works when method="gam", these functions take somewhat longer to run. I therefore run the following examples without bootstrapping.

As before, to see predicted durations from the Cox model, place the Cox model output as the first argument of coxed(). The predicted durations for each individual observation are stored in the exp.dur attribute,

ed2 <- coxed(mv.cox, method="gam")
head(ed2$exp.dur)

##     exp.dur
## 1 48.978295
## 2 42.036276
## 3 55.440293
## 4 15.734577
## 5  1.530695
## 6 64.449942

and summary() reports either the mean or median expected duration:

summary(ed2, stat="mean")

##   mean 
## 28.034

summary(ed2, stat="median")

## median 
## 21.208

Out of sample prediction

The GAM method can also forecast durations for new data along with standard errors and confidence intervals. Here I use the coxed() function with method="gam" and place the new.coalitions I created above into the newdata argument:

forecast <- coxed(mv.cox, newdata=new.coalitions, method="gam")
forecast$exp.dur

##     exp.dur
## 1  6.614393
## 2  3.194893
## 3 18.159002

Marginal changes in expected duration

Here I again calculate the two marginal effects to better understand the substantive meaning of the Cox model. This time I employ the GAM method instead of the NPSF method. The GAM method may provide a warning that some observations have linear predictors that are greater than or less than all of the observed cases in the estimation sample. Some observations falling outside the range of the original linear predictors is to be expected when applying new data, but if it happens with too many of the new observations NPSF may be a better option for estimating these quantities.

me <- coxed(mv.cox, method="gam",
            newdata = dplyr::mutate(martinvanberg, rgovm=0),
            newdata2 = dplyr::mutate(martinvanberg, rgovm=1.24))

## Warning in rank.predict(x = exp.xb2, v = exp.xb, ties.method =
## ties.method, : New data contain 1 observations with linear predictors
## greater than all linear predictors in the estimation sample. These
## observations will all have the same predicted duration

## Warning in rank.predict(x = exp.xb2, v = exp.xb, ties.method =
## ties.method, : New data contain 1 observations with linear predictors
## greater than all linear predictors in the estimation sample. These
## observations will all have the same predicted duration

summary(me, stat="mean")

##   newdata2    newdata difference 
##     25.322     28.927      3.605

summary(me, stat="median")

##   newdata2    newdata difference 
##     19.692     22.392      2.928

me <- coxed(mv.cox, method="gam", 
            newdata = dplyr::mutate(martinvanberg, pgovno=1),
            newdata2 = dplyr::mutate(martinvanberg, pgovno=6))

## Warning in rank.predict(x = exp.xb2, v = exp.xb, ties.method =
## ties.method, : New data contain 20 observations with linear predictors less
## than all linear predictors in the estimation sample. These observations
## will all have the same predicted duration

## Warning in rank.predict(x = exp.xb2, v = exp.xb, ties.method =
## ties.method, : New data contain 74 observations with linear predictors
## greater than all linear predictors in the estimation sample. These
## observations will all have the same predicted duration

summary(me, stat="mean")

##   newdata2    newdata difference 
##     47.507      6.083    -41.424

summary(me, stat="median")

##   newdata2    newdata difference 
##     36.807      0.592    -31.934

Plotting the GAM fit

The data used by coxed() to map rankings to durations are stored in the gam.data attribute, and the output from the GAM is stored in gam.model:

summary(ed2$gam.data)

##        y              failed     rank.xb          rank.y     
##  Min.   :  1.00   Min.   :1   Min.   :  1.0   Min.   :  1.0  
##  1st Qu.:  6.00   1st Qu.:1   1st Qu.: 51.5   1st Qu.: 51.5  
##  Median : 19.00   Median :1   Median :102.0   Median :102.0  
##  Mean   : 28.03   Mean   :1   Mean   :102.0   Mean   :102.0  
##  3rd Qu.: 37.50   3rd Qu.:1   3rd Qu.:152.5   3rd Qu.:152.5  
##  Max.   :205.00   Max.   :1   Max.   :203.0   Max.   :203.0  
##     gam_fit           gam_fit_se     gam_fit_95lb      gam_fit_95ub    
##  Min.   :  0.3262   Min.   :2.645   Min.   :-10.183   Min.   :  7.497  
##  1st Qu.:  7.3341   1st Qu.:2.701   1st Qu.:  1.994   1st Qu.: 12.674  
##  Median : 21.2081   Median :2.754   Median : 16.024   Median : 26.392  
##  Mean   : 28.0345   Mean   :2.919   Mean   : 22.313   Mean   : 33.756  
##  3rd Qu.: 38.2803   3rd Qu.:2.810   3rd Qu.: 32.941   3rd Qu.: 43.620  
##  Max.   :105.3490   Max.   :5.362   Max.   : 94.840   Max.   :115.858

summary(ed2$gam.model)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(rank.xb, bs = "cr", k = k)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   28.034      1.198   23.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df     F p-value    
## s(rank.xb) 5.122  6.214 77.47  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.704   Deviance explained = 71.2%
## GCV = 300.59  Scale est. = 291.53    n = 203

The gam.data can be used to visualize the fit of the GAM:

ggplot(ed2$gam.data, aes(x=rank.xb, y=y)) +
     geom_point() +
     geom_line(aes(x=rank.xb, y=gam_fit)) +
     geom_ribbon(aes(ymin=gam_fit_95lb, ymax=gam_fit_95ub), alpha=.5) +
     xlab("Cox model LP rank (smallest to largest)") +
     ylab("Duration")

Comparing the predicted durations to the observed durations and to each other

Given that coxed() contains two alternative methods for generating expected durations, it is possible to compare the estimates. Both correlate positively the observed durations, and the GAM and NPSF durations correlate even more strongly with each other.

tester <- data.frame(y=martinvanberg$formdur, npsf=ed1$exp.dur$exp.dur, gam=ed2$exp.dur$exp.dur)
cor(tester)

##              y      npsf       gam
## y    1.0000000 0.8240072 0.8436075
## npsf 0.8240072 1.0000000 0.9119669
## gam  0.8436075 0.9119669 1.0000000

Scatterplots visualize these correlations:

pairs(tester)

Example 2: Incumbent Fundraising and Detering Quality Challengers

This example illustrates how to use coxed() to calculate expected durations and marginal changes in duration when the CPH model includes time-varying covariates. To set up this example, I quote from the online appendix to Kropko and Harden (2018):

Box-Steffensmeier (1996) examines whether U.S. House incumbents’ ability to raise campaign funds can effectively deter quality challengers from entering the race. The theoretical expectation is that as incumbents raise more money, challengers further delay their decision to run for the incumbent’s seat. She employs data on 397 House races in the 1989–1990 election cycle to test this hypothesis. The dependent variable in this analysis is the number of weeks after January 1, 1989 when a challenger entered the race. Races in which no challenger entered are coded as the number of weeks after January 1 when the state’s primary filing deadline occurred, and are treated as censored. The key independent variable is the incumbent’s War chest, or the amount of money in millions of dollars that the incumbent has in reserve at a given time. Importantly, this measure updates over the course of five Federal Election Commission (FEC) reporting periods, so it is a time-varying covariate (TVC). The theory predicts a negative coefficient on this variable, which would indicate that as the incumbent raises more money, the hazard of challenger entry declines (and the time until entry increases).

Box-Steffensmeier’s model is replicated below. Note that the Surv() function which sets up the dependent variable has two time arguments, representing the start and end of discrete intervals, which allows a covariate to take on different values across different intervals for the same observation.

bs.surv <- Surv(time = boxsteffensmeier$start, time2 = boxsteffensmeier$te, event = boxsteffensmeier$cut_hi)
bs.cox <- coxph(bs.surv ~ ec + dem + south + iv, data = boxsteffensmeier, method = "breslow")
summary(bs.cox)

## Call:
## coxph(formula = bs.surv ~ ec + dem + south + iv, data = boxsteffensmeier, 
##     method = "breslow")
## 
##   n= 1376, number of events= 40 
## 
##             coef  exp(coef)   se(coef)      z Pr(>|z|)    
## ec    -3.0305806  0.0482876  1.3961518 -2.171    0.030 *  
## dem    0.2124840  1.2367464  0.3266545  0.650    0.515    
## south -0.4266388  0.6526992  0.4240698 -1.006    0.314    
## iv    -7.2790246  0.0006899  1.6996244 -4.283 1.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##       exp(coef) exp(-coef) lower .95 upper .95
## ec    0.0482876    20.7093 3.129e-03    0.7451
## dem   1.2367464     0.8086 6.520e-01    2.3460
## south 0.6526992     1.5321 2.843e-01    1.4986
## iv    0.0006899  1449.5734 2.466e-05    0.0193
## 
## Concordance= 0.773  (se = 0.052 )
## Rsquare= 0.025   (max possible= 0.259 )
## Likelihood ratio test= 35.04  on 4 df,   p=5e-07
## Wald test            = 25.52  on 4 df,   p=4e-05
## Score (logrank) test = 26.9  on 4 df,   p=2e-05

The coxed() function automatically detects whether time-varying covariates are used in the model and it takes steps to account for this structure in predicting expected durations and in estimating marginal effects. The only additional step that the user needs to take is to specify the ID variable in the id argument, so that the function knows which intervals refer to which observations.

ed1 <- coxed(bs.cox, method="npsf", id=boxsteffensmeier$caseid)
summary(ed1, stat="mean")

##   mean 
## 83.582

Here I look directly at the effect of the war chest on the length of time until a high quality challenger enters the race. I compare the 25th and 75th percentiles in war chest variable:

me <- coxed(bs.cox, method="npsf",
            newdata = mutate(boxsteffensmeier, ec=quantile(ec, .25)),
            newdata2 = mutate(boxsteffensmeier, ec=quantile(ec, .75)),
            id=boxsteffensmeier$caseid)
summary(me, stat="mean")

##   newdata2    newdata difference 
##     82.382     85.119      2.738

summary(me, stat="median")

##   newdata2    newdata difference 
##     83.280     85.743      2.464

An incumbent whose war chest is at the 75th percentile in the data delays the entry of a high quality challenger by 2.7 weeks, on average, compared to an incumbent whose war chest is at the 25th percentile.

Conclusion

The coxed method is part of a growing movement in statistics to emphasize the substantive meaning of empirical results. Rather than relying solely on the direction of an effect and a p-value, an increasing number of statistical researchers report what their results describe in terms of real changes in the data.

Our goal in developing the coxed is to give applied researchers an easy-to-use and open-source tool to take robust and flexible results from a CPH model and express those results in terms of durations and changes in duration. These duration-based quantities are much more intuitive and accessible to a wide audience than hazard-based results. They also provide needed context: a hazard ratio can make an effect appear to be larger than it really is in terms of duration.

References

Box-Steffensmeier, J. M. (1996) “A Dynamic Analysis of The Role of War Chests in Campaign Strategy.” American Journal of Political Science 40: 352-371.
Cox, D. R.; Oakes, D. (1984). Analysis of Survival Data. New York: Chapman & Hall. ISBN 978-0412244902.
Kropko, J. and Harden, J. J. (2018) “Beyond the Hazard Ratio: Generating Expected Durations from the Cox Proportional Hazards Model.” British Journal of Political Science https://doi.org/10.1017/S000712341700045X
Kropko, J. and Harden, J. J. (2019). coxed: Duration-Based Quantities of Interest for the Cox Proportional Hazards Model. R package version 0.2.3. https://github.com/jkropko/coxed
Martin, L. W and Vanberg, G. (2003) “Wasting Time? The Impact of Ideology and Size on Delay in Coalition Formation.” British Journal of Political Science 33 323-344 https://doi.org/10.1017/S0007123403000140

Forecasting The Durations of Events Before and After Elections: The `coxed` Package in Action

Jonathan Kropko

2019-03-28

Executive Summary

Installation/Usage

Example 1: Duration of Negotiations to Form Governing Coalitions

Running a Cox proportional hazards model

Using the NPSF method within the `coxed()` function

Without standard errors

With (bootstrapped) standard errors

Out of sample prediction

Marginal changes in expected duration

Using the GAM method within the `coxed()` function

Out of sample prediction

Marginal changes in expected duration

Plotting the GAM fit

Comparing the predicted durations to the observed durations and to each other

Example 2: Incumbent Fundraising and Detering Quality Challengers

Conclusion

References

Forecasting The Durations of Events Before and After Elections: The coxed Package in Action

Jonathan Kropko

2019-03-28

Executive Summary

Installation/Usage

Example 1: Duration of Negotiations to Form Governing Coalitions

Running a Cox proportional hazards model

Using the NPSF method within the coxed() function

Without standard errors

With (bootstrapped) standard errors

Out of sample prediction

Marginal changes in expected duration

Using the GAM method within the coxed() function

Out of sample prediction

Marginal changes in expected duration

Plotting the GAM fit

Comparing the predicted durations to the observed durations and to each other

Example 2: Incumbent Fundraising and Detering Quality Challengers

Conclusion

References

Forecasting The Durations of Events Before and After Elections: The `coxed` Package in Action

Using the NPSF method within the `coxed()` function

Using the GAM method within the `coxed()` function