The aim of this study was to investigate the different effects of several factors on hospital readmission among patients diagnosed with colorectal cancer.(J.R. Gonza´lez, 2005).The main outcome is the hospital readmmision times related to colorectal cancer after surgical procedure, considering it as a potential recurrent event (colorectal cancer patients may have several readmission after discharge). The first ten observations of the data set is given as follows:
load("DataColonDukesABvsC.rda")
data<-DataColonDukesABvsC[,c(3,5,7:10)]
data$event<-1-data$event
names<-c(1,4:6)
data[,names]<-lapply(data[,names],factor)
attach(data)
data<-data[order(id),]
detach(data)
head(data, n=10)
#> id time event chemoter dukes distance
#> 1 2 489 0 1 2 1
#> 2 2 693 1 1 2 1
#> 3 3 15 0 1 2 1
#> 4 3 768 1 1 2 1
#> 5 4 163 0 2 1 1
#> 6 4 125 0 2 1 1
#> 7 4 350 0 2 1 1
#> 8 4 48 0 2 1 1
#> 9 4 1362 1 2 1 1
#> 10 5 1134 0 1 2 1
Suppose we want to predict the recurrence time which is possibly censored. A log-normal mixed effects model could be used, in which the intercept is nested within patients. Then, we may use the following code:
library(brms)
fit1 <- brm(formula = time | cens(event) ~ chemoter + dukes + distance + (1|id),
data = data, family = lognormal(),
prior = c(set_prior("normal(0,0.5)", class = "b"),
set_prior("cauchy(0,2)", class = "sd")),
warmup =1500, iter = 3000, chains = 4, control = list(adapt_delta = 0.99),thin=3)
~, cens makes up the internal function that handles censored data, and event is the variable that contains information on the censoring. On the right hand, the predictors chemoter, dukes and distance are treated as population level, only one group-level term is specified which implies that intercept of the model is supposed to vary between patients.adapt_delta from the default 0.8 to 0.99.The model fit1 is fitted using 4 chains, each with 3000 iterations of which the first 1500 are warmup to calibrate the sampler with thinning rate being 3, leading to a total of 2000 posterior samples. A summary of the fitted model is available using:
summary(fit1)
#> Family: lognormal
#> Links: mu = identity; sigma = identity
#> Formula: time | cens(event) ~ chemoter + dukes + distance + (1 | id)
#> Data: data (Number of observations: 654)
#> Samples: 4 chains, each with iter = 3000; warmup = 1500; thin = 3;
#> total post-warmup samples = 2000
#>
#> Group-Level Effects:
#> ~id (Number of levels: 327)
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sd(Intercept) 1.63 0.17 1.31 1.99 1.00 1420 1688
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept 7.39 0.33 6.74 8.04 1.00 1932 1873
#> chemoter2 0.21 0.34 -0.43 0.91 1.00 1871 1696
#> dukes2 -0.61 0.33 -1.26 0.02 1.00 1917 1975
#> distance2 -0.07 0.48 -1.05 0.86 1.00 1807 1933
#>
#> Family Specific Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma 2.20 0.11 2.00 2.43 1.00 1931 1723
#>
#> Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
From the above output, all of the Rhat values equals to 1 indicating the convergence of 4 chains. (If Rhat is considerably greater than 1, the chains have not yet converged and it is necessary to run more iterations and/or set stronger priors). The values of Bulk_ESS and Tail_ESS for each parameters (the effective sample sizes of the 5% and 95% quantiles) are all smaller than 2000, so the sample size could be considered as large enough. * To visually investigate the posterior distributions, pairs plots show univariate histograms and bivariate scatter plots for selected parameters. Especially, a positive collinearity between variables dukes and chemoter was identified through the plots.
pairs(fit1,off_diag_args = list(size = 1/3, alpha = 1/3))
Q: how to interpret the negative relationship between the parameters corresponding to intercept and chemoters?
* From the following trank plots, we can see that the histograms of 4 chains are similar to one another, largely overlapping, staying within the same range, showing a good mixing of lines:
library(ggplot2)
library(magrittr)
library(bayesplot)
post <- posterior_samples(fit1, add_chain = T)
post %>%
mcmc_rank_overlay(pars = vars(b_Intercept, b_dukes2,b_chemoter2,b_distance2,sigma)) +
coord_cartesian(ylim = c(10, NA))
mcmc_trace(post[, c(1:6)], facet_args = list(ncol = 2), size = .15)
mcmc_acf(post, vars(b_Intercept, b_dukes2,b_chemoter2,b_distance2,sigma),lags = 5)
Looking at the population-level effects, the parameter estimates of variables, chemoter, dukes and distance are suspiciously small. To test whether they equal to 0, we apply the hypothesis method:
h <- c("chemoter2=0", " dukes2=0"," distance2=0")
hyps <- hypothesis(fit1, h, alpha=0.05)
plot(hyps, ignore_prior = TRUE)
Seen from the three posterior probability plots, 0 locates inside 95% credible intervals, which may raise the question if every coefficient should be modeled as a fixed effect term at all. To answer this question, thus we fit another model without chemoter and distance terms:
fit2<- update(fit1, formula. = ~ . - chemoter- distance )
A good way to compare both models is leave-one-out cross-validation (LOO), which can be called in brms using:
loo(fit1,fit2)
#> Warning: Found 32 observations with a pareto_k > 0.7 in model 'fit1'. It is
#> recommended to set 'moment_match = TRUE' in order to perform moment matching for
#> problematic observations.
#> Warning: Found 24 observations with a pareto_k > 0.7 in model 'fit2'. It is
#> recommended to set 'moment_match = TRUE' in order to perform moment matching for
#> problematic observations.
#> Output of model 'fit1':
#>
#> Computed from 2000 by 654 log-likelihood matrix
#>
#> Estimate SE
#> elpd_loo -2434.4 85.2
#> p_loo 113.8 4.8
#> looic 4868.8 170.5
#> ------
#> Monte Carlo SE of elpd_loo is NA.
#>
#> Pareto k diagnostic values:
#> Count Pct. Min. n_eff
#> (-Inf, 0.5] (good) 540 82.6% 380
#> (0.5, 0.7] (ok) 82 12.5% 148
#> (0.7, 1] (bad) 29 4.4% 160
#> (1, Inf) (very bad) 3 0.5% 1762
#> See help('pareto-k-diagnostic') for details.
#>
#> Output of model 'fit2':
#>
#> Computed from 2000 by 654 log-likelihood matrix
#>
#> Estimate SE
#> elpd_loo -2433.5 85.3
#> p_loo 112.8 4.8
#> looic 4867.1 170.5
#> ------
#> Monte Carlo SE of elpd_loo is NA.
#>
#> Pareto k diagnostic values:
#> Count Pct. Min. n_eff
#> (-Inf, 0.5] (good) 548 83.8% 463
#> (0.5, 0.7] (ok) 82 12.5% 216
#> (0.7, 1] (bad) 23 3.5% 47
#> (1, Inf) (very bad) 1 0.2% 1796
#> See help('pareto-k-diagnostic') for details.
#>
#> Model comparisons:
#> elpd_diff se_diff
#> fit2 0.0 0.0
#> fit1 -0.9 0.9
Q: Could you please have a look at the warning messages of loo command?
First plot fitted scale parameters k for the 655 leave-one-out data sets:
yrep <- posterior_predict(fit1)
loo1 <- loo(fit1, save_psis = TRUE)
loo1
#>
#> Computed from 2000 by 654 log-likelihood matrix
#>
#> Estimate SE
#> elpd_loo -2433.4 85.2
#> p_loo 112.9 4.8
#> looic 4866.8 170.4
#> ------
#> Monte Carlo SE of elpd_loo is NA.
#>
#> Pareto k diagnostic values:
#> Count Pct. Min. n_eff
#> (-Inf, 0.5] (good) 547 83.6% 305
#> (0.5, 0.7] (ok) 80 12.2% 114
#> (0.7, 1] (bad) 22 3.4% 155
#> (1, Inf) (very bad) 5 0.8% 1857
#> See help('pareto-k-diagnostic') for details.
plot(loo1)
From the above plot, 23 observations with a pareto_k > 0.7 are found in model, indicating that the variance of the raw importance ratios is infinite,the generalized central limit theorem for stable distributions holds, and the convergence of the estimate is slower. The variance of the PSIS estimate is finite but may be large. For the one with k > 1, the variance and the mean of the raw ratios distribution do not exist, therefore Monte Carlo SE of elpd_loo is NA. of the PSIS estimate is finite but may be large thus not reliable!
Now have a look at the plot of LOO-PIT values are cumulative probabilities for yi computed using the LOO marginal predictive distributions p(yi|y−i).
ppc_loo_pit_overlay(
y = na.omit(data)$time,
yrep = yrep,
lw = weights(loo1$psis_object)
)
loo should be used?
waic(fit1, fit2)
#> Output of model 'fit1':
#>
#> Computed from 2000 by 654 log-likelihood matrix
#>
#> Estimate SE
#> elpd_waic -2427.9 85.2
#> p_waic 107.4 4.5
#> waic 4855.8 170.4
#>
#> 45 (6.9%) p_waic estimates greater than 0.4. We recommend trying loo instead.
#>
#> Output of model 'fit2':
#>
#> Computed from 2000 by 654 log-likelihood matrix
#>
#> Estimate SE
#> elpd_waic -2427.8 85.2
#> p_waic 107.0 4.5
#> waic 4855.5 170.4
#>
#> 44 (6.7%) p_waic estimates greater than 0.4. We recommend trying loo instead.
#>
#> Model comparisons:
#> elpd_diff se_diff
#> fit2 0.0 0.0
#> fit1 -0.1 0.8
Conclusion: removing variables chemoter and distance cannot improve the predictive accuracy.