I think Sean’s analysis should use survival statistics. Let me try to prove that point.
First the data.
library(dplyr)
mortality_rate = 1 / 1000
cancer_rate = 1 / 2000
effect_of_treatment = 2
n <- 50000
control_lifespan <- rweibull(n, 5, 1/mortality_rate)
treatment_lifespan <- rweibull(n, 5, 1/(mortality_rate * effect_of_treatment))
control_onset <- rexp(n, cancer_rate)
treatment_onset <- rexp(n, cancer_rate * effect_of_treatment)
data <- data.frame(
group = c(rep("control", n),
rep("treated", n)),
lifespan = c(round(control_lifespan),
round(treatment_lifespan)),
onset = c(round(control_onset),
round(treatment_onset))
) %>%
mutate(
cancer = onset < lifespan
)
head(data)
group lifespan onset cancer
1 control 998 807 TRUE
2 control 890 1890 FALSE
3 control 626 209 TRUE
4 control 1099 944 TRUE
5 control 640 989 FALSE
6 control 564 2070 FALSE
We have simulated two populations, one treated, one control. The treated population has double the mortality rate of the control and double the cancer rate. We estimate the date of cancer onset for a sample from an exponential distribution. If cancer onset occurs before death then we label the animal as having cancer at death.
Let’s show off the survival curves.
library(ggplot2)
library(scales)
ggplot(data, aes(lifespan, color=cancer)) +
stat_ecdf() +
facet_grid(group ~ .) +
scale_y_reverse() +
theme_bw()
Both controls and treated groups have approximately equal rates of cancer at death.
table(data$group, data$cancer)
FALSE TRUE
control 31736 18264
treated 31790 18210
So a logistic regression shows no effect of treatment.
model <- glm(cancer ~ group,
family = "binomial",
data=data)
round(coef(summary(model)), 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.553 0.009 -59.489 0.000
grouptreated -0.005 0.013 -0.355 0.723
But, we know from our setup that treatment doubles the cancer rate. This is not apparent in the data because treatment also doubled the overall mortality rate.
We can see the increased risk if we estimate the rate of lifetime cancer incidence divided by the the number of days at risk.
data %>%
group_by(group) %>%
summarize(rate=weighted.mean(cancer / lifespan, lifespan))
Source: local data frame [2 x 2]
group rate
1 control 0.0003979414
2 treated 0.0007927002
Now treatment risks are doubled, as expected.
We can also use regression to find this value by using a cox proportional hazards model with interval censoring. Cliff AB on Stats Exchange gave a great explanation:
with current status data, if we inspect subject i at time ci, and the event has occurred, we record their interval as (0,ci) (i.e. we know the event occurred before ci), but if it has not occurred, we say (ci,∞) (i.e. all we know is that the event has not occurred by time ci). (source)
library(icenReg)
# Add censoring information (like Cliff suggested)
data <- data %>%
mutate(left_censor = ifelse(cancer, 0, lifespan),
right_censor = ifelse(cancer, lifespan, Inf))
head(data, 3)
group lifespan onset cancer left_censor right_censor
1 control 998 807 TRUE 0 998
2 control 890 1890 FALSE 890 Inf
3 control 626 209 TRUE 0 626
# Model at will
model <- ic_par(
Surv(left_censor, right_censor, type='interval2') ~ group,
data=data
)
summary(model)
Model: Cox PH
Baseline: weibull
Call: ic_par(formula = Surv(left_censor, right_censor, type = "interval2") ~
group, data = data)
Estimate Exp(Est) Std.Error z-value p
log_shape -0.02631 0.974 0.02345 -1.122 0.2617
log_scale 7.27500 1443.000 0.01916 379.600 0.0000
grouptreated 0.67030 1.955 0.01901 35.260 0.0000
final llk = -64623.86
Iterations = 42
Lo! Treatment increases risk 2 fold. Cliff is correct. Thanks Cliff!