if (!require(pacman)) install.packages("pacman")
::p_load(tidyverse, survival, ggfortify, survminer, gridExtra,
pacman
Epi, KMsurv, gnm, cmprsk, mstate, flexsurv, splines, epitools,
eha, ctqr, scales, data.table,haven,biostat3,paletteer)library(tidyverse)
<-haven::read_dta("trinmlsh.dta") |>
out_newmutate(time_new= timeout-timein,
years_new =time_new/365.25,
smoke_status = case_when(smokenum==0~"non-smoker",
==1~"ex-smoker",
smokenum>1~"Current-smoker")) smokenum
lOADING PACKAGES AND CALLING IN DATA FOR R ANALYSIS
Question a)
- Open a log file, read the trinmlsh.dta data and then examine the variables.
<- survfit(Surv(years,death) ~ 1,
KM data = out_new)
ggsurvplot(KM, risk.table = TRUE, xlab = "Time (years)", censor = T)
Question 1b
The variables timein and timeout hold the dates of entry and exit into the study, while death is the indicator for mortality from any cause. Use stset with these variables, setting timein to be the origin (i.e. sort the records according to follow-up time). Hint: You may use the command stdescribe.
quietly use trinmlsh
stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
sts graph, survival
stdescribe, noshow
quietly graph export overal.svg, replace
data settings
## Survival-time
##
## Failure event: death!=0 & death<.
## Observed time interval: (origin, timeout]on or after: time timein
## Enter on or before: failure
## Exit for analysis: (time-origin)/365.25
## Time
## Origin: time timein
##
## --------------------------------------------------------------------------total observations
## 318
## 0 exclusions
## --------------------------------------------------------------------------
## 318 observations remaining, representingin single-record/single-failure data
## 88 failures total analysis time at risk and under observation
## 2,204.539
## At risk from t = 0
## Earliest observed entry t = 0exit t = 9.798768
## Last observed
##
##
## Failure _d: death
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
##
##
## |-------------- Per subject --------------|
## Category Total Mean Min Median Max
## ------------------------------------------------------------------------------of subjects 318
## Number of records 318 1 1 1 1
## Number
##
## Entry time (first) 0 0 0 0
## Exit time (final) 6.932514 .0164271 7.678303 9.798768
##
## Subjects with gap 0 on gap 0
## Time at risk 2204.5394 6.932514 .0164271 7.678303 9.798768
## Time
##
## Failures 88 .2767296 0 0 1 ## ------------------------------------------------------------------------------
the stdescribe output shows that we have 318 subjects in the analytical sample of which 88 (\(27.67\%\)) have experienced the event of death.
In the output above, mean age at entry is 0. This is actually the same for all individuals since they have the same date for origin and the same date for enter, which also explains why the same value is presented for min, median, and max.
The mean age at exit is \(6.932514 (min: .0164271, median: 7.678303 , max: 9.798768)\) days.
Time at risk is here presented as days. We can see that the mean is \(6.932514 (min=.0164271 , median=7.678303, max=9.798768)\).
Another way of obtaining the mean time at risk is through stci, which produces the restricted mean survival time (same as mean time at risk) along with information on standard errors and confidence intervals.
Estimating the restricted mean (or average) survival time is determined by calculating the area under the survival curve, restricting the estimation to the longest follow-up time.
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
stci, rmean noshow
of Restricted
## | Number mean Std. err. [95% conf. interval]
## | subjects
## -------------+-------------------------------------------------------------
## Total | 318 8.113829(*) .1666967 7.78711 8.44055
## mean is underestimated ## (*) largest observed analysis time is censored,
- The restricted mean survival time in this example is 8.113829 (years).
Question 1c
Examine the effect of smoking on all-cause mortality using strate and stcox. Hint: You may wish to recode smokenum into a smaller number of categories, say: 0 = non smoker, 1 = ex-smoker, 2 = current smoker
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
tab smokenum
of |
## No.
## cigarettes |day | Freq. Percent Cum.
## per
## ------------+-----------------------------------
## non-smok | 140 44.16 44.16
## ex-smoke | 68 21.45 65.62
## 1-9 cigs | 29 9.15 74.76ci | 30 9.46 84.23
## 10-19 ci | 26 8.20 92.43
## 20-29
## 30+ cigs | 24 7.57 100.00
## ------------+----------------------------------- ## Total | 317 100.00
Recoding the variable
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)
label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
tab smoke3
## (80 differences between smokenum and smoke3)
##
##
##
## of |
## RECODE
## smokenum (No. |of cigarettes |
## day) | Freq. Percent Cum.
## per
## ---------------+-----------------------------------
## Non-smoker | 140 44.16 44.16
## Ex-smoker | 68 21.45 65.62
## Current-smoker | 109 34.38 100.00
## ---------------+----------------------------------- ## Total | 317 100.00
Using strate
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)
label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
strate smoke3
## (80 differences between smokenum and smoke3)
##
##
##
##
## Failure _d: death
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
##
## Estimated failure ratesof records = 317
## Number
##
## +-----------------------------------------------------------------+
## | smoke3 D Y Rate Lower Upper |
## |-----------------------------------------------------------------|
## | Non-smoker 30 984.4079 0.030475 0.021308 0.043587 |
## | Ex-smoker 18 461.5934 0.038995 0.024569 0.061893 |
## | Current-smoker 40 749.0157 0.053403 0.039173 0.072804 |
## +-----------------------------------------------------------------+
## Notes: Rate = D/Y = failures/person-time.of 95% confidence intervals. ## Lower and Upper are bounds
Manually doing it in R
|>
out_new filter(!is.na(smoke_status)) |>
group_by(smoke_status) |>
summarise(years=sum(years), death=sum(death)) |>
mutate(rate=death/years)
smoke_status | years | death | rate |
---|---|---|---|
Current-smoker | 749.0157 | 40 | 0.0534034 |
ex-smoker | 461.5934 | 18 | 0.0389954 |
non-smoker | 984.4079 | 30 | 0.0304752 |
Using Survrate
<- survRate(Surv(years,death) ~ smoke_status,
KM_rate data = out_new)
KM_rate
smoke_status | tstop | event | rate | lower | upper | |
---|---|---|---|---|---|---|
smoke_status=Current-smoker | Current-smoker | 749.0157 | 40 | 0.0534034 | 0.0381522 | 0.0727203 |
smoke_status=ex-smoker | ex-smoker | 461.5934 | 18 | 0.0389954 | 0.0231111 | 0.0616295 |
smoke_status=non-smoker | non-smoker | 984.4079 | 30 | 0.0304752 | 0.0205615 | 0.0435052 |
- Based on the crude rates, there is a clear increase in all-cause mortality from non-smokers (0.03047517) to ex-smokers (0.03899536) and current smokers (0.05340342 ).
Using Cox in Stata
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)
label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
stcox i.smoke3 , efron
## (80 differences between smokenum and smoke3)
##
##
##
##
## Failure _d: death
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
##
## Iteration 0: Log likelihood = -482.41209
## Iteration 1: Log likelihood = -479.70223
## Iteration 2: Log likelihood = -479.69013
## Iteration 3: Log likelihood = -479.69013estimates:
## Refining
## Iteration 0: Log likelihood = -479.69013
## for ties
## Cox regression with Efron method
## of subjects = 317 Number of obs = 317
## No. of failures = 88
## No. at risk = 2,195.0171
## Time chi2(2) = 5.44
## LR chi2 = 0.0657
## Log likelihood = -479.69013 Prob >
##
## ------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
## _t | Haz.
## -------------+----------------------------------------------------------------
## smoke3 |
## Ex-smoker | 1.294778 .3862725 0.87 0.387 .7215315 2.32346s~r | 1.749641 .4226712 2.32 0.021 1.08973 2.809175
## Current- ## ------------------------------------------------------------------------------
Using Cox in R
<-out_new |>
out_newmutate(smoke_status = fct_relevel(smoke_status,"non-smoker"))
<- coxph(Surv(years,death) ~ smoke_status,
KM_cox data = out_new)
tidy(KM_cox, exponentiate = T, conf.int = T)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
smoke_statusCurrent-smoker | 1.749641 | 0.2415760 | 2.3156709 | 0.0205762 | 1.0897303 | 2.809175 |
smoke_statusex-smoker | 1.294778 | 0.2983311 | 0.8659475 | 0.3865190 | 0.7215315 | 2.323460 |
ggforest(KM_cox , as.data.frame(out_new))
Poisson regression to estimate the mortality rate ratio.
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)
label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
poisson death i.smoke3 , exp(years) irr
## (80 differences between smokenum and smoke3)
##
##
##
##
## Iteration 0: Log likelihood = -282.97543
## Iteration 1: Log likelihood = -282.9754
## Iteration 2: Log likelihood = -282.9754
## of obs = 317
## Poisson regression Number chi2(2) = 5.49
## LR chi2 = 0.0644
## Prob >
## Log likelihood = -282.9754 Pseudo R2 = 0.0096
##
## ------------------------------------------------------------------------------
## death | IRR Std. err. z P>|z| [95% conf. interval]
## -------------+----------------------------------------------------------------
## smoke3 |
## Ex-smoker | 1.279578 .3814964 0.83 0.408 .713325 2.295335s~r | 1.752358 .4232347 2.32 0.020 1.091536 2.813245
## Current-
## |_cons | .0304752 .005564 -19.12 0.000 .0213078 .0435867
## ln(years) | 1 (exposure)
##
## ------------------------------------------------------------------------------_cons estimates baseline incidence rate. ## Note:
library(gtsummary)
<- glm(death ~ smoke_status + offset(log(years)),
poisson7c family=poisson, data=out_new)
|>
poisson7c tbl_regression(exponentiate = T)
Characteristic | IRR | 95% CI | p-value |
---|---|---|---|
smoke_status | |||
non-smoker | — | — | |
Current-smoker | 1.75 | 1.09, 2.83 | 0.020 |
ex-smoker | 1.28 | 0.70, 2.27 | 0.4 |
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio |
- A cox regression model using non-smokers as the reference group indicated that current smokers had a statistically significant 75% higher mortality rate (hazard ratio \([HR] = 1.75; 95\% CI: 1.09–2.39; p = 0.02057\))
- while the increase for ex-smokers was smaller and not statistically significant (\(HR = 1.294; 95\% CI: 0.721–2.3234; p = 0.386\)).
- The results were almost similar to those from fitting a Poisson model.
Question 1d
Using the rates and rate ratios in part (c), can you see a trend in the relationship between smoking category and all-cause mortality? How would you assess this more formally? How would you present the results from your analyses in parts (c) and (d)?
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)
label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
sts graph , by(smoke3)
quietly graph export KM.svg, replace
## (80 differences between smokenum and smoke3)
##
##
##
##
## Failure _d: death
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein ## Enter
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)
label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
sts test smoke3 , logrank
## (80 differences between smokenum and smoke3)
##
##
##
##
## Failure _d: death
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
## of survivor functions
## Equality rank test
## Log-
##
## | Observed Expected
## smoke3 | events events
## ---------------+-------------------------
## Non-smoker | 30 39.54
## Ex-smoker | 18 18.32
## Current-smoker | 40 30.14
## ---------------+-------------------------
## Total | 88 88.00
## chi2(2) = 5.54
## chi2 = 0.0628 ## Pr>
library(paletteer)
<- survfit(Surv(years,death) ~ smoke_status,
KM_dm data = out_new)
<-ggsurvplot(KM_dm,
p2data = out_new,
palette = paletteer_c("grDevices::Set 2", 12),
size = 1.2, conf.int = FALSE,
pval=TRUE,
legend.labs = levels(out_new$smoke_status),
legend.title = "",
ggtheme = theme_minimal() +
theme(plot.title = element_text(face = "bold")),
title = "Probability of dying",
xlab = "Years",
ylab = "Probability of dying",
legend = "bottom", censor = FALSE)
p2
- Based on the crude rates, there is a clear increase in all-cause mortality from non-smokers to ex smokers and then to current smokers
- The Kaplan Meier plots also suggest a marginally significant difference in survival times between the 3 groups where
- non-smokers survival longer
- ex-smokers survival longer than current smokers but shorter than non smokers
- current smokers have a shorter survival time as compared to the other 2 groups.
Question 1e and 1f
Consider the possible confounding effect of current age on the relationship between smoking and mortality. By examining the data (but without performing any further modelling) would you expect a Cox model controlling for current age to give different results to one controlling for time in the study?
Hint: What was the age range at study entry (ageent) in each smoking group? Are there large differences in age between the groups? If follow-up time is your analysis scale, will you be comparing men whose ages are very different?
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death)
recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)
label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
tabstat ageent, by(smoke3) stats(mean sd min max)
## (80 differences between smokenum and smoke3)
##
##
##
## for variables: ageent
## Summary variable: smoke3 (RECODE of smokenum (No. of cigarettes per day))
## Group
## SD Min Max
## smoke3 | Mean
## ---------------+----------------------------------------
## Non-smoker | 65.1261 3.203284 60.077 72.016
## Ex-smoker | 65.60484 3.571255 60.093 73.467
## Current-smoker | 64.71269 2.902213 60.005 73.93
## ---------------+----------------------------------------
## Total | 65.08664 3.194229 60.005 73.93 ## --------------------------------------------------------
|>
out_new group_by(smoke_status) |>
summarise(aver_age=mean(ageent),
sd_age=sd(ageent),
range_age = range(ageent))
smoke_status | aver_age | sd_age | range_age |
---|---|---|---|
non-smoker | 65.12610 | 3.203284 | 60.077 |
non-smoker | 65.12610 | 3.203284 | 72.016 |
Current-smoker | 64.71269 | 2.902213 | 60.005 |
Current-smoker | 64.71269 | 2.902213 | 73.930 |
ex-smoker | 65.60484 | 3.571255 | 60.093 |
ex-smoker | 65.60484 | 3.571255 | 73.467 |
NA | 67.24400 | NA | 67.244 |
NA | 67.24400 | NA | 67.244 |
- The mean ages are all around 65 years, with small SD (\(\approx 3\) years). The ranges overlap heavily across groups.
- Adjustment for age is important if age differs systematically across smoking groups and age is associated with the hazard (e.g. mortality, CVD, HIV progression).
- Here, the age distributions are very similar across smoking categories (differences <1 year on average).
- Because of this, adjusting for age is unlikely to meaningfully change hazard ratios for smoking status.
- using follow-up time as the analysis scale will not result in comparing men of very different actual ages, although minor age variation still warrants adjustment.
Question 1g
Now perform a Cox regression, and check the assumptions, to investigate this.
Cox model adjusting for ageent
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)
label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
stcox i.smoke3 ageent , efron
## (80 differences between smokenum and smoke3)
##
##
##
##
## Failure _d: death
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
##
## Iteration 0: Log likelihood = -482.41209
## Iteration 1: Log likelihood = -478.51285
## Iteration 2: Log likelihood = -478.50359
## Iteration 3: Log likelihood = -478.50359estimates:
## Refining
## Iteration 0: Log likelihood = -478.50359
## for ties
## Cox regression with Efron method
## of subjects = 317 Number of obs = 317
## No. of failures = 88
## No. at risk = 2,195.0171
## Time chi2(3) = 7.82
## LR chi2 = 0.0499
## Log likelihood = -478.50359 Prob >
##
## ------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
## _t | Haz.
## -------------+----------------------------------------------------------------
## smoke3 |
## Ex-smoker | 1.258957 .3764753 0.77 0.441 .7005992 2.262309s~r | 1.772706 .4286392 2.37 0.018 1.103613 2.847454
## Current-
## |
## ageent | 1.053675 .0355862 1.55 0.122 .9861853 1.125783 ## ------------------------------------------------------------------------------
<- coxph(Surv(years,death) ~ smoke_status+ageent,
KM_cox data = out_new)
tidy(KM_cox, exponentiate = T, conf.int = T)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
smoke_statusCurrent-smoker | 1.772706 | 0.2417994 | 2.3676953 | 0.0178993 | 1.1036128 | 2.847454 |
smoke_statusex-smoker | 1.258957 | 0.2990375 | 0.7700821 | 0.4412512 | 0.7005992 | 2.262309 |
ageent | 1.053675 | 0.0337734 | 1.5480728 | 0.1216048 | 0.9861853 | 1.125783 |
Checking assumptions
<- cox.zph(KM_cox)
cph_assump
cph_assump## chisq df p
## smoke_status 5.159 2 0.076
## ageent 0.035 1 0.852
## GLOBAL 5.225 3 0.156
ggcoxzph(cph_assump)
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)
label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
stcox i.smoke3 ageent , efron
stphtest, detail
stphtest, plot(1.smoke3) msym(oh) name(plt1)
stphtest, plot(2.smoke3) msym(oh) name(plt2)
stphtest, plot(ageent) msym(oh) name(plt3)
graph combine plt1 plt2 plt3
quietly graph export all_multii.svg, replace
## (80 differences between smokenum and smoke3)
##
##
##
##
## Failure _d: death
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
##
## Iteration 0: Log likelihood = -482.41209
## Iteration 1: Log likelihood = -478.51285
## Iteration 2: Log likelihood = -478.50359
## Iteration 3: Log likelihood = -478.50359estimates:
## Refining
## Iteration 0: Log likelihood = -478.50359
## for ties
## Cox regression with Efron method
## of subjects = 317 Number of obs = 317
## No. of failures = 88
## No. at risk = 2,195.0171
## Time chi2(3) = 7.82
## LR chi2 = 0.0499
## Log likelihood = -478.50359 Prob >
##
## ------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
## _t | Haz.
## -------------+----------------------------------------------------------------
## smoke3 |
## Ex-smoker | 1.258957 .3764753 0.77 0.441 .7005992 2.262309s~r | 1.772706 .4286392 2.37 0.018 1.103613 2.847454
## Current-
## |
## ageent | 1.053675 .0355862 1.55 0.122 .9861853 1.125783
## ------------------------------------------------------------------------------
##
## of proportional-hazards assumption
## Test
## function: Analysis time
## Time
## --------------------------------------------------------chi2 df Prob>chi2
## | rho
## -------------+------------------------------------------
## 0b.smoke3 | . . 1 .
## 1.smoke3 | 0.13164 1.54 1 0.2148
## 2.smoke3 | 0.22638 4.47 1 0.0345
## ageent | 0.02115 0.04 1 0.8508
## -------------+------------------------------------------test | 4.58 3 0.2053
## Global ## --------------------------------------------------------
After adjusting for age at study entry, Age is not statistically significant \((HR = 1.05~ per~ year~ older; p = 0.12)\).
There is no evidence of PH violation overall (global \(p = 0.156\)).
White dataset
Question 1h
Now read the whitehal new.dta data. We are interested in CHD mortality, so the variables necessary to define the time-scale and the outcome are timein, timeout, and chd.
Loading the data in R
<-haven::read_dta("whitehal_new.dta") |>
out_new2mutate(fu_years = as.numeric(timeout - timein) / 365.25) |>
mutate(across(where(is.labelled), ~as_factor(.))) |>
mutate(chd = as.integer(chd))
Loading the data in Stata
quietly use whitehal_new
format timein %td
format timeout %td
quietly stset timeout, origin(time timein) enter(time timein) failure(chd)
tab grade
of work, |
## grade levels | Freq. Percent Cum.
## 2
## ---------------+-----------------------------------
## High job grade | 1,194 71.20 71.20
## Low job grade | 483 28.80 100.00
## ---------------+----------------------------------- ## Total | 1,677 100.00
Question 1i
Use a Cox regression model to estimate the effect of job grade (grade, coded: 1 = high; 2 = low) using first the follow-up time scale. You may adjust for other covariates.
<- survfit(Surv(fu_years,chd) ~ grade,
KM_dm data = out_new2)
<-ggsurvplot(KM_dm,
p2data = out_new2,
palette = paletteer_c("grDevices::Set 2", 12),
size = 1.2, conf.int = FALSE,
pval=TRUE,
legend.labs = levels(out_new2$grade),
legend.title = "",
ggtheme = theme_minimal() +
theme(plot.title = element_text(face = "bold")),
title = "Probability of dying",
xlab = "Years",
ylab = "Probability of dying",
legend = "bottom", censor = FALSE)
p2
<- coxph(Surv(fu_years,chd) ~ grade,
KM_cox data = out_new2)
tidy(KM_cox, exponentiate = T, conf.int = T)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
gradeLow job grade | 2.042109 | 0.1636785 | 4.362108 | 0.0000129 | 1.481684 | 2.814507 |
- The unadjusted Cox regression showS that participants in the low job grade had more than twice the hazard of CHD mortality compared with those in the high job grade \((HR = 2.04; 95\% CI: 1.48–2.82; p < 0.001)\).
Running full model in R
<- coxph(Surv(fu_years,chd) ~ grade + agein + sbp + chol + smok, data = out_new2)
cox_adj
summary(cox_adj)
## Call:
## coxph(formula = Surv(fu_years, chd) ~ grade + agein + sbp + chol +
## smok, data = out_new2)
##
## n= 1651, number of events= 151
## (26 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## gradeLow job grade 0.100190 1.105381 0.178687 0.561 0.575000
## agein 0.095873 1.100620 0.013930 6.882 0.00000000000588 ***
## sbp 0.016100 1.016230 0.003615 4.453 0.00000844870978 ***
## chol 0.007367 1.007394 0.001633 4.513 0.00000640522967 ***
## smokEx-smoker 0.769555 2.158804 0.332196 2.317 0.020527 *
## smok1-14 cig/day 1.050483 2.859032 0.349535 3.005 0.002653 **
## smok15-24 cig/day 1.320346 3.744718 0.348945 3.784 0.000154 ***
## smok25+ cig/day 1.265059 3.543302 0.398100 3.178 0.001484 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## gradeLow job grade 1.105 0.9047 0.7788 1.569
## agein 1.101 0.9086 1.0710 1.131
## sbp 1.016 0.9840 1.0091 1.023
## chol 1.007 0.9927 1.0042 1.011
## smokEx-smoker 2.159 0.4632 1.1258 4.140
## smok1-14 cig/day 2.859 0.3498 1.4411 5.672
## smok15-24 cig/day 3.745 0.2670 1.8897 7.421
## smok25+ cig/day 3.543 0.2822 1.6238 7.732
##
## Concordance= 0.759 (se = 0.018 )
## Likelihood ratio test= 135.4 on 8 df, p=<0.0000000000000002
## Wald test = 126.8 on 8 df, p=<0.0000000000000002
## Score (logrank) test = 136.6 on 8 df, p=<0.0000000000000002
Running full model in Stata
quietly use whitehal_new
format timein %td
format timeout %td
quietly stset timeout, origin(time timein) enter(time timein) failure(chd) scale(365.25)
stcox i.grade agein sbp chol i.smok,efron
## Failure _d: chd
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
##
## Iteration 0: Log likelihood = -1094.0566
## Iteration 1: Log likelihood = -1028.4682
## Iteration 2: Log likelihood = -1026.4085
## Iteration 3: Log likelihood = -1026.3755
## Iteration 4: Log likelihood = -1026.3755estimates:
## Refining
## Iteration 0: Log likelihood = -1026.3755
##
## Cox regression with no ties
## of subjects = 1,651 Number of obs = 1,651
## No. of failures = 151
## No. at risk = 27,200.0163
## Time chi2(8) = 135.36
## LR chi2 = 0.0000
## Log likelihood = -1026.3755 Prob >
##
## ------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
## _t | Haz.
## -------------+----------------------------------------------------------------
## grade |
## Low job g.. | 1.105381 .1975177 0.56 0.575 .7787771 1.568957
## agein | 1.10062 .0153318 6.88 0.000 1.070976 1.131083
## sbp | 1.01623 .0036737 4.45 0.000 1.009055 1.023456
## chol | 1.007394 .0016447 4.51 0.000 1.004176 1.010623
## |
## smok |
## Ex-smoker | 2.158804 .7171464 2.32 0.021 1.125764 4.139798
## 1-14 cig/.. | 2.859032 .999333 3.01 0.003 1.4411 5.672103
## 15-24 cig.. | 3.744718 1.306701 3.78 0.000 1.889717 7.420645day | 3.543302 1.410588 3.18 0.001 1.623846 7.731637
## 25+ cig/ ## ------------------------------------------------------------------------------
- after adjusting for age at entry, systolic blood pressure, cholesterol, and smoking, the association is insignificant (HR for low vs high = \(1.11\); equivalent to HR for high vs low = \(0.905; 95\% CI: 0.637–1.284; p = 0.575)\).
Question j): Using a Parametric Model
- Fit an appropriate Parametric model or Frailty model whichever is most suitable and distinguish how you will interpret the results for a single categorical and also a single continuous variable.
Intercept only model
We will begin by estimating intercept only parametric regression models (i.e., without covariates). But first, it’s helpful to estimate the hazard function (among all patients) using nonparametric techniques. We can do this using the kernel density estimator from the muhaz
package.
library("muhaz")
<- muhaz(out_new2$fu_years, out_new2$chd)
kernel_haz_est <- data.table(time = kernel_haz_est$est.grid,
kernel_haz est = kernel_haz_est$haz.est,
method = "Kernel density")
Then we can use flexsurv
to estimate intercept only models for a range of probability distributions. The hazard function for each fitted model is returned using summary.flexsurvreg()
.
<- c("exp", "weibull", "gompertz", "gamma",
dists "lognormal", "llogis", "gengamma")
<- c("Exponential", "Weibull (AFT)",
dists_long "Gompertz", "Gamma", "Lognormal", "Log-logistic",
"Generalized gamma")
<- vector(mode = "list", length = length(dists))
parametric_haz for (i in 1:length(dists)){
<- flexsurvreg(Surv(fu_years, chd) ~ 1, data = out_new2, dist = dists[i])
fit <- summary(fit, type = "hazard",
parametric_haz[[i]] ci = FALSE, tidy = TRUE)
$method <- dists_long[i]
parametric_haz[[i]]
}<- rbindlist(parametric_haz) parametric_haz
We can plot the hazard functions from the parametric models and compare them to the kernel density estimate.
<- rbind(kernel_haz, parametric_haz)
haz := factor(method,
haz[, method levels = c("Kernel density",
dists_long))]<- length(dists)
n_dists ggplot(haz, aes(x = time, y = est, col = method, linetype = method)) +
geom_line() +
xlab("years") + ylab("Hazard") +
scale_colour_manual(name = "",
values = c("black", rainbow(n_dists))) +
scale_linetype_manual(name = "",
values = c(1, rep_len(2:6, n_dists)))
Weibull PH
The hazard and the survival function of the Weibull distribution are given by, \(h(t) = \gamma \lambda t^{\gamma-1}\) and \(S(t) = e^{-\lambda t^\gamma}\) respectively. The hazard function of a patient with covariates \(X_1,X_2,X_3,\cdots X_p\) is given by \[\begin{equation} h(t|X_i) =\gamma \lambda t^{\gamma-1} \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip})=\gamma \lambda t^{\gamma-1}\text{exp}^{(\beta^{T} X)} \end{equation}\] making use of equation . The corresponding survival function is given by: \[\begin{equation} S_i(t)=\text{exp}\{\exp(\beta^{T} x_i)\lambda t^{\gamma}\} \end{equation}\] The model is fitted using the likelihood function of the \(n\) observations and maximizing this function with respect to unknown parameters \((\beta_1,\beta_2,\beta_3,\cdots,\beta_p),\lambda\) and \(\gamma\) where the likelihood function for the hazard and survivor function is given by : \[\begin{equation} L(\alpha,\mu,\sigma)=\prod_{i=1}^{n}\{h_i(t_i)\}^{\sigma_i}S_i(t_i) \end{equation}\] Taking logarithms in equation we get \[\begin{equation} \log L(\alpha,\mu,\sigma)=\sum^{n}_{i=1}\{\sigma_i \log h_i(t_i)+\log S_i(t_i)\} \end{equation}\] Now substituting (\(\ref{gam}\)) and simplifying we get \[\begin{equation} \log L(\alpha,\mu,\sigma)=\sum^n_{i=1}[\sigma_i\{\beta` x_i+\log(\lambda\gamma)+\gamma\log (t_i)\} -\lambda\text{exp}(\beta` x_i)t^{\gamma} \end{equation}\]
The additional parameter p is called a shape parameter and determines the shape of the hazard function.
- \(p >1\) hazard increases over time
- \(p > 1\) constant hazard (exponential model)
- \(p <1\) hazard decreases over time
Cholestrol alone
quietly use whitehal_new
format timein %td
format timeout %td
quietly stset timeout, origin(time timein) enter(time timein) failure(chd) scale(365.25)
streg chol, distrib(weibull)
## Failure _d: chd
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
## model:
## Fitting constant-only
## Iteration 0: Log likelihood = -616.34373
## Iteration 1: Log likelihood = -610.15638
## Iteration 2: Log likelihood = -610.04232
## Iteration 3: Log likelihood = -610.04228
## model:
## Fitting full
## Iteration 0: Log likelihood = -610.04228
## Iteration 1: Log likelihood = -603.49255
## Iteration 2: Log likelihood = -603.21873
## Iteration 3: Log likelihood = -603.21823
## Iteration 4: Log likelihood = -603.21823
##
## Weibull PH regression
## of subjects = 1,651 Number of obs = 1,651
## No. of failures = 151
## No. at risk = 27,200.0163
## Time chi2(1) = 13.65
## LR chi2 = 0.0002
## Log likelihood = -603.21823 Prob >
##
## ------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
## _t | Haz.
## -------------+----------------------------------------------------------------
## chol | 1.006151 .0016079 3.84 0.000 1.003004 1.009307_cons | .0006045 .0002773 -16.16 0.000 .000246 .0014853
##
## -------------+----------------------------------------------------------------
## /ln_p | .2929458 .0788982 3.71 0.000 .1383081 .4475834
## -------------+----------------------------------------------------------------p | 1.34037 .1057528 1.148329 1.564527
## p | .7460626 .058863 .6391709 .8708303
## 1/
## ------------------------------------------------------------------------------_cons estimates baseline hazard. ## Note:
Grade alone
quietly use whitehal_new
format timein %td
format timeout %td
quietly stset timeout, origin(time timein) enter(time timein) failure(chd) scale(365.25)
streg i.grade, distrib(weibull)
## Failure _d: chd
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
## model:
## Fitting constant-only
## Iteration 0: Log likelihood = -627.95275
## Iteration 1: Log likelihood = -621.65709
## Iteration 2: Log likelihood = -621.54148
## Iteration 3: Log likelihood = -621.54144
## model:
## Fitting full
## Iteration 0: Log likelihood = -621.54144
## Iteration 1: Log likelihood = -613.22222
## Iteration 2: Log likelihood = -612.60824
## Iteration 3: Log likelihood = -612.60625
## Iteration 4: Log likelihood = -612.60625
##
## Weibull PH regression
## of subjects = 1,677 Number of obs = 1,677
## No. of failures = 154
## No. at risk = 27,605.3707
## Time chi2(1) = 17.87
## LR chi2 = 0.0000
## Log likelihood = -612.60625 Prob >
##
## ------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
## _t | Haz.
## -------------+----------------------------------------------------------------
## grade |
## Low job g.. | 2.038542 .333555 4.35 0.000 1.479253 2.809293_cons | .0015878 .0005093 -20.10 0.000 .0008468 .0029773
##
## -------------+----------------------------------------------------------------
## /ln_p | .305599 .0777084 3.93 0.000 .1532934 .4579047
## -------------+----------------------------------------------------------------p | 1.357438 .1054843 1.165667 1.580758
## p | .736682 .0572464 .6326078 .857878
## 1/
## ------------------------------------------------------------------------------_cons estimates baseline hazard. ## Note:
Weibull AFT
Cholestrol alone
quietly use whitehal_new
format timein %td
format timeout %td
quietly stset timeout, origin(time timein) enter(time timein) failure(chd) scale(365.25)
streg chol, distrib(weibull) time
## Failure _d: chd
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
## model:
## Fitting constant-only
## Iteration 0: Log likelihood = -616.34373
## Iteration 1: Log likelihood = -610.15638
## Iteration 2: Log likelihood = -610.04232
## Iteration 3: Log likelihood = -610.04228
## model:
## Fitting full
## Iteration 0: Log likelihood = -610.04228
## Iteration 1: Log likelihood = -603.49255
## Iteration 2: Log likelihood = -603.21873
## Iteration 3: Log likelihood = -603.21823
## Iteration 4: Log likelihood = -603.21823
##
## Weibull AFT regression
## of subjects = 1,651 Number of obs = 1,651
## No. of failures = 151
## No. at risk = 27,200.0163
## Time chi2(1) = 13.65
## LR chi2 = 0.0002
## Log likelihood = -603.21823 Prob >
##
## ------------------------------------------------------------------------------
## _t | Coefficient Std. err. z P>|z| [95% conf. interval]
## -------------+----------------------------------------------------------------
## chol | -.0045749 .0012475 -3.67 0.000 -.00702 -.0021298_cons | 5.529128 .3343822 16.54 0.000 4.873751 6.184505
##
## -------------+----------------------------------------------------------------
## /ln_p | .2929458 .0788982 3.71 0.000 .1383081 .4475834
## -------------+----------------------------------------------------------------p | 1.34037 .1057528 1.148329 1.564527
## p | .7460626 .058863 .6391709 .8708303
## 1/ ## ------------------------------------------------------------------------------
Grade alone
quietly use whitehal_new
format timein %td
format timeout %td
quietly stset timeout, origin(time timein) enter(time timein) failure(chd) scale(365.25)
streg i.grade, distrib(weibull) time
## Failure _d: chd
## Analysis time _t: (timeout-origin)/365.25
## Origin: time timeinon or after: time timein
## Enter
## model:
## Fitting constant-only
## Iteration 0: Log likelihood = -627.95275
## Iteration 1: Log likelihood = -621.65709
## Iteration 2: Log likelihood = -621.54148
## Iteration 3: Log likelihood = -621.54144
## model:
## Fitting full
## Iteration 0: Log likelihood = -621.54144
## Iteration 1: Log likelihood = -613.22222
## Iteration 2: Log likelihood = -612.60824
## Iteration 3: Log likelihood = -612.60625
## Iteration 4: Log likelihood = -612.60625
##
## Weibull AFT regression
## of subjects = 1,677 Number of obs = 1,677
## No. of failures = 154
## No. at risk = 27,605.3707
## Time chi2(1) = 17.87
## LR chi2 = 0.0000
## Log likelihood = -612.60625 Prob >
##
## ------------------------------------------------------------------------------
## _t | Coefficient Std. err. z P>|z| [95% conf. interval]
## -------------+----------------------------------------------------------------
## grade |
## Low job g.. | -.5246907 .1258083 -4.17 0.000 -.7712705 -.278111_cons | 4.748192 .1652015 28.74 0.000 4.424403 5.071981
##
## -------------+----------------------------------------------------------------
## /ln_p | .305599 .0777084 3.93 0.000 .1532934 .4579047
## -------------+----------------------------------------------------------------p | 1.357438 .1054843 1.165667 1.580758
## p | .736682 .0572464 .6326078 .857878
## 1/ ## ------------------------------------------------------------------------------
In R : Grade
AFT model
<- Surv(fu_years, chd)~ grade
mf <- survreg(mf, ,data = out_new2, dist="weibull")
wf summary(wf)
##
## Call:
## survreg(formula = mf, data = out_new2, dist = "weibull")
## Value Std. Error z p
## (Intercept) 4.7482 0.1652 28.74 < 0.0000000000000002
## gradeLow job grade -0.5247 0.1258 -4.17 0.000030
## Log(scale) -0.3056 0.0777 -3.93 0.000084
##
## Scale= 0.737
##
## Weibull distribution
## Loglik(model)= -937.7 Loglik(intercept only)= -946.7
## Chisq= 17.87 on 1 degrees of freedom, p= 0.000024
## Number of Newton-Raphson Iterations: 8
## n= 1677
By default R reports coefficients in the accelerated failure time metric. A simple function to convert them to a proportional hazards metric, which you can download from this website as shown below. I also work with p = 1/scale.
PH model
source("https://grodri.github.io/survival/aft2ph.R")
<- aft2ph(wf); phfit phfit
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -6.4453753 | 0.3207430 | -20.095142 | 0.0000000 |
gradeLow job grade | 0.7122351 | 0.1636243 | 4.352870 | 0.0000134 |
log.p | 0.3055990 | 0.0777084 | 3.932638 | 0.0000840 |
To obtain hazard ratios we exponentiate the coefficients. We also report the Weibull parameter p, while R uses scale = 1/p.
transmute(phfit, term, hazard.ratio = exp(estimate))
term | hazard.ratio |
---|---|
(Intercept) | 0.0015878 |
gradeLow job grade | 2.0385425 |
log.p | 1.3574379 |
exp(tail(phfit,1)$estimate)
## [1] 1.357438
In R : chol
AFT model
<- Surv(fu_years, chd)~ chol
mf <- survreg(mf, ,data = out_new2, dist="weibull")
wf summary(wf)
##
## Call:
## survreg(formula = mf, data = out_new2, dist = "weibull")
## Value Std. Error z p
## (Intercept) 5.52913 0.33438 16.54 < 0.0000000000000002
## chol -0.00457 0.00125 -3.67 0.00025
## Log(scale) -0.29295 0.07890 -3.71 0.00020
##
## Scale= 0.746
##
## Weibull distribution
## Loglik(model)= -922.1 Loglik(intercept only)= -928.9
## Chisq= 13.65 on 1 degrees of freedom, p= 0.00022
## Number of Newton-Raphson Iterations: 8
## n=1651 (26 observations deleted due to missingness)
PH model
source("https://grodri.github.io/survival/aft2ph.R")
<- aft2ph(wf); phfit phfit
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -7.4110780 | 0.4586572 | -16.158206 | 0.0000000 |
chol | 0.0061320 | 0.0015981 | 3.837157 | 0.0001245 |
log.p | 0.2929458 | 0.0788982 | 3.712958 | 0.0002049 |
transmute(phfit, term, hazard.ratio = exp(estimate))
term | hazard.ratio |
---|---|
(Intercept) | 0.0006045 |
chol | 1.0061509 |
log.p | 1.3403701 |
exp(tail(phfit,1)$estimate)
## [1] 1.34037
- The unadjusted Weibull regression like the Cox model shows that participants in the low job grade had more than twice the hazard of CHD mortality compared with those in the high job grade \((HR = 2.04, p < 0.001)\).
Comment on Output : Cholestrol
- for cholestrol , a continuous variable , For the Weibull model, the estimated scale parameter was \(\sigma = 0.746\), corresponding to a shape parameter \(p = 1/\sigma \approx 1.34\), indicating that the hazard increases over time.
- The AFT coefficient for cholesterol was \(\beta = -0.00457\) (\(p<0.001\)), suggesting that a one-unit increase in cholesterol shortens the expected survival time by about 0.46%.
- Converting to the proportional hazards metric, the corresponding coefficient is \(\gamma = -\beta/\sigma = 0.00613\), giving a hazard ratio (HR) of 1.0061 with 95% CI (1.0029, 1.0095).
- Thus, each unit increase in cholesterol is associated with a 0.6% increase in the hazard of the event.
- Notice that the Weibull output has a parameter p that the exponential distribution does not have . The hazard function for a Weibull distribution is \(h(t)=\lambda pt^{p-1}\) , if \(p=1\) then the Weibull distribution is also an exponential distribution \(h(t)=\lambda\).
- By default Stata exponentiates the coefficients to show relative risks.
- we Use the option
nohr
for no hazard ratios to obtain the coefficients. - The relationship between the hazard ratio parameter \(\beta_j\) and the AFT parameter \(\alpha_j\) is \(\beta_j = -\alpha_jp\)
Weibull PH vs AFT equivalence
Weibull models can be written in two equivalent forms: proportional hazards (PH) and accelerated failure time (AFT).
In the AFT form:
\[ \log T = \mu + \sigma \varepsilon + x^\top \gamma, \qquad \varepsilon \sim \text{EV(type I)}, \]
with \(\sigma = 1/p\) for Weibull shape parameter \(p>0\).
The survival is
\[ S(t \mid x) = \exp\!\left\{ - \exp\!\left(\tfrac{\log t - \mu - x^\top \gamma}{\sigma}\right) \right\} = \exp\!\left\{ -\, t^{1/\sigma}\, \exp\!\left(-\tfrac{\mu+x^\top \gamma}{\sigma}\right) \right\}. \]
Let \(p=1/\sigma\). Then
\[ S(t \mid x) = \exp\!\left\{-\,\Lambda_0(t)\,\exp\!\big(-p\,x^\top\gamma\big)\right\}, \qquad \Lambda_0(t)=\exp(-p\mu)\,t^p. \]
In the PH form:
\[ S(t \mid x) = \exp\!\left\{-\,\Lambda_0(t)\,\exp\!\big(x^\top \beta_{\text{PH}}\big)\right\}. \]
By comparison,
\[ \beta_{\text{PH}} = -p\,\gamma \qquad\Longleftrightarrow\qquad \gamma = -\tfrac{\beta_{\text{PH}}}{p}. \]
Adjusting for other variables
AFT model
<- Surv(fu_years, chd)~ grade + agein + sbp + chol + smok
mf <- survreg(mf, ,data = out_new2, dist="weibull")
wf summary(wf)
##
## Call:
## survreg(formula = mf, data = out_new2, dist = "weibull")
## Value Std. Error z p
## (Intercept) 11.19842 0.88133 12.71 < 0.0000000000000002
## gradeLow job grade -0.06955 0.12314 -0.56 0.57219
## agein -0.06558 0.01049 -6.25 0.0000000004
## sbp -0.01108 0.00259 -4.28 0.0000188645
## chol -0.00506 0.00118 -4.28 0.0000184859
## smokEx-smoker -0.52886 0.23249 -2.27 0.02292
## smok1-14 cig/day -0.72423 0.24713 -2.93 0.00338
## smok15-24 cig/day -0.90727 0.24848 -3.65 0.00026
## smok25+ cig/day -0.86606 0.28103 -3.08 0.00206
## Log(scale) -0.37098 0.07664 -4.84 0.0000012940
##
## Scale= 0.69
##
## Weibull distribution
## Loglik(model)= -861.7 Loglik(intercept only)= -928.9
## Chisq= 134.4 on 8 degrees of freedom, p= 0.00000000000000000000000035
## Number of Newton-Raphson Iterations: 8
## n=1651 (26 observations deleted due to missingness)
PH model
source("https://grodri.github.io/survival/aft2ph.R")
<- aft2ph(wf); phfit phfit
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -16.2283091 | 1.0505090 | -15.4480445 | 0.0000000 |
gradeLow job grade | 0.1007923 | 0.1784525 | 0.5648131 | 0.5722009 |
agein | 0.0950383 | 0.0138933 | 6.8405973 | 0.0000000 |
sbp | 0.0160579 | 0.0036097 | 4.4485489 | 0.0000086 |
chol | 0.0073287 | 0.0016288 | 4.4993631 | 0.0000068 |
smokEx-smoker | 0.7664033 | 0.3321150 | 2.3076446 | 0.0210189 |
smok1-14 cig/day | 1.0495317 | 0.3495019 | 3.0029359 | 0.0026739 |
smok15-24 cig/day | 1.3147849 | 0.3489098 | 3.7682656 | 0.0001644 |
smok25+ cig/day | 1.2550574 | 0.3978793 | 3.1543668 | 0.0016085 |
log.p | 0.3709845 | 0.0766390 | 4.8406766 | 0.0000013 |
transmute(phfit, term, hazard.ratio = exp(estimate))
term | hazard.ratio |
---|---|
(Intercept) | 0.0000001 |
gradeLow job grade | 1.1060469 |
agein | 1.0997010 |
sbp | 1.0161875 |
chol | 1.0073556 |
smokEx-smoker | 2.1520123 |
smok1-14 cig/day | 2.8563133 |
smok15-24 cig/day | 3.7239499 |
smok25+ cig/day | 3.5080398 |
log.p | 1.4491606 |
exp(tail(phfit,1)$estimate)
## [1] 1.449161
Accounting for missing data in R
- We will impute the missing covariates using the Nelson-Aalen estimate of the cumulative hazard in place of the time variable, as proposed by White and Royston (2009). To generate this, the mice package provides a function nelsonaalen:
library(mice)
$cumHaz <- nelsonaalen(out_new, years,death)
out_newplot(out_new$years,out_new$cumHaz)
The curve is approximately linear, with a slightly increasing slope over time, suggesting a constant, or slightly increasing hazard over follow-up time.
We now impute using mice. To incorporate the outcome, as proposed by White and Royston (2009) we include the event indicator dead and the Nelson Aalen cumulative hazard as covariates in the imputation models. Thus we need to tell mice not to use the Years variable:
colSums(is.na(out_new))
## ethgp ageent death cvdeath alc smokenum
## 0 0 0 0 0 1
## hdlc diabp sysbp chdstart days years
## 17 4 4 28 0 0
## bmi id timein timeout y timebth
## 35 0 0 0 0 0
## time_new years_new smoke_status cumHaz
## 0 0 1 0
library(mice)
<- make.method(out_new)
meth "hdl"] <- "norm" # linear regression
meth["diabp"] <- "norm"
meth["sysbp"] <- "norm"
meth["chdstart"] <- "norm"
meth["smoke3"] <- "polyreg" # multinomial logistic
meth[<- make.predictorMatrix(out_new)
predMat
"years"] <- 0
predMat[,<- mice(out_new, m=10, seed=73347221,
cumHazImp predictorMatrix=predMat, printFlag=FALSE)
options(scipen=999)
<-coxph(Surv(years,death)~smoke_status+ageent+sysbp+bmi+diabp+ hdlc+chdstart,data=out_new)
full_cox_m1
<- with(data = cumHazImp,
cumHazFit exp = coxph(Surv(years,death)~smoke_status+ageent+sysbp+bmi+diabp+ hdlc+chdstart))
<- pool(cumHazFit)
cumHazPooled #estimates
cbind(coef(full_cox_m1), summary(cumHazPooled)[,2])
## [,1] [,2]
## smoke_statusCurrent-smoker 0.897908020 0.599832976
## smoke_statusex-smoker 0.584612792 0.142356724
## ageent 0.035335774 0.050013250
## sysbp 0.007397350 0.010321798
## bmi -0.035497207 -0.032627229
## diabp 0.001390669 -0.003686936
## hdlc -0.360502622 -0.480521920
## chdstart 0.608962101 0.708228579
summary(cumHazPooled, conf.int = TRUE, exponentiate = TRUE)
term | estimate | std.error | statistic | df | p.value | 2.5 % | 97.5 % | conf.low | conf.high |
---|---|---|---|---|---|---|---|---|---|
smoke_statusCurrent-smoker | 1.8218145 | 0.2475278 | 2.4232952 | 77.55088 | 0.0177112 | 1.1129320 | 2.982220 | 1.1129320 | 2.982220 |
smoke_statusex-smoker | 1.1529879 | 0.3005212 | 0.4736995 | 77.66639 | 0.6370431 | 0.6338294 | 2.097380 | 0.6338294 | 2.097380 |
ageent | 1.0512850 | 0.0347711 | 1.4383583 | 77.20378 | 0.1543740 | 0.9809617 | 1.126650 | 0.9809617 | 1.126650 |
sysbp | 1.0103753 | 0.0056265 | 1.8345020 | 72.74352 | 0.0706686 | 0.9991080 | 1.021770 | 0.9991080 | 1.021770 |
bmi | 0.9678993 | 0.0300339 | -1.0863470 | 63.63170 | 0.2814237 | 0.9115275 | 1.027757 | 0.9115275 | 1.027757 |
diabp | 0.9963199 | 0.0104782 | -0.3518688 | 71.38604 | 0.7259733 | 0.9757217 | 1.017353 | 0.9757217 | 1.017353 |
hdlc | 0.6184605 | 0.3297329 | -1.4573065 | 69.51321 | 0.1495347 | 0.3203842 | 1.193859 | 0.3203842 | 1.193859 |
chdstart | 2.0303914 | 0.2909900 | 2.4338589 | 64.84843 | 0.0177028 | 1.1354803 | 3.630613 | 1.1354803 | 3.630613 |
Obtaining the standard errors
#standard errors
cbind(diag(vcov(full_cox_m1))^0.5, summary(cumHazPooled)[,3])
## [,1] [,2]
## smoke_statusCurrent-smoker 0.297974797 0.247527822
## smoke_statusex-smoker 0.346474409 0.300521191
## ageent 0.040441780 0.034771064
## sysbp 0.006666571 0.005626485
## bmi 0.032282332 0.030033891
## diabp 0.013074294 0.010478158
## hdlc 0.343395526 0.329732912
## chdstart 0.322437634 0.290989988
- Imputation generally modifies the coefficients slightly, sometimes increasing effect sizes (e.g., sex-smoker, hdlc, chdstart) and sometimes attenuating them (e.g., non-smoker). Overall, it reflects the impact of accounting for missing data on the estimated associations.
- Imputation mostly attenuates effect sizes slightly and reduces standard errors, improving precision for some variables while leaving others nearly unchanged.
Accounting for missing data in STATA
- Next we need to mi register variables. The minimum that we need do is register the variable(s) we intend to impute.
quietly use trinmlsh
quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)
recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)
label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
sts gen cumHaz = na
mi set flong
mi register imputed hdlc smoke3 diabp sysbp chdstart bmi
mi impute chained (regress) hdlc diabp sysbp chdstart (mlogit) smoke3 = death cumHaz ageent, add(10) rseed(56123236)
mi estimate, post: stcox hdlc diabp sysbp chdstart ageent i.smoke3
est store coxMI_NA
quietly stcox hdlc diabp sysbp chdstart ageent i.smoke3
est store coxCCA
est table coxCCA coxMI_NA, se
## (80 differences between smokenum and smoke3)
##
##
##
##
## m=0 obs now marked as incomplete)
## (44
##
##
## Conditional models:mlogit smoke3 diabp sysbp hdlc chdstart death cumHaz
## smoke3:
## ageentregress diabp i.smoke3 sysbp hdlc chdstart death cumHaz
## diabp:
## ageentregress sysbp i.smoke3 diabp hdlc chdstart death cumHaz
## sysbp:
## ageentregress hdlc i.smoke3 diabp sysbp chdstart death cumHaz
## hdlc:
## ageentregress chdstart i.smoke3 diabp sysbp hdlc death cumHaz
## chdstart:
## ageent
##
## Performing chained iterations ...
##
## Multivariate imputation Imputations = 10
## Chained equations added = 10m=1 through m=10 updated = 0
## Imputed:
##
## Initialization: monotone Iterations = 100in = 10
## burn-
##
## hdlc: linear regression
## diabp: linear regression
## sysbp: linear regression
## chdstart: linear regressionlogistic regression
## smoke3: multinomial
##
## ------------------------------------------------------------------m
## | Observations per
## |----------------------------------------------
## Variable | Complete Incomplete Imputed | Total
## -------------------+-----------------------------------+----------
## hdlc | 301 17 17 | 318
## diabp | 314 4 4 | 318
## sysbp | 314 4 4 | 318
## chdstart | 290 28 28 | 318
## smoke3 | 317 1 1 | 318
## ------------------------------------------------------------------m
## (Complete + Incomplete = Total; Imputed is the minimum across of the number of filled-in observations.)
##
##
## estimates Imputations = 10
## Multiple-imputation for ties Number of obs = 318
## Cox regression: Breslow method
## Average RVI = 0.0837
## Largest FMI = 0.2277sample DF: min = 186.92
## DF adjustment: Large
## avg = 70,774.52max = 241,467.48
## F test: Equal FMI F( 7, 9315.1) = 3.09
## Model VCE type: OIM Prob > F = 0.0029
## Within
##
## ------------------------------------------------------------------------------
## _t | Coefficient Std. err. t P>|t| [95% conf. interval]
## -------------+----------------------------------------------------------------
## hdlc | -.4200718 .3359779 -1.25 0.212 -1.080977 .2408333
## diabp | -.0004034 .010552 -0.04 0.970 -.0210899 .0202831
## sysbp | .0088906 .0055906 1.59 0.112 -.0020686 .0198499
## chdstart | .8168205 .3059592 2.67 0.008 .2132434 1.420397
## ageent | .0582103 .0342695 1.70 0.089 -.0089571 .1253777
## |
## smoke3 |
## Ex-smoker | .1276731 .3007495 0.42 0.671 -.461788 .7171342s~r | .6564548 .244325 2.69 0.007 .1775793 1.13533
## Current-
## ------------------------------------------------------------------------------
##
##
##
##
##
## ----------------------------------------
## Variable | coxCCA coxMI_NA
## -------------+--------------------------
## hdlc | -.40885206 -.4200718
## | .09295799 .33597787
## diabp | -.00029629 -.00040343
## | .0031699 .01055201
## sysbp | .00872926 .00889064
## | .00167964 .00559062
## chdstart | .80339295 .81682047
## | .08218443 .30595922
## ageent | .05725468 .05821029
## | .01040884 .03426947
## |
## smoke3 |
## Ex-smoker | .16038246 .12767312
## | .09116201 .30074947 s~r | .67483635 .65645484
## Current-
## | .07423566 .24432501
## ---------------------------------------- ## Legend: b/se
Part II: Poisson and Cox Regression
Question b)
- Examine the effect of child’s HIV status (chiv) using strate and stcox. What is the HR and 95% CI?
quietly use paedsurv_new
quietly stset date_out, origin(time date_enr) enter(time date_enr) failure(died) scale(365.25)
strate chiv
## Failure _d: died
## Analysis time _t: (date_out-origin)/365.25
## Origin: time date_enron or after: time date_enr
## Enter
##
## Estimated failure ratesof records = 590
## Number
##
## +------------------------------------------------------------------+
## | chiv D Y Rate Lower Upper |
## |------------------------------------------------------------------|
## | HIV exposed 16 1.2e+03 0.0131388 0.0080492 0.0214464 |
## | HIV positive 27 638.3912 0.0422938 0.0290043 0.0616724 |
## +------------------------------------------------------------------+
## Notes: Rate = D/Y = failures/person-time.of 95% confidence intervals. ## Lower and Upper are bounds
<-haven::read_dta("paedsurv_new.dta") |>
out_new3mutate(across(where(is.labelled), ~as_factor(.))) |>
mutate(
age_enrol = as.numeric(difftime(date_enr, dob, units = "days")) / 365.25,
overall_survival = as.numeric(difftime(date_out, date_enr, units = "days")), # follow-up in days
overall_survival = ifelse(is.na(overall_survival), NA_real_, pmax(overall_survival, 0)),
person_years = overall_survival / 365.25,
died = as.integer(died)
)
|>
out_new3 filter(!is.na(chiv)) |>
group_by(chiv) |>
summarise(person_years=sum(person_years), died=sum(died)) |>
mutate(rate=died/person_years)
chiv | person_years | died | rate |
---|---|---|---|
HIV exposed | 1217.7708 | 16 | 0.0131388 |
HIV positive | 638.3912 | 27 | 0.0422938 |
<- survRate(Surv(person_years,died) ~ chiv,
KM_rate data = out_new3)
KM_rate
chiv | tstop | event | rate | lower | upper | |
---|---|---|---|---|---|---|
chiv=HIV exposed | HIV exposed | 1217.7708 | 16 | 0.0131388 | 0.0075099 | 0.0213365 |
chiv=HIV positive | HIV positive | 638.3912 | 27 | 0.0422938 | 0.0278719 | 0.0615353 |
quietly use paedsurv_new
quietly stset date_out, origin(time date_enr) enter(time date_enr) failure(died)
stcox i.chiv
## Failure _d: died
## Analysis time _t: (date_out-origin)
## Origin: time date_enron or after: time date_enr
## Enter
##
## Iteration 0: Log likelihood = -262.60562
## Iteration 1: Log likelihood = -253.86693
## Iteration 2: Log likelihood = -253.75738
## Iteration 3: Log likelihood = -253.75733estimates:
## Refining
## Iteration 0: Log likelihood = -253.75733
## for ties
## Cox regression with Breslow method
## of subjects = 590 Number of obs = 590
## No. of failures = 43
## No. at risk = 677,963.191
## Time chi2(1) = 17.70
## LR chi2 = 0.0000
## Log likelihood = -253.75733 Prob >
##
## ------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
## _t | Haz.
## -------------+----------------------------------------------------------------
## chiv |
## HIV posit.. | 3.671663 1.159636 4.12 0.000 1.977081 6.818692 ## ------------------------------------------------------------------------------
- A crude Cox model showed that HIV-positive children had a hazard ratio (HR) of \(3.67 (95\% CI: 1.98–6.82, p < 0.001)\) compared with HIV-exposed children. This indicates that HIV-positive children had over three times the hazard of death when compared on the same follow-up time scale.
- Consider the possible confounding effect of current age on the relationship between child’s HIV status and mortality. Before doing any analyses, would you expect a Cox model controlling for current age to give different results from the one controlling for time in the study?
quietly use paedsurv_new
quietly stset date_out, origin(time date_enr) enter(time date_enr) failure(died)
gen age_entry = (date_enr- dob) / 365.25
tabstat age_entry, by(chiv) stats(mean sd min max)
for variables: age_entry
## Summary variable: chiv (Child's HIV status)
## Group
## SD Min Max
## chiv | Mean
## -----------------+----------------------------------------
## HIV exposed | .3238306 .6747565 0 6.795346
## HIV positive | 4.331711 4.145623 .0054757 13.18275
## -----------------+----------------------------------------
## Total | 1.532988 2.978616 0 13.18275 ## ----------------------------------------------------------
- Prior to age adjustment, substantial differences were observed in the mean age at study entry across HIV groups. HIV-exposed children enrolled at a mean age of 0.32 years, while HIV-positive children enrolled much later, at 4.33 years. Because older age is associated with lower mortality risk in this cohort, adjusting for age is expected to influence the estimated effect of HIV status.
d) Hint: Remember Question 1e) from Part I. Are there large differences in age at entry between the HIV groups? You can calculate age at entry with the following command
gen age_entry = (date_enr- dob) / 365.25
- There is a substantial difference between the groups. HIV-exposed children entered the study at a mean age of \(0.32\) years (SD = \(0.67\)), whereas HIV-positive children entered at a mean age of \(4.33\) years (SD = \(4.15\)). The overall mean age at entry was \(1.53\) years. This suggests that HIV-positive children tended to be enrolled considerably later than HIV-exposed children.
Question e)
- Now perform a Cox regression controlling for current age to investigate whether age is a confounder.
quietly use paedsurv_new
quietly stset date_out, origin(time date_enr) enter(time date_enr) failure(died) scale(365.25)
gen age_entry = (date_enr- dob) / 365.25
stcox i.chiv age_entry,efron
## Failure _d: died
## Analysis time _t: (date_out-origin)/365.25
## Origin: time date_enron or after: time date_enr
## Enter
##
## Iteration 0: Log likelihood = -262.59989
## Iteration 1: Log likelihood = -252.33873
## Iteration 2: Log likelihood = -250.89864
## Iteration 3: Log likelihood = -250.89667estimates:
## Refining
## Iteration 0: Log likelihood = -250.89667
## for ties
## Cox regression with Efron method
## of subjects = 590 Number of obs = 590
## No. of failures = 43
## No. at risk = 1,856.1621
## Time chi2(2) = 23.41
## LR chi2 = 0.0000
## Log likelihood = -250.89667 Prob >
##
## ------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
## _t | Haz.
## -------------+----------------------------------------------------------------
## chiv |
## HIV posit.. | 5.435908 1.864223 4.94 0.000 2.775594 10.64605
## age_entry | .8772548 .0526643 -2.18 0.029 .779876 .9867927 ## ------------------------------------------------------------------------------
- After adjusting for age at entry, the HR for HIV-positive versus HIV-exposed increased to 5.435 .
- Age itself was inversely associated with mortality i.e a unit increase in age results in the decrease in hazard of death by approximately 13%.
- This suggests a strengthening of the HIV effect after age adjustment.
Question f)
- With age as the analysis scale, use a Poisson model to assess the relationship between child’s HIV status and mortality. How do the results compare to those from using a Cox model?
<- glm(died ~ chiv+age_enrol+offset(log(person_years)), data = out_new3,
mod_poi family = poisson(link="log"))
tidy(mod_poi , exponentiate = T,conf.int = T)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.0139408 | 0.2509078 | -17.029905 | 0.0000000 | 0.0081643 | 0.0219746 |
chivHIV positive | 5.3086828 | 0.3459787 | 4.824991 | 0.0000014 | 2.6968172 | 10.5843637 |
age_enrol | 0.8497478 | 0.0640825 | -2.540721 | 0.0110624 | 0.7406717 | 0.9545660 |
- With age as the analysis scale, The estimated risk ratio for mortality comparing HIV-positive to HIV-negative is \(5.309 (95\%CI[2.697-10.584](P<0.001)\), with age being protective i.e a one year increase in age at entry is associated with \(RR=0.85(95\%CL 0.7417- 0.955,P=0.011)\).
- The results are closely identical to those from the cox model \(HR = 5.44, 95\% CI 2.78–10.65, p < 0.001; age, HR =0.88, 95\% CI 0.78–0.99)\)
Question(g)
In order to make a like-for-like comparison with the Cox model, what would be a more appropriate analysis using Poisson regression?
- The baseline hazard is constant in a Poisson regression, \(exp(\beta_0)\)
- If we add a categorical variable for time, e.g. time-since-entry in 1-year bands, then the baseline hazard is a step function of time.
- The hazard is piece-wise constant in 1-year bands.
\[\lambda =\exp(\beta_0+\beta_3t_{[1,2)}+\beta_4t_{[2,3)}+...)\exp(\beta_1X)\]
- where \(t_[1,2)\) is an indicator for time being in the interval \([1,2)\). Note that \(t_[0,1)\)% if left out from the equation (it is assumed to be the reference time band).
- In both cases, the β parameters are interpreted as log rate ratios.
- Both models are multiplicative (i.e. both assume proportional hazards).
- In Poisson regression, follow-up time is classified into bands and a separate rate parameter is estimated for each band (or smoothed!), thereby allowing for the possibility that the rate is changing with time.
- It is assumed that the rate is constant within each band, so if the rate is changing rapidly with time we may have to choose very narrow bands.
- In Cox regression, we essentially choose bands of infinitesimal width; each band is so narrow that it includes only a single event.
- hence we would split time to create 1 year age bands.
Question(h)
Use the stsplit command to create 1-year age bands of current age from 0 to 5 years, and 5-year bands for children aged over 5. Re-run the Poisson model controlling for current age. How do these results compare with those obtained from the Cox model?
- here we fit a piecewise Poisson regression model for comparison.
quietly use paedsurv_new
at entry and exit in years
* age gen age_ent = (date_enr - dob)/365.25
gen age_out = (date_out - dob)/365.25
as timescale
* survival setup with age quietly stset age_out, origin(time age_ent) enter(time age_ent) failure(died) id(idno) scale(365.25)
stsplit ageband, at(0(1)5 10(5)80) trim
streg i.ageband i.chiv age_ent, dist(exp)
none out of range)
## (no observations trimmed because new episodes generated)
## (no
##
##
## Failure _d: died
## Analysis time _t: (age_out-origin)/365.25
## Origin: time age_enton or after: time age_ent
## Enter variable: idno
## ID note: 0.ageband omitted because of collinearity.
##
##
## Iteration 0: Log likelihood = -217.08342
## Iteration 1: Log likelihood = -211.19473
## Iteration 2: Log likelihood = -205.941
## Iteration 3: Log likelihood = -205.91907
## Iteration 4: Log likelihood = -205.91906
##
## Exponential PH regression
## of subjects = 590 Number of obs = 590
## No. of failures = 43
## No. at risk = 5.0819
## Time chi2(2) = 22.33
## LR chi2 = 0.0000
## Log likelihood = -205.91906 Prob >
##
## ------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
## _t | Haz.
## -------------+----------------------------------------------------------------
## 0.ageband | 1 (omitted)
## |
## chiv |
## HIV posit.. | 5.308681 1.836701 4.82 0.000 2.694557 10.4589
## age_ent | .8497479 .0544654 -2.54 0.011 .7494308 .9634932_cons | 5.091878 1.277595 6.49 0.000 3.113898 8.326292
##
## ------------------------------------------------------------------------------_cons estimates baseline hazard. ## Note:
quietly use paedsurv_new
at entry and exit in years
* age gen age_ent = (date_enr - dob)/365.25
gen age_out = (date_out - dob)/365.25
gen age_entry = (date_enr- dob) / 365.25
as timescale
* survival setup with age quietly stset age_out, origin(time age_ent) enter(time age_ent) failure(died) id(idno)
stsplit ageband, at(0(1)5 10(5)80) trim
quietly stcox i.chiv age_entry , efron
estimates store Cox
quietly streg i.ageband i.chiv age_entry, dist(exp)
estimates store Poisson
/* Compare the estimates */
estimates table Cox Poisson, eform equations(1) b(%9.6f) se(%9.6f) ///
keep(1.chiv age_entry) ///
title("Hazard ratios and standard errors for Cox and Poisson models")
none out of range)
## (no observations trimmed because
## (1,440 observations (episodes) created)
##
##
##
##
##
## for Cox and Poisson models
## Hazard ratios and standard errors
##
## --------------------------------------
## Variable | Cox Poisson
## -------------+------------------------
## chiv |
## HIV posit.. | 5.435126 5.462112
## | 1.863987 1.873629
## age_entry | 0.877228 0.875707
## | 0.052661 0.052621
## -------------------------------------- ## Legend: b/se
he results from the Poisson model closely match those of the Cox model. For both approaches, HIV-positive children have a markedly higher risk of death compared with HIV-exposed children (HR \(\approx 5.4\)). The protective effect of older age at study entry is also consistent across the two models, with hazard ratios of approximately \(0.88\) per year increase. The similarity of results provides reassurance that the piecewise Poisson approach yields estimates that are effectively equivalent to the Cox model in this dataset
Question(i)
Does age confound the relationship between HIV status and mortality?
Yes. Age is a confounder of the HIV–mortality relationship, as it is linked to both HIV status and mortality. After adjusting for age, HIV-positive children still had a substantially higher risk of death (\(HR \approx 5.4\)), while age itself demonstrated an independent protective effect (\(HR \approx 0.88\)).
Question(j)
- Redo part (h) and (i) to refit another Poisson model but splitting time at every event?
- splitting at every event
<- sort(unique(out_new3$person_years[out_new3$died == 1]))
cuts <- survSplit(Surv(person_years, died) ~ ., data = out_new3, cut = cuts, episode = "tgroup")
orca_splitted head(orca_splitted, 15)
idno | sex | chiv | dob | date_enr | date_out | ill3m | stunted | cd4cat | age_enrol | overall_survival | tstart | person_years | died | tgroup |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1001 | Male | HIV exposed | 2010-01-11 | 2010-03-01 | 2010-04-03 | No | 1 | NA | 0.1341547 | 33.09961 | 0.0000000 | 0.0906218 | 0 | 1 |
1002 | Female | HIV exposed | 2010-01-11 | 2010-03-01 | 2010-04-03 | No | 1 | NA | 0.1341547 | 33.09961 | 0.0000000 | 0.0906218 | 0 | 1 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.0000000 | 0.1095140 | 0 | 1 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.1095140 | 0.1314168 | 0 | 2 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.1314168 | 0.1834360 | 0 | 3 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.1834360 | 0.1861739 | 0 | 4 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.1861739 | 0.1916496 | 0 | 5 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.1916496 | 0.2655715 | 0 | 6 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.2655715 | 0.2683094 | 0 | 7 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.2683094 | 0.3093771 | 0 | 8 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.3093771 | 0.3203285 | 0 | 9 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.3203285 | 0.3422313 | 0 | 10 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.3422313 | 0.4435318 | 0 | 11 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.4435318 | 0.4955510 | 0 | 12 |
1003 | Female | HIV positive | 2003-08-10 | 2006-05-24 | 2010-03-12 | Yes | 0 | <15% | 2.7871321 | 1388.00000 | 0.4955510 | 0.5694730 | 0 | 13 |
The gnm()
function in the gnm package fits a conditional Poisson on splitted data, where the effects of the time (as a factor variable) can be marginalized (not estimated to improve computational efficiency).
<-out_new3 |>
out_new3filter(chiv %in% names(which(table(chiv) > 0))) %>%
droplevels()
<-coxph(Surv(person_years, died) ~ chiv+age_enrol, data = out_new3)
m2
<- gnm(died ~ chiv+age_enrol, data = orca_splitted,
mod_poi family = poisson, eliminate = factor(person_years))
summary(mod_poi)
##
## Call:
## gnm(formula = died ~ chiv + age_enrol, eliminate = factor(person_years),
## family = poisson, data = orca_splitted)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.15793 -0.06739 -0.05041 -0.04569 3.44316
##
## Coefficients of interest:
## Estimate Std. Error z value Pr(>|z|)
## chivHIV positive 1.69264 0.34294 4.936 0.000000799 ***
## age_enrol -0.13091 0.06003 -2.181 0.0292 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Residual deviance: 493.5 on 18322 degrees of freedom
## AIC: 1537.5
##
## Number of iterations: 4
Compare the estimates obtained from the conditional Poisson with the cox proportional hazard model.
list(cox = ci.exp(m2), poisson = ci.exp(mod_poi))
## $cox
## exp(Est.) 2.5% 97.5%
## chivHIV positive 5.4359084 2.775594 10.6460459
## age_enrol 0.8772548 0.779876 0.9867927
##
## $poisson
## exp(Est.) 2.5% 97.5%
## chivHIV positive 5.4337910 2.7745567 10.6417305
## age_enrol 0.8772983 0.7799259 0.9868274
- Comparing the traditional Cox Hazards model with Poisson regression approach accounting for time split at each event , the results are nearly identical .
- for both models Hiv status is strongly associated with an increased risk of mortality \(RR=5.433;(95\% ~CL ~2.775- 10.642,P=0.000000799)\)
- Similarly, age at entry was protective, with an estimated hazard ratio of \(\text{HR} = 0.88 \; (95\% \; \text{CI}: 0.78\text{--}0.99)\), indicating that each additional year of age was associated with approximately a 12% lower risk of death.
TA better approach would be to flexibly model the baseline hazard by using, for instance, splines with knots \(k\).
$dur <- with(orca_splitted, person_years - tstart)
orca_splitted<- data.frame(person_years = cuts, dur = 1,
newd chiv= "HIV exposed", age_enrol = 0.15)
library(splines)
<- quantile(out_new3$person_years, 1:5/6)
k <- glm(died ~ ns(person_years, knots = k) + chiv+age_enrol,
mod_poi2s data = orca_splitted, family = poisson, offset = log(dur))
round(ci.exp(mod_poi2s), 3)
## exp(Est.) 2.5%
## (Intercept) 0.023 0.009
## ns(person_years, knots = k)1 0.068 0.006
## ns(person_years, knots = k)2 1.946 0.042
## ns(person_years, knots = k)3 1.091 0.000
## ns(person_years, knots = k)4 0.000 0.000
## ns(person_years, knots = k)5 0.000 0.000
## ns(person_years, knots = k)6 0.000 0.000
## chivHIV positive 5.447 2.782
## age_enrol 0.879 0.781
## 97.5%
## (Intercept) 0.058
## ns(person_years, knots = k)1 0.708
## ns(person_years, knots = k)2 90.628
## ns(person_years, knots = k)3 33235098.734
## ns(person_years, knots = k)4 7114190590675165331048666628824260888462646602486644202404688666064840240060260680806844462262400004280404820246068648664082848424640022428064848.000
## ns(person_years, knots = k)5 Inf
## ns(person_years, knots = k)6 Inf
## chivHIV positive 10.667
## age_enrol 0.988
<- data.frame(ci.pred(mod_poi2s, newdata = newd))
blhazs ggplot(blhazs, aes(x = newd$person_years , y = Estimate)) + geom_line() +
geom_ribbon(aes(ymin = X2.5., ymax = X97.5.), alpha = .2) +
scale_y_continuous(trans = "log", breaks = pretty_breaks()) +
theme_classic() + labs(x = "Time (years)", y = "Baseline hazard")
- again we notice that \((HR=5.447)\) Close to what we got from Poisson and Co proportional model.
Compare in Stata
quietly use paedsurv_new
at entry and exit in years
* age gen age_ent = (date_enr - dob)/365.25
gen age_out = (date_out - dob)/365.25
as timescale
* survival setup with age quietly stset age_out, origin(time age_ent) enter(time age_ent) failure(died==1) id(idno)
stsplit, at(failures) riskset(riskset)
quietly tab riskset, gen(interval)
quietly stcox i.chiv age_ent , efron
estimates store Cox
quietly streg age_ent interval* i.chiv, dist(exp)
estimates store Poisson_fine
/* Compare the estimates */
estimates table Cox Poisson_fine, eform equations(1) b(%9.6f) se(%9.6f) ///
keep(1.chiv age_ent) ///
title("Hazard ratios and standard errors for Cox and Poisson models")
## (43 failure times)
## (19,792 observations (episodes) created)
##
##
##
##
##
##
## for Cox and Poisson models
## Hazard ratios and standard errors
##
## --------------------------------------e
## Variable | Cox Poisson~
## -------------+------------------------
## chiv |
## HIV posit.. | 5.435126 5.435126
## | 1.863987 1.863987
## age_ent | 0.877228 0.877228
## | 0.052661 0.052661
## -------------------------------------- ## Legend: b/se
Question k
- Bonus: Fit an appropriate Parametric model or Frailty model whichever is most suitable and distinguish how you will interpret the results for a single categorical and also a single continuous variable.
Intercept only model
library("muhaz")
<- muhaz(out_new3$person_years, out_new3$died)
kernel_haz_est <- data.table(time = kernel_haz_est$est.grid,
kernel_haz est = kernel_haz_est$haz.est,
method = "Kernel density")
<- c("exp", "weibull", "gompertz", "gamma",
dists "lognormal", "llogis", "gengamma")
<- c("Exponential", "Weibull (AFT)",
dists_long "Gompertz", "Gamma", "Lognormal", "Log-logistic",
"Generalized gamma")
<- vector(mode = "list", length = length(dists))
parametric_haz for (i in 1:length(dists)){
<- flexsurvreg(Surv(person_years, died) ~ 1, data = out_new3, dist = dists[i])
fit <- summary(fit, type = "hazard",
parametric_haz[[i]] ci = FALSE, tidy = TRUE)
$method <- dists_long[i]
parametric_haz[[i]]
}<- rbindlist(parametric_haz) parametric_haz
We can plot the hazard functions from the parametric models and compare them to the kernel density estimate.
<- rbind(kernel_haz, parametric_haz)
haz := factor(method,
haz[, method levels = c("Kernel density",
dists_long))]<- length(dists)
n_dists ggplot(haz, aes(x = time, y = est, col = method, linetype = method)) +
geom_line() +
xlab("years") + ylab("Hazard") +
scale_colour_manual(name = "",
values = c("black", rainbow(n_dists))) +
scale_linetype_manual(name = "",
values = c(1, rep_len(2:6, n_dists)))
Fitting a Weibull AFT model
AFT model hiv status
<- Surv(person_years, died) ~ chiv
mf <- survreg(mf, ,data = out_new3, dist="weibull")
wf summary(wf)
##
## Call:
## survreg(formula = mf, data = out_new3, dist = "weibull")
## Value Std. Error z p
## (Intercept) 6.147 0.796 7.72 0.000000000000011
## chivHIV positive -1.976 0.586 -3.37 0.00075
## Log(scale) 0.466 0.139 3.36 0.00079
##
## Scale= 1.59
##
## Weibull distribution
## Loglik(model)= -190.9 Loglik(intercept only)= -198.9
## Chisq= 16.08 on 1 degrees of freedom, p= 0.000061
## Number of Newton-Raphson Iterations: 8
## n= 590
The coefficient \(\beta =- 1.976\) gives an acceleration factor of \(e^{-1.967}=0.14\) implying that HIV positive patients have 86% shorter survival times compared to HIV exposed.
AFT model for age of enrolmment
<- Surv(person_years, died) ~ age_enrol
mf <- survreg(mf, ,data = out_new3, dist="weibull")
wf summary(wf)
##
## Call:
## survreg(formula = mf, data = out_new3, dist = "weibull")
## Value Std. Error z p
## (Intercept) 5.09247 0.60213 8.46 <0.0000000000000002
## age_enrol -0.00329 0.07801 -0.04 0.9664
## Log(scale) 0.43573 0.13825 3.15 0.0016
##
## Scale= 1.55
##
## Weibull distribution
## Loglik(model)= -198.9 Loglik(intercept only)= -198.9
## Chisq= 0 on 1 degrees of freedom, p= 0.97
## Number of Newton-Raphson Iterations: 9
## n= 590
exp(-0.00329)
## [1] 0.9967154
The coefficient \(\beta =-0.00329\) gives an acceleration factor of $e^{-0.003297}= 0.9967154 $ a unit increase in age has no significant impact on survival.
Part 3
Use this paper’s approach to model the data from Part II and also include a flexible GAMM model from R2BayesX on all six models. Show the relevant equations for the models.
library(Epi)
library(survival)
library(mgcv)
library(tidyverse)
Flexible GAMM (BayesX)
structured additive Cox-type model, where the log-hazard is modeled as:
The flexible Cox proportional hazards model is:
\[ h(t \mid X) = h_0(t) \exp\Big( \beta_1 \,\text{chiv} + \beta_2 \,\text{age\_enrol} + f(t) \Big), \]
where:
- \(h(t \mid X)\): hazard at time \(t\) given covariates \(X\)
- \(h_0(t) = \exp(f(t))\): smooth baseline hazard modeled via splines (
sx(time)
)
- \(\beta_1, \beta_2\): parametric coefficients (log-hazard ratios)
- \(f(t)\): smooth function of follow-up time
Interpretation:
- \(\exp(\beta_1)\): hazard ratio for HIV positive vs negative
- \(\exp(\beta_2)\): hazard ratio per additional year of age at enrolment
- \(f(t)\): adjusts for non-linear baseline hazard over time
library(R2BayesX)
<- bayesx(died ~ chiv + age_enrol +
gam_fit sx(overall_survival, bs = "bl"),
family = "cox",
data = out_new3,
method = "REML"
)
summary(gam_fit)
## Call:
## bayesx(formula = died ~ chiv + age_enrol + sx(overall_survival,
## bs = "bl"), data = out_new3, family = "cox", method = "REML")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.2382 0.2609 -43.0723 <0.0000000000000002 ***
## chivHIV positive 1.6973 0.3432 4.9452 <0.0000000000000002 ***
## age_enrol -0.1315 0.0602 -2.1843 0.0293 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Smooth terms:
## Variance Smooth Par. df Stopped
## sx(overall_survival) 0.0005 1993.4700 1.0742 0
##
## N = 590 df = 4.07418 AIC = 875.595 BIC = 893.441
## logLik = -433.7235 method = REML family = cox
plot(gam_fit, resid = TRUE, cex.resid = 0.1)
where
\[ h(t \mid X) = \text{hazard function at time t given covariates X}. \]
\[\begin{align*} \beta_{0} &= -11.238 \quad &\text{(baseline log-hazard intercept)} \\ \beta_{\text{chiv}} &= 1.697 \quad &\text{(log-hazard effect of HIV positive vs. negative)} \\ \beta_{\text{age}} &= -0.132 \quad &\text{(log-hazard effect per unit increase in age at enrolment)} \\ f(t) & \quad &\text{smooth, non-parametric baseline hazard function.} \end{align*}\]
Poisson-type GAMM (BayesX)
The flexible Poisson model is:
\[ Y_i \sim \text{Poisson}(\mu_i), \quad \log(\mu_i) = \log(\text{exposure}_i) + \beta_1 \,\text{chiv} + \beta_2 \,\text{age\_enrol} + g(t_i), \]
where:
- \(Y_i\): observed events (0/1 or counts)
- \(\mu_i\): expected number of events
- \(\text{exposure}_i\): offset, typically log of person-years
- \(\beta_1, \beta_2\): log-incidence rate ratios
- \(g(t_i)\): smooth function of time (
sx(time)
)
Interpretation:
- \(\exp(\beta_1)\): incidence rate ratio for HIV positive vs negative
- \(\exp(\beta_2)\): incidence rate ratio per additional year of age
- \(g(t_i)\): captures non-linear variation in baseline incidence over time
<- bayesx(died ~ chiv + age_enrol + offset(log(person_years)),
bx_poisson family = "poisson", data = out_new3)
## Note: created new output directory 'C:/Users/r1953/AppData/Local/Temp/Rtmp00dbdg/bayesx1'!
summary(bx_poisson)
## Call:
## bayesx(formula = died ~ chiv + age_enrol + offset(log(person_years)),
## data = out_new3, family = "poisson")
##
## Fixed effects estimation results:
##
## Parametric coefficients:
## Mean Sd 2.5% 50% 97.5%
## (Intercept) -3.2522 0.2534 -3.7703 -3.2443 -2.8005
## chivHIV positive 1.7643 0.3518 1.0541 1.7566 2.4511
## age_enrol -0.1319 0.0598 -0.2501 -0.1326 -0.0225
##
## N = 590 burnin = 2000 method = MCMC family = poisson
## iterations = 12000 step = 10
<- bx_poisson$fixed.effects[, "Mean"]
coef_pois <- exp(coef_pois)
HR_pois data.frame(term = names(coef_pois), HR = HR_pois)
term | HR | |
---|---|---|
(Intercept) | (Intercept) | 0.0386890 |
chivHIV positive | chivHIV positive | 5.8373096 |
age_enrol | age_enrol | 0.8764348 |
the results yielded are almost the same as we got before
Across all models, HIV status consistently emerged as the strongest predictor of mortality, with HIV-positive children facing a substantially higher risk of death compared with HIV-exposed or unexposed peers. This was reflected in hazard ratios exceeding 5 in both the Cox and GAMM models. Similarly, age at enrolment demonstrated a protective effect, with older children showing lower mortality risk across all modeling approaches, whether age was included as a linear term (Cox, Poisson, Weibull) or modeled flexibly using smooth functions in the GAMM.
The flexible GAMM added insight by allowing the hazard to vary smoothly over survival time. However, the smooth term did not indicate meaningful deviations from proportionality, suggesting that the Cox model’s proportional hazards assumption was reasonable. Moreover, the GAMM’s parametric coefficients were broadly consistent with those from the Cox models, reinforcing the robustness of the findings
Comment on results
The best performing models are those that support monotonically increasing hazards (Gompertz, Weibull, gamma). the Weibull model perform the best with the Weibull modeling the increase in the slope of the hazard the most closely.