lOADING PACKAGES AND CALLING IN DATA FOR R ANALYSIS

Question a)

Note

Open a log file, read the trinmlsh.dta data and then examine the variables.

if (!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, survival, ggfortify, survminer, gridExtra, 
               Epi, KMsurv, gnm, cmprsk, mstate, flexsurv, splines, epitools, 
               eha, ctqr, scales, data.table,haven,biostat3,paletteer)
library(tidyverse)
out_new<-haven::read_dta("trinmlsh.dta") |> 
  mutate(time_new= timeout-timein,
         years_new =time_new/365.25,
         smoke_status = case_when(smokenum==0~"non-smoker",
                                  smokenum==1~"ex-smoker",
                                  smokenum>1~"Current-smoker"))

KM <- survfit(Surv(years,death) ~ 1, 
              data = out_new)
ggsurvplot(KM, risk.table = TRUE, xlab = "Time (years)", censor = T)

Question 1b

Note

The variables timein and timeout hold the dates of entry and exit into the study, while death is the indicator for mortality from any cause. Use stset with these variables, setting timein to be the origin (i.e. sort the records according to follow-up time). Hint: You may use the command stdescribe.

quietly use trinmlsh

stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

sts graph, survival

stdescribe, noshow

quietly graph export overal.svg, replace
## Survival-time data settings
## 
##          Failure event: death!=0 & death<.
## Observed time interval: (origin, timeout]
##      Enter on or after: time timein
##      Exit on or before: failure
##      Time for analysis: (time-origin)/365.25
##                 Origin: time timein
## 
## --------------------------------------------------------------------------
##         318  total observations
##           0  exclusions
## --------------------------------------------------------------------------
##         318  observations remaining, representing
##          88  failures in single-record/single-failure data
##   2,204.539  total analysis time at risk and under observation
##                                                 At risk from t =         0
##                                      Earliest observed entry t =         0
##                                           Last observed exit t =  9.798768
## 
## 
##          Failure _d: death
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## 
##                                    |-------------- Per subject --------------|
## Category                   Total        Mean         Min     Median        Max
## ------------------------------------------------------------------------------
## Number of subjects           318   
## Number of records            318           1           1          1          1
## 
## Entry time (first)                         0           0          0          0
## Exit time (final)                   6.932514    .0164271   7.678303   9.798768
## 
## Subjects with gap              0   
## Time on gap                    0   
## Time at risk           2204.5394    6.932514    .0164271   7.678303   9.798768
## 
## Failures                      88    .2767296           0          0          1
## ------------------------------------------------------------------------------

Note

the stdescribe output shows that we have 318 subjects in the analytical sample of which 88 ($27.67\%$) have experienced the event of death.
In the output above, mean age at entry is 0. This is actually the same for all individuals since they have the same date for origin and the same date for enter, which also explains why the same value is presented for min, median, and max.
The mean age at exit is $6.932514 (min: .0164271, median: 7.678303 , max: 9.798768)$ days.
Time at risk is here presented as days. We can see that the mean is $6.932514 (min=.0164271 , median=7.678303, max=9.798768)$.

Restricted mean survival time

Another way of obtaining the mean time at risk is through stci, which produces the restricted mean survival time (same as mean time at risk) along with information on standard errors and confidence intervals.

Estimating the restricted mean (or average) survival time is determined by calculating the area under the survival curve, restricting the estimation to the longest follow-up time.

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

stci, rmean noshow
##              | Number of  Restricted
##              |  subjects        mean      Std. err.    [95% conf. interval]
## -------------+-------------------------------------------------------------
##        Total |       318    8.113829(*)   .1666967      7.78711    8.44055
## 
## (*) largest observed analysis time is censored, mean is underestimated

Note

The restricted mean survival time in this example is 8.113829 (years).

Question 1c

Note

Examine the effect of smoking on all-cause mortality using strate and stcox. Hint: You may wish to recode smokenum into a smaller number of categories, say: 0 = non smoker, 1 = ex-smoker, 2 = current smoker

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

tab smokenum
##      No. of |
##  cigarettes |
##     per day |      Freq.     Percent        Cum.
## ------------+-----------------------------------
##    non-smok |        140       44.16       44.16
##    ex-smoke |         68       21.45       65.62
##    1-9 cigs |         29        9.15       74.76
##    10-19 ci |         30        9.46       84.23
##    20-29 ci |         26        8.20       92.43
##    30+ cigs |         24        7.57      100.00
## ------------+-----------------------------------
##       Total |        317      100.00

Recoding the variable

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)

label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl

tab smoke3
## (80 differences between smokenum and smoke3)
## 
## 
## 
## 
##      RECODE of |
##  smokenum (No. |
##  of cigarettes |
##       per day) |      Freq.     Percent        Cum.
## ---------------+-----------------------------------
##     Non-smoker |        140       44.16       44.16
##      Ex-smoker |         68       21.45       65.62
## Current-smoker |        109       34.38      100.00
## ---------------+-----------------------------------
##          Total |        317      100.00

Using strate

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)

label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl

strate smoke3
## (80 differences between smokenum and smoke3)
## 
## 
## 
## 
##          Failure _d: death
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## Estimated failure rates
## Number of records = 317
## 
##   +-----------------------------------------------------------------+
##   |         smoke3    D          Y       Rate      Lower      Upper |
##   |-----------------------------------------------------------------|
##   |     Non-smoker   30   984.4079   0.030475   0.021308   0.043587 |
##   |      Ex-smoker   18   461.5934   0.038995   0.024569   0.061893 |
##   | Current-smoker   40   749.0157   0.053403   0.039173   0.072804 |
##   +-----------------------------------------------------------------+
##    Notes: Rate = D/Y = failures/person-time.
##           Lower and Upper are bounds of 95% confidence intervals.

Manually doing it in R

out_new |>
  filter(!is.na(smoke_status)) |> 
  group_by(smoke_status) |>
  summarise(years=sum(years), death=sum(death)) |>
  mutate(rate=death/years)

smoke_status	years	death	rate
Current-smoker	749.0157	40	0.0534034
ex-smoker	461.5934	18	0.0389954
non-smoker	984.4079	30	0.0304752

Using Survrate

KM_rate <- survRate(Surv(years,death) ~ smoke_status, 
              data = out_new)
KM_rate

	smoke_status	tstop	event	rate	lower	upper
smoke_status=Current-smoker	Current-smoker	749.0157	40	0.0534034	0.0381522	0.0727203
smoke_status=ex-smoker	ex-smoker	461.5934	18	0.0389954	0.0231111	0.0616295
smoke_status=non-smoker	non-smoker	984.4079	30	0.0304752	0.0205615	0.0435052

Comment on Results from Strate

Based on the crude rates, there is a clear increase in all-cause mortality from non-smokers (0.03047517) to ex-smokers (0.03899536) and current smokers (0.05340342 ).

Using Cox in Stata

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)

label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl

stcox i.smoke3 , efron
## (80 differences between smokenum and smoke3)
## 
## 
## 
## 
##          Failure _d: death
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## Iteration 0:  Log likelihood = -482.41209
## Iteration 1:  Log likelihood = -479.70223
## Iteration 2:  Log likelihood = -479.69013
## Iteration 3:  Log likelihood = -479.69013
## Refining estimates:
## Iteration 0:  Log likelihood = -479.69013
## 
## Cox regression with Efron method for ties
## 
## No. of subjects =        317                            Number of obs =    317
## No. of failures =         88
## Time at risk    = 2,195.0171
##                                                         LR chi2(2)    =   5.44
## Log likelihood = -479.69013                             Prob > chi2   = 0.0657
## 
## ------------------------------------------------------------------------------
##           _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##       smoke3 |
##   Ex-smoker  |   1.294778   .3862725     0.87   0.387     .7215315     2.32346
## Current-s~r  |   1.749641   .4226712     2.32   0.021      1.08973    2.809175
## ------------------------------------------------------------------------------

Using Cox in R

out_new<-out_new |> 
  mutate(smoke_status = fct_relevel(smoke_status,"non-smoker"))

KM_cox <- coxph(Surv(years,death) ~ smoke_status, 
              data = out_new)
tidy(KM_cox, exponentiate = T, conf.int = T)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
smoke_statusCurrent-smoker	1.749641	0.2415760	2.3156709	0.0205762	1.0897303	2.809175
smoke_statusex-smoker	1.294778	0.2983311	0.8659475	0.3865190	0.7215315	2.323460

ggforest(KM_cox , as.data.frame(out_new))

Poisson regression to estimate the mortality rate ratio.

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)

label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl
poisson death i.smoke3 , exp(years) irr
## (80 differences between smokenum and smoke3)
## 
## 
## 
## 
## Iteration 0:  Log likelihood = -282.97543  
## Iteration 1:  Log likelihood =  -282.9754  
## Iteration 2:  Log likelihood =  -282.9754  
## 
## Poisson regression                                      Number of obs =    317
##                                                         LR chi2(2)    =   5.49
##                                                         Prob > chi2   = 0.0644
## Log likelihood = -282.9754                              Pseudo R2     = 0.0096
## 
## ------------------------------------------------------------------------------
##        death |        IRR   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##       smoke3 |
##   Ex-smoker  |   1.279578   .3814964     0.83   0.408      .713325    2.295335
## Current-s~r  |   1.752358   .4232347     2.32   0.020     1.091536    2.813245
##              |
##        _cons |   .0304752    .005564   -19.12   0.000     .0213078    .0435867
##    ln(years) |          1  (exposure)
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline incidence rate.

library(gtsummary)
poisson7c <- glm(death ~ smoke_status + offset(log(years)),
                family=poisson, data=out_new)
poisson7c |> 
  tbl_regression(exponentiate = T)

Characteristic	IRR	95% CI	p-value
smoke_status
non-smoker	—	—
Current-smoker	1.75	1.09, 2.83	0.020
ex-smoker	1.28	0.70, 2.27	0.4
Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio

Comment on Results from Cox and poisson

A cox regression model using non-smokers as the reference group indicated that current smokers had a statistically significant 75% higher mortality rate (hazard ratio $[HR] = 1.75; 95\% CI: 1.09–2.39; p = 0.02057$)
while the increase for ex-smokers was smaller and not statistically significant ($HR = 1.294; 95\% CI: 0.721–2.3234; p = 0.386$).
The results were almost similar to those from fitting a Poisson model.

Question 1d

Note

Using the rates and rate ratios in part (c), can you see a trend in the relationship between smoking category and all-cause mortality? How would you assess this more formally? How would you present the results from your analyses in parts (c) and (d)?

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)

label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl

sts graph , by(smoke3)

quietly graph export KM.svg, replace
## (80 differences between smokenum and smoke3)
## 
## 
## 
## 
##          Failure _d: death
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)

label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl

sts test smoke3 , logrank
## (80 differences between smokenum and smoke3)
## 
## 
## 
## 
##          Failure _d: death
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## Equality of survivor functions
## Log-rank test
## 
##                |  Observed       Expected
## smoke3         |    events         events
## ---------------+-------------------------
##     Non-smoker |        30          39.54
##      Ex-smoker |        18          18.32
## Current-smoker |        40          30.14
## ---------------+-------------------------
##          Total |        88          88.00
## 
##                          chi2(2) =   5.54
##                          Pr>chi2 = 0.0628

library(paletteer)
KM_dm <- survfit(Surv(years,death) ~ smoke_status, 
              data = out_new)

p2<-ggsurvplot(KM_dm, 
           data = out_new, 
                 palette = paletteer_c("grDevices::Set 2", 12), 
                 size = 1.2, conf.int = FALSE,
                  pval=TRUE,
                 legend.labs = levels(out_new$smoke_status),
                 legend.title = "",
                 ggtheme = theme_minimal() + 
             theme(plot.title = element_text(face = "bold")),
                 title = "Probability of dying",
                 xlab = "Years",
                 ylab = "Probability of dying",
                 legend = "bottom", censor = FALSE)

p2

Solution

Based on the crude rates, there is a clear increase in all-cause mortality from non-smokers to ex smokers and then to current smokers
The Kaplan Meier plots also suggest a marginally significant difference in survival times between the 3 groups where

non-smokers survival longer
ex-smokers survival longer than current smokers but shorter than non smokers
current smokers have a shorter survival time as compared to the other 2 groups.

Question 1e and 1f

Note

Consider the possible confounding effect of current age on the relationship between smoking and mortality. By examining the data (but without performing any further modelling) would you expect a Cox model controlling for current age to give different results to one controlling for time in the study?
Hint: What was the age range at study entry (ageent) in each smoking group? Are there large differences in age between the groups? If follow-up time is your analysis scale, will you be comparing men whose ages are very different?

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death)

recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)

label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl

tabstat ageent, by(smoke3) stats(mean sd min max)
## (80 differences between smokenum and smoke3)
## 
## 
## 
## 
## Summary for variables: ageent
## Group variable: smoke3 (RECODE of smokenum (No. of cigarettes per day))
## 
##         smoke3 |      Mean        SD       Min       Max
## ---------------+----------------------------------------
##     Non-smoker |   65.1261  3.203284    60.077    72.016
##      Ex-smoker |  65.60484  3.571255    60.093    73.467
## Current-smoker |  64.71269  2.902213    60.005     73.93
## ---------------+----------------------------------------
##          Total |  65.08664  3.194229    60.005     73.93
## --------------------------------------------------------

out_new |> 
  group_by(smoke_status) |> 
  summarise(aver_age=mean(ageent),
            sd_age=sd(ageent),
            range_age = range(ageent))

smoke_status	aver_age	sd_age	range_age
non-smoker	65.12610	3.203284	60.077
non-smoker	65.12610	3.203284	72.016
Current-smoker	64.71269	2.902213	60.005
Current-smoker	64.71269	2.902213	73.930
ex-smoker	65.60484	3.571255	60.093
ex-smoker	65.60484	3.571255	73.467
NA	67.24400	NA	67.244
NA	67.24400	NA	67.244

Note

The mean ages are all around 65 years, with small SD ($\approx 3$ years). The ranges overlap heavily across groups.
Adjustment for age is important if age differs systematically across smoking groups and age is associated with the hazard (e.g. mortality, CVD, HIV progression).
Here, the age distributions are very similar across smoking categories (differences <1 year on average).
Because of this, adjusting for age is unlikely to meaningfully change hazard ratios for smoking status.
using follow-up time as the analysis scale will not result in comparing men of very different actual ages, although minor age variation still warrants adjustment.

Question 1g

Note

Now perform a Cox regression, and check the assumptions, to investigate this.

Cox model adjusting for ageent

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)

label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl

stcox i.smoke3 ageent , efron
## (80 differences between smokenum and smoke3)
## 
## 
## 
## 
##          Failure _d: death
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## Iteration 0:  Log likelihood = -482.41209
## Iteration 1:  Log likelihood = -478.51285
## Iteration 2:  Log likelihood = -478.50359
## Iteration 3:  Log likelihood = -478.50359
## Refining estimates:
## Iteration 0:  Log likelihood = -478.50359
## 
## Cox regression with Efron method for ties
## 
## No. of subjects =        317                            Number of obs =    317
## No. of failures =         88
## Time at risk    = 2,195.0171
##                                                         LR chi2(3)    =   7.82
## Log likelihood = -478.50359                             Prob > chi2   = 0.0499
## 
## ------------------------------------------------------------------------------
##           _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##       smoke3 |
##   Ex-smoker  |   1.258957   .3764753     0.77   0.441     .7005992    2.262309
## Current-s~r  |   1.772706   .4286392     2.37   0.018     1.103613    2.847454
##              |
##       ageent |   1.053675   .0355862     1.55   0.122     .9861853    1.125783
## ------------------------------------------------------------------------------

KM_cox <- coxph(Surv(years,death) ~ smoke_status+ageent, 
              data = out_new)
tidy(KM_cox, exponentiate = T, conf.int = T)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
smoke_statusCurrent-smoker	1.772706	0.2417994	2.3676953	0.0178993	1.1036128	2.847454
smoke_statusex-smoker	1.258957	0.2990375	0.7700821	0.4412512	0.7005992	2.262309
ageent	1.053675	0.0337734	1.5480728	0.1216048	0.9861853	1.125783

Checking assumptions

cph_assump <- cox.zph(KM_cox)
cph_assump
##              chisq df     p
## smoke_status 5.159  2 0.076
## ageent       0.035  1 0.852
## GLOBAL       5.225  3 0.156

ggcoxzph(cph_assump)

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)

label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl

stcox i.smoke3 ageent , efron

stphtest, detail
stphtest, plot(1.smoke3) msym(oh) name(plt1)
stphtest, plot(2.smoke3) msym(oh) name(plt2)
stphtest, plot(ageent) msym(oh) name(plt3)


graph combine plt1 plt2  plt3

quietly graph export all_multii.svg, replace
## (80 differences between smokenum and smoke3)
## 
## 
## 
## 
##          Failure _d: death
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## Iteration 0:  Log likelihood = -482.41209
## Iteration 1:  Log likelihood = -478.51285
## Iteration 2:  Log likelihood = -478.50359
## Iteration 3:  Log likelihood = -478.50359
## Refining estimates:
## Iteration 0:  Log likelihood = -478.50359
## 
## Cox regression with Efron method for ties
## 
## No. of subjects =        317                            Number of obs =    317
## No. of failures =         88
## Time at risk    = 2,195.0171
##                                                         LR chi2(3)    =   7.82
## Log likelihood = -478.50359                             Prob > chi2   = 0.0499
## 
## ------------------------------------------------------------------------------
##           _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##       smoke3 |
##   Ex-smoker  |   1.258957   .3764753     0.77   0.441     .7005992    2.262309
## Current-s~r  |   1.772706   .4286392     2.37   0.018     1.103613    2.847454
##              |
##       ageent |   1.053675   .0355862     1.55   0.122     .9861853    1.125783
## ------------------------------------------------------------------------------
## 
## 
## Test of proportional-hazards assumption
## 
## Time function: Analysis time
## --------------------------------------------------------
##              |        rho     chi2       df    Prob>chi2
## -------------+------------------------------------------
##    0b.smoke3 |          .        .        1           .
##     1.smoke3 |    0.13164     1.54        1       0.2148
##     2.smoke3 |    0.22638     4.47        1       0.0345
##       ageent |    0.02115     0.04        1       0.8508
## -------------+------------------------------------------
##  Global test |                4.58        3       0.2053
## --------------------------------------------------------

Note

After adjusting for age at study entry, Age is not statistically significant $(HR = 1.05~ per~ year~ older; p = 0.12)$.
There is no evidence of PH violation overall (global $p = 0.156$).

White dataset

Question 1h

Note

Now read the whitehal new.dta data. We are interested in CHD mortality, so the variables necessary to define the time-scale and the outcome are timein, timeout, and chd.

Loading the data in R

out_new2<-haven::read_dta("whitehal_new.dta") |> 
  mutate(fu_years = as.numeric(timeout - timein) / 365.25) |> 
  mutate(across(where(is.labelled), ~as_factor(.))) |> 
  mutate(chd = as.integer(chd))

Loading the data in Stata

quietly use  whitehal_new

format timein  %td
format timeout %td

quietly stset timeout, origin(time timein) enter(time timein) failure(chd)

tab grade
## grade of work, |
##       2 levels |      Freq.     Percent        Cum.
## ---------------+-----------------------------------
## High job grade |      1,194       71.20       71.20
##  Low job grade |        483       28.80      100.00
## ---------------+-----------------------------------
##          Total |      1,677      100.00

Question 1i

Note

Use a Cox regression model to estimate the effect of job grade (grade, coded: 1 = high; 2 = low) using first the follow-up time scale. You may adjust for other covariates.

KM_dm <- survfit(Surv(fu_years,chd) ~ grade, 
              data = out_new2)

p2<-ggsurvplot(KM_dm, 
           data = out_new2, 
                 palette = paletteer_c("grDevices::Set 2", 12), 
                 size = 1.2, conf.int = FALSE,
                  pval=TRUE,
                 legend.labs = levels(out_new2$grade),
                 legend.title = "",
                 ggtheme = theme_minimal() + 
             theme(plot.title = element_text(face = "bold")),
                 title = "Probability of dying",
                 xlab = "Years",
                 ylab = "Probability of dying",
                 legend = "bottom", censor = FALSE)

p2

KM_cox <- coxph(Surv(fu_years,chd) ~ grade, 
              data = out_new2)
tidy(KM_cox, exponentiate = T, conf.int = T)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
gradeLow job grade	2.042109	0.1636785	4.362108	0.0000129	1.481684	2.814507

Comment on Output

The unadjusted Cox regression showS that participants in the low job grade had more than twice the hazard of CHD mortality compared with those in the high job grade $(HR = 2.04; 95\% CI: 1.48–2.82; p < 0.001)$.

Running full model in R

cox_adj <- coxph(Surv(fu_years,chd) ~ grade + agein + sbp + chol + smok, data = out_new2)

summary(cox_adj)
## Call:
## coxph(formula = Surv(fu_years, chd) ~ grade + agein + sbp + chol + 
##     smok, data = out_new2)
## 
##   n= 1651, number of events= 151 
##    (26 observations deleted due to missingness)
## 
##                        coef exp(coef) se(coef)     z         Pr(>|z|)    
## gradeLow job grade 0.100190  1.105381 0.178687 0.561         0.575000    
## agein              0.095873  1.100620 0.013930 6.882 0.00000000000588 ***
## sbp                0.016100  1.016230 0.003615 4.453 0.00000844870978 ***
## chol               0.007367  1.007394 0.001633 4.513 0.00000640522967 ***
## smokEx-smoker      0.769555  2.158804 0.332196 2.317         0.020527 *  
## smok1-14 cig/day   1.050483  2.859032 0.349535 3.005         0.002653 ** 
## smok15-24 cig/day  1.320346  3.744718 0.348945 3.784         0.000154 ***
## smok25+ cig/day    1.265059  3.543302 0.398100 3.178         0.001484 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                    exp(coef) exp(-coef) lower .95 upper .95
## gradeLow job grade     1.105     0.9047    0.7788     1.569
## agein                  1.101     0.9086    1.0710     1.131
## sbp                    1.016     0.9840    1.0091     1.023
## chol                   1.007     0.9927    1.0042     1.011
## smokEx-smoker          2.159     0.4632    1.1258     4.140
## smok1-14 cig/day       2.859     0.3498    1.4411     5.672
## smok15-24 cig/day      3.745     0.2670    1.8897     7.421
## smok25+ cig/day        3.543     0.2822    1.6238     7.732
## 
## Concordance= 0.759  (se = 0.018 )
## Likelihood ratio test= 135.4  on 8 df,   p=<0.0000000000000002
## Wald test            = 126.8  on 8 df,   p=<0.0000000000000002
## Score (logrank) test = 136.6  on 8 df,   p=<0.0000000000000002

Running full model in Stata

quietly use  whitehal_new

format timein  %td
format timeout %td

quietly stset timeout, origin(time timein) enter(time timein) failure(chd) scale(365.25)
stcox i.grade agein sbp chol i.smok,efron
##          Failure _d: chd
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## Iteration 0:  Log likelihood = -1094.0566
## Iteration 1:  Log likelihood = -1028.4682
## Iteration 2:  Log likelihood = -1026.4085
## Iteration 3:  Log likelihood = -1026.3755
## Iteration 4:  Log likelihood = -1026.3755
## Refining estimates:
## Iteration 0:  Log likelihood = -1026.3755
## 
## Cox regression with no ties
## 
## No. of subjects =       1,651                           Number of obs =  1,651
## No. of failures =         151
## Time at risk    = 27,200.0163
##                                                         LR chi2(8)    = 135.36
## Log likelihood = -1026.3755                             Prob > chi2   = 0.0000
## 
## ------------------------------------------------------------------------------
##           _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##        grade |
## Low job g..  |   1.105381   .1975177     0.56   0.575     .7787771    1.568957
##        agein |    1.10062   .0153318     6.88   0.000     1.070976    1.131083
##          sbp |    1.01623   .0036737     4.45   0.000     1.009055    1.023456
##         chol |   1.007394   .0016447     4.51   0.000     1.004176    1.010623
##              |
##         smok |
##   Ex-smoker  |   2.158804   .7171464     2.32   0.021     1.125764    4.139798
## 1-14 cig/..  |   2.859032    .999333     3.01   0.003       1.4411    5.672103
## 15-24 cig..  |   3.744718   1.306701     3.78   0.000     1.889717    7.420645
## 25+ cig/day  |   3.543302   1.410588     3.18   0.001     1.623846    7.731637
## ------------------------------------------------------------------------------

Note

after adjusting for age at entry, systolic blood pressure, cholesterol, and smoking, the association is insignificant (HR for low vs high = $1.11$; equivalent to HR for high vs low = $0.905; 95\% CI: 0.637–1.284; p = 0.575)$.

Question j): Using a Parametric Model

Note

Fit an appropriate Parametric model or Frailty model whichever is most suitable and distinguish how you will interpret the results for a single categorical and also a single continuous variable.

Intercept only model

We will begin by estimating intercept only parametric regression models (i.e., without covariates). But first, it’s helpful to estimate the hazard function (among all patients) using nonparametric techniques. We can do this using the kernel density estimator from the muhaz package.

library("muhaz")
kernel_haz_est <- muhaz(out_new2$fu_years, out_new2$chd)
kernel_haz <- data.table(time = kernel_haz_est$est.grid,
                         est = kernel_haz_est$haz.est,
                         method = "Kernel density")

Then we can use flexsurv to estimate intercept only models for a range of probability distributions. The hazard function for each fitted model is returned using summary.flexsurvreg().

dists <- c("exp", "weibull", "gompertz", "gamma", 
           "lognormal", "llogis", "gengamma")
dists_long <- c("Exponential", "Weibull (AFT)",
                "Gompertz", "Gamma", "Lognormal", "Log-logistic",
                "Generalized gamma")
parametric_haz <- vector(mode = "list", length = length(dists))
for (i in 1:length(dists)){
  fit <- flexsurvreg(Surv(fu_years, chd) ~ 1, data = out_new2, dist = dists[i]) 
  parametric_haz[[i]] <- summary(fit, type = "hazard", 
                                     ci = FALSE, tidy = TRUE)
  parametric_haz[[i]]$method <- dists_long[i]
}
parametric_haz <- rbindlist(parametric_haz)

We can plot the hazard functions from the parametric models and compare them to the kernel density estimate.

haz <- rbind(kernel_haz, parametric_haz)
haz[, method := factor(method,
                       levels = c("Kernel density",
                                  dists_long))]
n_dists <- length(dists) 
ggplot(haz, aes(x = time, y = est, col = method, linetype = method)) +
  geom_line() +
  xlab("years") + ylab("Hazard") + 
  scale_colour_manual(name = "", 
                      values = c("black", rainbow(n_dists))) +
  scale_linetype_manual(name = "",
                        values = c(1, rep_len(2:6, n_dists)))

Comment on results

Note

The best performing models are those that support monotonically increasing hazards (Gompertz, Weibull, gamma). the Weibull model perform the best with the Weibull modeling the increase in the slope of the hazard the most closely.

Weibull PH

The hazard and the survival function of the Weibull distribution are given by, $h(t) = \gamma \lambda t^{\gamma-1}$ and $S(t) = e^{-\lambda t^\gamma}$ respectively. The hazard function of a patient with covariates $X_1,X_2,X_3,\cdots X_p$ is given by \[\begin{equation} h(t|X_i) =\gamma \lambda t^{\gamma-1} \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip})=\gamma \lambda t^{\gamma-1}\text{exp}^{(\beta^{T} X)} \end{equation}\] making use of equation . The corresponding survival function is given by: \[\begin{equation} S_i(t)=\text{exp}\{\exp(\beta^{T} x_i)\lambda t^{\gamma}\} \end{equation}\] The model is fitted using the likelihood function of the $n$ observations and maximizing this function with respect to unknown parameters $(\beta_1,\beta_2,\beta_3,\cdots,\beta_p),\lambda$ and $\gamma$ where the likelihood function for the hazard and survivor function is given by : \[\begin{equation} L(\alpha,\mu,\sigma)=\prod_{i=1}^{n}\{h_i(t_i)\}^{\sigma_i}S_i(t_i) \end{equation}\] Taking logarithms in equation we get \[\begin{equation} \log L(\alpha,\mu,\sigma)=\sum^{n}_{i=1}\{\sigma_i \log h_i(t_i)+\log S_i(t_i)\} \end{equation}\] Now substituting ($\ref{gam}$) and simplifying we get \[\begin{equation} \log L(\alpha,\mu,\sigma)=\sum^n_{i=1}[\sigma_i\{\beta` x_i+\log(\lambda\gamma)+\gamma\log (t_i)\} -\lambda\text{exp}(\beta` x_i)t^{\gamma} \end{equation}\]

Note

The additional parameter p is called a shape parameter and determines the shape of the hazard function.

$p >1$ hazard increases over time
$p > 1$ constant hazard (exponential model)
$p <1$ hazard decreases over time

Cholestrol alone

quietly use  whitehal_new

format timein  %td
format timeout %td

quietly stset timeout, origin(time timein) enter(time timein) failure(chd) scale(365.25)

streg chol, distrib(weibull) 
##          Failure _d: chd
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## Fitting constant-only model:
## Iteration 0:  Log likelihood = -616.34373
## Iteration 1:  Log likelihood = -610.15638
## Iteration 2:  Log likelihood = -610.04232
## Iteration 3:  Log likelihood = -610.04228
## 
## Fitting full model:
## Iteration 0:  Log likelihood = -610.04228  
## Iteration 1:  Log likelihood = -603.49255  
## Iteration 2:  Log likelihood = -603.21873  
## Iteration 3:  Log likelihood = -603.21823  
## Iteration 4:  Log likelihood = -603.21823  
## 
## Weibull PH regression
## 
## No. of subjects =       1,651                           Number of obs =  1,651
## No. of failures =         151
## Time at risk    = 27,200.0163
##                                                         LR chi2(1)    =  13.65
## Log likelihood = -603.21823                             Prob > chi2   = 0.0002
## 
## ------------------------------------------------------------------------------
##           _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##         chol |   1.006151   .0016079     3.84   0.000     1.003004    1.009307
##        _cons |   .0006045   .0002773   -16.16   0.000      .000246    .0014853
## -------------+----------------------------------------------------------------
##        /ln_p |   .2929458   .0788982     3.71   0.000     .1383081    .4475834
## -------------+----------------------------------------------------------------
##            p |    1.34037   .1057528                      1.148329    1.564527
##          1/p |   .7460626    .058863                      .6391709    .8708303
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline hazard.

Grade alone

quietly use  whitehal_new

format timein  %td
format timeout %td

quietly stset timeout, origin(time timein) enter(time timein) failure(chd) scale(365.25)

streg i.grade, distrib(weibull) 
##          Failure _d: chd
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## Fitting constant-only model:
## Iteration 0:  Log likelihood = -627.95275
## Iteration 1:  Log likelihood = -621.65709
## Iteration 2:  Log likelihood = -621.54148
## Iteration 3:  Log likelihood = -621.54144
## 
## Fitting full model:
## Iteration 0:  Log likelihood = -621.54144  
## Iteration 1:  Log likelihood = -613.22222  
## Iteration 2:  Log likelihood = -612.60824  
## Iteration 3:  Log likelihood = -612.60625  
## Iteration 4:  Log likelihood = -612.60625  
## 
## Weibull PH regression
## 
## No. of subjects =       1,677                           Number of obs =  1,677
## No. of failures =         154
## Time at risk    = 27,605.3707
##                                                         LR chi2(1)    =  17.87
## Log likelihood = -612.60625                             Prob > chi2   = 0.0000
## 
## ------------------------------------------------------------------------------
##           _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##        grade |
## Low job g..  |   2.038542    .333555     4.35   0.000     1.479253    2.809293
##        _cons |   .0015878   .0005093   -20.10   0.000     .0008468    .0029773
## -------------+----------------------------------------------------------------
##        /ln_p |    .305599   .0777084     3.93   0.000     .1532934    .4579047
## -------------+----------------------------------------------------------------
##            p |   1.357438   .1054843                      1.165667    1.580758
##          1/p |    .736682   .0572464                      .6326078     .857878
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline hazard.

Weibull AFT

Cholestrol alone

quietly use  whitehal_new

format timein  %td
format timeout %td

quietly stset timeout, origin(time timein) enter(time timein) failure(chd) scale(365.25)

streg chol, distrib(weibull) time
##          Failure _d: chd
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## Fitting constant-only model:
## Iteration 0:  Log likelihood = -616.34373
## Iteration 1:  Log likelihood = -610.15638
## Iteration 2:  Log likelihood = -610.04232
## Iteration 3:  Log likelihood = -610.04228
## 
## Fitting full model:
## Iteration 0:  Log likelihood = -610.04228  
## Iteration 1:  Log likelihood = -603.49255  
## Iteration 2:  Log likelihood = -603.21873  
## Iteration 3:  Log likelihood = -603.21823  
## Iteration 4:  Log likelihood = -603.21823  
## 
## Weibull AFT regression
## 
## No. of subjects =       1,651                           Number of obs =  1,651
## No. of failures =         151
## Time at risk    = 27,200.0163
##                                                         LR chi2(1)    =  13.65
## Log likelihood = -603.21823                             Prob > chi2   = 0.0002
## 
## ------------------------------------------------------------------------------
##           _t | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##         chol |  -.0045749   .0012475    -3.67   0.000      -.00702   -.0021298
##        _cons |   5.529128   .3343822    16.54   0.000     4.873751    6.184505
## -------------+----------------------------------------------------------------
##        /ln_p |   .2929458   .0788982     3.71   0.000     .1383081    .4475834
## -------------+----------------------------------------------------------------
##            p |    1.34037   .1057528                      1.148329    1.564527
##          1/p |   .7460626    .058863                      .6391709    .8708303
## ------------------------------------------------------------------------------

Grade alone

quietly use  whitehal_new

format timein  %td
format timeout %td

quietly stset timeout, origin(time timein) enter(time timein) failure(chd) scale(365.25)

streg i.grade, distrib(weibull) time
##          Failure _d: chd
##    Analysis time _t: (timeout-origin)/365.25
##              Origin: time timein
##   Enter on or after: time timein
## 
## Fitting constant-only model:
## Iteration 0:  Log likelihood = -627.95275
## Iteration 1:  Log likelihood = -621.65709
## Iteration 2:  Log likelihood = -621.54148
## Iteration 3:  Log likelihood = -621.54144
## 
## Fitting full model:
## Iteration 0:  Log likelihood = -621.54144  
## Iteration 1:  Log likelihood = -613.22222  
## Iteration 2:  Log likelihood = -612.60824  
## Iteration 3:  Log likelihood = -612.60625  
## Iteration 4:  Log likelihood = -612.60625  
## 
## Weibull AFT regression
## 
## No. of subjects =       1,677                           Number of obs =  1,677
## No. of failures =         154
## Time at risk    = 27,605.3707
##                                                         LR chi2(1)    =  17.87
## Log likelihood = -612.60625                             Prob > chi2   = 0.0000
## 
## ------------------------------------------------------------------------------
##           _t | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##        grade |
## Low job g..  |  -.5246907   .1258083    -4.17   0.000    -.7712705    -.278111
##        _cons |   4.748192   .1652015    28.74   0.000     4.424403    5.071981
## -------------+----------------------------------------------------------------
##        /ln_p |    .305599   .0777084     3.93   0.000     .1532934    .4579047
## -------------+----------------------------------------------------------------
##            p |   1.357438   .1054843                      1.165667    1.580758
##          1/p |    .736682   .0572464                      .6326078     .857878
## ------------------------------------------------------------------------------

In R : Grade

AFT model

mf <- Surv(fu_years, chd)~  grade 
wf <- survreg(mf, ,data = out_new2, dist="weibull")
summary(wf)
## 
## Call:
## survreg(formula = mf, data = out_new2, dist = "weibull")
##                      Value Std. Error     z                    p
## (Intercept)         4.7482     0.1652 28.74 < 0.0000000000000002
## gradeLow job grade -0.5247     0.1258 -4.17             0.000030
## Log(scale)         -0.3056     0.0777 -3.93             0.000084
## 
## Scale= 0.737 
## 
## Weibull distribution
## Loglik(model)= -937.7   Loglik(intercept only)= -946.7
##  Chisq= 17.87 on 1 degrees of freedom, p= 0.000024 
## Number of Newton-Raphson Iterations: 8 
## n= 1677

Note

By default R reports coefficients in the accelerated failure time metric. A simple function to convert them to a proportional hazards metric, which you can download from this website as shown below. I also work with p = 1/scale.

PH model

source("https://grodri.github.io/survival/aft2ph.R")
phfit <- aft2ph(wf); phfit

term	estimate	std.error	statistic	p.value
(Intercept)	-6.4453753	0.3207430	-20.095142	0.0000000
gradeLow job grade	0.7122351	0.1636243	4.352870	0.0000134
log.p	0.3055990	0.0777084	3.932638	0.0000840

To obtain hazard ratios we exponentiate the coefficients. We also report the Weibull parameter p, while R uses scale = 1/p.

    transmute(phfit, term, hazard.ratio = exp(estimate))

term	hazard.ratio
(Intercept)	0.0015878
gradeLow job grade	2.0385425
log.p	1.3574379

    exp(tail(phfit,1)$estimate)
## [1] 1.357438

In R : chol

AFT model

mf <- Surv(fu_years, chd)~  chol 
wf <- survreg(mf, ,data = out_new2, dist="weibull")
summary(wf)
## 
## Call:
## survreg(formula = mf, data = out_new2, dist = "weibull")
##                Value Std. Error     z                    p
## (Intercept)  5.52913    0.33438 16.54 < 0.0000000000000002
## chol        -0.00457    0.00125 -3.67              0.00025
## Log(scale)  -0.29295    0.07890 -3.71              0.00020
## 
## Scale= 0.746 
## 
## Weibull distribution
## Loglik(model)= -922.1   Loglik(intercept only)= -928.9
##  Chisq= 13.65 on 1 degrees of freedom, p= 0.00022 
## Number of Newton-Raphson Iterations: 8 
## n=1651 (26 observations deleted due to missingness)

PH model

source("https://grodri.github.io/survival/aft2ph.R")
phfit <- aft2ph(wf); phfit

term	estimate	std.error	statistic	p.value
(Intercept)	-7.4110780	0.4586572	-16.158206	0.0000000
chol	0.0061320	0.0015981	3.837157	0.0001245
log.p	0.2929458	0.0788982	3.712958	0.0002049

    transmute(phfit, term, hazard.ratio = exp(estimate))

term	hazard.ratio
(Intercept)	0.0006045
chol	1.0061509
log.p	1.3403701

    exp(tail(phfit,1)$estimate)
## [1] 1.34037

Comment on Output : Grade

The unadjusted Weibull regression like the Cox model shows that participants in the low job grade had more than twice the hazard of CHD mortality compared with those in the high job grade $(HR = 2.04, p < 0.001)$.

Comment on Output : Cholestrol

for cholestrol , a continuous variable , For the Weibull model, the estimated scale parameter was $\sigma = 0.746$, corresponding to a shape parameter $p = 1/\sigma \approx 1.34$, indicating that the hazard increases over time.
The AFT coefficient for cholesterol was $\beta = -0.00457$ ($p<0.001$), suggesting that a one-unit increase in cholesterol shortens the expected survival time by about 0.46%.
Converting to the proportional hazards metric, the corresponding coefficient is $\gamma = -\beta/\sigma = 0.00613$, giving a hazard ratio (HR) of 1.0061 with 95% CI (1.0029, 1.0095).
Thus, each unit increase in cholesterol is associated with a 0.6% increase in the hazard of the event.

Comment on Results

Notice that the Weibull output has a parameter p that the exponential distribution does not have . The hazard function for a Weibull distribution is $h(t)=\lambda pt^{p-1}$ , if $p=1$ then the Weibull distribution is also an exponential distribution $h(t)=\lambda$.
By default Stata exponentiates the coefficients to show relative risks.
we Use the option nohr for no hazard ratios to obtain the coefficients.
The relationship between the hazard ratio parameter $\beta_j$ and the AFT parameter $\alpha_j$ is $\beta_j = -\alpha_jp$

Weibull PH vs AFT equivalence

Weibull models can be written in two equivalent forms: proportional hazards (PH) and accelerated failure time (AFT).

In the AFT form:

\[ \log T = \mu + \sigma \varepsilon + x^\top \gamma, \qquad \varepsilon \sim \text{EV(type I)}, \]

with $\sigma = 1/p$ for Weibull shape parameter $p>0$.

The survival is

\[ S(t \mid x) = \exp\!\left\{ - \exp\!\left(\tfrac{\log t - \mu - x^\top \gamma}{\sigma}\right) \right\} = \exp\!\left\{ -\, t^{1/\sigma}\, \exp\!\left(-\tfrac{\mu+x^\top \gamma}{\sigma}\right) \right\}. \]

Let $p=1/\sigma$. Then

\[ S(t \mid x) = \exp\!\left\{-\,\Lambda_0(t)\,\exp\!\big(-p\,x^\top\gamma\big)\right\}, \qquad \Lambda_0(t)=\exp(-p\mu)\,t^p. \]

In the PH form:

\[ S(t \mid x) = \exp\!\left\{-\,\Lambda_0(t)\,\exp\!\big(x^\top \beta_{\text{PH}}\big)\right\}. \]

By comparison,

\[ \beta_{\text{PH}} = -p\,\gamma \qquad\Longleftrightarrow\qquad \gamma = -\tfrac{\beta_{\text{PH}}}{p}. \]

Adjusting for other variables

AFT model

mf <- Surv(fu_years, chd)~  grade + agein + sbp + chol + smok
wf <- survreg(mf, ,data = out_new2, dist="weibull")
summary(wf)
## 
## Call:
## survreg(formula = mf, data = out_new2, dist = "weibull")
##                       Value Std. Error     z                    p
## (Intercept)        11.19842    0.88133 12.71 < 0.0000000000000002
## gradeLow job grade -0.06955    0.12314 -0.56              0.57219
## agein              -0.06558    0.01049 -6.25         0.0000000004
## sbp                -0.01108    0.00259 -4.28         0.0000188645
## chol               -0.00506    0.00118 -4.28         0.0000184859
## smokEx-smoker      -0.52886    0.23249 -2.27              0.02292
## smok1-14 cig/day   -0.72423    0.24713 -2.93              0.00338
## smok15-24 cig/day  -0.90727    0.24848 -3.65              0.00026
## smok25+ cig/day    -0.86606    0.28103 -3.08              0.00206
## Log(scale)         -0.37098    0.07664 -4.84         0.0000012940
## 
## Scale= 0.69 
## 
## Weibull distribution
## Loglik(model)= -861.7   Loglik(intercept only)= -928.9
##  Chisq= 134.4 on 8 degrees of freedom, p= 0.00000000000000000000000035 
## Number of Newton-Raphson Iterations: 8 
## n=1651 (26 observations deleted due to missingness)

PH model

source("https://grodri.github.io/survival/aft2ph.R")
phfit <- aft2ph(wf); phfit

term	estimate	std.error	statistic	p.value
(Intercept)	-16.2283091	1.0505090	-15.4480445	0.0000000
gradeLow job grade	0.1007923	0.1784525	0.5648131	0.5722009
agein	0.0950383	0.0138933	6.8405973	0.0000000
sbp	0.0160579	0.0036097	4.4485489	0.0000086
chol	0.0073287	0.0016288	4.4993631	0.0000068
smokEx-smoker	0.7664033	0.3321150	2.3076446	0.0210189
smok1-14 cig/day	1.0495317	0.3495019	3.0029359	0.0026739
smok15-24 cig/day	1.3147849	0.3489098	3.7682656	0.0001644
smok25+ cig/day	1.2550574	0.3978793	3.1543668	0.0016085
log.p	0.3709845	0.0766390	4.8406766	0.0000013

    transmute(phfit, term, hazard.ratio = exp(estimate))

term	hazard.ratio
(Intercept)	0.0000001
gradeLow job grade	1.1060469
agein	1.0997010
sbp	1.0161875
chol	1.0073556
smokEx-smoker	2.1520123
smok1-14 cig/day	2.8563133
smok15-24 cig/day	3.7239499
smok25+ cig/day	3.5080398
log.p	1.4491606

    exp(tail(phfit,1)$estimate)
## [1] 1.449161

Accounting for missing data in R

Note

We will impute the missing covariates using the Nelson-Aalen estimate of the cumulative hazard in place of the time variable, as proposed by White and Royston (2009). To generate this, the mice package provides a function nelsonaalen:

library(mice)
out_new$cumHaz <- nelsonaalen(out_new, years,death)
plot(out_new$years,out_new$cumHaz)

Comments

The curve is approximately linear, with a slightly increasing slope over time, suggesting a constant, or slightly increasing hazard over follow-up time.
We now impute using mice. To incorporate the outcome, as proposed by White and Royston (2009) we include the event indicator dead and the Nelson Aalen cumulative hazard as covariates in the imputation models. Thus we need to tell mice not to use the Years variable:

colSums(is.na(out_new))
##        ethgp       ageent        death      cvdeath          alc     smokenum 
##            0            0            0            0            0            1 
##         hdlc        diabp        sysbp     chdstart         days        years 
##           17            4            4           28            0            0 
##          bmi           id       timein      timeout            y      timebth 
##           35            0            0            0            0            0 
##     time_new    years_new smoke_status       cumHaz 
##            0            0            1            0

library(mice)

meth <- make.method(out_new)
meth["hdl"]      <- "norm"     # linear regression
meth["diabp"]    <- "norm"
meth["sysbp"]    <- "norm"
meth["chdstart"] <- "norm"
meth["smoke3"]   <- "polyreg"  # multinomial logistic
predMat <- make.predictorMatrix(out_new)

predMat[,"years"] <- 0
cumHazImp <- mice(out_new, m=10, seed=73347221,
  predictorMatrix=predMat, printFlag=FALSE)

options(scipen=999)

full_cox_m1<-coxph(Surv(years,death)~smoke_status+ageent+sysbp+bmi+diabp+ hdlc+chdstart,data=out_new) 


cumHazFit <- with(data = cumHazImp, 
  exp = coxph(Surv(years,death)~smoke_status+ageent+sysbp+bmi+diabp+ hdlc+chdstart))
cumHazPooled <- pool(cumHazFit)
#estimates
cbind(coef(full_cox_m1), summary(cumHazPooled)[,2])
##                                    [,1]         [,2]
## smoke_statusCurrent-smoker  0.897908020  0.599832976
## smoke_statusex-smoker       0.584612792  0.142356724
## ageent                      0.035335774  0.050013250
## sysbp                       0.007397350  0.010321798
## bmi                        -0.035497207 -0.032627229
## diabp                       0.001390669 -0.003686936
## hdlc                       -0.360502622 -0.480521920
## chdstart                    0.608962101  0.708228579

summary(cumHazPooled, conf.int = TRUE, exponentiate = TRUE)

term	estimate	std.error	statistic	df	p.value	2.5 %	97.5 %	conf.low	conf.high
smoke_statusCurrent-smoker	1.8218145	0.2475278	2.4232952	77.55088	0.0177112	1.1129320	2.982220	1.1129320	2.982220
smoke_statusex-smoker	1.1529879	0.3005212	0.4736995	77.66639	0.6370431	0.6338294	2.097380	0.6338294	2.097380
ageent	1.0512850	0.0347711	1.4383583	77.20378	0.1543740	0.9809617	1.126650	0.9809617	1.126650
sysbp	1.0103753	0.0056265	1.8345020	72.74352	0.0706686	0.9991080	1.021770	0.9991080	1.021770
bmi	0.9678993	0.0300339	-1.0863470	63.63170	0.2814237	0.9115275	1.027757	0.9115275	1.027757
diabp	0.9963199	0.0104782	-0.3518688	71.38604	0.7259733	0.9757217	1.017353	0.9757217	1.017353
hdlc	0.6184605	0.3297329	-1.4573065	69.51321	0.1495347	0.3203842	1.193859	0.3203842	1.193859
chdstart	2.0303914	0.2909900	2.4338589	64.84843	0.0177028	1.1354803	3.630613	1.1354803	3.630613

Obtaining the standard errors

#standard errors
cbind(diag(vcov(full_cox_m1))^0.5, summary(cumHazPooled)[,3])
##                                   [,1]        [,2]
## smoke_statusCurrent-smoker 0.297974797 0.247527822
## smoke_statusex-smoker      0.346474409 0.300521191
## ageent                     0.040441780 0.034771064
## sysbp                      0.006666571 0.005626485
## bmi                        0.032282332 0.030033891
## diabp                      0.013074294 0.010478158
## hdlc                       0.343395526 0.329732912
## chdstart                   0.322437634 0.290989988

Comments

Imputation generally modifies the coefficients slightly, sometimes increasing effect sizes (e.g., sex-smoker, hdlc, chdstart) and sometimes attenuating them (e.g., non-smoker). Overall, it reflects the impact of accounting for missing data on the estimated associations.
Imputation mostly attenuates effect sizes slightly and reduces standard errors, improving precision for some variables while leaving others nearly unchanged.

Accounting for missing data in STATA

Next we need to mi register variables. The minimum that we need do is register the variable(s) we intend to impute.

quietly use trinmlsh

quietly stset timeout, origin(time timein) enter(time timein) failure(death) scale(365.25)

recode smokenum (0=0) (1=1) (2/5=2), gen(smoke3)

label define smoke3lbl 0 "Non-smoker" 1 "Ex-smoker" 2 "Current-smoker"
label values smoke3 smoke3lbl


sts gen cumHaz = na
mi set flong
mi register imputed hdlc smoke3 diabp sysbp chdstart bmi

mi impute chained (regress) hdlc diabp sysbp chdstart (mlogit) smoke3 = death cumHaz ageent, add(10) rseed(56123236)


mi estimate, post: stcox hdlc diabp sysbp chdstart ageent i.smoke3
est store coxMI_NA


quietly stcox hdlc diabp sysbp chdstart ageent i.smoke3
est store coxCCA

est table coxCCA coxMI_NA, se

## (80 differences between smokenum and smoke3)
## 
## 
## 
## 
## 
## (44 m=0 obs now marked as incomplete)
## 
## 
## Conditional models:
##             smoke3: mlogit smoke3 diabp sysbp hdlc chdstart death cumHaz
##                      ageent
##              diabp: regress diabp i.smoke3 sysbp hdlc chdstart death cumHaz
##                      ageent
##              sysbp: regress sysbp i.smoke3 diabp hdlc chdstart death cumHaz
##                      ageent
##               hdlc: regress hdlc i.smoke3 diabp sysbp chdstart death cumHaz
##                      ageent
##           chdstart: regress chdstart i.smoke3 diabp sysbp hdlc death cumHaz
##                      ageent
## 
## Performing chained iterations ...
## 
## Multivariate imputation                     Imputations =       10
## Chained equations                                 added =       10
## Imputed: m=1 through m=10                       updated =        0
## 
## Initialization: monotone                     Iterations =      100
##                                                 burn-in =       10
## 
##               hdlc: linear regression
##              diabp: linear regression
##              sysbp: linear regression
##           chdstart: linear regression
##             smoke3: multinomial logistic regression
## 
## ------------------------------------------------------------------
##                    |               Observations per m             
##                    |----------------------------------------------
##           Variable |   Complete   Incomplete   Imputed |     Total
## -------------------+-----------------------------------+----------
##               hdlc |        301           17        17 |       318
##              diabp |        314            4         4 |       318
##              sysbp |        314            4         4 |       318
##           chdstart |        290           28        28 |       318
##             smoke3 |        317            1         1 |       318
## ------------------------------------------------------------------
## (Complete + Incomplete = Total; Imputed is the minimum across m
##  of the number of filled-in observations.)
## 
## 
## Multiple-imputation estimates                   Imputations       =         10
## Cox regression: Breslow method for ties         Number of obs     =        318
##                                                 Average RVI       =     0.0837
##                                                 Largest FMI       =     0.2277
## DF adjustment:   Large sample                   DF:     min       =     186.92
##                                                         avg       =  70,774.52
##                                                         max       = 241,467.48
## Model F test:       Equal FMI                   F(   7, 9315.1)   =       3.09
## Within VCE type:          OIM                   Prob > F          =     0.0029
## 
## ------------------------------------------------------------------------------
##           _t | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##         hdlc |  -.4200718   .3359779    -1.25   0.212    -1.080977    .2408333
##        diabp |  -.0004034    .010552    -0.04   0.970    -.0210899    .0202831
##        sysbp |   .0088906   .0055906     1.59   0.112    -.0020686    .0198499
##     chdstart |   .8168205   .3059592     2.67   0.008     .2132434    1.420397
##       ageent |   .0582103   .0342695     1.70   0.089    -.0089571    .1253777
##              |
##       smoke3 |
##   Ex-smoker  |   .1276731   .3007495     0.42   0.671     -.461788    .7171342
## Current-s~r  |   .6564548    .244325     2.69   0.007     .1775793     1.13533
## ------------------------------------------------------------------------------
## 
## 
## 
## 
## 
## ----------------------------------------
##     Variable |   coxCCA      coxMI_NA   
## -------------+--------------------------
##         hdlc | -.40885206    -.4200718  
##              |  .09295799    .33597787  
##        diabp | -.00029629   -.00040343  
##              |   .0031699    .01055201  
##        sysbp |  .00872926    .00889064  
##              |  .00167964    .00559062  
##     chdstart |  .80339295    .81682047  
##              |  .08218443    .30595922  
##       ageent |  .05725468    .05821029  
##              |  .01040884    .03426947  
##              |
##       smoke3 |
##   Ex-smoker  |  .16038246    .12767312  
##              |  .09116201    .30074947  
## Current-s~r  |  .67483635    .65645484  
##              |  .07423566    .24432501  
## ----------------------------------------
##                             Legend: b/se

Part II: Poisson and Cox Regression

Question b)

Note

Examine the effect of child’s HIV status (chiv) using strate and stcox. What is the HR and 95% CI?

quietly use paedsurv_new

quietly stset date_out, origin(time date_enr) enter(time date_enr) failure(died) scale(365.25)

strate chiv

##          Failure _d: died
##    Analysis time _t: (date_out-origin)/365.25
##              Origin: time date_enr
##   Enter on or after: time date_enr
## 
## Estimated failure rates
## Number of records = 590
## 
##   +------------------------------------------------------------------+
##   |         chiv    D          Y        Rate       Lower       Upper |
##   |------------------------------------------------------------------|
##   |  HIV exposed   16    1.2e+03   0.0131388   0.0080492   0.0214464 |
##   | HIV positive   27   638.3912   0.0422938   0.0290043   0.0616724 |
##   +------------------------------------------------------------------+
##    Notes: Rate = D/Y = failures/person-time.
##           Lower and Upper are bounds of 95% confidence intervals.


out_new3<-haven::read_dta("paedsurv_new.dta") |> 
  mutate(across(where(is.labelled), ~as_factor(.))) |> 
  mutate(
    age_enrol = as.numeric(difftime(date_enr, dob, units = "days")) / 365.25,
    overall_survival = as.numeric(difftime(date_out, date_enr, units = "days")),   # follow-up in days
    overall_survival = ifelse(is.na(overall_survival), NA_real_, pmax(overall_survival, 0)),
    person_years = overall_survival / 365.25,
   
    died = as.integer(died)
  )


out_new3 |>
  filter(!is.na(chiv)) |> 
  group_by(chiv) |>
  summarise(person_years=sum(person_years), died=sum(died)) |>
  mutate(rate=died/person_years)

chiv	person_years	died	rate
HIV exposed	1217.7708	16	0.0131388
HIV positive	638.3912	27	0.0422938


KM_rate <- survRate(Surv(person_years,died) ~ chiv, 
              data = out_new3)
KM_rate

	chiv	tstop	event	rate	lower	upper
chiv=HIV exposed	HIV exposed	1217.7708	16	0.0131388	0.0075099	0.0213365
chiv=HIV positive	HIV positive	638.3912	27	0.0422938	0.0278719	0.0615353

quietly use paedsurv_new

quietly stset date_out, origin(time date_enr) enter(time date_enr) failure(died)

stcox i.chiv

##          Failure _d: died
##    Analysis time _t: (date_out-origin)
##              Origin: time date_enr
##   Enter on or after: time date_enr
## 
## Iteration 0:  Log likelihood = -262.60562
## Iteration 1:  Log likelihood = -253.86693
## Iteration 2:  Log likelihood = -253.75738
## Iteration 3:  Log likelihood = -253.75733
## Refining estimates:
## Iteration 0:  Log likelihood = -253.75733
## 
## Cox regression with Breslow method for ties
## 
## No. of subjects =         590                           Number of obs =    590
## No. of failures =          43
## Time at risk    = 677,963.191
##                                                         LR chi2(1)    =  17.70
## Log likelihood = -253.75733                             Prob > chi2   = 0.0000
## 
## ------------------------------------------------------------------------------
##           _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##         chiv |
## HIV posit..  |   3.671663   1.159636     4.12   0.000     1.977081    6.818692
## ------------------------------------------------------------------------------

Solution

A crude Cox model showed that HIV-positive children had a hazard ratio (HR) of $3.67 (95\% CI: 1.98–6.82, p < 0.001)$ compared with HIV-exposed children. This indicates that HIV-positive children had over three times the hazard of death when compared on the same follow-up time scale.

Note

Consider the possible confounding effect of current age on the relationship between child’s HIV status and mortality. Before doing any analyses, would you expect a Cox model controlling for current age to give different results from the one controlling for time in the study?

quietly use paedsurv_new

quietly stset date_out, origin(time date_enr) enter(time date_enr) failure(died)

gen age_entry = (date_enr- dob) / 365.25

tabstat age_entry, by(chiv) stats(mean sd min max)

## Summary for variables: age_entry
## Group variable: chiv (Child's HIV status)
## 
##             chiv |      Mean        SD       Min       Max
## -----------------+----------------------------------------
##      HIV exposed |  .3238306  .6747565         0  6.795346
##     HIV positive |  4.331711  4.145623  .0054757  13.18275
## -----------------+----------------------------------------
##            Total |  1.532988  2.978616         0  13.18275
## ----------------------------------------------------------

Comments

Prior to age adjustment, substantial differences were observed in the mean age at study entry across HIV groups. HIV-exposed children enrolled at a mean age of 0.32 years, while HIV-positive children enrolled much later, at 4.33 years. Because older age is associated with lower mortality risk in this cohort, adjusting for age is expected to influence the estimated effect of HIV status.

d) Hint: Remember Question 1e) from Part I. Are there large differences in age at entry between the HIV groups? You can calculate age at entry with the following command

gen age_entry = (date_enr- dob) / 365.25

Comments

There is a substantial difference between the groups. HIV-exposed children entered the study at a mean age of $0.32$ years (SD = $0.67$), whereas HIV-positive children entered at a mean age of $4.33$ years (SD = $4.15$). The overall mean age at entry was $1.53$ years. This suggests that HIV-positive children tended to be enrolled considerably later than HIV-exposed children.

Question e)

Now perform a Cox regression controlling for current age to investigate whether age is a confounder.

quietly use paedsurv_new

quietly stset date_out, origin(time date_enr) enter(time date_enr) failure(died) scale(365.25)

gen age_entry = (date_enr- dob) / 365.25

stcox i.chiv age_entry,efron

##          Failure _d: died
##    Analysis time _t: (date_out-origin)/365.25
##              Origin: time date_enr
##   Enter on or after: time date_enr
## 
## Iteration 0:  Log likelihood = -262.59989
## Iteration 1:  Log likelihood = -252.33873
## Iteration 2:  Log likelihood = -250.89864
## Iteration 3:  Log likelihood = -250.89667
## Refining estimates:
## Iteration 0:  Log likelihood = -250.89667
## 
## Cox regression with Efron method for ties
## 
## No. of subjects =        590                            Number of obs =    590
## No. of failures =         43
## Time at risk    = 1,856.1621
##                                                         LR chi2(2)    =  23.41
## Log likelihood = -250.89667                             Prob > chi2   = 0.0000
## 
## ------------------------------------------------------------------------------
##           _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##         chiv |
## HIV posit..  |   5.435908   1.864223     4.94   0.000     2.775594    10.64605
##    age_entry |   .8772548   .0526643    -2.18   0.029      .779876    .9867927
## ------------------------------------------------------------------------------

Comment

After adjusting for age at entry, the HR for HIV-positive versus HIV-exposed increased to 5.435 .
Age itself was inversely associated with mortality i.e a unit increase in age results in the decrease in hazard of death by approximately 13%.
This suggests a strengthening of the HIV effect after age adjustment.

Question f)

Note

With age as the analysis scale, use a Poisson model to assess the relationship between child’s HIV status and mortality. How do the results compare to those from using a Cox model?

mod_poi <- glm(died ~ chiv+age_enrol+offset(log(person_years)), data = out_new3, 
               family = poisson(link="log"))

tidy(mod_poi , exponentiate = T,conf.int = T)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.0139408	0.2509078	-17.029905	0.0000000	0.0081643	0.0219746
chivHIV positive	5.3086828	0.3459787	4.824991	0.0000014	2.6968172	10.5843637
age_enrol	0.8497478	0.0640825	-2.540721	0.0110624	0.7406717	0.9545660

Note

With age as the analysis scale, The estimated risk ratio for mortality comparing HIV-positive to HIV-negative is $5.309 (95\%CI[2.697-10.584](P<0.001)$, with age being protective i.e a one year increase in age at entry is associated with $RR=0.85(95\%CL 0.7417- 0.955,P=0.011)$.
The results are closely identical to those from the cox model $HR = 5.44, 95\% CI 2.78–10.65, p < 0.001; age, HR =0.88, 95\% CI 0.78–0.99)$

Question(g)

Note

In order to make a like-for-like comparison with the Cox model, what would be a more appropriate analysis using Poisson regression?

Solution

The baseline hazard is constant in a Poisson regression, $exp(\beta_0)$
If we add a categorical variable for time, e.g. time-since-entry in 1-year bands, then the baseline hazard is a step function of time.
The hazard is piece-wise constant in 1-year bands.

\[\lambda =\exp(\beta_0+\beta_3t_{[1,2)}+\beta_4t_{[2,3)}+...)\exp(\beta_1X)\]

where $t_[1,2)$ is an indicator for time being in the interval $[1,2)$. Note that $t_[0,1)$% if left out from the equation (it is assumed to be the reference time band).
In both cases, the β parameters are interpreted as log rate ratios.
Both models are multiplicative (i.e. both assume proportional hazards).
In Poisson regression, follow-up time is classified into bands and a separate rate parameter is estimated for each band (or smoothed!), thereby allowing for the possibility that the rate is changing with time.
It is assumed that the rate is constant within each band, so if the rate is changing rapidly with time we may have to choose very narrow bands.
In Cox regression, we essentially choose bands of infinitesimal width; each band is so narrow that it includes only a single event.
hence we would split time to create 1 year age bands.

Question(h)

Note

Use the stsplit command to create 1-year age bands of current age from 0 to 5 years, and 5-year bands for children aged over 5. Re-run the Poisson model controlling for current age. How do these results compare with those obtained from the Cox model?

here we fit a piecewise Poisson regression model for comparison.

quietly use paedsurv_new
* age at entry and exit in years
gen age_ent  = (date_enr  - dob)/365.25
gen age_out = (date_out - dob)/365.25

* survival setup with age as timescale
quietly stset age_out, origin(time age_ent) enter(time age_ent) failure(died) id(idno) scale(365.25)

stsplit ageband, at(0(1)5 10(5)80) trim

streg i.ageband i.chiv age_ent, dist(exp)

## (no observations trimmed because none out of range)
## (no new episodes generated)
## 
## 
##          Failure _d: died
##    Analysis time _t: (age_out-origin)/365.25
##              Origin: time age_ent
##   Enter on or after: time age_ent
##         ID variable: idno
## note: 0.ageband omitted because of collinearity.
## 
## Iteration 0:  Log likelihood = -217.08342  
## Iteration 1:  Log likelihood = -211.19473  
## Iteration 2:  Log likelihood =   -205.941  
## Iteration 3:  Log likelihood = -205.91907  
## Iteration 4:  Log likelihood = -205.91906  
## 
## Exponential PH regression
## 
## No. of subjects =    590                                Number of obs =    590
## No. of failures =     43
## Time at risk    = 5.0819
##                                                         LR chi2(2)    =  22.33
## Log likelihood = -205.91906                             Prob > chi2   = 0.0000
## 
## ------------------------------------------------------------------------------
##           _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
## -------------+----------------------------------------------------------------
##    0.ageband |          1  (omitted)
##              |
##         chiv |
## HIV posit..  |   5.308681   1.836701     4.82   0.000     2.694557     10.4589
##      age_ent |   .8497479   .0544654    -2.54   0.011     .7494308    .9634932
##        _cons |   5.091878   1.277595     6.49   0.000     3.113898    8.326292
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline hazard.

quietly use paedsurv_new
* age at entry and exit in years
gen age_ent  = (date_enr  - dob)/365.25
gen age_out = (date_out - dob)/365.25
gen age_entry = (date_enr- dob) / 365.25
* survival setup with age as timescale
quietly stset age_out, origin(time age_ent) enter(time age_ent) failure(died) id(idno)

stsplit ageband, at(0(1)5 10(5)80) trim


quietly stcox i.chiv age_entry , efron
estimates store Cox

quietly streg i.ageband i.chiv age_entry, dist(exp)
estimates store Poisson

 /* Compare the estimates */
estimates table Cox Poisson, eform equations(1) b(%9.6f) se(%9.6f) ///
keep(1.chiv age_entry) ///
title("Hazard ratios and standard errors for Cox and Poisson models")

## (no observations trimmed because none out of range)
## (1,440 observations (episodes) created)
## 
## 
## 
## 
## 
## 
## Hazard ratios and standard errors for Cox and Poisson models
## 
## --------------------------------------
##     Variable |    Cox       Poisson   
## -------------+------------------------
##         chiv |
## HIV posit..  |  5.435126    5.462112  
##              |  1.863987    1.873629  
##    age_entry |  0.877228    0.875707  
##              |  0.052661    0.052621  
## --------------------------------------
##                           Legend: b/se

Comparison of Cox and Poisson

he results from the Poisson model closely match those of the Cox model. For both approaches, HIV-positive children have a markedly higher risk of death compared with HIV-exposed children (HR $\approx 5.4$). The protective effect of older age at study entry is also consistent across the two models, with hazard ratios of approximately $0.88$ per year increase. The similarity of results provides reassurance that the piecewise Poisson approach yields estimates that are effectively equivalent to the Cox model in this dataset

Question(i)

Note

Does age confound the relationship between HIV status and mortality?

Comment

Yes. Age is a confounder of the HIV–mortality relationship, as it is linked to both HIV status and mortality. After adjusting for age, HIV-positive children still had a substantially higher risk of death ($HR \approx 5.4$), while age itself demonstrated an independent protective effect ($HR \approx 0.88$).

Question(j)

Note

Redo part (h) and (i) to refit another Poisson model but splitting time at every event?

splitting at every event

cuts <- sort(unique(out_new3$person_years[out_new3$died == 1]))
orca_splitted <- survSplit(Surv(person_years, died) ~ ., data = out_new3, cut = cuts, episode = "tgroup")
head(orca_splitted, 15)

idno	sex	chiv	dob	date_enr	date_out	ill3m	stunted	cd4cat	age_enrol	overall_survival	tstart	person_years	tgroup
1001	Male	HIV exposed	2010-01-11	2010-03-01	2010-04-03	No	1	NA	0.1341547	33.09961	0.0000000	0.0906218	1
1002	Female	HIV exposed	2010-01-11	2010-03-01	2010-04-03	No	1	NA	0.1341547	33.09961	0.0000000	0.0906218	1
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.0000000	0.1095140	1
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.1095140	0.1314168	2
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.1314168	0.1834360	3
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.1834360	0.1861739	4
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.1861739	0.1916496	5
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.1916496	0.2655715	6
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.2655715	0.2683094	7
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.2683094	0.3093771	8
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.3093771	0.3203285	9
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.3203285	0.3422313	10
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.3422313	0.4435318	11
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.4435318	0.4955510	12
1003	Female	HIV positive	2003-08-10	2006-05-24	2010-03-12	Yes	0	<15%	2.7871321	1388.00000	0.4955510	0.5694730	13

Procedure

The gnm() function in the gnm package fits a conditional Poisson on splitted data, where the effects of the time (as a factor variable) can be marginalized (not estimated to improve computational efficiency).

out_new3<-out_new3 |> 
  filter(chiv %in% names(which(table(chiv) > 0))) %>%
  droplevels()

m2<-coxph(Surv(person_years, died) ~ chiv+age_enrol, data = out_new3)

mod_poi <- gnm(died ~ chiv+age_enrol, data = orca_splitted, 
               family = poisson, eliminate = factor(person_years))
summary(mod_poi)
## 
## Call:
## gnm(formula = died ~ chiv + age_enrol, eliminate = factor(person_years), 
##     family = poisson, data = orca_splitted)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.15793  -0.06739  -0.05041  -0.04569   3.44316  
## 
## Coefficients of interest:
##                  Estimate Std. Error z value    Pr(>|z|)    
## chivHIV positive  1.69264    0.34294   4.936 0.000000799 ***
## age_enrol        -0.13091    0.06003  -2.181      0.0292 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
## Residual deviance: 493.5 on 18322 degrees of freedom
## AIC: 1537.5
## 
## Number of iterations: 4

Compare the estimates obtained from the conditional Poisson with the cox proportional hazard model.

list(cox = ci.exp(m2), poisson = ci.exp(mod_poi))
## $cox
##                  exp(Est.)     2.5%      97.5%
## chivHIV positive 5.4359084 2.775594 10.6460459
## age_enrol        0.8772548 0.779876  0.9867927
## 
## $poisson
##                  exp(Est.)      2.5%      97.5%
## chivHIV positive 5.4337910 2.7745567 10.6417305
## age_enrol        0.8772983 0.7799259  0.9868274

Note

Comparing the traditional Cox Hazards model with Poisson regression approach accounting for time split at each event , the results are nearly identical .
for both models Hiv status is strongly associated with an increased risk of mortality $RR=5.433;(95\% ~CL ~2.775- 10.642,P=0.000000799)$
Similarly, age at entry was protective, with an estimated hazard ratio of $\text{HR} = 0.88 \; (95\% \; \text{CI}: 0.78\text{--}0.99)$, indicating that each additional year of age was associated with approximately a 12% lower risk of death.

TA better approach would be to flexibly model the baseline hazard by using, for instance, splines with knots $k$.

orca_splitted$dur <- with(orca_splitted, person_years - tstart)
newd <- data.frame(person_years = cuts, dur = 1,
                   chiv= "HIV exposed", age_enrol = 0.15)

library(splines)
k <- quantile(out_new3$person_years, 1:5/6)
mod_poi2s <- glm(died ~ ns(person_years, knots = k) + chiv+age_enrol, 
                 data = orca_splitted, family = poisson, offset = log(dur))
round(ci.exp(mod_poi2s), 3)
##                              exp(Est.)  2.5%
## (Intercept)                      0.023 0.009
## ns(person_years, knots = k)1     0.068 0.006
## ns(person_years, knots = k)2     1.946 0.042
## ns(person_years, knots = k)3     1.091 0.000
## ns(person_years, knots = k)4     0.000 0.000
## ns(person_years, knots = k)5     0.000 0.000
## ns(person_years, knots = k)6     0.000 0.000
## chivHIV positive                 5.447 2.782
## age_enrol                        0.879 0.781
##                                                                                                                                                                              97.5%
## (Intercept)                                                                                                                                                                  0.058
## ns(person_years, knots = k)1                                                                                                                                                 0.708
## ns(person_years, knots = k)2                                                                                                                                                90.628
## ns(person_years, knots = k)3                                                                                                                                          33235098.734
## ns(person_years, knots = k)4 7114190590675165331048666628824260888462646602486644202404688666064840240060260680806844462262400004280404820246068648664082848424640022428064848.000
## ns(person_years, knots = k)5                                                                                                                                                   Inf
## ns(person_years, knots = k)6                                                                                                                                                   Inf
## chivHIV positive                                                                                                                                                            10.667
## age_enrol                                                                                                                                                                    0.988
blhazs <- data.frame(ci.pred(mod_poi2s, newdata = newd))
ggplot(blhazs, aes(x = newd$person_years , y = Estimate)) + geom_line() +
  geom_ribbon(aes(ymin = X2.5., ymax = X97.5.), alpha = .2) +
  scale_y_continuous(trans = "log", breaks = pretty_breaks()) +
  theme_classic() + labs(x = "Time (years)", y = "Baseline hazard")

Note

again we notice that $(HR=5.447)$ Close to what we got from Poisson and Co proportional model.

Compare in Stata


quietly use paedsurv_new
* age at entry and exit in years
gen age_ent  = (date_enr  - dob)/365.25
gen age_out = (date_out - dob)/365.25

* survival setup with age as timescale
quietly stset age_out, origin(time age_ent) enter(time age_ent) failure(died==1) id(idno)

stsplit, at(failures) riskset(riskset)

quietly tab riskset, gen(interval)

quietly stcox i.chiv age_ent , efron
estimates store Cox

quietly streg age_ent  interval* i.chiv, dist(exp)
estimates store Poisson_fine

 /* Compare the estimates */
estimates table Cox  Poisson_fine, eform equations(1) b(%9.6f) se(%9.6f) ///
keep(1.chiv  age_ent) ///
title("Hazard ratios and standard errors for Cox and Poisson models")

## (43 failure times)
## (19,792 observations (episodes) created)
## 
## 
## 
## 
## 
## 
## 
## Hazard ratios and standard errors for Cox and Poisson models
## 
## --------------------------------------
##     Variable |    Cox      Poisson~e  
## -------------+------------------------
##         chiv |
## HIV posit..  |  5.435126    5.435126  
##              |  1.863987    1.863987  
##      age_ent |  0.877228    0.877228  
##              |  0.052661    0.052661  
## --------------------------------------
##                           Legend: b/se

Question k

Note

Bonus: Fit an appropriate Parametric model or Frailty model whichever is most suitable and distinguish how you will interpret the results for a single categorical and also a single continuous variable.

Intercept only model

library("muhaz")
kernel_haz_est <- muhaz(out_new3$person_years, out_new3$died)
kernel_haz <- data.table(time = kernel_haz_est$est.grid,
                         est = kernel_haz_est$haz.est,
                         method = "Kernel density")

dists <- c("exp", "weibull", "gompertz", "gamma", 
           "lognormal", "llogis", "gengamma")
dists_long <- c("Exponential", "Weibull (AFT)",
                "Gompertz", "Gamma", "Lognormal", "Log-logistic",
                "Generalized gamma")
parametric_haz <- vector(mode = "list", length = length(dists))
for (i in 1:length(dists)){
  fit <- flexsurvreg(Surv(person_years, died) ~ 1, data = out_new3, dist = dists[i]) 
  parametric_haz[[i]] <- summary(fit, type = "hazard", 
                                     ci = FALSE, tidy = TRUE)
  parametric_haz[[i]]$method <- dists_long[i]
}
parametric_haz <- rbindlist(parametric_haz)

We can plot the hazard functions from the parametric models and compare them to the kernel density estimate.

haz <- rbind(kernel_haz, parametric_haz)
haz[, method := factor(method,
                       levels = c("Kernel density",
                                  dists_long))]
n_dists <- length(dists) 
ggplot(haz, aes(x = time, y = est, col = method, linetype = method)) +
  geom_line() +
  xlab("years") + ylab("Hazard") + 
  scale_colour_manual(name = "", 
                      values = c("black", rainbow(n_dists))) +
  scale_linetype_manual(name = "",
                        values = c(1, rep_len(2:6, n_dists)))

Fitting a Weibull AFT model

AFT model hiv status

mf <- Surv(person_years, died) ~ chiv
wf <- survreg(mf, ,data = out_new3, dist="weibull")
summary(wf)
## 
## Call:
## survreg(formula = mf, data = out_new3, dist = "weibull")
##                   Value Std. Error     z                 p
## (Intercept)       6.147      0.796  7.72 0.000000000000011
## chivHIV positive -1.976      0.586 -3.37           0.00075
## Log(scale)        0.466      0.139  3.36           0.00079
## 
## Scale= 1.59 
## 
## Weibull distribution
## Loglik(model)= -190.9   Loglik(intercept only)= -198.9
##  Chisq= 16.08 on 1 degrees of freedom, p= 0.000061 
## Number of Newton-Raphson Iterations: 8 
## n= 590

Comments

The coefficient $\beta =- 1.976$ gives an acceleration factor of $e^{-⁡1.967}=0.14$ implying that HIV positive patients have 86% shorter survival times compared to HIV exposed.

AFT model for age of enrolmment

mf <- Surv(person_years, died) ~ age_enrol
wf <- survreg(mf, ,data = out_new3, dist="weibull")
summary(wf)
## 
## Call:
## survreg(formula = mf, data = out_new3, dist = "weibull")
##                Value Std. Error     z                   p
## (Intercept)  5.09247    0.60213  8.46 <0.0000000000000002
## age_enrol   -0.00329    0.07801 -0.04              0.9664
## Log(scale)   0.43573    0.13825  3.15              0.0016
## 
## Scale= 1.55 
## 
## Weibull distribution
## Loglik(model)= -198.9   Loglik(intercept only)= -198.9
##  Chisq= 0 on 1 degrees of freedom, p= 0.97 
## Number of Newton-Raphson Iterations: 9 
## n= 590

exp(-0.00329)
## [1] 0.9967154

Comments

The coefficient $\beta =-0.00329$ gives an acceleration factor of $e^{-0.003297}= 0.9967154 $ a unit increase in age has no significant impact on survival.

Part 3

Note

Use this paper’s approach to model the data from Part II and also include a flexible GAMM model from R2BayesX on all six models. Show the relevant equations for the models.

library(Epi)
library(survival)
library(mgcv)
library(tidyverse)

Flexible GAMM (BayesX)

Cox-type GAMM (BayesX)

structured additive Cox-type model, where the log-hazard is modeled as:

The flexible Cox proportional hazards model is:

\[ h(t \mid X) = h_0(t) \exp\Big( \beta_1 \,\text{chiv} + \beta_2 \,\text{age\_enrol} + f(t) \Big), \]

where:

$h(t \mid X)$: hazard at time $t$ given covariates $X$
$h_0(t) = \exp(f(t))$: smooth baseline hazard modeled via splines (sx(time))
$\beta_1, \beta_2$: parametric coefficients (log-hazard ratios)
$f(t)$: smooth function of follow-up time

Interpretation:

$\exp(\beta_1)$: hazard ratio for HIV positive vs negative
$\exp(\beta_2)$: hazard ratio per additional year of age at enrolment
$f(t)$: adjusts for non-linear baseline hazard over time

library(R2BayesX)

gam_fit <- bayesx(died ~ chiv + age_enrol +
    sx(overall_survival, bs = "bl"),
  family = "cox",
  data   = out_new3,
  method = "REML"
)

summary(gam_fit)
## Call:
## bayesx(formula = died ~ chiv + age_enrol + sx(overall_survival, 
##     bs = "bl"), data = out_new3, family = "cox", method = "REML")
##  
## Fixed effects estimation results:
## 
## Parametric coefficients:
##                  Estimate Std. Error  t value            Pr(>|t|)    
## (Intercept)      -11.2382     0.2609 -43.0723 <0.0000000000000002 ***
## chivHIV positive   1.6973     0.3432   4.9452 <0.0000000000000002 ***
## age_enrol         -0.1315     0.0602  -2.1843              0.0293 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Smooth terms:
##                       Variance Smooth Par.        df Stopped
## sx(overall_survival)    0.0005   1993.4700    1.0742       0
##  
## N = 590  df = 4.07418  AIC = 875.595  BIC = 893.441  
## logLik = -433.7235  method = REML  family = cox


plot(gam_fit,  resid = TRUE, cex.resid = 0.1)

where

\[ h(t \mid X) = \text{hazard function at time t given covariates X}. \]

\[\begin{align*} \beta_{0} &= -11.238 \quad &\text{(baseline log-hazard intercept)} \\ \beta_{\text{chiv}} &= 1.697 \quad &\text{(log-hazard effect of HIV positive vs. negative)} \\ \beta_{\text{age}} &= -0.132 \quad &\text{(log-hazard effect per unit increase in age at enrolment)} \\ f(t) & \quad &\text{smooth, non-parametric baseline hazard function.} \end{align*}\]

Poisson-type GAMM (BayesX)

The flexible Poisson model is:

\[ Y_i \sim \text{Poisson}(\mu_i), \quad \log(\mu_i) = \log(\text{exposure}_i) + \beta_1 \,\text{chiv} + \beta_2 \,\text{age\_enrol} + g(t_i), \]

where:

$Y_i$: observed events (0/1 or counts)
$\mu_i$: expected number of events
$\text{exposure}_i$: offset, typically log of person-years
$\beta_1, \beta_2$: log-incidence rate ratios
$g(t_i)$: smooth function of time (sx(time))

Interpretation:

$\exp(\beta_1)$: incidence rate ratio for HIV positive vs negative
$\exp(\beta_2)$: incidence rate ratio per additional year of age
$g(t_i)$: captures non-linear variation in baseline incidence over time

bx_poisson <- bayesx(died ~ chiv + age_enrol + offset(log(person_years)),
                     family = "poisson", data = out_new3)
## Note: created new output directory 'C:/Users/r1953/AppData/Local/Temp/Rtmp00dbdg/bayesx1'!

summary(bx_poisson)
## Call:
## bayesx(formula = died ~ chiv + age_enrol + offset(log(person_years)), 
##     data = out_new3, family = "poisson")
##  
## Fixed effects estimation results:
## 
## Parametric coefficients:
##                     Mean      Sd    2.5%     50%   97.5%
## (Intercept)      -3.2522  0.2534 -3.7703 -3.2443 -2.8005
## chivHIV positive  1.7643  0.3518  1.0541  1.7566  2.4511
## age_enrol        -0.1319  0.0598 -0.2501 -0.1326 -0.0225
##  
## N = 590  burnin = 2000  method = MCMC  family = poisson  
## iterations = 12000  step = 10

coef_pois <- bx_poisson$fixed.effects[, "Mean"]
HR_pois <- exp(coef_pois)
data.frame(term = names(coef_pois), HR = HR_pois)

	term	HR
(Intercept)	(Intercept)	0.0386890
chivHIV positive	chivHIV positive	5.8373096
age_enrol	age_enrol	0.8764348

Results

the results yielded are almost the same as we got before
Across all models, HIV status consistently emerged as the strongest predictor of mortality, with HIV-positive children facing a substantially higher risk of death compared with HIV-exposed or unexposed peers. This was reflected in hazard ratios exceeding 5 in both the Cox and GAMM models. Similarly, age at enrolment demonstrated a protective effect, with older children showing lower mortality risk across all modeling approaches, whether age was included as a linear term (Cox, Poisson, Weibull) or modeled flexibly using smooth functions in the GAMM.
The flexible GAMM added insight by allowing the hazard to vary smoothly over survival time. However, the smooth term did not indicate meaningful deviations from proportionality, suggesting that the Cox model’s proportional hazards assumption was reasonable. Moreover, the GAMM’s parametric coefficients were broadly consistent with those from the Cox models, reinforcing the robustness of the findings