Introduction

Background

Heart failure remains a leading cause of global cardiovascular mortality, particularly in developing nations where resource constraints complicate disease management. The Institute of Cardiology and Allied Hospital Faisalabad serves a high-risk population in Pakistan’s Punjab region, where modifiable risk factors like hypertension and diabetes are prevalent. Existing studies in South Asia have identified left ventricular systolic dysfunction as a critical prognostic factor, yet comprehensive survival analyses incorporating both clinical and biochemical parameters remain limited in this setting.

Problem Statement

Despite established protocols for heart failure management, mortality rates among New York Heart Association (NYHA) class III-IV patients in institutions exceed regional averages. Current risk prediction models often fail to account for the complex interplay between biochemical markers (serum sodium, creatinine) and clinical parameters (ejection fraction). This gap necessitates a detailed survival analysis to identify modifiable prognostic factors specific to our patient population.

Aim

The aim of the study was to estimate death rates due to heart failure for patients who were admitted to Institute of Cardiology and Allied hospital Faisalabad-Pakistan during April-December (2015).

Objectives

This main study objectives were to:

Determine survival rates among heart failure patients with left ventricular systolic dysfunction.
Develop a risk stratification model incorporating both continuous and categorical variables.

Justification

The investigation of 299 patients provides sufficient power to detect clinically significant associations, with gender distribution (35% female, 65% male) reflecting regional hospitalization patterns. By analyzing variables through both clinical records (physician notes, echo reports) and laboratory data (hematocrit, CPK), this study offers a multidimensional assessment of mortality predictors. The focus on NYHA III-IV patients addresses a critical knowledge gap regarding end-stage heart failure outcomes in resource-limited settings.

Data and Methods

Study Design

A hospital-based prospective cohort study of consecutively admitted heart failure patients between April-December 2015 was carried out. Inclusion criteria comprised: Age \(\geq\) 40 years ,Documented left ventricular systolic dysfunction (LVEF <45% on echocardiography) and New York Heart Association functional class III or IV.

Detail of data

Current study is based on 299 patients of heart failure comprising of 105 women and 194 men. All the patients were more than 40 years old, having left ventricular systolic dysfunction and falling in NYHA class III and IV. Follow up time was 4–285 days. Disease was diagnosed by cardiac echo report or notes written by physician. Age, serum sodium, serum creatinine, gender, smoking, Blood Pressure (BP), Ejection Fraction (EF), anemia, platelets, Creatinine Phosphokinase (CPK) and diabetes were considered as potential variables explaining mortality caused by CHD. Age, serum sodium , serum creatinine and CPK are continuous variables whereas EF, Gender, Smoking status, anaemia, Diabetes,Blood pressure and platelets were taken as categorical variables.

Statistical Analysis

To summarise conntinous variables , means and standard deviations were used for normally distributed variables. For categorical data , proportions and frequencies were also reported. Kaplan Meier curves were constructed for categorical variables to compare survival rates. Mortality and time to death of patients (died/survived) was modeled using a Cox proportional hazards model. Stepwise selection (sequential replacement) was used to come up with the optimal set of predictors that determine mortality . Proportional hazards assumption was assessed using schonienfeld residuals. The full and reduced Model were compared using Akaike and Bayesian Information Criterion as well as likelihood ratio test. Residuals analysis such as the coxsnell and deviance residuals were conducted to determine model fit for the final model.

All Statistical analysis were computed using R 4.4.3 (R core teams,2022) using the Rstudio interface. Additional packages such as the survival ,survminer and eha packages were used for this analysis. The detailed analysis can be found on my Rpubs Research report.

out_new<-out_new |> 
  mutate(EFraction_cat=if_else(Ejection.Fraction <= 30,"EF<=30",
                                  ifelse(between(Ejection.Fraction,30,45),"30-45","EF>45")),
         platelets_quartile=ntile(Pletelets,3),
         platelets_cat = case_when(platelets_quartile ==1 ~ "< Q1",
                                   platelets_quartile ==2 ~ "< Q2",
                                   platelets_quartile ==3 ~ "< Q3"),
         event_cat = if_else(Event == 1,"dead","alive")) |> 
   mutate(BP = ifelse(BP == 1, "Yes","No"),
         Smoking = ifelse(Smoking ==1 , "Yes","No"),
         Diabetes = ifelse(Diabetes ==1, "Yes","No"),
         Anaemia = ifelse(Anaemia ==1, "Yes","No"),
         Gender = ifelse(Gender==1 ,"M","F"))

Explanatory data analysis

Correlation Analysis

There were no significant correlations among the dependent variables hence no issues with multicollinearity would be detected in model building steps.

heart_meas <- out_new  |> select_if(is.numeric) |> 
  select(-platelets_quartile,-Event)
corr <- round(cor(heart_meas, use="complete.obs"), 2)
ggcorrplot(corr, lab = TRUE, colors = c("aquamarine", "white", "dodgerblue","blue"), 
           show.legend = T, outline.color = "gray", type = "upper", 
           tl.cex = 15, lab_size = 3.5, sig.level = 0.1,
           title = "Correlation Matrix of Heart Attack Data") +
  labs(fill = "Correlation") + 
  theme(axis.text.x = element_text(size=8,margin=margin(-2,0,0,0)),  
        axis.text.y = element_text(size=8,margin=margin(0,-2,0,0)),
        panel.grid.major=element_blank())

The plot below shows the number of patients in each Ejection Fraction Category and the majority of patients where in the 30 <EF<=45 category. For the variables , Smoking , Diabetes ,BP and Anaemia we can note that the majority of patients were not in the yes Category , hence there were more patients who did not smoke than those who smoked. There more patients who did not have anaemia and BP than those who had anaemia and BP. most importantly there were more patients who got censored than those who died.

Proportions in each categorical variable by outcome of interest (Death) are also displayed in the plots below.

It can be noted that there is some significant differences in Ejection Fraction between between people who died and those who survived. From the boxplots ,the mean age seems to be higher for patients who died as compared to those who censored. Other boxplots do not show any significant mean differences.


subset<- out_new |>
  dplyr::select(Age,Sodium,Creatinine,Ejection.Fraction,Pletelets,CPK,event_cat)# Bring in external file for visualisations
source('functions/visualisations.R')

# Use plot function
plot <- histoplotter(subset,event_cat, 
                     chart_x_axis_lbl = "Event status", 
                     chart_y_axis_lbl = 'Measures',
                     boxplot_color = 'navy', 
                     boxplot_fill = '#89CFF0', 
                     box_fill_transparency = 0.2) 

# Add extras to plot
plot + 
  ggthemes::theme_stata() +
  theme(legend.position = 'bottom')

Statistical Analysis

The table below shows chi-squared test of association and t-test for potential predictor variables and our outcome variable of interest. Prelimenary results show that Ejection fraction is the categorical variable that is significantly associated with our outcome variable at 5% level of significance(\(\chi^2=32,p<0.001\). There is no significant evidence to show association for other categorical predictors. The table also shows that there is significant difference in mean Age of patients who died and those who survived (\(p<0.001\)). There is statistically significant difference in mean Sodium and Creatinine for patients who died and those who survived (\(p=0.002\)) and (\(p<0.001\)). Mean Ejection Fraction differs significantly for patients who survived vs those who died (\(p<0.001\)) implying that the differences in the averages is not merely an outcome of chance and as can be highlighted from the categorised variable.

out_new|> 
  dplyr::select(BP,platelets_cat,EFraction_cat,Diabetes,Smoking,event_cat,Anaemia,Gender,Age,Sodium,Creatinine,Ejection.Fraction,Pletelets,CPK) |> 
  tbl_summary(
    by = event_cat,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"), 
    type=list(Smoking~"categorical",
              Diabetes~"categorical",
              BP ~ "categorical",
              Anaemia = "categorical"))|> 
  add_p(test = all_continuous() ~ "t.test",
        pvalue_fun = function(x) style_pvalue(x, digits = 2))|> 
   modify_header(statistic ~ "**Test Statistic**")|>
  bold_labels()|> 
  modify_fmt_fun(statistic ~ style_sigfig) |> 
  italicize_levels() |> 
  bold_p(t = 0.8)

Characteristic	alive N = 203¹	dead N = 96¹	Test Statistic²	p-value²
BP			1.9	0.17
No	137 / 203 (67%)	57 / 96 (59%)
Yes	66 / 203 (33%)	39 / 96 (41%)
platelets_cat			3.5	0.17
< Q1	61 / 203 (30%)	39 / 96 (41%)
< Q2	73 / 203 (36%)	27 / 96 (28%)
< Q3	69 / 203 (34%)	30 / 96 (31%)
EFraction_cat			32	<0.001
30-45	115 / 203 (57%)	31 / 96 (32%)
EF<=30	42 / 203 (21%)	51 / 96 (53%)
EF>45	46 / 203 (23%)	14 / 96 (15%)
Diabetes			0.00	0.97
No	118 / 203 (58%)	56 / 96 (58%)
Yes	85 / 203 (42%)	40 / 96 (42%)
Smoking			0.05	0.83
No	137 / 203 (67%)	66 / 96 (69%)
Yes	66 / 203 (33%)	30 / 96 (31%)
Anaemia			1.3	0.25
No	120 / 203 (59%)	50 / 96 (52%)
Yes	83 / 203 (41%)	46 / 96 (48%)
Gender			0.01	0.94
F	71 / 203 (35%)	34 / 96 (35%)
M	132 / 203 (65%)	62 / 96 (65%)
Age	59 (11)	65 (13)	-4.2	<0.001
Sodium	137.2 (4.0)	135.4 (5.0)	3.2	0.002
Creatinine	1.18 (0.65)	1.84 (1.47)	-4.2	<0.001
Ejection.Fraction	40 (11)	33 (13)	4.6	<0.001
Pletelets	266,657 (97,531)	256,381 (98,526)	0.84	0.40
CPK	540 (754)	670 (1,317)	-0.90	0.37
¹ n / N (%); Mean (SD)
² Pearson’s Chi-squared test; Welch Two Sample t-test

Kaplan Meier Curves

The plots below show the overall survival rates for the patients and here we notice that first event occured at day 4, with survival probability dropping to 99.7% ,between days 10-30 (survival drops from 96% to 88%) ,Median survival was not reached (since survival doesn’t drop below 50%).The steepness continues until day 241 with final survival at 57.6%

The plots below show that survival for \(EF \leq 30\) was lower than other two levels. Moreover, relatively small difference between the survival of patients with \(30<EF<45\) and \(EF\geq 45\) can be observed , log rank test also suggesr that the difference in the survival times was also statistically significant at 5% (\(p<0.001\)). There is significant differences in the survival times between patients with high blood pressure and those without (\(p=0.036\)) , patients with high blood pressure have lower survival time.

srvfit1 <- survfit(Surv(TIME,Event)~platelets_cat,data=out_new)
srvfit2 <- survfit(Surv(TIME,Event)~EFraction_cat,data=out_new)
srvfit3 <- survfit(Surv(TIME,Event)~BP,data=out_new)
srvfit4 <- survfit(Surv(TIME,Event)~Anaemia,data=out_new)
srvfit6 <- survfit(Surv(TIME,Event)~Diabetes,data=out_new)
srvfit7 <- survfit(Surv(TIME,Event)~Gender,data=out_new)
srvfit8 <- survfit(Surv(TIME,Event)~Smoking,data=out_new)

splots<-list()
splots[[1]]<-ggsurvplot(srvfit1, 
                 palette=c("gold","hotpink","darkblue"), 
                 size = 1.2, conf.int = TRUE,
                 risk.table = TRUE,
                 tables.height = 0.3,
                 pval = TRUE,
                 surv.median.line="hv",
                 legend.labs = levels(out_new$platelets_cat),
                 legend.title = "",
                 ggtheme = theme_minimal() + theme(plot.title = element_text(face = "bold")),
                 title = "platelet category",
                 xlab = "Time",
                 ylab = "Probability of death",
                 legend = "none", censor = FALSE)

splots[[2]]<-ggsurvplot(srvfit2, 
                palette=c("gold","hotpink","darkblue"), 
                 size = 1.2, conf.int = TRUE,
                 risk.table = TRUE,
                 pval = TRUE,
                 tables.height = 0.3,
                 surv.median.line="hv",
                 legend.labs = levels(out_new$EFraction_cat),
                 legend.title = "",
                 ggtheme = theme_minimal() + theme(plot.title = element_text(face = "bold")),
                 title = "Ejection fraction category",
                 xlab = "Time",
                 ylab = "Probability death",
                 legend = "none", censor = FALSE)


splots[[3]]<-ggsurvplot(srvfit3, 
                 palette = paletteer_d("ggthemes::Superfishel_Stone")[c(3,1,5)], 
                 size = 1.2, conf.int = TRUE,
                 risk.table = TRUE,
                 pval = TRUE,
                 tables.height = 0.3,
                 surv.median.line="hv",
                 legend.labs = levels(out_new$BP),
                 legend.title = "",
                 ggtheme = theme_minimal() + theme(plot.title = element_text(face = "bold")),
                 title = "Blood Pressure",
                 xlab = "Time",
                 ylab = "Probability death",
                 legend = "none", censor = FALSE)


splots[[4]]<-ggsurvplot(srvfit4, 
                 palette = paletteer_d("ggthemes::Superfishel_Stone")[c(3,1,5)], 
                 size = 1.2, conf.int = TRUE,
                 risk.table = TRUE,
                 pval = TRUE,
                 tables.height = 0.3,
                 surv.median.line="hv",
                 legend.labs = levels(out_new$Anaemia),
                 legend.title = "",
                 ggtheme = theme_minimal() + theme(plot.title = element_text(face = "bold")),
                 title = "Anaemia",
                 xlab = "Time",
                 ylab = "Probability death",
                 legend = "bottom", censor = FALSE)


splots[[5]]<-ggsurvplot(srvfit6, 
                 palette = paletteer_d("ggthemes::Superfishel_Stone")[c(3,1,5)], 
                 size = 1.2, conf.int = TRUE,
                 risk.table = TRUE,
                 pval = TRUE,
                 tables.height = 0.3,
                 surv.median.line="hv",
                 legend.labs = levels(out_new$Diabetes),
                 legend.title = "",
                 ggtheme = theme_minimal() + theme(plot.title = element_text(face = "bold")),
                 title = "Diabetes",
                 xlab = "Time",
                 ylab = "Probability death",
                 legend = "bottom", censor = FALSE)

splots[[6]]<-ggsurvplot(srvfit7, 
                 palette = paletteer_d("ggthemes::Superfishel_Stone")[c(3,1,5)], 
                 size = 1.2, conf.int = TRUE,
                 risk.table = TRUE,
                 pval = TRUE,
                 tables.height = 0.3,
                 surv.median.line="hv",
                 legend.labs = levels(out_new$Gender),
                 legend.title = "",
                 ggtheme = theme_minimal() + theme(plot.title = element_text(face = "bold")),
                 title ="By Gender",
                 xlab = "Time",
                 ylab = "Probability death",
                 legend = "bottom", censor = FALSE)
splots[[7]]<-ggsurvplot(srvfit8, 
                 palette = paletteer_d("ggthemes::Superfishel_Stone")[c(3,1,5)], 
                 size = 1.2, conf.int = TRUE,
                 risk.table = TRUE,
                 pval = TRUE,
                 tables.height = 0.3,
                 surv.median.line="hv",
                 legend.labs = levels(out_new$Smoking),
                 legend.title = "",
                 ggtheme = theme_minimal() + theme(plot.title = element_text(face = "bold")),
                 title = "Smoking status",
                 xlab = "Time",
                 ylab = "Probability death",
                 legend = "none", censor = FALSE)



res<-arrange_ggsurvplots(splots,print = TRUE,ncol = 4,nrow = 1)

Statistical Modeling

Cox proportional Hazard

The table below shows the adjusted and unadjusted Cox proportional hazards models for all the potential predictors of interest. The preliminary results here show that variables Age ,Ejection fraction, Sodium, blood pressure and CPK are statistically significant at 5% level adjusting for other predictors. The fitted univariate models for the same variables indicate that they are statistically significant in predicting mortality for heart attack patients.

full_cox_m1<-coxph(Surv(TIME,Event)~platelets_cat+Gender+Age+EFraction_cat+Smoking+Sodium+BP+Diabetes+Anaemia+CPK,data=out_new) 

cox_m1_tab<-full_cox_m1|> 
  gtsummary::tbl_regression(exponentiate=T)

uvcm_table <- tbl_uvregression(
  out_new %>% 
    select(platelets_cat,Gender,Age,EFraction_cat,Smoking,Sodium,BP,Diabetes,Anaemia,CPK,TIME,Event),
  method       = coxph,
  y            = Surv(TIME,Event),
  exponentiate = TRUE
  ) 
 # 300 predictors

fancy_table <-
  tbl_merge(
    tbls        = list(uvcm_table, cox_m1_tab),
    tab_spanner = c("Unadjusted", "Adjusted")
  )

fancy_table

Characteristic	Unadjusted				Adjusted
Characteristic	N	HR	95% CI	p-value	HR	95% CI	p-value
platelets_cat	299
< Q1		—	—		—	—
< Q2		0.70	0.43, 1.14	0.15	0.64	0.38, 1.08	0.091
< Q3		0.79	0.49, 1.27	0.3	0.78	0.47, 1.29	0.3
Gender	299
F		—	—		—	—
M		1.01	0.67, 1.54	>0.9	0.77	0.48, 1.26	0.3
Age	299	1.04	1.03, 1.06	<0.001	1.05	1.03, 1.07	<0.001
EFraction_cat	299
30-45		—	—		—	—
EF<=30		3.18	2.03, 4.97	<0.001	3.46	2.16, 5.55	<0.001
EF>45		1.30	0.69, 2.44	0.4	1.22	0.64, 2.33	0.5
Smoking	299
No		—	—		—	—
Yes		0.99	0.64, 1.53	>0.9	1.09	0.67, 1.77	0.7
Sodium	299	0.93	0.90, 0.97	<0.001	0.93	0.89, 0.97	0.001
BP	299
No		—	—		—	—
Yes		1.55	1.03, 2.33	0.037	1.73	1.13, 2.66	0.012
Diabetes	299
No		—	—		—	—
Yes		0.96	0.64, 1.44	0.8	1.08	0.70, 1.65	0.7
Anaemia	299
No		—	—		—	—
Yes		1.40	0.94, 2.09	0.10	1.33	0.86, 2.06	0.2
CPK	299	1.00	1.00, 1.00	0.3	1.00	1.00, 1.00	0.006
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

reduced_cox<-coxph(formula = Surv(TIME, Event) ~ Age + EFraction_cat + Sodium + 
    BP + Anaemia + CPK, data = out_new)

Interpretting Optimal model Coefficients

The forest plot below shows the optimal model after performing a stepwise selection (sequential replacement). The reduced model has both the least Akaike information Criterion and Bayesian information Criterion (\(957.72 ~and~ 975.67\)) as compared to the full model (\(964.11~and~994.88\)) or model with all variables indicating that the reduced model is a better fit. The likelihood ratio test comparing the full model and reduced model suggest that adding more variables to the reduced model does not improve the model efficiency since the result suggest that we reject \(H_0\) i.e (\(\chi^2 =3.612,p=0.6065\)).

The forest plot shows that for a 1 year increase in Age , the log hazard of death increase by 0.05 when adjusting for other predictors, when exponentiated this translates to a 5% increase in the hazard of death for patients with heart attack . The result is statistically significant (\(p<0.001\)).Ejection Fraction category was another significant factor when adjusting for other predictors, the hazard rate among patients with \(EF \leq 30\) was \(3.4\) folds significantly higher as compared to patients with ejection fraction between \(30 ~and~45\) (\(p<0.001\)). Patients with Ejection Fraction above 45 were \(21\%\) more risky to die as compared to patients in the baseline category however this category was not statistically significant. Serum sodium was significant with \(p-value = 0.003\) and its one unit \((meq/L)\) increase decreases the hazard of death by 6% when adjusting for other predictors. An Anemic patient had 41% more chances of death as compared to a non-anemic patient when adjusting for other covariates , however Anemia was an insignificant variable with \(p-value <0.05\) . CPK was found to be statistically non-significant since hazard ratio was 1 , even though the \(p-value\) is less than 5%. Patients with high Blood pressure have 76% increased hazard(risk) of death as compared to those who do not have high blood pressure, and this is statistically significant at 5% level of significance (\(p<0.009\)).

survminer::ggforest(reduced_cox,data=as.data.frame(out_new))

Model Diagnostics

Multicollinearity

The plot below shows that all variables had variance inflation factor within the range of 1 to 2 indicating that multicollinearity is not a problem in our final model.

plot(performance::check_collinearity(reduced_cox))

Proportional hazards assumption

The plots below show that the test of proportionality is not significant for each of the covariates except for Ejection fraction. The global test is also not statistically significant implying that our proportional hazards assumption is reasonable.

test_coxph<-cox.zph(reduced_cox)
ggcoxzph(test_coxph)

Martingale and Deviance Residuals

The plots of residuals below are roughly symmetrically distributed about zero hence there is minimal signs of outliers.


p1|p2

Coxsnell residuals

The plot of coxsnell residuals below shows that our model is not a 100% perfect fit but the cumulative hazard of the residuals follows some degree of a 45° line (unit exponential distribution) to suggest an approximately perfect fit.

coxsnellres <- out_new$Event-resid(reduced_cox,type="martingale")

## Use NA method to estimate the cumulative hazard function for residuals
fit4 <- survfit(Surv(coxsnellres,out_new$Event==1)~1)
nacumhaz <- cumsum(fit4$n.event/fit4$n.risk)


plot(fit4$time,-log(fit4$surv),
 type = "l",
 xlab="Time",
 ylab = "H(t) based on reduced model residuals",
 col='red'
 )
 lines(coxsnellres, coxsnellres, type = "l",col='blue',lty=2)

Discussion

The statistical analysis identified age, Ejection fraction, sodium and BP as the significant variables affecting the likelihood of mortality among heart failure patients. Both these variables were observed to be associated with an increased hazard of death. The findings that seem surprising are non-significance of smoking and diabetes. However, similar results concerning diabetes and smoking have been reported in other studies as well. The reason behind may be smoking and diabetes are basically causes of heart problem at initial stages. We were only concerned with patients of NYHA class III and IV which are advanced stages of heart failure. Up to these stages, these factors (diabetes and smoking) may probably be controlled by medications and hence these factors do not have significant effect on deaths due to heart failure in class III and IV.

Performance of model was checked using Akaike Information Criterion ,Bayesian Information Criterion and residuals such as coxsnell , deviance and martingale residuals and all these suggest that our final model fits the data pretty well.

References

WHO. Fact sheet on CVDs. Global Hearts. World Health Organization. 2016.
Al-Shifa IH. Cardiac Diseases in Pakistan [Internet]. 2016 . http://www.shifa.com.pk/chronic-disease-pakistan/
Pillai Harikrishnan Sivadasan Ganapathi S. Heart failure in South Asia. Curr Cardiol Rev. 2013;9: 102–111. pmid:23597297.

Appendix

General survival analysis ideas

Survival analysis deals with the study of time till a certain event of interest occurs.

Censoring

Censoring in survival analysis refers to the situation where the outcome(e.g., death, hospitalization, etc.) has not occurred for some subjects by the end of the study. This can occur for various reasons, such as loss to follow-up, withdrawal from the study, or the occurrence of another event that renders the subject no longer at risk for the event of interest. The common types of censoring include :

left sensoring
right sensoring
interval censoring

Because of censoring,we can not use the usual ordinay statistical methods because we can not compute the descriptive statistics like sample mean,standard deviation or even perform regression analysis.

Survival function

Survival analysis typically involves using T as the response variable, as it represents the time till an outcome occurs, while t refers to the actual survival time of an individual. The distribution function F(t) is used to describe chances of time of survival being less than a given value, t. In some cases, the cumulative distribution function is of interest and can be obtained as follows:

\[\begin{equation} F(t)=P(T\le t)=\int_{t}^{\infty}f(u) \end{equation}\] Interest also tends to focus on the survival or survivorship function which in this study is the chances of a patient surviving after time \(t\) and also given by the formula

\[\begin{equation} S(t) = P(T > t)=\int_{t}^{\infty} f(u) du \end{equation}\]

\(S(t)\) monotonically decreasing function with \(S(0) = 1\) and \(S(\infty) = \text{lim}_{t \rightarrow \infty} S(t) = 0\).

Suppose we have \(R (t)\) (set at risk) that has patients who are prone to dying by time t, the chances of a patient in that set to die in a given interval \([t, t + \delta t)\) is given by the following formula, where \(h(t)\) is the hazard rate and is defined as: \[\begin{equation} h(t) = \lim_{\delta t \rightarrow 0} \frac{P(t \le T < t+\delta t | T \ge t)}{\delta t} \end{equation}\] The hazard function’s shape can be any strictly positive function and it changes based on the specific survival data provided.

Relationship between h(t) and S(t)

It is given that \(f(t) = - \frac{d}{dt} S(t)\) hence the equation above can be simplified to:

\[\begin{equation} h(t)=\frac{f(t)}{S(t)} =\frac{- \frac{d}{dt} S(t)}{S(t)}= - \frac{d}{dt} [\log S(t)] \end{equation}\]

Taking integrals both sides the above reduces to: \[H(t)= -\log S(t)\]

where \(H(t)=\int_{t}^{t} h(u) du\) ,is the cumulative risk of a patient dying by time t. Since the event is death, then H(t) summarises the risk of death up to time t, assuming that death has not occurred before t. The survival function is also approximately simplified as follows. \(S(t) = e\^{-H(t)} \approx 1 -H(t) \text{ when H(t) is small. }\)

Non-parametric methods

Estimates of the hazard and survival function are often used to summarize survival data and such methods are often referred to as Non-parametric methods of estimation. More detailed information such as the median and percentiles can be taken from the estimated survival function thereby giving more profound analysis (Cleves et al.,2008, Hanagal, 2011).

Empirical Survival Function

Assuming no censoring in the data points as well as no tied observations in the data. The empirical function of survival is used to give an estimate of S(t) in the absence of censoring and is given by:

\[\begin{align*} \hat{S}(t) = \frac{no~ of~patients~ at~T \ge t} {total number of patients} \end{align*}\]

Kaplan-Meier estimate

If there are \(r\) times of deaths among patients with heart attack, such that \(r\leq n\) and \(n\) is the number of patients with observed survival times. Arranging these times in from the smallest to the largest,the \(j\)-th death time is noted by \(t_{(j)}\) , for \(j = 1, 2, \cdots , r\), such that the \(r\) death times ordred are \[ t_{1}<t_{2}< \cdots < t_{m}\] . Let \(n_j\), for \(j = 1, 2, \ldots, r\), denote the number of patients who are still alive before time \(t(j)\), including the ones that are expected to die at that time. Additionally, let \(d_j\) denote the number of patients who die at time \(t(j)\). The estimate for the product limit can be expressed in the following form:

\[\begin{equation} \hat{S}(t) = \prod_{j=1}^k \, \left(1-\frac{d_j}{r_j}\right) \end{equation}\]

To survive until time \(t\), a patient must start by surviving till \(t(1)\), and then continue to survive until each subsequent time point \(t(2), \ldots, t(r)\). It is assumed that there are no deaths between two consecutive time points \(t(j-1)\) and \(t(j)\). The conditional probability of surviving at time \(t(i)\), given that the patient has managed to survive up to \(t(i-1)\), is the complement of the proportion of the new born babies who die at time \(t(i)\), denoted by \(d_i\), to the new born babies who are alive just before \(t(i)\), denoted by \(n_i\). Therefore, the probability that is conditional of surviving at time \(t(i)\) is given by \(1 - \frac{d_i}{n_i}\). The variance of the estimator is given by

Log rank test

Given the problem of contrasting two or more survival functions,for instance survival times between male and female patients with heart attack. Let \[ t_{1}<t_{2}< \cdots < t_{r}\]

Let \(t_i\) be the unique times of death observed among the patients, found by joining all groups of interest. For group \(j\),\ let \(d_{ij}\) denote the number of deaths at time \(t_i\), and\ let \(n_{ij}\) denote the number at risk at time \(t_i\) in that group. Let \(d_i\) denote the total number of deaths at time \(t_i\) across all groups, and\

the expected number of deaths in group \(j\) at time \(t_i\) is: ,

\[E(d_{ij}) = \frac{n_{ij} \, d_i}{n_i} = n_{ij} \left(\frac{d_i}{n_i}\right) \]

At time \(t_i\), suppose there are \(d_i\) deaths and \(n_i\) individuals at risk in total, with \(d_{i0}\) and \(d_{i1}\) deaths and \(n_{i0}\) and \(n_{i1}\) individuals at risk in groups 0 and 1, respectively, such that \(d_{i0} + d_{i1} = d_i\) and \(n_{i0} + n_{i1} = n_i\). Such data can be summarized in a \(2 \times 2\) table at each death time \(t_i\).

Given that the null hypothesis is true, the number of deaths is expected to follow the hypergeometric distribution, and therefore: \[\begin{equation} E(d_{i0}) =e_{i0}= \frac{n_{i0} \, d_i}{n_i} = n_{i0} \left(\frac{d_i}{n_i}\right) \end{equation}\]

\[\begin{equation} \text{Var}(d_{i0}) = \frac{n_{i0} \, n_{i1} \, d_i(n_i-d_i)}{n_{i}^2 (n_{i}-1)} \end{equation}\] the statistic \(U_L\) is given by \[U_L=\sum^r_{i=1}(d_{i0}-e_{i0})\] so that the variance of \(U_L\) is \[Var(U_L)=\sum^r_{j=1}v_{i0}=V_L\] which then follows that : \[\frac{U_L}{\sqrt{V_L}} \sim N(0,1)\] and squaring results in \[\frac{U^2_L}{V_L}\sim \chi^2_1\]

Parametric survival distributions

Some of the most used parametric methods include the following:

Regression models in Survival analysis

Cox’s Proportional Hazards

The model is semi-parametric in nature since no assumption about a probabilty distribution is made for the survival times. The baseline model can take any form and the covariate enter the model linearly (Fox,2008). The model can thus be represented as :

\[\begin{equation} h(t|X) = h_0(t) \exp(\beta_1 X_{1} + \cdots + \beta_p X_{p})=h_0(t)\text{exp}^{(\beta^{T} X)} \end{equation}\]

In the above equation, \(h_0(t)\) is known as the baseline hazard function, which represents the hazard function for a patient when all the included variables in the model are zero. On the other hand, \(h(t|X)\) represents the hazard of a patient at time \(t\) with a covariate vector \(X=(x_1,x_2,\ldots,x_p)\) and \(\beta^T=(\beta_1,\beta_2,\ldots,\beta_p)\) is a vector of regression coefficients.

Fitting a proportional hazards model

The partial likelihood

Given that \(t(1) < t(2) < \ldots < t(r)\) denote the \(r\) ordered death times. The partial likelihood is given by :

\[\begin{equation} L(\beta) = \prod_{i=1}^{n} \left[\frac{\text{exp}^{\beta ^{T} X_i}}{ \sum_{\ell\in R(t_{(i)}}\text{exp}^{\beta ^{T} X_\ell}}\right]^{\delta_i} \end{equation}\] Maximising the logarithmic of the likelihood function is more easier and computationally efficient and this is generally given by:

\[\begin{equation} \log L(\beta)= \sum_{i=1}^{n} \delta_i \left[ \beta ^{T} X_i - \log \left(\sum_{\ell \in R(t_{(i)}}\text{exp}^{\beta ^{T} X_\ell}\right) \right] \end{equation}\]

Inference on the hazard ratios

Testing the conjecture that \(H_0 :\mathbf{\beta}=\mathbf{\beta_0}\) against \(H_a :\mathbf{\beta}\neq \mathbf{\beta_0}\) . Where \(\mathbf{\beta_0}\) is known,then

The likelihood ratio test is given by

\[T=2 \left[\log(L(\hat{\mathbf{\beta}})) - \log(L(\mathbf{\beta_0}))\right]\] Where given the null hypothesis is true \(T\) has a of \(\chi^2_{(p)}\) and we can find the \(p\)-value at tail probablity of the \(\chi^2_{(p)}\) distribution as \(P(\chi^2_{(p)}\geq T)\)

The wald test statistic has a quadratic form : \[W=(\mathbf{\hat{\beta}}-\mathbf{\beta})^{T}\mathbf{I(\hat{\beta})}(\mathbf{\hat{\beta}-\beta})\] with \[\text{s.e}(\hat{\beta}=\frac{1}{\sqrt{I(\hat{\beta})}}\]

The test statistic for the Wald test has a \(\chi^2\) distribution. To conduct a one-sided test at the \(\alpha\) level of significance, we compare the value of the test statistic, denoted as \(W\), to the chi-square value with \(1 - \alpha\) degrees of freedom.

\[\mathbf{\hat{\beta}} \sim N[\mathbf{\beta},I^{-1}(\mathbf{\hat{\beta}})]\]

According to Dobson (2002) ,the statistic

\[Z^{*}_i=\frac{\hat{\beta_i}}{\text{s.e}(\hat{\beta_i})}\sim N(0,1)\] is called the wald statistic with a null hypthothesis \[H_0:\beta_i=0\]

Residuals for Cox Regression Model

Six types of residuals are studied in this thesis: Cox-Snell, Modified Cox-Snell, Schoenfeld, Martingale, Deviance, and Score residuals.

Cox-Snell ResidualS

Cox-Snell definition of residuals given by Cox and Snell (1968) is the most common residual in survival analysis. The Cox-Snell residual for the \(ith\) individual, \(i = 1, 2, \cdots , n\), is given by

\[\begin{equation} r_{Ci}=\text{exp}(\mathbf{\hat{\beta}^{T} x_i})\hat{H}_0(t_i) \end{equation}\] where \(\hat{H}_0(t_i)\) is an estimate of the baseline cumulative hazard function at time \(t_i\). It is sometimes called the Nelson-Aalen estimator.

Martingale residuals

These are useful in determing the functional form of the covariates to be included in the model. If the test reveals that the covariate can not be included linearly then there is need for such a covariate to be transformed . The martingale residuals are represented in the following form

\[\begin{equation} r_{Mi}=\delta_i - r_{Ci} \end{equation}\]

Deviance Residuals

The skew range of Martingale residual makes it difficult to use it to detect outliers. The deviance residuals are much more symmetrically distributed about zero and are defined by \[\begin{equation} r_{Di}=\text{sgn}(r_{Mi})\left[-2\{r_{Mi}+\delta_i\log\left(\delta_i-r_{Mi}\right)\}\right]^{\frac{1}{2}} \end{equation}\]

Here, \(r_{Mi}\) represents the martingale residual for the \(i\)-th individual, and the function \(\text{sgn}(\cdot)\) denotes the sign function. Specifically, the function takes the value +1 if its argument is positive and -1 if negative.

Schoenfield residuals

Schoenfeld residuals are a useful tool for examining and testing the proportional hazard assumption, detecting leverage points, and identifying outliers. They can be calculated as follows: \[\begin{equation} r_{Sji}=\delta_i\{x_{ji}-\hat{a}_{ji}\} \end{equation}\] where \(x_{ji}\) is the value of the \(jth\) explanatory variable,\(j = 1, 2, \cdots , p\), for the \(ith\) individual in the study,

\[\begin{equation} \hat{a}_{ji}=\frac{\sum_{\ell \in R(t_i)}x_{j\ell}\text{exp}(\mathbf{\hat{\beta}^{T} x_{\ell}})}{\sum_{\ell \in R(t_i)}\text{exp}(\mathbf{\hat{\beta}^{T} x_{\ell}})} \end{equation}\]

Model Checking

Akaike Information Criterion

Contrasting between a number of possible models, which can be unnested, can also be made on the basis of Akaike’s information criterion, given by \[AIC = -2logL + 2q\] in which \(q\) is the number of unknown \(\beta\)-parameters in the model.

Bayesian Information Criterion

An alternative to the AIC statistic is the Bayesian Information Criterion, given by \[BIC = -2logL +qlogd\], where \(q\) is the number of unknown parameters in the fitted model and d is the number of uncensored observations in the data set.

Variable selection

Stepwise regression

The method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some pre- specified criterion. The aim of stepwise regression is to identify the most important predictors and to reduce the complexity of the model.There are three types of stepwise regression which are forward Selection and backward Elimination.

Backward elimination

This is the opposite of forward selection. It is a stepwise regression that begins with all of the variables in the model and then removes them one at a time until only those that are statistically significant remain. This technique can be used to reduce over-fitting and improve model accuracy.