library(survival)
library(survminer)
## Loading required package: ggplot2
## Loading required package: ggpubr
## 
## Attaching package: 'survminer'
## The following object is masked from 'package:survival':
## 
##     myeloma
library(ggplot2)
library(tidyr)

IMPORTING THE DATASET

data(package = "survival")
data("cancer" , package = "survival")
head(cancer)
##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0
colnames(cancer)
##  [1] "inst"      "time"      "status"    "age"       "sex"       "ph.ecog"  
##  [7] "ph.karno"  "pat.karno" "meal.cal"  "wt.loss"
unique(cancer$sex)
## [1] 1 2

1. KAPLAN MEIER

# Fit Kaplan-Meier model stratified by  Sex
km_fit <- survfit(Surv(time, status) ~ sex, data = cancer )

# Plot Kaplan-Meier survival curve stratified by  sex
ggsurvplot(km_fit, data = cancer, 
           main = "Kaplan-Meier Survival Curve by  sex", 
           xlab = "Time", 
           ylab = "Survival Probability", 
           palette = c("blue", "red"), 
           legend.title = "sex", 
           legend.labs = c("Male", "Female"))

INTERPRETATION OF KAPLAN-MEIER CURVE

The males in the dataset seem to have better survival compared to females Females appear to have a lower survival probability over time compared to males in this dataset. This could indicate that females in this study group experience the event of interest at a higher rate than males

2. PERFORMING THE LOG-RANK TEST

logrank_test <- survdiff(Surv(time, status) ~ sex, data = cancer)

# Display the result of the test
logrank_test
## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = cancer)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 138      112     91.6      4.55      10.3
## sex=2  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001

INTERPRETATION OF LOG-RANK TEST

The p-value of 0.001 indicates that the difference in survival between males and females is statistically significant. Therefore, we can reject the null hypothesis and conclude that there is a significant difference in survival between the two groups (males and females).

The Chi-square value of 10.3 supports this finding, showing a strong association between sex and survival time in the data.

3. NELSON-AALEN

fit_na <- survfit(Surv(time, status) ~ sex, data = cancer, type = "fleming-harrington")

ggsurvplot(fit_na, data = cancer, fun = "cumhaz",
           xlab = "Time", ylab = "Cumulative Hazard",
           legend.title = "sex", legend.labs = c("Male", "Female"))

INTERPRETATION

Based on this Nelson-Aalen estimate of the cumulative hazard, males have a higher cumulative hazard compared to females in this dataset. This could indicate that males have a higher overall risk of the event than females over the time period observed.

4. COX PROPORTION HAZARD

fit_cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog , data = cancer)
fit_cox
## Call:
## coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = cancer)
## 
##              coef exp(coef)  se(coef)      z        p
## age      0.011067  1.011128  0.009267  1.194 0.232416
## sex     -0.552612  0.575445  0.167739 -3.294 0.000986
## ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05
## 
## Likelihood ratio test=30.5  on 3 df, p=1.083e-06
## n= 227, number of events= 164 
##    (1 observation deleted due to missingness)
# Display the results
summary(fit_cox )
## Call:
## coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = cancer)
## 
##   n= 227, number of events= 164 
##    (1 observation deleted due to missingness)
## 
##              coef exp(coef)  se(coef)      z Pr(>|z|)    
## age      0.011067  1.011128  0.009267  1.194 0.232416    
## sex     -0.552612  0.575445  0.167739 -3.294 0.000986 ***
## ph.ecog  0.463728  1.589991  0.113577  4.083 4.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## age        1.0111     0.9890    0.9929    1.0297
## sex        0.5754     1.7378    0.4142    0.7994
## ph.ecog    1.5900     0.6289    1.2727    1.9864
## 
## Concordance= 0.637  (se = 0.025 )
## Likelihood ratio test= 30.5  on 3 df,   p=1e-06
## Wald test            = 29.93  on 3 df,   p=1e-06
## Score (logrank) test = 30.5  on 3 df,   p=1e-06

INTERPRETATION

1. Coefficient (coef):

These values represent the effect of each covariate on the log hazard. A positive value increases the hazard (i.e., increases the risk of the event), while a negative value decreases the hazard. age: (0.0111) — The coefficient for age is positive, which means that as age increases, the hazard (risk of the event) slightly increases. However, the effect is small.

sex: (−0.5526) — The negative coefficient suggests that being male (1 for male and 2 for female) decreases the hazard, meaning males are at a lower risk of the event compared to females.

ph.ecog:(0.4637) — The positive coefficient indicates that worse performance status (higher ECOG score) increases the risk of the event. Higher ECOG scores are associated with poorer health status and an increased hazard.

2. Exponent of Coefficients (exp(coef)):

These are the hazard ratios (HR), which show the relative risk of the event for each unit increase in the covariate. age: 1.0111 — For each one-unit increase in age, the hazard increases by 1.1%.

sex: 0.5754 — Being male reduces the hazard by 42.46% compared to being female.

ph.ecog: 1.5899 — A one-unit increase in the ECOG performance status increases the hazard by approximately 59%

3. p-value:

This tests whether the coefficient is significantly different from zero (null hypothesis). A p-value less than 0.05 typically indicates statistical significance.

age: The p-value is 0.2324, which is not statistically significant. This means that age does not significantly affect the survival time in this model.

sex: The p-value is 0.000986, which is very significant. This suggests that sex has a significant effect on survival.

ph.ecog: The p-value is 4.45×10^−5, which is extremely significant, indicating that the performance status (ph.ecog) significantly affects survival.

4. Confidence Intervals:

age: The 95% confidence interval for age is [0.9929, 1.0297]. Since this range includes 1, it suggests that the effect of age might not be statistically significant.

sex: The 95% confidence interval for sex is [0.4142, 0.7994], which does not include 1, suggesting that being male is significantly associated with a lower risk of the event.

ph.ecog: The 95% confidence interval for ph.ecog is [1.2727, 1.9864], which does not include 1, indicating that higher performance status is significantly associated with a higher risk.

5. Test of Hypothesis:

Likelihood ratio test: The likelihood ratio test statistic is 30.5, which tests the overall significance of the model. The p-value is 1×10^−6, indicating that the model as a whole is statistically significant.

Wald test: The Wald test statistic is 29.93, which is also significant, with a p-value of 1×10^−6

Score (logrank) test: The score test statistic is 30.5, with a p-value of 1×10^−6, indicating significant differences in the hazard across the covariates.

6. Concordance Index (C-index):

The concordance index (C-index) is 0.637. The C-index measures how well the model discriminates between high- and low-risk individuals. The value of 0.637 suggests that the model has moderate discrimination.