Introduction

Generally, survival analysis is a collection of statistical procedures for data analysis for which the outcome variable of interest is time until an event occurs.

In the medical world, we typically think of survival analysis literally – tracking time until death. But, it’s more general than that – survival analysis models time until an event occurs (any event). This might be death of a biological organism. But it could also be the time until a hardware failure in a mechanical system, time until recovery, time someone remains unemployed after losing a job, time until a ripe tomato is eaten by a grazing deer, time until someone falls asleep in a workshop, etc. Survival analysis also goes by reliability theory in engineering, duration analysis in economics, and event history analysis in sociology.

Type of events: death, disease, relapse, recovery…

The goal of survival analysis is to:

Goal 1: To estimate and interpret survivor and/or hazard functions from survival data.

Goal 2: To compare survivor and/or hazard functions.

Goal 3: To assess the relationship of explanatory variables to survival time

##setting data

In setting data i use library “survival” and “tidyverse”

Data manipulation

First I clean NA and deal with them

## [1] TRUE
##      inst      time    status       age       sex   ph.ecog  ph.karno pat.karno 
##         1         0         0         0         0         1         1         3 
##  meal.cal   wt.loss 
##        47        14
## 'data.frame':    228 obs. of  10 variables:
##  $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
##  $ time     : num  306 455 1010 210 883 ...
##  $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
##  $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
##  $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
##  $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
##  $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
##  $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
##  $ meal.cal : num  1175 1225 NA 1150 NA ...
##  $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...

in this part i deal with describing statistic.

Here i represent summary table

summary(lung[,c("age","time","ph.ecog","wt.loss","meal.cal")])
##       age             time        ph.ecog       wt.loss           meal.cal     
##  Min.   :39.00   Min.   :   5.0   0   : 63   Min.   :-24.000   Min.   :  96.0  
##  1st Qu.:56.00   1st Qu.: 166.8   1   :113   1st Qu.:  0.000   1st Qu.: 768.0  
##  Median :63.00   Median : 255.5   2   : 50   Median :  7.000   Median : 928.8  
##  Mean   :62.45   Mean   : 305.2   3   :  1   Mean   :  9.658   Mean   : 928.8  
##  3rd Qu.:69.00   3rd Qu.: 396.5   NA's:  1   3rd Qu.: 15.000   3rd Qu.:1075.0  
##  Max.   :82.00   Max.   :1022.0              Max.   : 68.000   Max.   :2600.0

representing of histogram that deal with distribution of survival time

hist(lung$time,main = "Distribution of survival time (in days)",
     xlab = "Time",
     ylab = "Count",
     col = "blue")

  • Figure 1 : The histogram above displays the distribution of survival time in days for data set

the majority of patient have survival times clusteredbetween 200 to 600 days.the distribution suggest that survival timesare concentrated around the middle range,with relatively few extreme values

Q1- are there significant difference in survival times between gender group

boxplot(time ~ sex, data = lung,
        main = "Avergae survival time by gender",
        xlab = "Gender",
        ylab = "Survival time",
        col = 4:2)

  • figure 2 the bar plot displays the average survivle time for male and female patients.According the plot we can see thyat famele hase higher survivle time.

for determination of statistical correlation between thise variable we need to conduct t test

t.test(time ~ sex, data = lung)
## 
##  Welch Two Sample t-test
## 
## data:  time by sex
## t = -1.9843, df = 196.51, p-value = 0.04861
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -111.1266705   -0.3428947
## sample estimates:
##   mean in group Male mean in group Female 
##             283.2319             338.9667
  • According to representing table p-value=0.04861 less them 0.05 this indicate that difference in average survival time between male and female patient is statistical significant female have improved height survival time

Q2 - Are there significant differences in survivle times between patients with difference ph_ecog

anova_model <- aov(time~ph.ecog, data = lung)
summary(anova_model) #p-value - 0.0193 
##              Df  Sum Sq Mean Sq F value Pr(>F)  
## ph.ecog       3  434580  144860   3.371 0.0193 *
## Residuals   223 9582655   42972                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
TukeyHSD(anova_model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = time ~ ph.ecog, data = lung)
## 
## $ph.ecog
##           diff       lwr       upr     p adj
## 1-0  -37.43054 -121.8003  46.93921 0.6599478
## 2-0 -117.79302 -219.4235 -16.16256 0.0157925
## 3-0 -233.87302 -774.7016 306.95558 0.6779863
## 2-1  -80.36248 -171.5026  10.77762 0.1052341
## 3-1 -196.44248 -735.3983 342.51331 0.7814210
## 3-2 -116.08000 -658.0060 425.84604 0.9452620
  • A one-way ANOVA was performed to determinate whether there are significant differences in survival gtimes between patient with different ph-ecog group
  • The result show a statistically significant difference among the group P=0.019, indicating that at least one group differs from the other in terms of mean survival time
  • to identify wich group are significantly different from each other , are TUKEY post-hoc test conducted.the post -hock analises revealed that the mean survive time for the group 2 is significantly lower than the group 0 ( mean difference=-117.79 days’ p value=0.0158.
  • no other of represented group showed statistically significant differences)

Q3- What factorare assotiate with survivle times (longer or shorter)

accordig to ANOVA that used in Q2 we can understand that patient with ph_ecog higher had significantly shorter survival time(p=0.019) in additional post-hoc TUKEY test showed that the difference in mean survival time between ph_ecog group2 and group 0 was statistically significant (p=0.0158) additionally a comparison of mean survival times between sexes showed that famale had longer average survival time =338.97 days than male =283 days, suggesting that sex may also be related to survival (see Q1 conclusion)

Q4 -Can survival time be predicted based on patient characteristic

lm1 <- lm(time ~ age+sex+ph.ecog, data = lung)
summary(lm1)
## 
## Call:
## lm(formula = time ~ age + sex + ph.ecog, data = lung)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -375.0 -142.7  -54.5  100.2  732.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  357.7386   101.4369   3.527 0.000512 ***
## age           -0.4626     1.5654  -0.296 0.767856    
## sexFemale     52.3341    28.3389   1.847 0.066124 .  
## ph.ecog1     -34.3179    32.5108  -1.056 0.292310    
## ph.ecog2    -115.0029    39.9023  -2.882 0.004340 ** 
## ph.ecog3    -207.3538   208.8521  -0.993 0.321881    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 206.5 on 221 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.05913,    Adjusted R-squared:  0.03785 
## F-statistic: 2.778 on 5 and 221 DF,  p-value: 0.01865
  • According to the linear regression model the variable ph_ _ecog is significantly associate with survive times- higher ph ecog score have shorter survival times

  • The overall model is statically significant but it explains only 6%of variances

  • For clear estimation of outcomes i use logistic regression

  • Logistic regressions used to model and prediction of binary outcomes using several explanatory variables and to estimate the effect of each variable on the likelihood of outcomes

## 
## Call:
## glm(formula = Death_200 ~ age + sex + ph.ecog, family = binomial, 
##     data = lung)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.200e+00  1.128e+00  -1.064 0.287499    
## age         -1.708e-04  1.723e-02  -0.010 0.992089    
## sexFemale   -9.721e-01  3.314e-01  -2.933 0.003353 ** 
## ph.ecog1     6.827e-01  3.964e-01   1.722 0.085014 .  
## ph.ecog2     1.617e+00  4.541e-01   3.560 0.000371 ***
## ph.ecog3     1.578e+01  8.827e+02   0.018 0.985739    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 282.07  on 226  degrees of freedom
## Residual deviance: 256.58  on 221  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 268.58
## 
## Number of Fisher Scoring iterations: 13

There i modeled the probability of death withing 200 days as a function of age sex and ph.ecog

interpretation :

  • sexFamele: being female significantly reduces the odds of the death within 200 days p=0.003

  • ph.ecog 2 : is strong and statiscally signicant predictor factor of death withing 200 days. Patients with performance status of 2 have about 5 times the odds of dying withing 200 days compared to fully active patients

  • ph.ecog 1 :some increases the odd but not quite statistically significant(p=0.085)

  • no significant effect

  • ph. ecog 3 :very large but not significant

Survival analysis

survdiff(Surv(time,event) ~ sex, data = lung) #p= 0.001
## Call:
## survdiff(formula = Surv(time, event) ~ sex, data = lung)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Male   138      112     91.6      4.55      10.3
## sex=Female  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001
cox_model <- coxph(Surv(time,event) ~ sex + ph.ecog, data = lung)
summary(cox_model)
## Call:
## coxph(formula = Surv(time, event) ~ sex + ph.ecog, data = lung)
## 
##   n= 227, number of events= 164 
##    (1 observation deleted due to missingness)
## 
##              coef exp(coef) se(coef)      z Pr(>|z|)    
## sexFemale -0.5449    0.5799   0.1681 -3.241  0.00119 ** 
## ph.ecog1   0.4182    1.5192   0.1994  2.097  0.03602 *  
## ph.ecog2   0.9475    2.5792   0.2248  4.216 2.49e-05 ***
## ph.ecog3   2.0485    7.7565   1.0269  1.995  0.04605 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## sexFemale    0.5799     1.7245    0.4171    0.8062
## ph.ecog1     1.5192     0.6582    1.0277    2.2459
## ph.ecog2     2.5792     0.3877    1.6602    4.0067
## ph.ecog3     7.7565     0.1289    1.0366   58.0390
## 
## Concordance= 0.642  (se = 0.025 )
## Likelihood ratio test= 29.5  on 4 df,   p=6e-06
## Wald test            = 30.7  on 4 df,   p=4e-06
## Score (logrank) test = 32.71  on 4 df,   p=1e-06

There presentation of log-rank test and cox model

Interpretation:

  • survival is significantly different between males and fameles : females survive longer

  • worse performance status (higher ph.ecog) is associated with much higher risk of death

  • both sex and ph.ecog are important predictor of survival in this database

Visualisation

library(forestmodel)
forest_model(cox_model)
## `height` was translated to `width`.

Interpretation

reference group:

  • male/female- female have a sinificantly lower risk of death compared to male( about 42% lower risk)

reference group: + ph_ecog - 1-2-3 :

  • ph_ecog1 - p value=0.036 compared to patient with ph ecog 0 , those with phe cog 1 have a 52% hiher risk to death ;

  • ph_ecog- p value<0.001- the patient have a 2.58 times higher risk to death compared to those with ph.ecog 0 -is highly significant;

  • ph_ecog 3 - p value =0.046 - patient with ph ecog 3 have a 7.76 times higher risk of death compared those with ph_ecog 0 .This result is statistically significant, but-the confidence interval is very wide, likely due toa small sample size in the group

Conclusion

In this analisis of the lung cancer dataset, we explored the inpact of various clinical variables on patient survival using Cox proportional hazard modeling

Our result demonstrate that factors such as sex and perfomance status(ph_ecog) are significant predictors of overallsurvival

Specifically, poorer performance status and male sex were associatedwith a hiher risk of mortality

Kaplan-Meier survival curve illustrated clear differences in survivalprobabilities between subgroup , emphazing the prognostic value of these variables

The forest plot provide a concise visualisation of the hazard ratios

Overall, these finding highlight the importance of incorporatingclinical variables such as performance status and sex into prognostic assessments for patients with advanced lung cancer.

The analitic aproach and visual tools presented here can supportclinicians in the making more informed, individualized treatment decisions