Generally, survival analysis is a collection of statistical procedures for data analysis for which the outcome variable of interest is time until an event occurs.
In the medical world, we typically think of survival analysis literally – tracking time until death. But, it’s more general than that – survival analysis models time until an event occurs (any event). This might be death of a biological organism. But it could also be the time until a hardware failure in a mechanical system, time until recovery, time someone remains unemployed after losing a job, time until a ripe tomato is eaten by a grazing deer, time until someone falls asleep in a workshop, etc. Survival analysis also goes by reliability theory in engineering, duration analysis in economics, and event history analysis in sociology.
Type of events: death, disease, relapse, recovery…
The goal of survival analysis is to:
##setting data
In setting data i use library “survival” and “tidyverse”
First I clean NA and deal with them
## [1] TRUE
## inst time status age sex ph.ecog ph.karno pat.karno
## 1 0 0 0 0 1 1 3
## meal.cal wt.loss
## 47 14
## 'data.frame': 228 obs. of 10 variables:
## $ inst : num 3 3 3 5 1 12 7 11 1 7 ...
## $ time : num 306 455 1010 210 883 ...
## $ status : num 2 2 1 2 2 1 2 2 2 2 ...
## $ age : num 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : num 1 1 1 1 1 1 2 2 1 1 ...
## $ ph.ecog : num 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : num 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: num 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : num 1175 1225 NA 1150 NA ...
## $ wt.loss : num NA 15 15 11 0 0 10 1 16 34 ...
in this part i deal with describing statistic.
Here i represent summary table
summary(lung[,c("age","time","ph.ecog","wt.loss","meal.cal")])
## age time ph.ecog wt.loss meal.cal
## Min. :39.00 Min. : 5.0 0 : 63 Min. :-24.000 Min. : 96.0
## 1st Qu.:56.00 1st Qu.: 166.8 1 :113 1st Qu.: 0.000 1st Qu.: 768.0
## Median :63.00 Median : 255.5 2 : 50 Median : 7.000 Median : 928.8
## Mean :62.45 Mean : 305.2 3 : 1 Mean : 9.658 Mean : 928.8
## 3rd Qu.:69.00 3rd Qu.: 396.5 NA's: 1 3rd Qu.: 15.000 3rd Qu.:1075.0
## Max. :82.00 Max. :1022.0 Max. : 68.000 Max. :2600.0
representing of histogram that deal with distribution of survival time
hist(lung$time,main = "Distribution of survival time (in days)",
xlab = "Time",
ylab = "Count",
col = "blue")
the majority of patient have survival times clusteredbetween 200 to 600 days.the distribution suggest that survival timesare concentrated around the middle range,with relatively few extreme values
boxplot(time ~ sex, data = lung,
main = "Avergae survival time by gender",
xlab = "Gender",
ylab = "Survival time",
col = 4:2)
for determination of statistical correlation between thise variable we need to conduct t test
t.test(time ~ sex, data = lung)
##
## Welch Two Sample t-test
##
## data: time by sex
## t = -1.9843, df = 196.51, p-value = 0.04861
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## -111.1266705 -0.3428947
## sample estimates:
## mean in group Male mean in group Female
## 283.2319 338.9667
anova_model <- aov(time~ph.ecog, data = lung)
summary(anova_model) #p-value - 0.0193
## Df Sum Sq Mean Sq F value Pr(>F)
## ph.ecog 3 434580 144860 3.371 0.0193 *
## Residuals 223 9582655 42972
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
TukeyHSD(anova_model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = time ~ ph.ecog, data = lung)
##
## $ph.ecog
## diff lwr upr p adj
## 1-0 -37.43054 -121.8003 46.93921 0.6599478
## 2-0 -117.79302 -219.4235 -16.16256 0.0157925
## 3-0 -233.87302 -774.7016 306.95558 0.6779863
## 2-1 -80.36248 -171.5026 10.77762 0.1052341
## 3-1 -196.44248 -735.3983 342.51331 0.7814210
## 3-2 -116.08000 -658.0060 425.84604 0.9452620
accordig to ANOVA that used in Q2 we can understand that patient with ph_ecog higher had significantly shorter survival time(p=0.019) in additional post-hoc TUKEY test showed that the difference in mean survival time between ph_ecog group2 and group 0 was statistically significant (p=0.0158) additionally a comparison of mean survival times between sexes showed that famale had longer average survival time =338.97 days than male =283 days, suggesting that sex may also be related to survival (see Q1 conclusion)
lm1 <- lm(time ~ age+sex+ph.ecog, data = lung)
summary(lm1)
##
## Call:
## lm(formula = time ~ age + sex + ph.ecog, data = lung)
##
## Residuals:
## Min 1Q Median 3Q Max
## -375.0 -142.7 -54.5 100.2 732.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 357.7386 101.4369 3.527 0.000512 ***
## age -0.4626 1.5654 -0.296 0.767856
## sexFemale 52.3341 28.3389 1.847 0.066124 .
## ph.ecog1 -34.3179 32.5108 -1.056 0.292310
## ph.ecog2 -115.0029 39.9023 -2.882 0.004340 **
## ph.ecog3 -207.3538 208.8521 -0.993 0.321881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 206.5 on 221 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.05913, Adjusted R-squared: 0.03785
## F-statistic: 2.778 on 5 and 221 DF, p-value: 0.01865
According to the linear regression model the variable ph_ _ecog is significantly associate with survive times- higher ph ecog score have shorter survival times
The overall model is statically significant but it explains only 6%of variances
For clear estimation of outcomes i use logistic regression
Logistic regressions used to model and prediction of binary outcomes using several explanatory variables and to estimate the effect of each variable on the likelihood of outcomes
##
## Call:
## glm(formula = Death_200 ~ age + sex + ph.ecog, family = binomial,
## data = lung)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.200e+00 1.128e+00 -1.064 0.287499
## age -1.708e-04 1.723e-02 -0.010 0.992089
## sexFemale -9.721e-01 3.314e-01 -2.933 0.003353 **
## ph.ecog1 6.827e-01 3.964e-01 1.722 0.085014 .
## ph.ecog2 1.617e+00 4.541e-01 3.560 0.000371 ***
## ph.ecog3 1.578e+01 8.827e+02 0.018 0.985739
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 282.07 on 226 degrees of freedom
## Residual deviance: 256.58 on 221 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 268.58
##
## Number of Fisher Scoring iterations: 13
There i modeled the probability of death withing 200 days as a function of age sex and ph.ecog
sexFamele: being female significantly reduces the odds of the death within 200 days p=0.003
ph.ecog 2 : is strong and statiscally signicant predictor factor of death withing 200 days. Patients with performance status of 2 have about 5 times the odds of dying withing 200 days compared to fully active patients
ph.ecog 1 :some increases the odd but not quite statistically significant(p=0.085)
no significant effect
ph. ecog 3 :very large but not significant
survdiff(Surv(time,event) ~ sex, data = lung) #p= 0.001
## Call:
## survdiff(formula = Surv(time, event) ~ sex, data = lung)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Male 138 112 91.6 4.55 10.3
## sex=Female 90 53 73.4 5.68 10.3
##
## Chisq= 10.3 on 1 degrees of freedom, p= 0.001
cox_model <- coxph(Surv(time,event) ~ sex + ph.ecog, data = lung)
summary(cox_model)
## Call:
## coxph(formula = Surv(time, event) ~ sex + ph.ecog, data = lung)
##
## n= 227, number of events= 164
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sexFemale -0.5449 0.5799 0.1681 -3.241 0.00119 **
## ph.ecog1 0.4182 1.5192 0.1994 2.097 0.03602 *
## ph.ecog2 0.9475 2.5792 0.2248 4.216 2.49e-05 ***
## ph.ecog3 2.0485 7.7565 1.0269 1.995 0.04605 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sexFemale 0.5799 1.7245 0.4171 0.8062
## ph.ecog1 1.5192 0.6582 1.0277 2.2459
## ph.ecog2 2.5792 0.3877 1.6602 4.0067
## ph.ecog3 7.7565 0.1289 1.0366 58.0390
##
## Concordance= 0.642 (se = 0.025 )
## Likelihood ratio test= 29.5 on 4 df, p=6e-06
## Wald test = 30.7 on 4 df, p=4e-06
## Score (logrank) test = 32.71 on 4 df, p=1e-06
There presentation of log-rank test and cox model
survival is significantly different between males and fameles : females survive longer
worse performance status (higher ph.ecog) is associated with much higher risk of death
both sex and ph.ecog are important predictor of survival in this database
library(forestmodel)
forest_model(cox_model)
## `height` was translated to `width`.
reference group:
reference group: + ph_ecog - 1-2-3 :
ph_ecog1 - p value=0.036 compared to patient with ph ecog 0 , those with phe cog 1 have a 52% hiher risk to death ;
ph_ecog- p value<0.001- the patient have a 2.58 times higher risk to death compared to those with ph.ecog 0 -is highly significant;
ph_ecog 3 - p value =0.046 - patient with ph ecog 3 have a 7.76 times higher risk of death compared those with ph_ecog 0 .This result is statistically significant, but-the confidence interval is very wide, likely due toa small sample size in the group
In this analisis of the lung cancer dataset, we explored the inpact of various clinical variables on patient survival using Cox proportional hazard modeling
Our result demonstrate that factors such as sex and perfomance status(ph_ecog) are significant predictors of overallsurvival
Specifically, poorer performance status and male sex were associatedwith a hiher risk of mortality
Kaplan-Meier survival curve illustrated clear differences in survivalprobabilities between subgroup , emphazing the prognostic value of these variables
The forest plot provide a concise visualisation of the hazard ratios
Overall, these finding highlight the importance of incorporatingclinical variables such as performance status and sex into prognostic assessments for patients with advanced lung cancer.
The analitic aproach and visual tools presented here can supportclinicians in the making more informed, individualized treatment decisions