PART 1: Survival Data Creation and Censoring

1.1 Creating Survival Data in R

# Set seed for reproducibility
set.seed(123)

# Create sample survival dataset
n <- 200
surv_data <- data.frame(
  ID = 1:n,
  Age = round(rnorm(n, mean = 50, sd = 10)),
  Height_cm = round(rnorm(n, mean = 170, sd = 10)),
  Weight_kg = round(rnorm(n, mean = 70, sd = 15)),
  Treatment = sample(c("Drug_A", "Drug_B", "Placebo"), n, replace = TRUE),
  # Generate survival times from exponential distribution
  Time_to_event = rexp(n, rate = 0.05),
  # Randomly censor 30% of observations
  Censored = rbinom(n, 1, 0.3)
)

# Create event indicator (1 = event occurred, 0 = censored)
surv_data$Event <- ifelse(surv_data$Censored == 1, 0, 1)

# Adjust time for censored observations (make them shorter)
surv_data$Time <- ifelse(surv_data$Censored == 1, 
                         surv_data$Time_to_event * runif(n, 0.3, 0.8),
                         surv_data$Time_to_event)

# Ensure time is positive
surv_data$Time <- pmax(surv_data$Time, 0.1)

# Create survival object
surv_obj_full <- Surv(surv_data$Time, surv_data$Event)

# View structure
cat("First 10 observations:\n")
## First 10 observations:
print(head(surv_data, 10))
##    ID Age Height_cm Weight_kg Treatment Time_to_event Censored Event      Time
## 1   1  44       192        69    Drug_A     13.520928        0     1 13.520928
## 2   2  48       183        52    Drug_B     51.955049        0     1 51.955049
## 3   3  66       167        60    Drug_A     12.435579        1     0  7.452644
## 4   4  51       175        70    Drug_A     38.112723        0     1 38.112723
## 5   5  51       166        80    Drug_A     82.554753        0     1 82.554753
## 6   6  67       165        45    Drug_A      1.907929        0     1  1.907929
## 7   7  55       162        65    Drug_A     20.280953        0     1 20.280953
## 8   8  37       164        81    Drug_A     42.572799        0     1 42.572799
## 9   9  43       187        62   Placebo      9.321600        0     1  9.321600
## 10 10  46       169        73   Placebo     25.537471        0     1 25.537471
cat("\nDataset dimensions:", dim(surv_data))
## 
## Dataset dimensions: 200 9

1.2 Causes and Reasons for Censoring

Definition: Censoring occurs when the outcome of interest (usually the time until an event) is not fully observed for a subject(s) during the study period. The event status is incomplete, but partial information is still available.

The main causes of Censoring are:

1. Study Design Constraints

  • Fixed study duration: The study ends before all subjects experience the event.
  • Staggered entry: Subjects enrolling at different times but follow-up ends on a fixed calendar date.

2. Loss to Follow-Up

  • A Subject can move away geographically, withdraws consent, or stops responding.
  • Unrelated death (e.g., a subject dies in a car accident in a study of cancer survival).

3. Competing Risks

  • An alternative event precludes the event of interest for example In a study of heart attack risk, a patient dies from stroke first—the heart attack event is now censored.

4. Administrative Reasons

  • Funding for a study can end and a study can be terminated early.

1.3 Types of Censoring

1. Right Censoring

Happens when the event has not occurred by the end of the study, or the subject is lost to follow-up.

Example: A patient in a 5-year cancer survival study is still alive at year 5, or drops out after year 3 while still alive.

2. Left Censoring

The event occurred before the subject entered the study, but the exact time is unknown.

Example: Measuring time until HIV seroconversion—a subject may already be HIV+ at enrollment, but we don’t know when infection happened.

3. Interval Censoring

The event occurred within a known time interval, but the exact time is unknown.

Example: In periodic medical check-ups, disease recurrence is detected between the 2nd and 3rd visit, but the exact date is unknown.

1.4 Features of Survival Data

Mathematical Aspects of Survival data

Survival analysis is built on Five fundamental functions:

  • The Survival Function

  • The hazard Function

  • The Cummulative Hazard Function

  • The Probability Density Function

  • The Kaplan Meir Estimator

1. Survival Function: S(t)

The survival function represents the probability that the event has not occurred by time \(t\): It is the probability that a subject survives beyond time 𝑡

\[S(t) = P(T > t)\]

where \(T\) is the random event time, with \(S(0) = 1\) and \(S(\infty) = 0\).

Properties of the survival function

i). At time zero (the start of the study or “birth”), no one has experienced the event yet. Therefore, the probability of surviving past time 0 is 100%.

\[S(0) = 1\] ii). The event (e.g., death or failure) is inevitable. As time approaches infinity, the probability of an individual still being “alive” or “intact” drops to zero.

\[S(\infty) = 0\] iii). Since \(S(t)\) measures the probability of surviving beyond time \(t\), it can never go up. You cannot “un-experience” a death or failure event once it happens.

\[S(t) = non-increasing\]

2. Hazard Function: h(t)

The instantaneous rate of the event occurring at time \(t\), given survival up to \(t\):

\[h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t} = -\frac{d}{dt} \log S(t)\] The Hazard function is the “failure rate”—risk of the event given the subject has made it to \(t\).

Relationship with Survival Function

\[h(t) = \frac{f(t)}{S(t)}\]

3. Cumulative Hazard Function: H(t)

The cumulative hazard is: \[H(t) = \int_0^t h(u) du = -\log S(t)\]

Where

  • \(h(u)\) is the hazard rate at time \(u\)

  • Integration Sums the Hazard continuously over time

The cumulative hazard function \(H(t)\) represents the total amount of risk accumulated by time 𝑡

  • It does not give a probability directly but measures how much exposure to risk a subject has experienced up to time 𝑡

4.Probability Density Function \(f(t)\)

Defined as:

\[f(t) = \frac{d}{dt}[1 - S(t)]\]

Represents the instantaneous probability of experiencing the event at time \(t\).

5. Kaplan-Meier Estimator

Used to estimate \(S(t)\) from observed data:

The Kaplan–Meier estimator calculates the probability of surviving past time \(S(t)\) by:

  • Breaking time into event intervals

  • Multiplying conditional survival probabilities at each event time

It answers: “Given survival so far, what is the chance of surviving the next time point?”

\[\hat{S}(t) = \prod_{t_i \le t} \left( 1 - \frac{d_i}{n_i} \right)\]

Where:

  • \(t_i\) = ordered event times
  • \(d_i\) = number of events at \(t_i\)
  • \(n_i\) = number at risk just before \(t_i\)
  • \(\frac{d_i}{n_i}\) = Estimated probability of failing at time \({t_i}\)
  • \(\left(1 - \frac{d_i}{n_i} \right)\) = probability of surviving past time \({t_i}\)

Graphical Aspects of survival data

# Create survival object
surv_obj <- Surv(surv_data$Time, surv_data$Event)

# Fit Kaplan-Meier
km_fit_basic <- survfit(surv_obj ~ 1)

# Plot survival curve
plot(km_fit_basic, main = "Kaplan-Meier Survival Curve",
     xlab = "Time", ylab = "Survival Probability",
     col = "blue", lwd = 2)
grid()

Graphical methods are central to survival analysis because they allow visual assessment of time-to-event patterns, censoring, and group differences.

1. Kaplan–Meier Survival Curve

What is plotted

  • X-axis: Time

  • Y-axis: Survival probability \({S}(t)\)

Key Features

  • Stepwise decreasing curve

  • Vertical drops at event times

  • Flat sections where no events occur

  • Tick marks (|) show censored observations

Interpretation

  • Steeper declines - Shows higher event rate

  • Higher curve - Shows better survival

  • Median survival time = time where \({S}(t)\) = 0.5

2. Hazard Function Plot

# Load required package
library(survival)
if (!require("muhaz")) {
  install.packages("muhaz")
  library(muhaz)
}

# surv_obj is created like Surv(time, event)
time <- surv_obj[, 1]
event <- surv_obj[, 2]

# Estimate hazard function
haz_est <- muhaz(time, event)

# Plot hazard function
plot(haz_est, 
     main = "Hazard Function Plot",
     xlab = "Time", ylab = "Hazard Rate",
     col = "green", lwd = 2)
grid()

What is plotted

  • X-axis: Time

  • Y-axis: Hazard rate \({h}(t)\)

Features

  • The plot Shows instantaneous risk

  • The plot line May be increasing, decreasing, or constant

Interpretation

  • Peaks indicate high-risk periods

  • Used in reliability and medical studies

3. Cumulative Hazard Plot

# Fit Kaplan-Meier
km_fit_basic <- survfit(surv_obj ~ 1)

# Plot cumulative hazard
plot(km_fit_basic, fun = "cumhaz", 
    main = "Cumulative Hazard Plot",
    xlab = "Time", ylab = "Cumulative Hazard",
    col = "green", lwd = 2)
grid()

What is plotted

  • X-axis: Time

  • Y-axis: Cumulative hazard H(t)

Features

  • Always increasing

  • Increases only at event times

  • Slope indicates hazard intensity

Interpretation

  • Straight line - Indicates Constant Hazard

  • Curving upward - Increasing risk

  • Flatter curve - Shows lower accumulated risk

# Fit Kaplan-Meier estimator (non-parametric)
km_fit <- survfit(surv_obj_full ~ 1, data = surv_data)

# Display summary of KM estimator
summary(km_fit)
## Call: survfit(formula = surv_obj_full ~ 1, data = surv_data)
## 
##    time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   0.100    200       1   0.9950 0.00499      0.98527       1.0000
##   0.378    197       1   0.9899 0.00707      0.97619       1.0000
##   0.506    196       1   0.9849 0.00865      0.96808       1.0000
##   0.739    195       1   0.9798 0.00997      0.96049       0.9996
##   0.858    193       1   0.9748 0.01114      0.95318       0.9969
##   0.897    192       1   0.9697 0.01218      0.94611       0.9939
##   1.024    191       1   0.9646 0.01314      0.93921       0.9907
##   1.155    189       1   0.9595 0.01402      0.93242       0.9874
##   1.382    187       1   0.9544 0.01486      0.92570       0.9839
##   1.554    186       1   0.9493 0.01564      0.91909       0.9804
##   1.765    184       1   0.9441 0.01638      0.91252       0.9768
##   1.775    183       1   0.9389 0.01709      0.90604       0.9730
##   1.908    181       1   0.9337 0.01776      0.89958       0.9692
##   2.115    179       1   0.9285 0.01841      0.89313       0.9653
##   2.266    177       1   0.9233 0.01904      0.88671       0.9614
##   2.333    176       1   0.9180 0.01964      0.88034       0.9574
##   2.353    175       1   0.9128 0.02022      0.87401       0.9533
##   2.591    173       1   0.9075 0.02078      0.86769       0.9492
##   3.286    171       1   0.9022 0.02132      0.86137       0.9450
##   3.388    170       1   0.8969 0.02185      0.85508       0.9408
##   3.655    168       1   0.8916 0.02236      0.84879       0.9365
##   4.045    167       1   0.8862 0.02286      0.84254       0.9322
##   4.145    166       1   0.8809 0.02333      0.83632       0.9278
##   4.331    165       1   0.8755 0.02379      0.83013       0.9234
##   4.362    163       1   0.8702 0.02425      0.82392       0.9190
##   4.470    162       1   0.8648 0.02469      0.81775       0.9146
##   4.534    160       1   0.8594 0.02512      0.81155       0.9101
##   4.703    159       1   0.8540 0.02553      0.80539       0.9055
##   4.913    157       1   0.8486 0.02594      0.79920       0.9010
##   5.030    156       1   0.8431 0.02634      0.79304       0.8964
##   5.100    155       1   0.8377 0.02673      0.78690       0.8917
##   5.105    154       1   0.8322 0.02710      0.78078       0.8871
##   5.223    153       1   0.8268 0.02746      0.77468       0.8824
##   5.337    152       1   0.8214 0.02782      0.76861       0.8777
##   5.567    150       1   0.8159 0.02817      0.76250       0.8730
##   5.683    148       1   0.8104 0.02851      0.75637       0.8682
##   5.698    147       1   0.8049 0.02884      0.75026       0.8634
##   5.754    146       1   0.7993 0.02917      0.74417       0.8586
##   5.789    145       1   0.7938 0.02948      0.73810       0.8538
##   6.148    143       1   0.7883 0.02980      0.73199       0.8489
##   6.175    142       1   0.7827 0.03010      0.72590       0.8440
##   6.223    141       1   0.7772 0.03039      0.71983       0.8391
##   6.326    138       1   0.7715 0.03069      0.71368       0.8341
##   6.473    137       1   0.7659 0.03098      0.70754       0.8291
##   6.767    136       1   0.7603 0.03126      0.70142       0.8241
##   6.988    135       1   0.7546 0.03153      0.69532       0.8190
##   7.053    134       1   0.7490 0.03179      0.68923       0.8140
##   7.074    133       1   0.7434 0.03205      0.68315       0.8089
##   7.303    132       1   0.7378 0.03230      0.67709       0.8039
##   8.076    130       1   0.7321 0.03254      0.67099       0.7987
##   8.213    129       1   0.7264 0.03278      0.66491       0.7936
##   8.234    128       1   0.7207 0.03301      0.65884       0.7884
##   8.521    124       1   0.7149 0.03326      0.65262       0.7832
##   8.647    122       1   0.7091 0.03350      0.64636       0.7778
##   8.860    120       1   0.7031 0.03373      0.64005       0.7725
##   8.889    119       1   0.6972 0.03396      0.63375       0.7671
##   9.322    118       1   0.6913 0.03419      0.62747       0.7617
##   9.683    116       1   0.6854 0.03441      0.62115       0.7562
##   9.811    114       1   0.6794 0.03463      0.61477       0.7507
##   9.844    113       1   0.6733 0.03484      0.60841       0.7452
##  10.941    111       1   0.6673 0.03505      0.60201       0.7396
##  11.198    110       1   0.6612 0.03525      0.59561       0.7340
##  11.287    109       1   0.6551 0.03544      0.58923       0.7284
##  11.407    108       1   0.6491 0.03563      0.58287       0.7228
##  11.495    107       1   0.6430 0.03581      0.57652       0.7172
##  11.498    106       1   0.6369 0.03598      0.57019       0.7115
##  11.685    105       1   0.6309 0.03615      0.56387       0.7059
##  12.554    101       1   0.6246 0.03633      0.55735       0.7001
##  12.675    100       1   0.6184 0.03650      0.55084       0.6942
##  12.950     99       1   0.6121 0.03666      0.54435       0.6884
##  13.163     98       1   0.6059 0.03681      0.53788       0.6825
##  13.448     96       1   0.5996 0.03697      0.53134       0.6766
##  13.521     95       1   0.5933 0.03711      0.52482       0.6707
##  13.834     94       1   0.5870 0.03725      0.51831       0.6647
##  13.947     93       1   0.5807 0.03738      0.51182       0.6587
##  14.299     92       1   0.5743 0.03750      0.50534       0.6528
##  14.463     91       1   0.5680 0.03762      0.49888       0.6468
##  14.657     90       1   0.5617 0.03773      0.49244       0.6407
##  15.023     88       1   0.5553 0.03783      0.48592       0.6347
##  15.683     87       1   0.5490 0.03793      0.47942       0.6286
##  16.020     85       1   0.5425 0.03803      0.47284       0.6224
##  16.094     84       1   0.5360 0.03812      0.46629       0.6162
##  16.596     82       1   0.5295 0.03822      0.45965       0.6100
##  16.655     81       1   0.5230 0.03830      0.45303       0.6037
##  16.777     80       1   0.5164 0.03837      0.44643       0.5974
##  16.944     79       1   0.5099 0.03844      0.43985       0.5911
##  17.373     78       1   0.5033 0.03850      0.43327       0.5848
##  18.696     75       1   0.4966 0.03857      0.42652       0.5783
##  19.343     73       1   0.4898 0.03863      0.41968       0.5717
##  19.730     71       1   0.4829 0.03870      0.41274       0.5651
##  19.771     70       1   0.4760 0.03876      0.40582       0.5584
##  19.776     69       1   0.4691 0.03881      0.39892       0.5517
##  20.089     68       1   0.4622 0.03884      0.39205       0.5450
##  20.176     67       1   0.4553 0.03887      0.38519       0.5383
##  20.281     66       1   0.4484 0.03889      0.37834       0.5315
##  20.352     65       1   0.4415 0.03890      0.37152       0.5248
##  20.438     63       1   0.4345 0.03891      0.36459       0.5179
##  21.523     61       1   0.4274 0.03892      0.35755       0.5109
##  21.589     60       1   0.4203 0.03891      0.35054       0.5039
##  22.321     59       1   0.4132 0.03890      0.34354       0.4969
##  22.367     58       1   0.4060 0.03888      0.33656       0.4899
##  22.556     57       1   0.3989 0.03884      0.32961       0.4828
##  22.843     56       1   0.3918 0.03880      0.32268       0.4757
##  23.336     54       1   0.3845 0.03875      0.31562       0.4685
##  23.764     53       1   0.3773 0.03869      0.30858       0.4613
##  24.050     52       1   0.3700 0.03862      0.30157       0.4540
##  24.480     51       1   0.3628 0.03854      0.29458       0.4468
##  25.071     50       1   0.3555 0.03845      0.28761       0.4395
##  25.089     49       1   0.3483 0.03834      0.28067       0.4321
##  25.326     48       1   0.3410 0.03822      0.27375       0.4248
##  25.537     47       1   0.3337 0.03809      0.26685       0.4174
##  25.569     46       1   0.3265 0.03795      0.25998       0.4100
##  25.868     45       1   0.3192 0.03779      0.25313       0.4026
##  27.705     44       1   0.3120 0.03762      0.24631       0.3952
##  28.213     43       1   0.3047 0.03744      0.23951       0.3877
##  28.865     42       1   0.2975 0.03725      0.23274       0.3802
##  30.041     41       1   0.2902 0.03704      0.22599       0.3727
##  31.663     38       1   0.2826 0.03684      0.21886       0.3649
##  32.568     37       1   0.2749 0.03663      0.21176       0.3570
##  33.548     36       1   0.2673 0.03640      0.20469       0.3491
##  34.210     35       1   0.2597 0.03615      0.19766       0.3411
##  35.198     34       1   0.2520 0.03589      0.19066       0.3332
##  36.260     33       1   0.2444 0.03560      0.18369       0.3252
##  36.290     32       1   0.2368 0.03530      0.17676       0.3171
##  37.357     31       1   0.2291 0.03498      0.16987       0.3090
##  38.113     30       1   0.2215 0.03463      0.16301       0.3009
##  38.354     29       1   0.2138 0.03427      0.15620       0.2928
##  38.643     28       1   0.2062 0.03389      0.14942       0.2846
##  40.764     27       1   0.1986 0.03348      0.14269       0.2763
##  41.975     25       1   0.1906 0.03307      0.13568       0.2678
##  42.224     24       1   0.1827 0.03263      0.12872       0.2593
##  42.573     23       1   0.1747 0.03217      0.12181       0.2507
##  42.624     22       1   0.1668 0.03167      0.11497       0.2420
##  43.107     21       1   0.1589 0.03114      0.10818       0.2333
##  44.726     20       1   0.1509 0.03058      0.10145       0.2245
##  45.809     19       1   0.1430 0.02999      0.09478       0.2157
##  47.939     18       1   0.1350 0.02935      0.08818       0.2068
##  48.811     17       1   0.1271 0.02868      0.08166       0.1978
##  49.835     16       1   0.1191 0.02797      0.07521       0.1887
##  51.955     15       1   0.1112 0.02721      0.06884       0.1796
##  55.912     14       1   0.1033 0.02640      0.06256       0.1704
##  61.031     11       1   0.0939 0.02561      0.05499       0.1602
##  67.452      9       1   0.0834 0.02480      0.04660       0.1494
##  69.279      8       1   0.0730 0.02379      0.03855       0.1383
##  72.872      7       1   0.0626 0.02256      0.03087       0.1269
##  75.927      6       1   0.0521 0.02108      0.02362       0.1151
##  79.877      5       1   0.0417 0.01927      0.01687       0.1032
##  82.555      4       1   0.0313 0.01704      0.01076       0.0910
##  83.337      3       1   0.0209 0.01420      0.00549       0.0792
##  83.806      2       1   0.0104 0.01024      0.00152       0.0714
##  89.176      1       1   0.0000     NaN           NA           NA
# The KM estimator is calculated as:
# S_hat(t) = product over (t_i <= t) of (1 - d_i/n_i)
# where d_i = number of events at time t_i
#       n_i = number at risk just before t_i

PART 2: Kaplan-Meier Analysis

2.1 Creating Survival Object and Fitting Kaplan-Meier

# Loading the survival library
library(survival)

# Creating the survival object
surv_object <- Surv(surv_data$Time, surv_data$Event)

# Fitting Kaplan-Meier estimator
km_fit <- survfit(surv_object ~ 1)

# Display summary
summary(km_fit)
## Call: survfit(formula = surv_object ~ 1)
## 
##    time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   0.100    200       1   0.9950 0.00499      0.98527       1.0000
##   0.378    197       1   0.9899 0.00707      0.97619       1.0000
##   0.506    196       1   0.9849 0.00865      0.96808       1.0000
##   0.739    195       1   0.9798 0.00997      0.96049       0.9996
##   0.858    193       1   0.9748 0.01114      0.95318       0.9969
##   0.897    192       1   0.9697 0.01218      0.94611       0.9939
##   1.024    191       1   0.9646 0.01314      0.93921       0.9907
##   1.155    189       1   0.9595 0.01402      0.93242       0.9874
##   1.382    187       1   0.9544 0.01486      0.92570       0.9839
##   1.554    186       1   0.9493 0.01564      0.91909       0.9804
##   1.765    184       1   0.9441 0.01638      0.91252       0.9768
##   1.775    183       1   0.9389 0.01709      0.90604       0.9730
##   1.908    181       1   0.9337 0.01776      0.89958       0.9692
##   2.115    179       1   0.9285 0.01841      0.89313       0.9653
##   2.266    177       1   0.9233 0.01904      0.88671       0.9614
##   2.333    176       1   0.9180 0.01964      0.88034       0.9574
##   2.353    175       1   0.9128 0.02022      0.87401       0.9533
##   2.591    173       1   0.9075 0.02078      0.86769       0.9492
##   3.286    171       1   0.9022 0.02132      0.86137       0.9450
##   3.388    170       1   0.8969 0.02185      0.85508       0.9408
##   3.655    168       1   0.8916 0.02236      0.84879       0.9365
##   4.045    167       1   0.8862 0.02286      0.84254       0.9322
##   4.145    166       1   0.8809 0.02333      0.83632       0.9278
##   4.331    165       1   0.8755 0.02379      0.83013       0.9234
##   4.362    163       1   0.8702 0.02425      0.82392       0.9190
##   4.470    162       1   0.8648 0.02469      0.81775       0.9146
##   4.534    160       1   0.8594 0.02512      0.81155       0.9101
##   4.703    159       1   0.8540 0.02553      0.80539       0.9055
##   4.913    157       1   0.8486 0.02594      0.79920       0.9010
##   5.030    156       1   0.8431 0.02634      0.79304       0.8964
##   5.100    155       1   0.8377 0.02673      0.78690       0.8917
##   5.105    154       1   0.8322 0.02710      0.78078       0.8871
##   5.223    153       1   0.8268 0.02746      0.77468       0.8824
##   5.337    152       1   0.8214 0.02782      0.76861       0.8777
##   5.567    150       1   0.8159 0.02817      0.76250       0.8730
##   5.683    148       1   0.8104 0.02851      0.75637       0.8682
##   5.698    147       1   0.8049 0.02884      0.75026       0.8634
##   5.754    146       1   0.7993 0.02917      0.74417       0.8586
##   5.789    145       1   0.7938 0.02948      0.73810       0.8538
##   6.148    143       1   0.7883 0.02980      0.73199       0.8489
##   6.175    142       1   0.7827 0.03010      0.72590       0.8440
##   6.223    141       1   0.7772 0.03039      0.71983       0.8391
##   6.326    138       1   0.7715 0.03069      0.71368       0.8341
##   6.473    137       1   0.7659 0.03098      0.70754       0.8291
##   6.767    136       1   0.7603 0.03126      0.70142       0.8241
##   6.988    135       1   0.7546 0.03153      0.69532       0.8190
##   7.053    134       1   0.7490 0.03179      0.68923       0.8140
##   7.074    133       1   0.7434 0.03205      0.68315       0.8089
##   7.303    132       1   0.7378 0.03230      0.67709       0.8039
##   8.076    130       1   0.7321 0.03254      0.67099       0.7987
##   8.213    129       1   0.7264 0.03278      0.66491       0.7936
##   8.234    128       1   0.7207 0.03301      0.65884       0.7884
##   8.521    124       1   0.7149 0.03326      0.65262       0.7832
##   8.647    122       1   0.7091 0.03350      0.64636       0.7778
##   8.860    120       1   0.7031 0.03373      0.64005       0.7725
##   8.889    119       1   0.6972 0.03396      0.63375       0.7671
##   9.322    118       1   0.6913 0.03419      0.62747       0.7617
##   9.683    116       1   0.6854 0.03441      0.62115       0.7562
##   9.811    114       1   0.6794 0.03463      0.61477       0.7507
##   9.844    113       1   0.6733 0.03484      0.60841       0.7452
##  10.941    111       1   0.6673 0.03505      0.60201       0.7396
##  11.198    110       1   0.6612 0.03525      0.59561       0.7340
##  11.287    109       1   0.6551 0.03544      0.58923       0.7284
##  11.407    108       1   0.6491 0.03563      0.58287       0.7228
##  11.495    107       1   0.6430 0.03581      0.57652       0.7172
##  11.498    106       1   0.6369 0.03598      0.57019       0.7115
##  11.685    105       1   0.6309 0.03615      0.56387       0.7059
##  12.554    101       1   0.6246 0.03633      0.55735       0.7001
##  12.675    100       1   0.6184 0.03650      0.55084       0.6942
##  12.950     99       1   0.6121 0.03666      0.54435       0.6884
##  13.163     98       1   0.6059 0.03681      0.53788       0.6825
##  13.448     96       1   0.5996 0.03697      0.53134       0.6766
##  13.521     95       1   0.5933 0.03711      0.52482       0.6707
##  13.834     94       1   0.5870 0.03725      0.51831       0.6647
##  13.947     93       1   0.5807 0.03738      0.51182       0.6587
##  14.299     92       1   0.5743 0.03750      0.50534       0.6528
##  14.463     91       1   0.5680 0.03762      0.49888       0.6468
##  14.657     90       1   0.5617 0.03773      0.49244       0.6407
##  15.023     88       1   0.5553 0.03783      0.48592       0.6347
##  15.683     87       1   0.5490 0.03793      0.47942       0.6286
##  16.020     85       1   0.5425 0.03803      0.47284       0.6224
##  16.094     84       1   0.5360 0.03812      0.46629       0.6162
##  16.596     82       1   0.5295 0.03822      0.45965       0.6100
##  16.655     81       1   0.5230 0.03830      0.45303       0.6037
##  16.777     80       1   0.5164 0.03837      0.44643       0.5974
##  16.944     79       1   0.5099 0.03844      0.43985       0.5911
##  17.373     78       1   0.5033 0.03850      0.43327       0.5848
##  18.696     75       1   0.4966 0.03857      0.42652       0.5783
##  19.343     73       1   0.4898 0.03863      0.41968       0.5717
##  19.730     71       1   0.4829 0.03870      0.41274       0.5651
##  19.771     70       1   0.4760 0.03876      0.40582       0.5584
##  19.776     69       1   0.4691 0.03881      0.39892       0.5517
##  20.089     68       1   0.4622 0.03884      0.39205       0.5450
##  20.176     67       1   0.4553 0.03887      0.38519       0.5383
##  20.281     66       1   0.4484 0.03889      0.37834       0.5315
##  20.352     65       1   0.4415 0.03890      0.37152       0.5248
##  20.438     63       1   0.4345 0.03891      0.36459       0.5179
##  21.523     61       1   0.4274 0.03892      0.35755       0.5109
##  21.589     60       1   0.4203 0.03891      0.35054       0.5039
##  22.321     59       1   0.4132 0.03890      0.34354       0.4969
##  22.367     58       1   0.4060 0.03888      0.33656       0.4899
##  22.556     57       1   0.3989 0.03884      0.32961       0.4828
##  22.843     56       1   0.3918 0.03880      0.32268       0.4757
##  23.336     54       1   0.3845 0.03875      0.31562       0.4685
##  23.764     53       1   0.3773 0.03869      0.30858       0.4613
##  24.050     52       1   0.3700 0.03862      0.30157       0.4540
##  24.480     51       1   0.3628 0.03854      0.29458       0.4468
##  25.071     50       1   0.3555 0.03845      0.28761       0.4395
##  25.089     49       1   0.3483 0.03834      0.28067       0.4321
##  25.326     48       1   0.3410 0.03822      0.27375       0.4248
##  25.537     47       1   0.3337 0.03809      0.26685       0.4174
##  25.569     46       1   0.3265 0.03795      0.25998       0.4100
##  25.868     45       1   0.3192 0.03779      0.25313       0.4026
##  27.705     44       1   0.3120 0.03762      0.24631       0.3952
##  28.213     43       1   0.3047 0.03744      0.23951       0.3877
##  28.865     42       1   0.2975 0.03725      0.23274       0.3802
##  30.041     41       1   0.2902 0.03704      0.22599       0.3727
##  31.663     38       1   0.2826 0.03684      0.21886       0.3649
##  32.568     37       1   0.2749 0.03663      0.21176       0.3570
##  33.548     36       1   0.2673 0.03640      0.20469       0.3491
##  34.210     35       1   0.2597 0.03615      0.19766       0.3411
##  35.198     34       1   0.2520 0.03589      0.19066       0.3332
##  36.260     33       1   0.2444 0.03560      0.18369       0.3252
##  36.290     32       1   0.2368 0.03530      0.17676       0.3171
##  37.357     31       1   0.2291 0.03498      0.16987       0.3090
##  38.113     30       1   0.2215 0.03463      0.16301       0.3009
##  38.354     29       1   0.2138 0.03427      0.15620       0.2928
##  38.643     28       1   0.2062 0.03389      0.14942       0.2846
##  40.764     27       1   0.1986 0.03348      0.14269       0.2763
##  41.975     25       1   0.1906 0.03307      0.13568       0.2678
##  42.224     24       1   0.1827 0.03263      0.12872       0.2593
##  42.573     23       1   0.1747 0.03217      0.12181       0.2507
##  42.624     22       1   0.1668 0.03167      0.11497       0.2420
##  43.107     21       1   0.1589 0.03114      0.10818       0.2333
##  44.726     20       1   0.1509 0.03058      0.10145       0.2245
##  45.809     19       1   0.1430 0.02999      0.09478       0.2157
##  47.939     18       1   0.1350 0.02935      0.08818       0.2068
##  48.811     17       1   0.1271 0.02868      0.08166       0.1978
##  49.835     16       1   0.1191 0.02797      0.07521       0.1887
##  51.955     15       1   0.1112 0.02721      0.06884       0.1796
##  55.912     14       1   0.1033 0.02640      0.06256       0.1704
##  61.031     11       1   0.0939 0.02561      0.05499       0.1602
##  67.452      9       1   0.0834 0.02480      0.04660       0.1494
##  69.279      8       1   0.0730 0.02379      0.03855       0.1383
##  72.872      7       1   0.0626 0.02256      0.03087       0.1269
##  75.927      6       1   0.0521 0.02108      0.02362       0.1151
##  79.877      5       1   0.0417 0.01927      0.01687       0.1032
##  82.555      4       1   0.0313 0.01704      0.01076       0.0910
##  83.337      3       1   0.0209 0.01420      0.00549       0.0792
##  83.806      2       1   0.0104 0.01024      0.00152       0.0714
##  89.176      1       1   0.0000     NaN           NA           NA

2.2 Kaplan-Meier Curves

Without Confidence Interval

plot(km_fit, conf.int = FALSE,
     main = "Kaplan-Meier Curve (No Confidence Interval)",
     xlab = "Time", ylab = "Survival Probability",
     col = "darkred", lwd = 2)
abline(h = seq(0, 1, 0.2), lty = 3, col = "gray")

With Confidence Interval

plot(km_fit, conf.int = TRUE,
     main = "Kaplan-Meier Curve with 95% Confidence Interval",
     xlab = "Time", ylab = "Survival Probability",
     col = "darkblue", lwd = 2)
legend("topright", legend = c("KM Estimate", "95% CI"),
       col = c("darkblue", "gray"), lwd = c(2, 2), lty = c(1, 2))

With Mark-up Time

plot(km_fit, mark.time = TRUE,
     main = "Kaplan-Meier Curve with Censoring Marks",
     xlab = "Time", ylab = "Survival Probability",
     col = "darkgreen", lwd = 2)
# Add median survival time
abline(v = median(km_fit$time), lty = 2, col = "red")
text(x = median(km_fit$time), y = 0.5, 
     paste("Median =", round(median(km_fit$time), 2)),
     pos = 4, col = "red")

2.3 Features of Survival Curve

The Kaplan-Meier survival curve (also known as the product-limit estimator) is a widely used non-parametric method to estimate the survival function \(S(t)\), which represents the probability of surviving beyond time t. It has several distinctive features that make it particularly suitable for analyzing time-to-event data with incomplete observations. The Kaplan-Meier survival curve has these features:

  • Step function: Decreases only at event times.

The curve is a decreasing step function that remains constant (horizontal) between successive event times and drops only at the exact times when events (e.g., deaths, failures, or relapses) occur. It does not decrease continuously or use sloping lines between point

  • Right-continuous: \(S(t) = \lim_{u \downarrow t} S(u)\)

The Kaplan-Meier estimator is right-continuous, meaning that at each event time t, the survival probability \(S(t)\) equals the limit of \(S(u)\) as u approaches t from the right. This convention reflects that the survival probability is defined just after the event has occurred.

  • Censoring indicators: Marks show censored observations

  • Small vertical tick marks or crosses/dots are typically plotted along the curve to indicate censored observations — individuals who are lost to follow-up, withdraw from the study, or remain event-free at the end of observation. These marks show that the subject survived at least up to that point, without causing a drop in the curve, but they reduce the number of individuals at risk for subsequent time points.

  • Plateaus: Periods with no events

The curve often shows flat, horizontal segments (plateaus) during periods when no events occur, even if some censoring happens in between. These plateaus indicate stable survival probability over those intervals and become more common with longer follow-up or smaller sample sizes.

  • Confidence bands: Show uncertainty in estimates

Pointwise or banded confidence intervals (most commonly based on Greenwood’s formula for variance) are frequently added to the curve to illustrate the uncertainty in the survival estimates. Wider bands at later times reflect greater uncertainty due to fewer patients remaining at risk (increased censoring and fewer events).

  • Median survival: Time when \(S(t) = 0.5\)

A key summary statistic is the median survival time, defined as the smallest time t where the estimated survival probability \(S(t)\) drops to 0.5 (or ≤ 0.5). It is easily read from the curve by finding where it crosses the 50% horizontal line and dropping a vertical line to the time axis. If the curve never reaches 0.5 (e.g., due to heavy censoring), the median may not be estimable.

PART 3: Analysis with Imported CSV Data

3.1 Importing and Preparing Data

# Read the CSV file (saved as 'health_data.csv')
imported_data<- read.csv(file.choose(), header = TRUE)
attach(imported_data)
View(imported_data)

# View data
cat("Dataset dimensions:", dim(imported_data), "\n")
## Dataset dimensions: 200 9
cat("First few observations:\n")
## First few observations:
print(head(imported_data))
##   ID Age Height_cm Weight_kg Treatment Time_to_event Censored Event      Time
## 1  1  44       192        69    Drug_A     13.520928        0     1 13.520928
## 2  2  48       183        52    Drug_B     51.955049        0     1 51.955049
## 3  3  66       167        60    Drug_A     12.435579        1     0  7.452644
## 4  4  51       175        70    Drug_A     38.112723        0     1 38.112723
## 5  5  51       166        80    Drug_A     82.554753        0     1 82.554753
## 6  6  67       165        45    Drug_A      1.907929        0     1  1.907929

3.2 Kaplan-Meier Analysis on Imported Data

# Create survival object
surv_obj_imported <- Surv(imported_data$Time, imported_data$Event)

# Fit Kaplan-Meier by treatment group
km_fit_group <- survfit(surv_obj_imported ~ Treatment, data = imported_data)

# Plot by group
plot(km_fit_group, col = c("red", "blue", "green"),
     main = "Kaplan-Meier Curves by Treatment Group",
     xlab = "Time", ylab = "Survival Probability",
     lwd = 2)
legend("topright", legend = levels(factor(imported_data$Treatment)),
       col = c("red", "blue", "green"), lwd = 2)

3.3 Kaplan-Meier Curves using imported data

Without Confidence Interval(imported data)

plot(km_fit_group, conf.int = FALSE,
     main = "Kaplan-Meier Curve (No Confidence Interval)",
     xlab = "Time", ylab = "Survival Probability",
     col = "darkred", lwd = 2)
abline(h = seq(0, 1, 0.2), lty = 3, col = "gray")

With Confidence Interval(Imported data)

plot(km_fit_group, conf.int = TRUE,
     main = "Kaplan-Meier Curve with 95% Confidence Interval",
     xlab = "Time", ylab = "Survival Probability",
     col = "darkblue", lwd = 2)
legend("topright", legend = c("KM Estimate", "95% CI"),
       col = c("darkblue", "gray"), lwd = c(2, 2), lty = c(1, 2))

With Mark-up Time (imported data)

plot(km_fit_group, mark.time = TRUE,
     main = "Kaplan-Meier Curve with Censoring Marks",
     xlab = "Time", ylab = "Survival Probability",
     col = "darkgreen", lwd = 2)
# Add median survival time
abline(v = median(km_fit_group$time), lty = 2, col = "red")
text(x = median(km_fit_group$time), y = 0.5, 
     paste("Median =", round(median(km_fit_group$time), 2)),
     pos = 4, col = "red")

PART 4: Cox Proportional Hazards Model

4.1 Fitting Cox PH Model

# Fit Cox proportional hazards model
cox_model_full <- coxph(surv_obj_imported ~ Age + Height_cm + Weight_kg + Treatment,
                   data = imported_data)

# Summary of the model
cox_summary <- summary(cox_model_full)
print(cox_summary)
## Call:
## coxph(formula = surv_obj_imported ~ Age + Height_cm + Weight_kg + 
##     Treatment, data = imported_data)
## 
##   n= 200, number of events= 151 
## 
##                       coef exp(coef)  se(coef)      z Pr(>|z|)
## Age              -0.011449  0.988616  0.009566 -1.197    0.231
## Height_cm        -0.008332  0.991703  0.008985 -0.927    0.354
## Weight_kg         0.004624  1.004634  0.006090  0.759    0.448
## TreatmentDrug_B   0.030196  1.030656  0.216457  0.139    0.889
## TreatmentPlacebo  0.185164  1.203416  0.199225  0.929    0.353
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## Age                 0.9886     1.0115    0.9703     1.007
## Height_cm           0.9917     1.0084    0.9744     1.009
## Weight_kg           1.0046     0.9954    0.9927     1.017
## TreatmentDrug_B     1.0307     0.9703    0.6743     1.575
## TreatmentPlacebo    1.2034     0.8310    0.8144     1.778
## 
## Concordance= 0.562  (se = 0.028 )
## Likelihood ratio test= 3.64  on 5 df,   p=0.6
## Wald test            = 3.64  on 5 df,   p=0.6
## Score (logrank) test = 3.64  on 5 df,   p=0.6

Model Overview

The Cox proportional hazards model examines how different variables affect survival time (time to event). The model fitted includes:

  • Age: Patient’s age

  • Height_cm: Height in centimeters

  • Weight_kg: Weight in kilograms

  • Treatment: Two treatment groups (Drug_B and Placebo)

Sample characteristics: n = 200 patients, 151 events observed

4.2 Interpretation of Results

The Cox proportional hazards model assumes:

\[h(t|X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\]

Where:

  • \(h(t|X)\) is the hazard at time \(t\) for covariates \(X\)
  • \(h_0(t)\) is the baseline hazard function
  • \(\beta_i\) are the coefficients to be estimated

4.3. Key Interpretations

Hazard Ratios (HR): \(\exp(\beta)\)

  • HR > 1: Increased hazard (worse survival, Higher risk of event)
  • HR < 1: Decreased hazard (better survival, lower risk of event)
  • HR = 1: No effect

Variable-by-Variable Analysis

1. Age (coef = -0.011449, HR = 0.989)

  • Interpretation: For each additional year of age, the hazard decreases by approximately 1.1% (HR = 0.989)

  • Statistical significance: Not statistically significant (p = 0.231).

  • 95% CI: [0.970, 1.007] - includes 1, confirming non-significance

  • Practical meaning: - Age does not appear to meaningfully affect survival in this dataset

  • Statistical Significance: p-value < 0.05 indicates significant effect

  • Confidence Intervals: 95% CI for hazard ratios

2. Height_cm (coef = -0.008332, HR = 0.992)

  • Interpretation: For each additional centimeter of height, the hazard decreases by approximately 0.8% (HR = 0.992)

  • Statistical significance: Not statistically significant (p = 0.354)

  • 95% CI: [0.974, 1.009] - includes 1

  • Practical meaning: Height is not a significant predictor of survival

3. Weight_kg (coef = 0.004624, HR = 1.005)

  • Interpretation: For each additional kilogram of weight, the hazard increases by approximately 0.5% (HR = 1.005)

  • Statistical significance: Not statistically significant (p = 0.448)

  • 95% CI: [0.993, 1.017] - includes 1

  • Practical meaning: Weight does not significantly impact survival

4. TreatmentDrug_B (coef = 0.030196, HR = 1.031)

  • Interpretation: Patients receiving Drug B have a 3.1% higher hazard compared to the reference group (assumed to be standard treatment or control)

  • Statistical significance: Not statistically significant (p = 0.889)

  • 95% CI: [0.674, 1.575] - very wide interval including 1

  • Practical meaning: Drug B shows no significant difference from the reference treatment

5. TreatmentPlacebo (coef = 0.185164, HR = 1.203)

  • Interpretation: Patients receiving placebo have a 20.3% higher hazard compared to the reference group

  • Statistical significance: Not statistically significant (p = 0.353)

  • 95% CI: [0.814, 1.778] - includes 1

  • Practical meaning: While the hazard ratio suggests worse outcomes with placebo, this difference is not statistically significant

4.4 Model Fit Statistics

Concordance Index (C-index = 0.562)

Range: 0.5 to 1.0

Interpretation:

  • 0.5 = random prediction (no better than chance)

  • 1.0 = perfect prediction

  • The model: 0.562 indicates the model has very weak predictive ability, only slightly better than random chance

  • Practical meaning: The included variables do not effectively distinguish between patients with different survival outcomes

Likelihood Ratio Test

  • Test statistic: 3.64 on 5 degrees of freedom

  • p-value: 0.6

  • Interpretation: The model as a whole is not statistically significant. The predictors collectively do not improve upon a model with no predictors

Wald Test

  • Test statistic: 3.64 on 5 degrees of freedom

  • p-value: 0.6

  • Interpretation: Confirms the overall non-significance of the model

Score (Log-rank) Test

  • Test statistic: 3.64 on 5 degrees of freedom

  • p-value: 0.6

  • Interpretation: Another confirmation that the model lacks significant predictive power

Proportional Hazards Assumption Check

# Check proportional hazards assumption
cox_zph_full <- cox.zph(cox_model_full)
print(cox_zph_full)
##              chisq df    p
## Age       4.35e-02  1 0.83
## Height_cm 4.84e-03  1 0.94
## Weight_kg 4.01e-05  1 0.99
## Treatment 1.49e+00  2 0.47
## GLOBAL    1.75e+00  5 0.88
# Plot Schoenfeld residuals
plot(cox_zph_full, var = "Age", main = "PH Assumption Check for Age")
abline(h = 0, lty = 2, col = "red")

PART 5: Mathematical Construction of Kaplan-Meier Table

5.1 Manual Kaplan-Meier Calculation

Given example data: Time: \(2, 3^+, 6, 6, 7, 10^+, 15, 15, 16, 27, 30, 32\) where \(+\) indicates censored observations.

# Example data
time_example <- c(0, 2, 3, 6, 6, 7, 10, 15, 15, 16, 27, 30, 32)
event_example <- c(1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1)  # 0=censored, 1=event

# Sort data
sorted_data <- data.frame(time = time_example, event = event_example)
sorted_data <- sorted_data[order(sorted_data$time), ]

# Manual KM calculation
n <- length(time_example)
at_risk <- n:1
events <- sorted_data$event

# Calculate survival probabilities
survival_prob <- numeric(n)
survival_prob[1] <- 1 - events[1]/at_risk[1]

for(i in 2:n) {
  survival_prob[i] <- survival_prob[i-1] * (1 - events[i]/at_risk[i])
}

# Create KM table
km_table <- data.frame(
  Time = sorted_data$time,
  n_risk = at_risk,
  n_event = events,
  Survival = round(survival_prob, 4)
)

print(km_table)
##    Time n_risk n_event Survival
## 1     0     13       1   0.9231
## 2     2     12       1   0.8462
## 3     3     11       0   0.8462
## 4     6     10       1   0.7615
## 5     6      9       1   0.6769
## 6     7      8       1   0.5923
## 7    10      7       0   0.5923
## 8    15      6       1   0.4936
## 9    15      5       1   0.3949
## 10   16      4       1   0.2962
## 11   27      3       1   0.1974
## 12   30      2       1   0.0987
## 13   32      1       1   0.0000

5.2 Survival Probability Plot

# Plot survival probabilities
plot(km_table$Time, km_table$Survival, type = "s",
     main = "Manual Kaplan-Meier Survival Curve",
     xlab = "Time", ylab = "Survival Probability",
     col = "purple", lwd = 2, ylim = c(0, 1))
points(km_table$Time, km_table$Survival, pch = 19)

# Add censored observations
censored_times <- sorted_data$time[sorted_data$event == 0]
censored_probs <- km_table$Survival[sorted_data$event == 0]
points(censored_times, censored_probs, pch = 3, col = "red", cex = 1.5)

abline(h = 0.5, lty = 2, col = "gray")
legend("topright", legend = c("Survival", "Censored"),
       pch = c(19, 3), col = c("purple", "red"))

PART 6: Multiple Linear Regression

6.1 Fitting Multiple Linear Regression Model

# Fit multiple linear regression model
# For demonstration, we'll use Time as the dependent variable
lm_model <- lm(Time ~ Age + Height_cm + Weight_kg, data = imported_data)
lm_summary <- summary(lm_model)

# Displaying results
cat("Multiple Linear Regression Results:\n")
## Multiple Linear Regression Results:
print(lm_summary)
## 
## Call:
## lm(formula = Time ~ Age + Height_cm + Weight_kg, data = imported_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.685 -13.378  -7.327   5.977  69.735 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.717462  25.836128  -0.028    0.978
## Age          0.048643   0.144121   0.338    0.736
## Height_cm    0.104590   0.136301   0.767    0.444
## Weight_kg   -0.009338   0.093865  -0.099    0.921
## 
## Residual standard error: 19.18 on 196 degrees of freedom
## Multiple R-squared:  0.00361,    Adjusted R-squared:  -0.01164 
## F-statistic: 0.2367 on 3 and 196 DF,  p-value: 0.8707
# Diagnostic plots
par(mfrow = c(2, 2))
plot(lm_model)

par(mfrow = c(1, 1))

6.2 Regression Equation

The fitted regression model is:

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3\]

Where:

  • \(\hat{Y}\) = Predicted Blood Pressure
  • \(X_1\) = Age
  • \(X_2\) = Height (cm)
  • \(X_3\) = Weight (kg)

Interpretation

  • R-squared: Proportion of variance explained
  • Adjusted R-squared: Adjusted for number of predictors
  • F-statistic: Overall significance of model
  • Coefficients: Effect of each predictor controlling for others

PART 7: USING REAL WORLD DATASET (CANCER DATASET)

##ABOUT THIS DATASET This data has been downloaded from Kaggle. https://www.kaggle.com/code/saychakra/survival-analysis-of-lung-cancer-patients
This section performs survival analysis on the lung cancer dataset (cancer.csv) using R.

7.1 Analysis to be done

  1. Complete Data Analysis Pipeline: From loading to final conclusions
  2. Multiple Survival Models: Kaplan-Meier, Cox PH, Random Survival Forest
  3. Comprehensive Visualizations: Survival curves, forest plots, diagnostic plots
  4. Statistical Testing: Log-rank tests, proportional hazards assumptions
  5. Model Comparison: Concordance indices and performance metrics
  6. Clinical Interpretation: Hazard ratios and practical implications

##Loading Required Libraries

# Install packages if not already installed
if(!require("survival")) install.packages("survival")
if(!require("survminer")) install.packages("survminer")
if(!require("ggplot2")) install.packages("ggplot2")
if(!require("dplyr")) install.packages("dplyr")
if(!require("tidyr")) install.packages("tidyr")
if(!require("GGally")) install.packages("GGally")
if(!require("caret")) install.packages("caret")
if(!require("randomForestSRC")) install.packages("randomForestSRC")

# Load libraries
library(survival)
library(survminer)
library(ggplot2)
library(dplyr)
library(tidyr)
library(GGally)
library(caret)
library(randomForestSRC)
library(knitr)

##. Data Loading and Exploration for the Cancer data

# Load the lung cancer dataset
# Read the CSV file (saved as 'cancer.csv')
cancer_data<- read.csv(file.choose(), header = TRUE)
attach(cancer_data)
View(cancer_data)
#cancer_data <- read.csv("cancer.csv")

# Display basic information
cat("Dataset Dimensions:", dim(cancer_data), "\n")
## Dataset Dimensions: 228 11
cat("Column Names:", names(cancer_data), "\n\n")
## Column Names: X inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
# Display first few rows
cat("First 6 rows of the dataset:\n")
## First 6 rows of the dataset:
kable(head(cancer_data), caption = "First 6 Observations")
First 6 Observations
X inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1 3 306 2 74 1 1 90 100 1175 NA
2 3 455 2 68 1 0 90 90 1225 15
3 3 1010 1 56 1 0 90 90 NA 15
4 5 210 2 57 1 1 90 60 1150 11
5 1 883 2 60 1 0 100 90 NA 0
6 12 1022 1 74 1 1 50 80 513 0
# Structure of the data
str(cancer_data)
## 'data.frame':    228 obs. of  11 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ inst     : int  3 3 3 5 1 12 7 11 1 7 ...
##  $ time     : int  306 455 1010 210 883 1022 310 361 218 166 ...
##  $ status   : int  2 2 1 2 2 1 2 2 2 2 ...
##  $ age      : int  74 68 56 57 60 74 68 71 53 61 ...
##  $ sex      : int  1 1 1 1 1 1 2 2 1 1 ...
##  $ ph.ecog  : int  1 0 0 1 0 1 2 2 1 2 ...
##  $ ph.karno : int  90 90 90 90 100 50 70 60 70 70 ...
##  $ pat.karno: int  100 90 90 60 90 80 60 80 80 70 ...
##  $ meal.cal : int  1175 1225 NA 1150 NA 513 384 538 825 271 ...
##  $ wt.loss  : int  NA 15 15 11 0 0 10 1 16 34 ...
# Summary statistics
cat("\nSummary Statistics:\n")
## 
## Summary Statistics:
summary(cancer_data)
##        X               inst            time            status     
##  Min.   :  1.00   Min.   : 1.00   Min.   :   5.0   Min.   :1.000  
##  1st Qu.: 57.75   1st Qu.: 3.00   1st Qu.: 166.8   1st Qu.:1.000  
##  Median :114.50   Median :11.00   Median : 255.5   Median :2.000  
##  Mean   :114.50   Mean   :11.09   Mean   : 305.2   Mean   :1.724  
##  3rd Qu.:171.25   3rd Qu.:16.00   3rd Qu.: 396.5   3rd Qu.:2.000  
##  Max.   :228.00   Max.   :33.00   Max.   :1022.0   Max.   :2.000  
##                   NA's   :1                                       
##       age             sex           ph.ecog          ph.karno     
##  Min.   :39.00   Min.   :1.000   Min.   :0.0000   Min.   : 50.00  
##  1st Qu.:56.00   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 75.00  
##  Median :63.00   Median :1.000   Median :1.0000   Median : 80.00  
##  Mean   :62.45   Mean   :1.395   Mean   :0.9515   Mean   : 81.94  
##  3rd Qu.:69.00   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 90.00  
##  Max.   :82.00   Max.   :2.000   Max.   :3.0000   Max.   :100.00  
##                                  NA's   :1        NA's   :1       
##    pat.karno         meal.cal         wt.loss       
##  Min.   : 30.00   Min.   :  96.0   Min.   :-24.000  
##  1st Qu.: 70.00   1st Qu.: 635.0   1st Qu.:  0.000  
##  Median : 80.00   Median : 975.0   Median :  7.000  
##  Mean   : 79.96   Mean   : 928.8   Mean   :  9.832  
##  3rd Qu.: 90.00   3rd Qu.:1150.0   3rd Qu.: 15.750  
##  Max.   :100.00   Max.   :2600.0   Max.   : 68.000  
##  NA's   :3        NA's   :47       NA's   :14

##.Data Preprocessing for the cancer data

# Create a copy for analysis
df <- cancer_data

# Rename columns for clarity
df <- df %>%
  rename(
    institution = inst,
    survival_time = time,
    age_years = age,
    performance_ecog = ph.ecog,
    physician_karno = ph.karno,
    patient_karno = pat.karno,
    calories_per_day = meal.cal,
    weight_loss = wt.loss
  )

# Convert status to proper event indicator (2 = death, 1 = censored)
# For survival analysis: 1 = event (death), 0 = censored
df$event <- ifelse(df$status == 2, 1, 0)

# Convert sex to factor (1 = male, 2 = female)
df$sex <- factor(df$sex, levels = c(1, 2), labels = c("Male", "Female"))

# Create age groups
df$age_group <- cut(df$age_years,
                    breaks = c(0, 50, 60, 70, 100),
                    labels = c("<50", "50-59", "60-69", "70+"))

# Create performance status groups
df$ecog_group <- cut(df$performance_ecog,
                     breaks = c(-1, 0, 1, 2, 5),
                     labels = c("0", "1", "2", "3+"))

# Handle missing values
# Count missing values
cat("Missing values before imputation:\n")
## Missing values before imputation:
print(colSums(is.na(df)))
##                X      institution    survival_time           status 
##                0                1                0                0 
##        age_years              sex performance_ecog  physician_karno 
##                0                0                1                1 
##    patient_karno calories_per_day      weight_loss            event 
##                3               47               14                0 
##        age_group       ecog_group 
##                0                1
# Impute missing values with median for numeric columns
numeric_cols <- c("performance_ecog", "physician_karno", 
                  "patient_karno", "calories_per_day", "weight_loss")

for(col in numeric_cols) {
  if(any(is.na(df[[col]]))) {
    df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)
  }
}

# Check missing values after imputation
cat("\nMissing values after imputation:\n")
## 
## Missing values after imputation:
print(colSums(is.na(df)))
##                X      institution    survival_time           status 
##                0                1                0                0 
##        age_years              sex performance_ecog  physician_karno 
##                0                0                0                0 
##    patient_karno calories_per_day      weight_loss            event 
##                0                0                0                0 
##        age_group       ecog_group 
##                0                1
# Display cleaned data structure
cat("\nCleaned dataset dimensions:", dim(df), "\n")
## 
## Cleaned dataset dimensions: 228 14

##. Exploratory Data Analysis on Cancer Data

Distribution of Variables

# Create multiple histograms
par(mfrow = c(2, 3))
hist(df$age_years, main = "Age Distribution", xlab = "Age", col = "skyblue")
hist(df$survival_time, main = "Survival Time Distribution", 
     xlab = "Days", col = "lightgreen")
hist(df$performance_ecog, main = "ECOG Performance Status", 
     xlab = "Score", col = "lightcoral")
hist(df$physician_karno, main = "Physician Karnofsky Score", 
     xlab = "Score", col = "gold")
hist(df$weight_loss, main = "Weight Loss Distribution", 
     xlab = "Pounds", col = "plum")
par(mfrow = c(1, 1))

Categorical Variable Distributions on Cancer data

# Create a summary table for categorical variables
cat_table <- df %>%
  group_by(sex, event) %>%
  summarise(
    count = n(),
    mean_survival = mean(survival_time),
    median_survival = median(survival_time),
    .groups = 'drop'
  )

cat("Summary by Sex and Event Status:\n")
## Summary by Sex and Event Status:
kable(cat_table, caption = "Summary Statistics by Sex and Event Status")
Summary Statistics by Sex and Event Status
sex event count mean_survival median_survival
Male 0 26 372.1538 281.5
Male 1 112 262.5893 208.5
Female 0 37 357.3514 292.0
Female 1 53 326.1321 293.0
# Bar plots
p1 <- ggplot(df, aes(x = sex, fill = factor(event))) +
  geom_bar(position = "dodge") +
  labs(title = "Event Distribution by Sex",
       x = "Sex", y = "Count", fill = "Event") +
  theme_minimal() +
  scale_fill_manual(values = c("blue", "red"), 
                    labels = c("Censored", "Death"))

p2 <- ggplot(df, aes(x = age_group, fill = factor(event))) +
  geom_bar(position = "dodge") +
  labs(title = "Event Distribution by Age Group",
       x = "Age Group", y = "Count", fill = "Event") +
  theme_minimal() +
  scale_fill_manual(values = c("blue", "red"),
                    labels = c("Censored", "Death"))

library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

Correlation Analysis on Cancer data

# Select numeric columns for correlation
numeric_df <- df %>%
  select(survival_time, event, age_years, performance_ecog,
         physician_karno, patient_karno, calories_per_day, weight_loss)

# Correlation matrix
cor_matrix <- cor(numeric_df, use = "complete.obs")
cor_df <- as.data.frame(as.table(cor_matrix))
names(cor_df) <- c("Var1", "Var2", "Correlation")

# Create correlation plot
ggplot(cor_df, aes(Var1, Var2, fill = Correlation)) +
  geom_tile() +
  geom_text(aes(label = round(Correlation, 2)), color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1, 1)) +
  labs(title = "Correlation Matrix", x = "", y = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.2 Kaplan-Meier Survival Analysis

Overall Survival

# Create survival object
surv_obj <- Surv(time = df$survival_time, event = df$event)

# Fit Kaplan-Meier estimator
km_fit <- survfit(surv_obj ~ 1)

# Plot overall survival curve
ggsurvplot(km_fit,
           data = df,
           risk.table = TRUE,
           conf.int = TRUE,
           ggtheme = theme_minimal(),
           title = "Overall Kaplan-Meier Survival Curve",
           xlab = "Time (days)",
           ylab = "Survival Probability",
           legend.title = "",
           palette = "blue")

Summary Statistics

# Get summary statistics
km_summary <- summary(km_fit)

cat("Overall Survival Statistics:\n")
## Overall Survival Statistics:
cat("Median survival time:", km_fit$median, "days\n")
## Median survival time: days
cat("Mean survival time:", mean(km_fit$time), "days\n")
## Mean survival time: 327.5 days
cat("Number of events:", sum(km_fit$n.event), "\n")
## Number of events: 165
cat("Number censored:", sum(km_fit$n.censor), "\n")
## Number censored: 63
# Survival probabilities at specific time points
time_points <- c(100, 200, 300, 400, 500)
cat("\nSurvival probabilities at specific time points:\n")
## 
## Survival probabilities at specific time points:
for(t in time_points) {
  surv_prob <- summary(km_fit, times = t)$surv
  cat(sprintf("At %d days: %.4f\n", t, surv_prob))
}
## At 100 days: 0.8640
## At 200 days: 0.6803
## At 300 days: 0.5306
## At 400 days: 0.3768
## At 500 days: 0.2933

Stratified Analysis by Sex

# Kaplan-Meier by sex
km_sex <- survfit(surv_obj ~ sex, data = df)

# Plot
ggsurvplot(km_sex,
           data = df,
           risk.table = TRUE,
           conf.int = TRUE,
           pval = TRUE,
           pval.method = TRUE,
           ggtheme = theme_minimal(),
           title = "Kaplan-Meier Survival Curves by Sex",
           xlab = "Time (days)",
           ylab = "Survival Probability",
           legend.title = "Sex",
           legend.labs = c("Male", "Female"),
           palette = c("blue", "red"))

Log-Rank Test

# Perform log-rank test
logrank_result <- survdiff(surv_obj ~ sex, data = df)

cat("Log-Rank Test Results (Sex Comparison):\n")
## Log-Rank Test Results (Sex Comparison):
print(logrank_result)
## Call:
## survdiff(formula = surv_obj ~ sex, data = df)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Male   138      112     91.6      4.55      10.3
## sex=Female  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001
# Extract p-value
logrank_p <- 1 - pchisq(logrank_result$chisq, length(logrank_result$n) - 1)
cat(sprintf("\nP-value: %.4f\n", logrank_p))
## 
## P-value: 0.0013
if(logrank_p < 0.05) {
  cat("Conclusion: Significant difference in survival between sexes\n")
} else {
  cat("Conclusion: No significant difference in survival between sexes\n")
}
## Conclusion: Significant difference in survival between sexes

Stratified Analysis by Age Group

# Kaplan-Meier by age group
km_age <- survfit(surv_obj ~ age_group, data = df)

# Plot
ggsurvplot(km_age,
           data = df,
           risk.table = TRUE,
           conf.int = FALSE,
           pval = TRUE,
           pval.method = TRUE,
           ggtheme = theme_minimal(),
           title = "Kaplan-Meier Survival Curves by Age Group",
           xlab = "Time (days)",
           ylab = "Survival Probability",
           legend.title = "Age Group",
           palette = "Set1")

Stratified Analysis by Performance Status

# Kaplan-Meier by ECOG group
km_ecog <- survfit(surv_obj ~ ecog_group, data = df)

# Plot
ggsurvplot(km_ecog,
           data = df,
           risk.table = TRUE,
           conf.int = FALSE,
           pval = TRUE,
           pval.method = TRUE,
           ggtheme = theme_minimal(),
           title = "Kaplan-Meier Survival Curves by ECOG Performance Status",
           xlab = "Time (days)",
           ylab = "Survival Probability",
           legend.title = "ECOG Score",
           palette = "Dark2")

7.3 Cox Proportional Hazards Model

Model Fitting

# Prepare data for Cox model
cox_df <- df %>%
  select(survival_time, event, age_years, sex, performance_ecog,
         physician_karno, patient_karno, calories_per_day, weight_loss)

# Remove any remaining NA values
cox_df <- na.omit(cox_df)

# Fit Cox proportional hazards model
cox_model <- coxph(Surv(survival_time, event) ~ 
                     age_years + sex + performance_ecog + 
                     physician_karno + patient_karno + 
                     calories_per_day + weight_loss,
                   data = cox_df)

# Display model summary
cat("Cox Proportional Hazards Model Summary:\n\n")
## Cox Proportional Hazards Model Summary:
print(summary(cox_model))
## Call:
## coxph(formula = Surv(survival_time, event) ~ age_years + sex + 
##     performance_ecog + physician_karno + patient_karno + calories_per_day + 
##     weight_loss, data = cox_df)
## 
##   n= 228, number of events= 165 
## 
##                        coef  exp(coef)   se(coef)      z Pr(>|z|)    
## age_years         1.239e-02  1.012e+00  9.446e-03  1.311  0.18977    
## sexFemale        -5.894e-01  5.547e-01  1.701e-01 -3.465  0.00053 ***
## performance_ecog  5.863e-01  1.797e+00  1.831e-01  3.202  0.00136 ** 
## physician_karno   1.374e-02  1.014e+00  9.384e-03  1.465  0.14300    
## patient_karno    -1.245e-02  9.876e-01  6.950e-03 -1.792  0.07320 .  
## calories_per_day  3.298e-05  1.000e+00  2.410e-04  0.137  0.89115    
## weight_loss      -1.134e-02  9.887e-01  6.588e-03 -1.722  0.08509 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## age_years           1.0125     0.9877    0.9939    1.0314
## sexFemale           0.5547     1.8028    0.3974    0.7742
## performance_ecog    1.7973     0.5564    1.2553    2.5731
## physician_karno     1.0138     0.9863    0.9954    1.0327
## patient_karno       0.9876     1.0125    0.9743    1.0012
## calories_per_day    1.0000     1.0000    0.9996    1.0005
## weight_loss         0.9887     1.0114    0.9760    1.0016
## 
## Concordance= 0.65  (se = 0.024 )
## Likelihood ratio test= 37.48  on 7 df,   p=4e-06
## Wald test            = 36.75  on 7 df,   p=5e-06
## Score (logrank) test = 37.76  on 7 df,   p=3e-06

Hazard Ratios Visualization

# Extract coefficients and confidence intervals
cox_summary <- summary(cox_model)
hazard_ratios <- data.frame(
  Variable = rownames(cox_summary$coefficients),
  HR = cox_summary$coefficients[, "exp(coef)"],
  Lower = cox_summary$conf.int[, "lower .95"],
  Upper = cox_summary$conf.int[, "upper .95"],
  p_value = cox_summary$coefficients[, "Pr(>|z|)"]
)

# Remove intercept if present
hazard_ratios <- hazard_ratios[!grepl("Intercept", hazard_ratios$Variable),]

# Create forest plot
ggplot(hazard_ratios, aes(x = HR, y = reorder(Variable, HR))) +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = Lower, xmax = Upper), height = 0.2) +
  geom_vline(xintercept = 1, linetype = "dashed", color = "red") +
  scale_x_log10() +
  labs(title = "Hazard Ratios with 95% Confidence Intervals",
       x = "Hazard Ratio (log scale)",
       y = "Variable") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Model Diagnostics

# Check proportional hazards assumption
cox_ph_test <- cox.zph(cox_model)

cat("Proportional Hazards Assumption Test:\n")
## Proportional Hazards Assumption Test:
print(cox_ph_test)
##                    chisq df      p
## age_years         0.0564  1 0.8123
## sex               2.7016  1 0.1002
## performance_ecog  1.0437  1 0.3070
## physician_karno   5.1387  1 0.0234
## patient_karno     3.5380  1 0.0600
## calories_per_day  7.1009  1 0.0077
## weight_loss       0.2832  1 0.5946
## GLOBAL           15.3360  7 0.0319
# Plot Schoenfeld residuals
par(mfrow = c(3, 3))
plot(cox_ph_test, var = 1)
plot(cox_ph_test, var = 2)
plot(cox_ph_test, var = 3)
plot(cox_ph_test, var = 4)
plot(cox_ph_test, var = 5)
plot(cox_ph_test, var = 6)
plot(cox_ph_test, var = 7)
par(mfrow = c(1, 1))

# Test for influential observations
dfbeta <- residuals(cox_model, type = "dfbeta")
cat("\nInfluential Observations (Top 5 by absolute dfbeta):\n")
## 
## Influential Observations (Top 5 by absolute dfbeta):
influential_idx <- order(abs(dfbeta[,1]), decreasing = TRUE)[1:5]
print(dfbeta[influential_idx,])
##             [,1]        [,2]         [,3]          [,4]          [,5]
## 85   0.004692762  0.02665687 -0.009524579  0.0004768544  0.0001530322
## 129 -0.002399962 -0.03591095 -0.041732519 -0.0027043722  0.0008466951
## 33   0.002383475  0.01627415 -0.018684245  0.0014165391 -0.0006493623
## 89   0.001969821 -0.02054653 -0.003658999 -0.0005622521 -0.0003148215
## 6   -0.001865053  0.02644861  0.095979318  0.0078422270 -0.0010399387
##              [,6]         [,7]
## 85   9.056090e-06  0.001123855
## 129 -4.445420e-06 -0.001005656
## 33   4.602607e-06  0.001545555
## 89   2.859613e-05 -0.001473872
## 6    6.322200e-05  0.001349450

Predicted Survival Curves

# Create new data for prediction
new_data <- data.frame(
  age_years = c(50, 50, 70, 70),
  sex = factor(c("Male", "Female", "Male", "Female")),
  performance_ecog = c(0, 0, 2, 2),
  physician_karno = c(90, 90, 60, 60),
  patient_karno = c(90, 90, 60, 60),
  calories_per_day = median(df$calories_per_day),
  weight_loss = median(df$weight_loss)
)

# Predict survival curves
pred_surv <- survfit(cox_model, newdata = new_data)

# Plot predicted survival curves
plot(pred_surv, 
     col = c("blue", "red", "green", "purple"),
     lwd = 2,
     xlab = "Time (days)",
     ylab = "Survival Probability",
     main = "Predicted Survival Curves from Cox Model")
legend("topright",
       legend = c("50M ECOG0", "50F ECOG0", "70M ECOG2", "70F ECOG2"),
       col = c("blue", "red", "green", "purple"),
       lwd = 2)
grid()

7.4 Advanced Survival Machine Learning Models

Random Survival Forest

# Prepare data for Random Survival Forest
rsf_data <- df %>%
  select(survival_time, event, age_years, sex, performance_ecog,
         physician_karno, patient_karno, calories_per_day, weight_loss)

# Convert to numeric matrix (RSF requires numeric input)
rsf_data$sex <- as.numeric(rsf_data$sex) - 1  # Convert to 0/1

# Remove any NA values
rsf_data <- na.omit(rsf_data)

# Fit Random Survival Forest
set.seed(123)
rsf_model <- rfsrc(Surv(survival_time, event) ~ .,
                   data = rsf_data,
                   ntree = 500,
                   importance = TRUE,
                   splitrule = "logrank",
                   na.action = "na.omit")

# Summary
cat("Random Survival Forest Model Summary:\n")
## Random Survival Forest Model Summary:
print(rsf_model)
##                          Sample size: 228
##                     Number of deaths: 165
##                      Number of trees: 500
##            Forest terminal node size: 15
##        Average no. of terminal nodes: 12.446
## No. of variables tried at each split: 3
##               Total no. of variables: 7
##        Resampling used to grow trees: swor
##     Resample size used to grow trees: 144
##                             Analysis: RSF
##                               Family: surv
##                       Splitting rule: logrank *random*
##        Number of random split points: 10
##                           (OOB) CRPS: 139.87887465
##              (OOB) standardized CRPS: 0.15841322
##    (OOB) Requested performance error: 0.39207664

Variable Importance

# Extract variable importance
vimp <- rsf_model$importance
vimp_df <- data.frame(
  Variable = names(vimp),
  Importance = vimp
) %>%
  arrange(desc(Importance))

# Plot variable importance
ggplot(vimp_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Random Survival Forest - Variable Importance",
       x = "Variable",
       y = "Importance") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

RSF Predicted Survival

# Get predicted survival from RSF
pred_rsf <- predict(rsf_model, newdata = rsf_data)

# Plot survival curves for first 4 patients
plot(pred_rsf, 
     plots.one.page = FALSE,
     show.plots = TRUE)

Model Comparison and Performance Metrics

# Calculate concordance index for Cox model
cox_cindex <- summary(cox_model)$concordance[1]

# Calculate concordance index for RSF
rsf_cindex <- rsf_model$err.rate[rsf_model$ntree]

# Create comparison table
comparison_df <- data.frame(
  Model = c("Cox Proportional Hazards", "Random Survival Forest"),
  Concordance = c(round(cox_cindex, 3), round(1 - rsf_cindex, 3)),
  AIC = c(round(AIC(cox_model), 1), NA),
  BIC = c(round(BIC(cox_model), 1), NA)
)

cat("Model Performance Comparison:\n")
## Model Performance Comparison:
kable(comparison_df, caption = "Comparison of Survival Models")
Comparison of Survival Models
Model Concordance AIC BIC
C Cox Proportional Hazards 0.650 1476.3 1498.1
Random Survival Forest 0.608 NA NA

PART 8: GEOGRAPHICAL DATA ANALYSIS (Kenya Survival data sample)

Load required Libraries

# Load Required Libraries
# Set CRAN mirror again to be safe
options(repos = c(CRAN = "https://cloud.r-project.org"))

# List of required packages
required_packages <- c("tidyverse", "sf", "leaflet", "survival", 
                       "survminer", "plotly", "viridis", "patchwork")

# Install and load packages - SIMPLE VERSION
for (pkg in required_packages) {
  # Check if package is installed
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, dependencies = TRUE)
  }
  # Load the package
  library(pkg, character.only = TRUE)
}

Data Loading and Preparation

# Load the dataset
# Read the CSV file (saved as 'cancer.csv')
kenya_data<- read.csv(file.choose(), header = TRUE)
attach(kenya_data)
View(kenya_data)
#kenya_data <- read.csv("kenya_survival_dataset_47counties.csv")

# Display basic information
cat("Dataset Dimensions:", dim(kenya_data), "\n")
## Dataset Dimensions: 470 12
cat("Number of Counties:", length(unique(kenya_data$Region)), "\n")
## Number of Counties: 2
cat("Time Range:", range(kenya_data$Time), "\n")
## Time Range: 5 60
cat("Event Rate:", round(mean(kenya_data$Event) * 100, 1), "%\n")
## Event Rate: 55.7 %
# Display first few rows
cat("\nFirst 6 rows:\n")
## 
## First 6 rows:
head(kenya_data)

Data Preprocessing

# Clean and prepare the data
kenya_clean <- kenya_data %>%
  mutate(
    # Create proper event indicator (1 = event, 0 = censored)
    event_status = ifelse(Event == 1, "Event", "Censored"),
    
    # Convert Gender to factor
    Gender = factor(Gender, levels = c("M", "F"), labels = c("Male", "Female")),
    
    # Create age groups
    age_group = cut(Age,
                    breaks = c(0, 18, 35, 50, 65, 100),
                    labels = c("0-18", "19-35", "36-50", "51-65", "66+")),
    
    # Calculate BMI
    BMI = Weight_kg / ((Height_cm/100)^2),
    bmi_category = cut(BMI,
                       breaks = c(0, 18.5, 25, 30, Inf),
                       labels = c("Underweight", "Normal", "Overweight", "Obese")),
    
    # Blood pressure categories
    bp_category = cut(Blood_Pressure,
                      breaks = c(0, 90, 120, 130, 140, Inf),
                      labels = c("Low", "Normal", "Elevated", "Stage1", "Stage2")),
    
    # Education level ordering
    education_level = factor(Education_level,
                           levels = c("None", "Primary", "Secondary", "College", "University"),
                           ordered = TRUE)
  )

# Check structure
glimpse(kenya_clean)
## Rows: 470
## Columns: 18
## $ Time             <int> 43, 6, 26, 7, 18, 18, 22, 20, 33, 55, 45, 44, 30, 17,…
## $ Event            <int> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,…
## $ Patient_ID       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ Age              <int> 32, 47, 66, 24, 26, 67, 51, 70, 82, 52, 90, 52, 31, 6…
## $ Gender           <fct> Male, Female, Male, Male, Female, Female, Female, Fem…
## $ Height_cm        <int> 147, 141, 198, 148, 192, 143, 175, 165, 146, 172, 151…
## $ Weight_kg        <int> 60, 99, 81, 78, 41, 41, 53, 64, 48, 98, 73, 76, 48, 4…
## $ Blood_Pressure   <int> 103, 100, 139, 97, 139, 85, 127, 139, 80, 84, 112, 12…
## $ Disease_category <chr> "Malaria", "Stroke", "Malaria", "HIV", "HIV", "TB", "…
## $ Region           <chr> "Rural", "Rural", "Rural", "Urban", "Urban", "Rural",…
## $ Country          <chr> "Nairobi", "Nairobi", "Nairobi", "Nairobi", "Nairobi"…
## $ Education_level  <chr> "University", "Primary", "College", "College", "Secon…
## $ event_status     <chr> "Censored", "Censored", "Event", "Censored", "Censore…
## $ age_group        <fct> 19-35, 36-50, 66+, 19-35, 19-35, 66+, 51-65, 66+, 66+…
## $ BMI              <dbl> 27.76621, 49.79629, 20.66116, 35.60993, 11.12196, 20.…
## $ bmi_category     <fct> Overweight, Obese, Normal, Obese, Underweight, Normal…
## $ bp_category      <fct> Normal, Normal, Stage1, Normal, Stage1, Low, Elevated…
## $ education_level  <ord> University, Primary, College, College, Secondary, Uni…

Interactive Map with Multiple Layers

# Create a data frame with all 47 Kenyan county coordinates
county_coords <- data.frame(
  Region = c("Nairobi", "Mombasa", "Kwale", "Kilifi", "Tana River", 
             "Lamu", "Taita-Taveta", "Garissa", "Wajir", "Mandera",
             "Marsabit", "Isiolo", "Meru", "Tharaka-Nithi", "Embu",
             "Kitui", "Machakos", "Makueni", "Nyandarua", "Nyeri",
             "Kirinyaga", "Murang'a", "Kiambu", "Turkana", "West Pokot",
             "Samburu", "Trans-Nzoia", "Uasin Gishu", "Elgeyo-Marakwet",
             "Nandi", "Baringo", "Laikipia", "Nakuru", "Narok", "Kajiado",
             "Kericho", "Bomet", "Kakamega", "Vihiga", "Bungoma",
             "Busia", "Siaya", "Kisumu", "Homa-Bay", "Migori",
             "Kisii", "Nyamira"),
  lat = c(-1.2921, -4.0435, -4.1816, -3.5107, -1.5000,
          -2.2696, -3.3960, -0.4536, 1.7500, 3.9360,
          2.3346, 0.3556, 0.0500, -0.2962, -0.5319,
          -1.3667, -1.5167, -2.2833, -0.5333, -0.4167,
          -0.5000, -0.7833, -1.0333, 3.1167, 1.8833,
          1.4500, 1.0333, 0.5167, 0.5167, 0.2000,
          0.4667, 0.0667, -0.2833, -1.0833, -2.1000,
          -0.3667, -0.7833, 0.2833, 0.1000, 0.5667,
          0.4667, 0.0667, -0.1000, -0.5333, -1.0667,
          -0.6833, -0.5667),
  lng = c(36.8219, 39.6682, 39.4603, 39.9093, 39.5000,
          40.9020, 38.3710, 39.6464, 40.0500, 41.8670,
          37.9790, 37.5808, 37.6500, 37.9452, 37.4600,
          38.0100, 37.2667, 37.8167, 36.4000, 36.9500,
          37.3333, 37.0333, 36.8667, 35.6000, 35.1333,
          36.6667, 34.9500, 35.2833, 35.5167, 35.1000,
          35.9500, 36.6500, 36.0667, 35.8667, 36.7833,
          35.2833, 35.3500, 34.7500, 34.7167, 34.5500,
          34.1167, 34.3000, 34.7500, 34.4667, 34.4667,
          34.7667, 34.9333)
)

# Merge with your county statistics
if(exists("county_stats")) {
  map_data <- county_stats %>%
    left_join(county_coords, by = "Region")
} else {
  # If county_stats doesn't exist, use just coordinates
  map_data <- county_coords %>%
    mutate(n_patients = 10, n_events = 5, event_rate = 50)  # dummy data
}

# Create the interactive map
library(leaflet)

leaflet(map_data) %>%
  addTiles() %>%
  setView(lng = 37.9062, lat = 0.0236, zoom = 6) %>%
  addCircleMarkers(
    lng = ~lng,
    lat = ~lat,
    radius = ~ifelse(exists("n_patients"), sqrt(n_patients)/3, 5),
    color = ~ifelse(exists("event_rate"), 
                   colorNumeric("Reds", event_rate)(event_rate), 
                   "blue"),
    fillOpacity = 0.7,
    stroke = TRUE,
    weight = 2,
    label = ~paste(
      "<strong>", Region, "</strong><br>",
      ifelse(exists("n_patients"), paste("Patients: ", n_patients, "<br>"), ""),
      ifelse(exists("n_events"), paste("Events: ", n_events, "<br>"), ""),
      ifelse(exists("event_rate"), paste("Event Rate: ", event_rate, "%"), "")
    ),
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "12px",
      direction = "auto"
    )
  ) %>%
  addControl(
    "Kenya County Survival Analysis",
    position = "topright"
  ) %>%
  addMiniMap(
    toggleDisplay = TRUE,
    position = "bottomleft"
  )

References