# Set seed for reproducibility
set.seed(123)
# Create sample survival dataset
n <- 200
surv_data <- data.frame(
ID = 1:n,
Age = round(rnorm(n, mean = 50, sd = 10)),
Height_cm = round(rnorm(n, mean = 170, sd = 10)),
Weight_kg = round(rnorm(n, mean = 70, sd = 15)),
Treatment = sample(c("Drug_A", "Drug_B", "Placebo"), n, replace = TRUE),
# Generate survival times from exponential distribution
Time_to_event = rexp(n, rate = 0.05),
# Randomly censor 30% of observations
Censored = rbinom(n, 1, 0.3)
)
# Create event indicator (1 = event occurred, 0 = censored)
surv_data$Event <- ifelse(surv_data$Censored == 1, 0, 1)
# Adjust time for censored observations (make them shorter)
surv_data$Time <- ifelse(surv_data$Censored == 1,
surv_data$Time_to_event * runif(n, 0.3, 0.8),
surv_data$Time_to_event)
# Ensure time is positive
surv_data$Time <- pmax(surv_data$Time, 0.1)
# Create survival object
surv_obj_full <- Surv(surv_data$Time, surv_data$Event)
# View structure
cat("First 10 observations:\n")
## First 10 observations:
print(head(surv_data, 10))
## ID Age Height_cm Weight_kg Treatment Time_to_event Censored Event Time
## 1 1 44 192 69 Drug_A 13.520928 0 1 13.520928
## 2 2 48 183 52 Drug_B 51.955049 0 1 51.955049
## 3 3 66 167 60 Drug_A 12.435579 1 0 7.452644
## 4 4 51 175 70 Drug_A 38.112723 0 1 38.112723
## 5 5 51 166 80 Drug_A 82.554753 0 1 82.554753
## 6 6 67 165 45 Drug_A 1.907929 0 1 1.907929
## 7 7 55 162 65 Drug_A 20.280953 0 1 20.280953
## 8 8 37 164 81 Drug_A 42.572799 0 1 42.572799
## 9 9 43 187 62 Placebo 9.321600 0 1 9.321600
## 10 10 46 169 73 Placebo 25.537471 0 1 25.537471
cat("\nDataset dimensions:", dim(surv_data))
##
## Dataset dimensions: 200 9
Definition: Censoring occurs when the outcome of interest (usually the time until an event) is not fully observed for a subject(s) during the study period. The event status is incomplete, but partial information is still available.
The main causes of Censoring are:
1. Study Design Constraints
2. Loss to Follow-Up
3. Competing Risks
4. Administrative Reasons
1. Right Censoring
Happens when the event has not occurred by the end of the study, or the subject is lost to follow-up.
Example: A patient in a 5-year cancer survival study is still alive at year 5, or drops out after year 3 while still alive.
2. Left Censoring
The event occurred before the subject entered the study, but the exact time is unknown.
Example: Measuring time until HIV seroconversion—a subject may already be HIV+ at enrollment, but we don’t know when infection happened.
3. Interval Censoring
The event occurred within a known time interval, but the exact time is unknown.
Example: In periodic medical check-ups, disease recurrence is detected between the 2nd and 3rd visit, but the exact date is unknown.
Survival analysis is built on Five fundamental functions:
The Survival Function
The hazard Function
The Cummulative Hazard Function
The Probability Density Function
The Kaplan Meir Estimator
1. Survival Function: S(t)
The survival function represents the probability that the event has not occurred by time \(t\): It is the probability that a subject survives beyond time 𝑡
\[S(t) = P(T > t)\]
where \(T\) is the random event time, with \(S(0) = 1\) and \(S(\infty) = 0\).
Properties of the survival function
i). At time zero (the start of the study or “birth”), no one has experienced the event yet. Therefore, the probability of surviving past time 0 is 100%.
\[S(0) = 1\] ii). The event (e.g., death or failure) is inevitable. As time approaches infinity, the probability of an individual still being “alive” or “intact” drops to zero.
\[S(\infty) = 0\] iii). Since \(S(t)\) measures the probability of surviving beyond time \(t\), it can never go up. You cannot “un-experience” a death or failure event once it happens.
\[S(t) = non-increasing\]
2. Hazard Function: h(t)
The instantaneous rate of the event occurring at time \(t\), given survival up to \(t\):
\[h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t} = -\frac{d}{dt} \log S(t)\] The Hazard function is the “failure rate”—risk of the event given the subject has made it to \(t\).
Relationship with Survival Function
\[h(t) = \frac{f(t)}{S(t)}\]
3. Cumulative Hazard Function: H(t)
The cumulative hazard is: \[H(t) = \int_0^t h(u) du = -\log S(t)\]
Where
\(h(u)\) is the hazard rate at time \(u\)
Integration Sums the Hazard continuously over time
The cumulative hazard function \(H(t)\) represents the total amount of risk accumulated by time 𝑡
4.Probability Density Function \(f(t)\)
Defined as:
\[f(t) = \frac{d}{dt}[1 - S(t)]\]
Represents the instantaneous probability of experiencing the event at time \(t\).
5. Kaplan-Meier Estimator
Used to estimate \(S(t)\) from observed data:
The Kaplan–Meier estimator calculates the probability of surviving past time \(S(t)\) by:
Breaking time into event intervals
Multiplying conditional survival probabilities at each event time
It answers: “Given survival so far, what is the chance of surviving the next time point?”
\[\hat{S}(t) = \prod_{t_i \le t} \left( 1 - \frac{d_i}{n_i} \right)\]
Where:
# Create survival object
surv_obj <- Surv(surv_data$Time, surv_data$Event)
# Fit Kaplan-Meier
km_fit_basic <- survfit(surv_obj ~ 1)
# Plot survival curve
plot(km_fit_basic, main = "Kaplan-Meier Survival Curve",
xlab = "Time", ylab = "Survival Probability",
col = "blue", lwd = 2)
grid()
Graphical methods are central to survival analysis because they allow
visual assessment of time-to-event patterns, censoring, and group
differences.
1. Kaplan–Meier Survival Curve
What is plotted
X-axis: Time
Y-axis: Survival probability \({S}(t)\)
Key Features
Stepwise decreasing curve
Vertical drops at event times
Flat sections where no events occur
Tick marks (|) show censored observations
Interpretation
Steeper declines - Shows higher event rate
Higher curve - Shows better survival
Median survival time = time where \({S}(t)\) = 0.5
2. Hazard Function Plot
# Load required package
library(survival)
if (!require("muhaz")) {
install.packages("muhaz")
library(muhaz)
}
# surv_obj is created like Surv(time, event)
time <- surv_obj[, 1]
event <- surv_obj[, 2]
# Estimate hazard function
haz_est <- muhaz(time, event)
# Plot hazard function
plot(haz_est,
main = "Hazard Function Plot",
xlab = "Time", ylab = "Hazard Rate",
col = "green", lwd = 2)
grid()
What is plotted
X-axis: Time
Y-axis: Hazard rate \({h}(t)\)
Features
The plot Shows instantaneous risk
The plot line May be increasing, decreasing, or constant
Interpretation
Peaks indicate high-risk periods
Used in reliability and medical studies
3. Cumulative Hazard Plot
# Fit Kaplan-Meier
km_fit_basic <- survfit(surv_obj ~ 1)
# Plot cumulative hazard
plot(km_fit_basic, fun = "cumhaz",
main = "Cumulative Hazard Plot",
xlab = "Time", ylab = "Cumulative Hazard",
col = "green", lwd = 2)
grid()
What is plotted
X-axis: Time
Y-axis: Cumulative hazard H(t)
Features
Always increasing
Increases only at event times
Slope indicates hazard intensity
Interpretation
Straight line - Indicates Constant Hazard
Curving upward - Increasing risk
Flatter curve - Shows lower accumulated risk
# Fit Kaplan-Meier estimator (non-parametric)
km_fit <- survfit(surv_obj_full ~ 1, data = surv_data)
# Display summary of KM estimator
summary(km_fit)
## Call: survfit(formula = surv_obj_full ~ 1, data = surv_data)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0.100 200 1 0.9950 0.00499 0.98527 1.0000
## 0.378 197 1 0.9899 0.00707 0.97619 1.0000
## 0.506 196 1 0.9849 0.00865 0.96808 1.0000
## 0.739 195 1 0.9798 0.00997 0.96049 0.9996
## 0.858 193 1 0.9748 0.01114 0.95318 0.9969
## 0.897 192 1 0.9697 0.01218 0.94611 0.9939
## 1.024 191 1 0.9646 0.01314 0.93921 0.9907
## 1.155 189 1 0.9595 0.01402 0.93242 0.9874
## 1.382 187 1 0.9544 0.01486 0.92570 0.9839
## 1.554 186 1 0.9493 0.01564 0.91909 0.9804
## 1.765 184 1 0.9441 0.01638 0.91252 0.9768
## 1.775 183 1 0.9389 0.01709 0.90604 0.9730
## 1.908 181 1 0.9337 0.01776 0.89958 0.9692
## 2.115 179 1 0.9285 0.01841 0.89313 0.9653
## 2.266 177 1 0.9233 0.01904 0.88671 0.9614
## 2.333 176 1 0.9180 0.01964 0.88034 0.9574
## 2.353 175 1 0.9128 0.02022 0.87401 0.9533
## 2.591 173 1 0.9075 0.02078 0.86769 0.9492
## 3.286 171 1 0.9022 0.02132 0.86137 0.9450
## 3.388 170 1 0.8969 0.02185 0.85508 0.9408
## 3.655 168 1 0.8916 0.02236 0.84879 0.9365
## 4.045 167 1 0.8862 0.02286 0.84254 0.9322
## 4.145 166 1 0.8809 0.02333 0.83632 0.9278
## 4.331 165 1 0.8755 0.02379 0.83013 0.9234
## 4.362 163 1 0.8702 0.02425 0.82392 0.9190
## 4.470 162 1 0.8648 0.02469 0.81775 0.9146
## 4.534 160 1 0.8594 0.02512 0.81155 0.9101
## 4.703 159 1 0.8540 0.02553 0.80539 0.9055
## 4.913 157 1 0.8486 0.02594 0.79920 0.9010
## 5.030 156 1 0.8431 0.02634 0.79304 0.8964
## 5.100 155 1 0.8377 0.02673 0.78690 0.8917
## 5.105 154 1 0.8322 0.02710 0.78078 0.8871
## 5.223 153 1 0.8268 0.02746 0.77468 0.8824
## 5.337 152 1 0.8214 0.02782 0.76861 0.8777
## 5.567 150 1 0.8159 0.02817 0.76250 0.8730
## 5.683 148 1 0.8104 0.02851 0.75637 0.8682
## 5.698 147 1 0.8049 0.02884 0.75026 0.8634
## 5.754 146 1 0.7993 0.02917 0.74417 0.8586
## 5.789 145 1 0.7938 0.02948 0.73810 0.8538
## 6.148 143 1 0.7883 0.02980 0.73199 0.8489
## 6.175 142 1 0.7827 0.03010 0.72590 0.8440
## 6.223 141 1 0.7772 0.03039 0.71983 0.8391
## 6.326 138 1 0.7715 0.03069 0.71368 0.8341
## 6.473 137 1 0.7659 0.03098 0.70754 0.8291
## 6.767 136 1 0.7603 0.03126 0.70142 0.8241
## 6.988 135 1 0.7546 0.03153 0.69532 0.8190
## 7.053 134 1 0.7490 0.03179 0.68923 0.8140
## 7.074 133 1 0.7434 0.03205 0.68315 0.8089
## 7.303 132 1 0.7378 0.03230 0.67709 0.8039
## 8.076 130 1 0.7321 0.03254 0.67099 0.7987
## 8.213 129 1 0.7264 0.03278 0.66491 0.7936
## 8.234 128 1 0.7207 0.03301 0.65884 0.7884
## 8.521 124 1 0.7149 0.03326 0.65262 0.7832
## 8.647 122 1 0.7091 0.03350 0.64636 0.7778
## 8.860 120 1 0.7031 0.03373 0.64005 0.7725
## 8.889 119 1 0.6972 0.03396 0.63375 0.7671
## 9.322 118 1 0.6913 0.03419 0.62747 0.7617
## 9.683 116 1 0.6854 0.03441 0.62115 0.7562
## 9.811 114 1 0.6794 0.03463 0.61477 0.7507
## 9.844 113 1 0.6733 0.03484 0.60841 0.7452
## 10.941 111 1 0.6673 0.03505 0.60201 0.7396
## 11.198 110 1 0.6612 0.03525 0.59561 0.7340
## 11.287 109 1 0.6551 0.03544 0.58923 0.7284
## 11.407 108 1 0.6491 0.03563 0.58287 0.7228
## 11.495 107 1 0.6430 0.03581 0.57652 0.7172
## 11.498 106 1 0.6369 0.03598 0.57019 0.7115
## 11.685 105 1 0.6309 0.03615 0.56387 0.7059
## 12.554 101 1 0.6246 0.03633 0.55735 0.7001
## 12.675 100 1 0.6184 0.03650 0.55084 0.6942
## 12.950 99 1 0.6121 0.03666 0.54435 0.6884
## 13.163 98 1 0.6059 0.03681 0.53788 0.6825
## 13.448 96 1 0.5996 0.03697 0.53134 0.6766
## 13.521 95 1 0.5933 0.03711 0.52482 0.6707
## 13.834 94 1 0.5870 0.03725 0.51831 0.6647
## 13.947 93 1 0.5807 0.03738 0.51182 0.6587
## 14.299 92 1 0.5743 0.03750 0.50534 0.6528
## 14.463 91 1 0.5680 0.03762 0.49888 0.6468
## 14.657 90 1 0.5617 0.03773 0.49244 0.6407
## 15.023 88 1 0.5553 0.03783 0.48592 0.6347
## 15.683 87 1 0.5490 0.03793 0.47942 0.6286
## 16.020 85 1 0.5425 0.03803 0.47284 0.6224
## 16.094 84 1 0.5360 0.03812 0.46629 0.6162
## 16.596 82 1 0.5295 0.03822 0.45965 0.6100
## 16.655 81 1 0.5230 0.03830 0.45303 0.6037
## 16.777 80 1 0.5164 0.03837 0.44643 0.5974
## 16.944 79 1 0.5099 0.03844 0.43985 0.5911
## 17.373 78 1 0.5033 0.03850 0.43327 0.5848
## 18.696 75 1 0.4966 0.03857 0.42652 0.5783
## 19.343 73 1 0.4898 0.03863 0.41968 0.5717
## 19.730 71 1 0.4829 0.03870 0.41274 0.5651
## 19.771 70 1 0.4760 0.03876 0.40582 0.5584
## 19.776 69 1 0.4691 0.03881 0.39892 0.5517
## 20.089 68 1 0.4622 0.03884 0.39205 0.5450
## 20.176 67 1 0.4553 0.03887 0.38519 0.5383
## 20.281 66 1 0.4484 0.03889 0.37834 0.5315
## 20.352 65 1 0.4415 0.03890 0.37152 0.5248
## 20.438 63 1 0.4345 0.03891 0.36459 0.5179
## 21.523 61 1 0.4274 0.03892 0.35755 0.5109
## 21.589 60 1 0.4203 0.03891 0.35054 0.5039
## 22.321 59 1 0.4132 0.03890 0.34354 0.4969
## 22.367 58 1 0.4060 0.03888 0.33656 0.4899
## 22.556 57 1 0.3989 0.03884 0.32961 0.4828
## 22.843 56 1 0.3918 0.03880 0.32268 0.4757
## 23.336 54 1 0.3845 0.03875 0.31562 0.4685
## 23.764 53 1 0.3773 0.03869 0.30858 0.4613
## 24.050 52 1 0.3700 0.03862 0.30157 0.4540
## 24.480 51 1 0.3628 0.03854 0.29458 0.4468
## 25.071 50 1 0.3555 0.03845 0.28761 0.4395
## 25.089 49 1 0.3483 0.03834 0.28067 0.4321
## 25.326 48 1 0.3410 0.03822 0.27375 0.4248
## 25.537 47 1 0.3337 0.03809 0.26685 0.4174
## 25.569 46 1 0.3265 0.03795 0.25998 0.4100
## 25.868 45 1 0.3192 0.03779 0.25313 0.4026
## 27.705 44 1 0.3120 0.03762 0.24631 0.3952
## 28.213 43 1 0.3047 0.03744 0.23951 0.3877
## 28.865 42 1 0.2975 0.03725 0.23274 0.3802
## 30.041 41 1 0.2902 0.03704 0.22599 0.3727
## 31.663 38 1 0.2826 0.03684 0.21886 0.3649
## 32.568 37 1 0.2749 0.03663 0.21176 0.3570
## 33.548 36 1 0.2673 0.03640 0.20469 0.3491
## 34.210 35 1 0.2597 0.03615 0.19766 0.3411
## 35.198 34 1 0.2520 0.03589 0.19066 0.3332
## 36.260 33 1 0.2444 0.03560 0.18369 0.3252
## 36.290 32 1 0.2368 0.03530 0.17676 0.3171
## 37.357 31 1 0.2291 0.03498 0.16987 0.3090
## 38.113 30 1 0.2215 0.03463 0.16301 0.3009
## 38.354 29 1 0.2138 0.03427 0.15620 0.2928
## 38.643 28 1 0.2062 0.03389 0.14942 0.2846
## 40.764 27 1 0.1986 0.03348 0.14269 0.2763
## 41.975 25 1 0.1906 0.03307 0.13568 0.2678
## 42.224 24 1 0.1827 0.03263 0.12872 0.2593
## 42.573 23 1 0.1747 0.03217 0.12181 0.2507
## 42.624 22 1 0.1668 0.03167 0.11497 0.2420
## 43.107 21 1 0.1589 0.03114 0.10818 0.2333
## 44.726 20 1 0.1509 0.03058 0.10145 0.2245
## 45.809 19 1 0.1430 0.02999 0.09478 0.2157
## 47.939 18 1 0.1350 0.02935 0.08818 0.2068
## 48.811 17 1 0.1271 0.02868 0.08166 0.1978
## 49.835 16 1 0.1191 0.02797 0.07521 0.1887
## 51.955 15 1 0.1112 0.02721 0.06884 0.1796
## 55.912 14 1 0.1033 0.02640 0.06256 0.1704
## 61.031 11 1 0.0939 0.02561 0.05499 0.1602
## 67.452 9 1 0.0834 0.02480 0.04660 0.1494
## 69.279 8 1 0.0730 0.02379 0.03855 0.1383
## 72.872 7 1 0.0626 0.02256 0.03087 0.1269
## 75.927 6 1 0.0521 0.02108 0.02362 0.1151
## 79.877 5 1 0.0417 0.01927 0.01687 0.1032
## 82.555 4 1 0.0313 0.01704 0.01076 0.0910
## 83.337 3 1 0.0209 0.01420 0.00549 0.0792
## 83.806 2 1 0.0104 0.01024 0.00152 0.0714
## 89.176 1 1 0.0000 NaN NA NA
# The KM estimator is calculated as:
# S_hat(t) = product over (t_i <= t) of (1 - d_i/n_i)
# where d_i = number of events at time t_i
# n_i = number at risk just before t_i
# Loading the survival library
library(survival)
# Creating the survival object
surv_object <- Surv(surv_data$Time, surv_data$Event)
# Fitting Kaplan-Meier estimator
km_fit <- survfit(surv_object ~ 1)
# Display summary
summary(km_fit)
## Call: survfit(formula = surv_object ~ 1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0.100 200 1 0.9950 0.00499 0.98527 1.0000
## 0.378 197 1 0.9899 0.00707 0.97619 1.0000
## 0.506 196 1 0.9849 0.00865 0.96808 1.0000
## 0.739 195 1 0.9798 0.00997 0.96049 0.9996
## 0.858 193 1 0.9748 0.01114 0.95318 0.9969
## 0.897 192 1 0.9697 0.01218 0.94611 0.9939
## 1.024 191 1 0.9646 0.01314 0.93921 0.9907
## 1.155 189 1 0.9595 0.01402 0.93242 0.9874
## 1.382 187 1 0.9544 0.01486 0.92570 0.9839
## 1.554 186 1 0.9493 0.01564 0.91909 0.9804
## 1.765 184 1 0.9441 0.01638 0.91252 0.9768
## 1.775 183 1 0.9389 0.01709 0.90604 0.9730
## 1.908 181 1 0.9337 0.01776 0.89958 0.9692
## 2.115 179 1 0.9285 0.01841 0.89313 0.9653
## 2.266 177 1 0.9233 0.01904 0.88671 0.9614
## 2.333 176 1 0.9180 0.01964 0.88034 0.9574
## 2.353 175 1 0.9128 0.02022 0.87401 0.9533
## 2.591 173 1 0.9075 0.02078 0.86769 0.9492
## 3.286 171 1 0.9022 0.02132 0.86137 0.9450
## 3.388 170 1 0.8969 0.02185 0.85508 0.9408
## 3.655 168 1 0.8916 0.02236 0.84879 0.9365
## 4.045 167 1 0.8862 0.02286 0.84254 0.9322
## 4.145 166 1 0.8809 0.02333 0.83632 0.9278
## 4.331 165 1 0.8755 0.02379 0.83013 0.9234
## 4.362 163 1 0.8702 0.02425 0.82392 0.9190
## 4.470 162 1 0.8648 0.02469 0.81775 0.9146
## 4.534 160 1 0.8594 0.02512 0.81155 0.9101
## 4.703 159 1 0.8540 0.02553 0.80539 0.9055
## 4.913 157 1 0.8486 0.02594 0.79920 0.9010
## 5.030 156 1 0.8431 0.02634 0.79304 0.8964
## 5.100 155 1 0.8377 0.02673 0.78690 0.8917
## 5.105 154 1 0.8322 0.02710 0.78078 0.8871
## 5.223 153 1 0.8268 0.02746 0.77468 0.8824
## 5.337 152 1 0.8214 0.02782 0.76861 0.8777
## 5.567 150 1 0.8159 0.02817 0.76250 0.8730
## 5.683 148 1 0.8104 0.02851 0.75637 0.8682
## 5.698 147 1 0.8049 0.02884 0.75026 0.8634
## 5.754 146 1 0.7993 0.02917 0.74417 0.8586
## 5.789 145 1 0.7938 0.02948 0.73810 0.8538
## 6.148 143 1 0.7883 0.02980 0.73199 0.8489
## 6.175 142 1 0.7827 0.03010 0.72590 0.8440
## 6.223 141 1 0.7772 0.03039 0.71983 0.8391
## 6.326 138 1 0.7715 0.03069 0.71368 0.8341
## 6.473 137 1 0.7659 0.03098 0.70754 0.8291
## 6.767 136 1 0.7603 0.03126 0.70142 0.8241
## 6.988 135 1 0.7546 0.03153 0.69532 0.8190
## 7.053 134 1 0.7490 0.03179 0.68923 0.8140
## 7.074 133 1 0.7434 0.03205 0.68315 0.8089
## 7.303 132 1 0.7378 0.03230 0.67709 0.8039
## 8.076 130 1 0.7321 0.03254 0.67099 0.7987
## 8.213 129 1 0.7264 0.03278 0.66491 0.7936
## 8.234 128 1 0.7207 0.03301 0.65884 0.7884
## 8.521 124 1 0.7149 0.03326 0.65262 0.7832
## 8.647 122 1 0.7091 0.03350 0.64636 0.7778
## 8.860 120 1 0.7031 0.03373 0.64005 0.7725
## 8.889 119 1 0.6972 0.03396 0.63375 0.7671
## 9.322 118 1 0.6913 0.03419 0.62747 0.7617
## 9.683 116 1 0.6854 0.03441 0.62115 0.7562
## 9.811 114 1 0.6794 0.03463 0.61477 0.7507
## 9.844 113 1 0.6733 0.03484 0.60841 0.7452
## 10.941 111 1 0.6673 0.03505 0.60201 0.7396
## 11.198 110 1 0.6612 0.03525 0.59561 0.7340
## 11.287 109 1 0.6551 0.03544 0.58923 0.7284
## 11.407 108 1 0.6491 0.03563 0.58287 0.7228
## 11.495 107 1 0.6430 0.03581 0.57652 0.7172
## 11.498 106 1 0.6369 0.03598 0.57019 0.7115
## 11.685 105 1 0.6309 0.03615 0.56387 0.7059
## 12.554 101 1 0.6246 0.03633 0.55735 0.7001
## 12.675 100 1 0.6184 0.03650 0.55084 0.6942
## 12.950 99 1 0.6121 0.03666 0.54435 0.6884
## 13.163 98 1 0.6059 0.03681 0.53788 0.6825
## 13.448 96 1 0.5996 0.03697 0.53134 0.6766
## 13.521 95 1 0.5933 0.03711 0.52482 0.6707
## 13.834 94 1 0.5870 0.03725 0.51831 0.6647
## 13.947 93 1 0.5807 0.03738 0.51182 0.6587
## 14.299 92 1 0.5743 0.03750 0.50534 0.6528
## 14.463 91 1 0.5680 0.03762 0.49888 0.6468
## 14.657 90 1 0.5617 0.03773 0.49244 0.6407
## 15.023 88 1 0.5553 0.03783 0.48592 0.6347
## 15.683 87 1 0.5490 0.03793 0.47942 0.6286
## 16.020 85 1 0.5425 0.03803 0.47284 0.6224
## 16.094 84 1 0.5360 0.03812 0.46629 0.6162
## 16.596 82 1 0.5295 0.03822 0.45965 0.6100
## 16.655 81 1 0.5230 0.03830 0.45303 0.6037
## 16.777 80 1 0.5164 0.03837 0.44643 0.5974
## 16.944 79 1 0.5099 0.03844 0.43985 0.5911
## 17.373 78 1 0.5033 0.03850 0.43327 0.5848
## 18.696 75 1 0.4966 0.03857 0.42652 0.5783
## 19.343 73 1 0.4898 0.03863 0.41968 0.5717
## 19.730 71 1 0.4829 0.03870 0.41274 0.5651
## 19.771 70 1 0.4760 0.03876 0.40582 0.5584
## 19.776 69 1 0.4691 0.03881 0.39892 0.5517
## 20.089 68 1 0.4622 0.03884 0.39205 0.5450
## 20.176 67 1 0.4553 0.03887 0.38519 0.5383
## 20.281 66 1 0.4484 0.03889 0.37834 0.5315
## 20.352 65 1 0.4415 0.03890 0.37152 0.5248
## 20.438 63 1 0.4345 0.03891 0.36459 0.5179
## 21.523 61 1 0.4274 0.03892 0.35755 0.5109
## 21.589 60 1 0.4203 0.03891 0.35054 0.5039
## 22.321 59 1 0.4132 0.03890 0.34354 0.4969
## 22.367 58 1 0.4060 0.03888 0.33656 0.4899
## 22.556 57 1 0.3989 0.03884 0.32961 0.4828
## 22.843 56 1 0.3918 0.03880 0.32268 0.4757
## 23.336 54 1 0.3845 0.03875 0.31562 0.4685
## 23.764 53 1 0.3773 0.03869 0.30858 0.4613
## 24.050 52 1 0.3700 0.03862 0.30157 0.4540
## 24.480 51 1 0.3628 0.03854 0.29458 0.4468
## 25.071 50 1 0.3555 0.03845 0.28761 0.4395
## 25.089 49 1 0.3483 0.03834 0.28067 0.4321
## 25.326 48 1 0.3410 0.03822 0.27375 0.4248
## 25.537 47 1 0.3337 0.03809 0.26685 0.4174
## 25.569 46 1 0.3265 0.03795 0.25998 0.4100
## 25.868 45 1 0.3192 0.03779 0.25313 0.4026
## 27.705 44 1 0.3120 0.03762 0.24631 0.3952
## 28.213 43 1 0.3047 0.03744 0.23951 0.3877
## 28.865 42 1 0.2975 0.03725 0.23274 0.3802
## 30.041 41 1 0.2902 0.03704 0.22599 0.3727
## 31.663 38 1 0.2826 0.03684 0.21886 0.3649
## 32.568 37 1 0.2749 0.03663 0.21176 0.3570
## 33.548 36 1 0.2673 0.03640 0.20469 0.3491
## 34.210 35 1 0.2597 0.03615 0.19766 0.3411
## 35.198 34 1 0.2520 0.03589 0.19066 0.3332
## 36.260 33 1 0.2444 0.03560 0.18369 0.3252
## 36.290 32 1 0.2368 0.03530 0.17676 0.3171
## 37.357 31 1 0.2291 0.03498 0.16987 0.3090
## 38.113 30 1 0.2215 0.03463 0.16301 0.3009
## 38.354 29 1 0.2138 0.03427 0.15620 0.2928
## 38.643 28 1 0.2062 0.03389 0.14942 0.2846
## 40.764 27 1 0.1986 0.03348 0.14269 0.2763
## 41.975 25 1 0.1906 0.03307 0.13568 0.2678
## 42.224 24 1 0.1827 0.03263 0.12872 0.2593
## 42.573 23 1 0.1747 0.03217 0.12181 0.2507
## 42.624 22 1 0.1668 0.03167 0.11497 0.2420
## 43.107 21 1 0.1589 0.03114 0.10818 0.2333
## 44.726 20 1 0.1509 0.03058 0.10145 0.2245
## 45.809 19 1 0.1430 0.02999 0.09478 0.2157
## 47.939 18 1 0.1350 0.02935 0.08818 0.2068
## 48.811 17 1 0.1271 0.02868 0.08166 0.1978
## 49.835 16 1 0.1191 0.02797 0.07521 0.1887
## 51.955 15 1 0.1112 0.02721 0.06884 0.1796
## 55.912 14 1 0.1033 0.02640 0.06256 0.1704
## 61.031 11 1 0.0939 0.02561 0.05499 0.1602
## 67.452 9 1 0.0834 0.02480 0.04660 0.1494
## 69.279 8 1 0.0730 0.02379 0.03855 0.1383
## 72.872 7 1 0.0626 0.02256 0.03087 0.1269
## 75.927 6 1 0.0521 0.02108 0.02362 0.1151
## 79.877 5 1 0.0417 0.01927 0.01687 0.1032
## 82.555 4 1 0.0313 0.01704 0.01076 0.0910
## 83.337 3 1 0.0209 0.01420 0.00549 0.0792
## 83.806 2 1 0.0104 0.01024 0.00152 0.0714
## 89.176 1 1 0.0000 NaN NA NA
plot(km_fit, conf.int = FALSE,
main = "Kaplan-Meier Curve (No Confidence Interval)",
xlab = "Time", ylab = "Survival Probability",
col = "darkred", lwd = 2)
abline(h = seq(0, 1, 0.2), lty = 3, col = "gray")
plot(km_fit, conf.int = TRUE,
main = "Kaplan-Meier Curve with 95% Confidence Interval",
xlab = "Time", ylab = "Survival Probability",
col = "darkblue", lwd = 2)
legend("topright", legend = c("KM Estimate", "95% CI"),
col = c("darkblue", "gray"), lwd = c(2, 2), lty = c(1, 2))
plot(km_fit, mark.time = TRUE,
main = "Kaplan-Meier Curve with Censoring Marks",
xlab = "Time", ylab = "Survival Probability",
col = "darkgreen", lwd = 2)
# Add median survival time
abline(v = median(km_fit$time), lty = 2, col = "red")
text(x = median(km_fit$time), y = 0.5,
paste("Median =", round(median(km_fit$time), 2)),
pos = 4, col = "red")
The Kaplan-Meier survival curve (also known as the product-limit estimator) is a widely used non-parametric method to estimate the survival function \(S(t)\), which represents the probability of surviving beyond time t. It has several distinctive features that make it particularly suitable for analyzing time-to-event data with incomplete observations. The Kaplan-Meier survival curve has these features:
The curve is a decreasing step function that remains constant (horizontal) between successive event times and drops only at the exact times when events (e.g., deaths, failures, or relapses) occur. It does not decrease continuously or use sloping lines between point
The Kaplan-Meier estimator is right-continuous, meaning that at each event time t, the survival probability \(S(t)\) equals the limit of \(S(u)\) as u approaches t from the right. This convention reflects that the survival probability is defined just after the event has occurred.
Censoring indicators: Marks show censored observations
Small vertical tick marks or crosses/dots are typically plotted along the curve to indicate censored observations — individuals who are lost to follow-up, withdraw from the study, or remain event-free at the end of observation. These marks show that the subject survived at least up to that point, without causing a drop in the curve, but they reduce the number of individuals at risk for subsequent time points.
Plateaus: Periods with no events
The curve often shows flat, horizontal segments (plateaus) during periods when no events occur, even if some censoring happens in between. These plateaus indicate stable survival probability over those intervals and become more common with longer follow-up or smaller sample sizes.
Pointwise or banded confidence intervals (most commonly based on Greenwood’s formula for variance) are frequently added to the curve to illustrate the uncertainty in the survival estimates. Wider bands at later times reflect greater uncertainty due to fewer patients remaining at risk (increased censoring and fewer events).
A key summary statistic is the median survival time, defined as the smallest time t where the estimated survival probability \(S(t)\) drops to 0.5 (or ≤ 0.5). It is easily read from the curve by finding where it crosses the 50% horizontal line and dropping a vertical line to the time axis. If the curve never reaches 0.5 (e.g., due to heavy censoring), the median may not be estimable.
# Read the CSV file (saved as 'health_data.csv')
imported_data<- read.csv(file.choose(), header = TRUE)
attach(imported_data)
View(imported_data)
# View data
cat("Dataset dimensions:", dim(imported_data), "\n")
## Dataset dimensions: 200 9
cat("First few observations:\n")
## First few observations:
print(head(imported_data))
## ID Age Height_cm Weight_kg Treatment Time_to_event Censored Event Time
## 1 1 44 192 69 Drug_A 13.520928 0 1 13.520928
## 2 2 48 183 52 Drug_B 51.955049 0 1 51.955049
## 3 3 66 167 60 Drug_A 12.435579 1 0 7.452644
## 4 4 51 175 70 Drug_A 38.112723 0 1 38.112723
## 5 5 51 166 80 Drug_A 82.554753 0 1 82.554753
## 6 6 67 165 45 Drug_A 1.907929 0 1 1.907929
# Create survival object
surv_obj_imported <- Surv(imported_data$Time, imported_data$Event)
# Fit Kaplan-Meier by treatment group
km_fit_group <- survfit(surv_obj_imported ~ Treatment, data = imported_data)
# Plot by group
plot(km_fit_group, col = c("red", "blue", "green"),
main = "Kaplan-Meier Curves by Treatment Group",
xlab = "Time", ylab = "Survival Probability",
lwd = 2)
legend("topright", legend = levels(factor(imported_data$Treatment)),
col = c("red", "blue", "green"), lwd = 2)
plot(km_fit_group, conf.int = FALSE,
main = "Kaplan-Meier Curve (No Confidence Interval)",
xlab = "Time", ylab = "Survival Probability",
col = "darkred", lwd = 2)
abline(h = seq(0, 1, 0.2), lty = 3, col = "gray")
plot(km_fit_group, conf.int = TRUE,
main = "Kaplan-Meier Curve with 95% Confidence Interval",
xlab = "Time", ylab = "Survival Probability",
col = "darkblue", lwd = 2)
legend("topright", legend = c("KM Estimate", "95% CI"),
col = c("darkblue", "gray"), lwd = c(2, 2), lty = c(1, 2))
plot(km_fit_group, mark.time = TRUE,
main = "Kaplan-Meier Curve with Censoring Marks",
xlab = "Time", ylab = "Survival Probability",
col = "darkgreen", lwd = 2)
# Add median survival time
abline(v = median(km_fit_group$time), lty = 2, col = "red")
text(x = median(km_fit_group$time), y = 0.5,
paste("Median =", round(median(km_fit_group$time), 2)),
pos = 4, col = "red")
# Fit Cox proportional hazards model
cox_model_full <- coxph(surv_obj_imported ~ Age + Height_cm + Weight_kg + Treatment,
data = imported_data)
# Summary of the model
cox_summary <- summary(cox_model_full)
print(cox_summary)
## Call:
## coxph(formula = surv_obj_imported ~ Age + Height_cm + Weight_kg +
## Treatment, data = imported_data)
##
## n= 200, number of events= 151
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Age -0.011449 0.988616 0.009566 -1.197 0.231
## Height_cm -0.008332 0.991703 0.008985 -0.927 0.354
## Weight_kg 0.004624 1.004634 0.006090 0.759 0.448
## TreatmentDrug_B 0.030196 1.030656 0.216457 0.139 0.889
## TreatmentPlacebo 0.185164 1.203416 0.199225 0.929 0.353
##
## exp(coef) exp(-coef) lower .95 upper .95
## Age 0.9886 1.0115 0.9703 1.007
## Height_cm 0.9917 1.0084 0.9744 1.009
## Weight_kg 1.0046 0.9954 0.9927 1.017
## TreatmentDrug_B 1.0307 0.9703 0.6743 1.575
## TreatmentPlacebo 1.2034 0.8310 0.8144 1.778
##
## Concordance= 0.562 (se = 0.028 )
## Likelihood ratio test= 3.64 on 5 df, p=0.6
## Wald test = 3.64 on 5 df, p=0.6
## Score (logrank) test = 3.64 on 5 df, p=0.6
Model Overview
The Cox proportional hazards model examines how different variables affect survival time (time to event). The model fitted includes:
Age: Patient’s age
Height_cm: Height in centimeters
Weight_kg: Weight in kilograms
Treatment: Two treatment groups (Drug_B and Placebo)
Sample characteristics: n = 200 patients, 151 events observed
The Cox proportional hazards model assumes:
\[h(t|X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\]
Where:
Hazard Ratios (HR): \(\exp(\beta)\)
1. Age (coef = -0.011449, HR = 0.989)
Interpretation: For each additional year of age, the hazard decreases by approximately 1.1% (HR = 0.989)
Statistical significance: Not statistically significant (p = 0.231).
95% CI: [0.970, 1.007] - includes 1, confirming non-significance
Practical meaning: - Age does not appear to meaningfully affect survival in this dataset
Statistical Significance: p-value < 0.05 indicates significant effect
Confidence Intervals: 95% CI for hazard ratios
2. Height_cm (coef = -0.008332, HR = 0.992)
Interpretation: For each additional centimeter of height, the hazard decreases by approximately 0.8% (HR = 0.992)
Statistical significance: Not statistically significant (p = 0.354)
95% CI: [0.974, 1.009] - includes 1
Practical meaning: Height is not a significant predictor of survival
3. Weight_kg (coef = 0.004624, HR = 1.005)
Interpretation: For each additional kilogram of weight, the hazard increases by approximately 0.5% (HR = 1.005)
Statistical significance: Not statistically significant (p = 0.448)
95% CI: [0.993, 1.017] - includes 1
Practical meaning: Weight does not significantly impact survival
4. TreatmentDrug_B (coef = 0.030196, HR = 1.031)
Interpretation: Patients receiving Drug B have a 3.1% higher hazard compared to the reference group (assumed to be standard treatment or control)
Statistical significance: Not statistically significant (p = 0.889)
95% CI: [0.674, 1.575] - very wide interval including 1
Practical meaning: Drug B shows no significant difference from the reference treatment
5. TreatmentPlacebo (coef = 0.185164, HR = 1.203)
Interpretation: Patients receiving placebo have a 20.3% higher hazard compared to the reference group
Statistical significance: Not statistically significant (p = 0.353)
95% CI: [0.814, 1.778] - includes 1
Practical meaning: While the hazard ratio suggests worse outcomes with placebo, this difference is not statistically significant
Concordance Index (C-index = 0.562)
Range: 0.5 to 1.0
Interpretation:
0.5 = random prediction (no better than chance)
1.0 = perfect prediction
The model: 0.562 indicates the model has very weak predictive ability, only slightly better than random chance
Practical meaning: The included variables do not effectively distinguish between patients with different survival outcomes
Likelihood Ratio Test
Test statistic: 3.64 on 5 degrees of freedom
p-value: 0.6
Interpretation: The model as a whole is not statistically significant. The predictors collectively do not improve upon a model with no predictors
Wald Test
Test statistic: 3.64 on 5 degrees of freedom
p-value: 0.6
Interpretation: Confirms the overall non-significance of the model
Score (Log-rank) Test
Test statistic: 3.64 on 5 degrees of freedom
p-value: 0.6
Interpretation: Another confirmation that the model lacks significant predictive power
# Check proportional hazards assumption
cox_zph_full <- cox.zph(cox_model_full)
print(cox_zph_full)
## chisq df p
## Age 4.35e-02 1 0.83
## Height_cm 4.84e-03 1 0.94
## Weight_kg 4.01e-05 1 0.99
## Treatment 1.49e+00 2 0.47
## GLOBAL 1.75e+00 5 0.88
# Plot Schoenfeld residuals
plot(cox_zph_full, var = "Age", main = "PH Assumption Check for Age")
abline(h = 0, lty = 2, col = "red")
Given example data: Time: \(2, 3^+, 6, 6, 7, 10^+, 15, 15, 16, 27, 30, 32\) where \(+\) indicates censored observations.
# Example data
time_example <- c(0, 2, 3, 6, 6, 7, 10, 15, 15, 16, 27, 30, 32)
event_example <- c(1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1) # 0=censored, 1=event
# Sort data
sorted_data <- data.frame(time = time_example, event = event_example)
sorted_data <- sorted_data[order(sorted_data$time), ]
# Manual KM calculation
n <- length(time_example)
at_risk <- n:1
events <- sorted_data$event
# Calculate survival probabilities
survival_prob <- numeric(n)
survival_prob[1] <- 1 - events[1]/at_risk[1]
for(i in 2:n) {
survival_prob[i] <- survival_prob[i-1] * (1 - events[i]/at_risk[i])
}
# Create KM table
km_table <- data.frame(
Time = sorted_data$time,
n_risk = at_risk,
n_event = events,
Survival = round(survival_prob, 4)
)
print(km_table)
## Time n_risk n_event Survival
## 1 0 13 1 0.9231
## 2 2 12 1 0.8462
## 3 3 11 0 0.8462
## 4 6 10 1 0.7615
## 5 6 9 1 0.6769
## 6 7 8 1 0.5923
## 7 10 7 0 0.5923
## 8 15 6 1 0.4936
## 9 15 5 1 0.3949
## 10 16 4 1 0.2962
## 11 27 3 1 0.1974
## 12 30 2 1 0.0987
## 13 32 1 1 0.0000
# Plot survival probabilities
plot(km_table$Time, km_table$Survival, type = "s",
main = "Manual Kaplan-Meier Survival Curve",
xlab = "Time", ylab = "Survival Probability",
col = "purple", lwd = 2, ylim = c(0, 1))
points(km_table$Time, km_table$Survival, pch = 19)
# Add censored observations
censored_times <- sorted_data$time[sorted_data$event == 0]
censored_probs <- km_table$Survival[sorted_data$event == 0]
points(censored_times, censored_probs, pch = 3, col = "red", cex = 1.5)
abline(h = 0.5, lty = 2, col = "gray")
legend("topright", legend = c("Survival", "Censored"),
pch = c(19, 3), col = c("purple", "red"))
# Fit multiple linear regression model
# For demonstration, we'll use Time as the dependent variable
lm_model <- lm(Time ~ Age + Height_cm + Weight_kg, data = imported_data)
lm_summary <- summary(lm_model)
# Displaying results
cat("Multiple Linear Regression Results:\n")
## Multiple Linear Regression Results:
print(lm_summary)
##
## Call:
## lm(formula = Time ~ Age + Height_cm + Weight_kg, data = imported_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.685 -13.378 -7.327 5.977 69.735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.717462 25.836128 -0.028 0.978
## Age 0.048643 0.144121 0.338 0.736
## Height_cm 0.104590 0.136301 0.767 0.444
## Weight_kg -0.009338 0.093865 -0.099 0.921
##
## Residual standard error: 19.18 on 196 degrees of freedom
## Multiple R-squared: 0.00361, Adjusted R-squared: -0.01164
## F-statistic: 0.2367 on 3 and 196 DF, p-value: 0.8707
# Diagnostic plots
par(mfrow = c(2, 2))
plot(lm_model)
par(mfrow = c(1, 1))
The fitted regression model is:
\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3\]
Where:
##ABOUT THIS DATASET This data has been downloaded from Kaggle. https://www.kaggle.com/code/saychakra/survival-analysis-of-lung-cancer-patients
This section performs survival analysis on the lung cancer dataset
(cancer.csv) using R.
##Loading Required Libraries
# Install packages if not already installed
if(!require("survival")) install.packages("survival")
if(!require("survminer")) install.packages("survminer")
if(!require("ggplot2")) install.packages("ggplot2")
if(!require("dplyr")) install.packages("dplyr")
if(!require("tidyr")) install.packages("tidyr")
if(!require("GGally")) install.packages("GGally")
if(!require("caret")) install.packages("caret")
if(!require("randomForestSRC")) install.packages("randomForestSRC")
# Load libraries
library(survival)
library(survminer)
library(ggplot2)
library(dplyr)
library(tidyr)
library(GGally)
library(caret)
library(randomForestSRC)
library(knitr)
##. Data Loading and Exploration for the Cancer data
# Load the lung cancer dataset
# Read the CSV file (saved as 'cancer.csv')
cancer_data<- read.csv(file.choose(), header = TRUE)
attach(cancer_data)
View(cancer_data)
#cancer_data <- read.csv("cancer.csv")
# Display basic information
cat("Dataset Dimensions:", dim(cancer_data), "\n")
## Dataset Dimensions: 228 11
cat("Column Names:", names(cancer_data), "\n\n")
## Column Names: X inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
# Display first few rows
cat("First 6 rows of the dataset:\n")
## First 6 rows of the dataset:
kable(head(cancer_data), caption = "First 6 Observations")
| X | inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | 306 | 2 | 74 | 1 | 1 | 90 | 100 | 1175 | NA |
| 2 | 3 | 455 | 2 | 68 | 1 | 0 | 90 | 90 | 1225 | 15 |
| 3 | 3 | 1010 | 1 | 56 | 1 | 0 | 90 | 90 | NA | 15 |
| 4 | 5 | 210 | 2 | 57 | 1 | 1 | 90 | 60 | 1150 | 11 |
| 5 | 1 | 883 | 2 | 60 | 1 | 0 | 100 | 90 | NA | 0 |
| 6 | 12 | 1022 | 1 | 74 | 1 | 1 | 50 | 80 | 513 | 0 |
# Structure of the data
str(cancer_data)
## 'data.frame': 228 obs. of 11 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ inst : int 3 3 3 5 1 12 7 11 1 7 ...
## $ time : int 306 455 1010 210 883 1022 310 361 218 166 ...
## $ status : int 2 2 1 2 2 1 2 2 2 2 ...
## $ age : int 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : int 1 1 1 1 1 1 2 2 1 1 ...
## $ ph.ecog : int 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : int 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: int 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : int 1175 1225 NA 1150 NA 513 384 538 825 271 ...
## $ wt.loss : int NA 15 15 11 0 0 10 1 16 34 ...
# Summary statistics
cat("\nSummary Statistics:\n")
##
## Summary Statistics:
summary(cancer_data)
## X inst time status
## Min. : 1.00 Min. : 1.00 Min. : 5.0 Min. :1.000
## 1st Qu.: 57.75 1st Qu.: 3.00 1st Qu.: 166.8 1st Qu.:1.000
## Median :114.50 Median :11.00 Median : 255.5 Median :2.000
## Mean :114.50 Mean :11.09 Mean : 305.2 Mean :1.724
## 3rd Qu.:171.25 3rd Qu.:16.00 3rd Qu.: 396.5 3rd Qu.:2.000
## Max. :228.00 Max. :33.00 Max. :1022.0 Max. :2.000
## NA's :1
## age sex ph.ecog ph.karno
## Min. :39.00 Min. :1.000 Min. :0.0000 Min. : 50.00
## 1st Qu.:56.00 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 75.00
## Median :63.00 Median :1.000 Median :1.0000 Median : 80.00
## Mean :62.45 Mean :1.395 Mean :0.9515 Mean : 81.94
## 3rd Qu.:69.00 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.: 90.00
## Max. :82.00 Max. :2.000 Max. :3.0000 Max. :100.00
## NA's :1 NA's :1
## pat.karno meal.cal wt.loss
## Min. : 30.00 Min. : 96.0 Min. :-24.000
## 1st Qu.: 70.00 1st Qu.: 635.0 1st Qu.: 0.000
## Median : 80.00 Median : 975.0 Median : 7.000
## Mean : 79.96 Mean : 928.8 Mean : 9.832
## 3rd Qu.: 90.00 3rd Qu.:1150.0 3rd Qu.: 15.750
## Max. :100.00 Max. :2600.0 Max. : 68.000
## NA's :3 NA's :47 NA's :14
##.Data Preprocessing for the cancer data
# Create a copy for analysis
df <- cancer_data
# Rename columns for clarity
df <- df %>%
rename(
institution = inst,
survival_time = time,
age_years = age,
performance_ecog = ph.ecog,
physician_karno = ph.karno,
patient_karno = pat.karno,
calories_per_day = meal.cal,
weight_loss = wt.loss
)
# Convert status to proper event indicator (2 = death, 1 = censored)
# For survival analysis: 1 = event (death), 0 = censored
df$event <- ifelse(df$status == 2, 1, 0)
# Convert sex to factor (1 = male, 2 = female)
df$sex <- factor(df$sex, levels = c(1, 2), labels = c("Male", "Female"))
# Create age groups
df$age_group <- cut(df$age_years,
breaks = c(0, 50, 60, 70, 100),
labels = c("<50", "50-59", "60-69", "70+"))
# Create performance status groups
df$ecog_group <- cut(df$performance_ecog,
breaks = c(-1, 0, 1, 2, 5),
labels = c("0", "1", "2", "3+"))
# Handle missing values
# Count missing values
cat("Missing values before imputation:\n")
## Missing values before imputation:
print(colSums(is.na(df)))
## X institution survival_time status
## 0 1 0 0
## age_years sex performance_ecog physician_karno
## 0 0 1 1
## patient_karno calories_per_day weight_loss event
## 3 47 14 0
## age_group ecog_group
## 0 1
# Impute missing values with median for numeric columns
numeric_cols <- c("performance_ecog", "physician_karno",
"patient_karno", "calories_per_day", "weight_loss")
for(col in numeric_cols) {
if(any(is.na(df[[col]]))) {
df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)
}
}
# Check missing values after imputation
cat("\nMissing values after imputation:\n")
##
## Missing values after imputation:
print(colSums(is.na(df)))
## X institution survival_time status
## 0 1 0 0
## age_years sex performance_ecog physician_karno
## 0 0 0 0
## patient_karno calories_per_day weight_loss event
## 0 0 0 0
## age_group ecog_group
## 0 1
# Display cleaned data structure
cat("\nCleaned dataset dimensions:", dim(df), "\n")
##
## Cleaned dataset dimensions: 228 14
##. Exploratory Data Analysis on Cancer Data
# Create multiple histograms
par(mfrow = c(2, 3))
hist(df$age_years, main = "Age Distribution", xlab = "Age", col = "skyblue")
hist(df$survival_time, main = "Survival Time Distribution",
xlab = "Days", col = "lightgreen")
hist(df$performance_ecog, main = "ECOG Performance Status",
xlab = "Score", col = "lightcoral")
hist(df$physician_karno, main = "Physician Karnofsky Score",
xlab = "Score", col = "gold")
hist(df$weight_loss, main = "Weight Loss Distribution",
xlab = "Pounds", col = "plum")
par(mfrow = c(1, 1))
# Create a summary table for categorical variables
cat_table <- df %>%
group_by(sex, event) %>%
summarise(
count = n(),
mean_survival = mean(survival_time),
median_survival = median(survival_time),
.groups = 'drop'
)
cat("Summary by Sex and Event Status:\n")
## Summary by Sex and Event Status:
kable(cat_table, caption = "Summary Statistics by Sex and Event Status")
| sex | event | count | mean_survival | median_survival |
|---|---|---|---|---|
| Male | 0 | 26 | 372.1538 | 281.5 |
| Male | 1 | 112 | 262.5893 | 208.5 |
| Female | 0 | 37 | 357.3514 | 292.0 |
| Female | 1 | 53 | 326.1321 | 293.0 |
# Bar plots
p1 <- ggplot(df, aes(x = sex, fill = factor(event))) +
geom_bar(position = "dodge") +
labs(title = "Event Distribution by Sex",
x = "Sex", y = "Count", fill = "Event") +
theme_minimal() +
scale_fill_manual(values = c("blue", "red"),
labels = c("Censored", "Death"))
p2 <- ggplot(df, aes(x = age_group, fill = factor(event))) +
geom_bar(position = "dodge") +
labs(title = "Event Distribution by Age Group",
x = "Age Group", y = "Count", fill = "Event") +
theme_minimal() +
scale_fill_manual(values = c("blue", "red"),
labels = c("Censored", "Death"))
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)
# Select numeric columns for correlation
numeric_df <- df %>%
select(survival_time, event, age_years, performance_ecog,
physician_karno, patient_karno, calories_per_day, weight_loss)
# Correlation matrix
cor_matrix <- cor(numeric_df, use = "complete.obs")
cor_df <- as.data.frame(as.table(cor_matrix))
names(cor_df) <- c("Var1", "Var2", "Correlation")
# Create correlation plot
ggplot(cor_df, aes(Var1, Var2, fill = Correlation)) +
geom_tile() +
geom_text(aes(label = round(Correlation, 2)), color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1)) +
labs(title = "Correlation Matrix", x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Create survival object
surv_obj <- Surv(time = df$survival_time, event = df$event)
# Fit Kaplan-Meier estimator
km_fit <- survfit(surv_obj ~ 1)
# Plot overall survival curve
ggsurvplot(km_fit,
data = df,
risk.table = TRUE,
conf.int = TRUE,
ggtheme = theme_minimal(),
title = "Overall Kaplan-Meier Survival Curve",
xlab = "Time (days)",
ylab = "Survival Probability",
legend.title = "",
palette = "blue")
# Get summary statistics
km_summary <- summary(km_fit)
cat("Overall Survival Statistics:\n")
## Overall Survival Statistics:
cat("Median survival time:", km_fit$median, "days\n")
## Median survival time: days
cat("Mean survival time:", mean(km_fit$time), "days\n")
## Mean survival time: 327.5 days
cat("Number of events:", sum(km_fit$n.event), "\n")
## Number of events: 165
cat("Number censored:", sum(km_fit$n.censor), "\n")
## Number censored: 63
# Survival probabilities at specific time points
time_points <- c(100, 200, 300, 400, 500)
cat("\nSurvival probabilities at specific time points:\n")
##
## Survival probabilities at specific time points:
for(t in time_points) {
surv_prob <- summary(km_fit, times = t)$surv
cat(sprintf("At %d days: %.4f\n", t, surv_prob))
}
## At 100 days: 0.8640
## At 200 days: 0.6803
## At 300 days: 0.5306
## At 400 days: 0.3768
## At 500 days: 0.2933
# Kaplan-Meier by sex
km_sex <- survfit(surv_obj ~ sex, data = df)
# Plot
ggsurvplot(km_sex,
data = df,
risk.table = TRUE,
conf.int = TRUE,
pval = TRUE,
pval.method = TRUE,
ggtheme = theme_minimal(),
title = "Kaplan-Meier Survival Curves by Sex",
xlab = "Time (days)",
ylab = "Survival Probability",
legend.title = "Sex",
legend.labs = c("Male", "Female"),
palette = c("blue", "red"))
# Perform log-rank test
logrank_result <- survdiff(surv_obj ~ sex, data = df)
cat("Log-Rank Test Results (Sex Comparison):\n")
## Log-Rank Test Results (Sex Comparison):
print(logrank_result)
## Call:
## survdiff(formula = surv_obj ~ sex, data = df)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Male 138 112 91.6 4.55 10.3
## sex=Female 90 53 73.4 5.68 10.3
##
## Chisq= 10.3 on 1 degrees of freedom, p= 0.001
# Extract p-value
logrank_p <- 1 - pchisq(logrank_result$chisq, length(logrank_result$n) - 1)
cat(sprintf("\nP-value: %.4f\n", logrank_p))
##
## P-value: 0.0013
if(logrank_p < 0.05) {
cat("Conclusion: Significant difference in survival between sexes\n")
} else {
cat("Conclusion: No significant difference in survival between sexes\n")
}
## Conclusion: Significant difference in survival between sexes
# Kaplan-Meier by age group
km_age <- survfit(surv_obj ~ age_group, data = df)
# Plot
ggsurvplot(km_age,
data = df,
risk.table = TRUE,
conf.int = FALSE,
pval = TRUE,
pval.method = TRUE,
ggtheme = theme_minimal(),
title = "Kaplan-Meier Survival Curves by Age Group",
xlab = "Time (days)",
ylab = "Survival Probability",
legend.title = "Age Group",
palette = "Set1")
# Kaplan-Meier by ECOG group
km_ecog <- survfit(surv_obj ~ ecog_group, data = df)
# Plot
ggsurvplot(km_ecog,
data = df,
risk.table = TRUE,
conf.int = FALSE,
pval = TRUE,
pval.method = TRUE,
ggtheme = theme_minimal(),
title = "Kaplan-Meier Survival Curves by ECOG Performance Status",
xlab = "Time (days)",
ylab = "Survival Probability",
legend.title = "ECOG Score",
palette = "Dark2")
# Prepare data for Cox model
cox_df <- df %>%
select(survival_time, event, age_years, sex, performance_ecog,
physician_karno, patient_karno, calories_per_day, weight_loss)
# Remove any remaining NA values
cox_df <- na.omit(cox_df)
# Fit Cox proportional hazards model
cox_model <- coxph(Surv(survival_time, event) ~
age_years + sex + performance_ecog +
physician_karno + patient_karno +
calories_per_day + weight_loss,
data = cox_df)
# Display model summary
cat("Cox Proportional Hazards Model Summary:\n\n")
## Cox Proportional Hazards Model Summary:
print(summary(cox_model))
## Call:
## coxph(formula = Surv(survival_time, event) ~ age_years + sex +
## performance_ecog + physician_karno + patient_karno + calories_per_day +
## weight_loss, data = cox_df)
##
## n= 228, number of events= 165
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age_years 1.239e-02 1.012e+00 9.446e-03 1.311 0.18977
## sexFemale -5.894e-01 5.547e-01 1.701e-01 -3.465 0.00053 ***
## performance_ecog 5.863e-01 1.797e+00 1.831e-01 3.202 0.00136 **
## physician_karno 1.374e-02 1.014e+00 9.384e-03 1.465 0.14300
## patient_karno -1.245e-02 9.876e-01 6.950e-03 -1.792 0.07320 .
## calories_per_day 3.298e-05 1.000e+00 2.410e-04 0.137 0.89115
## weight_loss -1.134e-02 9.887e-01 6.588e-03 -1.722 0.08509 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age_years 1.0125 0.9877 0.9939 1.0314
## sexFemale 0.5547 1.8028 0.3974 0.7742
## performance_ecog 1.7973 0.5564 1.2553 2.5731
## physician_karno 1.0138 0.9863 0.9954 1.0327
## patient_karno 0.9876 1.0125 0.9743 1.0012
## calories_per_day 1.0000 1.0000 0.9996 1.0005
## weight_loss 0.9887 1.0114 0.9760 1.0016
##
## Concordance= 0.65 (se = 0.024 )
## Likelihood ratio test= 37.48 on 7 df, p=4e-06
## Wald test = 36.75 on 7 df, p=5e-06
## Score (logrank) test = 37.76 on 7 df, p=3e-06
# Extract coefficients and confidence intervals
cox_summary <- summary(cox_model)
hazard_ratios <- data.frame(
Variable = rownames(cox_summary$coefficients),
HR = cox_summary$coefficients[, "exp(coef)"],
Lower = cox_summary$conf.int[, "lower .95"],
Upper = cox_summary$conf.int[, "upper .95"],
p_value = cox_summary$coefficients[, "Pr(>|z|)"]
)
# Remove intercept if present
hazard_ratios <- hazard_ratios[!grepl("Intercept", hazard_ratios$Variable),]
# Create forest plot
ggplot(hazard_ratios, aes(x = HR, y = reorder(Variable, HR))) +
geom_point(size = 3) +
geom_errorbarh(aes(xmin = Lower, xmax = Upper), height = 0.2) +
geom_vline(xintercept = 1, linetype = "dashed", color = "red") +
scale_x_log10() +
labs(title = "Hazard Ratios with 95% Confidence Intervals",
x = "Hazard Ratio (log scale)",
y = "Variable") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# Check proportional hazards assumption
cox_ph_test <- cox.zph(cox_model)
cat("Proportional Hazards Assumption Test:\n")
## Proportional Hazards Assumption Test:
print(cox_ph_test)
## chisq df p
## age_years 0.0564 1 0.8123
## sex 2.7016 1 0.1002
## performance_ecog 1.0437 1 0.3070
## physician_karno 5.1387 1 0.0234
## patient_karno 3.5380 1 0.0600
## calories_per_day 7.1009 1 0.0077
## weight_loss 0.2832 1 0.5946
## GLOBAL 15.3360 7 0.0319
# Plot Schoenfeld residuals
par(mfrow = c(3, 3))
plot(cox_ph_test, var = 1)
plot(cox_ph_test, var = 2)
plot(cox_ph_test, var = 3)
plot(cox_ph_test, var = 4)
plot(cox_ph_test, var = 5)
plot(cox_ph_test, var = 6)
plot(cox_ph_test, var = 7)
par(mfrow = c(1, 1))
# Test for influential observations
dfbeta <- residuals(cox_model, type = "dfbeta")
cat("\nInfluential Observations (Top 5 by absolute dfbeta):\n")
##
## Influential Observations (Top 5 by absolute dfbeta):
influential_idx <- order(abs(dfbeta[,1]), decreasing = TRUE)[1:5]
print(dfbeta[influential_idx,])
## [,1] [,2] [,3] [,4] [,5]
## 85 0.004692762 0.02665687 -0.009524579 0.0004768544 0.0001530322
## 129 -0.002399962 -0.03591095 -0.041732519 -0.0027043722 0.0008466951
## 33 0.002383475 0.01627415 -0.018684245 0.0014165391 -0.0006493623
## 89 0.001969821 -0.02054653 -0.003658999 -0.0005622521 -0.0003148215
## 6 -0.001865053 0.02644861 0.095979318 0.0078422270 -0.0010399387
## [,6] [,7]
## 85 9.056090e-06 0.001123855
## 129 -4.445420e-06 -0.001005656
## 33 4.602607e-06 0.001545555
## 89 2.859613e-05 -0.001473872
## 6 6.322200e-05 0.001349450
# Create new data for prediction
new_data <- data.frame(
age_years = c(50, 50, 70, 70),
sex = factor(c("Male", "Female", "Male", "Female")),
performance_ecog = c(0, 0, 2, 2),
physician_karno = c(90, 90, 60, 60),
patient_karno = c(90, 90, 60, 60),
calories_per_day = median(df$calories_per_day),
weight_loss = median(df$weight_loss)
)
# Predict survival curves
pred_surv <- survfit(cox_model, newdata = new_data)
# Plot predicted survival curves
plot(pred_surv,
col = c("blue", "red", "green", "purple"),
lwd = 2,
xlab = "Time (days)",
ylab = "Survival Probability",
main = "Predicted Survival Curves from Cox Model")
legend("topright",
legend = c("50M ECOG0", "50F ECOG0", "70M ECOG2", "70F ECOG2"),
col = c("blue", "red", "green", "purple"),
lwd = 2)
grid()
# Prepare data for Random Survival Forest
rsf_data <- df %>%
select(survival_time, event, age_years, sex, performance_ecog,
physician_karno, patient_karno, calories_per_day, weight_loss)
# Convert to numeric matrix (RSF requires numeric input)
rsf_data$sex <- as.numeric(rsf_data$sex) - 1 # Convert to 0/1
# Remove any NA values
rsf_data <- na.omit(rsf_data)
# Fit Random Survival Forest
set.seed(123)
rsf_model <- rfsrc(Surv(survival_time, event) ~ .,
data = rsf_data,
ntree = 500,
importance = TRUE,
splitrule = "logrank",
na.action = "na.omit")
# Summary
cat("Random Survival Forest Model Summary:\n")
## Random Survival Forest Model Summary:
print(rsf_model)
## Sample size: 228
## Number of deaths: 165
## Number of trees: 500
## Forest terminal node size: 15
## Average no. of terminal nodes: 12.446
## No. of variables tried at each split: 3
## Total no. of variables: 7
## Resampling used to grow trees: swor
## Resample size used to grow trees: 144
## Analysis: RSF
## Family: surv
## Splitting rule: logrank *random*
## Number of random split points: 10
## (OOB) CRPS: 139.87887465
## (OOB) standardized CRPS: 0.15841322
## (OOB) Requested performance error: 0.39207664
# Extract variable importance
vimp <- rsf_model$importance
vimp_df <- data.frame(
Variable = names(vimp),
Importance = vimp
) %>%
arrange(desc(Importance))
# Plot variable importance
ggplot(vimp_df, aes(x = reorder(Variable, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Random Survival Forest - Variable Importance",
x = "Variable",
y = "Importance") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# Get predicted survival from RSF
pred_rsf <- predict(rsf_model, newdata = rsf_data)
# Plot survival curves for first 4 patients
plot(pred_rsf,
plots.one.page = FALSE,
show.plots = TRUE)
# Calculate concordance index for Cox model
cox_cindex <- summary(cox_model)$concordance[1]
# Calculate concordance index for RSF
rsf_cindex <- rsf_model$err.rate[rsf_model$ntree]
# Create comparison table
comparison_df <- data.frame(
Model = c("Cox Proportional Hazards", "Random Survival Forest"),
Concordance = c(round(cox_cindex, 3), round(1 - rsf_cindex, 3)),
AIC = c(round(AIC(cox_model), 1), NA),
BIC = c(round(BIC(cox_model), 1), NA)
)
cat("Model Performance Comparison:\n")
## Model Performance Comparison:
kable(comparison_df, caption = "Comparison of Survival Models")
| Model | Concordance | AIC | BIC | |
|---|---|---|---|---|
| C | Cox Proportional Hazards | 0.650 | 1476.3 | 1498.1 |
| Random Survival Forest | 0.608 | NA | NA |
# Load Required Libraries
# Set CRAN mirror again to be safe
options(repos = c(CRAN = "https://cloud.r-project.org"))
# List of required packages
required_packages <- c("tidyverse", "sf", "leaflet", "survival",
"survminer", "plotly", "viridis", "patchwork")
# Install and load packages - SIMPLE VERSION
for (pkg in required_packages) {
# Check if package is installed
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg, dependencies = TRUE)
}
# Load the package
library(pkg, character.only = TRUE)
}
# Load the dataset
# Read the CSV file (saved as 'cancer.csv')
kenya_data<- read.csv(file.choose(), header = TRUE)
attach(kenya_data)
View(kenya_data)
#kenya_data <- read.csv("kenya_survival_dataset_47counties.csv")
# Display basic information
cat("Dataset Dimensions:", dim(kenya_data), "\n")
## Dataset Dimensions: 470 12
cat("Number of Counties:", length(unique(kenya_data$Region)), "\n")
## Number of Counties: 2
cat("Time Range:", range(kenya_data$Time), "\n")
## Time Range: 5 60
cat("Event Rate:", round(mean(kenya_data$Event) * 100, 1), "%\n")
## Event Rate: 55.7 %
# Display first few rows
cat("\nFirst 6 rows:\n")
##
## First 6 rows:
head(kenya_data)
# Clean and prepare the data
kenya_clean <- kenya_data %>%
mutate(
# Create proper event indicator (1 = event, 0 = censored)
event_status = ifelse(Event == 1, "Event", "Censored"),
# Convert Gender to factor
Gender = factor(Gender, levels = c("M", "F"), labels = c("Male", "Female")),
# Create age groups
age_group = cut(Age,
breaks = c(0, 18, 35, 50, 65, 100),
labels = c("0-18", "19-35", "36-50", "51-65", "66+")),
# Calculate BMI
BMI = Weight_kg / ((Height_cm/100)^2),
bmi_category = cut(BMI,
breaks = c(0, 18.5, 25, 30, Inf),
labels = c("Underweight", "Normal", "Overweight", "Obese")),
# Blood pressure categories
bp_category = cut(Blood_Pressure,
breaks = c(0, 90, 120, 130, 140, Inf),
labels = c("Low", "Normal", "Elevated", "Stage1", "Stage2")),
# Education level ordering
education_level = factor(Education_level,
levels = c("None", "Primary", "Secondary", "College", "University"),
ordered = TRUE)
)
# Check structure
glimpse(kenya_clean)
## Rows: 470
## Columns: 18
## $ Time <int> 43, 6, 26, 7, 18, 18, 22, 20, 33, 55, 45, 44, 30, 17,…
## $ Event <int> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,…
## $ Patient_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ Age <int> 32, 47, 66, 24, 26, 67, 51, 70, 82, 52, 90, 52, 31, 6…
## $ Gender <fct> Male, Female, Male, Male, Female, Female, Female, Fem…
## $ Height_cm <int> 147, 141, 198, 148, 192, 143, 175, 165, 146, 172, 151…
## $ Weight_kg <int> 60, 99, 81, 78, 41, 41, 53, 64, 48, 98, 73, 76, 48, 4…
## $ Blood_Pressure <int> 103, 100, 139, 97, 139, 85, 127, 139, 80, 84, 112, 12…
## $ Disease_category <chr> "Malaria", "Stroke", "Malaria", "HIV", "HIV", "TB", "…
## $ Region <chr> "Rural", "Rural", "Rural", "Urban", "Urban", "Rural",…
## $ Country <chr> "Nairobi", "Nairobi", "Nairobi", "Nairobi", "Nairobi"…
## $ Education_level <chr> "University", "Primary", "College", "College", "Secon…
## $ event_status <chr> "Censored", "Censored", "Event", "Censored", "Censore…
## $ age_group <fct> 19-35, 36-50, 66+, 19-35, 19-35, 66+, 51-65, 66+, 66+…
## $ BMI <dbl> 27.76621, 49.79629, 20.66116, 35.60993, 11.12196, 20.…
## $ bmi_category <fct> Overweight, Obese, Normal, Obese, Underweight, Normal…
## $ bp_category <fct> Normal, Normal, Stage1, Normal, Stage1, Low, Elevated…
## $ education_level <ord> University, Primary, College, College, Secondary, Uni…
# Create a data frame with all 47 Kenyan county coordinates
county_coords <- data.frame(
Region = c("Nairobi", "Mombasa", "Kwale", "Kilifi", "Tana River",
"Lamu", "Taita-Taveta", "Garissa", "Wajir", "Mandera",
"Marsabit", "Isiolo", "Meru", "Tharaka-Nithi", "Embu",
"Kitui", "Machakos", "Makueni", "Nyandarua", "Nyeri",
"Kirinyaga", "Murang'a", "Kiambu", "Turkana", "West Pokot",
"Samburu", "Trans-Nzoia", "Uasin Gishu", "Elgeyo-Marakwet",
"Nandi", "Baringo", "Laikipia", "Nakuru", "Narok", "Kajiado",
"Kericho", "Bomet", "Kakamega", "Vihiga", "Bungoma",
"Busia", "Siaya", "Kisumu", "Homa-Bay", "Migori",
"Kisii", "Nyamira"),
lat = c(-1.2921, -4.0435, -4.1816, -3.5107, -1.5000,
-2.2696, -3.3960, -0.4536, 1.7500, 3.9360,
2.3346, 0.3556, 0.0500, -0.2962, -0.5319,
-1.3667, -1.5167, -2.2833, -0.5333, -0.4167,
-0.5000, -0.7833, -1.0333, 3.1167, 1.8833,
1.4500, 1.0333, 0.5167, 0.5167, 0.2000,
0.4667, 0.0667, -0.2833, -1.0833, -2.1000,
-0.3667, -0.7833, 0.2833, 0.1000, 0.5667,
0.4667, 0.0667, -0.1000, -0.5333, -1.0667,
-0.6833, -0.5667),
lng = c(36.8219, 39.6682, 39.4603, 39.9093, 39.5000,
40.9020, 38.3710, 39.6464, 40.0500, 41.8670,
37.9790, 37.5808, 37.6500, 37.9452, 37.4600,
38.0100, 37.2667, 37.8167, 36.4000, 36.9500,
37.3333, 37.0333, 36.8667, 35.6000, 35.1333,
36.6667, 34.9500, 35.2833, 35.5167, 35.1000,
35.9500, 36.6500, 36.0667, 35.8667, 36.7833,
35.2833, 35.3500, 34.7500, 34.7167, 34.5500,
34.1167, 34.3000, 34.7500, 34.4667, 34.4667,
34.7667, 34.9333)
)
# Merge with your county statistics
if(exists("county_stats")) {
map_data <- county_stats %>%
left_join(county_coords, by = "Region")
} else {
# If county_stats doesn't exist, use just coordinates
map_data <- county_coords %>%
mutate(n_patients = 10, n_events = 5, event_rate = 50) # dummy data
}
# Create the interactive map
library(leaflet)
leaflet(map_data) %>%
addTiles() %>%
setView(lng = 37.9062, lat = 0.0236, zoom = 6) %>%
addCircleMarkers(
lng = ~lng,
lat = ~lat,
radius = ~ifelse(exists("n_patients"), sqrt(n_patients)/3, 5),
color = ~ifelse(exists("event_rate"),
colorNumeric("Reds", event_rate)(event_rate),
"blue"),
fillOpacity = 0.7,
stroke = TRUE,
weight = 2,
label = ~paste(
"<strong>", Region, "</strong><br>",
ifelse(exists("n_patients"), paste("Patients: ", n_patients, "<br>"), ""),
ifelse(exists("n_events"), paste("Events: ", n_events, "<br>"), ""),
ifelse(exists("event_rate"), paste("Event Rate: ", event_rate, "%"), "")
),
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "12px",
direction = "auto"
)
) %>%
addControl(
"Kenya County Survival Analysis",
position = "topright"
) %>%
addMiniMap(
toggleDisplay = TRUE,
position = "bottomleft"
)