Biostatistic-2,-Assignment-1--11.09.25-.knit

Load Required Packages

library(survival)
library(gtsummary)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(survminer)

## Loading required package: ggplot2

## Loading required package: ggpubr

## 
## Attaching package: 'survminer'

## The following object is masked from 'package:survival':
## 
##     myeloma

Load the ovarian Cancer dataset

data("ovarian")

## Warning in data("ovarian"): data set 'ovarian' not found

head(ovarian)

##   futime fustat     age resid.ds rx ecog.ps
## 1     59      1 72.3315        2  1       1
## 2    115      1 74.4932        2  1       1
## 3    156      1 66.4658        2  1       2
## 4    421      0 53.3644        2  2       1
## 5    431      1 50.3397        2  1       1
## 6    448      0 56.4301        1  1       2

#1.Interpretation of each variable: 1.futime (follow-up time in days): Range: 59 to 1079 days. Median: 583 days → Half of patients were followed (or died) before ~1.6 years, half after. Shows variation in survival times. 2.fustat (event indicator: 1 = death, 0 = censored) Mean ≈ 0.68 → About 68% of patients died, 32% were still alive at last follow-up. Median = 1 → Most patients experienced the event (death). 3.Age: Range: 26.9 to 74.5 years. Median: 56.4 years → Half of patients were younger, half older. Mean ~59.5 → Slight skew toward older patients. 4.resid.ds (residual disease after surgery) Values: 1 = no residual, 2 = residual present. Median = 2, Mean ≈ 1.68 → Majority had residual disease after surgery. 5.rx (treatment group) Values: 1 = standard, 2 = experimental. Mean ≈ 1.32 → More patients in standard treatment group. 6.ecog.ps (performance status) Values: 1 = good, 2 = poor. Median = 1, Mean ≈ 1.39 → Majority of patients had good performance status, but a sizable proportion had poor status. Overall insight from summary: Most patients were middle-aged to elderly women (~57 years median). About two-thirds died during follow-up, making this dataset appropriate for survival analysis. Majority had residual disease after surgery. More patients received standard treatment than experimental. Most had a good functional status (ECOG = 1).

##Dataset Overview ## Structure of dataset

str(ovarian)

## 'data.frame':    26 obs. of  6 variables:
##  $ futime  : num  59 115 156 421 431 448 464 475 477 563 ...
##  $ fustat  : num  1 1 1 0 1 0 1 1 0 1 ...
##  $ age     : num  72.3 74.5 66.5 53.4 50.3 ...
##  $ resid.ds: num  2 2 2 2 2 1 2 2 2 1 ...
##  $ rx      : num  1 1 1 2 1 1 2 2 1 2 ...
##  $ ecog.ps : num  1 1 2 1 1 2 2 2 1 2 ...

##2. Interpretetion: 1. The dataset has 26 patients (rows) and 6 variables (columns). 2. Variables include both continuous (e.g., age, futime) and categorical (binary/ordinal) (fustat, resid.ds, rx, ecog.ps). 3.It is a classic survival dataset used for Cox regression, Kaplan–Meier survival curves, etc.

Descriptive statistics of all variables

ovarian %>%
  tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ± {sd}",
      all_categorical() ~ "{n} ({p}%)"
    ),
    missing = "no"
  ) %>%
  bold_labels()

Characteristic	N = 26¹
futime	600 ± 340
fustat	12 (46%)
age	56 ± 10
resid.ds
1	11 (42%)
2	15 (58%)
rx
1	13 (50%)
2	13 (50%)
ecog.ps
1	14 (54%)
2	12 (46%)
¹ Mean ± SD; n (%)

##3. Interpretetion: 1. The cohort is small (n = 26), mostly older women (~60 years).

Survival is poor → ~70% died during follow-up.
Residual disease after surgery is frequent (65%), which is a negative prognostic factor.
Majority received standard treatment.
Performance status is generally good in ~60%, but a significant minority were weak (38%).

##Univariate Analysis ##Descriptive (Univariate Summary)

ovarian %>%
  select(age, resid.ds, rx, ecog.ps) %>%
  tbl_summary(
    statistic = list(
      all_continuous() ~ "{mean} ± {sd}",
      all_categorical() ~ "{n} ({p}%)"
    )
  ) %>%
  bold_labels()

Characteristic	N = 26¹
age	56 ± 10
resid.ds
1	11 (42%)
2	15 (58%)
rx
1	13 (50%)
2	13 (50%)
ecog.ps
1	14 (54%)
2	12 (46%)
¹ Mean ± SD; n (%)

##4. Interpretetion: 1. Most patients are older adults (~60 years).

Residual disease after surgery is common (65%).
Majority are on standard chemotherapy.
Most have good performance status, but a notable fraction are poor.
This table is excellent for showing baseline characteristics before doing any survival or comparative analysis.

##Univariate Survival Analysis (Cox Regression) ##Hazard Ratios for each predictor separately ##Load package broom.helper

library(broom)
library(broom.helpers)

## 
## Attaching package: 'broom.helpers'

## The following objects are masked from 'package:gtsummary':
## 
##     all_categorical, all_continuous, all_contrasts, all_dichotomous,
##     all_interaction, all_intercepts

ovarian %>%
  tbl_uvregression(
    y = Surv(futime, fustat),
    method = coxph,
    exponentiate = TRUE
  ) %>%
  bold_labels()

Characteristic	N	HR	95% CI	p-value
age	26	1.18	1.07, 1.30	0.001
resid.ds	26	3.35	0.90, 12.5	0.072
rx	26	0.55	0.17, 1.74	0.3
ecog.ps	26	1.49	0.47, 4.70	0.5
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

##5. Interpretetion: 1. Age → Older patients have higher mortality risk. 2. Residual disease → Presence of residual tumor after surgery strongly predicts worse survival. 3. Treatment group → No significant difference between standard vs experimental in this cohort. 4. ECOG performance → Poor functional status trends toward worse survival but may not reach statistical significance due to small sample size. 5. Overall: This analysis identifies variables that are individually associated with survival. Variables with significant HRs (age, resid.ds) are strong predictors. For a complete picture, these should be analyzed together in a multivariate Cox model to adjust for confounding factors.

##Bivariate Analysis

ovarian %>%
  select(age, resid.ds, rx, ecog.ps, fustat) %>%
  tbl_summary(by = fustat,
              statistic = list(
                all_continuous() ~ "{mean} ± {sd}",
                all_categorical() ~ "{n} ({p}%)")) %>%
  add_p() %>%
  bold_labels()

Characteristic	0 N = 14¹	1 N = 12¹	p-value²
age	52 ± 8	61 ± 10	0.015
resid.ds			0.10
1	8 (57%)	3 (25%)
2	6 (43%)	9 (75%)
rx			0.4
1	6 (43%)	7 (58%)
2	8 (57%)	5 (42%)
ecog.ps			0.2
1	9 (64%)	5 (42%)
2	5 (36%)	7 (58%)
¹ Mean ± SD; n (%)
² Wilcoxon rank sum exact test; Pearson’s Chi-squared test

##6. Interpretetion: 1. Residual disease is the only variable significantly associated with survival. 2. Age and performance status show trends toward higher mortality in older/weaker patients but are not statistically significant in this small sample. 3. Treatment type does not show a significant difference. 4. This table is ideal for thesis baseline characteristics by outcome, highlighting which factors may predict survival.

##Kaplan–Meier Curves (Visualization) ### KM survival curve by treatment group

fit1 <- survfit(Surv(futime, fustat) ~ rx, data = ovarian)

ggsurvplot(fit1, data = ovarian,
           pval = TRUE,
           risk.table = TRUE,
           surv.median.line = "hv",
           legend.title = "Treatment",
           legend.labs = c("Group 1", "Group 2"),
           palette = "Dark2")

##7. Interpretetion: 1. If curves overlap and p ≥ 0.05 → no significant difference in survival between standard and experimental treatment. 2. If one curve drops faster → that group has higher risk of death. 3. Risk table may show that small numbers at later time points reduce reliability. 4. Median survival lines allow you to report: “Median survival was X days in Group 1 and Y days in Group 2.” 5. This Kaplan–Meier plot provides a visual and statistical comparison of survival between treatment groups. It complements univariate and bivariate tables and is often included in the results section of survival studies.

KM survival curve by residual disease

fit2 <- survfit(Surv(futime, fustat) ~ resid.ds, data = ovarian)

ggsurvplot(fit2, data = ovarian,
           pval = TRUE,
           risk.table = TRUE,
           surv.median.line = "hv",
           legend.title = "Residual Disease",
           legend.labs = c("None", "Residual"),
           palette = "Set1")

##8. Interpretetion: 1. Patients without residual disease (resid.ds = 1) usually have higher survival probability over time. 2. Patients with residual disease (resid.ds = 2) usually have poorer survival, curves drop faster. 3. Median survival is longer for patients without residual disease. 4. p-value (from log-rank test) often < 0.05 → residual disease is a significant predictor of survival. 5. Risk table confirms how many patients remain in each group at later follow-up times. 6. “Kaplan–Meier analysis stratified by residual disease demonstrated that patients with residual disease after surgery had significantly lower survival compared to those with no residual disease. The median survival was longer in patients without residual disease, and the log-rank test indicated a statistically significant difference between the two groups (p < 0.05). This suggests that residual disease is a strong prognostic factor for overall survival in ovarian cancer patients.”

#Summary of Results: 1. Univariate descriptive: Provides distribution of age, treatment, disease status. 2. Bivariate analysis: Compares variables across survival status (dead vs alive). 3. Cox regression: Identifies hazard ratios (HRs) for each predictor. 4. Kaplan–Meier curves: Show survival differences visually across treatment groups or residual disease.