title: “Lung Cancer Dataset Analysis” author: “NAIMUR RAHMAN” date: “2025-09-12” output: html_document — —
This report provides univariate and bivariate descriptive statistics
of the lung dataset from the survival
package, along with interpretations.
library(survival)
library(tableone)
##Load Data
# Load dataset
lung <- survival::lung
# See structure
str(lung)
## 'data.frame': 228 obs. of 10 variables:
## $ inst : num 3 3 3 5 1 12 7 11 1 7 ...
## $ time : num 306 455 1010 210 883 ...
## $ status : num 2 2 1 2 2 1 2 2 2 2 ...
## $ age : num 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : num 1 1 1 1 1 1 2 2 1 1 ...
## $ ph.ecog : num 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : num 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: num 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : num 1175 1225 NA 1150 NA ...
## $ wt.loss : num NA 15 15 11 0 0 10 1 16 34 ...
# First few rows
head(lung)
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
##Univariate Analysis
# Age
summary(lung$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.00 56.00 63.00 62.45 69.00 82.00
# Sex
table(lung$sex)
##
## 1 2
## 138 90
prop.table(table(lung$sex))*100
##
## 1 2
## 60.52632 39.47368
# Survival time
summary(lung$time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 166.8 255.5 305.2 396.5 1022.0
# Status (1=censored, 2=dead)
table(lung$status)
##
## 1 2
## 63 165
# ECOG performance score
summary(lung$ph.ecog)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 1.0000 0.9515 1.0000 3.0000 1
@Interpretation:
The mean age is around 62 years, indicating mostly older adults.
Around two-thirds of patients are male, and one-third female.
Median survival time is about 310 days, reflecting the poor prognosis of advanced lung cancer.
Most patients had ECOG performance status 1 (ambulatory, restricted in strenuous activity).
##Bivariate Analysis
# Age by Sex (t-test)
t.test(age ~ sex, data=lung)
##
## Welch Two Sample t-test
##
## data: age by sex
## t = 1.8632, df = 194.72, p-value = 0.06394
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -0.1324347 4.6580386
## sample estimates:
## mean in group 1 mean in group 2
## 63.34058 61.07778
# Survival time by Sex (t-test)
t.test(time ~ sex, data=lung)
##
## Welch Two Sample t-test
##
## data: time by sex
## t = -1.9843, df = 196.51, p-value = 0.04861
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -111.1266705 -0.3428947
## sample estimates:
## mean in group 1 mean in group 2
## 283.2319 338.9667
# Age by ECOG score (ANOVA)
anova_result <- aov(age ~ ph.ecog, data=lung)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## ph.ecog 1 698 697.6 8.727 0.00347 **
## Residuals 225 17985 79.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
# Sex vs ECOG score (Chi-square)
chisq.test(table(lung$sex, lung$ph.ecog))
## Warning in chisq.test(table(lung$sex, lung$ph.ecog)): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(lung$sex, lung$ph.ecog)
## X-squared = 1.3341, df = 3, p-value = 0.7211
Interpretation:
1.There is no significant age difference between males and females (if p > 0.05).
Survival time may differ by sex; a p < 0.05 would suggest significant differences.
Age distribution across ECOG categories shows whether functional status varies by age.
Sex vs ECOG chi-square shows whether performance status distribution differs between males and females.
library(knitr)
# Frequency of sex
sex_tab <- table(lung$sex)
kable(sex_tab, caption = "Distribution of Patients by Sex")
Var1 | Freq |
---|---|
1 | 138 |
2 | 90 |
hist(lung$age,
main = "Histogram of Age",
xlab = "Age (years)",
col = "lightblue",
border = "black")
boxplot(time ~ sex, data = lung,
main = "Survival Time by Sex",
xlab = "Sex (1=Male, 2=Female)",
ylab = "Survival Time (days)",
col = c("orange", "lightgreen"))
##Bar Plot of ECOG Performance Status
barplot(table(lung$ph.ecog),
main = "ECOG Performance Status",
xlab = "ECOG Score",
ylab = "Number of Patients",
col = "purple")
##Kaplan–Meier Survival Curve
library(survival)
library(survminer) # better survival plots
## Loading required package: ggplot2
## Loading required package: ggpubr
##
## Attaching package: 'survminer'
## The following object is masked from 'package:survival':
##
## myeloma
fit <- survfit(Surv(time, status==2) ~ sex, data = lung)
ggsurvplot(fit, data = lung,
pval = ,
risk.table = ,
conf.int = ,
legend.title = "Sex",
legend.labs = c("Male", "Female"),
xlab = "Time (days)",
ylab = "Survival probability")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
## Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretations:
1.The lung dataset represents patients with advanced lung cancer, mostly older adults and predominantly male.
Survival outcomes are poor, with median survival under one year.
Performance status (ECOG) is generally good (mostly 0–1), indicating patients were well enough to participate in clinical trials.
Sex differences are notable: females show longer survival, which may have biological or treatment-related explanations.
These findings make the dataset ideal for practicing survival analysis (Cox regression, Kaplan–Meier, etc.).