Survival Time and Status:
The time column represents the survival time in
days.
The status column indicates the patient’s status at
the end of the study (1 = died from melanoma, 2 = alive, 3 = died from
other causes).
Analyzing the distribution of survival times can provide insights into the overall prognosis of melanoma patients.
Demographics:
The sex column (1 = male, 0 = female) and
age column can be used to analyze the distribution of
melanoma cases by gender and age.
Insights into whether certain age groups or genders are more affected by melanoma can be derived.
Tumor Characteristics:
The thickness column indicates the thickness of the
tumor in millimeters, which is a critical factor in melanoma
prognosis.
The ulcer column (1 = presence of ulceration, 0 =
absence) is another important prognostic factor.
Analyzing the relationship between tumor thickness, ulceration, and survival outcomes can provide valuable insights into disease severity and progression.
Temporal Trends:
The year column indicates the year of diagnosis,
which can be used to analyze trends over time.
Insights into whether the incidence or survival rates of melanoma have changed over the years can be derived.
Correlation Analysis:
Correlation analysis between variables such as age, tumor thickness, ulceration, and survival time can reveal significant relationships.
For example, thicker tumors and the presence of ulceration might be associated with shorter survival times.
Survival Analysis:
Kaplan-Meier survival curves can be plotted to estimate the survival function based on different factors such as tumor thickness, ulceration status, and age groups.
Cox proportional hazards models can be used to assess the impact of various factors on survival time.
Comparative Analysis:
Predictive Modeling:
Machine learning models can be built to predict survival outcomes based on patient characteristics and tumor features.
This can help in identifying high-risk patients and tailoring treatment strategies accordingly.
By conducting these analyses, you can gain a comprehensive understanding of the factors influencing melanoma prognosis and identify potential areas for further research or clinical intervention.
First, load the dataset into R and inspect its structure.
# Load necessary libraries
library(survival)
library(tidyverse)
library(dplyr)
# Load the dataset
melanoma <- read.csv("melanoma.csv")
# Inspect the dataset
head(melanoma)
## time status sex age year thickness ulcer
## 1 10 3 1 76 1972 6.76 1
## 2 30 3 1 56 1968 0.65 0
## 3 35 2 1 41 1977 1.34 0
## 4 99 3 0 71 1968 2.90 0
## 5 185 1 1 52 1965 12.08 1
## 6 204 1 1 28 1971 4.84 1
str(melanoma)
## 'data.frame': 205 obs. of 7 variables:
## $ time : int 10 30 35 99 185 204 210 232 232 279 ...
## $ status : int 3 3 2 3 1 1 1 3 1 1 ...
## $ sex : int 1 1 1 0 1 1 1 0 1 0 ...
## $ age : int 76 56 41 71 52 28 77 60 49 68 ...
## $ year : int 1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
## $ thickness: num 6.76 0.65 1.34 2.9 12.08 ...
## $ ulcer : int 1 0 0 0 1 1 1 1 1 1 ...
summary(melanoma)
## time status sex age year
## Min. : 10 Min. :1.00 Min. :0.0000 Min. : 4.00 Min. :1962
## 1st Qu.:1525 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:42.00 1st Qu.:1968
## Median :2005 Median :2.00 Median :0.0000 Median :54.00 Median :1970
## Mean :2153 Mean :1.79 Mean :0.3854 Mean :52.46 Mean :1970
## 3rd Qu.:3042 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:65.00 3rd Qu.:1972
## Max. :5565 Max. :3.00 Max. :1.0000 Max. :95.00 Max. :1977
## thickness ulcer
## Min. : 0.10 Min. :0.000
## 1st Qu.: 0.97 1st Qu.:0.000
## Median : 1.94 Median :0.000
## Mean : 2.92 Mean :0.439
## 3rd Qu.: 3.56 3rd Qu.:1.000
## Max. :17.42 Max. :1.000
Check for missing values and convert categorical variables to factors.
# Check for missing values
sum(is.na(melanoma))
## [1] 0
# Convert categorical variables to factors
melanoma$sex <- factor(melanoma$sex, levels = c(0, 1), labels = c("Female", "Male"))
melanoma$ulcer <- factor(melanoma$ulcer, levels = c(0, 1), labels = c("No Ulcer", "Ulcer"))
melanoma$status <- factor(melanoma$status, levels = c(1, 2, 3), labels = c("Died from Melanoma", "Alive", "Died from Other Causes"))
# melanoma <- melanoma %>%
# mutate(ulcer= ifelse(ulcer==1, "Ulcer", "No_Ulcer"),
# status= base::ifelse(status==1,"Died_of_melanoma", status),
# status= base::ifelse(status==2,"Alive" , "Died_Other_Causes"))
# Inspect the cleaned dataset
str(melanoma)
## 'data.frame': 205 obs. of 7 variables:
## $ time : int 10 30 35 99 185 204 210 232 232 279 ...
## $ status : Factor w/ 3 levels "Died from Melanoma",..: 3 3 2 3 1 1 1 3 1 1 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 1 2 1 ...
## $ age : int 76 56 41 71 52 28 77 60 49 68 ...
## $ year : int 1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
## $ thickness: num 6.76 0.65 1.34 2.9 12.08 ...
## $ ulcer : Factor w/ 2 levels "No Ulcer","Ulcer": 2 1 1 1 2 2 2 2 2 2 ...
Perform exploratory analysis to understand the distribution of variables.
ggplot(melanoma, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "blue", color = "red") +
labs(title = "Age Distribution of Melanoma Patients", x = "Age", y = "Count")
ggplot(melanoma, aes(x = thickness)) +
geom_histogram(binwidth = 1, fill = "red", color = "blue") +
labs(title = "Distribution of Tumor Thickness", x = "Thickness (mm)", y = "Count")
ggplot(melanoma, aes(x = ulcer)) +
geom_bar(fill = "green") +
labs(title = "Ulceration Status", x = "Ulceration", y = "Count")
ggplot(melanoma, aes(x = time)) +
geom_histogram(binwidth = 100, fill = "purple", color = "black") +
labs(title = "Survival Time Distribution", x = "Survival Time (Days)", y = "Count")
Perform survival analysis using Kaplan-Meier curves and Cox proportional hazards models.
# Fit Kaplan-Meier survival model
library(survminer)
km_fit <- survfit(Surv(time, status == "Died from Melanoma") ~ 1, data = melanoma)
# Plot Kaplan-Meier curve
ggsurvplot(km_fit, data = melanoma,
title = "Kaplan-Meier Survival Curve",
xlab = "Time (Days)",
ylab = "Survival Probability")
km_fit_ulcer <- survfit(Surv(time, status == "Died from Melanoma") ~ ulcer, data = melanoma)
ggsurvplot(km_fit_ulcer, data = melanoma,
title = "Kaplan-Meier Survival Curve by Ulceration Status",
xlab = "Time (Days)",
ylab = "Survival Probability",
legend.title = "Ulceration",
legend.labs = c("No Ulcer", "Ulcer"))
# Fit Cox model
cox_model <- coxph(Surv(time, status == "Died from Melanoma") ~ age + sex + thickness + ulcer, data = melanoma)
# Summarize the model
summary(cox_model)
## Call:
## coxph(formula = Surv(time, status == "Died from Melanoma") ~
## age + sex + thickness + ulcer, data = melanoma)
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.012198 1.012273 0.008297 1.470 0.14150
## sexMale 0.432817 1.541594 0.267410 1.619 0.10554
## thickness 0.108945 1.115101 0.037734 2.887 0.00389 **
## ulcerUlcer 1.164479 3.204253 0.309751 3.759 0.00017 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.012 0.9879 0.9959 1.029
## sexMale 1.542 0.6487 0.9127 2.604
## thickness 1.115 0.8968 1.0356 1.201
## ulcerUlcer 3.204 0.3121 1.7461 5.880
##
## Concordance= 0.753 (se = 0.033 )
## Likelihood ratio test= 41.62 on 4 df, p=2e-08
## Wald test = 39.42 on 4 df, p=6e-08
## Score (logrank) test = 46.67 on 4 df, p=2e-09
Based on the analysis, here are the insights and recommendations:
Age Distribution:
Tumor Thickness:
Ulceration Status:
Survival Analysis:
The Kaplan-Meier curve shows that survival probability decreases over time, with a steep drop in the first 1000 days.
Patients with ulcerated tumors have significantly worse survival outcomes compared to those without ulceration.
Cox Model:
Early Detection:
Targeted Treatment:
Public Awareness:
Further Research:
Save the cleaned dataset and analysis results for future reference.
# Save cleaned dataset
write.csv(melanoma, "melanoma_cleaned.csv", row.names = FALSE)
# Save Cox model results
sink("cox_model_summary.txt")
summary(cox_model)
sink()