Melanoma Patient Data Analysis Insights

  1. Survival Time and Status:

    • The time column represents the survival time in days.

    • The status column indicates the patient’s status at the end of the study (1 = died from melanoma, 2 = alive, 3 = died from other causes).

    • Analyzing the distribution of survival times can provide insights into the overall prognosis of melanoma patients.

  2. Demographics:

    • The sex column (1 = male, 0 = female) and age column can be used to analyze the distribution of melanoma cases by gender and age.

    • Insights into whether certain age groups or genders are more affected by melanoma can be derived.

  3. Tumor Characteristics:

    • The thickness column indicates the thickness of the tumor in millimeters, which is a critical factor in melanoma prognosis.

    • The ulcer column (1 = presence of ulceration, 0 = absence) is another important prognostic factor.

    • Analyzing the relationship between tumor thickness, ulceration, and survival outcomes can provide valuable insights into disease severity and progression.

  4. Temporal Trends:

    • The year column indicates the year of diagnosis, which can be used to analyze trends over time.

    • Insights into whether the incidence or survival rates of melanoma have changed over the years can be derived.

  5. Correlation Analysis:

    • Correlation analysis between variables such as age, tumor thickness, ulceration, and survival time can reveal significant relationships.

    • For example, thicker tumors and the presence of ulceration might be associated with shorter survival times.

  6. Survival Analysis:

    • Kaplan-Meier survival curves can be plotted to estimate the survival function based on different factors such as tumor thickness, ulceration status, and age groups.

    • Cox proportional hazards models can be used to assess the impact of various factors on survival time.

  7. Comparative Analysis:

    • Comparing survival outcomes between different subgroups (e.g., males vs. females, different age groups, presence vs. absence of ulceration) can highlight significant differences in prognosis.
  8. Predictive Modeling:

    • Machine learning models can be built to predict survival outcomes based on patient characteristics and tumor features.

    • This can help in identifying high-risk patients and tailoring treatment strategies accordingly.

By conducting these analyses, you can gain a comprehensive understanding of the factors influencing melanoma prognosis and identify potential areas for further research or clinical intervention.

Step 1: Load the Data

First, load the dataset into R and inspect its structure.

# Load necessary libraries
library(survival)
library(tidyverse)
library(dplyr)

# Load the dataset
melanoma <- read.csv("melanoma.csv")

# Inspect the dataset
head(melanoma)
##   time status sex age year thickness ulcer
## 1   10      3   1  76 1972      6.76     1
## 2   30      3   1  56 1968      0.65     0
## 3   35      2   1  41 1977      1.34     0
## 4   99      3   0  71 1968      2.90     0
## 5  185      1   1  52 1965     12.08     1
## 6  204      1   1  28 1971      4.84     1
str(melanoma)
## 'data.frame':    205 obs. of  7 variables:
##  $ time     : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ status   : int  3 3 2 3 1 1 1 3 1 1 ...
##  $ sex      : int  1 1 1 0 1 1 1 0 1 0 ...
##  $ age      : int  76 56 41 71 52 28 77 60 49 68 ...
##  $ year     : int  1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
##  $ thickness: num  6.76 0.65 1.34 2.9 12.08 ...
##  $ ulcer    : int  1 0 0 0 1 1 1 1 1 1 ...
summary(melanoma)
##       time          status          sex              age             year     
##  Min.   :  10   Min.   :1.00   Min.   :0.0000   Min.   : 4.00   Min.   :1962  
##  1st Qu.:1525   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:42.00   1st Qu.:1968  
##  Median :2005   Median :2.00   Median :0.0000   Median :54.00   Median :1970  
##  Mean   :2153   Mean   :1.79   Mean   :0.3854   Mean   :52.46   Mean   :1970  
##  3rd Qu.:3042   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:65.00   3rd Qu.:1972  
##  Max.   :5565   Max.   :3.00   Max.   :1.0000   Max.   :95.00   Max.   :1977  
##    thickness         ulcer      
##  Min.   : 0.10   Min.   :0.000  
##  1st Qu.: 0.97   1st Qu.:0.000  
##  Median : 1.94   Median :0.000  
##  Mean   : 2.92   Mean   :0.439  
##  3rd Qu.: 3.56   3rd Qu.:1.000  
##  Max.   :17.42   Max.   :1.000

Step 2: Data Cleaning and Preparation

Check for missing values and convert categorical variables to factors.

# Check for missing values
sum(is.na(melanoma))
## [1] 0
# Convert categorical variables to factors
melanoma$sex <- factor(melanoma$sex, levels = c(0, 1), labels = c("Female", "Male"))
melanoma$ulcer <- factor(melanoma$ulcer, levels = c(0, 1), labels = c("No Ulcer", "Ulcer"))
melanoma$status <- factor(melanoma$status, levels = c(1, 2, 3), labels = c("Died from Melanoma", "Alive", "Died from Other Causes"))
# melanoma <- melanoma %>% 
#   mutate(ulcer= ifelse(ulcer==1, "Ulcer", "No_Ulcer"),
#          status= base::ifelse(status==1,"Died_of_melanoma", status),
#          status= base::ifelse(status==2,"Alive" , "Died_Other_Causes"))

# Inspect the cleaned dataset
str(melanoma)
## 'data.frame':    205 obs. of  7 variables:
##  $ time     : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ status   : Factor w/ 3 levels "Died from Melanoma",..: 3 3 2 3 1 1 1 3 1 1 ...
##  $ sex      : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 1 2 1 ...
##  $ age      : int  76 56 41 71 52 28 77 60 49 68 ...
##  $ year     : int  1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
##  $ thickness: num  6.76 0.65 1.34 2.9 12.08 ...
##  $ ulcer    : Factor w/ 2 levels "No Ulcer","Ulcer": 2 1 1 1 2 2 2 2 2 2 ...

Step 3: Exploratory Data Analysis (EDA)

Perform exploratory analysis to understand the distribution of variables.

Age Distribution

ggplot(melanoma, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "red") +
  labs(title = "Age Distribution of Melanoma Patients", x = "Age", y = "Count")

Tumor Thickness Distribution

ggplot(melanoma, aes(x = thickness)) +
  geom_histogram(binwidth = 1, fill = "red", color = "blue") +
  labs(title = "Distribution of Tumor Thickness", x = "Thickness (mm)", y = "Count")

Ulceration Status

ggplot(melanoma, aes(x = ulcer)) +
  geom_bar(fill = "green") +
  labs(title = "Ulceration Status", x = "Ulceration", y = "Count")

Survival Time Distribution

ggplot(melanoma, aes(x = time)) +
  geom_histogram(binwidth = 100, fill = "purple", color = "black") +
  labs(title = "Survival Time Distribution", x = "Survival Time (Days)", y = "Count")

Step 4: Survival Analysis

Perform survival analysis using Kaplan-Meier curves and Cox proportional hazards models.

Kaplan-Meier Survival Curves

# Fit Kaplan-Meier survival model
library(survminer)
km_fit <- survfit(Surv(time, status == "Died from Melanoma") ~ 1, data = melanoma)

# Plot Kaplan-Meier curve
ggsurvplot(km_fit, data = melanoma, 
           title = "Kaplan-Meier Survival Curve",
           xlab = "Time (Days)", 
           ylab = "Survival Probability")

Stratified by Ulceration Status

km_fit_ulcer <- survfit(Surv(time, status == "Died from Melanoma") ~ ulcer, data = melanoma)

ggsurvplot(km_fit_ulcer, data = melanoma, 
           title = "Kaplan-Meier Survival Curve by Ulceration Status",
           xlab = "Time (Days)", 
           ylab = "Survival Probability",
           legend.title = "Ulceration",
           legend.labs = c("No Ulcer", "Ulcer"))

Cox Proportional Hazards Model

# Fit Cox model
cox_model <- coxph(Surv(time, status == "Died from Melanoma") ~ age + sex + thickness + ulcer, data = melanoma)

# Summarize the model
summary(cox_model)
## Call:
## coxph(formula = Surv(time, status == "Died from Melanoma") ~ 
##     age + sex + thickness + ulcer, data = melanoma)
## 
##   n= 205, number of events= 57 
## 
##                coef exp(coef) se(coef)     z Pr(>|z|)    
## age        0.012198  1.012273 0.008297 1.470  0.14150    
## sexMale    0.432817  1.541594 0.267410 1.619  0.10554    
## thickness  0.108945  1.115101 0.037734 2.887  0.00389 ** 
## ulcerUlcer 1.164479  3.204253 0.309751 3.759  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## age            1.012     0.9879    0.9959     1.029
## sexMale        1.542     0.6487    0.9127     2.604
## thickness      1.115     0.8968    1.0356     1.201
## ulcerUlcer     3.204     0.3121    1.7461     5.880
## 
## Concordance= 0.753  (se = 0.033 )
## Likelihood ratio test= 41.62  on 4 df,   p=2e-08
## Wald test            = 39.42  on 4 df,   p=6e-08
## Score (logrank) test = 46.67  on 4 df,   p=2e-09

Step 5: Insights and Recommendations

Based on the analysis, here are the insights and recommendations:

Insights:

  1. Age Distribution:

    • The majority of melanoma patients are in the age range of 40–80 years, with a peak around 60–70 years.
  2. Tumor Thickness:

    • Tumor thickness varies widely, with many patients having tumors less than 5 mm thick. However, thicker tumors (> 10 mm) are associated with poorer prognosis.
  3. Ulceration Status:

    • A significant proportion of patients have ulcerated tumors, which is a known poor prognostic factor.
  4. Survival Analysis:

    • The Kaplan-Meier curve shows that survival probability decreases over time, with a steep drop in the first 1000 days.

    • Patients with ulcerated tumors have significantly worse survival outcomes compared to those without ulceration.

  5. Cox Model:

    • The Cox model indicates that tumor thickness and ulceration status are significant predictors of survival. Older age and male sex may also contribute to poorer outcomes, but their effects are less pronounced.

Recommendations:

  1. Early Detection:

    • Encourage regular skin checks, especially for individuals aged 40 and above, to detect melanoma at an early stage when tumors are thinner and have a better prognosis.
  2. Targeted Treatment:

    • Patients with thicker tumors or ulceration should be prioritized for aggressive treatment and closer monitoring.
  3. Public Awareness:

    • Raise awareness about the importance of sun protection and early detection, particularly for high-risk groups (e.g., older adults and males).
  4. Further Research:

    • Investigate the role of other potential prognostic factors (e.g., genetic markers, lifestyle factors) to improve risk stratification and treatment planning.

Step 6: Save the Results

Save the cleaned dataset and analysis results for future reference.

# Save cleaned dataset
write.csv(melanoma, "melanoma_cleaned.csv", row.names = FALSE)

# Save Cox model results
sink("cox_model_summary.txt")
summary(cox_model)
sink()