Introduction

Research Question

Is there a significant association between a patient’s age group at diagnosis and the specific anatomical site of metastasis?

Dataset Introduction & Reference

Study Name: Prostate Adenocarcinoma (MSK/DFCI, Nature Genetics 2018)

Dataset: Contains clinical information for patients with prostate cancer inclduing whole-exome sequencing of 1,013 prostate cancer samples.

Sample Size: 1,013 samples from 1,013 patients.

Variables: 24 variables

Study ID, Patient ID, Sample ID, Diagnosis Age, Cancer Type, Cancer Type Detailed Data Source, Fraction Genome Altered Fusion, Radical Prostatectomy Gleason Score for Prostate Cancer, Metastatic Site, Mutation Burden, Mutation Count Normal Genome Coverage, Oncotree Code, Ploidy, Purity, Reviewed Gleason Category Number of Samples Per Patient, Sample Type, Sex, Somatic Status, TMB (nonsynonymous) Tumor Genome Coverage

Key Variables: Diagnosis Age, Metastatic Site

Key Focus of this Study: This analysis focuses specifically on whether age (categorized as under 60 vs. 60 and over) influences where the cancer spreads, focusing on top 4 sites of Lung, Bone, Liver, and Lymph Nodes.

Direct Link: https://www.cbioportal.org/study/summary?id=prad_p1000

Data Analysis

In this section, I cleaned the dataset by standardizing the naming conventions for metastatic sites (e.g., ensuring “Lymph node” is consistently labeled as “Lymph Node”) and filtering for the four primary sites of interest: Bone, Liver, Lung, and Lymph Node. I also created a categorical age variable to compare patients diagnosed “Under 60” versus those “60 and Over.” I will be performing a Chi-Squared Test of Independence to determine if there are any relationship between age group and metastatic site. Visualizations will include a bar chart to compare the distribution of metastatic sites across these two age demographics.

library(tidyverse)

# Load the dataset
df <- read_tsv("prad_p1000_clinical_data.tsv")

# Clean and prepare the data
df_clean <- df %>%
  select(`Diagnosis Age`, `Metastatic Site`) %>%
  drop_na() %>%
  mutate(Metastatic_Site_Clean = str_to_title(`Metastatic Site`)) %>%
  filter(Metastatic_Site_Clean %in% c("Bone", "Liver", "Lung", "Lymph Node"))

# Mutate to create Age Groups
df_final <- df_clean %>%
  mutate(Age_Group = if_else(`Diagnosis Age` < 60, "Under 60", "60 and Over"))

# EDA Function 1
summary(df_final)
##  Diagnosis Age   Metastatic Site    Metastatic_Site_Clean  Age_Group        
##  Min.   :40.00   Length:156         Length:156            Length:156        
##  1st Qu.:63.00   Class :character   Class :character      Class :character  
##  Median :69.00   Mode  :character   Mode  :character      Mode  :character  
##  Mean   :67.93                                                              
##  3rd Qu.:73.00                                                              
##  Max.   :86.00
# EDA Function 2
head(df_final)
## # A tibble: 6 × 4
##   `Diagnosis Age` `Metastatic Site` Metastatic_Site_Clean Age_Group  
##             <dbl> <chr>             <chr>                 <chr>      
## 1              55 Lymph node        Lymph Node            Under 60   
## 2              65 Lymph node        Lymph Node            60 and Over
## 3              75 Lymph node        Lymph Node            60 and Over
## 4              63 Liver             Liver                 60 and Over
## 5              68 Lymph node        Lymph Node            60 and Over
## 6              69 Bone              Bone                  60 and Over
# Create Visualization
ggplot(df_final, aes(x = Metastatic_Site_Clean, fill = Age_Group)) +
  geom_bar(position = "dodge") +
  labs(title = "Metastasis Site Distribution by Age Group",
       x = "Metastatic Site",
       y = "Count",
       fill = "Age Group") +
  theme_minimal()

Statistical Analysis

Hypothesis:

\(H_0\): There is no association between the occurrence site of metastasis and a patient’s age group.

\(H_a\): There is a significant association between the occurrence site of metastasis and a patient’s age group.

# Create the contingency table
observed_table <- table(df_final$Age_Group, df_final$Metastatic_Site_Clean)
observed_table
##              
##               Bone Liver Lung Lymph Node
##   60 and Over   64    17    1         54
##   Under 60       8     2    0         10
# Perform Chi-Squared Test
chi_result <- chisq.test(observed_table)
chi_result
## 
##  Pearson's Chi-squared test
## 
## data:  observed_table
## X-squared = 0.87514, df = 3, p-value = 0.8314
# Check expected counts and statistic
chi_result$expected
##              
##                    Bone     Liver      Lung Lymph Node
##   60 and Over 62.769231 16.564103 0.8717949  55.794872
##   Under 60     9.230769  2.435897 0.1282051   8.205128
chi_result$statistic
## X-squared 
## 0.8751354

Interpretation

With a p-value of 0.8314, which is significantly higher than the standard alpha level of 0.05, we fail to reject the null hypothesis. This suggests that there is no statistically significant evidence to conclude that the anatomical site of metastasis depends on whether a patient is under or over the age of 60.

Conclusion

The analysis indicates that for this dataset of 1,013 samples, the location of prostate cancer metastasis (Bone, Liver, Lung, or Lymph Node) does not appear to be influenced by the patient’s age at diagnosis. Both younger and older patients show similar patterns of spread, with the Bone and Lymph Nodes remaining the most common sites across both groups.

Implications: Clinical decisions regarding the monitoring of specific metastatic sites may not need to be adjusted based solely on age groups, as the biological preference for metastatic sites appears consistent across the lifespan in this dataset.

Future Directions: Future research could refine these age groups (like comparing “very early onset” under 45 vs. elderly) or incorporate genetic mutation data (like TP53 or PTEN status) to see if that, rather than chronological age, better predict the site of metastasis.

References

cBioPortal for Cancer Genomics. Study Summary: Prostate Adenocarcinoma (MSK/DFCI, Nature Genetics 2018). Retrieved from https://www.cbioportal.org/study/summary?id=prad_p1000