Introduction

Interestingly in recent news, a new trend has emerged where non-smokers, (especially Asian women) people who have never smoked a cigarette in their life are developing cancer. According to Yale Medicine, this cancer type is non-small cell lung cancer known as Adenocarcinoma. Although, there is no single reason for development of such cancer. I looked at a research data set that accumulated the physiological characteristics of the tumor tissue, habits of the patients along with genomic information processed via next gen sequencing.

Question

What are some distinguishing features in tissue histology both physiological and genomic data compared across various metrics such as sex, smoking habits, tumor characteristics, and genes mutated?

# Load the required packages.
library(pacman)
p_load(ggplot2,babynames,tidyverse,ggrepel,viridis,ggthemes,cowplot, plotly, pheatmap,dplyr,tidyr)

# Load the downloaded clinical cancer data from cbioportal.
cancerdata<-read_tsv(file = "luad_mskcc_2020_clinical_data.tsv")
## Rows: 604 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (14): Study ID, Patient ID, Sample ID, Cancer Type, Cancer Type Detailed...
## dbl (15): Age At Surgery, CT Size, Diffusion lung capacity for carbon monoxi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove unwanted repeat columns which contain sample ID, patient ID, and cancer type and detailed cancer type as all the entries virtually contain the same information.
trim_cancer_data<-cancerdata %>% 
  select(-(`Study ID`:`Patient ID`)) %>% 
  select(-(`Cancer Type`:`Cancer Type Detailed`))
view(trim_cancer_data)
# For ease of coding, change the spaces in titles of columns to lower case to underscores for ease of coding.
colnames(trim_cancer_data)<-gsub(" ", "_", colnames(trim_cancer_data))
colnames(trim_cancer_data)
##  [1] "Sample_ID"                                       
##  [2] "Age_At_Surgery"                                  
##  [3] "Clinical_Nodal_Status"                           
##  [4] "CT_Size"                                         
##  [5] "Diffusion_lung_capacity_for_carbon_monoxide"     
##  [6] "FEV1"                                            
##  [7] "FGA_Facets"                                      
##  [8] "Fraction_Genome_Altered"                         
##  [9] "Mutation_Count"                                  
## [10] "Oncotree_Code"                                   
## [11] "Overall_Survival_(Months)"                       
## [12] "Overall_Survival_Status"                         
## [13] "Pathologic_Stage"                                
## [14] "Pathologic_Nodal_Status"                         
## [15] "Predominant_Histologic_Subtype"                  
## [16] "Relapse_Free_Status_(Months)"                    
## [17] "Relapse_Free_Status"                             
## [18] "Number_of_Samples_Per_Patient"                   
## [19] "Sex"                                             
## [20] "Smoking_History"                                 
## [21] "Person_Cigarette_Smoking_History_Pack_Year_Value"
## [22] "Subtype_Group"                                   
## [23] "Standardized_uptake_values"                      
## [24] "Tumor_Mutation_Burden"                           
## [25] "TMB_(nonsynonymous)"
# Change the entries inside the Smoking_History column to avoid misconceptions.

trim_cancer_data<-trim_cancer_data %>% 
  mutate(Smoking_History = recode(Smoking_History, "Ever Smoker" = "Smoker", "Never Smoker" = "Non-Smoker"))
head(trim_cancer_data)
## # A tibble: 6 × 25
##   Sample…¹ Age_A…² Clini…³ CT_Size Diffu…⁴  FEV1 FGA_F…⁵ Fract…⁶ Mutat…⁷ Oncot…⁸
##   <chr>      <dbl> <chr>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <chr>  
## 1 P-00002…      52 Negati…     3.4      88   108   0.190  0.0001       4 LUAD   
## 2 P-00002…      62 Negati…     2.1      61    76   0.571  0.130        2 LUAD   
## 3 P-00006…      68 Negati…     1.3      47    90   0.182  0.0303      10 LUAD   
## 4 P-00008…      64 Negati…     3.6      85    89   0.556  0.304        6 LUAD   
## 5 P-00019…      67 Negati…     1.3      NA    NA   0.427  0.0008       5 LUAD   
## 6 P-00025…      73 Negati…     3.4      96   102   0.343  0.134        2 LUAD   
## # … with 15 more variables: `Overall_Survival_(Months)` <dbl>,
## #   Overall_Survival_Status <chr>, Pathologic_Stage <dbl>,
## #   Pathologic_Nodal_Status <chr>, Predominant_Histologic_Subtype <chr>,
## #   `Relapse_Free_Status_(Months)` <dbl>, Relapse_Free_Status <chr>,
## #   Number_of_Samples_Per_Patient <dbl>, Sex <chr>, Smoking_History <chr>,
## #   Person_Cigarette_Smoking_History_Pack_Year_Value <dbl>,
## #   Subtype_Group <chr>, Standardized_uptake_values <dbl>, …
# Load in genomic data.

genedata<-read_csv(file = "data_mutations.csv")
## Rows: 4814 Columns: 123
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): Hugo_Symbol, Center, NCBI_Build, Chromosome, Strand, Variant_Class...
## dbl  (9): Entrez_Gene_Id, Start_Position, End_Position, t_ref_count, t_alt_c...
## lgl (90): dbSNP_Val_Status, Matched_Norm_Sample_Barcode, Match_Norm_Seq_Alle...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Trim the gene data.

trim_gene_data<-genedata[, c("Hugo_Symbol","Tumor_Sample_Barcode")]
head(trim_gene_data)
## # A tibble: 6 × 2
##   Hugo_Symbol Tumor_Sample_Barcode
##   <chr>       <chr>               
## 1 MED12       P-0010976-T01-IM5   
## 2 EGFR        P-0010976-T01-IM5   
## 3 EGFR        P-0000280-T01-IM3   
## 4 CRLF2       P-0000280-T01-IM3   
## 5 CRLF2       P-0000280-T01-IM3   
## 6 EP300       P-0010841-T01-IM5
# Add the trimmed gene entries to a new dataset which has gene symbols combined into a vector and put under single tumor sample barcode id or patient id.
transformed_gene_data<-trim_gene_data %>% 
  group_by(Tumor_Sample_Barcode) %>% 
  summarize(Hugo_Symbols = paste(unique(Hugo_Symbol), collapse = ", ")) %>% 
  rename(Sample_ID = Tumor_Sample_Barcode)
head(transformed_gene_data)
## # A tibble: 6 × 2
##   Sample_ID         Hugo_Symbols                                          
##   <chr>             <chr>                                                 
## 1 P-0000219-T01-IM3 BRAF, SMAD4, KDM5C, EGFR                              
## 2 P-0000280-T01-IM3 EGFR, CRLF2                                           
## 3 P-0000627-T02-IM5 KRAS, TERT, EPHA3, KDR, FAT1, FGFR4, TP53, ZRSR2, BCOR
## 4 P-0000867-T01-IM3 BRAF, PIK3CB, ROS1, KDM5A, ETV6, CTCF                 
## 5 P-0001944-T01-IM3 KRAS, FAT1, STK11, KEAP1, AR                          
## 6 P-0002534-T01-IM3 EGFR, TP53
# Generate a new table with mutated gene data (genomic data) and original clinical data.
combined_data<-inner_join(trim_cancer_data,transformed_gene_data, by = "Sample_ID")
head(combined_data)
## # A tibble: 6 × 26
##   Sample…¹ Age_A…² Clini…³ CT_Size Diffu…⁴  FEV1 FGA_F…⁵ Fract…⁶ Mutat…⁷ Oncot…⁸
##   <chr>      <dbl> <chr>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <chr>  
## 1 P-00002…      52 Negati…     3.4      88   108   0.190  0.0001       4 LUAD   
## 2 P-00002…      62 Negati…     2.1      61    76   0.571  0.130        2 LUAD   
## 3 P-00006…      68 Negati…     1.3      47    90   0.182  0.0303      10 LUAD   
## 4 P-00008…      64 Negati…     3.6      85    89   0.556  0.304        6 LUAD   
## 5 P-00019…      67 Negati…     1.3      NA    NA   0.427  0.0008       5 LUAD   
## 6 P-00025…      73 Negati…     3.4      96   102   0.343  0.134        2 LUAD   
## # … with 16 more variables: `Overall_Survival_(Months)` <dbl>,
## #   Overall_Survival_Status <chr>, Pathologic_Stage <dbl>,
## #   Pathologic_Nodal_Status <chr>, Predominant_Histologic_Subtype <chr>,
## #   `Relapse_Free_Status_(Months)` <dbl>, Relapse_Free_Status <chr>,
## #   Number_of_Samples_Per_Patient <dbl>, Sex <chr>, Smoking_History <chr>,
## #   Person_Cigarette_Smoking_History_Pack_Year_Value <dbl>,
## #   Subtype_Group <chr>, Standardized_uptake_values <dbl>, …

A plot for discerning the distribution of histology of tumors across smokers based upon sex and number of mutations in the tumor genome.

TMBplot<-combined_data %>% 
  ggplot(aes(Tumor_Mutation_Burden, fill=Predominant_Histologic_Subtype))+
    facet_grid(Smoking_History~Sex)+
  geom_density(alpha = 0.5)+
  ggtitle("Mutations in tumor tissue vs tumor histology \n between smoker and non-smokers by sex.")+
  xlab("Number of mutations in tumor tissue")+
  ylab("Density")+
  xlim(0,40)+
  labs(fill = "Tissue type")+
  scale_fill_viridis_d()+
  theme_minimal()+
  theme(plot.title = element_text(size = 12, face = "bold"),
        strip.text = element_text(size = 10, face = "bold"))
TMBplot
## Warning: Removed 7 rows containing non-finite values (stat_density).

The depiction shows that a.) the higher tumor mutation rates seen in smokers compared to non-smokers which is a consequence of carcinogens inhaled during smoking. b.) For non-smokers, at a lower tumor mutation burden lepidic tumor histology is predominant in female non-smokers whilst solid & micropapillary tumor histology is seen in male non-smokers. c.) For female and male smokers, lower rate of mutations in the tumor correlates with lepidic tumor histology and higher rate correlates with solid tumor histology.

Further investigate any impacts of smoking habits (number of cigarette packs smoked) on tissue histology found in the patients (smokers).

smoke_plot<-combined_data %>% 
  filter(Smoking_History == "Smoker") %>% 
  ggplot(aes(Person_Cigarette_Smoking_History_Pack_Year_Value, fill = Predominant_Histologic_Subtype))+
  geom_density(alpha = 0.5)+
  ggtitle("Distribution of tumor tissue histology based on packs of cigarettes smoked.")+
  xlab("Average number of cigarette packs smoked per year")+
  ylab("Density")+
  labs(fill = "Tissue type")+
  scale_fill_viridis_d()+ 
  facet_wrap(~Sex)+
  theme_bw()+
  theme(plot.title = element_text(size = 12, face = "bold"))
smoke_plot
## Warning: Removed 1 rows containing non-finite values (stat_density).

This plot depicts the distribution of tissue histology between the smokers based upon sex, where within female smokers solid tissue histology develops predominantly with increasing exposure to smoke carcinogens (number of packs smoked) and male smokers develop micropapillary to papillary tissue histology.

Investigate the relationship between tumor characteristics and tissue histology.

# Import the tumor size and standardized uptake values to generate a smooth line plot based upon tissue type.
tumor_plot<-combined_data %>%
  ggplot(aes(CT_Size, Standardized_uptake_values))+
  geom_smooth(method = lm)+
# Facet the plot based upon tissue type
facet_wrap(~Predominant_Histologic_Subtype, scales = "free_x")+
  xlab("Size of the tumor")+
  ylab("Uptake of nutrients by the tumor")+
  ggtitle("Standardized Uptake Value vs Tumor Size by tissue histology.")+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
        strip.text = element_text(size = 10))
  tumor_plot
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 69 rows containing non-finite values (stat_smooth).

The depiction above shows the metabolic activity (standardized amount of nutrients absorbed) of tumor and correlates size of the tumor based on the histology. The slope of the trendline within each graph represents the uptake of nutrients into the various tissue types which can be independent of tumor size. Thereby, highlighting the importance of tumor histology where solid tissue histology is seen to be the most metabolically active compared to the rest.

Now on to the distribution of tissue histology with the pathologic stage of cancer stage.

# Aggregate data for plot for histology of tumors observed for different cancer stages.
plotdata<-combined_data %>% 
  select(Sex,Sample_ID,Pathologic_Stage, Predominant_Histologic_Subtype) %>%
  group_by(Sex,Pathologic_Stage,Predominant_Histologic_Subtype) %>% 
  summarize(Number_of_patients = n(), .groups = "drop")
plotdata
## # A tibble: 30 × 4
##    Sex    Pathologic_Stage Predominant_Histologic_Subtype Number_of_patients
##    <chr>             <dbl> <chr>                                       <int>
##  1 Female                1 Acinar                                        189
##  2 Female                1 Lepidic                                        63
##  3 Female                1 Micropapillary                                 11
##  4 Female                1 Papillary                                      16
##  5 Female                1 Solid                                          25
##  6 Female                2 Acinar                                         35
##  7 Female                2 Lepidic                                         3
##  8 Female                2 Micropapillary                                  5
##  9 Female                2 Papillary                                       2
## 10 Female                2 Solid                                          10
## # … with 20 more rows
# Visualize the spread of tissue histology between the different pathological stages of cancer.
pathology_spread<-plotdata %>% 
ggplot(aes(x = Pathologic_Stage, y = Number_of_patients, text = paste0("Patient Count: ", Number_of_patients), fill = Predominant_Histologic_Subtype))+
  geom_col(position = "dodge")+
 ggtitle("Distribution of tissue histology across stages of cancer.")+
  facet_wrap(~Sex)+
  xlab("Pathologic stage of cancer")+
  ylab("Number of patients")+
  labs(fill = "Tissue type")+  
  scale_fill_viridis_d()+
  theme_bw()+
   theme(plot.title = element_text(size = 12))
plot_patient<-ggplotly(pathology_spread, tooltip = "text")
plot_patient

The graphical depiction above shows the spread of tissue histology across various stages of cancer noted in the dataset, where acinar tissue type is seen to be the most prominent across the patients of both sexes.

Discussion and future implications.

From this study, some insights can be drawn about the nature of adenocarcinoma. Like Simple, well understood insights such as smoking leads to more tumor mutation burden and therefore more mutations in the tumor genome. Non-smoker women and men were found to develop different tissue histologies which can lead to tumors with varying metabolic activities. Another fact is that size of the tumor does not correlate with aggressiveness (standardized uptake values) of the benign mass which can help physicians properly inform and console patients in moments of great distress. The histological evaluation of lung tissue, in addition to bio marker testing of key mutated genes (for smokers and non-smokers) can help diagnose and guide efforts for disease remediation. The data set also alludes to similar mutated genes that lead to various histologies of tissues in patients and further genomic investigations can be performed to develop antigens which can help screen blood from the site of the mass, lessening the financial and physical toll of invasive procedures.

R-Pubs Link to the HTML https://rpubs.com/VictorUPEI/1295090

References.

  1. Chapter 11 - Oncology: Positron Emission Tomography (2014). In H. A. Ziessman, J. P. O’Malley & J. H. Thrall (Eds.), Nuclear Medicine (Fourth Edition) (pp. 227–264). W.B. Saunders. https://doi.org/10.1016/B978-0-323-08299-0.00011-0
  2. Caso, R., Sanchez-Vega, F., Tan, K. S., Mastrogiacomo, B., Zhou, J., Jones, G. D., Nguyen, B., Schultz, N., Connolly, J. G., Brandt, W. S., Bott, M. J., Rocco, G., Molena, D., Isbell, J. M., Liu, Y., Mayo, M. W., Adusumilli, P. S., Travis, W. D., & Jones, D. R. (2020). The Underlying Tumor Genomics of Predominant Histologic Subtypes in Lung Adenocarcinoma. Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer, 15(12), 1844–1856. https://doi.org/10.1016/j.jtho.2020.08.005 Dataset aquired from https://www.cbioportal.org/study/summary?id=luad_mskcc_2020
  3. Lung cancer in nonsmokers. (2024, April 18). Yale Medicine. https://www.yalemedicine.org/conditions/lung-cancer-in-nonsmokers