Tissue Histology of Lung Cancer

Question

What are some distinguishing features in tissue histology both physiological and genomic data compared across various metrics such as sex, smoking habits, tumor characteristics, and genes mutated?

# Load the required packages.
library(pacman)
p_load(ggplot2,babynames,tidyverse,ggrepel,viridis,ggthemes,cowplot, plotly, pheatmap,dplyr,tidyr)

# Load the downloaded clinical cancer data from cbioportal.
cancerdata<-read_tsv(file = "luad_mskcc_2020_clinical_data.tsv")

## Rows: 604 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (14): Study ID, Patient ID, Sample ID, Cancer Type, Cancer Type Detailed...
## dbl (15): Age At Surgery, CT Size, Diffusion lung capacity for carbon monoxi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Remove unwanted repeat columns which contain sample ID, patient ID, and cancer type and detailed cancer type as all the entries virtually contain the same information.
trim_cancer_data<-cancerdata %>% 
  select(-(`Study ID`:`Patient ID`)) %>% 
  select(-(`Cancer Type`:`Cancer Type Detailed`))
view(trim_cancer_data)

# For ease of coding, change the spaces in titles of columns to lower case to underscores for ease of coding.
colnames(trim_cancer_data)<-gsub(" ", "_", colnames(trim_cancer_data))
colnames(trim_cancer_data)

##  [1] "Sample_ID"                                       
##  [2] "Age_At_Surgery"                                  
##  [3] "Clinical_Nodal_Status"                           
##  [4] "CT_Size"                                         
##  [5] "Diffusion_lung_capacity_for_carbon_monoxide"     
##  [6] "FEV1"                                            
##  [7] "FGA_Facets"                                      
##  [8] "Fraction_Genome_Altered"                         
##  [9] "Mutation_Count"                                  
## [10] "Oncotree_Code"                                   
## [11] "Overall_Survival_(Months)"                       
## [12] "Overall_Survival_Status"                         
## [13] "Pathologic_Stage"                                
## [14] "Pathologic_Nodal_Status"                         
## [15] "Predominant_Histologic_Subtype"                  
## [16] "Relapse_Free_Status_(Months)"                    
## [17] "Relapse_Free_Status"                             
## [18] "Number_of_Samples_Per_Patient"                   
## [19] "Sex"                                             
## [20] "Smoking_History"                                 
## [21] "Person_Cigarette_Smoking_History_Pack_Year_Value"
## [22] "Subtype_Group"                                   
## [23] "Standardized_uptake_values"                      
## [24] "Tumor_Mutation_Burden"                           
## [25] "TMB_(nonsynonymous)"

# Change the entries inside the Smoking_History column to avoid misconceptions.

trim_cancer_data<-trim_cancer_data %>% 
  mutate(Smoking_History = recode(Smoking_History, "Ever Smoker" = "Smoker", "Never Smoker" = "Non-Smoker"))
head(trim_cancer_data)

## # A tibble: 6 × 25
##   Sample…¹ Age_A…² Clini…³ CT_Size Diffu…⁴  FEV1 FGA_F…⁵ Fract…⁶ Mutat…⁷ Oncot…⁸
##   <chr>      <dbl> <chr>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <chr>  
## 1 P-00002…      52 Negati…     3.4      88   108   0.190  0.0001       4 LUAD   
## 2 P-00002…      62 Negati…     2.1      61    76   0.571  0.130        2 LUAD   
## 3 P-00006…      68 Negati…     1.3      47    90   0.182  0.0303      10 LUAD   
## 4 P-00008…      64 Negati…     3.6      85    89   0.556  0.304        6 LUAD   
## 5 P-00019…      67 Negati…     1.3      NA    NA   0.427  0.0008       5 LUAD   
## 6 P-00025…      73 Negati…     3.4      96   102   0.343  0.134        2 LUAD   
## # … with 15 more variables: `Overall_Survival_(Months)` <dbl>,
## #   Overall_Survival_Status <chr>, Pathologic_Stage <dbl>,
## #   Pathologic_Nodal_Status <chr>, Predominant_Histologic_Subtype <chr>,
## #   `Relapse_Free_Status_(Months)` <dbl>, Relapse_Free_Status <chr>,
## #   Number_of_Samples_Per_Patient <dbl>, Sex <chr>, Smoking_History <chr>,
## #   Person_Cigarette_Smoking_History_Pack_Year_Value <dbl>,
## #   Subtype_Group <chr>, Standardized_uptake_values <dbl>, …

# Load in genomic data.

genedata<-read_csv(file = "data_mutations.csv")

## Rows: 4814 Columns: 123
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): Hugo_Symbol, Center, NCBI_Build, Chromosome, Strand, Variant_Class...
## dbl  (9): Entrez_Gene_Id, Start_Position, End_Position, t_ref_count, t_alt_c...
## lgl (90): dbSNP_Val_Status, Matched_Norm_Sample_Barcode, Match_Norm_Seq_Alle...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Trim the gene data.

trim_gene_data<-genedata[, c("Hugo_Symbol","Tumor_Sample_Barcode")]
head(trim_gene_data)

## # A tibble: 6 × 2
##   Hugo_Symbol Tumor_Sample_Barcode
##   <chr>       <chr>               
## 1 MED12       P-0010976-T01-IM5   
## 2 EGFR        P-0010976-T01-IM5   
## 3 EGFR        P-0000280-T01-IM3   
## 4 CRLF2       P-0000280-T01-IM3   
## 5 CRLF2       P-0000280-T01-IM3   
## 6 EP300       P-0010841-T01-IM5

# Add the trimmed gene entries to a new dataset which has gene symbols combined into a vector and put under single tumor sample barcode id or patient id.
transformed_gene_data<-trim_gene_data %>% 
  group_by(Tumor_Sample_Barcode) %>% 
  summarize(Hugo_Symbols = paste(unique(Hugo_Symbol), collapse = ", ")) %>% 
  rename(Sample_ID = Tumor_Sample_Barcode)
head(transformed_gene_data)

## # A tibble: 6 × 2
##   Sample_ID         Hugo_Symbols                                          
##   <chr>             <chr>                                                 
## 1 P-0000219-T01-IM3 BRAF, SMAD4, KDM5C, EGFR                              
## 2 P-0000280-T01-IM3 EGFR, CRLF2                                           
## 3 P-0000627-T02-IM5 KRAS, TERT, EPHA3, KDR, FAT1, FGFR4, TP53, ZRSR2, BCOR
## 4 P-0000867-T01-IM3 BRAF, PIK3CB, ROS1, KDM5A, ETV6, CTCF                 
## 5 P-0001944-T01-IM3 KRAS, FAT1, STK11, KEAP1, AR                          
## 6 P-0002534-T01-IM3 EGFR, TP53

# Generate a new table with mutated gene data (genomic data) and original clinical data.
combined_data<-inner_join(trim_cancer_data,transformed_gene_data, by = "Sample_ID")
head(combined_data)

## # A tibble: 6 × 26
##   Sample…¹ Age_A…² Clini…³ CT_Size Diffu…⁴  FEV1 FGA_F…⁵ Fract…⁶ Mutat…⁷ Oncot…⁸
##   <chr>      <dbl> <chr>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl> <chr>  
## 1 P-00002…      52 Negati…     3.4      88   108   0.190  0.0001       4 LUAD   
## 2 P-00002…      62 Negati…     2.1      61    76   0.571  0.130        2 LUAD   
## 3 P-00006…      68 Negati…     1.3      47    90   0.182  0.0303      10 LUAD   
## 4 P-00008…      64 Negati…     3.6      85    89   0.556  0.304        6 LUAD   
## 5 P-00019…      67 Negati…     1.3      NA    NA   0.427  0.0008       5 LUAD   
## 6 P-00025…      73 Negati…     3.4      96   102   0.343  0.134        2 LUAD   
## # … with 16 more variables: `Overall_Survival_(Months)` <dbl>,
## #   Overall_Survival_Status <chr>, Pathologic_Stage <dbl>,
## #   Pathologic_Nodal_Status <chr>, Predominant_Histologic_Subtype <chr>,
## #   `Relapse_Free_Status_(Months)` <dbl>, Relapse_Free_Status <chr>,
## #   Number_of_Samples_Per_Patient <dbl>, Sex <chr>, Smoking_History <chr>,
## #   Person_Cigarette_Smoking_History_Pack_Year_Value <dbl>,
## #   Subtype_Group <chr>, Standardized_uptake_values <dbl>, …

A plot for discerning the distribution of histology of tumors across smokers based upon sex and number of mutations in the tumor genome.

TMBplot<-combined_data %>% 
  ggplot(aes(Tumor_Mutation_Burden, fill=Predominant_Histologic_Subtype))+
    facet_grid(Smoking_History~Sex)+
  geom_density(alpha = 0.5)+
  ggtitle("Mutations in tumor tissue vs tumor histology \n between smoker and non-smokers by sex.")+
  xlab("Number of mutations in tumor tissue")+
  ylab("Density")+
  xlim(0,40)+
  labs(fill = "Tissue type")+
  scale_fill_viridis_d()+
  theme_minimal()+
  theme(plot.title = element_text(size = 12, face = "bold"),
        strip.text = element_text(size = 10, face = "bold"))
TMBplot

## Warning: Removed 7 rows containing non-finite values (stat_density).

The depiction shows that a.) the higher tumor mutation rates seen in smokers compared to non-smokers which is a consequence of carcinogens inhaled during smoking. b.) For non-smokers, at a lower tumor mutation burden lepidic tumor histology is predominant in female non-smokers whilst solid & micropapillary tumor histology is seen in male non-smokers. c.) For female and male smokers, lower rate of mutations in the tumor correlates with lepidic tumor histology and higher rate correlates with solid tumor histology.

Further investigate any impacts of smoking habits (number of cigarette packs smoked) on tissue histology found in the patients (smokers).

smoke_plot<-combined_data %>% 
  filter(Smoking_History == "Smoker") %>% 
  ggplot(aes(Person_Cigarette_Smoking_History_Pack_Year_Value, fill = Predominant_Histologic_Subtype))+
  geom_density(alpha = 0.5)+
  ggtitle("Distribution of tumor tissue histology based on packs of cigarettes smoked.")+
  xlab("Average number of cigarette packs smoked per year")+
  ylab("Density")+
  labs(fill = "Tissue type")+
  scale_fill_viridis_d()+ 
  facet_wrap(~Sex)+
  theme_bw()+
  theme(plot.title = element_text(size = 12, face = "bold"))
smoke_plot

## Warning: Removed 1 rows containing non-finite values (stat_density).

This plot depicts the distribution of tissue histology between the smokers based upon sex, where within female smokers solid tissue histology develops predominantly with increasing exposure to smoke carcinogens (number of packs smoked) and male smokers develop micropapillary to papillary tissue histology.

Investigate the relationship between tumor characteristics and tissue histology.

# Import the tumor size and standardized uptake values to generate a smooth line plot based upon tissue type.
tumor_plot<-combined_data %>%
  ggplot(aes(CT_Size, Standardized_uptake_values))+
  geom_smooth(method = lm)+
# Facet the plot based upon tissue type
facet_wrap(~Predominant_Histologic_Subtype, scales = "free_x")+
  xlab("Size of the tumor")+
  ylab("Uptake of nutrients by the tumor")+
  ggtitle("Standardized Uptake Value vs Tumor Size by tissue histology.")+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
        strip.text = element_text(size = 10))
  tumor_plot

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 69 rows containing non-finite values (stat_smooth).

The depiction above shows the metabolic activity (standardized amount of nutrients absorbed) of tumor and correlates size of the tumor based on the histology. The slope of the trendline within each graph represents the uptake of nutrients into the various tissue types which can be independent of tumor size. Thereby, highlighting the importance of tumor histology where solid tissue histology is seen to be the most metabolically active compared to the rest.

Now on to the distribution of tissue histology with the pathologic stage of cancer stage.

# Aggregate data for plot for histology of tumors observed for different cancer stages.
plotdata<-combined_data %>% 
  select(Sex,Sample_ID,Pathologic_Stage, Predominant_Histologic_Subtype) %>%
  group_by(Sex,Pathologic_Stage,Predominant_Histologic_Subtype) %>% 
  summarize(Number_of_patients = n(), .groups = "drop")
plotdata

## # A tibble: 30 × 4
##    Sex    Pathologic_Stage Predominant_Histologic_Subtype Number_of_patients
##    <chr>             <dbl> <chr>                                       <int>
##  1 Female                1 Acinar                                        189
##  2 Female                1 Lepidic                                        63
##  3 Female                1 Micropapillary                                 11
##  4 Female                1 Papillary                                      16
##  5 Female                1 Solid                                          25
##  6 Female                2 Acinar                                         35
##  7 Female                2 Lepidic                                         3
##  8 Female                2 Micropapillary                                  5
##  9 Female                2 Papillary                                       2
## 10 Female                2 Solid                                          10
## # … with 20 more rows

# Visualize the spread of tissue histology between the different pathological stages of cancer.
pathology_spread<-plotdata %>% 
ggplot(aes(x = Pathologic_Stage, y = Number_of_patients, text = paste0("Patient Count: ", Number_of_patients), fill = Predominant_Histologic_Subtype))+
  geom_col(position = "dodge")+
 ggtitle("Distribution of tissue histology across stages of cancer.")+
  facet_wrap(~Sex)+
  xlab("Pathologic stage of cancer")+
  ylab("Number of patients")+
  labs(fill = "Tissue type")+  
  scale_fill_viridis_d()+
  theme_bw()+
   theme(plot.title = element_text(size = 12))
plot_patient<-ggplotly(pathology_spread, tooltip = "text")
plot_patient

The graphical depiction above shows the spread of tissue histology across various stages of cancer noted in the dataset, where acinar tissue type is seen to be the most prominent across the patients of both sexes.

Investigate links between mutated genes and cancer histology in non-smoker women.

# Aggregate data of interest to a new dataset.
mutated_gene_data<-combined_data %>% 
  select(Sample_ID,Tumor_Mutation_Burden,Hugo_Symbols,Sex, Smoking_History, Predominant_Histologic_Subtype) 
mutated_gene_data

## # A tibble: 598 × 6
##    Sample_ID         Tumor_Mutation_Burden Hugo_Symbols    Sex   Smoki…¹ Predo…²
##    <chr>                             <dbl> <chr>           <chr> <chr>   <chr>  
##  1 P-0000219-T01-IM3                   4.5 BRAF, SMAD4, K… Fema… Smoker  Solid  
##  2 P-0000280-T01-IM3                   3.3 EGFR, CRLF2     Fema… Non-Sm… Acinar 
##  3 P-0000627-T02-IM5                   9.8 KRAS, TERT, EP… Fema… Smoker  Acinar 
##  4 P-0000867-T01-IM3                   6.7 BRAF, PIK3CB, … Male  Smoker  Acinar 
##  5 P-0001944-T01-IM3                   5.6 KRAS, FAT1, ST… Fema… Smoker  Acinar 
##  6 P-0002534-T01-IM3                   2.2 EGFR, TP53      Fema… Non-Sm… Acinar 
##  7 P-0002544-T01-IM3                   4.5 KRAS, TP53, TB… Fema… Smoker  Acinar 
##  8 P-0002557-T01-IM3                   3.3 EGFR, SMAD4, F… Fema… Non-Sm… Acinar 
##  9 P-0002768-T02-IM6                   5.3 EGFR, RB1, PIK… Fema… Non-Sm… Papill…
## 10 P-0002880-T02-IM6                   1.8 EGFR, MTOR      Male  Non-Sm… Acinar 
## # … with 588 more rows, and abbreviated variable names ¹Smoking_History,
## #   ²Predominant_Histologic_Subtype

# Separate the gene symbol data, which is a concatenated vector, for gene count across various tissue histology.
wider_data<-mutated_gene_data %>% 
mutate(gene = str_split(Hugo_Symbols, ", ")) %>% 
  unnest_wider(gene, names_sep = "_")
wider_data

## # A tibble: 598 × 70
##    Sample_ID   Tumor…¹ Hugo_…² Sex   Smoki…³ Predo…⁴ gene_1 gene_2 gene_3 gene_4
##    <chr>         <dbl> <chr>   <chr> <chr>   <chr>   <chr>  <chr>  <chr>  <chr> 
##  1 P-0000219-…     4.5 BRAF, … Fema… Smoker  Solid   BRAF   SMAD4  KDM5C  EGFR  
##  2 P-0000280-…     3.3 EGFR, … Fema… Non-Sm… Acinar  EGFR   CRLF2  <NA>   <NA>  
##  3 P-0000627-…     9.8 KRAS, … Fema… Smoker  Acinar  KRAS   TERT   EPHA3  KDR   
##  4 P-0000867-…     6.7 BRAF, … Male  Smoker  Acinar  BRAF   PIK3CB ROS1   KDM5A 
##  5 P-0001944-…     5.6 KRAS, … Fema… Smoker  Acinar  KRAS   FAT1   STK11  KEAP1 
##  6 P-0002534-…     2.2 EGFR, … Fema… Non-Sm… Acinar  EGFR   TP53   <NA>   <NA>  
##  7 P-0002544-…     4.5 KRAS, … Fema… Smoker  Acinar  KRAS   TP53   TBX3   GRIN2A
##  8 P-0002557-…     3.3 EGFR, … Fema… Non-Sm… Acinar  EGFR   SMAD4  FLT3   <NA>  
##  9 P-0002768-…     5.3 EGFR, … Fema… Non-Sm… Papill… EGFR   RB1    PIK3CB ARID2 
## 10 P-0002880-…     1.8 EGFR, … Male  Non-Sm… Acinar  EGFR   MTOR   <NA>   <NA>  
## # … with 588 more rows, 60 more variables: gene_5 <chr>, gene_6 <chr>,
## #   gene_7 <chr>, gene_8 <chr>, gene_9 <chr>, gene_10 <chr>, gene_11 <chr>,
## #   gene_12 <chr>, gene_13 <chr>, gene_14 <chr>, gene_15 <chr>, gene_16 <chr>,
## #   gene_17 <chr>, gene_18 <chr>, gene_19 <chr>, gene_20 <chr>, gene_21 <chr>,
## #   gene_22 <chr>, gene_23 <chr>, gene_24 <chr>, gene_25 <chr>, gene_26 <chr>,
## #   gene_27 <chr>, gene_28 <chr>, gene_29 <chr>, gene_30 <chr>, gene_31 <chr>,
## #   gene_32 <chr>, gene_33 <chr>, gene_34 <chr>, gene_35 <chr>, …

# For finding the most common gene mutation across various tissue histology based on sex and smoking history, pivot data to longer format.
useful_data<-wider_data %>% 
  pivot_longer(cols = starts_with("gene"), names_to = "Gene_type", values_to = "gene_symbol")
useful_data<-useful_data %>% 
  filter(!is.na(gene_symbol))

# Find the most common mutated gene sorted across females that do not smoke.
gene_counter_female<-useful_data %>% 
  filter(Sex == "Female" & Smoking_History =="Non-Smoker") %>% 
  group_by(Sex, Smoking_History, Predominant_Histologic_Subtype,gene_symbol) %>%
  count() %>% 
  arrange(-n)
gene_counter_female

## # A tibble: 178 × 5
## # Groups:   Sex, Smoking_History, Predominant_Histologic_Subtype, gene_symbol
## #   [178]
##    Sex    Smoking_History Predominant_Histologic_Subtype gene_symbol     n
##    <chr>  <chr>           <chr>                          <chr>       <int>
##  1 Female Non-Smoker      Acinar                         EGFR           46
##  2 Female Non-Smoker      Acinar                         TP53           18
##  3 Female Non-Smoker      Lepidic                        EGFR           12
##  4 Female Non-Smoker      Papillary                      EGFR            7
##  5 Female Non-Smoker      Acinar                         KRAS            6
##  6 Female Non-Smoker      Acinar                         RBM10           6
##  7 Female Non-Smoker      Acinar                         SETD2           6
##  8 Female Non-Smoker      Acinar                         RB1             4
##  9 Female Non-Smoker      Acinar                         SMAD4           4
## 10 Female Non-Smoker      Lepidic                        RBM10           4
## # … with 168 more rows

#Tiled heatmap generation, taking top (1) mutated gene across all the tissue histology for non-smoker women. 
heatmap_plot<-gene_counter_female %>%
   group_by(Predominant_Histologic_Subtype) %>% 
  slice_max(order_by = n, n = 1) %>%
  ggplot(aes(y = gene_symbol,x = Predominant_Histologic_Subtype))+
           geom_tile(aes(fill = n, text = paste0("Count: ", n)), color = "blue")+
           scale_fill_gradient(low = "blue", high = "red")+
           labs(y = "Gene Symbol",
                x = "Tissue Histology",
                fill = "Gene Count")+
   ggtitle("Most common mutation found in lung tissue in non-smoker women")+
  theme_bw()+
  facet_wrap(~Sex)

## Warning: Ignoring unknown aesthetics: text

  theme(axis.text.y = element_text (hjust = 4))

## List of 1
##  $ axis.text.y:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 4
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

ggplotly(heatmap_plot, tooltip = "text")

The depiction above shows the most common mutated genes in non-smoker women that leads to adenocarcinoma which were found to be EGFR and TP53.

Investigate links between mutated genes and cancer histology in non-smoker men.

# Find the most common mutated gene sorted across Males that do not smoke.
gene_counter_male<-useful_data %>% 
  filter(Sex == "Male" & Smoking_History =="Non-Smoker") %>% 
  group_by(Sex, Smoking_History, Predominant_Histologic_Subtype,gene_symbol) %>%
  count() %>% 
  arrange(-n)
gene_counter_male

## # A tibble: 71 × 5
## # Groups:   Sex, Smoking_History, Predominant_Histologic_Subtype, gene_symbol
## #   [71]
##    Sex   Smoking_History Predominant_Histologic_Subtype gene_symbol     n
##    <chr> <chr>           <chr>                          <chr>       <int>
##  1 Male  Non-Smoker      Acinar                         EGFR            9
##  2 Male  Non-Smoker      Acinar                         TP53            7
##  3 Male  Non-Smoker      Acinar                         MET             4
##  4 Male  Non-Smoker      Papillary                      EGFR            4
##  5 Male  Non-Smoker      Acinar                         CTNNB1          3
##  6 Male  Non-Smoker      Acinar                         KRAS            3
##  7 Male  Non-Smoker      Solid                          EGFR            3
##  8 Male  Non-Smoker      Acinar                         CDKN2A          2
##  9 Male  Non-Smoker      Acinar                         ERBB2           2
## 10 Male  Non-Smoker      Acinar                         MTOR            2
## # … with 61 more rows

#Tiled heatmap generation, taking top (1) mutated gene across all the tissue histology for non-smoker men. 
heatmap_plot<-gene_counter_male %>%
   group_by(Predominant_Histologic_Subtype) %>% 
  slice_max(order_by = n, n = 1) %>%
  ggplot(aes(y = gene_symbol,x = Predominant_Histologic_Subtype))+
           geom_tile(aes(fill = n, text = paste0("Count: ", n)), color = "blue")+
           scale_fill_gradient(low = "blue", high = "red")+
           labs(y = "Gene Symbol",
                x = "Tissue Histology",
                fill = "Gene Count")+
    ggtitle("Most common mutation found across lung tissue in non-smoker men")+
  theme_bw()+
  theme(axis.text.y = element_text (hjust = 4))

## Warning: Ignoring unknown aesthetics: text

ggplotly(heatmap_plot, tooltip = "text")

The depiction above searches the combined database for the top mutated gene seen across various tissue histology in non-smoker men. A lack of micropapillary histology was observed. The EGFR gene was found to be them most common mutated gene of interest for both the sexes across all tissue histologies.

Investigate links between mutated genes and cancer histology in women smokers.

# For finding the most common gene mutation across various tissue histology based on sex and smoking history, pivot data to longer format.
useful_data<-wider_data %>% 
  pivot_longer(cols = starts_with("gene"), names_to = "Gene_type", values_to = "gene_symbol")
useful_data<-useful_data %>% 
  filter(!is.na(gene_symbol))

# Find the most common mutated gene sorted across females that do not smoke.
gene_counter_female_smoke<-useful_data %>% 
  filter(Sex == "Female" & Smoking_History =="Smoker") %>% 
  group_by(Sex, Smoking_History, Predominant_Histologic_Subtype,gene_symbol) %>%
  count() %>% 
  arrange(-n)
gene_counter_female_smoke

## # A tibble: 925 × 5
## # Groups:   Sex, Smoking_History, Predominant_Histologic_Subtype, gene_symbol
## #   [925]
##    Sex    Smoking_History Predominant_Histologic_Subtype gene_symbol     n
##    <chr>  <chr>           <chr>                          <chr>       <int>
##  1 Female Smoker          Acinar                         KRAS          100
##  2 Female Smoker          Acinar                         TP53           72
##  3 Female Smoker          Acinar                         EGFR           39
##  4 Female Smoker          Solid                          TP53           30
##  5 Female Smoker          Acinar                         RBM10          29
##  6 Female Smoker          Acinar                         STK11          26
##  7 Female Smoker          Lepidic                        EGFR           22
##  8 Female Smoker          Acinar                         EPHA3          21
##  9 Female Smoker          Lepidic                        KRAS           19
## 10 Female Smoker          Solid                          KRAS           19
## # … with 915 more rows

#Tiled heatmap generation, taking top (1) mutated gene across all the tissue histology for non-smoker women. 
heatmap_plot<-gene_counter_female_smoke %>%
   group_by(Predominant_Histologic_Subtype) %>% 
  slice_max(order_by = n, n = 1) %>%
  ggplot(aes(y = gene_symbol,x = Predominant_Histologic_Subtype))+
           geom_tile(aes(fill = n, text = paste0("Count: ", n)), color = "blue")+
           scale_fill_gradient(low = "blue", high = "red")+
           labs(y = "Gene Symbol",
                x = "Tissue Histology",
                fill = "Gene Count")+
   ggtitle("Most common mutation found in lung tissue in smoker women")+
  theme_bw()+
  theme(axis.text.y = element_text (hjust = 4))

## Warning: Ignoring unknown aesthetics: text

ggplotly(heatmap_plot, tooltip = "text")

The depiction above shows the most common mutated genes in female smokers that leads to adenocarcinoma which were found to be KRAS, EGFR, STK1 and TP53 across various tissue types.

Investigate links between mutated genes and cancer histology in men smokers.

# Find the most common mutated gene sorted across Males that do smoke.
gene_counter_male_smoke<-useful_data %>% 
  filter(Sex == "Male" & Smoking_History =="Smoker") %>% 
  group_by(Sex, Smoking_History, Predominant_Histologic_Subtype,gene_symbol) %>%
  count() %>% 
  arrange(-n)
gene_counter_male_smoke

## # A tibble: 731 × 5
## # Groups:   Sex, Smoking_History, Predominant_Histologic_Subtype, gene_symbol
## #   [731]
##    Sex   Smoking_History Predominant_Histologic_Subtype gene_symbol     n
##    <chr> <chr>           <chr>                          <chr>       <int>
##  1 Male  Smoker          Acinar                         KRAS           39
##  2 Male  Smoker          Acinar                         TP53           37
##  3 Male  Smoker          Acinar                         STK11          24
##  4 Male  Smoker          Solid                          TP53           17
##  5 Male  Smoker          Acinar                         EGFR           16
##  6 Male  Smoker          Acinar                         KEAP1          16
##  7 Male  Smoker          Acinar                         FAT1           14
##  8 Male  Smoker          Acinar                         ATM            13
##  9 Male  Smoker          Acinar                         RBM10          12
## 10 Male  Smoker          Papillary                      KRAS           11
## # … with 721 more rows

#Tiled heatmap generation, taking top (1) mutated gene across all the tissue histology for non-smoker men. 
heatmap_plot<-gene_counter_male_smoke %>%
   group_by(Predominant_Histologic_Subtype) %>% 
  slice_max(order_by = n, n = 1) %>%
  ggplot(aes(y = gene_symbol,x = Predominant_Histologic_Subtype))+
           geom_tile(aes(fill = n, text = paste0("Count: ", n)), color = "blue")+
           scale_fill_gradient(low = "blue", high = "red")+
           labs(y = "Gene Symbol",
                x = "Tissue Histology",
                fill = "Gene Count")+
    ggtitle("Most common mutation found across lung tissue in smoker men")+
  theme_bw()+
  theme(axis.text.y = element_text (hjust = 4))

## Warning: Ignoring unknown aesthetics: text

ggplotly(heatmap_plot, tooltip = "text")

The depiction above searches the combined database for the top mutated gene seen across various tissue histology in male smokers. The KRAS and TP53 gene were found to be them most common mutated gene of interest for both the sexes across all tissue histologies.

Tissue Histology of Lung Cancer

Victor Varenium Singh

2025-04-08

Introduction