Interestingly in recent news, a new trend has emerged where non-smokers, (especially Asian women) people who have never smoked a cigarette in their life are developing cancer. According to Yale Medicine, this cancer type is non-small cell lung cancer known as Adenocarcinoma. Although, there is no single reason for development of such cancer. I looked at a research data set that accumulated the physiological characteristics of the tumor tissue, habits of the patients along with genomic information processed via next gen sequencing.
What are some distinguishing features in tissue histology both physiological and genomic data compared across various metrics such as sex, smoking habits, tumor characteristics, and genes mutated?
# Load the required packages.
library(pacman)
p_load(ggplot2,babynames,tidyverse,ggrepel,viridis,ggthemes,cowplot, plotly, pheatmap,dplyr,tidyr)
# Load the downloaded clinical cancer data from cbioportal.
cancerdata<-read_tsv(file = "luad_mskcc_2020_clinical_data.tsv")
## Rows: 604 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (14): Study ID, Patient ID, Sample ID, Cancer Type, Cancer Type Detailed...
## dbl (15): Age At Surgery, CT Size, Diffusion lung capacity for carbon monoxi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove unwanted repeat columns which contain sample ID, patient ID, and cancer type and detailed cancer type as all the entries virtually contain the same information.
trim_cancer_data<-cancerdata %>%
select(-(`Study ID`:`Patient ID`)) %>%
select(-(`Cancer Type`:`Cancer Type Detailed`))
view(trim_cancer_data)
# For ease of coding, change the spaces in titles of columns to lower case to underscores for ease of coding.
colnames(trim_cancer_data)<-gsub(" ", "_", colnames(trim_cancer_data))
colnames(trim_cancer_data)
## [1] "Sample_ID"
## [2] "Age_At_Surgery"
## [3] "Clinical_Nodal_Status"
## [4] "CT_Size"
## [5] "Diffusion_lung_capacity_for_carbon_monoxide"
## [6] "FEV1"
## [7] "FGA_Facets"
## [8] "Fraction_Genome_Altered"
## [9] "Mutation_Count"
## [10] "Oncotree_Code"
## [11] "Overall_Survival_(Months)"
## [12] "Overall_Survival_Status"
## [13] "Pathologic_Stage"
## [14] "Pathologic_Nodal_Status"
## [15] "Predominant_Histologic_Subtype"
## [16] "Relapse_Free_Status_(Months)"
## [17] "Relapse_Free_Status"
## [18] "Number_of_Samples_Per_Patient"
## [19] "Sex"
## [20] "Smoking_History"
## [21] "Person_Cigarette_Smoking_History_Pack_Year_Value"
## [22] "Subtype_Group"
## [23] "Standardized_uptake_values"
## [24] "Tumor_Mutation_Burden"
## [25] "TMB_(nonsynonymous)"
# Change the entries inside the Smoking_History column to avoid misconceptions.
trim_cancer_data<-trim_cancer_data %>%
mutate(Smoking_History = recode(Smoking_History, "Ever Smoker" = "Smoker", "Never Smoker" = "Non-Smoker"))
head(trim_cancer_data)
## # A tibble: 6 × 25
## Sample…¹ Age_A…² Clini…³ CT_Size Diffu…⁴ FEV1 FGA_F…⁵ Fract…⁶ Mutat…⁷ Oncot…⁸
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 P-00002… 52 Negati… 3.4 88 108 0.190 0.0001 4 LUAD
## 2 P-00002… 62 Negati… 2.1 61 76 0.571 0.130 2 LUAD
## 3 P-00006… 68 Negati… 1.3 47 90 0.182 0.0303 10 LUAD
## 4 P-00008… 64 Negati… 3.6 85 89 0.556 0.304 6 LUAD
## 5 P-00019… 67 Negati… 1.3 NA NA 0.427 0.0008 5 LUAD
## 6 P-00025… 73 Negati… 3.4 96 102 0.343 0.134 2 LUAD
## # … with 15 more variables: `Overall_Survival_(Months)` <dbl>,
## # Overall_Survival_Status <chr>, Pathologic_Stage <dbl>,
## # Pathologic_Nodal_Status <chr>, Predominant_Histologic_Subtype <chr>,
## # `Relapse_Free_Status_(Months)` <dbl>, Relapse_Free_Status <chr>,
## # Number_of_Samples_Per_Patient <dbl>, Sex <chr>, Smoking_History <chr>,
## # Person_Cigarette_Smoking_History_Pack_Year_Value <dbl>,
## # Subtype_Group <chr>, Standardized_uptake_values <dbl>, …
# Load in genomic data.
genedata<-read_csv(file = "data_mutations.csv")
## Rows: 4814 Columns: 123
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): Hugo_Symbol, Center, NCBI_Build, Chromosome, Strand, Variant_Class...
## dbl (9): Entrez_Gene_Id, Start_Position, End_Position, t_ref_count, t_alt_c...
## lgl (90): dbSNP_Val_Status, Matched_Norm_Sample_Barcode, Match_Norm_Seq_Alle...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Trim the gene data.
trim_gene_data<-genedata[, c("Hugo_Symbol","Tumor_Sample_Barcode")]
head(trim_gene_data)
## # A tibble: 6 × 2
## Hugo_Symbol Tumor_Sample_Barcode
## <chr> <chr>
## 1 MED12 P-0010976-T01-IM5
## 2 EGFR P-0010976-T01-IM5
## 3 EGFR P-0000280-T01-IM3
## 4 CRLF2 P-0000280-T01-IM3
## 5 CRLF2 P-0000280-T01-IM3
## 6 EP300 P-0010841-T01-IM5
# Add the trimmed gene entries to a new dataset which has gene symbols combined into a vector and put under single tumor sample barcode id or patient id.
transformed_gene_data<-trim_gene_data %>%
group_by(Tumor_Sample_Barcode) %>%
summarize(Hugo_Symbols = paste(unique(Hugo_Symbol), collapse = ", ")) %>%
rename(Sample_ID = Tumor_Sample_Barcode)
head(transformed_gene_data)
## # A tibble: 6 × 2
## Sample_ID Hugo_Symbols
## <chr> <chr>
## 1 P-0000219-T01-IM3 BRAF, SMAD4, KDM5C, EGFR
## 2 P-0000280-T01-IM3 EGFR, CRLF2
## 3 P-0000627-T02-IM5 KRAS, TERT, EPHA3, KDR, FAT1, FGFR4, TP53, ZRSR2, BCOR
## 4 P-0000867-T01-IM3 BRAF, PIK3CB, ROS1, KDM5A, ETV6, CTCF
## 5 P-0001944-T01-IM3 KRAS, FAT1, STK11, KEAP1, AR
## 6 P-0002534-T01-IM3 EGFR, TP53
# Generate a new table with mutated gene data (genomic data) and original clinical data.
combined_data<-inner_join(trim_cancer_data,transformed_gene_data, by = "Sample_ID")
head(combined_data)
## # A tibble: 6 × 26
## Sample…¹ Age_A…² Clini…³ CT_Size Diffu…⁴ FEV1 FGA_F…⁵ Fract…⁶ Mutat…⁷ Oncot…⁸
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 P-00002… 52 Negati… 3.4 88 108 0.190 0.0001 4 LUAD
## 2 P-00002… 62 Negati… 2.1 61 76 0.571 0.130 2 LUAD
## 3 P-00006… 68 Negati… 1.3 47 90 0.182 0.0303 10 LUAD
## 4 P-00008… 64 Negati… 3.6 85 89 0.556 0.304 6 LUAD
## 5 P-00019… 67 Negati… 1.3 NA NA 0.427 0.0008 5 LUAD
## 6 P-00025… 73 Negati… 3.4 96 102 0.343 0.134 2 LUAD
## # … with 16 more variables: `Overall_Survival_(Months)` <dbl>,
## # Overall_Survival_Status <chr>, Pathologic_Stage <dbl>,
## # Pathologic_Nodal_Status <chr>, Predominant_Histologic_Subtype <chr>,
## # `Relapse_Free_Status_(Months)` <dbl>, Relapse_Free_Status <chr>,
## # Number_of_Samples_Per_Patient <dbl>, Sex <chr>, Smoking_History <chr>,
## # Person_Cigarette_Smoking_History_Pack_Year_Value <dbl>,
## # Subtype_Group <chr>, Standardized_uptake_values <dbl>, …
TMBplot<-combined_data %>%
ggplot(aes(Tumor_Mutation_Burden, fill=Predominant_Histologic_Subtype))+
facet_grid(Smoking_History~Sex)+
geom_density(alpha = 0.5)+
ggtitle("Mutations in tumor tissue vs tumor histology \n between smoker and non-smokers by sex.")+
xlab("Number of mutations in tumor tissue")+
ylab("Density")+
xlim(0,40)+
labs(fill = "Tissue type")+
scale_fill_viridis_d()+
theme_minimal()+
theme(plot.title = element_text(size = 12, face = "bold"),
strip.text = element_text(size = 10, face = "bold"))
TMBplot
## Warning: Removed 7 rows containing non-finite values (stat_density).
The depiction shows that a.) the higher tumor mutation rates seen in smokers compared to non-smokers which is a consequence of carcinogens inhaled during smoking. b.) For non-smokers, at a lower tumor mutation burden lepidic tumor histology is predominant in female non-smokers whilst solid & micropapillary tumor histology is seen in male non-smokers. c.) For female and male smokers, lower rate of mutations in the tumor correlates with lepidic tumor histology and higher rate correlates with solid tumor histology.
smoke_plot<-combined_data %>%
filter(Smoking_History == "Smoker") %>%
ggplot(aes(Person_Cigarette_Smoking_History_Pack_Year_Value, fill = Predominant_Histologic_Subtype))+
geom_density(alpha = 0.5)+
ggtitle("Distribution of tumor tissue histology based on packs of cigarettes smoked.")+
xlab("Average number of cigarette packs smoked per year")+
ylab("Density")+
labs(fill = "Tissue type")+
scale_fill_viridis_d()+
facet_wrap(~Sex)+
theme_bw()+
theme(plot.title = element_text(size = 12, face = "bold"))
smoke_plot
## Warning: Removed 1 rows containing non-finite values (stat_density).
This plot depicts the distribution of tissue histology between the smokers based upon sex, where within female smokers solid tissue histology develops predominantly with increasing exposure to smoke carcinogens (number of packs smoked) and male smokers develop micropapillary to papillary tissue histology.
# Import the tumor size and standardized uptake values to generate a smooth line plot based upon tissue type.
tumor_plot<-combined_data %>%
ggplot(aes(CT_Size, Standardized_uptake_values))+
geom_smooth(method = lm)+
# Facet the plot based upon tissue type
facet_wrap(~Predominant_Histologic_Subtype, scales = "free_x")+
xlab("Size of the tumor")+
ylab("Uptake of nutrients by the tumor")+
ggtitle("Standardized Uptake Value vs Tumor Size by tissue histology.")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
strip.text = element_text(size = 10))
tumor_plot
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 69 rows containing non-finite values (stat_smooth).
The depiction above shows the metabolic activity (standardized amount of nutrients absorbed) of tumor and correlates size of the tumor based on the histology. The slope of the trendline within each graph represents the uptake of nutrients into the various tissue types which can be independent of tumor size. Thereby, highlighting the importance of tumor histology where solid tissue histology is seen to be the most metabolically active compared to the rest.
# Aggregate data for plot for histology of tumors observed for different cancer stages.
plotdata<-combined_data %>%
select(Sex,Sample_ID,Pathologic_Stage, Predominant_Histologic_Subtype) %>%
group_by(Sex,Pathologic_Stage,Predominant_Histologic_Subtype) %>%
summarize(Number_of_patients = n(), .groups = "drop")
plotdata
## # A tibble: 30 × 4
## Sex Pathologic_Stage Predominant_Histologic_Subtype Number_of_patients
## <chr> <dbl> <chr> <int>
## 1 Female 1 Acinar 189
## 2 Female 1 Lepidic 63
## 3 Female 1 Micropapillary 11
## 4 Female 1 Papillary 16
## 5 Female 1 Solid 25
## 6 Female 2 Acinar 35
## 7 Female 2 Lepidic 3
## 8 Female 2 Micropapillary 5
## 9 Female 2 Papillary 2
## 10 Female 2 Solid 10
## # … with 20 more rows
# Visualize the spread of tissue histology between the different pathological stages of cancer.
pathology_spread<-plotdata %>%
ggplot(aes(x = Pathologic_Stage, y = Number_of_patients, text = paste0("Patient Count: ", Number_of_patients), fill = Predominant_Histologic_Subtype))+
geom_col(position = "dodge")+
ggtitle("Distribution of tissue histology across stages of cancer.")+
facet_wrap(~Sex)+
xlab("Pathologic stage of cancer")+
ylab("Number of patients")+
labs(fill = "Tissue type")+
scale_fill_viridis_d()+
theme_bw()+
theme(plot.title = element_text(size = 12))
plot_patient<-ggplotly(pathology_spread, tooltip = "text")
plot_patient
The graphical depiction above shows the spread of tissue histology across various stages of cancer noted in the dataset, where acinar tissue type is seen to be the most prominent across the patients of both sexes.
# Aggregate data of interest to a new dataset.
mutated_gene_data<-combined_data %>%
select(Sample_ID,Tumor_Mutation_Burden,Hugo_Symbols,Sex, Smoking_History, Predominant_Histologic_Subtype)
mutated_gene_data
## # A tibble: 598 × 6
## Sample_ID Tumor_Mutation_Burden Hugo_Symbols Sex Smoki…¹ Predo…²
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 P-0000219-T01-IM3 4.5 BRAF, SMAD4, K… Fema… Smoker Solid
## 2 P-0000280-T01-IM3 3.3 EGFR, CRLF2 Fema… Non-Sm… Acinar
## 3 P-0000627-T02-IM5 9.8 KRAS, TERT, EP… Fema… Smoker Acinar
## 4 P-0000867-T01-IM3 6.7 BRAF, PIK3CB, … Male Smoker Acinar
## 5 P-0001944-T01-IM3 5.6 KRAS, FAT1, ST… Fema… Smoker Acinar
## 6 P-0002534-T01-IM3 2.2 EGFR, TP53 Fema… Non-Sm… Acinar
## 7 P-0002544-T01-IM3 4.5 KRAS, TP53, TB… Fema… Smoker Acinar
## 8 P-0002557-T01-IM3 3.3 EGFR, SMAD4, F… Fema… Non-Sm… Acinar
## 9 P-0002768-T02-IM6 5.3 EGFR, RB1, PIK… Fema… Non-Sm… Papill…
## 10 P-0002880-T02-IM6 1.8 EGFR, MTOR Male Non-Sm… Acinar
## # … with 588 more rows, and abbreviated variable names ¹Smoking_History,
## # ²Predominant_Histologic_Subtype
# Separate the gene symbol data, which is a concatenated vector, for gene count across various tissue histology.
wider_data<-mutated_gene_data %>%
mutate(gene = str_split(Hugo_Symbols, ", ")) %>%
unnest_wider(gene, names_sep = "_")
wider_data
## # A tibble: 598 × 70
## Sample_ID Tumor…¹ Hugo_…² Sex Smoki…³ Predo…⁴ gene_1 gene_2 gene_3 gene_4
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 P-0000219-… 4.5 BRAF, … Fema… Smoker Solid BRAF SMAD4 KDM5C EGFR
## 2 P-0000280-… 3.3 EGFR, … Fema… Non-Sm… Acinar EGFR CRLF2 <NA> <NA>
## 3 P-0000627-… 9.8 KRAS, … Fema… Smoker Acinar KRAS TERT EPHA3 KDR
## 4 P-0000867-… 6.7 BRAF, … Male Smoker Acinar BRAF PIK3CB ROS1 KDM5A
## 5 P-0001944-… 5.6 KRAS, … Fema… Smoker Acinar KRAS FAT1 STK11 KEAP1
## 6 P-0002534-… 2.2 EGFR, … Fema… Non-Sm… Acinar EGFR TP53 <NA> <NA>
## 7 P-0002544-… 4.5 KRAS, … Fema… Smoker Acinar KRAS TP53 TBX3 GRIN2A
## 8 P-0002557-… 3.3 EGFR, … Fema… Non-Sm… Acinar EGFR SMAD4 FLT3 <NA>
## 9 P-0002768-… 5.3 EGFR, … Fema… Non-Sm… Papill… EGFR RB1 PIK3CB ARID2
## 10 P-0002880-… 1.8 EGFR, … Male Non-Sm… Acinar EGFR MTOR <NA> <NA>
## # … with 588 more rows, 60 more variables: gene_5 <chr>, gene_6 <chr>,
## # gene_7 <chr>, gene_8 <chr>, gene_9 <chr>, gene_10 <chr>, gene_11 <chr>,
## # gene_12 <chr>, gene_13 <chr>, gene_14 <chr>, gene_15 <chr>, gene_16 <chr>,
## # gene_17 <chr>, gene_18 <chr>, gene_19 <chr>, gene_20 <chr>, gene_21 <chr>,
## # gene_22 <chr>, gene_23 <chr>, gene_24 <chr>, gene_25 <chr>, gene_26 <chr>,
## # gene_27 <chr>, gene_28 <chr>, gene_29 <chr>, gene_30 <chr>, gene_31 <chr>,
## # gene_32 <chr>, gene_33 <chr>, gene_34 <chr>, gene_35 <chr>, …
# For finding the most common gene mutation across various tissue histology based on sex and smoking history, pivot data to longer format.
useful_data<-wider_data %>%
pivot_longer(cols = starts_with("gene"), names_to = "Gene_type", values_to = "gene_symbol")
useful_data<-useful_data %>%
filter(!is.na(gene_symbol))
# Find the most common mutated gene sorted across females that do not smoke.
gene_counter_female<-useful_data %>%
filter(Sex == "Female" & Smoking_History =="Non-Smoker") %>%
group_by(Sex, Smoking_History, Predominant_Histologic_Subtype,gene_symbol) %>%
count() %>%
arrange(-n)
gene_counter_female
## # A tibble: 178 × 5
## # Groups: Sex, Smoking_History, Predominant_Histologic_Subtype, gene_symbol
## # [178]
## Sex Smoking_History Predominant_Histologic_Subtype gene_symbol n
## <chr> <chr> <chr> <chr> <int>
## 1 Female Non-Smoker Acinar EGFR 46
## 2 Female Non-Smoker Acinar TP53 18
## 3 Female Non-Smoker Lepidic EGFR 12
## 4 Female Non-Smoker Papillary EGFR 7
## 5 Female Non-Smoker Acinar KRAS 6
## 6 Female Non-Smoker Acinar RBM10 6
## 7 Female Non-Smoker Acinar SETD2 6
## 8 Female Non-Smoker Acinar RB1 4
## 9 Female Non-Smoker Acinar SMAD4 4
## 10 Female Non-Smoker Lepidic RBM10 4
## # … with 168 more rows
#Tiled heatmap generation, taking top (1) mutated gene across all the tissue histology for non-smoker women.
heatmap_plot<-gene_counter_female %>%
group_by(Predominant_Histologic_Subtype) %>%
slice_max(order_by = n, n = 1) %>%
ggplot(aes(y = gene_symbol,x = Predominant_Histologic_Subtype))+
geom_tile(aes(fill = n, text = paste0("Count: ", n)), color = "blue")+
scale_fill_gradient(low = "blue", high = "red")+
labs(y = "Gene Symbol",
x = "Tissue Histology",
fill = "Gene Count")+
ggtitle("Most common mutation found in lung tissue in non-smoker women")+
theme_bw()+
facet_wrap(~Sex)
## Warning: Ignoring unknown aesthetics: text
theme(axis.text.y = element_text (hjust = 4))
## List of 1
## $ axis.text.y:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 4
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
ggplotly(heatmap_plot, tooltip = "text")
The depiction above shows the most common mutated genes in non-smoker women that leads to adenocarcinoma which were found to be EGFR and TP53.
# Find the most common mutated gene sorted across Males that do not smoke.
gene_counter_male<-useful_data %>%
filter(Sex == "Male" & Smoking_History =="Non-Smoker") %>%
group_by(Sex, Smoking_History, Predominant_Histologic_Subtype,gene_symbol) %>%
count() %>%
arrange(-n)
gene_counter_male
## # A tibble: 71 × 5
## # Groups: Sex, Smoking_History, Predominant_Histologic_Subtype, gene_symbol
## # [71]
## Sex Smoking_History Predominant_Histologic_Subtype gene_symbol n
## <chr> <chr> <chr> <chr> <int>
## 1 Male Non-Smoker Acinar EGFR 9
## 2 Male Non-Smoker Acinar TP53 7
## 3 Male Non-Smoker Acinar MET 4
## 4 Male Non-Smoker Papillary EGFR 4
## 5 Male Non-Smoker Acinar CTNNB1 3
## 6 Male Non-Smoker Acinar KRAS 3
## 7 Male Non-Smoker Solid EGFR 3
## 8 Male Non-Smoker Acinar CDKN2A 2
## 9 Male Non-Smoker Acinar ERBB2 2
## 10 Male Non-Smoker Acinar MTOR 2
## # … with 61 more rows
#Tiled heatmap generation, taking top (1) mutated gene across all the tissue histology for non-smoker men.
heatmap_plot<-gene_counter_male %>%
group_by(Predominant_Histologic_Subtype) %>%
slice_max(order_by = n, n = 1) %>%
ggplot(aes(y = gene_symbol,x = Predominant_Histologic_Subtype))+
geom_tile(aes(fill = n, text = paste0("Count: ", n)), color = "blue")+
scale_fill_gradient(low = "blue", high = "red")+
labs(y = "Gene Symbol",
x = "Tissue Histology",
fill = "Gene Count")+
ggtitle("Most common mutation found across lung tissue in non-smoker men")+
theme_bw()+
theme(axis.text.y = element_text (hjust = 4))
## Warning: Ignoring unknown aesthetics: text
ggplotly(heatmap_plot, tooltip = "text")
The depiction above searches the combined database for the top mutated gene seen across various tissue histology in non-smoker men. A lack of micropapillary histology was observed. The EGFR gene was found to be them most common mutated gene of interest for both the sexes across all tissue histologies.
# For finding the most common gene mutation across various tissue histology based on sex and smoking history, pivot data to longer format.
useful_data<-wider_data %>%
pivot_longer(cols = starts_with("gene"), names_to = "Gene_type", values_to = "gene_symbol")
useful_data<-useful_data %>%
filter(!is.na(gene_symbol))
# Find the most common mutated gene sorted across females that do not smoke.
gene_counter_female_smoke<-useful_data %>%
filter(Sex == "Female" & Smoking_History =="Smoker") %>%
group_by(Sex, Smoking_History, Predominant_Histologic_Subtype,gene_symbol) %>%
count() %>%
arrange(-n)
gene_counter_female_smoke
## # A tibble: 925 × 5
## # Groups: Sex, Smoking_History, Predominant_Histologic_Subtype, gene_symbol
## # [925]
## Sex Smoking_History Predominant_Histologic_Subtype gene_symbol n
## <chr> <chr> <chr> <chr> <int>
## 1 Female Smoker Acinar KRAS 100
## 2 Female Smoker Acinar TP53 72
## 3 Female Smoker Acinar EGFR 39
## 4 Female Smoker Solid TP53 30
## 5 Female Smoker Acinar RBM10 29
## 6 Female Smoker Acinar STK11 26
## 7 Female Smoker Lepidic EGFR 22
## 8 Female Smoker Acinar EPHA3 21
## 9 Female Smoker Lepidic KRAS 19
## 10 Female Smoker Solid KRAS 19
## # … with 915 more rows
#Tiled heatmap generation, taking top (1) mutated gene across all the tissue histology for non-smoker women.
heatmap_plot<-gene_counter_female_smoke %>%
group_by(Predominant_Histologic_Subtype) %>%
slice_max(order_by = n, n = 1) %>%
ggplot(aes(y = gene_symbol,x = Predominant_Histologic_Subtype))+
geom_tile(aes(fill = n, text = paste0("Count: ", n)), color = "blue")+
scale_fill_gradient(low = "blue", high = "red")+
labs(y = "Gene Symbol",
x = "Tissue Histology",
fill = "Gene Count")+
ggtitle("Most common mutation found in lung tissue in smoker women")+
theme_bw()+
theme(axis.text.y = element_text (hjust = 4))
## Warning: Ignoring unknown aesthetics: text
ggplotly(heatmap_plot, tooltip = "text")
The depiction above shows the most common mutated genes in female smokers that leads to adenocarcinoma which were found to be KRAS, EGFR, STK1 and TP53 across various tissue types.
# Find the most common mutated gene sorted across Males that do smoke.
gene_counter_male_smoke<-useful_data %>%
filter(Sex == "Male" & Smoking_History =="Smoker") %>%
group_by(Sex, Smoking_History, Predominant_Histologic_Subtype,gene_symbol) %>%
count() %>%
arrange(-n)
gene_counter_male_smoke
## # A tibble: 731 × 5
## # Groups: Sex, Smoking_History, Predominant_Histologic_Subtype, gene_symbol
## # [731]
## Sex Smoking_History Predominant_Histologic_Subtype gene_symbol n
## <chr> <chr> <chr> <chr> <int>
## 1 Male Smoker Acinar KRAS 39
## 2 Male Smoker Acinar TP53 37
## 3 Male Smoker Acinar STK11 24
## 4 Male Smoker Solid TP53 17
## 5 Male Smoker Acinar EGFR 16
## 6 Male Smoker Acinar KEAP1 16
## 7 Male Smoker Acinar FAT1 14
## 8 Male Smoker Acinar ATM 13
## 9 Male Smoker Acinar RBM10 12
## 10 Male Smoker Papillary KRAS 11
## # … with 721 more rows
#Tiled heatmap generation, taking top (1) mutated gene across all the tissue histology for non-smoker men.
heatmap_plot<-gene_counter_male_smoke %>%
group_by(Predominant_Histologic_Subtype) %>%
slice_max(order_by = n, n = 1) %>%
ggplot(aes(y = gene_symbol,x = Predominant_Histologic_Subtype))+
geom_tile(aes(fill = n, text = paste0("Count: ", n)), color = "blue")+
scale_fill_gradient(low = "blue", high = "red")+
labs(y = "Gene Symbol",
x = "Tissue Histology",
fill = "Gene Count")+
ggtitle("Most common mutation found across lung tissue in smoker men")+
theme_bw()+
theme(axis.text.y = element_text (hjust = 4))
## Warning: Ignoring unknown aesthetics: text
ggplotly(heatmap_plot, tooltip = "text")
The depiction above searches the combined database for the top mutated gene seen across various tissue histology in male smokers. The KRAS and TP53 gene were found to be them most common mutated gene of interest for both the sexes across all tissue histologies.
From this study, some insights can be drawn about the nature of adenocarcinoma. Like Simple, well understood insights such as smoking leads to more tumor mutation burden and therefore more mutations in the tumor genome. Non-smoker women and men were found to develop different tissue histologies which can lead to tumors with varying metabolic activities. Another fact is that size of the tumor does not correlate with aggressiveness (standardized uptake values) of the benign mass which can help physicians properly inform and console patients in moments of great distress. The histological evaluation of lung tissue, in addition to bio marker testing of key mutated genes (for smokers and non-smokers) can help diagnose and guide efforts for disease remediation. The data set also alludes to similar mutated genes that lead to various histologies of tissues in patients and further genomic investigations can be performed to develop antigens which can help screen blood from the site of the mass, lessening the financial and physical toll of invasive procedures.
R-Pubs Link to the HTML https://rpubs.com/VictorUPEI/1295090