Melanoma, a highly aggressive form of skin cancer originating from melanocytes, poses a significant health challenge in the United States. Despite constituting only about 1% of skin cancers, it is responsible for the majority of skin cancer-related deaths due to its propensity to metastasize if not detected and treated promptly. Annually, approximately 100,640 new cases are diagnosed, with an estimated 8,290 individuals succumbing to the disease. While recent advancements in treatment have led to a decline in death rates, demographic disparities persist. Variations in incidence rates among age groups and genders continue to evolve, with concerning increases observed in women aged 50 and older.
Moreover, melanoma disproportionately affects individuals with lighter skin tones, with white populations facing a substantially higher lifetime risk compared to Black and Hispanic individuals. The prevalence of melanoma among young adults, particularly young women, underscores the urgency for deeper research. Understanding the intricate interplay of risk factors, including genetic predisposition, environmental influences, and behavioral patterns, is imperative for developing targeted prevention strategies and enhancing early detection methods. Given the evolving epidemiological landscape and the potential impact of tailored interventions, there is a pressing need for comprehensive research initiatives to address the multifaceted challenges posed by melanoma in the United States.
The challenge in Melanoma drug development lies in efficiently identifying crucial genetic markers. Pharmaceutical companies currently grapple with navigating through 22,000 genes and mutations, which is time-consuming and resource-intensive. Our solution streamlines this process by pinpointing and prioritizing the most relevant genetic markers upfront. By doing so, we save valuable time and resources, offering a more effective path to developing targeted treatments for Melanoma cancer.
The analysis will focus on examining data from 25 patients diagnosed with a variety of cancers, including head and neck, breast, lung, melanoma, and thyroid cancer. The primary objective is to uncover significant patterns and connections within this dataset that may lead to the identification of genes associated with Melanoma. While this study cannot fully resolve the issue due to the diversity of individual genetic profiles, it will play a crucial role in narrowing down genes warranting further investigation by molecular biologists in the laboratory.
Although every cancerous gene and mutation may not be definitively isolated from the vast array of 22,000 genes, there is an anticipation of identifying a subset that exhibits commonalities across multiple patients, providing valuable direction for subsequent research. The methodology will involve exploratory data analysis and Pareto analysis. Initially, descriptive analytics will be performed, followed by data narrowing in Tableau. Subsequently, this limited data will be utilized to predict and forecast future outcomes, with the predictive outcome cross-validated with visuals created on the limited data. Through these rigorous methodologies, the aim is to shed light on the genetic underpinnings of cancer, paving the way for more targeted interventions and treatments in the future.
Stakeholders, primarily pharmaceutical companies focused on drug development, should be particularly interested in this business problem due to its potential to significantly enhance time and cost efficiency in the drug development process. By identifying genetic markers associated with Melanoma, these companies can expedite the discovery of targeted therapies, reducing research and development timelines and minimizing associated costs. This efficiency not only accelerates the availability of new treatments but also strengthens the competitive position of pharmaceutical companies in the market. Moreover, the ability to offer personalized treatment approaches based on genetic profiling holds the promise of more effective therapies with fewer adverse effects, aligning with the industry’s ongoing pursuit of innovative and patient-centric solutions. Thus, addressing this business problem has the potential to yield substantial benefits for pharmaceutical companies and their stakeholders, driving progress in cancer treatment and improving patient outcomes.
This section outlines the comprehensive procedures undertaken to prepare the data analysis. Each step is meticulously detailed, accompanied by corresponding code implementations.
The data utilized in this study originates from two primary sources:
Fourteen data sets were procured from the DCL Pathology- molecular laboratory in Indiana, acquired in January 2024 through gene sequencing processes.
Eleven data sets were obtained from the National Cancer Institute’s GDC Data Portal, collected in 2022.
DCL Pathology Data
The following cleaning steps were applied to the 14 data sets obtained from DCL Pathology to ensure compatibility with the analysis format. These datasets serve as the primary source for the study.
National Cancer Institute Data
The 11 data sets obtained from the National Cancer Institute’s GDC Data Portal underwent meticulous cleaning to ensure compatibility with the analysis format. As these data sets serve as a secondary source for the study, the cleaning process was tailored to align the data format with that of the primary source.
readxl: This library facilitates the reading of Excel files directly into R, providing functions to import data from spreadsheets with ease. It offers robust support for various Excel file formats and enables users to extract data seamlessly for further analysis.
dplyr: dplyr is a powerful data manipulation package that provides a grammar of data manipulation, allowing users to perform a wide range of data wrangling tasks efficiently. It includes functions for filtering, selecting, summarizing, mutating, and arranging data, making it a versatile tool for data transformation and exploration.
writexl: Complementing the functionality of readxl, writexl enables users to write data frames and matrices to Excel files directly from R. It offers straightforward functions for exporting data, maintaining formatting, and preserving data integrity when sharing results with collaborators or stakeholders.
tidyr: tidyr is designed for data tidying tasks, providing functions to reshape and organize messy datasets into tidy data formats suitable for analysis and visualization. It includes tools for gathering, spreading, separating, and uniting data, helping users efficiently tidy and prepare their data for analysis.
rpart: rpart implements recursive partitioning algorithms for classification and regression tasks, allowing users to create decision trees based on input variables. Decision trees are interpretable models that partition the feature space into segments based on the values of predictor variables, making them valuable for understanding the underlying structure of the data.
rpart.plot: This package extends the functionality of rpart by providing enhanced plotting capabilities specifically tailored for visualizing decision trees created with rpart. It offers customizable plotting options, including tree pruning, node labeling, and branch coloring, enabling users to create clear and informative visualizations of their decision tree models.
caret: caret (Classification And REgression Training) is a comprehensive package for machine learning that provides a unified interface for training and evaluating predictive models. It streamlines the machine learning workflow by offering standardized functions for data preprocessing, model training, tuning, and evaluation across different algorithms and methodologies.
randomForest: randomForest implements random forest algorithms for classification and regression tasks, known for their robustness and predictive accuracy. Random forests are ensemble learning methods that combine multiple decision trees to improve predictive performance and reduce overfitting, making them suitable for a wide range of predictive modeling tasks.
gbm: Short for Gradient Boosting Machine, gbm implements gradient boosting algorithms for predictive modeling. Gradient boosting is a powerful machine learning technique that builds an ensemble of weak learners sequentially, with each learner focusing on the mistakes made by its predecessors. This iterative approach results in highly accurate predictive models that excel in handling complex, high-dimensional data.
ipred: ipred (Improved Predictors) extends the functionality of traditional predictive modeling approaches by providing methods for improved predictive modeling, including bagging and bootstrapping techniques. These ensemble learning methods combine multiple models to enhance predictive performance and robustness, particularly in the presence of noisy or uncertain data.
adabag: adabag extends the functionality of ipred by providing additional ensemble methods such as AdaBoost for classification and regression tasks. AdaBoost is an adaptive boosting algorithm that iteratively trains weak learners on different subsets of the data, assigning higher weights to misclassified observations to improve model performance iteratively.
DT: DT enables the creation of interactive web-based data tables directly from R, facilitating data exploration and visualization. It allows users to create dynamic, interactive tables with features such as sorting, filtering, and pagination, making it easier to explore large datasets and share insights with collaborators or stakeholders.
neuralnet: neuralnet facilitates the creation and training of artificial neural networks, a powerful technique for complex pattern recognition tasks. Neural networks consist of interconnected nodes organized into layers, capable of learning complex patterns and relationships from data. neuralnet provides functions for building, training, and evaluating neural network models, offering flexibility and scalability for a wide range of applications.
class: class implements various classification algorithms, including k-nearest neighbors (KNN), which are used for predictive modeling and classification tasks. KNN is a non-parametric method that classifies new data points based on the majority class of their nearest neighbors in the feature space. class provides functions for training and evaluating KNN models, making it a valuable tool for classification tasks in machine learning.
library(readxl)
library(dplyr)
library(writexl)
library(tidyr)
library(rpart)
library(rpart.plot)
library(caret)
library(randomForest)
library(gbm)
library(ipred)
library(DT)
library(neuralnet)
library(class)
During this stage in RStudio, 25 initially cleaned datasets were consolidated into one, which was then exported to Excel for integration into Tableau Prep for final data refinement. The merged dataset consists of 13 variables and encompasses a total of 35,890 rows.
data1 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient1_v1_lung.xlsx")
data2 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient2_v1_lung.xlsx")
data3 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient3_v1_lung.xlsx")
data4 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient4_v1_lung.xlsx")
data5 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient5_v1_head-and_neck.xlsx")
data6 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient6_v1_Thyroid.xlsx")
data7 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient7_v1_head_and_neck.xlsx")
data8 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient8_v1_head_and_neck.xlsx")
data9 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient9_v1_melanoma.xlsx")
data10 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient10_v1_head_and_neck.xlsx")
data11 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient11_v1_breast.xlsx")
data12 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient12_v1-breast.xlsx")
data13 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient13_v1_lung.xlsx")
data14 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient14_v1_Thyroid.xlsx")
data15 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient15_v1_breast.xlsx")
data16 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient16_v1_breast.xlsx")
data17 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient17_v1_breast.xlsx")
data18 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient18_v1_head_and_neck.xlsx")
data19 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient19_v1_melanoma.xlsx")
data20 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient20_v1_melanoma.xlsx")
data21 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient21_v1_melanoma.xlsx")
data22<- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient22_v1_melanoma.xlsx")
data23 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient23_v1_thyroid.xlsx")
data24 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient24_v1_thyroid.xlsx")
data25 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient25_v1_thyroid.xlsx")
data15$`Genomic Position`<-as.character(data15$`Genomic Position`)
data16$`Genomic Position`<-as.character(data16$`Genomic Position`)
data17$`Genomic Position`<-as.character(data17$`Genomic Position`)
data18$`Genomic Position`<-as.character(data18$`Genomic Position`)
data19$`Genomic Position`<-as.character(data19$`Genomic Position`)
data20$`Genomic Position`<-as.character(data20$`Genomic Position`)
data21$`Genomic Position`<-as.character(data21$`Genomic Position`)
data22$`Genomic Position`<-as.character(data22$`Genomic Position`)
data23$`Genomic Position`<-as.character(data23$`Genomic Position`)
data24$`Genomic Position`<-as.character(data24$`Genomic Position`)
data25$`Genomic Position`<-as.character(data25$`Genomic Position`)
combined_data<-bind_rows(data1,data2,data3,data4,data5,data6,data7,data8,data9,data10,data11,data12,data13,data14,data15,data16,data17,data18,data19,data20,data21,data22,data23,data24,data25)
summary(combined_data)
## Gene Chromosome Genomic Position Reference Call
## Length:35890 Length:35890 Length:35890 Length:35890
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Alternative Call Allele Frequency Depth P-Dot Notation
## Length:35890 Min. :0.0084 Min. : 2.0 Length:35890
## Class :character 1st Qu.:0.3020 1st Qu.: 98.0 Class :character
## Mode :character Median :0.4320 Median : 229.0 Mode :character
## Mean :0.4848 Mean : 365.9
## 3rd Qu.:0.5193 3rd Qu.: 576.2
## Max. :1.0000 Max. :11707.0
## NA's :14 NA's :14
## C-Dot Notation Consequence(s) Affected Exon(s) Patient_ID
## Length:35890 Length:35890 Length:35890 Length:35890
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Cancer_type
## Length:35890
## Class :character
## Mode :character
##
##
##
##
write_xlsx(combined_data,"C:/Users/annac/Desktop/Ania Data Sets/Company data/combined_data.xlsx")
The following steps outline the procedures executed in Tableau Prep to attain the definitive clean dataset, serving as the foundational basis for subsequent analysis.
The resulting clean dataset comprises 16 variables and 24,874 rows.
Notably, 11,016 rows were eliminated due to null values. Given the
categorical and unique nature of the data, conventional imputation
methods were deemed unsuitable for use.
The analysis commenced with an exploratory data analysis conducted using Tableau. This approach was undertaken to visualize and gain deeper insights into the dataset.
Here are the conclusions derived from the preliminary analysis:
After conducting exploratory analysis, the dataset containing 16 variables and 24,874 rows was imported into Rmarkdown for predictive analytics. However, the computational workload proved to be excessively time-consuming, primarily due to the presence of categorical data and thousands of rows with unique values. To mitigate this issue, Pareto analysis—a decision-making technique aimed at statistically categorizing data entries into groups that exert the most or least influence on the dataset—was employed. Commonly utilized in business contexts to identify optimal strategies or areas of focus, this technique facilitated the identification of the most crucial genes for analysis.
Specifically, the focus was narrowed down to the top eight genes per cancer type, considering their mutations and affected exons. The selection process adhered to three key criteria:
The Pareto analysis facilitated the identification of key genes associated with each cancer type. From a comprehensive pool, 8 genes were selected per cancer type, resulting in a total of 40 genes for further analysis. These genes, along with essential variables such as Gene, C-Dot Notation (mutation), Affected Exon(s), All Exon(s), cancer type, and a binary indicator for Melanoma (Y, N), were compiled into a dataset comprising 457 rows. This approach streamlined the focus on critical genes, aiding descriptive analytics to uncover patterns pertinent to melanoma etiology, while also optimizing computational efficiency for subsequent predictive analytics.
Introduction:
This segment delves into a comprehensive analysis of data points identified through Pareto analysis. Leveraging Tableau, an array of charts has been meticulously crafted to extract insights crucial for predicting Melanoma. This analysis delves deeper into genes, mutations, and affected exons, recognizing them as pivotal variables essential for detecting patterns within Melanoma patients. To decipher the charts effectively, it’s imperative to delve into the meaning of these four terms.
Gene: Genes, segments of DNA, encode the blueprint for producing specific proteins essential for cellular function. Some genes, termed driver genes or oncogenes, harbor mutations capable of instigating cancer. These mutations provide cells with a growth advantage, fostering unbridled proliferation and tumor formation.
Mutation: Mutations denote changes or alterations in the DNA sequence of a gene. Oncogenes, a subset of genes, possess the capacity to transform normal cells into cancerous entities upon mutation. These mutations may engender hyperactive proteins, fueling accelerated cell growth and division. Moreover, cancer cells frequently exhibit genomic instability, characterized by an augmented mutation rate and chromosomal irregularities.
Affected Exon: An affected exon designates a specific segment of a gene’s DNA sequence that has undergone mutation or alteration. When an exon is affected by a mutation, it signifies a change within that particular portion of the gene’s sequence. Each gene comprises its own set number of exons, and the affected exon denotes a particular locus within the gene undergoing change. Recognizing affected exons holds paramount importance in unraveling the molecular mechanisms underpinning various diseases, including cancer. By pinpointing mutated regions within genes, researchers glean invaluable insights into how these genetic alterations contribute to disease onset and progression.
All exons This metric represents the count of exons present within a specific gene. Each gene exhibits a unique number of localized exons.
This comprehensive elucidation of genes, mutations, affected exons and all exons serves as a fundamental backdrop for interpreting the subsequent analyses, enabling the extraction of meaningful insights into Melanoma prediction.
This analysis centers on the eight most prevalent melanoma genes, highlighting their significance due to their frequent occurrence in melanoma patients. Beyond gene identification, the study delves into the specific locations within these genes where mutations are concentrated, aiming to identify potential sites for protein alterations. Following the identification of affected exons, the top eight mutation hotspots are further dissected, revealing four distinct genes prominently involved, with PCLO exhibiting the highest mutation count across three exons. Notably, most of these mutation hotspots are clustered within the same genes, suggesting their critical role in melanoma development. Moreover, upon deeper examination of mutation types and frequencies within each gene, a pattern emerges where mutations rarely occur more than once in the same gene.
In the context of diseases like cancer, where mutations in specific genes can contribute to the development and progression of the disease, such heterogeneity can have significant implications. It may indicate that the gene is prone to accumulating various mutations, potentially resulting from exposure to different carcinogens, genomic instability, or other factors. This genetic heterogeneity underscores the complexity of melanoma genetics and highlights the challenges it poses for diagnosis and treatment. The findings emphasize the importance of comprehensive genomic analysis to unravel the full spectrum of mutations and their implications for melanoma progression and treatment strategies.
The analysis begins with the loading of the refined dataset, followed by its transformation into a structured data frame. This systematic organization of the dataset into rows and columns facilitates efficient management and comprehensive analysis.
genes_ds <-read_excel("C:/Users/annac/Desktop/Capstone/data/data_filtered_2.0.xlsx")
genes_df<-as.data.frame(genes_ds)
summary(genes_df)
## Gene C-Dot Notation Affected Exon(s) All Exon(s)
## Length:457 Length:457 Min. : 1.00 Min. : 2.0
## Class :character Class :character 1st Qu.: 3.00 1st Qu.:10.0
## Mode :character Mode :character Median : 7.00 Median :16.0
## Mean : 8.93 Mean :24.9
## 3rd Qu.:12.00 3rd Qu.:25.0
## Max. :57.00 Max. :85.0
## Cancer_type Melanoma
## Length:457 Length:457
## Class :character Class :character
## Mode :character Mode :character
##
##
##
str(genes_df)
## 'data.frame': 457 obs. of 6 variables:
## $ Gene : chr "ACVR1B" "ACVR1B" "BRAF" "BRAF" ...
## $ C-Dot Notation : chr "c.1236G>C" "c.1236G>C" "c.1919T>A" "c.1919T>A" ...
## $ Affected Exon(s): num 1 1 1 1 1 7 7 7 7 7 ...
## $ All Exon(s) : num 10 10 15 15 15 15 15 15 15 15 ...
## $ Cancer_type : chr "Breast_cancer" "Breast_cancer" "Thyroid_cancer" "Thyroid_cancer" ...
## $ Melanoma : chr "N" "N" "N" "N" ...
Next, the columns underwent renaming to adhere to Rstudio standards, ensuring compatibility and seamless data reading processes.
genes_df <- genes_df %>%
rename(
gene = `Gene`,
c_dot_notation = `C-Dot Notation`,
affected_exon = `Affected Exon(s)`,
cancer_type = `Cancer_type`,
all_exon = `All Exon(s)`
)
Following that, a crucial step was taken to enhance the dataframe’s organization: converting character data types to factors. This strategic modification ensures that categorical data is suitably formatted for comprehensive analysis.
genes_df <- genes_df %>%
mutate(
gene = as.factor(gene),
c_dot_notation = as.factor(c_dot_notation),
affected_exon = as.numeric(affected_exon),
cancer_type = as.factor(cancer_type),
Melanoma = as.factor(Melanoma)
)
str(genes_df)
## 'data.frame': 457 obs. of 6 variables:
## $ gene : Factor w/ 41 levels "ABL2","ACVR1B",..: 2 2 4 4 4 4 4 4 4 4 ...
## $ c_dot_notation: Factor w/ 184 levels "c.10022G>A","c.10070G>A",..: 26 26 54 54 54 49 54 43 43 54 ...
## $ affected_exon : num 1 1 1 1 1 7 7 7 7 7 ...
## $ all_exon : num 10 10 15 15 15 15 15 15 15 15 ...
## $ cancer_type : Factor w/ 6 levels "Breast_cancer",..: 1 1 6 6 6 6 5 5 5 6 ...
## $ Melanoma : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 2 2 2 1 ...
The predictive analysis initiates with the division of data into training and testing sets. To maximize model accuracy, a substantial portion of the dataset (90%) is allocated for training, while the remaining 10% is reserved for testing and validating the models’ predictive capabilities on the target variable.
set.seed(123)
train_index <- createDataPartition(genes_df$cancer_type, p = 0.90, list = FALSE)
train_data <- genes_df[train_index, ]
test_data <- genes_df[-train_index, ]
nrow(train_data)
## [1] 413
nrow(test_data)
## [1] 44
train_data <- as.data.frame(train_data)
test_data <- as.data.frame(test_data)
As observed, the training dataset comprises 413 rows, while the testing dataset consists of 44 rows.
In pursuit of insights, the decision to construct a decision tree was made to identify the most influential variables in predicting Melanoma. Consequently, the construction process of the decision tree commenced.
#Utilizing the provided code, a decision tree was generated by designating the target variable as "status," to be predicted using gene, affected_exon, e_dot_notation and all exons from the Genes_df data frame. Given the binary nature of the target, the method was set to "class" and the complexity to 0.0001. Subsequently, predictions were made on the testing set, and the results were presented in tabular form. Additionally, a visualization of the decision tree was produced.
melanoma_rpart <- rpart(formula = Melanoma ~ gene + affected_exon + c_dot_notation + all_exon, data = train_data, method = "class", cp=0.0001)
melanoma_rpart
## n= 413
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 413 181 N (0.56174334 0.43825666)
## 2) c_dot_notation=c.1012T>C,c.10543T>G,c.10544C>T,c.10715T>G,c.1156A>G,c.1159C>T,c.1214T>C,c.1236G>C,c.1316G>A,c.131C>T,c.1371T>G,c.1432T>C,c.145C>A,c.1460G>A,c.146C>G,c.1516C>T,c.1562G>A,c.1637C>T,c.1799T>A,c.1919T>A,c.2012T>C,c.2017G>A,c.2071G>A,c.2107C>T,c.2176T>C,c.2215G>A,c.222G>T,c.2339C>T,c.2513C>A,c.265G>A,c.2666A>G,c.2726G>T,c.2734C>G,c.2789A>G,c.2854_2855delinsAT,c.2995A>G,c.29C>T,c.3002C>A,c.3038C>G,c.3131G>T,c.3215G>A,c.3223T>G,c.3310G>C,c.331C>A,c.3365C>T,c.3422G>A,c.3538T>C,c.3557T>C,c.3583A>T,c.3662C>T,c.3698G>A,c.3709G>A,c.3797A>C,c.380T>G,c.3812C>T,c.3847C>A,c.3944C>T,c.4258C>T,c.479C>T,c.4958A>G,c.4991C>G,c.49A>G,c.5200G>A,c.523G>A,c.532G>A,c.553G>C,c.5557G>A,c.610C>T,c.62A>G,c.661C>T,c.6944T>A,c.710G>A,c.760A>G,c.767A>G,c.7754T>C,c.79A>C,c.8921A>G,c.962A>T,c.98C>G,c.999G>A 237 5 N (0.97890295 0.02109705) *
## 3) c_dot_notation=c.10022G>A,c.10070G>A,c.10528C>T,c.1070G>A,c.10936G>A,c.1124C>T,c.11335G>A,c.11470C>T,c.11542G>A,c.11663C>T,c.11815G>A,c.12055C>T,c.12071C>T,c.12074C>T,c.12095G>A,c.12131G>A,c.12334G>A,c.12391G>A,c.12457C>T,c.1258T>C,c.12995C>T,c.13019C>T,c.1405C>T,c.1468C>T,c.1478G>T,c.1525G>A,c.1547C>T,c.1591C>T,c.17C>A,c.1804C>T,c.1849C>T,c.18713C>T,c.194G>A,c.2044C>T,c.2096C>T,c.2180C>T,c.2225C>A,c.2417T>C,c.2618C>T,c.2645G>A,c.2650T>A,c.2680G>A,c.2749G>A,c.28378C>T,c.2845G>A,c.3025G>A,c.3044G>A,c.3077G>A,c.3143G>A,c.33923C>T,c.340C>T,c.34504G>A,c.3470G>A,c.3487G>A,c.3659C>T,c.3836G>A,c.3883G>A,c.4328C>T,c.4652C>T,c.4942G>A,c.5117C>T,c.5233A>C,c.5255G>A,c.5312G>A,c.5603C>T,c.5632G>A,c.5657C>T,c.5668G>A,c.5780A>C,c.5900G>A,c.6343G>A,c.6533C>T,c.6557C>T,c.6562G>A,c.67363G>A,c.7027C>T,c.70460C>T,c.7102G>A,c.71585A>C,c.7234C>T,c.73330G>A,c.74875A>G,c.7796G>A,c.8090C>A,c.8236C>T,c.839C>T,c.839G>A,c.8434C>T,c.8456G>A,c.8483C>T,c.861G>A,c.8876G>A,c.8939C>T,c.8967G>A,c.9083C>T,c.9349G>A,c.9721G>A,c.9746C>T 176 0 Y (0.00000000 1.00000000) *
pred0 <- predict(melanoma_rpart, test_data, type = "class") #Predictions performed on test data
pred0
## 1 4 20 52 66 78 79 82 88 94 103 113 121 161 162 166 172 174 184 199
## N N N Y Y Y Y Y Y N N N N N N Y N Y Y Y
## 201 203 212 223 244 257 260 276 278 293 296 306 308 335 341 350 352 357 361 370
## Y Y Y Y Y Y Y N Y N N N N N N N N N N N
## 406 417 428 448
## N N N N
## Levels: N Y
table(test_data$Melanoma, pred0, dnn = c("True", "Pred")) #Predictive Matrix
## Pred
## True N Y
## N 24 0
## Y 2 18
sum(test_data$Melanoma != pred0) #Count of misclassified predictions
## [1] 2
sum(test_data$Melanoma != pred0)/nrow(test_data) #misclassification rate
## [1] 0.04545455
prp(melanoma_rpart, extra = 1)
This model underscores the significance of the “c_dot_notatation” AK mutation, identified as the most influential variable within the analysis. According to the model’s findings, 232 mutations were deemed non-carcinogenic, while 176 genes were implicated in carcinogenesis. Notably, the model delineates specific mutations, which, when subjected to descriptive modeling, substantiate the hypothesis regarding their pivotal role in potential cancer development.
For instance, within the final analysis of the XIRP2 gene, mutations including 839C>T, 3077G>A, 2650T>A, 1405C>T, 6557C>T, and 2044C>T were identified and featured prominently within the decision tree, signifying their predictive relevance for Melanoma incidence. During predictive testing on the validation set, a minimal misclassification error of only 2 instances out of 44 was observed, yielding a misclassification rate of 4.5%. This performance underscores the model’s efficacy in accurately predicting Melanoma cases.
set.seed(123)
#Using the code below, I created a random forest model using all variables in genes_df dataset to predict Melanoma and then displyed results in the table.
#rf_model<-randomForest(Melanoma~gene + affected_exon + all_exon + c_dot_notation , data = train_data, ntrees=500)
Here is the error I encounter:
unique(train_data$c_dot_notation)
## [1] c.1236G>C c.1919T>A c.1799T>A
## [4] c.1525G>A c.331C>A c.479C>T
## [7] c.8967G>A c.5255G>A c.4652C>T
## [10] c.3044G>A c.2845G>A c.2618C>T
## [13] c.1804C>T c.1591C>T c.1124C>T
## [16] c.49A>G c.11542G>A c.8939C>T
## [19] c.8236C>T c.7027C>T c.5780A>C
## [22] c.3883G>A c.3487G>A c.3470G>A
## [25] c.11335G>A c.10936G>A c.1258T>C
## [28] c.2180C>T c.5603C>T c.5632G>A
## [31] c.6562G>A c.8434C>T c.8456G>A
## [34] c.8876G>A c.9083C>T c.11815G>A
## [37] c.12055C>T c.12457C>T c.2680G>A
## [40] c.4942G>A c.7102G>A c.7234C>T
## [43] c.1562G>A c.2096C>T c.2749G>A
## [46] c.3025G>A c.3310G>C c.767A>G
## [49] c.760A>G c.79A>C c.1460G>A
## [52] c.532G>A c.3709G>A c.10543T>G
## [55] c.10544C>T c.10715T>G c.5200G>A
## [58] c.4991C>G c.3847C>A c.3797A>C
## [61] c.3557T>C c.3538T>C c.2107C>T
## [64] c.1637C>T c.1012T>C c.1478G>T
## [67] c.2017G>A c.11663C>T c.28378C>T
## [70] c.18713C>T c.10022G>A c.9746C>T
## [73] c.33923C>T c.34504G>A c.2176T>C
## [76] c.2071G>A c.12391G>A c.12334G>A
## [79] c.12131G>A c.5900G>A c.2645G>A
## [82] c.1070G>A c.13019C>T c.9349G>A
## [85] c.2225C>A c.1468C>T c.1547C>T
## [88] c.8090C>A c.6533C>T c.12995C>T
## [91] c.12074C>T c.11470C>T c.10528C>T
## [94] c.9721G>A c.3143G>A c.194G>A
## [97] c.839G>A c.4328C>T c.5312G>A
## [100] c.5668G>A c.6343G>A c.7796G>A
## [103] c.12071C>T c.12095G>A c.10070G>A
## [106] c.3836G>A c.999G>A c.3944C>T
## [109] c.2417T>C c.3583A>T c.3659C>T
## [112] c.3365C>T c.2854_2855delinsAT c.3662C>T
## [115] c.3812C>T c.710G>A c.610C>T
## [118] c.1156A>G c.1371T>G c.2012T>C
## [121] c.380T>G c.265G>A c.67363G>A
## [124] c.73330G>A c.71585A>C c.70460C>T
## [127] c.74875A>G c.131C>T c.145C>A
## [130] c.222G>T c.3131G>T c.2044C>T
## [133] c.6557C>T c.839C>T c.1405C>T
## [136] c.2650T>A c.3077G>A c.2789A>G
## [139] c.5557G>A c.146C>G c.4258C>T
## [142] c.5233A>C c.98C>G c.523G>A
## [145] c.962A>T c.8921A>G c.3223T>G
## [148] c.8483C>T c.340C>T c.6944T>A
## [151] c.553G>C c.2726G>T c.3422G>A
## [154] c.3002C>A c.3038C>G c.62A>G
## [157] c.861G>A c.661C>T c.1432T>C
## [160] c.1316G>A c.1516C>T c.17C>A
## [163] c.2339C>T c.7754T>C c.2513C>A
## [166] c.2995A>G c.5117C>T c.4958A>G
## [169] c.3215G>A c.5657C>T c.3698G>A
## [172] c.2734C>G c.1849C>T c.2666A>G
## [175] c.2215G>A c.29C>T c.1214T>C
## [178] c.1159C>T
## 184 Levels: c.10022G>A c.10070G>A c.1012T>C c.10528C>T c.10543T>G ... c.999G>A
As outlined above, the dataset comprised 184 categories, with only 53 meeting the model’s criteria. Consequently, the model failed to execute. Given the importance of all mutations and their prior refinement from a larger pool, deleting categories to fit the model was impractical.
Therefore, a subsequent Random Forest analysis will be conducted, this time excluding the “c_dot_notation” - mutation. The aim is to evaluate the model’s predictive ability regarding melanoma incidence and its accuracy.
set.seed(123)
rf_model<-randomForest(Melanoma~gene + affected_exon + all_exon , data = train_data, ntrees=500)
rf_model
##
## Call:
## randomForest(formula = Melanoma ~ gene + affected_exon + all_exon, data = train_data, ntrees = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 8.96%
## Confusion matrix:
## N Y class.error
## N 221 11 0.04741379
## Y 26 155 0.14364641
rf_pred<-predict(rf_model,test_data) #Predictions performed on test data
rf_pred
## 1 4 20 52 66 78 79 82 88 94 103 113 121 161 162 166 172 174 184 199
## N N N Y Y Y Y Y Y N N N N N N N N N Y Y
## 201 203 212 223 244 257 260 276 278 293 296 306 308 335 341 350 352 357 361 370
## Y Y Y Y Y Y Y N N N N N N N N N N N N N
## 406 417 428 448
## N N N N
## Levels: N Y
table(test_data$Melanoma,rf_pred, dnn=c("True","Pred")) #Matrix Table
## Pred
## True N Y
## N 24 0
## Y 5 15
sum(test_data$Melanoma != rf_pred)
## [1] 5
sum(test_data$Melanoma != rf_pred)/nrow(test_data)
## [1] 0.1136364
rf_model$importance
## MeanDecreaseGini
## gene 110.34091
## affected_exon 30.92654
## all_exon 32.27210
Following the implementation of adjusted random forest analysis, the outcomes exhibited a marginally inferior performance compared to the decision tree approach. Specifically, the model incurred misclassification in 5 instances, representing an 11.3% error rate. This observation suggests that the absence of a pivotal variable in the prediction significantly complicates the accuracy of the model. Subsequently, the analysis of feature importance revealed that the gene variable exhibited substantial significance, whereas affected_exons and all_exons demonstrated comparatively diminished influence.
set.seed(123)
bag_model <- bagging(formula = Melanoma ~ gene + affected_exon + all_exon + c_dot_notation , data = train_data, nbagg = 50)#Bagging model, "nbag=50" indicates that the algorithm is set to create an ensemble of 50 bootstrap samples. Bootstrap sampling involves randomly selecting subsets of the original dataset with replacement.
Melanoma2 <- test_data$Melanoma
bag_pred <- predict(bag_model, newdata = test_data)#Predictions
table(Melanoma2, bag_pred, dnn = c("True", "Pred")) #Confusion Matrix
## Pred
## True N Y
## N 24 0
## Y 2 18
sum(test_data$Melanoma != bag_pred) #Number of misclassifications
## [1] 2
Bagging, a more sophisticated model akin to boosting, demonstrated exceptional performance. Our analysis via the confusion matrix reveals that only 2 data points were misclassified, mirroring the performance of boosting precisely.
library(adabag)
set.seed(123)
melanoma_boost = boosting(Melanoma~ gene + c_dot_notation + all_exon + affected_exon, data = train_data, boos = T)#Boosting model, The parameter "boos=T" indicates that boosting is turned on, implying that the algorithm will iteratively train weak learners and combine them to enhance the overall predictive power of the model.
melanoma_boost$importance #Importance of the variables
## affected_exon all_exon c_dot_notation gene
## 2.364250 0.000000 96.434424 1.201326
boost_pred = predict(melanoma_boost, newdata = test_data) #Predictions
boost_pred$class
## [1] "N" "N" "N" "Y" "Y" "Y" "Y" "Y" "Y" "Y" "N" "N" "N" "N" "Y" "Y" "N" "Y" "Y"
## [20] "Y" "Y" "Y" "Y" "Y" "Y" "Y" "Y" "N" "Y" "N" "N" "N" "N" "N" "N" "N" "N" "N"
## [39] "N" "N" "N" "N" "N" "N"
boost_pred$confusion #Confusion Matrix
## Observed Class
## Predicted Class N Y
## N 23 1
## Y 1 19
boost_pred$error #Misclasification error
## [1] 0.04545455
boost_pred$error*nrow(test_data)
## [1] 2
This model was constructed using the most influential variables: gene, c_dot_notation, affected_exons, and all_exons. During testing, the boosting model demonstrated exceptional performance. Consistent with the findings from the Decision Tree analysis, the variable c_dot_notation/mutation proved to be pivotal once again. Interestingly, our model suggests that all exons have negligible impact. Notably, the confusion matrix revealed only 2 inaccurately predicted data points, confirming the model’s effectiveness. With a resulting misclassification error rate of 4.54%, the model’s robustness is underscored. This underscores the efficacy of employing a more sophisticated model for extracting insights from the provided data.
Unfortunately, due to the categorical nature of the dataset containing numerous unique values, additional model execution was constrained. Consequently, the dataset was transferred to Python to circumvent these limitations. Within Python, binary columns were engineered for problematic variables such as c_dot_mutataion and genes, enabling further predictive analyses.
The initial step involved loading essential libraries like Pandas and Seaborn. Pandas facilitates diverse data operations including reading and writing files, data cleaning, preprocessing, statistical analysis, and manipulation. Seaborn, built on Matplotlib, offers an intuitive interface for generating visually appealing and informative statistical graphics.
Following the library loading, the connection to Google Drive was established, and the data was imported into Python. Subsequently, the code proceeded to create binary columns for two specific variables: gene and c_dot_notataion/mutataion.
The output below demonstrates the creation of binary columns. Despite this transformation, our dataset retains its original size of 457 rows. However, the addition of 225 binary columns is evident.
Following, the dataset underwent enrichment through the addition of unchanged variables. These variables include “Melanoma,” already binary-encoded, as well as “Affected Exon(s)” and “All Exon(s),” both represented as numeric columns. Consequently, the dataset now comprises 457 rows and 228 columns. Subsequently, this extended dataset was exported to Excel and is now poised for further predictive analytics.
Subsequent to this procedure, we reintegrated the dataset into R to
conduct further modeling. However, the initial challenge arose when the
program failed to interpret the data accurately, necessitating the
removal of special characters from the column headers. Presently, the
headers are represented as single contiguous words.
genes_ds1 <-read_excel("C:/Users/annac/Desktop/Capstone/data/genes_python.xlsx")
genes_df1<-as.data.frame(genes_ds1)
str(genes_df1) #Checking if the data was read correctly
## 'data.frame': 457 obs. of 228 variables:
## $ AffectedExons : num 1 1 1 1 1 7 7 7 7 7 ...
## $ Melanoma : num 0 0 0 0 0 0 1 1 1 0 ...
## $ AllExons : num 10 10 15 15 15 15 15 15 15 15 ...
## $ CDotNotationc10022GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc10070GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1012TC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc10528CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc10543TG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc10544CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1070GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc10715TG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc10936GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1124CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc11335GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc11470CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc11542GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1156AG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1159CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc11638CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc11663CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc11815GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc12055CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc12071CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc12074CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc12095GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc12131GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1214TC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc12334GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1236GC : num 1 1 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc12391GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc12457CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1258TC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc12995CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc13019CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1316GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc131CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1371TG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1405CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1432TC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc145CA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1460GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1468CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc146CG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1478GT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1516CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1525GA : num 0 0 0 0 0 0 0 1 1 0 ...
## $ CDotNotationc1547CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1562GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1591CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1637CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1717GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1799TA : num 0 0 0 0 0 1 0 0 0 0 ...
## $ CDotNotationc17CA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1804CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1849CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc18713CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc1919TA : num 0 0 1 1 1 0 1 0 0 1 ...
## $ CDotNotationc194GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2012TC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2017GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2044CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2071GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2096CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2107CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2176TC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2180CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2215GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2225CA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc222GT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2339CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2344CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2417TC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2513CA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2618CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2645GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2650TA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc265GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2666AG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2680GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2726GT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2734CG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2749GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2789AG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc28378CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2845GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc28542855delinsAT: num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc290CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc2995AG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc29CT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc3002CA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc3025GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc3038CG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc3044GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc3077GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc3131GT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc3143GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc3215GA : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc3223TG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CDotNotationc3310GC : num 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
Upon inspecting the dataset, it’s apparent that all variables are numeric, and the data has been correctly processed, maintaining consistency in the number of variables and columns. The next step involves partitioning the data into training and testing sets. This time, the data is split identically as before, with 90% allocated for training and 10% for testing. Given the smaller dataset size, this partitioning scheme ensures sufficient data availability for proper model building and training.
set.seed(123)
train_index1 <- createDataPartition(genes_df1$Melanoma, p = 0.90, list = FALSE)
train_data1 <- genes_df1[train_index1, ]
test_data1 <- genes_df1[-train_index1, ]
train_data1 <- as.data.frame(train_data1)
logit_model1 <- glm(Melanoma ~ ., data = train_data1, family = binomial) #Model
summary(logit_model1)
##
## Call:
## glm(formula = Melanoma ~ ., family = binomial, data = train_data1)
##
## Coefficients: (49 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.912e+15 6.283e+07 -1.100e+08 <2e-16 ***
## AffectedExons -5.235e+13 9.009e+05 -5.811e+07 <2e-16 ***
## AllExons 5.027e+14 6.192e+06 8.119e+07 <2e-16 ***
## CDotNotationc10022GA 4.399e+15 6.713e+07 6.552e+07 <2e-16 ***
## CDotNotationc10070GA -3.219e+16 4.341e+08 -7.417e+07 <2e-16 ***
## CDotNotationc1012TC -6.399e+15 7.021e+07 -9.114e+07 <2e-16 ***
## CDotNotationc10528CT 2.807e+15 9.571e+07 2.933e+07 <2e-16 ***
## CDotNotationc10543TG -1.582e+16 1.821e+08 -8.689e+07 <2e-16 ***
## CDotNotationc10544CT -1.582e+16 1.821e+08 -8.689e+07 <2e-16 ***
## CDotNotationc1070GA 2.720e+15 7.845e+07 3.466e+07 <2e-16 ***
## CDotNotationc10715TG -1.582e+16 1.821e+08 -8.689e+07 <2e-16 ***
## CDotNotationc10936GA -3.239e+16 4.495e+08 -7.206e+07 <2e-16 ***
## CDotNotationc1124CT -1.047e+15 1.238e+08 -8.454e+06 <2e-16 ***
## CDotNotationc11335GA -2.833e+16 4.470e+08 -6.337e+07 <2e-16 ***
## CDotNotationc11470CT 2.807e+15 9.571e+07 2.933e+07 <2e-16 ***
## CDotNotationc11542GA -2.865e+16 4.485e+08 -6.387e+07 <2e-16 ***
## CDotNotationc1156AG -9.600e+14 5.491e+07 -1.748e+07 <2e-16 ***
## CDotNotationc1159CT 1.570e+14 7.754e+07 2.025e+06 <2e-16 ***
## CDotNotationc11638CT 2.807e+15 9.571e+07 2.933e+07 <2e-16 ***
## CDotNotationc11663CT 4.399e+15 8.221e+07 5.351e+07 <2e-16 ***
## CDotNotationc11815GA 9.060e+15 7.750e+07 1.169e+08 <2e-16 ***
## CDotNotationc12055CT 6.782e+15 6.126e+07 1.107e+08 <2e-16 ***
## CDotNotationc12071CT -2.881e+16 4.296e+08 -6.705e+07 <2e-16 ***
## CDotNotationc12074CT -1.697e+15 9.571e+07 -1.773e+07 <2e-16 ***
## CDotNotationc12095GA -2.731e+16 4.296e+08 -6.356e+07 <2e-16 ***
## CDotNotationc12131GA 2.676e+15 8.310e+07 3.220e+07 <2e-16 ***
## CDotNotationc1214TC 1.570e+14 7.754e+07 2.025e+06 <2e-16 ***
## CDotNotationc12334GA 2.545e+15 9.573e+07 2.659e+07 <2e-16 ***
## CDotNotationc1236GC -1.062e+16 7.397e+07 -1.435e+08 <2e-16 ***
## CDotNotationc12391GA 2.676e+15 8.310e+07 3.220e+07 <2e-16 ***
## CDotNotationc12457CT 6.782e+15 6.126e+07 1.107e+08 <2e-16 ***
## CDotNotationc1258TC -2.789e+16 4.470e+08 -6.239e+07 <2e-16 ***
## CDotNotationc12995CT 2.807e+15 9.571e+07 2.933e+07 <2e-16 ***
## CDotNotationc13019CT 2.702e+15 7.453e+07 3.626e+07 <2e-16 ***
## CDotNotationc1316GA NA NA NA NA
## CDotNotationc131CT 2.252e+15 6.126e+07 3.676e+07 <2e-16 ***
## CDotNotationc1371TG -1.466e+15 5.390e+07 -2.720e+07 <2e-16 ***
## CDotNotationc1405CT -1.979e+15 1.297e+08 -1.526e+07 <2e-16 ***
## CDotNotationc1432TC -8.630e+15 1.029e+08 -8.386e+07 <2e-16 ***
## CDotNotationc145CA 1.519e+15 5.479e+07 2.772e+07 <2e-16 ***
## CDotNotationc1460GA -1.421e+14 3.558e+07 -3.994e+06 <2e-16 ***
## CDotNotationc1468CT 4.896e+14 7.603e+07 6.440e+06 <2e-16 ***
## CDotNotationc146CG -3.061e+16 3.609e+08 -8.482e+07 <2e-16 ***
## CDotNotationc1478GT 2.612e+15 7.435e+07 3.512e+07 <2e-16 ***
## CDotNotationc1516CT -8.630e+15 1.133e+08 -7.616e+07 <2e-16 ***
## CDotNotationc1525GA 4.277e+15 6.139e+07 6.967e+07 <2e-16 ***
## CDotNotationc1547CT 9.217e+14 7.453e+07 1.237e+07 <2e-16 ***
## CDotNotationc1562GA -1.135e+16 1.274e+08 -8.908e+07 <2e-16 ***
## CDotNotationc1591CT -1.047e+15 1.238e+08 -8.454e+06 <2e-16 ***
## CDotNotationc1637CT -6.430e+15 7.443e+07 -8.639e+07 <2e-16 ***
## CDotNotationc1717GA -5.006e+15 8.511e+07 -5.882e+07 <2e-16 ***
## CDotNotationc1799TA -4.739e+15 5.826e+07 -8.135e+07 <2e-16 ***
## CDotNotationc17CA -4.545e+15 1.141e+08 -3.983e+07 <2e-16 ***
## CDotNotationc1804CT -5.551e+15 1.238e+08 -4.482e+07 <2e-16 ***
## CDotNotationc1849CT 6.012e+15 7.070e+07 8.503e+07 <2e-16 ***
## CDotNotationc18713CT NA NA NA NA
## CDotNotationc1919TA -1.736e+15 5.279e+07 -3.288e+07 <2e-16 ***
## CDotNotationc194GA -2.714e+16 4.301e+08 -6.310e+07 <2e-16 ***
## CDotNotationc2012TC -4.713e+15 5.491e+07 -8.583e+07 <2e-16 ***
## CDotNotationc2017GA -4.894e+15 7.435e+07 -6.583e+07 <2e-16 ***
## CDotNotationc2044CT -1.979e+15 1.381e+08 -1.433e+07 <2e-16 ***
## CDotNotationc2071GA -3.623e+16 4.329e+08 -8.368e+07 <2e-16 ***
## CDotNotationc2096CT -6.849e+15 1.387e+08 -4.937e+07 <2e-16 ***
## CDotNotationc2107CT -2.341e+15 6.691e+07 -3.498e+07 <2e-16 ***
## CDotNotationc2176TC -3.172e+16 4.303e+08 -7.372e+07 <2e-16 ***
## CDotNotationc2180CT 4.556e+15 7.750e+07 5.879e+07 <2e-16 ***
## CDotNotationc2215GA 4.556e+15 6.127e+07 7.436e+07 <2e-16 ***
## CDotNotationc2225CA 1.576e+15 7.602e+07 2.073e+07 <2e-16 ***
## CDotNotationc222GT 1.812e+15 4.901e+07 3.697e+07 <2e-16 ***
## CDotNotationc2339CT -1.133e+16 1.432e+08 -7.915e+07 <2e-16 ***
## CDotNotationc2344CT -2.346e+15 1.387e+08 -1.691e+07 <2e-16 ***
## CDotNotationc2417TC -4.133e+14 1.087e+08 -3.802e+06 <2e-16 ***
## CDotNotationc2513CA -1.123e+16 1.431e+08 -7.848e+07 <2e-16 ***
## CDotNotationc2618CT -2.775e+15 1.133e+08 -2.449e+07 <2e-16 ***
## CDotNotationc2645GA 1.218e+15 7.845e+07 1.553e+07 <2e-16 ***
## CDotNotationc2650TA -1.979e+15 1.297e+08 -1.526e+07 <2e-16 ***
## CDotNotationc265GA -2.011e+15 8.318e+07 -2.418e+07 <2e-16 ***
## CDotNotationc2666AG 4.556e+15 6.127e+07 7.436e+07 <2e-16 ***
## CDotNotationc2680GA 9.007e+15 7.749e+07 1.162e+08 <2e-16 ***
## CDotNotationc2726GT 3.310e+15 5.655e+07 5.853e+07 <2e-16 ***
## CDotNotationc2734CG 6.012e+15 7.070e+07 8.503e+07 <2e-16 ***
## CDotNotationc2749GA NA NA NA NA
## CDotNotationc2789AG -2.995e+15 5.914e+07 -5.065e+07 <2e-16 ***
## CDotNotationc28378CT 4.399e+15 8.221e+07 5.351e+07 <2e-16 ***
## CDotNotationc2845GA -5.235e+14 1.133e+08 -4.619e+06 <2e-16 ***
## CDotNotationc28542855delinsAT -2.766e+15 5.254e+07 -5.264e+07 <2e-16 ***
## CDotNotationc290CT -2.566e+15 7.397e+07 -3.469e+07 <2e-16 ***
## CDotNotationc2995AG -6.619e+15 1.430e+08 -4.629e+07 <2e-16 ***
## CDotNotationc29CT 4.451e+15 7.750e+07 5.744e+07 <2e-16 ***
## CDotNotationc3002CA NA NA NA NA
## CDotNotationc3025GA -2.346e+15 1.387e+08 -1.691e+07 <2e-16 ***
## CDotNotationc3038CG -1.194e+15 7.382e+07 -1.618e+07 <2e-16 ***
## CDotNotationc3044GA -2.775e+15 1.133e+08 -2.449e+07 <2e-16 ***
## CDotNotationc3077GA -6.483e+15 1.381e+08 -4.695e+07 <2e-16 ***
## CDotNotationc3131GT 2.859e+15 7.356e+07 3.887e+07 <2e-16 ***
## CDotNotationc3143GA -2.685e+16 4.321e+08 -6.214e+07 <2e-16 ***
## CDotNotationc3215GA -1.085e+16 1.812e+08 -5.987e+07 <2e-16 ***
## CDotNotationc3223TG -3.482e+16 4.716e+08 -7.384e+07 <2e-16 ***
## CDotNotationc3310GC -2.566e+15 7.397e+07 -3.469e+07 <2e-16 ***
## CDotNotationc331CA -2.566e+15 7.397e+07 -3.469e+07 <2e-16 ***
## CDotNotationc3365CT -4.672e+15 5.254e+07 -8.893e+07 <2e-16 ***
## CDotNotationc33923CT 4.399e+15 8.221e+07 5.351e+07 <2e-16 ***
## CDotNotationc340CT -3.116e+16 4.743e+08 -6.569e+07 <2e-16 ***
## CDotNotationc3422GA -1.194e+15 5.655e+07 -2.112e+07 <2e-16 ***
## CDotNotationc34504GA 4.399e+15 8.221e+07 5.351e+07 <2e-16 ***
## CDotNotationc3470GA -3.370e+16 4.524e+08 -7.450e+07 <2e-16 ***
## CDotNotationc3487GA -2.920e+16 4.524e+08 -6.454e+07 <2e-16 ***
## CDotNotationc3538TC 1.938e+15 7.397e+07 2.620e+07 <2e-16 ***
## CDotNotationc3557TC -2.566e+15 7.397e+07 -3.469e+07 <2e-16 ***
## CDotNotationc3583AT -7.169e+15 1.087e+08 -6.594e+07 <2e-16 ***
## CDotNotationc3659CT -3.872e+14 1.051e+08 -3.682e+06 <2e-16 ***
## CDotNotationc3662CT -1.816e+15 5.312e+07 -3.419e+07 <2e-16 ***
## CDotNotationc3698GA 6.012e+15 8.515e+07 7.060e+07 <2e-16 ***
## CDotNotationc3709GA -1.582e+16 1.821e+08 -8.689e+07 <2e-16 ***
## CDotNotationc3797AC -2.566e+15 7.397e+07 -3.469e+07 <2e-16 ***
## CDotNotationc380TG 4.194e+07 6.126e+07 6.850e-01 0.494
## CDotNotationc3812CT -4.650e+15 5.620e+07 -8.274e+07 <2e-16 ***
## CDotNotationc3836GA -2.714e+16 4.301e+08 -6.310e+07 <2e-16 ***
## CDotNotationc3847CA -2.566e+15 7.397e+07 -3.469e+07 <2e-16 ***
## CDotNotationc3883GA -2.920e+16 4.524e+08 -6.454e+07 <2e-16 ***
## CDotNotationc3944CT -9.428e+15 9.989e+07 -9.438e+07 <2e-16 ***
## CDotNotationc4258CT -2.475e+16 3.582e+08 -6.910e+07 <2e-16 ***
## CDotNotationc4328CT -2.996e+16 4.288e+08 -6.986e+07 <2e-16 ***
## CDotNotationc4652CT -2.775e+15 1.133e+08 -2.449e+07 <2e-16 ***
## CDotNotationc479CT -2.566e+15 7.397e+07 -3.469e+07 <2e-16 ***
## CDotNotationc4934GA 6.012e+15 8.515e+07 7.060e+07 <2e-16 ***
## CDotNotationc4942GA 9.007e+15 7.749e+07 1.162e+08 <2e-16 ***
## CDotNotationc4958AG -1.043e+16 1.744e+08 -5.982e+07 <2e-16 ***
## CDotNotationc4991CG -2.566e+15 7.397e+07 -3.469e+07 <2e-16 ***
## CDotNotationc49AG 1.938e+15 4.575e+07 4.235e+07 <2e-16 ***
## CDotNotationc5117CT -2.116e+15 1.430e+08 -1.480e+07 <2e-16 ***
## CDotNotationc5200GA -2.566e+15 7.397e+07 -3.469e+07 <2e-16 ***
## CDotNotationc5233AC -1.993e+16 3.577e+08 -5.571e+07 <2e-16 ***
## CDotNotationc523GA 1.570e+14 5.486e+07 2.862e+06 <2e-16 ***
## CDotNotationc5255GA -5.235e+14 1.133e+08 -4.619e+06 <2e-16 ***
## CDotNotationc5312GA -2.731e+16 4.296e+08 -6.356e+07 <2e-16 ***
## CDotNotationc532GA -2.566e+15 5.674e+07 -4.522e+07 <2e-16 ***
## CDotNotationc553GC -5.373e+15 6.552e+07 -8.200e+07 <2e-16 ***
## CDotNotationc5557GA -2.433e+16 3.576e+08 -6.803e+07 <2e-16 ***
## CDotNotationc5603CT 9.007e+15 7.749e+07 1.162e+08 <2e-16 ***
## CDotNotationc5632GA 6.782e+15 6.126e+07 1.107e+08 <2e-16 ***
## CDotNotationc5657CT -5.771e+15 1.867e+08 -3.090e+07 <2e-16 ***
## CDotNotationc5668GA -2.731e+16 4.296e+08 -6.356e+07 <2e-16 ***
## CDotNotationc5780AC -2.865e+16 4.485e+08 -6.387e+07 <2e-16 ***
## CDotNotationc5900GA 2.807e+15 9.571e+07 2.933e+07 <2e-16 ***
## CDotNotationc610CT -9.600e+14 5.491e+07 -1.748e+07 <2e-16 ***
## CDotNotationc62AG -1.508e+15 7.382e+07 -2.043e+07 <2e-16 ***
## CDotNotationc6343GA -2.711e+16 4.301e+08 -6.305e+07 <2e-16 ***
## CDotNotationc6533CT 2.807e+15 9.571e+07 2.933e+07 <2e-16 ***
## CDotNotationc6557CT -1.979e+15 1.381e+08 -1.433e+07 <2e-16 ***
## CDotNotationc6562GA 6.782e+15 6.126e+07 1.107e+08 <2e-16 ***
## CDotNotationc661CT -8.892e+15 1.138e+08 -7.816e+07 <2e-16 ***
## CDotNotationc67363GA -2.011e+15 9.576e+07 -2.100e+07 <2e-16 ***
## CDotNotationc6944TA -3.351e+16 4.726e+08 -7.091e+07 <2e-16 ***
## CDotNotationc7013CT 6.782e+15 6.126e+07 1.107e+08 <2e-16 ***
## CDotNotationc7027CT -2.865e+16 4.485e+08 -6.387e+07 <2e-16 ***
## CDotNotationc70460CT 2.493e+15 9.576e+07 2.603e+07 <2e-16 ***
## CDotNotationc7102GA 9.007e+15 7.749e+07 1.162e+08 <2e-16 ***
## CDotNotationc710GA -4.661e+15 6.132e+07 -7.600e+07 <2e-16 ***
## CDotNotationc71585AC -2.011e+15 9.576e+07 -2.100e+07 <2e-16 ***
## CDotNotationc7234CT 4.504e+15 7.749e+07 5.812e+07 <2e-16 ***
## CDotNotationc73330GA -2.011e+15 9.576e+07 -2.100e+07 <2e-16 ***
## CDotNotationc74875AG 2.493e+15 8.318e+07 2.997e+07 <2e-16 ***
## CDotNotationc760AG NA NA NA NA
## CDotNotationc767AG -2.566e+15 7.397e+07 -3.469e+07 <2e-16 ***
## CDotNotationc7754TC -1.112e+16 1.430e+08 -7.779e+07 <2e-16 ***
## CDotNotationc7796GA -2.714e+16 4.301e+08 -6.310e+07 <2e-16 ***
## CDotNotationc79AC -2.403e+15 3.478e+07 -6.907e+07 <2e-16 ***
## CDotNotationc8090CA 2.755e+15 8.310e+07 3.315e+07 <2e-16 ***
## CDotNotationc8236CT -3.260e+16 4.499e+08 -7.247e+07 <2e-16 ***
## CDotNotationc839CT -6.483e+15 1.297e+08 -4.999e+07 <2e-16 ***
## CDotNotationc839GA -2.714e+16 4.301e+08 -6.310e+07 <2e-16 ***
## CDotNotationc8434CT 6.782e+15 6.126e+07 1.107e+08 <2e-16 ***
## CDotNotationc8456GA 9.033e+15 6.126e+07 1.475e+08 <2e-16 ***
## CDotNotationc8483CT -2.849e+16 4.720e+08 -6.035e+07 <2e-16 ***
## CDotNotationc861GA 7.604e+15 7.377e+07 1.031e+08 <2e-16 ***
## CDotNotationc8876GA 6.782e+15 6.126e+07 1.107e+08 <2e-16 ***
## CDotNotationc8921AG -3.734e+16 4.719e+08 -7.912e+07 <2e-16 ***
## CDotNotationc8939CT -2.865e+16 4.485e+08 -6.387e+07 <2e-16 ***
## CDotNotationc8967GA -2.775e+15 1.133e+08 -2.449e+07 <2e-16 ***
## CDotNotationc9083CT 6.782e+15 6.126e+07 1.107e+08 <2e-16 ***
## CDotNotationc9349GA 2.720e+15 7.351e+07 3.699e+07 <2e-16 ***
## CDotNotationc962AT NA NA NA NA
## CDotNotationc9721GA 2.807e+15 9.571e+07 2.933e+07 <2e-16 ***
## CDotNotationc9746CT 4.399e+15 6.713e+07 6.553e+07 <2e-16 ***
## CDotNotationc98CG NA NA NA NA
## CDotNotationc999GA NA NA NA NA
## GeneABL2 NA NA NA NA
## GeneACVR1B NA NA NA NA
## GeneATM NA NA NA NA
## GeneBRAF NA NA NA NA
## GeneCATSPERZ NA NA NA NA
## GeneCD276 NA NA NA NA
## GeneCSMD1 NA NA NA NA
## GeneCTLA4 NA NA NA NA
## GeneCUL3 NA NA NA NA
## GeneDNAH17 NA NA NA NA
## GeneDNAH2 NA NA NA NA
## GeneDNAH9 NA NA NA NA
## GeneEGFR NA NA NA NA
## GeneERCC5 NA NA NA NA
## GeneHNF1A NA NA NA NA
## GeneIDH1 NA NA NA NA
## GeneKMT2A NA NA NA NA
## GeneMDC1 NA NA NA NA
## GeneMST1 NA NA NA NA
## GeneMUC16 NA NA NA NA
## GeneNBN NA NA NA NA
## GeneNSD1 NA NA NA NA
## GeneNUTM1 NA NA NA NA
## GenePCLO NA NA NA NA
## GenePDGFRA NA NA NA NA
## GenePKHD1L1 NA NA NA NA
## GenePRKDC NA NA NA NA
## GenePTCH1 NA NA NA NA
## GeneRANBP2 NA NA NA NA
## GeneSLX4 NA NA NA NA
## GeneSPTBN2 NA NA NA NA
## GeneTNFAIP3 NA NA NA NA
## GeneTNKS2 NA NA NA NA
## GeneTTN NA NA NA NA
## GeneVCL NA NA NA NA
## GeneWISP3 NA NA NA NA
## GeneXIRP1 NA NA NA NA
## GeneXIRP2 NA NA NA NA
## GeneZNF217 NA NA NA NA
## GeneZNF337 NA NA NA NA
## GeneZNF74 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 566.01 on 411 degrees of freedom
## Residual deviance: 360.47 on 233 degrees of freedom
## AIC: 718.47
##
## Number of Fisher Scoring iterations: 25
pred_resp <- predict(logit_model1, newdata = test_data1, type = "response") #Predictions
conf <- table(test_data1$Melanoma, (pred_resp > 0.5)*1, dnn = c("Truth", "Predicted")) #Confusion Matrix
conf
## Predicted
## Truth 0 1
## 0 22 5
## 1 0 18
misclassification_error <- 1 - sum(diag(conf)) / sum(conf) # Misclassification error
misclassification_error
## [1] 0.1111111
As observed earlier, c_dot_notation/mutations emerge as significant variables in the model. The confusion matrix indicates only 5 misclassifications - 11.1 % , suggesting an acceptable yet not optimal performance compared to previous models like bagging or boosting. However, numerous instances of “NA” are notable. “NA” occurs in two scenarios: firstly, when additional data doesn’t contribute to model enhancement and predictive accuracy; secondly, when variables lack significance. To discern the case in this scenario, another logistic regression will be conducted, excluding the most influential variable, c_dot_notation/mutations.
genes_to_include <- c("GeneABL2", "GeneACVR1B", "GeneATM", "GeneBRAF", "GeneCATSPERZ",
"GeneCD276", "GeneCSMD1", "GeneCTLA4", "GeneCUL3", "GeneDNAH17",
"GeneDNAH2", "GeneDNAH9", "GeneEGFR", "GeneERCC5", "GeneHNF1A",
"GeneIDH1", "GeneKMT2A", "GeneMDC1", "GeneMST1", "GeneMUC16",
"GeneNBN", "GeneNSD1", "GeneNUTM1", "GenePCLO", "GenePDGFRA",
"GenePKHD1L1", "GenePRKDC", "GenePTCH1", "GeneRANBP2", "GeneSLX4",
"GeneSPTBN2", "GeneTNFAIP3", "GeneTNKS2", "GeneTTN", "GeneVCL",
"GeneWISP3", "GeneXIRP1", "GeneXIRP2", "GeneZNF217", "GeneZNF337",
"GeneZNF74")
formula <- as.formula(paste("Melanoma ~ ", paste(c(genes_to_include, "AffectedExons", "AllExons"), collapse = " + ")))
logit_model2 <- glm(formula, data = train_data1, family = binomial)
summary(logit_model2)
##
## Call:
## glm(formula = formula, family = binomial, data = train_data1)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.309e+13 9.743e+13 -0.340 0.734
## GeneABL2 -4.471e+15 9.743e+13 -45.885 <2e-16 ***
## GeneACVR1B -4.471e+15 9.743e+13 -45.885 <2e-16 ***
## GeneATM 3.309e+13 9.743e+13 0.340 0.734
## GeneBRAF 3.309e+13 9.743e+13 0.340 0.734
## GeneCATSPERZ 3.309e+13 9.743e+13 0.340 0.734
## GeneCD276 3.309e+13 9.743e+13 0.340 0.734
## GeneCSMD1 3.309e+13 9.743e+13 0.340 0.734
## GeneCTLA4 3.309e+13 9.743e+13 0.340 0.734
## GeneCUL3 3.309e+13 9.743e+13 0.340 0.734
## GeneDNAH17 3.309e+13 9.743e+13 0.340 0.734
## GeneDNAH2 3.309e+13 9.743e+13 0.340 0.734
## GeneDNAH9 3.309e+13 9.743e+13 0.340 0.734
## GeneEGFR 3.309e+13 9.743e+13 0.340 0.734
## GeneERCC5 3.309e+13 9.743e+13 0.340 0.734
## GeneHNF1A 3.309e+13 9.743e+13 0.340 0.734
## GeneIDH1 3.309e+13 9.743e+13 0.340 0.734
## GeneKMT2A 3.309e+13 9.743e+13 0.340 0.734
## GeneMDC1 3.309e+13 9.743e+13 0.340 0.734
## GeneMST1 3.309e+13 9.743e+13 0.340 0.734
## GeneMUC16 3.309e+13 9.743e+13 0.340 0.734
## GeneNBN 3.309e+13 9.743e+13 0.340 0.734
## GeneNSD1 3.309e+13 9.743e+13 0.340 0.734
## GeneNUTM1 3.309e+13 9.743e+13 0.340 0.734
## GenePCLO 3.309e+13 9.743e+13 0.340 0.734
## GenePDGFRA 3.309e+13 9.743e+13 0.340 0.734
## GenePKHD1L1 3.309e+13 9.743e+13 0.340 0.734
## GenePRKDC 3.309e+13 9.743e+13 0.340 0.734
## GenePTCH1 3.309e+13 9.743e+13 0.340 0.734
## GeneRANBP2 3.309e+13 9.743e+13 0.340 0.734
## GeneSLX4 3.309e+13 9.743e+13 0.340 0.734
## GeneSPTBN2 3.309e+13 9.743e+13 0.340 0.734
## GeneTNFAIP3 3.309e+13 9.743e+13 0.340 0.734
## GeneTNKS2 3.309e+13 9.743e+13 0.340 0.734
## GeneTTN 3.309e+13 9.743e+13 0.340 0.734
## GeneVCL 3.309e+13 9.743e+13 0.340 0.734
## GeneWISP3 3.309e+13 9.743e+13 0.340 0.734
## GeneXIRP1 3.309e+13 9.743e+13 0.340 0.734
## GeneXIRP2 3.309e+13 9.743e+13 0.340 0.734
## GeneZNF217 3.309e+13 9.743e+13 0.340 0.734
## GeneZNF337 3.309e+13 9.743e+13 0.340 0.734
## GeneZNF74 3.309e+13 9.743e+13 0.340 0.734
## AffectedExons -4.837e-03 3.070e-02 -0.158 0.875
## AllExons NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 566.01 on 411 degrees of freedom
## Residual deviance: 120.22 on 369 degrees of freedom
## AIC: 206.22
##
## Number of Fisher Scoring iterations: 25
pred_prob <- predict(logit_model2, newdata = test_data1, type = "response")
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
conf_mat <- table(test_data1$Melanoma, pred_class, dnn = c("Truth", "Predicted")) #Confusion Matrix
print(conf_mat)
## Predicted
## Truth 0 1
## 0 27 0
## 1 2 16
misclassification_error <- 1 - sum(diag(conf_mat)) / sum(conf_mat) # Misclassification error
misclassification_error
## [1] 0.04444444
The second logistic model, which excluded the ‘Cdotnotation’ mutations, revealed that only two genes, “Gene ABL2” and “Gene ACVR1B,” exhibited significance. Conversely, the p-values of the remaining genes exceeded 0.05, indicating their lack of significance. Notably, ‘AllExons’ yielded an NA value, suggesting that while significant in the previous model, it did not contribute substantially to this prediction. The omission of the most significant variable notwithstanding, the model’s predictive accuracy improved. Misclassification was limited to only 2 data points, resulting in a misclassification error of 4.44%.
genes_to_include1 <- c("GeneABL2", "GeneACVR1B", "GeneATM", "GeneBRAF", "GeneCATSPERZ",
"GeneCD276", "GeneCSMD1", "GeneCTLA4", "GeneCUL3", "GeneDNAH17",
"GeneDNAH2", "GeneDNAH9", "GeneEGFR", "GeneERCC5", "GeneHNF1A",
"GeneIDH1", "GeneKMT2A", "GeneMDC1", "GeneMST1", "GeneMUC16",
"GeneNBN", "GeneNSD1", "GeneNUTM1", "GenePCLO", "GenePDGFRA",
"GenePKHD1L1", "GenePRKDC", "GenePTCH1", "GeneRANBP2", "GeneSLX4",
"GeneSPTBN2", "GeneTNFAIP3", "GeneTNKS2", "GeneTTN", "GeneVCL",
"GeneWISP3", "GeneXIRP1", "GeneXIRP2", "GeneZNF217", "GeneZNF337",
"GeneZNF74")
formula1 <- as.formula(paste("Melanoma ~ ", paste(genes_to_include1, collapse = " + ")))
logit_model3 <- glm(formula1, data = train_data1, family = binomial)
summary(logit_model3)
##
## Call:
## glm(formula = formula1, family = binomial, data = train_data1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.258e+13 6.567e+13 -0.344 0.731
## GeneABL2 -4.481e+15 6.567e+13 -68.231 <2e-16 ***
## GeneACVR1B -4.481e+15 6.567e+13 -68.231 <2e-16 ***
## GeneATM 2.258e+13 6.567e+13 0.344 0.731
## GeneBRAF 2.258e+13 6.567e+13 0.344 0.731
## GeneCATSPERZ 2.258e+13 6.567e+13 0.344 0.731
## GeneCD276 2.258e+13 6.567e+13 0.344 0.731
## GeneCSMD1 2.258e+13 6.567e+13 0.344 0.731
## GeneCTLA4 2.258e+13 6.567e+13 0.344 0.731
## GeneCUL3 2.258e+13 6.567e+13 0.344 0.731
## GeneDNAH17 2.258e+13 6.567e+13 0.344 0.731
## GeneDNAH2 2.258e+13 6.567e+13 0.344 0.731
## GeneDNAH9 2.258e+13 6.567e+13 0.344 0.731
## GeneEGFR 2.258e+13 6.567e+13 0.344 0.731
## GeneERCC5 2.258e+13 6.567e+13 0.344 0.731
## GeneHNF1A 2.258e+13 6.567e+13 0.344 0.731
## GeneIDH1 2.258e+13 6.567e+13 0.344 0.731
## GeneKMT2A 2.258e+13 6.567e+13 0.344 0.731
## GeneMDC1 2.258e+13 6.567e+13 0.344 0.731
## GeneMST1 2.258e+13 6.567e+13 0.344 0.731
## GeneMUC16 2.258e+13 6.567e+13 0.344 0.731
## GeneNBN 2.258e+13 6.567e+13 0.344 0.731
## GeneNSD1 2.258e+13 6.567e+13 0.344 0.731
## GeneNUTM1 2.258e+13 6.567e+13 0.344 0.731
## GenePCLO 2.258e+13 6.567e+13 0.344 0.731
## GenePDGFRA 2.258e+13 6.567e+13 0.344 0.731
## GenePKHD1L1 2.258e+13 6.567e+13 0.344 0.731
## GenePRKDC 2.258e+13 6.567e+13 0.344 0.731
## GenePTCH1 2.258e+13 6.567e+13 0.344 0.731
## GeneRANBP2 2.258e+13 6.567e+13 0.344 0.731
## GeneSLX4 2.258e+13 6.567e+13 0.344 0.731
## GeneSPTBN2 2.258e+13 6.567e+13 0.344 0.731
## GeneTNFAIP3 2.258e+13 6.567e+13 0.344 0.731
## GeneTNKS2 2.258e+13 6.567e+13 0.344 0.731
## GeneTTN 2.258e+13 6.567e+13 0.344 0.731
## GeneVCL 2.258e+13 6.567e+13 0.344 0.731
## GeneWISP3 2.258e+13 6.567e+13 0.344 0.731
## GeneXIRP1 2.258e+13 6.567e+13 0.344 0.731
## GeneXIRP2 2.258e+13 6.567e+13 0.344 0.731
## GeneZNF217 2.258e+13 6.567e+13 0.344 0.731
## GeneZNF337 2.258e+13 6.567e+13 0.344 0.731
## GeneZNF74 2.258e+13 6.567e+13 0.344 0.731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 566.01 on 411 degrees of freedom
## Residual deviance: 120.24 on 370 degrees of freedom
## AIC: 204.24
##
## Number of Fisher Scoring iterations: 25
pred_prob <- predict(logit_model3, newdata = test_data1, type = "response")
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
conf_mat <- table(test_data1$Melanoma, pred_class)
print(conf_mat)
## pred_class
## 0 1
## 0 26 1
## 1 3 15
misclassification_error <- 1 - sum(diag(conf_mat)) / sum(conf_mat) # Misclassification error
misclassification_error
## [1] 0.08888889
In the third logistic regression, the model excluded the two predictors with the highest p-values, namely “Affected Exons” and “All Exons.” Among the remaining genes, except for “GeneACVR1B” and “GeneABL2,” which exhibited significance, the rest showed a uniform p-value of 0.731. Notably, this p-value improved compared to the previous model; however, the model encountered challenges in predicting Melanoma accurately. The resulting misclassification rate was 8.88%, indicating four misclassified data points. In contrast, the second logistic regression produced the most favorable outcomes with a misclassification rate of 4.44%.
set.seed(123)
library(randomForest)
rf_model1<-randomForest(Melanoma ~., data = train_data1, ntrees=500) #Model, 500 trees were used for this model
rf_model1
##
## Call:
## randomForest(formula = Melanoma ~ ., data = train_data1, ntrees = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 75
##
## Mean of squared residuals: 0.04059789
## % Var explained: 83.56
rf_pred1<-predict(rf_model1,test_data1) #Predictions
table(test_data1$Melanoma, (pred_resp > 0.5)*1, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 22 5
## 1 0 18
varImpPlot(rf_model1, n.var = 10, main = "Top 10 Variable Importance Plot", cex.axis = 0.7, las = 2)
After realizing that the previous Random Forest model was limited due to the categorical nature of the variable c_dot_notation, which had more categories than the model could handle (limited to 53 categories), a new Random Forest was conducted. This time, the data underwent transformation to accommodate this issue. However, despite the refinement, the predictive performance saw a slight decline, with five datapoints being misclassified. Nonetheless, the model managed to explain 83.56% of the variance, indicating a respectable performance. Remarkably, upon examining the ten most influential variables, it was evident that many genes appeared prominently. This aligns with earlier descriptive analytics conducted in Tableau, which identified genes DNAH9, PCLO, and PKHD1L1 as highly influential factors for Melanoma.
set.seed(123)
knn_gene <- knn(train = train_data1[, -2], test = test_data1[, -2], cl = train_data1[, 2], k = 5) #Model , k=5 means that the algorithm will consider the five closest data points to the point being classified.
knn_gene
## [1] 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1
## [39] 0 0 0 0 0 0 1
## Levels: 0 1
table(test_data1[,2], knn_gene, dnn = c("True", "Predicted")) #Confusion Matrix
## Predicted
## True 0 1
## 0 24 3
## 1 1 17
sum(test_data1[, 2] != knn_gene)
## [1] 4
misclassification_percentage <- sum(test_data1[, 2] != knn_gene) / length(test_data1[, 2]) * 100
misclassification_percentage
## [1] 8.888889
K-Nearest Neighbors (KNN) was applied to the dataset, yielding promising results. The misclassification rate stands at 8.88%, with only four data points misclassified. These findings affirm the effectiveness of the KNN model in predicting Melanoma, showcasing its viability as a reliable predictive tool.
set.seed(1234567)
maxs <- apply(genes_df1, 2, max) # maximum value of each column
mins <- apply(genes_df1, 2, min) #minimum value of each column
scaled <- as.data.frame(scale(genes_df1, center = mins, scale = maxs - mins))
index <- sample(nrow(genes_df1), nrow(genes_df1)*0.90)
train_genes_df1 <- scaled[index,]
test_genes_df1 <- scaled[-index,]
nn <- neuralnet(Melanoma ~ ., data=train_genes_df1, hidden=c(1,1), linear.output=F, algorithm = 'rprop+') # First hidden layer will have five neurons and the second hidden layer will have 3 neurons
nn$act.fct
## function (x)
## {
## 1/(1 + exp(-x))
## }
## <bytecode: 0x0000025c839349e0>
## <environment: 0x0000025c839313f0>
## attr(,"type")
## [1] "logistic"
plot(nn)
pr_nn <- compute(nn, test_genes_df1[,1:228])
prob_nn_out <- predict(nn, test_genes_df1, type = "response")
pcut_nn<-0.5
pred_nn_out <- (prob_nn_out >= pcut_nn)*1
table(test_genes_df1$Melanoma, pred_nn_out, dnn = c("Observed", "Predicted"))
## Predicted
## Observed 0 1
## 0 29 0
## 1 0 17
The logistic neural network, tailored for binary classification tasks, represents the culmination of our modeling efforts. Designed to predict the probability of belonging to one of two classes, this model stands as the most sophisticated among all others employed. Prior to model creation, our dataset underwent normalization to enhance performance. Remarkably, the logistic neural network yielded impeccable results, achieving a perfect 100% prediction accuracy with no misclassifications. Although the plot displaying the influence of 228 different variables may be visually overwhelming, the outcome aligns with our expectations and desired objectives.
As observed, ten distinct models were constructed for the prediction of Melanoma. It is notable that most of these models yielded comparable outcomes, with the least performing models exhibiting a misclassification rate of 11%, a level deemed acceptable. Notably, Logistic Regression without C_Dot_Notation, Bagging, Boosting, and Decision Tree models produced identical results in this scenario.The standout performer among the models was the Neural Network, which achieved flawless predictions without any errors. Remarkably, three out of the top five performing models were classified as complex models.Complex models are indispensable in scenarios where the relationships between input features and the target variable are intricate, nonlinear, or high-dimensional, as they possess the capacity to capture and model such complexities effectively.
Melanoma presents a significant health challenge, necessitating intensified research efforts. Genomic analysis offers valuable insights into the disease’s molecular mechanisms, facilitating the discovery of druggable targets for pharmaceutical companies. By leveraging genomic data, these companies can expedite the development of targeted therapies, offering hope to melanoma patients. Thus, genomic analysis holds immense promise in advancing the understanding of melanoma and driving innovative treatment strategies.
Melanoma represents a significant public health challenge in the United States due to its widespread prevalence and potentially fatal consequences, which persist despite the condition being largely preventable through measures such as sunscreen application and regular skin examinations. Despite considerable efforts to raise awareness, the disease continues to claim many lives annually. In response to this pressing issue, this study was undertaken to identify underlying patterns associated with melanoma. A comprehensive dataset incorporating 25 diverse sources, including 14 from local sequencing laboratories in Carmel, IN, and 9 from the National Cancer Institute’s GDC Data Portal, encompassing melanoma as well as other cancers such as lung, head and neck, thyroid, and breast cancer, totaling over 35,000 rows, was meticulously compiled. The primary objective was to uncover common factors among melanoma patients that could potentially serve as targets for drug development efforts. Following gene identification, predictive analytics were conducted using various models, including logistic regression, decision trees, random forests, boosting, bagging, Neural Networks, and KNN nearest neighbor, with a focus on predicting melanoma. The successful outcome of the predictive analysis underscores the promise of these models in prognosticating melanoma. This comprehensive analysis aims to offer valuable insights into genes and associated factors crucial for informing pharmaceutical companies, thereby enhancing the efficiency and cost-effectiveness of drug discovery endeavors aimed at combatting melanoma.
The analysis identified several tumor suppressor genes crucial for maintaining cell health. Importantly, these genes impact cancer development broadly rather than targeting specific types. Therefore, mutations in genes like FAT1 or SPTA1 are associated with various cancer types, highlighting their pivotal role in tumorigenesis.
The analysis revealed a common trend where mutations were frequently localized to specific regions within genes. This suggests that certain areas of the genome may be more susceptible to genetic alterations, highlighting potential targets for further investigation.
The analysis revealed that genes such as MUC16, PCLO, DNAH9, PKHD1L1, CSMD1, DNAH17, and XIRP2 were most frequently observed in patients with melanoma.
PCLO, DNAH9, PKHD1L1, and XIRP2 were identified as genes with the most significant influence in the analysis. Further research into these genes by healthcare professionals could provide valuable insights into their roles in cancer biology and potential therapeutic targets.
Location 18/19 in the PCLO gene exhibited the highest frequency of mutations among all genes identified in melanoma patients. This hotspot region may harbor important genetic variations associated with melanoma development and progression.
The analysis revealed a correlation between the number of mutations within the same exon and the likelihood of cancer development. This suggests that accumulation of mutations in specific genomic regions may increase cancer susceptibility and underscores the importance of investigating these regions further.
Predictive analytics demonstrated that mutations have a substantial impact on cancer development. Understanding the genetic alterations driving cancer initiation and progression is crucial for developing targeted treatment strategies.
The insights gleaned from the analysis offer valuable guidance for pharmaceutical companies, particularly in the realm of melanoma research and drug development. By pinpointing key tumor suppressor genes, hotspot regions for mutations, and influential genetic factors specific to melanoma, the analysis enables pharmaceutical companies to refine their focus on the most promising targets for therapeutic intervention. It’s important to note that while these insights provide valuable direction, they are not definitive answers and require further research and validation in actual laboratory settings. This targeted approach streamlines research efforts and allows for more efficient allocation of resources, facilitating the development of cost-effective and highly efficacious treatments tailored to melanoma. Leveraging these insights, pharmaceutical companies can expedite the discovery and development of novel therapies for melanoma, ultimately leading to improved patient outcomes and addressing the unmet medical needs in melanoma treatment.
Limited Dataset Size: The size of the dataset may limit the generalizability of findings. Utilizing a larger dataset would enhance result robustness. Data cleaning procedures, while necessary, may further reduce the dataset’s size, potentially impacting imputation capabilities.
Potential Bias in Data Source: Analysis could be biased if data primarily consists of targeted sequencing rather than whole-genome sequencing from the National Cancer Institute’s GDC Data Portal. This limitation may affect the comprehensiveness of genomic insights.
Broader Cancer Spectrum Inclusion: Enriching the analysis with a broader spectrum of cancer types beyond melanoma would enhance reliability and contextual understanding of genetic factors influencing cancer susceptibility and progression.
Increased Melanoma Patient Cohort: Expanding the melanoma patient cohort would improve the ability to identify correlations and discern patterns within the dataset. A larger sample size enhances statistical power and facilitates more robust analyses.
Data Collection Practices: Variations in data collection practices, including sample collection methods, sequencing techniques, and data processing pipelines, can affect data quantity and quality in sequencing databases.
Prevalence of Cancer Types: Variation in cancer prevalence affects the volume of available data for analysis. Cancers with higher mortality rates or greater research focus may have more extensive genomic studies and sequencing efforts.
Singular Focus on Single Nucleotide Variants (SNVs): While SNVs are a primary focus, acknowledging other genomic alterations beyond SNVs would provide a more nuanced understanding of melanoma pathogenesis.
Multifactorial Nature of Melanoma: Melanoma is influenced by genetic, environmental, and lifestyle factors. Non-genetic determinants such as sun exposure and family history should be considered alongside genetic factors.
Demographic Representation Bias: Biases or underrepresentation in data collection may compromise model generalizability across diverse populations.
Challenges with Categorical Data: High volumes of unique categorical variables may complicate model development, requiring advanced techniques for effective analysis.
Model Performance Constraints: Computational errors may arise from a large number of categories, necessitating careful model selection and preprocessing techniques.
Validation of Predictive Models: Independent dataset validation is crucial for evaluating model accuracy and reliability, enhancing confidence in their utility for clinical decision-making and research.
For substantial progress in this study, securing a larger and more diverse dataset encompassing various cancer types is crucial. Standardizing data collection methods across all sources is essential to eliminate bias. Lastly, validating the models using external datasets would offer invaluable insights into the predictive capabilities of the analysis.