Melanoma Gene Identifier

Introduction

Problem Statement

Melanoma, a highly aggressive form of skin cancer originating from melanocytes, poses a significant health challenge in the United States. Despite constituting only about 1% of skin cancers, it is responsible for the majority of skin cancer-related deaths due to its propensity to metastasize if not detected and treated promptly. Annually, approximately 100,640 new cases are diagnosed, with an estimated 8,290 individuals succumbing to the disease. While recent advancements in treatment have led to a decline in death rates, demographic disparities persist. Variations in incidence rates among age groups and genders continue to evolve, with concerning increases observed in women aged 50 and older.

Moreover, melanoma disproportionately affects individuals with lighter skin tones, with white populations facing a substantially higher lifetime risk compared to Black and Hispanic individuals. The prevalence of melanoma among young adults, particularly young women, underscores the urgency for deeper research. Understanding the intricate interplay of risk factors, including genetic predisposition, environmental influences, and behavioral patterns, is imperative for developing targeted prevention strategies and enhancing early detection methods. Given the evolving epidemiological landscape and the potential impact of tailored interventions, there is a pressing need for comprehensive research initiatives to address the multifaceted challenges posed by melanoma in the United States.

Business Problem

The challenge in Melanoma drug development lies in efficiently identifying crucial genetic markers. Pharmaceutical companies currently grapple with navigating through 22,000 genes and mutations, which is time-consuming and resource-intensive. Our solution streamlines this process by pinpointing and prioritizing the most relevant genetic markers upfront. By doing so, we save valuable time and resources, offering a more effective path to developing targeted treatments for Melanoma cancer.

Proposed approach and analytics techniques

The analysis will focus on examining data from 25 patients diagnosed with a variety of cancers, including head and neck, breast, lung, melanoma, and thyroid cancer. The primary objective is to uncover significant patterns and connections within this dataset that may lead to the identification of genes associated with Melanoma. While this study cannot fully resolve the issue due to the diversity of individual genetic profiles, it will play a crucial role in narrowing down genes warranting further investigation by molecular biologists in the laboratory.

Although every cancerous gene and mutation may not be definitively isolated from the vast array of 22,000 genes, there is an anticipation of identifying a subset that exhibits commonalities across multiple patients, providing valuable direction for subsequent research. The methodology will involve exploratory data analysis and Pareto analysis. Initially, descriptive analytics will be performed, followed by data narrowing in Tableau. Subsequently, this limited data will be utilized to predict and forecast future outcomes, with the predictive outcome cross-validated with visuals created on the limited data. Through these rigorous methodologies, the aim is to shed light on the genetic underpinnings of cancer, paving the way for more targeted interventions and treatments in the future.

Interested Stakeholders

Stakeholders, primarily pharmaceutical companies focused on drug development, should be particularly interested in this business problem due to its potential to significantly enhance time and cost efficiency in the drug development process. By identifying genetic markers associated with Melanoma, these companies can expedite the discovery of targeted therapies, reducing research and development timelines and minimizing associated costs. This efficiency not only accelerates the availability of new treatments but also strengthens the competitive position of pharmaceutical companies in the market. Moreover, the ability to offer personalized treatment approaches based on genetic profiling holds the promise of more effective therapies with fewer adverse effects, aligning with the industry’s ongoing pursuit of innovative and patient-centric solutions. Thus, addressing this business problem has the potential to yield substantial benefits for pharmaceutical companies and their stakeholders, driving progress in cancer treatment and improving patient outcomes.

Data Preparation

This section outlines the comprehensive procedures undertaken to prepare the data analysis. Each step is meticulously detailed, accompanied by corresponding code implementations.

Original Data Source

The data utilized in this study originates from two primary sources:

Fourteen data sets were procured from the DCL Pathology- molecular laboratory in Indiana, acquired in January 2024 through gene sequencing processes.
- 11 variables , around 1100 rows each. The data is mostly categorical
Eleven data sets were obtained from the National Cancer Institute’s GDC Data Portal, collected in 2022.
- 140 variables , different length of data sets. The dataset comprises both categorical and numerical data types.

Initial Data Cleaning in Power Query, Excel

DCL Pathology Data

The following cleaning steps were applied to the 14 data sets obtained from DCL Pathology to ensure compatibility with the analysis format. These datasets serve as the primary source for the study.

Converted the file from TSV to Excel format for easier data handling.
Removed 46 irrelevant rows to focus on important data.
Promoted row 47 to headers for clear column labeling.
Split the “P-Dot Notation” column into two for better organization.
Trimmed characters before “c” in the “C-Dot Notation” column for consistency.
Inserted a “Patient_ID” column for individual identification.
Added a “Cancer_type” column for cancer categorization.

National Cancer Institute Data

The 11 data sets obtained from the National Cancer Institute’s GDC Data Portal underwent meticulous cleaning to ensure compatibility with the analysis format. As these data sets serve as a secondary source for the study, the cleaning process was tailored to align the data format with that of the primary source.

Extracted data from MAF files to Excel format for ease of analysis.
Removed the top 7 rows to eliminate unnecessary header information.
Set the first row as headers to ensure proper column labeling.
Removed 129 unnecessary columns to streamline the dataset.
Inserted a “Patient_ID” column for individual patient identification.
Added a “Cancer_type” column for categorizing cancer types.
Calculated allele frequency by dividing “t_alt_count” by “t_depth.”
Converted allele frequency to percentage format for easier interpretation.
Changed column names to standardize terminology for consistency:

“Hugo_Symbol” was renamed to “Gene.”
“Start_Position” was changed to “Genomic Position.”
“Reference_Allelle” became “Reference Call.”
“Tumor_Seq_Allele2” was updated to “Alternative Call.”
“HGVSc” was adjusted to “C-Dot Notation.”
“HGVSp” was modified to “P-Dot Notation.2.”
“Exton_Number” was renamed to “Affected Exon.”
“T_depth” was changed to “Depth.”
“Consequence” was adjusted to “Consequence(s).”

Required R-packages

readxl: This library facilitates the reading of Excel files directly into R, providing functions to import data from spreadsheets with ease. It offers robust support for various Excel file formats and enables users to extract data seamlessly for further analysis.
dplyr: dplyr is a powerful data manipulation package that provides a grammar of data manipulation, allowing users to perform a wide range of data wrangling tasks efficiently. It includes functions for filtering, selecting, summarizing, mutating, and arranging data, making it a versatile tool for data transformation and exploration.
writexl: Complementing the functionality of readxl, writexl enables users to write data frames and matrices to Excel files directly from R. It offers straightforward functions for exporting data, maintaining formatting, and preserving data integrity when sharing results with collaborators or stakeholders.
tidyr: tidyr is designed for data tidying tasks, providing functions to reshape and organize messy datasets into tidy data formats suitable for analysis and visualization. It includes tools for gathering, spreading, separating, and uniting data, helping users efficiently tidy and prepare their data for analysis.
rpart: rpart implements recursive partitioning algorithms for classification and regression tasks, allowing users to create decision trees based on input variables. Decision trees are interpretable models that partition the feature space into segments based on the values of predictor variables, making them valuable for understanding the underlying structure of the data.
rpart.plot: This package extends the functionality of rpart by providing enhanced plotting capabilities specifically tailored for visualizing decision trees created with rpart. It offers customizable plotting options, including tree pruning, node labeling, and branch coloring, enabling users to create clear and informative visualizations of their decision tree models.
caret: caret (Classification And REgression Training) is a comprehensive package for machine learning that provides a unified interface for training and evaluating predictive models. It streamlines the machine learning workflow by offering standardized functions for data preprocessing, model training, tuning, and evaluation across different algorithms and methodologies.
randomForest: randomForest implements random forest algorithms for classification and regression tasks, known for their robustness and predictive accuracy. Random forests are ensemble learning methods that combine multiple decision trees to improve predictive performance and reduce overfitting, making them suitable for a wide range of predictive modeling tasks.
gbm: Short for Gradient Boosting Machine, gbm implements gradient boosting algorithms for predictive modeling. Gradient boosting is a powerful machine learning technique that builds an ensemble of weak learners sequentially, with each learner focusing on the mistakes made by its predecessors. This iterative approach results in highly accurate predictive models that excel in handling complex, high-dimensional data.
ipred: ipred (Improved Predictors) extends the functionality of traditional predictive modeling approaches by providing methods for improved predictive modeling, including bagging and bootstrapping techniques. These ensemble learning methods combine multiple models to enhance predictive performance and robustness, particularly in the presence of noisy or uncertain data.
adabag: adabag extends the functionality of ipred by providing additional ensemble methods such as AdaBoost for classification and regression tasks. AdaBoost is an adaptive boosting algorithm that iteratively trains weak learners on different subsets of the data, assigning higher weights to misclassified observations to improve model performance iteratively.
DT: DT enables the creation of interactive web-based data tables directly from R, facilitating data exploration and visualization. It allows users to create dynamic, interactive tables with features such as sorting, filtering, and pagination, making it easier to explore large datasets and share insights with collaborators or stakeholders.
neuralnet: neuralnet facilitates the creation and training of artificial neural networks, a powerful technique for complex pattern recognition tasks. Neural networks consist of interconnected nodes organized into layers, capable of learning complex patterns and relationships from data. neuralnet provides functions for building, training, and evaluating neural network models, offering flexibility and scalability for a wide range of applications.
class: class implements various classification algorithms, including k-nearest neighbors (KNN), which are used for predictive modeling and classification tasks. KNN is a non-parametric method that classifies new data points based on the majority class of their nearest neighbors in the feature space. class provides functions for training and evaluating KNN models, making it a valuable tool for classification tasks in machine learning.

library(readxl)
library(dplyr)
library(writexl)
library(tidyr)
library(rpart)
library(rpart.plot)
library(caret)
library(randomForest)
library(gbm)
library(ipred)
library(DT)
library(neuralnet)
library(class)

Data set Union in R-Studio

During this stage in RStudio, 25 initially cleaned datasets were consolidated into one, which was then exported to Excel for integration into Tableau Prep for final data refinement. The merged dataset consists of 13 variables and encompasses a total of 35,890 rows.

data1 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient1_v1_lung.xlsx")
data2 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient2_v1_lung.xlsx")
data3 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient3_v1_lung.xlsx")
data4 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient4_v1_lung.xlsx")
data5 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient5_v1_head-and_neck.xlsx")
data6 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient6_v1_Thyroid.xlsx")
data7 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient7_v1_head_and_neck.xlsx")
data8 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient8_v1_head_and_neck.xlsx")
data9 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient9_v1_melanoma.xlsx")
data10 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient10_v1_head_and_neck.xlsx")
data11 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient11_v1_breast.xlsx")
data12 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient12_v1-breast.xlsx")
data13 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient13_v1_lung.xlsx")
data14 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient14_v1_Thyroid.xlsx")
data15 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient15_v1_breast.xlsx")
data16 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient16_v1_breast.xlsx")
data17 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient17_v1_breast.xlsx")
data18 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient18_v1_head_and_neck.xlsx")
data19 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient19_v1_melanoma.xlsx")
data20 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient20_v1_melanoma.xlsx")
data21 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient21_v1_melanoma.xlsx")
data22<- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient22_v1_melanoma.xlsx")
data23 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient23_v1_thyroid.xlsx")
data24 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient24_v1_thyroid.xlsx")
data25 <- read_xlsx("C:/Users/annac/Desktop/Ania Data Sets/Company data/Patient25_v1_thyroid.xlsx")

data15$`Genomic Position`<-as.character(data15$`Genomic Position`)
data16$`Genomic Position`<-as.character(data16$`Genomic Position`)
data17$`Genomic Position`<-as.character(data17$`Genomic Position`)
data18$`Genomic Position`<-as.character(data18$`Genomic Position`)
data19$`Genomic Position`<-as.character(data19$`Genomic Position`)
data20$`Genomic Position`<-as.character(data20$`Genomic Position`)
data21$`Genomic Position`<-as.character(data21$`Genomic Position`)
data22$`Genomic Position`<-as.character(data22$`Genomic Position`)
data23$`Genomic Position`<-as.character(data23$`Genomic Position`)
data24$`Genomic Position`<-as.character(data24$`Genomic Position`)
data25$`Genomic Position`<-as.character(data25$`Genomic Position`)



combined_data<-bind_rows(data1,data2,data3,data4,data5,data6,data7,data8,data9,data10,data11,data12,data13,data14,data15,data16,data17,data18,data19,data20,data21,data22,data23,data24,data25)
summary(combined_data)

##      Gene            Chromosome        Genomic Position   Reference Call    
##  Length:35890       Length:35890       Length:35890       Length:35890      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Alternative Call   Allele Frequency     Depth         P-Dot Notation    
##  Length:35890       Min.   :0.0084   Min.   :    2.0   Length:35890      
##  Class :character   1st Qu.:0.3020   1st Qu.:   98.0   Class :character  
##  Mode  :character   Median :0.4320   Median :  229.0   Mode  :character  
##                     Mean   :0.4848   Mean   :  365.9                     
##                     3rd Qu.:0.5193   3rd Qu.:  576.2                     
##                     Max.   :1.0000   Max.   :11707.0                     
##                     NA's   :14       NA's   :14                          
##  C-Dot Notation     Consequence(s)     Affected Exon(s)    Patient_ID       
##  Length:35890       Length:35890       Length:35890       Length:35890      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Cancer_type       
##  Length:35890      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

write_xlsx(combined_data,"C:/Users/annac/Desktop/Ania Data Sets/Company data/combined_data.xlsx")

Final data cleaning in Tableau Prep

The following steps outline the procedures executed in Tableau Prep to attain the definitive clean dataset, serving as the foundational basis for subsequent analysis.

Null values were removed from gene, P-dot notation (1030 rows), and affected exons (9987 rows) to ensure data completeness.
Data formatting was standardized by eliminating extra spaces.
The chromosome column was converted to uppercase for consistency.
Similar cancer types, such as “head_and_neck” and “head_and_neck_cancer,” were merged for data consolidation.
Consequences from 28 variables were grouped into 6 distinct categories for improved organization and analysis.
A binary column was created to indicate cases of melanoma.
The Affected Exon(s) column was split into two separate columns.

The resulting clean dataset comprises 16 variables and 24,874 rows. Notably, 11,016 rows were eliminated due to null values. Given the categorical and unique nature of the data, conventional imputation methods were deemed unsuitable for use.

Data Understanding Table

Modelling

Exploratory Analysis

The analysis commenced with an exploratory data analysis conducted using Tableau. This approach was undertaken to visualize and gain deeper insights into the dataset.

Here are the conclusions derived from the preliminary analysis:

Mutation variants observed encompass synonymous, intronic, and missense mutations. Of these, missense mutations hold particular significance in disease progression, as they possess the potential to drive pathological processes, unlike other variants.
The dataset exhibits a balanced distribution between melanoma and other cancer types.
Notably, while melanoma patients showcase a higher diversity of genes compared to other cancers, the frequency of mutations per gene is markedly higher in other cancer types. This trend suggests that the accumulation of multiple mutations within the same gene may elevate the probability of functional alterations. Each mutation harbors the capacity to induce changes in protein structure or function, thus exerting significant effects on cellular processes.
A substantial portion, approximately 56%, of patients demonstrate at least three mutations per gene, underscoring the prevalence of multiple mutations within genes.
Chromosome 7 emerges as a potential hotspot for mutation and consequent cancer development, attributed to the abundance of genes and mutations localized on this chromosome.
Notably, the gene FAT1 exhibits the highest mutation frequency, with 91 mutations observed within the same exon. This observation implies that this specific exon may serve as a hotspot for protein alterations, potentially contributing to the onset of specific diseases.

Gene Selection Methodology Using Pareto Analysis

After conducting exploratory analysis, the dataset containing 16 variables and 24,874 rows was imported into Rmarkdown for predictive analytics. However, the computational workload proved to be excessively time-consuming, primarily due to the presence of categorical data and thousands of rows with unique values. To mitigate this issue, Pareto analysis—a decision-making technique aimed at statistically categorizing data entries into groups that exert the most or least influence on the dataset—was employed. Commonly utilized in business contexts to identify optimal strategies or areas of focus, this technique facilitated the identification of the most crucial genes for analysis.

Specifically, the focus was narrowed down to the top eight genes per cancer type, considering their mutations and affected exons. The selection process adhered to three key criteria:

The data was filtered exclusively for missense variants.
Genes common to all five cancer types were omitted, as their significance in predicting cancer is noteworthy but not necessarily pertinent to forecasting melanoma specifically. These genes were appended to the conclusion as they hold significance, albeit not in this specific context.
Priority was given to genes with the highest occurrence count in each cancer type.
In cases where multiple genes shared the highest count, preference was granted to the gene uniquely associated with the intended cancer type. For instance, MUC16 was among the top eight genes for melanoma, despite its presence in breast cancer six times. To maintain manageability, genes were limited to a maximum of four cancer types, prioritizing the type with the highest occurrence count for each gene.

The Pareto analysis facilitated the identification of key genes associated with each cancer type. From a comprehensive pool, 8 genes were selected per cancer type, resulting in a total of 40 genes for further analysis. These genes, along with essential variables such as Gene, C-Dot Notation (mutation), Affected Exon(s), All Exon(s), cancer type, and a binary indicator for Melanoma (Y, N), were compiled into a dataset comprising 457 rows. This approach streamlined the focus on critical genes, aiding descriptive analytics to uncover patterns pertinent to melanoma etiology, while also optimizing computational efficiency for subsequent predictive analytics.

Descriptive Analytics

Introduction:

This segment delves into a comprehensive analysis of data points identified through Pareto analysis. Leveraging Tableau, an array of charts has been meticulously crafted to extract insights crucial for predicting Melanoma. This analysis delves deeper into genes, mutations, and affected exons, recognizing them as pivotal variables essential for detecting patterns within Melanoma patients. To decipher the charts effectively, it’s imperative to delve into the meaning of these four terms.

Gene: Genes, segments of DNA, encode the blueprint for producing specific proteins essential for cellular function. Some genes, termed driver genes or oncogenes, harbor mutations capable of instigating cancer. These mutations provide cells with a growth advantage, fostering unbridled proliferation and tumor formation.
Mutation: Mutations denote changes or alterations in the DNA sequence of a gene. Oncogenes, a subset of genes, possess the capacity to transform normal cells into cancerous entities upon mutation. These mutations may engender hyperactive proteins, fueling accelerated cell growth and division. Moreover, cancer cells frequently exhibit genomic instability, characterized by an augmented mutation rate and chromosomal irregularities.
Affected Exon: An affected exon designates a specific segment of a gene’s DNA sequence that has undergone mutation or alteration. When an exon is affected by a mutation, it signifies a change within that particular portion of the gene’s sequence. Each gene comprises its own set number of exons, and the affected exon denotes a particular locus within the gene undergoing change. Recognizing affected exons holds paramount importance in unraveling the molecular mechanisms underpinning various diseases, including cancer. By pinpointing mutated regions within genes, researchers glean invaluable insights into how these genetic alterations contribute to disease onset and progression.
All exons This metric represents the count of exons present within a specific gene. Each gene exhibits a unique number of localized exons.

This comprehensive elucidation of genes, mutations, affected exons and all exons serves as a fundamental backdrop for interpreting the subsequent analyses, enabling the extraction of meaningful insights into Melanoma prediction.

This analysis centers on the eight most prevalent melanoma genes, highlighting their significance due to their frequent occurrence in melanoma patients. Beyond gene identification, the study delves into the specific locations within these genes where mutations are concentrated, aiming to identify potential sites for protein alterations. Following the identification of affected exons, the top eight mutation hotspots are further dissected, revealing four distinct genes prominently involved, with PCLO exhibiting the highest mutation count across three exons. Notably, most of these mutation hotspots are clustered within the same genes, suggesting their critical role in melanoma development. Moreover, upon deeper examination of mutation types and frequencies within each gene, a pattern emerges where mutations rarely occur more than once in the same gene.

In the context of diseases like cancer, where mutations in specific genes can contribute to the development and progression of the disease, such heterogeneity can have significant implications. It may indicate that the gene is prone to accumulating various mutations, potentially resulting from exposure to different carcinogens, genomic instability, or other factors. This genetic heterogeneity underscores the complexity of melanoma genetics and highlights the challenges it poses for diagnosis and treatment. The findings emphasize the importance of comprehensive genomic analysis to unravel the full spectrum of mutations and their implications for melanoma progression and treatment strategies.

Predictive Analytics

Final Data Adjustments

The analysis begins with the loading of the refined dataset, followed by its transformation into a structured data frame. This systematic organization of the dataset into rows and columns facilitates efficient management and comprehensive analysis.

genes_ds <-read_excel("C:/Users/annac/Desktop/Capstone/data/data_filtered_2.0.xlsx")
genes_df<-as.data.frame(genes_ds)
summary(genes_df)

##      Gene           C-Dot Notation     Affected Exon(s)  All Exon(s)  
##  Length:457         Length:457         Min.   : 1.00    Min.   : 2.0  
##  Class :character   Class :character   1st Qu.: 3.00    1st Qu.:10.0  
##  Mode  :character   Mode  :character   Median : 7.00    Median :16.0  
##                                        Mean   : 8.93    Mean   :24.9  
##                                        3rd Qu.:12.00    3rd Qu.:25.0  
##                                        Max.   :57.00    Max.   :85.0  
##  Cancer_type          Melanoma        
##  Length:457         Length:457        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

str(genes_df)

## 'data.frame':    457 obs. of  6 variables:
##  $ Gene            : chr  "ACVR1B" "ACVR1B" "BRAF" "BRAF" ...
##  $ C-Dot Notation  : chr  "c.1236G>C" "c.1236G>C" "c.1919T>A" "c.1919T>A" ...
##  $ Affected Exon(s): num  1 1 1 1 1 7 7 7 7 7 ...
##  $ All Exon(s)     : num  10 10 15 15 15 15 15 15 15 15 ...
##  $ Cancer_type     : chr  "Breast_cancer" "Breast_cancer" "Thyroid_cancer" "Thyroid_cancer" ...
##  $ Melanoma        : chr  "N" "N" "N" "N" ...

Next, the columns underwent renaming to adhere to Rstudio standards, ensuring compatibility and seamless data reading processes.

genes_df <- genes_df %>%
  rename(
    gene = `Gene`,
    c_dot_notation = `C-Dot Notation`,
    affected_exon = `Affected Exon(s)`,
    cancer_type = `Cancer_type`,
    all_exon = `All Exon(s)`
  )

Following that, a crucial step was taken to enhance the dataframe’s organization: converting character data types to factors. This strategic modification ensures that categorical data is suitably formatted for comprehensive analysis.

genes_df <- genes_df %>%
  mutate(
    gene = as.factor(gene),
    c_dot_notation = as.factor(c_dot_notation),
    affected_exon = as.numeric(affected_exon),
    cancer_type = as.factor(cancer_type),
    Melanoma = as.factor(Melanoma)
  )

str(genes_df)

## 'data.frame':    457 obs. of  6 variables:
##  $ gene          : Factor w/ 41 levels "ABL2","ACVR1B",..: 2 2 4 4 4 4 4 4 4 4 ...
##  $ c_dot_notation: Factor w/ 184 levels "c.10022G>A","c.10070G>A",..: 26 26 54 54 54 49 54 43 43 54 ...
##  $ affected_exon : num  1 1 1 1 1 7 7 7 7 7 ...
##  $ all_exon      : num  10 10 15 15 15 15 15 15 15 15 ...
##  $ cancer_type   : Factor w/ 6 levels "Breast_cancer",..: 1 1 6 6 6 6 5 5 5 6 ...
##  $ Melanoma      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 2 2 2 1 ...

Creating Training and Testing Sets

The predictive analysis initiates with the division of data into training and testing sets. To maximize model accuracy, a substantial portion of the dataset (90%) is allocated for training, while the remaining 10% is reserved for testing and validating the models’ predictive capabilities on the target variable.

set.seed(123)
train_index <- createDataPartition(genes_df$cancer_type, p = 0.90, list = FALSE)
train_data <- genes_df[train_index, ]
test_data <- genes_df[-train_index, ]
nrow(train_data)

## [1] 413

nrow(test_data)

## [1] 44

train_data <- as.data.frame(train_data)
test_data <- as.data.frame(test_data)

As observed, the training dataset comprises 413 rows, while the testing dataset consists of 44 rows.

Decision Tree

In pursuit of insights, the decision to construct a decision tree was made to identify the most influential variables in predicting Melanoma. Consequently, the construction process of the decision tree commenced.

#Utilizing the provided code, a decision tree was generated by designating the target variable as "status," to be predicted using gene, affected_exon, e_dot_notation and all exons from the Genes_df data frame. Given the binary nature of the target, the method was set to "class" and the complexity to 0.0001. Subsequently, predictions were made on the testing set, and the results were presented in tabular form. Additionally, a visualization of the decision tree was produced.

melanoma_rpart <- rpart(formula = Melanoma ~ gene + affected_exon + c_dot_notation + all_exon, data = train_data, method = "class", cp=0.0001)
melanoma_rpart

## n= 413 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 413 181 N (0.56174334 0.43825666)  
##   2) c_dot_notation=c.1012T>C,c.10543T>G,c.10544C>T,c.10715T>G,c.1156A>G,c.1159C>T,c.1214T>C,c.1236G>C,c.1316G>A,c.131C>T,c.1371T>G,c.1432T>C,c.145C>A,c.1460G>A,c.146C>G,c.1516C>T,c.1562G>A,c.1637C>T,c.1799T>A,c.1919T>A,c.2012T>C,c.2017G>A,c.2071G>A,c.2107C>T,c.2176T>C,c.2215G>A,c.222G>T,c.2339C>T,c.2513C>A,c.265G>A,c.2666A>G,c.2726G>T,c.2734C>G,c.2789A>G,c.2854_2855delinsAT,c.2995A>G,c.29C>T,c.3002C>A,c.3038C>G,c.3131G>T,c.3215G>A,c.3223T>G,c.3310G>C,c.331C>A,c.3365C>T,c.3422G>A,c.3538T>C,c.3557T>C,c.3583A>T,c.3662C>T,c.3698G>A,c.3709G>A,c.3797A>C,c.380T>G,c.3812C>T,c.3847C>A,c.3944C>T,c.4258C>T,c.479C>T,c.4958A>G,c.4991C>G,c.49A>G,c.5200G>A,c.523G>A,c.532G>A,c.553G>C,c.5557G>A,c.610C>T,c.62A>G,c.661C>T,c.6944T>A,c.710G>A,c.760A>G,c.767A>G,c.7754T>C,c.79A>C,c.8921A>G,c.962A>T,c.98C>G,c.999G>A 237   5 N (0.97890295 0.02109705) *
##   3) c_dot_notation=c.10022G>A,c.10070G>A,c.10528C>T,c.1070G>A,c.10936G>A,c.1124C>T,c.11335G>A,c.11470C>T,c.11542G>A,c.11663C>T,c.11815G>A,c.12055C>T,c.12071C>T,c.12074C>T,c.12095G>A,c.12131G>A,c.12334G>A,c.12391G>A,c.12457C>T,c.1258T>C,c.12995C>T,c.13019C>T,c.1405C>T,c.1468C>T,c.1478G>T,c.1525G>A,c.1547C>T,c.1591C>T,c.17C>A,c.1804C>T,c.1849C>T,c.18713C>T,c.194G>A,c.2044C>T,c.2096C>T,c.2180C>T,c.2225C>A,c.2417T>C,c.2618C>T,c.2645G>A,c.2650T>A,c.2680G>A,c.2749G>A,c.28378C>T,c.2845G>A,c.3025G>A,c.3044G>A,c.3077G>A,c.3143G>A,c.33923C>T,c.340C>T,c.34504G>A,c.3470G>A,c.3487G>A,c.3659C>T,c.3836G>A,c.3883G>A,c.4328C>T,c.4652C>T,c.4942G>A,c.5117C>T,c.5233A>C,c.5255G>A,c.5312G>A,c.5603C>T,c.5632G>A,c.5657C>T,c.5668G>A,c.5780A>C,c.5900G>A,c.6343G>A,c.6533C>T,c.6557C>T,c.6562G>A,c.67363G>A,c.7027C>T,c.70460C>T,c.7102G>A,c.71585A>C,c.7234C>T,c.73330G>A,c.74875A>G,c.7796G>A,c.8090C>A,c.8236C>T,c.839C>T,c.839G>A,c.8434C>T,c.8456G>A,c.8483C>T,c.861G>A,c.8876G>A,c.8939C>T,c.8967G>A,c.9083C>T,c.9349G>A,c.9721G>A,c.9746C>T 176   0 Y (0.00000000 1.00000000) *

pred0 <- predict(melanoma_rpart, test_data, type = "class") #Predictions performed on test data
pred0

##   1   4  20  52  66  78  79  82  88  94 103 113 121 161 162 166 172 174 184 199 
##   N   N   N   Y   Y   Y   Y   Y   Y   N   N   N   N   N   N   Y   N   Y   Y   Y 
## 201 203 212 223 244 257 260 276 278 293 296 306 308 335 341 350 352 357 361 370 
##   Y   Y   Y   Y   Y   Y   Y   N   Y   N   N   N   N   N   N   N   N   N   N   N 
## 406 417 428 448 
##   N   N   N   N 
## Levels: N Y

table(test_data$Melanoma, pred0, dnn = c("True", "Pred")) #Predictive Matrix

##     Pred
## True  N  Y
##    N 24  0
##    Y  2 18

sum(test_data$Melanoma != pred0) #Count of misclassified predictions

## [1] 2

sum(test_data$Melanoma != pred0)/nrow(test_data) #misclassification rate

## [1] 0.04545455

prp(melanoma_rpart, extra = 1)

This model underscores the significance of the “c_dot_notatation” AK mutation, identified as the most influential variable within the analysis. According to the model’s findings, 232 mutations were deemed non-carcinogenic, while 176 genes were implicated in carcinogenesis. Notably, the model delineates specific mutations, which, when subjected to descriptive modeling, substantiate the hypothesis regarding their pivotal role in potential cancer development.

For instance, within the final analysis of the XIRP2 gene, mutations including 839C>T, 3077G>A, 2650T>A, 1405C>T, 6557C>T, and 2044C>T were identified and featured prominently within the decision tree, signifying their predictive relevance for Melanoma incidence. During predictive testing on the validation set, a minimal misclassification error of only 2 instances out of 44 was observed, yielding a misclassification rate of 4.5%. This performance underscores the model’s efficacy in accurately predicting Melanoma cases.

Random Forest

set.seed(123)
#Using the code below, I created a random forest model using all variables in genes_df dataset to predict Melanoma and then displyed results in the table.

#rf_model<-randomForest(Melanoma~gene + affected_exon + all_exon + c_dot_notation , data = train_data, ntrees=500)

Here is the error I encounter:

unique(train_data$c_dot_notation)

##   [1] c.1236G>C           c.1919T>A           c.1799T>A          
##   [4] c.1525G>A           c.331C>A            c.479C>T           
##   [7] c.8967G>A           c.5255G>A           c.4652C>T          
##  [10] c.3044G>A           c.2845G>A           c.2618C>T          
##  [13] c.1804C>T           c.1591C>T           c.1124C>T          
##  [16] c.49A>G             c.11542G>A          c.8939C>T          
##  [19] c.8236C>T           c.7027C>T           c.5780A>C          
##  [22] c.3883G>A           c.3487G>A           c.3470G>A          
##  [25] c.11335G>A          c.10936G>A          c.1258T>C          
##  [28] c.2180C>T           c.5603C>T           c.5632G>A          
##  [31] c.6562G>A           c.8434C>T           c.8456G>A          
##  [34] c.8876G>A           c.9083C>T           c.11815G>A         
##  [37] c.12055C>T          c.12457C>T          c.2680G>A          
##  [40] c.4942G>A           c.7102G>A           c.7234C>T          
##  [43] c.1562G>A           c.2096C>T           c.2749G>A          
##  [46] c.3025G>A           c.3310G>C           c.767A>G           
##  [49] c.760A>G            c.79A>C             c.1460G>A          
##  [52] c.532G>A            c.3709G>A           c.10543T>G         
##  [55] c.10544C>T          c.10715T>G          c.5200G>A          
##  [58] c.4991C>G           c.3847C>A           c.3797A>C          
##  [61] c.3557T>C           c.3538T>C           c.2107C>T          
##  [64] c.1637C>T           c.1012T>C           c.1478G>T          
##  [67] c.2017G>A           c.11663C>T          c.28378C>T         
##  [70] c.18713C>T          c.10022G>A          c.9746C>T          
##  [73] c.33923C>T          c.34504G>A          c.2176T>C          
##  [76] c.2071G>A           c.12391G>A          c.12334G>A         
##  [79] c.12131G>A          c.5900G>A           c.2645G>A          
##  [82] c.1070G>A           c.13019C>T          c.9349G>A          
##  [85] c.2225C>A           c.1468C>T           c.1547C>T          
##  [88] c.8090C>A           c.6533C>T           c.12995C>T         
##  [91] c.12074C>T          c.11470C>T          c.10528C>T         
##  [94] c.9721G>A           c.3143G>A           c.194G>A           
##  [97] c.839G>A            c.4328C>T           c.5312G>A          
## [100] c.5668G>A           c.6343G>A           c.7796G>A          
## [103] c.12071C>T          c.12095G>A          c.10070G>A         
## [106] c.3836G>A           c.999G>A            c.3944C>T          
## [109] c.2417T>C           c.3583A>T           c.3659C>T          
## [112] c.3365C>T           c.2854_2855delinsAT c.3662C>T          
## [115] c.3812C>T           c.710G>A            c.610C>T           
## [118] c.1156A>G           c.1371T>G           c.2012T>C          
## [121] c.380T>G            c.265G>A            c.67363G>A         
## [124] c.73330G>A          c.71585A>C          c.70460C>T         
## [127] c.74875A>G          c.131C>T            c.145C>A           
## [130] c.222G>T            c.3131G>T           c.2044C>T          
## [133] c.6557C>T           c.839C>T            c.1405C>T          
## [136] c.2650T>A           c.3077G>A           c.2789A>G          
## [139] c.5557G>A           c.146C>G            c.4258C>T          
## [142] c.5233A>C           c.98C>G             c.523G>A           
## [145] c.962A>T            c.8921A>G           c.3223T>G          
## [148] c.8483C>T           c.340C>T            c.6944T>A          
## [151] c.553G>C            c.2726G>T           c.3422G>A          
## [154] c.3002C>A           c.3038C>G           c.62A>G            
## [157] c.861G>A            c.661C>T            c.1432T>C          
## [160] c.1316G>A           c.1516C>T           c.17C>A            
## [163] c.2339C>T           c.7754T>C           c.2513C>A          
## [166] c.2995A>G           c.5117C>T           c.4958A>G          
## [169] c.3215G>A           c.5657C>T           c.3698G>A          
## [172] c.2734C>G           c.1849C>T           c.2666A>G          
## [175] c.2215G>A           c.29C>T             c.1214T>C          
## [178] c.1159C>T          
## 184 Levels: c.10022G>A c.10070G>A c.1012T>C c.10528C>T c.10543T>G ... c.999G>A

As outlined above, the dataset comprised 184 categories, with only 53 meeting the model’s criteria. Consequently, the model failed to execute. Given the importance of all mutations and their prior refinement from a larger pool, deleting categories to fit the model was impractical.

Therefore, a subsequent Random Forest analysis will be conducted, this time excluding the “c_dot_notation” - mutation. The aim is to evaluate the model’s predictive ability regarding melanoma incidence and its accuracy.

set.seed(123)
rf_model<-randomForest(Melanoma~gene + affected_exon + all_exon , data = train_data, ntrees=500) 
rf_model

## 
## Call:
##  randomForest(formula = Melanoma ~ gene + affected_exon + all_exon,      data = train_data, ntrees = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 8.96%
## Confusion matrix:
##     N   Y class.error
## N 221  11  0.04741379
## Y  26 155  0.14364641

rf_pred<-predict(rf_model,test_data) #Predictions performed on test data
rf_pred

##   1   4  20  52  66  78  79  82  88  94 103 113 121 161 162 166 172 174 184 199 
##   N   N   N   Y   Y   Y   Y   Y   Y   N   N   N   N   N   N   N   N   N   Y   Y 
## 201 203 212 223 244 257 260 276 278 293 296 306 308 335 341 350 352 357 361 370 
##   Y   Y   Y   Y   Y   Y   Y   N   N   N   N   N   N   N   N   N   N   N   N   N 
## 406 417 428 448 
##   N   N   N   N 
## Levels: N Y

table(test_data$Melanoma,rf_pred, dnn=c("True","Pred")) #Matrix Table

##     Pred
## True  N  Y
##    N 24  0
##    Y  5 15

sum(test_data$Melanoma != rf_pred)

## [1] 5

sum(test_data$Melanoma != rf_pred)/nrow(test_data)

## [1] 0.1136364

rf_model$importance

##               MeanDecreaseGini
## gene                 110.34091
## affected_exon         30.92654
## all_exon              32.27210

Following the implementation of adjusted random forest analysis, the outcomes exhibited a marginally inferior performance compared to the decision tree approach. Specifically, the model incurred misclassification in 5 instances, representing an 11.3% error rate. This observation suggests that the absence of a pivotal variable in the prediction significantly complicates the accuracy of the model. Subsequently, the analysis of feature importance revealed that the gene variable exhibited substantial significance, whereas affected_exons and all_exons demonstrated comparatively diminished influence.

Bagging

set.seed(123)
bag_model <- bagging(formula = Melanoma ~ gene + affected_exon + all_exon + c_dot_notation , data = train_data, nbagg = 50)#Bagging model, "nbag=50" indicates that the algorithm is set to create an ensemble of 50 bootstrap samples. Bootstrap sampling involves randomly selecting subsets of the original dataset with replacement.
Melanoma2 <- test_data$Melanoma
bag_pred <- predict(bag_model, newdata = test_data)#Predictions
table(Melanoma2, bag_pred, dnn = c("True", "Pred")) #Confusion Matrix

##     Pred
## True  N  Y
##    N 24  0
##    Y  2 18

sum(test_data$Melanoma != bag_pred) #Number of misclassifications

## [1] 2

Bagging, a more sophisticated model akin to boosting, demonstrated exceptional performance. Our analysis via the confusion matrix reveals that only 2 data points were misclassified, mirroring the performance of boosting precisely.

Boosting

library(adabag)
set.seed(123)
melanoma_boost = boosting(Melanoma~ gene + c_dot_notation + all_exon + affected_exon, data = train_data, boos = T)#Boosting model, The parameter "boos=T" indicates that boosting is turned on, implying that the algorithm will iteratively train weak learners and combine them to enhance the overall predictive power of the model.
melanoma_boost$importance #Importance of the variables

##  affected_exon       all_exon c_dot_notation           gene 
##       2.364250       0.000000      96.434424       1.201326

boost_pred = predict(melanoma_boost, newdata = test_data) #Predictions
boost_pred$class

##  [1] "N" "N" "N" "Y" "Y" "Y" "Y" "Y" "Y" "Y" "N" "N" "N" "N" "Y" "Y" "N" "Y" "Y"
## [20] "Y" "Y" "Y" "Y" "Y" "Y" "Y" "Y" "N" "Y" "N" "N" "N" "N" "N" "N" "N" "N" "N"
## [39] "N" "N" "N" "N" "N" "N"

boost_pred$confusion #Confusion Matrix

##                Observed Class
## Predicted Class  N  Y
##               N 23  1
##               Y  1 19

boost_pred$error #Misclasification error

## [1] 0.04545455

boost_pred$error*nrow(test_data)

## [1] 2

This model was constructed using the most influential variables: gene, c_dot_notation, affected_exons, and all_exons. During testing, the boosting model demonstrated exceptional performance. Consistent with the findings from the Decision Tree analysis, the variable c_dot_notation/mutation proved to be pivotal once again. Interestingly, our model suggests that all exons have negligible impact. Notably, the confusion matrix revealed only 2 inaccurately predicted data points, confirming the model’s effectiveness. With a resulting misclassification error rate of 4.54%, the model’s robustness is underscored. This underscores the efficacy of employing a more sophisticated model for extracting insights from the provided data.

Overcoming Data Constraints for Enhanced Predictive Analysis

Unfortunately, due to the categorical nature of the dataset containing numerous unique values, additional model execution was constrained. Consequently, the dataset was transferred to Python to circumvent these limitations. Within Python, binary columns were engineered for problematic variables such as c_dot_mutataion and genes, enabling further predictive analyses.

The initial step involved loading essential libraries like Pandas and Seaborn. Pandas facilitates diverse data operations including reading and writing files, data cleaning, preprocessing, statistical analysis, and manipulation. Seaborn, built on Matplotlib, offers an intuitive interface for generating visually appealing and informative statistical graphics.

Following the library loading, the connection to Google Drive was established, and the data was imported into Python. Subsequently, the code proceeded to create binary columns for two specific variables: gene and c_dot_notataion/mutataion.

The output below demonstrates the creation of binary columns. Despite this transformation, our dataset retains its original size of 457 rows. However, the addition of 225 binary columns is evident.

Following, the dataset underwent enrichment through the addition of unchanged variables. These variables include “Melanoma,” already binary-encoded, as well as “Affected Exon(s)” and “All Exon(s),” both represented as numeric columns. Consequently, the dataset now comprises 457 rows and 228 columns. Subsequently, this extended dataset was exported to Excel and is now poised for further predictive analytics.

Subsequent to this procedure, we reintegrated the dataset into R to conduct further modeling. However, the initial challenge arose when the program failed to interpret the data accurately, necessitating the removal of special characters from the column headers. Presently, the headers are represented as single contiguous words.

genes_ds1 <-read_excel("C:/Users/annac/Desktop/Capstone/data/genes_python.xlsx")
genes_df1<-as.data.frame(genes_ds1)
str(genes_df1) #Checking if the data was read correctly

## 'data.frame':    457 obs. of  228 variables:
##  $ AffectedExons                : num  1 1 1 1 1 7 7 7 7 7 ...
##  $ Melanoma                     : num  0 0 0 0 0 0 1 1 1 0 ...
##  $ AllExons                     : num  10 10 15 15 15 15 15 15 15 15 ...
##  $ CDotNotationc10022GA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc10070GA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1012TC          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc10528CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc10543TG         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc10544CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1070GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc10715TG         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc10936GA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1124CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc11335GA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc11470CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc11542GA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1156AG          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1159CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc11638CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc11663CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc11815GA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc12055CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc12071CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc12074CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc12095GA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc12131GA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1214TC          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc12334GA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1236GC          : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc12391GA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc12457CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1258TC          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc12995CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc13019CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1316GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc131CT           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1371TG          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1405CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1432TC          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc145CA           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1460GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1468CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc146CG           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1478GT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1516CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1525GA          : num  0 0 0 0 0 0 0 1 1 0 ...
##  $ CDotNotationc1547CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1562GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1591CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1637CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1717GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1799TA          : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ CDotNotationc17CA            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1804CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1849CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc18713CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc1919TA          : num  0 0 1 1 1 0 1 0 0 1 ...
##  $ CDotNotationc194GA           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2012TC          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2017GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2044CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2071GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2096CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2107CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2176TC          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2180CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2215GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2225CA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc222GT           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2339CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2344CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2417TC          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2513CA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2618CT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2645GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2650TA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc265GA           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2666AG          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2680GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2726GT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2734CG          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2749GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2789AG          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc28378CT         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2845GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc28542855delinsAT: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc290CT           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc2995AG          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc29CT            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc3002CA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc3025GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc3038CG          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc3044GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc3077GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc3131GT          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc3143GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc3215GA          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc3223TG          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CDotNotationc3310GC          : num  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

Upon inspecting the dataset, it’s apparent that all variables are numeric, and the data has been correctly processed, maintaining consistency in the number of variables and columns. The next step involves partitioning the data into training and testing sets. This time, the data is split identically as before, with 90% allocated for training and 10% for testing. Given the smaller dataset size, this partitioning scheme ensures sufficient data availability for proper model building and training.

set.seed(123)
train_index1 <- createDataPartition(genes_df1$Melanoma, p = 0.90, list = FALSE)
train_data1 <- genes_df1[train_index1, ]
test_data1 <- genes_df1[-train_index1, ]
train_data1 <- as.data.frame(train_data1)

Logistic Regression

logit_model1 <- glm(Melanoma ~ ., data = train_data1, family = binomial) #Model
summary(logit_model1)

## 
## Call:
## glm(formula = Melanoma ~ ., family = binomial, data = train_data1)
## 
## Coefficients: (49 not defined because of singularities)
##                                 Estimate Std. Error    z value Pr(>|z|)    
## (Intercept)                   -6.912e+15  6.283e+07 -1.100e+08   <2e-16 ***
## AffectedExons                 -5.235e+13  9.009e+05 -5.811e+07   <2e-16 ***
## AllExons                       5.027e+14  6.192e+06  8.119e+07   <2e-16 ***
## CDotNotationc10022GA           4.399e+15  6.713e+07  6.552e+07   <2e-16 ***
## CDotNotationc10070GA          -3.219e+16  4.341e+08 -7.417e+07   <2e-16 ***
## CDotNotationc1012TC           -6.399e+15  7.021e+07 -9.114e+07   <2e-16 ***
## CDotNotationc10528CT           2.807e+15  9.571e+07  2.933e+07   <2e-16 ***
## CDotNotationc10543TG          -1.582e+16  1.821e+08 -8.689e+07   <2e-16 ***
## CDotNotationc10544CT          -1.582e+16  1.821e+08 -8.689e+07   <2e-16 ***
## CDotNotationc1070GA            2.720e+15  7.845e+07  3.466e+07   <2e-16 ***
## CDotNotationc10715TG          -1.582e+16  1.821e+08 -8.689e+07   <2e-16 ***
## CDotNotationc10936GA          -3.239e+16  4.495e+08 -7.206e+07   <2e-16 ***
## CDotNotationc1124CT           -1.047e+15  1.238e+08 -8.454e+06   <2e-16 ***
## CDotNotationc11335GA          -2.833e+16  4.470e+08 -6.337e+07   <2e-16 ***
## CDotNotationc11470CT           2.807e+15  9.571e+07  2.933e+07   <2e-16 ***
## CDotNotationc11542GA          -2.865e+16  4.485e+08 -6.387e+07   <2e-16 ***
## CDotNotationc1156AG           -9.600e+14  5.491e+07 -1.748e+07   <2e-16 ***
## CDotNotationc1159CT            1.570e+14  7.754e+07  2.025e+06   <2e-16 ***
## CDotNotationc11638CT           2.807e+15  9.571e+07  2.933e+07   <2e-16 ***
## CDotNotationc11663CT           4.399e+15  8.221e+07  5.351e+07   <2e-16 ***
## CDotNotationc11815GA           9.060e+15  7.750e+07  1.169e+08   <2e-16 ***
## CDotNotationc12055CT           6.782e+15  6.126e+07  1.107e+08   <2e-16 ***
## CDotNotationc12071CT          -2.881e+16  4.296e+08 -6.705e+07   <2e-16 ***
## CDotNotationc12074CT          -1.697e+15  9.571e+07 -1.773e+07   <2e-16 ***
## CDotNotationc12095GA          -2.731e+16  4.296e+08 -6.356e+07   <2e-16 ***
## CDotNotationc12131GA           2.676e+15  8.310e+07  3.220e+07   <2e-16 ***
## CDotNotationc1214TC            1.570e+14  7.754e+07  2.025e+06   <2e-16 ***
## CDotNotationc12334GA           2.545e+15  9.573e+07  2.659e+07   <2e-16 ***
## CDotNotationc1236GC           -1.062e+16  7.397e+07 -1.435e+08   <2e-16 ***
## CDotNotationc12391GA           2.676e+15  8.310e+07  3.220e+07   <2e-16 ***
## CDotNotationc12457CT           6.782e+15  6.126e+07  1.107e+08   <2e-16 ***
## CDotNotationc1258TC           -2.789e+16  4.470e+08 -6.239e+07   <2e-16 ***
## CDotNotationc12995CT           2.807e+15  9.571e+07  2.933e+07   <2e-16 ***
## CDotNotationc13019CT           2.702e+15  7.453e+07  3.626e+07   <2e-16 ***
## CDotNotationc1316GA                   NA         NA         NA       NA    
## CDotNotationc131CT             2.252e+15  6.126e+07  3.676e+07   <2e-16 ***
## CDotNotationc1371TG           -1.466e+15  5.390e+07 -2.720e+07   <2e-16 ***
## CDotNotationc1405CT           -1.979e+15  1.297e+08 -1.526e+07   <2e-16 ***
## CDotNotationc1432TC           -8.630e+15  1.029e+08 -8.386e+07   <2e-16 ***
## CDotNotationc145CA             1.519e+15  5.479e+07  2.772e+07   <2e-16 ***
## CDotNotationc1460GA           -1.421e+14  3.558e+07 -3.994e+06   <2e-16 ***
## CDotNotationc1468CT            4.896e+14  7.603e+07  6.440e+06   <2e-16 ***
## CDotNotationc146CG            -3.061e+16  3.609e+08 -8.482e+07   <2e-16 ***
## CDotNotationc1478GT            2.612e+15  7.435e+07  3.512e+07   <2e-16 ***
## CDotNotationc1516CT           -8.630e+15  1.133e+08 -7.616e+07   <2e-16 ***
## CDotNotationc1525GA            4.277e+15  6.139e+07  6.967e+07   <2e-16 ***
## CDotNotationc1547CT            9.217e+14  7.453e+07  1.237e+07   <2e-16 ***
## CDotNotationc1562GA           -1.135e+16  1.274e+08 -8.908e+07   <2e-16 ***
## CDotNotationc1591CT           -1.047e+15  1.238e+08 -8.454e+06   <2e-16 ***
## CDotNotationc1637CT           -6.430e+15  7.443e+07 -8.639e+07   <2e-16 ***
## CDotNotationc1717GA           -5.006e+15  8.511e+07 -5.882e+07   <2e-16 ***
## CDotNotationc1799TA           -4.739e+15  5.826e+07 -8.135e+07   <2e-16 ***
## CDotNotationc17CA             -4.545e+15  1.141e+08 -3.983e+07   <2e-16 ***
## CDotNotationc1804CT           -5.551e+15  1.238e+08 -4.482e+07   <2e-16 ***
## CDotNotationc1849CT            6.012e+15  7.070e+07  8.503e+07   <2e-16 ***
## CDotNotationc18713CT                  NA         NA         NA       NA    
## CDotNotationc1919TA           -1.736e+15  5.279e+07 -3.288e+07   <2e-16 ***
## CDotNotationc194GA            -2.714e+16  4.301e+08 -6.310e+07   <2e-16 ***
## CDotNotationc2012TC           -4.713e+15  5.491e+07 -8.583e+07   <2e-16 ***
## CDotNotationc2017GA           -4.894e+15  7.435e+07 -6.583e+07   <2e-16 ***
## CDotNotationc2044CT           -1.979e+15  1.381e+08 -1.433e+07   <2e-16 ***
## CDotNotationc2071GA           -3.623e+16  4.329e+08 -8.368e+07   <2e-16 ***
## CDotNotationc2096CT           -6.849e+15  1.387e+08 -4.937e+07   <2e-16 ***
## CDotNotationc2107CT           -2.341e+15  6.691e+07 -3.498e+07   <2e-16 ***
## CDotNotationc2176TC           -3.172e+16  4.303e+08 -7.372e+07   <2e-16 ***
## CDotNotationc2180CT            4.556e+15  7.750e+07  5.879e+07   <2e-16 ***
## CDotNotationc2215GA            4.556e+15  6.127e+07  7.436e+07   <2e-16 ***
## CDotNotationc2225CA            1.576e+15  7.602e+07  2.073e+07   <2e-16 ***
## CDotNotationc222GT             1.812e+15  4.901e+07  3.697e+07   <2e-16 ***
## CDotNotationc2339CT           -1.133e+16  1.432e+08 -7.915e+07   <2e-16 ***
## CDotNotationc2344CT           -2.346e+15  1.387e+08 -1.691e+07   <2e-16 ***
## CDotNotationc2417TC           -4.133e+14  1.087e+08 -3.802e+06   <2e-16 ***
## CDotNotationc2513CA           -1.123e+16  1.431e+08 -7.848e+07   <2e-16 ***
## CDotNotationc2618CT           -2.775e+15  1.133e+08 -2.449e+07   <2e-16 ***
## CDotNotationc2645GA            1.218e+15  7.845e+07  1.553e+07   <2e-16 ***
## CDotNotationc2650TA           -1.979e+15  1.297e+08 -1.526e+07   <2e-16 ***
## CDotNotationc265GA            -2.011e+15  8.318e+07 -2.418e+07   <2e-16 ***
## CDotNotationc2666AG            4.556e+15  6.127e+07  7.436e+07   <2e-16 ***
## CDotNotationc2680GA            9.007e+15  7.749e+07  1.162e+08   <2e-16 ***
## CDotNotationc2726GT            3.310e+15  5.655e+07  5.853e+07   <2e-16 ***
## CDotNotationc2734CG            6.012e+15  7.070e+07  8.503e+07   <2e-16 ***
## CDotNotationc2749GA                   NA         NA         NA       NA    
## CDotNotationc2789AG           -2.995e+15  5.914e+07 -5.065e+07   <2e-16 ***
## CDotNotationc28378CT           4.399e+15  8.221e+07  5.351e+07   <2e-16 ***
## CDotNotationc2845GA           -5.235e+14  1.133e+08 -4.619e+06   <2e-16 ***
## CDotNotationc28542855delinsAT -2.766e+15  5.254e+07 -5.264e+07   <2e-16 ***
## CDotNotationc290CT            -2.566e+15  7.397e+07 -3.469e+07   <2e-16 ***
## CDotNotationc2995AG           -6.619e+15  1.430e+08 -4.629e+07   <2e-16 ***
## CDotNotationc29CT              4.451e+15  7.750e+07  5.744e+07   <2e-16 ***
## CDotNotationc3002CA                   NA         NA         NA       NA    
## CDotNotationc3025GA           -2.346e+15  1.387e+08 -1.691e+07   <2e-16 ***
## CDotNotationc3038CG           -1.194e+15  7.382e+07 -1.618e+07   <2e-16 ***
## CDotNotationc3044GA           -2.775e+15  1.133e+08 -2.449e+07   <2e-16 ***
## CDotNotationc3077GA           -6.483e+15  1.381e+08 -4.695e+07   <2e-16 ***
## CDotNotationc3131GT            2.859e+15  7.356e+07  3.887e+07   <2e-16 ***
## CDotNotationc3143GA           -2.685e+16  4.321e+08 -6.214e+07   <2e-16 ***
## CDotNotationc3215GA           -1.085e+16  1.812e+08 -5.987e+07   <2e-16 ***
## CDotNotationc3223TG           -3.482e+16  4.716e+08 -7.384e+07   <2e-16 ***
## CDotNotationc3310GC           -2.566e+15  7.397e+07 -3.469e+07   <2e-16 ***
## CDotNotationc331CA            -2.566e+15  7.397e+07 -3.469e+07   <2e-16 ***
## CDotNotationc3365CT           -4.672e+15  5.254e+07 -8.893e+07   <2e-16 ***
## CDotNotationc33923CT           4.399e+15  8.221e+07  5.351e+07   <2e-16 ***
## CDotNotationc340CT            -3.116e+16  4.743e+08 -6.569e+07   <2e-16 ***
## CDotNotationc3422GA           -1.194e+15  5.655e+07 -2.112e+07   <2e-16 ***
## CDotNotationc34504GA           4.399e+15  8.221e+07  5.351e+07   <2e-16 ***
## CDotNotationc3470GA           -3.370e+16  4.524e+08 -7.450e+07   <2e-16 ***
## CDotNotationc3487GA           -2.920e+16  4.524e+08 -6.454e+07   <2e-16 ***
## CDotNotationc3538TC            1.938e+15  7.397e+07  2.620e+07   <2e-16 ***
## CDotNotationc3557TC           -2.566e+15  7.397e+07 -3.469e+07   <2e-16 ***
## CDotNotationc3583AT           -7.169e+15  1.087e+08 -6.594e+07   <2e-16 ***
## CDotNotationc3659CT           -3.872e+14  1.051e+08 -3.682e+06   <2e-16 ***
## CDotNotationc3662CT           -1.816e+15  5.312e+07 -3.419e+07   <2e-16 ***
## CDotNotationc3698GA            6.012e+15  8.515e+07  7.060e+07   <2e-16 ***
## CDotNotationc3709GA           -1.582e+16  1.821e+08 -8.689e+07   <2e-16 ***
## CDotNotationc3797AC           -2.566e+15  7.397e+07 -3.469e+07   <2e-16 ***
## CDotNotationc380TG             4.194e+07  6.126e+07  6.850e-01    0.494    
## CDotNotationc3812CT           -4.650e+15  5.620e+07 -8.274e+07   <2e-16 ***
## CDotNotationc3836GA           -2.714e+16  4.301e+08 -6.310e+07   <2e-16 ***
## CDotNotationc3847CA           -2.566e+15  7.397e+07 -3.469e+07   <2e-16 ***
## CDotNotationc3883GA           -2.920e+16  4.524e+08 -6.454e+07   <2e-16 ***
## CDotNotationc3944CT           -9.428e+15  9.989e+07 -9.438e+07   <2e-16 ***
## CDotNotationc4258CT           -2.475e+16  3.582e+08 -6.910e+07   <2e-16 ***
## CDotNotationc4328CT           -2.996e+16  4.288e+08 -6.986e+07   <2e-16 ***
## CDotNotationc4652CT           -2.775e+15  1.133e+08 -2.449e+07   <2e-16 ***
## CDotNotationc479CT            -2.566e+15  7.397e+07 -3.469e+07   <2e-16 ***
## CDotNotationc4934GA            6.012e+15  8.515e+07  7.060e+07   <2e-16 ***
## CDotNotationc4942GA            9.007e+15  7.749e+07  1.162e+08   <2e-16 ***
## CDotNotationc4958AG           -1.043e+16  1.744e+08 -5.982e+07   <2e-16 ***
## CDotNotationc4991CG           -2.566e+15  7.397e+07 -3.469e+07   <2e-16 ***
## CDotNotationc49AG              1.938e+15  4.575e+07  4.235e+07   <2e-16 ***
## CDotNotationc5117CT           -2.116e+15  1.430e+08 -1.480e+07   <2e-16 ***
## CDotNotationc5200GA           -2.566e+15  7.397e+07 -3.469e+07   <2e-16 ***
## CDotNotationc5233AC           -1.993e+16  3.577e+08 -5.571e+07   <2e-16 ***
## CDotNotationc523GA             1.570e+14  5.486e+07  2.862e+06   <2e-16 ***
## CDotNotationc5255GA           -5.235e+14  1.133e+08 -4.619e+06   <2e-16 ***
## CDotNotationc5312GA           -2.731e+16  4.296e+08 -6.356e+07   <2e-16 ***
## CDotNotationc532GA            -2.566e+15  5.674e+07 -4.522e+07   <2e-16 ***
## CDotNotationc553GC            -5.373e+15  6.552e+07 -8.200e+07   <2e-16 ***
## CDotNotationc5557GA           -2.433e+16  3.576e+08 -6.803e+07   <2e-16 ***
## CDotNotationc5603CT            9.007e+15  7.749e+07  1.162e+08   <2e-16 ***
## CDotNotationc5632GA            6.782e+15  6.126e+07  1.107e+08   <2e-16 ***
## CDotNotationc5657CT           -5.771e+15  1.867e+08 -3.090e+07   <2e-16 ***
## CDotNotationc5668GA           -2.731e+16  4.296e+08 -6.356e+07   <2e-16 ***
## CDotNotationc5780AC           -2.865e+16  4.485e+08 -6.387e+07   <2e-16 ***
## CDotNotationc5900GA            2.807e+15  9.571e+07  2.933e+07   <2e-16 ***
## CDotNotationc610CT            -9.600e+14  5.491e+07 -1.748e+07   <2e-16 ***
## CDotNotationc62AG             -1.508e+15  7.382e+07 -2.043e+07   <2e-16 ***
## CDotNotationc6343GA           -2.711e+16  4.301e+08 -6.305e+07   <2e-16 ***
## CDotNotationc6533CT            2.807e+15  9.571e+07  2.933e+07   <2e-16 ***
## CDotNotationc6557CT           -1.979e+15  1.381e+08 -1.433e+07   <2e-16 ***
## CDotNotationc6562GA            6.782e+15  6.126e+07  1.107e+08   <2e-16 ***
## CDotNotationc661CT            -8.892e+15  1.138e+08 -7.816e+07   <2e-16 ***
## CDotNotationc67363GA          -2.011e+15  9.576e+07 -2.100e+07   <2e-16 ***
## CDotNotationc6944TA           -3.351e+16  4.726e+08 -7.091e+07   <2e-16 ***
## CDotNotationc7013CT            6.782e+15  6.126e+07  1.107e+08   <2e-16 ***
## CDotNotationc7027CT           -2.865e+16  4.485e+08 -6.387e+07   <2e-16 ***
## CDotNotationc70460CT           2.493e+15  9.576e+07  2.603e+07   <2e-16 ***
## CDotNotationc7102GA            9.007e+15  7.749e+07  1.162e+08   <2e-16 ***
## CDotNotationc710GA            -4.661e+15  6.132e+07 -7.600e+07   <2e-16 ***
## CDotNotationc71585AC          -2.011e+15  9.576e+07 -2.100e+07   <2e-16 ***
## CDotNotationc7234CT            4.504e+15  7.749e+07  5.812e+07   <2e-16 ***
## CDotNotationc73330GA          -2.011e+15  9.576e+07 -2.100e+07   <2e-16 ***
## CDotNotationc74875AG           2.493e+15  8.318e+07  2.997e+07   <2e-16 ***
## CDotNotationc760AG                    NA         NA         NA       NA    
## CDotNotationc767AG            -2.566e+15  7.397e+07 -3.469e+07   <2e-16 ***
## CDotNotationc7754TC           -1.112e+16  1.430e+08 -7.779e+07   <2e-16 ***
## CDotNotationc7796GA           -2.714e+16  4.301e+08 -6.310e+07   <2e-16 ***
## CDotNotationc79AC             -2.403e+15  3.478e+07 -6.907e+07   <2e-16 ***
## CDotNotationc8090CA            2.755e+15  8.310e+07  3.315e+07   <2e-16 ***
## CDotNotationc8236CT           -3.260e+16  4.499e+08 -7.247e+07   <2e-16 ***
## CDotNotationc839CT            -6.483e+15  1.297e+08 -4.999e+07   <2e-16 ***
## CDotNotationc839GA            -2.714e+16  4.301e+08 -6.310e+07   <2e-16 ***
## CDotNotationc8434CT            6.782e+15  6.126e+07  1.107e+08   <2e-16 ***
## CDotNotationc8456GA            9.033e+15  6.126e+07  1.475e+08   <2e-16 ***
## CDotNotationc8483CT           -2.849e+16  4.720e+08 -6.035e+07   <2e-16 ***
## CDotNotationc861GA             7.604e+15  7.377e+07  1.031e+08   <2e-16 ***
## CDotNotationc8876GA            6.782e+15  6.126e+07  1.107e+08   <2e-16 ***
## CDotNotationc8921AG           -3.734e+16  4.719e+08 -7.912e+07   <2e-16 ***
## CDotNotationc8939CT           -2.865e+16  4.485e+08 -6.387e+07   <2e-16 ***
## CDotNotationc8967GA           -2.775e+15  1.133e+08 -2.449e+07   <2e-16 ***
## CDotNotationc9083CT            6.782e+15  6.126e+07  1.107e+08   <2e-16 ***
## CDotNotationc9349GA            2.720e+15  7.351e+07  3.699e+07   <2e-16 ***
## CDotNotationc962AT                    NA         NA         NA       NA    
## CDotNotationc9721GA            2.807e+15  9.571e+07  2.933e+07   <2e-16 ***
## CDotNotationc9746CT            4.399e+15  6.713e+07  6.553e+07   <2e-16 ***
## CDotNotationc98CG                     NA         NA         NA       NA    
## CDotNotationc999GA                    NA         NA         NA       NA    
## GeneABL2                              NA         NA         NA       NA    
## GeneACVR1B                            NA         NA         NA       NA    
## GeneATM                               NA         NA         NA       NA    
## GeneBRAF                              NA         NA         NA       NA    
## GeneCATSPERZ                          NA         NA         NA       NA    
## GeneCD276                             NA         NA         NA       NA    
## GeneCSMD1                             NA         NA         NA       NA    
## GeneCTLA4                             NA         NA         NA       NA    
## GeneCUL3                              NA         NA         NA       NA    
## GeneDNAH17                            NA         NA         NA       NA    
## GeneDNAH2                             NA         NA         NA       NA    
## GeneDNAH9                             NA         NA         NA       NA    
## GeneEGFR                              NA         NA         NA       NA    
## GeneERCC5                             NA         NA         NA       NA    
## GeneHNF1A                             NA         NA         NA       NA    
## GeneIDH1                              NA         NA         NA       NA    
## GeneKMT2A                             NA         NA         NA       NA    
## GeneMDC1                              NA         NA         NA       NA    
## GeneMST1                              NA         NA         NA       NA    
## GeneMUC16                             NA         NA         NA       NA    
## GeneNBN                               NA         NA         NA       NA    
## GeneNSD1                              NA         NA         NA       NA    
## GeneNUTM1                             NA         NA         NA       NA    
## GenePCLO                              NA         NA         NA       NA    
## GenePDGFRA                            NA         NA         NA       NA    
## GenePKHD1L1                           NA         NA         NA       NA    
## GenePRKDC                             NA         NA         NA       NA    
## GenePTCH1                             NA         NA         NA       NA    
## GeneRANBP2                            NA         NA         NA       NA    
## GeneSLX4                              NA         NA         NA       NA    
## GeneSPTBN2                            NA         NA         NA       NA    
## GeneTNFAIP3                           NA         NA         NA       NA    
## GeneTNKS2                             NA         NA         NA       NA    
## GeneTTN                               NA         NA         NA       NA    
## GeneVCL                               NA         NA         NA       NA    
## GeneWISP3                             NA         NA         NA       NA    
## GeneXIRP1                             NA         NA         NA       NA    
## GeneXIRP2                             NA         NA         NA       NA    
## GeneZNF217                            NA         NA         NA       NA    
## GeneZNF337                            NA         NA         NA       NA    
## GeneZNF74                             NA         NA         NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 566.01  on 411  degrees of freedom
## Residual deviance: 360.47  on 233  degrees of freedom
## AIC: 718.47
## 
## Number of Fisher Scoring iterations: 25

pred_resp <- predict(logit_model1, newdata = test_data1, type = "response") #Predictions
conf <- table(test_data1$Melanoma, (pred_resp > 0.5)*1, dnn = c("Truth", "Predicted")) #Confusion Matrix
conf

##      Predicted
## Truth  0  1
##     0 22  5
##     1  0 18

misclassification_error <- 1 - sum(diag(conf)) / sum(conf) # Misclassification error
misclassification_error

## [1] 0.1111111

As observed earlier, c_dot_notation/mutations emerge as significant variables in the model. The confusion matrix indicates only 5 misclassifications - 11.1 % , suggesting an acceptable yet not optimal performance compared to previous models like bagging or boosting. However, numerous instances of “NA” are notable. “NA” occurs in two scenarios: firstly, when additional data doesn’t contribute to model enhancement and predictive accuracy; secondly, when variables lack significance. To discern the case in this scenario, another logistic regression will be conducted, excluding the most influential variable, c_dot_notation/mutations.

genes_to_include <- c("GeneABL2", "GeneACVR1B", "GeneATM", "GeneBRAF", "GeneCATSPERZ",
                      "GeneCD276", "GeneCSMD1", "GeneCTLA4", "GeneCUL3", "GeneDNAH17",
                      "GeneDNAH2", "GeneDNAH9", "GeneEGFR", "GeneERCC5", "GeneHNF1A",
                      "GeneIDH1", "GeneKMT2A", "GeneMDC1", "GeneMST1", "GeneMUC16",
                      "GeneNBN", "GeneNSD1", "GeneNUTM1", "GenePCLO", "GenePDGFRA",
                      "GenePKHD1L1", "GenePRKDC", "GenePTCH1", "GeneRANBP2", "GeneSLX4",
                      "GeneSPTBN2", "GeneTNFAIP3", "GeneTNKS2", "GeneTTN", "GeneVCL",
                      "GeneWISP3", "GeneXIRP1", "GeneXIRP2", "GeneZNF217", "GeneZNF337",
                      "GeneZNF74")

formula <- as.formula(paste("Melanoma ~ ", paste(c(genes_to_include, "AffectedExons", "AllExons"), collapse = " + ")))
logit_model2 <- glm(formula, data = train_data1, family = binomial)
summary(logit_model2)

## 
## Call:
## glm(formula = formula, family = binomial, data = train_data1)
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.309e+13  9.743e+13  -0.340    0.734    
## GeneABL2      -4.471e+15  9.743e+13 -45.885   <2e-16 ***
## GeneACVR1B    -4.471e+15  9.743e+13 -45.885   <2e-16 ***
## GeneATM        3.309e+13  9.743e+13   0.340    0.734    
## GeneBRAF       3.309e+13  9.743e+13   0.340    0.734    
## GeneCATSPERZ   3.309e+13  9.743e+13   0.340    0.734    
## GeneCD276      3.309e+13  9.743e+13   0.340    0.734    
## GeneCSMD1      3.309e+13  9.743e+13   0.340    0.734    
## GeneCTLA4      3.309e+13  9.743e+13   0.340    0.734    
## GeneCUL3       3.309e+13  9.743e+13   0.340    0.734    
## GeneDNAH17     3.309e+13  9.743e+13   0.340    0.734    
## GeneDNAH2      3.309e+13  9.743e+13   0.340    0.734    
## GeneDNAH9      3.309e+13  9.743e+13   0.340    0.734    
## GeneEGFR       3.309e+13  9.743e+13   0.340    0.734    
## GeneERCC5      3.309e+13  9.743e+13   0.340    0.734    
## GeneHNF1A      3.309e+13  9.743e+13   0.340    0.734    
## GeneIDH1       3.309e+13  9.743e+13   0.340    0.734    
## GeneKMT2A      3.309e+13  9.743e+13   0.340    0.734    
## GeneMDC1       3.309e+13  9.743e+13   0.340    0.734    
## GeneMST1       3.309e+13  9.743e+13   0.340    0.734    
## GeneMUC16      3.309e+13  9.743e+13   0.340    0.734    
## GeneNBN        3.309e+13  9.743e+13   0.340    0.734    
## GeneNSD1       3.309e+13  9.743e+13   0.340    0.734    
## GeneNUTM1      3.309e+13  9.743e+13   0.340    0.734    
## GenePCLO       3.309e+13  9.743e+13   0.340    0.734    
## GenePDGFRA     3.309e+13  9.743e+13   0.340    0.734    
## GenePKHD1L1    3.309e+13  9.743e+13   0.340    0.734    
## GenePRKDC      3.309e+13  9.743e+13   0.340    0.734    
## GenePTCH1      3.309e+13  9.743e+13   0.340    0.734    
## GeneRANBP2     3.309e+13  9.743e+13   0.340    0.734    
## GeneSLX4       3.309e+13  9.743e+13   0.340    0.734    
## GeneSPTBN2     3.309e+13  9.743e+13   0.340    0.734    
## GeneTNFAIP3    3.309e+13  9.743e+13   0.340    0.734    
## GeneTNKS2      3.309e+13  9.743e+13   0.340    0.734    
## GeneTTN        3.309e+13  9.743e+13   0.340    0.734    
## GeneVCL        3.309e+13  9.743e+13   0.340    0.734    
## GeneWISP3      3.309e+13  9.743e+13   0.340    0.734    
## GeneXIRP1      3.309e+13  9.743e+13   0.340    0.734    
## GeneXIRP2      3.309e+13  9.743e+13   0.340    0.734    
## GeneZNF217     3.309e+13  9.743e+13   0.340    0.734    
## GeneZNF337     3.309e+13  9.743e+13   0.340    0.734    
## GeneZNF74      3.309e+13  9.743e+13   0.340    0.734    
## AffectedExons -4.837e-03  3.070e-02  -0.158    0.875    
## AllExons              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 566.01  on 411  degrees of freedom
## Residual deviance: 120.22  on 369  degrees of freedom
## AIC: 206.22
## 
## Number of Fisher Scoring iterations: 25

pred_prob <- predict(logit_model2, newdata = test_data1, type = "response")
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
conf_mat <- table(test_data1$Melanoma, pred_class, dnn = c("Truth", "Predicted")) #Confusion Matrix
print(conf_mat)

##      Predicted
## Truth  0  1
##     0 27  0
##     1  2 16

misclassification_error <- 1 - sum(diag(conf_mat)) / sum(conf_mat) # Misclassification error
misclassification_error

## [1] 0.04444444

The second logistic model, which excluded the ‘Cdotnotation’ mutations, revealed that only two genes, “Gene ABL2” and “Gene ACVR1B,” exhibited significance. Conversely, the p-values of the remaining genes exceeded 0.05, indicating their lack of significance. Notably, ‘AllExons’ yielded an NA value, suggesting that while significant in the previous model, it did not contribute substantially to this prediction. The omission of the most significant variable notwithstanding, the model’s predictive accuracy improved. Misclassification was limited to only 2 data points, resulting in a misclassification error of 4.44%.

genes_to_include1 <- c("GeneABL2", "GeneACVR1B", "GeneATM", "GeneBRAF", "GeneCATSPERZ",
                      "GeneCD276", "GeneCSMD1", "GeneCTLA4", "GeneCUL3", "GeneDNAH17",
                      "GeneDNAH2", "GeneDNAH9", "GeneEGFR", "GeneERCC5", "GeneHNF1A",
                      "GeneIDH1", "GeneKMT2A", "GeneMDC1", "GeneMST1", "GeneMUC16",
                      "GeneNBN", "GeneNSD1", "GeneNUTM1", "GenePCLO", "GenePDGFRA",
                      "GenePKHD1L1", "GenePRKDC", "GenePTCH1", "GeneRANBP2", "GeneSLX4",
                      "GeneSPTBN2", "GeneTNFAIP3", "GeneTNKS2", "GeneTTN", "GeneVCL",
                      "GeneWISP3", "GeneXIRP1", "GeneXIRP2", "GeneZNF217", "GeneZNF337",
                      "GeneZNF74")

formula1 <- as.formula(paste("Melanoma ~ ", paste(genes_to_include1, collapse = " + ")))
logit_model3 <- glm(formula1, data = train_data1, family = binomial)
summary(logit_model3)

## 
## Call:
## glm(formula = formula1, family = binomial, data = train_data1)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.258e+13  6.567e+13  -0.344    0.731    
## GeneABL2     -4.481e+15  6.567e+13 -68.231   <2e-16 ***
## GeneACVR1B   -4.481e+15  6.567e+13 -68.231   <2e-16 ***
## GeneATM       2.258e+13  6.567e+13   0.344    0.731    
## GeneBRAF      2.258e+13  6.567e+13   0.344    0.731    
## GeneCATSPERZ  2.258e+13  6.567e+13   0.344    0.731    
## GeneCD276     2.258e+13  6.567e+13   0.344    0.731    
## GeneCSMD1     2.258e+13  6.567e+13   0.344    0.731    
## GeneCTLA4     2.258e+13  6.567e+13   0.344    0.731    
## GeneCUL3      2.258e+13  6.567e+13   0.344    0.731    
## GeneDNAH17    2.258e+13  6.567e+13   0.344    0.731    
## GeneDNAH2     2.258e+13  6.567e+13   0.344    0.731    
## GeneDNAH9     2.258e+13  6.567e+13   0.344    0.731    
## GeneEGFR      2.258e+13  6.567e+13   0.344    0.731    
## GeneERCC5     2.258e+13  6.567e+13   0.344    0.731    
## GeneHNF1A     2.258e+13  6.567e+13   0.344    0.731    
## GeneIDH1      2.258e+13  6.567e+13   0.344    0.731    
## GeneKMT2A     2.258e+13  6.567e+13   0.344    0.731    
## GeneMDC1      2.258e+13  6.567e+13   0.344    0.731    
## GeneMST1      2.258e+13  6.567e+13   0.344    0.731    
## GeneMUC16     2.258e+13  6.567e+13   0.344    0.731    
## GeneNBN       2.258e+13  6.567e+13   0.344    0.731    
## GeneNSD1      2.258e+13  6.567e+13   0.344    0.731    
## GeneNUTM1     2.258e+13  6.567e+13   0.344    0.731    
## GenePCLO      2.258e+13  6.567e+13   0.344    0.731    
## GenePDGFRA    2.258e+13  6.567e+13   0.344    0.731    
## GenePKHD1L1   2.258e+13  6.567e+13   0.344    0.731    
## GenePRKDC     2.258e+13  6.567e+13   0.344    0.731    
## GenePTCH1     2.258e+13  6.567e+13   0.344    0.731    
## GeneRANBP2    2.258e+13  6.567e+13   0.344    0.731    
## GeneSLX4      2.258e+13  6.567e+13   0.344    0.731    
## GeneSPTBN2    2.258e+13  6.567e+13   0.344    0.731    
## GeneTNFAIP3   2.258e+13  6.567e+13   0.344    0.731    
## GeneTNKS2     2.258e+13  6.567e+13   0.344    0.731    
## GeneTTN       2.258e+13  6.567e+13   0.344    0.731    
## GeneVCL       2.258e+13  6.567e+13   0.344    0.731    
## GeneWISP3     2.258e+13  6.567e+13   0.344    0.731    
## GeneXIRP1     2.258e+13  6.567e+13   0.344    0.731    
## GeneXIRP2     2.258e+13  6.567e+13   0.344    0.731    
## GeneZNF217    2.258e+13  6.567e+13   0.344    0.731    
## GeneZNF337    2.258e+13  6.567e+13   0.344    0.731    
## GeneZNF74     2.258e+13  6.567e+13   0.344    0.731    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 566.01  on 411  degrees of freedom
## Residual deviance: 120.24  on 370  degrees of freedom
## AIC: 204.24
## 
## Number of Fisher Scoring iterations: 25

pred_prob <- predict(logit_model3, newdata = test_data1, type = "response")
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
conf_mat <- table(test_data1$Melanoma, pred_class)
print(conf_mat)

##    pred_class
##      0  1
##   0 26  1
##   1  3 15

misclassification_error <- 1 - sum(diag(conf_mat)) / sum(conf_mat) # Misclassification error
misclassification_error

## [1] 0.08888889

In the third logistic regression, the model excluded the two predictors with the highest p-values, namely “Affected Exons” and “All Exons.” Among the remaining genes, except for “GeneACVR1B” and “GeneABL2,” which exhibited significance, the rest showed a uniform p-value of 0.731. Notably, this p-value improved compared to the previous model; however, the model encountered challenges in predicting Melanoma accurately. The resulting misclassification rate was 8.88%, indicating four misclassified data points. In contrast, the second logistic regression produced the most favorable outcomes with a misclassification rate of 4.44%.

Random forest including cdotnotation/ mutataion

set.seed(123)
library(randomForest)
rf_model1<-randomForest(Melanoma ~.,  data = train_data1, ntrees=500) #Model, 500 trees were used for this model
rf_model1

## 
## Call:
##  randomForest(formula = Melanoma ~ ., data = train_data1, ntrees = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 75
## 
##           Mean of squared residuals: 0.04059789
##                     % Var explained: 83.56

rf_pred1<-predict(rf_model1,test_data1) #Predictions
table(test_data1$Melanoma, (pred_resp > 0.5)*1, dnn = c("Truth", "Predicted"))

##      Predicted
## Truth  0  1
##     0 22  5
##     1  0 18

varImpPlot(rf_model1, n.var = 10, main = "Top 10 Variable Importance Plot", cex.axis = 0.7, las = 2)

After realizing that the previous Random Forest model was limited due to the categorical nature of the variable c_dot_notation, which had more categories than the model could handle (limited to 53 categories), a new Random Forest was conducted. This time, the data underwent transformation to accommodate this issue. However, despite the refinement, the predictive performance saw a slight decline, with five datapoints being misclassified. Nonetheless, the model managed to explain 83.56% of the variance, indicating a respectable performance. Remarkably, upon examining the ten most influential variables, it was evident that many genes appeared prominently. This aligns with earlier descriptive analytics conducted in Tableau, which identified genes DNAH9, PCLO, and PKHD1L1 as highly influential factors for Melanoma.

K-Nearest Neighbors (KNN)

set.seed(123)
knn_gene <- knn(train = train_data1[, -2], test = test_data1[, -2], cl = train_data1[, 2], k = 5) #Model , k=5 means that the algorithm will consider the five closest data points to the point being classified. 
knn_gene

##  [1] 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1
## [39] 0 0 0 0 0 0 1
## Levels: 0 1

table(test_data1[,2], knn_gene, dnn = c("True", "Predicted")) #Confusion Matrix

##     Predicted
## True  0  1
##    0 24  3
##    1  1 17

sum(test_data1[, 2] != knn_gene)

## [1] 4

misclassification_percentage <- sum(test_data1[, 2] != knn_gene) / length(test_data1[, 2]) * 100
misclassification_percentage

## [1] 8.888889

K-Nearest Neighbors (KNN) was applied to the dataset, yielding promising results. The misclassification rate stands at 8.88%, with only four data points misclassified. These findings affirm the effectiveness of the KNN model in predicting Melanoma, showcasing its viability as a reliable predictive tool.

Neural Network

set.seed(1234567)
maxs <- apply(genes_df1, 2, max) # maximum value of each column
mins <- apply(genes_df1, 2, min) #minimum value of each column
scaled <- as.data.frame(scale(genes_df1, center = mins, scale = maxs - mins))
index <- sample(nrow(genes_df1), nrow(genes_df1)*0.90)
train_genes_df1 <- scaled[index,]
test_genes_df1 <- scaled[-index,]


nn <- neuralnet(Melanoma ~ ., data=train_genes_df1, hidden=c(1,1), linear.output=F, algorithm = 'rprop+') # First hidden layer will have five neurons and the second hidden layer will have 3 neurons
nn$act.fct

## function (x) 
## {
##     1/(1 + exp(-x))
## }
## <bytecode: 0x0000025c839349e0>
## <environment: 0x0000025c839313f0>
## attr(,"type")
## [1] "logistic"

plot(nn)
pr_nn <- compute(nn, test_genes_df1[,1:228])
prob_nn_out <- predict(nn, test_genes_df1, type = "response")
pcut_nn<-0.5
pred_nn_out <- (prob_nn_out >= pcut_nn)*1 
table(test_genes_df1$Melanoma, pred_nn_out, dnn = c("Observed", "Predicted"))

##         Predicted
## Observed  0  1
##        0 29  0
##        1  0 17

The logistic neural network, tailored for binary classification tasks, represents the culmination of our modeling efforts. Designed to predict the probability of belonging to one of two classes, this model stands as the most sophisticated among all others employed. Prior to model creation, our dataset underwent normalization to enhance performance. Remarkably, the logistic neural network yielded impeccable results, achieving a perfect 100% prediction accuracy with no misclassifications. Although the plot displaying the influence of 228 different variables may be visually overwhelming, the outcome aligns with our expectations and desired objectives.

Comparison of the predictive models

As observed, ten distinct models were constructed for the prediction of Melanoma. It is notable that most of these models yielded comparable outcomes, with the least performing models exhibiting a misclassification rate of 11%, a level deemed acceptable. Notably, Logistic Regression without C_Dot_Notation, Bagging, Boosting, and Decision Tree models produced identical results in this scenario.The standout performer among the models was the Neural Network, which achieved flawless predictions without any errors. Remarkably, three out of the top five performing models were classified as complex models.Complex models are indispensable in scenarios where the relationships between input features and the target variable are intricate, nonlinear, or high-dimensional, as they possess the capacity to capture and model such complexities effectively.

Summary

Problem Statement

Melanoma presents a significant health challenge, necessitating intensified research efforts. Genomic analysis offers valuable insights into the disease’s molecular mechanisms, facilitating the discovery of druggable targets for pharmaceutical companies. By leveraging genomic data, these companies can expedite the development of targeted therapies, offering hope to melanoma patients. Thus, genomic analysis holds immense promise in advancing the understanding of melanoma and driving innovative treatment strategies.

Addressing the problem statement

Melanoma represents a significant public health challenge in the United States due to its widespread prevalence and potentially fatal consequences, which persist despite the condition being largely preventable through measures such as sunscreen application and regular skin examinations. Despite considerable efforts to raise awareness, the disease continues to claim many lives annually. In response to this pressing issue, this study was undertaken to identify underlying patterns associated with melanoma. A comprehensive dataset incorporating 25 diverse sources, including 14 from local sequencing laboratories in Carmel, IN, and 9 from the National Cancer Institute’s GDC Data Portal, encompassing melanoma as well as other cancers such as lung, head and neck, thyroid, and breast cancer, totaling over 35,000 rows, was meticulously compiled. The primary objective was to uncover common factors among melanoma patients that could potentially serve as targets for drug development efforts. Following gene identification, predictive analytics were conducted using various models, including logistic regression, decision trees, random forests, boosting, bagging, Neural Networks, and KNN nearest neighbor, with a focus on predicting melanoma. The successful outcome of the predictive analysis underscores the promise of these models in prognosticating melanoma. This comprehensive analysis aims to offer valuable insights into genes and associated factors crucial for informing pharmaceutical companies, thereby enhancing the efficiency and cost-effectiveness of drug discovery endeavors aimed at combatting melanoma.

Insights Provided by Analysis

The analysis identified several tumor suppressor genes crucial for maintaining cell health. Importantly, these genes impact cancer development broadly rather than targeting specific types. Therefore, mutations in genes like FAT1 or SPTA1 are associated with various cancer types, highlighting their pivotal role in tumorigenesis.
The analysis revealed a common trend where mutations were frequently localized to specific regions within genes. This suggests that certain areas of the genome may be more susceptible to genetic alterations, highlighting potential targets for further investigation.
The analysis revealed that genes such as MUC16, PCLO, DNAH9, PKHD1L1, CSMD1, DNAH17, and XIRP2 were most frequently observed in patients with melanoma.
PCLO, DNAH9, PKHD1L1, and XIRP2 were identified as genes with the most significant influence in the analysis. Further research into these genes by healthcare professionals could provide valuable insights into their roles in cancer biology and potential therapeutic targets.
Location 18/19 in the PCLO gene exhibited the highest frequency of mutations among all genes identified in melanoma patients. This hotspot region may harbor important genetic variations associated with melanoma development and progression.
The analysis revealed a correlation between the number of mutations within the same exon and the likelihood of cancer development. This suggests that accumulation of mutations in specific genomic regions may increase cancer susceptibility and underscores the importance of investigating these regions further.
Predictive analytics demonstrated that mutations have a substantial impact on cancer development. Understanding the genetic alterations driving cancer initiation and progression is crucial for developing targeted treatment strategies.

Implications for stakeholders

The insights gleaned from the analysis offer valuable guidance for pharmaceutical companies, particularly in the realm of melanoma research and drug development. By pinpointing key tumor suppressor genes, hotspot regions for mutations, and influential genetic factors specific to melanoma, the analysis enables pharmaceutical companies to refine their focus on the most promising targets for therapeutic intervention. It’s important to note that while these insights provide valuable direction, they are not definitive answers and require further research and validation in actual laboratory settings. This targeted approach streamlines research efforts and allows for more efficient allocation of resources, facilitating the development of cost-effective and highly efficacious treatments tailored to melanoma. Leveraging these insights, pharmaceutical companies can expedite the discovery and development of novel therapies for melanoma, ultimately leading to improved patient outcomes and addressing the unmet medical needs in melanoma treatment.

Limitations of the analysis

Limited Dataset Size: The size of the dataset may limit the generalizability of findings. Utilizing a larger dataset would enhance result robustness. Data cleaning procedures, while necessary, may further reduce the dataset’s size, potentially impacting imputation capabilities.
Potential Bias in Data Source: Analysis could be biased if data primarily consists of targeted sequencing rather than whole-genome sequencing from the National Cancer Institute’s GDC Data Portal. This limitation may affect the comprehensiveness of genomic insights.
Broader Cancer Spectrum Inclusion: Enriching the analysis with a broader spectrum of cancer types beyond melanoma would enhance reliability and contextual understanding of genetic factors influencing cancer susceptibility and progression.
Increased Melanoma Patient Cohort: Expanding the melanoma patient cohort would improve the ability to identify correlations and discern patterns within the dataset. A larger sample size enhances statistical power and facilitates more robust analyses.
Data Collection Practices: Variations in data collection practices, including sample collection methods, sequencing techniques, and data processing pipelines, can affect data quantity and quality in sequencing databases.
Prevalence of Cancer Types: Variation in cancer prevalence affects the volume of available data for analysis. Cancers with higher mortality rates or greater research focus may have more extensive genomic studies and sequencing efforts.
Singular Focus on Single Nucleotide Variants (SNVs): While SNVs are a primary focus, acknowledging other genomic alterations beyond SNVs would provide a more nuanced understanding of melanoma pathogenesis.
Multifactorial Nature of Melanoma: Melanoma is influenced by genetic, environmental, and lifestyle factors. Non-genetic determinants such as sun exposure and family history should be considered alongside genetic factors.
Demographic Representation Bias: Biases or underrepresentation in data collection may compromise model generalizability across diverse populations.
Challenges with Categorical Data: High volumes of unique categorical variables may complicate model development, requiring advanced techniques for effective analysis.
Model Performance Constraints: Computational errors may arise from a large number of categories, necessitating careful model selection and preprocessing techniques.
Validation of Predictive Models: Independent dataset validation is crucial for evaluating model accuracy and reliability, enhancing confidence in their utility for clinical decision-making and research.

For substantial progress in this study, securing a larger and more diverse dataset encompassing various cancer types is crucial. Standardizing data collection methods across all sources is essential to eliminate bias. Lastly, validating the models using external datasets would offer invaluable insights into the predictive capabilities of the analysis.

Melanoma Gene Identifier

Anna Calka

2024-04-21

Introduction

Problem Statement

Business Problem

Proposed approach and analytics techniques

Interested Stakeholders

Data Preparation

Original Data Source

Initial Data Cleaning in Power Query, Excel

Required R-packages

Data set Union in R-Studio

Final data cleaning in Tableau Prep

Data Understanding Table

Modelling

Exploratory Analysis

Gene Selection Methodology Using Pareto Analysis

Descriptive Analytics

Predictive Analytics

Final Data Adjustments

Creating Training and Testing Sets

Decision Tree

Random Forest

Bagging

Boosting

Overcoming Data Constraints for Enhanced Predictive Analysis

Logistic Regression

Random forest including cdotnotation/ mutataion

K-Nearest Neighbors (KNN)

Neural Network

Comparison of the predictive models

Summary

Problem Statement

Addressing the problem statement

Insights Provided by Analysis

Implications for stakeholders

Limitations of the analysis