Genetic Variants are usually classified by labs into two main categories: Pathogenic or Benign.
Information abut teh type of variant is critical for doctors and researchers to understand the medical risks involves with each variant.
Problem:
An important issue with the classification of genetic variants is that different lab often provide conflicting results for the same variant. Thus it is of interest to predict the possibility of a particular variant being classified in conflicting manner by multiple labs.
If a variant is predicted to have non-conflicting results then doctors can have more confidence in the results of one single lab. On the contrary if a variant is predicted to have conflicting results doctors might consult multiple labs and then make the best possible decision.
Solution:
Random Forest Classifier The above problem can be addressed as a binary classification problem. Each variant is classified as Conflicting or Non-conflicting. A random forest is used for building the classifier. The model is trained and the hyperparamters are tuned to finalize the optimal model.
The optimal model is then embedded in a shiny app. The app allows the user to either make predictions for multiple variants or one single variant. In the case of multiple variants the app also helps the user visualize some of teh key features of the variant segregated by whether the variant will have conflicting or non-conflicting results.
Dataset: The dataset used is a Kaggle dataset (https://www.kaggle.com/kevinarvai/clinvar-conflicting ). The source of the data is CLinVar , a public repository of annotated genetic variants.
library(caret) # machine learning library
library(mlbench)
library(randomForest)
library(dplyr)
library(ggplot2)
library(pROC)
Load the dataset.
# load dataset
# add path to the folder containing the data (.csv file)
filepath = '/Users/archanabalan/Documents/Spring Semester/DS/Project/Clinical_variants'
# set working directory to the file path
setwd ('/Users/archanabalan/Documents/Spring Semester/DS/Project/Clinical_variants')
# Read the data and save it as a dataframe
data = read.csv('clinvar_conflicting.csv', header = TRUE, sep = ',')
Data Pre-processing:
A number of features in the dataset are either NaN for all samples or have enteries for very few samples. These features are not included in the analysis. These features include: Distance , BAM_EDIT, SIFT, PolyPhen, Motif Name, Motif position, High_inf_pos, Motif_score_change, BLOSUM62, CLNDISDBINCL and CLNVI.
Further some of the features such as CLNVC and ORIGIN remain more or less the same for the entire dataset . Thus they are exluded as they might not be distingushing features for a classifier.
Some of the features are subjective descriptions such as date of recording and may not add predictive power to the model. The important features that are investigated for the classifier are:
CHROM, POS, REF, AF_ESP, AF_TGP, AF_EXAC.
# The feature CHROM is a number between 1 - 22. However there are two more categories MT and X.
# To use CHROM as a numerical feture : X is coded in as 23 and MT as 24
# This is achieved using the chrom_numerical function.
chrom_numerical = function(input){
if (input == 'X'){
output= 23
}
else if (input == 'MT'){
output= 24
}
else{
output = input
}
output
}
data$CHROM = unlist(lapply( data$CHROM,chrom_numerical ))
# Features cDNA_position, CSD_position and Protein_position are converted from factor to numeric
#The feature class is coverted to a factor
data$CLASS = as.factor(data$CLASS)
data$cDNA_position = as.numeric(data$cDNA_position)
data$CDS_position = as.numeric(data$CDS_position)
data$Protein_position = as.numeric(data$Protein_position)
#The class label is coverted to a factor
data$CLASS = as.factor(data$CLASS)
Train-Test split
#Train-test split: 80% of the data is used for training the model and the remaining 20% is used for testing the model.
#Choosing the features of interest from the data set
data_model = data %>% select(CHROM, POS, cDNA_position, CDS_position, Protein_position, CLASS, CADD_PHRED, CADD_RAW)
train_index <- createDataPartition(data_model$CLASS, p=0.80, list=FALSE)
test_set <- data_model[-train_index,]
train_set <- data_model[train_index,]
rf_model<- randomForest(CLASS ~ ., data = train_set, importance = TRUE, na.action = na.omit)
print(rf_model)
##
## Call:
## randomForest(formula = CLASS ~ ., data = train_set, importance = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 26.8%
## Confusion matrix:
## 0 1 class.error
## 0 36367 1889 0.04937788
## 1 11858 1186 0.90907697
final_predictions <- predict(rf_model, test_set)
confusionMatrix(final_predictions, test_set$CLASS)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9062 2984
## 1 474 276
##
## Accuracy : 0.7298
## 95% CI : (0.722, 0.7374)
## No Information Rate : 0.7452
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0468
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.95029
## Specificity : 0.08466
## Pos Pred Value : 0.75228
## Neg Pred Value : 0.36800
## Prevalence : 0.74523
## Detection Rate : 0.70819
## Detection Prevalence : 0.94139
## Balanced Accuracy : 0.51748
##
## 'Positive' Class : 0
##
The model has an accuracy of 73.31% with a positive predictive value of 0.76.
Plotting ROC and calculating AUC
## Area under the curve: 0.6001
Problem of biased dataset:
It is observed that the dataset is biased towards one particular label (class 0). Class 0 samples are almost thrice as many as the class 1 samples. In order to obtain an unbiased classifier, the training set is undersampled. Class 0 samples are reduced to 13000 so that they are comparable in number to class 1 (13148 samples).
#Train-test split: 80% of the data is used for training the model and the remaining 20% is used for testing the model.
#Choosing the features of interest from the data set
class_0_ind = which(train_set$CLASS == 0)
class_1_ind = which(train_set$CLASS == 1)
# Randomly select 16000 samples of class 0 label
class_0_ind_small = sample(as.numeric(class_0_ind),13000 )
# Index for sampling: All indices of class 0 and
total_index = c(class_1_ind, class_0_ind_small)
data_small = train_set[total_index, ]
rf_model_small<- randomForest(CLASS ~ ., data = data_small, importance = TRUE, na.action = na.omit)
print(rf_model_small)
##
## Call:
## randomForest(formula = CLASS ~ ., data = data_small, importance = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 43.63%
## Confusion matrix:
## 0 1 class.error
## 0 6739 6015 0.4716167
## 1 5240 7804 0.4017173
final_predictions_small <- predict(rf_model_small, test_set)
confusionMatrix(final_predictions_small, test_set$CLASS)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4976 1306
## 1 4560 1954
##
## Accuracy : 0.5416
## 95% CI : (0.5329, 0.5502)
## No Information Rate : 0.7452
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0912
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5218
## Specificity : 0.5994
## Pos Pred Value : 0.7921
## Neg Pred Value : 0.3000
## Prevalence : 0.7452
## Detection Rate : 0.3889
## Detection Prevalence : 0.4909
## Balanced Accuracy : 0.5606
##
## 'Positive' Class : 0
##
The model has an accuracy of 55% with a OOB error rate of 43.81%
Plotting ROC and calculating AUC
## Area under the curve: 0.5941
Models comparison: The second model trained on an unde sampled dataset shows lower accuracy of 55% when compared to the first model (73.31%). Further the OOB error rate is also lower for the first model. Thus, the first model is chosen as the optimal model for implementation in the shiny app.
The best performing classifier is saved and embedded in the shiny app.