This notebook follows steps in paper Immunosignature system for diagnosis of cancer and re-execute some of their classification method.
Code for classification is as written in the Dataset S1 attached in the paper. Only the train-test split is reproduced. After table is imported, we pivot and do some transformation to return 2 separated data sets: train and test.
library(data.table)
## Warning: package 'data.table' was built under R version 4.0.5
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.0.5
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.5
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
library(e1071)
## Warning: package 'e1071' was built under R version 4.0.5
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Loading required package: lattice
library(stats)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(class)
setwd("C:/Users/kttcf/OneDrive/Bioinfor/LuanVan")
data_nom <- fread("GSE52580_test1_normalized_matrix.txt.gz")
data_nom <- melt(data_nom, id = c("Peptide"))
colnames(data_nom) <- c("peptide", "cancer", "value")
data_nom <- data_nom[,c("cancer", "peptide", "value")]
head(data_nom)
Now we do feature selection on the train data set in Python, however returned result still not match of it in the paper: 1. multiple test-corrected ANOVA 2. pattern matching, ‘Expression Profile’ in GeneSpring 7.3.1, Euclidean linkage. Compare each disease to the entire population. Chose peptides with high - low signal in disease - non-disease.
Output should be the same as described in paper of 120 peptides, which was extracted from the Supporting Information document. “This was repeated for every disease until equal numbers of peptides were selected for every disease”?
We filter raw data to get only the peptides selected to input to the dataset and assign it to train datasets.
trial1 <- read.csv('trial1.csv')
colnames(trial1) <- c("Brain cancer", "Breast cancer","Esophageal cancer", "Multiple myeloma", "Pancreatic cancer")
trial1 <- melt(trial1, id.vars = NULL)
data_nomf <- data_nom[data_nom$peptide %in% trial1$value,]
trainf <- data_nomf %>% filter(grepl('train',cancer))
trainf <- dcast(cancer ~ peptide, data = trainf, value.var = "value", mean)
trainf$cancer <- sub(" -.*", "", trainf$cancer)
trainf$cancer <- as.factor(trainf$cancer)
head(trainf)
After an appropriate dataframe is generated for the training set, proceed and apply the same steps to test dataset.
testf <- data_nomf %>% filter(grepl('test',cancer))
testf <- dcast(cancer ~ peptide, data = testf, value.var = "value", mean)
testf$cancer <- sub(" -.*", "", testf$cancer)
testf$cancer <- as.factor(testf$cancer)
head(testf)
Now we apply some classification method to the trial1 dataset.
act = testf[,1]
#SVM
model<-svm(formula = cancer~.,data=trainf)
pred<-predict(model, testf)
confusionMatrix(as.factor(pred),as.factor(act))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Brain cancer Breast cancer Esophageal cancer
## Brain cancer 19 0 0
## Breast cancer 0 20 0
## Esophageal cancer 0 0 19
## Healthy control 0 0 0
## Multiple myeloma 1 0 1
## Pancreatic cancer 0 0 0
## Reference
## Prediction Healthy control Multiple myeloma Pancreatic cancer
## Brain cancer 0 0 0
## Breast cancer 0 0 1
## Esophageal cancer 0 0 0
## Healthy control 20 0 2
## Multiple myeloma 0 20 0
## Pancreatic cancer 0 0 17
##
## Overall Statistics
##
## Accuracy : 0.9583
## 95% CI : (0.9054, 0.9863)
## No Information Rate : 0.1667
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.95
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Brain cancer Class: Breast cancer
## Sensitivity 0.9500 1.0000
## Specificity 1.0000 0.9900
## Pos Pred Value 1.0000 0.9524
## Neg Pred Value 0.9901 1.0000
## Prevalence 0.1667 0.1667
## Detection Rate 0.1583 0.1667
## Detection Prevalence 0.1583 0.1750
## Balanced Accuracy 0.9750 0.9950
## Class: Esophageal cancer Class: Healthy control
## Sensitivity 0.9500 1.0000
## Specificity 1.0000 0.9800
## Pos Pred Value 1.0000 0.9091
## Neg Pred Value 0.9901 1.0000
## Prevalence 0.1667 0.1667
## Detection Rate 0.1583 0.1667
## Detection Prevalence 0.1583 0.1833
## Balanced Accuracy 0.9750 0.9900
## Class: Multiple myeloma Class: Pancreatic cancer
## Sensitivity 1.0000 0.8500
## Specificity 0.9800 1.0000
## Pos Pred Value 0.9091 1.0000
## Neg Pred Value 1.0000 0.9709
## Prevalence 0.1667 0.1667
## Detection Rate 0.1667 0.1417
## Detection Prevalence 0.1833 0.1417
## Balanced Accuracy 0.9900 0.9250
#NB
model<-naiveBayes(cancer~.,data=trainf)
pred<-predict(model, testf)
confusionMatrix(pred,testf[,1])
## Confusion Matrix and Statistics
##
## Reference
## Prediction Brain cancer Breast cancer Esophageal cancer
## Brain cancer 20 0 0
## Breast cancer 0 18 0
## Esophageal cancer 0 0 20
## Healthy control 0 1 0
## Multiple myeloma 0 0 0
## Pancreatic cancer 0 1 0
## Reference
## Prediction Healthy control Multiple myeloma Pancreatic cancer
## Brain cancer 0 0 0
## Breast cancer 0 0 1
## Esophageal cancer 9 0 0
## Healthy control 11 0 0
## Multiple myeloma 0 20 0
## Pancreatic cancer 0 0 19
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.8318, 0.9473)
## No Information Rate : 0.1667
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.88
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Brain cancer Class: Breast cancer
## Sensitivity 1.0000 0.9000
## Specificity 1.0000 0.9900
## Pos Pred Value 1.0000 0.9474
## Neg Pred Value 1.0000 0.9802
## Prevalence 0.1667 0.1667
## Detection Rate 0.1667 0.1500
## Detection Prevalence 0.1667 0.1583
## Balanced Accuracy 1.0000 0.9450
## Class: Esophageal cancer Class: Healthy control
## Sensitivity 1.0000 0.55000
## Specificity 0.9100 0.99000
## Pos Pred Value 0.6897 0.91667
## Neg Pred Value 1.0000 0.91667
## Prevalence 0.1667 0.16667
## Detection Rate 0.1667 0.09167
## Detection Prevalence 0.2417 0.10000
## Balanced Accuracy 0.9550 0.77000
## Class: Multiple myeloma Class: Pancreatic cancer
## Sensitivity 1.0000 0.9500
## Specificity 1.0000 0.9900
## Pos Pred Value 1.0000 0.9500
## Neg Pred Value 1.0000 0.9900
## Prevalence 0.1667 0.1667
## Detection Rate 0.1667 0.1583
## Detection Prevalence 0.1667 0.1667
## Balanced Accuracy 1.0000 0.9700
#LDA
model<-lda(cancer~.,trainf)
## Warning in lda.default(x, grouping, ...): variables are collinear
pred<-predict(model, testf)
confusionMatrix(pred$class, testf[,1])
## Confusion Matrix and Statistics
##
## Reference
## Prediction Brain cancer Breast cancer Esophageal cancer
## Brain cancer 18 0 0
## Breast cancer 0 14 0
## Esophageal cancer 0 0 19
## Healthy control 0 2 0
## Multiple myeloma 2 0 1
## Pancreatic cancer 0 4 0
## Reference
## Prediction Healthy control Multiple myeloma Pancreatic cancer
## Brain cancer 0 0 0
## Breast cancer 5 1 3
## Esophageal cancer 0 0 0
## Healthy control 11 2 4
## Multiple myeloma 2 15 1
## Pancreatic cancer 2 2 12
##
## Overall Statistics
##
## Accuracy : 0.7417
## 95% CI : (0.6538, 0.8172)
## No Information Rate : 0.1667
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.69
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Brain cancer Class: Breast cancer
## Sensitivity 0.9000 0.7000
## Specificity 1.0000 0.9100
## Pos Pred Value 1.0000 0.6087
## Neg Pred Value 0.9804 0.9381
## Prevalence 0.1667 0.1667
## Detection Rate 0.1500 0.1167
## Detection Prevalence 0.1500 0.1917
## Balanced Accuracy 0.9500 0.8050
## Class: Esophageal cancer Class: Healthy control
## Sensitivity 0.9500 0.55000
## Specificity 1.0000 0.92000
## Pos Pred Value 1.0000 0.57895
## Neg Pred Value 0.9901 0.91089
## Prevalence 0.1667 0.16667
## Detection Rate 0.1583 0.09167
## Detection Prevalence 0.1583 0.15833
## Balanced Accuracy 0.9750 0.73500
## Class: Multiple myeloma Class: Pancreatic cancer
## Sensitivity 0.7500 0.6000
## Specificity 0.9400 0.9200
## Pos Pred Value 0.7143 0.6000
## Neg Pred Value 0.9495 0.9200
## Prevalence 0.1667 0.1667
## Detection Rate 0.1250 0.1000
## Detection Prevalence 0.1750 0.1667
## Balanced Accuracy 0.8450 0.7600
#KNN
pred<-knn(trainf[,2:121],testf[,2:121],cl=trainf[,1],k=5)
confusionMatrix(pred, testf[,1])
## Confusion Matrix and Statistics
##
## Reference
## Prediction Brain cancer Breast cancer Esophageal cancer
## Brain cancer 18 0 0
## Breast cancer 0 20 0
## Esophageal cancer 0 0 20
## Healthy control 0 0 0
## Multiple myeloma 2 0 0
## Pancreatic cancer 0 0 0
## Reference
## Prediction Healthy control Multiple myeloma Pancreatic cancer
## Brain cancer 0 0 0
## Breast cancer 2 0 2
## Esophageal cancer 0 0 0
## Healthy control 18 1 3
## Multiple myeloma 0 19 0
## Pancreatic cancer 0 0 15
##
## Overall Statistics
##
## Accuracy : 0.9167
## 95% CI : (0.8521, 0.9593)
## No Information Rate : 0.1667
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Brain cancer Class: Breast cancer
## Sensitivity 0.9000 1.0000
## Specificity 1.0000 0.9600
## Pos Pred Value 1.0000 0.8333
## Neg Pred Value 0.9804 1.0000
## Prevalence 0.1667 0.1667
## Detection Rate 0.1500 0.1667
## Detection Prevalence 0.1500 0.2000
## Balanced Accuracy 0.9500 0.9800
## Class: Esophageal cancer Class: Healthy control
## Sensitivity 1.0000 0.9000
## Specificity 1.0000 0.9600
## Pos Pred Value 1.0000 0.8182
## Neg Pred Value 1.0000 0.9796
## Prevalence 0.1667 0.1667
## Detection Rate 0.1667 0.1500
## Detection Prevalence 0.1667 0.1833
## Balanced Accuracy 1.0000 0.9300
## Class: Multiple myeloma Class: Pancreatic cancer
## Sensitivity 0.9500 0.7500
## Specificity 0.9800 1.0000
## Pos Pred Value 0.9048 1.0000
## Neg Pred Value 0.9899 0.9524
## Prevalence 0.1667 0.1667
## Detection Rate 0.1583 0.1250
## Detection Prevalence 0.1750 0.1250
## Balanced Accuracy 0.9650 0.8750