ImmunoSignature Data Classification

This notebook follows steps in paper Immunosignature system for diagnosis of cancer and re-execute some of their classification method.

Code for classification is as written in the Dataset S1 attached in the paper. Only the train-test split is reproduced. After table is imported, we pivot and do some transformation to return 2 separated data sets: train and test.

library(data.table)

## Warning: package 'data.table' was built under R version 4.0.5

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.0.5

## 
## Attaching package: 'reshape2'

## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.0.5

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

library(e1071)

## Warning: package 'e1071' was built under R version 4.0.5

library(caret)

## Warning: package 'caret' was built under R version 4.0.5

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.0.5

## Loading required package: lattice

library(stats)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(class)

setwd("C:/Users/kttcf/OneDrive/Bioinfor/LuanVan")
data_nom <- fread("GSE52580_test1_normalized_matrix.txt.gz")
data_nom <- melt(data_nom, id = c("Peptide"))
colnames(data_nom) <- c("peptide", "cancer", "value")
data_nom <- data_nom[,c("cancer", "peptide", "value")]
head(data_nom)

Now we do feature selection on the train data set in Python, however returned result still not match of it in the paper: 1. multiple test-corrected ANOVA 2. pattern matching, ‘Expression Profile’ in GeneSpring 7.3.1, Euclidean linkage. Compare each disease to the entire population. Chose peptides with high - low signal in disease - non-disease.

Output should be the same as described in paper of 120 peptides, which was extracted from the Supporting Information document. “This was repeated for every disease until equal numbers of peptides were selected for every disease”?

We filter raw data to get only the peptides selected to input to the dataset and assign it to train datasets.

trial1 <- read.csv('trial1.csv')
colnames(trial1) <- c("Brain cancer", "Breast cancer","Esophageal cancer", "Multiple myeloma", "Pancreatic cancer")
trial1 <- melt(trial1, id.vars = NULL)
data_nomf <- data_nom[data_nom$peptide %in% trial1$value,]
trainf <- data_nomf %>% filter(grepl('train',cancer))
trainf <- dcast(cancer ~ peptide, data = trainf, value.var = "value", mean)
trainf$cancer <- sub(" -.*", "", trainf$cancer)
trainf$cancer <- as.factor(trainf$cancer)
head(trainf)

After an appropriate dataframe is generated for the training set, proceed and apply the same steps to test dataset.

testf <- data_nomf %>% filter(grepl('test',cancer))
testf <- dcast(cancer ~ peptide, data = testf, value.var = "value", mean)
testf$cancer <- sub(" -.*", "", testf$cancer)
testf$cancer <- as.factor(testf$cancer)
head(testf)

Now we apply some classification method to the trial1 dataset.

Support Vector Machine

act = testf[,1]
#SVM
model<-svm(formula = cancer~.,data=trainf)
pred<-predict(model, testf)
confusionMatrix(as.factor(pred),as.factor(act))

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Brain cancer Breast cancer Esophageal cancer
##   Brain cancer                19             0                 0
##   Breast cancer                0            20                 0
##   Esophageal cancer            0             0                19
##   Healthy control              0             0                 0
##   Multiple myeloma             1             0                 1
##   Pancreatic cancer            0             0                 0
##                    Reference
## Prediction          Healthy control Multiple myeloma Pancreatic cancer
##   Brain cancer                    0                0                 0
##   Breast cancer                   0                0                 1
##   Esophageal cancer               0                0                 0
##   Healthy control                20                0                 2
##   Multiple myeloma                0               20                 0
##   Pancreatic cancer               0                0                17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9583          
##                  95% CI : (0.9054, 0.9863)
##     No Information Rate : 0.1667          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Brain cancer Class: Breast cancer
## Sensitivity                       0.9500               1.0000
## Specificity                       1.0000               0.9900
## Pos Pred Value                    1.0000               0.9524
## Neg Pred Value                    0.9901               1.0000
## Prevalence                        0.1667               0.1667
## Detection Rate                    0.1583               0.1667
## Detection Prevalence              0.1583               0.1750
## Balanced Accuracy                 0.9750               0.9950
##                      Class: Esophageal cancer Class: Healthy control
## Sensitivity                            0.9500                 1.0000
## Specificity                            1.0000                 0.9800
## Pos Pred Value                         1.0000                 0.9091
## Neg Pred Value                         0.9901                 1.0000
## Prevalence                             0.1667                 0.1667
## Detection Rate                         0.1583                 0.1667
## Detection Prevalence                   0.1583                 0.1833
## Balanced Accuracy                      0.9750                 0.9900
##                      Class: Multiple myeloma Class: Pancreatic cancer
## Sensitivity                           1.0000                   0.8500
## Specificity                           0.9800                   1.0000
## Pos Pred Value                        0.9091                   1.0000
## Neg Pred Value                        1.0000                   0.9709
## Prevalence                            0.1667                   0.1667
## Detection Rate                        0.1667                   0.1417
## Detection Prevalence                  0.1833                   0.1417
## Balanced Accuracy                     0.9900                   0.9250

Linear Discriminant Analysis

#NB
model<-naiveBayes(cancer~.,data=trainf)
pred<-predict(model, testf)
confusionMatrix(pred,testf[,1])

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Brain cancer Breast cancer Esophageal cancer
##   Brain cancer                20             0                 0
##   Breast cancer                0            18                 0
##   Esophageal cancer            0             0                20
##   Healthy control              0             1                 0
##   Multiple myeloma             0             0                 0
##   Pancreatic cancer            0             1                 0
##                    Reference
## Prediction          Healthy control Multiple myeloma Pancreatic cancer
##   Brain cancer                    0                0                 0
##   Breast cancer                   0                0                 1
##   Esophageal cancer               9                0                 0
##   Healthy control                11                0                 0
##   Multiple myeloma                0               20                 0
##   Pancreatic cancer               0                0                19
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.8318, 0.9473)
##     No Information Rate : 0.1667          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.88            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Brain cancer Class: Breast cancer
## Sensitivity                       1.0000               0.9000
## Specificity                       1.0000               0.9900
## Pos Pred Value                    1.0000               0.9474
## Neg Pred Value                    1.0000               0.9802
## Prevalence                        0.1667               0.1667
## Detection Rate                    0.1667               0.1500
## Detection Prevalence              0.1667               0.1583
## Balanced Accuracy                 1.0000               0.9450
##                      Class: Esophageal cancer Class: Healthy control
## Sensitivity                            1.0000                0.55000
## Specificity                            0.9100                0.99000
## Pos Pred Value                         0.6897                0.91667
## Neg Pred Value                         1.0000                0.91667
## Prevalence                             0.1667                0.16667
## Detection Rate                         0.1667                0.09167
## Detection Prevalence                   0.2417                0.10000
## Balanced Accuracy                      0.9550                0.77000
##                      Class: Multiple myeloma Class: Pancreatic cancer
## Sensitivity                           1.0000                   0.9500
## Specificity                           1.0000                   0.9900
## Pos Pred Value                        1.0000                   0.9500
## Neg Pred Value                        1.0000                   0.9900
## Prevalence                            0.1667                   0.1667
## Detection Rate                        0.1667                   0.1583
## Detection Prevalence                  0.1667                   0.1667
## Balanced Accuracy                     1.0000                   0.9700

#LDA
model<-lda(cancer~.,trainf)

## Warning in lda.default(x, grouping, ...): variables are collinear

pred<-predict(model, testf)
confusionMatrix(pred$class, testf[,1])

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Brain cancer Breast cancer Esophageal cancer
##   Brain cancer                18             0                 0
##   Breast cancer                0            14                 0
##   Esophageal cancer            0             0                19
##   Healthy control              0             2                 0
##   Multiple myeloma             2             0                 1
##   Pancreatic cancer            0             4                 0
##                    Reference
## Prediction          Healthy control Multiple myeloma Pancreatic cancer
##   Brain cancer                    0                0                 0
##   Breast cancer                   5                1                 3
##   Esophageal cancer               0                0                 0
##   Healthy control                11                2                 4
##   Multiple myeloma                2               15                 1
##   Pancreatic cancer               2                2                12
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7417          
##                  95% CI : (0.6538, 0.8172)
##     No Information Rate : 0.1667          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.69            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Brain cancer Class: Breast cancer
## Sensitivity                       0.9000               0.7000
## Specificity                       1.0000               0.9100
## Pos Pred Value                    1.0000               0.6087
## Neg Pred Value                    0.9804               0.9381
## Prevalence                        0.1667               0.1667
## Detection Rate                    0.1500               0.1167
## Detection Prevalence              0.1500               0.1917
## Balanced Accuracy                 0.9500               0.8050
##                      Class: Esophageal cancer Class: Healthy control
## Sensitivity                            0.9500                0.55000
## Specificity                            1.0000                0.92000
## Pos Pred Value                         1.0000                0.57895
## Neg Pred Value                         0.9901                0.91089
## Prevalence                             0.1667                0.16667
## Detection Rate                         0.1583                0.09167
## Detection Prevalence                   0.1583                0.15833
## Balanced Accuracy                      0.9750                0.73500
##                      Class: Multiple myeloma Class: Pancreatic cancer
## Sensitivity                           0.7500                   0.6000
## Specificity                           0.9400                   0.9200
## Pos Pred Value                        0.7143                   0.6000
## Neg Pred Value                        0.9495                   0.9200
## Prevalence                            0.1667                   0.1667
## Detection Rate                        0.1250                   0.1000
## Detection Prevalence                  0.1750                   0.1667
## Balanced Accuracy                     0.8450                   0.7600

K-nearest neighbour

#KNN
pred<-knn(trainf[,2:121],testf[,2:121],cl=trainf[,1],k=5)
confusionMatrix(pred, testf[,1])

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Brain cancer Breast cancer Esophageal cancer
##   Brain cancer                18             0                 0
##   Breast cancer                0            20                 0
##   Esophageal cancer            0             0                20
##   Healthy control              0             0                 0
##   Multiple myeloma             2             0                 0
##   Pancreatic cancer            0             0                 0
##                    Reference
## Prediction          Healthy control Multiple myeloma Pancreatic cancer
##   Brain cancer                    0                0                 0
##   Breast cancer                   2                0                 2
##   Esophageal cancer               0                0                 0
##   Healthy control                18                1                 3
##   Multiple myeloma                0               19                 0
##   Pancreatic cancer               0                0                15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.8521, 0.9593)
##     No Information Rate : 0.1667          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9             
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Brain cancer Class: Breast cancer
## Sensitivity                       0.9000               1.0000
## Specificity                       1.0000               0.9600
## Pos Pred Value                    1.0000               0.8333
## Neg Pred Value                    0.9804               1.0000
## Prevalence                        0.1667               0.1667
## Detection Rate                    0.1500               0.1667
## Detection Prevalence              0.1500               0.2000
## Balanced Accuracy                 0.9500               0.9800
##                      Class: Esophageal cancer Class: Healthy control
## Sensitivity                            1.0000                 0.9000
## Specificity                            1.0000                 0.9600
## Pos Pred Value                         1.0000                 0.8182
## Neg Pred Value                         1.0000                 0.9796
## Prevalence                             0.1667                 0.1667
## Detection Rate                         0.1667                 0.1500
## Detection Prevalence                   0.1667                 0.1833
## Balanced Accuracy                      1.0000                 0.9300
##                      Class: Multiple myeloma Class: Pancreatic cancer
## Sensitivity                           0.9500                   0.7500
## Specificity                           0.9800                   1.0000
## Pos Pred Value                        0.9048                   1.0000
## Neg Pred Value                        0.9899                   0.9524
## Prevalence                            0.1667                   0.1667
## Detection Rate                        0.1583                   0.1250
## Detection Prevalence                  0.1750                   0.1250
## Balanced Accuracy                     0.9650                   0.8750