Classification Project Report

Ahmed Tadde

June 30, 2015

Synopsis

The Data

Preprocessing

Libraries

# For model building
library(caret) 

Loading the data

# The gene expression data:
DF <- read.table("Gene-Expression-Data.txt", sep = " ", colClasses = "numeric", 
    header = TRUE, na.strings = "NA", stringsAsFactors = FALSE)
# Checking the dimension of the data frame

dim(DF)
## [1]    89 22277
# The patients' classification data:
class <- readLines("Patients-Classification-Data.txt", n = -1L)
# Labeling the two classes : 'Positive'<=>'Has Cancer'; 'Negitive'<=>'No
# Cancer'
class <- (class != "Normal prostate")
class <- as.factor(ifelse(as.character(class) == "TRUE", "Positive", "Negative"))

# Checking the number of patients in both classes
Positives <- length(which(as.character(class) == "Positive"))
Negatives <- length(which(as.character(class) != "Positive"))

Positives
## [1] 69
Negatives
## [1] 20

Preprocessing

Scaling/Normalizing the gene expression data

means = colMeans(DF)
SD = apply(DF, 2, sd)
DF <- scale(DF, center = means, scale = SD)

Creating training and test datasets (for both gene expression and class data)

# Training data takes 3/4 of the entire data...
split <- createDataPartition(y = class, p = 0.75, list = FALSE)
trainData <- DF[split, ]
trainClass <- class[split]

# Test data
testData <- DF[-split, ]
testClass <- class[-split]

Support Vector Machine (SVM) model

Control parameters set up

control_svm = trainControl(
                          method = "repeatedcv",
                          number = 10,          
                          repeats = 10,     
                          allowParallel = TRUE,
                          savePredictions= TRUE, 
                          classProbs = TRUE,
                          summaryFunction = twoClassSummary,
                          selectionFunction = "best",
                          returnResamp="all"
                          
                      )

Training the model

model_svm <- train(trainData, trainClass, method = "svmRadial", tuneLenght = 30, 
    metric = "ROC", trControl = control_svm, )
## Loading required package: kernlab

SVM model

Results

## Support Vector Machines with Radial Basis Function Kernel 
## 
##    67 samples
## 22277 predictors
##     2 classes: 'Negative', 'Positive' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## 
## Summary of sample sizes: 60, 60, 60, 60, 60, 61, ... 
## 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens   Spec       ROC SD     Sens SD    Spec SD  
##   0.25  0.8708333  0.610  0.8726667  0.1680811  0.4298226  0.1532572
##   0.50  0.8708333  0.620  0.8806667  0.1680811  0.4329835  0.1570763
##   1.00  0.8758333  0.615  0.8863333  0.1637176  0.4314352  0.1480964
## 
## Tuning parameter 'sigma' was held constant at a value of 3.128011e-05
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 3.128011e-05 and C = 1.
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  3.12801149264817e-05 
## 
## Number of Support Vectors : 56 
## 
## Objective Function Value : -19.1305 
## Training error : 0.074627 
## Probability model included.
## [1] "Applying SVM Model To The Testing Data:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative        5        2
##   Positive        0       15
##                                           
##                Accuracy : 0.9091          
##                  95% CI : (0.7084, 0.9888)
##     No Information Rate : 0.7727          
##     P-Value [Acc > NIR] : 0.09444         
##                                           
##                   Kappa : 0.7732          
##  Mcnemar's Test P-Value : 0.47950         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8824          
##          Pos Pred Value : 0.7143          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2273          
##          Detection Rate : 0.2273          
##    Detection Prevalence : 0.3182          
##       Balanced Accuracy : 0.9412          
##                                           
##        'Positive' Class : Negative        
## 

SVM model

Discussion

1- Accuracy : 0.9091

2- Sensitivity : 1.0000

3- Specificity : 0.8824

Every patient who has cancer is accurately diagnosed. According to this model, some patients who do not have cancer could be wrongly diagnosed. In relative terms, this is an acceptable misdirection when we consider the nature of the classification problem.

1- Overfiting on the training data

2- High-Dimensionality (it is quite extreme when comparing n=89 and p=22277)

Acquiring more patients’ data will be very benificial ( n= 1000+). In addition, divising an efficient method of feature selection will have a great positive effect on the accuracy, specificity, and sensitivity of our model. Feature selection from gene expression data is a very researched topic.