Ahmed Tadde
June 30, 2015
# For model building
library(caret)
# The gene expression data:
DF <- read.table("Gene-Expression-Data.txt", sep = " ", colClasses = "numeric",
header = TRUE, na.strings = "NA", stringsAsFactors = FALSE)
# Checking the dimension of the data frame
dim(DF)
## [1] 89 22277
# The patients' classification data:
class <- readLines("Patients-Classification-Data.txt", n = -1L)
# Labeling the two classes : 'Positive'<=>'Has Cancer'; 'Negitive'<=>'No
# Cancer'
class <- (class != "Normal prostate")
class <- as.factor(ifelse(as.character(class) == "TRUE", "Positive", "Negative"))
# Checking the number of patients in both classes
Positives <- length(which(as.character(class) == "Positive"))
Negatives <- length(which(as.character(class) != "Positive"))
Positives
## [1] 69
Negatives
## [1] 20
means = colMeans(DF)
SD = apply(DF, 2, sd)
DF <- scale(DF, center = means, scale = SD)
# Training data takes 3/4 of the entire data...
split <- createDataPartition(y = class, p = 0.75, list = FALSE)
trainData <- DF[split, ]
trainClass <- class[split]
# Test data
testData <- DF[-split, ]
testClass <- class[-split]
control_svm = trainControl(
method = "repeatedcv",
number = 10,
repeats = 10,
allowParallel = TRUE,
savePredictions= TRUE,
classProbs = TRUE,
summaryFunction = twoClassSummary,
selectionFunction = "best",
returnResamp="all"
)
model_svm <- train(trainData, trainClass, method = "svmRadial", tuneLenght = 30,
metric = "ROC", trControl = control_svm, )
## Loading required package: kernlab
## Support Vector Machines with Radial Basis Function Kernel
##
## 67 samples
## 22277 predictors
## 2 classes: 'Negative', 'Positive'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
##
## Summary of sample sizes: 60, 60, 60, 60, 60, 61, ...
##
## Resampling results across tuning parameters:
##
## C ROC Sens Spec ROC SD Sens SD Spec SD
## 0.25 0.8708333 0.610 0.8726667 0.1680811 0.4298226 0.1532572
## 0.50 0.8708333 0.620 0.8806667 0.1680811 0.4329835 0.1570763
## 1.00 0.8758333 0.615 0.8863333 0.1637176 0.4314352 0.1480964
##
## Tuning parameter 'sigma' was held constant at a value of 3.128011e-05
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 3.128011e-05 and C = 1.
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 3.12801149264817e-05
##
## Number of Support Vectors : 56
##
## Objective Function Value : -19.1305
## Training error : 0.074627
## Probability model included.
## [1] "Applying SVM Model To The Testing Data:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 5 2
## Positive 0 15
##
## Accuracy : 0.9091
## 95% CI : (0.7084, 0.9888)
## No Information Rate : 0.7727
## P-Value [Acc > NIR] : 0.09444
##
## Kappa : 0.7732
## Mcnemar's Test P-Value : 0.47950
##
## Sensitivity : 1.0000
## Specificity : 0.8824
## Pos Pred Value : 0.7143
## Neg Pred Value : 1.0000
## Prevalence : 0.2273
## Detection Rate : 0.2273
## Detection Prevalence : 0.3182
## Balanced Accuracy : 0.9412
##
## 'Positive' Class : Negative
##
1- Accuracy : 0.9091
2- Sensitivity : 1.0000
3- Specificity : 0.8824
Every patient who has cancer is accurately diagnosed. According to this model, some patients who do not have cancer could be wrongly diagnosed. In relative terms, this is an acceptable misdirection when we consider the nature of the classification problem.
1- Overfiting on the training data
2- High-Dimensionality (it is quite extreme when comparing n=89 and p=22277)
Acquiring more patients’ data will be very benificial ( n= 1000+). In addition, divising an efficient method of feature selection will have a great positive effect on the accuracy, specificity, and sensitivity of our model. Feature selection from gene expression data is a very researched topic.