The data file can be found here https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
require(caret)
require(mlbench)
# load the data
cancerData <- read.csv("data.csv", sep = ",", header = T)
set.seed(123)
# calculate correlation matrix
correlationMatrix <- cor(cancerData[,3:32])
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)
# print indexes of highly correlated attributes
print(highlyCorrelated)
## [1] 7 8 6 28 27 23 21 3 26 24 1 13 11 18 16 14 17 5 9 10 30 22
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(cancerData[,3:32], cancerData[,2], sizes=c(3:32), rfeControl=control)
# summarize the results
print(results)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 3 0.9280 0.8454 0.04240 0.09057
## 4 0.9369 0.8652 0.03497 0.07429
## 5 0.9406 0.8735 0.04396 0.09250
## 6 0.9439 0.8799 0.04451 0.09392
## 7 0.9544 0.9023 0.03761 0.08021
## 8 0.9545 0.9028 0.03288 0.06964
## 9 0.9580 0.9100 0.03294 0.07012
## 10 0.9579 0.9103 0.03312 0.07007
## 11 0.9597 0.9138 0.03379 0.07189 *
## 12 0.9544 0.9027 0.03292 0.06982
## 13 0.9544 0.9026 0.03400 0.07207
## 14 0.9562 0.9065 0.03399 0.07215
## 15 0.9597 0.9141 0.03671 0.07774
## 16 0.9544 0.9029 0.03502 0.07420
## 17 0.9561 0.9064 0.03504 0.07433
## 18 0.9543 0.9025 0.03603 0.07648
## 19 0.9579 0.9102 0.03315 0.07026
## 20 0.9579 0.9102 0.03524 0.07468
## 21 0.9544 0.9024 0.03502 0.07434
## 22 0.9579 0.9101 0.03315 0.07023
## 23 0.9579 0.9100 0.03692 0.07846
## 24 0.9596 0.9139 0.03402 0.07211
## 25 0.9562 0.9064 0.03215 0.06812
## 26 0.9596 0.9139 0.03402 0.07211
## 27 0.9561 0.9063 0.03216 0.06813
## 28 0.9579 0.9100 0.03315 0.07026
## 29 0.9543 0.9021 0.03324 0.07083
## 30 0.9543 0.9025 0.02987 0.06316
##
## The top 5 variables (out of 11):
## area_worst, concave.points_worst, perimeter_worst, radius_worst, texture_worst
The list of chosen features
predictors(results)
## [1] "area_worst" "concave.points_worst" "perimeter_worst"
## [4] "radius_worst" "texture_worst" "concave.points_mean"
## [7] "area_se" "texture_mean" "concavity_worst"
## [10] "smoothness_worst" "area_mean"
Plot of features by accuracy
Subsetting the data using the selected features
features <- predictors(results)
newdata <- cancerData[, features]
newdata$diagnosis <- cancerData$diagnosis
Partition the data into training and testing being 70% and 30% resoectively.
inTrain <- createDataPartition(y = newdata$diagnosis ,
p=0.7, list=FALSE)
training <- newdata[inTrain,]
testing <- newdata[-inTrain,]
dim(training)
## [1] 399 12
Run the training and prediction using 3 different models.
View the performance of the gl model
confusionMatrix(pred1,testing$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 103 2
## M 4 61
##
## Accuracy : 0.9647
## 95% CI : (0.9248, 0.9869)
## No Information Rate : 0.6294
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9248
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.9626
## Specificity : 0.9683
## Pos Pred Value : 0.9810
## Neg Pred Value : 0.9385
## Prevalence : 0.6294
## Detection Rate : 0.6059
## Detection Prevalence : 0.6176
## Balanced Accuracy : 0.9654
##
## 'Positive' Class : B
##
View the performance of the rf model
confusionMatrix(pred2,testing$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 105 5
## M 2 58
##
## Accuracy : 0.9588
## 95% CI : (0.917, 0.9833)
## No Information Rate : 0.6294
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9109
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.9813
## Specificity : 0.9206
## Pos Pred Value : 0.9545
## Neg Pred Value : 0.9667
## Prevalence : 0.6294
## Detection Rate : 0.6176
## Detection Prevalence : 0.6471
## Balanced Accuracy : 0.9510
##
## 'Positive' Class : B
##
View the performance of the rf model
confusionMatrix(pred3,testing$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 101 2
## M 6 61
##
## Accuracy : 0.9529
## 95% CI : (0.9094, 0.9795)
## No Information Rate : 0.6294
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9004
## Mcnemar's Test P-Value : 0.2888
##
## Sensitivity : 0.9439
## Specificity : 0.9683
## Pos Pred Value : 0.9806
## Neg Pred Value : 0.9104
## Prevalence : 0.6294
## Detection Rate : 0.5941
## Detection Prevalence : 0.6059
## Balanced Accuracy : 0.9561
##
## 'Positive' Class : B
##
The predictors will be combined and another model will be fit using neural net.
View the performance of the ensemble model.
confusionMatrix(combPred,testing$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 103 2
## M 4 61
##
## Accuracy : 0.9647
## 95% CI : (0.9248, 0.9869)
## No Information Rate : 0.6294
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9248
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.9626
## Specificity : 0.9683
## Pos Pred Value : 0.9810
## Neg Pred Value : 0.9385
## Prevalence : 0.6294
## Detection Rate : 0.6059
## Detection Prevalence : 0.6176
## Balanced Accuracy : 0.9654
##
## 'Positive' Class : B
##
Comparing the performance, there is no significant difference between the ensemble model and the standalon neural networl model.