R Markdown

The data file can be found here https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

require(caret)
require(mlbench)

# load the data
cancerData <- read.csv("data.csv", sep = ",", header = T)

Feature Selection

set.seed(123)

# calculate correlation matrix
correlationMatrix <- cor(cancerData[,3:32])

# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)

# print indexes of highly correlated attributes
print(highlyCorrelated)
##  [1]  7  8  6 28 27 23 21  3 26 24  1 13 11 18 16 14 17  5  9 10 30 22
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)

# run the RFE algorithm
results <- rfe(cancerData[,3:32], cancerData[,2], sizes=c(3:32), rfeControl=control)

# summarize the results
print(results)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          3   0.9280 0.8454    0.04240 0.09057         
##          4   0.9369 0.8652    0.03497 0.07429         
##          5   0.9406 0.8735    0.04396 0.09250         
##          6   0.9439 0.8799    0.04451 0.09392         
##          7   0.9544 0.9023    0.03761 0.08021         
##          8   0.9545 0.9028    0.03288 0.06964         
##          9   0.9580 0.9100    0.03294 0.07012         
##         10   0.9579 0.9103    0.03312 0.07007         
##         11   0.9597 0.9138    0.03379 0.07189        *
##         12   0.9544 0.9027    0.03292 0.06982         
##         13   0.9544 0.9026    0.03400 0.07207         
##         14   0.9562 0.9065    0.03399 0.07215         
##         15   0.9597 0.9141    0.03671 0.07774         
##         16   0.9544 0.9029    0.03502 0.07420         
##         17   0.9561 0.9064    0.03504 0.07433         
##         18   0.9543 0.9025    0.03603 0.07648         
##         19   0.9579 0.9102    0.03315 0.07026         
##         20   0.9579 0.9102    0.03524 0.07468         
##         21   0.9544 0.9024    0.03502 0.07434         
##         22   0.9579 0.9101    0.03315 0.07023         
##         23   0.9579 0.9100    0.03692 0.07846         
##         24   0.9596 0.9139    0.03402 0.07211         
##         25   0.9562 0.9064    0.03215 0.06812         
##         26   0.9596 0.9139    0.03402 0.07211         
##         27   0.9561 0.9063    0.03216 0.06813         
##         28   0.9579 0.9100    0.03315 0.07026         
##         29   0.9543 0.9021    0.03324 0.07083         
##         30   0.9543 0.9025    0.02987 0.06316         
## 
## The top 5 variables (out of 11):
##    area_worst, concave.points_worst, perimeter_worst, radius_worst, texture_worst

The list of chosen features

predictors(results)
##  [1] "area_worst"           "concave.points_worst" "perimeter_worst"     
##  [4] "radius_worst"         "texture_worst"        "concave.points_mean" 
##  [7] "area_se"              "texture_mean"         "concavity_worst"     
## [10] "smoothness_worst"     "area_mean"

Plot of features by accuracy

Subsetting the data using the selected features

features <- predictors(results)
newdata <- cancerData[, features]
newdata$diagnosis <- cancerData$diagnosis

Partition the data into training and testing being 70% and 30% resoectively.

inTrain <- createDataPartition(y = newdata$diagnosis ,
                               p=0.7, list=FALSE)
training <- newdata[inTrain,]
testing <- newdata[-inTrain,]
dim(training)
## [1] 399  12

Build Prediction Model

Run the training and prediction using 3 different models.

Generalized Linear Model

View the performance of the gl model

confusionMatrix(pred1,testing$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 103   2
##          M   4  61
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9248, 0.9869)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9248          
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9626          
##             Specificity : 0.9683          
##          Pos Pred Value : 0.9810          
##          Neg Pred Value : 0.9385          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6059          
##    Detection Prevalence : 0.6176          
##       Balanced Accuracy : 0.9654          
##                                           
##        'Positive' Class : B               
## 

Random Forest

View the performance of the rf model

confusionMatrix(pred2,testing$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 105   5
##          M   2  58
##                                          
##                Accuracy : 0.9588         
##                  95% CI : (0.917, 0.9833)
##     No Information Rate : 0.6294         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9109         
##  Mcnemar's Test P-Value : 0.4497         
##                                          
##             Sensitivity : 0.9813         
##             Specificity : 0.9206         
##          Pos Pred Value : 0.9545         
##          Neg Pred Value : 0.9667         
##              Prevalence : 0.6294         
##          Detection Rate : 0.6176         
##    Detection Prevalence : 0.6471         
##       Balanced Accuracy : 0.9510         
##                                          
##        'Positive' Class : B              
## 

Neural Network Model

View the performance of the rf model

confusionMatrix(pred3,testing$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 101   2
##          M   6  61
##                                           
##                Accuracy : 0.9529          
##                  95% CI : (0.9094, 0.9795)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9004          
##  Mcnemar's Test P-Value : 0.2888          
##                                           
##             Sensitivity : 0.9439          
##             Specificity : 0.9683          
##          Pos Pred Value : 0.9806          
##          Neg Pred Value : 0.9104          
##              Prevalence : 0.6294          
##          Detection Rate : 0.5941          
##    Detection Prevalence : 0.6059          
##       Balanced Accuracy : 0.9561          
##                                           
##        'Positive' Class : B               
## 

Model Ensemble

The predictors will be combined and another model will be fit using neural net.

View the performance of the ensemble model.

confusionMatrix(combPred,testing$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 103   2
##          M   4  61
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9248, 0.9869)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9248          
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9626          
##             Specificity : 0.9683          
##          Pos Pred Value : 0.9810          
##          Neg Pred Value : 0.9385          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6059          
##    Detection Prevalence : 0.6176          
##       Balanced Accuracy : 0.9654          
##                                           
##        'Positive' Class : B               
## 

Comparing the performance, there is no significant difference between the ensemble model and the standalon neural networl model.