References: http://proxy.library.upenn.edu:2061/lib/upenn/reader.action?docID=10794279 charpter 3 p75 Lantz, Brett. Machine Learning with R. Olton, GB: Packt Publishing, 2013. ProQuest ebrary. Web. 24 April 2017. Copyright © 2013. Packt Publishing. All rights reserved. https://www.kaggle.com/gargmanish/d/uciml/breast-cancer-wisconsin-data/basic-machine-learning-with-cancer/notebook https://rpubs.com/jesuscastagnetto/caret-knn-cancer-prediction

DATA SET DESCRIPTION

The dataset used in the book is the “Breast Cancer Wisconsin (Diagnostic) Data Set” from the UCI Machine Learning Repository, as described in Chapter 3 (“*Lazy Learning - Clasification Using Nearest Neighbors“) of the aforementioned book. The data set contains results of routine breast cancer screen, which allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. The goal of the dataset is to practice classification analysis, to be able to predict which of sub-populations a new observations belongs to , on the basis of chosen metrics. In other words, after analysis of the cancer diagosis dataset, we will be able to preidct whether a patient has benign or malignant. Attributes: As I observed the dat can be divided into three parts: means (3-13) standard error (13-23) and Worst(23-32) each contain 10 parameter radius, texture, area, perimeter, smoothness, compactness, concavity, concave points, symmetry and fractal dimension.

Load the data

more EDA to exclude outliners

boxplot(dataset[,3:5])$stats[c(1, 5), ]

##        [,1]  [,2]   [,3]
## [1,]  6.981  9.71  43.79
## [2,] 21.750 29.97 147.30
boxplot(dataset[,6])$stats[c(1, 5), ]

## [1]  143.5 1326.0
boxplot(dataset[,7:12])$stats[c(1, 5), ]

##         [,1]    [,2]  [,3]  [,4]   [,5]    [,6]
## [1,] 0.06251 0.01938 0.000 0.000 0.1167 0.04996
## [2,] 0.13350 0.22840 0.281 0.152 0.2459 0.07871
dataset<-dataset[ which(dataset[,3]<=21.75 & dataset[,4]<=29.97 & dataset[,5]<=147.30 & dataset[,6]<=1326.0 & dataset[,7]<=0.13350 & dataset[,7]>=0.06251 & dataset[,8]<=0.22840 & dataset[,9]<=0.2810 & dataset[,10]<=0.152 & dataset[,11]<=0.2459 & dataset[,11]>=0.1167 & dataset[,12]<=0.07871 & dataset[,12]>=0.04996),]

Now only 502 observations left.

a1<-ggplot(data=dataset,aes(x=dataset[,3])) + geom_histogram(bins = 50)
a2<-ggplot(data=dataset,aes(x=dataset[,4])) + geom_histogram(bins = 50)
a3<-ggplot(data=dataset,aes(x=dataset[,5])) + geom_histogram(bins = 50)
a4<-ggplot(data=dataset,aes(x=dataset[,6])) + geom_histogram(bins = 50)
a5<-ggplot(data=dataset,aes(x=dataset[,7])) + geom_histogram(bins = 50)
a6<-ggplot(data=dataset,aes(x=dataset[,8])) + geom_histogram(bins = 50)
a7<-ggplot(data=dataset,aes(x=dataset[,9])) + geom_histogram(bins = 50)
a8<-ggplot(data=dataset,aes(x=dataset[,10])) + geom_histogram(bins = 50)
a9<-ggplot(data=dataset,aes(x=dataset[,11])) + geom_histogram(bins = 50)
a10<-ggplot(data=dataset,aes(x=dataset[,12])) + geom_histogram(bins = 50)
grid.arrange(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, nrow=4, widths=c(1,1,1))

Re-partitioning training data and test data

#try 4:1
train_df <- dataset[1:401,]
train_labels <-train_df[,2]
ft_train <- frqtab(train_df$diagnosis)

test_df <- dataset[402:502,]
test_labels <-train_df[,2]
ft_test <- frqtab(test_df$diagnosis)

ft_orig <- frqtab(dataset$diagnosis)
ftcmp_df <- as.data.frame(cbind(ft_orig, ft_train, ft_test))
colnames(ftcmp_df) <- c("Original", "Training set", "Test set")

pander(ftcmp_df, style="rmarkdown",
             caption="Comparison of diagnosis frequencies (in %)")
Comparison of diagnosis frequencies (in %)
  Original Training set Test set
Benign 67.1 67.1 67.3
Malignant 32.9 32.9 32.7
a1<-ggplot(data=train_df,aes(x=diagnosis)) + geom_bar() + geom_text(stat='Count',aes(label=..count..),vjust=-1)

a2<-ggplot(data=test_df,aes(x=diagnosis)) + geom_bar() + geom_text(stat='Count',aes(label=..count..),vjust=-1)

grid.arrange(a1, a2, nrow=1)

The frequencies of diagnosis in the tranining set, original data and test set are equivalent. Although the test data has a little bit more benign cases.

k-Nearest Neighbor feature_mean

#kNN using accuracy as metric
ctrl <- trainControl(method="repeatedcv", number=10, repeats=3)
set.seed(12345)
knnFit1 <- train(diagnosis ~ ., data=train_df, method="knn",
                trControl=ctrl, metric="Accuracy", tuneLength=20,
                preProc=c("range"))
knnFit1
## k-Nearest Neighbors 
## 
## 401 samples
##  31 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## Pre-processing: re-scaling to [0, 1] (31) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 361, 361, 361, 360, 362, 361, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9609511  0.9092246
##    7  0.9668069  0.9235495
##    9  0.9659735  0.9213962
##   11  0.9593465  0.9051376
##   13  0.9568465  0.8985422
##   15  0.9560335  0.8962887
##   17  0.9526788  0.8876588
##   19  0.9501991  0.8822729
##   21  0.9502001  0.8817905
##   23  0.9518882  0.8854542
##   25  0.9510345  0.8828425
##   27  0.9501798  0.8805709
##   29  0.9518678  0.8846090
##   31  0.9527012  0.8861578
##   33  0.9493465  0.8782137
##   35  0.9485335  0.8761510
##   37  0.9485752  0.8758324
##   39  0.9469288  0.8718234
##   41  0.9469074  0.8716464
##   43  0.9469074  0.8718883
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 7.
plot(knnFit1)

#predict the diagnosis
knnPredict1 <- predict(knnFit1, newdata=test_df)
cmat1 <- confusionMatrix(knnPredict1, test_df$diagnosis, positive="Malignant")
cmat1
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        68         2
##   Malignant      0        31
##                                           
##                Accuracy : 0.9802          
##                  95% CI : (0.9303, 0.9976)
##     No Information Rate : 0.6733          
##     P-Value [Acc > NIR] : 5.497e-15       
##                                           
##                   Kappa : 0.9543          
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9394          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9714          
##              Prevalence : 0.3267          
##          Detection Rate : 0.3069          
##    Detection Prevalence : 0.3069          
##       Balanced Accuracy : 0.9697          
##                                           
##        'Positive' Class : Malignant       
## 

traincontrol function is from caret V6.0 Control the computational nuances of the train function

knn function is from class V7.3-0 k-nearest neighbour classification for test set from training set. For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote.

range the data is ranged to [0,1] In other words, the data is minmax nomarlized.

Seed an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel. A value of NA will stop the seed from being set within the worker processes while a value of NULL will set the seeds using a random set of integers. Alternatively, a list can be used. The list should have B+1 elements where B is the number of resamples, unless method is “boot632” in which case B is the number of resamples plus 1. The first B elements of the list should be vectors of integers of length M where M is the number of models being evaluated. The last element of the list only needs to be a single integer (for the final model). See the Examples section below and the Details section.

predict is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.

confusionMatrix is from caret Calculates a cross-tabulation of observed and predicted classes with associated statistics.

References Efron (1983). Estimating the error rate of a prediction rule: improvement on cross-validation''. Journal of the American Statistical Association, 78(382):316-331 Efron, B., & Tibshirani, R. J. (1994).An introduction to the bootstrap’‘, pages 249-252. CRC press. Bergstra and Bengio (2012), Random Search for Hyper-Parameter Optimization'', Journal of Machine Learning Research, 13(Feb):281-305 Kuhn (2014),Futility Analysis in the Cross-Validation of Machine Learning Models’’ http://arxiv.org/abs/1405.6974, Package website for subsampling: https://topepo.github.io/caret/subsampling-for-class-imbalances.html Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge. Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.

#kNN using kappa as metric
knnFit2 <- train(diagnosis ~ ., data=train_df, method="knn",
                trControl=ctrl, metric="Kappa", tuneLength=20,
                preProc=c("range"))
knnFit2
## k-Nearest Neighbors 
## 
## 401 samples
##  31 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## Pre-processing: re-scaling to [0, 1] (31) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 361, 361, 361, 361, 361, 361, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9676188  0.9250237
##    7  0.9667235  0.9234585
##    9  0.9708901  0.9328905
##   11  0.9600975  0.9073302
##   13  0.9559308  0.8976493
##   15  0.9525761  0.8891935
##   17  0.9542428  0.8927164
##   19  0.9517428  0.8867363
##   21  0.9467214  0.8747088
##   23  0.9475141  0.8763095
##   25  0.9500354  0.8822192
##   27  0.9500558  0.8823577
##   29  0.9484308  0.8786716
##   31  0.9484105  0.8784325
##   33  0.9450964  0.8707441
##   35  0.9475547  0.8761472
##   37  0.9475547  0.8757506
##   39  0.9467000  0.8735448
##   41  0.9442203  0.8673911
##   43  0.9450334  0.8687345
## 
## Kappa was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.
plot(knnFit2)

#predict the diagnosis
knnPredict2 <- predict(knnFit2, newdata=test_df)
cmat2 <- confusionMatrix(knnPredict2, test_df$diagnosis, positive="Malignant")
cmat2
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        68         2
##   Malignant      0        31
##                                           
##                Accuracy : 0.9802          
##                  95% CI : (0.9303, 0.9976)
##     No Information Rate : 0.6733          
##     P-Value [Acc > NIR] : 5.497e-15       
##                                           
##                   Kappa : 0.9543          
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9394          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9714          
##              Prevalence : 0.3267          
##          Detection Rate : 0.3069          
##    Detection Prevalence : 0.3069          
##       Balanced Accuracy : 0.9697          
##                                           
##        'Positive' Class : Malignant       
## 
#kNN using ROC as metric
ctrl2 <- trainControl(method="repeatedcv", number=10, repeats=3, classProbs = TRUE, summaryFunction = twoClassSummary)

knnFit3 <- train(diagnosis ~ ., data=train_df, method="knn",
                trControl=ctrl2, metric="Accuracy", tuneLength=20,
                preProc=c("range"))
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.
knnFit3
## k-Nearest Neighbors 
## 
## 401 samples
##  31 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## Pre-processing: re-scaling to [0, 1] (31) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 361, 360, 361, 361, 362, 360, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    5  0.9838911  0.9851377  0.9194139
##    7  0.9846524  0.9839031  0.9267399
##    9  0.9869867  0.9826686  0.9194139
##   11  0.9872643  0.9814340  0.9190476
##   13  0.9876392  0.9814340  0.9091575
##   15  0.9874083  0.9863723  0.8941392
##   17  0.9883593  0.9888414  0.8811355
##   19  0.9906755  0.9876068  0.8836996
##   21  0.9908566  0.9876068  0.8787546
##   23  0.9907564  0.9888414  0.8789377
##   25  0.9908955  0.9900760  0.8663004
##   27  0.9904292  0.9913105  0.8663004
##   29  0.9901887  0.9937797  0.8637363
##   31  0.9899460  0.9913105  0.8615385
##   33  0.9898529  0.9912631  0.8589744
##   35  0.9899038  0.9950142  0.8564103
##   37  0.9898052  0.9950142  0.8589744
##   39  0.9898052  0.9950142  0.8589744
##   41  0.9891083  0.9950142  0.8564103
##   43  0.9885487  0.9962488  0.8465201
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was k = 25.
plot(knnFit3)

#predict the diagnosis
knnPredict3 <- predict(knnFit3, newdata=test_df)
cmat3 <- confusionMatrix(knnPredict3, test_df$diagnosis, positive="Malignant")
cmat3
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        68         4
##   Malignant      0        29
##                                           
##                Accuracy : 0.9604          
##                  95% CI : (0.9017, 0.9891)
##     No Information Rate : 0.6733          
##     P-Value [Acc > NIR] : 1.094e-12       
##                                           
##                   Kappa : 0.9071          
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 0.8788          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9444          
##              Prevalence : 0.3267          
##          Detection Rate : 0.2871          
##    Detection Prevalence : 0.2871          
##       Balanced Accuracy : 0.9394          
##                                           
##        'Positive' Class : Malignant       
## 

Comparing three models

Looks alike the caret method produce better models, let’s compare teh three models

model_comp <- as.data.frame(
    rbind(
          summod(cmat1, knnFit1),
          summod(cmat2, knnFit2),
          summod(cmat3, knnFit3)))
rownames(model_comp) <- c("Model 1", "Model 2", "Model 3")
pander(model_comp[,-3], split.tables=Inf, keep.trailing.zeros=TRUE,
       style="rmarkdown",
       caption="Model results when comparing predictions and test set")
Model results when comparing predictions and test set
  k metric TN TP FN FP acc sens spec PPV NPV
Model 1 7 Accuracy 68 31 2 0 0.98 0.94 1 1 0.97
Model 2 9 Kappa 68 31 2 0 0.98 0.94 1 1 0.97
Model 3 25 ROC 68 29 4 0 0.96 0.88 1 1 0.94

using Accuracy and Kappa as metric are better than using ROC as metric.