References: https://www.kaggle.com/lbronchal/d/uciml/breast-cancer-wisconsin-data/breast-cancer-dataset-analysis/notebook http://proxy.library.upenn.edu:2061/lib/upenn/reader.action?docID=10794279 chapter 3 p75 Lantz, Brett. Machine Learning with R. Olton, GB: Packt Publishing, 2013. ProQuest February. Web. 24 April 2017. Copyright © 2013. Packt Publishing. All rights reserved. https://www.kaggle.com/gargmanish/d/uciml/breast-cancer-wisconsin-data/basic-machine-learning-with-cancer/notebook https://rpubs.com/jesuscastagnetto/caret-knn-cancer-prediction

DATA SET DESCRIPTION

The dataset used in the book is the “Breast Cancer Wisconsin (Diagnostic) Data Set” from the UCI Machine Learning Repository, as described in Chapter 3 (“*Lazy Learning - Classification Using Nearest Neighbors“) of the aforementioned book. The data set contains results of routine breast cancer screen, which allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. The goal of the dataset is to practice classification analysis, to be able to predict which of sub-populations a new observations belongs to , on the basis of chosen metrics. In other words, after analysis of the cancer diagnosis dataset, we will be able to predict whether a patient has benign or malignant. Attributes: As I observed the data can be divided into three parts: means (3-13) standard error (13-23) and Worst(23-32) each contain 10 parameter radius, texture, area, perimeter, smoothness, compactness, concavity, concave points, symmetry and fractal dimension.

Data import

EDA

# test the propostion of two results
prop.table(table(cancer$diagnosis))

## 
##    Benign Malignant 
## 0.6274165 0.3725835

summary(cancer$diagnosis)

##    Benign Malignant 
##       357       212

# test the correlation between variables using correlation map
#corr_map <- cor(cancer[,3:ncol(cancer)])
#corrplot(corr_map)

From EDA, we know that the two results are unbalanced, there are 63% benign cases and 37% Malignant cases. There is correlations between some of the variables, among which there are strong correlations.

Re-partitioning training data and test data

#partition 4:1
train_df <- sample_n(cancer, 400)
train_labels <-train_df[,1]
ft_train <- frqtab(train_df$diagnosis)

test_df <- sample_n(cancer, 100)
test_labels <-train_df[,1]
ft_test <- frqtab(test_df$diagnosis)

ft_orig <- frqtab(cancer$diagnosis)
ftcmp_df <- as.data.frame(cbind(ft_orig, ft_train, ft_test))
colnames(ftcmp_df) <- c("Original", "Training set", "Test set")

pander(ftcmp_df, style="rmarkdown",
             caption="Comparison of diagnosis frequencies (in %)")

Comparison of diagnosis frequencies (in %)
	Original	Training set	Test set
Benign	62.7	65	66
Malignant	37.3	35	34

a1<-ggplot(data=train_df,aes(x=diagnosis)) + geom_bar() + geom_text(stat='Count',aes(label=..count..),vjust=-1)

a2<-ggplot(data=test_df,aes(x=diagnosis)) + geom_bar() + geom_text(stat='Count',aes(label=..count..),vjust=-1)

grid.arrange(a1, a2, nrow=1)

The frequencies of diagnosis in the training set, original data and test set are equivalent. Although the test data has a little bit more benign cases.

k-Nearest Neighbor

kNN using accuracy as metric

ctrl <- trainControl(method="repeatedcv", number=10, repeats=3)
knn_model1 <- train(diagnosis ~ ., data=train_df, method="knn",
                trControl=ctrl, metric="Accuracy", tuneLength=20,
                preProc=c('range'))
knn_model1

## k-Nearest Neighbors 
## 
## 400 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## Pre-processing: re-scaling to [0, 1] (30) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 360, 360, 360, 360, 360, 360, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9625000  0.9160319
##    7  0.9600000  0.9103744
##    9  0.9625000  0.9154042
##   11  0.9600000  0.9093638
##   13  0.9583333  0.9055002
##   15  0.9583333  0.9051320
##   17  0.9583333  0.9052019
##   19  0.9566667  0.9014474
##   21  0.9566667  0.9014464
##   23  0.9575000  0.9033721
##   25  0.9566667  0.9014464
##   27  0.9575000  0.9032452
##   29  0.9583333  0.9051279
##   31  0.9575000  0.9029910
##   33  0.9575000  0.9030545
##   35  0.9558333  0.8992666
##   37  0.9550000  0.8972741
##   39  0.9541667  0.8953535
##   41  0.9566667  0.9007506
##   43  0.9575000  0.9025493
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.

plot(knn_model1)

#predict the diagnosis
knn_model1_pred_test <- predict(knn_model1, newdata=test_df)
#Confusion Matrix and Statistics
cm_knn1 <- confusionMatrix(knn_model1_pred_test, test_df$diagnosis, positive="Malignant")
cm_knn1

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        66         4
##   Malignant      0        30
##                                          
##                Accuracy : 0.96           
##                  95% CI : (0.9007, 0.989)
##     No Information Rate : 0.66           
##     P-Value [Acc > NIR] : 2.698e-13      
##                                          
##                   Kappa : 0.9083         
##  Mcnemar's Test P-Value : 0.1336         
##                                          
##             Sensitivity : 0.8824         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9429         
##              Prevalence : 0.3400         
##          Detection Rate : 0.3000         
##    Detection Prevalence : 0.3000         
##       Balanced Accuracy : 0.9412         
##                                          
##        'Positive' Class : Malignant      
##

traincontrol function is from caret V6.0 Control the computational nuances of the train function

knn function is from class V7.3-0 k-nearest neighbor classification for test set from training set. For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote.

range the data is ranged to [0,1] In other words, the data is minmax nomarlized.

Seed an optional set of integers that will be used to set the seed at each re-sampling iteration. This is useful when the models are run in parallel. A value of NA will stop the seed from being set within the worker processes while a value of NULL will set the seeds using a random set of integers. Alternatively, a list can be used. The list should have B+1 elements where B is the number of re-samples, unless method is “boot632” in which case B is the number of re-samples plus 1. The first B elements of the list should be vectors of integers of length M where M is the number of models being evaluated. The last element of the list only needs to be a single integer (for the final model). See the Examples section below and the Details section.

predict is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.

confusionMatrix is from caret Calculates a cross-tabulation of observed and predicted classes with associated statistics.

References Efron (1983). Estimating the error rate of a prediction rule: improvement on cross-validation''. Journal of the American Statistical Association, 78(382):316-331 Efron, B., & Tibshirani, R. J. (1994).An introduction to the bootstrap’‘, pages 249-252. CRC press. Bergstra and Bengio (2012), Random Search for Hyper-Parameter Optimization'', Journal of Machine Learning Research, 13(Feb):281-305 Kuhn (2014),Futility Analysis in the Cross-Validation of Machine Learning Models’’ http://arxiv.org/abs/1405.6974, Package website for sub-sampling: https://topepo.github.io/caret/subsampling-for-class-imbalances.html Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge. Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.

kNN using kappa as metric

knn_model2 <- train(diagnosis ~ ., data=train_df, method="knn",
                trControl=ctrl, metric="kappa", tuneLength=20,
                preProc=c("range"))

## Warning in train.default(x, y, weights = w, ...): The metric "kappa" was
## not in the result set. Accuracy will be used instead.

knn_model2

## k-Nearest Neighbors 
## 
## 400 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## Pre-processing: re-scaling to [0, 1] (30) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 360, 360, 360, 360, 360, 360, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9650000  0.9213347
##    7  0.9591667  0.9083216
##    9  0.9608333  0.9114549
##   11  0.9616667  0.9130660
##   13  0.9600000  0.9087593
##   15  0.9600000  0.9087641
##   17  0.9575000  0.9030585
##   19  0.9575000  0.9030585
##   21  0.9566667  0.9011328
##   23  0.9575000  0.9031220
##   25  0.9558333  0.8993352
##   27  0.9558333  0.8990813
##   29  0.9591667  0.9065997
##   31  0.9608333  0.9103843
##   33  0.9566667  0.9011295
##   35  0.9575000  0.9027869
##   37  0.9583333  0.9045263
##   39  0.9566667  0.9007384
##   41  0.9566667  0.9006082
##   43  0.9566667  0.9004812
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 5.

plot(knn_model2)

#predict the diagnosis
knn_model2_pred_test <- predict(knn_model2, newdata=test_df)
#Confusion Matrix and Statistics
cm_knn2 <- confusionMatrix(knn_model2_pred_test, test_df$diagnosis, positive="Malignant")
cm_knn2

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        66         2
##   Malignant      0        32
##                                           
##                Accuracy : 0.98            
##                  95% CI : (0.9296, 0.9976)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : 1.23e-15        
##                                           
##                   Kappa : 0.9548          
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9412          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9706          
##              Prevalence : 0.3400          
##          Detection Rate : 0.3200          
##    Detection Prevalence : 0.3200          
##       Balanced Accuracy : 0.9706          
##                                           
##        'Positive' Class : Malignant       
##

kNN using ROC as metric

ctrl2 <- trainControl(method="repeatedcv", number=10, repeats=3, classProbs = TRUE, summaryFunction = twoClassSummary)

knn_model3 <- train(diagnosis ~ ., data=train_df, method="knn",
                trControl=ctrl2, metric="ROC", tuneLength=20,
                preProc=c("range"))
knn_model3

## k-Nearest Neighbors 
## 
## 400 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## Pre-processing: re-scaling to [0, 1] (30) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 360, 360, 360, 360, 360, 360, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    5  0.9808608  0.9833333  0.9119048
##    7  0.9838828  0.9820513  0.9119048
##    9  0.9891941  0.9858974  0.9071429
##   11  0.9891941  0.9884615  0.9047619
##   13  0.9888278  0.9923077  0.8976190
##   15  0.9894231  0.9935897  0.8976190
##   17  0.9889652  0.9948718  0.8904762
##   19  0.9909341  0.9948718  0.8857143
##   21  0.9908883  0.9948718  0.8928571
##   23  0.9902015  0.9948718  0.8904762
##   25  0.9898810  0.9923077  0.8904762
##   27  0.9895147  0.9897436  0.8880952
##   29  0.9893773  0.9910256  0.8880952
##   31  0.9893315  0.9923077  0.8833333
##   33  0.9892857  0.9923077  0.8833333
##   35  0.9901557  0.9897436  0.8857143
##   37  0.9899725  0.9910256  0.8857143
##   39  0.9893315  0.9910256  0.8857143
##   41  0.9893773  0.9910256  0.8809524
##   43  0.9893773  0.9935897  0.8880952
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was k = 19.

plot(knn_model3)

#predict the diagnosis
knn_model3_pred_test <- predict(knn_model3, newdata=test_df)
#Confusion Matrix and Statistics
cm_knn3 <- confusionMatrix(knn_model3_pred_test, test_df$diagnosis, positive="Malignant")
cm_knn3

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        66         4
##   Malignant      0        30
##                                          
##                Accuracy : 0.96           
##                  95% CI : (0.9007, 0.989)
##     No Information Rate : 0.66           
##     P-Value [Acc > NIR] : 2.698e-13      
##                                          
##                   Kappa : 0.9083         
##  Mcnemar's Test P-Value : 0.1336         
##                                          
##             Sensitivity : 0.8824         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9429         
##              Prevalence : 0.3400         
##          Detection Rate : 0.3000         
##    Detection Prevalence : 0.3000         
##       Balanced Accuracy : 0.9412         
##                                          
##        'Positive' Class : Malignant      
##

Comparing three models

Looks alike the caret method produce better models, let’s compare the three models

model_comp <- as.data.frame(
    rbind(
          summod(cm_knn1, knn_model1),
          summod(cm_knn2, knn_model2),
          summod(cm_knn3, knn_model3)))
rownames(model_comp) <- c("Model 1", "Model 2", "Model 3")
pander(model_comp[,-3], split.tables=Inf, keep.trailing.zeros=TRUE,
       style="rmarkdown",
       caption="Model results when comparing predictions and test set")

Model results when comparing predictions and test set
	k	metric	TN	TP	FN	acc	sens	spec	PPV	NPV
Model 1	9	Accuracy	66	30	4	0.96	0.88	1	1	0.94
Model 2	5	Accuracy	66	32	2	0.98	0.94	1	1	0.97
Model 3	19	ROC	66	30	4	0.96	0.88	1	1	0.94

using Accuracy and Kappa as metric are slightly better than using ROC as metric.

Random Forest

rf_model = randomForest(diagnosis ~ ., 
                        data = train_df, mtry = 17, importance = TRUE)
rf_model

## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = train_df, mtry = 17,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 17
## 
##         OOB estimate of  error rate: 3%
## Confusion matrix:
##           Benign Malignant class.error
## Benign       254         6  0.02307692
## Malignant      6       134  0.04285714

plot(rf_model)

#predict the diagnosis
rf_model_pred_test = predict(rf_model, test_df)

#Confusion Matrix and Statistics
cm_rf_model<- confusionMatrix(rf_model_pred_test, test_df$diagnosis, positive = "Malignant")
cm_rf_model

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        66         3
##   Malignant      0        31
##                                           
##                Accuracy : 0.97            
##                  95% CI : (0.9148, 0.9938)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : 2.113e-14       
##                                           
##                   Kappa : 0.9317          
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 0.9118          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9565          
##              Prevalence : 0.3400          
##          Detection Rate : 0.3100          
##    Detection Prevalence : 0.3100          
##       Balanced Accuracy : 0.9559          
##                                           
##        'Positive' Class : Malignant       
##

If we include all factors, an Accuracy of 0.9802, 95% CI : (0.9303, 0.9976) can be achieved. However, we know that from previous correlation map that some of the factors can be eliminated, maybe this will yield better accuracy.

We see that radius, perimeter and area are correlated. Fractal_dimension_mean and smoothness_se seem not to negatively correlate with radius, perimeter and area.

rf_model1 = randomForest(diagnosis ~ 
            area_mean + texture_mean + smoothness_mean + compactness_mean + concavity_mean + concave.points_mean + symmetry_mean +
            area_se +texture_se +compactness_se + concavity_se + concave.points_se + fractal_dimension_se +
            area_worst + texture_worst + smoothness_worst + compactness_worst + concavity_worst + concave.points_worst + symmetry_worst + fractal_dimension_worst, 
                        data = train_df, mtry = 17, importance = TRUE)

#predict the diagnosis
rf_model1_pred_test = predict(rf_model1, test_df)

#Confusion Matrix and Statistics
cm_rf_model1<- confusionMatrix(rf_model1_pred_test, test_df$diagnosis, positive = "Malignant")
cm_rf_model1

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        66         3
##   Malignant      0        31
##                                           
##                Accuracy : 0.97            
##                  95% CI : (0.9148, 0.9938)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : 2.113e-14       
##                                           
##                   Kappa : 0.9317          
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 0.9118          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9565          
##              Prevalence : 0.3400          
##          Detection Rate : 0.3100          
##    Detection Prevalence : 0.3100          
##       Balanced Accuracy : 0.9559          
##                                           
##        'Positive' Class : Malignant       
##

This time, I choose only area among perimeter, radius and area, and eliminated Fractal_dimension_mean, symmetry_se and smoothness_se Now, the accuracy decreased to 0.97.

rf_model2 = randomForest(diagnosis ~ 
            perimeter_mean + texture_mean + smoothness_mean + compactness_mean + concavity_mean + concave.points_mean + symmetry_mean +
            radius_worst + texture_worst + smoothness_worst + compactness_worst + concavity_worst + concave.points_worst + symmetry_worst + fractal_dimension_worst, 
                        data = train_df, mtry = 10, importance = TRUE)

#predict the diagnosis
rf_model2_pred_test = predict(rf_model2, test_df)

#Confusion Matrix and Statistics
cm_rf_model2<- confusionMatrix(rf_model2_pred_test, test_df$diagnosis, positive = "Malignant")
cm_rf_model2

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        66         2
##   Malignant      0        32
##                                           
##                Accuracy : 0.98            
##                  95% CI : (0.9296, 0.9976)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : 1.23e-15        
##                                           
##                   Kappa : 0.9548          
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9412          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9706          
##              Prevalence : 0.3400          
##          Detection Rate : 0.3200          
##    Detection Prevalence : 0.3200          
##       Balanced Accuracy : 0.9706          
##                                           
##        'Positive' Class : Malignant       
##

This time I excluded all the _se factors from model 1, and obtained the same accuracy as I have with all factors, 0.9802 95% CI : (0.9303, 0.9976)

rf_model3 = randomForest(diagnosis ~ 
            radius_mean + perimeter_mean + texture_mean + area_mean + smoothness_mean + compactness_mean + concavity_mean + concave.points_mean + symmetry_mean + fractal_dimension_mean +
            radius_worst + perimeter_worst + texture_worst + area_worst + smoothness_worst + compactness_worst + concavity_worst + concave.points_worst + symmetry_worst + fractal_dimension_worst,
                        data = train_df, mtry = 10, importance = TRUE)

#predict the diagnosis
rf_model3_pred_test = predict(rf_model3, test_df)

#Confusion Matrix and Statistics
cm_rf_model3<- confusionMatrix(rf_model3_pred_test, test_df$diagnosis, positive = "Malignant")
cm_rf_model3

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        66         2
##   Malignant      0        32
##                                           
##                Accuracy : 0.98            
##                  95% CI : (0.9296, 0.9976)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : 1.23e-15        
##                                           
##                   Kappa : 0.9548          
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9412          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9706          
##              Prevalence : 0.3400          
##          Detection Rate : 0.3200          
##    Detection Prevalence : 0.3200          
##       Balanced Accuracy : 0.9706          
##                                           
##        'Positive' Class : Malignant       
##

This time excluded all _se factors from all factors, and obtained the same accuracy as I have with all factors, 0.9802 95% CI : (0.9303, 0.9976)

I conclude, actually, including all factors yield the best result.

Overall, using kNN is slightly better than using Random Forest.

Classification analysis