References: http://proxy.library.upenn.edu:2061/lib/upenn/reader.action?docID=10794279 charpter 3 p75 Lantz, Brett. Machine Learning with R. Olton, GB: Packt Publishing, 2013. ProQuest ebrary. Web. 24 April 2017. Copyright © 2013. Packt Publishing. All rights reserved. https://www.kaggle.com/gargmanish/d/uciml/breast-cancer-wisconsin-data/basic-machine-learning-with-cancer/notebook https://rpubs.com/jesuscastagnetto/caret-knn-cancer-prediction

DATA SET DESCRIPTION

The dataset used in the book is the “Breast Cancer Wisconsin (Diagnostic) Data Set” from the UCI Machine Learning Repository, as described in Chapter 3 (“*Lazy Learning - Clasification Using Nearest Neighbors“) of the aforementioned book. The data set contains results of routine breast cancer screen, which allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. The goal of the dataset is to practice classification analysis, to be able to predict which of sub-populations a new observations belongs to , on the basis of chosen metrics. In other words, after analysis of the cancer diagosis dataset, we will be able to preidct whether a patient has benign or malignant. Attributes: As I observed the dat can be divided into three parts: means (3-13) standard error (13-23) and Worst(23-32) each contain 10 parameter radius, texture, area, perimeter, smoothness, compactness, concavity, concave points, symmetry and fractal dimension.

Load the data

**I randomized the imported data.

Exploration of data

ggplot(data=dataset,aes(x=diagnosis)) + geom_bar() + geom_text(stat='Count',aes(label=..count..),vjust=-1)

ft_orig <- frqtab(dataset$diagnosis)
pander(ft_orig, style="rmarkdown", caption="Original diagnosis frequencies (%)")

Original diagnosis frequencies (%)
Benign	Malignant
62.7	37.3

There are 357 benign cases and 212 malignant cases

divide the features according to their category

features_mean<- list(dataset[,3:12]) features_se<- list(dataset[1,13:22]) features_worst<-list(dataset[1,23:32])

ggplot(dataset, aes(dataset[,3:12], dataset[,3:12] )) + geom_tile(aes(fill = dataset[,3:12]), , color = “white”) + scale_fill_gradient(low = “red”, high = “steelblue”) + ylab(“List of genes”) + xlab(“List of patients”) + theme(legend.title = element_text(size = 10), legend.text = element_text(size = 12), plot.title = element_text(size=16), axis.title=element_text(size=14,face=“bold”), axis.text.x = element_text(angle = 90, hjust = 1)) + labs(fill = “Expression level”)

Normalize data

normalize <- function(x) {
  return ((x - min(x)) / (max(x) -min(x)))
}
dataset_n<-as.data.frame(lapply(dataset[3:32],normalize))
dataset_z<-as.data.frame(lapply(dataset[3:32],scale))

Assigin training data and test data

train_df <- dataset[1:469,]
train_df_n <- dataset_n[1:469,]
train_df_z <- dataset_z[1:469,]
train_labels <-dataset[1:469,2]
ft_train <- frqtab(train_df$diagnosis)

test_df <- dataset[470:569,]
test_df_n <- dataset_n[470:569,]
test_df_z <- dataset_z[470:569,]
test_labels <-dataset[470:569,2]
ft_test <- frqtab(test_df$diagnosis)

ftcmp_df <- as.data.frame(cbind(ft_orig, ft_train, ft_test))
colnames(ftcmp_df) <- c("Original", "Training set", "Test set")

pander(ftcmp_df, style="rmarkdown",
             caption="Comparison of diagnosis frequencies (in %)")

Comparison of diagnosis frequencies (in %)
	Original	Training set	Test set
Benign	62.7	62.9	62
Malignant	37.3	37.1	38

a1<-ggplot(data=train_df,aes(x=diagnosis)) + geom_bar() + geom_text(stat='Count',aes(label=..count..),vjust=-1)

a2<-ggplot(data=test_df,aes(x=diagnosis)) + geom_bar() + geom_text(stat='Count',aes(label=..count..),vjust=-1)

grid.arrange(a1, a2, nrow=1)

The frequencies of diagnosis in the tranining set, original data and test set are equivalent.

k-Nearest Neighbor

Definition Lazy learning is also known as instance-based learning ot rote learning. An insitanced-based learning do not build a model, it is a non-parametric learning method. Nearest neighbor classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of the most similar labeled examples. Despite the simplicity of this idea, nearest neighbor methods are extremely powerful. In general, nearest neighbor classifiers are well-suited for classification tasks where relationships among the features and the target classes are numerous, complicated, or otherwise extremely difficult to understand, yet the items of similar class type tend to be fairly homogeneous. Another way of putting it would be to say that if a concept is difficult to define, but you know it when you see it, then nearest neighbors might be appropriate. On the other hand, if there is not a clear distinction among the groups, the algorithm is by and large not well-suited for identifying the boundary. Strengths: . Simple and effective . Makes no assumptions about the underlying data distribution . Fast training phase Weakness: . Does not produce a model, which limits the ability to find novel insights in relationships among features
. Slow classification phase . Requires a large amount of memory . Nominal features and missing data require additional processing

Choosing an appropriate k Deciding how many neighbors to use for kNN determines how well the model will generalize to future data. The balance between overfitting and underfitting the training data is a problem known as the bias-variance tradeoff. Choosing a large k reduces the impact or variance caused by noisy data, but can bias the learner such that it runs the risk of ignoring small, but important patterns. On the opposite extreme, using a single nearest neighbor allows noisy data or outliers, to unduly influence the classification of examples. Obviously, the best k value is somewhere between these two extremes.

Preparing data 1. Transform to standard range (traditionally min-max normalization or z-score) if nominal feature use dummy coding 2. Building model on test data 3. Test model on test data 4. Improve model

#kNN using accuracy as metric
ctrl <- trainControl(method="repeatedcv", number=10, repeats=3)
set.seed(12345)
knnFit1 <- train(diagnosis ~ ., data=train_df[2:32], method="knn",
                trControl=ctrl, metric="Accuracy", tuneLength=20,
                preProc=c("range"))
knnFit1

## k-Nearest Neighbors 
## 
## 469 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## Pre-processing: re-scaling to [0, 1] (30) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 422, 423, 422, 423, 422, 422, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9701170  0.9353159
##    7  0.9708417  0.9367183
##    9  0.9708565  0.9365803
##   11  0.9672647  0.9284411
##   13  0.9644420  0.9222883
##   15  0.9672949  0.9286068
##   17  0.9637187  0.9203371
##   19  0.9601565  0.9124945
##   21  0.9580288  0.9081314
##   23  0.9572894  0.9064343
##   25  0.9572740  0.9063126
##   27  0.9565648  0.9045071
##   29  0.9558703  0.9030044
##   31  0.9544513  0.8997949
##   33  0.9572740  0.9061065
##   35  0.9565641  0.9042996
##   37  0.9572888  0.9059132
##   39  0.9537568  0.8982000
##   41  0.9544358  0.8995970
##   43  0.9544204  0.8993040
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.

plot(knnFit1)

#predict the diagnosis
knnPredict1 <- predict(knnFit1, newdata=test_df)
cmat1 <- confusionMatrix(knnPredict1, test_df$diagnosis, positive="Malignant")
cmat1

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        61         2
##   Malignant      1        36
##                                           
##                Accuracy : 0.97            
##                  95% CI : (0.9148, 0.9938)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.936           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 0.9839          
##          Pos Pred Value : 0.9730          
##          Neg Pred Value : 0.9683          
##              Prevalence : 0.3800          
##          Detection Rate : 0.3600          
##    Detection Prevalence : 0.3700          
##       Balanced Accuracy : 0.9656          
##                                           
##        'Positive' Class : Malignant       
##

A total of 36 of 100 predictions were true positives.

#kNN using kappa as metric
knnFit2 <- train(diagnosis ~ ., data=train_df[2:32], method="knn",
                trControl=ctrl, metric="Kappa", tuneLength=20,
                preProc=c("range"))
knnFit2

## k-Nearest Neighbors 
## 
## 469 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## Pre-processing: re-scaling to [0, 1] (30) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 422, 423, 423, 422, 423, 421, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9716408  0.9385976
##    7  0.9709329  0.9369079
##    9  0.9709175  0.9367775
##   11  0.9666609  0.9274058
##   13  0.9673701  0.9289355
##   15  0.9680793  0.9305838
##   17  0.9645930  0.9228064
##   19  0.9617400  0.9163455
##   21  0.9638381  0.9210691
##   23  0.9588730  0.9100547
##   25  0.9560656  0.9039821
##   27  0.9560502  0.9038837
##   29  0.9567594  0.9052654
##   31  0.9560502  0.9036095
##   33  0.9567440  0.9049750
##   35  0.9560046  0.9032095
##   37  0.9567138  0.9048273
##   39  0.9560046  0.9032095
##   41  0.9552806  0.9014806
##   43  0.9559898  0.9030771
## 
## Kappa was used to select the optimal model using  the largest value.
## The final value used for the model was k = 5.

plot(knnFit2)

#predict the diagnosis
knnPredict2 <- predict(knnFit2, newdata=test_df)
cmat2 <- confusionMatrix(knnPredict2, test_df$diagnosis, positive="Malignant")
cmat2

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        61         2
##   Malignant      1        36
##                                           
##                Accuracy : 0.97            
##                  95% CI : (0.9148, 0.9938)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.936           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 0.9839          
##          Pos Pred Value : 0.9730          
##          Neg Pred Value : 0.9683          
##              Prevalence : 0.3800          
##          Detection Rate : 0.3600          
##    Detection Prevalence : 0.3700          
##       Balanced Accuracy : 0.9656          
##                                           
##        'Positive' Class : Malignant       
##

A total of 36 of 100 predictions were true positives.

#kNN using ROC as metric
ctrl2 <- trainControl(method="repeatedcv", number=10, repeats=3, classProbs = TRUE, summaryFunction = twoClassSummary)

knnFit3 <- train(diagnosis ~ ., data=train_df[2:32], method="knn",
                trControl=ctrl2, metric="Accuracy", tuneLength=20,
                preProc=c("range"))

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.

knnFit3

## k-Nearest Neighbors 
## 
## 469 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## Pre-processing: re-scaling to [0, 1] (30) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 421, 422, 422, 422, 422, 422, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    5  0.9876097  0.9875479  0.9440087
##    7  0.9929557  0.9863985  0.9421569
##    9  0.9927515  0.9897318  0.9346405
##   11  0.9916450  0.9909195  0.9289760
##   13  0.9911833  0.9897701  0.9192810
##   15  0.9910838  0.9931801  0.9153595
##   17  0.9906003  0.9920307  0.9173203
##   19  0.9902015  0.9908812  0.9116558
##   21  0.9898164  0.9908812  0.9117647
##   23  0.9901742  0.9931418  0.9079521
##   25  0.9902688  0.9909195  0.8985839
##   27  0.9902669  0.9932184  0.8967320
##   29  0.9903972  0.9920690  0.8967320
##   31  0.9905412  0.9943678  0.8928105
##   33  0.9907136  0.9943678  0.8908497
##   35  0.9909107  0.9932567  0.8889978
##   37  0.9908465  0.9943678  0.8888889
##   39  0.9907447  0.9943678  0.8851852
##   41  0.9902850  0.9943295  0.8813725
##   43  0.9901287  0.9943295  0.8813725
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was k = 7.

plot(knnFit3)

#predict the diagnosis
knnPredict3 <- predict(knnFit3, newdata=test_df)
cmat3 <- confusionMatrix(knnPredict3, test_df$diagnosis, positive="Malignant")
cmat3

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        59         2
##   Malignant      3        36
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.8872, 0.9836)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 1.232e-14       
##                                           
##                   Kappa : 0.8944          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 0.9516          
##          Pos Pred Value : 0.9231          
##          Neg Pred Value : 0.9672          
##              Prevalence : 0.3800          
##          Detection Rate : 0.3600          
##    Detection Prevalence : 0.3900          
##       Balanced Accuracy : 0.9495          
##                                           
##        'Positive' Class : Malignant       
##

A total of 55 of 100 predictions were true positives.

#using knn
#try min-max normalization, k=7
knnFit4<- knn(train=train_df_n, test=test_df_n, cl = train_labels, k= 7)
#evaluating model performance
CrossTable(x=test_labels, y = knnFit4, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | knnFit4 
##  test_labels |    Benign | Malignant | Row Total | 
## -------------|-----------|-----------|-----------|
##       Benign |        61 |         1 |        62 | 
##              |     0.984 |     0.016 |     0.620 | 
##              |     0.984 |     0.026 |           | 
##              |     0.610 |     0.010 |           | 
## -------------|-----------|-----------|-----------|
##    Malignant |         1 |        37 |        38 | 
##              |     0.026 |     0.974 |     0.380 | 
##              |     0.016 |     0.974 |           | 
##              |     0.010 |     0.370 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        62 |        38 |       100 | 
##              |     0.620 |     0.380 |           | 
## -------------|-----------|-----------|-----------|
## 
##

A total of 35 of 100 predictions were true positives.

#try z-score standardization, k=7
knnFit5<- knn(train=train_df_z, test=test_df_z, cl = train_labels, k= 7)
#evaluating model performance
CrossTable(x=test_labels, y = knnFit5, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | knnFit5 
##  test_labels |    Benign | Malignant | Row Total | 
## -------------|-----------|-----------|-----------|
##       Benign |        62 |         0 |        62 | 
##              |     1.000 |     0.000 |     0.620 | 
##              |     0.969 |     0.000 |           | 
##              |     0.620 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##    Malignant |         2 |        36 |        38 | 
##              |     0.053 |     0.947 |     0.380 | 
##              |     0.031 |     1.000 |           | 
##              |     0.020 |     0.360 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        64 |        36 |       100 | 
##              |     0.640 |     0.360 |           | 
## -------------|-----------|-----------|-----------|
## 
##

A total of 35 of 100 predictions were true positives.

Comparing three models

Looks alike the caret method produce better models, let’s compare teh three models

model_comp <- as.data.frame(
    rbind(
          summod(cmat1, knnFit1),
          summod(cmat2, knnFit2),
          summod(cmat3, knnFit3)))
rownames(model_comp) <- c("Model 1", "Model 2", "Model 3")
pander(model_comp[,-3], split.tables=Inf, keep.trailing.zeros=TRUE,
       style="rmarkdown",
       caption="Model results when comparing predictions and test set")

Model results when comparing predictions and test set
	k	metric	TN	TP	FN	FP	acc	sens	spec	PPV	NPV
Model 1	9	Accuracy	61	36	2	1	0.97	0.95	0.98	0.97	0.97
Model 2	5	Kappa	61	36	2	1	0.97	0.95	0.98	0.97	0.97
Model 3	7	ROC	59	36	2	3	0.95	0.95	0.95	0.92	0.97

using ROC and Accuracy as metric is better than using Kappa as metric.

Classification analysis