Classification Models for Breast Cancer Detection

The identification of cancerous cells in the human body is a lengthy process. Moreover, the longer it takes to detect cancer the more difficult it is to treat the cancer. Therefore, in this project we aim to develop a model that could automatically detect breast cancer with a high accuracy rate based on the measurements of biopsied cells from women with abnormal breast masses. From UCI Repository we used Wisconsin Breast Cancer Diagnostic dataset that included 569 observations. Our goal is to build a classification model that will be able to classify biopsy images as either Benign (non-cancerous) or Malignant (cancerous).

There are four machine learning methods that will be applied to our classification models - Nearest Neighbors, Naive Bayes, Decision Trees, and Neural Networks. We will measure the performance of each model and compare the accuracy rate to find the best model amongst them. The platform being occupied for our analysis is R studio.

EXPLORING & PREPARING THE DATA

Step 1: download the data

bc <- read.csv("wisc_bc_data.csv")

Step 2: exploring and preparing the data

examine the structure of the wbcd data frame

str(bc)

## 'data.frame':    569 obs. of  32 variables:
##  $ id               : int  87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
##  $ diagnosis        : Factor w/ 2 levels "B","M": 1 1 1 1 1 1 1 2 1 1 ...
##  $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
##  $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
##  $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
##  $ area_mean        : num  464 346 373 385 712 ...
##  $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
##  $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
##  $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
##  $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
##  $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
##  $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
##  $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
##  $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
##  $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
##  $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
##  $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
##  $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
##  $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
##  $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
##  $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
##  $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
##  $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
##  $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
##  $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
##  $ area_worst       : num  549 425 471 434 819 ...
##  $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
##  $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
##  $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
##  $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
##  $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
##  $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...

drop the id feature

bc <- bc[-1]

table of diagnosis

table(bc$diagnosis)

## 
##   B   M 
## 357 212

recode diagnosis as a factor

bc$diagnosis <- factor(bc$diagnosis, levels = c("B", "M"),
                         labels = c("Benign", "Malignant"))

table or proportions with more informative labels

round(prop.table(table(bc$diagnosis)) * 100, digits = 1)

## 
##    Benign Malignant 
##      62.7      37.3

pie chart

pie(table(bc$diagnosis), main = "Diagnosis", col = c("Blue", "Red"))

lbls <- paste(names(table(bc$diagnosis)), round(prop.table(table(bc$diagnosis)) * 100, digits = 1), "%")

pie(table(bc$diagnosis), labels = lbls, main = "Diagnosis", col = c("Blue", "Red"))
box()

creat training and test data sets

bc_train <- bc[1:469, -1]
bc_test <- bc[470:569, -1]

create labels for training and test data

bc_train_labels <- bc[1:469, 1]
bc_test_labels <- bc[470:569, 1]

METHOD 1: K-NN NEAREST NEIGHBOR

normalization for numeric features create normalization function

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

normalize the bc data

bc_n <- as.data.frame(lapply(bc[2:31], normalize))

confirm that normalization worked

summary(bc_n$area_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1174  0.1729  0.2169  0.2711  1.0000

create normalized training and test data

bc_n_train <- bc_n[1:469, ]
bc_n_test <- bc_n[470:569, ]

Step 3: training a k-NN model to the data

load the “class” library

library(class)

bc_test_pred <- knn(train = bc_n_train, test = bc_n_test,
                      cl = bc_train_labels, k = 21)

Step 4: Evaluating the model performance

# load the "gmodels" library
library(gmodels)

Create the cross tabulation of predicted vs. actual

CrossTable(x = bc_test_labels, y = bc_test_pred,
           prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | bc_test_pred 
## bc_test_labels |    Benign | Malignant | Row Total | 
## ---------------|-----------|-----------|-----------|
##         Benign |        61 |         0 |        61 | 
##                |     1.000 |     0.000 |     0.610 | 
##                |     0.968 |     0.000 |           | 
##                |     0.610 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|
##      Malignant |         2 |        37 |        39 | 
##                |     0.051 |     0.949 |     0.390 | 
##                |     0.032 |     1.000 |           | 
##                |     0.020 |     0.370 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        63 |        37 |       100 | 
##                |     0.630 |     0.370 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

Accuracy of k-NN model:

accuracy_knn = (61+37)/(61+37+2+0)
accuracy_knn

## [1] 0.98

Step 5: Improving the model performance

use the scale() function to z-score standardize a data frame

bc_z <- as.data.frame(scale(bc[-1]))

confirm that the transformation was applied correctly

summary(bc_z$area_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4532 -0.6666 -0.2949  0.0000  0.3632  5.2459

create training and test datasets

bc_train <- bc_z[1:469, ]
bc_test <- bc_z[470:569, ]

re-classify test cases

bc_test_pred <- knn(train = bc_train, test = bc_test,
                      cl = bc_train_labels, k = 21)

Create the cross tabulation of predicted vs. actual

CrossTable(x = bc_test_labels, y = bc_test_pred,
           prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | bc_test_pred 
## bc_test_labels |    Benign | Malignant | Row Total | 
## ---------------|-----------|-----------|-----------|
##         Benign |        61 |         0 |        61 | 
##                |     1.000 |     0.000 |     0.610 | 
##                |     0.924 |     0.000 |           | 
##                |     0.610 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|
##      Malignant |         5 |        34 |        39 | 
##                |     0.128 |     0.872 |     0.390 | 
##                |     0.076 |     1.000 |           | 
##                |     0.050 |     0.340 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        66 |        34 |       100 | 
##                |     0.660 |     0.340 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

Calculate accuracy of z-score method

accuracy_z = (61+34)/(61+34+5+0)
accuracy_z

## [1] 0.95

try several different values of k

bc_n_train <- bc_n[1:469, ]
bc_n_test <- bc_n[470:569, ]

start time

strt<-Sys.time()

bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 1)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | bc_test_pred 
## bc_test_labels |    Benign | Malignant | Row Total | 
## ---------------|-----------|-----------|-----------|
##         Benign |        58 |         3 |        61 | 
##                |     0.951 |     0.049 |     0.610 | 
##                |     0.983 |     0.073 |           | 
##                |     0.580 |     0.030 |           | 
## ---------------|-----------|-----------|-----------|
##      Malignant |         1 |        38 |        39 | 
##                |     0.026 |     0.974 |     0.390 | 
##                |     0.017 |     0.927 |           | 
##                |     0.010 |     0.380 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        59 |        41 |       100 | 
##                |     0.590 |     0.410 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 5)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | bc_test_pred 
## bc_test_labels |    Benign | Malignant | Row Total | 
## ---------------|-----------|-----------|-----------|
##         Benign |        61 |         0 |        61 | 
##                |     1.000 |     0.000 |     0.610 | 
##                |     0.968 |     0.000 |           | 
##                |     0.610 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|
##      Malignant |         2 |        37 |        39 | 
##                |     0.051 |     0.949 |     0.390 | 
##                |     0.032 |     1.000 |           | 
##                |     0.020 |     0.370 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        63 |        37 |       100 | 
##                |     0.630 |     0.370 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 11)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | bc_test_pred 
## bc_test_labels |    Benign | Malignant | Row Total | 
## ---------------|-----------|-----------|-----------|
##         Benign |        61 |         0 |        61 | 
##                |     1.000 |     0.000 |     0.610 | 
##                |     0.953 |     0.000 |           | 
##                |     0.610 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|
##      Malignant |         3 |        36 |        39 | 
##                |     0.077 |     0.923 |     0.390 | 
##                |     0.047 |     1.000 |           | 
##                |     0.030 |     0.360 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        64 |        36 |       100 | 
##                |     0.640 |     0.360 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 15)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | bc_test_pred 
## bc_test_labels |    Benign | Malignant | Row Total | 
## ---------------|-----------|-----------|-----------|
##         Benign |        61 |         0 |        61 | 
##                |     1.000 |     0.000 |     0.610 | 
##                |     0.953 |     0.000 |           | 
##                |     0.610 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|
##      Malignant |         3 |        36 |        39 | 
##                |     0.077 |     0.923 |     0.390 | 
##                |     0.047 |     1.000 |           | 
##                |     0.030 |     0.360 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        64 |        36 |       100 | 
##                |     0.640 |     0.360 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 21)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | bc_test_pred 
## bc_test_labels |    Benign | Malignant | Row Total | 
## ---------------|-----------|-----------|-----------|
##         Benign |        61 |         0 |        61 | 
##                |     1.000 |     0.000 |     0.610 | 
##                |     0.968 |     0.000 |           | 
##                |     0.610 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|
##      Malignant |         2 |        37 |        39 | 
##                |     0.051 |     0.949 |     0.390 | 
##                |     0.032 |     1.000 |           | 
##                |     0.020 |     0.370 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        63 |        37 |       100 | 
##                |     0.630 |     0.370 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 27)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | bc_test_pred 
## bc_test_labels |    Benign | Malignant | Row Total | 
## ---------------|-----------|-----------|-----------|
##         Benign |        61 |         0 |        61 | 
##                |     1.000 |     0.000 |     0.610 | 
##                |     0.938 |     0.000 |           | 
##                |     0.610 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|
##      Malignant |         4 |        35 |        39 | 
##                |     0.103 |     0.897 |     0.390 | 
##                |     0.062 |     1.000 |           | 
##                |     0.040 |     0.350 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        65 |        35 |       100 | 
##                |     0.650 |     0.350 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

end time

print(Sys.time()-strt)

## Time difference of 0.05782104 secs

METHOD 2: NAIVE BAYES

Step 3: training a model to the data

library(e1071)

bc_classifier <- naiveBayes(bc_train, bc_train_labels)

Step 4: evaluating the model

bc_eval_pred <- predict(bc_classifier,bc_test)
head(bc_eval_pred)

## [1] Benign    Benign    Benign    Benign    Malignant Benign   
## Levels: Benign Malignant

create the cross tablulation of predicted vs. actual

CrossTable(bc_eval_pred, bc_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | actual 
##    predicted |    Benign | Malignant | Row Total | 
## -------------|-----------|-----------|-----------|
##       Benign |        60 |         4 |        64 | 
##              |     0.984 |     0.103 |           | 
## -------------|-----------|-----------|-----------|
##    Malignant |         1 |        35 |        36 | 
##              |     0.016 |     0.897 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        61 |        39 |       100 | 
##              |     0.610 |     0.390 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Comments: Knn was 98% accurate. NaiveBayes is 95.8% accurate

Step 5: improving the model performance

bc_classifier2 <- naiveBayes(bc_train, bc_train_labels, laplace = 1)
bc_eval_pred2 <- predict(bc_classifier2, bc_test)
CrossTable(bc_eval_pred2, bc_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | actual 
##    predicted |    Benign | Malignant | Row Total | 
## -------------|-----------|-----------|-----------|
##       Benign |        60 |         4 |        64 | 
##              |     0.984 |     0.103 |           | 
## -------------|-----------|-----------|-----------|
##    Malignant |         1 |        35 |        36 | 
##              |     0.016 |     0.897 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        61 |        39 |       100 | 
##              |     0.610 |     0.390 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Comment: Accuracy is exactly the same (95.8%)

METHOD 3: DECISION TREES

Step 3: Training a model on the data

build the simplest decision tree

library(C50)
bc_tree <- C5.0(bc_train, bc_train_labels)

display simple facts about the tree

bc_tree

## 
## Call:
## C5.0.default(x = bc_train, y = bc_train_labels)
## 
## Classification Tree
## Number of samples: 469 
## Number of predictors: 30 
## 
## Tree size: 9 
## 
## Non-standard options: attempt to group attributes

display detailed information about the tree

summary(bc_tree)

## 
## Call:
## C5.0.default(x = bc_train, y = bc_train_labels)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Oct 13 16:12:31 2018
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 469 cases (31 attributes) from undefined.data
## 
## Decision tree:
## 
## points_worst > 0.4669509:
## :...area_worst > -0.1989668: Malignant (135)
## :   area_worst <= -0.1989668:
## :   :...texture_worst <= -0.3168145: Benign (6)
## :       texture_worst > -0.3168145: Malignant (10/1)
## points_worst <= 0.4669509:
## :...area_worst > 0.1182332:
##     :...concavity_worst <= -0.3326002: Benign (4/1)
##     :   concavity_worst > -0.3326002: Malignant (15)
##     area_worst <= 0.1182332:
##     :...points_worst <= -0.06246884: Benign (255/3)
##         points_worst > -0.06246884:
##         :...texture_worst > 1.227215: Malignant (4)
##             texture_worst <= 1.227215:
##             :...dimension_mean <= -0.9061305: Malignant (4)
##                 dimension_mean > -0.9061305: Benign (36/2)
## 
## 
## Evaluation on training data (469 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       9    7( 1.5%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     295     1    (a): class Benign
##       6   167    (b): class Malignant
## 
## 
##  Attribute usage:
## 
##  100.00% area_worst
##  100.00% points_worst
##   12.79% texture_worst
##    8.53% dimension_mean
##    4.05% concavity_worst
## 
## 
## Time: 0.0 secs

Step 4: Evaluating Tree model performance

create a factor vector of predictions on test data

bc_tree_pred <- predict(bc_tree, bc_test)

cross tabulation of predicted versus actual classes

CrossTable(bc_test_labels, bc_tree_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual class', 'predicted class'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | predicted class 
## actual class |    Benign | Malignant | Row Total | 
## -------------|-----------|-----------|-----------|
##       Benign |        60 |         1 |        61 | 
##              |     0.600 |     0.010 |           | 
## -------------|-----------|-----------|-----------|
##    Malignant |         4 |        35 |        39 | 
##              |     0.040 |     0.350 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        64 |        36 |       100 | 
## -------------|-----------|-----------|-----------|
## 
##

Comment: Accuracy of the tree is 95%

Step 5: Improving model performance

Boosting the accuracy of decision trees Boosted decision tree with 10 trials

bc_boost10 <- C5.0(bc_train, bc_train_labels,
                    trials = 10)

Get idea about the boosted tree

summary(bc_boost10)

## 
## Call:
## C5.0.default(x = bc_train, y = bc_train_labels, trials = 10)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Oct 13 16:12:31 2018
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 469 cases (31 attributes) from undefined.data
## 
## -----  Trial 0:  -----
## 
## Decision tree:
## 
## points_worst > 0.4669509:
## :...area_worst > -0.1989668: Malignant (135)
## :   area_worst <= -0.1989668:
## :   :...texture_worst <= -0.3168145: Benign (6)
## :       texture_worst > -0.3168145: Malignant (10/1)
## points_worst <= 0.4669509:
## :...area_worst > 0.1182332:
##     :...concavity_worst <= -0.3326002: Benign (4/1)
##     :   concavity_worst > -0.3326002: Malignant (15)
##     area_worst <= 0.1182332:
##     :...points_worst <= -0.06246884: Benign (255/3)
##         points_worst > -0.06246884:
##         :...texture_worst > 1.227215: Malignant (4)
##             texture_worst <= 1.227215:
##             :...dimension_mean <= -0.9061305: Malignant (4)
##                 dimension_mean > -0.9061305: Benign (36/2)
## 
## -----  Trial 1:  -----
## 
## Decision tree:
## 
## area_se > -0.1144639: Malignant (177.5/12.1)
## area_se <= -0.1144639:
## :...points_mean <= 0.01522708: Benign (236.5/21.8)
##     points_mean > 0.01522708: Malignant (54.9/12.8)
## 
## -----  Trial 2:  -----
## 
## Decision tree:
## 
## perimeter_worst > 0.2094719:
## :...smoothness_worst <= -1.413279: Benign (6.8/0.6)
## :   smoothness_worst > -1.413279: Malignant (103.6/1.8)
## perimeter_worst <= 0.2094719:
## :...area_worst <= -0.5753563: Benign (132.4/0.6)
##     area_worst > -0.5753563:
##     :...texture_worst <= -0.8244404: Benign (57.2)
##         texture_worst > -0.8244404:
##         :...smoothness_mean <= -0.4458351: Benign (47.9/4.3)
##             smoothness_mean > -0.4458351:
##             :...area_se <= -0.4540915: Benign (16.9/2.3)
##                 area_se > -0.4540915: Malignant (104.2/14.2)
## 
## -----  Trial 3:  -----
## 
## Decision tree:
## 
## area_worst <= -0.3991927:
## :...area_se <= 0.1869143: Benign (176/4.8)
## :   area_se > 0.1869143: Malignant (12.6/2.4)
## area_worst > -0.3991927:
## :...perimeter_worst > 0.5903955: Malignant (63.3)
##     perimeter_worst <= 0.5903955:
##     :...smoothness_worst > 0.1546662:
##         :...concavity_worst <= -0.3326002: Benign (8.6)
##         :   concavity_worst > -0.3326002: Malignant (82.2/6.5)
##         smoothness_worst <= 0.1546662:
##         :...perimeter_se > 1.622244: Malignant (10.2)
##             perimeter_se <= 1.622244:
##             :...smoothness_se > 0.2857672: Malignant (9.5)
##                 smoothness_se <= 0.2857672:
##                 :...dimension_se <= -0.9179285: Malignant (12.5/3.7)
##                     dimension_se > -0.9179285: Benign (94/1.3)
## 
## -----  Trial 4:  -----
## 
## Decision tree:
## 
## points_worst > 0.8548878: Malignant (56.1/0.3)
## points_worst <= 0.8548878:
## :...texture_mean > 0.469736:
##     :...points_worst <= -0.4923942: Benign (29.3/7.8)
##     :   points_worst > -0.4923942: Malignant (75.9/3.7)
##     texture_mean <= 0.469736:
##     :...perimeter_worst <= 0.007106229: Benign (226.1/8.1)
##         perimeter_worst > 0.007106229:
##         :...texture_worst <= -0.9285688: Benign (33.9/1)
##             texture_worst > -0.9285688: Malignant (47.6/13.1)
## 
## -----  Trial 5:  -----
## 
## Decision tree:
## 
## perimeter_worst > 0.5903955: Malignant (49.3)
## perimeter_worst <= 0.5903955:
## :...points_worst > 0.9187833: Malignant (22)
##     points_worst <= 0.9187833:
##     :...perimeter_se > 0.6058501: Malignant (29.7/5.3)
##         perimeter_se <= 0.6058501:
##         :...texture_worst > 1.167015: Malignant (47.5/14.6)
##             texture_worst <= 1.167015:
##             :...area_worst <= -0.2481451: Benign (189)
##                 area_worst > -0.2481451:
##                 :...area_mean <= -0.1491532: Malignant (23.3/2.2)
##                     area_mean > -0.1491532:
##                     :...smoothness_se <= 0.2201556: Benign (101/10.4)
##                         smoothness_se > 0.2201556: Malignant (7.2/1.4)
## 
## -----  Trial 6:  -----
## 
## Decision tree:
## 
## area_worst > -0.02368133:
## :...concavity_worst <= -0.197429: Benign (28.8/9.6)
## :   concavity_worst > -0.197429: Malignant (78.6/0.4)
## area_worst <= -0.02368133:
## :...symmetry_worst <= -1.491504: Malignant (26.2/8)
##     symmetry_worst > -1.491504:
##     :...points_worst <= 0.3209041: Benign (255.4/10.9)
##         points_worst > 0.3209041:
##         :...texture_mean <= 0.1605082: Benign (60.3/17.2)
##             texture_mean > 0.1605082: Malignant (19.7)
## 
## -----  Trial 7:  -----
## 
## Decision tree:
## 
## area_worst > 0.1182332:
## :...symmetry_mean <= -1.14035: Benign (8.8/0.6)
## :   symmetry_mean > -1.14035: Malignant (82.6/3.5)
## area_worst <= 0.1182332:
## :...points_worst > 0.6890638: Malignant (37.8/3.4)
##     points_worst <= 0.6890638:
##     :...area_se > -0.1120459: Malignant (55.2/24.9)
##         area_se <= -0.1120459:
##         :...points_worst <= -0.3265702: Benign (149.6)
##             points_worst > -0.3265702:
##             :...texture_worst > 1.227215: Malignant (27.1/5.7)
##                 texture_worst <= 1.227215:
##                 :...concavity_se <= -0.4738517: Malignant (21.3/7.1)
##                     concavity_se > -0.4738517: Benign (86.6/1.7)
## 
## -----  Trial 8:  -----
## 
## Decision tree:
## 
## area_worst > 0.1182332:
## :...smoothness_worst <= -1.413279: Benign (10.3/1.1)
## :   smoothness_worst > -1.413279: Malignant (75.1/3.2)
## area_worst <= 0.1182332:
## :...texture_worst <= -0.7886463: Benign (101.3)
##     texture_worst > -0.7886463:
##     :...points_mean > 0.004145421: Malignant (75.6/18.1)
##         points_mean <= 0.004145421:
##         :...area_se > 0.1117346: Malignant (11)
##             area_se <= 0.1117346:
##             :...concavity_worst <= -0.3326002: Benign (107.1)
##                 concavity_worst > -0.3326002:
##                 :...compactness_se <= -0.7012516: Malignant (16.2)
##                     compactness_se > -0.7012516: Benign (72.5/3.1)
## 
## -----  Trial 9:  -----
## 
## Decision tree:
## 
## perimeter_worst > 0.3880298: Malignant (54/2.4)
## perimeter_worst <= 0.3880298:
## :...smoothness_worst > 1.945978: Malignant (20.8/1)
##     smoothness_worst <= 1.945978:
##     :...perimeter_worst > -0.004797633:
##         :...concavity_worst <= -0.26118: Benign (25.6)
##         :   concavity_worst > -0.26118: Malignant (70.4/24.5)
##         perimeter_worst <= -0.004797633:
##         :...symmetry_worst > 1.060726: Malignant (15.3/6.7)
##             symmetry_worst <= 1.060726:
##             :...compactness_mean > -0.7191631: Benign (204.5/0.8)
##                 compactness_mean <= -0.7191631:
##                 :...points_mean <= -0.5759667: Benign (57.7/3.7)
##                     points_mean > -0.5759667: Malignant (20.7/3.2)
## 
## 
## Evaluation on training data (469 cases):
## 
## Trial        Decision Tree   
## -----      ----------------  
##    Size      Errors  
## 
##    0      9    7( 1.5%)
##    1      3   40( 8.5%)
##    2      7   32( 6.8%)
##    3      9   17( 3.6%)
##    4      6   25( 5.3%)
##    5      8   34( 7.2%)
##    6      6   28( 6.0%)
##    7      8   35( 7.5%)
##    8      8   13( 2.8%)
##    9      8   29( 6.2%)
## boost              0( 0.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     296          (a): class Benign
##           173    (b): class Malignant
## 
## 
##  Attribute usage:
## 
##  100.00% area_se
##  100.00% perimeter_worst
##  100.00% area_worst
##  100.00% smoothness_worst
##  100.00% points_worst
##   80.81% texture_mean
##   79.10% texture_worst
##   78.25% concavity_worst
##   72.28% points_mean
##   71.22% perimeter_se
##   66.95% symmetry_worst
##   58.21% compactness_mean
##   28.78% symmetry_mean
##   24.95% smoothness_mean
##   19.40% smoothness_se
##   16.20% dimension_se
##   15.57% concavity_se
##   13.22% area_mean
##   10.23% compactness_se
##    8.53% dimension_mean
## 
## 
## Time: 0.1 secs

create a factor vector of predictions on test data

bc_tree_pred10 <- predict(bc_boost10, bc_test)

cross tabulation of predicted versus actual classes

CrossTable(bc_test_labels, bc_tree_pred10,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual class', 'predicted class'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | predicted class 
## actual class |    Benign | Malignant | Row Total | 
## -------------|-----------|-----------|-----------|
##       Benign |        60 |         1 |        61 | 
##              |     0.600 |     0.010 |           | 
## -------------|-----------|-----------|-----------|
##    Malignant |         1 |        38 |        39 | 
##              |     0.010 |     0.380 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        61 |        39 |       100 | 
## -------------|-----------|-----------|-----------|
## 
##

Comment: The accuracy is improved. It’s 98% (higher than Naive Bayes, equal to kNN)

TRIALS = 20 boosted decision tree with 20 trials

bc_boost20 <- C5.0(bc_train, bc_train_labels,
                   trials = 20)

Get idea about the boosted tree

summary(bc_boost20)

## 
## Call:
## C5.0.default(x = bc_train, y = bc_train_labels, trials = 20)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Oct 13 16:12:31 2018
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 469 cases (31 attributes) from undefined.data
## 
## -----  Trial 0:  -----
## 
## Decision tree:
## 
## points_worst > 0.4669509:
## :...area_worst > -0.1989668: Malignant (135)
## :   area_worst <= -0.1989668:
## :   :...texture_worst <= -0.3168145: Benign (6)
## :       texture_worst > -0.3168145: Malignant (10/1)
## points_worst <= 0.4669509:
## :...area_worst > 0.1182332:
##     :...concavity_worst <= -0.3326002: Benign (4/1)
##     :   concavity_worst > -0.3326002: Malignant (15)
##     area_worst <= 0.1182332:
##     :...points_worst <= -0.06246884: Benign (255/3)
##         points_worst > -0.06246884:
##         :...texture_worst > 1.227215: Malignant (4)
##             texture_worst <= 1.227215:
##             :...dimension_mean <= -0.9061305: Malignant (4)
##                 dimension_mean > -0.9061305: Benign (36/2)
## 
## -----  Trial 1:  -----
## 
## Decision tree:
## 
## area_se > -0.1144639: Malignant (177.5/12.1)
## area_se <= -0.1144639:
## :...points_mean <= 0.01522708: Benign (236.5/21.8)
##     points_mean > 0.01522708: Malignant (54.9/12.8)
## 
## -----  Trial 2:  -----
## 
## Decision tree:
## 
## perimeter_worst > 0.2094719:
## :...smoothness_worst <= -1.413279: Benign (6.8/0.6)
## :   smoothness_worst > -1.413279: Malignant (103.6/1.8)
## perimeter_worst <= 0.2094719:
## :...area_worst <= -0.5753563: Benign (132.4/0.6)
##     area_worst > -0.5753563:
##     :...texture_worst <= -0.8244404: Benign (57.2)
##         texture_worst > -0.8244404:
##         :...smoothness_mean <= -0.4458351: Benign (47.9/4.3)
##             smoothness_mean > -0.4458351:
##             :...area_se <= -0.4540915: Benign (16.9/2.3)
##                 area_se > -0.4540915: Malignant (104.2/14.2)
## 
## -----  Trial 3:  -----
## 
## Decision tree:
## 
## area_worst <= -0.3991927:
## :...area_se <= 0.1869143: Benign (176/4.8)
## :   area_se > 0.1869143: Malignant (12.6/2.4)
## area_worst > -0.3991927:
## :...perimeter_worst > 0.5903955: Malignant (63.3)
##     perimeter_worst <= 0.5903955:
##     :...smoothness_worst > 0.1546662:
##         :...concavity_worst <= -0.3326002: Benign (8.6)
##         :   concavity_worst > -0.3326002: Malignant (82.2/6.5)
##         smoothness_worst <= 0.1546662:
##         :...perimeter_se > 1.622244: Malignant (10.2)
##             perimeter_se <= 1.622244:
##             :...smoothness_se > 0.2857672: Malignant (9.5)
##                 smoothness_se <= 0.2857672:
##                 :...dimension_se <= -0.9179285: Malignant (12.5/3.7)
##                     dimension_se > -0.9179285: Benign (94/1.3)
## 
## -----  Trial 4:  -----
## 
## Decision tree:
## 
## points_worst > 0.8548878: Malignant (56.1/0.3)
## points_worst <= 0.8548878:
## :...texture_mean > 0.469736:
##     :...points_worst <= -0.4923942: Benign (29.3/7.8)
##     :   points_worst > -0.4923942: Malignant (75.9/3.7)
##     texture_mean <= 0.469736:
##     :...perimeter_worst <= 0.007106229: Benign (226.1/8.1)
##         perimeter_worst > 0.007106229:
##         :...texture_worst <= -0.9285688: Benign (33.9/1)
##             texture_worst > -0.9285688: Malignant (47.6/13.1)
## 
## -----  Trial 5:  -----
## 
## Decision tree:
## 
## perimeter_worst > 0.5903955: Malignant (49.3)
## perimeter_worst <= 0.5903955:
## :...points_worst > 0.9187833: Malignant (22)
##     points_worst <= 0.9187833:
##     :...perimeter_se > 0.6058501: Malignant (29.7/5.3)
##         perimeter_se <= 0.6058501:
##         :...texture_worst > 1.167015: Malignant (47.5/14.6)
##             texture_worst <= 1.167015:
##             :...area_worst <= -0.2481451: Benign (189)
##                 area_worst > -0.2481451:
##                 :...area_mean <= -0.1491532: Malignant (23.3/2.2)
##                     area_mean > -0.1491532:
##                     :...smoothness_se <= 0.2201556: Benign (101/10.4)
##                         smoothness_se > 0.2201556: Malignant (7.2/1.4)
## 
## -----  Trial 6:  -----
## 
## Decision tree:
## 
## area_worst > -0.02368133:
## :...concavity_worst <= -0.197429: Benign (28.8/9.6)
## :   concavity_worst > -0.197429: Malignant (78.6/0.4)
## area_worst <= -0.02368133:
## :...symmetry_worst <= -1.491504: Malignant (26.2/8)
##     symmetry_worst > -1.491504:
##     :...points_worst <= 0.3209041: Benign (255.4/10.9)
##         points_worst > 0.3209041:
##         :...texture_mean <= 0.1605082: Benign (60.3/17.2)
##             texture_mean > 0.1605082: Malignant (19.7)
## 
## -----  Trial 7:  -----
## 
## Decision tree:
## 
## area_worst > 0.1182332:
## :...symmetry_mean <= -1.14035: Benign (8.8/0.6)
## :   symmetry_mean > -1.14035: Malignant (82.6/3.5)
## area_worst <= 0.1182332:
## :...points_worst > 0.6890638: Malignant (37.8/3.4)
##     points_worst <= 0.6890638:
##     :...area_se > -0.1120459: Malignant (55.2/24.9)
##         area_se <= -0.1120459:
##         :...points_worst <= -0.3265702: Benign (149.6)
##             points_worst > -0.3265702:
##             :...texture_worst > 1.227215: Malignant (27.1/5.7)
##                 texture_worst <= 1.227215:
##                 :...concavity_se <= -0.4738517: Malignant (21.3/7.1)
##                     concavity_se > -0.4738517: Benign (86.6/1.7)
## 
## -----  Trial 8:  -----
## 
## Decision tree:
## 
## area_worst > 0.1182332:
## :...smoothness_worst <= -1.413279: Benign (10.3/1.1)
## :   smoothness_worst > -1.413279: Malignant (75.1/3.2)
## area_worst <= 0.1182332:
## :...texture_worst <= -0.7886463: Benign (101.3)
##     texture_worst > -0.7886463:
##     :...points_mean > 0.004145421: Malignant (75.6/18.1)
##         points_mean <= 0.004145421:
##         :...area_se > 0.1117346: Malignant (11)
##             area_se <= 0.1117346:
##             :...concavity_worst <= -0.3326002: Benign (107.1)
##                 concavity_worst > -0.3326002:
##                 :...compactness_se <= -0.7012516: Malignant (16.2)
##                     compactness_se > -0.7012516: Benign (72.5/3.1)
## 
## -----  Trial 9:  -----
## 
## Decision tree:
## 
## perimeter_worst > 0.3880298: Malignant (54/2.4)
## perimeter_worst <= 0.3880298:
## :...smoothness_worst > 1.945978: Malignant (20.8/1)
##     smoothness_worst <= 1.945978:
##     :...perimeter_worst > -0.004797633:
##         :...concavity_worst <= -0.26118: Benign (25.6)
##         :   concavity_worst > -0.26118: Malignant (70.4/24.5)
##         perimeter_worst <= -0.004797633:
##         :...symmetry_worst > 1.060726: Malignant (15.3/6.7)
##             symmetry_worst <= 1.060726:
##             :...compactness_mean > -0.7191631: Benign (204.5/0.8)
##                 compactness_mean <= -0.7191631:
##                 :...points_mean <= -0.5759667: Benign (57.7/3.7)
##                     points_mean > -0.5759667: Malignant (20.7/3.2)
## 
## -----  Trial 10:  -----
## 
## Decision tree:
## 
## perimeter_worst > 0.5903955: Malignant (31.6)
## perimeter_worst <= 0.5903955:
## :...texture_worst <= 0.01997586:
##     :...area_se <= -0.3742955: Benign (130.3)
##     :   area_se > -0.3742955:
##     :   :...smoothness_worst <= 0.3429949: Benign (123.3/7.8)
##     :       smoothness_worst > 0.3429949: Malignant (16/3.4)
##     texture_worst > 0.01997586:
##     :...concavity_worst <= -0.3326002: Benign (52.2)
##         concavity_worst > -0.3326002:
##         :...symmetry_mean <= -1.041861: Benign (23.6/1.7)
##             symmetry_mean > -1.041861: Malignant (92.1/23.5)
## 
## -----  Trial 11:  -----
## 
## Decision tree:
## 
## perimeter_worst > 0.3880298: Malignant (50.1/4.4)
## perimeter_worst <= 0.3880298:
## :...texture_mean <= 0.3465099: Benign (313/30.6)
##     texture_mean > 0.3465099:
##     :...points_mean > -0.1247111: Malignant (23.7/0.9)
##         points_mean <= -0.1247111:
##         :...compactness_se <= -0.7515079: Malignant (24.9/6.3)
##             compactness_se > -0.7515079: Benign (57.3)
## 
## -----  Trial 12:  -----
## 
## Decision tree:
## 
## points_worst > 0.4669509:
## :...concavity_se <= 3.177171: Malignant (82.6/8.6)
## :   concavity_se > 3.177171: Benign (10.8)
## points_worst <= 0.4669509:
## :...area_worst > 0.1182332: Malignant (46.6/14.8)
##     area_worst <= 0.1182332:
##     :...concavity_worst <= -0.3081544: Benign (161.7/15)
##         concavity_worst > -0.3081544:
##         :...compactness_se <= -0.7492743: Malignant (8.2)
##             compactness_se > -0.7492743: Benign (159.1/23.1)
## 
## -----  Trial 13:  -----
## 
## Decision tree:
## 
## concavity_se <= -0.6603616: Benign (67.1)
## concavity_se > -0.6603616:
## :...points_worst > 0.9187833: Malignant (33)
##     points_worst <= 0.9187833:
##     :...concavity_worst <= -1.011332: Malignant (25.3/0.1)
##         concavity_worst > -1.011332:
##         :...area_worst <= -0.2716804:
##             :...smoothness_worst <= 2.532863: Benign (138.8/2.2)
##             :   smoothness_worst > 2.532863: Malignant (8.2/1.8)
##             area_worst > -0.2716804:
##             :...concavity_worst <= -0.3460215: Benign (32.8/1.7)
##                 concavity_worst > -0.3460215:
##                 :...texture_mean <= -0.8578511: Benign (25.5/1.1)
##                     texture_mean > -0.8578511:
##                     :...dimension_se <= 0.8155851: Malignant (121.2/15.3)
##                         dimension_se > 0.8155851: Benign (17.2/3.2)
## 
## -----  Trial 14:  -----
## 
## Decision tree:
## 
## area_se > -0.1612864:
## :...dimension_se <= -0.8517927: Benign (17.4)
## :   dimension_se > -0.8517927:
## :   :...perimeter_worst <= -0.6598076: Benign (10.4)
## :       perimeter_worst > -0.6598076:
## :       :...perimeter_se <= -0.2567243: Benign (8.6/0.1)
## :           perimeter_se > -0.2567243: Malignant (125/13.1)
## area_se <= -0.1612864:
## :...smoothness_worst <= 0.1721852: Benign (178.3/6.6)
##     smoothness_worst > 0.1721852:
##     :...perimeter_worst > -0.147644: Malignant (38.2/2)
##         perimeter_worst <= -0.147644:
##         :...symmetry_worst > 3.17169: Malignant (5.9)
##             symmetry_worst <= 3.17169:
##             :...texture_worst <= 1.145864: Benign (66.6)
##                 texture_worst > 1.145864: Malignant (18.8/5.9)
## 
## -----  Trial 15:  -----
## 
## Decision tree:
## 
## area_worst > -0.02368133:
## :...texture_mean <= -1.208929: Benign (10)
## :   texture_mean > -1.208929:
## :   :...dimension_worst <= -1.066139: Benign (14.8/4.5)
## :       dimension_worst > -1.066139: Malignant (85.5/2.7)
## area_worst <= -0.02368133:
## :...concavity_mean > 0.6159157: Malignant (25.8/6.8)
##     concavity_mean <= 0.6159157:
##     :...texture_worst <= -0.7886463: Benign (79.1)
##         texture_worst > -0.7886463:
##         :...dimension_se > -0.2063073: Benign (112.2/6.3)
##             dimension_se <= -0.2063073:
##             :...points_se > 0.3539323: Malignant (27.2)
##                 points_se <= 0.3539323:
##                 :...compactness_mean > 0.2605245: Malignant (16.7)
##                     compactness_mean <= 0.2605245:
##                     :...smoothness_se <= -1.252608: Malignant (7.3/0.2)
##                         smoothness_se > -1.252608: Benign (90.5/6.7)
## 
## -----  Trial 16:  -----
## 
## Decision tree:
## 
## perimeter_worst > 0.5903955: Malignant (62.4)
## perimeter_worst <= 0.5903955:
## :...points_worst > 0.9187833: Malignant (20.7)
##     points_worst <= 0.9187833:
##     :...texture_worst <= -0.8895206: Benign (87.2)
##         texture_worst > -0.8895206:
##         :...radius_worst <= -0.6039818: Benign (79.3/2.9)
##             radius_worst > -0.6039818:
##             :...smoothness_se > 0.6198201: Malignant (27.4)
##                 smoothness_se <= 0.6198201:
##                 :...symmetry_mean <= -0.9762015: Benign (34.1)
##                     symmetry_mean > -0.9762015:
##                     :...symmetry_se > 0.5936947: Benign (28.9)
##                         symmetry_se <= 0.5936947:
##                         :...area_se <= -0.4822289: Benign (13.4)
##                             area_se > -0.4822289:
##                             :...symmetry_mean <= -0.6697919: Malignant (41.5/1.8)
##                                 symmetry_mean > -0.6697919:
##                                 :...compactness_mean <= 0.3570921: Benign (50.3/13.3)
##                                     compactness_mean > 0.3570921: Malignant (23.9/0.4)
## 
## -----  Trial 17:  -----
## 
## Decision tree:
## 
## perimeter_worst > 0.2957749: Malignant (79.3/6.7)
## perimeter_worst <= 0.2957749:
## :...texture_worst <= 0.573158:
##     :...symmetry_worst > 1.151242: Malignant (18.5/5)
##     :   symmetry_worst <= 1.151242:
##     :   :...texture_mean <= 0.788264: Benign (246.6/17.8)
##     :       texture_mean > 0.788264: Malignant (17.3/5.8)
##     texture_worst > 0.573158:
##     :...perimeter_worst <= -0.6127874: Benign (22.9)
##         perimeter_worst > -0.6127874:
##         :...smoothness_worst <= -1.0629: Benign (9.3)
##             smoothness_worst > -1.0629: Malignant (74.9/6.3)
## 
## -----  Trial 18:  -----
## 
## Decision tree:
## 
## points_mean <= 0.004145421:
## :...compactness_se > -0.7515079: Benign (210.4/11.7)
## :   compactness_se <= -0.7515079:
## :   :...concavity_se <= -0.6603616: Benign (39)
## :       concavity_se > -0.6603616: Malignant (52.8/12.1)
## points_mean > 0.004145421:
## :...concavity_worst <= -0.2276268: Benign (22.5/2.3)
##     concavity_worst > -0.2276268:
##     :...area_worst <= -0.2899466: Benign (23.9/9)
##         area_worst > -0.2899466: Malignant (120.4/7.9)
## 
## -----  Trial 19:  -----
## 
## Decision tree:
## 
## concavity_worst <= -0.3081544:
## :...radius_se <= 0.5363185: Benign (188.6/2)
## :   radius_se > 0.5363185: Malignant (10/1.1)
## concavity_worst > -0.3081544:
## :...area_worst > 0.1182332: Malignant (61.5)
##     area_worst <= 0.1182332:
##     :...points_worst > 0.9187833: Malignant (23.4)
##         points_worst <= 0.9187833:
##         :...texture_mean <= 0.3465099: Benign (105.1/15.6)
##             texture_mean > 0.3465099:
##             :...points_mean <= -0.1247111: Benign (41.8/17.2)
##                 points_mean > -0.1247111: Malignant (38.6)
## 
## 
## Evaluation on training data (469 cases):
## 
## Trial        Decision Tree   
## -----      ----------------  
##    Size      Errors  
## 
##    0      9    7( 1.5%)
##    1      3   40( 8.5%)
##    2      7   32( 6.8%)
##    3      9   17( 3.6%)
##    4      6   25( 5.3%)
##    5      8   34( 7.2%)
##    6      6   28( 6.0%)
##    7      8   35( 7.5%)
##    8      8   13( 2.8%)
##    9      8   29( 6.2%)
##   10      7   28( 6.0%)
##   11      5   32( 6.8%)
##   12      6   17( 3.6%)
##   13      9   18( 3.8%)
##   14      9   23( 4.9%)
##   15     10   30( 6.4%)
##   16     11   18( 3.8%)
##   17      7   26( 5.5%)
##   18      6   34( 7.2%)
##   19      7   12( 2.6%)
## boost              0( 0.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     296          (a): class Benign
##           173    (b): class Malignant
## 
## 
##  Attribute usage:
## 
##  100.00% texture_mean
##  100.00% points_mean
##  100.00% area_se
##  100.00% concavity_se
##  100.00% perimeter_worst
##  100.00% area_worst
##  100.00% smoothness_worst
##  100.00% concavity_worst
##  100.00% points_worst
##   96.59% perimeter_se
##   80.81% texture_worst
##   80.60% dimension_se
##   69.72% symmetry_worst
##   66.95% concavity_mean
##   64.61% compactness_se
##   63.33% compactness_mean
##   58.42% symmetry_mean
##   51.17% radius_worst
##   46.91% radius_se
##   43.50% smoothness_se
##   31.98% dimension_worst
##   25.37% points_se
##   24.95% smoothness_mean
##   21.11% symmetry_se
##   13.22% area_mean
##    8.53% dimension_mean
## 
## 
## Time: 0.1 secs

Create a factor vector of predictions on test data

bc_tree_pred20 <- predict(bc_boost20, bc_test)

Cross tabulation of predicted versus actual classes

CrossTable(bc_test_labels, bc_tree_pred20,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual class', 'predicted class'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | predicted class 
## actual class |    Benign | Malignant | Row Total | 
## -------------|-----------|-----------|-----------|
##       Benign |        61 |         0 |        61 | 
##              |     0.610 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##    Malignant |         1 |        38 |        39 | 
##              |     0.010 |     0.380 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        62 |        38 |       100 | 
## -------------|-----------|-----------|-----------|
## 
##

ACCURACY OF TRIALS 20 IS 99%

METHOD 4: NEURAL NETWORKS ANN

library(MASS) # Needed to sample multivariate Gaussian distributions 
library(neuralnet) # The package for neural networks in R
library(readr)

Step 1: Collecting and downloading data

cancer <- read_csv("wisc_bc_data.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   id = col_integer(),
##   diagnosis = col_character()
## )

## See spec(...) for full column specifications.

Step 2:Exploring and Preparing Data

head(cancer)

## # A tibble: 6 x 32
##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
##      <int> <chr>           <dbl>        <dbl>          <dbl>     <dbl>
## 1 87139402 B                12.3         12.4           78.8      464.
## 2  8910251 B                10.6         19.0           69.3      346.
## 3   905520 B                11.0         16.8           70.9      373.
## 4   868871 B                11.3         13.4           73        385.
## 5  9012568 B                15.2         13.2           97.6      712.
## 6   906539 B                11.6         19.0           74.2      410.
## # ... with 26 more variables: smoothness_mean <dbl>,
## #   compactness_mean <dbl>, concavity_mean <dbl>, points_mean <dbl>,
## #   symmetry_mean <dbl>, dimension_mean <dbl>, radius_se <dbl>,
## #   texture_se <dbl>, perimeter_se <dbl>, area_se <dbl>,
## #   smoothness_se <dbl>, compactness_se <dbl>, concavity_se <dbl>,
## #   points_se <dbl>, symmetry_se <dbl>, dimension_se <dbl>,
## #   radius_worst <dbl>, texture_worst <dbl>, perimeter_worst <dbl>,
## #   area_worst <dbl>, smoothness_worst <dbl>, compactness_worst <dbl>,
## #   concavity_worst <dbl>, points_worst <dbl>, symmetry_worst <dbl>,
## #   dimension_worst <dbl>

cancer <- cancer[, -c(1)]

cancer[, 1] <- as.numeric(cancer[, 1] == "M")

colnames(cancer)[1] <- "label"
colnames(cancer)[2] <- "V2"
colnames(cancer)[3] <- "V3"
colnames(cancer)[4] <- "V4"
colnames(cancer)[5] <- "V5"
colnames(cancer)[6] <- "V6"
colnames(cancer)[7] <- "V7"
colnames(cancer)[8] <- "V8"
colnames(cancer)[9] <- "V9"
colnames(cancer)[10] <- "V10"
colnames(cancer)[11] <- "V11"
colnames(cancer)[12] <- "V12"
colnames(cancer)[13] <- "V13"
colnames(cancer)[14] <- "V14"
colnames(cancer)[15] <- "V15"
colnames(cancer)[16] <- "V16"
colnames(cancer)[17] <- "V17"
colnames(cancer)[18] <- "V18"
colnames(cancer)[19] <- "V19"
colnames(cancer)[20] <- "V20"
colnames(cancer)[21] <- "V21"
colnames(cancer)[22] <- "V22"
colnames(cancer)[23] <- "V23"
colnames(cancer)[24] <- "V24"
colnames(cancer)[25] <- "V25"
colnames(cancer)[26] <- "V26"
colnames(cancer)[27] <- "V27"
colnames(cancer)[28] <- "V28"
colnames(cancer)[29] <- "V29"
colnames(cancer)[30] <- "V30"
colnames(cancer)[31] <- "V31"

cancer[1, ]

## # A tibble: 1 x 31
##   label    V2    V3    V4    V5    V6     V7     V8    V9   V10    V11
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
## 1     0  12.3  12.4  78.8  464. 0.103 0.0698 0.0399 0.037 0.196 0.0596
## # ... with 20 more variables: V12 <dbl>, V13 <dbl>, V14 <dbl>, V15 <dbl>,
## #   V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>,
## #   V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>,
## #   V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>

Normalize the cancer data

cancer <- as.data.frame(lapply(cancer, normalize))

create train and test data set

cancer_train <- cancer[1:469, ]
cancer_test <- cancer[470:569, ]

Step 3: Training a ANN model on the data

set.seed(12345) # to guarantee repeatable results
cancer.model <- neuralnet(formula = label ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 + V21 + 
                            V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29 + V30 + V31,data = cancer_train)

plot(cancer.model)

Alternative plot

library(NeuralNetTools)

# plotnet
par(mar = numeric(4), family = 'serif')
plotnet(cancer.model, alpha = 0.6)

Step 4: Evaluating model performance

Obtain model results

cancer.model.results <- compute(cancer.model, cancer_test[2:31])

Obtain predicted diagnosis values

predicted_label <- cancer.model.results$net.result

Examine the correlation between predicted and actual values

cor(predicted_label, cancer_test$label)

##              [,1]
## [1,] 0.9886599505

Step 5: Improving model performance

a more complex neural network topology with 5 hidden neurons

set.seed(12345) # to guarantee repeatable results
cancer.model2 <- neuralnet(label ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 + V21 + V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29 + V30 + V31,data = cancer_train, hidden = 5, act.fct = "logistic")

Plot the network

plot(cancer.model2)

Plotnet

par(mar = numeric(4), family = 'serif')
plotnet(cancer.model2, alpha = 0.6)

Evaluate the results as we did before

model_results2 <- compute(cancer.model2, cancer_test[2:31])
predicted_label2 <- model_results2$net.result
cor(predicted_label2, cancer_test$label)

##              [,1]
## [1,] 0.9754664796

Comments: The accuracy droppend to 0.975 (lower than 1 hidden node)

Classification Models for Breast Cancer Detection

Chi Nguyen & Arzoo Amiri

08/10/2018

EXPLORING & PREPARING THE DATA

Step 1: download the data

Step 2: exploring and preparing the data

METHOD 1: K-NN NEAREST NEIGHBOR

Step 3: training a k-NN model to the data

Step 4: Evaluating the model performance

Step 5: Improving the model performance

METHOD 2: NAIVE BAYES

Step 3: training a model to the data

Step 4: evaluating the model

Step 5: improving the model performance

METHOD 3: DECISION TREES

Step 3: Training a model on the data

Step 4: Evaluating Tree model performance

Step 5: Improving model performance

METHOD 4: NEURAL NETWORKS ANN

Step 1: Collecting and downloading data

Step 2:Exploring and Preparing Data

Step 3: Training a ANN model on the data

Step 4: Evaluating model performance

Step 5: Improving model performance