The identification of cancerous cells in the human body is a lengthy process. Moreover, the longer it takes to detect cancer the more difficult it is to treat the cancer. Therefore, in this project we aim to develop a model that could automatically detect breast cancer with a high accuracy rate based on the measurements of biopsied cells from women with abnormal breast masses. From UCI Repository we used Wisconsin Breast Cancer Diagnostic dataset that included 569 observations. Our goal is to build a classification model that will be able to classify biopsy images as either Benign (non-cancerous) or Malignant (cancerous).
There are four machine learning methods that will be applied to our classification models - Nearest Neighbors, Naive Bayes, Decision Trees, and Neural Networks. We will measure the performance of each model and compare the accuracy rate to find the best model amongst them. The platform being occupied for our analysis is R studio.
bc <- read.csv("wisc_bc_data.csv")
examine the structure of the wbcd data frame
str(bc)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
## $ diagnosis : Factor w/ 2 levels "B","M": 1 1 1 1 1 1 1 2 1 1 ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
drop the id feature
bc <- bc[-1]
table of diagnosis
table(bc$diagnosis)
##
## B M
## 357 212
recode diagnosis as a factor
bc$diagnosis <- factor(bc$diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))
table or proportions with more informative labels
round(prop.table(table(bc$diagnosis)) * 100, digits = 1)
##
## Benign Malignant
## 62.7 37.3
pie chart
pie(table(bc$diagnosis), main = "Diagnosis", col = c("Blue", "Red"))
lbls <- paste(names(table(bc$diagnosis)), round(prop.table(table(bc$diagnosis)) * 100, digits = 1), "%")
pie(table(bc$diagnosis), labels = lbls, main = "Diagnosis", col = c("Blue", "Red"))
box()
creat training and test data sets
bc_train <- bc[1:469, -1]
bc_test <- bc[470:569, -1]
create labels for training and test data
bc_train_labels <- bc[1:469, 1]
bc_test_labels <- bc[470:569, 1]
normalization for numeric features create normalization function
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
normalize the bc data
bc_n <- as.data.frame(lapply(bc[2:31], normalize))
confirm that normalization worked
summary(bc_n$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1174 0.1729 0.2169 0.2711 1.0000
create normalized training and test data
bc_n_train <- bc_n[1:469, ]
bc_n_test <- bc_n[470:569, ]
load the “class” library
library(class)
bc_test_pred <- knn(train = bc_n_train, test = bc_n_test,
cl = bc_train_labels, k = 21)
# load the "gmodels" library
library(gmodels)
Create the cross tabulation of predicted vs. actual
CrossTable(x = bc_test_labels, y = bc_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | bc_test_pred
## bc_test_labels | Benign | Malignant | Row Total |
## ---------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.968 | 0.000 | |
## | 0.610 | 0.000 | |
## ---------------|-----------|-----------|-----------|
## Malignant | 2 | 37 | 39 |
## | 0.051 | 0.949 | 0.390 |
## | 0.032 | 1.000 | |
## | 0.020 | 0.370 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 63 | 37 | 100 |
## | 0.630 | 0.370 | |
## ---------------|-----------|-----------|-----------|
##
##
Accuracy of k-NN model:
accuracy_knn = (61+37)/(61+37+2+0)
accuracy_knn
## [1] 0.98
use the scale() function to z-score standardize a data frame
bc_z <- as.data.frame(scale(bc[-1]))
confirm that the transformation was applied correctly
summary(bc_z$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4532 -0.6666 -0.2949 0.0000 0.3632 5.2459
create training and test datasets
bc_train <- bc_z[1:469, ]
bc_test <- bc_z[470:569, ]
re-classify test cases
bc_test_pred <- knn(train = bc_train, test = bc_test,
cl = bc_train_labels, k = 21)
Create the cross tabulation of predicted vs. actual
CrossTable(x = bc_test_labels, y = bc_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | bc_test_pred
## bc_test_labels | Benign | Malignant | Row Total |
## ---------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.924 | 0.000 | |
## | 0.610 | 0.000 | |
## ---------------|-----------|-----------|-----------|
## Malignant | 5 | 34 | 39 |
## | 0.128 | 0.872 | 0.390 |
## | 0.076 | 1.000 | |
## | 0.050 | 0.340 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 66 | 34 | 100 |
## | 0.660 | 0.340 | |
## ---------------|-----------|-----------|-----------|
##
##
Calculate accuracy of z-score method
accuracy_z = (61+34)/(61+34+5+0)
accuracy_z
## [1] 0.95
try several different values of k
bc_n_train <- bc_n[1:469, ]
bc_n_test <- bc_n[470:569, ]
start time
strt<-Sys.time()
bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 1)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | bc_test_pred
## bc_test_labels | Benign | Malignant | Row Total |
## ---------------|-----------|-----------|-----------|
## Benign | 58 | 3 | 61 |
## | 0.951 | 0.049 | 0.610 |
## | 0.983 | 0.073 | |
## | 0.580 | 0.030 | |
## ---------------|-----------|-----------|-----------|
## Malignant | 1 | 38 | 39 |
## | 0.026 | 0.974 | 0.390 |
## | 0.017 | 0.927 | |
## | 0.010 | 0.380 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 59 | 41 | 100 |
## | 0.590 | 0.410 | |
## ---------------|-----------|-----------|-----------|
##
##
bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 5)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | bc_test_pred
## bc_test_labels | Benign | Malignant | Row Total |
## ---------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.968 | 0.000 | |
## | 0.610 | 0.000 | |
## ---------------|-----------|-----------|-----------|
## Malignant | 2 | 37 | 39 |
## | 0.051 | 0.949 | 0.390 |
## | 0.032 | 1.000 | |
## | 0.020 | 0.370 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 63 | 37 | 100 |
## | 0.630 | 0.370 | |
## ---------------|-----------|-----------|-----------|
##
##
bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 11)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | bc_test_pred
## bc_test_labels | Benign | Malignant | Row Total |
## ---------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.953 | 0.000 | |
## | 0.610 | 0.000 | |
## ---------------|-----------|-----------|-----------|
## Malignant | 3 | 36 | 39 |
## | 0.077 | 0.923 | 0.390 |
## | 0.047 | 1.000 | |
## | 0.030 | 0.360 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 64 | 36 | 100 |
## | 0.640 | 0.360 | |
## ---------------|-----------|-----------|-----------|
##
##
bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 15)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | bc_test_pred
## bc_test_labels | Benign | Malignant | Row Total |
## ---------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.953 | 0.000 | |
## | 0.610 | 0.000 | |
## ---------------|-----------|-----------|-----------|
## Malignant | 3 | 36 | 39 |
## | 0.077 | 0.923 | 0.390 |
## | 0.047 | 1.000 | |
## | 0.030 | 0.360 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 64 | 36 | 100 |
## | 0.640 | 0.360 | |
## ---------------|-----------|-----------|-----------|
##
##
bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 21)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | bc_test_pred
## bc_test_labels | Benign | Malignant | Row Total |
## ---------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.968 | 0.000 | |
## | 0.610 | 0.000 | |
## ---------------|-----------|-----------|-----------|
## Malignant | 2 | 37 | 39 |
## | 0.051 | 0.949 | 0.390 |
## | 0.032 | 1.000 | |
## | 0.020 | 0.370 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 63 | 37 | 100 |
## | 0.630 | 0.370 | |
## ---------------|-----------|-----------|-----------|
##
##
bc_test_pred <- knn(train = bc_n_train, test = bc_n_test, cl = bc_train_labels, k = 27)
CrossTable(x = bc_test_labels, y = bc_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | bc_test_pred
## bc_test_labels | Benign | Malignant | Row Total |
## ---------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.938 | 0.000 | |
## | 0.610 | 0.000 | |
## ---------------|-----------|-----------|-----------|
## Malignant | 4 | 35 | 39 |
## | 0.103 | 0.897 | 0.390 |
## | 0.062 | 1.000 | |
## | 0.040 | 0.350 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 65 | 35 | 100 |
## | 0.650 | 0.350 | |
## ---------------|-----------|-----------|-----------|
##
##
end time
print(Sys.time()-strt)
## Time difference of 0.05782104 secs
library(e1071)
bc_classifier <- naiveBayes(bc_train, bc_train_labels)
bc_eval_pred <- predict(bc_classifier,bc_test)
head(bc_eval_pred)
## [1] Benign Benign Benign Benign Malignant Benign
## Levels: Benign Malignant
create the cross tablulation of predicted vs. actual
CrossTable(bc_eval_pred, bc_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | actual
## predicted | Benign | Malignant | Row Total |
## -------------|-----------|-----------|-----------|
## Benign | 60 | 4 | 64 |
## | 0.984 | 0.103 | |
## -------------|-----------|-----------|-----------|
## Malignant | 1 | 35 | 36 |
## | 0.016 | 0.897 | |
## -------------|-----------|-----------|-----------|
## Column Total | 61 | 39 | 100 |
## | 0.610 | 0.390 | |
## -------------|-----------|-----------|-----------|
##
##
Comments: Knn was 98% accurate. NaiveBayes is 95.8% accurate
bc_classifier2 <- naiveBayes(bc_train, bc_train_labels, laplace = 1)
bc_eval_pred2 <- predict(bc_classifier2, bc_test)
CrossTable(bc_eval_pred2, bc_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | actual
## predicted | Benign | Malignant | Row Total |
## -------------|-----------|-----------|-----------|
## Benign | 60 | 4 | 64 |
## | 0.984 | 0.103 | |
## -------------|-----------|-----------|-----------|
## Malignant | 1 | 35 | 36 |
## | 0.016 | 0.897 | |
## -------------|-----------|-----------|-----------|
## Column Total | 61 | 39 | 100 |
## | 0.610 | 0.390 | |
## -------------|-----------|-----------|-----------|
##
##
Comment: Accuracy is exactly the same (95.8%)
build the simplest decision tree
library(C50)
bc_tree <- C5.0(bc_train, bc_train_labels)
display simple facts about the tree
bc_tree
##
## Call:
## C5.0.default(x = bc_train, y = bc_train_labels)
##
## Classification Tree
## Number of samples: 469
## Number of predictors: 30
##
## Tree size: 9
##
## Non-standard options: attempt to group attributes
display detailed information about the tree
summary(bc_tree)
##
## Call:
## C5.0.default(x = bc_train, y = bc_train_labels)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Oct 13 16:12:31 2018
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 469 cases (31 attributes) from undefined.data
##
## Decision tree:
##
## points_worst > 0.4669509:
## :...area_worst > -0.1989668: Malignant (135)
## : area_worst <= -0.1989668:
## : :...texture_worst <= -0.3168145: Benign (6)
## : texture_worst > -0.3168145: Malignant (10/1)
## points_worst <= 0.4669509:
## :...area_worst > 0.1182332:
## :...concavity_worst <= -0.3326002: Benign (4/1)
## : concavity_worst > -0.3326002: Malignant (15)
## area_worst <= 0.1182332:
## :...points_worst <= -0.06246884: Benign (255/3)
## points_worst > -0.06246884:
## :...texture_worst > 1.227215: Malignant (4)
## texture_worst <= 1.227215:
## :...dimension_mean <= -0.9061305: Malignant (4)
## dimension_mean > -0.9061305: Benign (36/2)
##
##
## Evaluation on training data (469 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 9 7( 1.5%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 295 1 (a): class Benign
## 6 167 (b): class Malignant
##
##
## Attribute usage:
##
## 100.00% area_worst
## 100.00% points_worst
## 12.79% texture_worst
## 8.53% dimension_mean
## 4.05% concavity_worst
##
##
## Time: 0.0 secs
create a factor vector of predictions on test data
bc_tree_pred <- predict(bc_tree, bc_test)
cross tabulation of predicted versus actual classes
CrossTable(bc_test_labels, bc_tree_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual class', 'predicted class'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted class
## actual class | Benign | Malignant | Row Total |
## -------------|-----------|-----------|-----------|
## Benign | 60 | 1 | 61 |
## | 0.600 | 0.010 | |
## -------------|-----------|-----------|-----------|
## Malignant | 4 | 35 | 39 |
## | 0.040 | 0.350 | |
## -------------|-----------|-----------|-----------|
## Column Total | 64 | 36 | 100 |
## -------------|-----------|-----------|-----------|
##
##
Comment: Accuracy of the tree is 95%
Boosting the accuracy of decision trees Boosted decision tree with 10 trials
bc_boost10 <- C5.0(bc_train, bc_train_labels,
trials = 10)
Get idea about the boosted tree
summary(bc_boost10)
##
## Call:
## C5.0.default(x = bc_train, y = bc_train_labels, trials = 10)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Oct 13 16:12:31 2018
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 469 cases (31 attributes) from undefined.data
##
## ----- Trial 0: -----
##
## Decision tree:
##
## points_worst > 0.4669509:
## :...area_worst > -0.1989668: Malignant (135)
## : area_worst <= -0.1989668:
## : :...texture_worst <= -0.3168145: Benign (6)
## : texture_worst > -0.3168145: Malignant (10/1)
## points_worst <= 0.4669509:
## :...area_worst > 0.1182332:
## :...concavity_worst <= -0.3326002: Benign (4/1)
## : concavity_worst > -0.3326002: Malignant (15)
## area_worst <= 0.1182332:
## :...points_worst <= -0.06246884: Benign (255/3)
## points_worst > -0.06246884:
## :...texture_worst > 1.227215: Malignant (4)
## texture_worst <= 1.227215:
## :...dimension_mean <= -0.9061305: Malignant (4)
## dimension_mean > -0.9061305: Benign (36/2)
##
## ----- Trial 1: -----
##
## Decision tree:
##
## area_se > -0.1144639: Malignant (177.5/12.1)
## area_se <= -0.1144639:
## :...points_mean <= 0.01522708: Benign (236.5/21.8)
## points_mean > 0.01522708: Malignant (54.9/12.8)
##
## ----- Trial 2: -----
##
## Decision tree:
##
## perimeter_worst > 0.2094719:
## :...smoothness_worst <= -1.413279: Benign (6.8/0.6)
## : smoothness_worst > -1.413279: Malignant (103.6/1.8)
## perimeter_worst <= 0.2094719:
## :...area_worst <= -0.5753563: Benign (132.4/0.6)
## area_worst > -0.5753563:
## :...texture_worst <= -0.8244404: Benign (57.2)
## texture_worst > -0.8244404:
## :...smoothness_mean <= -0.4458351: Benign (47.9/4.3)
## smoothness_mean > -0.4458351:
## :...area_se <= -0.4540915: Benign (16.9/2.3)
## area_se > -0.4540915: Malignant (104.2/14.2)
##
## ----- Trial 3: -----
##
## Decision tree:
##
## area_worst <= -0.3991927:
## :...area_se <= 0.1869143: Benign (176/4.8)
## : area_se > 0.1869143: Malignant (12.6/2.4)
## area_worst > -0.3991927:
## :...perimeter_worst > 0.5903955: Malignant (63.3)
## perimeter_worst <= 0.5903955:
## :...smoothness_worst > 0.1546662:
## :...concavity_worst <= -0.3326002: Benign (8.6)
## : concavity_worst > -0.3326002: Malignant (82.2/6.5)
## smoothness_worst <= 0.1546662:
## :...perimeter_se > 1.622244: Malignant (10.2)
## perimeter_se <= 1.622244:
## :...smoothness_se > 0.2857672: Malignant (9.5)
## smoothness_se <= 0.2857672:
## :...dimension_se <= -0.9179285: Malignant (12.5/3.7)
## dimension_se > -0.9179285: Benign (94/1.3)
##
## ----- Trial 4: -----
##
## Decision tree:
##
## points_worst > 0.8548878: Malignant (56.1/0.3)
## points_worst <= 0.8548878:
## :...texture_mean > 0.469736:
## :...points_worst <= -0.4923942: Benign (29.3/7.8)
## : points_worst > -0.4923942: Malignant (75.9/3.7)
## texture_mean <= 0.469736:
## :...perimeter_worst <= 0.007106229: Benign (226.1/8.1)
## perimeter_worst > 0.007106229:
## :...texture_worst <= -0.9285688: Benign (33.9/1)
## texture_worst > -0.9285688: Malignant (47.6/13.1)
##
## ----- Trial 5: -----
##
## Decision tree:
##
## perimeter_worst > 0.5903955: Malignant (49.3)
## perimeter_worst <= 0.5903955:
## :...points_worst > 0.9187833: Malignant (22)
## points_worst <= 0.9187833:
## :...perimeter_se > 0.6058501: Malignant (29.7/5.3)
## perimeter_se <= 0.6058501:
## :...texture_worst > 1.167015: Malignant (47.5/14.6)
## texture_worst <= 1.167015:
## :...area_worst <= -0.2481451: Benign (189)
## area_worst > -0.2481451:
## :...area_mean <= -0.1491532: Malignant (23.3/2.2)
## area_mean > -0.1491532:
## :...smoothness_se <= 0.2201556: Benign (101/10.4)
## smoothness_se > 0.2201556: Malignant (7.2/1.4)
##
## ----- Trial 6: -----
##
## Decision tree:
##
## area_worst > -0.02368133:
## :...concavity_worst <= -0.197429: Benign (28.8/9.6)
## : concavity_worst > -0.197429: Malignant (78.6/0.4)
## area_worst <= -0.02368133:
## :...symmetry_worst <= -1.491504: Malignant (26.2/8)
## symmetry_worst > -1.491504:
## :...points_worst <= 0.3209041: Benign (255.4/10.9)
## points_worst > 0.3209041:
## :...texture_mean <= 0.1605082: Benign (60.3/17.2)
## texture_mean > 0.1605082: Malignant (19.7)
##
## ----- Trial 7: -----
##
## Decision tree:
##
## area_worst > 0.1182332:
## :...symmetry_mean <= -1.14035: Benign (8.8/0.6)
## : symmetry_mean > -1.14035: Malignant (82.6/3.5)
## area_worst <= 0.1182332:
## :...points_worst > 0.6890638: Malignant (37.8/3.4)
## points_worst <= 0.6890638:
## :...area_se > -0.1120459: Malignant (55.2/24.9)
## area_se <= -0.1120459:
## :...points_worst <= -0.3265702: Benign (149.6)
## points_worst > -0.3265702:
## :...texture_worst > 1.227215: Malignant (27.1/5.7)
## texture_worst <= 1.227215:
## :...concavity_se <= -0.4738517: Malignant (21.3/7.1)
## concavity_se > -0.4738517: Benign (86.6/1.7)
##
## ----- Trial 8: -----
##
## Decision tree:
##
## area_worst > 0.1182332:
## :...smoothness_worst <= -1.413279: Benign (10.3/1.1)
## : smoothness_worst > -1.413279: Malignant (75.1/3.2)
## area_worst <= 0.1182332:
## :...texture_worst <= -0.7886463: Benign (101.3)
## texture_worst > -0.7886463:
## :...points_mean > 0.004145421: Malignant (75.6/18.1)
## points_mean <= 0.004145421:
## :...area_se > 0.1117346: Malignant (11)
## area_se <= 0.1117346:
## :...concavity_worst <= -0.3326002: Benign (107.1)
## concavity_worst > -0.3326002:
## :...compactness_se <= -0.7012516: Malignant (16.2)
## compactness_se > -0.7012516: Benign (72.5/3.1)
##
## ----- Trial 9: -----
##
## Decision tree:
##
## perimeter_worst > 0.3880298: Malignant (54/2.4)
## perimeter_worst <= 0.3880298:
## :...smoothness_worst > 1.945978: Malignant (20.8/1)
## smoothness_worst <= 1.945978:
## :...perimeter_worst > -0.004797633:
## :...concavity_worst <= -0.26118: Benign (25.6)
## : concavity_worst > -0.26118: Malignant (70.4/24.5)
## perimeter_worst <= -0.004797633:
## :...symmetry_worst > 1.060726: Malignant (15.3/6.7)
## symmetry_worst <= 1.060726:
## :...compactness_mean > -0.7191631: Benign (204.5/0.8)
## compactness_mean <= -0.7191631:
## :...points_mean <= -0.5759667: Benign (57.7/3.7)
## points_mean > -0.5759667: Malignant (20.7/3.2)
##
##
## Evaluation on training data (469 cases):
##
## Trial Decision Tree
## ----- ----------------
## Size Errors
##
## 0 9 7( 1.5%)
## 1 3 40( 8.5%)
## 2 7 32( 6.8%)
## 3 9 17( 3.6%)
## 4 6 25( 5.3%)
## 5 8 34( 7.2%)
## 6 6 28( 6.0%)
## 7 8 35( 7.5%)
## 8 8 13( 2.8%)
## 9 8 29( 6.2%)
## boost 0( 0.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 296 (a): class Benign
## 173 (b): class Malignant
##
##
## Attribute usage:
##
## 100.00% area_se
## 100.00% perimeter_worst
## 100.00% area_worst
## 100.00% smoothness_worst
## 100.00% points_worst
## 80.81% texture_mean
## 79.10% texture_worst
## 78.25% concavity_worst
## 72.28% points_mean
## 71.22% perimeter_se
## 66.95% symmetry_worst
## 58.21% compactness_mean
## 28.78% symmetry_mean
## 24.95% smoothness_mean
## 19.40% smoothness_se
## 16.20% dimension_se
## 15.57% concavity_se
## 13.22% area_mean
## 10.23% compactness_se
## 8.53% dimension_mean
##
##
## Time: 0.1 secs
create a factor vector of predictions on test data
bc_tree_pred10 <- predict(bc_boost10, bc_test)
cross tabulation of predicted versus actual classes
CrossTable(bc_test_labels, bc_tree_pred10,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual class', 'predicted class'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted class
## actual class | Benign | Malignant | Row Total |
## -------------|-----------|-----------|-----------|
## Benign | 60 | 1 | 61 |
## | 0.600 | 0.010 | |
## -------------|-----------|-----------|-----------|
## Malignant | 1 | 38 | 39 |
## | 0.010 | 0.380 | |
## -------------|-----------|-----------|-----------|
## Column Total | 61 | 39 | 100 |
## -------------|-----------|-----------|-----------|
##
##
Comment: The accuracy is improved. It’s 98% (higher than Naive Bayes, equal to kNN)
TRIALS = 20 boosted decision tree with 20 trials
bc_boost20 <- C5.0(bc_train, bc_train_labels,
trials = 20)
Get idea about the boosted tree
summary(bc_boost20)
##
## Call:
## C5.0.default(x = bc_train, y = bc_train_labels, trials = 20)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Oct 13 16:12:31 2018
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 469 cases (31 attributes) from undefined.data
##
## ----- Trial 0: -----
##
## Decision tree:
##
## points_worst > 0.4669509:
## :...area_worst > -0.1989668: Malignant (135)
## : area_worst <= -0.1989668:
## : :...texture_worst <= -0.3168145: Benign (6)
## : texture_worst > -0.3168145: Malignant (10/1)
## points_worst <= 0.4669509:
## :...area_worst > 0.1182332:
## :...concavity_worst <= -0.3326002: Benign (4/1)
## : concavity_worst > -0.3326002: Malignant (15)
## area_worst <= 0.1182332:
## :...points_worst <= -0.06246884: Benign (255/3)
## points_worst > -0.06246884:
## :...texture_worst > 1.227215: Malignant (4)
## texture_worst <= 1.227215:
## :...dimension_mean <= -0.9061305: Malignant (4)
## dimension_mean > -0.9061305: Benign (36/2)
##
## ----- Trial 1: -----
##
## Decision tree:
##
## area_se > -0.1144639: Malignant (177.5/12.1)
## area_se <= -0.1144639:
## :...points_mean <= 0.01522708: Benign (236.5/21.8)
## points_mean > 0.01522708: Malignant (54.9/12.8)
##
## ----- Trial 2: -----
##
## Decision tree:
##
## perimeter_worst > 0.2094719:
## :...smoothness_worst <= -1.413279: Benign (6.8/0.6)
## : smoothness_worst > -1.413279: Malignant (103.6/1.8)
## perimeter_worst <= 0.2094719:
## :...area_worst <= -0.5753563: Benign (132.4/0.6)
## area_worst > -0.5753563:
## :...texture_worst <= -0.8244404: Benign (57.2)
## texture_worst > -0.8244404:
## :...smoothness_mean <= -0.4458351: Benign (47.9/4.3)
## smoothness_mean > -0.4458351:
## :...area_se <= -0.4540915: Benign (16.9/2.3)
## area_se > -0.4540915: Malignant (104.2/14.2)
##
## ----- Trial 3: -----
##
## Decision tree:
##
## area_worst <= -0.3991927:
## :...area_se <= 0.1869143: Benign (176/4.8)
## : area_se > 0.1869143: Malignant (12.6/2.4)
## area_worst > -0.3991927:
## :...perimeter_worst > 0.5903955: Malignant (63.3)
## perimeter_worst <= 0.5903955:
## :...smoothness_worst > 0.1546662:
## :...concavity_worst <= -0.3326002: Benign (8.6)
## : concavity_worst > -0.3326002: Malignant (82.2/6.5)
## smoothness_worst <= 0.1546662:
## :...perimeter_se > 1.622244: Malignant (10.2)
## perimeter_se <= 1.622244:
## :...smoothness_se > 0.2857672: Malignant (9.5)
## smoothness_se <= 0.2857672:
## :...dimension_se <= -0.9179285: Malignant (12.5/3.7)
## dimension_se > -0.9179285: Benign (94/1.3)
##
## ----- Trial 4: -----
##
## Decision tree:
##
## points_worst > 0.8548878: Malignant (56.1/0.3)
## points_worst <= 0.8548878:
## :...texture_mean > 0.469736:
## :...points_worst <= -0.4923942: Benign (29.3/7.8)
## : points_worst > -0.4923942: Malignant (75.9/3.7)
## texture_mean <= 0.469736:
## :...perimeter_worst <= 0.007106229: Benign (226.1/8.1)
## perimeter_worst > 0.007106229:
## :...texture_worst <= -0.9285688: Benign (33.9/1)
## texture_worst > -0.9285688: Malignant (47.6/13.1)
##
## ----- Trial 5: -----
##
## Decision tree:
##
## perimeter_worst > 0.5903955: Malignant (49.3)
## perimeter_worst <= 0.5903955:
## :...points_worst > 0.9187833: Malignant (22)
## points_worst <= 0.9187833:
## :...perimeter_se > 0.6058501: Malignant (29.7/5.3)
## perimeter_se <= 0.6058501:
## :...texture_worst > 1.167015: Malignant (47.5/14.6)
## texture_worst <= 1.167015:
## :...area_worst <= -0.2481451: Benign (189)
## area_worst > -0.2481451:
## :...area_mean <= -0.1491532: Malignant (23.3/2.2)
## area_mean > -0.1491532:
## :...smoothness_se <= 0.2201556: Benign (101/10.4)
## smoothness_se > 0.2201556: Malignant (7.2/1.4)
##
## ----- Trial 6: -----
##
## Decision tree:
##
## area_worst > -0.02368133:
## :...concavity_worst <= -0.197429: Benign (28.8/9.6)
## : concavity_worst > -0.197429: Malignant (78.6/0.4)
## area_worst <= -0.02368133:
## :...symmetry_worst <= -1.491504: Malignant (26.2/8)
## symmetry_worst > -1.491504:
## :...points_worst <= 0.3209041: Benign (255.4/10.9)
## points_worst > 0.3209041:
## :...texture_mean <= 0.1605082: Benign (60.3/17.2)
## texture_mean > 0.1605082: Malignant (19.7)
##
## ----- Trial 7: -----
##
## Decision tree:
##
## area_worst > 0.1182332:
## :...symmetry_mean <= -1.14035: Benign (8.8/0.6)
## : symmetry_mean > -1.14035: Malignant (82.6/3.5)
## area_worst <= 0.1182332:
## :...points_worst > 0.6890638: Malignant (37.8/3.4)
## points_worst <= 0.6890638:
## :...area_se > -0.1120459: Malignant (55.2/24.9)
## area_se <= -0.1120459:
## :...points_worst <= -0.3265702: Benign (149.6)
## points_worst > -0.3265702:
## :...texture_worst > 1.227215: Malignant (27.1/5.7)
## texture_worst <= 1.227215:
## :...concavity_se <= -0.4738517: Malignant (21.3/7.1)
## concavity_se > -0.4738517: Benign (86.6/1.7)
##
## ----- Trial 8: -----
##
## Decision tree:
##
## area_worst > 0.1182332:
## :...smoothness_worst <= -1.413279: Benign (10.3/1.1)
## : smoothness_worst > -1.413279: Malignant (75.1/3.2)
## area_worst <= 0.1182332:
## :...texture_worst <= -0.7886463: Benign (101.3)
## texture_worst > -0.7886463:
## :...points_mean > 0.004145421: Malignant (75.6/18.1)
## points_mean <= 0.004145421:
## :...area_se > 0.1117346: Malignant (11)
## area_se <= 0.1117346:
## :...concavity_worst <= -0.3326002: Benign (107.1)
## concavity_worst > -0.3326002:
## :...compactness_se <= -0.7012516: Malignant (16.2)
## compactness_se > -0.7012516: Benign (72.5/3.1)
##
## ----- Trial 9: -----
##
## Decision tree:
##
## perimeter_worst > 0.3880298: Malignant (54/2.4)
## perimeter_worst <= 0.3880298:
## :...smoothness_worst > 1.945978: Malignant (20.8/1)
## smoothness_worst <= 1.945978:
## :...perimeter_worst > -0.004797633:
## :...concavity_worst <= -0.26118: Benign (25.6)
## : concavity_worst > -0.26118: Malignant (70.4/24.5)
## perimeter_worst <= -0.004797633:
## :...symmetry_worst > 1.060726: Malignant (15.3/6.7)
## symmetry_worst <= 1.060726:
## :...compactness_mean > -0.7191631: Benign (204.5/0.8)
## compactness_mean <= -0.7191631:
## :...points_mean <= -0.5759667: Benign (57.7/3.7)
## points_mean > -0.5759667: Malignant (20.7/3.2)
##
## ----- Trial 10: -----
##
## Decision tree:
##
## perimeter_worst > 0.5903955: Malignant (31.6)
## perimeter_worst <= 0.5903955:
## :...texture_worst <= 0.01997586:
## :...area_se <= -0.3742955: Benign (130.3)
## : area_se > -0.3742955:
## : :...smoothness_worst <= 0.3429949: Benign (123.3/7.8)
## : smoothness_worst > 0.3429949: Malignant (16/3.4)
## texture_worst > 0.01997586:
## :...concavity_worst <= -0.3326002: Benign (52.2)
## concavity_worst > -0.3326002:
## :...symmetry_mean <= -1.041861: Benign (23.6/1.7)
## symmetry_mean > -1.041861: Malignant (92.1/23.5)
##
## ----- Trial 11: -----
##
## Decision tree:
##
## perimeter_worst > 0.3880298: Malignant (50.1/4.4)
## perimeter_worst <= 0.3880298:
## :...texture_mean <= 0.3465099: Benign (313/30.6)
## texture_mean > 0.3465099:
## :...points_mean > -0.1247111: Malignant (23.7/0.9)
## points_mean <= -0.1247111:
## :...compactness_se <= -0.7515079: Malignant (24.9/6.3)
## compactness_se > -0.7515079: Benign (57.3)
##
## ----- Trial 12: -----
##
## Decision tree:
##
## points_worst > 0.4669509:
## :...concavity_se <= 3.177171: Malignant (82.6/8.6)
## : concavity_se > 3.177171: Benign (10.8)
## points_worst <= 0.4669509:
## :...area_worst > 0.1182332: Malignant (46.6/14.8)
## area_worst <= 0.1182332:
## :...concavity_worst <= -0.3081544: Benign (161.7/15)
## concavity_worst > -0.3081544:
## :...compactness_se <= -0.7492743: Malignant (8.2)
## compactness_se > -0.7492743: Benign (159.1/23.1)
##
## ----- Trial 13: -----
##
## Decision tree:
##
## concavity_se <= -0.6603616: Benign (67.1)
## concavity_se > -0.6603616:
## :...points_worst > 0.9187833: Malignant (33)
## points_worst <= 0.9187833:
## :...concavity_worst <= -1.011332: Malignant (25.3/0.1)
## concavity_worst > -1.011332:
## :...area_worst <= -0.2716804:
## :...smoothness_worst <= 2.532863: Benign (138.8/2.2)
## : smoothness_worst > 2.532863: Malignant (8.2/1.8)
## area_worst > -0.2716804:
## :...concavity_worst <= -0.3460215: Benign (32.8/1.7)
## concavity_worst > -0.3460215:
## :...texture_mean <= -0.8578511: Benign (25.5/1.1)
## texture_mean > -0.8578511:
## :...dimension_se <= 0.8155851: Malignant (121.2/15.3)
## dimension_se > 0.8155851: Benign (17.2/3.2)
##
## ----- Trial 14: -----
##
## Decision tree:
##
## area_se > -0.1612864:
## :...dimension_se <= -0.8517927: Benign (17.4)
## : dimension_se > -0.8517927:
## : :...perimeter_worst <= -0.6598076: Benign (10.4)
## : perimeter_worst > -0.6598076:
## : :...perimeter_se <= -0.2567243: Benign (8.6/0.1)
## : perimeter_se > -0.2567243: Malignant (125/13.1)
## area_se <= -0.1612864:
## :...smoothness_worst <= 0.1721852: Benign (178.3/6.6)
## smoothness_worst > 0.1721852:
## :...perimeter_worst > -0.147644: Malignant (38.2/2)
## perimeter_worst <= -0.147644:
## :...symmetry_worst > 3.17169: Malignant (5.9)
## symmetry_worst <= 3.17169:
## :...texture_worst <= 1.145864: Benign (66.6)
## texture_worst > 1.145864: Malignant (18.8/5.9)
##
## ----- Trial 15: -----
##
## Decision tree:
##
## area_worst > -0.02368133:
## :...texture_mean <= -1.208929: Benign (10)
## : texture_mean > -1.208929:
## : :...dimension_worst <= -1.066139: Benign (14.8/4.5)
## : dimension_worst > -1.066139: Malignant (85.5/2.7)
## area_worst <= -0.02368133:
## :...concavity_mean > 0.6159157: Malignant (25.8/6.8)
## concavity_mean <= 0.6159157:
## :...texture_worst <= -0.7886463: Benign (79.1)
## texture_worst > -0.7886463:
## :...dimension_se > -0.2063073: Benign (112.2/6.3)
## dimension_se <= -0.2063073:
## :...points_se > 0.3539323: Malignant (27.2)
## points_se <= 0.3539323:
## :...compactness_mean > 0.2605245: Malignant (16.7)
## compactness_mean <= 0.2605245:
## :...smoothness_se <= -1.252608: Malignant (7.3/0.2)
## smoothness_se > -1.252608: Benign (90.5/6.7)
##
## ----- Trial 16: -----
##
## Decision tree:
##
## perimeter_worst > 0.5903955: Malignant (62.4)
## perimeter_worst <= 0.5903955:
## :...points_worst > 0.9187833: Malignant (20.7)
## points_worst <= 0.9187833:
## :...texture_worst <= -0.8895206: Benign (87.2)
## texture_worst > -0.8895206:
## :...radius_worst <= -0.6039818: Benign (79.3/2.9)
## radius_worst > -0.6039818:
## :...smoothness_se > 0.6198201: Malignant (27.4)
## smoothness_se <= 0.6198201:
## :...symmetry_mean <= -0.9762015: Benign (34.1)
## symmetry_mean > -0.9762015:
## :...symmetry_se > 0.5936947: Benign (28.9)
## symmetry_se <= 0.5936947:
## :...area_se <= -0.4822289: Benign (13.4)
## area_se > -0.4822289:
## :...symmetry_mean <= -0.6697919: Malignant (41.5/1.8)
## symmetry_mean > -0.6697919:
## :...compactness_mean <= 0.3570921: Benign (50.3/13.3)
## compactness_mean > 0.3570921: Malignant (23.9/0.4)
##
## ----- Trial 17: -----
##
## Decision tree:
##
## perimeter_worst > 0.2957749: Malignant (79.3/6.7)
## perimeter_worst <= 0.2957749:
## :...texture_worst <= 0.573158:
## :...symmetry_worst > 1.151242: Malignant (18.5/5)
## : symmetry_worst <= 1.151242:
## : :...texture_mean <= 0.788264: Benign (246.6/17.8)
## : texture_mean > 0.788264: Malignant (17.3/5.8)
## texture_worst > 0.573158:
## :...perimeter_worst <= -0.6127874: Benign (22.9)
## perimeter_worst > -0.6127874:
## :...smoothness_worst <= -1.0629: Benign (9.3)
## smoothness_worst > -1.0629: Malignant (74.9/6.3)
##
## ----- Trial 18: -----
##
## Decision tree:
##
## points_mean <= 0.004145421:
## :...compactness_se > -0.7515079: Benign (210.4/11.7)
## : compactness_se <= -0.7515079:
## : :...concavity_se <= -0.6603616: Benign (39)
## : concavity_se > -0.6603616: Malignant (52.8/12.1)
## points_mean > 0.004145421:
## :...concavity_worst <= -0.2276268: Benign (22.5/2.3)
## concavity_worst > -0.2276268:
## :...area_worst <= -0.2899466: Benign (23.9/9)
## area_worst > -0.2899466: Malignant (120.4/7.9)
##
## ----- Trial 19: -----
##
## Decision tree:
##
## concavity_worst <= -0.3081544:
## :...radius_se <= 0.5363185: Benign (188.6/2)
## : radius_se > 0.5363185: Malignant (10/1.1)
## concavity_worst > -0.3081544:
## :...area_worst > 0.1182332: Malignant (61.5)
## area_worst <= 0.1182332:
## :...points_worst > 0.9187833: Malignant (23.4)
## points_worst <= 0.9187833:
## :...texture_mean <= 0.3465099: Benign (105.1/15.6)
## texture_mean > 0.3465099:
## :...points_mean <= -0.1247111: Benign (41.8/17.2)
## points_mean > -0.1247111: Malignant (38.6)
##
##
## Evaluation on training data (469 cases):
##
## Trial Decision Tree
## ----- ----------------
## Size Errors
##
## 0 9 7( 1.5%)
## 1 3 40( 8.5%)
## 2 7 32( 6.8%)
## 3 9 17( 3.6%)
## 4 6 25( 5.3%)
## 5 8 34( 7.2%)
## 6 6 28( 6.0%)
## 7 8 35( 7.5%)
## 8 8 13( 2.8%)
## 9 8 29( 6.2%)
## 10 7 28( 6.0%)
## 11 5 32( 6.8%)
## 12 6 17( 3.6%)
## 13 9 18( 3.8%)
## 14 9 23( 4.9%)
## 15 10 30( 6.4%)
## 16 11 18( 3.8%)
## 17 7 26( 5.5%)
## 18 6 34( 7.2%)
## 19 7 12( 2.6%)
## boost 0( 0.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 296 (a): class Benign
## 173 (b): class Malignant
##
##
## Attribute usage:
##
## 100.00% texture_mean
## 100.00% points_mean
## 100.00% area_se
## 100.00% concavity_se
## 100.00% perimeter_worst
## 100.00% area_worst
## 100.00% smoothness_worst
## 100.00% concavity_worst
## 100.00% points_worst
## 96.59% perimeter_se
## 80.81% texture_worst
## 80.60% dimension_se
## 69.72% symmetry_worst
## 66.95% concavity_mean
## 64.61% compactness_se
## 63.33% compactness_mean
## 58.42% symmetry_mean
## 51.17% radius_worst
## 46.91% radius_se
## 43.50% smoothness_se
## 31.98% dimension_worst
## 25.37% points_se
## 24.95% smoothness_mean
## 21.11% symmetry_se
## 13.22% area_mean
## 8.53% dimension_mean
##
##
## Time: 0.1 secs
Create a factor vector of predictions on test data
bc_tree_pred20 <- predict(bc_boost20, bc_test)
Cross tabulation of predicted versus actual classes
CrossTable(bc_test_labels, bc_tree_pred20,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual class', 'predicted class'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted class
## actual class | Benign | Malignant | Row Total |
## -------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 0.610 | 0.000 | |
## -------------|-----------|-----------|-----------|
## Malignant | 1 | 38 | 39 |
## | 0.010 | 0.380 | |
## -------------|-----------|-----------|-----------|
## Column Total | 62 | 38 | 100 |
## -------------|-----------|-----------|-----------|
##
##
ACCURACY OF TRIALS 20 IS 99%
library(MASS) # Needed to sample multivariate Gaussian distributions
library(neuralnet) # The package for neural networks in R
library(readr)
cancer <- read_csv("wisc_bc_data.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## id = col_integer(),
## diagnosis = col_character()
## )
## See spec(...) for full column specifications.
head(cancer)
## # A tibble: 6 x 32
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 87139402 B 12.3 12.4 78.8 464.
## 2 8910251 B 10.6 19.0 69.3 346.
## 3 905520 B 11.0 16.8 70.9 373.
## 4 868871 B 11.3 13.4 73 385.
## 5 9012568 B 15.2 13.2 97.6 712.
## 6 906539 B 11.6 19.0 74.2 410.
## # ... with 26 more variables: smoothness_mean <dbl>,
## # compactness_mean <dbl>, concavity_mean <dbl>, points_mean <dbl>,
## # symmetry_mean <dbl>, dimension_mean <dbl>, radius_se <dbl>,
## # texture_se <dbl>, perimeter_se <dbl>, area_se <dbl>,
## # smoothness_se <dbl>, compactness_se <dbl>, concavity_se <dbl>,
## # points_se <dbl>, symmetry_se <dbl>, dimension_se <dbl>,
## # radius_worst <dbl>, texture_worst <dbl>, perimeter_worst <dbl>,
## # area_worst <dbl>, smoothness_worst <dbl>, compactness_worst <dbl>,
## # concavity_worst <dbl>, points_worst <dbl>, symmetry_worst <dbl>,
## # dimension_worst <dbl>
cancer <- cancer[, -c(1)]
cancer[, 1] <- as.numeric(cancer[, 1] == "M")
colnames(cancer)[1] <- "label"
colnames(cancer)[2] <- "V2"
colnames(cancer)[3] <- "V3"
colnames(cancer)[4] <- "V4"
colnames(cancer)[5] <- "V5"
colnames(cancer)[6] <- "V6"
colnames(cancer)[7] <- "V7"
colnames(cancer)[8] <- "V8"
colnames(cancer)[9] <- "V9"
colnames(cancer)[10] <- "V10"
colnames(cancer)[11] <- "V11"
colnames(cancer)[12] <- "V12"
colnames(cancer)[13] <- "V13"
colnames(cancer)[14] <- "V14"
colnames(cancer)[15] <- "V15"
colnames(cancer)[16] <- "V16"
colnames(cancer)[17] <- "V17"
colnames(cancer)[18] <- "V18"
colnames(cancer)[19] <- "V19"
colnames(cancer)[20] <- "V20"
colnames(cancer)[21] <- "V21"
colnames(cancer)[22] <- "V22"
colnames(cancer)[23] <- "V23"
colnames(cancer)[24] <- "V24"
colnames(cancer)[25] <- "V25"
colnames(cancer)[26] <- "V26"
colnames(cancer)[27] <- "V27"
colnames(cancer)[28] <- "V28"
colnames(cancer)[29] <- "V29"
colnames(cancer)[30] <- "V30"
colnames(cancer)[31] <- "V31"
cancer[1, ]
## # A tibble: 1 x 31
## label V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 12.3 12.4 78.8 464. 0.103 0.0698 0.0399 0.037 0.196 0.0596
## # ... with 20 more variables: V12 <dbl>, V13 <dbl>, V14 <dbl>, V15 <dbl>,
## # V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>,
## # V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>,
## # V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>
Normalize the cancer data
cancer <- as.data.frame(lapply(cancer, normalize))
create train and test data set
cancer_train <- cancer[1:469, ]
cancer_test <- cancer[470:569, ]
set.seed(12345) # to guarantee repeatable results
cancer.model <- neuralnet(formula = label ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 + V21 +
V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29 + V30 + V31,data = cancer_train)
plot(cancer.model)
Alternative plot
library(NeuralNetTools)
# plotnet
par(mar = numeric(4), family = 'serif')
plotnet(cancer.model, alpha = 0.6)
Obtain model results
cancer.model.results <- compute(cancer.model, cancer_test[2:31])
Obtain predicted diagnosis values
predicted_label <- cancer.model.results$net.result
Examine the correlation between predicted and actual values
cor(predicted_label, cancer_test$label)
## [,1]
## [1,] 0.9886599505
a more complex neural network topology with 5 hidden neurons
set.seed(12345) # to guarantee repeatable results
cancer.model2 <- neuralnet(label ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 + V21 + V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29 + V30 + V31,data = cancer_train, hidden = 5, act.fct = "logistic")
Plot the network
plot(cancer.model2)
Plotnet
par(mar = numeric(4), family = 'serif')
plotnet(cancer.model2, alpha = 0.6)
Evaluate the results as we did before
model_results2 <- compute(cancer.model2, cancer_test[2:31])
predicted_label2 <- model_results2$net.result
cor(predicted_label2, cancer_test$label)
## [,1]
## [1,] 0.9754664796
Comments: The accuracy droppend to 0.975 (lower than 1 hidden node)