Applying Neural Nets On Wisconsin Breast Cancer Data (Deep Learning)

Deep Learning is a branch of Machine Learning based on a set of algorithms that attempts to mimic the human brain. The basic unit in a deep learning model is the neuron, a model inspired by the human neuron. In humans, the varying strengths of neurons’ output signals travel along the synaptic junctions and are then aggregated as input for a connected neuron’s activation(feed-forward). A multi-layer neural network consist of many layers of interconnected neural units, starting with an input layer to match the feature space, followed by multiple layers of non-linearity, and ending with a linear regression or classification layer to match the output space -Arno Candel et al.

Deep Learning has become the Data Science buzzword particularly for its high accuracy of prediction in complex problems such as image, speech & text recognition. In exploring Deep Learning algorithms, I wanted to find one that satisfied the following criteria:

Scalability
Fast & memory efficient
Computational parallelization
Fast convergence
Regularization options (Robust to overfitting)
Grid search for hyper-parameter optimization
Adaptive learning rate

Neural Nets with its feedforward architecture satisfied the core features of primary interest to me.

The intended objective of this paper is to demonstrate how we can achieve world-class accuracy of prediction with basic Deep Learning models using the Neural Nets Algorithms. The data used is the Wisconsin Breast Cancer (Diagnostic) Data Set which can be downloaded at Kaggle. The goal of the task is to predict whether a diagnosis is Malignant (M) or Benign (B). In this data, features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image in the 3-dimensional space that is described in K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34.

Attribute Information:

ID number
Diagnosis (M = malignant, B = benign) -3-32.Ten real-valued features are computed for each cell nucleus:

radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0) g). concavity (severity of concave portions of the contour) h). concave points (number of concave portions of the contour) i). symmetry j). fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Malignant is a term for diseases in which abnormal cells divide without control and can invade nearby tissues. Malignant cells can also spread to other parts of the body through the blood and lymph systems. It means the cell is cancerous.

Benign means not cancerous. Benign tumors may grow larger but do not spread to other parts of the body. They are also called non-malignant.

Error Function

Another important component of a neural network that we can specify is the error function; while we typically use the squared loss functions in most machine learning settings, it unfortunately has the property in which neurons that are initialized to an output close to extremes will be very slow to train. Thus, we can introduce a different cost function named the logarithmic loss.

logarithmic loss makes the training process efficient even if one is training on a saturated neuron. It’s faster than squared loss functions.

Let’s start by loading the required libraries

library(MASS) # Needed to sample multivariate Gaussian distributions 
library(neuralnet) # The package for neural networks in R
library(readr) # The package for reading in data in R

Read in the data

wbcd.data <- read_csv("breast_cancer_wincosin.csv")

View the first 6 rows of our data

head(wbcd.data)

## # A tibble: 6 x 33
##       id diagnosis radius_mean texture_mean perimeter_mean area_mean
##    <dbl> <chr>           <dbl>        <dbl>          <dbl>     <dbl>
## 1 8.42e5 M                18.0         10.4          123.      1001 
## 2 8.43e5 M                20.6         17.8          133.      1326 
## 3 8.43e7 M                19.7         21.2          130       1203 
## 4 8.43e7 M                11.4         20.4           77.6      386.
## 5 8.44e7 M                20.3         14.3          135.      1297 
## 6 8.44e5 M                12.4         15.7           82.6      477.
## # ... with 27 more variables: smoothness_mean <dbl>,
## #   compactness_mean <dbl>, concavity_mean <dbl>, `concave
## #   points_mean` <dbl>, symmetry_mean <dbl>, fractal_dimension_mean <dbl>,
## #   radius_se <dbl>, texture_se <dbl>, perimeter_se <dbl>, area_se <dbl>,
## #   smoothness_se <dbl>, compactness_se <dbl>, concavity_se <dbl>,
## #   `concave points_se` <dbl>, symmetry_se <dbl>,
## #   fractal_dimension_se <dbl>, radius_worst <dbl>, texture_worst <dbl>,
## #   perimeter_worst <dbl>, area_worst <dbl>, smoothness_worst <dbl>,
## #   compactness_worst <dbl>, concavity_worst <dbl>, `concave
## #   points_worst` <dbl>, symmetry_worst <dbl>,
## #   fractal_dimension_worst <dbl>, X33 <chr>

Remove the less needed variables

wbcd.data <- wbcd.data[, -c(1)] # Don't need patient ID,

wbcd.data <- wbcd.data[,  -c(32)] # Don't need the X33 variable

Transform the Diagnosis variable to Binary digits

wbcd.data[, 1] <- as.numeric(wbcd.data[, 1] == "M")

We converted “M (Malignant)” to 1, and “B (Benign)” to 0

Transform the variable names

colnames(wbcd.data)[1] <- "label"
colnames(wbcd.data)[2] <- "V2"
colnames(wbcd.data)[3] <- "V3"
colnames(wbcd.data)[4] <- "V4"
colnames(wbcd.data)[5] <- "V5"
colnames(wbcd.data)[6] <- "V6"
colnames(wbcd.data)[7] <- "V7"
colnames(wbcd.data)[8] <- "V8"
colnames(wbcd.data)[9] <- "V9"
colnames(wbcd.data)[10] <- "V10"
colnames(wbcd.data)[11] <- "V11"
colnames(wbcd.data)[12] <- "V12"
colnames(wbcd.data)[13] <- "V13"
colnames(wbcd.data)[14] <- "V14"
colnames(wbcd.data)[15] <- "V15"
colnames(wbcd.data)[16] <- "V16"
colnames(wbcd.data)[17] <- "V17"
colnames(wbcd.data)[18] <- "V18"
colnames(wbcd.data)[19] <- "V19"
colnames(wbcd.data)[20] <- "V20"
colnames(wbcd.data)[21] <- "V21"
colnames(wbcd.data)[22] <- "V22"
colnames(wbcd.data)[23] <- "V23"
colnames(wbcd.data)[24] <- "V24"
colnames(wbcd.data)[25] <- "V25"
colnames(wbcd.data)[26] <- "V26"
colnames(wbcd.data)[27] <- "V27"
colnames(wbcd.data)[28] <- "V28"
colnames(wbcd.data)[29] <- "V29"
colnames(wbcd.data)[30] <- "V30"
colnames(wbcd.data)[31] <- "V31"

We transfromed the dependent variable “Diagnosis” to “Label”, and the independent variables to “V’s” for easier computation.

View the first row of our data

wbcd.data[1, ]

## # A tibble: 1 x 31
##   label    V2    V3    V4    V5    V6    V7    V8    V9   V10    V11   V12
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1     1  18.0  10.4  123.  1001 0.118 0.278 0.300 0.147 0.242 0.0787  1.10
## # ... with 19 more variables: V13 <dbl>, V14 <dbl>, V15 <dbl>, V16 <dbl>,
## #   V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>, V22 <dbl>,
## #   V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>, V28 <dbl>,
## #   V29 <dbl>, V30 <dbl>, V31 <dbl>

Split the data to training and test set in a ratio of 4:1

train.proportion <- 0.8
train.index <- sample(x = 1:nrow(wbcd.data),
                      size = floor(train.proportion * nrow(wbcd.data)),
                      replace = F)
wbcd.train.data <- wbcd.data[train.index, ]
wbcd.test.data <- wbcd.data[-train.index, ]
wbcd.test.labels <- wbcd.test.data$label
wbcd.test.data <- subset(wbcd.test.data, select = -c(label))

Compute the formula for prediction

formula <- sprintf("%s%s", "label ~ ", paste("V", 2:31, collapse = " + ", sep = ""))
formula #Specifying the output and inputs

## [1] "label ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 + V21 + V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29 + V30 + V31"

Train the model

set.seed(4400) # For identical results across all document evaluations

wbcd.first.net <- neuralnet(formula, data = wbcd.train.data,
                            hidden = c(5), # 1 hidden layer with 5 units
                            linear.output = F, rep = 5,
                            err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.second.net <- neuralnet(formula, data = wbcd.train.data,
                             hidden = c(10), # 1 hidden layer with 10 units
                             linear.output = F, rep = 5,
                             err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.third.net <- neuralnet(formula, data = wbcd.train.data,
                            hidden = c(15), # 1 hidden layer with 15 units 
                            linear.output = F, rep = 5,
                            err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.fourth.net <- neuralnet(formula, data = wbcd.train.data,
                            hidden = c(5, 5), # 2 hidden layers with 5 units each each
                            linear.output = F, rep = 5,
                            err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.fifth.net <- neuralnet(formula, data = wbcd.train.data,
                            hidden = c(10, 10), # 2 hidden layers with 10 units each
                            linear.output = F, rep = 5,
                            err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.sixth.net <- neuralnet(formula, data = wbcd.train.data,
                            hidden = c(15, 15), # 2 hidden layers with 15 units each
                            linear.output = F, rep = 5,
                            err.fct = "ce", act.fct = "logistic", threshold = 2)
train.scores <- sapply(list(wbcd.first.net, wbcd.second.net, wbcd.third.net,
                      wbcd.fourth.net, wbcd.fifth.net, wbcd.sixth.net),
                 function(x) {min(x$result.matrix[c("error"), ])})
cat(paste(c("Training Scores (Logarithmic Loss)\n1 Hidden Layer, 5 Hidden Units:", "1 Hidden Layer, 10 Hidden Units:", "1 Hidden Layer, 15 Hidden Units:", "2 Hidden Layers, 5 Hidden Units Each:", "2 Hidden Layers, 10 Hidden Units Each:", "2 Hidden Layers, 15 Hidden Units Each:"), train.scores, collapse = "\n"))

## Training Scores (Logarithmic Loss)
## 1 Hidden Layer, 5 Hidden Units: 68.2926463988237
## 1 Hidden Layer, 10 Hidden Units: 4.38358463981494
## 1 Hidden Layer, 15 Hidden Units: 7.63718950112194
## 2 Hidden Layers, 5 Hidden Units Each: 39.5479316254214
## 2 Hidden Layers, 10 Hidden Units Each: 2.55824488993493
## 2 Hidden Layers, 15 Hidden Units Each: 2.66256494441356

We weren’t sure which is the best number of layers (or the number of hidden units in each layer for that matter). So we tried a few different combinations; one note is that we had to raise the threshold from 0.01 because otherwise the network won’t think that it has converged. The value 2 was found experimentally to lead to decent results.

Now, let’s examine how the neural networks do on predicting the test data:

set.seed(440) # For identical results across all document evaluations

percentage.correctly.classified <- function(nn, threshold = 0.5) {
  best <- which.min(nn$result.matrix[c("error"), ])
  net.predictions <- compute(nn, wbcd.test.data, rep = best)$net.result
  thresholded.net.predictions <- ifelse(net.predictions > threshold, 1, 0)
  num.correct <- sum(as.numeric(thresholded.net.predictions == wbcd.test.labels))
  num.correct / length(wbcd.test.labels)
}
scores <- sapply(list(wbcd.first.net, wbcd.second.net, wbcd.third.net,
                      wbcd.fourth.net, wbcd.fifth.net, wbcd.sixth.net),
                 percentage.correctly.classified)
cat(paste(c("Test Scores (Percentage Correctly Classified)\n1 Hidden Layer, 5 Hidden Units:", "1 Hidden Layer, 10 Hidden Units:", "1 Hidden Layer, 15 Hidden Units:", "2 Hidden Layers, 5 Hidden Units Each:", "2 Hidden Layers, 10 Hidden Units Each:", "2 Hidden Layers, 15 Hidden Units Each:"), scores, collapse = "\n"))

## Test Scores (Percentage Correctly Classified)
## 1 Hidden Layer, 5 Hidden Units: 0.929824561403509
## 1 Hidden Layer, 10 Hidden Units: 0.956140350877193
## 1 Hidden Layer, 15 Hidden Units: 0.93859649122807
## 2 Hidden Layers, 5 Hidden Units Each: 0.956140350877193
## 2 Hidden Layers, 10 Hidden Units Each: 0.894736842105263
## 2 Hidden Layers, 15 Hidden Units Each: 0.929824561403509

We got an Accuracy rate greater than 90% in most nodes, however, simply having more nodes isn’t helpful for more accurate predictions. To improve prediction level we need to increase the data.

Summary

With an accuracy rate of 90% we can still use this technology to detect cancer early enough before it gets to the malignant stage. But more work needs to be done to get to 100% accuracy. And when it gets to 100% it will be ubiquitious and humans will be able to detect if they have cancer using their smartphone straight from their house.

For more use cases on Machine learning, contact us at Cartwheel Technologies, alinnorugo@gmail.com