Deep Learning is a branch of Machine Learning based on a set of algorithms that attempts to mimic the human brain. The basic unit in a deep learning model is the neuron, a model inspired by the human neuron. In humans, the varying strengths of neurons’ output signals travel along the synaptic junctions and are then aggregated as input for a connected neuron’s activation(feed-forward). A multi-layer neural network consist of many layers of interconnected neural units, starting with an input layer to match the feature space, followed by multiple layers of non-linearity, and ending with a linear regression or classification layer to match the output space -Arno Candel et al.
Deep Learning has become the Data Science buzzword particularly for its high accuracy of prediction in complex problems such as image, speech & text recognition. In exploring Deep Learning algorithms, I wanted to find one that satisfied the following criteria:
Neural Nets with its feedforward architecture satisfied the core features of primary interest to me.
The intended objective of this paper is to demonstrate how we can achieve world-class accuracy of prediction with basic Deep Learning models using the Neural Nets Algorithms. The data used is the Wisconsin Breast Cancer (Diagnostic) Data Set which can be downloaded at Kaggle. The goal of the task is to predict whether a diagnosis is Malignant (M) or Benign (B). In this data, features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image in the 3-dimensional space that is described in K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34.
Attribute Information:
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
Malignant is a term for diseases in which abnormal cells divide without control and can invade nearby tissues. Malignant cells can also spread to other parts of the body through the blood and lymph systems. It means the cell is cancerous.
Benign means not cancerous. Benign tumors may grow larger but do not spread to other parts of the body. They are also called non-malignant.
Error Function
Another important component of a neural network that we can specify is the error function; while we typically use the squared loss functions in most machine learning settings, it unfortunately has the property in which neurons that are initialized to an output close to extremes will be very slow to train. Thus, we can introduce a different cost function named the logarithmic loss.
logarithmic loss makes the training process efficient even if one is training on a saturated neuron. It’s faster than squared loss functions.
library(MASS) # Needed to sample multivariate Gaussian distributions
library(neuralnet) # The package for neural networks in R
library(readr) # The package for reading in data in R
wbcd.data <- read_csv("breast_cancer_wincosin.csv")
head(wbcd.data)
## # A tibble: 6 x 33
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 8.42e5 M 18.0 10.4 123. 1001
## 2 8.43e5 M 20.6 17.8 133. 1326
## 3 8.43e7 M 19.7 21.2 130 1203
## 4 8.43e7 M 11.4 20.4 77.6 386.
## 5 8.44e7 M 20.3 14.3 135. 1297
## 6 8.44e5 M 12.4 15.7 82.6 477.
## # ... with 27 more variables: smoothness_mean <dbl>,
## # compactness_mean <dbl>, concavity_mean <dbl>, `concave
## # points_mean` <dbl>, symmetry_mean <dbl>, fractal_dimension_mean <dbl>,
## # radius_se <dbl>, texture_se <dbl>, perimeter_se <dbl>, area_se <dbl>,
## # smoothness_se <dbl>, compactness_se <dbl>, concavity_se <dbl>,
## # `concave points_se` <dbl>, symmetry_se <dbl>,
## # fractal_dimension_se <dbl>, radius_worst <dbl>, texture_worst <dbl>,
## # perimeter_worst <dbl>, area_worst <dbl>, smoothness_worst <dbl>,
## # compactness_worst <dbl>, concavity_worst <dbl>, `concave
## # points_worst` <dbl>, symmetry_worst <dbl>,
## # fractal_dimension_worst <dbl>, X33 <chr>
wbcd.data <- wbcd.data[, -c(1)] # Don't need patient ID,
wbcd.data <- wbcd.data[, -c(32)] # Don't need the X33 variable
wbcd.data[, 1] <- as.numeric(wbcd.data[, 1] == "M")
We converted “M (Malignant)” to 1, and “B (Benign)” to 0
colnames(wbcd.data)[1] <- "label"
colnames(wbcd.data)[2] <- "V2"
colnames(wbcd.data)[3] <- "V3"
colnames(wbcd.data)[4] <- "V4"
colnames(wbcd.data)[5] <- "V5"
colnames(wbcd.data)[6] <- "V6"
colnames(wbcd.data)[7] <- "V7"
colnames(wbcd.data)[8] <- "V8"
colnames(wbcd.data)[9] <- "V9"
colnames(wbcd.data)[10] <- "V10"
colnames(wbcd.data)[11] <- "V11"
colnames(wbcd.data)[12] <- "V12"
colnames(wbcd.data)[13] <- "V13"
colnames(wbcd.data)[14] <- "V14"
colnames(wbcd.data)[15] <- "V15"
colnames(wbcd.data)[16] <- "V16"
colnames(wbcd.data)[17] <- "V17"
colnames(wbcd.data)[18] <- "V18"
colnames(wbcd.data)[19] <- "V19"
colnames(wbcd.data)[20] <- "V20"
colnames(wbcd.data)[21] <- "V21"
colnames(wbcd.data)[22] <- "V22"
colnames(wbcd.data)[23] <- "V23"
colnames(wbcd.data)[24] <- "V24"
colnames(wbcd.data)[25] <- "V25"
colnames(wbcd.data)[26] <- "V26"
colnames(wbcd.data)[27] <- "V27"
colnames(wbcd.data)[28] <- "V28"
colnames(wbcd.data)[29] <- "V29"
colnames(wbcd.data)[30] <- "V30"
colnames(wbcd.data)[31] <- "V31"
We transfromed the dependent variable “Diagnosis” to “Label”, and the independent variables to “V’s” for easier computation.
wbcd.data[1, ]
## # A tibble: 1 x 31
## label V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 18.0 10.4 123. 1001 0.118 0.278 0.300 0.147 0.242 0.0787 1.10
## # ... with 19 more variables: V13 <dbl>, V14 <dbl>, V15 <dbl>, V16 <dbl>,
## # V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>, V22 <dbl>,
## # V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>, V28 <dbl>,
## # V29 <dbl>, V30 <dbl>, V31 <dbl>
train.proportion <- 0.8
train.index <- sample(x = 1:nrow(wbcd.data),
size = floor(train.proportion * nrow(wbcd.data)),
replace = F)
wbcd.train.data <- wbcd.data[train.index, ]
wbcd.test.data <- wbcd.data[-train.index, ]
wbcd.test.labels <- wbcd.test.data$label
wbcd.test.data <- subset(wbcd.test.data, select = -c(label))
formula <- sprintf("%s%s", "label ~ ", paste("V", 2:31, collapse = " + ", sep = ""))
formula #Specifying the output and inputs
## [1] "label ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 + V21 + V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29 + V30 + V31"
set.seed(4400) # For identical results across all document evaluations
wbcd.first.net <- neuralnet(formula, data = wbcd.train.data,
hidden = c(5), # 1 hidden layer with 5 units
linear.output = F, rep = 5,
err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.second.net <- neuralnet(formula, data = wbcd.train.data,
hidden = c(10), # 1 hidden layer with 10 units
linear.output = F, rep = 5,
err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.third.net <- neuralnet(formula, data = wbcd.train.data,
hidden = c(15), # 1 hidden layer with 15 units
linear.output = F, rep = 5,
err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.fourth.net <- neuralnet(formula, data = wbcd.train.data,
hidden = c(5, 5), # 2 hidden layers with 5 units each each
linear.output = F, rep = 5,
err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.fifth.net <- neuralnet(formula, data = wbcd.train.data,
hidden = c(10, 10), # 2 hidden layers with 10 units each
linear.output = F, rep = 5,
err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.sixth.net <- neuralnet(formula, data = wbcd.train.data,
hidden = c(15, 15), # 2 hidden layers with 15 units each
linear.output = F, rep = 5,
err.fct = "ce", act.fct = "logistic", threshold = 2)
train.scores <- sapply(list(wbcd.first.net, wbcd.second.net, wbcd.third.net,
wbcd.fourth.net, wbcd.fifth.net, wbcd.sixth.net),
function(x) {min(x$result.matrix[c("error"), ])})
cat(paste(c("Training Scores (Logarithmic Loss)\n1 Hidden Layer, 5 Hidden Units:", "1 Hidden Layer, 10 Hidden Units:", "1 Hidden Layer, 15 Hidden Units:", "2 Hidden Layers, 5 Hidden Units Each:", "2 Hidden Layers, 10 Hidden Units Each:", "2 Hidden Layers, 15 Hidden Units Each:"), train.scores, collapse = "\n"))
## Training Scores (Logarithmic Loss)
## 1 Hidden Layer, 5 Hidden Units: 68.2926463988237
## 1 Hidden Layer, 10 Hidden Units: 4.38358463981494
## 1 Hidden Layer, 15 Hidden Units: 7.63718950112194
## 2 Hidden Layers, 5 Hidden Units Each: 39.5479316254214
## 2 Hidden Layers, 10 Hidden Units Each: 2.55824488993493
## 2 Hidden Layers, 15 Hidden Units Each: 2.66256494441356
We weren’t sure which is the best number of layers (or the number of hidden units in each layer for that matter). So we tried a few different combinations; one note is that we had to raise the threshold from 0.01 because otherwise the network won’t think that it has converged. The value 2 was found experimentally to lead to decent results.
set.seed(440) # For identical results across all document evaluations
percentage.correctly.classified <- function(nn, threshold = 0.5) {
best <- which.min(nn$result.matrix[c("error"), ])
net.predictions <- compute(nn, wbcd.test.data, rep = best)$net.result
thresholded.net.predictions <- ifelse(net.predictions > threshold, 1, 0)
num.correct <- sum(as.numeric(thresholded.net.predictions == wbcd.test.labels))
num.correct / length(wbcd.test.labels)
}
scores <- sapply(list(wbcd.first.net, wbcd.second.net, wbcd.third.net,
wbcd.fourth.net, wbcd.fifth.net, wbcd.sixth.net),
percentage.correctly.classified)
cat(paste(c("Test Scores (Percentage Correctly Classified)\n1 Hidden Layer, 5 Hidden Units:", "1 Hidden Layer, 10 Hidden Units:", "1 Hidden Layer, 15 Hidden Units:", "2 Hidden Layers, 5 Hidden Units Each:", "2 Hidden Layers, 10 Hidden Units Each:", "2 Hidden Layers, 15 Hidden Units Each:"), scores, collapse = "\n"))
## Test Scores (Percentage Correctly Classified)
## 1 Hidden Layer, 5 Hidden Units: 0.929824561403509
## 1 Hidden Layer, 10 Hidden Units: 0.956140350877193
## 1 Hidden Layer, 15 Hidden Units: 0.93859649122807
## 2 Hidden Layers, 5 Hidden Units Each: 0.956140350877193
## 2 Hidden Layers, 10 Hidden Units Each: 0.894736842105263
## 2 Hidden Layers, 15 Hidden Units Each: 0.929824561403509
We got an Accuracy rate greater than 90% in most nodes, however, simply having more nodes isn’t helpful for more accurate predictions. To improve prediction level we need to increase the data.
Summary
With an accuracy rate of 90% we can still use this technology to detect cancer early enough before it gets to the malignant stage. But more work needs to be done to get to 100% accuracy. And when it gets to 100% it will be ubiquitious and humans will be able to detect if they have cancer using their smartphone straight from their house.
For more use cases on Machine learning, contact us at Cartwheel Technologies, alinnorugo@gmail.com