Breast Cancer Prediction:A Deep Learning Use Case in R

Introduction

Deep Learning is a branch of Machine Learning based on a set of algorithms that attempt to model high level abstractions in data -Wikipedia. The basic unit in a deep learning model is the neuron, a biologically inspired model by the human neuron. In humans, the varying strengths of neurons’ output signals travel along the synaptic junctions & are then aggregated as input for a connected neuron’s activation. A multi-layer neural network consist of many layers of interconnected neural units, starting with an input layer to match the feature space, followed by multiple layers of non-linearity, and ending with a linear regression or classification layer to match the output space -Arno Candel et al.

Deep Learning has become the Data Science buzzword particularly for its high accuracy of prediction in complex problems such as image, speech & text recognition. In exploring Deep Learning algorithms, I wanted to find one that satisfied the following core requirements or features:

Scalability
Fast & memory efficient
Computational parallelization
Fast convergence
Regularization options (Robust to overfitting)
Grid search for hyper-parameter optimization
Adaptive learning rate

The H2O package with its feedforward architecture satisfied the core features of primary interest to me.

The intended objective of this blog post is to demonstrate how you can achieve world-class accuracy of prediction with basic Deep Learning models using the H2O package in this case. The data I will be using is the Breast Cancer Wisconsin (Diagnostic) Data Set found here, the goal of the task is to predict whether a diagnosis is Malignant (M) or Benign (B). In this data, features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image in the 3-dimensional space that is described in K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34.

Data Attributes Information

ID number 2) Diagnosis -M = malignant, B = benign

Ten real-valued features are computed for each cell nucleus:

radius -mean of distances from center to points on the perimeter
texture -standard deviation of gray-scale values
perimeter
area
smoothness -local variation in radius lengths
compactness -perimeter^2 / area - 1.0
concavity -severity of concave portions of the contour
concave points -number of concave portions of the contour
symmetry
fractal dimension -“coastline approximation” - 1

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, & field 23 is Worst Radius.

library(readr)
library(data.table)
Data <- read_csv("data.csv", col_names = T)

Data Preprocessing

#Target variable type coercion:
Data$diagnosis <- as.factor(Data$diagnosis)
#Remove less than useful 'X33':
Data$X33 <- NULL
#Check missing data:
Missing_data <- function(x){sum(is.na(x))/length(x)*100}
apply(Data, 2, Missing_data)

##                      id               diagnosis             radius_mean 
##                       0                       0                       0 
##            texture_mean          perimeter_mean               area_mean 
##                       0                       0                       0 
##         smoothness_mean        compactness_mean          concavity_mean 
##                       0                       0                       0 
##     concave points_mean           symmetry_mean  fractal_dimension_mean 
##                       0                       0                       0 
##               radius_se              texture_se            perimeter_se 
##                       0                       0                       0 
##                 area_se           smoothness_se          compactness_se 
##                       0                       0                       0 
##            concavity_se       concave points_se             symmetry_se 
##                       0                       0                       0 
##    fractal_dimension_se            radius_worst           texture_worst 
##                       0                       0                       0 
##         perimeter_worst              area_worst        smoothness_worst 
##                       0                       0                       0 
##       compactness_worst         concavity_worst    concave points_worst 
##                       0                       0                       0 
##          symmetry_worst fractal_dimension_worst 
##                       0                       0

Data Visualization

Nearly 37% cases out of 569 diagnosed patients had malignant stage breast cancer.

Data Split -Train:Test

I will do a 4:1 Train to Test data partition respectively trying to retain more samples for training from the already limited size. The data will also be randomly shuffled for a fair distribution & to enhance robustness to overfitting, I will perform cross-validation during model building.

set.seed(1)
n <- nrow(Data)
shuffled <- Data[sample(n),]
train.indices <- 1:round(0.9*n)
train <- shuffled[train.indices,]
test.indices <- (round(0.9*n)+1):n
test <- shuffled[test.indices,]

H2O Initialization & Data Conversion

library(h2o)
#start a local h2o cluster:
localH2o <- h2o.init(ip = "localhost", port = 54321, nthreads = -1)

#Set seed for reproducibility:
set.seed(2)
train_h2o <- as.h2o(train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

test_h2o  <- as.h2o(test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

#Set timer:
timer <- proc.time()

Model Building

The model architecture will consist of 3 different network topologies, 3 L1 norm weights & 3 hidden drop out ratios. H20 supports model tuning in grid search by allowing users to specify sets of model values for parameter arguments & observe individual model performace based on metrics of choice. In this demo AUC values will be used to analyze grid models. The grid models summary looks as follows:

Activation function: Rectified Linear
Input drop out ratio: 0.2
Hidden layers : 3 sets each consisting of 3 layers
Epoch: 500
Nfolds: 10

set.seed(2)
#Set Grid parameters:
hidden_opt <- list(c(32,32,32), c(5,25,75), c(100,100,100))
l1_opt     <- c(1e-5, 1e-4,1e-3)
hidden_drpoutRatios <- list(c(0.5,0.5,0.5), c(0.5,0.3,0.2), c(0.1,0.2,0.8))
hyper_pars <- list(hidden = hidden_opt, hidden_dropout_ratios = hidden_drpoutRatios, l1 = l1_opt)
#Building Grid models:
model_grid <- h2o.grid(
  algorithm = "deeplearning",
  activation = "RectifierWithDropout",
  hyper_params = hyper_pars,
  x = 3:32,
  y = 2,
  training_frame = train_h2o,
  input_dropout_ratio = 0.2,
  balance_classes = T,
  momentum_stable = 0.99,
  nesterov_accelerated_gradient = T,
  epochs = 50,
  nfolds = 10,
  variable_importances = T,
  keep_cross_validation_predictions = T)

Model Performance

See how each individual model within the grid performed based on ROC/AUC values. Other metrics to analyze performance are also available within H2O. The desired AUC value is 1, setting an operational score is based on several factors including domain application, the cost (impact) of error/misprediction, complexity of the problem etc. In a typical problem such as cancer prediction error should be highly penalized hence exceptional model performace is required. It is imperative that performance is cross-examined based on several metrics & trade-offs are well understood. In this demo however, AUC will be used to check individual model performace.

for (model_id in model_grid@model_ids) {
  auc <- h2o.auc(h2o.getModel(model_id))
  print(sprintf('CV set auc: %f', auc))
}

## [1] "CV set auc: 0.997943"
## [1] "CV set auc: 0.998554"
## [1] "CV set auc: 0.996847"
## [1] "CV set auc: 0.996427"
## [1] "CV set auc: 0.996828"
## [1] "CV set auc: 0.997523"
## [1] "CV set auc: 0.996636"
## [1] "CV set auc: 0.995909"
## [1] "CV set auc: 0.994824"
## [1] "CV set auc: 0.997402"
## [1] "CV set auc: 0.996360"
## [1] "CV set auc: 0.996243"
## [1] "CV set auc: 0.998227"
## [1] "CV set auc: 0.997666"
## [1] "CV set auc: 0.997489"
## [1] "CV set auc: 0.994941"
## [1] "CV set auc: 0.996747"
## [1] "CV set auc: 0.997225"
## [1] "CV set auc: 0.997269"
## [1] "CV set auc: 0.996808"
## [1] "CV set auc: 0.997484"
## [1] "CV set auc: 0.997742"
## [1] "CV set auc: 0.997431"
## [1] "CV set auc: 0.996132"
## [1] "CV set auc: 0.995799"
## [1] "CV set auc: 0.996594"
## [1] "CV set auc: 0.994423"

Final Model Candidate Selection & Making Prediction

I will emphasize on using multiple performace metrics to base final model selection on. Notice how all the 27 variations of the model from the grid have exceptional performace with AUC values very close to 1. Knowing that, the final candidate would be the model variation that yielded the maximum value. More techniques can be employed to further optimize model performace, these techniques range from the input data preprocessing to tuning & expanding model parameters within the grid. Now recall that I splitted the data into Train & Test sets, well go ahead and select the bast model from the list & as recommended use additional metrics & run your prediction on the Test set using your final candidate.

In this blog post, I have demonstrated how you can generate world-class model performance using Deep Learning models. These models yield exceptionally accurate results particularly in more complex non-linear problems such as Image, Speech & Text recognition. Additionally, understanding how Deep Learning algorithms work is very important in determining how to optimize model performance. I hope you found this demo helpful, now go ahead & confidently model more complex processes in your data science career!

h2o.shutdown()

Good Luck!