Deep Learning is a branch of Machine Learning based on a set of algorithms that attempt to model high level abstractions in data -Wikipedia. The basic unit in a deep learning model is the neuron, a biologically inspired model by the human neuron. In humans, the varying strengths of neurons’ output signals travel along the synaptic junctions & are then aggregated as input for a connected neuron’s activation. A multi-layer neural network consist of many layers of interconnected neural units, starting with an input layer to match the feature space, followed by multiple layers of non-linearity, and ending with a linear regression or classification layer to match the output space -Arno Candel et al.
Deep Learning has become the Data Science buzzword particularly for its high accuracy of prediction in complex problems such as image, speech & text recognition. In exploring Deep Learning algorithms, I wanted to find one that satisfied the following core requirements or features:
The H2O package with its feedforward architecture satisfied the core features of primary interest to me.
The intended objective of this blog post is to demonstrate how you can achieve world-class accuracy of prediction with basic Deep Learning models using the H2O package in this case. The data I will be using is the Breast Cancer Wisconsin (Diagnostic) Data Set found here, the goal of the task is to predict whether a diagnosis is Malignant (M) or Benign (B). In this data, features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image in the 3-dimensional space that is described in K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34.
Ten real-valued features are computed for each cell nucleus:
perimeter^2 / area - 1.0The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, & field 23 is Worst Radius.
library(readr)
library(data.table)
Data <- read_csv("data.csv", col_names = T)
#Target variable type coercion:
Data$diagnosis <- as.factor(Data$diagnosis)
#Remove less than useful 'X33':
Data$X33 <- NULL
#Check missing data:
Missing_data <- function(x){sum(is.na(x))/length(x)*100}
apply(Data, 2, Missing_data)
## id diagnosis radius_mean
## 0 0 0
## texture_mean perimeter_mean area_mean
## 0 0 0
## smoothness_mean compactness_mean concavity_mean
## 0 0 0
## concave points_mean symmetry_mean fractal_dimension_mean
## 0 0 0
## radius_se texture_se perimeter_se
## 0 0 0
## area_se smoothness_se compactness_se
## 0 0 0
## concavity_se concave points_se symmetry_se
## 0 0 0
## fractal_dimension_se radius_worst texture_worst
## 0 0 0
## perimeter_worst area_worst smoothness_worst
## 0 0 0
## compactness_worst concavity_worst concave points_worst
## 0 0 0
## symmetry_worst fractal_dimension_worst
## 0 0
Nearly 37% cases out of 569 diagnosed patients had malignant stage breast cancer.
I will do a 4:1 Train to Test data partition respectively trying to retain more samples for training from the already limited size. The data will also be randomly shuffled for a fair distribution & to enhance robustness to overfitting, I will perform cross-validation during model building.
set.seed(1)
n <- nrow(Data)
shuffled <- Data[sample(n),]
train.indices <- 1:round(0.9*n)
train <- shuffled[train.indices,]
test.indices <- (round(0.9*n)+1):n
test <- shuffled[test.indices,]
library(h2o)
#start a local h2o cluster:
localH2o <- h2o.init(ip = "localhost", port = 54321, nthreads = -1)
#Set seed for reproducibility:
set.seed(2)
train_h2o <- as.h2o(train)
##
|
| | 0%
|
|=================================================================| 100%
test_h2o <- as.h2o(test)
##
|
| | 0%
|
|=================================================================| 100%
#Set timer:
timer <- proc.time()
The model architecture will consist of 3 different network topologies, 3 L1 norm weights & 3 hidden drop out ratios. H20 supports model tuning in grid search by allowing users to specify sets of model values for parameter arguments & observe individual model performace based on metrics of choice. In this demo AUC values will be used to analyze grid models. The grid models summary looks as follows:
set.seed(2)
#Set Grid parameters:
hidden_opt <- list(c(32,32,32), c(5,25,75), c(100,100,100))
l1_opt <- c(1e-5, 1e-4,1e-3)
hidden_drpoutRatios <- list(c(0.5,0.5,0.5), c(0.5,0.3,0.2), c(0.1,0.2,0.8))
hyper_pars <- list(hidden = hidden_opt, hidden_dropout_ratios = hidden_drpoutRatios, l1 = l1_opt)
#Building Grid models:
model_grid <- h2o.grid(
algorithm = "deeplearning",
activation = "RectifierWithDropout",
hyper_params = hyper_pars,
x = 3:32,
y = 2,
training_frame = train_h2o,
input_dropout_ratio = 0.2,
balance_classes = T,
momentum_stable = 0.99,
nesterov_accelerated_gradient = T,
epochs = 50,
nfolds = 10,
variable_importances = T,
keep_cross_validation_predictions = T)
See how each individual model within the grid performed based on ROC/AUC values. Other metrics to analyze performance are also available within H2O. The desired AUC value is 1, setting an operational score is based on several factors including domain application, the cost (impact) of error/misprediction, complexity of the problem etc. In a typical problem such as cancer prediction error should be highly penalized hence exceptional model performace is required. It is imperative that performance is cross-examined based on several metrics & trade-offs are well understood. In this demo however, AUC will be used to check individual model performace.
for (model_id in model_grid@model_ids) {
auc <- h2o.auc(h2o.getModel(model_id))
print(sprintf('CV set auc: %f', auc))
}
## [1] "CV set auc: 0.997943"
## [1] "CV set auc: 0.998554"
## [1] "CV set auc: 0.996847"
## [1] "CV set auc: 0.996427"
## [1] "CV set auc: 0.996828"
## [1] "CV set auc: 0.997523"
## [1] "CV set auc: 0.996636"
## [1] "CV set auc: 0.995909"
## [1] "CV set auc: 0.994824"
## [1] "CV set auc: 0.997402"
## [1] "CV set auc: 0.996360"
## [1] "CV set auc: 0.996243"
## [1] "CV set auc: 0.998227"
## [1] "CV set auc: 0.997666"
## [1] "CV set auc: 0.997489"
## [1] "CV set auc: 0.994941"
## [1] "CV set auc: 0.996747"
## [1] "CV set auc: 0.997225"
## [1] "CV set auc: 0.997269"
## [1] "CV set auc: 0.996808"
## [1] "CV set auc: 0.997484"
## [1] "CV set auc: 0.997742"
## [1] "CV set auc: 0.997431"
## [1] "CV set auc: 0.996132"
## [1] "CV set auc: 0.995799"
## [1] "CV set auc: 0.996594"
## [1] "CV set auc: 0.994423"
I will emphasize on using multiple performace metrics to base final model selection on. Notice how all the 27 variations of the model from the grid have exceptional performace with AUC values very close to 1. Knowing that, the final candidate would be the model variation that yielded the maximum value. More techniques can be employed to further optimize model performace, these techniques range from the input data preprocessing to tuning & expanding model parameters within the grid. Now recall that I splitted the data into Train & Test sets, well go ahead and select the bast model from the list & as recommended use additional metrics & run your prediction on the Test set using your final candidate.
In this blog post, I have demonstrated how you can generate world-class model performance using Deep Learning models. These models yield exceptionally accurate results particularly in more complex non-linear problems such as Image, Speech & Text recognition. Additionally, understanding how Deep Learning algorithms work is very important in determining how to optimize model performance. I hope you found this demo helpful, now go ahead & confidently model more complex processes in your data science career!
h2o.shutdown()
Good Luck!