Synopsis

This blueprint follows proven and established methodology for performing classification modeling in conjunction with the CRISP-DM methodology. The modeling process is identical regardless of the programming language as standard artefacts are developed and gathered. The difference between this implementation in R with one in Python for example will be only in the syntax.

CRISP DM Workflow

The main focus of this chapter is to lay a foundational framework for modeling approaches and decision making based on outcomes generated by H2O. H2O provides state of the art platform for training, testing and evaluating individual and stacked (ensembled) models in parallel, thus decreasing the modeling work significantly. H2O provides implementations in R, Python and Spark.

https://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html

This phase has four tasks:

Select modeling techniques: Determine which algorithms to use (automatically handled by H2O). Generate test design: Split the data into training, test, and validation sets. Build the models (automatically handled by H2O). Assess the models: Generally, multiple models are competing against each other, and model results are interpreted based on domain knowledge, the pre-defined success criteria, and the test design.

1. Libraries & Setup

2. Data

3. Processing Pipeline

3.1. ML Pre-processing

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         34
## 
## Training data contained 1250 data points and no missing data.
## 
## Operations:
## 
## Zero variance filter removed EmployeeCount, Over18, StandardHours [trained]
## Variable mutation for JobLevel, StockOptionLevel [trained]

4. Modeling algorithms in H2O

Deep Learning: Based on a multi-layer feedforward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier, and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout, L1 or L2 regularization, checkpointing, and grid search enable high predictive accuracy. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously) and contributes periodically to the global model via model averaging across the network. A feedforward artificial neural network (ANN) model, also known as deep neural network (DNN) or multi-layer perceptron (MLP), is the most common type of Deep Neural Network and the only type that is supported natively in H2O.

- There is a technique called grid search for improving the accuracy and precision of deep learning models in H2O. I have another blueprint for the grid search, let me know if you are interested.

Generalized Linear Model (GLM): Estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification. The GLM suite includes: Gaussian regression, Poisson regression, Binomial regression (classification), Fractional binomial regression, Quasibinomial regression, Multinomial classification, Gamma regression, Ordinal regression, Negative Binomial regression, Tweedie distribution.

Gradient Boosting Machine (GBM): Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel. The current version of GBM is recognized with improvements related to: ability to train on categorical variables (using the nbins_cats parameter), histogramming logic for some corner cases, support features such as; Per-row observation weights, Per-row offsets, N-fold cross-validation and support for more distribution functions (Gamma, Poisson, and Tweedie).

XGBoost: Top performer and regular Kaggle winner. Implements a process called boosting to yield accurate models. Boosting refers to the ensemble learning technique of building many models sequentially, with each new model attempting to correct for the deficiencies in the previous model. In tree boosting, each new model that is added to the ensemble is a decision tree. XGBoost provides parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. For many problems, XGBoost is one of the best gradient boosting machine (GBM) frameworks today.

Stacked Ensembles: Machine learning methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Many of the popular modern machine learning algorithms are actually ensembles. For example, Random Forest and Gradient Boosting Machine (GBM) are both ensemble learners. Both bagging (e.g. Random Forest) and boosting (e.g. GBM) are methods for ensembling that take a collection of weak learners (e.g. decision tree) and form a single, strong learner. This method finds the optimal combination of a collection of prediction algorithms using a process called stacking. Stacked Ensembles consists of two “stacks” – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

##                                              model_id       auc   logloss
## 1    StackedEnsemble_AllModels_AutoML_20211109_085552 0.8730374 0.2941166
## 2 DeepLearning_grid__2_AutoML_20211109_085552_model_3 0.8727355 0.4062421
## 3 StackedEnsemble_BestOfFamily_AutoML_20211109_085552 0.8716787 0.2943853
## 4 DeepLearning_grid__3_AutoML_20211109_085552_model_2 0.8715278 0.3041569
## 5                        GLM_1_AutoML_20211109_085552 0.8706220 0.2986907
## 6 DeepLearning_grid__1_AutoML_20211109_085552_model_2 0.8683575 0.2991487
##       aucpr mean_per_class_error      rmse        mse
## 1 0.7056556            0.1938406 0.2929557 0.08582302
## 2 0.6018271            0.2101449 0.3284558 0.10788324
## 3 0.7011638            0.1826691 0.2931022 0.08590890
## 4 0.7194047            0.1911232 0.2951129 0.08709160
## 5 0.6891396            0.2158816 0.2985665 0.08914195
## 6 0.6619561            0.2330918 0.2991397 0.08948458
## 
## [28 rows x 7 columns]

4.1. Making Predictions

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

## # A tibble: 220 x 3
##    predict    No     Yes
##    <fct>   <dbl>   <dbl>
##  1 No      0.983 0.0170 
##  2 No      0.824 0.176  
##  3 No      0.999 0.00139
##  4 No      0.973 0.0274 
##  5 Yes     0.341 0.659  
##  6 No      0.908 0.0918 
##  7 No      0.960 0.0397 
##  8 No      0.988 0.0123 
##  9 No      0.994 0.00568
## 10 No      0.993 0.00670
## # ... with 210 more rows

5. Plotting function

6. Assesing Performance

6.1 Classifier Summary Metrics

AUC Gini LogLoss

## [1] 0.874849

## [1] 0.7496981

## [1] 0.2937918

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.35993685394665:
##         No Yes    Error       Rate
## No     845  44 0.049494    =44/889
## Yes     57 119 0.323864    =57/176
## Totals 902 163 0.094836  =101/1065

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.358356000033959:
##         No Yes    Error     Rate
## No     171  13 0.070652  =13/184
## Yes     12  24 0.333333   =12/36
## Totals 183  37 0.113636  =25/220

Core concepts of assessing performance:

Precision = True Positives (TP) / [True Positives (TP) + False Positives (FP)]
Recall = True Positives (TP) / [True Positives (TP) + False Negatives (FN)]
Accuracy = (True Positives (TP) + True Negatives (TN)) / [True Positives (TP) + True Negatives (TN) + False Negatives (FN) + False Positives (FP)]

7. Performance Visualization

Precision & Recall

For medical data modeling, anything that doesn’t account for false-negatives is a crime, thus Recall is the go to measure.
For solving the majority of problems within a business context, false-negatives is less of a concern. Precision is better.
Recall indicates how susceptible model is to FN’s - e.g. predicting to churn False but actually is it True.
Precision indicates how susceptible models are to FP’s - e.g. predicting churn True but actually it is False.
Precision vs Recall curve shows which models will give up less FN’s as we optimize the threshold.

Gain & Lift

Gain & Lift charts are a good way to show the value of using machine learning versus random guessing.

If we sampled 100 random observations we’d expect 16 of them to be leavers (as per the calculation below)). However, with our Gain example, we sort descending by probability of leaving, our algorithm predicts highest probability cases correctly. Therefore we gain a benefit of using the algorithm because of the prediction being highly accurate for the high probability cases.

Using the gain chart we can state that our model is able to identify 75% of all employees at risk in the first 25% cumulative data fraction.

In summary this gain calculation tells us that if we focused on the first quartile of the data we have, we GAIN the ability to target ~75% of the potential leavers using this model.

By using the Lift calculation we can state that if we are expecting the global attrition rate to be 16% (220 x 0.16 = 35 people), we gained 9 out of 35 (25.7%) in the first 10 samples. By using this model we beat the expectation (1.6 people in the first 10 case) by 5.5 fold (9/1.6=5.5).

## # A tibble: 2 x 3
##   Attrition     n   pct
##   <chr>     <int> <dbl>
## 1 No         1049 0.839
## 2 Yes         201 0.161

## # A tibble: 220 x 4
##    predict     No   Yes Attrition
##    <fct>    <dbl> <dbl> <fct>    
##  1 Yes     0.0807 0.919 Yes      
##  2 Yes     0.160  0.840 Yes      
##  3 Yes     0.196  0.804 Yes      
##  4 Yes     0.231  0.769 Yes      
##  5 Yes     0.286  0.714 Yes      
##  6 Yes     0.317  0.683 Yes      
##  7 Yes     0.341  0.659 Yes      
##  8 Yes     0.350  0.650 Yes      
##  9 Yes     0.355  0.645 Yes      
## 10 Yes     0.365  0.635 No       
## # ... with 210 more rows

8. Feature Explanation

This notebook uses the R implementation of the Local Interpretable Model-Agnostic Explanations: LIME library which is also available in Python as well.

When building complex models, it is often difficult to explain why the model should be trusted. While global measures such as accuracy are useful, they cannot be used for explaining why a model made a specific prediction. ‘lime’ (a port of the ‘lime’ ‘Python’ package) is a method for explaining the outcome of black box models by fitting a local model around the point in question an perturbations of this point. The approach is described in more detail in the article by Ribeiro et al. (2016) <arXiv:1602.04938>.

Using LIME we can determine which features contribute to the prediction & by how much for a single observation.

Modeling Employee Attrition with H2O

Metodi Simeonov

July 2021