Boosting demo (EMRM ML Seminar)

Boosting

This is an opinionated demo of boosting with an application to predicting loan default. The caveat is I won’t tell you a whole lot about what boosting is in this session. I’ll defer to the next session for that. Because the goal in this session is to present something readily applicable, I’ll focus on a case study. First you see boosting at work. And it works stupendously well: xgboost, an implementation of boosting optimized for speed and predictive performance, is a perennial favourite in Kaggle competitions. Then you wonder why it works. And so you go back to dig up answers. The presentation is, by design, front-loaded: heavy on code, light on theory. It emulates end-to-end – to use the term generously – model development.

To the ~~Batmobile~~ case study!

Setup

I am using R. I love Python. I like Julia. And Clojure. And F# too. But for this demo, it’s R. Enough said.

I could use a full-featured framework for this. To name a few, caret (R), mlr (R) or scikit-learn (Python) would do the job. I won’t be doing that. For exposition, I use more low-level and modular libraries/packages to peek under the hood.

I find that implementing stuff from scratch – for some notion of scratch – is a great way to make sure I understand. Or at least, that I am not fooling myself. Remember Feynman’s first principle?

The first principle is that you must not fool yourself and you are the easiest person to fool.

Here is the stuff you need to get going in R.

# Load packages
suppressPackageStartupMessages({
    packages <- c("ggplot2", "DataExplorer", "pROC", "xgboost", "data.table", 
        "broom", "recipes", "rsample", "rBayesianOptimization", "WVPlots")
    isLoaded <- sapply(packages, library, character.only = T, logical.return = T, 
        quietly = T)
    toInstall <- packages[!isLoaded]
    if (length(toInstall) > 0L) {
        install.packages(toInstall)
        sapply(toInstall, library, character.only = T, logical.return = T, quietly = T)
    }
    rm(packages, isLoaded, toInstall)
})

Data

The case study is to predict loan defaults from some variables. It is a classification problem. The data comes from here. Let’s download the data and mess about.

# Load (download) dataset
urlToData <- "https://assets.datacamp.com/production/course_1025/datasets/loan_data_ch1.rds"
savePath <- tempfile(fileext = ".rds")
download.file(urlToData, destfile = savePath, mode = "wb")
loanData <- readRDS(savePath)
# Clean up
rm(urlToData, savePath)
# Convert default status to factor
setDT(loanData)
loanData[, `:=`(loan_status, factor(loan_status))]

Salient features

So what have we learned from this exploratory exercise? I’m calling these salient features. Economists prefer stylized facts.

In increasing order of importance, we see that

annual income (annual_inc) is highly skewed.
employment length (emp_length) and interest rate (int_rat) have missing values.
class imbalance in default status (loan_status) is quite severe.

For (1), we need to transform/normalize the variable. The log transform is typical for income.

For (2), we need to impute the missing values. This can be tricky. See Rubin (1996) for an overview. A recurring theme in this literature is that using the mean is almost always the wrong procedure. This is in line with the theorem,

It is easy to do the wrong thing.

(no such theorem exists AFAIK).

Imputation via bagged (bootstrap aggregated) trees is an effective, albeit computationally intensive, method to impute missing values. See Saar-Tsechansky and Provost (2007) for a reference. That is the method I use.

For (3), we need to address class imbalance. The majority class (no default) dominates the sample to the tune of 9 to 1. This means that in our case study, a naive classifier can easily achieve an accuracy of 90% by predicting no default on all instances. And it does so by ignoring default altogether. And yet predicting default is the goal. Presumably default is very costly, perhaps orders of magnitude more so than no default. So we want to nudge the classifier to give default the appropriate weight. The data alone tilts – some might say, biases – the balance in favour of the majority, at the expense of the minority. How do we redress this balance?

One way to do this is to give weights to observations. The idea is startlingly simple to implement, intuitive and effective (the elusive trifecta?). Weigh each observation with the inverse probability of its class. Thus, the majority and minority class are assigned scaling factors of 1.11 and 10 respectively to counter the effect of class imbalance. I implictly assume equal importance for both classes. This 50/50 weighting scheme stops the model from overfitting to the majority class. In essence, the method penalizes errors based on the class. Errors on the minority are amplified; those on the majority are (comparatively) downweighted. This is a sneaky variant of regularization. Whereas regularization often targets the model (hyper-)parameters, here regularization applies at the observation-level.

Additional resources

In some applications, class imbalance is even more severe. Fraud detection is a prime example. The needle-in-a-haystack analogy is an apt description of the problem. See Barandela et al. (2003) for a good (academic) reference on the topic. In line with the stated goal of the presentation to be code-centric, see Lemaître, Nogueira, and Aridas (2017) for a Python package that does a lot of the plumbing for you. Also, see this PyData talk for a comparison of the remedial methods.

Strategies for class imbalance

You can safely skip this section if you do not want to ponder class imbalance any further. Supposedly the focus of the case study is to show the effectiveness of boosting. This is a detour, an important one, mind you but still a detour. So I freely grant you license to skip.

Take a deep breath – it is a long list. Here are some mitigating strategies for confronting class imbalance:

using cost-sensitive loss function/matrix (a variant of regularization),
weighting observations by inverse frequency of class (yet another variant of regularization),
undersampling the majority class,
oversampling the minority class,
selecting appropriate performance metrics: accuracy can be your enemy (see Cohen’s Kappa and ROC curves),
changing the fitting algorithms (e.g zero-inflated and hurdle models),
incorporating more information from the business application/domain,
generating synthetic samples (e.g SMOTE aka Synthetic Minority Over-sampling Technique).

In short, they lied: deep learning alone won’t solve all our problems. The joke, of course, is that I am only half joking. I have ordered the list of mitigating strategies to reflect my own preferences. Your mileage may vary. Notice that I use (2) for this case study.

Remarks

Again, you can safely skip this section.

First, these methods for handling class imbalance are not mutually exclusive; you can combine them as you see fit. If you combine them however, moderation is key. The risk of abusing these methods, and in so doing mangling/distorting the initial problem beyond recognition is not trivial.

Second, many of these methods, but in particular (1) and (7), entail bringing more information to bear on the problem. This information ideally comes from domain experts. You should not shun this avenue. A dogmatic and sometimes dangerous position from analysts is to assume no prior knowledge required. Many non-parametric methods appeal because they promise to extract actionable insights from data without looping in experts. Cases in point, decision trees and neural networks. The kicker is, sometimes they do exactly that: they beat the experts at their own games. But this is not universally true. Sometimes, they do not. And you have to recognize the difference. In such situations, ignoring domain knowledge is throwing away potentially valuable insights. If you can incorporate or embed this prior (and I am intentionally using this term with all its Bayesian connotations) into the model, it can dramatically improve your predictive model.

Third, you may notice that simply correcting for class imbalance could make simple models competitive with more complex models. That is, your humble logistic regression can rival random forests, once you account for class imbalance. And so the focus on the method itself, logistic regression versus ensemble of decision trees, can be misleading. Pay attention to what matters most. It may turn out that you get the most bang for your buck with a simple hack.

Exploratory analysis

This exploratory step of looking at the data is vital. Never skip it before modeling. It isn’t the glorious part of the job but it paves the way. No shortcut: get to know your data inside out. This is how you generate plausible hypotheses. Let’s peek at the data so far.

head(loanData)

The column names are fairly self-explanatory. They are what you think they are. loan_status, for example, indicates whether the loan is in default. And so, I won’t belabor the point.

summary(loanData)

##  loan_status   loan_amnt        int_rate     grade      emp_length    
##  0:25865     Min.   :  500   Min.   : 5.42   A:9649   Min.   : 0.000  
##  1: 3227     1st Qu.: 5000   1st Qu.: 7.90   B:9329   1st Qu.: 2.000  
##              Median : 8000   Median :10.99   C:5748   Median : 4.000  
##              Mean   : 9594   Mean   :11.00   D:3231   Mean   : 6.145  
##              3rd Qu.:12250   3rd Qu.:13.47   E: 868   3rd Qu.: 8.000  
##              Max.   :35000   Max.   :23.22   F: 211   Max.   :62.000  
##                              NA's   :2776    G:  56   NA's   :809     
##   home_ownership    annual_inc           age       
##  MORTGAGE:12002   Min.   :   4000   Min.   : 20.0  
##  OTHER   :   97   1st Qu.:  40000   1st Qu.: 23.0  
##  OWN     : 2301   Median :  56424   Median : 26.0  
##  RENT    :14692   Mean   :  67169   Mean   : 27.7  
##                   3rd Qu.:  80000   3rd Qu.: 30.0  
##                   Max.   :6000000   Max.   :144.0  
##

Missing values

Let’s look at missing data.

theme_set(theme_bw())
plot_missing(loanData)

Conditional distributions

Let’s look at categorical/discrete variables.

plot_bar(loanData)

And lastly, some boxplots for the continuous variables

plot_boxplot(loanData, "loan_status")

## Warning: Removed 3585 rows containing non-finite values (stat_boxplot).

Preprocessing

Let’s do some of the janitorial work necessary before training the boosting model.

Training/Test split

I split the dataset into 67% training and 33% test set.

I ensure that this split reflects and maintains class imbalance with stratified sampling.

splitData <- initial_split(loanData, prop = 2/3, strata = "loan_status")
trainData <- training(splitData)
# Clean up
rm(loanData)

Design Matrix

From exploratory analysis, we know we need to transform some variables. Some call this step preprocessing; others still, feature engineering (popular in machine learning and data mining communities). Statisticians prefer the term design matrix to refer to the transformed/final dataset ready for modeling. I create a recipe to replicate the necessary transformations.

# Add steps to create design matrix
xgbRecipe <- recipe(loan_status ~ ., data = trainData) %>% step_bagimpute(emp_length, 
    int_rate) %>% step_log(annual_inc) %>% step_dummy(all_nominal())
xgbRecipe

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          7
## 
## Operations:
## 
## Bagged tree imputation for emp_length, int_rate
## Log transformation on annual_inc
## Dummy variables from all_nominal()

Cross-validation

For boosting, we need to tune (hyper-)parameters. I use 10-fold cross-validation repeated 10 times for this purpose. This means that we end up with a total of 100 train-validate pairs, where training is 90% of the original training dataset and the validation is the remaining 10%. Let’s peek at the first 6 of these pairs.

trainingFolds <- vfold_cv(trainData, v = 10, repeats = 10, strata = "loan_status")
setDT(trainingFolds)
# Quick peek
head(trainingFolds)

Learning curves

A mistake I sometimes see is the urge to use most of the data right away. It is the lure of Big Data and it is a trap. And I say this as a Spark user and enthusiast. Some models may require Terabytes or, heaven helps you, Petabytes of training examples. That is Google scale. Most problems, I venture, are not Google scale. Most problems fit in RAM - yes, that means even yours. Start small instead and gradually increase your training set if and only if the predictive performance of your algorithm improves markedly in doing so.

Learning curves can inform the decision of how much data to use for training the model before the predictive performance plateaus. More data beyond that point is wasted CPU/GPU cycles, cluster headaches and network latency. Learning curves also highlights the limit of your chosen model architecture. If more data does not boost performance to the desired levels, it is time to consider changing the model and/or harvest more/better features from the data.

Statisticians are formally trained to appreciate the value of samples. You can do wondrous things with samples, they teach. You do not need the whole population, they say. So, sample. And sample aggressively. This is all very stats 101 and yet, worth stating again if it is all fuzzy now.

Helper functions

This is a helper function to assign frequency/event probability to defaults.

assignFrequency <- function(defaults) {
    # Get frequency table
    freqTable <- as.data.frame(prop.table(table(defaults)))
    mapClassToFreq <- freqTable[["Freq"]]
    names(mapClassToFreq) <- freqTable[[1]]  # Original levels
    return(unname(mapClassToFreq[defaults]))
}

`xgboost` Format

xgboost is optimized to work with its own data format. xgboost takes dgCMatrix objects. I create here a function, extractDMatrix, to convert the train-validate pair into the corresponding dgCMatrix objects.

extractDMatrix <- function(splitObj, recObj = xgbRecipe, colPredict = "loan_status") {
    # Use 90% for training the model
    fitData <- analysis(splitObj)
    prepObj <- prep(recObj, training = fitData, retain = TRUE, verbose = FALSE)
    # Create inverse frequency weights to correct imbalances
    frequencies <- assignFrequency(fitData[[colPredict]])
    weights <- 1/frequencies
    # Create xgboost DMatrix for training model
    xgbTrainData <- xgb.DMatrix(data = juice(prepObj, -contains(colPredict), 
        composition = "dgCMatrix"), label = juice(prepObj, contains(colPredict), 
        composition = "dgCMatrix"))
    # Use the remaining 10% for validation
    holdoutData <- assessment(splitObj)
    # Create xgboost DMatrix for validating model
    xgbValidateData <- xgb.DMatrix(data = bake(prepObj, newdata = holdoutData, 
        -contains(colPredict), composition = "dgCMatrix"), label = bake(prepObj, 
        newdata = holdoutData, contains(colPredict), composition = "dgCMatrix"))
    return(list(train = xgbTrainData, validate = xgbValidateData, weight = weights))
}

This wrapper function, extractDMatrix, also returns the weights from the inverse probability transformation to correct for class imbalance.

system.time({
    listDMatrix <- lapply(trainingFolds[["splits"]], function(x) extractDMatrix(x))
})

##    user  system elapsed 
##  447.79    1.46  488.25

I end up with a list of these dgCMatrix pairs, ready for model fitting. As a side note, creating these paired design matrices – all 100 of them – is a computational bottleneck. It takes, a little under 4 minutes on my machine. It is worth caching the results. This concludes the data preparation steps.

Hyper-parameters

And now we get to the heart of xgboost. Boosting has hyper-parameters, lots of ’em. Take this reductionist definition: boosting is tuning the hyper-parameters. This is how you wring out every drop of predictive perfomance, its value proposition, out of boosting. Some hyper-parameters, however, are specific to the implementation. Not all boosting software offers the same options. See here for a comprehensive list of xgboost parameters. Besides xgboost, other implementations of boosting include:

How do we tune these hyper-parameters? We anticipated this point. I’ll use cross-validation.

Additional resources

Understanding the boosting hyper-parameters is the difference between driving stick and being on autopilot. The next session tackles these hyper-parameters in depth. For an introduction, see Chapter 8 (Tree-Based Methods) of James et al. (2013). For more academic references, see Bühlmann and Hothorn (2007), Friedman (2002), and Friedman (2001). Prefer videos? See this talk from the inimitable Trevor Hastie and this talk. A pet peeve of mine is that boosting, more than likely for historical reasons (see AdaBoost), is usually introduced jointly with (ensemble of) decision trees. But it need not be. It conflates two very different although complimentary ideas. Boosting is a general idea and it can be applied to a whole sleuth of other algorithms. I won’t say much more than that for now.

Fixed

Let’s fix some hyper-parameters of interest.

# Fixed parameters
fixedParams <- list(booster = "gbtree", objective = "binary:logistic", nthread = parallel::detectCores() - 
    1L, eval_metric = "auc")

Tune

Let’s set the hyper-parameters to tune and give them some default values.

# Tuning parameters (set at their default) lambda, L2 regularization term on
# weights alpha, L1 regularization term on weights
tuneDefaultParams <- list(eta = 0.3, max_depth = 6L, subsample = 1, lambda = 1, 
    alpha = 0)

Bounds

Let’s determine the bounds to use for the hyper-parameter search.

# Set initial parameters
initParams <- c(fixedParams, tuneDefaultParams)
# Define the bounds of the search for Bayesian optimization
boundsParams <- list(eta = c(0.1, 0.8), max_depth = c(3L, 6L), subsample = c(0.5, 
    1), lambda = c(1, 5), alpha = c(0, 5))

Defaults

If it is any consolation, xgboost comes with good default hyper-parameters for the vast majority of problems. That is not to say, use the default hyper-parameters as is. On the contrary, ideally tune to your specific problem. The defaults often are, however, a good starting point.

Performance metric

What performance metric do I use for cross-validation? The easy answer is, I use the area under the curve (AUC) of the receiver operating characteristic (ROC) curve. If you need a refresher, here is a quick primer on AUC. But by now, you are inoculated against easy answers, aren’t you?

Let’s channel Douglas Crockford (see Javascript). First stop is, AUC: The Good Parts. AUC is both scale invariant and classification-threshold invariant. As opposed to accuracy, AUC can do fairly well for problems with severe class imbalance. Next stop is, AUC: The Bad Parts. Scale and threshold invariance are not always desirable (you saw that coming, right?). And in fact, we can argue that because of the alleged disparities in the cost of default versus no default, we should ditch threshold invariance. See Lobo, Jiménez-Valverde, and Real (2008) for a sobering account of how even the mighty AUC can let you down in some circumstances.

If default is what we care about the most, this suggests using the true positive rate detection rate as perhaps the most natural perfomance metric.

But for now I assert, and the benchmark model will corroborate, that AUC is an adequate perfomance metric for our case study.

Helper functions

I create some helper functions to help us compute and maximize AUC scores during cross-validation.

# Calculate AUC for a single train-validate pair using CV
getAUCFromCV <- function(listData, useInverseFrequencyWeight = TRUE, params = initParams, 
    verbose = 0) {
    # Train model
    watchlist <- list(train = listData$train, test = listData$validate)
    weight <- if (useInverseFrequencyWeight) 
        listData$weight else NULL
    xbgModel <- xgb.train(params = params, data = listData$train, watchlist = watchlist, 
        verbose = verbose, callbacks = list(xgboost::cb.evaluation.log()), nrounds = 50, 
        early_stopping_rounds = 5, weight = weight)
    # Extract AUC
    testAUC <- tail(xbgModel$evaluation_log[["test_auc"]], 1)
    return(testAUC)
}
# Calculate mean AUC for all train-validate pairs
getMeanAUC <- function(listDMatrix, useInverseFrequencyWeight = TRUE, params = initParams) {
    scores <- sapply(listDMatrix, getAUCFromCV, useInverseFrequencyWeight = useInverseFrequencyWeight, 
        verbose = 0, params = params)
    meanAUC <- mean(scores)
    return(meanAUC)
}

NFL Theorem

One lesson here is to consider the trade-offs. Remember the No free lunch (NFL) theorems? There is no such thing as a panacea – not quite (not even close) what it says, but good enough for government work. Repeat it a 100x, 1e6x and internalize the message. Any time someone (on the internet or elsewhere) tries to sell you the meta-algorithm, the one ring to rule ’em all, be skeptical.

Jokes aside, the NFL theorem originally stated,

if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems

Bayesian optimization

Grid and random search are often the methods used to select the pool of hyper-parameter candidates. The selected candidates are then evaluated with cross-validation. You pluck the best one out of this fixed set.

Exhaustive grid search is computationally expensive – prohibitively so. Random search is well, random. It explores the hyper-parameter space wastefully. You can repeatedly sample bad hyper-parameters. And it does not inform your search strategy to pick better candidates. Can we do better? Is there a principled and systematic way to balance hyper-parameter optimization – exploitation is the term that the literature uses for this, perhaps due to its connection to the multi-armed bandit problem –
and exploration? Bayesian optimization, to the rescue.

I randomly choose 10 initial starting points from the hyper-parameter space. I then run bayesian optimization for 30 iterations to find the best hyper-parameters. I print the partial history of this sequential search for the last 6 iterations.

# Create wrapper function for Bayesian Optimization
maximizeAUC <- function(eta, max_depth, subsample, lambda, alpha) {
    # Create list of updated parameters
    replaceParams <- list(eta = eta, max_depth = max_depth, subsample = subsample, 
        lambda = lambda, alpha = alpha)
    updatedParams <- modifyList(initParams, replaceParams)
    # Calculate AUC with updated parameters
    scoreAUC <- getMeanAUC(listDMatrix, params = updatedParams, useInverseFrequencyWeight = TRUE)
    resultList <- list(Score = scoreAUC, Pred = 0)
    return(resultList)
}
# Run bayesian optimization
bayesSearch <- BayesianOptimization(maximizeAUC, bounds = boundsParams, init_grid_dt = as.data.table(boundsParams), 
    init_points = 10, n_iter = 30, acq = "ucb", kappa = 2.576, eps = 0, verbose = FALSE)

## 
##  Best Parameters Found: 
## Round = 38   eta = 0.2548    max_depth = 3.0000  subsample = 1.0000  lambda = 1.0000 alpha = 0.0000  Value = 0.6548

# Get optimized parameters
tunedBayesianParams <- modifyList(fixedParams, as.list(bayesSearch$Best_Par))
# Peek at history of bayesian search
tail(bayesSearch$History)

If you looked at the code, I am sure the irony did not escape you. Bayesian optimization, a method for hyper-parameter tuning, has itself a few hyper-parameters of its own. What can I say? No free lunch.

Additional resource

For a good overview of Bayesian optimization, see Shahriari et al. (2016) and Snoek, Larochelle, and Adams (2012). By the way, I love the title Taking the human out of the loop. For software (Python), in line with the theme of the presentation, see Martinez-Cantin (2014).

Once you delve into Bayesian optimization, you encounter Gaussian processes. At that juncture, non-parametric Bayesian models are right around the corner. The Dirichlet process is typically the gateway drug into that world.

Big ideas

If you’re keeping score, Bayesian optimization is the third big idea encountered so far (boosting and class imbalance, being the other two).

Bayesian optimization is an invaluable tool to have in your arsenal. And this is going to be more and more the case because of the explosion of non-parametric models that require and are highly sensitive to hyper-parameter tuning. Deep learning, I’m looking at you. Having said tongue-in-cheek that boosting is hyper-parameter tuning, the importance of a smart strategy to tune hyper-parameters should be obvious.

Bayesian optimization does to hyper-parameter tuning what cloud computing does to sysadmin. It levels the playing field and democratizes machine learning. You need not be Geoffrey Hinton to tune hyper-parameters well. You can focus on something else.

Ask me what I am excited and learning about in my free time. Spoiler alert, it is not cryptocurrency. My focus is on Bayesian methods – say hello to Stan – and cloud computing (AWS and Azure).

Boosting rounds

… And one more thing. It turns out there is an extra boosting hyper-parameter to tune, namely the number of boosting rounds. Previously, this hyper-parameter was left floating and used an early stopping criteria based on … OK, OK, details. The point is, now we want to select a good value for this hyper-parameter too. I essentially bootstrap on the previous step. The code details the procedure.

# Get data
listData <- extractDMatrix(splitData)
# Get iteration (boosting round) with best AUC using 10-fold
# cross-validation tunedBayesianParams
nRoundsBest <- xgb.cv(params = tunedBayesianParams, data = listData$train, nfold = 10, 
    verbose = 0, callbacks = list(xgboost::cb.evaluation.log()), nrounds = 100, 
    early_stopping_rounds = 5, weight = listData$weight)$best_iteration
# Train model
xgbModel <- xgb.train(params = tunedBayesianParams, data = listData$train, verbose = 0, 
    nrounds = nRoundsBest, weight = listData$weight)

Benchmark

Many ML algorithms fail in production because they weren’t thoroughly tested against a credible benchmark. Test, test, test – paranoia helps. The emphasis is on a credible, not naive benchmark. This is, sometimes, a key distinction. Take this task for example: predicting default, a classification problem with class imbalance. A naive benchmark would be random guessing. This is the wrong path. We know that class imbalance a priori strongly favours the the majority class. A credible benchmark would be predicting the majority class. This is sometimes called the ‘null’ model or the ‘Zero Rule’ algorithm.

This null model implies a false positive rate (FPR) of 0 i.e we predict no default correctly every time and with (virtually) no effort. This means that the relevant range where to gauge the performance of the classifier is where its false positive rate (FPR) is low or close to 0. For instance, a FPR of about 11% means that the classifier labels as many no default cases as default as there are actual defaults in the test dataset. This seems rather excessive. Granted, arbitrarily so, I take this 11% FPR as the cut off FPR. In reality, this is a decision that the business needs to make. And so we should present the business with a menu of options, the so-called strategy curve, to decide.

Strategy curve

First I plot the full ROC curve for the trained model on the testing set.

# Get xgboost predictions on testing set
xgbPredicted <- predict(xgbModel, listData$validate)
actualDefaults <- getinfo(listData$validate, "label")
# Create ROC table and plot
xgbROC <- pROC::roc(response = actualDefaults, predictor = xgbPredicted)
WVPlots::ROCPlot(data.table(predict = xgbPredicted, default = actualDefaults), 
    xvar = "predict", truthVar = "default", truthTarget = 1, title = "ROC Plot for xgboost Model")

Then I create a table for the strategy curve.

# Get strategy curve (trade-off between TNR and TPR) TNR (sensitivities) and
# TPR (specificities)
xgbStrategyCurve <- data.table(thresholds = xgbROC$thresholds, TPR = xgbROC$sensitivities, 
    TNR = xgbROC$specificities)
# Find relevant range of the strategy curve
buffer <- round(0.1/0.9, 2)
relevantStrategyCurve <- xgbStrategyCurve[between(TNR, 1 - buffer, 1)]
relevantStrategyCurve[seq(1, .N, by = 5)]

Now the business can choose the best thresholds to classify accounts (clients) as default or no default.

Naive and better

Note that in some domains (time series, etc), naive benchmarks are tough to beat. For stock prices, Gaussian random walk is the naive benchmark. And it has proven to be incredibly resilient. Heed that old adage about simple models: “always wrong, but seldom bested” (if anyone knows who said this, please let me know)

AUC Revisited

By the way, this is why, as alluded to before, AUC is a flawed (cross-validation) measure for our task: we are not really interested in doing better than ‘random’ at all cost. AUC is computed based on all FPRs, indiscriminately considering low and high values as being appropriate. This is not to beat up on AUC. It is a fine perfomance measure. It is a matter of knowing the limits of your chosen metrics.

Epilogue: Why boosting?

Robust software comes with sane defaults. That is, most applications won’t need to modify the defaults to produce performant models. This is the Pareto principle aka the 80/20 rule. Let’s define another loaded term: mature. A mature ML algorithm is one that requires minimal amount of supervision/manual fiddling to be effective. A corollary of this property is that debugging (trouble-shooting), and model diagnostics should be clear, consistent and actionable. When the algorithm fails, and models invariably do, the postmortem should easily detect and report the cause of death.

My contention is that boosting is, in this narrow sense, a mature and robust ML method. For comparison, see deep learning (DL). DL often requires considerably more effort to work well out-of-the-box. This is not to say don't use DL. Rather, be aware that the bar to use DL effectively is relatively higher. In a section aptly titled Building Energy Results, Cook (2016) shows this figure. ML Method Comparison: Perfomance versus effort However, and as always, caveat emptor. Add your favourite disclaimer. To recap, why boosting? Because it gives you superior – if not, best in class – predictive performance (i.e effective) for modest/minimal effort (i.e mature).

Thank you

If you’ve stuck with me thus far, thank you for reading. Even more so, thank you for your indulgence. The presentation is intended to be somewhat provocative to spur conversation and encourage debate.

References

Barandela, Ricardo, José Salvador Sánchez, Vicente Garcia, and Edgar Rangel. 2003. “Strategies for Learning in Class Imbalance Problems.” Pattern Recognition 36 (3). Pergamon: 849–51.

Bühlmann, Peter, and Torsten Hothorn. 2007. “Boosting Algorithms: Regularization, Prediction and Model Fitting.” Statistical Science. JSTOR, 477–505.

Cook, Darren. 2016. “Practical Machine Learning with H2o.” O’Reilly Media Incorporated.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics. JSTOR, 1189–1232.

———. 2002. “Stochastic Gradient Boosting.” Computational Statistics & Data Analysis 38 (4). Elsevier: 367–78.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Lemaître, Guillaume, Fernando Nogueira, and Christos K Aridas. 2017. “Imbalanced-Learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning.” Journal of Machine Learning Research 18 (17): 1–5.

Lobo, Jorge M, Alberto Jiménez-Valverde, and Raimundo Real. 2008. “AUC: A Misleading Measure of the Performance of Predictive Distribution Models.” Global Ecology and Biogeography 17 (2). Wiley Online Library: 145–51.

Martinez-Cantin, Ruben. 2014. “Bayesopt: A Bayesian Optimization Library for Nonlinear Optimization, Experimental Design and Bandits.” The Journal of Machine Learning Research 15 (1). JMLR. org: 3735–9.

Rubin, Donald B. 1996. “Multiple Imputation After 18+ Years.” Journal of the American Statistical Association 91 (434). Taylor & Francis: 473–89.

Saar-Tsechansky, Maytal, and Foster Provost. 2007. “Handling Missing Values When Applying Classification Models.” Journal of Machine Learning Research 8 (Jul): 1623–57.

Shahriari, Bobak, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. 2016. “Taking the Human Out of the Loop: A Review of Bayesian Optimization.” Proceedings of the IEEE 104 (1). IEEE: 148–75.

Snoek, Jasper, Hugo Larochelle, and Ryan P Adams. 2012. “Practical Bayesian Optimization of Machine Learning Algorithms.” In Advances in Neural Information Processing Systems, 2951–9.

Boosting demo (EMRM ML Seminar)

Vathy M. Kamulete

May 20, 2018

Boosting

Setup

Data

Salient features

Additional resources

Strategies for class imbalance

Remarks

Exploratory analysis

Missing values

Conditional distributions

Preprocessing

Training/Test split

Design Matrix

Cross-validation

Learning curves

Helper functions

`xgboost` Format

Hyper-parameters

Additional resources

Fixed

Tune

Bounds

Defaults

Performance metric

Helper functions

NFL Theorem

Bayesian optimization

Additional resource

Big ideas

Boosting rounds

Benchmark

Strategy curve

Naive and better

AUC Revisited

Epilogue: Why boosting?

Thank you

References

Boosting demo (EMRM ML Seminar)

Vathy M. Kamulete

May 20, 2018

Boosting

Setup

Data

Salient features

Additional resources

Strategies for class imbalance

Remarks

Exploratory analysis

Missing values

Conditional distributions

Preprocessing

Training/Test split

Design Matrix

Cross-validation

Learning curves

Helper functions

xgboost Format

Hyper-parameters

Additional resources

Fixed

Tune

Bounds

Defaults

Performance metric

Helper functions

NFL Theorem

Bayesian optimization

Additional resource

Big ideas

Boosting rounds

Benchmark

Strategy curve

Naive and better

AUC Revisited

Epilogue: Why boosting?

Thank you

References

`xgboost` Format