Abstract

In this analysis, I aim to predict airline passenger satisfaction using H2o AutoML to test several models, and Lime to perform local interpretation of the final model used.

What is H2o AutoML?

H2o AutoML or Automatic Machine Learning is the process of automating algorithm selection, feature generation, hyper-parameter tuning, iterative modeling and model assessment. This speeds up the process of training machine learning models.

What is Lime?

While measures like Accuracy, Precision, and Recall are useful for accessing how good a model is, these measures do not tell you why a model makes a specific prediction. Lime is a method for explaining the outcome of black box machine learning models.

Please note that the scope of this project ends after implementing and interpreting H2o AutoML and a summary of findings. In a real world setting, this will just be the first step and further analysis will be needed to determine what the results of the model mean for the business or company in terms of financial benefits. Such analysis can include Expected Value Analysis, which is a framework that is used by companies in tying machine learning to ROI. To perform Expected Value Analysis, we will need additional data relating to financial costs and benefits of satisfied and unsatisfied passengers and we do not have such data. Additionally, the initial models produced performed very well in terms of accuracy, so I do not perform Hyper-Parameter Tuning in this analysis.

Before presenting a summary of findings, it may help to understand the data first.

Data Description

The dataset comes from Kaggle, and was was provided by an airline. Thus some of the data has been anonymized. See description of features below -

  • Training Set - 24 features and 103,904 observations

  • Test Set - 24 features and 25,976 observations

See detailed feature description below:

  • Gender: Gender of the passengers (Female, Male)

  • Customer Type: The customer type (Loyal customer, disloyal customer)

  • Age: The actual age of the passengers

  • Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)

  • Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)

  • Flight distance: The flight distance of this journey

  • In-flight wifi service: Satisfaction level of the in-flight wifi service (0:Not Applicable;1-5)

  • Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient

  • Ease of Online booking: Satisfaction level of online booking

  • Gate location: Satisfaction level of Gate location

  • Food and drink: Satisfaction level of Food and drink

  • Online boarding: Satisfaction level of online boarding

  • Seat comfort: Satisfaction level of Seat comfort

  • In-flight entertainment: Satisfaction level of in-flight entertainment

  • On-board service: Satisfaction level of On-board service

  • Leg room service: Satisfaction level of Leg room service

  • Baggage handling: Satisfaction level of baggage handling

  • Check-in service: Satisfaction level of Check-in service

  • In-flight service: Satisfaction level of in-light service

  • Cleanliness: Satisfaction level of Cleanliness

  • Departure Delay in Minutes: Minutes delayed when departure

  • Arrival Delay in Minutes: Minutes delayed when Arrival

  • Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction). This is our outcome feature.

Note: All features with “Satisfaction level of…” have a 0 - 5 score with 5 reflecting the highest level of satisfaction. Also looking at the data, it appears 0 was assigned if the passenger did not score that feature, or the airline did not have that feature available.


Summary of Findings

Pre-Modeling

The dataset was fairly “balanced”, with 43.3% of Satisfied and 56.7% of Not Satisfied passengers. A pre-modeling correlation funnel provided an initial understanding of some of the drivers of passenger satisfaction.

In the plot above, the x axis shows the correlations, while the y axis shows the features. We can see Satisfied and Not Satisfied on the extreme ends ( -1 and 1), as those are showing perfect correlations with themselves. However it is all the other features that should be focused on. We can see what features correlate with a passenger being Satisfied or Not Satisfied, from top to bottom of the chart by strength of correlation. For example Class shows the strongest correlation with Satisfaction. Passengers in the Business class show stronger correlation with Satisfied while passengers in Eco and Eco Plus show stronger correlation with Not Satisfied.

Additionally looking at the 3rd feature, Online Boarding. Recall this is a satisfaction score from 0 - 5 (5 being the highest) in regards to the location of the gate. We can see that passengers with a score of 4 and 5 show correlation with satisfied, while passengers with scores of 3, 2 and 1 show correlation with not Satisfied. Passengers with a score of 0 in this case show no correlation at all.

Numeric Features (Age, Flight Distance, Departure Delay and Arrival Delay) are binned into 4 bins and correlations vs the outcome is plotted on the chart as well.

The plot above provided an initial understanding of how each feature related to the outcome. However it’s important to note that correlation does not imply causation.

Post-Modeling

H2o AutoML produced several high performing models, however the XGBOOST was the leading model and had the best overall performance in terms of predicting whether a passenger was Satisfied or Not Satisfied. Gain and Lift calculations (using the XGBOOST model) were used to access the benefits of using a predictive model over a baseline of 56% (proportion of Not Satisfied customers in the dataset). For example, the model was 100% accurate in predicting the outcome of the first 2,597 passengers most likely (based on probability) to be Not Satisfied resulting in a Gain of 18% and Lift of 1.8 passengers (over the baseline) in the first 2,597 passengers.


Finally, Lime was used to further interpret our model and understand the drivers of Satisfied vs Not Satisfied Passengers. For any single passenger of set of passengers, we can visually understand why the model predicted an outcome.

For this passenger, the model correctly predicted them as Not Satisfied. The blue bars are features that support their outcome, while the red bars are features that contradict their outcome. The length of the bars represent the weight assigned to that feature in supporting or contradicting their predicted outcome. In the plot above, features like Customer Type (Disloyal), In-Flight Wifi Service (2), Online Boarding (3) supported the predicted outcome of Not Satisfied, while features like Type of Travel (Business), Leg Room Service (1) and Departure Delay (12 minutes) contradicts the predicted outcome of Not Satisfied.

Using Lime in this case, the airline can run analysis on passengers that were Not Satisfied and come with recommendations for improving their experience, thus changing their satisfaction to Satisfied.



Detailed Analysis

Starting below, I go though the entire analysis. Please note that while I provide code for all the outputs, some outputs are created using custom functions I created and saved in a separate file. Such files will be made available on Github.

1.0 Libraries and Data

Here I load the data do some brief initial cleaning such as removing and additional index column named “x”, and changing the name of one of the outcomes from “neutral or dissatisfied” to “not satisfied”.

2.0 Data Quality Inspection

A detailed feature description has already been provided in the introduction above. Please refer to that section if you need to.

Checking For Missing Values

par(mar = c(4, 4, .1, .1))

p1 <- train_raw_tbl %>% 
    func_plot_missing_data()+
    func_plot_axis_text_format()+
    labs(title = "Missing Values: Train Set", y = "")
   

p2 <- test_raw_tbl %>% 
    func_plot_missing_data()+
    func_plot_axis_text_format()+
    labs(title = "Missing Values: Test Set", y = "")

p1
p2

Except for a few missing values in the Arrival Delay column, this is a complete dataset. We’ll decide on a method to impute the missing values later on.


3.0: Exploratory Data Analysis

3.1 Data Summary - skimr

The skim() function can be use to get a quick summary of the dataset.

train_raw_tbl %>% skim()
Data summary
Name Piped data
Number of rows 103904
Number of columns 24
_______________________
Column type frequency:
factor 5
numeric 19
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1 FALSE 2 Fem: 52727, Mal: 51177
customer.type 0 1 FALSE 2 Loy: 84923, dis: 18981
type.of.travel 0 1 FALSE 2 Bus: 71655, Per: 32249
class 0 1 FALSE 3 Bus: 49665, Eco: 46745, Eco: 7494
satisfaction 0 1 FALSE 2 not: 58879, sat: 45025

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 64924.21 37463.81 1 32533.75 64856.5 97368.25 129880 ▇▇▇▇▇
age 0 1 39.38 15.11 7 27.00 40.0 51.00 85 ▃▇▇▅▁
flight.distance 0 1 1189.45 997.15 31 414.00 843.0 1743.00 4983 ▇▃▂▁▁
inflight.wifi.service 0 1 2.73 1.33 0 2.00 3.0 4.00 5 ▆▇▇▆▃
departure.arrival.time.convenient 0 1 3.06 1.53 0 2.00 3.0 4.00 5 ▆▆▆▇▇
ease.of.online.booking 0 1 2.76 1.40 0 2.00 3.0 4.00 5 ▇▇▇▆▅
gate.location 0 1 2.98 1.28 0 2.00 3.0 4.00 5 ▅▆▇▇▃
food.and.drink 0 1 3.20 1.33 0 2.00 3.0 4.00 5 ▅▇▇▇▇
online.boarding 0 1 3.25 1.35 0 2.00 3.0 4.00 5 ▃▅▆▇▆
seat.comfort 0 1 3.44 1.32 0 2.00 4.0 5.00 5 ▃▃▅▇▆
inflight.entertainment 0 1 3.36 1.33 0 2.00 4.0 4.00 5 ▃▅▅▇▇
on.board.service 0 1 3.38 1.29 0 2.00 4.0 4.00 5 ▃▃▆▇▆
leg.room.service 0 1 3.35 1.32 0 2.00 4.0 4.00 5 ▃▆▆▇▇
baggage.handling 0 1 3.63 1.18 1 3.00 4.0 5.00 5 ▂▂▅▇▆
checkin.service 0 1 3.30 1.27 0 3.00 3.0 4.00 5 ▃▃▇▇▆
inflight.service 0 1 3.64 1.18 0 3.00 4.0 5.00 5 ▂▂▅▇▆
cleanliness 0 1 3.29 1.31 0 2.00 3.0 4.00 5 ▃▅▇▇▇
departure.delay.in.minutes 0 1 14.82 38.23 0 0.00 0.0 12.00 1592 ▇▁▁▁▁
arrival.delay.in.minutes 310 1 15.18 38.70 0 0.00 0.0 13.00 1584 ▇▁▁▁▁

A few things to note here. Numeric features from 3 - 17 (Flight Distance - Cleanliness) can be considered as ordered categorical features as they represent satisfaction scores from 0 - 5. We’ll need to convert these to factors later on. Also ID just an identifier for the passenger and will not be used for modeling.

Next we go deeper and explore a few features of interest.

3.2 Categorical Features

Proportion of Outcome (Satisfaction)

train_raw_tbl %>% 
  count(satisfaction) %>% 
  ungroup() %>% 
  mutate(pct = n/sum(n),
         pct_txt = pct %>% scales::percent(accuracy = .1)) %>% 
  ggplot(aes(satisfaction, n, fill = satisfaction))+
  geom_col(col = "black")+
  expand_limits(y = c(0, 63000))+
  scale_y_continuous(labels = scales::comma_format())+
  geom_text(aes(label = pct_txt), vjust = -1, size = 4, fontface = "bold")+
  theme_bw()+
  scale_fill_manual(values = scale_manual_2)+
  theme(legend.title = element_blank(),
        legend.key.size = unit(0.35, "cm"))+
  labs(title = "Proportion of Outcome feature (Satisfaction)", x = "", y = "")

Dataset appears fairly balanced.


Age vs Class

# par(mar = c(4, 4, .1, .1))

p1 <- train_raw_tbl %>% 
    ggplot(aes(class, fill = satisfaction))+
    geom_bar(width = 0.7, position = position_dodge(), col = "black")+
    scale_y_continuous(labels = scales::comma_format())+
    scale_fill_manual(values = scale_manual_2)+
    theme_bw()+
    func_plot_axis_text_format(mosaic = FALSE)+
    labs(title = "Count")+
    theme(legend.position = "bottom",
          legend.direction = "horizontal",
          legend.title = element_blank(),
          legend.key.size = unit(0.35, "cm"))
    
p2 <-  train_raw_tbl %>% 
    ggplot(aes(class, age, fill = satisfaction))+
    geom_boxplot()+
    scale_y_continuous(labels = scales::comma_format())+
    scale_fill_manual(values = scale_manual_2)+
    theme_bw()+
    func_plot_axis_text_format(mosaic = FALSE)+
    labs(title = "Distribution")+
    theme(legend.position = "none")

p2 / p1 +
  plot_annotation(title = "Age vs Class", theme = theme(plot.title = element_text(face = "bold")))


Although there is a high count of passengers in the Eco class, majority of them are Not Satisfied, while the opposite is the case for the Business class where majority of the passesengers are Satisfied. Additionally looking at the boxplots, we can see that passengers flying Business and are Satisfied are slightly older then passengers Not Satisfied. Overall the median of passengers Satisfied is slightly higher than passengers Not Satisfied.


Age vs Travel Type

# par(mar = c(4, 4, .1, .1))

p3 <- train_raw_tbl %>% 
    ggplot(aes(type.of.travel, fill = satisfaction))+
    geom_bar(width = 0.7, position = position_dodge(), col = "black")+
    scale_y_continuous(labels = scales::comma_format())+
    scale_fill_manual(values = scale_manual_2)+
    theme_bw()+
    func_plot_axis_text_format(mosaic = FALSE)+
    labs(title = "Frequency - Travel Type", y = "freq")
    
p4 <-  train_raw_tbl %>% 
    ggplot(aes(type.of.travel, age, fill = satisfaction))+
    geom_boxplot()+
    scale_y_continuous(labels = scales::comma_format())+
    scale_fill_manual(values = scale_manual_2)+
    theme_bw()+
    func_plot_axis_text_format(mosaic = FALSE)+
    labs(title = "Distribution - Age vs Travel Type")+
    theme(legend.position = "none")

p4 / p3 +
  plot_annotation(title = "Age vs Travel Type", theme = theme(plot.title = element_text(face = "bold")))


Majority of the passengers flying for Personal Travel are Not Satisfied. This may be of interest to the airline to understand why this is the case.

Next, I’ll show a series of mosaic plots to visualize the relationships between some categorical features with satisfaction scores (0 - 5) and the outcome feature (Satisfaction).

func_mosaic_plot(data = train_raw_tbl, var_name = on.board.service) + 
  facet_grid(~ on.board.service, scales = "free_x", space = "free_x")+
  func_plot_axis_text_format(facet = TRUE, mosaic = TRUE)+
  labs(title = "Online Boarding", x = "", y = "")


func_mosaic_plot(data = train_raw_tbl, var_name = inflight.wifi.service) + 
  facet_grid(~ inflight.wifi.service, scales = "free_x", space = "free_x")+
  func_plot_axis_text_format(facet = TRUE, mosaic = TRUE)+
  labs(title = "In Flight Wifi Service", x = "", y = "")


func_mosaic_plot(data = train_raw_tbl, var_name = seat.comfort) + 
  facet_grid(~ seat.comfort, scales = "free_x", space = "free_x")+
  func_plot_axis_text_format(facet = TRUE, mosaic = TRUE)+
    labs(title = "Seat Comfort", x = "", y = "")

func_mosaic_plot(data = train_raw_tbl, var_name = inflight.entertainment) + 
  facet_grid(~ inflight.entertainment, scales = "free_x", space = "free_x")+
  func_plot_axis_text_format(facet = TRUE, mosaic = TRUE)+
    labs(title = "In Flight Entertainment", x = "", y = "")


func_mosaic_plot(data = train_raw_tbl, var_name = ease.of.online.booking) + 
  facet_grid(~ ease.of.online.booking, scales = "free_x", space = "free_x")+
  func_plot_axis_text_format(facet = TRUE, mosaic = TRUE)+
    labs(title = "Ease Of Online Booking", x = "", y = "")

func_mosaic_plot(data = train_raw_tbl, var_name = checkin.service) + 
  facet_grid(~ checkin.service, scales = "free_x", space = "free_x")+
  func_plot_axis_text_format(facet = TRUE, mosaic = TRUE)+
    labs(title = "Check-In Service", x = "", y = "")

The mosaic plots above are indicative of the relationships between satisfaction scores of predictors and the outcome feature (Satisfaction). The x axis shows variation in proportions of scores (0 - 5) for each feature. The y axis shows variation in proportion of Satisfaction. Most features show variations in the proportion of overall Satisfaction, as well as variations in proportion of Satisfaction for each score (0 - 5.).


3.3: Numeric Features

We can plot histograms to understand the distributions of other numeric features -

# Histogram of Numeric features
train_raw_tbl %>% 
    select(age, flight.distance, departure.delay.in.minutes, arrival.delay.in.minutes, satisfaction) %>% 
    gather(key = key, value = value, age:arrival.delay.in.minutes) %>% 
    ggplot(aes(value, fill = satisfaction))+
    geom_histogram()+
    scale_y_continuous(labels = scales::comma_format())+
    scale_x_continuous(labels = scales::comma_format())+
    facet_wrap(~ key, scales = "free")+
    theme_bw()+
    func_plot_axis_text_format(facet = TRUE, mosaic = FALSE)+
    scale_fill_manual(values = scale_manual_2)+
    labs(title = "Numeric features")+
    theme(legend.position = "bottom",
          legend.direction = "horizontal",
          legend.title = element_blank(),
          legend.key.size = unit(0.35, "cm"))

We can observe skewed data for Arrival Delay, Departure Delay and Flight Distance, however this should not be an issue as H2o Auto-ML will handle any necessary pre-processing for skewed data.


4.0: Modeling

Note: From this section on, I will not be showing all the outputs. H2o outputs are pretty lengthy so I’ll be selective in what outputs I show.

Steps to be taken in this phase include:

Step 1: Pre-Processing: H2o does perform most of the pre-processing steps however we still need to do a few things first, including releveling the outcome (Satisfaction) factor levels (I don’t really think this step is necessary however it does help me interpret the results better). I’ll also create a recipe to perform the following -

  1. Remove the ID column (we don’t need this for modeling)
  2. Remove any zero variance predictors
  3. Impute missing values in Arrival Delay with the median which is 0

Step 2: Split Train Set Into Training & Validation Sets: Here we’ll covert our train and test sets into H2o objects and also split the train set into training and validation sets, which is a requirement for H2o.

Step 3: Create Model Specifications & Implement H2o: This includes assigning roles to our features (predictors and outcome), specifying our training and validation sets, and setting [max_runtime_secs]().

Step 4: Inspect Leaderboard: Inspect H2o models and determine the best model.

4.1: Initial Pre-Processing

Here we do a fct_relevel() on the outcome and create a basic recipe() object.

# Relevel Factors for Satisfaction 
train_raw_tbl <- train_raw_tbl %>% 
    mutate(across(c(inflight.wifi.service:cleanliness), factor)) %>% 
    mutate(satisfaction = satisfaction %>% fct_relevel(
        "satisfied", "not satisfied"
    ))

test_raw_tbl <- test_raw_tbl %>% 
    mutate(across(c(inflight.wifi.service:cleanliness), factor)) %>% 
    mutate(satisfaction = satisfaction %>% fct_relevel(
        "satisfied", "not satisfied"
    ))

# Recipe
recipe_obj <- recipe(satisfaction ~ ., data = train_raw_tbl) %>% 
    step_rm(id) %>% 
    step_zv(all_predictors()) %>% 
    step_medianimpute(arrival.delay.in.minutes) %>% 
    prep()

# Apply (Bake) the Recipe to the Train & Test Set
train_tbl <- bake(recipe_obj, new_data = train_raw_tbl)
test_tbl <- bake(recipe_obj, new_data = test_raw_tbl)

We are now ready to convert our train and test sets to H2o objects. First we’ll need to initialize H2o. If you’re using H2o for the first time, you’ll need to install it prior to initializing.

# Initialize H2o
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 5 minutes 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.3 
##     H2O cluster version age:    8 months and 25 days !!! 
##     H2O cluster name:           H2O_started_from_R_BachataLu_kgu237 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.94 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.2 (2020-06-22)
# Disable Progress Bar 
h2o.no_progress()

4.2: Split Training Data & convert to H2o Objects

# Split H2o Frame into Training & Validation Sets
h2o_split_obj <- h2o.splitFrame(as.h2o(train_tbl), ratios = c(0.85), seed = 100)
train_h2o <- h2o_split_obj[[1]]
valid_h2o <- h2o_split_obj[[2]]
test_h2o <- as.h2o(test_tbl)

4.3: Model Specification & Implementation

This step could take some time to run, depending on the number on the max_runtime_secs() and nfolds() you specify. For this, I’m using 300 secs and 5 folds to ease computation time. Increasing these numbers may improve model performance.

# Assign Roles
y <- "satisfaction"
x <- setdiff(names(train_h2o), y)

# Model Specification
automl_models_h2o <- h2o.automl(
    x = x,
    y = y,
    training_frame = train_h2o,
    validation_frame = valid_h2o,
    max_runtime_secs = 300,
    nfolds = 5,
    seed = 100
    
)

4.4: Leaderboard Inspection & Model Performance

While H2o produces 20 different models, we’ll examine the top 5 models (based on Area Under the Curve).

# Load Data Table with Top 5 Models
readRDS("../06_Exec_Summary_Plots/top_5_leaderbord_dt_table.rds")

We can see the top 5 models ranked by auc and logloss. All models have a very high auc.

Top 5 models were saved in a folder to avoid re-running the models everytime I opened the file. Going forward models will be loaded from the saved file. Additionally, I do not go into hyper-parameter tuning via grid search for this analysis. Refer to documentation here to learn how to perform hyper-parameter tuning and grid search in H2o.

5.0: Model Performance Evaluation

In this section we’ll further investigate the performance of our top model (XGBoost_grid__1_AutoML). To see detailed information about about a model, we can just enter the model id in the console. H2o outputs are pretty long and detailed so I’ll just extract the relevant parts I want to show.

Here I show how to extract the auc for the train, validation and cross-validation. The model has great performance across all sets.

# Load Leaderboard Leader 
automl_leader <- h2o.loadModel("../05_Models/XGBoost_grid__1_AutoML_20210603_185006_model_1")

h2o.auc(automl_leader, train = TRUE, valid = TRUE, xval = TRUE)
##     train     valid      xval 
## 0.9977117 0.9945765 0.9946551

Additionally, we can evaluate this model on our test set. Again, high auc on the test set.

# AUC on Test Set
h2o.auc(h2o.performance(automl_leader, newdata = test_h2o))
## [1] 0.9948804

To make predictions on the test data, we can run the code below.

h2o.predict(automl_leader, newdata = test_h2o) %>% head()
##         predict not satisfied   satisfied
## 1     satisfied  0.0352563858 0.964743614
## 2     satisfied  0.0002663732 0.999733627
## 3 not satisfied  0.9976973534 0.002302654
## 4     satisfied  0.0049210787 0.995078921
## 5 not satisfied  0.9506860375 0.049313966
## 6     satisfied  0.0054655075 0.994534492

To get the confusion matrix on the test set, we can run the code below.

h2o.confusionMatrix(automl_leader, newdata = test_h2o)
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.561208647489548:
##               not satisfied satisfied    Error        Rate
## not satisfied         14305       268 0.018390  =268/14573
## satisfied               701     10702 0.061475  =701/11403
## Totals                15006     10970 0.037304  =969/25976

Note: When evaluating performance of H2o models on a test set, we can use the h2o.performance() function and supply the test set as newdata. Additionally, we can see a list of performance metrics by calling our performance object then add “@metrics” at the end of it. You can find further details here.


ROC Plots

We can visualize ROC Plots for our top model.

# Load Models
xgboost_grid_1_automl <- h2o.loadModel("../05_Models/XGBoost_grid__1_AutoML_20210603_185006_model_1")
stacked_ensemble_all_automl <- h2o.loadModel("../05_Models/StackedEnsemble_AllModels_AutoML_20210603_185006")
stacked_ensemble_bof_automl <- h2o.loadModel("../05_Models/StackedEnsemble_BestOfFamily_AutoML_20210603_185006")
gbm_grid_1_automl <- h2o.loadModel("../05_Models/GBM_grid__1_AutoML_20210603_185006_model_1")
xgboost_2_automl <- h2o.loadModel("../05_Models/XGBoost_2_AutoML_20210603_185006")

# Evaluate Performance on Test Set. Extract Metrics for ROC Plot Using Custom Function
xgboost_grid_1_automl_perf <- xgboost_grid_1_automl %>% 
    func_get_model_performance_auc_plot() %>% 
    mutate(model = "xgboost_grid_1_automl")

xgboost_grid_1_automl_perf %>% 
  mutate(auc = auc %>% round(5) %>%  as.character() %>% as.factor()) %>% 
  ggplot(aes(fpr, tpr, color = model, linetype = auc))+
  geom_line(size = 1.5)+
  scale_color_manual(values = scale_manual_5)+
  theme_bw()+
  func_plot_axis_text_format(facet = TRUE)+
   labs(title = "ROC Plot", subtitle = "xgboost_grid_1_automl")+
   theme(legend.position = "none")

Normally you would want visually inspect roc curves for multiple models to determine the model with the best auc, however since all 5 models are at about 99%, the plots will just overlap each other if plotted that way. Key note here all 5 models are performing well on our test set.

6.0: Gain & Lift

Gain and Lift are additional performance measures for classification models and can be used to make a business case for the necessity of a predictive model. They measure how much better one can expect to do with a predictive model than without a model, and is widely used in marketing analytics and other domains.

In this case, suppose the airline would like to predict unsatisfied passengers and find ways to improve their experience, this could potentially cost the airline additional money and other resources. Thus a targeted approach using predictive modeling will be more feasible than a random blanket approach.

# Predictions
predictions_tbl <- h2o.predict(object = xgboost_grid_1_automl, newdata = test_h2o) %>% as_tibble()

predictions_tbl %>% 
  bind_cols(test_tbl %>% select(satisfaction)) %>% 
  arrange(desc(not.satisfied)) %>% 
  slice(1:2597) %>% 
  mutate(match = case_when(
      predict == "not satisfied" & satisfaction == "not satisfied" ~ "match",
      TRUE ~ "not a match")) %>% 
  group_by(match) %>% 
  count() %>% 
  mutate(pct = n/sum(n))
## # A tibble: 1 × 3
## # Groups:   match [1]
##   match     n   pct
##   <chr> <int> <dbl>
## 1 match  2597     1

Within the first 2,597 (1st decile of 10 equal buckets) passengers with the highest probability of Not Satisfied, our model has a 100% accuracy rate. Without a model, we might only expect the proportion of Not Satisfied passengers for the first 2,597 to be 56% (this is the proportion of Not Satisfied customers in the dataset), i.e 1454 people. Also looking at the entire test set, if 14,573 (56% * 25,976) passengers were Not Satisfied we have detected 2,597 of 14,573 (18%) in the first 2,597 passengers. Additionally, Lift is the ratio between the Gain (2,597) and what we expected (1454). So here our Lift is 2,597 / 1,454 = ~1.8. We just gained 1.8 passengers in the first 2,597 cases.

Next we can plot the gain and Lift charts.

# Gain Data 
gain_data_tbl <- predictions_tbl %>% 
    bind_cols(test_tbl %>% select(satisfaction)) %>% 
    arrange(desc(not.satisfied)) %>% 
    mutate(ntile = ntile(not.satisfied, n = 16)) %>% 
    group_by(ntile) %>% 
    summarise(
        cases = n(),
        responses = sum(satisfaction == "not satisfied")
    ) %>% 
    arrange(desc(ntile)) %>% 
    mutate(group = row_number()) %>% 
    select(group, cases, responses) %>% 
    mutate(
        cumulative_responses = cumsum(responses),
        pct_responses        = responses / sum(responses),
        gain                 = cumsum(pct_responses),
        cumulative_pct_cases = cumsum(cases) / sum(cases),
        lift                 = gain / cumulative_pct_cases,
        gain_baseline        = cumulative_pct_cases,
        lift_baseline        = gain_baseline / cumulative_pct_cases
        
    ) %>% 
    select(group, cumulative_pct_cases, gain, gain_baseline) %>% 
    gather(key = key, value = value, gain, gain_baseline )

arrow_types <- expand.grid(
    lineend = c('round', 'butt', 'square'),
    linejoin = c('round', 'mitre', 'bevel'),
    stringsAsFactors = FALSE
)

arrow_types <- data.frame(arrow_types, y = 1:9)

gain_chart <- gain_data_tbl %>% 
    ggplot(aes(cumulative_pct_cases, value, color = key))+
    geom_line(size = 1.5)+
    theme_bw()+
    theme(legend.position = "right")+
    labs(title = "Gain Chart",
         x = "Cumulative Data Fraction", y = "Gain")+
    func_plot_axis_text_format()+
    scale_color_manual(values = c("#2c3e50", "#CE295E"))

gain_chart

# Lift Data 
lift_data_tbl <- predictions_tbl %>% 
    bind_cols(test_tbl %>% select(satisfaction)) %>% 
    arrange(desc(not.satisfied)) %>% 
    mutate(ntile = ntile(not.satisfied, n = 16)) %>% 
    group_by(ntile) %>% 
    summarise(
        cases = n(),
        responses = sum(satisfaction == "not satisfied")
    ) %>% 
    arrange(desc(ntile)) %>% 
    mutate(group = row_number()) %>% 
    select(group, cases, responses) %>% 
    mutate(
        cumulative_responses = cumsum(responses),
        pct_responses        = responses / sum(responses),
        gain                 = cumsum(pct_responses),
        cumulative_pct_cases = cumsum(cases) / sum(cases),
        lift                 = gain / cumulative_pct_cases,
        gain_baseline        = cumulative_pct_cases,
        lift_baseline        = gain_baseline / cumulative_pct_cases
        
    ) %>% 
    select(group, cumulative_pct_cases, lift, lift_baseline) %>% 
    gather(key = key, value = value, lift, lift_baseline )

# Visualize Lift Data
lift_chart <- lift_data_tbl %>% 
    ggplot(aes(cumulative_pct_cases, value, color = key))+
    geom_line(size = 1.5)+
    theme_bw()+
    theme(legend.position = "right")+
    labs(title = "Lift Chart",
         x = "Cumulative Data Fraction", y = "Gain")+
    func_plot_axis_text_format()+
    scale_color_manual(values = c("#2c3e50", "#CE295E"))

lift_chart

7.0: Model Interpretation With Lime

In this section, we’ll use Lime to interpret the predictions made by our H2o AutoML. Lime helps understand which features are driving our outcome (Satisfied or Not Satisfied).

First let’s take a look at predictions for the first 10 passengers in the test set.

first_10_predictions_tbl <- predictions_tbl %>% 
    bind_cols(test_raw_tbl %>% select(satisfaction, id)) %>% 
    slice(1:10)

first_10_predictions_tbl
## # A tibble: 10 × 5
##    predict       not.satisfied satisfied satisfaction     id
##    <fct>                 <dbl>     <dbl> <fct>         <int>
##  1 satisfied         0.0353      0.965   satisfied     19556
##  2 satisfied         0.000266    1.00    satisfied     90035
##  3 not satisfied     0.998       0.00230 not satisfied 12360
##  4 satisfied         0.00492     0.995   satisfied     77959
##  5 not satisfied     0.951       0.0493  satisfied     36875
##  6 satisfied         0.00547     0.995   satisfied     39177
##  7 satisfied         0.000692    0.999   satisfied     79433
##  8 satisfied         0.0000875   1.00    satisfied     97286
##  9 satisfied         0.00362     0.996   satisfied     27508
## 10 satisfied         0.000232    1.00    satisfied     62482

From the output above, 3rd row (id 12360) is predicted correctly as Not Satisfied with a probability of 0.997. We can investigate the features for this passenger.

test_raw_tbl %>% 
    filter(id == 12360) %>% 
    glimpse()
## Rows: 1
## Columns: 24
## $ id                                <int> 12360
## $ gender                            <fct> Male
## $ customer.type                     <fct> disloyal Customer
## $ age                               <int> 20
## $ type.of.travel                    <fct> Business travel
## $ class                             <fct> Eco
## $ flight.distance                   <int> 192
## $ inflight.wifi.service             <fct> 2
## $ departure.arrival.time.convenient <fct> 0
## $ ease.of.online.booking            <fct> 2
## $ gate.location                     <fct> 4
## $ food.and.drink                    <fct> 2
## $ online.boarding                   <fct> 2
## $ seat.comfort                      <fct> 2
## $ inflight.entertainment            <fct> 2
## $ on.board.service                  <fct> 4
## $ leg.room.service                  <fct> 1
## $ baggage.handling                  <fct> 3
## $ checkin.service                   <fct> 2
## $ inflight.service                  <fct> 2
## $ cleanliness                       <fct> 2
## $ departure.delay.in.minutes        <int> 0
## $ arrival.delay.in.minutes          <dbl> 0
## $ satisfaction                      <fct> not satisfied

We can see this passenger was male, 20 years old, and also had mostly low scores for In-Flight Service, Ease of Online Booking, etc. Lime helps us further determine why the model classified this passenger correctly as Not Satisfied.

7.1 Lime For A Single Explanation

Lets use Lime to -

  1. Build an explainer object

  2. Produce an explanation for passenger 12360

  3. Visualize the explanation

# Build Explainer Object
explainer_obj <- train_tbl %>% 
    select(-satisfaction) %>% 
    lime(
        model          = xgboost_grid_1_automl,
        bin_continous  = TRUE, # bins numeric features based on quantiles
        n_bins         = 4,
        quantile_bins  = TRUE # puts equal proportions into each bin
    )

# Use Explainer Object to Produce an Explanation for passenger 12360
set.seed(200)
explanation <- test_tbl %>% 
    slice(3) %>% 
    select(-satisfaction) %>% 
    lime::explain(
        explainer      = explainer_obj,
        n_labels       = 1,
        n_features     = 10, # Top 10 features to explain
        n_permutations = 5000,
        kernel_width   = 3.5, # Can be tuned using grid search
        dist_fun       = "manhattan",
        feature_select = "lasso_path"
    )

# Visualize Explanation
single_explainer_plot <- plot_features(explanation = explanation, ncol = 1)+
  func_plot_axis_text_format()

single_explainer_plot

Looking at the explanations for passenger 12360, we can see the top 10 features that determined the predicted outcome. The length of the bar is the value of the weight which with Lime assigns to that feature. Additionally, the model predicted this passenger as Not Satisfied with a high probability of approximately 1. Explanation fit refers to the R-Squared for the explanation.

Blue bars are features that support or lead to this passengers predicted outcome (Not Satisfied), while red bars are features that contradicts the passengers predicted outcome. Customer Type (Disloyal Customer) has the highest weight is highly supportive of this customer’s Not Satisfied prediction. Alternatively, this customer’s Type Of Travel is Business Travel and Contradicts the prediction of Not Satisfied. We can also see all other features, though with smaller weights, also support the prediction of Not Satisfied.

Additionally we can explain and visualize for multiple passengers. Lets create explanations for the next 10 passengers (rows 17 - 20) of the test set. This range has a good mix of correctly predicted outcomes of Satisfied and Not Satisfied.

# Build Multiple Explanations

set.seed(120)
multiple_explanation <- test_tbl %>% 
    slice(17:20) %>% 
    select(-satisfaction) %>% 
    lime::explain(
        explainer      = explainer_obj,
        n_labels       = 1,
        n_features     = 10,
        n_permutations = 5000,
        kernel_width   = 3.5,
        dist_fun       = "manhattan",
        feature_select = "lasso_path"
    )

multiple_explainer_plot <- plot_features(multiple_explanation, ncol = 2)+
  func_plot_axis_text_format()

multiple_explainer_plot

8.0: Next Steps

As I mentioned earlier, building a predictive model would be the first step in a real world setting. Next steps will involve using the Expected Value Framework to translate the model in ROI by examining the financial impact of using the model. Scenario Analysis can also be used to test the financial impact of using the model while adjusting the probability prediction threshold.