Goals of classification
The main goal of classification is to assign individual entities to categories, also known as classes. Binary classification (the number of classes equals 2) is the most common form of classification.
Assume that a company’s goal is to retain customers. Machine learning can be deployed in this context to classify customers into two groups: predicted to churn or not predicted to churn.
Postive and negative classes
In classification modeling, classes are the distinct groups represented in the target variable. Typically, the class which is most interesting is referred to as the positive class.
In a customer churn problem, customers who do churn represent the most interesting group. Therefore, this is natually the positive class.
True positives, false positives, true negatives, and false negatives
True positive. A true positive occurs when the predicted class and the actual class are both positive. Example: A customer is predicted to churn and actually churns.
False positive. A false positive occurs when the predicted class is positive but the actual class is negative. Example: A customer is predicted to churn but does not actually churn. In this example, the company devotes resources to a customer who would not actually churn.
True negative. A true negative occurs when the predicted class and the actual class are both negative. Example: A customer is not predicted to churn and does not actually churn.
False negative. A false negative occurs when the predicted class is negative but the actual class is positive. Example: A customer is not predicted to churn but actually churns. In this example, the company fails to proactively identify the customer who would actually churn.
Evaluating accuracy
It is normal for all classifiers to contain error. For example, most classifiers will have some examples of instances that were prediced to belong to the positive class but actually belong to the negative class. The reverse is also true.
Depending on the problem at hand, different types of error may have unequal consequences. Consider the following scenario. Using machine learning, a company intends to identify customers who are likely to switch companies. Which of the following strategies is wiser?
Identify a small segment of customers who all have a very high likelihood of switching companies. Accuracy is likely to be high, but many customers outside of this small segment may still churn.
Identify a relatively larger segment of customers who, on average, are less likely to switch companies. Accuracy will be lower, but more customers will be intercepted due to the wider net.
There are several ways to evaluate the accuracy of a classification model. In this section, you will learn about a selection of the most common metrics and tools used to evaluate classification accuracy.
Metrics derived from predicted probabilities
Many classification algorithms predict class probabilities. For example, a model may predict that a specific instance has a probability of 0.93 of belonging to the positive class.
Probability-based metrics assess how well predicted probabilities align with the actual class values. Ideally, instances with a high predicted probability of belonging to class A will actually belong to class A.
In industry, predicted probabilities can be extremely useful. For example, in a customer churn problem, an organization may be wise to focus its retention efforts on the segment of customers with the highest predicted probability of churn.
The following probability-based metrics are commonly used in industry.
AUC: AUC (Area Under ther Receiver Operating Curve) measures how well the classification model is able distinguish between true positives and false positives. An AUC of 1 indicates a perfect classifier, and an AUC of .5 means the model is no better than random guessing. AUC values above 0.7 are generally preferred. The AUC metric is generally not preferred when classes are highly imbalanced.
AUCPR: AUCPR (Area Under the Precision-Recall) measures the tradeoff between precision and recall for various probability thresholds.
LogLoss: LogLoss measures how well the predicted probabilities are aligned with the actual classes. For example, a model would have a favorable LogLoss score if the model tended to assign high predicted probabilities to the positive class and low predicted probabilities to the negative class. LogLoss scores are difficult to interpret directly, though they are useful for comparing models. LogLoss values closer to 0 are preferred.
Metrics derived from predicted classes
Because Logistic Regression predicts class probabilities by default, a classification threshold is required in order to convert predicted probabilities into predicted classes (e.g. yes/no). By default, H2O will find optimal classification thresholds for a variety of metrics.
Precision: Precision measures the proportion of predicted positive classes that are actual positive values. For example, if a machine learning model predicts 100 customers will churn and 90 customers actually churned, then precision equals 0.9. Precision values range from 0 to 1. Values closer to 1 are preferred.
Recall: Recall measures the proportion of actual positive values that are predicted by the model to be positive. For example, if a data set contains 100 customers who actually churned and 70 were correctly predicted to churn, then recall equals 0.7. Recall values range from 0 to 1. Values closer to 1 are preferred.
F1: F1 score intends to strike a balance between Precision and Recall. Because Precision and Recall each have strengths and weaknesses, F1 score is often a good metric to optimize. F1 scores range from 0 to 1. Values closer to 1 are preferred. See the F1 documentation for more details.
F2: See the F2 documentation for more details.
F0.5: See the F0.5 documentation for more details.
Misclassification rate: Also known as “accuracy”, misclassification rate measures the proportion of actual values that are correctly predicted regardless of class. Though intuitive, misclassification rate is often not a useful metric because it is highly sensitive to class imbalance. Misclassification rates range from 0 to 1. Values closer to 1 are preferred.
Lift scores
Unlike accuracy metrics, which describe the overall performance of the model, lift scores describe the performance of the model within groups of instances ordered by predicted probability.
For example, assume the overall customer churn rate in a dataset is 10%. Using your machine learning model, you identify the top 10% of instances (customers) with the highest predicted probability of churn. Within this group of instances, you observe the actual churn rate is 90%. Therefore, the lift score associated with this group of instances is 9 (0.9/0.1). In industry, it may be wise to focus retention efforts on this specific group of customers given the expected churn rate relative to the overall churn rate.
Class balance
Class imbalance occurs when the number of instances is one class greatly outweighs the number of instances in the opposite class. For example, a data set containing 100 positive instances and 1,000 negative instances would gernally be viewed as imbalanced.
Imbalanced data may present a risk since the majority class can easily overwhelm the minority class if the imbalance is severe enough. A common remedy for imbalanced classes is to balance the data during the training phase. Balancing classes is done by either oversampling the minority class (i.e. artificially increasing the amount of data until balance is achieved) or undersampling the majority class (i.e. artificially decreasing the amount of data until balance is achieved).
Quiz
- Which of the following metrics are derived from predicted classes? Select all that apply.
- AUC
- F1 (correct)
- Accuracy (correct)
- Precision (correct)
- Which of the following accuracy metrics are derived from predicted probabilities? Select all that apply.
- AUC (correct)
- Recall
- LogLoss (correct)
- AUCPR (correct)
- If customers who churn represent the positive class, what is an example of a true positive prediction?
- A customer churned and was predicted to churn. (correct)
- A customer did not churn but was predicted to churn.
- A customer did not churn and was not predicted to churn.
- If customers who churn represent the positive class, what is an example of a false positive prediction?
- A customer churned and was predicted to churn.
- A customer did not churn but was predicted to churn. (correct)
- A customer did not churn and was not predicted to churn.
- What is it called when the number of instances is one class greatly outweighs the other class?
- Overfitting
- Class imbalance (correct)
- False negative
- True or False: Oversampling the minority class and undersampling the majority class are two techniques for dealing with class imbalance.
- Lift analysis describes the overall accuracy of a model?
- If the overall customer churn rate is 5% and the expected customer churn rate associated with the top 1% of instances with the highest predicted probability of churn is 60%, what is the lift score for this group of instances?
Prepare customer churn data
Prerequisites
- Key concepts
Objective
In this section, you will learn how to launch a local H2O cluster, import data, and perform exploratory analysis.
The following steps outline the general process.
- Connect to a local H2O cluster
- Import data into the cluster
- Profile the features
- Evaluate missing data
- Evaluate class balance
- Partition training, validation, and testing data
- Specify the final feature set and the target variable
Connect to local H2O cluster
To launch a local H2O cluster, run the h2o.init()
. By default, clusters are configured to use all possible CPUs on the host machine to maximize speed and efficiency.
h2o.init()
Import data into H2O cluster
Use the h2o.importFile()
function to import the local CSV file into the local H2O cluster.
churn_h2o <- h2o.importFile(path = "churn.csv")
The class of this object is a H2OFrame
, which is efficient and ideally suited for H2O’s algorithms. An H2OFrame
is H2O’s equivalent of an R data.frame
.
Use the class()
function verify the data type is H2OFrame
.
class(churn_h2o)
Report the number of rows and columns
Use the h2o.dim()
function to report the dimensions of the H2OFrame
.
h2o.dim(churn_h2o)
Profile the variables
To better understand the variables in the H2OFrame
, use the h2o.describe()
function. The summary table includes important information about each variable in the dataset including the data type, missing values, numeric summaries, and the number of categories in each categorical varaible.
churn_profiles <- h2o.describe(frame = churn_h2o)
head(churn_profiles, 10)
Inspect data types
Using the churn_profiles
object, run the table()
function on the column Type
to count the number of variables by data type.
table(churn_profiles["Type"])
Evaluate missing values
Though H2O is capable of handling NA
values with ease, it is important to detect any columns that contain a sufficiently large number of NA
values.
Using the churn_profiles
object, apply a filter to the column Missing
to detect variables with at least one missing value.
churn_profiles[churn_profiles["Missing"] > 0, ]
There are no features in this dataset with missing values.
Evaluate class balance
Assess the distribution of the target variable churn
by calculating the number of instances belonging to the positive and negative classes. To do this, apply the h2o.table()
function churn
column of the H2OFrame
.
h2o.table(churn_h2o["churn"])
This dataset contains 598 instances of the positive class (14%) and 3,652 instances of negative class (86%). While there are no strict rules for categorizing the degree of imbalance, the author would consider this dataset moderately imbalanced. During model training, you will have the opportunity to balance classes and observe the effects.
Partition training, validation, and testing splits
The next step in the data preparation workflow is to partition the original H2OFrame
object churn_h2o
into three datasets: training, validation, and testing. Use the h2o.splitFrame()
function to partition the H2OFrame
.
In this example, the ratios are as follows.
churn_splits <- h2o.splitFrame(
data = churn_h2o,
ratios = c(0.7, 0.15),
seed = 1
)
To alter the percentage of data allocated to each of the three partitions, adjust the ratios
argument accordingly. For example, ratios = c(0.6, 0.2)
will allocate 60% for training, 20% for validation, and the remaining 20% for testing.
Because the churn_splits
object is an R list, the individual partitions can be extracted using the [[ ]]
convention.
churn_train <- churn_splits[[1]]
churn_validate <- churn_splits[[2]]
churn_test <- churn_splits[[3]]
Use the nrow()
function to print the number of instances in each partition.
nrow(churn_train)
nrow(churn_validate)
nrow(churn_test)
Specify feature and target variables
The final step of data preparation is to specify the names of the target variable and the features you intend to use in your model. To use all features, use the setdiff()
function to retrieve the names of all variables except for the target variable churn
. Alternatively, explicitly list the feature names using the convention features <- c("state", "account_length", "...")
target <- "churn"
features <- setdiff(
x = names(churn_h2o),
y = "churn"
)
Quiz
- The
h2o.init()
function is used to initialize and connect to a local H2O cluster from R.
- Which of the following functions is used to import a local file (e.g. CSV) into the H2O cluster?
h2o.importCSV()
h2o.importLocal()
h2o.importFile()
(correct)
- What is H2O’s equivalent of an R
data.frame
?
H2O.data.frame
H2O.df
H2OFrame
(correct)
- Which of the following functions is used to summarize the individual variables in a
H2OFrame
?
h2o.profile()
h2o.describe()
(correct)
h2o.summarize()
- Given an
H2OFrame
named my_frame
, which of the following expressions will calculate the number of instances belonging to the categorical variable named target
?
h2o.countInstances(my_frame["target"])
h2o.table(my_frame["target"])
(correct)
table(my_frame$target)
- Using the function
h2o.splitFrame
to partition training, validation, and testing datasets from a H2OFrame
, which of the following configurations of the ratios
argument will allocate 80% for training, 10% for validation, and 10% for testing.
ratios = c(0.8, 0.1)
(correct)
ratios = c(0.8, 0.2)
ratios = c(0.7, 0.1)
Playground
In this playground, you will explore the code to prepare the customer churn data for machine learning.
h2o.init()
help(h2o.init)
churn_h2o <- h2o.importFile(path = "churn.csv")
class(churn_h2o)
h2o.dim(churn_h2o)
churn_profiles <- h2o.describe(frame = churn_h2o)
head(churn_profiles, 10)
churn_profiles[churn_profiles["Missing"] > 0, ]
h2o.table(churn_h2o["churn"])
churn_splits <- h2o.splitFrame(
data = churn_h2o,
ratios = c(0.7, 0.15),
seed = 1
)
churn_train <- churn_splits[[1]]
churn_validate <- churn_splits[[2]]
churn_test <- churn_splits[[3]]
nrow(churn_train)
nrow(churn_validate)
nrow(churn_test)
target <- "churn"
features <- setdiff(
x = names(churn_h2o),
y = "churn"
)
Gradient Boosted Classification Trees
The next classification algorithm you will learn about is called Gradient Boosted Classification Trees.
Prepare hyperparameter tuning grid
For the Gradient Boosted Classification Trees algorithm, you will tune the following hyperparameters.
max_depth
. See the max_depth documentation for more details.
min_rows
. See the min_rows documentation for more details.
sample_rate
. See the sample_rate documentation for more details.
col_sample_rate
. See the col_sample_rate documentation for more details.
col_sample_rate_per_tree
. See the col_sample_rate_per_tree documentation for more details.
balance_classes
is used to balance the classes. If TRUE
, then H2O will balance the classes during training. See the balance_classes documentation for more details.
max_after_balance_size
controls the way in which classes are balanced. Smaller values cause the majority class to be downsampled, whereas larger values cause the minority class to be upsampled. See the max_after_balance_size documentatioin for more details.
gradient_boosted_hp_grid = list(
max_depth = seq(1, 20, 1),
min_rows = c(1, 5, 10, 20, 50, 100),
sample_rate = seq(0.3, 1, 0.05),
col_sample_rate = seq(0.3, 1, 0.05),
col_sample_rate_per_tree = seq(0.3,1,0.05),
balance_classes = c(TRUE, FALSE),
max_after_balance_size = seq(0.1, 5.1, 0.5),
histogram_type = c("UniformAdaptive", "QuantilesGlobal", "RoundRobin")
)
Define hyperparameter tuning strategy
For the Gradient Boosted Trees algorithm, your hyperparameter tuning strategy will use the following settings.
strategy = "RandomDiscrete
specifies random grid search. When random search is used, H2O randomly and uniformly samples hyperparameter values from the entire hyperparameter space until a high-performing subset of hyperparameter is found. Random grid search is the recommended strategy for efficiently searching large hyperparameter subspaces. See the grid search documentation for more details.
max_runtime_secs
specifies the amount of time (in seconds) hyperparameter tuning is permitted to run. Smaller values will shorten training time at the risk of failing to discover the best possible model. See the max_runtime_secs documentation for more details.
stopping_metric
specifies the metric to use for early stopping. The default value of AUTO
will use the logloss metric for classification. See the stopping_metric documentation for more details.
stopping_tolerance
specifies the minimum required improvement in the stopping metric before training ceases. Larger values will require greater amounts of improvement in the stopping metric for training to continue. See the stopping_tolerance documentation for more details.
stopping_rounds
. See the stopping_rounds documentation for more details.
seed
. See the seed documentation for more details.
gradient_boosted_hp_strategy = list(
strategy = "RandomDiscrete",
max_runtime_secs = 120,
max_models = 100,
stopping_metric = "AUTO",
stopping_tolerance = 0.001,
stopping_rounds = 5,
seed = 123456
)
Train models
In this section, you will train several Gradient Boosted Classification Tree models at once using hyperparameter tuning and cross validation. The result will be a “grid” of models. Each model will be associated with a distinct subset of hyperparameters. Accuracy metrics are generated for each subset of hyperparameters, and the subset that produces the most accurate model will be identified and described in the sections that follow.
To train the models, use the h2o.grid
function. Throughout the course, the following function arguments are explicitly defined.
gradient_boosted_grid_search <- h2o.grid(
algorithm = "gbm",
grid_id = "gradient_boosted_churn",
x = features,
y = target,
training_frame = churn_train,
validation_frame = churn_validate,
hyper_params = gradient_boosted_hp_grid,
search_criteria = gradient_boosted_hp_strategy,
ntrees = 100000,
learn_rate = 0.05,
learn_rate_annealing = 0.99,
score_tree_interval = 10,
distribution = "bernoulli",
stopping_rounds = 5,
stopping_tolerance = 0.0001,
stopping_metric = "AUTO",
seed = 123456
)
Select the most accurate model
At this point, you have trained several Gradient Boosted Classification Tree models using many subsets of hyperparameters. To select the most accurate model, you must first sort all models from most to least accurate using the h2o.getGrid
function. In this example, models are sorted in descending order according to each model’s F1 score.
gradient_boosted_model_grid <- h2o.getGrid(
grid_id = "gradient_boosted_churn",
sort_by = "f1",
decreasing = TRUE
)
To retrieve the model with the most accurate subset of hyperparameters, use the h2o.getModel
function as shown below.
gradient_boosted_best_model <- h2o.getModel(
model_id = gradient_boosted_model_grid@model_ids[[1]]
)
Report model parameters
To report the model’s parameters, including the best-performing hyperparameter values, use the custom function named report_model_parameters
below.
report_model_parameters <- function(model, model_grid) {
best_hyperparameters <- model_grid@summary_table[1,]
best_model_summary <- model@model$model_summary
out <- c(
best_hyperparameters,
best_model_summary
)
return(out)
}
gradient_boosted_model_grid@summary_table[1,]
gradient_boosted_best_model@model$model_summary
Evaluate accuracy
In this section, you will evaluate the accuracy of the best-performing model using a variety of strategies. First, you will evaluate the accuracy of the model on the validation split. Second, you will evaluate the accuracy of the model on the testing split. Finally, you will evaluate the accuracy of the model using cross validation. At the end, you will judge the results and determine the model’s true predictive power.
Custom function
The following custom function report_classification_metrics
extracts metrics from H2O’s classification algorithms. This custom function returns F1, F2, F0.5, Accuracy, AUC, AUCPR, and LogLoss scores. The lone argument metrics
should be a H2OBinomialMetrics
object.
report_classification_metrics <- function(metrics) {
metrics <- list(
F1 = max(h2o.F1(metrics)[,'f1']),
F2 = max(h2o.F2(metrics)[,'f2']),
F0.5 = max(h2o.F0point5(metrics)[,'f0point5']),
Accuracy = max(h2o.accuracy(metrics)[,'accuracy']),
AUC = h2o.auc(metrics),
AUCPR = h2o.aucpr(metrics),
LogLoss = h2o.logloss(metrics)
)
return(metrics)
}
Evaluate accuracy on validation split
In this section, you will learn how to evaluate the accuracy of the model on the validation split. Because the validation split was used during training, these accuracy metrics may not reflect the model’s true ability to perform on unseen data.
To obtain validation metrics, use the h2o.performance
function with valid = TRUE
. The class of the resulting object is H2OBinomialMetrics
.
gradient_boosted_validation_metrics <- h2o.performance(
model = gradient_boosted_best_model,
valid = TRUE
)
Use the custom function report_classification_metrics
to report the validation metrics.
report_classification_metrics(metrics = gradient_boosted_validation_metrics)
Evaluate accurcy on testing split
In this section, you will learn how to evaluate the accuracy of the model on the testing split.
Because the testing split was not used during training, these accuracy metrics reflect the model’s true ability to perform on unseen data. A reliable model will report similar accuracy for both the validation and testing splits. If the difference between validation and testing accuracy is significant, then the model is overfit and should be retrained.
To obtain testing metrics, use the h2o.performance
function with newdata = churn_test
. The class of the resulting object is H2OBinomialMetrics
.
gradient_boosted_testing_metrics <- h2o.performance(
model = gradient_boosted_best_model,
newdata = churn_test
)
Use the custom function report_classification_metrics
to report the validation metrics.
report_classification_metrics(metrics = gradient_boosted_testing_metrics)
Evaluate accuracy using cross validation
In this section, you will deploy cross validation to make one final evaluation regarding model accuracy. By using cross validation, you are putting the model through an additional round of rigorous testing to ensure the accuracy metrics are reliable.
To evaluate accuracy using cross validation, you will retrain a new model using all instances in the training and validation splits and the same model parameters obtained previously through hyperparameter tuning. You will also deploy 5-fold cross validation.
In the following code, you are calling the h2o.gbm
function and passing the parameters obtained through hyperparameter tuning. Some tweaks are necessary, such as expanding the training_frame
, disabling the validation_frame
, and setting nfolds
to a value of 5 to enable 5-fold cross validation.
gradient_boosted_cv <- do.call(
what = h2o.gbm,
args = {
parameters = gradient_boosted_best_model@parameters
parameters$model_id = "gradient_boosted_cv"
parameters$training_frame = h2o.rbind(churn_train, churn_validate)
parameters$validation_frame <- NULL
parameters$nfolds = 5
parameters$keep_cross_validation_predictions <- TRUE
parameters$seed <- 1234
parameters
}
)
To obtain cross validation metrics, use the h2o.performance
function with xval = TRUE
. The class of the resulting object is H2OBinomialMetrics
.
gradient_boosted_cv_metrics <- h2o.performance(
model = gradient_boosted_cv,
xval = TRUE
)
Use the custom function report_classification_metrics
to report the validation metrics.
report_classification_metrics(metrics = gradient_boosted_cv_metrics)
Confusion matrix
h2o.confusionMatrix(
object = gradient_boosted_testing_metrics
)
Accuracy matrix
An accuracy matrix is a matrix of validation, testing, and cross validation accuracy metrics. Accuracy matrices can easily be created using spreadsheet software (or similar).
The following accuracy matrix shows the validation, testing, and cross validation metrics for the best-performing Gradient Boosted Trees model.
F1 |
0.784 |
0.846 |
0.783 |
F2 |
0.782 |
0.819 |
0.791 |
F0.5 |
0.807 |
0.877 |
0.815 |
Accuracy |
0.946 |
0.956 |
0.942 |
AUC |
0.886 |
0.905 |
0.916 |
AUCPR |
0.784 |
0.845 |
0.817 |
LogLoss |
0.256 |
0.267 |
0.257 |
To conclude, the validation, testing, and cross validation accuracy metrics are relatively stable, suggesting the best-performing model is reliable and not overfit to the training data.
Explain model
Variable importances
h2o.varimp(gradient_boosted_best_model)
SHAP summary
h2o.shap_summary_plot(
model = gradient_boosted_best_model,
newdata = churn_test
)
SHAP contributions
h2o.shap_explain_row_plot(
model = gradient_boosted_best_model,
newdata = churn_test,
row_index = 10
)
h2o.predict_contributions(
object = gradient_boosted_best_model,
newdata = churn_test
)
Obtain predicted values
gradient_boosted_testing_predictions <- h2o.predict(
object = gradient_boosted_best_model,
newdata = churn_test
)
Evaluate important features
To assess which features are most important to the best Logistic Regression model, use the h2o.varimp_plot() function to visualize feature importances.
h2o.varimp_plot(
model = gradient_boosted_best_model,
num_of_features = 10
)
Shut down the H2O cluster
h2o.shutdown(prompt = FALSE)
Playground
In this playground, you will experiment with the code to train and evaluate your own Gradient Boosted Tree classification model.
h2o.init()
churn_h2o <- h2o.importFile(path = "churn.csv")
churn_splits <- h2o.splitFrame(
data = churn_h2o,
ratios = c(0.7, 0.15),
seed = 1
)
churn_train <- churn_splits[[1]]
churn_validate <- churn_splits[[2]]
churn_test <- churn_splits[[3]]
target <- "churn"
features <- setdiff(
x = names(churn_h2o),
y = "churn"
)
gradient_boosted_hp_grid = list(
max_depth = seq(1, 20, 1),
min_rows = c(1, 5, 10, 20, 50, 100),
sample_rate = seq(0.3, 1, 0.05),
col_sample_rate = seq(0.3, 1, 0.05),
col_sample_rate_per_tree = seq(0.3,1,0.05),
balance_classes = c(TRUE, FALSE),
max_after_balance_size = seq(0.1, 5.1, 0.5),
histogram_type = c("UniformAdaptive", "QuantilesGlobal", "RoundRobin")
)
gradient_boosted_hp_strategy = list(
strategy = "RandomDiscrete",
max_runtime_secs = 120,
max_models = 100,
stopping_metric = "AUTO",
stopping_tolerance = 0.001,
stopping_rounds = 5,
seed = 123456
)
gradient_boosted_grid_search <- h2o.grid(
algorithm = "gbm",
grid_id = "gradient_boosted_churn",
x = features,
y = target,
training_frame = churn_train,
validation_frame = churn_validate,
hyper_params = gradient_boosted_hp_grid,
search_criteria = gradient_boosted_hp_strategy,
ntrees = 100000,
learn_rate = 0.05,
learn_rate_annealing = 0.99,
score_tree_interval = 10,
distribution = "bernoulli",
stopping_rounds = 5,
stopping_tolerance = 0.0001,
stopping_metric = "AUTO",
seed = 123456
)
gradient_boosted_model_grid <- h2o.getGrid(
grid_id = "gradient_boosted_churn",
sort_by = "f1",
decreasing = TRUE
)
class(gradient_boosted_model_grid)
gradient_boosted_best_model <- h2o.getModel(
model_id = gradient_boosted_model_grid@model_ids[[1]]
)
class(gradient_boosted_best_model)
report_model_parameters <- function(model, model_grid) {
best_hyperparameters <- model_grid@summary_table[1,]
best_model_summary <- model@model$model_summary
out <- c(
best_hyperparameters,
best_model_summary
)
return(out)
}
gradient_boosted_model_grid@summary_table[1,]
gradient_boosted_best_model@model$model_summary
report_classification_metrics <- function(metrics) {
metrics <- list(
F1 = max(h2o.F1(metrics)[,'f1']),
F2 = max(h2o.F2(metrics)[,'f2']),
F0.5 = max(h2o.F0point5(metrics)[,'f0point5']),
Accuracy = max(h2o.accuracy(metrics)[,'Accuracy']),
AUC = h2o.auc(metrics),
AUCPR = h2o.aucpr(metrics),
LogLoss = h2o.logloss(metrics)
)
return(metrics)
}
gradient_boosted_validation_metrics <- h2o.performance(
model = gradient_boosted_best_model,
valid = TRUE
)
report_classification_metrics(metrics = gradient_boosted_validation_metrics)
gradient_boosted_cv <- do.call(
what = h2o.gbm,
args = {
parameters = gradient_boosted_best_model@parameters
parameters$model_id = "gradient_boosted_cv"
parameters$training_frame = h2o.rbind(churn_train, churn_validate)
parameters$validation_frame <- NULL
parameters$nfolds = 5
parameters$keep_cross_validation_predictions <- TRUE
parameters$seed <- 1234
parameters
}
)
gradient_boosted_cv_metrics <- h2o.performance(
model = gradient_boosted_cv,
xval = TRUE
)
report_classification_metrics(metrics = gradient_boosted_cv_metrics)
gradient_boosted_testing_metrics <- h2o.performance(
model = gradient_boosted_best_model,
newdata = churn_test
)
report_classification_metrics(metrics = gradient_boosted_testing_metrics)
h2o.confusionMatrix(
object = gradient_boosted_testing_metrics
)
gradient_boosted_testing_predictions <- h2o.predict(
object = gradient_boosted_best_model,
newdata = churn_test
)
h2o.varimp_plot(
model = gradient_boosted_best_model,
num_of_features = 10
)
Additional playground
Logistic regression
In this playground, you will experiment with the code to train and evaluate your own Logistic Regression model. If necessary, revisit the section Prepare customer churn data to improve your understanding of the data preparation phase.
h2o.init()
churn_h2o <- h2o.importFile(path = "churn.csv")
churn_splits <- h2o.splitFrame(
data = churn_h2o,
ratios = c(0.7, 0.15),
seed = 1
)
churn_train <- churn_splits[[1]]
churn_validate <- churn_splits[[2]]
churn_test <- churn_splits[[3]]
target <- "churn"
features <- setdiff(
x = names(churn_h2o),
y = "churn"
)
logistic_regression_hp_grid <- list(
alpha = c(0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1),
lambda = c(0, 0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5, 0.9)
)
logistic_regression_hp_strategy = list(
strategy = "RandomDiscrete",
max_runtime_secs = 60,
stopping_metric = "AUTO",
stopping_tolerance = 0.001,
stopping_rounds = 5,
seed = 123456
)
logistic_regression_interactions <- c(
"total_day_charge",
"total_eve_charge",
"total_night_charge",
"total_intl_charge",
"number_customer_service_calls"
)
logistic_regression_grid_search <- h2o.grid(
algorithm = "glm",
grid_id = "logistic_regression_churn",
x = features,
y = target,
training_frame = churn_train,
validation_frame = churn_validate,
interactions = logistic_regression_interactions,
hyper_params = logistic_regression_hp_grid,
search_criteria = logistic_regression_hp_strategy,
family = "binomial",
standardize = TRUE,
balance_classes = TRUE,
stopping_rounds = 5,
stopping_tolerance = 0.001,
stopping_metric = "AUTO",
seed = 123456
)
logistic_regression_model_grid <- h2o.getGrid(
grid_id = "logistic_regression_churn",
sort_by = "f1",
decreasing = TRUE
)
class(logistic_regression_model_grid)
logistic_regression_best_model <- h2o.getModel(
model_id = logistic_regression_model_grid@model_ids[[1]]
)
class(logistic_regression_best_model)
logistic_regression_model_grid@summary_table[1, ]
head(logistic_regression_model_grid@summary_table, 5)
tail(logistic_regression_model_grid@summary_table, 5)
report_classification_metrics <- function(metrics) {
metrics <- list(
F1 = max(h2o.F1(metrics)[,'f1']),
F2 = max(h2o.F2(metrics)[,'f2']),
F0.5 = max(h2o.F0point5(metrics)[,'f0point5']),
Accuracy = max(h2o.accuracy(metrics)[,'Accuracy']),
AUC = h2o.auc(metrics),
AUCPR = h2o.aucpr(metrics),
LogLoss = h2o.logloss(metrics)
)
return(metrics)
}
logistic_regression_validation_metrics <- h2o.performance(
model = logistic_regression_best_model,
valid = TRUE
)
report_classification_metrics(metrics = logistic_regression_validation_metrics)
logistic_regression_cv <- do.call(
what = h2o.glm,
args = {
parameters = logistic_regression_best_model@parameters
parameters$model_id = "logistic_regression_cv"
parameters$training_frame = h2o.rbind(churn_train, churn_validate)
parameters$validation_frame <- NULL
parameters$x <- features
parameters$nfolds = 5
parameters$keep_cross_validation_predictions = TRUE
parameters$seed <- 1234
parameters
}
)
logistic_regression_cv_metrics <- h2o.performance(
model = logistic_regression_cv,
xval = TRUE
)
report_classification_metrics(metrics = logistic_regression_cv_metrics)
logistic_regression_testing_metrics <- h2o.performance(
model = logistic_regression_best_model,
newdata = churn_test
)
report_classification_metrics(metrics = logistic_regression_testing_metrics)
h2o.confusionMatrix(
object = logistic_regression_testing_metrics
)
logistic_regression_testing_predictions <- h2o.predict(
object = logistic_regression_best_model,
newdata = churn_test
)
h2o.varimp_plot(
model = logistic_regression_best_model,
num_of_features = 10
)