Tidymodels_classification

Load Packages and Dataset

6.09 pm start time

Packages used to import and analyse the data include

tidymodels
telecom_df: a dataset contains information on customers of a telecommunications company. The outcome variable is canceled_service and it records whether a customer canceled their contract with the company. The predictor variables contain information about customers’ cell phone and internet usage as well as their contract type and monthly charges.

Practice

Objective: we will predict whether a customer cancels the service based on the predictors available in the dataset and the labels.

First - data resampling

This step is the same as what we did with the previous session on tidymodels on regressions. Therefore we are not providing too much descriptions below with the code.

note that here we want to have a 75%/25% split for training/testing data

telecom_split <- initial_split(telecom_df, prop = 0.75, strata = canceled_service)
telecom_training <- telecom_split %>% 
                    training()
telecom_test <- telecom_split %>% 
                testing()
nrow(telecom_training)

## [1] 731

nrow(telecom_test)

## [1] 244

Second - model fitting with training dataset

create a logistic model

logistic_model <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

logistic_model

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

fit to training data. printing a model fit object will display the estimated model coefficients.

logistic_fit <- logistic_model %>% 
  fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges, 
      data = telecom_training)

logistic_fit

## parsnip model object
## 
## Fit time:  9ms 
## 
## Call:  stats::glm(formula = canceled_service ~ avg_call_mins + avg_intl_mins + 
##     monthly_charges, family = stats::binomial, data = data)
## 
## Coefficients:
##     (Intercept)    avg_call_mins    avg_intl_mins  monthly_charges  
##        2.089671        -0.009769         0.021364        -0.001908  
## 
## Degrees of Freedom: 730 Total (i.e. Null);  727 Residual
## Null Deviance:       932.4 
## Residual Deviance: 817.1     AIC: 825.1

Third - combining test data results

couple of points to note:

not like the R base code, here we use new_data argument. and also for type argument here we call “class”, not “classification”

class_preds <- predict(logistic_fit, 
                      new_data = telecom_test, 
                      type = "class")

obtain estimated probabilities for each outcome value - here the class is “prob” since we are getting the probabilities

prob_preds <- predict(logistic_fit, new_data = telecom_test, type = "prob")

combine test set results, and here we created a tibble of model results using the test dataset

telecom_results <- telecom_test %>% 
  select(canceled_service) %>% 
  cbind(class_preds, prob_preds)

head(telecom_results,6)

##   canceled_service .pred_class  .pred_yes  .pred_no
## 1              yes         yes 0.52979047 0.4702095
## 2               no          no 0.38170832 0.6182917
## 3               no         yes 0.56777475 0.4322252
## 4               no          no 0.23016143 0.7698386
## 5               no          no 0.24749052 0.7525095
## 6               no          no 0.02215866 0.9778413

Accessing model fit

Key points

in tidymodels, outcome variable needs to be a factor. first level is positive class.
to check the ordering of a factor vector, pass it into levels()
confusion matrix for checking corrections. conf_mat()
- matrix with counts of all combinations of actual and predicted outcome values
- correct predictions: TP (true positive) and TN (true negative)
- there are two classification errors: FP and FN

classification metrics with yardstick - need to define truth and estimate
- creating confusion matrices and other model fit metrics with yardstick
- require a tibble of model results which contain
  - true outcome values
  - predicted outcome categories
  - estimated probabilities of each category
classification accuracy with accuracy function - need to define truth and estimate
- not the best matrix as if all classifying as no would also achieve XX% accuracy
- takes same arguments as conf_mat
- calculates the classification accuracy
  - \(\frac{TP + TN} {TP + TN + FP + FN}\)
sensitivity - proportion of all positive cases that were correctly classified. \(\frac{TP} {TP +FN}\)
specificity - proportion of all negative cases that were correctly classified. \(\frac{TN} {TN +FP}\)
can also create a custom metric set

Exercise

calculate conf_mat, accuracy, sensitivity, specificity. specificity results are much higher than sensitivity results which means the model is much better at detecting customers who will not cancel their telecom service versus the ones who will.

conf_mat(telecom_results, truth = canceled_service,
    estimate = .pred_class)

##           Truth
## Prediction yes  no
##        yes  25  15
##        no   57 147

accuracy(telecom_results, truth = canceled_service,
    estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.705

sens(telecom_results, truth = canceled_service,
    estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    binary         0.305

spec(telecom_results, truth = canceled_service,
    estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 spec    binary         0.907

otherwise

create a custom metric function

telecom_metrics <- metric_set(accuracy, sens, spec)

calculate metrics using model results tibble

telecom_metrics(telecom_results, truth = canceled_service,
                estimate = .pred_class)

## # A tibble: 3 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.705
## 2 sens     binary         0.305
## 3 spec     binary         0.907

create a confusion matrix and passing to summary() to calculate all available binary classification metric at once.

conf_mat(telecom_results, 
         truth = canceled_service,
         estimate = .pred_class) %>% 
  summary()

## # A tibble: 13 × 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.705
##  2 kap                  binary         0.243
##  3 sens                 binary         0.305
##  4 spec                 binary         0.907
##  5 ppv                  binary         0.625
##  6 npv                  binary         0.721
##  7 mcc                  binary         0.271
##  8 j_index              binary         0.212
##  9 bal_accuracy         binary         0.606
## 10 detection_prevalence binary         0.164
## 11 precision            binary         0.625
## 12 recall               binary         0.305
## 13 f_meas               binary         0.410

Visualization of model performance

A couple key points:

heatmap with autoplot
binary threshold probability is 0.5
test across a range of thresholds of probability
use ROC curve as well - covered before. proportion correct among actual positives vs proportion incorrect among actual negatives; used to visualize performance across probability thresholds. also calculate ROC AUC
- use roc_curve()
- pass to autoplot to plot the ROC curve
- use roc_auc to calculate AUC values

Practice:

plot the confusion matrix in heatmap

conf_mat(telecom_results,
         truth = canceled_service, 
         estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

plot the confusion matrix in mosaic - clearly as stated previously, model performs better in terms of specificity than sensitivity.

conf_mat(telecom_results, 
         truth = canceled_service, 
         estimate = .pred_class) %>% 
  autoplot(type = "mosaic")

ROC curve

first, calculate the metric across thresholds

threshold_df <- telecom_results %>% 
  roc_curve(truth = canceled_service, 
            .pred_yes)
threshold_df

## # A tibble: 246 × 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1  -Inf          0                 1
##  2     0.0222     0                 1
##  3     0.0468     0.00617           1
##  4     0.0522     0.0123            1
##  5     0.0556     0.0185            1
##  6     0.0585     0.0247            1
##  7     0.0688     0.0309            1
##  8     0.0715     0.0370            1
##  9     0.0739     0.0432            1
## 10     0.0752     0.0494            1
## # … with 236 more rows

second plot ROC curve

threshold_df %>% 
  autoplot()

third, calculate AUC - the area under ROC curve is 0.75.

roc_auc(telecom_results, truth = canceled_service, .pred_yes)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.784

Automating the modelling workflow

using last_fit()
- accepts classification models
- speeds up the modeling process
- fits the model to the training data and produces predictions on the test dataset
steps
- create a data split object with rsample
- specifying a model with parsnip
- pass to last_fit(): parsnip model object, model formula, data split object), and use collect_metrics to calculate metrics using test dataset
- pass to collect_predictions to create a tibble with all necessary columns for yardstick functions.
- for metrics, please note that for accuracy, sensitivity, and specificity, the required arguments are ‘truth’ and ‘estimate’ but for roc_auc, its required argument is ‘truth’ and ‘estimated probabilities’. so need to include both estimates in the ‘cutom_metrics’ function which is from metric_set () - select the criteria for calculation.

some fitting process with last_fit

train model with last fit and view set metrics

telecom_last_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
           split = telecom_split)

telecom_last_fit %>% 
  collect_metrics()

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.705 Preprocessor1_Model1
## 2 roc_auc  binary         0.784 Preprocessor1_Model1

collect predictions, view results and calculate metrics

last_fit_results <- telecom_last_fit %>% 
  collect_predictions()

last_fit_results

## # A tibble: 244 × 7
##    id         .pred_yes .pred_no  .row .pred_class canceled_service .config     
##    <chr>          <dbl>    <dbl> <int> <fct>       <fct>            <chr>       
##  1 train/tes…    0.530     0.470    11 yes         yes              Preprocesso…
##  2 train/tes…    0.382     0.618    15 no          no               Preprocesso…
##  3 train/tes…    0.568     0.432    18 yes         no               Preprocesso…
##  4 train/tes…    0.230     0.770    20 no          no               Preprocesso…
##  5 train/tes…    0.247     0.753    23 no          no               Preprocesso…
##  6 train/tes…    0.0222    0.978    37 no          no               Preprocesso…
##  7 train/tes…    0.369     0.631    48 no          no               Preprocesso…
##  8 train/tes…    0.145     0.855    49 no          no               Preprocesso…
##  9 train/tes…    0.578     0.422    53 yes         yes              Preprocesso…
## 10 train/tes…    0.309     0.691    59 no          no               Preprocesso…
## # … with 234 more rows

last_fit_metrics <- metric_set(accuracy, sens, spec, roc_auc)
last_fit_metrics(last_fit_results, truth = canceled_service, 
                 estimate = .pred_class, .pred_yes)

## # A tibble: 4 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.705
## 2 sens     binary         0.305
## 3 spec     binary         0.907
## 4 roc_auc  binary         0.784

test if the model performance improves with additional predictor - months_with_company. the auc result increased from 0.751 to 0.84.

#train a logistic regression model
logistic_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges + months_with_company, 
           split = telecom_split)
#collect metrics
logistic_fit %>% 
  collect_metrics()

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.832 Preprocessor1_Model1
## 2 roc_auc  binary         0.887 Preprocessor1_Model1

#collect model predictions and plot ROC curve
logistic_fit %>% 
  collect_predictions() %>% 
  roc_curve(truth = canceled_service, .pred_yes) %>% 
  autoplot()