Tidymodels_classification

Kate C

2022-01-29

Load Packages and Dataset

6.09 pm start time

Packages used to import and analyse the data include

  • tidymodels

  • telecom_df: a dataset contains information on customers of a telecommunications company. The outcome variable is canceled_service and it records whether a customer canceled their contract with the company. The predictor variables contain information about customers’ cell phone and internet usage as well as their contract type and monthly charges.

Practice

Objective: we will predict whether a customer cancels the service based on the predictors available in the dataset and the labels.

First - data resampling

This step is the same as what we did with the previous session on tidymodels on regressions. Therefore we are not providing too much descriptions below with the code.

note that here we want to have a 75%/25% split for training/testing data

telecom_split <- initial_split(telecom_df, prop = 0.75, strata = canceled_service)
telecom_training <- telecom_split %>% 
                    training()
telecom_test <- telecom_split %>% 
                testing()
nrow(telecom_training)
## [1] 731
nrow(telecom_test)
## [1] 244

Second - model fitting with training dataset

  • create a logistic model
logistic_model <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

logistic_model
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
  • fit to training data. printing a model fit object will display the estimated model coefficients.
logistic_fit <- logistic_model %>% 
  fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges, 
      data = telecom_training)

logistic_fit
## parsnip model object
## 
## Fit time:  9ms 
## 
## Call:  stats::glm(formula = canceled_service ~ avg_call_mins + avg_intl_mins + 
##     monthly_charges, family = stats::binomial, data = data)
## 
## Coefficients:
##     (Intercept)    avg_call_mins    avg_intl_mins  monthly_charges  
##        2.089671        -0.009769         0.021364        -0.001908  
## 
## Degrees of Freedom: 730 Total (i.e. Null);  727 Residual
## Null Deviance:       932.4 
## Residual Deviance: 817.1     AIC: 825.1

Third - combining test data results

couple of points to note:

  • not like the R base code, here we use new_data argument. and also for type argument here we call “class”, not “classification”
class_preds <- predict(logistic_fit, 
                      new_data = telecom_test, 
                      type = "class")
  • obtain estimated probabilities for each outcome value - here the class is “prob” since we are getting the probabilities
prob_preds <- predict(logistic_fit, new_data = telecom_test, type = "prob")
  • combine test set results, and here we created a tibble of model results using the test dataset
telecom_results <- telecom_test %>% 
  select(canceled_service) %>% 
  cbind(class_preds, prob_preds)

head(telecom_results,6)
##   canceled_service .pred_class  .pred_yes  .pred_no
## 1              yes         yes 0.52979047 0.4702095
## 2               no          no 0.38170832 0.6182917
## 3               no         yes 0.56777475 0.4322252
## 4               no          no 0.23016143 0.7698386
## 5               no          no 0.24749052 0.7525095
## 6               no          no 0.02215866 0.9778413

Accessing model fit

Key points

  • in tidymodels, outcome variable needs to be a factor. first level is positive class.

  • to check the ordering of a factor vector, pass it into levels()

  • confusion matrix for checking corrections. conf_mat()

    • matrix with counts of all combinations of actual and predicted outcome values

    • correct predictions: TP (true positive) and TN (true negative)

    • there are two classification errors: FP and FN

  • classification metrics with yardstick - need to define truth and estimate

    • creating confusion matrices and other model fit metrics with yardstick

    • require a tibble of model results which contain

      • true outcome values

      • predicted outcome categories

      • estimated probabilities of each category

  • classification accuracy with accuracy function - need to define truth and estimate

    • not the best matrix as if all classifying as no would also achieve XX% accuracy

    • takes same arguments as conf_mat

    • calculates the classification accuracy

      • \(\frac{TP + TN} {TP + TN + FP + FN}\)
  • sensitivity - proportion of all positive cases that were correctly classified. \(\frac{TP} {TP +FN}\)

  • specificity - proportion of all negative cases that were correctly classified. \(\frac{TN} {TN +FP}\)

  • can also create a custom metric set

Exercise

calculate conf_mat, accuracy, sensitivity, specificity. specificity results are much higher than sensitivity results which means the model is much better at detecting customers who will not cancel their telecom service versus the ones who will.

conf_mat(telecom_results, truth = canceled_service,
    estimate = .pred_class)
##           Truth
## Prediction yes  no
##        yes  25  15
##        no   57 147
accuracy(telecom_results, truth = canceled_service,
    estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.705
sens(telecom_results, truth = canceled_service,
    estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    binary         0.305
spec(telecom_results, truth = canceled_service,
    estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 spec    binary         0.907
  • otherwise

    • create a custom metric function
    telecom_metrics <- metric_set(accuracy, sens, spec)
  • calculate metrics using model results tibble

telecom_metrics(telecom_results, truth = canceled_service,
                estimate = .pred_class)
## # A tibble: 3 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.705
## 2 sens     binary         0.305
## 3 spec     binary         0.907
  • create a confusion matrix and passing to summary() to calculate all available binary classification metric at once.
conf_mat(telecom_results, 
         truth = canceled_service,
         estimate = .pred_class) %>% 
  summary()
## # A tibble: 13 × 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.705
##  2 kap                  binary         0.243
##  3 sens                 binary         0.305
##  4 spec                 binary         0.907
##  5 ppv                  binary         0.625
##  6 npv                  binary         0.721
##  7 mcc                  binary         0.271
##  8 j_index              binary         0.212
##  9 bal_accuracy         binary         0.606
## 10 detection_prevalence binary         0.164
## 11 precision            binary         0.625
## 12 recall               binary         0.305
## 13 f_meas               binary         0.410

Visualization of model performance

A couple key points:

  • heatmap with autoplot

  • binary threshold probability is 0.5

  • test across a range of thresholds of probability

  • use ROC curve as well - covered before. proportion correct among actual positives vs proportion incorrect among actual negatives; used to visualize performance across probability thresholds. also calculate ROC AUC

    • use roc_curve()

    • pass to autoplot to plot the ROC curve

    • use roc_auc to calculate AUC values

Practice:

  • plot the confusion matrix in heatmap
conf_mat(telecom_results,
         truth = canceled_service, 
         estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

  • plot the confusion matrix in mosaic - clearly as stated previously, model performs better in terms of specificity than sensitivity.
conf_mat(telecom_results, 
         truth = canceled_service, 
         estimate = .pred_class) %>% 
  autoplot(type = "mosaic")

  • ROC curve

first, calculate the metric across thresholds

threshold_df <- telecom_results %>% 
  roc_curve(truth = canceled_service, 
            .pred_yes)
threshold_df
## # A tibble: 246 × 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1  -Inf          0                 1
##  2     0.0222     0                 1
##  3     0.0468     0.00617           1
##  4     0.0522     0.0123            1
##  5     0.0556     0.0185            1
##  6     0.0585     0.0247            1
##  7     0.0688     0.0309            1
##  8     0.0715     0.0370            1
##  9     0.0739     0.0432            1
## 10     0.0752     0.0494            1
## # … with 236 more rows

second plot ROC curve

threshold_df %>% 
  autoplot()

third, calculate AUC - the area under ROC curve is 0.75.

roc_auc(telecom_results, truth = canceled_service, .pred_yes)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.784

Automating the modelling workflow

  • using last_fit()

    • accepts classification models

    • speeds up the modeling process

    • fits the model to the training data and produces predictions on the test dataset

  • steps

    • create a data split object with rsample

    • specifying a model with parsnip

    • pass to last_fit(): parsnip model object, model formula, data split object), and use collect_metrics to calculate metrics using test dataset

    • pass to collect_predictions to create a tibble with all necessary columns for yardstick functions.

    • for metrics, please note that for accuracy, sensitivity, and specificity, the required arguments are ‘truth’ and ‘estimate’ but for roc_auc, its required argument is ‘truth’ and ‘estimated probabilities’. so need to include both estimates in the ‘cutom_metrics’ function which is from metric_set () - select the criteria for calculation.

some fitting process with last_fit

  • train model with last fit and view set metrics
telecom_last_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
           split = telecom_split)

telecom_last_fit %>% 
  collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.705 Preprocessor1_Model1
## 2 roc_auc  binary         0.784 Preprocessor1_Model1
  • collect predictions, view results and calculate metrics
last_fit_results <- telecom_last_fit %>% 
  collect_predictions()

last_fit_results
## # A tibble: 244 × 7
##    id         .pred_yes .pred_no  .row .pred_class canceled_service .config     
##    <chr>          <dbl>    <dbl> <int> <fct>       <fct>            <chr>       
##  1 train/tes…    0.530     0.470    11 yes         yes              Preprocesso…
##  2 train/tes…    0.382     0.618    15 no          no               Preprocesso…
##  3 train/tes…    0.568     0.432    18 yes         no               Preprocesso…
##  4 train/tes…    0.230     0.770    20 no          no               Preprocesso…
##  5 train/tes…    0.247     0.753    23 no          no               Preprocesso…
##  6 train/tes…    0.0222    0.978    37 no          no               Preprocesso…
##  7 train/tes…    0.369     0.631    48 no          no               Preprocesso…
##  8 train/tes…    0.145     0.855    49 no          no               Preprocesso…
##  9 train/tes…    0.578     0.422    53 yes         yes              Preprocesso…
## 10 train/tes…    0.309     0.691    59 no          no               Preprocesso…
## # … with 234 more rows
last_fit_metrics <- metric_set(accuracy, sens, spec, roc_auc)
last_fit_metrics(last_fit_results, truth = canceled_service, 
                 estimate = .pred_class, .pred_yes)
## # A tibble: 4 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.705
## 2 sens     binary         0.305
## 3 spec     binary         0.907
## 4 roc_auc  binary         0.784
  • test if the model performance improves with additional predictor - months_with_company. the auc result increased from 0.751 to 0.84.
#train a logistic regression model
logistic_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges + months_with_company, 
           split = telecom_split)
#collect metrics
logistic_fit %>% 
  collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.832 Preprocessor1_Model1
## 2 roc_auc  binary         0.887 Preprocessor1_Model1
#collect model predictions and plot ROC curve
logistic_fit %>% 
  collect_predictions() %>% 
  roc_curve(truth = canceled_service, .pred_yes) %>% 
  autoplot()