Tidymodels_classification

Kate C

2022-01-28

Load Packages and Dataset

6.09 pm start time

Packages used to import and analyse the data include

  • tidymodels

  • telecom_df: a dataset contains information on customers of a telecommunications company. The outcome variable is canceled_service and it records whether a customer canceled their contract with the company. The predictor variables contain information about customers’ cell phone and internet usage as well as their contract type and monthly charges.

Practice

Objective: we will predict whether a customer cancels the service based on the predictors available in the dataset and the labels.

First - data resampling

This step is the same as what we did with the previous session on tidymodels on regressions. Therefore we are not providing too much descriptions below with the code.

note that here we want to have a 75%/25% split for training/testing data

telecom_split <- initial_split(telecom_df, prop = 0.75, strata = canceled_service)
telecom_training <- telecom_split %>% 
                    training()
telecom_test <- telecom_split %>% 
                testing()
nrow(telecom_training)
## [1] 731
nrow(telecom_test)
## [1] 244

Second - model fitting with training dataset

  • create a logistic model
logistic_model <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

logistic_model
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
  • fit to training data. printing a model fit object will display the estimated model coefficients.
logistic_fit <- logistic_model %>% 
  fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges, 
      data = telecom_training)

logistic_fit
## parsnip model object
## 
## Fit time:  9ms 
## 
## Call:  stats::glm(formula = canceled_service ~ avg_call_mins + avg_intl_mins + 
##     monthly_charges, family = stats::binomial, data = data)
## 
## Coefficients:
##     (Intercept)    avg_call_mins    avg_intl_mins  monthly_charges  
##        1.750644        -0.010339         0.021620         0.004416  
## 
## Degrees of Freedom: 730 Total (i.e. Null);  727 Residual
## Null Deviance:       932.4 
## Residual Deviance: 805.3     AIC: 813.3

Third - combining test data results

couple of points to note:

  • not like the R base code, here we use new_data argument. and also for type argument here we call “class”, not “classification”
class_preds <- predict(logistic_fit, 
                      new_data = telecom_test, 
                      type = "class")
  • obtain estimated probabilities for each outcome value - here the class is “prob” since we are getting the probabilities
prob_preds <- predict(logistic_fit, new_data = telecom_test, type = "prob")
  • combine test set results, and here we created a tibble of model results using the test dataset
telecom_results <- telecom_test %>% 
  select(canceled_service) %>% 
  cbind(class_preds, prob_preds)

telecom_results
##     canceled_service .pred_class  .pred_yes  .pred_no
## 1                 no          no 0.21969062 0.7803094
## 2                 no          no 0.47471229 0.5252877
## 3                yes          no 0.35634034 0.6436597
## 4                 no          no 0.01774991 0.9822501
## 5                yes          no 0.18779518 0.8122048
## 6                 no          no 0.10120042 0.8987996
## 7                 no         yes 0.57021039 0.4297896
## 8                yes         yes 0.65872000 0.3412800
## 9                 no          no 0.15311054 0.8468895
## 10                no          no 0.16558454 0.8344155
## 11                no          no 0.30534670 0.6946533
## 12                no          no 0.43504961 0.5649504
## 13                no          no 0.08771845 0.9122815
## 14               yes         yes 0.52727682 0.4727232
## 15                no          no 0.39242469 0.6075753
## 16                no          no 0.10971986 0.8902801
## 17                no          no 0.37485103 0.6251490
## 18                no          no 0.06413477 0.9358652
## 19               yes          no 0.39367976 0.6063202
## 20               yes          no 0.49954221 0.5004578
## 21               yes          no 0.38936881 0.6106312
## 22               yes          no 0.42431934 0.5756807
## 23               yes          no 0.47024930 0.5297507
## 24                no          no 0.25833906 0.7416609
## 25                no          no 0.35126617 0.6487338
## 26                no          no 0.15479886 0.8452011
## 27                no          no 0.19141678 0.8085832
## 28               yes          no 0.23246311 0.7675369
## 29                no          no 0.12675231 0.8732477
## 30               yes         yes 0.53518426 0.4648157
## 31               yes         yes 0.52274903 0.4772510
## 32               yes         yes 0.57909972 0.4209003
## 33                no          no 0.08454893 0.9154511
## 34               yes          no 0.32827761 0.6717224
## 35                no          no 0.14900478 0.8509952
## 36               yes          no 0.09902395 0.9009761
## 37                no         yes 0.50889162 0.4911084
## 38               yes          no 0.47457242 0.5254276
## 39                no          no 0.42248701 0.5775130
## 40               yes          no 0.25788352 0.7421165
## 41                no          no 0.13409172 0.8659083
## 42               yes          no 0.36095033 0.6390497
## 43                no          no 0.33387427 0.6661257
## 44               yes         yes 0.53992460 0.4600754
## 45                no         yes 0.60977291 0.3902271
## 46                no          no 0.41646518 0.5835348
## 47                no          no 0.26066752 0.7393325
## 48                no          no 0.40697004 0.5930300
## 49                no          no 0.30278949 0.6972105
## 50               yes         yes 0.80235678 0.1976432
## 51                no          no 0.41762059 0.5823794
## 52                no          no 0.18686131 0.8131387
## 53                no          no 0.44713715 0.5528628
## 54                no          no 0.26979945 0.7302005
## 55                no          no 0.42736146 0.5726385
## 56                no         yes 0.54332898 0.4566710
## 57                no         yes 0.63136708 0.3686329
## 58               yes         yes 0.77523432 0.2247657
## 59               yes         yes 0.68240371 0.3175963
## 60                no          no 0.36997843 0.6300216
## 61               yes          no 0.28474046 0.7152595
## 62                no          no 0.31378007 0.6862199
## 63                no          no 0.29646324 0.7035368
## 64                no          no 0.07782842 0.9221716
## 65               yes          no 0.43837761 0.5616224
## 66                no          no 0.22400992 0.7759901
## 67                no          no 0.09167932 0.9083207
## 68                no          no 0.44884524 0.5511548
## 69                no          no 0.20753749 0.7924625
## 70               yes         yes 0.56217004 0.4378300
## 71                no          no 0.28919340 0.7108066
## 72                no          no 0.10962232 0.8903777
## 73                no          no 0.26515045 0.7348495
## 74                no          no 0.36830784 0.6316922
## 75               yes          no 0.32368081 0.6763192
## 76                no          no 0.23979523 0.7602048
## 77                no          no 0.45168726 0.5483127
## 78                no         yes 0.50181565 0.4981843
## 79                no          no 0.06762814 0.9323719
## 80               yes          no 0.25805932 0.7419407
## 81               yes         yes 0.59596832 0.4040317
## 82                no          no 0.26847178 0.7315282
## 83                no          no 0.16089769 0.8391023
## 84                no          no 0.43986919 0.5601308
## 85                no          no 0.47662940 0.5233706
## 86                no          no 0.22607537 0.7739246
## 87                no          no 0.15606558 0.8439344
## 88                no         yes 0.52583128 0.4741687
## 89                no          no 0.10458862 0.8954114
## 90               yes         yes 0.52124741 0.4787526
## 91               yes          no 0.30527049 0.6947295
## 92               yes          no 0.12847640 0.8715236
## 93                no          no 0.17210637 0.8278936
## 94                no          no 0.06697828 0.9330217
## 95                no          no 0.13324119 0.8667588
## 96                no          no 0.06461925 0.9353807
## 97               yes          no 0.42832146 0.5716785
## 98               yes          no 0.10671685 0.8932831
## 99               yes          no 0.40667350 0.5933265
## 100               no          no 0.30980139 0.6901986
## 101              yes          no 0.34278618 0.6572138
## 102              yes         yes 0.84531759 0.1546824
## 103               no          no 0.40239857 0.5976014
## 104               no          no 0.18645305 0.8135469
## 105               no         yes 0.53439444 0.4656056
## 106               no          no 0.05821624 0.9417838
## 107               no          no 0.25679567 0.7432043
## 108              yes          no 0.33458356 0.6654164
## 109               no          no 0.24855802 0.7514420
## 110               no          no 0.37511250 0.6248875
## 111               no          no 0.10830972 0.8916903
## 112               no          no 0.12779057 0.8722094
## 113              yes         yes 0.65290876 0.3470912
## 114              yes          no 0.47914178 0.5208582
## 115               no          no 0.32098485 0.6790151
## 116               no          no 0.09294809 0.9070519
## 117               no          no 0.21468489 0.7853151
## 118               no         yes 0.51198311 0.4880169
## 119              yes          no 0.37879192 0.6212081
## 120               no         yes 0.57826174 0.4217383
## 121               no          no 0.03472197 0.9652780
## 122               no          no 0.35545521 0.6445448
## 123              yes          no 0.29040734 0.7095927
## 124              yes          no 0.43701745 0.5629825
## 125              yes         yes 0.67142293 0.3285771
## 126               no          no 0.18732029 0.8126797
## 127               no          no 0.49848506 0.5015149
## 128               no          no 0.18483610 0.8151639
## 129              yes          no 0.30115306 0.6988469
## 130               no          no 0.29155380 0.7084462
## 131               no          no 0.13158615 0.8684138
## 132               no          no 0.08671606 0.9132839
## 133              yes          no 0.44501833 0.5549817
## 134               no          no 0.23635425 0.7636458
## 135               no          no 0.15337879 0.8466212
## 136              yes          no 0.35567213 0.6443279
## 137               no          no 0.17837096 0.8216290
## 138               no          no 0.25375169 0.7462483
## 139               no          no 0.18372469 0.8162753
## 140              yes          no 0.07726214 0.9227379
## 141               no          no 0.38223330 0.6177667
## 142              yes         yes 0.64106060 0.3589394
## 143              yes         yes 0.59990613 0.4000939
## 144               no          no 0.21996142 0.7800386
## 145               no          no 0.36726945 0.6327305
## 146               no          no 0.35764208 0.6423579
## 147               no         yes 0.60617793 0.3938221
## 148              yes         yes 0.56743483 0.4325652
## 149               no          no 0.20147442 0.7985256
## 150               no          no 0.30953606 0.6904639
## 151               no          no 0.37845219 0.6215478
## 152               no          no 0.15700013 0.8429999
## 153              yes          no 0.45656898 0.5434310
## 154              yes          no 0.22498431 0.7750157
## 155               no          no 0.20176877 0.7982312
## 156              yes          no 0.30646618 0.6935338
## 157               no          no 0.39104526 0.6089547
## 158               no          no 0.39277354 0.6072265
## 159               no          no 0.10660064 0.8933994
## 160               no          no 0.34026109 0.6597389
## 161               no          no 0.20014411 0.7998559
## 162               no         yes 0.64352887 0.3564711
## 163              yes          no 0.27539756 0.7246024
## 164               no          no 0.15043268 0.8495673
## 165               no          no 0.21386025 0.7861397
## 166               no          no 0.08085626 0.9191437
## 167               no          no 0.16253712 0.8374629
## 168               no          no 0.13669980 0.8633002
## 169               no          no 0.19689523 0.8031048
## 170               no          no 0.34202804 0.6579720
## 171               no          no 0.16894074 0.8310593
## 172               no          no 0.24234601 0.7576540
## 173               no          no 0.40858171 0.5914183
## 174               no          no 0.14308586 0.8569141
## 175               no          no 0.35071087 0.6492891
## 176              yes          no 0.36777525 0.6322248
## 177               no          no 0.10175181 0.8982482
## 178               no          no 0.15477150 0.8452285
## 179              yes          no 0.34650681 0.6534932
## 180               no          no 0.29006642 0.7099336
## 181              yes          no 0.45041905 0.5495809
## 182               no          no 0.30175024 0.6982498
## 183               no          no 0.37197311 0.6280269
## 184              yes         yes 0.54719220 0.4528078
## 185               no          no 0.30019815 0.6998019
## 186               no          no 0.13595284 0.8640472
## 187              yes          no 0.14283846 0.8571615
## 188               no          no 0.17840843 0.8215916
## 189              yes          no 0.08503174 0.9149683
## 190               no          no 0.10229905 0.8977009
## 191               no          no 0.11768837 0.8823116
## 192              yes         yes 0.62489932 0.3751007
## 193              yes          no 0.35575694 0.6442431
## 194               no          no 0.07610853 0.9238915
## 195              yes          no 0.48068793 0.5193121
## 196               no          no 0.12302262 0.8769774
## 197               no          no 0.12402611 0.8759739
## 198               no          no 0.15534370 0.8446563
## 199               no          no 0.06256196 0.9374380
## 200              yes         yes 0.57196121 0.4280388
## 201              yes          no 0.30344489 0.6965551
## 202               no          no 0.37090235 0.6290976
## 203               no          no 0.32468175 0.6753183
## 204               no         yes 0.70329080 0.2967092
## 205               no          no 0.28998244 0.7100176
## 206               no          no 0.26971562 0.7302844
## 207               no          no 0.12713185 0.8728681
## 208               no          no 0.19670941 0.8032906
## 209               no         yes 0.59331609 0.4066839
## 210               no          no 0.24722941 0.7527706
## 211              yes          no 0.19285261 0.8071474
## 212               no          no 0.43847296 0.5615270
## 213               no          no 0.15889704 0.8411030
## 214              yes         yes 0.59827052 0.4017295
## 215              yes          no 0.44689603 0.5531040
## 216              yes         yes 0.56496856 0.4350314
## 217              yes          no 0.22137437 0.7786256
## 218              yes          no 0.34601419 0.6539858
## 219              yes          no 0.36626992 0.6337301
## 220              yes         yes 0.52433510 0.4756649
## 221               no         yes 0.62688973 0.3731103
## 222               no          no 0.13234264 0.8676574
## 223               no          no 0.29826718 0.7017328
## 224               no          no 0.30924796 0.6907520
## 225              yes          no 0.25689973 0.7431003
## 226              yes         yes 0.52663430 0.4733657
## 227              yes          no 0.43132288 0.5686771
## 228               no         yes 0.53483026 0.4651697
## 229              yes          no 0.29354161 0.7064584
## 230               no          no 0.18407968 0.8159203
## 231               no          no 0.08763449 0.9123655
## 232               no          no 0.23359719 0.7664028
## 233               no          no 0.16264215 0.8373578
## 234              yes          no 0.45111093 0.5488891
## 235              yes          no 0.19862069 0.8013793
## 236               no          no 0.20384693 0.7961531
## 237               no          no 0.42482690 0.5751731
## 238              yes         yes 0.74017932 0.2598207
## 239               no          no 0.37039490 0.6296051
## 240              yes          no 0.25849625 0.7415037
## 241               no          no 0.17209043 0.8279096
## 242              yes         yes 0.66573572 0.3342643
## 243               no          no 0.14283759 0.8571624
## 244               no          no 0.05202142 0.9479786

Accessing model fit

Key points

  • in tidymodels, outcome variable needs to be a factor. first level is positive class.

  • to check the ordering of a factor vector, pass it into levels()

  • confusion matrix for checking corrections. conf_mat()

    • matrix with counts of all combinations of actual and predicted outcome values

    • correct predictions: TP (true positive) and TN (true negative)

    • there are two classification errors: FP and FN

  • classification metrics with yardstick - need to define truth and estimate

    • creating confusion matrices and other model fit metrics with yardstick

    • require a tibble of model results which contain

      • true outcome values

      • predicted outcome categories

      • estimated probabilities of each category

  • classification accuracy with accuracy function - need to define truth and estimate

    • not the best matrix as if all classifying as no would also achieve XX% accuracy

    • takes same arguments as conf_mat

    • calculates the classification accuracy

      • \(\frac{TP + TN} {TP + TN + FP + FN}\)
  • sensitivity - proportion of all positive cases that were correctly classified. \(\frac{TP} {TP +FN}\)

  • specificity - proportion of all negative cases that were correctly classified. \(\frac{TN} {TN +FP}\)

  • can also create a custom metric set

Exercise

calculate conf_mat, accuracy, sensitivity, specificity. specificity results are much higher than sensitivity results which means the model is much better at detecting customers who will not cancel their telecom service versus the ones who will.

conf_mat(telecom_results, truth = canceled_service,
    estimate = .pred_class)
##           Truth
## Prediction yes  no
##        yes  27  16
##        no   55 146
accuracy(telecom_results, truth = canceled_service,
    estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.709
sens(telecom_results, truth = canceled_service,
    estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    binary         0.329
spec(telecom_results, truth = canceled_service,
    estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 spec    binary         0.901
  • otherwise

    • create a custom metric function
    telecom_metrics <- metric_set(accuracy, sens, spec)
  • calculate metrics using model results tibble

telecom_metrics(telecom_results, truth = canceled_service,
                estimate = .pred_class)
## # A tibble: 3 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.709
## 2 sens     binary         0.329
## 3 spec     binary         0.901
  • create a confusion matrix and passing to summary() to calculate all available binary classification metric at once.
conf_mat(telecom_results, 
         truth = canceled_service,
         estimate = .pred_class) %>% 
  summary()
## # A tibble: 13 × 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.709
##  2 kap                  binary         0.261
##  3 sens                 binary         0.329
##  4 spec                 binary         0.901
##  5 ppv                  binary         0.628
##  6 npv                  binary         0.726
##  7 mcc                  binary         0.286
##  8 j_index              binary         0.231
##  9 bal_accuracy         binary         0.615
## 10 detection_prevalence binary         0.176
## 11 precision            binary         0.628
## 12 recall               binary         0.329
## 13 f_meas               binary         0.432