Load Packages and Dataset
6.09 pm start time
Packages used to import and analyse the data include
tidymodels
telecom_df: a dataset contains information on customers of a telecommunications company. The outcome variable is canceled_service and it records whether a customer canceled their contract with the company. The predictor variables contain information about customers’ cell phone and internet usage as well as their contract type and monthly charges.
Practice
Objective: we will predict whether a customer cancels the service based on the predictors available in the dataset and the labels.
First - data resampling
This step is the same as what we did with the previous session on tidymodels on regressions. Therefore we are not providing too much descriptions below with the code.
note that here we want to have a 75%/25% split for training/testing data
telecom_split <- initial_split(telecom_df, prop = 0.75, strata = canceled_service)
telecom_training <- telecom_split %>%
training()
telecom_test <- telecom_split %>%
testing()
nrow(telecom_training)## [1] 731
nrow(telecom_test)## [1] 244
Second - model fitting with training dataset
- create a logistic model
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logistic_model## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
- fit to training data. printing a model fit object will display the estimated model coefficients.
logistic_fit <- logistic_model %>%
fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
data = telecom_training)
logistic_fit## parsnip model object
##
## Fit time: 9ms
##
## Call: stats::glm(formula = canceled_service ~ avg_call_mins + avg_intl_mins +
## monthly_charges, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) avg_call_mins avg_intl_mins monthly_charges
## 1.750644 -0.010339 0.021620 0.004416
##
## Degrees of Freedom: 730 Total (i.e. Null); 727 Residual
## Null Deviance: 932.4
## Residual Deviance: 805.3 AIC: 813.3
Third - combining test data results
couple of points to note:
- not like the R base code, here we use new_data argument. and also for type argument here we call “class”, not “classification”
class_preds <- predict(logistic_fit,
new_data = telecom_test,
type = "class")- obtain estimated probabilities for each outcome value - here the class is “prob” since we are getting the probabilities
prob_preds <- predict(logistic_fit, new_data = telecom_test, type = "prob")- combine test set results, and here we created a tibble of model results using the test dataset
telecom_results <- telecom_test %>%
select(canceled_service) %>%
cbind(class_preds, prob_preds)
telecom_results## canceled_service .pred_class .pred_yes .pred_no
## 1 no no 0.21969062 0.7803094
## 2 no no 0.47471229 0.5252877
## 3 yes no 0.35634034 0.6436597
## 4 no no 0.01774991 0.9822501
## 5 yes no 0.18779518 0.8122048
## 6 no no 0.10120042 0.8987996
## 7 no yes 0.57021039 0.4297896
## 8 yes yes 0.65872000 0.3412800
## 9 no no 0.15311054 0.8468895
## 10 no no 0.16558454 0.8344155
## 11 no no 0.30534670 0.6946533
## 12 no no 0.43504961 0.5649504
## 13 no no 0.08771845 0.9122815
## 14 yes yes 0.52727682 0.4727232
## 15 no no 0.39242469 0.6075753
## 16 no no 0.10971986 0.8902801
## 17 no no 0.37485103 0.6251490
## 18 no no 0.06413477 0.9358652
## 19 yes no 0.39367976 0.6063202
## 20 yes no 0.49954221 0.5004578
## 21 yes no 0.38936881 0.6106312
## 22 yes no 0.42431934 0.5756807
## 23 yes no 0.47024930 0.5297507
## 24 no no 0.25833906 0.7416609
## 25 no no 0.35126617 0.6487338
## 26 no no 0.15479886 0.8452011
## 27 no no 0.19141678 0.8085832
## 28 yes no 0.23246311 0.7675369
## 29 no no 0.12675231 0.8732477
## 30 yes yes 0.53518426 0.4648157
## 31 yes yes 0.52274903 0.4772510
## 32 yes yes 0.57909972 0.4209003
## 33 no no 0.08454893 0.9154511
## 34 yes no 0.32827761 0.6717224
## 35 no no 0.14900478 0.8509952
## 36 yes no 0.09902395 0.9009761
## 37 no yes 0.50889162 0.4911084
## 38 yes no 0.47457242 0.5254276
## 39 no no 0.42248701 0.5775130
## 40 yes no 0.25788352 0.7421165
## 41 no no 0.13409172 0.8659083
## 42 yes no 0.36095033 0.6390497
## 43 no no 0.33387427 0.6661257
## 44 yes yes 0.53992460 0.4600754
## 45 no yes 0.60977291 0.3902271
## 46 no no 0.41646518 0.5835348
## 47 no no 0.26066752 0.7393325
## 48 no no 0.40697004 0.5930300
## 49 no no 0.30278949 0.6972105
## 50 yes yes 0.80235678 0.1976432
## 51 no no 0.41762059 0.5823794
## 52 no no 0.18686131 0.8131387
## 53 no no 0.44713715 0.5528628
## 54 no no 0.26979945 0.7302005
## 55 no no 0.42736146 0.5726385
## 56 no yes 0.54332898 0.4566710
## 57 no yes 0.63136708 0.3686329
## 58 yes yes 0.77523432 0.2247657
## 59 yes yes 0.68240371 0.3175963
## 60 no no 0.36997843 0.6300216
## 61 yes no 0.28474046 0.7152595
## 62 no no 0.31378007 0.6862199
## 63 no no 0.29646324 0.7035368
## 64 no no 0.07782842 0.9221716
## 65 yes no 0.43837761 0.5616224
## 66 no no 0.22400992 0.7759901
## 67 no no 0.09167932 0.9083207
## 68 no no 0.44884524 0.5511548
## 69 no no 0.20753749 0.7924625
## 70 yes yes 0.56217004 0.4378300
## 71 no no 0.28919340 0.7108066
## 72 no no 0.10962232 0.8903777
## 73 no no 0.26515045 0.7348495
## 74 no no 0.36830784 0.6316922
## 75 yes no 0.32368081 0.6763192
## 76 no no 0.23979523 0.7602048
## 77 no no 0.45168726 0.5483127
## 78 no yes 0.50181565 0.4981843
## 79 no no 0.06762814 0.9323719
## 80 yes no 0.25805932 0.7419407
## 81 yes yes 0.59596832 0.4040317
## 82 no no 0.26847178 0.7315282
## 83 no no 0.16089769 0.8391023
## 84 no no 0.43986919 0.5601308
## 85 no no 0.47662940 0.5233706
## 86 no no 0.22607537 0.7739246
## 87 no no 0.15606558 0.8439344
## 88 no yes 0.52583128 0.4741687
## 89 no no 0.10458862 0.8954114
## 90 yes yes 0.52124741 0.4787526
## 91 yes no 0.30527049 0.6947295
## 92 yes no 0.12847640 0.8715236
## 93 no no 0.17210637 0.8278936
## 94 no no 0.06697828 0.9330217
## 95 no no 0.13324119 0.8667588
## 96 no no 0.06461925 0.9353807
## 97 yes no 0.42832146 0.5716785
## 98 yes no 0.10671685 0.8932831
## 99 yes no 0.40667350 0.5933265
## 100 no no 0.30980139 0.6901986
## 101 yes no 0.34278618 0.6572138
## 102 yes yes 0.84531759 0.1546824
## 103 no no 0.40239857 0.5976014
## 104 no no 0.18645305 0.8135469
## 105 no yes 0.53439444 0.4656056
## 106 no no 0.05821624 0.9417838
## 107 no no 0.25679567 0.7432043
## 108 yes no 0.33458356 0.6654164
## 109 no no 0.24855802 0.7514420
## 110 no no 0.37511250 0.6248875
## 111 no no 0.10830972 0.8916903
## 112 no no 0.12779057 0.8722094
## 113 yes yes 0.65290876 0.3470912
## 114 yes no 0.47914178 0.5208582
## 115 no no 0.32098485 0.6790151
## 116 no no 0.09294809 0.9070519
## 117 no no 0.21468489 0.7853151
## 118 no yes 0.51198311 0.4880169
## 119 yes no 0.37879192 0.6212081
## 120 no yes 0.57826174 0.4217383
## 121 no no 0.03472197 0.9652780
## 122 no no 0.35545521 0.6445448
## 123 yes no 0.29040734 0.7095927
## 124 yes no 0.43701745 0.5629825
## 125 yes yes 0.67142293 0.3285771
## 126 no no 0.18732029 0.8126797
## 127 no no 0.49848506 0.5015149
## 128 no no 0.18483610 0.8151639
## 129 yes no 0.30115306 0.6988469
## 130 no no 0.29155380 0.7084462
## 131 no no 0.13158615 0.8684138
## 132 no no 0.08671606 0.9132839
## 133 yes no 0.44501833 0.5549817
## 134 no no 0.23635425 0.7636458
## 135 no no 0.15337879 0.8466212
## 136 yes no 0.35567213 0.6443279
## 137 no no 0.17837096 0.8216290
## 138 no no 0.25375169 0.7462483
## 139 no no 0.18372469 0.8162753
## 140 yes no 0.07726214 0.9227379
## 141 no no 0.38223330 0.6177667
## 142 yes yes 0.64106060 0.3589394
## 143 yes yes 0.59990613 0.4000939
## 144 no no 0.21996142 0.7800386
## 145 no no 0.36726945 0.6327305
## 146 no no 0.35764208 0.6423579
## 147 no yes 0.60617793 0.3938221
## 148 yes yes 0.56743483 0.4325652
## 149 no no 0.20147442 0.7985256
## 150 no no 0.30953606 0.6904639
## 151 no no 0.37845219 0.6215478
## 152 no no 0.15700013 0.8429999
## 153 yes no 0.45656898 0.5434310
## 154 yes no 0.22498431 0.7750157
## 155 no no 0.20176877 0.7982312
## 156 yes no 0.30646618 0.6935338
## 157 no no 0.39104526 0.6089547
## 158 no no 0.39277354 0.6072265
## 159 no no 0.10660064 0.8933994
## 160 no no 0.34026109 0.6597389
## 161 no no 0.20014411 0.7998559
## 162 no yes 0.64352887 0.3564711
## 163 yes no 0.27539756 0.7246024
## 164 no no 0.15043268 0.8495673
## 165 no no 0.21386025 0.7861397
## 166 no no 0.08085626 0.9191437
## 167 no no 0.16253712 0.8374629
## 168 no no 0.13669980 0.8633002
## 169 no no 0.19689523 0.8031048
## 170 no no 0.34202804 0.6579720
## 171 no no 0.16894074 0.8310593
## 172 no no 0.24234601 0.7576540
## 173 no no 0.40858171 0.5914183
## 174 no no 0.14308586 0.8569141
## 175 no no 0.35071087 0.6492891
## 176 yes no 0.36777525 0.6322248
## 177 no no 0.10175181 0.8982482
## 178 no no 0.15477150 0.8452285
## 179 yes no 0.34650681 0.6534932
## 180 no no 0.29006642 0.7099336
## 181 yes no 0.45041905 0.5495809
## 182 no no 0.30175024 0.6982498
## 183 no no 0.37197311 0.6280269
## 184 yes yes 0.54719220 0.4528078
## 185 no no 0.30019815 0.6998019
## 186 no no 0.13595284 0.8640472
## 187 yes no 0.14283846 0.8571615
## 188 no no 0.17840843 0.8215916
## 189 yes no 0.08503174 0.9149683
## 190 no no 0.10229905 0.8977009
## 191 no no 0.11768837 0.8823116
## 192 yes yes 0.62489932 0.3751007
## 193 yes no 0.35575694 0.6442431
## 194 no no 0.07610853 0.9238915
## 195 yes no 0.48068793 0.5193121
## 196 no no 0.12302262 0.8769774
## 197 no no 0.12402611 0.8759739
## 198 no no 0.15534370 0.8446563
## 199 no no 0.06256196 0.9374380
## 200 yes yes 0.57196121 0.4280388
## 201 yes no 0.30344489 0.6965551
## 202 no no 0.37090235 0.6290976
## 203 no no 0.32468175 0.6753183
## 204 no yes 0.70329080 0.2967092
## 205 no no 0.28998244 0.7100176
## 206 no no 0.26971562 0.7302844
## 207 no no 0.12713185 0.8728681
## 208 no no 0.19670941 0.8032906
## 209 no yes 0.59331609 0.4066839
## 210 no no 0.24722941 0.7527706
## 211 yes no 0.19285261 0.8071474
## 212 no no 0.43847296 0.5615270
## 213 no no 0.15889704 0.8411030
## 214 yes yes 0.59827052 0.4017295
## 215 yes no 0.44689603 0.5531040
## 216 yes yes 0.56496856 0.4350314
## 217 yes no 0.22137437 0.7786256
## 218 yes no 0.34601419 0.6539858
## 219 yes no 0.36626992 0.6337301
## 220 yes yes 0.52433510 0.4756649
## 221 no yes 0.62688973 0.3731103
## 222 no no 0.13234264 0.8676574
## 223 no no 0.29826718 0.7017328
## 224 no no 0.30924796 0.6907520
## 225 yes no 0.25689973 0.7431003
## 226 yes yes 0.52663430 0.4733657
## 227 yes no 0.43132288 0.5686771
## 228 no yes 0.53483026 0.4651697
## 229 yes no 0.29354161 0.7064584
## 230 no no 0.18407968 0.8159203
## 231 no no 0.08763449 0.9123655
## 232 no no 0.23359719 0.7664028
## 233 no no 0.16264215 0.8373578
## 234 yes no 0.45111093 0.5488891
## 235 yes no 0.19862069 0.8013793
## 236 no no 0.20384693 0.7961531
## 237 no no 0.42482690 0.5751731
## 238 yes yes 0.74017932 0.2598207
## 239 no no 0.37039490 0.6296051
## 240 yes no 0.25849625 0.7415037
## 241 no no 0.17209043 0.8279096
## 242 yes yes 0.66573572 0.3342643
## 243 no no 0.14283759 0.8571624
## 244 no no 0.05202142 0.9479786
Accessing model fit
Key points
in tidymodels, outcome variable needs to be a factor. first level is positive class.
to check the ordering of a factor vector, pass it into levels()
confusion matrix for checking corrections. conf_mat()
matrix with counts of all combinations of actual and predicted outcome values
correct predictions: TP (true positive) and TN (true negative)
there are two classification errors: FP and FN
classification metrics with yardstick - need to define truth and estimate
creating confusion matrices and other model fit metrics with yardstick
require a tibble of model results which contain
true outcome values
predicted outcome categories
estimated probabilities of each category
classification accuracy with accuracy function - need to define truth and estimate
not the best matrix as if all classifying as no would also achieve XX% accuracy
takes same arguments as conf_mat
calculates the classification accuracy
- \(\frac{TP + TN} {TP + TN + FP + FN}\)
sensitivity - proportion of all positive cases that were correctly classified. \(\frac{TP} {TP +FN}\)
specificity - proportion of all negative cases that were correctly classified. \(\frac{TN} {TN +FP}\)
can also create a custom metric set
Exercise
calculate conf_mat, accuracy, sensitivity, specificity. specificity results are much higher than sensitivity results which means the model is much better at detecting customers who will not cancel their telecom service versus the ones who will.
conf_mat(telecom_results, truth = canceled_service,
estimate = .pred_class)## Truth
## Prediction yes no
## yes 27 16
## no 55 146
accuracy(telecom_results, truth = canceled_service,
estimate = .pred_class)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.709
sens(telecom_results, truth = canceled_service,
estimate = .pred_class)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 sens binary 0.329
spec(telecom_results, truth = canceled_service,
estimate = .pred_class)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 spec binary 0.901
otherwise
- create a custom metric function
telecom_metrics <- metric_set(accuracy, sens, spec)calculate metrics using model results tibble
telecom_metrics(telecom_results, truth = canceled_service,
estimate = .pred_class)## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.709
## 2 sens binary 0.329
## 3 spec binary 0.901
- create a confusion matrix and passing to summary() to calculate all available binary classification metric at once.
conf_mat(telecom_results,
truth = canceled_service,
estimate = .pred_class) %>%
summary()## # A tibble: 13 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.709
## 2 kap binary 0.261
## 3 sens binary 0.329
## 4 spec binary 0.901
## 5 ppv binary 0.628
## 6 npv binary 0.726
## 7 mcc binary 0.286
## 8 j_index binary 0.231
## 9 bal_accuracy binary 0.615
## 10 detection_prevalence binary 0.176
## 11 precision binary 0.628
## 12 recall binary 0.329
## 13 f_meas binary 0.432