A telecommunications company that offers voice, messaging, and data services has noticed a high rate of churn among its high-revenue customers in recent years. A customer retention campaign was developed and the marketing managers need to identify the customers who have a high probability of churning. These targeted customers will then be placed in the customer retention program. Your task is to build a model to predict customers with high propensity to churn.
In this exercise, telecom.sas7bdat in data sub-folder will be used. The telecom data set is in SAS data format. It contains customer relationship management data for nearly 13,196 customers. There are 46 columns in the data set.
For this exercise, you are required to have the following R packages installed and loaded.
The code chunk below will be used to check if the necessary R packages have been installed in R. if they have yet to be installed, then it will install the missing package(s). Finally, the code chunk will launch all the necessary R packages in R.
packages = c('funModeling', 'skimr', 'DataExplorer','haven', 'tidymodels', 'tidyverse')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The code chunk below shows how a SAS’s sas7bdat (i.e. telecom.sas7bdat) file can be imported using read_sas() function of haven package.
telecom <- read_sas("data/telecom.sas7bdat")
Before any analysis can be performed, it is important for us to explore the data. The purpose of initial exploration are: - to understand the structure of the data, and - to identify data quality issues such as messy data, dirty data, missing data, etc.
Since telecom is tibble data.frame, it is advisable to use glimpse() of dplyr package to display the data.
glimpse(telecom)
## Rows: 13,196
## Columns: 46
## $ GENDER_CD <chr> "M", "M", "M", "M", "F", "M", "M", "M"...
## $ EDUCATION_CD <chr> " 2", " 1", " 2", " 1", " 4", " 2", " ...
## $ SUBS_TENURE <dbl> 198, 114, 114, 228, 168, 132, 120, 138...
## $ TOT_IB_CALL_DUR <dbl> 54.167, 13.333, 12.333, 49.667, 16.333...
## $ TOT_IB_CALL_CNT <dbl> 192, 51, 94, 171, 536, 656, 283, 148, ...
## $ AVG_OB_CALL_CNT <dbl> 897, 252, 1016, 1534, 730, 323, 463, 8...
## $ TOT_OB_CALL_NAT_ROAM_CNT <dbl> 738, 25, 43, 39, 16, 144, 28, 49, 18, ...
## $ TOT_OB_CALL_INTL_CNT <dbl> 43, 5568, 72, 23, 16, 27, 18, 779, 33,...
## $ TOT_OB_CALL_LOC_CNT <dbl> 300, 221, 221, 302, 517, 540, 320, 192...
## $ TOT_OB_CALL_NAT_CNT <dbl> 125, 216, 79, 108, 570, 85, 73, 446, 1...
## $ TOT_OB_CALL_INTL_ROAM_CNT <dbl> 0.89995712, 0.03764006, 0.29266312, 0....
## $ TOT_DAY_LAST_COMPLAINT_CNT <dbl> 15, 8, 19, 14, 4, 22, 4, 1, 19, 5, 6, ...
## $ TOT_DAY_LAST_OB_BARRED_CNT <dbl> 11, 5, 2, 4, 3, 8, 13, 11, 9, 14, 23, ...
## $ TOT_DAY_LAST_SUSPENDED_CNT <dbl> 3, 23, 25, 7, 18, 20, 10, 18, 16, 1, 9...
## $ TOT_EMAIL_QUERY_CNT <dbl> 17, 16, 19, 4, 20, 25, 17, 11, 5, 3, 1...
## $ MTH_TO_SUBS_END_CNT <dbl> 4, 2, 3, 4, 2, 3, 3, 2, 3, 4, 1, 1, 4,...
## $ TOT_SRV_DROPPED_CNT <dbl> 2, 3, 4, 0, 2, 0, 0, 1, 1, 1, 0, 1, 1,...
## $ TOT_SRV_ADDED_CNT <dbl> 1, 0, 0, 1, 1, 2, 0, 2, 0, 2, 1, 0, 0,...
## $ TOT_OUTSTAND_60_90_DAY_AMT <dbl> 0, 0, 0, 0, 0, 483, 0, 0, 0, 0, 0, 0, ...
## $ TOT_REV_FIX_AMT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 718, 0, ...
## $ TOT_REV_GPRS_AMT <dbl> 0, 729, 574, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ TOT_REV_INET_AMT <dbl> 0, 0, 0, 0, 0, 508, 0, 0, 0, 0, 0, 467...
## $ TOT_COMPLAINT_1_MTH_CNT <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOT_MTH_LAST_SUSPENDED_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_DATA_ACTVN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_VM_ACTVN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BARRING_REASON_CD <chr> "000", "000", "002", "000", "002", "00...
## $ TOT_OB_CALL_CNT <dbl> 5382, 1512, 6096, 9204, 4380, 1938, 27...
## $ TOT_ACTV_SRV_CNT <dbl> 0, 0, 0, 4, 3, 3, 4, 2, 0, 3, 3, 0, 2,...
## $ REV_AMT_BASE_1 <dbl> 870, 570, 632, 638, 1162, 236, 445, 11...
## $ REV_AMT_BASE_2 <dbl> 548, 1024, 547, 466, 394, 729, 1091, 4...
## $ REV_AMT_BASE_3 <dbl> 970, 938, 851, 810, 1141, 483, 447, 33...
## $ REV_AMT_BASE_4 <dbl> 392, 602, 821, 655, 244, 325, 568, 645...
## $ REV_AMT_BASE_5 <dbl> 358, 729, 574, 402, 423, 787, 830, 241...
## $ REV_AMT_BASE_6 <dbl> 1248, 1129, 1027, 642, 618, 508, 1133,...
## $ CUST_AGE <dbl> 32, 24, 24, 44, 21, 52, 34, 44, 39, 43...
## $ PCT_CHNG_IB_SMS_CNT <dbl> 1.1553398, 1.2653061, 0.9019608, 1.870...
## $ PCT_CHNG_SUSPENDED_CNT <dbl> 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PCT_CHNG_BILL_AMT <dbl> 0.9555256, 0.9381989, 0.7204400, 0.880...
## $ CUST_SUBS_ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...
## $ TOT_REV_AMT <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ TOT_PROF_AMT <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ CUST_ID <dbl> 5198, 752, 3501, 5406, 6115, 5478, 514...
## $ name <chr> "Amy H Thomas", "Ignatius T Lyod", "Le...
## $ CHURN_FLG <chr> "0", "1", "1", "0", "1", "0", "0", "0"...
After reviewing the report above, it is clear that fields name, CUST_ID, and CUST_SUBS_ID are not required for the subsequent analysis. In view of this, we will exclude these three fields from telecom data frame by using the code chunk below.
telecom <- telecom %>%
select(-name, -CUST_SUBS_ID, -CUST_ID)
By default, all categorical values are stored as character data type. For EDA and modelling purposes, they need to be converted into factor data type.
telecom <- telecom %>%
mutate_if(is.character,funs(factor(.)))
It is always a good practice to check the data structure after wrangling the data.frame.
glimpse(telecom)
## Rows: 13,196
## Columns: 43
## $ GENDER_CD <fct> M, M, M, M, F, M, M, M, M, M, M, M, M,...
## $ EDUCATION_CD <fct> 2, 1, 2, 1, 4, 2, 3, 2, ., 2...
## $ SUBS_TENURE <dbl> 198, 114, 114, 228, 168, 132, 120, 138...
## $ TOT_IB_CALL_DUR <dbl> 54.167, 13.333, 12.333, 49.667, 16.333...
## $ TOT_IB_CALL_CNT <dbl> 192, 51, 94, 171, 536, 656, 283, 148, ...
## $ AVG_OB_CALL_CNT <dbl> 897, 252, 1016, 1534, 730, 323, 463, 8...
## $ TOT_OB_CALL_NAT_ROAM_CNT <dbl> 738, 25, 43, 39, 16, 144, 28, 49, 18, ...
## $ TOT_OB_CALL_INTL_CNT <dbl> 43, 5568, 72, 23, 16, 27, 18, 779, 33,...
## $ TOT_OB_CALL_LOC_CNT <dbl> 300, 221, 221, 302, 517, 540, 320, 192...
## $ TOT_OB_CALL_NAT_CNT <dbl> 125, 216, 79, 108, 570, 85, 73, 446, 1...
## $ TOT_OB_CALL_INTL_ROAM_CNT <dbl> 0.89995712, 0.03764006, 0.29266312, 0....
## $ TOT_DAY_LAST_COMPLAINT_CNT <dbl> 15, 8, 19, 14, 4, 22, 4, 1, 19, 5, 6, ...
## $ TOT_DAY_LAST_OB_BARRED_CNT <dbl> 11, 5, 2, 4, 3, 8, 13, 11, 9, 14, 23, ...
## $ TOT_DAY_LAST_SUSPENDED_CNT <dbl> 3, 23, 25, 7, 18, 20, 10, 18, 16, 1, 9...
## $ TOT_EMAIL_QUERY_CNT <dbl> 17, 16, 19, 4, 20, 25, 17, 11, 5, 3, 1...
## $ MTH_TO_SUBS_END_CNT <dbl> 4, 2, 3, 4, 2, 3, 3, 2, 3, 4, 1, 1, 4,...
## $ TOT_SRV_DROPPED_CNT <dbl> 2, 3, 4, 0, 2, 0, 0, 1, 1, 1, 0, 1, 1,...
## $ TOT_SRV_ADDED_CNT <dbl> 1, 0, 0, 1, 1, 2, 0, 2, 0, 2, 1, 0, 0,...
## $ TOT_OUTSTAND_60_90_DAY_AMT <dbl> 0, 0, 0, 0, 0, 483, 0, 0, 0, 0, 0, 0, ...
## $ TOT_REV_FIX_AMT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 718, 0, ...
## $ TOT_REV_GPRS_AMT <dbl> 0, 729, 574, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ TOT_REV_INET_AMT <dbl> 0, 0, 0, 0, 0, 508, 0, 0, 0, 0, 0, 467...
## $ TOT_COMPLAINT_1_MTH_CNT <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOT_MTH_LAST_SUSPENDED_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_DATA_ACTVN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_VM_ACTVN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BARRING_REASON_CD <fct> 000, 000, 002, 000, 002, 003, 002, 002...
## $ TOT_OB_CALL_CNT <dbl> 5382, 1512, 6096, 9204, 4380, 1938, 27...
## $ TOT_ACTV_SRV_CNT <dbl> 0, 0, 0, 4, 3, 3, 4, 2, 0, 3, 3, 0, 2,...
## $ REV_AMT_BASE_1 <dbl> 870, 570, 632, 638, 1162, 236, 445, 11...
## $ REV_AMT_BASE_2 <dbl> 548, 1024, 547, 466, 394, 729, 1091, 4...
## $ REV_AMT_BASE_3 <dbl> 970, 938, 851, 810, 1141, 483, 447, 33...
## $ REV_AMT_BASE_4 <dbl> 392, 602, 821, 655, 244, 325, 568, 645...
## $ REV_AMT_BASE_5 <dbl> 358, 729, 574, 402, 423, 787, 830, 241...
## $ REV_AMT_BASE_6 <dbl> 1248, 1129, 1027, 642, 618, 508, 1133,...
## $ CUST_AGE <dbl> 32, 24, 24, 44, 21, 52, 34, 44, 39, 43...
## $ PCT_CHNG_IB_SMS_CNT <dbl> 1.1553398, 1.2653061, 0.9019608, 1.870...
## $ PCT_CHNG_SUSPENDED_CNT <dbl> 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PCT_CHNG_BILL_AMT <dbl> 0.9555256, 0.9381989, 0.7204400, 0.880...
## $ TOT_REV_AMT <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ TOT_PROF_AMT <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ CHURN_FLG <fct> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
Notice that telecom data.frame only have 43 variables now and all the variables with character data type are encoded in factor data type.
The commonly used skim() of skimr package will generate report with spark charts. The code chunk below uses skim_without_charts() to generate summary statistics of input variables without spark charts.
skim_without_charts(telecom)
| Name | telecom |
| Number of rows | 13196 |
| Number of columns | 43 |
| _______________________ | |
| Column type frequency: | |
| factor | 4 |
| numeric | 39 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| GENDER_CD | 0 | 1 | FALSE | 3 | M: 8534, F: 4566, emp: 96 |
| EDUCATION_CD | 0 | 1 | FALSE | 8 | 2: 4870, 1: 4817, .: 2901, 3: 330 |
| BARRING_REASON_CD | 0 | 1 | FALSE | 4 | 000: 3403, 002: 3358, 003: 3223, 001: 3212 |
| CHURN_FLG | 0 | 1 | FALSE | 2 | 0: 12105, 1: 1091 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| SUBS_TENURE | 0 | 1.00 | 119.85 | 27.14 | 24.00 | 102.00 | 120.00 | 138.00 | 240.00 |
| TOT_IB_CALL_DUR | 0 | 1.00 | 44.85 | 13.23 | 3.33 | 39.50 | 46.75 | 53.50 | 78.67 |
| TOT_IB_CALL_CNT | 0 | 1.00 | 904.83 | 12273.97 | 8.00 | 119.75 | 180.00 | 366.00 | 1136214.00 |
| AVG_OB_CALL_CNT | 0 | 1.00 | 3986.69 | 80951.19 | 190.00 | 407.00 | 607.00 | 1209.25 | 5988140.00 |
| TOT_OB_CALL_NAT_ROAM_CNT | 0 | 1.00 | 130.73 | 2099.49 | 11.00 | 21.00 | 32.00 | 62.00 | 225286.00 |
| TOT_OB_CALL_INTL_CNT | 0 | 1.00 | 189.39 | 4265.17 | 11.00 | 21.00 | 31.00 | 62.00 | 410842.00 |
| TOT_OB_CALL_LOC_CNT | 0 | 1.00 | 1833.97 | 22732.02 | 121.00 | 245.00 | 370.00 | 725.25 | 1474931.00 |
| TOT_OB_CALL_NAT_CNT | 0 | 1.00 | 625.89 | 5794.71 | 49.00 | 103.00 | 154.00 | 308.00 | 328972.00 |
| TOT_OB_CALL_INTL_ROAM_CNT | 0 | 1.00 | 2.19 | 27.47 | 0.00 | 0.19 | 0.44 | 0.90 | 2158.12 |
| TOT_DAY_LAST_COMPLAINT_CNT | 0 | 1.00 | 11.78 | 6.99 | 0.00 | 6.00 | 12.00 | 17.00 | 32.00 |
| TOT_DAY_LAST_OB_BARRED_CNT | 0 | 1.00 | 12.67 | 7.36 | 0.00 | 6.00 | 13.00 | 19.00 | 31.00 |
| TOT_DAY_LAST_SUSPENDED_CNT | 0 | 1.00 | 13.36 | 7.72 | 0.00 | 7.00 | 13.00 | 20.00 | 34.00 |
| TOT_EMAIL_QUERY_CNT | 0 | 1.00 | 14.07 | 8.04 | 0.00 | 7.00 | 14.00 | 21.00 | 32.00 |
| MTH_TO_SUBS_END_CNT | 0 | 1.00 | 2.51 | 0.96 | 1.00 | 2.00 | 3.00 | 3.00 | 4.00 |
| TOT_SRV_DROPPED_CNT | 0 | 1.00 | 1.04 | 0.76 | 0.00 | 1.00 | 1.00 | 2.00 | 4.00 |
| TOT_SRV_ADDED_CNT | 0 | 1.00 | 1.41 | 0.98 | 0.00 | 1.00 | 1.00 | 2.00 | 3.00 |
| TOT_OUTSTAND_60_90_DAY_AMT | 0 | 1.00 | 61.94 | 180.05 | 0.00 | 0.00 | 0.00 | 0.00 | 700.00 |
| TOT_REV_FIX_AMT | 0 | 1.00 | 118.99 | 253.31 | 0.00 | 0.00 | 0.00 | 0.00 | 800.00 |
| TOT_REV_GPRS_AMT | 0 | 1.00 | 120.28 | 233.92 | 0.00 | 0.00 | 0.00 | 0.00 | 750.00 |
| TOT_REV_INET_AMT | 0 | 1.00 | 166.25 | 281.42 | 0.00 | 0.00 | 0.00 | 471.00 | 800.00 |
| TOT_COMPLAINT_1_MTH_CNT | 0 | 1.00 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| TOT_MTH_LAST_SUSPENDED_CNT | 0 | 1.00 | 0.02 | 0.15 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| LAST_PRICE_PLAN_CHNG_DAY_CNT | 0 | 1.00 | 0.02 | 0.15 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| MTH_SINCE_DATA_ACTVN | 0 | 1.00 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| MTH_SINCE_VM_ACTVN | 0 | 1.00 | 0.08 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| TOT_OB_CALL_CNT | 0 | 1.00 | 23920.12 | 485707.12 | 1140.00 | 2442.00 | 3642.00 | 7255.50 | 35928840.00 |
| TOT_ACTV_SRV_CNT | 0 | 1.00 | 2.44 | 1.64 | 0.00 | 1.00 | 2.00 | 4.00 | 7.00 |
| REV_AMT_BASE_1 | 0 | 1.00 | 694.11 | 284.69 | 200.00 | 457.00 | 681.00 | 938.00 | 1200.00 |
| REV_AMT_BASE_2 | 0 | 1.00 | 787.97 | 340.74 | 200.00 | 492.00 | 772.50 | 1081.00 | 1400.00 |
| REV_AMT_BASE_3 | 0 | 1.00 | 732.92 | 306.31 | 200.00 | 491.00 | 690.00 | 996.00 | 1300.00 |
| REV_AMT_BASE_4 | 0 | 1.00 | 648.61 | 236.75 | 200.00 | 473.00 | 647.00 | 825.00 | 1100.00 |
| REV_AMT_BASE_5 | 0 | 1.00 | 648.36 | 254.81 | 200.00 | 440.00 | 625.00 | 852.00 | 1150.00 |
| REV_AMT_BASE_6 | 0 | 1.00 | 697.86 | 268.67 | 200.00 | 501.00 | 672.00 | 893.00 | 1250.00 |
| CUST_AGE | 0 | 1.00 | 40.72 | 12.16 | 20.00 | 30.00 | 40.00 | 51.00 | 62.00 |
| PCT_CHNG_IB_SMS_CNT | 0 | 1.00 | 1.25 | 0.55 | 0.00 | 0.86 | 1.17 | 1.55 | 6.17 |
| PCT_CHNG_SUSPENDED_CNT | 0 | 1.00 | 181.50 | 1042.50 | 0.00 | 0.00 | 0.00 | 0.00 | 10000.00 |
| PCT_CHNG_BILL_AMT | 0 | 1.00 | 1.13 | 0.43 | 0.25 | 0.82 | 1.09 | 1.37 | 5.07 |
| TOT_REV_AMT | 9026 | 0.32 | 22.21 | 29.72 | -3.00 | 4.00 | 10.00 | 17.00 | 100.00 |
| TOT_PROF_AMT | 9026 | 0.32 | 22.21 | 29.72 | -3.00 | 4.00 | 10.00 | 17.00 | 100.00 |
Instead of using ggplot2 to display the distribution of the continuous variables, the plot_num() of funModeling package can be used.
plot_num(telecom, bins = 12)
freq(telecom)
## GENDER_CD frequency percentage cumulative_perc
## 1 M 8534 64.67 64.67
## 2 F 4566 34.60 99.27
## 3 96 0.73 100.00
## EDUCATION_CD frequency percentage cumulative_perc
## 1 2 4870 36.91 36.91
## 2 1 4817 36.50 73.41
## 3 . 2901 21.98 95.39
## 4 3 330 2.50 97.89
## 5 4 143 1.08 98.97
## 6 6 73 0.55 99.52
## 7 5 60 0.45 99.97
## 8 7 2 0.02 100.00
## BARRING_REASON_CD frequency percentage cumulative_perc
## 1 000 3403 25.79 25.79
## 2 002 3358 25.45 51.24
## 3 003 3223 24.42 75.66
## 4 001 3212 24.34 100.00
## CHURN_FLG frequency percentage cumulative_perc
## 1 0 12105 91.73 91.73
## 2 1 1091 8.27 100.00
## [1] "Variables processed: GENDER_CD, EDUCATION_CD, BARRING_REASON_CD, CHURN_FLG"
One of the task during the data exploration stage is to investigate the relationship between the target variable and the inputs variables. In this section, we are going to explore how this task can be perform by using appropriate functions of ggplot2 and funModeling packages.
plotar(telecom, target = 'CHURN_FLG', plot_type = "histdens")
telecom %>%
select(c(1,2,28,43)) %>%
cross_plot(target= 'CHURN_FLG',
plot_type = "both")
After examining the input variables carefully, the following input variables will be excluded in the cleaned data set for modeling purposes.
telecom_cleaned <- telecom %>%
select(-TOT_REV_AMT, -TOT_PROF_AMT, -TOT_IB_CALL_DUR, -TOT_IB_CALL_CNT)
glimpse(telecom_cleaned)
## Rows: 13,196
## Columns: 39
## $ GENDER_CD <fct> M, M, M, M, F, M, M, M, M, M, M, M, M,...
## $ EDUCATION_CD <fct> 2, 1, 2, 1, 4, 2, 3, 2, ., 2...
## $ SUBS_TENURE <dbl> 198, 114, 114, 228, 168, 132, 120, 138...
## $ AVG_OB_CALL_CNT <dbl> 897, 252, 1016, 1534, 730, 323, 463, 8...
## $ TOT_OB_CALL_NAT_ROAM_CNT <dbl> 738, 25, 43, 39, 16, 144, 28, 49, 18, ...
## $ TOT_OB_CALL_INTL_CNT <dbl> 43, 5568, 72, 23, 16, 27, 18, 779, 33,...
## $ TOT_OB_CALL_LOC_CNT <dbl> 300, 221, 221, 302, 517, 540, 320, 192...
## $ TOT_OB_CALL_NAT_CNT <dbl> 125, 216, 79, 108, 570, 85, 73, 446, 1...
## $ TOT_OB_CALL_INTL_ROAM_CNT <dbl> 0.89995712, 0.03764006, 0.29266312, 0....
## $ TOT_DAY_LAST_COMPLAINT_CNT <dbl> 15, 8, 19, 14, 4, 22, 4, 1, 19, 5, 6, ...
## $ TOT_DAY_LAST_OB_BARRED_CNT <dbl> 11, 5, 2, 4, 3, 8, 13, 11, 9, 14, 23, ...
## $ TOT_DAY_LAST_SUSPENDED_CNT <dbl> 3, 23, 25, 7, 18, 20, 10, 18, 16, 1, 9...
## $ TOT_EMAIL_QUERY_CNT <dbl> 17, 16, 19, 4, 20, 25, 17, 11, 5, 3, 1...
## $ MTH_TO_SUBS_END_CNT <dbl> 4, 2, 3, 4, 2, 3, 3, 2, 3, 4, 1, 1, 4,...
## $ TOT_SRV_DROPPED_CNT <dbl> 2, 3, 4, 0, 2, 0, 0, 1, 1, 1, 0, 1, 1,...
## $ TOT_SRV_ADDED_CNT <dbl> 1, 0, 0, 1, 1, 2, 0, 2, 0, 2, 1, 0, 0,...
## $ TOT_OUTSTAND_60_90_DAY_AMT <dbl> 0, 0, 0, 0, 0, 483, 0, 0, 0, 0, 0, 0, ...
## $ TOT_REV_FIX_AMT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 718, 0, ...
## $ TOT_REV_GPRS_AMT <dbl> 0, 729, 574, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ TOT_REV_INET_AMT <dbl> 0, 0, 0, 0, 0, 508, 0, 0, 0, 0, 0, 467...
## $ TOT_COMPLAINT_1_MTH_CNT <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOT_MTH_LAST_SUSPENDED_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_DATA_ACTVN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_VM_ACTVN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BARRING_REASON_CD <fct> 000, 000, 002, 000, 002, 003, 002, 002...
## $ TOT_OB_CALL_CNT <dbl> 5382, 1512, 6096, 9204, 4380, 1938, 27...
## $ TOT_ACTV_SRV_CNT <dbl> 0, 0, 0, 4, 3, 3, 4, 2, 0, 3, 3, 0, 2,...
## $ REV_AMT_BASE_1 <dbl> 870, 570, 632, 638, 1162, 236, 445, 11...
## $ REV_AMT_BASE_2 <dbl> 548, 1024, 547, 466, 394, 729, 1091, 4...
## $ REV_AMT_BASE_3 <dbl> 970, 938, 851, 810, 1141, 483, 447, 33...
## $ REV_AMT_BASE_4 <dbl> 392, 602, 821, 655, 244, 325, 568, 645...
## $ REV_AMT_BASE_5 <dbl> 358, 729, 574, 402, 423, 787, 830, 241...
## $ REV_AMT_BASE_6 <dbl> 1248, 1129, 1027, 642, 618, 508, 1133,...
## $ CUST_AGE <dbl> 32, 24, 24, 44, 21, 52, 34, 44, 39, 43...
## $ PCT_CHNG_IB_SMS_CNT <dbl> 1.1553398, 1.2653061, 0.9019608, 1.870...
## $ PCT_CHNG_SUSPENDED_CNT <dbl> 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PCT_CHNG_BILL_AMT <dbl> 0.9555256, 0.9381989, 0.7204400, 0.880...
## $ CHURN_FLG <fct> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
First of all, we want to extract a data set for testing the predictions in the end. We’ll keep 60% of the data for training and 40% of the data for testing.
set.seed(1234)
telecom_split <- initial_split(telecom_cleaned,
prop = .6,
strata = CHURN_FLG)
telecom_train <- training(telecom_split)
telecom_test <- testing(telecom_split)
Furthermore, the training data set will be prepared for 3-fold cross-validation (using three here to speed things up). All this is accomplished using the rsample package:
vfold_data <- vfold_cv(telecom_train, v = 3,
repeats = 1,
strata = CHURN_FLG)
vfold_data %>%
mutate(df_ana = map(splits, analysis),
df_ass = map(splits, assessment))
## # 3-fold cross-validation using stratification
## # A tibble: 3 x 4
## splits id df_ana df_ass
## <list> <chr> <list> <list>
## 1 <split [5.3K/2.6K]> Fold1 <tibble [5,278 x 39]> <tibble [2,640 x 39]>
## 2 <split [5.3K/2.6K]> Fold2 <tibble [5,279 x 39]> <tibble [2,639 x 39]>
## 3 <split [5.3K/2.6K]> Fold3 <tibble [5,279 x 39]> <tibble [2,639 x 39]>
Two data preprocessing steps will be performed"
telecom_recipe <- recipe(CHURN_FLG ~ ., data = telecom_train) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(all_numeric())
telecom_recipe
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 38
##
## Operations:
##
## Dummy variables from all_nominal(), -all_outcomes()
## Centering and scaling for all_numeric()
If you want to extract the pre-processed dataset itself, you can first prep() the recipe for a specific dataset and juice() the prepped recipe to extract the pre-processed data. It turns out that extracting the pre-processed data isn’t actually necessary for the pipeline, since this will be done under the hood when the model is fit, but sometimes it’s useful anyway.
telecom_train_preprocessed <- telecom_recipe %>%
prep(telecom_train) %>%
juice()
glimpse(telecom_train_preprocessed)
## Rows: 7,918
## Columns: 48
## $ SUBS_TENURE <dbl> 2.8659494236, -0.2197381918, -0.219738...
## $ AVG_OB_CALL_CNT <dbl> -0.039110543, -0.045349571, -0.0379594...
## $ TOT_OB_CALL_NAT_ROAM_CNT <dbl> 0.235027169, -0.042508145, -0.03550164...
## $ TOT_OB_CALL_INTL_CNT <dbl> -0.04654437, 1.91258878, -0.03626113, ...
## $ TOT_OB_CALL_LOC_CNT <dbl> -0.063734543, -0.066713847, -0.0667138...
## $ TOT_OB_CALL_NAT_CNT <dbl> -0.08208404, -0.06639903, -0.09001273,...
## $ TOT_OB_CALL_INTL_ROAM_CNT <dbl> -0.0588093830, -0.1040129999, -0.09064...
## $ TOT_DAY_LAST_COMPLAINT_CNT <dbl> 0.4634149, -0.5349229, 1.0338936, 0.32...
## $ TOT_DAY_LAST_OB_BARRED_CNT <dbl> -0.23106703, -1.04439834, -1.45106399,...
## $ TOT_DAY_LAST_SUSPENDED_CNT <dbl> -1.34529021, 1.24021135, 1.49876150, -...
## $ TOT_EMAIL_QUERY_CNT <dbl> 0.366171725, 0.241172901, 0.616169373,...
## $ MTH_TO_SUBS_END_CNT <dbl> 1.5636067, -0.5192732, 0.5221668, 1.56...
## $ TOT_SRV_DROPPED_CNT <dbl> 1.24889147, 2.55882955, 3.86876764, -1...
## $ TOT_SRV_ADDED_CNT <dbl> -0.4102997, -1.4268057, -1.4268057, -0...
## $ TOT_OUTSTAND_60_90_DAY_AMT <dbl> -0.3401562, -0.3401562, -0.3401562, -0...
## $ TOT_REV_FIX_AMT <dbl> -0.4746233, -0.4746233, -0.4746233, -0...
## $ TOT_REV_GPRS_AMT <dbl> -0.517029, 2.606372, 1.942274, -0.5170...
## $ TOT_REV_INET_AMT <dbl> -0.5847742, -0.5847742, -0.5847742, -0...
## $ TOT_COMPLAINT_1_MTH_CNT <dbl> -0.141294, -0.141294, 7.076549, -0.141...
## $ TOT_MTH_LAST_SUSPENDED_CNT <dbl> -0.1626102, -0.1626102, -0.1626102, -0...
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl> -0.1567891, -0.1567891, -0.1567891, -0...
## $ MTH_SINCE_DATA_ACTVN <dbl> -0.1767886, -0.1767886, -0.1767886, -0...
## $ MTH_SINCE_VM_ACTVN <dbl> -0.2798142, -0.2798142, -0.2798142, -0...
## $ TOT_OB_CALL_CNT <dbl> -0.039110543, -0.045349571, -0.0379594...
## $ TOT_ACTV_SRV_CNT <dbl> -1.4866598, -1.4866598, -1.4866598, 0....
## $ REV_AMT_BASE_1 <dbl> 0.6135341, -0.4429222, -0.2245879, -0....
## $ REV_AMT_BASE_2 <dbl> -0.7073374, 0.6890768, -0.7102710, -0....
## $ REV_AMT_BASE_3 <dbl> 0.79590683, 0.69106652, 0.40603193, 0....
## $ REV_AMT_BASE_4 <dbl> -1.08905379, -0.20247405, 0.72210198, ...
## $ REV_AMT_BASE_5 <dbl> -1.12917554, 0.33085276, -0.27913211, ...
## $ REV_AMT_BASE_6 <dbl> 2.03004072, 1.59051529, 1.21377920, -0...
## $ CUST_AGE <dbl> -0.7278067, -1.3836548, -1.3836548, 0....
## $ PCT_CHNG_IB_SMS_CNT <dbl> -0.158186621, 0.041398791, -0.61806159...
## $ PCT_CHNG_SUSPENDED_CNT <dbl> -0.1725298, -0.1725298, 4.8130260, -0....
## $ PCT_CHNG_BILL_AMT <dbl> -0.41913970, -0.45860554, -0.95460632,...
## $ CHURN_FLG <fct> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ GENDER_CD_F <dbl> -0.731453, -0.731453, -0.731453, -0.73...
## $ GENDER_CD_M <dbl> 0.7426753, 0.7426753, 0.7426753, 0.742...
## $ EDUCATION_CD_X.1 <dbl> -0.7665859, 1.3043206, -0.7665859, 1.3...
## $ EDUCATION_CD_X.2 <dbl> 1.3135553, -0.7611965, 1.3135553, -0.7...
## $ EDUCATION_CD_X.3 <dbl> -0.1576325, -0.1576325, -0.1576325, -0...
## $ EDUCATION_CD_X.4 <dbl> -0.100382, -0.100382, -0.100382, -0.10...
## $ EDUCATION_CD_X.5 <dbl> -0.06943865, -0.06943865, -0.06943865,...
## $ EDUCATION_CD_X.6 <dbl> -0.07388929, -0.07388929, -0.07388929,...
## $ EDUCATION_CD_X.7 <dbl> -0.01123808, -0.01123808, -0.01123808,...
## $ BARRING_REASON_CD_X001 <dbl> -0.5643825, -0.5643825, -0.5643825, -0...
## $ BARRING_REASON_CD_X002 <dbl> -0.5828552, -0.5828552, 1.7154753, -0....
## $ BARRING_REASON_CD_X003 <dbl> -0.5764389, -0.5764389, -0.5764389, -0...
test_proc <- bake(telecom_recipe, new_data = telecom_test)
Using parsnip package to calibrate different classification models.
The code chunk below fit a decision tree model using rpart package.
dt_model <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
The code chunk below fit a logistic regression model from the glm package.
lr_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
The code chunk below fit a random forest model as implemented by the ranger package for the purpose of classification.
rf_model <- rand_forest() %>%
set_engine("ranger",
importance = "impurity") %>%
set_mode("classification")
If you want to be able to examine the variable importance of your final model later, you will need to set importance argument when setting the engine. For ranger, the importance options are “impurity” or “permutation”.
Note that this code doesn’t actually fit the model. Like the recipe, it just outlines a description of the model.
Another thing to note is that nothing about this model specification is specific to the telecom dataset.
We’re now ready to put the model and recipes together into a workflow. You initiate a workflow using workflow() from the workflows package and then you can add a recipe and add a model to it.
dt_workflow <- workflow() %>%
add_recipe(telecom_recipe) %>%
add_model(dt_model)
lr_workflow <- workflow() %>%
add_recipe(telecom_recipe) %>%
add_model(lr_model)
rf_workflow <- workflow() %>%
add_recipe(telecom_recipe) %>%
add_model(rf_model)
Now we’ve defined our recipe, and our model, we’re ready to actually fit the final model. Since all of this information is contained within the workflow object, we will apply the last_fit() function to our workflow and our train/test split object. This will automatically train the model specified by the workflow using the training data, and produce evaluations based on the test set.
dt_fit <- dt_workflow %>%
last_fit(telecom_split)
dt_fit
## # Resampling results
## # Monte Carlo cross-validation (0.6/0.4) with 1 resamples
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [7.9K~ train/test~ <tibble [2 x~ <tibble [0~ <tibble [5,278 ~ <workflo~
lr_fit <- lr_workflow %>%
last_fit(telecom_split)
lr_fit
## # Resampling results
## # Monte Carlo cross-validation (0.6/0.4) with 1 resamples
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [7.9K~ train/test~ <tibble [2 x~ <tibble [2~ <tibble [5,278 ~ <workflo~
rf_fit <- rf_workflow %>%
last_fit(telecom_split)
rf_fit
## # Resampling results
## # Monte Carlo cross-validation (0.6/0.4) with 1 resamples
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [7.9K~ train/test~ <tibble [2 x~ <tibble [0~ <tibble [5,278 ~ <workflo~
Note that the fit object that is created is a data-frame-like object; specifically, it is a tibble with list columns.
Since we supplied the train/test object when we fit the workflow, the metrics are evaluated on the test set. Now when we use the collect_metrics() function (recall we used this when tuning our parameters), it extracts the performance of the final model (since rf_fit now consists of a single final model) applied to the test set.
rf_performance <- rf_fit %>%
collect_metrics()
rf_performance
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.976
## 2 roc_auc binary 0.987
Overall the performance is very good, with an accuracy of 0.97 and an AUC of 0.99.
You can also extract the test set predictions themselves using the collect_predictions() function. Note that there are 192 rows in the predictions object below which matches the number of test set observations (just to give you some evidence that these are based on the test set rather than the training set).
test_predictions <- rf_fit %>%
collect_predictions()
test_predictions
## # A tibble: 5,278 x 6
## id .pred_0 .pred_1 .row .pred_class CHURN_FLG
## <chr> <dbl> <dbl> <int> <fct> <fct>
## 1 train/test split 0.343 0.657 5 1 1
## 2 train/test split 0.934 0.0655 9 0 0
## 3 train/test split 0.922 0.0776 14 0 0
## 4 train/test split 0.966 0.0340 15 0 0
## 5 train/test split 0.291 0.709 16 1 1
## 6 train/test split 0.859 0.141 17 0 1
## 7 train/test split 0.984 0.0163 18 0 0
## 8 train/test split 0.994 0.00593 26 0 0
## 9 train/test split 0.978 0.0222 27 0 0
## 10 train/test split 0.988 0.0121 30 0 0
## # ... with 5,268 more rows
Since this is just a normal data frame/tibble object, we can generate summaries and plots such as a confusion matrix.
test_predictions %>%
conf_mat(truth = CHURN_FLG,
estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 4823 124
## 1 4 327
If you want to extract the variable importance scores from your model, as far as I can tell, for now you need to extract the model object from the fit() object (which for us is called final_model). The function that extracts the model is pull_workflow_fit() and then you need to grab the fit object that the output contains.
rf_model <- fit(rf_workflow, telecom_cleaned)
ranger_obj <- pull_workflow_fit(rf_model)$fit
ranger_obj
## Ranger result
##
## Call:
## ranger::ranger(formula = ..y ~ ., data = data, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 13196
## Number of independent variables: 47
## Mtry: 6
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.01699434
Then you can extract the variable importance from the ranger object itself (variable.importance is a specific object contained within ranger output - this will need to be adapted for the specific object type of other models).
ranger_obj$variable.importance
## SUBS_TENURE AVG_OB_CALL_CNT
## 12.1924563 69.1914085
## TOT_OB_CALL_NAT_ROAM_CNT TOT_OB_CALL_INTL_CNT
## 54.0967837 56.0624779
## TOT_OB_CALL_LOC_CNT TOT_OB_CALL_NAT_CNT
## 61.2488865 61.9115713
## TOT_OB_CALL_INTL_ROAM_CNT TOT_DAY_LAST_COMPLAINT_CNT
## 116.8822255 11.5065368
## TOT_DAY_LAST_OB_BARRED_CNT TOT_DAY_LAST_SUSPENDED_CNT
## 11.7035548 11.8769972
## TOT_EMAIL_QUERY_CNT MTH_TO_SUBS_END_CNT
## 11.3489732 4.9931900
## TOT_SRV_DROPPED_CNT TOT_SRV_ADDED_CNT
## 124.7099953 106.9056742
## TOT_OUTSTAND_60_90_DAY_AMT TOT_REV_FIX_AMT
## 3.0451607 4.0638076
## TOT_REV_GPRS_AMT TOT_REV_INET_AMT
## 5.1337278 5.4804504
## TOT_COMPLAINT_1_MTH_CNT TOT_MTH_LAST_SUSPENDED_CNT
## 148.3829950 0.5766582
## LAST_PRICE_PLAN_CHNG_DAY_CNT MTH_SINCE_DATA_ACTVN
## 0.5150562 0.5954739
## MTH_SINCE_VM_ACTVN TOT_OB_CALL_CNT
## 0.9487959 60.1454416
## TOT_ACTV_SRV_CNT REV_AMT_BASE_1
## 26.6214676 27.7617097
## REV_AMT_BASE_2 REV_AMT_BASE_3
## 38.5876493 17.2232273
## REV_AMT_BASE_4 REV_AMT_BASE_5
## 16.8719073 17.8636111
## REV_AMT_BASE_6 CUST_AGE
## 17.5503397 325.7159672
## PCT_CHNG_IB_SMS_CNT PCT_CHNG_SUSPENDED_CNT
## 35.1781608 345.6217802
## PCT_CHNG_BILL_AMT GENDER_CD_F
## 20.6142868 1.6994824
## GENDER_CD_M EDUCATION_CD_X.1
## 1.7241296 1.6647733
## EDUCATION_CD_X.2 EDUCATION_CD_X.3
## 1.5073467 0.9941026
## EDUCATION_CD_X.4 EDUCATION_CD_X.5
## 1.2137233 0.5830629
## EDUCATION_CD_X.6 EDUCATION_CD_X.7
## 0.4457791 0.0000000
## BARRING_REASON_CD_X001 BARRING_REASON_CD_X002
## 1.8729967 1.6545299
## BARRING_REASON_CD_X003
## 1.8191001
dt_model <- fit(dt_workflow, telecom_cleaned)
rpart_obj <- pull_workflow_fit(dt_model)$fit
rpart_obj
## n= 13196
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13196 1091 0 (0.917323431 0.082676569)
## 2) PCT_CHNG_SUSPENDED_CNT< -0.1737818 12730 625 0 (0.950903378 0.049096622)
## 4) CUST_AGE>=-1.08674 10547 96 0 (0.990897886 0.009102114)
## 8) TOT_SRV_DROPPED_CNT< 1.911073 10532 81 0 (0.992309153 0.007690847) *
## 9) TOT_SRV_DROPPED_CNT>=1.911073 15 0 1 (0.000000000 1.000000000) *
## 5) CUST_AGE< -1.08674 2183 529 0 (0.757672927 0.242327073)
## 10) TOT_SRV_ADDED_CNT>=-0.9278886 1534 176 0 (0.885267275 0.114732725)
## 20) TOT_SRV_ADDED_CNT>=0.09137536 809 0 0 (1.000000000 0.000000000) *
## 21) TOT_SRV_ADDED_CNT< 0.09137536 725 176 0 (0.757241379 0.242758621)
## 42) TOT_OB_CALL_INTL_ROAM_CNT>=-0.0703783 442 48 0 (0.891402715 0.108597285)
## 84) CUST_AGE>=-1.580008 431 37 0 (0.914153132 0.085846868) *
## 85) CUST_AGE< -1.580008 11 0 1 (0.000000000 1.000000000) *
## 43) TOT_OB_CALL_INTL_ROAM_CNT< -0.0703783 283 128 0 (0.547703180 0.452296820)
## 86) REV_AMT_BASE_2>=-0.2992656 140 39 0 (0.721428571 0.278571429) *
## 87) REV_AMT_BASE_2< -0.2992656 143 54 1 (0.377622378 0.622377622)
## 174) REV_AMT_BASE_2< -1.10194 41 15 0 (0.634146341 0.365853659) *
## 175) REV_AMT_BASE_2>=-1.10194 102 28 1 (0.274509804 0.725490196) *
## 11) TOT_SRV_ADDED_CNT< -0.9278886 649 296 1 (0.456086287 0.543913713)
## 22) TOT_OB_CALL_INTL_ROAM_CNT>=-0.06891255 282 77 0 (0.726950355 0.273049645)
## 44) TOT_SRV_DROPPED_CNT< 1.911073 259 54 0 (0.791505792 0.208494208)
## 88) CUST_AGE>=-1.580008 243 38 0 (0.843621399 0.156378601) *
## 89) CUST_AGE< -1.580008 16 0 1 (0.000000000 1.000000000) *
## 45) TOT_SRV_DROPPED_CNT>=1.911073 23 0 1 (0.000000000 1.000000000) *
## 23) TOT_OB_CALL_INTL_ROAM_CNT< -0.06891255 367 91 1 (0.247956403 0.752043597) *
## 3) PCT_CHNG_SUSPENDED_CNT>=-0.1737818 466 0 1 (0.000000000 1.000000000) *
rpart_obj$variable.importance
## PCT_CHNG_SUSPENDED_CNT TOT_COMPLAINT_1_MTH_CNT
## 812.9705049 474.5235565
## CUST_AGE TOT_SRV_ADDED_CNT
## 247.4439015 213.0720731
## TOT_OB_CALL_INTL_ROAM_CNT TOT_SRV_DROPPED_CNT
## 120.0284803 97.1897475
## TOT_ACTV_SRV_CNT REV_AMT_BASE_2
## 65.6103875 26.8835893
## TOT_OB_CALL_NAT_CNT AVG_OB_CALL_CNT
## 24.7426441 23.4184395
## PCT_CHNG_IB_SMS_CNT TOT_OB_CALL_CNT
## 21.3784548 14.3579718
## PCT_CHNG_BILL_AMT TOT_OB_CALL_INTL_CNT
## 9.4252195 4.7794478
## TOT_OB_CALL_LOC_CNT REV_AMT_BASE_3
## 4.2530183 2.7474818
## TOT_EMAIL_QUERY_CNT TOT_OB_CALL_NAT_ROAM_CNT
## 2.6709249 2.3015417
## TOT_DAY_LAST_SUSPENDED_CNT REV_AMT_BASE_6
## 1.9112917 1.3354624
## REV_AMT_BASE_1 TOT_REV_FIX_AMT
## 0.5535323 0.3690216
lr_model <- fit(lr_workflow, telecom_cleaned)
glm_obj <- pull_workflow_fit(lr_model)$fit
glm_obj
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) SUBS_TENURE
## 3.011e+03 -2.360e-02
## AVG_OB_CALL_CNT TOT_OB_CALL_NAT_ROAM_CNT
## -1.495e+00 1.658e-02
## TOT_OB_CALL_INTL_CNT TOT_OB_CALL_LOC_CNT
## 6.529e-02 -8.921e-01
## TOT_OB_CALL_NAT_CNT TOT_OB_CALL_INTL_ROAM_CNT
## -6.471e-01 -2.299e+00
## TOT_DAY_LAST_COMPLAINT_CNT TOT_DAY_LAST_OB_BARRED_CNT
## 5.641e-02 -7.557e-02
## TOT_DAY_LAST_SUSPENDED_CNT TOT_EMAIL_QUERY_CNT
## 1.348e-02 -3.648e-02
## MTH_TO_SUBS_END_CNT TOT_SRV_DROPPED_CNT
## 1.576e-02 5.105e-01
## TOT_SRV_ADDED_CNT TOT_OUTSTAND_60_90_DAY_AMT
## -1.687e+00 -9.786e-02
## TOT_REV_FIX_AMT TOT_REV_GPRS_AMT
## 3.636e-02 4.443e-02
## TOT_REV_INET_AMT TOT_COMPLAINT_1_MTH_CNT
## -6.297e-02 1.521e+00
## TOT_MTH_LAST_SUSPENDED_CNT LAST_PRICE_PLAN_CHNG_DAY_CNT
## 4.716e-02 -2.640e-02
## MTH_SINCE_DATA_ACTVN MTH_SINCE_VM_ACTVN
## 7.099e-03 8.531e-02
## TOT_OB_CALL_CNT TOT_ACTV_SRV_CNT
## NA 1.319e-01
## REV_AMT_BASE_1 REV_AMT_BASE_2
## -1.243e-01 -2.510e-01
## REV_AMT_BASE_3 REV_AMT_BASE_4
## -4.961e-02 7.314e-03
## REV_AMT_BASE_5 REV_AMT_BASE_6
## -2.142e-02 -6.983e-02
## CUST_AGE PCT_CHNG_IB_SMS_CNT
## -2.613e+00 -2.842e-01
## PCT_CHNG_SUSPENDED_CNT PCT_CHNG_BILL_AMT
## 1.733e+04 1.183e-02
## GENDER_CD_F GENDER_CD_M
## 2.537e-01 3.041e-01
## EDUCATION_CD_X.1 EDUCATION_CD_X.2
## 1.000e-03 -3.178e-02
## EDUCATION_CD_X.3 EDUCATION_CD_X.4
## -6.586e-03 6.920e-02
## EDUCATION_CD_X.5 EDUCATION_CD_X.6
## 2.865e-02 -3.216e-02
## EDUCATION_CD_X.7 BARRING_REASON_CD_X001
## -2.144e-01 2.708e-02
## BARRING_REASON_CD_X002 BARRING_REASON_CD_X003
## -2.241e-02 6.427e-02
##
## Degrees of Freedom: 13195 Total (i.e. Null); 13149 Residual
## Null Deviance: 7529
## Residual Deviance: 2482 AIC: 2576