1 Setting the Scene

A telecommunications company that offers voice, messaging, and data services has noticed a high rate of churn among its high-revenue customers in recent years. A customer retention campaign was developed and the marketing managers need to identify the customers who have a high probability of churning. These targeted customers will then be placed in the customer retention program. Your task is to build a model to predict customers with high propensity to churn.

2 The Data

In this exercise, telecom.sas7bdat in data sub-folder will be used. The telecom data set is in SAS data format. It contains customer relationship management data for nearly 13,196 customers. There are 46 columns in the data set.

3 Getting Started

For this exercise, you are required to have the following R packages installed and loaded.

Data import: haven
Data wrangling and visualisation: tidyverse
Exploratory Data Analysis: funModeling, skimr and DataExplorer
Modelling: tidymodels

3.1 Installing and launching R packages

The code chunk below will be used to check if the necessary R packages have been installed in R. if they have yet to be installed, then it will install the missing package(s). Finally, the code chunk will launch all the necessary R packages in R.

packages = c('funModeling', 'skimr', 'DataExplorer','haven', 'tidymodels', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

3.2 Importing data

The code chunk below shows how a SAS’s sas7bdat (i.e. telecom.sas7bdat) file can be imported using read_sas() function of haven package.

telecom <- read_sas("data/telecom.sas7bdat")

4 Getting to Known the Data Frame

Before any analysis can be performed, it is important for us to explore the data. The purpose of initial exploration are: - to understand the structure of the data, and - to identify data quality issues such as messy data, dirty data, missing data, etc.

4.1 Listing the data structure of the data

Since telecom is tibble data.frame, it is advisable to use glimpse() of dplyr package to display the data.

glimpse(telecom)

## Rows: 13,196
## Columns: 46
## $ GENDER_CD                    <chr> "M", "M", "M", "M", "F", "M", "M", "M"...
## $ EDUCATION_CD                 <chr> " 2", " 1", " 2", " 1", " 4", " 2", " ...
## $ SUBS_TENURE                  <dbl> 198, 114, 114, 228, 168, 132, 120, 138...
## $ TOT_IB_CALL_DUR              <dbl> 54.167, 13.333, 12.333, 49.667, 16.333...
## $ TOT_IB_CALL_CNT              <dbl> 192, 51, 94, 171, 536, 656, 283, 148, ...
## $ AVG_OB_CALL_CNT              <dbl> 897, 252, 1016, 1534, 730, 323, 463, 8...
## $ TOT_OB_CALL_NAT_ROAM_CNT     <dbl> 738, 25, 43, 39, 16, 144, 28, 49, 18, ...
## $ TOT_OB_CALL_INTL_CNT         <dbl> 43, 5568, 72, 23, 16, 27, 18, 779, 33,...
## $ TOT_OB_CALL_LOC_CNT          <dbl> 300, 221, 221, 302, 517, 540, 320, 192...
## $ TOT_OB_CALL_NAT_CNT          <dbl> 125, 216, 79, 108, 570, 85, 73, 446, 1...
## $ TOT_OB_CALL_INTL_ROAM_CNT    <dbl> 0.89995712, 0.03764006, 0.29266312, 0....
## $ TOT_DAY_LAST_COMPLAINT_CNT   <dbl> 15, 8, 19, 14, 4, 22, 4, 1, 19, 5, 6, ...
## $ TOT_DAY_LAST_OB_BARRED_CNT   <dbl> 11, 5, 2, 4, 3, 8, 13, 11, 9, 14, 23, ...
## $ TOT_DAY_LAST_SUSPENDED_CNT   <dbl> 3, 23, 25, 7, 18, 20, 10, 18, 16, 1, 9...
## $ TOT_EMAIL_QUERY_CNT          <dbl> 17, 16, 19, 4, 20, 25, 17, 11, 5, 3, 1...
## $ MTH_TO_SUBS_END_CNT          <dbl> 4, 2, 3, 4, 2, 3, 3, 2, 3, 4, 1, 1, 4,...
## $ TOT_SRV_DROPPED_CNT          <dbl> 2, 3, 4, 0, 2, 0, 0, 1, 1, 1, 0, 1, 1,...
## $ TOT_SRV_ADDED_CNT            <dbl> 1, 0, 0, 1, 1, 2, 0, 2, 0, 2, 1, 0, 0,...
## $ TOT_OUTSTAND_60_90_DAY_AMT   <dbl> 0, 0, 0, 0, 0, 483, 0, 0, 0, 0, 0, 0, ...
## $ TOT_REV_FIX_AMT              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 718, 0, ...
## $ TOT_REV_GPRS_AMT             <dbl> 0, 729, 574, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ TOT_REV_INET_AMT             <dbl> 0, 0, 0, 0, 0, 508, 0, 0, 0, 0, 0, 467...
## $ TOT_COMPLAINT_1_MTH_CNT      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOT_MTH_LAST_SUSPENDED_CNT   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_DATA_ACTVN         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_VM_ACTVN           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BARRING_REASON_CD            <chr> "000", "000", "002", "000", "002", "00...
## $ TOT_OB_CALL_CNT              <dbl> 5382, 1512, 6096, 9204, 4380, 1938, 27...
## $ TOT_ACTV_SRV_CNT             <dbl> 0, 0, 0, 4, 3, 3, 4, 2, 0, 3, 3, 0, 2,...
## $ REV_AMT_BASE_1               <dbl> 870, 570, 632, 638, 1162, 236, 445, 11...
## $ REV_AMT_BASE_2               <dbl> 548, 1024, 547, 466, 394, 729, 1091, 4...
## $ REV_AMT_BASE_3               <dbl> 970, 938, 851, 810, 1141, 483, 447, 33...
## $ REV_AMT_BASE_4               <dbl> 392, 602, 821, 655, 244, 325, 568, 645...
## $ REV_AMT_BASE_5               <dbl> 358, 729, 574, 402, 423, 787, 830, 241...
## $ REV_AMT_BASE_6               <dbl> 1248, 1129, 1027, 642, 618, 508, 1133,...
## $ CUST_AGE                     <dbl> 32, 24, 24, 44, 21, 52, 34, 44, 39, 43...
## $ PCT_CHNG_IB_SMS_CNT          <dbl> 1.1553398, 1.2653061, 0.9019608, 1.870...
## $ PCT_CHNG_SUSPENDED_CNT       <dbl> 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PCT_CHNG_BILL_AMT            <dbl> 0.9555256, 0.9381989, 0.7204400, 0.880...
## $ CUST_SUBS_ID                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...
## $ TOT_REV_AMT                  <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ TOT_PROF_AMT                 <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ CUST_ID                      <dbl> 5198, 752, 3501, 5406, 6115, 5478, 514...
## $ name                         <chr> "Amy H Thomas", "Ignatius T Lyod", "Le...
## $ CHURN_FLG                    <chr> "0", "1", "1", "0", "1", "0", "0", "0"...

4.2 Excluding unwanted columns

After reviewing the report above, it is clear that fields name, CUST_ID, and CUST_SUBS_ID are not required for the subsequent analysis. In view of this, we will exclude these three fields from telecom data frame by using the code chunk below.

telecom <- telecom %>% 
  select(-name, -CUST_SUBS_ID, -CUST_ID)

4.3 Converting data type

By default, all categorical values are stored as character data type. For EDA and modelling purposes, they need to be converted into factor data type.

telecom <- telecom %>%
mutate_if(is.character,funs(factor(.)))

4.4 Reviewing the structure

It is always a good practice to check the data structure after wrangling the data.frame.

glimpse(telecom)

## Rows: 13,196
## Columns: 43
## $ GENDER_CD                    <fct> M, M, M, M, F, M, M, M, M, M, M, M, M,...
## $ EDUCATION_CD                 <fct>  2,  1,  2,  1,  4,  2,  3,  2,  .,  2...
## $ SUBS_TENURE                  <dbl> 198, 114, 114, 228, 168, 132, 120, 138...
## $ TOT_IB_CALL_DUR              <dbl> 54.167, 13.333, 12.333, 49.667, 16.333...
## $ TOT_IB_CALL_CNT              <dbl> 192, 51, 94, 171, 536, 656, 283, 148, ...
## $ AVG_OB_CALL_CNT              <dbl> 897, 252, 1016, 1534, 730, 323, 463, 8...
## $ TOT_OB_CALL_NAT_ROAM_CNT     <dbl> 738, 25, 43, 39, 16, 144, 28, 49, 18, ...
## $ TOT_OB_CALL_INTL_CNT         <dbl> 43, 5568, 72, 23, 16, 27, 18, 779, 33,...
## $ TOT_OB_CALL_LOC_CNT          <dbl> 300, 221, 221, 302, 517, 540, 320, 192...
## $ TOT_OB_CALL_NAT_CNT          <dbl> 125, 216, 79, 108, 570, 85, 73, 446, 1...
## $ TOT_OB_CALL_INTL_ROAM_CNT    <dbl> 0.89995712, 0.03764006, 0.29266312, 0....
## $ TOT_DAY_LAST_COMPLAINT_CNT   <dbl> 15, 8, 19, 14, 4, 22, 4, 1, 19, 5, 6, ...
## $ TOT_DAY_LAST_OB_BARRED_CNT   <dbl> 11, 5, 2, 4, 3, 8, 13, 11, 9, 14, 23, ...
## $ TOT_DAY_LAST_SUSPENDED_CNT   <dbl> 3, 23, 25, 7, 18, 20, 10, 18, 16, 1, 9...
## $ TOT_EMAIL_QUERY_CNT          <dbl> 17, 16, 19, 4, 20, 25, 17, 11, 5, 3, 1...
## $ MTH_TO_SUBS_END_CNT          <dbl> 4, 2, 3, 4, 2, 3, 3, 2, 3, 4, 1, 1, 4,...
## $ TOT_SRV_DROPPED_CNT          <dbl> 2, 3, 4, 0, 2, 0, 0, 1, 1, 1, 0, 1, 1,...
## $ TOT_SRV_ADDED_CNT            <dbl> 1, 0, 0, 1, 1, 2, 0, 2, 0, 2, 1, 0, 0,...
## $ TOT_OUTSTAND_60_90_DAY_AMT   <dbl> 0, 0, 0, 0, 0, 483, 0, 0, 0, 0, 0, 0, ...
## $ TOT_REV_FIX_AMT              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 718, 0, ...
## $ TOT_REV_GPRS_AMT             <dbl> 0, 729, 574, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ TOT_REV_INET_AMT             <dbl> 0, 0, 0, 0, 0, 508, 0, 0, 0, 0, 0, 467...
## $ TOT_COMPLAINT_1_MTH_CNT      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOT_MTH_LAST_SUSPENDED_CNT   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_DATA_ACTVN         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_VM_ACTVN           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BARRING_REASON_CD            <fct> 000, 000, 002, 000, 002, 003, 002, 002...
## $ TOT_OB_CALL_CNT              <dbl> 5382, 1512, 6096, 9204, 4380, 1938, 27...
## $ TOT_ACTV_SRV_CNT             <dbl> 0, 0, 0, 4, 3, 3, 4, 2, 0, 3, 3, 0, 2,...
## $ REV_AMT_BASE_1               <dbl> 870, 570, 632, 638, 1162, 236, 445, 11...
## $ REV_AMT_BASE_2               <dbl> 548, 1024, 547, 466, 394, 729, 1091, 4...
## $ REV_AMT_BASE_3               <dbl> 970, 938, 851, 810, 1141, 483, 447, 33...
## $ REV_AMT_BASE_4               <dbl> 392, 602, 821, 655, 244, 325, 568, 645...
## $ REV_AMT_BASE_5               <dbl> 358, 729, 574, 402, 423, 787, 830, 241...
## $ REV_AMT_BASE_6               <dbl> 1248, 1129, 1027, 642, 618, 508, 1133,...
## $ CUST_AGE                     <dbl> 32, 24, 24, 44, 21, 52, 34, 44, 39, 43...
## $ PCT_CHNG_IB_SMS_CNT          <dbl> 1.1553398, 1.2653061, 0.9019608, 1.870...
## $ PCT_CHNG_SUSPENDED_CNT       <dbl> 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PCT_CHNG_BILL_AMT            <dbl> 0.9555256, 0.9381989, 0.7204400, 0.880...
## $ TOT_REV_AMT                  <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ TOT_PROF_AMT                 <dbl> NA, 99, 47, NA, 97, NA, NA, NA, 10, 3,...
## $ CHURN_FLG                    <fct> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...

Notice that telecom data.frame only have 43 variables now and all the variables with character data type are encoded in factor data type.

5 Exploratory Data Analysis

5.1 Generating summary statistics

The commonly used skim() of skimr package will generate report with spark charts. The code chunk below uses skim_without_charts() to generate summary statistics of input variables without spark charts.

skim_without_charts(telecom)

Data summary
Name	telecom
Number of rows	13196
Number of columns	43
_______________________
Column type frequency:
factor	4
numeric	39
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
GENDER_CD	1	FALSE	3	M: 8534, F: 4566, emp: 96
EDUCATION_CD	1	FALSE	8	2: 4870, 1: 4817, .: 2901, 3: 330
BARRING_REASON_CD	1	FALSE	4	000: 3403, 002: 3358, 003: 3223, 001: 3212
CHURN_FLG	1	FALSE	2	0: 12105, 1: 1091

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
SUBS_TENURE	0	1.00	119.85	27.14	24.00	102.00	120.00	138.00	240.00
TOT_IB_CALL_DUR	0	1.00	44.85	13.23	3.33	39.50	46.75	53.50	78.67
TOT_IB_CALL_CNT	0	1.00	904.83	12273.97	8.00	119.75	180.00	366.00	1136214.00
AVG_OB_CALL_CNT	0	1.00	3986.69	80951.19	190.00	407.00	607.00	1209.25	5988140.00
TOT_OB_CALL_NAT_ROAM_CNT	0	1.00	130.73	2099.49	11.00	21.00	32.00	62.00	225286.00
TOT_OB_CALL_INTL_CNT	0	1.00	189.39	4265.17	11.00	21.00	31.00	62.00	410842.00
TOT_OB_CALL_LOC_CNT	0	1.00	1833.97	22732.02	121.00	245.00	370.00	725.25	1474931.00
TOT_OB_CALL_NAT_CNT	0	1.00	625.89	5794.71	49.00	103.00	154.00	308.00	328972.00
TOT_OB_CALL_INTL_ROAM_CNT	0	1.00	2.19	27.47	0.00	0.19	0.44	0.90	2158.12
TOT_DAY_LAST_COMPLAINT_CNT	0	1.00	11.78	6.99	0.00	6.00	12.00	17.00	32.00
TOT_DAY_LAST_OB_BARRED_CNT	0	1.00	12.67	7.36	0.00	6.00	13.00	19.00	31.00
TOT_DAY_LAST_SUSPENDED_CNT	0	1.00	13.36	7.72	0.00	7.00	13.00	20.00	34.00
TOT_EMAIL_QUERY_CNT	0	1.00	14.07	8.04	0.00	7.00	14.00	21.00	32.00
MTH_TO_SUBS_END_CNT	0	1.00	2.51	0.96	1.00	2.00	3.00	3.00	4.00
TOT_SRV_DROPPED_CNT	0	1.00	1.04	0.76	0.00	1.00	1.00	2.00	4.00
TOT_SRV_ADDED_CNT	0	1.00	1.41	0.98	0.00	1.00	1.00	2.00	3.00
TOT_OUTSTAND_60_90_DAY_AMT	0	1.00	61.94	180.05	0.00	0.00	0.00	0.00	700.00
TOT_REV_FIX_AMT	0	1.00	118.99	253.31	0.00	0.00	0.00	0.00	800.00
TOT_REV_GPRS_AMT	0	1.00	120.28	233.92	0.00	0.00	0.00	0.00	750.00
TOT_REV_INET_AMT	0	1.00	166.25	281.42	0.00	0.00	0.00	471.00	800.00
TOT_COMPLAINT_1_MTH_CNT	0	1.00	0.02	0.14	0.00	0.00	0.00	0.00	1.00
TOT_MTH_LAST_SUSPENDED_CNT	0	1.00	0.02	0.15	0.00	0.00	0.00	0.00	1.00
LAST_PRICE_PLAN_CHNG_DAY_CNT	0	1.00	0.02	0.15	0.00	0.00	0.00	0.00	1.00
MTH_SINCE_DATA_ACTVN	0	1.00	0.03	0.17	0.00	0.00	0.00	0.00	1.00
MTH_SINCE_VM_ACTVN	0	1.00	0.08	0.26	0.00	0.00	0.00	0.00	1.00
TOT_OB_CALL_CNT	0	1.00	23920.12	485707.12	1140.00	2442.00	3642.00	7255.50	35928840.00
TOT_ACTV_SRV_CNT	0	1.00	2.44	1.64	0.00	1.00	2.00	4.00	7.00
REV_AMT_BASE_1	0	1.00	694.11	284.69	200.00	457.00	681.00	938.00	1200.00
REV_AMT_BASE_2	0	1.00	787.97	340.74	200.00	492.00	772.50	1081.00	1400.00
REV_AMT_BASE_3	0	1.00	732.92	306.31	200.00	491.00	690.00	996.00	1300.00
REV_AMT_BASE_4	0	1.00	648.61	236.75	200.00	473.00	647.00	825.00	1100.00
REV_AMT_BASE_5	0	1.00	648.36	254.81	200.00	440.00	625.00	852.00	1150.00
REV_AMT_BASE_6	0	1.00	697.86	268.67	200.00	501.00	672.00	893.00	1250.00
CUST_AGE	0	1.00	40.72	12.16	20.00	30.00	40.00	51.00	62.00
PCT_CHNG_IB_SMS_CNT	0	1.00	1.25	0.55	0.00	0.86	1.17	1.55	6.17
PCT_CHNG_SUSPENDED_CNT	0	1.00	181.50	1042.50	0.00	0.00	0.00	0.00	10000.00
PCT_CHNG_BILL_AMT	0	1.00	1.13	0.43	0.25	0.82	1.09	1.37	5.07
TOT_REV_AMT	9026	0.32	22.21	29.72	-3.00	4.00	10.00	17.00	100.00
TOT_PROF_AMT	9026	0.32	22.21	29.72	-3.00	4.00	10.00	17.00	100.00

5.2 Univariate EDA

5.2.1 Profiling continuous variables

Instead of using ggplot2 to display the distribution of the continuous variables, the plot_num() of funModeling package can be used.

plot_num(telecom, bins = 12)

5.2.2 Profiling continuous variables

freq(telecom)

##   GENDER_CD frequency percentage cumulative_perc
## 1         M      8534      64.67           64.67
## 2         F      4566      34.60           99.27
## 3                  96       0.73          100.00

##   EDUCATION_CD frequency percentage cumulative_perc
## 1            2      4870      36.91           36.91
## 2            1      4817      36.50           73.41
## 3            .      2901      21.98           95.39
## 4            3       330       2.50           97.89
## 5            4       143       1.08           98.97
## 6            6        73       0.55           99.52
## 7            5        60       0.45           99.97
## 8            7         2       0.02          100.00

##   BARRING_REASON_CD frequency percentage cumulative_perc
## 1               000      3403      25.79           25.79
## 2               002      3358      25.45           51.24
## 3               003      3223      24.42           75.66
## 4               001      3212      24.34          100.00

##   CHURN_FLG frequency percentage cumulative_perc
## 1         0     12105      91.73           91.73
## 2         1      1091       8.27          100.00

## [1] "Variables processed: GENDER_CD, EDUCATION_CD, BARRING_REASON_CD, CHURN_FLG"

5.3 Exploring the relationship between the target variable and the input variables

One of the task during the data exploration stage is to investigate the relationship between the target variable and the inputs variables. In this section, we are going to explore how this task can be perform by using appropriate functions of ggplot2 and funModeling packages.

5.3.1 Continuous input variables

plotar(telecom, target = 'CHURN_FLG', plot_type = "histdens")

5.3.2 Categorical input variables

telecom %>%
  select(c(1,2,28,43)) %>%
  cross_plot(target= 'CHURN_FLG',
             plot_type = "both")

6 Finalising data set

After examining the input variables carefully, the following input variables will be excluded in the cleaned data set for modeling purposes.

TOT_REV_AMT and TOT_PROF_AMT due to excesssive missing values, and
TOT_IB_CALL_DUR and TOT_IB_CALL_CNT due to potential data leakage.

telecom_cleaned <- telecom %>%
  select(-TOT_REV_AMT, -TOT_PROF_AMT, -TOT_IB_CALL_DUR, -TOT_IB_CALL_CNT)
glimpse(telecom_cleaned)

## Rows: 13,196
## Columns: 39
## $ GENDER_CD                    <fct> M, M, M, M, F, M, M, M, M, M, M, M, M,...
## $ EDUCATION_CD                 <fct>  2,  1,  2,  1,  4,  2,  3,  2,  .,  2...
## $ SUBS_TENURE                  <dbl> 198, 114, 114, 228, 168, 132, 120, 138...
## $ AVG_OB_CALL_CNT              <dbl> 897, 252, 1016, 1534, 730, 323, 463, 8...
## $ TOT_OB_CALL_NAT_ROAM_CNT     <dbl> 738, 25, 43, 39, 16, 144, 28, 49, 18, ...
## $ TOT_OB_CALL_INTL_CNT         <dbl> 43, 5568, 72, 23, 16, 27, 18, 779, 33,...
## $ TOT_OB_CALL_LOC_CNT          <dbl> 300, 221, 221, 302, 517, 540, 320, 192...
## $ TOT_OB_CALL_NAT_CNT          <dbl> 125, 216, 79, 108, 570, 85, 73, 446, 1...
## $ TOT_OB_CALL_INTL_ROAM_CNT    <dbl> 0.89995712, 0.03764006, 0.29266312, 0....
## $ TOT_DAY_LAST_COMPLAINT_CNT   <dbl> 15, 8, 19, 14, 4, 22, 4, 1, 19, 5, 6, ...
## $ TOT_DAY_LAST_OB_BARRED_CNT   <dbl> 11, 5, 2, 4, 3, 8, 13, 11, 9, 14, 23, ...
## $ TOT_DAY_LAST_SUSPENDED_CNT   <dbl> 3, 23, 25, 7, 18, 20, 10, 18, 16, 1, 9...
## $ TOT_EMAIL_QUERY_CNT          <dbl> 17, 16, 19, 4, 20, 25, 17, 11, 5, 3, 1...
## $ MTH_TO_SUBS_END_CNT          <dbl> 4, 2, 3, 4, 2, 3, 3, 2, 3, 4, 1, 1, 4,...
## $ TOT_SRV_DROPPED_CNT          <dbl> 2, 3, 4, 0, 2, 0, 0, 1, 1, 1, 0, 1, 1,...
## $ TOT_SRV_ADDED_CNT            <dbl> 1, 0, 0, 1, 1, 2, 0, 2, 0, 2, 1, 0, 0,...
## $ TOT_OUTSTAND_60_90_DAY_AMT   <dbl> 0, 0, 0, 0, 0, 483, 0, 0, 0, 0, 0, 0, ...
## $ TOT_REV_FIX_AMT              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 718, 0, ...
## $ TOT_REV_GPRS_AMT             <dbl> 0, 729, 574, 0, 0, 0, 0, 0, 0, 0, 0, 3...
## $ TOT_REV_INET_AMT             <dbl> 0, 0, 0, 0, 0, 508, 0, 0, 0, 0, 0, 467...
## $ TOT_COMPLAINT_1_MTH_CNT      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TOT_MTH_LAST_SUSPENDED_CNT   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_DATA_ACTVN         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MTH_SINCE_VM_ACTVN           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BARRING_REASON_CD            <fct> 000, 000, 002, 000, 002, 003, 002, 002...
## $ TOT_OB_CALL_CNT              <dbl> 5382, 1512, 6096, 9204, 4380, 1938, 27...
## $ TOT_ACTV_SRV_CNT             <dbl> 0, 0, 0, 4, 3, 3, 4, 2, 0, 3, 3, 0, 2,...
## $ REV_AMT_BASE_1               <dbl> 870, 570, 632, 638, 1162, 236, 445, 11...
## $ REV_AMT_BASE_2               <dbl> 548, 1024, 547, 466, 394, 729, 1091, 4...
## $ REV_AMT_BASE_3               <dbl> 970, 938, 851, 810, 1141, 483, 447, 33...
## $ REV_AMT_BASE_4               <dbl> 392, 602, 821, 655, 244, 325, 568, 645...
## $ REV_AMT_BASE_5               <dbl> 358, 729, 574, 402, 423, 787, 830, 241...
## $ REV_AMT_BASE_6               <dbl> 1248, 1129, 1027, 642, 618, 508, 1133,...
## $ CUST_AGE                     <dbl> 32, 24, 24, 44, 21, 52, 34, 44, 39, 43...
## $ PCT_CHNG_IB_SMS_CNT          <dbl> 1.1553398, 1.2653061, 0.9019608, 1.870...
## $ PCT_CHNG_SUSPENDED_CNT       <dbl> 0, 0, 5000, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PCT_CHNG_BILL_AMT            <dbl> 0.9555256, 0.9381989, 0.7204400, 0.880...
## $ CHURN_FLG                    <fct> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...

7 Data Sampling

7.1 Preparing training and testing data sets

First of all, we want to extract a data set for testing the predictions in the end. We’ll keep 60% of the data for training and 40% of the data for testing.

set.seed(1234)
telecom_split <- initial_split(telecom_cleaned,
                               prop = .6, 
                               strata = CHURN_FLG)

telecom_train <- training(telecom_split)
telecom_test <- testing(telecom_split)

7.2 Preparing cross-validation data sets

Furthermore, the training data set will be prepared for 3-fold cross-validation (using three here to speed things up). All this is accomplished using the rsample package:

vfold_data <- vfold_cv(telecom_train, v = 3, 
                      repeats = 1, 
                      strata = CHURN_FLG)
vfold_data %>% 
  mutate(df_ana = map(splits, analysis),
         df_ass = map(splits, assessment))

## #  3-fold cross-validation using stratification 
## # A tibble: 3 x 4
##   splits              id    df_ana                df_ass               
##   <list>              <chr> <list>                <list>               
## 1 <split [5.3K/2.6K]> Fold1 <tibble [5,278 x 39]> <tibble [2,640 x 39]>
## 2 <split [5.3K/2.6K]> Fold2 <tibble [5,279 x 39]> <tibble [2,639 x 39]>
## 3 <split [5.3K/2.6K]> Fold3 <tibble [5,279 x 39]> <tibble [2,639 x 39]>

8 Data Preprocessing

Two data preprocessing steps will be performed"

creating dummy variables for categorical input variables, and
normalising continuous input variables.

telecom_recipe <- recipe(CHURN_FLG ~ ., data = telecom_train) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_normalize(all_numeric())
telecom_recipe

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
## 
## Operations:
## 
## Dummy variables from all_nominal(), -all_outcomes()
## Centering and scaling for all_numeric()

If you want to extract the pre-processed dataset itself, you can first prep() the recipe for a specific dataset and juice() the prepped recipe to extract the pre-processed data. It turns out that extracting the pre-processed data isn’t actually necessary for the pipeline, since this will be done under the hood when the model is fit, but sometimes it’s useful anyway.

telecom_train_preprocessed <- telecom_recipe %>%
  prep(telecom_train) %>%
  juice()
glimpse(telecom_train_preprocessed)

## Rows: 7,918
## Columns: 48
## $ SUBS_TENURE                  <dbl> 2.8659494236, -0.2197381918, -0.219738...
## $ AVG_OB_CALL_CNT              <dbl> -0.039110543, -0.045349571, -0.0379594...
## $ TOT_OB_CALL_NAT_ROAM_CNT     <dbl> 0.235027169, -0.042508145, -0.03550164...
## $ TOT_OB_CALL_INTL_CNT         <dbl> -0.04654437, 1.91258878, -0.03626113, ...
## $ TOT_OB_CALL_LOC_CNT          <dbl> -0.063734543, -0.066713847, -0.0667138...
## $ TOT_OB_CALL_NAT_CNT          <dbl> -0.08208404, -0.06639903, -0.09001273,...
## $ TOT_OB_CALL_INTL_ROAM_CNT    <dbl> -0.0588093830, -0.1040129999, -0.09064...
## $ TOT_DAY_LAST_COMPLAINT_CNT   <dbl> 0.4634149, -0.5349229, 1.0338936, 0.32...
## $ TOT_DAY_LAST_OB_BARRED_CNT   <dbl> -0.23106703, -1.04439834, -1.45106399,...
## $ TOT_DAY_LAST_SUSPENDED_CNT   <dbl> -1.34529021, 1.24021135, 1.49876150, -...
## $ TOT_EMAIL_QUERY_CNT          <dbl> 0.366171725, 0.241172901, 0.616169373,...
## $ MTH_TO_SUBS_END_CNT          <dbl> 1.5636067, -0.5192732, 0.5221668, 1.56...
## $ TOT_SRV_DROPPED_CNT          <dbl> 1.24889147, 2.55882955, 3.86876764, -1...
## $ TOT_SRV_ADDED_CNT            <dbl> -0.4102997, -1.4268057, -1.4268057, -0...
## $ TOT_OUTSTAND_60_90_DAY_AMT   <dbl> -0.3401562, -0.3401562, -0.3401562, -0...
## $ TOT_REV_FIX_AMT              <dbl> -0.4746233, -0.4746233, -0.4746233, -0...
## $ TOT_REV_GPRS_AMT             <dbl> -0.517029, 2.606372, 1.942274, -0.5170...
## $ TOT_REV_INET_AMT             <dbl> -0.5847742, -0.5847742, -0.5847742, -0...
## $ TOT_COMPLAINT_1_MTH_CNT      <dbl> -0.141294, -0.141294, 7.076549, -0.141...
## $ TOT_MTH_LAST_SUSPENDED_CNT   <dbl> -0.1626102, -0.1626102, -0.1626102, -0...
## $ LAST_PRICE_PLAN_CHNG_DAY_CNT <dbl> -0.1567891, -0.1567891, -0.1567891, -0...
## $ MTH_SINCE_DATA_ACTVN         <dbl> -0.1767886, -0.1767886, -0.1767886, -0...
## $ MTH_SINCE_VM_ACTVN           <dbl> -0.2798142, -0.2798142, -0.2798142, -0...
## $ TOT_OB_CALL_CNT              <dbl> -0.039110543, -0.045349571, -0.0379594...
## $ TOT_ACTV_SRV_CNT             <dbl> -1.4866598, -1.4866598, -1.4866598, 0....
## $ REV_AMT_BASE_1               <dbl> 0.6135341, -0.4429222, -0.2245879, -0....
## $ REV_AMT_BASE_2               <dbl> -0.7073374, 0.6890768, -0.7102710, -0....
## $ REV_AMT_BASE_3               <dbl> 0.79590683, 0.69106652, 0.40603193, 0....
## $ REV_AMT_BASE_4               <dbl> -1.08905379, -0.20247405, 0.72210198, ...
## $ REV_AMT_BASE_5               <dbl> -1.12917554, 0.33085276, -0.27913211, ...
## $ REV_AMT_BASE_6               <dbl> 2.03004072, 1.59051529, 1.21377920, -0...
## $ CUST_AGE                     <dbl> -0.7278067, -1.3836548, -1.3836548, 0....
## $ PCT_CHNG_IB_SMS_CNT          <dbl> -0.158186621, 0.041398791, -0.61806159...
## $ PCT_CHNG_SUSPENDED_CNT       <dbl> -0.1725298, -0.1725298, 4.8130260, -0....
## $ PCT_CHNG_BILL_AMT            <dbl> -0.41913970, -0.45860554, -0.95460632,...
## $ CHURN_FLG                    <fct> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ GENDER_CD_F                  <dbl> -0.731453, -0.731453, -0.731453, -0.73...
## $ GENDER_CD_M                  <dbl> 0.7426753, 0.7426753, 0.7426753, 0.742...
## $ EDUCATION_CD_X.1             <dbl> -0.7665859, 1.3043206, -0.7665859, 1.3...
## $ EDUCATION_CD_X.2             <dbl> 1.3135553, -0.7611965, 1.3135553, -0.7...
## $ EDUCATION_CD_X.3             <dbl> -0.1576325, -0.1576325, -0.1576325, -0...
## $ EDUCATION_CD_X.4             <dbl> -0.100382, -0.100382, -0.100382, -0.10...
## $ EDUCATION_CD_X.5             <dbl> -0.06943865, -0.06943865, -0.06943865,...
## $ EDUCATION_CD_X.6             <dbl> -0.07388929, -0.07388929, -0.07388929,...
## $ EDUCATION_CD_X.7             <dbl> -0.01123808, -0.01123808, -0.01123808,...
## $ BARRING_REASON_CD_X001       <dbl> -0.5643825, -0.5643825, -0.5643825, -0...
## $ BARRING_REASON_CD_X002       <dbl> -0.5828552, -0.5828552, 1.7154753, -0....
## $ BARRING_REASON_CD_X003       <dbl> -0.5764389, -0.5764389, -0.5764389, -0...

test_proc <- bake(telecom_recipe, new_data = telecom_test)

9 Model Calibration

Using parsnip package to calibrate different classification models.

9.1 Calibrating a decision tree model

The code chunk below fit a decision tree model using rpart package.

dt_model <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

9.2 Calibrating a logitstic regression model

The code chunk below fit a logistic regression model from the glm package.

lr_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

9.3 Calibrating a random forest model

The code chunk below fit a random forest model as implemented by the ranger package for the purpose of classification.

rf_model <- rand_forest() %>%
  set_engine("ranger", 
             importance = "impurity") %>%
  set_mode("classification")

If you want to be able to examine the variable importance of your final model later, you will need to set importance argument when setting the engine. For ranger, the importance options are “impurity” or “permutation”.

Note that this code doesn’t actually fit the model. Like the recipe, it just outlines a description of the model.

Another thing to note is that nothing about this model specification is specific to the telecom dataset.

10 Put it all together in a workflow

We’re now ready to put the model and recipes together into a workflow. You initiate a workflow using workflow() from the workflows package and then you can add a recipe and add a model to it.

dt_workflow <- workflow() %>%
  add_recipe(telecom_recipe) %>%
  add_model(dt_model)

lr_workflow <- workflow() %>%
  add_recipe(telecom_recipe) %>%
  add_model(lr_model)

rf_workflow <- workflow() %>%
  add_recipe(telecom_recipe) %>%
  add_model(rf_model)

11 Model fitting

Now we’ve defined our recipe, and our model, we’re ready to actually fit the final model. Since all of this information is contained within the workflow object, we will apply the last_fit() function to our workflow and our train/test split object. This will automatically train the model specified by the workflow using the training data, and produce evaluations based on the test set.

dt_fit <- dt_workflow %>%
  last_fit(telecom_split)
dt_fit

## # Resampling results
## # Monte Carlo cross-validation (0.6/0.4) with 1 resamples  
## # A tibble: 1 x 6
##   splits        id          .metrics      .notes      .predictions     .workflow
##   <list>        <chr>       <list>        <list>      <list>           <list>   
## 1 <split [7.9K~ train/test~ <tibble [2 x~ <tibble [0~ <tibble [5,278 ~ <workflo~

lr_fit <- lr_workflow %>%
  last_fit(telecom_split)
lr_fit

## # Resampling results
## # Monte Carlo cross-validation (0.6/0.4) with 1 resamples  
## # A tibble: 1 x 6
##   splits        id          .metrics      .notes      .predictions     .workflow
##   <list>        <chr>       <list>        <list>      <list>           <list>   
## 1 <split [7.9K~ train/test~ <tibble [2 x~ <tibble [2~ <tibble [5,278 ~ <workflo~

rf_fit <- rf_workflow %>%
  last_fit(telecom_split)
rf_fit

## # Resampling results
## # Monte Carlo cross-validation (0.6/0.4) with 1 resamples  
## # A tibble: 1 x 6
##   splits        id          .metrics      .notes      .predictions     .workflow
##   <list>        <chr>       <list>        <list>      <list>           <list>   
## 1 <split [7.9K~ train/test~ <tibble [2 x~ <tibble [0~ <tibble [5,278 ~ <workflo~

Note that the fit object that is created is a data-frame-like object; specifically, it is a tibble with list columns.

12 Model Evaluation

12.1 Assissing model accuracy with training data

Since we supplied the train/test object when we fit the workflow, the metrics are evaluated on the test set. Now when we use the collect_metrics() function (recall we used this when tuning our parameters), it extracts the performance of the final model (since rf_fit now consists of a single final model) applied to the test set.

rf_performance <- rf_fit %>% 
  collect_metrics()
rf_performance

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.976
## 2 roc_auc  binary         0.987

Overall the performance is very good, with an accuracy of 0.97 and an AUC of 0.99.

12.2 Assessing model accuracy with testing data

12.2.1 Computing predictions from the test set

You can also extract the test set predictions themselves using the collect_predictions() function. Note that there are 192 rows in the predictions object below which matches the number of test set observations (just to give you some evidence that these are based on the test set rather than the training set).

test_predictions <- rf_fit %>%
collect_predictions()
test_predictions

## # A tibble: 5,278 x 6
##    id               .pred_0 .pred_1  .row .pred_class CHURN_FLG
##    <chr>              <dbl>   <dbl> <int> <fct>       <fct>    
##  1 train/test split   0.343 0.657       5 1           1        
##  2 train/test split   0.934 0.0655      9 0           0        
##  3 train/test split   0.922 0.0776     14 0           0        
##  4 train/test split   0.966 0.0340     15 0           0        
##  5 train/test split   0.291 0.709      16 1           1        
##  6 train/test split   0.859 0.141      17 0           1        
##  7 train/test split   0.984 0.0163     18 0           0        
##  8 train/test split   0.994 0.00593    26 0           0        
##  9 train/test split   0.978 0.0222     27 0           0        
## 10 train/test split   0.988 0.0121     30 0           0        
## # ... with 5,268 more rows

Since this is just a normal data frame/tibble object, we can generate summaries and plots such as a confusion matrix.

12.3 Computing a confusion matrix

test_predictions %>% 
  conf_mat(truth = CHURN_FLG, 
           estimate = .pred_class)

##           Truth
## Prediction    0    1
##          0 4823  124
##          1    4  327

12.4 Variable importance

If you want to extract the variable importance scores from your model, as far as I can tell, for now you need to extract the model object from the fit() object (which for us is called final_model). The function that extracts the model is pull_workflow_fit() and then you need to grab the fit object that the output contains.

rf_model <- fit(rf_workflow, telecom_cleaned)
ranger_obj <- pull_workflow_fit(rf_model)$fit
ranger_obj

## Ranger result
## 
## Call:
##  ranger::ranger(formula = ..y ~ ., data = data, importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      13196 
## Number of independent variables:  47 
## Mtry:                             6 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.01699434

Then you can extract the variable importance from the ranger object itself (variable.importance is a specific object contained within ranger output - this will need to be adapted for the specific object type of other models).

ranger_obj$variable.importance

##                  SUBS_TENURE              AVG_OB_CALL_CNT 
##                   12.1924563                   69.1914085 
##     TOT_OB_CALL_NAT_ROAM_CNT         TOT_OB_CALL_INTL_CNT 
##                   54.0967837                   56.0624779 
##          TOT_OB_CALL_LOC_CNT          TOT_OB_CALL_NAT_CNT 
##                   61.2488865                   61.9115713 
##    TOT_OB_CALL_INTL_ROAM_CNT   TOT_DAY_LAST_COMPLAINT_CNT 
##                  116.8822255                   11.5065368 
##   TOT_DAY_LAST_OB_BARRED_CNT   TOT_DAY_LAST_SUSPENDED_CNT 
##                   11.7035548                   11.8769972 
##          TOT_EMAIL_QUERY_CNT          MTH_TO_SUBS_END_CNT 
##                   11.3489732                    4.9931900 
##          TOT_SRV_DROPPED_CNT            TOT_SRV_ADDED_CNT 
##                  124.7099953                  106.9056742 
##   TOT_OUTSTAND_60_90_DAY_AMT              TOT_REV_FIX_AMT 
##                    3.0451607                    4.0638076 
##             TOT_REV_GPRS_AMT             TOT_REV_INET_AMT 
##                    5.1337278                    5.4804504 
##      TOT_COMPLAINT_1_MTH_CNT   TOT_MTH_LAST_SUSPENDED_CNT 
##                  148.3829950                    0.5766582 
## LAST_PRICE_PLAN_CHNG_DAY_CNT         MTH_SINCE_DATA_ACTVN 
##                    0.5150562                    0.5954739 
##           MTH_SINCE_VM_ACTVN              TOT_OB_CALL_CNT 
##                    0.9487959                   60.1454416 
##             TOT_ACTV_SRV_CNT               REV_AMT_BASE_1 
##                   26.6214676                   27.7617097 
##               REV_AMT_BASE_2               REV_AMT_BASE_3 
##                   38.5876493                   17.2232273 
##               REV_AMT_BASE_4               REV_AMT_BASE_5 
##                   16.8719073                   17.8636111 
##               REV_AMT_BASE_6                     CUST_AGE 
##                   17.5503397                  325.7159672 
##          PCT_CHNG_IB_SMS_CNT       PCT_CHNG_SUSPENDED_CNT 
##                   35.1781608                  345.6217802 
##            PCT_CHNG_BILL_AMT                  GENDER_CD_F 
##                   20.6142868                    1.6994824 
##                  GENDER_CD_M             EDUCATION_CD_X.1 
##                    1.7241296                    1.6647733 
##             EDUCATION_CD_X.2             EDUCATION_CD_X.3 
##                    1.5073467                    0.9941026 
##             EDUCATION_CD_X.4             EDUCATION_CD_X.5 
##                    1.2137233                    0.5830629 
##             EDUCATION_CD_X.6             EDUCATION_CD_X.7 
##                    0.4457791                    0.0000000 
##       BARRING_REASON_CD_X001       BARRING_REASON_CD_X002 
##                    1.8729967                    1.6545299 
##       BARRING_REASON_CD_X003 
##                    1.8191001

12.4.1 Decision tree

dt_model <- fit(dt_workflow, telecom_cleaned)
rpart_obj <- pull_workflow_fit(dt_model)$fit
rpart_obj

## n= 13196 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 13196 1091 0 (0.917323431 0.082676569)  
##     2) PCT_CHNG_SUSPENDED_CNT< -0.1737818 12730  625 0 (0.950903378 0.049096622)  
##       4) CUST_AGE>=-1.08674 10547   96 0 (0.990897886 0.009102114)  
##         8) TOT_SRV_DROPPED_CNT< 1.911073 10532   81 0 (0.992309153 0.007690847) *
##         9) TOT_SRV_DROPPED_CNT>=1.911073 15    0 1 (0.000000000 1.000000000) *
##       5) CUST_AGE< -1.08674 2183  529 0 (0.757672927 0.242327073)  
##        10) TOT_SRV_ADDED_CNT>=-0.9278886 1534  176 0 (0.885267275 0.114732725)  
##          20) TOT_SRV_ADDED_CNT>=0.09137536 809    0 0 (1.000000000 0.000000000) *
##          21) TOT_SRV_ADDED_CNT< 0.09137536 725  176 0 (0.757241379 0.242758621)  
##            42) TOT_OB_CALL_INTL_ROAM_CNT>=-0.0703783 442   48 0 (0.891402715 0.108597285)  
##              84) CUST_AGE>=-1.580008 431   37 0 (0.914153132 0.085846868) *
##              85) CUST_AGE< -1.580008 11    0 1 (0.000000000 1.000000000) *
##            43) TOT_OB_CALL_INTL_ROAM_CNT< -0.0703783 283  128 0 (0.547703180 0.452296820)  
##              86) REV_AMT_BASE_2>=-0.2992656 140   39 0 (0.721428571 0.278571429) *
##              87) REV_AMT_BASE_2< -0.2992656 143   54 1 (0.377622378 0.622377622)  
##               174) REV_AMT_BASE_2< -1.10194 41   15 0 (0.634146341 0.365853659) *
##               175) REV_AMT_BASE_2>=-1.10194 102   28 1 (0.274509804 0.725490196) *
##        11) TOT_SRV_ADDED_CNT< -0.9278886 649  296 1 (0.456086287 0.543913713)  
##          22) TOT_OB_CALL_INTL_ROAM_CNT>=-0.06891255 282   77 0 (0.726950355 0.273049645)  
##            44) TOT_SRV_DROPPED_CNT< 1.911073 259   54 0 (0.791505792 0.208494208)  
##              88) CUST_AGE>=-1.580008 243   38 0 (0.843621399 0.156378601) *
##              89) CUST_AGE< -1.580008 16    0 1 (0.000000000 1.000000000) *
##            45) TOT_SRV_DROPPED_CNT>=1.911073 23    0 1 (0.000000000 1.000000000) *
##          23) TOT_OB_CALL_INTL_ROAM_CNT< -0.06891255 367   91 1 (0.247956403 0.752043597) *
##     3) PCT_CHNG_SUSPENDED_CNT>=-0.1737818 466    0 1 (0.000000000 1.000000000) *

rpart_obj$variable.importance

##     PCT_CHNG_SUSPENDED_CNT    TOT_COMPLAINT_1_MTH_CNT 
##                812.9705049                474.5235565 
##                   CUST_AGE          TOT_SRV_ADDED_CNT 
##                247.4439015                213.0720731 
##  TOT_OB_CALL_INTL_ROAM_CNT        TOT_SRV_DROPPED_CNT 
##                120.0284803                 97.1897475 
##           TOT_ACTV_SRV_CNT             REV_AMT_BASE_2 
##                 65.6103875                 26.8835893 
##        TOT_OB_CALL_NAT_CNT            AVG_OB_CALL_CNT 
##                 24.7426441                 23.4184395 
##        PCT_CHNG_IB_SMS_CNT            TOT_OB_CALL_CNT 
##                 21.3784548                 14.3579718 
##          PCT_CHNG_BILL_AMT       TOT_OB_CALL_INTL_CNT 
##                  9.4252195                  4.7794478 
##        TOT_OB_CALL_LOC_CNT             REV_AMT_BASE_3 
##                  4.2530183                  2.7474818 
##        TOT_EMAIL_QUERY_CNT   TOT_OB_CALL_NAT_ROAM_CNT 
##                  2.6709249                  2.3015417 
## TOT_DAY_LAST_SUSPENDED_CNT             REV_AMT_BASE_6 
##                  1.9112917                  1.3354624 
##             REV_AMT_BASE_1            TOT_REV_FIX_AMT 
##                  0.5535323                  0.3690216

12.4.2 Logistic regression model

lr_model <- fit(lr_workflow, telecom_cleaned)
glm_obj <- pull_workflow_fit(lr_model)$fit
glm_obj

## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##                  (Intercept)                   SUBS_TENURE  
##                    3.011e+03                    -2.360e-02  
##              AVG_OB_CALL_CNT      TOT_OB_CALL_NAT_ROAM_CNT  
##                   -1.495e+00                     1.658e-02  
##         TOT_OB_CALL_INTL_CNT           TOT_OB_CALL_LOC_CNT  
##                    6.529e-02                    -8.921e-01  
##          TOT_OB_CALL_NAT_CNT     TOT_OB_CALL_INTL_ROAM_CNT  
##                   -6.471e-01                    -2.299e+00  
##   TOT_DAY_LAST_COMPLAINT_CNT    TOT_DAY_LAST_OB_BARRED_CNT  
##                    5.641e-02                    -7.557e-02  
##   TOT_DAY_LAST_SUSPENDED_CNT           TOT_EMAIL_QUERY_CNT  
##                    1.348e-02                    -3.648e-02  
##          MTH_TO_SUBS_END_CNT           TOT_SRV_DROPPED_CNT  
##                    1.576e-02                     5.105e-01  
##            TOT_SRV_ADDED_CNT    TOT_OUTSTAND_60_90_DAY_AMT  
##                   -1.687e+00                    -9.786e-02  
##              TOT_REV_FIX_AMT              TOT_REV_GPRS_AMT  
##                    3.636e-02                     4.443e-02  
##             TOT_REV_INET_AMT       TOT_COMPLAINT_1_MTH_CNT  
##                   -6.297e-02                     1.521e+00  
##   TOT_MTH_LAST_SUSPENDED_CNT  LAST_PRICE_PLAN_CHNG_DAY_CNT  
##                    4.716e-02                    -2.640e-02  
##         MTH_SINCE_DATA_ACTVN            MTH_SINCE_VM_ACTVN  
##                    7.099e-03                     8.531e-02  
##              TOT_OB_CALL_CNT              TOT_ACTV_SRV_CNT  
##                           NA                     1.319e-01  
##               REV_AMT_BASE_1                REV_AMT_BASE_2  
##                   -1.243e-01                    -2.510e-01  
##               REV_AMT_BASE_3                REV_AMT_BASE_4  
##                   -4.961e-02                     7.314e-03  
##               REV_AMT_BASE_5                REV_AMT_BASE_6  
##                   -2.142e-02                    -6.983e-02  
##                     CUST_AGE           PCT_CHNG_IB_SMS_CNT  
##                   -2.613e+00                    -2.842e-01  
##       PCT_CHNG_SUSPENDED_CNT             PCT_CHNG_BILL_AMT  
##                    1.733e+04                     1.183e-02  
##                  GENDER_CD_F                   GENDER_CD_M  
##                    2.537e-01                     3.041e-01  
##             EDUCATION_CD_X.1              EDUCATION_CD_X.2  
##                    1.000e-03                    -3.178e-02  
##             EDUCATION_CD_X.3              EDUCATION_CD_X.4  
##                   -6.586e-03                     6.920e-02  
##             EDUCATION_CD_X.5              EDUCATION_CD_X.6  
##                    2.865e-02                    -3.216e-02  
##             EDUCATION_CD_X.7        BARRING_REASON_CD_X001  
##                   -2.144e-01                     2.708e-02  
##       BARRING_REASON_CD_X002        BARRING_REASON_CD_X003  
##                   -2.241e-02                     6.427e-02  
## 
## Degrees of Freedom: 13195 Total (i.e. Null);  13149 Residual
## Null Deviance:       7529 
## Residual Deviance: 2482  AIC: 2576

Hands-on Exercise 8: Building Binary Predictive Models with tidymodels

Dr. Kam Tin Seong
Assoc. Professor of Information Systems

2020/07/15 (updated: 2020-07-25)

1 Setting the Scene

2 The Data

3 Getting Started

3.1 Installing and launching R packages

3.2 Importing data

4 Getting to Known the Data Frame

4.1 Listing the data structure of the data

4.2 Excluding unwanted columns

4.3 Converting data type

4.4 Reviewing the structure

5 Exploratory Data Analysis

5.1 Generating summary statistics

5.2 Univariate EDA

5.2.1 Profiling continuous variables

5.2.2 Profiling continuous variables

5.3 Exploring the relationship between the target variable and the input variables

5.3.1 Continuous input variables

5.3.2 Categorical input variables

6 Finalising data set

7 Data Sampling

7.1 Preparing training and testing data sets

7.2 Preparing cross-validation data sets

8 Data Preprocessing

9 Model Calibration

9.1 Calibrating a decision tree model

9.2 Calibrating a logitstic regression model

9.3 Calibrating a random forest model

10 Put it all together in a workflow

11 Model fitting

12 Model Evaluation

12.1 Assissing model accuracy with training data

12.2 Assessing model accuracy with testing data

12.2.1 Computing predictions from the test set

12.3 Computing a confusion matrix

12.4 Variable importance

12.4.1 Decision tree

12.4.2 Logistic regression model

Hands-on Exercise 8: Building Binary Predictive Models with tidymodels

Dr. Kam Tin SeongAssoc. Professor of Information Systems

2020/07/15 (updated: 2020-07-25)

1 Setting the Scene

2 The Data

3 Getting Started

3.1 Installing and launching R packages

3.2 Importing data

4 Getting to Known the Data Frame

4.1 Listing the data structure of the data

4.2 Excluding unwanted columns

4.3 Converting data type

4.4 Reviewing the structure

5 Exploratory Data Analysis

5.1 Generating summary statistics

5.2 Univariate EDA

5.2.1 Profiling continuous variables

5.2.2 Profiling continuous variables

5.3 Exploring the relationship between the target variable and the input variables

5.3.1 Continuous input variables

5.3.2 Categorical input variables

6 Finalising data set

7 Data Sampling

7.1 Preparing training and testing data sets

7.2 Preparing cross-validation data sets

8 Data Preprocessing

9 Model Calibration

9.1 Calibrating a decision tree model

9.2 Calibrating a logitstic regression model

9.3 Calibrating a random forest model

10 Put it all together in a workflow

11 Model fitting

12 Model Evaluation

12.1 Assissing model accuracy with training data

12.2 Assessing model accuracy with testing data

12.2.1 Computing predictions from the test set

12.3 Computing a confusion matrix

12.4 Variable importance

12.4.1 Decision tree

12.4.2 Logistic regression model

Dr. Kam Tin Seong
Assoc. Professor of Information Systems