1 Preparation

RPubs Link: https://rpubs.com/adityakw/telco1 Video Link: https://youtu.be/PY-MJkExWhg

1.1 Importing Library

1.1.1 R Library

library(reticulate)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(scales)
library(gridExtra)
library(grid)
library(corrgram)
library(tidyverse)

1.1.2 Python Library

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV, cross_validate,RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder,MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score, PrecisionRecallDisplay, roc_auc_score, roc_curve, precision_recall_curve

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from lightgbm.sklearn import LGBMClassifier

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE, RandomOverSampler

from imblearn.pipeline import Pipeline

import seaborn as sns
import matplotlib.pyplot as plt

2 Problem Understanding

2.1 Telco Customer Retention Program

Telco is a fictional Telecommunication Company undergoing a focused customer retention program. This program hopefully will make the customer to keep using Telco. This program is important because getting a new customer is harder and more costly than to retain one. To make the program effective, it needs to target the right customer, which in this case are the customer who will potentially left Telco (Churn).

2.2 Telco Customer Churn Prediction

Telco has a database of the characteristics of their customer and which customer left within the last month (called as Churn). To be able to identify which customer left Telco, we could use a machine learning to predict churn customer. The prediction results could be used to focus the retention program on the said customer.

This analysis will use f1 score as evaluation metric, because while detecting every single churn customer is important, misclassified churn occurence will leads to inefficiency in cost because we choose the wrong target for the focused customer retention program.

3 Data Understanding & Preprocessing

3.1 Data Understanding

3.1.1 Data Reading

df_churn = pd.read_csv('Telco-Customer-Churn.csv')
py$df_churn %>% data.frame()

3.1.2 Data Structure

df_churn_describe <- py$df_churn
df_churn_describe %>% mutate_if(is.character, as.factor) %>% str()
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : Factor w/ 6531 levels " ","100.2","100.25",..: 2506 1467 158 1401 926 6105 1551 2610 2647 3023 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
##  - attr(*, "pandas.index")=RangeIndex(start=0, stop=7043, step=1)

the data set contains 19 independent variables, which can be classified into 3 groups:

  1. Demographic Information

    • gender: Whether the client is a female or a male (Female, Male).
    • SeniorCitizen: Whether the client is a senior citizen or not ( 0, 1).
    • Partner: Whether the client has a partner or not (Yes, No).
    • Dependents: Whether the client has dependents or not (Yes, No).
  2. Customer Account Information

    • tenure: Number of months the customer has stayed with the company (Multiple different numeric values).
    • Contract: Indicates the customer’s current contract type (Month-to-Month, One year, Two year).
    • PaperlessBilling: Whether the client has paperless billing or not (Yes, No).
    • PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit Card (automatic)).
    • MontlyCharges: The amount charged to the customer monthly (Multiple different numeric values).
    • TotalCharges: The total amount charged to the customer (Multiple different numeric values).
  3. Services Information

    • PhoneService: Whether the client has a phone service or not (Yes, No).
    • MultipleLines: Whether the client has multiple lines or not (No phone service, No, Yes).
    • InternetServices: Whether the client is subscribed to Internet service with the company (DSL, Fiber optic, No)
    • OnlineSecurity: Whether the client has online security or not (No internet service, No, Yes).
    • OnlineBackup: Whether the client has online backup or not (No internet service, No, Yes).
    • DeviceProtection: Whether the client has device protection or not (No internet service, No, Yes).
    • TechSupport: Whether the client has tech support or not (No internet service, No, Yes).
    • StreamingTV: Whether the client has streaming TV or not (No internet service, No, Yes).
    • StreamingMovies: Whether the client has streaming movies or not (No internet service, No, Yes).

4 Data Cleansing & Exploratory Data Analysis

4.1 Data Cleansing

4.1.1 Missing Value

df_churn.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 7043 entries, 0 to 7042
## Data columns (total 21 columns):
##  #   Column            Non-Null Count  Dtype  
## ---  ------            --------------  -----  
##  0   customerID        7043 non-null   object 
##  1   gender            7043 non-null   object 
##  2   SeniorCitizen     7043 non-null   int64  
##  3   Partner           7043 non-null   object 
##  4   Dependents        7043 non-null   object 
##  5   tenure            7043 non-null   int64  
##  6   PhoneService      7043 non-null   object 
##  7   MultipleLines     7043 non-null   object 
##  8   InternetService   7043 non-null   object 
##  9   OnlineSecurity    7043 non-null   object 
##  10  OnlineBackup      7043 non-null   object 
##  11  DeviceProtection  7043 non-null   object 
##  12  TechSupport       7043 non-null   object 
##  13  StreamingTV       7043 non-null   object 
##  14  StreamingMovies   7043 non-null   object 
##  15  Contract          7043 non-null   object 
##  16  PaperlessBilling  7043 non-null   object 
##  17  PaymentMethod     7043 non-null   object 
##  18  MonthlyCharges    7043 non-null   float64
##  19  TotalCharges      7043 non-null   object 
##  20  Churn             7043 non-null   object 
## dtypes: float64(1), int64(2), object(18)
## memory usage: 1.1+ MB
py$df_churn %>% filter(TotalCharges == " ")

We also need to remove customerID information because it is irrelevant for our analysis.

py$df_churn <- py$df_churn %>% 
  filter(TotalCharges != " ") %>%
  mutate_at(c('TotalCharges'),  as.numeric) %>% 
  select(-customerID)
df_churn.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 7032 entries, 0 to 7031
## Data columns (total 20 columns):
##  #   Column            Non-Null Count  Dtype  
## ---  ------            --------------  -----  
##  0   gender            7032 non-null   object 
##  1   SeniorCitizen     7032 non-null   float64
##  2   Partner           7032 non-null   object 
##  3   Dependents        7032 non-null   object 
##  4   tenure            7032 non-null   float64
##  5   PhoneService      7032 non-null   object 
##  6   MultipleLines     7032 non-null   object 
##  7   InternetService   7032 non-null   object 
##  8   OnlineSecurity    7032 non-null   object 
##  9   OnlineBackup      7032 non-null   object 
##  10  DeviceProtection  7032 non-null   object 
##  11  TechSupport       7032 non-null   object 
##  12  StreamingTV       7032 non-null   object 
##  13  StreamingMovies   7032 non-null   object 
##  14  Contract          7032 non-null   object 
##  15  PaperlessBilling  7032 non-null   object 
##  16  PaymentMethod     7032 non-null   object 
##  17  MonthlyCharges    7032 non-null   float64
##  18  TotalCharges      7032 non-null   float64
##  19  Churn             7032 non-null   object 
## dtypes: float64(4), object(16)
## memory usage: 1.1+ MB

4.2 EDA

4.2.1 Dependent Variable

py$df_churn %>% 
  group_by(Churn) %>% 
  summarize(count=n()) %>% 
  mutate(freq = count/sum(count)) %>% 
  ggplot(aes(x=Churn,y=freq)) + 
  geom_col(aes(fill=Churn),show.legend = FALSE) +
  geom_text(aes(label = percent(freq,0.01)),vjust = 1.5,colour='white') +
  labs(title = "Churn Distribution", y='Frequency') +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

Distribution of Churn variable is 26.58% yes and 73.42% no which is a bit imbalance.

The labels in this parameter is a categorical of “No” and “Yes”, as our analysis focused on the customer who churned, we could change the “No” to 0 and “Yes” to 1

df_churn['Churn'] = np.where(df_churn['Churn'] == 'Yes', 1, 0)
df_churn.head()
##    gender  SeniorCitizen Partner  ... MonthlyCharges  TotalCharges Churn
## 0  Female            0.0     Yes  ...          29.85         29.85     0
## 1    Male            0.0      No  ...          56.95       1889.50     0
## 2    Male            0.0      No  ...          53.85        108.15     1
## 3    Male            0.0      No  ...          42.30       1840.75     0
## 4  Female            0.0      No  ...          70.70        151.65     1
## 
## [5 rows x 20 columns]

4.2.2 Independent Variables - Demographic Informations

p1 <- py$df_churn %>% 
  ggplot(aes(x=gender)) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(title = "Frequency Distribution", y='Frequency')
p2 <- py$df_churn %>% 
  ggplot(aes(x=gender, fill = as.factor(Churn))) + geom_bar(position='fill') + labs(title = "Normalized Distribution", y='Percentage')
p3 <- py$df_churn %>% 
  ggplot(aes(x=as.factor(SeniorCitizen))) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(y='Frequency', x='Senior Citizen')
p4 <- py$df_churn %>% 
  ggplot(aes(x=as.factor(SeniorCitizen), fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage',x='Senior Citizen')
p5 <- py$df_churn %>% 
  ggplot(aes(x=as.factor(Partner))) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(y='Frequency', x='Partner')
p6 <- py$df_churn %>% 
  ggplot(aes(x=Partner, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage',x='Partner')
p7 <- py$df_churn %>% 
  ggplot(aes(x=as.factor(Dependents))) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(y='Frequency', x='Dependents')
p8 <- py$df_churn %>% 
  ggplot(aes(x=Dependents, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage',x='Dependents')

grid.arrange(p1, p2, p3, p4,p5,p6,p7,p8, nrow = 4,top=textGrob("Ditribution of Churn Demographic Information"))

From this graph we could expect SeniorCitizen to be an important class and gender to not be an important class.

4.2.3 Independent Variables - Categorical Data

p1 <- py$df_churn %>% 
  ggplot(aes(y=PaymentMethod)) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(title = "Frequency Distribution", x='Frequency')
p2 <- py$df_churn %>% 
  ggplot(aes(y=PaymentMethod, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(title = "Normalized Distribution", x='Percentage')
p3 <- py$df_churn %>% 
  ggplot(aes(y=as.factor(PaperlessBilling))) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(title = "Frequency Distribution", x='Frequency', y='Paperless Billing')
p4 <- py$df_churn %>% 
  ggplot(aes(y=as.factor(PaperlessBilling), fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(title = "Normalized Distribution", x='Percentage',y='Paperless Billing')
p5 <- py$df_churn %>% 
  ggplot(aes(y=as.factor(Contract))) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(title = "Frequency Distribution",x='Frequency', y='Contract')
p6 <- py$df_churn %>% 
  ggplot(aes(y=Contract, fill = as.factor(Churn) )) + geom_bar(position='fill',show.legend=FALSE) + labs(title = "Normalized Distribution", x='Percentage',y='Contract')

grid.arrange(p1, p3, p5, p2,p4,p6, nrow = 2,top=textGrob("Ditribution of Churn Categorical Information"))

From this graph we see that customers who have month-to-month contract, paperless billing, and electronic check tends to churn more than the other. We could expect some important feature from these variable.

4.2.4 Independent Variables - Customer Account Information

p1 <- py$df_churn %>% 
  ggplot(aes(x=tenure, fill=as.factor(Churn))) + geom_histogram(position='identity',alpha=0.6,bins=20)
p2 <- py$df_churn %>% 
  ggplot(aes(x=MonthlyCharges, fill=as.factor(Churn))) + geom_histogram(position='identity',alpha=0.6,bins=20)
p3 <- py$df_churn %>% 
  ggplot(aes(x=TotalCharges, fill=as.factor(Churn))) + geom_histogram(position='identity',alpha=0.6,bins=20)

grid.arrange(p1, p2, p3, nrow = 2,top=textGrob("Ditribution of Churn Customer Information"))

From this graph we see that people with high tenure and total charges tends to not churn, and people with high monthly charge tends to churn.

4.2.5 Independent Variable - Services Information

p1 <- py$df_churn %>% 
  ggplot(aes(x=PhoneService, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p2 <- py$df_churn %>% 
  ggplot(aes(x=MultipleLines, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p3 <- py$df_churn %>% 
  ggplot(aes(x=InternetService, fill = as.factor(Churn))) + geom_bar(position='fill') + labs(y='Percentage')
p4 <- py$df_churn %>% 
  ggplot(aes(x=OnlineSecurity, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p5 <- py$df_churn %>% 
  ggplot(aes(x=OnlineBackup, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p6 <- py$df_churn %>% 
  ggplot(aes(x=DeviceProtection, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p7 <- py$df_churn %>% 
  ggplot(aes(x=TechSupport, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p8 <- py$df_churn %>% 
  ggplot(aes(x=StreamingTV, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p9 <- py$df_churn %>% 
  ggplot(aes(x=StreamingMovies, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')

grid.arrange(p1, p2, p3, p4,p5,p6,p7,p8,p9, nrow = 3,top=textGrob("Ditribution of Churn Services Information"))

From this graph we could expect no important feature from PhoneServices and MultipleLines. Customers with Fibre Optic internet services, no online security, or no tech support tends to churn more than the other. People with no internet service tends to not churn.

4.3 Feature Engineering

4.3.1 Target Variables

py$df_churn %>% select(Churn) %>% mutate_if(is.character, as.factor) %>% summary()
##      Churn       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2658  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

4.3.2 Categorical Variables

py$df_churn %>% select_if(is.character) %>% mutate_if(is.character, as.factor) %>% summary()
##     gender     Partner    Dependents PhoneService          MultipleLines 
##  Female:3483   No :3639   No :4933   No : 680     No              :3385  
##  Male  :3549   Yes:3393   Yes:2099   Yes:6352     No phone service: 680  
##                                                   Yes             :2967  
##                                                                          
##     InternetService             OnlineSecurity              OnlineBackup 
##  DSL        :2416   No                 :3497   No                 :3087  
##  Fiber optic:3096   No internet service:1520   No internet service:1520  
##  No         :1520   Yes                :2015   Yes                :2425  
##                                                                          
##             DeviceProtection              TechSupport  
##  No                 :3094    No                 :3472  
##  No internet service:1520    No internet service:1520  
##  Yes                :2418    Yes                :2040  
##                                                        
##               StreamingTV              StreamingMovies           Contract   
##  No                 :2809   No                 :2781   Month-to-month:3875  
##  No internet service:1520   No internet service:1520   One year      :1472  
##  Yes                :2703   Yes                :2731   Two year      :1685  
##                                                                             
##  PaperlessBilling                   PaymentMethod 
##  No :2864         Bank transfer (automatic):1542  
##  Yes:4168         Credit card (automatic)  :1521  
##                   Electronic check         :2365  
##                   Mailed check             :1604
py$cat <- py$df_churn %>% select_if(is.character) %>% colnames()

all the categorical variables have no explicit direction, so we could use OneHotEncoder for all of those variables.

4.3.3 Numeric Variables

py$df_churn %>% select_if(is.numeric) %>% summary()
##  SeniorCitizen        tenure      MonthlyCharges    TotalCharges   
##  Min.   :0.0000   Min.   : 1.00   Min.   : 18.25   Min.   :  18.8  
##  1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.: 35.59   1st Qu.: 401.4  
##  Median :0.0000   Median :29.00   Median : 70.35   Median :1397.5  
##  Mean   :0.1624   Mean   :32.42   Mean   : 64.80   Mean   :2283.3  
##  3rd Qu.:0.0000   3rd Qu.:55.00   3rd Qu.: 89.86   3rd Qu.:3794.7  
##  Max.   :1.0000   Max.   :72.00   Max.   :118.75   Max.   :8684.8  
##      Churn       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2658  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
py$num <- py$df_churn %>% select(SeniorCitizen,tenure,MonthlyCharges,TotalCharges) %>% colnames()
py$df_churn %>% select_if(is.numeric) %>% boxplot()

py$df_churn %>% select(SeniorCitizen, tenure, MonthlyCharges) %>% boxplot()

From the boxplot, we see that TotalCharges, tenure, and MonthlyCharges have a significantly different range, so we could scaling on those variables.

5 Modelling

5.1 Data Splitting

For this analysis, we split the data into 75% training set and 25% test set. we also use stratification to mitigate imbalance in y.

X = df_churn.drop(columns='Churn')
y = df_churn['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
print(f'''
X_train.shape = {X_train.shape}
y_train.shape = {y_train.shape}
X_test.shape = {X_test.shape}
y_test.shape = {y_test.shape}
''')
## 
## X_train.shape = (5274, 19)
## y_train.shape = (5274,)
## X_test.shape = (1758, 19)
## y_test.shape = (1758,)

5.2 Column Transformation Pipeline

We use OneHotEncoder for every categorical features and minmaxscaler for numeric features.

OneHot = OneHotEncoder(handle_unknown='ignore',drop='first')
minmaxscaler = MinMaxScaler()
ct = ColumnTransformer([('onehot',OneHot,cat),
('minmax',minmaxscaler,num)],remainder='passthrough')

X_train_trans = ct.fit_transform(X_train)
X_test_trans = ct.transform(X_test)
X_train_trans.shape
## (5274, 30)

5.3 Base Model Selection + Resampling Methods

For this analysis, we use multiple binary classification algorithm and choose which one give the best result and use that as our base model. All of the model are trained using its default hyperparameter. For this analysis, we could use f1 as our evaluation metrics because we want to maximize the focused customer. Also, we use cross validation by splitting the training set into 5 sets.

At the same time, we could also test which resampling method could maximize f1 score.

model_tree = DecisionTreeClassifier(random_state=2023)
model_lr = LogisticRegression(random_state=2023,max_iter=10000)
model_rf = RandomForestClassifier(random_state=2023)
model_xgb = XGBClassifier(random_state=2023,learning_rate= 0.2, max_depth= 6, min_child_weight= 100, n_estimators= 100)
model_svc = SVC(random_state=2023,probability=True)
model_knn = KNeighborsClassifier()
model_lgb = LGBMClassifier(random_state=2023)

smote = SMOTE(random_state=2023)
rus= RandomUnderSampler(random_state=2023) 

estimator=Pipeline([
    ('resample', None),
    ('model', model_lr)
])

models={'resample':[None,smote,rus],
        'model':[model_lr,model_tree,model_rf,model_xgb,model_svc,model_knn,model_lgb]}
                  
gs_models=GridSearchCV(estimator,
                         param_grid=models,
                         cv=5,
                         scoring='f1')
gs_models.fit(X_train_trans,y_train)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('resample', None),
                                       ('model',
                                        LogisticRegression(max_iter=10000,
                                                           random_state=2023))]),
             param_grid={'model': [LogisticRegression(max_iter=10000,
                                                      random_state=2023),
                                   DecisionTreeClassifier(random_state=2023),
                                   RandomForestClassifier(random_state=2023),
                                   XGBClassifier(base_score=None, booster=None,
                                                 callbacks=None,
                                                 colsamp...
                                                 max_depth=6, max_leaves=None,
                                                 min_child_weight=100,
                                                 missing=nan,
                                                 monotone_constraints=None,
                                                 n_estimators=100, n_jobs=None,
                                                 num_parallel_tree=None,
                                                 predictor=None,
                                                 random_state=2023, ...),
                                   SVC(probability=True, random_state=2023),
                                   KNeighborsClassifier(),
                                   LGBMClassifier(random_state=2023)],
                         'resample': [None, SMOTE(random_state=2023),
                                      RandomUnderSampler(random_state=2023)]},
             scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
results = pd.concat([pd.DataFrame(gs_models.cv_results_["params"]),pd.DataFrame(gs_models.cv_results_["mean_test_score"], columns=["f1"])],axis=1)

results
##                                                 model  ...        f1
## 0   LogisticRegression(max_iter=10000, random_stat...  ...  0.589727
## 1   LogisticRegression(max_iter=10000, random_stat...  ...  0.633082
## 2   LogisticRegression(max_iter=10000, random_stat...  ...  0.630899
## 3           DecisionTreeClassifier(random_state=2023)  ...  0.499511
## 4           DecisionTreeClassifier(random_state=2023)  ...  0.519894
## 5           DecisionTreeClassifier(random_state=2023)  ...  0.528526
## 6           RandomForestClassifier(random_state=2023)  ...  0.553539
## 7           RandomForestClassifier(random_state=2023)  ...  0.598985
## 8           RandomForestClassifier(random_state=2023)  ...  0.614012
## 9   XGBClassifier(base_score=None, booster=None, c...  ...  0.598545
## 10  XGBClassifier(base_score=None, booster=None, c...  ...  0.622743
## 11  XGBClassifier(base_score=None, booster=None, c...  ...  0.607400
## 12           SVC(probability=True, random_state=2023)  ...  0.570108
## 13           SVC(probability=True, random_state=2023)  ...  0.619906
## 14           SVC(probability=True, random_state=2023)  ...  0.619119
## 15                             KNeighborsClassifier()  ...  0.534968
## 16                             KNeighborsClassifier()  ...  0.552836
## 17                             KNeighborsClassifier()  ...  0.584259
## 18                  LGBMClassifier(random_state=2023)  ...  0.575133
## 19                  LGBMClassifier(random_state=2023)  ...  0.601645
## 20                  LGBMClassifier(random_state=2023)  ...  0.611686
## 
## [21 rows x 3 columns]
py$results %>% data.frame() %>% arrange(-f1)

from the base model testing, we see that Logistic Regression with SMOTE give the best results on f1 score. We will use Logistic Regression Model as our base model to then tune it to get the best result.

5.4 Hyperparameter Tuning

5.4.1 Before Tuning

# Before tuning
base_model = gs_models.best_estimator_
cv_results = cross_validate(base_model,X_train_trans,y_train,cv=5,scoring=['f1','recall'])
print(f'''cross validation before tuning:
rerata fit_time = {round(cv_results['fit_time'].mean(),3)}
std fit_time = {round(cv_results['fit_time'].std(),3)}
rerata f1 = {round(cv_results['test_f1'].mean(),3)}
std f1 = {round(cv_results['test_f1'].std(),3)}
rerata recall = {round(cv_results['test_recall'].mean(),3)}
std recall = {round(cv_results['test_recall'].std(),3)}
''')
## cross validation before tuning:
## rerata fit_time = 0.101
## std fit_time = 0.009
## rerata f1 = 0.633
## std f1 = 0.013
## rerata recall = 0.788
## std recall = 0.02

5.4.2 Hyperparameter Tuning

# Hyperparameter Tuning
hyperparam_space={'model__penalty' : ['l1', 'l2','elasticnet'],
                  'model__solver': ['lbfgs', 'liblinear'],
                  'model__C': [0.1, 1, 10]}
                  
grid_search=GridSearchCV(base_model,
                         param_grid=hyperparam_space,
                         cv=5,
                         scoring='f1')
grid_search.fit(X_train_trans,y_train)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('resample', SMOTE(random_state=2023)),
                                       ('model',
                                        LogisticRegression(max_iter=10000,
                                                           random_state=2023))]),
             param_grid={'model__C': [0.1, 1, 10],
                         'model__penalty': ['l1', 'l2', 'elasticnet'],
                         'model__solver': ['lbfgs', 'liblinear']},
             scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(f'''
grid_search.best_score_ = {grid_search.best_score_}
grid_search.best_params_ = {grid_search.best_params_}
''')
## 
## grid_search.best_score_ = 0.634830093389831
## grid_search.best_params_ = {'model__C': 1, 'model__penalty': 'l1', 'model__solver': 'liblinear'}
best_model=grid_search.best_estimator_
cv_results = cross_validate(best_model,X_train_trans,y_train,cv=5,scoring=['f1','recall'])
print(f'''cross validation before tuning:
rerata fit_time = {round(cv_results['fit_time'].mean(),3)}
std fit_time = {round(cv_results['fit_time'].std(),3)}
rerata f1 = {round(cv_results['test_f1'].mean(),3)}
std f1 = {round(cv_results['test_f1'].std(),3)}
rerata recall = {round(cv_results['test_recall'].mean(),3)}
std recall = {round(cv_results['test_recall'].std(),3)}
''')
## cross validation before tuning:
## rerata fit_time = 0.086
## std fit_time = 0.008
## rerata f1 = 0.635
## std f1 = 0.014
## rerata recall = 0.792
## std recall = 0.024

5.5 Evaluation

5.5.1 Before Tuning

base_model = gs_models.best_estimator_
base_model.fit(X_train_trans,y_train)
Pipeline(steps=[('resample', SMOTE(random_state=2023)),
                ('model',
                 LogisticRegression(max_iter=10000, random_state=2023))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(classification_report(y_test,base_model.predict(X_test_trans)))
##               precision    recall  f1-score   support
## 
##            0       0.90      0.72      0.80      1291
##            1       0.51      0.78      0.61       467
## 
##     accuracy                           0.74      1758
##    macro avg       0.70      0.75      0.71      1758
## weighted avg       0.80      0.74      0.75      1758
# Prediction results - base model
pred_results = pd.DataFrame({'y_test':y_test,
               'y_pred_proba':base_model.predict_proba(X_test_trans)[:,1],
               'y_pred':base_model.predict(X_test_trans)
               }).reset_index()
graph = sns.scatterplot(data=pred_results, x=pred_results.index, y='y_pred_proba', hue='y_test')
graph.axhline(0.5)
sns.set(rc={"figure.figsize":(15, 15)})
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

5.5.2 After Tuning

best_model = grid_search.best_estimator_
best_model.fit(X_train_trans,y_train)
Pipeline(steps=[('resample', SMOTE(random_state=2023)),
                ('model',
                 LogisticRegression(C=1, max_iter=10000, penalty='l1',
                                    random_state=2023, solver='liblinear'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(classification_report(y_test,best_model.predict(X_test_trans)))
##               precision    recall  f1-score   support
## 
##            0       0.90      0.72      0.80      1291
##            1       0.50      0.79      0.62       467
## 
##     accuracy                           0.74      1758
##    macro avg       0.70      0.75      0.71      1758
## weighted avg       0.80      0.74      0.75      1758
# Prediction results - base model
pred_results = pd.DataFrame({'y_test':y_test,
               'y_pred_proba':best_model.predict_proba(X_test_trans)[:,1],
               'y_pred':best_model.predict(X_test_trans)
               }).reset_index()
graph = sns.scatterplot(data=pred_results, x=pred_results.index, y='y_pred_proba', hue='y_test')
graph.axhline(0.5)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

From this result we could see that model after hyperparameter tuning (best_model) gave a slightly better results than the base model, so we could use it as our final model. From the graph, we could see that the model could separate the churn and non churn data. However, there are still many data mixed on the center that could not be separated by the model, hence it’s f1 score is 0.62, which is not high enough to deploy this model.

The recall value 79% means that we could predict 79% of churn occured, and 50% accuracy means that from all the data that we predict as churn (showed in the graph as the dot over 0.5 threshold), 50% are real churn customer and the other 50% are actually a non churn customer, which in this case will be a potential mistargeted audience for the focused retention program (potentially leads to ineffective cost). However, in this analysis we gave more emphasis on retaining potentially churned customer, because it is more costly to get a new customers than to rentent existing customers.

5.6 Feature Importances

feature_importances = pd.DataFrame({'feature_name' : ct.get_feature_names_out(), 
             'coef' : best_model.named_steps["model"].coef_.flatten()}).sort_values('coef')
py$feature_importances %>% mutate(coef = round(coef,3))
ggplot(py$feature_importances, aes(y=reorder(feature_name,coef), x=coef,fill=coef)) + 
  geom_col() + labs(title = "Feature Importance")

The coefficient of each variables are as predicted from EDA.

6 Conclusion

We are able to build a machine learning model to predict churn. The final model that we use are Logistic Regression Model(C=1, max_iter=10000, penalty=‘l1’,random_state=2023, solver=‘liblinear’) with SMOTE resampling. From our machine learning modeling, we could get f1 score of 0.62, which we think are not high enough to deploy this model.

The recall value 79% means that we could predict 79% of churn occured, and 50% accuracy means that from all the data that we predict as churn (showed in the graph as the dot over 0.5 threshold) is actually a non churn customer, which in this case will be a potential mistargeted audience for the focused retention program (potentially leads to ineffective cost). However, in this analysis we gave more emphasis on retaining potentially churned customer, because it is more costly to get a new customers than to retent existing customers.