RPubs Link: https://rpubs.com/adityakw/telco1 Video Link: https://youtu.be/PY-MJkExWhg
library(reticulate)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(scales)
library(gridExtra)
library(grid)
library(corrgram)
library(tidyverse)import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV, cross_validate,RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder,MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score, PrecisionRecallDisplay, roc_auc_score, roc_curve, precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from lightgbm.sklearn import LGBMClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.pipeline import Pipeline
import seaborn as sns
import matplotlib.pyplot as pltTelco is a fictional Telecommunication Company undergoing a focused customer retention program. This program hopefully will make the customer to keep using Telco. This program is important because getting a new customer is harder and more costly than to retain one. To make the program effective, it needs to target the right customer, which in this case are the customer who will potentially left Telco (Churn).
Telco has a database of the characteristics of their customer and which customer left within the last month (called as Churn). To be able to identify which customer left Telco, we could use a machine learning to predict churn customer. The prediction results could be used to focus the retention program on the said customer.
This analysis will use f1 score as evaluation metric, because while detecting every single churn customer is important, misclassified churn occurence will leads to inefficiency in cost because we choose the wrong target for the focused customer retention program.
df_churn = pd.read_csv('Telco-Customer-Churn.csv')py$df_churn %>% data.frame()df_churn_describe <- py$df_churn
df_churn_describe %>% mutate_if(is.character, as.factor) %>% str()## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : num 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : Factor w/ 6531 levels " ","100.2","100.25",..: 2506 1467 158 1401 926 6105 1551 2610 2647 3023 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
## - attr(*, "pandas.index")=RangeIndex(start=0, stop=7043, step=1)
the data set contains 19 independent variables, which can be classified into 3 groups:
Demographic Information
Customer Account Information
Services Information
df_churn.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 7043 entries, 0 to 7042
## Data columns (total 21 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 customerID 7043 non-null object
## 1 gender 7043 non-null object
## 2 SeniorCitizen 7043 non-null int64
## 3 Partner 7043 non-null object
## 4 Dependents 7043 non-null object
## 5 tenure 7043 non-null int64
## 6 PhoneService 7043 non-null object
## 7 MultipleLines 7043 non-null object
## 8 InternetService 7043 non-null object
## 9 OnlineSecurity 7043 non-null object
## 10 OnlineBackup 7043 non-null object
## 11 DeviceProtection 7043 non-null object
## 12 TechSupport 7043 non-null object
## 13 StreamingTV 7043 non-null object
## 14 StreamingMovies 7043 non-null object
## 15 Contract 7043 non-null object
## 16 PaperlessBilling 7043 non-null object
## 17 PaymentMethod 7043 non-null object
## 18 MonthlyCharges 7043 non-null float64
## 19 TotalCharges 7043 non-null object
## 20 Churn 7043 non-null object
## dtypes: float64(1), int64(2), object(18)
## memory usage: 1.1+ MB
py$df_churn %>% filter(TotalCharges == " ")We also need to remove customerID information because it is irrelevant for our analysis.
py$df_churn <- py$df_churn %>%
filter(TotalCharges != " ") %>%
mutate_at(c('TotalCharges'), as.numeric) %>%
select(-customerID)df_churn.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 7032 entries, 0 to 7031
## Data columns (total 20 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 gender 7032 non-null object
## 1 SeniorCitizen 7032 non-null float64
## 2 Partner 7032 non-null object
## 3 Dependents 7032 non-null object
## 4 tenure 7032 non-null float64
## 5 PhoneService 7032 non-null object
## 6 MultipleLines 7032 non-null object
## 7 InternetService 7032 non-null object
## 8 OnlineSecurity 7032 non-null object
## 9 OnlineBackup 7032 non-null object
## 10 DeviceProtection 7032 non-null object
## 11 TechSupport 7032 non-null object
## 12 StreamingTV 7032 non-null object
## 13 StreamingMovies 7032 non-null object
## 14 Contract 7032 non-null object
## 15 PaperlessBilling 7032 non-null object
## 16 PaymentMethod 7032 non-null object
## 17 MonthlyCharges 7032 non-null float64
## 18 TotalCharges 7032 non-null float64
## 19 Churn 7032 non-null object
## dtypes: float64(4), object(16)
## memory usage: 1.1+ MB
py$df_churn %>%
group_by(Churn) %>%
summarize(count=n()) %>%
mutate(freq = count/sum(count)) %>%
ggplot(aes(x=Churn,y=freq)) +
geom_col(aes(fill=Churn),show.legend = FALSE) +
geom_text(aes(label = percent(freq,0.01)),vjust = 1.5,colour='white') +
labs(title = "Churn Distribution", y='Frequency') +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))Distribution of Churn variable is 26.58% yes and 73.42% no which is a bit imbalance.
The labels in this parameter is a categorical of “No” and “Yes”, as our analysis focused on the customer who churned, we could change the “No” to 0 and “Yes” to 1
df_churn['Churn'] = np.where(df_churn['Churn'] == 'Yes', 1, 0)
df_churn.head()## gender SeniorCitizen Partner ... MonthlyCharges TotalCharges Churn
## 0 Female 0.0 Yes ... 29.85 29.85 0
## 1 Male 0.0 No ... 56.95 1889.50 0
## 2 Male 0.0 No ... 53.85 108.15 1
## 3 Male 0.0 No ... 42.30 1840.75 0
## 4 Female 0.0 No ... 70.70 151.65 1
##
## [5 rows x 20 columns]
p1 <- py$df_churn %>%
ggplot(aes(x=gender)) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(title = "Frequency Distribution", y='Frequency')
p2 <- py$df_churn %>%
ggplot(aes(x=gender, fill = as.factor(Churn))) + geom_bar(position='fill') + labs(title = "Normalized Distribution", y='Percentage')
p3 <- py$df_churn %>%
ggplot(aes(x=as.factor(SeniorCitizen))) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(y='Frequency', x='Senior Citizen')
p4 <- py$df_churn %>%
ggplot(aes(x=as.factor(SeniorCitizen), fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage',x='Senior Citizen')
p5 <- py$df_churn %>%
ggplot(aes(x=as.factor(Partner))) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(y='Frequency', x='Partner')
p6 <- py$df_churn %>%
ggplot(aes(x=Partner, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage',x='Partner')
p7 <- py$df_churn %>%
ggplot(aes(x=as.factor(Dependents))) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(y='Frequency', x='Dependents')
p8 <- py$df_churn %>%
ggplot(aes(x=Dependents, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage',x='Dependents')
grid.arrange(p1, p2, p3, p4,p5,p6,p7,p8, nrow = 4,top=textGrob("Ditribution of Churn Demographic Information"))From this graph we could expect SeniorCitizen to be an
important class and gender to not be an important
class.
p1 <- py$df_churn %>%
ggplot(aes(y=PaymentMethod)) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(title = "Frequency Distribution", x='Frequency')
p2 <- py$df_churn %>%
ggplot(aes(y=PaymentMethod, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(title = "Normalized Distribution", x='Percentage')
p3 <- py$df_churn %>%
ggplot(aes(y=as.factor(PaperlessBilling))) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(title = "Frequency Distribution", x='Frequency', y='Paperless Billing')
p4 <- py$df_churn %>%
ggplot(aes(y=as.factor(PaperlessBilling), fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(title = "Normalized Distribution", x='Percentage',y='Paperless Billing')
p5 <- py$df_churn %>%
ggplot(aes(y=as.factor(Contract))) + geom_bar(aes(fill=as.factor(Churn)),show.legend=FALSE) + labs(title = "Frequency Distribution",x='Frequency', y='Contract')
p6 <- py$df_churn %>%
ggplot(aes(y=Contract, fill = as.factor(Churn) )) + geom_bar(position='fill',show.legend=FALSE) + labs(title = "Normalized Distribution", x='Percentage',y='Contract')
grid.arrange(p1, p3, p5, p2,p4,p6, nrow = 2,top=textGrob("Ditribution of Churn Categorical Information"))From this graph we see that customers who have month-to-month contract, paperless billing, and electronic check tends to churn more than the other. We could expect some important feature from these variable.
p1 <- py$df_churn %>%
ggplot(aes(x=tenure, fill=as.factor(Churn))) + geom_histogram(position='identity',alpha=0.6,bins=20)
p2 <- py$df_churn %>%
ggplot(aes(x=MonthlyCharges, fill=as.factor(Churn))) + geom_histogram(position='identity',alpha=0.6,bins=20)
p3 <- py$df_churn %>%
ggplot(aes(x=TotalCharges, fill=as.factor(Churn))) + geom_histogram(position='identity',alpha=0.6,bins=20)
grid.arrange(p1, p2, p3, nrow = 2,top=textGrob("Ditribution of Churn Customer Information"))
From this graph we see that people with high tenure and total charges
tends to not churn, and people with high monthly charge tends to
churn.
p1 <- py$df_churn %>%
ggplot(aes(x=PhoneService, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p2 <- py$df_churn %>%
ggplot(aes(x=MultipleLines, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p3 <- py$df_churn %>%
ggplot(aes(x=InternetService, fill = as.factor(Churn))) + geom_bar(position='fill') + labs(y='Percentage')
p4 <- py$df_churn %>%
ggplot(aes(x=OnlineSecurity, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p5 <- py$df_churn %>%
ggplot(aes(x=OnlineBackup, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p6 <- py$df_churn %>%
ggplot(aes(x=DeviceProtection, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p7 <- py$df_churn %>%
ggplot(aes(x=TechSupport, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p8 <- py$df_churn %>%
ggplot(aes(x=StreamingTV, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
p9 <- py$df_churn %>%
ggplot(aes(x=StreamingMovies, fill = as.factor(Churn))) + geom_bar(position='fill',show.legend=FALSE) + labs(y='Percentage')
grid.arrange(p1, p2, p3, p4,p5,p6,p7,p8,p9, nrow = 3,top=textGrob("Ditribution of Churn Services Information"))From this graph we could expect no important feature from PhoneServices and MultipleLines. Customers with Fibre Optic internet services, no online security, or no tech support tends to churn more than the other. People with no internet service tends to not churn.
py$df_churn %>% select(Churn) %>% mutate_if(is.character, as.factor) %>% summary()## Churn
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2658
## 3rd Qu.:1.0000
## Max. :1.0000
py$df_churn %>% select_if(is.character) %>% mutate_if(is.character, as.factor) %>% summary()## gender Partner Dependents PhoneService MultipleLines
## Female:3483 No :3639 No :4933 No : 680 No :3385
## Male :3549 Yes:3393 Yes:2099 Yes:6352 No phone service: 680
## Yes :2967
##
## InternetService OnlineSecurity OnlineBackup
## DSL :2416 No :3497 No :3087
## Fiber optic:3096 No internet service:1520 No internet service:1520
## No :1520 Yes :2015 Yes :2425
##
## DeviceProtection TechSupport
## No :3094 No :3472
## No internet service:1520 No internet service:1520
## Yes :2418 Yes :2040
##
## StreamingTV StreamingMovies Contract
## No :2809 No :2781 Month-to-month:3875
## No internet service:1520 No internet service:1520 One year :1472
## Yes :2703 Yes :2731 Two year :1685
##
## PaperlessBilling PaymentMethod
## No :2864 Bank transfer (automatic):1542
## Yes:4168 Credit card (automatic) :1521
## Electronic check :2365
## Mailed check :1604
py$cat <- py$df_churn %>% select_if(is.character) %>% colnames()all the categorical variables have no explicit direction, so we could use OneHotEncoder for all of those variables.
py$df_churn %>% select_if(is.numeric) %>% summary()## SeniorCitizen tenure MonthlyCharges TotalCharges
## Min. :0.0000 Min. : 1.00 Min. : 18.25 Min. : 18.8
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 35.59 1st Qu.: 401.4
## Median :0.0000 Median :29.00 Median : 70.35 Median :1397.5
## Mean :0.1624 Mean :32.42 Mean : 64.80 Mean :2283.3
## 3rd Qu.:0.0000 3rd Qu.:55.00 3rd Qu.: 89.86 3rd Qu.:3794.7
## Max. :1.0000 Max. :72.00 Max. :118.75 Max. :8684.8
## Churn
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2658
## 3rd Qu.:1.0000
## Max. :1.0000
py$num <- py$df_churn %>% select(SeniorCitizen,tenure,MonthlyCharges,TotalCharges) %>% colnames()py$df_churn %>% select_if(is.numeric) %>% boxplot()py$df_churn %>% select(SeniorCitizen, tenure, MonthlyCharges) %>% boxplot()From the boxplot, we see that TotalCharges, tenure, and MonthlyCharges have a significantly different range, so we could scaling on those variables.
For this analysis, we split the data into 75% training set and 25% test set. we also use stratification to mitigate imbalance in y.
X = df_churn.drop(columns='Churn')
y = df_churn['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)print(f'''
X_train.shape = {X_train.shape}
y_train.shape = {y_train.shape}
X_test.shape = {X_test.shape}
y_test.shape = {y_test.shape}
''')##
## X_train.shape = (5274, 19)
## y_train.shape = (5274,)
## X_test.shape = (1758, 19)
## y_test.shape = (1758,)
We use OneHotEncoder for every categorical features and
minmaxscaler for numeric features.
OneHot = OneHotEncoder(handle_unknown='ignore',drop='first')
minmaxscaler = MinMaxScaler()
ct = ColumnTransformer([('onehot',OneHot,cat),
('minmax',minmaxscaler,num)],remainder='passthrough')
X_train_trans = ct.fit_transform(X_train)
X_test_trans = ct.transform(X_test)X_train_trans.shape## (5274, 30)
For this analysis, we use multiple binary classification algorithm and choose which one give the best result and use that as our base model. All of the model are trained using its default hyperparameter. For this analysis, we could use f1 as our evaluation metrics because we want to maximize the focused customer. Also, we use cross validation by splitting the training set into 5 sets.
At the same time, we could also test which resampling method could maximize f1 score.
model_tree = DecisionTreeClassifier(random_state=2023)
model_lr = LogisticRegression(random_state=2023,max_iter=10000)
model_rf = RandomForestClassifier(random_state=2023)
model_xgb = XGBClassifier(random_state=2023,learning_rate= 0.2, max_depth= 6, min_child_weight= 100, n_estimators= 100)
model_svc = SVC(random_state=2023,probability=True)
model_knn = KNeighborsClassifier()
model_lgb = LGBMClassifier(random_state=2023)
smote = SMOTE(random_state=2023)
rus= RandomUnderSampler(random_state=2023)
estimator=Pipeline([
('resample', None),
('model', model_lr)
])
models={'resample':[None,smote,rus],
'model':[model_lr,model_tree,model_rf,model_xgb,model_svc,model_knn,model_lgb]}
gs_models=GridSearchCV(estimator,
param_grid=models,
cv=5,
scoring='f1')gs_models.fit(X_train_trans,y_train)GridSearchCV(cv=5,
estimator=Pipeline(steps=[('resample', None),
('model',
LogisticRegression(max_iter=10000,
random_state=2023))]),
param_grid={'model': [LogisticRegression(max_iter=10000,
random_state=2023),
DecisionTreeClassifier(random_state=2023),
RandomForestClassifier(random_state=2023),
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsamp...
max_depth=6, max_leaves=None,
min_child_weight=100,
missing=nan,
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
predictor=None,
random_state=2023, ...),
SVC(probability=True, random_state=2023),
KNeighborsClassifier(),
LGBMClassifier(random_state=2023)],
'resample': [None, SMOTE(random_state=2023),
RandomUnderSampler(random_state=2023)]},
scoring='f1')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5,
estimator=Pipeline(steps=[('resample', None),
('model',
LogisticRegression(max_iter=10000,
random_state=2023))]),
param_grid={'model': [LogisticRegression(max_iter=10000,
random_state=2023),
DecisionTreeClassifier(random_state=2023),
RandomForestClassifier(random_state=2023),
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsamp...
max_depth=6, max_leaves=None,
min_child_weight=100,
missing=nan,
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
predictor=None,
random_state=2023, ...),
SVC(probability=True, random_state=2023),
KNeighborsClassifier(),
LGBMClassifier(random_state=2023)],
'resample': [None, SMOTE(random_state=2023),
RandomUnderSampler(random_state=2023)]},
scoring='f1')Pipeline(steps=[('resample', None),
('model',
LogisticRegression(max_iter=10000, random_state=2023))])None
LogisticRegression(max_iter=10000, random_state=2023)
results = pd.concat([pd.DataFrame(gs_models.cv_results_["params"]),pd.DataFrame(gs_models.cv_results_["mean_test_score"], columns=["f1"])],axis=1)
results## model ... f1
## 0 LogisticRegression(max_iter=10000, random_stat... ... 0.589727
## 1 LogisticRegression(max_iter=10000, random_stat... ... 0.633082
## 2 LogisticRegression(max_iter=10000, random_stat... ... 0.630899
## 3 DecisionTreeClassifier(random_state=2023) ... 0.499511
## 4 DecisionTreeClassifier(random_state=2023) ... 0.519894
## 5 DecisionTreeClassifier(random_state=2023) ... 0.528526
## 6 RandomForestClassifier(random_state=2023) ... 0.553539
## 7 RandomForestClassifier(random_state=2023) ... 0.598985
## 8 RandomForestClassifier(random_state=2023) ... 0.614012
## 9 XGBClassifier(base_score=None, booster=None, c... ... 0.598545
## 10 XGBClassifier(base_score=None, booster=None, c... ... 0.622743
## 11 XGBClassifier(base_score=None, booster=None, c... ... 0.607400
## 12 SVC(probability=True, random_state=2023) ... 0.570108
## 13 SVC(probability=True, random_state=2023) ... 0.619906
## 14 SVC(probability=True, random_state=2023) ... 0.619119
## 15 KNeighborsClassifier() ... 0.534968
## 16 KNeighborsClassifier() ... 0.552836
## 17 KNeighborsClassifier() ... 0.584259
## 18 LGBMClassifier(random_state=2023) ... 0.575133
## 19 LGBMClassifier(random_state=2023) ... 0.601645
## 20 LGBMClassifier(random_state=2023) ... 0.611686
##
## [21 rows x 3 columns]
py$results %>% data.frame() %>% arrange(-f1)from the base model testing, we see that Logistic Regression with SMOTE give the best results on f1 score. We will use Logistic Regression Model as our base model to then tune it to get the best result.
# Before tuning
base_model = gs_models.best_estimator_
cv_results = cross_validate(base_model,X_train_trans,y_train,cv=5,scoring=['f1','recall'])
print(f'''cross validation before tuning:
rerata fit_time = {round(cv_results['fit_time'].mean(),3)}
std fit_time = {round(cv_results['fit_time'].std(),3)}
rerata f1 = {round(cv_results['test_f1'].mean(),3)}
std f1 = {round(cv_results['test_f1'].std(),3)}
rerata recall = {round(cv_results['test_recall'].mean(),3)}
std recall = {round(cv_results['test_recall'].std(),3)}
''')## cross validation before tuning:
## rerata fit_time = 0.101
## std fit_time = 0.009
## rerata f1 = 0.633
## std f1 = 0.013
## rerata recall = 0.788
## std recall = 0.02
# Hyperparameter Tuning
hyperparam_space={'model__penalty' : ['l1', 'l2','elasticnet'],
'model__solver': ['lbfgs', 'liblinear'],
'model__C': [0.1, 1, 10]}
grid_search=GridSearchCV(base_model,
param_grid=hyperparam_space,
cv=5,
scoring='f1')grid_search.fit(X_train_trans,y_train)GridSearchCV(cv=5,
estimator=Pipeline(steps=[('resample', SMOTE(random_state=2023)),
('model',
LogisticRegression(max_iter=10000,
random_state=2023))]),
param_grid={'model__C': [0.1, 1, 10],
'model__penalty': ['l1', 'l2', 'elasticnet'],
'model__solver': ['lbfgs', 'liblinear']},
scoring='f1')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5,
estimator=Pipeline(steps=[('resample', SMOTE(random_state=2023)),
('model',
LogisticRegression(max_iter=10000,
random_state=2023))]),
param_grid={'model__C': [0.1, 1, 10],
'model__penalty': ['l1', 'l2', 'elasticnet'],
'model__solver': ['lbfgs', 'liblinear']},
scoring='f1')Pipeline(steps=[('resample', SMOTE(random_state=2023)),
('model',
LogisticRegression(max_iter=10000, random_state=2023))])SMOTE(random_state=2023)
LogisticRegression(max_iter=10000, random_state=2023)
print(f'''
grid_search.best_score_ = {grid_search.best_score_}
grid_search.best_params_ = {grid_search.best_params_}
''')##
## grid_search.best_score_ = 0.634830093389831
## grid_search.best_params_ = {'model__C': 1, 'model__penalty': 'l1', 'model__solver': 'liblinear'}
best_model=grid_search.best_estimator_
cv_results = cross_validate(best_model,X_train_trans,y_train,cv=5,scoring=['f1','recall'])
print(f'''cross validation before tuning:
rerata fit_time = {round(cv_results['fit_time'].mean(),3)}
std fit_time = {round(cv_results['fit_time'].std(),3)}
rerata f1 = {round(cv_results['test_f1'].mean(),3)}
std f1 = {round(cv_results['test_f1'].std(),3)}
rerata recall = {round(cv_results['test_recall'].mean(),3)}
std recall = {round(cv_results['test_recall'].std(),3)}
''')## cross validation before tuning:
## rerata fit_time = 0.086
## std fit_time = 0.008
## rerata f1 = 0.635
## std f1 = 0.014
## rerata recall = 0.792
## std recall = 0.024
base_model = gs_models.best_estimator_
base_model.fit(X_train_trans,y_train)Pipeline(steps=[('resample', SMOTE(random_state=2023)),
('model',
LogisticRegression(max_iter=10000, random_state=2023))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('resample', SMOTE(random_state=2023)),
('model',
LogisticRegression(max_iter=10000, random_state=2023))])SMOTE(random_state=2023)
LogisticRegression(max_iter=10000, random_state=2023)
print(classification_report(y_test,base_model.predict(X_test_trans)))## precision recall f1-score support
##
## 0 0.90 0.72 0.80 1291
## 1 0.51 0.78 0.61 467
##
## accuracy 0.74 1758
## macro avg 0.70 0.75 0.71 1758
## weighted avg 0.80 0.74 0.75 1758
# Prediction results - base model
pred_results = pd.DataFrame({'y_test':y_test,
'y_pred_proba':base_model.predict_proba(X_test_trans)[:,1],
'y_pred':base_model.predict(X_test_trans)
}).reset_index()
graph = sns.scatterplot(data=pred_results, x=pred_results.index, y='y_pred_proba', hue='y_test')
graph.axhline(0.5)
sns.set(rc={"figure.figsize":(15, 15)})
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()best_model = grid_search.best_estimator_
best_model.fit(X_train_trans,y_train)Pipeline(steps=[('resample', SMOTE(random_state=2023)),
('model',
LogisticRegression(C=1, max_iter=10000, penalty='l1',
random_state=2023, solver='liblinear'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('resample', SMOTE(random_state=2023)),
('model',
LogisticRegression(C=1, max_iter=10000, penalty='l1',
random_state=2023, solver='liblinear'))])SMOTE(random_state=2023)
LogisticRegression(C=1, max_iter=10000, penalty='l1', random_state=2023,
solver='liblinear')print(classification_report(y_test,best_model.predict(X_test_trans)))## precision recall f1-score support
##
## 0 0.90 0.72 0.80 1291
## 1 0.50 0.79 0.62 467
##
## accuracy 0.74 1758
## macro avg 0.70 0.75 0.71 1758
## weighted avg 0.80 0.74 0.75 1758
# Prediction results - base model
pred_results = pd.DataFrame({'y_test':y_test,
'y_pred_proba':best_model.predict_proba(X_test_trans)[:,1],
'y_pred':best_model.predict(X_test_trans)
}).reset_index()graph = sns.scatterplot(data=pred_results, x=pred_results.index, y='y_pred_proba', hue='y_test')
graph.axhline(0.5)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()From this result we could see that model after hyperparameter tuning (best_model) gave a slightly better results than the base model, so we could use it as our final model. From the graph, we could see that the model could separate the churn and non churn data. However, there are still many data mixed on the center that could not be separated by the model, hence it’s f1 score is 0.62, which is not high enough to deploy this model.
The recall value 79% means that we could predict 79% of churn occured, and 50% accuracy means that from all the data that we predict as churn (showed in the graph as the dot over 0.5 threshold), 50% are real churn customer and the other 50% are actually a non churn customer, which in this case will be a potential mistargeted audience for the focused retention program (potentially leads to ineffective cost). However, in this analysis we gave more emphasis on retaining potentially churned customer, because it is more costly to get a new customers than to rentent existing customers.
feature_importances = pd.DataFrame({'feature_name' : ct.get_feature_names_out(),
'coef' : best_model.named_steps["model"].coef_.flatten()}).sort_values('coef')py$feature_importances %>% mutate(coef = round(coef,3))ggplot(py$feature_importances, aes(y=reorder(feature_name,coef), x=coef,fill=coef)) +
geom_col() + labs(title = "Feature Importance")The coefficient of each variables are as predicted from EDA.
We are able to build a machine learning model to predict churn. The final model that we use are Logistic Regression Model(C=1, max_iter=10000, penalty=‘l1’,random_state=2023, solver=‘liblinear’) with SMOTE resampling. From our machine learning modeling, we could get f1 score of 0.62, which we think are not high enough to deploy this model.
The recall value 79% means that we could predict 79% of churn occured, and 50% accuracy means that from all the data that we predict as churn (showed in the graph as the dot over 0.5 threshold) is actually a non churn customer, which in this case will be a potential mistargeted audience for the focused retention program (potentially leads to ineffective cost). However, in this analysis we gave more emphasis on retaining potentially churned customer, because it is more costly to get a new customers than to retent existing customers.