PROJECT OBJECTIVE:

Build a ML model to perform focused digital marketing by predicting the potential customers who will convert from liability customers to asset customers.

CONTEXT:

Bank XYZ has a growing customer base where majority of them are liability customers (depositors) vs borrowers (asset customers). The bank is interested in expanding the borrowers base rapidly to bring in more business via loan interests. A campaign that the bank ran in last quarter showed an average single digit conversion rate. In the last town hall, the marketing head mentioned that digital transformation being the core strength of the business strategy, how to devise effective campaigns with better target marketing to increase the conversion ratio to double digit with same budget as per last campaign. You as a data scientist asked to develop machine learning model to identify potential borrowers to support focused marketing.

DATA DESCRIPTION: The data consists of the following attributes:

  1. ID: Customer ID
  2. Age Customer’s approximate age.
  3. CustomerSince: Customer of the bank since. [unit is masked]
  4. HighestSpend: Customer’s highest spend so far in one transaction. [unit is masked]
  5. ZipCode: Customer’s zip code.
  6. HiddenScore: A score associated to the customer which is masked by the bank as an IP.
  7. MonthlyAverageSpend: Customer’s monthly average spend so far. [unit is masked]
  8. Level: A level associated to the customer which is masked by the bank as an IP.
  9. Mortgage: Customer’s mortgage. [unit is masked]
  10. Security: Customer’s security asset with the bank. [unit is masked]
  11. FixedDepositAccount: Customer’s fixed deposit account with the bank. [unit is masked]
  12. InternetBanking: if the customer uses internet banking.
  13. CreditCard: if the customer uses bank’s credit card.
  14. LoanOnCard: if the customer has a loan on credit card
#Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import tree

Load customer data present in CSV file

from zipfile import ZipFile
import urllib.request
from io import BytesIO
folder = urllib.request.urlopen('https://s3.amazonaws.com/projex.dezyre.com/classification-algorithms-for-digital-transformation-in-banking/materials/data.zip')
zipfile = ZipFile(BytesIO(folder.read()))
zipfile.namelist()
## ['data/Data1.csv', 'data/Data2.csv']
# Load customer data present in CSV file
data1 = pd.read_csv(zipfile.open("data/Data1.csv"))
data2 = pd.read_csv(zipfile.open("data/Data2.csv"))

Shape and size of data

print(data1.shape)
## (5000, 8)
print(data2.shape)
## (5000, 7)

Merging two data frames. Use Pandas merge function to merge two data frames based on cutomer ID

cust_data=data1.merge(data2, how='inner', on='ID')

Explore final shape of data

print(cust_data.shape)
## (5000, 14)
# Explore data types
cust_data.dtypes
## ID                       int64
## Age                      int64
## CustomerSince            int64
## HighestSpend             int64
## ZipCode                  int64
## HiddenScore              int64
## MonthlyAverageSpend    float64
## Level                    int64
## Mortgage                 int64
## Security                 int64
## FixedDepositAccount      int64
## InternetBanking          int64
## CreditCard               int64
## LoanOnCard             float64
## dtype: object

Comment: As all data attributes are quantitative data, we don’t need data transformation here

# Data description
cust_data.describe().transpose()
##                       count          mean  ...       75%      max
## ID                   5000.0   2500.500000  ...   3750.25   5000.0
## Age                  5000.0     45.338400  ...     55.00     67.0
## CustomerSince        5000.0     20.104600  ...     30.00     43.0
## HighestSpend         5000.0     73.774200  ...     98.00    224.0
## ZipCode              5000.0  93152.503000  ...  94608.00  96651.0
## HiddenScore          5000.0      2.396400  ...      3.00      4.0
## MonthlyAverageSpend  5000.0      1.937938  ...      2.50     10.0
## Level                5000.0      1.881000  ...      3.00      3.0
## Mortgage             5000.0     56.498800  ...    101.00    635.0
## Security             5000.0      0.104400  ...      0.00      1.0
## FixedDepositAccount  5000.0      0.060400  ...      0.00      1.0
## InternetBanking      5000.0      0.596800  ...      1.00      1.0
## CreditCard           5000.0      0.294000  ...      1.00      1.0
## LoanOnCard           4980.0      0.096386  ...      0.00      1.0
## 
## [14 rows x 8 columns]
# Using Panda's dropna function to drop rows having null values.
cust_data = cust_data.dropna()
cust_data.shape
## (4980, 14)
# Check for null value
cust_data.isnull().sum()
## ID                     0
## Age                    0
## CustomerSince          0
## HighestSpend           0
## ZipCode                0
## HiddenScore            0
## MonthlyAverageSpend    0
## Level                  0
## Mortgage               0
## Security               0
## FixedDepositAccount    0
## InternetBanking        0
## CreditCard             0
## LoanOnCard             0
## dtype: int64

Comment: LoanOnCard attribute has 20 null data, which is 0.4% only. Secondly, it is the target class hence we can’t repplace null value using mean or mode. We can remove these data from our dataset.

#Explore Size after null value removal
cust_data.shape
## (4980, 14)

Exploratory Data Analysis

# Let explore how data is distributed as per target class.
sns.countplot(x = 'LoanOnCard',  data = cust_data);

This shows clearly data is highly imbalanced.

Calculate target class data percentage

n_true = len(cust_data.loc[cust_data['LoanOnCard'] == 1.0])
n_false = len(cust_data.loc[cust_data['LoanOnCard'] == 0.0])
print("Number of true cases: {0} ({1:2.2f}%)".format(n_true, (n_true / (n_true + n_false)) * 100 ))
## Number of true cases: 480 (9.64%)
print("Number of false cases: {0} ({1:2.2f}%)".format(n_false, (n_false / (n_true + n_false)) * 100))
## Number of false cases: 4500 (90.36%)

Comment: Data imbalance is a typical problem in machine learning. Later we shall use it’s impact when we develop ML models.

# Scatter plot to see how data points are distributed for "MonthlyAverageSpend" and "HighestSpend" as per target class
g = sns.scatterplot(x="HighestSpend", y="MonthlyAverageSpend", hue="LoanOnCard",
             data=cust_data,legend='full')
g.set(xscale="log")
## [None]
fig, ax = plt.subplots(1, 2)
sns.histplot(cust_data.loc[cust_data.LoanOnCard == 0.0, 'Mortgage'], ax = ax[0])
sns.histplot(cust_data.loc[cust_data.LoanOnCard == 1.0, 'Mortgage'], ax = ax[1])
plt.show()

fig, ax = plt.subplots(1, 2)
sns.histplot(cust_data.loc[cust_data.LoanOnCard == 0.0, 'FixedDepositAccount'], ax = ax[0])
sns.histplot(cust_data.loc[cust_data.LoanOnCard == 1.0, 'FixedDepositAccount'], ax = ax[1])
plt.show()

columns = list(cust_data)[0:-1] # Excluding Outcome column which has only 
cust_data[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2)); 
# Histogram of first 8 columns
## array([[<AxesSubplot: title={'center': 'ID'}>,
##         <AxesSubplot: title={'center': 'Age'}>],
##        [<AxesSubplot: title={'center': 'CustomerSince'}>,
##         <AxesSubplot: title={'center': 'HighestSpend'}>],
##        [<AxesSubplot: title={'center': 'ZipCode'}>,
##         <AxesSubplot: title={'center': 'HiddenScore'}>],
##        [<AxesSubplot: title={'center': 'MonthlyAverageSpend'}>,
##         <AxesSubplot: title={'center': 'Level'}>],
##        [<AxesSubplot: title={'center': 'Mortgage'}>,
##         <AxesSubplot: title={'center': 'Security'}>],
##        [<AxesSubplot: title={'center': 'FixedDepositAccount'}>,
##         <AxesSubplot: title={'center': 'InternetBanking'}>],
##        [<AxesSubplot: title={'center': 'CreditCard'}>, <AxesSubplot: >],
##        [<AxesSubplot: >, <AxesSubplot: >],
##        [<AxesSubplot: >, <AxesSubplot: >],
##        [<AxesSubplot: >, <AxesSubplot: >],
##        [<AxesSubplot: >, <AxesSubplot: >],
##        [<AxesSubplot: >, <AxesSubplot: >],
##        [<AxesSubplot: >, <AxesSubplot: >],
##        [<AxesSubplot: >, <AxesSubplot: >]], dtype=object)
sns.pairplot(cust_data, height=3, hue = 'LoanOnCard')

Zipcode doesn’t have any significance with other dependant variables and on learning, hence drop it from dependant variable list.

Age and customer Since have similar information content. Will verify through correlation analysis

cust_data = cust_data.drop(columns='ZipCode')
#Correlation analysis
corr = cust_data.corr()
corr
##                            ID       Age  ...  CreditCard  LoanOnCard
## ID                   1.000000 -0.010682  ...    0.015741   -0.027188
## Age                 -0.010682  1.000000  ...    0.007344   -0.008147
## CustomerSince       -0.010366  0.994208  ...    0.008779   -0.007801
## HighestSpend        -0.020739 -0.054951  ...   -0.002780    0.502626
## HiddenScore         -0.015721 -0.045289  ...    0.010784    0.061761
## MonthlyAverageSpend -0.026419 -0.051896  ...   -0.006577    0.366912
## Level                0.021763  0.042750  ...   -0.011766    0.137010
## Mortgage            -0.015546 -0.013272  ...   -0.007600    0.141947
## Security            -0.017160  0.000323  ...   -0.014518    0.021982
## FixedDepositAccount -0.008690  0.007744  ...    0.278924    0.316131
## InternetBanking     -0.003940  0.011227  ...    0.004960    0.006034
## CreditCard           0.015741  0.007344  ...    1.000000    0.002536
## LoanOnCard          -0.027188 -0.008147  ...    0.002536    1.000000
## 
## [13 rows x 13 columns]
#heatmap
fig,ax = plt.subplots(figsize=(10, 10))   
sns.heatmap(cust_data.corr(), ax=ax, annot=True, linewidths=0.05, fmt= '.2f',cmap="magma") # the color intensity is based on 
plt.show()

As “Age” and “customerSince” are highly correlated, we can drop 1. I am dropping “Age”

cust_data.info

#= cust_data.drop(columns='Age')
## <bound method DataFrame.info of         ID  Age  CustomerSince  ...  InternetBanking  CreditCard  LoanOnCard
## 9       10   34              9  ...                0           0         1.0
## 10      11   65             39  ...                0           0         0.0
## 11      12   29              5  ...                1           0         0.0
## 12      13   48             23  ...                0           0         0.0
## 13      14   59             32  ...                1           0         0.0
## ...    ...  ...            ...  ...              ...         ...         ...
## 4995  4996   29              3  ...                1           0         0.0
## 4996  4997   30              4  ...                1           0         0.0
## 4997  4998   63             39  ...                0           0         0.0
## 4998  4999   65             40  ...                1           0         0.0
## 4999  5000   28              4  ...                1           1         0.0
## 
## [4980 rows x 13 columns]>
cust_data.shape
## (4980, 13)
cust_data.head(10)
##     ID  Age  CustomerSince  ...  InternetBanking  CreditCard  LoanOnCard
## 9   10   34              9  ...                0           0         1.0
## 10  11   65             39  ...                0           0         0.0
## 11  12   29              5  ...                1           0         0.0
## 12  13   48             23  ...                0           0         0.0
## 13  14   59             32  ...                1           0         0.0
## 14  15   67             41  ...                0           0         0.0
## 15  16   60             30  ...                1           1         0.0
## 16  17   38             14  ...                0           0         1.0
## 17  18   42             18  ...                0           0         0.0
## 18  19   46             21  ...                0           0         1.0
## 
## [10 rows x 13 columns]

Spliting the data

We will use 70% of data for training and 30% for testing.

from sklearn.model_selection import train_test_split

X = cust_data.drop('LoanOnCard',axis=1)     # Predictor feature columns (8 X m)
Y = cust_data['LoanOnCard']   # Predicted class (1=True, 0=False) (1 X m)

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
# 1 is just any random seed number

x_train.head()
##         ID  Age  ...  InternetBanking  CreditCard
## 1479  1480   28  ...                0           0
## 1727  1728   52  ...                0           1
## 2843  2844   27  ...                1           1
## 4106  4107   48  ...                0           0
## 1768  1769   43  ...                0           0
## 
## [5 rows x 12 columns]

Logistic Regression

# import model and matrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, precision_score, recall_score, f1_score
# Fit the model on train
model = LogisticRegression(solver="liblinear")
model.fit(x_train, y_train)
#predict on test
LogisticRegression(solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_predict = model.predict(x_test)
coef_df = pd.DataFrame(model.coef_)
coef_df['intercept'] = model.intercept_
print(coef_df)
##          0         1         2  ...        10        11  intercept
## 0 -0.00004 -0.469698  0.464513  ... -0.572026 -0.912256  -0.537507
## 
## [1 rows x 13 columns]
model_score = model.score(x_test, y_test)
print(model_score)
## 0.9437751004016064
# performance
print(f'Accuracy Score: {accuracy_score(y_test,y_predict)}')
## Accuracy Score: 0.9437751004016064
print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_predict)}')
## Confusion Matrix: 
## [[1321   20]
##  [  64   89]]
print(f'Area Under Curve: {roc_auc_score(y_test, y_predict)}')
## Area Under Curve: 0.7833925516515331
print(f'Recall score: {recall_score(y_test,y_predict)}')
## Recall score: 0.5816993464052288
print(f'Precision score: {precision_score(y_test,y_predict)}')
## Precision score: 0.8165137614678899
print(f'f1 score: {f1_score(y_test,y_predict)}')
## f1 score: 0.6793893129770994

For minority class, the above model is able to predict 86 correctly, out of 153. Although the accuracy is high, still the model is not a good model. We need to handle the unbalanced data

Weighted Logistic Regression to handle class inbalance

# define class weights
w = {0:1, 1:2}

# Fit the model on train
model_weighted = LogisticRegression(solver="liblinear", class_weight=w)
model_weighted.fit(x_train, y_train)
#predict on test
LogisticRegression(class_weight={0: 1, 1: 2}, solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_predict = model_weighted.predict(x_test)
print(f'Accuracy Score: {accuracy_score(y_test,y_predict)}')
## Accuracy Score: 0.9390896921017403
print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_predict)}')
## Confusion Matrix: 
## [[1301   40]
##  [  51  102]]
print(f'Area Under Curve: {roc_auc_score(y_test, y_predict)}')
## Area Under Curve: 0.8184190902311708
print(f'Recall score: {recall_score(y_test,y_predict)}')
## Recall score: 0.6666666666666666
print(f'Precision score: {precision_score(y_test,y_predict)}')
## Precision score: 0.7183098591549296
print(f'f1 score: {f1_score(y_test,y_predict)}')
## f1 score: 0.6915254237288136

Although the accuracy decreases, AUC and recall increases significantly, hence, it is a better model. Hence we select “model_weighted”.

Train Naive bayes algorithm

from sklearn.naive_bayes import GaussianNB # using Gaussian algorithm from Naive Bayes

# create the model
diab_model = GaussianNB()

diab_model.fit(x_train, y_train)
GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Performance with training data

diab_train_predict = diab_model.predict(x_train)

from sklearn import metrics

print("Model Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, diab_train_predict)))
## Model Accuracy: 0.8907
print()

Performance with testing data

y_predict = diab_model.predict(x_test)

from sklearn import metrics

print("Model Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, y_predict)))
## Model Accuracy: 0.8829
print()
# performance
print(f'Accuracy Score: {accuracy_score(y_test,y_predict)}')
## Accuracy Score: 0.8828647925033467
print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_predict)}')
## Confusion Matrix: 
## [[1232  109]
##  [  66   87]]
print(f'Area Under Curve: {roc_auc_score(y_test, y_predict)}')
## Area Under Curve: 0.7436724130368032
print(f'Recall score: {recall_score(y_test,y_predict)}')
## Recall score: 0.5686274509803921

Use of class prior for inbalanced data

diab_model_cp = GaussianNB(priors=[0.1, 0.9])
#diab_model.class_prior_ = [0.9, 0.1]
diab_model_cp.fit(x_train, y_train.ravel())
GaussianNB(priors=[0.1, 0.9])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_predict = diab_model_cp.predict(x_test)
# performance
print(f'Accuracy Score: {accuracy_score(y_test,y_predict)}')
## Accuracy Score: 0.8159303882195449
print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_predict)}')
## Confusion Matrix: 
## [[1078  263]
##  [  12  141]]
print(f'Area Under Curve: {roc_auc_score(y_test, y_predict)}')
## Area Under Curve: 0.8627231653287712
print(f'Recall score: {recall_score(y_test,y_predict)}')
## Recall score: 0.9215686274509803

Support Vector Machines

from sklearn import svm
clf = svm.SVC(gamma=0.25, C=10)
clf.fit(x_train , y_train)
SVC(C=10, gamma=0.25)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_predict = clf.predict(x_test)
print(f'Accuracy Score: {accuracy_score(y_test,y_predict)}')
## Accuracy Score: 0.8975903614457831
print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_predict)}')
## Confusion Matrix: 
## [[1341    0]
##  [ 153    0]]
print(f'Area Under Curve: {roc_auc_score(y_test, y_predict)}')
## Area Under Curve: 0.5
print(f'Recall score: {recall_score(y_test,y_predict)}')
## Recall score: 0.0
print(f'Precision score: {precision_score(y_test,y_predict)}')
## Precision score: 0.0
## 
## C:\Users\Erick Yegon\AppData\Roaming\Python\Python39\site-packages\sklearn\metrics\_classification.py:1334: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
##   _warn_prf(average, modifier, msg_start, len(result))
print(f'f1 score: {f1_score(y_test,y_predict)}')
## f1 score: 0.0
from scipy.stats import zscore
XScaled  = X.apply(zscore)  # convert all attributes to Z scale 
XScaled.describe()
##                  ID           Age  ...  InternetBanking    CreditCard
## count  4.980000e+03  4.980000e+03  ...     4.980000e+03  4.980000e+03
## mean  -9.131473e-17 -9.488171e-17  ...     6.705925e-17 -9.060133e-17
## std    1.000100e+00  1.000100e+00  ...     1.000100e+00  1.000100e+00
## min   -1.738927e+00 -1.949969e+00  ...    -1.217601e+00 -6.459012e-01
## 25%   -8.655847e-01 -9.031279e-01  ...    -1.217601e+00 -6.459012e-01
## 50%    1.075332e-04 -3.076058e-02  ...     8.212871e-01 -6.459012e-01
## 75%    8.657997e-01  8.416067e-01  ...     8.212871e-01  1.548224e+00
## max    1.731492e+00  1.888448e+00  ...     8.212871e-01  1.548224e+00
## 
## [8 rows x 12 columns]
x_trains, x_tests, y_trains, y_tests = train_test_split(XScaled, Y, test_size=0.3, random_state=1)
clf = svm.SVC(gamma=0.25, C=10)
clf.fit(x_trains , y_trains)
SVC(C=10, gamma=0.25)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_predicts = clf.predict(x_tests)
print(f'Accuracy Score: {accuracy_score(y_tests,y_predicts)}')
## Accuracy Score: 0.9692101740294511
print(f'Confusion Matrix: \n{confusion_matrix(y_tests, y_predicts)}')
## Confusion Matrix: 
## [[1333    8]
##  [  38  115]]
print(f'Area Under Curve: {roc_auc_score(y_tests, y_predicts)}')
## Area Under Curve: 0.8728341448436198
print(f'Recall score: {recall_score(y_tests,y_predicts)}')
## Recall score: 0.7516339869281046
print(f'Precision score: {precision_score(y_tests,y_predicts)}')
## Precision score: 0.9349593495934959
print(f'f1 score: {f1_score(y_tests,y_predicts)}')
## f1 score: 0.8333333333333333

Decision Tree Classifier

# Build decision tree model
from sklearn.tree import DecisionTreeClassifier

dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(x_train, y_train)
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Scoring our DT
print(dTree.score(x_train, y_train))
## 1.0
print(dTree.score(x_test, y_test))
## 0.9792503346720214
y_predict = dTree.predict(x_test)
print(f'Accuracy Score: {accuracy_score(y_test,y_predict)}')
## Accuracy Score: 0.9792503346720214
print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_predict)}')
## Confusion Matrix: 
## [[1327   14]
##  [  17  136]]
print(f'Area Under Curve: {roc_auc_score(y_test, y_predict)}')
## Area Under Curve: 0.9392244593586876
print(f'Recall score: {recall_score(y_test,y_predict)}')
## Recall score: 0.8888888888888888
print(f'Precision score: {precision_score(y_test,y_predict)}')
## Precision score: 0.9066666666666666
print(f'f1 score: {f1_score(y_test,y_predict)}')
## f1 score: 0.8976897689768976
#Reducing over fitting (Regularization)
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 5, random_state=1)
dTreeR.fit(x_train, y_train)
DecisionTreeClassifier(max_depth=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(dTreeR.score(x_train, y_train))
## 0.9896729776247849
print(dTreeR.score(x_test, y_test))
## 0.9832663989290495
y_predictR = dTreeR.predict(x_test)
print(f'Accuracy Score: {accuracy_score(y_test,y_predictR)}')
## Accuracy Score: 0.9832663989290495
print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_predictR)}')
## Confusion Matrix: 
## [[1335    6]
##  [  19  134]]
print(f'Area Under Curve: {roc_auc_score(y_test, y_predictR)}')
## Area Under Curve: 0.9356713602667018
print(f'Recall score: {recall_score(y_test,y_predictR)}')
## Recall score: 0.8758169934640523
print(f'Precision score: {precision_score(y_test,y_predictR)}')
## Precision score: 0.9571428571428572
print(f'f1 score: {f1_score(y_test,y_predictR)}')
## f1 score: 0.9146757679180887
# Decision Tree Visualize
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
dTreeR3 = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1)
dTreeR3.fit(x_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
fn = list(x_train)
cn = ['0', '1']
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4, 4), dpi=300)
plot_tree(dTreeR3, feature_names = fn, class_names=cn, filled = True)

fig.savefig('tree.png')

Ensemble Learning: Random forest classifier

from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(random_state=1)
rfcl = rfcl.fit(x_train, y_train)
y_predict = rfcl.predict(x_test)
# performance
print(f'Accuracy Score: {accuracy_score(y_test,y_predict)}')
## Accuracy Score: 0.9846050870147256
print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_predict)}')
## Confusion Matrix: 
## [[1339    2]
##  [  21  132]]
print(f'Area Under Curve: {roc_auc_score(y_test, y_predict)}')
## Area Under Curve: 0.9306268368644999
print(f'Recall score: {recall_score(y_test,y_predict)}')
## Recall score: 0.8627450980392157
print(f'Precision score: {precision_score(y_test,y_predict)}')
## Precision score: 0.9850746268656716
print(f'f1 score: {f1_score(y_test,y_predict)}')
## f1 score: 0.9198606271777003

Unbalanced Data Handelling

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from collections import Counter
# summarize class distribution
counter = Counter(Y)
print(counter)
# define pipeline
## Counter({0.0: 4500, 1.0: 480})
over = SMOTE(sampling_strategy=0.3,random_state=1) #sampling_strategy=0.1,random_state=1
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [ ('o', over),('u', under)]
pipeline = Pipeline(steps=steps)
# transform the dataset
Xb, Yb = pipeline.fit_resample(XScaled, Y)
# summarize the new class distribution
counter = Counter(Yb)
print(counter)
## Counter({0.0: 2700, 1.0: 1350})
x_trainb, x_testb, y_trainb, y_testb = train_test_split(Xb, Yb, test_size=0.3, random_state=1)
# 1 is just any random seed number

SVM with balanced Data

clf = svm.SVC(gamma=0.25, C=10)
clf.fit(x_trainb , y_trainb)
SVC(C=10, gamma=0.25)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_predictb = clf.predict(x_testb)
# performance
print(f'Accuracy Score: {accuracy_score(y_testb,y_predictb)}')
## Accuracy Score: 0.9827160493827161
print(f'Confusion Matrix: \n{confusion_matrix(y_testb, y_predictb)}')
## Confusion Matrix: 
## [[804  11]
##  [ 10 390]]
print(f'Area Under Curve: {roc_auc_score(y_testb, y_predictb)}')
## Area Under Curve: 0.9807515337423313
print(f'Recall score: {recall_score(y_testb,y_predictb)}')
## Recall score: 0.975
print(f'Precision score: {precision_score(y_testb,y_predictb)}')
## Precision score: 0.972568578553616
print(f'f1 score: {f1_score(y_testb,y_predictb)}')
## f1 score: 0.9737827715355805

Random Forest classifier with Balanced Data

rfcl = RandomForestClassifier(random_state=1)
rfcl = rfcl.fit(x_trainb, y_trainb)
y_predict = rfcl.predict(x_testb)
# performance
print(f'Accuracy Score: {accuracy_score(y_testb,y_predict)}')
## Accuracy Score: 0.9827160493827161
print(f'Confusion Matrix: \n{confusion_matrix(y_testb, y_predict)}')
## Confusion Matrix: 
## [[807   8]
##  [ 13 387]]
print(f'Area Under Curve: {roc_auc_score(y_testb, y_predict)}')
## Area Under Curve: 0.9788420245398772
print(f'Recall score: {recall_score(y_testb,y_predict)}')
## Recall score: 0.9675
print(f'Precision score: {precision_score(y_testb,y_predict)}')
## Precision score: 0.979746835443038
print(f'f1 score: {f1_score(y_testb,y_predict)}')
## f1 score: 0.9735849056603773

Pickle the model

# Pickle model file
import pickle
filename = 'finalized_model.sav'
pickle.dump(rfcl, open(filename, 'wb'))

Load model from pickle file and use

# Checking the pickle model
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.predict(x_testb)
# performance
print(f'Accuracy Score: {accuracy_score(y_testb,result)}')
## Accuracy Score: 0.9827160493827161
print(f'Confusion Matrix: \n{confusion_matrix(y_testb, result)}')
## Confusion Matrix: 
## [[807   8]
##  [ 13 387]]
print(f'Area Under Curve: {roc_auc_score(y_testb, result)}')
## Area Under Curve: 0.9788420245398772
print(f'Recall score: {recall_score(y_testb,result)}')
## Recall score: 0.9675
print(f'Precision score: {precision_score(y_testb,result)}')
## Precision score: 0.979746835443038
print(f'f1 score: {f1_score(y_testb,result)}')
## f1 score: 0.9735849056603773

Conclusion:

We have built a model using logistic regression, Support vector machine and Random forest classifier. This data set is highly imbalance hence accuracy can’t a good measure, Hence we have used precision, Recall, and AUC for determining better model.

We use class weight technique to handle un balanced data and observe that the model performance improved by considering class weight.

Scaling/data transformation plays a major role when we work on SVM.

We have also explored undersampling and oversampling technique like SMOTE to handle data imbalance.

We have also seen how to systematically improve a model.