R Notebook

Introduction

In this notebook, we will be working with a dataset of client contacts made by a banking institution as part of a product marketing campaign. We will conduct exploratory data analysis (EDA), select a model to try to predict whether a future client contact is likely to lead to a sale, and ensure the data is properly engineered to build that model.

First, we’ll set up RStudio to run python code by importing the reticulate library and activating a virtual environment built specifically for this use case.

knitr::opts_chunk$set(echo = TRUE)

#reticulate allows for python code in R environment
library(reticulate)

#set environment and options specific for reticulate
use_condaenv("r-reticulate", required=TRUE)

Next, we’ll import the python packages we’ll need for future steps:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

And finally, we’ll read in our dataset using pandas’ read_csv function:

df = pd.read_csv('bank+marketing/bank/bank-full.csv',
                  sep=';')

Exploratory Data Analyis

Let’s get a sense of our columns and data types first:

df.dtypes

## age           int64
## job          object
## marital      object
## education    object
## default      object
## balance       int64
## housing      object
## loan         object
## contact      object
## day           int64
## month        object
## duration      int64
## campaign      int64
## pdays         int64
## previous      int64
## poutcome     object
## y            object
## dtype: object

This suggests we have 7 numerical features, 9 categorical features, and 1 dependent categorical variable (y). Let’s now examine the numerical features.

Numerical Features

describe = df.describe()

describe

##                 age        balance  ...         pdays      previous
## count  45211.000000   45211.000000  ...  45211.000000  45211.000000
## mean      40.936210    1362.272058  ...     40.197828      0.580323
## std       10.618762    3044.765829  ...    100.128746      2.303441
## min       18.000000   -8019.000000  ...     -1.000000      0.000000
## 25%       33.000000      72.000000  ...     -1.000000      0.000000
## 50%       39.000000     448.000000  ...     -1.000000      0.000000
## 75%       48.000000    1428.000000  ...     -1.000000      0.000000
## max       95.000000  102127.000000  ...    871.000000    275.000000
## 
## [8 rows x 7 columns]

This output provides a topline sense of the distribution of each numerical feature. Before moving on, let’s confirm that there are no nulls in this data, as determined by the count of each variable matching the length of the dataframe.

describe.loc['count'].unique()[0] == len(df)

## np.True_

Perfect. We’ll still want to take a look at the values in the data to get a sense check and ensure that there aren’t stand-in values for nulls, or otherwise anything that doesn’t match our intuitive understanding of the data. Let’s visualize the distributions as histograms.

# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Initialize a grid of plots with an Axes for each column in the dataframe
g = sns.FacetGrid(df[describe.columns].melt(var_name='Column', value_name='Value'), col='Column', col_wrap=4, sharex=False, height=3)

# Draw a histogram on each Axes
g = g.map(sns.histplot, 'Value', bins=20, kde=False)

# Add some space between the plots
g.fig.subplots_adjust(wspace=0.5, hspace=0.7)

plt.show()

It appears that age is more or less normally distributed, and day is basically uniform. The other variables seem to have rightward skews to varying degrees of extremity, signaling potential need for outlier removal.

Let’s dig a bit deeper into each variable. Here’s a brief description of the distribution for each numerical feature, based on their histograms and descriptive statistics. We will also take this opportunity to check for outliers, using the function defined here (based on the 1.5 times IQR rule):

def calc_fences(df, col):
  '''For a numerical column, returns upper and lower "fences" based on 1.5 * IQR'''
  data = df[col]
  q1 = data.quantile(0.25)
  q3 = data.quantile(0.75)
  iqr = q3-q1
  
  low_fence = q1 - 1.5 * iqr
  high_fence = q3 + 1.5 * iqr
  
  return low_fence, high_fence

Age: Somewhat normal distribution, with a heavy concentration of people in their 30s and 40s and a long tail skewing lightly to the right.

fences = calc_fences(df, 'age')
df[(df['age']<fences[0])|(df['age']>fences[1])]

##        age      job   marital  education  ... pdays  previous poutcome    y
## 29158   83  retired   married    primary  ...    -1         0  unknown   no
## 29261   75  retired  divorced    primary  ...    -1         0  unknown   no
## 29263   75  retired   married    primary  ...    -1         0  unknown   no
## 29322   83  retired   married   tertiary  ...    -1         0  unknown   no
## 29865   75  retired  divorced    primary  ...    -1         0  unknown  yes
## ...    ...      ...       ...        ...  ...   ...       ...      ...  ...
## 45163   71  retired   married  secondary  ...    92         3  failure   no
## 45191   75  retired  divorced   tertiary  ...   183         1  failure  yes
## 45204   73  retired   married  secondary  ...    40         8  failure  yes
## 45207   71  retired  divorced    primary  ...    -1         0  unknown  yes
## 45208   72  retired   married  secondary  ...   184         3  success  yes
## 
## [487 rows x 17 columns]

While there are technically outliers, I wouldn’t expect to remove them in this instance. The ages of the clients are reasonable (everyone is between 18 and 95) and may have relevance for response likelihood.

Balance: The vast majority of balances are in the several thousands, with a few major outliers on the upper end, and over 3,000 negative balances.

print("98% of balances below {}".format(df['balance'].quantile(0.98)))

## 98% of balances below 9439.3999999999

print("{} balances in the negative".format(len(df[df['balance']<0])))

## 3766 balances in the negative

We can likely retain most of the outliers as the fences themselves are not unreasonable. However, some of the high outliers so far exceed the upper fence that it is likely worth removing or accounting for them.

fences = calc_fences(df, 'balance')
fences[0]

## np.float64(-1962.0)

fences[1]

## np.float64(3462.0)

len(df[df['balance']<fences[0]]['balance'].sort_values())

## 17

len(df[df['balance']>fences[1]]['balance'].sort_values(ascending=False))

## 4712

Day: Seemingly random–people are contacted any day throughout the month. While it might make sense to retain this variable, I’m not sure it’s useful to treat it as a numerical variable. By what mechanism should we expect the likelihood of a sign-on maps directly onto how early or late in the month it is? Rather, it may make sense to map this into a categorical variable, based on whether it is the beginning, middle or end of the month (which could relate to things like pay schedules or housing payments).

Duration: Calls seem to hover in the range of 180-300 seconds (3-5 minutes). Several evident outliers stretch to over an hour.

fences = calc_fences(df, 'duration')
fences[0]

## np.float64(-221.0)

fences[1]

## np.float64(643.0)

len(df[df['duration']<fences[0]]['duration'])

## 0

len(df[df['duration']>fences[1]]['duration'])

## 3235

As with balance, we can retain some of the upper outlier values, but should draw a line where the duration is unreasonably long and could skew our analysis.

Campaign: For a large majority of contacted customers, this was the first or second contact from the company for this campaign. Thousands of clients received more contacts than that, but fewer and fewer as the number of contacts increases.

fences = calc_fences(df, 'campaign')
fences[0]

## np.float64(-2.0)

fences[1]

## np.float64(6.0)

len(df[df['campaign']<fences[0]]['campaign'].sort_values())

## 0

len(df[df['campaign']>fences[1]]['campaign'].sort_values(ascending=False))

## 3064

P-days: For over 80% of clients, there is no previous campaign contact to count back to. It may be necessary to remove this variable from the data, as it is hard to sensibly quantify the null values. For example, subbing in 0 would have the opposite of the intended effect (as it would imply they were contacted extremely recently).

#Percentage of client contacts with "null" val for days since previous
len(df[df['pdays']==-1]) / len(df)

## 0.8173674548229414

#Percentage of client contacts with no prior campaign contacts
len(df[df['previous']==0]) / len(df)

## 0.8173674548229414

Previous: Large majority of clients were not contacted for a previous campaign (as noted above). However, this variable is more usable, as the 0 value makes numerical sense in the context of other values.

Categorical Features

Let’s take a look at our non-numerical (therefore, categorical) columns. We’ll filter out the numerical columns

for col in df.drop(columns=describe.columns).columns:
  df[col].value_counts()

## job
## blue-collar      9732
## management       9458
## technician       7597
## admin.           5171
## services         4154
## retired          2264
## self-employed    1579
## entrepreneur     1487
## unemployed       1303
## housemaid        1240
## student           938
## unknown           288
## Name: count, dtype: int64
## marital
## married     27214
## single      12790
## divorced     5207
## Name: count, dtype: int64
## education
## secondary    23202
## tertiary     13301
## primary       6851
## unknown       1857
## Name: count, dtype: int64
## default
## no     44396
## yes      815
## Name: count, dtype: int64
## housing
## yes    25130
## no     20081
## Name: count, dtype: int64
## loan
## no     37967
## yes     7244
## Name: count, dtype: int64
## contact
## cellular     29285
## unknown      13020
## telephone     2906
## Name: count, dtype: int64
## month
## may    13766
## jul     6895
## aug     6247
## jun     5341
## nov     3970
## apr     2932
## feb     2649
## jan     1403
## oct      738
## sep      579
## mar      477
## dec      214
## Name: count, dtype: int64
## poutcome
## unknown    36959
## failure     4901
## other       1840
## success     1511
## Name: count, dtype: int64
## y
## no     39922
## yes     5289
## Name: count, dtype: int64

As with the numerical variables, we can go variable by variable to understand the distributions across categories.

Job: People with blue-collar jobs are the most common in the data, but not by a high margin. In fact, we appear to have a spread across many different industries and job-types.

plt.clf()
df['job'].value_counts().plot(kind='bar')
plt.show()

len(df[df['job']=='unknown'])/len(df)

## 0.006370131162770122

There are a number of “unknown” values here, which are functionally nulls. They constitute a small enough percentage of the data that I think it will be worth removing rows with null jobs.

Marital Status:

The majority of contactees are married, though single and divorced clients make up enough of our population to warrant consideration. The financial commitments associated with marriage and families may impact decision-making and risk tolerance in both directions, and impact how likely someone is to make a major purchase decision in response to a phone call.

plt.clf()
df['marital'].value_counts().plot(kind='bar')
plt.show()

Education: The most common category for educational attainment was secondary education, followed by tertiary.

plt.clf()
df['education'].value_counts().plot(kind='bar')
plt.show()

As with jobs data, we have a small number of null (“unknown”) values here, the rows for which can likely be safely removed given their low prevalence.

df[df['job']=='unknown']['education'].value_counts()

## education
## unknown      127
## secondary     71
## primary       51
## tertiary      39
## Name: count, dtype: int64

Interestingly, it appears many of the null values for job data correspond to further unknowns related to education. Perhaps this is a signal of challenges with data collection, and while the missingness here may itself have value, I still feel comfortable removing null rows given the small amount and their overlap across features.

Default:

plt.clf()
df['default'].value_counts().plot(kind='bar')
plt.show()

Only a small number of clients have credit in default.

#Checking default status against the success of current campaign
df[df['default']=='yes']['y'].value_counts() / len(df[df['default']=='yes'])

## y
## no     0.936196
## yes    0.063804
## Name: count, dtype: float64

It appears at an initial glance that those in credit default are not more or less likely to purchase this new product than those who aren’t. While it’s not the question at hand, the bank may want to consider not only whether the client is likely to purchase the product, but whether it makes financial sense for both buyer and seller to market it to them.

Housing: Here we have a close to 50/50 split and no null values. It will be interesting to see how this feature plays out, as having a housing payment could be a sign of more advanced financial maturity, but perhaps more responsibility and risk intolerance as well.

plt.clf()
df['housing'].value_counts().plot(kind='bar')
plt.show()

Loan: No nulls, and a much smaller but still meaningful percentage of clients do have personal loans.

plt.clf()
df['loan'].value_counts().plot(kind='bar')
plt.show()

Contact: The means by which the person was contacted, either their home telephone or cellphone.

plt.clf()
df['contact'].value_counts().plot(kind='bar')
plt.show()

This variable has a subsantial number of nulls–perhaps worth removing the feature entirely?

df_contact_test = df[df['contact']!='unknown'][['contact','y']]

df_contact_test.replace({'cellular':0,
                         'telephone':1,
                         'no':0,
                         'yes':1}, inplace = True)

## <string>:2: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

                         
df_contact_test['contact'].corr(df_contact_test['y'])

## np.float64(-0.012097873408306736)

The correlation, even when accounting for rows with good data, doesn’t appear strong enough to justify keeping a feature with so many nulls.

Month: May is the most common. At a glance this may not be as predictive as other information, but could impact spending. For example, in May many people could have just gotten a tax refund that they want to spend or invest.

plt.clf()
df['month'].value_counts().plot(kind='bar')
plt.show()

P(revious) Outcome: This variable tells us whether a previous campaign was a success for each client contact.

plt.clf()
df['poutcome'].value_counts().plot(kind='bar')
plt.show()

This is a very large amount of nulls–but on instinct I would hesitate to consider removing the feature as a result. After all, couldn’t the prior success of a campaign be a very important input for future sales?

df[df['poutcome']=='unknown']['previous'].value_counts()

## previous
## 0    36954
## 2        2
## 1        1
## 5        1
## 7        1
## Name: count, dtype: int64

Interesting–it appears that the unknown values are almost entirely based on the fact that there was no previous campaign contact to evaluate. Let’s see if this could be a useful feature to retain despite these missing values.

#limit to where poutcome is known to be either success or failure
df_poutcome_test = df[df['poutcome'].isin(['success','failure'])][['poutcome','y']]
#convert to binary numeric columns for correlation analysis
df_poutcome_test.replace({'failure':0,
                          'success':1,
                          'no':0,
                          'yes':1}, inplace=True)

## <string>:2: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

                          
df_poutcome_test['poutcome'].corr(df_poutcome_test['y'])

## np.float64(0.5115443306993206)

That’s a very meaningful correlation. Therefore, my proposal is to convert this variable into a binary category, where the person either has a known prior successful campaign outcome or not. Such a variable would not be as powerful as a definitively known outcome of either success or failure for each client, but it remains a potentially valuable predictor nonetheless.

Model Selection

When choosing a model for this dataset, we have to first recognize that we are dealing with a classification problem. That rules out most forms of regression.

But not all forms! I believe logistic regression would be a great fit here, as it is with many binary classification problems with a mix of quantitative and categorical features. Logistic regression is a type of supervised and the output is a likelihood from 0 to 1. That likelihood ‘score’ could be valuable for this business problem, as the business may want to adjust the threshold for how they prioritize calling clients. The model also outputs the importance of each feature, which could be valuable in determining what data to continue collecting and is trackable over time.

In terms of key assumptions for logistic regression, based on all available information we have independence of observations as each row represents one client (including prior contacts). We can can see no concerning multi-colinearity, below in the pre-processing section. And when it comes to linearity of the log-odds with the independent variables, we can run the model and run diagnostics to confirm this before moving forward with the results.

Longer term, data around which clients are not likely to be converted is also useful. The bank can create profiles or “personas” for what contributes to conversion. Perhaps there are certain groups that are not buying the product, but could be convinced based on tweaks to factors that are in the business’s control, like when and how often they make contact.

Pre-Processing

Based on prior assessment in EDA and on model selection.

First, we’re going to cut a couple features entirely: - pdays, because not much added value (as the vast majority have no prior contact, which is duplicative with campaign - contact, because high amount of nulls with little apparent predictive value

Next, for the following numerical features, we’re going to cut outliers using a 99% cutoff. I think this is a reasonable cut off because it limits the culling of our data to max 1% for each feature, and because I assessed the fences based on the IQR*1.5 rule to be overly restrictive (cutting out people over the age of __ or with ____ in their bank account): - balance - duration - campaign - previous

For day, convert to category.

For poutcome, we’ll convert to a binary category.

#DROPPED COLUMNS

#pdays
df.drop(columns='pdays', inplace = True)
#contact
df.drop(columns='contact', inplace = True)

#OUTLIER CUTOFFS

#set thresholds now so they're not impacted by the order of deletions
balance_q99 = df['balance'].quantile(0.99)
duration_q99 = df['duration'].quantile(0.99)
campaign_q99 = df['campaign'].quantile(0.99)
previous_q99 = df['previous'].quantile(0.99)

#balance
df = df[df['balance'] < balance_q99]
#duration
df = df[df['duration'] < duration_q99]
#campaign
df = df[df['campaign'] < campaign_q99]
#previous
df = df[df['previous'] < previous_q99]

#CUTTING SMALL NUMBER OF ROWS WITH NULLS

#job
df = df[df['job']!='unknown']
#education
df = df[df['education']!='unknown']


#CHANGING CATEGORIES

#day
df['day_cat'] = df['day'].apply(lambda x: 'early' if x < 10 else
                          'mid' if x < 20 else
                          'late')
  #drop original day col                         
df.drop(columns='day', inplace = True)

#poutcome

df['known_prior_success'] = df['poutcome'].replace({'success':'yes',
                                                    'failure':'no',
                                                    'other':'no',
                                                    'unknown':'no'})
df.drop(columns = 'poutcome', inplace = True)

plt.figure(figsize=(10, 8))  # Adjust size as needed
sns.heatmap(df.select_dtypes(include=np.number).corr(), 
            annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap')
plt.show()

valuable to know because the features won’t be duplicative

NOW DO DUMMY VARS FOR CATEGORICALS

df1 = pd.get_dummies(df, dtype='float')

#important to remove one 'default' value for each categorical variable to avoid perfect correlations among
#mutually exclusive categories
df1.drop(columns = ['job_student', 'marital_single', 'education_primary',
                    'default_no', 'housing_no', 'loan_no', 'month_jan',
                    'day_cat_early', 'known_prior_success_no', 'y_no'], inplace = True)

df1

##        age  balance  ...  day_cat_mid  known_prior_success_yes
## 0       58     2143  ...          0.0                      0.0
## 1       44       29  ...          0.0                      0.0
## 2       33        2  ...          0.0                      0.0
## 5       35      231  ...          0.0                      0.0
## 6       28      447  ...          0.0                      0.0
## ...    ...      ...  ...          ...                      ...
## 45205   25      505  ...          1.0                      0.0
## 45206   51      825  ...          1.0                      0.0
## 45207   71     1729  ...          1.0                      0.0
## 45208   72     5715  ...          1.0                      1.0
## 45209   57      668  ...          1.0                      0.0
## 
## [41412 rows x 37 columns]

Now correlation again

sns.clustermap(df1.corr(), cmap='coolwarm', linewidths=.5, figsize=(40, 40))

plt.show()

The sheer number of features here makes it challenging to observe correlations. Let’s filter down the correlation matrix into only those features which have at least some correlation with another feature.

df_corr = df1.corr()

for col in df_corr.columns:
  if not (df_corr[col].between(0.3, 0.999).any() or df_corr[col].between(-0.999, -0.3).any()):
    df_corr.drop(columns=col, inplace = True)
    df_corr.drop(index=col, inplace=True)
    
df_corr

##                               age  ...  known_prior_success_yes
## age                      1.000000  ...                 0.031358
## duration                -0.010584  ...                 0.053407
## previous                 0.008089  ...                 0.340751
## job_blue-collar         -0.040408  ...                -0.050141
## job_management          -0.024724  ...                 0.017313
## job_retired              0.446388  ...                 0.056363
## marital_divorced         0.170740  ...                -0.006838
## marital_married          0.279937  ...                -0.020756
## education_secondary     -0.084560  ...                -0.024913
## education_tertiary      -0.076560  ...                 0.052182
## housing_yes             -0.183330  ...                -0.091830
## month_aug                0.073973  ...                 0.000007
## month_may               -0.128167  ...                -0.059677
## y_yes                    0.026244  ...                 0.308051
## day_cat_late            -0.029889  ...                -0.025931
## day_cat_mid              0.035825  ...                 0.016288
## known_prior_success_yes  0.031358  ...                 1.000000
## 
## [17 rows x 17 columns]

sns.clustermap(df_corr, cmap='coolwarm', linewidths=.5, figsize=(12, 12))

plt.show()

First and foremost, we want to pay attention to to correlations with y_yes, which indicate the features most closely associated with a successful campaign engagement. The two features with strong positive correlations there appear to be phone call duration and prior campaign success. While we will learn a lot more from building and ultimately running our model, this already tells us that an emphasis on spending more time on the phone with clients and targeting those with prior positive engagements could lead to higher success rates.

Some additional correlations worth paying attention to: - Educational attainment and job type show several correlations, namely between primary education and having a blue collar job, and tertiary education and having a management-oriented job. - Age and retirement are positively correlated. - Younger clients are less likely to be married or to have a housing loan, while older clients are more likely to be either married or divorced.

All of these correlations make intuitive sense and align with domain knowledge, which is an encouraging sign. Further, while worth noting, none of the correlations appear strong enough to warrant combining or removing variables. For example, while older people are more likely to be retired, it doesn’t mean there aren’t older workers or younger retirees. Those values could prove important for our analysis.

Logistic regression does not require normalization or standardization–but on running the model, I would test different techniques to confirm that would hold in this case.