In this notebook, we will be working with a dataset of client contacts made by a banking institution as part of a product marketing campaign. We will conduct exploratory data analysis (EDA), select a model to try to predict whether a future client contact is likely to lead to a sale, and ensure the data is properly engineered to build that model.
First, we’ll set up RStudio to run python code by importing the reticulate library and activating a virtual environment built specifically for this use case.
knitr::opts_chunk$set(echo = TRUE)
#reticulate allows for python code in R environment
library(reticulate)
#set environment and options specific for reticulate
use_condaenv("r-reticulate", required=TRUE)
Next, we’ll import the python packages we’ll need for future steps:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
And finally, we’ll read in our dataset using pandas’ read_csv function:
df = pd.read_csv('bank+marketing/bank/bank-full.csv',
sep=';')
Let’s get a sense of our columns and data types first:
df.dtypes
## age int64
## job object
## marital object
## education object
## default object
## balance int64
## housing object
## loan object
## contact object
## day int64
## month object
## duration int64
## campaign int64
## pdays int64
## previous int64
## poutcome object
## y object
## dtype: object
This suggests we have 7 numerical features, 9 categorical features, and 1 dependent categorical variable (y). Let’s now examine the numerical features.
describe = df.describe()
describe
## age balance ... pdays previous
## count 45211.000000 45211.000000 ... 45211.000000 45211.000000
## mean 40.936210 1362.272058 ... 40.197828 0.580323
## std 10.618762 3044.765829 ... 100.128746 2.303441
## min 18.000000 -8019.000000 ... -1.000000 0.000000
## 25% 33.000000 72.000000 ... -1.000000 0.000000
## 50% 39.000000 448.000000 ... -1.000000 0.000000
## 75% 48.000000 1428.000000 ... -1.000000 0.000000
## max 95.000000 102127.000000 ... 871.000000 275.000000
##
## [8 rows x 7 columns]
This output provides a topline sense of the distribution of each numerical feature. Before moving on, let’s confirm that there are no nulls in this data, as determined by the count of each variable matching the length of the dataframe.
describe.loc['count'].unique()[0] == len(df)
## np.True_
Perfect. We’ll still want to take a look at the values in the data to get a sense check and ensure that there aren’t stand-in values for nulls, or otherwise anything that doesn’t match our intuitive understanding of the data. Let’s visualize the distributions as histograms.
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# Initialize a grid of plots with an Axes for each column in the dataframe
g = sns.FacetGrid(df[describe.columns].melt(var_name='Column', value_name='Value'), col='Column', col_wrap=4, sharex=False, height=3)
# Draw a histogram on each Axes
g = g.map(sns.histplot, 'Value', bins=20, kde=False)
# Add some space between the plots
g.fig.subplots_adjust(wspace=0.5, hspace=0.7)
plt.show()
It appears that age is more or less normally distributed, and day is basically uniform. The other variables seem to have rightward skews to varying degrees of extremity, signaling potential need for outlier removal.
Let’s dig a bit deeper into each variable. Here’s a brief description of the distribution for each numerical feature, based on their histograms and descriptive statistics. We will also take this opportunity to check for outliers, using the function defined here (based on the 1.5 times IQR rule):
def calc_fences(df, col):
'''For a numerical column, returns upper and lower "fences" based on 1.5 * IQR'''
data = df[col]
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3-q1
low_fence = q1 - 1.5 * iqr
high_fence = q3 + 1.5 * iqr
return low_fence, high_fence
Age: Somewhat normal distribution, with a heavy concentration of people in their 30s and 40s and a long tail skewing lightly to the right.
fences = calc_fences(df, 'age')
df[(df['age']<fences[0])|(df['age']>fences[1])]
## age job marital education ... pdays previous poutcome y
## 29158 83 retired married primary ... -1 0 unknown no
## 29261 75 retired divorced primary ... -1 0 unknown no
## 29263 75 retired married primary ... -1 0 unknown no
## 29322 83 retired married tertiary ... -1 0 unknown no
## 29865 75 retired divorced primary ... -1 0 unknown yes
## ... ... ... ... ... ... ... ... ... ...
## 45163 71 retired married secondary ... 92 3 failure no
## 45191 75 retired divorced tertiary ... 183 1 failure yes
## 45204 73 retired married secondary ... 40 8 failure yes
## 45207 71 retired divorced primary ... -1 0 unknown yes
## 45208 72 retired married secondary ... 184 3 success yes
##
## [487 rows x 17 columns]
While there are technically outliers, I wouldn’t expect to remove them in this instance. The ages of the clients are reasonable (everyone is between 18 and 95) and may have relevance for response likelihood.
Balance: The vast majority of balances are in the several thousands, with a few major outliers on the upper end, and over 3,000 negative balances.
print("98% of balances below {}".format(df['balance'].quantile(0.98)))
## 98% of balances below 9439.3999999999
print("{} balances in the negative".format(len(df[df['balance']<0])))
## 3766 balances in the negative
We can likely retain most of the outliers as the fences themselves are not unreasonable. However, some of the high outliers so far exceed the upper fence that it is likely worth removing or accounting for them.
fences = calc_fences(df, 'balance')
fences[0]
## np.float64(-1962.0)
fences[1]
## np.float64(3462.0)
len(df[df['balance']<fences[0]]['balance'].sort_values())
## 17
len(df[df['balance']>fences[1]]['balance'].sort_values(ascending=False))
## 4712
Day: Seemingly random–people are contacted any day throughout the month. While it might make sense to retain this variable, I’m not sure it’s useful to treat it as a numerical variable. By what mechanism should we expect the likelihood of a sign-on maps directly onto how early or late in the month it is? Rather, it may make sense to map this into a categorical variable, based on whether it is the beginning, middle or end of the month (which could relate to things like pay schedules or housing payments).
Duration: Calls seem to hover in the range of 180-300 seconds (3-5 minutes). Several evident outliers stretch to over an hour.
fences = calc_fences(df, 'duration')
fences[0]
## np.float64(-221.0)
fences[1]
## np.float64(643.0)
len(df[df['duration']<fences[0]]['duration'])
## 0
len(df[df['duration']>fences[1]]['duration'])
## 3235
As with balance, we can retain some of the upper outlier values, but should draw a line where the duration is unreasonably long and could skew our analysis.
Campaign: For a large majority of contacted customers, this was the first or second contact from the company for this campaign. Thousands of clients received more contacts than that, but fewer and fewer as the number of contacts increases.
fences = calc_fences(df, 'campaign')
fences[0]
## np.float64(-2.0)
fences[1]
## np.float64(6.0)
len(df[df['campaign']<fences[0]]['campaign'].sort_values())
## 0
len(df[df['campaign']>fences[1]]['campaign'].sort_values(ascending=False))
## 3064
P-days: For over 80% of clients, there is no previous campaign contact to count back to. It may be necessary to remove this variable from the data, as it is hard to sensibly quantify the null values. For example, subbing in 0 would have the opposite of the intended effect (as it would imply they were contacted extremely recently).
#Percentage of client contacts with "null" val for days since previous
len(df[df['pdays']==-1]) / len(df)
## 0.8173674548229414
#Percentage of client contacts with no prior campaign contacts
len(df[df['previous']==0]) / len(df)
## 0.8173674548229414
Previous: Large majority of clients were not contacted for a previous campaign (as noted above). However, this variable is more usable, as the 0 value makes numerical sense in the context of other values.
Let’s take a look at our non-numerical (therefore, categorical) columns. We’ll filter out the numerical columns
for col in df.drop(columns=describe.columns).columns:
df[col].value_counts()
## job
## blue-collar 9732
## management 9458
## technician 7597
## admin. 5171
## services 4154
## retired 2264
## self-employed 1579
## entrepreneur 1487
## unemployed 1303
## housemaid 1240
## student 938
## unknown 288
## Name: count, dtype: int64
## marital
## married 27214
## single 12790
## divorced 5207
## Name: count, dtype: int64
## education
## secondary 23202
## tertiary 13301
## primary 6851
## unknown 1857
## Name: count, dtype: int64
## default
## no 44396
## yes 815
## Name: count, dtype: int64
## housing
## yes 25130
## no 20081
## Name: count, dtype: int64
## loan
## no 37967
## yes 7244
## Name: count, dtype: int64
## contact
## cellular 29285
## unknown 13020
## telephone 2906
## Name: count, dtype: int64
## month
## may 13766
## jul 6895
## aug 6247
## jun 5341
## nov 3970
## apr 2932
## feb 2649
## jan 1403
## oct 738
## sep 579
## mar 477
## dec 214
## Name: count, dtype: int64
## poutcome
## unknown 36959
## failure 4901
## other 1840
## success 1511
## Name: count, dtype: int64
## y
## no 39922
## yes 5289
## Name: count, dtype: int64
As with the numerical variables, we can go variable by variable to understand the distributions across categories.
Job: People with blue-collar jobs are the most common in the data, but not by a high margin. In fact, we appear to have a spread across many different industries and job-types.
plt.clf()
df['job'].value_counts().plot(kind='bar')
plt.show()
len(df[df['job']=='unknown'])/len(df)
## 0.006370131162770122
There are a number of “unknown” values here, which are functionally nulls. They constitute a small enough percentage of the data that I think it will be worth removing rows with null jobs.
Marital Status:
The majority of contactees are married, though single and divorced clients make up enough of our population to warrant consideration. The financial commitments associated with marriage and families may impact decision-making and risk tolerance in both directions, and impact how likely someone is to make a major purchase decision in response to a phone call.
plt.clf()
df['marital'].value_counts().plot(kind='bar')
plt.show()
Education: The most common category for educational attainment was secondary education, followed by tertiary.
plt.clf()
df['education'].value_counts().plot(kind='bar')
plt.show()
As with jobs data, we have a small number of null (“unknown”) values here, the rows for which can likely be safely removed given their low prevalence.
df[df['job']=='unknown']['education'].value_counts()
## education
## unknown 127
## secondary 71
## primary 51
## tertiary 39
## Name: count, dtype: int64
Interestingly, it appears many of the null values for job data correspond to further unknowns related to education. Perhaps this is a signal of challenges with data collection, and while the missingness here may itself have value, I still feel comfortable removing null rows given the small amount and their overlap across features.
Default:
plt.clf()
df['default'].value_counts().plot(kind='bar')
plt.show()
Only a small number of clients have credit in default.
#Checking default status against the success of current campaign
df[df['default']=='yes']['y'].value_counts() / len(df[df['default']=='yes'])
## y
## no 0.936196
## yes 0.063804
## Name: count, dtype: float64
It appears at an initial glance that those in credit default are not more or less likely to purchase this new product than those who aren’t. While it’s not the question at hand, the bank may want to consider not only whether the client is likely to purchase the product, but whether it makes financial sense for both buyer and seller to market it to them.
Housing: Here we have a close to 50/50 split and no null values. It will be interesting to see how this feature plays out, as having a housing payment could be a sign of more advanced financial maturity, but perhaps more responsibility and risk intolerance as well.
plt.clf()
df['housing'].value_counts().plot(kind='bar')
plt.show()
Loan: No nulls, and a much smaller but still meaningful percentage of clients do have personal loans.
plt.clf()
df['loan'].value_counts().plot(kind='bar')
plt.show()
Contact: The means by which the person was contacted, either their home telephone or cellphone.
plt.clf()
df['contact'].value_counts().plot(kind='bar')
plt.show()
This variable has a subsantial number of nulls–perhaps worth removing the feature entirely?
df_contact_test = df[df['contact']!='unknown'][['contact','y']]
df_contact_test.replace({'cellular':0,
'telephone':1,
'no':0,
'yes':1}, inplace = True)
## <string>:2: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df_contact_test['contact'].corr(df_contact_test['y'])
## np.float64(-0.012097873408306736)
The correlation, even when accounting for rows with good data, doesn’t appear strong enough to justify keeping a feature with so many nulls.
Month: May is the most common. At a glance this may not be as predictive as other information, but could impact spending. For example, in May many people could have just gotten a tax refund that they want to spend or invest.
plt.clf()
df['month'].value_counts().plot(kind='bar')
plt.show()
P(revious) Outcome: This variable tells us whether a previous campaign was a success for each client contact.
plt.clf()
df['poutcome'].value_counts().plot(kind='bar')
plt.show()
This is a very large amount of nulls–but on instinct I would hesitate to consider removing the feature as a result. After all, couldn’t the prior success of a campaign be a very important input for future sales?
df[df['poutcome']=='unknown']['previous'].value_counts()
## previous
## 0 36954
## 2 2
## 1 1
## 5 1
## 7 1
## Name: count, dtype: int64
Interesting–it appears that the unknown values are almost entirely based on the fact that there was no previous campaign contact to evaluate. Let’s see if this could be a useful feature to retain despite these missing values.
#limit to where poutcome is known to be either success or failure
df_poutcome_test = df[df['poutcome'].isin(['success','failure'])][['poutcome','y']]
#convert to binary numeric columns for correlation analysis
df_poutcome_test.replace({'failure':0,
'success':1,
'no':0,
'yes':1}, inplace=True)
## <string>:2: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df_poutcome_test['poutcome'].corr(df_poutcome_test['y'])
## np.float64(0.5115443306993206)
That’s a very meaningful correlation. Therefore, my proposal is to convert this variable into a binary category, where the person either has a known prior successful campaign outcome or not. Such a variable would not be as powerful as a definitively known outcome of either success or failure for each client, but it remains a potentially valuable predictor nonetheless.
When choosing a model for this dataset, we have to first recognize that we are dealing with a classification problem. That rules out most forms of regression.
But not all forms! I believe logistic regression would be a great fit here, as it is with many binary classification problems with a mix of quantitative and categorical features. Logistic regression is a type of supervised and the output is a likelihood from 0 to 1. That likelihood ‘score’ could be valuable for this business problem, as the business may want to adjust the threshold for how they prioritize calling clients. The model also outputs the importance of each feature, which could be valuable in determining what data to continue collecting and is trackable over time.
In terms of key assumptions for logistic regression, based on all available information we have independence of observations as each row represents one client (including prior contacts). We can can see no concerning multi-colinearity, below in the pre-processing section. And when it comes to linearity of the log-odds with the independent variables, we can run the model and run diagnostics to confirm this before moving forward with the results.
Longer term, data around which clients are not likely to be converted is also useful. The bank can create profiles or “personas” for what contributes to conversion. Perhaps there are certain groups that are not buying the product, but could be convinced based on tweaks to factors that are in the business’s control, like when and how often they make contact.
Based on prior assessment in EDA and on model selection.
First, we’re going to cut a couple features entirely: - pdays, because not much added value (as the vast majority have no prior contact, which is duplicative with campaign - contact, because high amount of nulls with little apparent predictive value
Next, for the following numerical features, we’re going to cut outliers using a 99% cutoff. I think this is a reasonable cut off because it limits the culling of our data to max 1% for each feature, and because I assessed the fences based on the IQR*1.5 rule to be overly restrictive (cutting out people over the age of __ or with ____ in their bank account): - balance - duration - campaign - previous
For day, convert to category.
For poutcome, we’ll convert to a binary category.
#DROPPED COLUMNS
#pdays
df.drop(columns='pdays', inplace = True)
#contact
df.drop(columns='contact', inplace = True)
#OUTLIER CUTOFFS
#set thresholds now so they're not impacted by the order of deletions
balance_q99 = df['balance'].quantile(0.99)
duration_q99 = df['duration'].quantile(0.99)
campaign_q99 = df['campaign'].quantile(0.99)
previous_q99 = df['previous'].quantile(0.99)
#balance
df = df[df['balance'] < balance_q99]
#duration
df = df[df['duration'] < duration_q99]
#campaign
df = df[df['campaign'] < campaign_q99]
#previous
df = df[df['previous'] < previous_q99]
#CUTTING SMALL NUMBER OF ROWS WITH NULLS
#job
df = df[df['job']!='unknown']
#education
df = df[df['education']!='unknown']
#CHANGING CATEGORIES
#day
df['day_cat'] = df['day'].apply(lambda x: 'early' if x < 10 else
'mid' if x < 20 else
'late')
#drop original day col
df.drop(columns='day', inplace = True)
#poutcome
df['known_prior_success'] = df['poutcome'].replace({'success':'yes',
'failure':'no',
'other':'no',
'unknown':'no'})
df.drop(columns = 'poutcome', inplace = True)
plt.figure(figsize=(10, 8)) # Adjust size as needed
sns.heatmap(df.select_dtypes(include=np.number).corr(),
annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Heatmap')
plt.show()
valuable to know because the features won’t be duplicative
NOW DO DUMMY VARS FOR CATEGORICALS
df1 = pd.get_dummies(df, dtype='float')
#important to remove one 'default' value for each categorical variable to avoid perfect correlations among
#mutually exclusive categories
df1.drop(columns = ['job_student', 'marital_single', 'education_primary',
'default_no', 'housing_no', 'loan_no', 'month_jan',
'day_cat_early', 'known_prior_success_no', 'y_no'], inplace = True)
df1
## age balance ... day_cat_mid known_prior_success_yes
## 0 58 2143 ... 0.0 0.0
## 1 44 29 ... 0.0 0.0
## 2 33 2 ... 0.0 0.0
## 5 35 231 ... 0.0 0.0
## 6 28 447 ... 0.0 0.0
## ... ... ... ... ... ...
## 45205 25 505 ... 1.0 0.0
## 45206 51 825 ... 1.0 0.0
## 45207 71 1729 ... 1.0 0.0
## 45208 72 5715 ... 1.0 1.0
## 45209 57 668 ... 1.0 0.0
##
## [41412 rows x 37 columns]
Now correlation again
sns.clustermap(df1.corr(), cmap='coolwarm', linewidths=.5, figsize=(40, 40))
plt.show()
The sheer number of features here makes it challenging to observe
correlations. Let’s filter down the correlation matrix into only those
features which have at least some correlation with another feature.
df_corr = df1.corr()
for col in df_corr.columns:
if not (df_corr[col].between(0.3, 0.999).any() or df_corr[col].between(-0.999, -0.3).any()):
df_corr.drop(columns=col, inplace = True)
df_corr.drop(index=col, inplace=True)
df_corr
## age ... known_prior_success_yes
## age 1.000000 ... 0.031358
## duration -0.010584 ... 0.053407
## previous 0.008089 ... 0.340751
## job_blue-collar -0.040408 ... -0.050141
## job_management -0.024724 ... 0.017313
## job_retired 0.446388 ... 0.056363
## marital_divorced 0.170740 ... -0.006838
## marital_married 0.279937 ... -0.020756
## education_secondary -0.084560 ... -0.024913
## education_tertiary -0.076560 ... 0.052182
## housing_yes -0.183330 ... -0.091830
## month_aug 0.073973 ... 0.000007
## month_may -0.128167 ... -0.059677
## y_yes 0.026244 ... 0.308051
## day_cat_late -0.029889 ... -0.025931
## day_cat_mid 0.035825 ... 0.016288
## known_prior_success_yes 0.031358 ... 1.000000
##
## [17 rows x 17 columns]
sns.clustermap(df_corr, cmap='coolwarm', linewidths=.5, figsize=(12, 12))
plt.show()
First and foremost, we want to pay attention to to correlations with y_yes, which indicate the features most closely associated with a successful campaign engagement. The two features with strong positive correlations there appear to be phone call duration and prior campaign success. While we will learn a lot more from building and ultimately running our model, this already tells us that an emphasis on spending more time on the phone with clients and targeting those with prior positive engagements could lead to higher success rates.
Some additional correlations worth paying attention to: - Educational attainment and job type show several correlations, namely between primary education and having a blue collar job, and tertiary education and having a management-oriented job. - Age and retirement are positively correlated. - Younger clients are less likely to be married or to have a housing loan, while older clients are more likely to be either married or divorced.
All of these correlations make intuitive sense and align with domain knowledge, which is an encouraging sign. Further, while worth noting, none of the correlations appear strong enough to warrant combining or removing variables. For example, while older people are more likely to be retired, it doesn’t mean there aren’t older workers or younger retirees. Those values could prove important for our analysis.
Logistic regression does not require normalization or standardization–but on running the model, I would test different techniques to confirm that would hold in this case.