<- read.csv('bank-full.csv', sep = ';') bank_data
Assignment1
Assignment
Assignment 1 requires conducting exploratory data analysis on the “Bank Marketing” data set from the UC Irvine Machine Learning Repository. I have saved the data-set to my working directory and will be importing it from there. Below I load the libraries that I initially think will be necessary.
Exploratory analysis
Import
Structure
First, I examine the structure of the dataset
str(bank_data )
'data.frame': 45211 obs. of 17 variables:
$ age : int 58 44 33 47 33 35 28 42 58 43 ...
$ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
$ marital : chr "married" "single" "married" "married" ...
$ education: chr "tertiary" "secondary" "secondary" "unknown" ...
$ default : chr "no" "no" "no" "no" ...
$ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
$ housing : chr "yes" "yes" "yes" "yes" ...
$ loan : chr "no" "no" "yes" "no" ...
$ contact : chr "unknown" "unknown" "unknown" "unknown" ...
$ day : int 5 5 5 5 5 5 5 5 5 5 ...
$ month : chr "may" "may" "may" "may" ...
$ duration : int 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
$ y : chr "no" "no" "no" "no" ...
The data set has 17 variables, a mix of numeric and string data types.
Missing values
First, I check for missing values:
colSums(is.na(bank_data))
age job marital education default balance housing loan
0 0 0 0 0 0 0 0
contact day month duration campaign pdays previous poutcome
0 0 0 0 0 0 0 0
y
0
None of the columns have missing values.
gg_miss_var(bank_data)
Duplicate values
Below, I check if there are any rows in the data frame that are duplicate values:
sum(duplicated(bank_data))
[1] 0
There is no duplicate data.
Dates
The data set has two columns that correspond to dates: “month” and “day”. These represent the last contact month of the year and the last contact date during that month. While the data set does not have information on the year of contact, the description of the data set says that observations are ordered by date from May 2008 to November 2010.
It may be worth keeping these features as there may be some seasonal or cyclical pattern that corresponds to month and day and the target variable. The month column will have to be converted to dummy variables.
Summary statistics - numeric variables
Below I look at the summary statistics for the numeric columns:
|> select(where(is.numeric)) |> summary() bank_data
age balance day duration
Min. :18.00 Min. : -8019 Min. : 1.00 Min. : 0.0
1st Qu.:33.00 1st Qu.: 72 1st Qu.: 8.00 1st Qu.: 103.0
Median :39.00 Median : 448 Median :16.00 Median : 180.0
Mean :40.94 Mean : 1362 Mean :15.81 Mean : 258.2
3rd Qu.:48.00 3rd Qu.: 1428 3rd Qu.:21.00 3rd Qu.: 319.0
Max. :95.00 Max. :102127 Max. :31.00 Max. :4918.0
campaign pdays previous
Min. : 1.000 Min. : -1.0 Min. : 0.0000
1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000
Median : 2.000 Median : -1.0 Median : 0.0000
Mean : 2.764 Mean : 40.2 Mean : 0.5803
3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
Max. :63.000 Max. :871.0 Max. :275.0000
Numeric variables - Correlation, Distributions, and Outliers
Next, I visualize the distribution and correlation of the numeric variables.
|> select(where(is.numeric)) |> ggpairs() bank_data
|> select(where(is.numeric)) |> cor() |> corrplot::corrplot() bank_data
Based on the above graphics, there is a moderate correlation between pdays and previous. pdays measures “number of days that passed by after the client was last contacted from a previous campaign” while previous measures “number of contacts performed before this campaign and for this client”. If I use a regression model, it will be worth checking the variance inflation factor (VIF) for previous and pdays to see if its worth dropping one of the variables. Besides these two pairs, correlations between numeric variables are low.
Furthermore, I can see that a number of the numeric variables have outliers, as the density plots exhibit positive skew. I exclude day from this analysis as it is a different data type (date and discrete rather than continuous) than the other numeric values.
|> select(where(is.numeric)) |> select(-day) |> pivot_longer(cols = everything(),
bank_data names_to = 'variable',
values_to = 'value') |> ggplot(aes(x = variable, y = value)) + geom_boxplot() + facet_wrap(vars(variable), scales = 'free_y')
The above box-plots demonstrate the presence of outliers. It may make sense to take logarithms of these variables or perform Box-Cox transformations before fitting a model.
Lastly, below is a summary table that has measures of central tendency and spread:
|> select(where(is.numeric)) |> select(-day) |> describe() bank_data
vars n mean sd median trimmed mad min max range
age 1 45211 40.94 10.62 39 40.25 10.38 18 95 77
balance 2 45211 1362.27 3044.77 448 767.21 664.20 -8019 102127 110146
duration 3 45211 258.16 257.53 180 210.87 137.88 0 4918 4918
campaign 4 45211 2.76 3.10 2 2.12 1.48 1 63 62
pdays 5 45211 40.20 100.13 -1 11.92 0.00 -1 871 872
previous 6 45211 0.58 2.30 0 0.13 0.00 0 275 275
skew kurtosis se
age 0.68 0.32 0.05
balance 8.36 140.73 14.32
duration 3.14 18.15 1.21
campaign 4.90 39.24 0.01
pdays 2.62 6.93 0.47
previous 41.84 4506.16 0.01
Categorical variables
Next, I evaluate the categorical variables.
<- bank_data |> select(where(is.character))
categorical_data str(categorical_data)
'data.frame': 45211 obs. of 10 variables:
$ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
$ marital : chr "married" "single" "married" "married" ...
$ education: chr "tertiary" "secondary" "secondary" "unknown" ...
$ default : chr "no" "no" "no" "no" ...
$ housing : chr "yes" "yes" "yes" "yes" ...
$ loan : chr "no" "no" "yes" "no" ...
$ contact : chr "unknown" "unknown" "unknown" "unknown" ...
$ month : chr "may" "may" "may" "may" ...
$ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
$ y : chr "no" "no" "no" "no" ...
I’ll have to convert the characters to factors:
<- as.data.frame(unclass(categorical_data), stringsAsFactors = TRUE) categorical_data
str(categorical_data)
'data.frame': 45211 obs. of 10 variables:
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
$ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Now I look at the summary of the categorical data:
summary(categorical_data)
job marital education default housing
blue-collar:9732 divorced: 5207 primary : 6851 no :44396 no :20081
management :9458 married :27214 secondary:23202 yes: 815 yes:25130
technician :7597 single :12790 tertiary :13301
admin. :5171 unknown : 1857
services :4154
retired :2264
(Other) :6835
loan contact month poutcome y
no :37967 cellular :29285 may :13766 failure: 4901 no :39922
yes: 7244 telephone: 2906 jul : 6895 other : 1840 yes: 5289
unknown :13020 aug : 6247 success: 1511
jun : 5341 unknown:36959
nov : 3970
apr : 2932
(Other): 6060
Next, I visualize the frequency counts:
<- as.data.frame(gather(categorical_data, 'variable', 'value') |> count(variable, value) |> group_by(variable) |> mutate(prop = prop.table(n))) frequency_count
|> ggplot(aes(x = value, y = prop)) + geom_bar(stat = 'identity') + facet_wrap(vars(variable), scales = 'free') + coord_flip() frequency_count
The features “contact”, “education”, and “job” have unknown values. For pre-processing, I can change these to N/A and use KNN or a different method to impute values.
The poutcome variable is mostly unknown (over 80%). This measures the outcome of previous marketing campaigns. Since there are so many unknown outcomes, as well as several “other” outcomes, I can likely drop this variable as it does not provide much insight.
Algorithm selection
I am looking to predict the target, y, which is a binary yes or no value that indicates whether the client subscribed to a term deposit. There are several categorical and numeric features that can be used in modelling.
Before evaluating models, I need to determine which performance metric is most relevant in our case. As seen above, the data set is imbalanced. - about 90% of the observations did not subscribe to a term deposit while only 10% did. Thus, simply looking at accuracy will be misleading.
It may be more appropriate to evaluate the model via precision or recall (true positive rate). This is a hypothetical example and thus the true costs of the marketing campaign are unknown. It is possible that the management team may care more about precision, meaning it is important for positive predictions to be accurate. This may be because management is dedicating resources to a marketing campaign, so we want to ensure these resources are being dedicated to cases that are likely to subscribe to a term deposit. That said, it is difficult to decide without knowing more details and discussing with management which metric is more accurate (perhaps management is more sensitive to false negatives, where we predicted a client would not subscribe when in fact they would, which would be measured via recall). Thus, for this hypothetical exercise, I would recommend using the F1 score, which is the harmonic mean of precision and recall.
This excercise is a classification problem with labeled data, meaning this is a supervised learning exercise. It is also important to strike a balance between interpretability and accuracy for management’s purposes. Logistic regression would be most interpretable, as it would indicate which features are statistically significant and how they affect the log odds of the response. This can can be insightful from a business perspective as it can suggest which features to focus on in future campaigns. However, logistic regression may not be the most accurate model based on the metric of choice. Thus, the results from logistic regression should be compared to the results from a discriminant analysis model and the findings discussed with management to determine which approach to take.
There is a large amount of data, but not overwhelmingly large that training a model would be prohibitive. The data does have outliers, so transformations will be required to minimize the impact of outliers.
Pre-processing
As discussed above, there are a few pre-processing steps:
Imputing values for the “unknown” observations for features “contact”, “education”, and “job”
Converting categorical variables to dummy variables
Checking for features with near zero variance
Transforming continuous variables to control for outliers
Centering and scaling features will be done when fitting a model via the “caret” package
<- bank_data preproccessed_data
library(caret)
Converting unknown values
First, I replace the observations “unknown” in the “contact,”education”, and “job” columns with NA
$contact[preproccessed_data$contact=='unknown'] <- NA
preproccessed_data$education[preproccessed_data$education=='unknown'] <- NA
preproccessed_data$job[preproccessed_data$job=='unknown'] <- NA preproccessed_data
Next, I convert the categorical columns to factors:
<- preproccessed_data |> mutate(across(where(is.character), as.factor)) preproccessed_data
Next, I perform KNN imputation to replace the NAs
library(VIM)
Loading required package: colorspace
Loading required package: grid
VIM is ready to use.
Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
Attaching package: 'VIM'
The following object is masked from 'package:datasets':
sleep
<- kNN(data = preproccessed_data, variable = c('contact', 'education', 'job'), k = 5, imp_var = F) preproccessed_data
prop.table(table(preproccessed_data$contact))
cellular telephone
0.93052576 0.06947424
prop.table(table(preproccessed_data$education))
primary secondary tertiary
0.1589879 0.5380328 0.3029794
prop.table(table(preproccessed_data$job))
admin. blue-collar entrepreneur housemaid management
0.11497202 0.21667293 0.03302294 0.02782509 0.21056822
retired self-employed services student technician
0.05089469 0.03501360 0.09243326 0.02083564 0.16880848
unemployed
0.02895313
Dummy variables
Below, I convert categorical columns to dummy variables:
<- dummyVars(~.-y, data = preproccessed_data, fullRank = T) dummy_variables
<- predict(dummy_variables, newdata = preproccessed_data) dummy_data
<- cbind(as.data.frame(dummy_data), y = preproccessed_data$y) preproccessed_data
colnames(preproccessed_data)
[1] "age" "job.blue-collar" "job.entrepreneur"
[4] "job.housemaid" "job.management" "job.retired"
[7] "job.self-employed" "job.services" "job.student"
[10] "job.technician" "job.unemployed" "marital.married"
[13] "marital.single" "education.secondary" "education.tertiary"
[16] "default.yes" "balance" "housing.yes"
[19] "loan.yes" "contact.telephone" "day"
[22] "month.aug" "month.dec" "month.feb"
[25] "month.jan" "month.jul" "month.jun"
[28] "month.mar" "month.may" "month.nov"
[31] "month.oct" "month.sep" "duration"
[34] "campaign" "pdays" "previous"
[37] "poutcome.other" "poutcome.success" "poutcome.unknown"
[40] "y"
Near zero variance features
Next, I check for features with near zero variance:
colnames(preproccessed_data)[nearZeroVar(preproccessed_data)]
[1] "job.entrepreneur" "job.housemaid" "job.self-employed"
[4] "job.student" "job.unemployed" "default.yes"
[7] "month.dec" "month.jan" "month.mar"
[10] "month.oct" "month.sep" "pdays"
[13] "poutcome.other" "poutcome.success"
Dummy variable from this list can be ignored, however, “pdays” has near zero variance. I will drop “pdays” from the data set, which is also beneficial due to its moderate correlation to the “previous” feature.
$pdays <- NULL preproccessed_data
Transforming continuous variables
First, I look at the continuous variables besides “day”
|> select(age, balance, duration, campaign, previous) |> describe() preproccessed_data
vars n mean sd median trimmed mad min max range
age 1 45211 40.94 10.62 39 40.25 10.38 18 95 77
balance 2 45211 1362.27 3044.77 448 767.21 664.20 -8019 102127 110146
duration 3 45211 258.16 257.53 180 210.87 137.88 0 4918 4918
campaign 4 45211 2.76 3.10 2 2.12 1.48 1 63 62
previous 5 45211 0.58 2.30 0 0.13 0.00 0 275 275
skew kurtosis se
age 0.68 0.32 0.05
balance 8.36 140.73 14.32
duration 3.14 18.15 1.21
campaign 4.90 39.24 0.01
previous 41.84 4506.16 0.01
|> select(age, balance, duration, campaign, previous) |> pivot_longer(cols = everything(),
preproccessed_data names_to = 'variable',
values_to = 'value') |> ggplot(aes(x = variable, y = value)) + geom_boxplot() + facet_wrap(vars(variable), scales = 'free_y')
As discussed above, these features have outliers.
The features “balance”, “duration”, and “previous” have values that are either negative or include zero, which makes taking log values inappropriate.
Below is the description of the “duration” feature:
last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. |
Thus, I will remove the feature from the data set
$duration <- NULL preproccessed_data
Next, I will check how many observations are classified as outliers based on the IQR outlier test:
|> select(age, balance, campaign, previous) |> summarise(across(everything(),
preproccessed_data ~ sum(. < (quantile(., 0.25, na.rm = TRUE) - 1.5 * IQR(.)) |
> (quantile(., 0.75, na.rm = TRUE) + 1.5 * IQR(.)), na.rm = TRUE),
. .names = "outliers_{.col}"))
outliers_age outliers_balance outliers_campaign outliers_previous
1 487 4729 3064 8257
This is a meaningful amount of outliers (aside from the age feature) - automatically removing these would remove a large amount of data. These outliers do seem meaningful (e.g., it is possible that clients have negative balances or large positive balances, that certain clients have been contacted more often or that certain clients haven’t been contacted as recently).
As such, I will apply a log transformation to these features before centering and scaling. While there will still be outliers present even after the log transformation, I believe these values are too meaningful to exclude from the data.
<- preproccessed_data final_preproccessed_data
For balance, I will add a constant of 9,000 to ensure all values are positive before applying a log transformation. For previous, I will add a constant of 0.1
#age, balance, campaign, previous
$age <- log(final_preproccessed_data$age)
final_preproccessed_data$balance <- log(final_preproccessed_data$balance + 9000)
final_preproccessed_data$campaign <- log(final_preproccessed_data$campaign)
final_preproccessed_data$previous <- log(final_preproccessed_data$previous + 0.1) final_preproccessed_data
names(final_preproccessed_data)[names(final_preproccessed_data)=='age'] <- 'log_age'
names(final_preproccessed_data)[names(final_preproccessed_data)=='balance'] <- 'balance_transformed'
names(final_preproccessed_data)[names(final_preproccessed_data)=='campaign'] <- 'log_campaign'
names(final_preproccessed_data)[names(final_preproccessed_data)=='previous'] <- 'previous_transformed'
|> select(log_age, balance_transformed, log_campaign, previous_transformed) |> pivot_longer(cols = everything(),
final_preproccessed_data names_to = 'variable',
values_to = 'value') |> ggplot(aes(x = variable, y = value)) + geom_boxplot() + facet_wrap(vars(variable), scales = 'free_y')
Lastly, centering and scaling will be performed when fitting models.