Assignment1

Author

Semyon Toybis

Assignment

Assignment 1 requires conducting exploratory data analysis on the “Bank Marketing” data set from the UC Irvine Machine Learning Repository. I have saved the data-set to my working directory and will be importing it from there. Below I load the libraries that I initially think will be necessary.

Exploratory analysis

Import

bank_data <- read.csv('bank-full.csv', sep = ';')

Structure

First, I examine the structure of the dataset

str(bank_data )
'data.frame':   45211 obs. of  17 variables:
 $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
 $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
 $ marital  : chr  "married" "single" "married" "married" ...
 $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
 $ default  : chr  "no" "no" "no" "no" ...
 $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
 $ housing  : chr  "yes" "yes" "yes" "yes" ...
 $ loan     : chr  "no" "no" "yes" "no" ...
 $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
 $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
 $ month    : chr  "may" "may" "may" "may" ...
 $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
 $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
 $ y        : chr  "no" "no" "no" "no" ...

The data set has 17 variables, a mix of numeric and string data types.

Missing values

First, I check for missing values:

colSums(is.na(bank_data))
      age       job   marital education   default   balance   housing      loan 
        0         0         0         0         0         0         0         0 
  contact       day     month  duration  campaign     pdays  previous  poutcome 
        0         0         0         0         0         0         0         0 
        y 
        0 

None of the columns have missing values.

gg_miss_var(bank_data)

Duplicate values

Below, I check if there are any rows in the data frame that are duplicate values:

sum(duplicated(bank_data))
[1] 0

There is no duplicate data.

Dates

The data set has two columns that correspond to dates: “month” and “day”. These represent the last contact month of the year and the last contact date during that month. While the data set does not have information on the year of contact, the description of the data set says that observations are ordered by date from May 2008 to November 2010.

It may be worth keeping these features as there may be some seasonal or cyclical pattern that corresponds to month and day and the target variable. The month column will have to be converted to dummy variables.

Summary statistics - numeric variables

Below I look at the summary statistics for the numeric columns:

bank_data |> select(where(is.numeric)) |> summary()
      age           balance            day           duration     
 Min.   :18.00   Min.   : -8019   Min.   : 1.00   Min.   :   0.0  
 1st Qu.:33.00   1st Qu.:    72   1st Qu.: 8.00   1st Qu.: 103.0  
 Median :39.00   Median :   448   Median :16.00   Median : 180.0  
 Mean   :40.94   Mean   :  1362   Mean   :15.81   Mean   : 258.2  
 3rd Qu.:48.00   3rd Qu.:  1428   3rd Qu.:21.00   3rd Qu.: 319.0  
 Max.   :95.00   Max.   :102127   Max.   :31.00   Max.   :4918.0  
    campaign          pdays          previous       
 Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000  
 1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000  
 Median : 2.000   Median : -1.0   Median :  0.0000  
 Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803  
 3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000  
 Max.   :63.000   Max.   :871.0   Max.   :275.0000  

Numeric variables - Correlation, Distributions, and Outliers

Next, I visualize the distribution and correlation of the numeric variables.

bank_data |> select(where(is.numeric)) |> ggpairs()

bank_data |> select(where(is.numeric)) |> cor() |> corrplot::corrplot()

Based on the above graphics, there is a moderate correlation between pdays and previous. pdays measures “number of days that passed by after the client was last contacted from a previous campaign” while previous measures “number of contacts performed before this campaign and for this client”. If I use a regression model, it will be worth checking the variance inflation factor (VIF) for previous and pdays to see if its worth dropping one of the variables. Besides these two pairs, correlations between numeric variables are low.

Furthermore, I can see that a number of the numeric variables have outliers, as the density plots exhibit positive skew. I exclude day from this analysis as it is a different data type (date and discrete rather than continuous) than the other numeric values.

bank_data |> select(where(is.numeric)) |> select(-day) |> pivot_longer(cols = everything(),
                                                       names_to = 'variable',
                                                       values_to = 'value') |> ggplot(aes(x = variable, y = value)) + geom_boxplot() + facet_wrap(vars(variable), scales = 'free_y')

The above box-plots demonstrate the presence of outliers. It may make sense to take logarithms of these variables or perform Box-Cox transformations before fitting a model.

Lastly, below is a summary table that has measures of central tendency and spread:

bank_data |> select(where(is.numeric)) |> select(-day) |> describe()
         vars     n    mean      sd median trimmed    mad   min    max  range
age         1 45211   40.94   10.62     39   40.25  10.38    18     95     77
balance     2 45211 1362.27 3044.77    448  767.21 664.20 -8019 102127 110146
duration    3 45211  258.16  257.53    180  210.87 137.88     0   4918   4918
campaign    4 45211    2.76    3.10      2    2.12   1.48     1     63     62
pdays       5 45211   40.20  100.13     -1   11.92   0.00    -1    871    872
previous    6 45211    0.58    2.30      0    0.13   0.00     0    275    275
          skew kurtosis    se
age       0.68     0.32  0.05
balance   8.36   140.73 14.32
duration  3.14    18.15  1.21
campaign  4.90    39.24  0.01
pdays     2.62     6.93  0.47
previous 41.84  4506.16  0.01

Categorical variables

Next, I evaluate the categorical variables.

categorical_data <- bank_data |> select(where(is.character)) 
str(categorical_data)
'data.frame':   45211 obs. of  10 variables:
 $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
 $ marital  : chr  "married" "single" "married" "married" ...
 $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
 $ default  : chr  "no" "no" "no" "no" ...
 $ housing  : chr  "yes" "yes" "yes" "yes" ...
 $ loan     : chr  "no" "no" "yes" "no" ...
 $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
 $ month    : chr  "may" "may" "may" "may" ...
 $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
 $ y        : chr  "no" "no" "no" "no" ...

I’ll have to convert the characters to factors:

categorical_data <- as.data.frame(unclass(categorical_data), stringsAsFactors = TRUE)
str(categorical_data)
'data.frame':   45211 obs. of  10 variables:
 $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
 $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
 $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
 $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
 $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
 $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
 $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Now I look at the summary of the categorical data:

summary(categorical_data)
          job           marital          education     default     housing    
 blue-collar:9732   divorced: 5207   primary  : 6851   no :44396   no :20081  
 management :9458   married :27214   secondary:23202   yes:  815   yes:25130  
 technician :7597   single  :12790   tertiary :13301                          
 admin.     :5171                    unknown  : 1857                          
 services   :4154                                                             
 retired    :2264                                                             
 (Other)    :6835                                                             
  loan            contact          month          poutcome       y        
 no :37967   cellular :29285   may    :13766   failure: 4901   no :39922  
 yes: 7244   telephone: 2906   jul    : 6895   other  : 1840   yes: 5289  
             unknown  :13020   aug    : 6247   success: 1511              
                               jun    : 5341   unknown:36959              
                               nov    : 3970                              
                               apr    : 2932                              
                               (Other): 6060                              

Next, I visualize the frequency counts:

frequency_count <- as.data.frame(gather(categorical_data, 'variable', 'value') |> count(variable, value) |> group_by(variable) |> mutate(prop = prop.table(n)))
frequency_count |> ggplot(aes(x = value, y = prop)) + geom_bar(stat = 'identity') + facet_wrap(vars(variable), scales = 'free') + coord_flip()

The features “contact”, “education”, and “job” have unknown values. For pre-processing, I can change these to N/A and use KNN or a different method to impute values.

The poutcome variable is mostly unknown (over 80%). This measures the outcome of previous marketing campaigns. Since there are so many unknown outcomes, as well as several “other” outcomes, I can likely drop this variable as it does not provide much insight.

Algorithm selection

I am looking to predict the target, y, which is a binary yes or no value that indicates whether the client subscribed to a term deposit. There are several categorical and numeric features that can be used in modelling.

Before evaluating models, I need to determine which performance metric is most relevant in our case. As seen above, the data set is imbalanced. - about 90% of the observations did not subscribe to a term deposit while only 10% did. Thus, simply looking at accuracy will be misleading.

It may be more appropriate to evaluate the model via precision or recall (true positive rate). This is a hypothetical example and thus the true costs of the marketing campaign are unknown. It is possible that the management team may care more about precision, meaning it is important for positive predictions to be accurate. This may be because management is dedicating resources to a marketing campaign, so we want to ensure these resources are being dedicated to cases that are likely to subscribe to a term deposit. That said, it is difficult to decide without knowing more details and discussing with management which metric is more accurate (perhaps management is more sensitive to false negatives, where we predicted a client would not subscribe when in fact they would, which would be measured via recall). Thus, for this hypothetical exercise, I would recommend using the F1 score, which is the harmonic mean of precision and recall.

This excercise is a classification problem with labeled data, meaning this is a supervised learning exercise. It is also important to strike a balance between interpretability and accuracy for management’s purposes. Logistic regression would be most interpretable, as it would indicate which features are statistically significant and how they affect the log odds of the response. This can can be insightful from a business perspective as it can suggest which features to focus on in future campaigns. However, logistic regression may not be the most accurate model based on the metric of choice. Thus, the results from logistic regression should be compared to the results from a discriminant analysis model and the findings discussed with management to determine which approach to take.

There is a large amount of data, but not overwhelmingly large that training a model would be prohibitive. The data does have outliers, so transformations will be required to minimize the impact of outliers.

Pre-processing

As discussed above, there are a few pre-processing steps:

  • Imputing values for the “unknown” observations for features “contact”, “education”, and “job”

  • Converting categorical variables to dummy variables

  • Checking for features with near zero variance

  • Transforming continuous variables to control for outliers

  • Centering and scaling features will be done when fitting a model via the “caret” package

preproccessed_data <- bank_data
library(caret)

Converting unknown values

First, I replace the observations “unknown” in the “contact,”education”, and “job” columns with NA

preproccessed_data$contact[preproccessed_data$contact=='unknown'] <- NA
preproccessed_data$education[preproccessed_data$education=='unknown'] <- NA
preproccessed_data$job[preproccessed_data$job=='unknown'] <- NA

Next, I convert the categorical columns to factors:

preproccessed_data <- preproccessed_data |> mutate(across(where(is.character), as.factor))

Next, I perform KNN imputation to replace the NAs

library(VIM)
Loading required package: colorspace
Loading required package: grid
VIM is ready to use.
Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

Attaching package: 'VIM'
The following object is masked from 'package:datasets':

    sleep
preproccessed_data <- kNN(data = preproccessed_data, variable = c('contact', 'education', 'job'), k = 5, imp_var = F)
prop.table(table(preproccessed_data$contact))

  cellular  telephone 
0.93052576 0.06947424 
prop.table(table(preproccessed_data$education))

  primary secondary  tertiary 
0.1589879 0.5380328 0.3029794 
prop.table(table(preproccessed_data$job))

       admin.   blue-collar  entrepreneur     housemaid    management 
   0.11497202    0.21667293    0.03302294    0.02782509    0.21056822 
      retired self-employed      services       student    technician 
   0.05089469    0.03501360    0.09243326    0.02083564    0.16880848 
   unemployed 
   0.02895313 

Dummy variables

Below, I convert categorical columns to dummy variables:

dummy_variables <- dummyVars(~.-y, data = preproccessed_data, fullRank = T)
dummy_data <- predict(dummy_variables, newdata = preproccessed_data)
preproccessed_data <- cbind(as.data.frame(dummy_data), y = preproccessed_data$y)
colnames(preproccessed_data)
 [1] "age"                 "job.blue-collar"     "job.entrepreneur"   
 [4] "job.housemaid"       "job.management"      "job.retired"        
 [7] "job.self-employed"   "job.services"        "job.student"        
[10] "job.technician"      "job.unemployed"      "marital.married"    
[13] "marital.single"      "education.secondary" "education.tertiary" 
[16] "default.yes"         "balance"             "housing.yes"        
[19] "loan.yes"            "contact.telephone"   "day"                
[22] "month.aug"           "month.dec"           "month.feb"          
[25] "month.jan"           "month.jul"           "month.jun"          
[28] "month.mar"           "month.may"           "month.nov"          
[31] "month.oct"           "month.sep"           "duration"           
[34] "campaign"            "pdays"               "previous"           
[37] "poutcome.other"      "poutcome.success"    "poutcome.unknown"   
[40] "y"                  

Near zero variance features

Next, I check for features with near zero variance:

colnames(preproccessed_data)[nearZeroVar(preproccessed_data)]
 [1] "job.entrepreneur"  "job.housemaid"     "job.self-employed"
 [4] "job.student"       "job.unemployed"    "default.yes"      
 [7] "month.dec"         "month.jan"         "month.mar"        
[10] "month.oct"         "month.sep"         "pdays"            
[13] "poutcome.other"    "poutcome.success" 

Dummy variable from this list can be ignored, however, “pdays” has near zero variance. I will drop “pdays” from the data set, which is also beneficial due to its moderate correlation to the “previous” feature.

preproccessed_data$pdays <- NULL

Transforming continuous variables

First, I look at the continuous variables besides “day”

preproccessed_data |> select(age, balance, duration, campaign, previous) |> describe()
         vars     n    mean      sd median trimmed    mad   min    max  range
age         1 45211   40.94   10.62     39   40.25  10.38    18     95     77
balance     2 45211 1362.27 3044.77    448  767.21 664.20 -8019 102127 110146
duration    3 45211  258.16  257.53    180  210.87 137.88     0   4918   4918
campaign    4 45211    2.76    3.10      2    2.12   1.48     1     63     62
previous    5 45211    0.58    2.30      0    0.13   0.00     0    275    275
          skew kurtosis    se
age       0.68     0.32  0.05
balance   8.36   140.73 14.32
duration  3.14    18.15  1.21
campaign  4.90    39.24  0.01
previous 41.84  4506.16  0.01
preproccessed_data |> select(age, balance, duration, campaign, previous) |> pivot_longer(cols = everything(),
                                                       names_to = 'variable',
                                                       values_to = 'value') |> ggplot(aes(x = variable, y = value)) + geom_boxplot() + facet_wrap(vars(variable), scales = 'free_y')

As discussed above, these features have outliers.

The features “balance”, “duration”, and “previous” have values that are either negative or include zero, which makes taking log values inappropriate.

Below is the description of the “duration” feature:

last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Thus, I will remove the feature from the data set

preproccessed_data$duration <- NULL

Next, I will check how many observations are classified as outliers based on the IQR outlier test:

preproccessed_data |> select(age, balance, campaign, previous) |> summarise(across(everything(), 
                   ~ sum(. < (quantile(., 0.25, na.rm = TRUE) - 1.5 * IQR(.)) | 
                         . > (quantile(., 0.75, na.rm = TRUE) + 1.5 * IQR(.)), na.rm = TRUE),
                   .names = "outliers_{.col}"))
  outliers_age outliers_balance outliers_campaign outliers_previous
1          487             4729              3064              8257

This is a meaningful amount of outliers (aside from the age feature) - automatically removing these would remove a large amount of data. These outliers do seem meaningful (e.g., it is possible that clients have negative balances or large positive balances, that certain clients have been contacted more often or that certain clients haven’t been contacted as recently).

As such, I will apply a log transformation to these features before centering and scaling. While there will still be outliers present even after the log transformation, I believe these values are too meaningful to exclude from the data.

final_preproccessed_data <- preproccessed_data

For balance, I will add a constant of 9,000 to ensure all values are positive before applying a log transformation. For previous, I will add a constant of 0.1

#age, balance, campaign, previous

final_preproccessed_data$age <- log(final_preproccessed_data$age)
final_preproccessed_data$balance <- log(final_preproccessed_data$balance + 9000)
final_preproccessed_data$campaign <- log(final_preproccessed_data$campaign)
final_preproccessed_data$previous <- log(final_preproccessed_data$previous + 0.1)
names(final_preproccessed_data)[names(final_preproccessed_data)=='age'] <- 'log_age'


names(final_preproccessed_data)[names(final_preproccessed_data)=='balance'] <- 'balance_transformed'


names(final_preproccessed_data)[names(final_preproccessed_data)=='campaign'] <- 'log_campaign'

names(final_preproccessed_data)[names(final_preproccessed_data)=='previous'] <- 'previous_transformed'
final_preproccessed_data |> select(log_age, balance_transformed, log_campaign, previous_transformed) |> pivot_longer(cols = everything(),
                                                       names_to = 'variable',
                                                       values_to = 'value') |> ggplot(aes(x = variable, y = value)) + geom_boxplot() + facet_wrap(vars(variable), scales = 'free_y')

Lastly, centering and scaling will be performed when fitting models.