Bank Telemarketing Project

A Data-Driven Approach to Predict the Success of Bank Telemarketing.

This project is broken down into four sections.

The goal of this task is to predict if a client will subscribe to a term deposit.

Here is the link to the task on Kaggle https://www.kaggle.com/selvanig/bank-marketing-campaign

We start by loading our packages….

and we load the full data-set (bank-additional-full)…

df <- read.table("/Users/kazeemolalekan/Downloads/bank-additional-full.csv", 
                 header = T, sep = ";", stringsAsFactors = F)

Descriptive Analysis:

A good place to start is to check the number of rows and variables we have.

#check dataset dimension (rows and columns)
dim(df)

## [1] 41188    21

Here we can see that the full datatset has 41,188 rows and 21 variables or features. We can follow-up with this by viewing the first 10 rows of our data.

#check the first 10 rows
head(df, 10)

##    age         job marital           education default housing loan   contact
## 1   56   housemaid married            basic.4y      no      no   no telephone
## 2   57    services married         high.school unknown      no   no telephone
## 3   37    services married         high.school      no     yes   no telephone
## 4   40      admin. married            basic.6y      no      no   no telephone
## 5   56    services married         high.school      no      no  yes telephone
## 6   45    services married            basic.9y unknown      no   no telephone
## 7   59      admin. married professional.course      no      no   no telephone
## 8   41 blue-collar married             unknown unknown      no   no telephone
## 9   24  technician  single professional.course      no     yes   no telephone
## 10  25    services  single         high.school      no     yes   no telephone
##    month day_of_week duration campaign pdays previous    poutcome emp.var.rate
## 1    may         mon      261        1   999        0 nonexistent          1.1
## 2    may         mon      149        1   999        0 nonexistent          1.1
## 3    may         mon      226        1   999        0 nonexistent          1.1
## 4    may         mon      151        1   999        0 nonexistent          1.1
## 5    may         mon      307        1   999        0 nonexistent          1.1
## 6    may         mon      198        1   999        0 nonexistent          1.1
## 7    may         mon      139        1   999        0 nonexistent          1.1
## 8    may         mon      217        1   999        0 nonexistent          1.1
## 9    may         mon      380        1   999        0 nonexistent          1.1
## 10   may         mon       50        1   999        0 nonexistent          1.1
##    cons.price.idx cons.conf.idx euribor3m nr.employed  y
## 1          93.994         -36.4     4.857        5191 no
## 2          93.994         -36.4     4.857        5191 no
## 3          93.994         -36.4     4.857        5191 no
## 4          93.994         -36.4     4.857        5191 no
## 5          93.994         -36.4     4.857        5191 no
## 6          93.994         -36.4     4.857        5191 no
## 7          93.994         -36.4     4.857        5191 no
## 8          93.994         -36.4     4.857        5191 no
## 9          93.994         -36.4     4.857        5191 no
## 10         93.994         -36.4     4.857        5191 no

Here we can see the first 10 rows and we can also see some missing values (unknown values). We will come back to unknown values later in this section.

Next, we might want to have a look at the names of all our variables or features. There’s no need to go into details with what each of these variables represent or signify - we already know that from the problem statement. We know marital is equivalent to the marital status of our cleints, and age is the age of the client.

#variable names
names(df)

##  [1] "age"            "job"            "marital"        "education"     
##  [5] "default"        "housing"        "loan"           "contact"       
##  [9] "month"          "day_of_week"    "duration"       "campaign"      
## [13] "pdays"          "previous"       "poutcome"       "emp.var.rate"  
## [17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"   
## [21] "y"

and then we can view the structure of our data, to know the data-type of our dataset.

str(df)

## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...

Columns labeled int are integer columsn, columns labeled chr are character columns. num are numeric colums, num and int are both numeric columns. We might actually want to convert chr characters to factors - we’ll do that soon.

Now, let’s talk about unknown values. A good practice will be to check variables or columns with unknown values. We can do this with the colsum function.

#find the variables with unknown mapping (same as NA)
colSums(df == "unknown")

##            age            job        marital      education        default 
##              0            330             80           1731           8597 
##        housing           loan        contact          month    day_of_week 
##            990            990              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0

Any variable with a value greater than 0 has an unknown value. Here we can see that, job column has 330 unknown values, and in the marital columns we have 80 unknown values. The default column has the highest number of unknown values.

To do a deep-dive on unknown values, we will recode these values as “NA” … it is also a good time to recode chr columns to factors

The code below, recodes unknown to NA, and the code after it recodes character columns to numeric columns.

#recode unknown to NA and rename dataframe to df1

df1 <- within(df, {
    job [job == "unknown"] <- NA
    marital [marital == "unknown"] <- NA
    education [education == "unknown"] <- NA
    default [default == "unknown"] <- NA
    housing [housing == "unknown"] <- NA
    loan [loan == "unknown"] <- NA   
})

df1[sapply(df1,is.character)] <- lapply(df1[sapply(df1,is.character)],as.factor)

We can confirm that character columns have been changed to factors by checking the structure of our data again.

str(df1)

## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : Factor w/ 11 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
##  $ marital       : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ education     : Factor w/ 7 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 NA 6 4 ...
##  $ default       : Factor w/ 2 levels "no","yes": 1 NA 1 1 1 NA 1 NA 1 1 ...
##  $ housing       : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 2 2 ...
##  $ loan          : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

We can also confirm that unknown values have been recoded to NA by doing some descriptive analysis on our dataset.

The n column below is the number of non-missing values. Any column that has less than 41188 has a missing value. We can see that job, marital, education, default, housing and loan variables have missing values because their n-value is less than 41188.

We might also be interested in the mean, median and standard deviation of age. Since the standard deviation of age is 10.42, median will be a better representation instead of the mean.

#descriptive analysis
psych::describe(df1)

##                vars     n    mean     sd  median trimmed    mad     min     max
## age               1 41188   40.02  10.42   38.00   39.30  10.38   17.00   98.00
## job*              2 40858    4.67   3.55    3.00    4.43   2.97    1.00   11.00
## marital*          3 41108    2.17   0.60    2.00    2.21   0.00    1.00    3.00
## education*        4 39457    4.60   2.07    4.00    4.76   2.97    1.00    7.00
## default*          5 32591    1.00   0.01    1.00    1.00   0.00    1.00    2.00
## housing*          6 40198    1.54   0.50    2.00    1.55   0.00    1.00    2.00
## loan*             7 40198    1.16   0.36    1.00    1.07   0.00    1.00    2.00
## contact*          8 41188    1.37   0.48    1.00    1.33   0.00    1.00    2.00
## month*            9 41188    5.23   2.32    5.00    5.31   2.97    1.00   10.00
## day_of_week*     10 41188    3.00   1.40    3.00    3.01   1.48    1.00    5.00
## duration         11 41188  258.29 259.28  180.00  210.61 139.36    0.00 4918.00
## campaign         12 41188    2.57   2.77    2.00    1.99   1.48    1.00   56.00
## pdays            13 41188  962.48 186.91  999.00  999.00   0.00    0.00  999.00
## previous         14 41188    0.17   0.49    0.00    0.05   0.00    0.00    7.00
## poutcome*        15 41188    1.93   0.36    2.00    2.00   0.00    1.00    3.00
## emp.var.rate     16 41188    0.08   1.57    1.10    0.27   0.44   -3.40    1.40
## cons.price.idx   17 41188   93.58   0.58   93.75   93.58   0.56   92.20   94.77
## cons.conf.idx    18 41188  -40.50   4.63  -41.80  -40.60   6.52  -50.80  -26.90
## euribor3m        19 41188    3.62   1.73    4.86    3.81   0.16    0.63    5.04
## nr.employed      20 41188 5167.04  72.25 5191.00 5178.43  55.00 4963.60 5228.10
## y*               21 41188    1.11   0.32    1.00    1.02   0.00    1.00    2.00
##                  range   skew kurtosis   se
## age              81.00   0.78     0.79 0.05
## job*             10.00   0.45    -1.41 0.02
## marital*          2.00  -0.09    -0.42 0.00
## education*        6.00  -0.23    -1.24 0.01
## default*          1.00 104.21 10858.00 0.00
## housing*          1.00  -0.15    -1.98 0.00
## loan*             1.00   1.90     1.62 0.00
## contact*          1.00   0.56    -1.69 0.00
## month*            9.00  -0.31    -1.03 0.01
## day_of_week*      4.00   0.01    -1.27 0.01
## duration       4918.00   3.26    20.24 1.28
## campaign         55.00   4.76    36.97 0.01
## pdays           999.00  -4.92    22.23 0.92
## previous          7.00   3.83    20.11 0.00
## poutcome*         2.00  -0.88     3.98 0.00
## emp.var.rate      4.80  -0.72    -1.06 0.01
## cons.price.idx    2.57  -0.23    -0.83 0.00
## cons.conf.idx    23.90   0.30    -0.36 0.02
## euribor3m         4.41  -0.71    -1.41 0.01
## nr.employed     264.50  -1.04     0.00 0.36
## y*                1.00   2.45     4.00 0.00

Let’s also check the summary of our dataset. Here we can see that the minimum, mean, median and max of each column. Here we can see that the most common job of client is admin (job), and most of our clients are married(marital). We can also see that most of clients have a university degree (education).

#summarize data
summary(df1)

##       age                 job            marital     
##  Min.   :17.00   admin.     :10422   divorced: 4612  
##  1st Qu.:32.00   blue-collar: 9254   married :24928  
##  Median :38.00   technician : 6743   single  :11568  
##  Mean   :40.02   services   : 3969   NA's    :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   (Other)    : 7546                   
##                  NA's       :  330                   
##                education     default      housing        loan      
##  university.degree  :12168   no  :32588   no  :18622   no  :33950  
##  high.school        : 9515   yes :    3   yes :21576   yes : 6248  
##  basic.9y           : 6045   NA's: 8597   NA's:  990   NA's:  990  
##  professional.course: 5243                                         
##  basic.4y           : 4176                                         
##  (Other)            : 2310                                         
##  NA's               : 1731                                         
##       contact          month       day_of_week    duration     
##  cellular :26144   may    :13769   fri:7827    Min.   :   0.0  
##  telephone:15044   jul    : 7174   mon:8514    1st Qu.: 102.0  
##                    aug    : 6178   thu:8623    Median : 180.0  
##                    jun    : 5318   tue:8090    Mean   : 258.3  
##                    nov    : 4101   wed:8134    3rd Qu.: 319.0  
##                    apr    : 2632               Max.   :4918.0  
##                    (Other): 2016                               
##     campaign          pdays          previous            poutcome    
##  Min.   : 1.000   Min.   :  0.0   Min.   :0.000   failure    : 4252  
##  1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   nonexistent:35563  
##  Median : 2.000   Median :999.0   Median :0.000   success    : 1373  
##  Mean   : 2.568   Mean   :962.5   Mean   :0.173                      
##  3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000                      
##  Max.   :56.000   Max.   :999.0   Max.   :7.000                      
##                                                                      
##   emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
##  Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
##  1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
##  Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
##  Mean   : 0.08189   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
##  3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
##  Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
##                                                                    
##   nr.employed     y        
##  Min.   :4964   no :36548  
##  1st Qu.:5099   yes: 4640  
##  Median :5191              
##  Mean   :5167              
##  3rd Qu.:5228              
##  Max.   :5228              
##

Exploratory Analysis

Now, let’s do a deep-dive on missing values on NA. Let’s see the total number of missing values and also their proportion(inshort what’s the proportion of missing values in our dataset?).

#check rows with missing values
sum(is.na(df1))

## [1] 12718

#proportion of missing values:
#inshort 26% of rows have one or more missing values (for every 1 in four rows have a missing value)

mean(!complete.cases(df1))

## [1] 0.2597844

The total number of missing values(or unknown entries) in our dataset is 12,718 and that is 26% of our data-set. Inshort 26% of rows in our dataset have a missing value. A better way to understand this is through visualization.

#visualize missing values and their distribution
aggr(df1, prop = F, numbers = T)

Here, we can see that we have two plots. The plot on the left, is a barchart of missing values - what is obvious from the barchart is the height of the default variable, which signifies that the default variable has the highest number of missing values. Maybe our clients are not interested in disclosing their default status (i mean it’s not something to be proud of). Another variable with lots of missing values is also the education variable.

On the right of our plot, we have sort of a grid of missing values. To understand this grid, let’s do a bottom-up approach. A blue box represents a non-missing value, so a row of blue boxes and no red box is equivalent to rows with no missing values - in short we have 30,488 rows that have no missing values. Next we have 7,757 rows that have just a missing value (all missing values are for the default variable). We also have 1,102 rows with a single missing value (all missing values are for the marital variable). Next we have 739 rows that have missing values for both housing and loan variable (you should get the logic by now).

Let’s end this section with a margin plot for missing values. This enables us to understand if we are missing values at random (MAR) or missing completely at random (MCAR) or not missing at random(NMAR).

#how related are the missing values
matrixplot(df1, sortby = 1, main = "How distributed are the missing values in our dataset ?")

## 
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.

Here, the red shades signify missing values. We can see that the default column has a lot of red shading. The plot is sorted by age, i can actually make a case that as age increases, the number of missing values for default also increases - which is the same as saying, older people are more likely to fill in unknown for the default variable . You can also make a case that our data is MCAR(missing completely at random).

Moving on, let’s have a look at our output variable “y” . We want to see how many of our clients subscribed to a term deposit. We can do this with an ftable:

ftable(df1[,"y"])

##     no   yes
##             
##  36548  4640

Here we can see that 36,548 of our clients didn’t subscribe and 4,640 did subscribe - a huge difference. We can better visualize this with a mosaic plot. Quick word on mosaic plot, they usually don’t pass the eye test when you have multiple variables but once you get the logic behind it you’ll understand why they are a powerful plot to use for categorical variables.

Below is a mosaic plot of our output variable.

mosaic(~y, data = df1, main = "Most of our clients (more than 80%) didn't subscribe to a term deposit")

The plot above passes the eye test, size in a mosaic plot is visualized through a rectangular box. The bigger the box, the bigger the size or chunk of a variable. We can see that “No” makes up more than 80% of the rectangular box, with “yes” taking a small chunk.

We can build on this by adding the marital status to our mosaic plot - and this is where it might get complicated at first but you but after a while, you feel relaxed about it.

mosaic(~y + marital,shade =T, data = df)

From the plot above, we have eight boxes(although it looks like six but you should also notice the unknown portion of the plot). Here, we can see that most of our clients are married, followed by single clients and then divorced clients and then clients that decided not to fill in their marital status (just a few of them which is why the unknown box is very small).

We can also see that, most of the clients that subscribed are also married (which is not surprising since most of our clients are married). One major insight here is that, single people are more likely to subscribe than married people(you should notice the blue box extending slightly into the married territory slightly). Also the blue shading means single people that subscribed occur more often than expected and the red shading means married people that subscribed occurs less often than expected.

We can confirm this with an f-table

marital_tab <- xtabs(~marital + y, data = df)

addmargins(prop.table(marital_tab,1),2)

##           y
## marital           no       yes       Sum
##   divorced 0.8967910 0.1032090 1.0000000
##   married  0.8984275 0.1015725 1.0000000
##   single   0.8599585 0.1400415 1.0000000
##   unknown  0.8500000 0.1500000 1.0000000

14% of the “single” clients we contacted subscribed whereas 10% of married clients subscribed. 15% of clients that didn’t fill in their marital status subscribed. Of course there’s a huge disparity between married clients and single clients, so the fact that we have a bigger proportion (14%) of “single” clients doing a subscription that “married” clients (10%) might not mean much but it does mean something nonetheless.

Finally, we can view a count of this combination with an ftable.

addmargins(marital_tab)

##           y
## marital       no   yes   Sum
##   divorced  4136   476  4612
##   married  22396  2532 24928
##   single    9948  1620 11568
##   unknown     68    12    80
##   Sum      36548  4640 41188

We can see that out of 36,548 clients that didn’t subscribe - 22,396 identified as married and 9,948 are single people. The total number of married clients in our dataset is 24,928 (more than half of our clients are married).

Another thing we might be interested in is in how our clients were contacted. Once again, we visualize this with a mosaic plot.

mosaic(~y + contact,shade =T, data = df, main = "Clients contacted via cellular are more likely to subscribe")

Here we can see that most of our clients were contacted via cellular. Also we can see that most of our clients that subscribed for a term deposit were contacted via cellular. Inshort we might want to communicate more with our clients via cellular rather than telephone.

We can check the proportion

comms_tab <- xtabs(~contact + y, data = df)

addmargins(prop.table(comms_tab,1),2)

##            y
## contact             no        yes        Sum
##   cellular  0.85262393 0.14737607 1.00000000
##   telephone 0.94768679 0.05231321 1.00000000

…. 14% of our clients contacted via cellular subscribed and just 5% of clients contacted via telephone subscribed. Let’s end this the usual way by viewing the total counts of contact.

addmargins(comms_tab)

##            y
## contact        no   yes   Sum
##   cellular  22291  3853 26144
##   telephone 14257   787 15044
##   Sum       36548  4640 41188

We’ll come back to data exploratory with mosaic plots later in the section. For now, let’s try to understand “day_of_the_week” and “month” variables. But before we do this, we need to re-order “day_of_week” and “month” variables, we want monday to come first, then tuesday, then wednesday (you get my point) - this will make our job easier and visualization better to understand.

#reorder month
df1$month <- factor(df1$month, order = T, 
                   levels = c("mar","apr","may","jun","jul","aug","sep","oct","nov","dec"))

#reorder day_of_the_week
df1$day_of_week <- factor(df1$day_of_week,
                         levels = c("mon","tue","wed","thu","fri"))

Now let’s visualize the month our clients were last contacted

par(family = "mono", font = 2)
barplot(table(df1$month), main = "Most of our clients were last contacted in may")

Most of our clients were last contacted in may and then july. We can check the proportion of clients that subscribed based on the months their were last contacted..

month_tab <- xtabs(~month + y, data = df1)

addmargins(prop.table(month_tab,1),2)

##      y
## month         no        yes        Sum
##   mar 0.49450549 0.50549451 1.00000000
##   apr 0.79521277 0.20478723 1.00000000
##   may 0.93565255 0.06434745 1.00000000
##   jun 0.89488530 0.10511470 1.00000000
##   jul 0.90953443 0.09046557 1.00000000
##   aug 0.89397863 0.10602137 1.00000000
##   sep 0.55087719 0.44912281 1.00000000
##   oct 0.56128134 0.43871866 1.00000000
##   nov 0.89856133 0.10143867 1.00000000
##   dec 0.51098901 0.48901099 1.00000000

50% of clients that were last contacted in march subscribed. September, october and decemeber figures are also impressive - the only downside to this is not much clients were contacted in those months including march. It’s almost as if the more clients are contacted in a particular month, the less likely those clients will subscribe and vice-versa.

Next is plot of the day of the week when our clients were last contacted..

par(family = "mono", font = 2)
barplot(table(df1$day_of_week), main = "Most of our clients were last contacted on monday and thursday")

contacts across day_of_the_week is way more uniform than month. Although thursday is the day of the week when most of our clients were last contacted. We can add a mosaic plot here to supplement our barplot.

mosaic(~y + df1$day_of_week,shade =T, data = df)

From our mosaic plot, we can see that clients last contacted in thursday (and then subscribed) occurs more than expected (blue shading).

We can view the proportion of our output variable:

day_of_the_tab <- xtabs(~day_of_week + y, data = df1)

addmargins(prop.table(day_of_the_tab,1),2)

##            y
## day_of_week        no       yes       Sum
##         mon 0.9005168 0.0994832 1.0000000
##         tue 0.8822002 0.1177998 1.0000000
##         wed 0.8833292 0.1166708 1.0000000
##         thu 0.8788125 0.1211875 1.0000000
##         fri 0.8919126 0.1080874 1.0000000

Here we can see that, thursday is the best day to contact our clients (12% of clients last contacted on thursday subscribed for a term deposit). Monday is definitely not a good time to contact our clients.

To end this part, we’ll have an ftable of month and day_of_the_week. This will enable us to understand the relationship between both variables.

day_month_tab <- xtabs(~month + day_of_week, data = df1)

addmargins(prop.table(day_month_tab,1),2)

##      day_of_week
## month        mon        tue        wed        thu        fri        Sum
##   mar 0.26190476 0.25641026 0.12820513 0.18131868 0.17216117 1.00000000
##   apr 0.26671733 0.09574468 0.11398176 0.29179331 0.23176292 1.00000000
##   may 0.19188031 0.20400901 0.21228847 0.18425448 0.20756772 1.00000000
##   jun 0.23523881 0.18239940 0.18484393 0.18183528 0.21568259 1.00000000
##   jul 0.21131865 0.21145804 0.20309451 0.23306384 0.14106496 1.00000000
##   aug 0.19779864 0.20977663 0.20119780 0.21803173 0.17319521 1.00000000
##   sep 0.15789474 0.20701754 0.21929825 0.21403509 0.20175439 1.00000000
##   oct 0.17966574 0.20752089 0.18802228 0.22701950 0.19777159 1.00000000
##   nov 0.18678371 0.19848817 0.21043648 0.22019020 0.18410144 1.00000000
##   dec 0.29120879 0.13736264 0.19230769 0.24725275 0.13186813 1.00000000

A big take-away from the table above is that most of our clients were last contacted on thursday x april (29%) and also monday x decemeber (29%). tuesday x april (9%) is the period when our clients were least contacted. We can build on this by adding the output variable to our ftable

day_month_y_tab <- xtabs(~month + day_of_week +  + y, data = df1)

addmargins(prop.table(day_month_y_tab,1),2)

## , , y = no
## 
##      day_of_week
## month        mon        tue        wed        thu        fri        Sum
##   mar 0.15750916 0.10256410 0.05128205 0.10073260 0.08241758 0.49450549
##   apr 0.23746201 0.06496960 0.08016717 0.20440729 0.20820669 0.79521277
##   may 0.17800857 0.19318760 0.19877987 0.17256155 0.19311497 0.93565255
##   jun 0.20966529 0.15889432 0.16265513 0.16472358 0.19894697 0.89488530
##   jul 0.19459158 0.19291887 0.18427655 0.21313075 0.12461667 0.90953443
##   aug 0.18047912 0.18582065 0.17740369 0.19682745 0.15344772 0.89397863
##   sep 0.11052632 0.10000000 0.10701754 0.11228070 0.12105263 0.55087719
##   oct 0.12116992 0.11420613 0.10027855 0.12395543 0.10167131 0.56128134
##   nov 0.17044623 0.17654231 0.18995367 0.19775664 0.16386247 0.89856133
##   dec 0.16483516 0.05494505 0.07692308 0.13186813 0.08241758 0.51098901
## 
## , , y = yes
## 
##      day_of_week
## month        mon        tue        wed        thu        fri        Sum
##   mar 0.10439560 0.15384615 0.07692308 0.08058608 0.08974359 0.50549451
##   apr 0.02925532 0.03077508 0.03381459 0.08738602 0.02355623 0.20478723
##   may 0.01387174 0.01082141 0.01350861 0.01169293 0.01445276 0.06434745
##   jun 0.02557352 0.02350508 0.02218879 0.01711170 0.01673561 0.10511470
##   jul 0.01672707 0.01853917 0.01881795 0.01993309 0.01644829 0.09046557
##   aug 0.01731952 0.02395597 0.02379411 0.02120427 0.01974749 0.10602137
##   sep 0.04736842 0.10701754 0.11228070 0.10175439 0.08070175 0.44912281
##   oct 0.05849582 0.09331476 0.08774373 0.10306407 0.09610028 0.43871866
##   nov 0.01633748 0.02194587 0.02048281 0.02243355 0.02023897 0.10143867
##   dec 0.12637363 0.08241758 0.11538462 0.11538462 0.04945055 0.48901099

Here’s how to read this table. Using the first row as illustration - 49.4% of clients last contacted in march ended up not subscribing and from this 49.4%, 15.7% were last contacted on monday, 10.25% were on contacted on tuesday, 5% were last contacted on wednesday. Also from 50.5% of clients last contacted in that same march subscribed, 10.4% were last contacted on monday in march, 15% were last contacted on tuesday in march.

Also we might want to test for the relationship between day of the week and month. This answers the question of “if we know the day a client was last contacted, can we predict if the client will subscribe to a term deposit?”

We do this chi-square:

day_y <- xtabs(~y + day_of_week, data = df)

chisq.test(day_y)

## 
##  Pearson's Chi-squared test
## 
## data:  day_y
## X-squared = 26.145, df = 4, p-value = 2.958e-05

The p-value is 0.0000295 (which is less than 0.05). Meaning there’s a relationship between day_of_the_week and our output variable “y”.

We can also check for the relationship between month and our output variable:

month_y <- xtabs(~y + month, data = df)

chisq.test(month_y)

## 
##  Pearson's Chi-squared test
## 
## data:  month_y
## X-squared = 3101.1, df = 9, p-value < 2.2e-16

Once again, p-value is less than 0.05, hence relationship exists between our output variable and month. Once again, if we know the month a client was last contacted we can predict if that client will subscribe or not.

Next, let’s look at the housing loan variable and how they relate with our output variable. Starting with a mosaic plot:

mosaic(~y + housing,shade =T, data = df)

Here we can see that we have few clients that decided not to answer the question on housing loan. Also we can see that we have almost an equal number of clients that have an housing loan and those that don’t have an housing loan. And also from the plot you can deduce that clients with an housing loan are slightly more likely to subscribe (although the difference is very small)

We can confirm this with a proportion table:

housing_tab <- xtabs(~housing + y, data = df)

addmargins(prop.table(housing_tab,1),2)

##          y
## housing          no       yes       Sum
##   no      0.8912040 0.1087960 1.0000000
##   unknown 0.8919192 0.1080808 1.0000000
##   yes     0.8838061 0.1161939 1.0000000

11.6% of clients that took an housing loan subscribed for a term deposit and 10.8% of clients that don’t have an housing loan subscribed (almost the same as clients that didn’t fill in a value for that question).

Again, we can check for the relationship between housing loan and term deposit subscription:

housing_y <- xtabs(~y + housing, data = df)

chisq.test(housing_y)

## 
##  Pearson's Chi-squared test
## 
## data:  housing_y
## X-squared = 5.6845, df = 2, p-value = 0.05829

Here p-value is 0.058, this is more than 0.05, which means there’s no relationship between housing loan and term deposit subscription. In short, you can’t predict if a client will subscribe for a term deposit just by knowing their housing loan status.

Since we are taking about loan, we might as just dive into the loan variable.

mosaic(~y + loan,shade =T, data = df)

Most of our clients have no existing loan. We can check for the relationship between loan and term deposit subscription.

loan_y <- xtabs(~y + loan, data = df)

chisq.test(loan_y)

## 
##  Pearson's Chi-squared test
## 
## data:  loan_y
## X-squared = 1.094, df = 2, p-value = 0.5787

p-value here is very high, at 58%, meaning no relationship exists whatsoever between loan and term subscription.

Finally, let’s have a look at the default variable. If you recall, that we have lots of missing values for this variable (more than 8000 missing rows actually). And we can see this from the plot below. Also clients that didn’t default are way more likely to subscribe.

mosaic(~y + default,shade =T, data = df)

Let’s check for the relationship between default status and term subscription (the color shading already tells me, there might be a relationship).

tab <- xtabs(~default + y, data = df)
chisq.test(tab)

## Warning in chisq.test(tab): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 406.58, df = 2, p-value < 2.2e-16

Here we can see that the p=value is less than 0.05, meaning there’s a relationship between default status and term deposit subscription. However i did find that, once you remove missing values, you get a different result. Below i wll run another test but this time without the missing values.

tab <- xtabs(~default + y, data = df1)
chisq.test(tab)

## Warning in chisq.test(tab): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab
## X-squared = 6.7698e-24, df = 1, p-value = 1

This time you can see that p-value is equal to 1 (less than 0.05) meaning no relationship whatsoever exists between default status and our outcome variable (term deposit).

Let’s end this section of categorical variables by having a look at the kind of jobs our clients are into and their level of education.

Starting with jobs, below i have recoded the job variable factors, to make our visualization easier to understand and the followed by a plot of jobs and output variable y.

df$job <- factor(df$job, order = T, 
                   levels = c("unknown","student","unemployed","housemaid","self-employed",
                              "entrepreneur","retired","management","services","technician","blue-collar","admin."))

library(ggplot2)
ggplot(df, aes(job, fill = y)) + 
  geom_bar() + 
  coord_flip() + theme(legend.position  = "top" , 
                       legend.title = element_blank(),
                       legend.text = element_text(size = 13),
                       axis.text.y = element_text(size = 13),
                       axis.text.x = element_text(size = 12),
                       plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
                       plot.subtitle = element_text(size = 15, hjust = .5)) + 
  xlab(NULL) + 
  ylab(NULL) +
  labs(title = "Admin, blue-collar and technician are the most common jobs among our clients",
       subtitle = "Few of our clients are also students, unemployed and maids") +
  scale_y_continuous(labels = scales::comma_format())

Here we can see that most of our clients are admin., blue-collar and technicians. The turquoise color coding are the share of clients that subscribed to a term deposit and the red color coding are the share of clients that didn’t subscribe for a term deposit. Technician clients are more likely to subscribe to a term deposit than blue-collar clients. Retired clients and students are also impressive and more likely to subscribe to a term deposit.

We can further look into it and confirm this with an ftable:

job_tab <- xtabs(~job + y, data = df)

addmargins(prop.table(job_tab,1),2)

##                y
## job                     no        yes        Sum
##   unknown       0.88787879 0.11212121 1.00000000
##   student       0.68571429 0.31428571 1.00000000
##   unemployed    0.85798817 0.14201183 1.00000000
##   housemaid     0.90000000 0.10000000 1.00000000
##   self-employed 0.89514426 0.10485574 1.00000000
##   entrepreneur  0.91483516 0.08516484 1.00000000
##   retired       0.74767442 0.25232558 1.00000000
##   management    0.88782490 0.11217510 1.00000000
##   services      0.91861930 0.08138070 1.00000000
##   technician    0.89173958 0.10826042 1.00000000
##   blue-collar   0.93105684 0.06894316 1.00000000
##   admin.        0.87027442 0.12972558 1.00000000

You can see that, 31% of students subscribed to a term deposit, and also one in four of our retired clients also subscribed. Those are very impressive.

Let’s do the usual relationship check:

tab <- xtabs(~job + y, data = df1)
chisq.test(tab)

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 961.21, df = 10, p-value < 2.2e-16

As always, we look at the p-value, it is way less than 0.05 , meaning relationship exists between job and our output variable. In short if i know the job of a particular client, i can predict if the client will subscribe or not.

And finally the education feature. We do the usual, factor reordering and then make a plot.

df$education <- factor(df$education, order = T, 
                   levels = c("illiterate","unknown","basic.6y","basic.4y","professional.course",
                              "basic.9y","high.school","university.degree"))

ggplot(df, aes(education, fill = y)) + 
  geom_bar() + 
  coord_flip() + theme(legend.position  = "top" , 
                       legend.title = element_blank(),
                       legend.text = element_text(size = 13),
                       axis.text.y = element_text(size = 13),
                       axis.text.x = element_text(size = 12),
                       plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
                       plot.subtitle = element_text(size = 15, hjust = .5)) + 
  xlab(NULL) + 
  ylab(NULL) +
  labs(title = "A big chunck of our clients are well educated with a degree",
       subtitle = "The more educated our clients are the more likely they are to subscribe to a term deposit") +
  scale_y_continuous(labels = scales::comma_format())

Let’s stop here with categorical variables and have a look at our numeric features.

We can see a breakdown of our numeric colums below

num.cols <- sapply(df,is.numeric)
summary((df[,num.cols]))

##       age           duration         campaign          pdays      
##  Min.   :17.00   Min.   :   0.0   Min.   : 1.000   Min.   :  0.0  
##  1st Qu.:32.00   1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0  
##  Median :38.00   Median : 180.0   Median : 2.000   Median :999.0  
##  Mean   :40.02   Mean   : 258.3   Mean   : 2.568   Mean   :962.5  
##  3rd Qu.:47.00   3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0  
##  Max.   :98.00   Max.   :4918.0   Max.   :56.000   Max.   :999.0  
##     previous      emp.var.rate      cons.price.idx  cons.conf.idx  
##  Min.   :0.000   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  1st Qu.:0.000   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  Median :0.000   Median : 1.10000   Median :93.75   Median :-41.8  
##  Mean   :0.173   Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
##  3rd Qu.:0.000   3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##  Max.   :7.000   Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##    euribor3m      nr.employed  
##  Min.   :0.634   Min.   :4964  
##  1st Qu.:1.344   1st Qu.:5099  
##  Median :4.857   Median :5191  
##  Mean   :3.621   Mean   :5167  
##  3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :5.045   Max.   :5228

Here we can see that, our youngest client is 17 years old and our oldest client is 98 years old. For duration, the maximum seconds spent on talking to a client is 4918 seconds(same as 81 minutes or 1.36 hours). The minimum number of campaign is 1 and the max is 56 (meaning we have a client that was contacted 56 times).

Overall, our major focus will be on the age variable and the duration variable to some degree. Because campaigns are skewed to the right, i will not be looking into it.

How do i know that campaign is skewed to the right ? Because the minimum, mean, median and 3rd quatile are very close, that’s enough to tell me most of our clients were contacted once or twice and our maximum number which is 58 is an outlier.

We’ll also not be looking at the social and economic attributes, since these are economical factors that we don’t have power to adjust or control.

Let’s go over age. The standard deviation of age is 10.4, another way to say that is most of our clients are within one standard deviation to the mean or between 30 and 50 years old. Because our standard deviation is big(in the context of age) for that reason median and not mean is a better representation of the average age (the bigger the standard deviation the less the mean represents the average).

#standard deviation of age
sd(df1$age)

## [1] 10.42125

Next is the distribution of age of our clients using histogram and density plot

par(mfrow = c(1,2), family = "mono")
hist(df1$age,breaks = 10, col = "red", xlab = "Age", main = "Most of our clients are \n between the age of 30-40 years old")
plot(density(df1$age), col = "red", main = "")
polygon(density(df1$age), col = "brown")

Here we can verify what we already know, which is most of our clients fall between the age of 30-40 years old or even 30-50 years old.

What we might want to do is to visualize our age variable with other categorical variables.

We can do this with a facet plot, first of all let’s check the age distribution of our clients with regards to their jobs.

ggplot(df1, aes(age)) + geom_histogram(color = "white", binwidth = 5) + 
facet_wrap (~job) + theme(strip.text.x = element_text(size = 13),
                          axis.text.x = element_text(size = 11),
                          axis.text.y = element_text(size = 11),
                          plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
                       plot.subtitle = element_text(size = 15, hjust = .5)) +
  labs(title = "Most of our clients that are students are 25 years old",
       subtitle = "Our clients that are into admin. are equally distributed between 30 and 35 years old")

… and age and education breakdown

ggplot(df1, aes(age)) + geom_histogram(color = "white", binwidth = 5) + 
facet_wrap (~education) + theme(strip.text.x = element_text(size = 13),
                          axis.text.x = element_text(size = 11),
                          axis.text.y = element_text(size = 11),
                          plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
                       plot.subtitle = element_text(size = 15, hjust = .5)) +
  labs(title = "Clients with basic.4y eduaction are quite old",
       subtitle = "Clients with university degree fall in between ages of 30-35 years old")

Still on age and education. We can eyeball the median age of our clients.

df %>% 
group_by(education) %>% 
summarize(median_age = median(age)) %>% 
  ggplot(aes(median_age, reorder(education,median_age))) + 
  geom_point(size = 2) +
  theme(axis.text.x = element_text(size = 11),
        axis.text.y = element_text(size = 11),
        plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
        plot.subtitle = element_text(size = 15, hjust = .5)) +
  ylab(NULL) + 
  labs(title = "Our younger clients are more educated than our older clients",
       subtitle = "There's a big gap in age between our uneducated and basic.4y clients and basic.6y clients")

You can make a case that, clients with missing value or “unknown” entries are most likely not well educated. Judging by the mean age on the plot, a client with a university degree or high school degree and basic.9y educational level are less likely to not fill in their age (look at the gap between the median age of those clients with the median age of clients with “unknown” factor).

We can also check the median age between our clients with regards to their jobs.

df %>% 
group_by(job) %>% 
summarize(median_age = median(age)) %>% 
  ggplot(aes(median_age, reorder(job,median_age))) + 
  geom_point(size = 2) +
  theme(axis.text.x = element_text(size = 11),
        axis.text.y = element_text(size = 11),
        plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
        plot.subtitle = element_text(size = 15, hjust = .5)) +
  ylab(NULL) + 
  labs(title = "Housemaids and clients with missing age value share the same median age",
       subtitle = "Clients with missing age values are most likely not student, services or into admin")

Once again, we can make a case that or even make a guess that our clients that didn’t fill in their age are most likely retired, housemaid or into managament.

Let’s see how all these relate with our output variable.

ggplot(df1, aes(job, age, fill = y)) + 
geom_boxplot() + 
  theme(axis.text.x = element_text(size = 11),
        axis.text.y = element_text(size = 11),
        plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
        plot.subtitle = element_text(size = 15, hjust = .5),
        legend.position = "top",
        legend.title = element_blank(),
        legend.text = element_text(size = 13)) +
  xlab(NULL) + 
  labs(title = "Older clients in the retired and housemaid cohorts are more likely to subscribe",
       subtitle = "Younger clients in the student, self-employed or admin cohorts are more likely to subscribe")

Irony here is, older clients that are retired or housemaid are more likely to subscribe whereas younger students or self-employed or admin are more likely to subscribe.

ggplot(df1, aes(education, age, fill = y)) + 
geom_boxplot()+
  theme(axis.text.x = element_text(size = 11),
        axis.text.y = element_text(size = 11),
        plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
        plot.subtitle = element_text(size = 15, hjust = .5),
        legend.position = "top",
        legend.title = element_blank(),
        legend.text = element_text(size = 13)) +
  xlab(NULL) + 
  labs(title = "Older clients in the basic.4y cohorts are more likely to subscribe",
       subtitle = "Younger people in other cohorts are more likely to subscribe")

To end the exploratory section, let’s have a quick look at calls duration.

df1 %>% 
  mutate(minutes = round(duration/60,0)) %>% 
  filter(minutes > 0 & minutes < 20 ) %>% 
  ggplot(aes(minutes)) + geom_bar() + facet_grid(y~contact) +
  theme(strip.text.x = element_text(size = 13),
        strip.text.y = element_text(size = 13),
                          axis.text.x = element_text(size = 11),
                          axis.text.y = element_text(size = 11),
                          plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
                       plot.subtitle = element_text(size = 15, hjust = .5)) +
  labs(title = "We spent around 4 minutes on the phone with most of our clients that subscribed",
       subtitle = "Most Clients that didn't subscribe were on the phone for 2 minutes.")

For the data above, i have remove calls that lasted for more than 20 minutes(they are mostly outliers). Also each bar above represents 1 minute, so you can actually count each bar to know the exact minute of each bar. I have also divided the plot into contacts and our output variable. Once again we can see that most of clients were contacted via cellular.

That’s enough for now. We have more than covered a basic exploration of our dataset. Now we move on to the modeling and prediction section.

Modeling | Predictions

For this section we are going to reload our dataset and this time make use of the bank-additional-full datatset and bank-additional.

df <- read.table("/Users/kazeemolalekan/Downloads/bank-additional-full.csv", 
                 header = T, sep = ";", stringsAsFactors = F)

df.test_set <- read.table("/Users/kazeemolalekan/Downloads/bank-additional/bank-additional.csv",header = T, sep = ";", stringsAsFactors = F)

Now, this part is important. From the problem statement, it was stated that, bank-additional-csv(let’s call this our test set) is a random selection of values from bank-additional-full csv. So what we want to do now, is to remove our test set from the the bank-additional-full dataset, that way when we build our model, we are not building with our test set.

We do this with anti_join below

df.train_set <- anti_join(df, df.test_set)

## Joining, by = c("age", "job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "duration", "campaign", "pdays", "previous", "poutcome", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed", "y")

Now, our test set has been excluded from bank-additional-full. And we can verify this below

dim(df.train_set)

## [1] 37068    21

Our bank-additional-full dataset (or let’s call it training set) has 37,068 rows.

dim(df.test_set)

## [1] 4119   21

… our test set had 4,119 rows.

Next, we’ll remove the duration variable(we were adviced to do this in the problem statement) and then we’ll recode “unknown values” to NA values for both data set.

changes <- function(x){
  #remove duration column
  var <- names(x) %in% "duration"
  x <- x[!var]
  
  #replace unknown with NA
    x$job [x$job == "unknown"] <- NA
    x$marital [x$marital == "unknown"] <- NA
    x$education [x$education == "unknown"] <- NA
    x$default [x$default == "unknown"] <- NA
    x$housing [x$housing == "unknown"] <- NA
    x$loan [x$loan == "unknown"] <- NA   

  #convert character columns to factors
  x[sapply(x,is.character)] <- lapply(x[sapply(x, is.character)], as.factor)
  x
}

… and then we apply the function to both set. To training,

df.train <- changes(df.train_set)

… and then test set.

df.test <- changes(df.test_set)

This is where it gets interesting. Before we get into it, i think it’s important we go over how we are going to approach the modeling aspect. Right now we have two problems or decisions to make:

what do we do with missing values - should we ignore or take them into consideration;
What model approach are we going to use. is it logistic regression or linear discriminant analysis or quadratic discriminant analysis, or conditional tree or random forest or svm or bagging ? We have multiple ways to build a model, which should we use - we can try all and compare their prediction accuracy but that will be time wasting since we already have an idea of how accurate these models can be;

Back to missing values. Recall that we have more than twelve thousand rows with at least 1 missing value - and these rows(with missing values) make up more than 20% of our data (26% to be specific). That’s a ratio of 4 to 1 - meaning for every 4 non-missing rows we have 1 row with a missing value. We can confirm this again:

Number of rows with at least one missing value

#number of rows with at least 1 missing value
sum(is.na(df1))

## [1] 12718

Proportion of rows with missing values

#proportion of rows with missing values
mean(!complete.cases(df1))

## [1] 0.2597844

There are two ways to approach modeling with a dataset of missing values. Either we fit a model with:

listwise removal approach - meaning we delete/remove rows with missing values; or
Single imputation or Multiple Imputation approach - meaning we replace missing values with simulated replacement or median(for numeric variables or modal category for categorical variables);

Now, we have 26% of rows with missing values - that’s a red flag that listwise approach will be a wrong approach to take. For instance, we have seven thousand, seven hundred and fifty seven rows (7,757) that are missing just a single value for the default variable , whereas these same rows have a non-missing values for other variables like age, contact, education and others. The irony in deleting rows like such is that we are deleting rows with more non-missing values than missing values.

What i’m driving at is that it will be bad practice for us to ignore the missing rows or remove them, 26% of incomplete rows is a lot of data/information to ignore - which is why we won’t be taking the listwise approach since this approach deletes all rows with any missing value. What we might want to do instead is to take the Multiple Imputation approach.

How does this work ? I won’t go into specific details (to avoid bloating the report) but the idea here is:

you replace missing values with a random selection. You take a dataset with missing values then generate multiple datasets from the single dataset and for each of this generated dataset, the missing values are replaced with random possible selection. Then we fit a model for each of the generated dataset, and finally we combine the results of all the generated dataset into a single result. For this we use the mice package; or
you replace missing values with a median selection (for numeric columns) and modal selection for categorical columns and then run a model. For this, we use the na.roughfix function.

For this project, we’d be using the na.roughfix approach.

Actually we could go over all approaches, listwise removal and multiple imputation - and then compare the co-efficients and p-values of each approach to see if they are similar (if they are similar then that means deleting the rows with missing values have little or no impact on the overall data but if they are not then that means rows with missing values have a big part in our model fitness). We won’t do this, just wanted to let you know it can be done and how to go about it.

Let’s focus on na.roughmix approach.

Recall that we have two data sets, a training data and a test data. And both of these datasets have missing rows. We can check the proportion of our missing data.

Starting with our training data…..

mean(!complete.cases(df.train))

## [1] 0.2608989

26% of our training data have rows with a missing value and

mean(!complete.cases(df.test))

## [1] 0.2498179

25% of our test data have rows with a missing value.

The basic concept behind na.roughfix is that for each column/variable with a missing value, the missing values are replaced with the median of that particular column(for numerical columns) and modal (for categorical variable). The best way to understand this is by illustration.

This is a subset of our data with unknown or NA figures….

missing_values <- head(df[df[,c("job")]=="unknown",c("job","education","default","housing","loan")], 10)

missing_values

##         job         education default housing    loan
## 30  unknown university.degree unknown unknown unknown
## 36  unknown          basic.4y unknown     yes      no
## 74  unknown           unknown unknown      no      no
## 92  unknown           unknown unknown     yes      no
## 145 unknown       high.school unknown     yes      no
## 300 unknown           unknown unknown      no      no
## 304 unknown           unknown      no     yes      no
## 344 unknown           unknown unknown     yes      no
## 389 unknown           unknown unknown     yes     yes
## 429 unknown           unknown unknown     yes      no

this is the same dataset after we applied na.roughfix function

df3 <- na.roughfix(df1)

head(df3[c(as.numeric(row.names(missing_values))),
         c("job","education","default","housing","loan")],length(row.names(missing_values)))

##        job         education default housing loan
## 30  admin. university.degree      no     yes   no
## 36  admin.          basic.4y      no     yes   no
## 74  admin. university.degree      no      no   no
## 92  admin. university.degree      no     yes   no
## 145 admin.       high.school      no     yes   no
## 300 admin. university.degree      no      no   no
## 304 admin. university.degree      no     yes   no
## 344 admin. university.degree      no     yes   no
## 389 admin. university.degree      no     yes  yes
## 429 admin. university.degree      no     yes   no

Here we can see that for the job variable, missing or unknown values have been replaced with “admin” which is the median or the most frequent value in that column. For Education column, missing values have been replaced with “university degree”.

Does this make sense, and will this not have a negative impact on our model ? Thinking of it, this approach actually makes sense. The reason for this approach is not necessarily the replacement of missing values but keeping other information we could have discarded. In a dataset of 20 variables, for every row with a missing value, we have 19 variables that are non-missing. So replacing missing rows with a median is a good tradeoff over removing an entire row with just one missing value.

The only drawback and problem i had with this approach is with the default column. Because we have more 7,757 missing values for the default column alone - after applying the na.roughfix function, the missing values will be replaced by “no”, which can skew the dataset in that column to “yes” but again, this should not be a problem. We can actually show this will not be a problem by using the mice package - but we’ll not be doing this(to avoid bloating the report), i have already done it as a roughwork and i can confirm that is the case (you just have to take my word on this).

Moving on, what we can do next is to finally apply the na.roughfix function to both our training and set data and then we run our model on it.

#training set
df.train <- na.roughfix(df.train)

#test set
df.test <- na.roughfix(df.test)

Once again we can verify, if we still have missing rows in our data set:

#training set
mean(!complete.cases(df.train))

## [1] 0

#test set
mean(!complete.cases(df.test))

## [1] 0

Inshort, missing rows proportion is 0% which is the same as saying we don’t have any missing row in our training and test data.

This is also a good time to go over our datasets in this project:

df: this is our original dataset or bank-additional-full dataset;
df1: this is the dataset with unknown values recoded as NA values;
df.train: this is our training dataset or (bank additional full - bank additional) dataset;
df.test: this is our test dataset or bank additional data set;

Now onto building our model. This is a complex subject and there’s really no need to dive into the details. Based on my experience and domain knowledge, we can use different models for this task (be it - logistic regression, linear discriminant analysis, quadratic discriminant analysis, support vector machine, trees, bagging, randomForest or boosting).

Also from my domain knowledge i have (at least for the models i have built), we can classify models(classification models to be precise) based on their accuracy like this:

logistic regression, lda & qda -> classical trees -> bagging -> SVM -> randomForest & boosting.

My point is classical trees tend to be more accurate than logistic regression models. Also, bagging, randomForest and boosting(which are all built on trees mechanism) tend to be more accurate than classical trees and logistic regression models. randomForest and boosting are actually more accurate than bagging. Support vector machine is also more accurate than bagging (but compared to randomForest and boosting, it is tight). In some cases svm can come ontop and in other cases randomForest thrives over it.

Note: This is not in all cases, i have seen cases where a logistic regression model turned out to be better in terms of accuracy than other models but often times it’s not as accurate as other models.

For this project, there’s no need to try all the models, instead i will be using the logistic regression model, linear discriminant analysis, randomForest models and svm.

One more thing - to build a classification model, you might have to create dummy variables for categorical columns but since R automatically does this for us with the glm function - there’s no need for us to do this.

Let’s start with logistic regression

set.seed(1)

#fit model
logit.fit <- glm(y ~ ., family = binomial, data = df.train)

#check summary
summary(logit.fit)

## 
## Call:
## glm(formula = y ~ ., family = binomial, data = df.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0671  -0.3916  -0.3192  -0.2628   2.9455  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.371e+02  3.536e+01  -6.707 1.99e-11 ***
## age                          -2.578e-03  2.210e-03  -1.166 0.243497    
## jobblue-collar               -1.561e-01  7.041e-02  -2.217 0.026646 *  
## jobentrepreneur              -2.857e-02  1.114e-01  -0.257 0.797536    
## jobhousemaid                 -8.784e-02  1.353e-01  -0.649 0.516153    
## jobmanagement                -9.824e-03  7.881e-02  -0.125 0.900805    
## jobretired                    3.229e-01  9.859e-02   3.275 0.001056 ** 
## jobself-employed             -2.504e-02  1.070e-01  -0.234 0.814916    
## jobservices                  -1.361e-01  7.857e-02  -1.732 0.083338 .  
## jobstudent                    1.994e-01  1.018e-01   1.959 0.050059 .  
## jobtechnician                -8.723e-03  6.543e-02  -0.133 0.893945    
## jobunemployed                -4.783e-02  1.181e-01  -0.405 0.685517    
## maritalmarried                2.080e-02  6.301e-02   0.330 0.741343    
## maritalsingle                 6.151e-02  7.182e-02   0.857 0.391708    
## educationbasic.6y             1.096e-01  1.086e-01   1.008 0.313278    
## educationbasic.9y            -1.615e-02  8.638e-02  -0.187 0.851644    
## educationhigh.school          5.883e-02  8.364e-02   0.703 0.481836    
## educationilliterate           1.111e+00  6.511e-01   1.707 0.087815 .  
## educationprofessional.course  5.501e-02  9.274e-02   0.593 0.553084    
## educationuniversity.degree    1.304e-01  8.110e-02   1.608 0.107904    
## defaultyes                   -7.648e+00  8.448e+01  -0.091 0.927865    
## housingyes                   -2.433e-02  3.760e-02  -0.647 0.517618    
## loanyes                      -1.512e-02  5.251e-02  -0.288 0.773400    
## contacttelephone             -7.231e-01  7.037e-02 -10.275  < 2e-16 ***
## monthaug                      5.297e-01  1.140e-01   4.648 3.34e-06 ***
## monthdec                      4.123e-01  2.012e-01   2.049 0.040483 *  
## monthjul                      7.378e-02  8.706e-02   0.847 0.396747    
## monthjun                     -7.208e-01  1.175e-01  -6.135 8.54e-10 ***
## monthmar                      1.484e+00  1.365e-01  10.870  < 2e-16 ***
## monthmay                     -4.158e-01  7.569e-02  -5.493 3.95e-08 ***
## monthnov                     -4.132e-01  1.105e-01  -3.738 0.000186 ***
## monthoct                      7.079e-02  1.424e-01   0.497 0.619105    
## monthsep                      2.819e-01  1.675e-01   1.683 0.092443 .  
## day_of_weekmon               -2.261e-01  6.123e-02  -3.693 0.000222 ***
## day_of_weekthu                8.934e-02  5.861e-02   1.524 0.127461    
## day_of_weektue                7.076e-02  6.052e-02   1.169 0.242346    
## day_of_weekwed                1.708e-01  5.993e-02   2.850 0.004370 ** 
## campaign                     -4.009e-02  9.519e-03  -4.212 2.53e-05 ***
## pdays                        -1.175e-03  2.149e-04  -5.467 4.58e-08 ***
## previous                     -9.747e-02  6.016e-02  -1.620 0.105203    
## poutcomenonexistent           4.114e-01  9.198e-02   4.473 7.72e-06 ***
## poutcomesuccess               7.037e-01  2.100e-01   3.351 0.000805 ***
## emp.var.rate                 -1.539e+00  1.314e-01 -11.712  < 2e-16 ***
## cons.price.idx                2.142e+00  2.333e-01   9.184  < 2e-16 ***
## cons.conf.idx                 2.740e-02  7.366e-03   3.719 0.000200 ***
## euribor3m                     2.141e-01  1.212e-01   1.767 0.077240 .  
## nr.employed                   6.961e-03  2.877e-03   2.420 0.015535 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26152  on 37067  degrees of freedom
## Residual deviance: 20485  on 37021  degrees of freedom
## AIC: 20579
## 
## Number of Fisher Scoring iterations: 9

How to read this ? … You can classify model summary into five groups:

Group 1: rows with three asterick (***) are variables(dummy) that have significant relationship with the dependent variable(y). Meaning, you can predict “y” based on values from these variables. For instance, contact telephone, pdays, campaign, emp.var.rate and few others all have significant relationship with y. Another way to identify variables that have significant relationship with the outcome is by using the p-value. Variables with p-value below 0.1% or 0.001 are significant variables. The variables in this group are the biggest factor is determining whether a client will subscribe to a term deposit or not;
Group 2: variables with two asterick(**) also have significant relationship with the dependent variable but the relationship is not as significant as that of Group 1. Their p-value is more than 0.001 but less than or equal to 1% or 0.01. Variables in this group include day_of_weekwed, jobretired

+ Group 3: variables with one asterick(*) have a good relationship with “y”. But the relationship is not as good as that of Group 1 or 2. Their p-value is more than 0.01 or 1% but less than or equal to 5% or 0.05. Variables in this group are nr.employed, monthdec, jobblue-collar

+ Group 4: variables in this group have small or weak relationship with “y”. The relationship is weak but it’s still a relationship nonetheless. Actually, in some cases, variables in this group can also be categorized with group 5 (no relationship group). In a situation where p-value of 5% is the benchmark, variables in this group are considered insignificant because their p-value is greater than 5%. Variables in this group are: euribor3m, monthsep, educationilliterate, jobstudent, jobservices .

+ Group 5: variables in this group have no relationship whatsoever with the output variable. Inshort they have no say in whether a client will subscribe for a term deposit. The variables in this group have a p-value that is greater than 10% or 0.10. Any variable with no asterick or period belong here. To make your model more accurate you should remove variables in this group and re-run model.

Quick word on the Estimate values. The Estimate values for contact telephone is -7.231e-01. What does that tell us ? It means contacttelephone have a negative impact on our output variable. Inshort clients that are contacted through telephone are less likely to subscribe. To be exact, a one-unit increase in the number of times we contact client through the telephone will decrease the odds of them subscribing by -0.723 units. Because -0.723 is in odds(logistic regression mathematical jargons), we can know the exact figure by exponentiating the figure

exp(coefficients(logit.fit)["contacttelephone"])

## contacttelephone 
##        0.4852474

The odds of a client subscribing decreases by a factor of 0.48 whenever we contact them through the telephone. We might as well focus on contacting our clients through their cellular phone.

We might also want to have a look at clients we last contacted in Jun, they are also less likely to subscribe - same for client contacted on monday (day_of_weekmon), monday is not a good time to contact client.

On the positive side, clients that are retired are more likely to subscribe, same for clients last contacted in August(monthaug) and in march(monthmar). Wednesday (day_of_weekwed) is also a good time to contact client.

Another thing we might be interested in are rows or cases that have significant impact on our model

plot(logit.fit, which = 4)

Observations 24011, 25780 and 34054 all have big impact on our model. It is always a good practice to remove influential observations to make our model more accurate - but there is no need for us to do that. Just want to let you know how to go about making your model more accurate.

Before we move on, we can also have a view of these influential observations. I have a feeling that what makes them influential observations is that their variables combination might just be unusual.

df.train[c(24011,25780,34054),]

##       age           job  marital  education default housing loan   contact
## 24011  34 self-employed  married illiterate      no     yes   no  cellular
## 25780  51  entrepreneur  married illiterate      no     yes   no  cellular
## 34054  42       retired divorced illiterate      no      no   no telephone
##       month day_of_week campaign pdays previous    poutcome emp.var.rate
## 24011   nov         thu        1   999        0 nonexistent         -0.1
## 25780   apr         thu        3   999        0 nonexistent         -1.8
## 34054   aug         wed        3   999        0 nonexistent         -2.9
##       cons.price.idx cons.conf.idx euribor3m nr.employed   y
## 24011         93.200         -42.0     4.076      5195.8 yes
## 25780         93.075         -47.1     1.410      5099.1 yes
## 34054         92.201         -31.4     0.834      5076.2 yes

If you remember from our exploratory analysis, the average age for our retired clients is above 60, so a retired client that’s 42 years old is unusual and an outlier (row 34054).

Moving forward, let’s dive into prediction.

#predict using test set
logit.pred <- predict(logit.fit, df.test, type = "response")

#classify prediction
logit.prob <- factor(logit.pred > 0.5, levels = c(FALSE, TRUE),
                     labels = c("no","yes"))

logit.perf <- table(df.test$y,logit.prob, dnn = c("Actual","Predicted"))

#confusion matrix
logit.perf

##       Predicted
## Actual   no  yes
##    no  3601   67
##    yes  341  110

#test error rate
mean(logit.prob != df.test$y)

## [1] 0.09905317

From our confusion matrix, we can deduce that 3,601 cases that were rightly classified as no and 110 cases were rightly classified as yes. We also wrongly classified 341 cases as no and 67 cases were wrongly classified as yes. Understood ?

Hence the total number of cases rightly classified or accuracy is (3601+110)/(3601+110+67+341) or 0.90 or 90%.

Now is this model a good predictive model ? that depends on a lot of metrics which i will dive into later in this section. While the overall accuracy is good(90%), the error rate among clients that subscribed is low(24%). Out of 451 clients that subscribed, our model correctly picked on 110 clients and missed on 341 clients that also subscribed (model sensitivity is 24% which is poor).

We can work on improving this result by various procedures. One way might be for us to increase or reduce the threshold, right now our threshold is 0.5, such that if our prediction is greater than 0.5, we consider that prediction to be a “yes”, if it’s less than or equal to 5, we consider it to be a “no”. What we might want to do is maybe decrease the threshold, we might say something like if probabilty is greater than to 0.2 then consider that a “yes”(client are likely to subscribe) and less than that consider it a “no”(client are not likely to subscribe. Now, this is one way of improving our model. Best way to understand this is by illustration:

#predict using test set
logit.p <- predict(logit.fit, df.test, type = "response")

#classify prediction
logit.pr <- factor(logit.p > 0.2, levels = c(FALSE, TRUE),
                     labels = c("no","yes"))

logit.pe <- table(df.test$y,logit.pr, dnn = c("Actual","Predicted"))

#confusion matrix
logit.pe

##       Predicted
## Actual   no  yes
##    no  3328  340
##    yes  189  262

#test error rate
mean(logit.pr != df.test$y)

## [1] 0.1284292

Here can see that our model correctly classified 262 clients and our sensitivity is 58% (34+ more than the previous sensitivity) but at the same time our overall error rate increased slight to 13% or accuracy dropped to 87%.

At the end of the day, organizations or companies decide on the best measure. For some companies having a model with a sensitivity of 58% might be better than having a model with 90% accuracy. Sensitivity is an important metric in some organizations, way more than accuracy (organizations like banks will pay more to identify customers that are likely to default than customers that will not default).

Another way of improving our model is by variable selection. By variable selection, i mean using only variables that have a significant relationship with our outcome “y”. We can do this automatically or we can go over our model summary and pick variables that have at least a period "*" (remember -> group 1 to group 4) and build a model on those variables alone. We can also do this automatically with a “step” funcion, and this is what we are going to do. Now, i don’t want to print the result of the step function, it is too long but i have done it and i will just copy the recommended variables and not print the result. Now let’s build another model with selected variables:

set.seed(1)
#best variables model
logit.step <- glm(y ~ age + job + contact + month + day_of_week + 
    campaign + pdays + previous + poutcome + emp.var.rate + cons.price.idx + 
    cons.conf.idx + euribor3m + nr.employed, family = binomial, 
    data = df.train)

#predictions
logit.pred.step <- predict(logit.step, df.test, type = "response")

#classify prob
logit.prob.step <- factor(logit.pred.step > 0.5, levels = c(FALSE, TRUE),
                     labels = c("no","yes"))

#prediction table
logit.perf.step <- table(df.test$y,logit.prob.step,dnn = c("Actual","Predicted"))

#print out table
logit.perf.step

##       Predicted
## Actual   no  yes
##    no  3605   63
##    yes  339  112

#test error
mean(logit.prob.step != df.test$y)

## [1] 0.0975965

This is a slight improvement over our first model, this model was able to identify 112 clients that subscribed (our first model identified 110).

Let’s try out linear discriminant analysis

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

lda.fit <- lda(y ~., data = df.train)

lda.pred <- predict(lda.fit, df.test)

lda.class <- lda.pred$class

lda.perf <- table(df.test$y,lda.class,dnn = c("Actual","Predicted"))
lda.perf

##       Predicted
## Actual   no  yes
##    no  3501  167
##    yes  274  177

mean(lda.class != df.test$y)

## [1] 0.1070648

This is actually a better model than our logisit regression model because, it correctly identified 177 clients (out of 451 that subscribed). It’s sensitivity is better than logisitic regression but it is less accurate.

Once again we can improve our model’s sensitivity by adjusting the threshold. See below:

sum(lda.pred$posterior[,1] < .5)

## [1] 344

Now this is awesome and good. After adjusting the threshold, our model correctly identified 344 out of 451 clients that subscribed (sensitivity is 76%). By default, threshold here was 0.2.

Let’s try the randomForest model

#randomForest
set.seed(1)

rf.fit <- randomForest(y ~ ., data = df.train, 
                       importance = T, mtry = 4)

Here, we have. Let’s go over the basic concept of randomForest. randomForest is built on the idea of decision trees and bootstrap.

In randomForest, we build multiple trees and then use a subset of our variables as predictors for each of these trees(this is where bootstrap comes in) and finally the average of these results will be our final result. By default randomForest builds 500 trees and for each of these trees it uses a specified number of predictors.

For our model, i have specified 4 as the number of predictors to use for each tree(mtry = 4). So what randomForest does is, for each of these 500 trees it uses different set of predictors. For tree 1, it might use “job, age, marital” as predictors, for tree 2 it might use “job, education, loan” and for tree 4 it might use “contact, month and pdays” (you should get the concept now). And then it does this for the whole 500 trees and average the result.

Note that: if mtry = 19(all predictors) then that is bagging and not randomForest.I actually ran a bagging model with our data and the result was not as good as randomForest but i’m not including the result in this project for the same reason i mentioned before (to avoid bloating the report).

We can verify the number of trees built and variables used for each tree below:

rf.fit

## 
## Call:
##  randomForest(formula = y ~ ., data = df.train, importance = T,      mtry = 4) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 10.12%
## Confusion matrix:
##        no  yes class.error
## no  32114  765  0.02326713
## yes  2988 1201  0.71329673

importance(rf.fit)

##                       no        yes MeanDecreaseAccuracy MeanDecreaseGini
## age            37.965523  -2.430191            36.664020     7.128217e+02
## job            57.045411 -12.063915            51.845525     4.914984e+02
## marital        20.081200  -4.209574            17.306731     1.741618e+02
## education      38.455699  -5.747768            34.875182     3.341983e+02
## default         0.000000   0.000000             0.000000     1.618952e-03
## housing         6.196515  -3.369117             3.862782     1.392867e+02
## loan            2.133211   1.853132             2.786042     9.964356e+01
## contact         9.197919  29.529872            11.790618     8.408242e+01
## month          25.099893  -9.798359            25.943122     2.134846e+02
## day_of_week    35.313835  -1.191252            35.077292     3.317685e+02
## campaign       14.898398  13.954336            19.800422     3.505065e+02
## pdays           6.079087  35.419908            24.522829     2.921908e+02
## previous        7.794241   3.434010             8.702346     1.052016e+02
## poutcome       12.332619  13.704904            16.048634     2.178699e+02
## emp.var.rate   16.955485   2.580982            17.797485     1.549350e+02
## cons.price.idx 19.773182 -15.163870            19.624726     1.400434e+02
## cons.conf.idx  17.861462  -7.781122            18.333387     1.632809e+02
## euribor3m      29.372158   7.912342            32.398056     7.958030e+02
## nr.employed    20.591462  16.362149            23.989428     3.956196e+02

The most important variable judging from MeanDecreaseAccuracy is Job and then Age. Another way to read this is, if we remove the Job variable from our model, our model accuracy will decrease by a factor of 41.5. On the other hand based on GiniIndex, the most important variable is euribor3m then age. We can see the plot below.

varImpPlot(rf.fit, main = "Job(based on MeanDecrease) and euribor3m(based on GiniIndex) are the most important variables")

rf.pred <- predict(rf.fit, df.test)

rf.perf <- table(df.test$y, rf.pred, dnn = c("Actual","Predicted"))

rf.perf

##       Predicted
## Actual   no  yes
##    no  3579   89
##    yes  316  135

mean(rf.pred != df.test$y)

## [1] 0.09832484

Our test error is 0.098 or 9.8% test error (remember the lower the test error the better). We will have a look at the model accuracy at the end of this section.

Now we check support vector machine.

set.seed(1)
svm.fit <- svm(y ~., data = df.train) 
svm.fit

## 
## Call:
## svm(formula = y ~ ., data = df.train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  10467

svm.pred <- predict(svm.fit, df.test)

svm.perf <- table(df.test$y,svm.pred, 
                  dnn = c("Actual","Predicted"))

svm.perf

##       Predicted
## Actual   no  yes
##    no  3625   43
##    yes  359   92

mean(svm.pred != df.test$y)

## [1] 0.0975965

The test error for svm is 0.0975 or 9.8% which is almost the same as our variable selection logistic regression model.

We can actually improve on the svm by tunning the parameter. This is computational intensive because we have a large data set(in the context of tuning parameters with svm - this a large dataset). And also depending on your workstation, this might take a while to be done - i tried it with a 4GB RAM workstation, and it was still running after 52 minutes.

For that reason, i will stop and not run the code again. But will add the code below, just incase ….

set.seed(1)
#tune.svm <- tune.svm(y ~., data = df.train,
#                     gamma = 10 ^ (-3:1),
#                     cost = 10 ^ (-5:5))

To have an holisitic view of our model accuracy, we need to focus on five metrics and also a test error accuracy.

These five metrics are:

Sensitivity
Specificity
Positive predicted value
Negative predicted value
Accuracy
test error

Let’s build a function to calculate each of this.

performance <- 
  function(predictions, table, n = 2,actual = df.test$y){
  if(dim(table) != c(2,2))
    stop("Must be a 2 x 2 table")
  tn = table [1,1]
  fp = table [1,2]
  fn = table [2,1]
  tp = table [2,2]
  sensitivity = tp/(tp+fn)
  specificity = tn/(tn+fp)
  ppp = tp/(tp+fp)
  npp = tn/(tn+fn)
  hitrate = (tp+tn)/(tp+tn+fp+fn)
  test_error_rate = mean(predictions!=actual)
  result <- paste("Sensitivity = ", round(sensitivity,n),
                  "\nSpecificity = ", round(specificity, n),
                  "\nNegative Predicted Value = ", round(npp,n),
                  "\nAccuracy = ", round(hitrate, 3), 
                  "\nTest Error rate = ", round(test_error_rate,3),
                  "\n", sep ="")
  cat(result)
    
}

Let’s go over each of our model:

Sensitivity: tells us the probability of getting a positive classification when the true outcome is positive. It is also called recall or TPR (true positive rate);

Specificity: the opposite of sensitivity. probability of getting a negative classification when the outcome is negative(in our case probability of getting clients that didn’t subscribe when the outcome is no)

PPV: Positive predicted value, probability that an observation with a positive classification is correctly identified as positive (also called precision);

NPV: Negative predicted value, opposite of PPV.

Accuracy: Also called ACC, proportion of observations correctly identified.

Now let’s check the metrics with our models.

For logistic Regression:

performance(logit.prob, logit.perf)

## Sensitivity = 0.24
## Specificity = 0.98
## Negative Predicted Value = 0.91
## Accuracy = 0.901
## Test Error rate = 0.099

For Logistic Regression (variable selection)

performance(logit.prob.step, logit.perf.step)

## Sensitivity = 0.25
## Specificity = 0.98
## Negative Predicted Value = 0.91
## Accuracy = 0.902
## Test Error rate = 0.098

For linear discriminant analysis

performance(lda.class, lda.perf)

## Sensitivity = 0.39
## Specificity = 0.95
## Negative Predicted Value = 0.93
## Accuracy = 0.893
## Test Error rate = 0.107

For random forest

performance(rf.pred, rf.perf)

## Sensitivity = 0.3
## Specificity = 0.98
## Negative Predicted Value = 0.92
## Accuracy = 0.902
## Test Error rate = 0.098

For support vector machine

performance(svm.pred, svm.perf)

## Sensitivity = 0.2
## Specificity = 0.99
## Negative Predicted Value = 0.91
## Accuracy = 0.902
## Test Error rate = 0.098

Which is the best model ? This answer is based on context. All the models all have poor sensitivity but as we have seen with lda (the sensitivity can easily be improved on by adjusting the threshold).

If the aim or if identifying clients that will subscribe is more important, we might go with lda since it’s sensitivity is way better. If the goal is accuracy, we might go with SVM or tuned logistic regression model. I actually believe a tuned SVM model, will outclass all the models in terms of accuracy but not in terms of sensitivity.

Bank Telemarketing Project

Lekan Ali

5/17/2021

A Data-Driven Approach to Predict the Success of Bank Telemarketing.

Descriptive Analysis:

Exploratory Analysis

Modeling | Predictions