This project is broken down into four sections.
The goal of this task is to predict if a client will subscribe to a term deposit.
Here is the link to the task on Kaggle https://www.kaggle.com/selvanig/bank-marketing-campaign
We start by loading our packages….
and we load the full data-set (bank-additional-full)…
df <- read.table("/Users/kazeemolalekan/Downloads/bank-additional-full.csv",
header = T, sep = ";", stringsAsFactors = F)
A good place to start is to check the number of rows and variables we have.
#check dataset dimension (rows and columns)
dim(df)
## [1] 41188 21
Here we can see that the full datatset has 41,188 rows and 21 variables or features. We can follow-up with this by viewing the first 10 rows of our data.
#check the first 10 rows
head(df, 10)
## age job marital education default housing loan contact
## 1 56 housemaid married basic.4y no no no telephone
## 2 57 services married high.school unknown no no telephone
## 3 37 services married high.school no yes no telephone
## 4 40 admin. married basic.6y no no no telephone
## 5 56 services married high.school no no yes telephone
## 6 45 services married basic.9y unknown no no telephone
## 7 59 admin. married professional.course no no no telephone
## 8 41 blue-collar married unknown unknown no no telephone
## 9 24 technician single professional.course no yes no telephone
## 10 25 services single high.school no yes no telephone
## month day_of_week duration campaign pdays previous poutcome emp.var.rate
## 1 may mon 261 1 999 0 nonexistent 1.1
## 2 may mon 149 1 999 0 nonexistent 1.1
## 3 may mon 226 1 999 0 nonexistent 1.1
## 4 may mon 151 1 999 0 nonexistent 1.1
## 5 may mon 307 1 999 0 nonexistent 1.1
## 6 may mon 198 1 999 0 nonexistent 1.1
## 7 may mon 139 1 999 0 nonexistent 1.1
## 8 may mon 217 1 999 0 nonexistent 1.1
## 9 may mon 380 1 999 0 nonexistent 1.1
## 10 may mon 50 1 999 0 nonexistent 1.1
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1 93.994 -36.4 4.857 5191 no
## 2 93.994 -36.4 4.857 5191 no
## 3 93.994 -36.4 4.857 5191 no
## 4 93.994 -36.4 4.857 5191 no
## 5 93.994 -36.4 4.857 5191 no
## 6 93.994 -36.4 4.857 5191 no
## 7 93.994 -36.4 4.857 5191 no
## 8 93.994 -36.4 4.857 5191 no
## 9 93.994 -36.4 4.857 5191 no
## 10 93.994 -36.4 4.857 5191 no
Here we can see the first 10 rows and we can also see some missing values (unknown values). We will come back to unknown values later in this section.
Next, we might want to have a look at the names of all our variables or features. There’s no need to go into details with what each of these variables represent or signify - we already know that from the problem statement. We know marital is equivalent to the marital status of our cleints, and age is the age of the client.
#variable names
names(df)
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "duration" "campaign"
## [13] "pdays" "previous" "poutcome" "emp.var.rate"
## [17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
## [21] "y"
and then we can view the structure of our data, to know the data-type of our dataset.
str(df)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
Columns labeled int are integer columsn, columns labeled chr are character columns. num are numeric colums, num and int are both numeric columns. We might actually want to convert chr characters to factors - we’ll do that soon.
Now, let’s talk about unknown values. A good practice will be to check variables or columns with unknown values. We can do this with the colsum function.
#find the variables with unknown mapping (same as NA)
colSums(df == "unknown")
## age job marital education default
## 0 330 80 1731 8597
## housing loan contact month day_of_week
## 990 990 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
Any variable with a value greater than 0 has an unknown value. Here we can see that, job column has 330 unknown values, and in the marital columns we have 80 unknown values. The default column has the highest number of unknown values.
To do a deep-dive on unknown values, we will recode these values as “NA” … it is also a good time to recode chr columns to factors
The code below, recodes unknown to NA, and the code after it recodes character columns to numeric columns.
#recode unknown to NA and rename dataframe to df1
df1 <- within(df, {
job [job == "unknown"] <- NA
marital [marital == "unknown"] <- NA
education [education == "unknown"] <- NA
default [default == "unknown"] <- NA
housing [housing == "unknown"] <- NA
loan [loan == "unknown"] <- NA
})
df1[sapply(df1,is.character)] <- lapply(df1[sapply(df1,is.character)],as.factor)
We can confirm that character columns have been changed to factors by checking the structure of our data again.
str(df1)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : Factor w/ 11 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ education : Factor w/ 7 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 NA 6 4 ...
## $ default : Factor w/ 2 levels "no","yes": 1 NA 1 1 1 NA 1 NA 1 1 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
We can also confirm that unknown values have been recoded to NA by doing some descriptive analysis on our dataset.
The n column below is the number of non-missing values. Any column that has less than 41188 has a missing value. We can see that job, marital, education, default, housing and loan variables have missing values because their n-value is less than 41188.
We might also be interested in the mean, median and standard deviation of age. Since the standard deviation of age is 10.42, median will be a better representation instead of the mean.
#descriptive analysis
psych::describe(df1)
## vars n mean sd median trimmed mad min max
## age 1 41188 40.02 10.42 38.00 39.30 10.38 17.00 98.00
## job* 2 40858 4.67 3.55 3.00 4.43 2.97 1.00 11.00
## marital* 3 41108 2.17 0.60 2.00 2.21 0.00 1.00 3.00
## education* 4 39457 4.60 2.07 4.00 4.76 2.97 1.00 7.00
## default* 5 32591 1.00 0.01 1.00 1.00 0.00 1.00 2.00
## housing* 6 40198 1.54 0.50 2.00 1.55 0.00 1.00 2.00
## loan* 7 40198 1.16 0.36 1.00 1.07 0.00 1.00 2.00
## contact* 8 41188 1.37 0.48 1.00 1.33 0.00 1.00 2.00
## month* 9 41188 5.23 2.32 5.00 5.31 2.97 1.00 10.00
## day_of_week* 10 41188 3.00 1.40 3.00 3.01 1.48 1.00 5.00
## duration 11 41188 258.29 259.28 180.00 210.61 139.36 0.00 4918.00
## campaign 12 41188 2.57 2.77 2.00 1.99 1.48 1.00 56.00
## pdays 13 41188 962.48 186.91 999.00 999.00 0.00 0.00 999.00
## previous 14 41188 0.17 0.49 0.00 0.05 0.00 0.00 7.00
## poutcome* 15 41188 1.93 0.36 2.00 2.00 0.00 1.00 3.00
## emp.var.rate 16 41188 0.08 1.57 1.10 0.27 0.44 -3.40 1.40
## cons.price.idx 17 41188 93.58 0.58 93.75 93.58 0.56 92.20 94.77
## cons.conf.idx 18 41188 -40.50 4.63 -41.80 -40.60 6.52 -50.80 -26.90
## euribor3m 19 41188 3.62 1.73 4.86 3.81 0.16 0.63 5.04
## nr.employed 20 41188 5167.04 72.25 5191.00 5178.43 55.00 4963.60 5228.10
## y* 21 41188 1.11 0.32 1.00 1.02 0.00 1.00 2.00
## range skew kurtosis se
## age 81.00 0.78 0.79 0.05
## job* 10.00 0.45 -1.41 0.02
## marital* 2.00 -0.09 -0.42 0.00
## education* 6.00 -0.23 -1.24 0.01
## default* 1.00 104.21 10858.00 0.00
## housing* 1.00 -0.15 -1.98 0.00
## loan* 1.00 1.90 1.62 0.00
## contact* 1.00 0.56 -1.69 0.00
## month* 9.00 -0.31 -1.03 0.01
## day_of_week* 4.00 0.01 -1.27 0.01
## duration 4918.00 3.26 20.24 1.28
## campaign 55.00 4.76 36.97 0.01
## pdays 999.00 -4.92 22.23 0.92
## previous 7.00 3.83 20.11 0.00
## poutcome* 2.00 -0.88 3.98 0.00
## emp.var.rate 4.80 -0.72 -1.06 0.01
## cons.price.idx 2.57 -0.23 -0.83 0.00
## cons.conf.idx 23.90 0.30 -0.36 0.02
## euribor3m 4.41 -0.71 -1.41 0.01
## nr.employed 264.50 -1.04 0.00 0.36
## y* 1.00 2.45 4.00 0.00
Let’s also check the summary of our dataset. Here we can see that the minimum, mean, median and max of each column. Here we can see that the most common job of client is admin (job), and most of our clients are married(marital). We can also see that most of clients have a university degree (education).
#summarize data
summary(df1)
## age job marital
## Min. :17.00 admin. :10422 divorced: 4612
## 1st Qu.:32.00 blue-collar: 9254 married :24928
## Median :38.00 technician : 6743 single :11568
## Mean :40.02 services : 3969 NA's : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 (Other) : 7546
## NA's : 330
## education default housing loan
## university.degree :12168 no :32588 no :18622 no :33950
## high.school : 9515 yes : 3 yes :21576 yes : 6248
## basic.9y : 6045 NA's: 8597 NA's: 990 NA's: 990
## professional.course: 5243
## basic.4y : 4176
## (Other) : 2310
## NA's : 1731
## contact month day_of_week duration
## cellular :26144 may :13769 fri:7827 Min. : 0.0
## telephone:15044 jul : 7174 mon:8514 1st Qu.: 102.0
## aug : 6178 thu:8623 Median : 180.0
## jun : 5318 tue:8090 Mean : 258.3
## nov : 4101 wed:8134 3rd Qu.: 319.0
## apr : 2632 Max. :4918.0
## (Other): 2016
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35563
## Median : 2.000 Median :999.0 Median :0.000 success : 1373
## Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :56.000 Max. :999.0 Max. :7.000
##
## emp.var.rate cons.price.idx cons.conf.idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08189 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
##
## nr.employed y
## Min. :4964 no :36548
## 1st Qu.:5099 yes: 4640
## Median :5191
## Mean :5167
## 3rd Qu.:5228
## Max. :5228
##
Now, let’s do a deep-dive on missing values on NA. Let’s see the total number of missing values and also their proportion(inshort what’s the proportion of missing values in our dataset?).
#check rows with missing values
sum(is.na(df1))
## [1] 12718
#proportion of missing values:
#inshort 26% of rows have one or more missing values (for every 1 in four rows have a missing value)
mean(!complete.cases(df1))
## [1] 0.2597844
The total number of missing values(or unknown entries) in our dataset is 12,718 and that is 26% of our data-set. Inshort 26% of rows in our dataset have a missing value. A better way to understand this is through visualization.
#visualize missing values and their distribution
aggr(df1, prop = F, numbers = T)
Here, we can see that we have two plots. The plot on the left, is a barchart of missing values - what is obvious from the barchart is the height of the default variable, which signifies that the default variable has the highest number of missing values. Maybe our clients are not interested in disclosing their default status (i mean it’s not something to be proud of). Another variable with lots of missing values is also the education variable.
On the right of our plot, we have sort of a grid of missing values. To understand this grid, let’s do a bottom-up approach. A blue box represents a non-missing value, so a row of blue boxes and no red box is equivalent to rows with no missing values - in short we have 30,488 rows that have no missing values. Next we have 7,757 rows that have just a missing value (all missing values are for the default variable). We also have 1,102 rows with a single missing value (all missing values are for the marital variable). Next we have 739 rows that have missing values for both housing and loan variable (you should get the logic by now).
Let’s end this section with a margin plot for missing values. This enables us to understand if we are missing values at random (MAR) or missing completely at random (MCAR) or not missing at random(NMAR).
#how related are the missing values
matrixplot(df1, sortby = 1, main = "How distributed are the missing values in our dataset ?")
##
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.
Here, the red shades signify missing values. We can see that the default column has a lot of red shading. The plot is sorted by age, i can actually make a case that as age increases, the number of missing values for default also increases - which is the same as saying, older people are more likely to fill in unknown for the default variable . You can also make a case that our data is MCAR(missing completely at random).
Moving on, let’s have a look at our output variable “y” . We want to see how many of our clients subscribed to a term deposit. We can do this with an ftable:
ftable(df1[,"y"])
## no yes
##
## 36548 4640
Here we can see that 36,548 of our clients didn’t subscribe and 4,640 did subscribe - a huge difference. We can better visualize this with a mosaic plot. Quick word on mosaic plot, they usually don’t pass the eye test when you have multiple variables but once you get the logic behind it you’ll understand why they are a powerful plot to use for categorical variables.
Below is a mosaic plot of our output variable.
mosaic(~y, data = df1, main = "Most of our clients (more than 80%) didn't subscribe to a term deposit")
The plot above passes the eye test, size in a mosaic plot is visualized through a rectangular box. The bigger the box, the bigger the size or chunk of a variable. We can see that “No” makes up more than 80% of the rectangular box, with “yes” taking a small chunk.
We can build on this by adding the marital status to our mosaic plot - and this is where it might get complicated at first but you but after a while, you feel relaxed about it.
mosaic(~y + marital,shade =T, data = df)
From the plot above, we have eight boxes(although it looks like six but you should also notice the unknown portion of the plot). Here, we can see that most of our clients are married, followed by single clients and then divorced clients and then clients that decided not to fill in their marital status (just a few of them which is why the unknown box is very small).
We can also see that, most of the clients that subscribed are also married (which is not surprising since most of our clients are married). One major insight here is that, single people are more likely to subscribe than married people(you should notice the blue box extending slightly into the married territory slightly). Also the blue shading means single people that subscribed occur more often than expected and the red shading means married people that subscribed occurs less often than expected.
We can confirm this with an f-table
marital_tab <- xtabs(~marital + y, data = df)
addmargins(prop.table(marital_tab,1),2)
## y
## marital no yes Sum
## divorced 0.8967910 0.1032090 1.0000000
## married 0.8984275 0.1015725 1.0000000
## single 0.8599585 0.1400415 1.0000000
## unknown 0.8500000 0.1500000 1.0000000
14% of the “single” clients we contacted subscribed whereas 10% of married clients subscribed. 15% of clients that didn’t fill in their marital status subscribed. Of course there’s a huge disparity between married clients and single clients, so the fact that we have a bigger proportion (14%) of “single” clients doing a subscription that “married” clients (10%) might not mean much but it does mean something nonetheless.
Finally, we can view a count of this combination with an ftable.
addmargins(marital_tab)
## y
## marital no yes Sum
## divorced 4136 476 4612
## married 22396 2532 24928
## single 9948 1620 11568
## unknown 68 12 80
## Sum 36548 4640 41188
We can see that out of 36,548 clients that didn’t subscribe - 22,396 identified as married and 9,948 are single people. The total number of married clients in our dataset is 24,928 (more than half of our clients are married).
Another thing we might be interested in is in how our clients were contacted. Once again, we visualize this with a mosaic plot.
mosaic(~y + contact,shade =T, data = df, main = "Clients contacted via cellular are more likely to subscribe")
Here we can see that most of our clients were contacted via cellular. Also we can see that most of our clients that subscribed for a term deposit were contacted via cellular. Inshort we might want to communicate more with our clients via cellular rather than telephone.
We can check the proportion
comms_tab <- xtabs(~contact + y, data = df)
addmargins(prop.table(comms_tab,1),2)
## y
## contact no yes Sum
## cellular 0.85262393 0.14737607 1.00000000
## telephone 0.94768679 0.05231321 1.00000000
…. 14% of our clients contacted via cellular subscribed and just 5% of clients contacted via telephone subscribed. Let’s end this the usual way by viewing the total counts of contact.
addmargins(comms_tab)
## y
## contact no yes Sum
## cellular 22291 3853 26144
## telephone 14257 787 15044
## Sum 36548 4640 41188
We’ll come back to data exploratory with mosaic plots later in the section. For now, let’s try to understand “day_of_the_week” and “month” variables. But before we do this, we need to re-order “day_of_week” and “month” variables, we want monday to come first, then tuesday, then wednesday (you get my point) - this will make our job easier and visualization better to understand.
#reorder month
df1$month <- factor(df1$month, order = T,
levels = c("mar","apr","may","jun","jul","aug","sep","oct","nov","dec"))
#reorder day_of_the_week
df1$day_of_week <- factor(df1$day_of_week,
levels = c("mon","tue","wed","thu","fri"))
Now let’s visualize the month our clients were last contacted
par(family = "mono", font = 2)
barplot(table(df1$month), main = "Most of our clients were last contacted in may")
Most of our clients were last contacted in may and then july. We can check the proportion of clients that subscribed based on the months their were last contacted..
month_tab <- xtabs(~month + y, data = df1)
addmargins(prop.table(month_tab,1),2)
## y
## month no yes Sum
## mar 0.49450549 0.50549451 1.00000000
## apr 0.79521277 0.20478723 1.00000000
## may 0.93565255 0.06434745 1.00000000
## jun 0.89488530 0.10511470 1.00000000
## jul 0.90953443 0.09046557 1.00000000
## aug 0.89397863 0.10602137 1.00000000
## sep 0.55087719 0.44912281 1.00000000
## oct 0.56128134 0.43871866 1.00000000
## nov 0.89856133 0.10143867 1.00000000
## dec 0.51098901 0.48901099 1.00000000
50% of clients that were last contacted in march subscribed. September, october and decemeber figures are also impressive - the only downside to this is not much clients were contacted in those months including march. It’s almost as if the more clients are contacted in a particular month, the less likely those clients will subscribe and vice-versa.
Next is plot of the day of the week when our clients were last contacted..
par(family = "mono", font = 2)
barplot(table(df1$day_of_week), main = "Most of our clients were last contacted on monday and thursday")
contacts across day_of_the_week is way more uniform than month. Although thursday is the day of the week when most of our clients were last contacted. We can add a mosaic plot here to supplement our barplot.
mosaic(~y + df1$day_of_week,shade =T, data = df)
From our mosaic plot, we can see that clients last contacted in thursday (and then subscribed) occurs more than expected (blue shading).
We can view the proportion of our output variable:
day_of_the_tab <- xtabs(~day_of_week + y, data = df1)
addmargins(prop.table(day_of_the_tab,1),2)
## y
## day_of_week no yes Sum
## mon 0.9005168 0.0994832 1.0000000
## tue 0.8822002 0.1177998 1.0000000
## wed 0.8833292 0.1166708 1.0000000
## thu 0.8788125 0.1211875 1.0000000
## fri 0.8919126 0.1080874 1.0000000
Here we can see that, thursday is the best day to contact our clients (12% of clients last contacted on thursday subscribed for a term deposit). Monday is definitely not a good time to contact our clients.
To end this part, we’ll have an ftable of month and day_of_the_week. This will enable us to understand the relationship between both variables.
day_month_tab <- xtabs(~month + day_of_week, data = df1)
addmargins(prop.table(day_month_tab,1),2)
## day_of_week
## month mon tue wed thu fri Sum
## mar 0.26190476 0.25641026 0.12820513 0.18131868 0.17216117 1.00000000
## apr 0.26671733 0.09574468 0.11398176 0.29179331 0.23176292 1.00000000
## may 0.19188031 0.20400901 0.21228847 0.18425448 0.20756772 1.00000000
## jun 0.23523881 0.18239940 0.18484393 0.18183528 0.21568259 1.00000000
## jul 0.21131865 0.21145804 0.20309451 0.23306384 0.14106496 1.00000000
## aug 0.19779864 0.20977663 0.20119780 0.21803173 0.17319521 1.00000000
## sep 0.15789474 0.20701754 0.21929825 0.21403509 0.20175439 1.00000000
## oct 0.17966574 0.20752089 0.18802228 0.22701950 0.19777159 1.00000000
## nov 0.18678371 0.19848817 0.21043648 0.22019020 0.18410144 1.00000000
## dec 0.29120879 0.13736264 0.19230769 0.24725275 0.13186813 1.00000000
A big take-away from the table above is that most of our clients were last contacted on thursday x april (29%) and also monday x decemeber (29%). tuesday x april (9%) is the period when our clients were least contacted. We can build on this by adding the output variable to our ftable
day_month_y_tab <- xtabs(~month + day_of_week + + y, data = df1)
addmargins(prop.table(day_month_y_tab,1),2)
## , , y = no
##
## day_of_week
## month mon tue wed thu fri Sum
## mar 0.15750916 0.10256410 0.05128205 0.10073260 0.08241758 0.49450549
## apr 0.23746201 0.06496960 0.08016717 0.20440729 0.20820669 0.79521277
## may 0.17800857 0.19318760 0.19877987 0.17256155 0.19311497 0.93565255
## jun 0.20966529 0.15889432 0.16265513 0.16472358 0.19894697 0.89488530
## jul 0.19459158 0.19291887 0.18427655 0.21313075 0.12461667 0.90953443
## aug 0.18047912 0.18582065 0.17740369 0.19682745 0.15344772 0.89397863
## sep 0.11052632 0.10000000 0.10701754 0.11228070 0.12105263 0.55087719
## oct 0.12116992 0.11420613 0.10027855 0.12395543 0.10167131 0.56128134
## nov 0.17044623 0.17654231 0.18995367 0.19775664 0.16386247 0.89856133
## dec 0.16483516 0.05494505 0.07692308 0.13186813 0.08241758 0.51098901
##
## , , y = yes
##
## day_of_week
## month mon tue wed thu fri Sum
## mar 0.10439560 0.15384615 0.07692308 0.08058608 0.08974359 0.50549451
## apr 0.02925532 0.03077508 0.03381459 0.08738602 0.02355623 0.20478723
## may 0.01387174 0.01082141 0.01350861 0.01169293 0.01445276 0.06434745
## jun 0.02557352 0.02350508 0.02218879 0.01711170 0.01673561 0.10511470
## jul 0.01672707 0.01853917 0.01881795 0.01993309 0.01644829 0.09046557
## aug 0.01731952 0.02395597 0.02379411 0.02120427 0.01974749 0.10602137
## sep 0.04736842 0.10701754 0.11228070 0.10175439 0.08070175 0.44912281
## oct 0.05849582 0.09331476 0.08774373 0.10306407 0.09610028 0.43871866
## nov 0.01633748 0.02194587 0.02048281 0.02243355 0.02023897 0.10143867
## dec 0.12637363 0.08241758 0.11538462 0.11538462 0.04945055 0.48901099
Here’s how to read this table. Using the first row as illustration - 49.4% of clients last contacted in march ended up not subscribing and from this 49.4%, 15.7% were last contacted on monday, 10.25% were on contacted on tuesday, 5% were last contacted on wednesday. Also from 50.5% of clients last contacted in that same march subscribed, 10.4% were last contacted on monday in march, 15% were last contacted on tuesday in march.
Also we might want to test for the relationship between day of the week and month. This answers the question of “if we know the day a client was last contacted, can we predict if the client will subscribe to a term deposit?”
We do this chi-square:
day_y <- xtabs(~y + day_of_week, data = df)
chisq.test(day_y)
##
## Pearson's Chi-squared test
##
## data: day_y
## X-squared = 26.145, df = 4, p-value = 2.958e-05
The p-value is 0.0000295 (which is less than 0.05). Meaning there’s a relationship between day_of_the_week and our output variable “y”.
We can also check for the relationship between month and our output variable:
month_y <- xtabs(~y + month, data = df)
chisq.test(month_y)
##
## Pearson's Chi-squared test
##
## data: month_y
## X-squared = 3101.1, df = 9, p-value < 2.2e-16
Once again, p-value is less than 0.05, hence relationship exists between our output variable and month. Once again, if we know the month a client was last contacted we can predict if that client will subscribe or not.
Next, let’s look at the housing loan variable and how they relate with our output variable. Starting with a mosaic plot:
mosaic(~y + housing,shade =T, data = df)
Here we can see that we have few clients that decided not to answer the question on housing loan. Also we can see that we have almost an equal number of clients that have an housing loan and those that don’t have an housing loan. And also from the plot you can deduce that clients with an housing loan are slightly more likely to subscribe (although the difference is very small)
We can confirm this with a proportion table:
housing_tab <- xtabs(~housing + y, data = df)
addmargins(prop.table(housing_tab,1),2)
## y
## housing no yes Sum
## no 0.8912040 0.1087960 1.0000000
## unknown 0.8919192 0.1080808 1.0000000
## yes 0.8838061 0.1161939 1.0000000
11.6% of clients that took an housing loan subscribed for a term deposit and 10.8% of clients that don’t have an housing loan subscribed (almost the same as clients that didn’t fill in a value for that question).
Again, we can check for the relationship between housing loan and term deposit subscription:
housing_y <- xtabs(~y + housing, data = df)
chisq.test(housing_y)
##
## Pearson's Chi-squared test
##
## data: housing_y
## X-squared = 5.6845, df = 2, p-value = 0.05829
Here p-value is 0.058, this is more than 0.05, which means there’s no relationship between housing loan and term deposit subscription. In short, you can’t predict if a client will subscribe for a term deposit just by knowing their housing loan status.
Since we are taking about loan, we might as just dive into the loan variable.
mosaic(~y + loan,shade =T, data = df)
Most of our clients have no existing loan. We can check for the relationship between loan and term deposit subscription.
loan_y <- xtabs(~y + loan, data = df)
chisq.test(loan_y)
##
## Pearson's Chi-squared test
##
## data: loan_y
## X-squared = 1.094, df = 2, p-value = 0.5787
p-value here is very high, at 58%, meaning no relationship exists whatsoever between loan and term subscription.
Finally, let’s have a look at the default variable. If you recall, that we have lots of missing values for this variable (more than 8000 missing rows actually). And we can see this from the plot below. Also clients that didn’t default are way more likely to subscribe.
mosaic(~y + default,shade =T, data = df)
Let’s check for the relationship between default status and term subscription (the color shading already tells me, there might be a relationship).
tab <- xtabs(~default + y, data = df)
chisq.test(tab)
## Warning in chisq.test(tab): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 406.58, df = 2, p-value < 2.2e-16
Here we can see that the p=value is less than 0.05, meaning there’s a relationship between default status and term deposit subscription. However i did find that, once you remove missing values, you get a different result. Below i wll run another test but this time without the missing values.
tab <- xtabs(~default + y, data = df1)
chisq.test(tab)
## Warning in chisq.test(tab): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 6.7698e-24, df = 1, p-value = 1
This time you can see that p-value is equal to 1 (less than 0.05) meaning no relationship whatsoever exists between default status and our outcome variable (term deposit).
Let’s end this section of categorical variables by having a look at the kind of jobs our clients are into and their level of education.
Starting with jobs, below i have recoded the job variable factors, to make our visualization easier to understand and the followed by a plot of jobs and output variable y.
df$job <- factor(df$job, order = T,
levels = c("unknown","student","unemployed","housemaid","self-employed",
"entrepreneur","retired","management","services","technician","blue-collar","admin."))
library(ggplot2)
ggplot(df, aes(job, fill = y)) +
geom_bar() +
coord_flip() + theme(legend.position = "top" ,
legend.title = element_blank(),
legend.text = element_text(size = 13),
axis.text.y = element_text(size = 13),
axis.text.x = element_text(size = 12),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = .5)) +
xlab(NULL) +
ylab(NULL) +
labs(title = "Admin, blue-collar and technician are the most common jobs among our clients",
subtitle = "Few of our clients are also students, unemployed and maids") +
scale_y_continuous(labels = scales::comma_format())
Here we can see that most of our clients are admin., blue-collar and technicians. The turquoise color coding are the share of clients that subscribed to a term deposit and the red color coding are the share of clients that didn’t subscribe for a term deposit. Technician clients are more likely to subscribe to a term deposit than blue-collar clients. Retired clients and students are also impressive and more likely to subscribe to a term deposit.
We can further look into it and confirm this with an ftable:
job_tab <- xtabs(~job + y, data = df)
addmargins(prop.table(job_tab,1),2)
## y
## job no yes Sum
## unknown 0.88787879 0.11212121 1.00000000
## student 0.68571429 0.31428571 1.00000000
## unemployed 0.85798817 0.14201183 1.00000000
## housemaid 0.90000000 0.10000000 1.00000000
## self-employed 0.89514426 0.10485574 1.00000000
## entrepreneur 0.91483516 0.08516484 1.00000000
## retired 0.74767442 0.25232558 1.00000000
## management 0.88782490 0.11217510 1.00000000
## services 0.91861930 0.08138070 1.00000000
## technician 0.89173958 0.10826042 1.00000000
## blue-collar 0.93105684 0.06894316 1.00000000
## admin. 0.87027442 0.12972558 1.00000000
You can see that, 31% of students subscribed to a term deposit, and also one in four of our retired clients also subscribed. Those are very impressive.
Let’s do the usual relationship check:
tab <- xtabs(~job + y, data = df1)
chisq.test(tab)
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 961.21, df = 10, p-value < 2.2e-16
As always, we look at the p-value, it is way less than 0.05 , meaning relationship exists between job and our output variable. In short if i know the job of a particular client, i can predict if the client will subscribe or not.
And finally the education feature. We do the usual, factor reordering and then make a plot.
df$education <- factor(df$education, order = T,
levels = c("illiterate","unknown","basic.6y","basic.4y","professional.course",
"basic.9y","high.school","university.degree"))
ggplot(df, aes(education, fill = y)) +
geom_bar() +
coord_flip() + theme(legend.position = "top" ,
legend.title = element_blank(),
legend.text = element_text(size = 13),
axis.text.y = element_text(size = 13),
axis.text.x = element_text(size = 12),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = .5)) +
xlab(NULL) +
ylab(NULL) +
labs(title = "A big chunck of our clients are well educated with a degree",
subtitle = "The more educated our clients are the more likely they are to subscribe to a term deposit") +
scale_y_continuous(labels = scales::comma_format())
Let’s stop here with categorical variables and have a look at our numeric features.
We can see a breakdown of our numeric colums below
num.cols <- sapply(df,is.numeric)
summary((df[,num.cols]))
## age duration campaign pdays
## Min. :17.00 Min. : 0.0 Min. : 1.000 Min. : 0.0
## 1st Qu.:32.00 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0
## Median :38.00 Median : 180.0 Median : 2.000 Median :999.0
## Mean :40.02 Mean : 258.3 Mean : 2.568 Mean :962.5
## 3rd Qu.:47.00 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0
## Max. :98.00 Max. :4918.0 Max. :56.000 Max. :999.0
## previous emp.var.rate cons.price.idx cons.conf.idx
## Min. :0.000 Min. :-3.40000 Min. :92.20 Min. :-50.8
## 1st Qu.:0.000 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## Median :0.000 Median : 1.10000 Median :93.75 Median :-41.8
## Mean :0.173 Mean : 0.08189 Mean :93.58 Mean :-40.5
## 3rd Qu.:0.000 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. :7.000 Max. : 1.40000 Max. :94.77 Max. :-26.9
## euribor3m nr.employed
## Min. :0.634 Min. :4964
## 1st Qu.:1.344 1st Qu.:5099
## Median :4.857 Median :5191
## Mean :3.621 Mean :5167
## 3rd Qu.:4.961 3rd Qu.:5228
## Max. :5.045 Max. :5228
Here we can see that, our youngest client is 17 years old and our oldest client is 98 years old. For duration, the maximum seconds spent on talking to a client is 4918 seconds(same as 81 minutes or 1.36 hours). The minimum number of campaign is 1 and the max is 56 (meaning we have a client that was contacted 56 times).
Overall, our major focus will be on the age variable and the duration variable to some degree. Because campaigns are skewed to the right, i will not be looking into it.
How do i know that campaign is skewed to the right ? Because the minimum, mean, median and 3rd quatile are very close, that’s enough to tell me most of our clients were contacted once or twice and our maximum number which is 58 is an outlier.
We’ll also not be looking at the social and economic attributes, since these are economical factors that we don’t have power to adjust or control.
Let’s go over age. The standard deviation of age is 10.4, another way to say that is most of our clients are within one standard deviation to the mean or between 30 and 50 years old. Because our standard deviation is big(in the context of age) for that reason median and not mean is a better representation of the average age (the bigger the standard deviation the less the mean represents the average).
#standard deviation of age
sd(df1$age)
## [1] 10.42125
Next is the distribution of age of our clients using histogram and density plot
par(mfrow = c(1,2), family = "mono")
hist(df1$age,breaks = 10, col = "red", xlab = "Age", main = "Most of our clients are \n between the age of 30-40 years old")
plot(density(df1$age), col = "red", main = "")
polygon(density(df1$age), col = "brown")
Here we can verify what we already know, which is most of our clients fall between the age of 30-40 years old or even 30-50 years old.
What we might want to do is to visualize our age variable with other categorical variables.
We can do this with a facet plot, first of all let’s check the age distribution of our clients with regards to their jobs.
ggplot(df1, aes(age)) + geom_histogram(color = "white", binwidth = 5) +
facet_wrap (~job) + theme(strip.text.x = element_text(size = 13),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = .5)) +
labs(title = "Most of our clients that are students are 25 years old",
subtitle = "Our clients that are into admin. are equally distributed between 30 and 35 years old")
… and age and education breakdown
ggplot(df1, aes(age)) + geom_histogram(color = "white", binwidth = 5) +
facet_wrap (~education) + theme(strip.text.x = element_text(size = 13),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = .5)) +
labs(title = "Clients with basic.4y eduaction are quite old",
subtitle = "Clients with university degree fall in between ages of 30-35 years old")
Still on age and education. We can eyeball the median age of our clients.
df %>%
group_by(education) %>%
summarize(median_age = median(age)) %>%
ggplot(aes(median_age, reorder(education,median_age))) +
geom_point(size = 2) +
theme(axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = .5)) +
ylab(NULL) +
labs(title = "Our younger clients are more educated than our older clients",
subtitle = "There's a big gap in age between our uneducated and basic.4y clients and basic.6y clients")
You can make a case that, clients with missing value or “unknown” entries are most likely not well educated. Judging by the mean age on the plot, a client with a university degree or high school degree and basic.9y educational level are less likely to not fill in their age (look at the gap between the median age of those clients with the median age of clients with “unknown” factor).
We can also check the median age between our clients with regards to their jobs.
df %>%
group_by(job) %>%
summarize(median_age = median(age)) %>%
ggplot(aes(median_age, reorder(job,median_age))) +
geom_point(size = 2) +
theme(axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = .5)) +
ylab(NULL) +
labs(title = "Housemaids and clients with missing age value share the same median age",
subtitle = "Clients with missing age values are most likely not student, services or into admin")
Once again, we can make a case that or even make a guess that our clients that didn’t fill in their age are most likely retired, housemaid or into managament.
Let’s see how all these relate with our output variable.
ggplot(df1, aes(job, age, fill = y)) +
geom_boxplot() +
theme(axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = .5),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 13)) +
xlab(NULL) +
labs(title = "Older clients in the retired and housemaid cohorts are more likely to subscribe",
subtitle = "Younger clients in the student, self-employed or admin cohorts are more likely to subscribe")
Irony here is, older clients that are retired or housemaid are more likely to subscribe whereas younger students or self-employed or admin are more likely to subscribe.
ggplot(df1, aes(education, age, fill = y)) +
geom_boxplot()+
theme(axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = .5),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 13)) +
xlab(NULL) +
labs(title = "Older clients in the basic.4y cohorts are more likely to subscribe",
subtitle = "Younger people in other cohorts are more likely to subscribe")
To end the exploratory section, let’s have a quick look at calls duration.
df1 %>%
mutate(minutes = round(duration/60,0)) %>%
filter(minutes > 0 & minutes < 20 ) %>%
ggplot(aes(minutes)) + geom_bar() + facet_grid(y~contact) +
theme(strip.text.x = element_text(size = 13),
strip.text.y = element_text(size = 13),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5),
plot.subtitle = element_text(size = 15, hjust = .5)) +
labs(title = "We spent around 4 minutes on the phone with most of our clients that subscribed",
subtitle = "Most Clients that didn't subscribe were on the phone for 2 minutes.")
For the data above, i have remove calls that lasted for more than 20 minutes(they are mostly outliers). Also each bar above represents 1 minute, so you can actually count each bar to know the exact minute of each bar. I have also divided the plot into contacts and our output variable. Once again we can see that most of clients were contacted via cellular.
That’s enough for now. We have more than covered a basic exploration of our dataset. Now we move on to the modeling and prediction section.
For this section we are going to reload our dataset and this time make use of the bank-additional-full datatset and bank-additional.
df <- read.table("/Users/kazeemolalekan/Downloads/bank-additional-full.csv",
header = T, sep = ";", stringsAsFactors = F)
df.test_set <- read.table("/Users/kazeemolalekan/Downloads/bank-additional/bank-additional.csv",header = T, sep = ";", stringsAsFactors = F)
Now, this part is important. From the problem statement, it was stated that, bank-additional-csv(let’s call this our test set) is a random selection of values from bank-additional-full csv. So what we want to do now, is to remove our test set from the the bank-additional-full dataset, that way when we build our model, we are not building with our test set.
We do this with anti_join below
df.train_set <- anti_join(df, df.test_set)
## Joining, by = c("age", "job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "duration", "campaign", "pdays", "previous", "poutcome", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed", "y")
Now, our test set has been excluded from bank-additional-full. And we can verify this below
dim(df.train_set)
## [1] 37068 21
Our bank-additional-full dataset (or let’s call it training set) has 37,068 rows.
dim(df.test_set)
## [1] 4119 21
… our test set had 4,119 rows.
Next, we’ll remove the duration variable(we were adviced to do this in the problem statement) and then we’ll recode “unknown values” to NA values for both data set.
changes <- function(x){
#remove duration column
var <- names(x) %in% "duration"
x <- x[!var]
#replace unknown with NA
x$job [x$job == "unknown"] <- NA
x$marital [x$marital == "unknown"] <- NA
x$education [x$education == "unknown"] <- NA
x$default [x$default == "unknown"] <- NA
x$housing [x$housing == "unknown"] <- NA
x$loan [x$loan == "unknown"] <- NA
#convert character columns to factors
x[sapply(x,is.character)] <- lapply(x[sapply(x, is.character)], as.factor)
x
}
… and then we apply the function to both set. To training,
df.train <- changes(df.train_set)
… and then test set.
df.test <- changes(df.test_set)
This is where it gets interesting. Before we get into it, i think it’s important we go over how we are going to approach the modeling aspect. Right now we have two problems or decisions to make:
what do we do with missing values - should we ignore or take them into consideration;
What model approach are we going to use. is it logistic regression or linear discriminant analysis or quadratic discriminant analysis, or conditional tree or random forest or svm or bagging ? We have multiple ways to build a model, which should we use - we can try all and compare their prediction accuracy but that will be time wasting since we already have an idea of how accurate these models can be;
Back to missing values. Recall that we have more than twelve thousand rows with at least 1 missing value - and these rows(with missing values) make up more than 20% of our data (26% to be specific). That’s a ratio of 4 to 1 - meaning for every 4 non-missing rows we have 1 row with a missing value. We can confirm this again:
#number of rows with at least 1 missing value
sum(is.na(df1))
## [1] 12718
#proportion of rows with missing values
mean(!complete.cases(df1))
## [1] 0.2597844
There are two ways to approach modeling with a dataset of missing values. Either we fit a model with:
Now, we have 26% of rows with missing values - that’s a red flag that listwise approach will be a wrong approach to take. For instance, we have seven thousand, seven hundred and fifty seven rows (7,757) that are missing just a single value for the default variable , whereas these same rows have a non-missing values for other variables like age, contact, education and others. The irony in deleting rows like such is that we are deleting rows with more non-missing values than missing values.
What i’m driving at is that it will be bad practice for us to ignore the missing rows or remove them, 26% of incomplete rows is a lot of data/information to ignore - which is why we won’t be taking the listwise approach since this approach deletes all rows with any missing value. What we might want to do instead is to take the Multiple Imputation approach.
How does this work ? I won’t go into specific details (to avoid bloating the report) but the idea here is:
you replace missing values with a random selection. You take a dataset with missing values then generate multiple datasets from the single dataset and for each of this generated dataset, the missing values are replaced with random possible selection. Then we fit a model for each of the generated dataset, and finally we combine the results of all the generated dataset into a single result. For this we use the mice package; or
you replace missing values with a median selection (for numeric columns) and modal selection for categorical columns and then run a model. For this, we use the na.roughfix function.
For this project, we’d be using the na.roughfix approach.
Actually we could go over all approaches, listwise removal and multiple imputation - and then compare the co-efficients and p-values of each approach to see if they are similar (if they are similar then that means deleting the rows with missing values have little or no impact on the overall data but if they are not then that means rows with missing values have a big part in our model fitness). We won’t do this, just wanted to let you know it can be done and how to go about it.
Let’s focus on na.roughmix approach.
Recall that we have two data sets, a training data and a test data. And both of these datasets have missing rows. We can check the proportion of our missing data.
Starting with our training data…..
mean(!complete.cases(df.train))
## [1] 0.2608989
26% of our training data have rows with a missing value and
mean(!complete.cases(df.test))
## [1] 0.2498179
25% of our test data have rows with a missing value.
The basic concept behind na.roughfix is that for each column/variable with a missing value, the missing values are replaced with the median of that particular column(for numerical columns) and modal (for categorical variable). The best way to understand this is by illustration.
This is a subset of our data with unknown or NA figures….
missing_values <- head(df[df[,c("job")]=="unknown",c("job","education","default","housing","loan")], 10)
missing_values
## job education default housing loan
## 30 unknown university.degree unknown unknown unknown
## 36 unknown basic.4y unknown yes no
## 74 unknown unknown unknown no no
## 92 unknown unknown unknown yes no
## 145 unknown high.school unknown yes no
## 300 unknown unknown unknown no no
## 304 unknown unknown no yes no
## 344 unknown unknown unknown yes no
## 389 unknown unknown unknown yes yes
## 429 unknown unknown unknown yes no
this is the same dataset after we applied na.roughfix function
df3 <- na.roughfix(df1)
head(df3[c(as.numeric(row.names(missing_values))),
c("job","education","default","housing","loan")],length(row.names(missing_values)))
## job education default housing loan
## 30 admin. university.degree no yes no
## 36 admin. basic.4y no yes no
## 74 admin. university.degree no no no
## 92 admin. university.degree no yes no
## 145 admin. high.school no yes no
## 300 admin. university.degree no no no
## 304 admin. university.degree no yes no
## 344 admin. university.degree no yes no
## 389 admin. university.degree no yes yes
## 429 admin. university.degree no yes no
Here we can see that for the job variable, missing or unknown values have been replaced with “admin” which is the median or the most frequent value in that column. For Education column, missing values have been replaced with “university degree”.
Does this make sense, and will this not have a negative impact on our model ? Thinking of it, this approach actually makes sense. The reason for this approach is not necessarily the replacement of missing values but keeping other information we could have discarded. In a dataset of 20 variables, for every row with a missing value, we have 19 variables that are non-missing. So replacing missing rows with a median is a good tradeoff over removing an entire row with just one missing value.
The only drawback and problem i had with this approach is with the default column. Because we have more 7,757 missing values for the default column alone - after applying the na.roughfix function, the missing values will be replaced by “no”, which can skew the dataset in that column to “yes” but again, this should not be a problem. We can actually show this will not be a problem by using the mice package - but we’ll not be doing this(to avoid bloating the report), i have already done it as a roughwork and i can confirm that is the case (you just have to take my word on this).
Moving on, what we can do next is to finally apply the na.roughfix function to both our training and set data and then we run our model on it.
#training set
df.train <- na.roughfix(df.train)
#test set
df.test <- na.roughfix(df.test)
Once again we can verify, if we still have missing rows in our data set:
#training set
mean(!complete.cases(df.train))
## [1] 0
#test set
mean(!complete.cases(df.test))
## [1] 0
Inshort, missing rows proportion is 0% which is the same as saying we don’t have any missing row in our training and test data.
This is also a good time to go over our datasets in this project:
df: this is our original dataset or bank-additional-full dataset;
df1: this is the dataset with unknown values recoded as NA values;
df.train: this is our training dataset or (bank additional full - bank additional) dataset;
df.test: this is our test dataset or bank additional data set;
Now onto building our model. This is a complex subject and there’s really no need to dive into the details. Based on my experience and domain knowledge, we can use different models for this task (be it - logistic regression, linear discriminant analysis, quadratic discriminant analysis, support vector machine, trees, bagging, randomForest or boosting).
Also from my domain knowledge i have (at least for the models i have built), we can classify models(classification models to be precise) based on their accuracy like this:
My point is classical trees tend to be more accurate than logistic regression models. Also, bagging, randomForest and boosting(which are all built on trees mechanism) tend to be more accurate than classical trees and logistic regression models. randomForest and boosting are actually more accurate than bagging. Support vector machine is also more accurate than bagging (but compared to randomForest and boosting, it is tight). In some cases svm can come ontop and in other cases randomForest thrives over it.
Note: This is not in all cases, i have seen cases where a logistic regression model turned out to be better in terms of accuracy than other models but often times it’s not as accurate as other models.
For this project, there’s no need to try all the models, instead i will be using the logistic regression model, linear discriminant analysis, randomForest models and svm.
One more thing - to build a classification model, you might have to create dummy variables for categorical columns but since R automatically does this for us with the glm function - there’s no need for us to do this.
Let’s start with logistic regression
set.seed(1)
#fit model
logit.fit <- glm(y ~ ., family = binomial, data = df.train)
#check summary
summary(logit.fit)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = df.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0671 -0.3916 -0.3192 -0.2628 2.9455
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.371e+02 3.536e+01 -6.707 1.99e-11 ***
## age -2.578e-03 2.210e-03 -1.166 0.243497
## jobblue-collar -1.561e-01 7.041e-02 -2.217 0.026646 *
## jobentrepreneur -2.857e-02 1.114e-01 -0.257 0.797536
## jobhousemaid -8.784e-02 1.353e-01 -0.649 0.516153
## jobmanagement -9.824e-03 7.881e-02 -0.125 0.900805
## jobretired 3.229e-01 9.859e-02 3.275 0.001056 **
## jobself-employed -2.504e-02 1.070e-01 -0.234 0.814916
## jobservices -1.361e-01 7.857e-02 -1.732 0.083338 .
## jobstudent 1.994e-01 1.018e-01 1.959 0.050059 .
## jobtechnician -8.723e-03 6.543e-02 -0.133 0.893945
## jobunemployed -4.783e-02 1.181e-01 -0.405 0.685517
## maritalmarried 2.080e-02 6.301e-02 0.330 0.741343
## maritalsingle 6.151e-02 7.182e-02 0.857 0.391708
## educationbasic.6y 1.096e-01 1.086e-01 1.008 0.313278
## educationbasic.9y -1.615e-02 8.638e-02 -0.187 0.851644
## educationhigh.school 5.883e-02 8.364e-02 0.703 0.481836
## educationilliterate 1.111e+00 6.511e-01 1.707 0.087815 .
## educationprofessional.course 5.501e-02 9.274e-02 0.593 0.553084
## educationuniversity.degree 1.304e-01 8.110e-02 1.608 0.107904
## defaultyes -7.648e+00 8.448e+01 -0.091 0.927865
## housingyes -2.433e-02 3.760e-02 -0.647 0.517618
## loanyes -1.512e-02 5.251e-02 -0.288 0.773400
## contacttelephone -7.231e-01 7.037e-02 -10.275 < 2e-16 ***
## monthaug 5.297e-01 1.140e-01 4.648 3.34e-06 ***
## monthdec 4.123e-01 2.012e-01 2.049 0.040483 *
## monthjul 7.378e-02 8.706e-02 0.847 0.396747
## monthjun -7.208e-01 1.175e-01 -6.135 8.54e-10 ***
## monthmar 1.484e+00 1.365e-01 10.870 < 2e-16 ***
## monthmay -4.158e-01 7.569e-02 -5.493 3.95e-08 ***
## monthnov -4.132e-01 1.105e-01 -3.738 0.000186 ***
## monthoct 7.079e-02 1.424e-01 0.497 0.619105
## monthsep 2.819e-01 1.675e-01 1.683 0.092443 .
## day_of_weekmon -2.261e-01 6.123e-02 -3.693 0.000222 ***
## day_of_weekthu 8.934e-02 5.861e-02 1.524 0.127461
## day_of_weektue 7.076e-02 6.052e-02 1.169 0.242346
## day_of_weekwed 1.708e-01 5.993e-02 2.850 0.004370 **
## campaign -4.009e-02 9.519e-03 -4.212 2.53e-05 ***
## pdays -1.175e-03 2.149e-04 -5.467 4.58e-08 ***
## previous -9.747e-02 6.016e-02 -1.620 0.105203
## poutcomenonexistent 4.114e-01 9.198e-02 4.473 7.72e-06 ***
## poutcomesuccess 7.037e-01 2.100e-01 3.351 0.000805 ***
## emp.var.rate -1.539e+00 1.314e-01 -11.712 < 2e-16 ***
## cons.price.idx 2.142e+00 2.333e-01 9.184 < 2e-16 ***
## cons.conf.idx 2.740e-02 7.366e-03 3.719 0.000200 ***
## euribor3m 2.141e-01 1.212e-01 1.767 0.077240 .
## nr.employed 6.961e-03 2.877e-03 2.420 0.015535 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26152 on 37067 degrees of freedom
## Residual deviance: 20485 on 37021 degrees of freedom
## AIC: 20579
##
## Number of Fisher Scoring iterations: 9
How to read this ? … You can classify model summary into five groups:
Group 1: rows with three asterick (***) are variables(dummy) that have significant relationship with the dependent variable(y). Meaning, you can predict “y” based on values from these variables. For instance, contact telephone, pdays, campaign, emp.var.rate and few others all have significant relationship with y. Another way to identify variables that have significant relationship with the outcome is by using the p-value. Variables with p-value below 0.1% or 0.001 are significant variables. The variables in this group are the biggest factor is determining whether a client will subscribe to a term deposit or not;
Group 2: variables with two asterick(**) also have significant relationship with the dependent variable but the relationship is not as significant as that of Group 1. Their p-value is more than 0.001 but less than or equal to 1% or 0.01. Variables in this group include day_of_weekwed, jobretired
+ Group 3: variables with one asterick(*) have a good relationship with “y”. But the relationship is not as good as that of Group 1 or 2. Their p-value is more than 0.01 or 1% but less than or equal to 5% or 0.05. Variables in this group are nr.employed, monthdec, jobblue-collar
+ Group 4: variables in this group have small or weak relationship with “y”. The relationship is weak but it’s still a relationship nonetheless. Actually, in some cases, variables in this group can also be categorized with group 5 (no relationship group). In a situation where p-value of 5% is the benchmark, variables in this group are considered insignificant because their p-value is greater than 5%. Variables in this group are: euribor3m, monthsep, educationilliterate, jobstudent, jobservices .
+ Group 5: variables in this group have no relationship whatsoever with the output variable. Inshort they have no say in whether a client will subscribe for a term deposit. The variables in this group have a p-value that is greater than 10% or 0.10. Any variable with no asterick or period belong here. To make your model more accurate you should remove variables in this group and re-run model.
Quick word on the Estimate values. The Estimate values for contact telephone is -7.231e-01. What does that tell us ? It means contacttelephone have a negative impact on our output variable. Inshort clients that are contacted through telephone are less likely to subscribe. To be exact, a one-unit increase in the number of times we contact client through the telephone will decrease the odds of them subscribing by -0.723 units. Because -0.723 is in odds(logistic regression mathematical jargons), we can know the exact figure by exponentiating the figure
exp(coefficients(logit.fit)["contacttelephone"])
## contacttelephone
## 0.4852474
The odds of a client subscribing decreases by a factor of 0.48 whenever we contact them through the telephone. We might as well focus on contacting our clients through their cellular phone.
We might also want to have a look at clients we last contacted in Jun, they are also less likely to subscribe - same for client contacted on monday (day_of_weekmon), monday is not a good time to contact client.
On the positive side, clients that are retired are more likely to subscribe, same for clients last contacted in August(monthaug) and in march(monthmar). Wednesday (day_of_weekwed) is also a good time to contact client.
Another thing we might be interested in are rows or cases that have significant impact on our model
plot(logit.fit, which = 4)
Observations 24011, 25780 and 34054 all have big impact on our model. It is always a good practice to remove influential observations to make our model more accurate - but there is no need for us to do that. Just want to let you know how to go about making your model more accurate.
Before we move on, we can also have a view of these influential observations. I have a feeling that what makes them influential observations is that their variables combination might just be unusual.
df.train[c(24011,25780,34054),]
## age job marital education default housing loan contact
## 24011 34 self-employed married illiterate no yes no cellular
## 25780 51 entrepreneur married illiterate no yes no cellular
## 34054 42 retired divorced illiterate no no no telephone
## month day_of_week campaign pdays previous poutcome emp.var.rate
## 24011 nov thu 1 999 0 nonexistent -0.1
## 25780 apr thu 3 999 0 nonexistent -1.8
## 34054 aug wed 3 999 0 nonexistent -2.9
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 24011 93.200 -42.0 4.076 5195.8 yes
## 25780 93.075 -47.1 1.410 5099.1 yes
## 34054 92.201 -31.4 0.834 5076.2 yes
If you remember from our exploratory analysis, the average age for our retired clients is above 60, so a retired client that’s 42 years old is unusual and an outlier (row 34054).
Moving forward, let’s dive into prediction.
#predict using test set
logit.pred <- predict(logit.fit, df.test, type = "response")
#classify prediction
logit.prob <- factor(logit.pred > 0.5, levels = c(FALSE, TRUE),
labels = c("no","yes"))
logit.perf <- table(df.test$y,logit.prob, dnn = c("Actual","Predicted"))
#confusion matrix
logit.perf
## Predicted
## Actual no yes
## no 3601 67
## yes 341 110
#test error rate
mean(logit.prob != df.test$y)
## [1] 0.09905317
From our confusion matrix, we can deduce that 3,601 cases that were rightly classified as no and 110 cases were rightly classified as yes. We also wrongly classified 341 cases as no and 67 cases were wrongly classified as yes. Understood ?
Hence the total number of cases rightly classified or accuracy is (3601+110)/(3601+110+67+341) or 0.90 or 90%.
Now is this model a good predictive model ? that depends on a lot of metrics which i will dive into later in this section. While the overall accuracy is good(90%), the error rate among clients that subscribed is low(24%). Out of 451 clients that subscribed, our model correctly picked on 110 clients and missed on 341 clients that also subscribed (model sensitivity is 24% which is poor).
We can work on improving this result by various procedures. One way might be for us to increase or reduce the threshold, right now our threshold is 0.5, such that if our prediction is greater than 0.5, we consider that prediction to be a “yes”, if it’s less than or equal to 5, we consider it to be a “no”. What we might want to do is maybe decrease the threshold, we might say something like if probabilty is greater than to 0.2 then consider that a “yes”(client are likely to subscribe) and less than that consider it a “no”(client are not likely to subscribe. Now, this is one way of improving our model. Best way to understand this is by illustration:
#predict using test set
logit.p <- predict(logit.fit, df.test, type = "response")
#classify prediction
logit.pr <- factor(logit.p > 0.2, levels = c(FALSE, TRUE),
labels = c("no","yes"))
logit.pe <- table(df.test$y,logit.pr, dnn = c("Actual","Predicted"))
#confusion matrix
logit.pe
## Predicted
## Actual no yes
## no 3328 340
## yes 189 262
#test error rate
mean(logit.pr != df.test$y)
## [1] 0.1284292
Here can see that our model correctly classified 262 clients and our sensitivity is 58% (34+ more than the previous sensitivity) but at the same time our overall error rate increased slight to 13% or accuracy dropped to 87%.
At the end of the day, organizations or companies decide on the best measure. For some companies having a model with a sensitivity of 58% might be better than having a model with 90% accuracy. Sensitivity is an important metric in some organizations, way more than accuracy (organizations like banks will pay more to identify customers that are likely to default than customers that will not default).
Another way of improving our model is by variable selection. By variable selection, i mean using only variables that have a significant relationship with our outcome “y”. We can do this automatically or we can go over our model summary and pick variables that have at least a period "*" (remember -> group 1 to group 4) and build a model on those variables alone. We can also do this automatically with a “step” funcion, and this is what we are going to do. Now, i don’t want to print the result of the step function, it is too long but i have done it and i will just copy the recommended variables and not print the result. Now let’s build another model with selected variables:
set.seed(1)
#best variables model
logit.step <- glm(y ~ age + job + contact + month + day_of_week +
campaign + pdays + previous + poutcome + emp.var.rate + cons.price.idx +
cons.conf.idx + euribor3m + nr.employed, family = binomial,
data = df.train)
#predictions
logit.pred.step <- predict(logit.step, df.test, type = "response")
#classify prob
logit.prob.step <- factor(logit.pred.step > 0.5, levels = c(FALSE, TRUE),
labels = c("no","yes"))
#prediction table
logit.perf.step <- table(df.test$y,logit.prob.step,dnn = c("Actual","Predicted"))
#print out table
logit.perf.step
## Predicted
## Actual no yes
## no 3605 63
## yes 339 112
#test error
mean(logit.prob.step != df.test$y)
## [1] 0.0975965
This is a slight improvement over our first model, this model was able to identify 112 clients that subscribed (our first model identified 110).
Let’s try out linear discriminant analysis
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
lda.fit <- lda(y ~., data = df.train)
lda.pred <- predict(lda.fit, df.test)
lda.class <- lda.pred$class
lda.perf <- table(df.test$y,lda.class,dnn = c("Actual","Predicted"))
lda.perf
## Predicted
## Actual no yes
## no 3501 167
## yes 274 177
mean(lda.class != df.test$y)
## [1] 0.1070648
This is actually a better model than our logisit regression model because, it correctly identified 177 clients (out of 451 that subscribed). It’s sensitivity is better than logisitic regression but it is less accurate.
Once again we can improve our model’s sensitivity by adjusting the threshold. See below:
sum(lda.pred$posterior[,1] < .5)
## [1] 344
Now this is awesome and good. After adjusting the threshold, our model correctly identified 344 out of 451 clients that subscribed (sensitivity is 76%). By default, threshold here was 0.2.
Let’s try the randomForest model
#randomForest
set.seed(1)
rf.fit <- randomForest(y ~ ., data = df.train,
importance = T, mtry = 4)
Here, we have. Let’s go over the basic concept of randomForest. randomForest is built on the idea of decision trees and bootstrap.
In randomForest, we build multiple trees and then use a subset of our variables as predictors for each of these trees(this is where bootstrap comes in) and finally the average of these results will be our final result. By default randomForest builds 500 trees and for each of these trees it uses a specified number of predictors.
For our model, i have specified 4 as the number of predictors to use for each tree(mtry = 4). So what randomForest does is, for each of these 500 trees it uses different set of predictors. For tree 1, it might use “job, age, marital” as predictors, for tree 2 it might use “job, education, loan” and for tree 4 it might use “contact, month and pdays” (you should get the concept now). And then it does this for the whole 500 trees and average the result.
Note that: if mtry = 19(all predictors) then that is bagging and not randomForest.I actually ran a bagging model with our data and the result was not as good as randomForest but i’m not including the result in this project for the same reason i mentioned before (to avoid bloating the report).
We can verify the number of trees built and variables used for each tree below:
rf.fit
##
## Call:
## randomForest(formula = y ~ ., data = df.train, importance = T, mtry = 4)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 10.12%
## Confusion matrix:
## no yes class.error
## no 32114 765 0.02326713
## yes 2988 1201 0.71329673
importance(rf.fit)
## no yes MeanDecreaseAccuracy MeanDecreaseGini
## age 37.965523 -2.430191 36.664020 7.128217e+02
## job 57.045411 -12.063915 51.845525 4.914984e+02
## marital 20.081200 -4.209574 17.306731 1.741618e+02
## education 38.455699 -5.747768 34.875182 3.341983e+02
## default 0.000000 0.000000 0.000000 1.618952e-03
## housing 6.196515 -3.369117 3.862782 1.392867e+02
## loan 2.133211 1.853132 2.786042 9.964356e+01
## contact 9.197919 29.529872 11.790618 8.408242e+01
## month 25.099893 -9.798359 25.943122 2.134846e+02
## day_of_week 35.313835 -1.191252 35.077292 3.317685e+02
## campaign 14.898398 13.954336 19.800422 3.505065e+02
## pdays 6.079087 35.419908 24.522829 2.921908e+02
## previous 7.794241 3.434010 8.702346 1.052016e+02
## poutcome 12.332619 13.704904 16.048634 2.178699e+02
## emp.var.rate 16.955485 2.580982 17.797485 1.549350e+02
## cons.price.idx 19.773182 -15.163870 19.624726 1.400434e+02
## cons.conf.idx 17.861462 -7.781122 18.333387 1.632809e+02
## euribor3m 29.372158 7.912342 32.398056 7.958030e+02
## nr.employed 20.591462 16.362149 23.989428 3.956196e+02
The most important variable judging from MeanDecreaseAccuracy is Job and then Age. Another way to read this is, if we remove the Job variable from our model, our model accuracy will decrease by a factor of 41.5. On the other hand based on GiniIndex, the most important variable is euribor3m then age. We can see the plot below.
varImpPlot(rf.fit, main = "Job(based on MeanDecrease) and euribor3m(based on GiniIndex) are the most important variables")
rf.pred <- predict(rf.fit, df.test)
rf.perf <- table(df.test$y, rf.pred, dnn = c("Actual","Predicted"))
rf.perf
## Predicted
## Actual no yes
## no 3579 89
## yes 316 135
mean(rf.pred != df.test$y)
## [1] 0.09832484
Our test error is 0.098 or 9.8% test error (remember the lower the test error the better). We will have a look at the model accuracy at the end of this section.
Now we check support vector machine.
set.seed(1)
svm.fit <- svm(y ~., data = df.train)
svm.fit
##
## Call:
## svm(formula = y ~ ., data = df.train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 10467
svm.pred <- predict(svm.fit, df.test)
svm.perf <- table(df.test$y,svm.pred,
dnn = c("Actual","Predicted"))
svm.perf
## Predicted
## Actual no yes
## no 3625 43
## yes 359 92
mean(svm.pred != df.test$y)
## [1] 0.0975965
The test error for svm is 0.0975 or 9.8% which is almost the same as our variable selection logistic regression model.
We can actually improve on the svm by tunning the parameter. This is computational intensive because we have a large data set(in the context of tuning parameters with svm - this a large dataset). And also depending on your workstation, this might take a while to be done - i tried it with a 4GB RAM workstation, and it was still running after 52 minutes.
For that reason, i will stop and not run the code again. But will add the code below, just incase ….
set.seed(1)
#tune.svm <- tune.svm(y ~., data = df.train,
# gamma = 10 ^ (-3:1),
# cost = 10 ^ (-5:5))
To have an holisitic view of our model accuracy, we need to focus on five metrics and also a test error accuracy.
These five metrics are:
Sensitivity
Specificity
Positive predicted value
Negative predicted value
Accuracy
test error
Let’s build a function to calculate each of this.
performance <-
function(predictions, table, n = 2,actual = df.test$y){
if(dim(table) != c(2,2))
stop("Must be a 2 x 2 table")
tn = table [1,1]
fp = table [1,2]
fn = table [2,1]
tp = table [2,2]
sensitivity = tp/(tp+fn)
specificity = tn/(tn+fp)
ppp = tp/(tp+fp)
npp = tn/(tn+fn)
hitrate = (tp+tn)/(tp+tn+fp+fn)
test_error_rate = mean(predictions!=actual)
result <- paste("Sensitivity = ", round(sensitivity,n),
"\nSpecificity = ", round(specificity, n),
"\nNegative Predicted Value = ", round(npp,n),
"\nAccuracy = ", round(hitrate, 3),
"\nTest Error rate = ", round(test_error_rate,3),
"\n", sep ="")
cat(result)
}
Let’s go over each of our model:
Sensitivity: tells us the probability of getting a positive classification when the true outcome is positive. It is also called recall or TPR (true positive rate);
Specificity: the opposite of sensitivity. probability of getting a negative classification when the outcome is negative(in our case probability of getting clients that didn’t subscribe when the outcome is no)
PPV: Positive predicted value, probability that an observation with a positive classification is correctly identified as positive (also called precision);
NPV: Negative predicted value, opposite of PPV.
Accuracy: Also called ACC, proportion of observations correctly identified.
Now let’s check the metrics with our models.
For logistic Regression:
performance(logit.prob, logit.perf)
## Sensitivity = 0.24
## Specificity = 0.98
## Negative Predicted Value = 0.91
## Accuracy = 0.901
## Test Error rate = 0.099
For Logistic Regression (variable selection)
performance(logit.prob.step, logit.perf.step)
## Sensitivity = 0.25
## Specificity = 0.98
## Negative Predicted Value = 0.91
## Accuracy = 0.902
## Test Error rate = 0.098
For linear discriminant analysis
performance(lda.class, lda.perf)
## Sensitivity = 0.39
## Specificity = 0.95
## Negative Predicted Value = 0.93
## Accuracy = 0.893
## Test Error rate = 0.107
For random forest
performance(rf.pred, rf.perf)
## Sensitivity = 0.3
## Specificity = 0.98
## Negative Predicted Value = 0.92
## Accuracy = 0.902
## Test Error rate = 0.098
For support vector machine
performance(svm.pred, svm.perf)
## Sensitivity = 0.2
## Specificity = 0.99
## Negative Predicted Value = 0.91
## Accuracy = 0.902
## Test Error rate = 0.098
Which is the best model ? This answer is based on context. All the models all have poor sensitivity but as we have seen with lda (the sensitivity can easily be improved on by adjusting the threshold).
If the aim or if identifying clients that will subscribe is more important, we might go with lda since it’s sensitivity is way better. If the goal is accuracy, we might go with SVM or tuned logistic regression model. I actually believe a tuned SVM model, will outclass all the models in terms of accuracy but not in terms of sensitivity.