Hello!

Glad to meet you at the first, exploratory part, of the report, prepare for the bank company. My name is Anna, I’m a third year student data analyst. In this part I’m going to explore tha data as thoroughly, as possible, look at somу possible relations and provide some ideas for further analysis.

loading the data and variable description

I would also like to add an additional column, which’ll tell us if a client is employed or not. Retired, students, housemaids are unemployed in this case.

#unique(ds$job) to see which levels does this factor have
ds$employed <- NA
ds$employed <- ifelse(ds$job %in% c("unemployed", "retired", "housemaid", "student"), "no", "yes")
ds$employed <- as.factor(ds$employed)
#str(ds$employed)
#str(ds) juuust to check everything up :)

Firstly, let’s have a quick overview at our data, variables, which we have. You may skip it if you’d like, I will remind about everything in each part. However, extra look at everything at ones might be helpful :) We’ve been using ‘bank_cleaned’ data. Moreover, I’ve added a new variable, employed, which tells us if a client is employed or not. Retired, students, housemaids are unemployed in this case.

Socio-demoraphic data:

age (numeric);
marital status (categorical: divorced, married, single);
education (categorical: primary, secondary, tertiary);
employment: is the client employed? (categorical: yes or no);
job position (categorical, 12 types: management, technician, entrepreneur, retired, admin., services, blue-collar, self-employed, unemployed, housemaid, student and other);
balance level (numeric);

Client’s bank history in this bank: (all three variable are categorical, with yes/no answer)

default: has credit in default?
housing: has housing loan?
loan: has personal loan?

Last contact information of the current campaign:

day: last contact day (numeric);
month: last contact month (categorical: 12 months of the year);
duration: last contact duration, in seconds (numeric);

Infornation about the number of contacts and days passed after the last contact: (all three variable are numeric)

previous: number of contacts performed before this campaign and for this client;
campaign: number of contacts performed during this campaign and for this client;
pdays: number of days that passed by after the client was last contacted from a previous campaign;

And of course campaign outcomes:

poutcome: outcome of the previous marketing ampaign (categorical: success, failure or unknown);
response: has the client subscribed a term deposit? (binary: yes or no).

Socio-demoraphic factors

First, let’s have a look at our client in general. Down below you can see tables with information how many client are according to every socio-demographic trait. I’ve included both number of observations and percantage distribution and put it in descending order. The socio-demographic characteristics are:

marital status (divorced, married, single);
education (primary, secondary, tertiary);
employment (yes or no);
job position (12 types: management, technician, entrepreneur, retired, admin., services, blue-collar, self-employed, unemployed, housemaid, student and other);
age;
balance level.

First, I’ll show you everything in graphics and tables and than will provide a short summary.

ds %>% group_by(marital) %>% summarize(count=n()) %>% arrange(desc(count)) %>% 
       mutate(perc=paste0(round(count/sum(count)*100, 2), "%")) %>% as.data.frame()

ds %>% group_by(education) %>% summarize(count=n()) %>% arrange(desc(count)) %>% 
       mutate(perc=paste0(round(count/sum(count)*100, 2), "%")) %>% as.data.frame()

ds %>% group_by(employed) %>% summarize(count=n()) %>% arrange(desc(count)) %>% 
       mutate(perc=paste0(round(count/sum(count)*100, 2), "%")) %>% as.data.frame()

ds %>% group_by(job) %>% summarize(count=n()) %>% arrange(desc(count)) %>% 
       mutate(perc=paste0(round(count/sum(count)*100, 2), "%")) %>% as.data.frame()

# I've also created some barplot, but didn't add it now, cause slightly similar barplots will be a bit later, but here it's not realy necessary, tables are showing information quite clear :)

#ggplot(data = ds, aes(x = factor(marital), 
#                          y = prop.table(stat(count)), 
#                          label = scales::percent(prop.table(stat(count))))) +
#    geom_bar() + 
#    geom_text(stat = 'count',
#              position = position_dodge(.9), 
#              vjust =  1.2, 
#              size = 4,
#              col = "white") + 
#    scale_y_continuous(labels = scales::percent) + 
#    labs(x = 'marital status', y = 'percentage')

#ggplot(data = ds, aes(x = factor(education), 
#                          y = prop.table(stat(count)), 
#                          label = scales::percent(prop.table(stat(count))))) +
#    geom_bar() + 
#    geom_text(stat = 'count',
#              position = position_dodge(.9), 
#              vjust =  1.2, 
#              size = 4,
#              col = "white") + 
#    scale_y_continuous(labels = scales::percent) + 
#    labs(x = 'education', y = 'percentage')

#ggplot(data = ds, aes(x = factor(employed), 
#                          y = prop.table(stat(count)), 
#                          label = scales::percent(prop.table(stat(count))))) +
#    geom_bar() + 
#    geom_text(stat = 'count',
#              position = position_dodge(.9), 
#              vjust =  1.2, 
#              size = 4,
#              col = "white") + 
#    scale_y_continuous(labels = scales::percent) + 
#    labs(x = 'employed', y = 'percentage')

library(forcats)
#ggplot(data = ds, aes(x = fct_infreq(job), 
#                          y = prop.table(stat(count)), 
#                          label = scales::percent(prop.table(stat(count))))) +
#    geom_bar() + 
#    geom_text(stat = 'count',
#              position = position_dodge(.9), 
#              vjust =  1.2, 
#              size = 4,
#              col = "white") + 
#    scale_y_continuous(labels = scales::percent) + 
#    labs(x = 'job', y = 'percentage')

# age
hist(ds$age, 
     xlab = "Client's age",
     main = "Age distribution")
abline(v=c(mean(ds$age), median(ds$age)), 
       col=c("blue", "red"), 
       lwd = c(2, 2))
legend(x = "topright",
 c("Mean", "Median"),
 col = c("blue", "red"),
 lwd = c(2, 2))

summary(ds$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   33.00   39.00   40.79   48.00   95.00

# balance level

hist(ds$balance, breaks = 50,
     xlab = "Client's balance",
     main = "Clinents' balance distribution")
abline(v=c(mean(ds$balance), median(ds$balance)), 
       col=c("blue", "red"), 
       lwd = c(2, 2))
legend(x = "topright",
 c("Mean", "Median"),
 col = c("blue", "red"),
 lwd = c(2, 2))

summary(ds$balance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6847      64     421    1074    1333   10443

What do we know so far?

Marital status: Almost 2/3 of our sample - married people. Another one thrid part is divided between singles (almost 30%) and divorced ones (12%);

Education: More than a half of clients (53.7%) have secondary education, 30% have higher education and 16%,by contrast, have primary education.

Job position: First of all, 88% of our clients are employed somewhere. As for the more detailed information, the most wide-spread job positions among our clients are: blue-collar (22%), management (21%) and technician (17%). Than goes administrator position (12%) and service (9%). All other positions have less than 10% of observations and can be seen in the table above.

Age: In general, the age of our clients vary from 18 to 95. The mean age is 40, the median in our data is near the mean value - 39. From the graph we may see the first increase from 25 y.o. and peak at the point of 30-35 with following decrease. All in all, I would say that the bigger part of our observations fill the gap between 25 and 45 y.o. And as for descriptive statistics, it gives the age of 33 y.o. as the 1st quartile and 48 as the 3rd one, (which means than age of 50% of our clients lie between 33 and 48).

Balance: As for the client’s balance, from the graph we may see that the most frequent balance vary from -500 to 1000-1500 (every bar (break) on the graph is equal to 500). Descriptive statistics gives us 64 as the 1st quartile and 1333 as the 3rd one, (which means than 50% of our observations lie between 64 and 1333). The avarage client’s balance here is about 1000 and the median is 421. Minimal and maximum values can also be seen in the table (-6847 and 10443).

Current bank situation

Now I’d like to look throught the current situation in bank database. We have information about the following variables:

defaut - if the client has credit in default or not;
housing - if the client has housing loan?
loan - if the client has personal loan?

Let’s have a look at the dustribution of all observations in every variable. I’ve manually put three bar graphs in order as positive portions are ascending :)

gdef <- ggplot(data = ds, aes(x = default, fill = default,
                      y = prop.table(stat(count)))) +
  geom_bar(aes(y = prop.table(..count..)),
             position = "dodge") + 
  geom_text(aes(y = prop.table(..count..), 
                  label = paste0(round(prop.table(..count..),digits= 2), '%')), 
              stat = 'count', position=position_dodge(width=0.9), vjust=-0.25, size = 4) + 
    scale_y_continuous(labels = scales::percent) +  
  labs(y="percentage", x = "",  title="Does the client\nhas credit in default\nor not?") +
  theme_bw() +
  theme(axis.text = element_text(size=13.5), plot.title = element_text(size=12), legend.position = "none") +
  scale_fill_manual(values=c("tomato2", "lightseagreen"))

ghous <- ggplot(data = ds, aes(x = housing, fill = housing,
                      y = prop.table(stat(count)))) +
  geom_bar(aes(y = prop.table(..count..)),
             position = "dodge") + 
  geom_text(aes(y = prop.table(..count..), 
                  label = paste0(round(prop.table(..count..),digits= 2), '%')), 
              stat = 'count', position=position_dodge(width=0.9), vjust=-0.25, size = 4) + 
    scale_y_continuous(labels = scales::percent) +  
  labs(y="", x = "",  title="Does the client\nhave housing loan?") +
  theme_bw() +
  theme(axis.text = element_text(size=13.5), plot.title = element_text(size=12), legend.position = "none") +
  scale_fill_manual(values=c("tomato2", "lightseagreen"))

gloan <- ggplot(data = ds, aes(x = loan, fill = loan,
                      y = prop.table(stat(count)))) +
  geom_bar(aes(y = prop.table(..count..)),
             position = "dodge") + 
  geom_text(aes(y = prop.table(..count..), 
                  label = paste0(round(prop.table(..count..),digits= 2), '%')), 
              stat = 'count', position=position_dodge(width=0.9), vjust=-0.25, size = 4) + 
    scale_y_continuous(labels = scales::percent) +  
  labs(y="", x = "",  title="Does the client\nhave personal loan?") +
  theme_bw() +
  theme(axis.text = element_text(size=13.5), plot.title = element_text(size=12), legend.position = "none") +
  scale_fill_manual(values=c("tomato2", "lightseagreen"))

grid.arrange(gdef, gloan, ghous, ncol=3)

Well, here we see that:

really little part of clients has credit in default, only 2% of all observations;
nearly the same, but better situation is happening with personal loan: about 1/6 (17%) of all observed people have personal loan;
housing loan shows really good result: more than a half of clients have housing loan (and for now it’s the oly case, were there are more positive responses).

Now, let’s look at the results of the previous marketing campaign and current one (if the client subscribed a term deposit or not)

gpoutcome <- ggplot(data = ds, aes(x = poutcome, fill = poutcome,
                      y = prop.table(stat(count)))) +
  geom_bar(aes(y = prop.table(..count..)),
             position = "dodge") + 
  geom_text(aes(y = prop.table(..count..), 
                  label = paste0(round(prop.table(..count..),digits= 2), '%')), 
              stat = 'count', position=position_dodge(width=0.9), vjust=-0.25, size = 4) + 
    scale_y_continuous(labels = scales::percent) +  
  labs(y="", x = "",  title="Has the client gave positive\nresponse to the previous campaign?") +
  theme_bw() +
  theme(axis.text = element_text(size=13.5), plot.title = element_text(size=12), legend.position = "none") +
  scale_fill_manual(values=c("tomato2", "lightseagreen", "grey"))

gresponse <- ggplot(data = ds, aes(x = response, fill = response,
                      y = prop.table(stat(count)))) +
  geom_bar(aes(y = prop.table(..count..)),
             position = "dodge") + 
  geom_text(aes(y = prop.table(..count..), 
                  label = paste0(round(prop.table(..count..),digits= 2), '%')), 
              stat = 'count', position=position_dodge(width=0.9), vjust=-0.25, size = 4) + 
    scale_y_continuous(labels = scales::percent) +  
  labs(y="", x = "",  title="Has the client subscribed\na term deposit?") +
  theme_bw() +
  theme(axis.text = element_text(size=13.5), plot.title = element_text(size=12), legend.position = "none") +
  scale_fill_manual(values=c("tomato2", "lightseagreen"))

grid.arrange(gpoutcome, gresponse, ncol=2)

Here we see, the problem: a huge part of outcome of previous campain is unknown. But we know, that 3% of all cases were successful and 11% were not (which is almost 4 times more). As for the present marketing campaign, we see that 11% of our clients gave positive responses and subscribed a term deposit. Not the best result, BUT it’s not as bad as it could be: in comparison with number of clients, who have credit in default or with known ‘successful’ outcome cases of previuos campaign 11% of positive responses is not very bad.

Look at the predictors

Okay. After we know who is our client and what are results of our marketing campaign, let’s look at factors which could relate somehow to the responses to the present campaign (yes, our predictors). For that I will firstly just draw some different graphs. Each one will show the percentage proportion of those clients, who subscribed a term deposit and those, who did not.

Let’s go!

Socio-demographic factors

marital status

set_theme(base = theme_classic())

# marital status
sjp.xtab(ds$marital, ds$response, 
         margin = "row", 
         bar.pos = "stack", 
         title = "Relation with marital status",
         axis.titles = "marital status", 
         legend.title = "subscribed a term deposit",
         show.summary = TRUE,
         show.n = FALSE,
         show.prc = TRUE,
         show.total = FALSE, 
         coord.flip = TRUE,
         rev.order = FALSE)

# chi-sq test 
t_marital <- with(ds, table(marital, response))
chi_mar <- chisq.test(t_marital)
chi_mar$stdres

##           response
## marital             no         yes
##   divorced  -0.5671748   0.5671748
##   married   12.3011747 -12.3011747
##   single   -12.9948418  12.9948418

On the graph above we may notice that single people give positive responses more often, than married ones. Test shows that there’s a relation between client’s marital status and if they subscibed to a term deposit: single clients tend to give positive responses, while married, in contrast, give negative.

job and education

# job and education
sjp.xtab(ds$education, ds$response, 
         margin = "row", 
         bar.pos = "stack", 
         title = "Relation with сlient's education",
         axis.titles = "education level", 
         legend.title = "subscribed a term deposit",
         show.summary = TRUE,
         show.n = FALSE,
         show.prc = TRUE,
         show.total = FALSE, 
         coord.flip = TRUE,
         rev.order = FALSE)

t_edu <- with(ds, table(education, response))
chi_edu <- chisq.test(t_edu)
chi_edu

## 
##  Pearson's Chi-squared test
## 
## data:  t_edu
## X-squared = 210.72, df = 2, p-value < 2.2e-16

chi_edu$stdres

##            response
## education           no        yes
##   primary     8.276654  -8.276654
##   secondary   6.639639  -6.639639
##   tertiary  -13.802452  13.802452

From the graph we see a slight percentage increase with the increasing educational level. As for the test, there is some relation: clients with higher education (tertiary) give positive responses and those, who have primary or secondary education, tend to give negative responses to the marketing campaign.

sjp.xtab(ds$employed, ds$response, 
         margin = "row", 
         bar.pos = "stack", 
         title = "Relation with сlient's employment",
         axis.titles = "If client employed or not", 
         legend.title = "subscribed a term deposit",
         show.summary = FALSE,
         show.n = FALSE,
         show.prc = TRUE,
         show.total = FALSE, 
         coord.flip = TRUE,
         rev.order = FALSE)

sjp.xtab(ds$job, ds$response, 
         margin = "row", 
         bar.pos = "stack", 
         title = "Relation with сlient's job position",
         axis.titles = "job position", 
         legend.title = "subscribed a term deposit",
         show.summary = FALSE,
         show.n = FALSE,
         show.prc = TRUE,
         show.total = FALSE, 
         coord.flip = TRUE,
         rev.order = FALSE)

t_emp <- with(ds, table(employed, response))
chi_emp <- chisq.test(t_emp)
chi_emp

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t_emp
## X-squared = 298.98, df = 1, p-value < 2.2e-16

chi_emp$stdres

##         response
## employed        no       yes
##      no  -17.31465  17.31465
##      yes  17.31465 -17.31465

t_job <- with(ds, table(job, response))
chi_job <- chisq.test(t_job)
chi_job

## 
##  Pearson's Chi-squared test
## 
## data:  t_job
## X-squared = 708.67, df = 11, p-value < 2.2e-16

chi_job$stdres

##                response
## job                      no         yes
##   admin.         -1.3956813   1.3956813
##   blue-collar    13.7704109 -13.7704109
##   entrepreneur    3.7823118  -3.7823118
##   housemaid       2.7831881  -2.7831881
##   management     -5.8273865   5.8273865
##   other           0.7402356  -0.7402356
##   retired       -15.7187125  15.7187125
##   self-employed   0.1609809  -0.1609809
##   services        5.3538543  -5.3538543
##   student       -15.1044177  15.1044177
##   technician      1.4887414  -1.4887414
##   unemployed     -4.8092993   4.8092993

As for the employment level and job position, I would say, that there is such relation that unemployed clients tend to give positive responses to the marketing campaign, while employed clients, in contrast, give negative responses. Here I would like to remind, that factor ‘unempoyed’ include students and retired clients. This relation was also supported by the test. Speaking about employed clients, the positive relation can be found only among managerial and administrator positions. Other give either no relation or negative one.

age and balance

# age and balance
g_age <- ggplot(ds, aes(x = response, y = age, fill = response)) +
  geom_boxplot() +
  labs(y="age", x = "",  
       title="Client's age") +
  theme_bw() +
  theme(legend.position = "none", axis.text = element_text(size=13)) +
  scale_fill_manual(values=c("tomato2", "lightseagreen")) 

g_bal <- ggplot(ds, aes(x = response, y = balance, fill = response)) +
  geom_boxplot() +
  labs(y="balance", x = "",  
       title="Client's balance") +
  theme_bw() +
  theme(legend.position = "none", axis.text = element_text(size=13)) +
  scale_fill_manual(values=c("tomato2", "lightseagreen")) 

grid.arrange(g_age, g_bal, ncol=2, top = textGrob("Has the client subscribed a term deposit?", gp=gpar(fontsize=17)))

t.test(ds$age ~ ds$response)

## 
##  Welch Two Sample t-test
## 
## data:  ds$age by ds$response
## t = -3.7897, df = 5339.8, p-value = 0.0001525
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1623314 -0.3697696
## sample estimates:
##  mean in group no mean in group yes 
##          40.70366          41.46971

t.test(ds$balance ~ ds$response)

## 
##  Welch Two Sample t-test
## 
## data:  ds$balance by ds$response
## t = -13.121, df = 5674.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -430.8670 -318.8538
## sample estimates:
##  mean in group no mean in group yes 
##          1031.403          1406.263

Here I see no much differences on graphs. Means are relatively similar, no big differences in position of boxes on the graph. Let’s try t-test: in contrast, it tells that the difference in means is statistically significant (p-value is small in both cases: with age and with balance). Not quite evident, but that’s the test results. Clients, who have subscribed to a term deposit tend to be older and with higher balance on their accounts.

Bank history

I’d like to see if there any relation between the fact that client may have or not other loans or credit in default and their response to the marketing campaign.

#default
sjp.xtab(ds$default, ds$response, 
         margin = "row", 
         bar.pos = "stack",
         axis.titles = "If client has credit in default?", 
         legend.title = "subscribed a term deposit",
         show.summary = TRUE,
         show.n = FALSE,
         show.prc = TRUE,
         show.total = FALSE, 
         coord.flip = TRUE,
         rev.order = FALSE)

# housing
sjp.xtab(ds$housing, ds$response, 
         margin = "row", 
         bar.pos = "stack",
         axis.titles = "If client has housing loan?",  
         legend.title = "subscribed a term deposit",
         show.summary = TRUE,
         show.n = FALSE,
         show.prc = TRUE,
         show.total = FALSE, 
         coord.flip = TRUE,
         rev.order = FALSE)

# loan
sjp.xtab(ds$loan, ds$response, 
         margin = "row", 
         bar.pos = "stack",
         axis.titles = "If client has personal loan?",  
         legend.title = "subscribed a term deposit",
         show.summary = TRUE,
         show.n = FALSE,
         show.prc = TRUE,
         show.total = FALSE, 
         coord.flip = TRUE,
         rev.order = FALSE)

t_def <- with(ds, table(default, response))
chi_def <- chisq.test(t_def) 
# the results of the test was insignoficant, so I didn't add it in desciption of results 

# chi_def
# chi_def$stdres


t_hos <- with(ds, table(housing, response))
chi_hos <- chisq.test(t_hos)
chi_hos

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t_hos
## X-squared = 720.3, df = 1, p-value < 2.2e-16

chi_hos$stdres

##        response
## housing        no       yes
##     no  -26.85413  26.85413
##     yes  26.85413 -26.85413

t_loan <- with(ds, table(loan, response))
chi_loan <- chisq.test(t_loan)
chi_loan

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t_loan
## X-squared = 176.98, df = 1, p-value < 2.2e-16

chi_loan$stdres

##      response
## loan        no      yes
##   no  -13.3242  13.3242
##   yes  13.3242 -13.3242

Well, here we see that if a client is already having some loan or credit, they will probably refuse in subscribing to a term deposit. That’s quite logical. This hypothesis is noticable from the graphs and is supported by tests (the part with housing and personal loan at least).

Separatly I want to look is there any relation between the results of the previous campaign and present one:

sjp.xtab(ds$poutcome, ds$response, 
         margin = "row", 
         bar.pos = "stack",
         axis.titles = "The outcome of the previous marketing campaign", 
         legend.title = "subscribed a term deposit",
         show.summary = TRUE,
         show.n = FALSE,
         show.prc = TRUE,
         show.total = FALSE, 
         coord.flip = TRUE,
         rev.order = FALSE)

t_pc <- with(ds, table(poutcome, response))
chi_pc <- chisq.test(t_pc)
chi_pc

## 
##  Pearson's Chi-squared test
## 
## data:  t_pc
## X-squared = 4072.1, df = 2, p-value < 2.2e-16

chi_pc$stdres

##          response
## poutcome          no        yes
##   failure  -2.506702   2.506702
##   success -63.450609  63.450609
##   unknown  34.664544 -34.664544

Wow, here we can clearly see the prevalence of positive responses among successful cases of the previous campaign. The same relation is supported by the test. That’s what I was expecting: if the client agrees to do smth in one marketing campaign, there should be a good chance, that they will repeat this success in the next campaign. Maybe, these are some loyal clients or smth of that kind.

Bank campany’s actions

For last in exploralory part I’d like to look at the relation of client’s response with the variables: 1. Related with the last contact of the current campaign; 2. Related with the number of contacts and days passed after the last contact.

relation with the last contact of the current campaign

Variables here are:

day - last contact day;
month - last contact month;
duration - last contact duration, in seconds;

( * - the asterisk mark on the y-axis means that the scale is converted into log10 for normalization and better visualization)

#month
sjp.xtab(ds$month, ds$response, 
         margin = "row", 
         bar.pos = "stack",
         axis.titles = "The last contact month", 
         legend.title = "subscribed a term deposit",
         show.summary = FALSE,
         show.n = FALSE,
         show.prc = TRUE,
         show.total = FALSE, 
         coord.flip = TRUE,
         rev.order = FALSE)

#day
sjp.xtab(ds$day, ds$response, 
         margin = "row", 
         bar.pos = "stack",
         expand.grid  = TRUE,
         axis.titles = "The last contact day", 
         legend.title = "subscribed a term deposit",
         show.summary = FALSE,
         show.n = FALSE,
         show.prc = FALSE,
         show.total = FALSE, 
         coord.flip = TRUE,
         rev.order = FALSE)

#duration
ggplot(ds, aes(x = response, y = duration, fill = response)) +
  geom_boxplot() +      
  scale_y_log10() + 
  labs(y="Duration, in sec *", x = "",  
       title="Last contact duration") +
  theme_bw() +
  theme(legend.position = "none", axis.text = element_text(size=13)) +
  scale_fill_manual(values=c("tomato2", "lightseagreen"))

t_day <- with(ds, table(day, response))
chi_day <- chisq.test(t_day)
chi_day

## 
##  Pearson's Chi-squared test
## 
## data:  t_day
## X-squared = 532.4, df = 30, p-value < 2.2e-16

chi_day$stdres

##     response
## day          no        yes
##   1  -9.0695600  9.0695600
##   2  -2.5452954  2.5452954
##   3  -5.6432597  5.6432597
##   4  -5.4083247  5.4083247
##   5  -0.5332721  0.5332721
##   6   3.0905455 -3.0905455
##   7   3.7712217 -3.7712217
##   8   1.4063715 -1.4063715
##   9   0.2252072 -0.2252072
##   10 -6.8231465  6.8231465
##   11 -1.0951643  1.0951643
##   12 -4.4584670  4.4584670
##   13 -5.0311946  5.0311946
##   14  1.0347888 -1.0347888
##   15 -2.6758554  2.6758554
##   16 -2.0544013  2.0544013
##   17  3.0887191 -3.0887191
##   18  2.1368954 -2.1368954
##   19  6.3524519 -6.3524519
##   20  6.8916032 -6.8916032
##   21  2.3227162 -2.3227162
##   22 -3.7488284  3.7488284
##   23 -1.0299458  1.0299458
##   24 -1.6117461  1.6117461
##   25 -3.1818594  3.1818594
##   26  1.5484329 -1.5484329
##   27 -0.5925772  0.5925772
##   28  6.0475078 -6.0475078
##   29  5.6696739 -5.6696739
##   30 -7.0864269  7.0864269
##   31  3.2913651 -3.2913651

t_mon <- with(ds, table(month, response))
chi_mon <- chisq.test(t_mon)
chi_mon

## 
##  Pearson's Chi-squared test
## 
## data:  t_mon
## X-squared = 2864.5, df = 11, p-value < 2.2e-16

chi_mon$stdres

##      response
## month         no        yes
##   apr -13.506027  13.506027
##   aug   2.379227  -2.379227
##   dec -14.250612  14.250612
##   feb  -9.110778   9.110778
##   jan   1.150512  -1.150512
##   jul   8.029198  -8.029198
##   jun   2.951178  -2.951178
##   mar -27.438654  27.438654
##   may  20.080571 -20.080571
##   nov   2.323969  -2.323969
##   oct -26.240709  26.240709
##   sep -25.364717  25.364717

t.test(ds$duration ~ ds$response)

## 
##  Welch Two Sample t-test
## 
## data:  ds$duration by ds$response
## t = -54.686, df = 4961.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.632025 -5.242192
## sample estimates:
##  mean in group no mean in group yes 
##          3.691365          9.128474

Month and day of the last contact

From the graph with months we may clearly see the prevalence of positive responses among the clients, who were last contacted in september, october, march and december. The results of statistical test support this positive relation, adding april and february to the set. Other months, in contrast, give negative relation.

As for the day, we can se the increase of positive responses among people, who were lastly connected in the beginning of the month (1st, 3rd, 4th, 10th days) and some noticable increase on the 30th. If the graph doesn’t give clear information, we can look at the results of the test, there exist some significant relations: connecting with client at the beginning of the month have positive relation with giving positive response to the marketing campaign. 30th day also is positively related to client’s subscription to the term deposit.

Duration of the last contact

There is a difference between duration of the last contact, which can be evidently seen from the graph: among positive responses to the marketing campaign we can see longer contact duration. Maybe that is more or less obvious and explainatory (the person can give a fast negative response, but positove response takes more time… but that’s only my hypothesis). The significant difference between means of these two groups (positive and negative responses) is also supported by the test.

Relation with the number of contacts and days passed after the last contact

Variables here are:

previous - number of contacts performed before this campaign and for this client;
campaign - number of contacts performed during this campaign and for this client;
pdays - number of days that passed by after the client was last contacted from a previous campaign;

#Number of contacts performed BEFORE this campaign and for this client
g_prev <- ggplot(ds, aes(x = response, y = previous, fill = response)) +
  geom_boxplot() +
  scale_y_log10() +     
  labs(y="Number of contacts *", x = "",  
       title="Number of contacts\nperformed BEFORE\nthis campaign") +
  theme_bw() +
  theme(legend.position = "none", axis.text = element_text(size=13)) +
  scale_fill_manual(values=c("tomato2", "lightseagreen")) 

#Number of contacts performed DURING this campaign and for this client
g_camp <- ggplot(ds, aes(x = response, y = campaign, fill = response)) +
  geom_boxplot() +
  scale_y_log10() +     
  labs(y="Number of contacts *", x = "",  
       title="Number of contacts\nperformed DURING\nthis campaign") +
  theme_bw() +
  theme(legend.position = "none", axis.text = element_text(size=13)) +
  scale_fill_manual(values=c("tomato2", "lightseagreen")) 

#number of days that passed by after the past contact
g_pdays <- ggplot(ds, aes(x = response, y = pdays, fill = response)) +
  geom_boxplot() +
  scale_y_log10() +     
  labs(y="Number of days *", x = "",  
       title="Number of days that\nPASSED by after the\nlast contact from a\nPREVIOUS campaign") +
  theme_bw() +
  theme(legend.position = "none", axis.text = element_text(size=13)) +
  scale_fill_manual(values=c("tomato2", "lightseagreen")) 

grid.arrange(g_prev, g_camp, g_pdays, ncol=3, top = textGrob("Client's response and number of contacts with them", gp=gpar(fontsize=17)))

t.test(ds$previous ~ ds$response)

## 
##  Welch Two Sample t-test
## 
## data:  ds$previous by ds$response
## t = -19.049, df = 5164.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7010387 -0.5702062
## sample estimates:
##  mean in group no mean in group yes 
##         0.3645931         1.0002156

t.test(ds$campaign ~ ds$response)

## 
##  Welch Two Sample t-test
## 
## data:  ds$campaign by ds$response
## t = 21.382, df = 8324.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.6441811 0.7742162
## sample estimates:
##  mean in group no mean in group yes 
##          2.854704          2.145505

t.test(ds$pdays ~ ds$response)

## 
##  Welch Two Sample t-test
## 
## data:  ds$pdays by ds$response
## t = -18.061, df = 5365.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -34.41140 -27.67271
## sample estimates:
##  mean in group no mean in group yes 
##          28.72234          59.76439

As fot the number of contact with clients we have slightly similar results in both cases: defore and after the campaign. And the fact is there’s no big difference seen from the graphs (however t-test says that the difference in means exists and it’s significant). Anyway, here, it’s more interesting to have a look at the third graph with number of day passed by after the last contact. The difference between means an be seen clearly from the graph as well as the whole position of boxes. The same idea, that there is a significant difference between means in two groups (who gave postitive or nagative response to the marketing campaign) is supported be the test. So, we may said that, in our case, the more recent was the last contact, the more chances that the client subsribes to the term deposit. Which is wuite logical: if we forget about our client and don’t contact them, they will forget about our service too and may even churn. So don’t forget your client and they will give positive responses to the marketing campaign. But, again, it’s only hypothesis :)

Summary for exploratory part

Well, now we know lot’s of things. First of all, we have accumulated knowledge about our client from socio-demographic point of view. In further steps I recomend to take a look at age and job relation to clients’ response. Also, I suggest pay attention to the actions, that bank company can do: number of contacts, its duration, day or month, when the contact is performed - all these thigs can be really influential. Also, I really recomend to check the influence of the previuos campaign. If this relation will be supported by further analysis (with predictive modeling, I mean), that would be a good sign for the company to look at the traits and hints, which were used that time. And maybe create some loyalty program for the customers, who give positive responses in marketing programs! Anyway, this we will know later.

Logistic regression

Out of my personal interest I decided to try logistic regression. Leter, my colleguages will catch me up and continue this story with other methods (Decision tree and Bayesian network models in particular).

Our outcome is response - we are interested to see which factors can predict will the client give positive or negative response to our marketing campaign. First, we divide the sample into two parts: train and test ones in proportion 80/20. Then, make a model. As the predictors I’ve used all the variables.

ind = createDataPartition(ds$response, p = 0.20, list = F) 
ds.test = ds[ind,] 
ds.train = ds[-ind,]

logitModelFull <- glm(response_binary ~ age + education + employed + job + marital + default + balance + housing + loan + day + month + duration + campaign + pdays + previous + poutcome, family = binomial, data = ds.train)
summary(logitModelFull) # the model

## 
## Call:
## glm(formula = response_binary ~ age + education + employed + 
##     job + marital + default + balance + housing + loan + day + 
##     month + duration + campaign + pdays + previous + poutcome, 
##     family = binomial, data = ds.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1534  -0.3604  -0.2446  -0.1679   3.0901  
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.806e+00  3.034e-01  -5.952 2.64e-09 ***
## age                -3.159e-03  2.637e-03  -1.198  0.23102    
## educationsecondary  2.189e-01  7.571e-02   2.891  0.00384 ** 
## educationtertiary   4.575e-01  8.889e-02   5.147 2.65e-07 ***
## employedyes         1.007e-01  1.289e-01   0.781  0.43463    
## jobblue-collar     -3.514e-01  8.656e-02  -4.060 4.92e-05 ***
## jobentrepreneur    -4.485e-01  1.513e-01  -2.965  0.00303 ** 
## jobhousemaid       -3.254e-01  1.810e-01  -1.798  0.07215 .  
## jobmanagement      -2.659e-01  8.776e-02  -3.030  0.00244 ** 
## jobother           -4.483e-01  3.716e-01  -1.206  0.22766    
## jobretired          3.264e-01  1.473e-01   2.216  0.02667 *  
## jobself-employed   -4.063e-01  1.371e-01  -2.964  0.00303 ** 
## jobservices        -2.830e-01  1.001e-01  -2.828  0.00469 ** 
## jobstudent          4.803e-01  1.687e-01   2.847  0.00441 ** 
## jobtechnician      -1.548e-01  8.130e-02  -1.904  0.05690 .  
## jobunemployed              NA         NA      NA       NA    
## maritalmarried     -1.964e-01  7.028e-02  -2.794  0.00520 ** 
## maritalsingle       9.197e-02  7.986e-02   1.152  0.24947    
## defaultyes          6.329e-02  1.846e-01   0.343  0.73170    
## balance             3.562e-05  1.201e-05   2.967  0.00301 ** 
## housingyes         -7.288e-01  5.251e-02 -13.878  < 2e-16 ***
## loanyes            -4.180e-01  7.028e-02  -5.947 2.73e-09 ***
## day2               -4.820e-01  2.243e-01  -2.149  0.03165 *  
## day3               -3.487e-01  2.247e-01  -1.552  0.12076    
## day4               -2.039e-01  2.172e-01  -0.939  0.34772    
## day5               -5.718e-01  2.186e-01  -2.616  0.00890 ** 
## day6               -6.295e-01  2.218e-01  -2.838  0.00454 ** 
## day7               -5.519e-01  2.229e-01  -2.476  0.01330 *  
## day8               -3.163e-01  2.192e-01  -1.443  0.14894    
## day9               -4.826e-01  2.218e-01  -2.176  0.02956 *  
## day10               1.917e-01  2.467e-01   0.777  0.43697    
## day11              -3.031e-01  2.216e-01  -1.368  0.17143    
## day12              -6.016e-02  2.163e-01  -0.278  0.78093    
## day13               1.151e-01  2.181e-01   0.528  0.59761    
## day14              -2.194e-01  2.212e-01  -0.992  0.32144    
## day15              -6.030e-02  2.185e-01  -0.276  0.78259    
## day16              -5.148e-01  2.215e-01  -2.324  0.02013 *  
## day17              -1.019e+00  2.214e-01  -4.601 4.21e-06 ***
## day18              -4.242e-01  2.171e-01  -1.954  0.05066 .  
## day19              -1.247e+00  2.381e-01  -5.237 1.63e-07 ***
## day20              -9.226e-01  2.187e-01  -4.218 2.47e-05 ***
## day21              -6.210e-01  2.239e-01  -2.774  0.00554 ** 
## day22              -2.191e-01  2.339e-01  -0.937  0.34900    
## day23              -5.813e-02  2.372e-01  -0.245  0.80638    
## day24              -5.145e-01  2.840e-01  -1.812  0.07003 .  
## day25               7.964e-02  2.362e-01   0.337  0.73594    
## day26              -2.395e-01  2.389e-01  -1.003  0.31604    
## day27              -2.761e-02  2.340e-01  -0.118  0.90610    
## day28              -4.559e-01  2.337e-01  -1.951  0.05101 .  
## day29              -7.607e-01  2.369e-01  -3.212  0.00132 ** 
## day30              -9.446e-02  2.179e-01  -0.433  0.66467    
## day31               3.745e-02  2.847e-01   0.132  0.89534    
## monthaug           -8.687e-01  1.036e-01  -8.382  < 2e-16 ***
## monthdec            5.830e-01  2.202e-01   2.648  0.00810 ** 
## monthfeb           -2.988e-01  1.177e-01  -2.539  0.01112 *  
## monthjan           -1.236e+00  1.634e-01  -7.565 3.88e-14 ***
## monthjul           -1.051e+00  1.020e-01 -10.302  < 2e-16 ***
## monthjun           -6.221e-01  1.035e-01  -6.010 1.86e-09 ***
## monthmar            1.543e+00  1.522e-01  10.141  < 2e-16 ***
## monthmay           -1.175e+00  9.238e-02 -12.723  < 2e-16 ***
## monthnov           -5.896e-01  1.134e-01  -5.201 1.98e-07 ***
## monthoct            7.101e-01  1.363e-01   5.210 1.89e-07 ***
## monthsep            7.322e-01  1.548e-01   4.730 2.24e-06 ***
## duration            2.592e-01  4.623e-03  56.072  < 2e-16 ***
## campaign           -8.808e-02  1.215e-02  -7.251 4.14e-13 ***
## pdays              -1.690e-05  3.995e-04  -0.042  0.96625    
## previous            2.879e-02  1.638e-02   1.757  0.07892 .  
## poutcomesuccess     2.162e+00  9.772e-02  22.123  < 2e-16 ***
## poutcomeunknown    -3.038e-01  1.262e-01  -2.408  0.01604 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23128  on 32671  degrees of freedom
## Residual deviance: 15135  on 32604  degrees of freedom
## AIC: 15271
## 
## Number of Fisher Scoring iterations: 6

coefsexp <- coef(logitModelFull) %>% exp() %>% round(2)
coefsexp # the odds

##        (Intercept)                age educationsecondary 
##               0.16               1.00               1.24 
##  educationtertiary        employedyes     jobblue-collar 
##               1.58               1.11               0.70 
##    jobentrepreneur       jobhousemaid      jobmanagement 
##               0.64               0.72               0.77 
##           jobother         jobretired   jobself-employed 
##               0.64               1.39               0.67 
##        jobservices         jobstudent      jobtechnician 
##               0.75               1.62               0.86 
##      jobunemployed     maritalmarried      maritalsingle 
##                 NA               0.82               1.10 
##         defaultyes            balance         housingyes 
##               1.07               1.00               0.48 
##            loanyes               day2               day3 
##               0.66               0.62               0.71 
##               day4               day5               day6 
##               0.82               0.56               0.53 
##               day7               day8               day9 
##               0.58               0.73               0.62 
##              day10              day11              day12 
##               1.21               0.74               0.94 
##              day13              day14              day15 
##               1.12               0.80               0.94 
##              day16              day17              day18 
##               0.60               0.36               0.65 
##              day19              day20              day21 
##               0.29               0.40               0.54 
##              day22              day23              day24 
##               0.80               0.94               0.60 
##              day25              day26              day27 
##               1.08               0.79               0.97 
##              day28              day29              day30 
##               0.63               0.47               0.91 
##              day31           monthaug           monthdec 
##               1.04               0.42               1.79 
##           monthfeb           monthjan           monthjul 
##               0.74               0.29               0.35 
##           monthjun           monthmar           monthmay 
##               0.54               4.68               0.31 
##           monthnov           monthoct           monthsep 
##               0.55               2.03               2.08 
##           duration           campaign              pdays 
##               1.30               0.92               1.00 
##           previous    poutcomesuccess    poutcomeunknown 
##               1.03               8.69               0.74

What now do we see?

Positive relation:

tertiary education;
duration of last contact;
successful outcome in previous campaign;
months of the last contact: october, september and march;

Negative relation:

blue-collar job;
having housing or personal loan;
campaign - number of contacts performed during this campaign;
months of the last contact: january, november, may and all summer months.

Here I don’t what to spread and tell about each predictors’ odds, but I really want to highlight the huge positive relation of outcome of previous campaign. By looking at the odds we may claim: positive response in previous campaign increases the odds of giving positive response in present one by a factor of 9.69 which is 869% compared to someone who gave negative response in previous campaign.

Also, I want to mention, that the results, we see here, support lot’s of our hypothesis in the previous exploralory part :) Let’s now look at how our model is accurate and reliable.

#library(MASS)
#logitModelNew <- stepAIC(logitModelFull, trace = 0)
#summary(logitModelNew)
#coefsexp <- coef(logitModelNew) %>% exp() %>% round(2)
#coefsexp 
# all this chunk gave me pretty the same results, so I decided to skip it :)

LogRegR2(logitModelFull)

## Chi2                 7992.937 
## Df                   67 
## Sig.                 0 
## Cox and Snell Index  0.2170151 
## Nagelkerke Index     0.4277719 
## McFadden's R2        0.3455954

predNew <- predict(logitModelFull, type = "response", na.action = na.exclude, data = ds.test)
#ds.test %>% dplyr::select(response_binary, predNew) %>% tail()

Indeces here are not very high, but that’s still okay.

ds.test$pred = predict(logitModelFull, ds.test, type="response")
confMatrix <- confusion.matrix(ds.test$response_binary, 
                    ds.test$pred, threshold = 0.5)
confMatrix

##     obs
## pred    0   1
##    0 7065 608
##    1  176 320
## attr(,"class")
## [1] "confusion.matrix"

accuracy <- sum(diag(confMatrix)) / sum(confMatrix)
accuracy

## [1] 0.9040274

As for the confusion matrix we see that 7056 of people were correctly predicted as giving negative response to the present campaign and 319 people were correctly predited to give positove response to the present campaign. There’s also many people were incorrectly presicted as giving negative responses (609), but it can be explained by the fact, that there’s a big prevalege of negative responses at all in the whole sample. And as for the accuracy we have 90% of it, which is really nice.

References

data:

helping sources and useful forums:

https://www.rdocumentation.org/
https://stackoverflow.com/
https://www.r-graph-gallery.com/
https://www.r-bloggers.com/
http://www.sthda.com/
http://web.mnstate.edu/
http://rpubs.com/Nuta/ I’ve also used some part of code from my previous projects , so yes, I refer to my page on rpubs, too :) And of course a great thank you to my teammate Anna Moroz, who supported and helped me in difficult moments of doing the project!

And our full report can be found here: http://rpubs.com/gingerball/final

part 1 - explorative data analysis

Anna Podolskaya

15 12 2019