Glad to meet you at the first, exploratory part, of the report, prepare for the bank company. My name is Anna, I’m a third year student data analyst. In this part I’m going to explore tha data as thoroughly, as possible, look at somу possible relations and provide some ideas for further analysis.
I would also like to add an additional column, which’ll tell us if a client is employed or not. Retired, students, housemaids are unemployed in this case.
#unique(ds$job) to see which levels does this factor have
ds$employed <- NA
ds$employed <- ifelse(ds$job %in% c("unemployed", "retired", "housemaid", "student"), "no", "yes")
ds$employed <- as.factor(ds$employed)
#str(ds$employed)
#str(ds) juuust to check everything up :)
Firstly, let’s have a quick overview at our data, variables, which we have. You may skip it if you’d like, I will remind about everything in each part. However, extra look at everything at ones might be helpful :) We’ve been using ‘bank_cleaned’ data. Moreover, I’ve added a new variable, employed, which tells us if a client is employed or not. Retired, students, housemaids are unemployed in this case.
Socio-demoraphic data:
Client’s bank history in this bank: (all three variable are categorical, with yes/no answer)
Last contact information of the current campaign:
Infornation about the number of contacts and days passed after the last contact: (all three variable are numeric)
And of course campaign outcomes:
First, let’s have a look at our client in general. Down below you can see tables with information how many client are according to every socio-demographic trait. I’ve included both number of observations and percantage distribution and put it in descending order. The socio-demographic characteristics are:
First, I’ll show you everything in graphics and tables and than will provide a short summary.
ds %>% group_by(marital) %>% summarize(count=n()) %>% arrange(desc(count)) %>%
mutate(perc=paste0(round(count/sum(count)*100, 2), "%")) %>% as.data.frame()
ds %>% group_by(education) %>% summarize(count=n()) %>% arrange(desc(count)) %>%
mutate(perc=paste0(round(count/sum(count)*100, 2), "%")) %>% as.data.frame()
ds %>% group_by(employed) %>% summarize(count=n()) %>% arrange(desc(count)) %>%
mutate(perc=paste0(round(count/sum(count)*100, 2), "%")) %>% as.data.frame()
ds %>% group_by(job) %>% summarize(count=n()) %>% arrange(desc(count)) %>%
mutate(perc=paste0(round(count/sum(count)*100, 2), "%")) %>% as.data.frame()
# I've also created some barplot, but didn't add it now, cause slightly similar barplots will be a bit later, but here it's not realy necessary, tables are showing information quite clear :)
#ggplot(data = ds, aes(x = factor(marital),
# y = prop.table(stat(count)),
# label = scales::percent(prop.table(stat(count))))) +
# geom_bar() +
# geom_text(stat = 'count',
# position = position_dodge(.9),
# vjust = 1.2,
# size = 4,
# col = "white") +
# scale_y_continuous(labels = scales::percent) +
# labs(x = 'marital status', y = 'percentage')
#ggplot(data = ds, aes(x = factor(education),
# y = prop.table(stat(count)),
# label = scales::percent(prop.table(stat(count))))) +
# geom_bar() +
# geom_text(stat = 'count',
# position = position_dodge(.9),
# vjust = 1.2,
# size = 4,
# col = "white") +
# scale_y_continuous(labels = scales::percent) +
# labs(x = 'education', y = 'percentage')
#ggplot(data = ds, aes(x = factor(employed),
# y = prop.table(stat(count)),
# label = scales::percent(prop.table(stat(count))))) +
# geom_bar() +
# geom_text(stat = 'count',
# position = position_dodge(.9),
# vjust = 1.2,
# size = 4,
# col = "white") +
# scale_y_continuous(labels = scales::percent) +
# labs(x = 'employed', y = 'percentage')
library(forcats)
#ggplot(data = ds, aes(x = fct_infreq(job),
# y = prop.table(stat(count)),
# label = scales::percent(prop.table(stat(count))))) +
# geom_bar() +
# geom_text(stat = 'count',
# position = position_dodge(.9),
# vjust = 1.2,
# size = 4,
# col = "white") +
# scale_y_continuous(labels = scales::percent) +
# labs(x = 'job', y = 'percentage')
# age
hist(ds$age,
xlab = "Client's age",
main = "Age distribution")
abline(v=c(mean(ds$age), median(ds$age)),
col=c("blue", "red"),
lwd = c(2, 2))
legend(x = "topright",
c("Mean", "Median"),
col = c("blue", "red"),
lwd = c(2, 2))
summary(ds$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 33.00 39.00 40.79 48.00 95.00
# balance level
hist(ds$balance, breaks = 50,
xlab = "Client's balance",
main = "Clinents' balance distribution")
abline(v=c(mean(ds$balance), median(ds$balance)),
col=c("blue", "red"),
lwd = c(2, 2))
legend(x = "topright",
c("Mean", "Median"),
col = c("blue", "red"),
lwd = c(2, 2))
summary(ds$balance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6847 64 421 1074 1333 10443
What do we know so far?
Marital status: Almost 2/3 of our sample - married people. Another one thrid part is divided between singles (almost 30%) and divorced ones (12%);
Education: More than a half of clients (53.7%) have secondary education, 30% have higher education and 16%,by contrast, have primary education.
Job position: First of all, 88% of our clients are employed somewhere. As for the more detailed information, the most wide-spread job positions among our clients are: blue-collar (22%), management (21%) and technician (17%). Than goes administrator position (12%) and service (9%). All other positions have less than 10% of observations and can be seen in the table above.
Age: In general, the age of our clients vary from 18 to 95. The mean age is 40, the median in our data is near the mean value - 39. From the graph we may see the first increase from 25 y.o. and peak at the point of 30-35 with following decrease. All in all, I would say that the bigger part of our observations fill the gap between 25 and 45 y.o. And as for descriptive statistics, it gives the age of 33 y.o. as the 1st quartile and 48 as the 3rd one, (which means than age of 50% of our clients lie between 33 and 48).
Balance: As for the client’s balance, from the graph we may see that the most frequent balance vary from -500 to 1000-1500 (every bar (break) on the graph is equal to 500). Descriptive statistics gives us 64 as the 1st quartile and 1333 as the 3rd one, (which means than 50% of our observations lie between 64 and 1333). The avarage client’s balance here is about 1000 and the median is 421. Minimal and maximum values can also be seen in the table (-6847 and 10443).
Now I’d like to look throught the current situation in bank database. We have information about the following variables:
Let’s have a look at the dustribution of all observations in every variable. I’ve manually put three bar graphs in order as positive portions are ascending :)
gdef <- ggplot(data = ds, aes(x = default, fill = default,
y = prop.table(stat(count)))) +
geom_bar(aes(y = prop.table(..count..)),
position = "dodge") +
geom_text(aes(y = prop.table(..count..),
label = paste0(round(prop.table(..count..),digits= 2), '%')),
stat = 'count', position=position_dodge(width=0.9), vjust=-0.25, size = 4) +
scale_y_continuous(labels = scales::percent) +
labs(y="percentage", x = "", title="Does the client\nhas credit in default\nor not?") +
theme_bw() +
theme(axis.text = element_text(size=13.5), plot.title = element_text(size=12), legend.position = "none") +
scale_fill_manual(values=c("tomato2", "lightseagreen"))
ghous <- ggplot(data = ds, aes(x = housing, fill = housing,
y = prop.table(stat(count)))) +
geom_bar(aes(y = prop.table(..count..)),
position = "dodge") +
geom_text(aes(y = prop.table(..count..),
label = paste0(round(prop.table(..count..),digits= 2), '%')),
stat = 'count', position=position_dodge(width=0.9), vjust=-0.25, size = 4) +
scale_y_continuous(labels = scales::percent) +
labs(y="", x = "", title="Does the client\nhave housing loan?") +
theme_bw() +
theme(axis.text = element_text(size=13.5), plot.title = element_text(size=12), legend.position = "none") +
scale_fill_manual(values=c("tomato2", "lightseagreen"))
gloan <- ggplot(data = ds, aes(x = loan, fill = loan,
y = prop.table(stat(count)))) +
geom_bar(aes(y = prop.table(..count..)),
position = "dodge") +
geom_text(aes(y = prop.table(..count..),
label = paste0(round(prop.table(..count..),digits= 2), '%')),
stat = 'count', position=position_dodge(width=0.9), vjust=-0.25, size = 4) +
scale_y_continuous(labels = scales::percent) +
labs(y="", x = "", title="Does the client\nhave personal loan?") +
theme_bw() +
theme(axis.text = element_text(size=13.5), plot.title = element_text(size=12), legend.position = "none") +
scale_fill_manual(values=c("tomato2", "lightseagreen"))
grid.arrange(gdef, gloan, ghous, ncol=3)
Well, here we see that:
Now, let’s look at the results of the previous marketing campaign and current one (if the client subscribed a term deposit or not)
gpoutcome <- ggplot(data = ds, aes(x = poutcome, fill = poutcome,
y = prop.table(stat(count)))) +
geom_bar(aes(y = prop.table(..count..)),
position = "dodge") +
geom_text(aes(y = prop.table(..count..),
label = paste0(round(prop.table(..count..),digits= 2), '%')),
stat = 'count', position=position_dodge(width=0.9), vjust=-0.25, size = 4) +
scale_y_continuous(labels = scales::percent) +
labs(y="", x = "", title="Has the client gave positive\nresponse to the previous campaign?") +
theme_bw() +
theme(axis.text = element_text(size=13.5), plot.title = element_text(size=12), legend.position = "none") +
scale_fill_manual(values=c("tomato2", "lightseagreen", "grey"))
gresponse <- ggplot(data = ds, aes(x = response, fill = response,
y = prop.table(stat(count)))) +
geom_bar(aes(y = prop.table(..count..)),
position = "dodge") +
geom_text(aes(y = prop.table(..count..),
label = paste0(round(prop.table(..count..),digits= 2), '%')),
stat = 'count', position=position_dodge(width=0.9), vjust=-0.25, size = 4) +
scale_y_continuous(labels = scales::percent) +
labs(y="", x = "", title="Has the client subscribed\na term deposit?") +
theme_bw() +
theme(axis.text = element_text(size=13.5), plot.title = element_text(size=12), legend.position = "none") +
scale_fill_manual(values=c("tomato2", "lightseagreen"))
grid.arrange(gpoutcome, gresponse, ncol=2)
Here we see, the problem: a huge part of outcome of previous campain is unknown. But we know, that 3% of all cases were successful and 11% were not (which is almost 4 times more). As for the present marketing campaign, we see that 11% of our clients gave positive responses and subscribed a term deposit. Not the best result, BUT it’s not as bad as it could be: in comparison with number of clients, who have credit in default or with known ‘successful’ outcome cases of previuos campaign 11% of positive responses is not very bad.
Okay. After we know who is our client and what are results of our marketing campaign, let’s look at factors which could relate somehow to the responses to the present campaign (yes, our predictors). For that I will firstly just draw some different graphs. Each one will show the percentage proportion of those clients, who subscribed a term deposit and those, who did not.
Let’s go!
set_theme(base = theme_classic())
# marital status
sjp.xtab(ds$marital, ds$response,
margin = "row",
bar.pos = "stack",
title = "Relation with marital status",
axis.titles = "marital status",
legend.title = "subscribed a term deposit",
show.summary = TRUE,
show.n = FALSE,
show.prc = TRUE,
show.total = FALSE,
coord.flip = TRUE,
rev.order = FALSE)
# chi-sq test
t_marital <- with(ds, table(marital, response))
chi_mar <- chisq.test(t_marital)
chi_mar$stdres
## response
## marital no yes
## divorced -0.5671748 0.5671748
## married 12.3011747 -12.3011747
## single -12.9948418 12.9948418
On the graph above we may notice that single people give positive responses more often, than married ones. Test shows that there’s a relation between client’s marital status and if they subscibed to a term deposit: single clients tend to give positive responses, while married, in contrast, give negative.
# job and education
sjp.xtab(ds$education, ds$response,
margin = "row",
bar.pos = "stack",
title = "Relation with сlient's education",
axis.titles = "education level",
legend.title = "subscribed a term deposit",
show.summary = TRUE,
show.n = FALSE,
show.prc = TRUE,
show.total = FALSE,
coord.flip = TRUE,
rev.order = FALSE)
t_edu <- with(ds, table(education, response))
chi_edu <- chisq.test(t_edu)
chi_edu
##
## Pearson's Chi-squared test
##
## data: t_edu
## X-squared = 210.72, df = 2, p-value < 2.2e-16
chi_edu$stdres
## response
## education no yes
## primary 8.276654 -8.276654
## secondary 6.639639 -6.639639
## tertiary -13.802452 13.802452
From the graph we see a slight percentage increase with the increasing educational level. As for the test, there is some relation: clients with higher education (tertiary) give positive responses and those, who have primary or secondary education, tend to give negative responses to the marketing campaign.
sjp.xtab(ds$employed, ds$response,
margin = "row",
bar.pos = "stack",
title = "Relation with сlient's employment",
axis.titles = "If client employed or not",
legend.title = "subscribed a term deposit",
show.summary = FALSE,
show.n = FALSE,
show.prc = TRUE,
show.total = FALSE,
coord.flip = TRUE,
rev.order = FALSE)
sjp.xtab(ds$job, ds$response,
margin = "row",
bar.pos = "stack",
title = "Relation with сlient's job position",
axis.titles = "job position",
legend.title = "subscribed a term deposit",
show.summary = FALSE,
show.n = FALSE,
show.prc = TRUE,
show.total = FALSE,
coord.flip = TRUE,
rev.order = FALSE)
t_emp <- with(ds, table(employed, response))
chi_emp <- chisq.test(t_emp)
chi_emp
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t_emp
## X-squared = 298.98, df = 1, p-value < 2.2e-16
chi_emp$stdres
## response
## employed no yes
## no -17.31465 17.31465
## yes 17.31465 -17.31465
t_job <- with(ds, table(job, response))
chi_job <- chisq.test(t_job)
chi_job
##
## Pearson's Chi-squared test
##
## data: t_job
## X-squared = 708.67, df = 11, p-value < 2.2e-16
chi_job$stdres
## response
## job no yes
## admin. -1.3956813 1.3956813
## blue-collar 13.7704109 -13.7704109
## entrepreneur 3.7823118 -3.7823118
## housemaid 2.7831881 -2.7831881
## management -5.8273865 5.8273865
## other 0.7402356 -0.7402356
## retired -15.7187125 15.7187125
## self-employed 0.1609809 -0.1609809
## services 5.3538543 -5.3538543
## student -15.1044177 15.1044177
## technician 1.4887414 -1.4887414
## unemployed -4.8092993 4.8092993
As for the employment level and job position, I would say, that there is such relation that unemployed clients tend to give positive responses to the marketing campaign, while employed clients, in contrast, give negative responses. Here I would like to remind, that factor ‘unempoyed’ include students and retired clients. This relation was also supported by the test. Speaking about employed clients, the positive relation can be found only among managerial and administrator positions. Other give either no relation or negative one.
# age and balance
g_age <- ggplot(ds, aes(x = response, y = age, fill = response)) +
geom_boxplot() +
labs(y="age", x = "",
title="Client's age") +
theme_bw() +
theme(legend.position = "none", axis.text = element_text(size=13)) +
scale_fill_manual(values=c("tomato2", "lightseagreen"))
g_bal <- ggplot(ds, aes(x = response, y = balance, fill = response)) +
geom_boxplot() +
labs(y="balance", x = "",
title="Client's balance") +
theme_bw() +
theme(legend.position = "none", axis.text = element_text(size=13)) +
scale_fill_manual(values=c("tomato2", "lightseagreen"))
grid.arrange(g_age, g_bal, ncol=2, top = textGrob("Has the client subscribed a term deposit?", gp=gpar(fontsize=17)))
t.test(ds$age ~ ds$response)
##
## Welch Two Sample t-test
##
## data: ds$age by ds$response
## t = -3.7897, df = 5339.8, p-value = 0.0001525
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.1623314 -0.3697696
## sample estimates:
## mean in group no mean in group yes
## 40.70366 41.46971
t.test(ds$balance ~ ds$response)
##
## Welch Two Sample t-test
##
## data: ds$balance by ds$response
## t = -13.121, df = 5674.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -430.8670 -318.8538
## sample estimates:
## mean in group no mean in group yes
## 1031.403 1406.263
Here I see no much differences on graphs. Means are relatively similar, no big differences in position of boxes on the graph. Let’s try t-test: in contrast, it tells that the difference in means is statistically significant (p-value is small in both cases: with age and with balance). Not quite evident, but that’s the test results. Clients, who have subscribed to a term deposit tend to be older and with higher balance on their accounts.
I’d like to see if there any relation between the fact that client may have or not other loans or credit in default and their response to the marketing campaign.
#default
sjp.xtab(ds$default, ds$response,
margin = "row",
bar.pos = "stack",
axis.titles = "If client has credit in default?",
legend.title = "subscribed a term deposit",
show.summary = TRUE,
show.n = FALSE,
show.prc = TRUE,
show.total = FALSE,
coord.flip = TRUE,
rev.order = FALSE)
# housing
sjp.xtab(ds$housing, ds$response,
margin = "row",
bar.pos = "stack",
axis.titles = "If client has housing loan?",
legend.title = "subscribed a term deposit",
show.summary = TRUE,
show.n = FALSE,
show.prc = TRUE,
show.total = FALSE,
coord.flip = TRUE,
rev.order = FALSE)
# loan
sjp.xtab(ds$loan, ds$response,
margin = "row",
bar.pos = "stack",
axis.titles = "If client has personal loan?",
legend.title = "subscribed a term deposit",
show.summary = TRUE,
show.n = FALSE,
show.prc = TRUE,
show.total = FALSE,
coord.flip = TRUE,
rev.order = FALSE)
t_def <- with(ds, table(default, response))
chi_def <- chisq.test(t_def)
# the results of the test was insignoficant, so I didn't add it in desciption of results
# chi_def
# chi_def$stdres
t_hos <- with(ds, table(housing, response))
chi_hos <- chisq.test(t_hos)
chi_hos
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t_hos
## X-squared = 720.3, df = 1, p-value < 2.2e-16
chi_hos$stdres
## response
## housing no yes
## no -26.85413 26.85413
## yes 26.85413 -26.85413
t_loan <- with(ds, table(loan, response))
chi_loan <- chisq.test(t_loan)
chi_loan
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t_loan
## X-squared = 176.98, df = 1, p-value < 2.2e-16
chi_loan$stdres
## response
## loan no yes
## no -13.3242 13.3242
## yes 13.3242 -13.3242
Well, here we see that if a client is already having some loan or credit, they will probably refuse in subscribing to a term deposit. That’s quite logical. This hypothesis is noticable from the graphs and is supported by tests (the part with housing and personal loan at least).
Separatly I want to look is there any relation between the results of the previous campaign and present one:
sjp.xtab(ds$poutcome, ds$response,
margin = "row",
bar.pos = "stack",
axis.titles = "The outcome of the previous marketing campaign",
legend.title = "subscribed a term deposit",
show.summary = TRUE,
show.n = FALSE,
show.prc = TRUE,
show.total = FALSE,
coord.flip = TRUE,
rev.order = FALSE)
t_pc <- with(ds, table(poutcome, response))
chi_pc <- chisq.test(t_pc)
chi_pc
##
## Pearson's Chi-squared test
##
## data: t_pc
## X-squared = 4072.1, df = 2, p-value < 2.2e-16
chi_pc$stdres
## response
## poutcome no yes
## failure -2.506702 2.506702
## success -63.450609 63.450609
## unknown 34.664544 -34.664544
Wow, here we can clearly see the prevalence of positive responses among successful cases of the previous campaign. The same relation is supported by the test. That’s what I was expecting: if the client agrees to do smth in one marketing campaign, there should be a good chance, that they will repeat this success in the next campaign. Maybe, these are some loyal clients or smth of that kind.
For last in exploralory part I’d like to look at the relation of client’s response with the variables: 1. Related with the last contact of the current campaign; 2. Related with the number of contacts and days passed after the last contact.
Variables here are:
( * - the asterisk mark on the y-axis means that the scale is converted into log10 for normalization and better visualization)
#month
sjp.xtab(ds$month, ds$response,
margin = "row",
bar.pos = "stack",
axis.titles = "The last contact month",
legend.title = "subscribed a term deposit",
show.summary = FALSE,
show.n = FALSE,
show.prc = TRUE,
show.total = FALSE,
coord.flip = TRUE,
rev.order = FALSE)
#day
sjp.xtab(ds$day, ds$response,
margin = "row",
bar.pos = "stack",
expand.grid = TRUE,
axis.titles = "The last contact day",
legend.title = "subscribed a term deposit",
show.summary = FALSE,
show.n = FALSE,
show.prc = FALSE,
show.total = FALSE,
coord.flip = TRUE,
rev.order = FALSE)
#duration
ggplot(ds, aes(x = response, y = duration, fill = response)) +
geom_boxplot() +
scale_y_log10() +
labs(y="Duration, in sec *", x = "",
title="Last contact duration") +
theme_bw() +
theme(legend.position = "none", axis.text = element_text(size=13)) +
scale_fill_manual(values=c("tomato2", "lightseagreen"))
t_day <- with(ds, table(day, response))
chi_day <- chisq.test(t_day)
chi_day
##
## Pearson's Chi-squared test
##
## data: t_day
## X-squared = 532.4, df = 30, p-value < 2.2e-16
chi_day$stdres
## response
## day no yes
## 1 -9.0695600 9.0695600
## 2 -2.5452954 2.5452954
## 3 -5.6432597 5.6432597
## 4 -5.4083247 5.4083247
## 5 -0.5332721 0.5332721
## 6 3.0905455 -3.0905455
## 7 3.7712217 -3.7712217
## 8 1.4063715 -1.4063715
## 9 0.2252072 -0.2252072
## 10 -6.8231465 6.8231465
## 11 -1.0951643 1.0951643
## 12 -4.4584670 4.4584670
## 13 -5.0311946 5.0311946
## 14 1.0347888 -1.0347888
## 15 -2.6758554 2.6758554
## 16 -2.0544013 2.0544013
## 17 3.0887191 -3.0887191
## 18 2.1368954 -2.1368954
## 19 6.3524519 -6.3524519
## 20 6.8916032 -6.8916032
## 21 2.3227162 -2.3227162
## 22 -3.7488284 3.7488284
## 23 -1.0299458 1.0299458
## 24 -1.6117461 1.6117461
## 25 -3.1818594 3.1818594
## 26 1.5484329 -1.5484329
## 27 -0.5925772 0.5925772
## 28 6.0475078 -6.0475078
## 29 5.6696739 -5.6696739
## 30 -7.0864269 7.0864269
## 31 3.2913651 -3.2913651
t_mon <- with(ds, table(month, response))
chi_mon <- chisq.test(t_mon)
chi_mon
##
## Pearson's Chi-squared test
##
## data: t_mon
## X-squared = 2864.5, df = 11, p-value < 2.2e-16
chi_mon$stdres
## response
## month no yes
## apr -13.506027 13.506027
## aug 2.379227 -2.379227
## dec -14.250612 14.250612
## feb -9.110778 9.110778
## jan 1.150512 -1.150512
## jul 8.029198 -8.029198
## jun 2.951178 -2.951178
## mar -27.438654 27.438654
## may 20.080571 -20.080571
## nov 2.323969 -2.323969
## oct -26.240709 26.240709
## sep -25.364717 25.364717
t.test(ds$duration ~ ds$response)
##
## Welch Two Sample t-test
##
## data: ds$duration by ds$response
## t = -54.686, df = 4961.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.632025 -5.242192
## sample estimates:
## mean in group no mean in group yes
## 3.691365 9.128474
From the graph with months we may clearly see the prevalence of positive responses among the clients, who were last contacted in september, october, march and december. The results of statistical test support this positive relation, adding april and february to the set. Other months, in contrast, give negative relation.
As for the day, we can se the increase of positive responses among people, who were lastly connected in the beginning of the month (1st, 3rd, 4th, 10th days) and some noticable increase on the 30th. If the graph doesn’t give clear information, we can look at the results of the test, there exist some significant relations: connecting with client at the beginning of the month have positive relation with giving positive response to the marketing campaign. 30th day also is positively related to client’s subscription to the term deposit.
There is a difference between duration of the last contact, which can be evidently seen from the graph: among positive responses to the marketing campaign we can see longer contact duration. Maybe that is more or less obvious and explainatory (the person can give a fast negative response, but positove response takes more time… but that’s only my hypothesis). The significant difference between means of these two groups (positive and negative responses) is also supported by the test.
Variables here are:
#Number of contacts performed BEFORE this campaign and for this client
g_prev <- ggplot(ds, aes(x = response, y = previous, fill = response)) +
geom_boxplot() +
scale_y_log10() +
labs(y="Number of contacts *", x = "",
title="Number of contacts\nperformed BEFORE\nthis campaign") +
theme_bw() +
theme(legend.position = "none", axis.text = element_text(size=13)) +
scale_fill_manual(values=c("tomato2", "lightseagreen"))
#Number of contacts performed DURING this campaign and for this client
g_camp <- ggplot(ds, aes(x = response, y = campaign, fill = response)) +
geom_boxplot() +
scale_y_log10() +
labs(y="Number of contacts *", x = "",
title="Number of contacts\nperformed DURING\nthis campaign") +
theme_bw() +
theme(legend.position = "none", axis.text = element_text(size=13)) +
scale_fill_manual(values=c("tomato2", "lightseagreen"))
#number of days that passed by after the past contact
g_pdays <- ggplot(ds, aes(x = response, y = pdays, fill = response)) +
geom_boxplot() +
scale_y_log10() +
labs(y="Number of days *", x = "",
title="Number of days that\nPASSED by after the\nlast contact from a\nPREVIOUS campaign") +
theme_bw() +
theme(legend.position = "none", axis.text = element_text(size=13)) +
scale_fill_manual(values=c("tomato2", "lightseagreen"))
grid.arrange(g_prev, g_camp, g_pdays, ncol=3, top = textGrob("Client's response and number of contacts with them", gp=gpar(fontsize=17)))
t.test(ds$previous ~ ds$response)
##
## Welch Two Sample t-test
##
## data: ds$previous by ds$response
## t = -19.049, df = 5164.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.7010387 -0.5702062
## sample estimates:
## mean in group no mean in group yes
## 0.3645931 1.0002156
t.test(ds$campaign ~ ds$response)
##
## Welch Two Sample t-test
##
## data: ds$campaign by ds$response
## t = 21.382, df = 8324.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.6441811 0.7742162
## sample estimates:
## mean in group no mean in group yes
## 2.854704 2.145505
t.test(ds$pdays ~ ds$response)
##
## Welch Two Sample t-test
##
## data: ds$pdays by ds$response
## t = -18.061, df = 5365.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -34.41140 -27.67271
## sample estimates:
## mean in group no mean in group yes
## 28.72234 59.76439
As fot the number of contact with clients we have slightly similar results in both cases: defore and after the campaign. And the fact is there’s no big difference seen from the graphs (however t-test says that the difference in means exists and it’s significant). Anyway, here, it’s more interesting to have a look at the third graph with number of day passed by after the last contact. The difference between means an be seen clearly from the graph as well as the whole position of boxes. The same idea, that there is a significant difference between means in two groups (who gave postitive or nagative response to the marketing campaign) is supported be the test. So, we may said that, in our case, the more recent was the last contact, the more chances that the client subsribes to the term deposit. Which is wuite logical: if we forget about our client and don’t contact them, they will forget about our service too and may even churn. So don’t forget your client and they will give positive responses to the marketing campaign. But, again, it’s only hypothesis :)
Well, now we know lot’s of things. First of all, we have accumulated knowledge about our client from socio-demographic point of view. In further steps I recomend to take a look at age and job relation to clients’ response. Also, I suggest pay attention to the actions, that bank company can do: number of contacts, its duration, day or month, when the contact is performed - all these thigs can be really influential. Also, I really recomend to check the influence of the previuos campaign. If this relation will be supported by further analysis (with predictive modeling, I mean), that would be a good sign for the company to look at the traits and hints, which were used that time. And maybe create some loyalty program for the customers, who give positive responses in marketing programs! Anyway, this we will know later.
Out of my personal interest I decided to try logistic regression. Leter, my colleguages will catch me up and continue this story with other methods (Decision tree and Bayesian network models in particular).
Our outcome is response - we are interested to see which factors can predict will the client give positive or negative response to our marketing campaign. First, we divide the sample into two parts: train and test ones in proportion 80/20. Then, make a model. As the predictors I’ve used all the variables.
ind = createDataPartition(ds$response, p = 0.20, list = F)
ds.test = ds[ind,]
ds.train = ds[-ind,]
logitModelFull <- glm(response_binary ~ age + education + employed + job + marital + default + balance + housing + loan + day + month + duration + campaign + pdays + previous + poutcome, family = binomial, data = ds.train)
summary(logitModelFull) # the model
##
## Call:
## glm(formula = response_binary ~ age + education + employed +
## job + marital + default + balance + housing + loan + day +
## month + duration + campaign + pdays + previous + poutcome,
## family = binomial, data = ds.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.1534 -0.3604 -0.2446 -0.1679 3.0901
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.806e+00 3.034e-01 -5.952 2.64e-09 ***
## age -3.159e-03 2.637e-03 -1.198 0.23102
## educationsecondary 2.189e-01 7.571e-02 2.891 0.00384 **
## educationtertiary 4.575e-01 8.889e-02 5.147 2.65e-07 ***
## employedyes 1.007e-01 1.289e-01 0.781 0.43463
## jobblue-collar -3.514e-01 8.656e-02 -4.060 4.92e-05 ***
## jobentrepreneur -4.485e-01 1.513e-01 -2.965 0.00303 **
## jobhousemaid -3.254e-01 1.810e-01 -1.798 0.07215 .
## jobmanagement -2.659e-01 8.776e-02 -3.030 0.00244 **
## jobother -4.483e-01 3.716e-01 -1.206 0.22766
## jobretired 3.264e-01 1.473e-01 2.216 0.02667 *
## jobself-employed -4.063e-01 1.371e-01 -2.964 0.00303 **
## jobservices -2.830e-01 1.001e-01 -2.828 0.00469 **
## jobstudent 4.803e-01 1.687e-01 2.847 0.00441 **
## jobtechnician -1.548e-01 8.130e-02 -1.904 0.05690 .
## jobunemployed NA NA NA NA
## maritalmarried -1.964e-01 7.028e-02 -2.794 0.00520 **
## maritalsingle 9.197e-02 7.986e-02 1.152 0.24947
## defaultyes 6.329e-02 1.846e-01 0.343 0.73170
## balance 3.562e-05 1.201e-05 2.967 0.00301 **
## housingyes -7.288e-01 5.251e-02 -13.878 < 2e-16 ***
## loanyes -4.180e-01 7.028e-02 -5.947 2.73e-09 ***
## day2 -4.820e-01 2.243e-01 -2.149 0.03165 *
## day3 -3.487e-01 2.247e-01 -1.552 0.12076
## day4 -2.039e-01 2.172e-01 -0.939 0.34772
## day5 -5.718e-01 2.186e-01 -2.616 0.00890 **
## day6 -6.295e-01 2.218e-01 -2.838 0.00454 **
## day7 -5.519e-01 2.229e-01 -2.476 0.01330 *
## day8 -3.163e-01 2.192e-01 -1.443 0.14894
## day9 -4.826e-01 2.218e-01 -2.176 0.02956 *
## day10 1.917e-01 2.467e-01 0.777 0.43697
## day11 -3.031e-01 2.216e-01 -1.368 0.17143
## day12 -6.016e-02 2.163e-01 -0.278 0.78093
## day13 1.151e-01 2.181e-01 0.528 0.59761
## day14 -2.194e-01 2.212e-01 -0.992 0.32144
## day15 -6.030e-02 2.185e-01 -0.276 0.78259
## day16 -5.148e-01 2.215e-01 -2.324 0.02013 *
## day17 -1.019e+00 2.214e-01 -4.601 4.21e-06 ***
## day18 -4.242e-01 2.171e-01 -1.954 0.05066 .
## day19 -1.247e+00 2.381e-01 -5.237 1.63e-07 ***
## day20 -9.226e-01 2.187e-01 -4.218 2.47e-05 ***
## day21 -6.210e-01 2.239e-01 -2.774 0.00554 **
## day22 -2.191e-01 2.339e-01 -0.937 0.34900
## day23 -5.813e-02 2.372e-01 -0.245 0.80638
## day24 -5.145e-01 2.840e-01 -1.812 0.07003 .
## day25 7.964e-02 2.362e-01 0.337 0.73594
## day26 -2.395e-01 2.389e-01 -1.003 0.31604
## day27 -2.761e-02 2.340e-01 -0.118 0.90610
## day28 -4.559e-01 2.337e-01 -1.951 0.05101 .
## day29 -7.607e-01 2.369e-01 -3.212 0.00132 **
## day30 -9.446e-02 2.179e-01 -0.433 0.66467
## day31 3.745e-02 2.847e-01 0.132 0.89534
## monthaug -8.687e-01 1.036e-01 -8.382 < 2e-16 ***
## monthdec 5.830e-01 2.202e-01 2.648 0.00810 **
## monthfeb -2.988e-01 1.177e-01 -2.539 0.01112 *
## monthjan -1.236e+00 1.634e-01 -7.565 3.88e-14 ***
## monthjul -1.051e+00 1.020e-01 -10.302 < 2e-16 ***
## monthjun -6.221e-01 1.035e-01 -6.010 1.86e-09 ***
## monthmar 1.543e+00 1.522e-01 10.141 < 2e-16 ***
## monthmay -1.175e+00 9.238e-02 -12.723 < 2e-16 ***
## monthnov -5.896e-01 1.134e-01 -5.201 1.98e-07 ***
## monthoct 7.101e-01 1.363e-01 5.210 1.89e-07 ***
## monthsep 7.322e-01 1.548e-01 4.730 2.24e-06 ***
## duration 2.592e-01 4.623e-03 56.072 < 2e-16 ***
## campaign -8.808e-02 1.215e-02 -7.251 4.14e-13 ***
## pdays -1.690e-05 3.995e-04 -0.042 0.96625
## previous 2.879e-02 1.638e-02 1.757 0.07892 .
## poutcomesuccess 2.162e+00 9.772e-02 22.123 < 2e-16 ***
## poutcomeunknown -3.038e-01 1.262e-01 -2.408 0.01604 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23128 on 32671 degrees of freedom
## Residual deviance: 15135 on 32604 degrees of freedom
## AIC: 15271
##
## Number of Fisher Scoring iterations: 6
coefsexp <- coef(logitModelFull) %>% exp() %>% round(2)
coefsexp # the odds
## (Intercept) age educationsecondary
## 0.16 1.00 1.24
## educationtertiary employedyes jobblue-collar
## 1.58 1.11 0.70
## jobentrepreneur jobhousemaid jobmanagement
## 0.64 0.72 0.77
## jobother jobretired jobself-employed
## 0.64 1.39 0.67
## jobservices jobstudent jobtechnician
## 0.75 1.62 0.86
## jobunemployed maritalmarried maritalsingle
## NA 0.82 1.10
## defaultyes balance housingyes
## 1.07 1.00 0.48
## loanyes day2 day3
## 0.66 0.62 0.71
## day4 day5 day6
## 0.82 0.56 0.53
## day7 day8 day9
## 0.58 0.73 0.62
## day10 day11 day12
## 1.21 0.74 0.94
## day13 day14 day15
## 1.12 0.80 0.94
## day16 day17 day18
## 0.60 0.36 0.65
## day19 day20 day21
## 0.29 0.40 0.54
## day22 day23 day24
## 0.80 0.94 0.60
## day25 day26 day27
## 1.08 0.79 0.97
## day28 day29 day30
## 0.63 0.47 0.91
## day31 monthaug monthdec
## 1.04 0.42 1.79
## monthfeb monthjan monthjul
## 0.74 0.29 0.35
## monthjun monthmar monthmay
## 0.54 4.68 0.31
## monthnov monthoct monthsep
## 0.55 2.03 2.08
## duration campaign pdays
## 1.30 0.92 1.00
## previous poutcomesuccess poutcomeunknown
## 1.03 8.69 0.74
What now do we see?
Positive relation:
Negative relation:
Here I don’t what to spread and tell about each predictors’ odds, but I really want to highlight the huge positive relation of outcome of previous campaign. By looking at the odds we may claim: positive response in previous campaign increases the odds of giving positive response in present one by a factor of 9.69 which is 869% compared to someone who gave negative response in previous campaign.
Also, I want to mention, that the results, we see here, support lot’s of our hypothesis in the previous exploralory part :) Let’s now look at how our model is accurate and reliable.
#library(MASS)
#logitModelNew <- stepAIC(logitModelFull, trace = 0)
#summary(logitModelNew)
#coefsexp <- coef(logitModelNew) %>% exp() %>% round(2)
#coefsexp
# all this chunk gave me pretty the same results, so I decided to skip it :)
LogRegR2(logitModelFull)
## Chi2 7992.937
## Df 67
## Sig. 0
## Cox and Snell Index 0.2170151
## Nagelkerke Index 0.4277719
## McFadden's R2 0.3455954
predNew <- predict(logitModelFull, type = "response", na.action = na.exclude, data = ds.test)
#ds.test %>% dplyr::select(response_binary, predNew) %>% tail()
Indeces here are not very high, but that’s still okay.
ds.test$pred = predict(logitModelFull, ds.test, type="response")
confMatrix <- confusion.matrix(ds.test$response_binary,
ds.test$pred, threshold = 0.5)
confMatrix
## obs
## pred 0 1
## 0 7065 608
## 1 176 320
## attr(,"class")
## [1] "confusion.matrix"
accuracy <- sum(diag(confMatrix)) / sum(confMatrix)
accuracy
## [1] 0.9040274
As for the confusion matrix we see that 7056 of people were correctly predicted as giving negative response to the present campaign and 319 people were correctly predited to give positove response to the present campaign. There’s also many people were incorrectly presicted as giving negative responses (609), but it can be explained by the fact, that there’s a big prevalege of negative responses at all in the whole sample. And as for the accuracy we have 90% of it, which is really nice.
data:
helping sources and useful forums:
And our full report can be found here: http://rpubs.com/gingerball/final