The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
source: https://archive.ics.uci.edu/dataset/222/bank+marketing
How many times on average can a campaign reach success?
Recommended best campaign segments?
How much savings does someone who is successful in carrying out a campaign have on average?
data <- read.csv("data_input/bank.csv", header = FALSE, sep = ";", quote = "\"")
## age job marital education default balance housing loan contact day
## 2 30 unemployed married primary no 1787 no no cellular 19
## 3 33 services married secondary no 4789 yes yes cellular 11
## 4 35 management single tertiary no 1350 yes no cellular 16
## 5 30 management married tertiary no 1476 yes yes unknown 3
## 6 59 blue-collar married secondary no 0 yes no unknown 5
## 7 35 management single tertiary no 747 no no cellular 23
## month duration campaign pdays previous poutcome y
## 2 oct 79 1 -1 0 unknown no
## 3 may 220 1 339 4 failure no
## 4 apr 185 1 330 1 failure no
## 5 jun 199 4 -1 0 unknown no
## 6 may 226 1 -1 0 unknown no
## 7 feb 141 2 176 3 failure no
First things first, let’s read our data.
summary(data1)
## age job marital education
## Length:4521 Length:4521 Length:4521 Length:4521
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## default balance housing loan
## Length:4521 Length:4521 Length:4521 Length:4521
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## contact day month duration
## Length:4521 Length:4521 Length:4521 Length:4521
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## campaign pdays previous poutcome
## Length:4521 Length:4521 Length:4521 Length:4521
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## y
## Length:4521
## Class :character
## Mode :character
Based on the table above, let’s select the variables we need.
data2 <- data1 %>% select(
age, job, marital, education, default, balance, housing, loan, duration, campaign, previous, y)
To make it easier to read, we changed some column names.
names(data2)[names(data2) == "y"] <- "result"
names(data2)[names(data2) == "previous"] <- "campaignbef"
names(data2)[names(data2) == "campaign"] <- "campaignnow"
Let’s check for missing values in our data.
colSums(is.na(data2))
## age job marital education default balance
## 0 0 0 0 0 0
## housing loan duration campaignnow campaignbef result
## 0 0 0 0 0 0
To make processing easier, let’s change our data type.
data3 <- data2 %>%
mutate(age = as.numeric(age),
job = as.factor(job),
marital = as.factor(marital),
education = as.factor(education),
default = as.factor(default),
balance = as.numeric(balance),
housing = as.factor(housing),
loan = as.factor(loan),
duration = as.numeric(duration),
campaignnow = as.numeric(campaignnow),
campaignbef = as.numeric(campaignbef),
result = as.factor(result)
)
Once we have finished preparing our data, let’s start exploring our data.
For the first exploration, we will try to answer the business question regarding How many times on average can a campaign achieve success? For this reason, we do aggregation for several variables.
camp <- data3 %>%
group_by(result) %>%
summarise(avg_campaign = mean(campaignbef + campaignnow)) %>%
ungroup() %>%
mutate(avg_campaign = round(avg_campaign, 1))
camp
## # A tibble: 2 × 2
## result avg_campaign
## <fct> <dbl>
## 1 no 3.3
## 2 yes 3.4
Visualize the data.
plot1 <- ggplot(camp, aes(x = result, y = avg_campaign, fill = result)) +
geom_bar(stat = "identity", width = 0.1) + scale_fill_brewer(palette="Set2") +
labs(x = "Result", y = NULL, title = "Average Campaign by Result") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot1
Based on the data processing above, from this plot we can see that on average 3.4 new campaigns can produce success until someone wants to carry out the goals of this campaign.
To answer the second question, we try to aggregate based on variables that can group data that describes certain segments.
data3 %>%
group_by(age, job, marital, education, balance, housing, loan) %>%
summarise(
result_yes = sum(result == "yes"),
result_no = sum(result == "no")
)
## `summarise()` has grouped output by 'age', 'job', 'marital', 'education',
## 'balance', 'housing'. You can override using the `.groups` argument.
## # A tibble: 4,464 × 9
## # Groups: age, job, marital, education, balance, housing [4,456]
## age job marital education balance housing loan result_yes result_no
## <dbl> <fct> <fct> <fct> <dbl> <fct> <fct> <int> <int>
## 1 19 student single primary 103 no no 1 0
## 2 19 student single secondary 302 no no 1 0
## 3 19 student single unknown 0 no no 0 1
## 4 19 student single unknown 1169 no no 0 1
## 5 20 student single secondary 291 no no 0 1
## 6 20 student single secondary 502 no no 1 0
## 7 20 student single secondary 1191 no no 0 1
## 8 21 services single secondary 361 no no 0 1
## 9 21 services single secondary 1903 yes no 0 1
## 10 21 student single secondary 6 no no 0 1
## # ℹ 4,454 more rows
Let’s regroup for the job variable.
seg <- data3 %>%
filter(result == "yes") %>%
group_by(job, result) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
ungroup()
## `summarise()` has grouped output by 'job'. You can override using the `.groups`
## argument.
seg
## # A tibble: 12 × 3
## job result count
## <fct> <fct> <int>
## 1 management yes 131
## 2 technician yes 83
## 3 blue-collar yes 69
## 4 admin. yes 58
## 5 retired yes 54
## 6 services yes 38
## 7 self-employed yes 20
## 8 student yes 19
## 9 entrepreneur yes 15
## 10 housemaid yes 14
## 11 unemployed yes 13
## 12 unknown yes 7
Visualize for variable job based data.
plot2 <- ggplot(data = seg, aes(x = reorder(job, count), y = count, fill = result)) +
geom_bar(stat = "identity", position = position_dodge()) +
scale_fill_brewer(palette="Set2") +
labs(title = "Distribution of Job and Education with Result Counts",
x = NULL,
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
coord_flip()
plot2
We see that the job segment provides a level of success in the campaign, especially management or managerial jobs.
This time we aggregate based on savings.
bal <- data3 %>%
group_by(result) %>%
summarise(median_balance = median(balance, na.rm = TRUE)) %>%
ungroup()
bal
## # A tibble: 2 × 2
## result median_balance
## <fct> <dbl>
## 1 no 420.
## 2 yes 710
Here’s the visualization.
plot3 <- ggplot(bal, aes(x = result, y = median_balance, fill = result)) +
geom_bar(stat = "identity") +scale_fill_brewer(palette="Set2") +
labs(title = "Median Balance by Result",
x = "Result",
y = NULL) +
theme_minimal()
plot3
Based on the median savings that have the potential for success in this campaign, it is worth 12,518,522 Rupiah or 710 Euros.
Next, let’s try to aggregate it based on debt.
loany <- data3 %>%
filter(result == "yes") %>%
group_by(loan, housing, result) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
ungroup()
## `summarise()` has grouped output by 'loan', 'housing'. You can override using
## the `.groups` argument.
loany
## # A tibble: 4 × 4
## loan housing result count
## <fct> <fct> <fct> <int>
## 1 no no yes 283
## 2 no yes yes 195
## 3 yes yes yes 25
## 4 yes no yes 18
It can be seen that the data still needs to be summarized again, let’s try manipulating it into several groups by aggregation.
loann <- data3 %>%
filter(result == "no") %>%
group_by(loan, housing, result) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
ungroup()
## `summarise()` has grouped output by 'loan', 'housing'. You can override using
## the `.groups` argument.
loann
## # A tibble: 4 × 4
## loan housing result count
## <fct> <fct> <fct> <int>
## 1 no yes no 1958
## 2 no no no 1394
## 3 yes yes no 381
## 4 yes no no 267
After that, provide a new column with descriptive data to clarify the group.
loan1 <- loany %>%
mutate(group_yes = paste0("housing - ", housing, " & loan - ", loan)) %>%
select(group_yes, count)
loan1
## # A tibble: 4 × 2
## group_yes count
## <chr> <int>
## 1 housing - no & loan - no 283
## 2 housing - yes & loan - no 195
## 3 housing - yes & loan - yes 25
## 4 housing - no & loan - yes 18
Here’s the visualization.
# Compute the cumulative percentages (top of each rectangle)
loan1$fraction <- loan1$count / sum(loan1$count)
# Compute the cumulative percentages (top of each rectangle)
loan1$ymax <- cumsum(loan1$fraction)
# Compute the bottom of each rectangle
loan1$ymin <- c(0, head(loan1$ymax, n=-1))
# Compute label position
loan1$labelPosition <- (loan1$ymax + loan1$ymin) / 2
# Compute a good label
loan1$label <- paste0(loan1$count)
# Make the plot
plot4 <- ggplot(loan1, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=group_yes)) +
geom_rect(color="black") +
geom_text(x=2.5, aes(y=labelPosition, label=count), color="black", size=4) +
scale_fill_brewer(palette="Set2") +
scale_color_brewer(palette="Set1") +
coord_polar(theta="y") +
xlim(c(-1, 4)) +
theme_void() +
labs(title = "Doughnut Chart of Success Campaign Result Base on Loan")+
guides(fill = guide_legend(title = "Group"))
plot4
Based on the data above, the best potential is someone who has no mortgage payments and no debt with a total of 283 data.
Based on the data above, we conclude in the form of recommendations as follows To increase success in this campaign, there are several steps that need to be taken. Firstly, increase the number of campaigns, based on the statistical data above to achieve success in a campaign, on average it requires 3.4 campaigns. It is unfortunate that the campaign failure ratio is also seen after 3.3 campaigns have been carried out, with a difference of 0.1 in the statistical value of success and failure, we need to improve other variables. Another variable needed is what segments can increase success. We tried to explore data with various variables, namely the ratio of job debt and savings. We concluded that the job variable provided quite significant value to the campaign success ratio. So increase the focus on job segmentation in the management/managerial section of the company and the savings value is above the median, namely 710. This will increase the success of the campaign.