The dataset from the “Bank Marketing” provides valuable insights into direct marketing campaigns conducted by a Portuguese banking institution. These campaigns aim to promote term deposit subscriptions among customers. By analyzing the data, we can gain a better understanding of customer behavior and factors influencing subscription patterns. The dataset encompasses a wide range of attributes, including demographic, economic, and social factors, providing a comprehensive view of the customers targeted by the campaigns.
Note: Detail explanatory will be given at the end of content
Make sure our data placed in the same folder our R project data.
bank <- read.csv("bank-additional/bank-additional.csv", sep = ";")input data is DONE! then let’s get started.
head(bank)tail(bank)dim(bank)#> [1] 4119 21
names(bank)#> [1] "age" "job" "marital" "education"
#> [5] "default" "housing" "loan" "contact"
#> [9] "month" "day_of_week" "duration" "campaign"
#> [13] "pdays" "previous" "poutcome" "emp.var.rate"
#> [17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
#> [21] "y"
From our inspection we can conclude: Bank data contain 4119 rows and 21 columns Each of column name: “age”, “job”, “marital”, “education”, “default”, “housing”, “loan”, “contact”, “month”, “day_of_week”, “duration”, “campaign”, “pdays”, “previous”, “poutcome”, “emp.var.rate”, cons.price.idx” “cons.conf.idx”, “euribor3m”, “nr.employed”, “y”.
Check data type for each column
str(bank)#> 'data.frame': 4119 obs. of 21 variables:
#> $ age : int 30 39 25 38 47 32 32 41 31 35 ...
#> $ job : chr "blue-collar" "services" "services" "services" ...
#> $ marital : chr "married" "single" "married" "married" ...
#> $ education : chr "basic.9y" "high.school" "high.school" "basic.9y" ...
#> $ default : chr "no" "no" "no" "no" ...
#> $ housing : chr "yes" "no" "yes" "unknown" ...
#> $ loan : chr "no" "no" "no" "unknown" ...
#> $ contact : chr "cellular" "telephone" "telephone" "telephone" ...
#> $ month : chr "may" "may" "jun" "jun" ...
#> $ day_of_week : chr "fri" "fri" "wed" "fri" ...
#> $ duration : int 487 346 227 17 58 128 290 44 68 170 ...
#> $ campaign : int 2 4 1 3 1 3 4 2 1 1 ...
#> $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
#> $ previous : int 0 0 0 0 0 2 0 0 1 0 ...
#> $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
#> $ emp.var.rate : num -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
#> $ cons.price.idx: num 92.9 94 94.5 94.5 93.2 ...
#> $ cons.conf.idx : num -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
#> $ euribor3m : num 1.31 4.86 4.96 4.96 4.19 ...
#> $ nr.employed : num 5099 5191 5228 5228 5196 ...
#> $ y : chr "no" "no" "no" "no" ...
From this result, we find some of data type not in the correct type. We need to convert into correct type (data correlation)
#Changing the data types into categorical
bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$day_of_week <- as.factor(bank$day_of_week)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)
str(bank)#> 'data.frame': 4119 obs. of 21 variables:
#> $ age : int 30 39 25 38 47 32 32 41 31 35 ...
#> $ job : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
#> $ marital : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
#> $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
#> $ default : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
#> $ housing : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
#> $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
#> $ contact : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
#> $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
#> $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
#> $ duration : int 487 346 227 17 58 128 290 44 68 170 ...
#> $ campaign : int 2 4 1 3 1 3 4 2 1 1 ...
#> $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
#> $ previous : int 0 0 0 0 0 2 0 0 1 0 ...
#> $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
#> $ emp.var.rate : num -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
#> $ cons.price.idx: num 92.9 94 94.5 94.5 93.2 ...
#> $ cons.conf.idx : num -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
#> $ euribor3m : num 1.31 4.86 4.96 4.96 4.19 ...
#> $ nr.employed : num 5099 5191 5228 5228 5196 ...
#> $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Each of column already changed into desired data type
Check for missing value
colSums(is.na(bank))#> age job marital education default
#> 0 0 0 0 0
#> housing loan contact month day_of_week
#> 0 0 0 0 0
#> duration campaign pdays previous poutcome
#> 0 0 0 0 0
#> emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
#> 0 0 0 0 0
#> y
#> 0
anyNA(bank)#> [1] FALSE
Great! No missing value. Now, Bank dataset is ready to be processes and analyzed
Brief explanation
summary(bank)#> age job marital education
#> Min. :18.00 admin. :1012 divorced: 446 university.degree :1264
#> 1st Qu.:32.00 blue-collar: 884 married :2509 high.school : 921
#> Median :38.00 technician : 691 single :1153 basic.9y : 574
#> Mean :40.11 services : 393 unknown : 11 professional.course: 535
#> 3rd Qu.:47.00 management : 324 basic.4y : 429
#> Max. :88.00 retired : 166 basic.6y : 228
#> (Other) : 649 (Other) : 168
#> default housing loan contact month
#> no :3315 no :1839 no :3349 cellular :2652 may :1378
#> unknown: 803 unknown: 105 unknown: 105 telephone:1467 jul : 711
#> yes : 1 yes :2175 yes : 665 aug : 636
#> jun : 530
#> nov : 446
#> apr : 215
#> (Other): 203
#> day_of_week duration campaign pdays previous
#> fri:768 Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.0000
#> mon:855 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000
#> thu:860 Median : 181.0 Median : 2.000 Median :999.0 Median :0.0000
#> tue:841 Mean : 256.8 Mean : 2.537 Mean :960.4 Mean :0.1903
#> wed:795 3rd Qu.: 317.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000
#> Max. :3643.0 Max. :35.000 Max. :999.0 Max. :6.0000
#>
#> poutcome emp.var.rate cons.price.idx cons.conf.idx
#> failure : 454 Min. :-3.40000 Min. :92.20 Min. :-50.8
#> nonexistent:3523 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
#> success : 142 Median : 1.10000 Median :93.75 Median :-41.8
#> Mean : 0.08497 Mean :93.58 Mean :-40.5
#> 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
#> Max. : 1.40000 Max. :94.77 Max. :-26.9
#>
#> euribor3m nr.employed y
#> Min. :0.635 Min. :4964 no :3668
#> 1st Qu.:1.334 1st Qu.:5099 yes: 451
#> Median :4.857 Median :5191
#> Mean :3.621 Mean :5166
#> 3rd Qu.:4.961 3rd Qu.:5228
#> Max. :5.045 Max. :5228
#>
Check the outlier within duration
boxplot(bank$duration)Based on the information above, we have identified possible outliers. However, they are still considered tolerable as they are relevant to the case.
job_subs <- aggregate(y ~ job,
data = bank,
FUN = function(x) sum(x == "yes"))
job_subs[order(job_subs$y, decreasing = T), ]Answer: Excluding customers with unknown job, the most subscribed job type is admin, while the least subscribed job type is entrepreneur.
edu_subs <- aggregate(y ~ education,
data = bank,
FUN = function(x) sum(x == "yes"))
edu_subs[order(edu_subs$y, decreasing = T), ]Answer: The customers with a university degree have the highest subscription rate, while those with a professional course education have a lower subscription rate compared to customers with a high school background.
job_duration_call <- aggregate(duration ~ job,
data = bank,
FUN = max)
job_duration_call[order(job_duration_call$duration, decreasing = T), ]Answer: The longest contact duration is found among customers with the job occupation of “technician”. However, customers with the job occupation of “admin” have the highest number of subscriptions to the campaign product.
table(bank$y)#>
#> no yes
#> 3668 451
round(prop.table(table(bank$y)),4)*100#>
#> no yes
#> 89.05 10.95
Answer: there is 451 customer that actual subscribe to our product through campaign which mean 10,95% from all campaign target.
job_count <- table(bank$job)
job_data <- data.frame(job = job_count)
names(job_data) <- c("job", "count")
job_data <- job_data[order(job_data$count, decreasing = T), ]
job_datalibrary(ggplot2)
ggplot(data = job_data, mapping = aes(y = reorder(job, count), x = count)) +
geom_col(aes(fill = job)) +
geom_text(aes(label = count), nudge_x = 30) +
labs(title = "The Most Common Job Among Customer",
y = "Occupation",
x = "Number of Occupation") +
theme_minimal() +
theme(legend.position = "none")Answer: The most common customer job is admin, followed by blue_collar and technician
contact <- as.data.frame(round(prop.table(table(bank$contact, bank$y)),4)*100)
names(contact) <- c("contact", "decision", "freq")
contactggplot(data = contact, aes(x = contact, y = freq, fill = decision)) +
geom_col(position = "dodge") +
geom_text(aes(label=freq), nudge_y = 2) +
scale_fill_manual(values = c("#cad2c5", "#84a59d")) +
labs(title = "Proportion of Contacts and Decision",
x = "Contact",
y = "Proportion (%)",
fill = "Decision") +
theme_minimal() Answer: Most customer that subscribe to our product is contacted from celuler but it’s also linier with “no” decision.
most_call <- as.data.frame(xtabs(duration ~ day_of_week, data = bank))
names(most_call) <- c("day", "freq")
most_callmost_call$day <- factor(most_call$day, levels = c("mon", "tue", "wed", "thu", "fri"))
ggplot(data = most_call, mapping = aes(x = day, y = freq, group = 1)) +
geom_smooth(formula = y ~ x, method = "loess", se = F, color = "grey") +
labs(title = "Call Frequency",
x = "Day of Week",
y = "Frequency") +
theme_minimal()Answer: The highest occurrence of phone calls to customers is on the early weekdays, specifically Monday and Tuesday. It gradually decreases towards the middle of the week. There is a significant increase on Thursday, followed by a decline on Friday, aligning with the approaching weekend.
round(cor(bank$duration, as.numeric(bank$y == "yes")),4)*100#> [1] 41.86
ggplot(data = bank, mapping = aes(x = y, y = duration)) +
geom_boxplot() +
labs(title = "Relationship between Duration and Decision",
x = "Decision",
y = "Duration") +
theme_minimal()Answer: There is a positive correlation of 41.86% between the duration of phone calls during the campaign and the likelihood of customers subscribing to our product.
ggplot(data = bank, mapping = aes(x = age, y = ..count..)) +
geom_density(aes(fill = y)) +
scale_fill_manual(values = c("#a8dadc", "#52796f")) +
labs(title = "Relationship Between Age and Subscription Decision",
x = "Age",
y = "Count",
fill = "Subcription Decision") +
theme_minimal()round(cor(bank$age, as.numeric(bank$y == "yes")),4)*100#> [1] 6.04
Answer: Based on the analysis conducted, there is no clear correlation between age and the decision to subscribe to the product through the campaign. Age does not appear to be a determining factor in whether someone chooses to subscribe as the correlation show 6%.
Based on the analysis of the Bank Marketing dataset, it is found that customers with an administration job type are the most likely to subscribe to the campaign product, while entrepreneurs are the least likely. Additionally, customers with a university degree have a higher subscription rate compared to those with a professional course or high school background. The duration of contact during the campaign shows a positive correlation with the likelihood of subscription, indicating that longer phone calls may lead to higher subscription rates. However, age does not appear to be a determining factor in the decision to subscribe. Overall, these findings provide valuable insights into customer behavior and can help guide marketing strategies for future campaigns.
Recomendations: Based on the analysis of the Bank Marketing dataset, several recommendations can be made to improve the effectiveness of future marketing campaigns.
Target Administrators and Avoid Entrepreneurs: Since customers with an administration job type have shown the highest subscription rate, it would be beneficial to focus efforts on targeting this group in future campaigns. On the other hand, the lowest subscription rate was observed among entrepreneurs, suggesting that allocating resources towards this segment may not yield significant results.
Tailor Strategies for Education Background: Customers with a university degree have demonstrated the highest subscription rate. It would be wise to create marketing messages and strategies that appeal to this educated segment. However, customers with a professional course education have shown a lower subscription rate compared to those with a high school background. Considering this, it may be necessary to fine-tune approaches to engage and persuade customers with a professional course education to increase their subscription likelihood.
Prioritize Contacting Technicians and Administrators: Although technicians have shown the longest contact duration, administrators have the highest number of subscriptions to the campaign product. It is advisable to allocate more resources towards contacting administrators and technicians as they have demonstrated a higher likelihood of subscribing.
Optimize Contact Methods: Most customers who subscribed to the product were contacted through cellular communication. However, it is important to note that this contact method also aligns with the “no” decision. This suggests the need to evaluate the effectiveness of cellular communication and explore alternative contact methods to improve subscription rates.
Timing and Frequency of Phone Calls: The highest occurrence of phone calls to customers was observed on early weekdays, specifically Monday and Tuesday, gradually decreasing towards the middle of the week. There was a significant increase on Thursday, followed by a decline on Friday, indicating a potential relationship with the approaching weekend. This information can be used to optimize the timing and frequency of phone calls, ensuring that potential customers are contacted when they are most receptive.
Emphasize Call Duration: The analysis revealed a positive correlation of 41.86% between the duration of phone calls during the campaign and the likelihood of customers subscribing to the product. It is important to encourage representatives to engage in longer conversations with potential customers, as these extended interactions have shown a higher probability of resulting in a subscription.
Consider Factors Beyond Age: The analysis did not find a clear correlation between age and the decision to subscribe through the campaign. Age alone does not appear to be a determining factor in whether someone chooses to subscribe. Therefore, it is crucial to consider other variables and characteristics when targeting potential customers, rather than relying solely on age as a predictor.
By implementing these recommendations, the marketing team can optimize their strategies, improve targeting, and increase the effectiveness of future campaigns, ultimately leading to higher subscription rates and better overall results.