1 Explanation

The dataset from the “Bank Marketing” provides valuable insights into direct marketing campaigns conducted by a Portuguese banking institution. These campaigns aim to promote term deposit subscriptions among customers. By analyzing the data, we can gain a better understanding of customer behavior and factors influencing subscription patterns. The dataset encompasses a wide range of attributes, including demographic, economic, and social factors, providing a comprehensive view of the customers targeted by the campaigns.

Note: Detail explanatory will be given at the end of content

2 Input Data

Make sure our data placed in the same folder our R project data.

bank <- read.csv("bank-additional/bank-additional.csv", sep = ";")

input data is DONE! then let’s get started.

2.1 Data Inspection

head(bank)
tail(bank)
dim(bank)
#> [1] 4119   21
names(bank)
#>  [1] "age"            "job"            "marital"        "education"     
#>  [5] "default"        "housing"        "loan"           "contact"       
#>  [9] "month"          "day_of_week"    "duration"       "campaign"      
#> [13] "pdays"          "previous"       "poutcome"       "emp.var.rate"  
#> [17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"   
#> [21] "y"

From our inspection we can conclude: Bank data contain 4119 rows and 21 columns Each of column name: “age”, “job”, “marital”, “education”, “default”, “housing”, “loan”, “contact”, “month”, “day_of_week”, “duration”, “campaign”, “pdays”, “previous”, “poutcome”, “emp.var.rate”, cons.price.idx” “cons.conf.idx”, “euribor3m”, “nr.employed”, “y”.

  1. age.
  2. job : type of job.
  3. marital : marital status.
  4. education.
  5. default: credit in default.
  6. housing: housing loan.
  7. loan: personal loan.
  8. contact: contact communication type.
  9. month: last contact month of year.
  10. day_of_week: last contact day of the week.
  11. duration: last contact duration, in seconds.
  12. campaign: number of contacts performed during this campaign and for this client.
  13. pdays: number of days that passed by after the client was last contacted from a previous campaign.
  14. previous: number of contacts performed before this campaign and for this client.
  15. poutcome: outcome of the previous marketing campaign.
  16. emp.var.rate: employment variation rate.
  17. cons.price.idx: consumer price index.
  18. cons.conf.idx: consumer confidence index.
  19. euribor3m: euribor 3 month rate.
  20. nr.employed: number of employee.

2.2 Data Cleansing & correlation

Check data type for each column

str(bank)
#> 'data.frame':    4119 obs. of  21 variables:
#>  $ age           : int  30 39 25 38 47 32 32 41 31 35 ...
#>  $ job           : chr  "blue-collar" "services" "services" "services" ...
#>  $ marital       : chr  "married" "single" "married" "married" ...
#>  $ education     : chr  "basic.9y" "high.school" "high.school" "basic.9y" ...
#>  $ default       : chr  "no" "no" "no" "no" ...
#>  $ housing       : chr  "yes" "no" "yes" "unknown" ...
#>  $ loan          : chr  "no" "no" "no" "unknown" ...
#>  $ contact       : chr  "cellular" "telephone" "telephone" "telephone" ...
#>  $ month         : chr  "may" "may" "jun" "jun" ...
#>  $ day_of_week   : chr  "fri" "fri" "wed" "fri" ...
#>  $ duration      : int  487 346 227 17 58 128 290 44 68 170 ...
#>  $ campaign      : int  2 4 1 3 1 3 4 2 1 1 ...
#>  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
#>  $ previous      : int  0 0 0 0 0 2 0 0 1 0 ...
#>  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
#>  $ emp.var.rate  : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
#>  $ cons.price.idx: num  92.9 94 94.5 94.5 93.2 ...
#>  $ cons.conf.idx : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
#>  $ euribor3m     : num  1.31 4.86 4.96 4.96 4.19 ...
#>  $ nr.employed   : num  5099 5191 5228 5228 5196 ...
#>  $ y             : chr  "no" "no" "no" "no" ...

From this result, we find some of data type not in the correct type. We need to convert into correct type (data correlation)

#Changing the data types into categorical
bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$day_of_week <- as.factor(bank$day_of_week)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)

str(bank)
#> 'data.frame':    4119 obs. of  21 variables:
#>  $ age           : int  30 39 25 38 47 32 32 41 31 35 ...
#>  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
#>  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
#>  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
#>  $ default       : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
#>  $ housing       : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
#>  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
#>  $ contact       : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
#>  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
#>  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
#>  $ duration      : int  487 346 227 17 58 128 290 44 68 170 ...
#>  $ campaign      : int  2 4 1 3 1 3 4 2 1 1 ...
#>  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
#>  $ previous      : int  0 0 0 0 0 2 0 0 1 0 ...
#>  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
#>  $ emp.var.rate  : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
#>  $ cons.price.idx: num  92.9 94 94.5 94.5 93.2 ...
#>  $ cons.conf.idx : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
#>  $ euribor3m     : num  1.31 4.86 4.96 4.96 4.19 ...
#>  $ nr.employed   : num  5099 5191 5228 5228 5196 ...
#>  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Each of column already changed into desired data type

Check for missing value

colSums(is.na(bank))
#>            age            job        marital      education        default 
#>              0              0              0              0              0 
#>        housing           loan        contact          month    day_of_week 
#>              0              0              0              0              0 
#>       duration       campaign          pdays       previous       poutcome 
#>              0              0              0              0              0 
#>   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
#>              0              0              0              0              0 
#>              y 
#>              0
anyNA(bank)
#> [1] FALSE

Great! No missing value. Now, Bank dataset is ready to be processes and analyzed

3 Data Explanation

Brief explanation

summary(bank)
#>       age                 job           marital                   education   
#>  Min.   :18.00   admin.     :1012   divorced: 446   university.degree  :1264  
#>  1st Qu.:32.00   blue-collar: 884   married :2509   high.school        : 921  
#>  Median :38.00   technician : 691   single  :1153   basic.9y           : 574  
#>  Mean   :40.11   services   : 393   unknown :  11   professional.course: 535  
#>  3rd Qu.:47.00   management : 324                   basic.4y           : 429  
#>  Max.   :88.00   retired    : 166                   basic.6y           : 228  
#>                  (Other)    : 649                   (Other)            : 168  
#>     default        housing          loan           contact         month     
#>  no     :3315   no     :1839   no     :3349   cellular :2652   may    :1378  
#>  unknown: 803   unknown: 105   unknown: 105   telephone:1467   jul    : 711  
#>  yes    :   1   yes    :2175   yes    : 665                    aug    : 636  
#>                                                                jun    : 530  
#>                                                                nov    : 446  
#>                                                                apr    : 215  
#>                                                                (Other): 203  
#>  day_of_week    duration         campaign          pdays          previous     
#>  fri:768     Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.0000  
#>  mon:855     1st Qu.: 103.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.0000  
#>  thu:860     Median : 181.0   Median : 2.000   Median :999.0   Median :0.0000  
#>  tue:841     Mean   : 256.8   Mean   : 2.537   Mean   :960.4   Mean   :0.1903  
#>  wed:795     3rd Qu.: 317.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.0000  
#>              Max.   :3643.0   Max.   :35.000   Max.   :999.0   Max.   :6.0000  
#>                                                                                
#>         poutcome     emp.var.rate      cons.price.idx  cons.conf.idx  
#>  failure    : 454   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
#>  nonexistent:3523   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
#>  success    : 142   Median : 1.10000   Median :93.75   Median :-41.8  
#>                     Mean   : 0.08497   Mean   :93.58   Mean   :-40.5  
#>                     3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
#>                     Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
#>                                                                       
#>    euribor3m      nr.employed     y       
#>  Min.   :0.635   Min.   :4964   no :3668  
#>  1st Qu.:1.334   1st Qu.:5099   yes: 451  
#>  Median :4.857   Median :5191             
#>  Mean   :3.621   Mean   :5166             
#>  3rd Qu.:4.961   3rd Qu.:5228             
#>  Max.   :5.045   Max.   :5228             
#> 
  1. The average age of bank customers is 40 years old, with the youngest customer being 18 years old and the oldest customer being 88 years old.
  2. The most common occupation among customers is administration.
  3. There are more customers who are married compared to those who are single.
  4. The majority of customers have a university degree.
  5. There are more customers who have a housing loan compared to customers who have a personal loan.
  6. The highest number of customer contacts for the campaign program through direct phone calls occurred in May.
  7. The average duration of phone calls to customers during the campaign was 256 seconds or 4 minutes.

Check the outlier within duration

boxplot(bank$duration)

Based on the information above, we have identified possible outliers. However, they are still considered tolerable as they are relevant to the case.

4 Data Manipualtion & Transformation

  1. How many customers have subscribed based on job type?
job_subs <- aggregate(y ~ job,
                      data = bank,
                      FUN = function(x) sum(x == "yes"))
job_subs[order(job_subs$y, decreasing = T), ]

Answer: Excluding customers with unknown job, the most subscribed job type is admin, while the least subscribed job type is entrepreneur.

  1. How many customers are subscribed based on their education level?
edu_subs <- aggregate(y ~ education,
                      data = bank,
                      FUN = function(x) sum(x == "yes"))
edu_subs[order(edu_subs$y, decreasing = T), ]

Answer: The customers with a university degree have the highest subscription rate, while those with a professional course education have a lower subscription rate compared to customers with a high school background.

  1. What is the maximum value of the contact duration variable based on job occupation?
job_duration_call <- aggregate(duration ~ job,
                               data = bank,
                               FUN = max)
job_duration_call[order(job_duration_call$duration, decreasing = T), ]

Answer: The longest contact duration is found among customers with the job occupation of “technician”. However, customers with the job occupation of “admin” have the highest number of subscriptions to the campaign product.

  1. How many customers have subscribed to our product or service? And what is the percentage among them?
table(bank$y)
#> 
#>   no  yes 
#> 3668  451
round(prop.table(table(bank$y)),4)*100
#> 
#>    no   yes 
#> 89.05 10.95

Answer: there is 451 customer that actual subscribe to our product through campaign which mean 10,95% from all campaign target.

  1. Apa jenis pekerjaan yang paling umum di antara nasabah dalam dataset ini?
job_count <- table(bank$job)
job_data <- data.frame(job = job_count)
names(job_data) <- c("job", "count")

job_data <- job_data[order(job_data$count, decreasing = T), ]
job_data
library(ggplot2)
ggplot(data = job_data, mapping = aes(y = reorder(job, count), x = count)) +
  geom_col(aes(fill = job)) +
  geom_text(aes(label = count), nudge_x = 30) +
  
  labs(title = "The Most Common Job Among Customer",
       y = "Occupation",
       x = "Number of Occupation") +
  theme_minimal() +
  theme(legend.position = "none")

Answer: The most common customer job is admin, followed by blue_collar and technician

  1. What is the percentage of customers who have subscribed to our product or service based on the type of communication contact?
contact <- as.data.frame(round(prop.table(table(bank$contact, bank$y)),4)*100)
names(contact) <- c("contact", "decision", "freq")
contact
ggplot(data = contact, aes(x = contact, y = freq, fill = decision)) +
  geom_col(position = "dodge") +
  geom_text(aes(label=freq), nudge_y = 2) +
  scale_fill_manual(values = c("#cad2c5", "#84a59d")) +
  labs(title = "Proportion of Contacts and Decision",
       x = "Contact",
       y = "Proportion (%)",
       fill = "Decision") +
  theme_minimal()

Answer: Most customer that subscribe to our product is contacted from celuler but it’s also linier with “no” decision.

  1. Which day that has the most phone-call happen?
most_call <- as.data.frame(xtabs(duration ~ day_of_week, data = bank))
names(most_call) <- c("day", "freq")
most_call
most_call$day <- factor(most_call$day, levels = c("mon", "tue", "wed", "thu", "fri"))

ggplot(data = most_call, mapping = aes(x = day, y = freq, group = 1)) +
  geom_smooth(formula = y ~ x, method = "loess", se = F, color = "grey") +
  
    labs(title = "Call Frequency",
       x = "Day of Week",
       y = "Frequency") +
  theme_minimal()

Answer: The highest occurrence of phone calls to customers is on the early weekdays, specifically Monday and Tuesday. It gradually decreases towards the middle of the week. There is a significant increase on Thursday, followed by a decline on Friday, aligning with the approaching weekend.

  1. What is the relationship between the contact duration with customers and their likelihood of subscribing to the bank’s products?
round(cor(bank$duration, as.numeric(bank$y == "yes")),4)*100
#> [1] 41.86
ggplot(data = bank, mapping = aes(x = y, y = duration)) +
  geom_boxplot() +
  labs(title = "Relationship between Duration and Decision",
       x = "Decision",
       y = "Duration") +
  theme_minimal()

Answer: There is a positive correlation of 41.86% between the duration of phone calls during the campaign and the likelihood of customers subscribing to our product.

  1. Does age factor influence customers’ decision to subscribe to the bank’s products?
ggplot(data = bank, mapping = aes(x = age, y = ..count..)) +
  geom_density(aes(fill = y)) +
  scale_fill_manual(values = c("#a8dadc", "#52796f")) +
  labs(title = "Relationship Between Age and Subscription Decision",
       x = "Age",
       y = "Count",
       fill = "Subcription Decision") +
  theme_minimal()

round(cor(bank$age, as.numeric(bank$y == "yes")),4)*100
#> [1] 6.04

Answer: Based on the analysis conducted, there is no clear correlation between age and the decision to subscribe to the product through the campaign. Age does not appear to be a determining factor in whether someone chooses to subscribe as the correlation show 6%.

5 Explanatory Text & Business Recomendation

Based on the analysis of the Bank Marketing dataset, it is found that customers with an administration job type are the most likely to subscribe to the campaign product, while entrepreneurs are the least likely. Additionally, customers with a university degree have a higher subscription rate compared to those with a professional course or high school background. The duration of contact during the campaign shows a positive correlation with the likelihood of subscription, indicating that longer phone calls may lead to higher subscription rates. However, age does not appear to be a determining factor in the decision to subscribe. Overall, these findings provide valuable insights into customer behavior and can help guide marketing strategies for future campaigns.

Recomendations: Based on the analysis of the Bank Marketing dataset, several recommendations can be made to improve the effectiveness of future marketing campaigns.

  1. Target Administrators and Avoid Entrepreneurs: Since customers with an administration job type have shown the highest subscription rate, it would be beneficial to focus efforts on targeting this group in future campaigns. On the other hand, the lowest subscription rate was observed among entrepreneurs, suggesting that allocating resources towards this segment may not yield significant results.

  2. Tailor Strategies for Education Background: Customers with a university degree have demonstrated the highest subscription rate. It would be wise to create marketing messages and strategies that appeal to this educated segment. However, customers with a professional course education have shown a lower subscription rate compared to those with a high school background. Considering this, it may be necessary to fine-tune approaches to engage and persuade customers with a professional course education to increase their subscription likelihood.

  3. Prioritize Contacting Technicians and Administrators: Although technicians have shown the longest contact duration, administrators have the highest number of subscriptions to the campaign product. It is advisable to allocate more resources towards contacting administrators and technicians as they have demonstrated a higher likelihood of subscribing.

  4. Optimize Contact Methods: Most customers who subscribed to the product were contacted through cellular communication. However, it is important to note that this contact method also aligns with the “no” decision. This suggests the need to evaluate the effectiveness of cellular communication and explore alternative contact methods to improve subscription rates.

  5. Timing and Frequency of Phone Calls: The highest occurrence of phone calls to customers was observed on early weekdays, specifically Monday and Tuesday, gradually decreasing towards the middle of the week. There was a significant increase on Thursday, followed by a decline on Friday, indicating a potential relationship with the approaching weekend. This information can be used to optimize the timing and frequency of phone calls, ensuring that potential customers are contacted when they are most receptive.

  6. Emphasize Call Duration: The analysis revealed a positive correlation of 41.86% between the duration of phone calls during the campaign and the likelihood of customers subscribing to the product. It is important to encourage representatives to engage in longer conversations with potential customers, as these extended interactions have shown a higher probability of resulting in a subscription.

  7. Consider Factors Beyond Age: The analysis did not find a clear correlation between age and the decision to subscribe through the campaign. Age alone does not appear to be a determining factor in whether someone chooses to subscribe. Therefore, it is crucial to consider other variables and characteristics when targeting potential customers, rather than relying solely on age as a predictor.

By implementing these recommendations, the marketing team can optimize their strategies, improve targeting, and increase the effectiveness of future campaigns, ultimately leading to higher subscription rates and better overall results.