Bank portugal saat ini sedang mengalami penurunan revenue, mereka ingin tahu tindakan apa yang harus di ambil, setelah diselidiki, kami menemukan bahwa akar masalahnya adalah banyak nasabah tidak melakukan penyetoran sesering sebelumnya. selain itu, bank memiliki peluang lebih besar untuk membujuk nasabah yang membeli produk lain Seperti asuransi atau dana pensiun untuk lebih meningkatkan pendapatannya, oleh karena itu bank portugal ingin mengindentifikasi klien yang ada yang memiliki peluang lebih tinggi untuk berlangganan deposito berjangka dan memfokuskan upaya pemasaran pada klien tersebut.
Bank Portugal telah mengumpulkan sejumlah besar data yang mencakup profil pelanggan dari mereka yang telah berlangganan deposito berjangka dan mereka yang tidak berlangganan deposito berjangka. Data tersebut mencakup kolom-kolom berikut.
Other attributes:
The following are social and economic context attributes
Kita akan memasukan data ke object dt
## [1] 41188 21
Dari hasil inspection di atas, kita dapat description sederhana dari data bank portugal mempunyai 41188x21 row dan column, lalu kita akan check struktur data.
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
hasil dari struktur data di atas kita bisa lihat bahwa ada beberapa data yang masih belum sesuai type nya, pada peroses selanjutnya kita akan menyesuaikan type data.
dt[, c("marital","default", "housing", "loan", "contact", "day_of_week", "month", "education", "y" )] <- lapply(dt[,c("marital", "default", "housing", "loan", "contact", "day_of_week", "month", "education", "y")], as.factor)
str(dt)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## [1] FALSE
Jika kita lihat data bank portugal tidak ada missing sehingga kita tidak perlu lagi adjust data,
Rename kolom y menjadi deposit
## age job marital
## Min. :17.00 Length:41188 divorced: 4612
## 1st Qu.:32.00 Class :character married :24928
## Median :38.00 Mode :character single :11568
## Mean :40.02 unknown : 80
## 3rd Qu.:47.00
## Max. :98.00
##
## education default housing loan
## university.degree :12168 no :32588 no :18622 no :33950
## high.school : 9515 unknown: 8597 unknown: 990 unknown: 990
## basic.9y : 6045 yes : 3 yes :21576 yes : 6248
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## contact month day_of_week duration
## cellular :26144 may :13769 fri:7827 Min. : 0.0
## telephone:15044 jul : 7174 mon:8514 1st Qu.: 102.0
## aug : 6178 thu:8623 Median : 180.0
## jun : 5318 tue:8090 Mean : 258.3
## nov : 4101 wed:8134 3rd Qu.: 319.0
## apr : 2632 Max. :4918.0
## (Other): 2016
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.000 Length:41188
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 Class :character
## Median : 2.000 Median :999.0 Median :0.000 Mode :character
## Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :56.000 Max. :999.0 Max. :7.000
##
## emp.var.rate cons.price.idx cons.conf.idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08189 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
##
## nr.employed deposit
## Min. :4964 no :36548
## 1st Qu.:5099 yes: 4640
## Median :5191
## Mean :5167
## 3rd Qu.:5228
## Max. :5228
##
kita ingin mengetahui corelation deposito by duration
ggplot(data = dt, mapping = aes(x =deposit, y = duration))+
geom_boxplot(aes(fill = deposit))+
labs(title = "Deposit by Duration", x = "deposit", y = "Duration")+
theme(plot.title = element_text(hjust = 0.5))+
geom_hline(yintercept = mean(dt$duration), col = "blue", linetype = 5)
Insight :
kita ingin mengetahui corelation antara duration dan campaign
ggthemr_reset()
ggplot(data = dt, mapping = aes(x = campaign, y = duration, col = deposit))+
geom_point()+
geom_hline(yintercept = mean(dt$duration), col = "blue", linetype = 5)+
labs(title = "Duration vs Campaign")+
theme(plot.title = element_text(hjust = 0.5))
Insight :
kita ingin mengetahui nilai rata2 dari duration call dan campaign berdasarkan contact yang di gunakan
agg <- aggregate.data.frame(list(duration = dt$duration, campaign = dt$campaign),by = list(contact = dt$contact), FUN = mean)
dim(agg)
## [1] 2 3
agg.pi <- pivot_longer(agg,
cols = c("duration", "campaign"),
names_to = "stat",
values_to = "value")
agg.pi
ggthemr_reset()
ggplot(agg.pi, mapping = aes(x = contact, y = value))+
geom_col(aes(fill = stat), position = "dodge")+
coord_flip()+
labs(title = "Average duration & campaign for each contact", x = " contact", y = "value", fill = "variabel")+
geom_text(aes(label = round(value, digits = 1)), hjust = -0.1, size = 3)+
theme(plot.title = element_text(hjust = 0.5))
Insight :
campaign dengan telephone memiliki nilai rata-rata minimum lebih tinggi dibandingkan cellular mesikpun duration nya telephone rendah, ini bisa terjadi karena cost lebih mahal.
Cellular memiliki nilai rata-rata minimum campaign lebih rendah dibandikan dengan telephon tapi durationnya lebih tinggi hal ini bisa terjadi karena dari cost lebih murah hampir semua orang memilikinya dan flexibel menggunakannya.
kita akan mengelompokan usia dan membuat boxplot education by duration
p <- function(x){if(x<30){x <- "<30"} else if(x>=30 & x<=39){x <- "30-39"}else if (x>=40 & x<=49){x <-"40-49"}else if(x>=50 & x <=59){x <- "50-59"}else(x <- "60+")}
dt$ageg <- as.factor(sapply(dt$age, p))
head(dt)
ggthemr_reset()
ggplot(dt, aes(x = duration, y = education))+
geom_jitter(aes(col = dt$ageg))+
geom_boxplot(alpha = 0.5)+
scale_x_continuous(trans = "log10")+
labs(title = "Entire education by duration ", x = "Duration", y = "Education", col = "age segment")+
theme(plot.title = element_text(hjust = 0.5))
## Warning: Use of `dt$ageg` is discouraged. Use `ageg` instead.
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
Insight :
kita akan melihat bagaimana campaign banyak terjadi di bulan apa
ggplot(dt, aes(x = month , y = campaign))+
geom_col(aes(fill = deposit))+
labs(title = "Campaign vs Month", x = "Month", y = "campaign")+
theme(plot.title = element_text(hjust = 0.5))
Insight :
Kita ingin mencari segment pekerjaan dengan pesentase tertinggi
ggthemr("grape", type = "inner")
total_deposit <- dt %>%
select(job) %>%
group_by(job) %>%
summarise(n=n()) %>%
mutate(percent=round(prop.table(n),2) * 100) %>%
ggplot(mapping = aes(x =reorder(job, percent), y = percent))+
geom_bar(stat = "identity")+
coord_flip()+
scale_fill_brewer(palette = "Dark2")+
theme(legend.position = "none")+
xlab("Jobs")+
geom_label(aes(label = paste0(percent, "%")),
position = position_stack(vjust = 1 ),
colour = "black",
fontface = "bold")
## `summarise()` ungrouping output (override with `.groups` argument)
Insight :
kita ingin mencari bagaimana interaction job berdasarkan deposito
job_by_deposit <- dt %>%
select(job, deposit) %>%
dplyr::group_by(job, deposit) %>%
summarise(n=n()) %>%
mutate(percent=round(prop.table(n),2) * 100) %>%
ungroup() %>%
ggplot(aes(x = reorder(job, -n), y = n, fill = deposit)) +
geom_bar(stat="identity", width = 0.9, position = position_dodge(width = 0.8)) +
geom_text(aes(label = sprintf("%.f%%", percent)),
position = position_dodge(width = 0.8),
vjust = -0.3, size = 3.5,
colour="black", fontface="bold") +
labs(x="", y="") +
labs(title="Jobs By Deposit") +
theme(plot.title=element_text(hjust=0.5))
## `summarise()` regrouping output by 'job' (override with `.groups` argument)
Insight :
Jika kita lihat graph di atas admin memiliki proporsi tidak membuka deposito tertinggi 87% di ikuti oleh blue-collar sebesar 93% dan 7% yang membeli, technician 89% penolakan dan 11 % yang memutuskan untuk membuka deposito
Dari semua segment pekerjaan belum menemukan break event point mayoritas customer dari berbagai segment pekerjaan masih banyak yang tidak mau membuka deposito, yang menarik dari graph diatas adalah pada retired kontribusi nya cukup tinggi sebesar 25%, ini mengindikasikan bahwa pensiunan sangat tertarik untuk membuka deposito hal ini dikarenakan low risk dan return yang stabil meski rata-rata yield nya kecil.
Kita ingin mencari bagaimana interaction month berdasarkan deposito
monht_deposit <- dt %>%
select(month, deposit) %>%
group_by(month, deposit) %>%
summarise(n = n()) %>%
mutate(percent = round(prop.table(n), digits = 2)*100) %>%
ungroup() %>%
ggplot(aes(x =reorder(month, -n), y = n, fill = deposit))+
geom_bar(stat = "identity", width = 0.9, position = position_dodge(width = 0.8))+
theme(axis.title.x = element_text(size = 12, face = "bold"))+
theme(axis.title.y = element_text(size = 12, face = "bold"))+
xlab("Jobs")+
ylab("Deposit")+
geom_text(aes(label = sprintf("%.f%%", percent)),
position = position_dodge(width = 0.8),
vjust = -0.3, size = 3.5,
colour = "black", fontface = "bold")+
labs(x = "", y = "")+
labs(title = "Month by deposit")+
theme(plot.title = element_text(hjust = 0.5))
## `summarise()` regrouping output by 'month' (override with `.groups` argument)
Insight :
Jika kita lihat graph di atas pada bulan may hanya 6% yang membuka produk, padahal jika kita refer to graph sebelumnya Campaign vs Month di bulan mei paling banyak melakukn campaign dan diikuti oleh bulan juni, juli, agustus, tapi justru di bulan ini terjadi banyak menolak untuk membuka deposito.
Jika kita amati di bulan oct, sep, mar, jumlah nya cukup proposional jumlah tertinggi melakukan pembukaan deposito terjadi di bulan maret sebesar 51%.
Kita ingin mengetahui interaction loan housing terhadap deposito
ggthemr("grape", type = "inner")
loan_by_deposit <- dt %>%
select(loan, deposit) %>%
group_by(loan, deposit) %>%
summarise(n=n()) %>%
mutate(percent = round(prop.table(n),digits = 3)*100) %>%
ungroup() %>%
ggplot(aes(x = loan, y = n, fill = deposit))+
geom_bar(stat = "identity", width = 0.7, position = position_dodge(width = 0.8))+
theme(axis.title.x = element_text(size = 12, face = "bold"))+
geom_text(aes(label = sprintf("%.1f%%", percent)),
position = position_dodge(width = 0.8),
vjust = -0.3, size = 3.5,
colour = "black", fontface = "bold")+
labs(x = "", y = "")+
labs(title = "Loan by deposit")+
theme(plot.title = element_text(hjust = 0.5))
## `summarise()` regrouping output by 'loan' (override with `.groups` argument)
housing_by_deposit <- dt %>%
select(housing, deposit) %>%
group_by(housing, deposit) %>%
summarise(n = n()) %>%
mutate(percent = round(prop.table(n) ,digits = 2)*100) %>%
ungroup() %>%
ggplot(aes(x = housing, y = n, fill = deposit))+
geom_bar(stat = "identity", width = 0.7, position = position_dodge(width = 0.8))+
theme(axis.title.x = element_text(size = 15, face = "bold"))+
geom_text(aes(label = sprintf("%.1f%%", percent)),
position = position_dodge(width = 0.8),
vjust = -0.3, size = 3.5,
colour = "black", fontface = "bold")+
labs(title = "Housing by deposit", x = "", y ="")+
theme(plot.title = element_text(hjust = 0.5))
## `summarise()` regrouping output by 'housing' (override with `.groups` argument)
Insgiht :
Kita ingin melihat bagaimana distribusi usia memiliki interaction dengan deposito
ggthemr::ggthemr("flat dark", type = "inner")
age_dis <- dt %>%
select(age, deposit) %>%
ggplot(aes(x = age, fill = deposit))+
geom_histogram(alpha = 0.9, col = "grey")+
scale_x_continuous(breaks = seq(min(18), max(95), by = 10))+
theme(legend.position = "right")+
labs(x = "", y = "")+
labs(title = "Distribution Age")+
theme(plot.title = element_text(hjust = 0.5))
age_dis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Insight
Usia yang membuka deposito paling tinggi dilakukan pada range usia 28-40 dan di rentang usia tersebut pula terjadi banyak penolakan pembukaan deposito.
jika kita amati graph yang menarik adalah di range usia 60-80 tidak semasif di range usia 28-40, tapi penolakan pembukaan deposito justru hanya sedikit di usia tersebut ini mengindikasikan di usia tersebut orang cenderung lebih tertarik untuk membuka deposito, kemungkinan hal ini terjadi karena cash flow yang rutin dari deposito cocok untuk usia pensiun yaitu 60-80.
Dari semua graph dan analysis di atas, kami mempunyai beberapa kesimpulan.