Laporan ini menganalisis dataset Banking Customers (1.000 nasabah, 7 variabel) dengan fokus pada hubungan antara pendapatan dan perilaku menabung. Pertanyaan utama yang dijawab: apakah pendapatan yang lebih tinggi otomatis berarti perilaku menabung yang lebih baik?
Package yang dibutuhkan: readxl untuk baca Excel,
tidyverse untuk manipulasi data dan visualisasi,
scales untuk format angka di chart.
library(readxl)
library(tidyverse)
library(scales)
df_raw <- read_excel("04_BANKING_CUSTOMERS.xlsx")
cat("Dataset awal:\n")
## Dataset awal:
cat(" Rows :", nrow(df_raw), "\n")
## Rows : 1000
cat(" Cols :", ncol(df_raw), "\n\n")
## Cols : 7
glimpse(df_raw)
## Rows: 1,000
## Columns: 7
## $ customer_id <chr> "C000001", "C000002", "C000003", "C000004", "C000005"…
## $ age <dbl> 57, 30, 55, 20, 33, 59, 24, 71, 70, 20, 31, 33, 21, 4…
## $ income <dbl> 51808666, 33241740, 121506340, 17855323, 51440557, 18…
## $ balance <dbl> 71898776, 36624170, 222845105, 57806634, 34534238, 16…
## $ num_transactions <dbl> 23, 24, 15, 30, 13, 26, 19, 26, 21, 17, 8, 19, 15, 14…
## $ credit_score <dbl> 627.8335, 668.0386, 545.5959, 639.2831, 590.2093, 642…
## $ account_age <dbl> 18, 21, 8, 1, 14, 7, 9, 6, 18, 0, 21, 1, 2, 27, 13, 1…
Saat eksplorasi awal, ditemukan beberapa anomali yang jelas merupakan data error:
Treatment yang diterapkan:
df_clean <- df_raw |>
# Filter umur ke 17-90 (17 = batas legal buka rekening di Indonesia)
filter(age >= 17, age <= 90) |>
# Filter credit score ke range FICO valid
filter(is.na(credit_score) | between(credit_score, 300, 850)) |>
# Drop missing values (~5% data, loss kecil)
drop_na()
cat("Hasil cleaning:\n")
## Hasil cleaning:
cat(" Rows tersisa :", nrow(df_clean), "\n")
## Rows tersisa : 915
cat(" Retained :", round(nrow(df_clean) / nrow(df_raw) * 100, 1), "%\n")
## Retained : 91.5 %
Justifikasi: drop missing values dipilih daripada imputasi karena hanya sekitar 5% data — loss minimal. Imputasi pada data finansial (income, balance) berisiko misleading karena distribusinya skewed.
Insight utama membutuhkan satu variabel turunan:
saving_multiple — rasio saldo terhadap
pendapatan tahunan. Variabel ini mengukur perilaku menabung
relatif terhadap kapasitas finansial, bukan saldo absolut.
\[\text{saving\_multiple} = \frac{\text{balance}}{\text{income}}\]
Interpretasi: nilai 2 berarti saldo seseorang setara dengan 2 tahun pendapatan tahunan (penabung disiplin); nilai 0.3 berarti saldo hanya 30% dari pendapatan tahunan (boros relatif).
df_clean <- df_clean |>
mutate(
saving_multiple = balance / income,
# Segmentasi berdasarkan saving_multiple
segment = case_when(
saving_multiple < 0.5 ~ "Big Spenders", # Saldo < setengah pendapatan tahunan
saving_multiple <= 2.0 ~ "Average Savers", # Saldo 0.5x - 2x pendapatan
TRUE ~ "Hidden Wealthy" # Saldo > 2x pendapatan
),
segment = factor(segment,
levels = c("Big Spenders", "Average Savers", "Hidden Wealthy")),
# Kelompok umur
age_group = case_when(
age < 30 ~ "Young (17-29)",
age < 50 ~ "Mid (30-49)",
TRUE ~ "Senior (50+)"
),
age_group = factor(age_group,
levels = c("Young (17-29)", "Mid (30-49)", "Senior (50+)")),
# Kuartil pendapatan
income_quartile = factor(
ntile(income, 4),
labels = c("Q1\nTerendah", "Q2", "Q3", "Q4\nTertinggi")
)
)
# Preview hasil transformasi
df_clean |>
select(customer_id, age, income, balance, saving_multiple, segment, age_group) |>
head(5)
## # A tibble: 5 × 7
## customer_id age income balance saving_multiple segment age_group
## <chr> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 C000001 57 51808666. 71898776. 1.39 Average Sav… Senior (…
## 2 C000002 30 33241740. 36624170. 1.10 Average Sav… Mid (30-…
## 3 C000003 55 121506340. 222845105. 1.83 Average Sav… Senior (…
## 4 C000004 20 17855323. 57806634. 3.24 Hidden Weal… Young (1…
## 5 C000005 33 51440557. 34534238. 0.671 Average Sav… Mid (30-…
Sebagian besar nasabah ternyata adalah penabung disiplin — bukan boros seperti asumsi umum.
df_clean |>
count(segment) |>
mutate(persentase = round(n / sum(n) * 100, 1))
## # A tibble: 3 × 3
## segment n persentase
## <fct> <int> <dbl>
## 1 Big Spenders 216 23.6
## 2 Average Savers 306 33.4
## 3 Hidden Wealthy 393 43
Inilah pola kunci yang akan jadi backbone laporan ini:
df_clean |>
count(income_quartile, segment) |>
group_by(income_quartile) |>
mutate(persentase = round(n / sum(n) * 100, 1)) |>
select(-n) |>
pivot_wider(names_from = segment, values_from = persentase)
## # A tibble: 4 × 4
## # Groups: income_quartile [4]
## income_quartile `Big Spenders` `Average Savers` `Hidden Wealthy`
## <fct> <dbl> <dbl> <dbl>
## 1 "Q1\nTerendah" 8.3 15.7 76
## 2 "Q2" 15.7 32.3 52
## 3 "Q3" 23.1 45.9 31
## 4 "Q4\nTertinggi" 47.4 39.9 12.7
Yang menarik di sini: di Q1 (kuartil pendapatan terendah), 76% nasabah adalah Hidden Wealthy. Sementara di Q4 (kuartil pendapatan tertinggi), justru 47% nasabah adalah Big Spenders — dan Hidden Wealthy turun jadi 13%.
Trennya konsisten linear, dan ini counterintuitive: makin tinggi pendapatan, makin boros perilaku menabungnya.
Bukti paling tajam dari paradoks ini terlihat saat membandingkan dua kelompok ekstrem:
young_wealthy <- df_clean |> filter(age_group == "Young (17-29)", segment == "Hidden Wealthy")
senior_spenders <- df_clean |> filter(age_group == "Senior (50+)", segment == "Big Spenders")
cat("Muda & Hidden Wealthy :\n")
## Muda & Hidden Wealthy :
cat(" Jumlah :", nrow(young_wealthy), "nasabah\n")
## Jumlah : 90 nasabah
cat(" Median pendapatan : Rp", round(median(young_wealthy$income) / 1e6, 1), "juta\n")
## Median pendapatan : Rp 29.8 juta
cat(" Median saving multi :", round(median(young_wealthy$saving_multiple), 2), "x\n\n")
## Median saving multi : 4.04 x
cat("Senior & Big Spenders :\n")
## Senior & Big Spenders :
cat(" Jumlah :", nrow(senior_spenders), "nasabah\n")
## Jumlah : 97 nasabah
cat(" Median pendapatan : Rp", round(median(senior_spenders$income) / 1e6, 1), "juta\n")
## Median pendapatan : Rp 84.1 juta
cat(" Median saving multi :", round(median(senior_spenders$saving_multiple), 2), "x\n\n")
## Median saving multi : 0.22 x
ratio_income <- round(median(senior_spenders$income) / median(young_wealthy$income), 1)
ratio_saving <- round(median(young_wealthy$saving_multiple) / median(senior_spenders$saving_multiple), 0)
cat(sprintf("→ Senior berpendapatan %sx lebih besar, tapi disiplin menabung %sx lebih rendah.\n",
ratio_income, ratio_saving))
## → Senior berpendapatan 2.8x lebih besar, tapi disiplin menabung 18x lebih rendah.
Pertanyaan logis: apakah credit score berkorelasi dengan variabel finansial yang lain?
df_clean |>
select(credit_score, income, balance, age, account_age,
num_transactions, saving_multiple) |>
cor() |>
round(3) |>
as.data.frame() |>
select(credit_score) |>
arrange(desc(credit_score))
## credit_score
## credit_score 1.000
## balance 0.029
## saving_multiple 0.024
## account_age 0.020
## num_transactions 0.005
## income -0.010
## age -0.018
Insight: semua korelasi mendekati nol (< 0.05). Credit score di dataset ini tidak punya pola signifikan dengan variabel lain, jadi kita fokus pada variabel yang memberikan signal jelas.
# Palet warna konsisten di semua chart
COLORS <- c(
"Big Spenders" = "#C97064",
"Average Savers" = "#D4A574",
"Hidden Wealthy" = "#5C8A8A"
)
# Custom theme untuk konsistensi visual
theme_banking <- theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#FBF7F0", color = NA),
panel.background = element_rect(fill = "#FBF7F0", color = NA),
plot.title = element_text(face = "bold", size = 14,
color = "#3D2B1F", margin = margin(b = 8)),
axis.title = element_text(face = "bold", size = 11, color = "#3D2B1F"),
axis.text = element_text(color = "#7A6856", size = 10),
panel.grid.major = element_line(color = "#E8E0D5", linewidth = 0.5),
panel.grid.minor = element_blank(),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 11, color = "#3D2B1F"),
plot.margin = margin(16, 20, 16, 16),
strip.text = element_text(face = "bold", color = "#3D2B1F", size = 11)
)
theme_set(theme_banking)
p1 <- df_clean |>
ggplot(aes(x = income / 1e6, y = balance / 1e6, color = segment)) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", color = "#999",
linewidth = 0.8, alpha = 0.7) +
geom_point(alpha = 0.55, size = 2, shape = 16) +
scale_color_manual(values = COLORS) +
# coord_cartesian (bukan xlim/ylim) — clip visual tanpa drop data, no warning
coord_cartesian(xlim = c(0, 450), ylim = c(0, 600)) +
annotate("text", x = 380, y = 410,
label = "Saldo = Pendapatan",
color = "#666", size = 3.2, hjust = 1, fontface = "italic") +
labs(
title = 'Paradoks Penabung: Siapa Yang Sebenarnya "Kaya"?',
x = "Pendapatan Tahunan (Rp Juta)",
y = "Saldo Rekening (Rp Juta)"
)
print(p1)
ggsave("01_hero_chart_R.png", p1,
width = 11, height = 6.5, dpi = 180, bg = "#FBF7F0")
Reading chart: garis putus-putus diagonal adalah titik di mana saldo setara dengan pendapatan tahunan (saving multiple = 1x). Titik di atas garis = penabung disiplin (Hidden Wealthy, hijau-teal). Titik di bawah garis = boros relatif (Big Spenders, merah-coral). Cluster Hidden Wealthy banyak di kiri atas — low income, high balance. Cluster Big Spenders banyak di kanan bawah — high income, low balance.
quartile_pct <- df_clean |>
count(income_quartile, segment) |>
group_by(income_quartile) |>
mutate(pct = n / sum(n) * 100)
p2 <- quartile_pct |>
ggplot(aes(x = income_quartile, y = pct, fill = segment)) +
geom_col(color = "white", linewidth = 1.5, width = 0.65) +
geom_text(aes(label = paste0(round(pct), "%")),
position = position_stack(vjust = 0.5),
color = "white", fontface = "bold", size = 4) +
scale_fill_manual(values = COLORS) +
scale_y_continuous(labels = label_percent(scale = 1),
expand = expansion(mult = c(0, 0.02))) +
labs(
title = "Semakin Tinggi Pendapatan, Semakin Boros Perilakunya",
x = "Kuartil Pendapatan",
y = "Distribusi Segmen (%)"
) +
theme(legend.position = "bottom")
print(p2)
ggsave("02_income_paradox_R.png", p2,
width = 10, height = 6, dpi = 180, bg = "#FBF7F0")
Reading chart: proporsi Hidden Wealthy (teal) menurun drastis dari 76% di Q1 ke 13% di Q4. Sebaliknya, proporsi Big Spenders (coral) naik dari 8% di Q1 ke 47% di Q4. Pola inversinya sangat clean.
| Segmen | Profil | Rekomendasi |
|---|---|---|
| Hidden Wealthy (dominan Q1) | Penabung disiplin meski income kecil | Loyalty program + tawaran deposito berjangka |
| Big Spenders (dominan Q4) | High income tapi low balance | Cross-sell produk investasi + financial advisory |
| Senior Big Spenders | High earner, low saving discipline | Retirement planning + wealth management |
| Young Hidden Wealthy | Saving champion dengan growth runway | Reksa dana + education savings |
saving_multiple lebih informatif
dibanding saldo absolut karena memungkinkan perbandingan lintas tingkat
pendapatan.