Pendahuluan

Laporan ini menganalisis dataset Banking Customers (1.000 nasabah, 7 variabel) dengan fokus pada hubungan antara pendapatan dan perilaku menabung. Pertanyaan utama yang dijawab: apakah pendapatan yang lebih tinggi otomatis berarti perilaku menabung yang lebih baik?


1. Setup & Load Data

Package yang dibutuhkan: readxl untuk baca Excel, tidyverse untuk manipulasi data dan visualisasi, scales untuk format angka di chart.

library(readxl)
library(tidyverse)
library(scales)
df_raw <- read_excel("04_BANKING_CUSTOMERS.xlsx")

cat("Dataset awal:\n")
## Dataset awal:
cat("  Rows :", nrow(df_raw), "\n")
##   Rows : 1000
cat("  Cols :", ncol(df_raw), "\n\n")
##   Cols : 7
glimpse(df_raw)
## Rows: 1,000
## Columns: 7
## $ customer_id      <chr> "C000001", "C000002", "C000003", "C000004", "C000005"…
## $ age              <dbl> 57, 30, 55, 20, 33, 59, 24, 71, 70, 20, 31, 33, 21, 4…
## $ income           <dbl> 51808666, 33241740, 121506340, 17855323, 51440557, 18…
## $ balance          <dbl> 71898776, 36624170, 222845105, 57806634, 34534238, 16…
## $ num_transactions <dbl> 23, 24, 15, 30, 13, 26, 19, 26, 21, 17, 8, 19, 15, 14…
## $ credit_score     <dbl> 627.8335, 668.0386, 545.5959, 639.2831, 590.2093, 642…
## $ account_age      <dbl> 18, 21, 8, 1, 14, 7, 9, 6, 18, 0, 21, 1, 2, 27, 13, 1…

2. Data Cleaning

Saat eksplorasi awal, ditemukan beberapa anomali yang jelas merupakan data error:

Treatment yang diterapkan:

df_clean <- df_raw |>
  # Filter umur ke 17-90 (17 = batas legal buka rekening di Indonesia)
  filter(age >= 17, age <= 90) |>
  # Filter credit score ke range FICO valid
  filter(is.na(credit_score) | between(credit_score, 300, 850)) |>
  # Drop missing values (~5% data, loss kecil)
  drop_na()

cat("Hasil cleaning:\n")
## Hasil cleaning:
cat("  Rows tersisa  :", nrow(df_clean), "\n")
##   Rows tersisa  : 915
cat("  Retained      :", round(nrow(df_clean) / nrow(df_raw) * 100, 1), "%\n")
##   Retained      : 91.5 %

Justifikasi: drop missing values dipilih daripada imputasi karena hanya sekitar 5% data — loss minimal. Imputasi pada data finansial (income, balance) berisiko misleading karena distribusinya skewed.


3. Transformasi: Saving Multiple

Insight utama membutuhkan satu variabel turunan: saving_multiple — rasio saldo terhadap pendapatan tahunan. Variabel ini mengukur perilaku menabung relatif terhadap kapasitas finansial, bukan saldo absolut.

\[\text{saving\_multiple} = \frac{\text{balance}}{\text{income}}\]

Interpretasi: nilai 2 berarti saldo seseorang setara dengan 2 tahun pendapatan tahunan (penabung disiplin); nilai 0.3 berarti saldo hanya 30% dari pendapatan tahunan (boros relatif).

df_clean <- df_clean |>
  mutate(
    saving_multiple = balance / income,

    # Segmentasi berdasarkan saving_multiple
    segment = case_when(
      saving_multiple < 0.5  ~ "Big Spenders",     # Saldo < setengah pendapatan tahunan
      saving_multiple <= 2.0 ~ "Average Savers",   # Saldo 0.5x - 2x pendapatan
      TRUE                   ~ "Hidden Wealthy"    # Saldo > 2x pendapatan
    ),
    segment = factor(segment,
                     levels = c("Big Spenders", "Average Savers", "Hidden Wealthy")),

    # Kelompok umur
    age_group = case_when(
      age < 30 ~ "Young (17-29)",
      age < 50 ~ "Mid (30-49)",
      TRUE     ~ "Senior (50+)"
    ),
    age_group = factor(age_group,
                       levels = c("Young (17-29)", "Mid (30-49)", "Senior (50+)")),

    # Kuartil pendapatan
    income_quartile = factor(
      ntile(income, 4),
      labels = c("Q1\nTerendah", "Q2", "Q3", "Q4\nTertinggi")
    )
  )

# Preview hasil transformasi
df_clean |>
  select(customer_id, age, income, balance, saving_multiple, segment, age_group) |>
  head(5)
## # A tibble: 5 × 7
##   customer_id   age     income    balance saving_multiple segment      age_group
##   <chr>       <dbl>      <dbl>      <dbl>           <dbl> <fct>        <fct>    
## 1 C000001        57  51808666.  71898776.           1.39  Average Sav… Senior (…
## 2 C000002        30  33241740.  36624170.           1.10  Average Sav… Mid (30-…
## 3 C000003        55 121506340. 222845105.           1.83  Average Sav… Senior (…
## 4 C000004        20  17855323.  57806634.           3.24  Hidden Weal… Young (1…
## 5 C000005        33  51440557.  34534238.           0.671 Average Sav… Mid (30-…

4. Analisis Eksploratif

4.1 Distribusi Segmen

Sebagian besar nasabah ternyata adalah penabung disiplin — bukan boros seperti asumsi umum.

df_clean |>
  count(segment) |>
  mutate(persentase = round(n / sum(n) * 100, 1))
## # A tibble: 3 × 3
##   segment            n persentase
##   <fct>          <int>      <dbl>
## 1 Big Spenders     216       23.6
## 2 Average Savers   306       33.4
## 3 Hidden Wealthy   393       43

4.2 Hubungan Kuartil Pendapatan × Segmen

Inilah pola kunci yang akan jadi backbone laporan ini:

df_clean |>
  count(income_quartile, segment) |>
  group_by(income_quartile) |>
  mutate(persentase = round(n / sum(n) * 100, 1)) |>
  select(-n) |>
  pivot_wider(names_from = segment, values_from = persentase)
## # A tibble: 4 × 4
## # Groups:   income_quartile [4]
##   income_quartile `Big Spenders` `Average Savers` `Hidden Wealthy`
##   <fct>                    <dbl>            <dbl>            <dbl>
## 1 "Q1\nTerendah"             8.3             15.7             76  
## 2 "Q2"                      15.7             32.3             52  
## 3 "Q3"                      23.1             45.9             31  
## 4 "Q4\nTertinggi"           47.4             39.9             12.7

Yang menarik di sini: di Q1 (kuartil pendapatan terendah), 76% nasabah adalah Hidden Wealthy. Sementara di Q4 (kuartil pendapatan tertinggi), justru 47% nasabah adalah Big Spenders — dan Hidden Wealthy turun jadi 13%.

Trennya konsisten linear, dan ini counterintuitive: makin tinggi pendapatan, makin boros perilaku menabungnya.

4.3 Counterintuitive Finding: Muda vs Senior

Bukti paling tajam dari paradoks ini terlihat saat membandingkan dua kelompok ekstrem:

young_wealthy   <- df_clean |> filter(age_group == "Young (17-29)", segment == "Hidden Wealthy")
senior_spenders <- df_clean |> filter(age_group == "Senior (50+)",  segment == "Big Spenders")

cat("Muda & Hidden Wealthy :\n")
## Muda & Hidden Wealthy :
cat("  Jumlah               :", nrow(young_wealthy), "nasabah\n")
##   Jumlah               : 90 nasabah
cat("  Median pendapatan    : Rp", round(median(young_wealthy$income) / 1e6, 1), "juta\n")
##   Median pendapatan    : Rp 29.8 juta
cat("  Median saving multi  :", round(median(young_wealthy$saving_multiple), 2), "x\n\n")
##   Median saving multi  : 4.04 x
cat("Senior & Big Spenders :\n")
## Senior & Big Spenders :
cat("  Jumlah               :", nrow(senior_spenders), "nasabah\n")
##   Jumlah               : 97 nasabah
cat("  Median pendapatan    : Rp", round(median(senior_spenders$income) / 1e6, 1), "juta\n")
##   Median pendapatan    : Rp 84.1 juta
cat("  Median saving multi  :", round(median(senior_spenders$saving_multiple), 2), "x\n\n")
##   Median saving multi  : 0.22 x
ratio_income <- round(median(senior_spenders$income) / median(young_wealthy$income), 1)
ratio_saving <- round(median(young_wealthy$saving_multiple) / median(senior_spenders$saving_multiple), 0)

cat(sprintf("→ Senior berpendapatan %sx lebih besar, tapi disiplin menabung %sx lebih rendah.\n",
            ratio_income, ratio_saving))
## → Senior berpendapatan 2.8x lebih besar, tapi disiplin menabung 18x lebih rendah.

4.4 Korelasi dengan Credit Score

Pertanyaan logis: apakah credit score berkorelasi dengan variabel finansial yang lain?

df_clean |>
  select(credit_score, income, balance, age, account_age,
         num_transactions, saving_multiple) |>
  cor() |>
  round(3) |>
  as.data.frame() |>
  select(credit_score) |>
  arrange(desc(credit_score))
##                  credit_score
## credit_score            1.000
## balance                 0.029
## saving_multiple         0.024
## account_age             0.020
## num_transactions        0.005
## income                 -0.010
## age                    -0.018

Insight: semua korelasi mendekati nol (< 0.05). Credit score di dataset ini tidak punya pola signifikan dengan variabel lain, jadi kita fokus pada variabel yang memberikan signal jelas.


5. Visualisasi

5.1 Setup Tema

# Palet warna konsisten di semua chart
COLORS <- c(
  "Big Spenders"   = "#C97064",
  "Average Savers" = "#D4A574",
  "Hidden Wealthy" = "#5C8A8A"
)

# Custom theme untuk konsistensi visual
theme_banking <- theme_minimal(base_size = 12) +
  theme(
    plot.background  = element_rect(fill = "#FBF7F0", color = NA),
    panel.background = element_rect(fill = "#FBF7F0", color = NA),
    plot.title       = element_text(face = "bold", size = 14,
                                    color = "#3D2B1F", margin = margin(b = 8)),
    axis.title       = element_text(face = "bold", size = 11, color = "#3D2B1F"),
    axis.text        = element_text(color = "#7A6856", size = 10),
    panel.grid.major = element_line(color = "#E8E0D5", linewidth = 0.5),
    panel.grid.minor = element_blank(),
    legend.position  = "top",
    legend.title     = element_blank(),
    legend.text      = element_text(size = 11, color = "#3D2B1F"),
    plot.margin      = margin(16, 20, 16, 16),
    strip.text       = element_text(face = "bold", color = "#3D2B1F", size = 11)
  )

theme_set(theme_banking)

5.2 Hero Chart — Scatter Pendapatan vs Saldo

p1 <- df_clean |>
  ggplot(aes(x = income / 1e6, y = balance / 1e6, color = segment)) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", color = "#999",
              linewidth = 0.8, alpha = 0.7) +
  geom_point(alpha = 0.55, size = 2, shape = 16) +
  scale_color_manual(values = COLORS) +
  # coord_cartesian (bukan xlim/ylim) — clip visual tanpa drop data, no warning
  coord_cartesian(xlim = c(0, 450), ylim = c(0, 600)) +
  annotate("text", x = 380, y = 410,
           label = "Saldo = Pendapatan",
           color = "#666", size = 3.2, hjust = 1, fontface = "italic") +
  labs(
    title = 'Paradoks Penabung: Siapa Yang Sebenarnya "Kaya"?',
    x     = "Pendapatan Tahunan (Rp Juta)",
    y     = "Saldo Rekening (Rp Juta)"
  )

print(p1)

ggsave("01_hero_chart_R.png", p1,
       width = 11, height = 6.5, dpi = 180, bg = "#FBF7F0")

Reading chart: garis putus-putus diagonal adalah titik di mana saldo setara dengan pendapatan tahunan (saving multiple = 1x). Titik di atas garis = penabung disiplin (Hidden Wealthy, hijau-teal). Titik di bawah garis = boros relatif (Big Spenders, merah-coral). Cluster Hidden Wealthy banyak di kiri atas — low income, high balance. Cluster Big Spenders banyak di kanan bawah — high income, low balance.

5.3 Paradoks Pendapatan vs Disiplin Menabung

quartile_pct <- df_clean |>
  count(income_quartile, segment) |>
  group_by(income_quartile) |>
  mutate(pct = n / sum(n) * 100)

p2 <- quartile_pct |>
  ggplot(aes(x = income_quartile, y = pct, fill = segment)) +
  geom_col(color = "white", linewidth = 1.5, width = 0.65) +
  geom_text(aes(label = paste0(round(pct), "%")),
            position = position_stack(vjust = 0.5),
            color = "white", fontface = "bold", size = 4) +
  scale_fill_manual(values = COLORS) +
  scale_y_continuous(labels = label_percent(scale = 1),
                     expand = expansion(mult = c(0, 0.02))) +
  labs(
    title = "Semakin Tinggi Pendapatan, Semakin Boros Perilakunya",
    x     = "Kuartil Pendapatan",
    y     = "Distribusi Segmen (%)"
  ) +
  theme(legend.position = "bottom")

print(p2)

ggsave("02_income_paradox_R.png", p2,
       width = 10, height = 6, dpi = 180, bg = "#FBF7F0")

Reading chart: proporsi Hidden Wealthy (teal) menurun drastis dari 76% di Q1 ke 13% di Q4. Sebaliknya, proporsi Big Spenders (coral) naik dari 8% di Q1 ke 47% di Q4. Pola inversinya sangat clean.

5.4 Muda Hidden Wealthy vs Senior Big Spenders

compare_data <- bind_rows(
  young_wealthy |> summarise(
    group   = "Muda &\nHidden Wealthy",
    income  = median(income) / 1e6,
    balance = median(balance) / 1e6,
    sm      = median(saving_multiple)
  ),
  senior_spenders |> summarise(
    group   = "Senior &\nBig Spenders",
    income  = median(income) / 1e6,
    balance = median(balance) / 1e6,
    sm      = median(saving_multiple)
  )
) |>
  pivot_longer(c(income, balance, sm),
               names_to = "metric", values_to = "value") |>
  mutate(
    metric = factor(metric,
                    levels = c("income", "balance", "sm"),
                    labels = c("Median Pendapatan (Rp Juta)",
                               "Median Saldo (Rp Juta)",
                               "Saving Multiple")),
    color_key = if_else(grepl("Muda", group), "Hidden Wealthy", "Big Spenders"),
    color_key = factor(color_key, levels = c("Big Spenders", "Hidden Wealthy"))
  )

p3 <- compare_data |>
  ggplot(aes(x = group, y = value, fill = color_key)) +
  geom_col(width = 0.55, show.legend = FALSE) +
  geom_text(aes(label = if_else(grepl("Saving", as.character(metric)),
                                 paste0(round(value, 1), "x"),
                                 format(round(value, 1), nsmall = 1))),
            vjust = -0.4, fontface = "bold", size = 5) +
  facet_wrap(~metric, scales = "free_y") +
  scale_fill_manual(values = COLORS) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
  labs(
    title = "Anak Muda Menabung Lebih Disiplin Dari Senior Berpendapatan Tinggi",
    x = NULL, y = NULL
  )

print(p3)

ggsave("03_young_vs_senior_R.png", p3,
       width = 12, height = 5.5, dpi = 180, bg = "#FBF7F0")

Reading chart: meskipun senior Big Spenders punya pendapatan median 2.8x lebih tinggi (Rp 84.1 jt vs Rp 29.8 jt), saving multiple mereka 18x lebih rendah (0.2x vs 4.0x). Anak muda berpendapatan kecil justru lebih disiplin menabung.


6. Kesimpulan & Rekomendasi Aksi

6.1 Temuan Utama

  1. Disiplin menabung TIDAK ditentukan oleh besarnya pendapatan. Pola di data menunjukkan korelasi yang berlawanan: makin tinggi income, makin boros perilaku.
  2. Hidden Wealthy (43% dari nasabah) mayoritas berada di kuartil pendapatan rendah-menengah — mereka adalah penabung disiplin yang sering terabaikan dalam targeting tradisional.
  3. Big Spenders Q4 (47% nasabah berpendapatan tertinggi) adalah segmen berisiko tinggi dari sisi retention dan financial health.

6.2 Rekomendasi Strategis

Segmen Profil Rekomendasi
Hidden Wealthy (dominan Q1) Penabung disiplin meski income kecil Loyalty program + tawaran deposito berjangka
Big Spenders (dominan Q4) High income tapi low balance Cross-sell produk investasi + financial advisory
Senior Big Spenders High earner, low saving discipline Retirement planning + wealth management
Young Hidden Wealthy Saving champion dengan growth runway Reksa dana + education savings

6.3 Catatan Metodologi

  • Cleaning menghasilkan 915 baris dari 1.000 (91.5% retained), cukup untuk inferensi yang reliable.
  • Variabel turunan saving_multiple lebih informatif dibanding saldo absolut karena memungkinkan perbandingan lintas tingkat pendapatan.
  • Credit score di dataset ini tidak menunjukkan korelasi signifikan dengan variabel lain, sehingga tidak dijadikan fokus analisis.