1 Purposes & Scope

Purposes:

General Purposes: Conduct a comprehensive analysis of the bank data and articulate key insights derived from the findings.

Special Purposes: Find the solution from insights information.

Scope:

Dataset: A sample of customer data comprising debit, credit, and deposit transactions.

Time Period: - 

Tool: R Studio

2 Prepare The Data

2.1 Load The Library

Let’s load these library packages to make it easier to work.

library(dplyr)
library(tidyverse)
library(readxl)
library(tidyr)
library(hrbrthemes)
library(ggridges)
library(ggplot2)
library(kableExtra)
library(plotly)
library(viridis)
library(corrplot)
library(RColorBrewer)
library(scales)

2.2 Import Data

The data used in this project is ABC bank customer data in the Greater Jakarta area of Indonesia.dim(), names(), summary(), and rmarkdown::paged_table are used to know the data background.

jawa <- read_excel("Jabodetabek.xlsx", sheet=1)
dim(jawa)

## [1] 6940    8

The dataset comprises eight variables and 6,940 observations.

names(jawa)

## [1] "id"      "city"    "gender"  "name"    "date"    "debit"   "credit" 
## [8] "deposit"

The 8 Variables are:

ID
City
Gender
Name
Date
Debit
Credit
Deposit

summary(jawa)

##        id              city              gender              name          
##  Min.   :     33   Length:6940        Length:6940        Length:6940       
##  1st Qu.: 187465   Class :character   Class :character   Class :character  
##  Median : 445320   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 484521                                                           
##  3rd Qu.: 792419                                                           
##  Max.   :1048546                                                           
##       date                        debit               credit     
##  Min.   :1987-01-01 00:00:00   Length:6940        Min.   :  1.5  
##  1st Qu.:1991-10-01 18:00:00   Class :character   1st Qu.:100.7  
##  Median :1996-07-01 12:00:00   Mode  :character   Median :185.2  
##  Mean   :1996-07-01 12:00:00                      Mean   :194.4  
##  3rd Qu.:2001-04-01 06:00:00                      3rd Qu.:270.0  
##  Max.   :2005-12-31 00:00:00                      Max.   :665.9  
##     deposit     
##  Min.   : 61.6  
##  1st Qu.:196.5  
##  Median :245.6  
##  Mean   :252.3  
##  3rd Qu.:301.4  
##  Max.   :624.8

Detailed description of each variable:

ID: This variable represents a unique identifier for each observation in the dataset.
City: Categorical variable indicating the city associated with each observation.
Gender: This variable categorizes individuals based on gender.
Name: The ‘name’ variable represents the names of individuals, recorded as character data.
Date: The ‘date’ variable indicates the date of customer date of birth.
Debit: This variable reflects the debit amounts associated with each observation.
Credit: The ‘credit’ variable indicates the credit amounts.
Deposit: This variable represents the deposit amounts associated with each observation.

Overall, it is important to note that the variables date and debit require restructuring to align with the desired format for analysis, as they are currently categorized as character data rather than in their appropriate date and numeric formats.

rmarkdown::paged_table((jawa))

2.3 Check Unique Data

unique(jawa$gender)

## [1] "M"      "F"      "gender"

There is one unique data in gender column, so we will delete it and fill it with modus

# Count Modus From gender
modus_gender <- jawa %>%
  filter(gender != "gender") %>%  
  count(gender) %>%               
  arrange(desc(n)) %>%            
  slice(1) %>%                     # Take modus from the most value
  pull(gender)                     # Take gender value as modus

# Change gender value with modus
bank_data_clean <- jawa %>%
  mutate(gender = ifelse(gender == "gender", modus_gender, gender))

# Result
table(bank_data_clean$gender)

## 
##    F    M 
## 3845 3095

There is 3845 female with 3095 male.

jawa <- bank_data_clean

2.4 Check NA Values

To check missing values:

colnames(jawa)[ apply(jawa, 2, anyNA) ]

## [1] "city"  "debit"

city and debit column have missing values.

# Find missing value from every column
colSums(is.na(jawa))

##      id    city  gender    name    date   debit  credit deposit 
##       0       7       0       0       0       2       0       0

city has 7 data missing values and debit has 2 data missing values.

2.5 Change Data Type

jawa$date <-as.Date(jawa$date, format = "%m/%d/%Y")
jawa$debit <- suppressWarnings(as.numeric(jawa$debit))

Adjust the date and debit variables to ensure they conform to the required format for effective analysis.

2.6 Fill The Missing Value for Numeric Data

Filling missing numeric data with the mean is common for a few reasons:

Represents the Data: The mean reflects the average value of the data, so it doesn’t skew too high or too low.
Avoids Bias: Filling with the mean keeps the overall data balance without introducing extreme values.
Simple to Apply: It’s easy to calculate and use, making it a quick solution for handling missing data.
Maintains Data Variation: While it won’t add new variation, it also won’t remove much of the existing variability.

However, this method can reduce variation around missing values, especially if there are outliers or a skewed distribution.

jawa$debit[is.na(jawa$debit)] <-mean(jawa$debit,na.rm= T)
jawa$date      <-as.Date(jawa$date)

# Find missing value from every column
colSums(is.na(jawa))

##      id    city  gender    name    date   debit  credit deposit 
##       0       7       0       0       0       0       0       0

Now, there is no NA value on debit variable.

2.7 Fill The Missing Value for String Type Data

Filling missing string data with the mode (most frequent value) is common because:

Represents the Most Common Category: The mode reflects the value that appears the most, ensuring consistency with the majority of the data.
Prevents Adding Irrelevant Values: It avoids introducing values that don’t fit the context since the mode is already a frequent category.
Simple and Logical: It’s easy to apply and often makes sense for categorical data like cities or gender.
Maintains Data Balance: Filling with the mode keeps the overall distribution of categories intact.

However, if the data is highly diverse, the mode might not always represent the missing values well.

# Change NA with Modus

get_mode <- function(x) {
  uniq_vals <- na.omit(x)  # Remove NA from data
  uniq_vals[which.max(tabulate(match(x, uniq_vals)))]  # Take the value with the highest frequency
}
modus <- get_mode(jawa$city)

jawa$city[is.na(jawa$city)] <- modus

# Find missing value from every column
colSums(is.na(jawa))

##      id    city  gender    name    date   debit  credit deposit 
##       0       0       0       0       0       0       0       0

Now, there is no NA value on city variable.

3 Visualization

3.1 Customer Plot

First, we need to calculate how many male and female customers are.

Male <- sum(bank_data$Male)
Male

## [1] 3095

There are 3095 male customers.

## [1] 3845

There are 3845 female customers.

3.1.1 All Customer

Customer distribution based on birth year is almost the same.

3.1.2 Female Visualization

The female customers at this bank are mostly women born in 1999, as many as 262, and the lowest number was in 1994, as many as 159.

Customers born from 1987 to 2005 and from Jakarta remain constant as customers, however, customers born from 2003 - 2005 experienced a decline so that only six female customers were left who were born in 2005. Female customers born from 1998 to 2005 from Bekasi began to become bank customers indicating that female customers are starting to have awareness of banks. On the other hand, areas such as Bogor, Depok, and Tangerang do not have stable female customers during the period of the customer’s birth year.

3.1.3 Male Visualization

Male customers born in 1997-1999 have a decrease in becoming bank customers, while the largest number of male customers (206) were born in 1994.

Jakarta male customers born in 1987-2005 became regular customers in Jakarta, but experienced a decline in 2005 with the result of only 1 male customer born in 2005. Male customers in Bekasi born between 1998 and 2005 became bank customers, and there was an increase among those born in 2005. While the graph above shows the lack of consistency from male customers from the Bogor, Depok, Tangerang areas to become customers at the bank.

3.2 Age

Because the birth chart is difficult to see, we will add a new column of age to facilitate analysis and grouped by class:

Teenagers: 12-25 years
Adults: 26-45 years
Elderly: 46-65 years

Thereis 63,2% Adult and 36,8% Teenage Customer.

3.2.1 Female Distribution

There is 60,9% female adult and 39,1% teenage adult.

3.2.2 Male Distribution

There is 66% male adult and 34% teenage adult.

4 Debit, Credit, and Deposit

Make a new dataset by grouping the year and the city and summarizing the total of the debit, credit, and deposit.

4.1 Assets

We will use the assumption that assets are debits plus deposits to find out which city assets at Bank are the most and the least.

jkt           <- filter(jawa, city == "Jakarta")
Assetjkt      <- mutate(jkt, Asset = debit + deposit)
Assettotaljkt <- sum(Assetjkt$Asset, na.rm=T)

bgr           <- filter(jawa, city == "Bogor")
Assetbgr      <- mutate(bgr, Asset = debit + deposit)
Assettotalbgr <- sum(Assetbgr$Asset, na.rm=T)

tgr           <- filter(jawa, city == "Tangerang")
Assettgr      <- mutate(tgr, Asset = debit + deposit)
Assettotaltgr <- sum(Assettgr$Asset, na.rm=T)

depok           <- filter(jawa, city == "Depok")
Assetdepok      <- mutate(depok, Asset = debit + deposit)
Assettotaldepok <- sum(Assetdepok$Asset, na.rm=T)

bekasi           <- filter(jawa, city == "Bekasi")
Assetbekasi      <- mutate(bekasi, Asset = debit + deposit)
Assettotalbekasi <- sum(Assetbekasi$Asset, na.rm=T)

Alloftheasset <- data.frame(Assettotaljkt, Assettotalbgr, Assettotaltgr, Assettotaldepok, Assettotalbekasi)

# Count total from every column
total_per_column <- colSums(Alloftheasset)

# Sorting
sorted_columns <- names(sort(total_per_column, decreasing = TRUE))
Alloftheasset_sorted <- Alloftheasset[, sorted_columns]

# Change each column to currency format with commas
total_per_column <- colSums(Alloftheasset)
Alloftheasset_formatted <- Alloftheasset_sorted %>%
  mutate(across(everything(), ~ comma(.)))

Alloftheasset_formatted

Highest asset is customer from Jakarta area with the lowest asset is customer from Bogor.

4.2 Net Asset

We will use the assumption that assets are debits plus deposits then deducted by credit to find out which city assets at Bank are the most and the least.

net_jkt           <- filter(jawa, city == "Jakarta")
net_Assetjkt      <- mutate(net_jkt, Asset = debit + deposit-credit)
net_Assettotaljkt <- sum(net_Assetjkt$Asset, na.rm=T)

net_bgr           <- filter(jawa, city == "Bogor")
net_Assetbgr      <- mutate(net_bgr, Asset = debit + deposit-credit)
net_Assettotalbgr <- sum(net_Assetbgr$Asset, na.rm=T)

net_tgr           <- filter(jawa, city == "Tangerang")
net_Assettgr      <- mutate(net_tgr, Asset = debit + deposit-credit)
net_Assettotaltgr <- sum(net_Assettgr$Asset, na.rm=T)

net_depok           <- filter(jawa, city == "Depok")
net_Assetdepok      <- mutate(net_depok, Asset = debit + deposit-credit)
net_Assettotaldepok <- sum(net_Assetdepok$Asset, na.rm=T)

net_bekasi           <- filter(jawa, city == "Bekasi")
net_Assetbekasi      <- mutate(bekasi, Asset = debit + deposit-credit)
net_Assettotalbekasi <- sum(net_Assetbekasi$Asset, na.rm=T)

net_Alloftheasset <- data.frame(net_Assettotaljkt, net_Assettotalbgr, net_Assettotaltgr, net_Assettotaldepok, net_Assettotalbekasi)

# Change each column to currency format with commas
total_per_column <- colSums(net_Alloftheasset)

# Sorting
sorted_columns <- names(sort(total_per_column, decreasing = TRUE))
Alloftheasset_sorted <- net_Alloftheasset[, sorted_columns]

Alloftheasset_formatted <- Alloftheasset_sorted %>%
  mutate(across(everything(), ~ comma(.)))

Alloftheasset_formatted

Highest asset is customer from Jakarta area with the lowest asset is customer from Bogor.

Net Asset will be used for further analysis, with the reason:

More Accurate: Net Asset takes into account debt, reflecting the true net worth.
Relevant for Analysis: Provides a picture of wealth after obligations are met.
Avoid Misleading: Not only showing total assets which can give an exaggerated impression.

4.3 Segementation Asset per Customer

Manual Segmentation Based on Assets:

Low Assets (Micro Segment) Assets: Below $500.

Description: This can include micro-enterprises that are just starting out with very little capital. They may only have a basic savings account and are using micro-credit services to grow their business.
Medium Assets (Small Segment)

Assets: $500 - $1,500

Description: Customers in this category are usually small business owners who have been operating for some time and have a more stable capital turnover.
High Assets (Medium Segment)

Assets: Above $1,500

Description: MSME customers in this category usually have more developed businesses with more complex banking needs.

 yearly_variable <- yearly_variable %>% mutate(
       age = 2024 - year,  
  
    age_group = case_when(
      age >= 12 & age <= 25 ~ "Teenage",
      age >= 26 & age <= 45 ~ "Adult",
      age >= 46 & age <= 65 ~ "Eldery",
      TRUE ~ "Others"),
    net_asset = debit + deposit - credit,
    
    asset_group = case_when(
    net_asset < 500 ~ "Low Assets",
    net_asset >= 500 & net_asset <= 1500 ~ "Middle Assets",
    net_asset > 1500~ "High Assets") )
yearly_variable

There is 4820 customer with Low asset followed by 2117 Middle Assets, and 3 High Assets.

The results of manual customer segmentation are 69.5% Low Assets, 30.5% Middle Assets, 0.0432% High Assets.

K-Means Clustering Segmentation

# === Option 2: Segmentation with K-Means Clustering ===
# Set seed for consistent results and determine the number of clusters (e.g. 3)
set.seed(123)
num_clusters <- 3

# Run k-means clustering based on total assets
kmeans_result <- kmeans(yearly_variable$net_asset, centers = num_clusters)

# Add cluster results into yearly_variable
yearly_variable$cluster <- as.factor(kmeans_result$cluster)

# Displays clustering results
yearly_variable

From the distribution results, it is known that the green color is low assets, pink is middle assets, and blue is high assets.

The results of the asset segmentation distribution using K Means Clustering are 36.8% Low Asset, 47% Middle Asset, and 16.2% High Asset.

K-Means Clustering is used for further analysis, with the reasons:

1. Automation and Efficiency:

K-Means automatically groups data based on hidden patterns, making it faster and more efficient than manual clustering.

2. Scalability:

K-Means can handle large amounts of data quickly, while manual clustering is difficult to apply as the data grows.

3. Objectivity:

K-Means generates clusters based on mathematical distances, avoiding subjective bias that can arise in manual clustering.

4. Consistency:

The results of K-Means are more consistent, while manual clustering can vary depending on the person performing it.

5. Multi-Dimensional Analysis:

K-Means is more effective in grouping data with many variables, while manual clustering is more difficult to perform on complex data.

4.4 Other Analysis

prop_gender_asset <- yearly_variable %>%
  group_by( gender, cluster_labels) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

prop_gender_asset

Female customers with middle assets are the most numerous, while the lowest are males with high assets.

prop_year_asset <- yearly_variable %>%
  group_by(year, cluster_labels) %>%
  summarise(count = n()) %>%
  arrange(desc(count))  

prop_year_asset

Customers born in 1988 with middle assets are the most, while the lowest are those born in 2005 with high assets.

prop_age_asset <- yearly_variable %>%
  group_by(age_group, cluster_labels) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

prop_age_asset

Adult customers with middle assets are the most, while the lowest are teenagers with high assets.

prop_year_gender_asset <- yearly_variable %>%
  group_by(year, gender, cluster_labels) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

prop_year_gender_asset

Female customers with low assets born in 1999 are the most. Meanwhile, male customers with high assets born in 1999 are the least.

prop_age_gender_asset <- yearly_variable %>%
  group_by(age_group, gender, cluster_labels) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

prop_age_gender_asset

There are the most adult female customers with middle assets, while there are the fewest teenage male customers with high assets.

5 Conclusion

Majority of Customers Are from Jakarta and Bekasi Should Be Retained.
Customers from Bogor, Depok, Tangerang Are Unstable and Few in Number.
Largest Assets Come from Jakarta Customers, Smallest from Bogor.
Asset Segmentation Distribution with K-Means: 36% Low Asset, 47% Middle Asset, and 16.2% High Asset.
Largest Customer Segment: Adult Female with Middle Asset, Followed by Adult Male with Middle Asset.
Smallest Customer Segment: Teenage Female and Teenage Male with High Assets.

6 Solution

Retention Focus on Jakarta and Bekasi Customers:
Jakarta and Bekasi customers should be prioritized to maintain loyalty. One way to implement this is by offering exclusive loyalty programs, such as reward points that can be exchanged for gifts or service discounts. Additionally, premium services like priority customer service or access to special financial products can also be offered.
Development Strategy in Bogor, Depok, Tangerang:
To increase the number of customers in the Bogor, Depok, and Tangerang areas, more localized marketing campaigns can be implemented, such as targeted social media promotions or opening small service branches in strategic locations. For example, offering low-interest loan promotions or special referral programs for residents in these areas.
Increase Asset Conversion for Bogor Customers:
Customers in Bogor can be encouraged to grow their assets by offering investment products or long-term savings plans tailored to their needs. For instance, offering education savings plans with attractive interest rates or affordable insurance products with small investments for young families.
Optimize Offers for the Middle Asset Segment:
Since most customers fall into the middle asset segment, the bank can develop more varied investment products, such as mid-range mutual funds or micro-business loan programs. For example, offering integrated investment packages with life insurance or business capital loans with competitive interest rates to promote asset growth.
Increased Focus on Teen Customers with High Assets:
The teen segment with high assets should be nurtured due to its long-term potential. Example strategies include investment mentorship programs or financial classes that teach wealth management from an early age. Additionally, offer special millennial investment accounts with more flexibility or managed stock products that match their lifestyle.
Product Diversification for Low-Asset Customers:
Customers with low assets can be supported through more accessible financial products, such as micro-savings accounts with no administrative fees or micro-loans without collateral for urgent needs. For example, a daily savings program with very low minimum deposits or micro-loan products with fast approval processes to support small businesses.

Analysis of Bank Data

Sofia N. S.

19/10/2024