Student 1: Alavath Ruthik
Roll No: 34 | Reg No: 12321995 | Group 1
Student 2: Hirak Roy
Roll No: 26 | Reg No: 12301094 | Group 1
In this project, we analyze a Credit Card Customer Dataset containing customer details such as age, income, card type, credit limit, payment ratio, current balance, utilization rate, and monthly spending across different categories. The dataset also includes essential (“Needs”) and non-essential (“Wants/Lifestyle”) spend categories, allowing deeper behavioral segmentation.
Using R programming for data cleaning, transformation, and visualization, the project aims to identify hidden patterns in how customers spend and repay their credit card dues. The analysis helps classify customers into various groups such as high spenders, low-risk users, lifestyle-driven users, and potential defaulters. Spending categories, repayment behavior, age groups, and credit utilization are explored to understand how different customer segments exhibit distinct financial habits.This study not only highlights the financial behavior of customers but also supports decision-making for credit risk assessment. By analyzing spending patterns and repayment discipline, the project offers insights that can help financial institutions reduce default risk, understand customer needs, and improve credit product strategies.
# Load all required libraries
library(dplyr)
library(ggplot2)
library(tidyr)
library(psych)
library(scales) # for comma formatting
library(corrplot)
library(readxl)
library(stringr)
# ----------------------------
# 1. LOAD THE DATASET
# ----------------------------
credit_data <- read_excel("C:/Users/alava/Downloads/credit_card_dataset_no_spareparts.xlsx")
# View structure & summary
str(credit_data)
## tibble [1,000 × 24] (S3: tbl_df/tbl/data.frame)
## $ Customer_ID : chr [1:1000] "C0001" "C0002" "C0003" "C0004" ...
## $ Age : num [1:1000] 59 49 35 63 28 41 59 39 43 31 ...
## $ Income : num [1:1000] 24324 96323 29111 189182 36389 ...
## $ Card_Type : chr [1:1000] "Standard" "Standard" "Gold" "Gold" ...
## $ Credit_Limit : num [1:1000] 74179 70379 132172 159197 74014 ...
## $ Current_Balance : num [1:1000] 29019 44586 34789 35678 66416 ...
## $ Payment_Ratio : num [1:1000] 0.9 0.58 0.83 0.4 0.31 0.31 0.71 0.62 0.66 0.31 ...
## $ Late_Payments : num [1:1000] 1 2 2 2 2 1 2 3 1 3 ...
## $ Utilisation_Rate : num [1:1000] 0.39 0.63 0.26 0.22 0.9 0.88 0.85 0.63 0.79 0.16 ...
## $ Defaulted : chr [1:1000] "No" "No" "No" "No" ...
## $ Grocery_Spend : num [1:1000] 19565 17738 3631 18905 12370 ...
## $ Online_Spend : num [1:1000] 27485 28199 22815 26773 19862 ...
## $ Travel_Spend : num [1:1000] 16015 37756 6344 41198 48461 ...
## $ Dining_Spend : num [1:1000] 682 7240 12544 664 7458 ...
## $ Utilities_Spend : num [1:1000] 8977 14147 8731 19986 13641 ...
## $ Entertainment_Spend: num [1:1000] 9443 1802 4767 11828 6707 ...
## $ Monthly_Spend : num [1:1000] 82167 106882 58832 119354 108499 ...
## $ Clothes_Spend : num [1:1000] 17795 2860 7390 13964 13284 ...
## $ Bill_Payments : num [1:1000] 2387 14591 13880 11953 3679 ...
## $ Shopping_Spend : num [1:1000] 14218 27612 11471 7863 15693 ...
## $ Fuel_Spend : num [1:1000] 7040 6934 6925 7286 5628 ...
## $ Medical_Spend : num [1:1000] 2446 3283 4144 5741 6920 ...
## $ Total_Monthly_Spend: num [1:1000] 126053 162162 102642 166161 153703 ...
## $ Spender_Type : chr [1:1000] "Luxury Spender" "Luxury Spender" "High Spender" "Luxury Spender" ...
summary(credit_data)
## Customer_ID Age Income Card_Type
## Length:1000 Min. :21.00 Min. : 20126 Length:1000
## Class :character 1st Qu.:33.00 1st Qu.: 67569 Class :character
## Mode :character Median :46.00 Median :111092 Mode :character
## Mean :45.38 Mean :112332
## 3rd Qu.:57.00 3rd Qu.:157153
## Max. :69.00 Max. :199917
## Credit_Limit Current_Balance Payment_Ratio Late_Payments
## Min. : 30112 Min. : 4053 Min. :0.3000 Min. :0.000
## 1st Qu.: 63752 1st Qu.: 27831 1st Qu.:0.4700 1st Qu.:1.000
## Median :104532 Median : 52621 Median :0.6400 Median :2.000
## Mean :145894 Mean : 75016 Mean :0.6469 Mean :1.914
## 3rd Qu.:183114 3rd Qu.: 95767 3rd Qu.:0.8300 3rd Qu.:3.000
## Max. :499569 Max. :458417 Max. :1.0000 Max. :7.000
## Utilisation_Rate Defaulted Grocery_Spend Online_Spend
## Min. :0.100 Length:1000 Min. : 1005 Min. : 1020
## 1st Qu.:0.310 Class :character 1st Qu.: 6060 1st Qu.: 8445
## Median :0.510 Mode :character Median :10641 Median :15709
## Mean :0.524 Mean :10656 Mean :15632
## 3rd Qu.:0.730 3rd Qu.:15389 3rd Qu.:23030
## Max. :0.950 Max. :19975 Max. :29890
## Travel_Spend Dining_Spend Utilities_Spend Entertainment_Spend
## Min. : 16 Min. : 500 Min. : 818 Min. : 327
## 1st Qu.:12214 1st Qu.: 3896 1st Qu.: 5952 1st Qu.: 3322
## Median :23915 Median : 7142 Median :10763 Median : 6292
## Mean :24279 Mean : 7436 Mean :10688 Mean : 6329
## 3rd Qu.:36174 3rd Qu.:10795 3rd Qu.:15582 3rd Qu.: 9449
## Max. :49916 Max. :14997 Max. :19994 Max. :11983
## Monthly_Spend Clothes_Spend Bill_Payments Shopping_Spend
## Min. : 22606 Min. : 2009 Min. : 1001 Min. : 5011
## 1st Qu.: 62081 1st Qu.: 6657 1st Qu.: 4378 1st Qu.:11415
## Median : 74825 Median :10516 Median : 8210 Median :17262
## Mean : 75020 Mean :10806 Mean : 8088 Mean :17516
## 3rd Qu.: 88238 3rd Qu.:15133 3rd Qu.:11698 3rd Qu.:23709
## Max. :127699 Max. :19988 Max. :14998 Max. :29983
## Fuel_Spend Medical_Spend Total_Monthly_Spend Spender_Type
## Min. :1006 Min. : 502 Min. : 62605 Length:1000
## 1st Qu.:2873 1st Qu.:2940 1st Qu.:106366 Class :character
## Median :4600 Median :4900 Median :120738 Mode :character
## Mean :4530 Mean :5087 Mean :121047
## 3rd Qu.:6204 3rd Qu.:7401 3rd Qu.:135570
## Max. :7999 Max. :9986 Max. :181590
# ----------------------------
# 2. CLEAN COLUMN NAMES
# ----------------------------
credit_data <- credit_data %>%
rename_all(~ str_replace_all(., " ", "_")) %>% # replace spaces
rename_all(~ str_replace_all(., "\\/", "_")) %>% # replace slashes
rename_all(tolower) # lowercase names
# ----------------------------
# 3. HANDLE MISSING VALUES
# ----------------------------
# Count missing values
colSums(is.na(credit_data))
## customer_id age income card_type
## 0 0 0 0
## credit_limit current_balance payment_ratio late_payments
## 0 0 0 0
## utilisation_rate defaulted grocery_spend online_spend
## 0 0 0 0
## travel_spend dining_spend utilities_spend entertainment_spend
## 0 0 0 0
## monthly_spend clothes_spend bill_payments shopping_spend
## 0 0 0 0
## fuel_spend medical_spend total_monthly_spend spender_type
## 0 0 0 0
# Fill NA in numeric columns with median
numeric_cols <- sapply(credit_data, is.numeric)
credit_data[numeric_cols] <- lapply(credit_data[numeric_cols], function(x){
replace(x, is.na(x), median(x, na.rm = TRUE))
})
# Fill NA in character columns with "Unknown"
cat_cols <- sapply(credit_data, is.character)
credit_data[cat_cols] <- lapply(credit_data[cat_cols], function(x){
replace(x, is.na(x), "Unknown")
})
# ----------------------------
# 4. CREATE AGE CATEGORIES
# ----------------------------
credit_data <- credit_data %>%
mutate(
age_category = case_when(
age >= 25 & age <= 35 ~ "25-35",
age >= 36 & age <= 55 ~ "36-55",
age >= 56 & age <= 75 ~ "56-75",
TRUE ~ "Other"
)
)
# ----------------------------
# 5. CREATE UTILIZATION RATE
# ----------------------------
# If columns exist: current_balance, credit_limit
credit_data <- credit_data %>%
mutate(utilization_rate = current_balance / credit_limit)
# ----------------------------
# 6. CREATE NEEDS & WANTS SPENDING (Corrected)
# ----------------------------
# NEEDS = essential expenses
credit_data <- credit_data %>%
mutate(
needs_spend = grocery_spend +
utilities_spend +
bill_payments +
fuel_spend +
medical_spend
)
# WANTS = lifestyle expenses
credit_data <- credit_data %>%
mutate(
wants_spend = shopping_spend +
dining_spend +
travel_spend +
entertainment_spend +
clothes_spend +
online_spend
)
# ----------------------------
# 7. CREATE PAYMENT CATEGORY
# ----------------------------
credit_data <- credit_data %>%
mutate(
payment_category = case_when(
payment_ratio < 0.30 ~ "Low Payer",
payment_ratio >= 0.30 & payment_ratio < 0.70 ~ "Partial Payer",
payment_ratio >= 0.70 ~ "Good Payer",
TRUE ~ "Unknown"
)
)
# ----------------------------
# 8. FINAL CLEANED DATASET
# ----------------------------
glimpse(credit_data)
## Rows: 1,000
## Columns: 29
## $ customer_id <chr> "C0001", "C0002", "C0003", "C0004", "C0005", "C000…
## $ age <dbl> 59, 49, 35, 63, 28, 41, 59, 39, 43, 31, 31, 44, 56…
## $ income <dbl> 24324, 96323, 29111, 189182, 36389, 142177, 61427,…
## $ card_type <chr> "Standard", "Standard", "Gold", "Gold", "Standard"…
## $ credit_limit <dbl> 74179, 70379, 132172, 159197, 74014, 41738, 77498,…
## $ current_balance <dbl> 29019, 44586, 34789, 35678, 66416, 36729, 66210, 7…
## $ payment_ratio <dbl> 0.90, 0.58, 0.83, 0.40, 0.31, 0.31, 0.71, 0.62, 0.…
## $ late_payments <dbl> 1, 2, 2, 2, 2, 1, 2, 3, 1, 3, 3, 1, 1, 4, 2, 1, 1,…
## $ utilisation_rate <dbl> 0.39, 0.63, 0.26, 0.22, 0.90, 0.88, 0.85, 0.63, 0.…
## $ defaulted <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…
## $ grocery_spend <dbl> 19565, 17738, 3631, 18905, 12370, 19721, 18012, 36…
## $ online_spend <dbl> 27485, 28199, 22815, 26773, 19862, 14055, 28086, 1…
## $ travel_spend <dbl> 16015, 37756, 6344, 41198, 48461, 9106, 34206, 475…
## $ dining_spend <dbl> 682, 7240, 12544, 664, 7458, 2637, 10715, 954, 278…
## $ utilities_spend <dbl> 8977, 14147, 8731, 19986, 13641, 12644, 9394, 1421…
## $ entertainment_spend <dbl> 9443, 1802, 4767, 11828, 6707, 8601, 2654, 5619, 5…
## $ monthly_spend <dbl> 82167, 106882, 58832, 119354, 108499, 66764, 10306…
## $ clothes_spend <dbl> 17795, 2860, 7390, 13964, 13284, 8265, 18850, 6426…
## $ bill_payments <dbl> 2387, 14591, 13880, 11953, 3679, 3106, 9173, 5223,…
## $ shopping_spend <dbl> 14218, 27612, 11471, 7863, 15693, 6177, 9760, 5011…
## $ fuel_spend <dbl> 7040, 6934, 6925, 7286, 5628, 7370, 2336, 4700, 19…
## $ medical_spend <dbl> 2446, 3283, 4144, 5741, 6920, 8057, 3145, 7089, 68…
## $ total_monthly_spend <dbl> 126053, 162162, 102642, 166161, 153703, 99739, 146…
## $ spender_type <chr> "Luxury Spender", "Luxury Spender", "High Spender"…
## $ age_category <chr> "56-75", "36-55", "25-35", "56-75", "25-35", "36-5…
## $ utilization_rate <dbl> 0.3912024, 0.6335128, 0.2632101, 0.2241123, 0.8973…
## $ needs_spend <dbl> 40415, 56693, 37311, 63871, 42238, 50898, 42060, 3…
## $ wants_spend <dbl> 85638, 105469, 65331, 102290, 111465, 48841, 10427…
## $ payment_category <chr> "Good Payer", "Partial Payer", "Good Payer", "Part…
head(credit_data)
## # A tibble: 6 × 29
## customer_id age income card_type credit_limit current_balance payment_ratio
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 C0001 59 24324 Standard 74179 29019 0.9
## 2 C0002 49 96323 Standard 70379 44586 0.58
## 3 C0003 35 29111 Gold 132172 34789 0.83
## 4 C0004 63 189182 Gold 159197 35678 0.4
## 5 C0005 28 36389 Standard 74014 66416 0.31
## 6 C0006 41 142177 Standard 41738 36729 0.31
## # ℹ 22 more variables: late_payments <dbl>, utilisation_rate <dbl>,
## # defaulted <chr>, grocery_spend <dbl>, online_spend <dbl>,
## # travel_spend <dbl>, dining_spend <dbl>, utilities_spend <dbl>,
## # entertainment_spend <dbl>, monthly_spend <dbl>, clothes_spend <dbl>,
## # bill_payments <dbl>, shopping_spend <dbl>, fuel_spend <dbl>,
## # medical_spend <dbl>, total_monthly_spend <dbl>, spender_type <chr>,
## # age_category <chr>, utilization_rate <dbl>, needs_spend <dbl>, …
#Q: What is the typical monthly spending behavior of credit card customers, and are there extreme spending outliers?
library(scales)
ggplot(credit_data, aes(x = monthly_spend)) +
geom_histogram( #geometric object
bins = 30,
fill = "#0073C2FF",
color = "black",
alpha = 0.8
) +
stat_bin(
bins = 30,
geom = "text",
aes(label = ..count..),
vjust = -0.5,
size = 3
) +
scale_x_continuous(
labels = comma # 50,000 instead of 5e+04
) +
labs(
title = "Distribution of Monthly Spend",
x = "Monthly Spend (₹)",
y = "Number of Customers"
) +
theme_minimal(base_size = 14)
Most customers spend between ₹50,000 and ₹1,00,000
each month.
- The highest spending cluster is ₹75,000–₹85,000, with the largest
bar.
- Very low spenders (< ₹30,000) and very high spenders (>
₹1,20,000) are rare.
- The shape is roughly bell-shaped, meaning spending is stable across
customers.
- Only a few outliers exist on both ends of the range.
#Q: How does a customer’s credit utilization rate affect their payment behavior?
ggplot(credit_data, aes(x = utilization_rate,
y = payment_ratio,
color = age_category)) +
geom_point(size = 3, alpha = 0.8) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
title = "Utilization Rate vs Payment Ratio",
x = "Utilization Rate (Percentage of Credit Limit Used)",
y = "Payment Ratio (Percentage of Bill Paid)",
color = "Age Category"
) +
theme_minimal(base_size = 14)
Bottom-right points (high utilization, low
payment ratio)
→ These are high-risk customers.
They use a lot of
cr++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++edit
but pay back very little.
Top-left points (low utilization, high payment
ratio)
→ These are low-risk customers who spend responsibly and pay
fully.
Middle cluster
→ Most customers fall in a moderate zone (40–70% utilization and 40–70%
payment ratio).
Age Insight
Colors show age groups.
If younger customers cluster at high utilization, they may show
aggressive spending.
#Q: Among the top 100 highest-spending customers, how does credit limit influence monthly spending? Do customers with higher limits spend more?
library(dplyr)
library(ggplot2)
library(scales)
# Select Top 100 Customers by Monthly Spend
top100 <- credit_data %>%
arrange(desc(monthly_spend)) %>%
slice(1:100)
ggplot(top100, aes(x = credit_limit, y = monthly_spend, color = age_category)) +
geom_point(size = 3, alpha = 0.8) +
geom_smooth(method = "lm", se = TRUE, color = "red", size = 1.2) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(
title = "Credit Limit vs Monthly Spend (Top 100 Customers)",
x = "Credit Limit (₹)",
y = "Monthly Spend (₹)",
color = "Age Category"
) +
theme_minimal(base_size = 14)
The plot shows the relationship between credit limits and monthly spending for the top 100 highest-spending customers.
#Q: Do customers spend more on essential needs or lifestyle wants?
# Calculate average spending
avg_spend <- credit_data %>%
summarise(
avg_needs = mean(needs_spend),
avg_wants = mean(wants_spend)
) %>%
pivot_longer(cols = everything(), names_to = "category", values_to = "average_spend")
# Bar chart
ggplot(avg_spend, aes(x = category, y = average_spend, fill = category)) +
geom_col(width = 0.6) +
scale_y_continuous(labels = comma) +
labs(
title = "Average Needs vs Wants Spending",
x = "Category",
y = "Average Spend (₹)"
) +
theme_minimal(base_size = 14) +
scale_fill_manual(values = c("avg_needs" = "#1f77b4", "avg_wants" = "#ff7f0e"))
The bar chart shows that customers spend significantly more on wants
than on essential needs.
Average Wants Spend (~₹82,000) is more than double the Average Needs Spend (~₹39,000).
This indicates that customer spending is lifestyle-driven, with a larger portion going toward shopping, dining, travel, entertainment, and online purchases.
The pattern suggests higher discretionary behaviour, which may increase financial risk if payment discipline is low.
#Q: How does utilization differ between defaulters and non-defaulters?
# ---- 1. Fix utilization_rate (replace Inf/NA) ----
credit_data <- credit_data %>%
mutate(
utilization_rate = ifelse(
is.infinite(utilization_rate) | is.na(utilization_rate),
NA,
utilization_rate
)
)
# ---- 2. Clean defaulted column ----
credit_data <- credit_data %>%
mutate(
defaulted = case_when(
defaulted %in% c(1, "1", "yes", "Yes", "Y") ~ 1,
defaulted %in% c(0, "0", "no", "No", "N") ~ 0,
TRUE ~ NA_real_
)
)
# ---- 3. Filter valid rows ----
plot_df <- credit_data %>%
filter(!is.na(utilization_rate), !is.na(defaulted))
# ---- 4. Boxplot ----
ggplot(plot_df, aes(x = factor(defaulted), y = utilization_rate, fill = factor(defaulted))) +
geom_boxplot(alpha = 0.7) +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Utilization Rate vs Default Status",
x = "Defaulted (0 = No, 1 = Yes)",
y = "Utilization Rate",
fill = "Defaulted"
) +
theme_minimal(base_size = 14)
The boxplot shows a clear difference in utilization behaviour between defaulters and non-defaulters:
Defaulters (1) have very high utilization rates, mostly between 75% and 90%, meaning they consistently use a large portion of their credit limit.
Non-defaulters (0) show much lower and more varied utilization, mostly between 25% and 60%.
The median utilization of defaulters is far higher, indicating a strong link between high credit usage and default risk.
#** the correlation of the age income credit limist**
library(corrplot)
numeric_cols <- credit_data %>%
select(age, income, credit_limit, current_balance, monthly_spend,
needs_spend, wants_spend, utilization_rate, payment_ratio)
corr_matrix <- cor(numeric_cols)
corrplot(corr_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
number.cex = 0.6,
title = "Correlation Heatmap of Financial Variables",
mar = c(0,0,2,0))
library(ggplot2)
library(scales)
# ---- 1. Simple Linear Regression Model ----
model <- lm(monthly_spend ~ credit_limit, data = credit_data)
# ---- 2. Print Summary ----
summary(model)
##
## Call:
## lm(formula = monthly_spend ~ credit_limit, data = credit_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52511 -13047 -109 13198 52699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.455e+04 9.675e+02 77.057 <2e-16 ***
## credit_limit 3.192e-03 5.243e-03 0.609 0.543
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18730 on 998 degrees of freedom
## Multiple R-squared: 0.0003713, Adjusted R-squared: -0.0006304
## F-statistic: 0.3707 on 1 and 998 DF, p-value: 0.5428
# ---- 3. Regression Plot ----
ggplot(credit_data, aes(x = credit_limit, y = monthly_spend)) +
geom_point(color = "#0073C2FF", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = TRUE, size = 1.2) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Simple Linear Regression: Credit Limit vs Monthly Spend",
x = "Credit Limit (INR)",
y = "Monthly Spend (INR)"
) +
theme_minimal(base_size = 14)
The regression plot shows a positive relationship between credit limit and monthly spending:
As the credit limit increases, customers tend to spend more on average.
The red regression line slopes upward, indicating that credit limit is a significant predictor of monthly spend.
The shaded grey area shows the confidence interval, meaning we are reasonably confident about the trend.
Although the points are somewhat spread out, the trend remains clear: Higher-limit customers generally show higher spending ability.
Some customers with high limits still spend less → indicates conservative spenders.
# Convert age category into a factor
credit_data$age_category <- factor(credit_data$age_category)
# 1. Check group sizes
table(credit_data$age_category)
##
## 25-35 36-55 56-75 Other
## 201 433 281 85
# 2. One-way ANOVA
anova_model <- aov(monthly_spend ~ age_category, data = credit_data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## age_category 3 6.430e+08 214325463 0.61 0.608
## Residuals 996 3.498e+11 351163882
# 3. Check ANOVA Assumptions
## (a) Normality of residuals
shapiro.test(residuals(anova_model))
##
## Shapiro-Wilk normality test
##
## data: residuals(anova_model)
## W = 0.99738, p-value = 0.1069
## (b) Homogeneity of variance
library(car)
leveneTest(monthly_spend ~ age_category, data = credit_data)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.6656 0.5732
## 996
# 4. Post-hoc test (Tukey HSD)
TukeyHSD(anova_model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = monthly_spend ~ age_category, data = credit_data)
##
## $age_category
## diff lwr upr p adj
## 36-55-25-35 1698.4959 -2417.362 5814.354 0.7127890
## 56-75-25-35 1302.5129 -3152.306 5757.332 0.8756467
## Other-25-35 2971.3603 -3267.903 9210.623 0.6107235
## 56-75-36-55 -395.9831 -4090.093 3298.126 0.9926675
## Other-36-55 1272.8644 -4448.100 6993.829 0.9402625
## Other-56-75 1668.8474 -4300.621 7638.316 0.8893823
# 5. Visualization
library(ggplot2)
library(scales)
ggplot(credit_data, aes(x = age_category, y = monthly_spend, fill = age_category)) +
geom_boxplot(alpha = 0.8) +
scale_y_continuous(labels = comma) +
labs(
title = "Monthly Spending Across Age Groups",
x = "Age Group",
y = "Monthly Spend (₹)",
fill = "Age Group"
) +
theme_minimal(base_size = 14)
# ----------------------------------------------------------
# KNN Classification: Predicting whether a customer defaults
# ----------------------------------------------------------
library(class)
library(ggplot2)
library(dplyr)
# Ensure defaulted is numeric/factor
credit_data <- credit_data %>%
mutate(
defaulted = case_when(
defaulted %in% c(1, "1", "yes", "Yes", "Y") ~ 1,
defaulted %in% c(0, "0", "no", "No", "N") ~ 0,
TRUE ~ NA_real_
)
)
# Remove missing default values
knn_data <- credit_data %>%
select(defaulted, monthly_spend, credit_limit,
utilization_rate, payment_ratio, income, current_balance) %>%
filter(!is.na(defaulted))
# Convert defaulted to factor
knn_data$defaulted <- factor(knn_data$defaulted)
# Normalize numeric predictors
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
knn_norm <- knn_data %>%
mutate(across(-defaulted, ~ normalize(.x)))
# Train-test split (80/20)
set.seed(123)
n <- nrow(knn_norm)
train_idx <- sample(1:n, size = 0.8*n)
train <- knn_norm[train_idx, ]
test <- knn_norm[-train_idx, ]
train_x <- train %>% select(-defaulted)
test_x <- test %>% select(-defaulted)
train_y <- train$defaulted
test_y <- test$defaulted
# Run KNN (k = 5)
knn_pred <- knn(train = train_x,
test = test_x,
cl = train_y,
k = 5)
# Confusion Matrix
conf_matrix <- table(Predicted = knn_pred, Actual = test_y)
conf_matrix
## Actual
## Predicted 0 1
## 0 196 4
## 1 0 0
# Accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.98
# ----------------------------------------------------------
# Visualization: Plotting KNN predictions
# Using the two strongest predictors for plotting:
# credit_limit (x-axis) vs monthly_spend (y-axis)
# ----------------------------------------------------------
plot_df <- test %>%
mutate(predicted = knn_pred)
ggplot(plot_df, aes(x = credit_limit, y = monthly_spend, color = predicted)) +
geom_point(size = 3, alpha = 0.8) +
labs(
title = "KNN Classification Results (k = 5)",
x = "Credit Limit (Normalized)",
y = "Monthly Spend (Normalized)",
color = "Predicted\nDefault"
) +
theme_minimal(base_size = 14)
This project provided a comprehensive descriptive and visual analysis of credit card customer behavior using a dataset of 1,000 individuals. Through systematic data cleaning, feature engineering, and visualization in R, we uncovered several important patterns in spending behavior, credit usage, and repayment discipline.
The analysis revealed that most customers maintain moderate monthly spending, typically between ₹50,000 and ₹1,00,000, with only a few extreme outliers. Spending behavior is strongly lifestyle-driven, as customers consistently allocate more money toward non-essential “wants” such as shopping, entertainment, and travel compared to essential needs.
When examining financial discipline, we observed clear relationships between credit utilization, payment ratios, and age groups. Customers who maintain low utilization and high payment ratios appear financially responsible, whereas those with high utilization and low repayment ratios potentially indicate risky financial habits. Age also plays a role, with some younger customers showing more aggressive spending patterns.
The analysis of credit limit and spending among the top 100 customers highlighted a positive correlation—higher credit limits often encourage higher spending. However, a subset of customers underutilizes their available credit, showing more conservative financial behavior.
Finally, the default risk analysis clearly demonstrated that defaulters exhibit significantly higher credit utilization rates (typically 75–90%) compared to non-defaulters (25–60%). This strong separation confirms that credit utilization is one of the most powerful indicators of default risk.
Overall, the project successfully identified key behavioral insights that can support financial institutions in designing better credit policies, improving customer segmentation, reducing default rates, and tailoring products to customer needs. The findings highlight the importance of monitoring utilization patterns, encouraging healthy repayment behavior, and understanding customer lifestyle preferences to ensure sustainable credit management.