1 Introduction

In this project, we analyze a Credit Card Customer Dataset containing customer details such as age, income, card type, credit limit, payment ratio, current balance, utilization rate, and monthly spending across different categories. The dataset also includes essential (“Needs”) and non-essential (“Wants/Lifestyle”) spend categories, allowing deeper behavioral segmentation.

Using R programming for data cleaning, transformation, and visualization, the project aims to identify hidden patterns in how customers spend and repay their credit card dues. The analysis helps classify customers into various groups such as high spenders, low-risk users, lifestyle-driven users, and potential defaulters. Spending categories, repayment behavior, age groups, and credit utilization are explored to understand how different customer segments exhibit distinct financial habits.This study not only highlights the financial behavior of customers but also supports decision-making for credit risk assessment. By analyzing spending patterns and repayment discipline, the project offers insights that can help financial institutions reduce default risk, understand customer needs, and improve credit product strategies.

2 Data Cleaning and Preprocessing

# Load all required libraries
library(dplyr)
library(ggplot2)
library(tidyr)
library(psych)
library(scales)   # for comma formatting
library(corrplot)
library(readxl)
library(stringr)

# ----------------------------
# 1. LOAD THE DATASET
# ----------------------------
credit_data <- read_excel("C:/Users/alava/Downloads/credit_card_dataset_no_spareparts.xlsx")

# View structure & summary
str(credit_data)

## tibble [1,000 × 24] (S3: tbl_df/tbl/data.frame)
##  $ Customer_ID        : chr [1:1000] "C0001" "C0002" "C0003" "C0004" ...
##  $ Age                : num [1:1000] 59 49 35 63 28 41 59 39 43 31 ...
##  $ Income             : num [1:1000] 24324 96323 29111 189182 36389 ...
##  $ Card_Type          : chr [1:1000] "Standard" "Standard" "Gold" "Gold" ...
##  $ Credit_Limit       : num [1:1000] 74179 70379 132172 159197 74014 ...
##  $ Current_Balance    : num [1:1000] 29019 44586 34789 35678 66416 ...
##  $ Payment_Ratio      : num [1:1000] 0.9 0.58 0.83 0.4 0.31 0.31 0.71 0.62 0.66 0.31 ...
##  $ Late_Payments      : num [1:1000] 1 2 2 2 2 1 2 3 1 3 ...
##  $ Utilisation_Rate   : num [1:1000] 0.39 0.63 0.26 0.22 0.9 0.88 0.85 0.63 0.79 0.16 ...
##  $ Defaulted          : chr [1:1000] "No" "No" "No" "No" ...
##  $ Grocery_Spend      : num [1:1000] 19565 17738 3631 18905 12370 ...
##  $ Online_Spend       : num [1:1000] 27485 28199 22815 26773 19862 ...
##  $ Travel_Spend       : num [1:1000] 16015 37756 6344 41198 48461 ...
##  $ Dining_Spend       : num [1:1000] 682 7240 12544 664 7458 ...
##  $ Utilities_Spend    : num [1:1000] 8977 14147 8731 19986 13641 ...
##  $ Entertainment_Spend: num [1:1000] 9443 1802 4767 11828 6707 ...
##  $ Monthly_Spend      : num [1:1000] 82167 106882 58832 119354 108499 ...
##  $ Clothes_Spend      : num [1:1000] 17795 2860 7390 13964 13284 ...
##  $ Bill_Payments      : num [1:1000] 2387 14591 13880 11953 3679 ...
##  $ Shopping_Spend     : num [1:1000] 14218 27612 11471 7863 15693 ...
##  $ Fuel_Spend         : num [1:1000] 7040 6934 6925 7286 5628 ...
##  $ Medical_Spend      : num [1:1000] 2446 3283 4144 5741 6920 ...
##  $ Total_Monthly_Spend: num [1:1000] 126053 162162 102642 166161 153703 ...
##  $ Spender_Type       : chr [1:1000] "Luxury Spender" "Luxury Spender" "High Spender" "Luxury Spender" ...

summary(credit_data)

##  Customer_ID             Age            Income        Card_Type        
##  Length:1000        Min.   :21.00   Min.   : 20126   Length:1000       
##  Class :character   1st Qu.:33.00   1st Qu.: 67569   Class :character  
##  Mode  :character   Median :46.00   Median :111092   Mode  :character  
##                     Mean   :45.38   Mean   :112332                     
##                     3rd Qu.:57.00   3rd Qu.:157153                     
##                     Max.   :69.00   Max.   :199917                     
##   Credit_Limit    Current_Balance  Payment_Ratio    Late_Payments  
##  Min.   : 30112   Min.   :  4053   Min.   :0.3000   Min.   :0.000  
##  1st Qu.: 63752   1st Qu.: 27831   1st Qu.:0.4700   1st Qu.:1.000  
##  Median :104532   Median : 52621   Median :0.6400   Median :2.000  
##  Mean   :145894   Mean   : 75016   Mean   :0.6469   Mean   :1.914  
##  3rd Qu.:183114   3rd Qu.: 95767   3rd Qu.:0.8300   3rd Qu.:3.000  
##  Max.   :499569   Max.   :458417   Max.   :1.0000   Max.   :7.000  
##  Utilisation_Rate  Defaulted         Grocery_Spend    Online_Spend  
##  Min.   :0.100    Length:1000        Min.   : 1005   Min.   : 1020  
##  1st Qu.:0.310    Class :character   1st Qu.: 6060   1st Qu.: 8445  
##  Median :0.510    Mode  :character   Median :10641   Median :15709  
##  Mean   :0.524                       Mean   :10656   Mean   :15632  
##  3rd Qu.:0.730                       3rd Qu.:15389   3rd Qu.:23030  
##  Max.   :0.950                       Max.   :19975   Max.   :29890  
##   Travel_Spend    Dining_Spend   Utilities_Spend Entertainment_Spend
##  Min.   :   16   Min.   :  500   Min.   :  818   Min.   :  327      
##  1st Qu.:12214   1st Qu.: 3896   1st Qu.: 5952   1st Qu.: 3322      
##  Median :23915   Median : 7142   Median :10763   Median : 6292      
##  Mean   :24279   Mean   : 7436   Mean   :10688   Mean   : 6329      
##  3rd Qu.:36174   3rd Qu.:10795   3rd Qu.:15582   3rd Qu.: 9449      
##  Max.   :49916   Max.   :14997   Max.   :19994   Max.   :11983      
##  Monthly_Spend    Clothes_Spend   Bill_Payments   Shopping_Spend 
##  Min.   : 22606   Min.   : 2009   Min.   : 1001   Min.   : 5011  
##  1st Qu.: 62081   1st Qu.: 6657   1st Qu.: 4378   1st Qu.:11415  
##  Median : 74825   Median :10516   Median : 8210   Median :17262  
##  Mean   : 75020   Mean   :10806   Mean   : 8088   Mean   :17516  
##  3rd Qu.: 88238   3rd Qu.:15133   3rd Qu.:11698   3rd Qu.:23709  
##  Max.   :127699   Max.   :19988   Max.   :14998   Max.   :29983  
##    Fuel_Spend   Medical_Spend  Total_Monthly_Spend Spender_Type      
##  Min.   :1006   Min.   : 502   Min.   : 62605      Length:1000       
##  1st Qu.:2873   1st Qu.:2940   1st Qu.:106366      Class :character  
##  Median :4600   Median :4900   Median :120738      Mode  :character  
##  Mean   :4530   Mean   :5087   Mean   :121047                        
##  3rd Qu.:6204   3rd Qu.:7401   3rd Qu.:135570                        
##  Max.   :7999   Max.   :9986   Max.   :181590

# ----------------------------
# 2. CLEAN COLUMN NAMES
# ----------------------------
credit_data <- credit_data %>%
  rename_all(~ str_replace_all(., " ", "_")) %>%    # replace spaces
  rename_all(~ str_replace_all(., "\\/", "_")) %>%  # replace slashes
  rename_all(tolower)                               # lowercase names

# ----------------------------
# 3. HANDLE MISSING VALUES
# ----------------------------

# Count missing values
colSums(is.na(credit_data))

##         customer_id                 age              income           card_type 
##                   0                   0                   0                   0 
##        credit_limit     current_balance       payment_ratio       late_payments 
##                   0                   0                   0                   0 
##    utilisation_rate           defaulted       grocery_spend        online_spend 
##                   0                   0                   0                   0 
##        travel_spend        dining_spend     utilities_spend entertainment_spend 
##                   0                   0                   0                   0 
##       monthly_spend       clothes_spend       bill_payments      shopping_spend 
##                   0                   0                   0                   0 
##          fuel_spend       medical_spend total_monthly_spend        spender_type 
##                   0                   0                   0                   0

# Fill NA in numeric columns with median
numeric_cols <- sapply(credit_data, is.numeric)
credit_data[numeric_cols] <- lapply(credit_data[numeric_cols], function(x){
  replace(x, is.na(x), median(x, na.rm = TRUE))
})

# Fill NA in character columns with "Unknown"
cat_cols <- sapply(credit_data, is.character)
credit_data[cat_cols] <- lapply(credit_data[cat_cols], function(x){
  replace(x, is.na(x), "Unknown")
})



# ----------------------------
# 4. CREATE AGE CATEGORIES
# ----------------------------
credit_data <- credit_data %>%
  mutate(
    age_category = case_when(
      age >= 25 & age <= 35 ~ "25-35",
      age >= 36 & age <= 55 ~ "36-55",
      age >= 56 & age <= 75 ~ "56-75",
      TRUE ~ "Other"
    )
  )

# ----------------------------
# 5. CREATE UTILIZATION RATE
# ----------------------------
# If columns exist: current_balance, credit_limit
credit_data <- credit_data %>%
  mutate(utilization_rate = current_balance / credit_limit)


# ----------------------------
# 6. CREATE NEEDS & WANTS SPENDING (Corrected)
# ----------------------------

# NEEDS = essential expenses
credit_data <- credit_data %>%
  mutate(
    needs_spend = grocery_spend +
                   utilities_spend +
                   bill_payments +
                   fuel_spend +
                   medical_spend
  )

# WANTS = lifestyle expenses
credit_data <- credit_data %>%
  mutate(
    wants_spend = shopping_spend +
                  dining_spend +
                  travel_spend +
                  entertainment_spend +
                  clothes_spend +
                  online_spend
  )


# ----------------------------
# 7. CREATE PAYMENT CATEGORY
# ----------------------------
credit_data <- credit_data %>%
  mutate(
    payment_category = case_when(
      payment_ratio < 0.30 ~ "Low Payer",
      payment_ratio >= 0.30 & payment_ratio < 0.70 ~ "Partial Payer",
      payment_ratio >= 0.70 ~ "Good Payer",
      TRUE ~ "Unknown"
    )
  )

# ----------------------------
# 8. FINAL CLEANED DATASET
# ----------------------------
glimpse(credit_data)

## Rows: 1,000
## Columns: 29
## $ customer_id         <chr> "C0001", "C0002", "C0003", "C0004", "C0005", "C000…
## $ age                 <dbl> 59, 49, 35, 63, 28, 41, 59, 39, 43, 31, 31, 44, 56…
## $ income              <dbl> 24324, 96323, 29111, 189182, 36389, 142177, 61427,…
## $ card_type           <chr> "Standard", "Standard", "Gold", "Gold", "Standard"…
## $ credit_limit        <dbl> 74179, 70379, 132172, 159197, 74014, 41738, 77498,…
## $ current_balance     <dbl> 29019, 44586, 34789, 35678, 66416, 36729, 66210, 7…
## $ payment_ratio       <dbl> 0.90, 0.58, 0.83, 0.40, 0.31, 0.31, 0.71, 0.62, 0.…
## $ late_payments       <dbl> 1, 2, 2, 2, 2, 1, 2, 3, 1, 3, 3, 1, 1, 4, 2, 1, 1,…
## $ utilisation_rate    <dbl> 0.39, 0.63, 0.26, 0.22, 0.90, 0.88, 0.85, 0.63, 0.…
## $ defaulted           <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…
## $ grocery_spend       <dbl> 19565, 17738, 3631, 18905, 12370, 19721, 18012, 36…
## $ online_spend        <dbl> 27485, 28199, 22815, 26773, 19862, 14055, 28086, 1…
## $ travel_spend        <dbl> 16015, 37756, 6344, 41198, 48461, 9106, 34206, 475…
## $ dining_spend        <dbl> 682, 7240, 12544, 664, 7458, 2637, 10715, 954, 278…
## $ utilities_spend     <dbl> 8977, 14147, 8731, 19986, 13641, 12644, 9394, 1421…
## $ entertainment_spend <dbl> 9443, 1802, 4767, 11828, 6707, 8601, 2654, 5619, 5…
## $ monthly_spend       <dbl> 82167, 106882, 58832, 119354, 108499, 66764, 10306…
## $ clothes_spend       <dbl> 17795, 2860, 7390, 13964, 13284, 8265, 18850, 6426…
## $ bill_payments       <dbl> 2387, 14591, 13880, 11953, 3679, 3106, 9173, 5223,…
## $ shopping_spend      <dbl> 14218, 27612, 11471, 7863, 15693, 6177, 9760, 5011…
## $ fuel_spend          <dbl> 7040, 6934, 6925, 7286, 5628, 7370, 2336, 4700, 19…
## $ medical_spend       <dbl> 2446, 3283, 4144, 5741, 6920, 8057, 3145, 7089, 68…
## $ total_monthly_spend <dbl> 126053, 162162, 102642, 166161, 153703, 99739, 146…
## $ spender_type        <chr> "Luxury Spender", "Luxury Spender", "High Spender"…
## $ age_category        <chr> "56-75", "36-55", "25-35", "56-75", "25-35", "36-5…
## $ utilization_rate    <dbl> 0.3912024, 0.6335128, 0.2632101, 0.2241123, 0.8973…
## $ needs_spend         <dbl> 40415, 56693, 37311, 63871, 42238, 50898, 42060, 3…
## $ wants_spend         <dbl> 85638, 105469, 65331, 102290, 111465, 48841, 10427…
## $ payment_category    <chr> "Good Payer", "Partial Payer", "Good Payer", "Part…

head(credit_data)

## # A tibble: 6 × 29
##   customer_id   age income card_type credit_limit current_balance payment_ratio
##   <chr>       <dbl>  <dbl> <chr>            <dbl>           <dbl>         <dbl>
## 1 C0001          59  24324 Standard         74179           29019          0.9 
## 2 C0002          49  96323 Standard         70379           44586          0.58
## 3 C0003          35  29111 Gold            132172           34789          0.83
## 4 C0004          63 189182 Gold            159197           35678          0.4 
## 5 C0005          28  36389 Standard         74014           66416          0.31
## 6 C0006          41 142177 Standard         41738           36729          0.31
## # ℹ 22 more variables: late_payments <dbl>, utilisation_rate <dbl>,
## #   defaulted <chr>, grocery_spend <dbl>, online_spend <dbl>,
## #   travel_spend <dbl>, dining_spend <dbl>, utilities_spend <dbl>,
## #   entertainment_spend <dbl>, monthly_spend <dbl>, clothes_spend <dbl>,
## #   bill_payments <dbl>, shopping_spend <dbl>, fuel_spend <dbl>,
## #   medical_spend <dbl>, total_monthly_spend <dbl>, spender_type <chr>,
## #   age_category <chr>, utilization_rate <dbl>, needs_spend <dbl>, …

2.1 Monthly Spend Analysis

#Q: What is the typical monthly spending behavior of credit card customers, and are there extreme spending outliers?

library(scales)

ggplot(credit_data, aes(x = monthly_spend)) +
  geom_histogram(    #geometric object
    bins = 30, 
    fill = "#0073C2FF",
    color = "black",
    alpha = 0.8
  ) +
  stat_bin(
    bins = 30,
    geom = "text",
    aes(label = ..count..),
    vjust = -0.5,
    size = 3
  ) +
  scale_x_continuous(
    labels = comma  # 50,000 instead of 5e+04
  ) +
  labs(
    title = "Distribution of Monthly Spend",
    x = "Monthly Spend (₹)",
    y = "Number of Customers"
  ) +
  theme_minimal(base_size = 14)

Most customers spend between ₹50,000 and ₹1,00,000 each month.
- The highest spending cluster is ₹75,000–₹85,000, with the largest bar.
- Very low spenders (< ₹30,000) and very high spenders (> ₹1,20,000) are rare.
- The shape is roughly bell-shaped, meaning spending is stable across customers.
- Only a few outliers exist on both ends of the range.

3 Utilization Rate vs Payment Ratio

#Q: How does a customer’s credit utilization rate affect their payment behavior?

ggplot(credit_data, aes(x = utilization_rate, 
                        y = payment_ratio, 
                        color = age_category)) +
  geom_point(size = 3, alpha = 0.8) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(
    title = "Utilization Rate vs Payment Ratio",
    x = "Utilization Rate (Percentage of Credit Limit Used)",
    y = "Payment Ratio (Percentage of Bill Paid)",
    color = "Age Category"
  ) +
  theme_minimal(base_size = 14)

Bottom-right points (high utilization, low payment ratio)
→ These are high-risk customers.
They use a lot of cr++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++edit but pay back very little.
Top-left points (low utilization, high payment ratio)
→ These are low-risk customers who spend responsibly and pay fully.
Middle cluster
→ Most customers fall in a moderate zone (40–70% utilization and 40–70% payment ratio).
Age Insight
Colors show age groups.
If younger customers cluster at high utilization, they may show aggressive spending.

4 Credit Limit vs Monthly Spend

#Q: Among the top 100 highest-spending customers, how does credit limit influence monthly spending? Do customers with higher limits spend more?

library(dplyr)
library(ggplot2)
library(scales)

# Select Top 100 Customers by Monthly Spend
top100 <- credit_data %>%
  arrange(desc(monthly_spend)) %>%
  slice(1:100)

ggplot(top100, aes(x = credit_limit, y = monthly_spend, color = age_category)) +
  geom_point(size = 3, alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE, color = "red", size = 1.2) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Credit Limit vs Monthly Spend (Top 100 Customers)",
    x = "Credit Limit (₹)",
    y = "Monthly Spend (₹)",
    color = "Age Category"
  ) +
  theme_minimal(base_size = 14)

The plot shows the relationship between credit limits and monthly spending for the top 100 highest-spending customers.

Customers with higher credit limits generally spend more.
A visible upward trend indicates that credit limit strongly influences customer spending.
Some customers have high limits but low spend, showing under-utilization.
A few customers spend aggressively despite lower limits, indicating potential risk.

5 Needs vs Wants Spending Analysis

#Q: Do customers spend more on essential needs or lifestyle wants?

# Calculate average spending
avg_spend <- credit_data %>%
  summarise(
    avg_needs = mean(needs_spend),
    avg_wants = mean(wants_spend)
  ) %>%
  pivot_longer(cols = everything(), names_to = "category", values_to = "average_spend")

# Bar chart
ggplot(avg_spend, aes(x = category, y = average_spend, fill = category)) +
  geom_col(width = 0.6) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Average Needs vs Wants Spending",
    x = "Category",
    y = "Average Spend (₹)"
  ) +
  theme_minimal(base_size = 14) +
  scale_fill_manual(values = c("avg_needs" = "#1f77b4", "avg_wants" = "#ff7f0e"))

The bar chart shows that customers spend significantly more on wants than on essential needs.

Average Wants Spend (~₹82,000) is more than double the Average Needs Spend (~₹39,000).

This indicates that customer spending is lifestyle-driven, with a larger portion going toward shopping, dining, travel, entertainment, and online purchases.

The pattern suggests higher discretionary behaviour, which may increase financial risk if payment discipline is low.

6 Default Risk Analysis

#Q: How does utilization differ between defaulters and non-defaulters?

# ---- 1. Fix utilization_rate (replace Inf/NA) ----
credit_data <- credit_data %>%
  mutate(
    utilization_rate = ifelse(
      is.infinite(utilization_rate) | is.na(utilization_rate),
      NA,
      utilization_rate
    )
  )

# ---- 2. Clean defaulted column ----
credit_data <- credit_data %>%
  mutate(
    defaulted = case_when(
      defaulted %in% c(1, "1", "yes", "Yes", "Y") ~ 1,
      defaulted %in% c(0, "0", "no", "No", "N") ~ 0,
      TRUE ~ NA_real_
    )
  )

# ---- 3. Filter valid rows ----
plot_df <- credit_data %>%
  filter(!is.na(utilization_rate), !is.na(defaulted))

# ---- 4. Boxplot ----
ggplot(plot_df, aes(x = factor(defaulted), y = utilization_rate, fill = factor(defaulted))) +
  geom_boxplot(alpha = 0.7) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Utilization Rate vs Default Status",
    x = "Defaulted (0 = No, 1 = Yes)",
    y = "Utilization Rate",
    fill = "Defaulted"
  ) +
  theme_minimal(base_size = 14)

The boxplot shows a clear difference in utilization behaviour between defaulters and non-defaulters:

Defaulters (1) have very high utilization rates, mostly between 75% and 90%, meaning they consistently use a large portion of their credit limit.

Non-defaulters (0) show much lower and more varied utilization, mostly between 25% and 60%.

The median utilization of defaulters is far higher, indicating a strong link between high credit usage and default risk.

7 Correlation Scatter Plot (Top 100 Customers)

#** the correlation of the age income credit limist**

library(corrplot)

numeric_cols <- credit_data %>%
  select(age, income, credit_limit, current_balance, monthly_spend,
         needs_spend, wants_spend, utilization_rate, payment_ratio)

corr_matrix <- cor(numeric_cols)

corrplot(corr_matrix,
         method = "color",
         type = "upper",
         addCoef.col = "black",
         tl.col = "black",
         number.cex = 0.6,
         title = "Correlation Heatmap of Financial Variables",
         mar = c(0,0,2,0))

8 Simple Linear Regression Analysis

8.0.1 Q: Does a higher credit limit lead to higher monthly spending among customers?

library(ggplot2)
library(scales)

# ---- 1. Simple Linear Regression Model ----
model <- lm(monthly_spend ~ credit_limit, data = credit_data)

# ---- 2. Print Summary ----
summary(model)

## 
## Call:
## lm(formula = monthly_spend ~ credit_limit, data = credit_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52511 -13047   -109  13198  52699 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.455e+04  9.675e+02  77.057   <2e-16 ***
## credit_limit 3.192e-03  5.243e-03   0.609    0.543    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18730 on 998 degrees of freedom
## Multiple R-squared:  0.0003713,  Adjusted R-squared:  -0.0006304 
## F-statistic: 0.3707 on 1 and 998 DF,  p-value: 0.5428

# ---- 3. Regression Plot ----
ggplot(credit_data, aes(x = credit_limit, y = monthly_spend)) +
  geom_point(color = "#0073C2FF", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = TRUE, size = 1.2) +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Simple Linear Regression: Credit Limit vs Monthly Spend",
    x = "Credit Limit (INR)",
    y = "Monthly Spend (INR)"
  ) +
  theme_minimal(base_size = 14)

The regression plot shows a positive relationship between credit limit and monthly spending:

As the credit limit increases, customers tend to spend more on average.

The red regression line slopes upward, indicating that credit limit is a significant predictor of monthly spend.

The shaded grey area shows the confidence interval, meaning we are reasonably confident about the trend.

Although the points are somewhat spread out, the trend remains clear: Higher-limit customers generally show higher spending ability.

Some customers with high limits still spend less → indicates conservative spenders.

8.1 ANOVA: Monthly Spending Across Age Groups

# Convert age category into a factor
credit_data$age_category <- factor(credit_data$age_category)

# 1. Check group sizes
table(credit_data$age_category)

## 
## 25-35 36-55 56-75 Other 
##   201   433   281    85

# 2. One-way ANOVA
anova_model <- aov(monthly_spend ~ age_category, data = credit_data)
summary(anova_model)

##               Df    Sum Sq   Mean Sq F value Pr(>F)
## age_category   3 6.430e+08 214325463    0.61  0.608
## Residuals    996 3.498e+11 351163882

# 3. Check ANOVA Assumptions

## (a) Normality of residuals
shapiro.test(residuals(anova_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(anova_model)
## W = 0.99738, p-value = 0.1069

## (b) Homogeneity of variance
library(car)
leveneTest(monthly_spend ~ age_category, data = credit_data)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   3  0.6656 0.5732
##       996

# 4. Post-hoc test (Tukey HSD)
TukeyHSD(anova_model)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = monthly_spend ~ age_category, data = credit_data)
## 
## $age_category
##                  diff       lwr      upr     p adj
## 36-55-25-35 1698.4959 -2417.362 5814.354 0.7127890
## 56-75-25-35 1302.5129 -3152.306 5757.332 0.8756467
## Other-25-35 2971.3603 -3267.903 9210.623 0.6107235
## 56-75-36-55 -395.9831 -4090.093 3298.126 0.9926675
## Other-36-55 1272.8644 -4448.100 6993.829 0.9402625
## Other-56-75 1668.8474 -4300.621 7638.316 0.8893823

# 5. Visualization
library(ggplot2)
library(scales)

ggplot(credit_data, aes(x = age_category, y = monthly_spend, fill = age_category)) +
  geom_boxplot(alpha = 0.8) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Monthly Spending Across Age Groups",
    x = "Age Group",
    y = "Monthly Spend (₹)",
    fill = "Age Group"
  ) +
  theme_minimal(base_size = 14)

8.2 KNN Classification: Predicting Default Risk

# ----------------------------------------------------------
# KNN Classification: Predicting whether a customer defaults
# ----------------------------------------------------------

library(class)
library(ggplot2)
library(dplyr)

# Ensure defaulted is numeric/factor
credit_data <- credit_data %>%
  mutate(
    defaulted = case_when(
      defaulted %in% c(1, "1", "yes", "Yes", "Y") ~ 1,
      defaulted %in% c(0, "0", "no", "No", "N") ~ 0,
      TRUE ~ NA_real_
    )
  )

# Remove missing default values
knn_data <- credit_data %>%
  select(defaulted, monthly_spend, credit_limit,
         utilization_rate, payment_ratio, income, current_balance) %>%
  filter(!is.na(defaulted))

# Convert defaulted to factor
knn_data$defaulted <- factor(knn_data$defaulted)

# Normalize numeric predictors
normalize <- function(x) (x - min(x)) / (max(x) - min(x))

knn_norm <- knn_data %>%
  mutate(across(-defaulted, ~ normalize(.x)))

# Train-test split (80/20)
set.seed(123)
n <- nrow(knn_norm)
train_idx <- sample(1:n, size = 0.8*n)

train <- knn_norm[train_idx, ]
test  <- knn_norm[-train_idx, ]

train_x <- train %>% select(-defaulted)
test_x  <- test %>% select(-defaulted)
train_y <- train$defaulted
test_y  <- test$defaulted

# Run KNN (k = 5)
knn_pred <- knn(train = train_x, 
                test = test_x,
                cl = train_y,
                k = 5)

# Confusion Matrix
conf_matrix <- table(Predicted = knn_pred, Actual = test_y)
conf_matrix

##          Actual
## Predicted   0   1
##         0 196   4
##         1   0   0

# Accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy

## [1] 0.98

# ----------------------------------------------------------
# Visualization: Plotting KNN predictions
# Using the two strongest predictors for plotting:
# credit_limit (x-axis) vs monthly_spend (y-axis)
# ----------------------------------------------------------

plot_df <- test %>%
  mutate(predicted = knn_pred)

ggplot(plot_df, aes(x = credit_limit, y = monthly_spend, color = predicted)) +
  geom_point(size = 3, alpha = 0.8) +
  labs(
    title = "KNN Classification Results (k = 5)",
    x = "Credit Limit (Normalized)",
    y = "Monthly Spend (Normalized)",
    color = "Predicted\nDefault"
  ) +
  theme_minimal(base_size = 14)

9 Conclusion

This project provided a comprehensive descriptive and visual analysis of credit card customer behavior using a dataset of 1,000 individuals. Through systematic data cleaning, feature engineering, and visualization in R, we uncovered several important patterns in spending behavior, credit usage, and repayment discipline.

The analysis revealed that most customers maintain moderate monthly spending, typically between ₹50,000 and ₹1,00,000, with only a few extreme outliers. Spending behavior is strongly lifestyle-driven, as customers consistently allocate more money toward non-essential “wants” such as shopping, entertainment, and travel compared to essential needs.

When examining financial discipline, we observed clear relationships between credit utilization, payment ratios, and age groups. Customers who maintain low utilization and high payment ratios appear financially responsible, whereas those with high utilization and low repayment ratios potentially indicate risky financial habits. Age also plays a role, with some younger customers showing more aggressive spending patterns.

The analysis of credit limit and spending among the top 100 customers highlighted a positive correlation—higher credit limits often encourage higher spending. However, a subset of customers underutilizes their available credit, showing more conservative financial behavior.

Finally, the default risk analysis clearly demonstrated that defaulters exhibit significantly higher credit utilization rates (typically 75–90%) compared to non-defaulters (25–60%). This strong separation confirms that credit utilization is one of the most powerful indicators of default risk.

Overall, the project successfully identified key behavioral insights that can support financial institutions in designing better credit policies, improving customer segmentation, reducing default rates, and tailoring products to customer needs. The findings highlight the importance of monitoring utilization patterns, encouraging healthy repayment behavior, and understanding customer lifestyle preferences to ensure sustainable credit management.

Descriptive and Visual Analysis of Credit Card Customer Data

Student 1: Alavath Ruthik
Roll No: 34 | Reg No: 12321995 | Group 1

Student 2: Hirak Roy
Roll No: 26 | Reg No: 12301094 | Group 1

2025-11-14

1 Introduction

2 Data Cleaning and Preprocessing

2.1 Monthly Spend Analysis

3 Utilization Rate vs Payment Ratio

4 Credit Limit vs Monthly Spend

5 Needs vs Wants Spending Analysis

6 Default Risk Analysis

7 Correlation Scatter Plot (Top 100 Customers)

8 Simple Linear Regression Analysis

8.0.1 Q: Does a higher credit limit lead to higher monthly spending among customers?

8.1 ANOVA: Monthly Spending Across Age Groups

8.2 KNN Classification: Predicting Default Risk

9 Conclusion

Descriptive and Visual Analysis of Credit Card Customer Data

Student 1: Alavath Ruthik Roll No: 34 | Reg No: 12321995 | Group 1 Student 2: Hirak Roy Roll No: 26 | Reg No: 12301094 | Group 1

2025-11-14

1 Introduction

2 Data Cleaning and Preprocessing

2.1 Monthly Spend Analysis

3 Utilization Rate vs Payment Ratio

4 Credit Limit vs Monthly Spend

5 Needs vs Wants Spending Analysis

6 Default Risk Analysis

7 Correlation Scatter Plot (Top 100 Customers)

8 Simple Linear Regression Analysis

8.0.1 Q: Does a higher credit limit lead to higher monthly spending among customers?

8.1 ANOVA: Monthly Spending Across Age Groups

8.2 KNN Classification: Predicting Default Risk

9 Conclusion

Student 1: Alavath Ruthik
Roll No: 34 | Reg No: 12321995 | Group 1

Student 2: Hirak Roy
Roll No: 26 | Reg No: 12301094 | Group 1