Introduction

In this data dive,I am going to explore the transaction data from a digital wallet platform.Main aim is to investigate the probabilities of different groups and identify rare combinations. This will help us better understand patterns of user behavior, spending trends, and anomalies in the data.

Grouping by Columns

Group 1: Grouping by Product Category

We will first group the data by product_category and investigate the summary statistics for product_amount.

# Load necessary libraries
library(dplyr)
library(knitr)

# Group by product category and calculate summary statistics
group1 <- data %>%
  group_by(product_category) %>%
  summarize(mean_amount = mean(product_amount, na.rm = TRUE),
            count = n(),
            sum_amount = sum(product_amount, na.rm = TRUE))

group1
## # A tibble: 20 × 4
##    product_category  mean_amount count sum_amount
##    <chr>                   <dbl> <int>      <dbl>
##  1 Bus Ticket              4748.   235   1115712.
##  2 Education Fee           4718.   286   1349322.
##  3 Electricity Bill        4944.   252   1245973.
##  4 Flight Booking          5205.   216   1124283.
##  5 Food Delivery           5085.   259   1317106.
##  6 Gaming Credits          4921.   231   1136665.
##  7 Gas Bill                5446.   250   1361520.
##  8 Gift Card               4628.   221   1022796.
##  9 Grocery Shopping        4769.   238   1134973.
## 10 Hotel Booking           4816.   274   1319604.
## 11 Insurance Premium       4589.   225   1032609.
## 12 Internet Bill           5353.   233   1247178.
## 13 Loan Repayment          4942.   245   1210682.
## 14 Mobile Recharge         4964.   241   1196335.
## 15 Movie Ticket            4917.   272   1337400.
## 16 Online Shopping         4969.   243   1207396.
## 17 Rent Payment            5013.   251   1258264.
## 18 Streaming Service       4891.   299   1462462.
## 19 Taxi Fare               5104.   256   1306566.
## 20 Water Bill              5131.   273   1400669.

Insights:

Gas Bill, Water Bill categories have high mean_amount tend to be essential services,and Flight Booking has irregular high-cost transactions.These represent essential services or popular subscriptions, leading to frequent and relatively high-value transactions and Flight bookings are typically less frequent but higher in value. Streaming Service and Movie Ticket categories have a higher number of transactions indicate frequent low-cost transactions, contributing to a higher overall total spending.Streaming services are likely popular due to monthly subscriptions, while education fees and movie tickets reflect common spending for students and leisure.Gift Card has rare categories may suggest either infrequent purchases.Gift cards may be purchased for specific occasions, contributing to the lower count.

Group 2: Grouping by Payment Method

We will now group the data by payment_method and summarize the same product_amount column.

group2 <- data %>%
  group_by(payment_method) %>%
  summarize(mean_amount = mean(product_amount, na.rm = TRUE),
            count = n(),
            sum_amount = sum(product_amount, na.rm = TRUE))
group2
## # A tibble: 5 × 4
##   payment_method mean_amount count sum_amount
##   <chr>                <dbl> <int>      <dbl>
## 1 Bank Transfer        4993.  1045   5218142.
## 2 Credit Card          4926.   992   4886969.
## 3 Debit Card           5049.  1022   5160183.
## 4 UPI                  4904.   999   4898629.
## 5 Wallet Balance       4908.   942   4623590.

Insights:

Debit Card has the highest mean transaction amount because One possible reason could be that debit card payments are directly linked to the user’s bank account, which may give them more flexibility with larger payments and otps.Bank Transfer comes second in terms of total transaction amount,The relatively high total spending might indicate that users often choose bank transfers for high-value payments or bills.UPI and Wallet Balance have slightly lower mean transaction amounts, are often preferred for smaller, quick, or frequent transactions, such as paying for groceries or other daily needs.Wallet Balance has the lowest total transaction amount. Wallets are usually capped at a certain balance, and users might only use them for smaller, convenience-based transactions.Credit Card transactions show a mean amount,slightly lower than debit cards and bank transfers because Credit cards might have more transactional restrictions or fees.

Group 3: Grouping by Transaction Status

Lastly, we will group the data by transaction_status and summarize the product_amount column again.

group3 <- data %>%
  group_by(transaction_status) %>%
  summarize(mean_amount = mean(product_amount, na.rm = TRUE),
            count = n(),
            sum_amount = sum(product_amount, na.rm = TRUE))
group3
## # A tibble: 3 × 4
##   transaction_status mean_amount count sum_amount
##   <chr>                    <dbl> <int>      <dbl>
## 1 Failed                   5045.   146    736511.
## 2 Pending                  4970.    99    491984 
## 3 Successful               4955.  4755  23559019.

Insights:

Successful transactions have the highest mean amount, highest count, and highest total spending, showing that users are more likely to complete larger and frequent payments successfully.Failed transactions are more common for lower-value payments, possibly due to issues like insufficient funds, card limits, or abandoned purchases.Pending transactions show intermediate behavior, potentially reflecting mid-sized payments that require additional verification or confirmation before completion.

Probability Calculation and Rare Group Identification

Now, let’s calculate the probability of selecting a row from each group and tag the rarest ones.

group1 <- group1 %>%
  mutate(probability = count / sum(count))
group1 <- group1 %>%
  mutate(tag = ifelse(probability == min(probability), "Low Probability", "Normal"))
group1
## # A tibble: 20 × 6
##    product_category  mean_amount count sum_amount probability tag            
##    <chr>                   <dbl> <int>      <dbl>       <dbl> <chr>          
##  1 Bus Ticket              4748.   235   1115712.      0.047  Normal         
##  2 Education Fee           4718.   286   1349322.      0.0572 Normal         
##  3 Electricity Bill        4944.   252   1245973.      0.0504 Normal         
##  4 Flight Booking          5205.   216   1124283.      0.0432 Low Probability
##  5 Food Delivery           5085.   259   1317106.      0.0518 Normal         
##  6 Gaming Credits          4921.   231   1136665.      0.0462 Normal         
##  7 Gas Bill                5446.   250   1361520.      0.05   Normal         
##  8 Gift Card               4628.   221   1022796.      0.0442 Normal         
##  9 Grocery Shopping        4769.   238   1134973.      0.0476 Normal         
## 10 Hotel Booking           4816.   274   1319604.      0.0548 Normal         
## 11 Insurance Premium       4589.   225   1032609.      0.045  Normal         
## 12 Internet Bill           5353.   233   1247178.      0.0466 Normal         
## 13 Loan Repayment          4942.   245   1210682.      0.049  Normal         
## 14 Mobile Recharge         4964.   241   1196335.      0.0482 Normal         
## 15 Movie Ticket            4917.   272   1337400.      0.0544 Normal         
## 16 Online Shopping         4969.   243   1207396.      0.0486 Normal         
## 17 Rent Payment            5013.   251   1258264.      0.0502 Normal         
## 18 Streaming Service       4891.   299   1462462.      0.0598 Normal         
## 19 Taxi Fare               5104.   256   1306566.      0.0512 Normal         
## 20 Water Bill              5131.   273   1400669.      0.0546 Normal

Observations:

Flight booking has the Lowest probability (4.51%) that implies that for every transaction in the dataset, there is only a small chance.This could indicate that users make flight-related purchases less frequently than they do for everyday services like food delivery or utility bills.Booking a flight is not an everyday event—it’s likely tied to specific occasions like vacations, work travel, or emergencies, contributing to its rarity.

Combining Two Categorical Variables

We will now investigate the combinations of two categorical columns: product_category and payment_method.

combinations <- data %>%
  group_by(product_category, payment_method) %>%
  summarize(count = n())
## `summarise()` has grouped output by 'product_category'. You can override using
## the `.groups` argument.
combinations
## # A tibble: 100 × 3
## # Groups:   product_category [20]
##    product_category payment_method count
##    <chr>            <chr>          <int>
##  1 Bus Ticket       Bank Transfer     66
##  2 Bus Ticket       Credit Card       45
##  3 Bus Ticket       Debit Card        43
##  4 Bus Ticket       UPI               42
##  5 Bus Ticket       Wallet Balance    39
##  6 Education Fee    Bank Transfer     58
##  7 Education Fee    Credit Card       47
##  8 Education Fee    Debit Card        63
##  9 Education Fee    UPI               71
## 10 Education Fee    Wallet Balance    47
## # ℹ 90 more rows

The observed trends indicate that payment methods like UPI and Bank Transfer dominate in high-value and frequent transactions such as Education Fees, while Wallet Balances and Debit Cards are used less frequently, especially for lower-value transactions like Bus Tickets. These preferences are likely driven by the ease, speed, and perceived convenience of different payment methods based on the type of product or service.

Identifying the Most/Least Common Combinations

# Sort to identify the most and least common combinations
combinations <- combinations %>%
  arrange(desc(count))

# Display the top and bottom 5 combinations
head(combinations, 5)
## # A tibble: 5 × 3
## # Groups:   product_category [5]
##   product_category  payment_method count
##   <chr>             <chr>          <int>
## 1 Movie Ticket      Bank Transfer     75
## 2 Education Fee     UPI               71
## 3 Streaming Service Debit Card        70
## 4 Bus Ticket        Bank Transfer     66
## 5 Taxi Fare         Credit Card       66
tail(combinations, 5)
## # A tibble: 5 × 3
## # Groups:   product_category [5]
##   product_category  payment_method count
##   <chr>             <chr>          <int>
## 1 Flight Booking    Debit Card        39
## 2 Insurance Premium Debit Card        39
## 3 Gift Card         Wallet Balance    38
## 4 Taxi Fare         Debit Card        37
## 5 Mobile Recharge   Bank Transfer     31

Observations The most common combinations involve everyday transactions (e.g., movie tickets, streaming services) with popular payment methods like Bank Transfer and UPI. On the other hand, less frequent combinations like Flight Booking with Debit Card or Mobile Recharge with Bank Transfer likely reflect the preference for different payment methods for specific transaction types (e.g., travel rewards on credit cards, instant payments via UPI for smaller amounts).

Missing Combinations:

Now we will check for any combinations that do not exist in the data (i.e., a product category that has never been purchased with a specific payment method).

all_combinations <- expand.grid(unique(data$product_category), unique(data$payment_method))
colnames(all_combinations) <- c("product_category", "payment_method")
missing_combinations <- anti_join(all_combinations, combinations, by = c("product_category", "payment_method"))
missing_combinations
## [1] product_category payment_method  
## <0 rows> (or 0-length row.names)

There is no any missing values.

Explanation:

Missing combinations indicate product categories that were never purchased with certain payment methods. This can lead to interesting insights about user preferences or limitations in the platform.

Visualizing Groups

Plot 1: Transaction Amount by Product Category

ggplot(data, aes(x = product_category, y = product_amount, fill = product_category)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Product Amount by Product Category", x = "Product Category", y = "Product Amount") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  
        plot.title = element_text(hjust = 0.5),             
        plot.margin = unit(c(1, 1, 1, 2), "cm"))            

They spent more amount on Gas bill compared to all product category.

Plot 2: Count by Payment Method

ggplot(group2, aes(x = payment_method, y = count, fill = payment_method)) +
  geom_bar(stat = "identity") +
  labs(title = "Transaction Count by Payment Method", x = "Payment Method", y = "Count") +
  theme_minimal()

The Bank Transfers are more compared to any other on transaction count basis of payment method.

Plot 3: Transaction Amount by Transaction Status

ggplot(group3, aes(x = transaction_status, y = mean_amount, fill = transaction_status)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Transaction Amount by Transaction Status", x = "Transaction Status", y = "Average Amount") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(0, max(group3$mean_amount, na.rm = TRUE), by = 500)) +  # Y-axis breaks at 500
  theme(plot.title = element_text(hjust = 0.5))  # Center the title

The Failed transactions are more compared to pending and successful.