In this data dive,I am going to explore the transaction data from a digital wallet platform.Main aim is to investigate the probabilities of different groups and identify rare combinations. This will help us better understand patterns of user behavior, spending trends, and anomalies in the data.
We will first group the data by product_category and investigate the summary statistics for product_amount.
# Load necessary libraries
library(dplyr)
library(knitr)
# Group by product category and calculate summary statistics
group1 <- data %>%
group_by(product_category) %>%
summarize(mean_amount = mean(product_amount, na.rm = TRUE),
count = n(),
sum_amount = sum(product_amount, na.rm = TRUE))
group1
## # A tibble: 20 × 4
## product_category mean_amount count sum_amount
## <chr> <dbl> <int> <dbl>
## 1 Bus Ticket 4748. 235 1115712.
## 2 Education Fee 4718. 286 1349322.
## 3 Electricity Bill 4944. 252 1245973.
## 4 Flight Booking 5205. 216 1124283.
## 5 Food Delivery 5085. 259 1317106.
## 6 Gaming Credits 4921. 231 1136665.
## 7 Gas Bill 5446. 250 1361520.
## 8 Gift Card 4628. 221 1022796.
## 9 Grocery Shopping 4769. 238 1134973.
## 10 Hotel Booking 4816. 274 1319604.
## 11 Insurance Premium 4589. 225 1032609.
## 12 Internet Bill 5353. 233 1247178.
## 13 Loan Repayment 4942. 245 1210682.
## 14 Mobile Recharge 4964. 241 1196335.
## 15 Movie Ticket 4917. 272 1337400.
## 16 Online Shopping 4969. 243 1207396.
## 17 Rent Payment 5013. 251 1258264.
## 18 Streaming Service 4891. 299 1462462.
## 19 Taxi Fare 5104. 256 1306566.
## 20 Water Bill 5131. 273 1400669.
Insights:
Gas Bill, Water Bill categories have high mean_amount tend to be essential services,and Flight Booking has irregular high-cost transactions.These represent essential services or popular subscriptions, leading to frequent and relatively high-value transactions and Flight bookings are typically less frequent but higher in value. Streaming Service and Movie Ticket categories have a higher number of transactions indicate frequent low-cost transactions, contributing to a higher overall total spending.Streaming services are likely popular due to monthly subscriptions, while education fees and movie tickets reflect common spending for students and leisure.Gift Card has rare categories may suggest either infrequent purchases.Gift cards may be purchased for specific occasions, contributing to the lower count.
We will now group the data by payment_method and summarize the same product_amount column.
group2 <- data %>%
group_by(payment_method) %>%
summarize(mean_amount = mean(product_amount, na.rm = TRUE),
count = n(),
sum_amount = sum(product_amount, na.rm = TRUE))
group2
## # A tibble: 5 × 4
## payment_method mean_amount count sum_amount
## <chr> <dbl> <int> <dbl>
## 1 Bank Transfer 4993. 1045 5218142.
## 2 Credit Card 4926. 992 4886969.
## 3 Debit Card 5049. 1022 5160183.
## 4 UPI 4904. 999 4898629.
## 5 Wallet Balance 4908. 942 4623590.
Insights:
Debit Card has the highest mean transaction amount because One possible reason could be that debit card payments are directly linked to the user’s bank account, which may give them more flexibility with larger payments and otps.Bank Transfer comes second in terms of total transaction amount,The relatively high total spending might indicate that users often choose bank transfers for high-value payments or bills.UPI and Wallet Balance have slightly lower mean transaction amounts, are often preferred for smaller, quick, or frequent transactions, such as paying for groceries or other daily needs.Wallet Balance has the lowest total transaction amount. Wallets are usually capped at a certain balance, and users might only use them for smaller, convenience-based transactions.Credit Card transactions show a mean amount,slightly lower than debit cards and bank transfers because Credit cards might have more transactional restrictions or fees.
Lastly, we will group the data by transaction_status and summarize the product_amount column again.
group3 <- data %>%
group_by(transaction_status) %>%
summarize(mean_amount = mean(product_amount, na.rm = TRUE),
count = n(),
sum_amount = sum(product_amount, na.rm = TRUE))
group3
## # A tibble: 3 × 4
## transaction_status mean_amount count sum_amount
## <chr> <dbl> <int> <dbl>
## 1 Failed 5045. 146 736511.
## 2 Pending 4970. 99 491984
## 3 Successful 4955. 4755 23559019.
Insights:
Successful transactions have the highest mean amount, highest count, and highest total spending, showing that users are more likely to complete larger and frequent payments successfully.Failed transactions are more common for lower-value payments, possibly due to issues like insufficient funds, card limits, or abandoned purchases.Pending transactions show intermediate behavior, potentially reflecting mid-sized payments that require additional verification or confirmation before completion.
Now, let’s calculate the probability of selecting a row from each group and tag the rarest ones.
group1 <- group1 %>%
mutate(probability = count / sum(count))
group1 <- group1 %>%
mutate(tag = ifelse(probability == min(probability), "Low Probability", "Normal"))
group1
## # A tibble: 20 × 6
## product_category mean_amount count sum_amount probability tag
## <chr> <dbl> <int> <dbl> <dbl> <chr>
## 1 Bus Ticket 4748. 235 1115712. 0.047 Normal
## 2 Education Fee 4718. 286 1349322. 0.0572 Normal
## 3 Electricity Bill 4944. 252 1245973. 0.0504 Normal
## 4 Flight Booking 5205. 216 1124283. 0.0432 Low Probability
## 5 Food Delivery 5085. 259 1317106. 0.0518 Normal
## 6 Gaming Credits 4921. 231 1136665. 0.0462 Normal
## 7 Gas Bill 5446. 250 1361520. 0.05 Normal
## 8 Gift Card 4628. 221 1022796. 0.0442 Normal
## 9 Grocery Shopping 4769. 238 1134973. 0.0476 Normal
## 10 Hotel Booking 4816. 274 1319604. 0.0548 Normal
## 11 Insurance Premium 4589. 225 1032609. 0.045 Normal
## 12 Internet Bill 5353. 233 1247178. 0.0466 Normal
## 13 Loan Repayment 4942. 245 1210682. 0.049 Normal
## 14 Mobile Recharge 4964. 241 1196335. 0.0482 Normal
## 15 Movie Ticket 4917. 272 1337400. 0.0544 Normal
## 16 Online Shopping 4969. 243 1207396. 0.0486 Normal
## 17 Rent Payment 5013. 251 1258264. 0.0502 Normal
## 18 Streaming Service 4891. 299 1462462. 0.0598 Normal
## 19 Taxi Fare 5104. 256 1306566. 0.0512 Normal
## 20 Water Bill 5131. 273 1400669. 0.0546 Normal
Observations:
Flight booking has the Lowest probability (4.51%) that implies that for every transaction in the dataset, there is only a small chance.This could indicate that users make flight-related purchases less frequently than they do for everyday services like food delivery or utility bills.Booking a flight is not an everyday event—it’s likely tied to specific occasions like vacations, work travel, or emergencies, contributing to its rarity.
We will now investigate the combinations of two categorical columns: product_category and payment_method.
combinations <- data %>%
group_by(product_category, payment_method) %>%
summarize(count = n())
## `summarise()` has grouped output by 'product_category'. You can override using
## the `.groups` argument.
combinations
## # A tibble: 100 × 3
## # Groups: product_category [20]
## product_category payment_method count
## <chr> <chr> <int>
## 1 Bus Ticket Bank Transfer 66
## 2 Bus Ticket Credit Card 45
## 3 Bus Ticket Debit Card 43
## 4 Bus Ticket UPI 42
## 5 Bus Ticket Wallet Balance 39
## 6 Education Fee Bank Transfer 58
## 7 Education Fee Credit Card 47
## 8 Education Fee Debit Card 63
## 9 Education Fee UPI 71
## 10 Education Fee Wallet Balance 47
## # ℹ 90 more rows
The observed trends indicate that payment methods like UPI and Bank Transfer dominate in high-value and frequent transactions such as Education Fees, while Wallet Balances and Debit Cards are used less frequently, especially for lower-value transactions like Bus Tickets. These preferences are likely driven by the ease, speed, and perceived convenience of different payment methods based on the type of product or service.
Identifying the Most/Least Common Combinations
# Sort to identify the most and least common combinations
combinations <- combinations %>%
arrange(desc(count))
# Display the top and bottom 5 combinations
head(combinations, 5)
## # A tibble: 5 × 3
## # Groups: product_category [5]
## product_category payment_method count
## <chr> <chr> <int>
## 1 Movie Ticket Bank Transfer 75
## 2 Education Fee UPI 71
## 3 Streaming Service Debit Card 70
## 4 Bus Ticket Bank Transfer 66
## 5 Taxi Fare Credit Card 66
tail(combinations, 5)
## # A tibble: 5 × 3
## # Groups: product_category [5]
## product_category payment_method count
## <chr> <chr> <int>
## 1 Flight Booking Debit Card 39
## 2 Insurance Premium Debit Card 39
## 3 Gift Card Wallet Balance 38
## 4 Taxi Fare Debit Card 37
## 5 Mobile Recharge Bank Transfer 31
Observations The most common combinations involve everyday transactions (e.g., movie tickets, streaming services) with popular payment methods like Bank Transfer and UPI. On the other hand, less frequent combinations like Flight Booking with Debit Card or Mobile Recharge with Bank Transfer likely reflect the preference for different payment methods for specific transaction types (e.g., travel rewards on credit cards, instant payments via UPI for smaller amounts).
Missing Combinations:
Now we will check for any combinations that do not exist in the data (i.e., a product category that has never been purchased with a specific payment method).
all_combinations <- expand.grid(unique(data$product_category), unique(data$payment_method))
colnames(all_combinations) <- c("product_category", "payment_method")
missing_combinations <- anti_join(all_combinations, combinations, by = c("product_category", "payment_method"))
missing_combinations
## [1] product_category payment_method
## <0 rows> (or 0-length row.names)
There is no any missing values.
Explanation:
Missing combinations indicate product categories that were never purchased with certain payment methods. This can lead to interesting insights about user preferences or limitations in the platform.
Plot 1: Transaction Amount by Product Category
ggplot(data, aes(x = product_category, y = product_amount, fill = product_category)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Product Amount by Product Category", x = "Product Category", y = "Product Amount") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
plot.margin = unit(c(1, 1, 1, 2), "cm"))
They spent more amount on Gas bill compared to all product category.
Plot 2: Count by Payment Method
ggplot(group2, aes(x = payment_method, y = count, fill = payment_method)) +
geom_bar(stat = "identity") +
labs(title = "Transaction Count by Payment Method", x = "Payment Method", y = "Count") +
theme_minimal()
The Bank Transfers are more compared to any other on transaction count basis of payment method.
Plot 3: Transaction Amount by Transaction Status
ggplot(group3, aes(x = transaction_status, y = mean_amount, fill = transaction_status)) +
geom_bar(stat = "identity") +
labs(title = "Average Transaction Amount by Transaction Status", x = "Transaction Status", y = "Average Amount") +
theme_minimal() +
scale_y_continuous(breaks = seq(0, max(group3$mean_amount, na.rm = TRUE), by = 500)) + # Y-axis breaks at 500
theme(plot.title = element_text(hjust = 0.5)) # Center the title
The Failed transactions are more compared to pending and successful.