Week5

Introduction

In this notebook, I will analyze the digital wallet transaction dataset. I will explore unclear columns/values, investigate encoding choices, and visualize the impact of data ambiguity.

The following columns were unclear until I read the documentation:

Product_amount:

It’s not immediately clear if this column refers to the total or the net amount after discounts.Reading the documentation clarifies that it refers to the final amount the customer paid after any discounts and taxes.

Why was it encoded this way?

The product_amount likely refers to the final amount the customer paid after any discounts or taxes because it provides a clear, comprehensive picture of the transaction from the user’s perspective. Customers are typically most concerned with the final amount, making it the most relevant metric for analysis.Reading the documentation clarifies that it refers to the final amount the customer paid after any discounts and taxes.

What could have happened if you didn’t read the documentation?

I might have assumed that the product_amount represented the original price, excluding any discounts or taxes. This could lead to incorrect conclusions when analyzing total spending, product profitability, or customer behavior, as the true transaction value would be misinterpreted.

Transaction status

The values Success, Pending, and Failed are used. While Success and Failed are self-explanatory, the meaning of Pending was unclear. According to the documentation, Pending refers to transactions that are still being processed or awaiting confirmation from the payment gateway.

Why was it encoded this way? Using simple categories like “Success,” “Pending,” and “Failed” makes the data easy to interpret at a high level. It’s common for financial data to be simplified into these statuses so that users and analysts can quickly understand the transaction’s outcome without needing further explanation.

What could have happened if you didn’t read the documentation? If I didn’t read the documentation, the status “Pending” might have been misinterpreted as a failed transaction or one that doesn’t matter, when in reality, these transactions are still being processed. Misclassifying “Pending” transactions could result in inaccurate analysis of payment success rates or customer transaction behavior.

Payment_method:

The documentation distinguishes between UPI, Wallet Balance, Bank Transfer, Credit Card, and Debit Card. However, some transactions were marked as Other, which was unclear until I read that Other can include less common methods like mobile banking or coupons.

Why was it encoded this way? Using specific categories like UPI, Wallet Balance, Bank Transfer, Credit Card, and Debit Card provides detailed insights into payment preferences. The “Other” category aggregates less common payment methods to avoid excessive fragmentation while still capturing alternative methods.

What could have happened if you didn’t read the documentation? Without I reading the documentation, the “Other” category might have been dismissed or misunderstood. You might miss out on insights about mobile banking, coupons, or other less common payment methods, assuming they were errors or negligible transactions. This could skew your understanding of user behavior and preferences.

Unclear Data Element Even After Reading Documentation

Transaction Date

Even though the documentation clarifies that the column records the date of the transaction, it may not specify whether this refers to the initiation, completion, or settlement date. For example, if a transaction is pending for days, it’s unclear whether the date reflects when the payment was initiated or when the status of the transaction was last updated. This distinction could have an impact on the accuracy of certain analyses related to timing, like seasonality or payment delays.

Visualizing the Issue

Now we will visualize the distribution of transactions over time

library(dplyr)
library(ggplot2)
data$transaction_date <- as.Date(data$transaction_date)
transaction_summary <- data %>%
  group_by(transaction_date, transaction_status) %>%
  summarize(transaction_count = n(), .groups = 'drop')
ggplot(transaction_summary, aes(x = transaction_date, y = transaction_count, fill = transaction_status)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Transactions Over Time by Status",
       x = "Transaction Date",
       y = "Transaction Count") +
  facet_wrap(~ transaction_status, scales = "free_y") +  # Create separate panels for each status
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotating x-axis labels for better readability

The transaction_date column raises ambiguity about whether it refers to the date when the transaction was initiated or when the transaction was fully processed and confirmed. This difference can have a significant impact on the analysis:

If the date refers to when the transaction was initiated, the “Pending” transactions may appear on earlier dates even if they were resolved later.
If the date refers to when the transaction was finalized, it becomes harder to track when issues (like delays) actually began.

Why is it unclear? The documentation does not clarify if transaction_date reflects the initiation date, processing date, or final settlement date. This makes it difficult to accurately assess timelines for failed or pending transactions, and could lead to incorrect interpretations of transaction delays or system inefficiencies.

Risks Problems will occur between customer and seller like if the wrong date is analyzed, the company might underestimate how long customers are waiting for transaction resolution, leading to bad service and delayed responses to customer complaints.

Mitigating Risks This issue will be solved by creating another column to dataset that defines both the initiation date and completion date to the dataset then we wil get to know how long transcations are pending and when they were resolved.