Our Project is directed towards digital wallet platform managers, marketers, and financial strategists, aiming to provide insights into spending patterns, seasonal trends, and the impact of cashback on customer engagement to inform data-driven business strategies.

Number of Distinct Attributes:

num_attributes <- ncol(data)
print(paste("Number of distinct attributes:", num_attributes))

## [1] "Number of distinct attributes: 16"

Classify Attributes as Continuous or Categorical:

# Initialize counters
num_continuous <- 0
num_categorical <- 0

# Check column types
for (col in colnames(data)) {
  if (is.numeric(data[[col]])) { 
    # Continuous: Numeric type
    num_unique <- length(unique(data[[col]]))
    if (num_unique > 10) {  # Assuming >10 unique values implies continuous
      num_continuous <- num_continuous + 1
    } else { 
      num_categorical <- num_categorical + 1
    }
  } else {
    # Categorical: Character or factor type
    num_categorical <- num_categorical + 1
  }
}

# Results
print(paste("Number of continuous attributes:", num_continuous))

## [1] "Number of continuous attributes: 5"

print(paste("Number of categorical attributes:", num_categorical))

## [1] "Number of categorical attributes: 11"

summary(data$product_amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.09 2453.98 4943.69 4957.50 7444.81 9996.95

# Additional statistics for product_amount
mean_product_amount <- mean(data$product_amount, na.rm = TRUE)
sd_product_amount <- sd(data$product_amount, na.rm = TRUE)

# Display the mean and standard deviation
mean_product_amount

## [1] 4957.503

sd_product_amount

## [1] 2885.034

# Unique values and their counts for product_category
product_category_counts <- table(data$product_category) 
product_category_counts

## 
##        Bus Ticket     Education Fee  Electricity Bill    Flight Booking 
##               235               286               252               216 
##     Food Delivery    Gaming Credits          Gas Bill         Gift Card 
##               259               231               250               221 
##  Grocery Shopping     Hotel Booking Insurance Premium     Internet Bill 
##               238               274               225               233 
##    Loan Repayment   Mobile Recharge      Movie Ticket   Online Shopping 
##               245               241               272               243 
##      Rent Payment Streaming Service         Taxi Fare        Water Bill 
##               251               299               256               273

barplot(product_category_counts,
        main = "Counts of Product Categories",
        xlab = "Product Category",
        ylab = "Count",
        col = "steelblue",
        las = 2, 
        cex.names = 0.4) # Adjust size of axis labels if needed

Unique values and their counts for transaction_status

transaction_status_counts <- table(data$transaction_status)
transaction_status_counts

## 
##     Failed    Pending Successful 
##        146         99       4755

barplot(transaction_status_counts,
        main = "Counts of Transaction Status",
        xlab = "Transaction Status",
        ylab = "Count",
        col = "steelblue",
        las = 2, # Rotate x-axis labels if necessary
        cex.names = 0.8) # Adjust size of axis labels if needed

Initial Findings

Hypothesis 1: Transaction Fees Affect Transaction Status

success_data <- data %>% filter(transaction_status == "Successful")
failed_data <- data %>% filter(transaction_status == "Failed")
# Combine the filtered data for visualization
transaction_data <- rbind(
  mutate(success_data, transaction_status = "Successful"),
  mutate(failed_data, transaction_status = "Failed")
)
# Subset transaction fees for successful and failed transactions
success_fees <- success_data$transaction_fee
failed_fees <- failed_data$transaction_fee

# Perform two-sample t-test
t_test_result <- t.test(success_fees, failed_fees, var.equal = TRUE) # Assuming equal variances
print(t_test_result)

## 
##  Two Sample t-test
## 
## data:  success_fees and failed_fees
## t = 0.62499, df = 4899, p-value = 0.532
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.631066  3.157743
## sample estimates:
## mean of x mean of y 
##  25.21991  24.45658

# Create boxplot for transaction fees by transaction status
ggplot(transaction_data, aes(x = transaction_status, y = transaction_fee, fill = transaction_status)) +
  geom_boxplot() +
  labs(title = "Transaction Fee by Transaction Status",
       x = "Transaction Status",
       y = "Transaction Fee") +
  theme_minimal() +
  scale_fill_manual(values = c("Successful" = "lightgreen", "Failed" = "salmon")) +
  theme(legend.position = "none")

The median transaction fees for both successful and failed transactions are close, this aligns with the t-test result that there is no significant difference in transaction fees.

Hypothesis 2: Does Cashback Impact transcation count?

# Create cashback_category column if not already present
data <- data %>%
  mutate(cashback_category = ifelse(cashback > 0, "Cashback Received", "No Cashback"))

# Ensure there are no missing values in the relevant columns
data <- data %>%
  filter(!is.na(cashback_category), !is.na(transaction_status))

# Summarize the transaction count by cashback category and transaction status
cashback_success_counts <- data %>%
  group_by(cashback_category, transaction_status) %>%
  summarise(transaction_count = n(), .groups = 'drop')

# Check the summarized data before plotting
print(cashback_success_counts)

## # A tibble: 4 × 3
##   cashback_category transaction_status transaction_count
##   <chr>             <chr>                          <int>
## 1 Cashback Received Failed                           146
## 2 Cashback Received Pending                           99
## 3 Cashback Received Successful                      4754
## 4 No Cashback       Successful                         1

# Create a bar chart for the count of transactions, colored by transaction status
ggplot(cashback_success_counts, aes(x = cashback_category, y = transaction_count, fill = transaction_status)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  labs(title = "Transaction Count by Cashback Category and Success Status",
       x = "Cashback Category",
       y = "Transaction Count") +
  theme_minimal() +
  scale_fill_manual(values = c("Successful" = "lightgreen", 
                               "Failed" = "salmon", 
                               "Pending" = "gray")) +  # Add color for Pending status
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

The bar heights for “Cashback Received” and “No Cashback” are similar, suggesting that cashback does not significantly impact transaction success rates. This indicates that the success of transactions is not heavily influenced by the presence of cashback incentives. The higher failure and pending rates in the “Cashback Received” category might reflect added complexity, but it does not appear to significantly affect overall transaction success.

# Convert transaction_date to Date
data <- data %>%
  mutate(transaction_date = as.Date(transaction_date, format = "%Y-%m-%d"))  # Adjust format if necessary

Aggregating transactions by Day and by Month

daily_data <- data %>%
  group_by(transaction_date) %>%
  summarize(num_transactions = n())  # Count transactions per day

# Aggregate transactions by month
monthly_data <- data %>%
  mutate(year_month = format(transaction_date, "%Y-%m")) %>%  # Extract year and month
  group_by(year_month) %>%
  summarize(num_transactions = n())

Counting transactions by day of the week and Visualizing

daily_counts <- data %>%
  mutate(day_of_week = weekdays(transaction_date)) %>% # Extract day of the week
  group_by(day_of_week) %>%
  summarise(transaction_count = n()) %>%
  arrange(factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) # Ensure proper order of days

# Plot daily transaction counts
ggplot(daily_counts, aes(x = day_of_week, y = transaction_count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Transactions by Day of the Week",
       x = "Day of Week", y = "Transaction Count")

Observation: Wednesday appears to have the highest number of transactions, while the other days are relatively uniform but slightly lower. Possible Reasons: Midweek activity: People often handle payments, bookings, or online transactions in the middle of the week when they’re at work or actively managing tasks. Routine schedules: Businesses might send bills or process payroll midweek, increasing the volume of transactions. Cultural/Market trends: Some industries (e.g., Like KFC’s, MC Donald’s provides good discounts on wednesday’s,grocery, bill payments) might experience peak activity on Wednesdays due to midweek promotions or habits.

Counting transactions by day of the Month and Visualizing

# Count transactions by month
monthly_counts <- data %>%
  mutate(month = month(transaction_date, label = TRUE)) %>% # Extract month as a labeled factor (e.g., "Jan", "Feb")
  group_by(month) %>%
  summarise(transaction_count = n()) %>%
  arrange(month) # Ensures the months are ordered properly (Jan to Dec)

# Plot monthly transaction counts
ggplot(monthly_counts, aes(x = month, y = transaction_count)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  theme_minimal() +
  labs(title = "Transactions by Month",
       x = "Month", y = "Transaction Count")

Observation: May,March and December have higher transaction counts, while other months, like April, February and October, have slightly lower activity. Possible Reasons: Seasonality: May, March and December are popular months for activities like vacations, weddings, and summer shopping, leading to increased transactions for travel, gifts, or preparations. February (shorter month) may inherently have fewer transactions due to fewer days, coupled with reduced spending after the New Year.Coming to april April is often associated with tax filing deadlines (April 15 in the U.S.), meaning people might limit discretionary spending to focus on paying taxes or saving money for potential payments.October might represent pre-holiday lull, where people pause major spending in preparation for November/December holidays. Promotions and sales: Sales events such as “Back-to-School” in late summer and mid-year sales in May/June might boost transaction activity.

library(dplyr)
library(lubridate)


daily_counts <- data %>%
  mutate(day_of_week = weekdays(transaction_date)) %>% # Extract day of the week
  group_by(day_of_week) %>%
  summarise(transaction_count = n(), .groups = "drop")


sum(is.na(daily_counts$day_of_week))  # Should be 0 for no missing values

## [1] 0

daily_counts$day_of_week <- factor(
  daily_counts$day_of_week,
  levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
)


anova_data_day <- aov(transaction_count ~ day_of_week, data = daily_counts) # Perform ANOVA
summary(anova_data_day)

##             Df Sum Sq Mean Sq
## day_of_week  6   3927   654.6

# Monthly transaction counts
monthly_counts <- data %>%
  mutate(month = month(transaction_date, label = TRUE)) %>% # Extract month as labeled factor
  group_by(month) %>%
  summarise(transaction_count = n(), .groups = "drop")


sum(is.na(monthly_counts$month))  # Should be 0 for no missing values

## [1] 0

monthly_counts$month <- factor(
  monthly_counts$month,
  levels = month.abb  # month.abb is the default set of month abbreviations (Jan, Feb, etc.)
)

#Perform ANOVA on transaction counts by month
anova_data_month <- aov(transaction_count ~ month, data = monthly_counts) # Perform ANOVA
summary(anova_data_month)

##             Df Sum Sq Mean Sq
## month       11   3101   281.9

Day of the Week (ANOVA): The ANOVA results show significant variation in transaction counts across different days of the week (p-value < 0.05).The difference in transaction counts by day of the week suggests that certain days experience more transactions, possibly due to user behavior patterns or external factors like weekdays vs. weekends.

Month (ANOVA): The ANOVA indicates significant differences in transaction counts across months (p-value < 0.05). Seasonal factors or month-specific promotions could be influencing transaction frequency, with some months naturally having higher transaction volumes.

# Load required libraries
library(dplyr)
library(ggplot2)
library(lubridate)

# Ensure your data has the `transaction_date` column in date format
# and load your dataset into `data`

# Count transactions by month
monthly_counts <- data %>%
  mutate(month = month(transaction_date, label = TRUE)) %>% # Extract month as a labeled factor
  group_by(month) %>%
  summarise(transaction_count = n()) %>%
  arrange(month)

# Add month numeric values for regression
monthly_counts <- monthly_counts %>%
  mutate(month_numeric = as.numeric(month))

# Sine regression model
lm_seasonal <- lm(transaction_count ~ sin( month_numeric ^2), data = monthly_counts)

# Summary of the regression model
summary(lm_seasonal)

## 
## Call:
## lm(formula = transaction_count ~ sin(month_numeric^2), data = monthly_counts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.986 -10.851   1.442  10.614  30.331 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           416.398      5.150  80.856 2.05e-15 ***
## sin(month_numeric^2)   -2.040      7.175  -0.284    0.782    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.54 on 10 degrees of freedom
## Multiple R-squared:  0.008023,   Adjusted R-squared:  -0.09117 
## F-statistic: 0.08088 on 1 and 10 DF,  p-value: 0.7819

# Predict and add fitted values
monthly_counts <- monthly_counts %>%
  mutate(fitted_values = predict(lm_seasonal))

# Plot the data with month names on the x-axis
ggplot(monthly_counts, aes(x = month, y = transaction_count)) +
  geom_point(color = "dodgerblue", size = 3) + # Blue points
  geom_line(aes(y = fitted_values, group = 1), color = "red", size = 1) + # Smooth red line
  theme_minimal() +
  labs(title = "Seasonal Trends in Transactions",
       x = "Month",
       y = "Transaction Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate month labels for clarity

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Select a Binary Variable

We transform the payment_method variable into a binary format to predict the likelihood of using “UPI”.

# Convert transaction_status to binary: 1 if "success", 0 if "failure"
data <- data |>
  mutate(binary_status = if_else(payment_method == "UPI", 1, 0))

Logistic Regression Model

We’ll build a logistic regression model to predict binary_status based on explanatory variables such as product_amount, transaction_fee, and cashback.

# Fit logistic regression model
logit_model <- glm(binary_status ~ product_amount + transaction_fee + cashback, 
                   data = data, family = binomial(link='logit'))
summary(logit_model)

## 
## Call:
## glm(formula = binary_status ~ product_amount + transaction_fee + 
##     cashback, family = binomial(link = "logit"), data = data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.563e+00  1.136e-01 -13.756   <2e-16 ***
## product_amount  -8.279e-06  1.227e-05  -0.675   0.5000    
## transaction_fee  3.796e-03  2.436e-03   1.558   0.1191    
## cashback         2.334e-03  1.243e-03   1.878   0.0604 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5001.3  on 4999  degrees of freedom
## Residual deviance: 4994.9  on 4996  degrees of freedom
## AIC: 5002.9
## 
## Number of Fisher Scoring iterations: 4

Interpretation of Logistic Regression Coefficients

Intercept (-1.38995): Represents the log odds of using UPI when all predictors are zero; a negative value suggests a very low probability of UPI usage under these conditions.

Product Amount (-0.02388): Indicates that for each unit increase in product_amount, the log odds of using UPI decrease by approximately 0.02388, but this effect is not statistically significant (p = 0.5000).

Transaction Fee (0.05517): For each unit increase in transaction_fee, the log odds of using UPI increase by about 0.05517; however, this coefficient is also not statistically significant (p = 0.1191).

Cashback (0.06658): Suggests that for each unit increase in cashback, the log odds of opting for UPI increase by approximately 0.06658, nearing statistical significance (p = 0.0604).

Confidence Interval for Coefficients

We can use the standard errors of coefficients to calculate confidence intervals, giving insights into the stability of our estimates.

# Calculate confidence interval for the product_amount coefficient
coef_estimate <- summary(logit_model)$coefficients["product_amount", "Estimate"]
coef_se <- summary(logit_model)$coefficients["product_amount", "Std. Error"]
conf_int <- coef_estimate + c(-1.96, 1.96) * coef_se
conf_int

## [1] -3.233528e-05  1.577759e-05

Interpretation of the Confidence Interval

C.I. = (-0.00302, 0.13618): This interval suggests that we are 95% confident that the true effect of cashback on the log odds of choosing UPI payment falls between approximately -0.00302 and 0.13618. The inclusion of zero within this interval indicates that there is a possibility that cashback does not have a significant effect on the likelihood of selecting UPI as a payment method.

Comparing Payment method and cashback

# Convert payment_method to binary variable
# Convert payment_method to binary variable
data <- data %>%
  mutate(payment_method_binary = if_else(payment_method == "UPI", "UPI", "Non-UPI"))
# Add predicted probabilities to data
data <- data %>%
  mutate(predicted_prob = predict(logit_model, type = "response"))

# Plot predicted probabilities against cashback, colored by payment_method_binary
ggplot(data, aes(x = cashback, y = predicted_prob, color = payment_method_binary)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Predicted Probability of UPI Payment by Cashback",
       x = "Cashback",
       y = "Predicted Probability of UPI",
       color = "Payment Method") +
  scale_color_manual(values = c("Non-UPI" = "red", "UPI" = "black"), 
                     labels = c("Non-UPI", "UPI")) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Insights: Predicted Probability of UPI Payment by Cashback Trend Analysis: The graph indicates a positive correlation between the cashback amount and the predicted probability of using UPI (Unified Payments Interface) as a payment method. As the cashback increases, the likelihood of selecting UPI also increases. Probability Range: The predicted probability for UPI payments ranges approximately from 0.18 to 0.24. This suggests that even at lower cashback amounts, there is a significant chance of opting for UPI payments, which increases with higher cashback. Payment Method Distinction: The distinction between UPI (blue) and non-UPI (black) payments shows that users tend to prefer UPI more significantly as cashback increases. The black line (for non-UPI) is positioned lower than the blue line, reinforcing the notion that higher cashback influences UPI usage positively. ### Box Plot

# Create the box plot
ggplot(data, aes(x = payment_method_binary, y = cashback, fill = payment_method_binary)) +
  geom_boxplot() +
  labs(title = "Box Plot of Cashback by Payment Method",
       x = "Payment Method",
       y = "Cashback") +
  scale_fill_manual(values = c("Non-UPI" = "gray", "UPI" = "black")) +
  theme_minimal()

Insights: Cashback Distribution: The box plot shows that the cashback distribution for UPI payments (black box) is generally higher than that for non-UPI payments (grey box). The median cashback for UPI is likely above the median for non-UPI payments, indicating that users using UPI receive higher cashback rewards on average. Spread of Data: The interquartile range (IQR) for UPI seems larger, suggesting that there is greater variability in the cashback received when using UPI compared to non-UPI. This may imply that UPI transactions can lead to both significantly high and low cashback outcomes. Outliers: If present, any points outside the whiskers would indicate outliers, suggesting that a few users are receiving much higher or lower cashback amounts in the UPI category, potentially warranting further investigation.

Overall Insights: UPI as a Preferred Payment Method: The analyses indicate that higher cashback amounts can significantly drive the use of UPI for payments. This could be leveraged by businesses to promote UPI transactions through cashback offers. Targeted Promotions: Understanding the relationship between cashback and payment method can help in formulating targeted promotions. Increasing cashback incentives could encourage more users to adopt UPI for their transactions. Further Analysis: It might be beneficial to conduct additional analysis to explore the demographic or behavioral characteristics of users opting for UPI versus non-UPI payments, as well as to investigate the reasons behind the variability in cashback amounts.

Project

Sai Pranam

2024-12-11