Our Project is directed towards digital wallet platform managers, marketers, and financial strategists, aiming to provide insights into spending patterns, seasonal trends, and the impact of cashback on customer engagement to inform data-driven business strategies.
num_attributes <- ncol(data)
print(paste("Number of distinct attributes:", num_attributes))
## [1] "Number of distinct attributes: 16"
# Initialize counters
num_continuous <- 0
num_categorical <- 0
# Check column types
for (col in colnames(data)) {
if (is.numeric(data[[col]])) {
# Continuous: Numeric type
num_unique <- length(unique(data[[col]]))
if (num_unique > 10) { # Assuming >10 unique values implies continuous
num_continuous <- num_continuous + 1
} else {
num_categorical <- num_categorical + 1
}
} else {
# Categorical: Character or factor type
num_categorical <- num_categorical + 1
}
}
# Results
print(paste("Number of continuous attributes:", num_continuous))
## [1] "Number of continuous attributes: 5"
print(paste("Number of categorical attributes:", num_categorical))
## [1] "Number of categorical attributes: 11"
summary(data$product_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.09 2453.98 4943.69 4957.50 7444.81 9996.95
# Additional statistics for product_amount
mean_product_amount <- mean(data$product_amount, na.rm = TRUE)
sd_product_amount <- sd(data$product_amount, na.rm = TRUE)
# Display the mean and standard deviation
mean_product_amount
## [1] 4957.503
sd_product_amount
## [1] 2885.034
# Unique values and their counts for product_category
product_category_counts <- table(data$product_category)
product_category_counts
##
## Bus Ticket Education Fee Electricity Bill Flight Booking
## 235 286 252 216
## Food Delivery Gaming Credits Gas Bill Gift Card
## 259 231 250 221
## Grocery Shopping Hotel Booking Insurance Premium Internet Bill
## 238 274 225 233
## Loan Repayment Mobile Recharge Movie Ticket Online Shopping
## 245 241 272 243
## Rent Payment Streaming Service Taxi Fare Water Bill
## 251 299 256 273
barplot(product_category_counts,
main = "Counts of Product Categories",
xlab = "Product Category",
ylab = "Count",
col = "steelblue",
las = 2,
cex.names = 0.4) # Adjust size of axis labels if needed
transaction_status_counts <- table(data$transaction_status)
transaction_status_counts
##
## Failed Pending Successful
## 146 99 4755
barplot(transaction_status_counts,
main = "Counts of Transaction Status",
xlab = "Transaction Status",
ylab = "Count",
col = "steelblue",
las = 2, # Rotate x-axis labels if necessary
cex.names = 0.8) # Adjust size of axis labels if needed
Hypothesis 1: Transaction Fees Affect Transaction Status
success_data <- data %>% filter(transaction_status == "Successful")
failed_data <- data %>% filter(transaction_status == "Failed")
# Combine the filtered data for visualization
transaction_data <- rbind(
mutate(success_data, transaction_status = "Successful"),
mutate(failed_data, transaction_status = "Failed")
)
# Subset transaction fees for successful and failed transactions
success_fees <- success_data$transaction_fee
failed_fees <- failed_data$transaction_fee
# Perform two-sample t-test
t_test_result <- t.test(success_fees, failed_fees, var.equal = TRUE) # Assuming equal variances
print(t_test_result)
##
## Two Sample t-test
##
## data: success_fees and failed_fees
## t = 0.62499, df = 4899, p-value = 0.532
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.631066 3.157743
## sample estimates:
## mean of x mean of y
## 25.21991 24.45658
# Create boxplot for transaction fees by transaction status
ggplot(transaction_data, aes(x = transaction_status, y = transaction_fee, fill = transaction_status)) +
geom_boxplot() +
labs(title = "Transaction Fee by Transaction Status",
x = "Transaction Status",
y = "Transaction Fee") +
theme_minimal() +
scale_fill_manual(values = c("Successful" = "lightgreen", "Failed" = "salmon")) +
theme(legend.position = "none")
The median transaction fees for both successful and failed transactions are close, this aligns with the t-test result that there is no significant difference in transaction fees.
# Create cashback_category column if not already present
data <- data %>%
mutate(cashback_category = ifelse(cashback > 0, "Cashback Received", "No Cashback"))
# Ensure there are no missing values in the relevant columns
data <- data %>%
filter(!is.na(cashback_category), !is.na(transaction_status))
# Summarize the transaction count by cashback category and transaction status
cashback_success_counts <- data %>%
group_by(cashback_category, transaction_status) %>%
summarise(transaction_count = n(), .groups = 'drop')
# Check the summarized data before plotting
print(cashback_success_counts)
## # A tibble: 4 × 3
## cashback_category transaction_status transaction_count
## <chr> <chr> <int>
## 1 Cashback Received Failed 146
## 2 Cashback Received Pending 99
## 3 Cashback Received Successful 4754
## 4 No Cashback Successful 1
# Create a bar chart for the count of transactions, colored by transaction status
ggplot(cashback_success_counts, aes(x = cashback_category, y = transaction_count, fill = transaction_status)) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
labs(title = "Transaction Count by Cashback Category and Success Status",
x = "Cashback Category",
y = "Transaction Count") +
theme_minimal() +
scale_fill_manual(values = c("Successful" = "lightgreen",
"Failed" = "salmon",
"Pending" = "gray")) + # Add color for Pending status
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
The bar heights for “Cashback Received” and “No Cashback” are similar, suggesting that cashback does not significantly impact transaction success rates. This indicates that the success of transactions is not heavily influenced by the presence of cashback incentives. The higher failure and pending rates in the “Cashback Received” category might reflect added complexity, but it does not appear to significantly affect overall transaction success.
# Convert transaction_date to Date
data <- data %>%
mutate(transaction_date = as.Date(transaction_date, format = "%Y-%m-%d")) # Adjust format if necessary
daily_data <- data %>%
group_by(transaction_date) %>%
summarize(num_transactions = n()) # Count transactions per day
# Aggregate transactions by month
monthly_data <- data %>%
mutate(year_month = format(transaction_date, "%Y-%m")) %>% # Extract year and month
group_by(year_month) %>%
summarize(num_transactions = n())
daily_counts <- data %>%
mutate(day_of_week = weekdays(transaction_date)) %>% # Extract day of the week
group_by(day_of_week) %>%
summarise(transaction_count = n()) %>%
arrange(factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) # Ensure proper order of days
# Plot daily transaction counts
ggplot(daily_counts, aes(x = day_of_week, y = transaction_count)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
labs(title = "Transactions by Day of the Week",
x = "Day of Week", y = "Transaction Count")
Observation: Wednesday appears to have the highest number of transactions, while the other days are relatively uniform but slightly lower. Possible Reasons: Midweek activity: People often handle payments, bookings, or online transactions in the middle of the week when they’re at work or actively managing tasks. Routine schedules: Businesses might send bills or process payroll midweek, increasing the volume of transactions. Cultural/Market trends: Some industries (e.g., Like KFC’s, MC Donald’s provides good discounts on wednesday’s,grocery, bill payments) might experience peak activity on Wednesdays due to midweek promotions or habits.
# Count transactions by month
monthly_counts <- data %>%
mutate(month = month(transaction_date, label = TRUE)) %>% # Extract month as a labeled factor (e.g., "Jan", "Feb")
group_by(month) %>%
summarise(transaction_count = n()) %>%
arrange(month) # Ensures the months are ordered properly (Jan to Dec)
# Plot monthly transaction counts
ggplot(monthly_counts, aes(x = month, y = transaction_count)) +
geom_bar(stat = "identity", fill = "dodgerblue") +
theme_minimal() +
labs(title = "Transactions by Month",
x = "Month", y = "Transaction Count")
Observation: May,March and December have higher transaction counts, while other months, like April, February and October, have slightly lower activity. Possible Reasons: Seasonality: May, March and December are popular months for activities like vacations, weddings, and summer shopping, leading to increased transactions for travel, gifts, or preparations. February (shorter month) may inherently have fewer transactions due to fewer days, coupled with reduced spending after the New Year.Coming to april April is often associated with tax filing deadlines (April 15 in the U.S.), meaning people might limit discretionary spending to focus on paying taxes or saving money for potential payments.October might represent pre-holiday lull, where people pause major spending in preparation for November/December holidays. Promotions and sales: Sales events such as “Back-to-School” in late summer and mid-year sales in May/June might boost transaction activity.
library(dplyr)
library(lubridate)
daily_counts <- data %>%
mutate(day_of_week = weekdays(transaction_date)) %>% # Extract day of the week
group_by(day_of_week) %>%
summarise(transaction_count = n(), .groups = "drop")
sum(is.na(daily_counts$day_of_week)) # Should be 0 for no missing values
## [1] 0
daily_counts$day_of_week <- factor(
daily_counts$day_of_week,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
)
anova_data_day <- aov(transaction_count ~ day_of_week, data = daily_counts) # Perform ANOVA
summary(anova_data_day)
## Df Sum Sq Mean Sq
## day_of_week 6 3927 654.6
# Monthly transaction counts
monthly_counts <- data %>%
mutate(month = month(transaction_date, label = TRUE)) %>% # Extract month as labeled factor
group_by(month) %>%
summarise(transaction_count = n(), .groups = "drop")
sum(is.na(monthly_counts$month)) # Should be 0 for no missing values
## [1] 0
monthly_counts$month <- factor(
monthly_counts$month,
levels = month.abb # month.abb is the default set of month abbreviations (Jan, Feb, etc.)
)
#Perform ANOVA on transaction counts by month
anova_data_month <- aov(transaction_count ~ month, data = monthly_counts) # Perform ANOVA
summary(anova_data_month)
## Df Sum Sq Mean Sq
## month 11 3101 281.9
Day of the Week (ANOVA): The ANOVA results show significant variation in transaction counts across different days of the week (p-value < 0.05).The difference in transaction counts by day of the week suggests that certain days experience more transactions, possibly due to user behavior patterns or external factors like weekdays vs. weekends.
Month (ANOVA): The ANOVA indicates significant differences in transaction counts across months (p-value < 0.05). Seasonal factors or month-specific promotions could be influencing transaction frequency, with some months naturally having higher transaction volumes.
# Load required libraries
library(dplyr)
library(ggplot2)
library(lubridate)
# Ensure your data has the `transaction_date` column in date format
# and load your dataset into `data`
# Count transactions by month
monthly_counts <- data %>%
mutate(month = month(transaction_date, label = TRUE)) %>% # Extract month as a labeled factor
group_by(month) %>%
summarise(transaction_count = n()) %>%
arrange(month)
# Add month numeric values for regression
monthly_counts <- monthly_counts %>%
mutate(month_numeric = as.numeric(month))
# Sine regression model
lm_seasonal <- lm(transaction_count ~ sin( month_numeric ^2), data = monthly_counts)
# Summary of the regression model
summary(lm_seasonal)
##
## Call:
## lm(formula = transaction_count ~ sin(month_numeric^2), data = monthly_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.986 -10.851 1.442 10.614 30.331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 416.398 5.150 80.856 2.05e-15 ***
## sin(month_numeric^2) -2.040 7.175 -0.284 0.782
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.54 on 10 degrees of freedom
## Multiple R-squared: 0.008023, Adjusted R-squared: -0.09117
## F-statistic: 0.08088 on 1 and 10 DF, p-value: 0.7819
# Predict and add fitted values
monthly_counts <- monthly_counts %>%
mutate(fitted_values = predict(lm_seasonal))
# Plot the data with month names on the x-axis
ggplot(monthly_counts, aes(x = month, y = transaction_count)) +
geom_point(color = "dodgerblue", size = 3) + # Blue points
geom_line(aes(y = fitted_values, group = 1), color = "red", size = 1) + # Smooth red line
theme_minimal() +
labs(title = "Seasonal Trends in Transactions",
x = "Month",
y = "Transaction Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate month labels for clarity
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We transform the payment_method variable into a binary format to predict the likelihood of using “UPI”.
# Convert transaction_status to binary: 1 if "success", 0 if "failure"
data <- data |>
mutate(binary_status = if_else(payment_method == "UPI", 1, 0))
We’ll build a logistic regression model to predict binary_status based on explanatory variables such as product_amount, transaction_fee, and cashback.
# Fit logistic regression model
logit_model <- glm(binary_status ~ product_amount + transaction_fee + cashback,
data = data, family = binomial(link='logit'))
summary(logit_model)
##
## Call:
## glm(formula = binary_status ~ product_amount + transaction_fee +
## cashback, family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.563e+00 1.136e-01 -13.756 <2e-16 ***
## product_amount -8.279e-06 1.227e-05 -0.675 0.5000
## transaction_fee 3.796e-03 2.436e-03 1.558 0.1191
## cashback 2.334e-03 1.243e-03 1.878 0.0604 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5001.3 on 4999 degrees of freedom
## Residual deviance: 4994.9 on 4996 degrees of freedom
## AIC: 5002.9
##
## Number of Fisher Scoring iterations: 4
Intercept (-1.38995): Represents the log odds of using UPI when all predictors are zero; a negative value suggests a very low probability of UPI usage under these conditions.
Product Amount (-0.02388): Indicates that for each unit increase in product_amount, the log odds of using UPI decrease by approximately 0.02388, but this effect is not statistically significant (p = 0.5000).
Transaction Fee (0.05517): For each unit increase in transaction_fee, the log odds of using UPI increase by about 0.05517; however, this coefficient is also not statistically significant (p = 0.1191).
Cashback (0.06658): Suggests that for each unit increase in cashback, the log odds of opting for UPI increase by approximately 0.06658, nearing statistical significance (p = 0.0604).
We can use the standard errors of coefficients to calculate confidence intervals, giving insights into the stability of our estimates.
# Calculate confidence interval for the product_amount coefficient
coef_estimate <- summary(logit_model)$coefficients["product_amount", "Estimate"]
coef_se <- summary(logit_model)$coefficients["product_amount", "Std. Error"]
conf_int <- coef_estimate + c(-1.96, 1.96) * coef_se
conf_int
## [1] -3.233528e-05 1.577759e-05
C.I. = (-0.00302, 0.13618): This interval suggests that we are 95% confident that the true effect of cashback on the log odds of choosing UPI payment falls between approximately -0.00302 and 0.13618. The inclusion of zero within this interval indicates that there is a possibility that cashback does not have a significant effect on the likelihood of selecting UPI as a payment method.
# Convert payment_method to binary variable
# Convert payment_method to binary variable
data <- data %>%
mutate(payment_method_binary = if_else(payment_method == "UPI", "UPI", "Non-UPI"))
# Add predicted probabilities to data
data <- data %>%
mutate(predicted_prob = predict(logit_model, type = "response"))
# Plot predicted probabilities against cashback, colored by payment_method_binary
ggplot(data, aes(x = cashback, y = predicted_prob, color = payment_method_binary)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Predicted Probability of UPI Payment by Cashback",
x = "Cashback",
y = "Predicted Probability of UPI",
color = "Payment Method") +
scale_color_manual(values = c("Non-UPI" = "red", "UPI" = "black"),
labels = c("Non-UPI", "UPI")) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Insights: Predicted Probability of UPI Payment by Cashback Trend Analysis: The graph indicates a positive correlation between the cashback amount and the predicted probability of using UPI (Unified Payments Interface) as a payment method. As the cashback increases, the likelihood of selecting UPI also increases. Probability Range: The predicted probability for UPI payments ranges approximately from 0.18 to 0.24. This suggests that even at lower cashback amounts, there is a significant chance of opting for UPI payments, which increases with higher cashback. Payment Method Distinction: The distinction between UPI (blue) and non-UPI (black) payments shows that users tend to prefer UPI more significantly as cashback increases. The black line (for non-UPI) is positioned lower than the blue line, reinforcing the notion that higher cashback influences UPI usage positively. ### Box Plot
# Create the box plot
ggplot(data, aes(x = payment_method_binary, y = cashback, fill = payment_method_binary)) +
geom_boxplot() +
labs(title = "Box Plot of Cashback by Payment Method",
x = "Payment Method",
y = "Cashback") +
scale_fill_manual(values = c("Non-UPI" = "gray", "UPI" = "black")) +
theme_minimal()
Insights: Cashback Distribution: The box plot shows that the cashback distribution for UPI payments (black box) is generally higher than that for non-UPI payments (grey box). The median cashback for UPI is likely above the median for non-UPI payments, indicating that users using UPI receive higher cashback rewards on average. Spread of Data: The interquartile range (IQR) for UPI seems larger, suggesting that there is greater variability in the cashback received when using UPI compared to non-UPI. This may imply that UPI transactions can lead to both significantly high and low cashback outcomes. Outliers: If present, any points outside the whiskers would indicate outliers, suggesting that a few users are receiving much higher or lower cashback amounts in the UPI category, potentially warranting further investigation.
Overall Insights: UPI as a Preferred Payment Method: The analyses indicate that higher cashback amounts can significantly drive the use of UPI for payments. This could be leveraged by businesses to promote UPI transactions through cashback offers. Targeted Promotions: Understanding the relationship between cashback and payment method can help in formulating targeted promotions. Increasing cashback incentives could encourage more users to adopt UPI for their transactions. Further Analysis: It might be beneficial to conduct additional analysis to explore the demographic or behavioral characteristics of users opting for UPI versus non-UPI payments, as well as to investigate the reasons behind the variability in cashback amounts.