# Load the dataset
data <- read.csv("C:/Users/My PC/Downloads/Digital Wallet Dataaa/digital_wallet_transactions.csv")
# Convert payment_method to a binary variable: UPI (1) and Non-UPI (0)
data <- data %>%
mutate(payment_method_binary = if_else(payment_method == "UPI", 1, 0))
Model Building: We will build a logistic regression model using payment_method_binary as the response variable, with product_amount, transaction_fee, and cashback as explanatory variables.
# Fit logistic regression model
logit_model <- glm(payment_method_binary ~ product_amount + transaction_fee + cashback,
data = data, family = binomial(link='logit'))
summary(logit_model)
##
## Call:
## glm(formula = payment_method_binary ~ product_amount + transaction_fee +
## cashback, family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.563e+00 1.136e-01 -13.756 <2e-16 ***
## product_amount -8.279e-06 1.227e-05 -0.675 0.5000
## transaction_fee 3.796e-03 2.436e-03 1.558 0.1191
## cashback 2.334e-03 1.243e-03 1.878 0.0604 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5001.3 on 4999 degrees of freedom
## Residual deviance: 4994.9 on 4996 degrees of freedom
## AIC: 5002.9
##
## Number of Fisher Scoring iterations: 4
Key Coefficients:
Intercept (-1.563): Represents the log-odds of selecting UPI when product_amount, transaction_fee, and cashback are zero. The large negative value implies that, in the absence of any cashback or fees, the baseline probability of UPI usage is quite low.
Product Amount (-8.279e-06): The estimate is close to zero and not statistically significant (p = 0.5000), indicating little to no relationship between product_amount and the likelihood of UPI usage.
Transaction Fee (0.003796): Although positive, this effect is not statistically significant (p = 0.1191). It suggests a small increase in UPI likelihood with increasing transaction fees, but the evidence is weak.
Cashback (0.002334): This positive coefficient suggests that an increase in cashback may be associated with a higher probability of UPI usage. The p-value (0.0604) is near the threshold for significance, indicating a potential trend but not strong evidence.
Model Diagnostics To diagnose potential issues with multicollinearity and model fit, we’ll calculate VIF for multicollinearity and assess residual deviance and AIC for model fit.
# Calculate VIF to check for multicollinearity
vif_values <- vif(logit_model)
vif_values
## product_amount transaction_fee cashback
## 1.000039 1.000016 1.000039
Confidence interval for the cashback coefficient
cashback_estimate <- coef(logit_model)["cashback"]
cashback_se <- summary(logit_model)$coefficients["cashback", "Std. Error"]
cashback_ci <- cashback_estimate + c(-1.96, 1.96) * cashback_se
cashback_ci
## [1] -0.0001017219 0.0047701802
Visualize the predicted probabilities of UPI selection against cashback.
You can also embed plots, for example:
# Add predicted probabilities
data <- data %>%
mutate(predicted_prob = predict(logit_model, type = "response"))
# Plot predicted probabilities by cashback
ggplot(data, aes(x = cashback, y = predicted_prob, color = factor(payment_method_binary))) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Predicted Probability of UPI Payment by Cashback",
x = "Cashback",
y = "Predicted Probability of UPI",
color = "Payment Method") +
scale_color_manual(values = c("0" = "blue", "1" = "red"),
labels = c("Non-UPI", "UPI")) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The predicted probabilities in the plot show a very narrow range (approximately 0.18–0.24), indicating that the model does not strongly differentiate between UPI and Non-UPI probabilities based on the given variables. This could be due to low effect sizes for each predictor or insufficiently explanatory variables.
Diagnostic Summary and Issues with the Model
Multicollinearity:
Variance Inflation Factor (VIF) for product_amount, transaction_fee, and cashback are all close to 1.000, indicating no multicollinearity concerns among the predictors.
Model Fit:
AIC: 5002.9. This is a high value, which indicates the model might not be the best fit. Residual Deviance: 4994.9 on 4996 degrees of freedom, which is close to the null deviance. This suggests the model does not significantly improve over a null model (a model without predictors), implying weak predictive power of the chosen explanatory variables.
Predictive Limitations:
The predicted probabilities in the plot show a very narrow range (approximately 0.18–0.24), indicating that the model does not strongly differentiate between UPI and Non-UPI probabilities based on the given variables. This could be due to low effect sizes for each predictor or insufficiently explanatory variables.
These insights suggest that while cashback has a possible effect on UPI selection, the model’s overall fit is weak and additional factors may need to be explored.