The aim of this paper is to identify and quantify the factors that influence customer churn in e-commerce from an econometric perspective. Customer churn remains a critical challenge for business profitability, as retaining existing customers is often more cost-effective than acquiring new ones. In a competitive landscape marked by evolving consumer expectations, understanding the drivers of churn is more important than ever.
Using a synthetic e-commerce dataset, we employ a binary logistic regression model to examine the impact of customer characteristics such as shopping behavior, engagement metrics, demographics, satisfaction levels, and product preferences on churn likelihood. Contrary to initial expectations, higher satisfaction scores were associated with a slight increase in churn probability, suggesting complex dynamics at play. Meanwhile, longer tenure and recent purchase activity significantly reduced the risk of churn, while complaints and greater delivery distance increased it.
Notably, customers with more registered devices or a preference for niche product categories were more likely to churn, while those favoring mainstream categories like laptops or mobile phones were less likely to leave. The results confirm that both behavioral and demographic variables play a significant role in customer retention. These findings provide practical guidance for firms to improve complaint handling, segment at-risk customers more effectively, and tailor engagement strategies. Furthermore, the study builds a foundation for applying advanced predictive models such as machine learning to refine churn prediction and enhance customer relationship management.
In recent years, e-commerce has developed into a fast-growing industry in which competition between companies is becoming increasingly fierce. One of the biggest challenges for e-commerce platforms is customer churn. When customers stop shopping or switch to another platform, this not only reduces revenue, but also wastes investment costs for marketing, advertising and customer care. According to a Harvard Business Review report, the cost of acquiring a new customer can be 5 to 25 times higher than the cost of retaining an existing customer. In addition, research by Bain & Company shows that a 5% increase in customer retention can increase profits by 25 to 95%. These figures underline the economic importance of analyzing customer churn behavior in the e-commerce sector.
In this context, understanding the factors that influence churn behavior is a top priority for companies in order to optimize their customer retention strategies. Although many previous studies have applied statistical methods and machine learning models to predict customer churn, there is still a lack of in-depth analysis of the causal relationship between customer behavior characteristics and churn probability based on traditional econometric approaches.
This study adopts a binary logit regression model to quantify the relationship between customer characteristics and churn probability. The variables analyzed include time spent on the platform, purchase behavior, engagement, payment method, satisfaction level and complaint occurrence. By applying this methodology, the study not only identifies the key factors influencing churn, but also provides clear quantitative evidence to support effective customer management strategies.
It is expected that the results of this study will provide insights for the development of more advanced customer churn prediction models in the future and assist managers in making strategic decisions about marketing, customer care and the optimization of company resources.
Customer churn is a pressing issue in the e-commerce sector, where acquiring new customers is generally more expensive than retaining existing ones. Understanding the underlying drivers of churn enables businesses to allocate marketing resources more efficiently and develop effective retention strategies. A growing body of research has explored behavioral, demographic, experiential, and preference-based factors influencing customer churn using both traditional econometric models and modern machine learning approaches.
Customer behavior and engagement are among the most consistently cited predictors of churn. Li (2022) applied a Random Forest model to e-commerce churn prediction and found that OrderCount, DaySinceLastOrder, and Complain were among the most influential predictors. Similarly, Berger and Kompan (2019) emphasized behavioral metrics such as HourSpendOnApp and NumberOfDeviceRegistered, arguing that decreased engagement with the platform often precedes churn.
Bhattacharya (2021) also supported this view, showing that declining frequency of interactions and lower transaction volume significantly raised churn risk among e-commerce customers. The study highlighted that users with sporadic usage patterns are more susceptible to disengagement. These results underscore the importance of continuous engagement in mitigating churn.
Demographic characteristics, such as gender, age, and tenure, are also commonly linked to churn behavior. In Li’s (2022) study, female users and customers with shorter tenure periods were more likely to churn. Similarly, Berger and Kompan (2019) found gender to be a significant factor, while Liu and Wang (2010) identified educational attainment as a key demographic influencing churn in the service sector—suggesting that more educated users may have higher service expectations and a lower tolerance for dissatisfaction.
Adding to this, Ahmad et al. (2019) showed that age group and marital status play a significant role in customer retention, with younger and single customers demonstrating a higher probability of switching platforms. These insights are especially relevant when designing personalized retention strategies.
Product preferences, while often overlooked, are crucial in understanding churn. Berger and Kompan (2019) found that customers with narrow or niche product interests were more prone to churn when platforms failed to meet their expectations. In a similar study, Dahiya and Bhatia (2020) analyzed product-level transaction data and concluded that customers who primarily purchased electronics or single-category goods exhibited more volatile loyalty patterns, particularly when competitors offered better alternatives.
This aligns with the current study’s findings, where customers who preferred mainstream product categories (e.g., laptops or mobile phones) were significantly less likely to churn than those in “Other” categories.
Customer experience—including satisfaction levels and service-related issues—plays a fundamental role in determining churn. Li (2022) found that SatisfactionScore and Complain strongly influence churn likelihood, echoing earlier findings by Mittal and Kamakura (2001), who showed that customer satisfaction and complaint handling quality directly impact retention and brand loyalty.
Moreover, Jaiswal and Niraj (2011) emphasized that satisfaction must be interpreted in context; not all satisfied customers are loyal. Factors such as perceived switching cost, emotional attachment, and service recovery also mediate the relationship between satisfaction and churn.
The dataset used for this analysis is Ecommerce Customer Churn Analysis and Prediction.cvs, which contains information about customer behavior including: demographics, frequency, usages and attitudes, loyalty, complain provided by Kaggle (https://www.kaggle.com/datasets/ankitverma2010/ecommerce-customer-churn-analysis-and-prediction)
Dataset contains 17 variables:
Demographic Factors
Customer Behavior and Engagement
Experience with the Platform
# Packages & Libraries
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(GGally,corrplot,car,lmtest,modelsummary,margins,BaylorEdPsych,ResourceSelection,caret,pROC)
# Import dataset
data<-read.csv("E Commerce Dataset.csv", sep=";", dec=".",header=TRUE)
# Checking data
head(data)
dim(data)
## [1] 5630 20
summary(data)
## CustomerID Churn Tenure PreferredLoginDevice
## Min. :50001 Min. :0.0000 Min. : 0.00 Length:5630
## 1st Qu.:51408 1st Qu.:0.0000 1st Qu.: 2.00 Class :character
## Median :52816 Median :0.0000 Median : 9.00 Mode :character
## Mean :52816 Mean :0.1684 Mean :10.19
## 3rd Qu.:54223 3rd Qu.:0.0000 3rd Qu.:16.00
## Max. :55630 Max. :1.0000 Max. :61.00
## NA's :264
## CityTier WarehouseToHome PreferredPaymentMode Gender
## Min. :1.000 Min. : 5.00 Length:5630 Length:5630
## 1st Qu.:1.000 1st Qu.: 9.00 Class :character Class :character
## Median :1.000 Median : 14.00 Mode :character Mode :character
## Mean :1.655 Mean : 15.64
## 3rd Qu.:3.000 3rd Qu.: 20.00
## Max. :3.000 Max. :127.00
## NA's :251
## HourSpendOnApp NumberOfDeviceRegistered PreferedOrderCat SatisfactionScore
## Min. :0.000 Min. :1.000 Length:5630 Min. :1.000
## 1st Qu.:2.000 1st Qu.:3.000 Class :character 1st Qu.:2.000
## Median :3.000 Median :4.000 Mode :character Median :3.000
## Mean :2.932 Mean :3.689 Mean :3.067
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :6.000 Max. :5.000
## NA's :255
## MaritalStatus NumberOfAddress Complain
## Length:5630 Min. : 1.000 Min. :0.0000
## Class :character 1st Qu.: 2.000 1st Qu.:0.0000
## Mode :character Median : 3.000 Median :0.0000
## Mean : 4.214 Mean :0.2849
## 3rd Qu.: 6.000 3rd Qu.:1.0000
## Max. :22.000 Max. :1.0000
##
## OrderAmountHikeFromlastYear CouponUsed OrderCount
## Min. :11.00 Min. : 0.000 Min. : 1.000
## 1st Qu.:13.00 1st Qu.: 1.000 1st Qu.: 1.000
## Median :15.00 Median : 1.000 Median : 2.000
## Mean :15.71 Mean : 1.751 Mean : 3.008
## 3rd Qu.:18.00 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :26.00 Max. :16.000 Max. :16.000
## NA's :265 NA's :256 NA's :258
## DaySinceLastOrder CashbackAmount
## Min. : 0.000 Min. : 0.0
## 1st Qu.: 2.000 1st Qu.:146.0
## Median : 3.000 Median :163.0
## Mean : 4.543 Mean :177.2
## 3rd Qu.: 7.000 3rd Qu.:196.0
## Max. :46.000 Max. :325.0
## NA's :307
There are many NA values in the dataset, so I use complete.case to remove rows with NA value
data <- data[complete.cases(data), ]
any(is.na(data))
## [1] FALSE
# Remove ID column (non-meaning variable)
data <- data[, -1]
# Checking distribution of numeric data
barplot(table(data$Churn),
col = "lightblue",
xlab = "Churn",
ylab = "Number of Customers",
main = "Bar Plot of Churn Variable",
space = 0.8)
barplot(table(data$Tenure),
col = "lightblue",
xlab = "Tenure",
ylab = "Number of Customers",
main = "Bar Plot of Tenure Variable",
space = 0.2)
barplot(table(data$CityTier),
col = "lightblue",
xlab = "CityTier",
ylab = "Number of Customers",
main = "Bar Plot of CityTier Variable",
space = 0.6)
barplot(table(data$WarehouseToHome),
col = "lightblue",
xlab = "WarehouseToHome",
ylab = "Number of Customers",
main = "Bar Plot of WarehouseToHome Variable",
space = 0.1)
barplot(table(data$HourSpendOnApp),
col = "lightblue",
xlab = "HourSpendOnApp",
ylab = "Number of Customers",
main = "Bar Plot of HourSpendOnApp Variable",
space = 0.3)
barplot(table(data$NumberOfDeviceRegistered),
col = "lightblue",
xlab = "NumberOfDeviceRegistered",
ylab = "Number of Customers",
main = "Bar Plot of NumberOfDeviceRegistered Variable",
space = 0.3)
barplot(table(data$SatisfactionScore),
col = "lightblue",
xlab = "SatisfactionScore",
ylab = "Number of Customers",
main = "Bar Plot of SatisfactionScore Variable",
space = 0.3)
barplot(table(data$NumberOfAddress),
col = "lightblue",
xlab = "NumberOfAddress",
ylab = "Number of Customers",
main = "Bar Plot of NumberOfAddress Variable",
space = 0.1)
barplot(table(data$Complain),
col = "lightblue",
xlab = "Complain",
ylab = "Number of Customers",
main = "Bar Plot of Complain Variable",
space = 0.8)
barplot(table(data$OrderAmountHikeFromlastYear),
col = "lightblue",
xlab = "OrderAmountHikeFromlastYear",
ylab = "Number of Customers",
main = "Bar Plot of OrderAmountHikeFromlastYear Variable",
space = 0.1)
barplot(table(data$CouponUsed),
col = "lightblue",
xlab = "CouponUsed",
ylab = "Number of Customers",
main = "Bar Plot of CouponUsed Variable",
space = 0.2)
barplot(table(data$OrderCount),
col = "lightblue",
xlab = "OrderCount",
ylab = "Number of Customers",
main = "Bar Plot of OrderCount Variable",
space = 0.2)
barplot(table(data$DaySinceLastOrder),
col = "lightblue",
xlab = "DaySinceLastOrder",
ylab = "Number of Customers",
main = "Bar Plot of DaySinceLastOrder Variable",
space = 0.2)
barplot(table(data$CashbackAmount),
col = "lightblue",
xlab = "CashbackAmount",
ylab = "Number of Customers",
main = "Bar Plot of CashbackAmount Variable",
space = 0.1)
# Checking distribution of character data
barplot(table(data$PreferredLoginDevice),
col = "blue",
xlab = "PreferredLoginDevice",
ylab = "Number of Customers",
main = "Bar Plot of PreferredLoginDevice Variable",
space = 0.8)
barplot(table(data$PreferredPaymentMode),
col = "blue",
ylab = "Number of Customers",
main = "Bar Plot of PreferredPaymentMode Variable",
space = 0.2,
las = 2,
cex.names = 1)
barplot(table(data$Gender),
col = "blue",
xlab = "Gender",
ylab = "Number of Customers",
main = "Bar Plot of Gender Variable",
space = 0.8)
barplot(table(data$PreferedOrderCat),
col = "blue",
ylab = "Number of Customers",
main = "Bar Plot of PreferedOrderCat Variable",
space = 0.2,
las = 2,
cex.names = 1)
barplot(table(data$MaritalStatus),
col = "blue",
xlab = "MaritalStatus",
ylab = "Number of Customers",
main = "Bar Plot of MaritalStatus Variable",
space = 0.8)
# Check the relationship among numeric variables
numeric_data <- data[, sapply(data, is.numeric)]
ggpairs(numeric_data)
corrplot(cor(numeric_data),
method = "number",
number.cex = 0.6,
tl.cex = 0.6,
order = "hclust")
This heatmap shows the correlation matrix between various customer behavior and profile features.
Correlation values:
Color scheme: Dark blue indicates strong positive correlation; red indicates strong negative correlation.
# Churn within categories
churn_gender_table <- table(data$Churn, data$Gender)
barplot(churn_gender_table, beside = TRUE, col = c("lightblue", "deepskyblue"),
legend = rownames(churn_gender_table),
xlab = "Churn", ylab = "Number of Customers",
main = "Churn by Gender")
churn_gender_table <- table(data$Churn, data$Gender)
churn_PreferredLoginDevice_table <- table(data$Churn, data$PreferredLoginDevice)
barplot(churn_PreferredLoginDevice_table, beside = TRUE, col = c("lightblue", "deepskyblue"),
legend = rownames(churn_PreferredLoginDevice_table),
xlab = "PreferredLoginDevice", ylab = "Number of Customers",
main = "Churn by PreferredLoginDevice")
churn_PreferredPaymentMod_table <- table(data$Churn, data$PreferredPaymentMod)
barplot(churn_PreferredPaymentMod_table, beside = TRUE, col = c("lightblue", "deepskyblue"),
legend = rownames(churn_PreferredPaymentMod_table),
xlab = "PreferredPaymentMod", ylab = "Number of Customers",
main = "Churn by PreferredPaymentMod",
cex.names = 0.5)
churn_PreferedOrderCat_table <- table(data$Churn, data$PreferedOrderCat)
barplot(churn_PreferedOrderCat_table, beside = TRUE, col = c("lightblue", "deepskyblue"),
legend = rownames(churn_PreferedOrderCat_table),
xlab = "PreferedOrderCat", ylab = "Number of Customers",
main = "Churn by PreferedOrderCat",
cex.names = 0.3)
churn_MaritalStatus_table <- table(data$Churn, data$MaritalStatus)
barplot(churn_MaritalStatus_table, beside = TRUE, col = c("lightblue", "deepskyblue"),
legend = rownames(churn_MaritalStatus_table),
xlab = "MaritalStatus", ylab = "Number of Customers",
main = "Churn by MaritalStatus",
cex.names = 0.7)
To obtain reliable and accurate model results, data preprocessing is essential. As previously mentioned, approximately 5% of the samples contain missing values (NA). These samples were removed from the dataset to ensure uniformity and completeness during subsequent processing stages.
The dataset contains a mix of numerical and categorical variables. To properly handle the categorical variables, all text fields were converted into factors, allowing them to be treated appropriately in statistical models and machine learning algorithms
# Convert category variables into factor
data[sapply(data, is.character)] <- lapply(data[sapply(data, is.character)], factor)
# Check result
summary(data)
## Churn Tenure PreferredLoginDevice CityTier
## Min. :0.0000 Min. : 0.000 Computer :1111 Min. :1.000
## 1st Qu.:0.0000 1st Qu.: 1.000 Mobile Phone:1936 1st Qu.:1.000
## Median :0.0000 Median : 8.000 Phone : 727 Median :1.000
## Mean :0.1672 Mean : 8.777 Mean :1.708
## 3rd Qu.:0.0000 3rd Qu.:13.000 3rd Qu.:3.000
## Max. :1.0000 Max. :51.000 Max. :3.000
##
## WarehouseToHome PreferredPaymentMode Gender HourSpendOnApp
## Min. : 5.00 Cash on Delivery: 48 Female:1503 Min. :0.000
## 1st Qu.: 9.00 CC : 35 Male :2271 1st Qu.:2.000
## Median : 14.00 COD : 301 Median :3.000
## Mean : 15.74 Credit Card :1124 Mean :2.981
## 3rd Qu.: 21.00 Debit Card :1538 3rd Qu.:3.000
## Max. :127.00 E wallet : 443 Max. :5.000
## UPI : 285
## NumberOfDeviceRegistered PreferedOrderCat SatisfactionScore
## Min. :1.000 Fashion : 443 Min. :1.000
## 1st Qu.:3.000 Grocery : 6 1st Qu.:2.000
## Median :4.000 Laptop & Accessory:1961 Median :3.000
## Mean :3.754 Mobile : 119 Mean :3.056
## 3rd Qu.:4.000 Mobile Phone :1227 3rd Qu.:4.000
## Max. :6.000 Others : 18 Max. :5.000
##
## MaritalStatus NumberOfAddress Complain OrderAmountHikeFromlastYear
## Divorced: 547 Min. : 1.000 Min. :0.0000 Min. :11.00
## Married :1982 1st Qu.: 2.000 1st Qu.:0.0000 1st Qu.:13.00
## Single :1245 Median : 3.000 Median :0.0000 Median :15.00
## Mean : 4.216 Mean :0.2822 Mean :15.73
## 3rd Qu.: 6.000 3rd Qu.:1.0000 3rd Qu.:18.00
## Max. :22.000 Max. :1.0000 Max. :26.00
##
## CouponUsed OrderCount DaySinceLastOrder CashbackAmount
## Min. : 0.00 Min. : 1.000 Min. : 0.000 Min. : 0.0
## 1st Qu.: 1.00 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.:148.2
## Median : 1.00 Median : 2.000 Median : 3.000 Median :160.0
## Mean : 1.72 Mean : 2.825 Mean : 4.526 Mean :164.2
## 3rd Qu.: 2.00 3rd Qu.: 3.000 3rd Qu.: 7.000 3rd Qu.:178.0
## Max. :16.00 Max. :16.000 Max. :46.000 Max. :325.0
##
Apply log transformation for continuous right-skew variables
# Check "zero-value"
vars_to_check <- c("Tenure", "WarehouseToHome", "NumberOfAddress",
"OrderAmountHikeFromlastYear", "CouponUsed",
"OrderCount", "DaySinceLastOrder", "CashbackAmount")
sapply(vars_to_check, function(var) {
sum(data[[var]] == 0)
})
## Tenure WarehouseToHome
## 303 0
## NumberOfAddress OrderAmountHikeFromlastYear
## 0 0
## CouponUsed OrderCount
## 610 0
## DaySinceLastOrder CashbackAmount
## 207 4
#Log - transformation
data$log_Tenure <- log(data$Tenure + 1)
data$log_WarehouseToHome <- log(data$WarehouseToHome)
data$log_NumberOfAddress <- log(data$NumberOfAddress)
data$log_OrderAmountHikeFromlastYear <- log(data$OrderAmountHikeFromlastYear)
data$log_CouponUsed <- log(data$CouponUsed + 1)
data$log_OrderCount <- log(data$OrderCount)
data$log_DaySinceLastOrder <- log(data$DaySinceLastOrder + 1)
data$log_CashbackAmount <- log(data$CashbackAmount + 1)
# Compare logit and probit model
mylogit <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + PreferredPaymentMode + Gender + HourSpendOnApp + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_OrderAmountHikeFromlastYear + log_CouponUsed + log_OrderCount + log_DaySinceLastOrder + log_CashbackAmount,data = data,
family = binomial(link = "logit"))
summary(mylogit)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + PreferredPaymentMode + Gender + HourSpendOnApp +
## NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore +
## MaritalStatus + log_NumberOfAddress + Complain + log_OrderAmountHikeFromlastYear +
## log_CouponUsed + log_OrderCount + log_DaySinceLastOrder +
## log_CashbackAmount, family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.44543 2.89485 -1.536 0.124628
## log_Tenure -1.71957 0.08123 -21.170 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.52842 0.14229 -3.714 0.000204 ***
## PreferredLoginDevicePhone -0.31936 0.18092 -1.765 0.077522 .
## CityTier 0.28865 0.07756 3.721 0.000198 ***
## log_WarehouseToHome 0.68573 0.12190 5.625 1.85e-08 ***
## PreferredPaymentModeCC -0.81463 0.89271 -0.913 0.361484
## PreferredPaymentModeCOD -0.16753 0.62870 -0.266 0.789872
## PreferredPaymentModeCredit Card -0.70332 0.60164 -1.169 0.242404
## PreferredPaymentModeDebit Card -0.52641 0.59676 -0.882 0.377716
## PreferredPaymentModeE wallet -0.04716 0.61785 -0.076 0.939153
## PreferredPaymentModeUPI -0.77897 0.63084 -1.235 0.216900
## GenderMale 0.25921 0.12546 2.066 0.038812 *
## HourSpendOnApp 0.09154 0.09971 0.918 0.358606
## NumberOfDeviceRegistered 0.38531 0.06785 5.679 1.36e-08 ***
## PreferedOrderCatGrocery -12.47451 313.81841 -0.040 0.968292
## PreferedOrderCatLaptop & Accessory -1.77436 0.22183 -7.999 1.26e-15 ***
## PreferedOrderCatMobile -1.29480 0.51844 -2.498 0.012507 *
## PreferedOrderCatMobile Phone -0.90115 0.26043 -3.460 0.000540 ***
## PreferedOrderCatOthers 1.57843 0.71870 2.196 0.028076 *
## SatisfactionScore 0.25602 0.04568 5.604 2.09e-08 ***
## MaritalStatusMarried -0.29964 0.18096 -1.656 0.097759 .
## MaritalStatusSingle 0.75270 0.18137 4.150 3.32e-05 ***
## log_NumberOfAddress 1.28533 0.11808 10.885 < 2e-16 ***
## Complain 1.66321 0.12714 13.082 < 2e-16 ***
## log_OrderAmountHikeFromlastYear -0.06925 0.27770 -0.249 0.803062
## log_CouponUsed -0.26687 0.18436 -1.448 0.147744
## log_OrderCount 0.85508 0.15964 5.356 8.49e-08 ***
## log_DaySinceLastOrder -0.66471 0.11163 -5.954 2.61e-09 ***
## log_CashbackAmount 0.08730 0.53588 0.163 0.870591
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3407.3 on 3773 degrees of freedom
## Residual deviance: 1865.8 on 3744 degrees of freedom
## AIC: 1925.8
##
## Number of Fisher Scoring iterations: 13
myprobit <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + PreferredPaymentMode + Gender + HourSpendOnApp + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_OrderAmountHikeFromlastYear + log_CouponUsed + log_OrderCount + log_DaySinceLastOrder + log_CashbackAmount,data = data,
family = binomial(link = "probit"))
summary(myprobit)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + PreferredPaymentMode + Gender + HourSpendOnApp +
## NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore +
## MaritalStatus + log_NumberOfAddress + Complain + log_OrderAmountHikeFromlastYear +
## log_CouponUsed + log_OrderCount + log_DaySinceLastOrder +
## log_CashbackAmount, family = binomial(link = "probit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.93279 1.37017 -1.411 0.158356
## log_Tenure -0.89854 0.04026 -22.316 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.28244 0.07562 -3.735 0.000188 ***
## PreferredLoginDevicePhone -0.15296 0.09802 -1.560 0.118660
## CityTier 0.15460 0.04119 3.753 0.000175 ***
## log_WarehouseToHome 0.36502 0.06481 5.632 1.78e-08 ***
## PreferredPaymentModeCC -0.70616 0.46244 -1.527 0.126758
## PreferredPaymentModeCOD -0.32780 0.29433 -1.114 0.265409
## PreferredPaymentModeCredit Card -0.64103 0.27724 -2.312 0.020767 *
## PreferredPaymentModeDebit Card -0.58226 0.27466 -2.120 0.034010 *
## PreferredPaymentModeE wallet -0.35276 0.28777 -1.226 0.220265
## PreferredPaymentModeUPI -0.72544 0.29718 -2.441 0.014642 *
## GenderMale 0.13393 0.06673 2.007 0.044749 *
## HourSpendOnApp 0.04268 0.05283 0.808 0.419106
## NumberOfDeviceRegistered 0.20128 0.03597 5.596 2.19e-08 ***
## PreferedOrderCatGrocery -4.40245 80.50886 -0.055 0.956391
## PreferedOrderCatLaptop & Accessory -0.98555 0.11372 -8.667 < 2e-16 ***
## PreferedOrderCatMobile -0.78459 0.27906 -2.812 0.004931 **
## PreferedOrderCatMobile Phone -0.55055 0.13446 -4.094 4.23e-05 ***
## PreferedOrderCatOthers 0.70534 0.38982 1.809 0.070393 .
## SatisfactionScore 0.12567 0.02407 5.220 1.79e-07 ***
## MaritalStatusMarried -0.16706 0.09632 -1.734 0.082832 .
## MaritalStatusSingle 0.39684 0.09715 4.085 4.41e-05 ***
## log_NumberOfAddress 0.67003 0.06185 10.834 < 2e-16 ***
## Complain 0.87246 0.06719 12.985 < 2e-16 ***
## log_OrderAmountHikeFromlastYear -0.04511 0.14829 -0.304 0.760959
## log_CouponUsed -0.16686 0.09553 -1.747 0.080692 .
## log_OrderCount 0.48010 0.08150 5.891 3.84e-09 ***
## log_DaySinceLastOrder -0.35997 0.05878 -6.124 9.14e-10 ***
## log_CashbackAmount 0.04711 0.24789 0.190 0.849284
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3407.3 on 3773 degrees of freedom
## Residual deviance: 1908.7 on 3744 degrees of freedom
## AIC: 1968.7
##
## Number of Fisher Scoring iterations: 13
To model customer churn effectively, we chose the Binary Logistic Regression Model. Our dataset’s target variable, Churn, is a binary outcome (1/0), making logistic regression a natural and appropriate choice. Additionally, our dataset includes a combination of numerical and categorical predictors. After converting all categorical variables into factors, we ensure the logistic model can handle them appropriately without needing extensive additional preprocessing.
We prefer the binary logit model because it provides clear interpretability and better model performance. Furthermore, we compared the performance of the logistic and probit models using the Akaike Information Criterion (AIC). The logit model achieved a lower AIC value (1925) compared to the probit model (1968), indicating that the logit model fits the data better while maintaining simplicity and ease of interpretation.
The logit model is based on the logistic function, an S-shaped curve that maps any real-valued number into the interval (0, 1). This makes it ideal for modeling probabilities—such as the likelihood of churn, which is the focus of our study.
Mathematical Formulation
The binary logit model can be expressed as:
\[ P(Y = 1 \mid X) = \frac{\exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_n X_n)}{1 + \exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_n X_n)} \]
where:
Interpretation
The goal of estimating these parameters is to understand how changes in the predictor variables affect the odds of churn. Each coefficient \(\beta_i\) represents the log-odds change in churn associated with a one-unit increase in \(X_i\), holding all other variables constant.
# First model (full model)
summary(mylogit)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + PreferredPaymentMode + Gender + HourSpendOnApp +
## NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore +
## MaritalStatus + log_NumberOfAddress + Complain + log_OrderAmountHikeFromlastYear +
## log_CouponUsed + log_OrderCount + log_DaySinceLastOrder +
## log_CashbackAmount, family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.44543 2.89485 -1.536 0.124628
## log_Tenure -1.71957 0.08123 -21.170 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.52842 0.14229 -3.714 0.000204 ***
## PreferredLoginDevicePhone -0.31936 0.18092 -1.765 0.077522 .
## CityTier 0.28865 0.07756 3.721 0.000198 ***
## log_WarehouseToHome 0.68573 0.12190 5.625 1.85e-08 ***
## PreferredPaymentModeCC -0.81463 0.89271 -0.913 0.361484
## PreferredPaymentModeCOD -0.16753 0.62870 -0.266 0.789872
## PreferredPaymentModeCredit Card -0.70332 0.60164 -1.169 0.242404
## PreferredPaymentModeDebit Card -0.52641 0.59676 -0.882 0.377716
## PreferredPaymentModeE wallet -0.04716 0.61785 -0.076 0.939153
## PreferredPaymentModeUPI -0.77897 0.63084 -1.235 0.216900
## GenderMale 0.25921 0.12546 2.066 0.038812 *
## HourSpendOnApp 0.09154 0.09971 0.918 0.358606
## NumberOfDeviceRegistered 0.38531 0.06785 5.679 1.36e-08 ***
## PreferedOrderCatGrocery -12.47451 313.81841 -0.040 0.968292
## PreferedOrderCatLaptop & Accessory -1.77436 0.22183 -7.999 1.26e-15 ***
## PreferedOrderCatMobile -1.29480 0.51844 -2.498 0.012507 *
## PreferedOrderCatMobile Phone -0.90115 0.26043 -3.460 0.000540 ***
## PreferedOrderCatOthers 1.57843 0.71870 2.196 0.028076 *
## SatisfactionScore 0.25602 0.04568 5.604 2.09e-08 ***
## MaritalStatusMarried -0.29964 0.18096 -1.656 0.097759 .
## MaritalStatusSingle 0.75270 0.18137 4.150 3.32e-05 ***
## log_NumberOfAddress 1.28533 0.11808 10.885 < 2e-16 ***
## Complain 1.66321 0.12714 13.082 < 2e-16 ***
## log_OrderAmountHikeFromlastYear -0.06925 0.27770 -0.249 0.803062
## log_CouponUsed -0.26687 0.18436 -1.448 0.147744
## log_OrderCount 0.85508 0.15964 5.356 8.49e-08 ***
## log_DaySinceLastOrder -0.66471 0.11163 -5.954 2.61e-09 ***
## log_CashbackAmount 0.08730 0.53588 0.163 0.870591
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3407.3 on 3773 degrees of freedom
## Residual deviance: 1865.8 on 3744 degrees of freedom
## AIC: 1925.8
##
## Number of Fisher Scoring iterations: 13
Remove statistically insignificant variables
We sequentially remove statistically insignificant variables based on
the regression results from the summary(mylogit) function.
Specifically, we retain variables with p values less than 0.05 (shown in
the Pr(>|z|) column), corresponding to a significance
level of 5%.
Insignificant variables (p > 0.05)
PreferedOrderCatGrocery (p = 0.968292)PreferredPaymentModeE wallet (p = 0.939153)CashbackAmount (p = 0.870591)OrderAmountHikeFromlastYear (p = 0.803062 )PreferredPaymentModeCOD (p = 0.789872)PreferredPaymentModeDebit Card (p = 0.377716)PreferredPaymentModeCC (p = 0.361484)HourSpendOnApp (p = 0.358606)PreferredPaymentModeCredit Card (p = 0.242404)PreferredPaymentModeUPI (p = 0.216900)CouponUsed (p = 0.147744)MaritalStatusMarried (p = 0.097759)PreferredLoginDevicePhone (p = 0.077522)null_logit = glm(Churn~1, data=data, family=binomial(link="logit"))
lrtest(mylogit, null_logit)
The p-value is extremely small (p < 0.001), indicating strong evidence against the null model. In other words, the full model provides a significantly better fit to the data than the null model.The set of explanatory variables jointly contributes significantly to explaining the likelihood of churn. Therefore, including these predictors in the model is statistically justified.
To check for multicollinearity in your model using the Variance Inflation Factor (VIF), you can use the vif() function from the car package in R. The VIF indicates how much the variance of a regression coefficient is inflated due to collinearity with other predictors. A higher VIF value suggests multicollinearity.
vif(mylogit)
## GVIF Df GVIF^(1/(2*Df))
## log_Tenure 1.566864 1 1.251744
## PreferredLoginDevice 1.405962 2 1.088913
## CityTier 1.561601 1 1.249640
## log_WarehouseToHome 1.075885 1 1.037249
## PreferredPaymentMode 2.829258 6 1.090534
## Gender 1.038172 1 1.018907
## HourSpendOnApp 1.311566 1 1.145236
## NumberOfDeviceRegistered 1.176532 1 1.084681
## PreferedOrderCat 5.952965 5 1.195290
## SatisfactionScore 1.068287 1 1.033580
## MaritalStatus 1.100013 2 1.024117
## log_NumberOfAddress 1.307668 1 1.143533
## Complain 1.096877 1 1.047319
## log_OrderAmountHikeFromlastYear 1.056426 1 1.027826
## log_CouponUsed 2.334681 1 1.527966
## log_OrderCount 2.760417 1 1.661450
## log_DaySinceLastOrder 1.460998 1 1.208718
## log_CashbackAmount 2.271725 1 1.507224
To check for multicollinearity among explanatory variables, we calculated the Generalized Variance Inflation Factor (GVIF) for each variable. We adjusted the GVIF values using the formula:
\[ GVIF^{1/(2 \cdot Df)} \]
to account for variables with multiple degrees of freedom.
All adjusted GVIF values were below 2, which suggests no serious multicollinearity among the predictors.
The highest adjusted GVIF was for log_OrderCount
(1.661450), which is still well within acceptable limits.
Remove “CashbackAmount” variable
mylogit1 <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + PreferredPaymentMode + Gender + HourSpendOnApp + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_OrderAmountHikeFromlastYear + log_CouponUsed + log_OrderCount + log_DaySinceLastOrder, data = data,
family = binomial(link = "logit"))
summary(mylogit1)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + PreferredPaymentMode + Gender + HourSpendOnApp +
## NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore +
## MaritalStatus + log_NumberOfAddress + Complain + log_OrderAmountHikeFromlastYear +
## log_CouponUsed + log_OrderCount + log_DaySinceLastOrder,
## family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.00905 1.09946 -3.646 0.000266 ***
## log_Tenure -1.71897 0.08112 -21.190 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.52677 0.14193 -3.711 0.000206 ***
## PreferredLoginDevicePhone -0.32324 0.17929 -1.803 0.071405 .
## CityTier 0.28864 0.07756 3.721 0.000198 ***
## log_WarehouseToHome 0.68557 0.12190 5.624 1.86e-08 ***
## PreferredPaymentModeCC -0.81491 0.89321 -0.912 0.361594
## PreferredPaymentModeCOD -0.16765 0.62941 -0.266 0.789960
## PreferredPaymentModeCredit Card -0.70370 0.60237 -1.168 0.242723
## PreferredPaymentModeDebit Card -0.52645 0.59750 -0.881 0.378270
## PreferredPaymentModeE wallet -0.04661 0.61857 -0.075 0.939936
## PreferredPaymentModeUPI -0.77957 0.63154 -1.234 0.217054
## GenderMale 0.26000 0.12537 2.074 0.038093 *
## HourSpendOnApp 0.09432 0.09831 0.959 0.337365
## NumberOfDeviceRegistered 0.38652 0.06745 5.731 1.00e-08 ***
## PreferedOrderCatGrocery -12.46431 314.37643 -0.040 0.968374
## PreferedOrderCatLaptop & Accessory -1.78663 0.20890 -8.553 < 2e-16 ***
## PreferedOrderCatMobile -1.31792 0.49886 -2.642 0.008245 **
## PreferedOrderCatMobile Phone -0.92219 0.22637 -4.074 4.62e-05 ***
## PreferedOrderCatOthers 1.61775 0.67712 2.389 0.016886 *
## SatisfactionScore 0.25620 0.04567 5.610 2.03e-08 ***
## MaritalStatusMarried -0.30005 0.18096 -1.658 0.097293 .
## MaritalStatusSingle 0.75278 0.18139 4.150 3.32e-05 ***
## log_NumberOfAddress 1.28746 0.11736 10.970 < 2e-16 ***
## Complain 1.66402 0.12707 13.096 < 2e-16 ***
## log_OrderAmountHikeFromlastYear -0.06897 0.27769 -0.248 0.803865
## log_CouponUsed -0.26568 0.18419 -1.442 0.149195
## log_OrderCount 0.85548 0.15960 5.360 8.32e-08 ***
## log_DaySinceLastOrder -0.66269 0.11093 -5.974 2.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3407.3 on 3773 degrees of freedom
## Residual deviance: 1865.8 on 3745 degrees of freedom
## AIC: 1923.8
##
## Number of Fisher Scoring iterations: 13
lrtest(mylogit1, mylogit)
The p-value is extremely high ( 0.8658), indicating that the variable
CashbackAmount does not provide any
statistically significant improvement in model fit.
Conclusion: There is no evidence to justify keeping
CashbackAmount in the model. Removing it does not reduce
model performance, and its exclusion helps simplify the model without
loss of explanatory power.
Remove “CashbackAmount”, “OrderAmountHikeFromlastYear” variables
mylogit2 <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + PreferredPaymentMode + Gender + HourSpendOnApp + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_CouponUsed + log_OrderCount + log_DaySinceLastOrder, data = data,
family = binomial(link = "logit"))
summary(mylogit2)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + PreferredPaymentMode + Gender + HourSpendOnApp +
## NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore +
## MaritalStatus + log_NumberOfAddress + Complain + log_CouponUsed +
## log_OrderCount + log_DaySinceLastOrder, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.19165 0.81854 -5.121 3.04e-07 ***
## log_Tenure -1.71898 0.08111 -21.192 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.52658 0.14195 -3.710 0.000207 ***
## PreferredLoginDevicePhone -0.32267 0.17926 -1.800 0.071854 .
## CityTier 0.28962 0.07745 3.739 0.000184 ***
## log_WarehouseToHome 0.68355 0.12162 5.620 1.91e-08 ***
## PreferredPaymentModeCC -0.80675 0.89289 -0.904 0.366246
## PreferredPaymentModeCOD -0.15821 0.62889 -0.252 0.801378
## PreferredPaymentModeCredit Card -0.69675 0.60235 -1.157 0.247388
## PreferredPaymentModeDebit Card -0.51989 0.59755 -0.870 0.384278
## PreferredPaymentModeE wallet -0.03953 0.61852 -0.064 0.949040
## PreferredPaymentModeUPI -0.77263 0.63156 -1.223 0.221186
## GenderMale 0.26128 0.12523 2.086 0.036951 *
## HourSpendOnApp 0.09460 0.09831 0.962 0.335937
## NumberOfDeviceRegistered 0.38558 0.06733 5.727 1.02e-08 ***
## PreferedOrderCatGrocery -12.47079 314.61006 -0.040 0.968381
## PreferedOrderCatLaptop & Accessory -1.79148 0.20794 -8.615 < 2e-16 ***
## PreferedOrderCatMobile -1.31546 0.49830 -2.640 0.008294 **
## PreferedOrderCatMobile Phone -0.92730 0.22534 -4.115 3.87e-05 ***
## PreferedOrderCatOthers 1.60880 0.67739 2.375 0.017549 *
## SatisfactionScore 0.25581 0.04565 5.604 2.09e-08 ***
## MaritalStatusMarried -0.30018 0.18088 -1.660 0.097009 .
## MaritalStatusSingle 0.75203 0.18132 4.148 3.36e-05 ***
## log_NumberOfAddress 1.28734 0.11739 10.966 < 2e-16 ***
## Complain 1.66471 0.12704 13.104 < 2e-16 ***
## log_CouponUsed -0.26792 0.18403 -1.456 0.145427
## log_OrderCount 0.85527 0.15965 5.357 8.45e-08 ***
## log_DaySinceLastOrder -0.66214 0.11090 -5.970 2.37e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3407.3 on 3773 degrees of freedom
## Residual deviance: 1865.9 on 3746 degrees of freedom
## AIC: 1921.9
##
## Number of Fisher Scoring iterations: 13
lrtest(mylogit2, mylogit1)
# Wald test for joint significance
linearHypothesis(mylogit, c("log_CashbackAmount = 0", "log_OrderAmountHikeFromlastYear = 0"))
Likelihood ratio test: Since the p-value is far greater than 0.05 (0.8038), we fail to reject the null hypothesis.This means that OrderAmountHikeFromlastYear does not significantly improve the model when added. Therefore, it can be removed from the model without reducing its explanatory power
Wald Test: Joint Significance of CashbackAmount and OrderAmountHikeFromlastYear: The Wald test evaluates whether two coefficients—CashbackAmount and OrderAmountHikeFromlastYear—are jointly equal to zero.
Hypotheses:
With a p-value of 0.9569, we fail to reject the null hypothesis, indicating that CashbackAmount and OrderAmountHikeFromlastYear are not jointly significant. Their contribution to explaining churn is negligible, and they can be safely excluded from the model to simplify it.
Remove “CashbackAmount”, “OrderAmountHikeFromlastYear”, “PreferredPaymentMode” variables
mylogit3 <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + Gender + HourSpendOnApp + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_CouponUsed + log_OrderCount + log_DaySinceLastOrder, data = data,
family = binomial(link = "logit"))
summary(mylogit3)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + Gender + HourSpendOnApp + NumberOfDeviceRegistered +
## PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress +
## Complain + log_CouponUsed + log_OrderCount + log_DaySinceLastOrder,
## family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.78648 0.59248 -8.079 6.55e-16 ***
## log_Tenure -1.70938 0.08039 -21.263 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.52409 0.14019 -3.738 0.000185 ***
## PreferredLoginDevicePhone -0.36251 0.17840 -2.032 0.042158 *
## CityTier 0.37998 0.06968 5.453 4.96e-08 ***
## log_WarehouseToHome 0.66824 0.11991 5.573 2.51e-08 ***
## GenderMale 0.24653 0.12387 1.990 0.046565 *
## HourSpendOnApp 0.10130 0.09717 1.043 0.297140
## NumberOfDeviceRegistered 0.38479 0.06654 5.783 7.34e-09 ***
## PreferedOrderCatGrocery -12.60306 312.49731 -0.040 0.967830
## PreferedOrderCatLaptop & Accessory -1.79826 0.20518 -8.764 < 2e-16 ***
## PreferedOrderCatMobile -1.32510 0.40728 -3.254 0.001140 **
## PreferedOrderCatMobile Phone -0.93836 0.22000 -4.265 2.00e-05 ***
## PreferedOrderCatOthers 1.60063 0.69581 2.300 0.021427 *
## SatisfactionScore 0.25803 0.04533 5.693 1.25e-08 ***
## MaritalStatusMarried -0.30670 0.18000 -1.704 0.088390 .
## MaritalStatusSingle 0.73360 0.18026 4.070 4.71e-05 ***
## log_NumberOfAddress 1.26377 0.11599 10.896 < 2e-16 ***
## Complain 1.67576 0.12642 13.256 < 2e-16 ***
## log_CouponUsed -0.24996 0.18292 -1.367 0.171781
## log_OrderCount 0.83933 0.15916 5.274 1.34e-07 ***
## log_DaySinceLastOrder -0.66623 0.10940 -6.090 1.13e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3407.3 on 3773 degrees of freedom
## Residual deviance: 1879.7 on 3752 degrees of freedom
## AIC: 1923.7
##
## Number of Fisher Scoring iterations: 13
# Wald test for joint significance
linearHypothesis(mylogit, c("log_CashbackAmount = 0", "log_OrderAmountHikeFromlastYear = 0","PreferredPaymentModeCC = 0","PreferredPaymentModeCOD =0", "PreferredPaymentModeCredit Card=0","PreferredPaymentModeDebit Card=0","PreferredPaymentModeE wallet =0","PreferredPaymentModeUPI =0"))
Wald test
Since p > 0.05, you fail to reject the null hypothesis, meaning: These three variables do not contribute significantly to the model jointly. It is safe to remove PreferredPaymentMod, HourSpendOnApp, OrderAmountHikeFromlastYear,and PreferredPaymentMode from our model.
Remove “HourSpendOnApp”, “OrderAmountHikeFromlastYear”, “CashbackAmount”, “PreferredPaymentMode” variables
mylogit4 <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + Gender + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_CouponUsed + log_OrderCount + log_DaySinceLastOrder, data = data,
family = binomial(link = "logit"))
summary(mylogit4)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + Gender + NumberOfDeviceRegistered +
## PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress +
## Complain + log_CouponUsed + log_OrderCount + log_DaySinceLastOrder,
## family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.64517 0.57566 -8.069 7.07e-16 ***
## log_Tenure -1.70153 0.07982 -21.317 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.51646 0.13999 -3.689 0.000225 ***
## PreferredLoginDevicePhone -0.37523 0.17785 -2.110 0.034869 *
## CityTier 0.38157 0.06965 5.478 4.29e-08 ***
## log_WarehouseToHome 0.67803 0.11954 5.672 1.41e-08 ***
## GenderMale 0.24648 0.12385 1.990 0.046575 *
## NumberOfDeviceRegistered 0.39726 0.06555 6.061 1.36e-09 ***
## PreferedOrderCatGrocery -12.62001 313.17989 -0.040 0.967857
## PreferedOrderCatLaptop & Accessory -1.77272 0.20340 -8.716 < 2e-16 ***
## PreferedOrderCatMobile -1.29601 0.40626 -3.190 0.001422 **
## PreferedOrderCatMobile Phone -0.87955 0.21206 -4.148 3.36e-05 ***
## PreferedOrderCatOthers 1.61493 0.69643 2.319 0.020402 *
## SatisfactionScore 0.26158 0.04521 5.786 7.19e-09 ***
## MaritalStatusMarried -0.31317 0.17981 -1.742 0.081556 .
## MaritalStatusSingle 0.72494 0.17990 4.030 5.59e-05 ***
## log_NumberOfAddress 1.27376 0.11556 11.023 < 2e-16 ***
## Complain 1.67803 0.12637 13.278 < 2e-16 ***
## log_CouponUsed -0.22970 0.18145 -1.266 0.205533
## log_OrderCount 0.83902 0.15875 5.285 1.26e-07 ***
## log_DaySinceLastOrder -0.66054 0.10908 -6.056 1.40e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3407.3 on 3773 degrees of freedom
## Residual deviance: 1880.8 on 3753 degrees of freedom
## AIC: 1922.8
##
## Number of Fisher Scoring iterations: 13
lrtest(mylogit4, mylogit3)
# Wald test for joint significance
linearHypothesis(mylogit, c("log_CashbackAmount = 0", "log_OrderAmountHikeFromlastYear = 0","PreferredPaymentModeCC = 0","PreferredPaymentModeCOD =0", "PreferredPaymentModeCredit Card=0","PreferredPaymentModeDebit Card=0","PreferredPaymentModeE wallet =0","PreferredPaymentModeUPI =0","HourSpendOnApp=0"))
Likelihood ratio test
Since p > 0.05, we fail to reject the null hypothesis that the coefficient for HourSpendOnApp is zero.This suggests that HourSpendOnApp is not statistically significant in predicting churn
Wald test
Since p > 0.05, we fail to reject the null hypothesis, meaning: These four variables do not contribute significantly to the model jointly. It is safe to remove PreferredPaymentMod, HourSpendOnApp, OrderAmountHikeFromlastYear,and CashbackAmount from our model.
Remove “HourSpendOnApp”, “OrderAmountHikeFromlastYear”, “CashbackAmount”, “PreferredPaymentMode” and “CouponUsed” variables
mylogit5 <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + Gender + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_OrderCount + log_DaySinceLastOrder, data = data,
family = binomial(link = "logit"))
summary(mylogit5)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + Gender + NumberOfDeviceRegistered +
## PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress +
## Complain + log_OrderCount + log_DaySinceLastOrder, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.71281 0.57252 -8.232 < 2e-16 ***
## log_Tenure -1.70031 0.07972 -21.329 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.51886 0.13997 -3.707 0.00021 ***
## PreferredLoginDevicePhone -0.37259 0.17771 -2.097 0.03603 *
## CityTier 0.38026 0.06961 5.463 4.68e-08 ***
## log_WarehouseToHome 0.68448 0.11937 5.734 9.81e-09 ***
## GenderMale 0.24871 0.12374 2.010 0.04443 *
## NumberOfDeviceRegistered 0.38911 0.06507 5.980 2.24e-09 ***
## PreferedOrderCatGrocery -12.54966 316.43693 -0.040 0.96836
## PreferedOrderCatLaptop & Accessory -1.78780 0.20312 -8.802 < 2e-16 ***
## PreferedOrderCatMobile -1.29576 0.40615 -3.190 0.00142 **
## PreferedOrderCatMobile Phone -0.90278 0.21114 -4.276 1.90e-05 ***
## PreferedOrderCatOthers 1.59205 0.69922 2.277 0.02279 *
## SatisfactionScore 0.26153 0.04517 5.790 7.05e-09 ***
## MaritalStatusMarried -0.29425 0.17923 -1.642 0.10065
## MaritalStatusSingle 0.73590 0.17983 4.092 4.27e-05 ***
## log_NumberOfAddress 1.26267 0.11501 10.979 < 2e-16 ***
## Complain 1.68154 0.12636 13.308 < 2e-16 ***
## log_OrderCount 0.70290 0.11715 6.000 1.97e-09 ***
## log_DaySinceLastOrder -0.65750 0.10897 -6.034 1.60e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3407.3 on 3773 degrees of freedom
## Residual deviance: 1882.4 on 3754 degrees of freedom
## AIC: 1922.4
##
## Number of Fisher Scoring iterations: 13
lrtest(mylogit5, mylogit4)
# Wald test for joint significance
linearHypothesis(mylogit, c("log_CashbackAmount = 0", "log_OrderAmountHikeFromlastYear = 0","PreferredPaymentModeCC = 0","PreferredPaymentModeCOD =0", "PreferredPaymentModeCredit Card=0","PreferredPaymentModeDebit Card=0","PreferredPaymentModeE wallet =0","PreferredPaymentModeUPI =0","HourSpendOnApp=0", "log_CouponUsed = 0"))
Likelihood ratio test
Since p > 0.05, we fail to reject the null hypothesis that the coefficient for CouponUsed is zero.This suggests that CouponUsed is not statistically significant in predicting churn
Wald test
Since p > 0.05, we fail to reject the null hypothesis, meaning: These five variables do not contribute significantly to the model jointly. It is safe to remove HourSpendOnApp, OrderAmountHikeFromlastYear,CashbackAmount,PreferredPaymentMode and CouponUsed from our model.
Remove “HourSpendOnApp”, “OrderAmountHikeFromlastYear”, “CashbackAmount”, “PreferredPaymentMode”, “CouponUsed” and “PreferedOrderCatGrocery” variables
# Remove "Grocery" and relevel
data$PreferedOrderCat <- factor(data$PreferedOrderCat, levels = setdiff(levels(data$PreferedOrderCat), "Grocery"))
# Fit updated model
mylogit6 <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + Gender + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_OrderCount + log_DaySinceLastOrder, data = data,
family = binomial(link = "logit"))
summary(mylogit6)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + Gender + NumberOfDeviceRegistered +
## PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress +
## Complain + log_OrderCount + log_DaySinceLastOrder, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.71281 0.57250 -8.232 < 2e-16 ***
## log_Tenure -1.70031 0.07971 -21.331 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.51886 0.13996 -3.707 0.00021 ***
## PreferredLoginDevicePhone -0.37259 0.17771 -2.097 0.03602 *
## CityTier 0.38026 0.06960 5.463 4.67e-08 ***
## log_WarehouseToHome 0.68448 0.11937 5.734 9.79e-09 ***
## GenderMale 0.24871 0.12373 2.010 0.04442 *
## NumberOfDeviceRegistered 0.38911 0.06507 5.980 2.23e-09 ***
## PreferedOrderCatLaptop & Accessory -1.78780 0.20311 -8.802 < 2e-16 ***
## PreferedOrderCatMobile -1.29576 0.40614 -3.190 0.00142 **
## PreferedOrderCatMobile Phone -0.90278 0.21113 -4.276 1.90e-05 ***
## PreferedOrderCatOthers 1.59205 0.69920 2.277 0.02279 *
## SatisfactionScore 0.26153 0.04517 5.790 7.04e-09 ***
## MaritalStatusMarried -0.29425 0.17923 -1.642 0.10064
## MaritalStatusSingle 0.73590 0.17982 4.092 4.27e-05 ***
## log_NumberOfAddress 1.26267 0.11500 10.979 < 2e-16 ***
## Complain 1.68154 0.12635 13.308 < 2e-16 ***
## log_OrderCount 0.70290 0.11714 6.000 1.97e-09 ***
## log_DaySinceLastOrder -0.65750 0.10897 -6.034 1.60e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3405.1 on 3767 degrees of freedom
## Residual deviance: 1882.4 on 3749 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 1920.4
##
## Number of Fisher Scoring iterations: 6
# Wald test for joint significance
linearHypothesis(mylogit, c("log_CashbackAmount = 0", "log_OrderAmountHikeFromlastYear = 0","PreferredPaymentModeCC = 0","PreferredPaymentModeCOD =0", "PreferredPaymentModeCredit Card=0","PreferredPaymentModeDebit Card=0","PreferredPaymentModeE wallet =0","PreferredPaymentModeUPI =0","HourSpendOnApp=0", "log_CouponUsed = 0", "PreferedOrderCatGrocery=0"))
Wald test
Since p > 0.05, we fail to reject the null hypothesis, meaning: These six variables do not contribute significantly to the model jointly. It is safe to remove HourSpendOnApp, OrderAmountHikeFromlastYear,CashbackAmount,PreferredPaymentMode, CouponUsed and PreferedOrderCatGrocery from our model.
Remove “HourSpendOnApp”, “OrderAmountHikeFromlastYear”, “CashbackAmount”, “PreferredPaymentMode”, “CouponUsed”, “PreferedOrderCatGrocery” and “MaritalStatusMarried” variables
# Remove "Married" category from MaritalStatus and fit the model again
data$MaritalStatus <- factor(data$MaritalStatus, levels = setdiff(levels(data$MaritalStatus), "Married"))
mylogit7 <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + Gender + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_OrderCount + log_DaySinceLastOrder, data = data,
family = binomial(link = "logit"))
summary(mylogit7)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + Gender + NumberOfDeviceRegistered +
## PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress +
## Complain + log_OrderCount + log_DaySinceLastOrder, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.89439 0.75192 -6.509 7.56e-11 ***
## log_Tenure -1.74908 0.10942 -15.984 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.59996 0.18604 -3.225 0.001260 **
## PreferredLoginDevicePhone -0.46179 0.24379 -1.894 0.058198 .
## CityTier 0.32792 0.09119 3.596 0.000323 ***
## log_WarehouseToHome 0.79739 0.16044 4.970 6.69e-07 ***
## GenderMale 0.21714 0.16538 1.313 0.189182
## NumberOfDeviceRegistered 0.39327 0.08859 4.439 9.04e-06 ***
## PreferedOrderCatLaptop & Accessory -1.67644 0.27755 -6.040 1.54e-09 ***
## PreferedOrderCatMobile -1.37396 0.52809 -2.602 0.009274 **
## PreferedOrderCatMobile Phone -0.93292 0.28728 -3.247 0.001164 **
## PreferedOrderCatOthers 2.15199 0.99723 2.158 0.030930 *
## SatisfactionScore 0.36126 0.06402 5.643 1.67e-08 ***
## MaritalStatusSingle 0.83072 0.18764 4.427 9.54e-06 ***
## log_NumberOfAddress 1.05197 0.15390 6.835 8.17e-12 ***
## Complain 1.85945 0.17426 10.671 < 2e-16 ***
## log_OrderCount 0.97958 0.16346 5.993 2.06e-09 ***
## log_DaySinceLastOrder -0.90233 0.15316 -5.892 3.82e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1931.9 on 1790 degrees of freedom
## Residual deviance: 1037.7 on 1773 degrees of freedom
## (1983 observations deleted due to missingness)
## AIC: 1073.7
##
## Number of Fisher Scoring iterations: 6
# Wald test for joint significance
linearHypothesis(mylogit, c("log_CashbackAmount = 0", "log_OrderAmountHikeFromlastYear = 0","PreferredPaymentModeCC = 0","PreferredPaymentModeCOD =0", "PreferredPaymentModeCredit Card=0","PreferredPaymentModeDebit Card=0","PreferredPaymentModeE wallet =0","PreferredPaymentModeUPI =0","HourSpendOnApp=0", "log_CouponUsed = 0", "PreferedOrderCatGrocery=0","MaritalStatusMarried=0"))
Wald test
Since p > 0.05, we fail to reject the null hypothesis, meaning: These seven variables do not contribute significantly to the model jointly. It is safe to remove HourSpendOnApp, OrderAmountHikeFromlastYear,CashbackAmount,PreferredPaymentMode, CouponUsed, PreferedOrderCatGrocery and MaritalStatusMarried from our model.
Remove “HourSpendOnApp”, “OrderAmountHikeFromlastYear”, “CashbackAmount”, “PreferredPaymentMode”, “CouponUsed”, “PreferedOrderCatGrocery”, “MaritalStatusMarried” and “PreferredLoginDevicePhone” variables
# Remove "Phone" category from PreferredLoginDevice and fit the model again
data$PreferredLoginDevice <- factor(data$PreferredLoginDevice, levels = setdiff(levels(data$PreferredLoginDevice), "Phone"))
mylogit8 <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + Gender + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_OrderCount + log_DaySinceLastOrder, data = data,
family = binomial(link = "logit"))
summary(mylogit8)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + Gender + NumberOfDeviceRegistered +
## PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress +
## Complain + log_OrderCount + log_DaySinceLastOrder, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.20046 0.84795 -6.133 8.62e-10 ***
## log_Tenure -1.72081 0.11994 -14.347 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.57483 0.18750 -3.066 0.002172 **
## CityTier 0.44867 0.10086 4.448 8.65e-06 ***
## log_WarehouseToHome 0.84275 0.17970 4.690 2.74e-06 ***
## GenderMale 0.07522 0.18457 0.408 0.683617
## NumberOfDeviceRegistered 0.35347 0.09863 3.584 0.000339 ***
## PreferedOrderCatLaptop & Accessory -1.66792 0.28414 -5.870 4.36e-09 ***
## PreferedOrderCatMobile 0.14697 1.18295 0.124 0.901126
## PreferedOrderCatMobile Phone -0.82414 0.29743 -2.771 0.005591 **
## PreferedOrderCatOthers 2.19226 0.99832 2.196 0.028095 *
## SatisfactionScore 0.34524 0.07136 4.838 1.31e-06 ***
## MaritalStatusSingle 0.81816 0.21135 3.871 0.000108 ***
## log_NumberOfAddress 1.11583 0.17885 6.239 4.41e-10 ***
## Complain 1.90614 0.19669 9.691 < 2e-16 ***
## log_OrderCount 0.92139 0.17774 5.184 2.17e-07 ***
## log_DaySinceLastOrder -0.83346 0.16963 -4.913 8.95e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1516.95 on 1422 degrees of freedom
## Residual deviance: 823.21 on 1406 degrees of freedom
## (2351 observations deleted due to missingness)
## AIC: 857.21
##
## Number of Fisher Scoring iterations: 6
# Wald test for joint significance
linearHypothesis(mylogit, c("log_CashbackAmount = 0", "log_OrderAmountHikeFromlastYear = 0","PreferredPaymentModeCC = 0","PreferredPaymentModeCOD =0", "PreferredPaymentModeCredit Card=0","PreferredPaymentModeDebit Card=0","PreferredPaymentModeE wallet =0","PreferredPaymentModeUPI =0","HourSpendOnApp=0", "log_CouponUsed = 0", "PreferedOrderCatGrocery=0","MaritalStatusMarried=0","PreferredLoginDevicePhone=0"))
Wald test
Since p < 0.05 (0.0374), we need to reject the null hypothesis, meaning: These eight variables do contribute significantly to the model jointly. It is not safe to remove PreferredLoginDevicePhone from our model.
Remove “HourSpendOnApp”, “OrderAmountHikeFromlastYear”, “CashbackAmount”, “PreferredPaymentMode”, “CouponUsed”, “PreferedOrderCatGrocery”, “MaritalStatusMarried” and “Gender” variables
mylogit9 <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_OrderCount + log_DaySinceLastOrder, data = data,
family = binomial(link = "logit"))
summary(mylogit9)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + NumberOfDeviceRegistered + PreferedOrderCat +
## SatisfactionScore + MaritalStatus + log_NumberOfAddress +
## Complain + log_OrderCount + log_DaySinceLastOrder, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.14019 0.83376 -6.165 7.04e-10 ***
## log_Tenure -1.71981 0.11976 -14.360 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.57218 0.18748 -3.052 0.002274 **
## CityTier 0.44956 0.10086 4.457 8.30e-06 ***
## log_WarehouseToHome 0.83694 0.17885 4.680 2.87e-06 ***
## NumberOfDeviceRegistered 0.35636 0.09840 3.622 0.000293 ***
## PreferedOrderCatLaptop & Accessory -1.66625 0.28372 -5.873 4.28e-09 ***
## PreferedOrderCatMobile 0.17419 1.17777 0.148 0.882424
## PreferedOrderCatMobile Phone -0.82256 0.29709 -2.769 0.005628 **
## PreferedOrderCatOthers 2.18772 0.99336 2.202 0.027641 *
## SatisfactionScore 0.34594 0.07134 4.849 1.24e-06 ***
## MaritalStatusSingle 0.81462 0.21103 3.860 0.000113 ***
## log_NumberOfAddress 1.11165 0.17840 6.231 4.63e-10 ***
## Complain 1.90343 0.19657 9.683 < 2e-16 ***
## log_OrderCount 0.91528 0.17686 5.175 2.28e-07 ***
## log_DaySinceLastOrder -0.83797 0.16934 -4.948 7.48e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1516.95 on 1422 degrees of freedom
## Residual deviance: 823.37 on 1407 degrees of freedom
## (2351 observations deleted due to missingness)
## AIC: 855.37
##
## Number of Fisher Scoring iterations: 6
# Wald test for joint significance
linearHypothesis(mylogit, c("log_CashbackAmount = 0", "log_OrderAmountHikeFromlastYear = 0","PreferredPaymentModeCC = 0","PreferredPaymentModeCOD =0", "PreferredPaymentModeCredit Card=0","PreferredPaymentModeDebit Card=0","PreferredPaymentModeE wallet =0","PreferredPaymentModeUPI =0","HourSpendOnApp=0", "log_CouponUsed = 0", "PreferedOrderCatGrocery=0","MaritalStatusMarried=0", "GenderMale=0"))
Wald test
Since p < 0.05 (0.04079), we need to reject the null hypothesis, meaning: These eight variables do contribute significantly to the model jointly. It is not safe to remove Gender from our model.
Add interactions between variables
mylogit10 <- glm(Churn ~ log_Tenure + PreferredLoginDevice + CityTier + log_WarehouseToHome + NumberOfDeviceRegistered + PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress + Complain + log_OrderCount + log_DaySinceLastOrder + CouponUsed * log_OrderCount + Complain * log_OrderCount +log_Tenure * CityTier, data = data,
family = binomial(link = "logit"))
summary(mylogit10)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + NumberOfDeviceRegistered + PreferedOrderCat +
## SatisfactionScore + MaritalStatus + log_NumberOfAddress +
## Complain + log_OrderCount + log_DaySinceLastOrder + CouponUsed *
## log_OrderCount + Complain * log_OrderCount + log_Tenure *
## CityTier, family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.73350 0.89857 -5.268 1.38e-07 ***
## log_Tenure -2.08504 0.24392 -8.548 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.52903 0.18921 -2.796 0.005175 **
## CityTier 0.22891 0.16900 1.354 0.175577
## log_WarehouseToHome 0.84523 0.18076 4.676 2.92e-06 ***
## NumberOfDeviceRegistered 0.39334 0.10123 3.885 0.000102 ***
## PreferedOrderCatLaptop & Accessory -1.61552 0.28626 -5.644 1.67e-08 ***
## PreferedOrderCatMobile 0.14779 1.19761 0.123 0.901789
## PreferedOrderCatMobile Phone -0.79111 0.30544 -2.590 0.009595 **
## PreferedOrderCatOthers 2.29801 1.01756 2.258 0.023923 *
## SatisfactionScore 0.33523 0.07182 4.668 3.05e-06 ***
## MaritalStatusSingle 0.78535 0.21224 3.700 0.000215 ***
## log_NumberOfAddress 1.16911 0.18213 6.419 1.37e-10 ***
## Complain 2.09657 0.33010 6.351 2.13e-10 ***
## log_OrderCount 1.08941 0.27275 3.994 6.49e-05 ***
## log_DaySinceLastOrder -0.84132 0.17069 -4.929 8.26e-07 ***
## CouponUsed -0.31460 0.18493 -1.701 0.088920 .
## log_OrderCount:CouponUsed 0.10839 0.07717 1.404 0.160178
## Complain:log_OrderCount -0.19024 0.31043 -0.613 0.539991
## log_Tenure:CityTier 0.16490 0.10145 1.625 0.104064
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1516.95 on 1422 degrees of freedom
## Residual deviance: 816.98 on 1403 degrees of freedom
## (2351 observations deleted due to missingness)
## AIC: 856.98
##
## Number of Fisher Scoring iterations: 6
All interactions have p-value > 0.05, they do not contribute statistically and may overcomplicate the model. We should drop interactions.
summary(mylogit7)
##
## Call:
## glm(formula = Churn ~ log_Tenure + PreferredLoginDevice + CityTier +
## log_WarehouseToHome + Gender + NumberOfDeviceRegistered +
## PreferedOrderCat + SatisfactionScore + MaritalStatus + log_NumberOfAddress +
## Complain + log_OrderCount + log_DaySinceLastOrder, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.89439 0.75192 -6.509 7.56e-11 ***
## log_Tenure -1.74908 0.10942 -15.984 < 2e-16 ***
## PreferredLoginDeviceMobile Phone -0.59996 0.18604 -3.225 0.001260 **
## PreferredLoginDevicePhone -0.46179 0.24379 -1.894 0.058198 .
## CityTier 0.32792 0.09119 3.596 0.000323 ***
## log_WarehouseToHome 0.79739 0.16044 4.970 6.69e-07 ***
## GenderMale 0.21714 0.16538 1.313 0.189182
## NumberOfDeviceRegistered 0.39327 0.08859 4.439 9.04e-06 ***
## PreferedOrderCatLaptop & Accessory -1.67644 0.27755 -6.040 1.54e-09 ***
## PreferedOrderCatMobile -1.37396 0.52809 -2.602 0.009274 **
## PreferedOrderCatMobile Phone -0.93292 0.28728 -3.247 0.001164 **
## PreferedOrderCatOthers 2.15199 0.99723 2.158 0.030930 *
## SatisfactionScore 0.36126 0.06402 5.643 1.67e-08 ***
## MaritalStatusSingle 0.83072 0.18764 4.427 9.54e-06 ***
## log_NumberOfAddress 1.05197 0.15390 6.835 8.17e-12 ***
## Complain 1.85945 0.17426 10.671 < 2e-16 ***
## log_OrderCount 0.97958 0.16346 5.993 2.06e-09 ***
## log_DaySinceLastOrder -0.90233 0.15316 -5.892 3.82e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1931.9 on 1790 degrees of freedom
## Residual deviance: 1037.7 on 1773 degrees of freedom
## (1983 observations deleted due to missingness)
## AIC: 1073.7
##
## Number of Fisher Scoring iterations: 6
model_list <- list(
"Logit (full model)" = mylogit,
"Probit (full model)" = myprobit,
"Intermediate model" = mylogit3,
"Final Model" = mylogit7
)
modelsummary(model_list,
statistic = "std.error",
gof_omit = ".*IC|Log.Lik|Deviance",
stars = TRUE)
## Warning in n * resp: longer object length is not a multiple of shorter object
## length
## Warning in n * resp: longer object length is not a multiple of shorter object
## length
## Warning in n * resp: longer object length is not a multiple of shorter object
## length
## Warning in n * resp: longer object length is not a multiple of shorter object
## length
| Logit (full model) | Probit (full model) | Intermediate model | Final Model | |
|---|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
| (Intercept) | -4.445 | -1.933 | -4.786*** | -4.894*** |
| (2.895) | (1.370) | (0.592) | (0.752) | |
| log_Tenure | -1.720*** | -0.899*** | -1.709*** | -1.749*** |
| (0.081) | (0.040) | (0.080) | (0.109) | |
| PreferredLoginDeviceMobile Phone | -0.528*** | -0.282*** | -0.524*** | -0.600** |
| (0.142) | (0.076) | (0.140) | (0.186) | |
| PreferredLoginDevicePhone | -0.319+ | -0.153 | -0.363* | -0.462+ |
| (0.181) | (0.098) | (0.178) | (0.244) | |
| CityTier | 0.289*** | 0.155*** | 0.380*** | 0.328*** |
| (0.078) | (0.041) | (0.070) | (0.091) | |
| log_WarehouseToHome | 0.686*** | 0.365*** | 0.668*** | 0.797*** |
| (0.122) | (0.065) | (0.120) | (0.160) | |
| PreferredPaymentModeCC | -0.815 | -0.706 | ||
| (0.893) | (0.462) | |||
| PreferredPaymentModeCOD | -0.168 | -0.328 | ||
| (0.629) | (0.294) | |||
| PreferredPaymentModeCredit Card | -0.703 | -0.641* | ||
| (0.602) | (0.277) | |||
| PreferredPaymentModeDebit Card | -0.526 | -0.582* | ||
| (0.597) | (0.275) | |||
| PreferredPaymentModeE wallet | -0.047 | -0.353 | ||
| (0.618) | (0.288) | |||
| PreferredPaymentModeUPI | -0.779 | -0.725* | ||
| (0.631) | (0.297) | |||
| GenderMale | 0.259* | 0.134* | 0.247* | 0.217 |
| (0.125) | (0.067) | (0.124) | (0.165) | |
| HourSpendOnApp | 0.092 | 0.043 | 0.101 | |
| (0.100) | (0.053) | (0.097) | ||
| NumberOfDeviceRegistered | 0.385*** | 0.201*** | 0.385*** | 0.393*** |
| (0.068) | (0.036) | (0.067) | (0.089) | |
| PreferedOrderCatGrocery | -12.475 | -4.402 | -12.603 | |
| (313.818) | (80.509) | (312.497) | ||
| PreferedOrderCatLaptop & Accessory | -1.774*** | -0.986*** | -1.798*** | -1.676*** |
| (0.222) | (0.114) | (0.205) | (0.278) | |
| PreferedOrderCatMobile | -1.295* | -0.785** | -1.325** | -1.374** |
| (0.518) | (0.279) | (0.407) | (0.528) | |
| PreferedOrderCatMobile Phone | -0.901*** | -0.551*** | -0.938*** | -0.933** |
| (0.260) | (0.134) | (0.220) | (0.287) | |
| PreferedOrderCatOthers | 1.578* | 0.705+ | 1.601* | 2.152* |
| (0.719) | (0.390) | (0.696) | (0.997) | |
| SatisfactionScore | 0.256*** | 0.126*** | 0.258*** | 0.361*** |
| (0.046) | (0.024) | (0.045) | (0.064) | |
| MaritalStatusMarried | -0.300+ | -0.167+ | -0.307+ | |
| (0.181) | (0.096) | (0.180) | ||
| MaritalStatusSingle | 0.753*** | 0.397*** | 0.734*** | 0.831*** |
| (0.181) | (0.097) | (0.180) | (0.188) | |
| log_NumberOfAddress | 1.285*** | 0.670*** | 1.264*** | 1.052*** |
| (0.118) | (0.062) | (0.116) | (0.154) | |
| Complain | 1.663*** | 0.872*** | 1.676*** | 1.859*** |
| (0.127) | (0.067) | (0.126) | (0.174) | |
| log_OrderAmountHikeFromlastYear | -0.069 | -0.045 | ||
| (0.278) | (0.148) | |||
| log_CouponUsed | -0.267 | -0.167+ | -0.250 | |
| (0.184) | (0.096) | (0.183) | ||
| log_OrderCount | 0.855*** | 0.480*** | 0.839*** | 0.980*** |
| (0.160) | (0.081) | (0.159) | (0.163) | |
| log_DaySinceLastOrder | -0.665*** | -0.360*** | -0.666*** | -0.902*** |
| (0.112) | (0.059) | (0.109) | (0.153) | |
| log_CashbackAmount | 0.087 | 0.047 | ||
| (0.536) | (0.248) | |||
| Num.Obs. | 3774 | 3774 | 3774 | 1791 |
| F | 24.475 | 28.931 | 33.602 | 22.841 |
| RMSE | 0.27 | 0.27 | 0.27 | 0.29 |
marginal_effects <- margins(mylogit7)
summary(marginal_effects)
Marginal effects in a logistic regression model represent the change in the predicted probability of the outcome (customer churn) for a one-unit change in a predictor variable, holding all other predictors constant. Below is the interpretation of marginal effects for statistically significant variables in the our final model:
Complain: Customers who have lodged a complaint are associated with a 16.63 percentage point increase in the predicted probability of churn, holding all other factors constant.
log_Tenure: A one-unit increase in the logarithm of tenure is associated with a 15.65 percentage point decrease in the probability of churn, all else being equal.
log_DaySinceLastOrder: An increase in the time since the last order (log scale) decreases the likelihood of churn by 8.07 percentage points, ceteris paribus.
log_NumberOfAddress: Customers with more recorded addresses (log scale) are 9.41 percentage points more likely to churn.
log_OrderCount: Each unit increase in the log of order count increases the predicted probability of churn by 8.76 percentage points, holding other variables constant.
log_WarehouseToHome: Greater delivery distance (log-transformed) increases the likelihood of churn by 7.13 percentage points.
MaritalStatusSingle: Being single is associated with a 7.24 percentage point increase in the probability of churn, all else equal.
NumberOfDeviceRegistered: Each additional registered device is associated with a 3.52 percentage point increase in churn probability.
CityTier: Customers living in higher-tier cities have a 2.93 percentage point higher probability of churning.
PreferredLoginDeviceMobile Phone: Using a mobile phone to log in reduces the predicted probability of churn by 5.52 percentage points.
PreferedOrderCatLaptop & Accessory: Preference for laptops and accessories is associated with a 16.02 percentage point decrease in the probability of churn.
PreferedOrderCatMobile: Preference for mobile products reduces churn probability by 13.58 percentage points, all else constant.
PreferedOrderCatMobile Phone: Preference for mobile phones is linked to a 9.66 percentage point decrease in predicted churn.
PreferedOrderCatOthers: Customers preferring “Other” categories are 27.28 percentage points more likely to churn — the largest positive marginal effect observed.
SatisfactionScore: A one-point increase in satisfaction score is surprisingly associated with a 3.23 percentage point increase in churn probability, which may indicate a complex relationship requiring further investigation.
In logistic regression, the odds ratio (OR) for a predictor indicates how the odds of the outcome (here, customer churn) change with a one-unit increase in that predictor, holding all other variables constant. Odds ratios are calculated by exponentiating the model coefficients:
# Calculate odds ratios and 95% CI
odds_ratios <- exp(coef(mylogit7))
conf_int <- exp(confint(mylogit7))
## Waiting for profiling to be done...
odds_table <- data.frame(
Variable = names(odds_ratios),
OR = odds_ratios,
CI_lower = conf_int[, 1],
CI_upper = conf_int[, 2]
)
print(odds_table)
## Variable
## (Intercept) (Intercept)
## log_Tenure log_Tenure
## PreferredLoginDeviceMobile Phone PreferredLoginDeviceMobile Phone
## PreferredLoginDevicePhone PreferredLoginDevicePhone
## CityTier CityTier
## log_WarehouseToHome log_WarehouseToHome
## GenderMale GenderMale
## NumberOfDeviceRegistered NumberOfDeviceRegistered
## PreferedOrderCatLaptop & Accessory PreferedOrderCatLaptop & Accessory
## PreferedOrderCatMobile PreferedOrderCatMobile
## PreferedOrderCatMobile Phone PreferedOrderCatMobile Phone
## PreferedOrderCatOthers PreferedOrderCatOthers
## SatisfactionScore SatisfactionScore
## MaritalStatusSingle MaritalStatusSingle
## log_NumberOfAddress log_NumberOfAddress
## Complain Complain
## log_OrderCount log_OrderCount
## log_DaySinceLastOrder log_DaySinceLastOrder
## OR CI_lower CI_upper
## (Intercept) 0.007488452 0.001675966 0.03206093
## log_Tenure 0.173934743 0.139392324 0.21416163
## PreferredLoginDeviceMobile Phone 0.548832370 0.380695912 0.79000051
## PreferredLoginDevicePhone 0.630151651 0.389494589 1.01387984
## CityTier 1.388077986 1.161840165 1.66172336
## log_WarehouseToHome 2.219738593 1.624927122 3.04960464
## GenderMale 1.242522562 0.899570839 1.72131318
## NumberOfDeviceRegistered 1.481821339 1.247657410 1.76634279
## PreferedOrderCatLaptop & Accessory 0.187037924 0.108360235 0.32232693
## PreferedOrderCatMobile 0.253102033 0.089111281 0.70914091
## PreferedOrderCatMobile Phone 0.393402273 0.223936558 0.69173118
## PreferedOrderCatOthers 8.601942976 0.989711447 58.48346236
## SatisfactionScore 1.435131635 1.267829844 1.62992204
## MaritalStatusSingle 2.294966458 1.596385949 3.33378834
## log_NumberOfAddress 2.863279426 2.126373623 3.88946917
## Complain 6.420232972 4.584244317 9.08348686
## log_OrderCount 2.663332340 1.938577760 3.68191007
## log_DaySinceLastOrder 0.405624057 0.299490255 0.54622137
Interpretation of selected odds ratios:
The link test is a diagnostic tool used to assess the specification of a logistic regression model. It helps determine whether the model is correctly specified or if key predictors may have been omitted or if non-linearities remain unaddressed.
The logic behind the link test is that if a model is properly
specified, adding the predicted value (_hat) should be
statistically significant (as it captures the systematic part of the
variation), while the square of the predicted value
(_hatsq) should not be significant (as it would otherwise
suggest a mis-specification).
# Update final data
data_clean <- subset(data,
PreferedOrderCat != "Grocery" &
MaritalStatus != "Married" )
data$PreferedOrderCat <- droplevels(data$PreferedOrderCat)
data$MaritalStatus <- droplevels(data$MaritalStatus)
dim(data)
## [1] 3774 27
#Link test
data_clean$hat <- fitted(mylogit7)
data_clean$hat_sq <- data_clean$hat^2
link_test_model <- glm(Churn ~ hat + hat_sq, family = binomial, data = data_clean)
summary(link_test_model)
##
## Call:
## glm(formula = Churn ~ hat + hat_sq, family = binomial, data = data_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.2585 0.1639 -19.885 < 2e-16 ***
## hat 5.6644 0.9879 5.734 9.8e-09 ***
## hat_sq 1.2489 1.1973 1.043 0.297
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1931.9 on 1790 degrees of freedom
## Residual deviance: 1024.5 on 1788 degrees of freedom
## AIC: 1030.5
##
## Number of Fisher Scoring iterations: 5
The coefficient for hat is statistically significant (p < 0.001), while the coefficient for hat_sq is not statistically significant (p = 0.297). This result indicates that the model is correctly specified and there is no evidence of omitted variables or incorrect functional form. Thus, the model passes the link test.
# R-squared statistics
PseudoR2(myprobit)
## McFadden Adj.McFadden Cox.Snell Nagelkerke
## 0.4398153 0.4216189 0.3277176 0.5511770
## McKelvey.Zavoina Effron Count Adj.Count
## 0.6228493 0.4808940 0.9067303 0.4421553
## AIC Corrected.AIC
## 1968.6984204 1969.1953480
The logistic regression model’s performance was evaluated using a variety of pseudo R sq and related fit measures:
McKelvey & Zavoina R_sq (0.623) is regarded as the best approximation of the traditional R² in binary outcome models, showing that 62.3% of the variance in the underlying latent variable is explained.
Count R_sq (0.907) indicates that 90.7% of observations were correctly classified. While impressive, this metric can be inflated in imbalanced datasets.
Adjusted Count R_sq (0.442) corrects for potential baseline bias in classification accuracy, and still reflects solid predictive power.
Overall, these fit statistics confirm that the final model provides a strong and robust explanation for customer churn behavior.
# Hosmer-Lemeshow Test
hl_test <- hoslem.test(x = mylogit7$y, y = fitted(mylogit7), g = 10)
print(hl_test)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: mylogit7$y, fitted(mylogit7)
## X-squared = 45.704, df = 8, p-value = 2.706e-07
pred_class <- ifelse(fitted(mylogit7) > 0.5, 1, 0)
confusionMatrix(as.factor(pred_class), as.factor(mylogit7$y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1308 128
## 1 71 284
##
## Accuracy : 0.8889
## 95% CI : (0.8734, 0.9031)
## No Information Rate : 0.77
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6704
##
## Mcnemar's Test P-Value : 7.195e-05
##
## Sensitivity : 0.9485
## Specificity : 0.6893
## Pos Pred Value : 0.9109
## Neg Pred Value : 0.8000
## Prevalence : 0.7700
## Detection Rate : 0.7303
## Detection Prevalence : 0.8018
## Balanced Accuracy : 0.8189
##
## 'Positive' Class : 0
##
#ROC Curve and AUC
roc_curve <- roc(mylogit7$y, fitted(mylogit7))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve")
auc(roc_curve)
## Area under the curve: 0.9147
The logistic regression model demonstrates strong classification performance, as shown by:
Despite a significant Hosmer–Lemeshow test indicating some model misspecification, the model performs well in practical terms, especially in minimizing false negatives for non-churners.However, the relatively lower specificity (68.93%) means that some churners are still being misclassified. Future improvements could include exploring interaction terms, non-linear effects, or machine learning methods to capture more complex relationships in the data.
Based on the insights from prior research and exploratory analysis, we formulate the following primary and secondary hypotheses to test the determinants of customer churn in e-commerce.
H0: Filing a complaint does not affect the
likelihood of customer churn.
H1: Customers who lodge complaints are more likely to
churn.
The marginal effects indicate that complaints are associated with a 16.63 percentage point increase in the probability of churn, holding all other variables constant. The p-value associated with the complaint variable is statistically significant at the 1% level. Thus, we reject the null hypothesis, suggesting that customer complaints are a strong predictor of churn.
H0: Customer tenure has no impact on the likelihood
of churn.
H1: Longer tenure is associated with a lower
probability of churn.
The variable log_Tenure is statistically significant and is associated with a 15.65 percentage point decrease in churn probability. Therefore, we reject the null hypothesis and conclude that longer-tenured customers are less likely to churn.
H0: Order frequency does not influence customer
churn.
H1: Higher order frequency reduces the probability of
churn.
log_OrderCount shows a significant 8.76 percentage point increase in churn probability for higher order frequency. Interestingly, rather than decreasing churn, more frequent orders are positively associated with churn in our dataset. We reject the null hypothesis, but the direction of the relationship suggests complex customer dynamics possibly related to dissatisfaction despite higher purchase activity.
H0: Delivery distance has no effect on churn
likelihood.
H1: Longer delivery distances increase the probability
of churn.
The log_WarehouseToHome variable is positively associated with churn (7.13 percentage point increase), and is statistically significant. Hence, we reject the null hypothesis and conclude that delivery logistics impact customer retention.
H0: Product category preference does not influence
churn.
H1: Certain product category preferences are associated
with lower churn probability.
Customers who preferred Laptop & Accessory, Mobile, or Mobile Phone categories showed lower predicted churn probabilities (16.02, 13.58, and 9.66 percentage point decreases respectively). In contrast, those preferring “Others” showed the highest increase (27.28 percentage points). These findings are statistically significant, leading us to reject the null hypothesis.
H0: Marital status does not affect churn.
H1: Single customers are more likely to churn.
Single customers are associated with a 7.24 percentage point increase in churn probability, and the result is statistically significant. Therefore, we reject the null hypothesis and confirm marital status as a relevant demographic factor.
This study set out to identify the key drivers of customer churn in the e-commerce sector using a binary logistic regression model. By analyzing a comprehensive set of variables, including behavioral indicators, demographic information, and customer preferences, the research revealed several statistically significant predictors of churn. Customers who submitted complaints were substantially more likely to churn, while those with longer tenure and more recent purchase activity were less likely to do so. Other important predictors included the number of devices registered, marital status, delivery distance, and product category preferences. Some product categories, such as laptops or mobile phones, were associated with reduced churn, whereas others, particularly “Other” categories, had a strong positive relationship with churn risk.
The marginal effects analysis provided further insights into the magnitude of these relationships. For instance, a complaint was associated with a 16.63 percentage point increase in churn probability, while each unit increase in log tenure decreased the probability of churn by 15.65 percentage points. Interestingly, the satisfaction score, which would typically be expected to lower churn, showed a small but significant positive association with churn. This finding may reflect unobserved factors such as inflated satisfaction ratings or unmet expectations despite high scores, indicating the need for deeper qualitative assessment in future research.
From a model performance perspective, the logistic regression demonstrated robust predictive ability. The overall classification accuracy was 88.89%, with high sensitivity (94.85%) and a well-balanced specificity (68.93%). The ROC curve showed a strong area under the curve (AUC), indicating high discriminative power. Pseudo-R sq statistics, including McFadden’s R sq (0.44) and McKelvey-Zavoina R sq (0.62), confirmed that the model captured a substantial portion of variance in customer churn behavior. Although the Hosmer-Lemeshow goodness-of-fit test yielded a significant p-value, this is a common occurrence in large samples and does not necessarily undermine the model’s overall validity.
In conclusion, this research confirms that customer churn is a multifactorial outcome influenced by a range of behavioral, demographic, and experiential factors. The findings align well with existing literature and highlight the practical importance of monitoring complaints, purchase recency, and order behavior in predicting churn. These insights can support the development of targeted retention strategies and customer relationship management practices. Future work could enhance prediction accuracy through non-linear models, machine learning algorithms, or deeper exploration of the satisfaction-churn paradox.
Ahmad, A., Jafar, A., & Aljoumaa, K. (2019). Customer churn prediction in telecom using machine learning in big data platform. Journal of Big Data, 6(1), 1–24.
Berger, P., & Kompan, M. (2019). Predicting customer churn in e-commerce using behavior-based models. International Journal of Information Management, 47, 150–162.
Bhattacharya, S. (2021). Predicting e-commerce customer churn using transactional data. Electronic Commerce Research and Applications, 45, 101024.
Dahiya, M., & Bhatia, M. P. S. (2020). Predictive analytics for customer churn using machine learning techniques. Procedia Computer Science, 167, 2319–2328.
Jaiswal, A. K., & Niraj, R. (2011). Examining mediating role of attitudinal loyalty and satisfaction on customer behavior. Journal of Services Marketing, 25(3), 165–175.
Li, M. (2022). Customer churn prediction on e-commerce platform using Random Forest. International Journal of Business Analytics, 9(4), 45–57.
Liu, Q., & Wang, Y. (2010). Predicting customer churn in the telecommunications industry–An application of survival analysis modeling using SAS. SAS Global Forum.