Data preperation

Required packages

library(survival)
library(survminer)
library(tidyverse)
library(broom)
library(janitor)

Data set

survival_data  = read_csv("survival_data.csv")
survival_data = survival_data %>% 
  mutate(status = ifelse(time_to_event <= 30,1,0))

Descriptive analysis

Overall view

survival_data %>% 
  group_by(return_status, status) %>% 
  summarise(
    create_day = median(processed_pr_day),
    return_pct = round(mean(ret_pct)),
    time_event = median(time_to_event)
    
  )
# A tibble: 4 × 5
# Groups:   return_status [2]
  return_status status create_day return_pct time_event
  <chr>          <dbl>      <dbl>      <dbl>      <dbl>
1 NR                 0          1          0         81
2 NR                 1          1          0          8
3 R                  0          2         19         72
4 R                  1          4          7          4

cross table

survival_data %>% 
  tabyl(return_status,status) %>% adorn_totals() %>%
  adorn_percentages() %>% 
  adorn_pct_formatting() %>% 
  adorn_ns(position = "front") %>% 
  adorn_title(row_name = "Return status",
              col_name = "Order status")
                 Order status               
 Return status              0              1
            NR  7,759 (64.6%)  4,246 (35.4%)
             R  2,375 (22.1%)  8,368 (77.9%)
         Total 10,134 (44.5%) 12,614 (55.5%)

Survival analysis

Fit the overall model

surv_model = survfit(Surv(time_to_event,status) ~ 1, data = survival_data)

ggsurvplot(
  surv_model,data = survival_data,
  ggtheme = theme_minimal()
) +
  labs(
    title = expression(bold("Survival plot for overall group"))
  ) 

Hazard graph for return affected vs return unaffected customer

surv_model_group = survfit(Surv(time_to_event,status) ~ strata(return_status), data = survival_data)

ggsurvplot(
  surv_model_group,data = survival_data,fun = function(y) -log(y),
  ggtheme = theme_minimal()
) +
  labs(
    title = expression(bold("Hazard plot for two group"))
  ) 

Fit coxph model

cox_model = coxph(Surv(time_to_event,status) ~ strata(return_status), data = survival_data)
summary(cox_model)
Call:  coxph(formula = Surv(time_to_event, status) ~ strata(return_status), 
    data = survival_data)

Null model
  log likelihood= -111912.9 
  n= 22748 
ggsurvplot(
  survfit(cox_model,data = survival_data),
  ggtheme = theme_minimal(),fun = function(y) -log(y),
) +
  labs(
    title = expression(bold("Hazard plot for two group")),
    subtitle = expression(italic("Return unaffected(NR) , Return affected(R)"))
  ) 

Adjusting covariate in the model

cox_cov = coxph(Surv(time_to_event,status) ~ return_status +processed_pr_day +ret_pct , data = survival_data)
summary(cox_cov)
Call:
coxph(formula = Surv(time_to_event, status) ~ return_status + 
    processed_pr_day + ret_pct, data = survival_data)

  n= 22748, number of events= 12614 

                       coef  exp(coef)   se(coef)      z Pr(>|z|)    
return_statusR    1.729e+00  5.636e+00  2.210e-02  78.23  < 2e-16 ***
processed_pr_day  5.565e-04  1.001e+00  9.434e-05   5.90 3.64e-09 ***
ret_pct          -4.701e-02  9.541e-01  1.489e-03 -31.57  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                 exp(coef) exp(-coef) lower .95 upper .95
return_statusR      5.6363     0.1774    5.3973    5.8858
processed_pr_day    1.0006     0.9994    1.0004    1.0007
ret_pct             0.9541     1.0481    0.9513    0.9569

Concordance= 0.716  (se = 0.002 )
Likelihood ratio test= 6538  on 3 df,   p=<2e-16
Wald test            = 6309  on 3 df,   p=<2e-16
Score (logrank) test = 7060  on 3 df,   p=<2e-16

From the table , it can be interpreted that 1 unit increase in processed order per day reduces possibility of churn rate considering all other covariates constant. But it is surprisingly showed that return affected customers has a higher churn possibility than who are not return affected considering other covariates constant.

Test whether two group have same survival time or not

log_rank_test <- survdiff(Surv(time_to_event,status) ~ return_status, data = survival_data)
log_rank_test
Call:
survdiff(formula = Surv(time_to_event, status) ~ return_status, 
    data = survival_data)

                     N Observed Expected (O-E)^2/E (O-E)^2/V
return_status=NR 12005     4246     8008      1768      5195
return_status=R  10743     8368     4606      3074      5195

 Chisq= 5195  on 1 degrees of freedom, p= <2e-16 

The table shows that there are significant different survival time between two groups as the p-value is significant.

Analysis for each group

Doing analysis for return affected merchant without considering covariates

ret_mer = survival_data %>% 
  filter(return_status =="R")

ret_cox = coxph(Surv(time_to_event,status) ~ processed_pr_day +ret_pct  , data = ret_mer)
summary(ret_cox)
Call:
coxph(formula = Surv(time_to_event, status) ~ processed_pr_day + 
    ret_pct, data = ret_mer)

  n= 10743, number of events= 8368 

                       coef  exp(coef)   se(coef)       z Pr(>|z|)    
processed_pr_day  5.336e-04  1.001e+00  9.721e-05   5.489 4.05e-08 ***
ret_pct          -4.647e-02  9.546e-01  1.495e-03 -31.073  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                 exp(coef) exp(-coef) lower .95 upper .95
processed_pr_day    1.0005     0.9995    1.0003    1.0007
ret_pct             0.9546     1.0476    0.9518    0.9574

Concordance= 0.58  (se = 0.004 )
Likelihood ratio test= 1508  on 2 df,   p=<2e-16
Wald test            = 1000  on 2 df,   p=<2e-16
Score (logrank) test = 955  on 2 df,   p=<2e-16
ggsurvplot(
  survfit(ret_cox,data = ret_mer),
  ggtheme = theme_minimal(), palette = "#2E9FDF"
) +
  labs(
    title = expression(bold("Survival plot for return merchant without adjusting covariate "))
  ) 

## Doing analysis for return affected merchant considering covariates

ret_cox_cov = coxph(Surv(time_to_event,status) ~  processed_pr_day +ret_pct , data = ret_mer)
summary(ret_cox_cov)
Call:
coxph(formula = Surv(time_to_event, status) ~ processed_pr_day + 
    ret_pct, data = ret_mer)

  n= 10743, number of events= 8368 

                       coef  exp(coef)   se(coef)       z Pr(>|z|)    
processed_pr_day  5.336e-04  1.001e+00  9.721e-05   5.489 4.05e-08 ***
ret_pct          -4.647e-02  9.546e-01  1.495e-03 -31.073  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                 exp(coef) exp(-coef) lower .95 upper .95
processed_pr_day    1.0005     0.9995    1.0003    1.0007
ret_pct             0.9546     1.0476    0.9518    0.9574

Concordance= 0.58  (se = 0.004 )
Likelihood ratio test= 1508  on 2 df,   p=<2e-16
Wald test            = 1000  on 2 df,   p=<2e-16
Score (logrank) test = 955  on 2 df,   p=<2e-16
ggsurvplot(
  survfit(ret_cox_cov,data = ret_mer),
  ggtheme = theme_minimal(), palette = "#2E9FDF"
) +
  labs(
    title = expression(bold("Survival plot for return merchant adjusting covariates "))
  ) 

Return unaffected customers

nret_mer = survival_data %>% 
  filter(return_status =="NR")

nret_cox = coxph(Surv(time_to_event,status) ~ processed_pr_day , data = nret_mer)
summary(nret_cox)
Call:
coxph(formula = Surv(time_to_event, status) ~ processed_pr_day, 
    data = nret_mer)

  n= 12005, number of events= 4246 

                     coef exp(coef) se(coef)     z Pr(>|z|)    
processed_pr_day 0.022639  1.022897 0.002316 9.773   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                 exp(coef) exp(-coef) lower .95 upper .95
processed_pr_day     1.023     0.9776     1.018     1.028

Concordance= 0.633  (se = 0.004 )
Likelihood ratio test= 47.15  on 1 df,   p=7e-12
Wald test            = 95.52  on 1 df,   p=<2e-16
Score (logrank) test = 142.7  on 1 df,   p=<2e-16
ggsurvplot(
  survfit(nret_cox,data = nret_mer),
  ggtheme = theme_minimal()
) +
  labs(
    title = expression(bold("Survival plot for return unaffected merchant without adjusting covariate "))
  ) 

# Impact of return percentage on reordering time

return_data = survival_data %>% 
  filter(return_status == "R") 
return_data %>% 
  dim()
[1] 10743     8
library(MatchIt)
# Create a propensity score model
matchit_model <- matchit(status ~ processed_pr_day + ret_pct , data = return_data, method = "nearest",distance = "glm")

# Create matched dataset
matched_data <- match.data(matchit_model)

Checking balancing

ggplot(return_data, aes(x = ret_pct, group = status, color = factor(status))) +
  geom_density() +
  theme_minimal() +
  labs(title = "Density Plot of Return Percentage by Status before matching",
       x = "Return Percentage",
       y = "Density",
       color = "Status") +
  theme(legend.position = "none")

## after matching 

ggplot(matched_data, aes(x = ret_pct, group = status, color = factor(status))) +
  geom_density() +
  theme_minimal() +
  labs(title = "Density Plot of Return Percentage by Status after matching",
       x = "Return Percentage",
       y = "Density",
       color = "Status") +
  theme(legend.position = "none")

Fit the linear regression model on matched data

matched_model <- lm(time_to_event ~ ret_pct , data = matched_data)

# Display the summary of the matched model
tidy(matched_model)
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    27.1     0.580       46.7       0
2 ret_pct         1.40    0.0297      47.2       0

The table shows that return percentage has a positive coefficient that indicates that return percentage increases reorder time. so we have to think about that.

Business action

The analysis results bring positive insights for the business owner of Baby Cakes. It’s promising to note that customers affected by return orders tend to reorder faster compared to those who are not affected. This trend suggests a high level of trust and loyalty among return-affected customers, indicating their satisfaction with the cakes and their affinity towards the brand.

However, the analysis also raises an alarming concern. As the frequency of return orders increases, it implies a growing dissatisfaction among customers with the quality or taste of the cakes. This dissatisfaction could potentially lead to customer churn from Baby Cakes, posing a significant risk to the business’s reputation and revenue.

Business Actions: Quality Assurance and Product Improvement: Baby Cakes should prioritize efforts to enhance product quality and taste based on the feedback received from return orders. Conducting thorough evaluations of the reasons behind returns and addressing any issues related to ingredients, baking process, or packaging can help improve customer satisfaction and reduce return rates.

Customer Feedback Mechanism: Implementing a robust system for gathering and analyzing customer feedback can provide valuable insights into specific areas for improvement. Actively soliciting feedback through surveys, reviews, and direct communication channels allows Baby Cakes to address customer concerns promptly and proactively.

Communication and Transparency: Maintain transparent communication with customers regarding any changes or improvements made to the products and processes in response to their feedback. Demonstrating a commitment to quality and customer satisfaction can help rebuild trust and loyalty among dissatisfied customers.

Retention Strategies: Develop targeted retention strategies to engage and retain customers who have experienced return orders. Offering personalized incentives, discounts, or special promotions can encourage them to continue patronizing Baby Cakes while demonstrating appreciation for their loyalty.

By taking proactive measures to address customer concerns and improve product quality, Baby Cakes can mitigate the risk of churn and foster long-term relationships with its customers, ensuring sustained growth and success in the competitive online cake processing industry.