What is Customer Churn?

Customer churn is loss of customers/subscribers. Is it possible to use AI/ML/Data Science techniques to construct a model that can help predict customer churn?

We can formulate this problem in at least two different ways:

In this report, we will show how both these approaches can be applied and present results.

Read data

# Read data
data_file <- here("data", "customer_churn_data_extended.csv")
churn_data <- read.csv(data_file)
datatable(churn_data, rownames = FALSE, filter="top", options = list(pageLength = 6, scrollX= T))
Warning in instance$preRenderHook(instance) :
  It seems your data is too big for client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html
Warning in instance$preRenderHook(instance) :
  It seems your data is too big for client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html

Quick missing data check

missmap(churn_data, col=c("blue", "red"), legend=FALSE)

  • Looks like there are some missing values in total_charges. But overall data quality is good.

Data Description (Reference: Extended telco churn data)

Explore data and implement preprocessing steps

# Standardize column names
colnames(churn_data) <- tolower(colnames(churn_data))
colnames(churn_data) <- gsub("\\.", "_", colnames(churn_data))

# Confirm unique customers (no duplicates)
## if no duplicates the following length computation should be equal to the number of rows of the dataframe 
length(unique(churn_data$customerid))
[1] 7043
# Missing values
summary(churn_data)
  customerid            count     country             state               city              zip_code       lat_long            latitude    
 Length:7043        Min.   :1   Length:7043        Length:7043        Length:7043        Min.   :90001   Length:7043        Min.   :32.56  
 Class :character   1st Qu.:1   Class :character   Class :character   Class :character   1st Qu.:92102   Class :character   1st Qu.:34.03  
 Mode  :character   Median :1   Mode  :character   Mode  :character   Mode  :character   Median :93552   Mode  :character   Median :36.39  
                    Mean   :1                                                            Mean   :93522                      Mean   :36.28  
                    3rd Qu.:1                                                            3rd Qu.:95351                      3rd Qu.:38.22  
                    Max.   :1                                                            Max.   :96161                      Max.   :41.96  
                                                                                                                                           
   longitude         gender          senior_citizen       partner           dependents        tenure_months   phone_service     
 Min.   :-124.3   Length:7043        Length:7043        Length:7043        Length:7043        Min.   : 0.00   Length:7043       
 1st Qu.:-121.8   Class :character   Class :character   Class :character   Class :character   1st Qu.: 9.00   Class :character  
 Median :-119.7   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :29.00   Mode  :character  
 Mean   :-119.8                                                                               Mean   :32.37                     
 3rd Qu.:-118.0                                                                               3rd Qu.:55.00                     
 Max.   :-114.2                                                                               Max.   :72.00                     
                                                                                                                                
 multiple_lines     internet_service   online_security    online_backup      device_protection  tech_support       streaming_tv      
 Length:7043        Length:7043        Length:7043        Length:7043        Length:7043        Length:7043        Length:7043       
 Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                                                                     
                                                                                                                                     
                                                                                                                                     
                                                                                                                                     
 streaming_movies     contract         paperless_billing  payment_method     monthly_charges  total_charges    churn_label         churn_value    
 Length:7043        Length:7043        Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8   Length:7043        Min.   :0.0000  
 Class :character   Class :character   Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4   Class :character   1st Qu.:0.0000  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 70.35   Median :1397.5   Mode  :character   Median :0.0000  
                                                                             Mean   : 64.76   Mean   :2283.3                      Mean   :0.2654  
                                                                             3rd Qu.: 89.85   3rd Qu.:3794.7                      3rd Qu.:1.0000  
                                                                             Max.   :118.75   Max.   :8684.8                      Max.   :1.0000  
                                                                                              NA's   :11                                          
  churn_score         cltv      churn_reason      
 Min.   :  5.0   Min.   :2003   Length:7043       
 1st Qu.: 40.0   1st Qu.:3469   Class :character  
 Median : 61.0   Median :4527   Mode  :character  
 Mean   : 58.7   Mean   :4400                     
 3rd Qu.: 75.0   3rd Qu.:5380                     
 Max.   :100.0   Max.   :6500                     
                                                  

Churned vs. Not churned

churn_data %>% ggplot(aes(x = churn_label, fill = churn_label)) + geom_bar() + theme_minimal()

How good is the company in retaining its customers?

mean(churn_data$churn_value)
[1] 0.2653699

One can answer this question easily by estimating survival curve using the given data. The Kaplan-Meir estimate for the given churn data is shown below.

churn_data <- churn_data %>%
  mutate(
    b_churn = ifelse(churn_label == "Yes", 1, 0)
  )
kmfit <- survfit(Surv(tenure_months, b_churn) ~ 1, data = churn_data)
km_survplot <- survminer::ggsurvplot(
  fit = kmfit, 
  surv.median.line = "none",
  xlab = "Months", 
  ylab = "Survival probability", ylim = c(0.4, 1))
km_survplot$plot + geom_hline(yintercept = 0.61, linetype = "dashed") + geom_vline(xintercept = 70, linetype = "dashed")

  • The probability that an average customer will continue with the company beyond 70 months is \(0.61\).
  • Even after 70 months, the company is able to retain 61% or more of their customers.

Closer look at all variables

State hypotheses before building a data-driven model

  • gender: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
  • senior_citizen: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
  • partner/dependents: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
  • tenure_months: In an average sense, a long time customer is more likely to stay with the company (low churn risk)
  • phone_service/internet_service: Customers with both phone and internet service should be less likely to churn.
  • multiple_lines: Having multiple lines should reduce churn risk. Multiple lines better discount more likely to be cost efficient for the customer.
  • online_security, online_backup, device_protection, streaming_tv, streaming_movies: Data description tells us that customers do not pay for extra services. In general, customers with more services and diverse usage should be less likely to churn.
  • contract: Expect to be a strong predictor. Customers with long term contracts less likely to churn than those with month-to-month contracts.
  • payment_method: customers with automatic payment setup (automatic bank transfer or credit card) less likely to churn than manual payment methods.
  • monthly_charges: churn risk should increase with higher monthly charges. However, need to be careful as this is not normalized with respect to number of lines, services, etc.
  • total_charges: total charges is cumulative and depends on the duration of service. Not a good idea to use this in its raw form

Visualize key variables

Tenure

churn_data %>% ggplot(aes(x = tenure_months, fill = churn_label)) + geom_histogram() + facet_wrap(churn_label ~ .)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Multiple lines

ggplot(churn_data, aes(x = multiple_lines, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ multiple_lines, scale = "free")

Contract

ggplot(churn_data, aes(x = contract, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ contract, scale = "free")

Payment method

ggplot(churn_data, aes(x = payment_method, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ payment_method, scale = "free")

Logistic regression

We consider the following model:

\[\begin{eqnarray} y_i & \sim & Bernoulli(\theta_i) \\ \text{logit}\left(\theta_i\right) & = & \beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n, \\ \beta_0, \beta_1, \ldots, \beta_n & \sim & \mathcal{N}(0, 5) \hspace{10pt} (priors) \end{eqnarray}\]

where \(y_i\) is binary data (churned or not churned), \(\theta_i\) is the probability of churn and \(x_i's\) are covariates of customer \(i\). \(\beta_0, \beta_1, \ldots, \beta_n\) are model parameters which will be estimated using the collected data.

Model summary

if(!bBuild_mdl_bayes_FE){
  mdl_logit_FixedEffects <- readRDS("mdl_logit_bayes_1.rds")
} else {
  mdl_logit_FixedEffects <- brm(formula = churn_value ~ contract  + multiple_lines + payment_method + tenure_months,
                                data=churn_data,
                                family = bernoulli(link = "logit"),
                                prior = c(prior(normal(0, 2), class = Intercept),
                                          prior(normal(0, 2), class = b)),
                                sample_prior = TRUE,
                                warmup = 500,
                                iter = 3000,
                                chains = 4,
                                inits= "0",
                                cores = 8,
                                seed = 123)
  saveRDS(mdl_logit_FixedEffects, "mdl_logit_FixedEffects.rds")
}
summary(mdl_logit_FixedEffects)
 Family: bernoulli 
  Links: mu = logit 
Formula: churn_value ~ contract + multiple_lines + payment_method + tenure_months 
   Data: churn_data (Number of observations: 7043) 
  Draws: 4 chains, each with iter = 3000; warmup = 500; thin = 1;
         total post-warmup draws = 10000

Population-Level Effects: 
                                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept                            -0.32      0.09    -0.50    -0.14 1.00     6841     7598
contractOneyear                      -1.07      0.10    -1.26    -0.88 1.00    10375     7535
contractTwoyear                      -2.12      0.16    -2.45    -1.80 1.00     8931     7159
multiple_linesNophoneservice          0.20      0.11    -0.02     0.41 1.00     9998     7716
multiple_linesYes                     0.71      0.07     0.57     0.85 1.00     9656     7536
payment_methodCreditcardautomatic    -0.09      0.11    -0.30     0.13 1.00     7496     7672
payment_methodElectroniccheck         0.64      0.09     0.47     0.82 1.00     7001     7450
payment_methodMailedcheck            -0.42      0.11    -0.63    -0.22 1.00     7023     7838
tenure_months                        -0.03      0.00    -0.03    -0.02 1.00    11053     9002

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Posterior distribution of parameters

  • Probability of parameters \(\theta\) given evidence (data)
  • We show how data modifies our a priori assumption of \(\theta\). We plot only one parameter for simplicity.
rbind(draws_prior, df_post) %>% 
  ggplot(aes(x = x, fill = label)) + geom_density() +
  scale_fill_viridis(discrete = TRUE, alpha = 0.5) + theme_minimal() + 
  labs(x = varname)

Posterior predictive distribution

  • Distribution of possible unobserved values conditional on observed data: \(p(y_{new}|y)\).
tenure1 = 15.8
tenure2 = 10.5

Xnew <- churn_data[1, ]
Xnew$tenure_months <- tenure1
p1 = posterior_linpred(mdl_logit_FixedEffects, transform=TRUE, newdata = Xnew)
df_p1 <- data.frame(p = p1, tenure_months = tenure1)


Xnew <- churn_data[1, ]
Xnew$tenure_months <- tenure2
#table(Xnew$tenure_months == churn_data$tenure_months)
p2 = posterior_linpred(mdl_logit_FixedEffects, transform=TRUE, newdata = Xnew)
df_p2 <- data.frame(p = p2, tenure_months = tenure2)
df_p <- rbind(df_p1, df_p2)
df_p$tenure_months <- as.factor(df_p$tenure_months)

ggplot(df_p, aes(x = p, fill = tenure_months)) + geom_density(alpha = 0.5) +
  labs(title = paste0("Contract: ", Xnew$contract, ", Multiple lines: ", Xnew$multiple_lines, ", Payment method: ",
                      Xnew$payment_method))

Model Evaluation

  1. How wrong is the model?
  2. How accurate are the model’s posterior classifications?
  3. Model comparisons

To answer question (1), we will first check convergence of posterior distribution of parameters using traceplots. Next, we perform posterior predictive checks to confirm that the data simulated from our posterior logistic regression model has features similar to the original data. That is, simulations from the fit model result in similar number of ones and zeros as in the original data. For (2), we will address this question in two ways: with and without cross-validation.

Check model convergence by visual inspection

Figure below shows that there are no divergence chains and hence we can conclude that our model does not have any convergence problems.

plot(mdl_logit_FixedEffects, ask = FALSE, newpage = FALSE, N = 2)

Posterior predictive checks

To check he fit of the model to data we draw simulated values from the posterior predictive distribution of replicated data, \(y^{rep}\) and compare the samples to original data. Any systematic differences between the simulations and the data will indicate potential failings of the model (e.g., unsuitable priors). The plot below compares original data with \(500\) draws from the fit model. The figure clearly shows that the model passes this check

pp_check(mdl_logit_FixedEffects, ndraws = 500)

Classification accuracy

Training accuracy

  • Threshold = 0.5
set.seed(84735)
threshold = 0.5
# Predicted probabilities
#linpred <- posterior_linpred(mdl_logit_FixedEffects)
preds <- posterior_linpred(mdl_logit_FixedEffects, transform=TRUE)
pred <- colMeans(preds)
pr <- as.integer(pred >= 0.5)

# posterior classification accuracy
round(mean(xor(pr,as.integer(churn_data$churn_value==0))),6)
[1] 0.781769

Confusion matrix

table(data.frame(churn_value = churn_data$churn_value, churn_pred = pr))
           churn_pred
churn_value    0    1
          0 4648  526
          1 1011  858

Receiver Operating Characteristic (ROC) curve

threshold <- seq(0, 1, by = 0.05)
compute_churn_decision <- function(threshold, pred, churn_value){
  pr <- as.integer(pred >= threshold)
  y <- data.frame(churn_value = churn_value, churn_pred = pr)
  pd <- y %>% dplyr::filter(churn_value == 1) %>%
    summarise(pd = mean(churn_pred)) %>% pull()
  pf <- y %>% dplyr::filter(churn_value == 0) %>%
    summarise(
      pf = mean(churn_pred)
    ) %>% pull()
  return(list(data.frame("pf" = pf, "pd" = pd, "th" = threshold)))
}

roc = lapply(threshold, compute_churn_decision, pred = pred, churn_value = churn_data$churn_value) %>%
  bind_rows()
roc <- roc %>%
  mutate(
    type = "training set"
  )

ggplot(roc, aes(x = pf, y = pd)) + geom_line(size = 2) + 
  geom_point(size = 5) + 
  labs(title = "ROC", x = "Probability of false alarm", y = "Probability of detection")

NA
NA

Cross validation

To better estimate the predictive performance for new not yet seen data we next use leave-one-out cross-validation.

if(bRunCV_mdl_bayes_FE){
  loo_mdl_logit_FixedEffects <- loo(mdl_logit_FixedEffects, save_psis = TRUE)
  saveRDS(loo_mdl_logit_FixedEffects, "loo_mdl_logit_FixedEffects.rds")
  # LOO predictive probabilities
  ploo_mdl_logit_FixedEffects=E_loo(preds, loo_mdl_logit_FixedEffects$psis_object, type="mean", log_ratios = -log_lik(mdl_logit_FixedEffects))$value
  saveRDS(ploo, "ploo_mdl_logit_FixedEffects.rds")
} else{
  loo_mdl_logit_FixedEffects <- readRDS("loo_mdl_logit_FixedEffects.rds")
  ploo_mdl_logit_FixedEffects <- readRDS("ploo_mdl_logit_FixedEffects.rds")
}

## LOO classification accuracy
round(mean(xor(ploo>0.5,as.integer(churn_data$churn_value==0))), 6)
[1] 0.781627

Challenger model: To do

  • Churn rate vs. Geographical location
  • Multilevel model to incorporate location variations (especially useful to reduce prediction uncertainty in small sample size counties)
table(churn_data$county)

        california,alameda          california,alpine          california,amador           california,butte       california,calaveras 
                       180                          8                         36                         72                         72 
         california,colusa    california,contra costa       california,del norte       california,el dorado          california,fresno 
                        24                        140                         16                         84                        200 
          california,glenn        california,humboldt        california,imperial            california,inyo            california,kern 
                        28                        136                         70                         40                        168 
          california,kings            california,lake          california,lassen     california,los angeles          california,madera 
                        32                         52                         60                       1352                         48 
          california,marin        california,mariposa       california,mendocino          california,merced           california,modoc 
                       108                         32                        100                         72                         36 
           california,mono        california,monterey            california,napa          california,nevada          california,orange 
                        28                        112                         36                         28                        341 
         california,placer          california,plumas       california,riverside      california,sacramento      california,san benito 
                       124                         48                        308                        196                         16 
 california,san bernardino       california,san diego   california,san francisco     california,san joaquin california,san luis obispo 
                       347                        475                        104                        104                         76 
      california,san mateo   california,santa barbara     california,santa clara      california,santa cruz          california,shasta 
                       116                         96                        220                         64                        108 
         california,sierra        california,siskiyou          california,solano          california,sonoma      california,stanislaus 
                        32                         84                         52                        140                         96 
         california,sutter          california,tehama         california,trinity          california,tulare        california,tuolumne 
                        36                         56                         52                        132                         48 
        california,ventura            california,yolo            california,yuba 
                       102                         56                         44 

churn_data %>% group_by(county) %>%
  summarise(p = mean(churn_value)) %>% arrange(desc(p)) %>%
  mutate(county = fct_reorder(county, p)) %>%
  ggplot( aes(x=county, y=p)) +
    geom_bar(stat="identity") +
    coord_flip() +
    xlab("") + ylab("churn rate") + 
    theme_bw()

Comparison with classical logistic regression (Frequentist)

mdl_logit_FixedEffects_classical <- glm(churn_value ~ contract  + multiple_lines + payment_method + tenure_months,
                                        data = churn_data, family = "binomial")
summary(mdl_logit_FixedEffects_classical)

Call:
glm(formula = churn_value ~ contract + multiple_lines + payment_method + 
    tenure_months, family = "binomial", data = churn_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6242  -0.7999  -0.3223   0.8172   3.0003  

Coefficients:
                                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)                           -0.314931   0.091434  -3.444 0.000572 ***
contractOne year                      -1.070434   0.099166 -10.794  < 2e-16 ***
contractTwo year                      -2.127030   0.164503 -12.930  < 2e-16 ***
multiple_linesNo phone service         0.202842   0.110110   1.842 0.065451 .  
multiple_linesYes                      0.712004   0.072611   9.806  < 2e-16 ***
payment_methodCredit card (automatic) -0.087631   0.109170  -0.803 0.422149    
payment_methodElectronic check         0.638239   0.089455   7.135 9.70e-13 ***
payment_methodMailed check            -0.424191   0.105653  -4.015 5.95e-05 ***
tenure_months                         -0.027519   0.001997 -13.778  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8150.1  on 7042  degrees of freedom
Residual deviance: 6321.2  on 7034  degrees of freedom
AIC: 6339.2

Number of Fisher Scoring iterations: 6
  • Both classical and Bayesian solutions match pretty well

Survival analysis: To Do

---
title: "Customer churn prediction"
author: "Satish Iyengar"
date: "9/23/2021"
output: html_notebook
---


```{r load_packages, include = FALSE, echo=FALSE, message=FALSE, warning=FALSE}

library(survival)
library(here)
library(dplyr)
library(tidyverse)
library(lubridate)
library(kableExtra)
library(data.table)
library(shiny)
library(DT)
library(survminer)
library(pec)
library(party)
library(rms)
library(sp)
library(maps)
library(maptools)
library(brms)
library(Amelia)
library(ISLR)

library(janitor)
library(purrr)
library(viridis)
library(tidybayes)
library(bayesplot)
library(arm)
library(modelr)
library(bayesrules)

```


```{r boolean_vars, include=FALSE}
bBuild_mdl_bayes_FE <- 0
bRunCV_mdl_bayes_FE <- 0
bBuild_mdl_bayes_RE <- 0
bRunCV_mdl_bayes_RE <- 0
bCompareModels <- 0

```

## What is Customer Churn?

Customer churn is loss of customers/subscribers. Is it possible to use AI/ML/Data Science techniques to construct a model that can help predict customer churn? 

We can formulate this problem in at least two different ways:

- Apply classical techniques such as logistic regression to predict if a customer will churn (Yes/No) given *relevant covariates* such as his/her tenure with the company so far.

- Build *time-to-event* models using survival analysis which will allow us to answer questions such as: What is the probability that the customer will churn in the next $x$ days.

In this report, we will show how both these approaches can be applied and present results.


## Read data

- Raw data used in this report can be downloaded from here: [Extended telco churn data](https://www.kaggle.com/yeanzc/telco-customer-churn-ibm-dataset)


```{r read_data}
# Read data
data_file <- here("data", "customer_churn_data_extended.csv")
churn_data <- read.csv(data_file)
datatable(churn_data, rownames = FALSE, filter="top", options = list(pageLength = 6, scrollX= T))

```

### Quick missing data check

```{r}
missmap(churn_data, col=c("blue", "red"), legend=FALSE)

```

- Looks like there are some missing values in *total_charges*. But overall data quality is good. 

## Data Description (Reference: [Extended telco churn data](https://www.kaggle.com/yeanzc/telco-customer-churn-ibm-dataset))

- 7043 observations with 33 variables
- CustomerID: A unique ID that identifies each customer.
- Count: A value used in reporting/dashboarding to sum up the number of customers in a filtered set.
- Country: The country of the customer’s primary residence: United States.
- State: The state of the customer’s primary residence: California.
- City: The city of the customer’s primary residence: 1129 cities in CA.
- Zip Code: The zip code of the customer’s primary residence: 1652 zip codes.
- Lat Long: The combined latitude and longitude of the customer’s primary residence: 1652 lat_lons.
- Latitude: The latitude of the customer’s primary residence: 1652 lats.
- Longitude: The longitude of the customer’s primary residence: 1652 lons.
- Gender: The customer’s gender: Male (3555), Female (3488).
- Senior Citizen: Indicates if the customer is 65 or older: Yes (1142), No (5901).
- Partner: Indicate if the customer has a partner: Yes (3402), No (3641).
- Dependents: Indicates if the customer lives with any dependents: Yes (1627), No (5416). Dependents could be children, parents, grandparents, etc.
- Tenure Months: Indicates the total amount of months that the customer has been with the company by the end of the quarter specified above.
- Phone Service: Indicates if the customer subscribes to home phone service with the company: Yes (6361), No (682).
- Multiple Lines: Indicates if the customer subscribes to multiple telephone lines with the company: Yes (2971), No (3390), No phone service (682).
- Internet Service: Indicates if the customer subscribes to Internet service with the company: No (1526), DSL (2421), Fiber Optic (3096).
- Online Security: Indicates if the customer subscribes to an additional online security service provided by the company: Yes (2019), No (3498), No internet service (1526).
- Online Backup: Indicates if the customer subscribes to an additional online backup service provided by the company: Yes (2429), No (3088), No internet service (1526).
- Device Protection: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: Yes (2422), No (3095), No internet service (1526).
- Tech Support: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times: Yes (2044), No (3473), No internet service (1526).
- Streaming TV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: Yes (2707), No (2810), No internet service (1526). The company does not charge an additional fee for this service.
- Streaming Movies: Indicates if the customer uses their Internet service to stream movies from a third party provider: Yes (2732), No (2785), No internet service (1526). The company does not charge an additional fee for this service.
- Contract: Indicates the customer’s current contract type: Month-to-Month (3875), One Year (1473), Two Year (1695).
- Paperless Billing: Indicates if the customer has chosen paperless billing: Yes (4171), No (2872).
- Payment Method: Indicates how the customer pays their bill: Bank transfer (1544), Credit Card (1522), Electronic check (2365), Mailed Check (1612).
- Monthly Charge: Indicates the customer’s current total monthly charge for all their services from the company.
- Total Charges: Indicates the customer’s total charges, calculated to the end of the quarter specified above.
- Churn Label: Yes = the customer left the company this quarter (1869). No = the customer remained with the company (5174). Directly related to Churn Value.
- Churn Value: 1 = the customer left the company this quarter (1869). 0 = the customer remained with the company. Directly related to Churn Label (5174).
- Churn Score: A value from 0-100 that is calculated using the predictive tool IBM SPSS Modeler. The model incorporates multiple factors known to cause churn. The higher the score, the more likely the customer will churn.
- CLTV: Customer Lifetime Value. A predicted CLTV is calculated using corporate formulas and existing data. The higher the value, the more valuable the customer. High value customers should be monitored for churn.
- Churn Reason: A customer’s specific reason for leaving the company. Directly related to Churn Category: 19 unique reasons.

## Explore data and implement preprocessing steps

- Replace dots in column names to underscores. Change all column names to lowercase
- Confirm each row is one customer (no duplicates)
- Check for missing values (11 NA values found in the column *total_charges*. All other columns are good. No further steps needed.)
- Add county values for each record for spatial analysis (future)

```{r preprocess_data, message=FALSE, warning=FALSE}
# Standardize column names
colnames(churn_data) <- tolower(colnames(churn_data))
colnames(churn_data) <- gsub("\\.", "_", colnames(churn_data))

# Confirm unique customers (no duplicates)
## if no duplicates the following length computation should be equal to the number of rows of the dataframe 
length(unique(churn_data$customerid))

# Missing values
summary(churn_data)

```

```{r find_county, warning=FALSE, error=FALSE, include=FALSE}
# The single argument to this function, pointsDF, is a data.frame in which:
#   - column 1 contains the longitude in degrees (negative in the US)
#   - column 2 contains the latitude in degrees
latlong2county <- function(pointsDF) {
  # Prepare SpatialPolygons object with one SpatialPolygon
  # per county
  counties <- map('county', fill=TRUE, col="transparent", plot=FALSE)
  IDs <- sapply(strsplit(counties$names, ":"), function(x) x[1])
  counties_sp <- map2SpatialPolygons(counties, IDs=IDs,
                                     proj4string=CRS("+proj=longlat +datum=WGS84"))
  # Convert pointsDF to a SpatialPoints object 
  pointsSP <- SpatialPoints(pointsDF, 
                            proj4string=CRS("+proj=longlat +datum=WGS84"))
  # Use 'over' to get _indices_ of the Polygons object containing each point 
  indices <- over(pointsSP, counties_sp)
  # Return the county names of the Polygons object containing each point
  countyNames <- sapply(counties_sp@polygons, function(x) x@ID)
  countyNames[indices]
}
churn_data <- churn_data %>%
  mutate(
    county = latlong2county(data.frame(x = churn_data$longitude, y = churn_data$latitude))
  )
df <- read.csv("df_updated.csv")
churn_data <- churn_data %>% dplyr::filter(!is.na(county)) %>%
  rbind(df)

```

### Churned vs. Not churned

```{r, fig.align='center'}
churn_data %>% ggplot(aes(x = churn_label, fill = churn_label)) + geom_bar() + theme_minimal()

```

### How good is the company in retaining its customers?


```{r}
mean(churn_data$churn_value)

```

One can answer this question easily by estimating survival curve using the given data. The Kaplan-Meir estimate for the given churn data is shown below.



```{r kmfit, warning=FALSE, fig.align='center'}
churn_data <- churn_data %>%
  mutate(
    b_churn = ifelse(churn_label == "Yes", 1, 0)
  )
kmfit <- survfit(Surv(tenure_months, b_churn) ~ 1, data = churn_data)
km_survplot <- survminer::ggsurvplot(
  fit = kmfit, 
  surv.median.line = "none",
  xlab = "Months", 
  ylab = "Survival probability", ylim = c(0.4, 1))
km_survplot$plot + geom_hline(yintercept = 0.61, linetype = "dashed") + geom_vline(xintercept = 70, linetype = "dashed")

```

- The probability that an average customer will continue with the company beyond 70 months is $0.61$.
- Even after 70 months, the company is able to retain 61% or more of their customers.

### Closer look at all variables 

**State hypotheses before building a data-driven model**

- gender: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
- senior_citizen: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
- partner/dependents: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
- **tenure_months**: In an average sense, a long time customer is more likely to stay with the company (low churn risk)
- phone_service/internet_service: Customers with both phone and internet service should be less likely to churn.
- **multiple_lines**: Having multiple lines should reduce churn risk. Multiple lines better discount more likely to be cost efficient for the customer.
- online_security, online_backup, device_protection, streaming_tv, streaming_movies: Data description tells us that customers do not pay for extra services. In general, customers with more services and diverse usage should be less likely to churn.
- **contract**: Expect to be a strong predictor. Customers with long term contracts less likely to churn than those with month-to-month contracts.
- **payment_method**: customers with automatic payment setup (automatic bank transfer or credit card) less likely to churn than manual payment methods.
- **monthly_charges**: churn risk should increase with higher monthly charges. However, need to be careful as this is not normalized with respect to number of lines, services, etc.
- total_charges: total charges is cumulative and depends on the duration of service. Not a good idea to use this in its raw form


### Visualize *key* variables

#### Tenure

```{r, fig.align='center', warning=FALSE}
churn_data %>% ggplot(aes(x = tenure_months, fill = churn_label)) + geom_histogram() + facet_wrap(churn_label ~ .)

```





#### Multiple lines

```{r, fig.align='center', warning=FALSE}
ggplot(churn_data, aes(x = multiple_lines, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ multiple_lines, scale = "free")

```



#### Contract

```{r, fig.align='center'}
ggplot(churn_data, aes(x = contract, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ contract, scale = "free")

```

#### Payment method

```{r, fig.align='center'}
ggplot(churn_data, aes(x = payment_method, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ payment_method, scale = "free")

```



## Logistic regression

We consider the following model:

\begin{eqnarray}
y_i & \sim & Bernoulli(\theta_i) \\
\text{logit}\left(\theta_i\right) & = & \beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n, \\
\beta_0, \beta_1, \ldots, \beta_n & \sim & \mathcal{N}(0, 5) \hspace{10pt} (priors)
\end{eqnarray}

where $y_i$ is binary data (churned or not churned), $\theta_i$ is the probability of churn and $x_i's$ are covariates of customer $i$. $\beta_0, \beta_1, \ldots, \beta_n$ are model parameters which will be estimated using the collected data. 

### Model summary

```{r}
if(!bBuild_mdl_bayes_FE){
  mdl_logit_FixedEffects <- readRDS("mdl_logit_bayes_1.rds")
} else {
  mdl_logit_FixedEffects <- brm(formula = churn_value ~ contract  + multiple_lines + payment_method + tenure_months,
                                data=churn_data,
                                family = bernoulli(link = "logit"),
                                prior = c(prior(normal(0, 2), class = Intercept),
                                          prior(normal(0, 2), class = b)),
                                sample_prior = TRUE,
                                warmup = 500,
                                iter = 3000,
                                chains = 4,
                                inits= "0",
                                cores = 8,
                                seed = 123)
  saveRDS(mdl_logit_FixedEffects, "mdl_logit_FixedEffects.rds")
}
summary(mdl_logit_FixedEffects)

```

#### Posterior distribution of parameters

- Probability of parameters $\theta$ given evidence (data)
- We show how data modifies our a priori assumption of $\theta$. We plot only one parameter for simplicity.

```{r, include=FALSE}
varname <- "b_contractTwoyear"
draws_prior <- prior_draws(mdl_logit_FixedEffects, varname) %>%
  dplyr::rename(x = !!as.name(varname)) %>% 
  mutate(
    label = "prior"
  )
draws_post <- as_draws_df(mdl_logit_FixedEffects) %>%
  mutate(
    label = "posterior"
  ) 

df_post <- draws_post %>%
  dplyr::select(as.name(varname), label) %>%
  rename(x = !!as.name(varname))
```

```{r}
rbind(draws_prior, df_post) %>% 
  ggplot(aes(x = x, fill = label)) + geom_density() +
  scale_fill_viridis(discrete = TRUE, alpha = 0.5) + theme_minimal() + 
  labs(x = varname)

```


### Posterior predictive distribution

- Distribution of possible unobserved values conditional on observed data: $p(y_{new}|y)$.

```{r}
tenure1 = 15.8
tenure2 = 10.5

Xnew <- churn_data[1, ]
Xnew$tenure_months <- tenure1
p1 = posterior_linpred(mdl_logit_FixedEffects, transform=TRUE, newdata = Xnew)
df_p1 <- data.frame(p = p1, tenure_months = tenure1)


Xnew <- churn_data[1, ]
Xnew$tenure_months <- tenure2
#table(Xnew$tenure_months == churn_data$tenure_months)
p2 = posterior_linpred(mdl_logit_FixedEffects, transform=TRUE, newdata = Xnew)
df_p2 <- data.frame(p = p2, tenure_months = tenure2)
df_p <- rbind(df_p1, df_p2)
df_p$tenure_months <- as.factor(df_p$tenure_months)

ggplot(df_p, aes(x = p, fill = tenure_months)) + geom_density(alpha = 0.5) +
  labs(title = paste0("Contract: ", Xnew$contract, ", Multiple lines: ", Xnew$multiple_lines, ", Payment method: ",
                      Xnew$payment_method))

```


### Model Evaluation

1) How wrong is the model?
2) How accurate are the model’s posterior classifications?
3) Model comparisons

To answer question (1), we will first check convergence of posterior distribution of parameters using traceplots. Next, we perform posterior predictive checks to confirm that the data simulated from our posterior logistic regression model has features similar to the original data. That is, simulations from the fit model result in similar number of ones and zeros as in the original data. For (2), we will address this question in two ways: with and without cross-validation. 

#### Check model convergence by visual inspection 

Figure below shows that there are no divergence chains and hence we can conclude that our model does not have any convergence problems.

```{r, fig.align='center', fig.height=6, fig.height=6}
plot(mdl_logit_FixedEffects, ask = FALSE, newpage = FALSE, N = 2)

```

#### Posterior predictive checks
To check he fit of the model to data we draw simulated values from the posterior predictive distribution of replicated data, $y^{rep}$ and compare the samples to original data. Any systematic differences between the simulations and the data will indicate potential failings of the model (e.g., unsuitable priors). The plot below compares original data with $500$ draws from the fit model. The figure clearly shows that the model passes this check

```{r, fig.align='center'}
pp_check(mdl_logit_FixedEffects, ndraws = 500)

```

### Classification accuracy

#### Training accuracy

- Threshold = 0.5


```{r}
set.seed(84735)
threshold = 0.5
# Predicted probabilities
#linpred <- posterior_linpred(mdl_logit_FixedEffects)
preds <- posterior_linpred(mdl_logit_FixedEffects, transform=TRUE)
pred <- colMeans(preds)
pr <- as.integer(pred >= 0.5)

# posterior classification accuracy
round(mean(xor(pr,as.integer(churn_data$churn_value==0))),6)


```
#### Confusion matrix

```{r}
table(data.frame(churn_value = churn_data$churn_value, churn_pred = pr))

```
#### Receiver Operating Characteristic (ROC) curve

```{r}
threshold <- seq(0, 1, by = 0.05)
compute_churn_decision <- function(threshold, pred, churn_value){
  pr <- as.integer(pred >= threshold)
  y <- data.frame(churn_value = churn_value, churn_pred = pr)
  pd <- y %>% dplyr::filter(churn_value == 1) %>%
    summarise(pd = mean(churn_pred)) %>% pull()
  pf <- y %>% dplyr::filter(churn_value == 0) %>%
    summarise(
      pf = mean(churn_pred)
    ) %>% pull()
  return(list(data.frame("pf" = pf, "pd" = pd, "th" = threshold)))
}

roc = lapply(threshold, compute_churn_decision, pred = pred, churn_value = churn_data$churn_value) %>%
  bind_rows()
roc <- roc %>%
  mutate(
    type = "training set"
  )

ggplot(roc, aes(x = pf, y = pd)) + geom_line(size = 2) + 
  geom_point(size = 5) + 
  labs(title = "ROC", x = "Probability of false alarm", y = "Probability of detection")


```

#### Cross validation
To better estimate the predictive performance for new not yet seen data we next use leave-one-out cross-validation.

```{r}
if(bRunCV_mdl_bayes_FE){
  loo_mdl_logit_FixedEffects <- loo(mdl_logit_FixedEffects, save_psis = TRUE)
  saveRDS(loo_mdl_logit_FixedEffects, "loo_mdl_logit_FixedEffects.rds")
  # LOO predictive probabilities
  ploo_mdl_logit_FixedEffects=E_loo(preds, loo_mdl_logit_FixedEffects$psis_object, type="mean", log_ratios = -log_lik(mdl_logit_FixedEffects))$value
  saveRDS(ploo, "ploo_mdl_logit_FixedEffects.rds")
} else{
  loo_mdl_logit_FixedEffects <- readRDS("loo_mdl_logit_FixedEffects.rds")
  ploo_mdl_logit_FixedEffects <- readRDS("ploo_mdl_logit_FixedEffects.rds")
}

## LOO classification accuracy
round(mean(xor(ploo>0.5,as.integer(churn_data$churn_value==0))), 6)

```


### Challenger model: To do

- Churn rate vs. Geographical location
- Multilevel model to incorporate location variations (especially useful to reduce prediction uncertainty in small sample size counties)


```{r}
table(churn_data$county)

```



```{r, fig.width=8, fig.height=8}

churn_data %>% group_by(county) %>%
  summarise(p = mean(churn_value)) %>% arrange(desc(p)) %>%
  mutate(county = fct_reorder(county, p)) %>%
  ggplot( aes(x=county, y=p)) +
    geom_bar(stat="identity") +
    coord_flip() +
    xlab("") + ylab("churn rate") + 
    theme_bw()

```


### Comparison with classical logistic regression (Frequentist)

```{r}
mdl_logit_FixedEffects_classical <- glm(churn_value ~ contract  + multiple_lines + payment_method + tenure_months,
                                        data = churn_data, family = "binomial")
summary(mdl_logit_FixedEffects_classical)

```
- Both classical and Bayesian solutions match pretty well

## Survival analysis: To Do


