Read data
# Read data
data_file <- here("data", "customer_churn_data_extended.csv")
churn_data <- read.csv(data_file)
datatable(churn_data, rownames = FALSE, filter="top", options = list(pageLength = 6, scrollX= T))
Warning in instance$preRenderHook(instance) :
It seems your data is too big for client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html
Warning in instance$preRenderHook(instance) :
It seems your data is too big for client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html
Quick missing data check
missmap(churn_data, col=c("blue", "red"), legend=FALSE)

- Looks like there are some missing values in total_charges. But overall data quality is good.
Explore data and implement preprocessing steps
- Replace dots in column names to underscores. Change all column names to lowercase
- Confirm each row is one customer (no duplicates)
- Check for missing values (11 NA values found in the column total_charges. All other columns are good. No further steps needed.)
- Add county values for each record for spatial analysis (future)
# Standardize column names
colnames(churn_data) <- tolower(colnames(churn_data))
colnames(churn_data) <- gsub("\\.", "_", colnames(churn_data))
# Confirm unique customers (no duplicates)
## if no duplicates the following length computation should be equal to the number of rows of the dataframe
length(unique(churn_data$customerid))
[1] 7043
# Missing values
summary(churn_data)
customerid count country state city zip_code lat_long latitude
Length:7043 Min. :1 Length:7043 Length:7043 Length:7043 Min. :90001 Length:7043 Min. :32.56
Class :character 1st Qu.:1 Class :character Class :character Class :character 1st Qu.:92102 Class :character 1st Qu.:34.03
Mode :character Median :1 Mode :character Mode :character Mode :character Median :93552 Mode :character Median :36.39
Mean :1 Mean :93522 Mean :36.28
3rd Qu.:1 3rd Qu.:95351 3rd Qu.:38.22
Max. :1 Max. :96161 Max. :41.96
longitude gender senior_citizen partner dependents tenure_months phone_service
Min. :-124.3 Length:7043 Length:7043 Length:7043 Length:7043 Min. : 0.00 Length:7043
1st Qu.:-121.8 Class :character Class :character Class :character Class :character 1st Qu.: 9.00 Class :character
Median :-119.7 Mode :character Mode :character Mode :character Mode :character Median :29.00 Mode :character
Mean :-119.8 Mean :32.37
3rd Qu.:-118.0 3rd Qu.:55.00
Max. :-114.2 Max. :72.00
multiple_lines internet_service online_security online_backup device_protection tech_support streaming_tv
Length:7043 Length:7043 Length:7043 Length:7043 Length:7043 Length:7043 Length:7043
Class :character Class :character Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
streaming_movies contract paperless_billing payment_method monthly_charges total_charges churn_label churn_value
Length:7043 Length:7043 Length:7043 Length:7043 Min. : 18.25 Min. : 18.8 Length:7043 Min. :0.0000
Class :character Class :character Class :character Class :character 1st Qu.: 35.50 1st Qu.: 401.4 Class :character 1st Qu.:0.0000
Mode :character Mode :character Mode :character Mode :character Median : 70.35 Median :1397.5 Mode :character Median :0.0000
Mean : 64.76 Mean :2283.3 Mean :0.2654
3rd Qu.: 89.85 3rd Qu.:3794.7 3rd Qu.:1.0000
Max. :118.75 Max. :8684.8 Max. :1.0000
NA's :11
churn_score cltv churn_reason
Min. : 5.0 Min. :2003 Length:7043
1st Qu.: 40.0 1st Qu.:3469 Class :character
Median : 61.0 Median :4527 Mode :character
Mean : 58.7 Mean :4400
3rd Qu.: 75.0 3rd Qu.:5380
Max. :100.0 Max. :6500
Churned vs. Not churned
churn_data %>% ggplot(aes(x = churn_label, fill = churn_label)) + geom_bar() + theme_minimal()

How good is the company in retaining its customers?
mean(churn_data$churn_value)
[1] 0.2653699
One can answer this question easily by estimating survival curve using the given data. The Kaplan-Meir estimate for the given churn data is shown below.
churn_data <- churn_data %>%
mutate(
b_churn = ifelse(churn_label == "Yes", 1, 0)
)
kmfit <- survfit(Surv(tenure_months, b_churn) ~ 1, data = churn_data)
km_survplot <- survminer::ggsurvplot(
fit = kmfit,
surv.median.line = "none",
xlab = "Months",
ylab = "Survival probability", ylim = c(0.4, 1))
km_survplot$plot + geom_hline(yintercept = 0.61, linetype = "dashed") + geom_vline(xintercept = 70, linetype = "dashed")

- The probability that an average customer will continue with the company beyond 70 months is \(0.61\).
- Even after 70 months, the company is able to retain 61% or more of their customers.
Closer look at all variables
State hypotheses before building a data-driven model
- gender: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
- senior_citizen: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
- partner/dependents: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
- tenure_months: In an average sense, a long time customer is more likely to stay with the company (low churn risk)
- phone_service/internet_service: Customers with both phone and internet service should be less likely to churn.
- multiple_lines: Having multiple lines should reduce churn risk. Multiple lines better discount more likely to be cost efficient for the customer.
- online_security, online_backup, device_protection, streaming_tv, streaming_movies: Data description tells us that customers do not pay for extra services. In general, customers with more services and diverse usage should be less likely to churn.
- contract: Expect to be a strong predictor. Customers with long term contracts less likely to churn than those with month-to-month contracts.
- payment_method: customers with automatic payment setup (automatic bank transfer or credit card) less likely to churn than manual payment methods.
- monthly_charges: churn risk should increase with higher monthly charges. However, need to be careful as this is not normalized with respect to number of lines, services, etc.
- total_charges: total charges is cumulative and depends on the duration of service. Not a good idea to use this in its raw form
Visualize key variables
Tenure
churn_data %>% ggplot(aes(x = tenure_months, fill = churn_label)) + geom_histogram() + facet_wrap(churn_label ~ .)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Multiple lines
ggplot(churn_data, aes(x = multiple_lines, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ multiple_lines, scale = "free")

Contract
ggplot(churn_data, aes(x = contract, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ contract, scale = "free")

Payment method
ggplot(churn_data, aes(x = payment_method, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ payment_method, scale = "free")

Logistic regression
We consider the following model:
\[\begin{eqnarray}
y_i & \sim & Bernoulli(\theta_i) \\
\text{logit}\left(\theta_i\right) & = & \beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n, \\
\beta_0, \beta_1, \ldots, \beta_n & \sim & \mathcal{N}(0, 5) \hspace{10pt} (priors)
\end{eqnarray}\]
where \(y_i\) is binary data (churned or not churned), \(\theta_i\) is the probability of churn and \(x_i's\) are covariates of customer \(i\). \(\beta_0, \beta_1, \ldots, \beta_n\) are model parameters which will be estimated using the collected data.
Model summary
if(!bBuild_mdl_bayes_FE){
mdl_logit_FixedEffects <- readRDS("mdl_logit_bayes_1.rds")
} else {
mdl_logit_FixedEffects <- brm(formula = churn_value ~ contract + multiple_lines + payment_method + tenure_months,
data=churn_data,
family = bernoulli(link = "logit"),
prior = c(prior(normal(0, 2), class = Intercept),
prior(normal(0, 2), class = b)),
sample_prior = TRUE,
warmup = 500,
iter = 3000,
chains = 4,
inits= "0",
cores = 8,
seed = 123)
saveRDS(mdl_logit_FixedEffects, "mdl_logit_FixedEffects.rds")
}
summary(mdl_logit_FixedEffects)
Family: bernoulli
Links: mu = logit
Formula: churn_value ~ contract + multiple_lines + payment_method + tenure_months
Data: churn_data (Number of observations: 7043)
Draws: 4 chains, each with iter = 3000; warmup = 500; thin = 1;
total post-warmup draws = 10000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.32 0.09 -0.50 -0.14 1.00 6841 7598
contractOneyear -1.07 0.10 -1.26 -0.88 1.00 10375 7535
contractTwoyear -2.12 0.16 -2.45 -1.80 1.00 8931 7159
multiple_linesNophoneservice 0.20 0.11 -0.02 0.41 1.00 9998 7716
multiple_linesYes 0.71 0.07 0.57 0.85 1.00 9656 7536
payment_methodCreditcardautomatic -0.09 0.11 -0.30 0.13 1.00 7496 7672
payment_methodElectroniccheck 0.64 0.09 0.47 0.82 1.00 7001 7450
payment_methodMailedcheck -0.42 0.11 -0.63 -0.22 1.00 7023 7838
tenure_months -0.03 0.00 -0.03 -0.02 1.00 11053 9002
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Posterior distribution of parameters
- Probability of parameters \(\theta\) given evidence (data)
- We show how data modifies our a priori assumption of \(\theta\). We plot only one parameter for simplicity.
rbind(draws_prior, df_post) %>%
ggplot(aes(x = x, fill = label)) + geom_density() +
scale_fill_viridis(discrete = TRUE, alpha = 0.5) + theme_minimal() +
labs(x = varname)

Posterior predictive distribution
- Distribution of possible unobserved values conditional on observed data: \(p(y_{new}|y)\).
tenure1 = 15.8
tenure2 = 10.5
Xnew <- churn_data[1, ]
Xnew$tenure_months <- tenure1
p1 = posterior_linpred(mdl_logit_FixedEffects, transform=TRUE, newdata = Xnew)
df_p1 <- data.frame(p = p1, tenure_months = tenure1)
Xnew <- churn_data[1, ]
Xnew$tenure_months <- tenure2
#table(Xnew$tenure_months == churn_data$tenure_months)
p2 = posterior_linpred(mdl_logit_FixedEffects, transform=TRUE, newdata = Xnew)
df_p2 <- data.frame(p = p2, tenure_months = tenure2)
df_p <- rbind(df_p1, df_p2)
df_p$tenure_months <- as.factor(df_p$tenure_months)
ggplot(df_p, aes(x = p, fill = tenure_months)) + geom_density(alpha = 0.5) +
labs(title = paste0("Contract: ", Xnew$contract, ", Multiple lines: ", Xnew$multiple_lines, ", Payment method: ",
Xnew$payment_method))

Model Evaluation
- How wrong is the model?
- How accurate are the model’s posterior classifications?
- Model comparisons
To answer question (1), we will first check convergence of posterior distribution of parameters using traceplots. Next, we perform posterior predictive checks to confirm that the data simulated from our posterior logistic regression model has features similar to the original data. That is, simulations from the fit model result in similar number of ones and zeros as in the original data. For (2), we will address this question in two ways: with and without cross-validation.
Check model convergence by visual inspection
Figure below shows that there are no divergence chains and hence we can conclude that our model does not have any convergence problems.
plot(mdl_logit_FixedEffects, ask = FALSE, newpage = FALSE, N = 2)





Posterior predictive checks
To check he fit of the model to data we draw simulated values from the posterior predictive distribution of replicated data, \(y^{rep}\) and compare the samples to original data. Any systematic differences between the simulations and the data will indicate potential failings of the model (e.g., unsuitable priors). The plot below compares original data with \(500\) draws from the fit model. The figure clearly shows that the model passes this check
pp_check(mdl_logit_FixedEffects, ndraws = 500)

Classification accuracy
Training accuracy
set.seed(84735)
threshold = 0.5
# Predicted probabilities
#linpred <- posterior_linpred(mdl_logit_FixedEffects)
preds <- posterior_linpred(mdl_logit_FixedEffects, transform=TRUE)
pred <- colMeans(preds)
pr <- as.integer(pred >= 0.5)
# posterior classification accuracy
round(mean(xor(pr,as.integer(churn_data$churn_value==0))),6)
[1] 0.781769
Confusion matrix
table(data.frame(churn_value = churn_data$churn_value, churn_pred = pr))
churn_pred
churn_value 0 1
0 4648 526
1 1011 858
Receiver Operating Characteristic (ROC) curve
threshold <- seq(0, 1, by = 0.05)
compute_churn_decision <- function(threshold, pred, churn_value){
pr <- as.integer(pred >= threshold)
y <- data.frame(churn_value = churn_value, churn_pred = pr)
pd <- y %>% dplyr::filter(churn_value == 1) %>%
summarise(pd = mean(churn_pred)) %>% pull()
pf <- y %>% dplyr::filter(churn_value == 0) %>%
summarise(
pf = mean(churn_pred)
) %>% pull()
return(list(data.frame("pf" = pf, "pd" = pd, "th" = threshold)))
}
roc = lapply(threshold, compute_churn_decision, pred = pred, churn_value = churn_data$churn_value) %>%
bind_rows()
roc <- roc %>%
mutate(
type = "training set"
)
ggplot(roc, aes(x = pf, y = pd)) + geom_line(size = 2) +
geom_point(size = 5) +
labs(title = "ROC", x = "Probability of false alarm", y = "Probability of detection")

NA
NA
Cross validation
To better estimate the predictive performance for new not yet seen data we next use leave-one-out cross-validation.
if(bRunCV_mdl_bayes_FE){
loo_mdl_logit_FixedEffects <- loo(mdl_logit_FixedEffects, save_psis = TRUE)
saveRDS(loo_mdl_logit_FixedEffects, "loo_mdl_logit_FixedEffects.rds")
# LOO predictive probabilities
ploo_mdl_logit_FixedEffects=E_loo(preds, loo_mdl_logit_FixedEffects$psis_object, type="mean", log_ratios = -log_lik(mdl_logit_FixedEffects))$value
saveRDS(ploo, "ploo_mdl_logit_FixedEffects.rds")
} else{
loo_mdl_logit_FixedEffects <- readRDS("loo_mdl_logit_FixedEffects.rds")
ploo_mdl_logit_FixedEffects <- readRDS("ploo_mdl_logit_FixedEffects.rds")
}
## LOO classification accuracy
round(mean(xor(ploo>0.5,as.integer(churn_data$churn_value==0))), 6)
[1] 0.781627
Challenger model: To do
- Churn rate vs. Geographical location
- Multilevel model to incorporate location variations (especially useful to reduce prediction uncertainty in small sample size counties)
table(churn_data$county)
california,alameda california,alpine california,amador california,butte california,calaveras
180 8 36 72 72
california,colusa california,contra costa california,del norte california,el dorado california,fresno
24 140 16 84 200
california,glenn california,humboldt california,imperial california,inyo california,kern
28 136 70 40 168
california,kings california,lake california,lassen california,los angeles california,madera
32 52 60 1352 48
california,marin california,mariposa california,mendocino california,merced california,modoc
108 32 100 72 36
california,mono california,monterey california,napa california,nevada california,orange
28 112 36 28 341
california,placer california,plumas california,riverside california,sacramento california,san benito
124 48 308 196 16
california,san bernardino california,san diego california,san francisco california,san joaquin california,san luis obispo
347 475 104 104 76
california,san mateo california,santa barbara california,santa clara california,santa cruz california,shasta
116 96 220 64 108
california,sierra california,siskiyou california,solano california,sonoma california,stanislaus
32 84 52 140 96
california,sutter california,tehama california,trinity california,tulare california,tuolumne
36 56 52 132 48
california,ventura california,yolo california,yuba
102 56 44
churn_data %>% group_by(county) %>%
summarise(p = mean(churn_value)) %>% arrange(desc(p)) %>%
mutate(county = fct_reorder(county, p)) %>%
ggplot( aes(x=county, y=p)) +
geom_bar(stat="identity") +
coord_flip() +
xlab("") + ylab("churn rate") +
theme_bw()

Comparison with classical logistic regression (Frequentist)
mdl_logit_FixedEffects_classical <- glm(churn_value ~ contract + multiple_lines + payment_method + tenure_months,
data = churn_data, family = "binomial")
summary(mdl_logit_FixedEffects_classical)
Call:
glm(formula = churn_value ~ contract + multiple_lines + payment_method +
tenure_months, family = "binomial", data = churn_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6242 -0.7999 -0.3223 0.8172 3.0003
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.314931 0.091434 -3.444 0.000572 ***
contractOne year -1.070434 0.099166 -10.794 < 2e-16 ***
contractTwo year -2.127030 0.164503 -12.930 < 2e-16 ***
multiple_linesNo phone service 0.202842 0.110110 1.842 0.065451 .
multiple_linesYes 0.712004 0.072611 9.806 < 2e-16 ***
payment_methodCredit card (automatic) -0.087631 0.109170 -0.803 0.422149
payment_methodElectronic check 0.638239 0.089455 7.135 9.70e-13 ***
payment_methodMailed check -0.424191 0.105653 -4.015 5.95e-05 ***
tenure_months -0.027519 0.001997 -13.778 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8150.1 on 7042 degrees of freedom
Residual deviance: 6321.2 on 7034 degrees of freedom
AIC: 6339.2
Number of Fisher Scoring iterations: 6
- Both classical and Bayesian solutions match pretty well
---
title: "Customer churn prediction"
author: "Satish Iyengar"
date: "9/23/2021"
output: html_notebook
---


```{r load_packages, include = FALSE, echo=FALSE, message=FALSE, warning=FALSE}

library(survival)
library(here)
library(dplyr)
library(tidyverse)
library(lubridate)
library(kableExtra)
library(data.table)
library(shiny)
library(DT)
library(survminer)
library(pec)
library(party)
library(rms)
library(sp)
library(maps)
library(maptools)
library(brms)
library(Amelia)
library(ISLR)

library(janitor)
library(purrr)
library(viridis)
library(tidybayes)
library(bayesplot)
library(arm)
library(modelr)
library(bayesrules)

```


```{r boolean_vars, include=FALSE}
bBuild_mdl_bayes_FE <- 0
bRunCV_mdl_bayes_FE <- 0
bBuild_mdl_bayes_RE <- 0
bRunCV_mdl_bayes_RE <- 0
bCompareModels <- 0

```

## What is Customer Churn?

Customer churn is loss of customers/subscribers. Is it possible to use AI/ML/Data Science techniques to construct a model that can help predict customer churn? 

We can formulate this problem in at least two different ways:

- Apply classical techniques such as logistic regression to predict if a customer will churn (Yes/No) given *relevant covariates* such as his/her tenure with the company so far.

- Build *time-to-event* models using survival analysis which will allow us to answer questions such as: What is the probability that the customer will churn in the next $x$ days.

In this report, we will show how both these approaches can be applied and present results.


## Read data

- Raw data used in this report can be downloaded from here: [Extended telco churn data](https://www.kaggle.com/yeanzc/telco-customer-churn-ibm-dataset)


```{r read_data}
# Read data
data_file <- here("data", "customer_churn_data_extended.csv")
churn_data <- read.csv(data_file)
datatable(churn_data, rownames = FALSE, filter="top", options = list(pageLength = 6, scrollX= T))

```

### Quick missing data check

```{r}
missmap(churn_data, col=c("blue", "red"), legend=FALSE)

```

- Looks like there are some missing values in *total_charges*. But overall data quality is good. 

## Data Description (Reference: [Extended telco churn data](https://www.kaggle.com/yeanzc/telco-customer-churn-ibm-dataset))

- 7043 observations with 33 variables
- CustomerID: A unique ID that identifies each customer.
- Count: A value used in reporting/dashboarding to sum up the number of customers in a filtered set.
- Country: The country of the customer’s primary residence: United States.
- State: The state of the customer’s primary residence: California.
- City: The city of the customer’s primary residence: 1129 cities in CA.
- Zip Code: The zip code of the customer’s primary residence: 1652 zip codes.
- Lat Long: The combined latitude and longitude of the customer’s primary residence: 1652 lat_lons.
- Latitude: The latitude of the customer’s primary residence: 1652 lats.
- Longitude: The longitude of the customer’s primary residence: 1652 lons.
- Gender: The customer’s gender: Male (3555), Female (3488).
- Senior Citizen: Indicates if the customer is 65 or older: Yes (1142), No (5901).
- Partner: Indicate if the customer has a partner: Yes (3402), No (3641).
- Dependents: Indicates if the customer lives with any dependents: Yes (1627), No (5416). Dependents could be children, parents, grandparents, etc.
- Tenure Months: Indicates the total amount of months that the customer has been with the company by the end of the quarter specified above.
- Phone Service: Indicates if the customer subscribes to home phone service with the company: Yes (6361), No (682).
- Multiple Lines: Indicates if the customer subscribes to multiple telephone lines with the company: Yes (2971), No (3390), No phone service (682).
- Internet Service: Indicates if the customer subscribes to Internet service with the company: No (1526), DSL (2421), Fiber Optic (3096).
- Online Security: Indicates if the customer subscribes to an additional online security service provided by the company: Yes (2019), No (3498), No internet service (1526).
- Online Backup: Indicates if the customer subscribes to an additional online backup service provided by the company: Yes (2429), No (3088), No internet service (1526).
- Device Protection: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: Yes (2422), No (3095), No internet service (1526).
- Tech Support: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times: Yes (2044), No (3473), No internet service (1526).
- Streaming TV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: Yes (2707), No (2810), No internet service (1526). The company does not charge an additional fee for this service.
- Streaming Movies: Indicates if the customer uses their Internet service to stream movies from a third party provider: Yes (2732), No (2785), No internet service (1526). The company does not charge an additional fee for this service.
- Contract: Indicates the customer’s current contract type: Month-to-Month (3875), One Year (1473), Two Year (1695).
- Paperless Billing: Indicates if the customer has chosen paperless billing: Yes (4171), No (2872).
- Payment Method: Indicates how the customer pays their bill: Bank transfer (1544), Credit Card (1522), Electronic check (2365), Mailed Check (1612).
- Monthly Charge: Indicates the customer’s current total monthly charge for all their services from the company.
- Total Charges: Indicates the customer’s total charges, calculated to the end of the quarter specified above.
- Churn Label: Yes = the customer left the company this quarter (1869). No = the customer remained with the company (5174). Directly related to Churn Value.
- Churn Value: 1 = the customer left the company this quarter (1869). 0 = the customer remained with the company. Directly related to Churn Label (5174).
- Churn Score: A value from 0-100 that is calculated using the predictive tool IBM SPSS Modeler. The model incorporates multiple factors known to cause churn. The higher the score, the more likely the customer will churn.
- CLTV: Customer Lifetime Value. A predicted CLTV is calculated using corporate formulas and existing data. The higher the value, the more valuable the customer. High value customers should be monitored for churn.
- Churn Reason: A customer’s specific reason for leaving the company. Directly related to Churn Category: 19 unique reasons.

## Explore data and implement preprocessing steps

- Replace dots in column names to underscores. Change all column names to lowercase
- Confirm each row is one customer (no duplicates)
- Check for missing values (11 NA values found in the column *total_charges*. All other columns are good. No further steps needed.)
- Add county values for each record for spatial analysis (future)

```{r preprocess_data, message=FALSE, warning=FALSE}
# Standardize column names
colnames(churn_data) <- tolower(colnames(churn_data))
colnames(churn_data) <- gsub("\\.", "_", colnames(churn_data))

# Confirm unique customers (no duplicates)
## if no duplicates the following length computation should be equal to the number of rows of the dataframe 
length(unique(churn_data$customerid))

# Missing values
summary(churn_data)

```

```{r find_county, warning=FALSE, error=FALSE, include=FALSE}
# The single argument to this function, pointsDF, is a data.frame in which:
#   - column 1 contains the longitude in degrees (negative in the US)
#   - column 2 contains the latitude in degrees
latlong2county <- function(pointsDF) {
  # Prepare SpatialPolygons object with one SpatialPolygon
  # per county
  counties <- map('county', fill=TRUE, col="transparent", plot=FALSE)
  IDs <- sapply(strsplit(counties$names, ":"), function(x) x[1])
  counties_sp <- map2SpatialPolygons(counties, IDs=IDs,
                                     proj4string=CRS("+proj=longlat +datum=WGS84"))
  # Convert pointsDF to a SpatialPoints object 
  pointsSP <- SpatialPoints(pointsDF, 
                            proj4string=CRS("+proj=longlat +datum=WGS84"))
  # Use 'over' to get _indices_ of the Polygons object containing each point 
  indices <- over(pointsSP, counties_sp)
  # Return the county names of the Polygons object containing each point
  countyNames <- sapply(counties_sp@polygons, function(x) x@ID)
  countyNames[indices]
}
churn_data <- churn_data %>%
  mutate(
    county = latlong2county(data.frame(x = churn_data$longitude, y = churn_data$latitude))
  )
df <- read.csv("df_updated.csv")
churn_data <- churn_data %>% dplyr::filter(!is.na(county)) %>%
  rbind(df)

```

### Churned vs. Not churned

```{r, fig.align='center'}
churn_data %>% ggplot(aes(x = churn_label, fill = churn_label)) + geom_bar() + theme_minimal()

```

### How good is the company in retaining its customers?


```{r}
mean(churn_data$churn_value)

```

One can answer this question easily by estimating survival curve using the given data. The Kaplan-Meir estimate for the given churn data is shown below.



```{r kmfit, warning=FALSE, fig.align='center'}
churn_data <- churn_data %>%
  mutate(
    b_churn = ifelse(churn_label == "Yes", 1, 0)
  )
kmfit <- survfit(Surv(tenure_months, b_churn) ~ 1, data = churn_data)
km_survplot <- survminer::ggsurvplot(
  fit = kmfit, 
  surv.median.line = "none",
  xlab = "Months", 
  ylab = "Survival probability", ylim = c(0.4, 1))
km_survplot$plot + geom_hline(yintercept = 0.61, linetype = "dashed") + geom_vline(xintercept = 70, linetype = "dashed")

```

- The probability that an average customer will continue with the company beyond 70 months is $0.61$.
- Even after 70 months, the company is able to retain 61% or more of their customers.

### Closer look at all variables 

**State hypotheses before building a data-driven model**

- gender: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
- senior_citizen: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
- partner/dependents: Do not expect to be a strong predictor of churn risk. We will first build a model without this variable, and revisit only if needed.
- **tenure_months**: In an average sense, a long time customer is more likely to stay with the company (low churn risk)
- phone_service/internet_service: Customers with both phone and internet service should be less likely to churn.
- **multiple_lines**: Having multiple lines should reduce churn risk. Multiple lines better discount more likely to be cost efficient for the customer.
- online_security, online_backup, device_protection, streaming_tv, streaming_movies: Data description tells us that customers do not pay for extra services. In general, customers with more services and diverse usage should be less likely to churn.
- **contract**: Expect to be a strong predictor. Customers with long term contracts less likely to churn than those with month-to-month contracts.
- **payment_method**: customers with automatic payment setup (automatic bank transfer or credit card) less likely to churn than manual payment methods.
- **monthly_charges**: churn risk should increase with higher monthly charges. However, need to be careful as this is not normalized with respect to number of lines, services, etc.
- total_charges: total charges is cumulative and depends on the duration of service. Not a good idea to use this in its raw form


### Visualize *key* variables

#### Tenure

```{r, fig.align='center', warning=FALSE}
churn_data %>% ggplot(aes(x = tenure_months, fill = churn_label)) + geom_histogram() + facet_wrap(churn_label ~ .)

```





#### Multiple lines

```{r, fig.align='center', warning=FALSE}
ggplot(churn_data, aes(x = multiple_lines, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ multiple_lines, scale = "free")

```



#### Contract

```{r, fig.align='center'}
ggplot(churn_data, aes(x = contract, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ contract, scale = "free")

```

#### Payment method

```{r, fig.align='center'}
ggplot(churn_data, aes(x = payment_method, y = tenure_months, fill = churn_label)) + geom_boxplot() + facet_wrap(~ payment_method, scale = "free")

```



## Logistic regression

We consider the following model:

\begin{eqnarray}
y_i & \sim & Bernoulli(\theta_i) \\
\text{logit}\left(\theta_i\right) & = & \beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n, \\
\beta_0, \beta_1, \ldots, \beta_n & \sim & \mathcal{N}(0, 5) \hspace{10pt} (priors)
\end{eqnarray}

where $y_i$ is binary data (churned or not churned), $\theta_i$ is the probability of churn and $x_i's$ are covariates of customer $i$. $\beta_0, \beta_1, \ldots, \beta_n$ are model parameters which will be estimated using the collected data. 

### Model summary

```{r}
if(!bBuild_mdl_bayes_FE){
  mdl_logit_FixedEffects <- readRDS("mdl_logit_bayes_1.rds")
} else {
  mdl_logit_FixedEffects <- brm(formula = churn_value ~ contract  + multiple_lines + payment_method + tenure_months,
                                data=churn_data,
                                family = bernoulli(link = "logit"),
                                prior = c(prior(normal(0, 2), class = Intercept),
                                          prior(normal(0, 2), class = b)),
                                sample_prior = TRUE,
                                warmup = 500,
                                iter = 3000,
                                chains = 4,
                                inits= "0",
                                cores = 8,
                                seed = 123)
  saveRDS(mdl_logit_FixedEffects, "mdl_logit_FixedEffects.rds")
}
summary(mdl_logit_FixedEffects)

```

#### Posterior distribution of parameters

- Probability of parameters $\theta$ given evidence (data)
- We show how data modifies our a priori assumption of $\theta$. We plot only one parameter for simplicity.

```{r, include=FALSE}
varname <- "b_contractTwoyear"
draws_prior <- prior_draws(mdl_logit_FixedEffects, varname) %>%
  dplyr::rename(x = !!as.name(varname)) %>% 
  mutate(
    label = "prior"
  )
draws_post <- as_draws_df(mdl_logit_FixedEffects) %>%
  mutate(
    label = "posterior"
  ) 

df_post <- draws_post %>%
  dplyr::select(as.name(varname), label) %>%
  rename(x = !!as.name(varname))
```

```{r}
rbind(draws_prior, df_post) %>% 
  ggplot(aes(x = x, fill = label)) + geom_density() +
  scale_fill_viridis(discrete = TRUE, alpha = 0.5) + theme_minimal() + 
  labs(x = varname)

```


### Posterior predictive distribution

- Distribution of possible unobserved values conditional on observed data: $p(y_{new}|y)$.

```{r}
tenure1 = 15.8
tenure2 = 10.5

Xnew <- churn_data[1, ]
Xnew$tenure_months <- tenure1
p1 = posterior_linpred(mdl_logit_FixedEffects, transform=TRUE, newdata = Xnew)
df_p1 <- data.frame(p = p1, tenure_months = tenure1)


Xnew <- churn_data[1, ]
Xnew$tenure_months <- tenure2
#table(Xnew$tenure_months == churn_data$tenure_months)
p2 = posterior_linpred(mdl_logit_FixedEffects, transform=TRUE, newdata = Xnew)
df_p2 <- data.frame(p = p2, tenure_months = tenure2)
df_p <- rbind(df_p1, df_p2)
df_p$tenure_months <- as.factor(df_p$tenure_months)

ggplot(df_p, aes(x = p, fill = tenure_months)) + geom_density(alpha = 0.5) +
  labs(title = paste0("Contract: ", Xnew$contract, ", Multiple lines: ", Xnew$multiple_lines, ", Payment method: ",
                      Xnew$payment_method))

```


### Model Evaluation

1) How wrong is the model?
2) How accurate are the model’s posterior classifications?
3) Model comparisons

To answer question (1), we will first check convergence of posterior distribution of parameters using traceplots. Next, we perform posterior predictive checks to confirm that the data simulated from our posterior logistic regression model has features similar to the original data. That is, simulations from the fit model result in similar number of ones and zeros as in the original data. For (2), we will address this question in two ways: with and without cross-validation. 

#### Check model convergence by visual inspection 

Figure below shows that there are no divergence chains and hence we can conclude that our model does not have any convergence problems.

```{r, fig.align='center', fig.height=6, fig.height=6}
plot(mdl_logit_FixedEffects, ask = FALSE, newpage = FALSE, N = 2)

```

#### Posterior predictive checks
To check he fit of the model to data we draw simulated values from the posterior predictive distribution of replicated data, $y^{rep}$ and compare the samples to original data. Any systematic differences between the simulations and the data will indicate potential failings of the model (e.g., unsuitable priors). The plot below compares original data with $500$ draws from the fit model. The figure clearly shows that the model passes this check

```{r, fig.align='center'}
pp_check(mdl_logit_FixedEffects, ndraws = 500)

```

### Classification accuracy

#### Training accuracy

- Threshold = 0.5


```{r}
set.seed(84735)
threshold = 0.5
# Predicted probabilities
#linpred <- posterior_linpred(mdl_logit_FixedEffects)
preds <- posterior_linpred(mdl_logit_FixedEffects, transform=TRUE)
pred <- colMeans(preds)
pr <- as.integer(pred >= 0.5)

# posterior classification accuracy
round(mean(xor(pr,as.integer(churn_data$churn_value==0))),6)


```
#### Confusion matrix

```{r}
table(data.frame(churn_value = churn_data$churn_value, churn_pred = pr))

```
#### Receiver Operating Characteristic (ROC) curve

```{r}
threshold <- seq(0, 1, by = 0.05)
compute_churn_decision <- function(threshold, pred, churn_value){
  pr <- as.integer(pred >= threshold)
  y <- data.frame(churn_value = churn_value, churn_pred = pr)
  pd <- y %>% dplyr::filter(churn_value == 1) %>%
    summarise(pd = mean(churn_pred)) %>% pull()
  pf <- y %>% dplyr::filter(churn_value == 0) %>%
    summarise(
      pf = mean(churn_pred)
    ) %>% pull()
  return(list(data.frame("pf" = pf, "pd" = pd, "th" = threshold)))
}

roc = lapply(threshold, compute_churn_decision, pred = pred, churn_value = churn_data$churn_value) %>%
  bind_rows()
roc <- roc %>%
  mutate(
    type = "training set"
  )

ggplot(roc, aes(x = pf, y = pd)) + geom_line(size = 2) + 
  geom_point(size = 5) + 
  labs(title = "ROC", x = "Probability of false alarm", y = "Probability of detection")


```

#### Cross validation
To better estimate the predictive performance for new not yet seen data we next use leave-one-out cross-validation.

```{r}
if(bRunCV_mdl_bayes_FE){
  loo_mdl_logit_FixedEffects <- loo(mdl_logit_FixedEffects, save_psis = TRUE)
  saveRDS(loo_mdl_logit_FixedEffects, "loo_mdl_logit_FixedEffects.rds")
  # LOO predictive probabilities
  ploo_mdl_logit_FixedEffects=E_loo(preds, loo_mdl_logit_FixedEffects$psis_object, type="mean", log_ratios = -log_lik(mdl_logit_FixedEffects))$value
  saveRDS(ploo, "ploo_mdl_logit_FixedEffects.rds")
} else{
  loo_mdl_logit_FixedEffects <- readRDS("loo_mdl_logit_FixedEffects.rds")
  ploo_mdl_logit_FixedEffects <- readRDS("ploo_mdl_logit_FixedEffects.rds")
}

## LOO classification accuracy
round(mean(xor(ploo>0.5,as.integer(churn_data$churn_value==0))), 6)

```


### Challenger model: To do

- Churn rate vs. Geographical location
- Multilevel model to incorporate location variations (especially useful to reduce prediction uncertainty in small sample size counties)


```{r}
table(churn_data$county)

```



```{r, fig.width=8, fig.height=8}

churn_data %>% group_by(county) %>%
  summarise(p = mean(churn_value)) %>% arrange(desc(p)) %>%
  mutate(county = fct_reorder(county, p)) %>%
  ggplot( aes(x=county, y=p)) +
    geom_bar(stat="identity") +
    coord_flip() +
    xlab("") + ylab("churn rate") + 
    theme_bw()

```


### Comparison with classical logistic regression (Frequentist)

```{r}
mdl_logit_FixedEffects_classical <- glm(churn_value ~ contract  + multiple_lines + payment_method + tenure_months,
                                        data = churn_data, family = "binomial")
summary(mdl_logit_FixedEffects_classical)

```
- Both classical and Bayesian solutions match pretty well

## Survival analysis: To Do


