Churn Analysis with Logistic Regression and Random Forest

Introduction

Churn analysis is a fundamental problem in data science. The investigator obtains information on customer behavior and attributes and uses this information to predict whether the customer will terminate a contract, or not. In this study, I conduct a churn analysis based on simulated cell phone customer data from a Kaggle competition, Customer Churn Prediction 2020. I combine this data with information on Google searches pertaining to each of the four major cell phone carriers in the US. The study addresses four main questions:

Does the mean total_day_charge for customers vary by area_code?
Does the mean value of number_customer_service_calls vary depending on whether a customer carries an international_plan?
Is number_customer_service_calls related to the customer’s total_day_charge?
Are a customer’s characteristics predictive of whether they will terminate their contract?

The study follows the OSEMN workflow. (“OSEMN” stands for Obtain, Scrub, Explore, Model, iNterpret, and it is an osemn/awesome way to structure a data science project.) The main strategy for data exploration, in addition to visualization, is hypothesis testing. In the modeling section, I build a logistic regression as well as a random forest model to predict customer churn. I conclude with a summary of my findings.

Obtain

This analysis combines two data sources. The first is from the Kaggle competition, Customer Churn Prediction 2020. The second data source contains information on Google searches pertaining to each of the four major cell phone carriers in the US.

library(caret)
library(InformationValue)
library(GGally)
library(gtrendsR)
library(infer)
library(psych)
library(randomForest)
library(ROCR)
library(stats)
library(tidyverse)
set.seed(210509)

k1.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/churn_train.csv")
str(k1.dat)

## spec_tbl_df [4,250 x 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ state                        : chr [1:4250] "OH" "NJ" "OH" "OK" ...
##  $ account_length               : num [1:4250] 107 137 84 75 121 147 117 141 65 74 ...
##  $ area_code                    : chr [1:4250] "area_code_415" "area_code_415" "area_code_408" "area_code_415" ...
##  $ international_plan           : chr [1:4250] "no" "no" "yes" "yes" ...
##  $ voice_mail_plan              : chr [1:4250] "yes" "no" "no" "no" ...
##  $ number_vmail_messages        : num [1:4250] 26 0 0 0 24 0 0 37 0 0 ...
##  $ total_day_minutes            : num [1:4250] 162 243 299 167 218 ...
##  $ total_day_calls              : num [1:4250] 123 114 71 113 88 79 97 84 137 127 ...
##  $ total_day_charge             : num [1:4250] 27.5 41.4 50.9 28.3 37.1 ...
##  $ total_eve_minutes            : num [1:4250] 195.5 121.2 61.9 148.3 348.5 ...
##  $ total_eve_calls              : num [1:4250] 103 110 88 122 108 94 80 111 83 148 ...
##  $ total_eve_charge             : num [1:4250] 16.62 10.3 5.26 12.61 29.62 ...
##  $ total_night_minutes          : num [1:4250] 254 163 197 187 213 ...
##  $ total_night_calls            : num [1:4250] 103 104 89 121 118 96 90 97 111 94 ...
##  $ total_night_charge           : num [1:4250] 11.45 7.32 8.86 8.41 9.57 ...
##  $ total_intl_minutes           : num [1:4250] 13.7 12.2 6.6 10.1 7.5 7.1 8.7 11.2 12.7 9.1 ...
##  $ total_intl_calls             : num [1:4250] 3 5 7 3 7 6 4 5 6 5 ...
##  $ total_intl_charge            : num [1:4250] 3.7 3.29 1.78 2.73 2.03 1.92 2.35 3.02 3.43 2.46 ...
##  $ number_customer_service_calls: num [1:4250] 1 0 2 3 3 0 1 0 4 0 ...
##  $ churn                        : chr [1:4250] "no" "no" "no" "no" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   state = col_character(),
##   ..   account_length = col_double(),
##   ..   area_code = col_character(),
##   ..   international_plan = col_character(),
##   ..   voice_mail_plan = col_character(),
##   ..   number_vmail_messages = col_double(),
##   ..   total_day_minutes = col_double(),
##   ..   total_day_calls = col_double(),
##   ..   total_day_charge = col_double(),
##   ..   total_eve_minutes = col_double(),
##   ..   total_eve_calls = col_double(),
##   ..   total_eve_charge = col_double(),
##   ..   total_night_minutes = col_double(),
##   ..   total_night_calls = col_double(),
##   ..   total_night_charge = col_double(),
##   ..   total_intl_minutes = col_double(),
##   ..   total_intl_calls = col_double(),
##   ..   total_intl_charge = col_double(),
##   ..   number_customer_service_calls = col_double(),
##   ..   churn = col_character()
##   .. )

k1.dat contains 20 variables with 4,250 rows. Columns 1 through 19 contain information pertaining to customer accounts, and column 20 indicates whether the customer terminated their account (churned).

gtrends_search_terms <- c("att", "at&t", "tmobile", "t-mobile", "us cellular", "u.s. cellular", "verizon", "verizon wireless")

states <- c('US-AL', 'US-AK', 'US-AZ', 'US-AR', 'US-CA', 'US-CO', 'US-CT', 'US-DE', 'US-DC', 'US-FL', 'US-GA', 'US-HI', 'US-ID', 'US-IL', 'US-IN', 'US-IA', 'US-KS', 'US-KY', 'US-LA', 'US-ME', 'US-MD', 'US-MA', 'US-MI', 'US-MN', 'US-MS', 'US-MO', 'US-MT', 'US-NE', 'US-NV', 'US-NH', 'US-NJ', 'US-NM', 'US-NY', 'US-NC', 'US-ND', 'US-OH', 'US-OK', 'US-OR', 'US-PA', 'US-RI', 'US-SC', 'US-SD', 'US-TN', 'US-TX', 'US-UT', 'US-VT', 'US-VA', 'US-WA', 'US-WV', 'US-WI', 'US-WY')

g1.dat <- list()

for (i in seq(1,51)){
  g1.dat[i] <- gtrends(keyword = gtrends_search_terms[1:4], geo = states[i], time = "2020-03-01 2020-03-30")
}

for (i in seq(52,102)){
  g1.dat[i] <- gtrends(keyword = gtrends_search_terms[5:8], geo = states[i-51], time = "2020-03-01 2020-03-30")
}

str(g1.dat[[1]])

## 'data.frame':    120 obs. of  7 variables:
##  $ date    : POSIXct, format: "2020-03-01" "2020-03-02" ...
##  $ hits    : int  44 20 84 34 36 76 32 38 69 59 ...
##  $ keyword : chr  "att" "att" "att" "att" ...
##  $ geo     : chr  "US-AL" "US-AL" "US-AL" "US-AL" ...
##  $ time    : chr  "2020-03-01 2020-03-30" "2020-03-01 2020-03-30" "2020-03-01 2020-03-30" "2020-03-01 2020-03-30" ...
##  $ gprop   : chr  "web" "web" "web" "web" ...
##  $ category: int  0 0 0 0 0 0 0 0 0 0 ...

summary(g1.dat[[1]])

##       date                          hits          keyword         
##  Min.   :2020-03-01 00:00:00   Min.   :  0.00   Length:120        
##  1st Qu.:2020-03-08 00:00:00   1st Qu.:  6.00   Class :character  
##  Median :2020-03-15 12:00:00   Median : 30.00   Mode  :character  
##  Mean   :2020-03-15 12:00:00   Mean   : 36.23                     
##  3rd Qu.:2020-03-23 00:00:00   3rd Qu.: 63.25                     
##  Max.   :2020-03-30 00:00:00   Max.   :100.00                     
##      geo                time              gprop              category
##  Length:120         Length:120         Length:120         Min.   :0  
##  Class :character   Class :character   Class :character   1st Qu.:0  
##  Mode  :character   Mode  :character   Mode  :character   Median :0  
##                                                           Mean   :0  
##                                                           3rd Qu.:0  
##                                                           Max.   :0

g1.dat is a list of data frames. Each of the 102 dataframes contained in g1.dat contains Google Trends information for 4 search terms, 1 state, and 30 days. Since there are 8 search terms, each state is represented by 2 data frames.

Scrub

In this section, I summarize the Google Trends data for each state, and I combine all the data into a single data frame.

Summary data for k1.dat shows that some columns are misclassified. For example, read_csv interprets churn as a character vector, but it’s better represented as a Boolean. Below I reclassify some of the columns.

k2.dat <- k1.dat %>%
  mutate("international_plan" = ifelse(international_plan == "yes", TRUE, FALSE)) %>%
  mutate("voice_mail_plan" = ifelse(voice_mail_plan == "yes", TRUE, FALSE)) %>%
  mutate("churn" = ifelse(churn == "yes", TRUE, FALSE)) %>%
  mutate("state" = as.factor(state)) %>%
  mutate("area_code" = as.factor(area_code))
  
str(k2.dat)

## spec_tbl_df [4,250 x 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ state                        : Factor w/ 51 levels "AK","AL","AR",..: 36 32 36 37 20 25 19 50 16 40 ...
##  $ account_length               : num [1:4250] 107 137 84 75 121 147 117 141 65 74 ...
##  $ area_code                    : Factor w/ 3 levels "area_code_408",..: 2 2 1 2 3 2 1 2 2 2 ...
##  $ international_plan           : logi [1:4250] FALSE FALSE TRUE TRUE FALSE TRUE ...
##  $ voice_mail_plan              : logi [1:4250] TRUE FALSE FALSE FALSE TRUE FALSE ...
##  $ number_vmail_messages        : num [1:4250] 26 0 0 0 24 0 0 37 0 0 ...
##  $ total_day_minutes            : num [1:4250] 162 243 299 167 218 ...
##  $ total_day_calls              : num [1:4250] 123 114 71 113 88 79 97 84 137 127 ...
##  $ total_day_charge             : num [1:4250] 27.5 41.4 50.9 28.3 37.1 ...
##  $ total_eve_minutes            : num [1:4250] 195.5 121.2 61.9 148.3 348.5 ...
##  $ total_eve_calls              : num [1:4250] 103 110 88 122 108 94 80 111 83 148 ...
##  $ total_eve_charge             : num [1:4250] 16.62 10.3 5.26 12.61 29.62 ...
##  $ total_night_minutes          : num [1:4250] 254 163 197 187 213 ...
##  $ total_night_calls            : num [1:4250] 103 104 89 121 118 96 90 97 111 94 ...
##  $ total_night_charge           : num [1:4250] 11.45 7.32 8.86 8.41 9.57 ...
##  $ total_intl_minutes           : num [1:4250] 13.7 12.2 6.6 10.1 7.5 7.1 8.7 11.2 12.7 9.1 ...
##  $ total_intl_calls             : num [1:4250] 3 5 7 3 7 6 4 5 6 5 ...
##  $ total_intl_charge            : num [1:4250] 3.7 3.29 1.78 2.73 2.03 1.92 2.35 3.02 3.43 2.46 ...
##  $ number_customer_service_calls: num [1:4250] 1 0 2 3 3 0 1 0 4 0 ...
##  $ churn                        : logi [1:4250] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   state = col_character(),
##   ..   account_length = col_double(),
##   ..   area_code = col_character(),
##   ..   international_plan = col_character(),
##   ..   voice_mail_plan = col_character(),
##   ..   number_vmail_messages = col_double(),
##   ..   total_day_minutes = col_double(),
##   ..   total_day_calls = col_double(),
##   ..   total_day_charge = col_double(),
##   ..   total_eve_minutes = col_double(),
##   ..   total_eve_calls = col_double(),
##   ..   total_eve_charge = col_double(),
##   ..   total_night_minutes = col_double(),
##   ..   total_night_calls = col_double(),
##   ..   total_night_charge = col_double(),
##   ..   total_intl_minutes = col_double(),
##   ..   total_intl_calls = col_double(),
##   ..   total_intl_charge = col_double(),
##   ..   number_customer_service_calls = col_double(),
##   ..   churn = col_character()
##   .. )

The Google Trends data can be transformed into a measure of inequality across search activity within each state. This is motivated by the assumption that there will be greater equality in search activity across wireless providers in states having competitive wireless markets. By contrast, in states where a single wireless provider dominates, the majority of searches will seek the dominant provider. This will result in greater inequality across search terms within low-competition states.

The inequality measure I use is the Gini index. A Gini index is typically used to measure economic inequality. When it’s used to measure income inequality, “A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of 1 (or 100%) expresses maximal inequality among values (e.g., for a large number of people where only one person has all the income or consumption and all others have none…)” (Wikipedia). In the context of this data, a Gini index of 1 means that all queries within a state sought a single wireless provider. An index of 0 means that search queries were equally split across the four providers.

To construct the Gini index for each state, I add together all the queries within a state for a given provider across the entire month of March, 2020, the month immediately preceding the opening of the Kaggle contest. Finally, I construct the Gini coefficient for the state by comparing the March 2020 queries for different providers within the state.

tmp_query_summaries <- data.frame("geo" = "x", "att" = -1, "tmobile" = -1, "uscell" = -1, "verizon" = -1)

for (i in seq(1,51)) {
  tmp_queries_in_state <- g1.dat[[i]] %>%
    mutate("hits" = as.numeric(ifelse(hits == "<1", 0, hits))) %>%
    group_by(keyword) %>%
    summarise(sum(hits))
  
  tmp_query_summaries <- rbind(tmp_query_summaries, c("geo" = states[i], "att" = sum(tmp_queries_in_state[1,2], tmp_queries_in_state[2,2]), "tmobile" = sum(tmp_queries_in_state[3,2], tmp_queries_in_state[4,2]), "uscell" = NA, "verizon" = NA))
}

tmp_query_summaries <- tmp_query_summaries %>%
  filter(att >= 0)

for (i in seq(52,102)) {
  tmp_queries_in_state <- g1.dat[[i]] %>%
    mutate("hits" = as.numeric(ifelse(hits == "<1", 0, hits))) %>%
    group_by(keyword) %>%
    summarise(sum(hits))

  tmp_query_summaries[i-51,4] <- sum(tmp_queries_in_state[1,2], tmp_queries_in_state[2,2])
  tmp_query_summaries[i-51,5] <- sum(tmp_queries_in_state[3,2], tmp_queries_in_state[4,2])
}

tmp_query_summaries is a relative measure of search activity for each carrier for each state for March 2020.

head(tmp_query_summaries)

##     geo  att tmobile uscell verizon
## 1 US-AL 3740     608     17    2443
## 2 US-AK 2143     112      0     954
## 3 US-AZ 2978    1799      5    2497
## 4 US-AR 2800     245      7    2126
## 5 US-CA 4172    1385     27    2772
## 6 US-CO 2687    1546     10    2224

Next, a Gini index is computed for each state from this data, and the variable gini is added to k2.dat.

ginis <- data.frame("geo" = "x", "gini" = "-1")
for(i in seq(nrow(tmp_query_summaries))) {
  tmp <- sort(as.integer(tmp_query_summaries[i,2:5]))
  tmp.gini <- 1 - ((1/sum(tmp)) * (1.75 * tmp[1] + 1.25 * tmp[2] + 0.75 * tmp[3] + 0.25 * tmp[4]))
  ginis <- rbind(ginis, c(tmp_query_summaries[i,1], round(tmp.gini, 4)))
}
ginis <- ginis %>%
  filter(gini >= 0)

ginis <- ginis %>%
  mutate("tmp" = str_sub(geo, -2)) %>%
  select(3,2) %>%
  rename("state" = tmp)

k3.dat <- left_join(k2.dat, ginis, by = "state") %>%
  select(2,6:18,21,19,4,5,1,3,20) %>%
  transform(gini = as.numeric(gini)) %>%
  transform(state = as.factor(state))

colnames(k3.dat)

##  [1] "account_length"                "number_vmail_messages"        
##  [3] "total_day_minutes"             "total_day_calls"              
##  [5] "total_day_charge"              "total_eve_minutes"            
##  [7] "total_eve_calls"               "total_eve_charge"             
##  [9] "total_night_minutes"           "total_night_calls"            
## [11] "total_night_charge"            "total_intl_minutes"           
## [13] "total_intl_calls"              "total_intl_charge"            
## [15] "gini"                          "number_customer_service_calls"
## [17] "international_plan"            "voice_mail_plan"              
## [19] "state"                         "area_code"                    
## [21] "churn"

Explore

What are the shapes of the distributions of the numeric variables? Do any of the variables possess extreme outliers? Histograms are a useful way to begin to respond to these questions.

k3.dat %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

All the variables pertaining to customer call activity are normally distributed with no apparent outliers, and no missing values. One exception is total_intl_calls, which is skewed right and displays some gaps in the distribution. number_customer_service_calls also exhibits right skew, but there are no extreme outliers. number_vmail_messages is the “worst-behaved” of the variables here. It appears to exhibit several extreme outliers, while most of its values lie near zero.

hist(k3.dat$number_vmail_messages, main = "Customer Use of Voicemail", xlab = "Number of voicemail messages", ylab = "Number of customers")

Closer inspection reveals a roughly normal distribution centered near 30, and a large number of observations between 0 and 5. Are those also distributed normally?

small_vmails <- k3.dat %>%
  filter(number_vmail_messages <= 5)

hist(small_vmails$number_vmail_messages, main = "Customer Use of Voicemail, msgs <= 5", xlab = "Number of voicemail messages", ylab = "Number of customers")

The large majority of customers do not use voice mail at all.

Examining categorical variables:

k3.dat %>%
  select(area_code) %>%
  table()

## .
## area_code_408 area_code_415 area_code_510 
##          1086          2108          1056

explore_state <- k3.dat %>%
  group_by(state) %>%
  summarise("fraction of customers" = round(n()/nrow(k3.dat),4)) %>%
  arrange(`fraction of customers`)

explore_state[c(1:5,47:51),]

## # A tibble: 10 x 2
##    state `fraction of customers`
##    <fct>                   <dbl>
##  1 CA                     0.0092
##  2 AK                     0.0144
##  3 IA                     0.0146
##  4 GA                     0.0151
##  5 ND                     0.0158
##  6 VA                     0.0235
##  7 AL                     0.0238
##  8 ID                     0.0249
##  9 MN                     0.0254
## 10 WV                     0.0327

Every state is represented in the data, and no state’s customers outnumber any other states by more than a factor of about 3. It’s interesting to note that the most populous state, California, shows the smallest number of customers. The state with the largest number of customers is West Virginia.

Examining the response variable, churn:

k3.dat %>%
  select(churn) %>%
  table()

## .
## FALSE  TRUE 
##  3652   598

There are no missing values, but the class is significantly imbalanced in favor of non-churning customers. I conclude this section of the exploratory analysis by considering some numeric variables, together with churn, in relation to each other. Perhaps there are interesting patterns that will be useful later on.

ggpairs(k3.dat[,c(1:4,15,16,21)])

There may be a relationship between number_customer_service_calls and churn, as well as between number vmail_messages and churn. Otherwise, there do not appear to be any meaningful relationships among any of these variable pairs.

I continue exploring the data set by running some hypothesis tests. In an effort to better understand the customer behavior described by the data, I ask: * Does the mean total_day_charge vary by area_code? * Does the mean value of number_customer_service_calls vary depending on whether a customer carries an international_plan? * Is the number of calls to customer service related to the customer’s total_day_charge?

Does the mean `total_day_charge` vary by `area_code`?

Are there any obvious differences apparent in the data?

k3.dat %>%
  ggplot(aes(x = total_day_charge, y = area_code)) +
  geom_boxplot() +
  labs(title = "Total Daytime Charges by Area Code", x = "Total Daytime Charge ($)", y = "Area Code")

Based on this visualization, there does not appear to be a statistically significant difference in the distribution of total_day_charge across area_codes. Next, I check conditions for inference to prepare for the formal hypothesis test.

Are observations independent within and across groups? Yes. No customer’s total_day_charge has a direct impact on any other customer’s total_day_charge. And no group’s charges has a meaningful direct impact on any other group’s.
Are the data within each group nearly normal?

ggplot(data = k3.dat, aes(x = total_day_charge)) +
  geom_histogram() +
  facet_wrap(~area_code) +
  labs(title = "Total Day Charge by Area Code", x = "Total Day Charge ($)", y = "count")

Yes.

Is the variability within each group approximately equal? Based on the boxplots and histograms, it does appear that the variability within each group is very similar.

The null hypothesis, \(H_{0}\), is that the mean values of total_day_charge for all levels of area_code are equal. The alternative hypothesis, \(H_{A}\), is that at least two mean values of total_day_charge for different levels of area_code are unequal. For this hypothesis test, \(\alpha = 0.05\).

#Compute the point estimate
F_hat <- k3.dat %>%
  specify(total_day_charge ~ area_code) %>%
  calculate(stat = "F")

#Generate the null distribution (the sampling distribution of the test statistic if the null hypothesis is true)
null_f_distn <- k3.dat %>%
  specify(total_day_charge ~ area_code) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "F")

#Get the p-value for your test statistic with respect to the null distribution
null_f_distn %>%
  get_p_value(obs_stat = F_hat, direction = "greater")

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.196

#Visualize the result
visualize(null_f_distn, method = "simulation") + 
  shade_p_value(obs_stat = F_hat, direction = "greater")

For this hypothesis test, \(p > \alpha\), so the null hypothesis is not rejected. There is not sufficient evidence to support the claim that mean day charges vary across area code.

The partially shaded histogram above, “Simulation-Based Null Distribution,” shows the point estimate for the F-statistic compared to the null distribution. If the null hypothesis were true, then about 19.6% of F-statistics calculated from samples like this one would lie at or to the right of the point estimate calculated above.

Does the mean value of `number_customer_service_calls` vary depending on whether a customer carries an `international_plan`?

ggplot(data = k3.dat, aes(x = number_customer_service_calls)) +
  geom_histogram() +
  facet_wrap(~international_plan) +
  labs(title = "Customer Service Calls For Customers With/out Intl Plans", x = "Number of Customer Service Calls", y = "Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice that the groups are very unequal:

k3.dat %>%
  group_by(international_plan) %>%
  summarise(n())

## # A tibble: 2 x 2
##   international_plan `n()`
## * <lgl>              <int>
## 1 FALSE               3854
## 2 TRUE                 396

Based on the above histogram, there does not appear to be a statistically significant difference in the distribution of number_customer_service_calls across the two levels of international_plan. Next, I check conditions for inference to prepare for the formal hypothesis test.

Are the data independent within and across groups? The data are independent within groups, because no customer’s calls to customer service have an effect on any other customer’s calls to customer service. The groups are independent of each other because no one’s decision to carry an international plan impacts anyone else’s.
Is the data within each group distributed normally? Both data sets are similarly skewed right, but neither exhibits any extreme outliers, and both contain many more than 30 observations. So the normality condition is sufficiently satisfied.

The null hypothesis, \(H_{0}\), is that the mean values of number_customer_service_calls for both levels of international_plan are equal. The alternative hypothesis, \(H_{A}\), is that they are unequal. For this hypothesis test, \(\alpha = 0.05\).

obs_diff <- k3.dat %>%
  specify(number_customer_service_calls ~ international_plan) %>%
  calculate(stat = "diff in means", order = c(TRUE, FALSE))

null_t_distn <- k3.dat %>%
  specify(number_customer_service_calls ~ international_plan) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c(TRUE, FALSE))

null_t_distn %>%
  get_p_value(obs_stat = obs_diff, direction = "left")

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.306

visualize(null_t_distn, method = "simulation") +
  shade_p_value(obs_stat = obs_diff, direction = "two_sided")

For this hypothesis test, \(p = 0.306\). Since \(p > \alpha\), the null hypothesis is not rejected. There is not sufficient evidence to support the claim that the number of calls a customer makes to customer service varies depending on whether they carry an international plan.

The partially shaded histogram above, “Simulation-Based Null Distribution,” shows the point estimate for the t-statistic compared to the null distribution. If the null hypothesis were true, then about 61% of t-statistics calculated from samples like this one would lie toward the tails of the point estimate calculated above.

Is `number_customer_service_calls` related to the customer’s `total_day_charge`?

This question can be addressed by attempting to fit a linear regression model to the number_customer_service_calls and total_day_charge data.

The most important condition for linear regression is that the data display a linear relationship. Do they?

ggplot(data = k3.dat, mapping = aes(x = total_day_charge, y = number_customer_service_calls)) +
  geom_point() +
  labs(title = "Do Customer Svc Calls Vary Linearly with Day Charges?", x = "Total Day Charges ($)", y = "Number of Customer Service Calls")

This scatterplot also does not show a linear pattern, so these two variables do not satisfy the conditions for regression.

Model: Logistic Regression

Logistic regression is a special type of linear model used when the response variable takes on two or more discrete factors. Here, I use logistic regression to try to predict a customer’s churn value based on their other characteristics. The first step is to assess whether the data satisfy conditions for the model. Then a “default” model is constructed and then refined to improve its simplicity and avoid overfitting. Finally, I assess the model’s performance against the test data, and interpret the results.

Conditions for logistic regression

Logistic regression may be appropriate if each customer’s churn status is independent of each other’s, and if each predictor variable is linearly related to \(\text{logit}(p_{i})\), when other predictors are held constant. Based on our knowledge of customer behavior, no customer’s churn decision has a direct impact on the decision of any other customer, so the first condition is satisfied. After arriving at the final model, a residuals plot can help inform whether the second condition of linearity is met.

While class balance is not considered a condition for fitting a logistic regression, data sets with imbalanced classes can sometimes lead to a model that is weaker in predicting the minority class. The data set as it stands exhibits a strong imbalance in favor of churn == FALSE.

k3.dat %>%
  select(churn) %>%
  table()

## .
## FALSE  TRUE 
##  3652   598

Below, I generate a balanced training set and a test set. To balance the training data, observations from the majority class are omitted until both classes are equal. While this entails ignoring some potentially informative observations, it will hopefully lead to a model that is better able to predict members of both classes.

Split the data 80/20 into train/test sets:

training_rows <- sample(nrow(k3.dat), nrow(k3.dat) * 0.80, replace = FALSE)
k3_train.dat <- k3.dat[training_rows,]
k3_test.dat <- k3.dat[-training_rows,]

tf_table <- k3_train.dat %>%
  select(churn) %>%
  table()

Downsampling the training set involves removing rows for which churn == FALSE until both classes contain the same number of observations, in this case, 470. First, we’ll sample 470 entries where churn == FALSE from the training data. Then we’ll combine these with all the entries from the positive class.

k3_train_negs.dat <- k3_train.dat %>%
  filter(churn == FALSE)

keepers <- sample(nrow(k3_train_negs.dat), tf_table[2], replace = FALSE)
k3_train_negs.dat <- k3_train_negs.dat[keepers,]

k3_train_negs.dat %>%
  select(churn) %>%
  table()

## .
## FALSE 
##   470

k4_train.dat <- k3_train.dat %>%
  filter(churn == TRUE) %>%
  rbind(k3_train_negs.dat)

k4_train.dat %>%
  select(churn) %>%
  table()

## .
## FALSE  TRUE 
##   470   470

Now that the training set is balanced, it’s ready for logistic regression.

k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ .)
summary(k4_train.glm)

## 
## Call:
## glm(formula = churn ~ ., family = binomial(link = "logit"), data = k4_train.dat)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.76971  -0.72429  -0.03442   0.71163   2.92395  
## 
## Coefficients: (1 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -6.606128   5.411118  -1.221  0.22215    
## account_length                  0.002201   0.002213   0.995  0.31991    
## number_vmail_messages           0.056755   0.027134   2.092  0.03647 *  
## total_day_minutes               9.794574   5.388074   1.818  0.06909 .  
## total_day_calls                 0.004704   0.004455   1.056  0.29108    
## total_day_charge              -57.526732  31.693817  -1.815  0.06951 .  
## total_eve_minutes               8.043411   2.666291   3.017  0.00256 ** 
## total_eve_calls                 0.001868   0.004471   0.418  0.67609    
## total_eve_charge              -94.540694  31.367390  -3.014  0.00258 ** 
## total_night_minutes            -2.337528   1.377518  -1.697  0.08971 .  
## total_night_calls              -0.003077   0.004559  -0.675  0.49970    
## total_night_charge             52.023829  30.612205   1.699  0.08923 .  
## total_intl_minutes             -2.247766   8.419649  -0.267  0.78949    
## total_intl_calls               -0.004009   0.035643  -0.112  0.91045    
## total_intl_charge               8.522457  31.187765   0.273  0.78465    
## gini                           -2.558943  10.858083  -0.236  0.81369    
## number_customer_service_calls   0.669806   0.068046   9.843  < 2e-16 ***
## international_planTRUE          2.693637   0.280713   9.596  < 2e-16 ***
## voice_mail_planTRUE            -3.150368   0.873454  -3.607  0.00031 ***
## stateAL                         1.219992   1.036300   1.177  0.23909    
## stateAR                         0.121749   0.906541   0.134  0.89317    
## stateAZ                        -1.009545   1.875777  -0.538  0.59044    
## stateCA                         2.880239   1.516575   1.899  0.05754 .  
## stateCO                        -0.598201   1.792712  -0.334  0.73862    
## stateCT                         0.023930   0.917865   0.026  0.97920    
## stateDC                         0.615978   1.976763   0.312  0.75534    
## stateDE                         0.031190   1.324862   0.024  0.98122    
## stateFL                         0.066972   1.075439   0.062  0.95034    
## stateGA                        -0.632626   1.090895  -0.580  0.56197    
## stateHI                        -1.945094   2.205508  -0.882  0.37782    
## stateIA                         0.745442   3.018458   0.247  0.80494    
## stateID                        -0.132230   1.267892  -0.104  0.91694    
## stateIL                        -0.523394   1.092523  -0.479  0.63189    
## stateIN                        -0.553687   0.737830  -0.750  0.45300    
## stateKS                        -0.428760   1.390104  -0.308  0.75775    
## stateKY                         0.759186   0.730338   1.040  0.29857    
## stateLA                         0.303714   0.798364   0.380  0.70363    
## stateMA                         0.629615   1.659161   0.379  0.70433    
## stateMD                        -0.262028   1.645109  -0.159  0.87345    
## stateME                         0.368467   2.507663   0.147  0.88318    
## stateMI                        -0.071056   0.730989  -0.097  0.92256    
## stateMN                         0.444058   1.475748   0.301  0.76349    
## stateMO                         0.741384   1.093474   0.678  0.49777    
## stateMS                         0.933923   0.873758   1.069  0.28513    
## stateMT                         1.028097   0.719033   1.430  0.15276    
## stateNC                        -0.493945   1.281062  -0.386  0.69981    
## stateND                        -0.770761   0.868640  -0.887  0.37491    
## stateNE                        -0.045074   1.188854  -0.038  0.96976    
## stateNH                         0.500424   2.029657   0.247  0.80525    
## stateNJ                         0.867509   1.667620   0.520  0.60292    
## stateNM                         0.827158   1.969721   0.420  0.67453    
## stateNV                         0.204461   1.280049   0.160  0.87309    
## stateNY                         0.102290   1.748559   0.058  0.95335    
## stateOH                         0.757006   0.768119   0.986  0.32436    
## stateOK                         0.365292   1.427422   0.256  0.79802    
## stateOR                         0.374018   1.766685   0.212  0.83234    
## statePA                         1.101522   1.415224   0.778  0.43637    
## stateRI                        -1.141071   2.386949  -0.478  0.63262    
## stateSC                         0.971614   1.020330   0.952  0.34097    
## stateSD                         0.377213   0.827736   0.456  0.64859    
## stateTN                        -0.144006   0.797764  -0.181  0.85675    
## stateTX                         0.894691   0.744787   1.201  0.22965    
## stateUT                         0.354550   2.009200   0.176  0.85993    
## stateVA                        -0.297560   1.419920  -0.210  0.83401    
## stateVT                         0.077435   0.792252   0.098  0.92214    
## stateWA                         0.670911   2.710610   0.248  0.80451    
## stateWI                        -0.643034   2.207066  -0.291  0.77078    
## stateWV                         0.421116   0.762781   0.552  0.58089    
## stateWY                               NA         NA      NA       NA    
## area_codearea_code_415         -0.143189   0.219758  -0.652  0.51467    
## area_codearea_code_510         -0.170650   0.252308  -0.676  0.49881    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1303.12  on 939  degrees of freedom
## Residual deviance:  857.82  on 870  degrees of freedom
## AIC: 997.82
## 
## Number of Fisher Scoring iterations: 5

The model shows that most variables in the data set are not statistically significant predictors of churn, although some are. Exactly which ones will remain in the final model will be determined by a process of backward induction. Deviance residuals are roughly symmetric, which suggests that the model may be a good fit to the data.

Refining the model

The process of refining the default model involves iteratively removing non-significant predictor variables, and attending to other properties of the model along the way. In particular, reductions in the AIC (Akaike information criterion) are a sign that the model is improving.

The model summary indicates that no state is a statistically significant predictor of churn, so the variable pruning process begins by removing state. Total charges for different types of calls are also dropped, since these variables are just scalar multiples of their corresponding minutes variables.

#drop state and total charges
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_day_calls + total_eve_minutes + total_eve_calls + total_night_minutes + total_night_calls + total_intl_minutes + total_intl_calls + gini + number_customer_service_calls + international_plan + voice_mail_plan + area_code)
summary(k4_train.glm)

The AIC is significantly reduced, and the model is considerably simpler. Continue removing variables with p values greater than 0.05.

#drop total_night_calls
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_day_calls + total_eve_minutes + total_eve_calls + total_night_minutes + total_intl_minutes + total_intl_calls + gini + number_customer_service_calls + international_plan + voice_mail_plan + area_code)
summary(k4_train.glm)

#drop area_code
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_day_calls + total_eve_minutes + total_eve_calls + total_night_minutes + total_intl_minutes + total_intl_calls + gini + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)

#drop total_day_calls
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_eve_minutes + total_eve_calls + total_night_minutes + total_intl_minutes + total_intl_calls + gini + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)

#drop gini
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_eve_minutes + total_eve_calls + total_night_minutes + total_intl_minutes + total_intl_calls + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)

#drop total_eve_calls
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_eve_minutes + total_night_minutes + total_intl_minutes + total_intl_calls + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)

#drop account_length
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ number_vmail_messages + total_day_minutes + total_eve_minutes + total_night_minutes + total_intl_minutes + total_intl_calls + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)

#drop total_intl_minutes
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ number_vmail_messages + total_day_minutes + total_eve_minutes + total_night_minutes + total_intl_calls + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)

#drop total_intl_calls
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ number_vmail_messages + total_day_minutes + total_eve_minutes + total_night_minutes + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)

#drop number_vmail_messages
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ total_day_minutes + total_eve_minutes + total_night_minutes + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)

## 
## Call:
## glm(formula = churn ~ total_day_minutes + total_eve_minutes + 
##     total_night_minutes + number_customer_service_calls + international_plan + 
##     voice_mail_plan, family = binomial(link = "logit"), data = k4_train.dat)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.55164  -0.78148  -0.03148   0.82985   2.67961  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -6.060681   0.612370  -9.897  < 2e-16 ***
## total_day_minutes              0.014621   0.001472   9.936  < 2e-16 ***
## total_eve_minutes              0.006237   0.001579   3.951 7.79e-05 ***
## total_night_minutes            0.003483   0.001645   2.117   0.0343 *  
## number_customer_service_calls  0.630234   0.061077  10.319  < 2e-16 ***
## international_planTRUE         2.487112   0.255333   9.741  < 2e-16 ***
## voice_mail_planTRUE           -1.356577   0.217122  -6.248 4.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1303.12  on 939  degrees of freedom
## Residual deviance:  933.62  on 933  degrees of freedom
## AIC: 947.62
## 
## Number of Fisher Scoring iterations: 5

The above model is parsimonious, with a small number of statistically significant predictor variables. Deviance residuals remain roughly symmetric, and AIC is lower than for the default model.

The last step in completing the model is to determine the optimal cutoff value for \(\text{logit}(p_{i})\). If the predicted value for an observation falls below the cutoff point, then the predicted class for the observation will be churn == FALSE. The model predicts churn == TRUE for observations with values of \(\text{logit}(p_{i})\) greater than the cutoff value.

In computing an optimal cutoff, one might choose any of a few different objectives. One objective might be to minimize false negative predictions. Customers assigned a false negative prediction are those who did churn, but who were not predicted to do so. Losing these customers is very expensive. The optimal cutoff that minimizes false negatives is the one that maximizes detection of customers likely to churn. The trade-off is that a low cutoff also increases false positives.

Optimal cutoff for churn detection:

k4_train_preds <- predict(k4_train.glm, k3_test.dat, type = "response")
optimal_cutoff <- optimalCutoff(actuals = k3_test.dat$churn, predictedScores = k4_train_preds, optimiseFor = "Ones")
optimal_cutoff

## [1] 0.08565906

Confusion matrix:

k4_train_preds <- ifelse(k4_train_preds >= optimal_cutoff, TRUE, FALSE)
k4_train_preds_table <- table(k3_test.dat$churn, k4_train_preds)
k4_train_preds_table

##        k4_train_preds
##         FALSE TRUE
##   FALSE    95  627
##   TRUE      0  128

With this optimal cut-off criterion, the model correctly detects all 128 customers likely to churn. However, it incurs 627 false positives. Overall accuracy here is 26.2%– much less than what one would obtain by naively assigning the majority class as the prediction for every observation in the test data. However, even with the low accuracy model, we were able to correctly identify 95 true negatives. If a company is considering sending a promotion to all customers to reduce churn, even this identification of 95 true negatives would reduce promotional offers by about 11.2%.

Another optimal cut-off strategy might be to minimize total false predictions.

k4_train_preds <- predict(k4_train.glm, k3_test.dat, type = "response")
optimal_cutoff <- optimalCutoff(actuals = k3_test.dat$churn, predictedScores = k4_train_preds, optimiseFor = "misclasserror")
optimal_cutoff

## [1] 0.9456591

Confusion matrix:

k4_train_preds <- ifelse(k4_train_preds >= optimal_cutoff, TRUE, FALSE)
k4_train_preds_table <- table(k3_test.dat$churn, k4_train_preds)
k4_train_preds_table

##        k4_train_preds
##         FALSE TRUE
##   FALSE   719    3
##   TRUE    114   14

With this criterion, accuracy rises to 86.2%. However, the model detects only 14 customers truly likely to churn, and overlooks 114 of them. Depending on the actual costs of offering promotions and losing customers, this cutoff that maximizes overall accuracy may not be the one that minimizes overall cost of churn.

The ROC curve shows the tradeoff between true positive rate and false positive rate across all cutoff values.

k4_train_preds <- predict(k4_train.glm, k3_test.dat, type = "response")
pred <- ROCR::prediction(k4_train_preds, k3_test.dat$churn)
perf <- ROCR::performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE, main = "ROC curve for logistic regression on churn data")

Residual Analysis

If the logistic regression model is a good fit to the data, what can we expect of the residuals? In a traditional linear regression, an important part of validating the model is verifying that the residuals are randomly and evenly distributed about a mean of zero. In a logistic regression, we cannot expect this. That’s because all the actual values of the response variable are either 0 or 1, and the logistic regression predicts not the value of the response variable, but the log odds that the response variable will be 1.

A graph of the residuals for this logistic regression is:

predprob <- predict(k4_train.glm, type = "response")
resid_analysis <- data.frame("churn" = k4_train.dat$churn, "predprob" = predprob, "rawres" = k4_train.dat$churn - predprob)
ggplot(resid_analysis, aes(x = predprob, y = rawres)) +
  geom_point() +
  labs(title = "Raw Residuals", x = "Predicted Probability", y = "Raw Residuals")

Regardless of the fit of the model, the shape of a residual plot for a logistic regression will appear the same: the top line representing residuals for observations of the positive class, and the bottom line representing residuals for observations of the negative class.

A better visualization of the residuals compares binned residuals to the linear predictor instead of the predicted probability. The linear predictor is the exponent in the logit function, and it’s the linear equation whose coefficients the model estimates.

A typical procedure for assessing residuals of a logistic regression is the binned residuals plot. (What follows relies on Faraway, pp. 34ff.) If the model is a good fit, we should see what we expect from a traditional linear regression: a cloud of points with mean zero.

k5_train.dat <- k4_train.dat %>%
mutate("residuals" = residuals(k4_train.glm), linpred = predict(k4_train.glm))
gdf <- group_by(k5_train.dat, cut(linpred, breaks = unique(quantile(linpred, (1:100)/101))))
diagdf <- summarise(gdf, residuals = mean(residuals), linpred = mean(linpred))
plot(residuals ~ linpred, diagdf, xlab = "linear predictor")

The binned residuals plot suggests the logistic regression is an appropriate model for this data.

Model: Random Forest

Another option for modeling our data is to fit a random forest. A random forest model is a means of testing many different partitions of the data with respect to each of the explanatory variables. As the data are recursively partitioned in the building of a single tree, the cut-point for any particular variable is the one that minimizes the residual sum of squares. The random forest builds many such tree models. It then accepts the majority prediction for a given data point among all the tree models constituting the forest.

The default random forest model for the training set is shown below.

k4_train.for <- randomForest(as.factor(churn) ~ ., k4_train.dat)
print(k4_train.for)

## 
## Call:
##  randomForest(formula = as.factor(churn) ~ ., data = k4_train.dat) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 14.89%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE   390   80   0.1702128
## TRUE     60  410   0.1276596

Without any model tuning, the accuracy of the model applied to the training data is about equal to the accuracy of the tuned logistic regression: 84.6%. However, since this measure is based only on the training data and not the test data, it’s not clear what it implies for the model’s effectiveness with unseen data.

Assessing the model using the test data:

k4_test_preds_for <- predict(k4_train.for, k3_test.dat) #predictions on the k3_test.dat data based on the model trained on k4_train.dat.
confusionMatrix(k4_test_preds_for, as.numeric(k3_test.dat$churn)) #A confusion matrix comparing predicted values for k3_test.dat with actual values.

##   FALSE TRUE
## 0   617  105
## 1    18  110

The model performs well against the test data, with an accuracy of almost 85%. This suggests that the default random forest was successful in avoiding overfitting, even though it considered all variables in the original data set.

Tuning the Model

The model can be refined by searching for an optimal value for mtry such that the out-of-bag error rate is minimized. The parameter mtry gives the “number of variables randomly sampled as candidates at each split.”

t <- tuneRF(k4_train.dat[,-21], as.factor(k4_train.dat[,21]), stepFactor = 0.3, plot = TRUE, trace = TRUE, improve = 0.01)

## mtry = 4  OOB error = 16.91% 
## Searching left ...
## mtry = 14    OOB error = 16.49% 
## 0.02515723 0.01

## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## invalid mtry: reset to within valid range

## mtry = 47    OOB error = 16.38% 
## 0.006451613 0.01 
## Searching right ...
## mtry = 1     OOB error = 23.62% 
## -0.4322581 0.01

Based on the plot above, let’s try an mtry parameter of 14 and rerun the forest.

k4_train.for <- randomForest(as.factor(churn) ~ ., k4_train.dat, ntree = 200, mtry = 14, importance = TRUE, proximity = TRUE)
print(k4_train.for)

## 
## Call:
##  randomForest(formula = as.factor(churn) ~ ., data = k4_train.dat,      ntree = 200, mtry = 14, importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 15.85%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE   386   84   0.1787234
## TRUE     65  405   0.1382979

Applying the new model to the test set:

k4_test_preds_for <- predict(k4_train.for, k3_test.dat) #predictions on the k3_test.dat data based on the model trained on k4_train.dat.
confusionMatrix(k4_test_preds_for, as.numeric(k3_test.dat$churn)) #A confusion matrix comparing predicted values for k3_test.dat with actual values.

##   FALSE TRUE
## 0   603  119
## 1    20  108

Tuning the model by using an optimal value for mtry did not significantly affect accuracy. However, the confusion matrix above is different in important ways from the confusion matrix from the logistic regression. True positives increased significantly, while true negatives decreased significantly Overall, the random forest model is better at identifying customers likely to churn. The cost is that false positives go up as well. Depending on the expected loss of a churning customer, and the cost of retaining that customer, this may be an acceptable tradeoff. Overall, the difference in confusion matrices across models with similar accuracy shows that one needs to consider model properties and business considerations beyond mere accuracy in order to select the best model.

Which variables were most important in the random forest model?

varImpPlot(k4_train.for, main = "Variable Importance", n.var = 5, sort = TRUE)

Note that most of these variables were also statistically significant predictors in the logistic regression model. However, the first variable we eliminated from that model, state, shows up here as having great importance to the random forest model. Why? Is it because some states have a higher proportion of churning customers than others?

explore_state <- k4_train.dat %>%
  group_by(state) %>%
  summarise(n(), sum(churn), fract_churn = sum(churn)/n())
hist(explore_state$fract_churn, main = "Fraction of customers churning across states", xlab = "Fraction of customers churning", ylab = "Number of states")

This histogram does not lend any insight. The distribution of churn across states is approximately normal, with no outliers.

Conclusion

By conducting hypothesis tests and modeling the churn data using logistic regression and random forests, it’s possible to make some valuable predictions of future customer behavior. While neither the logistic regression nor the random forest produced accuracy scores much higher than the overall prevalance of the majority class, each model yielded valuable insights. The logistic regression model can be used to capture some true negatives, preventing a company from spending money on promotions offered to customers unlikely to churn. The random forest model did better at predicting customers likely to churn. Along with this strong ability to predict true positives comes a high false positive rate. If lost customers are very expensive and promotions are inexpensive, then this might be an acceptable tradeoff.

The model yielded other insights as well. For example, a customer’s likelihood to churn is associated with whether they carry an international plan, and the number of times they call customer service. Even if the company is not interested in preventing churn, this churn analysis points out that resolving issues during calls to customer service, as well as competing on their international plan, may be important next steps for this business.

References

Diez, David, Mine Cetinkaya-Rundel, and Christopher Barr. OpenIntro Statistics, 4 ed. 2019. openintro.org/os.

Faraway, Julian J. Extending the Linear Model with R, 2 ed. Boca Raton, FL: CRC Press, 2016.

Kaggle. Customer Churn Prediction 2020. https://www.kaggle.com/c/customer-churn-prediction-2020. Accessed 5/9/2021.

Nwanganga, Fred Chukwuka, and Mike Chapple. Practical Machine Learning in R . 1st edition. Indianapolis: John Wiley and Sons, 2020. Print.

Rai, Bharatendra. “Random Forest in R - Classification and Prediction Example with Definition & Steps.” https://www.youtube.com/watch?v=dJclNIN-TPo&t=1255s. Accessed 5/9/2021.

Wikipedia. “Gini coefficient.” https://en.wikipedia.org/wiki/Gini_coefficient. Accessed 5/9/2021.

Churn Analysis with Logistic Regression and Random Forest

Daniel Moscoe

5/11/2021

Introduction

Obtain

Scrub

Explore

Does the mean `total_day_charge` vary by `area_code`?

Does the mean value of `number_customer_service_calls` vary depending on whether a customer carries an `international_plan`?

Model: Logistic Regression

Conditions for logistic regression

Refining the model

Residual Analysis

Model: Random Forest

Tuning the Model

Conclusion

References

Churn Analysis with Logistic Regression and Random Forest

Daniel Moscoe

5/11/2021

Introduction

Obtain

Scrub

Explore

Does the mean total_day_charge vary by area_code?

Does the mean value of number_customer_service_calls vary depending on whether a customer carries an international_plan?

Is number_customer_service_calls related to the customer’s total_day_charge?

Model: Logistic Regression

Conditions for logistic regression

Refining the model

Residual Analysis

Model: Random Forest

Tuning the Model

Conclusion

References

Does the mean `total_day_charge` vary by `area_code`?

Does the mean value of `number_customer_service_calls` vary depending on whether a customer carries an `international_plan`?

Is `number_customer_service_calls` related to the customer’s `total_day_charge`?