Churn analysis is a fundamental problem in data science. The investigator obtains information on customer behavior and attributes and uses this information to predict whether the customer will terminate a contract, or not. In this study, I conduct a churn analysis based on simulated cell phone customer data from a Kaggle competition, Customer Churn Prediction 2020. I combine this data with information on Google searches pertaining to each of the four major cell phone carriers in the US. The study addresses four main questions:
total_day_charge
for customers vary by area_code
?number_customer_service_calls
vary depending on whether a customer carries an international_plan
?number_customer_service_calls
related to the customer’s total_day_charge
?The study follows the OSEMN workflow. (“OSEMN” stands for Obtain, Scrub, Explore, Model, iNterpret, and it is an osemn/awesome way to structure a data science project.) The main strategy for data exploration, in addition to visualization, is hypothesis testing. In the modeling section, I build a logistic regression as well as a random forest model to predict customer churn. I conclude with a summary of my findings.
This analysis combines two data sources. The first is from the Kaggle competition, Customer Churn Prediction 2020. The second data source contains information on Google searches pertaining to each of the four major cell phone carriers in the US.
library(caret)
library(InformationValue)
library(GGally)
library(gtrendsR)
library(infer)
library(psych)
library(randomForest)
library(ROCR)
library(stats)
library(tidyverse)
set.seed(210509)
k1.dat <- read_csv("https://raw.githubusercontent.com/dmoscoe/SPS/main/churn_train.csv")
str(k1.dat)
## spec_tbl_df [4,250 x 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ state : chr [1:4250] "OH" "NJ" "OH" "OK" ...
## $ account_length : num [1:4250] 107 137 84 75 121 147 117 141 65 74 ...
## $ area_code : chr [1:4250] "area_code_415" "area_code_415" "area_code_408" "area_code_415" ...
## $ international_plan : chr [1:4250] "no" "no" "yes" "yes" ...
## $ voice_mail_plan : chr [1:4250] "yes" "no" "no" "no" ...
## $ number_vmail_messages : num [1:4250] 26 0 0 0 24 0 0 37 0 0 ...
## $ total_day_minutes : num [1:4250] 162 243 299 167 218 ...
## $ total_day_calls : num [1:4250] 123 114 71 113 88 79 97 84 137 127 ...
## $ total_day_charge : num [1:4250] 27.5 41.4 50.9 28.3 37.1 ...
## $ total_eve_minutes : num [1:4250] 195.5 121.2 61.9 148.3 348.5 ...
## $ total_eve_calls : num [1:4250] 103 110 88 122 108 94 80 111 83 148 ...
## $ total_eve_charge : num [1:4250] 16.62 10.3 5.26 12.61 29.62 ...
## $ total_night_minutes : num [1:4250] 254 163 197 187 213 ...
## $ total_night_calls : num [1:4250] 103 104 89 121 118 96 90 97 111 94 ...
## $ total_night_charge : num [1:4250] 11.45 7.32 8.86 8.41 9.57 ...
## $ total_intl_minutes : num [1:4250] 13.7 12.2 6.6 10.1 7.5 7.1 8.7 11.2 12.7 9.1 ...
## $ total_intl_calls : num [1:4250] 3 5 7 3 7 6 4 5 6 5 ...
## $ total_intl_charge : num [1:4250] 3.7 3.29 1.78 2.73 2.03 1.92 2.35 3.02 3.43 2.46 ...
## $ number_customer_service_calls: num [1:4250] 1 0 2 3 3 0 1 0 4 0 ...
## $ churn : chr [1:4250] "no" "no" "no" "no" ...
## - attr(*, "spec")=
## .. cols(
## .. state = col_character(),
## .. account_length = col_double(),
## .. area_code = col_character(),
## .. international_plan = col_character(),
## .. voice_mail_plan = col_character(),
## .. number_vmail_messages = col_double(),
## .. total_day_minutes = col_double(),
## .. total_day_calls = col_double(),
## .. total_day_charge = col_double(),
## .. total_eve_minutes = col_double(),
## .. total_eve_calls = col_double(),
## .. total_eve_charge = col_double(),
## .. total_night_minutes = col_double(),
## .. total_night_calls = col_double(),
## .. total_night_charge = col_double(),
## .. total_intl_minutes = col_double(),
## .. total_intl_calls = col_double(),
## .. total_intl_charge = col_double(),
## .. number_customer_service_calls = col_double(),
## .. churn = col_character()
## .. )
k1.dat
contains 20 variables with 4,250 rows. Columns 1 through 19 contain information pertaining to customer accounts, and column 20 indicates whether the customer terminated their account (churned).
gtrends_search_terms <- c("att", "at&t", "tmobile", "t-mobile", "us cellular", "u.s. cellular", "verizon", "verizon wireless")
states <- c('US-AL', 'US-AK', 'US-AZ', 'US-AR', 'US-CA', 'US-CO', 'US-CT', 'US-DE', 'US-DC', 'US-FL', 'US-GA', 'US-HI', 'US-ID', 'US-IL', 'US-IN', 'US-IA', 'US-KS', 'US-KY', 'US-LA', 'US-ME', 'US-MD', 'US-MA', 'US-MI', 'US-MN', 'US-MS', 'US-MO', 'US-MT', 'US-NE', 'US-NV', 'US-NH', 'US-NJ', 'US-NM', 'US-NY', 'US-NC', 'US-ND', 'US-OH', 'US-OK', 'US-OR', 'US-PA', 'US-RI', 'US-SC', 'US-SD', 'US-TN', 'US-TX', 'US-UT', 'US-VT', 'US-VA', 'US-WA', 'US-WV', 'US-WI', 'US-WY')
g1.dat <- list()
for (i in seq(1,51)){
g1.dat[i] <- gtrends(keyword = gtrends_search_terms[1:4], geo = states[i], time = "2020-03-01 2020-03-30")
}
for (i in seq(52,102)){
g1.dat[i] <- gtrends(keyword = gtrends_search_terms[5:8], geo = states[i-51], time = "2020-03-01 2020-03-30")
}
str(g1.dat[[1]])
## 'data.frame': 120 obs. of 7 variables:
## $ date : POSIXct, format: "2020-03-01" "2020-03-02" ...
## $ hits : int 44 20 84 34 36 76 32 38 69 59 ...
## $ keyword : chr "att" "att" "att" "att" ...
## $ geo : chr "US-AL" "US-AL" "US-AL" "US-AL" ...
## $ time : chr "2020-03-01 2020-03-30" "2020-03-01 2020-03-30" "2020-03-01 2020-03-30" "2020-03-01 2020-03-30" ...
## $ gprop : chr "web" "web" "web" "web" ...
## $ category: int 0 0 0 0 0 0 0 0 0 0 ...
summary(g1.dat[[1]])
## date hits keyword
## Min. :2020-03-01 00:00:00 Min. : 0.00 Length:120
## 1st Qu.:2020-03-08 00:00:00 1st Qu.: 6.00 Class :character
## Median :2020-03-15 12:00:00 Median : 30.00 Mode :character
## Mean :2020-03-15 12:00:00 Mean : 36.23
## 3rd Qu.:2020-03-23 00:00:00 3rd Qu.: 63.25
## Max. :2020-03-30 00:00:00 Max. :100.00
## geo time gprop category
## Length:120 Length:120 Length:120 Min. :0
## Class :character Class :character Class :character 1st Qu.:0
## Mode :character Mode :character Mode :character Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
g1.dat
is a list of data frames. Each of the 102 dataframes contained in g1.dat
contains Google Trends information for 4 search terms, 1 state, and 30 days. Since there are 8 search terms, each state is represented by 2 data frames.
In this section, I summarize the Google Trends data for each state, and I combine all the data into a single data frame.
Summary data for k1.dat
shows that some columns are misclassified. For example, read_csv
interprets churn
as a character vector, but it’s better represented as a Boolean. Below I reclassify some of the columns.
k2.dat <- k1.dat %>%
mutate("international_plan" = ifelse(international_plan == "yes", TRUE, FALSE)) %>%
mutate("voice_mail_plan" = ifelse(voice_mail_plan == "yes", TRUE, FALSE)) %>%
mutate("churn" = ifelse(churn == "yes", TRUE, FALSE)) %>%
mutate("state" = as.factor(state)) %>%
mutate("area_code" = as.factor(area_code))
str(k2.dat)
## spec_tbl_df [4,250 x 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 36 32 36 37 20 25 19 50 16 40 ...
## $ account_length : num [1:4250] 107 137 84 75 121 147 117 141 65 74 ...
## $ area_code : Factor w/ 3 levels "area_code_408",..: 2 2 1 2 3 2 1 2 2 2 ...
## $ international_plan : logi [1:4250] FALSE FALSE TRUE TRUE FALSE TRUE ...
## $ voice_mail_plan : logi [1:4250] TRUE FALSE FALSE FALSE TRUE FALSE ...
## $ number_vmail_messages : num [1:4250] 26 0 0 0 24 0 0 37 0 0 ...
## $ total_day_minutes : num [1:4250] 162 243 299 167 218 ...
## $ total_day_calls : num [1:4250] 123 114 71 113 88 79 97 84 137 127 ...
## $ total_day_charge : num [1:4250] 27.5 41.4 50.9 28.3 37.1 ...
## $ total_eve_minutes : num [1:4250] 195.5 121.2 61.9 148.3 348.5 ...
## $ total_eve_calls : num [1:4250] 103 110 88 122 108 94 80 111 83 148 ...
## $ total_eve_charge : num [1:4250] 16.62 10.3 5.26 12.61 29.62 ...
## $ total_night_minutes : num [1:4250] 254 163 197 187 213 ...
## $ total_night_calls : num [1:4250] 103 104 89 121 118 96 90 97 111 94 ...
## $ total_night_charge : num [1:4250] 11.45 7.32 8.86 8.41 9.57 ...
## $ total_intl_minutes : num [1:4250] 13.7 12.2 6.6 10.1 7.5 7.1 8.7 11.2 12.7 9.1 ...
## $ total_intl_calls : num [1:4250] 3 5 7 3 7 6 4 5 6 5 ...
## $ total_intl_charge : num [1:4250] 3.7 3.29 1.78 2.73 2.03 1.92 2.35 3.02 3.43 2.46 ...
## $ number_customer_service_calls: num [1:4250] 1 0 2 3 3 0 1 0 4 0 ...
## $ churn : logi [1:4250] FALSE FALSE FALSE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. state = col_character(),
## .. account_length = col_double(),
## .. area_code = col_character(),
## .. international_plan = col_character(),
## .. voice_mail_plan = col_character(),
## .. number_vmail_messages = col_double(),
## .. total_day_minutes = col_double(),
## .. total_day_calls = col_double(),
## .. total_day_charge = col_double(),
## .. total_eve_minutes = col_double(),
## .. total_eve_calls = col_double(),
## .. total_eve_charge = col_double(),
## .. total_night_minutes = col_double(),
## .. total_night_calls = col_double(),
## .. total_night_charge = col_double(),
## .. total_intl_minutes = col_double(),
## .. total_intl_calls = col_double(),
## .. total_intl_charge = col_double(),
## .. number_customer_service_calls = col_double(),
## .. churn = col_character()
## .. )
The Google Trends data can be transformed into a measure of inequality across search activity within each state. This is motivated by the assumption that there will be greater equality in search activity across wireless providers in states having competitive wireless markets. By contrast, in states where a single wireless provider dominates, the majority of searches will seek the dominant provider. This will result in greater inequality across search terms within low-competition states.
The inequality measure I use is the Gini index. A Gini index is typically used to measure economic inequality. When it’s used to measure income inequality, “A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of 1 (or 100%) expresses maximal inequality among values (e.g., for a large number of people where only one person has all the income or consumption and all others have none…)” (Wikipedia). In the context of this data, a Gini index of 1 means that all queries within a state sought a single wireless provider. An index of 0 means that search queries were equally split across the four providers.
To construct the Gini index for each state, I add together all the queries within a state for a given provider across the entire month of March, 2020, the month immediately preceding the opening of the Kaggle contest. Finally, I construct the Gini coefficient for the state by comparing the March 2020 queries for different providers within the state.
tmp_query_summaries <- data.frame("geo" = "x", "att" = -1, "tmobile" = -1, "uscell" = -1, "verizon" = -1)
for (i in seq(1,51)) {
tmp_queries_in_state <- g1.dat[[i]] %>%
mutate("hits" = as.numeric(ifelse(hits == "<1", 0, hits))) %>%
group_by(keyword) %>%
summarise(sum(hits))
tmp_query_summaries <- rbind(tmp_query_summaries, c("geo" = states[i], "att" = sum(tmp_queries_in_state[1,2], tmp_queries_in_state[2,2]), "tmobile" = sum(tmp_queries_in_state[3,2], tmp_queries_in_state[4,2]), "uscell" = NA, "verizon" = NA))
}
tmp_query_summaries <- tmp_query_summaries %>%
filter(att >= 0)
for (i in seq(52,102)) {
tmp_queries_in_state <- g1.dat[[i]] %>%
mutate("hits" = as.numeric(ifelse(hits == "<1", 0, hits))) %>%
group_by(keyword) %>%
summarise(sum(hits))
tmp_query_summaries[i-51,4] <- sum(tmp_queries_in_state[1,2], tmp_queries_in_state[2,2])
tmp_query_summaries[i-51,5] <- sum(tmp_queries_in_state[3,2], tmp_queries_in_state[4,2])
}
tmp_query_summaries
is a relative measure of search activity for each carrier for each state for March 2020.
head(tmp_query_summaries)
## geo att tmobile uscell verizon
## 1 US-AL 3740 608 17 2443
## 2 US-AK 2143 112 0 954
## 3 US-AZ 2978 1799 5 2497
## 4 US-AR 2800 245 7 2126
## 5 US-CA 4172 1385 27 2772
## 6 US-CO 2687 1546 10 2224
Next, a Gini index is computed for each state from this data, and the variable gini
is added to k2.dat
.
ginis <- data.frame("geo" = "x", "gini" = "-1")
for(i in seq(nrow(tmp_query_summaries))) {
tmp <- sort(as.integer(tmp_query_summaries[i,2:5]))
tmp.gini <- 1 - ((1/sum(tmp)) * (1.75 * tmp[1] + 1.25 * tmp[2] + 0.75 * tmp[3] + 0.25 * tmp[4]))
ginis <- rbind(ginis, c(tmp_query_summaries[i,1], round(tmp.gini, 4)))
}
ginis <- ginis %>%
filter(gini >= 0)
ginis <- ginis %>%
mutate("tmp" = str_sub(geo, -2)) %>%
select(3,2) %>%
rename("state" = tmp)
k3.dat <- left_join(k2.dat, ginis, by = "state") %>%
select(2,6:18,21,19,4,5,1,3,20) %>%
transform(gini = as.numeric(gini)) %>%
transform(state = as.factor(state))
colnames(k3.dat)
## [1] "account_length" "number_vmail_messages"
## [3] "total_day_minutes" "total_day_calls"
## [5] "total_day_charge" "total_eve_minutes"
## [7] "total_eve_calls" "total_eve_charge"
## [9] "total_night_minutes" "total_night_calls"
## [11] "total_night_charge" "total_intl_minutes"
## [13] "total_intl_calls" "total_intl_charge"
## [15] "gini" "number_customer_service_calls"
## [17] "international_plan" "voice_mail_plan"
## [19] "state" "area_code"
## [21] "churn"
What are the shapes of the distributions of the numeric variables? Do any of the variables possess extreme outliers? Histograms are a useful way to begin to respond to these questions.
k3.dat %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
All the variables pertaining to customer call activity are normally distributed with no apparent outliers, and no missing values. One exception is total_intl_calls
, which is skewed right and displays some gaps in the distribution. number_customer_service_calls
also exhibits right skew, but there are no extreme outliers. number_vmail_messages
is the “worst-behaved” of the variables here. It appears to exhibit several extreme outliers, while most of its values lie near zero.
hist(k3.dat$number_vmail_messages, main = "Customer Use of Voicemail", xlab = "Number of voicemail messages", ylab = "Number of customers")
Closer inspection reveals a roughly normal distribution centered near 30, and a large number of observations between 0 and 5. Are those also distributed normally?
small_vmails <- k3.dat %>%
filter(number_vmail_messages <= 5)
hist(small_vmails$number_vmail_messages, main = "Customer Use of Voicemail, msgs <= 5", xlab = "Number of voicemail messages", ylab = "Number of customers")
The large majority of customers do not use voice mail at all.
Examining categorical variables:
k3.dat %>%
select(area_code) %>%
table()
## .
## area_code_408 area_code_415 area_code_510
## 1086 2108 1056
explore_state <- k3.dat %>%
group_by(state) %>%
summarise("fraction of customers" = round(n()/nrow(k3.dat),4)) %>%
arrange(`fraction of customers`)
explore_state[c(1:5,47:51),]
## # A tibble: 10 x 2
## state `fraction of customers`
## <fct> <dbl>
## 1 CA 0.0092
## 2 AK 0.0144
## 3 IA 0.0146
## 4 GA 0.0151
## 5 ND 0.0158
## 6 VA 0.0235
## 7 AL 0.0238
## 8 ID 0.0249
## 9 MN 0.0254
## 10 WV 0.0327
Every state is represented in the data, and no state’s customers outnumber any other states by more than a factor of about 3. It’s interesting to note that the most populous state, California, shows the smallest number of customers. The state with the largest number of customers is West Virginia.
Examining the response variable, churn
:
k3.dat %>%
select(churn) %>%
table()
## .
## FALSE TRUE
## 3652 598
There are no missing values, but the class is significantly imbalanced in favor of non-churning customers. I conclude this section of the exploratory analysis by considering some numeric variables, together with churn
, in relation to each other. Perhaps there are interesting patterns that will be useful later on.
ggpairs(k3.dat[,c(1:4,15,16,21)])
There may be a relationship between number_customer_service_calls
and churn
, as well as between number vmail_messages
and churn. Otherwise, there do not appear to be any meaningful relationships among any of these variable pairs.
I continue exploring the data set by running some hypothesis tests. In an effort to better understand the customer behavior described by the data, I ask: * Does the mean total_day_charge
vary by area_code
? * Does the mean value of number_customer_service_calls
vary depending on whether a customer carries an international_plan
? * Is the number of calls to customer service related to the customer’s total_day_charge
?
total_day_charge
vary by area_code
?Are there any obvious differences apparent in the data?
k3.dat %>%
ggplot(aes(x = total_day_charge, y = area_code)) +
geom_boxplot() +
labs(title = "Total Daytime Charges by Area Code", x = "Total Daytime Charge ($)", y = "Area Code")
Based on this visualization, there does not appear to be a statistically significant difference in the distribution of total_day_charge
across area_code
s. Next, I check conditions for inference to prepare for the formal hypothesis test.
Are observations independent within and across groups? Yes. No customer’s total_day_charge
has a direct impact on any other customer’s total_day_charge
. And no group’s charges has a meaningful direct impact on any other group’s.
Are the data within each group nearly normal?
ggplot(data = k3.dat, aes(x = total_day_charge)) +
geom_histogram() +
facet_wrap(~area_code) +
labs(title = "Total Day Charge by Area Code", x = "Total Day Charge ($)", y = "count")
Yes.
The null hypothesis, \(H_{0}\), is that the mean values of total_day_charge
for all levels of area_code
are equal. The alternative hypothesis, \(H_{A}\), is that at least two mean values of total_day_charge
for different levels of area_code
are unequal. For this hypothesis test, \(\alpha = 0.05\).
#Compute the point estimate
F_hat <- k3.dat %>%
specify(total_day_charge ~ area_code) %>%
calculate(stat = "F")
#Generate the null distribution (the sampling distribution of the test statistic if the null hypothesis is true)
null_f_distn <- k3.dat %>%
specify(total_day_charge ~ area_code) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "F")
#Get the p-value for your test statistic with respect to the null distribution
null_f_distn %>%
get_p_value(obs_stat = F_hat, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.196
#Visualize the result
visualize(null_f_distn, method = "simulation") +
shade_p_value(obs_stat = F_hat, direction = "greater")
For this hypothesis test, \(p > \alpha\), so the null hypothesis is not rejected. There is not sufficient evidence to support the claim that mean day charges vary across area code.
The partially shaded histogram above, “Simulation-Based Null Distribution,” shows the point estimate for the F-statistic compared to the null distribution. If the null hypothesis were true, then about 19.6% of F-statistics calculated from samples like this one would lie at or to the right of the point estimate calculated above.
number_customer_service_calls
vary depending on whether a customer carries an international_plan
?ggplot(data = k3.dat, aes(x = number_customer_service_calls)) +
geom_histogram() +
facet_wrap(~international_plan) +
labs(title = "Customer Service Calls For Customers With/out Intl Plans", x = "Number of Customer Service Calls", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice that the groups are very unequal:
k3.dat %>%
group_by(international_plan) %>%
summarise(n())
## # A tibble: 2 x 2
## international_plan `n()`
## * <lgl> <int>
## 1 FALSE 3854
## 2 TRUE 396
Based on the above histogram, there does not appear to be a statistically significant difference in the distribution of number_customer_service_calls
across the two levels of international_plan
. Next, I check conditions for inference to prepare for the formal hypothesis test.
Are the data independent within and across groups? The data are independent within groups, because no customer’s calls to customer service have an effect on any other customer’s calls to customer service. The groups are independent of each other because no one’s decision to carry an international plan impacts anyone else’s.
Is the data within each group distributed normally? Both data sets are similarly skewed right, but neither exhibits any extreme outliers, and both contain many more than 30 observations. So the normality condition is sufficiently satisfied.
The null hypothesis, \(H_{0}\), is that the mean values of number_customer_service_calls
for both levels of international_plan
are equal. The alternative hypothesis, \(H_{A}\), is that they are unequal. For this hypothesis test, \(\alpha = 0.05\).
obs_diff <- k3.dat %>%
specify(number_customer_service_calls ~ international_plan) %>%
calculate(stat = "diff in means", order = c(TRUE, FALSE))
null_t_distn <- k3.dat %>%
specify(number_customer_service_calls ~ international_plan) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c(TRUE, FALSE))
null_t_distn %>%
get_p_value(obs_stat = obs_diff, direction = "left")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.306
visualize(null_t_distn, method = "simulation") +
shade_p_value(obs_stat = obs_diff, direction = "two_sided")
For this hypothesis test, \(p = 0.306\). Since \(p > \alpha\), the null hypothesis is not rejected. There is not sufficient evidence to support the claim that the number of calls a customer makes to customer service varies depending on whether they carry an international plan.
The partially shaded histogram above, “Simulation-Based Null Distribution,” shows the point estimate for the t-statistic compared to the null distribution. If the null hypothesis were true, then about 61% of t-statistics calculated from samples like this one would lie toward the tails of the point estimate calculated above.
Logistic regression is a special type of linear model used when the response variable takes on two or more discrete factors. Here, I use logistic regression to try to predict a customer’s churn
value based on their other characteristics. The first step is to assess whether the data satisfy conditions for the model. Then a “default” model is constructed and then refined to improve its simplicity and avoid overfitting. Finally, I assess the model’s performance against the test data, and interpret the results.
Logistic regression may be appropriate if each customer’s churn status is independent of each other’s, and if each predictor variable is linearly related to \(\text{logit}(p_{i})\), when other predictors are held constant. Based on our knowledge of customer behavior, no customer’s churn decision has a direct impact on the decision of any other customer, so the first condition is satisfied. After arriving at the final model, a residuals plot can help inform whether the second condition of linearity is met.
While class balance is not considered a condition for fitting a logistic regression, data sets with imbalanced classes can sometimes lead to a model that is weaker in predicting the minority class. The data set as it stands exhibits a strong imbalance in favor of churn == FALSE
.
k3.dat %>%
select(churn) %>%
table()
## .
## FALSE TRUE
## 3652 598
Below, I generate a balanced training set and a test set. To balance the training data, observations from the majority class are omitted until both classes are equal. While this entails ignoring some potentially informative observations, it will hopefully lead to a model that is better able to predict members of both classes.
Split the data 80/20 into train/test sets:
training_rows <- sample(nrow(k3.dat), nrow(k3.dat) * 0.80, replace = FALSE)
k3_train.dat <- k3.dat[training_rows,]
k3_test.dat <- k3.dat[-training_rows,]
tf_table <- k3_train.dat %>%
select(churn) %>%
table()
Downsampling the training set involves removing rows for which churn == FALSE
until both classes contain the same number of observations, in this case, 470. First, we’ll sample 470 entries where churn == FALSE
from the training data. Then we’ll combine these with all the entries from the positive class.
k3_train_negs.dat <- k3_train.dat %>%
filter(churn == FALSE)
keepers <- sample(nrow(k3_train_negs.dat), tf_table[2], replace = FALSE)
k3_train_negs.dat <- k3_train_negs.dat[keepers,]
k3_train_negs.dat %>%
select(churn) %>%
table()
## .
## FALSE
## 470
k4_train.dat <- k3_train.dat %>%
filter(churn == TRUE) %>%
rbind(k3_train_negs.dat)
k4_train.dat %>%
select(churn) %>%
table()
## .
## FALSE TRUE
## 470 470
Now that the training set is balanced, it’s ready for logistic regression.
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ .)
summary(k4_train.glm)
##
## Call:
## glm(formula = churn ~ ., family = binomial(link = "logit"), data = k4_train.dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.76971 -0.72429 -0.03442 0.71163 2.92395
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.606128 5.411118 -1.221 0.22215
## account_length 0.002201 0.002213 0.995 0.31991
## number_vmail_messages 0.056755 0.027134 2.092 0.03647 *
## total_day_minutes 9.794574 5.388074 1.818 0.06909 .
## total_day_calls 0.004704 0.004455 1.056 0.29108
## total_day_charge -57.526732 31.693817 -1.815 0.06951 .
## total_eve_minutes 8.043411 2.666291 3.017 0.00256 **
## total_eve_calls 0.001868 0.004471 0.418 0.67609
## total_eve_charge -94.540694 31.367390 -3.014 0.00258 **
## total_night_minutes -2.337528 1.377518 -1.697 0.08971 .
## total_night_calls -0.003077 0.004559 -0.675 0.49970
## total_night_charge 52.023829 30.612205 1.699 0.08923 .
## total_intl_minutes -2.247766 8.419649 -0.267 0.78949
## total_intl_calls -0.004009 0.035643 -0.112 0.91045
## total_intl_charge 8.522457 31.187765 0.273 0.78465
## gini -2.558943 10.858083 -0.236 0.81369
## number_customer_service_calls 0.669806 0.068046 9.843 < 2e-16 ***
## international_planTRUE 2.693637 0.280713 9.596 < 2e-16 ***
## voice_mail_planTRUE -3.150368 0.873454 -3.607 0.00031 ***
## stateAL 1.219992 1.036300 1.177 0.23909
## stateAR 0.121749 0.906541 0.134 0.89317
## stateAZ -1.009545 1.875777 -0.538 0.59044
## stateCA 2.880239 1.516575 1.899 0.05754 .
## stateCO -0.598201 1.792712 -0.334 0.73862
## stateCT 0.023930 0.917865 0.026 0.97920
## stateDC 0.615978 1.976763 0.312 0.75534
## stateDE 0.031190 1.324862 0.024 0.98122
## stateFL 0.066972 1.075439 0.062 0.95034
## stateGA -0.632626 1.090895 -0.580 0.56197
## stateHI -1.945094 2.205508 -0.882 0.37782
## stateIA 0.745442 3.018458 0.247 0.80494
## stateID -0.132230 1.267892 -0.104 0.91694
## stateIL -0.523394 1.092523 -0.479 0.63189
## stateIN -0.553687 0.737830 -0.750 0.45300
## stateKS -0.428760 1.390104 -0.308 0.75775
## stateKY 0.759186 0.730338 1.040 0.29857
## stateLA 0.303714 0.798364 0.380 0.70363
## stateMA 0.629615 1.659161 0.379 0.70433
## stateMD -0.262028 1.645109 -0.159 0.87345
## stateME 0.368467 2.507663 0.147 0.88318
## stateMI -0.071056 0.730989 -0.097 0.92256
## stateMN 0.444058 1.475748 0.301 0.76349
## stateMO 0.741384 1.093474 0.678 0.49777
## stateMS 0.933923 0.873758 1.069 0.28513
## stateMT 1.028097 0.719033 1.430 0.15276
## stateNC -0.493945 1.281062 -0.386 0.69981
## stateND -0.770761 0.868640 -0.887 0.37491
## stateNE -0.045074 1.188854 -0.038 0.96976
## stateNH 0.500424 2.029657 0.247 0.80525
## stateNJ 0.867509 1.667620 0.520 0.60292
## stateNM 0.827158 1.969721 0.420 0.67453
## stateNV 0.204461 1.280049 0.160 0.87309
## stateNY 0.102290 1.748559 0.058 0.95335
## stateOH 0.757006 0.768119 0.986 0.32436
## stateOK 0.365292 1.427422 0.256 0.79802
## stateOR 0.374018 1.766685 0.212 0.83234
## statePA 1.101522 1.415224 0.778 0.43637
## stateRI -1.141071 2.386949 -0.478 0.63262
## stateSC 0.971614 1.020330 0.952 0.34097
## stateSD 0.377213 0.827736 0.456 0.64859
## stateTN -0.144006 0.797764 -0.181 0.85675
## stateTX 0.894691 0.744787 1.201 0.22965
## stateUT 0.354550 2.009200 0.176 0.85993
## stateVA -0.297560 1.419920 -0.210 0.83401
## stateVT 0.077435 0.792252 0.098 0.92214
## stateWA 0.670911 2.710610 0.248 0.80451
## stateWI -0.643034 2.207066 -0.291 0.77078
## stateWV 0.421116 0.762781 0.552 0.58089
## stateWY NA NA NA NA
## area_codearea_code_415 -0.143189 0.219758 -0.652 0.51467
## area_codearea_code_510 -0.170650 0.252308 -0.676 0.49881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1303.12 on 939 degrees of freedom
## Residual deviance: 857.82 on 870 degrees of freedom
## AIC: 997.82
##
## Number of Fisher Scoring iterations: 5
The model shows that most variables in the data set are not statistically significant predictors of churn
, although some are. Exactly which ones will remain in the final model will be determined by a process of backward induction. Deviance residuals are roughly symmetric, which suggests that the model may be a good fit to the data.
The process of refining the default model involves iteratively removing non-significant predictor variables, and attending to other properties of the model along the way. In particular, reductions in the AIC (Akaike information criterion) are a sign that the model is improving.
The model summary indicates that no state is a statistically significant predictor of churn
, so the variable pruning process begins by removing state
. Total charges for different types of calls are also dropped, since these variables are just scalar multiples of their corresponding minutes
variables.
#drop state and total charges
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_day_calls + total_eve_minutes + total_eve_calls + total_night_minutes + total_night_calls + total_intl_minutes + total_intl_calls + gini + number_customer_service_calls + international_plan + voice_mail_plan + area_code)
summary(k4_train.glm)
The AIC is significantly reduced, and the model is considerably simpler. Continue removing variables with p values greater than 0.05.
#drop total_night_calls
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_day_calls + total_eve_minutes + total_eve_calls + total_night_minutes + total_intl_minutes + total_intl_calls + gini + number_customer_service_calls + international_plan + voice_mail_plan + area_code)
summary(k4_train.glm)
#drop area_code
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_day_calls + total_eve_minutes + total_eve_calls + total_night_minutes + total_intl_minutes + total_intl_calls + gini + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)
#drop total_day_calls
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_eve_minutes + total_eve_calls + total_night_minutes + total_intl_minutes + total_intl_calls + gini + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)
#drop gini
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_eve_minutes + total_eve_calls + total_night_minutes + total_intl_minutes + total_intl_calls + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)
#drop total_eve_calls
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ account_length + number_vmail_messages + total_day_minutes + total_eve_minutes + total_night_minutes + total_intl_minutes + total_intl_calls + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)
#drop account_length
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ number_vmail_messages + total_day_minutes + total_eve_minutes + total_night_minutes + total_intl_minutes + total_intl_calls + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)
#drop total_intl_minutes
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ number_vmail_messages + total_day_minutes + total_eve_minutes + total_night_minutes + total_intl_calls + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)
#drop total_intl_calls
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ number_vmail_messages + total_day_minutes + total_eve_minutes + total_night_minutes + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)
#drop number_vmail_messages
k4_train.glm <- glm(data = k4_train.dat, family = binomial(link = "logit"), formula = churn ~ total_day_minutes + total_eve_minutes + total_night_minutes + number_customer_service_calls + international_plan + voice_mail_plan)
summary(k4_train.glm)
##
## Call:
## glm(formula = churn ~ total_day_minutes + total_eve_minutes +
## total_night_minutes + number_customer_service_calls + international_plan +
## voice_mail_plan, family = binomial(link = "logit"), data = k4_train.dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.55164 -0.78148 -0.03148 0.82985 2.67961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.060681 0.612370 -9.897 < 2e-16 ***
## total_day_minutes 0.014621 0.001472 9.936 < 2e-16 ***
## total_eve_minutes 0.006237 0.001579 3.951 7.79e-05 ***
## total_night_minutes 0.003483 0.001645 2.117 0.0343 *
## number_customer_service_calls 0.630234 0.061077 10.319 < 2e-16 ***
## international_planTRUE 2.487112 0.255333 9.741 < 2e-16 ***
## voice_mail_planTRUE -1.356577 0.217122 -6.248 4.16e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1303.12 on 939 degrees of freedom
## Residual deviance: 933.62 on 933 degrees of freedom
## AIC: 947.62
##
## Number of Fisher Scoring iterations: 5
The above model is parsimonious, with a small number of statistically significant predictor variables. Deviance residuals remain roughly symmetric, and AIC is lower than for the default model.
The last step in completing the model is to determine the optimal cutoff value for \(\text{logit}(p_{i})\). If the predicted value for an observation falls below the cutoff point, then the predicted class for the observation will be churn == FALSE
. The model predicts churn == TRUE
for observations with values of \(\text{logit}(p_{i})\) greater than the cutoff value.
In computing an optimal cutoff, one might choose any of a few different objectives. One objective might be to minimize false negative predictions. Customers assigned a false negative prediction are those who did churn, but who were not predicted to do so. Losing these customers is very expensive. The optimal cutoff that minimizes false negatives is the one that maximizes detection of customers likely to churn. The trade-off is that a low cutoff also increases false positives.
Optimal cutoff for churn detection:
k4_train_preds <- predict(k4_train.glm, k3_test.dat, type = "response")
optimal_cutoff <- optimalCutoff(actuals = k3_test.dat$churn, predictedScores = k4_train_preds, optimiseFor = "Ones")
optimal_cutoff
## [1] 0.08565906
Confusion matrix:
k4_train_preds <- ifelse(k4_train_preds >= optimal_cutoff, TRUE, FALSE)
k4_train_preds_table <- table(k3_test.dat$churn, k4_train_preds)
k4_train_preds_table
## k4_train_preds
## FALSE TRUE
## FALSE 95 627
## TRUE 0 128
With this optimal cut-off criterion, the model correctly detects all 128 customers likely to churn. However, it incurs 627 false positives. Overall accuracy here is 26.2%– much less than what one would obtain by naively assigning the majority class as the prediction for every observation in the test data. However, even with the low accuracy model, we were able to correctly identify 95 true negatives. If a company is considering sending a promotion to all customers to reduce churn, even this identification of 95 true negatives would reduce promotional offers by about 11.2%.
Another optimal cut-off strategy might be to minimize total false predictions.
k4_train_preds <- predict(k4_train.glm, k3_test.dat, type = "response")
optimal_cutoff <- optimalCutoff(actuals = k3_test.dat$churn, predictedScores = k4_train_preds, optimiseFor = "misclasserror")
optimal_cutoff
## [1] 0.9456591
Confusion matrix:
k4_train_preds <- ifelse(k4_train_preds >= optimal_cutoff, TRUE, FALSE)
k4_train_preds_table <- table(k3_test.dat$churn, k4_train_preds)
k4_train_preds_table
## k4_train_preds
## FALSE TRUE
## FALSE 719 3
## TRUE 114 14
With this criterion, accuracy rises to 86.2%. However, the model detects only 14 customers truly likely to churn, and overlooks 114 of them. Depending on the actual costs of offering promotions and losing customers, this cutoff that maximizes overall accuracy may not be the one that minimizes overall cost of churn.
The ROC curve shows the tradeoff between true positive rate and false positive rate across all cutoff values.
k4_train_preds <- predict(k4_train.glm, k3_test.dat, type = "response")
pred <- ROCR::prediction(k4_train_preds, k3_test.dat$churn)
perf <- ROCR::performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE, main = "ROC curve for logistic regression on churn data")
If the logistic regression model is a good fit to the data, what can we expect of the residuals? In a traditional linear regression, an important part of validating the model is verifying that the residuals are randomly and evenly distributed about a mean of zero. In a logistic regression, we cannot expect this. That’s because all the actual values of the response variable are either 0 or 1, and the logistic regression predicts not the value of the response variable, but the log odds that the response variable will be 1.
A graph of the residuals for this logistic regression is:
predprob <- predict(k4_train.glm, type = "response")
resid_analysis <- data.frame("churn" = k4_train.dat$churn, "predprob" = predprob, "rawres" = k4_train.dat$churn - predprob)
ggplot(resid_analysis, aes(x = predprob, y = rawres)) +
geom_point() +
labs(title = "Raw Residuals", x = "Predicted Probability", y = "Raw Residuals")
Regardless of the fit of the model, the shape of a residual plot for a logistic regression will appear the same: the top line representing residuals for observations of the positive class, and the bottom line representing residuals for observations of the negative class.
A better visualization of the residuals compares binned residuals to the linear predictor instead of the predicted probability. The linear predictor is the exponent in the logit function, and it’s the linear equation whose coefficients the model estimates.
A typical procedure for assessing residuals of a logistic regression is the binned residuals plot. (What follows relies on Faraway, pp. 34ff.) If the model is a good fit, we should see what we expect from a traditional linear regression: a cloud of points with mean zero.
k5_train.dat <- k4_train.dat %>%
mutate("residuals" = residuals(k4_train.glm), linpred = predict(k4_train.glm))
gdf <- group_by(k5_train.dat, cut(linpred, breaks = unique(quantile(linpred, (1:100)/101))))
diagdf <- summarise(gdf, residuals = mean(residuals), linpred = mean(linpred))
plot(residuals ~ linpred, diagdf, xlab = "linear predictor")
The binned residuals plot suggests the logistic regression is an appropriate model for this data.
Another option for modeling our data is to fit a random forest. A random forest model is a means of testing many different partitions of the data with respect to each of the explanatory variables. As the data are recursively partitioned in the building of a single tree, the cut-point for any particular variable is the one that minimizes the residual sum of squares. The random forest builds many such tree models. It then accepts the majority prediction for a given data point among all the tree models constituting the forest.
The default random forest model for the training set is shown below.
k4_train.for <- randomForest(as.factor(churn) ~ ., k4_train.dat)
print(k4_train.for)
##
## Call:
## randomForest(formula = as.factor(churn) ~ ., data = k4_train.dat)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 14.89%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 390 80 0.1702128
## TRUE 60 410 0.1276596
Without any model tuning, the accuracy of the model applied to the training data is about equal to the accuracy of the tuned logistic regression: 84.6%. However, since this measure is based only on the training data and not the test data, it’s not clear what it implies for the model’s effectiveness with unseen data.
Assessing the model using the test data:
k4_test_preds_for <- predict(k4_train.for, k3_test.dat) #predictions on the k3_test.dat data based on the model trained on k4_train.dat.
confusionMatrix(k4_test_preds_for, as.numeric(k3_test.dat$churn)) #A confusion matrix comparing predicted values for k3_test.dat with actual values.
## FALSE TRUE
## 0 617 105
## 1 18 110
The model performs well against the test data, with an accuracy of almost 85%. This suggests that the default random forest was successful in avoiding overfitting, even though it considered all variables in the original data set.
The model can be refined by searching for an optimal value for mtry
such that the out-of-bag error rate is minimized. The parameter mtry
gives the “number of variables randomly sampled as candidates at each split.”
t <- tuneRF(k4_train.dat[,-21], as.factor(k4_train.dat[,21]), stepFactor = 0.3, plot = TRUE, trace = TRUE, improve = 0.01)
## mtry = 4 OOB error = 16.91%
## Searching left ...
## mtry = 14 OOB error = 16.49%
## 0.02515723 0.01
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## invalid mtry: reset to within valid range
## mtry = 47 OOB error = 16.38%
## 0.006451613 0.01
## Searching right ...
## mtry = 1 OOB error = 23.62%
## -0.4322581 0.01
Based on the plot above, let’s try an mtry
parameter of 14 and rerun the forest.
k4_train.for <- randomForest(as.factor(churn) ~ ., k4_train.dat, ntree = 200, mtry = 14, importance = TRUE, proximity = TRUE)
print(k4_train.for)
##
## Call:
## randomForest(formula = as.factor(churn) ~ ., data = k4_train.dat, ntree = 200, mtry = 14, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 14
##
## OOB estimate of error rate: 15.85%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 386 84 0.1787234
## TRUE 65 405 0.1382979
Applying the new model to the test set:
k4_test_preds_for <- predict(k4_train.for, k3_test.dat) #predictions on the k3_test.dat data based on the model trained on k4_train.dat.
confusionMatrix(k4_test_preds_for, as.numeric(k3_test.dat$churn)) #A confusion matrix comparing predicted values for k3_test.dat with actual values.
## FALSE TRUE
## 0 603 119
## 1 20 108
Tuning the model by using an optimal value for mtry
did not significantly affect accuracy. However, the confusion matrix above is different in important ways from the confusion matrix from the logistic regression. True positives increased significantly, while true negatives decreased significantly Overall, the random forest model is better at identifying customers likely to churn. The cost is that false positives go up as well. Depending on the expected loss of a churning customer, and the cost of retaining that customer, this may be an acceptable tradeoff. Overall, the difference in confusion matrices across models with similar accuracy shows that one needs to consider model properties and business considerations beyond mere accuracy in order to select the best model.
Which variables were most important in the random forest model?
varImpPlot(k4_train.for, main = "Variable Importance", n.var = 5, sort = TRUE)
Note that most of these variables were also statistically significant predictors in the logistic regression model. However, the first variable we eliminated from that model, state
, shows up here as having great importance to the random forest model. Why? Is it because some states have a higher proportion of churning customers than others?
explore_state <- k4_train.dat %>%
group_by(state) %>%
summarise(n(), sum(churn), fract_churn = sum(churn)/n())
hist(explore_state$fract_churn, main = "Fraction of customers churning across states", xlab = "Fraction of customers churning", ylab = "Number of states")
This histogram does not lend any insight. The distribution of churn across states is approximately normal, with no outliers.
By conducting hypothesis tests and modeling the churn data using logistic regression and random forests, it’s possible to make some valuable predictions of future customer behavior. While neither the logistic regression nor the random forest produced accuracy scores much higher than the overall prevalance of the majority class, each model yielded valuable insights. The logistic regression model can be used to capture some true negatives, preventing a company from spending money on promotions offered to customers unlikely to churn. The random forest model did better at predicting customers likely to churn. Along with this strong ability to predict true positives comes a high false positive rate. If lost customers are very expensive and promotions are inexpensive, then this might be an acceptable tradeoff.
The model yielded other insights as well. For example, a customer’s likelihood to churn is associated with whether they carry an international plan, and the number of times they call customer service. Even if the company is not interested in preventing churn, this churn analysis points out that resolving issues during calls to customer service, as well as competing on their international plan, may be important next steps for this business.
Diez, David, Mine Cetinkaya-Rundel, and Christopher Barr. OpenIntro Statistics, 4 ed. 2019. openintro.org/os.
Faraway, Julian J. Extending the Linear Model with R, 2 ed. Boca Raton, FL: CRC Press, 2016.
Kaggle. Customer Churn Prediction 2020. https://www.kaggle.com/c/customer-churn-prediction-2020. Accessed 5/9/2021.
Nwanganga, Fred Chukwuka, and Mike Chapple. Practical Machine Learning in R . 1st edition. Indianapolis: John Wiley and Sons, 2020. Print.
Rai, Bharatendra. “Random Forest in R - Classification and Prediction Example with Definition & Steps.” https://www.youtube.com/watch?v=dJclNIN-TPo&t=1255s. Accessed 5/9/2021.
Wikipedia. “Gini coefficient.” https://en.wikipedia.org/wiki/Gini_coefficient. Accessed 5/9/2021.