We can shortly define customer churn (most commonly called “churn”) as customers that stop doing business with a company or a service. There are customer churns in different business area.
最常见的就是我们更换手机的运营商。
用数据分析工具识别那些客户容易流失,他们有什么特征。重点关照这些客户,给他们定制广告,优惠等留住他们。因为留住一个客户比开发一个新客户容易地多,便宜地多。
It’s a trendy topic in customer relationship management (CRM) departments because it costs more money to find new customers than keeping the existing ones. So companies want to prevent them to leave.
简单说来,我们需要以往的消费者数据,其中有人是流失了的,来建立模型,通常是logistic regression,然后将该模型应用到新的数据里面去预测其他客户流失的概率。设定一个标准,据此标准判断哪些客户容易流失。
To identify the customers, we need to have a database with data about the previous customers that churned. Using this data, we develop a model which identifies customers that have a profile close to the ones that already left. To simulate an experiment where we want to predict if our customers will churn, we need to work with a partitioned database. The database has 2 parts, one part will be the training set. This will be used to create the model. The second part will be the testing set which will be used to evaluate our model. In this case we know customer answers from the testing dataset so we can compare the model prediction with the true answers. Nevertheless in reality, we don’t know what will be the true answers. So we have to target mainly customers with high probability to churn. This probability is given by our model.
我们将使用来自IBM的消费者通信服务数据。我们的目标是预测哪些客户可能会离开他们现在的通信服务商。
我们的课程假定你已经有一些统计的基础,重点在教会大家如何在R里面建立logistic regreesion来做Churn Analysis并进行预测。
与linear regression类似,logistic regression也是探索因变量与自变量的关系,不同的是,logistic regression的自变量是binary variable, 0, 1, 而不是连续性变量,例如票房收入。预测的结果是被观测样本可能发生某种行为的概率。在建模的时候使用的函数以及其它参数也是不一样的。
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.1
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.5.1
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
telco_customer <- read.csv("https://www.dropbox.com/s/1t5umhj7l2ituqy/Telco-Customer-Churn.csv?dl=1")
str(telco_customer)
## 'data.frame': 7043 obs. of 21 variables:
## $ ï..customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
summary(telco_customer)
## ï..customerID gender SeniorCitizen Partner Dependents
## 0002-ORFBO: 1 Female:3488 Min. :0.0000 No :3641 No :4933
## 0003-MKNFE: 1 Male :3555 1st Qu.:0.0000 Yes:3402 Yes:2110
## 0004-TLHLJ: 1 Median :0.0000
## 0011-IGKFF: 1 Mean :0.1621
## 0013-EXCHZ: 1 3rd Qu.:0.0000
## 0013-MHZWF: 1 Max. :1.0000
## (Other) :7037
## tenure PhoneService MultipleLines InternetService
## Min. : 0.00 No : 682 No :3390 DSL :2421
## 1st Qu.: 9.00 Yes:6361 No phone service: 682 Fiber optic:3096
## Median :29.00 Yes :2971 No :1526
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## OnlineSecurity OnlineBackup
## No :3498 No :3088
## No internet service:1526 No internet service:1526
## Yes :2019 Yes :2429
##
##
##
##
## DeviceProtection TechSupport
## No :3095 No :3473
## No internet service:1526 No internet service:1526
## Yes :2422 Yes :2044
##
##
##
##
## StreamingTV StreamingMovies
## No :2810 No :2785
## No internet service:1526 No internet service:1526
## Yes :2707 Yes :2732
##
##
##
##
## Contract PaperlessBilling PaymentMethod
## Month-to-month:3875 No :2872 Bank transfer (automatic):1544
## One year :1473 Yes:4171 Credit card (automatic) :1522
## Two year :1695 Electronic check :2365
## Mailed check :1612
##
##
##
## MonthlyCharges TotalCharges Churn
## Min. : 18.25 Min. : 18.8 No :5174
## 1st Qu.: 35.50 1st Qu.: 401.4 Yes:1869
## Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
通过上面简单的描述统计,我们发现有些变量有不和谐的声音,例如 “Yes” “No” “No internet service” (or “No phone service”).
第三个类别没有提供更多的信息,我们把第三类都归为第二类 “No”。
具体地,我们要结合使用lapply和select来达到目的:
colnames(telco_customer)[colnames(telco_customer)=="ï..customerID"] <- "customerID"
factor.variables = lapply(telco_customer %>%
select(-customerID,-MonthlyCharges,
-TotalCharges, -tenure),
function(x){
x = gsub("No internet service", "No", x)
x = gsub("No phone service", "No", x)
return(x)
})
factor.variables <- as.data.frame(factor.variables)
telco_customer <- cbind( customerID = telco_customer$customerID,
TotalCharges = telco_customer$TotalCharges,
MonthlyCharges = telco_customer$MonthlyCharges,
tenure = telco_customer$tenure,
factor.variables)
summary(telco_customer)
## customerID TotalCharges MonthlyCharges tenure
## 0002-ORFBO: 1 Min. : 18.8 Min. : 18.25 Min. : 0.00
## 0003-MKNFE: 1 1st Qu.: 401.4 1st Qu.: 35.50 1st Qu.: 9.00
## 0004-TLHLJ: 1 Median :1397.5 Median : 70.35 Median :29.00
## 0011-IGKFF: 1 Mean :2283.3 Mean : 64.76 Mean :32.37
## 0013-EXCHZ: 1 3rd Qu.:3794.7 3rd Qu.: 89.85 3rd Qu.:55.00
## 0013-MHZWF: 1 Max. :8684.8 Max. :118.75 Max. :72.00
## (Other) :7037 NA's :11
## gender SeniorCitizen Partner Dependents PhoneService
## Female:3488 0:5901 No :3641 No :4933 No : 682
## Male :3555 1:1142 Yes:3402 Yes:2110 Yes:6361
##
##
##
##
##
## MultipleLines InternetService OnlineSecurity OnlineBackup
## No :4072 DSL :2421 No :5024 No :4614
## Yes:2971 Fiber optic:3096 Yes:2019 Yes:2429
## No :1526
##
##
##
##
## DeviceProtection TechSupport StreamingTV StreamingMovies
## No :4621 No :4999 No :4336 No :4311
## Yes:2422 Yes:2044 Yes:2707 Yes:2732
##
##
##
##
##
## Contract PaperlessBilling PaymentMethod
## Month-to-month:3875 No :2872 Bank transfer (automatic):1544
## One year :1473 Yes:4171 Credit card (automatic) :1522
## Two year :1695 Electronic check :2365
## Mailed check :1612
##
##
##
## Churn
## No :5174
## Yes:1869
##
##
##
##
##
telco_customer %>%
is.na() %>%
sum()
## [1] 11
telco_customer %>%
is.na() %>%
colSums()
## customerID TotalCharges MonthlyCharges tenure
## 0 11 0 0
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## PhoneService MultipleLines InternetService OnlineSecurity
## 0 0 0 0
## OnlineBackup DeviceProtection TechSupport StreamingTV
## 0 0 0 0
## StreamingMovies Contract PaperlessBilling PaymentMethod
## 0 0 0 0
## Churn
## 0
telco_customer %>%
filter(is.na(TotalCharges)==TRUE) %>%
select(Churn) %>%
table()
## Warning: package 'bindrcpp' was built under R version 3.5.1
## .
## No Yes
## 11 0
11/7042
## [1] 0.001562056
经过上面的探索,我们发现:
所以,我们剔除那些有缺失值的样本。
还有我们需要把 SeniorCitizen 变成因子变量。
下面的代码使用 dplyr里面的函数结合 pipe operator, 一气呵成,完成上面的任务:
telco_customer <- telco_customer %>%
filter(!is.na(TotalCharges)) %>%
mutate(SeniorCitizen=as.factor(SeniorCitizen))
绘制MonthlyCharges的直方图,查看MonthlyCharges的分布:
telco_customer %>%
ggplot(aes(x=MonthlyCharges)) +
geom_histogram(color='blue', fill='red')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
绘制TotalCharges的直方图,查看TotalCharges的分布:
telco_customer %>%
ggplot(aes(x=TotalCharges)) +
geom_histogram(color='black', fill='green')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
绘制 tenure 的直方图,查看 tenure 的分布:
telco_customer %>%
ggplot(aes(x=tenure)) +
geom_histogram(color='black', fill='white')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
将数据随机分成训练数据和测试数据:
set.seed(100)
trainingRowIndex <- sample(1:nrow(telco_customer), 0.7*nrow(telco_customer))
training_data <- telco_customer[trainingRowIndex, ] # model training data
testing_data <- telco_customer[-trainingRowIndex, ] # test data
names(training_data)
## [1] "customerID" "TotalCharges" "MonthlyCharges"
## [4] "tenure" "gender" "SeniorCitizen"
## [7] "Partner" "Dependents" "PhoneService"
## [10] "MultipleLines" "InternetService" "OnlineSecurity"
## [13] "OnlineBackup" "DeviceProtection" "TechSupport"
## [16] "StreamingTV" "StreamingMovies" "Contract"
## [19] "PaperlessBilling" "PaymentMethod" "Churn"
Investigate who churned?
training_data %>%
ggplot(aes(x = Churn)) +
geom_bar(position = "dodge")
training_data %>%
ggplot(aes(x = gender, fill = Churn)) +
geom_bar(position = "fill")
training_data %>%
ggplot(aes(x = SeniorCitizen, fill = Churn)) +
geom_bar(position = "fill")
training_data %>%
ggplot(aes(x = Partner, fill = Churn)) +
geom_bar(position = "fill")
training_data %>%
ggplot(aes(x = Dependents, fill = Churn)) +
geom_bar(position = "fill")
training_data %>%
ggplot(aes(x = PhoneService, fill = Churn)) +
geom_bar(position = "fill")
training_data %>%
ggplot(aes(x = OnlineBackup, fill = Churn)) +
geom_bar(position = "fill")
training_data %>%
ggplot(aes(x = PaymentMethod, fill = Churn)) +
geom_bar(position = "fill")
using “gridExtra” package:
p1 <- ggplot(training_data, aes(x=gender)) + ggtitle("Gender") + xlab("Gender") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p2 <- ggplot(training_data, aes(x=SeniorCitizen)) + ggtitle("Senior Citizen") + xlab("Senior Citizen") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p3 <- ggplot(training_data, aes(x=Partner)) + ggtitle("Partner") + xlab("Partner") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p4 <- ggplot(training_data, aes(x=Dependents)) + ggtitle("Dependents") + xlab("Dependents") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p5 <- ggplot(training_data, aes(x=PhoneService)) + ggtitle("Phone Service") + xlab("Phone Service") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p6 <- ggplot(training_data, aes(x=MultipleLines)) + ggtitle("Multiple Lines") + xlab("Multiple Lines") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p7 <- ggplot(training_data, aes(x=InternetService)) + ggtitle("Internet Service") + xlab("Internet Service") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p8 <- ggplot(training_data, aes(x=OnlineSecurity)) + ggtitle("Online Security") + xlab("Online Security") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol=2)
# Model specification using lm
fullModel <- glm(Churn ~.-customerID,
data = training_data,
family=binomial(link='logit'))
# Looking at model summary
summary(fullModel)
##
## Call:
## glm(formula = Churn ~ . - customerID, family = binomial(link = "logit"),
## data = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8887 -0.6747 -0.2758 0.7161 3.3029
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.8168751 0.9942323 0.822
## TotalCharges 0.0003392 0.0000887 3.824
## MonthlyCharges -0.0262286 0.0387536 -0.677
## tenure -0.0617111 0.0078214 -7.890
## genderMale -0.0604784 0.0783548 -0.772
## SeniorCitizen1 0.2301161 0.1014211 2.269
## PartnerYes 0.0665439 0.0946277 0.703
## DependentsYes -0.1796625 0.1093367 -1.643
## PhoneServiceYes -0.1069154 0.7917878 -0.135
## MultipleLinesYes 0.3895390 0.2162088 1.802
## InternetServiceFiber optic 1.4276896 0.9751787 1.464
## InternetServiceNo -1.4239599 0.9843287 -1.447
## OnlineSecurityYes -0.3742309 0.2194692 -1.705
## OnlineBackupYes -0.0043264 0.2127520 -0.020
## DeviceProtectionYes 0.0793275 0.2133443 0.372
## TechSupportYes -0.2756729 0.2210403 -1.247
## StreamingTVYes 0.4364784 0.3980483 1.097
## StreamingMoviesYes 0.4239477 0.4006505 1.058
## ContractOne year -0.7735879 0.1343541 -5.758
## ContractTwo year -1.4622699 0.2193340 -6.667
## PaperlessBillingYes 0.2649639 0.0895322 2.959
## PaymentMethodCredit card (automatic) -0.0538417 0.1393127 -0.386
## PaymentMethodElectronic check 0.3363397 0.1148767 2.928
## PaymentMethodMailed check -0.0202119 0.1402156 -0.144
## Pr(>|z|)
## (Intercept) 0.411297
## TotalCharges 0.000131 ***
## MonthlyCharges 0.498530
## tenure 3.02e-15 ***
## genderMale 0.440201
## SeniorCitizen1 0.023273 *
## PartnerYes 0.481920
## DependentsYes 0.100341
## PhoneServiceYes 0.892588
## MultipleLinesYes 0.071596 .
## InternetServiceFiber optic 0.143186
## InternetServiceNo 0.148000
## OnlineSecurityYes 0.088164 .
## OnlineBackupYes 0.983776
## DeviceProtectionYes 0.710021
## TechSupportYes 0.212338
## StreamingTVYes 0.272840
## StreamingMoviesYes 0.289988
## ContractOne year 8.52e-09 ***
## ContractTwo year 2.61e-11 ***
## PaperlessBillingYes 0.003082 **
## PaymentMethodCredit card (automatic) 0.699141
## PaymentMethodElectronic check 0.003413 **
## PaymentMethodMailed check 0.885383
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5660.5 on 4921 degrees of freedom
## Residual deviance: 4003.4 on 4898 degrees of freedom
## AIC: 4051.4
##
## Number of Fisher Scoring iterations: 6
使用stepAIC函数,剔除无关变量: Choose a model by AIC in a Stepwise Algorithm
library(MASS)
## Warning: package 'MASS' was built under R version 3.5.1
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Model specification using lm
fullModel <- glm(Churn ~.-customerID,
data = training_data,
family=binomial(link='logit'))
newModel <- stepAIC(fullModel,trace = 0)
summary(newModel)
##
## Call:
## glm(formula = Churn ~ TotalCharges + MonthlyCharges + tenure +
## SeniorCitizen + Dependents + MultipleLines + InternetService +
## OnlineSecurity + TechSupport + StreamingTV + StreamingMovies +
## Contract + PaperlessBilling + PaymentMethod, family = binomial(link = "logit"),
## data = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9233 -0.6774 -0.2760 0.7172 3.2728
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 8.076e-01 3.429e-01 2.355
## TotalCharges 3.421e-04 8.838e-05 3.871
## MonthlyCharges -2.812e-02 7.120e-03 -3.949
## tenure -6.113e-02 7.794e-03 -7.843
## SeniorCitizen1 2.382e-01 1.007e-01 2.365
## DependentsYes -1.471e-01 9.840e-02 -1.495
## MultipleLinesYes 3.917e-01 1.060e-01 3.696
## InternetServiceFiber optic 1.457e+00 2.358e-01 6.181
## InternetServiceNo -1.511e+00 2.176e-01 -6.944
## OnlineSecurityYes -3.642e-01 1.097e-01 -3.319
## TechSupportYes -2.632e-01 1.120e-01 -2.350
## StreamingTVYes 4.680e-01 1.184e-01 3.954
## StreamingMoviesYes 4.531e-01 1.162e-01 3.900
## ContractOne year -7.618e-01 1.339e-01 -5.688
## ContractTwo year -1.446e+00 2.189e-01 -6.606
## PaperlessBillingYes 2.635e-01 8.939e-02 2.947
## PaymentMethodCredit card (automatic) -5.320e-02 1.391e-01 -0.383
## PaymentMethodElectronic check 3.346e-01 1.148e-01 2.915
## PaymentMethodMailed check -2.668e-02 1.400e-01 -0.191
## Pr(>|z|)
## (Intercept) 0.018525 *
## TotalCharges 0.000108 ***
## MonthlyCharges 7.83e-05 ***
## tenure 4.40e-15 ***
## SeniorCitizen1 0.018022 *
## DependentsYes 0.134986
## MultipleLinesYes 0.000219 ***
## InternetServiceFiber optic 6.37e-10 ***
## InternetServiceNo 3.81e-12 ***
## OnlineSecurityYes 0.000902 ***
## TechSupportYes 0.018778 *
## StreamingTVYes 7.68e-05 ***
## StreamingMoviesYes 9.60e-05 ***
## ContractOne year 1.29e-08 ***
## ContractTwo year 3.96e-11 ***
## PaperlessBillingYes 0.003205 **
## PaymentMethodCredit card (automatic) 0.702031
## PaymentMethodElectronic check 0.003552 **
## PaymentMethodMailed check 0.848887
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5660.5 on 4921 degrees of freedom
## Residual deviance: 4005.6 on 4903 degrees of freedom
## AIC: 4043.6
##
## Number of Fisher Scoring iterations: 6
newFormula <- as.formula(summary(newModel)$call)
newFormula
## Churn ~ TotalCharges + MonthlyCharges + tenure + SeniorCitizen +
## Dependents + MultipleLines + InternetService + OnlineSecurity +
## TechSupport + StreamingTV + StreamingMovies + Contract +
## PaperlessBilling + PaymentMethod
table(telco_customer$Churn)
##
## No Yes
## 5163 1869
1869/(1869+5163)
## [1] 0.265785
class(testing_data$Churn)
## [1] "factor"
testing_data$Churn <- as.character(testing_data$Churn)
testing_data$Churn[testing_data$Churn=="No"] <- "0"
testing_data$Churn[testing_data$Churn=="Yes"] <- "1"
fitted.results <- predict(newModel,newdata=testing_data,type='response')
fitted.results <- ifelse(fitted.results > 0.27,1,0)
misClasificError <- mean(fitted.results != testing_data$Churn)
misClasificError
## [1] 0.2417062
print(paste('Logistic Regression Accuracy',1-misClasificError))
## [1] "Logistic Regression Accuracy 0.758293838862559"
可能流失的客户有什么特征?
testing_data$pred <- fitted.results
testing_data%>%
group_by(pred) %>%
select(TotalCharges, MonthlyCharges, tenure) %>%
summarise_all(funs(mean(.)))
## Adding missing grouping variables: `pred`
## # A tibble: 2 x 4
## pred TotalCharges MonthlyCharges tenure
## <dbl> <dbl> <dbl> <dbl>
## 1 0 2770. 56.3 42.8
## 2 1 1430. 75.4 16.3
完结撒花!