什么是 customer Churn，客户流失？

We can shortly define customer churn (most commonly called “churn”) as customers that stop doing business with a company or a service. There are customer churns in different business area.

最常见的就是我们更换手机的运营商。

什么是 Churn Analysis？

用数据分析工具识别那些客户容易流失，他们有什么特征。重点关照这些客户，给他们定制广告，优惠等留住他们。因为留住一个客户比开发一个新客户容易地多，便宜地多。

It’s a trendy topic in customer relationship management (CRM) departments because it costs more money to find new customers than keeping the existing ones. So companies want to prevent them to leave.

怎么做 Churn Analysis？

简单说来，我们需要以往的消费者数据，其中有人是流失了的，来建立模型，通常是logistic regression，然后将该模型应用到新的数据里面去预测其他客户流失的概率。设定一个标准，据此标准判断哪些客户容易流失。

To identify the customers, we need to have a database with data about the previous customers that churned. Using this data, we develop a model which identifies customers that have a profile close to the ones that already left. To simulate an experiment where we want to predict if our customers will churn, we need to work with a partitioned database. The database has 2 parts, one part will be the training set. This will be used to create the model. The second part will be the testing set which will be used to evaluate our model. In this case we know customer answers from the testing dataset so we can compare the model prediction with the true answers. Nevertheless in reality, we don’t know what will be the true answers. So we have to target mainly customers with high probability to churn. This probability is given by our model.

我们本节课的内容

我们将使用来自IBM的消费者通信服务数据。我们的目标是预测哪些客户可能会离开他们现在的通信服务商。

我们的课程假定你已经有一些统计的基础，重点在教会大家如何在R里面建立logistic regreesion来做Churn Analysis并进行预测。

与linear regression类似，logistic regression也是探索因变量与自变量的关系，不同的是，logistic regression的自变量是binary variable, 0, 1, 而不是连续性变量，例如票房收入。预测的结果是被观测样本可能发生某种行为的概率。在建模的时候使用的函数以及其它参数也是不一样的。

读取数据并做初步探索：

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.1

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.5.1

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

telco_customer <- read.csv("https://www.dropbox.com/s/1t5umhj7l2ituqy/Telco-Customer-Churn.csv?dl=1")

str(telco_customer)

## 'data.frame':    7043 obs. of  21 variables:
##  $ ï..customerID   : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

summary(telco_customer)

##     ï..customerID     gender     SeniorCitizen    Partner    Dependents
##  0002-ORFBO:   1   Female:3488   Min.   :0.0000   No :3641   No :4933  
##  0003-MKNFE:   1   Male  :3555   1st Qu.:0.0000   Yes:3402   Yes:2110  
##  0004-TLHLJ:   1                 Median :0.0000                        
##  0011-IGKFF:   1                 Mean   :0.1621                        
##  0013-EXCHZ:   1                 3rd Qu.:0.0000                        
##  0013-MHZWF:   1                 Max.   :1.0000                        
##  (Other)   :7037                                                       
##      tenure      PhoneService          MultipleLines     InternetService
##  Min.   : 0.00   No : 682     No              :3390   DSL        :2421  
##  1st Qu.: 9.00   Yes:6361     No phone service: 682   Fiber optic:3096  
##  Median :29.00                Yes             :2971   No         :1526  
##  Mean   :32.37                                                          
##  3rd Qu.:55.00                                                          
##  Max.   :72.00                                                          
##                                                                         
##              OnlineSecurity              OnlineBackup 
##  No                 :3498   No                 :3088  
##  No internet service:1526   No internet service:1526  
##  Yes                :2019   Yes                :2429  
##                                                       
##                                                       
##                                                       
##                                                       
##             DeviceProtection              TechSupport  
##  No                 :3095    No                 :3473  
##  No internet service:1526    No internet service:1526  
##  Yes                :2422    Yes                :2044  
##                                                        
##                                                        
##                                                        
##                                                        
##               StreamingTV              StreamingMovies
##  No                 :2810   No                 :2785  
##  No internet service:1526   No internet service:1526  
##  Yes                :2707   Yes                :2732  
##                                                       
##                                                       
##                                                       
##                                                       
##            Contract    PaperlessBilling                   PaymentMethod 
##  Month-to-month:3875   No :2872         Bank transfer (automatic):1544  
##  One year      :1473   Yes:4171         Credit card (automatic)  :1522  
##  Two year      :1695                    Electronic check         :2365  
##                                         Mailed check             :1612  
##                                                                         
##                                                                         
##                                                                         
##  MonthlyCharges    TotalCharges    Churn     
##  Min.   : 18.25   Min.   :  18.8   No :5174  
##  1st Qu.: 35.50   1st Qu.: 401.4   Yes:1869  
##  Median : 70.35   Median :1397.5             
##  Mean   : 64.76   Mean   :2283.3             
##  3rd Qu.: 89.85   3rd Qu.:3794.7             
##  Max.   :118.75   Max.   :8684.8             
##                   NA's   :11

通过上面简单的描述统计，我们发现有些变量有不和谐的声音，例如 “Yes” “No” “No internet service” (or “No phone service”).

第三个类别没有提供更多的信息，我们把第三类都归为第二类 “No”。

具体地，我们要结合使用lapply和select来达到目的：

colnames(telco_customer)[colnames(telco_customer)=="ï..customerID"] <- "customerID"
factor.variables = lapply(telco_customer %>% 
                      select(-customerID,-MonthlyCharges, 
                      -TotalCharges, -tenure), 
                      function(x){
        x = gsub("No internet service", "No", x)
        x = gsub("No phone service", "No", x)
        return(x)
})

factor.variables <- as.data.frame(factor.variables)
telco_customer <- cbind( customerID = telco_customer$customerID, 
                             TotalCharges = telco_customer$TotalCharges,
                             MonthlyCharges = telco_customer$MonthlyCharges,
                             tenure = telco_customer$tenure,
                  factor.variables)

summary(telco_customer)

##       customerID    TotalCharges    MonthlyCharges       tenure     
##  0002-ORFBO:   1   Min.   :  18.8   Min.   : 18.25   Min.   : 0.00  
##  0003-MKNFE:   1   1st Qu.: 401.4   1st Qu.: 35.50   1st Qu.: 9.00  
##  0004-TLHLJ:   1   Median :1397.5   Median : 70.35   Median :29.00  
##  0011-IGKFF:   1   Mean   :2283.3   Mean   : 64.76   Mean   :32.37  
##  0013-EXCHZ:   1   3rd Qu.:3794.7   3rd Qu.: 89.85   3rd Qu.:55.00  
##  0013-MHZWF:   1   Max.   :8684.8   Max.   :118.75   Max.   :72.00  
##  (Other)   :7037   NA's   :11                                       
##     gender     SeniorCitizen Partner    Dependents PhoneService
##  Female:3488   0:5901        No :3641   No :4933   No : 682    
##  Male  :3555   1:1142        Yes:3402   Yes:2110   Yes:6361    
##                                                                
##                                                                
##                                                                
##                                                                
##                                                                
##  MultipleLines    InternetService OnlineSecurity OnlineBackup
##  No :4072      DSL        :2421   No :5024       No :4614    
##  Yes:2971      Fiber optic:3096   Yes:2019       Yes:2429    
##                No         :1526                              
##                                                              
##                                                              
##                                                              
##                                                              
##  DeviceProtection TechSupport StreamingTV StreamingMovies
##  No :4621         No :4999    No :4336    No :4311       
##  Yes:2422         Yes:2044    Yes:2707    Yes:2732       
##                                                          
##                                                          
##                                                          
##                                                          
##                                                          
##            Contract    PaperlessBilling                   PaymentMethod 
##  Month-to-month:3875   No :2872         Bank transfer (automatic):1544  
##  One year      :1473   Yes:4171         Credit card (automatic)  :1522  
##  Two year      :1695                    Electronic check         :2365  
##                                         Mailed check             :1612  
##                                                                         
##                                                                         
##                                                                         
##  Churn     
##  No :5174  
##  Yes:1869  
##            
##            
##            
##            
##

处理缺失值

telco_customer %>%
  is.na() %>%
  sum()

## [1] 11

telco_customer %>%
  is.na() %>%
  colSums()

##       customerID     TotalCharges   MonthlyCharges           tenure 
##                0               11                0                0 
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##     PhoneService    MultipleLines  InternetService   OnlineSecurity 
##                0                0                0                0 
##     OnlineBackup DeviceProtection      TechSupport      StreamingTV 
##                0                0                0                0 
##  StreamingMovies         Contract PaperlessBilling    PaymentMethod 
##                0                0                0                0 
##            Churn 
##                0

telco_customer %>%
  filter(is.na(TotalCharges)==TRUE) %>%
  select(Churn) %>%
  table()

## Warning: package 'bindrcpp' was built under R version 3.5.1

## .
##  No Yes 
##  11   0

11/7042

## [1] 0.001562056

经过上面的探索，我们发现：

all missings are from TotalCharges
all of them do not churn
only 11/7042 missing

所以，我们剔除那些有缺失值的样本。

还有我们需要把 SeniorCitizen 变成因子变量。

下面的代码使用 dplyr里面的函数结合 pipe operator, 一气呵成，完成上面的任务：

telco_customer <- telco_customer %>%
  filter(!is.na(TotalCharges)) %>%
  mutate(SeniorCitizen=as.factor(SeniorCitizen))

绘制MonthlyCharges的直方图，查看MonthlyCharges的分布：

telco_customer %>%
  ggplot(aes(x=MonthlyCharges)) +
  geom_histogram(color='blue', fill='red')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

绘制TotalCharges的直方图，查看TotalCharges的分布：

telco_customer %>%
  ggplot(aes(x=TotalCharges)) +
  geom_histogram(color='black', fill='green')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

绘制 tenure 的直方图，查看 tenure 的分布:

telco_customer %>%
  ggplot(aes(x=tenure)) +
  geom_histogram(color='black', fill='white')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

将数据随机分成训练数据和测试数据:

set.seed(100) 
trainingRowIndex <- sample(1:nrow(telco_customer), 0.7*nrow(telco_customer)) 

training_data <- telco_customer[trainingRowIndex, ] # model training data
testing_data <- telco_customer[-trainingRowIndex, ] # test data

names(training_data)

##  [1] "customerID"       "TotalCharges"     "MonthlyCharges"  
##  [4] "tenure"           "gender"           "SeniorCitizen"   
##  [7] "Partner"          "Dependents"       "PhoneService"    
## [10] "MultipleLines"    "InternetService"  "OnlineSecurity"  
## [13] "OnlineBackup"     "DeviceProtection" "TechSupport"     
## [16] "StreamingTV"      "StreamingMovies"  "Contract"        
## [19] "PaperlessBilling" "PaymentMethod"    "Churn"

Investigate who churned?

training_data %>%
  ggplot(aes(x = Churn)) +
  geom_bar(position = "dodge")

training_data %>%
  ggplot(aes(x = gender, fill = Churn)) +
  geom_bar(position = "fill")

training_data %>%
  ggplot(aes(x = SeniorCitizen, fill = Churn)) +
  geom_bar(position = "fill")

training_data %>%
  ggplot(aes(x = Partner, fill = Churn)) +
  geom_bar(position = "fill")

training_data %>%
  ggplot(aes(x = Dependents, fill = Churn)) +
  geom_bar(position = "fill")

training_data %>%
  ggplot(aes(x = PhoneService, fill = Churn)) +
  geom_bar(position = "fill")

training_data %>%
  ggplot(aes(x = OnlineBackup, fill = Churn)) +
  geom_bar(position = "fill")

training_data %>%
  ggplot(aes(x = PaymentMethod, fill = Churn)) +
  geom_bar(position = "fill")

Another way of visulization:

using “gridExtra” package:

p1 <- ggplot(training_data, aes(x=gender)) + ggtitle("Gender") + xlab("Gender") +
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p2 <- ggplot(training_data, aes(x=SeniorCitizen)) + ggtitle("Senior Citizen") + xlab("Senior Citizen") + 
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p3 <- ggplot(training_data, aes(x=Partner)) + ggtitle("Partner") + xlab("Partner") + 
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p4 <- ggplot(training_data, aes(x=Dependents)) + ggtitle("Dependents") + xlab("Dependents") +
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()

p5 <- ggplot(training_data, aes(x=PhoneService)) + ggtitle("Phone Service") + xlab("Phone Service") +
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p6 <- ggplot(training_data, aes(x=MultipleLines)) + ggtitle("Multiple Lines") + xlab("Multiple Lines") + 
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p7 <- ggplot(training_data, aes(x=InternetService)) + ggtitle("Internet Service") + xlab("Internet Service") + 
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
p8 <- ggplot(training_data, aes(x=OnlineSecurity)) + ggtitle("Online Security") + xlab("Online Security") +
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()

grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol=2)

building a good logistical model:

exert caution when you interpret the coefficients, only directions make sense.

# Model specification using lm
fullModel <- glm(Churn ~.-customerID, 
                        data = training_data,
                 family=binomial(link='logit'))

# Looking at model summary
summary(fullModel)

## 
## Call:
## glm(formula = Churn ~ . - customerID, family = binomial(link = "logit"), 
##     data = training_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8887  -0.6747  -0.2758   0.7161   3.3029  
## 
## Coefficients:
##                                        Estimate Std. Error z value
## (Intercept)                           0.8168751  0.9942323   0.822
## TotalCharges                          0.0003392  0.0000887   3.824
## MonthlyCharges                       -0.0262286  0.0387536  -0.677
## tenure                               -0.0617111  0.0078214  -7.890
## genderMale                           -0.0604784  0.0783548  -0.772
## SeniorCitizen1                        0.2301161  0.1014211   2.269
## PartnerYes                            0.0665439  0.0946277   0.703
## DependentsYes                        -0.1796625  0.1093367  -1.643
## PhoneServiceYes                      -0.1069154  0.7917878  -0.135
## MultipleLinesYes                      0.3895390  0.2162088   1.802
## InternetServiceFiber optic            1.4276896  0.9751787   1.464
## InternetServiceNo                    -1.4239599  0.9843287  -1.447
## OnlineSecurityYes                    -0.3742309  0.2194692  -1.705
## OnlineBackupYes                      -0.0043264  0.2127520  -0.020
## DeviceProtectionYes                   0.0793275  0.2133443   0.372
## TechSupportYes                       -0.2756729  0.2210403  -1.247
## StreamingTVYes                        0.4364784  0.3980483   1.097
## StreamingMoviesYes                    0.4239477  0.4006505   1.058
## ContractOne year                     -0.7735879  0.1343541  -5.758
## ContractTwo year                     -1.4622699  0.2193340  -6.667
## PaperlessBillingYes                   0.2649639  0.0895322   2.959
## PaymentMethodCredit card (automatic) -0.0538417  0.1393127  -0.386
## PaymentMethodElectronic check         0.3363397  0.1148767   2.928
## PaymentMethodMailed check            -0.0202119  0.1402156  -0.144
##                                      Pr(>|z|)    
## (Intercept)                          0.411297    
## TotalCharges                         0.000131 ***
## MonthlyCharges                       0.498530    
## tenure                               3.02e-15 ***
## genderMale                           0.440201    
## SeniorCitizen1                       0.023273 *  
## PartnerYes                           0.481920    
## DependentsYes                        0.100341    
## PhoneServiceYes                      0.892588    
## MultipleLinesYes                     0.071596 .  
## InternetServiceFiber optic           0.143186    
## InternetServiceNo                    0.148000    
## OnlineSecurityYes                    0.088164 .  
## OnlineBackupYes                      0.983776    
## DeviceProtectionYes                  0.710021    
## TechSupportYes                       0.212338    
## StreamingTVYes                       0.272840    
## StreamingMoviesYes                   0.289988    
## ContractOne year                     8.52e-09 ***
## ContractTwo year                     2.61e-11 ***
## PaperlessBillingYes                  0.003082 ** 
## PaymentMethodCredit card (automatic) 0.699141    
## PaymentMethodElectronic check        0.003413 ** 
## PaymentMethodMailed check            0.885383    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5660.5  on 4921  degrees of freedom
## Residual deviance: 4003.4  on 4898  degrees of freedom
## AIC: 4051.4
## 
## Number of Fisher Scoring iterations: 6

使用stepAIC函数，剔除无关变量： Choose a model by AIC in a Stepwise Algorithm

library(MASS)

## Warning: package 'MASS' was built under R version 3.5.1

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# Model specification using lm
fullModel <- glm(Churn ~.-customerID, 
                        data = training_data,
                 family=binomial(link='logit'))

newModel <- stepAIC(fullModel,trace = 0)

summary(newModel)

## 
## Call:
## glm(formula = Churn ~ TotalCharges + MonthlyCharges + tenure + 
##     SeniorCitizen + Dependents + MultipleLines + InternetService + 
##     OnlineSecurity + TechSupport + StreamingTV + StreamingMovies + 
##     Contract + PaperlessBilling + PaymentMethod, family = binomial(link = "logit"), 
##     data = training_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9233  -0.6774  -0.2760   0.7172   3.2728  
## 
## Coefficients:
##                                        Estimate Std. Error z value
## (Intercept)                           8.076e-01  3.429e-01   2.355
## TotalCharges                          3.421e-04  8.838e-05   3.871
## MonthlyCharges                       -2.812e-02  7.120e-03  -3.949
## tenure                               -6.113e-02  7.794e-03  -7.843
## SeniorCitizen1                        2.382e-01  1.007e-01   2.365
## DependentsYes                        -1.471e-01  9.840e-02  -1.495
## MultipleLinesYes                      3.917e-01  1.060e-01   3.696
## InternetServiceFiber optic            1.457e+00  2.358e-01   6.181
## InternetServiceNo                    -1.511e+00  2.176e-01  -6.944
## OnlineSecurityYes                    -3.642e-01  1.097e-01  -3.319
## TechSupportYes                       -2.632e-01  1.120e-01  -2.350
## StreamingTVYes                        4.680e-01  1.184e-01   3.954
## StreamingMoviesYes                    4.531e-01  1.162e-01   3.900
## ContractOne year                     -7.618e-01  1.339e-01  -5.688
## ContractTwo year                     -1.446e+00  2.189e-01  -6.606
## PaperlessBillingYes                   2.635e-01  8.939e-02   2.947
## PaymentMethodCredit card (automatic) -5.320e-02  1.391e-01  -0.383
## PaymentMethodElectronic check         3.346e-01  1.148e-01   2.915
## PaymentMethodMailed check            -2.668e-02  1.400e-01  -0.191
##                                      Pr(>|z|)    
## (Intercept)                          0.018525 *  
## TotalCharges                         0.000108 ***
## MonthlyCharges                       7.83e-05 ***
## tenure                               4.40e-15 ***
## SeniorCitizen1                       0.018022 *  
## DependentsYes                        0.134986    
## MultipleLinesYes                     0.000219 ***
## InternetServiceFiber optic           6.37e-10 ***
## InternetServiceNo                    3.81e-12 ***
## OnlineSecurityYes                    0.000902 ***
## TechSupportYes                       0.018778 *  
## StreamingTVYes                       7.68e-05 ***
## StreamingMoviesYes                   9.60e-05 ***
## ContractOne year                     1.29e-08 ***
## ContractTwo year                     3.96e-11 ***
## PaperlessBillingYes                  0.003205 ** 
## PaymentMethodCredit card (automatic) 0.702031    
## PaymentMethodElectronic check        0.003552 ** 
## PaymentMethodMailed check            0.848887    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5660.5  on 4921  degrees of freedom
## Residual deviance: 4005.6  on 4903  degrees of freedom
## AIC: 4043.6
## 
## Number of Fisher Scoring iterations: 6

newFormula <- as.formula(summary(newModel)$call)
newFormula

## Churn ~ TotalCharges + MonthlyCharges + tenure + SeniorCitizen + 
##     Dependents + MultipleLines + InternetService + OnlineSecurity + 
##     TechSupport + StreamingTV + StreamingMovies + Contract + 
##     PaperlessBilling + PaymentMethod

Let us predict with full model:

table(telco_customer$Churn)

## 
##   No  Yes 
## 5163 1869

1869/(1869+5163)

## [1] 0.265785

class(testing_data$Churn)

## [1] "factor"

testing_data$Churn <- as.character(testing_data$Churn)
testing_data$Churn[testing_data$Churn=="No"] <- "0"
testing_data$Churn[testing_data$Churn=="Yes"] <- "1"

fitted.results <- predict(newModel,newdata=testing_data,type='response')
fitted.results <- ifelse(fitted.results > 0.27,1,0)
misClasificError <- mean(fitted.results != testing_data$Churn)
misClasificError

## [1] 0.2417062

print(paste('Logistic Regression Accuracy',1-misClasificError))

## [1] "Logistic Regression Accuracy 0.758293838862559"

可能流失的客户有什么特征？

testing_data$pred <- fitted.results

testing_data%>%
  group_by(pred) %>%
  select(TotalCharges, MonthlyCharges, tenure) %>%
  summarise_all(funs(mean(.)))

## Adding missing grouping variables: `pred`

## # A tibble: 2 x 4
##    pred TotalCharges MonthlyCharges tenure
##   <dbl>        <dbl>          <dbl>  <dbl>
## 1     0        2770.           56.3   42.8
## 2     1        1430.           75.4   16.3

完结撒花！

Churn Analysis: logistical regression

Chris Qi

2018