Source file ⇒ Churn_Analysis.Rmd

What is Churn? In the customer management, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%.

It is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions. This analysis shows how logistic regression model, support vector machines and the random forest model can be used to identify the customer churn in the telecom dataset.

The Data The “churn” data set was developed to predict telecom customer churn based on information about their account. The data files state that the data are “artificial based on claims similar to real world”.

#Packages used in analysis
library(ggplot2)
library(reshape2)
library(corrplot)
library(e1071)
library(caret)
## Loading required package: lattice
library(rpart)
library(C50)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
library(partykit)
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner,
##     node_surv, node_terminal
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(dplyr)
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
#reading in data from C50 package
library(C50)
data(churn)
test <- churnTest
train <- churnTrain
mydata <- rbind(test,train)
#Tranforming variables to numeric form 
#no=1, yes =0
mydata$churn <- as.integer(mydata$churn)
mydata$churn[mydata$churn == "1"] <- 1
mydata$churn[mydata$churn == "2"] <- 0

mydata$international_plan <-as.integer(mydata$international_plan)
mydata$international_plan[mydata$international_plan == "1"] <- 0
mydata$international_plan[mydata$international_plan == "2"] <- 1

mydata$voice_mail_plan <- as.integer(mydata$voice_mail_plan)
mydata$voice_mail_plan[mydata$voice_mail_plan == "1"] <- 0
mydata$voice_mail_plan[mydata$voice_mail_plan == "2"] <- 1
#Removing unwanted variables for analysis
mydata$state <- NULL
mydata$area_code <- NULL
#Remove observations that are missing from datasell
na.omit(mydata) %>%
  head()
##   account_length international_plan voice_mail_plan number_vmail_messages
## 1            101                  0               0                     0
## 2            137                  0               0                     0
## 3            103                  0               1                    29
## 4             99                  0               0                     0
## 5            108                  0               0                     0
## 6            117                  0               0                     0
##   total_day_minutes total_day_calls total_day_charge total_eve_minutes
## 1              70.9             123            12.05             211.9
## 2             223.6              86            38.01             244.8
## 3             294.7              95            50.10             237.3
## 4             216.8             123            36.86             126.4
## 5             197.4              78            33.56             124.0
## 6             226.5              85            38.51             141.6
##   total_eve_calls total_eve_charge total_night_minutes total_night_calls
## 1              73            18.01               236.0                73
## 2             139            20.81                94.2                81
## 3             105            20.17               300.3               127
## 4              88            10.74               220.6                82
## 5             101            10.54               204.5               107
## 6              68            12.04               223.0                90
##   total_night_charge total_intl_minutes total_intl_calls total_intl_charge
## 1              10.62               10.6                3              2.86
## 2               4.24                9.5                7              2.57
## 3              13.51               13.7                6              3.70
## 4               9.93               15.7                2              4.24
## 5               9.20                7.7                4              2.08
## 6              10.04                6.9                5              1.86
##   number_customer_service_calls churn
## 1                             3     0
## 2                             0     0
## 3                             1     0
## 4                             1     0
## 5                             2     0
## 6                             1     0
#Now begin exploratory data analysis
#Summarize dataset
summary(mydata)
##  account_length  international_plan voice_mail_plan  number_vmail_messages
##  Min.   :  1.0   Min.   :0.0000     Min.   :0.0000   Min.   : 0.000       
##  1st Qu.: 73.0   1st Qu.:0.0000     1st Qu.:0.0000   1st Qu.: 0.000       
##  Median :100.0   Median :0.0000     Median :0.0000   Median : 0.000       
##  Mean   :100.3   Mean   :0.0946     Mean   :0.2646   Mean   : 7.755       
##  3rd Qu.:127.0   3rd Qu.:0.0000     3rd Qu.:1.0000   3rd Qu.:17.000       
##  Max.   :243.0   Max.   :1.0000     Max.   :1.0000   Max.   :52.000       
##  total_day_minutes total_day_calls total_day_charge total_eve_minutes
##  Min.   :  0.0     Min.   :  0     Min.   : 0.00    Min.   :  0.0    
##  1st Qu.:143.7     1st Qu.: 87     1st Qu.:24.43    1st Qu.:166.4    
##  Median :180.1     Median :100     Median :30.62    Median :201.0    
##  Mean   :180.3     Mean   :100     Mean   :30.65    Mean   :200.6    
##  3rd Qu.:216.2     3rd Qu.:113     3rd Qu.:36.75    3rd Qu.:234.1    
##  Max.   :351.5     Max.   :165     Max.   :59.76    Max.   :363.7    
##  total_eve_calls total_eve_charge total_night_minutes total_night_calls
##  Min.   :  0.0   Min.   : 0.00    Min.   :  0.0       Min.   :  0.00   
##  1st Qu.: 87.0   1st Qu.:14.14    1st Qu.:166.9       1st Qu.: 87.00   
##  Median :100.0   Median :17.09    Median :200.4       Median :100.00   
##  Mean   :100.2   Mean   :17.05    Mean   :200.4       Mean   : 99.92   
##  3rd Qu.:114.0   3rd Qu.:19.90    3rd Qu.:234.7       3rd Qu.:113.00   
##  Max.   :170.0   Max.   :30.91    Max.   :395.0       Max.   :175.00   
##  total_night_charge total_intl_minutes total_intl_calls total_intl_charge
##  Min.   : 0.000     Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
##  1st Qu.: 7.510     1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
##  Median : 9.020     Median :10.30      Median : 4.000   Median :2.780    
##  Mean   : 9.018     Mean   :10.26      Mean   : 4.435   Mean   :2.771    
##  3rd Qu.:10.560     3rd Qu.:12.00      3rd Qu.: 6.000   3rd Qu.:3.240    
##  Max.   :17.770     Max.   :20.00      Max.   :20.000   Max.   :5.400    
##  number_customer_service_calls     churn       
##  Min.   :0.00                  Min.   :0.0000  
##  1st Qu.:1.00                  1st Qu.:0.0000  
##  Median :1.00                  Median :0.0000  
##  Mean   :1.57                  Mean   :0.1414  
##  3rd Qu.:2.00                  3rd Qu.:0.0000  
##  Max.   :9.00                  Max.   :1.0000
sapply(mydata, sd)
##                account_length            international_plan 
##                    39.6945595                     0.2926909 
##               voice_mail_plan         number_vmail_messages 
##                     0.4411641                    13.5463934 
##             total_day_minutes               total_day_calls 
##                    53.8946992                    19.8311974 
##              total_day_charge             total_eve_minutes 
##                     9.1620687                    50.5513090 
##               total_eve_calls              total_eve_charge 
##                    19.8264958                     4.2968433 
##           total_night_minutes             total_night_calls 
##                    50.5277893                    19.9586859 
##            total_night_charge            total_intl_minutes 
##                     2.2737627                     2.7613957 
##              total_intl_calls             total_intl_charge 
##                     2.4567882                     0.7455137 
## number_customer_service_calls                         churn 
##                     1.3063633                     0.3484685
cormatrix <- round(cor(mydata), digits = 2 )
cormatrix
##                               account_length international_plan
## account_length                          1.00               0.01
## international_plan                      0.01               1.00
## voice_mail_plan                        -0.01               0.01
## number_vmail_messages                  -0.01               0.01
## total_day_minutes                       0.00               0.03
## total_day_calls                         0.03               0.01
## total_day_charge                        0.00               0.03
## total_eve_minutes                      -0.01               0.02
## total_eve_calls                         0.01               0.00
## total_eve_charge                       -0.01               0.02
## total_night_minutes                     0.00              -0.03
## total_night_calls                      -0.01               0.01
## total_night_charge                      0.00              -0.03
## total_intl_minutes                      0.00               0.03
## total_intl_calls                        0.01               0.00
## total_intl_charge                       0.00               0.03
## number_customer_service_calls           0.00              -0.01
## churn                                   0.02               0.26
##                               voice_mail_plan number_vmail_messages
## account_length                          -0.01                 -0.01
## international_plan                       0.01                  0.01
## voice_mail_plan                          1.00                  0.95
## number_vmail_messages                    0.95                  1.00
## total_day_minutes                        0.00                  0.01
## total_day_calls                          0.00                  0.00
## total_day_charge                         0.00                  0.01
## total_eve_minutes                        0.02                  0.02
## total_eve_calls                         -0.01                  0.00
## total_eve_charge                         0.02                  0.02
## total_night_minutes                      0.01                  0.01
## total_night_calls                        0.01                  0.00
## total_night_charge                       0.01                  0.01
## total_intl_minutes                       0.00                  0.00
## total_intl_calls                        -0.01                  0.00
## total_intl_charge                        0.00                  0.00
## number_customer_service_calls           -0.01                 -0.01
## churn                                   -0.11                 -0.10
##                               total_day_minutes total_day_calls
## account_length                             0.00            0.03
## international_plan                         0.03            0.01
## voice_mail_plan                            0.00            0.00
## number_vmail_messages                      0.01            0.00
## total_day_minutes                          1.00            0.00
## total_day_calls                            0.00            1.00
## total_day_charge                           1.00            0.00
## total_eve_minutes                         -0.01            0.00
## total_eve_calls                            0.01            0.00
## total_eve_charge                          -0.01            0.00
## total_night_minutes                        0.01            0.00
## total_night_calls                          0.00           -0.01
## total_night_charge                         0.01            0.00
## total_intl_minutes                        -0.02            0.01
## total_intl_calls                           0.00            0.01
## total_intl_charge                         -0.02            0.01
## number_customer_service_calls              0.00           -0.01
## churn                                      0.21            0.02
##                               total_day_charge total_eve_minutes
## account_length                            0.00             -0.01
## international_plan                        0.03              0.02
## voice_mail_plan                           0.00              0.02
## number_vmail_messages                     0.01              0.02
## total_day_minutes                         1.00             -0.01
## total_day_calls                           0.00              0.00
## total_day_charge                          1.00             -0.01
## total_eve_minutes                        -0.01              1.00
## total_eve_calls                           0.01              0.00
## total_eve_charge                         -0.01              1.00
## total_night_minutes                       0.01             -0.02
## total_night_calls                         0.00              0.01
## total_night_charge                        0.01             -0.02
## total_intl_minutes                       -0.02              0.00
## total_intl_calls                          0.00              0.01
## total_intl_charge                        -0.02              0.00
## number_customer_service_calls             0.00             -0.01
## churn                                     0.21              0.09
##                               total_eve_calls total_eve_charge
## account_length                           0.01            -0.01
## international_plan                       0.00             0.02
## voice_mail_plan                         -0.01             0.02
## number_vmail_messages                    0.00             0.02
## total_day_minutes                        0.01            -0.01
## total_day_calls                          0.00             0.00
## total_day_charge                         0.01            -0.01
## total_eve_minutes                        0.00             1.00
## total_eve_calls                          1.00             0.00
## total_eve_charge                         0.00             1.00
## total_night_minutes                      0.00            -0.02
## total_night_calls                       -0.01             0.01
## total_night_charge                       0.00            -0.02
## total_intl_minutes                      -0.01             0.00
## total_intl_calls                         0.01             0.01
## total_intl_charge                       -0.01             0.00
## number_customer_service_calls            0.01            -0.01
## churn                                   -0.01             0.09
##                               total_night_minutes total_night_calls
## account_length                               0.00             -0.01
## international_plan                          -0.03              0.01
## voice_mail_plan                              0.01              0.01
## number_vmail_messages                        0.01              0.00
## total_day_minutes                            0.01              0.00
## total_day_calls                              0.00             -0.01
## total_day_charge                             0.01              0.00
## total_eve_minutes                           -0.02              0.01
## total_eve_calls                              0.00             -0.01
## total_eve_charge                            -0.02              0.01
## total_night_minutes                          1.00              0.03
## total_night_calls                            0.03              1.00
## total_night_charge                           1.00              0.03
## total_intl_minutes                          -0.01              0.00
## total_intl_calls                            -0.02              0.00
## total_intl_charge                           -0.01              0.00
## number_customer_service_calls               -0.01             -0.01
## churn                                        0.05             -0.01
##                               total_night_charge total_intl_minutes
## account_length                              0.00               0.00
## international_plan                         -0.03               0.03
## voice_mail_plan                             0.01               0.00
## number_vmail_messages                       0.01               0.00
## total_day_minutes                           0.01              -0.02
## total_day_calls                             0.00               0.01
## total_day_charge                            0.01              -0.02
## total_eve_minutes                          -0.02               0.00
## total_eve_calls                             0.00              -0.01
## total_eve_charge                           -0.02               0.00
## total_night_minutes                         1.00              -0.01
## total_night_calls                           0.03               0.00
## total_night_charge                          1.00              -0.01
## total_intl_minutes                         -0.01               1.00
## total_intl_calls                           -0.02               0.02
## total_intl_charge                          -0.01               1.00
## number_customer_service_calls              -0.01              -0.01
## churn                                       0.05               0.06
##                               total_intl_calls total_intl_charge
## account_length                            0.01              0.00
## international_plan                        0.00              0.03
## voice_mail_plan                          -0.01              0.00
## number_vmail_messages                     0.00              0.00
## total_day_minutes                         0.00             -0.02
## total_day_calls                           0.01              0.01
## total_day_charge                          0.00             -0.02
## total_eve_minutes                         0.01              0.00
## total_eve_calls                           0.01             -0.01
## total_eve_charge                          0.01              0.00
## total_night_minutes                      -0.02             -0.01
## total_night_calls                         0.00              0.00
## total_night_charge                       -0.02             -0.01
## total_intl_minutes                        0.02              1.00
## total_intl_calls                          1.00              0.02
## total_intl_charge                         0.02              1.00
## number_customer_service_calls            -0.02             -0.01
## churn                                    -0.05              0.06
##                               number_customer_service_calls churn
## account_length                                         0.00  0.02
## international_plan                                    -0.01  0.26
## voice_mail_plan                                       -0.01 -0.11
## number_vmail_messages                                 -0.01 -0.10
## total_day_minutes                                      0.00  0.21
## total_day_calls                                       -0.01  0.02
## total_day_charge                                       0.00  0.21
## total_eve_minutes                                     -0.01  0.09
## total_eve_calls                                        0.01 -0.01
## total_eve_charge                                      -0.01  0.09
## total_night_minutes                                   -0.01  0.05
## total_night_calls                                     -0.01 -0.01
## total_night_charge                                    -0.01  0.05
## total_intl_minutes                                    -0.01  0.06
## total_intl_calls                                      -0.02 -0.05
## total_intl_charge                                     -0.01  0.06
## number_customer_service_calls                          1.00  0.21
## churn                                                  0.21  1.00
plot.new()
plot(mydata$churn ~mydata$total_day_minutes)
title('Basic Scatterplot')

ggplot(mydata, aes(x=mydata$total_day_minutes)) + geom_histogram(binwidth = 1, fill = "white", color = "black")

#Randomly split data into train and test set
#70% will be ssigned to train set, 30% will be assigned to tst set

set.seed(1234)
ind <- sample(2, nrow(mydata), replace = TRUE, prob = c(.7,.3))
traindata <- mydata[ind == 1,]
testdata <- mydata[ind == 2,]
#Forward elimination
#Lower AIC indicates a better model

forward <- step(glm(churn ~ 1, data = traindata), direction = 'forward', scope = ~ account_length + international_plan + voice_mail_plan + number_vmail_messages + total_day_minutes + total_day_calls + total_day_charge + total_eve_minutes + total_eve_calls + total_eve_charge + total_night_minutes + total_night_calls + total_night_charge + total_intl_minutes + total_intl_calls + total_intl_charge + number_customer_service_calls)
## Start:  AIC=2618.93
## churn ~ 1
## 
##                                 Df Deviance    AIC
## + international_plan             1   404.63 2379.1
## + number_customer_service_calls  1   412.04 2443.0
## + total_day_minutes              1   417.70 2491.1
## + total_day_charge               1   417.70 2491.1
## + voice_mail_plan                1   427.13 2569.8
## + number_vmail_messages          1   428.28 2579.3
## + total_eve_minutes              1   430.31 2596.0
## + total_eve_charge               1   430.31 2596.0
## + total_intl_minutes             1   431.69 2607.2
## + total_intl_charge              1   431.69 2607.2
## + total_intl_calls               1   432.37 2612.8
## + total_night_charge             1   432.37 2612.8
## + total_night_minutes            1   432.37 2612.8
## + account_length                 1   433.01 2618.0
## <none>                               433.37 2618.9
## + total_eve_calls                1   433.23 2619.8
## + total_night_calls              1   433.30 2620.4
## + total_day_calls                1   433.34 2620.7
## 
## Step:  AIC=2379.08
## churn ~ international_plan
## 
##                                 Df Deviance    AIC
## + number_customer_service_calls  1   382.89 2186.4
## + total_day_minutes              1   390.25 2253.5
## + total_day_charge               1   390.25 2253.5
## + voice_mail_plan                1   398.31 2325.6
## + number_vmail_messages          1   399.36 2334.9
## + total_eve_minutes              1   402.09 2358.9
## + total_eve_charge               1   402.09 2358.9
## + total_night_charge             1   403.48 2371.0
## + total_night_minutes            1   403.48 2371.0
## + total_intl_charge              1   403.50 2371.2
## + total_intl_minutes             1   403.50 2371.2
## + total_intl_calls               1   403.51 2371.3
## <none>                               404.63 2379.1
## + total_eve_calls                1   404.41 2379.1
## + account_length                 1   404.42 2379.2
## + total_night_calls              1   404.54 2380.3
## + total_day_calls                1   404.62 2381.0
## 
## Step:  AIC=2186.45
## churn ~ international_plan + number_customer_service_calls
## 
##                         Df Deviance    AIC
## + total_day_minutes      1   368.61 2054.4
## + total_day_charge       1   368.61 2054.4
## + voice_mail_plan        1   377.01 2133.8
## + number_vmail_messages  1   377.95 2142.7
## + total_eve_minutes      1   380.05 2162.2
## + total_eve_charge       1   380.06 2162.2
## + total_night_charge     1   381.62 2176.8
## + total_night_minutes    1   381.63 2176.8
## + total_intl_charge      1   381.68 2177.2
## + total_intl_minutes     1   381.68 2177.2
## + total_intl_calls       1   382.09 2181.0
## + total_eve_calls        1   382.63 2186.1
## + account_length         1   382.64 2186.2
## <none>                       382.89 2186.4
## + total_night_calls      1   382.82 2187.7
## + total_day_calls        1   382.86 2188.2
## 
## Step:  AIC=2054.41
## churn ~ international_plan + number_customer_service_calls + 
##     total_day_minutes
## 
##                         Df Deviance    AIC
## + voice_mail_plan        1   362.60 1998.5
## + number_vmail_messages  1   363.57 2007.9
## + total_eve_minutes      1   365.55 2027.0
## + total_eve_charge       1   365.55 2027.0
## + total_intl_charge      1   367.05 2041.5
## + total_intl_minutes     1   367.05 2041.5
## + total_night_charge     1   367.49 2045.7
## + total_night_minutes    1   367.49 2045.7
## + total_intl_calls       1   367.69 2047.6
## + total_day_charge       1   368.23 2052.8
## + total_eve_calls        1   368.34 2053.9
## + account_length         1   368.40 2054.4
## <none>                       368.61 2054.4
## + total_night_calls      1   368.50 2055.4
## + total_day_calls        1   368.59 2056.2
## 
## Step:  AIC=1998.52
## churn ~ international_plan + number_customer_service_calls + 
##     total_day_minutes + voice_mail_plan
## 
##                         Df Deviance    AIC
## + total_eve_minutes      1   359.28 1968.1
## + total_eve_charge       1   359.28 1968.1
## + total_intl_charge      1   360.92 1984.2
## + total_intl_minutes     1   360.93 1984.2
## + total_night_charge     1   361.48 1989.6
## + total_night_minutes    1   361.48 1989.6
## + total_intl_calls       1   361.69 1991.6
## + total_day_charge       1   362.24 1996.9
## + total_eve_calls        1   362.32 1997.7
## <none>                       362.60 1998.5
## + account_length         1   362.44 1998.9
## + number_vmail_messages  1   362.50 1999.5
## + total_night_calls      1   362.52 1999.7
## + total_day_calls        1   362.58 2000.3
## 
## Step:  AIC=1968.11
## churn ~ international_plan + number_customer_service_calls + 
##     total_day_minutes + voice_mail_plan + total_eve_minutes
## 
##                         Df Deviance    AIC
## + total_intl_charge      1   357.51 1952.6
## + total_intl_minutes     1   357.51 1952.6
## + total_night_charge     1   358.11 1958.6
## + total_night_minutes    1   358.11 1958.6
## + total_intl_calls       1   358.38 1961.2
## + total_day_charge       1   358.89 1966.2
## + total_eve_calls        1   359.00 1967.3
## <none>                       359.28 1968.1
## + account_length         1   359.11 1968.4
## + number_vmail_messages  1   359.15 1968.8
## + total_night_calls      1   359.19 1969.2
## + total_day_calls        1   359.26 1969.9
## + total_eve_charge       1   359.28 1970.1
## 
## Step:  AIC=1952.63
## churn ~ international_plan + number_customer_service_calls + 
##     total_day_minutes + voice_mail_plan + total_eve_minutes + 
##     total_intl_charge
## 
##                         Df Deviance    AIC
## + total_night_charge     1   356.32 1942.9
## + total_night_minutes    1   356.32 1942.9
## + total_intl_calls       1   356.56 1945.3
## + total_day_charge       1   357.11 1950.7
## + total_eve_calls        1   357.22 1951.8
## <none>                       357.51 1952.6
## + account_length         1   357.35 1953.1
## + number_vmail_messages  1   357.36 1953.2
## + total_night_calls      1   357.42 1953.7
## + total_day_calls        1   357.49 1954.4
## + total_eve_charge       1   357.50 1954.6
## + total_intl_minutes     1   357.50 1954.6
## 
## Step:  AIC=1942.86
## churn ~ international_plan + number_customer_service_calls + 
##     total_day_minutes + voice_mail_plan + total_eve_minutes + 
##     total_intl_charge + total_night_charge
## 
##                         Df Deviance    AIC
## + total_intl_calls       1   355.40 1935.8
## + total_day_charge       1   355.92 1940.9
## + total_eve_calls        1   356.03 1942.0
## <none>                       356.32 1942.9
## + account_length         1   356.15 1943.2
## + number_vmail_messages  1   356.15 1943.2
## + total_night_calls      1   356.20 1943.7
## + total_night_minutes    1   356.25 1944.2
## + total_day_calls        1   356.29 1944.6
## + total_eve_charge       1   356.31 1944.8
## + total_intl_minutes     1   356.32 1944.8
## 
## Step:  AIC=1935.79
## churn ~ international_plan + number_customer_service_calls + 
##     total_day_minutes + voice_mail_plan + total_eve_minutes + 
##     total_intl_charge + total_night_charge + total_intl_calls
## 
##                         Df Deviance    AIC
## + total_day_charge       1   354.99 1933.7
## + total_eve_calls        1   355.12 1935.0
## <none>                       355.40 1935.8
## + number_vmail_messages  1   355.22 1936.0
## + account_length         1   355.23 1936.1
## + total_night_calls      1   355.28 1936.6
## + total_night_minutes    1   355.32 1937.0
## + total_day_calls        1   355.38 1937.5
## + total_intl_minutes     1   355.40 1937.8
## + total_eve_charge       1   355.40 1937.8
## 
## Step:  AIC=1933.7
## churn ~ international_plan + number_customer_service_calls + 
##     total_day_minutes + voice_mail_plan + total_eve_minutes + 
##     total_intl_charge + total_night_charge + total_intl_calls + 
##     total_day_charge
## 
##                         Df Deviance    AIC
## + total_eve_calls        1   354.72 1933.0
## <none>                       354.99 1933.7
## + number_vmail_messages  1   354.82 1934.0
## + account_length         1   354.82 1934.0
## + total_night_calls      1   354.88 1934.6
## + total_night_minutes    1   354.91 1934.9
## + total_day_calls        1   354.96 1935.5
## + total_eve_charge       1   354.99 1935.7
## + total_intl_minutes     1   354.99 1935.7
## 
## Step:  AIC=1933
## churn ~ international_plan + number_customer_service_calls + 
##     total_day_minutes + voice_mail_plan + total_eve_minutes + 
##     total_intl_charge + total_night_charge + total_intl_calls + 
##     total_day_charge + total_eve_calls
## 
##                         Df Deviance    AIC
## <none>                       354.72 1933.0
## + number_vmail_messages  1   354.54 1933.2
## + account_length         1   354.55 1933.3
## + total_night_calls      1   354.60 1933.8
## + total_night_minutes    1   354.63 1934.2
## + total_day_calls        1   354.69 1934.8
## + total_eve_charge       1   354.72 1935.0
## + total_intl_minutes     1   354.72 1935.0

According to the significant codes above, we see which variables are significant. I perform further analysis on total_day_charge, number_vmail_messages, total_intl_charge and total_eve_minutes.

logit <- glm(churn ~ total_day_charge + number_vmail_messages+ total_intl_charge + total_eve_minutes, data = traindata, family = "binomial")
summary(logit)
## 
## Call:
## glm(formula = churn ~ total_day_charge + number_vmail_messages + 
##     total_intl_charge + total_eve_minutes, family = "binomial", 
##     data = traindata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0809  -0.5949  -0.4701  -0.3226   2.9249  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -5.718786   0.366600 -15.600  < 2e-16 ***
## total_day_charge       0.065126   0.005710  11.405  < 2e-16 ***
## number_vmail_messages -0.030606   0.004593  -6.664 2.67e-11 ***
## total_intl_charge      0.306952   0.068434   4.485 7.28e-06 ***
## total_eve_minutes      0.005580   0.000990   5.636 1.74e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2900.0  on 3524  degrees of freedom
## Residual deviance: 2670.5  on 3520  degrees of freedom
## AIC: 2680.5
## 
## Number of Fisher Scoring iterations: 5
#evaluate model's fit and performance
influenceIndexPlot(logit, vars = c('Cook', "hat"), id.n =4)

# Confidence interval using log-likelihood
confint(logit)
## Waiting for profiling to be done...
##                              2.5 %       97.5 %
## (Intercept)           -6.446119939 -5.008602987
## total_day_charge       0.054020111  0.076412727
## number_vmail_messages -0.039871089 -0.021842583
## total_intl_charge      0.173322675  0.441655939
## total_eve_minutes      0.003646016  0.007528171
exp(logit$coefficients)
##           (Intercept)      total_day_charge number_vmail_messages 
##           0.003283696           1.067293484           0.969857536 
##     total_intl_charge     total_eve_minutes 
##           1.359275814           1.005595136
exp(confint(logit)) #odds ratio
## Waiting for profiling to be done...
##                             2.5 %      97.5 %
## (Intercept)           0.001586667 0.006680229
## total_day_charge      1.055505829 1.079407983
## number_vmail_messages 0.960913304 0.978394238
## total_intl_charge     1.189249785 1.555280536
## total_eve_minutes     1.003652671 1.007556579

The odds ratio says, “what are the odd of an outcome happening as a result of a change in some variable. For example, for each unit increase in international charge, there is an 18% increase in the likelihood of churning (leaving the company or business.”

# Making a support vector machine (another prediction model)
svm_model <- svm(churn ~., data= traindata, gamma = .1, cost =1) 
print(svm_model)
## 
## Call:
## svm(formula = churn ~ ., data = traindata, gamma = 0.1, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  2048
summary(svm_model)
## 
## Call:
## svm(formula = churn ~ ., data = traindata, gamma = 0.1, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  2048
#Random forest model- takes decision trees and averages them
rf <- randomForest(churn ~., data= traindata, ntree = 500, mtry = 5, importance = TRUE)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
print(rf)
## 
## Call:
##  randomForest(formula = churn ~ ., data = traindata, ntree = 500,      mtry = 5, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.04123458
##                     % Var explained: 66.46
importance(rf)
##                                   %IncMSE IncNodePurity
## account_length                 -0.1711648     11.157122
## international_plan             97.3564559     34.104464
## voice_mail_plan                23.5721931      8.742515
## number_vmail_messages          24.8403757     12.176776
## total_day_minutes              37.6210656     58.161602
## total_day_calls                 0.5055798     10.501387
## total_day_charge               40.3613063     60.283992
## total_eve_minutes              24.7559703     27.629199
## total_eve_calls                -3.2696250      9.763455
## total_eve_charge               24.7693715     26.977522
## total_night_minutes            18.8966563     15.916180
## total_night_calls              -0.7744242     10.023966
## total_night_charge             18.5910094     15.410549
## total_intl_minutes             27.2684694     16.153068
## total_intl_calls               52.5423688     23.100111
## total_intl_charge              27.5696348     16.361977
## number_customer_service_calls 118.8471276     56.515437
plot.new()
varImpPlot(rf, type = 1, pch = 17, col = 1, cex = 1.0, main = "")
abline(v= 45, col= "red")

To the right of the red line are the vairables: number of customer service calls, international plan and total international calls. That is saying these are the most important factors in determining customer churn. Intuitively, this makes sense. A customer who has to receive many customer service calls to resolve an issue would likely become frustrated and leave their business with the company.

mydata$churn <- as.factor(mydata$churn)

#algorithm for decision tree
tree <- C5.0(churn ~., data = mydata)
summary(tree)
## 
## Call:
## C5.0.formula(formula = churn ~ ., data = mydata)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Sep  3 14:47:37 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 5000 cases (18 attributes) from undefined.data
## 
## Decision tree:
## 
## number_customer_service_calls > 3:
## :...total_day_minutes <= 160.2:
## :   :...total_eve_charge <= 19.83: 1 (113/4)
## :   :   total_eve_charge > 19.83:
## :   :   :...total_day_minutes <= 134.5: 1 (17/1)
## :   :       total_day_minutes > 134.5: 0 (15/3)
## :   total_day_minutes > 160.2:
## :   :...international_plan > 0:
## :       :...number_customer_service_calls > 4: 1 (6)
## :       :   number_customer_service_calls <= 4:
## :       :   :...total_intl_calls <= 2: 1 (5)
## :       :       total_intl_calls > 2:
## :       :       :...total_intl_charge <= 3.56: 0 (12/1)
## :       :           total_intl_charge > 3.56: 1 (2)
## :       international_plan <= 0:
## :       :...total_day_minutes > 263.4:
## :           :...voice_mail_plan > 0: 0 (5)
## :           :   voice_mail_plan <= 0:
## :           :   :...total_eve_minutes <= 184.9: 0 (4/1)
## :           :       total_eve_minutes > 184.9: 1 (13)
## :           total_day_minutes <= 263.4:
## :           :...total_eve_charge <= 13.22:
## :               :...total_day_minutes <= 197.2: 1 (16/1)
## :               :   total_day_minutes > 197.2: 0 (21/5)
## :               total_eve_charge > 13.22:
## :               :...total_day_minutes <= 185.7:
## :                   :...total_eve_minutes > 216.9: 0 (24)
## :                   :   total_eve_minutes <= 216.9:
## :                   :   :...total_night_minutes <= 172.9: 1 (15/2)
## :                   :       total_night_minutes > 172.9: 0 (22/5)
## :                   total_day_minutes > 185.7:
## :                   :...total_night_charge <= 11.41: 0 (95/1)
## :                       total_night_charge > 11.41:
## :                       :...total_intl_minutes <= 9.9: 1 (7/1)
## :                           total_intl_minutes > 9.9: 0 (7)
## number_customer_service_calls <= 3:
## :...total_day_minutes > 245.1:
##     :...voice_mail_plan > 0: 0 (121/8)
##     :   voice_mail_plan <= 0:
##     :   :...total_eve_minutes <= 201:
##     :       :...total_day_minutes <= 277.7:
##     :       :   :...international_plan <= 0: 0 (118/11)
##     :       :   :   international_plan > 0:
##     :       :   :   :...total_intl_calls <= 2: 1 (6)
##     :       :   :       total_intl_calls > 2: 0 (14/3)
##     :       :   total_day_minutes > 277.7:
##     :       :   :...total_night_charge > 9.31: 1 (27)
##     :       :       total_night_charge <= 9.31:
##     :       :       :...total_eve_minutes <= 152.7: 0 (13)
##     :       :           total_eve_minutes > 152.7:
##     :       :           :...account_length <= 69: 0 (2)
##     :       :               account_length > 69: 1 (17/1)
##     :       total_eve_minutes > 201:
##     :       :...total_night_charge > 8.54: 1 (114/3)
##     :           total_night_charge <= 8.54:
##     :           :...total_day_minutes <= 264.7:
##     :               :...total_eve_minutes <= 242.4: 0 (20/1)
##     :               :   total_eve_minutes > 242.4: 1 (18/6)
##     :               total_day_minutes > 264.7:
##     :               :...total_night_minutes > 128.5: 1 (36)
##     :                   total_night_minutes <= 128.5:
##     :                   :...total_day_minutes <= 277: 0 (4)
##     :                       total_day_minutes > 277: 1 (4)
##     total_day_minutes <= 245.1:
##     :...international_plan > 0:
##         :...total_intl_calls <= 2: 1 (68)
##         :   total_intl_calls > 2:
##         :   :...total_intl_minutes <= 13: 0 (239/5)
##         :       total_intl_minutes > 13: 1 (59)
##         international_plan <= 0:
##         :...total_day_minutes <= 221.8: 0 (3288/86)
##             total_day_minutes > 221.8:
##             :...total_eve_charge > 22.7:
##                 :...voice_mail_plan <= 0: 1 (34/3)
##                 :   voice_mail_plan > 0: 0 (8)
##                 total_eve_charge <= 22.7:
##                 :...voice_mail_plan > 0: 0 (111/2)
##                     voice_mail_plan <= 0:
##                     :...total_eve_minutes <= 234.2: 0 (247/11)
##                         total_eve_minutes > 234.2:
##                         :...total_intl_charge > 3.48: 1 (3)
##                             total_intl_charge <= 3.48:
##                             :...total_night_charge <= 9.16: 0 (18)
##                                 total_night_charge > 9.16:
##                                 :...total_day_minutes <= 237.8: 0 (7/1)
##                                     total_day_minutes > 237.8: 1 (5)
## 
## 
## Evaluation on training data (5000 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      44  166( 3.3%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    4271    22    (a): class 0
##     144   563    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% total_day_minutes
##  100.00% number_customer_service_calls
##   89.58% international_plan
##   19.38% voice_mail_plan
##   15.70% total_eve_charge
##   15.02% total_eve_minutes
##    8.10% total_intl_calls
##    7.88% total_night_charge
##    6.24% total_intl_minutes
##    1.62% total_night_minutes
##    0.94% total_intl_charge
##    0.38% account_length
## 
## 
## Time: 0.1 secs
results <- C5.0(churn ~., data = mydata, rules = TRUE)
summary(results)
## 
## Call:
## C5.0.formula(formula = churn ~ ., data = mydata, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Sep  3 14:47:37 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 5000 cases (18 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (239/5, lift 1.1)
##  international_plan > 0
##  total_day_minutes <= 245.1
##  total_intl_minutes <= 13
##  total_intl_calls > 2
##  number_customer_service_calls <= 3
##  ->  class 0  [0.975]
## 
## Rule 2: (3288/86, lift 1.1)
##  international_plan <= 0
##  total_day_minutes <= 221.8
##  number_customer_service_calls <= 3
##  ->  class 0  [0.974]
## 
## Rule 3: (210/10, lift 1.1)
##  total_day_minutes > 134.5
##  total_day_minutes <= 160.2
##  total_eve_charge > 19.83
##  ->  class 0  [0.948]
## 
## Rule 4: (3201/495, lift 1.0)
##  total_day_minutes > 160.2
##  ->  class 0  [0.845]
## 
## Rule 5: (89, lift 7.0)
##  international_plan > 0
##  total_intl_calls <= 2
##  ->  class 1  [0.989]
## 
## Rule 6: (79, lift 7.0)
##  international_plan > 0
##  total_intl_minutes > 13
##  ->  class 1  [0.988]
## 
## Rule 7: (55, lift 6.9)
##  voice_mail_plan <= 0
##  total_day_minutes > 237.8
##  total_eve_minutes > 234.2
##  total_night_charge > 9.16
##  ->  class 1  [0.982]
## 
## Rule 8: (55, lift 6.9)
##  voice_mail_plan <= 0
##  total_day_minutes > 277.7
##  total_night_charge > 9.31
##  number_customer_service_calls <= 3
##  ->  class 1  [0.982]
## 
## Rule 9: (82/1, lift 6.9)
##  account_length > 69
##  voice_mail_plan <= 0
##  total_day_minutes > 277.7
##  total_eve_minutes > 152.7
##  number_customer_service_calls <= 3
##  ->  class 1  [0.976]
## 
## Rule 10: (114/3, lift 6.8)
##  voice_mail_plan <= 0
##  total_day_minutes > 245.1
##  total_eve_minutes > 201
##  total_night_charge > 8.54
##  number_customer_service_calls <= 3
##  ->  class 1  [0.966]
## 
## Rule 11: (130/5, lift 6.8)
##  international_plan <= 0
##  voice_mail_plan <= 0
##  total_day_minutes > 263.4
##  total_eve_minutes > 184.9
##  ->  class 1  [0.955]
## 
## Rule 12: (39/1, lift 6.7)
##  total_day_minutes <= 197.2
##  total_eve_charge <= 13.22
##  number_customer_service_calls > 3
##  ->  class 1  [0.951]
## 
## Rule 13: (54/2, lift 6.7)
##  total_day_minutes <= 185.7
##  total_eve_minutes <= 216.9
##  total_night_minutes <= 172.9
##  number_customer_service_calls > 3
##  ->  class 1  [0.946]
## 
## Rule 14: (63/3, lift 6.6)
##  voice_mail_plan <= 0
##  total_day_minutes > 221.8
##  total_eve_charge > 22.7
##  number_customer_service_calls <= 3
##  ->  class 1  [0.938]
## 
## Rule 15: (90/6, lift 6.5)
##  voice_mail_plan <= 0
##  total_day_minutes > 245.1
##  total_eve_minutes > 242.4
##  ->  class 1  [0.924]
## 
## Rule 16: (6, lift 6.2)
##  international_plan > 0
##  total_day_minutes > 160.2
##  number_customer_service_calls > 4
##  ->  class 1  [0.875]
## 
## Rule 17: (8/1, lift 5.7)
##  international_plan <= 0
##  total_day_minutes > 185.7
##  total_eve_charge > 13.22
##  total_night_charge > 11.41
##  total_intl_minutes <= 9.9
##  number_customer_service_calls > 3
##  ->  class 1  [0.800]
## 
## Rule 18: (399/198, lift 3.6)
##  number_customer_service_calls > 3
##  ->  class 1  [0.504]
## 
## Default class: 0
## 
## 
## Evaluation on training data (5000 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##      18  159( 3.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    4267    26    (a): class 0
##     133   574    (b): class 1
## 
## 
##  Attribute usage:
## 
##   97.50% total_day_minutes
##   82.96% number_customer_service_calls
##   76.54% international_plan
##    6.56% total_intl_calls
##    6.52% total_intl_minutes
##    6.40% total_eve_charge
##    5.88% total_eve_minutes
##    5.70% voice_mail_plan
##    3.26% total_night_charge
##    1.64% account_length
##    1.08% total_night_minutes
## 
## 
## Time: 0.1 secs

There are many rules here, but let’s look at rule 9. It is saying if the total minutes in the day exceed 277.7 and total evening minutes exceed 152.7, the customer is likely to leave the company. This can be used to derive business insight.

#Check what models are better then others
logistic_model <- predict(logit, testdata, type = "response")
svm_predict <- predict(svm_model, testdata, type = "response")
rf_predict <- predict(rf, testdata, type = "response")
testdata$Yhat1 <- logistic_model
testdata$Yhat2 <- svm_predict
testdata$Yhat3 <- rf_predict

#setting threshold parameters
predict1 <- function(x) ifelse(logistic_model > x, 1, 0)
predict2 <- function(x) ifelse(svm_predict > x, 1, 0)
predict3 <- function(x) ifelse(rf_predict > x, 1, 0)
confusionMatrix(predict1(.5), testdata$churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1274  191
##          1    0   10
##                                           
##                Accuracy : 0.8705          
##                  95% CI : (0.8523, 0.8872)
##     No Information Rate : 0.8637          
##     P-Value [Acc > NIR] : 0.2368          
##                                           
##                   Kappa : 0.0829          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.00000         
##             Specificity : 0.04975         
##          Pos Pred Value : 0.86962         
##          Neg Pred Value : 1.00000         
##              Prevalence : 0.86373         
##          Detection Rate : 0.86373         
##    Detection Prevalence : 0.99322         
##       Balanced Accuracy : 0.52488         
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(predict2(.5), testdata$churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1270  117
##          1    4   84
##                                           
##                Accuracy : 0.918           
##                  95% CI : (0.9028, 0.9315)
##     No Information Rate : 0.8637          
##     P-Value [Acc > NIR] : 6.187e-11       
##                                           
##                   Kappa : 0.5434          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9969          
##             Specificity : 0.4179          
##          Pos Pred Value : 0.9156          
##          Neg Pred Value : 0.9545          
##              Prevalence : 0.8637          
##          Detection Rate : 0.8610          
##    Detection Prevalence : 0.9403          
##       Balanced Accuracy : 0.7074          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(predict3(.5), testdata$churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1259   45
##          1   15  156
##                                           
##                Accuracy : 0.9593          
##                  95% CI : (0.9479, 0.9688)
##     No Information Rate : 0.8637          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8156          
##  Mcnemar's Test P-Value : 0.0001812       
##                                           
##             Sensitivity : 0.9882          
##             Specificity : 0.7761          
##          Pos Pred Value : 0.9655          
##          Neg Pred Value : 0.9123          
##              Prevalence : 0.8637          
##          Detection Rate : 0.8536          
##    Detection Prevalence : 0.8841          
##       Balanced Accuracy : 0.8822          
##                                           
##        'Positive' Class : 0               
## 
#Graph the results
predict_1 <- prediction(testdata$Yhat1, testdata$churn)
predict_2 <- prediction(testdata$Yhat2, testdata$churn)
predict_3 <- prediction(testdata$Yhat3, testdata$churn)

performance1 <- performance(predict_1, "tpr", "fpr")
performance2 <- performance(predict_2, "tpr", "fpr")
performance3 <- performance(predict_3, "tpr", "fpr")

plot.new()
plot(performance1, col= "yellow")
plot(performance2, add = TRUE, col= "blue")
plot(performance3, add = TRUE, col= "green")
abline(0,1, col = "red")
title("ROC curve")
legend(.8, .4,c("Logistic", "SVM", "Random Forest"), 
       lty = c(1,1,1), 
       lwd = c(1.4, 1.4,1.4), col = c("yellow", "blue", "green"))

We want a line around the perimeter. Therefore, the random forest model is the best fit for the data. To see its accuracy, we find the AUC (area under the curve):

accuracy_log <- performance(predict_1, "auc")
accuracy_svm <- performance(predict_2, "auc")
accuracy_rf <- performance(predict_3, "auc")
accuracy_log
## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.6930458
## 
## 
## Slot "alpha.values":
## list()
accuracy_svm
## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.9276967
## 
## 
## Slot "alpha.values":
## list()
accuracy_rf
## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.9295575
## 
## 
## Slot "alpha.values":
## list()

It shows that the random forest model is about 92.9% accurate.

This model can be used in business insight.