Source file ⇒ Churn_Analysis.Rmd
What is Churn? In the customer management, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%.
It is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions. This analysis shows how logistic regression model, support vector machines and the random forest model can be used to identify the customer churn in the telecom dataset.
The Data The “churn” data set was developed to predict telecom customer churn based on information about their account. The data files state that the data are “artificial based on claims similar to real world”.
#Packages used in analysis
library(ggplot2)
library(reshape2)
library(corrplot)
library(e1071)
library(caret)
## Loading required package: lattice
library(rpart)
library(C50)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
library(partykit)
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner,
## node_surv, node_terminal
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(dplyr)
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
#reading in data from C50 package
library(C50)
data(churn)
test <- churnTest
train <- churnTrain
mydata <- rbind(test,train)
#Tranforming variables to numeric form
#no=1, yes =0
mydata$churn <- as.integer(mydata$churn)
mydata$churn[mydata$churn == "1"] <- 1
mydata$churn[mydata$churn == "2"] <- 0
mydata$international_plan <-as.integer(mydata$international_plan)
mydata$international_plan[mydata$international_plan == "1"] <- 0
mydata$international_plan[mydata$international_plan == "2"] <- 1
mydata$voice_mail_plan <- as.integer(mydata$voice_mail_plan)
mydata$voice_mail_plan[mydata$voice_mail_plan == "1"] <- 0
mydata$voice_mail_plan[mydata$voice_mail_plan == "2"] <- 1
#Removing unwanted variables for analysis
mydata$state <- NULL
mydata$area_code <- NULL
#Remove observations that are missing from datasell
na.omit(mydata) %>%
head()
## account_length international_plan voice_mail_plan number_vmail_messages
## 1 101 0 0 0
## 2 137 0 0 0
## 3 103 0 1 29
## 4 99 0 0 0
## 5 108 0 0 0
## 6 117 0 0 0
## total_day_minutes total_day_calls total_day_charge total_eve_minutes
## 1 70.9 123 12.05 211.9
## 2 223.6 86 38.01 244.8
## 3 294.7 95 50.10 237.3
## 4 216.8 123 36.86 126.4
## 5 197.4 78 33.56 124.0
## 6 226.5 85 38.51 141.6
## total_eve_calls total_eve_charge total_night_minutes total_night_calls
## 1 73 18.01 236.0 73
## 2 139 20.81 94.2 81
## 3 105 20.17 300.3 127
## 4 88 10.74 220.6 82
## 5 101 10.54 204.5 107
## 6 68 12.04 223.0 90
## total_night_charge total_intl_minutes total_intl_calls total_intl_charge
## 1 10.62 10.6 3 2.86
## 2 4.24 9.5 7 2.57
## 3 13.51 13.7 6 3.70
## 4 9.93 15.7 2 4.24
## 5 9.20 7.7 4 2.08
## 6 10.04 6.9 5 1.86
## number_customer_service_calls churn
## 1 3 0
## 2 0 0
## 3 1 0
## 4 1 0
## 5 2 0
## 6 1 0
#Now begin exploratory data analysis
#Summarize dataset
summary(mydata)
## account_length international_plan voice_mail_plan number_vmail_messages
## Min. : 1.0 Min. :0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 73.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000
## Median :100.0 Median :0.0000 Median :0.0000 Median : 0.000
## Mean :100.3 Mean :0.0946 Mean :0.2646 Mean : 7.755
## 3rd Qu.:127.0 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:17.000
## Max. :243.0 Max. :1.0000 Max. :1.0000 Max. :52.000
## total_day_minutes total_day_calls total_day_charge total_eve_minutes
## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0
## 1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4
## Median :180.1 Median :100 Median :30.62 Median :201.0
## Mean :180.3 Mean :100 Mean :30.65 Mean :200.6
## 3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1
## Max. :351.5 Max. :165 Max. :59.76 Max. :363.7
## total_eve_calls total_eve_charge total_night_minutes total_night_calls
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 87.0 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00
## Median :100.0 Median :17.09 Median :200.4 Median :100.00
## Mean :100.2 Mean :17.05 Mean :200.4 Mean : 99.92
## 3rd Qu.:114.0 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00
## Max. :170.0 Max. :30.91 Max. :395.0 Max. :175.00
## total_night_charge total_intl_minutes total_intl_calls total_intl_charge
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. :0.000
## 1st Qu.: 7.510 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
## Median : 9.020 Median :10.30 Median : 4.000 Median :2.780
## Mean : 9.018 Mean :10.26 Mean : 4.435 Mean :2.771
## 3rd Qu.:10.560 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
## Max. :17.770 Max. :20.00 Max. :20.000 Max. :5.400
## number_customer_service_calls churn
## Min. :0.00 Min. :0.0000
## 1st Qu.:1.00 1st Qu.:0.0000
## Median :1.00 Median :0.0000
## Mean :1.57 Mean :0.1414
## 3rd Qu.:2.00 3rd Qu.:0.0000
## Max. :9.00 Max. :1.0000
sapply(mydata, sd)
## account_length international_plan
## 39.6945595 0.2926909
## voice_mail_plan number_vmail_messages
## 0.4411641 13.5463934
## total_day_minutes total_day_calls
## 53.8946992 19.8311974
## total_day_charge total_eve_minutes
## 9.1620687 50.5513090
## total_eve_calls total_eve_charge
## 19.8264958 4.2968433
## total_night_minutes total_night_calls
## 50.5277893 19.9586859
## total_night_charge total_intl_minutes
## 2.2737627 2.7613957
## total_intl_calls total_intl_charge
## 2.4567882 0.7455137
## number_customer_service_calls churn
## 1.3063633 0.3484685
cormatrix <- round(cor(mydata), digits = 2 )
cormatrix
## account_length international_plan
## account_length 1.00 0.01
## international_plan 0.01 1.00
## voice_mail_plan -0.01 0.01
## number_vmail_messages -0.01 0.01
## total_day_minutes 0.00 0.03
## total_day_calls 0.03 0.01
## total_day_charge 0.00 0.03
## total_eve_minutes -0.01 0.02
## total_eve_calls 0.01 0.00
## total_eve_charge -0.01 0.02
## total_night_minutes 0.00 -0.03
## total_night_calls -0.01 0.01
## total_night_charge 0.00 -0.03
## total_intl_minutes 0.00 0.03
## total_intl_calls 0.01 0.00
## total_intl_charge 0.00 0.03
## number_customer_service_calls 0.00 -0.01
## churn 0.02 0.26
## voice_mail_plan number_vmail_messages
## account_length -0.01 -0.01
## international_plan 0.01 0.01
## voice_mail_plan 1.00 0.95
## number_vmail_messages 0.95 1.00
## total_day_minutes 0.00 0.01
## total_day_calls 0.00 0.00
## total_day_charge 0.00 0.01
## total_eve_minutes 0.02 0.02
## total_eve_calls -0.01 0.00
## total_eve_charge 0.02 0.02
## total_night_minutes 0.01 0.01
## total_night_calls 0.01 0.00
## total_night_charge 0.01 0.01
## total_intl_minutes 0.00 0.00
## total_intl_calls -0.01 0.00
## total_intl_charge 0.00 0.00
## number_customer_service_calls -0.01 -0.01
## churn -0.11 -0.10
## total_day_minutes total_day_calls
## account_length 0.00 0.03
## international_plan 0.03 0.01
## voice_mail_plan 0.00 0.00
## number_vmail_messages 0.01 0.00
## total_day_minutes 1.00 0.00
## total_day_calls 0.00 1.00
## total_day_charge 1.00 0.00
## total_eve_minutes -0.01 0.00
## total_eve_calls 0.01 0.00
## total_eve_charge -0.01 0.00
## total_night_minutes 0.01 0.00
## total_night_calls 0.00 -0.01
## total_night_charge 0.01 0.00
## total_intl_minutes -0.02 0.01
## total_intl_calls 0.00 0.01
## total_intl_charge -0.02 0.01
## number_customer_service_calls 0.00 -0.01
## churn 0.21 0.02
## total_day_charge total_eve_minutes
## account_length 0.00 -0.01
## international_plan 0.03 0.02
## voice_mail_plan 0.00 0.02
## number_vmail_messages 0.01 0.02
## total_day_minutes 1.00 -0.01
## total_day_calls 0.00 0.00
## total_day_charge 1.00 -0.01
## total_eve_minutes -0.01 1.00
## total_eve_calls 0.01 0.00
## total_eve_charge -0.01 1.00
## total_night_minutes 0.01 -0.02
## total_night_calls 0.00 0.01
## total_night_charge 0.01 -0.02
## total_intl_minutes -0.02 0.00
## total_intl_calls 0.00 0.01
## total_intl_charge -0.02 0.00
## number_customer_service_calls 0.00 -0.01
## churn 0.21 0.09
## total_eve_calls total_eve_charge
## account_length 0.01 -0.01
## international_plan 0.00 0.02
## voice_mail_plan -0.01 0.02
## number_vmail_messages 0.00 0.02
## total_day_minutes 0.01 -0.01
## total_day_calls 0.00 0.00
## total_day_charge 0.01 -0.01
## total_eve_minutes 0.00 1.00
## total_eve_calls 1.00 0.00
## total_eve_charge 0.00 1.00
## total_night_minutes 0.00 -0.02
## total_night_calls -0.01 0.01
## total_night_charge 0.00 -0.02
## total_intl_minutes -0.01 0.00
## total_intl_calls 0.01 0.01
## total_intl_charge -0.01 0.00
## number_customer_service_calls 0.01 -0.01
## churn -0.01 0.09
## total_night_minutes total_night_calls
## account_length 0.00 -0.01
## international_plan -0.03 0.01
## voice_mail_plan 0.01 0.01
## number_vmail_messages 0.01 0.00
## total_day_minutes 0.01 0.00
## total_day_calls 0.00 -0.01
## total_day_charge 0.01 0.00
## total_eve_minutes -0.02 0.01
## total_eve_calls 0.00 -0.01
## total_eve_charge -0.02 0.01
## total_night_minutes 1.00 0.03
## total_night_calls 0.03 1.00
## total_night_charge 1.00 0.03
## total_intl_minutes -0.01 0.00
## total_intl_calls -0.02 0.00
## total_intl_charge -0.01 0.00
## number_customer_service_calls -0.01 -0.01
## churn 0.05 -0.01
## total_night_charge total_intl_minutes
## account_length 0.00 0.00
## international_plan -0.03 0.03
## voice_mail_plan 0.01 0.00
## number_vmail_messages 0.01 0.00
## total_day_minutes 0.01 -0.02
## total_day_calls 0.00 0.01
## total_day_charge 0.01 -0.02
## total_eve_minutes -0.02 0.00
## total_eve_calls 0.00 -0.01
## total_eve_charge -0.02 0.00
## total_night_minutes 1.00 -0.01
## total_night_calls 0.03 0.00
## total_night_charge 1.00 -0.01
## total_intl_minutes -0.01 1.00
## total_intl_calls -0.02 0.02
## total_intl_charge -0.01 1.00
## number_customer_service_calls -0.01 -0.01
## churn 0.05 0.06
## total_intl_calls total_intl_charge
## account_length 0.01 0.00
## international_plan 0.00 0.03
## voice_mail_plan -0.01 0.00
## number_vmail_messages 0.00 0.00
## total_day_minutes 0.00 -0.02
## total_day_calls 0.01 0.01
## total_day_charge 0.00 -0.02
## total_eve_minutes 0.01 0.00
## total_eve_calls 0.01 -0.01
## total_eve_charge 0.01 0.00
## total_night_minutes -0.02 -0.01
## total_night_calls 0.00 0.00
## total_night_charge -0.02 -0.01
## total_intl_minutes 0.02 1.00
## total_intl_calls 1.00 0.02
## total_intl_charge 0.02 1.00
## number_customer_service_calls -0.02 -0.01
## churn -0.05 0.06
## number_customer_service_calls churn
## account_length 0.00 0.02
## international_plan -0.01 0.26
## voice_mail_plan -0.01 -0.11
## number_vmail_messages -0.01 -0.10
## total_day_minutes 0.00 0.21
## total_day_calls -0.01 0.02
## total_day_charge 0.00 0.21
## total_eve_minutes -0.01 0.09
## total_eve_calls 0.01 -0.01
## total_eve_charge -0.01 0.09
## total_night_minutes -0.01 0.05
## total_night_calls -0.01 -0.01
## total_night_charge -0.01 0.05
## total_intl_minutes -0.01 0.06
## total_intl_calls -0.02 -0.05
## total_intl_charge -0.01 0.06
## number_customer_service_calls 1.00 0.21
## churn 0.21 1.00
plot.new()
plot(mydata$churn ~mydata$total_day_minutes)
title('Basic Scatterplot')
ggplot(mydata, aes(x=mydata$total_day_minutes)) + geom_histogram(binwidth = 1, fill = "white", color = "black")
#Randomly split data into train and test set
#70% will be ssigned to train set, 30% will be assigned to tst set
set.seed(1234)
ind <- sample(2, nrow(mydata), replace = TRUE, prob = c(.7,.3))
traindata <- mydata[ind == 1,]
testdata <- mydata[ind == 2,]
#Forward elimination
#Lower AIC indicates a better model
forward <- step(glm(churn ~ 1, data = traindata), direction = 'forward', scope = ~ account_length + international_plan + voice_mail_plan + number_vmail_messages + total_day_minutes + total_day_calls + total_day_charge + total_eve_minutes + total_eve_calls + total_eve_charge + total_night_minutes + total_night_calls + total_night_charge + total_intl_minutes + total_intl_calls + total_intl_charge + number_customer_service_calls)
## Start: AIC=2618.93
## churn ~ 1
##
## Df Deviance AIC
## + international_plan 1 404.63 2379.1
## + number_customer_service_calls 1 412.04 2443.0
## + total_day_minutes 1 417.70 2491.1
## + total_day_charge 1 417.70 2491.1
## + voice_mail_plan 1 427.13 2569.8
## + number_vmail_messages 1 428.28 2579.3
## + total_eve_minutes 1 430.31 2596.0
## + total_eve_charge 1 430.31 2596.0
## + total_intl_minutes 1 431.69 2607.2
## + total_intl_charge 1 431.69 2607.2
## + total_intl_calls 1 432.37 2612.8
## + total_night_charge 1 432.37 2612.8
## + total_night_minutes 1 432.37 2612.8
## + account_length 1 433.01 2618.0
## <none> 433.37 2618.9
## + total_eve_calls 1 433.23 2619.8
## + total_night_calls 1 433.30 2620.4
## + total_day_calls 1 433.34 2620.7
##
## Step: AIC=2379.08
## churn ~ international_plan
##
## Df Deviance AIC
## + number_customer_service_calls 1 382.89 2186.4
## + total_day_minutes 1 390.25 2253.5
## + total_day_charge 1 390.25 2253.5
## + voice_mail_plan 1 398.31 2325.6
## + number_vmail_messages 1 399.36 2334.9
## + total_eve_minutes 1 402.09 2358.9
## + total_eve_charge 1 402.09 2358.9
## + total_night_charge 1 403.48 2371.0
## + total_night_minutes 1 403.48 2371.0
## + total_intl_charge 1 403.50 2371.2
## + total_intl_minutes 1 403.50 2371.2
## + total_intl_calls 1 403.51 2371.3
## <none> 404.63 2379.1
## + total_eve_calls 1 404.41 2379.1
## + account_length 1 404.42 2379.2
## + total_night_calls 1 404.54 2380.3
## + total_day_calls 1 404.62 2381.0
##
## Step: AIC=2186.45
## churn ~ international_plan + number_customer_service_calls
##
## Df Deviance AIC
## + total_day_minutes 1 368.61 2054.4
## + total_day_charge 1 368.61 2054.4
## + voice_mail_plan 1 377.01 2133.8
## + number_vmail_messages 1 377.95 2142.7
## + total_eve_minutes 1 380.05 2162.2
## + total_eve_charge 1 380.06 2162.2
## + total_night_charge 1 381.62 2176.8
## + total_night_minutes 1 381.63 2176.8
## + total_intl_charge 1 381.68 2177.2
## + total_intl_minutes 1 381.68 2177.2
## + total_intl_calls 1 382.09 2181.0
## + total_eve_calls 1 382.63 2186.1
## + account_length 1 382.64 2186.2
## <none> 382.89 2186.4
## + total_night_calls 1 382.82 2187.7
## + total_day_calls 1 382.86 2188.2
##
## Step: AIC=2054.41
## churn ~ international_plan + number_customer_service_calls +
## total_day_minutes
##
## Df Deviance AIC
## + voice_mail_plan 1 362.60 1998.5
## + number_vmail_messages 1 363.57 2007.9
## + total_eve_minutes 1 365.55 2027.0
## + total_eve_charge 1 365.55 2027.0
## + total_intl_charge 1 367.05 2041.5
## + total_intl_minutes 1 367.05 2041.5
## + total_night_charge 1 367.49 2045.7
## + total_night_minutes 1 367.49 2045.7
## + total_intl_calls 1 367.69 2047.6
## + total_day_charge 1 368.23 2052.8
## + total_eve_calls 1 368.34 2053.9
## + account_length 1 368.40 2054.4
## <none> 368.61 2054.4
## + total_night_calls 1 368.50 2055.4
## + total_day_calls 1 368.59 2056.2
##
## Step: AIC=1998.52
## churn ~ international_plan + number_customer_service_calls +
## total_day_minutes + voice_mail_plan
##
## Df Deviance AIC
## + total_eve_minutes 1 359.28 1968.1
## + total_eve_charge 1 359.28 1968.1
## + total_intl_charge 1 360.92 1984.2
## + total_intl_minutes 1 360.93 1984.2
## + total_night_charge 1 361.48 1989.6
## + total_night_minutes 1 361.48 1989.6
## + total_intl_calls 1 361.69 1991.6
## + total_day_charge 1 362.24 1996.9
## + total_eve_calls 1 362.32 1997.7
## <none> 362.60 1998.5
## + account_length 1 362.44 1998.9
## + number_vmail_messages 1 362.50 1999.5
## + total_night_calls 1 362.52 1999.7
## + total_day_calls 1 362.58 2000.3
##
## Step: AIC=1968.11
## churn ~ international_plan + number_customer_service_calls +
## total_day_minutes + voice_mail_plan + total_eve_minutes
##
## Df Deviance AIC
## + total_intl_charge 1 357.51 1952.6
## + total_intl_minutes 1 357.51 1952.6
## + total_night_charge 1 358.11 1958.6
## + total_night_minutes 1 358.11 1958.6
## + total_intl_calls 1 358.38 1961.2
## + total_day_charge 1 358.89 1966.2
## + total_eve_calls 1 359.00 1967.3
## <none> 359.28 1968.1
## + account_length 1 359.11 1968.4
## + number_vmail_messages 1 359.15 1968.8
## + total_night_calls 1 359.19 1969.2
## + total_day_calls 1 359.26 1969.9
## + total_eve_charge 1 359.28 1970.1
##
## Step: AIC=1952.63
## churn ~ international_plan + number_customer_service_calls +
## total_day_minutes + voice_mail_plan + total_eve_minutes +
## total_intl_charge
##
## Df Deviance AIC
## + total_night_charge 1 356.32 1942.9
## + total_night_minutes 1 356.32 1942.9
## + total_intl_calls 1 356.56 1945.3
## + total_day_charge 1 357.11 1950.7
## + total_eve_calls 1 357.22 1951.8
## <none> 357.51 1952.6
## + account_length 1 357.35 1953.1
## + number_vmail_messages 1 357.36 1953.2
## + total_night_calls 1 357.42 1953.7
## + total_day_calls 1 357.49 1954.4
## + total_eve_charge 1 357.50 1954.6
## + total_intl_minutes 1 357.50 1954.6
##
## Step: AIC=1942.86
## churn ~ international_plan + number_customer_service_calls +
## total_day_minutes + voice_mail_plan + total_eve_minutes +
## total_intl_charge + total_night_charge
##
## Df Deviance AIC
## + total_intl_calls 1 355.40 1935.8
## + total_day_charge 1 355.92 1940.9
## + total_eve_calls 1 356.03 1942.0
## <none> 356.32 1942.9
## + account_length 1 356.15 1943.2
## + number_vmail_messages 1 356.15 1943.2
## + total_night_calls 1 356.20 1943.7
## + total_night_minutes 1 356.25 1944.2
## + total_day_calls 1 356.29 1944.6
## + total_eve_charge 1 356.31 1944.8
## + total_intl_minutes 1 356.32 1944.8
##
## Step: AIC=1935.79
## churn ~ international_plan + number_customer_service_calls +
## total_day_minutes + voice_mail_plan + total_eve_minutes +
## total_intl_charge + total_night_charge + total_intl_calls
##
## Df Deviance AIC
## + total_day_charge 1 354.99 1933.7
## + total_eve_calls 1 355.12 1935.0
## <none> 355.40 1935.8
## + number_vmail_messages 1 355.22 1936.0
## + account_length 1 355.23 1936.1
## + total_night_calls 1 355.28 1936.6
## + total_night_minutes 1 355.32 1937.0
## + total_day_calls 1 355.38 1937.5
## + total_intl_minutes 1 355.40 1937.8
## + total_eve_charge 1 355.40 1937.8
##
## Step: AIC=1933.7
## churn ~ international_plan + number_customer_service_calls +
## total_day_minutes + voice_mail_plan + total_eve_minutes +
## total_intl_charge + total_night_charge + total_intl_calls +
## total_day_charge
##
## Df Deviance AIC
## + total_eve_calls 1 354.72 1933.0
## <none> 354.99 1933.7
## + number_vmail_messages 1 354.82 1934.0
## + account_length 1 354.82 1934.0
## + total_night_calls 1 354.88 1934.6
## + total_night_minutes 1 354.91 1934.9
## + total_day_calls 1 354.96 1935.5
## + total_eve_charge 1 354.99 1935.7
## + total_intl_minutes 1 354.99 1935.7
##
## Step: AIC=1933
## churn ~ international_plan + number_customer_service_calls +
## total_day_minutes + voice_mail_plan + total_eve_minutes +
## total_intl_charge + total_night_charge + total_intl_calls +
## total_day_charge + total_eve_calls
##
## Df Deviance AIC
## <none> 354.72 1933.0
## + number_vmail_messages 1 354.54 1933.2
## + account_length 1 354.55 1933.3
## + total_night_calls 1 354.60 1933.8
## + total_night_minutes 1 354.63 1934.2
## + total_day_calls 1 354.69 1934.8
## + total_eve_charge 1 354.72 1935.0
## + total_intl_minutes 1 354.72 1935.0
According to the significant codes above, we see which variables are significant. I perform further analysis on total_day_charge, number_vmail_messages, total_intl_charge and total_eve_minutes.
logit <- glm(churn ~ total_day_charge + number_vmail_messages+ total_intl_charge + total_eve_minutes, data = traindata, family = "binomial")
summary(logit)
##
## Call:
## glm(formula = churn ~ total_day_charge + number_vmail_messages +
## total_intl_charge + total_eve_minutes, family = "binomial",
## data = traindata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0809 -0.5949 -0.4701 -0.3226 2.9249
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.718786 0.366600 -15.600 < 2e-16 ***
## total_day_charge 0.065126 0.005710 11.405 < 2e-16 ***
## number_vmail_messages -0.030606 0.004593 -6.664 2.67e-11 ***
## total_intl_charge 0.306952 0.068434 4.485 7.28e-06 ***
## total_eve_minutes 0.005580 0.000990 5.636 1.74e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2900.0 on 3524 degrees of freedom
## Residual deviance: 2670.5 on 3520 degrees of freedom
## AIC: 2680.5
##
## Number of Fisher Scoring iterations: 5
#evaluate model's fit and performance
influenceIndexPlot(logit, vars = c('Cook', "hat"), id.n =4)
# Confidence interval using log-likelihood
confint(logit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -6.446119939 -5.008602987
## total_day_charge 0.054020111 0.076412727
## number_vmail_messages -0.039871089 -0.021842583
## total_intl_charge 0.173322675 0.441655939
## total_eve_minutes 0.003646016 0.007528171
exp(logit$coefficients)
## (Intercept) total_day_charge number_vmail_messages
## 0.003283696 1.067293484 0.969857536
## total_intl_charge total_eve_minutes
## 1.359275814 1.005595136
exp(confint(logit)) #odds ratio
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.001586667 0.006680229
## total_day_charge 1.055505829 1.079407983
## number_vmail_messages 0.960913304 0.978394238
## total_intl_charge 1.189249785 1.555280536
## total_eve_minutes 1.003652671 1.007556579
The odds ratio says, “what are the odd of an outcome happening as a result of a change in some variable. For example, for each unit increase in international charge, there is an 18% increase in the likelihood of churning (leaving the company or business.”
# Making a support vector machine (another prediction model)
svm_model <- svm(churn ~., data= traindata, gamma = .1, cost =1)
print(svm_model)
##
## Call:
## svm(formula = churn ~ ., data = traindata, gamma = 0.1, cost = 1)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.1
## epsilon: 0.1
##
##
## Number of Support Vectors: 2048
summary(svm_model)
##
## Call:
## svm(formula = churn ~ ., data = traindata, gamma = 0.1, cost = 1)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.1
## epsilon: 0.1
##
##
## Number of Support Vectors: 2048
#Random forest model- takes decision trees and averages them
rf <- randomForest(churn ~., data= traindata, ntree = 500, mtry = 5, importance = TRUE)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
print(rf)
##
## Call:
## randomForest(formula = churn ~ ., data = traindata, ntree = 500, mtry = 5, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.04123458
## % Var explained: 66.46
importance(rf)
## %IncMSE IncNodePurity
## account_length -0.1711648 11.157122
## international_plan 97.3564559 34.104464
## voice_mail_plan 23.5721931 8.742515
## number_vmail_messages 24.8403757 12.176776
## total_day_minutes 37.6210656 58.161602
## total_day_calls 0.5055798 10.501387
## total_day_charge 40.3613063 60.283992
## total_eve_minutes 24.7559703 27.629199
## total_eve_calls -3.2696250 9.763455
## total_eve_charge 24.7693715 26.977522
## total_night_minutes 18.8966563 15.916180
## total_night_calls -0.7744242 10.023966
## total_night_charge 18.5910094 15.410549
## total_intl_minutes 27.2684694 16.153068
## total_intl_calls 52.5423688 23.100111
## total_intl_charge 27.5696348 16.361977
## number_customer_service_calls 118.8471276 56.515437
plot.new()
varImpPlot(rf, type = 1, pch = 17, col = 1, cex = 1.0, main = "")
abline(v= 45, col= "red")
To the right of the red line are the vairables: number of customer service calls, international plan and total international calls. That is saying these are the most important factors in determining customer churn. Intuitively, this makes sense. A customer who has to receive many customer service calls to resolve an issue would likely become frustrated and leave their business with the company.
mydata$churn <- as.factor(mydata$churn)
#algorithm for decision tree
tree <- C5.0(churn ~., data = mydata)
summary(tree)
##
## Call:
## C5.0.formula(formula = churn ~ ., data = mydata)
##
##
## C5.0 [Release 2.07 GPL Edition] Sun Sep 3 14:47:37 2017
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 5000 cases (18 attributes) from undefined.data
##
## Decision tree:
##
## number_customer_service_calls > 3:
## :...total_day_minutes <= 160.2:
## : :...total_eve_charge <= 19.83: 1 (113/4)
## : : total_eve_charge > 19.83:
## : : :...total_day_minutes <= 134.5: 1 (17/1)
## : : total_day_minutes > 134.5: 0 (15/3)
## : total_day_minutes > 160.2:
## : :...international_plan > 0:
## : :...number_customer_service_calls > 4: 1 (6)
## : : number_customer_service_calls <= 4:
## : : :...total_intl_calls <= 2: 1 (5)
## : : total_intl_calls > 2:
## : : :...total_intl_charge <= 3.56: 0 (12/1)
## : : total_intl_charge > 3.56: 1 (2)
## : international_plan <= 0:
## : :...total_day_minutes > 263.4:
## : :...voice_mail_plan > 0: 0 (5)
## : : voice_mail_plan <= 0:
## : : :...total_eve_minutes <= 184.9: 0 (4/1)
## : : total_eve_minutes > 184.9: 1 (13)
## : total_day_minutes <= 263.4:
## : :...total_eve_charge <= 13.22:
## : :...total_day_minutes <= 197.2: 1 (16/1)
## : : total_day_minutes > 197.2: 0 (21/5)
## : total_eve_charge > 13.22:
## : :...total_day_minutes <= 185.7:
## : :...total_eve_minutes > 216.9: 0 (24)
## : : total_eve_minutes <= 216.9:
## : : :...total_night_minutes <= 172.9: 1 (15/2)
## : : total_night_minutes > 172.9: 0 (22/5)
## : total_day_minutes > 185.7:
## : :...total_night_charge <= 11.41: 0 (95/1)
## : total_night_charge > 11.41:
## : :...total_intl_minutes <= 9.9: 1 (7/1)
## : total_intl_minutes > 9.9: 0 (7)
## number_customer_service_calls <= 3:
## :...total_day_minutes > 245.1:
## :...voice_mail_plan > 0: 0 (121/8)
## : voice_mail_plan <= 0:
## : :...total_eve_minutes <= 201:
## : :...total_day_minutes <= 277.7:
## : : :...international_plan <= 0: 0 (118/11)
## : : : international_plan > 0:
## : : : :...total_intl_calls <= 2: 1 (6)
## : : : total_intl_calls > 2: 0 (14/3)
## : : total_day_minutes > 277.7:
## : : :...total_night_charge > 9.31: 1 (27)
## : : total_night_charge <= 9.31:
## : : :...total_eve_minutes <= 152.7: 0 (13)
## : : total_eve_minutes > 152.7:
## : : :...account_length <= 69: 0 (2)
## : : account_length > 69: 1 (17/1)
## : total_eve_minutes > 201:
## : :...total_night_charge > 8.54: 1 (114/3)
## : total_night_charge <= 8.54:
## : :...total_day_minutes <= 264.7:
## : :...total_eve_minutes <= 242.4: 0 (20/1)
## : : total_eve_minutes > 242.4: 1 (18/6)
## : total_day_minutes > 264.7:
## : :...total_night_minutes > 128.5: 1 (36)
## : total_night_minutes <= 128.5:
## : :...total_day_minutes <= 277: 0 (4)
## : total_day_minutes > 277: 1 (4)
## total_day_minutes <= 245.1:
## :...international_plan > 0:
## :...total_intl_calls <= 2: 1 (68)
## : total_intl_calls > 2:
## : :...total_intl_minutes <= 13: 0 (239/5)
## : total_intl_minutes > 13: 1 (59)
## international_plan <= 0:
## :...total_day_minutes <= 221.8: 0 (3288/86)
## total_day_minutes > 221.8:
## :...total_eve_charge > 22.7:
## :...voice_mail_plan <= 0: 1 (34/3)
## : voice_mail_plan > 0: 0 (8)
## total_eve_charge <= 22.7:
## :...voice_mail_plan > 0: 0 (111/2)
## voice_mail_plan <= 0:
## :...total_eve_minutes <= 234.2: 0 (247/11)
## total_eve_minutes > 234.2:
## :...total_intl_charge > 3.48: 1 (3)
## total_intl_charge <= 3.48:
## :...total_night_charge <= 9.16: 0 (18)
## total_night_charge > 9.16:
## :...total_day_minutes <= 237.8: 0 (7/1)
## total_day_minutes > 237.8: 1 (5)
##
##
## Evaluation on training data (5000 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 44 166( 3.3%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 4271 22 (a): class 0
## 144 563 (b): class 1
##
##
## Attribute usage:
##
## 100.00% total_day_minutes
## 100.00% number_customer_service_calls
## 89.58% international_plan
## 19.38% voice_mail_plan
## 15.70% total_eve_charge
## 15.02% total_eve_minutes
## 8.10% total_intl_calls
## 7.88% total_night_charge
## 6.24% total_intl_minutes
## 1.62% total_night_minutes
## 0.94% total_intl_charge
## 0.38% account_length
##
##
## Time: 0.1 secs
results <- C5.0(churn ~., data = mydata, rules = TRUE)
summary(results)
##
## Call:
## C5.0.formula(formula = churn ~ ., data = mydata, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Sun Sep 3 14:47:37 2017
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 5000 cases (18 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (239/5, lift 1.1)
## international_plan > 0
## total_day_minutes <= 245.1
## total_intl_minutes <= 13
## total_intl_calls > 2
## number_customer_service_calls <= 3
## -> class 0 [0.975]
##
## Rule 2: (3288/86, lift 1.1)
## international_plan <= 0
## total_day_minutes <= 221.8
## number_customer_service_calls <= 3
## -> class 0 [0.974]
##
## Rule 3: (210/10, lift 1.1)
## total_day_minutes > 134.5
## total_day_minutes <= 160.2
## total_eve_charge > 19.83
## -> class 0 [0.948]
##
## Rule 4: (3201/495, lift 1.0)
## total_day_minutes > 160.2
## -> class 0 [0.845]
##
## Rule 5: (89, lift 7.0)
## international_plan > 0
## total_intl_calls <= 2
## -> class 1 [0.989]
##
## Rule 6: (79, lift 7.0)
## international_plan > 0
## total_intl_minutes > 13
## -> class 1 [0.988]
##
## Rule 7: (55, lift 6.9)
## voice_mail_plan <= 0
## total_day_minutes > 237.8
## total_eve_minutes > 234.2
## total_night_charge > 9.16
## -> class 1 [0.982]
##
## Rule 8: (55, lift 6.9)
## voice_mail_plan <= 0
## total_day_minutes > 277.7
## total_night_charge > 9.31
## number_customer_service_calls <= 3
## -> class 1 [0.982]
##
## Rule 9: (82/1, lift 6.9)
## account_length > 69
## voice_mail_plan <= 0
## total_day_minutes > 277.7
## total_eve_minutes > 152.7
## number_customer_service_calls <= 3
## -> class 1 [0.976]
##
## Rule 10: (114/3, lift 6.8)
## voice_mail_plan <= 0
## total_day_minutes > 245.1
## total_eve_minutes > 201
## total_night_charge > 8.54
## number_customer_service_calls <= 3
## -> class 1 [0.966]
##
## Rule 11: (130/5, lift 6.8)
## international_plan <= 0
## voice_mail_plan <= 0
## total_day_minutes > 263.4
## total_eve_minutes > 184.9
## -> class 1 [0.955]
##
## Rule 12: (39/1, lift 6.7)
## total_day_minutes <= 197.2
## total_eve_charge <= 13.22
## number_customer_service_calls > 3
## -> class 1 [0.951]
##
## Rule 13: (54/2, lift 6.7)
## total_day_minutes <= 185.7
## total_eve_minutes <= 216.9
## total_night_minutes <= 172.9
## number_customer_service_calls > 3
## -> class 1 [0.946]
##
## Rule 14: (63/3, lift 6.6)
## voice_mail_plan <= 0
## total_day_minutes > 221.8
## total_eve_charge > 22.7
## number_customer_service_calls <= 3
## -> class 1 [0.938]
##
## Rule 15: (90/6, lift 6.5)
## voice_mail_plan <= 0
## total_day_minutes > 245.1
## total_eve_minutes > 242.4
## -> class 1 [0.924]
##
## Rule 16: (6, lift 6.2)
## international_plan > 0
## total_day_minutes > 160.2
## number_customer_service_calls > 4
## -> class 1 [0.875]
##
## Rule 17: (8/1, lift 5.7)
## international_plan <= 0
## total_day_minutes > 185.7
## total_eve_charge > 13.22
## total_night_charge > 11.41
## total_intl_minutes <= 9.9
## number_customer_service_calls > 3
## -> class 1 [0.800]
##
## Rule 18: (399/198, lift 3.6)
## number_customer_service_calls > 3
## -> class 1 [0.504]
##
## Default class: 0
##
##
## Evaluation on training data (5000 cases):
##
## Rules
## ----------------
## No Errors
##
## 18 159( 3.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 4267 26 (a): class 0
## 133 574 (b): class 1
##
##
## Attribute usage:
##
## 97.50% total_day_minutes
## 82.96% number_customer_service_calls
## 76.54% international_plan
## 6.56% total_intl_calls
## 6.52% total_intl_minutes
## 6.40% total_eve_charge
## 5.88% total_eve_minutes
## 5.70% voice_mail_plan
## 3.26% total_night_charge
## 1.64% account_length
## 1.08% total_night_minutes
##
##
## Time: 0.1 secs
There are many rules here, but let’s look at rule 9. It is saying if the total minutes in the day exceed 277.7 and total evening minutes exceed 152.7, the customer is likely to leave the company. This can be used to derive business insight.
#Check what models are better then others
logistic_model <- predict(logit, testdata, type = "response")
svm_predict <- predict(svm_model, testdata, type = "response")
rf_predict <- predict(rf, testdata, type = "response")
testdata$Yhat1 <- logistic_model
testdata$Yhat2 <- svm_predict
testdata$Yhat3 <- rf_predict
#setting threshold parameters
predict1 <- function(x) ifelse(logistic_model > x, 1, 0)
predict2 <- function(x) ifelse(svm_predict > x, 1, 0)
predict3 <- function(x) ifelse(rf_predict > x, 1, 0)
confusionMatrix(predict1(.5), testdata$churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1274 191
## 1 0 10
##
## Accuracy : 0.8705
## 95% CI : (0.8523, 0.8872)
## No Information Rate : 0.8637
## P-Value [Acc > NIR] : 0.2368
##
## Kappa : 0.0829
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.00000
## Specificity : 0.04975
## Pos Pred Value : 0.86962
## Neg Pred Value : 1.00000
## Prevalence : 0.86373
## Detection Rate : 0.86373
## Detection Prevalence : 0.99322
## Balanced Accuracy : 0.52488
##
## 'Positive' Class : 0
##
confusionMatrix(predict2(.5), testdata$churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1270 117
## 1 4 84
##
## Accuracy : 0.918
## 95% CI : (0.9028, 0.9315)
## No Information Rate : 0.8637
## P-Value [Acc > NIR] : 6.187e-11
##
## Kappa : 0.5434
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9969
## Specificity : 0.4179
## Pos Pred Value : 0.9156
## Neg Pred Value : 0.9545
## Prevalence : 0.8637
## Detection Rate : 0.8610
## Detection Prevalence : 0.9403
## Balanced Accuracy : 0.7074
##
## 'Positive' Class : 0
##
confusionMatrix(predict3(.5), testdata$churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1259 45
## 1 15 156
##
## Accuracy : 0.9593
## 95% CI : (0.9479, 0.9688)
## No Information Rate : 0.8637
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8156
## Mcnemar's Test P-Value : 0.0001812
##
## Sensitivity : 0.9882
## Specificity : 0.7761
## Pos Pred Value : 0.9655
## Neg Pred Value : 0.9123
## Prevalence : 0.8637
## Detection Rate : 0.8536
## Detection Prevalence : 0.8841
## Balanced Accuracy : 0.8822
##
## 'Positive' Class : 0
##
#Graph the results
predict_1 <- prediction(testdata$Yhat1, testdata$churn)
predict_2 <- prediction(testdata$Yhat2, testdata$churn)
predict_3 <- prediction(testdata$Yhat3, testdata$churn)
performance1 <- performance(predict_1, "tpr", "fpr")
performance2 <- performance(predict_2, "tpr", "fpr")
performance3 <- performance(predict_3, "tpr", "fpr")
plot.new()
plot(performance1, col= "yellow")
plot(performance2, add = TRUE, col= "blue")
plot(performance3, add = TRUE, col= "green")
abline(0,1, col = "red")
title("ROC curve")
legend(.8, .4,c("Logistic", "SVM", "Random Forest"),
lty = c(1,1,1),
lwd = c(1.4, 1.4,1.4), col = c("yellow", "blue", "green"))
We want a line around the perimeter. Therefore, the random forest model is the best fit for the data. To see its accuracy, we find the AUC (area under the curve):
accuracy_log <- performance(predict_1, "auc")
accuracy_svm <- performance(predict_2, "auc")
accuracy_rf <- performance(predict_3, "auc")
accuracy_log
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.6930458
##
##
## Slot "alpha.values":
## list()
accuracy_svm
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.9276967
##
##
## Slot "alpha.values":
## list()
accuracy_rf
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.9295575
##
##
## Slot "alpha.values":
## list()
It shows that the random forest model is about 92.9% accurate.
This model can be used in business insight.