Churn prediction is one of the most popular Big Data use cases in business. It consists of detecting customers who are likely to cancel a subscription to a service.We want to predict the answer to the following question, asked for each current customer: “Is this customer going to leave us within the next X months?” There are only two possible answers, yes or no, and it is what we call a binary classification task. Here, the input of the task is a customer and the output is the answer to the question (yes or no).Being able to predict churn based on customer data has proven extremely valuable to big telecom companies.
Telcom Company Ding Dong is facing a uphill problem to retain their customers. It had an annual churn ratio of around 14% for the year 2014. The data set consist of 5000 customers and different features that help predict churn. You as a budding data scientist are now echarged with the task of using advanced analytics to predict churners and retain the customers.
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The data set consits of the following varaibles
1.state
2.account_length
3.area_code
4.international_plan
5.voice_mail_plan
6.number_vmail_messages
7.total_day_minutes
8.total_day_minutes
9.total_day_calls
10.total_eve_charge
11.total_night_minutes
12.total_night_charge
13.total_night_calls
14.total_intl_calls
15.total_intl_charge
16.number_customer_service_calls
17.churn
Here we divide the data set into two parts Categorical varaiables and continous varaibles and see it`s relation to the churn variable
library(caret)
## Loading required package: lattice
library(reshape)
##
## Attaching package: 'reshape'
## The following object is masked from 'package:plotly':
##
## rename
Data=Data[,-2]
Cont_Data=Data[,-c(1,2,3,4)]
Cat_Data=Data[,c(1,2,3,4,19)]
Cont_melt=melt(Cont_Data)
## Using churn as id variables
ggplot(Cont_melt, aes(x=churn,y=value,fill=variable))+geom_boxplot()+facet_wrap(~variable)
library(plyr)
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:reshape':
##
## rename, round_any
## The following objects are masked from 'package:plotly':
##
## arrange, mutate, rename, summarise
library(data.table)
##
## Attaching package: 'data.table'
## The following object is masked from 'package:reshape':
##
## melt
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 3.3.3
Data=data.table(Data)
Reason_areacode=Data[, count(churn), by = area_code]
sjp.xtab(Data$churn,
Data$area_code,
bar.pos = c("dodge"),
show.total = FALSE)
1 .International Plan
sjp.xtab(Data$churn,
Data$international_plan,
bar.pos = c("dodge"),
show.total = FALSE)
sjp.xtab(Data$churn,
Data$voice_mail_plan,
bar.pos = c("dodge"),
show.total = FALSE)
library(caret)
### Dividing data into train and test
set.seed(3456)
trainIndex <- createDataPartition(Data$churn, p = .6,
list = FALSE,
times = 1)
Train <- Data[ trainIndex,]
Test <- Data[-trainIndex,]
Here we are going to use A Machine Learning Model and a Linear Model
A Logistic regression Model
A Random Forest Model
Random forest is an ensemble tool which takes a subset of observations and a subset of variables to build a decision trees. It builds multiple such decision tree and amalgamate them together to get a more accurate and stable prediction. It is direct consequence of the fact that by maximum voting from a panel of independent judges, it results the final prediction better than the best judge. [ Each tree is planted & grown as follows:
If the number of cases in the training set is N, then sample of N cases is taken at random but with replacement. This sample will be the training set for growing the tree. If there are M input variables, a number m<
## (Intercept) 10.4373617 1.0566345 9.878 < 2e-16 ## stateAL -0.6319295 0.7922579 -0.798 0.425085
## stateAR -1.0752785 0.7756407 -1.386 0.165652
## stateAZ -1.1593492 0.7883534 -1.471 0.141400
## stateCA -2.4032731 0.8077485 -2.975 0.002927 ## stateCO -0.2153291 0.8456921 -0.255 0.799018
## stateCT -1.1040894 0.7531453 -1.466 0.142656
## stateDC -0.6510514 0.8103770 -0.803 0.421747
## stateDE -0.7027549 0.7647536 -0.919 0.358132
## stateFL -0.6204576 0.8073923 -0.768 0.442207
## stateGA -0.4982346 0.8225564 -0.606 0.544704
## stateHI 0.3685022 0.9897189 0.372 0.709647
## stateIA -0.8521855 0.8469577 -1.006 0.314333
## stateID -0.9598147 0.7647202 -1.255 0.209436
## stateIL -0.3356726 0.8353334 -0.402 0.687800
## stateIN -0.7908941 0.7673603 -1.031 0.302696
## stateKS -0.8110337 0.7748424 -1.047 0.295234
## stateKY -1.0128530 0.7676893 -1.319 0.187051
## stateLA -1.3994971 0.8022960 -1.744 0.081095 .
## stateMA -1.8655432 0.7446759 -2.505 0.012239
## stateMD -1.1739337 0.7376562 -1.591 0.111511
## stateME -1.1772465 0.7696310 -1.530 0.126110
## stateMI -1.0105505 0.7863367 -1.285 0.198744
## stateMN -1.0501207 0.7517450 -1.397 0.162440
## stateMO -0.3441704 0.8050351 -0.428 0.668999
## stateMS -1.4803300 0.7655319 -1.934 0.053147 .
## stateMT -2.4561637 0.7456827 -3.294 0.000988 ## stateNC -0.9068701 0.8034272 -1.129 0.259002
## stateND -0.8866100 0.7903932 -1.122 0.261976
## stateNE -0.0915463 0.8982225 -0.102 0.918821
## stateNH -0.2350965 0.8546874 -0.275 0.783265
## stateNJ -1.7240425 0.7370997 -2.339 0.019338
## stateNM -0.5650474 0.8157343 -0.693 0.488507
## stateNV -1.2623842 0.7607225 -1.659 0.097024 .
## stateNY -1.1619708 0.7517854 -1.546 0.122198
## stateOH -0.7926898 0.7682956 -1.032 0.302189
## stateOK -0.9225553 0.7821815 -1.179 0.238213
## stateOR -1.2742907 0.7397678 -1.723 0.084969 .
## statePA -0.7994338 0.8291977 -0.964 0.334993
## stateRI 1.5008191 1.0289842 1.459 0.144691
## stateSC -1.8420346 0.7728648 -2.383 0.017154 *
## stateSD -0.7316450 0.8069820 -0.907 0.364595
## stateTN -1.3879070 0.7608127 -1.824 0.068115 .
## stateTX -1.1629366 0.7468716 -1.557 0.119452
## stateUT -0.7927999 0.7977450 -0.994 0.320320
## stateVA 0.2340778 0.8445123 0.277 0.781646
## stateVT -0.0238572 0.8077224 -0.030 0.976437
## stateWA -1.6076665 0.7649265 -2.102 0.035577 *
## stateWI -0.8455032 0.8018795 -1.054 0.291699
## stateWV -0.9821144 0.7401787 -1.327 0.184555
## stateWY 0.5768982 0.9200816 0.627 0.530654
## area_codearea_code_415 -0.1599630 0.1553637 -1.030 0.303196
## area_codearea_code_510 -0.1547219 0.1810984 -0.854 0.392910
## international_planyes -2.2676012 0.1654446 -13.706 < 2e-16 ## voice_mail_planyes 2.8945601 0.6921417 4.182 2.89e-05 ## number_vmail_messages -0.0554142 0.0210694 -2.630 0.008537 ## total_day_minutes -0.7406116 3.6777155 -0.201 0.840403
## total_day_calls -0.0048923 0.0031092 -1.574 0.115599
## total_day_charge 4.2692953 21.6338566 0.197 0.843559
## total_eve_minutes 0.3349721 1.8203582 0.184 0.854002
## total_eve_calls 0.0006109 0.0031936 0.191 0.848297
## total_eve_charge -4.0377420 21.4158088 -0.189 0.850453
## total_night_minutes 0.4069335 0.9932397 0.410 0.682024
## total_night_calls 0.0015455 0.0031339 0.493 0.621892
## total_night_charge -9.1637374 22.0712329 -0.415 0.678003
## total_intl_minutes -4.0538783 5.9411554 -0.682 0.495025
## total_intl_calls 0.0510321 0.0260804 1.957 0.050380 .
## total_intl_charge 14.7410272 22.0046267 0.670 0.502918
## number_customer_service_calls -0.5397787 0.0461981 -11.684 < 2e-16 *** ## — ## Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2448.2 on 3000 degrees of freedom ## Residual deviance: 1779.4 on 2932 degrees of freedom ## AIC: 1917.4 ## ## Number of Fisher Scoring iterations: 6 ```
anova(model, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: churn
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 3000 2448.2
## state 50 83.967 2950 2364.2 0.001864
## area_code 2 1.279 2948 2362.9 0.527444
## international_plan 1 168.017 2947 2194.9 < 2.2e-16
## voice_mail_plan 1 55.122 2946 2139.8 1.133e-13
## number_vmail_messages 1 4.724 2945 2135.1 0.029747
## total_day_minutes 1 144.295 2944 1990.8 < 2.2e-16
## total_day_calls 1 1.519 2943 1989.2 0.217823
## total_day_charge 1 0.056 2942 1989.2 0.812846
## total_eve_minutes 1 30.658 2941 1958.5 3.078e-08
## total_eve_calls 1 0.039 2940 1958.5 0.844399
## total_eve_charge 1 0.035 2939 1958.5 0.851003
## total_night_minutes 1 17.880 2938 1940.6 2.353e-05
## total_night_calls 1 0.458 2937 1940.1 0.498729
## total_night_charge 1 0.142 2936 1940.0 0.705919
## total_intl_minutes 1 8.908 2935 1931.1 0.002840
## total_intl_calls 1 6.248 2934 1924.8 0.012432
## total_intl_charge 1 0.309 2933 1924.5 0.578053
## number_customer_service_calls 1 145.091 2932 1779.4 < 2.2e-16
##
## NULL
## state **
## area_code
## international_plan ***
## voice_mail_plan ***
## number_vmail_messages *
## total_day_minutes ***
## total_day_calls
## total_day_charge
## total_eve_minutes ***
## total_eve_calls
## total_eve_charge
## total_night_minutes ***
## total_night_calls
## total_night_charge
## total_intl_minutes **
## total_intl_calls *
## total_intl_charge
## number_customer_service_calls ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fitted.results_prob <- predict(model,Test,type='response')
fitted.results_class <- ifelse(fitted.results_prob > 0.5,"yes","no")
logit_Res=confusionMatrix(fitted.results_class,Test$churn,positive = "yes")
## Warning in confusionMatrix.default(fitted.results_class, Test$churn,
## positive = "yes"): Levels are not in the same order for reference and data.
## Refactoring data to match.
logit_Res$table
## Reference
## Prediction yes no
## yes 211 1653
## no 71 64
logit_Res$byClass
## Sensitivity Specificity Pos Pred Value
## 0.74822695 0.03727432 0.11319742
## Neg Pred Value Precision Recall
## 0.47407407 0.11319742 0.74822695
## F1 Prevalence Detection Rate
## 0.19664492 0.14107054 0.10555278
## Detection Prevalence Balanced Accuracy
## 0.93246623 0.39275063
library(ROSE)
## Loaded ROSE 0.0-3
roc.curve(Test$churn,fitted.results_prob,main="ROC curve Logit")
## Area under the curve (AUC): 0.803
suppressPackageStartupMessages(library(h2o))
suppressWarnings(h2o.init( nthreads=-1,max_mem_size = "64g"))
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 20 hours 55 minutes
## H2O cluster version: 3.10.0.8
## H2O cluster version age: 5 months and 11 days !!!
## H2O cluster name: H2O_started_from_R_M00864_puz801
## H2O cluster total nodes: 1
## H2O cluster total memory: 56.75 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.3.2 (2016-10-31)
##H2o Cluster specifications
##Converting Data set into H2o Format
h2o_Train=as.h2o(Train)
##
|
| | 0%
|
|=================================================================| 100%
h2o_Test=as.h2o(Test)
##
|
| | 0%
|
|=================================================================| 100%
#Running Random Forest Model
Rf_normal=h2o.randomForest(x = 1:18,
y = 19,
training_frame = h2o_Train,
ntrees = 1000,
max_depth=20,
## early stopping once the validation AUC doesn't
#improve by at least 0.01% for 5 consecutive scoring events
stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC",
## sample 80% of rows per tree
sample_rate = 0.8,
## sample 80% of columns per split
## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
score_tree_interval = 10 )
##
|
| | 0%
|
|==== | 6%
|
|====== | 9%
|
|======== | 12%
|
|========= | 14%
|
|=========== | 17%
|
|============ | 19%
|
|============== | 21%
|
|=============== | 22%
|
|================ | 24%
|
|================= | 26%
|
|================== | 27%
|
|=================== | 29%
|
|==================== | 30%
|
|=================================================================| 100%
rf_pred_normal=h2o.predict(Rf_normal,h2o_Test)
##
|
| | 0%
|
|=================================================================| 100%
rf_pred_normal_df=as.data.frame(rf_pred_normal)
rf_pred_normal1=rf_pred_normal_df[,1]
Target=Test$Target
Results_normal_rf=confusionMatrix(Test$churn,rf_pred_normal1,positive = "yes")
## Warning in confusionMatrix.default(Test$churn, rf_pred_normal1, positive
## = "yes"): Levels are not in the same order for reference and data.
## Refactoring data to match.
roc.curve(Test$churn,rf_pred_normal_df$yes,main="ROC curve Random Forest")
## Area under the curve (AUC): 0.915
Here in this chapter we look into the case of class imblance. In case of class imbalance we have a skewed example of classes present in one class and much less class in another class. Below we can see that the churn % is 14 and that of the non churn is 86 %
prop.table(table(Data$churn))
##
## yes no
## 0.1414 0.8586
so for that we need to do over sampling and undersampling, so that the model gets to train on underrepresented classes more.
The strategy here is to train the model on oversampled classes but test it on data set where the class imbalance persists.
library(caret)
set.seed(123)
Train=as.data.frame(Train)
down_train <- downSample(x = Train[, -ncol(Train)],y = Train$churn)
table(down_train$Class)
##
## yes no
## 425 425
up_train <- upSample(x = Train[, -19],y = Train$churn)
table(up_train$Class)
##
## yes no
## 2576 2576
h2o_Train_down=as.h2o(down_train)
##
|
| | 0%
|
|=================================================================| 100%
h2o_Train_up=as.h2o(up_train)
##
|
| | 0%
|
|=================================================================| 100%
model_down <- glm(Class ~.,family=binomial(link='logit'),data=down_train)
summary(model_down)
##
## Call:
## glm(formula = Class ~ ., family = binomial(link = "logit"), data = down_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.68285 -0.72066 0.01304 0.69580 3.09649
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.965e+00 1.527e+00 5.872 4.3e-09 ***
## stateAL -6.783e-01 1.133e+00 -0.599 0.54926
## stateAR -1.285e+00 1.152e+00 -1.115 0.26481
## stateAZ -1.988e+00 1.139e+00 -1.746 0.08078 .
## stateCA -4.028e+00 1.438e+00 -2.800 0.00511 **
## stateCO -1.613e+00 1.186e+00 -1.360 0.17376
## stateCT -2.139e+00 1.159e+00 -1.845 0.06500 .
## stateDC -1.693e+00 1.159e+00 -1.461 0.14404
## stateDE -8.105e-01 1.071e+00 -0.757 0.44901
## stateFL -1.353e+00 1.160e+00 -1.166 0.24351
## stateGA -9.656e-01 1.187e+00 -0.814 0.41587
## stateHI -1.453e+00 1.562e+00 -0.930 0.35227
## stateIA -1.019e+00 1.189e+00 -0.857 0.39144
## stateID -8.931e-01 1.067e+00 -0.837 0.40241
## stateIL -1.369e+00 1.255e+00 -1.091 0.27543
## stateIN -1.557e+00 1.123e+00 -1.387 0.16552
## stateKS -2.053e+00 1.114e+00 -1.843 0.06530 .
## stateKY -1.511e+00 1.100e+00 -1.374 0.16950
## stateLA -2.518e+00 1.132e+00 -2.225 0.02610 *
## stateMA -3.002e+00 1.139e+00 -2.636 0.00839 **
## stateMD -1.329e+00 1.065e+00 -1.248 0.21201
## stateME -1.893e+00 1.085e+00 -1.745 0.08094 .
## stateMI -1.499e+00 1.127e+00 -1.331 0.18332
## stateMN -1.515e+00 1.059e+00 -1.430 0.15274
## stateMO -8.670e-01 1.204e+00 -0.720 0.47153
## stateMS -2.117e+00 1.076e+00 -1.967 0.04920 *
## stateMT -3.299e+00 1.089e+00 -3.029 0.00245 **
## stateNC -1.129e+00 1.152e+00 -0.980 0.32717
## stateND -1.270e+00 1.114e+00 -1.140 0.25425
## stateNE -6.867e-01 1.187e+00 -0.579 0.56281
## stateNH -1.713e+00 1.261e+00 -1.358 0.17443
## stateNJ -2.153e+00 1.042e+00 -2.066 0.03880 *
## stateNM -1.204e+00 1.126e+00 -1.069 0.28485
## stateNV -2.079e+00 1.098e+00 -1.894 0.05821 .
## stateNY -1.450e+00 1.048e+00 -1.384 0.16641
## stateOH -8.633e-01 1.096e+00 -0.788 0.43096
## stateOK -1.506e+00 1.055e+00 -1.427 0.15347
## stateOR -2.329e+00 1.122e+00 -2.076 0.03788 *
## statePA -1.442e+00 1.141e+00 -1.264 0.20639
## stateRI 4.109e-01 1.325e+00 0.310 0.75644
## stateSC -2.048e+00 1.103e+00 -1.856 0.06345 .
## stateSD -2.209e+00 1.215e+00 -1.818 0.06910 .
## stateTN -2.848e+00 1.209e+00 -2.356 0.01846 *
## stateTX -1.511e+00 1.073e+00 -1.409 0.15897
## stateUT -2.766e+00 1.256e+00 -2.202 0.02766 *
## stateVA -5.948e-01 1.176e+00 -0.506 0.61309
## stateVT -5.085e-01 1.104e+00 -0.461 0.64505
## stateWA -2.538e+00 1.179e+00 -2.153 0.03133 *
## stateWI -1.171e+00 1.134e+00 -1.032 0.30188
## stateWV -1.561e+00 1.044e+00 -1.495 0.13495
## stateWY 3.842e-01 1.171e+00 0.328 0.74289
## area_codearea_code_415 -3.396e-01 2.281e-01 -1.489 0.13653
## area_codearea_code_510 -3.722e-01 2.707e-01 -1.375 0.16917
## international_planyes -3.175e+00 3.240e-01 -9.799 < 2e-16 ***
## voice_mail_planyes 2.678e+00 9.480e-01 2.825 0.00473 **
## number_vmail_messages -4.351e-02 2.968e-02 -1.466 0.14260
## total_day_minutes 3.037e+00 5.752e+00 0.528 0.59751
## total_day_calls -2.108e-03 4.753e-03 -0.443 0.65743
## total_day_charge -1.796e+01 3.384e+01 -0.531 0.59557
## total_eve_minutes -1.957e-01 2.722e+00 -0.072 0.94268
## total_eve_calls 7.467e-04 4.721e-03 0.158 0.87433
## total_eve_charge 2.182e+00 3.203e+01 0.068 0.94568
## total_night_minutes 1.149e-01 1.485e+00 0.077 0.93833
## total_night_calls 2.710e-03 4.800e-03 0.565 0.57234
## total_night_charge -2.582e+00 3.301e+01 -0.078 0.93766
## total_intl_minutes 9.502e-01 8.839e+00 0.108 0.91439
## total_intl_calls 3.290e-02 3.835e-02 0.858 0.39092
## total_intl_charge -3.705e+00 3.275e+01 -0.113 0.90991
## number_customer_service_calls -6.902e-01 7.279e-02 -9.482 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1178.35 on 849 degrees of freedom
## Residual deviance: 756.82 on 781 degrees of freedom
## AIC: 894.82
##
## Number of Fisher Scoring iterations: 5
anova(model_down, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Class
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 849 1178.35
## state 50 63.078 799 1115.27 0.101351
## area_code 2 9.524 797 1105.75 0.008549
## international_plan 1 97.983 796 1007.77 < 2.2e-16
## voice_mail_plan 1 41.172 795 966.59 1.394e-10
## number_vmail_messages 1 4.646 794 961.95 0.031118
## total_day_minutes 1 69.557 793 892.39 < 2.2e-16
## total_day_calls 1 0.436 792 891.95 0.509190
## total_day_charge 1 0.025 791 891.93 0.875445
## total_eve_minutes 1 15.430 790 876.50 8.560e-05
## total_eve_calls 1 0.040 789 876.46 0.841061
## total_eve_charge 1 0.135 788 876.32 0.713530
## total_night_minutes 1 0.019 787 876.31 0.891797
## total_night_calls 1 0.539 786 875.77 0.462752
## total_night_charge 1 0.027 785 875.74 0.870459
## total_intl_minutes 1 2.422 784 873.32 0.119652
## total_intl_calls 1 1.573 783 871.74 0.209752
## total_intl_charge 1 0.004 782 871.74 0.949136
## number_customer_service_calls 1 114.920 781 756.82 < 2.2e-16
##
## NULL
## state
## area_code **
## international_plan ***
## voice_mail_plan ***
## number_vmail_messages *
## total_day_minutes ***
## total_day_calls
## total_day_charge
## total_eve_minutes ***
## total_eve_calls
## total_eve_charge
## total_night_minutes
## total_night_calls
## total_night_charge
## total_intl_minutes
## total_intl_calls
## total_intl_charge
## number_customer_service_calls ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fitted.results_prob_down <- predict(model_down,Test,type='response')
fitted.results_class_down <- ifelse(fitted.results_prob_down > 0.5,"yes","no")
logit_Res_down=confusionMatrix(fitted.results_class_down,Test$churn,positive = "yes")
## Warning in confusionMatrix.default(fitted.results_class_down, Test$churn, :
## Levels are not in the same order for reference and data. Refactoring data
## to match.
logit_Res_down$table
## Reference
## Prediction yes no
## yes 70 1246
## no 212 471
logit_Res_down$byClass
## Sensitivity Specificity Pos Pred Value
## 0.24822695 0.27431567 0.05319149
## Neg Pred Value Precision Recall
## 0.68960469 0.05319149 0.24822695
## F1 Prevalence Detection Rate
## 0.08760951 0.14107054 0.03501751
## Detection Prevalence Balanced Accuracy
## 0.65832916 0.26127131
library(ROSE)
roc.curve(Test$churn,fitted.results_prob_down,main="ROC curve Logit Down")
## Area under the curve (AUC): 0.798
model_up <- glm(Class ~.,family=binomial(link='logit'),data=up_train)
summary(model_up)
##
## Call:
## glm(formula = Class ~ ., family = binomial(link = "logit"), data = up_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5671 -0.7393 0.0070 0.7255 3.0943
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.957e+00 5.809e-01 15.418 < 2e-16 ***
## stateAL -3.571e-01 4.226e-01 -0.845 0.398060
## stateAR -8.157e-01 4.282e-01 -1.905 0.056775 .
## stateAZ -1.234e+00 4.299e-01 -2.871 0.004098 **
## stateCA -2.201e+00 4.730e-01 -4.654 3.25e-06 ***
## stateCO 1.017e-01 4.926e-01 0.207 0.836391
## stateCT -1.063e+00 4.046e-01 -2.627 0.008626 **
## stateDC -1.036e+00 4.282e-01 -2.420 0.015540 *
## stateDE -5.141e-01 4.162e-01 -1.235 0.216695
## stateFL -2.407e-01 4.467e-01 -0.539 0.590057
## stateGA -4.219e-01 4.425e-01 -0.954 0.340331
## stateHI 3.103e-01 4.809e-01 0.645 0.518790
## stateIA -7.992e-01 4.621e-01 -1.729 0.083736 .
## stateID -9.720e-01 4.084e-01 -2.380 0.017316 *
## stateIL -4.309e-01 4.692e-01 -0.918 0.358367
## stateIN -1.298e+00 4.214e-01 -3.079 0.002075 **
## stateKS -8.316e-01 4.166e-01 -1.996 0.045919 *
## stateKY -9.258e-01 4.143e-01 -2.235 0.025430 *
## stateLA -1.636e+00 4.274e-01 -3.828 0.000129 ***
## stateMA -1.744e+00 4.115e-01 -4.238 2.25e-05 ***
## stateMD -1.011e+00 4.032e-01 -2.507 0.012180 *
## stateME -1.233e+00 4.112e-01 -2.999 0.002708 **
## stateMI -7.183e-01 4.296e-01 -1.672 0.094500 .
## stateMN -9.470e-01 4.072e-01 -2.326 0.020039 *
## stateMO -1.245e-01 4.360e-01 -0.286 0.775171
## stateMS -1.662e+00 4.171e-01 -3.983 6.79e-05 ***
## stateMT -2.877e+00 4.115e-01 -6.990 2.75e-12 ***
## stateNC -8.451e-01 4.618e-01 -1.830 0.067219 .
## stateND -1.244e+00 4.127e-01 -3.015 0.002567 **
## stateNE -2.181e-01 4.587e-01 -0.476 0.634425
## stateNH -2.432e-01 4.434e-01 -0.548 0.583406
## stateNJ -1.865e+00 4.055e-01 -4.599 4.25e-06 ***
## stateNM -5.136e-01 4.342e-01 -1.183 0.236875
## stateNV -1.313e+00 4.153e-01 -3.161 0.001570 **
## stateNY -1.057e+00 4.056e-01 -2.605 0.009178 **
## stateOH -5.245e-01 4.199e-01 -1.249 0.211626
## stateOK -1.201e+00 4.190e-01 -2.866 0.004153 **
## stateOR -1.252e+00 3.992e-01 -3.136 0.001715 **
## statePA -7.227e-01 4.402e-01 -1.642 0.100625
## stateRI 1.135e+00 5.133e-01 2.211 0.027013 *
## stateSC -1.728e+00 4.258e-01 -4.060 4.91e-05 ***
## stateSD -1.281e+00 4.346e-01 -2.948 0.003194 **
## stateTN -1.284e+00 4.222e-01 -3.042 0.002347 **
## stateTX -9.722e-01 4.028e-01 -2.414 0.015795 *
## stateUT -1.063e+00 4.198e-01 -2.533 0.011311 *
## stateVA -7.975e-02 4.348e-01 -0.183 0.854464
## stateVT 2.563e-01 4.399e-01 0.583 0.560175
## stateWA -1.561e+00 4.199e-01 -3.718 0.000201 ***
## stateWI -8.013e-01 4.263e-01 -1.879 0.060182 .
## stateWV -1.039e+00 3.954e-01 -2.628 0.008586 **
## stateWY 6.482e-01 4.888e-01 1.326 0.184838
## area_codearea_code_415 -2.153e-01 9.176e-02 -2.346 0.018985 *
## area_codearea_code_510 -2.646e-01 1.057e-01 -2.502 0.012337 *
## international_planyes -2.765e+00 1.199e-01 -23.055 < 2e-16 ***
## voice_mail_planyes 2.646e+00 3.662e-01 7.226 4.98e-13 ***
## number_vmail_messages -4.650e-02 1.132e-02 -4.106 4.02e-05 ***
## total_day_minutes 2.504e+00 2.170e+00 1.154 0.248600
## total_day_calls -4.100e-03 1.806e-03 -2.270 0.023206 *
## total_day_charge -1.482e+01 1.277e+01 -1.161 0.245813
## total_eve_minutes 1.173e-01 1.079e+00 0.109 0.913402
## total_eve_calls -8.249e-04 1.850e-03 -0.446 0.655611
## total_eve_charge -1.481e+00 1.269e+01 -0.117 0.907070
## total_night_minutes 1.015e+00 5.668e-01 1.791 0.073259 .
## total_night_calls 2.775e-03 1.886e-03 1.471 0.141221
## total_night_charge -2.269e+01 1.260e+01 -1.801 0.071694 .
## total_intl_minutes -2.723e+00 3.511e+00 -0.776 0.438037
## total_intl_calls 5.322e-02 1.504e-02 3.539 0.000401 ***
## total_intl_charge 9.848e+00 1.300e+01 0.757 0.448893
## number_customer_service_calls -6.517e-01 2.771e-02 -23.517 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7142.2 on 5151 degrees of freedom
## Residual deviance: 4811.5 on 5083 degrees of freedom
## AIC: 4949.5
##
## Number of Fisher Scoring iterations: 5
anova(model_up, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Class
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 5151 7142.2
## state 50 313.26 5101 6828.9 < 2.2e-16
## area_code 2 6.94 5099 6822.0 0.031054
## international_plan 1 502.46 5098 6319.5 < 2.2e-16
## voice_mail_plan 1 245.85 5097 6073.7 < 2.2e-16
## number_vmail_messages 1 26.00 5096 6047.7 3.421e-07
## total_day_minutes 1 392.85 5095 5654.8 < 2.2e-16
## total_day_calls 1 4.95 5094 5649.9 0.026021
## total_day_charge 1 1.31 5093 5648.6 0.253036
## total_eve_minutes 1 71.50 5092 5577.1 < 2.2e-16
## total_eve_calls 1 0.00 5091 5577.1 0.996561
## total_eve_charge 1 0.65 5090 5576.4 0.421593
## total_night_minutes 1 40.68 5089 5535.7 1.794e-10
## total_night_calls 1 2.74 5088 5533.0 0.097741
## total_night_charge 1 0.89 5087 5532.1 0.345079
## total_intl_minutes 1 10.62 5086 5521.5 0.001121
## total_intl_calls 1 24.05 5085 5497.4 9.368e-07
## total_intl_charge 1 0.76 5084 5496.7 0.382223
## number_customer_service_calls 1 685.14 5083 4811.5 < 2.2e-16
##
## NULL
## state ***
## area_code *
## international_plan ***
## voice_mail_plan ***
## number_vmail_messages ***
## total_day_minutes ***
## total_day_calls *
## total_day_charge
## total_eve_minutes ***
## total_eve_calls
## total_eve_charge
## total_night_minutes ***
## total_night_calls .
## total_night_charge
## total_intl_minutes **
## total_intl_calls ***
## total_intl_charge
## number_customer_service_calls ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fitted.results_prob_up <- predict(model_up,Test,type='response')
fitted.results_class_up <- ifelse(fitted.results_prob_up > 0.5,"yes","no")
logit_Res_up=confusionMatrix(fitted.results_class_up,Test$churn,positive = "yes")
## Warning in confusionMatrix.default(fitted.results_class_up, Test$churn, :
## Levels are not in the same order for reference and data. Refactoring data
## to match.
logit_Res_up$table
## Reference
## Prediction yes no
## yes 76 1305
## no 206 412
logit_Res_up$byClass
## Sensitivity Specificity Pos Pred Value
## 0.26950355 0.23995341 0.05503259
## Neg Pred Value Precision Recall
## 0.66666667 0.05503259 0.26950355
## F1 Prevalence Detection Rate
## 0.09140108 0.14107054 0.03801901
## Detection Prevalence Balanced Accuracy
## 0.69084542 0.25472848
library(ROSE)
roc.curve(Test$churn,fitted.results_prob_up,main="ROC curve Logit up")
## Area under the curve (AUC): 0.803
suppressPackageStartupMessages(library(h2o))
suppressWarnings(h2o.init( nthreads=-1,max_mem_size = "64g"))
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 20 hours 55 minutes
## H2O cluster version: 3.10.0.8
## H2O cluster version age: 5 months and 11 days !!!
## H2O cluster name: H2O_started_from_R_M00864_puz801
## H2O cluster total nodes: 1
## H2O cluster total memory: 56.74 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.3.2 (2016-10-31)
##H2o Cluster specifications
##Converting Data set into H2o Format
h2o_Train=as.h2o(Train)
##
|
| | 0%
|
|=================================================================| 100%
h2o_Test=as.h2o(Test)
##
|
| | 0%
|
|=================================================================| 100%
h2o_Train_down=as.h2o(down_train)
##
|
| | 0%
|
|=================================================================| 100%
h2o_Train_up=as.h2o(up_train)
##
|
| | 0%
|
|=================================================================| 100%
#Running Random Forest Model
Rf_normal_up=h2o.randomForest(x = 1:18,
y = 19,
training_frame = h2o_Train_up,
ntrees = 1000,
max_depth=20,
## early stopping once the validation AUC doesn't
#improve by at least 0.01% for 5 consecutive scoring events
stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC",
## sample 80% of rows per tree
sample_rate = 0.8,
## sample 80% of columns per split
## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
score_tree_interval = 10 )
##
|
| | 0%
|
|=== | 5%
|
|===== | 8%
|
|====== | 10%
|
|======== | 12%
|
|========= | 14%
|
|========== | 15%
|
|=================================================================| 100%
rf_pred_normal_up=h2o.predict(Rf_normal_up,h2o_Test)
##
|
| | 0%
|
|=================================================================| 100%
rf_pred_normal_df_up=as.data.frame(rf_pred_normal_up)
rf_pred_normal1_up=rf_pred_normal_df_up[,1]
Target=Test$Target
Results_normal_rf_up=confusionMatrix(Test$churn,rf_pred_normal1_up,positive = "yes")
## Warning in confusionMatrix.default(Test$churn, rf_pred_normal1_up,
## positive = "yes"): Levels are not in the same order for reference and data.
## Refactoring data to match.
roc.curve(Test$churn,rf_pred_normal_df_up$yes,main="ROC curve Random Forest UP")
## Area under the curve (AUC): 0.914
suppressPackageStartupMessages(library(h2o))
suppressWarnings(h2o.init( nthreads=-1,max_mem_size = "64g"))
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 20 hours 55 minutes
## H2O cluster version: 3.10.0.8
## H2O cluster version age: 5 months and 11 days !!!
## H2O cluster name: H2O_started_from_R_M00864_puz801
## H2O cluster total nodes: 1
## H2O cluster total memory: 56.74 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.3.2 (2016-10-31)
##H2o Cluster specifications
##Converting Data set into H2o Format
h2o_Train=as.h2o(Train)
##
|
| | 0%
|
|=================================================================| 100%
h2o_Test=as.h2o(Test)
##
|
| | 0%
|
|=================================================================| 100%
h2o_Train_down=as.h2o(down_train)
##
|
| | 0%
|
|=================================================================| 100%
h2o_Train_up=as.h2o(up_train)
##
|
| | 0%
|
|=================================================================| 100%
#Running Random Forest Model
Rf_normal_down=h2o.randomForest(x = 1:18,
y = 19,
training_frame = h2o_Train_down,
ntrees = 1000,
max_depth=20,
## early stopping once the validation AUC doesn't
#improve by at least 0.01% for 5 consecutive scoring events
stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC",
## sample 80% of rows per tree
sample_rate = 0.8,
## sample 80% of columns per split
## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
score_tree_interval = 10 )
##
|
| | 0%
|
|= | 1%
|
|======== | 12%
|
|============= | 20%
|
|================= | 26%
|
|=================================================================| 100%
rf_pred_normal_down=h2o.predict(Rf_normal_down,h2o_Test)
##
|
| | 0%
|
|=================================================================| 100%
rf_pred_normal_df_down=as.data.frame(rf_pred_normal_down)
rf_pred_normal1_down=rf_pred_normal_df_down[,1]
Target=Test$Target
Results_normal_rf_down=confusionMatrix(Test$churn,rf_pred_normal1_down,positive = "yes")
## Warning in confusionMatrix.default(Test$churn, rf_pred_normal1_down,
## positive = "yes"): Levels are not in the same order for reference and data.
## Refactoring data to match.
roc.curve(Test$churn,rf_pred_normal_df_down$yes,main="ROC curve Random Forest down")
## Area under the curve (AUC): 0.910
Here we look at the ROC curve and also at the confusion matrix to find out how the model performs
The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. It may be better to avoid the accuracy metric in favor of other metrics such as precision and recall.
Accuracy is often the starting point for analyzing the quality of a predictive model, as well as an obvious criterion for prediction. Accuracy measures the ratio of correct predictions to the total number of cases evaluated. It may seem obvious that the ratio of correct predictions to cases should be a key metric. A predictive model may have high accuracy, but be useless.
(https://tryolabs.com/blog/2013/03/25/why-accuracy-alone-bad-measure-classification-tasks-and-what-we-can-do-about-it/ “Why Accuracy is a Bad Measure for Class Imbalance Problems”)
Here we use a Metric F1 Score which is a harmonic mean of Precision and Recall for evaluating the model performance and also by plotting the ROC curve