Telcom Churn Prediction Challenge

Churn prediction is one of the most popular Big Data use cases in business. It consists of detecting customers who are likely to cancel a subscription to a service.We want to predict the answer to the following question, asked for each current customer: “Is this customer going to leave us within the next X months?” There are only two possible answers, yes or no, and it is what we call a binary classification task. Here, the input of the task is a customer and the output is the answer to the question (yes or no).Being able to predict churn based on customer data has proven extremely valuable to big telecom companies.

The Business Case

Telcom Company Ding Dong is facing a uphill problem to retain their customers. It had an annual churn ratio of around 14% for the year 2014. The data set consist of 5000 customers and different features that help predict churn. You as a budding data scientist are now echarged with the task of using advanced analytics to predict churners and retain the customers.

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

The Data set

The data set consits of the following varaibles

1.state

2.account_length

3.area_code

4.international_plan

5.voice_mail_plan

6.number_vmail_messages

7.total_day_minutes

8.total_day_minutes

9.total_day_calls

10.total_eve_charge

11.total_night_minutes

12.total_night_charge

13.total_night_calls

14.total_intl_calls

15.total_intl_charge

16.number_customer_service_calls

17.churn

Chapter 1 Graphically exploring the data set

Here we divide the data set into two parts Categorical varaiables and continous varaibles and see it`s relation to the churn variable

Plotting The continous Variables

library(caret)

## Loading required package: lattice

library(reshape)

## 
## Attaching package: 'reshape'

## The following object is masked from 'package:plotly':
## 
##     rename

Data=Data[,-2]

Cont_Data=Data[,-c(1,2,3,4)]

Cat_Data=Data[,c(1,2,3,4,19)]


Cont_melt=melt(Cont_Data)

## Using churn as id variables

ggplot(Cont_melt, aes(x=churn,y=value,fill=variable))+geom_boxplot()+facet_wrap(~variable)

Plotting the Categorical Variables

Area code

library(plyr)

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:reshape':
## 
##     rename, round_any

## The following objects are masked from 'package:plotly':
## 
##     arrange, mutate, rename, summarise

library(data.table)

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:reshape':
## 
##     melt

library(sjPlot)

## Warning: package 'sjPlot' was built under R version 3.3.3

Data=data.table(Data) 

Reason_areacode=Data[, count(churn), by = area_code]

sjp.xtab(Data$churn, 
         Data$area_code,
         bar.pos = c("dodge"),
         show.total = FALSE)

1 .International Plan

sjp.xtab(Data$churn, 
         Data$international_plan,
         bar.pos = c("dodge"),
         show.total = FALSE)

Voice Mail Plan

sjp.xtab(Data$churn, 
         Data$voice_mail_plan,
         bar.pos = c("dodge"),
         show.total = FALSE)

Dividing the data set into Train and Test

library(caret)

### Dividing data into train and test 

set.seed(3456)
trainIndex <- createDataPartition(Data$churn, p = .6, 
                                  list = FALSE, 
                                  times = 1)


Train <- Data[ trainIndex,]
Test  <- Data[-trainIndex,]

Chapter 4 - Machine Learning Models

Here we are going to use A Machine Learning Model and a Linear Model

A Logistic regression Model
A Random Forest Model

Description of a Random Forest

Random forest is an ensemble tool which takes a subset of observations and a subset of variables to build a decision trees. It builds multiple such decision tree and amalgamate them together to get a more accurate and stable prediction. It is direct consequence of the fact that by maximum voting from a panel of independent judges, it results the final prediction better than the best judge. [ Each tree is planted & grown as follows:

If the number of cases in the training set is N, then sample of N cases is taken at random but with replacement. This sample will be the training set for growing the tree. If there are M input variables, a number m<|z|)
## (Intercept) 10.4373617 1.0566345 9.878 < 2e-16 ## stateAL -0.6319295 0.7922579 -0.798 0.425085
## stateAR -1.0752785 0.7756407 -1.386 0.165652
## stateAZ -1.1593492 0.7883534 -1.471 0.141400
## stateCA -2.4032731 0.8077485 -2.975 0.002927 ## stateCO -0.2153291 0.8456921 -0.255 0.799018
## stateCT -1.1040894 0.7531453 -1.466 0.142656
## stateDC -0.6510514 0.8103770 -0.803 0.421747
## stateDE -0.7027549 0.7647536 -0.919 0.358132
## stateFL -0.6204576 0.8073923 -0.768 0.442207
## stateGA -0.4982346 0.8225564 -0.606 0.544704
## stateHI 0.3685022 0.9897189 0.372 0.709647
## stateIA -0.8521855 0.8469577 -1.006 0.314333
## stateID -0.9598147 0.7647202 -1.255 0.209436
## stateIL -0.3356726 0.8353334 -0.402 0.687800
## stateIN -0.7908941 0.7673603 -1.031 0.302696
## stateKS -0.8110337 0.7748424 -1.047 0.295234
## stateKY -1.0128530 0.7676893 -1.319 0.187051
## stateLA -1.3994971 0.8022960 -1.744 0.081095 .
## stateMA -1.8655432 0.7446759 -2.505 0.012239
## stateMD -1.1739337 0.7376562 -1.591 0.111511
## stateME -1.1772465 0.7696310 -1.530 0.126110
## stateMI -1.0105505 0.7863367 -1.285 0.198744
## stateMN -1.0501207 0.7517450 -1.397 0.162440
## stateMO -0.3441704 0.8050351 -0.428 0.668999
## stateMS -1.4803300 0.7655319 -1.934 0.053147 .
## stateMT -2.4561637 0.7456827 -3.294 0.000988 ## stateNC -0.9068701 0.8034272 -1.129 0.259002
## stateND -0.8866100 0.7903932 -1.122 0.261976
## stateNE -0.0915463 0.8982225 -0.102 0.918821
## stateNH -0.2350965 0.8546874 -0.275 0.783265
## stateNJ -1.7240425 0.7370997 -2.339 0.019338
## stateNM -0.5650474 0.8157343 -0.693 0.488507
## stateNV -1.2623842 0.7607225 -1.659 0.097024 .
## stateNY -1.1619708 0.7517854 -1.546 0.122198
## stateOH -0.7926898 0.7682956 -1.032 0.302189
## stateOK -0.9225553 0.7821815 -1.179 0.238213
## stateOR -1.2742907 0.7397678 -1.723 0.084969 .
## statePA -0.7994338 0.8291977 -0.964 0.334993
## stateRI 1.5008191 1.0289842 1.459 0.144691
## stateSC -1.8420346 0.7728648 -2.383 0.017154 *
## stateSD -0.7316450 0.8069820 -0.907 0.364595
## stateTN -1.3879070 0.7608127 -1.824 0.068115 .
## stateTX -1.1629366 0.7468716 -1.557 0.119452
## stateUT -0.7927999 0.7977450 -0.994 0.320320
## stateVA 0.2340778 0.8445123 0.277 0.781646
## stateVT -0.0238572 0.8077224 -0.030 0.976437
## stateWA -1.6076665 0.7649265 -2.102 0.035577 *
## stateWI -0.8455032 0.8018795 -1.054 0.291699
## stateWV -0.9821144 0.7401787 -1.327 0.184555
## stateWY 0.5768982 0.9200816 0.627 0.530654
## area_codearea_code_415 -0.1599630 0.1553637 -1.030 0.303196
## area_codearea_code_510 -0.1547219 0.1810984 -0.854 0.392910
## international_planyes -2.2676012 0.1654446 -13.706 < 2e-16 ## voice_mail_planyes 2.8945601 0.6921417 4.182 2.89e-05 ## number_vmail_messages -0.0554142 0.0210694 -2.630 0.008537 ## total_day_minutes -0.7406116 3.6777155 -0.201 0.840403
## total_day_calls -0.0048923 0.0031092 -1.574 0.115599
## total_day_charge 4.2692953 21.6338566 0.197 0.843559
## total_eve_minutes 0.3349721 1.8203582 0.184 0.854002
## total_eve_calls 0.0006109 0.0031936 0.191 0.848297
## total_eve_charge -4.0377420 21.4158088 -0.189 0.850453
## total_night_minutes 0.4069335 0.9932397 0.410 0.682024
## total_night_calls 0.0015455 0.0031339 0.493 0.621892
## total_night_charge -9.1637374 22.0712329 -0.415 0.678003
## total_intl_minutes -4.0538783 5.9411554 -0.682 0.495025
## total_intl_calls 0.0510321 0.0260804 1.957 0.050380 .
## total_intl_charge 14.7410272 22.0046267 0.670 0.502918
## number_customer_service_calls -0.5397787 0.0461981 -11.684 < 2e-16 *** ## — ## Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2448.2 on 3000 degrees of freedom ## Residual deviance: 1779.4 on 2932 degrees of freedom ## AIC: 1917.4 ## ## Number of Fisher Scoring iterations: 6 ```

anova(model, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: churn
## 
## Terms added sequentially (first to last)
## 
## 
##                               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
## NULL                                           3000     2448.2          
## state                         50   83.967      2950     2364.2  0.001864
## area_code                      2    1.279      2948     2362.9  0.527444
## international_plan             1  168.017      2947     2194.9 < 2.2e-16
## voice_mail_plan                1   55.122      2946     2139.8 1.133e-13
## number_vmail_messages          1    4.724      2945     2135.1  0.029747
## total_day_minutes              1  144.295      2944     1990.8 < 2.2e-16
## total_day_calls                1    1.519      2943     1989.2  0.217823
## total_day_charge               1    0.056      2942     1989.2  0.812846
## total_eve_minutes              1   30.658      2941     1958.5 3.078e-08
## total_eve_calls                1    0.039      2940     1958.5  0.844399
## total_eve_charge               1    0.035      2939     1958.5  0.851003
## total_night_minutes            1   17.880      2938     1940.6 2.353e-05
## total_night_calls              1    0.458      2937     1940.1  0.498729
## total_night_charge             1    0.142      2936     1940.0  0.705919
## total_intl_minutes             1    8.908      2935     1931.1  0.002840
## total_intl_calls               1    6.248      2934     1924.8  0.012432
## total_intl_charge              1    0.309      2933     1924.5  0.578053
## number_customer_service_calls  1  145.091      2932     1779.4 < 2.2e-16
##                                  
## NULL                             
## state                         ** 
## area_code                        
## international_plan            ***
## voice_mail_plan               ***
## number_vmail_messages         *  
## total_day_minutes             ***
## total_day_calls                  
## total_day_charge                 
## total_eve_minutes             ***
## total_eve_calls                  
## total_eve_charge                 
## total_night_minutes           ***
## total_night_calls                
## total_night_charge               
## total_intl_minutes            ** 
## total_intl_calls              *  
## total_intl_charge                
## number_customer_service_calls ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fitted.results_prob <- predict(model,Test,type='response')
fitted.results_class <- ifelse(fitted.results_prob > 0.5,"yes","no")

logit_Res=confusionMatrix(fitted.results_class,Test$churn,positive = "yes")

## Warning in confusionMatrix.default(fitted.results_class, Test$churn,
## positive = "yes"): Levels are not in the same order for reference and data.
## Refactoring data to match.

logit_Res$table

##           Reference
## Prediction  yes   no
##        yes  211 1653
##        no    71   64

logit_Res$byClass

##          Sensitivity          Specificity       Pos Pred Value 
##           0.74822695           0.03727432           0.11319742 
##       Neg Pred Value            Precision               Recall 
##           0.47407407           0.11319742           0.74822695 
##                   F1           Prevalence       Detection Rate 
##           0.19664492           0.14107054           0.10555278 
## Detection Prevalence    Balanced Accuracy 
##           0.93246623           0.39275063

library(ROSE)

## Loaded ROSE 0.0-3

roc.curve(Test$churn,fitted.results_prob,main="ROC curve Logit")

## Area under the curve (AUC): 0.803

Random Forest without sampling

suppressPackageStartupMessages(library(h2o))
suppressWarnings(h2o.init( nthreads=-1,max_mem_size = "64g"))

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         20 hours 55 minutes 
##     H2O cluster version:        3.10.0.8 
##     H2O cluster version age:    5 months and 11 days !!! 
##     H2O cluster name:           H2O_started_from_R_M00864_puz801 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   56.75 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.2 (2016-10-31)

##H2o Cluster specifications 


##Converting Data set into H2o Format 
h2o_Train=as.h2o(Train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o_Test=as.h2o(Test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

#Running Random Forest Model
Rf_normal=h2o.randomForest(x = 1:18, 
                 y = 19, 
                 training_frame = h2o_Train,
                 ntrees = 1000, 
                 max_depth=20,
                 ## early stopping once the validation AUC doesn't
                 #improve by at least 0.01% for 5 consecutive scoring events
                 stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC", 
                 
                 ## sample 80% of rows per tree
                 sample_rate = 0.8,                                                       
                 
                 ## sample 80% of columns per split
                                                            
                 
                 
                 ## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
                 score_tree_interval = 10 )

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |===============                                                  |  22%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |=================================================================| 100%

rf_pred_normal=h2o.predict(Rf_normal,h2o_Test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

rf_pred_normal_df=as.data.frame(rf_pred_normal)
rf_pred_normal1=rf_pred_normal_df[,1]

Target=Test$Target

Results_normal_rf=confusionMatrix(Test$churn,rf_pred_normal1,positive = "yes")

## Warning in confusionMatrix.default(Test$churn, rf_pred_normal1, positive
## = "yes"): Levels are not in the same order for reference and data.
## Refactoring data to match.

roc.curve(Test$churn,rf_pred_normal_df$yes,main="ROC curve Random Forest")

## Area under the curve (AUC): 0.915

Chapter 4 Class Imbalance and use of Over and Undersampling

Here in this chapter we look into the case of class imblance. In case of class imbalance we have a skewed example of classes present in one class and much less class in another class. Below we can see that the churn % is 14 and that of the non churn is 86 %

prop.table(table(Data$churn))

## 
##    yes     no 
## 0.1414 0.8586

so for that we need to do over sampling and undersampling, so that the model gets to train on underrepresented classes more.

The strategy here is to train the model on oversampled classes but test it on data set where the class imbalance persists.

library(caret)
set.seed(123)

Train=as.data.frame(Train)
down_train <- downSample(x = Train[, -ncol(Train)],y = Train$churn)
table(down_train$Class)

## 
## yes  no 
## 425 425

up_train <- upSample(x = Train[, -19],y = Train$churn)
table(up_train$Class)

## 
##  yes   no 
## 2576 2576

h2o_Train_down=as.h2o(down_train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o_Train_up=as.h2o(up_train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Logit with Over and Under sampling

model_down <- glm(Class ~.,family=binomial(link='logit'),data=down_train)

summary(model_down)

## 
## Call:
## glm(formula = Class ~ ., family = binomial(link = "logit"), data = down_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.68285  -0.72066   0.01304   0.69580   3.09649  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    8.965e+00  1.527e+00   5.872  4.3e-09 ***
## stateAL                       -6.783e-01  1.133e+00  -0.599  0.54926    
## stateAR                       -1.285e+00  1.152e+00  -1.115  0.26481    
## stateAZ                       -1.988e+00  1.139e+00  -1.746  0.08078 .  
## stateCA                       -4.028e+00  1.438e+00  -2.800  0.00511 ** 
## stateCO                       -1.613e+00  1.186e+00  -1.360  0.17376    
## stateCT                       -2.139e+00  1.159e+00  -1.845  0.06500 .  
## stateDC                       -1.693e+00  1.159e+00  -1.461  0.14404    
## stateDE                       -8.105e-01  1.071e+00  -0.757  0.44901    
## stateFL                       -1.353e+00  1.160e+00  -1.166  0.24351    
## stateGA                       -9.656e-01  1.187e+00  -0.814  0.41587    
## stateHI                       -1.453e+00  1.562e+00  -0.930  0.35227    
## stateIA                       -1.019e+00  1.189e+00  -0.857  0.39144    
## stateID                       -8.931e-01  1.067e+00  -0.837  0.40241    
## stateIL                       -1.369e+00  1.255e+00  -1.091  0.27543    
## stateIN                       -1.557e+00  1.123e+00  -1.387  0.16552    
## stateKS                       -2.053e+00  1.114e+00  -1.843  0.06530 .  
## stateKY                       -1.511e+00  1.100e+00  -1.374  0.16950    
## stateLA                       -2.518e+00  1.132e+00  -2.225  0.02610 *  
## stateMA                       -3.002e+00  1.139e+00  -2.636  0.00839 ** 
## stateMD                       -1.329e+00  1.065e+00  -1.248  0.21201    
## stateME                       -1.893e+00  1.085e+00  -1.745  0.08094 .  
## stateMI                       -1.499e+00  1.127e+00  -1.331  0.18332    
## stateMN                       -1.515e+00  1.059e+00  -1.430  0.15274    
## stateMO                       -8.670e-01  1.204e+00  -0.720  0.47153    
## stateMS                       -2.117e+00  1.076e+00  -1.967  0.04920 *  
## stateMT                       -3.299e+00  1.089e+00  -3.029  0.00245 ** 
## stateNC                       -1.129e+00  1.152e+00  -0.980  0.32717    
## stateND                       -1.270e+00  1.114e+00  -1.140  0.25425    
## stateNE                       -6.867e-01  1.187e+00  -0.579  0.56281    
## stateNH                       -1.713e+00  1.261e+00  -1.358  0.17443    
## stateNJ                       -2.153e+00  1.042e+00  -2.066  0.03880 *  
## stateNM                       -1.204e+00  1.126e+00  -1.069  0.28485    
## stateNV                       -2.079e+00  1.098e+00  -1.894  0.05821 .  
## stateNY                       -1.450e+00  1.048e+00  -1.384  0.16641    
## stateOH                       -8.633e-01  1.096e+00  -0.788  0.43096    
## stateOK                       -1.506e+00  1.055e+00  -1.427  0.15347    
## stateOR                       -2.329e+00  1.122e+00  -2.076  0.03788 *  
## statePA                       -1.442e+00  1.141e+00  -1.264  0.20639    
## stateRI                        4.109e-01  1.325e+00   0.310  0.75644    
## stateSC                       -2.048e+00  1.103e+00  -1.856  0.06345 .  
## stateSD                       -2.209e+00  1.215e+00  -1.818  0.06910 .  
## stateTN                       -2.848e+00  1.209e+00  -2.356  0.01846 *  
## stateTX                       -1.511e+00  1.073e+00  -1.409  0.15897    
## stateUT                       -2.766e+00  1.256e+00  -2.202  0.02766 *  
## stateVA                       -5.948e-01  1.176e+00  -0.506  0.61309    
## stateVT                       -5.085e-01  1.104e+00  -0.461  0.64505    
## stateWA                       -2.538e+00  1.179e+00  -2.153  0.03133 *  
## stateWI                       -1.171e+00  1.134e+00  -1.032  0.30188    
## stateWV                       -1.561e+00  1.044e+00  -1.495  0.13495    
## stateWY                        3.842e-01  1.171e+00   0.328  0.74289    
## area_codearea_code_415        -3.396e-01  2.281e-01  -1.489  0.13653    
## area_codearea_code_510        -3.722e-01  2.707e-01  -1.375  0.16917    
## international_planyes         -3.175e+00  3.240e-01  -9.799  < 2e-16 ***
## voice_mail_planyes             2.678e+00  9.480e-01   2.825  0.00473 ** 
## number_vmail_messages         -4.351e-02  2.968e-02  -1.466  0.14260    
## total_day_minutes              3.037e+00  5.752e+00   0.528  0.59751    
## total_day_calls               -2.108e-03  4.753e-03  -0.443  0.65743    
## total_day_charge              -1.796e+01  3.384e+01  -0.531  0.59557    
## total_eve_minutes             -1.957e-01  2.722e+00  -0.072  0.94268    
## total_eve_calls                7.467e-04  4.721e-03   0.158  0.87433    
## total_eve_charge               2.182e+00  3.203e+01   0.068  0.94568    
## total_night_minutes            1.149e-01  1.485e+00   0.077  0.93833    
## total_night_calls              2.710e-03  4.800e-03   0.565  0.57234    
## total_night_charge            -2.582e+00  3.301e+01  -0.078  0.93766    
## total_intl_minutes             9.502e-01  8.839e+00   0.108  0.91439    
## total_intl_calls               3.290e-02  3.835e-02   0.858  0.39092    
## total_intl_charge             -3.705e+00  3.275e+01  -0.113  0.90991    
## number_customer_service_calls -6.902e-01  7.279e-02  -9.482  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1178.35  on 849  degrees of freedom
## Residual deviance:  756.82  on 781  degrees of freedom
## AIC: 894.82
## 
## Number of Fisher Scoring iterations: 5

anova(model_down, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Class
## 
## Terms added sequentially (first to last)
## 
## 
##                               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
## NULL                                            849    1178.35          
## state                         50   63.078       799    1115.27  0.101351
## area_code                      2    9.524       797    1105.75  0.008549
## international_plan             1   97.983       796    1007.77 < 2.2e-16
## voice_mail_plan                1   41.172       795     966.59 1.394e-10
## number_vmail_messages          1    4.646       794     961.95  0.031118
## total_day_minutes              1   69.557       793     892.39 < 2.2e-16
## total_day_calls                1    0.436       792     891.95  0.509190
## total_day_charge               1    0.025       791     891.93  0.875445
## total_eve_minutes              1   15.430       790     876.50 8.560e-05
## total_eve_calls                1    0.040       789     876.46  0.841061
## total_eve_charge               1    0.135       788     876.32  0.713530
## total_night_minutes            1    0.019       787     876.31  0.891797
## total_night_calls              1    0.539       786     875.77  0.462752
## total_night_charge             1    0.027       785     875.74  0.870459
## total_intl_minutes             1    2.422       784     873.32  0.119652
## total_intl_calls               1    1.573       783     871.74  0.209752
## total_intl_charge              1    0.004       782     871.74  0.949136
## number_customer_service_calls  1  114.920       781     756.82 < 2.2e-16
##                                  
## NULL                             
## state                            
## area_code                     ** 
## international_plan            ***
## voice_mail_plan               ***
## number_vmail_messages         *  
## total_day_minutes             ***
## total_day_calls                  
## total_day_charge                 
## total_eve_minutes             ***
## total_eve_calls                  
## total_eve_charge                 
## total_night_minutes              
## total_night_calls                
## total_night_charge               
## total_intl_minutes               
## total_intl_calls                 
## total_intl_charge                
## number_customer_service_calls ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fitted.results_prob_down <- predict(model_down,Test,type='response')
fitted.results_class_down <- ifelse(fitted.results_prob_down > 0.5,"yes","no")

logit_Res_down=confusionMatrix(fitted.results_class_down,Test$churn,positive = "yes")

## Warning in confusionMatrix.default(fitted.results_class_down, Test$churn, :
## Levels are not in the same order for reference and data. Refactoring data
## to match.

logit_Res_down$table

##           Reference
## Prediction  yes   no
##        yes   70 1246
##        no   212  471

logit_Res_down$byClass

##          Sensitivity          Specificity       Pos Pred Value 
##           0.24822695           0.27431567           0.05319149 
##       Neg Pred Value            Precision               Recall 
##           0.68960469           0.05319149           0.24822695 
##                   F1           Prevalence       Detection Rate 
##           0.08760951           0.14107054           0.03501751 
## Detection Prevalence    Balanced Accuracy 
##           0.65832916           0.26127131

library(ROSE)

roc.curve(Test$churn,fitted.results_prob_down,main="ROC curve Logit Down")

## Area under the curve (AUC): 0.798

model_up <- glm(Class ~.,family=binomial(link='logit'),data=up_train)

summary(model_up)

## 
## Call:
## glm(formula = Class ~ ., family = binomial(link = "logit"), data = up_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5671  -0.7393   0.0070   0.7255   3.0943  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    8.957e+00  5.809e-01  15.418  < 2e-16 ***
## stateAL                       -3.571e-01  4.226e-01  -0.845 0.398060    
## stateAR                       -8.157e-01  4.282e-01  -1.905 0.056775 .  
## stateAZ                       -1.234e+00  4.299e-01  -2.871 0.004098 ** 
## stateCA                       -2.201e+00  4.730e-01  -4.654 3.25e-06 ***
## stateCO                        1.017e-01  4.926e-01   0.207 0.836391    
## stateCT                       -1.063e+00  4.046e-01  -2.627 0.008626 ** 
## stateDC                       -1.036e+00  4.282e-01  -2.420 0.015540 *  
## stateDE                       -5.141e-01  4.162e-01  -1.235 0.216695    
## stateFL                       -2.407e-01  4.467e-01  -0.539 0.590057    
## stateGA                       -4.219e-01  4.425e-01  -0.954 0.340331    
## stateHI                        3.103e-01  4.809e-01   0.645 0.518790    
## stateIA                       -7.992e-01  4.621e-01  -1.729 0.083736 .  
## stateID                       -9.720e-01  4.084e-01  -2.380 0.017316 *  
## stateIL                       -4.309e-01  4.692e-01  -0.918 0.358367    
## stateIN                       -1.298e+00  4.214e-01  -3.079 0.002075 ** 
## stateKS                       -8.316e-01  4.166e-01  -1.996 0.045919 *  
## stateKY                       -9.258e-01  4.143e-01  -2.235 0.025430 *  
## stateLA                       -1.636e+00  4.274e-01  -3.828 0.000129 ***
## stateMA                       -1.744e+00  4.115e-01  -4.238 2.25e-05 ***
## stateMD                       -1.011e+00  4.032e-01  -2.507 0.012180 *  
## stateME                       -1.233e+00  4.112e-01  -2.999 0.002708 ** 
## stateMI                       -7.183e-01  4.296e-01  -1.672 0.094500 .  
## stateMN                       -9.470e-01  4.072e-01  -2.326 0.020039 *  
## stateMO                       -1.245e-01  4.360e-01  -0.286 0.775171    
## stateMS                       -1.662e+00  4.171e-01  -3.983 6.79e-05 ***
## stateMT                       -2.877e+00  4.115e-01  -6.990 2.75e-12 ***
## stateNC                       -8.451e-01  4.618e-01  -1.830 0.067219 .  
## stateND                       -1.244e+00  4.127e-01  -3.015 0.002567 ** 
## stateNE                       -2.181e-01  4.587e-01  -0.476 0.634425    
## stateNH                       -2.432e-01  4.434e-01  -0.548 0.583406    
## stateNJ                       -1.865e+00  4.055e-01  -4.599 4.25e-06 ***
## stateNM                       -5.136e-01  4.342e-01  -1.183 0.236875    
## stateNV                       -1.313e+00  4.153e-01  -3.161 0.001570 ** 
## stateNY                       -1.057e+00  4.056e-01  -2.605 0.009178 ** 
## stateOH                       -5.245e-01  4.199e-01  -1.249 0.211626    
## stateOK                       -1.201e+00  4.190e-01  -2.866 0.004153 ** 
## stateOR                       -1.252e+00  3.992e-01  -3.136 0.001715 ** 
## statePA                       -7.227e-01  4.402e-01  -1.642 0.100625    
## stateRI                        1.135e+00  5.133e-01   2.211 0.027013 *  
## stateSC                       -1.728e+00  4.258e-01  -4.060 4.91e-05 ***
## stateSD                       -1.281e+00  4.346e-01  -2.948 0.003194 ** 
## stateTN                       -1.284e+00  4.222e-01  -3.042 0.002347 ** 
## stateTX                       -9.722e-01  4.028e-01  -2.414 0.015795 *  
## stateUT                       -1.063e+00  4.198e-01  -2.533 0.011311 *  
## stateVA                       -7.975e-02  4.348e-01  -0.183 0.854464    
## stateVT                        2.563e-01  4.399e-01   0.583 0.560175    
## stateWA                       -1.561e+00  4.199e-01  -3.718 0.000201 ***
## stateWI                       -8.013e-01  4.263e-01  -1.879 0.060182 .  
## stateWV                       -1.039e+00  3.954e-01  -2.628 0.008586 ** 
## stateWY                        6.482e-01  4.888e-01   1.326 0.184838    
## area_codearea_code_415        -2.153e-01  9.176e-02  -2.346 0.018985 *  
## area_codearea_code_510        -2.646e-01  1.057e-01  -2.502 0.012337 *  
## international_planyes         -2.765e+00  1.199e-01 -23.055  < 2e-16 ***
## voice_mail_planyes             2.646e+00  3.662e-01   7.226 4.98e-13 ***
## number_vmail_messages         -4.650e-02  1.132e-02  -4.106 4.02e-05 ***
## total_day_minutes              2.504e+00  2.170e+00   1.154 0.248600    
## total_day_calls               -4.100e-03  1.806e-03  -2.270 0.023206 *  
## total_day_charge              -1.482e+01  1.277e+01  -1.161 0.245813    
## total_eve_minutes              1.173e-01  1.079e+00   0.109 0.913402    
## total_eve_calls               -8.249e-04  1.850e-03  -0.446 0.655611    
## total_eve_charge              -1.481e+00  1.269e+01  -0.117 0.907070    
## total_night_minutes            1.015e+00  5.668e-01   1.791 0.073259 .  
## total_night_calls              2.775e-03  1.886e-03   1.471 0.141221    
## total_night_charge            -2.269e+01  1.260e+01  -1.801 0.071694 .  
## total_intl_minutes            -2.723e+00  3.511e+00  -0.776 0.438037    
## total_intl_calls               5.322e-02  1.504e-02   3.539 0.000401 ***
## total_intl_charge              9.848e+00  1.300e+01   0.757 0.448893    
## number_customer_service_calls -6.517e-01  2.771e-02 -23.517  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7142.2  on 5151  degrees of freedom
## Residual deviance: 4811.5  on 5083  degrees of freedom
## AIC: 4949.5
## 
## Number of Fisher Scoring iterations: 5

anova(model_up, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Class
## 
## Terms added sequentially (first to last)
## 
## 
##                               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
## NULL                                           5151     7142.2          
## state                         50   313.26      5101     6828.9 < 2.2e-16
## area_code                      2     6.94      5099     6822.0  0.031054
## international_plan             1   502.46      5098     6319.5 < 2.2e-16
## voice_mail_plan                1   245.85      5097     6073.7 < 2.2e-16
## number_vmail_messages          1    26.00      5096     6047.7 3.421e-07
## total_day_minutes              1   392.85      5095     5654.8 < 2.2e-16
## total_day_calls                1     4.95      5094     5649.9  0.026021
## total_day_charge               1     1.31      5093     5648.6  0.253036
## total_eve_minutes              1    71.50      5092     5577.1 < 2.2e-16
## total_eve_calls                1     0.00      5091     5577.1  0.996561
## total_eve_charge               1     0.65      5090     5576.4  0.421593
## total_night_minutes            1    40.68      5089     5535.7 1.794e-10
## total_night_calls              1     2.74      5088     5533.0  0.097741
## total_night_charge             1     0.89      5087     5532.1  0.345079
## total_intl_minutes             1    10.62      5086     5521.5  0.001121
## total_intl_calls               1    24.05      5085     5497.4 9.368e-07
## total_intl_charge              1     0.76      5084     5496.7  0.382223
## number_customer_service_calls  1   685.14      5083     4811.5 < 2.2e-16
##                                  
## NULL                             
## state                         ***
## area_code                     *  
## international_plan            ***
## voice_mail_plan               ***
## number_vmail_messages         ***
## total_day_minutes             ***
## total_day_calls               *  
## total_day_charge                 
## total_eve_minutes             ***
## total_eve_calls                  
## total_eve_charge                 
## total_night_minutes           ***
## total_night_calls             .  
## total_night_charge               
## total_intl_minutes            ** 
## total_intl_calls              ***
## total_intl_charge                
## number_customer_service_calls ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fitted.results_prob_up <- predict(model_up,Test,type='response')
fitted.results_class_up <- ifelse(fitted.results_prob_up > 0.5,"yes","no")

logit_Res_up=confusionMatrix(fitted.results_class_up,Test$churn,positive = "yes")

## Warning in confusionMatrix.default(fitted.results_class_up, Test$churn, :
## Levels are not in the same order for reference and data. Refactoring data
## to match.

logit_Res_up$table

##           Reference
## Prediction  yes   no
##        yes   76 1305
##        no   206  412

logit_Res_up$byClass

##          Sensitivity          Specificity       Pos Pred Value 
##           0.26950355           0.23995341           0.05503259 
##       Neg Pred Value            Precision               Recall 
##           0.66666667           0.05503259           0.26950355 
##                   F1           Prevalence       Detection Rate 
##           0.09140108           0.14107054           0.03801901 
## Detection Prevalence    Balanced Accuracy 
##           0.69084542           0.25472848

library(ROSE)

roc.curve(Test$churn,fitted.results_prob_up,main="ROC curve Logit up")

## Area under the curve (AUC): 0.803

Random Forest with over and undersampling

suppressPackageStartupMessages(library(h2o))
suppressWarnings(h2o.init( nthreads=-1,max_mem_size = "64g"))

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         20 hours 55 minutes 
##     H2O cluster version:        3.10.0.8 
##     H2O cluster version age:    5 months and 11 days !!! 
##     H2O cluster name:           H2O_started_from_R_M00864_puz801 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   56.74 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.2 (2016-10-31)

##H2o Cluster specifications 


##Converting Data set into H2o Format 
h2o_Train=as.h2o(Train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o_Test=as.h2o(Test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o_Train_down=as.h2o(down_train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o_Train_up=as.h2o(up_train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

#Running Random Forest Model
Rf_normal_up=h2o.randomForest(x = 1:18, 
                 y = 19, 
                 training_frame = h2o_Train_up,
                 ntrees = 1000, 
                 max_depth=20,
                 ## early stopping once the validation AUC doesn't
                 #improve by at least 0.01% for 5 consecutive scoring events
                 stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC", 
                 
                 ## sample 80% of rows per tree
                 sample_rate = 0.8,                                                       
                 
                 ## sample 80% of columns per split
                                                            
                 
                 
                 ## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
                 score_tree_interval = 10 )

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |=================================================================| 100%

rf_pred_normal_up=h2o.predict(Rf_normal_up,h2o_Test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

rf_pred_normal_df_up=as.data.frame(rf_pred_normal_up)
rf_pred_normal1_up=rf_pred_normal_df_up[,1]

Target=Test$Target

Results_normal_rf_up=confusionMatrix(Test$churn,rf_pred_normal1_up,positive = "yes")

## Warning in confusionMatrix.default(Test$churn, rf_pred_normal1_up,
## positive = "yes"): Levels are not in the same order for reference and data.
## Refactoring data to match.

roc.curve(Test$churn,rf_pred_normal_df_up$yes,main="ROC curve Random Forest UP")

## Area under the curve (AUC): 0.914

suppressPackageStartupMessages(library(h2o))
suppressWarnings(h2o.init( nthreads=-1,max_mem_size = "64g"))

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         20 hours 55 minutes 
##     H2O cluster version:        3.10.0.8 
##     H2O cluster version age:    5 months and 11 days !!! 
##     H2O cluster name:           H2O_started_from_R_M00864_puz801 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   56.74 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.2 (2016-10-31)

##H2o Cluster specifications 


##Converting Data set into H2o Format 
h2o_Train=as.h2o(Train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o_Test=as.h2o(Test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o_Train_down=as.h2o(down_train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o_Train_up=as.h2o(up_train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

#Running Random Forest Model
Rf_normal_down=h2o.randomForest(x = 1:18, 
                 y = 19, 
                 training_frame = h2o_Train_down,
                 ntrees = 1000, 
                 max_depth=20,
                 ## early stopping once the validation AUC doesn't
                 #improve by at least 0.01% for 5 consecutive scoring events
                 stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC", 
                 
                 ## sample 80% of rows per tree
                 sample_rate = 0.8,                                                       
                 
                 ## sample 80% of columns per split
                                                            
                 
                 
                 ## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
                 score_tree_interval = 10 )

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |=================================================================| 100%

rf_pred_normal_down=h2o.predict(Rf_normal_down,h2o_Test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

rf_pred_normal_df_down=as.data.frame(rf_pred_normal_down)
rf_pred_normal1_down=rf_pred_normal_df_down[,1]

Target=Test$Target

Results_normal_rf_down=confusionMatrix(Test$churn,rf_pred_normal1_down,positive = "yes")

## Warning in confusionMatrix.default(Test$churn, rf_pred_normal1_down,
## positive = "yes"): Levels are not in the same order for reference and data.
## Refactoring data to match.

roc.curve(Test$churn,rf_pred_normal_df_down$yes,main="ROC curve Random Forest down")

## Area under the curve (AUC): 0.910

Chapter 7 Model Evaluation

ROC curve Plot

Here we look at the ROC curve and also at the confusion matrix to find out how the model performs

The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. It may be better to avoid the accuracy metric in favor of other metrics such as precision and recall.

Accuracy is often the starting point for analyzing the quality of a predictive model, as well as an obvious criterion for prediction. Accuracy measures the ratio of correct predictions to the total number of cases evaluated. It may seem obvious that the ratio of correct predictions to cases should be a key metric. A predictive model may have high accuracy, but be useless.

(https://tryolabs.com/blog/2013/03/25/why-accuracy-alone-bad-measure-classification-tasks-and-what-we-can-do-about-it/ “Why Accuracy is a Bad Measure for Class Imbalance Problems”)

Here we use a Metric F1 Score which is a harmonic mean of Precision and Recall for evaluating the model performance and also by plotting the ROC curve

coe Face to Face Machine Learning Hands on Session

Swarnava

20 de marzo de 2017