CUSTOMER CHURN PREDICTION

1.INTRODUCTION:

This project is studied for Data Scientist / Data Analyst roles, and R programming skill is purposed to display that will be required to the position. During the project, data preprocessing operations and machine learning processing will be implement using R Programming. Decision Trees (CART) model will be leveraged in ML modeling.

2.BUSINESS PROBLEM:

According to the scenario, as a data scientist, my goal is to create a prediction model that will predict customers who have the potential to churn. Thus, sales and marketing functional departments will be able to concentrate on these possible customers.

3.DATA DESCRIPTION:

We have a CSV file keeping more than 41-thousand customer list for this study. The data consists of 21 features keeping customer details. The column titles referring to features are clarified below:

1.Dependent variable (labeled variable) : “Exited”

The “Exited” column refers to customer churn. - If observation point 1, that means the churning customer. - If observation point 0, that means the loyal customer.

2.Independent variables in data below :

a)"age": 
b)"job": 
c)"marital":
d)"education": 
e)"default":
f)"housing": 
g)"contact": 
h)"month":
i)"day_of_week":
j)"duration":
k)"campaign":
l)"pdays":
m)"previous":
n)"outcome":
o)"emp_var_rate":
p)"cons_price_idx":
q)cons_price_idx":
r)"euribor3m":
s)"nr_employed":

4.METHODOLOGY:

This project will be implemented in 2 phases.

A. In descriptive analysis, it will be focused on understanding customer data, then pre-processing (data wrangling) will be implemented for predictive analysis.

B. In Predictive analysis phase, it will be created a Machine Learning model enabling to predict customers for classification.

A. Descriptive Analysis

I. Load Packages

library("tidyverse")

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library("rpart")
library("rpart.plot")
library("readr")
library("knitr")
library("ggplot2")
library("modelr")
library("ggpubr")
library("ROCR")
library("tune")

II. Load Dataset From Outsource

set.seed(1234)

file = "https://raw.githubusercontent.com/vzn2000/sample_dataset/master/bankcustomers-data.csv" 

data <- read.csv(file)

head(data, dec=",")

##   age         job  marital         education default housing loan  contact
## 1  44 blue-collar  married          basic.4y unknown     yes   no cellular
## 2  53  technician  married           unknown      no      no   no cellular
## 3  28  management   single university.degree      no     yes   no cellular
## 4  39    services  married       high.school      no      no   no cellular
## 5  55     retired  married          basic.4y      no     yes   no cellular
## 6  30  management divorced          basic.4y      no     yes   no cellular
##   month day_of_week duration campaign pdays previous    poutcome emp_var_rate
## 1   aug         thu      210        1   999        0 nonexistent          1.4
## 2   nov         fri      138        1   999        0 nonexistent         -0.1
## 3   jun         thu      339        3     6        2     success         -1.7
## 4   apr         fri      185        2   999        0 nonexistent         -1.8
## 5   aug         fri      137        1     3        1     success         -2.9
## 6   jul         tue       68        8   999        0 nonexistent          1.4
##   cons_price_idx cons_conf_idx euribor3m nr_employed Exited
## 1         93.444         -36.1     4.963      5228.1      0
## 2         93.200         -42.0     4.021      5195.8      0
## 3         94.055         -39.8     0.729      4991.6      1
## 4         93.075         -47.1     1.405      5099.1      0
## 5         92.201         -31.4     0.869      5076.2      1
## 6         93.918         -42.7     4.961      5228.1      0

III. Explanatory Data Analysis and Data Preprocessing

It will be manipulated data for further analysis while exploring.

(a) Examine Data

Check data format

is.data.frame(data)

## [1] TRUE

Check variable types in data

str(data)

## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  44 53 28 39 55 30 37 39 36 27 ...
##  $ job           : chr  "blue-collar" "technician" "management" "services" ...
##  $ marital       : chr  "married" "married" "single" "married" ...
##  $ education     : chr  "basic.4y" "unknown" "university.degree" "high.school" ...
##  $ default       : chr  "unknown" "no" "no" "no" ...
##  $ housing       : chr  "yes" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "cellular" "cellular" "cellular" "cellular" ...
##  $ month         : chr  "aug" "nov" "jun" "apr" ...
##  $ day_of_week   : chr  "thu" "fri" "thu" "fri" ...
##  $ duration      : int  210 138 339 185 137 68 204 191 174 191 ...
##  $ campaign      : int  1 1 3 2 1 8 1 1 1 2 ...
##  $ pdays         : int  999 999 6 999 3 999 999 999 3 999 ...
##  $ previous      : int  0 0 2 0 1 0 0 0 1 1 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "success" "nonexistent" ...
##  $ emp_var_rate  : num  1.4 -0.1 -1.7 -1.8 -2.9 1.4 -1.8 -1.8 -2.9 -1.8 ...
##  $ cons_price_idx: num  93.4 93.2 94.1 93.1 92.2 ...
##  $ cons_conf_idx : num  -36.1 -42 -39.8 -47.1 -31.4 -42.7 -46.2 -46.2 -40.8 -47.1 ...
##  $ euribor3m     : num  4.963 4.021 0.729 1.405 0.869 ...
##  $ nr_employed   : num  5228 5196 4992 5099 5076 ...
##  $ Exited        : int  0 0 1 0 1 0 0 0 1 0 ...

Check data volume

dim(data)

## [1] 41188    21

Check data summary

summary(data)

##       age            job              marital           education        
##  Min.   :17.00   Length:41188       Length:41188       Length:41188      
##  1st Qu.:32.00   Class :character   Class :character   Class :character  
##  Median :38.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.02                                                           
##  3rd Qu.:47.00                                                           
##  Max.   :98.00                                                           
##    default            housing              loan             contact         
##  Length:41188       Length:41188       Length:41188       Length:41188      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     month           day_of_week           duration         campaign     
##  Length:41188       Length:41188       Min.   :   0.0   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 102.0   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 180.0   Median : 2.000  
##                                        Mean   : 258.3   Mean   : 2.568  
##                                        3rd Qu.: 319.0   3rd Qu.: 3.000  
##                                        Max.   :4918.0   Max.   :56.000  
##      pdays          previous       poutcome          emp_var_rate     
##  Min.   :  0.0   Min.   :0.000   Length:41188       Min.   :-3.40000  
##  1st Qu.:999.0   1st Qu.:0.000   Class :character   1st Qu.:-1.80000  
##  Median :999.0   Median :0.000   Mode  :character   Median : 1.10000  
##  Mean   :962.5   Mean   :0.173                      Mean   : 0.08189  
##  3rd Qu.:999.0   3rd Qu.:0.000                      3rd Qu.: 1.40000  
##  Max.   :999.0   Max.   :7.000                      Max.   : 1.40000  
##  cons_price_idx  cons_conf_idx     euribor3m      nr_employed  
##  Min.   :92.20   Min.   :-50.8   Min.   :0.634   Min.   :4964  
##  1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344   1st Qu.:5099  
##  Median :93.75   Median :-41.8   Median :4.857   Median :5191  
##  Mean   :93.58   Mean   :-40.5   Mean   :3.621   Mean   :5167  
##  3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :94.77   Max.   :-26.9   Max.   :5.045   Max.   :5228  
##      Exited      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1127  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

(b) Check Missing Values

Is there any missing value ?

sum(is.na(data))

## [1] 0

(Note: If there were null values I would try to understand reason. If nulls are randomly, I would fill them mean or fix value or drop from data. It depends on situation.)

In order to fill missing value, I would prefer to use one of below:

df$age[is.na(df$age)]<- mean(data$age, na.rm=TRUE)

data %>% mutate(c1 = replace_na(age, mean(data$age, na.rm=TRUE))

mean(data$age, na.rm=TRUE)

## [1] 40.02406

(c) Outliers

Box Plot Graph is the most common method to observe outliers. Then we can remove outliers or change outliers with mean values or threshold values using Interquartile Range (IQR).

# Box Whisker Plot

ggplot(data,aes(age)) +
  geom_boxplot(color ="black", 
            fill = "orange") +
  xlab("Age") +
  ggtitle("Age Distribution")

According to Box Plot chart, there are several outliers in age variable. However, we assume there is no outlier in dataset.

Let’s make several data manipulation and data visualization for this part.

# Histogram and Density Graphs

ggplot(data, aes(age)) +
  geom_histogram(aes(y = ..density..),
                 color ="black", 
                 fill = "orange") +
  xlab("Age Distribution") +
  ylab("Frequency") + 
  ggtitle("Age Histogram")+
  geom_density(alpha = .3,
               fill = "black")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data, aes(age, fill=marital)) +
  geom_histogram(alpha = .5,
                 position = "identity") +
  xlab("Age") +
  ylab("Frequency") + 
  ggtitle("Marital Status by Age")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Loan by Jobs

ggplot(data,aes(job, fill=loan)) +
  geom_bar()+
  xlab("Jobs") +
  ylab("Loan") + 
  ggtitle("Loan by Jobs")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

# Let's check average ages by education groups 

data %>%
    group_by(education) %>%
                    summarise(num = n(),
                            avg_age = mean(age, na.rm = TRUE),
                            med_age = median(age, na.rm = TRUE),
                            std_age = sd(age, na.rm = TRUE),
                            var_age = var(age, na.rm = TRUE)
                            
                            )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 8 x 6
##   education             num avg_age med_age std_age var_age
##   <chr>               <int>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 basic.4y             4176    47.6      47   12.1    147. 
## 2 basic.6y             2292    40.4      39    8.69    75.5
## 3 basic.9y             6045    39.1      38    9.59    91.9
## 4 high.school          9515    38.0      36    9.71    94.2
## 5 illiterate             18    48.5      48   10.9    120. 
## 6 professional.course  5243    40.1      38    9.90    97.9
## 7 university.degree   12168    38.9      36    9.62    92.6
## 8 unknown              1731    43.5      43   12.5    156.

Data Aggregation Using “group_by”

# Let's check average ages by job groups 

data %>%
     group_by(job) %>%
                  summarise(num = n(),
                            avg_age = mean(age, na.rm = TRUE),
                            med_age = median(age, na.rm = TRUE),
                            std_age = sd(age, na.rm = TRUE),
                            var_age = var(age, na.rm = TRUE)
                            
                            )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 12 x 6
##    job             num avg_age med_age std_age var_age
##    <chr>         <int>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 admin.        10422    38.2      36    8.91    79.3
##  2 blue-collar    9254    39.6      39    8.83    77.9
##  3 entrepreneur   1456    41.7      41    8.91    79.4
##  4 housemaid      1060    45.5      45   10.8    116. 
##  5 management     2924    42.4      42    9.30    86.6
##  6 retired        1720    62.0      59   10.5    110. 
##  7 self-employed  1421    39.9      39    9.42    88.8
##  8 services       3969    37.9      36    9.02    81.3
##  9 student         875    25.9      25    4.99    24.9
## 10 technician     6743    38.5      37    8.66    75.0
## 11 unemployed     1014    39.7      39    9.27    85.9
## 12 unknown         330    45.6      45   11.2    125.

(d) Correlation of Independent Variables with Each Other

In this step, we should explore the correlation of independent variables, and then we should eliminate some of those if they are represented by one of them. Two or more correlated variables can affect both prediction and performance negatively. ROC Curve is an excellent option to see which independent variable that is correlated with other independent ones can represent all, so we can keep that one and remove other correlated independent variables , also heat map and matrix can be used to see the correlation of variables.

(Note:Spearman is utilized in parametric test whereas Pearson is utilized non-parametric test. Kendall is another method if data fails Normality Test. That means non-parametric test will be applied.)

library("ggpubr")


data_num<- data%>%
    select(-job, -marital, -education, -default, -housing, -loan, -poutcome, -contact, -month, -day_of_week)


cor(data_num, method = c("spearman"))

##                         age     duration     campaign        pdays    previous
## age             1.000000000 -0.002122552  0.005715229 -0.001062391 -0.01264261
## duration       -0.002122552  1.000000000 -0.080952981 -0.083072207  0.04240740
## campaign        0.005715229 -0.080952981  1.000000000  0.055514463 -0.08741996
## pdays          -0.001062391 -0.083072207  0.055514463  1.000000000 -0.50958283
## previous       -0.012642606  0.042407399 -0.087419960 -0.509582829  1.00000000
## emp_var_rate    0.045003376 -0.069200749  0.156360175  0.227697760 -0.43529294
## cons_price_idx  0.044789573  0.002854432  0.096494032  0.056762734 -0.28271977
## cons_conf_idx   0.114536484 -0.008677671 -0.001554411 -0.077264772 -0.11596680
## euribor3m       0.054390893 -0.078346358  0.140512480  0.278488098 -0.45472433
## nr_employed     0.044789262 -0.095224788  0.144264423  0.290673104 -0.43872079
## Exited         -0.011860719  0.348776856 -0.063668140 -0.324941799  0.20091403
##                emp_var_rate cons_price_idx cons_conf_idx   euribor3m
## age              0.04500338    0.044789573   0.114536484  0.05439089
## duration        -0.06920075    0.002854432  -0.008677671 -0.07834636
## campaign         0.15636017    0.096494032  -0.001554411  0.14051248
## pdays            0.22769776    0.056762734  -0.077264772  0.27848810
## previous        -0.43529294   -0.282719773  -0.115966803 -0.45472433
## emp_var_rate     1.00000000    0.664890298   0.224693285  0.93991521
## cons_price_idx   0.66489030    1.000000000   0.245633251  0.49097007
## cons_conf_idx    0.22469329    0.245633251   1.000000000  0.23664618
## euribor3m        0.93991521    0.490970068   0.236646183  1.00000000
## nr_employed      0.94470154    0.464738409   0.132707479  0.92885696
## Exited          -0.24747967   -0.122187103   0.041228586 -0.26682160
##                nr_employed      Exited
## age             0.04478926 -0.01186072
## duration       -0.09522479  0.34877686
## campaign        0.14426442 -0.06366814
## pdays           0.29067310 -0.32494180
## previous       -0.43872079  0.20091403
## emp_var_rate    0.94470154 -0.24747967
## cons_price_idx  0.46473841 -0.12218710
## cons_conf_idx   0.13270748  0.04122859
## euribor3m       0.92885696 -0.26682160
## nr_employed     1.00000000 -0.28395704
## Exited         -0.28395704  1.00000000

There are very high positive correlation between “nr_employed” and “euribor3m” and “emp_var_rate”. There are also some moderate negative correlations between a few variables that we can see the matrix above. We can think to remove 2 ones of 3 correlated variables after ROC Curve Analysis.

We can easily understand if one variable stays inside area under curve (AUC) of the other one, it means that it will be represented by the second one.

Let’s check it below!

library(ROCR)

data(ROCR.simple)


df_roc<-data.frame(ROCR.simple)

pred1<- prediction(data$nr_employed , data_num$Exited)
pred2<- prediction(data$euribor3m , data_num$Exited)
pred3<- prediction(data$emp_var_rate , data_num$Exited)


perf1<- performance(pred1, "tpr","fpr")
perf2<- performance(pred2, "tpr","fpr")
perf3<- performance(pred3, "tpr","fpr")


plot(perf1)
plot(perf2, add = TRUE,)
plot(perf3, add = TRUE, colorize=TRUE)

As understood from ROC analysis above, “emp_var_rate” represents other 2 variables. So we can remove “nr_employed” and “euribor3m”.

(e) Correlation of Independent Variables With Target Variable

This is also another important subject. When we explore impact of independent variables on target (dependent variable) we will understand importance of level of independent variables. In this situation, it will be unnecessary to hold non-correlated or low correlated variables from our predictive model. It depends!

cor(y=data$Exited, x=data_num,  method = c("spearman"))

##                       [,1]
## age            -0.01186072
## duration        0.34877686
## campaign       -0.06366814
## pdays          -0.32494180
## previous        0.20091403
## emp_var_rate   -0.24747967
## cons_price_idx -0.12218710
## cons_conf_idx   0.04122859
## euribor3m      -0.26682160
## nr_employed    -0.28395704
## Exited          1.00000000

There is no important correlation between dependent and independent variables.

B. Predictive Analysis

Now, we will convert all categorical data into numerical one, or remove from dataset if not need. Because ML algorithms runs only numerical data.

I.Data Encoding / One Hot Encoding

Now we will select categorical variables and make them continuous data types

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

data_cat<-(
  data%>%
  select(job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome)
           )

dmy<- dummyVars("~ .", data = data_cat)

trs<- data.frame(predict(dmy, newdata=data_cat))

head(trs)

##   jobadmin. jobblue.collar jobentrepreneur jobhousemaid jobmanagement
## 1         0              1               0            0             0
## 2         0              0               0            0             0
## 3         0              0               0            0             1
## 4         0              0               0            0             0
## 5         0              0               0            0             0
## 6         0              0               0            0             1
##   jobretired jobself.employed jobservices jobstudent jobtechnician
## 1          0                0           0          0             0
## 2          0                0           0          0             1
## 3          0                0           0          0             0
## 4          0                0           1          0             0
## 5          1                0           0          0             0
## 6          0                0           0          0             0
##   jobunemployed jobunknown maritaldivorced maritalmarried maritalsingle
## 1             0          0               0              1             0
## 2             0          0               0              1             0
## 3             0          0               0              0             1
## 4             0          0               0              1             0
## 5             0          0               0              1             0
## 6             0          0               1              0             0
##   maritalunknown educationbasic.4y educationbasic.6y educationbasic.9y
## 1              0                 1                 0                 0
## 2              0                 0                 0                 0
## 3              0                 0                 0                 0
## 4              0                 0                 0                 0
## 5              0                 1                 0                 0
## 6              0                 1                 0                 0
##   educationhigh.school educationilliterate educationprofessional.course
## 1                    0                   0                            0
## 2                    0                   0                            0
## 3                    0                   0                            0
## 4                    1                   0                            0
## 5                    0                   0                            0
## 6                    0                   0                            0
##   educationuniversity.degree educationunknown defaultno defaultunknown
## 1                          0                0         0              1
## 2                          0                1         1              0
## 3                          1                0         1              0
## 4                          0                0         1              0
## 5                          0                0         1              0
## 6                          0                0         1              0
##   defaultyes housingno housingunknown housingyes loanno loanunknown loanyes
## 1          0         0              0          1      1           0       0
## 2          0         1              0          0      1           0       0
## 3          0         0              0          1      1           0       0
## 4          0         1              0          0      1           0       0
## 5          0         0              0          1      1           0       0
## 6          0         0              0          1      1           0       0
##   contactcellular contacttelephone monthapr monthaug monthdec monthjul monthjun
## 1               1                0        0        1        0        0        0
## 2               1                0        0        0        0        0        0
## 3               1                0        0        0        0        0        1
## 4               1                0        1        0        0        0        0
## 5               1                0        0        1        0        0        0
## 6               1                0        0        0        0        1        0
##   monthmar monthmay monthnov monthoct monthsep day_of_weekfri day_of_weekmon
## 1        0        0        0        0        0              0              0
## 2        0        0        1        0        0              1              0
## 3        0        0        0        0        0              0              0
## 4        0        0        0        0        0              1              0
## 5        0        0        0        0        0              1              0
## 6        0        0        0        0        0              0              0
##   day_of_weekthu day_of_weektue day_of_weekwed poutcomefailure
## 1              1              0              0               0
## 2              0              0              0               0
## 3              1              0              0               0
## 4              0              0              0               0
## 5              0              0              0               0
## 6              0              1              0               0
##   poutcomenonexistent poutcomesuccess
## 1                   1               0
## 2                   1               0
## 3                   0               1
## 4                   1               0
## 5                   0               1
## 6                   1               0

We converted categorical variables into continuous variable using Encoders. Now let’s remove categorical columns from dataset and put new encoded data instead.

# remove categorical variables. We assumed some categorical variables don't affect target variable. Therefore we will remove them as well.

data_<-data%>%
      select( -job, -marital, -education, -default, -housing, -loan, -poutcome, -contact, -month, -day_of_week, -nr_employed, -euribor3m )

str(data_)

## 'data.frame':    41188 obs. of  9 variables:
##  $ age           : int  44 53 28 39 55 30 37 39 36 27 ...
##  $ duration      : int  210 138 339 185 137 68 204 191 174 191 ...
##  $ campaign      : int  1 1 3 2 1 8 1 1 1 2 ...
##  $ pdays         : int  999 999 6 999 3 999 999 999 3 999 ...
##  $ previous      : int  0 0 2 0 1 0 0 0 1 1 ...
##  $ emp_var_rate  : num  1.4 -0.1 -1.7 -1.8 -2.9 1.4 -1.8 -1.8 -2.9 -1.8 ...
##  $ cons_price_idx: num  93.4 93.2 94.1 93.1 92.2 ...
##  $ cons_conf_idx : num  -36.1 -42 -39.8 -47.1 -31.4 -42.7 -46.2 -46.2 -40.8 -47.1 ...
##  $ Exited        : int  0 0 1 0 1 0 0 0 1 0 ...

encoded_data<- cbind(data_, trs )

head(encoded_data)

##   age duration campaign pdays previous emp_var_rate cons_price_idx
## 1  44      210        1   999        0          1.4         93.444
## 2  53      138        1   999        0         -0.1         93.200
## 3  28      339        3     6        2         -1.7         94.055
## 4  39      185        2   999        0         -1.8         93.075
## 5  55      137        1     3        1         -2.9         92.201
## 6  30       68        8   999        0          1.4         93.918
##   cons_conf_idx Exited jobadmin. jobblue.collar jobentrepreneur jobhousemaid
## 1         -36.1      0         0              1               0            0
## 2         -42.0      0         0              0               0            0
## 3         -39.8      1         0              0               0            0
## 4         -47.1      0         0              0               0            0
## 5         -31.4      1         0              0               0            0
## 6         -42.7      0         0              0               0            0
##   jobmanagement jobretired jobself.employed jobservices jobstudent
## 1             0          0                0           0          0
## 2             0          0                0           0          0
## 3             1          0                0           0          0
## 4             0          0                0           1          0
## 5             0          1                0           0          0
## 6             1          0                0           0          0
##   jobtechnician jobunemployed jobunknown maritaldivorced maritalmarried
## 1             0             0          0               0              1
## 2             1             0          0               0              1
## 3             0             0          0               0              0
## 4             0             0          0               0              1
## 5             0             0          0               0              1
## 6             0             0          0               1              0
##   maritalsingle maritalunknown educationbasic.4y educationbasic.6y
## 1             0              0                 1                 0
## 2             0              0                 0                 0
## 3             1              0                 0                 0
## 4             0              0                 0                 0
## 5             0              0                 1                 0
## 6             0              0                 1                 0
##   educationbasic.9y educationhigh.school educationilliterate
## 1                 0                    0                   0
## 2                 0                    0                   0
## 3                 0                    0                   0
## 4                 0                    1                   0
## 5                 0                    0                   0
## 6                 0                    0                   0
##   educationprofessional.course educationuniversity.degree educationunknown
## 1                            0                          0                0
## 2                            0                          0                1
## 3                            0                          1                0
## 4                            0                          0                0
## 5                            0                          0                0
## 6                            0                          0                0
##   defaultno defaultunknown defaultyes housingno housingunknown housingyes
## 1         0              1          0         0              0          1
## 2         1              0          0         1              0          0
## 3         1              0          0         0              0          1
## 4         1              0          0         1              0          0
## 5         1              0          0         0              0          1
## 6         1              0          0         0              0          1
##   loanno loanunknown loanyes contactcellular contacttelephone monthapr monthaug
## 1      1           0       0               1                0        0        1
## 2      1           0       0               1                0        0        0
## 3      1           0       0               1                0        0        0
## 4      1           0       0               1                0        1        0
## 5      1           0       0               1                0        0        1
## 6      1           0       0               1                0        0        0
##   monthdec monthjul monthjun monthmar monthmay monthnov monthoct monthsep
## 1        0        0        0        0        0        0        0        0
## 2        0        0        0        0        0        1        0        0
## 3        0        0        1        0        0        0        0        0
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        1        0        0        0        0        0        0
##   day_of_weekfri day_of_weekmon day_of_weekthu day_of_weektue day_of_weekwed
## 1              0              0              1              0              0
## 2              1              0              0              0              0
## 3              0              0              1              0              0
## 4              1              0              0              0              0
## 5              1              0              0              0              0
## 6              0              0              0              1              0
##   poutcomefailure poutcomenonexistent poutcomesuccess
## 1               0                   1               0
## 2               0                   1               0
## 3               0                   0               1
## 4               0                   1               0
## 5               0                   0               1
## 6               0                   1               0

II.Split Train and Test Data

# split our data so that 30% is in the test set and 70% is in the training set

split_df <- resample_partition (encoded_data, c(test= 0.3, train= 0.7))


# how many cases are in test & training set? 

lapply(split_df, dim)

## $test
## [1] 12356    62
## 
## $train
## [1] 28832    62

III. Set up Model

It will be utilized Decision Trees model.

(a) Decision Trees

# Let's apply classification tree


fit <- rpart( Exited ~ ., data = split_df$train, method ="class", parms = list(split = "information"))

y_pred <- predict(fit, split_df$test, type="class")

y_pred_df <- data.frame(y_pred)

y_test_df <- data.frame(split_df$test)%>%select(Exited)


accuracy<- mean( y_pred_df == y_test_df)

accuracy

## [1] 0.9110554

Current Decision Trees Model

rpart.plot(fit, extra =106, )

(b) Model Tuning

Tuning manually

tuning<- rpart.control(minsplit=8, minbucket = round(8/3), maxdepth=30)

fit <- rpart( Exited ~ ., data = split_df$train, method ="class", parms = list(split = "information"), control=tuning)

y_pred_tuned <- predict(fit, split_df$test, type="class")

y_pred_tuned_df <- data.frame(y_pred_tuned)

y_test_df <- data.frame(split_df$test)%>%select(Exited)


accuracy<- mean( y_pred_tuned_df == y_test_df)

accuracy

## [1] 0.9110554

Tuning Automatically

library("tune")


fit <- rpart( Exited ~ ., data = split_df$train, method ="class", parms = list(split = "information"), control=tune())

y_pred_tuned <- predict(fit, split_df$test, type="class")

y_pred_tuned_df <- data.frame(y_pred_tuned)

y_test_df <- data.frame(split_df$test)%>%select(Exited)


accuracy<- mean( y_pred_tuned_df == y_test_df)

accuracy

## [1] 0.9110554

5.RESULT AND DISCUSSION

In this model, it is preferred to use Decision Trees (CART) to benefit from the advantages of the model. The most important advantages for this task are that CART is resistant to outliers and missing values, and no needed to make a scaling dataset. Also, it is easy to explain the model to technical personnel and stakeholders. Therefore, CART will be able to tackle and tolerate human error and perform big datasets.

Additionally, the prediction rate is % 91.1. This displays that the prediction accuracy is highly successful for this task. It is recommended to utilize the CART Model for this customer churn project since CART is one of the best models to balance possible issues in the future.

6.CONCLUSION

This project is completed successfully. Data preprocessing and ML modeling phases are targeted to display basically using 41-thousand-row of dataset. The project has samples of R programming functions, data preprocessing, data visualization, statistics, correlation analysis, machine learning, and reporting as well. Besides, the prediction success rate of the model is achieved as 91.1 % at the end of the project.

R Machine Learning Sample Project

Vedat Ozkan

1/24/2021