This project is studied for Data Scientist / Data Analyst roles, and R programming skill is purposed to display that will be required to the position. During the project, data preprocessing operations and machine learning processing will be implement using R Programming. Decision Trees (CART) model will be leveraged in ML modeling.
According to the scenario, as a data scientist, my goal is to create a prediction model that will predict customers who have the potential to churn. Thus, sales and marketing functional departments will be able to concentrate on these possible customers.
We have a CSV file keeping more than 41-thousand customer list for this study. The data consists of 21 features keeping customer details. The column titles referring to features are clarified below:
1.Dependent variable (labeled variable) : “Exited”
The “Exited” column refers to customer churn. - If observation point 1, that means the churning customer. - If observation point 0, that means the loyal customer.
2.Independent variables in data below :
a)"age":
b)"job":
c)"marital":
d)"education":
e)"default":
f)"housing":
g)"contact":
h)"month":
i)"day_of_week":
j)"duration":
k)"campaign":
l)"pdays":
m)"previous":
n)"outcome":
o)"emp_var_rate":
p)"cons_price_idx":
q)cons_price_idx":
r)"euribor3m":
s)"nr_employed":
This project will be implemented in 2 phases.
A. In descriptive analysis, it will be focused on understanding customer data, then pre-processing (data wrangling) will be implemented for predictive analysis.
B. In Predictive analysis phase, it will be created a Machine Learning model enabling to predict customers for classification.
library("tidyverse")
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("rpart")
library("rpart.plot")
library("readr")
library("knitr")
library("ggplot2")
library("modelr")
library("ggpubr")
library("ROCR")
library("tune")
set.seed(1234)
file = "https://raw.githubusercontent.com/vzn2000/sample_dataset/master/bankcustomers-data.csv"
data <- read.csv(file)
head(data, dec=",")
## age job marital education default housing loan contact
## 1 44 blue-collar married basic.4y unknown yes no cellular
## 2 53 technician married unknown no no no cellular
## 3 28 management single university.degree no yes no cellular
## 4 39 services married high.school no no no cellular
## 5 55 retired married basic.4y no yes no cellular
## 6 30 management divorced basic.4y no yes no cellular
## month day_of_week duration campaign pdays previous poutcome emp_var_rate
## 1 aug thu 210 1 999 0 nonexistent 1.4
## 2 nov fri 138 1 999 0 nonexistent -0.1
## 3 jun thu 339 3 6 2 success -1.7
## 4 apr fri 185 2 999 0 nonexistent -1.8
## 5 aug fri 137 1 3 1 success -2.9
## 6 jul tue 68 8 999 0 nonexistent 1.4
## cons_price_idx cons_conf_idx euribor3m nr_employed Exited
## 1 93.444 -36.1 4.963 5228.1 0
## 2 93.200 -42.0 4.021 5195.8 0
## 3 94.055 -39.8 0.729 4991.6 1
## 4 93.075 -47.1 1.405 5099.1 0
## 5 92.201 -31.4 0.869 5076.2 1
## 6 93.918 -42.7 4.961 5228.1 0
It will be manipulated data for further analysis while exploring.
is.data.frame(data)
## [1] TRUE
str(data)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 44 53 28 39 55 30 37 39 36 27 ...
## $ job : chr "blue-collar" "technician" "management" "services" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education : chr "basic.4y" "unknown" "university.degree" "high.school" ...
## $ default : chr "unknown" "no" "no" "no" ...
## $ housing : chr "yes" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "cellular" "cellular" "cellular" "cellular" ...
## $ month : chr "aug" "nov" "jun" "apr" ...
## $ day_of_week : chr "thu" "fri" "thu" "fri" ...
## $ duration : int 210 138 339 185 137 68 204 191 174 191 ...
## $ campaign : int 1 1 3 2 1 8 1 1 1 2 ...
## $ pdays : int 999 999 6 999 3 999 999 999 3 999 ...
## $ previous : int 0 0 2 0 1 0 0 0 1 1 ...
## $ poutcome : chr "nonexistent" "nonexistent" "success" "nonexistent" ...
## $ emp_var_rate : num 1.4 -0.1 -1.7 -1.8 -2.9 1.4 -1.8 -1.8 -2.9 -1.8 ...
## $ cons_price_idx: num 93.4 93.2 94.1 93.1 92.2 ...
## $ cons_conf_idx : num -36.1 -42 -39.8 -47.1 -31.4 -42.7 -46.2 -46.2 -40.8 -47.1 ...
## $ euribor3m : num 4.963 4.021 0.729 1.405 0.869 ...
## $ nr_employed : num 5228 5196 4992 5099 5076 ...
## $ Exited : int 0 0 1 0 1 0 0 0 1 0 ...
dim(data)
## [1] 41188 21
summary(data)
## age job marital education
## Min. :17.00 Length:41188 Length:41188 Length:41188
## 1st Qu.:32.00 Class :character Class :character Class :character
## Median :38.00 Mode :character Mode :character Mode :character
## Mean :40.02
## 3rd Qu.:47.00
## Max. :98.00
## default housing loan contact
## Length:41188 Length:41188 Length:41188 Length:41188
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## month day_of_week duration campaign
## Length:41188 Length:41188 Min. : 0.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 102.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 180.0 Median : 2.000
## Mean : 258.3 Mean : 2.568
## 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :4918.0 Max. :56.000
## pdays previous poutcome emp_var_rate
## Min. : 0.0 Min. :0.000 Length:41188 Min. :-3.40000
## 1st Qu.:999.0 1st Qu.:0.000 Class :character 1st Qu.:-1.80000
## Median :999.0 Median :0.000 Mode :character Median : 1.10000
## Mean :962.5 Mean :0.173 Mean : 0.08189
## 3rd Qu.:999.0 3rd Qu.:0.000 3rd Qu.: 1.40000
## Max. :999.0 Max. :7.000 Max. : 1.40000
## cons_price_idx cons_conf_idx euribor3m nr_employed
## Min. :92.20 Min. :-50.8 Min. :0.634 Min. :4964
## 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344 1st Qu.:5099
## Median :93.75 Median :-41.8 Median :4.857 Median :5191
## Mean :93.58 Mean :-40.5 Mean :3.621 Mean :5167
## 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
## Max. :94.77 Max. :-26.9 Max. :5.045 Max. :5228
## Exited
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1127
## 3rd Qu.:0.0000
## Max. :1.0000
sum(is.na(data))
## [1] 0
(Note: If there were null values I would try to understand reason. If nulls are randomly, I would fill them mean or fix value or drop from data. It depends on situation.)
In order to fill missing value, I would prefer to use one of below:
df\(age[is.na(df\)age)]<- mean(data$age, na.rm=TRUE)
OR
data %>% mutate(c1 = replace_na(age, mean(data$age, na.rm=TRUE))
mean(data$age, na.rm=TRUE)
## [1] 40.02406
# Box Whisker Plot
ggplot(data,aes(age)) +
geom_boxplot(color ="black",
fill = "orange") +
xlab("Age") +
ggtitle("Age Distribution")
According to Box Plot chart, there are several outliers in age variable. However, we assume there is no outlier in dataset.
# Histogram and Density Graphs
ggplot(data, aes(age)) +
geom_histogram(aes(y = ..density..),
color ="black",
fill = "orange") +
xlab("Age Distribution") +
ylab("Frequency") +
ggtitle("Age Histogram")+
geom_density(alpha = .3,
fill = "black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(age, fill=marital)) +
geom_histogram(alpha = .5,
position = "identity") +
xlab("Age") +
ylab("Frequency") +
ggtitle("Marital Status by Age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Loan by Jobs
ggplot(data,aes(job, fill=loan)) +
geom_bar()+
xlab("Jobs") +
ylab("Loan") +
ggtitle("Loan by Jobs")+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
# Let's check average ages by education groups
data %>%
group_by(education) %>%
summarise(num = n(),
avg_age = mean(age, na.rm = TRUE),
med_age = median(age, na.rm = TRUE),
std_age = sd(age, na.rm = TRUE),
var_age = var(age, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 6
## education num avg_age med_age std_age var_age
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 basic.4y 4176 47.6 47 12.1 147.
## 2 basic.6y 2292 40.4 39 8.69 75.5
## 3 basic.9y 6045 39.1 38 9.59 91.9
## 4 high.school 9515 38.0 36 9.71 94.2
## 5 illiterate 18 48.5 48 10.9 120.
## 6 professional.course 5243 40.1 38 9.90 97.9
## 7 university.degree 12168 38.9 36 9.62 92.6
## 8 unknown 1731 43.5 43 12.5 156.
Data Aggregation Using “group_by”
# Let's check average ages by job groups
data %>%
group_by(job) %>%
summarise(num = n(),
avg_age = mean(age, na.rm = TRUE),
med_age = median(age, na.rm = TRUE),
std_age = sd(age, na.rm = TRUE),
var_age = var(age, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 6
## job num avg_age med_age std_age var_age
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 admin. 10422 38.2 36 8.91 79.3
## 2 blue-collar 9254 39.6 39 8.83 77.9
## 3 entrepreneur 1456 41.7 41 8.91 79.4
## 4 housemaid 1060 45.5 45 10.8 116.
## 5 management 2924 42.4 42 9.30 86.6
## 6 retired 1720 62.0 59 10.5 110.
## 7 self-employed 1421 39.9 39 9.42 88.8
## 8 services 3969 37.9 36 9.02 81.3
## 9 student 875 25.9 25 4.99 24.9
## 10 technician 6743 38.5 37 8.66 75.0
## 11 unemployed 1014 39.7 39 9.27 85.9
## 12 unknown 330 45.6 45 11.2 125.
In this step, we should explore the correlation of independent variables, and then we should eliminate some of those if they are represented by one of them. Two or more correlated variables can affect both prediction and performance negatively. ROC Curve is an excellent option to see which independent variable that is correlated with other independent ones can represent all, so we can keep that one and remove other correlated independent variables , also heat map and matrix can be used to see the correlation of variables.
(Note:Spearman is utilized in parametric test whereas Pearson is utilized non-parametric test. Kendall is another method if data fails Normality Test. That means non-parametric test will be applied.)
library("ggpubr")
data_num<- data%>%
select(-job, -marital, -education, -default, -housing, -loan, -poutcome, -contact, -month, -day_of_week)
cor(data_num, method = c("spearman"))
## age duration campaign pdays previous
## age 1.000000000 -0.002122552 0.005715229 -0.001062391 -0.01264261
## duration -0.002122552 1.000000000 -0.080952981 -0.083072207 0.04240740
## campaign 0.005715229 -0.080952981 1.000000000 0.055514463 -0.08741996
## pdays -0.001062391 -0.083072207 0.055514463 1.000000000 -0.50958283
## previous -0.012642606 0.042407399 -0.087419960 -0.509582829 1.00000000
## emp_var_rate 0.045003376 -0.069200749 0.156360175 0.227697760 -0.43529294
## cons_price_idx 0.044789573 0.002854432 0.096494032 0.056762734 -0.28271977
## cons_conf_idx 0.114536484 -0.008677671 -0.001554411 -0.077264772 -0.11596680
## euribor3m 0.054390893 -0.078346358 0.140512480 0.278488098 -0.45472433
## nr_employed 0.044789262 -0.095224788 0.144264423 0.290673104 -0.43872079
## Exited -0.011860719 0.348776856 -0.063668140 -0.324941799 0.20091403
## emp_var_rate cons_price_idx cons_conf_idx euribor3m
## age 0.04500338 0.044789573 0.114536484 0.05439089
## duration -0.06920075 0.002854432 -0.008677671 -0.07834636
## campaign 0.15636017 0.096494032 -0.001554411 0.14051248
## pdays 0.22769776 0.056762734 -0.077264772 0.27848810
## previous -0.43529294 -0.282719773 -0.115966803 -0.45472433
## emp_var_rate 1.00000000 0.664890298 0.224693285 0.93991521
## cons_price_idx 0.66489030 1.000000000 0.245633251 0.49097007
## cons_conf_idx 0.22469329 0.245633251 1.000000000 0.23664618
## euribor3m 0.93991521 0.490970068 0.236646183 1.00000000
## nr_employed 0.94470154 0.464738409 0.132707479 0.92885696
## Exited -0.24747967 -0.122187103 0.041228586 -0.26682160
## nr_employed Exited
## age 0.04478926 -0.01186072
## duration -0.09522479 0.34877686
## campaign 0.14426442 -0.06366814
## pdays 0.29067310 -0.32494180
## previous -0.43872079 0.20091403
## emp_var_rate 0.94470154 -0.24747967
## cons_price_idx 0.46473841 -0.12218710
## cons_conf_idx 0.13270748 0.04122859
## euribor3m 0.92885696 -0.26682160
## nr_employed 1.00000000 -0.28395704
## Exited -0.28395704 1.00000000
There are very high positive correlation between “nr_employed” and “euribor3m” and “emp_var_rate”. There are also some moderate negative correlations between a few variables that we can see the matrix above. We can think to remove 2 ones of 3 correlated variables after ROC Curve Analysis.
We can easily understand if one variable stays inside area under curve (AUC) of the other one, it means that it will be represented by the second one.
library(ROCR)
data(ROCR.simple)
df_roc<-data.frame(ROCR.simple)
pred1<- prediction(data$nr_employed , data_num$Exited)
pred2<- prediction(data$euribor3m , data_num$Exited)
pred3<- prediction(data$emp_var_rate , data_num$Exited)
perf1<- performance(pred1, "tpr","fpr")
perf2<- performance(pred2, "tpr","fpr")
perf3<- performance(pred3, "tpr","fpr")
plot(perf1)
plot(perf2, add = TRUE,)
plot(perf3, add = TRUE, colorize=TRUE)
As understood from ROC analysis above, “emp_var_rate” represents other 2 variables. So we can remove “nr_employed” and “euribor3m”.
This is also another important subject. When we explore impact of independent variables on target (dependent variable) we will understand importance of level of independent variables. In this situation, it will be unnecessary to hold non-correlated or low correlated variables from our predictive model. It depends!
cor(y=data$Exited, x=data_num, method = c("spearman"))
## [,1]
## age -0.01186072
## duration 0.34877686
## campaign -0.06366814
## pdays -0.32494180
## previous 0.20091403
## emp_var_rate -0.24747967
## cons_price_idx -0.12218710
## cons_conf_idx 0.04122859
## euribor3m -0.26682160
## nr_employed -0.28395704
## Exited 1.00000000
There is no important correlation between dependent and independent variables.
Now, we will convert all categorical data into numerical one, or remove from dataset if not need. Because ML algorithms runs only numerical data.
Now we will select categorical variables and make them continuous data types
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
data_cat<-(
data%>%
select(job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome)
)
dmy<- dummyVars("~ .", data = data_cat)
trs<- data.frame(predict(dmy, newdata=data_cat))
head(trs)
## jobadmin. jobblue.collar jobentrepreneur jobhousemaid jobmanagement
## 1 0 1 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 1
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 1
## jobretired jobself.employed jobservices jobstudent jobtechnician
## 1 0 0 0 0 0
## 2 0 0 0 0 1
## 3 0 0 0 0 0
## 4 0 0 1 0 0
## 5 1 0 0 0 0
## 6 0 0 0 0 0
## jobunemployed jobunknown maritaldivorced maritalmarried maritalsingle
## 1 0 0 0 1 0
## 2 0 0 0 1 0
## 3 0 0 0 0 1
## 4 0 0 0 1 0
## 5 0 0 0 1 0
## 6 0 0 1 0 0
## maritalunknown educationbasic.4y educationbasic.6y educationbasic.9y
## 1 0 1 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 1 0 0
## 6 0 1 0 0
## educationhigh.school educationilliterate educationprofessional.course
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 1 0 0
## 5 0 0 0
## 6 0 0 0
## educationuniversity.degree educationunknown defaultno defaultunknown
## 1 0 0 0 1
## 2 0 1 1 0
## 3 1 0 1 0
## 4 0 0 1 0
## 5 0 0 1 0
## 6 0 0 1 0
## defaultyes housingno housingunknown housingyes loanno loanunknown loanyes
## 1 0 0 0 1 1 0 0
## 2 0 1 0 0 1 0 0
## 3 0 0 0 1 1 0 0
## 4 0 1 0 0 1 0 0
## 5 0 0 0 1 1 0 0
## 6 0 0 0 1 1 0 0
## contactcellular contacttelephone monthapr monthaug monthdec monthjul monthjun
## 1 1 0 0 1 0 0 0
## 2 1 0 0 0 0 0 0
## 3 1 0 0 0 0 0 1
## 4 1 0 1 0 0 0 0
## 5 1 0 0 1 0 0 0
## 6 1 0 0 0 0 1 0
## monthmar monthmay monthnov monthoct monthsep day_of_weekfri day_of_weekmon
## 1 0 0 0 0 0 0 0
## 2 0 0 1 0 0 1 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 1 0
## 5 0 0 0 0 0 1 0
## 6 0 0 0 0 0 0 0
## day_of_weekthu day_of_weektue day_of_weekwed poutcomefailure
## 1 1 0 0 0
## 2 0 0 0 0
## 3 1 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 1 0 0
## poutcomenonexistent poutcomesuccess
## 1 1 0
## 2 1 0
## 3 0 1
## 4 1 0
## 5 0 1
## 6 1 0
We converted categorical variables into continuous variable using Encoders. Now let’s remove categorical columns from dataset and put new encoded data instead.
# remove categorical variables. We assumed some categorical variables don't affect target variable. Therefore we will remove them as well.
data_<-data%>%
select( -job, -marital, -education, -default, -housing, -loan, -poutcome, -contact, -month, -day_of_week, -nr_employed, -euribor3m )
str(data_)
## 'data.frame': 41188 obs. of 9 variables:
## $ age : int 44 53 28 39 55 30 37 39 36 27 ...
## $ duration : int 210 138 339 185 137 68 204 191 174 191 ...
## $ campaign : int 1 1 3 2 1 8 1 1 1 2 ...
## $ pdays : int 999 999 6 999 3 999 999 999 3 999 ...
## $ previous : int 0 0 2 0 1 0 0 0 1 1 ...
## $ emp_var_rate : num 1.4 -0.1 -1.7 -1.8 -2.9 1.4 -1.8 -1.8 -2.9 -1.8 ...
## $ cons_price_idx: num 93.4 93.2 94.1 93.1 92.2 ...
## $ cons_conf_idx : num -36.1 -42 -39.8 -47.1 -31.4 -42.7 -46.2 -46.2 -40.8 -47.1 ...
## $ Exited : int 0 0 1 0 1 0 0 0 1 0 ...
encoded_data<- cbind(data_, trs )
head(encoded_data)
## age duration campaign pdays previous emp_var_rate cons_price_idx
## 1 44 210 1 999 0 1.4 93.444
## 2 53 138 1 999 0 -0.1 93.200
## 3 28 339 3 6 2 -1.7 94.055
## 4 39 185 2 999 0 -1.8 93.075
## 5 55 137 1 3 1 -2.9 92.201
## 6 30 68 8 999 0 1.4 93.918
## cons_conf_idx Exited jobadmin. jobblue.collar jobentrepreneur jobhousemaid
## 1 -36.1 0 0 1 0 0
## 2 -42.0 0 0 0 0 0
## 3 -39.8 1 0 0 0 0
## 4 -47.1 0 0 0 0 0
## 5 -31.4 1 0 0 0 0
## 6 -42.7 0 0 0 0 0
## jobmanagement jobretired jobself.employed jobservices jobstudent
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 1 0 0 0 0
## 4 0 0 0 1 0
## 5 0 1 0 0 0
## 6 1 0 0 0 0
## jobtechnician jobunemployed jobunknown maritaldivorced maritalmarried
## 1 0 0 0 0 1
## 2 1 0 0 0 1
## 3 0 0 0 0 0
## 4 0 0 0 0 1
## 5 0 0 0 0 1
## 6 0 0 0 1 0
## maritalsingle maritalunknown educationbasic.4y educationbasic.6y
## 1 0 0 1 0
## 2 0 0 0 0
## 3 1 0 0 0
## 4 0 0 0 0
## 5 0 0 1 0
## 6 0 0 1 0
## educationbasic.9y educationhigh.school educationilliterate
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 1 0
## 5 0 0 0
## 6 0 0 0
## educationprofessional.course educationuniversity.degree educationunknown
## 1 0 0 0
## 2 0 0 1
## 3 0 1 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## defaultno defaultunknown defaultyes housingno housingunknown housingyes
## 1 0 1 0 0 0 1
## 2 1 0 0 1 0 0
## 3 1 0 0 0 0 1
## 4 1 0 0 1 0 0
## 5 1 0 0 0 0 1
## 6 1 0 0 0 0 1
## loanno loanunknown loanyes contactcellular contacttelephone monthapr monthaug
## 1 1 0 0 1 0 0 1
## 2 1 0 0 1 0 0 0
## 3 1 0 0 1 0 0 0
## 4 1 0 0 1 0 1 0
## 5 1 0 0 1 0 0 1
## 6 1 0 0 1 0 0 0
## monthdec monthjul monthjun monthmar monthmay monthnov monthoct monthsep
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 1 0 0
## 3 0 0 1 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 1 0 0 0 0 0 0
## day_of_weekfri day_of_weekmon day_of_weekthu day_of_weektue day_of_weekwed
## 1 0 0 1 0 0
## 2 1 0 0 0 0
## 3 0 0 1 0 0
## 4 1 0 0 0 0
## 5 1 0 0 0 0
## 6 0 0 0 1 0
## poutcomefailure poutcomenonexistent poutcomesuccess
## 1 0 1 0
## 2 0 1 0
## 3 0 0 1
## 4 0 1 0
## 5 0 0 1
## 6 0 1 0
# split our data so that 30% is in the test set and 70% is in the training set
split_df <- resample_partition (encoded_data, c(test= 0.3, train= 0.7))
# how many cases are in test & training set?
lapply(split_df, dim)
## $test
## [1] 12356 62
##
## $train
## [1] 28832 62
It will be utilized Decision Trees model.
# Let's apply classification tree
fit <- rpart( Exited ~ ., data = split_df$train, method ="class", parms = list(split = "information"))
y_pred <- predict(fit, split_df$test, type="class")
y_pred_df <- data.frame(y_pred)
y_test_df <- data.frame(split_df$test)%>%select(Exited)
accuracy<- mean( y_pred_df == y_test_df)
accuracy
## [1] 0.9110554
rpart.plot(fit, extra =106, )
tuning<- rpart.control(minsplit=8, minbucket = round(8/3), maxdepth=30)
fit <- rpart( Exited ~ ., data = split_df$train, method ="class", parms = list(split = "information"), control=tuning)
y_pred_tuned <- predict(fit, split_df$test, type="class")
y_pred_tuned_df <- data.frame(y_pred_tuned)
y_test_df <- data.frame(split_df$test)%>%select(Exited)
accuracy<- mean( y_pred_tuned_df == y_test_df)
accuracy
## [1] 0.9110554
library("tune")
fit <- rpart( Exited ~ ., data = split_df$train, method ="class", parms = list(split = "information"), control=tune())
y_pred_tuned <- predict(fit, split_df$test, type="class")
y_pred_tuned_df <- data.frame(y_pred_tuned)
y_test_df <- data.frame(split_df$test)%>%select(Exited)
accuracy<- mean( y_pred_tuned_df == y_test_df)
accuracy
## [1] 0.9110554
In this model, it is preferred to use Decision Trees (CART) to benefit from the advantages of the model. The most important advantages for this task are that CART is resistant to outliers and missing values, and no needed to make a scaling dataset. Also, it is easy to explain the model to technical personnel and stakeholders. Therefore, CART will be able to tackle and tolerate human error and perform big datasets.
Additionally, the prediction rate is % 91.1. This displays that the prediction accuracy is highly successful for this task. It is recommended to utilize the CART Model for this customer churn project since CART is one of the best models to balance possible issues in the future.
This project is completed successfully. Data preprocessing and ML modeling phases are targeted to display basically using 41-thousand-row of dataset. The project has samples of R programming functions, data preprocessing, data visualization, statistics, correlation analysis, machine learning, and reporting as well. Besides, the prediction success rate of the model is achieved as 91.1 % at the end of the project.