Langkah pertama yang dilakukan adalah mengatur working directory pada R, serta tempatkan data yang tersedia pada file ini.
setwd("D:\\DS\\R\\[R] Random Forest Classifier")
getwd()
## [1] "D:/DS/R/[R] Random Forest Classifier"
rm(list=ls(all=TRUE))
##Input Data Langkah selanjutnya adalah menginput dataset, sebagai berikut :
##Input Data
telco <- read.csv("telco.csv", header = T)
head(telco)
## UpdatedAt customerID gender SeniorCitizen Partner tenure PhoneService
## 1 202006 45759018157 Female No Yes 1 No
## 2 202006 45315483266 Male No Yes 60 Yes
## 3 202006 45236961615 Male No No 5 Yes
## 4 202006 45929827382 Female No Yes 72 Yes
## 5 202006 45305082233 Female No Yes 56 Yes
## 6 202006 45072364214 Male No No 44 Yes
## StreamingTV InternetService PaperlessBilling MonthlyCharges TotalCharges
## 1 No Yes Yes 29.85 29.85
## 2 No No Yes 20.50 1198.80
## 3 Yes Yes No 104.10 541.90
## 4 Yes Yes Yes 115.50 8312.75
## 5 Yes Yes No 81.25 4620.40
## 6 Yes Yes Yes 85.25 3704.15
## Churn
## 1 No
## 2 No
## 3 Yes
## 4 No
## 5 No
## 6 No
Karena variabel CustomerID bersifat unik dan tidak dapat digunakan, maka variabel ini dihilangkan terlebih dahulu dari dataset, sebagai berikut :
telco <-telco[,-c(1:2)]
head(telco)
## gender SeniorCitizen Partner tenure PhoneService StreamingTV InternetService
## 1 Female No Yes 1 No No Yes
## 2 Male No Yes 60 Yes No No
## 3 Male No No 5 Yes Yes Yes
## 4 Female No Yes 72 Yes Yes Yes
## 5 Female No Yes 56 Yes Yes Yes
## 6 Male No No 44 Yes Yes Yes
## PaperlessBilling MonthlyCharges TotalCharges Churn
## 1 Yes 29.85 29.85 No
## 2 Yes 20.50 1198.80 No
## 3 No 104.10 541.90 Yes
## 4 Yes 115.50 8312.75 No
## 5 No 81.25 4620.40 No
## 6 Yes 85.25 3704.15 No
Langkah selanjutnya adalah melakukan eksplorasi terhadap dataset, sebagai berikut :
#Melihat struktur data
str(telco)
## 'data.frame': 6950 obs. of 11 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 1 2 2 2 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 1 2 2 2 2 ...
## $ tenure : int 1 60 5 72 56 44 39 12 71 19 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 2 1 1 1 ...
## $ InternetService : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 2 1 1 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 2 1 1 ...
## $ MonthlyCharges : num 29.9 20.5 104.1 115.5 81.2 ...
## $ TotalCharges : num 29.9 1198.8 541.9 8312.8 4620.4 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 2 1 1 1 ...
Dari struktur data diatas, terlihat bahwa data sudah sesuai dengan type data dari masing - masing variabel, untuk variabel kategorik data bertipe factor, untuk variabel non-kategorik data bertipe integer/numerik.
#Melihat apakah ada missing data
summary(telco)
## gender SeniorCitizen Partner tenure PhoneService
## Female:3445 No :5822 No :3591 Min. : 0.00 No : 669
## Male :3505 Yes:1128 Yes:3359 1st Qu.: 9.00 Yes:6281
## Median : 29.00
## Mean : 32.42
## 3rd Qu.: 55.00
## Max. :124.00
## StreamingTV InternetService PaperlessBilling MonthlyCharges TotalCharges
## No :4279 No :1505 No :2836 Min. : 0.00 Min. : 19
## Yes:2671 Yes:5445 Yes:4114 1st Qu.: 36.46 1st Qu.: 407
## Median : 70.45 Median :1401
## Mean : 64.99 Mean :2286
## 3rd Qu.: 89.85 3rd Qu.:3800
## Max. :169.93 Max. :8889
## Churn
## No :5114
## Yes:1836
##
##
##
##
table(is.na(telco))
##
## FALSE
## 76450
Terlihat bahwa tidak terdapat missing data pada dataset. Selanjutnya adalah membuat pie chart untuk variabel Churn, sebagai berikut :
#Pie Plot Variabel Churn
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df_churn <- telco %>%
select(Churn) %>%
group_by(Churn) %>%
summarise(Total = n())
df_churn
## # A tibble: 2 x 2
## Churn Total
## <fct> <int>
## 1 No 5114
## 2 Yes 1836
Langkah selanjutnya adalah membagi data ke dalam Training data dan Testing data, sebagai berikut :
##Split Data
library(caTools)
set.seed(123)
Split <- sample.split(telco, SplitRatio = 0.7)
Train <- subset(telco, Split==TRUE)
Test <- subset(telco, Split==FALSE)
Langkah selanjutnya adalah membuat model terhadap Training data dengan default parameter, dimana number of trees to grow (ntree = 500), dan number of variables randomly sampled as candidates at each split (mtry=sqrt(p)) untuk classification, karena jumlah variabel prediktor dalam data ini ada 9 maka mtry = 3. sebagai berikut :
##Modeling Train Data (default parameter)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
set.seed(222)
rf_model1<-randomForest(Churn~., data = Train)
rf_model1
##
## Call:
## randomForest(formula = Churn ~ ., data = Train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 20.48%
## Confusion matrix:
## No Yes class.error
## No 3005 277 0.08439976
## Yes 629 512 0.55127082
Langkah selanjutnya adalah melakukan prediksi pada Testing data dengan menggunakan Training model, sebagai berikut :
##Predict (default parameter)
library(caret)
## Loading required package: lattice
p1_test <- predict(rf_model1, newdata = Test)
p1_test_cm<-confusionMatrix(p1_test, Test$Churn)
p1_test_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1655 370
## Yes 177 325
##
## Accuracy : 0.7835
## 95% CI : (0.767, 0.7995)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 8.416e-12
##
## Kappa : 0.406
##
## Mcnemar's Test P-Value : 2.224e-16
##
## Sensitivity : 0.9034
## Specificity : 0.4676
## Pos Pred Value : 0.8173
## Neg Pred Value : 0.6474
## Prevalence : 0.7250
## Detection Rate : 0.6549
## Detection Prevalence : 0.8013
## Balanced Accuracy : 0.6855
##
## 'Positive' Class : No
##
Langkah selanjutnya adalah mencari nilai parameter mtry yang optimal untuk model random forest, ada beberapa cara untuk mencari parameter mtry yang optimal yakni sebagai berikut :
#Random Search CV
set.seed(222)
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
search = "random")
rf_random <- train(Churn~.,
data = telco,
method = "rf",
metric = "Accuracy",
tuneLength = 10,
trControl = control)
rf_random
## Random Forest
##
## 6950 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 6256, 6256, 6256, 6254, 6256, 6254, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7951543 0.3987911
## 4 0.7833086 0.3953396
## 6 0.7769782 0.3837060
## 7 0.7764493 0.3824825
## 8 0.7756830 0.3821310
## 9 0.7743879 0.3782746
## 10 0.7743385 0.3794617
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Dengan metode ini diperoleh best mtry adalah 2.
#Grid Search CV
control <- trainControl(method ="repeatedcv",
number = 10,
repeats = 3,
search = "grid")
set.seed(222)
tunegrid <- expand.grid(.mtry=c(1:10))
rf_gridsearch <- train(Churn~.,
data = telco,
method = "rf",
metric = "Accuracy",
tuneGrid = tunegrid,
trControl = control)
rf_gridsearch
## Random Forest
##
## 6950 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 6256, 6256, 6256, 6254, 6256, 6254, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.7574560 0.1345038
## 2 0.7939545 0.3952879
## 3 0.7930936 0.4115954
## 4 0.7833084 0.3944844
## 5 0.7789922 0.3882772
## 6 0.7773612 0.3864893
## 7 0.7754420 0.3800425
## 8 0.7759223 0.3817688
## 9 0.7744354 0.3794570
## 10 0.7745316 0.3797799
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Dengan metode ini diperoleh best mtry adalah 2.
## mtry = 3 OOB error = 20.48%
## Searching left ...
## mtry = 2 OOB error = 20.08%
## 0.01986755 0.05
## Searching right ...
## mtry = 4 OOB error = 21.34%
## -0.0419426 0.05
Dengan metode ini diperoleh best mtry adalah 2.
Untuk mtry = 3 sudah dimodelkan pada model dengan default parameter, selanjutnya adalah memodelkan dengan mtry = 2 sebagai berikut :
##Modeling Train Data (mtry=2)
library(randomForest)
set.seed(222)
rf_model2<-randomForest(Churn~., data = Train, mtry=2)
rf_model2
##
## Call:
## randomForest(formula = Churn ~ ., data = Train, mtry = 2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.12%
## Confusion matrix:
## No Yes class.error
## No 3071 211 0.06429007
## Yes 679 462 0.59509202
##Predict (mtry=2)
library(caret)
p2_test <- predict(rf_model2, newdata = Test)
p2_test_cm<-confusionMatrix(p2_test, Test$Churn)
p2_test_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1691 402
## Yes 141 293
##
## Accuracy : 0.7851
## 95% CI : (0.7686, 0.801)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 2.275e-12
##
## Kappa : 0.3901
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9230
## Specificity : 0.4216
## Pos Pred Value : 0.8079
## Neg Pred Value : 0.6751
## Prevalence : 0.7250
## Detection Rate : 0.6692
## Detection Prevalence : 0.8283
## Balanced Accuracy : 0.6723
##
## 'Positive' Class : No
##
Dapat dilihat bahwa dengan mtry = 2 diperoleh nilai Accuracy-nya lebih besar dibanding dengan mtry = 3 pada model default parameter. Sehingga model yang dipilih adalah dengan mtry = 2.
Selanjutnya, pada model mtry = 2 dapat dilihat bahwa dari model random forest dengan mtry = 2 diperoleh nilai Sensitivity dan Spesificity yang diperoleh jauh berbeda.Hal ini dikarenakan adanya permasalahan data tidak seimbang, dimana pada variabel Churn, terlihat bahwa customer yang memilih Churn (Yes = 1836) lebih sedikit dibanding customer yang memilih untuk tidak churn (No = 5114).
Ada beberapa cara untuk mengatasi masalah ini, yakni sebagai berikut :
##Mengatasi spesificity yang rendah karena imbalance data
library(ROSE)
## Loaded ROSE 0.0-3
##OverSampling
table(Train$Churn)
##
## No Yes
## 3282 1141
3286*2
## [1] 6572
over <- ovun.sample(Churn~., data = Train, method = "over", N=6572)$data
table(over$Churn)
##
## No Yes
## 3282 3290
#Modeling Train Data (oversampling)
library(randomForest)
set.seed(222)
rf_model2_over<-randomForest(Churn~., data = over, mtry=2)
rf_model2_over
##
## Call:
## randomForest(formula = Churn ~ ., data = over, mtry = 2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 19%
## Confusion matrix:
## No Yes class.error
## No 2461 821 0.2501523
## Yes 428 2862 0.1300912
#Predict (oversampling)
library(caret)
p2_test_over <- predict(rf_model2_over, newdata = Test)
p2_test_over_cm<-confusionMatrix(p2_test_over, Test$Churn)
p2_test_over_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1345 177
## Yes 487 518
##
## Accuracy : 0.7372
## 95% CI : (0.7196, 0.7543)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 0.08665
##
## Kappa : 0.4212
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.7342
## Specificity : 0.7453
## Pos Pred Value : 0.8837
## Neg Pred Value : 0.5154
## Prevalence : 0.7250
## Detection Rate : 0.5323
## Detection Prevalence : 0.6023
## Balanced Accuracy : 0.7397
##
## 'Positive' Class : No
##
##UnderSampling
1137*2
## [1] 2274
under <- ovun.sample(Churn~., data = Train, method = "under", N=2274)$data
table(under$Churn)
##
## No Yes
## 1133 1141
#Modeling Train Data (undersampling)
library(randomForest)
set.seed(222)
rf_model2_under<-randomForest(Churn~., data = under, mtry=2)
rf_model2_under
##
## Call:
## randomForest(formula = Churn ~ ., data = under, mtry = 2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 25.81%
## Confusion matrix:
## No Yes class.error
## No 806 327 0.2886143
## Yes 260 881 0.2278703
#Predict (undersampling)
library(caret)
p2_test_under <- predict(rf_model2_under, newdata = Test)
p2_test_under_cm<-confusionMatrix(p2_test_under, Test$Churn)
p2_test_under_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1283 156
## Yes 549 539
##
## Accuracy : 0.721
## 95% CI : (0.7031, 0.7384)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 0.681
##
## Kappa : 0.4048
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7003
## Specificity : 0.7755
## Pos Pred Value : 0.8916
## Neg Pred Value : 0.4954
## Prevalence : 0.7250
## Detection Rate : 0.5077
## Detection Prevalence : 0.5694
## Balanced Accuracy : 0.7379
##
## 'Positive' Class : No
##
##SMOTE
table(Train$Churn)
##
## No Yes
## 3282 1141
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
smote <- SMOTE(Churn~. , data = Train, perc.over = 100, perc.under = 200)
table(smote$Churn)
##
## No Yes
## 2282 2282
#Modeling Train Data (smote)
library(randomForest)
set.seed(222)
rf_model2_smote<-randomForest(Churn~., data = smote, mtry=2)
rf_model2_smote
##
## Call:
## randomForest(formula = Churn ~ ., data = smote, mtry = 2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 21.69%
## Confusion matrix:
## No Yes class.error
## No 1776 506 0.2217353
## Yes 484 1798 0.2120947
#Predict (smote)
library(caret)
p2_test_smote<- predict(rf_model2_smote, newdata = Test)
p2_test_smote_cm<-confusionMatrix(p2_test_smote, Test$Churn)
p2_test_smote_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1295 174
## Yes 537 521
##
## Accuracy : 0.7186
## 95% CI : (0.7007, 0.7361)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 0.7693
##
## Kappa : 0.3928
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7069
## Specificity : 0.7496
## Pos Pred Value : 0.8816
## Neg Pred Value : 0.4924
## Prevalence : 0.7250
## Detection Rate : 0.5125
## Detection Prevalence : 0.5813
## Balanced Accuracy : 0.7283
##
## 'Positive' Class : No
##
Dari model tanpa mengatasi imbalanced data dan model dengan ketiga cara dalam mengatasi imbalanced data diatas, selanjutnya akan dipilih metode yang menghasilkan performance yang baik dengan melihat nilai AUC (Area Under the Curve) terbesar, yakni sebagai berikut :
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
auc<- roc(Test$Churn, factor(p2_test, ordered = TRUE))
## Setting levels: control = No, case = Yes
## Warning in value[[3L]](cond): Ordered predictor converted to numeric vector.
## Threshold values will not correspond to values in predictor.
## Setting direction: controls < cases
auc
##
## Call:
## roc.default(response = Test$Churn, predictor = factor(p2_test, ordered = TRUE))
##
## Data: factor(p2_test, ordered = TRUE) in 1832 controls (Test$Churn No) < 695 cases (Test$Churn Yes).
## Area under the curve: 0.6723
auc_over <- roc(Test$Churn, factor(p2_test_over, ordered = TRUE))
## Setting levels: control = No, case = Yes
## Warning in value[[3L]](cond): Ordered predictor converted to numeric vector.
## Threshold values will not correspond to values in predictor.
## Setting direction: controls < cases
auc_over
##
## Call:
## roc.default(response = Test$Churn, predictor = factor(p2_test_over, ordered = TRUE))
##
## Data: factor(p2_test_over, ordered = TRUE) in 1832 controls (Test$Churn No) < 695 cases (Test$Churn Yes).
## Area under the curve: 0.7397
auc_under <- roc(Test$Churn, factor(p2_test_under, ordered = TRUE))
## Setting levels: control = No, case = Yes
## Warning in value[[3L]](cond): Ordered predictor converted to numeric vector.
## Threshold values will not correspond to values in predictor.
## Setting direction: controls < cases
auc_under
##
## Call:
## roc.default(response = Test$Churn, predictor = factor(p2_test_under, ordered = TRUE))
##
## Data: factor(p2_test_under, ordered = TRUE) in 1832 controls (Test$Churn No) < 695 cases (Test$Churn Yes).
## Area under the curve: 0.7379
auc_smote <- roc(Test$Churn, factor(p2_test_smote, ordered = TRUE))
## Setting levels: control = No, case = Yes
## Warning in value[[3L]](cond): Ordered predictor converted to numeric vector.
## Threshold values will not correspond to values in predictor.
## Setting direction: controls < cases
auc_smote
##
## Call:
## roc.default(response = Test$Churn, predictor = factor(p2_test_smote, ordered = TRUE))
##
## Data: factor(p2_test_smote, ordered = TRUE) in 1832 controls (Test$Churn No) < 695 cases (Test$Churn Yes).
## Area under the curve: 0.7283
Berikut tabel ringkasan, untuk evaulasi keempat model diatas:
nama_metode<-c("Tanpa Handling", "OverSampling", "UnderSampling", "SMOTE")
accuracy <- c(p2_test_cm$overall[1], p2_test_over_cm$overall[1], p2_test_under_cm$overall[1], p2_test_smote_cm$overall[1])
sensitivity <- c(p2_test_cm$overall[2], p2_test_over_cm$overall[2], p2_test_under_cm$overall[2], p2_test_smote_cm$overall[2])
specitifity <- c(p2_test_cm$overall[3], p2_test_over_cm$overall[3], p2_test_under_cm$overall[3], p2_test_smote_cm$overall[3])
auc <- c(auc$auc, auc_over$auc, auc_under$auc, auc_smote$auc)
data.frame(nama_metode, accuracy, sensitivity, specitifity, auc)
## nama_metode accuracy sensitivity specitifity auc
## 1 Tanpa Handling 0.7851207 0.3900756 0.7685826 0.6723088
## 2 OverSampling 0.7372378 0.4211946 0.7196109 0.7397470
## 3 UnderSampling 0.7210131 0.4048300 0.7030792 0.7379335
## 4 SMOTE 0.7186387 0.3928459 0.7006623 0.7282590
Jadi model random forest yang dipilih untuk kasus data ini adalah model dengan parameter mtry = 2, ntree = 500 dengan handling imbalanced data dengan metode undersampling.
Selanjutnya akan dilihat variabel mana yang paling penting dalam memengaruhi customer untuk Churn, yakni sebagai berikut :
Terlihat bahwa variabel yang paling berpengaruh terhadap keputusan customer untuk Churn adalah tenure, TotalCharges dan MonthlyCharges.