Langkah pertama yang dilakukan adalah mengatur working directory pada R, serta tempatkan data yang tersedia pada file ini.
setwd("D:\\DS\\R\\[R] Support Vector Machine")
getwd()
## [1] "D:/DS/R/[R] Support Vector Machine"
rm(list=ls(all=TRUE))
Langkah selanjutnya adalah menginput dataset, sebagai berikut :
##Input Data
telco <- read.csv("telco.csv", header = T)
head(telco)
## UpdatedAt customerID gender SeniorCitizen Partner tenure PhoneService
## 1 202006 45759018157 Female No Yes 1 No
## 2 202006 45315483266 Male No Yes 60 Yes
## 3 202006 45236961615 Male No No 5 Yes
## 4 202006 45929827382 Female No Yes 72 Yes
## 5 202006 45305082233 Female No Yes 56 Yes
## 6 202006 45072364214 Male No No 44 Yes
## StreamingTV InternetService PaperlessBilling MonthlyCharges TotalCharges
## 1 No Yes Yes 29.85 29.85
## 2 No No Yes 20.50 1198.80
## 3 Yes Yes No 104.10 541.90
## 4 Yes Yes Yes 115.50 8312.75
## 5 Yes Yes No 81.25 4620.40
## 6 Yes Yes Yes 85.25 3704.15
## Churn
## 1 No
## 2 No
## 3 Yes
## 4 No
## 5 No
## 6 No
Karena variabel CustomerID bersifat unik dan tidak dapat digunakan, maka variabel ini dihilangkan terlebih dahulu dari dataset, sebagai berikut :
telco <-telco[,-c(1:2)]
head(telco)
## gender SeniorCitizen Partner tenure PhoneService StreamingTV InternetService
## 1 Female No Yes 1 No No Yes
## 2 Male No Yes 60 Yes No No
## 3 Male No No 5 Yes Yes Yes
## 4 Female No Yes 72 Yes Yes Yes
## 5 Female No Yes 56 Yes Yes Yes
## 6 Male No No 44 Yes Yes Yes
## PaperlessBilling MonthlyCharges TotalCharges Churn
## 1 Yes 29.85 29.85 No
## 2 Yes 20.50 1198.80 No
## 3 No 104.10 541.90 Yes
## 4 Yes 115.50 8312.75 No
## 5 No 81.25 4620.40 No
## 6 Yes 85.25 3704.15 No
Langkah selanjutnya adalah melakukan eksplorasi terhadap dataset, sebagai berikut :
#Melihat struktur data
str(telco)
## 'data.frame': 6950 obs. of 11 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 1 2 2 2 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 1 2 2 2 2 ...
## $ tenure : int 1 60 5 72 56 44 39 12 71 19 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 2 1 1 1 ...
## $ InternetService : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 2 1 1 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 2 1 1 ...
## $ MonthlyCharges : num 29.9 20.5 104.1 115.5 81.2 ...
## $ TotalCharges : num 29.9 1198.8 541.9 8312.8 4620.4 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 2 1 1 1 ...
Dari struktur data diatas, terlihat bahwa data sudah sesuai dengan type data dari masing - masing variabel, untuk variabel kategorik data bertipe factor, untuk variabel non-kategorik data bertipe integer/numerik.
summary(telco)
## gender SeniorCitizen Partner tenure PhoneService
## Female:3445 No :5822 No :3591 Min. : 0.00 No : 669
## Male :3505 Yes:1128 Yes:3359 1st Qu.: 9.00 Yes:6281
## Median : 29.00
## Mean : 32.42
## 3rd Qu.: 55.00
## Max. :124.00
## StreamingTV InternetService PaperlessBilling MonthlyCharges TotalCharges
## No :4279 No :1505 No :2836 Min. : 0.00 Min. : 19
## Yes:2671 Yes:5445 Yes:4114 1st Qu.: 36.46 1st Qu.: 407
## Median : 70.45 Median :1401
## Mean : 64.99 Mean :2286
## 3rd Qu.: 89.85 3rd Qu.:3800
## Max. :169.93 Max. :8889
## Churn
## No :5114
## Yes:1836
##
##
##
##
Pada data bertipe numerik maupun integer yakni variabel tenure, MonthlyCharges, dan TotalCharges memiliki rentang pengukuran yang berbeda. Hal ini berimplikasi pada missclassification pada SVM. Sehingga variabel ini harus distandarsari terlebih dahulu.
#Menstandarisasi Data
telco[,c(4, 9:10)]<-scale(telco[,c(4, 9:10)])
summary(telco)
## gender SeniorCitizen Partner tenure PhoneService
## Female:3445 No :5822 No :3591 Min. :-1.3190 No : 669
## Male :3505 Yes:1128 Yes:3359 1st Qu.:-0.9529 Yes:6281
## Median :-0.1393
## Mean : 0.0000
## 3rd Qu.: 0.9185
## Max. : 3.7255
## StreamingTV InternetService PaperlessBilling MonthlyCharges
## No :4279 No :1505 No :2836 Min. :-2.1641
## Yes:2671 Yes:5445 Yes:4114 1st Qu.:-0.9500
## Median : 0.1817
## Mean : 0.0000
## 3rd Qu.: 0.8277
## Max. : 3.4942
## TotalCharges Churn
## Min. :-1.0006 No :5114
## 1st Qu.:-0.8294 Yes:1836
## Median :-0.3907
## Mean : 0.0000
## 3rd Qu.: 0.6681
## Max. : 2.9144
attach(telco)
#Plot Data
#Pie Plot Variabel Churn
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df_churn <- telco %>%
select(Churn) %>%
group_by(Churn) %>%
summarise(Total = n())
df_churn
## # A tibble: 2 x 2
## Churn Total
## <fct> <int>
## 1 No 5114
## 2 Yes 1836
library(ggplot2)
ggplot(df_churn, aes(x="", y=Total, fill = Churn))+
geom_bar(stat = "identity")+
coord_polar(theta = "y")+
geom_text(aes(label=Total), position = position_stack(vjust=0.5))+
labs(title = "Pie Plot Churn Customer", ylab = "Churn")
Langkah selanjutnya adalah membagi data ke dalam Training data dan Testing data, sebagai berikut :
##Split Data
library(caTools)
set.seed(123)
Split <- sample.split(telco, SplitRatio = 0.7)
Train <- subset(telco, Split==TRUE)
Test <- subset(telco, Split==FALSE)
library(e1071)
svm_model <- svm(Churn~. , data = Train)
summary(svm_model)
##
## Call:
## svm(formula = Churn ~ ., data = Train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 2035
##
## ( 1040 995 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
Default parameter menggunakan SVM-type C-classification karena variabel respon (Churn) berupa katerogikal variabel, dengan kernel radial dan cost = 1. Langkah selanjutnya adalah melakukan prediksi pada Testing data dengan menggunakan Training model, sebagai berikut :
##Predict (default parameter)
library(caret)
## Loading required package: lattice
p_train <- predict(svm_model)
p_train_cm<-confusionMatrix(p_train, Train$Churn)
p_train_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3082 682
## Yes 200 459
##
## Accuracy : 0.8006
## 95% CI : (0.7885, 0.8123)
## No Information Rate : 0.742
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3959
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9391
## Specificity : 0.4023
## Pos Pred Value : 0.8188
## Neg Pred Value : 0.6965
## Prevalence : 0.7420
## Detection Rate : 0.6968
## Detection Prevalence : 0.8510
## Balanced Accuracy : 0.6707
##
## 'Positive' Class : No
##
p_test <- predict(svm_model, newdata = Test)
p_test_cm<-confusionMatrix(p_test, Test$Churn)
p_test_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1699 422
## Yes 133 273
##
## Accuracy : 0.7804
## 95% CI : (0.7637, 0.7964)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 1.032e-10
##
## Kappa : 0.3676
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9274
## Specificity : 0.3928
## Pos Pred Value : 0.8010
## Neg Pred Value : 0.6724
## Prevalence : 0.7250
## Detection Rate : 0.6723
## Detection Prevalence : 0.8393
## Balanced Accuracy : 0.6601
##
## 'Positive' Class : No
##
Selanjutnya akan dibandingan dengan menggunakan kernel “linear”, “polynomial”, “sigmoid”.
#Kernel linear
library(e1071)
svm_model1 <- svm(Churn~. , data = Train, kernel = "linear")
summary(svm_model1)
##
## Call:
## svm(formula = Churn ~ ., data = Train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 2135
##
## ( 1070 1065 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
#Prediksi
p1_test <- predict(svm_model1, newdata = Test)
p1_test_cm<-confusionMatrix(p1_test, Test$Churn)
p1_test_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1642 368
## Yes 190 327
##
## Accuracy : 0.7792
## 95% CI : (0.7625, 0.7952)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 2.546e-10
##
## Kappa : 0.3985
##
## Mcnemar's Test P-Value : 6.731e-14
##
## Sensitivity : 0.8963
## Specificity : 0.4705
## Pos Pred Value : 0.8169
## Neg Pred Value : 0.6325
## Prevalence : 0.7250
## Detection Rate : 0.6498
## Detection Prevalence : 0.7954
## Balanced Accuracy : 0.6834
##
## 'Positive' Class : No
##
#Kernel polynomial
svm_model2 <- svm(Churn~. , data = Train, kernel = "polynomial")
summary(svm_model2)
##
## Call:
## svm(formula = Churn ~ ., data = Train, kernel = "polynomial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 2014
##
## ( 1020 994 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
#Prediksi
p2_test <- predict(svm_model2, newdata = Test)
p2_test_cm<-confusionMatrix(p2_test, Test$Churn)
p2_test_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1711 443
## Yes 121 252
##
## Accuracy : 0.7768
## 95% CI : (0.7601, 0.7929)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 1.457e-09
##
## Kappa : 0.3463
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9340
## Specificity : 0.3626
## Pos Pred Value : 0.7943
## Neg Pred Value : 0.6756
## Prevalence : 0.7250
## Detection Rate : 0.6771
## Detection Prevalence : 0.8524
## Balanced Accuracy : 0.6483
##
## 'Positive' Class : No
##
#Kernel sigmoid
svm_model3 <- svm(Churn~. , data = Train, kernel = "sigmoid")
summary(svm_model3)
##
## Call:
## svm(formula = Churn ~ ., data = Train, kernel = "sigmoid")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 1
## coef.0: 0
##
## Number of Support Vectors: 1315
##
## ( 657 658 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
#Prediksi
p3_test <- predict(svm_model3, newdata = Test)
p3_test_cm<-confusionMatrix(p3_test, Test$Churn)
p3_test_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1521 399
## Yes 311 296
##
## Accuracy : 0.719
## 95% CI : (0.7011, 0.7365)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 0.755614
##
## Kappa : 0.2666
##
## Mcnemar's Test P-Value : 0.001094
##
## Sensitivity : 0.8302
## Specificity : 0.4259
## Pos Pred Value : 0.7922
## Neg Pred Value : 0.4876
## Prevalence : 0.7250
## Detection Rate : 0.6019
## Detection Prevalence : 0.7598
## Balanced Accuracy : 0.6281
##
## 'Positive' Class : No
##
Dari keempat kernel yang digunakan, kernel yang menghasilkan nilai Accuracy dan Kappa terbesar adalah kernel “radial” pada default model.
Langkah selanjutnya adalah mencari nilai parameter yang optimal, yakni nilai parameter gamma dan cost, ada beberapa cara untuk mencari parameter yang optimal, salah satunya adalah dengan metode GridSearch, metode ini sudah tersedia dalam library “e1071”, sebagai berikut :
Tune <- tune(svm, Churn ~ .,
data = Train, type = "C-classification", kernel = "radial",
ranges = list(gamma = c(0.5, 1.0, 1.5), cost = 10^(0.1:1)))
summary(Tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.5 1.258925
##
## - best performance: 0.1998693
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 0.5 1.258925 0.1998693 0.02281869
## 2 1.0 1.258925 0.2025837 0.02331603
## 3 1.5 1.258925 0.2050688 0.02163950
#Random Search CV
set.seed(222)
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
search = "random")
svm_random <- train(Churn~.,
data = telco,
method = "svmRadial",
metric = "Accuracy",
tuneLength = 10,
trControl = control)
svm_random
## Support Vector Machines with Radial Basis Function Kernel
##
## 6950 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 6256, 6256, 6256, 6254, 6256, 6254, ...
## Resampling results across tuning parameters:
##
## sigma C Accuracy Kappa
## 0.01699863 595.69475337 0.7937607 0.3917258
## 0.04318439 0.07812318 0.7810524 0.3278050
## 0.04961205 0.07352501 0.7813399 0.3307863
## 0.05164241 16.61873787 0.7938567 0.3949774
## 0.07629500 0.80140798 0.7930908 0.3905625
## 0.09192807 793.12942215 0.7877667 0.3974551
## 0.09904983 3.49806630 0.7945759 0.3984123
## 0.15384435 2.95086023 0.7937127 0.3983778
## 0.15616259 3.96544819 0.7942887 0.4019388
## 0.28463599 7.15624148 0.7914593 0.4022572
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.09904983 and C = 3.498066.
#GridSearch
svm_model4 <- svm(Churn~. , data = Train, kernel = "radial", gamma = 0.5, cost = 1.258925 )
summary(svm_model4)
##
## Call:
## svm(formula = Churn ~ ., data = Train, kernel = "radial", gamma = 0.5,
## cost = 1.258925)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1.258925
##
## Number of Support Vectors: 2046
##
## ( 1107 939 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
#Prediksi
p4_test <- predict(svm_model4, newdata = Test)
p4_test_cm<-confusionMatrix(p4_test, Test$Churn)
p4_test_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1674 389
## Yes 158 306
##
## Accuracy : 0.7835
## 95% CI : (0.767, 0.7995)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 8.416e-12
##
## Kappa : 0.3948
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9138
## Specificity : 0.4403
## Pos Pred Value : 0.8114
## Neg Pred Value : 0.6595
## Prevalence : 0.7250
## Detection Rate : 0.6624
## Detection Prevalence : 0.8164
## Balanced Accuracy : 0.6770
##
## 'Positive' Class : No
##
#RandomSearch
svm_model5 <- svm(Churn~. , data = Train, kernel = "radial", gamma = 0.09904983, cost = 3.498066 )
summary(svm_model5)
##
## Call:
## svm(formula = Churn ~ ., data = Train, kernel = "radial", gamma = 0.09904983,
## cost = 3.498066)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 3.498066
##
## Number of Support Vectors: 1998
##
## ( 1029 969 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
#Prediksi
p5_test <- predict(svm_model5, newdata = Test)
p5_test_cm<-confusionMatrix(p5_test, Test$Churn)
p5_test_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1699 430
## Yes 133 265
##
## Accuracy : 0.7772
## 95% CI : (0.7605, 0.7933)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 1.096e-09
##
## Kappa : 0.3559
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9274
## Specificity : 0.3813
## Pos Pred Value : 0.7980
## Neg Pred Value : 0.6658
## Prevalence : 0.7250
## Detection Rate : 0.6723
## Detection Prevalence : 0.8425
## Balanced Accuracy : 0.6543
##
## 'Positive' Class : No
##
Dibandingkan dengan model default parameter dan model dengan tuning parameter RandomSearch, nilai Accuracy dan nilai Kappa yang lebih besar dihasilkan oleh model dengan tuning parameter metode GridSearch, sehingga parameter optimal yang digunakan adalah dengan metode GridSearch.
Selanjutnya, pada model dengan tuning parameter GridSearch, diperoleh nilai Sensitivity dan Spesificity jauh berbeda.Hal ini dikarenakan adanya permasalahan data tidak seimbang, dimana pada variabel Churn, terlihat bahwa customer yang memilih Churn (Yes = 1836) lebih sedikit dibanding customer yang memilih untuk tidak churn (No = 5114).
Ada beberapa cara untuk mengatasi masalah ini, yakni sebagai berikut :
##Mengatasi spesificity yang rendah karena imbalance data
library(ROSE)
## Loaded ROSE 0.0-3
##OverSampling
table(Train$Churn)
##
## No Yes
## 3282 1141
3282*2
## [1] 6564
over <- ovun.sample(Churn~., data = Train, method = "over", N=6564)$data
table(over$Churn)
##
## No Yes
## 3282 3282
#Modeling Train Data (oversampling)
svm_model4_over <- svm(Churn~. , data = over, kernel = "radial", gamma = 0.5, cost = 1.258925 )
summary(svm_model4_over)
##
## Call:
## svm(formula = Churn ~ ., data = over, kernel = "radial", gamma = 0.5,
## cost = 1.258925)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1.258925
##
## Number of Support Vectors: 3515
##
## ( 1747 1768 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
#Prediksi (oversampling)
p4_test_over <- predict(svm_model4_over, newdata = Test)
p4_test_cm_over<-confusionMatrix(p4_test_over, Test$Churn)
p4_test_cm_over
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1323 185
## Yes 509 510
##
## Accuracy : 0.7254
## 95% CI : (0.7075, 0.7427)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 0.4924
##
## Kappa : 0.3983
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7222
## Specificity : 0.7338
## Pos Pred Value : 0.8773
## Neg Pred Value : 0.5005
## Prevalence : 0.7250
## Detection Rate : 0.5235
## Detection Prevalence : 0.5968
## Balanced Accuracy : 0.7280
##
## 'Positive' Class : No
##
##UnderSampling
table(Train$Churn)
##
## No Yes
## 3282 1141
1141*2
## [1] 2282
under <- ovun.sample(Churn~., data = Train, method = "under", N=2282)$data
table(under$Churn)
##
## No Yes
## 1141 1141
#Modeling Train Data (undersampling)
svm_model4_under <- svm(Churn~. , data = under, kernel = "radial", gamma = 0.5, cost = 1.258925 )
summary(svm_model4_under)
##
## Call:
## svm(formula = Churn ~ ., data = under, kernel = "radial", gamma = 0.5,
## cost = 1.258925)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1.258925
##
## Number of Support Vectors: 1290
##
## ( 659 631 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
#Prediksi (undersampling)
p4_test_under<- predict(svm_model4_under, newdata = Test)
p4_test_cm_under<-confusionMatrix(p4_test_under, Test$Churn)
p4_test_cm_under
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1298 177
## Yes 534 518
##
## Accuracy : 0.7186
## 95% CI : (0.7007, 0.7361)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 0.7693
##
## Kappa : 0.3914
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7085
## Specificity : 0.7453
## Pos Pred Value : 0.8800
## Neg Pred Value : 0.4924
## Prevalence : 0.7250
## Detection Rate : 0.5137
## Detection Prevalence : 0.5837
## Balanced Accuracy : 0.7269
##
## 'Positive' Class : No
##
##SMOTE
table(Train$Churn)
##
## No Yes
## 3282 1141
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
smote <- SMOTE(Churn~. , data = Train, perc.over = 100, perc.under = 200)
table(smote$Churn)
##
## No Yes
## 2282 2282
#Modeling Train Data (smote)
svm_model4_smote <- svm(Churn~. , data = smote, kernel = "radial", gamma = 0.5, cost = 1.258925 )
summary(svm_model4_smote)
##
## Call:
## svm(formula = Churn ~ ., data = smote, kernel = "radial", gamma = 0.5,
## cost = 1.258925)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1.258925
##
## Number of Support Vectors: 2495
##
## ( 1258 1237 )
##
##
## Number of Classes: 2
##
## Levels:
## No Yes
#Prediksi (smote)
p4_test_smote<- predict(svm_model4_smote, newdata = Test)
p4_test_cm_smote<-confusionMatrix(p4_test_smote, Test$Churn)
p4_test_cm_smote
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1323 190
## Yes 509 505
##
## Accuracy : 0.7234
## 95% CI : (0.7055, 0.7408)
## No Information Rate : 0.725
## P-Value [Acc > NIR] : 0.5807
##
## Kappa : 0.3928
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7222
## Specificity : 0.7266
## Pos Pred Value : 0.8744
## Neg Pred Value : 0.4980
## Prevalence : 0.7250
## Detection Rate : 0.5235
## Detection Prevalence : 0.5987
## Balanced Accuracy : 0.7244
##
## 'Positive' Class : No
##
Dari model tanpa mengatasi imbalanced data dan model dengan ketiga cara dalam mengatasi imbalanced data diatas, selanjutnya akan dipilih metode yang menghasilkan performance yang baik dengan melihat nilai AUC (Area Under the Curve) terbesar, yakni sebagai berikut :
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
auc<- roc(Test$Churn, factor(p4_test, ordered = TRUE))
## Setting levels: control = No, case = Yes
## Warning in value[[3L]](cond): Ordered predictor converted to numeric vector.
## Threshold values will not correspond to values in predictor.
## Setting direction: controls < cases
auc
##
## Call:
## roc.default(response = Test$Churn, predictor = factor(p4_test, ordered = TRUE))
##
## Data: factor(p4_test, ordered = TRUE) in 1832 controls (Test$Churn No) < 695 cases (Test$Churn Yes).
## Area under the curve: 0.677
auc_over <- roc(Test$Churn, factor(p4_test_over, ordered = TRUE))
## Setting levels: control = No, case = Yes
## Warning in value[[3L]](cond): Ordered predictor converted to numeric vector.
## Threshold values will not correspond to values in predictor.
## Setting direction: controls < cases
auc_over
##
## Call:
## roc.default(response = Test$Churn, predictor = factor(p4_test_over, ordered = TRUE))
##
## Data: factor(p4_test_over, ordered = TRUE) in 1832 controls (Test$Churn No) < 695 cases (Test$Churn Yes).
## Area under the curve: 0.728
auc_under <- roc(Test$Churn, factor(p4_test_under, ordered = TRUE))
## Setting levels: control = No, case = Yes
## Warning in value[[3L]](cond): Ordered predictor converted to numeric vector.
## Threshold values will not correspond to values in predictor.
## Setting direction: controls < cases
auc_under
##
## Call:
## roc.default(response = Test$Churn, predictor = factor(p4_test_under, ordered = TRUE))
##
## Data: factor(p4_test_under, ordered = TRUE) in 1832 controls (Test$Churn No) < 695 cases (Test$Churn Yes).
## Area under the curve: 0.7269
auc_smote <- roc(Test$Churn, factor(p4_test_smote, ordered = TRUE))
## Setting levels: control = No, case = Yes
## Warning in value[[3L]](cond): Ordered predictor converted to numeric vector.
## Threshold values will not correspond to values in predictor.
## Setting direction: controls < cases
auc_smote
##
## Call:
## roc.default(response = Test$Churn, predictor = factor(p4_test_smote, ordered = TRUE))
##
## Data: factor(p4_test_smote, ordered = TRUE) in 1832 controls (Test$Churn No) < 695 cases (Test$Churn Yes).
## Area under the curve: 0.7244
Berikut tabel ringkasan, untuk evaulasi keempat model diatas:
nama_metode<-c("Tanpa Handling", "OverSampling", "UnderSampling", "SMOTE")
accuracy <- c(p4_test_cm$overall[1], p4_test_cm_over$overall[1], p4_test_cm_under$overall[1], p4_test_cm_smote$overall[1])
sensitivity <- c(p4_test_cm$overall[2], p4_test_cm_over$overall[2], p4_test_cm_under$overall[2], p4_test_cm_smote$overall[2])
specitifity <- c(p4_test_cm$overall[3], p4_test_cm_over$overall[3], p4_test_cm_under$overall[3], p4_test_cm_smote$overall[3])
auc <- c(auc$auc, auc_over$auc, auc_under$auc, auc_smote$auc)
data.frame(nama_metode, accuracy, sensitivity, specitifity, auc)
## nama_metode accuracy sensitivity specitifity auc
## 1 Tanpa Handling 0.7835378 0.3947592 0.7669590 0.6770216
## 2 OverSampling 0.7253660 0.3983474 0.7075117 0.7279873
## 3 UnderSampling 0.7186387 0.3914430 0.7006623 0.7269195
## 4 SMOTE 0.7233874 0.3928285 0.7054967 0.7243901
Jadi model SVM yang dipilih untuk kasus data ini adalah model dengan parameter gamma = 0.5, cost = 1.258925, dan kernel = “radial” dengan handling imbalanced data dengan metode undersampling.
Selanjutnya akan dilihat variabel mana yang paling penting dalam memengaruhi customer untuk Churn, yakni sebagai berikut :
library(rminer)
Model <- fit(Churn~., data=Train, model="svm", kpar=list(sigma=0.5), C=1.258925)
svm.imp <- Importance(Model, data=Train)
L=list(runs=1,sen=t(svm.imp$imp),
sresponses=svm.imp$sresponses)
mgraph(L,graph="IMP",leg=names(Train), col=c("#8FBC8F"),Grid=10)
Terlihat bahwa variabel yang paling berpengaruh terhadap keputusan customer untuk Churn adalah MonthlyCharges,TotalCharges dan tenure.