Dewasa ini internet merupakan kebutuhan pokok bagi kaum millenial keatas, pada penelitian ini, saya ingin meneliti mengenai pengaruh Jenis Internet apakah yang memiliki tech support. Penelitian ini akan menggunakan metode Decision Tree dan Naive Bayes.

Library

set.seed(666)
library(dplyr) 
library(ggplot2) 
library(gridExtra) 
library(inspectdf) 
library(tidymodels) 
library(caret) 
library(partykit)
library(e1071)

Read Data

G1 <- read.csv("Data/Telco_Churn.csv", stringsAsFactors = T)
glimpse(G1)
## Rows: 7,043
## Columns: 21
## $ customerID       <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-~
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Male, Femal~
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye~
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No~
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y~
## $ MultipleLines    <fct> No phone service, No, No, No phone service, No, Yes, ~
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o~
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No ~
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in~
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in~
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte~
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int~
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte~
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One year, M~
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No~
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check, Bank tr~
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
head(G1,10)

Data Cleaning

anyNA(G1)
## [1] TRUE
colSums(is.na(G1))
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

Clean!

Proposi Data

table(G1$InternetService) %>% 
  prop.table()
## 
##         DSL Fiber optic          No 
##   0.3437456   0.4395854   0.2166690

Proporsi membutuhkan upsample di No, agar balance.

Train-Test Split

index <- initial_split(G1, prop = 0.8)

data_train <- training(index)
data_test <- testing(index)
train_up <- upSample(x = data_train,
                     y = data_train$InternetService, yname = "InternetService"
                     )

table(train_up$InternetService)
## 
##         DSL Fiber optic          No 
##        2501        2501        2501

Pembuatan Model

Decision Tree

G2 <- ctree(TechSupport ~ InternetService, data = data_train)
plot(G2)

Model Evaluation

pred_test <- predict(G2, data_test)

head(pred_test)
##                   5                   6                  10                  13 
##                  No                  No                  No                  No 
##                  19                  23 
##                  No No internet service 
## Levels: No No internet service Yes
confusionMatrix(pred_test, data_test$TechSupport, positive = "Yes")
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction             No No internet service Yes
##   No                  700                   0 414
##   No internet service   0                 295   0
##   Yes                   0                   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7062          
##                  95% CI : (0.6816, 0.7299)
##     No Information Rate : 0.4968          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4785          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: No Class: No internet service Class: Yes
## Sensitivity             1.0000                     1.0000     0.0000
## Specificity             0.4161                     1.0000     1.0000
## Pos Pred Value          0.6284                     1.0000        NaN
## Neg Pred Value          1.0000                     1.0000     0.7062
## Prevalence              0.4968                     0.2094     0.2938
## Detection Rate          0.4968                     0.2094     0.0000
## Detection Prevalence    0.7906                     0.2094     0.0000
## Balanced Accuracy       0.7080                     1.0000     0.5000

Dari Plot tersebut dapat disimpulkan bahwa DSL(atau yang kita tahu dengan kabel Croax) lebih memiliki technical support dibandingkan dengan layanan dengan fiber optic. Pengujian data tersebut memiliki akurasi 70%.

Naive Bayes

model_bayes <- naiveBayes(InternetService ~ ., train_up)

pred_bayes <- predict(model_bayes, data_test)

confusionMatrix(pred_bayes, data_test$InternetService, positive = "Yes")
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    DSL Fiber optic  No
##   DSL         436          55   0
##   Fiber optic  83         540   0
##   No            0           0 295
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9021          
##                  95% CI : (0.8853, 0.9171)
##     No Information Rate : 0.4223          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8472          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: DSL Class: Fiber optic Class: No
## Sensitivity              0.8401             0.9076    1.0000
## Specificity              0.9382             0.8980    1.0000
## Pos Pred Value           0.8880             0.8668    1.0000
## Neg Pred Value           0.9096             0.9300    1.0000
## Prevalence               0.3683             0.4223    0.2094
## Detection Rate           0.3094             0.3833    0.2094
## Detection Prevalence     0.3485             0.4422    0.2094
## Balanced Accuracy        0.8891             0.9028    1.0000

Pada metode naive bayes memperlihatkan bahwa tingkat akurasi 90% sama dengan model ctree.

Conclusion

Pada kedua uji test tersebut lebih baik menggunakan metode naive bayes dengan tingkat akurasi 90%.

Berdasarkan Uji ctree, dapat disimpulkan bahwa yang lebih memiliki tech support adalah layanan DSL dibandingkan dengan layanan Fiber Optik.