Dewasa ini internet merupakan kebutuhan pokok bagi kaum millenial keatas, pada penelitian ini, saya ingin meneliti mengenai pengaruh Jenis Internet apakah yang memiliki tech support. Penelitian ini akan menggunakan metode Decision Tree dan Naive Bayes.
set.seed(666)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(inspectdf)
library(tidymodels)
library(caret)
library(partykit)
library(e1071)G1 <- read.csv("Data/Telco_Churn.csv", stringsAsFactors = T)
glimpse(G1)## Rows: 7,043
## Columns: 21
## $ customerID <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOCW, 9237-~
## $ gender <fct> Female, Male, Male, Male, Female, Female, Male, Femal~
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Partner <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes, No, Ye~
## $ Dependents <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes, No, No~
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2~
## $ PhoneService <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, Y~
## $ MultipleLines <fct> No phone service, No, No, No phone service, No, Yes, ~
## $ InternetService <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic, Fiber o~
## $ OnlineSecurity <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No ~
## $ OnlineBackup <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, No, No in~
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, No, No in~
## $ TechSupport <fct> No, No, No, Yes, No, No, No, No, Yes, No, No, No inte~
## $ StreamingTV <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No, No int~
## $ StreamingMovies <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No inte~
## $ Contract <fct> Month-to-month, One year, Month-to-month, One year, M~
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No, Yes, No~
## $ PaymentMethod <fct> Electronic check, Mailed check, Mailed check, Bank tr~
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7~
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949~
## $ Churn <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
head(G1,10)anyNA(G1)## [1] TRUE
colSums(is.na(G1))## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
Clean!
table(G1$InternetService) %>%
prop.table()##
## DSL Fiber optic No
## 0.3437456 0.4395854 0.2166690
Proporsi membutuhkan upsample di No, agar balance.
index <- initial_split(G1, prop = 0.8)
data_train <- training(index)
data_test <- testing(index)train_up <- upSample(x = data_train,
y = data_train$InternetService, yname = "InternetService"
)
table(train_up$InternetService)##
## DSL Fiber optic No
## 2501 2501 2501
G2 <- ctree(TechSupport ~ InternetService, data = data_train)
plot(G2)pred_test <- predict(G2, data_test)
head(pred_test)## 5 6 10 13
## No No No No
## 19 23
## No No internet service
## Levels: No No internet service Yes
confusionMatrix(pred_test, data_test$TechSupport, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No No internet service Yes
## No 700 0 414
## No internet service 0 295 0
## Yes 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.7062
## 95% CI : (0.6816, 0.7299)
## No Information Rate : 0.4968
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4785
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: No Class: No internet service Class: Yes
## Sensitivity 1.0000 1.0000 0.0000
## Specificity 0.4161 1.0000 1.0000
## Pos Pred Value 0.6284 1.0000 NaN
## Neg Pred Value 1.0000 1.0000 0.7062
## Prevalence 0.4968 0.2094 0.2938
## Detection Rate 0.4968 0.2094 0.0000
## Detection Prevalence 0.7906 0.2094 0.0000
## Balanced Accuracy 0.7080 1.0000 0.5000
Dari Plot tersebut dapat disimpulkan bahwa DSL(atau yang kita tahu dengan kabel Croax) lebih memiliki technical support dibandingkan dengan layanan dengan fiber optic. Pengujian data tersebut memiliki akurasi 70%.
model_bayes <- naiveBayes(InternetService ~ ., train_up)
pred_bayes <- predict(model_bayes, data_test)
confusionMatrix(pred_bayes, data_test$InternetService, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction DSL Fiber optic No
## DSL 436 55 0
## Fiber optic 83 540 0
## No 0 0 295
##
## Overall Statistics
##
## Accuracy : 0.9021
## 95% CI : (0.8853, 0.9171)
## No Information Rate : 0.4223
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8472
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: DSL Class: Fiber optic Class: No
## Sensitivity 0.8401 0.9076 1.0000
## Specificity 0.9382 0.8980 1.0000
## Pos Pred Value 0.8880 0.8668 1.0000
## Neg Pred Value 0.9096 0.9300 1.0000
## Prevalence 0.3683 0.4223 0.2094
## Detection Rate 0.3094 0.3833 0.2094
## Detection Prevalence 0.3485 0.4422 0.2094
## Balanced Accuracy 0.8891 0.9028 1.0000
Pada metode naive bayes memperlihatkan bahwa tingkat akurasi 90% sama dengan model ctree.
Pada kedua uji test tersebut lebih baik menggunakan metode naive bayes dengan tingkat akurasi 90%.
Berdasarkan Uji ctree, dapat disimpulkan bahwa yang lebih memiliki tech support adalah layanan DSL dibandingkan dengan layanan Fiber Optik.