Supervised Learning Case Study : Telco Churn Prevention
Digunakan data transaksi sebulan terakhir.
Objektif:
Karakteristik pelanggan yang akan churn dan stay
Prediksi pelanggan yang punya kecenderungan akan churn (berhenti berlangganan) paket telepon pascabayar di bulan depan.
Dari sisi bisnis, apa yang dapat dilakukan setelah mendapatkan model prediksi churn?
Data
| Variable | Deskripsi | Satuan |
| MSISDN | Data Nomor Telepon yang telah di masking | |
| los | Length of Stay / Lamanya menjadi customer | hari |
| voice_rev | Penghasilan dari jasa voice/telepon dalam sebulan terakhir | Rp |
| voice_trx | Banyaknya transaksi jasa voice/telepon dalam sebulan terakhir | kali |
| voice_mou | Lamanya waktu menggunakan layanan voice/telepon dalam sebulan terakhir | menit |
| voice_dou | Banyaknya hari menggunakan layanan voice/telepon dalam sebulan terakhir | hari |
| sms_rev | Penghasilan dari jasa sms dalam sebulan terakhir | Rp |
| sms_trx | Banyaknya transaksi SMS dalam sebulan terakhir | kali |
| sms_dou | Banyaknya menggunakan SMS dalam sebulan terakhir | hari |
| broadband_rev | Penghasilan dari jasa paket data(broadband) dalam sebulan terakhir | Rp |
| broadband_usg | penggunaan layanan broadband dalam sebulan terakhir | MB |
| broadband_dou | Banyaknya hari menggunakan layanan broadband dalam sebulan terakhir | hari |
| voice_package_rev | Penghasilan dari paket telepon dalam sebulan terakhir | Rp |
| vocie_package_trx | Banyaknya transaksi menggunakan paket telepon dalam sebulan terakhir | kali |
| voice_package_dou | Banyaknya hari menggunakan paket telepon dalam sebulan terakhir | hari |
| churn | menunjukkan status churn atau masih aktif (1 = churn ; positive event, 0 = aktif) |
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.2
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## Warning: package 'dials' was built under R version 4.3.2
## Warning: package 'infer' was built under R version 4.3.2
## Warning: package 'modeldata' was built under R version 4.3.2
## Warning: package 'parsnip' was built under R version 4.3.2
## Warning: package 'rsample' was built under R version 4.3.2
## Warning: package 'tune' was built under R version 4.3.2
## Warning: package 'workflows' was built under R version 4.3.2
## Warning: package 'workflowsets' was built under R version 4.3.2
## Warning: package 'yardstick' was built under R version 4.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(finetune)
## Warning: package 'finetune' was built under R version 4.3.2
library(vip)
## Warning: package 'vip' was built under R version 4.3.2
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
setwd("D:/Kuliah/Mat/TSA Kominfo/Praktikum")
telco <- read.csv("4. telco_churn_sample.csv")
head(telco, 15)
## MSISDN los voice_rev voice_trx voice_mou voice_dou sms_rev sms_trx sms_dou
## 1 31441 534 5860 57 251 7 3465 16 5
## 2 31944 589 0 0 0 0 0 0 0
## 3 82538 1223 3744 6 5 2 2475 13 5
## 4 86957 1221 231 52 188 6 0 2 2
## 5 27618 60 12073 36 588 6 2100 20 5
## 6 96313 188 3829 7 261 3 0 0 0
## 7 21363 255 320 8 207 4 20430 97 7
## 8 85847 594 317 1 0 1 0 0 0
## 9 56880 3190 0 0 0 0 1505 7 4
## 10 83060 357 17930 37 67 7 2150 16 6
## 11 12681 39 0 0 0 0 0 0 0
## 12 86834 2773 0 16 51 3 1350 13 5
## 13 87358 176 5133 14 16 6 7210 37 7
## 14 76508 2650 12120 27 17 6 9095 43 6
## 15 55676 617 1149 2 1 2 900 6 2
## broadband_rev broadband_usg broadband_dou voice_package_rev
## 1 35000 1713.20277 6 34200
## 2 0 0.00000 0 8350
## 3 0 0.00000 0 8200
## 4 0 341.69264 7 7950
## 5 0 0.00000 0 26600
## 6 0 0.00000 0 4550
## 7 0 0.00000 0 15800
## 8 9000 40.74369 7 17900
## 9 0 0.00000 0 16650
## 10 0 1573.61691 7 8700
## 11 0 0.00000 0 8200
## 12 0 0.00000 0 22050
## 13 0 0.00000 0 9000
## 14 0 0.00000 0 33300
## 15 2888 375.30317 2 3850
## voice_package_trx voice_package_dou churn
## 1 8 6 0
## 2 2 2 1
## 3 2 2 0
## 4 2 2 0
## 5 6 5 1
## 6 1 1 0
## 7 4 4 0
## 8 4 2 0
## 9 4 4 0
## 10 2 2 0
## 11 2 1 1
## 12 6 5 0
## 13 4 4 0
## 14 8 6 0
## 15 1 1 1
str(telco)
## 'data.frame': 5000 obs. of 16 variables:
## $ MSISDN : int 31441 31944 82538 86957 27618 96313 21363 85847 56880 83060 ...
## $ los : int 534 589 1223 1221 60 188 255 594 3190 357 ...
## $ voice_rev : num 5860 0 3744 231 12073 ...
## $ voice_trx : int 57 0 6 52 36 7 8 1 0 37 ...
## $ voice_mou : int 251 0 5 188 588 261 207 0 0 67 ...
## $ voice_dou : int 7 0 2 6 6 3 4 1 0 7 ...
## $ sms_rev : int 3465 0 2475 0 2100 0 20430 0 1505 2150 ...
## $ sms_trx : int 16 0 13 2 20 0 97 0 7 16 ...
## $ sms_dou : int 5 0 5 2 5 0 7 0 4 6 ...
## $ broadband_rev : num 35000 0 0 0 0 0 0 9000 0 0 ...
## $ broadband_usg : num 1713 0 0 342 0 ...
## $ broadband_dou : int 6 0 0 7 0 0 0 7 0 7 ...
## $ voice_package_rev: int 34200 8350 8200 7950 26600 4550 15800 17900 16650 8700 ...
## $ voice_package_trx: int 8 2 2 2 6 1 4 4 4 2 ...
## $ voice_package_dou: int 6 2 2 2 5 1 4 2 4 2 ...
## $ churn : int 0 1 0 0 1 0 0 0 0 0 ...
# library(dplyr)
glimpse(telco)
## Rows: 5,000
## Columns: 16
## $ MSISDN <int> 31441, 31944, 82538, 86957, 27618, 96313, 21363, 858…
## $ los <int> 534, 589, 1223, 1221, 60, 188, 255, 594, 3190, 357, …
## $ voice_rev <dbl> 5860.00, 0.00, 3744.00, 231.00, 12073.00, 3829.00, 3…
## $ voice_trx <int> 57, 0, 6, 52, 36, 7, 8, 1, 0, 37, 0, 16, 14, 27, 2, …
## $ voice_mou <int> 251, 0, 5, 188, 588, 261, 207, 0, 0, 67, 0, 51, 16, …
## $ voice_dou <int> 7, 0, 2, 6, 6, 3, 4, 1, 0, 7, 0, 3, 6, 6, 2, 2, 7, 5…
## $ sms_rev <int> 3465, 0, 2475, 0, 2100, 0, 20430, 0, 1505, 2150, 0, …
## $ sms_trx <int> 16, 0, 13, 2, 20, 0, 97, 0, 7, 16, 0, 13, 37, 43, 6,…
## $ sms_dou <int> 5, 0, 5, 2, 5, 0, 7, 0, 4, 6, 0, 5, 7, 6, 2, 0, 7, 6…
## $ broadband_rev <dbl> 35000, 0, 0, 0, 0, 0, 0, 9000, 0, 0, 0, 0, 0, 0, 288…
## $ broadband_usg <dbl> 1.713203e+03, 0.000000e+00, 0.000000e+00, 3.416926e+…
## $ broadband_dou <int> 6, 0, 0, 7, 0, 0, 0, 7, 0, 7, 0, 0, 0, 0, 2, 0, 0, 1…
## $ voice_package_rev <int> 34200, 8350, 8200, 7950, 26600, 4550, 15800, 17900, …
## $ voice_package_trx <int> 8, 2, 2, 2, 6, 1, 4, 4, 4, 2, 2, 6, 4, 8, 1, 2, 4, 1…
## $ voice_package_dou <int> 6, 2, 2, 2, 5, 1, 4, 2, 4, 2, 1, 5, 4, 6, 1, 2, 4, 1…
## $ churn <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1…
summary(telco)
## MSISDN los voice_rev voice_trx
## Min. : 23 Min. : 19 Min. : 0 Min. : 0.00
## 1st Qu.:24788 1st Qu.: 341 1st Qu.: 435 1st Qu.: 4.00
## Median :49573 Median : 956 Median : 3664 Median : 15.00
## Mean :49784 Mean :1224 Mean : 7962 Mean : 28.57
## 3rd Qu.:75005 3rd Qu.:2097 3rd Qu.: 9955 3rd Qu.: 37.00
## Max. :99992 Max. :3220 Max. :397101 Max. :507.00
## voice_mou voice_dou sms_rev sms_trx
## Min. : 0.0 Min. :0.000 Min. : 0 Min. : 0.00
## 1st Qu.: 4.0 1st Qu.:2.000 1st Qu.: 415 1st Qu.: 3.00
## Median : 44.0 Median :4.000 Median : 1980 Median : 12.00
## Mean : 117.8 Mean :3.941 Mean : 4551 Mean : 35.26
## 3rd Qu.: 137.0 3rd Qu.:6.000 3rd Qu.: 5746 3rd Qu.: 34.00
## Max. :3477.0 Max. :7.000 Max. :72430 Max. :1449.00
## sms_dou broadband_rev broadband_usg broadband_dou
## Min. :0.000 Min. : 0 Min. : 0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:0.000
## Median :4.000 Median : 0 Median : 0.000 Median :0.000
## Mean :3.915 Mean : 3952 Mean : 104.874 Mean :1.207
## 3rd Qu.:6.000 3rd Qu.: 0 3rd Qu.: 0.012 3rd Qu.:1.000
## Max. :7.000 Max. :424720 Max. :9882.606 Max. :7.000
## voice_package_rev voice_package_trx voice_package_dou churn
## Min. : 0 Min. : 1.000 Min. :1.000 Min. :0.000
## 1st Qu.: 4350 1st Qu.: 1.000 1st Qu.:1.000 1st Qu.:0.000
## Median : 8200 Median : 2.000 Median :2.000 Median :0.000
## Mean :12065 Mean : 3.201 Mean :2.625 Mean :0.337
## 3rd Qu.:16475 3rd Qu.: 4.000 3rd Qu.:4.000 3rd Qu.:1.000
## Max. :88000 Max. :24.000 Max. :7.000 Max. :1.000
Proporsi Churn
library(ggplot2)
ggplot(telco, aes(x = factor(churn))) +
geom_bar(aes(y = ..count.., fill = factor(churn)), position = "dodge") +
geom_text(stat = "count", aes(label = scales::percent(..count../sum(..count..)),
y = ..count.., group = factor(churn)),
position = position_dodge(width = 0.9),
vjust = -0.5) + # the position of the text
scale_y_continuous(breaks = seq(0, 4000, by = 1000)) +
labs(title = "Proportion",x = "Churn", y = "Jumlah Nasabah") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("No", "Yes")) +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
theme_bw()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Proporsi sangat berpengaruh pada klasifikasi biner. Dari plot di atas didapat bahwa proporsi antara churn dengan tidak adalah 66:34. Proporsi pada data ini cukup baik untuk digunakan dalam pemodelan. (Usahakan proporsi imbalance data salah satunya mecapai 20%).
library(ggplot2)
telco %>%
ggplot(aes(x = los)) +
geom_histogram(
binwidth = 100,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(title = "Histogram of Length of Stay (LOS)",
x = "length of stay",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = los, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of Length of Stay (LOS) by Churn",
y = "Churn",
x = "Length of Stay (LOS)") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = los, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of Length of Stay (LOS) by Churn",
x = "Length of Stay (LOS)",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran los, didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel los tidak terlalu berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = voice_rev), y = ..count..) +
geom_histogram(
binwidth = 10000,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(title = "Histogram of Voice Renevue",
x = "Voice Renevue",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_rev, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of Voice Renevue by Churn",
y = "Churn",
x = "Voice Renevue") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_rev, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of Voice Renevue by Churn",
x = "Voice Renevue",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran voice_rev didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_rev tidak terlalu berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = voice_trx), y = ..count..) +
geom_histogram(
binwidth = 20,
fill = 'skyblue',
color = "white",
alpha = 0.9) +
labs(title = "Histogram of Voice Transaction",
x = "Voice Transaction",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_trx, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of Voice Transaction by Churn",
y = "Churn",
x = "Voice Transaction") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_rev, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of Voice Transaction by Churn",
x = "Voice Transaction",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran voice_trx didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_trx berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = voice_mou), y = ..count..) +
geom_histogram(
binwidth = 50,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(tile = "Histogram of Voice Mou",
x = "Voice Mou",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_mou, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of Voice Mou by Churn",
y = "Churn",
x = "Voice Mou") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_mou, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of Voice Mou by Churn",
x = "Voice Mou",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran voice_mou didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_mou berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = voice_dou), y = ..count..) +
geom_histogram(
binwidth = 1,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(tile = "Histogram of Voice Dou",
x = "Voice Dou",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_dou, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of Voice Dou by Churn",
y = "Churn",
x = "Voice Dou") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_dou, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of Voice Dou by Churn",
x = "Voice Dou",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
telco %>%
ggplot(aes(x = voice_dou, fill = factor(churn))) +
geom_histogram(position = position_dodge(width = 0.5),
binwidth = 1,
alpha = 1) +
labs(title = "Histogram of Voice Dou by Churn",
x = "Voice Dou",
y = "Frequency")+
scale_fill_manual(values = c("0" = "skyblue",
"1" = "pink"),
name = "Churn",
labels = c("No", "Yes")) +
theme_minimal() +
theme(legend.position = "top")
Dari grafik boxplot dan density pada sebaran voice_dou didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_dou berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = sms_rev)) +
geom_histogram(
binwidth = 2000,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(title = "Histogram of sms_rev",
x = "sms_rev",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = sms_rev, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of sms_rev by Churn",
y = "Churn",
x = "sms_rev") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = sms_rev, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of sms_rev by Churn",
x = "sms_rev",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran sms_rev, didapati bahwa terdapat sedikit perbedaan yang signifikan antara churn dengan tidak. Artinya variabel sms_rev cukup berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = sms_trx), y = ..count..) +
geom_histogram(
binwidth = 50,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(title = "Histogram of sms_trx",
x = "sms_trx",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = sms_trx, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of sms_trx by Churn",
y = "Churn",
x = "sms_trx") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = sms_trx, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of sms_trx by Churn",
x = "sms_trx",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran sms_trx didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel sms_trx tidak terlalu berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = sms_dou), y = ..count..) +
geom_histogram(
binwidth = 1,
fill = 'skyblue',
color = "white",
alpha = 0.9) +
labs(title = "Histogram of sms_dou",
x = "sms_dou",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = sms_dou, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of sms_dou by Churn",
y = "Churn",
x = "sms_dou") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = sms_dou, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of sms_dou by Churn",
x = "sms_dou",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran sms_dou didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel sms_dou berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = broadband_rev), y = ..count..) +
geom_histogram(
binwidth = 10000,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(title = "Histogram of broadband_rev",
x = "broadband_rev",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = broadband_rev, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of broadband_rev by Churn",
y = "Churn",
x = "broadband_rev") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = broadband_rev, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of broadband_rev by Churn",
x = "broadband_rev",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran broadband_rev didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel broadband_rev tidak berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = broadband_usg), y = ..count..) +
geom_histogram(
binwidth = 200,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(title = "Histogram of broadband_usg",
x = "broadband_usg",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = broadband_usg, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of broadband_usg by Churn",
y = "Churn",
x = "broadband_usg") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = broadband_usg, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of broadband_usg by Churn",
x = "broadband_usg",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran broadband_usg didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel broadband_usg cenderung tidak berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = broadband_dou), y = ..count..) +
geom_histogram(
binwidth = 1,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(title = "Histogram of broadband_dou",
x = "broadband_dou",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = broadband_dou, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of broadband_dou by Churn",
y = "Churn",
x = "broadband_dou") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = broadband_dou, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of broadband_dou by Churn",
x = "broadband_dou",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran broadband_dou didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel broadband_dou tidak berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = voice_package_rev), y = ..count..) +
geom_histogram(
binwidth = 2000,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(title = "Histogram of voice_package_rev",
x = "voice_package_rev",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_package_rev, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of voice_package_rev by Churn",
y = "Churn",
x = "voice_package_rev") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_package_rev, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of voice_package_rev by Churn",
x = "voice_package_rev",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran voice_package_rev didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_package_rev cenderung berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = voice_package_trx), y = ..count..) +
geom_histogram(
binwidth = 1,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(title = "Histogram of voice_package_trx",
x = "voice_package_trx",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_package_trx, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of voice_package_trx by Churn",
y = "Churn",
x = "voice_package_trx") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_package_trx, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of voice_package_trx by Churn",
x = "voice_package_trx",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran voice_package_trx didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_package_trx berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
library(ggplot2)
telco %>%
ggplot(aes(x = voice_package_dou), y = ..count..) +
geom_histogram(
binwidth = 1,
fill = 'skyblue',
color = "white",
alpha = 0.7) +
labs(title = "Histogram of voice_package_dou",
x = "voice_package_dou",
y = "frequency") +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_package_dou, y = factor(churn), fill = factor(churn))) +
geom_boxplot() +
labs(title = "Boxplot of voice_package_dou by Churn",
y = "Churn",
x = "voice_package_dou") +
scale_fill_manual(values = c("skyblue", "pink")) +
scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
guides(fill = FALSE) +
theme_bw()
library(ggplot2)
# Your ggplot code here
ggplot(telco, aes(x = voice_package_dou, fill = factor(churn))) +
geom_density(alpha = 0.8) +
labs(title = "Density Plot of voice_package_dou by Churn",
x = "voice_package_dou",
y = "Density") +
scale_fill_manual(values = c("skyblue", "pink"),
name = "Churn",
labels = c("1" = "Yes", "0" = "No")) +
theme_minimal()
Dari grafik boxplot dan density pada sebaran voice_package_dou didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_package_dou cenderung berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.
Multikolinearitas
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(telco[,2:15]), method = "square", cl.cex = 0.8 )
Terdapat beberapa variabel yang multikolinearitas. Masalah multikolinearitas ini harus diatasi.
set.seed(100)
telco_split <- telco %>%
select(-MSISDN) %>%
mutate(churn = factor(churn,
levels = c(1, 0),
label = c("Yes", "No"))) %>%
initial_split(prop = 0.8, strata = churn)
telco_split
## <Training/Testing/Total>
## <4000/1000/5000>
# Data Traiing
telco_train <- telco_split %>% training()
# Data Testing
telco_test <- telco_split %>% testing()
# Data k-fold cross validation, k = v = 5
cv <- telco_train %>%
vfold_cv(v = 5, strata = churn)
cv
## # 5-fold cross-validation using stratification
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [3199/801]> Fold1
## 2 <split [3199/801]> Fold2
## 3 <split [3200/800]> Fold3
## 4 <split [3201/799]> Fold4
## 5 <split [3201/799]> Fold5
Membuat model regrsi logistik tanpa validasi silang
reglog <- glm(churn ~.,data = telco_train, family = "binomial")
reglog %>%
summary()
##
## Call:
## glm(formula = churn ~ ., family = "binomial", data = telco_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.549e+00 1.025e-01 -15.106 < 2e-16 ***
## los 8.648e-05 4.254e-05 2.033 0.0421 *
## voice_rev -2.318e-05 3.522e-06 -6.581 4.68e-11 ***
## voice_trx 2.591e-02 3.845e-03 6.738 1.60e-11 ***
## voice_mou 2.022e-03 4.938e-04 4.095 4.22e-05 ***
## voice_dou 1.209e-01 2.687e-02 4.498 6.85e-06 ***
## sms_rev -9.683e-06 9.545e-06 -1.014 0.3104
## sms_trx -1.887e-04 7.038e-04 -0.268 0.7886
## sms_dou 1.054e-01 2.335e-02 4.514 6.37e-06 ***
## broadband_rev 9.649e-07 2.971e-06 0.325 0.7454
## broadband_usg 2.775e-05 1.051e-04 0.264 0.7918
## broadband_dou -2.022e-02 2.007e-02 -1.007 0.3138
## voice_package_rev -9.735e-06 1.358e-05 -0.717 0.4734
## voice_package_trx -5.111e-02 5.718e-02 -0.894 0.3714
## voice_package_dou 5.228e-01 8.143e-02 6.421 1.36e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5112.2 on 3999 degrees of freedom
## Residual deviance: 3926.0 on 3985 degrees of freedom
## AIC: 3956
##
## Number of Fisher Scoring iterations: 6
Recipe tanpa pra-process apapun
telco_recipe <- telco_train %>%
## menuliskan semua variabel dalam formula
# recipe(formula = churn~ los +
# voice_rev + voice_trx + voice_mou + voice_dou +
# sms_rev + sms_trx + sms_dou +
# broadband_rev + broadband_usg + broadband_dou +
# voice_package_rev + voice_package_trx + voice_package_dou)
## Tanda . sebagai pengganti untuk menyebutkan semua variabel
recipe(formula = churn ~.)
telco_recipe
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 14
Recipe dengan normalisasi semua variabel numerik prediktornya. Kita akan menyimpan nilai rataan dan standar deviasi disimpan di model. Sehingga ketika melakukan prediksi, normalisasinya tidak menggunakan statistik rataan dan standar deviasinya dari data yang baru melainkan menggunakan nilai dari data training
telco_rec_normalize <- telco_train %>%
recipe(formula = churn ~.) %>%
step_normalize(all_numeric_predictors())
Masing-masing model mempunyai hyperparameter yang bisa diganti-ganti agar dapat menghasilkan model yang terbaik (hyperparameter tunning). Untuk melakukan hyperparameter tunning gunakan tune() untuk parameter yang diinginkan.
Dengan package glmnet
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.2
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
logreg <- logistic_reg(penalty = tune(), mixture = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
Dengan package rpart
library(rpart)
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
dtree <- decision_tree(cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()) %>%
set_mode("classification") %>%
set_engine("rpart")
Dengan package ranger dan perhitungan importance. Variabel dengan metode permutasi.
library(ranger)
## Warning: package 'ranger' was built under R version 4.3.2
rf <- rand_forest(mtry =tune(),
min_n = tune(),
trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger", importance = "permutation")
List semua spesifikasi model
models <- list(
logreg = logreg,
dtree = dtree,
rand_forest = rf
)
Set workflow recipe and model
telco_set <- workflow_set(preproc = list(simple = telco_recipe,
normalize = telco_rec_normalize),
models = models, cross = TRUE)
Grid control untuk menyimpan hasil prediksi dan workflow saat proses membuat model
grid_ctrl <- control_race(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE
)
Lakukan pemodelan dan cross validation.
Perintah ini membutuhkan waktu yang cukup lama dan komputasi yang berat
telco_models <- telco_set %>%
workflow_map(fn = "tune_race_anova",
resamples = cv,
grid = 15,
control = grid_ctrl,
verbose = TRUE,
seed = 1001)
## i 1 of 6 tuning: simple_logreg
## ✔ 1 of 6 tuning: simple_logreg (18.4s)
## i 2 of 6 tuning: simple_dtree
## ✔ 2 of 6 tuning: simple_dtree (12.4s)
## i 3 of 6 tuning: simple_rand_forest
## ✔ 3 of 6 tuning: simple_rand_forest (7m 29.7s)
## i 4 of 6 tuning: normalize_logreg
## ✔ 4 of 6 tuning: normalize_logreg (19.5s)
## i 5 of 6 tuning: normalize_dtree
## ✔ 5 of 6 tuning: normalize_dtree (18.4s)
## i 6 of 6 tuning: normalize_rand_forest
## ✔ 6 of 6 tuning: normalize_rand_forest (7m 1.2s)
Tampilkan list workflow berdasarkan AUC
telco_models %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(wflow_id, model, mean) %>%
arrange(desc(mean)) %>%
print(width = Inf)
## # A tibble: 46 × 3
## wflow_id model mean
## <chr> <chr> <dbl>
## 1 simple_rand_forest rand_forest 0.817
## 2 normalize_rand_forest rand_forest 0.817
## 3 simple_rand_forest rand_forest 0.816
## 4 normalize_rand_forest rand_forest 0.816
## 5 normalize_rand_forest rand_forest 0.816
## 6 simple_rand_forest rand_forest 0.816
## 7 normalize_rand_forest rand_forest 0.813
## 8 simple_rand_forest rand_forest 0.813
## 9 simple_logreg logistic_reg 0.806
## 10 normalize_logreg logistic_reg 0.806
## # ℹ 36 more rows
Dari data ROC_AUC di atas diperoleh model yang paling bagus adalah model dengan metode random_forest.
Plot perbandingan model dan pra-prosesnya berdasarkan AUC
telco_models %>%
autoplot(rank_metric = "roc_auc",
metric = "roc_auc",
select_best = TRUE) +
theme_bw()
Plot perbandingan ROC semua model dan pra-prosesnya
telco_models %>%
collect_predictions() %>%
group_by(wflow_id) %>%
roc_curve(churn, .pred_Yes) %>%
autoplot()
Model yang paling bagus masih simple dan normalize random forest.
Mengambil parameter hasil tunning terbaik
best_param <- telco_models %>%
extract_workflow_set_result("simple_rand_forest") %>%
select_best(metric = "roc_auc")
best_param
## # A tibble: 1 × 3
## mtry min_n .config
## <int> <int> <chr>
## 1 2 40 Preprocessor1_Model10
Fit terakhir dengan parameter terbaik sehingga mendapatkan workflow terbaik
best_telco <- telco_models %>%
extract_workflow("simple_rand_forest") %>%
finalize_workflow(best_param) %>%
last_fit(split = telco_split)
Menampilkan metric dari model dengan parameter terbaik
best_telco %>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.779 Preprocessor1_Model1
## 2 roc_auc binary 0.830 Preprocessor1_Model1
Nilai nya cukup bagus, karna nilai AUC sudah diatas 80%.
Menyimpan hasil prediksi dari data testing
pred_result <- best_telco %>%
collect_predictions()
pred_result
## # A tibble: 1,000 × 7
## id .pred_Yes .pred_No .row .pred_class churn .config
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 train/test split 0.0837 0.916 12 No No Preprocessor1_Mo…
## 2 train/test split 0.510 0.490 20 Yes Yes Preprocessor1_Mo…
## 3 train/test split 0.763 0.237 32 Yes Yes Preprocessor1_Mo…
## 4 train/test split 0.388 0.612 37 No No Preprocessor1_Mo…
## 5 train/test split 0.472 0.528 40 No No Preprocessor1_Mo…
## 6 train/test split 0.163 0.837 50 No No Preprocessor1_Mo…
## 7 train/test split 0.404 0.596 51 No Yes Preprocessor1_Mo…
## 8 train/test split 0.249 0.751 54 No No Preprocessor1_Mo…
## 9 train/test split 0.606 0.394 56 Yes No Preprocessor1_Mo…
## 10 train/test split 0.382 0.618 59 No Yes Preprocessor1_Mo…
## # ℹ 990 more rows
Menghitung metric klasifikasi.
Untuk menghitung accuracy, sensitivity, specificity, recall, precision dan f_score menggunakan nilai aktual dan prediksi kelas. Sedangkan untuk menghitung AUC menggunakan nilai aktual dan nilai pediksi peluangnya.
bind_rows(
pred_result %>%
accuracy(truth = churn, estimate = .pred_class),
pred_result %>%
sensitivity(truth = churn, estimate = .pred_class),
pred_result %>%
recall(truth = churn, estimate = .pred_class),
pred_result %>%
specificity(truth = churn, estimate = .pred_class),
pred_result %>%
precision(truth = churn, estimate = .pred_class),
pred_result %>%
f_meas(truth = churn, estimate = .pred_class),
pred_result %>%
roc_auc(churn, .pred_Yes)
)
## # A tibble: 7 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.779
## 2 sensitivity binary 0.593
## 3 recall binary 0.593
## 4 specificity binary 0.873
## 5 precision binary 0.704
## 6 f_meas binary 0.644
## 7 roc_auc binary 0.830
Kurva ROC dari model terbaik
pred_result %>%
roc_curve(truth = churn, .pred_Yes) %>%
autoplot()
Workflow terbaik yang nantinya digunakan untuk prediksi sebagai model akhir.
final_model <- best_telco %>%
extract_workflow()
final_model
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L, x), num.trees = ~1000, min.node.size = min_rows(~40L, x), importance = ~"permutation", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 1000
## Sample size: 4000
## Number of independent variables: 14
## Mtry: 2
## Target node size: 40
## Variable importance mode: permutation
## Splitrule: gini
## OOB prediction error (Brier s.): 0.162052
Akan digunakan data baru untuk diprediksi.
telco_to_pred <- read_csv("4. telco_churn_to_pred.csv")
## Rows: 1000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): MSISDN, los, voice_rev, voice_trx, voice_mou, voice_dou, sms_rev, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Melakukan prediksi dan menggabungkan hasil prediksi dengan data asli. Haisl prediksi berupa kelas dan probability.
telco_to_pred %>%
bind_cols(
final_model %>%
predict(new_data = telco_to_pred),
final_model %>%
predict(new_data = telco_to_pred, type = "prob"))%>%
print(width = Inf)
## # A tibble: 1,000 × 18
## MSISDN los voice_rev voice_trx voice_mou voice_dou sms_rev sms_trx sms_dou
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 68424 851 5994 12 5 6 3435 25 6
## 2 29726 1935 3319 8 24 3 2330 12 5
## 3 51163 51 0 0 0 0 0 0 0
## 4 27287 1665 84397 196 441 7 18885 102 7
## 5 16430 3154 33341 38 18 6 220 3 3
## 6 28634 180 4585 282 626 6 1600 7 4
## 7 19185 1421 2747 53 168 7 7390 42 7
## 8 85760 1161 13499 43 871 6 2820 12 6
## 9 17098 454 6265 20 64 4 5260 25 3
## 10 78350 253 0 0 0 0 0 0 0
## broadband_rev broadband_usg broadband_dou voice_package_rev voice_package_trx
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 2500 1
## 2 0 0 0 3600 1
## 3 0 0 0 10950 2
## 4 0 269. 7 13050 3
## 5 0 0 0 18200 10
## 6 0 0 0 13050 3
## 7 0 0 0 24500 7
## 8 0 0 0 21900 5
## 9 735 8.80 4 8700 2
## 10 0 0 0 17650 4
## voice_package_dou .pred_class .pred_Yes .pred_No
## <dbl> <fct> <dbl> <dbl>
## 1 1 Yes 0.615 0.385
## 2 1 Yes 0.566 0.434
## 3 2 Yes 0.799 0.201
## 4 3 No 0.0975 0.902
## 5 5 No 0.386 0.614
## 6 2 No 0.165 0.835
## 7 7 No 0.00174 0.998
## 8 5 No 0.0275 0.972
## 9 2 No 0.178 0.822
## 10 4 Yes 0.631 0.369
## # ℹ 990 more rows
Melihat variable importance
final_model %>%
extract_fit_engine() %>%
vip::vi_model() %>%
ggplot(aes(x =Importance,
y = reorder(Variable, Importance))) +
geom_col() +
labs(y = "Variable") +
theme_bw()
Variabel yang paling berpengaruh adalah: voice_trx, voice_mou, dan voice_package_dou. Maka akan dilakukan prediksi untuk mengetahui user yang akan churn. Pada user inilah team marketing/sales akan khusus melakukan prelakukan sebagai preventif kemungkinan churn agar tidak terjadi. Kemudian, untuk user yang diprediksi tidak churn dapat diberikan perlakuan yang berbeda pula agar user ini dapat tetap loyal memakai jasa provider.