Supervised Learning Case Study : Telco Churn Prevention

Digunakan data transaksi sebulan terakhir.

Objektif:

Karakteristik pelanggan yang akan churn dan stay
Prediksi pelanggan yang punya kecenderungan akan churn (berhenti berlangganan) paket telepon pascabayar di bulan depan.
Dari sisi bisnis, apa yang dapat dilakukan setelah mendapatkan model prediksi churn?

Data

Variable	Deskripsi	Satuan
MSISDN	Data Nomor Telepon yang telah di masking
los	Length of Stay / Lamanya menjadi customer	hari
voice_rev	Penghasilan dari jasa voice/telepon dalam sebulan terakhir	Rp
voice_trx	Banyaknya transaksi jasa voice/telepon dalam sebulan terakhir	kali
voice_mou	Lamanya waktu menggunakan layanan voice/telepon dalam sebulan terakhir	menit
voice_dou	Banyaknya hari menggunakan layanan voice/telepon dalam sebulan terakhir	hari
sms_rev	Penghasilan dari jasa sms dalam sebulan terakhir	Rp
sms_trx	Banyaknya transaksi SMS dalam sebulan terakhir	kali
sms_dou	Banyaknya menggunakan SMS dalam sebulan terakhir	hari
broadband_rev	Penghasilan dari jasa paket data(broadband) dalam sebulan terakhir	Rp
broadband_usg	penggunaan layanan broadband dalam sebulan terakhir	MB
broadband_dou	Banyaknya hari menggunakan layanan broadband dalam sebulan terakhir	hari
voice_package_rev	Penghasilan dari paket telepon dalam sebulan terakhir	Rp
vocie_package_trx	Banyaknya transaksi menggunakan paket telepon dalam sebulan terakhir	kali
voice_package_dou	Banyaknya hari menggunakan paket telepon dalam sebulan terakhir	hari
churn	menunjukkan status churn atau masih aktif (1 = churn ; positive event, 0 = aktif)

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.3.2

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8

## Warning: package 'dials' was built under R version 4.3.2

## Warning: package 'infer' was built under R version 4.3.2

## Warning: package 'modeldata' was built under R version 4.3.2

## Warning: package 'parsnip' was built under R version 4.3.2

## Warning: package 'rsample' was built under R version 4.3.2

## Warning: package 'tune' was built under R version 4.3.2

## Warning: package 'workflows' was built under R version 4.3.2

## Warning: package 'workflowsets' was built under R version 4.3.2

## Warning: package 'yardstick' was built under R version 4.3.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

library(finetune)

## Warning: package 'finetune' was built under R version 4.3.2

library(vip)

## Warning: package 'vip' was built under R version 4.3.2

## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi

setwd("D:/Kuliah/Mat/TSA Kominfo/Praktikum")
telco <- read.csv("4. telco_churn_sample.csv")
head(telco, 15)

##    MSISDN  los voice_rev voice_trx voice_mou voice_dou sms_rev sms_trx sms_dou
## 1   31441  534      5860        57       251         7    3465      16       5
## 2   31944  589         0         0         0         0       0       0       0
## 3   82538 1223      3744         6         5         2    2475      13       5
## 4   86957 1221       231        52       188         6       0       2       2
## 5   27618   60     12073        36       588         6    2100      20       5
## 6   96313  188      3829         7       261         3       0       0       0
## 7   21363  255       320         8       207         4   20430      97       7
## 8   85847  594       317         1         0         1       0       0       0
## 9   56880 3190         0         0         0         0    1505       7       4
## 10  83060  357     17930        37        67         7    2150      16       6
## 11  12681   39         0         0         0         0       0       0       0
## 12  86834 2773         0        16        51         3    1350      13       5
## 13  87358  176      5133        14        16         6    7210      37       7
## 14  76508 2650     12120        27        17         6    9095      43       6
## 15  55676  617      1149         2         1         2     900       6       2
##    broadband_rev broadband_usg broadband_dou voice_package_rev
## 1          35000    1713.20277             6             34200
## 2              0       0.00000             0              8350
## 3              0       0.00000             0              8200
## 4              0     341.69264             7              7950
## 5              0       0.00000             0             26600
## 6              0       0.00000             0              4550
## 7              0       0.00000             0             15800
## 8           9000      40.74369             7             17900
## 9              0       0.00000             0             16650
## 10             0    1573.61691             7              8700
## 11             0       0.00000             0              8200
## 12             0       0.00000             0             22050
## 13             0       0.00000             0              9000
## 14             0       0.00000             0             33300
## 15          2888     375.30317             2              3850
##    voice_package_trx voice_package_dou churn
## 1                  8                 6     0
## 2                  2                 2     1
## 3                  2                 2     0
## 4                  2                 2     0
## 5                  6                 5     1
## 6                  1                 1     0
## 7                  4                 4     0
## 8                  4                 2     0
## 9                  4                 4     0
## 10                 2                 2     0
## 11                 2                 1     1
## 12                 6                 5     0
## 13                 4                 4     0
## 14                 8                 6     0
## 15                 1                 1     1

str(telco)

## 'data.frame':    5000 obs. of  16 variables:
##  $ MSISDN           : int  31441 31944 82538 86957 27618 96313 21363 85847 56880 83060 ...
##  $ los              : int  534 589 1223 1221 60 188 255 594 3190 357 ...
##  $ voice_rev        : num  5860 0 3744 231 12073 ...
##  $ voice_trx        : int  57 0 6 52 36 7 8 1 0 37 ...
##  $ voice_mou        : int  251 0 5 188 588 261 207 0 0 67 ...
##  $ voice_dou        : int  7 0 2 6 6 3 4 1 0 7 ...
##  $ sms_rev          : int  3465 0 2475 0 2100 0 20430 0 1505 2150 ...
##  $ sms_trx          : int  16 0 13 2 20 0 97 0 7 16 ...
##  $ sms_dou          : int  5 0 5 2 5 0 7 0 4 6 ...
##  $ broadband_rev    : num  35000 0 0 0 0 0 0 9000 0 0 ...
##  $ broadband_usg    : num  1713 0 0 342 0 ...
##  $ broadband_dou    : int  6 0 0 7 0 0 0 7 0 7 ...
##  $ voice_package_rev: int  34200 8350 8200 7950 26600 4550 15800 17900 16650 8700 ...
##  $ voice_package_trx: int  8 2 2 2 6 1 4 4 4 2 ...
##  $ voice_package_dou: int  6 2 2 2 5 1 4 2 4 2 ...
##  $ churn            : int  0 1 0 0 1 0 0 0 0 0 ...

1 Eksplorasi Data

# library(dplyr)
glimpse(telco)

## Rows: 5,000
## Columns: 16
## $ MSISDN            <int> 31441, 31944, 82538, 86957, 27618, 96313, 21363, 858…
## $ los               <int> 534, 589, 1223, 1221, 60, 188, 255, 594, 3190, 357, …
## $ voice_rev         <dbl> 5860.00, 0.00, 3744.00, 231.00, 12073.00, 3829.00, 3…
## $ voice_trx         <int> 57, 0, 6, 52, 36, 7, 8, 1, 0, 37, 0, 16, 14, 27, 2, …
## $ voice_mou         <int> 251, 0, 5, 188, 588, 261, 207, 0, 0, 67, 0, 51, 16, …
## $ voice_dou         <int> 7, 0, 2, 6, 6, 3, 4, 1, 0, 7, 0, 3, 6, 6, 2, 2, 7, 5…
## $ sms_rev           <int> 3465, 0, 2475, 0, 2100, 0, 20430, 0, 1505, 2150, 0, …
## $ sms_trx           <int> 16, 0, 13, 2, 20, 0, 97, 0, 7, 16, 0, 13, 37, 43, 6,…
## $ sms_dou           <int> 5, 0, 5, 2, 5, 0, 7, 0, 4, 6, 0, 5, 7, 6, 2, 0, 7, 6…
## $ broadband_rev     <dbl> 35000, 0, 0, 0, 0, 0, 0, 9000, 0, 0, 0, 0, 0, 0, 288…
## $ broadband_usg     <dbl> 1.713203e+03, 0.000000e+00, 0.000000e+00, 3.416926e+…
## $ broadband_dou     <int> 6, 0, 0, 7, 0, 0, 0, 7, 0, 7, 0, 0, 0, 0, 2, 0, 0, 1…
## $ voice_package_rev <int> 34200, 8350, 8200, 7950, 26600, 4550, 15800, 17900, …
## $ voice_package_trx <int> 8, 2, 2, 2, 6, 1, 4, 4, 4, 2, 2, 6, 4, 8, 1, 2, 4, 1…
## $ voice_package_dou <int> 6, 2, 2, 2, 5, 1, 4, 2, 4, 2, 1, 5, 4, 6, 1, 2, 4, 1…
## $ churn             <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1…

summary(telco)

##      MSISDN           los         voice_rev        voice_trx     
##  Min.   :   23   Min.   :  19   Min.   :     0   Min.   :  0.00  
##  1st Qu.:24788   1st Qu.: 341   1st Qu.:   435   1st Qu.:  4.00  
##  Median :49573   Median : 956   Median :  3664   Median : 15.00  
##  Mean   :49784   Mean   :1224   Mean   :  7962   Mean   : 28.57  
##  3rd Qu.:75005   3rd Qu.:2097   3rd Qu.:  9955   3rd Qu.: 37.00  
##  Max.   :99992   Max.   :3220   Max.   :397101   Max.   :507.00  
##    voice_mou        voice_dou        sms_rev         sms_trx       
##  Min.   :   0.0   Min.   :0.000   Min.   :    0   Min.   :   0.00  
##  1st Qu.:   4.0   1st Qu.:2.000   1st Qu.:  415   1st Qu.:   3.00  
##  Median :  44.0   Median :4.000   Median : 1980   Median :  12.00  
##  Mean   : 117.8   Mean   :3.941   Mean   : 4551   Mean   :  35.26  
##  3rd Qu.: 137.0   3rd Qu.:6.000   3rd Qu.: 5746   3rd Qu.:  34.00  
##  Max.   :3477.0   Max.   :7.000   Max.   :72430   Max.   :1449.00  
##     sms_dou      broadband_rev    broadband_usg      broadband_dou  
##  Min.   :0.000   Min.   :     0   Min.   :   0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:     0   1st Qu.:   0.000   1st Qu.:0.000  
##  Median :4.000   Median :     0   Median :   0.000   Median :0.000  
##  Mean   :3.915   Mean   :  3952   Mean   : 104.874   Mean   :1.207  
##  3rd Qu.:6.000   3rd Qu.:     0   3rd Qu.:   0.012   3rd Qu.:1.000  
##  Max.   :7.000   Max.   :424720   Max.   :9882.606   Max.   :7.000  
##  voice_package_rev voice_package_trx voice_package_dou     churn      
##  Min.   :    0     Min.   : 1.000    Min.   :1.000     Min.   :0.000  
##  1st Qu.: 4350     1st Qu.: 1.000    1st Qu.:1.000     1st Qu.:0.000  
##  Median : 8200     Median : 2.000    Median :2.000     Median :0.000  
##  Mean   :12065     Mean   : 3.201    Mean   :2.625     Mean   :0.337  
##  3rd Qu.:16475     3rd Qu.: 4.000    3rd Qu.:4.000     3rd Qu.:1.000  
##  Max.   :88000     Max.   :24.000    Max.   :7.000     Max.   :1.000

Proporsi Churn

library(ggplot2)

ggplot(telco, aes(x = factor(churn))) + 
  geom_bar(aes(y = ..count.., fill = factor(churn)), position = "dodge") +
  geom_text(stat = "count", aes(label = scales::percent(..count../sum(..count..)),
            y = ..count.., group = factor(churn)),
            position = position_dodge(width = 0.9),
            vjust = -0.5) +  # the position of the text
  scale_y_continuous(breaks = seq(0, 4000, by = 1000)) +
  labs(title = "Proportion",x = "Churn", y = "Jumlah Nasabah") +
  scale_fill_manual(values = c("skyblue", "pink"),
                    name = "Churn", 
                    labels = c("No", "Yes")) +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_bw()

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Proporsi sangat berpengaruh pada klasifikasi biner. Dari plot di atas didapat bahwa proporsi antara churn dengan tidak adalah 66:34. Proporsi pada data ini cukup baik untuk digunakan dalam pemodelan. (Usahakan proporsi imbalance data salah satunya mecapai 20%).

1.1 Variabel Los

library(ggplot2)
telco %>%
  ggplot(aes(x = los)) + 
  geom_histogram(
    binwidth = 100,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(title = "Histogram of Length of Stay (LOS)",
       x = "length of stay",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = los, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of Length of Stay (LOS) by Churn",
       y = "Churn",
       x = "Length of Stay (LOS)") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = los, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of Length of Stay (LOS) by Churn",
       x = "Length of Stay (LOS)",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran los, didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel los tidak terlalu berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.2 Variabel Voice Renevue

library(ggplot2)
telco %>%
  ggplot(aes(x = voice_rev), y = ..count..) + 
  geom_histogram(
    binwidth = 10000,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(title = "Histogram of Voice Renevue",
       x = "Voice Renevue",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_rev, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of Voice Renevue by Churn",
       y = "Churn",
       x = "Voice Renevue") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_rev, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of Voice Renevue by Churn",
       x = "Voice Renevue",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran voice_rev didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_rev tidak terlalu berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.3 Variabel Voice Transaction

library(ggplot2)
telco %>%
  ggplot(aes(x = voice_trx), y = ..count..) + 
  geom_histogram(
    binwidth = 20,
    fill = 'skyblue',
    color = "white",
    alpha = 0.9) +
  labs(title = "Histogram of Voice Transaction",
       x = "Voice Transaction",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_trx, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of Voice Transaction by Churn",
       y = "Churn",
       x = "Voice Transaction") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_rev, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of Voice Transaction by Churn",
       x = "Voice Transaction",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran voice_trx didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_trx berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.4 Variabel Voice Mou

library(ggplot2)
telco %>%
  ggplot(aes(x = voice_mou), y = ..count..) + 
  geom_histogram(
    binwidth = 50,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(tile = "Histogram of Voice Mou",
       x = "Voice Mou",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_mou, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of Voice Mou by Churn",
       y = "Churn",
       x = "Voice Mou") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_mou, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of Voice Mou by Churn",
       x = "Voice Mou",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran voice_mou didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_mou berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.5 Variabel Voice Dou

library(ggplot2)
telco %>%
  ggplot(aes(x = voice_dou), y = ..count..) + 
  geom_histogram(
    binwidth = 1,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(tile = "Histogram of Voice Dou",
       x = "Voice Dou",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_dou, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of Voice Dou by Churn",
       y = "Churn",
       x = "Voice Dou") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_dou, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of Voice Dou by Churn",
       x = "Voice Dou",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

telco %>%
  ggplot(aes(x = voice_dou, fill = factor(churn))) + 
  geom_histogram(position = position_dodge(width = 0.5),
                 binwidth = 1,
                 alpha = 1) + 
  labs(title = "Histogram of Voice Dou by Churn",
       x = "Voice Dou",
       y = "Frequency")+
  scale_fill_manual(values = c("0" = "skyblue",
                               "1" = "pink"),
                    name = "Churn",
                    labels = c("No", "Yes")) +
  theme_minimal() +
  theme(legend.position = "top")

Dari grafik boxplot dan density pada sebaran voice_dou didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_dou berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.6 Variabel sms_rev

library(ggplot2)
telco %>%
  ggplot(aes(x = sms_rev)) + 
  geom_histogram(
    binwidth = 2000,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(title = "Histogram of sms_rev",
       x = "sms_rev",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = sms_rev, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of sms_rev by Churn",
       y = "Churn",
       x = "sms_rev") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = sms_rev, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of sms_rev by Churn",
       x = "sms_rev",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran sms_rev, didapati bahwa terdapat sedikit perbedaan yang signifikan antara churn dengan tidak. Artinya variabel sms_rev cukup berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.7 Variabel sms_trx

library(ggplot2)
telco %>%
  ggplot(aes(x = sms_trx), y = ..count..) + 
  geom_histogram(
    binwidth = 50,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(title = "Histogram of sms_trx",
       x = "sms_trx",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = sms_trx, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of sms_trx by Churn",
       y = "Churn",
       x = "sms_trx") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = sms_trx, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of sms_trx by Churn",
       x = "sms_trx",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran sms_trx didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel sms_trx tidak terlalu berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.8 Variabel sms_dou

library(ggplot2)
telco %>%
  ggplot(aes(x = sms_dou), y = ..count..) + 
  geom_histogram(
    binwidth = 1,
    fill = 'skyblue',
    color = "white",
    alpha = 0.9) +
  labs(title = "Histogram of sms_dou",
       x = "sms_dou",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = sms_dou, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of sms_dou by Churn",
       y = "Churn",
       x = "sms_dou") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = sms_dou, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of sms_dou by Churn",
       x = "sms_dou",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran sms_dou didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel sms_dou berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.9 Variabel broadband_rev

library(ggplot2)
telco %>%
  ggplot(aes(x = broadband_rev), y = ..count..) + 
  geom_histogram(
    binwidth = 10000,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(title = "Histogram of broadband_rev",
       x = "broadband_rev",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = broadband_rev, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of broadband_rev by Churn",
       y = "Churn",
       x = "broadband_rev") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = broadband_rev, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of broadband_rev by Churn",
       x = "broadband_rev",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran broadband_rev didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel broadband_rev tidak berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.10 Variabel broadband_usg

library(ggplot2)
telco %>%
  ggplot(aes(x = broadband_usg), y = ..count..) + 
  geom_histogram(
    binwidth = 200,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(title = "Histogram of broadband_usg",
       x = "broadband_usg",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = broadband_usg, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of broadband_usg by Churn",
       y = "Churn",
       x = "broadband_usg") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = broadband_usg, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of broadband_usg by Churn",
       x = "broadband_usg",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran broadband_usg didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel broadband_usg cenderung tidak berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.11 Variabel broadband_dou

library(ggplot2)
telco %>%
  ggplot(aes(x = broadband_dou), y = ..count..) + 
  geom_histogram(
    binwidth = 1,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(title = "Histogram of broadband_dou",
       x = "broadband_dou",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = broadband_dou, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of broadband_dou by Churn",
       y = "Churn",
       x = "broadband_dou") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = broadband_dou, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of broadband_dou by Churn",
       x = "broadband_dou",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran broadband_dou didapati bahwa tidak terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel broadband_dou tidak berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.12 Variabel voice_package_rev

library(ggplot2)
telco %>%
  ggplot(aes(x = voice_package_rev), y = ..count..) + 
  geom_histogram(
    binwidth = 2000,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(title = "Histogram of voice_package_rev",
       x = "voice_package_rev",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_package_rev, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of voice_package_rev by Churn",
       y = "Churn",
       x = "voice_package_rev") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_package_rev, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of voice_package_rev by Churn",
       x = "voice_package_rev",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran voice_package_rev didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_package_rev cenderung berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.13 Variabel voice_package_trx

library(ggplot2)
telco %>%
  ggplot(aes(x = voice_package_trx), y = ..count..) + 
  geom_histogram(
    binwidth = 1,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(title = "Histogram of voice_package_trx",
       x = "voice_package_trx",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_package_trx, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of voice_package_trx by Churn",
       y = "Churn",
       x = "voice_package_trx") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_package_trx, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of voice_package_trx by Churn",
       x = "voice_package_trx",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran voice_package_trx didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_package_trx berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

1.14 Variabel voice_package_dou

library(ggplot2)
telco %>%
  ggplot(aes(x = voice_package_dou), y = ..count..) + 
  geom_histogram(
    binwidth = 1,
    fill = 'skyblue',
    color = "white",
    alpha = 0.7) +
  labs(title = "Histogram of voice_package_dou",
       x = "voice_package_dou",
       y = "frequency") +
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_package_dou, y = factor(churn), fill = factor(churn))) +
  geom_boxplot() +
  labs(title = "Boxplot of voice_package_dou by Churn",
       y = "Churn",
       x = "voice_package_dou") +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_y_discrete(labels = c("0" = "No", "1" = "Yes")) +
  guides(fill = FALSE) + 
  theme_bw()

library(ggplot2)

# Your ggplot code here
ggplot(telco, aes(x = voice_package_dou, fill = factor(churn))) +
  geom_density(alpha = 0.8) +
  labs(title = "Density Plot of voice_package_dou by Churn",
       x = "voice_package_dou",
       y = "Density") +
  scale_fill_manual(values = c("skyblue", "pink"), 
                    name = "Churn",
                    labels = c("1" = "Yes", "0" = "No")) +
  theme_minimal()

Dari grafik boxplot dan density pada sebaran voice_package_dou didapati bahwa terdapat perbedaan yang signifikan antara churn dengan tidak. Artinya variabel voice_package_dou cenderung berpengaruh untuk membedakan nasabah yang akan churn dengan tidak.

Multikolinearitas

library(corrplot)

## corrplot 0.92 loaded

corrplot(cor(telco[,2:15]), method = "square", cl.cex = 0.8 )

Terdapat beberapa variabel yang multikolinearitas. Masalah multikolinearitas ini harus diatasi.

2 Splitting

set.seed(100)
telco_split <- telco %>% 
  select(-MSISDN) %>%
  mutate(churn = factor(churn,
                        levels = c(1, 0),
                        label = c("Yes", "No"))) %>%
  initial_split(prop = 0.8, strata = churn)
telco_split

## <Training/Testing/Total>
## <4000/1000/5000>

# Data Traiing 
telco_train <- telco_split %>% training()

# Data Testing
telco_test <- telco_split %>% testing()

# Data k-fold cross validation, k = v = 5
cv <- telco_train %>%
  vfold_cv(v = 5, strata = churn)
cv

## #  5-fold cross-validation using stratification 
## # A tibble: 5 × 2
##   splits             id   
##   <list>             <chr>
## 1 <split [3199/801]> Fold1
## 2 <split [3199/801]> Fold2
## 3 <split [3200/800]> Fold3
## 4 <split [3201/799]> Fold4
## 5 <split [3201/799]> Fold5

3 Simple Model

Membuat model regrsi logistik tanpa validasi silang

reglog <- glm(churn ~.,data = telco_train, family = "binomial")
reglog %>%
  summary()

## 
## Call:
## glm(formula = churn ~ ., family = "binomial", data = telco_train)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.549e+00  1.025e-01 -15.106  < 2e-16 ***
## los                8.648e-05  4.254e-05   2.033   0.0421 *  
## voice_rev         -2.318e-05  3.522e-06  -6.581 4.68e-11 ***
## voice_trx          2.591e-02  3.845e-03   6.738 1.60e-11 ***
## voice_mou          2.022e-03  4.938e-04   4.095 4.22e-05 ***
## voice_dou          1.209e-01  2.687e-02   4.498 6.85e-06 ***
## sms_rev           -9.683e-06  9.545e-06  -1.014   0.3104    
## sms_trx           -1.887e-04  7.038e-04  -0.268   0.7886    
## sms_dou            1.054e-01  2.335e-02   4.514 6.37e-06 ***
## broadband_rev      9.649e-07  2.971e-06   0.325   0.7454    
## broadband_usg      2.775e-05  1.051e-04   0.264   0.7918    
## broadband_dou     -2.022e-02  2.007e-02  -1.007   0.3138    
## voice_package_rev -9.735e-06  1.358e-05  -0.717   0.4734    
## voice_package_trx -5.111e-02  5.718e-02  -0.894   0.3714    
## voice_package_dou  5.228e-01  8.143e-02   6.421 1.36e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5112.2  on 3999  degrees of freedom
## Residual deviance: 3926.0  on 3985  degrees of freedom
## AIC: 3956
## 
## Number of Fisher Scoring iterations: 6

4 Recipe and Data Preproccessing

Recipe tanpa pra-process apapun

telco_recipe <- telco_train %>%
  ## menuliskan semua variabel dalam formula
  # recipe(formula = churn~ los +
  #          voice_rev + voice_trx + voice_mou + voice_dou +
  #          sms_rev + sms_trx + sms_dou +
  #          broadband_rev + broadband_usg + broadband_dou + 
  #          voice_package_rev + voice_package_trx + voice_package_dou)
  ## Tanda . sebagai pengganti untuk menyebutkan semua variabel
  recipe(formula = churn ~.)
telco_recipe

##

## ── Recipe ──────────────────────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:    1
## predictor: 14

Recipe dengan normalisasi semua variabel numerik prediktornya. Kita akan menyimpan nilai rataan dan standar deviasi disimpan di model. Sehingga ketika melakukan prediksi, normalisasinya tidak menggunakan statistik rataan dan standar deviasinya dari data yang baru melainkan menggunakan nilai dari data training

telco_rec_normalize <- telco_train %>%
  recipe(formula = churn ~.) %>%
  step_normalize(all_numeric_predictors())

5 Model Spesification

Masing-masing model mempunyai hyperparameter yang bisa diganti-ganti agar dapat menghasilkan model yang terbaik (hyperparameter tunning). Untuk melakukan hyperparameter tunning gunakan tune() untuk parameter yang diinginkan.

5.0.1 Logistic Regression

Dengan package glmnet

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.3.2

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-8

logreg <- logistic_reg(penalty = tune(), mixture = tune()) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

5.0.2 Decision Tree

Dengan package rpart

library(rpart)

## 
## Attaching package: 'rpart'

## The following object is masked from 'package:dials':
## 
##     prune

dtree <- decision_tree(cost_complexity = tune(),
                       tree_depth = tune(),
                       min_n = tune()) %>%
  set_mode("classification") %>%
  set_engine("rpart")

5.0.3 Random Forest

Dengan package ranger dan perhitungan importance. Variabel dengan metode permutasi.

library(ranger)

## Warning: package 'ranger' was built under R version 4.3.2

rf <- rand_forest(mtry =tune(),
                  min_n = tune(),
                  trees = 1000) %>%
  set_mode("classification") %>%
  set_engine("ranger", importance = "permutation")

6 Workflow

List semua spesifikasi model

models <- list(
  logreg = logreg, 
  dtree = dtree,
  rand_forest = rf
)

Set workflow recipe and model

telco_set <- workflow_set(preproc = list(simple = telco_recipe,
                                         normalize = telco_rec_normalize),
                          models = models, cross = TRUE)

Grid control untuk menyimpan hasil prediksi dan workflow saat proses membuat model

grid_ctrl <- control_race(
  save_pred = TRUE,
  parallel_over = "everything",
  save_workflow = TRUE
)

Lakukan pemodelan dan cross validation.

Perintah ini membutuhkan waktu yang cukup lama dan komputasi yang berat

telco_models <- telco_set %>% 
  workflow_map(fn = "tune_race_anova",
               resamples = cv,
               grid = 15,
               control = grid_ctrl,
               verbose = TRUE,
               seed = 1001)

## i 1 of 6 tuning:     simple_logreg

## ✔ 1 of 6 tuning:     simple_logreg (18.4s)

## i 2 of 6 tuning:     simple_dtree

## ✔ 2 of 6 tuning:     simple_dtree (12.4s)

## i 3 of 6 tuning:     simple_rand_forest

## ✔ 3 of 6 tuning:     simple_rand_forest (7m 29.7s)

## i 4 of 6 tuning:     normalize_logreg

## ✔ 4 of 6 tuning:     normalize_logreg (19.5s)

## i 5 of 6 tuning:     normalize_dtree

## ✔ 5 of 6 tuning:     normalize_dtree (18.4s)

## i 6 of 6 tuning:     normalize_rand_forest

## ✔ 6 of 6 tuning:     normalize_rand_forest (7m 1.2s)

Tampilkan list workflow berdasarkan AUC

telco_models %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(wflow_id, model, mean) %>%
  arrange(desc(mean)) %>%
  print(width = Inf)

## # A tibble: 46 × 3
##    wflow_id              model         mean
##    <chr>                 <chr>        <dbl>
##  1 simple_rand_forest    rand_forest  0.817
##  2 normalize_rand_forest rand_forest  0.817
##  3 simple_rand_forest    rand_forest  0.816
##  4 normalize_rand_forest rand_forest  0.816
##  5 normalize_rand_forest rand_forest  0.816
##  6 simple_rand_forest    rand_forest  0.816
##  7 normalize_rand_forest rand_forest  0.813
##  8 simple_rand_forest    rand_forest  0.813
##  9 simple_logreg         logistic_reg 0.806
## 10 normalize_logreg      logistic_reg 0.806
## # ℹ 36 more rows

Dari data ROC_AUC di atas diperoleh model yang paling bagus adalah model dengan metode random_forest.

Plot perbandingan model dan pra-prosesnya berdasarkan AUC

telco_models %>%
  autoplot(rank_metric = "roc_auc",
           metric = "roc_auc",
           select_best = TRUE) +
  theme_bw()

Plot perbandingan ROC semua model dan pra-prosesnya

telco_models %>%
  collect_predictions() %>%
  group_by(wflow_id) %>%
  roc_curve(churn, .pred_Yes) %>%
  autoplot()

Model yang paling bagus masih simple dan normalize random forest.

Mengambil parameter hasil tunning terbaik

best_param <- telco_models %>%
  extract_workflow_set_result("simple_rand_forest") %>%
  select_best(metric = "roc_auc")
best_param

## # A tibble: 1 × 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1     2    40 Preprocessor1_Model10

Fit terakhir dengan parameter terbaik sehingga mendapatkan workflow terbaik

best_telco <- telco_models %>%
  extract_workflow("simple_rand_forest") %>%
  finalize_workflow(best_param) %>%
  last_fit(split = telco_split)

Menampilkan metric dari model dengan parameter terbaik

best_telco %>% 
  collect_metrics()

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.779 Preprocessor1_Model1
## 2 roc_auc  binary         0.830 Preprocessor1_Model1

Nilai nya cukup bagus, karna nilai AUC sudah diatas 80%.

Menyimpan hasil prediksi dari data testing

pred_result <- best_telco %>%
  collect_predictions()
pred_result

## # A tibble: 1,000 × 7
##    id               .pred_Yes .pred_No  .row .pred_class churn .config          
##    <chr>                <dbl>    <dbl> <int> <fct>       <fct> <chr>            
##  1 train/test split    0.0837    0.916    12 No          No    Preprocessor1_Mo…
##  2 train/test split    0.510     0.490    20 Yes         Yes   Preprocessor1_Mo…
##  3 train/test split    0.763     0.237    32 Yes         Yes   Preprocessor1_Mo…
##  4 train/test split    0.388     0.612    37 No          No    Preprocessor1_Mo…
##  5 train/test split    0.472     0.528    40 No          No    Preprocessor1_Mo…
##  6 train/test split    0.163     0.837    50 No          No    Preprocessor1_Mo…
##  7 train/test split    0.404     0.596    51 No          Yes   Preprocessor1_Mo…
##  8 train/test split    0.249     0.751    54 No          No    Preprocessor1_Mo…
##  9 train/test split    0.606     0.394    56 Yes         No    Preprocessor1_Mo…
## 10 train/test split    0.382     0.618    59 No          Yes   Preprocessor1_Mo…
## # ℹ 990 more rows

Menghitung metric klasifikasi.

Untuk menghitung accuracy, sensitivity, specificity, recall, precision dan f_score menggunakan nilai aktual dan prediksi kelas. Sedangkan untuk menghitung AUC menggunakan nilai aktual dan nilai pediksi peluangnya.

bind_rows(
  pred_result %>%
    accuracy(truth = churn, estimate = .pred_class),
  pred_result %>% 
    sensitivity(truth = churn, estimate = .pred_class),
  pred_result %>%
    recall(truth = churn, estimate = .pred_class),
  pred_result %>%
    specificity(truth = churn, estimate = .pred_class),
  pred_result %>%
    precision(truth = churn, estimate = .pred_class),
  pred_result %>%
    f_meas(truth = churn, estimate = .pred_class),
  pred_result %>%
    roc_auc(churn, .pred_Yes)
    
)

## # A tibble: 7 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 accuracy    binary         0.779
## 2 sensitivity binary         0.593
## 3 recall      binary         0.593
## 4 specificity binary         0.873
## 5 precision   binary         0.704
## 6 f_meas      binary         0.644
## 7 roc_auc     binary         0.830

Kurva ROC dari model terbaik

pred_result %>%
  roc_curve(truth = churn, .pred_Yes) %>%
  autoplot()

Workflow terbaik yang nantinya digunakan untuk prediksi sebagai model akhir.

7 Final Model

final_model <- best_telco %>%
  extract_workflow()
final_model

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L,      x), num.trees = ~1000, min.node.size = min_rows(~40L, x),      importance = ~"permutation", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  1000 
## Sample size:                      4000 
## Number of independent variables:  14 
## Mtry:                             2 
## Target node size:                 40 
## Variable importance mode:         permutation 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.162052

7.1 Data Prediksi

Akan digunakan data baru untuk diprediksi.

telco_to_pred <- read_csv("4. telco_churn_to_pred.csv")

## Rows: 1000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): MSISDN, los, voice_rev, voice_trx, voice_mou, voice_dou, sms_rev, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Melakukan prediksi dan menggabungkan hasil prediksi dengan data asli. Haisl prediksi berupa kelas dan probability.

telco_to_pred %>%
  bind_cols(
    final_model %>% 
      predict(new_data = telco_to_pred),
    final_model %>%
      predict(new_data = telco_to_pred, type = "prob"))%>%
  print(width = Inf)

## # A tibble: 1,000 × 18
##    MSISDN   los voice_rev voice_trx voice_mou voice_dou sms_rev sms_trx sms_dou
##     <dbl> <dbl>     <dbl>     <dbl>     <dbl>     <dbl>   <dbl>   <dbl>   <dbl>
##  1  68424   851      5994        12         5         6    3435      25       6
##  2  29726  1935      3319         8        24         3    2330      12       5
##  3  51163    51         0         0         0         0       0       0       0
##  4  27287  1665     84397       196       441         7   18885     102       7
##  5  16430  3154     33341        38        18         6     220       3       3
##  6  28634   180      4585       282       626         6    1600       7       4
##  7  19185  1421      2747        53       168         7    7390      42       7
##  8  85760  1161     13499        43       871         6    2820      12       6
##  9  17098   454      6265        20        64         4    5260      25       3
## 10  78350   253         0         0         0         0       0       0       0
##    broadband_rev broadband_usg broadband_dou voice_package_rev voice_package_trx
##            <dbl>         <dbl>         <dbl>             <dbl>             <dbl>
##  1             0          0                0              2500                 1
##  2             0          0                0              3600                 1
##  3             0          0                0             10950                 2
##  4             0        269.               7             13050                 3
##  5             0          0                0             18200                10
##  6             0          0                0             13050                 3
##  7             0          0                0             24500                 7
##  8             0          0                0             21900                 5
##  9           735          8.80             4              8700                 2
## 10             0          0                0             17650                 4
##    voice_package_dou .pred_class .pred_Yes .pred_No
##                <dbl> <fct>           <dbl>    <dbl>
##  1                 1 Yes           0.615      0.385
##  2                 1 Yes           0.566      0.434
##  3                 2 Yes           0.799      0.201
##  4                 3 No            0.0975     0.902
##  5                 5 No            0.386      0.614
##  6                 2 No            0.165      0.835
##  7                 7 No            0.00174    0.998
##  8                 5 No            0.0275     0.972
##  9                 2 No            0.178      0.822
## 10                 4 Yes           0.631      0.369
## # ℹ 990 more rows

7.2 Variable Importance

Melihat variable importance

final_model %>%
  extract_fit_engine() %>%
  vip::vi_model() %>%
  ggplot(aes(x =Importance,
             y = reorder(Variable, Importance))) +
  geom_col() +
  labs(y = "Variable") +
  theme_bw()

Variabel yang paling berpengaruh adalah: voice_trx, voice_mou, dan voice_package_dou. Maka akan dilakukan prediksi untuk mengetahui user yang akan churn. Pada user inilah team marketing/sales akan khusus melakukan prelakukan sebagai preventif kemungkinan churn agar tidak terjadi. Kemudian, untuk user yang diprediksi tidak churn dapat diberikan perlakuan yang berbeda pula agar user ini dapat tetap loyal memakai jasa provider.

Supervised Learning: Churn Detection

Carolin

2023-09-23