1 Dataset

CustomerID: ID unik untuk setiap pelanggan dalam dataset.
Age: Usia pelanggan dalam tahun.
Gender: Jenis kelamin pelanggan (misalnya: “Male” atau “Female”).
Tenure: Lamanya pelanggan telah berlangganan dalam bulan.
Usage Frequency: Frekuensi penggunaan layanan oleh pelanggan dalam periode waktu tertentu.
Support Calls: Jumlah panggilan dukungan yang dilakukan oleh pelanggan.
Payment Delay: Keterlambatan pembayaran pelanggan terhadap tagihan.
Subscription Type: Jenis langganan pelanggan (misalnya: “Monthly” atau “Yearly”).
Contract Length: Durasi kontrak langganan pelanggan dalam bulan.
Total Spend: Total pengeluaran pelanggan selama periode tertentu.
Last Interaction: Waktu interaksi terakhir antara pelanggan dengan perusahaan.
Churn: Label yang menunjukkan apakah pelanggan berpindah (“Yes”) atau tidak (“No”).

library(dplyr)
library(e1071) #silakan untuk menginstall package e1071
library(caret)
library(ROCR)
library(ggplot2)
library(tidyr)

2 Data Preparation

churn_train = read.csv("./data_input/customer_churn_dataset-training-master.csv")
churn_test = read.csv("./data_input/customer_churn_dataset-testing-master.csv")

# Menggabungkan data train dan test
costumer_churn <- rbind(churn_train, churn_test)
head(costumer_churn)

#>   CustomerID Age Gender Tenure Usage.Frequency Support.Calls Payment.Delay
#> 1          2  30 Female     39              14             5            18
#> 2          3  65 Female     49               1            10             8
#> 3          4  55 Female     14               4             6            18
#> 4          5  58   Male     38              21             7             7
#> 5          6  23   Male     32              20             5             8
#> 6          8  51   Male     33              25             9            26
#>   Subscription.Type Contract.Length Total.Spend Last.Interaction Churn
#> 1          Standard          Annual         932               17     1
#> 2             Basic         Monthly         557                6     1
#> 3             Basic       Quarterly         185                3     1
#> 4          Standard         Monthly         396               29     1
#> 5             Basic         Monthly         617               20     1
#> 6           Premium          Annual         129                8     1

2.1 Check Struture Data

glimpse(costumer_churn)

#> Rows: 505,207
#> Columns: 12
#> $ CustomerID        <int> 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ Age               <int> 30, 65, 55, 58, 23, 51, 58, 55, 39, 64, 29, 52, 22, …
#> $ Gender            <chr> "Female", "Female", "Female", "Male", "Male", "Male"…
#> $ Tenure            <int> 39, 49, 14, 38, 32, 33, 49, 37, 12, 3, 18, 21, 41, 3…
#> $ Usage.Frequency   <int> 14, 1, 4, 21, 20, 25, 12, 8, 5, 25, 9, 6, 17, 25, 9,…
#> $ Support.Calls     <int> 5, 10, 6, 7, 5, 9, 3, 4, 7, 2, 0, 3, 10, 1, 4, 2, 7,…
#> $ Payment.Delay     <int> 18, 8, 18, 7, 8, 26, 16, 15, 4, 11, 30, 26, 25, 13, …
#> $ Subscription.Type <chr> "Standard", "Basic", "Basic", "Standard", "Basic", "…
#> $ Contract.Length   <chr> "Annual", "Monthly", "Quarterly", "Monthly", "Monthl…
#> $ Total.Spend       <dbl> 932, 557, 185, 396, 617, 129, 821, 445, 969, 415, 93…
#> $ Last.Interaction  <int> 17, 6, 3, 29, 20, 8, 24, 30, 13, 29, 18, 19, 23, 17,…
#> $ Churn             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Insight Terdapat tipe data yang tidak sesuai yaitu pada kolom Gender, Subscriptio.Type, churn dan Contract.Length yang harusnya menjadi “Factor”

2.2 Mengubah Tipe Data

costumer_churn <- costumer_churn %>%
  mutate(Gender = as.factor(Gender),
         Subscription.Type = as.factor(Subscription.Type),
         Contract.Length = as.factor(Contract.Length),
         Churn = as.factor(Churn))

2.3 Menghapus Kolom yang tidak digunakan and Missing Value

# Menghapus kolom 'CustomerID'
costumer_churn <- costumer_churn %>%
  select(-CustomerID)

# Menghapus baris yang memiliki nilai NA (missing values)
costumer_churn <- costumer_churn %>%
  na.omit()

3 Exploratory Data Analyst

3.1 Statistika Deskriptif

# Deskripsi statistik dari data frame costumer_churn
summary(costumer_churn)

#>       Age          Gender           Tenure      Usage.Frequency
#>  Min.   :18.0         :     0   Min.   : 1.00   Min.   : 1.00  
#>  1st Qu.:29.0   Female:224933   1st Qu.:16.00   1st Qu.: 8.00  
#>  Median :40.0   Male  :280273   Median :32.00   Median :16.00  
#>  Mean   :39.7                   Mean   :31.35   Mean   :15.71  
#>  3rd Qu.:49.0                   3rd Qu.:46.00   3rd Qu.:23.00  
#>  Max.   :65.0                   Max.   :60.00   Max.   :30.00  
#>  Support.Calls    Payment.Delay  Subscription.Type  Contract.Length  
#>  Min.   : 0.000   Min.   : 0.0           :     0            :     0  
#>  1st Qu.: 1.000   1st Qu.: 6.0   Basic   :164477   Annual   :198608  
#>  Median : 3.000   Median :13.0   Premium :170099   Monthly  :109234  
#>  Mean   : 3.833   Mean   :13.5   Standard:170630   Quarterly:197364  
#>  3rd Qu.: 6.000   3rd Qu.:20.0                                       
#>  Max.   :10.000   Max.   :30.0                                       
#>   Total.Spend     Last.Interaction Churn     
#>  Min.   : 100.0   Min.   : 1.00    0:224714  
#>  1st Qu.: 446.0   1st Qu.: 7.00    1:280492  
#>  Median : 648.9   Median :14.00              
#>  Mean   : 620.1   Mean   :14.61              
#>  3rd Qu.: 824.0   3rd Qu.:22.00              
#>  Max.   :1000.0   Max.   :30.00

Insigth * Rata-rata umur pada dataset ini kisaran umur costumer pada dataset ini yaitu 39thn, dimana umur costumer paling muda yaitu 18thn dan paling tua 65thn * Lama berlangganan para costumer yang ada didata ini paling sebentar adalah 1 bulan dan paling lama 60 bulan * Payment delay yang terjadi paling lama adalah 30 hari * Total pengeluaran yang dilakukan costumer rata-rata 631.62

# Membuat plot bar untuk distribusi churn dan tidak churn
ggplot(data = costumer_churn, aes(x = Churn, fill = Churn)) +
  geom_bar() +
  scale_fill_manual(values = c("Not Churn" = "deepskyblue", "Churn" = "salmon")) +
  xlab("Churn") +
  ylab("Count") +
  labs(title = "Distribution of Churn") +
  scale_x_discrete(labels = c("Not Churn", "Churn"))

# Menghitung jumlah churn dan tidak churn untuk setiap subscription type
subscription_churn_counts <- costumer_churn %>%
  group_by(Subscription.Type, Churn) %>%
  summarize(count = n()) %>%
  pivot_wider(names_from = Churn, values_from = count, values_fill = 0)

# Visualisasi perbandingan churn dan tidak churn dalam satu bar plot
ggplot(data = costumer_churn, aes(x = Subscription.Type, fill = factor(Churn))) +
  geom_bar(position = "stack", color = "black") +
  scale_fill_manual(values = c("deepskyblue", "salmon"), labels = c("Not Churn", "Churn")) +
  xlab("Subscription Type") +
  ylab("Count") +
  labs(title = "Comparison of Churn vs. Subscription Type") +
  theme_minimal() +
  theme(legend.position = "top")

Insigth Comparison of Churn and Not Churn by Subscription Type Dari hasil visualisasinya antara churn dan tidak churn tiap subscription type ini memiliki kesamaan. dimana kegiata tipe member ini lebih condong terhadap data yang churn daripada tidak.

# Plot bar untuk membandingkan banyaknya churn dan tidak churn antara lelaki dan perempuan
ggplot(data = costumer_churn, aes(x = Gender, fill = factor(Churn))) +
  geom_bar(position = "stack", color = "black") +
  scale_fill_manual(values = c("deepskyblue", "salmon"), labels = c("Not Churn", "Churn")) +
  xlab("Gender") +
  ylab("Count") +
  labs(title = "Comparison of Churn vs. Gender") +
  theme_minimal() +
  theme(legend.position = "top")

Insigth (Comparasion of Churn vs Gender) dari visualisasi diatas dapat disimpulkan bahwa perempuan lebih banyak yang churn daripada lelaki, hal tersebut dapat dilihat dari perbandingan laki-laki dan perempuan yang memiliki distribusi yang sama antara churn pada perempuan dan tidak churn pada lelaki

# Plot bar untuk Total Spend (total pengeluaran) antara churn dan tidak churn
ggplot(data = costumer_churn, aes(x = Churn, y = Total.Spend, fill = Churn)) +
  stat_summary(fun = sum, geom = "bar", position = "dodge", color = "black") +
  scale_fill_manual(values = c("deepskyblue", "salmon"), labels = c("Not Churn", "Churn")) +
  xlab("Churn") +
  ylab("Total Spend") +
  labs(title = "Total Spend for Churn vs. Not Churn") +
  theme_minimal() +
  scale_x_discrete(labels = c("Not Churn", "Churn"))

Insigth (Total Spend for Churn vs. Not Churn) Dari hasil visualisasi diatas, dapat disimpulkan bahwa costumer yang tidak churn lebih banyak mengeluarkan uangnya daripada yang churn

# Plot boxplot untuk kolom 'Total Spend'
ggplot(data = costumer_churn, aes(y = `Total.Spend`)) +
  geom_boxplot(color = "black", fill = "skyblue") +
  ylab("Total Spend") +
  labs(title = "Boxplot of Total Spend") +
  theme_minimal()

Insigth (Boxplot of Total Spend) Pada Visualisasi boxplot didapatkan bahwa total spend dari costumer lebih condong memiliki negatively skewed yang artinya total yang dikeluarkan oleh costumer memiliki distribusi yang condong tinggi, yaitu kisaran 480an hingga 820an.

# Plot boxplot untuk kolom 'Total Spend'
ggplot(data = costumer_churn, aes(y = `Age`)) +
  geom_boxplot(color = "black", fill = "skyblue") +
  ylab("Total Spend") +
  labs(title = "Boxplot of Total Spend") +
  theme_minimal()

Insigth (Boxplot of Age) Pada Visualisasi boxplot didapatkan bahwa umur dari costumer hampir setara akan tetapi memang sedikit condong kebawah memiliki positively skewed yang artinya umur dari costumer terdistribusi diantara 29an hingga 48an

4 Cross Validation

RNGkind(sample.kind = "Rounding")

set.seed(100)

# code here

ind_churn <- sample(x = nrow(costumer_churn), size= nrow(costumer_churn)*0.8) 
churn_train <- costumer_churn[ind_churn,] 
churn_test <- costumer_churn[-ind_churn,]

# code here
prop.table(table(churn_train$Churn))

#> 
#>         0         1 
#> 0.4452747 0.5547253

5 Modeling

5.1 Naive Bayes

# Code here
model_nb <- naiveBayes(formula = Churn ~ ., data = churn_train)

5.1.1 Confusion Matrix

# Predict data test
churn_pred <- predict(object = model_nb, newdata = churn_test, type = "class")

confusionMatrix(data = churn_pred,
                reference = churn_test$Churn,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0 39378  7496
#>          1  5372 48796
#>                                                
#>                Accuracy : 0.8726               
#>                  95% CI : (0.8706, 0.8747)     
#>     No Information Rate : 0.5571               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7432               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8668               
#>             Specificity : 0.8800               
#>          Pos Pred Value : 0.9008               
#>          Neg Pred Value : 0.8401               
#>              Prevalence : 0.5571               
#>          Detection Rate : 0.4829               
#>    Detection Prevalence : 0.5361               
#>       Balanced Accuracy : 0.8734               
#>                                                
#>        'Positive' Class : 1                    
#>

Insigth - Kelas Positive: Churn(1) - Akurasi model mencapai 0.8726, yang berarti sekitar 87.26% dari prediksi model adalah benar. Precision mencapai 0.9008, menunjukkan bahwa sekitar 90.08% dari prediksi positif model adalah benar. Recall (Sensitivity) mencapai 0.8668, artinya sekitar 86.68% dari total kasus yang positif berhasil terdeteksi oleh model. - Risiko yang diutamakan adalah Precision (Presisi) karena risiko yang lebih berbahaya adalah jika model mengklasifikasikan pelanggan sebagai “Churn” (positif) tetapi sebenarnya tidak (False Positive). Oleh karena itu, model yang memiliki nilai Precision yang tinggi lebih diutamakan dalam konteks kasus ini.

5.1.2 ROC

# Get prob by chrn and not churn
churn_prob <- predict(object = model_nb, newdata = churn_test, type = "raw")

data_pred <-  data.frame(
  pred_prob_pos = churn_prob[,'1'],
  label = churn_test$Churn
)

library(ROCR)

#code here
spam_roc <- prediction(predictions = data_pred$pred_prob_pos,
                       labels = data_pred$label)

plot(performance(
  prediction.obj = spam_roc,
  measure = "tpr", # pengukuran di sumbu y
  x.measure = "fpr" # pengukuran di sumbu x
))

Costumer Churn Prediction

Rizki Aldiansyah

July 25, 2023