Predicting Customer Churn: A Comparative Study of Semi-Supervised and One-Class Methods
Author
Saurabh C Srivastava
Published
March 5, 2025
This analysis utilizes the Telco Churn data set, which tracks customer churn based on factors like location, monthly charges, and services. To simulate a semi-supervised learning scenario, I assumed a 95% rate of missing values in the churn indicator and converted it to a binary format. The objective is to compare the performance of semi-supervised learning algorithms to one-class learning algorithms in predicting customer churn.
Exploring Semi-Supervised Machine Learning
This analysis began by loading the necessary R packages and the ‘Telco Customer Churn’ dataset. Exploratory Data Analysis (EDA) was then performed, which included two primary data preparation steps:
Imputation, where missing values were replaced using the mice package with mean imputation, followed by verification to ensure successful imputation; and
Renaming the dependent variable ‘Churn’ to ‘Class’, removing the ‘customerID’ column, and converting categorical variables to factors to prepare the data for machine learning.
To simulate a semi-supervised learning scenario, 95% of the ‘Class’ labels were masked using the add_missinglabels_mar() function, leaving 5% labeled data for model training. To ensure reproducible results, a seed value was set.
Finally, various semi-supervised and supervised models were benchmarked on the Telco dataset, including Laplacian SVM, Self-Learning SVM with a linear kernel, Self-Learning Nearest Mean Classifier, and a standard Nearest Mean Classifier.
customerID gender SeniorCitizen Partner Dependents tenure PhoneService
1 7590-VHVEG Female 0 Yes No 1 No
2 5575-GNVDE Male 0 No No 34 Yes
3 3668-QPYBK Male 0 No No 2 Yes
MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
1 No phone service DSL No Yes No
2 No DSL Yes No Yes
3 No DSL Yes Yes No
TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
1 No No No Month-to-month Yes
2 No No No One year No
3 No No No Month-to-month Yes
PaymentMethod MonthlyCharges TotalCharges Churn
1 Electronic check 29.85 29.85 No
2 Mailed check 56.95 1889.50 No
3 Mailed check 53.85 108.15 Yes
customerID gender SeniorCitizen Partner Dependents
1 7590-VHVEG Female 0 Yes No
2 5575-GNVDE Male 0 No No
3 3668-QPYBK Male 0 No No
4 7795-CFOCW Male 0 No No
5 9237-HQITU Female 0 No No
telco_oc_df %<>% dplyr::rename(Class = Churn) %>% dplyr::select(-customerID)# Convert character columns to factorschar_cols <-c("gender", "Partner", "Dependents", "PhoneService", "MultipleLines", "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "Contract", "PaperlessBilling", "PaymentMethod", "Class")telco_oc_df[char_cols] <-lapply(telco_oc_df[char_cols], as.factor)# Convert SeniorCitizen to factor as well.telco_oc_df$SeniorCitizen <-as.factor(telco_oc_df$SeniorCitizen)telco_oc_df$TotalCharges <-as.numeric(as.character(telco_oc_df$TotalCharges))str(telco_oc_df)
# create one-class with no y valuestelco_oc_df$Class =ifelse(telco_oc_df$Class =="Yes","N","N")table(telco_oc_df$Class)
N
7043
head(telco_oc_df)
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines
1 Female 0 Yes No 1 No No phone service
2 Male 0 No No 34 Yes No
3 Male 0 No No 2 Yes No
4 Male 0 No No 45 No No phone service
5 Female 0 No No 2 Yes No
6 Female 0 No No 8 Yes Yes
InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport
1 DSL No Yes No No
2 DSL Yes No Yes No
3 DSL Yes Yes No No
4 DSL Yes No Yes Yes
5 Fiber optic No No No No
6 Fiber optic No No Yes No
StreamingTV StreamingMovies Contract PaperlessBilling
1 No No Month-to-month Yes
2 No No One year No
3 No No Month-to-month Yes
4 No No One year No
5 No No Month-to-month Yes
6 Yes Yes Month-to-month Yes
PaymentMethod MonthlyCharges TotalCharges Class
1 Electronic check 29.85 29.85 N
2 Mailed check 56.95 1889.50 N
3 Mailed check 53.85 108.15 N
4 Bank transfer (automatic) 42.30 1840.75 N
5 Electronic check 70.70 151.65 N
6 Electronic check 99.65 820.50 N
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines
1 Female 0 Yes No 1 No No phone service
2 Male 0 No No 34 Yes No
3 Male 0 No No 2 Yes No
4 Male 0 No No 45 No No phone service
5 Female 0 No No 2 Yes No
6 Female 0 No No 8 Yes Yes
InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport
1 DSL No Yes No No
2 DSL Yes No Yes No
3 DSL Yes Yes No No
4 DSL Yes No Yes Yes
5 Fiber optic No No No No
6 Fiber optic No No Yes No
StreamingTV StreamingMovies Contract PaperlessBilling
1 No No Month-to-month Yes
2 No No One year No
3 No No Month-to-month Yes
4 No No One year No
5 No No Month-to-month Yes
6 Yes Yes Month-to-month Yes
PaymentMethod MonthlyCharges TotalCharges Class
1 Electronic check 29.85 29.85 N
2 Mailed check 56.95 1889.50 N
3 Mailed check 53.85 108.15 N
4 Bank transfer (automatic) 42.30 1840.75 N
5 Electronic check 70.70 151.65 N
6 Electronic check 99.65 820.50 N
x = complete_df[ , -c("Class")]y = complete_df$Classmodel_svm3 <-svm(x, y =NULL,type ="one-classification", kernel ="linear") model_svm3
Call:
svm.default(x = x, y = NULL, type = "one-classification", kernel = "linear")
Parameters:
SVM-Type: one-classification
SVM-Kernel: linear
gamma: 0.02173913
nu: 0.5
Number of Support Vectors: 3674
summary(model_svm3)
Call:
svm.default(x = x, y = NULL, type = "one-classification", kernel = "linear")
Parameters:
SVM-Type: one-classification
SVM-Kernel: linear
gamma: 0.02173913
nu: 0.5
Number of Support Vectors: 3674
Number of Classes: 1
save_actual
svm_predict3 No Yes
No 2485 1044
Yes 2689 825
mean(svm_predict3 == save_actual)
[1] 0.4699702
Model
Accuracy (%)
Support Vector Machine (SVM)
47%
Analysis Outcomes and Recommendations
As Self-Learning SVM with a linear kernel demonstrated the highest accuracy among the semi-supervised models, it is the most promising approach for churn prediction in this scenario. Further fine-tuning of this model, including hyperparameter optimization and feature engineering, can be explored to potentially improve its performance. The poor performance of the One-Class SVM necessitates further investigation. Experimentation with different kernel functions and parameter settings, or the use of alternative one-class classification algorithms, is recommended.