Churn Dataset Analysis

Introduction

This analysis sets out to predict the customers most likely to cancel their subscription plans, through use of the dataset’s predictor variables. The three methods used will consist of KNN, Naive Bayes, and Logistic Regression.

Data Summary

The dataset consists of the binary churn variable, this being the target variable, and ten predictor variables.

First 5 Rows of Dataset
Call.Failure	Subscription.Length	Minutes.of.Use	Distinct.Called.Numbers	Status	Age	Customer.Value
8	38	72.83	17	1	30	197.640
0	39	5.30	4	0	25	46.035
10	37	40.88	24	1	30	1536.520
10	38	69.97	35	1	15	240.020
3	38	39.88	33	1	15	145.805

## data.frame [3150 × 11]

##  $ Call.Failure        : integer [1:3150] 8 0 10 10 3
##  $ Complains           : factor [1:3150] 0 0 0 0 0
##  $ Subscription.Length : integer [1:3150] 38 39 37 38 38
##  $ Charge.Amount       : integer [1:3150] 0 0 0 0 0
##  $ Minutes.of.Use      : numeric [1:3150] 72.83 5.3 40.88 69.97 39.88
##  $ Distinct.Called.Numbers: integer [1:3150] 17 4 24 35 33
##  $ Tariff.Plan         : factor [1:3150] 0 0 0 0 0
##  $ Status              : factor [1:3150] 1 0 1 1 1
##  $ Age                 : integer [1:3150] 30 25 30 15 15
##  $ Customer.Value      : numeric [1:3150] 197.64 46.035 1536.52 240.02 145.805
##  $ Churn               : factor [1:3150] 0 0 0 0 0

Histograms

Fig. 1.1

Box Plots

Fig. 1.2

Filled Plots

Fig. 1.3

Scatterplots

Fig. 1.4

Correlation Matrix

Fig. 1.5

Partitioning

The data was partitioned into a training set and a holdout set. Using churn as the dependent variable, 70% of the data was allocated to the training set, while the remaining 30% was set aside as the holdout set. The seed was set to 1 to allow for result reproduction.

set.seed(1)
idx <- createDataPartition(df$Churn, p = 0.7, list = FALSE)
training.df <- df[idx, ]
holdout.df <- df[-idx, ]
training.df$Churn <- as.factor(training.df$Churn)
holdout.df$Churn <- as.factor(holdout.df$Churn)

KNN Model

## k-Nearest Neighbors 
## 
## 2206 samples
##   10 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1986, 1986, 1985, 1985, 1986, 1985, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9031441  0.6125965
##   7  0.8977040  0.5792232
##   9  0.8880324  0.5329789
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

	Call.Failure	Complains	Subscription.Length	Charge.Amount	Minutes.of.Use	Distinct.Called.Numbers	Tariff.Plan	Status	Age	Customer.Value	Churn	prediction
2	0	0	39	0	5.30	4	0	0	25	46.035	0	1
6	11	0	38	1	62.92	28	0	1	30	282.280	0	0
10	7	0	38	1	75.25	25	0	1	30	191.920	0	0
22	8	0	37	1	111.97	37	0	1	25	791.685	0	0
28	9	1	36	0	37.80	31	0	0	30	228.480	1	1
31	3	0	33	1	113.08	22	0	1	55	124.230	0	0
34	25	0	31	3	267.92	80	1	1	25	734.085	0	0
38	6	0	36	0	46.33	23	0	1	25	914.985	0	0
39	4	0	27	1	21.92	11	1	1	25	929.295	0	0
40	4	0	26	0	26.13	68	0	1	45	115.175	0	0

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 757  59
##          1  39  89
##                                           
##                Accuracy : 0.8962          
##                  95% CI : (0.8749, 0.9149)
##     No Information Rate : 0.8432          
##     P-Value [Acc > NIR] : 1.593e-06       
##                                           
##                   Kappa : 0.5845          
##                                           
##  Mcnemar's Test P-Value : 0.05495         
##                                           
##             Sensitivity : 0.60135         
##             Specificity : 0.95101         
##          Pos Pred Value : 0.69531         
##          Neg Pred Value : 0.92770         
##              Prevalence : 0.15678         
##          Detection Rate : 0.09428         
##    Detection Prevalence : 0.13559         
##       Balanced Accuracy : 0.77618         
##                                           
##        'Positive' Class : 1               
##

Output

The accuracy is 89.6%, proving it to be an effective prediction model. The Kappa statistic of 0.579 shows moderately better agreement beyond chance. Specificity is a strong point of this model, as the value given is 95%. By finding similiarities among neighbour data points, KNN serves here as a strong foundation for predictive modeling.

Naive Bayes Model

Summary

## 'data.frame':    2206 obs. of  11 variables:
##  $ Call.Failure           : Factor w/ 2 levels "Few","Many": 1 2 2 1 1 2 1 1 2 2 ...
##  $ Complains              : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Subscription.Length    : Factor w/ 2 levels "Short","Long": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Charge.Amount          : Factor w/ 2 levels "Small","Large": 1 1 1 1 1 2 1 1 1 2 ...
##  $ Minutes.of.Use         : Factor w/ 2 levels "Few","Many": 1 1 1 1 1 2 2 2 1 2 ...
##  $ Distinct.Called.Numbers: Factor w/ 2 levels "Few","Many": 1 1 2 2 1 2 2 1 1 2 ...
##  $ Tariff.Plan            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Status                 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
##  $ Age                    : Factor w/ 5 levels "15","25","30",..: 3 3 1 1 3 3 3 3 3 3 ...
##  $ Customer.Value         : Factor w/ 2 levels "Less Valuable",..: 1 2 1 1 2 2 2 1 1 2 ...
##  $ Churn                  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

## 'data.frame':    944 obs. of  11 variables:
##  $ Call.Failure           : Factor w/ 2 levels "Few","Many": 1 2 1 1 2 1 2 1 1 1 ...
##  $ Complains              : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ Subscription.Length    : Factor w/ 2 levels "Short","Long": 2 2 2 2 2 1 1 2 1 1 ...
##  $ Charge.Amount          : Factor w/ 2 levels "Small","Large": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Minutes.of.Use         : Factor w/ 2 levels "Few","Many": 1 1 2 2 1 2 2 1 1 1 ...
##  $ Distinct.Called.Numbers: Factor w/ 2 levels "Few","Many": 1 2 2 2 2 1 2 1 1 2 ...
##  $ Tariff.Plan            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
##  $ Status                 : Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 2 2 2 ...
##  $ Age                    : Factor w/ 5 levels "15","25","30",..: 2 3 3 2 3 5 2 2 2 4 ...
##  $ Customer.Value         : Factor w/ 2 levels "Less Valuable",..: 1 1 1 2 1 1 2 2 2 1 ...
##  $ Churn                  : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.8427017 0.1572983 
## 
## Conditional probabilities:
##    Call.Failure
## Y         Few      Many
##   0 0.6120365 0.3879635
##   1 0.6590258 0.3409742
## 
##    Complains
## Y            0          1
##   0 0.98173025 0.01826975
##   1 0.57593123 0.42406877
## 
##    Subscription.Length
## Y       Short      Long
##   0 0.4040838 0.5959162
##   1 0.4040115 0.5959885
## 
##    Charge.Amount
## Y        Small      Large
##   0 0.72326706 0.27673294
##   1 0.93123209 0.06876791
## 
##    Minutes.of.Use
## Y          Few       Many
##   0 0.59108006 0.40891994
##   1 0.91690544 0.08309456
## 
##    Distinct.Called.Numbers
## Y         Few      Many
##   0 0.5206878 0.4793122
##   1 0.8538682 0.1461318
## 
##    Tariff.Plan
## Y            0          1
##   0 0.90865126 0.09134874
##   1 0.98280802 0.01719198
## 
##    Status
## Y           0         1
##   0 0.1526061 0.8473939
##   1 0.7363897 0.2636103
## 
##    Age
## Y            15          25          30          45          55
##   0 0.046673820 0.331008584 0.442596567 0.112124464 0.067596567
##   1 0.002840909 0.380681818 0.434659091 0.173295455 0.008522727
## 
##    Customer.Value
## Y   Less Valuable More Valuable
##   0    0.60827512    0.39172488
##   1    0.98280802    0.01719198

Classification Results

Result table
Call.Failure	Complains	Subscription.Length	Charge.Amount	Minutes.of.Use	Distinct.Called.Numbers	Tariff.Plan	Status	Age	Customer.Value	Churn	X0	X1	pred.class
Few	0	Long	Small	Few	Few	0	0	25	Less Valuable	0	0.2107083	0.7892917	1
Many	0	Long	Small	Few	Many	0	1	30	Less Valuable	0	0.9696575	0.0303425	0
Few	0	Long	Small	Many	Many	0	1	30	Less Valuable	0	0.9950030	0.0049970	0
Few	0	Long	Small	Many	Many	0	1	25	More Valuable	0	0.9998403	0.0001597	0
Many	1	Long	Small	Few	Many	0	0	30	Less Valuable	1	0.0494923	0.9505077	1
Few	0	Short	Small	Many	Few	0	1	55	Less Valuable	0	0.9965450	0.0034550	0
Many	0	Short	Large	Many	Many	1	1	25	More Valuable	0	0.9999956	0.0000044	0
Few	0	Long	Small	Few	Few	0	1	25	More Valuable	0	0.9934832	0.0065168	0
Few	0	Short	Small	Few	Few	1	1	25	More Valuable	0	0.9988603	0.0011397	0
Few	0	Short	Small	Few	Many	0	1	45	Less Valuable	0	0.9431137	0.0568863	0

Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Churn
##      No    671    35
##      Churn 125   113
##                                          
##                Accuracy : 0.8305         
##                  95% CI : (0.805, 0.8539)
##     No Information Rate : 0.8432         
##     P-Value [Acc > NIR] : 0.8679         
##                                          
##                   Kappa : 0.4861         
##                                          
##  Mcnemar's Test P-Value : 1.977e-12      
##                                          
##             Sensitivity : 0.7635         
##             Specificity : 0.8430         
##          Pos Pred Value : 0.4748         
##          Neg Pred Value : 0.9504         
##              Prevalence : 0.1568         
##          Detection Rate : 0.1197         
##    Detection Prevalence : 0.2521         
##       Balanced Accuracy : 0.8032         
##                                          
##        'Positive' Class : Churn          
##

Output

The model achieves an accuracy of 83.41%. However, it is heavily skewed toward predicting the majority class, this being the non-positive class. The low positive predictive value shows that the model struggles at predicting positive values correctly. The Kappa statistic of 0.5001 shows moderately bettter agreement beyond chance. The model performs well when identifying the negative class, but is only moderately successful in predicting the positive class, which is more important.

The model suffers due to the low prevalence of positive churn data. The model assumes feature independence, which is not a realistic expectation in everyday ventures. This can lead the model to innacuracy if there is some correlation between predictor variables.

Logistic Regression Model

Summary

## 
## Call:
## glm(formula = Churn ~ Call.Failure + Subscription.Length + Charge.Amount + 
##     Distinct.Called.Numbers + Customer.Value + Minutes.of.Use + 
##     Age, family = binomial(link = "logit"), data = training.df)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -16.8251   391.9665  -0.043   0.9658    
## Call.FailureMany              0.7697     0.1529   5.035 4.78e-07 ***
## Subscription.LengthLong       0.3409     0.1350   2.525   0.0116 *  
## Charge.AmountLarge           -1.0245     0.2521  -4.064 4.82e-05 ***
## Distinct.Called.NumbersMany  -0.8124     0.1842  -4.411 1.03e-05 ***
## Customer.ValueMore Valuable  -3.2931     0.4587  -7.180 6.99e-13 ***
## Minutes.of.UseMany           -1.0874     0.2284  -4.760 1.94e-06 ***
## Age25                        15.9001   391.9665   0.041   0.9676    
## Age30                        15.6433   391.9665   0.040   0.9682    
## Age45                        15.9637   391.9665   0.041   0.9675    
## Age55                        13.3153   391.9671   0.034   0.9729    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1919.9  on 2205  degrees of freedom
## Residual deviance: 1460.6  on 2195  degrees of freedom
## AIC: 1482.6
## 
## Number of Fisher Scoring iterations: 16

Output

The 3 stars indicate the variables with the biggest predictive significance.

Conclusion

With the help of R, we have developed 3 models that analyze the data provided and have visualized the connection between many of the variables given. There is now a greater understanding of what causes the churning of employees, and so the data can be acted upon with less hesitation than before.

Churn Dataset Analysis

Radoslaw Budzik

2025-04-17

Introduction

This analysis sets out to predict the customers most likely to cancel their subscription plans, through use of the dataset’s predictor variables. The three methods used will consist of KNN, Naive Bayes, and Logistic Regression.

Data Summary

The dataset consists of the binary churn variable, this being the target variable, and ten predictor variables.

Histograms

Box Plots

Filled Plots

Scatterplots

Correlation Matrix

Partitioning

KNN Model

Output

Naive Bayes Model

Summary

Classification Results

Confusion Matrix

Output

The model suffers due to the low prevalence of positive churn data. The model assumes feature independence, which is not a realistic expectation in everyday ventures. This can lead the model to innacuracy if there is some correlation between predictor variables.

Logistic Regression Model

Summary

Output

The 3 stars indicate the variables with the biggest predictive significance.

Conclusion

With the help of R, we have developed 3 models that analyze the data provided and have visualized the connection between many of the variables given. There is now a greater understanding of what causes the churning of employees, and so the data can be acted upon with less hesitation than before.