Introduction

This analysis sets out to predict the customers most likely to cancel their subscription plans, through use of the dataset’s predictor variables. The three methods used will consist of KNN, Naive Bayes, and Logistic Regression.

Data Summary

The dataset consists of the binary churn variable, this being the target variable, and ten predictor variables.
First 5 Rows of Dataset
Call.Failure Complains Subscription.Length Charge.Amount Minutes.of.Use Distinct.Called.Numbers Tariff.Plan Status Age Customer.Value Churn
8 0 38 0 72.83 17 0 1 30 197.640 0
0 0 39 0 5.30 4 0 0 25 46.035 0
10 0 37 0 40.88 24 0 1 30 1536.520 0
10 0 38 0 69.97 35 0 1 15 240.020 0
3 0 38 0 39.88 33 0 1 15 145.805 0
## data.frame [3150 × 11]
##  $ Call.Failure        : integer [1:3150] 8 0 10 10 3
##  $ Complains           : factor [1:3150] 0 0 0 0 0
##  $ Subscription.Length : integer [1:3150] 38 39 37 38 38
##  $ Charge.Amount       : integer [1:3150] 0 0 0 0 0
##  $ Minutes.of.Use      : numeric [1:3150] 72.83 5.3 40.88 69.97 39.88
##  $ Distinct.Called.Numbers: integer [1:3150] 17 4 24 35 33
##  $ Tariff.Plan         : factor [1:3150] 0 0 0 0 0
##  $ Status              : factor [1:3150] 1 0 1 1 1
##  $ Age                 : integer [1:3150] 30 25 30 15 15
##  $ Customer.Value      : numeric [1:3150] 197.64 46.035 1536.52 240.02 145.805
##  $ Churn               : factor [1:3150] 0 0 0 0 0

Histograms

Fig. 1.1

Fig. 1.1

Box Plots

Fig. 1.2

Fig. 1.2

Filled Plots

Fig. 1.3

Fig. 1.3

Scatterplots

Fig. 1.4

Fig. 1.4

Correlation Matrix

Fig. 1.5

Fig. 1.5

Partitioning

The data was partitioned into a training set and a holdout set. Using churn as the dependent variable, 70% of the data was allocated to the training set, while the remaining 30% was set aside as the holdout set. The seed was set to 1 to allow for result reproduction.

set.seed(1)
idx <- createDataPartition(df$Churn, p = 0.7, list = FALSE)
training.df <- df[idx, ]
holdout.df <- df[-idx, ]
training.df$Churn <- as.factor(training.df$Churn)
holdout.df$Churn <- as.factor(holdout.df$Churn)

KNN Model

## k-Nearest Neighbors 
## 
## 2206 samples
##   10 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1986, 1986, 1985, 1985, 1986, 1985, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9031441  0.6125965
##   7  0.8977040  0.5792232
##   9  0.8880324  0.5329789
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
Call.Failure Complains Subscription.Length Charge.Amount Minutes.of.Use Distinct.Called.Numbers Tariff.Plan Status Age Customer.Value Churn prediction
2 0 0 39 0 5.30 4 0 0 25 46.035 0 1
6 11 0 38 1 62.92 28 0 1 30 282.280 0 0
10 7 0 38 1 75.25 25 0 1 30 191.920 0 0
22 8 0 37 1 111.97 37 0 1 25 791.685 0 0
28 9 1 36 0 37.80 31 0 0 30 228.480 1 1
31 3 0 33 1 113.08 22 0 1 55 124.230 0 0
34 25 0 31 3 267.92 80 1 1 25 734.085 0 0
38 6 0 36 0 46.33 23 0 1 25 914.985 0 0
39 4 0 27 1 21.92 11 1 1 25 929.295 0 0
40 4 0 26 0 26.13 68 0 1 45 115.175 0 0
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 757  59
##          1  39  89
##                                           
##                Accuracy : 0.8962          
##                  95% CI : (0.8749, 0.9149)
##     No Information Rate : 0.8432          
##     P-Value [Acc > NIR] : 1.593e-06       
##                                           
##                   Kappa : 0.5845          
##                                           
##  Mcnemar's Test P-Value : 0.05495         
##                                           
##             Sensitivity : 0.60135         
##             Specificity : 0.95101         
##          Pos Pred Value : 0.69531         
##          Neg Pred Value : 0.92770         
##              Prevalence : 0.15678         
##          Detection Rate : 0.09428         
##    Detection Prevalence : 0.13559         
##       Balanced Accuracy : 0.77618         
##                                           
##        'Positive' Class : 1               
## 

Output

The accuracy is 89.6%, proving it to be an effective prediction model. The Kappa statistic of 0.579 shows moderately better agreement beyond chance. Specificity is a strong point of this model, as the value given is 95%. By finding similiarities among neighbour data points, KNN serves here as a strong foundation for predictive modeling.

Naive Bayes Model

Summary

## 'data.frame':    2206 obs. of  11 variables:
##  $ Call.Failure           : Factor w/ 2 levels "Few","Many": 1 2 2 1 1 2 1 1 2 2 ...
##  $ Complains              : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Subscription.Length    : Factor w/ 2 levels "Short","Long": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Charge.Amount          : Factor w/ 2 levels "Small","Large": 1 1 1 1 1 2 1 1 1 2 ...
##  $ Minutes.of.Use         : Factor w/ 2 levels "Few","Many": 1 1 1 1 1 2 2 2 1 2 ...
##  $ Distinct.Called.Numbers: Factor w/ 2 levels "Few","Many": 1 1 2 2 1 2 2 1 1 2 ...
##  $ Tariff.Plan            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Status                 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
##  $ Age                    : Factor w/ 5 levels "15","25","30",..: 3 3 1 1 3 3 3 3 3 3 ...
##  $ Customer.Value         : Factor w/ 2 levels "Less Valuable",..: 1 2 1 1 2 2 2 1 1 2 ...
##  $ Churn                  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## 'data.frame':    944 obs. of  11 variables:
##  $ Call.Failure           : Factor w/ 2 levels "Few","Many": 1 2 1 1 2 1 2 1 1 1 ...
##  $ Complains              : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ Subscription.Length    : Factor w/ 2 levels "Short","Long": 2 2 2 2 2 1 1 2 1 1 ...
##  $ Charge.Amount          : Factor w/ 2 levels "Small","Large": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Minutes.of.Use         : Factor w/ 2 levels "Few","Many": 1 1 2 2 1 2 2 1 1 1 ...
##  $ Distinct.Called.Numbers: Factor w/ 2 levels "Few","Many": 1 2 2 2 2 1 2 1 1 2 ...
##  $ Tariff.Plan            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
##  $ Status                 : Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 2 2 2 ...
##  $ Age                    : Factor w/ 5 levels "15","25","30",..: 2 3 3 2 3 5 2 2 2 4 ...
##  $ Customer.Value         : Factor w/ 2 levels "Less Valuable",..: 1 1 1 2 1 1 2 2 2 1 ...
##  $ Churn                  : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.8427017 0.1572983 
## 
## Conditional probabilities:
##    Call.Failure
## Y         Few      Many
##   0 0.6120365 0.3879635
##   1 0.6590258 0.3409742
## 
##    Complains
## Y            0          1
##   0 0.98173025 0.01826975
##   1 0.57593123 0.42406877
## 
##    Subscription.Length
## Y       Short      Long
##   0 0.4040838 0.5959162
##   1 0.4040115 0.5959885
## 
##    Charge.Amount
## Y        Small      Large
##   0 0.72326706 0.27673294
##   1 0.93123209 0.06876791
## 
##    Minutes.of.Use
## Y          Few       Many
##   0 0.59108006 0.40891994
##   1 0.91690544 0.08309456
## 
##    Distinct.Called.Numbers
## Y         Few      Many
##   0 0.5206878 0.4793122
##   1 0.8538682 0.1461318
## 
##    Tariff.Plan
## Y            0          1
##   0 0.90865126 0.09134874
##   1 0.98280802 0.01719198
## 
##    Status
## Y           0         1
##   0 0.1526061 0.8473939
##   1 0.7363897 0.2636103
## 
##    Age
## Y            15          25          30          45          55
##   0 0.046673820 0.331008584 0.442596567 0.112124464 0.067596567
##   1 0.002840909 0.380681818 0.434659091 0.173295455 0.008522727
## 
##    Customer.Value
## Y   Less Valuable More Valuable
##   0    0.60827512    0.39172488
##   1    0.98280802    0.01719198

Classification Results

Result table
Call.Failure Complains Subscription.Length Charge.Amount Minutes.of.Use Distinct.Called.Numbers Tariff.Plan Status Age Customer.Value Churn X0 X1 pred.class
Few 0 Long Small Few Few 0 0 25 Less Valuable 0 0.2107083 0.7892917 1
Many 0 Long Small Few Many 0 1 30 Less Valuable 0 0.9696575 0.0303425 0
Few 0 Long Small Many Many 0 1 30 Less Valuable 0 0.9950030 0.0049970 0
Few 0 Long Small Many Many 0 1 25 More Valuable 0 0.9998403 0.0001597 0
Many 1 Long Small Few Many 0 0 30 Less Valuable 1 0.0494923 0.9505077 1
Few 0 Short Small Many Few 0 1 55 Less Valuable 0 0.9965450 0.0034550 0
Many 0 Short Large Many Many 1 1 25 More Valuable 0 0.9999956 0.0000044 0
Few 0 Long Small Few Few 0 1 25 More Valuable 0 0.9934832 0.0065168 0
Few 0 Short Small Few Few 1 1 25 More Valuable 0 0.9988603 0.0011397 0
Few 0 Short Small Few Many 0 1 45 Less Valuable 0 0.9431137 0.0568863 0

Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Churn
##      No    671    35
##      Churn 125   113
##                                          
##                Accuracy : 0.8305         
##                  95% CI : (0.805, 0.8539)
##     No Information Rate : 0.8432         
##     P-Value [Acc > NIR] : 0.8679         
##                                          
##                   Kappa : 0.4861         
##                                          
##  Mcnemar's Test P-Value : 1.977e-12      
##                                          
##             Sensitivity : 0.7635         
##             Specificity : 0.8430         
##          Pos Pred Value : 0.4748         
##          Neg Pred Value : 0.9504         
##              Prevalence : 0.1568         
##          Detection Rate : 0.1197         
##    Detection Prevalence : 0.2521         
##       Balanced Accuracy : 0.8032         
##                                          
##        'Positive' Class : Churn          
## 

Output

The model achieves an accuracy of 83.41%. However, it is heavily skewed toward predicting the majority class, this being the non-positive class. The low positive predictive value shows that the model struggles at predicting positive values correctly. The Kappa statistic of 0.5001 shows moderately bettter agreement beyond chance. The model performs well when identifying the negative class, but is only moderately successful in predicting the positive class, which is more important.
The model suffers due to the low prevalence of positive churn data. The model assumes feature independence, which is not a realistic expectation in everyday ventures. This can lead the model to innacuracy if there is some correlation between predictor variables.

Logistic Regression Model

Summary

## 
## Call:
## glm(formula = Churn ~ Call.Failure + Subscription.Length + Charge.Amount + 
##     Distinct.Called.Numbers + Customer.Value + Minutes.of.Use + 
##     Age, family = binomial(link = "logit"), data = training.df)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -16.8251   391.9665  -0.043   0.9658    
## Call.FailureMany              0.7697     0.1529   5.035 4.78e-07 ***
## Subscription.LengthLong       0.3409     0.1350   2.525   0.0116 *  
## Charge.AmountLarge           -1.0245     0.2521  -4.064 4.82e-05 ***
## Distinct.Called.NumbersMany  -0.8124     0.1842  -4.411 1.03e-05 ***
## Customer.ValueMore Valuable  -3.2931     0.4587  -7.180 6.99e-13 ***
## Minutes.of.UseMany           -1.0874     0.2284  -4.760 1.94e-06 ***
## Age25                        15.9001   391.9665   0.041   0.9676    
## Age30                        15.6433   391.9665   0.040   0.9682    
## Age45                        15.9637   391.9665   0.041   0.9675    
## Age55                        13.3153   391.9671   0.034   0.9729    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1919.9  on 2205  degrees of freedom
## Residual deviance: 1460.6  on 2195  degrees of freedom
## AIC: 1482.6
## 
## Number of Fisher Scoring iterations: 16

Output

The 3 stars indicate the variables with the biggest predictive significance.

Conclusion

With the help of R, we have developed 3 models that analyze the data provided and have visualized the connection between many of the variables given. There is now a greater understanding of what causes the churning of employees, and so the data can be acted upon with less hesitation than before.