Homework BA

Возникла проблема с загрузкой HTML файла, поэтому вот ссылка на Rpubs: http://rpubs.com/Yuma02/BAhomework2

Telco customer churn

Exploratory Data Analysis

str(c1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 21
##   .. ..$ customerID      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ gender          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ SeniorCitizen   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Partner         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Dependents      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ tenure          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ PhoneService    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ MultipleLines   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ InternetService : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ OnlineSecurity  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ OnlineBackup    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ DeviceProtection: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ TechSupport     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ StreamingTV     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ StreamingMovies : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ Contract        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ PaperlessBilling: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ PaymentMethod   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ MonthlyCharges  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ TotalCharges    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Churn           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

So there is need in reformatting of data types. For example, SeniorCitizen is supposed to be character while tenure is supposed to be a numeric value. Let’s reformate them. Other variables seems to be normal.

c1$SeniorCitizen = as.character(c1$SeniorCitizen)
c1$tenure = as.numeric(c1$tenure)

Unfortunately there are some missing values in the dataset(TotalCharges). It can influences on analysis, but I will delete them.

Continuous variables

Tenure

hchart(c1$tenure, color = "blue") %>% 
  hc_title(text = "Tenure")

summary(c1$tenure)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   29.00   32.42   55.00   72.00

From the table and graphs, the max. tenure is from 70 to 75(532 observations), while there 1371 observations in the min.tenure (0-5), the mean tenure is 32.37.

Monthly charges

hchart(c1$MonthlyCharges, color = "green") %>% 
  hc_title(text = "MonthlyCharges")

summary(c1$MonthlyCharges)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.25   35.59   70.35   64.80   89.86  118.75

The min. charges in mounth is from 0 to 20(656 obs.), while there 64 observations in the max.charges in mounth (115-120), the mean charges is 64.76.

Total charges

hchart(c1$TotalCharges, color = "red") %>% 
  hc_title(text = "TotalCharges")

summary(c1$TotalCharges)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.8   401.4  1397.5  2283.3  3794.7  8684.8

The min. total charges is from 0 to 500(2000 obs.), while there 70 observations in the max.total charges (0-5), the mean total charges is 2283.3.

We can see that all of them are not disributed normally, so in the future analyse we will consider this.

Correlations

plot_correlation(c1, type = 'continuous')

From the table we can see that TotalCharges has positive correlation with MonthlyCharges(0.65) and tenure(0.83). As both TotalCharges and MonthlyCharges posuvely correlated with tenure, we can delete one of the variables.

Churn and continious variables

Churn and tenure

data(c1, package = "ggplot2")

## Warning in data(c1, package = "ggplot2"): data set 'c1' not found

hcboxplot( x = c1$tenure, var = c1$Churn) %>% 
  hc_chart(type = "column")%>% 
  hc_title(text = "Churn and tenure")

It was expected that people with bigger tenure are more likely not to churn(median = 38), while median tenure of people who more likely to churn is 10.

Churn and Total Charges

data(c1, package = "ggplot2")

## Warning in data(c1, package = "ggplot2"): data set 'c1' not found

hcboxplot( x = c1$TotalCharges, var = c1$Churn) %>% 
  hc_chart(type = "column")%>% 
  hc_title(text = "Churn and Total Charges")

From the boxplots we can see that there are not equal amount of observations (totalcharges) among people who do not churn and churn. However, the median charges of people who do not churn is higher(1683), comparing with median od people who churn(704)

Churn and Monthly Charges

data(c1, package = "ggplot2")

## Warning in data(c1, package = "ggplot2"): data set 'c1' not found

hcboxplot( x = c1$MonthlyCharges, var = c1$Churn) %>% 
  hc_chart(type = "column")%>% 
  hc_title(text = "Churn and Monthly Charges")

These boxplots shows that median meaning of the monthly charges of people who churn is higher: it is located approximately at 80 charges, while for clients who do not churn median monthly charges is near to 64.

Predictive Model

Before starting to analyze we should reformate tenure. It is numeric variable, but it will be more difficult to interpret, also some models do not work with categorical variables(SVM).

c1$tenure[c1$tenure >=0 & c1$tenure <= 12] <- '0-1 year'
c1$tenure[c1$tenure > 12 & c1$tenure <= 24] <- '1-2 years'
c1$tenure[c1$tenure > 24 & c1$tenure <= 36] <- '2-3 years'
c1$tenure[c1$tenure > 36 & c1$tenure <= 48] <- '3-4 years'
c1$tenure[c1$tenure > 48 & c1$tenure <= 60] <- '4-5 years'
c1$tenure[c1$tenure > 60 & c1$tenure <= 72] <- '5-6 years'
c1$tenure <- as.factor(c1$tenure)

Using EDA we can divide tenure into several categories. I decided to divide by 1 year(12 months).

c1$Churn <- ifelse(c1$Churn == "Yes", 1, 0)
dmy <- dummyVars(" ~ .", data = c1)
dmy <- data.frame(predict(dmy, newdata = c1))
set.seed(123)
ind = createDataPartition(dmy$Churn, p = 0.2, list = F)
test = dmy[ind,]
train = dmy[-ind,]

As we have some categorical variable we should code them into dummy ones, because SVM works only with numeric types of variables.

Tree

library(partykit)

## Loading required package: grid

library(rpart)
model.tree = rpart::rpart(Churn ~., data = train)
predTrain.tree = predict(model.tree, train)
predTest.tree = predict(model.tree, test)
train_pred <- factor(ifelse(predTrain.tree > 0.5, "Yes", "No"))
train_actual <- factor(ifelse(train$Churn == 1, "Yes", "No"))
test_pred <- factor(ifelse(predTest.tree > 0.5, "Yes", "No"))
test_actual <- factor(ifelse(test$Churn == 1, "Yes", "No"))
confusionMatrix(data = train_pred, reference = train_actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3727  715
##        Yes  421  762
##                                           
##                Accuracy : 0.798           
##                  95% CI : (0.7873, 0.8085)
##     No Information Rate : 0.7374          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4428          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8985          
##             Specificity : 0.5159          
##          Pos Pred Value : 0.8390          
##          Neg Pred Value : 0.6441          
##              Prevalence : 0.7374          
##          Detection Rate : 0.6626          
##    Detection Prevalence : 0.7897          
##       Balanced Accuracy : 0.7072          
##                                           
##        'Positive' Class : No              
##

confusionMatrix(data = test_pred, reference = test_actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  913 197
##        Yes 102 195
##                                           
##                Accuracy : 0.7875          
##                  95% CI : (0.7652, 0.8086)
##     No Information Rate : 0.7214          
##     P-Value [Acc > NIR] : 8.072e-09       
##                                           
##                   Kappa : 0.4289          
##  Mcnemar's Test P-Value : 5.444e-08       
##                                           
##             Sensitivity : 0.8995          
##             Specificity : 0.4974          
##          Pos Pred Value : 0.8225          
##          Neg Pred Value : 0.6566          
##              Prevalence : 0.7214          
##          Detection Rate : 0.6489          
##    Detection Prevalence : 0.7889          
##       Balanced Accuracy : 0.6985          
##                                           
##        'Positive' Class : No              
##

accuracyTrain.tree = confusionMatrix(data = train_pred, reference = train_actual)$overall["Accuracy"]
accuracyTest.tree = confusionMatrix(data = test_pred, reference = test_actual)$overall["Accuracy"]

As we can see there is a good accuracy(0.7875) and sensitivity(0.8995), but the specificity is quite low(0.4974 ) for test dataset.Maybe it because we did not use optiminal cut off. However, we cannot use it, because we should compare models with similar options.

Logistic regression

model.log = glm(Churn~., data = train, family = binomial)
model.log <- stepAIC(model.log, trace = 0)
summary(model.log)#I use stepAIC to choose significant variables to improve the model. Unfortunately, we can not use this method for others variables.

## 
## Call:
## glm(formula = Churn ~ SeniorCitizen0 + tenure.0.1.year + tenure.1.2.years + 
##     MultipleLinesNo + InternetServiceDSL + InternetServiceFiber.optic + 
##     OnlineSecurityNo + TechSupportNo + StreamingTVNo + StreamingMoviesNo + 
##     ContractMonth.to.month + ContractOne.year + PaperlessBillingNo + 
##     PaymentMethodElectronic.check + MonthlyCharges + TotalCharges, 
##     family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0193  -0.6826  -0.2956   0.6495   3.0753  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -3.252e+00  2.776e-01 -11.715  < 2e-16 ***
## SeniorCitizen0                -2.000e-01  9.285e-02  -2.154 0.031210 *  
## tenure.0.1.year                1.049e+00  1.434e-01   7.312 2.64e-13 ***
## tenure.1.2.years               1.975e-01  1.384e-01   1.427 0.153487    
## MultipleLinesNo               -3.400e-01  8.252e-02  -4.121 3.78e-05 ***
## InternetServiceDSL             1.061e+00  3.354e-01   3.163 0.001559 ** 
## InternetServiceFiber.optic     2.359e+00  5.024e-01   4.696 2.65e-06 ***
## OnlineSecurityNo               3.262e-01  9.982e-02   3.267 0.001085 ** 
## TechSupportNo                  3.021e-01  1.010e-01   2.992 0.002769 ** 
## StreamingTVNo                 -3.637e-01  1.049e-01  -3.467 0.000526 ***
## StreamingMoviesNo             -4.182e-01  1.040e-01  -4.022 5.76e-05 ***
## ContractMonth.to.month         1.703e+00  1.887e-01   9.029  < 2e-16 ***
## ContractOne.year               8.780e-01  1.926e-01   4.559 5.14e-06 ***
## PaperlessBillingNo            -3.053e-01  8.377e-02  -3.644 0.000268 ***
## PaymentMethodElectronic.check  3.043e-01  7.813e-02   3.895 9.81e-05 ***
## MonthlyCharges                -8.969e-03  5.609e-03  -1.599 0.109839    
## TotalCharges                  -1.266e-04  4.066e-05  -3.114 0.001846 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6477.0  on 5624  degrees of freedom
## Residual deviance: 4688.4  on 5608  degrees of freedom
## AIC: 4722.4
## 
## Number of Fisher Scoring iterations: 6

predTrainProb.log = predict(model.log, train, type = "response")
predTestProb.log = predict(model.log, test, type = "response")
train_pred <- factor(ifelse(predTrainProb.log > 0.5, "Yes", "No"))
train_actual <- factor(ifelse(train$Churn == 1, "Yes", "No"))
test_pred <- factor(ifelse(predTestProb.log  > 0.5, "Yes", "No"))
test_actual <- factor(ifelse(test$Churn == 1, "Yes", "No"))
confusionMatrix(data = train_pred, reference = train_actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  3789  729
##        Yes  359  748
##                                          
##                Accuracy : 0.8066         
##                  95% CI : (0.796, 0.8168)
##     No Information Rate : 0.7374         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4567         
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9135         
##             Specificity : 0.5064         
##          Pos Pred Value : 0.8386         
##          Neg Pred Value : 0.6757         
##              Prevalence : 0.7374         
##          Detection Rate : 0.6736         
##    Detection Prevalence : 0.8032         
##       Balanced Accuracy : 0.7099         
##                                          
##        'Positive' Class : No             
##

confusionMatrix(data = test_pred, reference = test_actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  924 201
##        Yes  91 191
##                                           
##                Accuracy : 0.7925          
##                  95% CI : (0.7703, 0.8134)
##     No Information Rate : 0.7214          
##     P-Value [Acc > NIR] : 5.600e-10       
##                                           
##                   Kappa : 0.4351          
##  Mcnemar's Test P-Value : 1.785e-10       
##                                           
##             Sensitivity : 0.9103          
##             Specificity : 0.4872          
##          Pos Pred Value : 0.8213          
##          Neg Pred Value : 0.6773          
##              Prevalence : 0.7214          
##          Detection Rate : 0.6567          
##    Detection Prevalence : 0.7996          
##       Balanced Accuracy : 0.6988          
##                                           
##        'Positive' Class : No              
##

accuracyTrain.log = confusionMatrix(data = train_pred, reference = train_actual)$overall["Accuracy"]
accuracyTest.log = confusionMatrix(data = test_pred, reference = test_actual)$overall["Accuracy"]

So, for this model Sensitivity(0.9103) and accuracy (0.7925) are quite high, as for Specificity(0.4872) it can be better.

SVM

library(e1071)
model.svm = svm(Churn~., data = train, gamma=0.05, cost=10)
summary(model.svm)

## 
## Call:
## svm(formula = Churn ~ ., data = train, gamma = 0.05, cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.05 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  4141

predTrain.svm = predict(model.svm, train,type="class")
predTest.svm = predict(model.svm, test,type="class")
train_pred <- factor(ifelse(predTrain.svm > 0.5, "Yes", "No"))
train_actual <- factor(ifelse(train$Churn == 1, "Yes", "No"))
test_pred <- factor(ifelse(predTest.svm > 0.5, "Yes", "No"))
test_actual <- factor(ifelse(test$Churn == 1, "Yes", "No"))
confusionMatrix(data = train_pred, reference = train_actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  4014  146
##        Yes  134 1331
##                                           
##                Accuracy : 0.9502          
##                  95% CI : (0.9442, 0.9558)
##     No Information Rate : 0.7374          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8711          
##  Mcnemar's Test P-Value : 0.5109          
##                                           
##             Sensitivity : 0.9677          
##             Specificity : 0.9012          
##          Pos Pred Value : 0.9649          
##          Neg Pred Value : 0.9085          
##              Prevalence : 0.7374          
##          Detection Rate : 0.7136          
##    Detection Prevalence : 0.7396          
##       Balanced Accuracy : 0.9344          
##                                           
##        'Positive' Class : No              
##

confusionMatrix(data = test_pred, reference = test_actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  865 212
##        Yes 150 180
##                                          
##                Accuracy : 0.7427         
##                  95% CI : (0.719, 0.7654)
##     No Information Rate : 0.7214         
##     P-Value [Acc > NIR] : 0.038880       
##                                          
##                   Kappa : 0.3273         
##  Mcnemar's Test P-Value : 0.001346       
##                                          
##             Sensitivity : 0.8522         
##             Specificity : 0.4592         
##          Pos Pred Value : 0.8032         
##          Neg Pred Value : 0.5455         
##              Prevalence : 0.7214         
##          Detection Rate : 0.6148         
##    Detection Prevalence : 0.7655         
##       Balanced Accuracy : 0.6557         
##                                          
##        'Positive' Class : No             
##

accuracyTrain.svm = confusionMatrix(data = train_pred, reference = train_actual)$overall["Accuracy"]
accuracyTest.svm = confusionMatrix(data = test_pred, reference = test_actual)$overall["Accuracy"]

For SVM Sensitivity(0.8522) and Accuracy(0.7427 ) which is good, while Specificity(0.4592) is low.

Choosing the best model

Accuracy

accuracyTest.svm

## Accuracy 
## 0.742715

accuracyTest.log

##  Accuracy 
## 0.7924662

accuracyTest.tree

##  Accuracy 
## 0.7874911

According to the accuracy the best model is Logistic regression(0.7924662), while for SVM is 0.742715 and for Tree is 0.7874911, but there are other metrics that we can use to evaluate the models.

ROC & AUC

library(pROC)
ROCtree = roc(response = test$Churn, predictor = predTest.tree)
pROC::auc(ROCtree)

## Area under the curve: 0.8062

ROClog = roc(response = test$Churn, predictor = predTestProb.log)
pROC::auc(ROClog)

## Area under the curve: 0.855

ROCsvm = roc(response = test$Churn, predictor = predTest.svm)
pROC::auc(ROCsvm)

## Area under the curve: 0.7325

We can see that Area under the curve for Tree is 0.8062, for Logistic regression id 0.855, while for SVM is 0.7325

To sum up, results of Logistic regression is better than SVM and Tree, so the best model is Logistic regression, according to the Accuracy, ROC and AUC.

Global interpretation

Using global interpretation we can make conclusions on the relationships in the model as a whole, on the behavior and significance of variables.

Importance of variables

##                                Overall
## SeniorCitizen0                2.154389
## tenure.0.1.year               7.311636
## tenure.1.2.years              1.427321
## MultipleLinesNo               4.120676
## InternetServiceDSL            3.163427
## InternetServiceFiber.optic    4.696191
## OnlineSecurityNo              3.267393
## TechSupportNo                 2.992261
## StreamingTVNo                 3.467116
## StreamingMoviesNo             4.022305
## ContractMonth.to.month        9.028807
## ContractOne.year              4.559113
## PaperlessBillingNo            3.644309
## PaymentMethodElectronic.check 3.895293
## MonthlyCharges                1.598918
## TotalCharges                  3.113997

##                                           Overall
## ContractMonth.to.month                 0.16087290
## InternetServiceDSL                     0.07494298
## InternetServiceFiber.optic             0.17056739
## InternetServiceNo                      0.07494298
## MonthlyCharges                         0.05658639
## OnlineBackupNo                         0.02575336
## OnlineBackupNo.internet.service        0.07494298
## OnlineSecurityNo                       0.22182897
## OnlineSecurityNo.internet.service      0.07494298
## OnlineSecurityYes                      0.02554801
## PaymentMethodElectronic.check          0.04235429
## TechSupportNo                          0.27314757
## tenure.0.1.year                        0.21315584
## tenure.4.5.years                       0.02369595
## TotalCharges                           0.13512466
## genderFemale                           0.00000000
## genderMale                             0.00000000
## SeniorCitizen0                         0.00000000
## SeniorCitizen1                         0.00000000
## PartnerNo                              0.00000000
## PartnerYes                             0.00000000
## DependentsNo                           0.00000000
## DependentsYes                          0.00000000
## tenure.1.2.years                       0.00000000
## tenure.2.3.years                       0.00000000
## tenure.3.4.years                       0.00000000
## tenure.5.6.years                       0.00000000
## PhoneServiceNo                         0.00000000
## PhoneServiceYes                        0.00000000
## MultipleLinesNo                        0.00000000
## MultipleLinesNo.phone.service          0.00000000
## MultipleLinesYes                       0.00000000
## OnlineBackupYes                        0.00000000
## DeviceProtectionNo                     0.00000000
## DeviceProtectionNo.internet.service    0.00000000
## DeviceProtectionYes                    0.00000000
## TechSupportNo.internet.service         0.00000000
## TechSupportYes                         0.00000000
## StreamingTVNo                          0.00000000
## StreamingTVNo.internet.service         0.00000000
## StreamingTVYes                         0.00000000
## StreamingMoviesNo                      0.00000000
## StreamingMoviesNo.internet.service     0.00000000
## StreamingMoviesYes                     0.00000000
## ContractOne.year                       0.00000000
## ContractTwo.year                       0.00000000
## PaperlessBillingNo                     0.00000000
## PaperlessBillingYes                    0.00000000
## PaymentMethodBank.transfer..automatic. 0.00000000
## PaymentMethodCredit.card..automatic.   0.00000000
## PaymentMethodMailed.check              0.00000000

In the Logistic regression the most important variables are tenure, MultipleLines, InternetService, StreamingMovies, while in the Tree OnlineSecurity, ContractMonth and both of theese models have InternetService as a significant feature. So, in most cases services are more important if we would like to predict churn rate.

Unfortunately, we cannot know about importance in SVM.

ICEbox

ICEbox is a tool that allow to trace the change in the prediction of the target variable when you change a predictor.

So, we can look at the model the influences of the most important variables on the model.

Let’s see how the prediction changes when the most significant variable changes.

So, according to the plot we can say that tenure have positive correlation with Churn. Clients with longer tenure will be more likely to stay.

While MultipleLinesNo have negative correlation with Churn, in other words clients that do not have MultipleLines will be more likely to Churn.

People who are not provided StreamingMovies have big chances to churn in the future.

Clients who have InternetServiceFiber.optic will be more likely to stay with the company.

Final report

Conclusions:

There are more people who churn(73%) than stay (27%) in this period of time.
It is pretty clear that people who stay with the company longer are more likely not to churn. In out analysis we have conclusion that socio-demographic characteristics like gender, senior citizen, having a partner or dependents do not have significant impact on churn rate. According to the analysis(Logistic regression, SVM, Tree) different type of services significantly influence on the decisions of customers.
The most significant are StreamingMovies, InternetService, MultipleLines.
So, if customer use several services, he/she are more likely not to churn.
We should raise engagement of clients in the company using web-sites and social media to provide our services. Also, we can offer some free temporal suggestions that can show clients how some services work in real life.

References:

1)Kaggle, [URL]:https://www.kaggle.com/farazrahman/telco-customer-churn-logisticregression

2)Kaggle, [URL]:https://www.kaggle.com/liyingiris90/telco-customer-churn-prediction

3)R for Data Science, [URL]:https://r4ds.had.co.nz/exploratory-data-analysis.html

4)Highcharts, [URL]: http://jkunst.com/highcharter/highcharts.html

Homework BA

Lkhasaranova Yumzhana

02 12 2018

Telco customer churn

Exploratory Data Analysis

Final report

References:

Homework BA

Lkhasaranova Yumzhana

02 12 2018

Telco customer churn

Exploratory Data Analysis

Continuous variables

Tenure

Monthly charges

Total charges

Correlations

Socio-demographic characteristics

Gender

Partner

Dependents

Senior Citizen

Services

1

2

Churn and continious variables

Churn and tenure

Churn and Total Charges

Churn and Monthly Charges

Predictive Model

Tree

Logistic regression

SVM

Choosing the best model

Accuracy

ROC & AUC

Global interpretation

Importance of variables

ICEbox

Final report

References: