Возникла проблема с загрузкой HTML файла, поэтому вот ссылка на Rpubs: http://rpubs.com/Yuma02/BAhomework2
str(c1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 21
## .. ..$ customerID : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ gender : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ SeniorCitizen : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Partner : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Dependents : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ tenure : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ PhoneService : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ MultipleLines : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ InternetService : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ OnlineSecurity : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ OnlineBackup : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ DeviceProtection: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ TechSupport : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ StreamingTV : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ StreamingMovies : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Contract : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ PaperlessBilling: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ PaymentMethod : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ MonthlyCharges : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ TotalCharges : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Churn : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
So there is need in reformatting of data types. For example, SeniorCitizen is supposed to be character while tenure is supposed to be a numeric value. Let’s reformate them. Other variables seems to be normal.
c1$SeniorCitizen = as.character(c1$SeniorCitizen)
c1$tenure = as.numeric(c1$tenure)
Unfortunately there are some missing values in the dataset(TotalCharges). It can influences on analysis, but I will delete them.
hchart(c1$tenure, color = "blue") %>%
hc_title(text = "Tenure")
summary(c1$tenure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 29.00 32.42 55.00 72.00
From the table and graphs, the max. tenure is from 70 to 75(532 observations), while there 1371 observations in the min.tenure (0-5), the mean tenure is 32.37.
hchart(c1$MonthlyCharges, color = "green") %>%
hc_title(text = "MonthlyCharges")
summary(c1$MonthlyCharges)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.25 35.59 70.35 64.80 89.86 118.75
The min. charges in mounth is from 0 to 20(656 obs.), while there 64 observations in the max.charges in mounth (115-120), the mean charges is 64.76.
hchart(c1$TotalCharges, color = "red") %>%
hc_title(text = "TotalCharges")
summary(c1$TotalCharges)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.8 401.4 1397.5 2283.3 3794.7 8684.8
The min. total charges is from 0 to 500(2000 obs.), while there 70 observations in the max.total charges (0-5), the mean total charges is 2283.3.
We can see that all of them are not disributed normally, so in the future analyse we will consider this.
plot_correlation(c1, type = 'continuous')
From the table we can see that TotalCharges has positive correlation with MonthlyCharges(0.65) and tenure(0.83). As both TotalCharges and MonthlyCharges posuvely correlated with tenure, we can delete one of the variables.
We have almost equal amount of males and females who churn(about 1000 people), so we can say that there is no significant association according to the barcharts.
In this case we see that clients who do not have partners are more likely to churn(1200), but we cannot say it with 100%.
We cannot say about any association, because the distribution is not normal. There are more people who do not gave dependents in our dataset, so we have more clients who churn in this case(1500 people).
The number of people who churn is higher in case of not senior citizens(about 1500). however, there are more people who are not senior citizen, so we cannot say about association.
Churn rate is much higher in case of Fiber Optic InternetServices(0.5), comparing with DSL and no internet;
From graphics it is clear that clients who do not have services: No OnlineSecurity , OnlineBackup and TechSupport - churn the company;
As for DeviceProtection and StreamingTV we cannot say that customer who have these services churn the company more or less;
The churn rates do not have significant differences between clients who have the service of PhoneService, StreamingMovies and MultipleLines or not;
A significant amount of clients with month-to-month contract left the company, comparing with one or two year contract;
Churn rate is higher in case of clients having paperless billing;
Customers who have ElectronicCheck PaymentMethod are mo likely to leave the company more compared to other options.
data(c1, package = "ggplot2")
## Warning in data(c1, package = "ggplot2"): data set 'c1' not found
hcboxplot( x = c1$tenure, var = c1$Churn) %>%
hc_chart(type = "column")%>%
hc_title(text = "Churn and tenure")
It was expected that people with bigger tenure are more likely not to churn(median = 38), while median tenure of people who more likely to churn is 10.
data(c1, package = "ggplot2")
## Warning in data(c1, package = "ggplot2"): data set 'c1' not found
hcboxplot( x = c1$TotalCharges, var = c1$Churn) %>%
hc_chart(type = "column")%>%
hc_title(text = "Churn and Total Charges")
From the boxplots we can see that there are not equal amount of observations (totalcharges) among people who do not churn and churn. However, the median charges of people who do not churn is higher(1683), comparing with median od people who churn(704)
data(c1, package = "ggplot2")
## Warning in data(c1, package = "ggplot2"): data set 'c1' not found
hcboxplot( x = c1$MonthlyCharges, var = c1$Churn) %>%
hc_chart(type = "column")%>%
hc_title(text = "Churn and Monthly Charges")
These boxplots shows that median meaning of the monthly charges of people who churn is higher: it is located approximately at 80 charges, while for clients who do not churn median monthly charges is near to 64.
Before starting to analyze we should reformate tenure. It is numeric variable, but it will be more difficult to interpret, also some models do not work with categorical variables(SVM).
c1$tenure[c1$tenure >=0 & c1$tenure <= 12] <- '0-1 year'
c1$tenure[c1$tenure > 12 & c1$tenure <= 24] <- '1-2 years'
c1$tenure[c1$tenure > 24 & c1$tenure <= 36] <- '2-3 years'
c1$tenure[c1$tenure > 36 & c1$tenure <= 48] <- '3-4 years'
c1$tenure[c1$tenure > 48 & c1$tenure <= 60] <- '4-5 years'
c1$tenure[c1$tenure > 60 & c1$tenure <= 72] <- '5-6 years'
c1$tenure <- as.factor(c1$tenure)
Using EDA we can divide tenure into several categories. I decided to divide by 1 year(12 months).
c1$Churn <- ifelse(c1$Churn == "Yes", 1, 0)
dmy <- dummyVars(" ~ .", data = c1)
dmy <- data.frame(predict(dmy, newdata = c1))
set.seed(123)
ind = createDataPartition(dmy$Churn, p = 0.2, list = F)
test = dmy[ind,]
train = dmy[-ind,]
As we have some categorical variable we should code them into dummy ones, because SVM works only with numeric types of variables.
library(partykit)
## Loading required package: grid
library(rpart)
model.tree = rpart::rpart(Churn ~., data = train)
predTrain.tree = predict(model.tree, train)
predTest.tree = predict(model.tree, test)
train_pred <- factor(ifelse(predTrain.tree > 0.5, "Yes", "No"))
train_actual <- factor(ifelse(train$Churn == 1, "Yes", "No"))
test_pred <- factor(ifelse(predTest.tree > 0.5, "Yes", "No"))
test_actual <- factor(ifelse(test$Churn == 1, "Yes", "No"))
confusionMatrix(data = train_pred, reference = train_actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3727 715
## Yes 421 762
##
## Accuracy : 0.798
## 95% CI : (0.7873, 0.8085)
## No Information Rate : 0.7374
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4428
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8985
## Specificity : 0.5159
## Pos Pred Value : 0.8390
## Neg Pred Value : 0.6441
## Prevalence : 0.7374
## Detection Rate : 0.6626
## Detection Prevalence : 0.7897
## Balanced Accuracy : 0.7072
##
## 'Positive' Class : No
##
confusionMatrix(data = test_pred, reference = test_actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 913 197
## Yes 102 195
##
## Accuracy : 0.7875
## 95% CI : (0.7652, 0.8086)
## No Information Rate : 0.7214
## P-Value [Acc > NIR] : 8.072e-09
##
## Kappa : 0.4289
## Mcnemar's Test P-Value : 5.444e-08
##
## Sensitivity : 0.8995
## Specificity : 0.4974
## Pos Pred Value : 0.8225
## Neg Pred Value : 0.6566
## Prevalence : 0.7214
## Detection Rate : 0.6489
## Detection Prevalence : 0.7889
## Balanced Accuracy : 0.6985
##
## 'Positive' Class : No
##
accuracyTrain.tree = confusionMatrix(data = train_pred, reference = train_actual)$overall["Accuracy"]
accuracyTest.tree = confusionMatrix(data = test_pred, reference = test_actual)$overall["Accuracy"]
As we can see there is a good accuracy(0.7875) and sensitivity(0.8995), but the specificity is quite low(0.4974 ) for test dataset.Maybe it because we did not use optiminal cut off. However, we cannot use it, because we should compare models with similar options.
model.log = glm(Churn~., data = train, family = binomial)
model.log <- stepAIC(model.log, trace = 0)
summary(model.log)#I use stepAIC to choose significant variables to improve the model. Unfortunately, we can not use this method for others variables.
##
## Call:
## glm(formula = Churn ~ SeniorCitizen0 + tenure.0.1.year + tenure.1.2.years +
## MultipleLinesNo + InternetServiceDSL + InternetServiceFiber.optic +
## OnlineSecurityNo + TechSupportNo + StreamingTVNo + StreamingMoviesNo +
## ContractMonth.to.month + ContractOne.year + PaperlessBillingNo +
## PaymentMethodElectronic.check + MonthlyCharges + TotalCharges,
## family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0193 -0.6826 -0.2956 0.6495 3.0753
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.252e+00 2.776e-01 -11.715 < 2e-16 ***
## SeniorCitizen0 -2.000e-01 9.285e-02 -2.154 0.031210 *
## tenure.0.1.year 1.049e+00 1.434e-01 7.312 2.64e-13 ***
## tenure.1.2.years 1.975e-01 1.384e-01 1.427 0.153487
## MultipleLinesNo -3.400e-01 8.252e-02 -4.121 3.78e-05 ***
## InternetServiceDSL 1.061e+00 3.354e-01 3.163 0.001559 **
## InternetServiceFiber.optic 2.359e+00 5.024e-01 4.696 2.65e-06 ***
## OnlineSecurityNo 3.262e-01 9.982e-02 3.267 0.001085 **
## TechSupportNo 3.021e-01 1.010e-01 2.992 0.002769 **
## StreamingTVNo -3.637e-01 1.049e-01 -3.467 0.000526 ***
## StreamingMoviesNo -4.182e-01 1.040e-01 -4.022 5.76e-05 ***
## ContractMonth.to.month 1.703e+00 1.887e-01 9.029 < 2e-16 ***
## ContractOne.year 8.780e-01 1.926e-01 4.559 5.14e-06 ***
## PaperlessBillingNo -3.053e-01 8.377e-02 -3.644 0.000268 ***
## PaymentMethodElectronic.check 3.043e-01 7.813e-02 3.895 9.81e-05 ***
## MonthlyCharges -8.969e-03 5.609e-03 -1.599 0.109839
## TotalCharges -1.266e-04 4.066e-05 -3.114 0.001846 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6477.0 on 5624 degrees of freedom
## Residual deviance: 4688.4 on 5608 degrees of freedom
## AIC: 4722.4
##
## Number of Fisher Scoring iterations: 6
predTrainProb.log = predict(model.log, train, type = "response")
predTestProb.log = predict(model.log, test, type = "response")
train_pred <- factor(ifelse(predTrainProb.log > 0.5, "Yes", "No"))
train_actual <- factor(ifelse(train$Churn == 1, "Yes", "No"))
test_pred <- factor(ifelse(predTestProb.log > 0.5, "Yes", "No"))
test_actual <- factor(ifelse(test$Churn == 1, "Yes", "No"))
confusionMatrix(data = train_pred, reference = train_actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3789 729
## Yes 359 748
##
## Accuracy : 0.8066
## 95% CI : (0.796, 0.8168)
## No Information Rate : 0.7374
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4567
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9135
## Specificity : 0.5064
## Pos Pred Value : 0.8386
## Neg Pred Value : 0.6757
## Prevalence : 0.7374
## Detection Rate : 0.6736
## Detection Prevalence : 0.8032
## Balanced Accuracy : 0.7099
##
## 'Positive' Class : No
##
confusionMatrix(data = test_pred, reference = test_actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 924 201
## Yes 91 191
##
## Accuracy : 0.7925
## 95% CI : (0.7703, 0.8134)
## No Information Rate : 0.7214
## P-Value [Acc > NIR] : 5.600e-10
##
## Kappa : 0.4351
## Mcnemar's Test P-Value : 1.785e-10
##
## Sensitivity : 0.9103
## Specificity : 0.4872
## Pos Pred Value : 0.8213
## Neg Pred Value : 0.6773
## Prevalence : 0.7214
## Detection Rate : 0.6567
## Detection Prevalence : 0.7996
## Balanced Accuracy : 0.6988
##
## 'Positive' Class : No
##
accuracyTrain.log = confusionMatrix(data = train_pred, reference = train_actual)$overall["Accuracy"]
accuracyTest.log = confusionMatrix(data = test_pred, reference = test_actual)$overall["Accuracy"]
So, for this model Sensitivity(0.9103) and accuracy (0.7925) are quite high, as for Specificity(0.4872) it can be better.
library(e1071)
model.svm = svm(Churn~., data = train, gamma=0.05, cost=10)
summary(model.svm)
##
## Call:
## svm(formula = Churn ~ ., data = train, gamma = 0.05, cost = 10)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 10
## gamma: 0.05
## epsilon: 0.1
##
##
## Number of Support Vectors: 4141
predTrain.svm = predict(model.svm, train,type="class")
predTest.svm = predict(model.svm, test,type="class")
train_pred <- factor(ifelse(predTrain.svm > 0.5, "Yes", "No"))
train_actual <- factor(ifelse(train$Churn == 1, "Yes", "No"))
test_pred <- factor(ifelse(predTest.svm > 0.5, "Yes", "No"))
test_actual <- factor(ifelse(test$Churn == 1, "Yes", "No"))
confusionMatrix(data = train_pred, reference = train_actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 4014 146
## Yes 134 1331
##
## Accuracy : 0.9502
## 95% CI : (0.9442, 0.9558)
## No Information Rate : 0.7374
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8711
## Mcnemar's Test P-Value : 0.5109
##
## Sensitivity : 0.9677
## Specificity : 0.9012
## Pos Pred Value : 0.9649
## Neg Pred Value : 0.9085
## Prevalence : 0.7374
## Detection Rate : 0.7136
## Detection Prevalence : 0.7396
## Balanced Accuracy : 0.9344
##
## 'Positive' Class : No
##
confusionMatrix(data = test_pred, reference = test_actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 865 212
## Yes 150 180
##
## Accuracy : 0.7427
## 95% CI : (0.719, 0.7654)
## No Information Rate : 0.7214
## P-Value [Acc > NIR] : 0.038880
##
## Kappa : 0.3273
## Mcnemar's Test P-Value : 0.001346
##
## Sensitivity : 0.8522
## Specificity : 0.4592
## Pos Pred Value : 0.8032
## Neg Pred Value : 0.5455
## Prevalence : 0.7214
## Detection Rate : 0.6148
## Detection Prevalence : 0.7655
## Balanced Accuracy : 0.6557
##
## 'Positive' Class : No
##
accuracyTrain.svm = confusionMatrix(data = train_pred, reference = train_actual)$overall["Accuracy"]
accuracyTest.svm = confusionMatrix(data = test_pred, reference = test_actual)$overall["Accuracy"]
For SVM Sensitivity(0.8522) and Accuracy(0.7427 ) which is good, while Specificity(0.4592) is low.
accuracyTest.svm
## Accuracy
## 0.742715
accuracyTest.log
## Accuracy
## 0.7924662
accuracyTest.tree
## Accuracy
## 0.7874911
According to the accuracy the best model is Logistic regression(0.7924662), while for SVM is 0.742715 and for Tree is 0.7874911, but there are other metrics that we can use to evaluate the models.
library(pROC)
ROCtree = roc(response = test$Churn, predictor = predTest.tree)
pROC::auc(ROCtree)
## Area under the curve: 0.8062
ROClog = roc(response = test$Churn, predictor = predTestProb.log)
pROC::auc(ROClog)
## Area under the curve: 0.855
ROCsvm = roc(response = test$Churn, predictor = predTest.svm)
pROC::auc(ROCsvm)
## Area under the curve: 0.7325
We can see that Area under the curve for Tree is 0.8062, for Logistic regression id 0.855, while for SVM is 0.7325
To sum up, results of Logistic regression is better than SVM and Tree, so the best model is Logistic regression, according to the Accuracy, ROC and AUC.
Using global interpretation we can make conclusions on the relationships in the model as a whole, on the behavior and significance of variables.
## Overall
## SeniorCitizen0 2.154389
## tenure.0.1.year 7.311636
## tenure.1.2.years 1.427321
## MultipleLinesNo 4.120676
## InternetServiceDSL 3.163427
## InternetServiceFiber.optic 4.696191
## OnlineSecurityNo 3.267393
## TechSupportNo 2.992261
## StreamingTVNo 3.467116
## StreamingMoviesNo 4.022305
## ContractMonth.to.month 9.028807
## ContractOne.year 4.559113
## PaperlessBillingNo 3.644309
## PaymentMethodElectronic.check 3.895293
## MonthlyCharges 1.598918
## TotalCharges 3.113997
## Overall
## ContractMonth.to.month 0.16087290
## InternetServiceDSL 0.07494298
## InternetServiceFiber.optic 0.17056739
## InternetServiceNo 0.07494298
## MonthlyCharges 0.05658639
## OnlineBackupNo 0.02575336
## OnlineBackupNo.internet.service 0.07494298
## OnlineSecurityNo 0.22182897
## OnlineSecurityNo.internet.service 0.07494298
## OnlineSecurityYes 0.02554801
## PaymentMethodElectronic.check 0.04235429
## TechSupportNo 0.27314757
## tenure.0.1.year 0.21315584
## tenure.4.5.years 0.02369595
## TotalCharges 0.13512466
## genderFemale 0.00000000
## genderMale 0.00000000
## SeniorCitizen0 0.00000000
## SeniorCitizen1 0.00000000
## PartnerNo 0.00000000
## PartnerYes 0.00000000
## DependentsNo 0.00000000
## DependentsYes 0.00000000
## tenure.1.2.years 0.00000000
## tenure.2.3.years 0.00000000
## tenure.3.4.years 0.00000000
## tenure.5.6.years 0.00000000
## PhoneServiceNo 0.00000000
## PhoneServiceYes 0.00000000
## MultipleLinesNo 0.00000000
## MultipleLinesNo.phone.service 0.00000000
## MultipleLinesYes 0.00000000
## OnlineBackupYes 0.00000000
## DeviceProtectionNo 0.00000000
## DeviceProtectionNo.internet.service 0.00000000
## DeviceProtectionYes 0.00000000
## TechSupportNo.internet.service 0.00000000
## TechSupportYes 0.00000000
## StreamingTVNo 0.00000000
## StreamingTVNo.internet.service 0.00000000
## StreamingTVYes 0.00000000
## StreamingMoviesNo 0.00000000
## StreamingMoviesNo.internet.service 0.00000000
## StreamingMoviesYes 0.00000000
## ContractOne.year 0.00000000
## ContractTwo.year 0.00000000
## PaperlessBillingNo 0.00000000
## PaperlessBillingYes 0.00000000
## PaymentMethodBank.transfer..automatic. 0.00000000
## PaymentMethodCredit.card..automatic. 0.00000000
## PaymentMethodMailed.check 0.00000000
In the Logistic regression the most important variables are tenure, MultipleLines, InternetService, StreamingMovies, while in the Tree OnlineSecurity, ContractMonth and both of theese models have InternetService as a significant feature. So, in most cases services are more important if we would like to predict churn rate.
Unfortunately, we cannot know about importance in SVM.
ICEbox is a tool that allow to trace the change in the prediction of the target variable when you change a predictor.
So, we can look at the model the influences of the most important variables on the model.
Let’s see how the prediction changes when the most significant variable changes.
So, according to the plot we can say that tenure have positive correlation with Churn. Clients with longer tenure will be more likely to stay.
While MultipleLinesNo have negative correlation with Churn, in other words clients that do not have MultipleLines will be more likely to Churn.
People who are not provided StreamingMovies have big chances to churn in the future.
Clients who have InternetServiceFiber.optic will be more likely to stay with the company.
Conclusions:
There are more people who churn(73%) than stay (27%) in this period of time.
It is pretty clear that people who stay with the company longer are more likely not to churn. In out analysis we have conclusion that socio-demographic characteristics like gender, senior citizen, having a partner or dependents do not have significant impact on churn rate. According to the analysis(Logistic regression, SVM, Tree) different type of services significantly influence on the decisions of customers.
The most significant are StreamingMovies, InternetService, MultipleLines.
So, if customer use several services, he/she are more likely not to churn.
We should raise engagement of clients in the company using web-sites and social media to provide our services. Also, we can offer some free temporal suggestions that can show clients how some services work in real life.
1)Kaggle, [URL]:https://www.kaggle.com/farazrahman/telco-customer-churn-logisticregression
2)Kaggle, [URL]:https://www.kaggle.com/liyingiris90/telco-customer-churn-prediction
3)R for Data Science, [URL]:https://r4ds.had.co.nz/exploratory-data-analysis.html
4)Highcharts, [URL]: http://jkunst.com/highcharter/highcharts.html