case_s=iteration
head(case_s)
The above data is being analysied for prediction.
str(case_s)
case_s$Gender=factor(case_s$Gender)
case_s$Income=factor(case_s$Income)
case_s$Churn=factor(case_s$Churn)
The above shown is descriptive analysis of the given data.
summary(case_s)
The summary describe the central tendency of each variable
Above graph shows the distribution age vs call.
The graph between Calls and Eductaion.
Now we have to see the correlation between variables.
the redish color shows the strong positive correlation and blue indicates the strong negative correlation.
Now for the model building and prediction we have to split the data into train data and test data.
summary(case_glm)
Call:
glm(formula = Churn ~ ., family = binomial(link = "logit"), data = train_case)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.33731 -0.68973 0.09699 0.65465 2.54748
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.81025 2.42832 -4.452 8.52e-06 ***
Gender1 1.29634 0.49565 2.615 0.008911 **
Age -0.01809 0.01528 -1.184 0.236330
Income1 1.96473 0.55288 3.554 0.000380 ***
FamilySize 1.31021 0.33967 3.857 0.000115 ***
Education 0.23322 0.10171 2.293 0.021846 *
Calls 0.06161 0.02244 2.746 0.006035 **
Visits 0.45354 0.16956 2.675 0.007478 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 166.22 on 119 degrees of freedom
Residual deviance: 105.83 on 112 degrees of freedom
AIC: 121.83
Number of Fisher Scoring iterations: 5
As from the above table,we came to know that according to p-value all the independent variables are significant but since in logistics regression not only p-value is considered but also we have to see the residual error and AIC.
summary(model_case2)
Call:
glm(formula = Churn ~ . - Education, family = "binomial", data = train_case)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5593 -0.7613 0.1213 0.7349 2.5640
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.33130 1.67112 -4.387 1.15e-05 ***
Gender1 1.36133 0.48545 2.804 0.005044 **
Age -0.02168 0.01526 -1.421 0.155386
Income1 1.78174 0.52890 3.369 0.000755 ***
FamilySize 1.30276 0.32335 4.029 5.60e-05 ***
Calls 0.06945 0.02055 3.380 0.000726 ***
Visits 0.45257 0.16397 2.760 0.005779 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 166.22 on 119 degrees of freedom
Residual deviance: 111.46 on 113 degrees of freedom
AIC: 125.46
Number of Fisher Scoring iterations: 5
Even though Age and Education are less significant according to p-value. but as we exclude them the AIC and residual error tends to increase,Hence we are not excluding either of these variables from our model.
predict_case
1 2 3 4 5
0.128561514 0.989731137 0.519249066 0.351355752 0.728860084
6 7 8 9 10
0.149581330 0.980514042 0.763882908 0.934801608 0.292808047
11 12 13 14 15
0.984283756 0.049200219 0.205409181 0.204013141 0.122600549
16 17 18 19 20
0.885635058 0.936579754 0.958000352 0.025657305 0.930563161
21 22 23 24 25
0.527723004 0.209611745 0.193657613 0.707139900 0.515624569
26 27 28 29 30
0.317410941 0.486299606 0.452784687 0.988710589 0.002240695
31 32 33 34 35
0.845787059 0.211410029 0.823854969 0.422863887 0.066522226
36 37 38 39 40
0.988104147 0.065648751 0.784132804 0.469726152 0.934879778
the above table shows the prediction in terms of probablity.
table_case
predicted_value
actual_value FALSE TRUE
0 17 3
1 3 17
from above confusion matrix shows the number of values truely predicted and falsely predicted
A11=TRUE NEGATIVE (truely predicted as “not survived” )
A22=TRUE POSITIVE (truely predicted as “survived”)
A12=FALSE POSITIVE (falsely predicted as “survived”)
A21=FALSE NEGATIVE (falsely predicted as “not survived”)
Now based on the set threshold we have to check its accuracy and find the best threshold value for max accuracy.
print(paste("Accuracy of prediction is",acc_case*100,"percent"))
[1] "Accuracy of prediction is 85 percent"
The graph above used to determine the optimum value of the threshold in order to get the maximum accuracy.
From the above graph we get to know that threshold near 0.50 is having max accuracy.
The rocr curve for showing relation between true positive value and false positive value.
print(plot(table_case1,col=c("yellow","blue")))
NULL
print(paste("Accuracy of prediction is",acc_case*100,"percent"))
[1] "Accuracy of prediction is 85 percent"
As the prediction accuracy is 85%,it concludes that our prediction model is good.