Here Attrition dataset is used for logistic regression analysis. Churn is the dependent variable. Based on the independent variable we have to predict the Churn. Below code is for importing and viewing our dataset.
library(readxl)
attrition_excel <- read_excel("C:/Users/DELL/Desktop/Imarticus/Assignments excel/attrition excel.xlsx")
View(attrition_excel)
Assigning dataset to a new variable for our analysis
Structure(str) explains us the about variables whether they are numeric or character
str(att)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 160 obs. of 9 variables:
$ CustID : num 101901 102056 102522 103149 103866 ...
$ Gender : num 1 0 1 1 0 0 1 1 0 1 ...
$ Age : num 30 17 54 42 30 23 28 19 48 40 ...
$ Income : num 0 1 1 1 1 0 1 1 0 1 ...
$ FamilySize: num 5 1 4 2 2 4 2 5 4 5 ...
$ Education : num 20 12 18 17 12 16 18 16 15 16 ...
$ Calls : num 37 25 48 51 26 18 29 28 16 31 ...
$ Visits : num 3 1 3 2 1 0 2 1 3 3 ...
$ Churn : num 1 0 1 1 0 0 1 1 1 1 ...
Summary of the dataset gives us the minimum value,maximum value, quartile values,mean,median. This gives us the basic understanding of our dataset.
summary(att)
CustID Gender Age Income FamilySize
Min. :101901 Min. :0.0000 Min. :17.00 Min. :0.0000 Min. :1.000
1st Qu.:126719 1st Qu.:0.0000 1st Qu.:22.00 1st Qu.:0.0000 1st Qu.:2.000
Median :151060 Median :1.0000 Median :31.00 Median :1.0000 Median :3.000
Mean :151144 Mean :0.5813 Mean :35.67 Mean :0.5062 Mean :3.131
3rd Qu.:176099 3rd Qu.:1.0000 3rd Qu.:46.00 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :199131 Max. :1.0000 Max. :82.00 Max. :1.0000 Max. :5.000
Education Calls Visits Churn
Min. :12.00 Min. : 3.00 Min. :0.000 Min. :0.0000
1st Qu.:12.00 1st Qu.:15.75 1st Qu.:1.000 1st Qu.:0.0000
Median :14.00 Median :22.00 Median :2.000 Median :1.0000
Mean :14.96 Mean :25.22 Mean :1.906 Mean :0.5312
3rd Qu.:17.00 3rd Qu.:32.00 3rd Qu.:3.000 3rd Qu.:1.0000
Max. :20.00 Max. :65.00 Max. :5.000 Max. :1.0000
is.na is used to find the not available values.
is.na(att)
CustID Gender Age Income FamilySize Education Calls Visits Churn
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[36,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[38,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[39,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[40,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[41,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[42,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[43,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[44,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[46,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[47,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[48,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[50,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[51,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[53,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[54,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[55,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[56,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[57,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[58,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[59,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[60,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[62,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[63,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[64,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[65,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[66,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[67,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[68,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[69,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[70,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[71,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[72,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[74,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[75,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[76,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[77,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[78,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[79,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[80,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[81,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[82,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[83,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[84,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[86,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[87,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[88,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[89,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[90,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[91,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[92,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[93,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[94,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[95,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[96,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[98,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[100,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[101,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[102,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[103,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[104,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[105,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[106,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[107,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[108,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[110,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[111,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[ reached getOption("max.print") -- omitted 49 rows ]
Considering our dependent variable and some of the other variables as a factor will help us to plot graph based on our dependent variable.
Correlogram gives the correlation between all the variables in our dataset
Here we split our data into 80% for train data, 20% for test data for prediction.Library catools is used to split data randomly into 80% for training and 20% for testing.
library(caTools)
sample_att=sample.split(att,SplitRatio = 0.8)
sample_att
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
train_att=subset(att,sample_att=='TRUE')
Length of logical index must be 1 or 160, not 9
train_att
test_att=subset(att,sample_att=='FALSE')
Length of logical index must be 1 or 160, not 9
test_att
We train our model for prediction including all variables. We cannot remove variable based on p value and we should consider AIC value for removing variables.
summary(ctt_eq)
Call:
glm(formula = Churn ~ ., family = "binomial", data = train_att)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2034 -0.8575 0.2465 0.7594 2.0255
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.131e+00 2.290e+00 -2.241 0.0250 *
CustID -1.279e-05 8.796e-06 -1.454 0.1459
Gender 1.134e+00 4.625e-01 2.451 0.0142 *
Age -2.520e-02 1.565e-02 -1.610 0.1075
Income1 1.303e+00 4.847e-01 2.687 0.0072 **
FamilySize 6.359e-01 2.559e-01 2.485 0.0130 *
Education 2.199e-01 9.391e-02 2.341 0.0192 *
Calls 3.612e-02 1.836e-02 1.967 0.0492 *
Visits 3.591e-01 1.578e-01 2.276 0.0229 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 172.32 on 124 degrees of freedom
Residual deviance: 123.78 on 116 degrees of freedom
AIC: 141.78
Number of Fisher Scoring iterations: 5
We need to remove variable seperately and check model overall AIC. 2 variables are removed one by one and AIC is checked. AIC should not decrease.After seeing AIC we remove CustID variable from our model.
ctt_final
Call: glm(formula = Churn ~ . - CustID, family = "binomial", data = train_att)
Coefficients:
(Intercept) Gender Age Income1 FamilySize Education
-7.35642 1.18065 -0.02505 1.42304 0.65295 0.23682
Calls Visits
0.02934 0.38944
Degrees of Freedom: 124 Total (i.e. Null); 117 Residual
Null Deviance: 172.3
Residual Deviance: 125.9 AIC: 141.9
Now we predict our model using test data based on our trained model.
att_predict=predict(ctt_final,test_att,type = 'response')
att_predict
table(av=test_att$Churn,pv=att_predict>0.43)
After our prediction we need check our accuracy of our prediction. library ROCR is used to plot out actual value and predicted value to test our perfomance. From our ROCR graph we can find out our threshold value for our accuracy prediction.
Here for 0.66 threshold value we get 84% accuracy for prediction so our predicted model is good to use.
att_accu
[1] 0.8333333