library(readxl)
attrition <- read_excel("C:/Users/Hxtreme/Desktop/attrition.xlsx")
my_attrition=attrition
The main data set is been swaped to a Sub data set.
View(my_attrition)
VIEW: To display the entire Data Set.
str(my_attrition)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 160 obs. of 9 variables:
$ CustID : num 101901 102056 102522 103149 103866 ...
$ Gender : num 1 0 1 1 0 0 1 1 0 1 ...
$ Age : num 30 17 54 42 30 23 28 19 48 40 ...
$ Income : num 0 1 1 1 1 0 1 1 0 1 ...
$ FamilySize: num 5 1 4 2 2 4 2 5 4 5 ...
$ Education : num 20 12 18 17 12 16 18 16 15 16 ...
$ Calls : num 37 25 48 51 26 18 29 28 16 31 ...
$ Visits : num 3 1 3 2 1 0 2 1 3 3 ...
$ Churn : num 1 0 1 1 0 0 1 1 1 1 ...
STR(Structure):It is a compact way to display the structure of an R object.
summary(my_attrition)
CustID Gender Age Income FamilySize
Min. :101901 Min. :0.0000 Min. :17.00 Min. :0.0000 Min. :1.000
1st Qu.:126719 1st Qu.:0.0000 1st Qu.:22.00 1st Qu.:0.0000 1st Qu.:2.000
Median :151060 Median :1.0000 Median :31.00 Median :1.0000 Median :3.000
Mean :151144 Mean :0.5813 Mean :35.67 Mean :0.5062 Mean :3.131
3rd Qu.:176099 3rd Qu.:1.0000 3rd Qu.:46.00 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :199131 Max. :1.0000 Max. :82.00 Max. :1.0000 Max. :5.000
Education Calls Visits Churn
Min. :12.00 Min. : 3.00 Min. :0.000 Min. :0.0000
1st Qu.:12.00 1st Qu.:15.75 1st Qu.:1.000 1st Qu.:0.0000
Median :14.00 Median :22.00 Median :2.000 Median :1.0000
Mean :14.96 Mean :25.22 Mean :1.906 Mean :0.5312
3rd Qu.:17.00 3rd Qu.:32.00 3rd Qu.:3.000 3rd Qu.:1.0000
Max. :20.00 Max. :65.00 Max. :5.000 Max. :1.0000
SUMMARY : It is a generic function used to produce result summaries of the results of various model fitting functions.
my_attrition=my_attrition[,-1]
As no need of cUSTID, I have removed it.
my_attrition$Churn=as.factor(my_attrition$Churn)
my_attrition$Gender=as.factor(my_attrition$Gender)
my_attrition$Income=as.factor(my_attrition$Income)
FACTOR : are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values.
library(ggplot2)
GGPLOT2 : The ggplot2 package offers a powerful graphics language for creating elegant and complex plots.
ggplot(my_attrition,aes(x=Income,y=Churn,col=Churn))+geom_jitter()+labs(title = "INCOME OF CHURN")
The above plot deals with the income of the churn. GEOM_JITTER : It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.
ggplot(my_attrition,aes(Age,Calls))+stat_density2d(geom = "tile",aes(fill=..density..),contour = FALSE)+geom_point(colour="white")+labs(title = " 2D Plot ")
The above dispaly a Good deference and elaborate the variables in 2D plot.
COUNTOUR : is a generic function with only a default method in base R. … For more control, add contours to a plot, or add axes and frame to a contour plot.
library("scatterplot3d")
SCATTERPLOT : a graph in which the values of two variables are plotted along two axes, the pattern of the resulting points revealing any correlation present.
scatterplot3d(my_attrition[,1:3],
main="3D Scatter Plot",
xlab = "Education (cm)",
ylab = "Income (cm)",
zlab = "Age (cm)")
Unknown or uninitialised column: 'color'.
SCATTERPLOT3D : for the supplementary points or regression planes into an already generated graphic.
library(caTools)
CATOOLS : Contains several basic utility functions including: moving (rolling, running) window statistic functions, read/write for GIF and ENVI binary files, fast calculation of AUC, LogitBoost classifier, base64 encoder/decoder, round-off-error-free sum and cumsum, etc.
SAmple=sample.split(my_attrition,SplitRatio = 0.8)
SAmple
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
SAMPLE.SPLIT : is used to divide the data into two sets….. train set and test set. Below piece of code is used to divide the data into train and test set.
Train=subset(my_attrition,SAmple=="TRUE")
Length of logical index must be 1 or 160, not 8
Train
Train : We can divide data into a particular ratio here it is 80% train and we train the machine.
Test=subset(my_attrition,SAmple=="FALSE")
Length of logical index must be 1 or 160, not 8
Test
We can divide data into a particular ratio here it is 20% in a test dataset.
my_attrition_eq=glm(Churn~.,Train,family="binomial")
my_attrition_eq
Call: glm(formula = Churn ~ ., family = "binomial", data = Train)
Coefficients:
(Intercept) Gender1 Age Income1 FamilySize Education Calls
-6.20407 1.16110 -0.02459 0.85539 0.58304 0.20786 0.02800
Visits
0.35617
Degrees of Freedom: 119 Total (i.e. Null); 112 Residual
Null Deviance: 164.2
Residual Deviance: 126.2 AIC: 142.2
We obatain model to find accuray,here we also check AIC & Residual deviance.
AIC : The Akaike Information Criterion (AIC) provides a method for assessing the quality of your model through comparison of related models. It’s based on the Deviance, but penalizes you for making the model more complicated.
Residual deviance : The Residual Deviance has reduced by 22.46 with a loss of two degrees of freedom.
Test_pred=predict(my_attrition_eq,Test,type="response")
Test_pred
1 2 3 4 5 6 7 8 9
0.3525772 0.8257432 0.6993325 0.4568739 0.8009582 0.7789753 0.5361129 0.9024260 0.8756390
10 11 12 13 14 15 16 17 18
0.8638592 0.3061531 0.5834685 0.6953019 0.7384999 0.7844095 0.1266510 0.5755652 0.4039113
19 20 21 22 23 24 25 26 27
0.3725526 0.4299794 0.6883279 0.1810669 0.0806694 0.7729659 0.7611589 0.6509631 0.2653015
28 29 30 31 32 33 34 35 36
0.6293396 0.5150554 0.3710720 0.8393425 0.1324726 0.2440157 0.1660396 0.9607420 0.4591172
37 38 39 40
0.6796177 0.1493302 0.4410668 0.2303010
Predict function is used to prediction . Type = RESPONSE :The type=“response” option tells R to output probabilities of the form P(Y = 1|X) , as opposed to other information such as the logit . If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model.
library(ROCR)
ROCR (with obvious pronounciation) is for evaluating and visualizing classifier performance. It is… …easy to use: adds only three new commands to R. …flexible: integrates tightly with R’s built-in graphics facilities.
ROCRPred=prediction(Test_pred,Test$Churn)
ROCR_Perfor=performance(ROCRPred,"acc")
plot(ROCR_Perfor)
Before the prediction, ROCR shows the high streak were as that is the accuracy.
T=table(actualvalue=Test$Churn,predictedvalue=Test_pred>0.5)
T
predictedvalue
actualvalue FALSE TRUE
0 17 6
1 1 16
More generally in binary classification, a false positive is an error in data reporting in which a test result improperly indicates presence of a condition, such as a disease (the result is positive), when in reality it is not present, while a false negative is an error in which a test result. TO cAL ACC = The false positive rate is calculated as b ÷ (b + d), or 1 – the specificity; the true positive rate (sensitivity) as a ÷ (a + c); the true negative rate (specificity) as d ÷ (b + d), or 1 – the false positive rate; and the positive predictive value as a ÷ (a + b).
sum(diag(T))/sum(T)
[1] 0.825
As per the Accuracy i prove that my analysics is 83% Accurate.