This is a report of the Human Resource key performance indicators that I will use to measure the performance of an organization. The objective of the report is to build machine learning algorithm to predict churn whilst checking how fit is the model, I observe this by considering these model parameters; sensitivity and specificity. The original data is obtained from the Predictive Analytics Lab Website.
Importing data into RStudio
hr<-read.csv("data/hr_comma_sep.csv",stringsAsFactors = F,na.strings = c("","NA",""))
First I convert these 3 data variables; left, department and salary into factors then view the structure of my data. The left variable will be my dependent variable in the linear equation that I predict over the other independent variables. The churn problem is of a regression type. (Supervised Learning)
hr$department=as.factor(hr$department)
hr$salary=as.factor(hr$salary)
hr$left=as.factor(hr$left)
str(hr)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ department : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
Then I generate a frequency table to show the distribution of the data and create a visual by generating a mosaic plot of the distribution of department and their salary.
ftable<-table(hr$department,hr$salary)
ftable
##
## high low medium
## accounting 74 358 335
## hr 45 335 359
## IT 83 609 535
## management 225 180 225
## marketing 80 402 376
## product_mng 68 451 383
## RandD 51 364 372
## sales 269 2099 1772
## support 141 1146 942
## technical 201 1372 1147
mosaicplot(ftable,main="Distribution of Departments by their Salary",color=TRUE)
Now I perform a chi-square test to understand my hypothesis; departments and salary are dependent. Statistical theory states a p-value < 0.5 shows dependance, I surmise that my initial hypothesis stands. The X2 table below provides this evidence.
chisq.test(ftable)
##
## Pearson's Chi-squared test
##
## data: ftable
## X-squared = 700.92, df = 18, p-value < 2.2e-16
To predict churn in the organisation I perform a logistic regression. With an in sample error distribution, I make my prediction with the whole data and do an out of sample error distribution to see how good my prediction is when my algorithm is subjected through unseen data to determine its efficacy and its readiness for deployment in a real life scenario.
This is a classification problem because the output has two outcomes; left denoted by 0 and stayed denoted by 1. The family of distribution I will employ in my analysis is from the binomial distribution.
In-sample distibution
## in-sample distribution
logit_model<-glm(left~satisfaction_level+last_evaluation+number_project+
Work_accident+promotion_last_5years+average_montly_hours+
time_spend_company,data = hr, family = "binomial")
#summary(logit_model)
##predict churn
hr$left_pred<-predict(logit_model,data=hr,type="response")
##changing the probabilities into class (0 or 1)
hr$left_pred<-ifelse(hr$left_pred>0.5,1,0)
hr$left_pred=as.factor(hr$left_pred)
#str(hr)
#confusion matrix
table(hr$left,hr$left_pred)
##
## 0 1
## 0 10581 847
## 1 2670 901
misClassError<-mean(hr$left_pred!=hr$left)
print(paste('Accuracy=',1-misClassError))
## [1] "Accuracy= 0.765517701180079"
My prediction with the in-sample error is Accuracy= 0.765517701180079. With this accuracy, I perform an out of sample error estimate to compare the two outcomes and see how much an improvement the algorithm managed. A perfect model is not only accurate(overfitted) in its prediction but is also robust. I will evaluate my calculations later using a confusion matrix.
Out of sample distribution
Here I split my data into training set and test set, the idea is to train my data using the training set and make predictions over the unseen data that the algorithm is unfamiliar with to see how robust my model.
#Splitting the data in train and test data
#install.packages('caTools')
library(caTools)
set.seed(123)
split<-sample.split(hr,SplitRatio=0.75)
split
## [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
train<-subset(hr,split=="TRUE")
test<-subset(hr,split=="FALSE")
str(train)
## 'data.frame': 10909 obs. of 11 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.41 0.1 0.89 0.42 0.11 0.84 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.5 0.77 1 0.53 0.81 0.92 ...
## $ number_project : int 2 5 7 5 2 6 5 2 6 4 ...
## $ average_montly_hours : int 157 262 272 223 153 247 224 142 305 234 ...
## $ time_spend_company : int 3 6 4 5 3 4 5 3 4 5 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ department : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
## $ left_pred : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 2 1 ...
str(test)
## 'data.frame': 4090 obs. of 11 variables:
## $ satisfaction_level : num 0.37 0.92 0.45 0.38 0.45 0.38 0.82 0.38 0.4 0.45 ...
## $ last_evaluation : num 0.52 0.85 0.54 0.54 0.51 0.55 0.87 0.5 0.51 0.5 ...
## $ number_project : int 2 5 2 2 2 2 4 2 2 2 ...
## $ average_montly_hours : int 159 259 135 143 160 147 239 132 145 126 ...
## $ time_spend_company : int 3 5 3 3 3 3 5 3 3 3 ...
## $ Work_accident : int 0 0 0 0 1 0 0 0 0 0 ...
## $ left : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ promotion_last_5years: int 0 0 0 0 1 0 0 0 0 0 ...
## $ department : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 1 2 10 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 2 2 2 2 2 2 2 2 2 ...
## $ left_pred : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Training dataset using a generalized linear model(glm) function and thereafter making my predictions on the test data based on the trained set.
##Train model using glm function
logit_model_tr<-glm(left~satisfaction_level+last_evaluation+number_project+
Work_accident+promotion_last_5years+average_montly_hours+
time_spend_company,data = train, family = "binomial")
#logit_model_tr
##Predicting test data based on trained model
predicted.result<-predict(logit_model_tr,test,type = "response")
predicted.result<-ifelse(predicted.result>0.5,1,0)
##Evaluating model accuracy using confusion matrix
#table(test$left,predicted.result)
misClassError<-mean(predicted.result!=test$left)
#print(paste('Accuracy=',1-misClassError))
Evaluating my model using a confusion Matrix.
library(caret)
confusionMatrix(table(test$left,predicted.result))
## Confusion Matrix and Statistics
##
## predicted.result
## 0 1
## 0 2885 231
## 1 727 247
##
## Accuracy : 0.7658
## 95% CI : (0.7525, 0.7787)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2175
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7987
## Specificity : 0.5167
## Pos Pred Value : 0.9259
## Neg Pred Value : 0.2536
## Prevalence : 0.8831
## Detection Rate : 0.7054
## Detection Prevalence : 0.7619
## Balanced Accuracy : 0.6577
##
## 'Positive' Class : 0
##
ROC-AUC CURVE
A visual representation of the model using a Receiver Operating Characteristic (ROC) curve and Area under Curve (AUC) label that displays the actual of the area in the curve.
N/B: Best models have auc values closer to 1.
In my analysis I use this algorithm for illustrative purpose to help us understand Random forest, an algorithm based on decision trees. A random forest is an agglomerative process whereby you generate many decision trees based on various decision nodes to solve this regression model, it is also useful with classification models.
Loading my data into RStudio
hr<-read.csv("data/hr_comma_sep.csv")
hr$left=as.factor(hr$left)
#unlist(lapply(hr,class))
library(caTools)
set.seed(1234)
split<-sample.split(hr,SplitRatio = 0.75)
split
## [1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
train<-subset(hr,split=="TRUE")
test<-subset(hr,split=="FALSE")
#str(train)
#str(test)
A plot of the decision tree
library("rpart") ##recursive partioning
library("rpart.plot")
DecTreeModel<-rpart(left~.,data=train,method = "class")
rpart.plot(DecTreeModel)
The decision tree is constructed in a top-down manner and involves the following process:
1. Placing all the trained examples at the root
2. Categorizing the attributes
3. Partition the examples recursively based on the selected attributes
4. Select test attributes on the basis of a heuristic or statistical measure
#summary(DecTreeModel)
The summary gives a wider description of the different nodes and the attributes.
The idea here is to allow the decision tree to grow fully and observe the CP value. The complexity parameter (CP) is used to control the size of the decision tree and to select the optimal tree size. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue.
printcp(DecTreeModel)
##
## Classification tree:
## rpart(formula = left ~ ., data = train, method = "class")
##
## Variables actually used in tree construction:
## [1] average_montly_hours last_evaluation number_project
## [4] satisfaction_level time_spend_company
##
## Root node error: 2500/10499 = 0.23812
##
## n= 10499
##
## CP nsplit rel error xerror xstd
## 1 0.2424 0 1.0000 1.0000 0.0174572
## 2 0.1872 1 0.7576 0.7576 0.0157598
## 3 0.0744 3 0.3832 0.3832 0.0118023
## 4 0.0504 5 0.2344 0.2344 0.0094089
## 5 0.0316 6 0.1840 0.1884 0.0084841
## 6 0.0180 7 0.1524 0.1624 0.0079024
## 7 0.0128 8 0.1344 0.1416 0.0073980
## 8 0.0100 9 0.1216 0.1280 0.0070455
plotcp(DecTreeModel)
The printcp() and plotcp() functions provide the cross-validation error for each nsplit and can be used to prune the tree. The one with the least cross-validation error (xerror) is the optimal value of CP given by the printcp() funtion
test$left_Pred<-predict(DecTreeModel,newdata = test,type = "class")
table(test$left,test$left_Pred)
##
## 0 1
## 0 3391 38
## 1 93 978
library(caret)
confusionMatrix(table(test$left,test$left_Pred))
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 3391 38
## 1 93 978
##
## Accuracy : 0.9709
## 95% CI : (0.9655, 0.9756)
## No Information Rate : 0.7742
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9183
##
## Mcnemar's Test P-Value : 2.382e-06
##
## Sensitivity : 0.9733
## Specificity : 0.9626
## Pos Pred Value : 0.9889
## Neg Pred Value : 0.9132
## Prevalence : 0.7742
## Detection Rate : 0.7536
## Detection Prevalence : 0.7620
## Balanced Accuracy : 0.9680
##
## 'Positive' Class : 0
##
I find the value of CP for which the cross validation error is minimum
min(DecTreeModel$cptable[,"xerror"])
## [1] 0.128
which.min(DecTreeModel$cptable[,"xerror"])
## 8
## 8
cpmin<-DecTreeModel$cptable[8,"CP"]
cpmin
## [1] 0.01
From the analysis I can say the decision tree is optimal and homogeneous and doesn’t need further pruning.
Random forest is the most prefered and mostly used algorithm in data science projects for best results. It is an ensemble of decision trees that builds and combines multiple decision trees to gives a more accurate prediction. Each decision tree model is weak when epmloyed on its own, but it becomes stable when put together.
It is random because it operates by choosing predictors randomly at the time of training the model.
It is called a forest because it takes the output of multiple decision trees to make a decision.
After loading my data unto Rstudio, my next step will be spliting data.
library(caTools)
set.seed(123)
split <- sample.split(hr, SplitRatio = 0.75)
#split
training_set = subset(hr, split == TRUE)
test_set = subset(hr, split == FALSE)
# install.packages('randomForest')
library(randomForest)
classifier <- randomForest(x = training_set[-7],
y = training_set$left,
ntree = 500)
#bestmtry<-tuneRF(training_set,training_set$left,stepFactor = 1.2,improve = 0.01,trace = T,plot = T)
I am running my classifier using a random forest function, the ntree is an arbitrary number of trees to help me determine later if my model still needs optimization or has stabilized.
y_pred = predict(classifier, newdata = test_set[-7])
#y_pred
cm <- table(test_set$left,y_pred)
misClassError<-mean(test_set$left!=y_pred)
#print(paste('Accuracy=',1-misClassError))
library(caret)
confusionMatrix(cm)
## Confusion Matrix and Statistics
##
## y_pred
## 0 1
## 0 3424 5
## 1 28 1043
##
## Accuracy : 0.9927
## 95% CI : (0.9897, 0.9949)
## No Information Rate : 0.7671
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9796
##
## Mcnemar's Test P-Value : 0.0001283
##
## Sensitivity : 0.9919
## Specificity : 0.9952
## Pos Pred Value : 0.9985
## Neg Pred Value : 0.9739
## Prevalence : 0.7671
## Detection Rate : 0.7609
## Detection Prevalence : 0.7620
## Balanced Accuracy : 0.9936
##
## 'Positive' Class : 0
##
The information I gather from the confusion matrix is; a high degree of accuracy and a general high score in sensitivity and specificity, which is indeed a good model. Doing a quick comparison with decision tree model you realise that there is an improvement with this Random Forest model.
plot(classifier)
The above plot on the classifier illustrate the number of trees that best optimise the model. By growing the trees more and more you understand from the figure if my error decreases or not. If it were still decreasing I would probably go and change, ntree, from my classifier to accomadate this calculation. It is not decreasing further infact it decreased until at some point in 30 trees and there was no improvements by growing more trees.
importance(classifier)
## MeanDecreaseGini
## satisfaction_level 1311.316790
## last_evaluation 450.771887
## number_project 639.670532
## average_montly_hours 562.505808
## time_spend_company 699.796376
## Work_accident 21.226730
## promotion_last_5years 3.825953
## department 66.347186
## salary 30.727622
varImpPlot(classifier)
The best thing about Random Forest, it has an inbuilt variable importance function, varImpPlot. In earlier models we used to check on p-value but in Random Forest, it is a diagnostic tool to explain the WHY in a problem statement. The most important variable in the analysis is displayed in the plot sequentially to illustrate the significance of each individual attribute.
By looking at the plot you understand why this happened. For instance, why is the organization experiencing staff turnover; you realise that the satisfaction level in this organisation plays a major role with this churn problem. Time spent, number of projects, and average monthly hours put in by the employees comes below respectively. Salary comes a distant 7th in the list.