Libraries used
library(readxl)
## Warning: package 'readxl' was built under R version 3.5.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(arules)
## Warning: package 'arules' was built under R version 3.5.3
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 3.5.3
## Loading required package: grid
Sheet = “Revised_HTM_EA_Data” is read from the Excel Workbook.
data <- na.omit(read_excel("C:/Users/Sharathchandra/Desktop/Skill Assessment/Diamond_Resorts/Case-Fictional_HR-Sharathchandra.xlsx",sheet = "Revised_HTM_EA_Data"))
Data is cleansed and modified majorily in MS Excel as part of Excel Analytics - Pivot Tables & Analysis. The numerical values in columns except “EmployeeNumber” column has been put into bins and thus are characterized in Excel and then factrozied in R. Please see the formula applied for each column in Excel Sheet = “Revised_HTM_EA_Data”. This is done as a prerequiste to apply Apriori algorithm on the dataset. And, more importantly with words and bins it is easier to decode reasons for Attrition value and at the same time, to give a clear picture to the executives.
Attrition = No, 1233 observations Attrition = Yes, 237 observations Total = 1470 observations
data <- data %>% mutate_if(is.character, funs(as.factor))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
data <- data[,-c(1)]
First, I wanted to see the main factors affecting employees to stay and continue their employment in the company. Thus, I have checked for rules only with RHS equal to “Attrition=No”. I have tuned the algorithm according to the number of observations in the dataset.
I have made sure to remove the redundant rules and only take unique rules under consideration. These rules gives us a story of what factors are leading to employees to stay employed in the company. The visualization of the rules are sorted by confidence (With 90% confidence, these factors are important) and colored by lift (How many times these factors have occurred in rules?).
rulesHR1 <- apriori(data, parameter = list(supp = 0.1, conf = 0.9, maxlen = 5), appearance = list(default = "lhs", rhs = "Attrition=No"),control = list(verbose=F))
subsetRules1 <- which(colSums(is.subset(rulesHR1, rulesHR1)) > 1)
length(subsetRules1)
## [1] 29
rulesHRAttritionNo <- rulesHR1[-subsetRules1]
length(rulesHRAttritionNo)
## [1] 16
summary(rulesHRAttritionNo)
## set of 16 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 1 11 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 3.00 3.00 3.25 3.25 5.00
##
## summary of quality measures:
## support confidence lift count
## Min. :0.1143 Min. :0.9000 Min. :1.073 Min. :168.0
## 1st Qu.:0.1274 1st Qu.:0.9031 1st Qu.:1.077 1st Qu.:187.2
## Median :0.1418 Median :0.9088 Median :1.084 Median :208.5
## Mean :0.1573 Mean :0.9132 Mean :1.089 Mean :231.2
## 3rd Qu.:0.1755 3rd Qu.:0.9209 3rd Qu.:1.098 3rd Qu.:258.0
## Max. :0.3279 Max. :0.9447 Max. :1.126 Max. :482.0
##
## mining info:
## data ntransactions support confidence
## data 1470 0.1 0.9
rulesConf1 <- sort(rulesHRAttritionNo, by = "confidence", decreasing = T)
inspect(head(rulesConf1))
## lhs rhs support confidence lift count
## [1] {Department=Research & Development,
## MonthlyIncome=Income less than 10,000$} => {Attrition=No} 0.1394558 0.9447005 1.126285 205
## [2] {Age=Between 40 and 50 years,
## Department=Research & Development} => {Attrition=No} 0.1333333 0.9333333 1.112733 196
## [3] {Age=Between 30 and 40 years,
## JobSatisfaction=Rating 4} => {Attrition=No} 0.1258503 0.9250000 1.102798 185
## [4] {DistanceFromHome=Less than 5 Miles,
## MonthlyIncome=Income less than 10,000$} => {Attrition=No} 0.1149660 0.9234973 1.101006 169
## [5] {Age=Between 40 and 50 years,
## BusinessTravel=Travel_Rarely} => {Attrition=No} 0.1408163 0.9200000 1.096837 207
## [6] {Age=Between 30 and 40 years,
## YearsAtCompany=Between 6 and 10 years} => {Attrition=No} 0.1278912 0.9126214 1.088040 188
rulesHRAttritionNoViz <- plot(rulesConf1, method = "graph")
Then, Iwanted to check for main factors affecting employees to quit their employment and leave the company. Thus, I have checked for rules only with RHS equal to “Attrition=Yes”. I have tuned the algorithm according to the number of observations in the dataset.
Same principle as explained before.
rulesHR2 <- apriori(data, parameter = list(supp = 0.0025, conf = 0.9, maxlen = 5), appearance = list(default = "lhs", rhs = "Attrition=Yes"),control = list(verbose=F))
length(rulesHR2)
## [1] 39
subsetRules2 <- which(colSums(is.subset(rulesHR2, rulesHR2)) > 1)
length(subsetRules2)
## [1] 30
rulesHRAttritionYes <- rulesHR2[-subsetRules2]
length(rulesHRAttritionYes)
## [1] 9
summary(rulesHRAttritionYes)
## set of 9 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5
## 1 2 6
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 5.000 4.556 5.000 5.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.002721 Min. :1 Min. :6.203 Min. :4.000
## 1st Qu.:0.002721 1st Qu.:1 1st Qu.:6.203 1st Qu.:4.000
## Median :0.002721 Median :1 Median :6.203 Median :4.000
## Mean :0.002797 Mean :1 Mean :6.203 Mean :4.111
## 3rd Qu.:0.002721 3rd Qu.:1 3rd Qu.:6.203 3rd Qu.:4.000
## Max. :0.003401 Max. :1 Max. :6.203 Max. :5.000
##
## mining info:
## data ntransactions support confidence
## data 1470 0.0025 0.9
rulesConf2 <- sort(rulesHRAttritionYes, by = "confidence", decreasing = T)
inspect(head(rulesConf2))
## lhs rhs support confidence lift count
## [1] {Age=Less than 20 years,
## BusinessTravel=Travel_Frequently} => {Attrition=Yes} 0.003401361 1 6.202532 5
## [2] {BusinessTravel=Travel_Frequently,
## JobRole=Sales Representative,
## JobSatisfaction=Rating 1} => {Attrition=Yes} 0.002721088 1 6.202532 4
## [3] {BusinessTravel=Travel_Frequently,
## DistanceFromHome=Between 6 and 10 miles,
## JobRole=Sales Representative} => {Attrition=Yes} 0.002721088 1 6.202532 4
## [4] {BusinessTravel=Travel_Frequently,
## DistanceFromHome=Greater than 26 miles,
## JobLevel=Level 2,
## JobRole=Sales Executive} => {Attrition=Yes} 0.002721088 1 6.202532 4
## [5] {BusinessTravel=Travel_Frequently,
## Department=Sales,
## DistanceFromHome=Greater than 26 miles,
## JobLevel=Level 2} => {Attrition=Yes} 0.002721088 1 6.202532 4
## [6] {Age=Between 30 and 40 years,
## DistanceFromHome=Between 16 and 20 miles,
## JobLevel=Level 1,
## JobSatisfaction=Rating 1} => {Attrition=Yes} 0.002721088 1 6.202532 4
rulesHRAttritionYesViz <- plot(rulesConf2, method = "graph")
Apart from this R Notebook, please do see the attached Excel for Analytics, Visualizations & Pivot Table and Word Document for Executive summary and Visualizations.