df <- read.csv("HR_comma_sep.csv")
We will carry out EDA to see the general structure of the data and decide whether we need to preprocess the data.
str(df)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Department : chr "sales" "sales" "sales" "sales" ...
## $ salary : chr "low" "medium" "medium" "low" ...
describe(df)
## df
##
## 10 Variables 14999 Observations
## --------------------------------------------------------------------------------
## satisfaction_level
## n missing distinct Info Mean Gmd .05 .10
## 14999 0 92 1 0.6128 0.2823 0.11 0.21
## .25 .50 .75 .90 .95
## 0.44 0.64 0.82 0.92 0.96
##
## lowest : 0.09 0.10 0.11 0.12 0.13, highest: 0.96 0.97 0.98 0.99 1.00
## --------------------------------------------------------------------------------
## last_evaluation
## n missing distinct Info Mean Gmd .05 .10
## 14999 0 65 1 0.7161 0.1973 0.46 0.49
## .25 .50 .75 .90 .95
## 0.56 0.72 0.87 0.95 0.98
##
## lowest : 0.36 0.37 0.38 0.39 0.40, highest: 0.96 0.97 0.98 0.99 1.00
## --------------------------------------------------------------------------------
## number_project
## n missing distinct Info Mean Gmd
## 14999 0 6 0.945 3.803 1.367
##
## lowest : 2 3 4 5 6, highest: 3 4 5 6 7
##
## Value 2 3 4 5 6 7
## Frequency 2388 4055 4365 2761 1174 256
## Proportion 0.159 0.270 0.291 0.184 0.078 0.017
## --------------------------------------------------------------------------------
## average_montly_hours
## n missing distinct Info Mean Gmd .05 .10
## 14999 0 215 1 201.1 57.48 130 137
## .25 .50 .75 .90 .95
## 156 200 245 267 275
##
## lowest : 96 97 98 99 100, highest: 306 307 308 309 310
## --------------------------------------------------------------------------------
## time_spend_company
## n missing distinct Info Mean Gmd
## 14999 0 8 0.905 3.498 1.43
##
## lowest : 2 3 4 5 6, highest: 5 6 7 8 10
##
## Value 2 3 4 5 6 7 8 10
## Frequency 3244 6443 2557 1473 718 188 162 214
## Proportion 0.216 0.430 0.170 0.098 0.048 0.013 0.011 0.014
## --------------------------------------------------------------------------------
## Work_accident
## n missing distinct Info Sum Mean Gmd
## 14999 0 2 0.371 2169 0.1446 0.2474
##
## --------------------------------------------------------------------------------
## left
## n missing distinct Info Sum Mean Gmd
## 14999 0 2 0.544 3571 0.2381 0.3628
##
## --------------------------------------------------------------------------------
## promotion_last_5years
## n missing distinct Info Sum Mean Gmd
## 14999 0 2 0.062 319 0.02127 0.04163
##
## --------------------------------------------------------------------------------
## Department
## n missing distinct
## 14999 0 10
##
## lowest : accounting hr IT management marketing
## highest: product_mng RandD sales support technical
##
## Value accounting hr IT management marketing
## Frequency 767 739 1227 630 858
## Proportion 0.051 0.049 0.082 0.042 0.057
##
## Value product_mng RandD sales support technical
## Frequency 902 787 4140 2229 2720
## Proportion 0.060 0.052 0.276 0.149 0.181
## --------------------------------------------------------------------------------
## salary
## n missing distinct
## 14999 0 3
##
## Value high low medium
## Frequency 1237 7316 6446
## Proportion 0.082 0.488 0.430
## --------------------------------------------------------------------------------
From the above output from describe(df), we can see that there are no missing values. Let’s conclude for each of the variable:
Our target is:
hist(df$satisfaction_level,
freq = FALSE,
main = "Histogram for Satisfaction Level",
xlab = " Satisfaction Level",
ylab = "Frequency",
border = "black",
col = "lightblue1")
dens <- density(df$satisfaction_level)
lines(dens)
The histogram of satisfaction level shows that most of the employees gave a score of more than 0.5. However, there are still a number of employees is not satisfy with the company.
hist(df$last_evaluation,
freq = FALSE,
main = "Histogram for Last Evaluation",
xlab = " Last Evaluation",
ylab = "Frequency",
border = "black",
col = "lightblue1")
dens <- density(df$last_evaluation)
lines(dens)
In the last evaluation, most of the employees gave a score of more then 0.5. There is also low number of employees that gives a score less than 0.5. However, the company still have rooms to improve so that the employee would give higher score.
hist(df$promotion_last_5years,
freq = FALSE,
main = "Histogram for Promotion Last 5 Years",
xlab = " Promotion Last 5 Years",
ylab = "Frequency",
border = "black",
col = "lightblue1")
dens <- density(df$promotion_last_5years)
lines(dens)
cat("Percent of employees get promoted: ",
nrow(subset(df, promotion_last_5years == 1))/nrow(df)*100,
"%\n")
## Percent of employees get promoted: 2.126808 %
cat("Number of employees get promoted: ",
nrow(subset(df, promotion_last_5years == 1)), "\n")
## Number of employees get promoted: 319
cat("Number of employees does not get promoted: ",
nrow(df) - nrow(subset(df, promotion_last_5years == 1)))
## Number of employees does not get promoted: 14680
From the histogram above, we can observe that majority of the employees does not get promoted in the last 5 years. This might be the reason that lots of employee does not give high score in the evaluation.
Nest, we would like to examine the number of employees per department.
barchart(df$Department,
main = "Number of Employees In Each Department",
xlab = "Number of Employee",
ylab = "Department",
col = "lightblue1")
counts <- table(df$promotion_last_5years, df$Department)
barplot(counts, main = "Promotion Last 5 Years By Department",
xlab = "Number of Employees", col = c("blue","red"),
legend = rownames(counts),
cex.names = 0.5)
cat("Percent promotion from Technical Department: ",
nrow(subset(df, (Department == 'technical' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'technical'))*100,
"%\n")
## Percent promotion from Technical Department: 1.029412 %
cat("Percent promotion from Support Department: ",
nrow(subset(df, (Department == 'support' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'support'))*100,
"%\n")
## Percent promotion from Support Department: 0.8972633 %
cat("Percent promotion from Sales Department: ",
nrow(subset(df, (Department == 'sales' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'sales'))*100,
"%\n")
## Percent promotion from Sales Department: 2.415459 %
cat("Percent promotion from RnD Department: ",
nrow(subset(df, (Department == 'RandD' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'RandD'))*100,
"%\n")
## Percent promotion from RnD Department: 3.43075 %
cat("Percent promotion from Product Management Department: ",
nrow(subset(df, (Department == 'product_mng' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'product_mng'))*100,
"%\n")
## Percent promotion from Product Management Department: 0 %
cat("Percent promotion from Marketing Department: ",
nrow(subset(df, (Department == 'marketing' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'marketing'))*100,
"%\n")
## Percent promotion from Marketing Department: 5.011655 %
cat("Percent promotion from Management Department: ",
nrow(subset(df, (Department == 'management' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'management'))*100,
"%\n")
## Percent promotion from Management Department: 10.95238 %
cat("Percent promotion from IT Department: ",
nrow(subset(df, (Department == 'IT' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'IT'))*100,
"%\n")
## Percent promotion from IT Department: 0.2444988 %
cat("Percent promotion from HR Department: ",
nrow(subset(df, (Department == 'hr' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'hr'))*100,
"%\n")
## Percent promotion from HR Department: 2.02977 %
cat("Percent promotion from Accounting Department: ",
nrow(subset(df, (Department == 'accounting' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'accounting'))*100,
"%")
## Percent promotion from Accounting Department: 1.825293 %
counts <- table(df$salary, df$Department)
barplot(counts, main = "Salary Distribution By Department",
xlab = "Department", col = c("darkblue","red", "orange"),
legend = rownames(counts), beside = TRUE, cex.names = 0.5)
From the charts above, it is clearly shown the percentage of getting promoted in the last 5 years are not more than 10%. Only the management department got more chance of promotion. The result also shows that in majority of the employees does not have a high salary. This might be another reason the employees are does not give high score to the company.
df$satisfaction_level <- cut(df$satisfaction_level, br = c(0, 1/3, 2/3, 1), labels = c("unsatisfy", "neutral", "satisfy"))
df$last_evaluation <- cut(df$last_evaluation, br = c(0, 1/3, 2/3, 1), labels = c("unsatisfy", "neutral", "satisfy"))
df$number_project <- factor(df$number_project)
df$average_montly_hours <- cut(df$average_montly_hours, br = c(95, 180, 250, 310), labels = c("less", "normal", "more"))
df$time_spend_company <- factor(df$time_spend_company)
df$Work_accident <- factor(df$Work_accident)
df$promotion_last_5years <- factor(df$promotion_last_5years)
df$left <- factor(df$left)
str(df)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : Factor w/ 3 levels "unsatisfy","neutral",..: 2 3 1 3 2 2 1 3 3 2 ...
## $ last_evaluation : Factor w/ 3 levels "unsatisfy","neutral",..: 2 3 3 3 2 2 3 3 3 2 ...
## $ number_project : Factor w/ 6 levels "2","3","4","5",..: 1 4 6 4 1 1 5 4 4 1 ...
## $ average_montly_hours : Factor w/ 3 levels "less","normal",..: 1 3 3 2 1 1 2 3 2 1 ...
## $ time_spend_company : Factor w/ 8 levels "2","3","4","5",..: 2 5 3 4 2 2 3 4 4 2 ...
## $ Work_accident : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ left : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Department : chr "sales" "sales" "sales" "sales" ...
## $ salary : chr "low" "medium" "medium" "low" ...
set.seed(45)
pd <- sample(2, nrow(df), replace = TRUE, prob = c(0.8,0.2))
train <- df[pd==1,]#, here means all column
test <- df[pd==2,]
cat("Number of train data: ", nrow(train), "\n")
## Number of train data: 12015
cat("Number of test data: ", nrow(test))
## Number of test data: 2984
oner_model <- OneR(formula = left ~., data = train, verbose = TRUE)
## Warning in OneR.data.frame(x = data, ties.method = ties.method, verbose =
## verbose, : data contains unused factor levels
##
## Attribute Accuracy
## 1 * number_project 83.75%
## 2 time_spend_company 77.19%
## 3 satisfaction_level 76.1%
## 3 last_evaluation 76.1%
## 3 average_montly_hours 76.1%
## 3 Work_accident 76.1%
## 3 promotion_last_5years 76.1%
## 3 Department 76.1%
## 3 salary 76.1%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
summary(oner_model)
##
## Call:
## OneR.formula(formula = left ~ ., data = train, verbose = TRUE)
##
## Rules:
## If number_project = 2 then left = 1
## If number_project = 3 then left = 0
## If number_project = 4 then left = 0
## If number_project = 5 then left = 0
## If number_project = 6 then left = 1
## If number_project = 7 then left = 1
##
## Accuracy:
## 10062 of 12015 instances classified correctly (83.75%)
##
## Contingency table:
## number_project
## left 2 3 4 5 6 7 Sum
## 0 646 * 3179 * 3163 * 1718 437 0 9143
## 1 * 1260 63 326 481 * 538 * 204 2872
## Sum 1906 3242 3489 2199 975 204 12015
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 4311.2, df = 5, p-value < 2.2e-16
plot(oner_model)
From the model above we can see that:
number_project gives the highest accuracy (lowest error) which is 83.75%
The rules are given as:
Next, we would like to test the model using the test dataset:
prediction <- predict(oner_model, test)
eval_model(prediction, test$left)
##
## Confusion matrix (absolute):
## Actual
## Prediction 0 1 Sum
## 0 2028 223 2251
## 1 257 476 733
## Sum 2285 699 2984
##
## Confusion matrix (relative):
## Actual
## Prediction 0 1 Sum
## 0 0.68 0.07 0.75
## 1 0.09 0.16 0.25
## Sum 0.77 0.23 1.00
##
## Accuracy:
## 0.8391 (2504/2984)
##
## Error rate:
## 0.1609 (480/2984)
##
## Error rate reduction (vs. base rate):
## 0.3133 (p-value < 2.2e-16)
From the prediction generated, we can see that the model gives an accuracy of 83.91%. The p-value is less than 2.2e-16 which means the model is significant! This shouws that an employee would leave the company if the have too little (2 projects) or too many projects (more than 5 projects) to handle! Therefore, to ensure an employee would not leave the company, the employees should be given around 3-5 projects to handle.
In this section, we would be trying the Naive Bayes model to examine whther it can gives a better accuracy as compared to the OneR model. The OneR model shows that there is a strong relationship between number of projects and whether an employee would leave the company. Naive Bayes can consider more attributes rather than one.
NaiveBayesModel <- naiveBayes(left ~., data = train)
NaiveBayesModel
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.7609655 0.2390345
##
## Conditional probabilities:
## satisfaction_level
## Y unsatisfy neutral satisfy
## 0 0.09865471 0.37066608 0.53067921
## 1 0.27646240 0.46169916 0.26183844
##
## last_evaluation
## Y unsatisfy neutral satisfy
## 0 0.0000000 0.4093842 0.5906158
## 1 0.0000000 0.4596100 0.5403900
##
## number_project
## Y 2 3 4 5 6 7
## 0 0.07065515 0.34769769 0.34594772 0.18790331 0.04779613 0.00000000
## 1 0.43871866 0.02193593 0.11350975 0.16747911 0.18732591 0.07103064
##
## average_montly_hours
## Y less normal more
## 0 0.3867440 0.4427431 0.1705130
## 1 0.4564763 0.1970752 0.3464485
##
## time_spend_company
## Y 2 3 4 5 6 7
## 0 0.28185497 0.41671224 0.14831018 0.05731160 0.04593678 0.01684349
## 1 0.01636490 0.44324513 0.25383008 0.22806407 0.05849582 0.00000000
## time_spend_company
## Y 8 10
## 0 0.01410915 0.01892158
## 1 0.00000000 0.00000000
##
## Work_accident
## Y 0 1
## 0 0.82598709 0.17401291
## 1 0.95543175 0.04456825
##
## promotion_last_5years
## Y 0 1
## 0 0.973750410 0.026249590
## 1 0.995821727 0.004178273
##
## Department
## Y accounting hr IT management marketing product_mng
## 0 0.04724926 0.04506180 0.08279558 0.04888986 0.05884283 0.06092092
## 1 0.05501393 0.05988858 0.07660167 0.02576602 0.05849582 0.05222841
## Department
## Y RandD sales support technical
## 0 0.05763972 0.27310511 0.14918517 0.17630975
## 1 0.03377437 0.27889972 0.16086351 0.19846797
##
## salary
## Y high low medium
## 0 0.10204528 0.44930548 0.44864924
## 1 0.02123955 0.60898329 0.36977716
The model is tested on train dataset.
NBPredictions <- predict(NaiveBayesModel,train)
eval_model(NBPredictions,train$left)
##
## Confusion matrix (absolute):
## Actual
## Prediction 0 1 Sum
## 0 8369 819 9188
## 1 774 2053 2827
## Sum 9143 2872 12015
##
## Confusion matrix (relative):
## Actual
## Prediction 0 1 Sum
## 0 0.70 0.07 0.76
## 1 0.06 0.17 0.24
## Sum 0.76 0.24 1.00
##
## Accuracy:
## 0.8674 (10422/12015)
##
## Error rate:
## 0.1326 (1593/12015)
##
## Error rate reduction (vs. base rate):
## 0.4453 (p-value < 2.2e-16)
The model is tested on test dataset.
NBPredictions <- predict(NaiveBayesModel,test)
eval_model(NBPredictions,test$left)
##
## Confusion matrix (absolute):
## Actual
## Prediction 0 1 Sum
## 0 2096 206 2302
## 1 189 493 682
## Sum 2285 699 2984
##
## Confusion matrix (relative):
## Actual
## Prediction 0 1 Sum
## 0 0.70 0.07 0.77
## 1 0.06 0.17 0.23
## Sum 0.77 0.23 1.00
##
## Accuracy:
## 0.8676 (2589/2984)
##
## Error rate:
## 0.1324 (395/2984)
##
## Error rate reduction (vs. base rate):
## 0.4349 (p-value < 2.2e-16)
In this section, we can see that the Naive Bayes gives an accuracy of 86.76%, which is more accurate than the One-R model! This is because the Naive Bayes model takes more attributes into account. The Naive Bayes model assume that all attributes are independent of each other. Even this is not a practical assumption, but the Naive Bayes model works surprisingly well in making predictions.
In this section, we would use the decision tree to make the predictions.
df <- read.csv("HR_comma_sep.csv")
df$left <- factor(df$left)
set.seed(45)
pd <- sample(2, nrow(df), replace = TRUE, prob = c(0.8,0.2))
train <- df[pd==1,]#, here means all column
test <- df[pd==2,]
cat("Number of train data: ", nrow(train), "\n")
## Number of train data: 12015
cat("Number of test data: ", nrow(test))
## Number of test data: 2984
treeModel <- ctree(left~satisfaction_level+last_evaluation+number_project+average_montly_hours+time_spend_company+Work_accident+promotion_last_5years, controls = ctree_control(mincriterion = 0.99,minsplit = 500), data = train)
plot(treeModel)
Test the tree model using train dataset.
treePrediction <- predict(treeModel, train)
eval_model(treePrediction, train$left)
##
## Confusion matrix (absolute):
## Actual
## Prediction 0 1 Sum
## 0 9105 353 9458
## 1 38 2519 2557
## Sum 9143 2872 12015
##
## Confusion matrix (relative):
## Actual
## Prediction 0 1 Sum
## 0 0.76 0.03 0.79
## 1 0.00 0.21 0.21
## Sum 0.76 0.24 1.00
##
## Accuracy:
## 0.9675 (11624/12015)
##
## Error rate:
## 0.0325 (391/12015)
##
## Error rate reduction (vs. base rate):
## 0.8639 (p-value < 2.2e-16)
Test the tree model using test dataset.
treePrediction <- predict(treeModel, test)
eval_model(treePrediction, test$left)
##
## Confusion matrix (absolute):
## Actual
## Prediction 0 1 Sum
## 0 2278 97 2375
## 1 7 602 609
## Sum 2285 699 2984
##
## Confusion matrix (relative):
## Actual
## Prediction 0 1 Sum
## 0 0.76 0.03 0.80
## 1 0.00 0.20 0.20
## Sum 0.77 0.23 1.00
##
## Accuracy:
## 0.9651 (2880/2984)
##
## Error rate:
## 0.0349 (104/2984)
##
## Error rate reduction (vs. base rate):
## 0.8512 (p-value < 2.2e-16)
From this section, we can see that the decision tree gives us a better accuracy which is 96.51%. This is the best model to be used to predict whether an employee will leave the company or not.