Load dataset

df <- read.csv("HR_comma_sep.csv")

Exploratory Data Analysis (EDA)

We will carry out EDA to see the general structure of the data and decide whether we need to preprocess the data.

str(df)
## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Department           : chr  "sales" "sales" "sales" "sales" ...
##  $ salary               : chr  "low" "medium" "medium" "low" ...
describe(df)
## df 
## 
##  10  Variables      14999  Observations
## --------------------------------------------------------------------------------
## satisfaction_level 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    14999        0       92        1   0.6128   0.2823     0.11     0.21 
##      .25      .50      .75      .90      .95 
##     0.44     0.64     0.82     0.92     0.96 
## 
## lowest : 0.09 0.10 0.11 0.12 0.13, highest: 0.96 0.97 0.98 0.99 1.00
## --------------------------------------------------------------------------------
## last_evaluation 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    14999        0       65        1   0.7161   0.1973     0.46     0.49 
##      .25      .50      .75      .90      .95 
##     0.56     0.72     0.87     0.95     0.98 
## 
## lowest : 0.36 0.37 0.38 0.39 0.40, highest: 0.96 0.97 0.98 0.99 1.00
## --------------------------------------------------------------------------------
## number_project 
##        n  missing distinct     Info     Mean      Gmd 
##    14999        0        6    0.945    3.803    1.367 
## 
## lowest : 2 3 4 5 6, highest: 3 4 5 6 7
##                                               
## Value          2     3     4     5     6     7
## Frequency   2388  4055  4365  2761  1174   256
## Proportion 0.159 0.270 0.291 0.184 0.078 0.017
## --------------------------------------------------------------------------------
## average_montly_hours 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    14999        0      215        1    201.1    57.48      130      137 
##      .25      .50      .75      .90      .95 
##      156      200      245      267      275 
## 
## lowest :  96  97  98  99 100, highest: 306 307 308 309 310
## --------------------------------------------------------------------------------
## time_spend_company 
##        n  missing distinct     Info     Mean      Gmd 
##    14999        0        8    0.905    3.498     1.43 
## 
## lowest :  2  3  4  5  6, highest:  5  6  7  8 10
##                                                           
## Value          2     3     4     5     6     7     8    10
## Frequency   3244  6443  2557  1473   718   188   162   214
## Proportion 0.216 0.430 0.170 0.098 0.048 0.013 0.011 0.014
## --------------------------------------------------------------------------------
## Work_accident 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##    14999        0        2    0.371     2169   0.1446   0.2474 
## 
## --------------------------------------------------------------------------------
## left 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##    14999        0        2    0.544     3571   0.2381   0.3628 
## 
## --------------------------------------------------------------------------------
## promotion_last_5years 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##    14999        0        2    0.062      319  0.02127  0.04163 
## 
## --------------------------------------------------------------------------------
## Department 
##        n  missing distinct 
##    14999        0       10 
## 
## lowest : accounting  hr          IT          management  marketing  
## highest: product_mng RandD       sales       support     technical  
##                                                                       
## Value       accounting          hr          IT  management   marketing
## Frequency          767         739        1227         630         858
## Proportion       0.051       0.049       0.082       0.042       0.057
##                                                                       
## Value      product_mng       RandD       sales     support   technical
## Frequency          902         787        4140        2229        2720
## Proportion       0.060       0.052       0.276       0.149       0.181
## --------------------------------------------------------------------------------
## salary 
##        n  missing distinct 
##    14999        0        3 
##                                
## Value        high    low medium
## Frequency    1237   7316   6446
## Proportion  0.082  0.488  0.430
## --------------------------------------------------------------------------------

From the above output from describe(df), we can see that there are no missing values. Let’s conclude for each of the variable:

Our target is:

Distribution Analysis

hist(df$satisfaction_level, 
     freq = FALSE,
     main = "Histogram for Satisfaction Level", 
     xlab = " Satisfaction Level", 
     ylab = "Frequency",
     border = "black", 
     col = "lightblue1")
dens <- density(df$satisfaction_level)
lines(dens)

The histogram of satisfaction level shows that most of the employees gave a score of more than 0.5. However, there are still a number of employees is not satisfy with the company.

hist(df$last_evaluation, 
     freq = FALSE,
     main = "Histogram for Last Evaluation", 
     xlab = " Last Evaluation", 
     ylab = "Frequency",
     border = "black", 
     col = "lightblue1")
dens <- density(df$last_evaluation)
lines(dens)

In the last evaluation, most of the employees gave a score of more then 0.5. There is also low number of employees that gives a score less than 0.5. However, the company still have rooms to improve so that the employee would give higher score.

hist(df$promotion_last_5years, 
     freq = FALSE,
     main = "Histogram for Promotion Last 5 Years", 
     xlab = " Promotion Last 5 Years", 
     ylab = "Frequency",
     border = "black", 
     col = "lightblue1")
dens <- density(df$promotion_last_5years)
lines(dens)

cat("Percent of employees get promoted: ", 
    nrow(subset(df, promotion_last_5years == 1))/nrow(df)*100,
    "%\n")
## Percent of employees get promoted:  2.126808 %
cat("Number of employees get promoted: ", 
    nrow(subset(df, promotion_last_5years == 1)), "\n")
## Number of employees get promoted:  319
cat("Number of employees does not get promoted: ", 
    nrow(df) - nrow(subset(df, promotion_last_5years == 1)))
## Number of employees does not get promoted:  14680

From the histogram above, we can observe that majority of the employees does not get promoted in the last 5 years. This might be the reason that lots of employee does not give high score in the evaluation.

Nest, we would like to examine the number of employees per department.

barchart(df$Department, 
         main = "Number of Employees In Each Department", 
         xlab = "Number of Employee", 
         ylab = "Department", 
         col = "lightblue1")

counts <- table(df$promotion_last_5years, df$Department)
barplot(counts, main = "Promotion Last 5 Years By Department",
  xlab = "Number of Employees", col = c("blue","red"),
  legend = rownames(counts),
  cex.names = 0.5)

cat("Percent promotion from Technical Department: ", 
    nrow(subset(df, (Department == 'technical' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'technical'))*100,
    "%\n")
## Percent promotion from Technical Department:  1.029412 %
cat("Percent promotion from Support Department: ", 
    nrow(subset(df, (Department == 'support' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'support'))*100,
    "%\n")
## Percent promotion from Support Department:  0.8972633 %
cat("Percent promotion from Sales Department: ", 
    nrow(subset(df, (Department == 'sales' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'sales'))*100,
    "%\n")
## Percent promotion from Sales Department:  2.415459 %
cat("Percent promotion from RnD Department: ", 
    nrow(subset(df, (Department == 'RandD' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'RandD'))*100,
    "%\n")
## Percent promotion from RnD Department:  3.43075 %
cat("Percent promotion from Product Management Department: ", 
    nrow(subset(df, (Department == 'product_mng' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'product_mng'))*100,
    "%\n")
## Percent promotion from Product Management Department:  0 %
cat("Percent promotion from Marketing Department: ", 
    nrow(subset(df, (Department == 'marketing' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'marketing'))*100,
    "%\n")
## Percent promotion from Marketing Department:  5.011655 %
cat("Percent promotion from Management Department: ", 
    nrow(subset(df, (Department == 'management' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'management'))*100,
    "%\n")
## Percent promotion from Management Department:  10.95238 %
cat("Percent promotion from IT Department: ", 
    nrow(subset(df, (Department == 'IT' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'IT'))*100,
    "%\n")
## Percent promotion from IT Department:  0.2444988 %
cat("Percent promotion from HR Department: ", 
    nrow(subset(df, (Department == 'hr' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'hr'))*100,
    "%\n")
## Percent promotion from HR Department:  2.02977 %
cat("Percent promotion from Accounting Department: ", 
    nrow(subset(df, (Department == 'accounting' & promotion_last_5years == 1)))/nrow(subset(df, Department == 'accounting'))*100,
    "%")
## Percent promotion from Accounting Department:  1.825293 %
counts <- table(df$salary, df$Department)
barplot(counts, main = "Salary Distribution By Department",
  xlab = "Department", col = c("darkblue","red", "orange"),
  legend = rownames(counts), beside = TRUE, cex.names = 0.5)

From the charts above, it is clearly shown the percentage of getting promoted in the last 5 years are not more than 10%. Only the management department got more chance of promotion. The result also shows that in majority of the employees does not have a high salary. This might be another reason the employees are does not give high score to the company.

Descretization and Convert Data Into Factor (Categorical Data)

df$satisfaction_level <- cut(df$satisfaction_level, br = c(0, 1/3, 2/3, 1), labels = c("unsatisfy", "neutral", "satisfy"))
df$last_evaluation <- cut(df$last_evaluation, br = c(0, 1/3, 2/3, 1), labels = c("unsatisfy", "neutral", "satisfy"))
df$number_project <- factor(df$number_project)
df$average_montly_hours <- cut(df$average_montly_hours, br = c(95, 180, 250, 310), labels = c("less", "normal", "more"))
df$time_spend_company <- factor(df$time_spend_company)
df$Work_accident <- factor(df$Work_accident)
df$promotion_last_5years <- factor(df$promotion_last_5years)
df$left <- factor(df$left)
str(df)
## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : Factor w/ 3 levels "unsatisfy","neutral",..: 2 3 1 3 2 2 1 3 3 2 ...
##  $ last_evaluation      : Factor w/ 3 levels "unsatisfy","neutral",..: 2 3 3 3 2 2 3 3 3 2 ...
##  $ number_project       : Factor w/ 6 levels "2","3","4","5",..: 1 4 6 4 1 1 5 4 4 1 ...
##  $ average_montly_hours : Factor w/ 3 levels "less","normal",..: 1 3 3 2 1 1 2 3 2 1 ...
##  $ time_spend_company   : Factor w/ 8 levels "2","3","4","5",..: 2 5 3 4 2 2 3 4 4 2 ...
##  $ Work_accident        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ left                 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ promotion_last_5years: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Department           : chr  "sales" "sales" "sales" "sales" ...
##  $ salary               : chr  "low" "medium" "medium" "low" ...

Train Test Split

set.seed(45)
pd <- sample(2, nrow(df), replace = TRUE, prob = c(0.8,0.2))
train <- df[pd==1,]#, here means all column 
test <- df[pd==2,]
cat("Number of train data: ", nrow(train), "\n")
## Number of train data:  12015
cat("Number of test data: ", nrow(test))
## Number of test data:  2984

OneR model

oner_model <- OneR(formula = left ~., data = train, verbose = TRUE) 
## Warning in OneR.data.frame(x = data, ties.method = ties.method, verbose =
## verbose, : data contains unused factor levels
## 
##     Attribute             Accuracy
## 1 * number_project        83.75%  
## 2   time_spend_company    77.19%  
## 3   satisfaction_level    76.1%   
## 3   last_evaluation       76.1%   
## 3   average_montly_hours  76.1%   
## 3   Work_accident         76.1%   
## 3   promotion_last_5years 76.1%   
## 3   Department            76.1%   
## 3   salary                76.1%   
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
summary(oner_model)
## 
## Call:
## OneR.formula(formula = left ~ ., data = train, verbose = TRUE)
## 
## Rules:
## If number_project = 2 then left = 1
## If number_project = 3 then left = 0
## If number_project = 4 then left = 0
## If number_project = 5 then left = 0
## If number_project = 6 then left = 1
## If number_project = 7 then left = 1
## 
## Accuracy:
## 10062 of 12015 instances classified correctly (83.75%)
## 
## Contingency table:
##      number_project
## left       2      3      4      5     6     7   Sum
##   0      646 * 3179 * 3163 * 1718   437     0  9143
##   1   * 1260     63    326    481 * 538 * 204  2872
##   Sum   1906   3242   3489   2199   975   204 12015
## ---
## Maximum in each column: '*'
## 
## Pearson's Chi-squared test:
## X-squared = 4311.2, df = 5, p-value < 2.2e-16
plot(oner_model)

From the model above we can see that:

  • number_project gives the highest accuracy (lowest error) which is 83.75%

  • The rules are given as:

    • If number_project = 2 then left = 1
    • If number_project = 3 then left = 0
    • If number_project = 4 then left = 0
    • If number_project = 5 then left = 0
    • If number_project = 6 then left = 1
    • If number_project = 7 then left = 1

Next, we would like to test the model using the test dataset:

prediction <- predict(oner_model, test)
eval_model(prediction, test$left)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    0    1  Sum
##        0   2028  223 2251
##        1    257  476  733
##        Sum 2285  699 2984
## 
## Confusion matrix (relative):
##           Actual
## Prediction    0    1  Sum
##        0   0.68 0.07 0.75
##        1   0.09 0.16 0.25
##        Sum 0.77 0.23 1.00
## 
## Accuracy:
## 0.8391 (2504/2984)
## 
## Error rate:
## 0.1609 (480/2984)
## 
## Error rate reduction (vs. base rate):
## 0.3133 (p-value < 2.2e-16)

From the prediction generated, we can see that the model gives an accuracy of 83.91%. The p-value is less than 2.2e-16 which means the model is significant! This shouws that an employee would leave the company if the have too little (2 projects) or too many projects (more than 5 projects) to handle! Therefore, to ensure an employee would not leave the company, the employees should be given around 3-5 projects to handle.

Naive Bayes Model

In this section, we would be trying the Naive Bayes model to examine whther it can gives a better accuracy as compared to the OneR model. The OneR model shows that there is a strong relationship between number of projects and whether an employee would leave the company. Naive Bayes can consider more attributes rather than one.

NaiveBayesModel <- naiveBayes(left ~., data = train)
NaiveBayesModel
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.7609655 0.2390345 
## 
## Conditional probabilities:
##    satisfaction_level
## Y    unsatisfy    neutral    satisfy
##   0 0.09865471 0.37066608 0.53067921
##   1 0.27646240 0.46169916 0.26183844
## 
##    last_evaluation
## Y   unsatisfy   neutral   satisfy
##   0 0.0000000 0.4093842 0.5906158
##   1 0.0000000 0.4596100 0.5403900
## 
##    number_project
## Y            2          3          4          5          6          7
##   0 0.07065515 0.34769769 0.34594772 0.18790331 0.04779613 0.00000000
##   1 0.43871866 0.02193593 0.11350975 0.16747911 0.18732591 0.07103064
## 
##    average_montly_hours
## Y        less    normal      more
##   0 0.3867440 0.4427431 0.1705130
##   1 0.4564763 0.1970752 0.3464485
## 
##    time_spend_company
## Y            2          3          4          5          6          7
##   0 0.28185497 0.41671224 0.14831018 0.05731160 0.04593678 0.01684349
##   1 0.01636490 0.44324513 0.25383008 0.22806407 0.05849582 0.00000000
##    time_spend_company
## Y            8         10
##   0 0.01410915 0.01892158
##   1 0.00000000 0.00000000
## 
##    Work_accident
## Y            0          1
##   0 0.82598709 0.17401291
##   1 0.95543175 0.04456825
## 
##    promotion_last_5years
## Y             0           1
##   0 0.973750410 0.026249590
##   1 0.995821727 0.004178273
## 
##    Department
## Y   accounting         hr         IT management  marketing product_mng
##   0 0.04724926 0.04506180 0.08279558 0.04888986 0.05884283  0.06092092
##   1 0.05501393 0.05988858 0.07660167 0.02576602 0.05849582  0.05222841
##    Department
## Y        RandD      sales    support  technical
##   0 0.05763972 0.27310511 0.14918517 0.17630975
##   1 0.03377437 0.27889972 0.16086351 0.19846797
## 
##    salary
## Y         high        low     medium
##   0 0.10204528 0.44930548 0.44864924
##   1 0.02123955 0.60898329 0.36977716

The model is tested on train dataset.

NBPredictions <- predict(NaiveBayesModel,train)
eval_model(NBPredictions,train$left)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction     0     1   Sum
##        0    8369   819  9188
##        1     774  2053  2827
##        Sum  9143  2872 12015
## 
## Confusion matrix (relative):
##           Actual
## Prediction    0    1  Sum
##        0   0.70 0.07 0.76
##        1   0.06 0.17 0.24
##        Sum 0.76 0.24 1.00
## 
## Accuracy:
## 0.8674 (10422/12015)
## 
## Error rate:
## 0.1326 (1593/12015)
## 
## Error rate reduction (vs. base rate):
## 0.4453 (p-value < 2.2e-16)

The model is tested on test dataset.

NBPredictions <- predict(NaiveBayesModel,test)
eval_model(NBPredictions,test$left)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    0    1  Sum
##        0   2096  206 2302
##        1    189  493  682
##        Sum 2285  699 2984
## 
## Confusion matrix (relative):
##           Actual
## Prediction    0    1  Sum
##        0   0.70 0.07 0.77
##        1   0.06 0.17 0.23
##        Sum 0.77 0.23 1.00
## 
## Accuracy:
## 0.8676 (2589/2984)
## 
## Error rate:
## 0.1324 (395/2984)
## 
## Error rate reduction (vs. base rate):
## 0.4349 (p-value < 2.2e-16)

In this section, we can see that the Naive Bayes gives an accuracy of 86.76%, which is more accurate than the One-R model! This is because the Naive Bayes model takes more attributes into account. The Naive Bayes model assume that all attributes are independent of each other. Even this is not a practical assumption, but the Naive Bayes model works surprisingly well in making predictions.

Decision Tree

In this section, we would use the decision tree to make the predictions.

df <- read.csv("HR_comma_sep.csv")
df$left <- factor(df$left)
set.seed(45)
pd <- sample(2, nrow(df), replace = TRUE, prob = c(0.8,0.2))
train <- df[pd==1,]#, here means all column 
test <- df[pd==2,]
cat("Number of train data: ", nrow(train), "\n")
## Number of train data:  12015
cat("Number of test data: ", nrow(test))
## Number of test data:  2984
treeModel <- ctree(left~satisfaction_level+last_evaluation+number_project+average_montly_hours+time_spend_company+Work_accident+promotion_last_5years, controls = ctree_control(mincriterion = 0.99,minsplit = 500), data = train)
plot(treeModel)

Test the tree model using train dataset.

treePrediction <- predict(treeModel, train)
eval_model(treePrediction, train$left)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction     0     1   Sum
##        0    9105   353  9458
##        1      38  2519  2557
##        Sum  9143  2872 12015
## 
## Confusion matrix (relative):
##           Actual
## Prediction    0    1  Sum
##        0   0.76 0.03 0.79
##        1   0.00 0.21 0.21
##        Sum 0.76 0.24 1.00
## 
## Accuracy:
## 0.9675 (11624/12015)
## 
## Error rate:
## 0.0325 (391/12015)
## 
## Error rate reduction (vs. base rate):
## 0.8639 (p-value < 2.2e-16)

Test the tree model using test dataset.

treePrediction <- predict(treeModel, test)
eval_model(treePrediction, test$left)
## 
## Confusion matrix (absolute):
##           Actual
## Prediction    0    1  Sum
##        0   2278   97 2375
##        1      7  602  609
##        Sum 2285  699 2984
## 
## Confusion matrix (relative):
##           Actual
## Prediction    0    1  Sum
##        0   0.76 0.03 0.80
##        1   0.00 0.20 0.20
##        Sum 0.77 0.23 1.00
## 
## Accuracy:
## 0.9651 (2880/2984)
## 
## Error rate:
## 0.0349 (104/2984)
## 
## Error rate reduction (vs. base rate):
## 0.8512 (p-value < 2.2e-16)

From this section, we can see that the decision tree gives us a better accuracy which is 96.51%. This is the best model to be used to predict whether an employee will leave the company or not.