Description

This report provides employee promotion prediction using classification algoritms. The dataset used in this report for modeling is real employee data in a company. It is publicly available at Kaggle.

It can be downloaded here: https://www.kaggle.com/muhammadimran112233/employees-evaluation-for-promotion

The report is structured as follows:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

1. Data Extraction

Import necessary libraries

library(dplyr)
library(ggplot2)
library(corrgram)
library(gridExtra)
library(e1071)
library(party)
library(randomForest)
library(caret)
library(ROSE)

Library dpylr : for manipulation data.
Library ggplot : for graphic and visualization.
Library corrgram : for visualization of coefficient.
Library gridExtra : for plotting multiple draft.
Library e1071 : for use Support Vector Machine (SVM) model.
Library party : for use Decision Tree model.
Library ramdomForest : for use Random Forest model.
Library caret : for overSampling and underSampling.
Library ROSE : for handle unbalance data.

Read employee promotion dataset from .csv file to R dataframe. Then, see the dataframe’s structure.

# read data to dataframe
employee_df <- read.csv("employee_promotion.csv", na.strings = "")

The dataset has 54808 observations (rows) and 13 variables (colums). The target variable is is_promoted and the remaining variables are features candidate.

Compute statistical summary of each variables.

# statistical summary
summary(employee_df)

##   employee_id     department           region           education        
##  Min.   :    1   Length:54808       Length:54808       Length:54808      
##  1st Qu.:19670   Class :character   Class :character   Class :character  
##  Median :39226   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :39196                                                           
##  3rd Qu.:58731                                                           
##  Max.   :78298                                                           
##                                                                          
##     gender          recruitment_channel no_of_trainings       age      
##  Length:54808       Length:54808        Min.   : 1.000   Min.   :20.0  
##  Class :character   Class :character    1st Qu.: 1.000   1st Qu.:29.0  
##  Mode  :character   Mode  :character    Median : 1.000   Median :33.0  
##                                         Mean   : 1.253   Mean   :34.8  
##                                         3rd Qu.: 1.000   3rd Qu.:39.0  
##                                         Max.   :10.000   Max.   :60.0  
##                                                                        
##  previous_year_rating length_of_service   awards_won      avg_training_score
##  Min.   :1.000        Min.   : 1.000    Min.   :0.00000   Min.   :39.00     
##  1st Qu.:3.000        1st Qu.: 3.000    1st Qu.:0.00000   1st Qu.:51.00     
##  Median :3.000        Median : 5.000    Median :0.00000   Median :60.00     
##  Mean   :3.329        Mean   : 5.866    Mean   :0.02317   Mean   :63.71     
##  3rd Qu.:4.000        3rd Qu.: 7.000    3rd Qu.:0.00000   3rd Qu.:77.00     
##  Max.   :5.000        Max.   :37.000    Max.   :1.00000   Max.   :99.00     
##  NA's   :4124                                             NA's   :2560      
##   is_promoted     
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.08517  
##  3rd Qu.:0.00000  
##  Max.   :1.00000  
##

We can see the min, median, mean, and max values of each numeric variable.

We must create a new dataframe. A new dataframe will used for Exploratory Data Analysis.

# create a new dataframe
employee_df2 <- employee_df

Now, change variable column type to factor for Exploratory Data Analysis.

# change column type is_promoted
employee_df2$is_promoted <- factor(employee_df$is_promoted,
                                  levels = c(0,1),
                                  labels = c("Not Promoted", "Promoted"))

# change column type previous_year_rating
employee_df2$previous_year_rating <- factor(employee_df$previous_year_rating,
                                           levels = c(1,2,3,4,5),
                                           labels = c("Very Bad", 
                                                      "Bad", 
                                                      "Enough", 
                                                      "Good", 
                                                      "Very Good"))

# change column type awards_won
employee_df2$awards_won <- factor(employee_df$awards_won,
                                 levels = c(0,1),
                                 labels = c("None", "Won"))

# change column type gender
employee_df2$gender <- factor(employee_df$gender,
                             levels = c("f", "m"),
                             labels = c("Female", "Male"))

2. Exploratory Data Analysis

2.1. Univariate Analysis

Plot distribution of department using bar plot.

# departement distribution
ggplot(data = employee_df2, aes(x = department, fill = department)) +
  geom_bar() +
  theme(legend.position = c(-10,-10),
        axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1)) + 
  labs(title = "Distribution by Department")

Based on the bar plot above, we can see that the Sales & Marketing department is the department with the most employees.

Plot distribution of region using bar plot.

# region distribution
ggplot(data = employee_df2, aes(x = region, fill = region)) +
  geom_bar() +
  theme(legend.position = c(-10.076,-100.790), 
        axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1)) + 
  labs(title = "Distribution by Region")

Based on the bar plot above, we can see that the most employees are in region 2.

Plot distribution of gender using bar plot.

# gender distribution
ggplot(data = employee_df2, aes(x = gender, fill = gender)) +
  geom_bar() +
  theme(legend.position = c(-10.076,-100.790), 
        axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1)) + 
  labs(title = "Distribution by Gender")

Based on the bar plot above, we can see that the most employees gender is male.

Plot distribution of previous_year_rating using bar plot.

# previous_year_rating distribution
ggplot(data = employee_df2, aes(x = previous_year_rating, 
                                fill = previous_year_rating)) +
  geom_bar() +
  theme(legend.position = c(-10.076,-100.790), 
        axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1)) + 
  labs(title = "Distribution by Previous year rating")

Based on the bar plot above, we can see that the most employees rating in 3 rating. We consider the N/A rating as an employee who has just entered the company.

Plot distribution of recruitment_channel using bar plot.

# recruitment_channel distribution
ggplot(data = employee_df2, aes(x = recruitment_channel, 
                                fill = recruitment_channel)) +
  geom_bar() +
  theme(legend.position = c(-10.076,-100.790), 
        axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1)) + 
  labs(title = "Distribution by Recruitment channel")

Based on the bar plot above, we can see that the most recruitment channel is other channel.

Plot distribution of education using bar plot.

# education distribution
ggplot(data = employee_df2, aes(x = education, fill = education)) +
  geom_bar() +
  theme(legend.position = c(-10.076,-100.790), 
        axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1)) + 
  labs(title = "Distribution by Education")

Based on the bar plot above, we can see that the most education in a company is bachelor’s degree. N/A elements here we conclude as outliers. We will process it in the data preparation section.

Plot distribution of awards_won using bar plot.

# awards_won distribution
ggplot(data = employee_df2, aes(x = awards_won, fill = awards_won)) +
  geom_bar() +
  theme(legend.position = c(-10.076,-100.790), 
        axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1)) + 
  labs(title = "Distribution by Awards won")

Based on the bar plot above, we can see that many employee is not won awards.

Plot distribution numeric feature distribution using histogram plot and boxplot.

# no_of_trainings distribution
univariate1 <- ggplot(data = employee_df2,
                aes (x = no_of_trainings)) + 
  geom_histogram(bins = 20, fill = '#107009AA') + 
  labs(title = "Distribution by Number of trainings")

# age distribution
univariate2 <- ggplot(data = employee_df2,
                aes (y = age)) +
  geom_boxplot(color = '#107009AA') + 
  labs(title = "Distribution by Age")

# length_of_service distribution
univariate3 <- ggplot(data = employee_df2,
                aes (x = length_of_service)) +
  geom_histogram(bins = 20, fill = '#107009AA') + 
  labs(title = "Distribution by Length of service")

# avg_training_score distribution
univariate4 <- ggplot(data = employee_df2,
                aes (x = avg_training_score)) +
  geom_histogram(bins = 20, fill = '#107009AA') + 
  labs(title = "Distribution by Avg training score")

Combine all histogram plot and boxplot using grid.arrange.

univariate.plot <- grid.arrange(univariate1, 
                                univariate2,
                                univariate3, 
                                univariate4, nrow = 2)

Based on the histogram plot and boxplot above, we can see that the gender variable has outliers. In addition, most employees only attend 1 training and most of the training scores are 50. Then, many employees in the company have a working period of zero to ten years.

2.2. Bivariate Analysis

Plot distribution of gender and is_promoted using bar plot.

# Promotion by gender
ggplot(data = employee_df2, aes(x = gender, fill = is_promoted)) +
  geom_bar(position = "dodge") + 
  labs(title = "Promotion by Gender")

Based on the bar plot above, we can see that more male employees are promoted than female employees. In addition, there are still more male employees who are not promoted than female employees.

Plot distribution of no_of_trainings and is_promoted using bar plot.

# Promotion by no_of_trainings
ggplot(data = employee_df2, aes(x = no_of_trainings, fill = is_promoted)) +
  geom_bar(position = "dodge") + 
  labs(title = "Promotion by Number of trainings")

Based on the bar plot above, we can see that more employees who have attended training 1 time were not promoted than those who were promoted.

Plot distribution of length_of_service and is_promoted using bar plot.

# Promotion by length_of_service
ggplot(data = employee_df2, aes(x = length_of_service, fill = is_promoted)) +
  geom_bar(position = "dodge") + 
  labs(title = "Promotion by Length of service")

Based on the bar plot above, we can see that most of the promoted employees are in the range of 1-10 years of service.

Plot distribution of awards_won and is_promoted using bar plot.

# Promotion by awards_won
ggplot(data = employee_df2, aes(x = awards_won, fill = is_promoted)) +
  geom_bar(position = "dodge") + 
  labs(title = "Promotion by Awards won")

Based on the bar plot above, we can see that employees who have not received awards are promoted more than employees who have received awards.

Plot distribution of previous_year_rating and is_promoted using bar plot.

# Promotion by previous_year_rating
ggplot(data = employee_df2, aes(x = previous_year_rating, fill = is_promoted)) +
  geom_bar(position = "stack") + 
  theme(legend.position = c(-10.076,-100.790), 
        axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1)) + 
  labs(title = "Promotion by Previous year rating")

Based on the bar plot above, we can see that employees who get enough rating, good rating, and very good rating are more likely to be promoted.

Plot distribution of avg_training_score and is_promoted using bar plot.

# Promotion by avg_training_score
ggplot(data = employee_df2, aes(x = avg_training_score, fill = is_promoted)) +
  geom_bar(position = "dodge") + 
  labs(title = "Promotion by Average training score")

Based on the bar plot above, we can see that many employees who received training scores in the 40-80 range were promoted, although many were not promoted.

2.3. Multivariate Analysis

Plot distribution relationship is_promoted to any variable.

# Relationship between is_promoted and avg_training_score
ggplot(data = employee_df2, aes(y = avg_training_score, 
                                x = no_of_trainings,
                                color = awards_won)) +
  geom_point(position = "jitter" ) +
  facet_grid(gender ~ is_promoted) +
  labs(title = "Relationship between Number of trainings vs Avg training score & Awards won",
       x = "Number of trainings",
       y = "Avg taining score",
       color = "Awards won")

Based on the relationship above, we can see that the distribution of male and female employees who participated in one training received an award and would be promoted.

# Relationship between length_of_service and age
ggplot(data = employee_df2, aes(y = age,
                                x = length_of_service,
                                color = awards_won)) +
  geom_point(position = "jitter" ) +
  facet_grid(gender ~ is_promoted) +
  labs(title = "Relationship between Length of service vs Age & Awards won",
       x = "Length of service",
       y = "Age",
       color = "Awards won")

Based on the relationship above, we can see that the distribution of male and female employees whose tenure ranges from 1-10 years and gets awards tends to be promoted.

Now, we must create corrgram for see a correlation coefficient variable.

# Correlation Coefficient
corrgram(employee_df,
         upper.panel = panel.cor)

Based on corrgram above, we can see that previous_year_rating, awards_won, and avg_training_score have big correlation coefficient to is_promoted

2.4. Insight from EDA:

The dataset shows unbalanced data, causing the data to tend to favor one of the binaries (is_promoted).
The education variable shows N/A but is not read as N/A because the data type is character.
From this data, it can be seen that employees with length_of_service less than 10 years have no_of_training in the range 0-1, so it is difficult to relate them to other variables.
There are outliers in the gender variable in the age range of 55-60 years.

3. Data Preparation

3.1. Data Cleaning

We must clean a missing value in previous_year_rating, avg_training_score, and education.

Check a summary from 1st dataframe before remove a missing value.

# Summary before remove special value in variable
sum(is.na(employee_df))

## [1] 9093

We must use 1st dataframe to remove a special value in previous_year_rating, and avg_training_score and use 2nd dataframe to remove special value in education.

# Remove special value in previous_year_rating
employee_df$previous_year_rating = ifelse(is.na(employee_df$previous_year_rating),
                                          ave(employee_df$previous_year_rating, 
                                              FUN = function(x) mean(x, na.rm = TRUE)),
                                          employee_df$previous_year_rating)

# Remove special value in avg_training score
employee_df$avg_training_score = ifelse(is.na(employee_df$avg_training_score),
                                        ave(employee_df$avg_training_score, 
                                            FUN = function(x) mean(x, na.rm = TRUE)),
                                        employee_df$avg_training_score)

# Remove special value in education's variable
employee_df2 <- employee_df[which(employee_df$education != "NA"), ]

Now, check again a summary from 1st dataset after remove a special value.

# Summary after special value in variable.
sum(is.na(employee_df))

## [1] 2409

Then, remove employee_id column from 2nd dataframe. Because, employee_id have a special character.

# Remove employee_id column
employee_df2 <- select(employee_df2, -employee_id)

Now, we get a new 2nd dataset after remove employee_id column.

str(employee_df2)

## 'data.frame':    52399 obs. of  12 variables:
##  $ department          : chr  "Sales & Marketing" "Operations" "Sales & Marketing" "Sales & Marketing" ...
##  $ region              : chr  "region_7" "region_22" "region_19" "region_23" ...
##  $ education           : chr  "Master's & above" "Bachelor's" "Bachelor's" "Bachelor's" ...
##  $ gender              : chr  "f" "m" "m" "m" ...
##  $ recruitment_channel : chr  "sourcing" "other" "sourcing" "other" ...
##  $ no_of_trainings     : int  1 1 1 2 1 2 1 1 1 1 ...
##  $ age                 : int  35 30 34 39 45 31 31 33 28 32 ...
##  $ previous_year_rating: num  5 5 3 1 3 3 3 3 4 5 ...
##  $ length_of_service   : int  8 4 7 10 2 7 5 6 5 5 ...
##  $ awards_won          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score  : num  49 60 50 50 73 85 59 63 83 54 ...
##  $ is_promoted         : int  0 0 0 0 0 0 0 0 0 0 ...

Change a feature extraction data type to a factor from 1st dataframe.

# Change is_promoted column type
employee_df$is_promoted <- factor(employee_df$is_promoted,
                                  levels = c(0,1),
                                  labels = c("Not Promoted", "Promoted"))

# Change previous_year_rating column type
employee_df$previous_year_rating <- factor(employee_df$previous_year_rating,
                                           levels = c(0,1,2,3,4,5),
                                           labels = c("Null", 
                                                      "Very Bad", 
                                                      "Bad", 
                                                      "Enough", 
                                                      "Good", 
                                                      "Very Good"))

# Change awards_won column type
employee_df$awards_won <- factor(employee_df$awards_won,
                                 levels = c(0,1),
                                 labels = c("None", "Won"))

Now, we must check unbalanced data using target variable (is_promoted).

ggplot(data = employee_df, aes(x = is_promoted, fill = is_promoted)) +
  geom_bar() +
  theme(legend.position = c(-10.076,-100.790), 
        axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1)) + 
  labs(title = "Distribution by Promotion")

Based on the bar plot above, we can conclude that this data is included in the unbalance data. So we have to do SMOTE so that the data is balanced.

Now, we must create data partition before using SMOTE.

set.seed(2021)
idx <- sample(2, nrow(employee_df), replace = TRUE,)
train <- employee_df[idx==1, ]
test <- employee_df[idx==2,]

# Check a table is_promoted's train data
table(train$is_promoted)

## 
## Not Promoted     Promoted 
##        24997         2329

# Check a proportional table is_promoted's train data before using SMOTE
prop.table(table(train$is_promoted))

## 
## Not Promoted     Promoted 
##   0.91476982   0.08523018

Once we partition the data, we can handle unbalanced data.

employee_df3 <- ovun.sample(is_promoted~., data = train, method = "over", N = 44154)$data
employee_df4 <- ovun.sample(is_promoted~., data = train, method = "under", N = 4340)$data
employee_df5 <- ovun.sample(is_promoted~., data = train, method = "both",
                            p = 0.5,
                            seed = 222,
                            N = 10000)$data

# Check a proportional table is_promoted's train data after using SMOTE
prop.table(table(employee_df3$is_promoted))

## 
## Not Promoted     Promoted 
##    0.5003624    0.4996376

Check data after handling unbalanced data using target variable.

ggplot(data = employee_df3, aes(x = is_promoted, fill = is_promoted)) +
  geom_bar() +
  theme(legend.position = c(-10.076,-100.790), 
        axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1)) + 
  labs(title = "Distribution by Promotion")

Based on the bar plot above, we can conclude that after SMOTE, the proportion of each data becomes more balanced than before SMOTE. Because the data is proportional, we can enter the next process.

3.2. Data Feature Extraction

We must feature selection a data using 3rd dataframe (employee_df3). employee_df3 is over sampling’s dataframe.

X <- select(employee_df3, -department, -region, -education, -gender, 
            -recruitment_channel, -is_promoted, -previous_year_rating, -awards_won)

After feature selection a data, we can process a data using Principal Component Analysis. We used 80% information in Principal Component Analysis.

pr.out <- prcomp(X, scale. = TRUE) # PCA
pr.var <- pr.out$sdev^2            # Variance
pve <- pr.var / sum(pr.var)        # Proportion of Variance Explained (PVE)
cumsum(pve)

## [1] 0.3270913 0.5346093 0.7318913 0.9229744 1.0000000

# Create a features, target, and new dataframe
features <- data.frame(pr.out$x[ , 1:4])
target <-  employee_df3$is_promoted

employee_df6 <- cbind(features, target)

3.3. Split Data into Training and Testing Data

Now, we must create split data for training data and testing data. Here, we used 80:20 ratio for training and testing data. In other word, we must used a new dataframe.

# number of all samples and number of train samples
m <- nrow(employee_df6)
m_train <- 0.8 * m
m_train <- floor(m_train)

# random with set.seed for reproducible result
set.seed(2021)
train_index <- sample(m, m_train)
train_index[1:5]

## [1] 25018 17478 37189 16487 35763

# create dataframe for train and test
train_df <- employee_df6[train_index, ]
test_df <- employee_df6[-train_index, ]

After creating training data and testing data, we can enter the modeling process.

4. Modeling

In this modeling step, we carried out four data modeling using training data and using classification model.

4.1. Logistic Regression

model.logit <- glm(formula = target ~ ., 
                   data = train_df,
                   family = binomial)

model.logit

## 
## Call:  glm(formula = target ~ ., family = binomial, data = train_df)
## 
## Coefficients:
## (Intercept)          PC1          PC2          PC3          PC4  
##   -0.003941     0.074198     0.271251     0.305247     0.515660  
## 
## Degrees of Freedom: 35322 Total (i.e. Null);  35318 Residual
## Null Deviance:       48970 
## Residual Deviance: 45550     AIC: 45560

4.2. Decision Tree

model.tree <- ctree(formula = target ~ .,
                    data = train_df)

model.tree

## 
##   Conditional inference tree with 88 terminal nodes
## 
## Response:  target 
## Inputs:  PC1, PC2, PC3, PC4 
## Number of observations:  35323 
## 
## 1) PC4 <= 0.1834015; criterion = 1, statistic = 1902.366
##   2) PC3 <= 0.3319995; criterion = 1, statistic = 243.112
##     3) PC2 <= 2.664413; criterion = 1, statistic = 44.368
##       4) PC4 <= -1.367108; criterion = 1, statistic = 93.044
##         5) PC2 <= 0.619071; criterion = 1, statistic = 34.198
##           6) PC3 <= -1.866225; criterion = 0.999, statistic = 14.217
##             7)*  weights = 16 
##           6) PC3 > -1.866225
##             8)*  weights = 749 
##         5) PC2 > 0.619071
##           9) PC3 <= -0.5424553; criterion = 0.99, statistic = 9.164
##             10) PC4 <= -3.136583; criterion = 1, statistic = 20.203
##               11)*  weights = 13 
##             10) PC4 > -3.136583
##               12) PC2 <= 2.072296; criterion = 0.992, statistic = 9.474
##                 13)*  weights = 201 
##               12) PC2 > 2.072296
##                 14) PC3 <= -1.074581; criterion = 1, statistic = 20.148
##                   15)*  weights = 7 
##                 14) PC3 > -1.074581
##                   16)*  weights = 29 
##           9) PC3 > -0.5424553
##             17)*  weights = 215 
##       4) PC4 > -1.367108
##         18) PC2 <= 0.9170353; criterion = 1, statistic = 65.903
##           19) PC4 <= -0.2059072; criterion = 1, statistic = 22.109
##             20) PC3 <= -0.6977661; criterion = 0.998, statistic = 12.622
##               21) PC3 <= -1.052106; criterion = 0.99, statistic = 9.079
##                 22) PC3 <= -1.849351; criterion = 0.998, statistic = 12.622
##                   23)*  weights = 270 
##                 22) PC3 > -1.849351
##                   24)*  weights = 1409 
##               21) PC3 > -1.052106
##                 25) PC4 <= -0.6731815; criterion = 1, statistic = 25.566
##                   26) PC1 <= -3.660994; criterion = 1, statistic = 20.656
##                     27) PC4 <= -0.757441; criterion = 0.987, statistic = 8.72
##                       28)*  weights = 10 
##                     27) PC4 > -0.757441
##                       29)*  weights = 10 
##                   26) PC1 > -3.660994
##                     30) PC3 <= -0.7385885; criterion = 1, statistic = 14.807
##                       31)*  weights = 41 
##                     30) PC3 > -0.7385885
##                       32) PC3 <= -0.7309253; criterion = 1, statistic = 14.954
##                         33)*  weights = 10 
##                       32) PC3 > -0.7309253
##                         34)*  weights = 11 
##                 25) PC4 > -0.6731815
##                   35) PC3 <= -0.9562167; criterion = 0.959, statistic = 6.578
##                     36)*  weights = 136 
##                   35) PC3 > -0.9562167
##                     37)*  weights = 464 
##             20) PC3 > -0.6977661
##               38) PC3 <= 0.1877901; criterion = 1, statistic = 15.938
##                 39) PC1 <= -0.4132578; criterion = 0.992, statistic = 9.621
##                   40) PC3 <= -0.6873006; criterion = 0.993, statistic = 9.741
##                     41)*  weights = 26 
##                   40) PC3 > -0.6873006
##                     42) PC1 <= -0.6088672; criterion = 0.993, statistic = 9.691
##                       43)*  weights = 819 
##                     42) PC1 > -0.6088672
##                       44)*  weights = 101 
##                 39) PC1 > -0.4132578
##                   45) PC1 <= -0.3961709; criterion = 0.974, statistic = 7.394
##                     46)*  weights = 35 
##                   45) PC1 > -0.3961709
##                     47)*  weights = 1839 
##               38) PC3 > 0.1877901
##                 48) PC4 <= -0.6006943; criterion = 0.998, statistic = 11.715
##                   49)*  weights = 209 
##                 48) PC4 > -0.6006943
##                   50) PC4 <= -0.4597684; criterion = 0.999, statistic = 14.523
##                     51)*  weights = 122 
##                   50) PC4 > -0.4597684
##                     52)*  weights = 120 
##           19) PC4 > -0.2059072
##             53) PC3 <= -0.3696245; criterion = 0.999, statistic = 12.993
##               54)*  weights = 2996 
##             53) PC3 > -0.3696245
##               55)*  weights = 1537 
##         18) PC2 > 0.9170353
##           56)*  weights = 1102 
##     3) PC2 > 2.664413
##       57) PC4 <= -3.936369; criterion = 0.999, statistic = 13.87
##         58) PC1 <= 2.261653; criterion = 0.961, statistic = 6.667
##           59)*  weights = 24 
##         58) PC1 > 2.261653
##           60)*  weights = 7 
##       57) PC4 > -3.936369
##         61)*  weights = 199 
##   2) PC3 > 0.3319995
##     62) PC4 <= -1.591272; criterion = 1, statistic = 158.546
##       63) PC2 <= 0.7507008; criterion = 1, statistic = 80.541
##         64) PC4 <= -1.973742; criterion = 0.999, statistic = 13.194
##           65)*  weights = 214 
##         64) PC4 > -1.973742
##           66) PC4 <= -1.966164; criterion = 1, statistic = 14.946
##             67)*  weights = 15 
##           66) PC4 > -1.966164
##             68) PC1 <= -2.738524; criterion = 0.998, statistic = 12.276
##               69)*  weights = 13 
##             68) PC1 > -2.738524
##               70)*  weights = 177 
##       63) PC2 > 0.7507008
##         71) PC4 <= -2.922163; criterion = 1, statistic = 28.703
##           72)*  weights = 238 
##         71) PC4 > -2.922163
##           73) PC2 <= 4.251806; criterion = 1, statistic = 16.713
##             74) PC3 <= 1.136422; criterion = 1, statistic = 15.196
##               75) PC1 <= 0.005669644; criterion = 0.963, statistic = 6.748
##                 76) PC3 <= 0.5425239; criterion = 0.993, statistic = 9.85
##                   77)*  weights = 7 
##                 76) PC3 > 0.5425239
##                   78) PC1 <= -0.3620893; criterion = 0.997, statistic = 11.377
##                     79)*  weights = 13 
##                   78) PC1 > -0.3620893
##                     80)*  weights = 15 
##               75) PC1 > 0.005669644
##                 81)*  weights = 91 
##             74) PC3 > 1.136422
##               82)*  weights = 272 
##           73) PC2 > 4.251806
##             83)*  weights = 32 
##     62) PC4 > -1.591272
##       84) PC2 <= 1.297146; criterion = 1, statistic = 112.001
##         85) PC1 <= -1.709617; criterion = 1, statistic = 24.551
##           86) PC4 <= 0.01179968; criterion = 1, statistic = 20.81
##             87)*  weights = 569 
##           86) PC4 > 0.01179968
##             88)*  weights = 127 
##         85) PC1 > -1.709617
##           89) PC3 <= 0.5565738; criterion = 0.996, statistic = 10.69
##             90) PC3 <= 0.4047064; criterion = 1, statistic = 20.925
##               91) PC2 <= -0.7849073; criterion = 1, statistic = 15.509
##                 92)*  weights = 459 
##               91) PC2 > -0.7849073
##                 93) PC1 <= -1.519028; criterion = 0.99, statistic = 9.225
##                   94)*  weights = 11 
##                 93) PC1 > -1.519028
##                   95) PC3 <= 0.4043296; criterion = 1, statistic = 16.486
##                     96)*  weights = 48 
##                   95) PC3 > 0.4043296
##                     97)*  weights = 7 
##             90) PC3 > 0.4047064
##               98) PC2 <= 1.005243; criterion = 0.991, statistic = 9.421
##                 99)*  weights = 728 
##               98) PC2 > 1.005243
##                 100) PC1 <= -0.5717216; criterion = 0.996, statistic = 11.051
##                   101)*  weights = 17 
##                 100) PC1 > -0.5717216
##                   102)*  weights = 16 
##           89) PC3 > 0.5565738
##             103)*  weights = 3438 
##       84) PC2 > 1.297146
##         104) PC3 <= 1.168326; criterion = 1, statistic = 35.341
##           105) PC1 <= -0.5142853; criterion = 1, statistic = 29.728
##             106) PC2 <= 2.415267; criterion = 1, statistic = 16.952
##               107)*  weights = 29 
##             106) PC2 > 2.415267
##               108)*  weights = 7 
##           105) PC1 > -0.5142853
##             109)*  weights = 419 
##         104) PC3 > 1.168326
##           110) PC1 <= -1.231104; criterion = 0.998, statistic = 12.027
##             111) PC4 <= -0.856759; criterion = 0.983, statistic = 8.195
##               112) PC1 <= -2.595527; criterion = 0.988, statistic = 8.756
##                 113)*  weights = 14 
##               112) PC1 > -2.595527
##                 114)*  weights = 13 
##             111) PC4 > -0.856759
##               115)*  weights = 17 
##           110) PC1 > -1.231104
##             116) PC2 <= 1.51405; criterion = 0.978, statistic = 7.7
##               117) PC3 <= 1.518513; criterion = 0.995, statistic = 10.401
##                 118)*  weights = 16 
##               117) PC3 > 1.518513
##                 119)*  weights = 55 
##             116) PC2 > 1.51405
##               120)*  weights = 109 
## 1) PC4 > 0.1834015
##   121) PC4 <= 1.187093; criterion = 1, statistic = 619.878
##     122) PC3 <= 0.9999766; criterion = 1, statistic = 123.732
##       123) PC3 <= -1.651292; criterion = 1, statistic = 18.273
##         124)*  weights = 216 
##       123) PC3 > -1.651292
##         125)*  weights = 9202 
##     122) PC3 > 0.9999766
##       126) PC4 <= 0.9336104; criterion = 1, statistic = 83.055
##         127) PC2 <= 0.3869843; criterion = 0.972, statistic = 7.267
##           128) PC2 <= -0.2685459; criterion = 0.973, statistic = 7.298
##             129)*  weights = 1339 
##           128) PC2 > -0.2685459
##             130)*  weights = 725 
##         127) PC2 > 0.3869843
##           131)*  weights = 184 
##       126) PC4 > 0.9336104
##         132) PC3 <= 1.215289; criterion = 1, statistic = 28.505
##           133) PC1 <= 0.4961082; criterion = 0.999, statistic = 13.604
##             134)*  weights = 19 
##           133) PC1 > 0.4961082
##             135) PC1 <= 1.217222; criterion = 0.967, statistic = 6.974
##               136)*  weights = 74 
##             135) PC1 > 1.217222
##               137)*  weights = 17 
##         132) PC3 > 1.215289
##           138) PC4 <= 1.115847; criterion = 1, statistic = 15.367
##             139) PC1 <= -0.4169971; criterion = 0.997, statistic = 11.436
##               140)*  weights = 116 
##             139) PC1 > -0.4169971
##               141)*  weights = 331 
##           138) PC4 > 1.115847
##             142)*  weights = 201 
##   121) PC4 > 1.187093
##     143) PC3 <= -0.2128683; criterion = 1, statistic = 305.279
##       144) PC4 <= 1.514837; criterion = 1, statistic = 59.457
##         145) PC2 <= 0.8918199; criterion = 0.963, statistic = 6.729
##           146) PC3 <= -0.3881231; criterion = 1, statistic = 16.873
##             147) PC1 <= 1.680952; criterion = 0.994, statistic = 10.232
##               148)*  weights = 146 
##             147) PC1 > 1.680952
##               149)*  weights = 9 
##           146) PC3 > -0.3881231
##             150)*  weights = 14 
##         145) PC2 > 0.8918199
##           151) PC3 <= -0.804613; criterion = 1, statistic = 15.602
##             152) PC4 <= 1.190255; criterion = 1, statistic = 16.258
##               153)*  weights = 17 
##             152) PC4 > 1.190255
##               154)*  weights = 127 
##           151) PC3 > -0.804613
##             155)*  weights = 538 
##       144) PC4 > 1.514837
##         156) PC3 <= -0.655496; criterion = 0.996, statistic = 10.99
##           157) PC1 <= 0.376782; criterion = 1, statistic = 19.481
##             158)*  weights = 15 
##           157) PC1 > 0.376782
##             159)*  weights = 7 
##         156) PC3 > -0.655496
##           160)*  weights = 228 
##     143) PC3 > -0.2128683
##       161) PC4 <= 1.366997; criterion = 1, statistic = 103.474
##         162) PC3 <= 0.4799058; criterion = 1, statistic = 103.977
##           163) PC2 <= 1.071399; criterion = 1, statistic = 16.325
##             164)*  weights = 210 
##           163) PC2 > 1.071399
##             165)*  weights = 91 
##         162) PC3 > 0.4799058
##           166) PC3 <= 0.891597; criterion = 1, statistic = 22.893
##             167) PC2 <= 0.4328277; criterion = 0.998, statistic = 12.333
##               168)*  weights = 9 
##             167) PC2 > 0.4328277
##               169)*  weights = 139 
##           166) PC3 > 0.891597
##             170)*  weights = 375 
##       161) PC4 > 1.366997
##         171) PC3 <= 0.5383737; criterion = 0.99, statistic = 9.226
##           172) PC4 <= 1.451175; criterion = 0.997, statistic = 11.613
##             173)*  weights = 152 
##           172) PC4 > 1.451175
##             174)*  weights = 485 
##         171) PC3 > 0.5383737
##           175)*  weights = 424

4.3. Random Forest

set.seed(2021)
model.forest <- randomForest(formula = target ~ .,
                             data = train_df)

model.forest

## 
## Call:
##  randomForest(formula = target ~ ., data = train_df) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 1.61%
## Confusion matrix:
##              Not Promoted Promoted  class.error
## Not Promoted        17153      563 0.0317791827
## Promoted                4    17603 0.0002271824

4.4. Support Vector Machine

set.seed(2021)
model.svm <- svm(formula = target ~ .,
                 data = train_df)

model.svm

## 
## Call:
## svm(formula = target ~ ., data = train_df)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  27238

5. Evaluation

Now, we must create actual and predicted value for see confusion matrix and performance classification of each model.

# Actual values
actual <- test_df$target

# Probability [0-1] of predicted values
pred.logit.prob <- predict(model.logit, test_df, type = "response")
pred.logit.boolen <- pred.logit.prob > 0.5

# Change to factor (FALSE ==> "Not Promoted", TRUE ==> "Promoted")
pred.logit <- factor(pred.logit.boolen,
                     levels = c(FALSE, TRUE),
                     labels = c("Not Promoted", "Promoted"))
pred.tree <- predict(model.tree, test_df)
pred.forest <- predict(model.forest, test_df)
pred.svm <- predict(model.svm, test_df)

Create confusion matrix to see actual and predicted value of each model.

# Confusion Matrix : Logistic Regression
logit.perf <- table(test_df$target, pred.logit,
                    dnn = c("Actual","Predicted"))
logit.perf

##               Predicted
## Actual         Not Promoted Promoted
##   Not Promoted         2902     1475
##   Promoted             1900     2554

# Confusion Matrix : Decision Tree
ctree.perf <- table(test_df$target, pred.tree,
                    dnn = c("Actual","Predicted"))
ctree.perf

##               Predicted
## Actual         Not Promoted Promoted
##   Not Promoted         2428     1949
##   Promoted             1260     3194

# Confusion Matrix : Random Forest
forest.perf <- table(test_df$target, pred.forest,
                     dnn = c("Actual","Predicted"))
forest.perf

##               Predicted
## Actual         Not Promoted Promoted
##   Not Promoted         4241      136
##   Promoted                8     4446

# Confusion Matrix : Decision Tree
svm.perf <- table(na.omit(test_df)$target, pred.svm,
                  dnn = c("Actual","Predicted"))
svm.perf

##               Predicted
## Actual         Not Promoted Promoted
##   Not Promoted         2777     1600
##   Promoted             1699     2755

Now, create performance classification function to see performance of each model.

performance_classification <- function(prediction, actual){
  
  # Confusion matrix
  conf.matrix <- table(actual, prediction,
                       dnn = c("Actual", "Predicted"))
  conf.matrix
  
  # Compute performance
  tp <- conf.matrix[2,2]
  tn <- conf.matrix[1,1]
  fp <- conf.matrix[1,2]
  fn <- conf.matrix[2,1]
  
  accuracy <- (tp+tn) / (tp+tn+fp+fn)
  precision <- tp / (tp+fp)
  recall <- tp / (tp+fn)
  f1score <- (2*precision*recall) / (precision+recall)
  
  result <- paste("Accuracy = ", round(accuracy,3), 
                  "\nPrecision = ", round(precision,3),
                  "\nRecall = ", round(recall,3),
                  "\nF1-Score = ", round(f1score,3),
                  "\n")
  cat(result)
}

Now, we can check a performance classification of each model.

# Performance Classification : Logistic Regression
performance_classification(pred.logit, actual)

## Accuracy =  0.618 
## Precision =  0.634 
## Recall =  0.573 
## F1-Score =  0.602

# Performance Classification : Decision Tree
performance_classification(pred.tree, actual)

## Accuracy =  0.637 
## Precision =  0.621 
## Recall =  0.717 
## F1-Score =  0.666

# Performance Classification : Random Forest
performance_classification(pred.forest, actual)

## Accuracy =  0.984 
## Precision =  0.97 
## Recall =  0.998 
## F1-Score =  0.984

# Performance Classification : Support Vector Machine
performance_classification(pred.svm, actual)

## Accuracy =  0.626 
## Precision =  0.633 
## Recall =  0.619 
## F1-Score =  0.625

6. Recommendation

This project is carried out for 2 weeks.
The main variables used in this project are is promoted, avg_training_score, length_of_service, number of training, age.
In our opinion, the most profitable model in terms of business is Random Forest. Because the precision shows the most mistakes in promotions as many as 136 employees. And also the highest percentage result among other models with a recall of 99%.
So that the promotion plan can be able to achieve optimal targets and increase the motivation of the people being promoted.
This project can be utilized by an HR in time efficiency of their performance to develop KPI’s in promoting positions.
This project can be used by a manager in predicting the probability of promotion, so that managers can find out the right parameters for someone to be able to get a promotion. This project can also assist managers in minimizing a person’s disability after getting a promotion due to an error in selecting a promotion candidate.

Employee Promotion Prediction using Classification Model

Achmad Faiz Alawi & Armand Indrawan

16/10/2021