Neural Network: IBM HR Analytics Attrition

Introduction

This project was created to fulfill the Learning by Building (LBB) assignment on the Neural Network and Deep Learning material in the Machine Learning Specialization Course. The dataset used in this project is the IBM HR Analytics Employee Attrition and Performance. The IBM HR Analytics Employee Attrition & Performance dataset is a collection of employee-related data used to study factors influencing attrition and performance within a company. It helps identify reasons for employee turnover and analyze performance-related patterns for talent retention strategies.

Neural networks and deep learning can be used in attrition analysis to determine the causes of employee attrition and forecast future attrition rates.

Import Library

library(neuralnet)
library(dplyr)
library(keras)
library(caret)

Read Data

employee <- read.csv("dataset/data-clean.csv")
glimpse(employee)

## Rows: 1,470
## Columns: 35
## $ attrition                  <chr> "yes", "no", "yes", "no", "no", "no", "no",…
## $ age                        <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35,…
## $ business_travel            <chr> "travel_rarely", "travel_frequently", "trav…
## $ daily_rate                 <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 135…
## $ department                 <chr> "sales", "research_development", "research_…
## $ distance_from_home         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26…
## $ education                  <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3…
## $ education_field            <chr> "life_sciences", "life_sciences", "other", …
## $ employee_count             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ employee_number            <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 1…
## $ environment_satisfaction   <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3…
## $ gender                     <chr> "female", "male", "male", "female", "male",…
## $ hourly_rate                <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84,…
## $ job_involvement            <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2…
## $ job_level                  <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1…
## $ job_role                   <chr> "sales_executive", "research_scientist", "l…
## $ job_satisfaction           <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3…
## $ marital_status             <chr> "single", "married", "single", "married", "…
## $ monthly_income             <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2…
## $ monthly_rate               <int> 19479, 24907, 2396, 23159, 16632, 11864, 99…
## $ num_companies_worked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5…
## $ over_18                    <chr> "y", "y", "y", "y", "y", "y", "y", "y", "y"…
## $ over_time                  <chr> "yes", "no", "yes", "yes", "no", "no", "yes…
## $ percent_salary_hike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13,…
## $ performance_rating         <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3…
## $ relationship_satisfaction  <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2…
## $ standard_hours             <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80,…
## $ stock_option_level         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0…
## $ total_working_years        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5,…
## $ training_times_last_year   <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4…
## $ work_life_balance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3…
## $ years_at_company           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, …
## $ years_in_current_role      <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2…
## $ years_since_last_promotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0…
## $ years_with_curr_manager    <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3…

Description: The dataset has 1,470 rows with 35 columns.

attrition: Whether the employee leaves the organization or not.
age: Employee’s age in years.
gender: The gender of the employee.
business_travel: Frequency of employee business trips.
daily_rate: The employee’s daily salary.
department: The department where the employee works.
distance_from_home: The distance from the employee’s home to the workplace in miles.
education: The level of education achieved by employees.
education_field: Field of study of employees.
employee_count: The total number of employees in the organization.
employee_number: A unique identifier for each employee record.
environment_satisfaction: The level of employee satisfaction with their work environment.
hourly_rate: An employee’s hourly salary.
job_involvement: The level of involvement required for an employee’s job.
job_level: The level of employee positions.
job_role: The role of the employee in the organization.
job_satisfaction: Employee satisfaction with their jobs.
marital_status: Employee’s marital status.
monthly_income: Monthly income of employees.
monthly_rate: The employee’s monthly pay rate.
num_companies_worked: The number of companies the employee previously worked for.
over_18: Whether the employee is over 18 years of age.
over_time: Does the employee work overtime.
percent_salary_hike: The rate of increase in an employee’s salary.
performance_rating: Assessment of employee performance.
relationship_satisfaction: Employee satisfaction with their relationship.
standard_hours: Standard hours of work for employees.
stock_option_level: The employee stock option level.
total_working_years: Total number of years the employee has worked.
training_times_last_year: The number of times the employee attended training in the last year.
work_life_balance: Employees’ perception of their work and life balance.
years_at_company: The number of years the employee has worked at this company.
years_in_current_role: The number of years the employee has been in the current role.
years_since_last_promotion: The number of years since the employee’s last promotion.
years_with_current_manager: The number of years the employee has worked with the current manager.

Data Pre-processing

First of all we need to get rid of all unused columns.

employee <- employee[, !(names(employee) %in% c('bussiness_travel','daily_rate','employee_count','employee_number','hourly_rate','monthly_rate'
           ,'num_companies_worked','over_18','standard_hours', 'stock_option_level','training_times_last_year'))]

head(employee)

Followed by changing the data type that is still not suitable, especially the factor data type which has a level according to the metadata provided.

employee <-  employee %>% 
  mutate(education = factor(education, levels = c(1, 2, 3, 4, 5),
                            labels = c("Below College", "College", "Bachelor", "Master", "Doctor")),
         environment_satisfaction = factor(environment_satisfaction, levels = c(1, 2, 3, 4),
                                          labels = c("Low", "Medium", "High", "Very High")),
         job_involvement = factor(job_involvement, levels = c(1, 2, 3, 4),
                                  labels = c("Low", "Medium", "High", "Very High")),
         job_level = factor(job_level, levels = c(1, 2, 3, 4, 5),
                            labels = c("Entry Level", "Junior Level", "Mid Level", "Senior Level", "Executive Level")),
         job_satisfaction = factor(job_satisfaction, levels = c(1, 2, 3, 4),
                                  labels = c("Low", "Medium", "High", "Very High")),
         performance_rating = factor(performance_rating, levels = c(1, 2, 3, 4),
                                    labels = c("Low", "Good", "Excellent", "Outstanding")),
         relationship_satisfaction = factor(relationship_satisfaction, levels = c(1, 2, 3, 4),
                                           labels = c("Low", "Medium", "High", "Very High")),
         work_life_balance = factor(work_life_balance, levels = c(1, 2, 3, 4),
                                  labels = c("Bad", "Good", "Better", "Best")))

Check whether there are missing values and duplicated or not.

colSums(is.na(employee))

##                  attrition                        age 
##                          0                          0 
##            business_travel                 department 
##                          0                          0 
##         distance_from_home                  education 
##                          0                          0 
##            education_field   environment_satisfaction 
##                          0                          0 
##                     gender            job_involvement 
##                          0                          0 
##                  job_level                   job_role 
##                          0                          0 
##           job_satisfaction             marital_status 
##                          0                          0 
##             monthly_income                  over_time 
##                          0                          0 
##        percent_salary_hike         performance_rating 
##                          0                          0 
##  relationship_satisfaction        total_working_years 
##                          0                          0 
##          work_life_balance           years_at_company 
##                          0                          0 
##      years_in_current_role years_since_last_promotion 
##                          0                          0 
##    years_with_curr_manager 
##                          0

any(duplicated(employee))

## [1] FALSE

The data is clean and ready to be normalized for numeric data and encoding for factor data. The purpose of this process is to change the scale or range of numeric attribute values so that the effect of each attribute is balanced when used in the model.

employee_num <- employee %>% select_if(is.numeric)
employee_cat <- employee %>% select_if(is.factor)

# Normalization
employee_norm <- employee_num %>%
  preProcess(method = c("center", "scale"))

# Encoding 
employee_enc <- dummyVars(~., data = employee_cat)

After that, the data that has been normalized and encoded is combined together.

employee_norm_encoded <- cbind(predict(employee_norm, newdata = employee_num), 
                                 predict(employee_enc, newdata = employee_cat))

employee_norm_encoded <- cbind(employee_norm_encoded, attrition = employee$attrition)

Cross Validation

The proportion of data train is determined as 75% of the total data which will be divided randomly.

set.seed(100)

train_size <- round(0.75 * nrow(employee_norm_encoded))
index <- sample(seq_len(nrow(employee_norm_encoded)))


train_index <- index[1:train_size]
test_index <- index[(train_size + 1):nrow(employee_norm_encoded)]


# Split to train and test data
train_data <- employee_norm_encoded[train_index, ]
test_data <- employee_norm_encoded[test_index, ]

After splitting, don’t forget to separate the target and predictor variables in the training and testing data and turn them into a matrix using as.matrix().

train_x <- train_data %>% 
  select(-attrition) %>% 
  as.matrix()

train_y <- ifelse(train_data$attrition == "No", 0, 1)

test_x <- test_data %>% 
  select(-attrition) %>% 
  as.matrix()

test_y <- ifelse(test_data$attrition == "No", 0, 1)

Because this project uses the hardware library, before creating the model, the train and test data need to be converted into array using array_reshape().

train_x <- array_reshape(x=train_x, dim= dim(train_x)) 
test_x <- array_reshape(x=test_x, dim= dim(test_x))

Build Model

Modeling will be carried out using the following criteria: - Input Layer: 43 predictor based on the total columns - Hidden Layer 1: 64 neurons with activation function = ReLu - Hidden Layer 2: 32 neurons with activation function = ReLu - Hidden Layer 3: 16 neurons with activation function = ReLu - Output Layer: 1 neuron with activation function = sigmoid

input_dim <- dim(train_x)[2]

model <- keras_model_sequential()


# Menambahkan layer
model %>%
  layer_dense(units = 64, 
              activation = "relu", 
              input_shape = input_dim,
              name = "hidden_1") %>%
  
  layer_dense(units = 32, 
              activation = "relu",
              name = "hidden_2") %>%
  
  layer_dense(units = 16, 
              activation = "relu",
              name = "hidden_3") %>%
  
  layer_dense(units = 1, 
              activation = "sigmoid",
              name = "output")

model

## Model: "sequential"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  hidden_1 (Dense)                   (None, 64)                      2816        
##  hidden_2 (Dense)                   (None, 32)                      2080        
##  hidden_3 (Dense)                   (None, 16)                      528         
##  output (Dense)                     (None, 1)                       17          
## ================================================================================
## Total params: 5441 (21.25 KB)
## Trainable params: 5441 (21.25 KB)
## Non-trainable params: 0 (0.00 Byte)
## ________________________________________________________________________________

Compile Model

model %>% compile(
  loss = "binary_crossentropy",
  optimizer = "adam",
  metrics = c("accuracy")
)

Model Fitting

history <- model %>% fit(
  train_x, train_y,
  epochs = 30,
  batch_size = 24,
  validation_data = list(test_x, test_y)
)

Model Evaluation

Let’s see the results of the plot loss and accuracy from training dan validation.

plot(history$metrics$accuracy, type = "l", col = "#67d294", lwd = 2,
     xlab = "Epoch", ylab = "Accuracy", main = "Training and Validation Accuracy")
lines(history$metrics$val_accuracy, col = "#1f4260", lwd = 2)
legend("bottomright", legend = c("Train Accuracy", "Validation Accuracy"), col = c("#67d294", "#1f4260"), lwd = 2, lty = 1)

plot(history$metrics$loss, type = "l", col = "#67d294", lwd = 2,
     xlab = "Epoch", ylab = "Loss", main = "Training and Validation Loss")
lines(history$metrics$val_loss, col = "#1f4260", lwd = 2)
legend("topright", legend = c("Train Loss", "Validation Loss"), col = c("#67d294", "#1f4260"), lwd = 2, lty = 1)

train_loss_acc <- model %>% evaluate(train_x, train_y)
test_loss_acc <- model %>% evaluate(test_x, test_y)

From the results above, the model managed to get a low training loss value of 1.1595e-05 and accuracy of 1.00. For the testing, loss value is 1.1385e-05 and accuracy is 1.00.

Conclusion

The results of the model made are good enough but there is overfitting caused by the proportion of the number of trains is not balanced. This happens naturally because it is so from the available datasets. Suggestions in the future might be sampled so that the numbers are balanced but this must also be considered because it will affect the business side