In this LBB Project, we will analyze factors leading to employee attrition using fictional data set created by IBM data scientists provided to us from Kaggle.
The data used for our project will be the clean version derived from the above source which has been pre-processed by another source in Github, which we renamed as “ibm_attrition_data_clean.csv”
The goal is to predict as close to the ground truth whether the
classification of employee attrition attrition is “yes” or
“no” based on several factors contributing to its prediction.
# Library Setup and Installation necessary packages
# data wrangling
library(dplyr)
# neural network
library(neuralnet)
library(keras)
# cross-validation
library(rsample)
library(caret)
library(recipes)
library(tensorflow)
# set graphic theme
theme_set(theme_minimal())
options(scipen = 999)Before we proceed further, let us explore our dataset
# Read dataset
ibm_raw <- read.csv("data_input/ibm_attrition_data_clean.csv")
head(ibm_raw) # Check datasetWe can also check on simple information containing in our dataset
#> 'data.frame': 1470 obs. of 35 variables:
#> $ attrition : chr "yes" "no" "yes" "no" ...
#> $ age : int 41 49 37 33 27 32 59 30 38 36 ...
#> $ business_travel : chr "travel_rarely" "travel_frequently" "travel_rarely" "travel_frequently" ...
#> $ daily_rate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
#> $ department : chr "sales" "research_development" "research_development" "research_development" ...
#> $ distance_from_home : int 1 8 2 3 2 2 3 24 23 27 ...
#> $ education : int 2 1 2 4 1 2 3 1 3 3 ...
#> $ education_field : chr "life_sciences" "life_sciences" "other" "life_sciences" ...
#> $ employee_count : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ employee_number : int 1 2 4 5 7 8 10 11 12 13 ...
#> $ environment_satisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
#> $ gender : chr "female" "male" "male" "female" ...
#> $ hourly_rate : int 94 61 92 56 40 79 81 67 44 94 ...
#> $ job_involvement : int 3 2 2 3 3 3 4 3 2 3 ...
#> $ job_level : int 2 2 1 1 1 1 1 1 3 2 ...
#> $ job_role : chr "sales_executive" "research_scientist" "laboratory_technician" "research_scientist" ...
#> $ job_satisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
#> $ marital_status : chr "single" "married" "single" "married" ...
#> $ monthly_income : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
#> $ monthly_rate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
#> $ num_companies_worked : int 8 1 6 1 9 0 4 1 0 6 ...
#> $ over_18 : chr "y" "y" "y" "y" ...
#> $ over_time : chr "yes" "no" "yes" "yes" ...
#> $ percent_salary_hike : int 11 23 15 11 12 13 20 22 21 13 ...
#> $ performance_rating : int 3 4 3 3 3 3 4 4 4 3 ...
#> $ relationship_satisfaction : int 1 4 2 3 4 3 1 2 2 2 ...
#> $ standard_hours : int 80 80 80 80 80 80 80 80 80 80 ...
#> $ stock_option_level : int 0 1 0 0 1 0 3 1 0 2 ...
#> $ total_working_years : int 8 10 7 8 6 8 12 1 10 17 ...
#> $ training_times_last_year : int 0 3 3 3 3 2 3 2 2 3 ...
#> $ work_life_balance : int 1 3 3 3 3 2 2 3 3 2 ...
#> $ years_at_company : int 6 10 0 8 2 7 1 1 9 7 ...
#> $ years_in_current_role : int 4 7 0 7 2 7 0 0 7 7 ...
#> $ years_since_last_promotion: int 0 1 0 3 2 3 0 0 1 7 ...
#> $ years_with_curr_manager : int 5 7 0 0 2 6 0 0 8 7 ...
Based on the information above, we can summarize that our dataset
contains 35 columns with the target variable named
attrition and the rest 34 columns is the
contributing factors leading to status of our employee attrition:
yes or no.
Our current dataset has two different datatype which is
character and integer. From our observation,
we can noted that all the character type of columns can be changed to
factor type.
This is a classification case with 2 output ( attrition = yes/no )
We will remove two columns named employee_count and
employee_number as those do not provide relevant
information for further analysis
#> 'data.frame': 1470 obs. of 33 variables:
#> $ attrition : chr "yes" "no" "yes" "no" ...
#> $ age : int 41 49 37 33 27 32 59 30 38 36 ...
#> $ business_travel : chr "travel_rarely" "travel_frequently" "travel_rarely" "travel_frequently" ...
#> $ daily_rate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
#> $ department : chr "sales" "research_development" "research_development" "research_development" ...
#> $ distance_from_home : int 1 8 2 3 2 2 3 24 23 27 ...
#> $ education : int 2 1 2 4 1 2 3 1 3 3 ...
#> $ education_field : chr "life_sciences" "life_sciences" "other" "life_sciences" ...
#> $ environment_satisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
#> $ gender : chr "female" "male" "male" "female" ...
#> $ hourly_rate : int 94 61 92 56 40 79 81 67 44 94 ...
#> $ job_involvement : int 3 2 2 3 3 3 4 3 2 3 ...
#> $ job_level : int 2 2 1 1 1 1 1 1 3 2 ...
#> $ job_role : chr "sales_executive" "research_scientist" "laboratory_technician" "research_scientist" ...
#> $ job_satisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
#> $ marital_status : chr "single" "married" "single" "married" ...
#> $ monthly_income : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
#> $ monthly_rate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
#> $ num_companies_worked : int 8 1 6 1 9 0 4 1 0 6 ...
#> $ over_18 : chr "y" "y" "y" "y" ...
#> $ over_time : chr "yes" "no" "yes" "yes" ...
#> $ percent_salary_hike : int 11 23 15 11 12 13 20 22 21 13 ...
#> $ performance_rating : int 3 4 3 3 3 3 4 4 4 3 ...
#> $ relationship_satisfaction : int 1 4 2 3 4 3 1 2 2 2 ...
#> $ standard_hours : int 80 80 80 80 80 80 80 80 80 80 ...
#> $ stock_option_level : int 0 1 0 0 1 0 3 1 0 2 ...
#> $ total_working_years : int 8 10 7 8 6 8 12 1 10 17 ...
#> $ training_times_last_year : int 0 3 3 3 3 2 3 2 2 3 ...
#> $ work_life_balance : int 1 3 3 3 3 2 2 3 3 2 ...
#> $ years_at_company : int 6 10 0 8 2 7 1 1 9 7 ...
#> $ years_in_current_role : int 4 7 0 7 2 7 0 0 7 7 ...
#> $ years_since_last_promotion: int 0 1 0 3 2 3 0 0 1 7 ...
#> $ years_with_curr_manager : int 5 7 0 0 2 6 0 0 8 7 ...
First, let us confirm whether our dataset has any null values and duplicated info
#> attrition age business_travel daily_rate department distance_from_home
#> 0 0 0 0 0 0
#> education education_field environment_satisfaction gender hourly_rate job_involvement
#> 0 0 0 0 0 0
#> job_level job_role job_satisfaction marital_status monthly_income monthly_rate
#> 0 0 0 0 0 0
#> num_companies_worked over_18 over_time percent_salary_hike performance_rating relationship_satisfaction
#> 0 0 0 0 0 0
#> standard_hours stock_option_level total_working_years training_times_last_year work_life_balance years_at_company
#> 0 0 0 0 0 0
#> years_in_current_role years_since_last_promotion years_with_curr_manager
#> 0 0 0
#> [1] 0
There is neither missing values nor duplicated values in our dataset.
Let us split our prepared dataset into ratio of 80:20 for train:test dataset using stratified sampling so that the sampling
set.seed(100)
index <- initial_split(data = ibm_raw, # dataset used for training
prop = 0.8, # 80% for training dataset
strata = "attrition") Using library recipes, we will implemented the
Pre-processing Data to prepare for further analysis :
ibm_clean <- recipe(attrition ~ .,
data = training(index)) %>%
step_nzv(all_predictors()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
step_dummy(all_nominal(), -attrition, one_hot = FALSE) %>%
prep()Here we will process the splitting of Training and Testing Data
ibm_train <- juice(ibm_clean)
ibm_test <- bake(ibm_clean, testing(index))
# Check the proportion table of Training Data
prop.table(table(ibm_train$attrition))#>
#> no yes
#> 0.8391489 0.1608511
#>
#> no yes
#> 0.8372881 0.1627119
Based on the above information, we noted that our training and
testing dataset ibm_train and ibm_test
respectively still maintain its balanced proportion with proportion of
84:16
Next, we will start the process of Model Building to use with Neural Network
train_x <- ibm_train %>%
select(-attrition) %>% # predictor variables only in our training dataset
data.matrix() # change dataset into matrix type
train_y <- to_categorical(as.numeric(ibm_train$attrition) - 1) # target variable
test_x <- ibm_test %>%
select(-attrition) %>% # predictor variables only in our training dataset
data.matrix() # change dataset into matrix type
test_y <- to_categorical(as.numeric(ibm_test$attrition) - 1) # target variableAs our dataset is a classification with 2 (two) output values,
therefore it is a case of binary cross-entropy
neuralnet function# Building Neural Network with 2 hidden layer with 5 and 3 neurons
nn_ibm <- neuralnet(formula = attrition ~ .,
data = ibm_train,
hidden = c(5, 3),
err.fct = "ce",
act.fct = "logistic",
linear.output = FALSE
)
plot(nn_ibm)#> [,1] [,2]
#> [1,] 0.9733027 0.02669739
#> [2,] 0.9732972 0.02670290
#> [3,] 0.9732981 0.02670197
#> [4,] 0.9733026 0.02669749
#> [5,] 0.9733027 0.02669739
#> [6,] 0.9733027 0.02669739
# Convert probability into class
pred_nn_class <- ifelse(pred_nn$net.result > 0.5,
1, # if pred value > 0.5, then the class value is 1
0) # otherwise, the class value is 0
pred_nn_class %>% head()#> [,1] [,2]
#> [1,] 1 0
#> [2,] 1 0
#> [3,] 1 0
#> [4,] 1 0
#> [5,] 1 0
#> [6,] 1 0
Let us first create object input_dim to store
information of the number of columns from predictor variables and number
of categories from target variables into object
num_class
input_dim <- ncol(train_x) # number of columns of predictor variables
num_class <- n_distinct(ibm_train$attrition) # number of target variables
input_dim#> [1] 44
#> [1] 2
The input layer will be equals to the number of
columns of predictor variables which we have defined above as
input_dim
The output layer of our modeling is a binary
classification with ONLY two output as “yes” or “no”,
therefore the Loss Function used will be Binary
Cross Entropy
As it is a binary classification case, the
Activation Function used will be logistic /
sigmoid
In summary, our Neural Network model will using the following fixed parameter:
input layer = 44 predictorsoutput layer = 2 neuronsactivation: “sigmoid”tensorflow::set_random_seed(100)
# Create architectural
model1 <- keras_model_sequential(name="model1") %>%
# First Hidden Layer
layer_dense(input_shape = input_dim, # number of predictors
units = input_dim, # number of nodes in the first hidden layer
activation = "sigmoid",
name = "Hidden_layer") %>%
# Output layer
layer_dense(units = num_class,
activation = "sigmoid",
name = "output")
model1#> Model: "model1"
#> ________________________________________________________________________________________________________________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ========================================================================================================================================================================================
#> Hidden_layer (Dense) (None, 44) 1980
#> output (Dense) (None, 2) 90
#> ========================================================================================================================================================================================
#> Total params: 2,070
#> Trainable params: 2,070
#> Non-trainable params: 0
#> ________________________________________________________________________________________________________________________________________________________________________________________
To compile the model, we will need to define the valuse of our error
function, optimizer and evaluation metrics with compile()
funtion.
In this project, the parameters used will be:
model1 %>%
compile(loss = "binary_crossentropy",
optimizer = optimizer_sgd(learning_rate = 0.5),
metrics = "accuracy")
model1#> Model: "model1"
#> ________________________________________________________________________________________________________________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ========================================================================================================================================================================================
#> Hidden_layer (Dense) (None, 44) 1980
#> output (Dense) (None, 2) 90
#> ========================================================================================================================================================================================
#> Total params: 2,070
#> Trainable params: 2,070
#> Non-trainable params: 0
#> ________________________________________________________________________________________________________________________________________________________________________________________
Model Fitting using fit() function will have the
following paramenter:
x: prediktory: targetepochs: number of iterations for training modelbatch_sizevalidation_data: unseen data for metrics evalution
(prediktor and target) while the model in training modeverbose#> [1] 1175
Our training dataset has total of 1,175 number of rows, let us choose number of batch = 5 so that our batch size = 235 :
batch_size = 235history <- model1 %>% fit(x = train_x,
y = train_y,
epochs = 10,
batch_size = 235,
validation_data = list(test_x, test_y),
verbose = 1
)#> Epoch 1/10
#>
1/5 [=====>........................] - ETA: 0s - loss: 0.9339 - accuracy: 0.1787
5/5 [==============================] - 0s 4ms/step - loss: 0.5469 - accuracy: 0.7106
#>
5/5 [==============================] - 1s 166ms/step - loss: 0.5469 - accuracy: 0.7106 - val_loss: 0.4269 - val_accuracy: 0.8373
#> Epoch 2/10
#>
1/5 [=====>........................] - ETA: 0s - loss: 0.4324 - accuracy: 0.8426
5/5 [==============================] - 0s 7ms/step - loss: 0.4308 - accuracy: 0.8391
#>
5/5 [==============================] - 0s 41ms/step - loss: 0.4308 - accuracy: 0.8391 - val_loss: 0.4203 - val_accuracy: 0.8373
#> Epoch 3/10
#>
1/5 [=====>........................] - ETA: 0s - loss: 0.4184 - accuracy: 0.8468
5/5 [==============================] - 0s 5ms/step - loss: 0.4257 - accuracy: 0.8391
#>
5/5 [==============================] - 0s 38ms/step - loss: 0.4257 - accuracy: 0.8391 - val_loss: 0.4145 - val_accuracy: 0.8373
#> Epoch 4/10
#>
1/5 [=====>........................] - ETA: 0s - loss: 0.4223 - accuracy: 0.8383
5/5 [==============================] - 0s 5ms/step - loss: 0.4199 - accuracy: 0.8391
#>
5/5 [==============================] - 0s 38ms/step - loss: 0.4199 - accuracy: 0.8391 - val_loss: 0.4086 - val_accuracy: 0.8373
#> Epoch 5/10
#>
1/5 [=====>........................] - ETA: 0s - loss: 0.4590 - accuracy: 0.8170
5/5 [==============================] - 0s 4ms/step - loss: 0.4160 - accuracy: 0.8391
#>
5/5 [==============================] - 0s 37ms/step - loss: 0.4160 - accuracy: 0.8391 - val_loss: 0.4035 - val_accuracy: 0.8373
#> Epoch 6/10
#>
1/5 [=====>........................] - ETA: 0s - loss: 0.4581 - accuracy: 0.8000
5/5 [==============================] - 0s 4ms/step - loss: 0.4123 - accuracy: 0.8391
#>
5/5 [==============================] - 0s 36ms/step - loss: 0.4123 - accuracy: 0.8391 - val_loss: 0.3990 - val_accuracy: 0.8373
#> Epoch 7/10
#>
1/5 [=====>........................] - ETA: 0s - loss: 0.3775 - accuracy: 0.8681
5/5 [==============================] - 0s 4ms/step - loss: 0.4086 - accuracy: 0.8391
#>
5/5 [==============================] - 0s 36ms/step - loss: 0.4086 - accuracy: 0.8391 - val_loss: 0.3946 - val_accuracy: 0.8373
#> Epoch 8/10
#>
1/5 [=====>........................] - ETA: 0s - loss: 0.4943 - accuracy: 0.7872
5/5 [==============================] - 0s 5ms/step - loss: 0.4048 - accuracy: 0.8391
#>
5/5 [==============================] - 0s 38ms/step - loss: 0.4048 - accuracy: 0.8391 - val_loss: 0.3909 - val_accuracy: 0.8373
#> Epoch 9/10
#>
1/5 [=====>........................] - ETA: 0s - loss: 0.4524 - accuracy: 0.8170
5/5 [==============================] - 0s 4ms/step - loss: 0.4005 - accuracy: 0.8391
#>
5/5 [==============================] - 0s 37ms/step - loss: 0.4005 - accuracy: 0.8391 - val_loss: 0.3868 - val_accuracy: 0.8373
#> Epoch 10/10
#>
1/5 [=====>........................] - ETA: 0s - loss: 0.4012 - accuracy: 0.8383
5/5 [==============================] - 0s 4ms/step - loss: 0.3973 - accuracy: 0.8391
#>
5/5 [==============================] - 0s 37ms/step - loss: 0.3973 - accuracy: 0.8391 - val_loss: 0.3830 - val_accuracy: 0.8373
#> [1] 0.18
Based on the result above, our model at the beginning is overfitt but then it will reach quite optimal because the result generated has:
accuracy) with
test data/validation (val_accuracy) =
0.18% < 20%Our current model is already optimal
Let us create another model using different optimizer method with the following parameters tuning:
tensorflow::set_random_seed(8)
# Create architectural
model2 <- keras_model_sequential(name="model2") %>%
# First Hidden Layer
layer_dense(input_shape = input_dim, # number of predictors
units = input_dim, # number of nodes in the first hidden layer
activation = "sigmoid",
name = "Hidden_layer") %>%
# Output layer
layer_dense(units = num_class,
activation = "sigmoid",
name = "output")
model2 %>%
compile(loss = "binary_crossentropy",
optimizer = optimizer_adam(learning_rate = 0.2),
metrics = "accuracy")
model2#> Model: "model2"
#> ________________________________________________________________________________________________________________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ========================================================================================================================================================================================
#> Hidden_layer (Dense) (None, 44) 1980
#> output (Dense) (None, 2) 90
#> ========================================================================================================================================================================================
#> Total params: 2,070
#> Trainable params: 2,070
#> Non-trainable params: 0
#> ________________________________________________________________________________________________________________________________________________________________________________________
history2 <- model2 %>% fit(x = train_x,
y = train_y,
epochs = 10,
batch_size = 200,
validation_data = list(test_x, test_y),
verbose = 1
)#> Epoch 1/10
#>
1/6 [====>.........................] - ETA: 1s - loss: 0.4732 - accuracy: 0.8300
6/6 [==============================] - 0s 12ms/step - loss: 0.5682 - accuracy: 0.8017
#>
6/6 [==============================] - 1s 143ms/step - loss: 0.5682 - accuracy: 0.8017 - val_loss: 0.4109 - val_accuracy: 0.8373
#> Epoch 2/10
#>
1/6 [====>.........................] - ETA: 0s - loss: 0.4342 - accuracy: 0.8450
6/6 [==============================] - 0s 8ms/step - loss: 0.4258 - accuracy: 0.8451
#>
6/6 [==============================] - 0s 34ms/step - loss: 0.4258 - accuracy: 0.8451 - val_loss: 0.3736 - val_accuracy: 0.8373
#> Epoch 3/10
#>
1/6 [====>.........................] - ETA: 0s - loss: 0.5125 - accuracy: 0.8050
6/6 [==============================] - 0s 6ms/step - loss: 0.3968 - accuracy: 0.8323
#>
6/6 [==============================] - 0s 32ms/step - loss: 0.3968 - accuracy: 0.8323 - val_loss: 0.3759 - val_accuracy: 0.8339
#> Epoch 4/10
#>
1/6 [====>.........................] - ETA: 0s - loss: 0.3050 - accuracy: 0.8700
6/6 [==============================] - 0s 6ms/step - loss: 0.3596 - accuracy: 0.8409
#>
6/6 [==============================] - 0s 32ms/step - loss: 0.3596 - accuracy: 0.8409 - val_loss: 0.3594 - val_accuracy: 0.8305
#> Epoch 5/10
#>
1/6 [====>.........................] - ETA: 0s - loss: 0.3825 - accuracy: 0.8150
6/6 [==============================] - 0s 6ms/step - loss: 0.3300 - accuracy: 0.8579
#>
6/6 [==============================] - 0s 32ms/step - loss: 0.3300 - accuracy: 0.8579 - val_loss: 0.3498 - val_accuracy: 0.8407
#> Epoch 6/10
#>
1/6 [====>.........................] - ETA: 0s - loss: 0.3232 - accuracy: 0.8600
6/6 [==============================] - 0s 5ms/step - loss: 0.2965 - accuracy: 0.8826
#>
6/6 [==============================] - 0s 31ms/step - loss: 0.2965 - accuracy: 0.8826 - val_loss: 0.3312 - val_accuracy: 0.8678
#> Epoch 7/10
#>
1/6 [====>.........................] - ETA: 0s - loss: 0.2825 - accuracy: 0.9100
6/6 [==============================] - 0s 6ms/step - loss: 0.2669 - accuracy: 0.9064
#>
6/6 [==============================] - 0s 32ms/step - loss: 0.2669 - accuracy: 0.9064 - val_loss: 0.3303 - val_accuracy: 0.8712
#> Epoch 8/10
#>
1/6 [====>.........................] - ETA: 0s - loss: 0.2013 - accuracy: 0.9200
6/6 [==============================] - 0s 6ms/step - loss: 0.2350 - accuracy: 0.9183
#>
6/6 [==============================] - 0s 32ms/step - loss: 0.2350 - accuracy: 0.9183 - val_loss: 0.3240 - val_accuracy: 0.8712
#> Epoch 9/10
#>
1/6 [====>.........................] - ETA: 0s - loss: 0.1915 - accuracy: 0.9400
6/6 [==============================] - 0s 7ms/step - loss: 0.2136 - accuracy: 0.9251
#>
6/6 [==============================] - 0s 32ms/step - loss: 0.2136 - accuracy: 0.9251 - val_loss: 0.3438 - val_accuracy: 0.8780
#> Epoch 10/10
#>
1/6 [====>.........................] - ETA: 0s - loss: 0.1626 - accuracy: 0.9400
6/6 [==============================] - 0s 6ms/step - loss: 0.1862 - accuracy: 0.9302
#>
6/6 [==============================] - 0s 32ms/step - loss: 0.1862 - accuracy: 0.9302 - val_loss: 0.3478 - val_accuracy: 0.8746
Using ADAM optimizer, the modelling is much worse and tend to be underfitting