This project was created to fulfill the Learning by Building (LBB) assignment on the Neural Network and Deep Learning material in the Machine Learning Specialization Course. The dataset used in this project is the IBM HR Analytics Employee Attrition and Performance. The IBM HR Analytics Employee Attrition & Performance dataset is a collection of employee-related data used to study factors influencing attrition and performance within a company. It helps identify reasons for employee turnover and analyze performance-related patterns for talent retention strategies.
Neural networks and deep learning can be used in attrition analysis to determine the causes of employee attrition and forecast future attrition rates.
## Rows: 1,470
## Columns: 35
## $ attrition <chr> "yes", "no", "yes", "no", "no", "no", "no",…
## $ age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35,…
## $ business_travel <chr> "travel_rarely", "travel_frequently", "trav…
## $ daily_rate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 135…
## $ department <chr> "sales", "research_development", "research_…
## $ distance_from_home <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26…
## $ education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3…
## $ education_field <chr> "life_sciences", "life_sciences", "other", …
## $ employee_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ employee_number <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 1…
## $ environment_satisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3…
## $ gender <chr> "female", "male", "male", "female", "male",…
## $ hourly_rate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84,…
## $ job_involvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2…
## $ job_level <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1…
## $ job_role <chr> "sales_executive", "research_scientist", "l…
## $ job_satisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3…
## $ marital_status <chr> "single", "married", "single", "married", "…
## $ monthly_income <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2…
## $ monthly_rate <int> 19479, 24907, 2396, 23159, 16632, 11864, 99…
## $ num_companies_worked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5…
## $ over_18 <chr> "y", "y", "y", "y", "y", "y", "y", "y", "y"…
## $ over_time <chr> "yes", "no", "yes", "yes", "no", "no", "yes…
## $ percent_salary_hike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13,…
## $ performance_rating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3…
## $ relationship_satisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2…
## $ standard_hours <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80,…
## $ stock_option_level <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0…
## $ total_working_years <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5,…
## $ training_times_last_year <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4…
## $ work_life_balance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3…
## $ years_at_company <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, …
## $ years_in_current_role <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2…
## $ years_since_last_promotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0…
## $ years_with_curr_manager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3…
Description: The dataset has 1,470 rows with 35 columns.
attrition: Whether the employee leaves the organization
or not.age: Employee’s age in years.gender: The gender of the employee.business_travel: Frequency of employee business
trips.daily_rate: The employee’s daily salary.department: The department where the employee
works.distance_from_home: The distance from the employee’s
home to the workplace in miles.education: The level of education achieved by
employees.education_field: Field of study of employees.employee_count: The total number of employees in the
organization.employee_number: A unique identifier for each employee
record.environment_satisfaction: The level of employee
satisfaction with their work environment.hourly_rate: An employee’s hourly salary.job_involvement: The level of involvement required for
an employee’s job.job_level: The level of employee positions.job_role: The role of the employee in the
organization.job_satisfaction: Employee satisfaction with their
jobs.marital_status: Employee’s marital status.monthly_income: Monthly income of employees.monthly_rate: The employee’s monthly pay rate.num_companies_worked: The number of companies the
employee previously worked for.over_18: Whether the employee is over 18 years of
age.over_time: Does the employee work overtime.percent_salary_hike: The rate of increase in an
employee’s salary.performance_rating: Assessment of employee
performance.relationship_satisfaction: Employee satisfaction with
their relationship.standard_hours: Standard hours of work for
employees.stock_option_level: The employee stock option
level.total_working_years: Total number of years the employee
has worked.training_times_last_year: The number of times the
employee attended training in the last year.work_life_balance: Employees’ perception of their work
and life balance.years_at_company: The number of years the employee has
worked at this company.years_in_current_role: The number of years the employee
has been in the current role.years_since_last_promotion: The number of years since
the employee’s last promotion.years_with_current_manager: The number of years the
employee has worked with the current manager.First of all we need to get rid of all unused columns.
employee <- employee[, !(names(employee) %in% c('bussiness_travel','daily_rate','employee_count','employee_number','hourly_rate','monthly_rate'
,'num_companies_worked','over_18','standard_hours', 'stock_option_level','training_times_last_year'))]
head(employee)Followed by changing the data type that is still not suitable, especially the factor data type which has a level according to the metadata provided.
employee <- employee %>%
mutate(education = factor(education, levels = c(1, 2, 3, 4, 5),
labels = c("Below College", "College", "Bachelor", "Master", "Doctor")),
environment_satisfaction = factor(environment_satisfaction, levels = c(1, 2, 3, 4),
labels = c("Low", "Medium", "High", "Very High")),
job_involvement = factor(job_involvement, levels = c(1, 2, 3, 4),
labels = c("Low", "Medium", "High", "Very High")),
job_level = factor(job_level, levels = c(1, 2, 3, 4, 5),
labels = c("Entry Level", "Junior Level", "Mid Level", "Senior Level", "Executive Level")),
job_satisfaction = factor(job_satisfaction, levels = c(1, 2, 3, 4),
labels = c("Low", "Medium", "High", "Very High")),
performance_rating = factor(performance_rating, levels = c(1, 2, 3, 4),
labels = c("Low", "Good", "Excellent", "Outstanding")),
relationship_satisfaction = factor(relationship_satisfaction, levels = c(1, 2, 3, 4),
labels = c("Low", "Medium", "High", "Very High")),
work_life_balance = factor(work_life_balance, levels = c(1, 2, 3, 4),
labels = c("Bad", "Good", "Better", "Best")))Check whether there are missing values and duplicated or not.
## attrition age
## 0 0
## business_travel department
## 0 0
## distance_from_home education
## 0 0
## education_field environment_satisfaction
## 0 0
## gender job_involvement
## 0 0
## job_level job_role
## 0 0
## job_satisfaction marital_status
## 0 0
## monthly_income over_time
## 0 0
## percent_salary_hike performance_rating
## 0 0
## relationship_satisfaction total_working_years
## 0 0
## work_life_balance years_at_company
## 0 0
## years_in_current_role years_since_last_promotion
## 0 0
## years_with_curr_manager
## 0
## [1] FALSE
The data is clean and ready to be normalized for numeric data and encoding for factor data. The purpose of this process is to change the scale or range of numeric attribute values so that the effect of each attribute is balanced when used in the model.
employee_num <- employee %>% select_if(is.numeric)
employee_cat <- employee %>% select_if(is.factor)
# Normalization
employee_norm <- employee_num %>%
preProcess(method = c("center", "scale"))
# Encoding
employee_enc <- dummyVars(~., data = employee_cat)After that, the data that has been normalized and encoded is combined together.
The proportion of data train is determined as 75% of the total data which will be divided randomly.
set.seed(100)
train_size <- round(0.75 * nrow(employee_norm_encoded))
index <- sample(seq_len(nrow(employee_norm_encoded)))
train_index <- index[1:train_size]
test_index <- index[(train_size + 1):nrow(employee_norm_encoded)]
# Split to train and test data
train_data <- employee_norm_encoded[train_index, ]
test_data <- employee_norm_encoded[test_index, ]After splitting, don’t forget to separate the target and predictor
variables in the training and testing data and turn them into a matrix
using as.matrix().
train_x <- train_data %>%
select(-attrition) %>%
as.matrix()
train_y <- ifelse(train_data$attrition == "No", 0, 1)
test_x <- test_data %>%
select(-attrition) %>%
as.matrix()
test_y <- ifelse(test_data$attrition == "No", 0, 1)Because this project uses the hardware library, before creating the
model, the train and test data need to be converted into array using
array_reshape().
Modeling will be carried out using the following criteria: - Input Layer: 43 predictor based on the total columns - Hidden Layer 1: 64 neurons with activation function = ReLu - Hidden Layer 2: 32 neurons with activation function = ReLu - Hidden Layer 3: 16 neurons with activation function = ReLu - Output Layer: 1 neuron with activation function = sigmoid
input_dim <- dim(train_x)[2]
model <- keras_model_sequential()
# Menambahkan layer
model %>%
layer_dense(units = 64,
activation = "relu",
input_shape = input_dim,
name = "hidden_1") %>%
layer_dense(units = 32,
activation = "relu",
name = "hidden_2") %>%
layer_dense(units = 16,
activation = "relu",
name = "hidden_3") %>%
layer_dense(units = 1,
activation = "sigmoid",
name = "output")
model## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## hidden_1 (Dense) (None, 64) 2816
## hidden_2 (Dense) (None, 32) 2080
## hidden_3 (Dense) (None, 16) 528
## output (Dense) (None, 1) 17
## ================================================================================
## Total params: 5441 (21.25 KB)
## Trainable params: 5441 (21.25 KB)
## Non-trainable params: 0 (0.00 Byte)
## ________________________________________________________________________________
Let’s see the results of the plot loss and accuracy from training dan validation.
plot(history$metrics$accuracy, type = "l", col = "#67d294", lwd = 2,
xlab = "Epoch", ylab = "Accuracy", main = "Training and Validation Accuracy")
lines(history$metrics$val_accuracy, col = "#1f4260", lwd = 2)
legend("bottomright", legend = c("Train Accuracy", "Validation Accuracy"), col = c("#67d294", "#1f4260"), lwd = 2, lty = 1)plot(history$metrics$loss, type = "l", col = "#67d294", lwd = 2,
xlab = "Epoch", ylab = "Loss", main = "Training and Validation Loss")
lines(history$metrics$val_loss, col = "#1f4260", lwd = 2)
legend("topright", legend = c("Train Loss", "Validation Loss"), col = c("#67d294", "#1f4260"), lwd = 2, lty = 1)train_loss_acc <- model %>% evaluate(train_x, train_y)
test_loss_acc <- model %>% evaluate(test_x, test_y)From the results above, the model managed to get a low training loss value of 1.1595e-05 and accuracy of 1.00. For the testing, loss value is 1.1385e-05 and accuracy is 1.00.
The results of the model made are good enough but there is overfitting caused by the proportion of the number of trains is not balanced. This happens naturally because it is so from the available datasets. Suggestions in the future might be sampled so that the numbers are balanced but this must also be considered because it will affect the business side