Problem Statement

The objective of this project is to use Random Forest model to anticipate employee attrition within a company. Attrition, defined as the voluntary turnover of employees, presents a significant challenge for organizations, impacting productivity, morale, and overall business performance. By leveraging this machine learning algorithm, we aim to classify employees into two categories: those likely to leave the company (Attrition = Yes) and those likely to stay (Attrition = No). This predictive analysis will enable proactive measures to be taken to retain valuable talent, reduce turnover costs, and maintain a stable and productive workforce.

Load Libraries

# Load necessary libraries
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine()  masks randomForest::combine()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ purrr::lift()     masks caret::lift()
## ✖ ggplot2::margin() masks randomForest::margin()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Random Forest provides functions for training and using random forest models, a robust ensemble learning method.

caret simplifies the process of building machine learning models, including data splitting, pre-processing, and training.

tidyverse includes packages like ggplot2, dplyr, and tidyr, facilitating data manipulation and visualization.

Load Data

Reads data from a CSV file into a data frame in R. The dataset is assumed to contain HR attrition data.

# Load the dataset
data <- read.csv("hr_attrition_data.csv")

Data Wrangling

Converting categorical variables into factors. This is essential for random forest models since they need categorical data to be encoded as factors to handle classification tasks properly.

# Convert Attrition, Gender, and Department to factors
data$Attrition <- as.factor(data$Attrition)
data$Department <- as.factor(data$Department)
data$Gender <- as.factor(data$Gender)
data$Attrition
##  [1] No  No  Yes No  No  Yes No  No  Yes No 
## Levels: No Yes
data$Department
##  [1] Sales           Human Resources IT              Marketing      
##  [5] Finance         Operations      IT              Sales          
##  [9] Finance         Marketing      
## Levels: Finance Human Resources IT Marketing Operations Sales
data$Gender
##  [1] Male   Female Male   Female Male   Female Male   Female Male   Female
## Levels: Female Male

Data Manipulation

Checking for missing values in the dataset.

# Check for missing values and remove them if any
if (sum(is.na(data)) > 0) {
  data <- na.omit(data)
}

The code removes any rows with missing data to ensure that the dataset is clean and complete for model training.

Splitting Data

# Split the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(data$Attrition, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]

Sets a seed for random number generation to ensure reproducibility. This makes sure that you get the same split every time you run the code.

createDataPartition splits the data into training (70%) and testing (30%) sets. This function from caret creates a balanced partition based on the target variable, Attrition, ensuring each class is represented in both sets.

Random Forest Model

# Train Random Forest model with selected features
set.seed(123)
rf_model <- randomForest(Attrition ~ Age + Gender + Department + YearsAtCompany +
                           MonthlyIncome + PerformanceRating, data = trainData)

randomForest fits a random forest model using specified predictors (Age, Gender, Department, YearsAtCompany, MonthlyIncome, and PerformanceRating) to predict Attrition. This builds multiple decision trees and aggregate their results for improved accuracy and robustness.

Using the Predict Function on the Trained Model

predictions <- predict(rf_model, testData)

Creating A New Employee Data For Testing

# Define new employees for prediction
new_employees <- data.frame(
  Age = c(35, 40),
  Gender = factor(c("Female", "Male"), levels = levels(trainData$Gender)),
  Department = factor(c("Sales", "Research & Development"), levels = levels(trainData$Department)),
  YearsAtCompany = c(5, 8),
  MonthlyIncome = c(5000, 5800),
  PerformanceRating = c(4, 2)
)

New Data Frame creates a data frame for new employees for whom you want to predict attrition.

Factor with levels ensures that the categorical variables in the new data match the levels of the training data, which is crucial for making predictions.

Applying the Trained Model on the New Employee Data

# Predict attrition for new employees
new_predictions <- predict(rf_model, new_employees)
print(new_predictions)
##    1 <NA> 
##  Yes <NA> 
## Levels: No Yes

Predict applies the trained model to the new employee data to predict attrition outcomes. It checks whether new employees are likely to leave (Yes) or stay (No).

Visualizing Predictions

# Visualize Predictions
testData$PredictedAttrition <- predictions
ggplot(testData, aes(x = Age, y = MonthlyIncome, color = PredictedAttrition)) +
  geom_point(alpha = 0.6) +
  labs(title = "Attrition Predictions", x = "Age", y = "Monthly Income") +
  scale_color_manual(values = c("No" = "blue", "Yes" = "red"))

The diagram visualizes the predictions of employee attrition based on their age and monthly income. Here’s an explanation of the key elements in the diagram:

1 Axes: The x-axis represents the Age of the employees. The y-axis represents the Monthly Income of the employees.

2 Data Points: Each point on the plot represents an employee from the dataset. The color of each point indicates the predicted attrition status:

Red indicates employees predicted to leave the company (Attrition = “Yes”).

Blue indicates employees predicted to stay with the company (Attrition = “No”).

3 Interpretation:

The plot shows the distribution of employees’ age and monthly income, highlighting which employees are predicted to leave and which are predicted to stay.

From the plot, you can see that the point at age 35 with a monthly income of around $5,000 is colored red, indicating this employee is predicted to leave.

The point at age 40 with a monthly income of around $5,800 is colored blue, indicating this employee is predicted to stay.

4 Usefulness: This visualization helps identify patterns in the predictions of employee attrition related to age and income.

It provides insights that can be used to understand factors influencing attrition and possibly take preventive measures for employees at risk of leaving.

Overall, the plot serves as a tool to visually assess how the predictive model’s outputs correlate with employee characteristics such as age and income.