Amid the era of globalization and relentless competitive pressures, sustaining strong employee performance has become a central priority and one of the biggest challenges for Human Resource (HR) departments. Organizations can no longer rely solely on intuition; instead, they are increasingly turning to data‑driven analytics to evaluate workforce outcomes, optimize talent management, and reduce turnover. Understanding the specific, underlying factors that influence productivity is essential for building sustainable, long‑term employee success.
This project uses R programming to explore a comprehensive dataset and uncover the key drivers of high performance. A thorough understanding of influencing factors is essential for developing effective approaches to maintaining and improving employee performance over the long term. By leveraging statistical methods and machine learning techniques, the goal is to uncover the key drivers behind high performance and Key Performance Indicator (KPI) achievement, translating raw HR data into actionable organizational insights.
The primary objective of this project is to conduct a comprehensive data analysis using R to identify key factors such as employee demographics, training effectiveness, length of service, and prior performance ratings that significantly influence the achievement of Key Performance Indicators (KPIs) exceeding 80%. By applying statistical techniques, data visualization, and predictive modeling, this study aims to generate actionable insights that can guide HR professionals in enhancing employee performance strategies, supporting talent development, and strengthening organizational decision‑making.
Specifically, the project seeks to:
- Identify key factors affecting performance through
statistical analysis and machine learning, focusing on variables such as
training, work experience, education level, and departmental
affiliation.
- Compare predictive models to determine the most
effective approach for forecasting KPI achievement above 80%, using
evaluation metrics such as confusion matrices, accuracy, sensitivity,
and specificity.
- Provide recommendations and actionable insights to HR
departments and stakeholders, supporting evidence‑based decisions in
talent management, training programs, and employee engagement
initiatives.
The dataset titled Employees Performance for HR Analytics was uploaded to Kaggle by Sanjana Chaudhari in 2023 and serves as the foundation for this analysis. It contains 17,417 employee records across 13 variables, stored in CSV format. The dataset captures a balanced mix of categorical and numerical variables, making it suitable for exploratory data analysis (EDA), correlation studies, and predictive modeling in HR analytics.
The variables included are as follows:
- employee_id: Unique identifier for each employee;
serves as the primary key for tracking records without revealing
personal information.
- department: Employee’s department (e.g., Sales &
Marketing, Technology); useful for performance segmentation and
departmental comparisons.
- region: Geographic region of employment.
- education: Highest education level attained (e.g.,
Bachelor’s, Master’s and above).
- gender: Employee gender (m = male, f = female).
- recruitment_channel: Hiring source (e.g., Referred,
Sourcing).
- no_of_trainings: Number of trainings attended.
- age: Employee age.
- previous_year_rating: Performance rating from the
prior year (1–5 scale).
- length_of_service: Number of years served in the
organization.
- kpis_met_more_than_80: Binary indicator of whether
>80% KPIs were achieved (0 = No, 1 = Yes); this serves as the target
variable.
- awards_won: Indicator of whether the employee won
awards (0 = No, 1 = Yes).
- avg_training_score: Average score from trainings,
reflecting training quality.
By analyzing these variables, the study aims to uncover meaningful patterns that can guide HR strategies, improve productivity, and strengthen workforce management.
Data cleaning is a critical step in preparing the dataset for analysis. It involves handling missing values, correcting inconsistencies, removing duplicates, and ensuring that variables are properly formatted for statistical modeling. Clean data provides a reliable foundation for exploratory analysis and predictive modeling, reducing bias and improving the accuracy of insights.
The following packages were used in the data cleaning process:
dplyr
Functions: filter, mutate,
select, distinct, summarise,
case_when
Purpose: Data manipulation and transformation.
tidyr
Functions: replace_na, across
Purpose: Handling missing values and tidying data.
stringr
Functions: str_trim, str_to_lower
Purpose: Text cleaning and string processing.
writexl
Functions: write_xlsx
Purpose: Exporting cleaned dataset to Excel format.
This step involves loading the raw dataset into R for inspection. The structure and summary of the data are examined to understand variable types.
employee_performance <- read.csv("Uncleaned_employees_final_dataset.csv")
str(employee_performance)
## 'data.frame': 17417 obs. of 13 variables:
## $ employee_id : int 8724 74430 72255 38562 64486 46232 54542 67269 66174 76303 ...
## $ department : chr "Technology" "HR" "Sales & Marketing" "Procurement" ...
## $ region : chr "region_26" "region_4" "region_13" "region_2" ...
## $ education : chr "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
## $ gender : chr "m" "f" "m" "f" ...
## $ recruitment_channel : chr "sourcing" "other" "other" "other" ...
## $ no_of_trainings : int 1 1 1 3 1 1 1 2 1 1 ...
## $ age : int 24 31 31 31 30 36 33 36 51 29 ...
## $ previous_year_rating : int NA 3 1 2 4 3 5 3 4 5 ...
## $ length_of_service : int 1 5 4 9 7 2 3 3 11 2 ...
## $ KPIs_met_more_than_80: int 1 0 0 0 0 0 1 0 0 1 ...
## $ awards_won : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_training_score : int 77 51 47 65 61 68 57 85 75 76 ...
summary(employee_performance)
## employee_id department region education
## Min. : 3 Length :17417 Length :17417 Length :17417
## 1st Qu.:19281 N.unique : 9 N.unique : 34 N.unique : 4
## Median :39122 N.blank : 0 N.blank : 0 N.blank : 771
## Mean :39083 Min.nchar: 2 Min.nchar: 8 Min.nchar: 0
## 3rd Qu.:58838 Max.nchar: 17 Max.nchar: 9 Max.nchar: 15
## Max. :78295
##
## gender recruitment_channel no_of_trainings age
## Length :17417 Length :17417 Min. :1.000 Min. :20.00
## N.unique : 2 N.unique : 3 1st Qu.:1.000 1st Qu.:29.00
## N.blank : 0 N.blank : 0 Median :1.000 Median :33.00
## Min.nchar: 1 Min.nchar: 5 Mean :1.251 Mean :34.81
## Max.nchar: 1 Max.nchar: 8 3rd Qu.:1.000 3rd Qu.:39.00
## Max. :9.000 Max. :60.00
##
## previous_year_rating length_of_service KPIs_met_more_than_80 awards_won
## Min. :1.000 Min. : 1.000 Min. :0.0000 Min. :0.00000
## 1st Qu.:3.000 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :3.000 Median : 5.000 Median :0.0000 Median :0.00000
## Mean :3.345 Mean : 5.802 Mean :0.3588 Mean :0.02337
## 3rd Qu.:4.000 3rd Qu.: 7.000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :5.000 Max. :34.000 Max. :1.0000 Max. :1.00000
## NAs :1363
## avg_training_score
## Min. :39.00
## 1st Qu.:51.00
## Median :60.00
## Mean :63.18
## 3rd Qu.:75.00
## Max. :99.00
##
Duplicate records and unnecessary columns are removed to ensure data integrity. Unique values are checked to identify inconsistencies in categorical variables.
unique(employee_performance$department)
## [1] "Technology" "HR" "Sales & Marketing"
## [4] "Procurement" "Finance" "Analytics"
## [7] "Operations" "Legal" "R&D"
unique(employee_performance$education)
## [1] "Bachelors" "Masters & above" "" "Below Secondary"
unique(employee_performance$gender)
## [1] "m" "f"
unique(employee_performance$recruitment_channel)
## [1] "sourcing" "other" "referred"
unique(employee_performance$region)
## [1] "region_26" "region_4" "region_13" "region_2" "region_29" "region_7"
## [7] "region_22" "region_16" "region_17" "region_24" "region_11" "region_27"
## [13] "region_9" "region_20" "region_34" "region_23" "region_8" "region_14"
## [19] "region_31" "region_19" "region_5" "region_28" "region_15" "region_3"
## [25] "region_25" "region_12" "region_21" "region_30" "region_10" "region_33"
## [31] "region_32" "region_6" "region_1" "region_18"
str(employee_performance)
## 'data.frame': 17417 obs. of 13 variables:
## $ employee_id : int 8724 74430 72255 38562 64486 46232 54542 67269 66174 76303 ...
## $ department : chr "Technology" "HR" "Sales & Marketing" "Procurement" ...
## $ region : chr "region_26" "region_4" "region_13" "region_2" ...
## $ education : chr "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
## $ gender : chr "m" "f" "m" "f" ...
## $ recruitment_channel : chr "sourcing" "other" "other" "other" ...
## $ no_of_trainings : int 1 1 1 3 1 1 1 2 1 1 ...
## $ age : int 24 31 31 31 30 36 33 36 51 29 ...
## $ previous_year_rating : int NA 3 1 2 4 3 5 3 4 5 ...
## $ length_of_service : int 1 5 4 9 7 2 3 3 11 2 ...
## $ KPIs_met_more_than_80: int 1 0 0 0 0 0 1 0 0 1 ...
## $ awards_won : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_training_score : int 77 51 47 65 61 68 57 85 75 76 ...
# Show duplicated employee_id
employee_performance %>%
filter(duplicated(employee_id) | duplicated(employee_id, fromLast = TRUE)) %>%
arrange(employee_id)
# Remove exact duplicate rows only
employee_performance <- employee_performance %>%
distinct()
# Check duplicated employee_id again
employee_performance %>%
filter(duplicated(employee_id) | duplicated(employee_id, fromLast = TRUE)) %>%
arrange(employee_id)
#remove unnecessary column
employee_performance <- employee_performance %>%
select(-employee_id)
Text fields are standardized by trimming spaces and converting to lowercase. Missing values are handled using median imputation and categorical replacement to maintain data completeness.
#clean text column
employee_performance <- employee_performance %>%
mutate(
gender = str_to_lower(str_trim(gender)),
department = str_trim(department),
education = str_trim(education),
recruitment_channel = str_trim(recruitment_channel)
)
str(employee_performance)
## 'data.frame': 17415 obs. of 12 variables:
## $ department : chr "Technology" "HR" "Sales & Marketing" "Procurement" ...
## $ region : chr "region_26" "region_4" "region_13" "region_2" ...
## $ education : chr "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
## $ gender : chr "m" "f" "m" "f" ...
## $ recruitment_channel : chr "sourcing" "other" "other" "other" ...
## $ no_of_trainings : int 1 1 1 3 1 1 1 2 1 1 ...
## $ age : int 24 31 31 31 30 36 33 36 51 29 ...
## $ previous_year_rating : int NA 3 1 2 4 3 5 3 4 5 ...
## $ length_of_service : int 1 5 4 9 7 2 3 3 11 2 ...
## $ KPIs_met_more_than_80: int 1 0 0 0 0 0 1 0 0 1 ...
## $ awards_won : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_training_score : int 77 51 47 65 61 68 57 85 75 76 ...
colSums(is.na(employee_performance))
## department region education
## 0 0 0
## gender recruitment_channel no_of_trainings
## 0 0 0
## age previous_year_rating length_of_service
## 0 1363 0
## KPIs_met_more_than_80 awards_won avg_training_score
## 0 0 0
employee_performance %>%
summarise(across(everything(), ~ sum(is.na(.) | trimws(as.character(.)) == "")))
#handling missing values
employee_performance <- employee_performance %>%
mutate(
previous_year_rating = ifelse(
is.na(previous_year_rating),
median(previous_year_rating, na.rm = TRUE),
previous_year_rating
),
education = ifelse(
is.na(education) | str_trim(education) == "",
"Unknown",
education
)
)
#if missing exists
employee_performance <- employee_performance %>%
mutate(education = replace_na(education, "Unknown"))
#clean any text column
clean_text <- function(x) {
x %>%
str_trim() %>%
str_to_lower()
}
employee_performance$department <- clean_text(employee_performance$department)
employee_performance <- employee_performance %>%
mutate(across(
c(department, education, recruitment_channel, region),
clean_text
))
New variables are created to enhance analytical insights. Age groups are categorized, and categorical variables are converted to factors for modeling compatibility.
#create age group
employee_performance <- employee_performance %>%
mutate(age_group = case_when(
age < 30 ~ "Young",
age >= 30 & age < 40 ~ "Mid",
TRUE ~ "Senior"
))
str(employee_performance)
## 'data.frame': 17415 obs. of 13 variables:
## $ department : chr "technology" "hr" "sales & marketing" "procurement" ...
## $ region : chr "region_26" "region_4" "region_13" "region_2" ...
## $ education : chr "bachelors" "bachelors" "bachelors" "bachelors" ...
## $ gender : chr "m" "f" "m" "f" ...
## $ recruitment_channel : chr "sourcing" "other" "other" "other" ...
## $ no_of_trainings : int 1 1 1 3 1 1 1 2 1 1 ...
## $ age : int 24 31 31 31 30 36 33 36 51 29 ...
## $ previous_year_rating : num 3 3 1 2 4 3 5 3 4 5 ...
## $ length_of_service : int 1 5 4 9 7 2 3 3 11 2 ...
## $ KPIs_met_more_than_80: int 1 0 0 0 0 0 1 0 0 1 ...
## $ awards_won : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_training_score : int 77 51 47 65 61 68 57 85 75 76 ...
## $ age_group : chr "Young" "Mid" "Mid" "Mid" ...
#convert to factors
employee_performance <- employee_performance %>%
mutate(
department = as.factor(department),
gender = as.factor(gender),
education = as.factor(education),
recruitment_channel = as.factor(recruitment_channel),
region = as.factor(region),
age_group = as.factor(age_group)
)
#normalize score
employee_performance <- employee_performance %>%
mutate(avg_training_score_scaled = scale(avg_training_score))
After cleaning and exploring the dataset, the final step is to export the processed data for future analysis and reporting. csv formats are used for next step EDA.
write.csv(employee_performance, "clean_employee_performance.csv", row.names = FALSE)
Before proceeding into the modelling part, the Exploratory Data
Analysis (EDA) was conducted to examine the employee performance.
The steps performed in EDA:
The required libraries and cleaned dataset df_clean was loaded and inspected to understand its structure before moving forward to exploratory data analysis.
# Install packages (run only once if needed):
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("tidyverse")
# install.packages("knitr")
# install.packages("corrplot")
# install.packages("plotly")
# install.packages("reshape2")
# install.packages("kableExtra")
# Load required libraries:
library(dplyr) # Data manipulation and group_by() function
library(ggplot2) # Data visualization
library(tidyverse) # Collection of data science packages
library(knitr) # R Markdown table formatting
library(corrplot) # Correlation matrix visualization
library(plotly) # Interactive plots
library(reshape2) # Data reshaping
library(kableExtra) # Enhanced table styling
# Load cleaned dataset
df_clean<- read.csv("clean_employee_performance.csv")
#convert to factors
df_clean <- df_clean %>%
mutate(
department = as.factor(department),
gender = as.factor(gender),
education = as.factor(education),
recruitment_channel = as.factor(recruitment_channel),
region = as.factor(region),
age_group = as.factor(age_group),
awards_won = as.factor(awards_won) # added conversion to factor for better analysis
)
The dataset structure confirms the variables are correctly formatted with appropriate data types.
str() shows the dataset structure as a mix of numeric
and categorical variables. glimpse() and dim()
show that the dataset contains 17,417 observations and 14
variables.# Data structure overview inspection
head(df_clean)
str(df_clean)
## 'data.frame': 17415 obs. of 14 variables:
## $ department : Factor w/ 9 levels "analytics","finance",..: 9 3 8 6 2 6 2 1 9 9 ...
## $ region : Factor w/ 34 levels "region_1","region_10",..: 19 29 5 12 22 32 12 15 32 15 ...
## $ education : Factor w/ 4 levels "bachelors","below secondary",..: 1 1 1 1 1 1 1 1 3 1 ...
## $ gender : Factor w/ 2 levels "f","m": 2 1 2 1 2 2 2 2 2 2 ...
## $ recruitment_channel : Factor w/ 3 levels "other","referred",..: 3 1 1 1 3 3 1 3 1 3 ...
## $ no_of_trainings : int 1 1 1 3 1 1 1 2 1 1 ...
## $ age : int 24 31 31 31 30 36 33 36 51 29 ...
## $ previous_year_rating : int 3 3 1 2 4 3 5 3 4 5 ...
## $ length_of_service : int 1 5 4 9 7 2 3 3 11 2 ...
## $ KPIs_met_more_than_80 : int 1 0 0 0 0 0 1 0 0 1 ...
## $ awards_won : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ avg_training_score : int 77 51 47 65 61 68 57 85 75 76 ...
## $ age_group : Factor w/ 3 levels "Mid","Senior",..: 3 1 1 1 1 1 1 1 2 3 ...
## $ avg_training_score_scaled: num 1.03 -0.908 -1.206 0.136 -0.162 ...
glimpse(df_clean)
## Rows: 17,415
## Columns: 14
## $ department <fct> technology, hr, sales & marketing, procureme…
## $ region <fct> region_26, region_4, region_13, region_2, re…
## $ education <fct> bachelors, bachelors, bachelors, bachelors, …
## $ gender <fct> m, f, m, f, m, m, m, m, m, m, m, m, f, m, m,…
## $ recruitment_channel <fct> sourcing, other, other, other, sourcing, sou…
## $ no_of_trainings <int> 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ age <int> 24, 31, 31, 31, 30, 36, 33, 36, 51, 29, 40, …
## $ previous_year_rating <int> 3, 3, 1, 2, 4, 3, 5, 3, 4, 5, 5, 3, 3, 3, 5,…
## $ length_of_service <int> 1, 5, 4, 9, 7, 2, 3, 3, 11, 2, 12, 10, 4, 10…
## $ KPIs_met_more_than_80 <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,…
## $ awards_won <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ avg_training_score <int> 77, 51, 47, 65, 61, 68, 57, 85, 75, 76, 50, …
## $ age_group <fct> Young, Mid, Mid, Mid, Mid, Mid, Mid, Mid, Se…
## $ avg_training_score_scaled <dbl> 1.03010551, -0.90754471, -1.20564475, 0.1358…
# Dataset dimensions
dim(df_clean)
## [1] 17415 14
The summary provides an overview of the central tendency and distribution of each variable.
avg_training_scoreand created a new column
for avg_training_score_scaled which eases future
analysis.# Summary statistics
df_clean %>%
select(age, previous_year_rating, KPIs_met_more_than_80,
length_of_service, no_of_trainings, avg_training_score,
avg_training_score_scaled) %>%
summary()
## age previous_year_rating KPIs_met_more_than_80 length_of_service
## Min. :20.00 Min. :1.000 Min. :0.0000 Min. : 1.000
## 1st Qu.:29.00 1st Qu.:3.000 1st Qu.:0.0000 1st Qu.: 3.000
## Median :33.00 Median :3.000 Median :0.0000 Median : 5.000
## Mean :34.81 Mean :3.319 Mean :0.3589 Mean : 5.801
## 3rd Qu.:39.00 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.: 7.000
## Max. :60.00 Max. :5.000 Max. :1.0000 Max. :34.000
## no_of_trainings avg_training_score avg_training_score_scaled
## Min. :1.000 Min. :39.00 Min. :-1.8018
## 1st Qu.:1.000 1st Qu.:51.00 1st Qu.:-0.9075
## Median :1.000 Median :60.00 Median :-0.2368
## Mean :1.251 Mean :63.18 Mean : 0.0000
## 3rd Qu.:1.000 3rd Qu.:75.00 3rd Qu.: 0.8811
## Max. :9.000 Max. :99.00 Max. : 2.6697
A final quality check was conducted to check if any missing values or duplicated values remain.
# Check missing values
colSums(is.na(df_clean))
## department region education
## 0 0 0
## gender recruitment_channel no_of_trainings
## 0 0 0
## age previous_year_rating length_of_service
## 0 0 0
## KPIs_met_more_than_80 awards_won avg_training_score
## 0 0 0
## age_group avg_training_score_scaled
## 0 0
# Check missing or empty values
df_clean %>%
summarise(across(everything(), ~ sum(is.na(.) | . == "")))
# Check any duplicates
sum(duplicated(df_clean))
## [1] 16
janitor::get_dupes(df_clean)
## No variable names specified - using all columns.
# Check whether missing values in the education field have been replaced with “Unknown”
unique(df_clean$education)
## [1] bachelors masters & above unknown below secondary
## Levels: bachelors below secondary masters & above unknown
age, length_of_service, and
avg_training_score because these variables have meaningful
numerical ranges, and extreme values may reveal unusual or hidden
characteristics of employees.# Select variables suitable for outlier detection
outlier_vars <- df_clean %>%
select(age, length_of_service, avg_training_score)
# Convert selected variables into long format
outlier_data <- df_clean %>%
select(age, length_of_service, avg_training_score) %>%
pivot_longer(cols = everything(),
names_to = "Variable",
values_to = "Value")
# Boxplot visualization
ggplot(outlier_data,
aes(x = Variable,
y = Value,
fill = Variable)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = "Outlier Detection for Continuous Variables",
x = "Variables",
y = "Values") +
theme(legend.position = "none")
age and length_of_service contain several
outliers beyond the upper fence of the boxplot, which require further
investigation to determine whether they represent valid extreme values
or anomalies in the dataset.# Check class imbalance
target_dist <- df_clean %>%
count(KPIs_met_more_than_80) %>%
mutate(
percentage = round(n / sum(n) * 100, 1),
KPI_status = ifelse(KPIs_met_more_than_80 == 1,
"Met KPI >80%",
"Met KPI ≤80%")
)
# KPI distribution plot
ggplot(target_dist,
aes(x = KPI_status,
y = n,
fill = KPI_status)) +
geom_bar(stat = "identity",
width = 0.6,
alpha = 0.9) +
# Percentage + count labels
geom_text(aes(label = paste0(percentage,
"%\n(n = ",
scales::comma(n), ")")),
vjust = -0.35,
size = 4.3,
fontface = "bold") +
scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
expand_limits(y = max(target_dist$n) * 1.15) +
labs(title = "Distribution of KPI Achievement",
subtitle = "Class balance analysis of KPI performance",
x = NULL,
y = "Number of Employees") +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold",
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(face = "bold")
)
# Print imbalance ratio
imbalance_ratio <- max(target_dist$percentage) /
min(target_dist$percentage)
cat("Imbalance ratio (majority/minority):",
round(imbalance_ratio, 2), "\n")
## Imbalance ratio (majority/minority): 1.79
KPI Achievement is relatively balanced.# Select Variables
continuous_long <- df_clean %>%
select(age,
length_of_service,
avg_training_score) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value")
# Plots
ggplot() +
# age
geom_histogram(
data = filter(continuous_long, variable == "age"),
aes(x = value,
y = after_stat(density)),
binwidth = 2,
fill = "#3498DB",
color = "white",
alpha = 0.7
) +
geom_density(
data = filter(continuous_long, variable == "age"),
aes(x = value),
color = "#E74C3C",
linewidth = 1.1,
adjust = 1.2
) +
# length_of_service
geom_histogram(
data = filter(continuous_long,
variable == "length_of_service"),
aes(x = value,
y = after_stat(density)),
binwidth = 1,
fill = "#2ECC71",
color = "white",
alpha = 0.7
) +
geom_density(
data = filter(continuous_long,
variable == "length_of_service"),
aes(x = value),
color = "#C0392B",
linewidth = 1.1,
adjust = 1.5
) +
# avg_training_score
geom_histogram(
data = filter(continuous_long,
variable == "avg_training_score"),
aes(x = value,
y = after_stat(density)),
binwidth = 5,
fill = "#9B59B6",
color = "white",
alpha = 0.7
) +
geom_density(
data = filter(continuous_long,
variable == "avg_training_score"),
aes(x = value),
color = "#E74C3C",
linewidth = 1.1,
adjust = 1.2
) +
facet_wrap(~ variable,
scales = "free",
ncol = 2) +
labs(
title = "Distribution of Continuous Variables",
subtitle = "Histogram with Density Overlay",
x = "Value",
y = "Density"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12),
strip.text = element_text(face = "bold", size = 12),
axis.title = element_text(face = "bold")
)
age distribution is slightly right skewed. This
indicates a higher concentration of younger employees with fewer older
employees. The age density curve shows a peak around the
mid 30 years old which suggests that the most employees fall within the
early to mid-career stage.avg_training_score distribution appears as a multimodal
pattern as it consists of multiple peaks and major clusters visible
around 50, 60, and 80 to 85. This implies that there is possible
segmentation in employee performance or departmental training
outcomes.length_of_service distribution shows heavily
right-skewed. Most employees have 1 to 7 years of service whereas very
few employees exceed 15 years. The company is considered to have short
tenure or most of them are new employees.# Select Variables
discrete_vars <- df_clean %>%
select(previous_year_rating,
no_of_trainings)
discrete_long <- discrete_vars %>%
pivot_longer(everything(),
names_to = "variable",
values_to = "value")
#Plot
ggplot(discrete_long,
aes(x = factor(value))) +
geom_bar(fill = "#2ECC71",
alpha = 0.8) +
facet_wrap(~ variable,
scales = "free",
ncol = 2) +
labs(title = "Distribution of Discrete Variables",
subtitle = "Frequency distribution by category",
x = "Category",
y = "Count") +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))
no_of_trainings displays a strongly right-skewed
distribution pattern as most employees have training sessions at once.
It has a sharp decline after 2 training sessions and very few employees
received more than 4 training sessions which cause long tail.previous_year_rating shows a slightly left-skewed
distribution pattern with a mode at rating 3. Most employees received
ratings of 3, 4, or 5 which are above average to excellent. This
suggests that past performance is generally positive across the
workforce.# Select Variables
categorical_vars <- df_clean %>%
select(department,
education,
recruitment_channel,
awards_won)
categorical_long <- categorical_vars %>%
pivot_longer(everything(),
names_to = "variable",
values_to = "category")
# Plot
ggplot(categorical_long,
aes(x = category,
fill = variable)) +
geom_bar(alpha = 0.85,
show.legend = FALSE) +
facet_wrap(~ variable,
scales = "free",
ncol = 2) +
labs(title = "Distribution of Categorical Variables",
subtitle = "Frequency distribution across employee categories",
x = "",
y = "Number of Employees") +
scale_fill_brewer(palette = "Set2") +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45,
hjust = 1),
strip.text = element_text(face = "bold"),
plot.title = element_text(face = "bold",
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
# Plot for region (top 10)
region_analysis <- df_clean %>%
count(region) %>%
mutate(percentage = n / sum(n) * 100) %>%
top_n(10, n)
ggplot(region_analysis,
aes(x = reorder(region, n), y = percentage, fill = percentage)) +
geom_bar(stat = "identity", width = 0.8) +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
vjust = -0.3, size = 3) +
scale_fill_gradient(low = "#A3E4D7", high = "#1ABC9C") +
labs(title = "Employee Distribution by Region (Top 10)",
x = "Region", y = "Percentage (%)") +
theme_minimal() +
theme(axis.text.x = element_text(size = 7))
department, education,
recruitment_channel, revealing that the distribution of
employees across departments, educational backgrounds, and recruitment
channels is uneven.award_won distribution shows an extremely imbalanced
distribution in which nearly all employees have no awards. As the award
winners are rare and almost invisible on chart, this implies that this
variable has lower predictive power as insufficient variation. However,
further bivariate analysis is needed to determine whether the winners
perform better on KPI performance.# Summary Statistics by KPI
training_score_summary <- df_clean %>%
group_by(KPIs_met_more_than_80) %>%
summarise(
count = n(),
mean_score = mean(avg_training_score, na.rm = TRUE),
median_score = median(avg_training_score, na.rm = TRUE),
sd_score = sd(avg_training_score, na.rm = TRUE),
min_score = min(avg_training_score, na.rm = TRUE),
max_score = max(avg_training_score, na.rm = TRUE),
q25 = quantile(avg_training_score, 0.25, na.rm = TRUE),
q75 = quantile(avg_training_score, 0.75, na.rm = TRUE)
) %>%
mutate(KPI_Status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI >80%", "Did Not Meet KPI"))
kable(training_score_summary %>%
select(-KPIs_met_more_than_80) %>%
mutate(across(where(is.numeric), ~round(., 2))),
caption = "Training Score Summary by KPI Achievement Status")
| count | mean_score | median_score | sd_score | min_score | max_score | q25 | q75 | KPI_Status |
|---|---|---|---|---|---|---|---|---|
| 11165 | 62.46 | 59 | 13.35 | 39 | 99 | 50 | 74 | Did Not Meet KPI |
| 6250 | 64.47 | 61 | 13.45 | 41 | 99 | 53 | 77 | Met KPI >80% |
# Boxplot Comparison
ggplot(df_clean, aes(x = factor(KPIs_met_more_than_80), y = avg_training_score,
fill = factor(KPIs_met_more_than_80))) +
geom_boxplot(alpha = 0.7, outlier.size = 0.8) +
scale_fill_manual(values = c("#E74C3C", "#2ECC71"),
labels = c("Did Not Meet KPI", "Met KPI")) +
labs(
title = "Training Score Distribution Comparison",
subtitle = "High performers show higher median training scores",
x = "KPI Achievement Status",
y = "Average Training Score",
fill = "KPI Status"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "none"
)
# Previous year rating vs KPI
rating_analysis <- df_clean %>%
group_by(previous_year_rating, KPIs_met_more_than_80) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(previous_year_rating) %>%
mutate(percentage = count / sum(count) * 100,
KPI_status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI", "Did Not Meet"))
# Plot
ggplot(rating_analysis, aes(x = factor(previous_year_rating), y = percentage,
fill = KPI_status)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5), size = 3) +
scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
labs(title = "KPI Achievement by Previous Year Rating",
x = "Previous Year Rating", y = "Percentage (%)",
fill = "KPI Status")
# Plot
ggplot(df_clean,
aes(x = factor(previous_year_rating),
y = avg_training_score,
fill = factor(previous_year_rating))) +
geom_boxplot(alpha = 0.8,
outlier.color = "#E74C3C") +
labs(title = "Average Training Score by Previous Year Rating",
x = "Previous Year Rating",
y = "Average Training Score") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
# Function to create bar plots for categorical variables
plot_categorical_kpi <- function(data, var_name) {
data %>%
group_by(!!sym(var_name), KPIs_met_more_than_80) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(!!sym(var_name)) %>%
mutate(percentage = count / sum(count) * 100,
KPI_status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI >80%", "Did Not Meet")) %>%
ggplot(aes(x = reorder(!!sym(var_name), -percentage * (KPIs_met_more_than_80 == 1)),
y = percentage, fill = KPI_status)) +
geom_bar(stat = "identity", position = "stack", width = 0.7) +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5), size = 3) +
scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
labs(title = paste("KPI Achievement by", var_name),
x = var_name, y = "Percentage (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5, face = "bold"))
}
# Plot for each categorical variable
cat_vars <- c("department", "education", "recruitment_channel", "gender", "awards_won")
for (var in cat_vars) {
print(plot_categorical_kpi(df_clean, var))
}
# HO = There is no significant relationship between categorical variables and KPI achievement.
# H1 = There is a significant relationship between categorical variables and KPI achievement.
# Decision Rule: Reject H0, if p-value <0.05
# Chi-square tests
cat_chi_results <- map_df(cat_vars, function(var) {
tbl <- table(df_clean[[var]], df_clean$KPIs_met_more_than_80)
test <- chisq.test(tbl)
# Store numeric p-value for correct sorting
p_val <- test$p.value
# Format p-value for display only
p_formatted <- ifelse(
p_val < 0.0001,
"< 0.0001",
round(p_val, 4)
)
data.frame(
Variable = var,
Chi_Square = round(as.numeric(test$statistic), 2),
P_Value = p_formatted,
P_Value_Numeric = p_val,
Significant = ifelse(p_val < 0.05, "Yes", "No")
)
})
# Display results (with sorting)
kable(
cat_chi_results %>%
arrange(P_Value_Numeric) %>%
select(-P_Value_Numeric),
caption = "Chi-square Tests: Categorical Variables vs KPI Achievement"
)
| Variable | Chi_Square | P_Value | Significant |
|---|---|---|---|
| department | 292.33 | < 0.0001 | Yes |
| awards_won | 191.77 | < 0.0001 | Yes |
| recruitment_channel | 42.91 | < 0.0001 | Yes |
| education | 40.25 | < 0.0001 | Yes |
| gender | 28.61 | < 0.0001 | Yes |
# Summary function
get_summary <- function(data, group_var) {
data %>%
group_by(.data[[group_var]]) %>%
summarise(
total_trainings = sum(no_of_trainings, na.rm = TRUE),
avg_train_score = mean(avg_training_score, na.rm = TRUE),
kpi = sum(KPIs_met_more_than_80 == 1, na.rm = TRUE),
avg_tenure = mean(length_of_service, na.rm = TRUE),
avg_rating = mean(previous_year_rating, na.rm = TRUE),
avg_age = mean(age, na.rm = TRUE),
.groups = "drop"
) %>%
rename(category = 1)
}
# Focus only: Gender + Department
groups <- c("gender", "department")
summary_list <- lapply(groups, function(g) {
get_summary(df_clean, g) %>%
mutate(group = g)
})
combined_perf <- bind_rows(summary_list)
# Split metrics (NO scale mixing)
# Workforce metrics
workforce <- combined_perf %>%
pivot_longer(
cols = c(total_trainings, kpi),
names_to = "metric",
values_to = "value"
)
# Performance metrics
performance <- combined_perf %>%
pivot_longer(
cols = c(avg_train_score, avg_tenure, avg_rating, avg_age),
names_to = "metric",
values_to = "value"
)
# =========================
# Plot 1: Workforce (Gender + Department)
# =========================
p1 <- ggplot(workforce,
aes(x = category,
y = value,
fill = metric)) +
geom_bar(stat = "identity",
position = "dodge",
alpha = 0.9) +
facet_wrap(~ group, scales = "free_x") +
labs(title = "Workforce Overview by Gender and Department",
x = "",
y = "Count",
fill = "Metric") +
theme_minimal(base_size = 13) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
legend.title = element_text(size = 9),
plot.title = element_text(face = "bold", hjust = 0.5)
)
# =========================
# Plot 2: Performance (Gender + Department)
# =========================
p2 <- ggplot(performance,
aes(x = category,
y = value,
fill = metric)) +
geom_bar(stat = "identity",
position = "dodge",
alpha = 0.9) +
facet_wrap(~ group, scales = "free_x") +
labs(title = "Performance Metrics by Gender and Department",
x = "",
y = "Average Value",
fill = "Metric") +
theme_minimal(base_size = 13) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
legend.title = element_text(size = 9),
plot.title = element_text(face = "bold", hjust = 0.5),
)
# Output
ggplotly(p1) %>%
layout(
legend = list(
orientation = "v",
x = 1,
y = 0,
font = list(size = 9),
itemwidth = 30
),
margin = list(b = 120)
)
ggplotly(p2) %>%
layout(
legend = list(
orientation = "v",
x = 1,
y = 0,
font = list(size = 9),
itemwidth = 30
),
margin = list(b = 120)
)
# ==========================================
# Multivariate Summary Table
# For Gender + Department Analysis
# ==========================================
# Create formatted summary table
summary_table <- combined_perf %>%
mutate(
avg_train_score = round(avg_train_score, 2),
avg_tenure = round(avg_tenure, 2),
avg_rating = round(avg_rating, 2),
avg_age = round(avg_age, 2)
) %>%
arrange(group, desc(avg_train_score)) %>%
rename(
Category = category,
Group = group,
`Total Trainings` = total_trainings,
`Avg Traning Score`= avg_train_score,
`KPI Achieved` = kpi,
`Avg Tenure` = avg_tenure,
`Avg Rating` = avg_rating,
`Avg Age` = avg_age
)
# Display table
kable(
summary_table,
caption = "Multivariate Performance Summary by Gender and Department",
align = "c"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "center"
) %>%
row_spec(
0,
bold = TRUE,
color = "white",
background = "#2C3E50"
) %>%
column_spec(1, bold = TRUE) %>%
collapse_rows(
columns = 8,
valign = "top"
)
| Category | Total Trainings | Avg Traning Score | KPI Achieved | Avg Tenure | Avg Rating | Avg Age | Group |
|---|---|---|---|---|---|---|---|
| analytics | 2281 | 84.57 | 679 | 5.00 | 3.47 | 32.41 | department |
| r&d | 438 | 84.45 | 149 | 4.80 | 3.66 | 32.89 | |
| technology | 2740 | 79.85 | 783 | 5.84 | 3.14 | 35.03 | |
| procurement | 2993 | 70.18 | 836 | 6.19 | 3.23 | 36.17 | |
| operations | 4121 | 60.35 | 1553 | 6.43 | 3.63 | 36.15 | |
| finance | 1059 | 60.33 | 319 | 5.01 | 3.49 | 32.60 | |
| legal | 355 | 59.53 | 118 | 4.50 | 3.38 | 33.75 | |
| hr | 892 | 50.39 | 300 | 5.63 | 3.51 | 34.25 | |
| sales & marketing | 6903 | 50.06 | 1513 | 5.75 | 3.10 | 34.63 | |
| f | 5992 | 63.68 | 1986 | 5.86 | 3.37 | 35.04 | gender |
| m | 15790 | 62.97 | 4264 | 5.78 | 3.30 | 34.71 |
# Select numeric variables
num_data <- df_clean %>%
select(
no_of_trainings,
age,
previous_year_rating,
length_of_service,
avg_training_score,
KPIs_met_more_than_80
)
# Correlation matrix
cor_matrix <- cor(num_data, use = "complete.obs")
cor_melt <- melt(cor_matrix)
# Correlation heatmap
ggplot(cor_melt, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(
low = "#E74C3C", # -1 strong negative
mid = "white", # 0 no correlation
high = "#2ECC71", # +1 strong positive
midpoint = 0,
limits = c(-1, 1),
name = "Correlation"
) +
geom_text(aes(label = round(value, 2)), size = 3) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
) +
ggtitle("Correlation Matrix of Employee Performance Variables")
# Select numeric features
num_features <- df_clean %>%
select(age,
length_of_service,
avg_training_score,
no_of_trainings,
previous_year_rating)
# Compute correlation with KPI
cor_results <- sapply(num_features, function(x) {
cor(x, df_clean$KPIs_met_more_than_80, use = "complete.obs")
})
# Convert to data frame and rank
cor_ranked <- data.frame(
feature = names(cor_results),
correlation = cor_results
) %>%
arrange(desc(abs(correlation)))
cor_ranked
# Plot
ggplot(cor_ranked,
aes(x = reorder(feature, correlation),
y = correlation,
fill = correlation)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_gradient2(low = "#E74C3C",
mid = "white",
high = "#2ECC71") +
labs(title = "Feature Correlation with KPI Achievement",
x = "Feature",
y = "Correlation Strength") +
theme_minimal(base_size = 13)
avg_training_score and
previous_year_ratingare positively correlated with KPI
achievement. Employees with higher training scores and higher ratings
from the previous year typically perform better on KPIs.length_of_service and no_of_trainings
exhibit a right-skewed distribution, indicating that most employees have
shorter tenure and have attended fewer training sessions.department, education,
gender, awards_won and
recruitment_channel. Chi-square tests confirm statistically
significant associations between categorical variables and KPI
achievement.previous_year_rating and
avg_training_score reveals a potential nonlinear pattern,
suggesting that performance may vary across different rating
groups.age
and length_of_service is approximately 0.64; however, since
it does not exceed the standard threshold of 0.7, it does not result in
multicollinearity.This section applies statistical and machine learning techniques to uncover meaningful insights from the cleaned dataset. The goal is to identify key predictors, evaluate model performance, and generate reliable forecasts. By combining exploratory analysis with predictive modelling, we aim to transform raw data into actionable knowledge that supports decision‑making.
Question: Can employee KPI achievement (more than 80%) be predicted using demographic, training, and workplace-related variables?
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
# ============================================================
# Logistic Regression — Employee Performance
# ============================================================
library(caret)
library(ggplot2)
library(pROC)
library(dplyr)
library(gridExtra)
# STEP 1 — Load data
df_clean <- read.csv("clean_employee_performance.csv",
stringsAsFactors = FALSE)
# STEP 2 — Remove redundant columns
df_clean$avg_training_score_scaled <- NULL
df_clean$age_group <- NULL
# STEP 3 — Convert character columns to factors
char_cols <- names(df_clean)[sapply(df_clean, is.character)]
df_clean[char_cols] <- lapply(df_clean[char_cols], as.factor)
# STEP 4 — Convert target to factor
df_clean$KPIs_met_more_than_80 <- factor(
df_clean$KPIs_met_more_than_80,
levels = c(0, 1),
labels = c("No", "Yes")
)
# STEP 5 — Train/Test split
set.seed(42)
train_idx <- createDataPartition(
df_clean$KPIs_met_more_than_80,
p = 0.80,
list = FALSE
)
train_data <- df_clean[train_idx, ]
test_data <- df_clean[-train_idx, ]
cat(sprintf(
"\nTrain-test split: Train rows = %d | Test rows = %d\n",
nrow(train_data),
nrow(test_data)
))
##
## Train-test split: Train rows = 13932 | Test rows = 3483
# STEP 6 — Build Logistic Regression Model
log_model <- glm(
KPIs_met_more_than_80 ~ .,
data = train_data,
family = binomial
)
cat("\nLogistic Regression Model Successfully Built\n")
##
## Logistic Regression Model Successfully Built
# Logistic Regression Coefficients
coef_table <- coef(summary(log_model))
coef_table
## Estimate Std. Error z value
## (Intercept) -1.8726908529 0.431505245 -4.33990288
## departmentfinance -0.2105938388 0.147293437 -1.42975711
## departmenthr -0.5132270704 0.179556152 -2.85830958
## departmentlegal -0.5277242420 0.186081945 -2.83597768
## departmentoperations -0.0532980542 0.127460804 -0.41815250
## departmentprocurement -0.0458793980 0.103562297 -0.44301256
## departmentr&d 0.0161819184 0.143661783 0.11263899
## departmentsales & marketing -0.6022266990 0.161795217 -3.72215390
## departmenttechnology -0.0148637348 0.086201295 -0.17243053
## regionregion_10 0.0671279365 0.263124382 0.25511865
## regionregion_11 0.2313820004 0.235887456 0.98089998
## regionregion_12 0.2578670103 0.286428533 0.90028395
## regionregion_13 0.3659097187 0.219969861 1.66345388
## regionregion_14 0.1735354117 0.259410064 0.66896175
## regionregion_15 0.2028051695 0.221605549 0.91516286
## regionregion_16 0.0663256894 0.237037188 0.27981132
## regionregion_17 0.4803135904 0.253838560 1.89220105
## regionregion_18 0.8849238806 0.657066054 1.34678070
## regionregion_19 0.1907905709 0.249522039 0.76462412
## regionregion_2 0.3764650829 0.208562610 1.80504590
## regionregion_20 -0.0080864697 0.262748407 -0.03077647
## regionregion_21 0.1220179978 0.313709285 0.38895246
## regionregion_22 0.4826413301 0.210590211 2.29185074
## regionregion_23 0.1894869199 0.242369426 0.78181033
## regionregion_24 -0.1869001176 0.301773486 -0.61933910
## regionregion_25 0.0842851631 0.254194940 0.33157687
## regionregion_26 0.2460264710 0.223927289 1.09868910
## regionregion_27 0.2690104367 0.230338312 1.16789272
## regionregion_28 0.4819508206 0.234790104 2.05268796
## regionregion_29 0.5306506468 0.249550252 2.12642801
## regionregion_3 0.5671508208 0.310623474 1.82584662
## regionregion_30 0.1148503592 0.265673317 0.43229919
## regionregion_31 0.1907524303 0.225856913 0.84457202
## regionregion_32 -0.1357961886 0.251003424 -0.54101329
## regionregion_33 0.1529711524 0.340530017 0.44921489
## regionregion_34 0.2866883020 0.301546987 0.95072514
## regionregion_4 0.8356679163 0.226555755 3.68857509
## regionregion_5 -0.2321345830 0.268661634 -0.86404069
## regionregion_6 0.2394558697 0.268470704 0.89192551
## regionregion_7 0.4153636496 0.212857180 1.95137251
## regionregion_8 0.0877609175 0.265669036 0.33033928
## regionregion_9 -0.5273401108 0.345457214 -1.52649906
## educationbelow secondary 0.1800414620 0.146323260 1.23043637
## educationmasters & above 0.0434159765 0.048220704 0.90035965
## educationunknown -0.2307760401 0.103000502 -2.24053317
## genderm -0.0271444866 0.044163908 -0.61463053
## recruitment_channelreferred 0.4114742993 0.138767848 2.96519910
## recruitment_channelsourcing -0.0037838948 0.039001401 -0.09701946
## no_of_trainings -0.1446831935 0.034417659 -4.20374882
## age 0.0004432747 0.003607434 0.12287813
## previous_year_rating 0.6290713128 0.018128685 34.70032844
## length_of_service -0.0641080371 0.006283295 -10.20293218
## awards_won 1.4198891674 0.131798260 10.77320117
## avg_training_score -0.0071499976 0.004227949 -1.69112658
## Pr(>|z|)
## (Intercept) 1.425457e-05
## departmentfinance 1.527867e-01
## departmenthr 4.259046e-03
## departmentlegal 4.568564e-03
## departmentoperations 6.758356e-01
## departmentprocurement 6.577567e-01
## departmentr&d 9.103168e-01
## departmentsales & marketing 1.975306e-04
## departmenttechnology 8.630991e-01
## regionregion_10 7.986315e-01
## regionregion_11 3.266421e-01
## regionregion_12 3.679692e-01
## regionregion_13 9.622162e-02
## regionregion_14 5.035199e-01
## regionregion_15 3.601061e-01
## regionregion_16 7.796223e-01
## regionregion_17 5.846420e-02
## regionregion_18 1.780509e-01
## regionregion_19 4.444954e-01
## regionregion_2 7.106750e-02
## regionregion_20 9.754478e-01
## regionregion_21 6.973113e-01
## regionregion_22 2.191426e-02
## regionregion_23 4.343261e-01
## regionregion_24 5.356930e-01
## regionregion_25 7.402088e-01
## regionregion_26 2.719037e-01
## regionregion_27 2.428500e-01
## regionregion_28 4.010285e-02
## regionregion_29 3.346764e-02
## regionregion_3 6.787337e-02
## regionregion_30 6.655240e-01
## regionregion_31 3.983498e-01
## regionregion_32 5.884984e-01
## regionregion_33 6.532767e-01
## regionregion_34 3.417439e-01
## regionregion_4 2.255135e-04
## regionregion_5 3.875655e-01
## regionregion_6 3.724329e-01
## regionregion_7 5.101275e-02
## regionregion_8 7.411436e-01
## regionregion_9 1.268856e-01
## educationbelow secondary 2.185337e-01
## educationmasters & above 3.679289e-01
## educationunknown 2.505633e-02
## genderm 5.387987e-01
## recruitment_channelreferred 3.024871e-03
## recruitment_channelsourcing 9.227109e-01
## no_of_trainings 2.625303e-05
## age 9.022036e-01
## previous_year_rating 7.788889e-264
## length_of_service 1.923752e-24
## awards_won 4.606990e-27
## avg_training_score 9.081263e-02
# STEP 7 — Predictions
pred_prob <- predict(log_model,
newdata = test_data,
type = "response")
pred_class <- ifelse(pred_prob > 0.5, "Yes", "No")
pred_class <- factor(pred_class,
levels = c("No", "Yes"))
# STEP 8 — Evaluation
cm <- confusionMatrix(
pred_class,
test_data$KPIs_met_more_than_80,
positive = "Yes"
)
#Visual Confusion Matrix
cm_table <- as.data.frame(cm$table)
ggplot(cm_table,
aes(
x = Reference,
y = Prediction,
fill = Freq
)) +
geom_tile(color = "white", linewidth = 1) +
geom_text(
aes(label = Freq),
color = "white",
size = 8,
fontface = "bold"
) +
scale_fill_gradient(
low = "#9BBCE0",
high = "#1A5FA8"
) +
labs(
title = "Confusion Matrix",
subtitle = "Predicted vs Actual Classes",
x = "Actual Class",
y = "Predicted Class",
fill = "Count"
) +
theme_minimal(base_size = 16)
acc <- round(cm$overall["Accuracy"] * 100, 2)
sens <- round(cm$byClass["Sensitivity"] * 100, 2)
spec <- round(cm$byClass["Specificity"] * 100, 2)
f1 <- round(cm$byClass["F1"] * 100, 2)
cat(
"Accuracy:", acc, "\n",
"Sensitivity:", sens, "\n",
"Specificity:", spec, "\n",
"F1 Score:", f1, "\n"
)
## Accuracy: 71.03
## Sensitivity: 43.36
## Specificity: 86.52
## F1 Score: 51.79
# STEP 9 — ROC-AUC
roc_obj <- roc(
response = test_data$KPIs_met_more_than_80,
predictor = pred_prob,
levels = c("No", "Yes")
)
auc_val <- round(auc(roc_obj), 4)
roc_df <- data.frame(
FPR = 1 - roc_obj$specificities,
TPR = roc_obj$sensitivities
)
ggplot(roc_df, aes(x = FPR, y = TPR)) +
geom_line(colour = "#1A5FA8", linewidth = 1.1) +
geom_abline(
slope = 1,
intercept = 0,
linetype = "dashed",
colour = "grey60",
linewidth = 0.7
) +
annotate(
"text",
x = 0.65,
y = 0.12,
label = sprintf("AUC = %.4f", auc_val),
size = 5,
fontface = "bold",
colour = "#1A5FA8"
) +
labs(
title = "ROC Curve",
subtitle = "Receiver Operating Characteristic – Test Set",
x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)"
) +
theme_minimal(base_size = 14) +
coord_equal()
# ============================================================
# Logistic Regression Feature Importance
# ============================================================
coef_df <- data.frame(
Feature = names(coef(log_model)),
Coefficient = coef(log_model)
)
coef_df <- coef_df %>%
filter(Feature != "(Intercept)") %>%
mutate(
Abs_Coefficient = abs(Coefficient),
Direction = ifelse(
Coefficient > 0,
"Positive",
"Negative"
)
) %>%
arrange(desc(Abs_Coefficient))
top_coef_df <- coef_df %>%
slice_max(
order_by = Abs_Coefficient,
n = 10
)
# ============================================================
# Visual: Logistic Regression Feature Importance
# ============================================================
p1 <- ggplot(
top_coef_df,
aes(
x = reorder(
Feature,
Abs_Coefficient
),
y = Abs_Coefficient,
fill = Direction
)
) +
geom_bar(
stat = "identity"
) +
coord_flip() +
labs(
title = "Logistic Regression Feature Importance",
subtitle = "Top 10 absolute coefficients",
x = "Feature",
y = "Absolute Coefficient"
) +
theme_minimal(base_size = 12)
p1
# ============================================================
# Visual: Class Distribution
# ============================================================
class_tbl <- table(df_clean$KPIs_met_more_than_80)
class_pct <- round(
prop.table(class_tbl) * 100,
1
)
class_df <- data.frame(
Class = names(class_tbl),
Count = as.integer(class_tbl),
Percent = as.numeric(class_pct)
)
p2 <- ggplot(
class_df,
aes(
x = Class,
y = Count,
fill = Class
)
) +
geom_bar(
stat = "identity",
width = 0.6,
show.legend = FALSE
) +
geom_text(
aes(
label = paste0(
Count,
"\n(",
Percent,
"%)"
)
),
vjust = -0.3,
fontface = "bold",
size = 4.5
) +
labs(
title = "Class Distribution of Target Variable",
subtitle = "KPIs met more than 80%",
x = "KPIs Met > 80%",
y = "Count"
) +
ylim(0, max(class_df$Count) * 1.18) +
theme_minimal(base_size = 14)
p2
# ============================================================
# Metrics Table
# ============================================================
precision <- round(cm$byClass["Precision"] * 100, 2)
metrics_df <- data.frame(
Metric = c("Accuracy",
"F1 Score",
"Precision",
"Sensitivity",
"Specificity"),
Value = c(acc,
f1,
precision,
sens,
spec)
)
# ============================================================
# Visual: Performance Metrics
# ============================================================
p3 <- ggplot(
metrics_df,
aes(
x = Metric,
y = Value,
fill = Metric
)
) +
geom_bar(
stat = "identity",
show.legend = FALSE
) +
geom_text(
aes(
label = paste0(
Value,
"%"
)
),
vjust = -0.4,
fontface = "bold",
size = 6
) +
ylim(0, 110) +
labs(
title = "Model Performance Metrics",
subtitle = "Evaluated on test set",
x = "Metric",
y = "Score (%)"
) +
theme_minimal(base_size = 18)
p3
# ============================================================
# Predicted Probability Data
# ============================================================
prob_df <- data.frame(
Probability = pred_prob,
Actual_Class = test_data$KPIs_met_more_than_80
)
# ============================================================
# Visual: Predicted Probability Distribution
# ============================================================
p4 <- ggplot(
prob_df,
aes(
x = Probability,
fill = Actual_Class
)
) +
geom_histogram(
binwidth = 0.05,
alpha = 0.75,
position = "identity",
colour = "white"
) +
labs(
title = "Predicted Probability Distribution",
subtitle = "Probability of KPIs > 80% by actual class",
x = "Predicted probability (Yes)",
y = "Count",
fill = "Actual class"
) +
theme_minimal(base_size = 18)
p4
# ============================================================
# Visual: Confusion Matrix
# ============================================================
cm_tbl <- as.data.frame(cm$table)
names(cm_tbl) <- c("Predicted", "Actual", "Freq")
p5 <- ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(colour = "white", linewidth = 1) +
geom_text(aes(label = Freq), colour = "white", fontface = "bold", size = 7) +
scale_fill_gradient(low = "#A8C8E8", high = "#1A5FA8") +
labs(
title = "Confusion Matrix",
subtitle = "Predicted vs Actual classes",
x = "Actual",
y = "Predicted"
) +
theme_minimal(base_size = 18)
# ============================================================
# Visual: ROC Curve
# ============================================================
p6 <- ggplot(roc_df, aes(x = FPR, y = TPR)) +
geom_line(linewidth = 1.2) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(
title = "ROC Curve",
subtitle = paste("AUC =", auc_val),
x = "FPR",
y = "TPR"
) +
theme_minimal(base_size = 18)
library(grid)
library(gridExtra)
# ============================================================
# Save Logistic Regression Metrics
# ============================================================
acc <- round(cm$overall["Accuracy"] * 100, 2)
sens <- round(cm$byClass["Sensitivity"] * 100, 2)
spec <- round(cm$byClass["Specificity"] * 100, 2)
f1 <- round(cm$byClass["F1"] * 100, 2)
logistic_acc <- acc
logistic_sens <- sens
logistic_spec <- spec
logistic_f1 <- f1
logistic_auc <- auc_val
# ============================================================
# Final Logistic Regression Dashboard
# ============================================================
logistic_dashboard <- arrangeGrob(
p2, p5, p1,
p3, p6, p4,
ncol = 3,
nrow = 2,
widths = c(1.1, 1.1, 1.4),
heights = c(1, 1)
)
grid::grid.draw(logistic_dashboard)
# ============================================================
# Save Dashboard
# ============================================================
ggsave(
"logistic_regression_dashboard.png",
plot = logistic_dashboard,
width = 22,
height = 14,
dpi = 300
)
# ============================================================
# Random Forest — Employee Performance
# ============================================================
# ── 0. Install & load packages ────────────────────────────
required_packages <- c("randomForest", "caret", "ggplot2",
"reshape2", "pROC", "dplyr", "gridExtra")
for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg)
}
library(randomForest)
library(caret)
library(ggplot2)
library(reshape2)
library(pROC)
library(dplyr)
library(gridExtra)
# ============================================================
# STEP 1 — Load data
# ============================================================
df_clean <- read.csv("clean_employee_performance.csv", stringsAsFactors = FALSE)
cat(sprintf("Loaded: %d rows x %d columns\n\n", nrow(df_clean), ncol(df_clean)))
## Loaded: 17415 rows x 14 columns
# ============================================================
# STEP 2 — Fix issue 1: drop redundant / derived columns
# ============================================================
df_clean$avg_training_score_scaled <- NULL
df_clean$age_group <- NULL
cat("Step 2: Dropped redundant columns: avg_training_score_scaled, age_group\n")
## Step 2: Dropped redundant columns: avg_training_score_scaled, age_group
# ============================================================
# STEP 3 — Fix issue 2: convert character columns to factors
# ============================================================
char_cols <- names(df_clean)[sapply(df_clean, is.character)]
df_clean[char_cols] <- lapply(df_clean[char_cols], as.factor)
cat("Step 3: Converted to factor:", paste(char_cols, collapse = ", "), "\n")
## Step 3: Converted to factor: department, region, education, gender, recruitment_channel
cat(" 'unknown' kept as a valid factor level in 'education'\n")
## 'unknown' kept as a valid factor level in 'education'
# ============================================================
# STEP 4 — Fix issue 3: convert target to factor
# ============================================================
df_clean$KPIs_met_more_than_80 <- factor(df_clean$KPIs_met_more_than_80,
levels = c(0, 1),
labels = c("No", "Yes"))
cat("Step 4: Target 'KPIs_met_more_than_80' converted to factor\n")
## Step 4: Target 'KPIs_met_more_than_80' converted to factor
# ============================================================
# STEP 5 — Handle NA values
# ============================================================
na_total <- sum(is.na(df_clean))
cat(sprintf("\nStep 5: Total NAs found: %d\n", na_total))
##
## Step 5: Total NAs found: 0
if (na_total > 0) {
df_clean <- na.roughfix(df_clean)
cat(" na.roughfix() applied\n")
} else {
cat(" No NAs — skipping imputation\n")
}
## No NAs — skipping imputation
# ============================================================
# STEP 6 — Class imbalance check & compute class weights
# ============================================================
cat("\nStep 6: Class distribution\n")
##
## Step 6: Class distribution
class_tbl <- table(df_clean$KPIs_met_more_than_80)
print(class_tbl)
##
## No Yes
## 11165 6250
class_pct <- round(prop.table(class_tbl) * 100, 1)
print(class_pct)
##
## No Yes
## 64.1 35.9
n_no <- as.integer(class_tbl["No"])
n_yes <- as.integer(class_tbl["Yes"])
wt_no <- 1
wt_yes <- round(n_no / n_yes, 2)
class_weights <- c("No" = wt_no, "Yes" = wt_yes)
cat(sprintf(" Class weights — No: %.2f | Yes: %.2f\n", wt_no, wt_yes))
## Class weights — No: 1.00 | Yes: 1.79
# ============================================================
# STEP 7 — Train / Test split
# ============================================================
set.seed(42)
train_idx <- createDataPartition(df_clean$KPIs_met_more_than_80,
p = 0.80, list = FALSE)
train_data <- df_clean[ train_idx, ]
test_data <- df_clean[-train_idx, ]
cat(sprintf("\nStep 7: Train rows: %d | Test rows: %d\n",
nrow(train_data), nrow(test_data)))
##
## Step 7: Train rows: 13932 | Test rows: 3483
# ============================================================
# STEP 8 — Build Random Forest model
# ============================================================
set.seed(42)
n_features <- ncol(train_data) - 1
mtry_val <- floor(sqrt(n_features))
cat(sprintf("\nStep 8: Training Random Forest (ntree=500, mtry=%d) ...\n",
mtry_val))
##
## Step 8: Training Random Forest (ntree=500, mtry=3) ...
rf_model <- randomForest(
KPIs_met_more_than_80 ~ .,
data = train_data,
ntree = 500,
mtry = mtry_val,
importance = TRUE,
classwt = class_weights
)
print(rf_model)
##
## Call:
## randomForest(formula = KPIs_met_more_than_80 ~ ., data = train_data, ntree = 500, mtry = mtry_val, importance = TRUE, classwt = class_weights)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 31.77%
## Confusion matrix:
## No Yes class.error
## No 6693 2239 0.2506717
## Yes 2187 2813 0.4374000
# ============================================================
# STEP 9 — Predict on test set
# ============================================================
preds_class <- predict(rf_model, newdata = test_data)
preds_prob <- predict(rf_model, newdata = test_data, type = "prob")
# ============================================================
# STEP 10 — Performance Evaluation
# ============================================================
cm <- confusionMatrix(preds_class,
test_data$KPIs_met_more_than_80,
positive = "Yes")
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1734 538
## Yes 499 712
##
## Accuracy : 0.7023
## 95% CI : (0.6868, 0.7174)
## No Information Rate : 0.6411
## P-Value [Acc > NIR] : 1.346e-14
##
## Kappa : 0.3485
##
## Mcnemar's Test P-Value : 0.238
##
## Sensitivity : 0.5696
## Specificity : 0.7765
## Pos Pred Value : 0.5879
## Neg Pred Value : 0.7632
## Prevalence : 0.3589
## Detection Rate : 0.2044
## Detection Prevalence : 0.3477
## Balanced Accuracy : 0.6731
##
## 'Positive' Class : Yes
##
acc <- round(cm$overall["Accuracy"] * 100, 2)
kappa <- round(cm$overall["Kappa"] * 100, 2)
sens <- round(cm$byClass["Sensitivity"] * 100, 2)
spec <- round(cm$byClass["Specificity"] * 100, 2)
precision <- round(cm$byClass["Precision"] * 100, 2)
f1 <- round(cm$byClass["F1"] * 100, 2)
roc_obj <- roc(response = test_data$KPIs_met_more_than_80,
predictor = preds_prob[, "Yes"],
levels = c("No", "Yes"),
direction = "<")
auc_val <- round(auc(roc_obj), 4)
# ============================================================
# Visualize Confusion Matrix
# ============================================================
# Load required libraries
library(caret)
library(ggplot2)
# Example: assume you already have predictions
# preds_class <- predict(rf_model, newdata = test_data)
# cm <- confusionMatrix(preds_class, test_data$KPIs_met_more_than_80, positive = "Yes")
# Convert confusion matrix to data frame
cm_tbl <- as.data.frame(cm$table)
names(cm_tbl) <- c("Predicted", "Actual", "Freq")
# Plot confusion matrix heatmap
ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(colour = "white", linewidth = 1) +
geom_text(aes(label = Freq), size = 6, fontface = "bold", colour = "white") +
scale_fill_gradient(low = "#A8C8E8", high = "#1A5FA8") +
labs(title = "Confusion Matrix Heatmap",
subtitle = "Model predictions vs actual classes",
x = "Actual Class", y = "Predicted Class",
fill = "Count") +
theme_minimal(base_size = 14) +
theme(legend.position = "right")
# ============================================================
# ROC Curve
# ============================================================
roc_obj <- roc(response = test_data$KPIs_met_more_than_80,
predictor = preds_prob[, "Yes"],
levels = c("No", "Yes"),
direction = "<")
roc_df <- data.frame(FPR = 1 - roc_obj$specificities,
TPR = roc_obj$sensitivities)
ggplot(roc_df, aes(x = FPR, y = TPR)) +
geom_line(colour = "#1A5FA8", linewidth = 1.1) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", colour = "grey60", linewidth = 0.7) +
annotate("text", x = 0.65, y = 0.12,
label = sprintf("AUC = %.4f", auc(roc_obj)),
size = 5, fontface = "bold", colour = "#1A5FA8") +
labs(title = "ROC Curve",
subtitle = "Receiver Operating Characteristic — test set",
x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)") +
theme_minimal(base_size = 14) +
coord_equal()
# ============================================================
# Feature Importance (Mean Decrease Gini)
# ============================================================
imp_mat <- importance(rf_model)
imp_df <- data.frame(
Feature = rownames(imp_mat),
MeanDecreaseAccuracy = imp_mat[, "MeanDecreaseAccuracy"],
MeanDecreaseGini = imp_mat[, "MeanDecreaseGini"]
)
imp_df <- imp_df[order(imp_df$MeanDecreaseGini), ]
imp_df$Feature <- factor(imp_df$Feature, levels = imp_df$Feature)
ggplot(imp_df, aes(x = Feature, y = MeanDecreaseGini, fill = MeanDecreaseGini)) +
geom_bar(stat = "identity", width = 0.65, show.legend = FALSE) +
geom_text(aes(label = round(MeanDecreaseGini, 1)),
hjust = -0.15, size = 3.5) +
scale_fill_gradient(low = "#A8D5A2", high = "#2E7D32") +
coord_flip() +
labs(title = "Feature Importance (Mean Decrease Gini)",
subtitle = "Higher = more important for node purity",
y = "Mean Decrease Gini") +
theme_minimal(base_size = 14)
# ============================================================
# Class Distribution Bar Chart
# ============================================================
class_tbl <- table(df_clean$KPIs_met_more_than_80)
class_pct <- round(prop.table(class_tbl) * 100, 1)
class_df <- data.frame(Class = names(class_tbl),
Count = as.integer(class_tbl),
Percent = as.numeric(class_pct))
ggplot(class_df, aes(x = Class, y = Count, fill = Class)) +
geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
geom_text(aes(label = sprintf("%d\n(%.1f%%)", Count, Percent)),
vjust = -0.3, size = 3.5, fontface = "bold") +
scale_fill_manual(values = c("No" = "#E07B54", "Yes" = "#4C9BE8")) +
labs(title = "Class Distribution of Target Variable",
subtitle = "KPIs met more than 80%",
x = "KPIs Met > 80%", y = "Count") +
theme_minimal(base_size = 12)
# ============================================================
# Performance Metrics Bar Chart
# ============================================================
metrics_df <- data.frame(
Metric = c("Accuracy", "Sensitivity", "Specificity", "Precision", "F1 Score"),
Value = c(acc, sens, spec, precision, f1)
)
ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) +
geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
geom_text(aes(label = sprintf("%.1f%%", Value)),
vjust = -0.4, size = 4, fontface = "bold") +
scale_fill_brewer(palette = "Set2") +
labs(title = "Model Performance Metrics",
subtitle = "Evaluated on test set",
y = "Score (%)") +
ylim(0, 110) +
theme_minimal(base_size = 14)
# ============================================================
# Predicted Probability Histogram
# ============================================================
prob_df <- data.frame(
Probability = preds_prob[, "Yes"],
Actual_Class = test_data$KPIs_met_more_than_80
)
ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) +
geom_histogram(binwidth = 0.05, alpha = 0.75,
position = "identity", colour = "white") +
scale_fill_manual(values = c("No" = "#E07B54", "Yes" = "#4C9BE8")) +
labs(title = "Predicted Probability Distribution",
subtitle = "Probability of KPIs > 80% by actual class",
x = "Predicted probability (Yes)", y = "Count",
fill = "Actual class") +
theme_minimal(base_size = 14)
# ============================================================
# Create a Dashboard
# ============================================================
library(gridExtra)
# p1: Class distribution
p1 <- ggplot(class_df, aes(x = Class, y = Count, fill = Class)) +
geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
geom_text(aes(label = sprintf("%d\n(%.1f%%)", Count, Percent)),
vjust = -0.3, size = 3.2, fontface = "bold") +
scale_fill_manual(values = c("No" = "#E07B54", "Yes" = "#4C9BE8")) +
ylim(0, max(class_df$Count) * 1.18) +
labs(title = "Class Distribution",
x = "Class", y = "Count") +
theme_minimal(base_size = 11)
# p2: Confusion matrix
p2 <- ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(colour = "white") +
geom_text(aes(label = Freq), colour = "white",
fontface = "bold", size = 4) +
scale_fill_gradient(low = "#A8C8E8", high = "#1A5FA8") +
labs(title = "Confusion Matrix",
x = "Actual", y = "Predicted") +
theme_minimal(base_size = 11) +
theme(legend.position = "right")
# p3: Feature importance
p3 <- ggplot(imp_df, aes(x = Feature, y = MeanDecreaseGini, fill = MeanDecreaseGini)) +
geom_bar(stat = "identity", width = 0.65, show.legend = FALSE) +
geom_text(aes(label = round(MeanDecreaseGini, 1)),
hjust = -0.15, size = 2.8) +
scale_fill_gradient(low = "#A8D5A2", high = "#2E7D32") +
coord_flip() +
labs(title = "Feature Importance",
x = "Feature", y = "Mean Decrease Gini") +
theme_minimal(base_size = 10)
# p4: Performance metrics
p4 <- ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) +
geom_bar(stat = "identity", width = 0.55, show.legend = FALSE) +
geom_text(aes(label = sprintf("%.1f%%", Value)),
vjust = -0.3, size = 3.2, fontface = "bold") +
scale_fill_brewer(palette = "Set2") +
ylim(0, 110) +
labs(title = "Performance Metrics",
x = NULL, y = "Score (%)") +
theme_minimal(base_size = 11) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
# p5: ROC curve
p5 <- ggplot(roc_df, aes(x = FPR, y = TPR)) +
geom_line(linewidth = 0.9) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(title = "ROC Curve",
subtitle = paste("AUC =", auc_val),
x = "FPR", y = "TPR") +
theme_minimal(base_size = 11)
# p6: Probability histogram
p6 <- ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) +
geom_histogram(binwidth = 0.05, alpha = 0.75,
position = "identity", colour = "white") +
scale_fill_manual(values = c("No" = "#E07B54", "Yes" = "#4C9BE8")) +
labs(title = "Predicted Probability",
x = "Probability", y = "Count",
fill = "Actual Class") +
theme_minimal(base_size = 11)
combined_dashboard <- grid.arrange(
p1, p2, p3,
p4, p5, p6,
ncol = 3,
nrow = 2
)
ggsave(
"dashboard_overview.png",
plot = combined_dashboard,
width = 20,
height = 12,
dpi = 300
)
The class distribution analysis showed that the dataset was moderately imbalanced, with a larger proportion of employees not meeting more than 80% of KPIs (“No”) compared to employees who successfully achieved the KPI target (“Yes”). This imbalance may affect classification performance because models can become biased toward the majority class. Therefore, evaluation metrics such as sensitivity, specificity, F1-score, precision, and AUC were analysed alongside overall accuracy to ensure a more balanced and reliable assessment of model performance.
The dataset was divided into training and testing subsets using an 80:20 ratio, where 80% of the data was used to train the Logistic Regression and Random Forest models, while the remaining 20% was used for testing and evaluation. This approach helps reduce overfitting and ensures that the models are evaluated on unseen data, providing a more realistic indication of predictive performance and generalisation capability in employee KPI prediction tasks.
The confusion matrix analysis revealed clear differences between the Logistic Regression and Random Forest models. Logistic Regression produced more false negatives, meaning many employees who actually achieved more than 80% KPI performance were incorrectly classified as unsuccessful employees, showing that the model was more conservative in predicting high performers. In comparison, Random Forest reduced the number of false negatives and identified successful employees more effectively, although it produced slightly more false positives. Overall, Logistic Regression focused more heavily on reducing incorrect positive predictions, whereas Random Forest demonstrated more balanced and practically useful classification behaviour for identifying genuine high-performing employees.
Both models performed substantially better than random guessing based on their ROC curves and AUC values. Logistic Regression achieved a slightly higher AUC value of 0.7402 compared to 0.7327 for Random Forest, indicating marginally stronger overall class separation capability across different probability thresholds. However, despite the slightly lower AUC, Random Forest achieved stronger practical predictive performance through higher sensitivity and F1-score values, showing that it was more effective at correctly identifying employees who successfully achieved KPI targets, while Logistic Regression produced more conservative predictions and focused more on reducing false positives.
The model performance metrics highlighted the trade-offs between Logistic Regression and Random Forest. Logistic Regression achieved slightly higher overall accuracy (71.03% vs 70.23%) and substantially higher specificity (86.52% vs 77.65%), indicating stronger performance in correctly identifying employees who did not meet KPI targets. However, Random Forest achieved noticeably higher sensitivity (56.96% vs 43.36%) and a higher F1-score (57.86% vs 51.79%), demonstrating better balance between precision and recall. Random Forest also achieved slightly stronger Kappa and balanced accuracy values, suggesting more balanced classification performance across both employee classes. Overall, Logistic Regression prioritised reducing false positive classifications, while Random Forest provided stronger practical performance for identifying genuine high-performing employees.
comparison_df <- data.frame(
Model = c(
"Logistic Regression",
"Random Forest"
),
Accuracy = c(
logistic_acc,
acc
),
Sensitivity = c(
logistic_sens,
sens
),
Specificity = c(
logistic_spec,
spec
),
F1_Score = c(
logistic_f1,
f1
),
AUC = c(
logistic_auc,
auc_val
)
)
comparison_df
The feature importance analysis showed that awards_won and previous_year_rating were among the strongest predictors of employees meeting more than 80% of their KPIs. Employees with awards or strong previous-year ratings were significantly more likely to achieve successful KPI outcomes. Several regional variables also showed relatively strong positive relationships with KPI achievement, while departments such as sales & marketing, legal, and HR demonstrated negative relationships with KPI success. Logistic Regression provided interpretable coefficient-based relationships that clearly showed whether variables increased or decreased KPI achievement probability, whereas Random Forest provided clearer feature importance rankings and captured more complex non-linear relationships between variables.
The predicted probability distribution plots showed noticeable differences in prediction confidence between the two models. Logistic Regression produced more conservative probability estimates, with many predictions concentrated around lower and middle probability ranges, explaining its higher specificity and lower sensitivity because stronger evidence was required before classifying employees as successful. In contrast, Random Forest produced a wider probability spread and clearer separation between the “Yes” and “No” classes, indicating stronger capability in distinguishing successful and unsuccessful employees. This improved probability separation contributed to Random Forest’s higher sensitivity and F1-score performance, making it more effective for identifying genuine high-performing employees.
Question: Can employee average training score be predicted using demographic and workplace-related variables?
# ============================================================
# Linear Regression — Employee Training Score Prediction
# ============================================================
# ── 0. Install & load packages ─────────────────────────────
required_packages <- c(
"caret",
"ggplot2",
"dplyr",
"Metrics",
"gridExtra"
)
for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
}
}
library(caret)
library(ggplot2)
library(dplyr)
library(Metrics)
library(gridExtra)
# ============================================================
# STEP 1 — Load data
# ============================================================
df_clean <- read.csv(
"clean_employee_performance.csv",
stringsAsFactors = FALSE
)
cat(sprintf(
"Loaded: %d rows x %d columns\n\n",
nrow(df_clean),
ncol(df_clean)
))
## Loaded: 17415 rows x 14 columns
# ============================================================
# STEP 2 — Remove redundant columns
# ============================================================
df_clean$avg_training_score_scaled <- NULL
df_clean$age_group <- NULL
df_clean$KPIs_met_more_than_80 <- NULL
cat("Dropped redundant columns\n")
## Dropped redundant columns
# ============================================================
# STEP 3 — Convert character columns to factors
# ============================================================
char_cols <- names(df_clean)[sapply(df_clean, is.character)]
df_clean[char_cols] <- lapply(
df_clean[char_cols],
as.factor
)
cat("Converted character columns to factors\n")
## Converted character columns to factors
# ============================================================
# STEP 4 — Handle missing values
# ============================================================
na_total <- sum(is.na(df_clean))
cat(sprintf("Total NAs found: %d\n", na_total))
## Total NAs found: 0
if (na_total > 0) {
for (col in names(df_clean)) {
if (is.numeric(df_clean[[col]])) {
df_clean[[col]][is.na(df_clean[[col]])] <-
median(df_clean[[col]], na.rm = TRUE)
}
}
cat("Median imputation applied\n")
} else {
cat("No missing values found\n")
}
## No missing values found
# ============================================================
# STEP 5 — Train/Test Split
# ============================================================
set.seed(42)
train_idx <- createDataPartition(
df_clean$avg_training_score,
p = 0.80,
list = FALSE
)
train_data <- df_clean[train_idx, ]
test_data <- df_clean[-train_idx, ]
cat(sprintf(
"Train rows: %d | Test rows: %d\n",
nrow(train_data),
nrow(test_data)
))
## Train rows: 13934 | Test rows: 3481
# ============================================================
# STEP 6 — Build Linear Regression Model
# ============================================================
lm_model <- lm(
avg_training_score ~
age +
previous_year_rating +
length_of_service +
no_of_trainings +
department +
education +
gender +
recruitment_channel +
awards_won,
data = train_data
)
summary(lm_model)
##
## Call:
## lm(formula = avg_training_score ~ age + previous_year_rating +
## length_of_service + no_of_trainings + department + education +
## gender + recruitment_channel + awards_won, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.310 -2.277 -0.360 1.591 48.978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.629255 0.282900 295.614 <2e-16 ***
## age 0.002447 0.006932 0.353 0.7241
## previous_year_rating 0.215669 0.032133 6.712 2e-11 ***
## length_of_service 0.004121 0.011971 0.344 0.7307
## no_of_trainings -0.062389 0.066029 -0.945 0.3447
## departmentfinance -24.145277 0.218651 -110.429 <2e-16 ***
## departmenthr -34.128867 0.218810 -155.975 <2e-16 ***
## departmentlegal -24.980894 0.306405 -81.529 <2e-16 ***
## departmentoperations -24.332708 0.153957 -158.048 <2e-16 ***
## departmentprocurement -14.449422 0.168389 -85.810 <2e-16 ***
## departmentr&d -0.261169 0.299648 -0.872 0.3834
## departmentsales & marketing -34.344873 0.142121 -241.659 <2e-16 ***
## departmenttechnology -4.695710 0.168160 -27.924 <2e-16 ***
## educationbelow secondary 0.782761 0.314546 2.489 0.0128 *
## educationmasters & above 0.239945 0.093364 2.570 0.0102 *
## educationunknown 0.057675 0.193335 0.298 0.7655
## genderm -0.015187 0.088555 -0.172 0.8638
## recruitment_channelreferred -0.020663 0.285673 -0.072 0.9423
## recruitment_channelsourcing -0.152191 0.078444 -1.940 0.0524 .
## awards_won 6.303949 0.253829 24.835 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.536 on 13914 degrees of freedom
## Multiple R-squared: 0.8856, Adjusted R-squared: 0.8855
## F-statistic: 5671 on 19 and 13914 DF, p-value: < 2.2e-16
# ============================================================
# STEP 7 — Predictions
# ============================================================
predictions <- predict(
lm_model,
newdata = test_data
)
actual_values <- test_data$avg_training_score
# ============================================================
# STEP 8 — Regression Evaluation Metrics
# ============================================================
rmse_val <- round(
rmse(actual_values, predictions),
3
)
mae_val <- round(
mae(actual_values, predictions),
3
)
r2_val <- round(
cor(actual_values, predictions)^2,
3
)
Regression Model Performance
RMSE : 4.51
MAE : 2.74
R²
: 0.888
# ============================================================
# STEP 9 — Actual vs Predicted Plot
# ============================================================
results_df <- data.frame(
Actual = actual_values,
Predicted = predictions
)
p1 <- ggplot(
results_df,
aes(x = Actual,
y = Predicted)
) +
geom_point(
alpha = 0.5,
colour = "#1A5FA8"
) +
geom_abline(
slope = 1,
intercept = 0,
colour = "red",
linetype = "dashed"
) +
labs(
title = "Actual vs Predicted Values",
subtitle = "Linear Regression",
x = "Actual Training Score",
y = "Predicted Training Score"
) +
theme_minimal(base_size = 14)
p1
# ============================================================
# STEP 10 — Residual Plot
# ============================================================
results_df$Residuals <- actual_values - predictions
p2 <- ggplot(
results_df,
aes(x = Predicted,
y = Residuals)
) +
geom_point(
alpha = 0.5,
colour = "#2E7D32"
) +
geom_hline(
yintercept = 0,
colour = "red",
linetype = "dashed"
) +
labs(
title = "Residual Plot",
x = "Predicted Values",
y = "Residuals"
) +
theme_minimal(base_size = 14)
p2
# ============================================================
# STEP 11 — Residual Distribution
# ============================================================
p3 <- ggplot(
results_df,
aes(x = Residuals)
) +
geom_histogram(
bins = 30,
fill = "#4C9BE8",
colour = "white",
alpha = 0.8
) +
labs(
title = "Residual Distribution",
x = "Residual",
y = "Count"
) +
theme_minimal(base_size = 14)
p3
# ============================================================
# STEP 12 — Feature Importance
# ============================================================
coef_df <- data.frame(
Feature = names(coef(lm_model)),
Coefficient = coef(lm_model)
)
coef_df <- coef_df %>%
filter(Feature != "(Intercept)") %>%
mutate(
Abs_Coefficient = abs(Coefficient),
Direction = ifelse(
Coefficient > 0,
"Positive",
"Negative"
)
) %>%
arrange(desc(Abs_Coefficient))
top_coef_df <- coef_df %>%
slice_max(
order_by = Abs_Coefficient,
n = 15
)
p4 <- ggplot(
top_coef_df,
aes(
x = reorder(
Feature,
Abs_Coefficient
),
y = Abs_Coefficient,
fill = Direction
)
) +
geom_bar(
stat = "identity"
) +
coord_flip() +
labs(
title = "Feature Importance",
subtitle = "Linear Regression Coefficients",
x = "Feature",
y = "Absolute Coefficient"
) +
theme_minimal(base_size = 8)
print(p4)
The linear regression coefficient analysis showed that
awards_won was one of the strongest positive predictors of
average training score, indicating that employees who received awards
generally achieved higher training performance. Variables such as
previous_year_rating, higher education level, and
referral-based recruitment also contributed positively, although with
smaller effects. In contrast, departments including sales &
marketing, HR, legal, operations, and finance showed strong negative
coefficients, suggesting lower average training scores compared to the
reference department. Variables such as no_of_trainings and
length_of_service also demonstrated slight negative
relationships, indicating that attending more trainings or having longer
service did not necessarily improve training performance. Overall, the
model suggests that employee recognition, past performance, and
departmental differences play important roles in influencing average
training scores within the organization.
# ============================================================
# STEP 13 — Regression Metrics Bar Chart
# ============================================================
metrics_df <- data.frame(
Metric = c(
"RMSE",
"MAE",
"R²"
),
Value = c(
rmse_val,
mae_val,
r2_val
)
)
p5 <- ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) +
geom_bar(stat = "identity", show.legend = FALSE) +
geom_text(aes(label = Value), vjust = -0.4, fontface = "bold") +
labs(
title = "Regression Performance Metrics",
y = "Value"
) +
theme_minimal(base_size = 8)
print (p5)
The regression performance metrics indicate that the linear
regression model performs well in predicting employees’ average training
scores. The R² value of 0.888 shows that
approximately 88.8% of the variation in training scores was
explained by the predictor variables, indicating a very strong model
fit. Additionally, the relatively low MAE value of
2.74 suggests that the predicted scores differed from the
actual scores by only about 2.7 points on average, meaning
most predictions were reasonably accurate.
Similarly, the RMSE value of 4.51 indicates
that overall prediction errors remained relatively low, although the
slightly higher RMSE compared to MAE suggests
the presence of a few larger prediction errors. Overall, the combination
of high R² and low MAE and RMSE
values demonstrates that the model effectively captured the
relationships between employee characteristics and training performance,
resulting in reliable prediction outcomes.
This study demonstrates that KPI achievement is driven by a combination of individual performance indicators (especially previous_year_rating and awards_won), training quality (avg_training_score), and categorical factors like department and recruitment channel, with Random Forest outperforming Logistic Regression in capturing nonlinear relationships and identifying high performers. Future work should incorporate additional variables such as salary, absenteeism, and engagement scores, conduct longitudinal analysis to track performance trajectories over time; explore more advanced interpretable ML models such as XGBoost and develop a production-ready HR decision-support dashboard to translate these insights into actionable workforce planning tools for proactive employee development and targeted training investments.