Amid the era of globalization and relentless competitive pressures, sustaining strong employee performance has become a central priority and one of the biggest challenges for Human Resource (HR) departments. Organizations can no longer rely solely on intuition; instead, they are increasingly turning to data‑driven analytics to evaluate workforce outcomes, optimize talent management, and reduce turnover. Understanding the specific, underlying factors that influence productivity is essential for building sustainable, long‑term employee success.
This project uses R programming to explore a comprehensive dataset and uncover the key drivers of high performance. A thorough understanding of influencing factors is essential for developing effective approaches to maintaining and improving employee performance over the long term. By leveraging statistical methods and machine learning techniques, the goal is to uncover the key drivers behind high performance and Key Performance Indicator (KPI) achievement, translating raw HR data into actionable organizational insights.
The primary objective of this project is to conduct a comprehensive data analysis using R to identify key factors such as employee demographics, training effectiveness, length of service, and prior performance ratings that significantly influence the achievement of Key Performance Indicators (KPIs) exceeding 80%. By applying statistical techniques, data visualization, and predictive modeling, this study aims to generate actionable insights that can guide HR professionals in enhancing employee performance strategies, supporting talent development, and strengthening organizational decision‑making.
Specifically, the project seeks to:
- Identify key factors affecting performance through
statistical analysis and machine learning, focusing on variables such as
training, work experience, education level, and departmental
affiliation.
- Compare predictive models to determine the most
effective approach for forecasting KPI achievement above 80%, using
evaluation metrics such as confusion matrices, accuracy, sensitivity,
and specificity.
- Provide recommendations and actionable insights to HR
departments and stakeholders, supporting evidence‑based decisions in
talent management, training programs, and employee engagement
initiatives.
The dataset titled Employees Performance for HR Analytics was uploaded to Kaggle by Sanjana Chaudhari in 2023 and serves as the foundation for this analysis. It contains 17,417 employee records across 13 variables, stored in CSV format. The dataset captures a balanced mix of categorical and numerical variables, making it suitable for exploratory data analysis (EDA), correlation studies, and predictive modeling in HR analytics.
The variables included are as follows:
- employee_id: Unique identifier for each employee;
serves as the primary key for tracking records without revealing
personal information.
- department: Employee’s department (e.g., Sales &
Marketing, Technology); useful for performance segmentation and
departmental comparisons.
- region: Geographic region of employment.
- education: Highest education level attained (e.g.,
Bachelor’s, Master’s and above).
- gender: Employee gender (m = male, f = female).
- recruitment_channel: Hiring source (e.g., Referred,
Sourcing).
- no_of_trainings: Number of trainings attended.
- age: Employee age.
- previous_year_rating: Performance rating from the
prior year (1–5 scale).
- length_of_service: Number of years served in the
organization.
- kpis_met_more_than_80: Binary indicator of whether
>80% KPIs were achieved (0 = No, 1 = Yes); this serves as the target
variable.
- awards_won: Indicator of whether the employee won
awards (0 = No, 1 = Yes).
- avg_training_score: Average score from trainings,
reflecting training quality.
By analyzing these variables, the study aims to uncover meaningful patterns that can guide HR strategies, improve productivity, and strengthen workforce management.
Data cleaning is a critical step in preparing the dataset for analysis. It involves handling missing values, correcting inconsistencies, removing duplicates, and ensuring that variables are properly formatted for statistical modeling. Clean data provides a reliable foundation for exploratory analysis and predictive modeling, reducing bias and improving the accuracy of insights.
The following packages were used in the data cleaning process:
dplyr
Functions: filter, mutate,
select, distinct, summarise,
case_when
Purpose: Data manipulation and transformation.
tidyr
Functions: replace_na, across
Purpose: Handling missing values and tidying data.
stringr
Functions: str_trim, str_to_lower
Purpose: Text cleaning and string processing.
writexl
Functions: write_xlsx
Purpose: Exporting cleaned dataset to Excel format.
This step involves loading the raw dataset into R for inspection. The structure and summary of the data are examined to understand variable types.
employee_performance <- read.csv("Uncleaned_employees_final_dataset.csv")
str(employee_performance)
## 'data.frame': 17417 obs. of 13 variables:
## $ employee_id : int 8724 74430 72255 38562 64486 46232 54542 67269 66174 76303 ...
## $ department : chr "Technology" "HR" "Sales & Marketing" "Procurement" ...
## $ region : chr "region_26" "region_4" "region_13" "region_2" ...
## $ education : chr "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
## $ gender : chr "m" "f" "m" "f" ...
## $ recruitment_channel : chr "sourcing" "other" "other" "other" ...
## $ no_of_trainings : int 1 1 1 3 1 1 1 2 1 1 ...
## $ age : int 24 31 31 31 30 36 33 36 51 29 ...
## $ previous_year_rating : int NA 3 1 2 4 3 5 3 4 5 ...
## $ length_of_service : int 1 5 4 9 7 2 3 3 11 2 ...
## $ KPIs_met_more_than_80: int 1 0 0 0 0 0 1 0 0 1 ...
## $ awards_won : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_training_score : int 77 51 47 65 61 68 57 85 75 76 ...
summary(employee_performance)
## employee_id department region education
## Min. : 3 Length :17417 Length :17417 Length :17417
## 1st Qu.:19281 N.unique : 9 N.unique : 34 N.unique : 4
## Median :39122 N.blank : 0 N.blank : 0 N.blank : 771
## Mean :39083 Min.nchar: 2 Min.nchar: 8 Min.nchar: 0
## 3rd Qu.:58838 Max.nchar: 17 Max.nchar: 9 Max.nchar: 15
## Max. :78295
##
## gender recruitment_channel no_of_trainings age
## Length :17417 Length :17417 Min. :1.000 Min. :20.00
## N.unique : 2 N.unique : 3 1st Qu.:1.000 1st Qu.:29.00
## N.blank : 0 N.blank : 0 Median :1.000 Median :33.00
## Min.nchar: 1 Min.nchar: 5 Mean :1.251 Mean :34.81
## Max.nchar: 1 Max.nchar: 8 3rd Qu.:1.000 3rd Qu.:39.00
## Max. :9.000 Max. :60.00
##
## previous_year_rating length_of_service KPIs_met_more_than_80 awards_won
## Min. :1.000 Min. : 1.000 Min. :0.0000 Min. :0.00000
## 1st Qu.:3.000 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :3.000 Median : 5.000 Median :0.0000 Median :0.00000
## Mean :3.345 Mean : 5.802 Mean :0.3588 Mean :0.02337
## 3rd Qu.:4.000 3rd Qu.: 7.000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :5.000 Max. :34.000 Max. :1.0000 Max. :1.00000
## NAs :1363
## avg_training_score
## Min. :39.00
## 1st Qu.:51.00
## Median :60.00
## Mean :63.18
## 3rd Qu.:75.00
## Max. :99.00
##
Duplicate records and unnecessary columns are removed to ensure data integrity. Unique values are checked to identify inconsistencies in categorical variables.
unique(employee_performance$department)
## [1] "Technology" "HR" "Sales & Marketing"
## [4] "Procurement" "Finance" "Analytics"
## [7] "Operations" "Legal" "R&D"
unique(employee_performance$education)
## [1] "Bachelors" "Masters & above" "" "Below Secondary"
unique(employee_performance$gender)
## [1] "m" "f"
unique(employee_performance$recruitment_channel)
## [1] "sourcing" "other" "referred"
unique(employee_performance$region)
## [1] "region_26" "region_4" "region_13" "region_2" "region_29" "region_7"
## [7] "region_22" "region_16" "region_17" "region_24" "region_11" "region_27"
## [13] "region_9" "region_20" "region_34" "region_23" "region_8" "region_14"
## [19] "region_31" "region_19" "region_5" "region_28" "region_15" "region_3"
## [25] "region_25" "region_12" "region_21" "region_30" "region_10" "region_33"
## [31] "region_32" "region_6" "region_1" "region_18"
str(employee_performance)
## 'data.frame': 17417 obs. of 13 variables:
## $ employee_id : int 8724 74430 72255 38562 64486 46232 54542 67269 66174 76303 ...
## $ department : chr "Technology" "HR" "Sales & Marketing" "Procurement" ...
## $ region : chr "region_26" "region_4" "region_13" "region_2" ...
## $ education : chr "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
## $ gender : chr "m" "f" "m" "f" ...
## $ recruitment_channel : chr "sourcing" "other" "other" "other" ...
## $ no_of_trainings : int 1 1 1 3 1 1 1 2 1 1 ...
## $ age : int 24 31 31 31 30 36 33 36 51 29 ...
## $ previous_year_rating : int NA 3 1 2 4 3 5 3 4 5 ...
## $ length_of_service : int 1 5 4 9 7 2 3 3 11 2 ...
## $ KPIs_met_more_than_80: int 1 0 0 0 0 0 1 0 0 1 ...
## $ awards_won : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_training_score : int 77 51 47 65 61 68 57 85 75 76 ...
# Show duplicated employee_id
employee_performance %>%
filter(duplicated(employee_id) | duplicated(employee_id, fromLast = TRUE)) %>%
arrange(employee_id)
# Remove exact duplicate rows only
employee_performance <- employee_performance %>%
distinct()
# Check duplicated employee_id again
employee_performance %>%
filter(duplicated(employee_id) | duplicated(employee_id, fromLast = TRUE)) %>%
arrange(employee_id)
#remove unnecessary column
employee_performance <- employee_performance %>%
select(-employee_id)
Text fields are standardized by trimming spaces and converting to lowercase. Missing values are handled using median imputation and categorical replacement to maintain data completeness.
#clean text column
employee_performance <- employee_performance %>%
mutate(
gender = str_to_lower(str_trim(gender)),
department = str_trim(department),
education = str_trim(education),
recruitment_channel = str_trim(recruitment_channel)
)
str(employee_performance)
## 'data.frame': 17415 obs. of 12 variables:
## $ department : chr "Technology" "HR" "Sales & Marketing" "Procurement" ...
## $ region : chr "region_26" "region_4" "region_13" "region_2" ...
## $ education : chr "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
## $ gender : chr "m" "f" "m" "f" ...
## $ recruitment_channel : chr "sourcing" "other" "other" "other" ...
## $ no_of_trainings : int 1 1 1 3 1 1 1 2 1 1 ...
## $ age : int 24 31 31 31 30 36 33 36 51 29 ...
## $ previous_year_rating : int NA 3 1 2 4 3 5 3 4 5 ...
## $ length_of_service : int 1 5 4 9 7 2 3 3 11 2 ...
## $ KPIs_met_more_than_80: int 1 0 0 0 0 0 1 0 0 1 ...
## $ awards_won : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_training_score : int 77 51 47 65 61 68 57 85 75 76 ...
colSums(is.na(employee_performance))
## department region education
## 0 0 0
## gender recruitment_channel no_of_trainings
## 0 0 0
## age previous_year_rating length_of_service
## 0 1363 0
## KPIs_met_more_than_80 awards_won avg_training_score
## 0 0 0
employee_performance %>%
summarise(across(everything(), ~ sum(is.na(.) | trimws(as.character(.)) == "")))
#handling missing values
employee_performance <- employee_performance %>%
mutate(
previous_year_rating = ifelse(
is.na(previous_year_rating),
median(previous_year_rating, na.rm = TRUE),
previous_year_rating
),
education = ifelse(
is.na(education) | str_trim(education) == "",
"Unknown",
education
)
)
#if missing exists
employee_performance <- employee_performance %>%
mutate(education = replace_na(education, "Unknown"))
#clean any text column
clean_text <- function(x) {
x %>%
str_trim() %>%
str_to_lower()
}
employee_performance$department <- clean_text(employee_performance$department)
employee_performance <- employee_performance %>%
mutate(across(
c(department, education, recruitment_channel, region),
clean_text
))
New variables are created to enhance analytical insights. Age groups are categorized, and categorical variables are converted to factors for modeling compatibility.
#create age group
employee_performance <- employee_performance %>%
mutate(age_group = case_when(
age < 30 ~ "Young",
age >= 30 & age < 40 ~ "Mid",
TRUE ~ "Senior"
))
str(employee_performance)
## 'data.frame': 17415 obs. of 13 variables:
## $ department : chr "technology" "hr" "sales & marketing" "procurement" ...
## $ region : chr "region_26" "region_4" "region_13" "region_2" ...
## $ education : chr "bachelors" "bachelors" "bachelors" "bachelors" ...
## $ gender : chr "m" "f" "m" "f" ...
## $ recruitment_channel : chr "sourcing" "other" "other" "other" ...
## $ no_of_trainings : int 1 1 1 3 1 1 1 2 1 1 ...
## $ age : int 24 31 31 31 30 36 33 36 51 29 ...
## $ previous_year_rating : num 3 3 1 2 4 3 5 3 4 5 ...
## $ length_of_service : int 1 5 4 9 7 2 3 3 11 2 ...
## $ KPIs_met_more_than_80: int 1 0 0 0 0 0 1 0 0 1 ...
## $ awards_won : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_training_score : int 77 51 47 65 61 68 57 85 75 76 ...
## $ age_group : chr "Young" "Mid" "Mid" "Mid" ...
#convert to factors
employee_performance <- employee_performance %>%
mutate(
department = as.factor(department),
gender = as.factor(gender),
education = as.factor(education),
recruitment_channel = as.factor(recruitment_channel),
region = as.factor(region),
age_group = as.factor(age_group)
)
#normalize score
employee_performance <- employee_performance %>%
mutate(avg_training_score_scaled = scale(avg_training_score))
After cleaning and exploring the dataset, the final step is to export the processed data for future analysis and reporting. csv formats are used for next step EDA.
write.csv(employee_performance, "clean_employee_performance.csv", row.names = FALSE)
Before proceeding into the modelling part, the Exploratory Data
Analysis (EDA) was conducted to examine the employee performance.
The steps performed in EDA:
The required libraries and cleaned dataset df_clean was loaded and inspected to understand its structure before moving forward to exploratory data analysis.
# Install packages (run only once if needed):
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("tidyverse")
# install.packages("knitr")
# install.packages("corrplot")
# install.packages("plotly")
# install.packages("reshape2")
# install.packages("kableExtra")
# Load required libraries:
library(dplyr) # Data manipulation and group_by() function
library(ggplot2) # Data visualization
library(tidyverse) # Collection of data science packages
library(knitr) # R Markdown table formatting
library(corrplot) # Correlation matrix visualization
library(plotly) # Interactive plots
library(reshape2) # Data reshaping
library(kableExtra) # Enhanced table styling
# Load cleaned dataset
df_clean<- read.csv("clean_employee_performance.csv")
#convert to factors
df_clean <- df_clean %>%
mutate(
department = as.factor(department),
gender = as.factor(gender),
education = as.factor(education),
recruitment_channel = as.factor(recruitment_channel),
region = as.factor(region),
age_group = as.factor(age_group),
awards_won = as.factor(awards_won) # added conversion to factor for better analysis
)
The dataset structure confirms the variables are correctly formatted with appropriate data types.
str() shows the dataset structure as a mix of numeric
and categorical variables. glimpse() and dim()
show that the dataset contains 17,417 observations and 14
variables.# Data structure overview inspection
head(df_clean)
str(df_clean)
## 'data.frame': 17415 obs. of 14 variables:
## $ department : Factor w/ 9 levels "analytics","finance",..: 9 3 8 6 2 6 2 1 9 9 ...
## $ region : Factor w/ 34 levels "region_1","region_10",..: 19 29 5 12 22 32 12 15 32 15 ...
## $ education : Factor w/ 4 levels "bachelors","below secondary",..: 1 1 1 1 1 1 1 1 3 1 ...
## $ gender : Factor w/ 2 levels "f","m": 2 1 2 1 2 2 2 2 2 2 ...
## $ recruitment_channel : Factor w/ 3 levels "other","referred",..: 3 1 1 1 3 3 1 3 1 3 ...
## $ no_of_trainings : int 1 1 1 3 1 1 1 2 1 1 ...
## $ age : int 24 31 31 31 30 36 33 36 51 29 ...
## $ previous_year_rating : int 3 3 1 2 4 3 5 3 4 5 ...
## $ length_of_service : int 1 5 4 9 7 2 3 3 11 2 ...
## $ KPIs_met_more_than_80 : int 1 0 0 0 0 0 1 0 0 1 ...
## $ awards_won : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ avg_training_score : int 77 51 47 65 61 68 57 85 75 76 ...
## $ age_group : Factor w/ 3 levels "Mid","Senior",..: 3 1 1 1 1 1 1 1 2 3 ...
## $ avg_training_score_scaled: num 1.03 -0.908 -1.206 0.136 -0.162 ...
glimpse(df_clean)
## Rows: 17,415
## Columns: 14
## $ department <fct> technology, hr, sales & marketing, procureme…
## $ region <fct> region_26, region_4, region_13, region_2, re…
## $ education <fct> bachelors, bachelors, bachelors, bachelors, …
## $ gender <fct> m, f, m, f, m, m, m, m, m, m, m, m, f, m, m,…
## $ recruitment_channel <fct> sourcing, other, other, other, sourcing, sou…
## $ no_of_trainings <int> 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ age <int> 24, 31, 31, 31, 30, 36, 33, 36, 51, 29, 40, …
## $ previous_year_rating <int> 3, 3, 1, 2, 4, 3, 5, 3, 4, 5, 5, 3, 3, 3, 5,…
## $ length_of_service <int> 1, 5, 4, 9, 7, 2, 3, 3, 11, 2, 12, 10, 4, 10…
## $ KPIs_met_more_than_80 <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,…
## $ awards_won <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ avg_training_score <int> 77, 51, 47, 65, 61, 68, 57, 85, 75, 76, 50, …
## $ age_group <fct> Young, Mid, Mid, Mid, Mid, Mid, Mid, Mid, Se…
## $ avg_training_score_scaled <dbl> 1.03010551, -0.90754471, -1.20564475, 0.1358…
# Dataset dimensions
dim(df_clean)
## [1] 17415 14
The summary provides an overview of the central tendency and distribution of each variable.
avg_training_scoreand created a new column
for avg_training_score_scaled which eases future
analysis.# Summary statistics
df_clean %>%
select(age, previous_year_rating, KPIs_met_more_than_80,
length_of_service, no_of_trainings, avg_training_score,
avg_training_score_scaled) %>%
summary()
## age previous_year_rating KPIs_met_more_than_80 length_of_service
## Min. :20.00 Min. :1.000 Min. :0.0000 Min. : 1.000
## 1st Qu.:29.00 1st Qu.:3.000 1st Qu.:0.0000 1st Qu.: 3.000
## Median :33.00 Median :3.000 Median :0.0000 Median : 5.000
## Mean :34.81 Mean :3.319 Mean :0.3589 Mean : 5.801
## 3rd Qu.:39.00 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.: 7.000
## Max. :60.00 Max. :5.000 Max. :1.0000 Max. :34.000
## no_of_trainings avg_training_score avg_training_score_scaled
## Min. :1.000 Min. :39.00 Min. :-1.8018
## 1st Qu.:1.000 1st Qu.:51.00 1st Qu.:-0.9075
## Median :1.000 Median :60.00 Median :-0.2368
## Mean :1.251 Mean :63.18 Mean : 0.0000
## 3rd Qu.:1.000 3rd Qu.:75.00 3rd Qu.: 0.8811
## Max. :9.000 Max. :99.00 Max. : 2.6697
A final quality check was conducted to check if any missing values or duplicated values remain.
# Check missing values
colSums(is.na(df_clean))
## department region education
## 0 0 0
## gender recruitment_channel no_of_trainings
## 0 0 0
## age previous_year_rating length_of_service
## 0 0 0
## KPIs_met_more_than_80 awards_won avg_training_score
## 0 0 0
## age_group avg_training_score_scaled
## 0 0
# Check missing or empty values
df_clean %>%
summarise(across(everything(), ~ sum(is.na(.) | . == "")))
# Check any duplicates
sum(duplicated(df_clean))
## [1] 16
janitor::get_dupes(df_clean)
## No variable names specified - using all columns.
# Check whether missing values in the education field have been replaced with “Unknown”
unique(df_clean$education)
## [1] bachelors masters & above unknown below secondary
## Levels: bachelors below secondary masters & above unknown
age, length_of_service, and
avg_training_score because these variables have meaningful
numerical ranges, and extreme values may reveal unusual or hidden
characteristics of employees.# Select variables suitable for outlier detection
outlier_vars <- df_clean %>%
select(age, length_of_service, avg_training_score)
# Convert selected variables into long format
outlier_data <- df_clean %>%
select(age, length_of_service, avg_training_score) %>%
pivot_longer(cols = everything(),
names_to = "Variable",
values_to = "Value")
# Boxplot visualization
ggplot(outlier_data,
aes(x = Variable,
y = Value,
fill = Variable)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = "Outlier Detection for Continuous Variables",
x = "Variables",
y = "Values") +
theme(legend.position = "none")
age and length_of_service contain several
outliers beyond the upper fence of the boxplot, which require further
investigation to determine whether they represent valid extreme values
or anomalies in the dataset.# Check class imbalance
target_dist <- df_clean %>%
count(KPIs_met_more_than_80) %>%
mutate(
percentage = round(n / sum(n) * 100, 1),
KPI_status = ifelse(KPIs_met_more_than_80 == 1,
"Met KPI >80%",
"Met KPI ≤80%")
)
# KPI distribution plot
ggplot(target_dist,
aes(x = KPI_status,
y = n,
fill = KPI_status)) +
geom_bar(stat = "identity",
width = 0.6,
alpha = 0.9) +
# Percentage + count labels
geom_text(aes(label = paste0(percentage,
"%\n(n = ",
scales::comma(n), ")")),
vjust = -0.35,
size = 4.3,
fontface = "bold") +
scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
expand_limits(y = max(target_dist$n) * 1.15) +
labs(title = "Distribution of KPI Achievement",
subtitle = "Class balance analysis of KPI performance",
x = NULL,
y = "Number of Employees") +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold",
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(face = "bold")
)
# Print imbalance ratio
imbalance_ratio <- max(target_dist$percentage) /
min(target_dist$percentage)
cat("Imbalance ratio (majority/minority):",
round(imbalance_ratio, 2), "\n")
## Imbalance ratio (majority/minority): 1.79
KPI Achievement is relatively balanced.# Select Variables
continuous_long <- df_clean %>%
select(age,
length_of_service,
avg_training_score) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value")
# Plots
ggplot() +
# age
geom_histogram(
data = filter(continuous_long, variable == "age"),
aes(x = value,
y = after_stat(density)),
binwidth = 2,
fill = "#3498DB",
color = "white",
alpha = 0.7
) +
geom_density(
data = filter(continuous_long, variable == "age"),
aes(x = value),
color = "#E74C3C",
linewidth = 1.1,
adjust = 1.2
) +
# length_of_service
geom_histogram(
data = filter(continuous_long,
variable == "length_of_service"),
aes(x = value,
y = after_stat(density)),
binwidth = 1,
fill = "#2ECC71",
color = "white",
alpha = 0.7
) +
geom_density(
data = filter(continuous_long,
variable == "length_of_service"),
aes(x = value),
color = "#C0392B",
linewidth = 1.1,
adjust = 1.5
) +
# avg_training_score
geom_histogram(
data = filter(continuous_long,
variable == "avg_training_score"),
aes(x = value,
y = after_stat(density)),
binwidth = 5,
fill = "#9B59B6",
color = "white",
alpha = 0.7
) +
geom_density(
data = filter(continuous_long,
variable == "avg_training_score"),
aes(x = value),
color = "#E74C3C",
linewidth = 1.1,
adjust = 1.2
) +
facet_wrap(~ variable,
scales = "free",
ncol = 2) +
labs(
title = "Distribution of Continuous Variables",
subtitle = "Histogram with Density Overlay",
x = "Value",
y = "Density"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12),
strip.text = element_text(face = "bold", size = 12),
axis.title = element_text(face = "bold")
)
age distribution is slightly right skewed. This
indicates a higher concentration of younger employees with fewer older
employees. The age density curve shows a peak around the
mid 30 years old which suggests that the most employees fall within the
early to mid-career stage.avg_training_score distribution appears as a multimodal
pattern as it consists of multiple peaks and major clusters visible
around 50, 60, and 80 to 85. This implies that there is possible
segmentation in employee performance or departmental training
outcomes.length_of_service distribution shows heavily
right-skewed. Most employees have 1 to 7 years of service whereas very
few employees exceed 15 years. The company is considered to have short
tenure or most of them are new employees.# Select Variables
discrete_vars <- df_clean %>%
select(previous_year_rating,
no_of_trainings)
discrete_long <- discrete_vars %>%
pivot_longer(everything(),
names_to = "variable",
values_to = "value")
#Plot
ggplot(discrete_long,
aes(x = factor(value))) +
geom_bar(fill = "#2ECC71",
alpha = 0.8) +
facet_wrap(~ variable,
scales = "free",
ncol = 2) +
labs(title = "Distribution of Discrete Variables",
subtitle = "Frequency distribution by category",
x = "Category",
y = "Count") +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))
no_of_trainings displays a strongly right-skewed
distribution pattern as most employees have training sessions at once.
It has a sharp decline after 2 training sessions and very few employees
received more than 4 training sessions which cause long tail.previous_year_rating shows a slightly left-skewed
distribution pattern with a mode at rating 3. Most employees received
ratings of 3, 4, or 5 which are above average to excellent. This
suggests that past performance is generally positive across the
workforce.# Select Variables
categorical_vars <- df_clean %>%
select(department,
education,
recruitment_channel,
awards_won)
categorical_long <- categorical_vars %>%
pivot_longer(everything(),
names_to = "variable",
values_to = "category")
# Plot
ggplot(categorical_long,
aes(x = category,
fill = variable)) +
geom_bar(alpha = 0.85,
show.legend = FALSE) +
facet_wrap(~ variable,
scales = "free",
ncol = 2) +
labs(title = "Distribution of Categorical Variables",
subtitle = "Frequency distribution across employee categories",
x = "",
y = "Number of Employees") +
scale_fill_brewer(palette = "Set2") +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45,
hjust = 1),
strip.text = element_text(face = "bold"),
plot.title = element_text(face = "bold",
hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
# Plot for region (top 10)
region_analysis <- df_clean %>%
count(region) %>%
mutate(percentage = n / sum(n) * 100) %>%
top_n(10, n)
ggplot(region_analysis,
aes(x = reorder(region, n), y = percentage, fill = percentage)) +
geom_bar(stat = "identity", width = 0.8) +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
vjust = -0.3, size = 3) +
scale_fill_gradient(low = "#A3E4D7", high = "#1ABC9C") +
labs(title = "Employee Distribution by Region (Top 10)",
x = "Region", y = "Percentage (%)") +
theme_minimal() +
theme(axis.text.x = element_text(size = 7))
department, education,
recruitment_channel, revealing that the distribution of
employees across departments, educational backgrounds, and recruitment
channels is uneven.award_won distribution shows an extremely imbalanced
distribution in which nearly all employees have no awards. As the award
winners are rare and almost invisible on chart, this implies that this
variable has lower predictive power as insufficient variation. However,
further bivariate analysis is needed to determine whether the winners
perform better on KPI performance.# Summary Statistics by KPI
training_score_summary <- df_clean %>%
group_by(KPIs_met_more_than_80) %>%
summarise(
count = n(),
mean_score = mean(avg_training_score, na.rm = TRUE),
median_score = median(avg_training_score, na.rm = TRUE),
sd_score = sd(avg_training_score, na.rm = TRUE),
min_score = min(avg_training_score, na.rm = TRUE),
max_score = max(avg_training_score, na.rm = TRUE),
q25 = quantile(avg_training_score, 0.25, na.rm = TRUE),
q75 = quantile(avg_training_score, 0.75, na.rm = TRUE)
) %>%
mutate(KPI_Status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI >80%", "Did Not Meet KPI"))
kable(training_score_summary %>%
select(-KPIs_met_more_than_80) %>%
mutate(across(where(is.numeric), ~round(., 2))),
caption = "Training Score Summary by KPI Achievement Status")
| count | mean_score | median_score | sd_score | min_score | max_score | q25 | q75 | KPI_Status |
|---|---|---|---|---|---|---|---|---|
| 11165 | 62.46 | 59 | 13.35 | 39 | 99 | 50 | 74 | Did Not Meet KPI |
| 6250 | 64.47 | 61 | 13.45 | 41 | 99 | 53 | 77 | Met KPI >80% |
# Boxplot Comparison
ggplot(df_clean, aes(x = factor(KPIs_met_more_than_80), y = avg_training_score,
fill = factor(KPIs_met_more_than_80))) +
geom_boxplot(alpha = 0.7, outlier.size = 0.8) +
scale_fill_manual(values = c("#E74C3C", "#2ECC71"),
labels = c("Did Not Meet KPI", "Met KPI")) +
labs(
title = "Training Score Distribution Comparison",
subtitle = "High performers show higher median training scores",
x = "KPI Achievement Status",
y = "Average Training Score",
fill = "KPI Status"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "none"
)
# Previous year rating vs KPI
rating_analysis <- df_clean %>%
group_by(previous_year_rating, KPIs_met_more_than_80) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(previous_year_rating) %>%
mutate(percentage = count / sum(count) * 100,
KPI_status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI", "Did Not Meet"))
# Plot
ggplot(rating_analysis, aes(x = factor(previous_year_rating), y = percentage,
fill = KPI_status)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5), size = 3) +
scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
labs(title = "KPI Achievement by Previous Year Rating",
x = "Previous Year Rating", y = "Percentage (%)",
fill = "KPI Status")
# Plot
ggplot(df_clean,
aes(x = factor(previous_year_rating),
y = avg_training_score,
fill = factor(previous_year_rating))) +
geom_boxplot(alpha = 0.8,
outlier.color = "#E74C3C") +
labs(title = "Average Training Score by Previous Year Rating",
x = "Previous Year Rating",
y = "Average Training Score") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
# Function to create bar plots for categorical variables
plot_categorical_kpi <- function(data, var_name) {
data %>%
group_by(!!sym(var_name), KPIs_met_more_than_80) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(!!sym(var_name)) %>%
mutate(percentage = count / sum(count) * 100,
KPI_status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI >80%", "Did Not Meet")) %>%
ggplot(aes(x = reorder(!!sym(var_name), -percentage * (KPIs_met_more_than_80 == 1)),
y = percentage, fill = KPI_status)) +
geom_bar(stat = "identity", position = "stack", width = 0.7) +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5), size = 3) +
scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
labs(title = paste("KPI Achievement by", var_name),
x = var_name, y = "Percentage (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5, face = "bold"))
}
# Plot for each categorical variable
cat_vars <- c("department", "education", "recruitment_channel", "gender", "awards_won")
for (var in cat_vars) {
print(plot_categorical_kpi(df_clean, var))
}
# HO = There is no significant relationship between categorical variables and KPI achievement.
# H1 = There is a significant relationship between categorical variables and KPI achievement.
# Decision Rule: Reject H0, if p-value <0.05
# Chi-square tests
cat_chi_results <- map_df(cat_vars, function(var) {
tbl <- table(df_clean[[var]], df_clean$KPIs_met_more_than_80)
test <- chisq.test(tbl)
# Store numeric p-value for correct sorting
p_val <- test$p.value
# Format p-value for display only
p_formatted <- ifelse(
p_val < 0.0001,
"< 0.0001",
round(p_val, 4)
)
data.frame(
Variable = var,
Chi_Square = round(as.numeric(test$statistic), 2),
P_Value = p_formatted,
P_Value_Numeric = p_val,
Significant = ifelse(p_val < 0.05, "Yes", "No")
)
})
# Display results (with sorting)
kable(
cat_chi_results %>%
arrange(P_Value_Numeric) %>%
select(-P_Value_Numeric),
caption = "Chi-square Tests: Categorical Variables vs KPI Achievement"
)
| Variable | Chi_Square | P_Value | Significant |
|---|---|---|---|
| department | 292.33 | < 0.0001 | Yes |
| awards_won | 191.77 | < 0.0001 | Yes |
| recruitment_channel | 42.91 | < 0.0001 | Yes |
| education | 40.25 | < 0.0001 | Yes |
| gender | 28.61 | < 0.0001 | Yes |
# Summary function
get_summary <- function(data, group_var) {
data %>%
group_by(.data[[group_var]]) %>%
summarise(
total_trainings = sum(no_of_trainings, na.rm = TRUE),
avg_train_score = mean(avg_training_score, na.rm = TRUE),
kpi = sum(KPIs_met_more_than_80 == 1, na.rm = TRUE),
avg_tenure = mean(length_of_service, na.rm = TRUE),
avg_rating = mean(previous_year_rating, na.rm = TRUE),
avg_age = mean(age, na.rm = TRUE),
.groups = "drop"
) %>%
rename(category = 1)
}
# Focus only: Gender + Department
groups <- c("gender", "department")
summary_list <- lapply(groups, function(g) {
get_summary(df_clean, g) %>%
mutate(group = g)
})
combined_perf <- bind_rows(summary_list)
# Split metrics (NO scale mixing)
# Workforce metrics
workforce <- combined_perf %>%
pivot_longer(
cols = c(total_trainings, kpi),
names_to = "metric",
values_to = "value"
)
# Performance metrics
performance <- combined_perf %>%
pivot_longer(
cols = c(avg_train_score, avg_tenure, avg_rating, avg_age),
names_to = "metric",
values_to = "value"
)
# =========================
# Plot 1: Workforce (Gender + Department)
# =========================
p1 <- ggplot(workforce,
aes(x = category,
y = value,
fill = metric)) +
geom_bar(stat = "identity",
position = "dodge",
alpha = 0.9) +
facet_wrap(~ group, scales = "free_x") +
labs(title = "Workforce Overview by Gender and Department",
x = "",
y = "Count",
fill = "Metric") +
theme_minimal(base_size = 13) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
legend.title = element_text(size = 9),
plot.title = element_text(face = "bold", hjust = 0.5)
)
# =========================
# Plot 2: Performance (Gender + Department)
# =========================
p2 <- ggplot(performance,
aes(x = category,
y = value,
fill = metric)) +
geom_bar(stat = "identity",
position = "dodge",
alpha = 0.9) +
facet_wrap(~ group, scales = "free_x") +
labs(title = "Performance Metrics by Gender and Department",
x = "",
y = "Average Value",
fill = "Metric") +
theme_minimal(base_size = 13) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
legend.title = element_text(size = 9),
plot.title = element_text(face = "bold", hjust = 0.5),
)
# Output
ggplotly(p1) %>%
layout(
legend = list(
orientation = "v",
x = 1,
y = 0,
font = list(size = 9),
itemwidth = 30
),
margin = list(b = 120)
)
ggplotly(p2) %>%
layout(
legend = list(
orientation = "v",
x = 1,
y = 0,
font = list(size = 9),
itemwidth = 30
),
margin = list(b = 120)
)
# ==========================================
# Multivariate Summary Table
# For Gender + Department Analysis
# ==========================================
# Create formatted summary table
summary_table <- combined_perf %>%
mutate(
avg_train_score = round(avg_train_score, 2),
avg_tenure = round(avg_tenure, 2),
avg_rating = round(avg_rating, 2),
avg_age = round(avg_age, 2)
) %>%
arrange(group, desc(avg_train_score)) %>%
rename(
Category = category,
Group = group,
`Total Trainings` = total_trainings,
`Avg Traning Score`= avg_train_score,
`KPI Achieved` = kpi,
`Avg Tenure` = avg_tenure,
`Avg Rating` = avg_rating,
`Avg Age` = avg_age
)
# Display table
kable(
summary_table,
caption = "Multivariate Performance Summary by Gender and Department",
align = "c"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "center"
) %>%
row_spec(
0,
bold = TRUE,
color = "white",
background = "#2C3E50"
) %>%
column_spec(1, bold = TRUE) %>%
collapse_rows(
columns = 8,
valign = "top"
)
| Category | Total Trainings | Avg Traning Score | KPI Achieved | Avg Tenure | Avg Rating | Avg Age | Group |
|---|---|---|---|---|---|---|---|
| analytics | 2281 | 84.57 | 679 | 5.00 | 3.47 | 32.41 | department |
| r&d | 438 | 84.45 | 149 | 4.80 | 3.66 | 32.89 | |
| technology | 2740 | 79.85 | 783 | 5.84 | 3.14 | 35.03 | |
| procurement | 2993 | 70.18 | 836 | 6.19 | 3.23 | 36.17 | |
| operations | 4121 | 60.35 | 1553 | 6.43 | 3.63 | 36.15 | |
| finance | 1059 | 60.33 | 319 | 5.01 | 3.49 | 32.60 | |
| legal | 355 | 59.53 | 118 | 4.50 | 3.38 | 33.75 | |
| hr | 892 | 50.39 | 300 | 5.63 | 3.51 | 34.25 | |
| sales & marketing | 6903 | 50.06 | 1513 | 5.75 | 3.10 | 34.63 | |
| f | 5992 | 63.68 | 1986 | 5.86 | 3.37 | 35.04 | gender |
| m | 15790 | 62.97 | 4264 | 5.78 | 3.30 | 34.71 |
# Select numeric variables
num_data <- df_clean %>%
select(
no_of_trainings,
age,
previous_year_rating,
length_of_service,
avg_training_score,
KPIs_met_more_than_80
)
# Correlation matrix
cor_matrix <- cor(num_data, use = "complete.obs")
cor_melt <- melt(cor_matrix)
# Correlation heatmap
ggplot(cor_melt, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(
low = "#E74C3C", # -1 strong negative
mid = "white", # 0 no correlation
high = "#2ECC71", # +1 strong positive
midpoint = 0,
limits = c(-1, 1),
name = "Correlation"
) +
geom_text(aes(label = round(value, 2)), size = 3) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")
) +
ggtitle("Correlation Matrix of Employee Performance Variables")
# Select numeric features
num_features <- df_clean %>%
select(age,
length_of_service,
avg_training_score,
no_of_trainings,
previous_year_rating)
# Compute correlation with KPI
cor_results <- sapply(num_features, function(x) {
cor(x, df_clean$KPIs_met_more_than_80, use = "complete.obs")
})
# Convert to data frame and rank
cor_ranked <- data.frame(
feature = names(cor_results),
correlation = cor_results
) %>%
arrange(desc(abs(correlation)))
cor_ranked
# Plot
ggplot(cor_ranked,
aes(x = reorder(feature, correlation),
y = correlation,
fill = correlation)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_gradient2(low = "#E74C3C",
mid = "white",
high = "#2ECC71") +
labs(title = "Feature Correlation with KPI Achievement",
x = "Feature",
y = "Correlation Strength") +
theme_minimal(base_size = 13)
avg_training_score and
previous_year_ratingare positively correlated with KPI
achievement. Employees with higher training scores and higher ratings
from the previous year typically perform better on KPIs.length_of_service and no_of_trainings
exhibit a right-skewed distribution, indicating that most employees have
shorter tenure and have attended fewer training sessions.department, education,
gender, awards_won and
recruitment_channel. Chi-square tests confirm statistically
significant associations between categorical variables and KPI
achievement.previous_year_rating and
avg_training_score reveals a potential nonlinear pattern,
suggesting that performance may vary across different rating
groups.age
and length_of_service is approximately 0.64; however, since
it does not exceed the standard threshold of 0.7, it does not result in
multicollinearity.This section applies statistical and machine learning techniques to uncover meaningful insights from the cleaned dataset. The goal is to identify key predictors, evaluate model performance, and generate reliable forecasts. By combining exploratory analysis with predictive modelling, we aim to transform raw data into actionable knowledge that supports decision‑making.
Question: Can employee KPI achievement (more than 80%) be predicted using demographic, training, and workplace-related variables?
library(caret) library(ggplot2) library(pROC) library(dplyr) library(gridExtra)
df_clean <- read.csv(“clean_employee_performance.csv”, stringsAsFactors = FALSE)
df_clean\(avg_training_score_scaled <- NULL df_clean\)age_group <- NULL
char_cols <- names(df_clean)[sapply(df_clean, is.character)] df_clean[char_cols] <- lapply(df_clean[char_cols], as.factor)
df_clean\(KPIs_met_more_than_80 <- factor( df_clean\)KPIs_met_more_than_80, levels = c(0, 1), labels = c(“No”, “Yes”) )
set.seed(42)
train_idx <- createDataPartition( df_clean$KPIs_met_more_than_80, p = 0.80, list = FALSE )
train_data <- df_clean[train_idx, ] test_data <- df_clean[-train_idx, ]
log_model <- glm( KPIs_met_more_than_80 ~ ., data = train_data, family = binomial )
summary(log_model)
pred_prob <- predict(log_model, newdata = test_data, type = “response”)
pred_class <- ifelse(pred_prob > 0.5, “Yes”, “No”)
pred_class <- factor(pred_class, levels = c(“No”, “Yes”))
cm <- confusionMatrix( pred_class, test_data$KPIs_met_more_than_80, positive = “Yes” )
print(cm)
acc <- round(cm\(overall["Accuracy"] * 100, 2) sens <- round(cm\)byClass[“Sensitivity”] * 100, 2) spec <- round(cm\(byClass["Specificity"] * 100, 2) f1 <- round(cm\)byClass[“F1”] * 100, 2)
roc_obj <- roc( response = test_data$KPIs_met_more_than_80, predictor = pred_prob, levels = c(“No”, “Yes”) )
auc_val <- round(auc(roc_obj), 4)
roc_df <- data.frame( FPR = 1 - roc_obj\(specificities, TPR = roc_obj\)sensitivities )
metrics_df <- data.frame( Metric = c(“Accuracy”, “Sensitivity”, “Specificity”, “F1 Score”), Value = c(acc, sens, spec, f1) )
coef_df <- data.frame( Feature = names(coef(log_model)), Coefficient = coef(log_model) )
coef_df <- coef_df %>% filter(Feature != “(Intercept)”) %>% mutate( Abs_Coefficient = abs(Coefficient), Direction = ifelse(Coefficient > 0, “Positive”, “Negative”) ) %>% arrange(desc(Abs_Coefficient))
top_coef_df <- coef_df %>% slice_max(order_by = Abs_Coefficient, n = 15)
print(coef_df)
prob_df <- data.frame( Probability = pred_prob, Actual_Class = test_data$KPIs_met_more_than_80 )
class_tbl <- table(df_clean$KPIs_met_more_than_80)
class_pct <- round(prop.table(class_tbl) * 100, 1)
class_df <- data.frame( Class = names(class_tbl), Count = as.integer(class_tbl), Percent = as.numeric(class_pct) )
p1 <- ggplot(class_df, aes(x = Class, y = Count, fill = Class)) + geom_bar(stat = “identity”, width = 0.6, show.legend = FALSE) + geom_text(aes(label = paste0(Count, “(”, Percent, “%)”)), vjust = -0.4, fontface = “bold”) + labs(title = “Class Distribution”, subtitle = “KPIs Met More Than 80%”, x = “Class”, y = “Count”) + theme_minimal(base_size = 14)
cm_tbl <- as.data.frame(cm$table)
names(cm_tbl) <- c(“Predicted”, “Actual”, “Freq”)
p2 <- ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) + geom_tile(colour = “white”, linewidth = 1) + geom_text(aes(label = Freq), size = 6, fontface = “bold”, colour = “white”) + scale_fill_gradient(low = “#A8C8E8”, high = “#1A5FA8”) + labs(title = “Confusion Matrix”, subtitle = “Predicted vs Actual Classes”, x = “Actual Class”, y = “Predicted Class”, fill = “Count”) + theme_minimal(base_size = 14)
p3 <- ggplot(roc_df, aes(x = FPR, y = TPR)) + geom_line(linewidth = 1.1) + geom_abline(slope = 1, intercept = 0, linetype = “dashed”) + labs(title = “ROC Curve”, subtitle = paste(“AUC =”, auc_val), x = “False Positive Rate”, y = “True Positive Rate”) + theme_minimal(base_size = 14)
p4 <- ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) + geom_bar(stat = “identity”, show.legend = FALSE) + geom_text(aes(label = paste0(Value, “%”)), vjust = -0.4, fontface = “bold”) + ylim(0, 110) + labs(title = “Performance Metrics”, y = “Score (%)”) + theme_minimal(base_size = 14)
p5 <- ggplot(top_coef_df, aes(x = reorder(Feature, Abs_Coefficient), y = Abs_Coefficient, fill = Direction)) + geom_bar(stat = “identity”) + coord_flip() + labs(title = “Feature Importance”, subtitle = “Top 15 absolute coefficients”, x = “Feature”, y = “Absolute Coefficient”) + theme_minimal(base_size = 14)
p6 <- ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) + geom_histogram(binwidth = 0.05, alpha = 0.75, position = “identity”, colour = “white”) + labs(title = “Predicted Probability”, subtitle = “Probability of KPI Achievement”, x = “Probability of Yes”, y = “Count”, fill = “Actual Class”) + theme_minimal(base_size = 14)
logistic_dashboard <- grid.arrange( p1, p2, p3, p4, p5, p6, ncol = 2, nrow = 3 )
ggsave(“logistic_regression_dashboard.png”, plot = logistic_dashboard, width = 14, height = 16, dpi = 150)
acc <- round(cm\(overall["Accuracy"] * 100, 2) sens <- round(cm\)byClass[“Sensitivity”] * 100, 2) spec <- round(cm\(byClass["Specificity"] * 100, 2) f1 <- round(cm\)byClass[“F1”] * 100, 2)
logistic_acc <- acc logistic_sens <- sens logistic_spec <- spec logistic_f1 <- f1 logistic_auc <- auc_val
required_packages <- c(“randomForest”, “caret”, “ggplot2”, “reshape2”, “pROC”, “dplyr”, “gridExtra”)
for (pkg in required_packages) { if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg) }
library(randomForest) library(caret) library(ggplot2) library(reshape2) library(pROC) library(dplyr) library(gridExtra)
df_clean <- read.csv(“clean_employee_performance.csv”, stringsAsFactors = FALSE) cat(sprintf(“Loaded: %d rows x %d columns”, nrow(df_clean), ncol(df_clean)))
df_clean\(avg_training_score_scaled <- NULL df_clean\)age_group <- NULL cat(“Step 2: Dropped redundant columns: avg_training_score_scaled, age_group”)
char_cols <- names(df_clean)[sapply(df_clean, is.character)] df_clean[char_cols] <- lapply(df_clean[char_cols], as.factor) cat(“Step 3: Converted to factor:”, paste(char_cols, collapse = “,”), “”) cat(” ‘unknown’ kept as a valid factor level in ‘education’“)
df_clean\(KPIs_met_more_than_80 <- factor(df_clean\)KPIs_met_more_than_80, levels = c(0, 1), labels = c(“No”, “Yes”)) cat(“Step 4: Target ‘KPIs_met_more_than_80’ converted to factor”)
na_total <- sum(is.na(df_clean)) cat(sprintf(“: Total NAs found: %d”, na_total)) if (na_total > 0) { df_clean <- na.roughfix(df_clean) cat(” na.roughfix() applied“) } else { cat(” No NAs — skipping imputation“) }
cat(“: Class distribution”) class_tbl <- table(df_clean$KPIs_met_more_than_80) print(class_tbl) class_pct <- round(prop.table(class_tbl) * 100, 1) print(class_pct)
n_no <- as.integer(class_tbl[“No”]) n_yes <- as.integer(class_tbl[“Yes”]) wt_no <- 1 wt_yes <- round(n_no / n_yes, 2) class_weights <- c(“No” = wt_no, “Yes” = wt_yes) cat(sprintf(” Class weights — No: %.2f | Yes: %.2f“, wt_no, wt_yes))
set.seed(42) train_idx <- createDataPartition(df_clean$KPIs_met_more_than_80, p = 0.80, list = FALSE) train_data <- df_clean[ train_idx, ] test_data <- df_clean[-train_idx, ] cat(sprintf(“: Train rows: %d | Test rows: %d”, nrow(train_data), nrow(test_data)))
set.seed(42) n_features <- ncol(train_data) - 1 mtry_val <- floor(sqrt(n_features))
cat(sprintf(“: Training Random Forest (ntree=500, mtry=%d) …”, mtry_val))
rf_model <- randomForest( KPIs_met_more_than_80 ~ ., data = train_data, ntree = 500, mtry = mtry_val, importance = TRUE, classwt = class_weights )
print(rf_model)
preds_class <- predict(rf_model, newdata = test_data) preds_prob <- predict(rf_model, newdata = test_data, type = “prob”)
cm <- confusionMatrix(preds_class, test_data$KPIs_met_more_than_80, positive = “Yes”) print(cm)
acc <- round(cm\(overall["Accuracy"] * 100, 2) kappa <- round(cm\)overall[“Kappa”] * 100, 2) sens <- round(cm\(byClass["Sensitivity"] * 100, 2) spec <- round(cm\)byClass[“Specificity”] * 100, 2) precision <- round(cm\(byClass["Precision"] * 100, 2) f1 <- round(cm\)byClass[“F1”] * 100, 2)
roc_obj <- roc(response = test_data$KPIs_met_more_than_80, predictor = preds_prob[, “Yes”], levels = c(“No”, “Yes”), direction = “<”) auc_val <- round(auc(roc_obj), 4)
library(caret) library(ggplot2)
cm_tbl <- as.data.frame(cm$table) names(cm_tbl) <- c(“Predicted”, “Actual”, “Freq”)
ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) + geom_tile(colour = “white”, linewidth = 1) + geom_text(aes(label = Freq), size = 6, fontface = “bold”, colour = “white”) + scale_fill_gradient(low = “#A8C8E8”, high = “#1A5FA8”) + labs(title = “Confusion Matrix Heatmap”, subtitle = “Model predictions vs actual classes”, x = “Actual Class”, y = “Predicted Class”, fill = “Count”) + theme_minimal(base_size = 14) + theme(legend.position = “right”)
roc_obj <- roc(response = test_data$KPIs_met_more_than_80, predictor = preds_prob[, “Yes”], levels = c(“No”, “Yes”), direction = “<”)
roc_df <- data.frame(FPR = 1 - roc_obj\(specificities, TPR = roc_obj\)sensitivities)
ggplot(roc_df, aes(x = FPR, y = TPR)) + geom_line(colour = “#1A5FA8”, linewidth = 1.1) + geom_abline(slope = 1, intercept = 0, linetype = “dashed”, colour = “grey60”, linewidth = 0.7) + annotate(“text”, x = 0.65, y = 0.12, label = sprintf(“AUC = %.4f”, auc(roc_obj)), size = 5, fontface = “bold”, colour = “#1A5FA8”) + labs(title = “ROC Curve”, subtitle = “Receiver Operating Characteristic — test set”, x = “False Positive Rate (1 - Specificity)”, y = “True Positive Rate (Sensitivity)”) + theme_minimal(base_size = 14) + coord_equal()
imp_mat <- importance(rf_model) imp_df <- data.frame( Feature = rownames(imp_mat), MeanDecreaseAccuracy = imp_mat[, “MeanDecreaseAccuracy”], MeanDecreaseGini = imp_mat[, “MeanDecreaseGini”] )
imp_df <- imp_df[order(imp_df\(MeanDecreaseGini), ] imp_df\)Feature <- factor(imp_df\(Feature, levels = imp_df\)Feature)
ggplot(imp_df, aes(x = Feature, y = MeanDecreaseGini, fill = MeanDecreaseGini)) + geom_bar(stat = “identity”, width = 0.65, show.legend = FALSE) + geom_text(aes(label = round(MeanDecreaseGini, 1)), hjust = -0.15, size = 3.5) + scale_fill_gradient(low = “#A8D5A2”, high = “#2E7D32”) + coord_flip() + labs(title = “Feature Importance (Mean Decrease Gini)”, subtitle = “Higher = more important for node purity”, y = “Mean Decrease Gini”) + theme_minimal(base_size = 14)
class_tbl <- table(df_clean$KPIs_met_more_than_80) class_pct <- round(prop.table(class_tbl) * 100, 1) class_df <- data.frame(Class = names(class_tbl), Count = as.integer(class_tbl), Percent = as.numeric(class_pct))
ggplot(class_df, aes(x = Class, y = Count, fill = Class)) + geom_bar(stat = “identity”, width = 0.5, show.legend = FALSE) + geom_text(aes(label = sprintf(“%d(%.1f%%)”, Count, Percent)), vjust = -0.3, size = 4, fontface = “bold”) + scale_fill_manual(values = c(“No” = “#E07B54”, “Yes” = “#4C9BE8”)) + labs(title = “Class Distribution of Target Variable”, subtitle = “KPIs met more than 80%”, x = “KPIs Met > 80%”, y = “Count”) + theme_minimal(base_size = 14)
metrics_df <- data.frame( Metric = c(“Accuracy”, “Sensitivity”, “Specificity”, “Precision”, “F1 Score”), Value = c(acc, sens, spec, precision, f1) )
ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) + geom_bar(stat = “identity”, width = 0.6, show.legend = FALSE) + geom_text(aes(label = sprintf(“%.1f%%”, Value)), vjust = -0.4, size = 4, fontface = “bold”) + scale_fill_brewer(palette = “Set2”) + labs(title = “Model Performance Metrics”, subtitle = “Evaluated on test set”, y = “Score (%)”) + ylim(0, 110) + theme_minimal(base_size = 14)
prob_df <- data.frame( Probability = preds_prob[, “Yes”], Actual_Class = test_data$KPIs_met_more_than_80 )
ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) + geom_histogram(binwidth = 0.05, alpha = 0.75, position = “identity”, colour = “white”) + scale_fill_manual(values = c(“No” = “#E07B54”, “Yes” = “#4C9BE8”)) + labs(title = “Predicted Probability Distribution”, subtitle = “Probability of KPIs > 80% by actual class”, x = “Predicted probability (Yes)”, y = “Count”, fill = “Actual class”) + theme_minimal(base_size = 14)
#STEP-1: #——- # p1: Class distribution p1 <- ggplot(class_df, aes(x = Class, y = Count, fill = Class)) + geom_bar(stat = “identity”) + labs(title = “Class Distribution”)
p2 <- ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) + geom_tile() + labs(title = “Confusion Matrix”)
p3 <- ggplot(imp_df, aes(x = Feature, y = MeanDecreaseGini, fill = MeanDecreaseGini)) + geom_bar(stat = “identity”) + coord_flip() + labs(title = “Feature Importance (Gini)”)
p4 <- ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) + geom_bar(stat = “identity”) + labs(title = “Performance Metrics”)
p5 <- ggplot(roc_df, aes(x = FPR, y = TPR)) + geom_line() + labs(title = “ROC Curve”)
p6 <- ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) + geom_histogram(binwidth = 0.05) + labs(title = “Predicted Probability Distribution”) #—————————————————————— #STEP-2: Arrange the Graphs into a Dashboard #——————————————– library(gridExtra)
combined_dashboard <- grid.arrange( p1, p2, p3, p4, p5, p6, ncol = 3, nrow = 2 ) ggsave(“dashboard_overview.png”, plot = combined_dashboard, width = 18, height = 10, dpi = 150)
When compared with the Random Forest model, the Logistic Regression model achieved a slightly higher overall accuracy (71.03% vs 70.23%). However, the difference in performance becomes clearer when examining sensitivity and specificity. Logistic Regression recorded a much lower sensitivity of 43% compared to Random Forest’s 57%, meaning it missed a larger number of employees who actually met the KPI target. In contrast, Logistic Regression achieved a substantially higher specificity of 87%, while Random Forest obtained 78%, indicating that Logistic Regression was better at correctly identifying employees who did not meet KPI expectations.
The Random Forest model demonstrated a more balanced classification performance overall. Although its accuracy was marginally lower, it produced a higher F1-score (57.86% compared to 51.79% for Logistic Regression), showing a better balance between precision and recall. Random Forest was therefore more effective at detecting true high performers, while Logistic Regression was more conservative and focused on minimizing false positives. This is also reflected in the confusion matrices, where Logistic Regression produced more false negatives (708) than Random Forest (538), meaning more actual high performers were overlooked.
Additionally, the Kappa statistic for Random Forest (0.35) was slightly higher than Logistic Regression (0.32), suggesting better agreement beyond chance. The Random Forest model also achieved a higher balanced accuracy (67% vs 65%), indicating stronger overall performance across both classes rather than favoring the majority class. While Logistic Regression excelled in identifying non high performers, Random Forest provided a more even trade off between identifying successful and unsuccessful employees.
Overall, Logistic Regression appears more suitable when the priority is reducing false claims of high performance, whereas Random Forest is more appropriate when the goal is to identify as many genuine high performers as possible. Since employee performance prediction often benefits from detecting successful employees accurately, the Random Forest model may be considered the more practical and reliable choice despite its slightly lower overall accuracy.
Both models performed better than random guessing, with Logistic Regression achieving a slightly higher AUC (0.7402) than Random Forest (0.7327), indicating marginally better overall class separation. However, Random Forest showed stronger practical performance by achieving higher sensitivity and F1-score, meaning it was better at identifying employees who actually met KPI targets. Logistic Regression was more conservative, focusing more on correctly identifying non high performers and reducing false positives. Overall, Random Forest provides a more balanced model for employee performance prediction, while Logistic Regression is better when minimizing false positive predictions is the priority.
The Logistic Regression coefficient analysis shows that awards_won and previous_year_rating are the strongest positive predictors of employees meeting more than 80% of their KPIs. Employees who received awards or had strong previous performance ratings were much more likely to achieve high KPI outcomes. Several regional variables also had relatively strong positive effects, while some departments such as sales & marketing, legal, and HR showed negative relationships with KPI success. Features such as no_of_trainings and length_of_service had smaller negative coefficients, indicating weaker influence on the prediction outcome.
Compared with Random Forest, Logistic Regression provides coefficient based interpretations that show whether each variable increases or decreases the likelihood of success. However, Random Forest offered clearer feature importance rankings and captured more complex non linear relationships between variables. Logistic Regression is therefore simpler and easier to interpret statistically, while Random Forest provides stronger practical insight into which factors most influence employee performance predictions.
comparison_df <- data.frame( Model = c( “Logistic Regression”, “Random Forest” ),
Accuracy = c( logistic_acc, acc ),
Sensitivity = c( logistic_sens, sens ),
Specificity = c( logistic_spec, spec ),
F1_Score = c( logistic_f1, f1 ),
AUC = c( logistic_auc, auc_val ) )
comparison_df
Both Logistic Regression and Random Forest achieved similar accuracy, with Logistic Regression performing slightly better (71.03% vs 70.23%). However, Random Forest showed much higher sensitivity (56.96% vs 43.36%), meaning it was better at identifying employees who actually met KPI targets. Logistic Regression achieved higher specificity (86.52% vs 77.65%), indicating it was stronger at identifying employees who did not meet KPIs. Random Forest also obtained a higher F1-score (57.86% vs 51.79%), showing a better balance between precision and recall. Although Logistic Regression had a slightly higher AUC (0.7402 vs 0.7327), the difference was minimal. Overall, Logistic Regression is more conservative and better at avoiding false positives, while Random Forest provides a more balanced performance and is more effective at detecting genuine high performers.
Question: Can employee average training score be predicted using demographic and workplace-related variables?
required_packages <- c( “caret”, “ggplot2”, “dplyr”, “Metrics”, “gridExtra” )
for (pkg in required_packages) { if (!requireNamespace(pkg, quietly = TRUE)) { install.packages(pkg) } }
library(caret) library(ggplot2) library(dplyr) library(Metrics) library(gridExtra)
df_clean <- read.csv( “clean_employee_performance.csv”, stringsAsFactors = FALSE )
cat(sprintf( “Loaded: %d rows x %d columns”, nrow(df_clean), ncol(df_clean) ))
df_clean\(avg_training_score_scaled <- NULL df_clean\)age_group <- NULL df_clean$KPIs_met_more_than_80 <- NULL
cat(“Dropped redundant columns”)
char_cols <- names(df_clean)[sapply(df_clean, is.character)]
df_clean[char_cols] <- lapply( df_clean[char_cols], as.factor )
cat(“Converted character columns to factors”)
na_total <- sum(is.na(df_clean))
cat(sprintf(“Total NAs found: %d”, na_total))
if (na_total > 0) {
for (col in names(df_clean)) {
if (is.numeric(df_clean[[col]])) {
df_clean[[col]][is.na(df_clean[[col]])] <-
median(df_clean[[col]], na.rm = TRUE)
}
}
cat(“Median imputation applied”)
} else {
cat(“No missing values found”) }
set.seed(42)
train_idx <- createDataPartition( df_clean$avg_training_score, p = 0.80, list = FALSE )
train_data <- df_clean[train_idx, ] test_data <- df_clean[-train_idx, ]
cat(sprintf( “Train rows: %d | Test rows: %d”, nrow(train_data), nrow(test_data) ))
lm_model <- lm( avg_training_score ~ age + previous_year_rating + length_of_service + no_of_trainings + department + education + gender + recruitment_channel + awards_won,
data = train_data )
summary(lm_model)
predictions <- predict( lm_model, newdata = test_data )
actual_values <- test_data$avg_training_score
rmse_val <- round( rmse(actual_values, predictions), 3 )
mae_val <- round( mae(actual_values, predictions), 3 )
r2_val <- round( cor(actual_values, predictions)^2, 3 )
cat(“Model Performance”) cat(“RMSE :”, rmse_val, “”) cat(“MAE :”, mae_val, “”) cat(“R² :”, r2_val, “”)
results_df <- data.frame( Actual = actual_values, Predicted = predictions )
p1 <- ggplot( results_df, aes(x = Actual, y = Predicted) ) + geom_point( alpha = 0.5, colour = “#1A5FA8” ) + geom_abline( slope = 1, intercept = 0, colour = “red”, linetype = “dashed” ) + labs( title = “Actual vs Predicted Values”, subtitle = “Linear Regression”, x = “Actual Training Score”, y = “Predicted Training Score” ) + theme_minimal(base_size = 14)
results_df$Residuals <- actual_values - predictions
p2 <- ggplot( results_df, aes(x = Predicted, y = Residuals) ) + geom_point( alpha = 0.5, colour = “#2E7D32” ) + geom_hline( yintercept = 0, colour = “red”, linetype = “dashed” ) + labs( title = “Residual Plot”, x = “Predicted Values”, y = “Residuals” ) + theme_minimal(base_size = 14)
p3 <- ggplot( results_df, aes(x = Residuals) ) + geom_histogram( bins = 30, fill = “#4C9BE8”, colour = “white”, alpha = 0.8 ) + labs( title = “Residual Distribution”, x = “Residual”, y = “Count” ) + theme_minimal(base_size = 14)
coef_df <- data.frame( Feature = names(coef(lm_model)), Coefficient = coef(lm_model) )
coef_df <- coef_df %>% filter(Feature != “(Intercept)”) %>% mutate( Abs_Coefficient = abs(Coefficient), Direction = ifelse( Coefficient > 0, “Positive”, “Negative” ) ) %>% arrange(desc(Abs_Coefficient))
top_coef_df <- coef_df %>% slice_max( order_by = Abs_Coefficient, n = 15 )
p4 <- ggplot( top_coef_df, aes( x = reorder( Feature, Abs_Coefficient ), y = Abs_Coefficient, fill = Direction ) ) + geom_bar( stat = “identity” ) + coord_flip() + labs( title = “Feature Importance”, subtitle = “Linear Regression Coefficients”, x = “Feature”, y = “Absolute Coefficient” ) + theme_minimal(base_size = 14)
The linear regression coefficient plot shows how different employee characteristics influence the average training score. Features with positive coefficients increase the predicted training score, while negative coefficients reduce it. Among the positive predictors, awards_won has one of the strongest positive effects, suggesting that employees who received awards tend to achieve higher training scores. Variables such as previous_year_rating, education level, and referral based recruitment also contribute positively, although their effects are smaller.
On the other hand, departments such as sales & marketing, HR, legal, operations, and finance have large negative coefficients, indicating that employees in these departments tend to have lower average training scores compared to the reference department. Variables such as no_of_trainings and length_of_service also show slight negative relationships, suggesting that attending more trainings or having longer service does not necessarily correspond to higher average training performance.
Overall, the model suggests that employee recognition and past performance are associated with stronger training outcomes, while departmental differences appear to play a significant role in influencing average training scores. The coefficient directions also help explain which factors are linked to higher or lower training performance within the organization.
metrics_df <- data.frame( Metric = c( “RMSE”, “MAE”, “R²” ),
Value = c( rmse_val, mae_val, r2_val ) )
p5 <- ggplot( metrics_df, aes( x = Metric, y = Value, fill = Metric ) ) + geom_bar( stat = “identity”, show.legend = FALSE ) + geom_text( aes(label = Value), vjust = -0.4, fontface = “bold” ) + labs( title = “Regression Performance Metrics”, y = “Value” ) + theme_minimal(base_size = 8)
The regression performance metrics indicate that the linear regression model performs reasonably well in predicting employees’ average training scores. The R² value of 0.888 means that approximately 88.8% of the variation in training scores can be explained by the predictor variables included in the model. This suggests a very strong fit, indicating that the selected features are highly effective in explaining employee training performance.
The Mean Absolute Error (MAE) of 2.74 shows that, on average, the model’s predictions differ from the actual training scores by about 2.7 points. Since MAE measures the average magnitude of prediction errors without considering direction, this relatively small value suggests that the model predictions are generally close to the true scores.
Similarly, the Root Mean Squared Error (RMSE) of 4.51 indicates that the model’s prediction errors are relatively low overall, although RMSE is slightly higher because it penalizes larger errors more heavily. The difference between RMSE and MAE suggests that while most predictions are accurate, there may still be a few larger prediction errors present in the dataset.
Overall, these metrics indicate that the linear regression model has strong predictive performance and is effective for estimating employee average training scores. The high R² combined with relatively low MAE and RMSE values suggests that the model captures the underlying relationships in the data well and provides reliable predictions.