1.0 Introduction

Amid the era of globalization and relentless competitive pressures, sustaining strong employee performance has become a central priority and one of the biggest challenges for Human Resource (HR) departments. Organizations can no longer rely solely on intuition; instead, they are increasingly turning to data‑driven analytics to evaluate workforce outcomes, optimize talent management, and reduce turnover. Understanding the specific, underlying factors that influence productivity is essential for building sustainable, long‑term employee success.

This project uses R programming to explore a comprehensive dataset and uncover the key drivers of high performance. A thorough understanding of influencing factors is essential for developing effective approaches to maintaining and improving employee performance over the long term. By leveraging statistical methods and machine learning techniques, the goal is to uncover the key drivers behind high performance and Key Performance Indicator (KPI) achievement, translating raw HR data into actionable organizational insights.

1.1 Objective of the Project

The primary objective of this project is to conduct a comprehensive data analysis using R to identify key factors such as employee demographics, training effectiveness, length of service, and prior performance ratings that significantly influence the achievement of Key Performance Indicators (KPIs) exceeding 80%. By applying statistical techniques, data visualization, and predictive modeling, this study aims to generate actionable insights that can guide HR professionals in enhancing employee performance strategies, supporting talent development, and strengthening organizational decision‑making.

Specifically, the project seeks to:
- Identify key factors affecting performance through statistical analysis and machine learning, focusing on variables such as training, work experience, education level, and departmental affiliation.
- Compare predictive models to determine the most effective approach for forecasting KPI achievement above 80%, using evaluation metrics such as confusion matrices, accuracy, sensitivity, and specificity.
- Provide recommendations and actionable insights to HR departments and stakeholders, supporting evidence‑based decisions in talent management, training programs, and employee engagement initiatives.

1.2 Dataset Description

The dataset titled Employees Performance for HR Analytics was uploaded to Kaggle by Sanjana Chaudhari in 2023 and serves as the foundation for this analysis. It contains 17,417 employee records across 13 variables, stored in CSV format. The dataset captures a balanced mix of categorical and numerical variables, making it suitable for exploratory data analysis (EDA), correlation studies, and predictive modeling in HR analytics.

The variables included are as follows:
- employee_id: Unique identifier for each employee; serves as the primary key for tracking records without revealing personal information.
- department: Employee’s department (e.g., Sales & Marketing, Technology); useful for performance segmentation and departmental comparisons.
- region: Geographic region of employment.
- education: Highest education level attained (e.g., Bachelor’s, Master’s and above).
- gender: Employee gender (m = male, f = female).
- recruitment_channel: Hiring source (e.g., Referred, Sourcing).
- no_of_trainings: Number of trainings attended.
- age: Employee age.
- previous_year_rating: Performance rating from the prior year (1–5 scale).
- length_of_service: Number of years served in the organization.
- kpis_met_more_than_80: Binary indicator of whether >80% KPIs were achieved (0 = No, 1 = Yes); this serves as the target variable.
- awards_won: Indicator of whether the employee won awards (0 = No, 1 = Yes).
- avg_training_score: Average score from trainings, reflecting training quality.

By analyzing these variables, the study aims to uncover meaningful patterns that can guide HR strategies, improve productivity, and strengthen workforce management.

2.0 Data Cleaning & Preparation

Data cleaning is a critical step in preparing the dataset for analysis. It involves handling missing values, correcting inconsistencies, removing duplicates, and ensuring that variables are properly formatted for statistical modeling. Clean data provides a reliable foundation for exploratory analysis and predictive modeling, reducing bias and improving the accuracy of insights.

2.1 Packages Used

The following packages were used in the data cleaning process:

dplyr
Functions: filter, mutate, select, distinct, summarise, case_when
Purpose: Data manipulation and transformation.
tidyr
Functions: replace_na, across
Purpose: Handling missing values and tidying data.
stringr
Functions: str_trim, str_to_lower
Purpose: Text cleaning and string processing.
writexl
Functions: write_xlsx
Purpose: Exporting cleaned dataset to Excel format.

2.2 Data Importation

This step involves loading the raw dataset into R for inspection. The structure and summary of the data are examined to understand variable types.

employee_performance <- read.csv("Uncleaned_employees_final_dataset.csv")
str(employee_performance)

## 'data.frame':    17417 obs. of  13 variables:
##  $ employee_id          : int  8724 74430 72255 38562 64486 46232 54542 67269 66174 76303 ...
##  $ department           : chr  "Technology" "HR" "Sales & Marketing" "Procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : int  NA 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...

summary(employee_performance)

##   employee_id        department          region          education    
##  Min.   :    3   Length   :17417   Length   :17417   Length   :17417  
##  1st Qu.:19281   N.unique :    9   N.unique :   34   N.unique :    4  
##  Median :39122   N.blank  :    0   N.blank  :    0   N.blank  :  771  
##  Mean   :39083   Min.nchar:    2   Min.nchar:    8   Min.nchar:    0  
##  3rd Qu.:58838   Max.nchar:   17   Max.nchar:    9   Max.nchar:   15  
##  Max.   :78295                                                        
##                                                                       
##        gender      recruitment_channel no_of_trainings      age       
##  Length   :17417   Length   :17417     Min.   :1.000   Min.   :20.00  
##  N.unique :    2   N.unique :    3     1st Qu.:1.000   1st Qu.:29.00  
##  N.blank  :    0   N.blank  :    0     Median :1.000   Median :33.00  
##  Min.nchar:    1   Min.nchar:    5     Mean   :1.251   Mean   :34.81  
##  Max.nchar:    1   Max.nchar:    8     3rd Qu.:1.000   3rd Qu.:39.00  
##                                        Max.   :9.000   Max.   :60.00  
##                                                                       
##  previous_year_rating length_of_service KPIs_met_more_than_80   awards_won     
##  Min.   :1.000        Min.   : 1.000    Min.   :0.0000        Min.   :0.00000  
##  1st Qu.:3.000        1st Qu.: 3.000    1st Qu.:0.0000        1st Qu.:0.00000  
##  Median :3.000        Median : 5.000    Median :0.0000        Median :0.00000  
##  Mean   :3.345        Mean   : 5.802    Mean   :0.3588        Mean   :0.02337  
##  3rd Qu.:4.000        3rd Qu.: 7.000    3rd Qu.:1.0000        3rd Qu.:0.00000  
##  Max.   :5.000        Max.   :34.000    Max.   :1.0000        Max.   :1.00000  
##  NAs    :1363                                                                  
##  avg_training_score
##  Min.   :39.00     
##  1st Qu.:51.00     
##  Median :60.00     
##  Mean   :63.18     
##  3rd Qu.:75.00     
##  Max.   :99.00     
##

2.3 Customer Parsing & Batch Processing

Duplicate records and unnecessary columns are removed to ensure data integrity. Unique values are checked to identify inconsistencies in categorical variables.

unique(employee_performance$department)

## [1] "Technology"        "HR"                "Sales & Marketing"
## [4] "Procurement"       "Finance"           "Analytics"        
## [7] "Operations"        "Legal"             "R&D"

unique(employee_performance$education)

## [1] "Bachelors"       "Masters & above" ""                "Below Secondary"

unique(employee_performance$gender)

## [1] "m" "f"

unique(employee_performance$recruitment_channel)

## [1] "sourcing" "other"    "referred"

unique(employee_performance$region)

##  [1] "region_26" "region_4"  "region_13" "region_2"  "region_29" "region_7" 
##  [7] "region_22" "region_16" "region_17" "region_24" "region_11" "region_27"
## [13] "region_9"  "region_20" "region_34" "region_23" "region_8"  "region_14"
## [19] "region_31" "region_19" "region_5"  "region_28" "region_15" "region_3" 
## [25] "region_25" "region_12" "region_21" "region_30" "region_10" "region_33"
## [31] "region_32" "region_6"  "region_1"  "region_18"

str(employee_performance)

## 'data.frame':    17417 obs. of  13 variables:
##  $ employee_id          : int  8724 74430 72255 38562 64486 46232 54542 67269 66174 76303 ...
##  $ department           : chr  "Technology" "HR" "Sales & Marketing" "Procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : int  NA 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...

# Show duplicated employee_id
employee_performance %>%
  filter(duplicated(employee_id) | duplicated(employee_id, fromLast = TRUE)) %>%
  arrange(employee_id)

# Remove exact duplicate rows only
employee_performance <- employee_performance %>%
  distinct()

# Check duplicated employee_id again
employee_performance %>%
  filter(duplicated(employee_id) | duplicated(employee_id, fromLast = TRUE)) %>%
  arrange(employee_id)

#remove unnecessary column
employee_performance <- employee_performance %>%
  select(-employee_id)

2.4 Data Transformation

Text fields are standardized by trimming spaces and converting to lowercase. Missing values are handled using median imputation and categorical replacement to maintain data completeness.

#clean text column
employee_performance <- employee_performance %>%
  mutate(
    gender = str_to_lower(str_trim(gender)),
    department = str_trim(department),
    education = str_trim(education),
    recruitment_channel = str_trim(recruitment_channel)
  )


str(employee_performance)

## 'data.frame':    17415 obs. of  12 variables:
##  $ department           : chr  "Technology" "HR" "Sales & Marketing" "Procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : int  NA 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...

colSums(is.na(employee_performance))

##            department                region             education 
##                     0                     0                     0 
##                gender   recruitment_channel       no_of_trainings 
##                     0                     0                     0 
##                   age  previous_year_rating     length_of_service 
##                     0                  1363                     0 
## KPIs_met_more_than_80            awards_won    avg_training_score 
##                     0                     0                     0

employee_performance %>%
  summarise(across(everything(), ~ sum(is.na(.) | trimws(as.character(.)) == "")))

#handling missing values
employee_performance <- employee_performance %>%
  mutate(
    previous_year_rating = ifelse(
      is.na(previous_year_rating),
      median(previous_year_rating, na.rm = TRUE),
      previous_year_rating
    ),
    
    education = ifelse(
      is.na(education) | str_trim(education) == "",
      "Unknown",
      education
    )
  )

#if missing exists
employee_performance <- employee_performance %>%
  mutate(education = replace_na(education, "Unknown"))


#clean any text column
clean_text <- function(x) {
  x %>%
    str_trim() %>%
    str_to_lower()
}

employee_performance$department <- clean_text(employee_performance$department)


employee_performance <- employee_performance %>%
  mutate(across(
    c(department, education, recruitment_channel, region),
    clean_text
  ))

2.5 Feature Engineering

New variables are created to enhance analytical insights. Age groups are categorized, and categorical variables are converted to factors for modeling compatibility.

#create age group 
employee_performance <- employee_performance %>%
  mutate(age_group = case_when(
    age < 30 ~ "Young",
    age >= 30 & age < 40 ~ "Mid",
    TRUE ~ "Senior"
  ))

str(employee_performance)

## 'data.frame':    17415 obs. of  13 variables:
##  $ department           : chr  "technology" "hr" "sales & marketing" "procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "bachelors" "bachelors" "bachelors" "bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : num  3 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...
##  $ age_group            : chr  "Young" "Mid" "Mid" "Mid" ...

#convert to factors
employee_performance <- employee_performance %>%
  mutate(
    department = as.factor(department),
    gender = as.factor(gender),
    education = as.factor(education),
    recruitment_channel = as.factor(recruitment_channel),
    region = as.factor(region),
    age_group = as.factor(age_group)
  )


#normalize score 
employee_performance <- employee_performance %>%
  mutate(avg_training_score_scaled = scale(avg_training_score))

2.6 Data Exportation

After cleaning and exploring the dataset, the final step is to export the processed data for future analysis and reporting. csv formats are used for next step EDA.

write.csv(employee_performance, "clean_employee_performance.csv", row.names = FALSE)

3.0 Exploratory Data Analysis (EDA)

Before proceeding into the modelling part, the Exploratory Data Analysis (EDA) was conducted to examine the employee performance.
The steps performed in EDA:

3.1 Data Inspection

The required libraries and cleaned dataset df_clean was loaded and inspected to understand its structure before moving forward to exploratory data analysis.

# Install packages (run only once if needed):
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("tidyverse")
# install.packages("knitr")
# install.packages("corrplot")
# install.packages("plotly")
# install.packages("reshape2")
# install.packages("kableExtra")

# Load required libraries:
library(dplyr)        # Data manipulation and group_by() function
library(ggplot2)      # Data visualization
library(tidyverse)    # Collection of data science packages
library(knitr)        # R Markdown table formatting
library(corrplot)     # Correlation matrix visualization
library(plotly)       # Interactive plots
library(reshape2)     # Data reshaping
library(kableExtra)   # Enhanced table styling

# Load cleaned dataset
df_clean<- read.csv("clean_employee_performance.csv")

#convert to factors
df_clean <- df_clean %>%
  mutate(
    department = as.factor(department),
    gender = as.factor(gender),
    education = as.factor(education),
    recruitment_channel = as.factor(recruitment_channel),
    region = as.factor(region),
    age_group = as.factor(age_group),
    awards_won = as.factor(awards_won) # added conversion to factor for better analysis
  )

The dataset structure confirms the variables are correctly formatted with appropriate data types.

Both str() and glimpse() provide different views of a dataset.
For str() provide a detailed description of the structure, whereas glimpse() gives a brief overview of all variables and observations.
str() shows the dataset structure as a mix of numeric and categorical variables. glimpse() and dim() show that the dataset contains 17,417 observations and 14 variables.
Both str() and glimpse() confirmed sufficiently large datasets for reliable exploratory and statistical analysis.

# Data structure overview inspection
head(df_clean)

str(df_clean)

## 'data.frame':    17415 obs. of  14 variables:
##  $ department               : Factor w/ 9 levels "analytics","finance",..: 9 3 8 6 2 6 2 1 9 9 ...
##  $ region                   : Factor w/ 34 levels "region_1","region_10",..: 19 29 5 12 22 32 12 15 32 15 ...
##  $ education                : Factor w/ 4 levels "bachelors","below secondary",..: 1 1 1 1 1 1 1 1 3 1 ...
##  $ gender                   : Factor w/ 2 levels "f","m": 2 1 2 1 2 2 2 2 2 2 ...
##  $ recruitment_channel      : Factor w/ 3 levels "other","referred",..: 3 1 1 1 3 3 1 3 1 3 ...
##  $ no_of_trainings          : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                      : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating     : int  3 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service        : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80    : int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won               : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ avg_training_score       : int  77 51 47 65 61 68 57 85 75 76 ...
##  $ age_group                : Factor w/ 3 levels "Mid","Senior",..: 3 1 1 1 1 1 1 1 2 3 ...
##  $ avg_training_score_scaled: num  1.03 -0.908 -1.206 0.136 -0.162 ...

glimpse(df_clean)

## Rows: 17,415
## Columns: 14
## $ department                <fct> technology, hr, sales & marketing, procureme…
## $ region                    <fct> region_26, region_4, region_13, region_2, re…
## $ education                 <fct> bachelors, bachelors, bachelors, bachelors, …
## $ gender                    <fct> m, f, m, f, m, m, m, m, m, m, m, m, f, m, m,…
## $ recruitment_channel       <fct> sourcing, other, other, other, sourcing, sou…
## $ no_of_trainings           <int> 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ age                       <int> 24, 31, 31, 31, 30, 36, 33, 36, 51, 29, 40, …
## $ previous_year_rating      <int> 3, 3, 1, 2, 4, 3, 5, 3, 4, 5, 5, 3, 3, 3, 5,…
## $ length_of_service         <int> 1, 5, 4, 9, 7, 2, 3, 3, 11, 2, 12, 10, 4, 10…
## $ KPIs_met_more_than_80     <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,…
## $ awards_won                <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ avg_training_score        <int> 77, 51, 47, 65, 61, 68, 57, 85, 75, 76, 50, …
## $ age_group                 <fct> Young, Mid, Mid, Mid, Mid, Mid, Mid, Mid, Se…
## $ avg_training_score_scaled <dbl> 1.03010551, -0.90754471, -1.20564475, 0.1358…

# Dataset dimensions
dim(df_clean)

## [1] 17415    14

The summary provides an overview of the central tendency and distribution of each variable.

The majority of employees, who range in age from 20 to 60 and a mean around 35 years have completed at least 1 training session.
The dataset shows moderate employee performance with an average previous year rating of 3.3 and KPI achievement rate of 0.36.
The majority of employees have been employed for 3 to 7 years and the average KPI achievement rate is relatively low as 0.36 that indicating less than half of the employees generally reach KPI goals.
A wide range of employees’ skill levels shows a significant gap on training scores ranging from 39 to 99. Thus, a data normalization was conducted for avg_training_scoreand created a new column for avg_training_score_scaled which eases future analysis.

# Summary statistics
df_clean %>%
  select(age, previous_year_rating, KPIs_met_more_than_80,
         length_of_service, no_of_trainings, avg_training_score, 
         avg_training_score_scaled) %>%
  summary()

##       age        previous_year_rating KPIs_met_more_than_80 length_of_service
##  Min.   :20.00   Min.   :1.000        Min.   :0.0000        Min.   : 1.000   
##  1st Qu.:29.00   1st Qu.:3.000        1st Qu.:0.0000        1st Qu.: 3.000   
##  Median :33.00   Median :3.000        Median :0.0000        Median : 5.000   
##  Mean   :34.81   Mean   :3.319        Mean   :0.3589        Mean   : 5.801   
##  3rd Qu.:39.00   3rd Qu.:4.000        3rd Qu.:1.0000        3rd Qu.: 7.000   
##  Max.   :60.00   Max.   :5.000        Max.   :1.0000        Max.   :34.000   
##  no_of_trainings avg_training_score avg_training_score_scaled
##  Min.   :1.000   Min.   :39.00      Min.   :-1.8018          
##  1st Qu.:1.000   1st Qu.:51.00      1st Qu.:-0.9075          
##  Median :1.000   Median :60.00      Median :-0.2368          
##  Mean   :1.251   Mean   :63.18      Mean   : 0.0000          
##  3rd Qu.:1.000   3rd Qu.:75.00      3rd Qu.: 0.8811          
##  Max.   :9.000   Max.   :99.00      Max.   : 2.6697

Summary for 3.1:

This step ensures that all variables are correctly formatted after preprocessing and no inconsistencies remain.
Overall, the dataset is ready for analysis.

3.2 Data Quality Assessment

A final quality check was conducted to check if any missing values or duplicated values remain.

# Check missing values
colSums(is.na(df_clean))

##                department                    region                 education 
##                         0                         0                         0 
##                    gender       recruitment_channel           no_of_trainings 
##                         0                         0                         0 
##                       age      previous_year_rating         length_of_service 
##                         0                         0                         0 
##     KPIs_met_more_than_80                awards_won        avg_training_score 
##                         0                         0                         0 
##                 age_group avg_training_score_scaled 
##                         0                         0

# Check missing or empty values
df_clean %>%
  summarise(across(everything(), ~ sum(is.na(.) | . == "")))

# Check any duplicates
sum(duplicated(df_clean))

## [1] 16

janitor::get_dupes(df_clean)

## No variable names specified - using all columns.

# Check whether missing values in the education field have been replaced with “Unknown”
unique(df_clean$education)

## [1] bachelors       masters & above unknown         below secondary
## Levels: bachelors below secondary masters & above unknown

Summary for 3.2:

This step ensured that all variables were properly cleaned.
There are no exact duplicate rows remaining in the dataset after removing duplicate records based on employee IDs. The remaining duplicate records represent valid multiple observations rather than data duplicates.
All variables have been standardized and are able to perform reliable statistical analysis as the cleaned dataset contains no missing value or duplicated records after prepossessing.

3.3 Outlier Detection

3.3.1 Box Plot

An outlier detection applied to focus on continuous variables such as age, length_of_service, and avg_training_score because these variables have meaningful numerical ranges, and extreme values may reveal unusual or hidden characteristics of employees.

# Select variables suitable for outlier detection
outlier_vars <- df_clean %>%
  select(age, length_of_service, avg_training_score)

# Convert selected variables into long format
outlier_data <- df_clean %>%
  select(age, length_of_service, avg_training_score) %>%
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "Value")

# Boxplot visualization
ggplot(outlier_data,
       aes(x = Variable,
           y = Value,
           fill = Variable)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Outlier Detection for Continuous Variables",
       x = "Variables",
       y = "Values") +
  theme(legend.position = "none")

Summary for 3.3:

This step ensures that the continuous variables are prevented by bias which are not affected by any hidden significant outlier before conducting further deep analysis.
age and length_of_service contain several outliers beyond the upper fence of the boxplot, which require further investigation to determine whether they represent valid extreme values or anomalies in the dataset.
The overall data quality is satisfactory and makes the dataset suitable for further exploratory data analysis and modelling.

3.4 Univariate Analysis

3.4.1 Target Variable Distribution

# Check class imbalance
target_dist <- df_clean %>%
  count(KPIs_met_more_than_80) %>%
  mutate(
    percentage = round(n / sum(n) * 100, 1),
    KPI_status = ifelse(KPIs_met_more_than_80 == 1,
                        "Met KPI >80%",
                        "Met KPI ≤80%")
  )

# KPI distribution plot
ggplot(target_dist,
       aes(x = KPI_status,
           y = n,
           fill = KPI_status)) +

  geom_bar(stat = "identity",
           width = 0.6,
           alpha = 0.9) +

  # Percentage + count labels
  geom_text(aes(label = paste0(percentage,
                               "%\n(n = ",
                               scales::comma(n), ")")),
            vjust = -0.35,
            size = 4.3,
            fontface = "bold") +

  scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +

  expand_limits(y = max(target_dist$n) * 1.15) +

  labs(title = "Distribution of KPI Achievement",
       subtitle = "Class balance analysis of KPI performance",
       x = NULL,
       y = "Number of Employees") +

  theme_minimal(base_size = 13) +

  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold",
                              hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.x = element_text(face = "bold")
  )

# Print imbalance ratio
imbalance_ratio <- max(target_dist$percentage) /
                   min(target_dist$percentage)

cat("Imbalance ratio (majority/minority):",
    round(imbalance_ratio, 2), "\n")

## Imbalance ratio (majority/minority): 1.79

Insights:

The analysis reveals that about 64.1% of employees failed to achieve the KPI benchmark of 80%, while only 35.9% of employees met the KPI target.
The imbalance ratio of 1.79 indicates that the target variable KPI Achievement is relatively balanced.
More than half of the employees did not achieve the KPI target, and it indicates organizational and operational factors affecting employee performance and suggests the need for further investigation into departmental performance, employee support system and others.
Hence, the class distribution is still acceptable for exploratory analysis and modelling without class imbalance concerns.

3.4.2 Continuous Variables Distribution

# Select Variables
continuous_long <- df_clean %>%
  select(age,
         length_of_service,
         avg_training_score) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "value")

# Plots
ggplot() +

  # age
  geom_histogram(
    data = filter(continuous_long, variable == "age"),
    aes(x = value,
        y = after_stat(density)),
    binwidth = 2,
    fill = "#3498DB",
    color = "white",
    alpha = 0.7
  ) +

  geom_density(
    data = filter(continuous_long, variable == "age"),
    aes(x = value),
    color = "#E74C3C",
    linewidth = 1.1,
    adjust = 1.2
  ) +

  # length_of_service
  geom_histogram(
    data = filter(continuous_long,
                  variable == "length_of_service"),
    aes(x = value,
        y = after_stat(density)),
    binwidth = 1,
    fill = "#2ECC71",
    color = "white",
    alpha = 0.7
  ) +

  geom_density(
    data = filter(continuous_long,
                  variable == "length_of_service"),
    aes(x = value),
    color = "#C0392B",
    linewidth = 1.1,
    adjust = 1.5
  ) +

  # avg_training_score
  geom_histogram(
    data = filter(continuous_long,
                  variable == "avg_training_score"),
    aes(x = value,
        y = after_stat(density)),
    binwidth = 5,
    fill = "#9B59B6",
    color = "white",
    alpha = 0.7
  ) +

  geom_density(
    data = filter(continuous_long,
                  variable == "avg_training_score"),
    aes(x = value),
    color = "#E74C3C",
    linewidth = 1.1,
    adjust = 1.2
  ) +

  facet_wrap(~ variable,
             scales = "free",
             ncol = 2) +

  labs(
    title = "Distribution of Continuous Variables",
    subtitle = "Histogram with Density Overlay",
    x = "Value",
    y = "Density"
  ) +

  theme_minimal(base_size = 14) +

  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12),
    strip.text = element_text(face = "bold", size = 12),
    axis.title = element_text(face = "bold")
  )

Insights:

age distribution is slightly right skewed. This indicates a higher concentration of younger employees with fewer older employees. The age density curve shows a peak around the mid 30 years old which suggests that the most employees fall within the early to mid-career stage.
avg_training_score distribution appears as a multimodal pattern as it consists of multiple peaks and major clusters visible around 50, 60, and 80 to 85. This implies that there is possible segmentation in employee performance or departmental training outcomes.
length_of_service distribution shows heavily right-skewed. Most employees have 1 to 7 years of service whereas very few employees exceed 15 years. The company is considered to have short tenure or most of them are new employees.

3.4.3 Discrete Variables Distribution

# Select Variables
discrete_vars <- df_clean %>%
  select(previous_year_rating,
         no_of_trainings)

discrete_long <- discrete_vars %>%
  pivot_longer(everything(),
               names_to = "variable",
               values_to = "value")
#Plot
ggplot(discrete_long,
       aes(x = factor(value))) +

  geom_bar(fill = "#2ECC71",
           alpha = 0.8) +

  facet_wrap(~ variable,
             scales = "free",
             ncol = 2) +

  labs(title = "Distribution of Discrete Variables",
       subtitle = "Frequency distribution by category",
       x = "Category",
       y = "Count") +

  theme_minimal() +

  theme(strip.text = element_text(face = "bold"))

Insights:

no_of_trainings displays a strongly right-skewed distribution pattern as most employees have training sessions at once. It has a sharp decline after 2 training sessions and very few employees received more than 4 training sessions which cause long tail.
previous_year_rating shows a slightly left-skewed distribution pattern with a mode at rating 3. Most employees received ratings of 3, 4, or 5 which are above average to excellent. This suggests that past performance is generally positive across the workforce.

3.4.4 Categorical Variables Distribution

# Select Variables
categorical_vars <- df_clean %>%
  select(department,
         education,
         recruitment_channel,
         awards_won)

categorical_long <- categorical_vars %>%
  pivot_longer(everything(),
               names_to = "variable",
               values_to = "category")

# Plot
ggplot(categorical_long,
       aes(x = category,
           fill = variable)) +

  geom_bar(alpha = 0.85,
           show.legend = FALSE) +

  facet_wrap(~ variable,
             scales = "free",
             ncol = 2) +

  labs(title = "Distribution of Categorical Variables",
       subtitle = "Frequency distribution across employee categories",
       x = "",
       y = "Number of Employees") +

  scale_fill_brewer(palette = "Set2") +

  theme_minimal(base_size = 12) +

  theme(
    axis.text.x = element_text(angle = 45,
                               hjust = 1),
    strip.text = element_text(face = "bold"),
    plot.title = element_text(face = "bold",
                              hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

# Plot for region (top 10)
region_analysis <- df_clean %>%
  count(region) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  top_n(10, n)

ggplot(region_analysis,
       aes(x = reorder(region, n), y = percentage, fill = percentage)) +
  geom_bar(stat = "identity", width = 0.8) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")),
            vjust = -0.3, size = 3) +
  scale_fill_gradient(low = "#A3E4D7", high = "#1ABC9C") +
  labs(title = "Employee Distribution by Region (Top 10)",
       x = "Region", y = "Percentage (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 7))

Insights:

The categorical variable analysis only provides an overview ondepartment, education, recruitment_channel, revealing that the distribution of employees across departments, educational backgrounds, and recruitment channels is uneven.
award_won distribution shows an extremely imbalanced distribution in which nearly all employees have no awards. As the award winners are rare and almost invisible on chart, this implies that this variable has lower predictive power as insufficient variation. However, further bivariate analysis is needed to determine whether the winners perform better on KPI performance.
Compared to other departments, the Sales and Marketing department has the largest number of employees and has become the dominant department within the organization.
Most employees hold a bachelor’s degree, indicating that a bachelor’s degree is the most common educational background among the workforce.
There are three distinct recruitment channels, reflecting the variety of hiring methods the company employs.
Region 2 has the highest employee distribution (about 22.5%) compared to other regions.

3.5 Bivariate Analysis

3.5.1 Training Score vs KPI Achievement

This section examines the relationship between employee training performance (average training score) and KPI achievement status. This analysis helps determine whether training performance influences employees’ success in meeting their KPI targets.

# Summary Statistics by KPI 
training_score_summary <- df_clean %>%
  group_by(KPIs_met_more_than_80) %>%
  summarise(
    count = n(),
    mean_score = mean(avg_training_score, na.rm = TRUE),
    median_score = median(avg_training_score, na.rm = TRUE),
    sd_score = sd(avg_training_score, na.rm = TRUE),
    min_score = min(avg_training_score, na.rm = TRUE),
    max_score = max(avg_training_score, na.rm = TRUE),
    q25 = quantile(avg_training_score, 0.25, na.rm = TRUE),
    q75 = quantile(avg_training_score, 0.75, na.rm = TRUE)
  ) %>%
  mutate(KPI_Status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI >80%", "Did Not Meet KPI"))

kable(training_score_summary %>% 
        select(-KPIs_met_more_than_80) %>% 
        mutate(across(where(is.numeric), ~round(., 2))),
      caption = "Training Score Summary by KPI Achievement Status")

Training Score Summary by KPI Achievement Status
count	mean_score	median_score	sd_score	min_score	max_score	q25	q75	KPI_Status
11165	62.46	59	13.35	39	99	50	74	Did Not Meet KPI
6250	64.47	61	13.45	41	99	53	77	Met KPI >80%

# Boxplot Comparison
ggplot(df_clean, aes(x = factor(KPIs_met_more_than_80), y = avg_training_score, 
                      fill = factor(KPIs_met_more_than_80))) +
  geom_boxplot(alpha = 0.7, outlier.size = 0.8) +
  scale_fill_manual(values = c("#E74C3C", "#2ECC71"),
                    labels = c("Did Not Meet KPI", "Met KPI")) +
  labs(
    title = "Training Score Distribution Comparison",
    subtitle = "High performers show higher median training scores",
    x = "KPI Achievement Status",
    y = "Average Training Score",
    fill = "KPI Status"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "none"
  )

Insights:

The average training score for employees who met their KPI targets (64.5) was only slightly higher than that of employees who did not meet their KPI targets (62.5), with a difference of merely 2% between the two groups.
The boxplot shows an upward shift in training score distribution among high-performing employees, supporting the statistical findings.
The variability in scores was similar for both groups (standard deviation ≈ 13.4), indicating that the distribution of employee scores was relatively centralized.
According to the box plot, the interquartile range (IQR) for employees who did not meet their KPI targets is approximately 50–74, while the IQR for employees who met their KPI targets is approximately 53–77.
There is significant overlap between these two ranges, indicating that many employees in both groups achieved similar training scores; therefore, training performance alone is insufficient to fully explain KPI success.

3.5.2 Previous Year Rating Analysis vs KPI Achievement

This section examines the relationship between employee previous year performance (previous year rating) and KPI achievement status. This analysis helps determine whether previous year performance influences employee success in meeting KPI targets.

# Previous year rating vs KPI
rating_analysis <- df_clean %>%
  group_by(previous_year_rating, KPIs_met_more_than_80) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(previous_year_rating) %>%
  mutate(percentage = count / sum(count) * 100,
         KPI_status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI", "Did Not Meet"))

# Plot
ggplot(rating_analysis, aes(x = factor(previous_year_rating), y = percentage, 
                            fill = KPI_status)) +
  geom_bar(stat = "identity", position = "stack") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), size = 3) +
  scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
  labs(title = "KPI Achievement by Previous Year Rating",
       x = "Previous Year Rating", y = "Percentage (%)",
       fill = "KPI Status")

Insights:

Employees with higher previous year ratings demonstrate a greater proportion of KPI achievement.
The distribution indicates that past performance is positively associated with KPI achievement, while employees who received higher performance ratings in the previous year also had met KPI achievement.

3.5.3 Average Training Score by Previous Year Rating

# Plot
ggplot(df_clean,
       aes(x = factor(previous_year_rating),
           y = avg_training_score,
           fill = factor(previous_year_rating))) +

  geom_boxplot(alpha = 0.8,
               outlier.color = "#E74C3C") +

  labs(title = "Average Training Score by Previous Year Rating",
       x = "Previous Year Rating",
       y = "Average Training Score") +

  theme_minimal(base_size = 13) +

  theme(legend.position = "none")

Insights:

Employees who received performance ratings of 3 to 5 in the previous year had higher average training scores.
Employees with a performance rating of 1 had the lowest median training score, indicating weaker general capability development; but, its upper whisker reaching almost to 95 shows high scores exist even in low-rated groups.
In summary,the boxplot distribution implies that training results are not entirely dependent on prior performance evaluations, as employees who had low ratings the previous year can still perform well in training.

3.5.4 Categotical Variables vs Target

# Function to create bar plots for categorical variables
plot_categorical_kpi <- function(data, var_name) {
  data %>%
    group_by(!!sym(var_name), KPIs_met_more_than_80) %>%
    summarise(count = n(), .groups = 'drop') %>%
    group_by(!!sym(var_name)) %>%
    mutate(percentage = count / sum(count) * 100,
           KPI_status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI >80%", "Did Not Meet")) %>%
    ggplot(aes(x = reorder(!!sym(var_name), -percentage * (KPIs_met_more_than_80 == 1)), 
               y = percentage, fill = KPI_status)) +
    geom_bar(stat = "identity", position = "stack", width = 0.7) +
    geom_text(aes(label = paste0(round(percentage, 1), "%")), 
              position = position_stack(vjust = 0.5), size = 3) +
    scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
    labs(title = paste("KPI Achievement by", var_name),
         x = var_name, y = "Percentage (%)") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          plot.title = element_text(hjust = 0.5, face = "bold"))
}

# Plot for each categorical variable
cat_vars <- c("department", "education", "recruitment_channel", "gender", "awards_won")

for (var in cat_vars) {
  print(plot_categorical_kpi(df_clean, var))
}

# HO = There is no significant relationship between categorical variables and KPI achievement.
# H1 = There is a significant relationship between categorical variables and KPI achievement.

# Decision Rule: Reject H0, if p-value <0.05

# Chi-square tests
cat_chi_results <- map_df(cat_vars, function(var) {
  
  tbl <- table(df_clean[[var]], df_clean$KPIs_met_more_than_80)
  
  test <- chisq.test(tbl)
  
  # Store numeric p-value for correct sorting
  p_val <- test$p.value
  
  # Format p-value for display only
  p_formatted <- ifelse(
    p_val < 0.0001,
    "< 0.0001",
    round(p_val, 4)
  )
  
  data.frame(
    Variable = var,
    Chi_Square = round(as.numeric(test$statistic), 2),
    P_Value = p_formatted,
    P_Value_Numeric = p_val,
    Significant = ifelse(p_val < 0.05, "Yes", "No")
  )
})

# Display results (with sorting)
kable(
  cat_chi_results %>%
    arrange(P_Value_Numeric) %>%
    select(-P_Value_Numeric),
  caption = "Chi-square Tests: Categorical Variables vs KPI Achievement"
)

Chi-square Tests: Categorical Variables vs KPI Achievement
Variable	Chi_Square	P_Value	Significant
department	292.33	< 0.0001	Yes
awards_won	191.77	< 0.0001	Yes
recruitment_channel	42.91	< 0.0001	Yes
education	40.25	< 0.0001	Yes
gender	28.61	< 0.0001	Yes

Insights:

The chi-square tests indicated all categorical variables are significant associations with KPI achievement (p < 0.05).
The department has the strongest associations (χ² = 292.33), followed by awards_won, recruitment channel and education.
Gender shows the weakest but still significant association.

3.6 Multivariate Analysis

3.6.1 Performance by gender and department

# Summary function
get_summary <- function(data, group_var) {
  data %>%
    group_by(.data[[group_var]]) %>%
    summarise(
      total_trainings = sum(no_of_trainings, na.rm = TRUE),
      avg_train_score       = mean(avg_training_score, na.rm = TRUE),
      kpi             = sum(KPIs_met_more_than_80 == 1, na.rm = TRUE),
      avg_tenure      = mean(length_of_service, na.rm = TRUE),
      avg_rating      = mean(previous_year_rating, na.rm = TRUE),
      avg_age         = mean(age, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    rename(category = 1)
}

# Focus only: Gender + Department
groups <- c("gender", "department")

summary_list <- lapply(groups, function(g) {
  get_summary(df_clean, g) %>%
    mutate(group = g)
})

combined_perf <- bind_rows(summary_list)


# Split metrics (NO scale mixing)
# Workforce metrics
workforce <- combined_perf %>%
  pivot_longer(
    cols = c(total_trainings, kpi),
    names_to = "metric",
    values_to = "value"
  )

# Performance metrics
performance <- combined_perf %>%
  pivot_longer(
    cols = c(avg_train_score, avg_tenure, avg_rating, avg_age),
    names_to = "metric",
    values_to = "value"
  )

# =========================
# Plot 1: Workforce (Gender + Department)
# =========================
p1 <- ggplot(workforce,
             aes(x = category,
                 y = value,
                 fill = metric)) +
  
  geom_bar(stat = "identity",
           position = "dodge",
           alpha = 0.9) +
  
  facet_wrap(~ group, scales = "free_x") +
  
  labs(title = "Workforce Overview by Gender and Department",
       x = "",
       y = "Count",
       fill = "Metric") +
  
  theme_minimal(base_size = 13) +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    legend.title = element_text(size = 9),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )

# =========================
# Plot 2: Performance (Gender + Department)
# =========================
p2 <- ggplot(performance,
             aes(x = category,
                 y = value,
                 fill = metric)) +
  
  geom_bar(stat = "identity",
           position = "dodge",
           alpha = 0.9) +
  
  facet_wrap(~ group, scales = "free_x") +
  
  labs(title = "Performance Metrics by Gender and Department",
       x = "",
       y = "Average Value",
       fill = "Metric") +
  
  theme_minimal(base_size = 13) +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    legend.title = element_text(size = 9),
    plot.title = element_text(face = "bold", hjust = 0.5),
  )


# Output
ggplotly(p1) %>%
  layout(
    legend = list(
      orientation = "v",
      x = 1,
      y = 0,
      font = list(size = 9),
      itemwidth = 30
    ),
    margin = list(b = 120)
  )

ggplotly(p2) %>%
  layout(
    legend = list(
      orientation = "v",
      x = 1,
      y = 0,
      font = list(size = 9),
      itemwidth = 30
    ),
    margin = list(b = 120)
  )

# ==========================================
# Multivariate Summary Table
# For Gender + Department Analysis
# ==========================================

# Create formatted summary table
summary_table <- combined_perf %>%
  
  mutate(
    avg_train_score  = round(avg_train_score, 2),
    avg_tenure = round(avg_tenure, 2),
    avg_rating = round(avg_rating, 2),
    avg_age    = round(avg_age, 2)
  ) %>%
  
  arrange(group, desc(avg_train_score)) %>%
  
  rename(
    Category           = category,
    Group              = group,
    `Total Trainings`  = total_trainings,
    `Avg Traning Score`= avg_train_score,
    `KPI Achieved`     = kpi,
    `Avg Tenure`       = avg_tenure,
    `Avg Rating`       = avg_rating,
    `Avg Age`          = avg_age
  )

# Display table
kable(
  summary_table,
  caption = "Multivariate Performance Summary by Gender and Department",
  align = "c"
) %>%
  
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  
  row_spec(
    0,
    bold = TRUE,
    color = "white",
    background = "#2C3E50"
  ) %>%
  
  column_spec(1, bold = TRUE) %>%
  
  collapse_rows(
    columns = 8,
    valign = "top"
  )

Multivariate Performance Summary by Gender and Department
Category	Total Trainings	Avg Traning Score	KPI Achieved	Avg Tenure	Avg Rating	Avg Age	Group
analytics	2281	84.57	679	5.00	3.47	32.41	department
r&d	438	84.45	149	4.80	3.66	32.89
technology	2740	79.85	783	5.84	3.14	35.03
procurement	2993	70.18	836	6.19	3.23	36.17
operations	4121	60.35	1553	6.43	3.63	36.15
finance	1059	60.33	319	5.01	3.49	32.60
legal	355	59.53	118	4.50	3.38	33.75
hr	892	50.39	300	5.63	3.51	34.25
sales & marketing	6903	50.06	1513	5.75	3.10	34.63
f	5992	63.68	1986	5.86	3.37	35.04	gender
m	15790	62.97	4264	5.78	3.30	34.71	gender

Insights:

Plot A: Workforce Overview by Gender and Department:
- Operations (4,121) and Sales & Marketing (6,903) had the highest training participation proportions.
- Operations had the highest KPI count (1,553), slightly more than Sales & Marketing (1,513).
- R&D (438 training) and legal (355 training) had the lowest level of employee involvement.
- Compared to female employees (5,992 training; 1,986 KPIs), male employees reported more total number of training (15,790) and KPI outcomes (4,264).
Plot B: Performance Metrics by Gender and Department:
- R&D (84.45) and analytics (84.57) had the highest average training ratings, followed by technology (79.85).
- Sales & Marketing (50.06) and HR (50.39) achieved the lowest average training scores.
- The departments with the highest average tenure were operations (6.43 years) and procurement (6.19 years).
- There were no significant gender differences; female employees scored slightly higher than their male counterparts in terms of average score (63.68 vs. 62.97), rating (3.37 vs. 3.30), and age (35.04 vs. 34.71).

3.6.2 Correlation Matrix

# Select numeric variables
num_data <- df_clean %>%
  select(
    no_of_trainings,
    age,
    previous_year_rating,
    length_of_service,
    avg_training_score,
    KPIs_met_more_than_80
  )

# Correlation matrix
cor_matrix <- cor(num_data, use = "complete.obs")

cor_melt <- melt(cor_matrix)

# Correlation heatmap
ggplot(cor_melt, aes(Var1, Var2, fill = value)) +
  
  geom_tile(color = "white") +
  
  scale_fill_gradient2(
    low = "#E74C3C",   # -1 strong negative
    mid = "white",     # 0 no correlation
    high = "#2ECC71",  # +1 strong positive
    midpoint = 0,
    limits = c(-1, 1),
    name = "Correlation"
  ) +
  
  geom_text(aes(label = round(value, 2)), size = 3) +
  
  theme_minimal() +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  ) +
  
  ggtitle("Correlation Matrix of Employee Performance Variables")

Insights:

The moderate correlation between age and length of service (r = 0.64) indicates estimated workforce advancement and tenure stability. Since the value does not exceed the commonly accepted multicollinearity threshold of 0.70; thus, multicollinearity is not considered a significant concern.
Regarding the target variable (KPI achievement):
- Previous year rating (r = 0.33) is the most significant predictor, demonstrating some stability in employee performance in the long run.
- Training-related variables showed minimal correlation, implying a limited direct linear impact on KPI performance.
Overall, the correlation analysis reveals that most numerical variables have weak linear correlations, indicating that employee performance is driven by several interconnected factors rather than a single predictor.

3.6.3 Feature Correlation with KPI Achievement

# Select numeric features
num_features <- df_clean %>%
  select(age,
         length_of_service,
         avg_training_score,
         no_of_trainings,
         previous_year_rating)

# Compute correlation with KPI
cor_results <- sapply(num_features, function(x) {
  cor(x, df_clean$KPIs_met_more_than_80, use = "complete.obs")
})

# Convert to data frame and rank
cor_ranked <- data.frame(
  feature = names(cor_results),
  correlation = cor_results
) %>%
  arrange(desc(abs(correlation)))

cor_ranked

# Plot 
ggplot(cor_ranked,
       aes(x = reorder(feature, correlation),
           y = correlation,
           fill = correlation)) +

  geom_bar(stat = "identity") +

  coord_flip() +

  scale_fill_gradient2(low = "#E74C3C",
                       mid = "white",
                       high = "#2ECC71") +

  labs(title = "Feature Correlation with KPI Achievement",
       x = "Feature",
       y = "Correlation Strength") +

  theme_minimal(base_size = 13)

Insights:

Previous year rating (0.32) was the single variable that showed a significant correlation with the KPI achievement.
The correlation coefficients for all other numerical variables were close to zero, indicating that these variables had little or no linear relationship with KPI achievement.

3.7 Insights Before Modeling

Based on EDA findings:
- Numerical variables such asavg_training_score and previous_year_ratingare positively correlated with KPI achievement. Employees with higher training scores and higher ratings from the previous year typically perform better on KPIs.
- length_of_service and no_of_trainings exhibit a right-skewed distribution, indicating that most employees have shorter tenure and have attended fewer training sessions.
- Categorical analysis reveals differences in employee performance across department, education, gender, awards_won and recruitment_channel. Chi-square tests confirm statistically significant associations between categorical variables and KPI achievement.
- Bivariate analysis indicates that training-related variables are among the strongest predictors of KPI achievement. In particular, employees with higher average training scores tend to have a higher probability of meeting their KPI targets.
- The relationship between previous_year_rating and avg_training_score reveals a potential nonlinear pattern, suggesting that performance may vary across different rating groups.
- Correlation analysis revealed a moderate correlation between KPI achievement and the previous year’s scores, while all other numerical variables show a weak correlation. This suggests that KPI performance is likely driven by a combination of nonlinear effects, categorical factors (such as department and education), and other contextual or behavioral variables not covered in this correlation analysis. For age and length_of_service is approximately 0.64; however, since it does not exceed the standard threshold of 0.7, it does not result in multicollinearity.
Overall, the results of the Employee Data Analysis (EDA) indicate that employee demographics, training performance, previous year rating, and departmental characteristics may influence the achievement of KPI and should therefore be taken into account during the modeling phase.

4.0 Data Analysis & Modelling

This section applies statistical and machine learning techniques to uncover meaningful insights from the cleaned dataset. The goal is to identify key predictors, evaluate model performance, and generate reliable forecasts. By combining exploratory analysis with predictive modelling, we aim to transform raw data into actionable knowledge that supports decision‑making.

Question: Can employee KPI achievement (more than 80%) be predicted using demographic, training, and workplace-related variables?

4.1 Logistic Regression (Classification)

knitr::opts_chunk$set(message = FALSE, warning = FALSE)

# ============================================================
# Logistic Regression — Employee Performance
# ============================================================

library(caret)
library(ggplot2)
library(pROC)
library(dplyr)
library(gridExtra)

# STEP 1 — Load data
df_clean <- read.csv("clean_employee_performance.csv",
                     stringsAsFactors = FALSE)

# STEP 2 — Remove redundant columns
df_clean$avg_training_score_scaled <- NULL
df_clean$age_group <- NULL

# STEP 3 — Convert character columns to factors
char_cols <- names(df_clean)[sapply(df_clean, is.character)]
df_clean[char_cols] <- lapply(df_clean[char_cols], as.factor)

# STEP 4 — Convert target to factor
df_clean$KPIs_met_more_than_80 <- factor(
  df_clean$KPIs_met_more_than_80,
  levels = c(0, 1),
  labels = c("No", "Yes")
)

4.1.1 Train-test split

# STEP 5 — Train/Test split
set.seed(42)

train_idx <- createDataPartition(
  df_clean$KPIs_met_more_than_80,
  p = 0.80,
  list = FALSE
)

train_data <- df_clean[train_idx, ]
test_data  <- df_clean[-train_idx, ]

cat(sprintf(
  "\nTrain-test split: Train rows = %d | Test rows = %d\n",
  nrow(train_data),
  nrow(test_data)
))

## 
## Train-test split: Train rows = 13932 | Test rows = 3483

# STEP 6 — Build Logistic Regression Model
log_model <- glm(
  KPIs_met_more_than_80 ~ .,
  data = train_data,
  family = binomial
)

cat("\nLogistic Regression Model Successfully Built\n")

## 
## Logistic Regression Model Successfully Built

# Logistic Regression Coefficients
coef_table <- coef(summary(log_model))

coef_table

##                                  Estimate  Std. Error      z value
## (Intercept)                 -1.8726908529 0.431505245  -4.33990288
## departmentfinance           -0.2105938388 0.147293437  -1.42975711
## departmenthr                -0.5132270704 0.179556152  -2.85830958
## departmentlegal             -0.5277242420 0.186081945  -2.83597768
## departmentoperations        -0.0532980542 0.127460804  -0.41815250
## departmentprocurement       -0.0458793980 0.103562297  -0.44301256
## departmentr&d                0.0161819184 0.143661783   0.11263899
## departmentsales & marketing -0.6022266990 0.161795217  -3.72215390
## departmenttechnology        -0.0148637348 0.086201295  -0.17243053
## regionregion_10              0.0671279365 0.263124382   0.25511865
## regionregion_11              0.2313820004 0.235887456   0.98089998
## regionregion_12              0.2578670103 0.286428533   0.90028395
## regionregion_13              0.3659097187 0.219969861   1.66345388
## regionregion_14              0.1735354117 0.259410064   0.66896175
## regionregion_15              0.2028051695 0.221605549   0.91516286
## regionregion_16              0.0663256894 0.237037188   0.27981132
## regionregion_17              0.4803135904 0.253838560   1.89220105
## regionregion_18              0.8849238806 0.657066054   1.34678070
## regionregion_19              0.1907905709 0.249522039   0.76462412
## regionregion_2               0.3764650829 0.208562610   1.80504590
## regionregion_20             -0.0080864697 0.262748407  -0.03077647
## regionregion_21              0.1220179978 0.313709285   0.38895246
## regionregion_22              0.4826413301 0.210590211   2.29185074
## regionregion_23              0.1894869199 0.242369426   0.78181033
## regionregion_24             -0.1869001176 0.301773486  -0.61933910
## regionregion_25              0.0842851631 0.254194940   0.33157687
## regionregion_26              0.2460264710 0.223927289   1.09868910
## regionregion_27              0.2690104367 0.230338312   1.16789272
## regionregion_28              0.4819508206 0.234790104   2.05268796
## regionregion_29              0.5306506468 0.249550252   2.12642801
## regionregion_3               0.5671508208 0.310623474   1.82584662
## regionregion_30              0.1148503592 0.265673317   0.43229919
## regionregion_31              0.1907524303 0.225856913   0.84457202
## regionregion_32             -0.1357961886 0.251003424  -0.54101329
## regionregion_33              0.1529711524 0.340530017   0.44921489
## regionregion_34              0.2866883020 0.301546987   0.95072514
## regionregion_4               0.8356679163 0.226555755   3.68857509
## regionregion_5              -0.2321345830 0.268661634  -0.86404069
## regionregion_6               0.2394558697 0.268470704   0.89192551
## regionregion_7               0.4153636496 0.212857180   1.95137251
## regionregion_8               0.0877609175 0.265669036   0.33033928
## regionregion_9              -0.5273401108 0.345457214  -1.52649906
## educationbelow secondary     0.1800414620 0.146323260   1.23043637
## educationmasters & above     0.0434159765 0.048220704   0.90035965
## educationunknown            -0.2307760401 0.103000502  -2.24053317
## genderm                     -0.0271444866 0.044163908  -0.61463053
## recruitment_channelreferred  0.4114742993 0.138767848   2.96519910
## recruitment_channelsourcing -0.0037838948 0.039001401  -0.09701946
## no_of_trainings             -0.1446831935 0.034417659  -4.20374882
## age                          0.0004432747 0.003607434   0.12287813
## previous_year_rating         0.6290713128 0.018128685  34.70032844
## length_of_service           -0.0641080371 0.006283295 -10.20293218
## awards_won                   1.4198891674 0.131798260  10.77320117
## avg_training_score          -0.0071499976 0.004227949  -1.69112658
##                                  Pr(>|z|)
## (Intercept)                  1.425457e-05
## departmentfinance            1.527867e-01
## departmenthr                 4.259046e-03
## departmentlegal              4.568564e-03
## departmentoperations         6.758356e-01
## departmentprocurement        6.577567e-01
## departmentr&d                9.103168e-01
## departmentsales & marketing  1.975306e-04
## departmenttechnology         8.630991e-01
## regionregion_10              7.986315e-01
## regionregion_11              3.266421e-01
## regionregion_12              3.679692e-01
## regionregion_13              9.622162e-02
## regionregion_14              5.035199e-01
## regionregion_15              3.601061e-01
## regionregion_16              7.796223e-01
## regionregion_17              5.846420e-02
## regionregion_18              1.780509e-01
## regionregion_19              4.444954e-01
## regionregion_2               7.106750e-02
## regionregion_20              9.754478e-01
## regionregion_21              6.973113e-01
## regionregion_22              2.191426e-02
## regionregion_23              4.343261e-01
## regionregion_24              5.356930e-01
## regionregion_25              7.402088e-01
## regionregion_26              2.719037e-01
## regionregion_27              2.428500e-01
## regionregion_28              4.010285e-02
## regionregion_29              3.346764e-02
## regionregion_3               6.787337e-02
## regionregion_30              6.655240e-01
## regionregion_31              3.983498e-01
## regionregion_32              5.884984e-01
## regionregion_33              6.532767e-01
## regionregion_34              3.417439e-01
## regionregion_4               2.255135e-04
## regionregion_5               3.875655e-01
## regionregion_6               3.724329e-01
## regionregion_7               5.101275e-02
## regionregion_8               7.411436e-01
## regionregion_9               1.268856e-01
## educationbelow secondary     2.185337e-01
## educationmasters & above     3.679289e-01
## educationunknown             2.505633e-02
## genderm                      5.387987e-01
## recruitment_channelreferred  3.024871e-03
## recruitment_channelsourcing  9.227109e-01
## no_of_trainings              2.625303e-05
## age                          9.022036e-01
## previous_year_rating        7.788889e-264
## length_of_service            1.923752e-24
## awards_won                   4.606990e-27
## avg_training_score           9.081263e-02

4.1.2 Confusion Matrix

# STEP 7 — Predictions
pred_prob <- predict(log_model,
                     newdata = test_data,
                     type = "response")

pred_class <- ifelse(pred_prob > 0.5, "Yes", "No")

pred_class <- factor(pred_class,
                     levels = c("No", "Yes"))

# STEP 8 — Evaluation
cm <- confusionMatrix(
  pred_class,
  test_data$KPIs_met_more_than_80,
  positive = "Yes"
)

#Visual Confusion Matrix

cm_table <- as.data.frame(cm$table)

ggplot(cm_table,
       aes(
         x = Reference,
         y = Prediction,
         fill = Freq
       )) +
  
  geom_tile(color = "white", linewidth = 1) +
  
  geom_text(
    aes(label = Freq),
    color = "white",
    size = 8,
    fontface = "bold"
  ) +
  
  scale_fill_gradient(
    low = "#9BBCE0",
    high = "#1A5FA8"
  ) +
  
  labs(
    title = "Confusion Matrix",
    subtitle = "Predicted vs Actual Classes",
    x = "Actual Class",
    y = "Predicted Class",
    fill = "Count"
  ) +
  
  theme_minimal(base_size = 16)

acc  <- round(cm$overall["Accuracy"] * 100, 2)
sens <- round(cm$byClass["Sensitivity"] * 100, 2)
spec <- round(cm$byClass["Specificity"] * 100, 2)
f1   <- round(cm$byClass["F1"] * 100, 2)


cat(
  "Accuracy:", acc, "\n",
  "Sensitivity:", sens, "\n",
  "Specificity:", spec, "\n",
  "F1 Score:", f1, "\n"
)

## Accuracy: 71.03 
##  Sensitivity: 43.36 
##  Specificity: 86.52 
##  F1 Score: 51.79

4.1.3 Receiver Operating Characteristic - Area Under the Curve (ROC-AUC)

# STEP 9 — ROC-AUC

roc_obj <- roc(
  response = test_data$KPIs_met_more_than_80,
  predictor = pred_prob,
  levels = c("No", "Yes")
)

auc_val <- round(auc(roc_obj), 4)

roc_df <- data.frame(
  FPR = 1 - roc_obj$specificities,
  TPR = roc_obj$sensitivities
)

ggplot(roc_df, aes(x = FPR, y = TPR)) +
  geom_line(colour = "#1A5FA8", linewidth = 1.1) +
  
  geom_abline(
    slope = 1,
    intercept = 0,
    linetype = "dashed",
    colour = "grey60",
    linewidth = 0.7
  ) +
  
  annotate(
    "text",
    x = 0.65,
    y = 0.12,
    label = sprintf("AUC = %.4f", auc_val),
    size = 5,
    fontface = "bold",
    colour = "#1A5FA8"
  ) +
  
  labs(
    title = "ROC Curve",
    subtitle = "Receiver Operating Characteristic – Test Set",
    x = "False Positive Rate (1 - Specificity)",
    y = "True Positive Rate (Sensitivity)"
  ) +
  
  theme_minimal(base_size = 14) +
  coord_equal()

4.1.4 Model Evaluation

# ============================================================
# Logistic Regression Feature Importance
# ============================================================

coef_df <- data.frame(
  Feature = names(coef(log_model)),
  Coefficient = coef(log_model)
)

coef_df <- coef_df %>%
  filter(Feature != "(Intercept)") %>%
  
  mutate(
    Abs_Coefficient = abs(Coefficient),
    
    Direction = ifelse(
      Coefficient > 0,
      "Positive",
      "Negative"
    )
  ) %>%
  
  arrange(desc(Abs_Coefficient))

top_coef_df <- coef_df %>%
  slice_max(
    order_by = Abs_Coefficient,
    n = 10
  )

# ============================================================
# Visual: Logistic Regression Feature Importance
# ============================================================

p1 <- ggplot(
  top_coef_df,
  
  aes(
    x = reorder(
      Feature,
      Abs_Coefficient
    ),
    
    y = Abs_Coefficient,
    fill = Direction
  )
) +
  
  geom_bar(
    stat = "identity"
  ) +
  
  coord_flip() +
  
  labs(
    title = "Logistic Regression Feature Importance",
    subtitle = "Top 10 absolute coefficients",
    x = "Feature",
    y = "Absolute Coefficient"
  ) +
  
  theme_minimal(base_size = 12)

p1

# ============================================================
# Visual: Class Distribution
# ============================================================

class_tbl <- table(df_clean$KPIs_met_more_than_80)

class_pct <- round(
  prop.table(class_tbl) * 100,
  1
)

class_df <- data.frame(
  Class = names(class_tbl),
  Count = as.integer(class_tbl),
  Percent = as.numeric(class_pct)
)

p2 <- ggplot(
  class_df,
  
  aes(
    x = Class,
    y = Count,
    fill = Class
  )
) +
  
  geom_bar(
    stat = "identity",
    width = 0.6,
    show.legend = FALSE
  ) +
  
geom_text(
    aes(
      label = paste0(
        Count,
        "\n(",
        Percent,
        "%)"
      )
    ),

    vjust = -0.3,
    fontface = "bold",
    size = 4.5
) +
  
  labs(
    title = "Class Distribution of Target Variable",
    subtitle = "KPIs met more than 80%",
    x = "KPIs Met > 80%",
    y = "Count"
  ) +
  
  ylim(0, max(class_df$Count) * 1.18) +
  theme_minimal(base_size = 14)

p2

# ============================================================
# Metrics Table
# ============================================================

precision <- round(cm$byClass["Precision"] * 100, 2)

metrics_df <- data.frame(
  Metric = c("Accuracy",
             "F1 Score",
             "Precision",
             "Sensitivity",
             "Specificity"),

  Value = c(acc,
            f1,
            precision,
            sens,
            spec)
)
# ============================================================
# Visual: Performance Metrics
# ============================================================

p3 <- ggplot(
  metrics_df,
  
  aes(
    x = Metric,
    y = Value,
    fill = Metric
  )
) +
  
  geom_bar(
    stat = "identity",
    show.legend = FALSE
  ) +
  
  geom_text(
    aes(
      label = paste0(
        Value,
        "%"
      )
    ),
    
    vjust = -0.4,
    fontface = "bold",
    size = 6
  ) +
  
  ylim(0, 110) +
  
  labs(
    title = "Model Performance Metrics",
    subtitle = "Evaluated on test set",
    x = "Metric",
    y = "Score (%)"
  ) +
  
  theme_minimal(base_size = 18)

p3

# ============================================================
# Predicted Probability Data
# ============================================================

prob_df <- data.frame(
  Probability = pred_prob,
  Actual_Class = test_data$KPIs_met_more_than_80
)

# ============================================================
# Visual: Predicted Probability Distribution
# ============================================================

p4 <- ggplot(
  prob_df,
  
  aes(
    x = Probability,
    fill = Actual_Class
  )
) +
  
  geom_histogram(
    binwidth = 0.05,
    alpha = 0.75,
    position = "identity",
    colour = "white"
  ) +
  
  labs(
    title = "Predicted Probability Distribution",
    subtitle = "Probability of KPIs > 80% by actual class",
    x = "Predicted probability (Yes)",
    y = "Count",
    fill = "Actual class"
  ) +
  
  theme_minimal(base_size = 18)

p4

# ============================================================
# Visual: Confusion Matrix
# ============================================================

cm_tbl <- as.data.frame(cm$table)
names(cm_tbl) <- c("Predicted", "Actual", "Freq")

p5 <- ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 1) +
  geom_text(aes(label = Freq), colour = "white", fontface = "bold", size = 7) +
  scale_fill_gradient(low = "#A8C8E8", high = "#1A5FA8") +
  labs(
    title = "Confusion Matrix",
    subtitle = "Predicted vs Actual classes",
    x = "Actual",
    y = "Predicted"
  ) +
  theme_minimal(base_size = 18)

# ============================================================
# Visual: ROC Curve
# ============================================================

p6 <- ggplot(roc_df, aes(x = FPR, y = TPR)) +
  geom_line(linewidth = 1.2) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "ROC Curve",
    subtitle = paste("AUC =", auc_val),
    x = "FPR",
    y = "TPR"
  ) +
  theme_minimal(base_size = 18)

library(grid)
library(gridExtra)
# ============================================================
# Save Logistic Regression Metrics
# ============================================================

acc  <- round(cm$overall["Accuracy"] * 100, 2)
sens <- round(cm$byClass["Sensitivity"] * 100, 2)
spec <- round(cm$byClass["Specificity"] * 100, 2)
f1   <- round(cm$byClass["F1"] * 100, 2)

logistic_acc  <- acc
logistic_sens <- sens
logistic_spec <- spec
logistic_f1   <- f1
logistic_auc  <- auc_val

# ============================================================
# Final Logistic Regression Dashboard
# ============================================================

logistic_dashboard <- arrangeGrob(
  p2, p5, p1,
  p3, p6, p4,
  ncol = 3,
  nrow = 2,
  widths = c(1.1, 1.1, 1.4),
  heights = c(1, 1)
)

grid::grid.draw(logistic_dashboard)

# ============================================================
# Save Dashboard
# ============================================================

ggsave(
  "logistic_regression_dashboard.png",
  plot = logistic_dashboard,
  width = 22,
  height = 14,
  dpi = 300
)

4.2 Random Forest (Classification)

# ============================================================
#  Random Forest — Employee Performance
# ============================================================

# ── 0. Install & load packages ────────────────────────────
required_packages <- c("randomForest", "caret", "ggplot2",
                       "reshape2", "pROC", "dplyr", "gridExtra")

for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg)
}

library(randomForest)
library(caret)
library(ggplot2)
library(reshape2)
library(pROC)
library(dplyr)
library(gridExtra)


# ============================================================
#  STEP 1 — Load data
# ============================================================
df_clean <- read.csv("clean_employee_performance.csv", stringsAsFactors = FALSE)
cat(sprintf("Loaded: %d rows  x  %d columns\n\n", nrow(df_clean), ncol(df_clean)))

## Loaded: 17415 rows  x  14 columns

# ============================================================
#  STEP 2 — Fix issue 1: drop redundant / derived columns
# ============================================================
df_clean$avg_training_score_scaled <- NULL
df_clean$age_group                 <- NULL
cat("Step 2: Dropped redundant columns: avg_training_score_scaled, age_group\n")

## Step 2: Dropped redundant columns: avg_training_score_scaled, age_group

# ============================================================
#  STEP 3 — Fix issue 2: convert character columns to factors
# ============================================================
char_cols <- names(df_clean)[sapply(df_clean, is.character)]
df_clean[char_cols] <- lapply(df_clean[char_cols], as.factor)
cat("Step 3: Converted to factor:", paste(char_cols, collapse = ", "), "\n")

## Step 3: Converted to factor: department, region, education, gender, recruitment_channel

cat("        'unknown' kept as a valid factor level in 'education'\n")

##         'unknown' kept as a valid factor level in 'education'

# ============================================================
#  STEP 4 — Fix issue 3: convert target to factor
# ============================================================
df_clean$KPIs_met_more_than_80 <- factor(df_clean$KPIs_met_more_than_80,
                                         levels = c(0, 1),
                                         labels = c("No", "Yes"))
cat("Step 4: Target 'KPIs_met_more_than_80' converted to factor\n")

## Step 4: Target 'KPIs_met_more_than_80' converted to factor

# ============================================================
#  STEP 5 — Handle NA values
# ============================================================
na_total <- sum(is.na(df_clean))
cat(sprintf("\nStep 5: Total NAs found: %d\n", na_total))

## 
## Step 5: Total NAs found: 0

if (na_total > 0) {
  df_clean <- na.roughfix(df_clean)
  cat("        na.roughfix() applied\n")
} else {
  cat("        No NAs — skipping imputation\n")
}

##         No NAs — skipping imputation

# ============================================================
#  STEP 6 — Class imbalance check & compute class weights
# ============================================================
cat("\nStep 6: Class distribution\n")

## 
## Step 6: Class distribution

class_tbl <- table(df_clean$KPIs_met_more_than_80)
print(class_tbl)

## 
##    No   Yes 
## 11165  6250

class_pct <- round(prop.table(class_tbl) * 100, 1)
print(class_pct)

## 
##   No  Yes 
## 64.1 35.9

n_no  <- as.integer(class_tbl["No"])
n_yes <- as.integer(class_tbl["Yes"])
wt_no  <- 1
wt_yes <- round(n_no / n_yes, 2)
class_weights <- c("No" = wt_no, "Yes" = wt_yes)
cat(sprintf("        Class weights — No: %.2f  |  Yes: %.2f\n", wt_no, wt_yes))

##         Class weights — No: 1.00  |  Yes: 1.79

4.2.1 Train-test split

# ============================================================
#  STEP 7 — Train / Test split
# ============================================================
set.seed(42)
train_idx  <- createDataPartition(df_clean$KPIs_met_more_than_80,
                                  p = 0.80, list = FALSE)
train_data <- df_clean[ train_idx, ]
test_data  <- df_clean[-train_idx, ]
cat(sprintf("\nStep 7: Train rows: %d  |  Test rows: %d\n",
            nrow(train_data), nrow(test_data)))

## 
## Step 7: Train rows: 13932  |  Test rows: 3483

# ============================================================
#  STEP 8 — Build Random Forest model
# ============================================================
set.seed(42)
n_features <- ncol(train_data) - 1
mtry_val   <- floor(sqrt(n_features))

cat(sprintf("\nStep 8: Training Random Forest  (ntree=500, mtry=%d) ...\n",
            mtry_val))

## 
## Step 8: Training Random Forest  (ntree=500, mtry=3) ...

rf_model <- randomForest(
  KPIs_met_more_than_80 ~ .,
  data       = train_data,
  ntree      = 500,
  mtry       = mtry_val,
  importance = TRUE,
  classwt    = class_weights
)

print(rf_model)

## 
## Call:
##  randomForest(formula = KPIs_met_more_than_80 ~ ., data = train_data,      ntree = 500, mtry = mtry_val, importance = TRUE, classwt = class_weights) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 31.77%
## Confusion matrix:
##       No  Yes class.error
## No  6693 2239   0.2506717
## Yes 2187 2813   0.4374000

4.2.2 Confusion Matrix

# ============================================================
#  STEP 9 — Predict on test set
# ============================================================
preds_class <- predict(rf_model, newdata = test_data)
preds_prob  <- predict(rf_model, newdata = test_data, type = "prob")


# ============================================================
#  STEP 10 — Performance Evaluation
# ============================================================
cm <- confusionMatrix(preds_class,
                      test_data$KPIs_met_more_than_80,
                      positive = "Yes")
print(cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1734  538
##        Yes  499  712
##                                           
##                Accuracy : 0.7023          
##                  95% CI : (0.6868, 0.7174)
##     No Information Rate : 0.6411          
##     P-Value [Acc > NIR] : 1.346e-14       
##                                           
##                   Kappa : 0.3485          
##                                           
##  Mcnemar's Test P-Value : 0.238           
##                                           
##             Sensitivity : 0.5696          
##             Specificity : 0.7765          
##          Pos Pred Value : 0.5879          
##          Neg Pred Value : 0.7632          
##              Prevalence : 0.3589          
##          Detection Rate : 0.2044          
##    Detection Prevalence : 0.3477          
##       Balanced Accuracy : 0.6731          
##                                           
##        'Positive' Class : Yes             
##

acc       <- round(cm$overall["Accuracy"]     * 100, 2)
kappa     <- round(cm$overall["Kappa"]        * 100, 2)
sens      <- round(cm$byClass["Sensitivity"]  * 100, 2)
spec      <- round(cm$byClass["Specificity"]  * 100, 2)
precision <- round(cm$byClass["Precision"]    * 100, 2)
f1        <- round(cm$byClass["F1"]           * 100, 2)

roc_obj <- roc(response  = test_data$KPIs_met_more_than_80,
               predictor = preds_prob[, "Yes"],
               levels    = c("No", "Yes"),
               direction = "<")
auc_val <- round(auc(roc_obj), 4)


# ============================================================
#  Visualize Confusion Matrix
# ============================================================
# Load required libraries
library(caret)
library(ggplot2)

# Example: assume you already have predictions
# preds_class <- predict(rf_model, newdata = test_data)
# cm <- confusionMatrix(preds_class, test_data$KPIs_met_more_than_80, positive = "Yes")

# Convert confusion matrix to data frame
cm_tbl <- as.data.frame(cm$table)
names(cm_tbl) <- c("Predicted", "Actual", "Freq")

# Plot confusion matrix heatmap
ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 1) +
  geom_text(aes(label = Freq), size = 6, fontface = "bold", colour = "white") +
  scale_fill_gradient(low = "#A8C8E8", high = "#1A5FA8") +
  labs(title    = "Confusion Matrix Heatmap",
       subtitle = "Model predictions vs actual classes",
       x = "Actual Class", y = "Predicted Class",
       fill = "Count") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "right")

4.2.3 ROC, AUC

# ============================================================
#  ROC Curve
# ============================================================
roc_obj <- roc(response = test_data$KPIs_met_more_than_80,
               predictor = preds_prob[, "Yes"],
               levels = c("No", "Yes"),
               direction = "<")

roc_df <- data.frame(FPR = 1 - roc_obj$specificities,
                     TPR = roc_obj$sensitivities)

ggplot(roc_df, aes(x = FPR, y = TPR)) +
  geom_line(colour = "#1A5FA8", linewidth = 1.1) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey60", linewidth = 0.7) +
  annotate("text", x = 0.65, y = 0.12,
           label = sprintf("AUC = %.4f", auc(roc_obj)),
           size = 5, fontface = "bold", colour = "#1A5FA8") +
  labs(title = "ROC Curve",
       subtitle = "Receiver Operating Characteristic — test set",
       x = "False Positive Rate (1 - Specificity)",
       y = "True Positive Rate (Sensitivity)") +
  theme_minimal(base_size = 14) +
  coord_equal()

4.2.4 Model Evaluation

# ============================================================
#  Feature Importance (Mean Decrease Gini)
# ============================================================
imp_mat <- importance(rf_model)
imp_df  <- data.frame(
  Feature  = rownames(imp_mat),
  MeanDecreaseAccuracy = imp_mat[, "MeanDecreaseAccuracy"],
  MeanDecreaseGini     = imp_mat[, "MeanDecreaseGini"]
)

imp_df <- imp_df[order(imp_df$MeanDecreaseGini), ]
imp_df$Feature <- factor(imp_df$Feature, levels = imp_df$Feature)

ggplot(imp_df, aes(x = Feature, y = MeanDecreaseGini, fill = MeanDecreaseGini)) +
  geom_bar(stat = "identity", width = 0.65, show.legend = FALSE) +
  geom_text(aes(label = round(MeanDecreaseGini, 1)),
            hjust = -0.15, size = 3.5) +
  scale_fill_gradient(low = "#A8D5A2", high = "#2E7D32") +
  coord_flip() +
  labs(title = "Feature Importance (Mean Decrease Gini)",
       subtitle = "Higher = more important for node purity",
       y = "Mean Decrease Gini") +
  theme_minimal(base_size = 14)

# ============================================================
#  Class Distribution Bar Chart
# ============================================================
class_tbl <- table(df_clean$KPIs_met_more_than_80)
class_pct <- round(prop.table(class_tbl) * 100, 1)
class_df <- data.frame(Class = names(class_tbl),
                       Count = as.integer(class_tbl),
                       Percent = as.numeric(class_pct))

ggplot(class_df, aes(x = Class, y = Count, fill = Class)) +
  geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = sprintf("%d\n(%.1f%%)", Count, Percent)),
            vjust = -0.3, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("No" = "#E07B54", "Yes" = "#4C9BE8")) +
  labs(title = "Class Distribution of Target Variable",
       subtitle = "KPIs met more than 80%",
       x = "KPIs Met > 80%", y = "Count") +
  theme_minimal(base_size = 12)

# ============================================================
#  Performance Metrics Bar Chart
# ============================================================
metrics_df <- data.frame(
  Metric = c("Accuracy", "Sensitivity", "Specificity", "Precision", "F1 Score"),
  Value  = c(acc, sens, spec, precision, f1)
)

ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = sprintf("%.1f%%", Value)),
            vjust = -0.4, size = 4, fontface = "bold") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Model Performance Metrics",
       subtitle = "Evaluated on test set",
       y = "Score (%)") +
  ylim(0, 110) +
  theme_minimal(base_size = 14)

# ============================================================
#  Predicted Probability Histogram
# ============================================================
prob_df <- data.frame(
  Probability  = preds_prob[, "Yes"],
  Actual_Class = test_data$KPIs_met_more_than_80
)

ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) +
  geom_histogram(binwidth = 0.05, alpha = 0.75,
                 position = "identity", colour = "white") +
  scale_fill_manual(values = c("No" = "#E07B54", "Yes" = "#4C9BE8")) +
  labs(title = "Predicted Probability Distribution",
       subtitle = "Probability of KPIs > 80% by actual class",
       x = "Predicted probability (Yes)", y = "Count",
       fill = "Actual class") +
  theme_minimal(base_size = 14)

# ============================================================
#  Create a Dashboard
# ============================================================

library(gridExtra)

# p1: Class distribution
p1 <- ggplot(class_df, aes(x = Class, y = Count, fill = Class)) +
  geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = sprintf("%d\n(%.1f%%)", Count, Percent)),
            vjust = -0.3, size = 3.2, fontface = "bold") +
  scale_fill_manual(values = c("No" = "#E07B54", "Yes" = "#4C9BE8")) +
  ylim(0, max(class_df$Count) * 1.18) +
  labs(title = "Class Distribution",
       x = "Class", y = "Count") +
  theme_minimal(base_size = 11)

# p2: Confusion matrix
p2 <- ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(colour = "white") +
  geom_text(aes(label = Freq), colour = "white",
            fontface = "bold", size = 4) +
  scale_fill_gradient(low = "#A8C8E8", high = "#1A5FA8") +
  labs(title = "Confusion Matrix",
       x = "Actual", y = "Predicted") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "right")

# p3: Feature importance
p3 <- ggplot(imp_df, aes(x = Feature, y = MeanDecreaseGini, fill = MeanDecreaseGini)) +
  geom_bar(stat = "identity", width = 0.65, show.legend = FALSE) +
  geom_text(aes(label = round(MeanDecreaseGini, 1)),
            hjust = -0.15, size = 2.8) +
  scale_fill_gradient(low = "#A8D5A2", high = "#2E7D32") +
  coord_flip() +
  labs(title = "Feature Importance",
       x = "Feature", y = "Mean Decrease Gini") +
  theme_minimal(base_size = 10)

# p4: Performance metrics
p4 <- ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", width = 0.55, show.legend = FALSE) +
  geom_text(aes(label = sprintf("%.1f%%", Value)),
            vjust = -0.3, size = 3.2, fontface = "bold") +
  scale_fill_brewer(palette = "Set2") +
  ylim(0, 110) +
  labs(title = "Performance Metrics",
       x = NULL, y = "Score (%)") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

# p5: ROC curve
p5 <- ggplot(roc_df, aes(x = FPR, y = TPR)) +
  geom_line(linewidth = 0.9) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(title = "ROC Curve",
       subtitle = paste("AUC =", auc_val),
       x = "FPR", y = "TPR") +
  theme_minimal(base_size = 11)

# p6: Probability histogram
p6 <- ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) +
  geom_histogram(binwidth = 0.05, alpha = 0.75,
                 position = "identity", colour = "white") +
  scale_fill_manual(values = c("No" = "#E07B54", "Yes" = "#4C9BE8")) +
  labs(title = "Predicted Probability",
       x = "Probability", y = "Count",
       fill = "Actual Class") +
  theme_minimal(base_size = 11)

combined_dashboard <- grid.arrange(
  p1, p2, p3,
  p4, p5, p6,
  ncol = 3,
  nrow = 2
)

ggsave(
  "dashboard_overview.png",
  plot = combined_dashboard,
  width = 20,
  height = 12,
  dpi = 300
)

4.3 Insights and Comparison Table

4.3.1 Class Distribution

The class distribution analysis showed that the dataset was moderately imbalanced, with a larger proportion of employees not meeting more than 80% of KPIs (“No”) compared to employees who successfully achieved the KPI target (“Yes”). This imbalance may affect classification performance because models can become biased toward the majority class. Therefore, evaluation metrics such as sensitivity, specificity, F1-score, precision, and AUC were analysed alongside overall accuracy to ensure a more balanced and reliable assessment of model performance.

4.3.2 Train/Test Split

The dataset was divided into training and testing subsets using an 80:20 ratio, where 80% of the data was used to train the Logistic Regression and Random Forest models, while the remaining 20% was used for testing and evaluation. This approach helps reduce overfitting and ensures that the models are evaluated on unseen data, providing a more realistic indication of predictive performance and generalisation capability in employee KPI prediction tasks.

4.3.3 Confusion Matrix

The confusion matrix analysis revealed clear differences between the Logistic Regression and Random Forest models. Logistic Regression produced more false negatives, meaning many employees who actually achieved more than 80% KPI performance were incorrectly classified as unsuccessful employees, showing that the model was more conservative in predicting high performers. In comparison, Random Forest reduced the number of false negatives and identified successful employees more effectively, although it produced slightly more false positives. Overall, Logistic Regression focused more heavily on reducing incorrect positive predictions, whereas Random Forest demonstrated more balanced and practically useful classification behaviour for identifying genuine high-performing employees.

4.3.4 ROC Curve and AUC

Both models performed substantially better than random guessing based on their ROC curves and AUC values. Logistic Regression achieved a slightly higher AUC value of 0.7402 compared to 0.7327 for Random Forest, indicating marginally stronger overall class separation capability across different probability thresholds. However, despite the slightly lower AUC, Random Forest achieved stronger practical predictive performance through higher sensitivity and F1-score values, showing that it was more effective at correctly identifying employees who successfully achieved KPI targets, while Logistic Regression produced more conservative predictions and focused more on reducing false positives.

4.3.5 Model Performance Metrics

The model performance metrics highlighted the trade-offs between Logistic Regression and Random Forest. Logistic Regression achieved slightly higher overall accuracy (71.03% vs 70.23%) and substantially higher specificity (86.52% vs 77.65%), indicating stronger performance in correctly identifying employees who did not meet KPI targets. However, Random Forest achieved noticeably higher sensitivity (56.96% vs 43.36%) and a higher F1-score (57.86% vs 51.79%), demonstrating better balance between precision and recall. Random Forest also achieved slightly stronger Kappa and balanced accuracy values, suggesting more balanced classification performance across both employee classes. Overall, Logistic Regression prioritised reducing false positive classifications, while Random Forest provided stronger practical performance for identifying genuine high-performing employees.

Comparison Table

comparison_df <- data.frame(
  Model = c(
    "Logistic Regression",
    "Random Forest"
  ),
  
  Accuracy = c(
    logistic_acc,
    acc
  ),
  
  Sensitivity = c(
    logistic_sens,
    sens
  ),
  
  Specificity = c(
    logistic_spec,
    spec
  ),
  
  F1_Score = c(
    logistic_f1,
    f1
  ),
  
  AUC = c(
    logistic_auc,
    auc_val
  )
)

comparison_df

4.3.6 Feature Importance

The feature importance analysis showed that awards_won and previous_year_rating were among the strongest predictors of employees meeting more than 80% of their KPIs. Employees with awards or strong previous-year ratings were significantly more likely to achieve successful KPI outcomes. Several regional variables also showed relatively strong positive relationships with KPI achievement, while departments such as sales & marketing, legal, and HR demonstrated negative relationships with KPI success. Logistic Regression provided interpretable coefficient-based relationships that clearly showed whether variables increased or decreased KPI achievement probability, whereas Random Forest provided clearer feature importance rankings and captured more complex non-linear relationships between variables.

4.3.7 Predicted Probability Distribution

The predicted probability distribution plots showed noticeable differences in prediction confidence between the two models. Logistic Regression produced more conservative probability estimates, with many predictions concentrated around lower and middle probability ranges, explaining its higher specificity and lower sensitivity because stronger evidence was required before classifying employees as successful. In contrast, Random Forest produced a wider probability spread and clearer separation between the “Yes” and “No” classes, indicating stronger capability in distinguishing successful and unsuccessful employees. This improved probability separation contributed to Random Forest’s higher sensitivity and F1-score performance, making it more effective for identifying genuine high-performing employees.

4.4 Linear Regression Model

Question: Can employee average training score be predicted using demographic and workplace-related variables?

# ============================================================
#  Linear Regression — Employee Training Score Prediction
# ============================================================

# ── 0. Install & load packages ─────────────────────────────
required_packages <- c(
  "caret",
  "ggplot2",
  "dplyr",
  "Metrics",
  "gridExtra"
)

for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

library(caret)
library(ggplot2)
library(dplyr)
library(Metrics)
library(gridExtra)

# ============================================================
#  STEP 1 — Load data
# ============================================================

df_clean <- read.csv(
  "clean_employee_performance.csv",
  stringsAsFactors = FALSE
)

cat(sprintf(
  "Loaded: %d rows x %d columns\n\n",
  nrow(df_clean),
  ncol(df_clean)
))

## Loaded: 17415 rows x 14 columns

# ============================================================
#  STEP 2 — Remove redundant columns
# ============================================================

df_clean$avg_training_score_scaled <- NULL
df_clean$age_group <- NULL
df_clean$KPIs_met_more_than_80 <- NULL

cat("Dropped redundant columns\n")

## Dropped redundant columns

# ============================================================
#  STEP 3 — Convert character columns to factors
# ============================================================

char_cols <- names(df_clean)[sapply(df_clean, is.character)]

df_clean[char_cols] <- lapply(
  df_clean[char_cols],
  as.factor
)

cat("Converted character columns to factors\n")

## Converted character columns to factors

# ============================================================
#  STEP 4 — Handle missing values
# ============================================================

na_total <- sum(is.na(df_clean))

cat(sprintf("Total NAs found: %d\n", na_total))

## Total NAs found: 0

if (na_total > 0) {
  
  for (col in names(df_clean)) {
    
    if (is.numeric(df_clean[[col]])) {
      
      df_clean[[col]][is.na(df_clean[[col]])] <-
        median(df_clean[[col]], na.rm = TRUE)
      
    }
  }
  
  cat("Median imputation applied\n")
  
} else {
  
  cat("No missing values found\n")
}

## No missing values found

4.4.1 Train-test split

# ============================================================
#  STEP 5 — Train/Test Split
# ============================================================

set.seed(42)

train_idx <- createDataPartition(
  df_clean$avg_training_score,
  p = 0.80,
  list = FALSE
)

train_data <- df_clean[train_idx, ]
test_data  <- df_clean[-train_idx, ]

cat(sprintf(
  "Train rows: %d | Test rows: %d\n",
  nrow(train_data),
  nrow(test_data)
))

## Train rows: 13934 | Test rows: 3481

# ============================================================
#  STEP 6 — Build Linear Regression Model
# ============================================================

lm_model <- lm(
  avg_training_score ~
    age +
    previous_year_rating +
    length_of_service +
    no_of_trainings +
    department +
    education +
    gender +
    recruitment_channel +
    awards_won,
  
  data = train_data
)

summary(lm_model)

## 
## Call:
## lm(formula = avg_training_score ~ age + previous_year_rating + 
##     length_of_service + no_of_trainings + department + education + 
##     gender + recruitment_channel + awards_won, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.310  -2.277  -0.360   1.591  48.978 
## 
## Coefficients:
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                  83.629255   0.282900  295.614   <2e-16 ***
## age                           0.002447   0.006932    0.353   0.7241    
## previous_year_rating          0.215669   0.032133    6.712    2e-11 ***
## length_of_service             0.004121   0.011971    0.344   0.7307    
## no_of_trainings              -0.062389   0.066029   -0.945   0.3447    
## departmentfinance           -24.145277   0.218651 -110.429   <2e-16 ***
## departmenthr                -34.128867   0.218810 -155.975   <2e-16 ***
## departmentlegal             -24.980894   0.306405  -81.529   <2e-16 ***
## departmentoperations        -24.332708   0.153957 -158.048   <2e-16 ***
## departmentprocurement       -14.449422   0.168389  -85.810   <2e-16 ***
## departmentr&d                -0.261169   0.299648   -0.872   0.3834    
## departmentsales & marketing -34.344873   0.142121 -241.659   <2e-16 ***
## departmenttechnology         -4.695710   0.168160  -27.924   <2e-16 ***
## educationbelow secondary      0.782761   0.314546    2.489   0.0128 *  
## educationmasters & above      0.239945   0.093364    2.570   0.0102 *  
## educationunknown              0.057675   0.193335    0.298   0.7655    
## genderm                      -0.015187   0.088555   -0.172   0.8638    
## recruitment_channelreferred  -0.020663   0.285673   -0.072   0.9423    
## recruitment_channelsourcing  -0.152191   0.078444   -1.940   0.0524 .  
## awards_won                    6.303949   0.253829   24.835   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.536 on 13914 degrees of freedom
## Multiple R-squared:  0.8856, Adjusted R-squared:  0.8855 
## F-statistic:  5671 on 19 and 13914 DF,  p-value: < 2.2e-16

4.4.2 Model Evaluation: RMSE, MAE, RSquared

# ============================================================
#  STEP 7 — Predictions
# ============================================================

predictions <- predict(
  lm_model,
  newdata = test_data
)

actual_values <- test_data$avg_training_score

# ============================================================
#  STEP 8 — Regression Evaluation Metrics
# ============================================================

rmse_val <- round(
  rmse(actual_values, predictions),
  3
)

mae_val <- round(
  mae(actual_values, predictions),
  3
)

r2_val <- round(
  cor(actual_values, predictions)^2,
  3
)

Regression Model Performance
RMSE : 4.51
MAE : 2.74
R² : 0.888

4.4.3 Result Visualization: Actual vs. Predicted

# ============================================================
#  STEP 9 — Actual vs Predicted Plot
# ============================================================

results_df <- data.frame(
  Actual = actual_values,
  Predicted = predictions
)

p1 <- ggplot(
  results_df,
  aes(x = Actual,
      y = Predicted)
) +
  geom_point(
    alpha = 0.5,
    colour = "#1A5FA8"
  ) +
  geom_abline(
    slope = 1,
    intercept = 0,
    colour = "red",
    linetype = "dashed"
  ) +
  labs(
    title = "Actual vs Predicted Values",
    subtitle = "Linear Regression",
    x = "Actual Training Score",
    y = "Predicted Training Score"
  ) +
  theme_minimal(base_size = 14)

p1

4.4.4 Residual Plot and Distribution

# ============================================================
#  STEP 10 — Residual Plot
# ============================================================

results_df$Residuals <- actual_values - predictions

p2 <- ggplot(
  results_df,
  aes(x = Predicted,
      y = Residuals)
) +
  geom_point(
    alpha = 0.5,
    colour = "#2E7D32"
  ) +
  geom_hline(
    yintercept = 0,
    colour = "red",
    linetype = "dashed"
  ) +
  labs(
    title = "Residual Plot",
    x = "Predicted Values",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 14)

p2

# ============================================================
#  STEP 11 — Residual Distribution
# ============================================================

p3 <- ggplot(
  results_df,
  aes(x = Residuals)
) +
  geom_histogram(
    bins = 30,
    fill = "#4C9BE8",
    colour = "white",
    alpha = 0.8
  ) +
  labs(
    title = "Residual Distribution",
    x = "Residual",
    y = "Count"
  ) +
  theme_minimal(base_size = 14)

p3

4.4.5 Feature Importance

# ============================================================
#  STEP 12 — Feature Importance
# ============================================================

coef_df <- data.frame(
  Feature = names(coef(lm_model)),
  Coefficient = coef(lm_model)
)

coef_df <- coef_df %>%
  filter(Feature != "(Intercept)") %>%
  mutate(
    Abs_Coefficient = abs(Coefficient),
    Direction = ifelse(
      Coefficient > 0,
      "Positive",
      "Negative"
    )
  ) %>%
  arrange(desc(Abs_Coefficient))

top_coef_df <- coef_df %>%
  slice_max(
    order_by = Abs_Coefficient,
    n = 15
  )

p4 <- ggplot(
  top_coef_df,
  aes(
    x = reorder(
      Feature,
      Abs_Coefficient
    ),
    y = Abs_Coefficient,
    fill = Direction
  )
) +
  geom_bar(
    stat = "identity"
  ) +
  coord_flip() +
  labs(
    title = "Feature Importance",
    subtitle = "Linear Regression Coefficients",
    x = "Feature",
    y = "Absolute Coefficient"
  ) +
  theme_minimal(base_size = 8)

print(p4)

Insights

The linear regression coefficient analysis showed that awards_won was one of the strongest positive predictors of average training score, indicating that employees who received awards generally achieved higher training performance. Variables such as previous_year_rating, higher education level, and referral-based recruitment also contributed positively, although with smaller effects. In contrast, departments including sales & marketing, HR, legal, operations, and finance showed strong negative coefficients, suggesting lower average training scores compared to the reference department. Variables such as no_of_trainings and length_of_service also demonstrated slight negative relationships, indicating that attending more trainings or having longer service did not necessarily improve training performance. Overall, the model suggests that employee recognition, past performance, and departmental differences play important roles in influencing average training scores within the organization.

4.4.6 Regression Model Performance

# ============================================================
#  STEP 13 — Regression Metrics Bar Chart
# ============================================================

metrics_df <- data.frame(
  Metric = c(
    "RMSE",
    "MAE",
    "R²"
  ),
  
  Value = c(
    rmse_val,
    mae_val,
    r2_val
  )
)

p5 <- ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = Value), vjust = -0.4, fontface = "bold") +
  labs(
    title = "Regression Performance Metrics",
    y = "Value"
  ) +
  theme_minimal(base_size = 8)

print (p5)

Insights

The regression performance metrics indicate that the linear regression model performs well in predicting employees’ average training scores. The R² value of 0.888 shows that approximately 88.8% of the variation in training scores was explained by the predictor variables, indicating a very strong model fit. Additionally, the relatively low MAE value of 2.74 suggests that the predicted scores differed from the actual scores by only about 2.7 points on average, meaning most predictions were reasonably accurate.

Similarly, the RMSE value of 4.51 indicates that overall prediction errors remained relatively low, although the slightly higher RMSE compared to MAE suggests the presence of a few larger prediction errors. Overall, the combination of high R² and low MAE and RMSE values demonstrates that the model effectively captured the relationships between employee characteristics and training performance, resulting in reliable prediction outcomes.

5.0 Conclusion

This study demonstrates that KPI achievement is driven by a combination of individual performance indicators (especially previous_year_rating and awards_won), training quality (avg_training_score), and categorical factors like department and recruitment channel, with Random Forest outperforming Logistic Regression in capturing nonlinear relationships and identifying high performers. Future work should incorporate additional variables such as salary, absenteeism, and engagement scores, conduct longitudinal analysis to track performance trajectories over time; explore more advanced interpretable ML models such as XGBoost and develop a production-ready HR decision-support dashboard to translate these insights into actionable workforce planning tools for proactive employee development and targeted training investments.

6.0 References

Chaudhari, S.(2023) Employee’s Performance for HR Analytics. kaggle.com. https://www.kaggle.com/datasets/sanjanchaudhari/employees-performance-for-hr-analytics
Hendri, M. I. (2025, July 20) Exploring factors shaping sustainable employee performance: A systematic literature review. ScienceDirect. https://doi.org/10.1016/j.ssaho.2025.101586
Singh Lather, A., Malhotra, R., Priya, S., Singh, P., & Mittal, S. Prediction of employee performance using machine learning techniques. ACM Digital Library. https://dl.acm.org/doi/10.1145/3373477.3373698
HR Dashboards: Definition, Benefits and Examples (2025) | Visier. visier.com. https://www.visier.com/hr-analytics/hr-dashboard/

Employee Performance Analysis Using R: Determining the Factors Influencing KPI Achievements

Melissa A/P Xavier, Dhewi Triesuleha binti Safei, Pan Hui Xin, Sheikh Emran Shirage, Nur Allysha Frankfort binti Izwan Lewis

29-05-2026

1.0 Introduction

1.1 Objective of the Project

1.2 Dataset Description

2.0 Data Cleaning & Preparation

2.1 Packages Used

2.2 Data Importation

2.3 Customer Parsing & Batch Processing

2.4 Data Transformation

2.5 Feature Engineering

2.6 Data Exportation

3.0 Exploratory Data Analysis (EDA)

3.1 Data Inspection

Summary for 3.1:

3.2 Data Quality Assessment

Summary for 3.2:

3.3 Outlier Detection

3.3.1 Box Plot

Summary for 3.3:

3.4 Univariate Analysis

3.4.1 Target Variable Distribution

Insights:

3.4.2 Continuous Variables Distribution

Insights:

3.4.3 Discrete Variables Distribution

Insights:

3.4.4 Categorical Variables Distribution

Insights:

3.5 Bivariate Analysis

3.5.1 Training Score vs KPI Achievement

Insights:

3.5.2 Previous Year Rating Analysis vs KPI Achievement

Insights:

3.5.3 Average Training Score by Previous Year Rating

Insights:

3.5.4 Categotical Variables vs Target

Insights:

3.6 Multivariate Analysis

3.6.1 Performance by gender and department

Insights:

3.6.2 Correlation Matrix

Insights:

3.6.3 Feature Correlation with KPI Achievement

Insights:

3.7 Insights Before Modeling

4.0 Data Analysis & Modelling

4.1 Logistic Regression (Classification)

4.1.1 Train-test split

4.1.2 Confusion Matrix

4.1.3 Receiver Operating Characteristic - Area Under the Curve (ROC-AUC)

4.1.4 Model Evaluation

4.2 Random Forest (Classification)

4.2.1 Train-test split

4.2.2 Confusion Matrix

4.2.3 ROC, AUC

4.2.4 Model Evaluation

4.3 Insights and Comparison Table

4.3.1 Class Distribution

4.3.2 Train/Test Split

4.3.3 Confusion Matrix

4.3.4 ROC Curve and AUC

4.3.5 Model Performance Metrics

Comparison Table

4.3.6 Feature Importance

4.3.7 Predicted Probability Distribution

4.4 Linear Regression Model

4.4.1 Train-test split

4.4.2 Model Evaluation: RMSE, MAE, RSquared

4.4.3 Result Visualization: Actual vs. Predicted

4.4.4 Residual Plot and Distribution

4.4.5 Feature Importance

Insights

4.4.6 Regression Model Performance

Insights

5.0 Conclusion

6.0 References