1.0 Introduction

Amid the era of globalization and relentless competitive pressures, sustaining strong employee performance has become a central priority and one of the biggest challenges for Human Resource (HR) departments. Organizations can no longer rely solely on intuition; instead, they are increasingly turning to data‑driven analytics to evaluate workforce outcomes, optimize talent management, and reduce turnover. Understanding the specific, underlying factors that influence productivity is essential for building sustainable, long‑term employee success.

This project uses R programming to explore a comprehensive dataset and uncover the key drivers of high performance. A thorough understanding of influencing factors is essential for developing effective approaches to maintaining and improving employee performance over the long term. By leveraging statistical methods and machine learning techniques, the goal is to uncover the key drivers behind high performance and Key Performance Indicator (KPI) achievement, translating raw HR data into actionable organizational insights.

1.1 Objective of the Project

The primary objective of this project is to conduct a comprehensive data analysis using R to identify key factors such as employee demographics, training effectiveness, length of service, and prior performance ratings that significantly influence the achievement of Key Performance Indicators (KPIs) exceeding 80%. By applying statistical techniques, data visualization, and predictive modeling, this study aims to generate actionable insights that can guide HR professionals in enhancing employee performance strategies, supporting talent development, and strengthening organizational decision‑making.

Specifically, the project seeks to:
- Identify key factors affecting performance through statistical analysis and machine learning, focusing on variables such as training, work experience, education level, and departmental affiliation.
- Compare predictive models to determine the most effective approach for forecasting KPI achievement above 80%, using evaluation metrics such as confusion matrices, accuracy, sensitivity, and specificity.
- Provide recommendations and actionable insights to HR departments and stakeholders, supporting evidence‑based decisions in talent management, training programs, and employee engagement initiatives.

1.2 Dataset Description

The dataset titled Employees Performance for HR Analytics was uploaded to Kaggle by Sanjana Chaudhari in 2023 and serves as the foundation for this analysis. It contains 17,417 employee records across 13 variables, stored in CSV format. The dataset captures a balanced mix of categorical and numerical variables, making it suitable for exploratory data analysis (EDA), correlation studies, and predictive modeling in HR analytics.

The variables included are as follows:
- employee_id: Unique identifier for each employee; serves as the primary key for tracking records without revealing personal information.
- department: Employee’s department (e.g., Sales & Marketing, Technology); useful for performance segmentation and departmental comparisons.
- region: Geographic region of employment.
- education: Highest education level attained (e.g., Bachelor’s, Master’s and above).
- gender: Employee gender (m = male, f = female).
- recruitment_channel: Hiring source (e.g., Referred, Sourcing).
- no_of_trainings: Number of trainings attended.
- age: Employee age.
- previous_year_rating: Performance rating from the prior year (1–5 scale).
- length_of_service: Number of years served in the organization.
- kpis_met_more_than_80: Binary indicator of whether >80% KPIs were achieved (0 = No, 1 = Yes); this serves as the target variable.
- awards_won: Indicator of whether the employee won awards (0 = No, 1 = Yes).
- avg_training_score: Average score from trainings, reflecting training quality.

By analyzing these variables, the study aims to uncover meaningful patterns that can guide HR strategies, improve productivity, and strengthen workforce management.

2.0 Data Cleaning & Preparation

Data cleaning is a critical step in preparing the dataset for analysis. It involves handling missing values, correcting inconsistencies, removing duplicates, and ensuring that variables are properly formatted for statistical modeling. Clean data provides a reliable foundation for exploratory analysis and predictive modeling, reducing bias and improving the accuracy of insights.

2.1 Packages Used

The following packages were used in the data cleaning process:

dplyr
Functions: filter, mutate, select, distinct, summarise, case_when
Purpose: Data manipulation and transformation.
tidyr
Functions: replace_na, across
Purpose: Handling missing values and tidying data.
stringr
Functions: str_trim, str_to_lower
Purpose: Text cleaning and string processing.
writexl
Functions: write_xlsx
Purpose: Exporting cleaned dataset to Excel format.

2.2 Data Importation

This step involves loading the raw dataset into R for inspection. The structure and summary of the data are examined to understand variable types.

employee_performance <- read.csv("Uncleaned_employees_final_dataset.csv")
str(employee_performance)

## 'data.frame':    17417 obs. of  13 variables:
##  $ employee_id          : int  8724 74430 72255 38562 64486 46232 54542 67269 66174 76303 ...
##  $ department           : chr  "Technology" "HR" "Sales & Marketing" "Procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : int  NA 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...

summary(employee_performance)

##   employee_id        department          region          education    
##  Min.   :    3   Length   :17417   Length   :17417   Length   :17417  
##  1st Qu.:19281   N.unique :    9   N.unique :   34   N.unique :    4  
##  Median :39122   N.blank  :    0   N.blank  :    0   N.blank  :  771  
##  Mean   :39083   Min.nchar:    2   Min.nchar:    8   Min.nchar:    0  
##  3rd Qu.:58838   Max.nchar:   17   Max.nchar:    9   Max.nchar:   15  
##  Max.   :78295                                                        
##                                                                       
##        gender      recruitment_channel no_of_trainings      age       
##  Length   :17417   Length   :17417     Min.   :1.000   Min.   :20.00  
##  N.unique :    2   N.unique :    3     1st Qu.:1.000   1st Qu.:29.00  
##  N.blank  :    0   N.blank  :    0     Median :1.000   Median :33.00  
##  Min.nchar:    1   Min.nchar:    5     Mean   :1.251   Mean   :34.81  
##  Max.nchar:    1   Max.nchar:    8     3rd Qu.:1.000   3rd Qu.:39.00  
##                                        Max.   :9.000   Max.   :60.00  
##                                                                       
##  previous_year_rating length_of_service KPIs_met_more_than_80   awards_won     
##  Min.   :1.000        Min.   : 1.000    Min.   :0.0000        Min.   :0.00000  
##  1st Qu.:3.000        1st Qu.: 3.000    1st Qu.:0.0000        1st Qu.:0.00000  
##  Median :3.000        Median : 5.000    Median :0.0000        Median :0.00000  
##  Mean   :3.345        Mean   : 5.802    Mean   :0.3588        Mean   :0.02337  
##  3rd Qu.:4.000        3rd Qu.: 7.000    3rd Qu.:1.0000        3rd Qu.:0.00000  
##  Max.   :5.000        Max.   :34.000    Max.   :1.0000        Max.   :1.00000  
##  NAs    :1363                                                                  
##  avg_training_score
##  Min.   :39.00     
##  1st Qu.:51.00     
##  Median :60.00     
##  Mean   :63.18     
##  3rd Qu.:75.00     
##  Max.   :99.00     
##

2.3 Customer Parsing & Batch Processing

Duplicate records and unnecessary columns are removed to ensure data integrity. Unique values are checked to identify inconsistencies in categorical variables.

unique(employee_performance$department)

## [1] "Technology"        "HR"                "Sales & Marketing"
## [4] "Procurement"       "Finance"           "Analytics"        
## [7] "Operations"        "Legal"             "R&D"

unique(employee_performance$education)

## [1] "Bachelors"       "Masters & above" ""                "Below Secondary"

unique(employee_performance$gender)

## [1] "m" "f"

unique(employee_performance$recruitment_channel)

## [1] "sourcing" "other"    "referred"

unique(employee_performance$region)

##  [1] "region_26" "region_4"  "region_13" "region_2"  "region_29" "region_7" 
##  [7] "region_22" "region_16" "region_17" "region_24" "region_11" "region_27"
## [13] "region_9"  "region_20" "region_34" "region_23" "region_8"  "region_14"
## [19] "region_31" "region_19" "region_5"  "region_28" "region_15" "region_3" 
## [25] "region_25" "region_12" "region_21" "region_30" "region_10" "region_33"
## [31] "region_32" "region_6"  "region_1"  "region_18"

str(employee_performance)

## 'data.frame':    17417 obs. of  13 variables:
##  $ employee_id          : int  8724 74430 72255 38562 64486 46232 54542 67269 66174 76303 ...
##  $ department           : chr  "Technology" "HR" "Sales & Marketing" "Procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : int  NA 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...

# Show duplicated employee_id
employee_performance %>%
  filter(duplicated(employee_id) | duplicated(employee_id, fromLast = TRUE)) %>%
  arrange(employee_id)

# Remove exact duplicate rows only
employee_performance <- employee_performance %>%
  distinct()

# Check duplicated employee_id again
employee_performance %>%
  filter(duplicated(employee_id) | duplicated(employee_id, fromLast = TRUE)) %>%
  arrange(employee_id)

#remove unnecessary column
employee_performance <- employee_performance %>%
  select(-employee_id)

2.4 Data Transformation

Text fields are standardized by trimming spaces and converting to lowercase. Missing values are handled using median imputation and categorical replacement to maintain data completeness.

#clean text column
employee_performance <- employee_performance %>%
  mutate(
    gender = str_to_lower(str_trim(gender)),
    department = str_trim(department),
    education = str_trim(education),
    recruitment_channel = str_trim(recruitment_channel)
  )


str(employee_performance)

## 'data.frame':    17415 obs. of  12 variables:
##  $ department           : chr  "Technology" "HR" "Sales & Marketing" "Procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : int  NA 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...

colSums(is.na(employee_performance))

##            department                region             education 
##                     0                     0                     0 
##                gender   recruitment_channel       no_of_trainings 
##                     0                     0                     0 
##                   age  previous_year_rating     length_of_service 
##                     0                  1363                     0 
## KPIs_met_more_than_80            awards_won    avg_training_score 
##                     0                     0                     0

employee_performance %>%
  summarise(across(everything(), ~ sum(is.na(.) | trimws(as.character(.)) == "")))

#handling missing values
employee_performance <- employee_performance %>%
  mutate(
    previous_year_rating = ifelse(
      is.na(previous_year_rating),
      median(previous_year_rating, na.rm = TRUE),
      previous_year_rating
    ),
    
    education = ifelse(
      is.na(education) | str_trim(education) == "",
      "Unknown",
      education
    )
  )

#if missing exists
employee_performance <- employee_performance %>%
  mutate(education = replace_na(education, "Unknown"))


#clean any text column
clean_text <- function(x) {
  x %>%
    str_trim() %>%
    str_to_lower()
}

employee_performance$department <- clean_text(employee_performance$department)


employee_performance <- employee_performance %>%
  mutate(across(
    c(department, education, recruitment_channel, region),
    clean_text
  ))

2.5 Feature Engineering

New variables are created to enhance analytical insights. Age groups are categorized, and categorical variables are converted to factors for modeling compatibility.

#create age group 
employee_performance <- employee_performance %>%
  mutate(age_group = case_when(
    age < 30 ~ "Young",
    age >= 30 & age < 40 ~ "Mid",
    TRUE ~ "Senior"
  ))

str(employee_performance)

## 'data.frame':    17415 obs. of  13 variables:
##  $ department           : chr  "technology" "hr" "sales & marketing" "procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "bachelors" "bachelors" "bachelors" "bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : num  3 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...
##  $ age_group            : chr  "Young" "Mid" "Mid" "Mid" ...

#convert to factors
employee_performance <- employee_performance %>%
  mutate(
    department = as.factor(department),
    gender = as.factor(gender),
    education = as.factor(education),
    recruitment_channel = as.factor(recruitment_channel),
    region = as.factor(region),
    age_group = as.factor(age_group)
  )


#normalize score 
employee_performance <- employee_performance %>%
  mutate(avg_training_score_scaled = scale(avg_training_score))

2.6 Data Exportation

After cleaning and exploring the dataset, the final step is to export the processed data for future analysis and reporting. csv formats are used for next step EDA.

write.csv(employee_performance, "clean_employee_performance.csv", row.names = FALSE)

3.0 Exploratory Data Analysis (EDA)

Before proceeding into the modelling part, the Exploratory Data Analysis (EDA) was conducted to examine the employee performance.
The steps performed in EDA:

3.1 Data Inspection

The required libraries and cleaned dataset df_clean was loaded and inspected to understand its structure before moving forward to exploratory data analysis.

# Install packages (run only once if needed):
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("tidyverse")
# install.packages("knitr")
# install.packages("corrplot")
# install.packages("plotly")
# install.packages("reshape2")
# install.packages("kableExtra")

# Load required libraries:
library(dplyr)        # Data manipulation and group_by() function
library(ggplot2)      # Data visualization
library(tidyverse)    # Collection of data science packages
library(knitr)        # R Markdown table formatting
library(corrplot)     # Correlation matrix visualization
library(plotly)       # Interactive plots
library(reshape2)     # Data reshaping
library(kableExtra)   # Enhanced table styling

# Load cleaned dataset
df_clean<- read.csv("clean_employee_performance.csv")

#convert to factors
df_clean <- df_clean %>%
  mutate(
    department = as.factor(department),
    gender = as.factor(gender),
    education = as.factor(education),
    recruitment_channel = as.factor(recruitment_channel),
    region = as.factor(region),
    age_group = as.factor(age_group),
    awards_won = as.factor(awards_won) # added conversion to factor for better analysis
  )

The dataset structure confirms the variables are correctly formatted with appropriate data types.

Both str() and glimpse() provide different views of a dataset.
For str() provide a detailed description of the structure, whereas glimpse() gives a brief overview of all variables and observations.
str() shows the dataset structure as a mix of numeric and categorical variables. glimpse() and dim() show that the dataset contains 17,417 observations and 14 variables.
Both str() and glimpse() confirmed sufficiently large datasets for reliable exploratory and statistical analysis.

# Data structure overview inspection
head(df_clean)

str(df_clean)

## 'data.frame':    17415 obs. of  14 variables:
##  $ department               : Factor w/ 9 levels "analytics","finance",..: 9 3 8 6 2 6 2 1 9 9 ...
##  $ region                   : Factor w/ 34 levels "region_1","region_10",..: 19 29 5 12 22 32 12 15 32 15 ...
##  $ education                : Factor w/ 4 levels "bachelors","below secondary",..: 1 1 1 1 1 1 1 1 3 1 ...
##  $ gender                   : Factor w/ 2 levels "f","m": 2 1 2 1 2 2 2 2 2 2 ...
##  $ recruitment_channel      : Factor w/ 3 levels "other","referred",..: 3 1 1 1 3 3 1 3 1 3 ...
##  $ no_of_trainings          : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                      : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating     : int  3 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service        : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80    : int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won               : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ avg_training_score       : int  77 51 47 65 61 68 57 85 75 76 ...
##  $ age_group                : Factor w/ 3 levels "Mid","Senior",..: 3 1 1 1 1 1 1 1 2 3 ...
##  $ avg_training_score_scaled: num  1.03 -0.908 -1.206 0.136 -0.162 ...

glimpse(df_clean)

## Rows: 17,415
## Columns: 14
## $ department                <fct> technology, hr, sales & marketing, procureme…
## $ region                    <fct> region_26, region_4, region_13, region_2, re…
## $ education                 <fct> bachelors, bachelors, bachelors, bachelors, …
## $ gender                    <fct> m, f, m, f, m, m, m, m, m, m, m, m, f, m, m,…
## $ recruitment_channel       <fct> sourcing, other, other, other, sourcing, sou…
## $ no_of_trainings           <int> 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ age                       <int> 24, 31, 31, 31, 30, 36, 33, 36, 51, 29, 40, …
## $ previous_year_rating      <int> 3, 3, 1, 2, 4, 3, 5, 3, 4, 5, 5, 3, 3, 3, 5,…
## $ length_of_service         <int> 1, 5, 4, 9, 7, 2, 3, 3, 11, 2, 12, 10, 4, 10…
## $ KPIs_met_more_than_80     <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,…
## $ awards_won                <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ avg_training_score        <int> 77, 51, 47, 65, 61, 68, 57, 85, 75, 76, 50, …
## $ age_group                 <fct> Young, Mid, Mid, Mid, Mid, Mid, Mid, Mid, Se…
## $ avg_training_score_scaled <dbl> 1.03010551, -0.90754471, -1.20564475, 0.1358…

# Dataset dimensions
dim(df_clean)

## [1] 17415    14

The summary provides an overview of the central tendency and distribution of each variable.

The majority of employees, who range in age from 20 to 60 and a mean around 35 years have completed at least 1 training session.
The dataset shows moderate employee performance with an average previous year rating of 3.3 and KPI achievement rate of 0.36.
The majority of employees have been employed for 3 to 7 years and the average KPI achievement rate is relatively low as 0.36 that indicating less than half of the employees generally reach KPI goals.
A wide range of employees’ skill levels shows a significant gap on training scores ranging from 39 to 99. Thus, a data normalization was conducted for avg_training_scoreand created a new column for avg_training_score_scaled which eases future analysis.

# Summary statistics
df_clean %>%
  select(age, previous_year_rating, KPIs_met_more_than_80,
         length_of_service, no_of_trainings, avg_training_score, 
         avg_training_score_scaled) %>%
  summary()

##       age        previous_year_rating KPIs_met_more_than_80 length_of_service
##  Min.   :20.00   Min.   :1.000        Min.   :0.0000        Min.   : 1.000   
##  1st Qu.:29.00   1st Qu.:3.000        1st Qu.:0.0000        1st Qu.: 3.000   
##  Median :33.00   Median :3.000        Median :0.0000        Median : 5.000   
##  Mean   :34.81   Mean   :3.319        Mean   :0.3589        Mean   : 5.801   
##  3rd Qu.:39.00   3rd Qu.:4.000        3rd Qu.:1.0000        3rd Qu.: 7.000   
##  Max.   :60.00   Max.   :5.000        Max.   :1.0000        Max.   :34.000   
##  no_of_trainings avg_training_score avg_training_score_scaled
##  Min.   :1.000   Min.   :39.00      Min.   :-1.8018          
##  1st Qu.:1.000   1st Qu.:51.00      1st Qu.:-0.9075          
##  Median :1.000   Median :60.00      Median :-0.2368          
##  Mean   :1.251   Mean   :63.18      Mean   : 0.0000          
##  3rd Qu.:1.000   3rd Qu.:75.00      3rd Qu.: 0.8811          
##  Max.   :9.000   Max.   :99.00      Max.   : 2.6697

Summary for 3.1:

This step ensures that all variables are correctly formatted after preprocessing and no inconsistencies remain.
Overall, the dataset is ready for analysis.

3.2 Data Quality Assessment

A final quality check was conducted to check if any missing values or duplicated values remain.

# Check missing values
colSums(is.na(df_clean))

##                department                    region                 education 
##                         0                         0                         0 
##                    gender       recruitment_channel           no_of_trainings 
##                         0                         0                         0 
##                       age      previous_year_rating         length_of_service 
##                         0                         0                         0 
##     KPIs_met_more_than_80                awards_won        avg_training_score 
##                         0                         0                         0 
##                 age_group avg_training_score_scaled 
##                         0                         0

# Check missing or empty values
df_clean %>%
  summarise(across(everything(), ~ sum(is.na(.) | . == "")))

# Check any duplicates
sum(duplicated(df_clean))

## [1] 16

janitor::get_dupes(df_clean)

## No variable names specified - using all columns.

# Check whether missing values in the education field have been replaced with “Unknown”
unique(df_clean$education)

## [1] bachelors       masters & above unknown         below secondary
## Levels: bachelors below secondary masters & above unknown

Summary for 3.2:

This step ensured that all variables were properly cleaned.
There are no exact duplicate rows remaining in the dataset after removing duplicate records based on employee IDs. The remaining duplicate records represent valid multiple observations rather than data duplicates.
All variables have been standardized and are able to perform reliable statistical analysis as the cleaned dataset contains no missing value or duplicated records after prepossessing.

3.3 Outlier Detection

3.3.1 Box Plot

An outlier detection applied to focus on continuous variables such as age, length_of_service, and avg_training_score because these variables have meaningful numerical ranges, and extreme values may reveal unusual or hidden characteristics of employees.

# Select variables suitable for outlier detection
outlier_vars <- df_clean %>%
  select(age, length_of_service, avg_training_score)

# Convert selected variables into long format
outlier_data <- df_clean %>%
  select(age, length_of_service, avg_training_score) %>%
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "Value")

# Boxplot visualization
ggplot(outlier_data,
       aes(x = Variable,
           y = Value,
           fill = Variable)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Outlier Detection for Continuous Variables",
       x = "Variables",
       y = "Values") +
  theme(legend.position = "none")

Summary for 3.3:

This step ensures that the continuous variables are prevented by bias which are not affected by any hidden significant outlier before conducting further deep analysis.
age and length_of_service contain several outliers beyond the upper fence of the boxplot, which require further investigation to determine whether they represent valid extreme values or anomalies in the dataset.
The overall data quality is satisfactory and makes the dataset suitable for further exploratory data analysis and modelling.

3.4 Univariate Analysis

3.4.1 Target Variable Distribution

# Check class imbalance
target_dist <- df_clean %>%
  count(KPIs_met_more_than_80) %>%
  mutate(
    percentage = round(n / sum(n) * 100, 1),
    KPI_status = ifelse(KPIs_met_more_than_80 == 1,
                        "Met KPI >80%",
                        "Met KPI ≤80%")
  )

# KPI distribution plot
ggplot(target_dist,
       aes(x = KPI_status,
           y = n,
           fill = KPI_status)) +

  geom_bar(stat = "identity",
           width = 0.6,
           alpha = 0.9) +

  # Percentage + count labels
  geom_text(aes(label = paste0(percentage,
                               "%\n(n = ",
                               scales::comma(n), ")")),
            vjust = -0.35,
            size = 4.3,
            fontface = "bold") +

  scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +

  expand_limits(y = max(target_dist$n) * 1.15) +

  labs(title = "Distribution of KPI Achievement",
       subtitle = "Class balance analysis of KPI performance",
       x = NULL,
       y = "Number of Employees") +

  theme_minimal(base_size = 13) +

  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold",
                              hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.x = element_text(face = "bold")
  )

# Print imbalance ratio
imbalance_ratio <- max(target_dist$percentage) /
                   min(target_dist$percentage)

cat("Imbalance ratio (majority/minority):",
    round(imbalance_ratio, 2), "\n")

## Imbalance ratio (majority/minority): 1.79

Insights:

The analysis reveals that about 64.1% of employees failed to achieve the KPI benchmark of 80%, while only 35.9% of employees met the KPI target.
The imbalance ratio of 1.79 indicates that the target variable KPI Achievement is relatively balanced.
More than half of the employees did not achieve the KPI target, and it indicates organizational and operational factors affecting employee performance and suggests the need for further investigation into departmental performance, employee support system and others.
Hence, the class distribution is still acceptable for exploratory analysis and modelling without class imbalance concerns.

3.4.2 Continuous Variables Distribution

# Select Variables
continuous_long <- df_clean %>%
  select(age,
         length_of_service,
         avg_training_score) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "value")

# Plots
ggplot() +

  # age
  geom_histogram(
    data = filter(continuous_long, variable == "age"),
    aes(x = value,
        y = after_stat(density)),
    binwidth = 2,
    fill = "#3498DB",
    color = "white",
    alpha = 0.7
  ) +

  geom_density(
    data = filter(continuous_long, variable == "age"),
    aes(x = value),
    color = "#E74C3C",
    linewidth = 1.1,
    adjust = 1.2
  ) +

  # length_of_service
  geom_histogram(
    data = filter(continuous_long,
                  variable == "length_of_service"),
    aes(x = value,
        y = after_stat(density)),
    binwidth = 1,
    fill = "#2ECC71",
    color = "white",
    alpha = 0.7
  ) +

  geom_density(
    data = filter(continuous_long,
                  variable == "length_of_service"),
    aes(x = value),
    color = "#C0392B",
    linewidth = 1.1,
    adjust = 1.5
  ) +

  # avg_training_score
  geom_histogram(
    data = filter(continuous_long,
                  variable == "avg_training_score"),
    aes(x = value,
        y = after_stat(density)),
    binwidth = 5,
    fill = "#9B59B6",
    color = "white",
    alpha = 0.7
  ) +

  geom_density(
    data = filter(continuous_long,
                  variable == "avg_training_score"),
    aes(x = value),
    color = "#E74C3C",
    linewidth = 1.1,
    adjust = 1.2
  ) +

  facet_wrap(~ variable,
             scales = "free",
             ncol = 2) +

  labs(
    title = "Distribution of Continuous Variables",
    subtitle = "Histogram with Density Overlay",
    x = "Value",
    y = "Density"
  ) +

  theme_minimal(base_size = 14) +

  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12),
    strip.text = element_text(face = "bold", size = 12),
    axis.title = element_text(face = "bold")
  )

Insights:

age distribution is slightly right skewed. This indicates a higher concentration of younger employees with fewer older employees. The age density curve shows a peak around the mid 30 years old which suggests that the most employees fall within the early to mid-career stage.
avg_training_score distribution appears as a multimodal pattern as it consists of multiple peaks and major clusters visible around 50, 60, and 80 to 85. This implies that there is possible segmentation in employee performance or departmental training outcomes.
length_of_service distribution shows heavily right-skewed. Most employees have 1 to 7 years of service whereas very few employees exceed 15 years. The company is considered to have short tenure or most of them are new employees.

3.4.3 Discrete Variables Distribution

# Select Variables
discrete_vars <- df_clean %>%
  select(previous_year_rating,
         no_of_trainings)

discrete_long <- discrete_vars %>%
  pivot_longer(everything(),
               names_to = "variable",
               values_to = "value")
#Plot
ggplot(discrete_long,
       aes(x = factor(value))) +

  geom_bar(fill = "#2ECC71",
           alpha = 0.8) +

  facet_wrap(~ variable,
             scales = "free",
             ncol = 2) +

  labs(title = "Distribution of Discrete Variables",
       subtitle = "Frequency distribution by category",
       x = "Category",
       y = "Count") +

  theme_minimal() +

  theme(strip.text = element_text(face = "bold"))

Insights:

no_of_trainings displays a strongly right-skewed distribution pattern as most employees have training sessions at once. It has a sharp decline after 2 training sessions and very few employees received more than 4 training sessions which cause long tail.
previous_year_rating shows a slightly left-skewed distribution pattern with a mode at rating 3. Most employees received ratings of 3, 4, or 5 which are above average to excellent. This suggests that past performance is generally positive across the workforce.

3.4.4 Categorical Variables Distribution

# Select Variables
categorical_vars <- df_clean %>%
  select(department,
         education,
         recruitment_channel,
         awards_won)

categorical_long <- categorical_vars %>%
  pivot_longer(everything(),
               names_to = "variable",
               values_to = "category")

# Plot
ggplot(categorical_long,
       aes(x = category,
           fill = variable)) +

  geom_bar(alpha = 0.85,
           show.legend = FALSE) +

  facet_wrap(~ variable,
             scales = "free",
             ncol = 2) +

  labs(title = "Distribution of Categorical Variables",
       subtitle = "Frequency distribution across employee categories",
       x = "",
       y = "Number of Employees") +

  scale_fill_brewer(palette = "Set2") +

  theme_minimal(base_size = 12) +

  theme(
    axis.text.x = element_text(angle = 45,
                               hjust = 1),
    strip.text = element_text(face = "bold"),
    plot.title = element_text(face = "bold",
                              hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

# Plot for region (top 10)
region_analysis <- df_clean %>%
  count(region) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  top_n(10, n)

ggplot(region_analysis,
       aes(x = reorder(region, n), y = percentage, fill = percentage)) +
  geom_bar(stat = "identity", width = 0.8) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")),
            vjust = -0.3, size = 3) +
  scale_fill_gradient(low = "#A3E4D7", high = "#1ABC9C") +
  labs(title = "Employee Distribution by Region (Top 10)",
       x = "Region", y = "Percentage (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 7))

Insights:

The categorical variable analysis only provides an overview ondepartment, education, recruitment_channel, revealing that the distribution of employees across departments, educational backgrounds, and recruitment channels is uneven.
award_won distribution shows an extremely imbalanced distribution in which nearly all employees have no awards. As the award winners are rare and almost invisible on chart, this implies that this variable has lower predictive power as insufficient variation. However, further bivariate analysis is needed to determine whether the winners perform better on KPI performance.
Compared to other departments, the Sales and Marketing department has the largest number of employees and has become the dominant department within the organization.
Most employees hold a bachelor’s degree, indicating that a bachelor’s degree is the most common educational background among the workforce.
There are three distinct recruitment channels, reflecting the variety of hiring methods the company employs.
Region 2 has the highest employee distribution (about 22.5%) compared to other regions.

3.5 Bivariate Analysis

3.5.1 Training Score vs KPI Achievement

This section examines the relationship between employee training performance (average training score) and KPI achievement status. This analysis helps determine whether training performance influences employees’ success in meeting their KPI targets.

# Summary Statistics by KPI 
training_score_summary <- df_clean %>%
  group_by(KPIs_met_more_than_80) %>%
  summarise(
    count = n(),
    mean_score = mean(avg_training_score, na.rm = TRUE),
    median_score = median(avg_training_score, na.rm = TRUE),
    sd_score = sd(avg_training_score, na.rm = TRUE),
    min_score = min(avg_training_score, na.rm = TRUE),
    max_score = max(avg_training_score, na.rm = TRUE),
    q25 = quantile(avg_training_score, 0.25, na.rm = TRUE),
    q75 = quantile(avg_training_score, 0.75, na.rm = TRUE)
  ) %>%
  mutate(KPI_Status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI >80%", "Did Not Meet KPI"))

kable(training_score_summary %>% 
        select(-KPIs_met_more_than_80) %>% 
        mutate(across(where(is.numeric), ~round(., 2))),
      caption = "Training Score Summary by KPI Achievement Status")

Training Score Summary by KPI Achievement Status
count	mean_score	median_score	sd_score	min_score	max_score	q25	q75	KPI_Status
11165	62.46	59	13.35	39	99	50	74	Did Not Meet KPI
6250	64.47	61	13.45	41	99	53	77	Met KPI >80%

# Boxplot Comparison
ggplot(df_clean, aes(x = factor(KPIs_met_more_than_80), y = avg_training_score, 
                      fill = factor(KPIs_met_more_than_80))) +
  geom_boxplot(alpha = 0.7, outlier.size = 0.8) +
  scale_fill_manual(values = c("#E74C3C", "#2ECC71"),
                    labels = c("Did Not Meet KPI", "Met KPI")) +
  labs(
    title = "Training Score Distribution Comparison",
    subtitle = "High performers show higher median training scores",
    x = "KPI Achievement Status",
    y = "Average Training Score",
    fill = "KPI Status"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "none"
  )

Insights:

The average training score for employees who met their KPI targets (64.5) was only slightly higher than that of employees who did not meet their KPI targets (62.5), with a difference of merely 2% between the two groups.
The boxplot shows an upward shift in training score distribution among high-performing employees, supporting the statistical findings.
The variability in scores was similar for both groups (standard deviation ≈ 13.4), indicating that the distribution of employee scores was relatively centralized.
According to the box plot, the interquartile range (IQR) for employees who did not meet their KPI targets is approximately 50–74, while the IQR for employees who met their KPI targets is approximately 53–77.
There is significant overlap between these two ranges, indicating that many employees in both groups achieved similar training scores; therefore, training performance alone is insufficient to fully explain KPI success.

3.5.2 Previous Year Rating Analysis vs KPI Achievement

This section examines the relationship between employee previous year performance (previous year rating) and KPI achievement status. This analysis helps determine whether previous year performance influences employee success in meeting KPI targets.

# Previous year rating vs KPI
rating_analysis <- df_clean %>%
  group_by(previous_year_rating, KPIs_met_more_than_80) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(previous_year_rating) %>%
  mutate(percentage = count / sum(count) * 100,
         KPI_status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI", "Did Not Meet"))

# Plot
ggplot(rating_analysis, aes(x = factor(previous_year_rating), y = percentage, 
                            fill = KPI_status)) +
  geom_bar(stat = "identity", position = "stack") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), size = 3) +
  scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
  labs(title = "KPI Achievement by Previous Year Rating",
       x = "Previous Year Rating", y = "Percentage (%)",
       fill = "KPI Status")

Insights:

Employees with higher previous year ratings demonstrate a greater proportion of KPI achievement.
The distribution indicates that past performance is positively associated with KPI achievement, while employees who received higher performance ratings in the previous year also had met KPI achievement.

3.5.3 Average Training Score by Previous Year Rating

# Plot
ggplot(df_clean,
       aes(x = factor(previous_year_rating),
           y = avg_training_score,
           fill = factor(previous_year_rating))) +

  geom_boxplot(alpha = 0.8,
               outlier.color = "#E74C3C") +

  labs(title = "Average Training Score by Previous Year Rating",
       x = "Previous Year Rating",
       y = "Average Training Score") +

  theme_minimal(base_size = 13) +

  theme(legend.position = "none")

Insights:

Employees who received performance ratings of 3 to 5 in the previous year had higher average training scores.
Employees with a performance rating of 1 had the lowest median training score, indicating weaker general capability development; but, its upper whisker reaching almost to 95 shows high scores exist even in low-rated groups.
In summary,the boxplot distribution implies that training results are not entirely dependent on prior performance evaluations, as employees who had low ratings the previous year can still perform well in training.

3.5.4 Categotical Variables vs Target

# Function to create bar plots for categorical variables
plot_categorical_kpi <- function(data, var_name) {
  data %>%
    group_by(!!sym(var_name), KPIs_met_more_than_80) %>%
    summarise(count = n(), .groups = 'drop') %>%
    group_by(!!sym(var_name)) %>%
    mutate(percentage = count / sum(count) * 100,
           KPI_status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI >80%", "Did Not Meet")) %>%
    ggplot(aes(x = reorder(!!sym(var_name), -percentage * (KPIs_met_more_than_80 == 1)), 
               y = percentage, fill = KPI_status)) +
    geom_bar(stat = "identity", position = "stack", width = 0.7) +
    geom_text(aes(label = paste0(round(percentage, 1), "%")), 
              position = position_stack(vjust = 0.5), size = 3) +
    scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
    labs(title = paste("KPI Achievement by", var_name),
         x = var_name, y = "Percentage (%)") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          plot.title = element_text(hjust = 0.5, face = "bold"))
}

# Plot for each categorical variable
cat_vars <- c("department", "education", "recruitment_channel", "gender", "awards_won")

for (var in cat_vars) {
  print(plot_categorical_kpi(df_clean, var))
}

# HO = There is no significant relationship between categorical variables and KPI achievement.
# H1 = There is a significant relationship between categorical variables and KPI achievement.

# Decision Rule: Reject H0, if p-value <0.05

# Chi-square tests
cat_chi_results <- map_df(cat_vars, function(var) {
  
  tbl <- table(df_clean[[var]], df_clean$KPIs_met_more_than_80)
  
  test <- chisq.test(tbl)
  
  # Store numeric p-value for correct sorting
  p_val <- test$p.value
  
  # Format p-value for display only
  p_formatted <- ifelse(
    p_val < 0.0001,
    "< 0.0001",
    round(p_val, 4)
  )
  
  data.frame(
    Variable = var,
    Chi_Square = round(as.numeric(test$statistic), 2),
    P_Value = p_formatted,
    P_Value_Numeric = p_val,
    Significant = ifelse(p_val < 0.05, "Yes", "No")
  )
})

# Display results (with sorting)
kable(
  cat_chi_results %>%
    arrange(P_Value_Numeric) %>%
    select(-P_Value_Numeric),
  caption = "Chi-square Tests: Categorical Variables vs KPI Achievement"
)

Chi-square Tests: Categorical Variables vs KPI Achievement
Variable	Chi_Square	P_Value	Significant
department	292.33	< 0.0001	Yes
awards_won	191.77	< 0.0001	Yes
recruitment_channel	42.91	< 0.0001	Yes
education	40.25	< 0.0001	Yes
gender	28.61	< 0.0001	Yes

Insights:

The chi-square tests indicated all categorical variables are significant associations with KPI achievement (p < 0.05).
The department has the strongest associations (χ² = 292.33), followed by awards_won, recruitment channel and education.
Gender shows the weakest but still significant association.

3.6 Multivariate Analysis

3.6.1 Performance by gender and department

# Summary function
get_summary <- function(data, group_var) {
  data %>%
    group_by(.data[[group_var]]) %>%
    summarise(
      total_trainings = sum(no_of_trainings, na.rm = TRUE),
      avg_train_score       = mean(avg_training_score, na.rm = TRUE),
      kpi             = sum(KPIs_met_more_than_80 == 1, na.rm = TRUE),
      avg_tenure      = mean(length_of_service, na.rm = TRUE),
      avg_rating      = mean(previous_year_rating, na.rm = TRUE),
      avg_age         = mean(age, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    rename(category = 1)
}

# Focus only: Gender + Department
groups <- c("gender", "department")

summary_list <- lapply(groups, function(g) {
  get_summary(df_clean, g) %>%
    mutate(group = g)
})

combined_perf <- bind_rows(summary_list)


# Split metrics (NO scale mixing)
# Workforce metrics
workforce <- combined_perf %>%
  pivot_longer(
    cols = c(total_trainings, kpi),
    names_to = "metric",
    values_to = "value"
  )

# Performance metrics
performance <- combined_perf %>%
  pivot_longer(
    cols = c(avg_train_score, avg_tenure, avg_rating, avg_age),
    names_to = "metric",
    values_to = "value"
  )

# =========================
# Plot 1: Workforce (Gender + Department)
# =========================
p1 <- ggplot(workforce,
             aes(x = category,
                 y = value,
                 fill = metric)) +
  
  geom_bar(stat = "identity",
           position = "dodge",
           alpha = 0.9) +
  
  facet_wrap(~ group, scales = "free_x") +
  
  labs(title = "Workforce Overview by Gender and Department",
       x = "",
       y = "Count",
       fill = "Metric") +
  
  theme_minimal(base_size = 13) +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    legend.title = element_text(size = 9),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )

# =========================
# Plot 2: Performance (Gender + Department)
# =========================
p2 <- ggplot(performance,
             aes(x = category,
                 y = value,
                 fill = metric)) +
  
  geom_bar(stat = "identity",
           position = "dodge",
           alpha = 0.9) +
  
  facet_wrap(~ group, scales = "free_x") +
  
  labs(title = "Performance Metrics by Gender and Department",
       x = "",
       y = "Average Value",
       fill = "Metric") +
  
  theme_minimal(base_size = 13) +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    legend.title = element_text(size = 9),
    plot.title = element_text(face = "bold", hjust = 0.5),
  )


# Output
ggplotly(p1) %>%
  layout(
    legend = list(
      orientation = "v",
      x = 1,
      y = 0,
      font = list(size = 9),
      itemwidth = 30
    ),
    margin = list(b = 120)
  )

ggplotly(p2) %>%
  layout(
    legend = list(
      orientation = "v",
      x = 1,
      y = 0,
      font = list(size = 9),
      itemwidth = 30
    ),
    margin = list(b = 120)
  )

# ==========================================
# Multivariate Summary Table
# For Gender + Department Analysis
# ==========================================

# Create formatted summary table
summary_table <- combined_perf %>%
  
  mutate(
    avg_train_score  = round(avg_train_score, 2),
    avg_tenure = round(avg_tenure, 2),
    avg_rating = round(avg_rating, 2),
    avg_age    = round(avg_age, 2)
  ) %>%
  
  arrange(group, desc(avg_train_score)) %>%
  
  rename(
    Category           = category,
    Group              = group,
    `Total Trainings`  = total_trainings,
    `Avg Traning Score`= avg_train_score,
    `KPI Achieved`     = kpi,
    `Avg Tenure`       = avg_tenure,
    `Avg Rating`       = avg_rating,
    `Avg Age`          = avg_age
  )

# Display table
kable(
  summary_table,
  caption = "Multivariate Performance Summary by Gender and Department",
  align = "c"
) %>%
  
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  
  row_spec(
    0,
    bold = TRUE,
    color = "white",
    background = "#2C3E50"
  ) %>%
  
  column_spec(1, bold = TRUE) %>%
  
  collapse_rows(
    columns = 8,
    valign = "top"
  )

Multivariate Performance Summary by Gender and Department
Category	Total Trainings	Avg Traning Score	KPI Achieved	Avg Tenure	Avg Rating	Avg Age	Group
analytics	2281	84.57	679	5.00	3.47	32.41	department
r&d	438	84.45	149	4.80	3.66	32.89
technology	2740	79.85	783	5.84	3.14	35.03
procurement	2993	70.18	836	6.19	3.23	36.17
operations	4121	60.35	1553	6.43	3.63	36.15
finance	1059	60.33	319	5.01	3.49	32.60
legal	355	59.53	118	4.50	3.38	33.75
hr	892	50.39	300	5.63	3.51	34.25
sales & marketing	6903	50.06	1513	5.75	3.10	34.63
f	5992	63.68	1986	5.86	3.37	35.04	gender
m	15790	62.97	4264	5.78	3.30	34.71	gender

Insights:

Plot A: Workforce Overview by Gender and Department:
- Operations (4,121) and Sales & Marketing (6,903) had the highest training participation proportions.
- Operations had the highest KPI count (1,553), slightly more than Sales & Marketing (1,513).
- R&D (438 training) and legal (355 training) had the lowest level of employee involvement.
- Compared to female employees (5,992 training; 1,986 KPIs), male employees reported more total number of training (15,790) and KPI outcomes (4,264).
Plot B: Performance Metrics by Gender and Department:
- R&D (84.45) and analytics (84.57) had the highest average training ratings, followed by technology (79.85).
- Sales & Marketing (50.06) and HR (50.39) achieved the lowest average training scores.
- The departments with the highest average tenure were operations (6.43 years) and procurement (6.19 years).
- There were no significant gender differences; female employees scored slightly higher than their male counterparts in terms of average score (63.68 vs. 62.97), rating (3.37 vs. 3.30), and age (35.04 vs. 34.71).

3.6.2 Correlation Matrix

# Select numeric variables
num_data <- df_clean %>%
  select(
    no_of_trainings,
    age,
    previous_year_rating,
    length_of_service,
    avg_training_score,
    KPIs_met_more_than_80
  )

# Correlation matrix
cor_matrix <- cor(num_data, use = "complete.obs")

cor_melt <- melt(cor_matrix)

# Correlation heatmap
ggplot(cor_melt, aes(Var1, Var2, fill = value)) +
  
  geom_tile(color = "white") +
  
  scale_fill_gradient2(
    low = "#E74C3C",   # -1 strong negative
    mid = "white",     # 0 no correlation
    high = "#2ECC71",  # +1 strong positive
    midpoint = 0,
    limits = c(-1, 1),
    name = "Correlation"
  ) +
  
  geom_text(aes(label = round(value, 2)), size = 3) +
  
  theme_minimal() +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  ) +
  
  ggtitle("Correlation Matrix of Employee Performance Variables")

Insights:

The moderate correlation between age and length of service (r = 0.64) indicates estimated workforce advancement and tenure stability. Since the value does not exceed the commonly accepted multicollinearity threshold of 0.70; thus, multicollinearity is not considered a significant concern.
Regarding the target variable (KPI achievement):
- Previous year rating (r = 0.33) is the most significant predictor, demonstrating some stability in employee performance in the long run.
- Training-related variables showed minimal correlation, implying a limited direct linear impact on KPI performance.
Overall, the correlation analysis reveals that most numerical variables have weak linear correlations, indicating that employee performance is driven by several interconnected factors rather than a single predictor.

3.6.3 Feature Correlation with KPI Achievement

# Select numeric features
num_features <- df_clean %>%
  select(age,
         length_of_service,
         avg_training_score,
         no_of_trainings,
         previous_year_rating)

# Compute correlation with KPI
cor_results <- sapply(num_features, function(x) {
  cor(x, df_clean$KPIs_met_more_than_80, use = "complete.obs")
})

# Convert to data frame and rank
cor_ranked <- data.frame(
  feature = names(cor_results),
  correlation = cor_results
) %>%
  arrange(desc(abs(correlation)))

cor_ranked

# Plot 
ggplot(cor_ranked,
       aes(x = reorder(feature, correlation),
           y = correlation,
           fill = correlation)) +

  geom_bar(stat = "identity") +

  coord_flip() +

  scale_fill_gradient2(low = "#E74C3C",
                       mid = "white",
                       high = "#2ECC71") +

  labs(title = "Feature Correlation with KPI Achievement",
       x = "Feature",
       y = "Correlation Strength") +

  theme_minimal(base_size = 13)

Insights:

Previous year rating (0.32) was the single variable that showed a significant correlation with the KPI achievement.
The correlation coefficients for all other numerical variables were close to zero, indicating that these variables had little or no linear relationship with KPI achievement.

3.7 Insights Before Modeling

Based on EDA findings:
- Numerical variables such asavg_training_score and previous_year_ratingare positively correlated with KPI achievement. Employees with higher training scores and higher ratings from the previous year typically perform better on KPIs.
- length_of_service and no_of_trainings exhibit a right-skewed distribution, indicating that most employees have shorter tenure and have attended fewer training sessions.
- Categorical analysis reveals differences in employee performance across department, education, gender, awards_won and recruitment_channel. Chi-square tests confirm statistically significant associations between categorical variables and KPI achievement.
- Bivariate analysis indicates that training-related variables are among the strongest predictors of KPI achievement. In particular, employees with higher average training scores tend to have a higher probability of meeting their KPI targets.
- The relationship between previous_year_rating and avg_training_score reveals a potential nonlinear pattern, suggesting that performance may vary across different rating groups.
- Correlation analysis revealed a moderate correlation between KPI achievement and the previous year’s scores, while all other numerical variables show a weak correlation. This suggests that KPI performance is likely driven by a combination of nonlinear effects, categorical factors (such as department and education), and other contextual or behavioral variables not covered in this correlation analysis. For age and length_of_service is approximately 0.64; however, since it does not exceed the standard threshold of 0.7, it does not result in multicollinearity.
Overall, the results of the Employee Data Analysis (EDA) indicate that employee demographics, training performance, previous year rating, and departmental characteristics may influence the achievement of KPI and should therefore be taken into account during the modeling phase.

4.0 Data Analysis & Modelling

This section applies statistical and machine learning techniques to uncover meaningful insights from the cleaned dataset. The goal is to identify key predictors, evaluate model performance, and generate reliable forecasts. By combining exploratory analysis with predictive modelling, we aim to transform raw data into actionable knowledge that supports decision‑making.

Question: Can employee KPI achievement (more than 80%) be predicted using demographic, training, and workplace-related variables?

4.1 Logistic Regression (Classification)

============================================================

Logistic Regression — Employee Performance

============================================================

library(caret) library(ggplot2) library(pROC) library(dplyr) library(gridExtra)

STEP 1 — Load data

df_clean <- read.csv(“clean_employee_performance.csv”, stringsAsFactors = FALSE)

STEP 2 — Remove redundant columns

df_clean$avg_training_score_scaled <- NULL df_clean$age_group <- NULL

STEP 3 — Convert character columns to factors

char_cols <- names(df_clean)[sapply(df_clean, is.character)] df_clean[char_cols] <- lapply(df_clean[char_cols], as.factor)

STEP 4 — Convert target to factor

df_clean$KPIs_met_more_than_80 <- factor( df_clean$KPIs_met_more_than_80, levels = c(0, 1), labels = c(“No”, “Yes”) )

4.1.1 Train-test split

STEP 5 — Train/Test split

set.seed(42)

train_idx <- createDataPartition( df_clean$KPIs_met_more_than_80, p = 0.80, list = FALSE )

train_data <- df_clean[train_idx, ] test_data <- df_clean[-train_idx, ]

STEP 6 — Build Logistic Regression Model

log_model <- glm( KPIs_met_more_than_80 ~ ., data = train_data, family = binomial )

summary(log_model)

STEP 7 — Predictions

pred_prob <- predict(log_model, newdata = test_data, type = “response”)

pred_class <- ifelse(pred_prob > 0.5, “Yes”, “No”)

pred_class <- factor(pred_class, levels = c(“No”, “Yes”))

4.1.2 Confusion Matrix

STEP 8 — Evaluation

cm <- confusionMatrix( pred_class, test_data$KPIs_met_more_than_80, positive = “Yes” )

print(cm)

acc <- round(cm$overall["Accuracy"] * 100, 2) sens <- round(cm$byClass[“Sensitivity”] * 100, 2) spec <- round(cm$byClass["Specificity"] * 100, 2) f1 <- round(cm$byClass[“F1”] * 100, 2)

4.1.3 ROC, AUC

STEP 9 — ROC / AUC

roc_obj <- roc( response = test_data$KPIs_met_more_than_80, predictor = pred_prob, levels = c(“No”, “Yes”) )

auc_val <- round(auc(roc_obj), 4)

roc_df <- data.frame( FPR = 1 - roc_obj$specificities, TPR = roc_obj$sensitivities )

4.1.4 Model Evaluation

STEP 10 — Metrics table

metrics_df <- data.frame( Metric = c(“Accuracy”, “Sensitivity”, “Specificity”, “F1 Score”), Value = c(acc, sens, spec, f1) )

STEP 11 — Feature Importance

coef_df <- data.frame( Feature = names(coef(log_model)), Coefficient = coef(log_model) )

coef_df <- coef_df %>% filter(Feature != “(Intercept)”) %>% mutate( Abs_Coefficient = abs(Coefficient), Direction = ifelse(Coefficient > 0, “Positive”, “Negative”) ) %>% arrange(desc(Abs_Coefficient))

top_coef_df <- coef_df %>% slice_max(order_by = Abs_Coefficient, n = 15)

print(coef_df)

STEP 12 — Predicted Probability

prob_df <- data.frame( Probability = pred_prob, Actual_Class = test_data$KPIs_met_more_than_80 )

============================================================

STEP 13 — Dashboard Plots

============================================================

p1: Class Distribution

class_tbl <- table(df_clean$KPIs_met_more_than_80)

class_pct <- round(prop.table(class_tbl) * 100, 1)

class_df <- data.frame( Class = names(class_tbl), Count = as.integer(class_tbl), Percent = as.numeric(class_pct) )

p1 <- ggplot(class_df, aes(x = Class, y = Count, fill = Class)) + geom_bar(stat = “identity”, width = 0.6, show.legend = FALSE) + geom_text(aes(label = paste0(Count, “(”, Percent, “%)”)), vjust = -0.4, fontface = “bold”) + labs(title = “Class Distribution”, subtitle = “KPIs Met More Than 80%”, x = “Class”, y = “Count”) + theme_minimal(base_size = 14)

p2: Confusion Matrix

cm_tbl <- as.data.frame(cm$table)

names(cm_tbl) <- c(“Predicted”, “Actual”, “Freq”)

p2 <- ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) + geom_tile(colour = “white”, linewidth = 1) + geom_text(aes(label = Freq), size = 6, fontface = “bold”, colour = “white”) + scale_fill_gradient(low = “#A8C8E8”, high = “#1A5FA8”) + labs(title = “Confusion Matrix”, subtitle = “Predicted vs Actual Classes”, x = “Actual Class”, y = “Predicted Class”, fill = “Count”) + theme_minimal(base_size = 14)

p3: ROC Curve

p3 <- ggplot(roc_df, aes(x = FPR, y = TPR)) + geom_line(linewidth = 1.1) + geom_abline(slope = 1, intercept = 0, linetype = “dashed”) + labs(title = “ROC Curve”, subtitle = paste(“AUC =”, auc_val), x = “False Positive Rate”, y = “True Positive Rate”) + theme_minimal(base_size = 14)

p4: Performance Metrics

p4 <- ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) + geom_bar(stat = “identity”, show.legend = FALSE) + geom_text(aes(label = paste0(Value, “%”)), vjust = -0.4, fontface = “bold”) + ylim(0, 110) + labs(title = “Performance Metrics”, y = “Score (%)”) + theme_minimal(base_size = 14)

p5: Feature Importance

p5 <- ggplot(top_coef_df, aes(x = reorder(Feature, Abs_Coefficient), y = Abs_Coefficient, fill = Direction)) + geom_bar(stat = “identity”) + coord_flip() + labs(title = “Feature Importance”, subtitle = “Top 15 absolute coefficients”, x = “Feature”, y = “Absolute Coefficient”) + theme_minimal(base_size = 14)

p6: Predicted Probability

p6 <- ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) + geom_histogram(binwidth = 0.05, alpha = 0.75, position = “identity”, colour = “white”) + labs(title = “Predicted Probability”, subtitle = “Probability of KPI Achievement”, x = “Probability of Yes”, y = “Count”, fill = “Actual Class”) + theme_minimal(base_size = 14)

Combine dashboard

logistic_dashboard <- grid.arrange( p1, p2, p3, p4, p5, p6, ncol = 2, nrow = 3 )

Save dashboard

ggsave(“logistic_regression_dashboard.png”, plot = logistic_dashboard, width = 14, height = 16, dpi = 150)

acc <- round(cm$overall["Accuracy"] * 100, 2) sens <- round(cm$byClass[“Sensitivity”] * 100, 2) spec <- round(cm$byClass["Specificity"] * 100, 2) f1 <- round(cm$byClass[“F1”] * 100, 2)

Save Logistic Regression metrics

logistic_acc <- acc logistic_sens <- sens logistic_spec <- spec logistic_f1 <- f1 logistic_auc <- auc_val

4.2 Random Forest (Classification)

============================================================

Random Forest — Employee Performance

============================================================

── 0. Install & load packages ────────────────────────────

required_packages <- c(“randomForest”, “caret”, “ggplot2”, “reshape2”, “pROC”, “dplyr”, “gridExtra”)

for (pkg in required_packages) { if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg) }

library(randomForest) library(caret) library(ggplot2) library(reshape2) library(pROC) library(dplyr) library(gridExtra)

============================================================

STEP 1 — Load data

============================================================

df_clean <- read.csv(“clean_employee_performance.csv”, stringsAsFactors = FALSE) cat(sprintf(“Loaded: %d rows x %d columns”, nrow(df_clean), ncol(df_clean)))

============================================================

STEP 2 — Fix issue 1: drop redundant / derived columns

============================================================

df_clean$avg_training_score_scaled <- NULL df_clean$age_group <- NULL cat(“Step 2: Dropped redundant columns: avg_training_score_scaled, age_group”)

============================================================

STEP 3 — Fix issue 2: convert character columns to factors

============================================================

char_cols <- names(df_clean)[sapply(df_clean, is.character)] df_clean[char_cols] <- lapply(df_clean[char_cols], as.factor) cat(“Step 3: Converted to factor:”, paste(char_cols, collapse = “,”), “”) cat(” ‘unknown’ kept as a valid factor level in ‘education’“)

============================================================

STEP 4 — Fix issue 3: convert target to factor

============================================================

df_clean$KPIs_met_more_than_80 <- factor(df_clean$KPIs_met_more_than_80, levels = c(0, 1), labels = c(“No”, “Yes”)) cat(“Step 4: Target ‘KPIs_met_more_than_80’ converted to factor”)

============================================================

STEP 5 — Handle NA values

============================================================

na_total <- sum(is.na(df_clean)) cat(sprintf(“: Total NAs found: %d”, na_total)) if (na_total > 0) { df_clean <- na.roughfix(df_clean) cat(” na.roughfix() applied“) } else { cat(” No NAs — skipping imputation“) }

============================================================

STEP 6 — Class imbalance check & compute class weights

============================================================

cat(“: Class distribution”) class_tbl <- table(df_clean$KPIs_met_more_than_80) print(class_tbl) class_pct <- round(prop.table(class_tbl) * 100, 1) print(class_pct)

n_no <- as.integer(class_tbl[“No”]) n_yes <- as.integer(class_tbl[“Yes”]) wt_no <- 1 wt_yes <- round(n_no / n_yes, 2) class_weights <- c(“No” = wt_no, “Yes” = wt_yes) cat(sprintf(” Class weights — No: %.2f | Yes: %.2f“, wt_no, wt_yes))

4.2.1 Train-test split

============================================================

STEP 7 — Train / Test split

============================================================

set.seed(42) train_idx <- createDataPartition(df_clean$KPIs_met_more_than_80, p = 0.80, list = FALSE) train_data <- df_clean[ train_idx, ] test_data <- df_clean[-train_idx, ] cat(sprintf(“: Train rows: %d | Test rows: %d”, nrow(train_data), nrow(test_data)))

============================================================

STEP 8 — Build Random Forest model

============================================================

set.seed(42) n_features <- ncol(train_data) - 1 mtry_val <- floor(sqrt(n_features))

cat(sprintf(“: Training Random Forest (ntree=500, mtry=%d) …”, mtry_val))

rf_model <- randomForest( KPIs_met_more_than_80 ~ ., data = train_data, ntree = 500, mtry = mtry_val, importance = TRUE, classwt = class_weights )

print(rf_model)

4.2.2 Confusion Matrix

============================================================

STEP 9 — Predict on test set

============================================================

preds_class <- predict(rf_model, newdata = test_data) preds_prob <- predict(rf_model, newdata = test_data, type = “prob”)

============================================================

STEP 10 — Performance Evaluation

============================================================

cm <- confusionMatrix(preds_class, test_data$KPIs_met_more_than_80, positive = “Yes”) print(cm)

acc <- round(cm$overall["Accuracy"] * 100, 2) kappa <- round(cm$overall[“Kappa”] * 100, 2) sens <- round(cm$byClass["Sensitivity"] * 100, 2) spec <- round(cm$byClass[“Specificity”] * 100, 2) precision <- round(cm$byClass["Precision"] * 100, 2) f1 <- round(cm$byClass[“F1”] * 100, 2)

roc_obj <- roc(response = test_data$KPIs_met_more_than_80, predictor = preds_prob[, “Yes”], levels = c(“No”, “Yes”), direction = “<”) auc_val <- round(auc(roc_obj), 4)

============================================================

Visualize Confusion Matrix

============================================================

Load required libraries

library(caret) library(ggplot2)

Example: assume you already have predictions

preds_class <- predict(rf_model, newdata = test_data)

cm <- confusionMatrix(preds_class, test_data$KPIs_met_more_than_80, positive = “Yes”)

Convert confusion matrix to data frame

cm_tbl <- as.data.frame(cm$table) names(cm_tbl) <- c(“Predicted”, “Actual”, “Freq”)

Plot confusion matrix heatmap

ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) + geom_tile(colour = “white”, linewidth = 1) + geom_text(aes(label = Freq), size = 6, fontface = “bold”, colour = “white”) + scale_fill_gradient(low = “#A8C8E8”, high = “#1A5FA8”) + labs(title = “Confusion Matrix Heatmap”, subtitle = “Model predictions vs actual classes”, x = “Actual Class”, y = “Predicted Class”, fill = “Count”) + theme_minimal(base_size = 14) + theme(legend.position = “right”)

4.2.3 ROC, AUC

============================================================

ROC Curve

============================================================

roc_obj <- roc(response = test_data$KPIs_met_more_than_80, predictor = preds_prob[, “Yes”], levels = c(“No”, “Yes”), direction = “<”)

roc_df <- data.frame(FPR = 1 - roc_obj$specificities, TPR = roc_obj$sensitivities)

ggplot(roc_df, aes(x = FPR, y = TPR)) + geom_line(colour = “#1A5FA8”, linewidth = 1.1) + geom_abline(slope = 1, intercept = 0, linetype = “dashed”, colour = “grey60”, linewidth = 0.7) + annotate(“text”, x = 0.65, y = 0.12, label = sprintf(“AUC = %.4f”, auc(roc_obj)), size = 5, fontface = “bold”, colour = “#1A5FA8”) + labs(title = “ROC Curve”, subtitle = “Receiver Operating Characteristic — test set”, x = “False Positive Rate (1 - Specificity)”, y = “True Positive Rate (Sensitivity)”) + theme_minimal(base_size = 14) + coord_equal()

4.2.4 Model Evaluation

============================================================

Feature Importance (Mean Decrease Gini)

============================================================

imp_mat <- importance(rf_model) imp_df <- data.frame( Feature = rownames(imp_mat), MeanDecreaseAccuracy = imp_mat[, “MeanDecreaseAccuracy”], MeanDecreaseGini = imp_mat[, “MeanDecreaseGini”] )

imp_df <- imp_df[order(imp_df$MeanDecreaseGini), ] imp_df$Feature <- factor(imp_df$Feature, levels = imp_df$Feature)

ggplot(imp_df, aes(x = Feature, y = MeanDecreaseGini, fill = MeanDecreaseGini)) + geom_bar(stat = “identity”, width = 0.65, show.legend = FALSE) + geom_text(aes(label = round(MeanDecreaseGini, 1)), hjust = -0.15, size = 3.5) + scale_fill_gradient(low = “#A8D5A2”, high = “#2E7D32”) + coord_flip() + labs(title = “Feature Importance (Mean Decrease Gini)”, subtitle = “Higher = more important for node purity”, y = “Mean Decrease Gini”) + theme_minimal(base_size = 14)

============================================================

Class Distribution Bar Chart

============================================================

class_tbl <- table(df_clean$KPIs_met_more_than_80) class_pct <- round(prop.table(class_tbl) * 100, 1) class_df <- data.frame(Class = names(class_tbl), Count = as.integer(class_tbl), Percent = as.numeric(class_pct))

ggplot(class_df, aes(x = Class, y = Count, fill = Class)) + geom_bar(stat = “identity”, width = 0.5, show.legend = FALSE) + geom_text(aes(label = sprintf(“%d(%.1f%%)”, Count, Percent)), vjust = -0.3, size = 4, fontface = “bold”) + scale_fill_manual(values = c(“No” = “#E07B54”, “Yes” = “#4C9BE8”)) + labs(title = “Class Distribution of Target Variable”, subtitle = “KPIs met more than 80%”, x = “KPIs Met > 80%”, y = “Count”) + theme_minimal(base_size = 14)

============================================================

Performance Metrics Bar Chart

============================================================

metrics_df <- data.frame( Metric = c(“Accuracy”, “Sensitivity”, “Specificity”, “Precision”, “F1 Score”), Value = c(acc, sens, spec, precision, f1) )

ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) + geom_bar(stat = “identity”, width = 0.6, show.legend = FALSE) + geom_text(aes(label = sprintf(“%.1f%%”, Value)), vjust = -0.4, size = 4, fontface = “bold”) + scale_fill_brewer(palette = “Set2”) + labs(title = “Model Performance Metrics”, subtitle = “Evaluated on test set”, y = “Score (%)”) + ylim(0, 110) + theme_minimal(base_size = 14)

============================================================

Predicted Probability Histogram

============================================================

prob_df <- data.frame( Probability = preds_prob[, “Yes”], Actual_Class = test_data$KPIs_met_more_than_80 )

ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) + geom_histogram(binwidth = 0.05, alpha = 0.75, position = “identity”, colour = “white”) + scale_fill_manual(values = c(“No” = “#E07B54”, “Yes” = “#4C9BE8”)) + labs(title = “Predicted Probability Distribution”, subtitle = “Probability of KPIs > 80% by actual class”, x = “Predicted probability (Yes)”, y = “Count”, fill = “Actual class”) + theme_minimal(base_size = 14)

============================================================

Create a Dashboard

============================================================

#STEP-1: #——- # p1: Class distribution p1 <- ggplot(class_df, aes(x = Class, y = Count, fill = Class)) + geom_bar(stat = “identity”) + labs(title = “Class Distribution”)

p2: Confusion matrix

p2 <- ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) + geom_tile() + labs(title = “Confusion Matrix”)

p3: Feature importance

p3 <- ggplot(imp_df, aes(x = Feature, y = MeanDecreaseGini, fill = MeanDecreaseGini)) + geom_bar(stat = “identity”) + coord_flip() + labs(title = “Feature Importance (Gini)”)

p4: Performance metrics

p4 <- ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) + geom_bar(stat = “identity”) + labs(title = “Performance Metrics”)

p5: ROC curve

p5 <- ggplot(roc_df, aes(x = FPR, y = TPR)) + geom_line() + labs(title = “ROC Curve”)

p6: Probability histogram

p6 <- ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) + geom_histogram(binwidth = 0.05) + labs(title = “Predicted Probability Distribution”) #—————————————————————— #STEP-2: Arrange the Graphs into a Dashboard #——————————————– library(gridExtra)

combined_dashboard <- grid.arrange( p1, p2, p3, p4, p5, p6, ncol = 3, nrow = 2 ) ggsave(“dashboard_overview.png”, plot = combined_dashboard, width = 18, height = 10, dpi = 150)

4.3 Insights and Comparison Table

Confusion Matrix

When compared with the Random Forest model, the Logistic Regression model achieved a slightly higher overall accuracy (71.03% vs 70.23%). However, the difference in performance becomes clearer when examining sensitivity and specificity. Logistic Regression recorded a much lower sensitivity of 43% compared to Random Forest’s 57%, meaning it missed a larger number of employees who actually met the KPI target. In contrast, Logistic Regression achieved a substantially higher specificity of 87%, while Random Forest obtained 78%, indicating that Logistic Regression was better at correctly identifying employees who did not meet KPI expectations.

The Random Forest model demonstrated a more balanced classification performance overall. Although its accuracy was marginally lower, it produced a higher F1-score (57.86% compared to 51.79% for Logistic Regression), showing a better balance between precision and recall. Random Forest was therefore more effective at detecting true high performers, while Logistic Regression was more conservative and focused on minimizing false positives. This is also reflected in the confusion matrices, where Logistic Regression produced more false negatives (708) than Random Forest (538), meaning more actual high performers were overlooked.

Additionally, the Kappa statistic for Random Forest (0.35) was slightly higher than Logistic Regression (0.32), suggesting better agreement beyond chance. The Random Forest model also achieved a higher balanced accuracy (67% vs 65%), indicating stronger overall performance across both classes rather than favoring the majority class. While Logistic Regression excelled in identifying non high performers, Random Forest provided a more even trade off between identifying successful and unsuccessful employees.

Overall, Logistic Regression appears more suitable when the priority is reducing false claims of high performance, whereas Random Forest is more appropriate when the goal is to identify as many genuine high performers as possible. Since employee performance prediction often benefits from detecting successful employees accurately, the Random Forest model may be considered the more practical and reliable choice despite its slightly lower overall accuracy.

ROC, AUC

Both models performed better than random guessing, with Logistic Regression achieving a slightly higher AUC (0.7402) than Random Forest (0.7327), indicating marginally better overall class separation. However, Random Forest showed stronger practical performance by achieving higher sensitivity and F1-score, meaning it was better at identifying employees who actually met KPI targets. Logistic Regression was more conservative, focusing more on correctly identifying non high performers and reducing false positives. Overall, Random Forest provides a more balanced model for employee performance prediction, while Logistic Regression is better when minimizing false positive predictions is the priority.

Model Evaluation

The Logistic Regression coefficient analysis shows that awards_won and previous_year_rating are the strongest positive predictors of employees meeting more than 80% of their KPIs. Employees who received awards or had strong previous performance ratings were much more likely to achieve high KPI outcomes. Several regional variables also had relatively strong positive effects, while some departments such as sales & marketing, legal, and HR showed negative relationships with KPI success. Features such as no_of_trainings and length_of_service had smaller negative coefficients, indicating weaker influence on the prediction outcome.

Compared with Random Forest, Logistic Regression provides coefficient based interpretations that show whether each variable increases or decreases the likelihood of success. However, Random Forest offered clearer feature importance rankings and captured more complex non linear relationships between variables. Logistic Regression is therefore simpler and easier to interpret statistically, while Random Forest provides stronger practical insight into which factors most influence employee performance predictions.

Comparison Table

comparison_df <- data.frame( Model = c( “Logistic Regression”, “Random Forest” ),

Accuracy = c( logistic_acc, acc ),

Sensitivity = c( logistic_sens, sens ),

Specificity = c( logistic_spec, spec ),

F1_Score = c( logistic_f1, f1 ),

AUC = c( logistic_auc, auc_val ) )

comparison_df

Insights and Recommendations

Both Logistic Regression and Random Forest achieved similar accuracy, with Logistic Regression performing slightly better (71.03% vs 70.23%). However, Random Forest showed much higher sensitivity (56.96% vs 43.36%), meaning it was better at identifying employees who actually met KPI targets. Logistic Regression achieved higher specificity (86.52% vs 77.65%), indicating it was stronger at identifying employees who did not meet KPIs. Random Forest also obtained a higher F1-score (57.86% vs 51.79%), showing a better balance between precision and recall. Although Logistic Regression had a slightly higher AUC (0.7402 vs 0.7327), the difference was minimal. Overall, Logistic Regression is more conservative and better at avoiding false positives, while Random Forest provides a more balanced performance and is more effective at detecting genuine high performers.

4.4 Linear Regression Model

Question: Can employee average training score be predicted using demographic and workplace-related variables?

============================================================

Linear Regression — Employee Training Score Prediction

============================================================

── 0. Install & load packages ─────────────────────────────

required_packages <- c( “caret”, “ggplot2”, “dplyr”, “Metrics”, “gridExtra” )

for (pkg in required_packages) { if (!requireNamespace(pkg, quietly = TRUE)) { install.packages(pkg) } }

library(caret) library(ggplot2) library(dplyr) library(Metrics) library(gridExtra)

============================================================

STEP 1 — Load data

============================================================

df_clean <- read.csv( “clean_employee_performance.csv”, stringsAsFactors = FALSE )

cat(sprintf( “Loaded: %d rows x %d columns”, nrow(df_clean), ncol(df_clean) ))

============================================================

STEP 2 — Remove redundant columns

============================================================

df_clean$avg_training_score_scaled <- NULL df_clean$age_group <- NULL df_clean$KPIs_met_more_than_80 <- NULL

cat(“Dropped redundant columns”)

============================================================

STEP 3 — Convert character columns to factors

============================================================

char_cols <- names(df_clean)[sapply(df_clean, is.character)]

df_clean[char_cols] <- lapply( df_clean[char_cols], as.factor )

cat(“Converted character columns to factors”)

============================================================

STEP 4 — Handle missing values

============================================================

na_total <- sum(is.na(df_clean))

cat(sprintf(“Total NAs found: %d”, na_total))

if (na_total > 0) {

for (col in names(df_clean)) {

if (is.numeric(df_clean[[col]])) {
  
  df_clean[[col]][is.na(df_clean[[col]])] <-
    median(df_clean[[col]], na.rm = TRUE)
  
}

}

cat(“Median imputation applied”)

} else {

cat(“No missing values found”) }

4.4.1 Train-test split

============================================================

STEP 5 — Train/Test Split

============================================================

set.seed(42)

train_idx <- createDataPartition( df_clean$avg_training_score, p = 0.80, list = FALSE )

train_data <- df_clean[train_idx, ] test_data <- df_clean[-train_idx, ]

cat(sprintf( “Train rows: %d | Test rows: %d”, nrow(train_data), nrow(test_data) ))

============================================================

STEP 6 — Build Linear Regression Model

============================================================

lm_model <- lm( avg_training_score ~ age + previous_year_rating + length_of_service + no_of_trainings + department + education + gender + recruitment_channel + awards_won,

data = train_data )

summary(lm_model)

4.4.2 Model Evaluation: RMSE, MAE, RSquared

============================================================

STEP 7 — Predictions

============================================================

predictions <- predict( lm_model, newdata = test_data )

actual_values <- test_data$avg_training_score

============================================================

STEP 8 — Regression Evaluation Metrics

============================================================

rmse_val <- round( rmse(actual_values, predictions), 3 )

mae_val <- round( mae(actual_values, predictions), 3 )

r2_val <- round( cor(actual_values, predictions)^2, 3 )

cat(“Model Performance”) cat(“RMSE :”, rmse_val, “”) cat(“MAE :”, mae_val, “”) cat(“R² :”, r2_val, “”)

4.4.3 Result Visualization: Actual vs. Predicted

============================================================

STEP 9 — Actual vs Predicted Plot

============================================================

results_df <- data.frame( Actual = actual_values, Predicted = predictions )

p1 <- ggplot( results_df, aes(x = Actual, y = Predicted) ) + geom_point( alpha = 0.5, colour = “#1A5FA8” ) + geom_abline( slope = 1, intercept = 0, colour = “red”, linetype = “dashed” ) + labs( title = “Actual vs Predicted Values”, subtitle = “Linear Regression”, x = “Actual Training Score”, y = “Predicted Training Score” ) + theme_minimal(base_size = 14)

4.4.4 Residual Plot and Distribution

============================================================

STEP 10 — Residual Plot

============================================================

results_df$Residuals <- actual_values - predictions

p2 <- ggplot( results_df, aes(x = Predicted, y = Residuals) ) + geom_point( alpha = 0.5, colour = “#2E7D32” ) + geom_hline( yintercept = 0, colour = “red”, linetype = “dashed” ) + labs( title = “Residual Plot”, x = “Predicted Values”, y = “Residuals” ) + theme_minimal(base_size = 14)

============================================================

STEP 11 — Residual Distribution

============================================================

p3 <- ggplot( results_df, aes(x = Residuals) ) + geom_histogram( bins = 30, fill = “#4C9BE8”, colour = “white”, alpha = 0.8 ) + labs( title = “Residual Distribution”, x = “Residual”, y = “Count” ) + theme_minimal(base_size = 14)

4.4.5 Feature Importance

============================================================

STEP 12 — Feature Importance

============================================================

coef_df <- data.frame( Feature = names(coef(lm_model)), Coefficient = coef(lm_model) )

coef_df <- coef_df %>% filter(Feature != “(Intercept)”) %>% mutate( Abs_Coefficient = abs(Coefficient), Direction = ifelse( Coefficient > 0, “Positive”, “Negative” ) ) %>% arrange(desc(Abs_Coefficient))

top_coef_df <- coef_df %>% slice_max( order_by = Abs_Coefficient, n = 15 )

p4 <- ggplot( top_coef_df, aes( x = reorder( Feature, Abs_Coefficient ), y = Abs_Coefficient, fill = Direction ) ) + geom_bar( stat = “identity” ) + coord_flip() + labs( title = “Feature Importance”, subtitle = “Linear Regression Coefficients”, x = “Feature”, y = “Absolute Coefficient” ) + theme_minimal(base_size = 14)

Insights

The linear regression coefficient plot shows how different employee characteristics influence the average training score. Features with positive coefficients increase the predicted training score, while negative coefficients reduce it. Among the positive predictors, awards_won has one of the strongest positive effects, suggesting that employees who received awards tend to achieve higher training scores. Variables such as previous_year_rating, education level, and referral based recruitment also contribute positively, although their effects are smaller.

On the other hand, departments such as sales & marketing, HR, legal, operations, and finance have large negative coefficients, indicating that employees in these departments tend to have lower average training scores compared to the reference department. Variables such as no_of_trainings and length_of_service also show slight negative relationships, suggesting that attending more trainings or having longer service does not necessarily correspond to higher average training performance.

Overall, the model suggests that employee recognition and past performance are associated with stronger training outcomes, while departmental differences appear to play a significant role in influencing average training scores. The coefficient directions also help explain which factors are linked to higher or lower training performance within the organization.

4.4.6 Regression Model Performance

============================================================

STEP 13 — Regression Metrics Bar Chart

============================================================

metrics_df <- data.frame( Metric = c( “RMSE”, “MAE”, “R²” ),

Value = c( rmse_val, mae_val, r2_val ) )

p5 <- ggplot( metrics_df, aes( x = Metric, y = Value, fill = Metric ) ) + geom_bar( stat = “identity”, show.legend = FALSE ) + geom_text( aes(label = Value), vjust = -0.4, fontface = “bold” ) + labs( title = “Regression Performance Metrics”, y = “Value” ) + theme_minimal(base_size = 8)

Insights

The regression performance metrics indicate that the linear regression model performs reasonably well in predicting employees’ average training scores. The R² value of 0.888 means that approximately 88.8% of the variation in training scores can be explained by the predictor variables included in the model. This suggests a very strong fit, indicating that the selected features are highly effective in explaining employee training performance.

The Mean Absolute Error (MAE) of 2.74 shows that, on average, the model’s predictions differ from the actual training scores by about 2.7 points. Since MAE measures the average magnitude of prediction errors without considering direction, this relatively small value suggests that the model predictions are generally close to the true scores.

Similarly, the Root Mean Squared Error (RMSE) of 4.51 indicates that the model’s prediction errors are relatively low overall, although RMSE is slightly higher because it penalizes larger errors more heavily. The difference between RMSE and MAE suggests that while most predictions are accurate, there may still be a few larger prediction errors present in the dataset.

Overall, these metrics indicate that the linear regression model has strong predictive performance and is effective for estimating employee average training scores. The high R² combined with relatively low MAE and RMSE values suggests that the model captures the underlying relationships in the data well and provides reliable predictions.

Employee Performance Analysis Using R: Determining the Factors Influencing KPI Achievements

MELISSA A/P XAVIER, DHEWI TRIESULEHA BINTI SAFEI, PAN HUI XIN, SHEIKH EMRAN SHIRAGE, NUR ALLYSHA FRANKFORT BINTI IZWAN LEWIS

2026-05-27

1.0 Introduction

1.1 Objective of the Project

1.2 Dataset Description

2.0 Data Cleaning & Preparation

2.1 Packages Used

2.2 Data Importation

2.3 Customer Parsing & Batch Processing

2.4 Data Transformation

2.5 Feature Engineering

2.6 Data Exportation

3.0 Exploratory Data Analysis (EDA)

3.1 Data Inspection

Summary for 3.1:

3.2 Data Quality Assessment

Summary for 3.2:

3.3 Outlier Detection

3.3.1 Box Plot

Summary for 3.3:

3.4 Univariate Analysis

3.4.1 Target Variable Distribution

Insights:

3.4.2 Continuous Variables Distribution

Insights:

3.4.3 Discrete Variables Distribution

Insights:

3.4.4 Categorical Variables Distribution

Insights:

3.5 Bivariate Analysis

3.5.1 Training Score vs KPI Achievement

Insights:

3.5.2 Previous Year Rating Analysis vs KPI Achievement

Insights:

3.5.3 Average Training Score by Previous Year Rating

Insights:

3.5.4 Categotical Variables vs Target

Insights:

3.6 Multivariate Analysis

3.6.1 Performance by gender and department

Insights:

3.6.2 Correlation Matrix

Insights:

3.6.3 Feature Correlation with KPI Achievement

Insights:

3.7 Insights Before Modeling

4.0 Data Analysis & Modelling

4.1 Logistic Regression (Classification)

============================================================

Logistic Regression — Employee Performance

============================================================

STEP 1 — Load data

STEP 2 — Remove redundant columns

STEP 3 — Convert character columns to factors

STEP 4 — Convert target to factor

4.1.1 Train-test split

STEP 5 — Train/Test split

STEP 6 — Build Logistic Regression Model

STEP 7 — Predictions

4.1.2 Confusion Matrix

STEP 8 — Evaluation

4.1.3 ROC, AUC

STEP 9 — ROC / AUC

4.1.4 Model Evaluation

STEP 10 — Metrics table

STEP 11 — Feature Importance

STEP 12 — Predicted Probability

============================================================

STEP 13 — Dashboard Plots

============================================================

p1: Class Distribution

p2: Confusion Matrix

p3: ROC Curve

p4: Performance Metrics

p5: Feature Importance

p6: Predicted Probability

Combine dashboard

Save dashboard

Save Logistic Regression metrics