1.0 Introduction

Amid the era of globalization and relentless competitive pressures, sustaining strong employee performance has become a central priority and one of the biggest challenges for Human Resource (HR) departments. Organizations can no longer rely solely on intuition; instead, they are increasingly turning to data‑driven analytics to evaluate workforce outcomes, optimize talent management, and reduce turnover. Understanding the specific, underlying factors that influence productivity is essential for building sustainable, long‑term employee success.

This project uses R programming to explore a comprehensive dataset and uncover the key drivers of high performance. A thorough understanding of influencing factors is essential for developing effective approaches to maintaining and improving employee performance over the long term. By leveraging statistical methods and machine learning techniques, the goal is to uncover the key drivers behind high performance and Key Performance Indicator (KPI) achievement, translating raw HR data into actionable organizational insights.

1.1 Objective of the Project

The primary objective of this project is to conduct a comprehensive data analysis using R to identify key factors such as employee demographics, training effectiveness, length of service, and prior performance ratings that significantly influence the achievement of Key Performance Indicators (KPIs) exceeding 80%. By applying statistical techniques, data visualization, and predictive modeling, this study aims to generate actionable insights that can guide HR professionals in enhancing employee performance strategies, supporting talent development, and strengthening organizational decision‑making.

Specifically, the project seeks to:
- Identify key factors affecting performance through statistical analysis and machine learning, focusing on variables such as training, work experience, education level, and departmental affiliation.
- Compare predictive models to determine the most effective approach for forecasting KPI achievement above 80%, using evaluation metrics such as confusion matrices, accuracy, sensitivity, and specificity.
- Provide recommendations and actionable insights to HR departments and stakeholders, supporting evidence‑based decisions in talent management, training programs, and employee engagement initiatives.

1.2 Dataset Description

The dataset titled Employees Performance for HR Analytics was uploaded to Kaggle by Sanjana Chaudhari in 2023 and serves as the foundation for this analysis. It contains 17,417 employee records across 13 variables, stored in CSV format. The dataset captures a balanced mix of categorical and numerical variables, making it suitable for exploratory data analysis (EDA), correlation studies, and predictive modeling in HR analytics.

The variables included are as follows:
- employee_id: Unique identifier for each employee; serves as the primary key for tracking records without revealing personal information.
- department: Employee’s department (e.g., Sales & Marketing, Technology); useful for performance segmentation and departmental comparisons.
- region: Geographic region of employment.
- education: Highest education level attained (e.g., Bachelor’s, Master’s and above).
- gender: Employee gender (m = male, f = female).
- recruitment_channel: Hiring source (e.g., Referred, Sourcing).
- no_of_trainings: Number of trainings attended.
- age: Employee age.
- previous_year_rating: Performance rating from the prior year (1–5 scale).
- length_of_service: Number of years served in the organization.
- kpis_met_more_than_80: Binary indicator of whether >80% KPIs were achieved (0 = No, 1 = Yes); this serves as the target variable.
- awards_won: Indicator of whether the employee won awards (0 = No, 1 = Yes).
- avg_training_score: Average score from trainings, reflecting training quality.

By analyzing these variables, the study aims to uncover meaningful patterns that can guide HR strategies, improve productivity, and strengthen workforce management.

2.0 Data Cleaning & Preparation

Data cleaning is a critical step in preparing the dataset for analysis. It involves handling missing values, correcting inconsistencies, removing duplicates, and ensuring that variables are properly formatted for statistical modeling. Clean data provides a reliable foundation for exploratory analysis and predictive modeling, reducing bias and improving the accuracy of insights.

2.1 Packages Used

The following packages were used in the data cleaning process:

  • dplyr
    Functions: filter, mutate, select, distinct, summarise, case_when
    Purpose: Data manipulation and transformation.

  • tidyr
    Functions: replace_na, across
    Purpose: Handling missing values and tidying data.

  • stringr
    Functions: str_trim, str_to_lower
    Purpose: Text cleaning and string processing.

  • writexl
    Functions: write_xlsx
    Purpose: Exporting cleaned dataset to Excel format.

2.2 Data Importation

This step involves loading the raw dataset into R for inspection. The structure and summary of the data are examined to understand variable types.

employee_performance <- read.csv("Uncleaned_employees_final_dataset.csv")
str(employee_performance)
## 'data.frame':    17417 obs. of  13 variables:
##  $ employee_id          : int  8724 74430 72255 38562 64486 46232 54542 67269 66174 76303 ...
##  $ department           : chr  "Technology" "HR" "Sales & Marketing" "Procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : int  NA 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...
summary(employee_performance)
##   employee_id        department          region          education    
##  Min.   :    3   Length   :17417   Length   :17417   Length   :17417  
##  1st Qu.:19281   N.unique :    9   N.unique :   34   N.unique :    4  
##  Median :39122   N.blank  :    0   N.blank  :    0   N.blank  :  771  
##  Mean   :39083   Min.nchar:    2   Min.nchar:    8   Min.nchar:    0  
##  3rd Qu.:58838   Max.nchar:   17   Max.nchar:    9   Max.nchar:   15  
##  Max.   :78295                                                        
##                                                                       
##        gender      recruitment_channel no_of_trainings      age       
##  Length   :17417   Length   :17417     Min.   :1.000   Min.   :20.00  
##  N.unique :    2   N.unique :    3     1st Qu.:1.000   1st Qu.:29.00  
##  N.blank  :    0   N.blank  :    0     Median :1.000   Median :33.00  
##  Min.nchar:    1   Min.nchar:    5     Mean   :1.251   Mean   :34.81  
##  Max.nchar:    1   Max.nchar:    8     3rd Qu.:1.000   3rd Qu.:39.00  
##                                        Max.   :9.000   Max.   :60.00  
##                                                                       
##  previous_year_rating length_of_service KPIs_met_more_than_80   awards_won     
##  Min.   :1.000        Min.   : 1.000    Min.   :0.0000        Min.   :0.00000  
##  1st Qu.:3.000        1st Qu.: 3.000    1st Qu.:0.0000        1st Qu.:0.00000  
##  Median :3.000        Median : 5.000    Median :0.0000        Median :0.00000  
##  Mean   :3.345        Mean   : 5.802    Mean   :0.3588        Mean   :0.02337  
##  3rd Qu.:4.000        3rd Qu.: 7.000    3rd Qu.:1.0000        3rd Qu.:0.00000  
##  Max.   :5.000        Max.   :34.000    Max.   :1.0000        Max.   :1.00000  
##  NAs    :1363                                                                  
##  avg_training_score
##  Min.   :39.00     
##  1st Qu.:51.00     
##  Median :60.00     
##  Mean   :63.18     
##  3rd Qu.:75.00     
##  Max.   :99.00     
## 

2.3 Customer Parsing & Batch Processing

Duplicate records and unnecessary columns are removed to ensure data integrity. Unique values are checked to identify inconsistencies in categorical variables.

unique(employee_performance$department)
## [1] "Technology"        "HR"                "Sales & Marketing"
## [4] "Procurement"       "Finance"           "Analytics"        
## [7] "Operations"        "Legal"             "R&D"
unique(employee_performance$education)
## [1] "Bachelors"       "Masters & above" ""                "Below Secondary"
unique(employee_performance$gender)
## [1] "m" "f"
unique(employee_performance$recruitment_channel)
## [1] "sourcing" "other"    "referred"
unique(employee_performance$region)
##  [1] "region_26" "region_4"  "region_13" "region_2"  "region_29" "region_7" 
##  [7] "region_22" "region_16" "region_17" "region_24" "region_11" "region_27"
## [13] "region_9"  "region_20" "region_34" "region_23" "region_8"  "region_14"
## [19] "region_31" "region_19" "region_5"  "region_28" "region_15" "region_3" 
## [25] "region_25" "region_12" "region_21" "region_30" "region_10" "region_33"
## [31] "region_32" "region_6"  "region_1"  "region_18"
str(employee_performance)
## 'data.frame':    17417 obs. of  13 variables:
##  $ employee_id          : int  8724 74430 72255 38562 64486 46232 54542 67269 66174 76303 ...
##  $ department           : chr  "Technology" "HR" "Sales & Marketing" "Procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : int  NA 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...
# Show duplicated employee_id
employee_performance %>%
  filter(duplicated(employee_id) | duplicated(employee_id, fromLast = TRUE)) %>%
  arrange(employee_id)
# Remove exact duplicate rows only
employee_performance <- employee_performance %>%
  distinct()

# Check duplicated employee_id again
employee_performance %>%
  filter(duplicated(employee_id) | duplicated(employee_id, fromLast = TRUE)) %>%
  arrange(employee_id)
#remove unnecessary column
employee_performance <- employee_performance %>%
  select(-employee_id)

2.4 Data Transformation

Text fields are standardized by trimming spaces and converting to lowercase. Missing values are handled using median imputation and categorical replacement to maintain data completeness.

#clean text column
employee_performance <- employee_performance %>%
  mutate(
    gender = str_to_lower(str_trim(gender)),
    department = str_trim(department),
    education = str_trim(education),
    recruitment_channel = str_trim(recruitment_channel)
  )


str(employee_performance)
## 'data.frame':    17415 obs. of  12 variables:
##  $ department           : chr  "Technology" "HR" "Sales & Marketing" "Procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "Bachelors" "Bachelors" "Bachelors" "Bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : int  NA 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...
colSums(is.na(employee_performance))
##            department                region             education 
##                     0                     0                     0 
##                gender   recruitment_channel       no_of_trainings 
##                     0                     0                     0 
##                   age  previous_year_rating     length_of_service 
##                     0                  1363                     0 
## KPIs_met_more_than_80            awards_won    avg_training_score 
##                     0                     0                     0
employee_performance %>%
  summarise(across(everything(), ~ sum(is.na(.) | trimws(as.character(.)) == "")))
#handling missing values
employee_performance <- employee_performance %>%
  mutate(
    previous_year_rating = ifelse(
      is.na(previous_year_rating),
      median(previous_year_rating, na.rm = TRUE),
      previous_year_rating
    ),
    
    education = ifelse(
      is.na(education) | str_trim(education) == "",
      "Unknown",
      education
    )
  )

#if missing exists
employee_performance <- employee_performance %>%
  mutate(education = replace_na(education, "Unknown"))


#clean any text column
clean_text <- function(x) {
  x %>%
    str_trim() %>%
    str_to_lower()
}

employee_performance$department <- clean_text(employee_performance$department)


employee_performance <- employee_performance %>%
  mutate(across(
    c(department, education, recruitment_channel, region),
    clean_text
  ))

2.5 Feature Engineering

New variables are created to enhance analytical insights. Age groups are categorized, and categorical variables are converted to factors for modeling compatibility.

#create age group 
employee_performance <- employee_performance %>%
  mutate(age_group = case_when(
    age < 30 ~ "Young",
    age >= 30 & age < 40 ~ "Mid",
    TRUE ~ "Senior"
  ))

str(employee_performance)
## 'data.frame':    17415 obs. of  13 variables:
##  $ department           : chr  "technology" "hr" "sales & marketing" "procurement" ...
##  $ region               : chr  "region_26" "region_4" "region_13" "region_2" ...
##  $ education            : chr  "bachelors" "bachelors" "bachelors" "bachelors" ...
##  $ gender               : chr  "m" "f" "m" "f" ...
##  $ recruitment_channel  : chr  "sourcing" "other" "other" "other" ...
##  $ no_of_trainings      : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                  : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating : num  3 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service    : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80: int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_training_score   : int  77 51 47 65 61 68 57 85 75 76 ...
##  $ age_group            : chr  "Young" "Mid" "Mid" "Mid" ...
#convert to factors
employee_performance <- employee_performance %>%
  mutate(
    department = as.factor(department),
    gender = as.factor(gender),
    education = as.factor(education),
    recruitment_channel = as.factor(recruitment_channel),
    region = as.factor(region),
    age_group = as.factor(age_group)
  )


#normalize score 
employee_performance <- employee_performance %>%
  mutate(avg_training_score_scaled = scale(avg_training_score))

2.6 Data Exportation

After cleaning and exploring the dataset, the final step is to export the processed data for future analysis and reporting. csv formats are used for next step EDA.

write.csv(employee_performance, "clean_employee_performance.csv", row.names = FALSE)

3.0 Exploratory Data Analysis (EDA)

Before proceeding into the modelling part, the Exploratory Data Analysis (EDA) was conducted to examine the employee performance.
The steps performed in EDA:

3.1 Data Inspection

The required libraries and cleaned dataset df_clean was loaded and inspected to understand its structure before moving forward to exploratory data analysis.

# Install packages (run only once if needed):
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("tidyverse")
# install.packages("knitr")
# install.packages("corrplot")
# install.packages("plotly")
# install.packages("reshape2")
# install.packages("kableExtra")

# Load required libraries:
library(dplyr)        # Data manipulation and group_by() function
library(ggplot2)      # Data visualization
library(tidyverse)    # Collection of data science packages
library(knitr)        # R Markdown table formatting
library(corrplot)     # Correlation matrix visualization
library(plotly)       # Interactive plots
library(reshape2)     # Data reshaping
library(kableExtra)   # Enhanced table styling

# Load cleaned dataset
df_clean<- read.csv("clean_employee_performance.csv")

#convert to factors
df_clean <- df_clean %>%
  mutate(
    department = as.factor(department),
    gender = as.factor(gender),
    education = as.factor(education),
    recruitment_channel = as.factor(recruitment_channel),
    region = as.factor(region),
    age_group = as.factor(age_group),
    awards_won = as.factor(awards_won) # added conversion to factor for better analysis
  )

The dataset structure confirms the variables are correctly formatted with appropriate data types.

  • Both str() and glimpse() provide different views of a dataset.
  • For str() provide a detailed description of the structure, whereas glimpse() gives a brief overview of all variables and observations.
  • str() shows the dataset structure as a mix of numeric and categorical variables. glimpse() and dim() show that the dataset contains 17,417 observations and 14 variables.
  • Both str() and glimpse() confirmed sufficiently large datasets for reliable exploratory and statistical analysis.
# Data structure overview inspection
head(df_clean)
str(df_clean)
## 'data.frame':    17415 obs. of  14 variables:
##  $ department               : Factor w/ 9 levels "analytics","finance",..: 9 3 8 6 2 6 2 1 9 9 ...
##  $ region                   : Factor w/ 34 levels "region_1","region_10",..: 19 29 5 12 22 32 12 15 32 15 ...
##  $ education                : Factor w/ 4 levels "bachelors","below secondary",..: 1 1 1 1 1 1 1 1 3 1 ...
##  $ gender                   : Factor w/ 2 levels "f","m": 2 1 2 1 2 2 2 2 2 2 ...
##  $ recruitment_channel      : Factor w/ 3 levels "other","referred",..: 3 1 1 1 3 3 1 3 1 3 ...
##  $ no_of_trainings          : int  1 1 1 3 1 1 1 2 1 1 ...
##  $ age                      : int  24 31 31 31 30 36 33 36 51 29 ...
##  $ previous_year_rating     : int  3 3 1 2 4 3 5 3 4 5 ...
##  $ length_of_service        : int  1 5 4 9 7 2 3 3 11 2 ...
##  $ KPIs_met_more_than_80    : int  1 0 0 0 0 0 1 0 0 1 ...
##  $ awards_won               : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ avg_training_score       : int  77 51 47 65 61 68 57 85 75 76 ...
##  $ age_group                : Factor w/ 3 levels "Mid","Senior",..: 3 1 1 1 1 1 1 1 2 3 ...
##  $ avg_training_score_scaled: num  1.03 -0.908 -1.206 0.136 -0.162 ...
glimpse(df_clean)
## Rows: 17,415
## Columns: 14
## $ department                <fct> technology, hr, sales & marketing, procureme…
## $ region                    <fct> region_26, region_4, region_13, region_2, re…
## $ education                 <fct> bachelors, bachelors, bachelors, bachelors, …
## $ gender                    <fct> m, f, m, f, m, m, m, m, m, m, m, m, f, m, m,…
## $ recruitment_channel       <fct> sourcing, other, other, other, sourcing, sou…
## $ no_of_trainings           <int> 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ age                       <int> 24, 31, 31, 31, 30, 36, 33, 36, 51, 29, 40, …
## $ previous_year_rating      <int> 3, 3, 1, 2, 4, 3, 5, 3, 4, 5, 5, 3, 3, 3, 5,…
## $ length_of_service         <int> 1, 5, 4, 9, 7, 2, 3, 3, 11, 2, 12, 10, 4, 10…
## $ KPIs_met_more_than_80     <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,…
## $ awards_won                <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ avg_training_score        <int> 77, 51, 47, 65, 61, 68, 57, 85, 75, 76, 50, …
## $ age_group                 <fct> Young, Mid, Mid, Mid, Mid, Mid, Mid, Mid, Se…
## $ avg_training_score_scaled <dbl> 1.03010551, -0.90754471, -1.20564475, 0.1358…
# Dataset dimensions
dim(df_clean)
## [1] 17415    14

The summary provides an overview of the central tendency and distribution of each variable.

  • The majority of employees, who range in age from 20 to 60 and a mean around 35 years have completed at least 1 training session.
  • The dataset shows moderate employee performance with an average previous year rating of 3.3 and KPI achievement rate of 0.36.
  • The majority of employees have been employed for 3 to 7 years and the average KPI achievement rate is relatively low as 0.36 that indicating less than half of the employees generally reach KPI goals.
  • A wide range of employees’ skill levels shows a significant gap on training scores ranging from 39 to 99. Thus, a data normalization was conducted for avg_training_scoreand created a new column for avg_training_score_scaled which eases future analysis.
# Summary statistics
df_clean %>%
  select(age, previous_year_rating, KPIs_met_more_than_80,
         length_of_service, no_of_trainings, avg_training_score, 
         avg_training_score_scaled) %>%
  summary()
##       age        previous_year_rating KPIs_met_more_than_80 length_of_service
##  Min.   :20.00   Min.   :1.000        Min.   :0.0000        Min.   : 1.000   
##  1st Qu.:29.00   1st Qu.:3.000        1st Qu.:0.0000        1st Qu.: 3.000   
##  Median :33.00   Median :3.000        Median :0.0000        Median : 5.000   
##  Mean   :34.81   Mean   :3.319        Mean   :0.3589        Mean   : 5.801   
##  3rd Qu.:39.00   3rd Qu.:4.000        3rd Qu.:1.0000        3rd Qu.: 7.000   
##  Max.   :60.00   Max.   :5.000        Max.   :1.0000        Max.   :34.000   
##  no_of_trainings avg_training_score avg_training_score_scaled
##  Min.   :1.000   Min.   :39.00      Min.   :-1.8018          
##  1st Qu.:1.000   1st Qu.:51.00      1st Qu.:-0.9075          
##  Median :1.000   Median :60.00      Median :-0.2368          
##  Mean   :1.251   Mean   :63.18      Mean   : 0.0000          
##  3rd Qu.:1.000   3rd Qu.:75.00      3rd Qu.: 0.8811          
##  Max.   :9.000   Max.   :99.00      Max.   : 2.6697

Summary for 3.1:

  • This step ensures that all variables are correctly formatted after preprocessing and no inconsistencies remain.
  • Overall, the dataset is ready for analysis.

3.2 Data Quality Assessment

A final quality check was conducted to check if any missing values or duplicated values remain.

# Check missing values
colSums(is.na(df_clean))
##                department                    region                 education 
##                         0                         0                         0 
##                    gender       recruitment_channel           no_of_trainings 
##                         0                         0                         0 
##                       age      previous_year_rating         length_of_service 
##                         0                         0                         0 
##     KPIs_met_more_than_80                awards_won        avg_training_score 
##                         0                         0                         0 
##                 age_group avg_training_score_scaled 
##                         0                         0
# Check missing or empty values
df_clean %>%
  summarise(across(everything(), ~ sum(is.na(.) | . == "")))
# Check any duplicates
sum(duplicated(df_clean))
## [1] 16
janitor::get_dupes(df_clean)
## No variable names specified - using all columns.
# Check whether missing values in the education field have been replaced with “Unknown”
unique(df_clean$education)
## [1] bachelors       masters & above unknown         below secondary
## Levels: bachelors below secondary masters & above unknown

Summary for 3.2:

  • This step ensured that all variables were properly cleaned.
  • There are no exact duplicate rows remaining in the dataset after removing duplicate records based on employee IDs. The remaining duplicate records represent valid multiple observations rather than data duplicates.
  • All variables have been standardized and are able to perform reliable statistical analysis as the cleaned dataset contains no missing value or duplicated records after prepossessing.

3.3 Outlier Detection

3.3.1 Box Plot

  • An outlier detection applied to focus on continuous variables such as age, length_of_service, and avg_training_score because these variables have meaningful numerical ranges, and extreme values may reveal unusual or hidden characteristics of employees.
# Select variables suitable for outlier detection
outlier_vars <- df_clean %>%
  select(age, length_of_service, avg_training_score)

# Convert selected variables into long format
outlier_data <- df_clean %>%
  select(age, length_of_service, avg_training_score) %>%
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "Value")

# Boxplot visualization
ggplot(outlier_data,
       aes(x = Variable,
           y = Value,
           fill = Variable)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Outlier Detection for Continuous Variables",
       x = "Variables",
       y = "Values") +
  theme(legend.position = "none")

Summary for 3.3:

  • This step ensures that the continuous variables are prevented by bias which are not affected by any hidden significant outlier before conducting further deep analysis.
  • age and length_of_service contain several outliers beyond the upper fence of the boxplot, which require further investigation to determine whether they represent valid extreme values or anomalies in the dataset.
  • The overall data quality is satisfactory and makes the dataset suitable for further exploratory data analysis and modelling.

3.4 Univariate Analysis

3.4.1 Target Variable Distribution

# Check class imbalance
target_dist <- df_clean %>%
  count(KPIs_met_more_than_80) %>%
  mutate(
    percentage = round(n / sum(n) * 100, 1),
    KPI_status = ifelse(KPIs_met_more_than_80 == 1,
                        "Met KPI >80%",
                        "Met KPI ≤80%")
  )

# KPI distribution plot
ggplot(target_dist,
       aes(x = KPI_status,
           y = n,
           fill = KPI_status)) +

  geom_bar(stat = "identity",
           width = 0.6,
           alpha = 0.9) +

  # Percentage + count labels
  geom_text(aes(label = paste0(percentage,
                               "%\n(n = ",
                               scales::comma(n), ")")),
            vjust = -0.35,
            size = 4.3,
            fontface = "bold") +

  scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +

  expand_limits(y = max(target_dist$n) * 1.15) +

  labs(title = "Distribution of KPI Achievement",
       subtitle = "Class balance analysis of KPI performance",
       x = NULL,
       y = "Number of Employees") +

  theme_minimal(base_size = 13) +

  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold",
                              hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    axis.text.x = element_text(face = "bold")
  )

# Print imbalance ratio
imbalance_ratio <- max(target_dist$percentage) /
                   min(target_dist$percentage)

cat("Imbalance ratio (majority/minority):",
    round(imbalance_ratio, 2), "\n")
## Imbalance ratio (majority/minority): 1.79

Insights:

  • The analysis reveals that about 64.1% of employees failed to achieve the KPI benchmark of 80%, while only 35.9% of employees met the KPI target.
  • The imbalance ratio of 1.79 indicates that the target variable KPI Achievement is relatively balanced.
  • More than half of the employees did not achieve the KPI target, and it indicates organizational and operational factors affecting employee performance and suggests the need for further investigation into departmental performance, employee support system and others.
  • Hence, the class distribution is still acceptable for exploratory analysis and modelling without class imbalance concerns.

3.4.2 Continuous Variables Distribution

# Select Variables
continuous_long <- df_clean %>%
  select(age,
         length_of_service,
         avg_training_score) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "value")

# Plots
ggplot() +

  # age
  geom_histogram(
    data = filter(continuous_long, variable == "age"),
    aes(x = value,
        y = after_stat(density)),
    binwidth = 2,
    fill = "#3498DB",
    color = "white",
    alpha = 0.7
  ) +

  geom_density(
    data = filter(continuous_long, variable == "age"),
    aes(x = value),
    color = "#E74C3C",
    linewidth = 1.1,
    adjust = 1.2
  ) +

  # length_of_service
  geom_histogram(
    data = filter(continuous_long,
                  variable == "length_of_service"),
    aes(x = value,
        y = after_stat(density)),
    binwidth = 1,
    fill = "#2ECC71",
    color = "white",
    alpha = 0.7
  ) +

  geom_density(
    data = filter(continuous_long,
                  variable == "length_of_service"),
    aes(x = value),
    color = "#C0392B",
    linewidth = 1.1,
    adjust = 1.5
  ) +

  # avg_training_score
  geom_histogram(
    data = filter(continuous_long,
                  variable == "avg_training_score"),
    aes(x = value,
        y = after_stat(density)),
    binwidth = 5,
    fill = "#9B59B6",
    color = "white",
    alpha = 0.7
  ) +

  geom_density(
    data = filter(continuous_long,
                  variable == "avg_training_score"),
    aes(x = value),
    color = "#E74C3C",
    linewidth = 1.1,
    adjust = 1.2
  ) +

  facet_wrap(~ variable,
             scales = "free",
             ncol = 2) +

  labs(
    title = "Distribution of Continuous Variables",
    subtitle = "Histogram with Density Overlay",
    x = "Value",
    y = "Density"
  ) +

  theme_minimal(base_size = 14) +

  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12),
    strip.text = element_text(face = "bold", size = 12),
    axis.title = element_text(face = "bold")
  )

Insights:

  • age distribution is slightly right skewed. This indicates a higher concentration of younger employees with fewer older employees. The age density curve shows a peak around the mid 30 years old which suggests that the most employees fall within the early to mid-career stage.
  • avg_training_score distribution appears as a multimodal pattern as it consists of multiple peaks and major clusters visible around 50, 60, and 80 to 85. This implies that there is possible segmentation in employee performance or departmental training outcomes.
  • length_of_service distribution shows heavily right-skewed. Most employees have 1 to 7 years of service whereas very few employees exceed 15 years. The company is considered to have short tenure or most of them are new employees.

3.4.3 Discrete Variables Distribution

# Select Variables
discrete_vars <- df_clean %>%
  select(previous_year_rating,
         no_of_trainings)

discrete_long <- discrete_vars %>%
  pivot_longer(everything(),
               names_to = "variable",
               values_to = "value")
#Plot
ggplot(discrete_long,
       aes(x = factor(value))) +

  geom_bar(fill = "#2ECC71",
           alpha = 0.8) +

  facet_wrap(~ variable,
             scales = "free",
             ncol = 2) +

  labs(title = "Distribution of Discrete Variables",
       subtitle = "Frequency distribution by category",
       x = "Category",
       y = "Count") +

  theme_minimal() +

  theme(strip.text = element_text(face = "bold"))

Insights:

  • no_of_trainings displays a strongly right-skewed distribution pattern as most employees have training sessions at once. It has a sharp decline after 2 training sessions and very few employees received more than 4 training sessions which cause long tail.
  • previous_year_rating shows a slightly left-skewed distribution pattern with a mode at rating 3. Most employees received ratings of 3, 4, or 5 which are above average to excellent. This suggests that past performance is generally positive across the workforce.

3.4.4 Categorical Variables Distribution

# Select Variables
categorical_vars <- df_clean %>%
  select(department,
         education,
         recruitment_channel,
         awards_won)

categorical_long <- categorical_vars %>%
  pivot_longer(everything(),
               names_to = "variable",
               values_to = "category")

# Plot
ggplot(categorical_long,
       aes(x = category,
           fill = variable)) +

  geom_bar(alpha = 0.85,
           show.legend = FALSE) +

  facet_wrap(~ variable,
             scales = "free",
             ncol = 2) +

  labs(title = "Distribution of Categorical Variables",
       subtitle = "Frequency distribution across employee categories",
       x = "",
       y = "Number of Employees") +

  scale_fill_brewer(palette = "Set2") +

  theme_minimal(base_size = 12) +

  theme(
    axis.text.x = element_text(angle = 45,
                               hjust = 1),
    strip.text = element_text(face = "bold"),
    plot.title = element_text(face = "bold",
                              hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

# Plot for region (top 10)
region_analysis <- df_clean %>%
  count(region) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  top_n(10, n)

ggplot(region_analysis,
       aes(x = reorder(region, n), y = percentage, fill = percentage)) +
  geom_bar(stat = "identity", width = 0.8) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")),
            vjust = -0.3, size = 3) +
  scale_fill_gradient(low = "#A3E4D7", high = "#1ABC9C") +
  labs(title = "Employee Distribution by Region (Top 10)",
       x = "Region", y = "Percentage (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 7))

Insights:

  • The categorical variable analysis only provides an overview ondepartment, education, recruitment_channel, revealing that the distribution of employees across departments, educational backgrounds, and recruitment channels is uneven.
  • award_won distribution shows an extremely imbalanced distribution in which nearly all employees have no awards. As the award winners are rare and almost invisible on chart, this implies that this variable has lower predictive power as insufficient variation. However, further bivariate analysis is needed to determine whether the winners perform better on KPI performance.
  • Compared to other departments, the Sales and Marketing department has the largest number of employees and has become the dominant department within the organization.
  • Most employees hold a bachelor’s degree, indicating that a bachelor’s degree is the most common educational background among the workforce.
  • There are three distinct recruitment channels, reflecting the variety of hiring methods the company employs.
  • Region 2 has the highest employee distribution (about 22.5%) compared to other regions.

3.5 Bivariate Analysis

3.5.1 Training Score vs KPI Achievement

  • This section examines the relationship between employee training performance (average training score) and KPI achievement status. This analysis helps determine whether training performance influences employees’ success in meeting their KPI targets.
# Summary Statistics by KPI 
training_score_summary <- df_clean %>%
  group_by(KPIs_met_more_than_80) %>%
  summarise(
    count = n(),
    mean_score = mean(avg_training_score, na.rm = TRUE),
    median_score = median(avg_training_score, na.rm = TRUE),
    sd_score = sd(avg_training_score, na.rm = TRUE),
    min_score = min(avg_training_score, na.rm = TRUE),
    max_score = max(avg_training_score, na.rm = TRUE),
    q25 = quantile(avg_training_score, 0.25, na.rm = TRUE),
    q75 = quantile(avg_training_score, 0.75, na.rm = TRUE)
  ) %>%
  mutate(KPI_Status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI >80%", "Did Not Meet KPI"))

kable(training_score_summary %>% 
        select(-KPIs_met_more_than_80) %>% 
        mutate(across(where(is.numeric), ~round(., 2))),
      caption = "Training Score Summary by KPI Achievement Status")
Training Score Summary by KPI Achievement Status
count mean_score median_score sd_score min_score max_score q25 q75 KPI_Status
11165 62.46 59 13.35 39 99 50 74 Did Not Meet KPI
6250 64.47 61 13.45 41 99 53 77 Met KPI >80%
# Boxplot Comparison
ggplot(df_clean, aes(x = factor(KPIs_met_more_than_80), y = avg_training_score, 
                      fill = factor(KPIs_met_more_than_80))) +
  geom_boxplot(alpha = 0.7, outlier.size = 0.8) +
  scale_fill_manual(values = c("#E74C3C", "#2ECC71"),
                    labels = c("Did Not Meet KPI", "Met KPI")) +
  labs(
    title = "Training Score Distribution Comparison",
    subtitle = "High performers show higher median training scores",
    x = "KPI Achievement Status",
    y = "Average Training Score",
    fill = "KPI Status"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "none"
  )

Insights:

  • The average training score for employees who met their KPI targets (64.5) was only slightly higher than that of employees who did not meet their KPI targets (62.5), with a difference of merely 2% between the two groups.
  • The boxplot shows an upward shift in training score distribution among high-performing employees, supporting the statistical findings.
  • The variability in scores was similar for both groups (standard deviation ≈ 13.4), indicating that the distribution of employee scores was relatively centralized.
  • According to the box plot, the interquartile range (IQR) for employees who did not meet their KPI targets is approximately 50–74, while the IQR for employees who met their KPI targets is approximately 53–77.
  • There is significant overlap between these two ranges, indicating that many employees in both groups achieved similar training scores; therefore, training performance alone is insufficient to fully explain KPI success.

3.5.2 Previous Year Rating Analysis vs KPI Achievement

  • This section examines the relationship between employee previous year performance (previous year rating) and KPI achievement status. This analysis helps determine whether previous year performance influences employee success in meeting KPI targets.
# Previous year rating vs KPI
rating_analysis <- df_clean %>%
  group_by(previous_year_rating, KPIs_met_more_than_80) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(previous_year_rating) %>%
  mutate(percentage = count / sum(count) * 100,
         KPI_status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI", "Did Not Meet"))

# Plot
ggplot(rating_analysis, aes(x = factor(previous_year_rating), y = percentage, 
                            fill = KPI_status)) +
  geom_bar(stat = "identity", position = "stack") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), size = 3) +
  scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
  labs(title = "KPI Achievement by Previous Year Rating",
       x = "Previous Year Rating", y = "Percentage (%)",
       fill = "KPI Status")

Insights:

  • Employees with higher previous year ratings demonstrate a greater proportion of KPI achievement.
  • The distribution indicates that past performance is positively associated with KPI achievement, while employees who received higher performance ratings in the previous year also had met KPI achievement.

3.5.3 Average Training Score by Previous Year Rating

# Plot
ggplot(df_clean,
       aes(x = factor(previous_year_rating),
           y = avg_training_score,
           fill = factor(previous_year_rating))) +

  geom_boxplot(alpha = 0.8,
               outlier.color = "#E74C3C") +

  labs(title = "Average Training Score by Previous Year Rating",
       x = "Previous Year Rating",
       y = "Average Training Score") +

  theme_minimal(base_size = 13) +

  theme(legend.position = "none")

Insights:

  • Employees who received performance ratings of 3 to 5 in the previous year had higher average training scores.
  • Employees with a performance rating of 1 had the lowest median training score, indicating weaker general capability development; but, its upper whisker reaching almost to 95 shows high scores exist even in low-rated groups.
  • In summary,the boxplot distribution implies that training results are not entirely dependent on prior performance evaluations, as employees who had low ratings the previous year can still perform well in training.

3.5.4 Categotical Variables vs Target

# Function to create bar plots for categorical variables
plot_categorical_kpi <- function(data, var_name) {
  data %>%
    group_by(!!sym(var_name), KPIs_met_more_than_80) %>%
    summarise(count = n(), .groups = 'drop') %>%
    group_by(!!sym(var_name)) %>%
    mutate(percentage = count / sum(count) * 100,
           KPI_status = ifelse(KPIs_met_more_than_80 == 1, "Met KPI >80%", "Did Not Meet")) %>%
    ggplot(aes(x = reorder(!!sym(var_name), -percentage * (KPIs_met_more_than_80 == 1)), 
               y = percentage, fill = KPI_status)) +
    geom_bar(stat = "identity", position = "stack", width = 0.7) +
    geom_text(aes(label = paste0(round(percentage, 1), "%")), 
              position = position_stack(vjust = 0.5), size = 3) +
    scale_fill_manual(values = c("#E74C3C", "#2ECC71")) +
    labs(title = paste("KPI Achievement by", var_name),
         x = var_name, y = "Percentage (%)") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          plot.title = element_text(hjust = 0.5, face = "bold"))
}

# Plot for each categorical variable
cat_vars <- c("department", "education", "recruitment_channel", "gender", "awards_won")

for (var in cat_vars) {
  print(plot_categorical_kpi(df_clean, var))
}

# HO = There is no significant relationship between categorical variables and KPI achievement.
# H1 = There is a significant relationship between categorical variables and KPI achievement.

# Decision Rule: Reject H0, if p-value <0.05

# Chi-square tests
cat_chi_results <- map_df(cat_vars, function(var) {
  
  tbl <- table(df_clean[[var]], df_clean$KPIs_met_more_than_80)
  
  test <- chisq.test(tbl)
  
  # Store numeric p-value for correct sorting
  p_val <- test$p.value
  
  # Format p-value for display only
  p_formatted <- ifelse(
    p_val < 0.0001,
    "< 0.0001",
    round(p_val, 4)
  )
  
  data.frame(
    Variable = var,
    Chi_Square = round(as.numeric(test$statistic), 2),
    P_Value = p_formatted,
    P_Value_Numeric = p_val,
    Significant = ifelse(p_val < 0.05, "Yes", "No")
  )
})

# Display results (with sorting)
kable(
  cat_chi_results %>%
    arrange(P_Value_Numeric) %>%
    select(-P_Value_Numeric),
  caption = "Chi-square Tests: Categorical Variables vs KPI Achievement"
)
Chi-square Tests: Categorical Variables vs KPI Achievement
Variable Chi_Square P_Value Significant
department 292.33 < 0.0001 Yes
awards_won 191.77 < 0.0001 Yes
recruitment_channel 42.91 < 0.0001 Yes
education 40.25 < 0.0001 Yes
gender 28.61 < 0.0001 Yes

Insights:

  • The chi-square tests indicated all categorical variables are significant associations with KPI achievement (p < 0.05).
  • The department has the strongest associations (χ² = 292.33), followed by awards_won, recruitment channel and education.
  • Gender shows the weakest but still significant association.

3.6 Multivariate Analysis

3.6.1 Performance by gender and department

# Summary function
get_summary <- function(data, group_var) {
  data %>%
    group_by(.data[[group_var]]) %>%
    summarise(
      total_trainings = sum(no_of_trainings, na.rm = TRUE),
      avg_train_score       = mean(avg_training_score, na.rm = TRUE),
      kpi             = sum(KPIs_met_more_than_80 == 1, na.rm = TRUE),
      avg_tenure      = mean(length_of_service, na.rm = TRUE),
      avg_rating      = mean(previous_year_rating, na.rm = TRUE),
      avg_age         = mean(age, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    rename(category = 1)
}

# Focus only: Gender + Department
groups <- c("gender", "department")

summary_list <- lapply(groups, function(g) {
  get_summary(df_clean, g) %>%
    mutate(group = g)
})

combined_perf <- bind_rows(summary_list)


# Split metrics (NO scale mixing)
# Workforce metrics
workforce <- combined_perf %>%
  pivot_longer(
    cols = c(total_trainings, kpi),
    names_to = "metric",
    values_to = "value"
  )

# Performance metrics
performance <- combined_perf %>%
  pivot_longer(
    cols = c(avg_train_score, avg_tenure, avg_rating, avg_age),
    names_to = "metric",
    values_to = "value"
  )

# =========================
# Plot 1: Workforce (Gender + Department)
# =========================
p1 <- ggplot(workforce,
             aes(x = category,
                 y = value,
                 fill = metric)) +
  
  geom_bar(stat = "identity",
           position = "dodge",
           alpha = 0.9) +
  
  facet_wrap(~ group, scales = "free_x") +
  
  labs(title = "Workforce Overview by Gender and Department",
       x = "",
       y = "Count",
       fill = "Metric") +
  
  theme_minimal(base_size = 13) +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    legend.title = element_text(size = 9),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )

# =========================
# Plot 2: Performance (Gender + Department)
# =========================
p2 <- ggplot(performance,
             aes(x = category,
                 y = value,
                 fill = metric)) +
  
  geom_bar(stat = "identity",
           position = "dodge",
           alpha = 0.9) +
  
  facet_wrap(~ group, scales = "free_x") +
  
  labs(title = "Performance Metrics by Gender and Department",
       x = "",
       y = "Average Value",
       fill = "Metric") +
  
  theme_minimal(base_size = 13) +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    legend.title = element_text(size = 9),
    plot.title = element_text(face = "bold", hjust = 0.5),
  )


# Output
ggplotly(p1) %>%
  layout(
    legend = list(
      orientation = "v",
      x = 1,
      y = 0,
      font = list(size = 9),
      itemwidth = 30
    ),
    margin = list(b = 120)
  )
ggplotly(p2) %>%
  layout(
    legend = list(
      orientation = "v",
      x = 1,
      y = 0,
      font = list(size = 9),
      itemwidth = 30
    ),
    margin = list(b = 120)
  )
# ==========================================
# Multivariate Summary Table
# For Gender + Department Analysis
# ==========================================

# Create formatted summary table
summary_table <- combined_perf %>%
  
  mutate(
    avg_train_score  = round(avg_train_score, 2),
    avg_tenure = round(avg_tenure, 2),
    avg_rating = round(avg_rating, 2),
    avg_age    = round(avg_age, 2)
  ) %>%
  
  arrange(group, desc(avg_train_score)) %>%
  
  rename(
    Category           = category,
    Group              = group,
    `Total Trainings`  = total_trainings,
    `Avg Traning Score`= avg_train_score,
    `KPI Achieved`     = kpi,
    `Avg Tenure`       = avg_tenure,
    `Avg Rating`       = avg_rating,
    `Avg Age`          = avg_age
  )

# Display table
kable(
  summary_table,
  caption = "Multivariate Performance Summary by Gender and Department",
  align = "c"
) %>%
  
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  
  row_spec(
    0,
    bold = TRUE,
    color = "white",
    background = "#2C3E50"
  ) %>%
  
  column_spec(1, bold = TRUE) %>%
  
  collapse_rows(
    columns = 8,
    valign = "top"
  )
Multivariate Performance Summary by Gender and Department
Category Total Trainings Avg Traning Score KPI Achieved Avg Tenure Avg Rating Avg Age Group
analytics 2281 84.57 679 5.00 3.47 32.41 department
r&d 438 84.45 149 4.80 3.66 32.89
technology 2740 79.85 783 5.84 3.14 35.03
procurement 2993 70.18 836 6.19 3.23 36.17
operations 4121 60.35 1553 6.43 3.63 36.15
finance 1059 60.33 319 5.01 3.49 32.60
legal 355 59.53 118 4.50 3.38 33.75
hr 892 50.39 300 5.63 3.51 34.25
sales & marketing 6903 50.06 1513 5.75 3.10 34.63
f 5992 63.68 1986 5.86 3.37 35.04 gender
m 15790 62.97 4264 5.78 3.30 34.71

Insights:

  • Plot A: Workforce Overview by Gender and Department:
    • Operations (4,121) and Sales & Marketing (6,903) had the highest training participation proportions.
    • Operations had the highest KPI count (1,553), slightly more than Sales & Marketing (1,513).
    • R&D (438 training) and legal (355 training) had the lowest level of employee involvement.
    • Compared to female employees (5,992 training; 1,986 KPIs), male employees reported more total number of training (15,790) and KPI outcomes (4,264).
  • Plot B: Performance Metrics by Gender and Department:
    • R&D (84.45) and analytics (84.57) had the highest average training ratings, followed by technology (79.85).
    • Sales & Marketing (50.06) and HR (50.39) achieved the lowest average training scores.
    • The departments with the highest average tenure were operations (6.43 years) and procurement (6.19 years).
    • There were no significant gender differences; female employees scored slightly higher than their male counterparts in terms of average score (63.68 vs. 62.97), rating (3.37 vs. 3.30), and age (35.04 vs. 34.71).

3.6.2 Correlation Matrix

# Select numeric variables
num_data <- df_clean %>%
  select(
    no_of_trainings,
    age,
    previous_year_rating,
    length_of_service,
    avg_training_score,
    KPIs_met_more_than_80
  )

# Correlation matrix
cor_matrix <- cor(num_data, use = "complete.obs")

cor_melt <- melt(cor_matrix)

# Correlation heatmap
ggplot(cor_melt, aes(Var1, Var2, fill = value)) +
  
  geom_tile(color = "white") +
  
  scale_fill_gradient2(
    low = "#E74C3C",   # -1 strong negative
    mid = "white",     # 0 no correlation
    high = "#2ECC71",  # +1 strong positive
    midpoint = 0,
    limits = c(-1, 1),
    name = "Correlation"
  ) +
  
  geom_text(aes(label = round(value, 2)), size = 3) +
  
  theme_minimal() +
  
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_blank(),
    plot.title = element_text(hjust = 0.5, face = "bold")
  ) +
  
  ggtitle("Correlation Matrix of Employee Performance Variables")

Insights:

  • The moderate correlation between age and length of service (r = 0.64) indicates estimated workforce advancement and tenure stability. Since the value does not exceed the commonly accepted multicollinearity threshold of 0.70; thus, multicollinearity is not considered a significant concern.
  • Regarding the target variable (KPI achievement):
    • Previous year rating (r = 0.33) is the most significant predictor, demonstrating some stability in employee performance in the long run.
    • Training-related variables showed minimal correlation, implying a limited direct linear impact on KPI performance.
  • Overall, the correlation analysis reveals that most numerical variables have weak linear correlations, indicating that employee performance is driven by several interconnected factors rather than a single predictor.

3.6.3 Feature Correlation with KPI Achievement

# Select numeric features
num_features <- df_clean %>%
  select(age,
         length_of_service,
         avg_training_score,
         no_of_trainings,
         previous_year_rating)

# Compute correlation with KPI
cor_results <- sapply(num_features, function(x) {
  cor(x, df_clean$KPIs_met_more_than_80, use = "complete.obs")
})

# Convert to data frame and rank
cor_ranked <- data.frame(
  feature = names(cor_results),
  correlation = cor_results
) %>%
  arrange(desc(abs(correlation)))

cor_ranked
# Plot 
ggplot(cor_ranked,
       aes(x = reorder(feature, correlation),
           y = correlation,
           fill = correlation)) +

  geom_bar(stat = "identity") +

  coord_flip() +

  scale_fill_gradient2(low = "#E74C3C",
                       mid = "white",
                       high = "#2ECC71") +

  labs(title = "Feature Correlation with KPI Achievement",
       x = "Feature",
       y = "Correlation Strength") +

  theme_minimal(base_size = 13)

Insights:

  • Previous year rating (0.32) was the single variable that showed a significant correlation with the KPI achievement.
  • The correlation coefficients for all other numerical variables were close to zero, indicating that these variables had little or no linear relationship with KPI achievement.

3.7 Insights Before Modeling

  • Based on EDA findings:
    • Numerical variables such asavg_training_score and previous_year_ratingare positively correlated with KPI achievement. Employees with higher training scores and higher ratings from the previous year typically perform better on KPIs.
    • length_of_service and no_of_trainings exhibit a right-skewed distribution, indicating that most employees have shorter tenure and have attended fewer training sessions.
    • Categorical analysis reveals differences in employee performance across department, education, gender, awards_won and recruitment_channel. Chi-square tests confirm statistically significant associations between categorical variables and KPI achievement.
    • Bivariate analysis indicates that training-related variables are among the strongest predictors of KPI achievement. In particular, employees with higher average training scores tend to have a higher probability of meeting their KPI targets.
    • The relationship between previous_year_rating and avg_training_score reveals a potential nonlinear pattern, suggesting that performance may vary across different rating groups.
    • Correlation analysis revealed a moderate correlation between KPI achievement and the previous year’s scores, while all other numerical variables show a weak correlation. This suggests that KPI performance is likely driven by a combination of nonlinear effects, categorical factors (such as department and education), and other contextual or behavioral variables not covered in this correlation analysis. For age and length_of_service is approximately 0.64; however, since it does not exceed the standard threshold of 0.7, it does not result in multicollinearity.
  • Overall, the results of the Employee Data Analysis (EDA) indicate that employee demographics, training performance, previous year rating, and departmental characteristics may influence the achievement of KPI and should therefore be taken into account during the modeling phase.

4.0 Data Analysis & Modelling

This section applies statistical and machine learning techniques to uncover meaningful insights from the cleaned dataset. The goal is to identify key predictors, evaluate model performance, and generate reliable forecasts. By combining exploratory analysis with predictive modelling, we aim to transform raw data into actionable knowledge that supports decision‑making.

Question: Can employee KPI achievement (more than 80%) be predicted using demographic, training, and workplace-related variables?

4.1 Logistic Regression (Classification)

============================================================

Logistic Regression — Employee Performance

============================================================

library(caret) library(ggplot2) library(pROC) library(dplyr) library(gridExtra)

STEP 1 — Load data

df_clean <- read.csv(“clean_employee_performance.csv”, stringsAsFactors = FALSE)

STEP 2 — Remove redundant columns

df_clean\(avg_training_score_scaled <- NULL df_clean\)age_group <- NULL

STEP 3 — Convert character columns to factors

char_cols <- names(df_clean)[sapply(df_clean, is.character)] df_clean[char_cols] <- lapply(df_clean[char_cols], as.factor)

STEP 4 — Convert target to factor

df_clean\(KPIs_met_more_than_80 <- factor( df_clean\)KPIs_met_more_than_80, levels = c(0, 1), labels = c(“No”, “Yes”) )

4.1.1 Train-test split

STEP 5 — Train/Test split

set.seed(42)

train_idx <- createDataPartition( df_clean$KPIs_met_more_than_80, p = 0.80, list = FALSE )

train_data <- df_clean[train_idx, ] test_data <- df_clean[-train_idx, ]

STEP 6 — Build Logistic Regression Model

log_model <- glm( KPIs_met_more_than_80 ~ ., data = train_data, family = binomial )

summary(log_model)

STEP 7 — Predictions

pred_prob <- predict(log_model, newdata = test_data, type = “response”)

pred_class <- ifelse(pred_prob > 0.5, “Yes”, “No”)

pred_class <- factor(pred_class, levels = c(“No”, “Yes”))

4.1.2 Confusion Matrix

STEP 8 — Evaluation

cm <- confusionMatrix( pred_class, test_data$KPIs_met_more_than_80, positive = “Yes” )

print(cm)

acc <- round(cm\(overall["Accuracy"] * 100, 2) sens <- round(cm\)byClass[“Sensitivity”] * 100, 2) spec <- round(cm\(byClass["Specificity"] * 100, 2) f1 <- round(cm\)byClass[“F1”] * 100, 2)

4.1.3 ROC, AUC

STEP 9 — ROC / AUC

roc_obj <- roc( response = test_data$KPIs_met_more_than_80, predictor = pred_prob, levels = c(“No”, “Yes”) )

auc_val <- round(auc(roc_obj), 4)

roc_df <- data.frame( FPR = 1 - roc_obj\(specificities, TPR = roc_obj\)sensitivities )

4.1.4 Model Evaluation

STEP 10 — Metrics table

metrics_df <- data.frame( Metric = c(“Accuracy”, “Sensitivity”, “Specificity”, “F1 Score”), Value = c(acc, sens, spec, f1) )

STEP 11 — Feature Importance

coef_df <- data.frame( Feature = names(coef(log_model)), Coefficient = coef(log_model) )

coef_df <- coef_df %>% filter(Feature != “(Intercept)”) %>% mutate( Abs_Coefficient = abs(Coefficient), Direction = ifelse(Coefficient > 0, “Positive”, “Negative”) ) %>% arrange(desc(Abs_Coefficient))

top_coef_df <- coef_df %>% slice_max(order_by = Abs_Coefficient, n = 15)

print(coef_df)

STEP 12 — Predicted Probability

prob_df <- data.frame( Probability = pred_prob, Actual_Class = test_data$KPIs_met_more_than_80 )

============================================================

STEP 13 — Dashboard Plots

============================================================

p1: Class Distribution

class_tbl <- table(df_clean$KPIs_met_more_than_80)

class_pct <- round(prop.table(class_tbl) * 100, 1)

class_df <- data.frame( Class = names(class_tbl), Count = as.integer(class_tbl), Percent = as.numeric(class_pct) )

p1 <- ggplot(class_df, aes(x = Class, y = Count, fill = Class)) + geom_bar(stat = “identity”, width = 0.6, show.legend = FALSE) + geom_text(aes(label = paste0(Count, “(”, Percent, “%)”)), vjust = -0.4, fontface = “bold”) + labs(title = “Class Distribution”, subtitle = “KPIs Met More Than 80%”, x = “Class”, y = “Count”) + theme_minimal(base_size = 14)

p2: Confusion Matrix

cm_tbl <- as.data.frame(cm$table)

names(cm_tbl) <- c(“Predicted”, “Actual”, “Freq”)

p2 <- ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) + geom_tile(colour = “white”, linewidth = 1) + geom_text(aes(label = Freq), size = 6, fontface = “bold”, colour = “white”) + scale_fill_gradient(low = “#A8C8E8”, high = “#1A5FA8”) + labs(title = “Confusion Matrix”, subtitle = “Predicted vs Actual Classes”, x = “Actual Class”, y = “Predicted Class”, fill = “Count”) + theme_minimal(base_size = 14)

p3: ROC Curve

p3 <- ggplot(roc_df, aes(x = FPR, y = TPR)) + geom_line(linewidth = 1.1) + geom_abline(slope = 1, intercept = 0, linetype = “dashed”) + labs(title = “ROC Curve”, subtitle = paste(“AUC =”, auc_val), x = “False Positive Rate”, y = “True Positive Rate”) + theme_minimal(base_size = 14)

p4: Performance Metrics

p4 <- ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) + geom_bar(stat = “identity”, show.legend = FALSE) + geom_text(aes(label = paste0(Value, “%”)), vjust = -0.4, fontface = “bold”) + ylim(0, 110) + labs(title = “Performance Metrics”, y = “Score (%)”) + theme_minimal(base_size = 14)

p5: Feature Importance

p5 <- ggplot(top_coef_df, aes(x = reorder(Feature, Abs_Coefficient), y = Abs_Coefficient, fill = Direction)) + geom_bar(stat = “identity”) + coord_flip() + labs(title = “Feature Importance”, subtitle = “Top 15 absolute coefficients”, x = “Feature”, y = “Absolute Coefficient”) + theme_minimal(base_size = 14)

p6: Predicted Probability

p6 <- ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) + geom_histogram(binwidth = 0.05, alpha = 0.75, position = “identity”, colour = “white”) + labs(title = “Predicted Probability”, subtitle = “Probability of KPI Achievement”, x = “Probability of Yes”, y = “Count”, fill = “Actual Class”) + theme_minimal(base_size = 14)

Combine dashboard

logistic_dashboard <- grid.arrange( p1, p2, p3, p4, p5, p6, ncol = 2, nrow = 3 )

Save dashboard

ggsave(“logistic_regression_dashboard.png”, plot = logistic_dashboard, width = 14, height = 16, dpi = 150)

acc <- round(cm\(overall["Accuracy"] * 100, 2) sens <- round(cm\)byClass[“Sensitivity”] * 100, 2) spec <- round(cm\(byClass["Specificity"] * 100, 2) f1 <- round(cm\)byClass[“F1”] * 100, 2)

Save Logistic Regression metrics

logistic_acc <- acc logistic_sens <- sens logistic_spec <- spec logistic_f1 <- f1 logistic_auc <- auc_val

4.2 Random Forest (Classification)

============================================================

Random Forest — Employee Performance

============================================================

── 0. Install & load packages ────────────────────────────

required_packages <- c(“randomForest”, “caret”, “ggplot2”, “reshape2”, “pROC”, “dplyr”, “gridExtra”)

for (pkg in required_packages) { if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg) }

library(randomForest) library(caret) library(ggplot2) library(reshape2) library(pROC) library(dplyr) library(gridExtra)

============================================================

STEP 1 — Load data

============================================================

df_clean <- read.csv(“clean_employee_performance.csv”, stringsAsFactors = FALSE) cat(sprintf(“Loaded: %d rows x %d columns”, nrow(df_clean), ncol(df_clean)))

============================================================

STEP 2 — Fix issue 1: drop redundant / derived columns

============================================================

df_clean\(avg_training_score_scaled <- NULL df_clean\)age_group <- NULL cat(“Step 2: Dropped redundant columns: avg_training_score_scaled, age_group”)

============================================================

STEP 3 — Fix issue 2: convert character columns to factors

============================================================

char_cols <- names(df_clean)[sapply(df_clean, is.character)] df_clean[char_cols] <- lapply(df_clean[char_cols], as.factor) cat(“Step 3: Converted to factor:”, paste(char_cols, collapse = “,”), “”) cat(” ‘unknown’ kept as a valid factor level in ‘education’“)

============================================================

STEP 4 — Fix issue 3: convert target to factor

============================================================

df_clean\(KPIs_met_more_than_80 <- factor(df_clean\)KPIs_met_more_than_80, levels = c(0, 1), labels = c(“No”, “Yes”)) cat(“Step 4: Target ‘KPIs_met_more_than_80’ converted to factor”)

============================================================

STEP 5 — Handle NA values

============================================================

na_total <- sum(is.na(df_clean)) cat(sprintf(“: Total NAs found: %d”, na_total)) if (na_total > 0) { df_clean <- na.roughfix(df_clean) cat(” na.roughfix() applied“) } else { cat(” No NAs — skipping imputation“) }

============================================================

STEP 6 — Class imbalance check & compute class weights

============================================================

cat(“: Class distribution”) class_tbl <- table(df_clean$KPIs_met_more_than_80) print(class_tbl) class_pct <- round(prop.table(class_tbl) * 100, 1) print(class_pct)

n_no <- as.integer(class_tbl[“No”]) n_yes <- as.integer(class_tbl[“Yes”]) wt_no <- 1 wt_yes <- round(n_no / n_yes, 2) class_weights <- c(“No” = wt_no, “Yes” = wt_yes) cat(sprintf(” Class weights — No: %.2f | Yes: %.2f“, wt_no, wt_yes))

4.2.1 Train-test split

============================================================

STEP 7 — Train / Test split

============================================================

set.seed(42) train_idx <- createDataPartition(df_clean$KPIs_met_more_than_80, p = 0.80, list = FALSE) train_data <- df_clean[ train_idx, ] test_data <- df_clean[-train_idx, ] cat(sprintf(“: Train rows: %d | Test rows: %d”, nrow(train_data), nrow(test_data)))

============================================================

STEP 8 — Build Random Forest model

============================================================

set.seed(42) n_features <- ncol(train_data) - 1 mtry_val <- floor(sqrt(n_features))

cat(sprintf(“: Training Random Forest (ntree=500, mtry=%d) …”, mtry_val))

rf_model <- randomForest( KPIs_met_more_than_80 ~ ., data = train_data, ntree = 500, mtry = mtry_val, importance = TRUE, classwt = class_weights )

print(rf_model)

4.2.2 Confusion Matrix

============================================================

STEP 9 — Predict on test set

============================================================

preds_class <- predict(rf_model, newdata = test_data) preds_prob <- predict(rf_model, newdata = test_data, type = “prob”)

============================================================

STEP 10 — Performance Evaluation

============================================================

cm <- confusionMatrix(preds_class, test_data$KPIs_met_more_than_80, positive = “Yes”) print(cm)

acc <- round(cm\(overall["Accuracy"] * 100, 2) kappa <- round(cm\)overall[“Kappa”] * 100, 2) sens <- round(cm\(byClass["Sensitivity"] * 100, 2) spec <- round(cm\)byClass[“Specificity”] * 100, 2) precision <- round(cm\(byClass["Precision"] * 100, 2) f1 <- round(cm\)byClass[“F1”] * 100, 2)

roc_obj <- roc(response = test_data$KPIs_met_more_than_80, predictor = preds_prob[, “Yes”], levels = c(“No”, “Yes”), direction = “<”) auc_val <- round(auc(roc_obj), 4)

============================================================

Visualize Confusion Matrix

============================================================

Load required libraries

library(caret) library(ggplot2)

Example: assume you already have predictions

preds_class <- predict(rf_model, newdata = test_data)

cm <- confusionMatrix(preds_class, test_data$KPIs_met_more_than_80, positive = “Yes”)

Convert confusion matrix to data frame

cm_tbl <- as.data.frame(cm$table) names(cm_tbl) <- c(“Predicted”, “Actual”, “Freq”)

Plot confusion matrix heatmap

ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) + geom_tile(colour = “white”, linewidth = 1) + geom_text(aes(label = Freq), size = 6, fontface = “bold”, colour = “white”) + scale_fill_gradient(low = “#A8C8E8”, high = “#1A5FA8”) + labs(title = “Confusion Matrix Heatmap”, subtitle = “Model predictions vs actual classes”, x = “Actual Class”, y = “Predicted Class”, fill = “Count”) + theme_minimal(base_size = 14) + theme(legend.position = “right”)

4.2.3 ROC, AUC

============================================================

ROC Curve

============================================================

roc_obj <- roc(response = test_data$KPIs_met_more_than_80, predictor = preds_prob[, “Yes”], levels = c(“No”, “Yes”), direction = “<”)

roc_df <- data.frame(FPR = 1 - roc_obj\(specificities, TPR = roc_obj\)sensitivities)

ggplot(roc_df, aes(x = FPR, y = TPR)) + geom_line(colour = “#1A5FA8”, linewidth = 1.1) + geom_abline(slope = 1, intercept = 0, linetype = “dashed”, colour = “grey60”, linewidth = 0.7) + annotate(“text”, x = 0.65, y = 0.12, label = sprintf(“AUC = %.4f”, auc(roc_obj)), size = 5, fontface = “bold”, colour = “#1A5FA8”) + labs(title = “ROC Curve”, subtitle = “Receiver Operating Characteristic — test set”, x = “False Positive Rate (1 - Specificity)”, y = “True Positive Rate (Sensitivity)”) + theme_minimal(base_size = 14) + coord_equal()

4.2.4 Model Evaluation

============================================================

Feature Importance (Mean Decrease Gini)

============================================================

imp_mat <- importance(rf_model) imp_df <- data.frame( Feature = rownames(imp_mat), MeanDecreaseAccuracy = imp_mat[, “MeanDecreaseAccuracy”], MeanDecreaseGini = imp_mat[, “MeanDecreaseGini”] )

imp_df <- imp_df[order(imp_df\(MeanDecreaseGini), ] imp_df\)Feature <- factor(imp_df\(Feature, levels = imp_df\)Feature)

ggplot(imp_df, aes(x = Feature, y = MeanDecreaseGini, fill = MeanDecreaseGini)) + geom_bar(stat = “identity”, width = 0.65, show.legend = FALSE) + geom_text(aes(label = round(MeanDecreaseGini, 1)), hjust = -0.15, size = 3.5) + scale_fill_gradient(low = “#A8D5A2”, high = “#2E7D32”) + coord_flip() + labs(title = “Feature Importance (Mean Decrease Gini)”, subtitle = “Higher = more important for node purity”, y = “Mean Decrease Gini”) + theme_minimal(base_size = 14)

============================================================

Class Distribution Bar Chart

============================================================

class_tbl <- table(df_clean$KPIs_met_more_than_80) class_pct <- round(prop.table(class_tbl) * 100, 1) class_df <- data.frame(Class = names(class_tbl), Count = as.integer(class_tbl), Percent = as.numeric(class_pct))

ggplot(class_df, aes(x = Class, y = Count, fill = Class)) + geom_bar(stat = “identity”, width = 0.5, show.legend = FALSE) + geom_text(aes(label = sprintf(“%d(%.1f%%)”, Count, Percent)), vjust = -0.3, size = 4, fontface = “bold”) + scale_fill_manual(values = c(“No” = “#E07B54”, “Yes” = “#4C9BE8”)) + labs(title = “Class Distribution of Target Variable”, subtitle = “KPIs met more than 80%”, x = “KPIs Met > 80%”, y = “Count”) + theme_minimal(base_size = 14)

============================================================

Performance Metrics Bar Chart

============================================================

metrics_df <- data.frame( Metric = c(“Accuracy”, “Sensitivity”, “Specificity”, “Precision”, “F1 Score”), Value = c(acc, sens, spec, precision, f1) )

ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) + geom_bar(stat = “identity”, width = 0.6, show.legend = FALSE) + geom_text(aes(label = sprintf(“%.1f%%”, Value)), vjust = -0.4, size = 4, fontface = “bold”) + scale_fill_brewer(palette = “Set2”) + labs(title = “Model Performance Metrics”, subtitle = “Evaluated on test set”, y = “Score (%)”) + ylim(0, 110) + theme_minimal(base_size = 14)

============================================================

Predicted Probability Histogram

============================================================

prob_df <- data.frame( Probability = preds_prob[, “Yes”], Actual_Class = test_data$KPIs_met_more_than_80 )

ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) + geom_histogram(binwidth = 0.05, alpha = 0.75, position = “identity”, colour = “white”) + scale_fill_manual(values = c(“No” = “#E07B54”, “Yes” = “#4C9BE8”)) + labs(title = “Predicted Probability Distribution”, subtitle = “Probability of KPIs > 80% by actual class”, x = “Predicted probability (Yes)”, y = “Count”, fill = “Actual class”) + theme_minimal(base_size = 14)

============================================================

Create a Dashboard

============================================================

#STEP-1: #——- # p1: Class distribution p1 <- ggplot(class_df, aes(x = Class, y = Count, fill = Class)) + geom_bar(stat = “identity”) + labs(title = “Class Distribution”)

p2: Confusion matrix

p2 <- ggplot(cm_tbl, aes(x = Actual, y = Predicted, fill = Freq)) + geom_tile() + labs(title = “Confusion Matrix”)

p3: Feature importance

p3 <- ggplot(imp_df, aes(x = Feature, y = MeanDecreaseGini, fill = MeanDecreaseGini)) + geom_bar(stat = “identity”) + coord_flip() + labs(title = “Feature Importance (Gini)”)

p4: Performance metrics

p4 <- ggplot(metrics_df, aes(x = Metric, y = Value, fill = Metric)) + geom_bar(stat = “identity”) + labs(title = “Performance Metrics”)

p5: ROC curve

p5 <- ggplot(roc_df, aes(x = FPR, y = TPR)) + geom_line() + labs(title = “ROC Curve”)

p6: Probability histogram

p6 <- ggplot(prob_df, aes(x = Probability, fill = Actual_Class)) + geom_histogram(binwidth = 0.05) + labs(title = “Predicted Probability Distribution”) #—————————————————————— #STEP-2: Arrange the Graphs into a Dashboard #——————————————– library(gridExtra)

combined_dashboard <- grid.arrange( p1, p2, p3, p4, p5, p6, ncol = 3, nrow = 2 ) ggsave(“dashboard_overview.png”, plot = combined_dashboard, width = 18, height = 10, dpi = 150)

4.3 Insights and Comparison Table

Confusion Matrix

When compared with the Random Forest model, the Logistic Regression model achieved a slightly higher overall accuracy (71.03% vs 70.23%). However, the difference in performance becomes clearer when examining sensitivity and specificity. Logistic Regression recorded a much lower sensitivity of 43% compared to Random Forest’s 57%, meaning it missed a larger number of employees who actually met the KPI target. In contrast, Logistic Regression achieved a substantially higher specificity of 87%, while Random Forest obtained 78%, indicating that Logistic Regression was better at correctly identifying employees who did not meet KPI expectations.

The Random Forest model demonstrated a more balanced classification performance overall. Although its accuracy was marginally lower, it produced a higher F1-score (57.86% compared to 51.79% for Logistic Regression), showing a better balance between precision and recall. Random Forest was therefore more effective at detecting true high performers, while Logistic Regression was more conservative and focused on minimizing false positives. This is also reflected in the confusion matrices, where Logistic Regression produced more false negatives (708) than Random Forest (538), meaning more actual high performers were overlooked.

Additionally, the Kappa statistic for Random Forest (0.35) was slightly higher than Logistic Regression (0.32), suggesting better agreement beyond chance. The Random Forest model also achieved a higher balanced accuracy (67% vs 65%), indicating stronger overall performance across both classes rather than favoring the majority class. While Logistic Regression excelled in identifying non high performers, Random Forest provided a more even trade off between identifying successful and unsuccessful employees.

Overall, Logistic Regression appears more suitable when the priority is reducing false claims of high performance, whereas Random Forest is more appropriate when the goal is to identify as many genuine high performers as possible. Since employee performance prediction often benefits from detecting successful employees accurately, the Random Forest model may be considered the more practical and reliable choice despite its slightly lower overall accuracy.

ROC, AUC

Both models performed better than random guessing, with Logistic Regression achieving a slightly higher AUC (0.7402) than Random Forest (0.7327), indicating marginally better overall class separation. However, Random Forest showed stronger practical performance by achieving higher sensitivity and F1-score, meaning it was better at identifying employees who actually met KPI targets. Logistic Regression was more conservative, focusing more on correctly identifying non high performers and reducing false positives. Overall, Random Forest provides a more balanced model for employee performance prediction, while Logistic Regression is better when minimizing false positive predictions is the priority.

Model Evaluation

The Logistic Regression coefficient analysis shows that awards_won and previous_year_rating are the strongest positive predictors of employees meeting more than 80% of their KPIs. Employees who received awards or had strong previous performance ratings were much more likely to achieve high KPI outcomes. Several regional variables also had relatively strong positive effects, while some departments such as sales & marketing, legal, and HR showed negative relationships with KPI success. Features such as no_of_trainings and length_of_service had smaller negative coefficients, indicating weaker influence on the prediction outcome.

Compared with Random Forest, Logistic Regression provides coefficient based interpretations that show whether each variable increases or decreases the likelihood of success. However, Random Forest offered clearer feature importance rankings and captured more complex non linear relationships between variables. Logistic Regression is therefore simpler and easier to interpret statistically, while Random Forest provides stronger practical insight into which factors most influence employee performance predictions.

Comparison Table

comparison_df <- data.frame( Model = c( “Logistic Regression”, “Random Forest” ),

Accuracy = c( logistic_acc, acc ),

Sensitivity = c( logistic_sens, sens ),

Specificity = c( logistic_spec, spec ),

F1_Score = c( logistic_f1, f1 ),

AUC = c( logistic_auc, auc_val ) )

comparison_df

Insights and Recommendations

Both Logistic Regression and Random Forest achieved similar accuracy, with Logistic Regression performing slightly better (71.03% vs 70.23%). However, Random Forest showed much higher sensitivity (56.96% vs 43.36%), meaning it was better at identifying employees who actually met KPI targets. Logistic Regression achieved higher specificity (86.52% vs 77.65%), indicating it was stronger at identifying employees who did not meet KPIs. Random Forest also obtained a higher F1-score (57.86% vs 51.79%), showing a better balance between precision and recall. Although Logistic Regression had a slightly higher AUC (0.7402 vs 0.7327), the difference was minimal. Overall, Logistic Regression is more conservative and better at avoiding false positives, while Random Forest provides a more balanced performance and is more effective at detecting genuine high performers.

4.4 Linear Regression Model

Question: Can employee average training score be predicted using demographic and workplace-related variables?

============================================================

Linear Regression — Employee Training Score Prediction

============================================================

── 0. Install & load packages ─────────────────────────────

required_packages <- c( “caret”, “ggplot2”, “dplyr”, “Metrics”, “gridExtra” )

for (pkg in required_packages) { if (!requireNamespace(pkg, quietly = TRUE)) { install.packages(pkg) } }

library(caret) library(ggplot2) library(dplyr) library(Metrics) library(gridExtra)

============================================================

STEP 1 — Load data

============================================================

df_clean <- read.csv( “clean_employee_performance.csv”, stringsAsFactors = FALSE )

cat(sprintf( “Loaded: %d rows x %d columns”, nrow(df_clean), ncol(df_clean) ))

============================================================

STEP 2 — Remove redundant columns

============================================================

df_clean\(avg_training_score_scaled <- NULL df_clean\)age_group <- NULL df_clean$KPIs_met_more_than_80 <- NULL

cat(“Dropped redundant columns”)

============================================================

STEP 3 — Convert character columns to factors

============================================================

char_cols <- names(df_clean)[sapply(df_clean, is.character)]

df_clean[char_cols] <- lapply( df_clean[char_cols], as.factor )

cat(“Converted character columns to factors”)

============================================================

STEP 4 — Handle missing values

============================================================

na_total <- sum(is.na(df_clean))

cat(sprintf(“Total NAs found: %d”, na_total))

if (na_total > 0) {

for (col in names(df_clean)) {

if (is.numeric(df_clean[[col]])) {
  
  df_clean[[col]][is.na(df_clean[[col]])] <-
    median(df_clean[[col]], na.rm = TRUE)
  
}

}

cat(“Median imputation applied”)

} else {

cat(“No missing values found”) }

4.4.1 Train-test split

============================================================

STEP 5 — Train/Test Split

============================================================

set.seed(42)

train_idx <- createDataPartition( df_clean$avg_training_score, p = 0.80, list = FALSE )

train_data <- df_clean[train_idx, ] test_data <- df_clean[-train_idx, ]

cat(sprintf( “Train rows: %d | Test rows: %d”, nrow(train_data), nrow(test_data) ))

============================================================

STEP 6 — Build Linear Regression Model

============================================================

lm_model <- lm( avg_training_score ~ age + previous_year_rating + length_of_service + no_of_trainings + department + education + gender + recruitment_channel + awards_won,

data = train_data )

summary(lm_model)

4.4.2 Model Evaluation: RMSE, MAE, RSquared

============================================================

STEP 7 — Predictions

============================================================

predictions <- predict( lm_model, newdata = test_data )

actual_values <- test_data$avg_training_score

============================================================

STEP 8 — Regression Evaluation Metrics

============================================================

rmse_val <- round( rmse(actual_values, predictions), 3 )

mae_val <- round( mae(actual_values, predictions), 3 )

r2_val <- round( cor(actual_values, predictions)^2, 3 )

cat(“Model Performance”) cat(“RMSE :”, rmse_val, “”) cat(“MAE :”, mae_val, “”) cat(“R² :”, r2_val, “”)

4.4.3 Result Visualization: Actual vs. Predicted

============================================================

STEP 9 — Actual vs Predicted Plot

============================================================

results_df <- data.frame( Actual = actual_values, Predicted = predictions )

p1 <- ggplot( results_df, aes(x = Actual, y = Predicted) ) + geom_point( alpha = 0.5, colour = “#1A5FA8” ) + geom_abline( slope = 1, intercept = 0, colour = “red”, linetype = “dashed” ) + labs( title = “Actual vs Predicted Values”, subtitle = “Linear Regression”, x = “Actual Training Score”, y = “Predicted Training Score” ) + theme_minimal(base_size = 14)

4.4.4 Residual Plot and Distribution

============================================================

STEP 10 — Residual Plot

============================================================

results_df$Residuals <- actual_values - predictions

p2 <- ggplot( results_df, aes(x = Predicted, y = Residuals) ) + geom_point( alpha = 0.5, colour = “#2E7D32” ) + geom_hline( yintercept = 0, colour = “red”, linetype = “dashed” ) + labs( title = “Residual Plot”, x = “Predicted Values”, y = “Residuals” ) + theme_minimal(base_size = 14)

============================================================

STEP 11 — Residual Distribution

============================================================

p3 <- ggplot( results_df, aes(x = Residuals) ) + geom_histogram( bins = 30, fill = “#4C9BE8”, colour = “white”, alpha = 0.8 ) + labs( title = “Residual Distribution”, x = “Residual”, y = “Count” ) + theme_minimal(base_size = 14)

4.4.5 Feature Importance

============================================================

STEP 12 — Feature Importance

============================================================

coef_df <- data.frame( Feature = names(coef(lm_model)), Coefficient = coef(lm_model) )

coef_df <- coef_df %>% filter(Feature != “(Intercept)”) %>% mutate( Abs_Coefficient = abs(Coefficient), Direction = ifelse( Coefficient > 0, “Positive”, “Negative” ) ) %>% arrange(desc(Abs_Coefficient))

top_coef_df <- coef_df %>% slice_max( order_by = Abs_Coefficient, n = 15 )

p4 <- ggplot( top_coef_df, aes( x = reorder( Feature, Abs_Coefficient ), y = Abs_Coefficient, fill = Direction ) ) + geom_bar( stat = “identity” ) + coord_flip() + labs( title = “Feature Importance”, subtitle = “Linear Regression Coefficients”, x = “Feature”, y = “Absolute Coefficient” ) + theme_minimal(base_size = 14)

Insights

The linear regression coefficient plot shows how different employee characteristics influence the average training score. Features with positive coefficients increase the predicted training score, while negative coefficients reduce it. Among the positive predictors, awards_won has one of the strongest positive effects, suggesting that employees who received awards tend to achieve higher training scores. Variables such as previous_year_rating, education level, and referral based recruitment also contribute positively, although their effects are smaller.

On the other hand, departments such as sales & marketing, HR, legal, operations, and finance have large negative coefficients, indicating that employees in these departments tend to have lower average training scores compared to the reference department. Variables such as no_of_trainings and length_of_service also show slight negative relationships, suggesting that attending more trainings or having longer service does not necessarily correspond to higher average training performance.

Overall, the model suggests that employee recognition and past performance are associated with stronger training outcomes, while departmental differences appear to play a significant role in influencing average training scores. The coefficient directions also help explain which factors are linked to higher or lower training performance within the organization.

4.4.6 Regression Model Performance

============================================================

STEP 13 — Regression Metrics Bar Chart

============================================================

metrics_df <- data.frame( Metric = c( “RMSE”, “MAE”, “R²” ),

Value = c( rmse_val, mae_val, r2_val ) )

p5 <- ggplot( metrics_df, aes( x = Metric, y = Value, fill = Metric ) ) + geom_bar( stat = “identity”, show.legend = FALSE ) + geom_text( aes(label = Value), vjust = -0.4, fontface = “bold” ) + labs( title = “Regression Performance Metrics”, y = “Value” ) + theme_minimal(base_size = 8)

Insights

The regression performance metrics indicate that the linear regression model performs reasonably well in predicting employees’ average training scores. The R² value of 0.888 means that approximately 88.8% of the variation in training scores can be explained by the predictor variables included in the model. This suggests a very strong fit, indicating that the selected features are highly effective in explaining employee training performance.

The Mean Absolute Error (MAE) of 2.74 shows that, on average, the model’s predictions differ from the actual training scores by about 2.7 points. Since MAE measures the average magnitude of prediction errors without considering direction, this relatively small value suggests that the model predictions are generally close to the true scores.

Similarly, the Root Mean Squared Error (RMSE) of 4.51 indicates that the model’s prediction errors are relatively low overall, although RMSE is slightly higher because it penalizes larger errors more heavily. The difference between RMSE and MAE suggests that while most predictions are accurate, there may still be a few larger prediction errors present in the dataset.

Overall, these metrics indicate that the linear regression model has strong predictive performance and is effective for estimating employee average training scores. The high R² combined with relatively low MAE and RMSE values suggests that the model captures the underlying relationships in the data well and provides reliable predictions.