Predicting Employee Performance with KNN Algo

People Analytics Applications in R

K-Nearest Neighbors Classifier

This tutorial will illustrate how to use the K-Nearest Neighbors Classifier in R using the caret package (Kuhn, 2021). In this example, our goal is to use a number or HR-related variables to predict which performance group new employees are likely to fall under.

Dataset and codebook can be found on my github page

NOTE: If you’re already familiar with EDA & data viz, you can skip ahead to the Modeling Pipeline section

Data Import & Set-Up

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(readr)
library(readxl)
library(caret)
library(janitor)
library(DescTools)
library(RColorBrewer)
library(patchwork)
library(sjPlot)
library(GGally)

options(scipen = 999)

hr1 <- read_csv("HR_Sample_Data.csv") # Read in Data

hr1[-c(1,6,7,13)] -> hr1 # Remove predictors we don't need

hr1 %>% 
  map_lgl(is.character) -> char_cols # Identifies Character Variables

hr1[char_cols] %>% 
  map_df(as.factor) -> hr1[char_cols] # Converts to Factors

hr1$salary <- round(hr1$salary/1000, digits = 0) # Salary in Thousands Units

Exploratory Data Analysis

Before we begin the modeling process, we need to first take a look at our dataset to understand its structure, examine important characteristics such as data missingness, and identify any relevant trends or relationships.

Structure & Descriptive Statistics

view_df(hr1,show.type = T,
        show.na = T, 
        show.frq = T, 
        show.prc = T, 
        show.values = T)

Data frame: hr1
ID	Name	Type	missings	Values	Value Labels	Freq.	%
1	salary	numeric	0 (0.00%)	range: 45-220
2	gender	categorical	0 (0.00%)		F M	163 126	56.40 43.60
3	marital_desc	categorical	0 (0.00%)		Divorced Married Separated Single Widowed	29 114 12 126 8	10.03 39.45 4.15 43.60 2.77
4	hispanic_latino	categorical	0 (0.00%)		No Yes	263 26	91.00 9.00
5	emp_status	categorical	0 (0.00%)		Active Terminated for Cause Voluntarily Terminated	191 14 84	66.09 4.84 29.07
6	department	categorical	0 (0.00%)		IT/IS Production Sales	50 208 31	17.30 71.97 10.73
7	performance_rating	categorical	0 (0.00%)		Fail Pass	30 259	10.38 89.62
8	engage_survey	numeric	0 (0.00%)	range: 1.1-5.0
9	emp_satisfaction	numeric	0 (0.00%)	range: 1-5
10	absences	numeric	0 (0.00%)	range: 1-20
11	tenure_yr	numeric	0 (0.00%)	range: 0-17

summary(hr1)

##      salary       gender     marital_desc hispanic_latino
##  Min.   : 45.00   F:163   Divorced : 29   No :263        
##  1st Qu.: 55.00   M:126   Married  :114   Yes: 26        
##  Median : 62.00           Separated: 12                  
##  Mean   : 67.26           Single   :126                  
##  3rd Qu.: 70.00           Widowed  :  8                  
##  Max.   :220.00                                          
##                   emp_status       department  performance_rating
##  Active                :191   IT/IS     : 50   Fail: 30          
##  Terminated for Cause  : 14   Production:208   Pass:259          
##  Voluntarily Terminated: 84   Sales     : 31                     
##                                                                  
##                                                                  
##                                                                  
##  engage_survey  emp_satisfaction    absences       tenure_yr    
##  Min.   :1.12   Min.   :1.0      Min.   : 1.00   Min.   : 0.00  
##  1st Qu.:3.66   1st Qu.:3.0      1st Qu.: 5.00   1st Qu.: 5.00  
##  Median :4.28   Median :4.0      Median :11.00   Median : 8.00  
##  Mean   :4.10   Mean   :3.9      Mean   :10.37   Mean   : 7.27  
##  3rd Qu.:4.70   3rd Qu.:5.0      3rd Qu.:15.00   3rd Qu.: 9.00  
##  Max.   :5.00   Max.   :5.0      Max.   :20.00   Max.   :17.00

Some important characteristics to look for:

Data Missingness
- we have 0% missingness on all variables
Frequency/Distribution
- Marital Desc variable may present a challenge
  - Freqs are very uneven across factor levels
- Our Performance Rating outcome has a 90:10 ratio
  - This is okay, but be wary of very uneven outcome classes, especially when sample size is low
Data Types & Structure
- All of our categorical & numeric variables are properly coded as such

Data Visualization

Univariate Plots

The code below selects all the numeric variables in your dataset, pivots to long form, and then creates a ggplot faceted by each variable type. I highly recommend running the code one line at a time to see how it works. Note that the geom_density can be changed.

hr1 %>%
  select_if(is.numeric) %>% 
  pivot_longer(names_to = "variable",
               values_to = "value",
               cols = 1:5) %>% 
  ggplot(aes(x = value,
             color = variable,
             fill = variable)) +
  facet_wrap(~ variable, scales = "free") +
  geom_density(alpha = 0.5)

Multivariate Plots

For modeling relationships amongst our variables, the GGally package has a fantastic function called ggpairs, shown below.

hr1 %>% 
  select(c(salary,
           engage_survey,
           emp_satisfaction,
           performance_rating)) %>% 
  ggpairs(aes(colour = performance_rating, alpha = 0.4))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also recreate the same plots ourselves:

ggplot(data = hr1, 
       aes(x = engage_survey,
           y = salary,
           color = performance_rating)) +
  geom_point(size = 0.9) +
  geom_jitter(size = 0.9) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() + 
  labs(title = "Salary & Engagement Level by Performance Rating")

ggplot(data = hr1, 
       aes(x = engage_survey,
           y = salary,
           color = performance_rating)) +
  geom_point(size = 0.9) +
  geom_jitter(size = 0.9) +
  facet_wrap(~performance_rating) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() + 
  labs(title = "Salary & Engagement Level by Performance Rating")

ggplot(data = hr1, 
       aes(x = department,
           fill = performance_rating)) +
  geom_bar(alpha = 0.7) +
  theme_bw() +
  labs(title = "Stacked Bar Chart of Performance Group by Department")

ggplot(data = hr1, 
       aes(x = salary,
           fill = performance_rating)) +
  geom_boxplot(alpha = 0.5) +
  theme_bw() +
  labs(title = "Boxplot of Salaries by Performance Group")

Modeling Pipeline

With the caret package, are a TON of customization options for your modeling pipeline. We’re going to stick to the simpler, more generalizable pipeline methods, but just be aware there are many ways you can tweak this as desired.

Partition Data into Train & Test

We’ll start by using caret’s createDataPartition to split our main dataset into a training and testing set

Set seed for reproducibility
Stratify w/respect to the outcome via (y = hr1$performance_rating)
Specify the split ratio via (p = training %)
- 70:30 split used below

set.seed(825) # Set Seed

# Create index for split
## Specify outcome Y, split ratio, and say list = FALSE

index_trn <- createDataPartition(y = hr1$performance_rating, 
                                 p = .7, 
                                 list = FALSE,
                                 times = 1)

# Use index to create train & test subsets
hr_trn <- hr1[index_trn,]
hr_test <- hr1[-index_trn,]

Hyperparameter Tuning

We use trnControl to specify our hyperparameter tuning method. There are a LOT of different options, but a simple default is to use repeated k-fold cross-validation.

For the KNN model, we need to tune for the optimal selection of k. We’ll do so using an accuracy-based 10-fold cross-validation repeated 3x

# Specify method, number of folds, and number of repeats
trn_Control <- trainControl(method = "repeatedcv",
                            number = 10,
                            repeats = 3)

Data Pre-Processing

We have a couple of data pre-processing steps we need to carry out.

Create Dummy Vars for our categorical predictors
- e.g., Convert a 5-level categorical variable to 4 dummy vars
Normalize/rescale our predictors
- VERY important w/KNN or any other distance-based algorithms
- If neglected, larger-scale predictors will have undue influence

Good news! The caret package allows us to pre-process our data, tune our hyperparameters, and train our model all in the same step!

We’ll see how this works below in the model training section

Model Training

We use caret’s train function to create our fitted model.

Specify formula
- **Y ~ . ** if all predictors included
- (Y = salary + engage + …) if specific predictors included
Specify training data
- data = hr_trn for this example
Specify model family via model method
- method = “knn” for KNN
- different models have different tags to specify
  - names(getModelInfo()) will give a list of all models
Specify tuning method using the method name created earlier
- trControl = trn_Control
Specify pre-processing steps
- We want preProcess == center & scale

IMPORTANT: Using formula notation will lead caret to automatically convert your factor variables to dummies internally. We don’t need to do anything to explicitly convert categorical to dummies.

After fitting your model, printing the fitted object will give an overview of the model selected, along with the optimal hyperparameter values chosen

# Create fitted model
knn_fit1 <- train(performance_rating ~ ., 
                  data = hr_trn,
                  method = "knn",
                  trControl = trn_Control,
                  preProcess = c("center", "scale"),
                  tuneLength = 20)

knn_fit1 # Overview

## k-Nearest Neighbors 
## 
## 203 samples
##  10 predictor
##   2 classes: 'Fail', 'Pass' 
## 
## Pre-processing: centered (15), scaled (15) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 182, 182, 183, 183, 183, 182, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9099206  0.2284155
##    7  0.9033333  0.1178866
##    9  0.8966667  0.0000000
##   11  0.8966667  0.0000000
##   13  0.8966667  0.0000000
##   15  0.8966667  0.0000000
##   17  0.8966667  0.0000000
##   19  0.8966667  0.0000000
##   21  0.8966667  0.0000000
##   23  0.8966667  0.0000000
##   25  0.8966667  0.0000000
##   27  0.8966667  0.0000000
##   29  0.8966667  0.0000000
##   31  0.8966667  0.0000000
##   33  0.8966667  0.0000000
##   35  0.8966667  0.0000000
##   37  0.8966667  0.0000000
##   39  0.8966667  0.0000000
##   41  0.8966667  0.0000000
##   43  0.8966667  0.0000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

We can also plot the hyperparameter tuning process:

ggplot(knn_fit1) # Plot Hyperparameter tuning

We can also get a summary of the final model. This is kind of pointless for KNN, but is very useful for algorithms like the RandomForest, where we might want to see what the final decision rules were:

knn_fit1$finalModel

## 5-nearest neighbor model
## Training set outcome distribution:
## 
## Fail Pass 
##   21  182

Evaluating Predictive Performance

Now that we’ve fit out model, it’s time to evaluate it’s predictive performance. First we need to generate predictions using our test set.

Generating Predictions for Test Set

Specify the fitted object = fitted model name
Specify the newdata = testing set name
Specify the type of prediction (raw or probabilistic)

First we’ll generate probabilistic predictions, where we’re given a probability estimate for each outcome class

# Probabilistic Preds first
knn_pred_prob1 <- predict(knn_fit1,
                          newdata = hr_test,
                          type="prob")

head(knn_pred_prob1)

##   Fail Pass
## 1  0.0  1.0
## 2  0.0  1.0
## 3  0.2  0.8
## 4  0.0  1.0
## 5  0.2  0.8
## 6  0.2  0.8

Next we’ll generate “raw” predictions, or fitted values, where we get a simple class membership prediction rather than a probabilistic estimate.

# "Raw"/Fitted Preds Second
knn_pred_fitted1<-predict(knn_fit1, 
                          newdata=hr_test,
                          type = "raw")

head(knn_pred_fitted1)

## [1] Pass Pass Pass Pass Pass Pass
## Levels: Fail Pass

Evaluating Performance

Next we’ll compare the predictions generated to the actual test set performance group values.

It only makes sense to do this for the “raw” predictions
- Specify pred = predictions generated
- Specify obs = actual test set outcome values

# Evaluate Predictive Performance upon Test Set
postResample(pred = knn_pred_fitted1, 
             obs = hr_test$performance_rating)

##  Accuracy     Kappa 
## 0.9186047 0.3384615

We see that when used upon our testing set, our KNN model yields an accuracy of ~92%. Not bad!

We can also generate a Confusion Matrix

Specify data = fitted prediction object
Specify reference = actual test set outcome values

# Confusion Matrix
confusionMatrix(data = knn_pred_fitted1,
                reference = hr_test$performance_rating)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail    2    0
##       Pass    7   77
##                                           
##                Accuracy : 0.9186          
##                  95% CI : (0.8395, 0.9666)
##     No Information Rate : 0.8953          
##     P-Value [Acc > NIR] : 0.31087         
##                                           
##                   Kappa : 0.3385          
##                                           
##  Mcnemar's Test P-Value : 0.02334         
##                                           
##             Sensitivity : 0.22222         
##             Specificity : 1.00000         
##          Pos Pred Value : 1.00000         
##          Neg Pred Value : 0.91667         
##              Prevalence : 0.10465         
##          Detection Rate : 0.02326         
##    Detection Prevalence : 0.02326         
##       Balanced Accuracy : 0.61111         
##                                           
##        'Positive' Class : Fail            
##

It looks like our model does a really good job of predicting the high-performers, but not so great at predicting the low-performers.

Generating Predictions for Future Data

It’s great that our model performs well on our test set, but now that we know that, we want to use our fitted model to predict new data. Fortunately, the process is the exact same, and simply involves using R’s predict function again.

We don’t have future data, so let’s create a subset of our data and pretend its new.

hr1[1:50,] -> hr_new   # subset pretending to be new data for tutorial

hr_new[-7] -> hr_new # Remove performance outcome from "new" data

Predicting performance classes for these “new” data points is simple!

Specify fitted model object
Specify new data to predict

predict(knn_fit1, newdata = hr_new) -> new_preds

head(new_preds)

## [1] Pass Pass Pass Pass Pass Pass
## Levels: Fail Pass

Recap

We’ve covered the following steps:

Exploratory Data Analysis
- Data Structure
- Data Viz
Data Pre-Processing
- Center & Scale/Normalize
Data Partitioning
- 70:30 Train/Test Split
Hyperparameter Tuning
- Selecting optimal value of k
Model Fitting/Training
- Create fitted model object
Predictive Performance
- Generate test set performance predictions
- Compare predictions to actual outcomes
- Create Confusion Matrix to slice/dice accuracy issues
Predict New/Future Data
- Generate predictions from fitted model using predict function

Thanks for reading! Please feel free to reach out with any questions