Predicting Employee Performance with KNN Algo

People Analytics Applications in R

K-Nearest Neighbors Classifier

This tutorial will illustrate how to use the K-Nearest Neighbors Classifier in R using the caret package (Kuhn, 2021). In this example, our goal is to use a number or HR-related variables to predict which performance group new employees are likely to fall under.

NOTE: If you’re already familiar with EDA & data viz, you can skip ahead to the Modeling Pipeline section


Data Import & Set-Up

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(readr)
library(readxl)
library(caret)
library(janitor)
library(DescTools)
library(RColorBrewer)
library(patchwork)
library(sjPlot)
library(GGally)

options(scipen = 999)

hr1 <- read_csv("HR_Sample_Data.csv") # Read in Data

hr1[-c(1,6,7,13)] -> hr1 # Remove predictors we don't need

hr1 %>% 
  map_lgl(is.character) -> char_cols # Identifies Character Variables

hr1[char_cols] %>% 
  map_df(as.factor) -> hr1[char_cols] # Converts to Factors

hr1$salary <- round(hr1$salary/1000, digits = 0) # Salary in Thousands Units

Exploratory Data Analysis

Before we begin the modeling process, we need to first take a look at our dataset to understand its structure, examine important characteristics such as data missingness, and identify any relevant trends or relationships.

Structure & Descriptive Statistics
view_df(hr1,show.type = T,
        show.na = T, 
        show.frq = T, 
        show.prc = T, 
        show.values = T)
Data frame: hr1
ID Name Type Label missings Values Value Labels Freq. %
1 salary numeric 0 (0.00%) range: 45-220
2 gender categorical 0 (0.00%) F
M
163
126
56.40
43.60
3 marital_desc categorical 0 (0.00%) Divorced
Married
Separated
Single
Widowed
29
114
12
126
8
10.03
39.45
4.15
43.60
2.77
4 hispanic_latino categorical 0 (0.00%) No
Yes
263
26
91.00
9.00
5 emp_status categorical 0 (0.00%) Active
Terminated for Cause
Voluntarily Terminated
191
14
84
66.09
4.84
29.07
6 department categorical 0 (0.00%) IT/IS
Production
Sales
50
208
31
17.30
71.97
10.73
7 performance_rating categorical 0 (0.00%) Fail
Pass
30
259
10.38
89.62
8 engage_survey numeric 0 (0.00%) range: 1.1-5.0
9 emp_satisfaction numeric 0 (0.00%) range: 1-5
10 absences numeric 0 (0.00%) range: 1-20
11 tenure_yr numeric 0 (0.00%) range: 0-17
summary(hr1)
##      salary       gender     marital_desc hispanic_latino
##  Min.   : 45.00   F:163   Divorced : 29   No :263        
##  1st Qu.: 55.00   M:126   Married  :114   Yes: 26        
##  Median : 62.00           Separated: 12                  
##  Mean   : 67.26           Single   :126                  
##  3rd Qu.: 70.00           Widowed  :  8                  
##  Max.   :220.00                                          
##                   emp_status       department  performance_rating
##  Active                :191   IT/IS     : 50   Fail: 30          
##  Terminated for Cause  : 14   Production:208   Pass:259          
##  Voluntarily Terminated: 84   Sales     : 31                     
##                                                                  
##                                                                  
##                                                                  
##  engage_survey  emp_satisfaction    absences       tenure_yr    
##  Min.   :1.12   Min.   :1.0      Min.   : 1.00   Min.   : 0.00  
##  1st Qu.:3.66   1st Qu.:3.0      1st Qu.: 5.00   1st Qu.: 5.00  
##  Median :4.28   Median :4.0      Median :11.00   Median : 8.00  
##  Mean   :4.10   Mean   :3.9      Mean   :10.37   Mean   : 7.27  
##  3rd Qu.:4.70   3rd Qu.:5.0      3rd Qu.:15.00   3rd Qu.: 9.00  
##  Max.   :5.00   Max.   :5.0      Max.   :20.00   Max.   :17.00

Some important characteristics to look for:


Data Visualization
Univariate Plots

The code below selects all the numeric variables in your dataset, pivots to long form, and then creates a ggplot faceted by each variable type. I highly recommend running the code one line at a time to see how it works. Note that the geom_density can be changed.

hr1 %>%
  select_if(is.numeric) %>% 
  pivot_longer(names_to = "variable",
               values_to = "value",
               cols = 1:5) %>% 
  ggplot(aes(x = value,
             color = variable,
             fill = variable)) +
  facet_wrap(~ variable, scales = "free") +
  geom_density(alpha = 0.5)

Multivariate Plots

For modeling relationships amongst our variables, the GGally package has a fantastic function called ggpairs, shown below.

hr1 %>% 
  select(c(salary,
           engage_survey,
           emp_satisfaction,
           performance_rating)) %>% 
  ggpairs(aes(colour = performance_rating, alpha = 0.4))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also recreate the same plots ourselves:

ggplot(data = hr1, 
       aes(x = engage_survey,
           y = salary,
           color = performance_rating)) +
  geom_point(size = 0.9) +
  geom_jitter(size = 0.9) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() + 
  labs(title = "Salary & Engagement Level by Performance Rating")

ggplot(data = hr1, 
       aes(x = engage_survey,
           y = salary,
           color = performance_rating)) +
  geom_point(size = 0.9) +
  geom_jitter(size = 0.9) +
  facet_wrap(~performance_rating) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() + 
  labs(title = "Salary & Engagement Level by Performance Rating")

ggplot(data = hr1, 
       aes(x = department,
           fill = performance_rating)) +
  geom_bar(alpha = 0.7) +
  theme_bw() +
  labs(title = "Stacked Bar Chart of Performance Group by Department")

ggplot(data = hr1, 
       aes(x = salary,
           fill = performance_rating)) +
  geom_boxplot(alpha = 0.5) +
  theme_bw() +
  labs(title = "Boxplot of Salaries by Performance Group")


Modeling Pipeline

With the caret package, are a TON of customization options for your modeling pipeline. We’re going to stick to the simpler, more generalizable pipeline methods, but just be aware there are many ways you can tweak this as desired.


Partition Data into Train & Test

We’ll start by using caret’s createDataPartition to split our main dataset into a training and testing set

set.seed(825) # Set Seed

# Create index for split
## Specify outcome Y, split ratio, and say list = FALSE

index_trn <- createDataPartition(y = hr1$performance_rating, 
                                 p = .7, 
                                 list = FALSE,
                                 times = 1)

# Use index to create train & test subsets
hr_trn <- hr1[index_trn,]
hr_test <- hr1[-index_trn,]

Hyperparameter Tuning

We use trnControl to specify our hyperparameter tuning method. There are a LOT of different options, but a simple default is to use repeated k-fold cross-validation.

For the KNN model, we need to tune for the optimal selection of k. We’ll do so using an accuracy-based 10-fold cross-validation repeated 3x

# Specify method, number of folds, and number of repeats
trn_Control <- trainControl(method = "repeatedcv",
                            number = 10,
                            repeats = 3)

Data Pre-Processing

We have a couple of data pre-processing steps we need to carry out.

Good news! The caret package allows us to pre-process our data, tune our hyperparameters, and train our model all in the same step!

We’ll see how this works below in the model training section


Model Training

We use caret’s train function to create our fitted model.

IMPORTANT: Using formula notation will lead caret to automatically convert your factor variables to dummies internally. We don’t need to do anything to explicitly convert categorical to dummies.

After fitting your model, printing the fitted object will give an overview of the model selected, along with the optimal hyperparameter values chosen

# Create fitted model
knn_fit1 <- train(performance_rating ~ ., 
                  data = hr_trn,
                  method = "knn",
                  trControl = trn_Control,
                  preProcess = c("center", "scale"),
                  tuneLength = 20)

knn_fit1 # Overview
## k-Nearest Neighbors 
## 
## 203 samples
##  10 predictor
##   2 classes: 'Fail', 'Pass' 
## 
## Pre-processing: centered (15), scaled (15) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 182, 182, 183, 183, 183, 182, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9099206  0.2284155
##    7  0.9033333  0.1178866
##    9  0.8966667  0.0000000
##   11  0.8966667  0.0000000
##   13  0.8966667  0.0000000
##   15  0.8966667  0.0000000
##   17  0.8966667  0.0000000
##   19  0.8966667  0.0000000
##   21  0.8966667  0.0000000
##   23  0.8966667  0.0000000
##   25  0.8966667  0.0000000
##   27  0.8966667  0.0000000
##   29  0.8966667  0.0000000
##   31  0.8966667  0.0000000
##   33  0.8966667  0.0000000
##   35  0.8966667  0.0000000
##   37  0.8966667  0.0000000
##   39  0.8966667  0.0000000
##   41  0.8966667  0.0000000
##   43  0.8966667  0.0000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

We can also plot the hyperparameter tuning process:

ggplot(knn_fit1) # Plot Hyperparameter tuning 

We can also get a summary of the final model. This is kind of pointless for KNN, but is very useful for algorithms like the RandomForest, where we might want to see what the final decision rules were:

knn_fit1$finalModel
## 5-nearest neighbor model
## Training set outcome distribution:
## 
## Fail Pass 
##   21  182

Evaluating Predictive Performance

Now that we’ve fit out model, it’s time to evaluate it’s predictive performance. First we need to generate predictions using our test set.

Generating Predictions for Test Set

First we’ll generate probabilistic predictions, where we’re given a probability estimate for each outcome class

# Probabilistic Preds first
knn_pred_prob1 <- predict(knn_fit1,
                          newdata = hr_test,
                          type="prob")

head(knn_pred_prob1)
##   Fail Pass
## 1  0.0  1.0
## 2  0.0  1.0
## 3  0.2  0.8
## 4  0.0  1.0
## 5  0.2  0.8
## 6  0.2  0.8

Next we’ll generate “raw” predictions, or fitted values, where we get a simple class membership prediction rather than a probabilistic estimate.

# "Raw"/Fitted Preds Second
knn_pred_fitted1<-predict(knn_fit1, 
                          newdata=hr_test,
                          type = "raw")

head(knn_pred_fitted1)
## [1] Pass Pass Pass Pass Pass Pass
## Levels: Fail Pass
Evaluating Performance

Next we’ll compare the predictions generated to the actual test set performance group values.

# Evaluate Predictive Performance upon Test Set
postResample(pred = knn_pred_fitted1, 
             obs = hr_test$performance_rating)
##  Accuracy     Kappa 
## 0.9186047 0.3384615

We see that when used upon our testing set, our KNN model yields an accuracy of ~92%. Not bad!

We can also generate a Confusion Matrix

# Confusion Matrix
confusionMatrix(data = knn_pred_fitted1,
                reference = hr_test$performance_rating)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fail Pass
##       Fail    2    0
##       Pass    7   77
##                                           
##                Accuracy : 0.9186          
##                  95% CI : (0.8395, 0.9666)
##     No Information Rate : 0.8953          
##     P-Value [Acc > NIR] : 0.31087         
##                                           
##                   Kappa : 0.3385          
##                                           
##  Mcnemar's Test P-Value : 0.02334         
##                                           
##             Sensitivity : 0.22222         
##             Specificity : 1.00000         
##          Pos Pred Value : 1.00000         
##          Neg Pred Value : 0.91667         
##              Prevalence : 0.10465         
##          Detection Rate : 0.02326         
##    Detection Prevalence : 0.02326         
##       Balanced Accuracy : 0.61111         
##                                           
##        'Positive' Class : Fail            
## 

It looks like our model does a really good job of predicting the high-performers, but not so great at predicting the low-performers.


Generating Predictions for Future Data

It’s great that our model performs well on our test set, but now that we know that, we want to use our fitted model to predict new data. Fortunately, the process is the exact same, and simply involves using R’s predict function again.

We don’t have future data, so let’s create a subset of our data and pretend its new.

hr1[1:50,] -> hr_new   # subset pretending to be new data for tutorial

hr_new[-7] -> hr_new # Remove performance outcome from "new" data

Predicting performance classes for these “new” data points is simple!

predict(knn_fit1, newdata = hr_new) -> new_preds

head(new_preds)
## [1] Pass Pass Pass Pass Pass Pass
## Levels: Fail Pass

Recap

We’ve covered the following steps:


Thanks for reading! Please feel free to reach out with any questions