kMeans Clustering with HR-related Data

Introduction

In this presentation and example, I demonstrate how kMeans clustering can be applied to a HR-related data set to determine clusters of employees based on pay and age. While pay and age may not be directly related, I will show how plotting these types of variables/features can reveal new insights or confirm what you already know.

Background and Data Set

A data set developed by Dr. Carla Patalano and Dr. Rich Huebner contains a variety of HR-related information from fictitious employee names, pay rates, age, gender, race/ethnicity, performance scores, and the like.

We submitted the data set on Kaggle, and is available here: https://www.kaggle.com/rhuebner/human-resources-data-set/data

If you want to download the data, the most current version of the data set is: HRDataset_v9.csv

Step-by-Step Approach

1. Load data.

This data is already cleansed / tidy’d for us, so no additional work or transforms must be done for the kMeans analysis.

library(ggplot2)
library(cluster)
# For replicating this study, change this to your own working directory, and keep the same seed.
setwd("D:/Data/kMeans_hr")
set.seed(02045)

# Grab the data.
hr <- read.csv('HRDataset_v9.csv', sep=",")   # grab the data!

2. Get a subset of the HR dataset.

The subset means only use the quantitative fields. We basically don’t need every feature/field for this analysis, so why use all of it? We usually just subset the data so that a data frame only contains the features we need. This would include fields #3 through #10. This also makes the algorithms run more efficiently.

# First, subset the data.
# MarriedID, MaritalStatusID, GenderID, EmpStatusID, DeptID, Perf_ScoreID, Age, Pay.Rate
hr2 <- hr[ , c(3:10)]  # Grab the 3rd through 10th feature/column.

head(hr2)

##   MarriedID MaritalStatusID GenderID EmpStatus_ID DeptID Perf_ScoreID Age
## 1         1               1        0            1      1            3  30
## 2         0               2        1            1      1            3  34
## 3         0               0        1            1      1            3  31
## 4         1               1        0            1      1            9  32
## 5         0               0        0            1      1            9  30
## 6         1               1        0            5      1            3  30
##   Pay.Rate
## 1    28.50
## 2    23.00
## 3    29.00
## 4    21.50
## 5    16.56
## 6    20.50

3. Apply the kMeans algorithm to the subset of data.

In this case, I selected 4 clusters. In kMeans clustering, we generally pre-select the number of clusters that we want for our data. For this dataset, we are aiming to see four different clusters (or types) of employees who share similar characteristics.

When the number of clusters is set at k, k-means clustering gives a definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

Be sure that you understand that a kMeans clustering algorithm is not distance-based. Also, the x- and y- axis in the plot should be based on continuous variables.

hrCluster <- kmeans(hr2, 4, iter.max = 20, nstart = 20)

# This will display the results of the kMeans clustering algorithm. It includes @ 
# cluster means for the 4 clusters, each cluster size, a vector containing the 
# cluster that each data point belongs to, and the within cluster sum of squares.
hrCluster

## K-means clustering with 4 clusters of sizes 145, 59, 69, 37
## 
## Cluster means:
##   MarriedID MaritalStatusID  GenderID EmpStatus_ID   DeptID Perf_ScoreID
## 1 0.3931034       0.7862069 0.4000000     2.482759 4.682759     3.648276
## 2 0.4576271       0.6949153 0.5084746     1.813559 4.186441     3.355932
## 3 0.3768116       0.9855072 0.3333333     2.971014 4.869565     3.275362
## 4 0.3513514       0.7567568 0.5945946     1.918919 4.486486     3.216216
##        Age Pay.Rate
## 1 34.06897 21.77828
## 2 32.50847 50.39915
## 3 49.53623 21.43536
## 4 47.91892 56.42838
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 2 2 1 1 4 4 2 4 2 1 2 1 2 2 2 2 2 3 1 4 4 3 2 4 1 3 1 3 2
##  [36] 2 2 2 2 1 2 1 4 4 2 4 4 4 2 2 4 4 2 4 4 2 4 2 4 2 2 2 2 4 4 4 3 1 1 1
##  [71] 1 1 3 3 3 1 1 3 1 3 3 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1
## [106] 1 3 1 1 1 1 1 1 1 3 3 1 1 3 1 3 1 1 1 3 3 1 1 3 3 3 3 1 3 3 3 1 1 1 1
## [141] 1 1 3 3 3 1 3 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 3 3 1 1 3 3 1 1
## [176] 3 1 3 3 1 3 1 3 3 3 1 1 1 3 3 3 1 3 1 1 1 1 1 1 3 3 1 1 1 3 1 3 1 1 1
## [211] 1 1 1 3 1 1 1 1 1 3 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 3 1 3 1 3
## [246] 1 1 1 1 3 1 1 3 3 3 1 3 3 1 1 4 4 4 2 2 2 4 4 4 2 4 4 4 2 2 4 2 4 2 2
## [281] 2 2 4 2 4 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 3 4 2 2 4 4 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 7414.319 3983.894 5191.887 3141.162
##  (between_SS / total_SS =  80.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

4. Examine a table of clusters.

# This is a table of clusters per Employee Gender. 0 = Female, 1 = Male
table(hrCluster$cluster, hr2$GenderID)

##    
##      0  1
##   1 87 58
##   2 29 30
##   3 46 23
##   4 15 22

There appear to be far fewer males in cluster 3. There are obviously more females than males in the data set as well.

We could also run a table for the different departments and performance scores.

# Departments
table(hrCluster$cluster, hr2$DeptID)

##    
##       1   2   3   4   5   6
##   1   8   0   7   0 130   0
##   2   2   0  24   9   7  17
##   3   0   0   4   1  64   0
##   4   0   1  15   0   7  14

# Performance score.
table(hrCluster$cluster, hr2$Perf_ScoreID)

##    
##      0  1  2  3  4  5  9
##   1 15  4  8 81 11  3 23
##   2  8  1  4 32  5  2  7
##   3  6  3  1 42 10  2  5
##   4  2  1  2 26  2  2  2

Performance scores seem to be clustered right around a score of “3”, which means Fully Meets Expectations.

Next, let’s take a look at the cluster centers.

hrCluster$centers

##   MarriedID MaritalStatusID  GenderID EmpStatus_ID   DeptID Perf_ScoreID
## 1 0.3931034       0.7862069 0.4000000     2.482759 4.682759     3.648276
## 2 0.4576271       0.6949153 0.5084746     1.813559 4.186441     3.355932
## 3 0.3768116       0.9855072 0.3333333     2.971014 4.869565     3.275362
## 4 0.3513514       0.7567568 0.5945946     1.918919 4.486486     3.216216
##        Age Pay.Rate
## 1 34.06897 21.77828
## 2 32.50847 50.39915
## 3 49.53623 21.43536
## 4 47.91892 56.42838

hrCluster$betweenss

## [1] 81439.95

5. Results and Plot

hrCluster$cluster <- as.factor(hrCluster$cluster)
plot1 <- ggplot(hr2, aes(Pay.Rate, Age, color = hrCluster$cluster))
# p1 <- plot1 + geom_point() + stat_density2d(aes(alpha=..density..), geom="raster", contour=FALSE)
p1 <- plot1 + geom_hex()
p1

6. Interpretation and Further Work

Due to its popularity, I chose kMeans for this analysis. These results demonstrate several clusters of employees. Most notably, there appear to be distinct clusters above the $40 per hour mark and below 40 per hour mark. Additionally there are distinct clusters above 40 years old and under 40 years old. Also, the centers of the clusters indicate a high/low pay rate, and a high/low age. There seems to be high density of data points in the low wage/low age cluster (bottom left of the chart).

By plotting the clusters, we can easily see how employees can be grouped by similar characteristics.

This is a fairly basic/exploratory kMeans analysis. Further exploration can be done on this data set by using other clustering algorithms or by changing the number of clusters that we start with.

Further Information

Dr. Rich Huebner is a Principal Data Architect for Houghton Mifflin Harcourt (HMH) and is based in Boston, Massachusetts. Dr. Huebner also teaches graduate and doctoral courses for several colleges and universities.

Please contact Rich.Huebner@yahoo.com for further information about the HR data set.

Thank you!