This is a mini data science project in collaboration with Lim Lee Wen based on the Human Resources Data Set by Dr. Rich at https://www.kaggle.com/rhuebner/human-resources-data-set

Introduction

To reduce loss of talent, the HR department can seek to understand the active employees by segmenting them into different groups. The member of each group is similar with one another, yet are dissimilar with members from other groups. By doing so, the company can gain insights into how the employees are similar and different from one another, and devise strategies in dealing with them.

library(gower)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(clustertend)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Make model repeatable
set.seed(315)

# Load data and select only active employees
df_active <- readRDS('HRDataset_v14-cleaned.rds') %>%
  filter(
    Termd == 0
  )

row.names(df_active) <- df_active$EmpID

str(df_active)
## 'data.frame':    207 obs. of  38 variables:
##  $ Employee_Name             : chr  "Adinolfi, Wilson  K" "Alagbe,Trina" "Anderson, Linda  " "Andreola, Colby" ...
##  $ EmpID                     : chr  "10026" "10088" "10002" "10194" ...
##  $ MarriedID                 : int  0 1 0 0 0 0 0 0 0 1 ...
##  $ MaritalStatusID           : int  0 1 0 0 4 0 2 2 0 1 ...
##  $ GenderID                  : int  1 0 0 0 1 0 1 1 1 0 ...
##  $ EmpStatusID               : int  1 1 1 1 1 3 1 1 1 2 ...
##  $ DeptID                    : int  5 5 5 4 5 5 3 3 5 5 ...
##  $ PerfScoreID               : int  4 3 4 3 3 3 3 4 3 4 ...
##  $ FromDiversityJobFairID    : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Salary                    : int  62506 64991 57568 95660 59365 47837 50178 92328 58709 70131 ...
##  $ Termd                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PositionID                : int  19 19 19 24 19 19 14 9 19 20 ...
##  $ Position                  : Factor w/ 29 levels "Accountant I",..: 20 20 20 25 20 20 15 7 20 21 ...
##  $ State                     : Factor w/ 28 levels "AL","AZ","CA",..: 11 11 11 11 11 11 11 24 11 11 ...
##  $ Zip                       : chr  "1960" "1886" "1844" "2110" ...
##  $ DOB                       : Date, format: "1983-07-10" "1988-09-27" ...
##  $ Sex                       : Factor w/ 2 levels "F","M ": 2 1 1 1 2 1 2 2 2 1 ...
##  $ MaritalDesc               : Factor w/ 5 levels "Divorced","Married",..: 4 2 4 4 5 4 1 1 4 2 ...
##  $ CitizenDesc               : Factor w/ 3 levels "Eligible NonCitizen",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ HispanicLatino            : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RaceDesc                  : Factor w/ 6 levels "American Indian or Alaska Native",..: 6 6 6 6 6 3 6 3 5 6 ...
##  $ DateofHire                : Date, format: "2011-07-05" "2008-01-07" ...
##  $ DateofTermination         : Date, format: NA NA ...
##  $ TermReason                : Factor w/ 18 levels "Another position",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ EmploymentStatus          : Factor w/ 3 levels "Active","Terminated for Cause",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Department                : Factor w/ 6 levels "Admin Offices",..: 4 4 4 6 4 4 3 3 4 4 ...
##  $ ManagerName               : Factor w/ 21 levels "Alex Sweetwater",..: 18 9 2 1 15 5 19 20 14 14 ...
##  $ ManagerID                 : num  22 16 11 10 19 12 7 4 18 18 ...
##  $ RecruitmentSource         : Factor w/ 9 levels "CareerBuilder",..: 6 5 6 6 3 2 5 2 4 3 ...
##  $ PerformanceScore          : Factor w/ 4 levels "Exceeds","Fully Meets",..: 1 2 1 2 2 2 2 1 2 1 ...
##  $ EngagementSurvey          : num  4.6 4.84 5 3.04 5 4.46 5 4.28 4.6 4.4 ...
##  $ EmpSatisfaction           : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 3 4 3 5 4 4 3 ...
##  $ SpecialProjectsCount      : int  0 0 0 4 0 0 6 5 0 0 ...
##  $ LastPerformanceReview_Date: Date, format: "2019-01-17" "2019-01-03" ...
##  $ DaysLateLast30            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Absences                  : int  1 15 15 19 19 4 16 9 7 16 ...
##  $ TempYrOfService           : int  9 13 9 6 7 11 6 6 9 4 ...
##  $ TempAge                   : int  37 32 44 42 38 51 33 32 37 55 ...
head(df_active)
##             Employee_Name EmpID MarriedID MaritalStatusID GenderID EmpStatusID
## 10026 Adinolfi, Wilson  K 10026         0               0        1           1
## 10088        Alagbe,Trina 10088         1               1        0           1
## 10002   Anderson, Linda   10002         0               0        0           1
## 10194     Andreola, Colby 10194         0               0        0           1
## 10062         Athwal, Sam 10062         0               4        1           1
## 10114    Bachiochi, Linda 10114         0               0        0           3
##       DeptID PerfScoreID FromDiversityJobFairID Salary Termd PositionID
## 10026      5           4                      0  62506     0         19
## 10088      5           3                      0  64991     0         19
## 10002      5           4                      0  57568     0         19
## 10194      4           3                      0  95660     0         24
## 10062      5           3                      0  59365     0         19
## 10114      5           3                      1  47837     0         19
##                      Position State  Zip        DOB Sex MaritalDesc CitizenDesc
## 10026 Production Technician I    MA 1960 1983-07-10  M       Single  US Citizen
## 10088 Production Technician I    MA 1886 1988-09-27   F     Married  US Citizen
## 10002 Production Technician I    MA 1844 1977-05-22   F      Single  US Citizen
## 10194       Software Engineer    MA 2110 1979-05-24   F      Single  US Citizen
## 10062 Production Technician I    MA 2199 1983-02-18  M      Widowed  US Citizen
## 10114 Production Technician I    MA 1902 1970-02-11   F      Single  US Citizen
##       HispanicLatino                  RaceDesc DateofHire DateofTermination
## 10026             No                     White 2011-07-05              <NA>
## 10088             No                     White 2008-01-07              <NA>
## 10002             No                     White 2012-01-09              <NA>
## 10194             No                     White 2014-11-10              <NA>
## 10062             No                     White 2013-09-30              <NA>
## 10114             No Black or African American 2009-07-06              <NA>
##              TermReason EmploymentStatus           Department     ManagerName
## 10026 N/A-StillEmployed           Active    Production         Michael Albert
## 10088 N/A-StillEmployed           Active    Production           Elijiah Gray
## 10002 N/A-StillEmployed           Active    Production               Amy Dunn
## 10194 N/A-StillEmployed           Active Software Engineering Alex Sweetwater
## 10062 N/A-StillEmployed           Active    Production          Ketsia Liebig
## 10114 N/A-StillEmployed           Active    Production         Brannon Miller
##       ManagerID  RecruitmentSource PerformanceScore EngagementSurvey
## 10026        22           LinkedIn          Exceeds             4.60
## 10088        16             Indeed      Fully Meets             4.84
## 10002        11           LinkedIn          Exceeds             5.00
## 10194        10           LinkedIn      Fully Meets             3.04
## 10062        19  Employee Referral      Fully Meets             5.00
## 10114        12 Diversity Job Fair      Fully Meets             4.46
##       EmpSatisfaction SpecialProjectsCount LastPerformanceReview_Date
## 10026               5                    0                 2019-01-17
## 10088               5                    0                 2019-01-03
## 10002               5                    0                 2019-01-07
## 10194               3                    4                 2019-01-02
## 10062               4                    0                 2019-02-25
## 10114               3                    0                 2019-01-25
##       DaysLateLast30 Absences TempYrOfService TempAge
## 10026              0        1               9      37
## 10088              0       15              13      32
## 10002              0       15               9      44
## 10194              0       19               6      42
## 10062              0       19               7      38
## 10114              0        4              11      51

Data Preparation

Select the variables and scale the numeric variables for clustering. The salary variable is log-transformed as it is highly skewed.

df_scaled <- as.data.frame(df_active)
df_scaled$Salary <- log10(df_scaled$Salary)

df_scaled <- df_scaled %>%
  dplyr::select(
    Salary,
    Position,
    State,
    Sex,
    MaritalDesc,
    CitizenDesc,
    HispanicLatino,
    RaceDesc,
    Department,
    ManagerName,
    RecruitmentSource,
    PerformanceScore,
    EngagementSurvey,
    EmpSatisfaction,
    SpecialProjectsCount,
    DaysLateLast30,
    Absences,
    TempAge,
    TempYrOfService
  )

num_cols <- colnames(df_scaled)[unlist(sapply(df_scaled, is.numeric))] 
df_scaled[num_cols] <- sapply(df_scaled[num_cols], scale)

str(df_scaled)
## 'data.frame':    207 obs. of  19 variables:
##  $ Salary              : num  -0.254 -0.119 -0.539 1.219 -0.433 ...
##  $ Position            : Factor w/ 29 levels "Accountant I",..: 20 20 20 25 20 20 15 7 20 21 ...
##  $ State               : Factor w/ 28 levels "AL","AZ","CA",..: 11 11 11 11 11 11 11 24 11 11 ...
##  $ Sex                 : Factor w/ 2 levels "F","M ": 2 1 1 1 2 1 2 2 2 1 ...
##  $ MaritalDesc         : Factor w/ 5 levels "Divorced","Married",..: 4 2 4 4 5 4 1 1 4 2 ...
##  $ CitizenDesc         : Factor w/ 3 levels "Eligible NonCitizen",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ HispanicLatino      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RaceDesc            : Factor w/ 6 levels "American Indian or Alaska Native",..: 6 6 6 6 6 3 6 3 5 6 ...
##  $ Department          : Factor w/ 6 levels "Admin Offices",..: 4 4 4 6 4 4 3 3 4 4 ...
##  $ ManagerName         : Factor w/ 21 levels "Alex Sweetwater",..: 18 9 2 1 15 5 19 20 14 14 ...
##  $ RecruitmentSource   : Factor w/ 9 levels "CareerBuilder",..: 6 5 6 6 3 2 5 2 4 3 ...
##  $ PerformanceScore    : Factor w/ 4 levels "Exceeds","Fully Meets",..: 1 2 1 2 2 2 2 1 2 1 ...
##  $ EngagementSurvey    : num  0.615 0.922 1.126 -1.382 1.126 ...
##  $ EmpSatisfaction     : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 3 4 3 5 4 4 3 ...
##  $ SpecialProjectsCount: num  -0.578 -0.578 -0.578 1.001 -0.578 ...
##  $ DaysLateLast30      : num  -0.274 -0.274 -0.274 -0.274 -0.274 ...
##  $ Absences            : num  -1.51 0.884 0.884 1.568 1.568 ...
##  $ TempAge             : num  -0.5059 -1.1009 0.3271 0.0891 -0.3869 ...
##  $ TempYrOfService     : num  0.865 2.901 0.865 -0.661 -0.152 ...
head(df_scaled)
##           Salary                Position State Sex MaritalDesc CitizenDesc
## 10026 -0.2541211 Production Technician I    MA  M       Single  US Citizen
## 10088 -0.1191572 Production Technician I    MA   F     Married  US Citizen
## 10002 -0.5390151 Production Technician I    MA   F      Single  US Citizen
## 10194  1.2190195       Software Engineer    MA   F      Single  US Citizen
## 10062 -0.4326054 Production Technician I    MA  M      Widowed  US Citizen
## 10114 -1.1800341 Production Technician I    MA   F      Single  US Citizen
##       HispanicLatino                  RaceDesc           Department
## 10026             No                     White    Production       
## 10088             No                     White    Production       
## 10002             No                     White    Production       
## 10194             No                     White Software Engineering
## 10062             No                     White    Production       
## 10114             No Black or African American    Production       
##           ManagerName  RecruitmentSource PerformanceScore EngagementSurvey
## 10026  Michael Albert           LinkedIn          Exceeds        0.6145370
## 10088    Elijiah Gray             Indeed      Fully Meets        0.9216818
## 10002        Amy Dunn           LinkedIn          Exceeds        1.1264450
## 10194 Alex Sweetwater           LinkedIn      Fully Meets       -1.3819044
## 10062   Ketsia Liebig  Employee Referral      Fully Meets        1.1264450
## 10114  Brannon Miller Diversity Job Fair      Fully Meets        0.4353691
##       EmpSatisfaction SpecialProjectsCount DaysLateLast30   Absences
## 10026               5           -0.5779309     -0.2739534 -1.5104966
## 10088               5           -0.5779309     -0.2739534  0.8841528
## 10002               5           -0.5779309     -0.2739534  0.8841528
## 10194               3            1.0013653     -0.2739534  1.5683384
## 10062               4           -0.5779309     -0.2739534  1.5683384
## 10114               3           -0.5779309     -0.2739534 -0.9973574
##           TempAge TempYrOfService
## 10026 -0.50589473       0.8653762
## 10088 -1.10089592       2.9009771
## 10002  0.32710693       0.8653762
## 10194  0.08910646      -0.6613244
## 10062 -0.38689449      -0.1524242
## 10114  1.16010860       1.8831767
# Confirm there is no missing value
sum(is.na(df_scaled))
## [1] 0

Gower Distance

The dissimilarity matrix is first generated using the Gower distance. Gower is used as we are dealing with mixed data types rather than just numeric variables, where Euclidean or Manhattan distance would be useful.

dist <- matrix(0, ncol = nrow(df_scaled))
dist <- as.data.frame(dist)

for (i in 1:nrow(df_scaled)) {
  dist[i, ] <- gower_dist(df_scaled[i, ], df_scaled)
}

saveRDS(dist, 'dist.rds')
dist_mat <- as.matrix(dist)

Clustering Tendency

The Hopkins statistic (Lawson and Jurs, 1990) and data visualization are used to assess whether the distance matrix is feasible for cluster analysis. If the Hopkins statistic is close to 0.5, or the data visualization shows random distribution of blue color, then there is no meaningful clusters.

The Hopkins statistic measures the probability that the data is generated by a uniform data distribution. In this case, the data is shown to be cohesive enough for clustering.

hopkins(dist_mat, n = nrow(dist_mat) - 1)
## $H
## [1] 0.196678
fviz_dist(as.dist(dist), show_labels = FALSE)

Optimum Number of Clusters

3 methods are used to find the best number of clusters - Silhouette, Elbow method (within-cluster sum of square of WSS), and the Gap Statistic.

In this case, the Silhouette method suggests 2 clusters. The Elbow method suggest 3 clusters. The Gap Statistic suggests 4 clusters. A middle ground of 3 clusters is chosen as it fares well enough in all methods.

fviz_nbclust(dist_mat, pam, method = 'silhouette')

fviz_nbclust(dist_mat, pam, method = 'wss')

fviz_nbclust(dist_mat, pam, method = 'gap_stat')

Clustering Using PAM

Then the data is clustered into 3 clusters using Partitioning Around Medoids (PAM). It is a k-medoids algorithm in which data points are clustered around middle data point of each cluster, rather than mean value as in k-means algorithm.

The PAM algorithm is more robust to noise and less sensitive to outliers than k-means algorithm.

pam.res <- pam(dist_mat, k = 3)

fviz_cluster(
  pam.res,
  data = df_scaled,
  palette = 'jco',
  ellipse.type = 't',
  geom = 'point',
  start.plot = TRUE,
  repel = TRUE,
  ggtheme = theme_minimal()
)

saveRDS(pam.res, 'pam.rds')

Evaluation of Clusters

The generated clusters are then evaluated using the Silhouette width method. It shows that the data points are well clustered with just a few data points showing negative Silhoutte widths.

sil <- silhouette(pam.res$cluster, dist)
plot(sil)

fviz_silhouette(pam.res, palette = 'jco', ggtheme = theme_classic())
##   cluster size ave.sil.width
## 1       1  122          0.33
## 2       2   54          0.22
## 3       3   31          0.14

sil[sil[, 'sil_width'] < 0, ]
##       cluster neighbor   sil_width
##  [1,]       3        2 -0.11622271
##  [2,]       3        2 -0.13619274
##  [3,]       3        1 -0.22002013
##  [4,]       2        1 -0.12499086
##  [5,]       2        1 -0.20620159
##  [6,]       2        1 -0.18638566
##  [7,]       2        1 -0.02514931
##  [8,]       3        1 -0.06906233
##  [9,]       2        3 -0.18589211
## [10,]       3        1 -0.14614078
## [11,]       3        2 -0.10678000
## [12,]       2        1 -0.03695352
## [13,]       3        2 -0.07534500
## [14,]       2        1 -0.15527599
## [15,]       3        2 -0.12992515

Adding Cluster Info

The cluster number of each employee is then added back to the dataset. The means of each numeric variable is calculated to give an initial assessment of the resulting clusters.

df_clustered <- cbind(Cluster = pam.res$cluster, df_active)
df_clustered$Cluster <- as.factor(df_clustered$Cluster)
head(df_clustered)
##   Cluster       Employee_Name EmpID MarriedID MaritalStatusID GenderID
## 1       1 Adinolfi, Wilson  K 10026         0               0        1
## 2       1        Alagbe,Trina 10088         1               1        0
## 3       1   Anderson, Linda   10002         0               0        0
## 4       2     Andreola, Colby 10194         0               0        0
## 5       1         Athwal, Sam 10062         0               4        1
## 6       1    Bachiochi, Linda 10114         0               0        0
##   EmpStatusID DeptID PerfScoreID FromDiversityJobFairID Salary Termd PositionID
## 1           1      5           4                      0  62506     0         19
## 2           1      5           3                      0  64991     0         19
## 3           1      5           4                      0  57568     0         19
## 4           1      4           3                      0  95660     0         24
## 5           1      5           3                      0  59365     0         19
## 6           3      5           3                      1  47837     0         19
##                  Position State  Zip        DOB Sex MaritalDesc CitizenDesc
## 1 Production Technician I    MA 1960 1983-07-10  M       Single  US Citizen
## 2 Production Technician I    MA 1886 1988-09-27   F     Married  US Citizen
## 3 Production Technician I    MA 1844 1977-05-22   F      Single  US Citizen
## 4       Software Engineer    MA 2110 1979-05-24   F      Single  US Citizen
## 5 Production Technician I    MA 2199 1983-02-18  M      Widowed  US Citizen
## 6 Production Technician I    MA 1902 1970-02-11   F      Single  US Citizen
##   HispanicLatino                  RaceDesc DateofHire DateofTermination
## 1             No                     White 2011-07-05              <NA>
## 2             No                     White 2008-01-07              <NA>
## 3             No                     White 2012-01-09              <NA>
## 4             No                     White 2014-11-10              <NA>
## 5             No                     White 2013-09-30              <NA>
## 6             No Black or African American 2009-07-06              <NA>
##          TermReason EmploymentStatus           Department     ManagerName
## 1 N/A-StillEmployed           Active    Production         Michael Albert
## 2 N/A-StillEmployed           Active    Production           Elijiah Gray
## 3 N/A-StillEmployed           Active    Production               Amy Dunn
## 4 N/A-StillEmployed           Active Software Engineering Alex Sweetwater
## 5 N/A-StillEmployed           Active    Production          Ketsia Liebig
## 6 N/A-StillEmployed           Active    Production         Brannon Miller
##   ManagerID  RecruitmentSource PerformanceScore EngagementSurvey
## 1        22           LinkedIn          Exceeds             4.60
## 2        16             Indeed      Fully Meets             4.84
## 3        11           LinkedIn          Exceeds             5.00
## 4        10           LinkedIn      Fully Meets             3.04
## 5        19  Employee Referral      Fully Meets             5.00
## 6        12 Diversity Job Fair      Fully Meets             4.46
##   EmpSatisfaction SpecialProjectsCount LastPerformanceReview_Date
## 1               5                    0                 2019-01-17
## 2               5                    0                 2019-01-03
## 3               5                    0                 2019-01-07
## 4               3                    4                 2019-01-02
## 5               4                    0                 2019-02-25
## 6               3                    0                 2019-01-25
##   DaysLateLast30 Absences TempYrOfService TempAge
## 1              0        1               9      37
## 2              0       15              13      32
## 3              0       15               9      44
## 4              0       19               6      42
## 5              0       19               7      38
## 6              0        4              11      51
saveRDS(df_clustered, 'HRDataset_v14-clustered.rds')
# Show info of the medoids of each cluster
pam.res$clusinfo
##      size max_diss  av_diss diameter separation
## [1,]  122 2.011804 1.229564 2.583674  0.8256592
## [2,]   54 1.953180 1.293951 2.361562  0.6966623
## [3,]   31 2.848821 1.404985 3.787515  0.6966623
df_clustered[pam.res$id.med, ]
##     Cluster      Employee_Name EmpID MarriedID MaritalStatusID GenderID
## 163       1     Rose, Ashley   10054         0               3        0
## 7         2 Bacong, Alejandro  10250         0               2        1
## 118       3   Leruth, Giovanni 10103         0               3        1
##     EmpStatusID DeptID PerfScoreID FromDiversityJobFairID Salary Termd
## 163           1      5           3                      0  60627     0
## 7             1      3           3                      0  50178     0
## 118           1      6           3                      0  70468     0
##     PositionID                Position State   Zip        DOB Sex MaritalDesc
## 163         19 Production Technician I    MA  1886 1974-12-05   F   Separated
## 7           14              IT Support    MA  1886 1988-01-07  M     Divorced
## 118          3      Area Sales Manager    UT 84111 1988-12-27  M    Separated
##     CitizenDesc HispanicLatino                  RaceDesc DateofHire
## 163  US Citizen             No                     White 2014-01-06
## 7    US Citizen             No                     White 2015-01-05
## 118  US Citizen             No Black or African American 2012-04-30
##     DateofTermination        TermReason EmploymentStatus        Department
## 163              <NA> N/A-StillEmployed           Active Production       
## 7                <NA> N/A-StillEmployed           Active             IT/IS
## 118              <NA> N/A-StillEmployed           Active             Sales
##       ManagerName ManagerID RecruitmentSource PerformanceScore EngagementSurvey
## 163 David Stanley        14           Website      Fully Meets             5.00
## 7    Peter Monroe         7            Indeed      Fully Meets             5.00
## 118    John Smith        17           Website      Fully Meets             4.53
##     EmpSatisfaction SpecialProjectsCount LastPerformanceReview_Date
## 163               4                    0                 2019-01-31
## 7                 5                    6                 2019-02-18
## 118               3                    0                 2019-01-29
##     DaysLateLast30 Absences TempYrOfService TempAge
## 163              0        8               7      46
## 7                0       16               6      33
## 118              0       16               9      32
# Show the means of numeric variables of each cluster
aggregate(df_active[, num_cols], by = list(CLUSTER = pam.res$cluster), mean)
##   CLUSTER   Salary EngagementSurvey SpecialProjectsCount DaysLateLast30
## 1       1 59336.82         4.199836           0.03278689      0.2049180
## 2       2 92270.85         4.061111           5.05555556      0.2407407
## 3       3 77804.74         3.907097           0.83870968      0.7096774
##    Absences  TempAge TempYrOfService
## 1  9.475410 41.55738        7.442623
## 2  9.833333 40.53704        6.500000
## 3 11.225806 41.29032        8.129032

Visualizing Clusters

The clusters of employees are now visualized using bar plots and histograms. Each shows what makes them different from one another.

theme_set(theme_bw())

ggplot(data = df_clustered, aes(x = Salary, fill = Cluster)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = df_clustered, aes(x = Salary, fill = Cluster)) +
  geom_histogram() +
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = df_clustered, aes(y = Position, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(y = State, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(y = Zip, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(x = Sex, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(x = MaritalDesc, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(x = CitizenDesc, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(x = HispanicLatino, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(x = RaceDesc, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(y = Department, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(y = ManagerName, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(y = RecruitmentSource, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(x = PerformanceScore, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(x = EngagementSurvey, fill = Cluster)) +
  geom_histogram(binwidth = 0.5)

ggplot(data = df_clustered, aes(x = EmpSatisfaction, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(x = SpecialProjectsCount, fill = Cluster)) +
  geom_bar()

ggplot(data = df_clustered, aes(x = LastPerformanceReview_Date, fill = Cluster)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = df_clustered, aes(x = DaysLateLast30, fill = Cluster)) +
  geom_histogram(binwidth = 1)

ggplot(data = df_clustered, aes(x = Absences, fill = Cluster)) +
  geom_histogram(binwidth = 1)

ggplot(data = df_clustered, aes(x = TempAge, fill = Cluster)) +
  geom_histogram(binwidth = 2)

ggplot(data = df_clustered, aes(x = TempYrOfService, fill = Cluster)) +
  geom_histogram(binwidth = 2)

ggplot(data = df_clustered, aes(x = Cluster, y = Salary, fill = Cluster)) +
  geom_boxplot()

ggplot(data = df_clustered, aes(x = Cluster, y = EngagementSurvey, fill = Cluster)) +
  geom_boxplot()

ggplot(data = df_clustered, aes(x = Cluster, y = SpecialProjectsCount, fill = Cluster)) +
  geom_boxplot()

ggplot(data = df_clustered, aes(x = Cluster, y = DaysLateLast30, fill = Cluster)) +
  geom_boxplot()

ggplot(data = df_clustered, aes(x = Cluster, y = TempAge, fill = Cluster)) +
  geom_boxplot()

ggplot(data = df_clustered, aes(x = Cluster, y = TempYrOfService, fill = Cluster)) +
  geom_boxplot()

ggplot(data = df_clustered, aes(x = Cluster, y = Absences, fill = Cluster)) +
  geom_boxplot()

Interpreting Clusters

From the data visualization above, the variables below are found to differentiate the clusters from one another.

  1. Salary - Cluster 2 has the highest salary, followed by cluster 3 and 1.
  2. Position - Cluster 1 tends to have production-related positions, cluster 3 sales-related, and cluster 2 for the others.
  3. Department - Cluster 1 tends to come from production department, cluster 3 from sales and executive office, and cluster 2 for the others.
  4. Special projects - Cluster 2 has the highest number of special projects, followed by cluster 3 and 1.

In summary, the clustering model has segmented the employees into 3 clusters. Cluster 1 tend to be the blue-collar workers, Cluster 2 tend to be on sales and marketing, and Cluster 3 the white-collar workers.