This is a mini data science project in collaboration with Lim Lee Wen based on the Human Resources Data Set by Dr. Rich at https://www.kaggle.com/rhuebner/human-resources-data-set
To reduce loss of talent, the HR department can seek to understand the active employees by segmenting them into different groups. The member of each group is similar with one another, yet are dissimilar with members from other groups. By doing so, the company can gain insights into how the employees are similar and different from one another, and devise strategies in dealing with them.
library(gower)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(clustertend)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Make model repeatable
set.seed(315)
# Load data and select only active employees
df_active <- readRDS('HRDataset_v14-cleaned.rds') %>%
filter(
Termd == 0
)
row.names(df_active) <- df_active$EmpID
str(df_active)
## 'data.frame': 207 obs. of 38 variables:
## $ Employee_Name : chr "Adinolfi, Wilson K" "Alagbe,Trina" "Anderson, Linda " "Andreola, Colby" ...
## $ EmpID : chr "10026" "10088" "10002" "10194" ...
## $ MarriedID : int 0 1 0 0 0 0 0 0 0 1 ...
## $ MaritalStatusID : int 0 1 0 0 4 0 2 2 0 1 ...
## $ GenderID : int 1 0 0 0 1 0 1 1 1 0 ...
## $ EmpStatusID : int 1 1 1 1 1 3 1 1 1 2 ...
## $ DeptID : int 5 5 5 4 5 5 3 3 5 5 ...
## $ PerfScoreID : int 4 3 4 3 3 3 3 4 3 4 ...
## $ FromDiversityJobFairID : int 0 0 0 0 0 1 0 1 0 0 ...
## $ Salary : int 62506 64991 57568 95660 59365 47837 50178 92328 58709 70131 ...
## $ Termd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PositionID : int 19 19 19 24 19 19 14 9 19 20 ...
## $ Position : Factor w/ 29 levels "Accountant I",..: 20 20 20 25 20 20 15 7 20 21 ...
## $ State : Factor w/ 28 levels "AL","AZ","CA",..: 11 11 11 11 11 11 11 24 11 11 ...
## $ Zip : chr "1960" "1886" "1844" "2110" ...
## $ DOB : Date, format: "1983-07-10" "1988-09-27" ...
## $ Sex : Factor w/ 2 levels "F","M ": 2 1 1 1 2 1 2 2 2 1 ...
## $ MaritalDesc : Factor w/ 5 levels "Divorced","Married",..: 4 2 4 4 5 4 1 1 4 2 ...
## $ CitizenDesc : Factor w/ 3 levels "Eligible NonCitizen",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ HispanicLatino : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ RaceDesc : Factor w/ 6 levels "American Indian or Alaska Native",..: 6 6 6 6 6 3 6 3 5 6 ...
## $ DateofHire : Date, format: "2011-07-05" "2008-01-07" ...
## $ DateofTermination : Date, format: NA NA ...
## $ TermReason : Factor w/ 18 levels "Another position",..: 12 12 12 12 12 12 12 12 12 12 ...
## $ EmploymentStatus : Factor w/ 3 levels "Active","Terminated for Cause",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Department : Factor w/ 6 levels "Admin Offices",..: 4 4 4 6 4 4 3 3 4 4 ...
## $ ManagerName : Factor w/ 21 levels "Alex Sweetwater",..: 18 9 2 1 15 5 19 20 14 14 ...
## $ ManagerID : num 22 16 11 10 19 12 7 4 18 18 ...
## $ RecruitmentSource : Factor w/ 9 levels "CareerBuilder",..: 6 5 6 6 3 2 5 2 4 3 ...
## $ PerformanceScore : Factor w/ 4 levels "Exceeds","Fully Meets",..: 1 2 1 2 2 2 2 1 2 1 ...
## $ EngagementSurvey : num 4.6 4.84 5 3.04 5 4.46 5 4.28 4.6 4.4 ...
## $ EmpSatisfaction : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 3 4 3 5 4 4 3 ...
## $ SpecialProjectsCount : int 0 0 0 4 0 0 6 5 0 0 ...
## $ LastPerformanceReview_Date: Date, format: "2019-01-17" "2019-01-03" ...
## $ DaysLateLast30 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Absences : int 1 15 15 19 19 4 16 9 7 16 ...
## $ TempYrOfService : int 9 13 9 6 7 11 6 6 9 4 ...
## $ TempAge : int 37 32 44 42 38 51 33 32 37 55 ...
head(df_active)
## Employee_Name EmpID MarriedID MaritalStatusID GenderID EmpStatusID
## 10026 Adinolfi, Wilson K 10026 0 0 1 1
## 10088 Alagbe,Trina 10088 1 1 0 1
## 10002 Anderson, Linda 10002 0 0 0 1
## 10194 Andreola, Colby 10194 0 0 0 1
## 10062 Athwal, Sam 10062 0 4 1 1
## 10114 Bachiochi, Linda 10114 0 0 0 3
## DeptID PerfScoreID FromDiversityJobFairID Salary Termd PositionID
## 10026 5 4 0 62506 0 19
## 10088 5 3 0 64991 0 19
## 10002 5 4 0 57568 0 19
## 10194 4 3 0 95660 0 24
## 10062 5 3 0 59365 0 19
## 10114 5 3 1 47837 0 19
## Position State Zip DOB Sex MaritalDesc CitizenDesc
## 10026 Production Technician I MA 1960 1983-07-10 M Single US Citizen
## 10088 Production Technician I MA 1886 1988-09-27 F Married US Citizen
## 10002 Production Technician I MA 1844 1977-05-22 F Single US Citizen
## 10194 Software Engineer MA 2110 1979-05-24 F Single US Citizen
## 10062 Production Technician I MA 2199 1983-02-18 M Widowed US Citizen
## 10114 Production Technician I MA 1902 1970-02-11 F Single US Citizen
## HispanicLatino RaceDesc DateofHire DateofTermination
## 10026 No White 2011-07-05 <NA>
## 10088 No White 2008-01-07 <NA>
## 10002 No White 2012-01-09 <NA>
## 10194 No White 2014-11-10 <NA>
## 10062 No White 2013-09-30 <NA>
## 10114 No Black or African American 2009-07-06 <NA>
## TermReason EmploymentStatus Department ManagerName
## 10026 N/A-StillEmployed Active Production Michael Albert
## 10088 N/A-StillEmployed Active Production Elijiah Gray
## 10002 N/A-StillEmployed Active Production Amy Dunn
## 10194 N/A-StillEmployed Active Software Engineering Alex Sweetwater
## 10062 N/A-StillEmployed Active Production Ketsia Liebig
## 10114 N/A-StillEmployed Active Production Brannon Miller
## ManagerID RecruitmentSource PerformanceScore EngagementSurvey
## 10026 22 LinkedIn Exceeds 4.60
## 10088 16 Indeed Fully Meets 4.84
## 10002 11 LinkedIn Exceeds 5.00
## 10194 10 LinkedIn Fully Meets 3.04
## 10062 19 Employee Referral Fully Meets 5.00
## 10114 12 Diversity Job Fair Fully Meets 4.46
## EmpSatisfaction SpecialProjectsCount LastPerformanceReview_Date
## 10026 5 0 2019-01-17
## 10088 5 0 2019-01-03
## 10002 5 0 2019-01-07
## 10194 3 4 2019-01-02
## 10062 4 0 2019-02-25
## 10114 3 0 2019-01-25
## DaysLateLast30 Absences TempYrOfService TempAge
## 10026 0 1 9 37
## 10088 0 15 13 32
## 10002 0 15 9 44
## 10194 0 19 6 42
## 10062 0 19 7 38
## 10114 0 4 11 51
Select the variables and scale the numeric variables for clustering. The salary variable is log-transformed as it is highly skewed.
df_scaled <- as.data.frame(df_active)
df_scaled$Salary <- log10(df_scaled$Salary)
df_scaled <- df_scaled %>%
dplyr::select(
Salary,
Position,
State,
Sex,
MaritalDesc,
CitizenDesc,
HispanicLatino,
RaceDesc,
Department,
ManagerName,
RecruitmentSource,
PerformanceScore,
EngagementSurvey,
EmpSatisfaction,
SpecialProjectsCount,
DaysLateLast30,
Absences,
TempAge,
TempYrOfService
)
num_cols <- colnames(df_scaled)[unlist(sapply(df_scaled, is.numeric))]
df_scaled[num_cols] <- sapply(df_scaled[num_cols], scale)
str(df_scaled)
## 'data.frame': 207 obs. of 19 variables:
## $ Salary : num -0.254 -0.119 -0.539 1.219 -0.433 ...
## $ Position : Factor w/ 29 levels "Accountant I",..: 20 20 20 25 20 20 15 7 20 21 ...
## $ State : Factor w/ 28 levels "AL","AZ","CA",..: 11 11 11 11 11 11 11 24 11 11 ...
## $ Sex : Factor w/ 2 levels "F","M ": 2 1 1 1 2 1 2 2 2 1 ...
## $ MaritalDesc : Factor w/ 5 levels "Divorced","Married",..: 4 2 4 4 5 4 1 1 4 2 ...
## $ CitizenDesc : Factor w/ 3 levels "Eligible NonCitizen",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ HispanicLatino : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ RaceDesc : Factor w/ 6 levels "American Indian or Alaska Native",..: 6 6 6 6 6 3 6 3 5 6 ...
## $ Department : Factor w/ 6 levels "Admin Offices",..: 4 4 4 6 4 4 3 3 4 4 ...
## $ ManagerName : Factor w/ 21 levels "Alex Sweetwater",..: 18 9 2 1 15 5 19 20 14 14 ...
## $ RecruitmentSource : Factor w/ 9 levels "CareerBuilder",..: 6 5 6 6 3 2 5 2 4 3 ...
## $ PerformanceScore : Factor w/ 4 levels "Exceeds","Fully Meets",..: 1 2 1 2 2 2 2 1 2 1 ...
## $ EngagementSurvey : num 0.615 0.922 1.126 -1.382 1.126 ...
## $ EmpSatisfaction : Factor w/ 5 levels "1","2","3","4",..: 5 5 5 3 4 3 5 4 4 3 ...
## $ SpecialProjectsCount: num -0.578 -0.578 -0.578 1.001 -0.578 ...
## $ DaysLateLast30 : num -0.274 -0.274 -0.274 -0.274 -0.274 ...
## $ Absences : num -1.51 0.884 0.884 1.568 1.568 ...
## $ TempAge : num -0.5059 -1.1009 0.3271 0.0891 -0.3869 ...
## $ TempYrOfService : num 0.865 2.901 0.865 -0.661 -0.152 ...
head(df_scaled)
## Salary Position State Sex MaritalDesc CitizenDesc
## 10026 -0.2541211 Production Technician I MA M Single US Citizen
## 10088 -0.1191572 Production Technician I MA F Married US Citizen
## 10002 -0.5390151 Production Technician I MA F Single US Citizen
## 10194 1.2190195 Software Engineer MA F Single US Citizen
## 10062 -0.4326054 Production Technician I MA M Widowed US Citizen
## 10114 -1.1800341 Production Technician I MA F Single US Citizen
## HispanicLatino RaceDesc Department
## 10026 No White Production
## 10088 No White Production
## 10002 No White Production
## 10194 No White Software Engineering
## 10062 No White Production
## 10114 No Black or African American Production
## ManagerName RecruitmentSource PerformanceScore EngagementSurvey
## 10026 Michael Albert LinkedIn Exceeds 0.6145370
## 10088 Elijiah Gray Indeed Fully Meets 0.9216818
## 10002 Amy Dunn LinkedIn Exceeds 1.1264450
## 10194 Alex Sweetwater LinkedIn Fully Meets -1.3819044
## 10062 Ketsia Liebig Employee Referral Fully Meets 1.1264450
## 10114 Brannon Miller Diversity Job Fair Fully Meets 0.4353691
## EmpSatisfaction SpecialProjectsCount DaysLateLast30 Absences
## 10026 5 -0.5779309 -0.2739534 -1.5104966
## 10088 5 -0.5779309 -0.2739534 0.8841528
## 10002 5 -0.5779309 -0.2739534 0.8841528
## 10194 3 1.0013653 -0.2739534 1.5683384
## 10062 4 -0.5779309 -0.2739534 1.5683384
## 10114 3 -0.5779309 -0.2739534 -0.9973574
## TempAge TempYrOfService
## 10026 -0.50589473 0.8653762
## 10088 -1.10089592 2.9009771
## 10002 0.32710693 0.8653762
## 10194 0.08910646 -0.6613244
## 10062 -0.38689449 -0.1524242
## 10114 1.16010860 1.8831767
# Confirm there is no missing value
sum(is.na(df_scaled))
## [1] 0
The dissimilarity matrix is first generated using the Gower distance. Gower is used as we are dealing with mixed data types rather than just numeric variables, where Euclidean or Manhattan distance would be useful.
dist <- matrix(0, ncol = nrow(df_scaled))
dist <- as.data.frame(dist)
for (i in 1:nrow(df_scaled)) {
dist[i, ] <- gower_dist(df_scaled[i, ], df_scaled)
}
saveRDS(dist, 'dist.rds')
dist_mat <- as.matrix(dist)
The Hopkins statistic (Lawson and Jurs, 1990) and data visualization are used to assess whether the distance matrix is feasible for cluster analysis. If the Hopkins statistic is close to 0.5, or the data visualization shows random distribution of blue color, then there is no meaningful clusters.
The Hopkins statistic measures the probability that the data is generated by a uniform data distribution. In this case, the data is shown to be cohesive enough for clustering.
hopkins(dist_mat, n = nrow(dist_mat) - 1)
## $H
## [1] 0.196678
fviz_dist(as.dist(dist), show_labels = FALSE)
3 methods are used to find the best number of clusters - Silhouette, Elbow method (within-cluster sum of square of WSS), and the Gap Statistic.
In this case, the Silhouette method suggests 2 clusters. The Elbow method suggest 3 clusters. The Gap Statistic suggests 4 clusters. A middle ground of 3 clusters is chosen as it fares well enough in all methods.
fviz_nbclust(dist_mat, pam, method = 'silhouette')
fviz_nbclust(dist_mat, pam, method = 'wss')
fviz_nbclust(dist_mat, pam, method = 'gap_stat')
Then the data is clustered into 3 clusters using Partitioning Around Medoids (PAM). It is a k-medoids algorithm in which data points are clustered around middle data point of each cluster, rather than mean value as in k-means algorithm.
The PAM algorithm is more robust to noise and less sensitive to outliers than k-means algorithm.
pam.res <- pam(dist_mat, k = 3)
fviz_cluster(
pam.res,
data = df_scaled,
palette = 'jco',
ellipse.type = 't',
geom = 'point',
start.plot = TRUE,
repel = TRUE,
ggtheme = theme_minimal()
)
saveRDS(pam.res, 'pam.rds')
The generated clusters are then evaluated using the Silhouette width method. It shows that the data points are well clustered with just a few data points showing negative Silhoutte widths.
sil <- silhouette(pam.res$cluster, dist)
plot(sil)
fviz_silhouette(pam.res, palette = 'jco', ggtheme = theme_classic())
## cluster size ave.sil.width
## 1 1 122 0.33
## 2 2 54 0.22
## 3 3 31 0.14
sil[sil[, 'sil_width'] < 0, ]
## cluster neighbor sil_width
## [1,] 3 2 -0.11622271
## [2,] 3 2 -0.13619274
## [3,] 3 1 -0.22002013
## [4,] 2 1 -0.12499086
## [5,] 2 1 -0.20620159
## [6,] 2 1 -0.18638566
## [7,] 2 1 -0.02514931
## [8,] 3 1 -0.06906233
## [9,] 2 3 -0.18589211
## [10,] 3 1 -0.14614078
## [11,] 3 2 -0.10678000
## [12,] 2 1 -0.03695352
## [13,] 3 2 -0.07534500
## [14,] 2 1 -0.15527599
## [15,] 3 2 -0.12992515
The cluster number of each employee is then added back to the dataset. The means of each numeric variable is calculated to give an initial assessment of the resulting clusters.
df_clustered <- cbind(Cluster = pam.res$cluster, df_active)
df_clustered$Cluster <- as.factor(df_clustered$Cluster)
head(df_clustered)
## Cluster Employee_Name EmpID MarriedID MaritalStatusID GenderID
## 1 1 Adinolfi, Wilson K 10026 0 0 1
## 2 1 Alagbe,Trina 10088 1 1 0
## 3 1 Anderson, Linda 10002 0 0 0
## 4 2 Andreola, Colby 10194 0 0 0
## 5 1 Athwal, Sam 10062 0 4 1
## 6 1 Bachiochi, Linda 10114 0 0 0
## EmpStatusID DeptID PerfScoreID FromDiversityJobFairID Salary Termd PositionID
## 1 1 5 4 0 62506 0 19
## 2 1 5 3 0 64991 0 19
## 3 1 5 4 0 57568 0 19
## 4 1 4 3 0 95660 0 24
## 5 1 5 3 0 59365 0 19
## 6 3 5 3 1 47837 0 19
## Position State Zip DOB Sex MaritalDesc CitizenDesc
## 1 Production Technician I MA 1960 1983-07-10 M Single US Citizen
## 2 Production Technician I MA 1886 1988-09-27 F Married US Citizen
## 3 Production Technician I MA 1844 1977-05-22 F Single US Citizen
## 4 Software Engineer MA 2110 1979-05-24 F Single US Citizen
## 5 Production Technician I MA 2199 1983-02-18 M Widowed US Citizen
## 6 Production Technician I MA 1902 1970-02-11 F Single US Citizen
## HispanicLatino RaceDesc DateofHire DateofTermination
## 1 No White 2011-07-05 <NA>
## 2 No White 2008-01-07 <NA>
## 3 No White 2012-01-09 <NA>
## 4 No White 2014-11-10 <NA>
## 5 No White 2013-09-30 <NA>
## 6 No Black or African American 2009-07-06 <NA>
## TermReason EmploymentStatus Department ManagerName
## 1 N/A-StillEmployed Active Production Michael Albert
## 2 N/A-StillEmployed Active Production Elijiah Gray
## 3 N/A-StillEmployed Active Production Amy Dunn
## 4 N/A-StillEmployed Active Software Engineering Alex Sweetwater
## 5 N/A-StillEmployed Active Production Ketsia Liebig
## 6 N/A-StillEmployed Active Production Brannon Miller
## ManagerID RecruitmentSource PerformanceScore EngagementSurvey
## 1 22 LinkedIn Exceeds 4.60
## 2 16 Indeed Fully Meets 4.84
## 3 11 LinkedIn Exceeds 5.00
## 4 10 LinkedIn Fully Meets 3.04
## 5 19 Employee Referral Fully Meets 5.00
## 6 12 Diversity Job Fair Fully Meets 4.46
## EmpSatisfaction SpecialProjectsCount LastPerformanceReview_Date
## 1 5 0 2019-01-17
## 2 5 0 2019-01-03
## 3 5 0 2019-01-07
## 4 3 4 2019-01-02
## 5 4 0 2019-02-25
## 6 3 0 2019-01-25
## DaysLateLast30 Absences TempYrOfService TempAge
## 1 0 1 9 37
## 2 0 15 13 32
## 3 0 15 9 44
## 4 0 19 6 42
## 5 0 19 7 38
## 6 0 4 11 51
saveRDS(df_clustered, 'HRDataset_v14-clustered.rds')
# Show info of the medoids of each cluster
pam.res$clusinfo
## size max_diss av_diss diameter separation
## [1,] 122 2.011804 1.229564 2.583674 0.8256592
## [2,] 54 1.953180 1.293951 2.361562 0.6966623
## [3,] 31 2.848821 1.404985 3.787515 0.6966623
df_clustered[pam.res$id.med, ]
## Cluster Employee_Name EmpID MarriedID MaritalStatusID GenderID
## 163 1 Rose, Ashley 10054 0 3 0
## 7 2 Bacong, Alejandro 10250 0 2 1
## 118 3 Leruth, Giovanni 10103 0 3 1
## EmpStatusID DeptID PerfScoreID FromDiversityJobFairID Salary Termd
## 163 1 5 3 0 60627 0
## 7 1 3 3 0 50178 0
## 118 1 6 3 0 70468 0
## PositionID Position State Zip DOB Sex MaritalDesc
## 163 19 Production Technician I MA 1886 1974-12-05 F Separated
## 7 14 IT Support MA 1886 1988-01-07 M Divorced
## 118 3 Area Sales Manager UT 84111 1988-12-27 M Separated
## CitizenDesc HispanicLatino RaceDesc DateofHire
## 163 US Citizen No White 2014-01-06
## 7 US Citizen No White 2015-01-05
## 118 US Citizen No Black or African American 2012-04-30
## DateofTermination TermReason EmploymentStatus Department
## 163 <NA> N/A-StillEmployed Active Production
## 7 <NA> N/A-StillEmployed Active IT/IS
## 118 <NA> N/A-StillEmployed Active Sales
## ManagerName ManagerID RecruitmentSource PerformanceScore EngagementSurvey
## 163 David Stanley 14 Website Fully Meets 5.00
## 7 Peter Monroe 7 Indeed Fully Meets 5.00
## 118 John Smith 17 Website Fully Meets 4.53
## EmpSatisfaction SpecialProjectsCount LastPerformanceReview_Date
## 163 4 0 2019-01-31
## 7 5 6 2019-02-18
## 118 3 0 2019-01-29
## DaysLateLast30 Absences TempYrOfService TempAge
## 163 0 8 7 46
## 7 0 16 6 33
## 118 0 16 9 32
# Show the means of numeric variables of each cluster
aggregate(df_active[, num_cols], by = list(CLUSTER = pam.res$cluster), mean)
## CLUSTER Salary EngagementSurvey SpecialProjectsCount DaysLateLast30
## 1 1 59336.82 4.199836 0.03278689 0.2049180
## 2 2 92270.85 4.061111 5.05555556 0.2407407
## 3 3 77804.74 3.907097 0.83870968 0.7096774
## Absences TempAge TempYrOfService
## 1 9.475410 41.55738 7.442623
## 2 9.833333 40.53704 6.500000
## 3 11.225806 41.29032 8.129032
The clusters of employees are now visualized using bar plots and histograms. Each shows what makes them different from one another.
theme_set(theme_bw())
ggplot(data = df_clustered, aes(x = Salary, fill = Cluster)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = df_clustered, aes(x = Salary, fill = Cluster)) +
geom_histogram() +
scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = df_clustered, aes(y = Position, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(y = State, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(y = Zip, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(x = Sex, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(x = MaritalDesc, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(x = CitizenDesc, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(x = HispanicLatino, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(x = RaceDesc, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(y = Department, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(y = ManagerName, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(y = RecruitmentSource, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(x = PerformanceScore, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(x = EngagementSurvey, fill = Cluster)) +
geom_histogram(binwidth = 0.5)
ggplot(data = df_clustered, aes(x = EmpSatisfaction, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(x = SpecialProjectsCount, fill = Cluster)) +
geom_bar()
ggplot(data = df_clustered, aes(x = LastPerformanceReview_Date, fill = Cluster)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = df_clustered, aes(x = DaysLateLast30, fill = Cluster)) +
geom_histogram(binwidth = 1)
ggplot(data = df_clustered, aes(x = Absences, fill = Cluster)) +
geom_histogram(binwidth = 1)
ggplot(data = df_clustered, aes(x = TempAge, fill = Cluster)) +
geom_histogram(binwidth = 2)
ggplot(data = df_clustered, aes(x = TempYrOfService, fill = Cluster)) +
geom_histogram(binwidth = 2)
ggplot(data = df_clustered, aes(x = Cluster, y = Salary, fill = Cluster)) +
geom_boxplot()
ggplot(data = df_clustered, aes(x = Cluster, y = EngagementSurvey, fill = Cluster)) +
geom_boxplot()
ggplot(data = df_clustered, aes(x = Cluster, y = SpecialProjectsCount, fill = Cluster)) +
geom_boxplot()
ggplot(data = df_clustered, aes(x = Cluster, y = DaysLateLast30, fill = Cluster)) +
geom_boxplot()
ggplot(data = df_clustered, aes(x = Cluster, y = TempAge, fill = Cluster)) +
geom_boxplot()
ggplot(data = df_clustered, aes(x = Cluster, y = TempYrOfService, fill = Cluster)) +
geom_boxplot()
ggplot(data = df_clustered, aes(x = Cluster, y = Absences, fill = Cluster)) +
geom_boxplot()
From the data visualization above, the variables below are found to differentiate the clusters from one another.
In summary, the clustering model has segmented the employees into 3 clusters. Cluster 1 tend to be the blue-collar workers, Cluster 2 tend to be on sales and marketing, and Cluster 3 the white-collar workers.