Introduction

The data included 21 variables such as individual identification, the reason for absence, the month of absence, the day of the week, seasons, transportation expenses, distance from residence to work, service time, age, average work load per day, hit target, disciplinary failure, education, number of children, social drinking and smoking habits, number of pets, weight, height, body mass index, and finally, the absenteeism time in hours, which was the target of the study.

The reason for absence was classified into 21 categories based on the International Code of Diseases (ICD) and 7 additional categories without (CID). These categories covered a wide range of health conditions and circumstances, such as infectious diseases, mental and behavioral disorders, injuries, and external factors that could influence health status.

By analyzing this data, I am able to gain insights into the factors that contributed to employee absenteeism and develop strategies to minimize it. For example, I could monitor employees’ behavior and schedule working hours to ensure that employees at high risk of absenteeism were distributed during shifts.

## corrplot 0.92 loaded
## Warning: package 'factoextra' was built under R version 4.1.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Warning: package 'knitr' was built under R version 4.1.3
cat("Number of unique individuals in the dataset: ", length(employees_id_tab),".",
    "Total number of rows in the dataset: ", nrow(data),".",
    "Number of Independent variables in the dataset: ", ncol(data)-1)
## Number of unique individuals in the dataset:  36 . Total number of rows in the dataset:  740 . Number of Independent variables in the dataset:  20

Before diving into analysis, it was important to check if dataset had any missing values. The presence of missing data in the independent variables could potentially skew results and impact the analysis of the dependent variable.

I ran the check and was relieved to find that there were no missing values in dataset. This meant that I could proceed with observations and analysis without any concerns about missing data affecting results.

any(is.na(data))
## [1] FALSE

Now it was time to get a better understanding of dataset by checking the summary and distribution of each variable. The only categorical feature in the dataset was workLoad Average/day, with all other features being continuous.

I wanted to understand the central tendencies, spread, and shape of each feature in the dataset, which would give me a better idea of how each feature could impact the dependent variable of absenteeism time.

By examining the distribution of each feature, I would be able to identify any potential outliers or patterns that could help me in analysis.

summary(data)
##        ID        ReasonForAbsence MonthOfAbsence    DayOfTheWeek  
##  Min.   : 1.00   Min.   : 0.00    Min.   : 0.000   Min.   :2.000  
##  1st Qu.: 9.00   1st Qu.:13.00    1st Qu.: 3.000   1st Qu.:3.000  
##  Median :18.00   Median :23.00    Median : 6.000   Median :4.000  
##  Mean   :18.02   Mean   :19.22    Mean   : 6.324   Mean   :3.915  
##  3rd Qu.:28.00   3rd Qu.:26.00    3rd Qu.: 9.000   3rd Qu.:5.000  
##  Max.   :36.00   Max.   :28.00    Max.   :12.000   Max.   :6.000  
##     Seasons      TransportationExpense DistanceFromResidenceToWork
##  Min.   :1.000   Min.   :118.0         Min.   : 5.00              
##  1st Qu.:2.000   1st Qu.:179.0         1st Qu.:16.00              
##  Median :3.000   Median :225.0         Median :26.00              
##  Mean   :2.545   Mean   :221.3         Mean   :29.63              
##  3rd Qu.:4.000   3rd Qu.:260.0         3rd Qu.:50.00              
##  Max.   :4.000   Max.   :388.0         Max.   :52.00              
##   ServiceTime         Age        WorkLoadAverageAday   HitTarget     
##  Min.   : 1.00   Min.   :27.00   Length:740          Min.   : 81.00  
##  1st Qu.: 9.00   1st Qu.:31.00   Class :character    1st Qu.: 93.00  
##  Median :13.00   Median :37.00   Mode  :character    Median : 95.00  
##  Mean   :12.55   Mean   :36.45                       Mean   : 94.59  
##  3rd Qu.:16.00   3rd Qu.:40.00                       3rd Qu.: 97.00  
##  Max.   :29.00   Max.   :58.00                       Max.   :100.00  
##  DisciplinaryFailure   Education          Son        SocialDrinker   
##  Min.   :0.00000     Min.   :1.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.00000     1st Qu.:1.000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.00000     Median :1.000   Median :1.000   Median :1.0000  
##  Mean   :0.05405     Mean   :1.292   Mean   :1.019   Mean   :0.5676  
##  3rd Qu.:0.00000     3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.00000     Max.   :4.000   Max.   :4.000   Max.   :1.0000  
##   SocialSmoker          Pet             Weight           Height     
##  Min.   :0.00000   Min.   :0.0000   Min.   : 56.00   Min.   :163.0  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: 69.00   1st Qu.:169.0  
##  Median :0.00000   Median :0.0000   Median : 83.00   Median :170.0  
##  Mean   :0.07297   Mean   :0.7459   Mean   : 79.04   Mean   :172.1  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.: 89.00   3rd Qu.:172.0  
##  Max.   :1.00000   Max.   :8.0000   Max.   :108.00   Max.   :196.0  
##  BodyMassIndex   AbsenteeismTimeInHours
##  Min.   :19.00   Min.   :  0.000       
##  1st Qu.:24.00   1st Qu.:  2.000       
##  Median :25.00   Median :  3.000       
##  Mean   :26.68   Mean   :  6.924       
##  3rd Qu.:31.00   3rd Qu.:  8.000       
##  Max.   :38.00   Max.   :120.000

Now, I will take a closer look at the correlation between the features. We can see that there is a positive correlation between the age and service time of an individual, but there is only a correlation between the number of children and education level. To understand these relationships better, I will use a heatmap.

A heatmap is a graphical representation of data that uses color to show the relationship between variables. In this case, I have replaced each of the variables on the two axes with a list of numeric variables in the dataset. Each cell in the heatmap shows the relationship between the intersecting variables, such as a linear correlation.

I use this heatmap in an exploratory role to gain insights into the relationships in the data. The darker the color, the higher the relationship between the variables. By looking at the heatmap, I can build a descriptive statistical model to better understand the data.

subset_data_1 <- data.frame(data$ID, data$ReasonForAbsence, data$Seasons, data$DistanceFromResidenceToWork, data$ServiceTime, data$Age, data$Education, data$Son, data$SocialDrinker, data$SocialSmoker, data$Pet)

correlation_df <- cor(subset_data_1)

colnames(correlation_df) <- c("ID", "Reason\n absence", "Seasons", "Distance", "Service\n Time", "Age", "Education","Son","Social\n Drinker","Social\n Smoker","Pet")
corrplot.mixed(correlation_df,  tl.col = "blue", lower.col = "blue",number.cex = .5,tl.cex=0.5 )

Dimension Reduction

pca <- prcomp(subset_data_1, center = TRUE, scale = TRUE)
eigen(correlation_df)$values
##  [1] 2.4465874 1.8259131 1.3783901 1.1301231 1.0297561 0.8378868 0.8124494
##  [8] 0.5997981 0.4220521 0.3363494 0.1806943
fviz_eig(pca, choice = "eigenvalue", ncp = 22, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4",  addlabels = TRUE,   main = "Eigenvalues")

Based on the Kaiser’s rule, 4 components should be chosen, as the eigenvalue for those is above 1.

library("pdp")
## Warning: package 'pdp' was built under R version 4.1.3
pca_var <- get_pca_var(pca)

fviz_contrib(pca, "var", axes = 1:4, fill = "tomato3", color = "tomato4")

based on above graph servicetime,age,distancefromresidencetowork,socialdrinker, socialsmoker,ID,and education are contributing higher.

Clustering

In the second part of analysis, I will focus on clustering. I will use the reduced dataset that I created earlier to assess the clustering tendencies.

To do this, I will start by creating a correlogram based on the new dataset. This will help us understand the relationships between the variables.

Next, I will perform the Hopkins Statistics test to determine if our data is clusterable. The Hopkins Statistics test is used to evaluate the presence of clusters in the data.

I can repeat the Hopkins Statistics test using a threshold of 0.5 for rejecting the alternative hypothesis. If the result is less than 0.5, it suggests that the data is not likely to contain statistically significant clusters.

On the other hand, if the Hopkins statistic value is close to 1, we can reject the null hypothesis and conclude that the dataset is significantly clustered. In other words, the data has a clear grouping pattern.

temporary_df <- get_clust_tendency(subset_data_1,3, seed=1234, graph = FALSE)$hopkins_stat

print(temporary_df)
## [1] 0.8610555

This means that the data set has a strong tendency to form clusters, according to the Hopkins Statistics test. The test uses a threshold of 0.5 to determine if the data set is clusterable, and in this case, the result of 0.8610555 shows that it is highly clusterable. This suggests that the data points in the set have a strong tendency to group together, forming distinct and recognizable clusters.

Clustering

To measure the quality of clustering, I use the average silhouette method. It evaluates how well each object fits in its cluster, with a higher average silhouette width implying better clustering.

The average silhouette method determines the optimal number of clusters by calculating the average silhouette of the observations for various values of k. This method maximizes the average silhouette over the range of possible values for k, as described by Kaufmann and Rousseau in 1990. By finding the silhouette, I aim to determine the best value of k.

k.means1 <- fviz_nbclust(subset_data_1,FUNcluster = kmeans, method = "silhouette") +ggtitle("K-means")
PAM.1 <- fviz_nbclust(subset_data_1,FUNcluster = cluster::pam, method = "silhouette")+ggtitle("PAM")

grid.arrange(k.means1,PAM.1, ncol=2, top = "Number of clusters 'Silhouette' ")

When it comes to clustering methods like k-means, the goal is to divide the data into clusters in a way that minimizes the total intra-cluster variation. This is usually measured by the total within-cluster sum of squares (WSS), which I want to keep as low as possible.

One popular approach to finding the optimal number of clusters is the Elbow method. This method looks at the relationship between the number of clusters and the total WSS. Essentially, I want to choose a number of clusters that offers a good balance between the total WSS and the number of clusters.

In the Elbow method, I can see that the optimal number of clusters is three. This means that I should divide the data into three distinct clusters.

k.mean2 <- fviz_nbclust(subset_data_1, FUNcluster = kmeans, method = "wss") + 
  ggtitle("K-means")
PAM.2 <- fviz_nbclust(subset_data_1, FUNcluster = cluster::pam, method = "wss") + 
  ggtitle(" PAM")

grid.arrange(k.mean2,PAM.2, ncol=2, top = "Number of clusters 'Sum of squares'")

K-means Clustering

Clustering is the process of grouping similar data points together based on their characteristics. This helps me to analyze and understand the data in a better way.

In the K-Means algorithm, I specify the number of clusters I want to create, represented by the variable ‘k’. Each cluster is represented by a central point called a centroid. The K-Means algorithm then assigns each data point to the closest centroid based on their similarities, thus forming the clusters. The algorithm continues to adjust the centroids until they represent the average of the data points in their respective clusters.

In summary, K-Means uses the average of the data points to identify the centroids and group the data points into clusters based on their proximity to the centroids.

kmeans_cluster <- eclust(subset_data_1, k=3, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)
temp.1 <- fviz_silhouette(kmeans_cluster)
##   cluster size ave.sil.width
## 1       1  186          0.23
## 2       2  249          0.44
## 3       3  305          0.28
temp.2 <- fviz_cluster(kmeans_cluster, data = subset_data_1, elipse.type = "convex") + theme_minimal()
grid.arrange(temp.1, temp.2, ncol=2)

The average silhouette width is 0.32, and the silhouette average width for each cluster (1, 2, and 3) is 0.23, 0.44, and 0.28, respectively. The silhouette average width provides information about the quality of the clustering, with higher values indicating better clustering. By examining the silhouette average width for each cluster, we can see which clusters are well-separated and which may need further refinement.

PAM Clustering

PAM, or “Partition Around Medoids” is an algorithm used in clustering, a technique for grouping together data points that have similar characteristics. The main idea behind PAM is to find a set of objects, called medoids, that serve as central representatives for each cluster.

The algorithm starts by selecting a random set of potential medoids from the data points, and then iteratively improves this selection until a final set of medoids is found. During each iteration, the algorithm reassigns each data point to the closest medoid and updates the set of medoids to the objects that minimize the sum of the dissimilarities between each data point and its closest medoid.

This process continues until the set of medoids stops changing, at which point the algorithm has found the optimal representation of the clusters in the data. The final result of the PAM algorithm is a partition of the data into clusters, where each cluster is defined by its medoid and the data points assigned to it.

PAM.clustering <- eclust(subset_data_1, k=3, FUNcluster="pam", hc_metric="pearson", graph=FALSE)
temp.1 <- fviz_silhouette(PAM.clustering)
##   cluster size ave.sil.width
## 1       1  250          0.07
## 2       2  317          0.21
## 3       3  173          0.54
temp.2 <- fviz_cluster(PAM.clustering, data = subset_data_1, elipse.type = "convex") + theme_minimal()
grid.arrange(temp.1, temp.2, ncol=2)

The average silhouette width for the data set is 0.24. Additionally, the average silhouette width for each cluster (1, 2, and 3) are 0.07, 0.21, and 0.54 respectively. The silhouette width measures the quality of the clustering solution, where a higher value indicates a better clustering result.

Hierarchical clustering

Ward’s linkage

Ward linkage hierarchical clustering is a method of hierarchical clustering, which is a type of unsupervised learning where the aim is to group similar objects into clusters. In ward linkage hierarchical clustering, the goal is to minimize the variance of the distances between the newly formed cluster and its original clusters.

The algorithm starts by treating each data point as a single cluster. Then, it continuously merges the two closest clusters into one until the desired number of clusters is reached. The distance between the two clusters is determined using the ward linkage method, which calculates the variance of the distances between the points in the two original clusters and the points in the newly formed cluster. The ward linkage method tries to minimize the variance increase caused by merging the two clusters.

The algorithm continues this process of merging the closest clusters until all the data points are in one cluster or until the desired number of clusters is reached. The result is a tree-like diagram called a dendrogram, which shows the hierarchical relationships between the clusters.

One of the advantages of ward linkage hierarchical clustering is that it can produce more balanced and compact clusters compared to other linkage methods. However, it is more computationally expensive than other linkage methods and is sensitive to the scale of the data.

h.ward.clustering <- eclust(subset_data_1, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method="ward.D")
plot(h.ward.clustering, cex=0.5, hang=-1, main = "Dendrogram Ward's Method")
rect.hclust(h.ward.clustering, k=3, border='red')

By analyzing the heights at which these clusters join, we can determine the number of clusters and their sub-structures. In this case, we observe three main clusters at a height of approximately 1400.

The first cluster, called “cluster 1”, breaks down into three smaller sub-clusters. Meanwhile, the second cluster, “cluster 2”, is a single entity. Finally, the third cluster, “cluster 3”, is made up of two separate sub-clusters. These findings allow us to further understand the underlying structure of the data.

Single linkage

Single linkage is a type of hierarchical clustering method used to group data points based on their proximity to each other. In single linkage, the similarity between two clusters is determined by the distance between their closest data points, meaning the two closest data points in each cluster. The distance between two clusters is calculated as the minimum distance between any two data points in each cluster.

Once the distance is calculated, the two closest clusters are merged into one cluster, and the process is repeated until all data points are grouped into a single cluster or until the desired number of clusters is reached.

Single linkage is considered one of the simplest methods in hierarchical clustering and is useful for capturing elongated structures in the data. However, it is also known to be sensitive to noise and can result in elongated clusters.

h.single.clustering <- eclust(subset_data_1, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method="single")
plot(h.single.clustering, cex=0.5, hang=-1, main = "Dendrogram Single linkage")
rect.hclust(h.single.clustering, k=3, border='orange')

In this scenario, cluster 3 contains most of the linkages while cluster 2 is least separable, and cluster 1 is not separable clearly. This can result in poorly defined clusters and may not be the most suitable method for certain datasets.

Summary

My goal was to find a way to better understand the causes of employee absenteeism in a company.

After exploring various methods, I eventually settled on hierarchical clustering using the Eucledian method and Ward integration. This approach allowed me to gain a clearer insight into which factors had the biggest impact on employee absenteeism.

As I delved deeper, I discovered that certain features, such as physical profile, stress levels, and alcohol consumption, played a significant role in determining an employee’s tendency to take time off. However, it would be illegal for the HR department to use this information to target individual employees.

But that doesn’t mean that the findings were useless. On the contrary, they could be incredibly valuable in shaping company policies and programs that promote overall employee well-being and reduce absenteeism. By creating a “Cluster Group” of employees who tend to miss work less, the company could work to identify and replicate the positive factors that contribute to their success.

References