Employee Absenteeism

Final Project DACSS-601
Organizational Development
Employee turnover
Absenteeism
ggplot2
Author

Felix Betancourt

Published

August 18, 2022

Code
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)

Introduction

Absenteeism is one of topics that has an economic impact in the companies.

Absenteeism takes place when employees do not report to work. Absenteeism can take two forms, authorized or unauthorized. Management is primarily interested in reducing unauthorized absenteeism since it is more likely to cause hardships on employees and guests alike (Abraham Pizam, A. & Thornburg, S. 2000) (2).

As Henkerson and Persson (2004) (1) said, absence from work can have multiple causes, but there is little doubt that the quantitatively most important one is sick leave.

The data available in this study is related only to the population reported absent in a courier company, therefore we will try to find what might explain the total number of hours absent to work.

About the Dataset:

The dataset was found in:

https://www.kaggle.com/datasets/tonypriyanka2913/employee-absenteeism

It contain real data from a Courier Company, and it was created to explain the absenteeism.

Variables:

  1. Individual identification (ID)
  2. Reason for absence (ICD).

Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:

I Certain infectious and parasitic diseases

II Neoplasms

III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism

IV Endocrine, nutritional and metabolic diseases

V Mental and behavioural disorders

VI Diseases of the nervous system

VII Diseases of the eye and adnexa

VIII Diseases of the ear and mastoid process

IX Diseases of the circulatory system

X Diseases of the respiratory system

XI Diseases of the digestive system

XII Diseases of the skin and subcutaneous tissue

XIII Diseases of the musculoskeletal system and connective tissue

XIV Diseases of the genitourinary system

XV Pregnancy, childbirth and the puerperium

XVI Certain conditions originating in the perinatal period

XVII Congenital malformations, deformations and chromosomal abnormalities

XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified

XIX Injury, poisoning and certain other consequences of external causes

XX External causes of morbidity and mortality

XXI Factors influencing health status and contact with health services.

And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).

  1. Month of absence
  2. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
  3. Seasons (summer (1), autumn (2), winter (3), spring (4))
  4. Transportation expense
  5. Distance from Residence to Work (kilometers)
  6. Service time
  7. Age
  8. Work load Average/day
  9. Hit target
  10. Disciplinary failure (yes=1; no=0)
  11. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
  12. Son (number of children)
  13. Social Drinker (yes=1; no=0)
  14. Social smoker (yes=1; no=0)
  15. Pet (number of pet)
  16. Weight
  17. Height
  18. Body mass index
  19. Absenteeism time in hours (target)

Setting up and Reading the data

Let’s load packages and data.

Code
# Loading packages

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyverse))
library(formattable)
suppressPackageStartupMessages(library(kableExtra))
library(ggplot2)

# Setting working directory and loading dataset.

setwd("/Users/fbeta/OneDrive/1-UMASS-DACSS/601/DataSets/Absenteeism/")

courier <- read.csv("Absenteeism_at_work_Project.csv")

Exploring the data and wrangling it

Let’s see the structure of the dataset

Code
str(courier)
'data.frame':   740 obs. of  21 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ work_load_average_day    : chr  "239,554" "239,554" "239,554" "239,554" ...
 $ hit_target               : int  97 97 97 97 97 97 97 97 97 97 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ weight                   : int  90 98 89 68 90 89 80 65 95 88 ...
 $ height                   : int  172 178 170 168 172 170 172 168 196 172 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...

The dataset has 21 variables and 740 observations to explore potential explanations to the absenteeism.

In terms of the type of data we have:

  • Nine (9) Categorical variables:

    - "reason_for_absence"
    - "month_of_absence"
    - "day_of_the_week"
    - "seasons"    
    - "disciplinary_failure"  
    - "education"            
    - "social_drinker"    
    - "social_smoker"     
    - "pet"     
  • Three (3) numeric discrete:

    - "ID"
    - "work_load_average_day" 
    - "hit_target"            
  • Nine (9) numeric continuous:

    - "transportation_expense" 
    - "distance_to_work"      
    - "service_time"         
    - "age"                 
    - "son"                
    - "weight"            
    - "height"            
    - "bmi"               
    - "absenteeism_time_in_hours"

The dataset seems very clean in general, but I will adjust a few things:

  1. Variable “work_load_average_day” and “hit_target” are not clear in the description of the file. I rather to eliminate those variables from the dataset
Code
courier2 <- courier %>%
  dplyr::select(-work_load_average_day, -hit_target)
  1. On the other hand weight and height are used to calculate BMI, so I will eliminate also those 2 variables to simplify the dataset.
Code
courier2 <- courier2 %>%
  dplyr::select(-weight, -height)
  1. I will re code back to the categories name for a few relevant categorical variables to use it as character in tables, and will create categorical type of variable for Son (number of children) and ID. But I’ll save it as new variables in case I need to use it as numeric too at some point.
Code
courier2 <-courier2%>%
  mutate(Reason = case_when(
        reason_for_absence == 1 ~ "Parasitic",
        reason_for_absence == 2 ~ "Neoplasms",
        reason_for_absence == 3 ~ "Blood and Immune",
        reason_for_absence == 4 ~ "Metabolic",
        reason_for_absence == 5 ~ "Mental",
        reason_for_absence == 6 ~ "Nervous system",
        reason_for_absence == 7 ~ "Eye and Adnexa",
        reason_for_absence == 8 ~ "Ear and Mastoid",
        reason_for_absence == 9 ~ "Circulatory",
        reason_for_absence == 10 ~ "Respiratory",
        reason_for_absence == 11 ~ "Digestive",
        reason_for_absence == 12 ~ "Skin",
        reason_for_absence == 13 ~ "Musculoskeletal",
        reason_for_absence == 14 ~ "Genitourinary",
        reason_for_absence == 15 ~ "Pregnancy",
        reason_for_absence == 16 ~ "Perinatal period",
        reason_for_absence == 17 ~ "Deformations and Chromosomal",
        reason_for_absence == 18 ~ "Laboratory findings",
        reason_for_absence == 19 ~ "Injury, Poisoning",
        reason_for_absence == 20 ~ "External causes and mortality",
        reason_for_absence == 21 ~ "Health status",
        )) %>%
  mutate(Education.Level = case_when(
         education == 1 ~ "High School",
         education == 2 ~ "Bachelor",
         education == 3 ~ "Post-graduate",
         education == 4 ~ "Master or Phd",
         )) %>%
  mutate(Day.of.Week = case_when(
         day_of_the_week == 2 ~ "Monday",
         day_of_the_week == 3 ~ "Tuesday",
         day_of_the_week == 4 ~ "Wednesday",
         day_of_the_week == 5 ~ "Thursday",
         day_of_the_week == 6 ~ "Friday",
         ))%>%
  mutate(Season = case_when(
         seasons == 1 ~ "Summer",
         seasons == 2 ~ "Autunm",
         seasons == 3 ~ "Winter",
         seasons == 4 ~ "Spring",
         ))%>%
  mutate(Disciplined = case_when(
         disciplinary_failure == 1 ~ "Yes",
         disciplinary_failure == 0 ~ "No",
         ))%>%
  mutate(Drinker = case_when(
         social_drinker == 1 ~ "Yes",
         social_drinker == 0 ~ "No",
        ))%>%
  mutate(Children = case_when(
         son == 0 ~ "No Children",
         son == 1 ~ "One",
         son == 2 ~ "Two",
         son == 3 ~ "Three",
         son == 4 ~ "Four",
        ))%>%
  mutate(Smoker = case_when(
         social_smoker == 1 ~ "Yes",
         social_smoker == 0 ~ "No",
         ))

courier2$ID <- sub("^", "EE", courier$ID )

Let’s check the new Dimensions of the dataset

Code
dim.data.frame(courier2)
[1] 740  25

Now we have 25 variables (including the new recoded ones) in the dataset and 740 cases/rows.

But probably we have several observations for each employee in different rows (one entry for each absence). Let’s check how many distinct employees we have in the dataset.

Code
courier3 <- courier2 %>%
  filter(!is.na(ID) & ID != "NULL") %>%
  group_by(ID) %>%
  tally(sort = T) %>%
  arrange(desc(n))
courier3
# A tibble: 36 × 2
   ID        n
   <chr> <int>
 1 EE3     113
 2 EE28     76
 3 EE34     55
 4 EE22     46
 5 EE20     42
 6 EE11     40
 7 EE15     37
 8 EE36     34
 9 EE24     30
10 EE14     29
# … with 26 more rows

Certainly we have only 36 unique employees reported absent to work at least 1 time.

Let’s create a new dataframe with total number of absenteeism hours and grouped by Employee (ID) and Reason

Code
suppressPackageStartupMessages(library(data.table))
dt <- data.table(courier2)
dt2 <- dt[,list(Total.Hours = sum(absenteeism_time_in_hours, na.rm=T), freq = .N), by = c("ID", "Reason", "bmi", "distance_to_work", "age", "service_time", "Season", "Drinker", "Smoker", "Children", "Disciplined", "Day.of.Week", "Education.Level", "transportation_expense")]

Now let’s see a summary of total hours of absence for each Employee

Code
  ID_table <- dt2 %>% 
  group_by(ID) %>%
  summarise(Total.Hours.r = sum(Total.Hours, na.rm=TRUE), Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE))
ID_table <- ID_table %>%
  mutate(Total.Hours.Percent = Total.Hours.r/(sum(Total.Hours.r))*100.2)
  ID_table <- ID_table[with (ID_table, order(-Total.Hours.Percent)),]

ID_table <- ID_table%>%
  mutate(Cum_Percent = cumsum(Total.Hours.Percent))
  
ID_table_o <- ID_table[with (ID_table, order(-Total.Hours.Percent)),]
formattable(ID_table_o) %>% 
  kable("html", escape = F, caption = "Summary of Total Hours Absent by Employee", align = c("l", "c", "c", "c", "c", "c")) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
Summary of Total Hours Absent by Employee
ID Total.Hours.r Hours.Mean Hours.Median Total.Hours.Percent Cum_Percent
EE11 450 14.062500 8.0 9.00 9.00
EE3 448 9.531915 8.0 8.96 17.96
EE14 356 15.478261 8.0 7.12 25.08
EE28 345 8.414634 4.0 6.90 31.98
EE34 341 8.119048 3.0 6.82 38.80
EE36 311 12.958333 2.0 6.22 45.02
EE20 295 12.291667 8.0 5.90 50.92
EE9 262 32.750000 8.0 5.24 56.16
EE15 253 12.047619 8.0 5.06 61.22
EE22 253 8.433333 3.5 5.06 66.28
EE24 238 9.916667 8.0 4.76 71.04
EE13 175 14.583333 8.0 3.50 74.54
EE10 170 7.727273 8.0 3.40 77.94
EE7 150 25.000000 6.0 3.00 80.94
EE17 126 6.631579 4.0 2.52 83.46
EE1 121 5.761905 7.0 2.42 85.88
EE18 110 6.875000 8.0 2.20 88.08
EE5 104 6.500000 8.0 2.08 90.16
EE26 83 16.600000 8.0 1.66 91.82
EE33 73 3.476190 3.0 1.46 93.28
EE6 72 10.285714 8.0 1.44 94.72
EE25 42 4.666667 3.0 0.84 95.56
EE12 34 4.857143 3.0 0.68 96.24
EE23 32 5.333333 4.0 0.64 96.88
EE30 31 4.428571 3.0 0.62 97.50
EE2 25 5.000000 8.0 0.50 98.00
EE29 21 4.200000 3.0 0.42 98.42
EE27 19 3.166667 3.0 0.38 98.80
EE16 16 8.000000 8.0 0.32 99.12
EE21 16 5.333333 8.0 0.32 99.44
EE31 16 5.333333 8.0 0.32 99.76
EE32 16 3.200000 2.0 0.32 100.08
EE19 6 2.000000 2.0 0.12 100.20
EE35 0 0.000000 0.0 0.00 100.20
EE4 0 0.000000 0.0 0.00 100.20
EE8 0 0.000000 0.0 0.00 100.20

It does seems interesting that 50% of the total hours of absence are concentrated in 7 employees only (EE11, EE3, EE14, EE28, EE34, EE36, EE20).

As most of the bibliography indicate, the most frequent reason for absence is medical reasons. So we will put some focus on what are the reasons of absence for those specific employees in order to understand their situation and help with it.

Let’s visualize Reasons for absence.

Code
box_plot_crop<-ggplot(data=dt2, aes(Reason,Total.Hours, fill=Reason)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 60))+
  labs(title="Box Plot - Total Absenteeism Hours by Reason of Absence",
        x ="Reason of Absence", y = "Total Hours")

Clearly some reasons (Circulatory, Injury-Poisoning, Musculoeskeletal and Skin) represent more absence hours than others.

  1. Let’s summarize this variable on a table to see it more in deep.
Code
  ID_table2 <- dt2 %>% 
  group_by(Reason) %>%
  summarise(Total.Hours.r = sum(Total.Hours, na.rm=TRUE), Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE))
ID_table2 <- ID_table2 %>%
  mutate(Total.Hours.Percent = Total.Hours.r/(sum(Total.Hours.r))*100.2)
  ID_table2 <- ID_table2[with (ID_table2, order(-Total.Hours.Percent)),]

ID_table2 <- ID_table2%>%
  mutate(Cum_Percent = cumsum(Total.Hours.Percent))
  
ID_table_o2 <- ID_table2[with (ID_table2, order(-Total.Hours.Percent)),]
formattable(ID_table_o2) %>% 
  kable("html", escape = F, caption = "Summary of Total Hours Absent by Reason", align = c("l", "c", "c", "c", "c", "c")) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
Summary of Total Hours Absent by Reason
Reason Total.Hours.r Hours.Mean Hours.Median Total.Hours.Percent Cum_Percent
NA 1651 5.875445 4.0 33.02 33.02
Musculoskeletal 810 18.000000 8.0 16.20 49.22
Injury, Poisoning 729 18.692308 8.0 14.58 63.80
Respiratory 260 10.833333 8.0 5.20 69.00
Laboratory findings 217 10.850000 8.0 4.34 73.34
Skin 187 23.375000 10.0 3.74 77.08
Parasitic 182 12.133333 8.0 3.64 80.72
Digestive 177 7.375000 8.0 3.54 84.26
Circulatory 168 42.000000 24.0 3.36 87.62
Nervous system 163 20.375000 8.0 3.26 90.88
Genitourinary 159 8.368421 4.0 3.18 94.06
Eye and Adnexa 150 10.714286 8.0 3.00 97.06
Health status 35 5.833333 8.0 0.70 97.76
Ear and Mastoid 32 5.333333 5.5 0.64 98.40
Neoplasms 24 24.000000 24.0 0.48 98.88
Mental 19 6.333333 8.0 0.38 99.26
Pregnancy 16 8.000000 8.0 0.32 99.58
Metabolic 9 4.500000 4.5 0.18 99.76
Blood and Immune 8 8.000000 8.0 0.16 99.92
Deformations and Chromosomal 8 8.000000 8.0 0.16 100.08
Perinatal period 6 2.000000 2.0 0.12 100.20

We can note that 1/3 of the hours absent does not report a reason or not report what was the medical specific reason.

Then the following medical reasons concentrate about 55% (resting the 33% NA) of the total hours absent:

Musculoskeletal Injury, Poisoning Respiratory Laboratory findings Skin
Parasitic
Digestive
Circulatory

Now let’s see what are the main reasons for absence for the top 7 employees

Code
group_cols <- c("Reason", "ID")

 RID_table <- courier2 %>% 
  group_by_at(.vars = vars(one_of(group_cols))) %>% 
  summarise(Total.Hours.r = sum(absenteeism_time_in_hours, na.rm=TRUE))
`summarise()` has grouped output by 'Reason'. You can override using the
`.groups` argument.
Code
 RID_table <- RID_table %>%
  mutate(Total.Hours.Percent = Total.Hours.r/(sum(Total.Hours.r))*100.2)
    RID_table <- RID_table[with (RID_table, order(Reason, -Total.Hours.r, ID)),]
  
  RID_table <- RID_table%>%
  mutate(Cum_Percent = cumsum(Total.Hours.Percent))

formattable(RID_table) %>% 
  kable("html", escape = F, caption = "Summary of Total Hours Absent by Reason and Employee", align = c("l", "c", "c", "c", "c", "c")) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
Summary of Total Hours Absent by Reason and Employee
Reason ID Total.Hours.r Total.Hours.Percent Cum_Percent
Blood and Immune EE15 8 100.2000000 100.20000
Circulatory EE28 112 66.8000000 66.80000
Circulatory EE24 48 28.6285714 95.42857
Circulatory EE34 8 4.7714286 100.20000
Deformations and Chromosomal EE17 8 100.2000000 100.20000
Digestive EE17 42 23.7762712 23.77627
Digestive EE28 33 18.6813559 42.45763
Digestive EE3 29 16.4169492 58.87458
Digestive EE30 16 9.0576271 67.93220
Digestive EE34 13 7.3593220 75.29153
Digestive EE20 8 4.5288136 79.82034
Digestive EE21 8 4.5288136 84.34915
Digestive EE31 8 4.5288136 88.87797
Digestive EE6 8 4.5288136 93.40678
Digestive EE1 4 2.2644068 95.67119
Digestive EE19 4 2.2644068 97.93559
Digestive EE25 3 1.6983051 99.63390
Digestive EE18 1 0.5661017 100.20000
Digestive EE14 0 0.0000000 100.20000
Ear and Mastoid EE34 13 40.7062500 40.70625
Ear and Mastoid EE17 11 34.4437500 75.15000
Ear and Mastoid EE10 8 25.0500000 100.20000
Eye and Adnexa EE15 40 26.7200000 26.72000
Eye and Adnexa EE11 24 16.0320000 42.75200
Eye and Adnexa EE13 24 16.0320000 58.78400
Eye and Adnexa EE28 19 12.6920000 71.47600
Eye and Adnexa EE10 8 5.3440000 76.82000
Eye and Adnexa EE16 8 5.3440000 82.16400
Eye and Adnexa EE21 8 5.3440000 87.50800
Eye and Adnexa EE22 5 3.3400000 90.84800
Eye and Adnexa EE27 4 2.6720000 93.52000
Eye and Adnexa EE7 4 2.6720000 96.19200
Eye and Adnexa EE1 3 2.0040000 98.19600
Eye and Adnexa EE34 3 2.0040000 100.20000
Genitourinary EE15 80 50.4150943 50.41509
Genitourinary EE7 24 15.1245283 65.53962
Genitourinary EE33 10 6.3018868 71.84151
Genitourinary EE11 8 5.0415094 76.88302
Genitourinary EE24 8 5.0415094 81.92453
Genitourinary EE29 8 5.0415094 86.96604
Genitourinary EE36 5 3.1509434 90.11698
Genitourinary EE1 4 2.5207547 92.63774
Genitourinary EE10 4 2.5207547 95.15849
Genitourinary EE28 3 1.8905660 97.04906
Genitourinary EE32 3 1.8905660 98.93962
Genitourinary EE26 2 1.2603774 100.20000
Health status EE17 16 45.8057143 45.80571
Health status EE3 9 25.7657143 71.57143
Health status EE1 8 22.9028571 94.47429
Health status EE22 2 5.7257143 100.20000
Injury, Poisoning EE11 208 28.5893004 28.58930
Injury, Poisoning EE34 153 21.0296296 49.61893
Injury, Poisoning EE20 88 12.0954733 61.71440
Injury, Poisoning EE26 64 8.7967078 70.51111
Injury, Poisoning EE14 56 7.6971193 78.20823
Injury, Poisoning EE24 32 4.3983539 82.60658
Injury, Poisoning EE36 27 3.7111111 86.31770
Injury, Poisoning EE28 24 3.2987654 89.61646
Injury, Poisoning EE10 16 2.1991770 91.81564
Injury, Poisoning EE12 12 1.6493827 93.46502
Injury, Poisoning EE1 8 1.0995885 94.56461
Injury, Poisoning EE23 8 1.0995885 95.66420
Injury, Poisoning EE33 8 1.0995885 96.76379
Injury, Poisoning EE5 8 1.0995885 97.86337
Injury, Poisoning EE6 8 1.0995885 98.96296
Injury, Poisoning EE30 6 0.8246914 99.78765
Injury, Poisoning EE29 3 0.4123457 100.20000
Laboratory findings EE14 96 44.3281106 44.32811
Laboratory findings EE11 24 11.0820276 55.41014
Laboratory findings EE2 16 7.3880184 62.79816
Laboratory findings EE22 16 7.3880184 70.18618
Laboratory findings EE3 16 7.3880184 77.57419
Laboratory findings EE9 16 7.3880184 84.96221
Laboratory findings EE12 8 3.6940092 88.65622
Laboratory findings EE18 8 3.6940092 92.35023
Laboratory findings EE34 8 3.6940092 96.04424
Laboratory findings EE17 5 2.3087558 98.35300
Laboratory findings EE28 3 1.3852535 99.73825
Laboratory findings EE1 1 0.4617512 100.20000
Mental EE3 8 42.1894737 42.18947
Mental EE34 8 42.1894737 84.37895
Mental EE36 3 15.8210526 100.20000
Metabolic EE24 8 89.0666667 89.06667
Metabolic EE32 1 11.1333333 100.20000
Musculoskeletal EE36 243 30.0600000 30.06000
Musculoskeletal EE3 99 12.2466667 42.30667
Musculoskeletal EE13 80 9.8962963 52.20296
Musculoskeletal EE11 56 6.9274074 59.13037
Musculoskeletal EE20 56 6.9274074 66.05778
Musculoskeletal EE10 40 4.9481481 71.00593
Musculoskeletal EE15 40 4.9481481 75.95407
Musculoskeletal EE24 40 4.9481481 80.90222
Musculoskeletal EE22 39 4.8244444 85.72667
Musculoskeletal EE14 32 3.9585185 89.68519
Musculoskeletal EE1 25 3.0925926 92.77778
Musculoskeletal EE5 24 2.9688889 95.74667
Musculoskeletal EE28 15 1.8555556 97.60222
Musculoskeletal EE18 8 0.9896296 98.59185
Musculoskeletal EE6 8 0.9896296 99.58148
Musculoskeletal EE33 2 0.2474074 99.82889
Musculoskeletal EE34 2 0.2474074 100.07630
Musculoskeletal EE26 1 0.1237037 100.20000
Neoplasms EE18 24 100.2000000 100.20000
Nervous system EE9 128 78.6846626 78.68466
Nervous system EE11 8 4.9177914 83.60245
Nervous system EE14 8 4.9177914 88.52025
Nervous system EE25 8 4.9177914 93.43804
Nervous system EE3 8 4.9177914 98.35583
Nervous system EE28 3 1.8441718 100.20000
Nervous system EE27 0 0.0000000 100.20000
Parasitic EE22 77 42.3923077 42.39231
Parasitic EE14 16 8.8087912 51.20110
Parasitic EE18 16 8.8087912 60.00989
Parasitic EE20 16 8.8087912 68.81868
Parasitic EE1 8 4.4043956 73.22308
Parasitic EE23 8 4.4043956 77.62747
Parasitic EE24 8 4.4043956 82.03187
Parasitic EE25 8 4.4043956 86.43626
Parasitic EE31 8 4.4043956 90.84066
Parasitic EE33 8 4.4043956 95.24505
Parasitic EE36 8 4.4043956 99.64945
Parasitic EE9 1 0.5505495 100.20000
Perinatal period EE17 3 50.1000000 50.10000
Perinatal period EE25 3 50.1000000 100.20000
Pregnancy EE20 8 50.1000000 50.10000
Pregnancy EE26 8 50.1000000 100.20000
Respiratory EE14 88 33.9138462 33.91385
Respiratory EE24 56 21.5815385 55.49538
Respiratory EE13 24 9.2492308 64.74462
Respiratory EE34 21 8.0930769 72.83769
Respiratory EE3 12 4.6246154 77.46231
Respiratory EE10 8 3.0830769 80.54538
Respiratory EE17 8 3.0830769 83.62846
Respiratory EE18 8 3.0830769 86.71154
Respiratory EE20 8 3.0830769 89.79462
Respiratory EE25 8 3.0830769 92.87769
Respiratory EE32 8 3.0830769 95.96077
Respiratory EE33 8 3.0830769 99.04385
Respiratory EE28 3 1.1561538 100.20000
Skin EE9 112 60.0128342 60.01283
Skin EE22 40 21.4331551 81.44599
Skin EE14 24 12.8598930 94.30588
Skin EE15 4 2.1433155 96.44920
Skin EE12 3 1.6074866 98.05668
Skin EE28 3 1.6074866 99.66417
Skin EE3 1 0.5358289 100.20000
NA EE3 266 16.1436705 16.14367
NA EE28 127 7.7076923 23.85136
NA EE11 122 7.4042399 31.25560
NA EE7 122 7.4042399 38.65984
NA EE34 112 6.7973349 45.45718
NA EE20 111 6.7366445 52.19382
NA EE10 86 5.2193822 57.41320
NA EE15 81 4.9159297 62.32913
NA EE22 74 4.4910963 66.82023
NA EE5 72 4.3697153 71.18995
NA EE1 60 3.6414294 74.83137
NA EE6 48 2.9131435 77.74452
NA EE13 47 2.8524531 80.59697
NA EE18 45 2.7310721 83.32804
NA EE24 38 2.3062386 85.63428
NA EE33 37 2.2455482 87.87983
NA EE14 36 2.1848577 90.06469
NA EE17 33 2.0027862 92.06747
NA EE36 25 1.5172623 93.58474
NA EE23 16 0.9710478 94.55578
NA EE27 15 0.9103574 95.46614
NA EE25 12 0.7282859 96.19443
NA EE12 11 0.6675954 96.86202
NA EE29 10 0.6069049 97.46893
NA EE2 9 0.5462144 98.01514
NA EE30 9 0.5462144 98.56136
NA EE16 8 0.4855239 99.04688
NA EE26 8 0.4855239 99.53240
NA EE9 5 0.3034525 99.83586
NA EE32 4 0.2427620 100.07862
NA EE19 2 0.1213810 100.20000
NA EE21 0 0.0000000 100.20000
NA EE31 0 0.0000000 100.20000
NA EE35 0 0.0000000 100.20000
NA EE4 0 0.0000000 100.20000
NA EE8 0 0.0000000 100.20000

In this table the Cummulative Percentage is grouped by Reason.

It means that for example 30% of the absence due to Musculoskeletal reasons is concentrated in 1 employee (EE36), and Injury, Poisoning is concentrated in 2 employees (EE11 and EE34).

Let’s visualize and summarize other categorical variables.

Education Level

Code
suppressPackageStartupMessages(library(epiDisplay)) #Using epiDisplay package to get useful tables with frequency and percentages

# Box Plot - Education Level

box_plot_crop<-ggplot(data=dt2, aes(Education.Level,Total.Hours, fill=Education.Level)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Education Level",
        x ="Education Level", y = "Total Hours")

Code
# Frequency table for Education Level
tab1(courier2$Education.Level, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)
courier2$Education.Level : 
              Frequency   %(NA+)   %(NA-)
High School         601     81.2     82.3
Post-graduate        79     10.7     10.8
Bachelor             46      6.2      6.3
<NA>                 10      1.4      0.0
Master or Phd         4      0.5      0.5
  Total             740    100.0    100.0

Day of the week:

Code
# Box Plot - Day of the week

box_plot_crop<-ggplot(data=dt2, aes(Day.of.Week,Total.Hours, fill=Day.of.Week)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Day of the Week",
        x ="Day of the Week", y = "Total Hours")

Code
# Frequency table for Day of the Week
tab1(courier2$Day.of.Week, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)
courier2$Day.of.Week : 
          Frequency Percent
Monday          161    21.8
Wednesday       156    21.1
Tuesday         154    20.8
Friday          144    19.5
Thursday        125    16.9
  Total         740   100.0

Season of the year:

Code
# Box Plot -Season

box_plot_crop<-ggplot(data=dt2, aes(Season,Total.Hours, fill=Season)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Season",
        x ="Season", y = "Total Hours")

Code
# Frequency table for Season
tab1(courier2$Season, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)
courier2$Season : 
        Frequency Percent
Spring        195    26.4
Autunm        192    25.9
Winter        183    24.7
Summer        170    23.0
  Total       740   100.0

Has been Disciplined or not:

Code
# Box Plot Disciplined

box_plot_crop<-ggplot(data=dt2, aes(Disciplined,Total.Hours, fill=Disciplined)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 30))+
  labs(title="Box Plot - Total Absenteeism Hours by Disciplined",
        x ="Has been Disciplined?", y = "Total Hours")

Code
# Frequency table for Disciplined or not
tab1(courier2$Disciplined, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)
courier2$Disciplined : 
        Frequency   %(NA+)   %(NA-)
No            695     93.9     94.7
Yes            39      5.3      5.3
<NA>            6      0.8      0.0
  Total       740    100.0    100.0

Is Drinker?:

Code
# Box Plot - Drinker

box_plot_crop<-ggplot(data=dt2, aes(Drinker,Total.Hours, fill=Drinker)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 40))+
  labs(title="Box Plot - Total Absenteeism Hours by Drinker",
        x ="Is Drinker?", y = "Total Hours")

Code
# Frequency table for Drinker or not

tab1(courier2$Drinker, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)
courier2$Drinker : 
        Frequency   %(NA+)   %(NA-)
Yes           418     56.5     56.7
No            319     43.1     43.3
<NA>            3      0.4      0.0
  Total       740    100.0    100.0

Is Smoker?:

Code
# Box Plot Smoker

box_plot_crop<-ggplot(data=dt2, aes(Smoker,Total.Hours, fill=Smoker)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 30))+
  labs(title="Box Plot - Total Absenteeism Hours by Smoker",
        x ="Is Smoker?", y = "Total Hours")

Code
# Frequency table for Smoker or not
tab1(courier2$Smoker, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)
courier2$Smoker : 
        Frequency   %(NA+)   %(NA-)
No            682     92.2     92.7
Yes            54      7.3      7.3
<NA>            4      0.5      0.0
  Total       740    100.0    100.0

Number of Children:

Code
# Box Plot - Children

box_plot_crop<-ggplot(data=dt2, aes(Children,Total.Hours, fill=Children)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 30))+
  labs(title="Box Plot - Total Absenteeism Hours by Children",
        x ="How many Children", y = "Total Hours")

Code
# Frequency table for Smoker or not
tab1(courier2$Children, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)
courier2$Children : 
            Frequency   %(NA+)   %(NA-)
No Children       295     39.9     40.2
One               228     30.8     31.1
Two               155     20.9     21.1
Four               41      5.5      5.6
Three              15      2.0      2.0
<NA>                6      0.8      0.0
  Total           740    100.0    100.0

I found interesting that 94% of the employees who reported absent have not been Disciplined; 92% are not Smokers, and 81% holds a High School Degree as higher education level.

It is hard to make a conclusion from those variables because we don’t know how is the population not reported absent regarding Education Level, academic degree or whether smoker or not. Still those proportions might be relevant to consider.

However in the case of Children Seems that having between 3 to 4 children may have some impact on the absence incidence.

For the rest of the categorical variables there are not visible differences that can explain Absence.

Now let’s check any relation between Total Hours of Absence and other continuos variables.

Let’s use Scatter Plots for this

Code
# Tenure

scatter1 <- ggplot(dt2, aes(service_time,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,150))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Tenure",
        x ="Years of Service", y = "Total Hours")
scatter1

Code
#Age

scatter2 <- ggplot(dt2, aes(age,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Age",
        x ="Age", y = "Total Hours")
scatter2

Code
#BMI

scatter3 <- ggplot(dt2, aes(bmi,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs BMI",
        x ="Body Mass Index", y = "Total Hours")
scatter3

Code
#Distance to Work 

scatter5 <- ggplot(dt2, aes(distance_to_work,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Distance to work",
        x ="Distance to work in KM", y = "Total Hours")
scatter5

Code
# Transportation Expenses 

scatter6 <- ggplot(dt2, aes(transportation_expense,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Transportation Expense",
        x ="Transportation Expense", y = "Total Hours")
scatter6

I can’t see any relationship between any of the continuous variables and total hours of abcense.

Conclusions

Seems that the most closer impact to absence are some specific type of medical conditions and having 3-4 children also may increase chances to be absent.

Several bibliography related to Absenteeism suggest that Sick leave is the most common reason for being absent. The data here seems that are consistent with this conclusion, as we don’t see other variables affecting the absenteeism.

I wish I had other relevant variables in the dataset, like salary/wages, gender. Those 2 variables in particular are very relevant to predict absenteeism according to certain researches around this topic.

References