Employee Absenteeism

Final Project DACSS-601

Organizational Development

Employee turnover

Absenteeism

ggplot2

Author

Felix Betancourt

Published

August 18, 2022

Code

knitr::opts_chunk$set(echo = TRUE, warning = FALSE)

Introduction

Absenteeism is one of topics that has an economic impact in the companies.

Absenteeism takes place when employees do not report to work. Absenteeism can take two forms, authorized or unauthorized. Management is primarily interested in reducing unauthorized absenteeism since it is more likely to cause hardships on employees and guests alike (Abraham Pizam, A. & Thornburg, S. 2000) (2).

As Henkerson and Persson (2004) (1) said, absence from work can have multiple causes, but there is little doubt that the quantitatively most important one is sick leave.

The data available in this study is related only to the population reported absent in a courier company, therefore we will try to find what might explain the total number of hours absent to work.

About the Dataset:

The dataset was found in:

https://www.kaggle.com/datasets/tonypriyanka2913/employee-absenteeism

It contain real data from a Courier Company, and it was created to explain the absenteeism.

Variables:

Individual identification (ID)
Reason for absence (ICD).

Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:

I Certain infectious and parasitic diseases

II Neoplasms

III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism

IV Endocrine, nutritional and metabolic diseases

V Mental and behavioural disorders

VI Diseases of the nervous system

VII Diseases of the eye and adnexa

VIII Diseases of the ear and mastoid process

IX Diseases of the circulatory system

X Diseases of the respiratory system

XI Diseases of the digestive system

XII Diseases of the skin and subcutaneous tissue

XIII Diseases of the musculoskeletal system and connective tissue

XIV Diseases of the genitourinary system

XV Pregnancy, childbirth and the puerperium

XVI Certain conditions originating in the perinatal period

XVII Congenital malformations, deformations and chromosomal abnormalities

XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified

XIX Injury, poisoning and certain other consequences of external causes

XX External causes of morbidity and mortality

XXI Factors influencing health status and contact with health services.

And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).

Month of absence
Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
Seasons (summer (1), autumn (2), winter (3), spring (4))
Transportation expense
Distance from Residence to Work (kilometers)
Service time
Age
Work load Average/day
Hit target
Disciplinary failure (yes=1; no=0)
Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
Son (number of children)
Social Drinker (yes=1; no=0)
Social smoker (yes=1; no=0)
Pet (number of pet)
Weight
Height
Body mass index
Absenteeism time in hours (target)

Setting up and Reading the data

Let’s load packages and data.

Code

# Loading packages

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyverse))
library(formattable)
suppressPackageStartupMessages(library(kableExtra))
library(ggplot2)

# Setting working directory and loading dataset.

setwd("/Users/fbeta/OneDrive/1-UMASS-DACSS/601/DataSets/Absenteeism/")

courier <- read.csv("Absenteeism_at_work_Project.csv")

Exploring the data and wrangling it

Let’s see the structure of the dataset

Code

str(courier)

'data.frame':   740 obs. of  21 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ work_load_average_day    : chr  "239,554" "239,554" "239,554" "239,554" ...
 $ hit_target               : int  97 97 97 97 97 97 97 97 97 97 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ weight                   : int  90 98 89 68 90 89 80 65 95 88 ...
 $ height                   : int  172 178 170 168 172 170 172 168 196 172 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...

The dataset has 21 variables and 740 observations to explore potential explanations to the absenteeism.

In terms of the type of data we have:

Nine (9) Categorical variables:

- "reason_for_absence"
- "month_of_absence"
- "day_of_the_week"
- "seasons"    
- "disciplinary_failure"  
- "education"            
- "social_drinker"    
- "social_smoker"     
- "pet"

Three (3) numeric discrete:

- "ID"
- "work_load_average_day" 
- "hit_target"

Nine (9) numeric continuous:

- "transportation_expense" 
- "distance_to_work"      
- "service_time"         
- "age"                 
- "son"                
- "weight"            
- "height"            
- "bmi"               
- "absenteeism_time_in_hours"

The dataset seems very clean in general, but I will adjust a few things:

Variable “work_load_average_day” and “hit_target” are not clear in the description of the file. I rather to eliminate those variables from the dataset

Code

courier2 <- courier %>%
  dplyr::select(-work_load_average_day, -hit_target)

On the other hand weight and height are used to calculate BMI, so I will eliminate also those 2 variables to simplify the dataset.

Code

courier2 <- courier2 %>%
  dplyr::select(-weight, -height)

I will re code back to the categories name for a few relevant categorical variables to use it as character in tables, and will create categorical type of variable for Son (number of children) and ID. But I’ll save it as new variables in case I need to use it as numeric too at some point.

Code

courier2 <-courier2%>%
  mutate(Reason = case_when(
        reason_for_absence == 1 ~ "Parasitic",
        reason_for_absence == 2 ~ "Neoplasms",
        reason_for_absence == 3 ~ "Blood and Immune",
        reason_for_absence == 4 ~ "Metabolic",
        reason_for_absence == 5 ~ "Mental",
        reason_for_absence == 6 ~ "Nervous system",
        reason_for_absence == 7 ~ "Eye and Adnexa",
        reason_for_absence == 8 ~ "Ear and Mastoid",
        reason_for_absence == 9 ~ "Circulatory",
        reason_for_absence == 10 ~ "Respiratory",
        reason_for_absence == 11 ~ "Digestive",
        reason_for_absence == 12 ~ "Skin",
        reason_for_absence == 13 ~ "Musculoskeletal",
        reason_for_absence == 14 ~ "Genitourinary",
        reason_for_absence == 15 ~ "Pregnancy",
        reason_for_absence == 16 ~ "Perinatal period",
        reason_for_absence == 17 ~ "Deformations and Chromosomal",
        reason_for_absence == 18 ~ "Laboratory findings",
        reason_for_absence == 19 ~ "Injury, Poisoning",
        reason_for_absence == 20 ~ "External causes and mortality",
        reason_for_absence == 21 ~ "Health status",
        )) %>%
  mutate(Education.Level = case_when(
         education == 1 ~ "High School",
         education == 2 ~ "Bachelor",
         education == 3 ~ "Post-graduate",
         education == 4 ~ "Master or Phd",
         )) %>%
  mutate(Day.of.Week = case_when(
         day_of_the_week == 2 ~ "Monday",
         day_of_the_week == 3 ~ "Tuesday",
         day_of_the_week == 4 ~ "Wednesday",
         day_of_the_week == 5 ~ "Thursday",
         day_of_the_week == 6 ~ "Friday",
         ))%>%
  mutate(Season = case_when(
         seasons == 1 ~ "Summer",
         seasons == 2 ~ "Autunm",
         seasons == 3 ~ "Winter",
         seasons == 4 ~ "Spring",
         ))%>%
  mutate(Disciplined = case_when(
         disciplinary_failure == 1 ~ "Yes",
         disciplinary_failure == 0 ~ "No",
         ))%>%
  mutate(Drinker = case_when(
         social_drinker == 1 ~ "Yes",
         social_drinker == 0 ~ "No",
        ))%>%
  mutate(Children = case_when(
         son == 0 ~ "No Children",
         son == 1 ~ "One",
         son == 2 ~ "Two",
         son == 3 ~ "Three",
         son == 4 ~ "Four",
        ))%>%
  mutate(Smoker = case_when(
         social_smoker == 1 ~ "Yes",
         social_smoker == 0 ~ "No",
         ))

courier2$ID <- sub("^", "EE", courier$ID )

Let’s check the new Dimensions of the dataset

Code

dim.data.frame(courier2)

[1] 740  25

Now we have 25 variables (including the new recoded ones) in the dataset and 740 cases/rows.

But probably we have several observations for each employee in different rows (one entry for each absence). Let’s check how many distinct employees we have in the dataset.

Code

courier3 <- courier2 %>%
  filter(!is.na(ID) & ID != "NULL") %>%
  group_by(ID) %>%
  tally(sort = T) %>%
  arrange(desc(n))
courier3

# A tibble: 36 × 2
   ID        n
   <chr> <int>
 1 EE3     113
 2 EE28     76
 3 EE34     55
 4 EE22     46
 5 EE20     42
 6 EE11     40
 7 EE15     37
 8 EE36     34
 9 EE24     30
10 EE14     29
# … with 26 more rows

Certainly we have only 36 unique employees reported absent to work at least 1 time.

Let’s create a new dataframe with total number of absenteeism hours and grouped by Employee (ID) and Reason

Code

suppressPackageStartupMessages(library(data.table))
dt <- data.table(courier2)
dt2 <- dt[,list(Total.Hours = sum(absenteeism_time_in_hours, na.rm=T), freq = .N), by = c("ID", "Reason", "bmi", "distance_to_work", "age", "service_time", "Season", "Drinker", "Smoker", "Children", "Disciplined", "Day.of.Week", "Education.Level", "transportation_expense")]

Now let’s see a summary of total hours of absence for each Employee

Code

  ID_table <- dt2 %>% 
  group_by(ID) %>%
  summarise(Total.Hours.r = sum(Total.Hours, na.rm=TRUE), Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE))
ID_table <- ID_table %>%
  mutate(Total.Hours.Percent = Total.Hours.r/(sum(Total.Hours.r))*100.2)
  ID_table <- ID_table[with (ID_table, order(-Total.Hours.Percent)),]

ID_table <- ID_table%>%
  mutate(Cum_Percent = cumsum(Total.Hours.Percent))
  
ID_table_o <- ID_table[with (ID_table, order(-Total.Hours.Percent)),]
formattable(ID_table_o) %>% 
  kable("html", escape = F, caption = "Summary of Total Hours Absent by Employee", align = c("l", "c", "c", "c", "c", "c")) %>% 
  kable_classic(full_width = F, html_font = "Cambria")

Summary of Total Hours Absent by Employee
ID	Total.Hours.r	Hours.Mean	Hours.Median	Total.Hours.Percent	Cum_Percent
EE11	450	14.062500	8.0	9.00	9.00
EE3	448	9.531915	8.0	8.96	17.96
EE14	356	15.478261	8.0	7.12	25.08
EE28	345	8.414634	4.0	6.90	31.98
EE34	341	8.119048	3.0	6.82	38.80
EE36	311	12.958333	2.0	6.22	45.02
EE20	295	12.291667	8.0	5.90	50.92
EE9	262	32.750000	8.0	5.24	56.16
EE15	253	12.047619	8.0	5.06	61.22
EE22	253	8.433333	3.5	5.06	66.28
EE24	238	9.916667	8.0	4.76	71.04
EE13	175	14.583333	8.0	3.50	74.54
EE10	170	7.727273	8.0	3.40	77.94
EE7	150	25.000000	6.0	3.00	80.94
EE17	126	6.631579	4.0	2.52	83.46
EE1	121	5.761905	7.0	2.42	85.88
EE18	110	6.875000	8.0	2.20	88.08
EE5	104	6.500000	8.0	2.08	90.16
EE26	83	16.600000	8.0	1.66	91.82
EE33	73	3.476190	3.0	1.46	93.28
EE6	72	10.285714	8.0	1.44	94.72
EE25	42	4.666667	3.0	0.84	95.56
EE12	34	4.857143	3.0	0.68	96.24
EE23	32	5.333333	4.0	0.64	96.88
EE30	31	4.428571	3.0	0.62	97.50
EE2	25	5.000000	8.0	0.50	98.00
EE29	21	4.200000	3.0	0.42	98.42
EE27	19	3.166667	3.0	0.38	98.80
EE16	16	8.000000	8.0	0.32	99.12
EE21	16	5.333333	8.0	0.32	99.44
EE31	16	5.333333	8.0	0.32	99.76
EE32	16	3.200000	2.0	0.32	100.08
EE19	6	2.000000	2.0	0.12	100.20
EE35	0	0.000000	0.0	0.00	100.20
EE4	0	0.000000	0.0	0.00	100.20
EE8	0	0.000000	0.0	0.00	100.20

It does seems interesting that 50% of the total hours of absence are concentrated in 7 employees only (EE11, EE3, EE14, EE28, EE34, EE36, EE20).

As most of the bibliography indicate, the most frequent reason for absence is medical reasons. So we will put some focus on what are the reasons of absence for those specific employees in order to understand their situation and help with it.

Let’s visualize Reasons for absence.

Code

box_plot_crop<-ggplot(data=dt2, aes(Reason,Total.Hours, fill=Reason)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 60))+
  labs(title="Box Plot - Total Absenteeism Hours by Reason of Absence",
        x ="Reason of Absence", y = "Total Hours")

Clearly some reasons (Circulatory, Injury-Poisoning, Musculoeskeletal and Skin) represent more absence hours than others.

Let’s summarize this variable on a table to see it more in deep.

Code

  ID_table2 <- dt2 %>% 
  group_by(Reason) %>%
  summarise(Total.Hours.r = sum(Total.Hours, na.rm=TRUE), Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE))
ID_table2 <- ID_table2 %>%
  mutate(Total.Hours.Percent = Total.Hours.r/(sum(Total.Hours.r))*100.2)
  ID_table2 <- ID_table2[with (ID_table2, order(-Total.Hours.Percent)),]

ID_table2 <- ID_table2%>%
  mutate(Cum_Percent = cumsum(Total.Hours.Percent))
  
ID_table_o2 <- ID_table2[with (ID_table2, order(-Total.Hours.Percent)),]
formattable(ID_table_o2) %>% 
  kable("html", escape = F, caption = "Summary of Total Hours Absent by Reason", align = c("l", "c", "c", "c", "c", "c")) %>% 
  kable_classic(full_width = F, html_font = "Cambria")

Summary of Total Hours Absent by Reason
Reason	Total.Hours.r	Hours.Mean	Hours.Median	Total.Hours.Percent	Cum_Percent
NA	1651	5.875445	4.0	33.02	33.02
Musculoskeletal	810	18.000000	8.0	16.20	49.22
Injury, Poisoning	729	18.692308	8.0	14.58	63.80
Respiratory	260	10.833333	8.0	5.20	69.00
Laboratory findings	217	10.850000	8.0	4.34	73.34
Skin	187	23.375000	10.0	3.74	77.08
Parasitic	182	12.133333	8.0	3.64	80.72
Digestive	177	7.375000	8.0	3.54	84.26
Circulatory	168	42.000000	24.0	3.36	87.62
Nervous system	163	20.375000	8.0	3.26	90.88
Genitourinary	159	8.368421	4.0	3.18	94.06
Eye and Adnexa	150	10.714286	8.0	3.00	97.06
Health status	35	5.833333	8.0	0.70	97.76
Ear and Mastoid	32	5.333333	5.5	0.64	98.40
Neoplasms	24	24.000000	24.0	0.48	98.88
Mental	19	6.333333	8.0	0.38	99.26
Pregnancy	16	8.000000	8.0	0.32	99.58
Metabolic	9	4.500000	4.5	0.18	99.76
Blood and Immune	8	8.000000	8.0	0.16	99.92
Deformations and Chromosomal	8	8.000000	8.0	0.16	100.08
Perinatal period	6	2.000000	2.0	0.12	100.20

We can note that 1/3 of the hours absent does not report a reason or not report what was the medical specific reason.

Then the following medical reasons concentrate about 55% (resting the 33% NA) of the total hours absent:

Musculoskeletal Injury, Poisoning Respiratory Laboratory findings Skin
Parasitic
Digestive
Circulatory

Now let’s see what are the main reasons for absence for the top 7 employees

Code

group_cols <- c("Reason", "ID")

 RID_table <- courier2 %>% 
  group_by_at(.vars = vars(one_of(group_cols))) %>% 
  summarise(Total.Hours.r = sum(absenteeism_time_in_hours, na.rm=TRUE))

`summarise()` has grouped output by 'Reason'. You can override using the
`.groups` argument.

Code

 RID_table <- RID_table %>%
  mutate(Total.Hours.Percent = Total.Hours.r/(sum(Total.Hours.r))*100.2)
    RID_table <- RID_table[with (RID_table, order(Reason, -Total.Hours.r, ID)),]
  
  RID_table <- RID_table%>%
  mutate(Cum_Percent = cumsum(Total.Hours.Percent))

formattable(RID_table) %>% 
  kable("html", escape = F, caption = "Summary of Total Hours Absent by Reason and Employee", align = c("l", "c", "c", "c", "c", "c")) %>% 
  kable_classic(full_width = F, html_font = "Cambria")

Summary of Total Hours Absent by Reason and Employee
Reason	ID	Total.Hours.r	Total.Hours.Percent	Cum_Percent
Blood and Immune	EE15	8	100.2000000	100.20000
Circulatory	EE28	112	66.8000000	66.80000
Circulatory	EE24	48	28.6285714	95.42857
Circulatory	EE34	8	4.7714286	100.20000
Deformations and Chromosomal	EE17	8	100.2000000	100.20000
Digestive	EE17	42	23.7762712	23.77627
Digestive	EE28	33	18.6813559	42.45763
Digestive	EE3	29	16.4169492	58.87458
Digestive	EE30	16	9.0576271	67.93220
Digestive	EE34	13	7.3593220	75.29153
Digestive	EE20	8	4.5288136	79.82034
Digestive	EE21	8	4.5288136	84.34915
Digestive	EE31	8	4.5288136	88.87797
Digestive	EE6	8	4.5288136	93.40678
Digestive	EE1	4	2.2644068	95.67119
Digestive	EE19	4	2.2644068	97.93559
Digestive	EE25	3	1.6983051	99.63390
Digestive	EE18	1	0.5661017	100.20000
Digestive	EE14	0	0.0000000	100.20000
Ear and Mastoid	EE34	13	40.7062500	40.70625
Ear and Mastoid	EE17	11	34.4437500	75.15000
Ear and Mastoid	EE10	8	25.0500000	100.20000
Eye and Adnexa	EE15	40	26.7200000	26.72000
Eye and Adnexa	EE11	24	16.0320000	42.75200
Eye and Adnexa	EE13	24	16.0320000	58.78400
Eye and Adnexa	EE28	19	12.6920000	71.47600
Eye and Adnexa	EE10	8	5.3440000	76.82000
Eye and Adnexa	EE16	8	5.3440000	82.16400
Eye and Adnexa	EE21	8	5.3440000	87.50800
Eye and Adnexa	EE22	5	3.3400000	90.84800
Eye and Adnexa	EE27	4	2.6720000	93.52000
Eye and Adnexa	EE7	4	2.6720000	96.19200
Eye and Adnexa	EE1	3	2.0040000	98.19600
Eye and Adnexa	EE34	3	2.0040000	100.20000
Genitourinary	EE15	80	50.4150943	50.41509
Genitourinary	EE7	24	15.1245283	65.53962
Genitourinary	EE33	10	6.3018868	71.84151
Genitourinary	EE11	8	5.0415094	76.88302
Genitourinary	EE24	8	5.0415094	81.92453
Genitourinary	EE29	8	5.0415094	86.96604
Genitourinary	EE36	5	3.1509434	90.11698
Genitourinary	EE1	4	2.5207547	92.63774
Genitourinary	EE10	4	2.5207547	95.15849
Genitourinary	EE28	3	1.8905660	97.04906
Genitourinary	EE32	3	1.8905660	98.93962
Genitourinary	EE26	2	1.2603774	100.20000
Health status	EE17	16	45.8057143	45.80571
Health status	EE3	9	25.7657143	71.57143
Health status	EE1	8	22.9028571	94.47429
Health status	EE22	2	5.7257143	100.20000
Injury, Poisoning	EE11	208	28.5893004	28.58930
Injury, Poisoning	EE34	153	21.0296296	49.61893
Injury, Poisoning	EE20	88	12.0954733	61.71440
Injury, Poisoning	EE26	64	8.7967078	70.51111
Injury, Poisoning	EE14	56	7.6971193	78.20823
Injury, Poisoning	EE24	32	4.3983539	82.60658
Injury, Poisoning	EE36	27	3.7111111	86.31770
Injury, Poisoning	EE28	24	3.2987654	89.61646
Injury, Poisoning	EE10	16	2.1991770	91.81564
Injury, Poisoning	EE12	12	1.6493827	93.46502
Injury, Poisoning	EE1	8	1.0995885	94.56461
Injury, Poisoning	EE23	8	1.0995885	95.66420
Injury, Poisoning	EE33	8	1.0995885	96.76379
Injury, Poisoning	EE5	8	1.0995885	97.86337
Injury, Poisoning	EE6	8	1.0995885	98.96296
Injury, Poisoning	EE30	6	0.8246914	99.78765
Injury, Poisoning	EE29	3	0.4123457	100.20000
Laboratory findings	EE14	96	44.3281106	44.32811
Laboratory findings	EE11	24	11.0820276	55.41014
Laboratory findings	EE2	16	7.3880184	62.79816
Laboratory findings	EE22	16	7.3880184	70.18618
Laboratory findings	EE3	16	7.3880184	77.57419
Laboratory findings	EE9	16	7.3880184	84.96221
Laboratory findings	EE12	8	3.6940092	88.65622
Laboratory findings	EE18	8	3.6940092	92.35023
Laboratory findings	EE34	8	3.6940092	96.04424
Laboratory findings	EE17	5	2.3087558	98.35300
Laboratory findings	EE28	3	1.3852535	99.73825
Laboratory findings	EE1	1	0.4617512	100.20000
Mental	EE3	8	42.1894737	42.18947
Mental	EE34	8	42.1894737	84.37895
Mental	EE36	3	15.8210526	100.20000
Metabolic	EE24	8	89.0666667	89.06667
Metabolic	EE32	1	11.1333333	100.20000
Musculoskeletal	EE36	243	30.0600000	30.06000
Musculoskeletal	EE3	99	12.2466667	42.30667
Musculoskeletal	EE13	80	9.8962963	52.20296
Musculoskeletal	EE11	56	6.9274074	59.13037
Musculoskeletal	EE20	56	6.9274074	66.05778
Musculoskeletal	EE10	40	4.9481481	71.00593
Musculoskeletal	EE15	40	4.9481481	75.95407
Musculoskeletal	EE24	40	4.9481481	80.90222
Musculoskeletal	EE22	39	4.8244444	85.72667
Musculoskeletal	EE14	32	3.9585185	89.68519
Musculoskeletal	EE1	25	3.0925926	92.77778
Musculoskeletal	EE5	24	2.9688889	95.74667
Musculoskeletal	EE28	15	1.8555556	97.60222
Musculoskeletal	EE18	8	0.9896296	98.59185
Musculoskeletal	EE6	8	0.9896296	99.58148
Musculoskeletal	EE33	2	0.2474074	99.82889
Musculoskeletal	EE34	2	0.2474074	100.07630
Musculoskeletal	EE26	1	0.1237037	100.20000
Neoplasms	EE18	24	100.2000000	100.20000
Nervous system	EE9	128	78.6846626	78.68466
Nervous system	EE11	8	4.9177914	83.60245
Nervous system	EE14	8	4.9177914	88.52025
Nervous system	EE25	8	4.9177914	93.43804
Nervous system	EE3	8	4.9177914	98.35583
Nervous system	EE28	3	1.8441718	100.20000
Nervous system	EE27	0	0.0000000	100.20000
Parasitic	EE22	77	42.3923077	42.39231
Parasitic	EE14	16	8.8087912	51.20110
Parasitic	EE18	16	8.8087912	60.00989
Parasitic	EE20	16	8.8087912	68.81868
Parasitic	EE1	8	4.4043956	73.22308
Parasitic	EE23	8	4.4043956	77.62747
Parasitic	EE24	8	4.4043956	82.03187
Parasitic	EE25	8	4.4043956	86.43626
Parasitic	EE31	8	4.4043956	90.84066
Parasitic	EE33	8	4.4043956	95.24505
Parasitic	EE36	8	4.4043956	99.64945
Parasitic	EE9	1	0.5505495	100.20000
Perinatal period	EE17	3	50.1000000	50.10000
Perinatal period	EE25	3	50.1000000	100.20000
Pregnancy	EE20	8	50.1000000	50.10000
Pregnancy	EE26	8	50.1000000	100.20000
Respiratory	EE14	88	33.9138462	33.91385
Respiratory	EE24	56	21.5815385	55.49538
Respiratory	EE13	24	9.2492308	64.74462
Respiratory	EE34	21	8.0930769	72.83769
Respiratory	EE3	12	4.6246154	77.46231
Respiratory	EE10	8	3.0830769	80.54538
Respiratory	EE17	8	3.0830769	83.62846
Respiratory	EE18	8	3.0830769	86.71154
Respiratory	EE20	8	3.0830769	89.79462
Respiratory	EE25	8	3.0830769	92.87769
Respiratory	EE32	8	3.0830769	95.96077
Respiratory	EE33	8	3.0830769	99.04385
Respiratory	EE28	3	1.1561538	100.20000
Skin	EE9	112	60.0128342	60.01283
Skin	EE22	40	21.4331551	81.44599
Skin	EE14	24	12.8598930	94.30588
Skin	EE15	4	2.1433155	96.44920
Skin	EE12	3	1.6074866	98.05668
Skin	EE28	3	1.6074866	99.66417
Skin	EE3	1	0.5358289	100.20000
NA	EE3	266	16.1436705	16.14367
NA	EE28	127	7.7076923	23.85136
NA	EE11	122	7.4042399	31.25560
NA	EE7	122	7.4042399	38.65984
NA	EE34	112	6.7973349	45.45718
NA	EE20	111	6.7366445	52.19382
NA	EE10	86	5.2193822	57.41320
NA	EE15	81	4.9159297	62.32913
NA	EE22	74	4.4910963	66.82023
NA	EE5	72	4.3697153	71.18995
NA	EE1	60	3.6414294	74.83137
NA	EE6	48	2.9131435	77.74452
NA	EE13	47	2.8524531	80.59697
NA	EE18	45	2.7310721	83.32804
NA	EE24	38	2.3062386	85.63428
NA	EE33	37	2.2455482	87.87983
NA	EE14	36	2.1848577	90.06469
NA	EE17	33	2.0027862	92.06747
NA	EE36	25	1.5172623	93.58474
NA	EE23	16	0.9710478	94.55578
NA	EE27	15	0.9103574	95.46614
NA	EE25	12	0.7282859	96.19443
NA	EE12	11	0.6675954	96.86202
NA	EE29	10	0.6069049	97.46893
NA	EE2	9	0.5462144	98.01514
NA	EE30	9	0.5462144	98.56136
NA	EE16	8	0.4855239	99.04688
NA	EE26	8	0.4855239	99.53240
NA	EE9	5	0.3034525	99.83586
NA	EE32	4	0.2427620	100.07862
NA	EE19	2	0.1213810	100.20000
NA	EE21	0	0.0000000	100.20000
NA	EE31	0	0.0000000	100.20000
NA	EE35	0	0.0000000	100.20000
NA	EE4	0	0.0000000	100.20000
NA	EE8	0	0.0000000	100.20000

In this table the Cummulative Percentage is grouped by Reason.

It means that for example 30% of the absence due to Musculoskeletal reasons is concentrated in 1 employee (EE36), and Injury, Poisoning is concentrated in 2 employees (EE11 and EE34).

Let’s visualize and summarize other categorical variables.

Education Level

Code

suppressPackageStartupMessages(library(epiDisplay)) #Using epiDisplay package to get useful tables with frequency and percentages

# Box Plot - Education Level

box_plot_crop<-ggplot(data=dt2, aes(Education.Level,Total.Hours, fill=Education.Level)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Education Level",
        x ="Education Level", y = "Total Hours")

Code

# Frequency table for Education Level
tab1(courier2$Education.Level, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)

courier2$Education.Level : 
              Frequency   %(NA+)   %(NA-)
High School         601     81.2     82.3
Post-graduate        79     10.7     10.8
Bachelor             46      6.2      6.3
<NA>                 10      1.4      0.0
Master or Phd         4      0.5      0.5
  Total             740    100.0    100.0

Day of the week:

Code

# Box Plot - Day of the week

box_plot_crop<-ggplot(data=dt2, aes(Day.of.Week,Total.Hours, fill=Day.of.Week)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Day of the Week",
        x ="Day of the Week", y = "Total Hours")

Code

# Frequency table for Day of the Week
tab1(courier2$Day.of.Week, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)

courier2$Day.of.Week : 
          Frequency Percent
Monday          161    21.8
Wednesday       156    21.1
Tuesday         154    20.8
Friday          144    19.5
Thursday        125    16.9
  Total         740   100.0

Season of the year:

Code

# Box Plot -Season

box_plot_crop<-ggplot(data=dt2, aes(Season,Total.Hours, fill=Season)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Season",
        x ="Season", y = "Total Hours")

Code

# Frequency table for Season
tab1(courier2$Season, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)

courier2$Season : 
        Frequency Percent
Spring        195    26.4
Autunm        192    25.9
Winter        183    24.7
Summer        170    23.0
  Total       740   100.0

Has been Disciplined or not:

Code

# Box Plot Disciplined

box_plot_crop<-ggplot(data=dt2, aes(Disciplined,Total.Hours, fill=Disciplined)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 30))+
  labs(title="Box Plot - Total Absenteeism Hours by Disciplined",
        x ="Has been Disciplined?", y = "Total Hours")

Code

# Frequency table for Disciplined or not
tab1(courier2$Disciplined, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)

courier2$Disciplined : 
        Frequency   %(NA+)   %(NA-)
No            695     93.9     94.7
Yes            39      5.3      5.3
<NA>            6      0.8      0.0
  Total       740    100.0    100.0

Is Drinker?:

Code

# Box Plot - Drinker

box_plot_crop<-ggplot(data=dt2, aes(Drinker,Total.Hours, fill=Drinker)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 40))+
  labs(title="Box Plot - Total Absenteeism Hours by Drinker",
        x ="Is Drinker?", y = "Total Hours")

Code

# Frequency table for Drinker or not

tab1(courier2$Drinker, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)

courier2$Drinker : 
        Frequency   %(NA+)   %(NA-)
Yes           418     56.5     56.7
No            319     43.1     43.3
<NA>            3      0.4      0.0
  Total       740    100.0    100.0

Is Smoker?:

Code

# Box Plot Smoker

box_plot_crop<-ggplot(data=dt2, aes(Smoker,Total.Hours, fill=Smoker)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 30))+
  labs(title="Box Plot - Total Absenteeism Hours by Smoker",
        x ="Is Smoker?", y = "Total Hours")

Code

# Frequency table for Smoker or not
tab1(courier2$Smoker, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)

courier2$Smoker : 
        Frequency   %(NA+)   %(NA-)
No            682     92.2     92.7
Yes            54      7.3      7.3
<NA>            4      0.5      0.0
  Total       740    100.0    100.0

Number of Children:

Code

# Box Plot - Children

box_plot_crop<-ggplot(data=dt2, aes(Children,Total.Hours, fill=Children)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 30))+
  labs(title="Box Plot - Total Absenteeism Hours by Children",
        x ="How many Children", y = "Total Hours")

Code

# Frequency table for Smoker or not
tab1(courier2$Children, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)

courier2$Children : 
            Frequency   %(NA+)   %(NA-)
No Children       295     39.9     40.2
One               228     30.8     31.1
Two               155     20.9     21.1
Four               41      5.5      5.6
Three              15      2.0      2.0
<NA>                6      0.8      0.0
  Total           740    100.0    100.0

I found interesting that 94% of the employees who reported absent have not been Disciplined; 92% are not Smokers, and 81% holds a High School Degree as higher education level.

It is hard to make a conclusion from those variables because we don’t know how is the population not reported absent regarding Education Level, academic degree or whether smoker or not. Still those proportions might be relevant to consider.

However in the case of Children Seems that having between 3 to 4 children may have some impact on the absence incidence.

For the rest of the categorical variables there are not visible differences that can explain Absence.

Now let’s check any relation between Total Hours of Absence and other continuos variables.

Let’s use Scatter Plots for this

Code

# Tenure

scatter1 <- ggplot(dt2, aes(service_time,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,150))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Tenure",
        x ="Years of Service", y = "Total Hours")
scatter1

Code

#Age

scatter2 <- ggplot(dt2, aes(age,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Age",
        x ="Age", y = "Total Hours")
scatter2

Code

#BMI

scatter3 <- ggplot(dt2, aes(bmi,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs BMI",
        x ="Body Mass Index", y = "Total Hours")
scatter3

Code

#Distance to Work 

scatter5 <- ggplot(dt2, aes(distance_to_work,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Distance to work",
        x ="Distance to work in KM", y = "Total Hours")
scatter5

Code

# Transportation Expenses 

scatter6 <- ggplot(dt2, aes(transportation_expense,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Transportation Expense",
        x ="Transportation Expense", y = "Total Hours")
scatter6

I can’t see any relationship between any of the continuous variables and total hours of abcense.

Conclusions

Seems that the most closer impact to absence are some specific type of medical conditions and having 3-4 children also may increase chances to be absent.

Several bibliography related to Absenteeism suggest that Sick leave is the most common reason for being absent. The data here seems that are consistent with this conclusion, as we don’t see other variables affecting the absenteeism.

I wish I had other relevant variables in the dataset, like salary/wages, gender. Those 2 variables in particular are very relevant to predict absenteeism according to certain researches around this topic.

Introduction

About the Dataset:

Setting up and Reading the data

Exploring the data and wrangling it

Let’s see the structure of the dataset

Let’s visualize and summarize other categorical variables.

Education Level

Day of the week:

Season of the year:

Has been Disciplined or not:

Is Drinker?:

Is Smoker?:

Number of Children:

Now let’s check any relation between Total Hours of Absence and other continuos variables.

Let’s use Scatter Plots for this

Conclusions

References