Code
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)knitr::opts_chunk$set(echo = TRUE, warning = FALSE)Absenteeism is one of topics that has an economic impact in the companies.
Absenteeism takes place when employees do not report to work. Absenteeism can take two forms, authorized or unauthorized. Management is primarily interested in reducing unauthorized absenteeism since it is more likely to cause hardships on employees and guests alike (Abraham Pizam, A. & Thornburg, S. 2000) (2).
As Henkerson and Persson (2004) (1) said, absence from work can have multiple causes, but there is little doubt that the quantitatively most important one is sick leave.
The data available in this study is related only to the population reported absent in a courier company, therefore we will try to find what might explain the total number of hours absent to work.
The dataset was found in:
https://www.kaggle.com/datasets/tonypriyanka2913/employee-absenteeism
It contain real data from a Courier Company, and it was created to explain the absenteeism.
Variables:
Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:
I Certain infectious and parasitic diseases
II Neoplasms
III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism
IV Endocrine, nutritional and metabolic diseases
V Mental and behavioural disorders
VI Diseases of the nervous system
VII Diseases of the eye and adnexa
VIII Diseases of the ear and mastoid process
IX Diseases of the circulatory system
X Diseases of the respiratory system
XI Diseases of the digestive system
XII Diseases of the skin and subcutaneous tissue
XIII Diseases of the musculoskeletal system and connective tissue
XIV Diseases of the genitourinary system
XV Pregnancy, childbirth and the puerperium
XVI Certain conditions originating in the perinatal period
XVII Congenital malformations, deformations and chromosomal abnormalities
XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified
XIX Injury, poisoning and certain other consequences of external causes
XX External causes of morbidity and mortality
XXI Factors influencing health status and contact with health services.
And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).
Let’s load packages and data.
# Loading packages
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyverse))
library(formattable)
suppressPackageStartupMessages(library(kableExtra))
library(ggplot2)
# Setting working directory and loading dataset.
setwd("/Users/fbeta/OneDrive/1-UMASS-DACSS/601/DataSets/Absenteeism/")
courier <- read.csv("Absenteeism_at_work_Project.csv")str(courier)'data.frame': 740 obs. of 21 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ work_load_average_day : chr "239,554" "239,554" "239,554" "239,554" ...
$ hit_target : int 97 97 97 97 97 97 97 97 97 97 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ weight : int 90 98 89 68 90 89 80 65 95 88 ...
$ height : int 172 178 170 168 172 170 172 168 196 172 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
The dataset has 21 variables and 740 observations to explore potential explanations to the absenteeism.
In terms of the type of data we have:
Nine (9) Categorical variables:
- "reason_for_absence"
- "month_of_absence"
- "day_of_the_week"
- "seasons"
- "disciplinary_failure"
- "education"
- "social_drinker"
- "social_smoker"
- "pet" Three (3) numeric discrete:
- "ID"
- "work_load_average_day"
- "hit_target" Nine (9) numeric continuous:
- "transportation_expense"
- "distance_to_work"
- "service_time"
- "age"
- "son"
- "weight"
- "height"
- "bmi"
- "absenteeism_time_in_hours"The dataset seems very clean in general, but I will adjust a few things:
courier2 <- courier %>%
dplyr::select(-work_load_average_day, -hit_target)courier2 <- courier2 %>%
dplyr::select(-weight, -height)courier2 <-courier2%>%
mutate(Reason = case_when(
reason_for_absence == 1 ~ "Parasitic",
reason_for_absence == 2 ~ "Neoplasms",
reason_for_absence == 3 ~ "Blood and Immune",
reason_for_absence == 4 ~ "Metabolic",
reason_for_absence == 5 ~ "Mental",
reason_for_absence == 6 ~ "Nervous system",
reason_for_absence == 7 ~ "Eye and Adnexa",
reason_for_absence == 8 ~ "Ear and Mastoid",
reason_for_absence == 9 ~ "Circulatory",
reason_for_absence == 10 ~ "Respiratory",
reason_for_absence == 11 ~ "Digestive",
reason_for_absence == 12 ~ "Skin",
reason_for_absence == 13 ~ "Musculoskeletal",
reason_for_absence == 14 ~ "Genitourinary",
reason_for_absence == 15 ~ "Pregnancy",
reason_for_absence == 16 ~ "Perinatal period",
reason_for_absence == 17 ~ "Deformations and Chromosomal",
reason_for_absence == 18 ~ "Laboratory findings",
reason_for_absence == 19 ~ "Injury, Poisoning",
reason_for_absence == 20 ~ "External causes and mortality",
reason_for_absence == 21 ~ "Health status",
)) %>%
mutate(Education.Level = case_when(
education == 1 ~ "High School",
education == 2 ~ "Bachelor",
education == 3 ~ "Post-graduate",
education == 4 ~ "Master or Phd",
)) %>%
mutate(Day.of.Week = case_when(
day_of_the_week == 2 ~ "Monday",
day_of_the_week == 3 ~ "Tuesday",
day_of_the_week == 4 ~ "Wednesday",
day_of_the_week == 5 ~ "Thursday",
day_of_the_week == 6 ~ "Friday",
))%>%
mutate(Season = case_when(
seasons == 1 ~ "Summer",
seasons == 2 ~ "Autunm",
seasons == 3 ~ "Winter",
seasons == 4 ~ "Spring",
))%>%
mutate(Disciplined = case_when(
disciplinary_failure == 1 ~ "Yes",
disciplinary_failure == 0 ~ "No",
))%>%
mutate(Drinker = case_when(
social_drinker == 1 ~ "Yes",
social_drinker == 0 ~ "No",
))%>%
mutate(Children = case_when(
son == 0 ~ "No Children",
son == 1 ~ "One",
son == 2 ~ "Two",
son == 3 ~ "Three",
son == 4 ~ "Four",
))%>%
mutate(Smoker = case_when(
social_smoker == 1 ~ "Yes",
social_smoker == 0 ~ "No",
))
courier2$ID <- sub("^", "EE", courier$ID )Let’s check the new Dimensions of the dataset
dim.data.frame(courier2)[1] 740 25
Now we have 25 variables (including the new recoded ones) in the dataset and 740 cases/rows.
But probably we have several observations for each employee in different rows (one entry for each absence). Let’s check how many distinct employees we have in the dataset.
courier3 <- courier2 %>%
filter(!is.na(ID) & ID != "NULL") %>%
group_by(ID) %>%
tally(sort = T) %>%
arrange(desc(n))
courier3# A tibble: 36 × 2
ID n
<chr> <int>
1 EE3 113
2 EE28 76
3 EE34 55
4 EE22 46
5 EE20 42
6 EE11 40
7 EE15 37
8 EE36 34
9 EE24 30
10 EE14 29
# … with 26 more rows
Certainly we have only 36 unique employees reported absent to work at least 1 time.
Let’s create a new dataframe with total number of absenteeism hours and grouped by Employee (ID) and Reason
suppressPackageStartupMessages(library(data.table))
dt <- data.table(courier2)
dt2 <- dt[,list(Total.Hours = sum(absenteeism_time_in_hours, na.rm=T), freq = .N), by = c("ID", "Reason", "bmi", "distance_to_work", "age", "service_time", "Season", "Drinker", "Smoker", "Children", "Disciplined", "Day.of.Week", "Education.Level", "transportation_expense")]Now let’s see a summary of total hours of absence for each Employee
ID_table <- dt2 %>%
group_by(ID) %>%
summarise(Total.Hours.r = sum(Total.Hours, na.rm=TRUE), Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE))
ID_table <- ID_table %>%
mutate(Total.Hours.Percent = Total.Hours.r/(sum(Total.Hours.r))*100.2)
ID_table <- ID_table[with (ID_table, order(-Total.Hours.Percent)),]
ID_table <- ID_table%>%
mutate(Cum_Percent = cumsum(Total.Hours.Percent))
ID_table_o <- ID_table[with (ID_table, order(-Total.Hours.Percent)),]
formattable(ID_table_o) %>%
kable("html", escape = F, caption = "Summary of Total Hours Absent by Employee", align = c("l", "c", "c", "c", "c", "c")) %>%
kable_classic(full_width = F, html_font = "Cambria")| ID | Total.Hours.r | Hours.Mean | Hours.Median | Total.Hours.Percent | Cum_Percent |
|---|---|---|---|---|---|
| EE11 | 450 | 14.062500 | 8.0 | 9.00 | 9.00 |
| EE3 | 448 | 9.531915 | 8.0 | 8.96 | 17.96 |
| EE14 | 356 | 15.478261 | 8.0 | 7.12 | 25.08 |
| EE28 | 345 | 8.414634 | 4.0 | 6.90 | 31.98 |
| EE34 | 341 | 8.119048 | 3.0 | 6.82 | 38.80 |
| EE36 | 311 | 12.958333 | 2.0 | 6.22 | 45.02 |
| EE20 | 295 | 12.291667 | 8.0 | 5.90 | 50.92 |
| EE9 | 262 | 32.750000 | 8.0 | 5.24 | 56.16 |
| EE15 | 253 | 12.047619 | 8.0 | 5.06 | 61.22 |
| EE22 | 253 | 8.433333 | 3.5 | 5.06 | 66.28 |
| EE24 | 238 | 9.916667 | 8.0 | 4.76 | 71.04 |
| EE13 | 175 | 14.583333 | 8.0 | 3.50 | 74.54 |
| EE10 | 170 | 7.727273 | 8.0 | 3.40 | 77.94 |
| EE7 | 150 | 25.000000 | 6.0 | 3.00 | 80.94 |
| EE17 | 126 | 6.631579 | 4.0 | 2.52 | 83.46 |
| EE1 | 121 | 5.761905 | 7.0 | 2.42 | 85.88 |
| EE18 | 110 | 6.875000 | 8.0 | 2.20 | 88.08 |
| EE5 | 104 | 6.500000 | 8.0 | 2.08 | 90.16 |
| EE26 | 83 | 16.600000 | 8.0 | 1.66 | 91.82 |
| EE33 | 73 | 3.476190 | 3.0 | 1.46 | 93.28 |
| EE6 | 72 | 10.285714 | 8.0 | 1.44 | 94.72 |
| EE25 | 42 | 4.666667 | 3.0 | 0.84 | 95.56 |
| EE12 | 34 | 4.857143 | 3.0 | 0.68 | 96.24 |
| EE23 | 32 | 5.333333 | 4.0 | 0.64 | 96.88 |
| EE30 | 31 | 4.428571 | 3.0 | 0.62 | 97.50 |
| EE2 | 25 | 5.000000 | 8.0 | 0.50 | 98.00 |
| EE29 | 21 | 4.200000 | 3.0 | 0.42 | 98.42 |
| EE27 | 19 | 3.166667 | 3.0 | 0.38 | 98.80 |
| EE16 | 16 | 8.000000 | 8.0 | 0.32 | 99.12 |
| EE21 | 16 | 5.333333 | 8.0 | 0.32 | 99.44 |
| EE31 | 16 | 5.333333 | 8.0 | 0.32 | 99.76 |
| EE32 | 16 | 3.200000 | 2.0 | 0.32 | 100.08 |
| EE19 | 6 | 2.000000 | 2.0 | 0.12 | 100.20 |
| EE35 | 0 | 0.000000 | 0.0 | 0.00 | 100.20 |
| EE4 | 0 | 0.000000 | 0.0 | 0.00 | 100.20 |
| EE8 | 0 | 0.000000 | 0.0 | 0.00 | 100.20 |
It does seems interesting that 50% of the total hours of absence are concentrated in 7 employees only (EE11, EE3, EE14, EE28, EE34, EE36, EE20).
As most of the bibliography indicate, the most frequent reason for absence is medical reasons. So we will put some focus on what are the reasons of absence for those specific employees in order to understand their situation and help with it.
Let’s visualize Reasons for absence.
box_plot_crop<-ggplot(data=dt2, aes(Reason,Total.Hours, fill=Reason))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 60))+
labs(title="Box Plot - Total Absenteeism Hours by Reason of Absence",
x ="Reason of Absence", y = "Total Hours")Clearly some reasons (Circulatory, Injury-Poisoning, Musculoeskeletal and Skin) represent more absence hours than others.
ID_table2 <- dt2 %>%
group_by(Reason) %>%
summarise(Total.Hours.r = sum(Total.Hours, na.rm=TRUE), Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE))
ID_table2 <- ID_table2 %>%
mutate(Total.Hours.Percent = Total.Hours.r/(sum(Total.Hours.r))*100.2)
ID_table2 <- ID_table2[with (ID_table2, order(-Total.Hours.Percent)),]
ID_table2 <- ID_table2%>%
mutate(Cum_Percent = cumsum(Total.Hours.Percent))
ID_table_o2 <- ID_table2[with (ID_table2, order(-Total.Hours.Percent)),]
formattable(ID_table_o2) %>%
kable("html", escape = F, caption = "Summary of Total Hours Absent by Reason", align = c("l", "c", "c", "c", "c", "c")) %>%
kable_classic(full_width = F, html_font = "Cambria")| Reason | Total.Hours.r | Hours.Mean | Hours.Median | Total.Hours.Percent | Cum_Percent |
|---|---|---|---|---|---|
| NA | 1651 | 5.875445 | 4.0 | 33.02 | 33.02 |
| Musculoskeletal | 810 | 18.000000 | 8.0 | 16.20 | 49.22 |
| Injury, Poisoning | 729 | 18.692308 | 8.0 | 14.58 | 63.80 |
| Respiratory | 260 | 10.833333 | 8.0 | 5.20 | 69.00 |
| Laboratory findings | 217 | 10.850000 | 8.0 | 4.34 | 73.34 |
| Skin | 187 | 23.375000 | 10.0 | 3.74 | 77.08 |
| Parasitic | 182 | 12.133333 | 8.0 | 3.64 | 80.72 |
| Digestive | 177 | 7.375000 | 8.0 | 3.54 | 84.26 |
| Circulatory | 168 | 42.000000 | 24.0 | 3.36 | 87.62 |
| Nervous system | 163 | 20.375000 | 8.0 | 3.26 | 90.88 |
| Genitourinary | 159 | 8.368421 | 4.0 | 3.18 | 94.06 |
| Eye and Adnexa | 150 | 10.714286 | 8.0 | 3.00 | 97.06 |
| Health status | 35 | 5.833333 | 8.0 | 0.70 | 97.76 |
| Ear and Mastoid | 32 | 5.333333 | 5.5 | 0.64 | 98.40 |
| Neoplasms | 24 | 24.000000 | 24.0 | 0.48 | 98.88 |
| Mental | 19 | 6.333333 | 8.0 | 0.38 | 99.26 |
| Pregnancy | 16 | 8.000000 | 8.0 | 0.32 | 99.58 |
| Metabolic | 9 | 4.500000 | 4.5 | 0.18 | 99.76 |
| Blood and Immune | 8 | 8.000000 | 8.0 | 0.16 | 99.92 |
| Deformations and Chromosomal | 8 | 8.000000 | 8.0 | 0.16 | 100.08 |
| Perinatal period | 6 | 2.000000 | 2.0 | 0.12 | 100.20 |
We can note that 1/3 of the hours absent does not report a reason or not report what was the medical specific reason.
Then the following medical reasons concentrate about 55% (resting the 33% NA) of the total hours absent:
Musculoskeletal Injury, Poisoning Respiratory Laboratory findings Skin
Parasitic
Digestive
Circulatory
Now let’s see what are the main reasons for absence for the top 7 employees
group_cols <- c("Reason", "ID")
RID_table <- courier2 %>%
group_by_at(.vars = vars(one_of(group_cols))) %>%
summarise(Total.Hours.r = sum(absenteeism_time_in_hours, na.rm=TRUE))`summarise()` has grouped output by 'Reason'. You can override using the
`.groups` argument.
RID_table <- RID_table %>%
mutate(Total.Hours.Percent = Total.Hours.r/(sum(Total.Hours.r))*100.2)
RID_table <- RID_table[with (RID_table, order(Reason, -Total.Hours.r, ID)),]
RID_table <- RID_table%>%
mutate(Cum_Percent = cumsum(Total.Hours.Percent))
formattable(RID_table) %>%
kable("html", escape = F, caption = "Summary of Total Hours Absent by Reason and Employee", align = c("l", "c", "c", "c", "c", "c")) %>%
kable_classic(full_width = F, html_font = "Cambria")| Reason | ID | Total.Hours.r | Total.Hours.Percent | Cum_Percent |
|---|---|---|---|---|
| Blood and Immune | EE15 | 8 | 100.2000000 | 100.20000 |
| Circulatory | EE28 | 112 | 66.8000000 | 66.80000 |
| Circulatory | EE24 | 48 | 28.6285714 | 95.42857 |
| Circulatory | EE34 | 8 | 4.7714286 | 100.20000 |
| Deformations and Chromosomal | EE17 | 8 | 100.2000000 | 100.20000 |
| Digestive | EE17 | 42 | 23.7762712 | 23.77627 |
| Digestive | EE28 | 33 | 18.6813559 | 42.45763 |
| Digestive | EE3 | 29 | 16.4169492 | 58.87458 |
| Digestive | EE30 | 16 | 9.0576271 | 67.93220 |
| Digestive | EE34 | 13 | 7.3593220 | 75.29153 |
| Digestive | EE20 | 8 | 4.5288136 | 79.82034 |
| Digestive | EE21 | 8 | 4.5288136 | 84.34915 |
| Digestive | EE31 | 8 | 4.5288136 | 88.87797 |
| Digestive | EE6 | 8 | 4.5288136 | 93.40678 |
| Digestive | EE1 | 4 | 2.2644068 | 95.67119 |
| Digestive | EE19 | 4 | 2.2644068 | 97.93559 |
| Digestive | EE25 | 3 | 1.6983051 | 99.63390 |
| Digestive | EE18 | 1 | 0.5661017 | 100.20000 |
| Digestive | EE14 | 0 | 0.0000000 | 100.20000 |
| Ear and Mastoid | EE34 | 13 | 40.7062500 | 40.70625 |
| Ear and Mastoid | EE17 | 11 | 34.4437500 | 75.15000 |
| Ear and Mastoid | EE10 | 8 | 25.0500000 | 100.20000 |
| Eye and Adnexa | EE15 | 40 | 26.7200000 | 26.72000 |
| Eye and Adnexa | EE11 | 24 | 16.0320000 | 42.75200 |
| Eye and Adnexa | EE13 | 24 | 16.0320000 | 58.78400 |
| Eye and Adnexa | EE28 | 19 | 12.6920000 | 71.47600 |
| Eye and Adnexa | EE10 | 8 | 5.3440000 | 76.82000 |
| Eye and Adnexa | EE16 | 8 | 5.3440000 | 82.16400 |
| Eye and Adnexa | EE21 | 8 | 5.3440000 | 87.50800 |
| Eye and Adnexa | EE22 | 5 | 3.3400000 | 90.84800 |
| Eye and Adnexa | EE27 | 4 | 2.6720000 | 93.52000 |
| Eye and Adnexa | EE7 | 4 | 2.6720000 | 96.19200 |
| Eye and Adnexa | EE1 | 3 | 2.0040000 | 98.19600 |
| Eye and Adnexa | EE34 | 3 | 2.0040000 | 100.20000 |
| Genitourinary | EE15 | 80 | 50.4150943 | 50.41509 |
| Genitourinary | EE7 | 24 | 15.1245283 | 65.53962 |
| Genitourinary | EE33 | 10 | 6.3018868 | 71.84151 |
| Genitourinary | EE11 | 8 | 5.0415094 | 76.88302 |
| Genitourinary | EE24 | 8 | 5.0415094 | 81.92453 |
| Genitourinary | EE29 | 8 | 5.0415094 | 86.96604 |
| Genitourinary | EE36 | 5 | 3.1509434 | 90.11698 |
| Genitourinary | EE1 | 4 | 2.5207547 | 92.63774 |
| Genitourinary | EE10 | 4 | 2.5207547 | 95.15849 |
| Genitourinary | EE28 | 3 | 1.8905660 | 97.04906 |
| Genitourinary | EE32 | 3 | 1.8905660 | 98.93962 |
| Genitourinary | EE26 | 2 | 1.2603774 | 100.20000 |
| Health status | EE17 | 16 | 45.8057143 | 45.80571 |
| Health status | EE3 | 9 | 25.7657143 | 71.57143 |
| Health status | EE1 | 8 | 22.9028571 | 94.47429 |
| Health status | EE22 | 2 | 5.7257143 | 100.20000 |
| Injury, Poisoning | EE11 | 208 | 28.5893004 | 28.58930 |
| Injury, Poisoning | EE34 | 153 | 21.0296296 | 49.61893 |
| Injury, Poisoning | EE20 | 88 | 12.0954733 | 61.71440 |
| Injury, Poisoning | EE26 | 64 | 8.7967078 | 70.51111 |
| Injury, Poisoning | EE14 | 56 | 7.6971193 | 78.20823 |
| Injury, Poisoning | EE24 | 32 | 4.3983539 | 82.60658 |
| Injury, Poisoning | EE36 | 27 | 3.7111111 | 86.31770 |
| Injury, Poisoning | EE28 | 24 | 3.2987654 | 89.61646 |
| Injury, Poisoning | EE10 | 16 | 2.1991770 | 91.81564 |
| Injury, Poisoning | EE12 | 12 | 1.6493827 | 93.46502 |
| Injury, Poisoning | EE1 | 8 | 1.0995885 | 94.56461 |
| Injury, Poisoning | EE23 | 8 | 1.0995885 | 95.66420 |
| Injury, Poisoning | EE33 | 8 | 1.0995885 | 96.76379 |
| Injury, Poisoning | EE5 | 8 | 1.0995885 | 97.86337 |
| Injury, Poisoning | EE6 | 8 | 1.0995885 | 98.96296 |
| Injury, Poisoning | EE30 | 6 | 0.8246914 | 99.78765 |
| Injury, Poisoning | EE29 | 3 | 0.4123457 | 100.20000 |
| Laboratory findings | EE14 | 96 | 44.3281106 | 44.32811 |
| Laboratory findings | EE11 | 24 | 11.0820276 | 55.41014 |
| Laboratory findings | EE2 | 16 | 7.3880184 | 62.79816 |
| Laboratory findings | EE22 | 16 | 7.3880184 | 70.18618 |
| Laboratory findings | EE3 | 16 | 7.3880184 | 77.57419 |
| Laboratory findings | EE9 | 16 | 7.3880184 | 84.96221 |
| Laboratory findings | EE12 | 8 | 3.6940092 | 88.65622 |
| Laboratory findings | EE18 | 8 | 3.6940092 | 92.35023 |
| Laboratory findings | EE34 | 8 | 3.6940092 | 96.04424 |
| Laboratory findings | EE17 | 5 | 2.3087558 | 98.35300 |
| Laboratory findings | EE28 | 3 | 1.3852535 | 99.73825 |
| Laboratory findings | EE1 | 1 | 0.4617512 | 100.20000 |
| Mental | EE3 | 8 | 42.1894737 | 42.18947 |
| Mental | EE34 | 8 | 42.1894737 | 84.37895 |
| Mental | EE36 | 3 | 15.8210526 | 100.20000 |
| Metabolic | EE24 | 8 | 89.0666667 | 89.06667 |
| Metabolic | EE32 | 1 | 11.1333333 | 100.20000 |
| Musculoskeletal | EE36 | 243 | 30.0600000 | 30.06000 |
| Musculoskeletal | EE3 | 99 | 12.2466667 | 42.30667 |
| Musculoskeletal | EE13 | 80 | 9.8962963 | 52.20296 |
| Musculoskeletal | EE11 | 56 | 6.9274074 | 59.13037 |
| Musculoskeletal | EE20 | 56 | 6.9274074 | 66.05778 |
| Musculoskeletal | EE10 | 40 | 4.9481481 | 71.00593 |
| Musculoskeletal | EE15 | 40 | 4.9481481 | 75.95407 |
| Musculoskeletal | EE24 | 40 | 4.9481481 | 80.90222 |
| Musculoskeletal | EE22 | 39 | 4.8244444 | 85.72667 |
| Musculoskeletal | EE14 | 32 | 3.9585185 | 89.68519 |
| Musculoskeletal | EE1 | 25 | 3.0925926 | 92.77778 |
| Musculoskeletal | EE5 | 24 | 2.9688889 | 95.74667 |
| Musculoskeletal | EE28 | 15 | 1.8555556 | 97.60222 |
| Musculoskeletal | EE18 | 8 | 0.9896296 | 98.59185 |
| Musculoskeletal | EE6 | 8 | 0.9896296 | 99.58148 |
| Musculoskeletal | EE33 | 2 | 0.2474074 | 99.82889 |
| Musculoskeletal | EE34 | 2 | 0.2474074 | 100.07630 |
| Musculoskeletal | EE26 | 1 | 0.1237037 | 100.20000 |
| Neoplasms | EE18 | 24 | 100.2000000 | 100.20000 |
| Nervous system | EE9 | 128 | 78.6846626 | 78.68466 |
| Nervous system | EE11 | 8 | 4.9177914 | 83.60245 |
| Nervous system | EE14 | 8 | 4.9177914 | 88.52025 |
| Nervous system | EE25 | 8 | 4.9177914 | 93.43804 |
| Nervous system | EE3 | 8 | 4.9177914 | 98.35583 |
| Nervous system | EE28 | 3 | 1.8441718 | 100.20000 |
| Nervous system | EE27 | 0 | 0.0000000 | 100.20000 |
| Parasitic | EE22 | 77 | 42.3923077 | 42.39231 |
| Parasitic | EE14 | 16 | 8.8087912 | 51.20110 |
| Parasitic | EE18 | 16 | 8.8087912 | 60.00989 |
| Parasitic | EE20 | 16 | 8.8087912 | 68.81868 |
| Parasitic | EE1 | 8 | 4.4043956 | 73.22308 |
| Parasitic | EE23 | 8 | 4.4043956 | 77.62747 |
| Parasitic | EE24 | 8 | 4.4043956 | 82.03187 |
| Parasitic | EE25 | 8 | 4.4043956 | 86.43626 |
| Parasitic | EE31 | 8 | 4.4043956 | 90.84066 |
| Parasitic | EE33 | 8 | 4.4043956 | 95.24505 |
| Parasitic | EE36 | 8 | 4.4043956 | 99.64945 |
| Parasitic | EE9 | 1 | 0.5505495 | 100.20000 |
| Perinatal period | EE17 | 3 | 50.1000000 | 50.10000 |
| Perinatal period | EE25 | 3 | 50.1000000 | 100.20000 |
| Pregnancy | EE20 | 8 | 50.1000000 | 50.10000 |
| Pregnancy | EE26 | 8 | 50.1000000 | 100.20000 |
| Respiratory | EE14 | 88 | 33.9138462 | 33.91385 |
| Respiratory | EE24 | 56 | 21.5815385 | 55.49538 |
| Respiratory | EE13 | 24 | 9.2492308 | 64.74462 |
| Respiratory | EE34 | 21 | 8.0930769 | 72.83769 |
| Respiratory | EE3 | 12 | 4.6246154 | 77.46231 |
| Respiratory | EE10 | 8 | 3.0830769 | 80.54538 |
| Respiratory | EE17 | 8 | 3.0830769 | 83.62846 |
| Respiratory | EE18 | 8 | 3.0830769 | 86.71154 |
| Respiratory | EE20 | 8 | 3.0830769 | 89.79462 |
| Respiratory | EE25 | 8 | 3.0830769 | 92.87769 |
| Respiratory | EE32 | 8 | 3.0830769 | 95.96077 |
| Respiratory | EE33 | 8 | 3.0830769 | 99.04385 |
| Respiratory | EE28 | 3 | 1.1561538 | 100.20000 |
| Skin | EE9 | 112 | 60.0128342 | 60.01283 |
| Skin | EE22 | 40 | 21.4331551 | 81.44599 |
| Skin | EE14 | 24 | 12.8598930 | 94.30588 |
| Skin | EE15 | 4 | 2.1433155 | 96.44920 |
| Skin | EE12 | 3 | 1.6074866 | 98.05668 |
| Skin | EE28 | 3 | 1.6074866 | 99.66417 |
| Skin | EE3 | 1 | 0.5358289 | 100.20000 |
| NA | EE3 | 266 | 16.1436705 | 16.14367 |
| NA | EE28 | 127 | 7.7076923 | 23.85136 |
| NA | EE11 | 122 | 7.4042399 | 31.25560 |
| NA | EE7 | 122 | 7.4042399 | 38.65984 |
| NA | EE34 | 112 | 6.7973349 | 45.45718 |
| NA | EE20 | 111 | 6.7366445 | 52.19382 |
| NA | EE10 | 86 | 5.2193822 | 57.41320 |
| NA | EE15 | 81 | 4.9159297 | 62.32913 |
| NA | EE22 | 74 | 4.4910963 | 66.82023 |
| NA | EE5 | 72 | 4.3697153 | 71.18995 |
| NA | EE1 | 60 | 3.6414294 | 74.83137 |
| NA | EE6 | 48 | 2.9131435 | 77.74452 |
| NA | EE13 | 47 | 2.8524531 | 80.59697 |
| NA | EE18 | 45 | 2.7310721 | 83.32804 |
| NA | EE24 | 38 | 2.3062386 | 85.63428 |
| NA | EE33 | 37 | 2.2455482 | 87.87983 |
| NA | EE14 | 36 | 2.1848577 | 90.06469 |
| NA | EE17 | 33 | 2.0027862 | 92.06747 |
| NA | EE36 | 25 | 1.5172623 | 93.58474 |
| NA | EE23 | 16 | 0.9710478 | 94.55578 |
| NA | EE27 | 15 | 0.9103574 | 95.46614 |
| NA | EE25 | 12 | 0.7282859 | 96.19443 |
| NA | EE12 | 11 | 0.6675954 | 96.86202 |
| NA | EE29 | 10 | 0.6069049 | 97.46893 |
| NA | EE2 | 9 | 0.5462144 | 98.01514 |
| NA | EE30 | 9 | 0.5462144 | 98.56136 |
| NA | EE16 | 8 | 0.4855239 | 99.04688 |
| NA | EE26 | 8 | 0.4855239 | 99.53240 |
| NA | EE9 | 5 | 0.3034525 | 99.83586 |
| NA | EE32 | 4 | 0.2427620 | 100.07862 |
| NA | EE19 | 2 | 0.1213810 | 100.20000 |
| NA | EE21 | 0 | 0.0000000 | 100.20000 |
| NA | EE31 | 0 | 0.0000000 | 100.20000 |
| NA | EE35 | 0 | 0.0000000 | 100.20000 |
| NA | EE4 | 0 | 0.0000000 | 100.20000 |
| NA | EE8 | 0 | 0.0000000 | 100.20000 |
In this table the Cummulative Percentage is grouped by Reason.
It means that for example 30% of the absence due to Musculoskeletal reasons is concentrated in 1 employee (EE36), and Injury, Poisoning is concentrated in 2 employees (EE11 and EE34).
suppressPackageStartupMessages(library(epiDisplay)) #Using epiDisplay package to get useful tables with frequency and percentages
# Box Plot - Education Level
box_plot_crop<-ggplot(data=dt2, aes(Education.Level,Total.Hours, fill=Education.Level))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 20))+
labs(title="Box Plot - Total Absenteeism Hours by Education Level",
x ="Education Level", y = "Total Hours")# Frequency table for Education Level
tab1(courier2$Education.Level, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)courier2$Education.Level :
Frequency %(NA+) %(NA-)
High School 601 81.2 82.3
Post-graduate 79 10.7 10.8
Bachelor 46 6.2 6.3
<NA> 10 1.4 0.0
Master or Phd 4 0.5 0.5
Total 740 100.0 100.0
# Box Plot - Day of the week
box_plot_crop<-ggplot(data=dt2, aes(Day.of.Week,Total.Hours, fill=Day.of.Week))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 20))+
labs(title="Box Plot - Total Absenteeism Hours by Day of the Week",
x ="Day of the Week", y = "Total Hours")# Frequency table for Day of the Week
tab1(courier2$Day.of.Week, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)courier2$Day.of.Week :
Frequency Percent
Monday 161 21.8
Wednesday 156 21.1
Tuesday 154 20.8
Friday 144 19.5
Thursday 125 16.9
Total 740 100.0
# Box Plot -Season
box_plot_crop<-ggplot(data=dt2, aes(Season,Total.Hours, fill=Season))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 20))+
labs(title="Box Plot - Total Absenteeism Hours by Season",
x ="Season", y = "Total Hours")# Frequency table for Season
tab1(courier2$Season, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)courier2$Season :
Frequency Percent
Spring 195 26.4
Autunm 192 25.9
Winter 183 24.7
Summer 170 23.0
Total 740 100.0
# Box Plot Disciplined
box_plot_crop<-ggplot(data=dt2, aes(Disciplined,Total.Hours, fill=Disciplined))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 30))+
labs(title="Box Plot - Total Absenteeism Hours by Disciplined",
x ="Has been Disciplined?", y = "Total Hours")# Frequency table for Disciplined or not
tab1(courier2$Disciplined, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)courier2$Disciplined :
Frequency %(NA+) %(NA-)
No 695 93.9 94.7
Yes 39 5.3 5.3
<NA> 6 0.8 0.0
Total 740 100.0 100.0
# Box Plot - Drinker
box_plot_crop<-ggplot(data=dt2, aes(Drinker,Total.Hours, fill=Drinker))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 40))+
labs(title="Box Plot - Total Absenteeism Hours by Drinker",
x ="Is Drinker?", y = "Total Hours")# Frequency table for Drinker or not
tab1(courier2$Drinker, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)courier2$Drinker :
Frequency %(NA+) %(NA-)
Yes 418 56.5 56.7
No 319 43.1 43.3
<NA> 3 0.4 0.0
Total 740 100.0 100.0
# Box Plot Smoker
box_plot_crop<-ggplot(data=dt2, aes(Smoker,Total.Hours, fill=Smoker))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 30))+
labs(title="Box Plot - Total Absenteeism Hours by Smoker",
x ="Is Smoker?", y = "Total Hours")# Frequency table for Smoker or not
tab1(courier2$Smoker, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)courier2$Smoker :
Frequency %(NA+) %(NA-)
No 682 92.2 92.7
Yes 54 7.3 7.3
<NA> 4 0.5 0.0
Total 740 100.0 100.0
# Box Plot - Children
box_plot_crop<-ggplot(data=dt2, aes(Children,Total.Hours, fill=Children))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 30))+
labs(title="Box Plot - Total Absenteeism Hours by Children",
x ="How many Children", y = "Total Hours")# Frequency table for Smoker or not
tab1(courier2$Children, sort.group = "decreasing", cum.percent = F, missing = FALSE, graph = F)courier2$Children :
Frequency %(NA+) %(NA-)
No Children 295 39.9 40.2
One 228 30.8 31.1
Two 155 20.9 21.1
Four 41 5.5 5.6
Three 15 2.0 2.0
<NA> 6 0.8 0.0
Total 740 100.0 100.0
I found interesting that 94% of the employees who reported absent have not been Disciplined; 92% are not Smokers, and 81% holds a High School Degree as higher education level.
It is hard to make a conclusion from those variables because we don’t know how is the population not reported absent regarding Education Level, academic degree or whether smoker or not. Still those proportions might be relevant to consider.
However in the case of Children Seems that having between 3 to 4 children may have some impact on the absence incidence.
For the rest of the categorical variables there are not visible differences that can explain Absence.
# Tenure
scatter1 <- ggplot(dt2, aes(service_time,Total.Hours))+
geom_point()+
scale_y_continuous(limits=c(0,150))+
labs(title="Scatter Plot - Total Absenteeism Hours vs Tenure",
x ="Years of Service", y = "Total Hours")
scatter1#Age
scatter2 <- ggplot(dt2, aes(age,Total.Hours))+
geom_point()+
scale_y_continuous(limits=c(0,100))+
labs(title="Scatter Plot - Total Absenteeism Hours vs Age",
x ="Age", y = "Total Hours")
scatter2#BMI
scatter3 <- ggplot(dt2, aes(bmi,Total.Hours))+
geom_point()+
scale_y_continuous(limits=c(0,100))+
labs(title="Scatter Plot - Total Absenteeism Hours vs BMI",
x ="Body Mass Index", y = "Total Hours")
scatter3#Distance to Work
scatter5 <- ggplot(dt2, aes(distance_to_work,Total.Hours))+
geom_point()+
scale_y_continuous(limits=c(0,100))+
labs(title="Scatter Plot - Total Absenteeism Hours vs Distance to work",
x ="Distance to work in KM", y = "Total Hours")
scatter5# Transportation Expenses
scatter6 <- ggplot(dt2, aes(transportation_expense,Total.Hours))+
geom_point()+
scale_y_continuous(limits=c(0,100))+
labs(title="Scatter Plot - Total Absenteeism Hours vs Transportation Expense",
x ="Transportation Expense", y = "Total Hours")
scatter6I can’t see any relationship between any of the continuous variables and total hours of abcense.
Seems that the most closer impact to absence are some specific type of medical conditions and having 3-4 children also may increase chances to be absent.
Several bibliography related to Absenteeism suggest that Sick leave is the most common reason for being absent. The data here seems that are consistent with this conclusion, as we don’t see other variables affecting the absenteeism.
I wish I had other relevant variables in the dataset, like salary/wages, gender. Those 2 variables in particular are very relevant to predict absenteeism according to certain researches around this topic.