Introduction

Absenteeism is one of topics that has an economic impact in the companies.

Absenteeism takes place when employees do not report to work. Absenteeism can take two forms, authorized or unauthorized. Management is primarily interested in reducing unauthorized absenteeism since it is more likely to cause hardships on employees and guests alike (Abraham Pizam, A. & Thornburg, S. 2000) (2).

As Henkerson and Persson (2004) (1) said, absence from work can have multiple causes, but there is little doubt that the quantitatively most important one is sick leave.

In this study I will try to find causes of absenteeism on a Courier company.

About the Dataset:

The dataset was found in:

https://www.kaggle.com/datasets/tonypriyanka2913/employee-absenteeism

It contain real data from a Courier Company, and it was created to explain the absenteeism.

Variables:

Individual identification (ID)
Reason for absence (ICD).

Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:

I Certain infectious and parasitic diseases

II Neoplasms

III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism

IV Endocrine, nutritional and metabolic diseases

V Mental and behavioural disorders

VI Diseases of the nervous system

VII Diseases of the eye and adnexa

VIII Diseases of the ear and mastoid process

IX Diseases of the circulatory system

X Diseases of the respiratory system

XI Diseases of the digestive system

XII Diseases of the skin and subcutaneous tissue

XIII Diseases of the musculoskeletal system and connective tissue

XIV Diseases of the genitourinary system

XV Pregnancy, childbirth and the puerperium

XVI Certain conditions originating in the perinatal period

XVII Congenital malformations, deformations and chromosomal abnormalities

XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified

XIX Injury, poisoning and certain other consequences of external causes

XX External causes of morbidity and mortality

XXI Factors influencing health status and contact with health services.

And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).

Month of absence
Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
Seasons (summer (1), autumn (2), winter (3), spring (4))
Transportation expense
Distance from Residence to Work (kilometers)
Service time
Age
Work load Average/day
Hit target
Disciplinary failure (yes=1; no=0)
Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
Son (number of children)
Social Drinker (yes=1; no=0)
Social smoker (yes=1; no=0)
Pet (number of pet)
Weight
Height
Body mass index
Absenteeism time in hours (target)

Setting up and Reading the data

Let’s load packages and data.

# Loading packages

library(dplyr)
library(tidyverse)
library(formattable)
library(ggplot2)

# Setting working directory and loading dataset.

setwd("/Users/fbeta/OneDrive/1-UMASS-DACSS/601/DataSets/Absenteeism/")

courier <- read.csv("Absenteeism_at_work_Project.csv")

Exploring the data and wrangling it

Let’s see the structure of the dataset

str(courier)

'data.frame':   740 obs. of  21 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ work_load_average_day    : chr  "239,554" "239,554" "239,554" "239,554" ...
 $ hit_target               : int  97 97 97 97 97 97 97 97 97 97 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ weight                   : int  90 98 89 68 90 89 80 65 95 88 ...
 $ height                   : int  172 178 170 168 172 170 172 168 196 172 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...

The dataset has 21 variables and 740 observations to explore potential explanations to the absenteeism.

In terms of the type of data we have:

Nine (9) Categorical variables:

- "reason_for_absence"
- "month_of_absence"
- "day_of_the_week"
- "seasons"    
- "disciplinary_failure"  
- "education"            
- "social_drinker"    
- "social_smoker"     
- "pet"

Three (3) numeric discrete:

- "ID"
- "work_load_average_day" 
- "hit_target"

Nine (9) numeric continuous:

- "transportation_expense" 
- "distance_to_work"      
- "service_time"         
- "age"                 
- "son"                
- "weight"            
- "height"            
- "bmi"               
- "absenteeism_time_in_hours"

The dataset seems very clean in general, but I will adjust a few things:

Variable “work_load_average_day” and “hit_target” are not clear in the description of the file. I rather to eliminate those variables from the dataset

courier2 <- select(courier, - c("work_load_average_day", "hit_target"))

On the other hand weight and height are used to calculate BMI, so I will eliminate also those 2 variables to simplify the dataset.

courier2 <- select(courier2, - c("weight", "height"))

I will re code back to the categories name for a few relevant categorical variables to use it as character in tables, and will create categorical type of variable for Son (number of children). But I’ll save it as new variables in case I need to use it as numeric too at some point.

courier2 <-courier2%>%
  mutate(Reason = case_when(
        reason_for_absence == 1 ~ "Parasitic",
        reason_for_absence == 2 ~ "Neoplasms",
        reason_for_absence == 3 ~ "Blood and Immune",
        reason_for_absence == 4 ~ "Metabolic",
        reason_for_absence == 5 ~ "Mental",
        reason_for_absence == 6 ~ "Nervous system",
        reason_for_absence == 7 ~ "Eye and Adnexa",
        reason_for_absence == 8 ~ "Ear and Mastoid",
        reason_for_absence == 9 ~ "Circulatory",
        reason_for_absence == 10 ~ "Respiratory",
        reason_for_absence == 11 ~ "Digestive",
        reason_for_absence == 12 ~ "Skin",
        reason_for_absence == 13 ~ "Musculoskeletal",
        reason_for_absence == 14 ~ "Genitourinary",
        reason_for_absence == 15 ~ "Pregnancy",
        reason_for_absence == 16 ~ "Perinatal period",
        reason_for_absence == 17 ~ "Deformations and Chromosomal",
        reason_for_absence == 18 ~ "Laboratory findings",
        reason_for_absence == 19 ~ "Injury, Poisoning",
        reason_for_absence == 20 ~ "External causes and mortality",
        reason_for_absence == 21 ~ "Health status",
        )) %>%
  mutate(Education.Level = case_when(
         education == 1 ~ "High School",
         education == 2 ~ "Bachelor",
         education == 3 ~ "Post-graduate",
         education == 4 ~ "Master or Phd",
         )) %>%
  mutate(Day.of.Week = case_when(
         day_of_the_week == 2 ~ "Monday",
         day_of_the_week == 3 ~ "Tuesday",
         day_of_the_week == 4 ~ "Wednesday",
         day_of_the_week == 5 ~ "Thursday",
         day_of_the_week == 6 ~ "Friday",
         ))%>%
  mutate(Season = case_when(
         seasons == 1 ~ "Summer",
         seasons == 2 ~ "Autunm",
         seasons == 3 ~ "Winter",
         seasons == 4 ~ "Spring",
         ))%>%
  mutate(Disciplined = case_when(
         disciplinary_failure == 1 ~ "Yes",
         disciplinary_failure == 0 ~ "No",
         ))%>%
  mutate(Drinker = case_when(
         social_drinker == 1 ~ "Yes",
         social_drinker == 0 ~ "No",
        ))%>%
  mutate(Children = case_when(
         son == 0 ~ "No Children",
         son == 1 ~ "One",
         son == 2 ~ "Two",
         son == 3 ~ "Three",
         son == 4 ~ "Four",
        ))%>%
  mutate(Smoker = case_when(
         social_smoker == 1 ~ "Yes",
         social_smoker == 0 ~ "No",
         ))

# Confirming all the changes

str(courier2)

'data.frame':   740 obs. of  25 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...
 $ Reason                   : chr  NA NA NA "Eye and Adnexa" ...
 $ Education.Level          : chr  "High School" "High School" "High School" "High School" ...
 $ Day.of.Week              : chr  "Tuesday" "Tuesday" "Wednesday" "Thursday" ...
 $ Season                   : chr  "Summer" "Summer" "Summer" "Summer" ...
 $ Disciplined              : chr  "No" "Yes" "No" "No" ...
 $ Drinker                  : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ Children                 : chr  "Two" "One" "No Children" "Two" ...
 $ Smoker                   : chr  "No" "No" "No" "Yes" ...

Descriptive statistics

Let’s summarize Mean, median, and standard deviation for numeric continuous variables, and frequencies for categorical variables.

#Numerical continuous variables (except ID as it is only for identification purposes)

desc_summary <- courier2 %>%
  summarise(Transport.Mean = mean(transportation_expense, na.rm=TRUE), Transport_Median = median(transportation_expense, na.rm=TRUE), Transport.SD = sd(transportation_expense, na.rm=TRUE), Distance.Mean = mean(distance_to_work, na.rm=TRUE), Distance.Median = median(distance_to_work, na.rm=TRUE), Distance.SD = sd(distance_to_work, na.rm=TRUE), Service.Mean = mean(service_time, na.rm=TRUE), Service.Median = median(service_time, na.rm=TRUE), Service.SD = sd(service_time, na.rm=TRUE), Age.Mean = mean(age, na.rm=TRUE), Age.Median = median(age, na.rm=TRUE), Age.SD = sd(age, na.rm=TRUE), Son.Mean = mean(son, na.rm=TRUE), Son.Median = median(son, na.rm=TRUE), Son.SD = sd(son, na.rm=TRUE), Pet.Mean = mean(pet, na.rm=TRUE), Pet.Median = median(pet, na.rm=TRUE), Pet.SD = sd(pet, na.rm=TRUE), BMI.Mean = mean(bmi, na.rm=TRUE), BMI.Median = median(bmi, na.rm=TRUE), BMI.SD = sd(bmi, na.rm=TRUE), Hours.Mean = mean(absenteeism_time_in_hours, na.rm=TRUE), Hours.Median = median(absenteeism_time_in_hours, na.rm=TRUE), Hours.SD = sd(absenteeism_time_in_hours, na.rm=TRUE))
print(desc_summary)

  Transport.Mean Transport_Median Transport.SD Distance.Mean
1       221.0355              225     66.95418      29.66757
  Distance.Median Distance.SD Service.Mean Service.Median Service.SD
1              26    14.84812     12.56581             13   4.389813
  Age.Mean Age.Median   Age.SD Son.Mean Son.Median   Son.SD  Pet.Mean
1 36.44912         37 6.480148 1.017711          1 1.094928 0.7466125
  Pet.Median   Pet.SD BMI.Mean BMI.Median   BMI.SD Hours.Mean
1          0 1.319726 26.68406         25 4.292819   6.977716
  Hours.Median Hours.SD
1            3 13.47696

#(Formattable didn't work well with a lot of columns)

library(janitor) #Using Janitor package to get useful tables with frequency and percentages

# Frequency table for Reason
tabyl(courier2$Reason, sort = T)

              courier2$Reason   n     percent valid_percent
             Blood and Immune   1 0.001351351   0.003816794
                  Circulatory   4 0.005405405   0.015267176
 Deformations and Chromosomal   1 0.001351351   0.003816794
                    Digestive  26 0.035135135   0.099236641
              Ear and Mastoid   6 0.008108108   0.022900763
               Eye and Adnexa  15 0.020270270   0.057251908
                Genitourinary  19 0.025675676   0.072519084
                Health status   6 0.008108108   0.022900763
            Injury, Poisoning  40 0.054054054   0.152671756
          Laboratory findings  21 0.028378378   0.080152672
                       Mental   3 0.004054054   0.011450382
                    Metabolic   2 0.002702703   0.007633588
              Musculoskeletal  55 0.074324324   0.209923664
                    Neoplasms   1 0.001351351   0.003816794
               Nervous system   8 0.010810811   0.030534351
                    Parasitic  16 0.021621622   0.061068702
             Perinatal period   3 0.004054054   0.011450382
                    Pregnancy   2 0.002702703   0.007633588
                  Respiratory  25 0.033783784   0.095419847
                         Skin   8 0.010810811   0.030534351
                         <NA> 478 0.645945946            NA

# Since about 65% of the cases doesn't have an specific Reason for Absence, let's see the proportions excuding NA
tabyl(courier2$Reason, sort = T, show_na = FALSE)

              courier2$Reason  n     percent
             Blood and Immune  1 0.003816794
                  Circulatory  4 0.015267176
 Deformations and Chromosomal  1 0.003816794
                    Digestive 26 0.099236641
              Ear and Mastoid  6 0.022900763
               Eye and Adnexa 15 0.057251908
                Genitourinary 19 0.072519084
                Health status  6 0.022900763
            Injury, Poisoning 40 0.152671756
          Laboratory findings 21 0.080152672
                       Mental  3 0.011450382
                    Metabolic  2 0.007633588
              Musculoskeletal 55 0.209923664
                    Neoplasms  1 0.003816794
               Nervous system  8 0.030534351
                    Parasitic 16 0.061068702
             Perinatal period  3 0.011450382
                    Pregnancy  2 0.007633588
                  Respiratory 25 0.095419847
                         Skin  8 0.030534351

# Frequency table for Education Level
tabyl(courier2$Education.Level, sort = T)

 courier2$Education.Level   n     percent valid_percent
                 Bachelor  46 0.062162162   0.063013699
              High School 601 0.812162162   0.823287671
            Master or Phd   4 0.005405405   0.005479452
            Post-graduate  79 0.106756757   0.108219178
                     <NA>  10 0.013513514            NA

# Frequency table for Day of the Week
tabyl(courier2$Day.of.Week, sort = T)

 courier2$Day.of.Week   n   percent
               Friday 144 0.1945946
               Monday 161 0.2175676
             Thursday 125 0.1689189
              Tuesday 154 0.2081081
            Wednesday 156 0.2108108

#sort(table(courier2$Day.of.Week), decreasing=TRUE)

# Frequency table for Season
tabyl(courier2$Season, sort = T)

 courier2$Season   n   percent
          Autunm 192 0.2594595
          Spring 195 0.2635135
          Summer 170 0.2297297
          Winter 183 0.2472973

# Frequency table for Disciplined or not
tabyl(courier2$Disciplined, sort = T)

 courier2$Disciplined   n     percent valid_percent
                   No 695 0.939189189    0.94686649
                  Yes  39 0.052702703    0.05313351
                 <NA>   6 0.008108108            NA

# Frequency table for Drinker or not
tabyl(courier2$Drinker, sort = T)

 courier2$Drinker   n     percent valid_percent
               No 319 0.431081081     0.4328358
              Yes 418 0.564864865     0.5671642
             <NA>   3 0.004054054            NA

# Frequency table for Smoker or not
tabyl(courier2$Smoker, sort = T)

 courier2$Smoker   n     percent valid_percent
              No 682 0.921621622    0.92663043
             Yes  54 0.072972973    0.07336957
            <NA>   4 0.005405405            NA

I can see that the top 5 reasons for absence are (in frequency):

Musculoskeletal
Injury, Poising
Digestive
Respiratory
Lab findings

However it is relevant to note that 65% of the absences does not provide an specific medical reason. So there is still an important number of employees in this company that are absent for unknown reasons.

Also found interesting that 94% of the employees who reported absent have not been Disciplined; 92% are not Smokers, and 81% holds a High School Degree as higher education level.

It means that those variables may not help too much to explain well the hours absent as they almost do not vary in this dataset.

Now let’s Summarise basic descriptive for Absenteeism hours grouped by all Categorical variables.

# Let's create a new dataframe with total number of absenteeism hours

library(data.table)
dt <- data.table(courier2)
dt2 <- dt[,list(Total.Hours = sum(absenteeism_time_in_hours, na.rm=T), freq = .N), by = c("ID", "Reason", "bmi", "distance_to_work", "age", "service_time", "Season", "Drinker", "Smoker", "Children", "Disciplined", "Day.of.Week", "Education.Level", "transportation_expense")]
str(dt2)

Classes 'data.table' and 'data.frame':  526 obs. of  16 variables:
 $ ID                    : int  11 36 3 7 11 3 10 20 14 1 ...
 $ Reason                : chr  NA NA NA "Eye and Adnexa" ...
 $ bmi                   : int  30 31 31 24 30 31 27 23 25 29 ...
 $ distance_to_work      : int  36 13 51 5 36 51 52 50 12 11 ...
 $ age                   : int  33 50 38 39 33 38 28 36 34 37 ...
 $ service_time          : int  13 18 18 14 13 18 3 11 14 14 ...
 $ Season                : chr  "Summer" "Summer" "Summer" "Summer" ...
 $ Drinker               : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ Smoker                : chr  "No" "No" "No" "Yes" ...
 $ Children              : chr  "Two" "One" "No Children" "Two" ...
 $ Disciplined           : chr  "No" "Yes" "No" "No" ...
 $ Day.of.Week           : chr  "Tuesday" "Tuesday" "Wednesday" "Thursday" ...
 $ Education.Level       : chr  "High School" "High School" "High School" "High School" ...
 $ transportation_expense: int  289 118 179 279 289 179 NA 260 155 235 ...
 $ Total.Hours           : int  25 0 2 4 2 5 8 20 40 8 ...
 $ freq                  : int  5 1 1 1 1 3 1 6 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>

# Reason table

reason_table <- dt2 %>% 
  group_by(Reason) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
reason_table <- reason_table[with (reason_table, order(-Hours.Median)),]
formattable(reason_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Reason	Hours.Mean	Hours.Median	Hours.SD	Total.Hours
Circulatory	42.000000	24.0	47.721414	168
Neoplasms	24.000000	24.0	NA	24
Skin	23.375000	10.0	37.063411	187
Blood and Immune	8.000000	8.0	NA	8
Deformations and Chromosomal	8.000000	8.0	NA	8
Digestive	7.375000	8.0	8.197627	177
Eye and Adnexa	10.714286	8.0	11.013478	150
Health status	5.833333	8.0	3.371449	35
Injury, Poisoning	18.692308	8.0	22.291317	729
Laboratory findings	10.850000	8.0	18.313857	217
Mental	6.333333	8.0	2.886751	19
Musculoskeletal	18.000000	8.0	37.188708	810
Nervous system	20.375000	8.0	40.369499	163
Parasitic	12.133333	8.0	15.986006	182
Pregnancy	8.000000	8.0	0.000000	16
Respiratory	10.833333	8.0	10.821342	260
Ear and Mastoid	5.333333	5.5	2.943920	32
Metabolic	4.500000	4.5	4.949747	9
Genitourinary	8.368421	4.0	10.462688	159
NA	5.875445	4.0	9.012816	1651
Perinatal period	2.000000	2.0	1.000000	6

# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Reason,Total.Hours, fill=Reason)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 60))+
  labs(title="Box Plot - Total Absenteeism Hours by Reason of Absence",
        x ="Reason of Absence", y = "Total Hours")

# Disciplined table

Disciplined_table <- dt2 %>% 
  group_by(Disciplined) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Disciplined_table <- Disciplined_table[with (Disciplined_table, order(-Hours.Median)),]
formattable(Disciplined_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Disciplined	Hours.Mean	Hours.Median	Hours.SD	Total.Hours
No	10.024742	8	17.635643	4862
NA	4.666667	5	3.723797	28
Yes	3.428571	0	20.283702	120

# Smoker table

Smoker_table <- dt2 %>% 
  group_by(Smoker) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Smoker_table <- Smoker_table[with (Smoker_table, order(-Hours.Median)),]
formattable(Smoker_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Smoker	Hours.Mean	Hours.Median	Hours.SD	Total.Hours
No	9.557203	8.0	17.72427	4511
NA	9.000000	5.5	10.42433	36
Yes	9.260000	4.0	19.09034	463

# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Smoker,Total.Hours, fill=Smoker)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 30))+
  labs(title="Box Plot - Total Absenteeism Hours by Smoker",
        x ="Is Smoker?", y = "Total Hours")

# Drinker table
Drinker_table <- dt2 %>% 
  group_by(Drinker) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Drinker_table <- Drinker_table[with (Drinker_table, order(-Hours.Median)),]
formattable(Drinker_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Drinker	Hours.Mean	Hours.Median	Hours.SD	Total.Hours
Yes	11.100000	8	20.10949	3108
NA	13.666667	8	16.25833	41
No	7.658436	4	14.54640	1861

# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Drinker,Total.Hours, fill=Drinker)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 40))+
  labs(title="Box Plot - Total Absenteeism Hours by Drinker",
        x ="Is Drinker?", y = "Total Hours")

# Day of the week table

Day_table <- dt2 %>% 
  group_by(Day.of.Week) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Day_table <- Day_table[with (Day_table, order(-Hours.Median)),]
formattable(Day_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Day.of.Week	Hours.Mean	Hours.Median	Hours.SD	Total.Hours
Monday	11.130081	8	14.924927	1369
Wednesday	9.918182	8	24.079049	1091
Tuesday	11.872727	5	23.742931	1306
Friday	7.351064	4	10.087475	691
Thursday	6.213483	4	5.339544	553

# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Day.of.Week,Total.Hours, fill=Day.of.Week)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Day of the Week",
        x ="Day of the Week", y = "Total Hours")

#Season table

s_table <- dt2 %>% 
  group_by(Season) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
s_table <- s_table[with (s_table, order(-Hours.Median)),]
formattable(s_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Season	Hours.Mean	Hours.Median	Hours.SD	Total.Hours
Autunm	9.088000	8	14.29133	1136
Summer	9.356589	8	16.48001	1207
Winter	11.014925	8	23.91306	1476
Spring	8.630435	4	14.69251	1191

# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Season,Total.Hours, fill=Season)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Season",
        x ="Season", y = "Total Hours")

#Education Level table

e_table <- dt2 %>% 
  group_by(Education.Level) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
e_table <- e_table[with (e_table, order(-Hours.Median)),]
formattable(e_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Education.Level	Hours.Mean	Hours.Median	Hours.SD	Total.Hours
Bachelor	6.355556	8.0	6.839576	286
High School	10.393120	8.0	19.678894	4230
Master or Phd	5.250000	5.5	3.201562	21
NA	5.700000	5.0	4.762119	57
Post-graduate	6.933333	4.5	9.434640	416

# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Education.Level,Total.Hours, fill=Education.Level)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Education Level",
        x ="Education Level", y = "Total Hours")

#Children table

ch_table <- dt2 %>% 
  group_by(Children) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
ch_table <- ch_table[with (ch_table, order(-Hours.Median)),]
formattable(ch_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Children	Hours.Mean	Hours.Median	Hours.SD	Total.Hours
Four	12.695652	8	12.89973	292
Three	14.583333	8	22.33814	175
Two	12.007463	8	21.46531	1609
No Children	7.973958	5	10.15195	1531
One	8.559748	4	21.53496	1361
NA	7.000000	2	12.26377	42

# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Children,Total.Hours, fill=Children)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 30))+
  labs(title="Box Plot - Total Absenteeism Hours by Children",
        x ="How many Children", y = "Total Hours")

From these tables and charts, I can see that:

In terms of (medical) Reasons of absence, the top 4 in terms of Total Hours are (in order):

Circulatory
Skin
Injury, Poisoning
Musculoskeletal

Seems that having between 3 to 4 children may have some impact on the absence incidence.
For the rest of the categorical variables there are not visible differences that can explain Absence.

Now let’s check any relation between Total Hours of Absence and other continuos variables.

Histogram for Total hours of absenteeism

#Absenteesism histogram - saved as ggplot object

hours_hist <- ggplot(dt2, aes(x=Total.Hours)) +
 geom_histogram(binwidth = 2,
                fill='steelblue',
                col='black') +
  labs(title="Histogram - Total Absenteeism Hours",
        x ="Total Hours", y = "Frequency")

hours_hist

Let’s use Scatter Plots

# Tenure

scatter1 <- ggplot(dt2, aes(service_time,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,150))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Tenure",
        x ="Years of Service", y = "Total Hours")
scatter1

#Age

scatter2 <- ggplot(dt2, aes(age,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Age",
        x ="Age", y = "Total Hours")
scatter2

#BMI

scatter3 <- ggplot(dt2, aes(bmi,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs BMI",
        x ="Body Mass Index", y = "Total Hours")
scatter3

#Distance to Work 

scatter5 <- ggplot(dt2, aes(distance_to_work,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Distance to work",
        x ="Distance to work in KM", y = "Total Hours")
scatter5

# Transportation Expenses 

scatter6 <- ggplot(dt2, aes(transportation_expense,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Transportation Expense",
        x ="Transportation Expense", y = "Total Hours")
scatter6

I can’t find any relationship between any of the continuous variables and total hours of abcense.

Seems that the most closer impact to absence are some specific type of medical conditions and having 3-4 children also may increase chances to be absent.

Several bibliography related to Absenteeism suggest that Sick leave is the most common reason for being absent. The data here seems that are consistent with this conclusion, as we don’t see other variables affecting the absenteeism.

Final Comments

What is missing from your final project?

I wish I had other relevant variables in the dataset, like salary/wages, gender. Those 2 variables in particular are very relevant to predict absenteeism according to certain researches around this topic.

What do you hope to accomplish between now and submission time?

I need to complement it with theory around absenteeism, and why is it relevant to study in organizational context.

On the other hand, I would like to improve more the tables and maybe think on other possible combination of variables to explore relationships to the total hours of absence.

HW-6