HW-6

Employee Absenteeism

Felix Betancourt
2022-08-09

Introduction

Absenteeism is one of topics that has an economic impact in the companies.

Absenteeism takes place when employees do not report to work. Absenteeism can take two forms, authorized or unauthorized. Management is primarily interested in reducing unauthorized absenteeism since it is more likely to cause hardships on employees and guests alike (Abraham Pizam, A. & Thornburg, S. 2000) (2).

As Henkerson and Persson (2004) (1) said, absence from work can have multiple causes, but there is little doubt that the quantitatively most important one is sick leave.

In this study I will try to find causes of absenteeism on a Courier company.

About the Dataset:

The dataset was found in:

https://www.kaggle.com/datasets/tonypriyanka2913/employee-absenteeism

It contain real data from a Courier Company, and it was created to explain the absenteeism.

Variables:

  1. Individual identification (ID)
  2. Reason for absence (ICD).

Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:

I Certain infectious and parasitic diseases

II Neoplasms

III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism

IV Endocrine, nutritional and metabolic diseases

V Mental and behavioural disorders

VI Diseases of the nervous system

VII Diseases of the eye and adnexa

VIII Diseases of the ear and mastoid process

IX Diseases of the circulatory system

X Diseases of the respiratory system

XI Diseases of the digestive system

XII Diseases of the skin and subcutaneous tissue

XIII Diseases of the musculoskeletal system and connective tissue

XIV Diseases of the genitourinary system

XV Pregnancy, childbirth and the puerperium

XVI Certain conditions originating in the perinatal period

XVII Congenital malformations, deformations and chromosomal abnormalities

XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified

XIX Injury, poisoning and certain other consequences of external causes

XX External causes of morbidity and mortality

XXI Factors influencing health status and contact with health services.

And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).

  1. Month of absence
  2. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
  3. Seasons (summer (1), autumn (2), winter (3), spring (4))
  4. Transportation expense
  5. Distance from Residence to Work (kilometers)
  6. Service time
  7. Age
  8. Work load Average/day
  9. Hit target
  10. Disciplinary failure (yes=1; no=0)
  11. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
  12. Son (number of children)
  13. Social Drinker (yes=1; no=0)
  14. Social smoker (yes=1; no=0)
  15. Pet (number of pet)
  16. Weight
  17. Height
  18. Body mass index
  19. Absenteeism time in hours (target)

Setting up and Reading the data

Let’s load packages and data.

# Loading packages

library(dplyr)
library(tidyverse)
library(formattable)
library(ggplot2)

# Setting working directory and loading dataset.

setwd("/Users/fbeta/OneDrive/1-UMASS-DACSS/601/DataSets/Absenteeism/")

courier <- read.csv("Absenteeism_at_work_Project.csv")

Exploring the data and wrangling it

Let’s see the structure of the dataset

str(courier)
'data.frame':   740 obs. of  21 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ work_load_average_day    : chr  "239,554" "239,554" "239,554" "239,554" ...
 $ hit_target               : int  97 97 97 97 97 97 97 97 97 97 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ weight                   : int  90 98 89 68 90 89 80 65 95 88 ...
 $ height                   : int  172 178 170 168 172 170 172 168 196 172 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...

The dataset has 21 variables and 740 observations to explore potential explanations to the absenteeism.

In terms of the type of data we have:

The dataset seems very clean in general, but I will adjust a few things:

  1. Variable “work_load_average_day” and “hit_target” are not clear in the description of the file. I rather to eliminate those variables from the dataset
courier2 <- select(courier, - c("work_load_average_day", "hit_target"))
  1. On the other hand weight and height are used to calculate BMI, so I will eliminate also those 2 variables to simplify the dataset.
courier2 <- select(courier2, - c("weight", "height"))
  1. I will re code back to the categories name for a few relevant categorical variables to use it as character in tables, and will create categorical type of variable for Son (number of children). But I’ll save it as new variables in case I need to use it as numeric too at some point.
courier2 <-courier2%>%
  mutate(Reason = case_when(
        reason_for_absence == 1 ~ "Parasitic",
        reason_for_absence == 2 ~ "Neoplasms",
        reason_for_absence == 3 ~ "Blood and Immune",
        reason_for_absence == 4 ~ "Metabolic",
        reason_for_absence == 5 ~ "Mental",
        reason_for_absence == 6 ~ "Nervous system",
        reason_for_absence == 7 ~ "Eye and Adnexa",
        reason_for_absence == 8 ~ "Ear and Mastoid",
        reason_for_absence == 9 ~ "Circulatory",
        reason_for_absence == 10 ~ "Respiratory",
        reason_for_absence == 11 ~ "Digestive",
        reason_for_absence == 12 ~ "Skin",
        reason_for_absence == 13 ~ "Musculoskeletal",
        reason_for_absence == 14 ~ "Genitourinary",
        reason_for_absence == 15 ~ "Pregnancy",
        reason_for_absence == 16 ~ "Perinatal period",
        reason_for_absence == 17 ~ "Deformations and Chromosomal",
        reason_for_absence == 18 ~ "Laboratory findings",
        reason_for_absence == 19 ~ "Injury, Poisoning",
        reason_for_absence == 20 ~ "External causes and mortality",
        reason_for_absence == 21 ~ "Health status",
        )) %>%
  mutate(Education.Level = case_when(
         education == 1 ~ "High School",
         education == 2 ~ "Bachelor",
         education == 3 ~ "Post-graduate",
         education == 4 ~ "Master or Phd",
         )) %>%
  mutate(Day.of.Week = case_when(
         day_of_the_week == 2 ~ "Monday",
         day_of_the_week == 3 ~ "Tuesday",
         day_of_the_week == 4 ~ "Wednesday",
         day_of_the_week == 5 ~ "Thursday",
         day_of_the_week == 6 ~ "Friday",
         ))%>%
  mutate(Season = case_when(
         seasons == 1 ~ "Summer",
         seasons == 2 ~ "Autunm",
         seasons == 3 ~ "Winter",
         seasons == 4 ~ "Spring",
         ))%>%
  mutate(Disciplined = case_when(
         disciplinary_failure == 1 ~ "Yes",
         disciplinary_failure == 0 ~ "No",
         ))%>%
  mutate(Drinker = case_when(
         social_drinker == 1 ~ "Yes",
         social_drinker == 0 ~ "No",
        ))%>%
  mutate(Children = case_when(
         son == 0 ~ "No Children",
         son == 1 ~ "One",
         son == 2 ~ "Two",
         son == 3 ~ "Three",
         son == 4 ~ "Four",
        ))%>%
  mutate(Smoker = case_when(
         social_smoker == 1 ~ "Yes",
         social_smoker == 0 ~ "No",
         ))

# Confirming all the changes

str(courier2)
'data.frame':   740 obs. of  25 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...
 $ Reason                   : chr  NA NA NA "Eye and Adnexa" ...
 $ Education.Level          : chr  "High School" "High School" "High School" "High School" ...
 $ Day.of.Week              : chr  "Tuesday" "Tuesday" "Wednesday" "Thursday" ...
 $ Season                   : chr  "Summer" "Summer" "Summer" "Summer" ...
 $ Disciplined              : chr  "No" "Yes" "No" "No" ...
 $ Drinker                  : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ Children                 : chr  "Two" "One" "No Children" "Two" ...
 $ Smoker                   : chr  "No" "No" "No" "Yes" ...

Descriptive statistics

  1. Let’s summarize Mean, median, and standard deviation for numeric continuous variables, and frequencies for categorical variables.
#Numerical continuous variables (except ID as it is only for identification purposes)

desc_summary <- courier2 %>%
  summarise(Transport.Mean = mean(transportation_expense, na.rm=TRUE), Transport_Median = median(transportation_expense, na.rm=TRUE), Transport.SD = sd(transportation_expense, na.rm=TRUE), Distance.Mean = mean(distance_to_work, na.rm=TRUE), Distance.Median = median(distance_to_work, na.rm=TRUE), Distance.SD = sd(distance_to_work, na.rm=TRUE), Service.Mean = mean(service_time, na.rm=TRUE), Service.Median = median(service_time, na.rm=TRUE), Service.SD = sd(service_time, na.rm=TRUE), Age.Mean = mean(age, na.rm=TRUE), Age.Median = median(age, na.rm=TRUE), Age.SD = sd(age, na.rm=TRUE), Son.Mean = mean(son, na.rm=TRUE), Son.Median = median(son, na.rm=TRUE), Son.SD = sd(son, na.rm=TRUE), Pet.Mean = mean(pet, na.rm=TRUE), Pet.Median = median(pet, na.rm=TRUE), Pet.SD = sd(pet, na.rm=TRUE), BMI.Mean = mean(bmi, na.rm=TRUE), BMI.Median = median(bmi, na.rm=TRUE), BMI.SD = sd(bmi, na.rm=TRUE), Hours.Mean = mean(absenteeism_time_in_hours, na.rm=TRUE), Hours.Median = median(absenteeism_time_in_hours, na.rm=TRUE), Hours.SD = sd(absenteeism_time_in_hours, na.rm=TRUE))
print(desc_summary)
  Transport.Mean Transport_Median Transport.SD Distance.Mean
1       221.0355              225     66.95418      29.66757
  Distance.Median Distance.SD Service.Mean Service.Median Service.SD
1              26    14.84812     12.56581             13   4.389813
  Age.Mean Age.Median   Age.SD Son.Mean Son.Median   Son.SD  Pet.Mean
1 36.44912         37 6.480148 1.017711          1 1.094928 0.7466125
  Pet.Median   Pet.SD BMI.Mean BMI.Median   BMI.SD Hours.Mean
1          0 1.319726 26.68406         25 4.292819   6.977716
  Hours.Median Hours.SD
1            3 13.47696
#(Formattable didn't work well with a lot of columns)

library(janitor) #Using Janitor package to get useful tables with frequency and percentages

# Frequency table for Reason
tabyl(courier2$Reason, sort = T)
              courier2$Reason   n     percent valid_percent
             Blood and Immune   1 0.001351351   0.003816794
                  Circulatory   4 0.005405405   0.015267176
 Deformations and Chromosomal   1 0.001351351   0.003816794
                    Digestive  26 0.035135135   0.099236641
              Ear and Mastoid   6 0.008108108   0.022900763
               Eye and Adnexa  15 0.020270270   0.057251908
                Genitourinary  19 0.025675676   0.072519084
                Health status   6 0.008108108   0.022900763
            Injury, Poisoning  40 0.054054054   0.152671756
          Laboratory findings  21 0.028378378   0.080152672
                       Mental   3 0.004054054   0.011450382
                    Metabolic   2 0.002702703   0.007633588
              Musculoskeletal  55 0.074324324   0.209923664
                    Neoplasms   1 0.001351351   0.003816794
               Nervous system   8 0.010810811   0.030534351
                    Parasitic  16 0.021621622   0.061068702
             Perinatal period   3 0.004054054   0.011450382
                    Pregnancy   2 0.002702703   0.007633588
                  Respiratory  25 0.033783784   0.095419847
                         Skin   8 0.010810811   0.030534351
                         <NA> 478 0.645945946            NA
# Since about 65% of the cases doesn't have an specific Reason for Absence, let's see the proportions excuding NA
tabyl(courier2$Reason, sort = T, show_na = FALSE)
              courier2$Reason  n     percent
             Blood and Immune  1 0.003816794
                  Circulatory  4 0.015267176
 Deformations and Chromosomal  1 0.003816794
                    Digestive 26 0.099236641
              Ear and Mastoid  6 0.022900763
               Eye and Adnexa 15 0.057251908
                Genitourinary 19 0.072519084
                Health status  6 0.022900763
            Injury, Poisoning 40 0.152671756
          Laboratory findings 21 0.080152672
                       Mental  3 0.011450382
                    Metabolic  2 0.007633588
              Musculoskeletal 55 0.209923664
                    Neoplasms  1 0.003816794
               Nervous system  8 0.030534351
                    Parasitic 16 0.061068702
             Perinatal period  3 0.011450382
                    Pregnancy  2 0.007633588
                  Respiratory 25 0.095419847
                         Skin  8 0.030534351
# Frequency table for Education Level
tabyl(courier2$Education.Level, sort = T)
 courier2$Education.Level   n     percent valid_percent
                 Bachelor  46 0.062162162   0.063013699
              High School 601 0.812162162   0.823287671
            Master or Phd   4 0.005405405   0.005479452
            Post-graduate  79 0.106756757   0.108219178
                     <NA>  10 0.013513514            NA
# Frequency table for Day of the Week
tabyl(courier2$Day.of.Week, sort = T)
 courier2$Day.of.Week   n   percent
               Friday 144 0.1945946
               Monday 161 0.2175676
             Thursday 125 0.1689189
              Tuesday 154 0.2081081
            Wednesday 156 0.2108108
#sort(table(courier2$Day.of.Week), decreasing=TRUE)

# Frequency table for Season
tabyl(courier2$Season, sort = T)
 courier2$Season   n   percent
          Autunm 192 0.2594595
          Spring 195 0.2635135
          Summer 170 0.2297297
          Winter 183 0.2472973
# Frequency table for Disciplined or not
tabyl(courier2$Disciplined, sort = T)
 courier2$Disciplined   n     percent valid_percent
                   No 695 0.939189189    0.94686649
                  Yes  39 0.052702703    0.05313351
                 <NA>   6 0.008108108            NA
# Frequency table for Drinker or not
tabyl(courier2$Drinker, sort = T)
 courier2$Drinker   n     percent valid_percent
               No 319 0.431081081     0.4328358
              Yes 418 0.564864865     0.5671642
             <NA>   3 0.004054054            NA
# Frequency table for Smoker or not
tabyl(courier2$Smoker, sort = T)
 courier2$Smoker   n     percent valid_percent
              No 682 0.921621622    0.92663043
             Yes  54 0.072972973    0.07336957
            <NA>   4 0.005405405            NA

I can see that the top 5 reasons for absence are (in frequency):

However it is relevant to note that 65% of the absences does not provide an specific medical reason. So there is still an important number of employees in this company that are absent for unknown reasons.

Also found interesting that 94% of the employees who reported absent have not been Disciplined; 92% are not Smokers, and 81% holds a High School Degree as higher education level.

It means that those variables may not help too much to explain well the hours absent as they almost do not vary in this dataset.

  1. Now let’s Summarise basic descriptive for Absenteeism hours grouped by all Categorical variables.
# Let's create a new dataframe with total number of absenteeism hours

library(data.table)
dt <- data.table(courier2)
dt2 <- dt[,list(Total.Hours = sum(absenteeism_time_in_hours, na.rm=T), freq = .N), by = c("ID", "Reason", "bmi", "distance_to_work", "age", "service_time", "Season", "Drinker", "Smoker", "Children", "Disciplined", "Day.of.Week", "Education.Level", "transportation_expense")]
str(dt2)
Classes 'data.table' and 'data.frame':  526 obs. of  16 variables:
 $ ID                    : int  11 36 3 7 11 3 10 20 14 1 ...
 $ Reason                : chr  NA NA NA "Eye and Adnexa" ...
 $ bmi                   : int  30 31 31 24 30 31 27 23 25 29 ...
 $ distance_to_work      : int  36 13 51 5 36 51 52 50 12 11 ...
 $ age                   : int  33 50 38 39 33 38 28 36 34 37 ...
 $ service_time          : int  13 18 18 14 13 18 3 11 14 14 ...
 $ Season                : chr  "Summer" "Summer" "Summer" "Summer" ...
 $ Drinker               : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ Smoker                : chr  "No" "No" "No" "Yes" ...
 $ Children              : chr  "Two" "One" "No Children" "Two" ...
 $ Disciplined           : chr  "No" "Yes" "No" "No" ...
 $ Day.of.Week           : chr  "Tuesday" "Tuesday" "Wednesday" "Thursday" ...
 $ Education.Level       : chr  "High School" "High School" "High School" "High School" ...
 $ transportation_expense: int  289 118 179 279 289 179 NA 260 155 235 ...
 $ Total.Hours           : int  25 0 2 4 2 5 8 20 40 8 ...
 $ freq                  : int  5 1 1 1 1 3 1 6 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr> 
# Reason table

reason_table <- dt2 %>% 
  group_by(Reason) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
reason_table <- reason_table[with (reason_table, order(-Hours.Median)),]
formattable(reason_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))
Reason Hours.Mean Hours.Median Hours.SD Total.Hours
Circulatory 42.000000 24.0 47.721414 168
Neoplasms 24.000000 24.0 NA 24
Skin 23.375000 10.0 37.063411 187
Blood and Immune 8.000000 8.0 NA 8
Deformations and Chromosomal 8.000000 8.0 NA 8
Digestive 7.375000 8.0 8.197627 177
Eye and Adnexa 10.714286 8.0 11.013478 150
Health status 5.833333 8.0 3.371449 35
Injury, Poisoning 18.692308 8.0 22.291317 729
Laboratory findings 10.850000 8.0 18.313857 217
Mental 6.333333 8.0 2.886751 19
Musculoskeletal 18.000000 8.0 37.188708 810
Nervous system 20.375000 8.0 40.369499 163
Parasitic 12.133333 8.0 15.986006 182
Pregnancy 8.000000 8.0 0.000000 16
Respiratory 10.833333 8.0 10.821342 260
Ear and Mastoid 5.333333 5.5 2.943920 32
Metabolic 4.500000 4.5 4.949747 9
Genitourinary 8.368421 4.0 10.462688 159
NA 5.875445 4.0 9.012816 1651
Perinatal period 2.000000 2.0 1.000000 6
# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Reason,Total.Hours, fill=Reason)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 60))+
  labs(title="Box Plot - Total Absenteeism Hours by Reason of Absence",
        x ="Reason of Absence", y = "Total Hours")
# Disciplined table

Disciplined_table <- dt2 %>% 
  group_by(Disciplined) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Disciplined_table <- Disciplined_table[with (Disciplined_table, order(-Hours.Median)),]
formattable(Disciplined_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))
Disciplined Hours.Mean Hours.Median Hours.SD Total.Hours
No 10.024742 8 17.635643 4862
NA 4.666667 5 3.723797 28
Yes 3.428571 0 20.283702 120
# Smoker table

Smoker_table <- dt2 %>% 
  group_by(Smoker) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Smoker_table <- Smoker_table[with (Smoker_table, order(-Hours.Median)),]
formattable(Smoker_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))
Smoker Hours.Mean Hours.Median Hours.SD Total.Hours
No 9.557203 8.0 17.72427 4511
NA 9.000000 5.5 10.42433 36
Yes 9.260000 4.0 19.09034 463
# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Smoker,Total.Hours, fill=Smoker)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 30))+
  labs(title="Box Plot - Total Absenteeism Hours by Smoker",
        x ="Is Smoker?", y = "Total Hours")
# Drinker table
Drinker_table <- dt2 %>% 
  group_by(Drinker) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Drinker_table <- Drinker_table[with (Drinker_table, order(-Hours.Median)),]
formattable(Drinker_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))
Drinker Hours.Mean Hours.Median Hours.SD Total.Hours
Yes 11.100000 8 20.10949 3108
NA 13.666667 8 16.25833 41
No 7.658436 4 14.54640 1861
# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Drinker,Total.Hours, fill=Drinker)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 40))+
  labs(title="Box Plot - Total Absenteeism Hours by Drinker",
        x ="Is Drinker?", y = "Total Hours")
# Day of the week table

Day_table <- dt2 %>% 
  group_by(Day.of.Week) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Day_table <- Day_table[with (Day_table, order(-Hours.Median)),]
formattable(Day_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))
Day.of.Week Hours.Mean Hours.Median Hours.SD Total.Hours
Monday 11.130081 8 14.924927 1369
Wednesday 9.918182 8 24.079049 1091
Tuesday 11.872727 5 23.742931 1306
Friday 7.351064 4 10.087475 691
Thursday 6.213483 4 5.339544 553
# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Day.of.Week,Total.Hours, fill=Day.of.Week)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Day of the Week",
        x ="Day of the Week", y = "Total Hours")
#Season table

s_table <- dt2 %>% 
  group_by(Season) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
s_table <- s_table[with (s_table, order(-Hours.Median)),]
formattable(s_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))
Season Hours.Mean Hours.Median Hours.SD Total.Hours
Autunm 9.088000 8 14.29133 1136
Summer 9.356589 8 16.48001 1207
Winter 11.014925 8 23.91306 1476
Spring 8.630435 4 14.69251 1191
# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Season,Total.Hours, fill=Season)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Season",
        x ="Season", y = "Total Hours")
#Education Level table

e_table <- dt2 %>% 
  group_by(Education.Level) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
e_table <- e_table[with (e_table, order(-Hours.Median)),]
formattable(e_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))
Education.Level Hours.Mean Hours.Median Hours.SD Total.Hours
Bachelor 6.355556 8.0 6.839576 286
High School 10.393120 8.0 19.678894 4230
Master or Phd 5.250000 5.5 3.201562 21
NA 5.700000 5.0 4.762119 57
Post-graduate 6.933333 4.5 9.434640 416
# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Education.Level,Total.Hours, fill=Education.Level)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 20))+
  labs(title="Box Plot - Total Absenteeism Hours by Education Level",
        x ="Education Level", y = "Total Hours")
#Children table

ch_table <- dt2 %>% 
  group_by(Children) %>%
  summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
ch_table <- ch_table[with (ch_table, order(-Hours.Median)),]
formattable(ch_table,
            align =c("l","c","c","c","c"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))
Children Hours.Mean Hours.Median Hours.SD Total.Hours
Four 12.695652 8 12.89973 292
Three 14.583333 8 22.33814 175
Two 12.007463 8 21.46531 1609
No Children 7.973958 5 10.15195 1531
One 8.559748 4 21.53496 1361
NA 7.000000 2 12.26377 42
# Box Plot

box_plot_crop<-ggplot(data=dt2, aes(Children,Total.Hours, fill=Children)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 30))+
  labs(title="Box Plot - Total Absenteeism Hours by Children",
        x ="How many Children", y = "Total Hours")

From these tables and charts, I can see that:

  1. In terms of (medical) Reasons of absence, the top 4 in terms of Total Hours are (in order):
  1. Seems that having between 3 to 4 children may have some impact on the absence incidence.

  2. For the rest of the categorical variables there are not visible differences that can explain Absence.

Now let’s check any relation between Total Hours of Absence and other continuos variables.

Histogram for Total hours of absenteeism

#Absenteesism histogram - saved as ggplot object

hours_hist <- ggplot(dt2, aes(x=Total.Hours)) +
 geom_histogram(binwidth = 2,
                fill='steelblue',
                col='black') +
  labs(title="Histogram - Total Absenteeism Hours",
        x ="Total Hours", y = "Frequency")

hours_hist

Let’s use Scatter Plots

# Tenure

scatter1 <- ggplot(dt2, aes(service_time,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,150))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Tenure",
        x ="Years of Service", y = "Total Hours")
scatter1
#Age

scatter2 <- ggplot(dt2, aes(age,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Age",
        x ="Age", y = "Total Hours")
scatter2
#BMI

scatter3 <- ggplot(dt2, aes(bmi,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs BMI",
        x ="Body Mass Index", y = "Total Hours")
scatter3
#Distance to Work 

scatter5 <- ggplot(dt2, aes(distance_to_work,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Distance to work",
        x ="Distance to work in KM", y = "Total Hours")
scatter5
# Transportation Expenses 

scatter6 <- ggplot(dt2, aes(transportation_expense,Total.Hours))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))+
  labs(title="Scatter Plot - Total Absenteeism Hours vs Transportation Expense",
        x ="Transportation Expense", y = "Total Hours")
scatter6

I can’t find any relationship between any of the continuous variables and total hours of abcense.

Seems that the most closer impact to absence are some specific type of medical conditions and having 3-4 children also may increase chances to be absent.

Several bibliography related to Absenteeism suggest that Sick leave is the most common reason for being absent. The data here seems that are consistent with this conclusion, as we don’t see other variables affecting the absenteeism.

Final Comments

What is missing from your final project?

I wish I had other relevant variables in the dataset, like salary/wages, gender. Those 2 variables in particular are very relevant to predict absenteeism according to certain researches around this topic.

What do you hope to accomplish between now and submission time?

I need to complement it with theory around absenteeism, and why is it relevant to study in organizational context.

On the other hand, I would like to improve more the tables and maybe think on other possible combination of variables to explore relationships to the total hours of absence.