About Dataset:

The dataser was found in:

https://www.kaggle.com/datasets/tonypriyanka2913/employee-absenteeism

It contain real data from a Courier Company, and it was created to explain the absenteeism.

Variables:

Individual identification (ID)
Reason for absence (ICD). Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows: I Certain infectious and parasitic diseases II Neoplasms III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism IV Endocrine, nutritional and metabolic diseases V Mental and behavioural disorders VI Diseases of the nervous system VII Diseases of the eye and adnexa VIII Diseases of the ear and mastoid process IX Diseases of the circulatory system X Diseases of the respiratory system XI Diseases of the digestive system XII Diseases of the skin and subcutaneous tissue XIII Diseases of the musculoskeletal system and connective tissue XIV Diseases of the genitourinary system XV Pregnancy, childbirth and the puerperium XVI Certain conditions originating in the perinatal period XVII Congenital malformations, deformations and chromosomal abnormalities XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified XIX Injury, poisoning and certain other consequences of external causes XX External causes of morbidity and mortality XXI Factors influencing health status and contact with health services.

And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).

Month of absence
Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
Seasons (summer (1), autumn (2), winter (3), spring (4))
Transportation expense
Distance from Residence to Work (kilometers)
Service time
Age
Work load Average/day
Hit target
Disciplinary failure (yes=1; no=0)
Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
Son (number of children)
Social drinker (yes=1; no=0)
Social smoker (yes=1; no=0)
Pet (number of pet)
Weight
Height
Body mass index
Absenteeism time in hours (target)

Now let’s start working with the data set.

Setting up and Reading the data

First let’s set the working directory and load the data.

#First of all let's load the Dplyr and Tidyverse package

library(ggplot2)
library(dplyr)
library(tidyverse)

# Now let's set the working directory

setwd("/Users/fbeta/OneDrive/1-UMASS-DACSS/601/DataSets/Absenteeism/")

courier <- read.csv("Absenteeism_at_work_Project.csv")

Exploring the data

dim.data.frame(courier)

[1] 740  21

courier_t <- as_tibble(courier)

head(courier_t)

# A tibble: 6 × 21
     ID reason_for_absence month_of_absence day_of_the_week seasons
  <int>              <int>            <int>           <int>   <int>
1    11                 26                7               3       1
2    36                  0                7               3       1
3     3                 23                7               4       1
4     7                  7                7               5       1
5    11                 23                7               5       1
6     3                 23                7               6       1
# … with 16 more variables: transportation_expense <int>,
#   distance_to_work <int>, service_time <int>, age <int>,
#   work_load_average_day <chr>, hit_target <int>,
#   disciplinary_failure <int>, education <int>, son <int>,
#   social_drinker <int>, social_smoker <int>, pet <int>,
#   weight <int>, height <int>, bmi <int>,
#   absenteeism_time_in_hours <int>

# Want to see all the variables
head(courier)

  ID reason_for_absence month_of_absence day_of_the_week seasons
1 11                 26                7               3       1
2 36                  0                7               3       1
3  3                 23                7               4       1
4  7                  7                7               5       1
5 11                 23                7               5       1
6  3                 23                7               6       1
  transportation_expense distance_to_work service_time age
1                    289               36           13  33
2                    118               13           18  50
3                    179               51           18  38
4                    279                5           14  39
5                    289               36           13  33
6                    179               51           18  38
  work_load_average_day hit_target disciplinary_failure education son
1               239,554         97                    0         1   2
2               239,554         97                    1         1   1
3               239,554         97                    0         1   0
4               239,554         97                    0         1   2
5               239,554         97                    0         1   2
6               239,554         97                    0         1   0
  social_drinker social_smoker pet weight height bmi
1              1             0   1     90    172  30
2              1             0   0     98    178  31
3              1             0   0     89    170  31
4              1             1   0     68    168  24
5              1             0   1     90    172  30
6              1             0   0     89    170  31
  absenteeism_time_in_hours
1                         4
2                         0
3                         2
4                         4
5                         2
6                        NA

colnames(courier)

 [1] "ID"                        "reason_for_absence"       
 [3] "month_of_absence"          "day_of_the_week"          
 [5] "seasons"                   "transportation_expense"   
 [7] "distance_to_work"          "service_time"             
 [9] "age"                       "work_load_average_day"    
[11] "hit_target"                "disciplinary_failure"     
[13] "education"                 "son"                      
[15] "social_drinker"            "social_smoker"            
[17] "pet"                       "weight"                   
[19] "height"                    "bmi"                      
[21] "absenteeism_time_in_hours"

Let’s see the structure more clearly by using str()

str(courier)

'data.frame':   740 obs. of  21 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ work_load_average_day    : chr  "239,554" "239,554" "239,554" "239,554" ...
 $ hit_target               : int  97 97 97 97 97 97 97 97 97 97 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ weight                   : int  90 98 89 68 90 89 80 65 95 88 ...
 $ height                   : int  172 178 170 168 172 170 172 168 196 172 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...

I found this dataset very comprehensive. There are 21 variables to explore potential explanations to the absenteeism.

In terms of the data type, We have:

Nine (9) Categorical variables:

- "reason_for_absence"
- "month_of_absence"
- "day_of_the_week"
- "seasons"    
- "disciplinary_failure"  
- "education"            
- "social_drinker"    
- "social_smoker"     
- "pet"

Three (3) numeric discrete:

- "ID"
- "work_load_average_day" 
- "hit_target"

Nine (9) numeric continuous:

- "transportation_expense" 
- "distance_to_work"      
- "service_time"         
- "age"                 
- "son"                
- "weight"            
- "height"            
- "bmi"               
- "absenteeism_time_in_hours"

The dataset seems very clean in general, but I will adjust a few things:

Variable “work_load_average_day” and “hit_target” are not clear in the description of the file. I rather to eliminate those variables from the dataset

courier2 <- select(courier, - c("work_load_average_day", "hit_target"))
str(courier2)

'data.frame':   740 obs. of  19 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ weight                   : int  90 98 89 68 90 89 80 65 95 88 ...
 $ height                   : int  172 178 170 168 172 170 172 168 196 172 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...

On the other hand weight and height are used to calculate BMI, so I will eliminate also those 2 variables to simplify the dataset.

courier2 <- select(courier2, - c("weight", "height"))
str(courier2)

'data.frame':   740 obs. of  17 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...

I think worth recoding back to the categories name for a few relevant categorical variables but saving it in new variables in case I need to use it as character in tables.

#recode key categorical variables to use its labels and will create categorical type of variable for Son (number of children)

courier2 <-courier2%>%
  mutate(reason_c = case_when(
        reason_for_absence == 1 ~ "parasitic",
        reason_for_absence == 2 ~ "Neoplasms",
        reason_for_absence == 3 ~ "blood and immune",
        reason_for_absence == 4 ~ "metabolic",
        reason_for_absence == 5 ~ "Mental",
        reason_for_absence == 6 ~ "nervous system",
        reason_for_absence == 7 ~ "eye and adnexa",
        reason_for_absence == 8 ~ "ear and mastoid",
        reason_for_absence == 9 ~ "circulatory",
        reason_for_absence == 10 ~ "respiratory",
        reason_for_absence == 11 ~ "digestive",
        reason_for_absence == 12 ~ "skin",
        reason_for_absence == 13 ~ "musculoskeletal",
        reason_for_absence == 14 ~ "genitourinary",
        reason_for_absence == 15 ~ "Pregnancy",
        reason_for_absence == 16 ~ "perinatal period",
        reason_for_absence == 17 ~ "deformations and chromosomal",
        reason_for_absence == 18 ~ "laboratory findings",
        reason_for_absence == 19 ~ "Injury, poisoning",
        reason_for_absence == 20 ~ "External causes and mortality",
        reason_for_absence == 21 ~ "health status",
        )) %>%
  mutate(education_c = case_when(
         education == 1 ~ "High School",
         education == 2 ~ "Bachelor",
         education == 3 ~ "Post-graduate",
         education == 4 ~ "Master or Phd",
         )) %>%
  mutate(day_week_c = case_when(
         day_of_the_week == 2 ~ "Monday",
         day_of_the_week == 3 ~ "Tuesday",
         day_of_the_week == 4 ~ "Wednesday",
         day_of_the_week == 5 ~ "Thursday",
         day_of_the_week == 6 ~ "Friday",
         ))%>%
  mutate(seasons_c = case_when(
         seasons == 1 ~ "Summer",
         seasons == 2 ~ "Autunm",
         seasons == 3 ~ "Winter",
         seasons == 4 ~ "Spring",
         ))%>%
  mutate(disciplined = case_when(
         disciplinary_failure == 1 ~ "Yes",
         disciplinary_failure == 0 ~ "No",
         ))%>%
  mutate(drinker = case_when(
         social_drinker == 1 ~ "Yes",
         social_drinker == 0 ~ "No",
        ))%>%
  mutate(children_c = case_when(
         son == 0 ~ "No Children",
         son == 1 ~ "One",
         son == 2 ~ "Two",
         son == 3 ~ "Three",
         son == 4 ~ "Four",
        ))%>%
  mutate(smoker = case_when(
         social_smoker == 1 ~ "Yes",
         social_smoker == 0 ~ "No",
         ))
str(courier2)

'data.frame':   740 obs. of  25 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...
 $ reason_c                 : chr  NA NA NA "eye and adnexa" ...
 $ education_c              : chr  "High School" "High School" "High School" "High School" ...
 $ day_week_c               : chr  "Tuesday" "Tuesday" "Wednesday" "Thursday" ...
 $ seasons_c                : chr  "Summer" "Summer" "Summer" "Summer" ...
 $ disciplined              : chr  "No" "Yes" "No" "No" ...
 $ drinker                  : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ children_c               : chr  "Two" "One" "No Children" "Two" ...
 $ smoker                   : chr  "No" "No" "No" "Yes" ...

Descriptive statistics

Mean, median, and standard deviation for numerical variables, and frequencies for categorical variables.

#Numerical variables (except ID as it is only for identification purposes)

courier2 %>%
  summarise(transport_mean = mean(transportation_expense, na.rm=TRUE), transport_median = median(transportation_expense, na.rm=TRUE), transport_sd = sd(transportation_expense, na.rm=TRUE), distance_mean = mean(distance_to_work, na.rm=TRUE), distance_median = median(distance_to_work, na.rm=TRUE), distance_sd = sd(distance_to_work, na.rm=TRUE), service_mean = mean(service_time, na.rm=TRUE), service_median = median(service_time, na.rm=TRUE), service_sd = sd(service_time, na.rm=TRUE), age_mean = mean(age, na.rm=TRUE), age_median = median(age, na.rm=TRUE), age_sd = sd(age, na.rm=TRUE), son_mean = mean(son, na.rm=TRUE), son_median = median(son, na.rm=TRUE), son_sd = sd(son, na.rm=TRUE), pet_mean = mean(pet, na.rm=TRUE), pet_median = median(pet, na.rm=TRUE), pet_sd = sd(pet, na.rm=TRUE), bmi_mean = mean(bmi, na.rm=TRUE), bmi_median = median(bmi, na.rm=TRUE), bmi_sd = sd(bmi, na.rm=TRUE), hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE))

  transport_mean transport_median transport_sd distance_mean
1       221.0355              225     66.95418      29.66757
  distance_median distance_sd service_mean service_median service_sd
1              26    14.84812     12.56581             13   4.389813
  age_mean age_median   age_sd son_mean son_median   son_sd  pet_mean
1 36.44912         37 6.480148 1.017711          1 1.094928 0.7466125
  pet_median   pet_sd bmi_mean bmi_median   bmi_sd hours_mean
1          0 1.319726 26.68406         25 4.292819   6.977716
  hours_median hours_sd
1            3 13.47696

# Frequency for categorical

table(courier2$reason_c)


            blood and immune                  circulatory 
                           1                            4 
deformations and chromosomal                    digestive 
                           1                           26 
             ear and mastoid               eye and adnexa 
                           6                           15 
               genitourinary                health status 
                          19                            6 
           Injury, poisoning          laboratory findings 
                          40                           21 
                      Mental                    metabolic 
                           3                            2 
             musculoskeletal                    Neoplasms 
                          55                            1 
              nervous system                    parasitic 
                           8                           16 
            perinatal period                    Pregnancy 
                           3                            2 
                 respiratory                         skin 
                          25                            8

table(courier2$education_c)


     Bachelor   High School Master or Phd Post-graduate 
           46           601             4            79

table(courier2$day_week_c)


   Friday    Monday  Thursday   Tuesday Wednesday 
      144       161       125       154       156

table(courier2$seasons_c)


Autunm Spring Summer Winter 
   192    195    170    183

table(courier2$disciplined)


 No Yes 
695  39

table(courier2$drinker)


 No Yes 
319 418

table(courier2$smoker)


 No Yes 
682  54

In addition to overall means, medians, and SDs, use group_by() and summarise() to compute mean/median/SD for any relevant groupings

# Absenteeism in hours descriptive grouped by reason

reason_table <- courier2 %>% 
  group_by(reason_c) %>%
  summarise(hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE), hours_total = sum(absenteeism_time_in_hours, na.rm=TRUE))
print(reason_table,n=Inf)

# A tibble: 21 × 5
   reason_c               hours_mean hours_median hours_sd hours_total
   <chr>                       <dbl>        <dbl>    <dbl>       <int>
 1 blood and immune             8             8      NA              8
 2 circulatory                 42            24      47.7          168
 3 deformations and chro…       8             8      NA              8
 4 digestive                    7.08          8       8.08         177
 5 ear and mastoid              5.33          5.5     2.94          32
 6 eye and adnexa              10             8      10.9          150
 7 genitourinary                8.83          4      10.6          159
 8 health status                5.83          8       3.37          35
 9 Injury, poisoning           18.2           8      22.1          729
10 laboratory findings         10.3           8      16.1          217
11 Mental                       6.33          8       2.89          19
12 metabolic                    4.5           4.5     4.95           9
13 musculoskeletal             15             8      22.0          810
14 Neoplasms                   24            24      NA             24
15 nervous system              23.3           8      42.7          163
16 parasitic                   11.4           8      14.3          182
17 perinatal period             2             2       1              6
18 Pregnancy                    8             8       0             16
19 respiratory                 11.3           8      10.7          260
20 skin                        23.4          10      37.1          187
21 <NA>                         3.57          2       6.17        1651

# Order table by the Sum of absenteeism hours

reason_t_o <- reason_table[with (reason_table, order(hours_total)),]
print(reason_t_o,n=Inf)

# A tibble: 21 × 5
   reason_c               hours_mean hours_median hours_sd hours_total
   <chr>                       <dbl>        <dbl>    <dbl>       <int>
 1 perinatal period             2             2       1              6
 2 blood and immune             8             8      NA              8
 3 deformations and chro…       8             8      NA              8
 4 metabolic                    4.5           4.5     4.95           9
 5 Pregnancy                    8             8       0             16
 6 Mental                       6.33          8       2.89          19
 7 Neoplasms                   24            24      NA             24
 8 ear and mastoid              5.33          5.5     2.94          32
 9 health status                5.83          8       3.37          35
10 eye and adnexa              10             8      10.9          150
11 genitourinary                8.83          4      10.6          159
12 nervous system              23.3           8      42.7          163
13 circulatory                 42            24      47.7          168
14 digestive                    7.08          8       8.08         177
15 parasitic                   11.4           8      14.3          182
16 skin                        23.4          10      37.1          187
17 laboratory findings         10.3           8      16.1          217
18 respiratory                 11.3           8      10.7          260
19 Injury, poisoning           18.2           8      22.1          729
20 musculoskeletal             15             8      22.0          810
21 <NA>                         3.57          2       6.17        1651

# Now let's focus on the reasons with a total of hours above 150, and will eliminate NA.

courier_f <- filter(courier2, reason_c %in%  c("eye and adnexa", "genitourinary", "nervous system", "circulatory", "digestive", "parasitic", "skin", "laboratory findings", "respiratory", "Injury, poisoning", "musculoskeletal" ))

str(courier_f)

'data.frame':   237 obs. of  25 variables:
 $ ID                       : int  7 14 20 20 20 3 24 6 18 3 ...
 $ reason_for_absence       : int  7 19 1 1 11 11 14 11 10 11 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 8 8 ...
 $ day_of_the_week          : int  5 2 2 3 4 4 6 5 4 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  279 155 260 260 260 179 246 189 330 179 ...
 $ distance_to_work         : int  5 12 50 50 50 51 25 29 16 51 ...
 $ service_time             : int  14 14 11 11 11 18 16 13 4 18 ...
 $ age                      : int  39 34 36 36 36 38 41 33 28 38 ...
 $ disciplinary_failure     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 2 1 ...
 $ son                      : int  2 2 4 4 4 0 0 2 0 0 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 0 0 1 ...
 $ social_smoker            : int  1 0 0 0 0 0 0 0 0 0 ...
 $ pet                      : int  0 0 0 0 0 0 0 2 0 0 ...
 $ bmi                      : int  24 25 23 23 23 31 NA 25 25 31 ...
 $ absenteeism_time_in_hours: int  4 40 8 8 8 1 NA 8 NA 1 ...
 $ reason_c                 : chr  "eye and adnexa" "Injury, poisoning" "parasitic" "parasitic" ...
 $ education_c              : chr  "High School" "High School" "High School" "High School" ...
 $ day_week_c               : chr  "Thursday" "Monday" "Monday" "Tuesday" ...
 $ seasons_c                : chr  "Summer" "Summer" "Summer" "Summer" ...
 $ disciplined              : chr  "No" "No" "No" "No" ...
 $ drinker                  : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ children_c               : chr  "Two" "Two" "Four" "Four" ...
 $ smoker                   : chr  "Yes" "No" "No" "No" ...

# The number of cases went down to 237 from 740

# Now let's summarize descriptives for reasons

reason_table2 <- courier_f %>% 
  group_by(reason_c) %>%
  summarise(hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE), hours_total = sum(absenteeism_time_in_hours, na.rm=TRUE))
print(reason_table2,n=Inf)

# A tibble: 11 × 5
   reason_c            hours_mean hours_median hours_sd hours_total
   <chr>                    <dbl>        <dbl>    <dbl>       <int>
 1 circulatory              42              24    47.7          168
 2 digestive                 7.08            8     8.08         177
 3 eye and adnexa           10               8    10.9          150
 4 genitourinary             8.83            4    10.6          159
 5 Injury, poisoning        18.2             8    22.1          729
 6 laboratory findings      10.3             8    16.1          217
 7 musculoskeletal          15               8    22.0          810
 8 nervous system           23.3             8    42.7          163
 9 parasitic                11.4             8    14.3          182
10 respiratory              11.3             8    10.7          260
11 skin                     23.4            10    37.1          187

# Age descriptives grouped by reason

courier_f %>%
  group_by(reason_c) %>%
  summarise(age_mean = mean(age, na.rm=TRUE), age_median = median(age, na.rm=TRUE), age_sd = sd(age, na.rm=TRUE))

# A tibble: 11 × 4
   reason_c            age_mean age_median age_sd
   <chr>                  <dbl>      <dbl>  <dbl>
 1 circulatory             36.8       39     6.13
 2 digestive               35.1       37     5.15
 3 eye and adnexa          32.9       31     5.73
 4 genitourinary           41.1       41     6.53
 5 Injury, poisoning       35.3       34     5.75
 6 laboratory findings     37.6       34     8.68
 7 musculoskeletal         36.5       37     6.65
 8 nervous system          38.5       33.5  12.5 
 9 parasitic               37.7       36     9.07
10 respiratory             35.8       36.5   5.71
11 skin                    36.1       32.5   9.78

# Distance to work descriptive vs reason

courier_f %>%
  group_by(reason_c) %>%
  summarise(distance_mean = mean(distance_to_work, na.rm=TRUE), distance_median = median(distance_to_work, na.rm=TRUE), distance_sd = sd(distance_to_work, na.rm=TRUE))

# A tibble: 11 × 4
   reason_c            distance_mean distance_median distance_sd
   <chr>                       <dbl>           <dbl>       <dbl>
 1 circulatory                  21.5            25          7.68
 2 digestive                    29.7            26         16.7 
 3 eye and adnexa               24              26         12.9 
 4 genitourinary                24.6            25         12.6 
 5 Injury, poisoning            30.4            27         14.5 
 6 laboratory findings          25.8            26         13.6 
 7 musculoskeletal              29.5            26         14.3 
 8 nervous system               26.4            21         14.9 
 9 parasitic                    24.4            20.5       13.7 
10 respiratory                  24              17         14.5 
11 skin                         29.6            26         14.7

# Crosstab Reason vs disciplined

xtabs(~ reason_c + disciplined, courier_f)

                     disciplined
reason_c              No
  circulatory          4
  digestive           26
  eye and adnexa      15
  genitourinary       19
  Injury, poisoning   40
  laboratory findings 20
  musculoskeletal     55
  nervous system       8
  parasitic           16
  respiratory         25
  skin                 8

Let’s do Visualization

# First let's create a new dataframe with total number of absenteeism hours

library(data.table)
dt <- data.table(courier_f)
dt2 <- dt[,list(hourstotal = sum(absenteeism_time_in_hours), freq = .N), by = c("ID", "reason_c", "bmi", "smoker", "distance_to_work", "age", "service_time", "seasons_c", "drinker", "smoker", "son", "children_c")]
str(dt2)

Classes 'data.table' and 'data.frame':  179 obs. of  14 variables:
 $ ID              : int  7 14 20 20 3 24 6 18 10 11 ...
 $ reason_c        : chr  "eye and adnexa" "Injury, poisoning" "parasitic" "digestive" ...
 $ bmi             : int  24 25 23 23 31 NA 25 25 27 NA ...
 $ smoker          : chr  "Yes" "No" "No" "No" ...
 $ distance_to_work: int  5 12 50 50 51 25 29 16 52 36 ...
 $ age             : int  39 34 36 36 38 41 33 28 28 33 ...
 $ service_time    : int  14 14 11 11 18 16 13 4 3 13 ...
 $ seasons_c       : chr  "Summer" "Summer" "Summer" "Summer" ...
 $ drinker         : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ smoker.1        : chr  "Yes" "No" "No" "No" ...
 $ son             : int  2 2 4 4 0 0 2 0 1 2 ...
 $ children_c      : chr  "Two" "Two" "Four" "Four" ...
 $ hourstotal      : int  4 56 16 8 10 NA 8 NA 40 8 ...
 $ freq            : int  1 2 2 1 3 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>

Histogram for Total hours of absenteeism

#Absenteesism histogram - saved as ggplot object

hours_hist <- ggplot(dt2, aes(x=hourstotal)) +
 geom_histogram(binwidth = 2,fill='steelblue',col='black')
hours_hist

Let’s visualize categorical variables vs Total Absenteeism Hours using Box Plots.

# Box Plot / Season vs Absenteeism hours

boxplot(hourstotal~seasons_c,data=dt2, main="Total Hours Absenteeism",
   xlab="Reason", ylab="Total Hours")

# Let's eliminate outliners

box_plot_crop<-ggplot(data=dt2, aes(seasons_c,hourstotal)) 
box_plot_crop+ geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim =  c(0, 50))

# Box Plot / Reason vs Absenteeism hours

box_plot_crop<-ggplot(data=dt2, aes(reason_c,hourstotal)) 
box_plot_crop+ geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim =  c(0, 120))

# Box Plot / Reason vs Absenteeism hours again + colored and legend to see it more clearly

box_plot_crop<-ggplot(data=dt2, aes(reason_c,hourstotal, fill=reason_c)) 
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
  theme(legend.position = "right") +
  theme (axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  coord_cartesian(ylim =  c(0, 120))

# Here seems that employees with circulatory issues, sum more total absenteeism hours vs other reasons.

Let’s dig with some statistics

# Means
tapply(dt2$hourstotal,dt2$reason_c,mean, na.rm=T)

        circulatory           digestive      eye and adnexa 
           56.00000            10.41176            12.50000 
      genitourinary   Injury, poisoning laboratory findings 
           10.60000            23.51613            11.42105 
    musculoskeletal      nervous system           parasitic 
           23.35294            27.16667            15.16667 
        respiratory                skin 
           15.29412            26.71429

#Anova
anova(lm(dt2$hourstotal~dt2$reason_c))

Analysis of Variance Table

Response: dt2$hourstotal
              Df Sum Sq Mean Sq F value Pr(>F)
dt2$reason_c  10  10462 1046.17  1.2951 0.2372
Residuals    162 130866  807.82

summary(lm(dt2$hourstotal~dt2$reason_c))


Call:
lm(formula = dt2$hourstotal ~ dt2$reason_c)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.000 -15.353  -7.167   0.588 216.647 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)
(Intercept)                        56.00      16.41   3.413 0.000813
dt2$reason_cdigestive             -45.59      17.80  -2.561 0.011339
dt2$reason_ceye and adnexa        -43.50      18.35  -2.371 0.018914
dt2$reason_cgenitourinary         -45.40      17.98  -2.526 0.012509
dt2$reason_cInjury, poisoning     -32.48      17.19  -1.890 0.060514
dt2$reason_claboratory findings   -44.58      17.66  -2.525 0.012543
dt2$reason_cmusculoskeletal       -32.65      17.12  -1.907 0.058270
dt2$reason_cnervous system        -28.83      20.10  -1.435 0.153307
dt2$reason_cparasitic             -40.83      18.35  -2.226 0.027415
dt2$reason_crespiratory           -40.71      17.80  -2.287 0.023489
dt2$reason_cskin                  -29.29      19.61  -1.493 0.137338
                                   
(Intercept)                     ***
dt2$reason_cdigestive           *  
dt2$reason_ceye and adnexa      *  
dt2$reason_cgenitourinary       *  
dt2$reason_cInjury, poisoning   .  
dt2$reason_claboratory findings *  
dt2$reason_cmusculoskeletal     .  
dt2$reason_cnervous system         
dt2$reason_cparasitic           *  
dt2$reason_crespiratory         *  
dt2$reason_cskin                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.42 on 162 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.07402,   Adjusted R-squared:  0.01687 
F-statistic: 1.295 on 10 and 162 DF,  p-value: 0.2372

# fit the ANOVA model by skipping NAs
fit1 <- lm(hourstotal ~ 1 + reason_c, data = dt2)
summary(fit1)


Call:
lm(formula = hourstotal ~ 1 + reason_c, data = dt2)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.000 -15.353  -7.167   0.588 216.647 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    56.00      16.41   3.413 0.000813 ***
reason_cdigestive             -45.59      17.80  -2.561 0.011339 *  
reason_ceye and adnexa        -43.50      18.35  -2.371 0.018914 *  
reason_cgenitourinary         -45.40      17.98  -2.526 0.012509 *  
reason_cInjury, poisoning     -32.48      17.19  -1.890 0.060514 .  
reason_claboratory findings   -44.58      17.66  -2.525 0.012543 *  
reason_cmusculoskeletal       -32.65      17.12  -1.907 0.058270 .  
reason_cnervous system        -28.83      20.10  -1.435 0.153307    
reason_cparasitic             -40.83      18.35  -2.226 0.027415 *  
reason_crespiratory           -40.71      17.80  -2.287 0.023489 *  
reason_cskin                  -29.29      19.61  -1.493 0.137338    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.42 on 162 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.07402,   Adjusted R-squared:  0.01687 
F-statistic: 1.295 on 10 and 162 DF,  p-value: 0.2372

Let’s see if we can find more significant relationships

# Box Plot / Drinker vs Absenteeism hours

box_plot_crop<-ggplot(data=dt2, aes(drinker,hourstotal)) 
box_plot_crop+ geom_boxplot(color="red", fill="orange", alpha=0.2,outlier.shape = NA) +
  coord_cartesian(ylim =  c(0, 50))

# The total amount of hours looks higher on Drinkers,let's run a T-test

t.test(hourstotal ~ drinker, data = dt2)


    Welch Two Sample t-test

data:  hourstotal by drinker
t = -1.7792, df = 164.82, p-value = 0.07705
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -16.159996   0.840755
sample estimates:
 mean in group No mean in group Yes 
         14.32927          21.98889

# Seems there is a difference but not statistically significant

# What about smokers?

# Box Plot / Drinker vs Absenteeism hours

box_plot_crop<-ggplot(data=dt2, aes(smoker,hourstotal)) 
box_plot_crop+ geom_boxplot(color="black", fill="pink", alpha=0.5, outlier.shape = NA, na.rm=T) +
  coord_cartesian(ylim =  c(0, 35))

# Also seems there is some difference

t.test(hourstotal ~ smoker, data = dt2)


    Welch Two Sample t-test

data:  hourstotal by smoker
t = 1.9245, df = 37.904, p-value = 0.06183
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -0.4326298 17.0704827
sample estimates:
 mean in group No mean in group Yes 
         19.47682          11.15789

# But not statistically significant either



# Let's see Children as categorical variable

box_plot_crop<-ggplot(data=dt2, aes(children_c,hourstotal)) 
box_plot_crop+ geom_boxplot(color="blue", fill="pink", alpha=5, outlier.shape = NA) +
  coord_cartesian(ylim =  c(0, 50))

# Mean

tapply(dt2$hourstotal,dt2$children_c,mean)

       Four No Children         One       Three         Two 
   19.55556          NA    19.22222    32.00000          NA

#Anova

anova(lm(dt2$hourstotal~dt2$children_c))

Analysis of Variance Table

Response: dt2$hourstotal
                Df Sum Sq Mean Sq F value Pr(>F)
dt2$children_c   4   2490  622.61  0.7499 0.5594
Residuals      167 138652  830.25

summary(lm(dt2$hourstotal~dt2$children_c))


Call:
lm(formula = dt2$hourstotal ~ dt2$children_c)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.000 -13.500 -10.984   0.337 220.778 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)  
(Intercept)                19.5556     9.6047   2.036   0.0433 *
dt2$children_cNo Children  -5.5717    10.2782  -0.542   0.5885  
dt2$children_cOne          -0.3333    10.5214  -0.032   0.9748  
dt2$children_cThree        12.4444    17.3151   0.719   0.4733  
dt2$children_cTwo           1.9444    10.4027   0.187   0.8520  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.81 on 167 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.01764,   Adjusted R-squared:  -0.005884 
F-statistic: 0.7499 on 4 and 167 DF,  p-value: 0.5594

#Nothing here...

Now let’s see how are the relationships between Total Absenteeism hours and other continues variables

For this let’s use Scatter Plots

# Tenure

scatter1 <- ggplot(dt2, aes(service_time,hourstotal))+
  geom_point()+
  scale_y_continuous(limits=c(0,150))
scatter1

#Age

scatter2 <- ggplot(dt2, aes(age,hourstotal))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))
scatter2

scatter2+
  facet_wrap(vars(reason_c))

#BMI

scatter3 <- ggplot(dt2, aes(bmi,hourstotal))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))
scatter3

#Distance to Work 

scatter5 <- ggplot(dt2, aes(distance_to_work,hourstotal))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))
scatter5

Facetting

# Facet Wrap with Reason

scatter3+
  facet_wrap(vars(reason_c))

# Facet wrap bivariate (Reason-Drinker)

scatter3+
  facet_wrap(vars(reason_c,drinker))

# (Chart seems too small)

# Facet wrap bivariate (Season -Drinker)

scatter3+
  facet_wrap(vars(seasons_c,drinker))

# Also let's try a Scatter (BMI and hours absent) coloring if smoker or not

scatter3 <- ggplot(dt2, aes(bmi,hourstotal, col=drinker))+
  geom_point()+
  scale_y_continuous(limits=c(0,100))
scatter3

grid_hours <- scatter2 +
  facet_grid(children_c~drinker)
grid_hours

A little bit of customization

final_grid <- grid_hours +
  labs(x="Employee Age",
       y="Total Hours Absent",
       title="Scatter between Age and Hours absent",
       subtitle="Grid Drinker vs Children")+
  theme_bw()+
  theme(plot.title = element_text(hjust=0.5),
        plot.subtitle=element_text(hjust=0.5))
final_grid

Final Comments

– What is missing (if anything) in your analysis process so far?

Maybe I should try to exhaust all possible relationships among variables in the dataframe, there are some variables that I haven’t explore at all, like Transportation Expenses or Disciplinary action.

– What conclusions can you make about your research questions at this point?

Has been difficult to find solid significant relationships among the variables and total absenteeism hours, but there are some variables that might provide some explanation like Drinkers seems to produce a more hours of absenteeism than no drinkers, however is not an statistically significant difference.

Also certain medical conditions tend to produce more absenteeism hours, like Circulatory or Digestive issues, still none of those statistically significant.

– What do you think a naive reader would need to fully understand your graphs?

I need to work a lot more on the customization of the graph, titles, legends, colors, etc. that will help others to understand the graphs.

– Is there anything you want to answer with your dataset, but can’t?

Not really, I think graph are very useful to explore what are the relationships among variables.

However, in some cases I felt the need to jump into specific statistics (T-test, Anova, etc) to confirm the significancy of certain visual relationships.

HW-5