Employee Absenteeism
Absenteeism is one of topics that has an economic impact in the companies.
Absenteeism takes place when employees do not report to work. Absenteeism can take two forms, authorized or unauthorized. Management is primarily interested in reducing unauthorized absenteeism since it is more likely to cause hardships on employees and guests alike (Abraham Pizam, A. & Thornburg, S. 2000) (2).
As Henkerson and Persson (2004) (1) said, absence from work can have multiple causes, but there is little doubt that the quantitatively most important one is sick leave.
In this study I will try to find causes of absenteeism on a Courier company.
The dataset was found in:
https://www.kaggle.com/datasets/tonypriyanka2913/employee-absenteeism
It contain real data from a Courier Company, and it was created to explain the absenteeism.
Variables:
Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows:
I Certain infectious and parasitic diseases
II Neoplasms
III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism
IV Endocrine, nutritional and metabolic diseases
V Mental and behavioural disorders
VI Diseases of the nervous system
VII Diseases of the eye and adnexa
VIII Diseases of the ear and mastoid process
IX Diseases of the circulatory system
X Diseases of the respiratory system
XI Diseases of the digestive system
XII Diseases of the skin and subcutaneous tissue
XIII Diseases of the musculoskeletal system and connective tissue
XIV Diseases of the genitourinary system
XV Pregnancy, childbirth and the puerperium
XVI Certain conditions originating in the perinatal period
XVII Congenital malformations, deformations and chromosomal abnormalities
XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified
XIX Injury, poisoning and certain other consequences of external causes
XX External causes of morbidity and mortality
XXI Factors influencing health status and contact with health services.
And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).
Let’s load packages and data.
str(courier)
'data.frame': 740 obs. of 21 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ work_load_average_day : chr "239,554" "239,554" "239,554" "239,554" ...
$ hit_target : int 97 97 97 97 97 97 97 97 97 97 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ weight : int 90 98 89 68 90 89 80 65 95 88 ...
$ height : int 172 178 170 168 172 170 172 168 196 172 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
The dataset has 21 variables and 740 observations to explore potential explanations to the absenteeism.
In terms of the type of data we have:
Nine (9) Categorical variables:
- "reason_for_absence"
- "month_of_absence"
- "day_of_the_week"
- "seasons"
- "disciplinary_failure"
- "education"
- "social_drinker"
- "social_smoker"
- "pet" Three (3) numeric discrete:
- "ID"
- "work_load_average_day"
- "hit_target" Nine (9) numeric continuous:
- "transportation_expense"
- "distance_to_work"
- "service_time"
- "age"
- "son"
- "weight"
- "height"
- "bmi"
- "absenteeism_time_in_hours"The dataset seems very clean in general, but I will adjust a few things:
courier2 <-courier2%>%
mutate(Reason = case_when(
reason_for_absence == 1 ~ "Parasitic",
reason_for_absence == 2 ~ "Neoplasms",
reason_for_absence == 3 ~ "Blood and Immune",
reason_for_absence == 4 ~ "Metabolic",
reason_for_absence == 5 ~ "Mental",
reason_for_absence == 6 ~ "Nervous system",
reason_for_absence == 7 ~ "Eye and Adnexa",
reason_for_absence == 8 ~ "Ear and Mastoid",
reason_for_absence == 9 ~ "Circulatory",
reason_for_absence == 10 ~ "Respiratory",
reason_for_absence == 11 ~ "Digestive",
reason_for_absence == 12 ~ "Skin",
reason_for_absence == 13 ~ "Musculoskeletal",
reason_for_absence == 14 ~ "Genitourinary",
reason_for_absence == 15 ~ "Pregnancy",
reason_for_absence == 16 ~ "Perinatal period",
reason_for_absence == 17 ~ "Deformations and Chromosomal",
reason_for_absence == 18 ~ "Laboratory findings",
reason_for_absence == 19 ~ "Injury, Poisoning",
reason_for_absence == 20 ~ "External causes and mortality",
reason_for_absence == 21 ~ "Health status",
)) %>%
mutate(Education.Level = case_when(
education == 1 ~ "High School",
education == 2 ~ "Bachelor",
education == 3 ~ "Post-graduate",
education == 4 ~ "Master or Phd",
)) %>%
mutate(Day.of.Week = case_when(
day_of_the_week == 2 ~ "Monday",
day_of_the_week == 3 ~ "Tuesday",
day_of_the_week == 4 ~ "Wednesday",
day_of_the_week == 5 ~ "Thursday",
day_of_the_week == 6 ~ "Friday",
))%>%
mutate(Season = case_when(
seasons == 1 ~ "Summer",
seasons == 2 ~ "Autunm",
seasons == 3 ~ "Winter",
seasons == 4 ~ "Spring",
))%>%
mutate(Disciplined = case_when(
disciplinary_failure == 1 ~ "Yes",
disciplinary_failure == 0 ~ "No",
))%>%
mutate(Drinker = case_when(
social_drinker == 1 ~ "Yes",
social_drinker == 0 ~ "No",
))%>%
mutate(Children = case_when(
son == 0 ~ "No Children",
son == 1 ~ "One",
son == 2 ~ "Two",
son == 3 ~ "Three",
son == 4 ~ "Four",
))%>%
mutate(Smoker = case_when(
social_smoker == 1 ~ "Yes",
social_smoker == 0 ~ "No",
))
# Confirming all the changes
str(courier2)
'data.frame': 740 obs. of 25 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
$ Reason : chr NA NA NA "Eye and Adnexa" ...
$ Education.Level : chr "High School" "High School" "High School" "High School" ...
$ Day.of.Week : chr "Tuesday" "Tuesday" "Wednesday" "Thursday" ...
$ Season : chr "Summer" "Summer" "Summer" "Summer" ...
$ Disciplined : chr "No" "Yes" "No" "No" ...
$ Drinker : chr "Yes" "Yes" "Yes" "Yes" ...
$ Children : chr "Two" "One" "No Children" "Two" ...
$ Smoker : chr "No" "No" "No" "Yes" ...
#Numerical continuous variables (except ID as it is only for identification purposes)
desc_summary <- courier2 %>%
summarise(Transport.Mean = mean(transportation_expense, na.rm=TRUE), Transport_Median = median(transportation_expense, na.rm=TRUE), Transport.SD = sd(transportation_expense, na.rm=TRUE), Distance.Mean = mean(distance_to_work, na.rm=TRUE), Distance.Median = median(distance_to_work, na.rm=TRUE), Distance.SD = sd(distance_to_work, na.rm=TRUE), Service.Mean = mean(service_time, na.rm=TRUE), Service.Median = median(service_time, na.rm=TRUE), Service.SD = sd(service_time, na.rm=TRUE), Age.Mean = mean(age, na.rm=TRUE), Age.Median = median(age, na.rm=TRUE), Age.SD = sd(age, na.rm=TRUE), Son.Mean = mean(son, na.rm=TRUE), Son.Median = median(son, na.rm=TRUE), Son.SD = sd(son, na.rm=TRUE), Pet.Mean = mean(pet, na.rm=TRUE), Pet.Median = median(pet, na.rm=TRUE), Pet.SD = sd(pet, na.rm=TRUE), BMI.Mean = mean(bmi, na.rm=TRUE), BMI.Median = median(bmi, na.rm=TRUE), BMI.SD = sd(bmi, na.rm=TRUE), Hours.Mean = mean(absenteeism_time_in_hours, na.rm=TRUE), Hours.Median = median(absenteeism_time_in_hours, na.rm=TRUE), Hours.SD = sd(absenteeism_time_in_hours, na.rm=TRUE))
print(desc_summary)
Transport.Mean Transport_Median Transport.SD Distance.Mean
1 221.0355 225 66.95418 29.66757
Distance.Median Distance.SD Service.Mean Service.Median Service.SD
1 26 14.84812 12.56581 13 4.389813
Age.Mean Age.Median Age.SD Son.Mean Son.Median Son.SD Pet.Mean
1 36.44912 37 6.480148 1.017711 1 1.094928 0.7466125
Pet.Median Pet.SD BMI.Mean BMI.Median BMI.SD Hours.Mean
1 0 1.319726 26.68406 25 4.292819 6.977716
Hours.Median Hours.SD
1 3 13.47696
#(Formattable didn't work well with a lot of columns)
library(janitor) #Using Janitor package to get useful tables with frequency and percentages
# Frequency table for Reason
tabyl(courier2$Reason, sort = T)
courier2$Reason n percent valid_percent
Blood and Immune 1 0.001351351 0.003816794
Circulatory 4 0.005405405 0.015267176
Deformations and Chromosomal 1 0.001351351 0.003816794
Digestive 26 0.035135135 0.099236641
Ear and Mastoid 6 0.008108108 0.022900763
Eye and Adnexa 15 0.020270270 0.057251908
Genitourinary 19 0.025675676 0.072519084
Health status 6 0.008108108 0.022900763
Injury, Poisoning 40 0.054054054 0.152671756
Laboratory findings 21 0.028378378 0.080152672
Mental 3 0.004054054 0.011450382
Metabolic 2 0.002702703 0.007633588
Musculoskeletal 55 0.074324324 0.209923664
Neoplasms 1 0.001351351 0.003816794
Nervous system 8 0.010810811 0.030534351
Parasitic 16 0.021621622 0.061068702
Perinatal period 3 0.004054054 0.011450382
Pregnancy 2 0.002702703 0.007633588
Respiratory 25 0.033783784 0.095419847
Skin 8 0.010810811 0.030534351
<NA> 478 0.645945946 NA
# Since about 65% of the cases doesn't have an specific Reason for Absence, let's see the proportions excuding NA
tabyl(courier2$Reason, sort = T, show_na = FALSE)
courier2$Reason n percent
Blood and Immune 1 0.003816794
Circulatory 4 0.015267176
Deformations and Chromosomal 1 0.003816794
Digestive 26 0.099236641
Ear and Mastoid 6 0.022900763
Eye and Adnexa 15 0.057251908
Genitourinary 19 0.072519084
Health status 6 0.022900763
Injury, Poisoning 40 0.152671756
Laboratory findings 21 0.080152672
Mental 3 0.011450382
Metabolic 2 0.007633588
Musculoskeletal 55 0.209923664
Neoplasms 1 0.003816794
Nervous system 8 0.030534351
Parasitic 16 0.061068702
Perinatal period 3 0.011450382
Pregnancy 2 0.007633588
Respiratory 25 0.095419847
Skin 8 0.030534351
# Frequency table for Education Level
tabyl(courier2$Education.Level, sort = T)
courier2$Education.Level n percent valid_percent
Bachelor 46 0.062162162 0.063013699
High School 601 0.812162162 0.823287671
Master or Phd 4 0.005405405 0.005479452
Post-graduate 79 0.106756757 0.108219178
<NA> 10 0.013513514 NA
# Frequency table for Day of the Week
tabyl(courier2$Day.of.Week, sort = T)
courier2$Day.of.Week n percent
Friday 144 0.1945946
Monday 161 0.2175676
Thursday 125 0.1689189
Tuesday 154 0.2081081
Wednesday 156 0.2108108
#sort(table(courier2$Day.of.Week), decreasing=TRUE)
# Frequency table for Season
tabyl(courier2$Season, sort = T)
courier2$Season n percent
Autunm 192 0.2594595
Spring 195 0.2635135
Summer 170 0.2297297
Winter 183 0.2472973
# Frequency table for Disciplined or not
tabyl(courier2$Disciplined, sort = T)
courier2$Disciplined n percent valid_percent
No 695 0.939189189 0.94686649
Yes 39 0.052702703 0.05313351
<NA> 6 0.008108108 NA
# Frequency table for Drinker or not
tabyl(courier2$Drinker, sort = T)
courier2$Drinker n percent valid_percent
No 319 0.431081081 0.4328358
Yes 418 0.564864865 0.5671642
<NA> 3 0.004054054 NA
# Frequency table for Smoker or not
tabyl(courier2$Smoker, sort = T)
courier2$Smoker n percent valid_percent
No 682 0.921621622 0.92663043
Yes 54 0.072972973 0.07336957
<NA> 4 0.005405405 NA
I can see that the top 5 reasons for absence are (in frequency):
However it is relevant to note that 65% of the absences does not provide an specific medical reason. So there is still an important number of employees in this company that are absent for unknown reasons.
Also found interesting that 94% of the employees who reported absent have not been Disciplined; 92% are not Smokers, and 81% holds a High School Degree as higher education level.
It means that those variables may not help too much to explain well the hours absent as they almost do not vary in this dataset.
# Let's create a new dataframe with total number of absenteeism hours
library(data.table)
dt <- data.table(courier2)
dt2 <- dt[,list(Total.Hours = sum(absenteeism_time_in_hours, na.rm=T), freq = .N), by = c("ID", "Reason", "bmi", "distance_to_work", "age", "service_time", "Season", "Drinker", "Smoker", "Children", "Disciplined", "Day.of.Week", "Education.Level", "transportation_expense")]
str(dt2)
Classes 'data.table' and 'data.frame': 526 obs. of 16 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ Reason : chr NA NA NA "Eye and Adnexa" ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ Season : chr "Summer" "Summer" "Summer" "Summer" ...
$ Drinker : chr "Yes" "Yes" "Yes" "Yes" ...
$ Smoker : chr "No" "No" "No" "Yes" ...
$ Children : chr "Two" "One" "No Children" "Two" ...
$ Disciplined : chr "No" "Yes" "No" "No" ...
$ Day.of.Week : chr "Tuesday" "Tuesday" "Wednesday" "Thursday" ...
$ Education.Level : chr "High School" "High School" "High School" "High School" ...
$ transportation_expense: int 289 118 179 279 289 179 NA 260 155 235 ...
$ Total.Hours : int 25 0 2 4 2 5 8 20 40 8 ...
$ freq : int 5 1 1 1 1 3 1 6 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
# Reason table
reason_table <- dt2 %>%
group_by(Reason) %>%
summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
reason_table <- reason_table[with (reason_table, order(-Hours.Median)),]
formattable(reason_table,
align =c("l","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
| Reason | Hours.Mean | Hours.Median | Hours.SD | Total.Hours |
|---|---|---|---|---|
| Circulatory | 42.000000 | 24.0 | 47.721414 | 168 |
| Neoplasms | 24.000000 | 24.0 | NA | 24 |
| Skin | 23.375000 | 10.0 | 37.063411 | 187 |
| Blood and Immune | 8.000000 | 8.0 | NA | 8 |
| Deformations and Chromosomal | 8.000000 | 8.0 | NA | 8 |
| Digestive | 7.375000 | 8.0 | 8.197627 | 177 |
| Eye and Adnexa | 10.714286 | 8.0 | 11.013478 | 150 |
| Health status | 5.833333 | 8.0 | 3.371449 | 35 |
| Injury, Poisoning | 18.692308 | 8.0 | 22.291317 | 729 |
| Laboratory findings | 10.850000 | 8.0 | 18.313857 | 217 |
| Mental | 6.333333 | 8.0 | 2.886751 | 19 |
| Musculoskeletal | 18.000000 | 8.0 | 37.188708 | 810 |
| Nervous system | 20.375000 | 8.0 | 40.369499 | 163 |
| Parasitic | 12.133333 | 8.0 | 15.986006 | 182 |
| Pregnancy | 8.000000 | 8.0 | 0.000000 | 16 |
| Respiratory | 10.833333 | 8.0 | 10.821342 | 260 |
| Ear and Mastoid | 5.333333 | 5.5 | 2.943920 | 32 |
| Metabolic | 4.500000 | 4.5 | 4.949747 | 9 |
| Genitourinary | 8.368421 | 4.0 | 10.462688 | 159 |
| NA | 5.875445 | 4.0 | 9.012816 | 1651 |
| Perinatal period | 2.000000 | 2.0 | 1.000000 | 6 |
# Box Plot
box_plot_crop<-ggplot(data=dt2, aes(Reason,Total.Hours, fill=Reason))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 60))+
labs(title="Box Plot - Total Absenteeism Hours by Reason of Absence",
x ="Reason of Absence", y = "Total Hours")
# Disciplined table
Disciplined_table <- dt2 %>%
group_by(Disciplined) %>%
summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Disciplined_table <- Disciplined_table[with (Disciplined_table, order(-Hours.Median)),]
formattable(Disciplined_table,
align =c("l","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
| Disciplined | Hours.Mean | Hours.Median | Hours.SD | Total.Hours |
|---|---|---|---|---|
| No | 10.024742 | 8 | 17.635643 | 4862 |
| NA | 4.666667 | 5 | 3.723797 | 28 |
| Yes | 3.428571 | 0 | 20.283702 | 120 |
# Smoker table
Smoker_table <- dt2 %>%
group_by(Smoker) %>%
summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Smoker_table <- Smoker_table[with (Smoker_table, order(-Hours.Median)),]
formattable(Smoker_table,
align =c("l","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
| Smoker | Hours.Mean | Hours.Median | Hours.SD | Total.Hours |
|---|---|---|---|---|
| No | 9.557203 | 8.0 | 17.72427 | 4511 |
| NA | 9.000000 | 5.5 | 10.42433 | 36 |
| Yes | 9.260000 | 4.0 | 19.09034 | 463 |
# Box Plot
box_plot_crop<-ggplot(data=dt2, aes(Smoker,Total.Hours, fill=Smoker))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 30))+
labs(title="Box Plot - Total Absenteeism Hours by Smoker",
x ="Is Smoker?", y = "Total Hours")
# Drinker table
Drinker_table <- dt2 %>%
group_by(Drinker) %>%
summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Drinker_table <- Drinker_table[with (Drinker_table, order(-Hours.Median)),]
formattable(Drinker_table,
align =c("l","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
| Drinker | Hours.Mean | Hours.Median | Hours.SD | Total.Hours |
|---|---|---|---|---|
| Yes | 11.100000 | 8 | 20.10949 | 3108 |
| NA | 13.666667 | 8 | 16.25833 | 41 |
| No | 7.658436 | 4 | 14.54640 | 1861 |
# Box Plot
box_plot_crop<-ggplot(data=dt2, aes(Drinker,Total.Hours, fill=Drinker))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 40))+
labs(title="Box Plot - Total Absenteeism Hours by Drinker",
x ="Is Drinker?", y = "Total Hours")
# Day of the week table
Day_table <- dt2 %>%
group_by(Day.of.Week) %>%
summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
Day_table <- Day_table[with (Day_table, order(-Hours.Median)),]
formattable(Day_table,
align =c("l","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
| Day.of.Week | Hours.Mean | Hours.Median | Hours.SD | Total.Hours |
|---|---|---|---|---|
| Monday | 11.130081 | 8 | 14.924927 | 1369 |
| Wednesday | 9.918182 | 8 | 24.079049 | 1091 |
| Tuesday | 11.872727 | 5 | 23.742931 | 1306 |
| Friday | 7.351064 | 4 | 10.087475 | 691 |
| Thursday | 6.213483 | 4 | 5.339544 | 553 |
# Box Plot
box_plot_crop<-ggplot(data=dt2, aes(Day.of.Week,Total.Hours, fill=Day.of.Week))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 20))+
labs(title="Box Plot - Total Absenteeism Hours by Day of the Week",
x ="Day of the Week", y = "Total Hours")
#Season table
s_table <- dt2 %>%
group_by(Season) %>%
summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
s_table <- s_table[with (s_table, order(-Hours.Median)),]
formattable(s_table,
align =c("l","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
| Season | Hours.Mean | Hours.Median | Hours.SD | Total.Hours |
|---|---|---|---|---|
| Autunm | 9.088000 | 8 | 14.29133 | 1136 |
| Summer | 9.356589 | 8 | 16.48001 | 1207 |
| Winter | 11.014925 | 8 | 23.91306 | 1476 |
| Spring | 8.630435 | 4 | 14.69251 | 1191 |
# Box Plot
box_plot_crop<-ggplot(data=dt2, aes(Season,Total.Hours, fill=Season))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 20))+
labs(title="Box Plot - Total Absenteeism Hours by Season",
x ="Season", y = "Total Hours")
#Education Level table
e_table <- dt2 %>%
group_by(Education.Level) %>%
summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
e_table <- e_table[with (e_table, order(-Hours.Median)),]
formattable(e_table,
align =c("l","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
| Education.Level | Hours.Mean | Hours.Median | Hours.SD | Total.Hours |
|---|---|---|---|---|
| Bachelor | 6.355556 | 8.0 | 6.839576 | 286 |
| High School | 10.393120 | 8.0 | 19.678894 | 4230 |
| Master or Phd | 5.250000 | 5.5 | 3.201562 | 21 |
| NA | 5.700000 | 5.0 | 4.762119 | 57 |
| Post-graduate | 6.933333 | 4.5 | 9.434640 | 416 |
# Box Plot
box_plot_crop<-ggplot(data=dt2, aes(Education.Level,Total.Hours, fill=Education.Level))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 20))+
labs(title="Box Plot - Total Absenteeism Hours by Education Level",
x ="Education Level", y = "Total Hours")
#Children table
ch_table <- dt2 %>%
group_by(Children) %>%
summarise(Hours.Mean = mean(Total.Hours, na.rm=TRUE), Hours.Median = median(Total.Hours, na.rm=TRUE), Hours.SD = sd(Total.Hours, na.rm=TRUE), Total.Hours = sum(Total.Hours, na.rm=TRUE))
ch_table <- ch_table[with (ch_table, order(-Hours.Median)),]
formattable(ch_table,
align =c("l","c","c","c","c"),
list(`Indicator Name` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
| Children | Hours.Mean | Hours.Median | Hours.SD | Total.Hours |
|---|---|---|---|---|
| Four | 12.695652 | 8 | 12.89973 | 292 |
| Three | 14.583333 | 8 | 22.33814 | 175 |
| Two | 12.007463 | 8 | 21.46531 | 1609 |
| No Children | 7.973958 | 5 | 10.15195 | 1531 |
| One | 8.559748 | 4 | 21.53496 | 1361 |
| NA | 7.000000 | 2 | 12.26377 | 42 |
# Box Plot
box_plot_crop<-ggplot(data=dt2, aes(Children,Total.Hours, fill=Children))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 30))+
labs(title="Box Plot - Total Absenteeism Hours by Children",
x ="How many Children", y = "Total Hours")
From these tables and charts, I can see that:
Seems that having between 3 to 4 children may have some impact on the absence incidence.
For the rest of the categorical variables there are not visible differences that can explain Absence.
#Absenteesism histogram - saved as ggplot object
hours_hist <- ggplot(dt2, aes(x=Total.Hours)) +
geom_histogram(binwidth = 2,
fill='steelblue',
col='black') +
labs(title="Histogram - Total Absenteeism Hours",
x ="Total Hours", y = "Frequency")
hours_hist
# Tenure
scatter1 <- ggplot(dt2, aes(service_time,Total.Hours))+
geom_point()+
scale_y_continuous(limits=c(0,150))+
labs(title="Scatter Plot - Total Absenteeism Hours vs Tenure",
x ="Years of Service", y = "Total Hours")
scatter1
#Age
scatter2 <- ggplot(dt2, aes(age,Total.Hours))+
geom_point()+
scale_y_continuous(limits=c(0,100))+
labs(title="Scatter Plot - Total Absenteeism Hours vs Age",
x ="Age", y = "Total Hours")
scatter2
#BMI
scatter3 <- ggplot(dt2, aes(bmi,Total.Hours))+
geom_point()+
scale_y_continuous(limits=c(0,100))+
labs(title="Scatter Plot - Total Absenteeism Hours vs BMI",
x ="Body Mass Index", y = "Total Hours")
scatter3
#Distance to Work
scatter5 <- ggplot(dt2, aes(distance_to_work,Total.Hours))+
geom_point()+
scale_y_continuous(limits=c(0,100))+
labs(title="Scatter Plot - Total Absenteeism Hours vs Distance to work",
x ="Distance to work in KM", y = "Total Hours")
scatter5
# Transportation Expenses
scatter6 <- ggplot(dt2, aes(transportation_expense,Total.Hours))+
geom_point()+
scale_y_continuous(limits=c(0,100))+
labs(title="Scatter Plot - Total Absenteeism Hours vs Transportation Expense",
x ="Transportation Expense", y = "Total Hours")
scatter6
I can’t find any relationship between any of the continuous variables and total hours of abcense.
Seems that the most closer impact to absence are some specific type of medical conditions and having 3-4 children also may increase chances to be absent.
Several bibliography related to Absenteeism suggest that Sick leave is the most common reason for being absent. The data here seems that are consistent with this conclusion, as we don’t see other variables affecting the absenteeism.
I wish I had other relevant variables in the dataset, like salary/wages, gender. Those 2 variables in particular are very relevant to predict absenteeism according to certain researches around this topic.
I need to complement it with theory around absenteeism, and why is it relevant to study in organizational context.
On the other hand, I would like to improve more the tables and maybe think on other possible combination of variables to explore relationships to the total hours of absence.