Clean Data and Plot - Absenteeism
The dataser was found in:
https://www.kaggle.com/datasets/tonypriyanka2913/employee-absenteeism
It contain real data from a Courier Company, and it was created to explain the absenteeism.
Variables:
And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).
First let’s set the working directory and load the data.
dim.data.frame(courier)
[1] 740 21
# A tibble: 6 × 21
ID reason_for_absence month_of_absence day_of_the_week seasons
<int> <int> <int> <int> <int>
1 11 26 7 3 1
2 36 0 7 3 1
3 3 23 7 4 1
4 7 7 7 5 1
5 11 23 7 5 1
6 3 23 7 6 1
# … with 16 more variables: transportation_expense <int>,
# distance_to_work <int>, service_time <int>, age <int>,
# work_load_average_day <chr>, hit_target <int>,
# disciplinary_failure <int>, education <int>, son <int>,
# social_drinker <int>, social_smoker <int>, pet <int>,
# weight <int>, height <int>, bmi <int>,
# absenteeism_time_in_hours <int>
# Want to see all the variables
head(courier)
ID reason_for_absence month_of_absence day_of_the_week seasons
1 11 26 7 3 1
2 36 0 7 3 1
3 3 23 7 4 1
4 7 7 7 5 1
5 11 23 7 5 1
6 3 23 7 6 1
transportation_expense distance_to_work service_time age
1 289 36 13 33
2 118 13 18 50
3 179 51 18 38
4 279 5 14 39
5 289 36 13 33
6 179 51 18 38
work_load_average_day hit_target disciplinary_failure education son
1 239,554 97 0 1 2
2 239,554 97 1 1 1
3 239,554 97 0 1 0
4 239,554 97 0 1 2
5 239,554 97 0 1 2
6 239,554 97 0 1 0
social_drinker social_smoker pet weight height bmi
1 1 0 1 90 172 30
2 1 0 0 98 178 31
3 1 0 0 89 170 31
4 1 1 0 68 168 24
5 1 0 1 90 172 30
6 1 0 0 89 170 31
absenteeism_time_in_hours
1 4
2 0
3 2
4 4
5 2
6 NA
colnames(courier)
[1] "ID" "reason_for_absence"
[3] "month_of_absence" "day_of_the_week"
[5] "seasons" "transportation_expense"
[7] "distance_to_work" "service_time"
[9] "age" "work_load_average_day"
[11] "hit_target" "disciplinary_failure"
[13] "education" "son"
[15] "social_drinker" "social_smoker"
[17] "pet" "weight"
[19] "height" "bmi"
[21] "absenteeism_time_in_hours"
Let’s see the structure more clearly by using str()
str(courier)
'data.frame': 740 obs. of 21 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ work_load_average_day : chr "239,554" "239,554" "239,554" "239,554" ...
$ hit_target : int 97 97 97 97 97 97 97 97 97 97 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ weight : int 90 98 89 68 90 89 80 65 95 88 ...
$ height : int 172 178 170 168 172 170 172 168 196 172 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
I found this dataset very comprehensive. There are 21 variables to explore potential explanations to the absenteeism.
In terms of the data type, We have:
Nine (9) Categorical variables:
- "reason_for_absence"
- "month_of_absence"
- "day_of_the_week"
- "seasons"
- "disciplinary_failure"
- "education"
- "social_drinker"
- "social_smoker"
- "pet" Three (3) numeric discrete:
- "ID"
- "work_load_average_day"
- "hit_target" Nine (9) numeric continuous:
- "transportation_expense"
- "distance_to_work"
- "service_time"
- "age"
- "son"
- "weight"
- "height"
- "bmi"
- "absenteeism_time_in_hours"The dataset seems very clean in general, but I will adjust a few things:
'data.frame': 740 obs. of 19 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ weight : int 90 98 89 68 90 89 80 65 95 88 ...
$ height : int 172 178 170 168 172 170 172 168 196 172 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
'data.frame': 740 obs. of 17 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
#recode key categorical variables to use its labels
courier2 <-courier2%>%
mutate(reason_c = case_when(
reason_for_absence == 1 ~ "parasitic",
reason_for_absence == 2 ~ "Neoplasms",
reason_for_absence == 3 ~ "blood and immune",
reason_for_absence == 4 ~ "metabolic",
reason_for_absence == 5 ~ "Mental",
reason_for_absence == 6 ~ "nervous system",
reason_for_absence == 7 ~ "eye and adnexa",
reason_for_absence == 8 ~ "ear and mastoid",
reason_for_absence == 9 ~ "circulatory",
reason_for_absence == 10 ~ "respiratory",
reason_for_absence == 11 ~ "digestive",
reason_for_absence == 12 ~ "skin",
reason_for_absence == 13 ~ "musculoskeletal",
reason_for_absence == 14 ~ "genitourinary",
reason_for_absence == 15 ~ "Pregnancy",
reason_for_absence == 16 ~ "perinatal period",
reason_for_absence == 17 ~ "deformations and chromosomal",
reason_for_absence == 18 ~ "laboratory findings",
reason_for_absence == 19 ~ "Injury, poisoning",
reason_for_absence == 20 ~ "External causes and mortality",
reason_for_absence == 21 ~ "health status",
)) %>%
mutate(education_c = case_when(
education == 1 ~ "High School",
education == 2 ~ "Bachelor",
education == 3 ~ "Post-graduate",
education == 4 ~ "Master or Phd",
)) %>%
mutate(day_week_c = case_when(
day_of_the_week == 2 ~ "Monday",
day_of_the_week == 3 ~ "Tuesday",
day_of_the_week == 4 ~ "Wednesday",
day_of_the_week == 5 ~ "Thursday",
day_of_the_week == 6 ~ "Friday",
))%>%
mutate(seasons_c = case_when(
seasons == 1 ~ "Summer",
seasons == 2 ~ "Autunm",
seasons == 3 ~ "Winter",
seasons == 4 ~ "Spring",
))%>%
mutate(disciplined = case_when(
disciplinary_failure == 1 ~ "Yes",
disciplinary_failure == 0 ~ "No",
))%>%
mutate(drinker = case_when(
social_drinker == 1 ~ "Yes",
social_drinker == 0 ~ "No",
))%>%
mutate(smoker = case_when(
social_smoker == 1 ~ "Yes",
social_smoker == 0 ~ "No",
))
str(courier2)
'data.frame': 740 obs. of 24 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
$ reason_c : chr NA NA NA "eye and adnexa" ...
$ education_c : chr "High School" "High School" "High School" "High School" ...
$ day_week_c : chr "Tuesday" "Tuesday" "Wednesday" "Thursday" ...
$ seasons_c : chr "Summer" "Summer" "Summer" "Summer" ...
$ disciplined : chr "No" "Yes" "No" "No" ...
$ drinker : chr "Yes" "Yes" "Yes" "Yes" ...
$ smoker : chr "No" "No" "No" "Yes" ...
#Numerical variables (except ID as it is only for identification purposes)
courier2 %>%
summarise(transport_mean = mean(transportation_expense, na.rm=TRUE), transport_median = median(transportation_expense, na.rm=TRUE), transport_sd = sd(transportation_expense, na.rm=TRUE), distance_mean = mean(distance_to_work, na.rm=TRUE), distance_median = median(distance_to_work, na.rm=TRUE), distance_sd = sd(distance_to_work, na.rm=TRUE), service_mean = mean(service_time, na.rm=TRUE), service_median = median(service_time, na.rm=TRUE), service_sd = sd(service_time, na.rm=TRUE), age_mean = mean(age, na.rm=TRUE), age_median = median(age, na.rm=TRUE), age_sd = sd(age, na.rm=TRUE), son_mean = mean(son, na.rm=TRUE), son_median = median(son, na.rm=TRUE), son_sd = sd(son, na.rm=TRUE), pet_mean = mean(pet, na.rm=TRUE), pet_median = median(pet, na.rm=TRUE), pet_sd = sd(pet, na.rm=TRUE), bmi_mean = mean(bmi, na.rm=TRUE), bmi_median = median(bmi, na.rm=TRUE), bmi_sd = sd(bmi, na.rm=TRUE), hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE))
transport_mean transport_median transport_sd distance_mean
1 221.0355 225 66.95418 29.66757
distance_median distance_sd service_mean service_median service_sd
1 26 14.84812 12.56581 13 4.389813
age_mean age_median age_sd son_mean son_median son_sd pet_mean
1 36.44912 37 6.480148 1.017711 1 1.094928 0.7466125
pet_median pet_sd bmi_mean bmi_median bmi_sd hours_mean
1 0 1.319726 26.68406 25 4.292819 6.977716
hours_median hours_sd
1 3 13.47696
# Frequency for categorical
table(courier2$reason_c)
blood and immune circulatory
1 4
deformations and chromosomal digestive
1 26
ear and mastoid eye and adnexa
6 15
genitourinary health status
19 6
Injury, poisoning laboratory findings
40 21
Mental metabolic
3 2
musculoskeletal Neoplasms
55 1
nervous system parasitic
8 16
perinatal period Pregnancy
3 2
respiratory skin
25 8
table(courier2$education_c)
Bachelor High School Master or Phd Post-graduate
46 601 4 79
table(courier2$day_week_c)
Friday Monday Thursday Tuesday Wednesday
144 161 125 154 156
table(courier2$seasons_c)
Autunm Spring Summer Winter
192 195 170 183
table(courier2$disciplined)
No Yes
695 39
table(courier2$drinker)
No Yes
319 418
table(courier2$smoker)
No Yes
682 54
# Absenteeism in hours descriptives grouped by reason
courier2 %>%
group_by(reason_c) %>%
summarise(hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE))
# A tibble: 21 × 4
reason_c hours_mean hours_median hours_sd
<chr> <dbl> <dbl> <dbl>
1 blood and immune 8 8 NA
2 circulatory 42 24 47.7
3 deformations and chromosomal 8 8 NA
4 digestive 7.08 8 8.08
5 ear and mastoid 5.33 5.5 2.94
6 eye and adnexa 10 8 10.9
7 genitourinary 8.83 4 10.6
8 health status 5.83 8 3.37
9 Injury, poisoning 18.2 8 22.1
10 laboratory findings 10.3 8 16.1
# … with 11 more rows
# Age descriptives grouped by reason
courier2 %>%
group_by(reason_c) %>%
summarise(age_mean = mean(age, na.rm=TRUE), age_median = median(age, na.rm=TRUE), age_sd = sd(age, na.rm=TRUE))
# A tibble: 21 × 4
reason_c age_mean age_median age_sd
<chr> <dbl> <dbl> <dbl>
1 blood and immune 40 40 NA
2 circulatory 36.8 39 6.13
3 deformations and chromosomal 40 40 NA
4 digestive 35.1 37 5.15
5 ear and mastoid 36.5 37 4.42
6 eye and adnexa 32.9 31 5.73
7 genitourinary 41.1 41 6.53
8 health status 37.2 38 3.71
9 Injury, poisoning 35.3 34 5.75
10 laboratory findings 37.6 34 8.68
# … with 11 more rows
# Distance to work descriptive vs reason
courier2 %>%
group_by(reason_c) %>%
summarise(distance_mean = mean(distance_to_work, na.rm=TRUE), distance_median = median(distance_to_work, na.rm=TRUE), distance_sd = sd(distance_to_work, na.rm=TRUE))
# A tibble: 21 × 4
reason_c distance_mean distance_median distance_sd
<chr> <dbl> <dbl> <dbl>
1 blood and immune 31 31 NA
2 circulatory 21.5 25 7.68
3 deformations and chromos… 22 22 NA
4 digestive 29.7 26 16.7
5 ear and mastoid 21 16 16.3
6 eye and adnexa 24 26 12.9
7 genitourinary 24.6 25 12.6
8 health status 30.5 24 16.6
9 Injury, poisoning 30.4 27 14.5
10 laboratory findings 25.8 26 13.6
# … with 11 more rows
# Crosstab Reason vs disciplined
xtabs(~ reason_c + disciplined, courier2)
disciplined
reason_c No
blood and immune 1
circulatory 4
deformations and chromosomal 1
digestive 26
ear and mastoid 5
eye and adnexa 15
genitourinary 19
health status 6
Injury, poisoning 40
laboratory findings 20
Mental 3
metabolic 2
musculoskeletal 55
Neoplasms 1
nervous system 8
parasitic 16
perinatal period 3
Pregnancy 2
respiratory 25
skin 8
# Absenteeism in hours descriptives grouped by Disciplined
courier2 %>%
group_by(disciplined, na.rm=TRUE) %>%
summarise(hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE))
# A tibble: 3 × 5
# Groups: disciplined [3]
disciplined na.rm hours_mean hours_median hours_sd
<chr> <lgl> <dbl> <dbl> <dbl>
1 No TRUE 7.15 3 13.1
2 Yes TRUE 3.64 0 20.9
3 <NA> TRUE 5.6 8 3.29
# Absenteesism histogram - saved as ggplot object
hours_hist <- ggplot(courier2, aes(x=absenteeism_time_in_hours)) +
geom_histogram(binwidth = 2,fill='steelblue',col='black')
hours_hist
What variable(s) you are visualizing: here we are looking at the hours of absense
What question(s) you are attempting to answer with the visualization: I want to see how is the amount of hours of absenteeism are distributed across the population
What conclusions you can make from the visualization: the very first thing that caught my eye is the dispersion of the amount of absenteeism hours. Which I anticipated with the descriptives above (SD=13).
It seems that most of the population with absenteeism acummulate between 0 to less than 12 hours.Then the large amounts of absenteeism hours are concentrated in a few employees.
# use geom_bar() to visualize if there is a pattern related to an specific day of the week
day_week_bar <- ggplot(courier2,aes(x=day_week_c))+
geom_bar()+
scale_fill_manual(values = c("red","steelblue"))+
labs(x="Day of the Week", y="Frequency",title="Absenteeism by Day of the Week")
day_week_bar
What variable(s) you are visualizing: total number of absenteeism hours by each day of the week
What question(s) you are attempting to answer with the visualization: I want to see if there is a pattern or a day with more absenteeism than others
What conclusions you can make from the visualization: it seems that the distribution of hours of absense are pretty even during the week, except for Thursdays which is a bit lower.
Seems that there is not a “preferred” day to be absent but we can see if there is a significant (statistically) difference between Thursdays and the rest of the week.
(I don’t know why the colors in the bar don’t show up, I will have to look into it)
# use geom_bar visualize the absenteeism time distribution per day of the week
ggplot(courier2, aes(absenteeism_time_in_hours)) +
geom_histogram(binwidth = 4) +
labs(title = "Absenteeism time in hours by Day of the week") +
theme_bw() +
facet_wrap(vars(day_week_c), scales = "free")
ggplot(courier2, aes(absenteeism_time_in_hours)) +
geom_histogram(binwidth = 1) +
labs(title = "Absenteeism time in hours by Reasons") +
theme_bw() +
facet_wrap(vars(reason_c), scales = "free")
What variable(s) you are visualizing: distribution of absenteeism hours by each day of the week and by reason.
What question(s) you are attempting to answer with the visualization: I want to see how is distribution of absenteeism hours per day of the week and reasons. Trying to find a pattern.
What conclusions you can make from the visualization: I found interesting that the amount of hours is less disperse on Thursdays, so employees who do not show up to work that day are absent for less hours than the rest of the days. The distribution on Thursdays is a bit more homogeneous.
On the other hand, it is difficult to have an insight with charts by Reason, probably I should group reasons by wider categories.
• What questions are left unanswered with your visualizations: I need to think on a few more hypothesis around the absenteeism and try more bivariate plots. For example: I what if I create an scatter plot between age and absenteeism time?, or distance from work and absenteeism time?
• What about the visualizations may be unclear to a naive viewer: probably some of the histograms when grouping by reasons for example might be unclear for the cases where the count is low. Too many categories in reasons for plotting in the way I did.
• How could you improve the visualizations for the final project: I would need to clean a bit the data by grouping reasons in bigger categories and I will use scatter plots, also I will need to fix the colors.