More Visualization - To explain Absenteeism
The dataser was found in:
https://www.kaggle.com/datasets/tonypriyanka2913/employee-absenteeism
It contain real data from a Courier Company, and it was created to explain the absenteeism.
Variables:
And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).
First let’s set the working directory and load the data.
dim.data.frame(courier)
[1] 740 21
# A tibble: 6 × 21
ID reason_for_absence month_of_absence day_of_the_week seasons
<int> <int> <int> <int> <int>
1 11 26 7 3 1
2 36 0 7 3 1
3 3 23 7 4 1
4 7 7 7 5 1
5 11 23 7 5 1
6 3 23 7 6 1
# … with 16 more variables: transportation_expense <int>,
# distance_to_work <int>, service_time <int>, age <int>,
# work_load_average_day <chr>, hit_target <int>,
# disciplinary_failure <int>, education <int>, son <int>,
# social_drinker <int>, social_smoker <int>, pet <int>,
# weight <int>, height <int>, bmi <int>,
# absenteeism_time_in_hours <int>
# Want to see all the variables
head(courier)
ID reason_for_absence month_of_absence day_of_the_week seasons
1 11 26 7 3 1
2 36 0 7 3 1
3 3 23 7 4 1
4 7 7 7 5 1
5 11 23 7 5 1
6 3 23 7 6 1
transportation_expense distance_to_work service_time age
1 289 36 13 33
2 118 13 18 50
3 179 51 18 38
4 279 5 14 39
5 289 36 13 33
6 179 51 18 38
work_load_average_day hit_target disciplinary_failure education son
1 239,554 97 0 1 2
2 239,554 97 1 1 1
3 239,554 97 0 1 0
4 239,554 97 0 1 2
5 239,554 97 0 1 2
6 239,554 97 0 1 0
social_drinker social_smoker pet weight height bmi
1 1 0 1 90 172 30
2 1 0 0 98 178 31
3 1 0 0 89 170 31
4 1 1 0 68 168 24
5 1 0 1 90 172 30
6 1 0 0 89 170 31
absenteeism_time_in_hours
1 4
2 0
3 2
4 4
5 2
6 NA
colnames(courier)
[1] "ID" "reason_for_absence"
[3] "month_of_absence" "day_of_the_week"
[5] "seasons" "transportation_expense"
[7] "distance_to_work" "service_time"
[9] "age" "work_load_average_day"
[11] "hit_target" "disciplinary_failure"
[13] "education" "son"
[15] "social_drinker" "social_smoker"
[17] "pet" "weight"
[19] "height" "bmi"
[21] "absenteeism_time_in_hours"
Let’s see the structure more clearly by using str()
str(courier)
'data.frame': 740 obs. of 21 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ work_load_average_day : chr "239,554" "239,554" "239,554" "239,554" ...
$ hit_target : int 97 97 97 97 97 97 97 97 97 97 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ weight : int 90 98 89 68 90 89 80 65 95 88 ...
$ height : int 172 178 170 168 172 170 172 168 196 172 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
I found this dataset very comprehensive. There are 21 variables to explore potential explanations to the absenteeism.
In terms of the data type, We have:
Nine (9) Categorical variables:
- "reason_for_absence"
- "month_of_absence"
- "day_of_the_week"
- "seasons"
- "disciplinary_failure"
- "education"
- "social_drinker"
- "social_smoker"
- "pet" Three (3) numeric discrete:
- "ID"
- "work_load_average_day"
- "hit_target" Nine (9) numeric continuous:
- "transportation_expense"
- "distance_to_work"
- "service_time"
- "age"
- "son"
- "weight"
- "height"
- "bmi"
- "absenteeism_time_in_hours"The dataset seems very clean in general, but I will adjust a few things:
'data.frame': 740 obs. of 19 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ weight : int 90 98 89 68 90 89 80 65 95 88 ...
$ height : int 172 178 170 168 172 170 172 168 196 172 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
'data.frame': 740 obs. of 17 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
#recode key categorical variables to use its labels and will create categorical type of variable for Son (number of children)
courier2 <-courier2%>%
mutate(reason_c = case_when(
reason_for_absence == 1 ~ "parasitic",
reason_for_absence == 2 ~ "Neoplasms",
reason_for_absence == 3 ~ "blood and immune",
reason_for_absence == 4 ~ "metabolic",
reason_for_absence == 5 ~ "Mental",
reason_for_absence == 6 ~ "nervous system",
reason_for_absence == 7 ~ "eye and adnexa",
reason_for_absence == 8 ~ "ear and mastoid",
reason_for_absence == 9 ~ "circulatory",
reason_for_absence == 10 ~ "respiratory",
reason_for_absence == 11 ~ "digestive",
reason_for_absence == 12 ~ "skin",
reason_for_absence == 13 ~ "musculoskeletal",
reason_for_absence == 14 ~ "genitourinary",
reason_for_absence == 15 ~ "Pregnancy",
reason_for_absence == 16 ~ "perinatal period",
reason_for_absence == 17 ~ "deformations and chromosomal",
reason_for_absence == 18 ~ "laboratory findings",
reason_for_absence == 19 ~ "Injury, poisoning",
reason_for_absence == 20 ~ "External causes and mortality",
reason_for_absence == 21 ~ "health status",
)) %>%
mutate(education_c = case_when(
education == 1 ~ "High School",
education == 2 ~ "Bachelor",
education == 3 ~ "Post-graduate",
education == 4 ~ "Master or Phd",
)) %>%
mutate(day_week_c = case_when(
day_of_the_week == 2 ~ "Monday",
day_of_the_week == 3 ~ "Tuesday",
day_of_the_week == 4 ~ "Wednesday",
day_of_the_week == 5 ~ "Thursday",
day_of_the_week == 6 ~ "Friday",
))%>%
mutate(seasons_c = case_when(
seasons == 1 ~ "Summer",
seasons == 2 ~ "Autunm",
seasons == 3 ~ "Winter",
seasons == 4 ~ "Spring",
))%>%
mutate(disciplined = case_when(
disciplinary_failure == 1 ~ "Yes",
disciplinary_failure == 0 ~ "No",
))%>%
mutate(drinker = case_when(
social_drinker == 1 ~ "Yes",
social_drinker == 0 ~ "No",
))%>%
mutate(children_c = case_when(
son == 0 ~ "No Children",
son == 1 ~ "One",
son == 2 ~ "Two",
son == 3 ~ "Three",
son == 4 ~ "Four",
))%>%
mutate(smoker = case_when(
social_smoker == 1 ~ "Yes",
social_smoker == 0 ~ "No",
))
str(courier2)
'data.frame': 740 obs. of 25 variables:
$ ID : int 11 36 3 7 11 3 10 20 14 1 ...
$ reason_for_absence : int 26 0 23 7 23 23 22 23 19 22 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_the_week : int 3 3 4 5 5 6 6 6 2 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 289 118 179 279 289 179 NA 260 155 235 ...
$ distance_to_work : int 36 13 51 5 36 51 52 50 12 11 ...
$ service_time : int 13 18 18 14 13 18 3 11 14 14 ...
$ age : int 33 50 38 39 33 38 28 36 34 37 ...
$ disciplinary_failure : int 0 1 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 1 3 ...
$ son : int 2 1 0 2 2 0 1 4 2 1 ...
$ social_drinker : int 1 1 1 1 1 1 1 1 1 0 ...
$ social_smoker : int 0 0 0 1 0 0 0 0 0 0 ...
$ pet : int 1 0 0 0 1 0 4 0 0 1 ...
$ bmi : int 30 31 31 24 30 31 27 23 25 29 ...
$ absenteeism_time_in_hours: int 4 0 2 4 2 NA 8 4 40 8 ...
$ reason_c : chr NA NA NA "eye and adnexa" ...
$ education_c : chr "High School" "High School" "High School" "High School" ...
$ day_week_c : chr "Tuesday" "Tuesday" "Wednesday" "Thursday" ...
$ seasons_c : chr "Summer" "Summer" "Summer" "Summer" ...
$ disciplined : chr "No" "Yes" "No" "No" ...
$ drinker : chr "Yes" "Yes" "Yes" "Yes" ...
$ children_c : chr "Two" "One" "No Children" "Two" ...
$ smoker : chr "No" "No" "No" "Yes" ...
#Numerical variables (except ID as it is only for identification purposes)
courier2 %>%
summarise(transport_mean = mean(transportation_expense, na.rm=TRUE), transport_median = median(transportation_expense, na.rm=TRUE), transport_sd = sd(transportation_expense, na.rm=TRUE), distance_mean = mean(distance_to_work, na.rm=TRUE), distance_median = median(distance_to_work, na.rm=TRUE), distance_sd = sd(distance_to_work, na.rm=TRUE), service_mean = mean(service_time, na.rm=TRUE), service_median = median(service_time, na.rm=TRUE), service_sd = sd(service_time, na.rm=TRUE), age_mean = mean(age, na.rm=TRUE), age_median = median(age, na.rm=TRUE), age_sd = sd(age, na.rm=TRUE), son_mean = mean(son, na.rm=TRUE), son_median = median(son, na.rm=TRUE), son_sd = sd(son, na.rm=TRUE), pet_mean = mean(pet, na.rm=TRUE), pet_median = median(pet, na.rm=TRUE), pet_sd = sd(pet, na.rm=TRUE), bmi_mean = mean(bmi, na.rm=TRUE), bmi_median = median(bmi, na.rm=TRUE), bmi_sd = sd(bmi, na.rm=TRUE), hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE))
transport_mean transport_median transport_sd distance_mean
1 221.0355 225 66.95418 29.66757
distance_median distance_sd service_mean service_median service_sd
1 26 14.84812 12.56581 13 4.389813
age_mean age_median age_sd son_mean son_median son_sd pet_mean
1 36.44912 37 6.480148 1.017711 1 1.094928 0.7466125
pet_median pet_sd bmi_mean bmi_median bmi_sd hours_mean
1 0 1.319726 26.68406 25 4.292819 6.977716
hours_median hours_sd
1 3 13.47696
# Frequency for categorical
table(courier2$reason_c)
blood and immune circulatory
1 4
deformations and chromosomal digestive
1 26
ear and mastoid eye and adnexa
6 15
genitourinary health status
19 6
Injury, poisoning laboratory findings
40 21
Mental metabolic
3 2
musculoskeletal Neoplasms
55 1
nervous system parasitic
8 16
perinatal period Pregnancy
3 2
respiratory skin
25 8
table(courier2$education_c)
Bachelor High School Master or Phd Post-graduate
46 601 4 79
table(courier2$day_week_c)
Friday Monday Thursday Tuesday Wednesday
144 161 125 154 156
table(courier2$seasons_c)
Autunm Spring Summer Winter
192 195 170 183
table(courier2$disciplined)
No Yes
695 39
table(courier2$drinker)
No Yes
319 418
table(courier2$smoker)
No Yes
682 54
# Absenteeism in hours descriptive grouped by reason
reason_table <- courier2 %>%
group_by(reason_c) %>%
summarise(hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE), hours_total = sum(absenteeism_time_in_hours, na.rm=TRUE))
print(reason_table,n=Inf)
# A tibble: 21 × 5
reason_c hours_mean hours_median hours_sd hours_total
<chr> <dbl> <dbl> <dbl> <int>
1 blood and immune 8 8 NA 8
2 circulatory 42 24 47.7 168
3 deformations and chro… 8 8 NA 8
4 digestive 7.08 8 8.08 177
5 ear and mastoid 5.33 5.5 2.94 32
6 eye and adnexa 10 8 10.9 150
7 genitourinary 8.83 4 10.6 159
8 health status 5.83 8 3.37 35
9 Injury, poisoning 18.2 8 22.1 729
10 laboratory findings 10.3 8 16.1 217
11 Mental 6.33 8 2.89 19
12 metabolic 4.5 4.5 4.95 9
13 musculoskeletal 15 8 22.0 810
14 Neoplasms 24 24 NA 24
15 nervous system 23.3 8 42.7 163
16 parasitic 11.4 8 14.3 182
17 perinatal period 2 2 1 6
18 Pregnancy 8 8 0 16
19 respiratory 11.3 8 10.7 260
20 skin 23.4 10 37.1 187
21 <NA> 3.57 2 6.17 1651
# Order table by the Sum of absenteeism hours
reason_t_o <- reason_table[with (reason_table, order(hours_total)),]
print(reason_t_o,n=Inf)
# A tibble: 21 × 5
reason_c hours_mean hours_median hours_sd hours_total
<chr> <dbl> <dbl> <dbl> <int>
1 perinatal period 2 2 1 6
2 blood and immune 8 8 NA 8
3 deformations and chro… 8 8 NA 8
4 metabolic 4.5 4.5 4.95 9
5 Pregnancy 8 8 0 16
6 Mental 6.33 8 2.89 19
7 Neoplasms 24 24 NA 24
8 ear and mastoid 5.33 5.5 2.94 32
9 health status 5.83 8 3.37 35
10 eye and adnexa 10 8 10.9 150
11 genitourinary 8.83 4 10.6 159
12 nervous system 23.3 8 42.7 163
13 circulatory 42 24 47.7 168
14 digestive 7.08 8 8.08 177
15 parasitic 11.4 8 14.3 182
16 skin 23.4 10 37.1 187
17 laboratory findings 10.3 8 16.1 217
18 respiratory 11.3 8 10.7 260
19 Injury, poisoning 18.2 8 22.1 729
20 musculoskeletal 15 8 22.0 810
21 <NA> 3.57 2 6.17 1651
# Now let's focus on the reasons with a total of hours above 150, and will eliminate NA.
courier_f <- filter(courier2, reason_c %in% c("eye and adnexa", "genitourinary", "nervous system", "circulatory", "digestive", "parasitic", "skin", "laboratory findings", "respiratory", "Injury, poisoning", "musculoskeletal" ))
str(courier_f)
'data.frame': 237 obs. of 25 variables:
$ ID : int 7 14 20 20 20 3 24 6 18 3 ...
$ reason_for_absence : int 7 19 1 1 11 11 14 11 10 11 ...
$ month_of_absence : int 7 7 7 7 7 7 7 7 8 8 ...
$ day_of_the_week : int 5 2 2 3 4 4 6 5 4 2 ...
$ seasons : int 1 1 1 1 1 1 1 1 1 1 ...
$ transportation_expense : int 279 155 260 260 260 179 246 189 330 179 ...
$ distance_to_work : int 5 12 50 50 50 51 25 29 16 51 ...
$ service_time : int 14 14 11 11 11 18 16 13 4 18 ...
$ age : int 39 34 36 36 36 38 41 33 28 38 ...
$ disciplinary_failure : int 0 0 0 0 0 0 0 0 0 0 ...
$ education : int 1 1 1 1 1 1 1 1 2 1 ...
$ son : int 2 2 4 4 4 0 0 2 0 0 ...
$ social_drinker : int 1 1 1 1 1 1 1 0 0 1 ...
$ social_smoker : int 1 0 0 0 0 0 0 0 0 0 ...
$ pet : int 0 0 0 0 0 0 0 2 0 0 ...
$ bmi : int 24 25 23 23 23 31 NA 25 25 31 ...
$ absenteeism_time_in_hours: int 4 40 8 8 8 1 NA 8 NA 1 ...
$ reason_c : chr "eye and adnexa" "Injury, poisoning" "parasitic" "parasitic" ...
$ education_c : chr "High School" "High School" "High School" "High School" ...
$ day_week_c : chr "Thursday" "Monday" "Monday" "Tuesday" ...
$ seasons_c : chr "Summer" "Summer" "Summer" "Summer" ...
$ disciplined : chr "No" "No" "No" "No" ...
$ drinker : chr "Yes" "Yes" "Yes" "Yes" ...
$ children_c : chr "Two" "Two" "Four" "Four" ...
$ smoker : chr "Yes" "No" "No" "No" ...
# The number of cases went down to 237 from 740
# Now let's summarize descriptives for reasons
reason_table2 <- courier_f %>%
group_by(reason_c) %>%
summarise(hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE), hours_total = sum(absenteeism_time_in_hours, na.rm=TRUE))
print(reason_table2,n=Inf)
# A tibble: 11 × 5
reason_c hours_mean hours_median hours_sd hours_total
<chr> <dbl> <dbl> <dbl> <int>
1 circulatory 42 24 47.7 168
2 digestive 7.08 8 8.08 177
3 eye and adnexa 10 8 10.9 150
4 genitourinary 8.83 4 10.6 159
5 Injury, poisoning 18.2 8 22.1 729
6 laboratory findings 10.3 8 16.1 217
7 musculoskeletal 15 8 22.0 810
8 nervous system 23.3 8 42.7 163
9 parasitic 11.4 8 14.3 182
10 respiratory 11.3 8 10.7 260
11 skin 23.4 10 37.1 187
# Age descriptives grouped by reason
courier_f %>%
group_by(reason_c) %>%
summarise(age_mean = mean(age, na.rm=TRUE), age_median = median(age, na.rm=TRUE), age_sd = sd(age, na.rm=TRUE))
# A tibble: 11 × 4
reason_c age_mean age_median age_sd
<chr> <dbl> <dbl> <dbl>
1 circulatory 36.8 39 6.13
2 digestive 35.1 37 5.15
3 eye and adnexa 32.9 31 5.73
4 genitourinary 41.1 41 6.53
5 Injury, poisoning 35.3 34 5.75
6 laboratory findings 37.6 34 8.68
7 musculoskeletal 36.5 37 6.65
8 nervous system 38.5 33.5 12.5
9 parasitic 37.7 36 9.07
10 respiratory 35.8 36.5 5.71
11 skin 36.1 32.5 9.78
# Distance to work descriptive vs reason
courier_f %>%
group_by(reason_c) %>%
summarise(distance_mean = mean(distance_to_work, na.rm=TRUE), distance_median = median(distance_to_work, na.rm=TRUE), distance_sd = sd(distance_to_work, na.rm=TRUE))
# A tibble: 11 × 4
reason_c distance_mean distance_median distance_sd
<chr> <dbl> <dbl> <dbl>
1 circulatory 21.5 25 7.68
2 digestive 29.7 26 16.7
3 eye and adnexa 24 26 12.9
4 genitourinary 24.6 25 12.6
5 Injury, poisoning 30.4 27 14.5
6 laboratory findings 25.8 26 13.6
7 musculoskeletal 29.5 26 14.3
8 nervous system 26.4 21 14.9
9 parasitic 24.4 20.5 13.7
10 respiratory 24 17 14.5
11 skin 29.6 26 14.7
# Crosstab Reason vs disciplined
xtabs(~ reason_c + disciplined, courier_f)
disciplined
reason_c No
circulatory 4
digestive 26
eye and adnexa 15
genitourinary 19
Injury, poisoning 40
laboratory findings 20
musculoskeletal 55
nervous system 8
parasitic 16
respiratory 25
skin 8
# First let's create a new dataframe with total number of absenteeism hours
library(data.table)
dt <- data.table(courier_f)
dt2 <- dt[,list(hourstotal = sum(absenteeism_time_in_hours), freq = .N), by = c("ID", "reason_c", "bmi", "smoker", "distance_to_work", "age", "service_time", "seasons_c", "drinker", "smoker", "son", "children_c")]
str(dt2)
Classes 'data.table' and 'data.frame': 179 obs. of 14 variables:
$ ID : int 7 14 20 20 3 24 6 18 10 11 ...
$ reason_c : chr "eye and adnexa" "Injury, poisoning" "parasitic" "digestive" ...
$ bmi : int 24 25 23 23 31 NA 25 25 27 NA ...
$ smoker : chr "Yes" "No" "No" "No" ...
$ distance_to_work: int 5 12 50 50 51 25 29 16 52 36 ...
$ age : int 39 34 36 36 38 41 33 28 28 33 ...
$ service_time : int 14 14 11 11 18 16 13 4 3 13 ...
$ seasons_c : chr "Summer" "Summer" "Summer" "Summer" ...
$ drinker : chr "Yes" "Yes" "Yes" "Yes" ...
$ smoker.1 : chr "Yes" "No" "No" "No" ...
$ son : int 2 2 4 4 0 0 2 0 1 2 ...
$ children_c : chr "Two" "Two" "Four" "Four" ...
$ hourstotal : int 4 56 16 8 10 NA 8 NA 40 8 ...
$ freq : int 1 2 2 1 3 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
#Absenteesism histogram - saved as ggplot object
hours_hist <- ggplot(dt2, aes(x=hourstotal)) +
geom_histogram(binwidth = 2,fill='steelblue',col='black')
hours_hist
# Box Plot / Season vs Absenteeism hours
boxplot(hourstotal~seasons_c,data=dt2, main="Total Hours Absenteeism",
xlab="Reason", ylab="Total Hours")
# Let's eliminate outliners
box_plot_crop<-ggplot(data=dt2, aes(seasons_c,hourstotal))
box_plot_crop+ geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim = c(0, 50))
# Box Plot / Reason vs Absenteeism hours
box_plot_crop<-ggplot(data=dt2, aes(reason_c,hourstotal))
box_plot_crop+ geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim = c(0, 120))
# Box Plot / Reason vs Absenteeism hours again + colored and legend to see it more clearly
box_plot_crop<-ggplot(data=dt2, aes(reason_c,hourstotal, fill=reason_c))
box_plot_crop+ geom_boxplot(alpha=0.7, outlier.shape = NA) +
theme(legend.position = "right") +
theme (axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
coord_cartesian(ylim = c(0, 120))
# Here seems that employees with circulatory issues, sum more total absenteeism hours vs other reasons.
# Means
tapply(dt2$hourstotal,dt2$reason_c,mean, na.rm=T)
circulatory digestive eye and adnexa
56.00000 10.41176 12.50000
genitourinary Injury, poisoning laboratory findings
10.60000 23.51613 11.42105
musculoskeletal nervous system parasitic
23.35294 27.16667 15.16667
respiratory skin
15.29412 26.71429
Analysis of Variance Table
Response: dt2$hourstotal
Df Sum Sq Mean Sq F value Pr(>F)
dt2$reason_c 10 10462 1046.17 1.2951 0.2372
Residuals 162 130866 807.82
Call:
lm(formula = dt2$hourstotal ~ dt2$reason_c)
Residuals:
Min 1Q Median 3Q Max
-48.000 -15.353 -7.167 0.588 216.647
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.00 16.41 3.413 0.000813
dt2$reason_cdigestive -45.59 17.80 -2.561 0.011339
dt2$reason_ceye and adnexa -43.50 18.35 -2.371 0.018914
dt2$reason_cgenitourinary -45.40 17.98 -2.526 0.012509
dt2$reason_cInjury, poisoning -32.48 17.19 -1.890 0.060514
dt2$reason_claboratory findings -44.58 17.66 -2.525 0.012543
dt2$reason_cmusculoskeletal -32.65 17.12 -1.907 0.058270
dt2$reason_cnervous system -28.83 20.10 -1.435 0.153307
dt2$reason_cparasitic -40.83 18.35 -2.226 0.027415
dt2$reason_crespiratory -40.71 17.80 -2.287 0.023489
dt2$reason_cskin -29.29 19.61 -1.493 0.137338
(Intercept) ***
dt2$reason_cdigestive *
dt2$reason_ceye and adnexa *
dt2$reason_cgenitourinary *
dt2$reason_cInjury, poisoning .
dt2$reason_claboratory findings *
dt2$reason_cmusculoskeletal .
dt2$reason_cnervous system
dt2$reason_cparasitic *
dt2$reason_crespiratory *
dt2$reason_cskin
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 28.42 on 162 degrees of freedom
(6 observations deleted due to missingness)
Multiple R-squared: 0.07402, Adjusted R-squared: 0.01687
F-statistic: 1.295 on 10 and 162 DF, p-value: 0.2372
# fit the ANOVA model by skipping NAs
fit1 <- lm(hourstotal ~ 1 + reason_c, data = dt2)
summary(fit1)
Call:
lm(formula = hourstotal ~ 1 + reason_c, data = dt2)
Residuals:
Min 1Q Median 3Q Max
-48.000 -15.353 -7.167 0.588 216.647
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.00 16.41 3.413 0.000813 ***
reason_cdigestive -45.59 17.80 -2.561 0.011339 *
reason_ceye and adnexa -43.50 18.35 -2.371 0.018914 *
reason_cgenitourinary -45.40 17.98 -2.526 0.012509 *
reason_cInjury, poisoning -32.48 17.19 -1.890 0.060514 .
reason_claboratory findings -44.58 17.66 -2.525 0.012543 *
reason_cmusculoskeletal -32.65 17.12 -1.907 0.058270 .
reason_cnervous system -28.83 20.10 -1.435 0.153307
reason_cparasitic -40.83 18.35 -2.226 0.027415 *
reason_crespiratory -40.71 17.80 -2.287 0.023489 *
reason_cskin -29.29 19.61 -1.493 0.137338
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 28.42 on 162 degrees of freedom
(6 observations deleted due to missingness)
Multiple R-squared: 0.07402, Adjusted R-squared: 0.01687
F-statistic: 1.295 on 10 and 162 DF, p-value: 0.2372
# Box Plot / Drinker vs Absenteeism hours
box_plot_crop<-ggplot(data=dt2, aes(drinker,hourstotal))
box_plot_crop+ geom_boxplot(color="red", fill="orange", alpha=0.2,outlier.shape = NA) +
coord_cartesian(ylim = c(0, 50))
# The total amount of hours looks higher on Drinkers,let's run a T-test
t.test(hourstotal ~ drinker, data = dt2)
Welch Two Sample t-test
data: hourstotal by drinker
t = -1.7792, df = 164.82, p-value = 0.07705
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-16.159996 0.840755
sample estimates:
mean in group No mean in group Yes
14.32927 21.98889
# Seems there is a difference but not statistically significant
# What about smokers?
# Box Plot / Drinker vs Absenteeism hours
box_plot_crop<-ggplot(data=dt2, aes(smoker,hourstotal))
box_plot_crop+ geom_boxplot(color="black", fill="pink", alpha=0.5, outlier.shape = NA, na.rm=T) +
coord_cartesian(ylim = c(0, 35))
# Also seems there is some difference
t.test(hourstotal ~ smoker, data = dt2)
Welch Two Sample t-test
data: hourstotal by smoker
t = 1.9245, df = 37.904, p-value = 0.06183
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-0.4326298 17.0704827
sample estimates:
mean in group No mean in group Yes
19.47682 11.15789
# But not statistically significant either
# Let's see Children as categorical variable
box_plot_crop<-ggplot(data=dt2, aes(children_c,hourstotal))
box_plot_crop+ geom_boxplot(color="blue", fill="pink", alpha=5, outlier.shape = NA) +
coord_cartesian(ylim = c(0, 50))
# Mean
tapply(dt2$hourstotal,dt2$children_c,mean)
Four No Children One Three Two
19.55556 NA 19.22222 32.00000 NA
Analysis of Variance Table
Response: dt2$hourstotal
Df Sum Sq Mean Sq F value Pr(>F)
dt2$children_c 4 2490 622.61 0.7499 0.5594
Residuals 167 138652 830.25
Call:
lm(formula = dt2$hourstotal ~ dt2$children_c)
Residuals:
Min 1Q Median 3Q Max
-24.000 -13.500 -10.984 0.337 220.778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.5556 9.6047 2.036 0.0433 *
dt2$children_cNo Children -5.5717 10.2782 -0.542 0.5885
dt2$children_cOne -0.3333 10.5214 -0.032 0.9748
dt2$children_cThree 12.4444 17.3151 0.719 0.4733
dt2$children_cTwo 1.9444 10.4027 0.187 0.8520
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 28.81 on 167 degrees of freedom
(7 observations deleted due to missingness)
Multiple R-squared: 0.01764, Adjusted R-squared: -0.005884
F-statistic: 0.7499 on 4 and 167 DF, p-value: 0.5594
#Nothing here...
For this let’s use Scatter Plots
# Tenure
scatter1 <- ggplot(dt2, aes(service_time,hourstotal))+
geom_point()+
scale_y_continuous(limits=c(0,150))
scatter1
#Age
scatter2 <- ggplot(dt2, aes(age,hourstotal))+
geom_point()+
scale_y_continuous(limits=c(0,100))
scatter2
scatter2+
facet_wrap(vars(reason_c))
#BMI
scatter3 <- ggplot(dt2, aes(bmi,hourstotal))+
geom_point()+
scale_y_continuous(limits=c(0,100))
scatter3
#Distance to Work
scatter5 <- ggplot(dt2, aes(distance_to_work,hourstotal))+
geom_point()+
scale_y_continuous(limits=c(0,100))
scatter5
# Facet Wrap with Reason
scatter3+
facet_wrap(vars(reason_c))
# Facet wrap bivariate (Reason-Drinker)
scatter3+
facet_wrap(vars(reason_c,drinker))
# (Chart seems too small)
# Facet wrap bivariate (Season -Drinker)
scatter3+
facet_wrap(vars(seasons_c,drinker))
# Also let's try a Scatter (BMI and hours absent) coloring if smoker or not
scatter3 <- ggplot(dt2, aes(bmi,hourstotal, col=drinker))+
geom_point()+
scale_y_continuous(limits=c(0,100))
scatter3
grid_hours <- scatter2 +
facet_grid(children_c~drinker)
grid_hours
final_grid <- grid_hours +
labs(x="Employee Age",
y="Total Hours Absent",
title="Scatter between Age and Hours absent",
subtitle="Grid Drinker vs Children")+
theme_bw()+
theme(plot.title = element_text(hjust=0.5),
plot.subtitle=element_text(hjust=0.5))
final_grid
Maybe I should try to exhaust all possible relationships among variables in the dataframe, there are some variables that I haven’t explore at all, like Transportation Expenses or Disciplinary action.
Has been difficult to find solid significant relationships among the variables and total absenteeism hours, but there are some variables that might provide some explanation like Drinkers seems to produce a more hours of absenteeism than no drinkers, however is not an statistically significant difference.
Also certain medical conditions tend to produce more absenteeism hours, like Circulatory or Digestive issues, still none of those statistically significant.
I need to work a lot more on the customization of the graph, titles, legends, colors, etc. that will help others to understand the graphs.
Not really, I think graph are very useful to explore what are the relationships among variables.
However, in some cases I felt the need to jump into specific statistics (T-test, Anova, etc) to confirm the significancy of certain visual relationships.