HW-4

Clean Data and Plot - Absenteeism

Felix Betancourt
2022-07-12

About Dataset:

The dataser was found in:

https://www.kaggle.com/datasets/tonypriyanka2913/employee-absenteeism

It contain real data from a Courier Company, and it was created to explain the absenteeism.

Variables:

  1. Individual identification (ID)
  2. Reason for absence (ICD). Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to XXI) as follows: I Certain infectious and parasitic diseases II Neoplasms III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism IV Endocrine, nutritional and metabolic diseases V Mental and behavioural disorders VI Diseases of the nervous system VII Diseases of the eye and adnexa VIII Diseases of the ear and mastoid process IX Diseases of the circulatory system X Diseases of the respiratory system XI Diseases of the digestive system XII Diseases of the skin and subcutaneous tissue XIII Diseases of the musculoskeletal system and connective tissue XIV Diseases of the genitourinary system XV Pregnancy, childbirth and the puerperium XVI Certain conditions originating in the perinatal period XVII Congenital malformations, deformations and chromosomal abnormalities XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified XIX Injury, poisoning and certain other consequences of external causes XX External causes of morbidity and mortality XXI Factors influencing health status and contact with health services.

And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation (24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental consultation (28).

  1. Month of absence
  2. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
  3. Seasons (summer (1), autumn (2), winter (3), spring (4))
  4. Transportation expense
  5. Distance from Residence to Work (kilometers)
  6. Service time
  7. Age
  8. Work load Average/day
  9. Hit target
  10. Disciplinary failure (yes=1; no=0)
  11. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
  12. Son (number of children)
  13. Social drinker (yes=1; no=0)
  14. Social smoker (yes=1; no=0)
  15. Pet (number of pet)
  16. Weight
  17. Height
  18. Body mass index
  19. Absenteeism time in hours (target)

Now let’s start working with the data set.

Setting up and Reading the data

First let’s set the working directory and load the data.

#First of all let's load the Dplyr and Tidyverse package

library(ggplot2)
library(dplyr)
library(tidyverse)

# Now let's set the working directory

setwd("/Users/fbeta/OneDrive/1-UMASS-DACSS/601/DataSets/Absenteeism/")

courier <- read.csv("Absenteeism_at_work_Project.csv")

Exploring the data

dim.data.frame(courier)
[1] 740  21
courier_t <- as_tibble(courier)

head(courier_t)
# A tibble: 6 × 21
     ID reason_for_absence month_of_absence day_of_the_week seasons
  <int>              <int>            <int>           <int>   <int>
1    11                 26                7               3       1
2    36                  0                7               3       1
3     3                 23                7               4       1
4     7                  7                7               5       1
5    11                 23                7               5       1
6     3                 23                7               6       1
# … with 16 more variables: transportation_expense <int>,
#   distance_to_work <int>, service_time <int>, age <int>,
#   work_load_average_day <chr>, hit_target <int>,
#   disciplinary_failure <int>, education <int>, son <int>,
#   social_drinker <int>, social_smoker <int>, pet <int>,
#   weight <int>, height <int>, bmi <int>,
#   absenteeism_time_in_hours <int>
# Want to see all the variables
head(courier)
  ID reason_for_absence month_of_absence day_of_the_week seasons
1 11                 26                7               3       1
2 36                  0                7               3       1
3  3                 23                7               4       1
4  7                  7                7               5       1
5 11                 23                7               5       1
6  3                 23                7               6       1
  transportation_expense distance_to_work service_time age
1                    289               36           13  33
2                    118               13           18  50
3                    179               51           18  38
4                    279                5           14  39
5                    289               36           13  33
6                    179               51           18  38
  work_load_average_day hit_target disciplinary_failure education son
1               239,554         97                    0         1   2
2               239,554         97                    1         1   1
3               239,554         97                    0         1   0
4               239,554         97                    0         1   2
5               239,554         97                    0         1   2
6               239,554         97                    0         1   0
  social_drinker social_smoker pet weight height bmi
1              1             0   1     90    172  30
2              1             0   0     98    178  31
3              1             0   0     89    170  31
4              1             1   0     68    168  24
5              1             0   1     90    172  30
6              1             0   0     89    170  31
  absenteeism_time_in_hours
1                         4
2                         0
3                         2
4                         4
5                         2
6                        NA
colnames(courier)
 [1] "ID"                        "reason_for_absence"       
 [3] "month_of_absence"          "day_of_the_week"          
 [5] "seasons"                   "transportation_expense"   
 [7] "distance_to_work"          "service_time"             
 [9] "age"                       "work_load_average_day"    
[11] "hit_target"                "disciplinary_failure"     
[13] "education"                 "son"                      
[15] "social_drinker"            "social_smoker"            
[17] "pet"                       "weight"                   
[19] "height"                    "bmi"                      
[21] "absenteeism_time_in_hours"

Let’s see the structure more clearly by using str()

str(courier)
'data.frame':   740 obs. of  21 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ work_load_average_day    : chr  "239,554" "239,554" "239,554" "239,554" ...
 $ hit_target               : int  97 97 97 97 97 97 97 97 97 97 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ weight                   : int  90 98 89 68 90 89 80 65 95 88 ...
 $ height                   : int  172 178 170 168 172 170 172 168 196 172 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...

I found this dataset very comprehensive. There are 21 variables to explore potential explanations to the absenteeism.

In terms of the data type, We have:

The dataset seems very clean in general, but I will adjust a few things:

  1. Variable “work_load_average_day” and “hit_target” are not clear in the description of the file. I rather to eliminate those variables from the dataset
courier2 <- select(courier, - c("work_load_average_day", "hit_target"))
str(courier2)
'data.frame':   740 obs. of  19 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ weight                   : int  90 98 89 68 90 89 80 65 95 88 ...
 $ height                   : int  172 178 170 168 172 170 172 168 196 172 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...
  1. On the other hand weight and height are used to calculate BMI, so I will eliminate also those 2 variables to simplify the dataset.
courier2 <- select(courier2, - c("weight", "height"))
str(courier2)
'data.frame':   740 obs. of  17 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...
  1. I think worth recoding back to the categories name for a few relevant categorical variables but saving it in new variables in case I need to use it as character in tables.
#recode key categorical variables to use its labels

courier2 <-courier2%>%
  mutate(reason_c = case_when(
        reason_for_absence == 1 ~ "parasitic",
        reason_for_absence == 2 ~ "Neoplasms",
        reason_for_absence == 3 ~ "blood and immune",
        reason_for_absence == 4 ~ "metabolic",
        reason_for_absence == 5 ~ "Mental",
        reason_for_absence == 6 ~ "nervous system",
        reason_for_absence == 7 ~ "eye and adnexa",
        reason_for_absence == 8 ~ "ear and mastoid",
        reason_for_absence == 9 ~ "circulatory",
        reason_for_absence == 10 ~ "respiratory",
        reason_for_absence == 11 ~ "digestive",
        reason_for_absence == 12 ~ "skin",
        reason_for_absence == 13 ~ "musculoskeletal",
        reason_for_absence == 14 ~ "genitourinary",
        reason_for_absence == 15 ~ "Pregnancy",
        reason_for_absence == 16 ~ "perinatal period",
        reason_for_absence == 17 ~ "deformations and chromosomal",
        reason_for_absence == 18 ~ "laboratory findings",
        reason_for_absence == 19 ~ "Injury, poisoning",
        reason_for_absence == 20 ~ "External causes and mortality",
        reason_for_absence == 21 ~ "health status",
        )) %>%
  mutate(education_c = case_when(
         education == 1 ~ "High School",
         education == 2 ~ "Bachelor",
         education == 3 ~ "Post-graduate",
         education == 4 ~ "Master or Phd",
         )) %>%
  mutate(day_week_c = case_when(
         day_of_the_week == 2 ~ "Monday",
         day_of_the_week == 3 ~ "Tuesday",
         day_of_the_week == 4 ~ "Wednesday",
         day_of_the_week == 5 ~ "Thursday",
         day_of_the_week == 6 ~ "Friday",
         ))%>%
  mutate(seasons_c = case_when(
         seasons == 1 ~ "Summer",
         seasons == 2 ~ "Autunm",
         seasons == 3 ~ "Winter",
         seasons == 4 ~ "Spring",
         ))%>%
  mutate(disciplined = case_when(
         disciplinary_failure == 1 ~ "Yes",
         disciplinary_failure == 0 ~ "No",
         ))%>%
  mutate(drinker = case_when(
         social_drinker == 1 ~ "Yes",
         social_drinker == 0 ~ "No",
        ))%>%
  mutate(smoker = case_when(
         social_smoker == 1 ~ "Yes",
         social_smoker == 0 ~ "No",
         ))
str(courier2)
'data.frame':   740 obs. of  24 variables:
 $ ID                       : int  11 36 3 7 11 3 10 20 14 1 ...
 $ reason_for_absence       : int  26 0 23 7 23 23 22 23 19 22 ...
 $ month_of_absence         : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_the_week          : int  3 3 4 5 5 6 6 6 2 2 ...
 $ seasons                  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ transportation_expense   : int  289 118 179 279 289 179 NA 260 155 235 ...
 $ distance_to_work         : int  36 13 51 5 36 51 52 50 12 11 ...
 $ service_time             : int  13 18 18 14 13 18 3 11 14 14 ...
 $ age                      : int  33 50 38 39 33 38 28 36 34 37 ...
 $ disciplinary_failure     : int  0 1 0 0 0 0 0 0 0 0 ...
 $ education                : int  1 1 1 1 1 1 1 1 1 3 ...
 $ son                      : int  2 1 0 2 2 0 1 4 2 1 ...
 $ social_drinker           : int  1 1 1 1 1 1 1 1 1 0 ...
 $ social_smoker            : int  0 0 0 1 0 0 0 0 0 0 ...
 $ pet                      : int  1 0 0 0 1 0 4 0 0 1 ...
 $ bmi                      : int  30 31 31 24 30 31 27 23 25 29 ...
 $ absenteeism_time_in_hours: int  4 0 2 4 2 NA 8 4 40 8 ...
 $ reason_c                 : chr  NA NA NA "eye and adnexa" ...
 $ education_c              : chr  "High School" "High School" "High School" "High School" ...
 $ day_week_c               : chr  "Tuesday" "Tuesday" "Wednesday" "Thursday" ...
 $ seasons_c                : chr  "Summer" "Summer" "Summer" "Summer" ...
 $ disciplined              : chr  "No" "Yes" "No" "No" ...
 $ drinker                  : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ smoker                   : chr  "No" "No" "No" "Yes" ...

Descriptive statistics

  1. Mean, median, and standard deviation for numerical variables, and frequencies for categorical variables.
#Numerical variables (except ID as it is only for identification purposes)

courier2 %>%
  summarise(transport_mean = mean(transportation_expense, na.rm=TRUE), transport_median = median(transportation_expense, na.rm=TRUE), transport_sd = sd(transportation_expense, na.rm=TRUE), distance_mean = mean(distance_to_work, na.rm=TRUE), distance_median = median(distance_to_work, na.rm=TRUE), distance_sd = sd(distance_to_work, na.rm=TRUE), service_mean = mean(service_time, na.rm=TRUE), service_median = median(service_time, na.rm=TRUE), service_sd = sd(service_time, na.rm=TRUE), age_mean = mean(age, na.rm=TRUE), age_median = median(age, na.rm=TRUE), age_sd = sd(age, na.rm=TRUE), son_mean = mean(son, na.rm=TRUE), son_median = median(son, na.rm=TRUE), son_sd = sd(son, na.rm=TRUE), pet_mean = mean(pet, na.rm=TRUE), pet_median = median(pet, na.rm=TRUE), pet_sd = sd(pet, na.rm=TRUE), bmi_mean = mean(bmi, na.rm=TRUE), bmi_median = median(bmi, na.rm=TRUE), bmi_sd = sd(bmi, na.rm=TRUE), hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE))
  transport_mean transport_median transport_sd distance_mean
1       221.0355              225     66.95418      29.66757
  distance_median distance_sd service_mean service_median service_sd
1              26    14.84812     12.56581             13   4.389813
  age_mean age_median   age_sd son_mean son_median   son_sd  pet_mean
1 36.44912         37 6.480148 1.017711          1 1.094928 0.7466125
  pet_median   pet_sd bmi_mean bmi_median   bmi_sd hours_mean
1          0 1.319726 26.68406         25 4.292819   6.977716
  hours_median hours_sd
1            3 13.47696
# Frequency for categorical

table(courier2$reason_c)

            blood and immune                  circulatory 
                           1                            4 
deformations and chromosomal                    digestive 
                           1                           26 
             ear and mastoid               eye and adnexa 
                           6                           15 
               genitourinary                health status 
                          19                            6 
           Injury, poisoning          laboratory findings 
                          40                           21 
                      Mental                    metabolic 
                           3                            2 
             musculoskeletal                    Neoplasms 
                          55                            1 
              nervous system                    parasitic 
                           8                           16 
            perinatal period                    Pregnancy 
                           3                            2 
                 respiratory                         skin 
                          25                            8 
table(courier2$education_c)

     Bachelor   High School Master or Phd Post-graduate 
           46           601             4            79 
table(courier2$day_week_c)

   Friday    Monday  Thursday   Tuesday Wednesday 
      144       161       125       154       156 
table(courier2$seasons_c)

Autunm Spring Summer Winter 
   192    195    170    183 
table(courier2$disciplined)

 No Yes 
695  39 
table(courier2$drinker)

 No Yes 
319 418 
table(courier2$smoker)

 No Yes 
682  54 
  1. In addition to overall means, medians, and SDs, use group_by() and summarise() to compute mean/median/SD for any relevant groupings
# Absenteeism in hours descriptives grouped by reason

courier2 %>%
  group_by(reason_c) %>%
  summarise(hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE))
# A tibble: 21 × 4
   reason_c                     hours_mean hours_median hours_sd
   <chr>                             <dbl>        <dbl>    <dbl>
 1 blood and immune                   8             8      NA   
 2 circulatory                       42            24      47.7 
 3 deformations and chromosomal       8             8      NA   
 4 digestive                          7.08          8       8.08
 5 ear and mastoid                    5.33          5.5     2.94
 6 eye and adnexa                    10             8      10.9 
 7 genitourinary                      8.83          4      10.6 
 8 health status                      5.83          8       3.37
 9 Injury, poisoning                 18.2           8      22.1 
10 laboratory findings               10.3           8      16.1 
# … with 11 more rows
# Age descriptives grouped by reason

courier2 %>%
  group_by(reason_c) %>%
  summarise(age_mean = mean(age, na.rm=TRUE), age_median = median(age, na.rm=TRUE), age_sd = sd(age, na.rm=TRUE))
# A tibble: 21 × 4
   reason_c                     age_mean age_median age_sd
   <chr>                           <dbl>      <dbl>  <dbl>
 1 blood and immune                 40           40  NA   
 2 circulatory                      36.8         39   6.13
 3 deformations and chromosomal     40           40  NA   
 4 digestive                        35.1         37   5.15
 5 ear and mastoid                  36.5         37   4.42
 6 eye and adnexa                   32.9         31   5.73
 7 genitourinary                    41.1         41   6.53
 8 health status                    37.2         38   3.71
 9 Injury, poisoning                35.3         34   5.75
10 laboratory findings              37.6         34   8.68
# … with 11 more rows
# Distance to work descriptive vs reason

courier2 %>%
  group_by(reason_c) %>%
  summarise(distance_mean = mean(distance_to_work, na.rm=TRUE), distance_median = median(distance_to_work, na.rm=TRUE), distance_sd = sd(distance_to_work, na.rm=TRUE))
# A tibble: 21 × 4
   reason_c                  distance_mean distance_median distance_sd
   <chr>                             <dbl>           <dbl>       <dbl>
 1 blood and immune                   31                31       NA   
 2 circulatory                        21.5              25        7.68
 3 deformations and chromos…          22                22       NA   
 4 digestive                          29.7              26       16.7 
 5 ear and mastoid                    21                16       16.3 
 6 eye and adnexa                     24                26       12.9 
 7 genitourinary                      24.6              25       12.6 
 8 health status                      30.5              24       16.6 
 9 Injury, poisoning                  30.4              27       14.5 
10 laboratory findings                25.8              26       13.6 
# … with 11 more rows
# Crosstab Reason vs disciplined

xtabs(~ reason_c + disciplined, courier2)
                              disciplined
reason_c                       No
  blood and immune              1
  circulatory                   4
  deformations and chromosomal  1
  digestive                    26
  ear and mastoid               5
  eye and adnexa               15
  genitourinary                19
  health status                 6
  Injury, poisoning            40
  laboratory findings          20
  Mental                        3
  metabolic                     2
  musculoskeletal              55
  Neoplasms                     1
  nervous system                8
  parasitic                    16
  perinatal period              3
  Pregnancy                     2
  respiratory                  25
  skin                          8
# Absenteeism in hours descriptives grouped by Disciplined

courier2 %>%
  group_by(disciplined, na.rm=TRUE) %>%
  summarise(hours_mean = mean(absenteeism_time_in_hours, na.rm=TRUE), hours_median = median(absenteeism_time_in_hours, na.rm=TRUE), hours_sd = sd(absenteeism_time_in_hours, na.rm=TRUE))
# A tibble: 3 × 5
# Groups:   disciplined [3]
  disciplined na.rm hours_mean hours_median hours_sd
  <chr>       <lgl>      <dbl>        <dbl>    <dbl>
1 No          TRUE        7.15            3    13.1 
2 Yes         TRUE        3.64            0    20.9 
3 <NA>        TRUE        5.6             8     3.29

Let’s do Visualization

# Absenteesism histogram - saved as ggplot object

hours_hist <- ggplot(courier2, aes(x=absenteeism_time_in_hours)) +
 geom_histogram(binwidth = 2,fill='steelblue',col='black')
hours_hist

It seems that most of the population with absenteeism acummulate between 0 to less than 12 hours.Then the large amounts of absenteeism hours are concentrated in a few employees.

# use geom_bar() to visualize if there is a pattern related to an specific day of the week

day_week_bar <- ggplot(courier2,aes(x=day_week_c))+
  geom_bar()+
  scale_fill_manual(values = c("red","steelblue"))+
  labs(x="Day of the Week", y="Frequency",title="Absenteeism by Day of the Week")
day_week_bar

Seems that there is not a “preferred” day to be absent but we can see if there is a significant (statistically) difference between Thursdays and the rest of the week.

(I don’t know why the colors in the bar don’t show up, I will have to look into it)

Bivariate

# use geom_bar visualize the absenteeism time distribution per day of the week

ggplot(courier2, aes(absenteeism_time_in_hours)) + 
  geom_histogram(binwidth = 4) + 
  labs(title = "Absenteeism time in hours by Day of the week") + 
  theme_bw() +
  facet_wrap(vars(day_week_c), scales = "free")
ggplot(courier2, aes(absenteeism_time_in_hours)) + 
  geom_histogram(binwidth = 1) + 
  labs(title = "Absenteeism time in hours by Reasons") + 
  theme_bw() +
  facet_wrap(vars(reason_c), scales = "free")

On the other hand, it is difficult to have an insight with charts by Reason, probably I should group reasons by wider categories.

Final comments

• What questions are left unanswered with your visualizations: I need to think on a few more hypothesis around the absenteeism and try more bivariate plots. For example: I what if I create an scatter plot between age and absenteeism time?, or distance from work and absenteeism time?

• What about the visualizations may be unclear to a naive viewer: probably some of the histograms when grouping by reasons for example might be unclear for the cases where the count is low. Too many categories in reasons for plotting in the way I did.

• How could you improve the visualizations for the final project: I would need to clean a bit the data by grouping reasons in bigger categories and I will use scatter plots, also I will need to fix the colors.