Introduction

Column

Why re-encode ‘Reasons for absence’ ?

There was a mild complication with ‘Reason for absence’ since 0s are present in the data, however Martiniano et al. (2012) do not list ‘0’ as a possible code. Additionally, ‘20’ doesn’t exist in the data, but it exists as a code. Based on other papers on this data, it seems that the codes for 1 through 20 should be applied to 0 through 19 (21 through 28 are fine).

Distribution of Reasons

Column

Data Post-Feature Engineering

Exploratory Data Analysis

Column

Reasons for Absenteeism

Distribution of Age

Age & Absenteeism

Column

Reasons for Absenteeism

Top 4 most common reasons for absence (by frequency):

  • Medical consultations

  • Dental consultations

  • Physiotherapy (PT)

  • Genitourinary system diseases

Top 5 most time consuming reasons (by hour):

  • Genitourinary system diseases

  • External causes of morbidity & mortality

  • Medical consultations

  • Dental consultations

  • Integumentary system (skin & subcutaneous tissue) diseases

Age & Absenteeism

  • Age is even distributed from the mid-20s to the early 40s.

  • The majority of employees who take leaves of absence do so for 8 hours or less. There is a distinct gap between half-day absences (under 3 to 4 hours) and full-day absences (8 hours). Intervals for absenteeism time then increase by 8 hours at a time.

Feature Importance

Column

ATW (Original Data)

ATW2 (Re-leveled data)

ATW2 without ‘Reasons’

Column

Original Data set

  • Reasons for absence

  • Disciplinary failure

  • Children (‘Son’)

  • Education

  • Age

ATW2

  • Reasons for absence:

    • Respiratory system diseases

    • External causes of morbidity & mortality

    • Genitourinary system diseases

    • Musculosekeltal system diseases

    • Eye diseases

    • Integumentary system diseases

    • Digestive system diseases

  • Children

  • Day of the week

ATW2 without ‘Reasons’

  • Discipline failure

  • Day of the week

  • Age

  • Season

  • Children

Clustering, Classification, Results, & Analysis

Column

How many clusters?

2 clusters

3 Clusters

4 Clusters

5 Clusters

LogitBoost

Column

‘Absent Type’

  • Created a new categorical variable for length of absence (‘One day’, ‘Multi-day’) based on clustering results & EDA (see scatter plot of ‘Age’ & ‘Absenteeism’)

  • When cluster assignment was included:

    • \(Accuracy = .905\)

    • \(Specificty = .971\)

    • \(AUC =.808\)

  • While accuracy & specificity are high, sensitivity was low. The boosted logit model is good at predicting the positive class (‘Multi-day’), but too cautious and under-predicts the negative class (‘One day’ absences)

Results & Recommendations

  • 2 clusters seems appropriate for k-Means clustering
  • Classification models were much more successful than regression models. Future models of absenteeism could focus on classifying time periods (rather than treating time like a continuous variable), however further tuning of metrics is required.
  • The combination of EDA, data knowledge, and clustering could be an avenue for binning absence length into categories
  • Some of reasons for absence may not be controllable, but this company can address the most prevalent reasons for absence by promoting wellness practices & by tailoring benefits to prevent disorders, such as those relating to the genitourinary system (e.g., PT, regular movement in/around the work environment, etc.)
    • In terms of frequency, about 50% of absences are due to medical & dental consultations, PT, and urogenital diseases.

    • Urogenital disease, external causes to mortality, and medical & dental consultation also make up 50% of total hours absent.