People miss work when they are sick. Sometimes, people miss work and say it is because they are sick when they are really not sick at all. If we assume people are equally likely to get sick on any given day of the week, we can plot the data and run a statistical test to compare the number of observed sick days vs. the number of expected sick days. If we observe meaningful differences between the observed and expected counts, it may suggest that people are faking sick and missing work for a different reason.
Sick leave data were collected between 2011 and Q1 2015:
## Set Working Directory
setwd("~/Documents/HRMeasured/Sick Leave")
## Read data into R session. FYI - stringsAsFactors = FALSE converts dates to Character instead of Factor
sickData <- read.csv("sickLeave.csv", header = TRUE, stringsAsFactors = FALSE)
## Select single variable of interest for the study <-sickLeaveTaken
sickData <- sickData[c(5)]
dim(sickData)
## [1] 21264 1
head(sickData)
## sickLeaveTaken
## 1 3/7/14
## 2 3/10/14
## 3 6/25/13
## 4 6/23/14
## 5 9/8/14
## 6 7/12/13
Dates when employees used sick days were recoded into two variables:
## Convert dates from Character to POSIX
wDate <- strptime(as.character(sickData$sickLeaveTaken), "%m/%d/%y")
## Save reformatted Date variables to dataframe
sickData <- data.frame(sickData, wDate)
## Create new date variables - 'Day of the Week', and 'month'
sickDotw <- weekdays(wDate)
sickMonth <- months(wDate)
## Save new date variables to dataframe
sickData <- data.frame(sickData, sickDotw, sickMonth)
head(sickDotw)
## [1] "Friday" "Monday" "Tuesday" "Monday" "Monday" "Friday"
head(sickMonth)
## [1] "March" "March" "June" "June" "September" "July"
Some reformatting and sorting were needed to get the plots to display Monday through Friday and January through December:
## Re-order days of week to be Sunday through Saturday and months to be January through December
sickData$sickDotw <- factor(sickData$sickDotw, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))
sickData$sickMonth <- factor(sickData$sickMonth, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
## Subset Days of the week to only include observations equal to Monday through Friday
sickData <- subset(sickData, sickDotw == "Monday" | sickDotw =="Tuesday" | sickDotw == "Wednesday" | sickDotw == "Thursday" | sickDotw == "Friday")
The number of sick days by Day of the Week (DOTW) were plotted:
## Create barplots: Count of Sick Days by Day of the Week
##install.packages("ggplot2")
library(ggplot2)
qplot(factor(sickDotw), data=sickData, xlab = "", ylab = "", main = "Sick Days taken by DOTW",geom="bar")
table(sickData$sickDotw)
##
## Monday Tuesday Wednesday Thursday Friday
## 5118 4221 3908 3697 4240
The number of sick days by Month were plotted:
## Create barplots: Count of Sick Days by Day Month
library(ggplot2)
qplot(factor(sickMonth), data=sickData, xlab = "", ylab = "", main = "Sick Days taken by Month",geom="bar")
table(sickData$sickMonth)
##
## January February March April May June July
## 2541 2090 2132 1505 1297 1426 1440
## August September October November December
## 1532 1632 1919 1683 1987
A chi-squared test was run on sick days by day of the week to test if the differences between observed and expected counts were statistically different:
##Chi-square - Test to see if there is a statistical difference between DOTW
chiSquareWeekDay <- table(sickData$sickDotw)
chiSquareWeekDay
##
## Monday Tuesday Wednesday Thursday Friday
## 5118 4221 3908 3697 4240
chisq.test(table(sickData$sickDotw))
##
## Chi-squared test for given probabilities
##
## data: table(sickData$sickDotw)
## X-squared = 277.63, df = 4, p-value < 2.2e-16
Monday is the most frequent day people in the sample called in sick. Mondays account for 20% more sick days than the next two most frequent days, Tuesday and Friday. A chi-squared test with 4 degrees of freedom resulted in p < .01 suggesting the differences in sick days across weekdays is not likely due to chance.
No statistical tests were planned to run on sick days per month. Just the barplot was used to look at the frequency of sick days reported by each month to see if any patterns emerged.
Monday accounted for 20% more sick days in the sample than the next two most frequent days, Tuesday and Friday. People were less likely to call in sick on Wednesday and Thursday compared with the other weekdays. Based on these results, it seems people do fake sick and miss work, usually on a Monday.
When looking at the number of sick days by month, a different pattern emerges. One that suggests people miss work when they really are sick. The most frequent months people use sick days cluster around the winter months: January, February, and March with the lowest reported sick days occur during the summer months.