The kitchen areas at FreeAgent are usually very clean but in this scenario, let’s imagine that we have been very messy recently! We haven’t been putting our mugs in the dishwasher or generally keeping the kitchen clean so we’ve designed a daily rota to make sure that one person is responsible for giving the kitchen a five-minute blitz after lunchtime.
The data science team are interested in analyzing the data to find out why people became so messy in the first place, so we asked everyone to keep details of their cleaning duty in a shared spreadsheet. We asked them:
their name
the date when they last cleaned the kitchen
whether the dishwasher was full at the time
the number of mugs they found on the side
whether they wiped the sides or sink
mydata <- read.csv("https://raw.githubusercontent.com/ajbentley/cuny_ms_ds/master/607/cleaning_data.csv",
header = TRUE, colClasses = "character")
dim(mydata)
## [1] 9 6
date_cleaned | name | sides_and_sink | dishwasher_full | no_of_mugs | notes |
---|---|---|---|---|---|
16-07-18 | Charlotte | sides and sink | NO | 16 | new cleaning schedule begins today! |
17-7-18 | David E | NA | NA | 2 - 6 | NA |
18/07/18 | Hannah | sides | YES | 15(ish) | all fine |
19 07 18 | Dave | sink, side | YES | 18 | so many mugs |
20-7-18 | Charlie | didn’t need doing | YES | 12-14 (i think!) | can’t remember very well |
23-07-2018 | Davida | NA | NA | 0 | All of us were at conference but I still had to put 0s in form! |
24-07-18 | David A | both | NO | 4-14 | NA |
25 7 2018 | Hanson | just the side | NO | 3 | not v messy |
260718 | Davie | sink only | YES | 6 | NA |
This looks like a great data set for cleaning!
date_cleaned: an onerous process before I discovered the joys of lubridate
name: This is a really tricky one, not because of the complexity of cleaning but because I have to assume what is meant. For example, “Davida” is a female name, though an uncommon one, but it could also be David A. Similarly I have no way of knowing whether Dave or Davie are one of the Davids or even if Davie is a misspelling of Dave. Honestly, if I was working on this professionally I would have to get in touch with the client to verify the names. As is I’m not going to touch it since assuming things about people’s names seems unwise.
sides_and_sink: For this I’ll create two columns, one for sides and one for sink.
Dishwasher Full: Doesn’t look like it needs cleaning.
no_of_mugs: interesting decisions to be made. I can either add min and max columns, populating both with the same number when there is just one number, create buckets, or take the average of the ranges. Of particular concern is David A’s 4-14, which is somewhat nonsensical and I’m tempted to just NA it.
Notes: Another column that I could do some cleaning on but unless I know that some sort of analysis will be done requiring it to have proper spelling and grammar I don’t see a reason why I should.
Please note, while the article this is taken from (which includes instructions on how to clean it, but I am not going to look at them) mentions that there is a duplicated row this is not evident in the data provided on the site to download. It also makes some note about how Davida should have put an NA in instead of a 0 because everyone was out of the office–that makes no sense at all to me so I am not doing it.
It looks like all of the dates follow the European format (day/month/year), so that makes things easier than if they were mixed formats, especially with lubridate, my favorite function ever.
x |
---|
2018-07-16 |
2018-07-17 |
2018-07-18 |
2018-07-19 |
2018-07-20 |
2018-07-23 |
2018-07-24 |
2018-07-25 |
2018-07-26 |
Knowing what day of the week each of these days corresponds to will also be useful
df$dow <- weekdays(as.Date(df$date_cleaned)) # add column that reflects DoW
df$dow <- as.factor(df$dow)
df <- df[,c(1,7,2:6)] # Move DoW column next to Date
colnames(df)
## [1] "date_cleaned" "dow" "name" "sides_and_sink"
## [5] "dishwasher_full" "no_of_mugs" "notes"
Creating columns for sides and sink that check whether that column has the words “side”, “sink”, or “both”. These are logical vectors and depending on what was going to be done with the data I might transform these to binary (also might do the same with dishwasher full).
df$sides <- str_detect(df$sides_and_sink,("side|both"))
df$sink <- str_detect(df$sides_and_sink,"sink|both")
kable(df[,8:9])
sides | sink |
---|---|
TRUE | TRUE |
NA | NA |
TRUE | FALSE |
TRUE | TRUE |
FALSE | FALSE |
NA | NA |
TRUE | TRUE |
TRUE | FALSE |
FALSE | TRUE |
The easiest thing to do here is to assume that either the largest or smallest number written is accurate enough. A little harder would be to take the average of the high and low (if present). Why not have some fun?
df$mugslow <- as.numeric(str_extract(df$no_of_mugs,"[:digit:]+")) # extract the first set of uninterupted digits
df$mugshigh <- as.numeric(str_extract(df$no_of_mugs,"(?<=\\D)[:digit:]+")) # extract uninterupted digits following non-digits
n <- which(is.na(df$mugshigh)) # replace NA in mugshigh with the value in mugslow
df[n, "mugshigh"] <- df[n, "mugslow"]
df$no_of_mugs <- rowMeans(df[c('mugshigh', 'mugslow')]) # calculate the average of the mugs
mugcount <- c(df$mugslow-df$mugshigh) # creating a vector that will be used later to determine if the column
# for number of columns is being filled out correctly
kable(df[,6:11])
no_of_mugs | notes | sides | sink | mugslow | mugshigh |
---|---|---|---|---|---|
16 | new cleaning schedule begins today! | TRUE | TRUE | 16 | 16 |
4 | NA | NA | NA | 2 | 6 |
15 | all fine | TRUE | FALSE | 15 | 15 |
18 | so many mugs | TRUE | TRUE | 18 | 18 |
13 | can’t remember very well | FALSE | FALSE | 12 | 14 |
0 | All of us were at conference but I still had to put 0s in form! | NA | NA | 0 | 0 |
9 | NA | TRUE | TRUE | 4 | 14 |
3 | not v messy | TRUE | FALSE | 3 | 3 |
6 | NA | FALSE | TRUE | 6 | 6 |
Reduce and reorder columns
df <- df %>% select("date_cleaned", "dow", "name", "no_of_mugs", "dishwasher_full", "sides", "sink", "notes")
kable(df)
date_cleaned | dow | name | no_of_mugs | dishwasher_full | sides | sink | notes |
---|---|---|---|---|---|---|---|
2018-07-16 | Monday | Charlotte | 16 | NO | TRUE | TRUE | new cleaning schedule begins today! |
2018-07-17 | Tuesday | David E | 4 | NA | NA | NA | NA |
2018-07-18 | Wednesday | Hannah | 15 | YES | TRUE | FALSE | all fine |
2018-07-19 | Thursday | Dave | 18 | YES | TRUE | TRUE | so many mugs |
2018-07-20 | Friday | Charlie | 13 | YES | FALSE | FALSE | can’t remember very well |
2018-07-23 | Monday | Davida | 0 | NA | NA | NA | All of us were at conference but I still had to put 0s in form! |
2018-07-24 | Tuesday | David A | 9 | NO | TRUE | TRUE | NA |
2018-07-25 | Wednesday | Hanson | 3 | NO | TRUE | FALSE | not v messy |
2018-07-26 | Thursday | Davie | 6 | YES | FALSE | TRUE | NA |
What day are there the most mugs?
df_day <- df %>%
group_by(dow) %>%
summarize(mugs = sum(wday(no_of_mugs))) # group by day of week
df_day$dow <- factor(df_day$dow, levels= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")) #set day of week factors in order
df_day <- df_day[order(df_day$dow), ] # Order days chronologically
df_day
## # A tibble: 5 x 2
## dow mugs
## <fct> <dbl>
## 1 Monday 16
## 2 Tuesday 13
## 3 Wednesday 18
## 4 Thursday 24
## 5 Friday 13
Thursdays are clearly the biggest days for mug usage with Wednesdays close behind.
Lets graph it
p<-ggplot(data=df_day, aes(x=dow, y=mugs), main="Dirty Mugs by Day of Week") +
geom_bar(stat="identity", fill = "darksalmon") + geom_text(aes(label=mugs), vjust=1.6, size=5)
p
I guess on Monday folks are rested from the weekend, then they need some help getting through the rest of the week until Friday. As these data were collected in the summer, maybe the company had summer Fridays.
Are men or women more likely to wash either the sides or sink?
Gender wasn’t collected with this information but can be determined. There are resources available to determine likely gender if it were needed for a larger data set.
dfga <- df # create new dataframe
dfga$gender <- as.factor(c('F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M' )) # create a vector with genders
gcount <- data.frame("gender" = c("F", "M"), "count" =c(sum(dfga$gender == 'F'),sum(dfga$gender == 'M'))) #create a df of gender and counts of gender
dfga$anywash <- (dfga$sides + dfga$sink) # create a vector that sums the sink and sides columns
dfga$anytrue <- dfga$anywash>0 # create a logical vector where TRUE is any cell that is over 0
dfga$anytrue <- as.integer(as.logical(dfga$anytrue)) # coerce the anywtrue vector to integer
dfgb <- dfga[,c(9:11)] # subset to just have the gender count and anywtrue columns
dfgb <- na.omit(dfgb) # remove NA from analysis
dfgwash <- dfgb %>%
group_by(gender) %>%
summarize("washed" = sum(anytrue)) # group by gender for anywtrue
dfgwash <- data.frame(dfgwash) # make df of prior group
dfw <- join(gcount, dfgwash) # join the washed and gender count dataframes
## Joining by: gender
## gender count washed pct
## 1 F 3 2 0.6666667
## 2 M 6 4 0.6666667
Are men or women more likely to fill out the form completely?
dffa <- df # create new dataframe
dffa$gender <- as.factor(c('F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M' )) # create a vector with genders
# gcount <- data.frame("gender" = c("F", "M"), "count" =c(sum(dfga$gender == 'F'),sum(dfga$gender == 'M'))) #create a df of gender and counts of gender
dffa$mugcount <- mugcount # create a vector that reflects whether there was a range put into the mug count instead of a number
dffa$mugcount <- dffa$mugcount!=0 # create a logical vector where TRUE is any cell does not equal 0 (where there was a diff b/n high and low count)
dffa$na_count <- apply(dffa, 1, function(x) sum(is.na(x)))
dffa$mugcount <- as.integer(as.logical(dffa$mugcount)) # coerce the anywtrue vector to integer
dffb <- dffa[,c(10:11)] # subset to just have the mugcount and na_count columns
dffb$complete <- rowSums (dffb) # create vector to add mugcount and na_count
dffb$gender <- dffa[,c(9)] # bring in gender vector
dffb$complete <- as.integer(as.logical(dffb$complete)) # coerce the complete vector to integer
#dffb$complete <- dffb$complete==0 # create a logical vector where TRUE is any cell equals 0
dffb
## mugcount na_count complete gender
## 1 0 0 0 F
## 2 1 4 1 M
## 3 0 0 0 F
## 4 0 0 0 M
## 5 1 0 1 M
## 6 0 3 1 F
## 7 1 1 1 M
## 8 0 0 0 M
## 9 0 1 1 M
dffcomplete <- dffb %>%
group_by(gender) %>%
summarize("complete" = sum(complete)) # group by gender for complete
dffcomplete
## # A tibble: 2 x 2
## gender complete
## <fct> <int>
## 1 F 1
## 2 M 4
dffcomplete <- data.frame(dffcomplete) # make df of prior group
dfc <- join(gcount, dffcomplete) # join the complete and gender count dataframes
## Joining by: gender
## gender count complete pct
## 1 F 3 1 0.3333333
## 2 M 6 4 0.6666667
Employees clearly need more direction in how to fill out these forms. The most successful method would probably to provide some sort of incentive for correctly filling it out.
Thursdays have the most dirty mugs–should make sure to keep on top of it these days.
Men and woman are equally likely (about 2/3) to clean either the sides or the sink.
Men are more likely to completely fill the form.