This dataset contains information about all the restaurants that have received health inspection from the NYC Dept of Health and Mental Hygiene since 2001. Originally, when I inspected the data I thought it had mutliple violations recorded on each row and was in a “wide” format. After closer inspection, I realized the the data is actually already organized in long format with a row for each violation for each restaurant.
I have created a mySQL DB for the project and I will connect to the tables using src_mysql.
## Warning: package 'dplyr' was built under R version 3.4.4
##
## The downloaded binary packages are in
## /var/folders/3g/qjtz0bws0_xfv3746wcy58000000gn/T//Rtmpr9vbSv/downloaded_packages
## Warning: package 'dbplyr' was built under R version 3.4.4
## Warning: package 'knitr' was built under R version 3.4.3
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
foodgrades <- tbl(project2db, "foodgrades")
foodgrades <- foodgrades %>% tbl_df()
glimpse(foodgrades)
## Observations: 377,258
## Variables: 18
## $ CAMIS <int> 41708362, 50047773, 41561944, 41563537, ...
## $ DBA <chr> "MAGIC MIX JUICERY", "METRO MARKET", "HE...
## $ BORO <chr> "MANHATTAN", "BROOKLYN", "BROOKLYN", "MA...
## $ BUILDING <chr> "102", "438", "517", "174", "24932", "72...
## $ STREET <chr> "FULTON STREET", "UTICA AVE", "51 STREET...
## $ ZIPCODE <chr> "10038", "11213", "11220", "10013", "113...
## $ PHONE <chr> "6464540680", "7189758223", "3475810351"...
## $ CUISINE_DESCRIPTION <chr> "Juice, Smoothies, Fruit Salads", "Ameri...
## $ INSPECTION <chr> "9/3/15", "9/18/18", "3/28/16", "2/10/15...
## $ ACTION <chr> "Violations were cited in the following ...
## $ VIOLATION_CODE <chr> "10F", "04N", "10F", "02G", "10H", "08A"...
## $ VIOLATION_DESCRIPTION <chr> "Non-food contact surface improperly con...
## $ CRITICAL_FLAG <chr> "Not Critical", "Critical", "Not Critica...
## $ SCORE <int> 13, 66, 2, 59, 45, 12, 11, 45, 10, 27, 2...
## $ GRADE <chr> "A", "", "A", "", "", "A", "A", "", "A",...
## $ GRADE_DATE <chr> "9/3/15", "", "3/28/16", "", "", "9/10/1...
## $ RECORD_DATE <chr> "9/25/18", "9/25/18", "9/25/18", "9/25/1...
## $ INSPECTION_TYPE <chr> "Cycle Inspection / Initial Inspection",...
It appears the way the data imported from SQL that blanks for the SCORE variable were interpreted as 0. I will adress this later and replace them with NA.
To begin tidying the dataframe I will rename CAMIS and DBA as well as drop unnecessary cols. I then change the data type for the INSPECTION variable to be a date type. I am interested in the current standing score for each restaurant in each borough, therefore, I need the inspection date to be a date type to allow me to filter the data properly.
small <- foodgrades %>%
rename(id = CAMIS,name = DBA) %>%
select(-c(STREET,
BUILDING,
PHONE,
CUISINE_DESCRIPTION,
ACTION,
VIOLATION_CODE,
CRITICAL_FLAG,
GRADE_DATE:INSPECTION_TYPE
))
small$INSPECTION <- as.Date(small$INSPECTION,"%m/%d/%y")
## Warning in strptime(x, format, tz = "GMT"): unknown timezone 'zone/tz/
## 2018e.1.0/zoneinfo/America/New_York'
To narrow the dataframe down to just the most recent scores for each restaurant, I group by id, filter out rows that are missing a borough, and arrange the by desc inspection date, and filter for only the most recent inspection date. I then replace all 0 SCORE values with NA
idgrp <-small %>%
group_by(id) %>%
select(id,name, INSPECTION,BORO,ZIPCODE,SCORE,VIOLATION_DESCRIPTION) %>%
filter(BORO != 'Missing') %>%
arrange(desc(INSPECTION)) %>%
top_n(1, INSPECTION)
## Warning: package 'bindrcpp' was built under R version 3.4.4
idgrp[idgrp == 0] <- NA
idgrp
## # A tibble: 66,425 x 7
## # Groups: id [26,759]
## id name INSPECTION BORO ZIPCODE SCORE VIOLATION_DESCRIPTION
## <int> <chr> <date> <chr> <chr> <int> <chr>
## 1 4.10e7 THE SOUL… 2018-09-23 BROO… 11201 NA ""
## 2 4.14e7 BUILDING… 2018-09-23 BROO… 11217 NA ""
## 3 5.00e7 MASALA G… 2018-09-23 BROO… 11217 NA ""
## 4 5.01e7 POQUITO … 2018-09-23 BROO… 11217 NA ""
## 5 5.01e7 A Summer… 2018-09-22 MANH… 10013 12 Proper sanitization n…
## 6 5.01e7 New Gaya… 2018-09-22 QUEE… 11354 65 Accurate thermometer …
## 7 5.01e7 CELEBRAT… 2018-09-22 STAT… 10310 40 "Filth flies or food/…
## 8 5.01e7 New Gaya… 2018-09-22 QUEE… 11354 65 Hand washing facility…
## 9 5.00e7 TONEL RE… 2018-09-22 BROO… 11226 27 Non-food contact surf…
## 10 5.01e7 KING'S C… 2018-09-22 BRONX 10459 23 "Filth flies or food/…
## # ... with 66,415 more rows
Lets compare the five boroughs using a simple barplot and mean score for each borough.
idgrp$BORO[idgrp$BORO == 'STATEN ISLAND'] <- 'STAT ISL'
idgrp$BORO[idgrp$BORO == 'MANHATTAN'] <- 'MAN.'
boromean <- idgrp %>%
group_by(BORO) %>%
select(BORO, SCORE) %>%
summarize(avg_score = mean(SCORE, na.rm=T))
boromean <- boromean %>%
filter(BORO != 'Missing')
barplot(boromean$avg_score,ylim = c(0,20), names = boromean$BORO)
The barplot is does not serve as a very useful visualization in this care because the means are so similar for each borough. What we can see is the the average score for all boroughs is above 14 which means that all boroughs would collective receive a B grade. Not very appetizing.
Lets see what the scores look like using a boxplot and check the medians
boxplot(SCORE~BORO, data=idgrp, main = "Recent grades by borough")
boromedian <- idgrp %>%
group_by(BORO) %>%
filter(BORO != 'Missing') %>%
select(BORO, SCORE) %>%
summarise(med_score = median(SCORE, na.rm=T))
boromedian
## # A tibble: 5 x 2
## BORO med_score
## <chr> <dbl>
## 1 BRONX 12
## 2 BROOKLYN 12
## 3 MAN. 12
## 4 QUEENS 12
## 5 STAT ISL 12
The boxplots make it clear that each boro does not have significantly different median grades, however, there are frequent outliers with extremely high scores. Also from the boxplot we can notice that the data seems to be scewed to many restaurants having low scores. This can be explained by the process the DOHMH uses to give restaurants grades. Usually if the restaurant earns lower than a C grade they do not receive a score upon inspection and are closed down until reinspection at another time. This system allows the restaurants to avoid have high scores or bad grades posted in their shop windows.
Creating a density plot we can see that despite the average score for each borough being a B grade, the most frequent grade is slightly above average at 12 points, earning a letter A grade. The densitly plot also has a strong right skew. The outliers with extremely high scores likey had an impact on the mean score so I would prefer using the median to summarise the data. The steep drop off in frequency of scores just before 14 and the medians of 12 might be indicative of a bias that the health inspectors have when grading restaurants. Perhaps they feel inclined to keep most restaurants at an A letter grade and knowlingly avoid giving restaurants enough points to earn a B grade.
counts <- table(idgrp$SCORE)
idgrp %>%
filter(SCORE > 0) %>%
ggplot(aes(SCORE))+geom_density()+scale_x_continuous(breaks = pretty(idgrp$SCORE, n=20))
counts[which.max(counts)]
## 12
## 10391
Lets now take a look at the most frequent violations for current restaurants. To do so I will have to select all rows for the most recent inspection for each restaurant.
allvio <-small %>%
group_by(id) %>%
select(id,name, INSPECTION,BORO,ZIPCODE,SCORE,VIOLATION_DESCRIPTION) %>%
filter(BORO != 'Missing' & min_rank(desc(INSPECTION))==1) %>%
arrange(desc(INSPECTION))
allvio[allvio == 0] <- NA
allvio$VIOLATION_DESCRIPTION[allvio$VIOLATION_DESCRIPTION == ""] <- NA
freqviolations<- tbl_df(table(allvio$VIOLATION_DESCRIPTION))
top10 <-freqviolations %>%
arrange(desc(n)) %>%
top_n(10)
## Selecting by n
kable(top10)
| Var1 | n |
|---|---|
| Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit. | 12257 |
| Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist. | 6283 |
| Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred. | 4962 |
| Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly. | 4544 |
| Food not protected from potential source of contamination during storage, preparation, transportation, display or service. | 4174 |
| Evidence of mice or live mice present in facility’s food and/or non-food areas. | 3729 |
| Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation. | 3489 |
| Filth flies or food/refuse/sewage-associated (FRSA) flies present in facilitys food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies. | 2979 |
| Hot food item not held at or above 140º F. | 2977 |
| Food contact surface not properly maintained. | 1582 |
Looking at the list of the top 10 most frequent violations has me worried. Many of them are related to violations that could cause food born illness, vermin, and poor sanitation directly invovling food. These are things I like to pretend never happen in restaurants but I know are the norm in NYC. Using nrows and filtering I will find the proportion of restaurants with a score below 14, receiving an A, but still having a violation in the top 10 most frequent.
topvio <- c(top10$Var1)
countAwtop10 <-allvio %>%
filter(SCORE < 14 & VIOLATION_DESCRIPTION %in% topvio) %>%
nrow()
countAwtop10
## [1] 35037
countA <- allvio %>%
filter(SCORE < 14) %>%
nrow()
countA
## [1] 45414
countAwtop10/countA
## [1] 0.7715022
77% of A grade restaurants receive one or more of the top 10 violations. That isn’t too suprising considering A is the most frequent letter grade. I am more interested in the number of restaurants that received an A letter grade but have violations with vermin, mice, rats, or roaches.
countvermin <- allvio %>%
filter(SCORE < 14 & grepl(' mouse | mice | vermin | rat(s\\b|\\b) | roach(es\\b|\\b) | flies | pest(s\\b|\\b)',ignore.case = T, VIOLATION_DESCRIPTION)) %>%
nrow()
countvermin
## [1] 8361
countvermin/countA
## [1] 0.1841062
Somwhat reassuringly, only approximately 18.4% of restaurants with an A grade have reported evidence and violations involving pests. An A grade in my mind, and I believe most New Yorkers, means that the restaurant is clean and safe to eat at. In the future will definitely keep in mind when trying new restaurants that despite having an A letter grade, the restauran has approximately 18.4% chance of having pest problems.
This dataset contains data of Djokovic’s performances taken from: https://en.wikipedia.org/wiki/Novak_Djokovic_career_statistics. Djokovic does not appear to compete in doubles very often so a good part of the data is missing from the dataset. I created a table in the project2 MySQL database to be used. The table already is a recording of the number of wins an losses at each tournament, so I have removed the rows for totals when entering the data into MySQL and will remove to win ratio column so that I can calculate them myself.
tennis <- tbl(project2db,'tennis')
tennis <- tbl_df(tennis)
kable(tennis)
| dicipline | type | won | lost | total | Wratio |
|---|---|---|---|---|---|
| Singles | Grand Slam tournaments | 14 | 9 | 23 | 0.61 |
| NA | Year-End Championships | 5 | 1 | 6 | 0.83 |
| NA | ATP Masters 1000 | 31 | 14 | 45 | 0.69 |
| NA | Olympic Games | 0 | 0 | 0 | 0 |
| NA | ATP Tour 500 | 12 | 3 | 15 | 0.8 |
| NA | ATP Tour 250 | 9 | 4 | 13 | 0.67 |
| Doubles | Grand Slam tournaments | - | - | - | - |
| NA | Year-End Championships | - | - | - | - |
| NA | ATP Masters 1000 | - | - | - | - |
| NA | Olympic Games | - | - | - | - |
| NA | ATP Tour 500 | - | - | - | - |
| NA | ATP Tour 250 | 1 | 1 | 2 | 0.5 |
glimpse(tennis)
## Observations: 12
## Variables: 6
## $ dicipline <chr> "Singles", NA, NA, NA, NA, NA, "Doubles", NA, NA, NA...
## $ type <chr> "Grand Slam tournaments", "Year-End Championships", ...
## $ won <chr> "14", "5", "31", "0", "12", "9", "-", "-", "-", "-",...
## $ lost <chr> "9", "1", "14", "0", "3", "4", "-", "-", "-", "-", "...
## $ total <chr> "23", "6", "45", "0", "15", "13", "-", "-", "-", "-"...
## $ Wratio <chr> "0.61", "0.83", "0.69", "0", "0.8", "0.67", "-", "-"...
I will drop the Wratio col, fill the dicipline values down, and replace “-” with NA.
tennis <- tennis %>%
select(-Wratio)%>%
fill(dicipline)
tennis[tennis == '-'] <- NA
tennis$won <- as.integer(tennis$won)
tennis$lost <- as.integer(tennis$lost)
tennis$total <- as.integer(tennis$total)
kable(tennis)
| dicipline | type | won | lost | total |
|---|---|---|---|---|
| Singles | Grand Slam tournaments | 14 | 9 | 23 |
| Singles | Year-End Championships | 5 | 1 | 6 |
| Singles | ATP Masters 1000 | 31 | 14 | 45 |
| Singles | Olympic Games | 0 | 0 | 0 |
| Singles | ATP Tour 500 | 12 | 3 | 15 |
| Singles | ATP Tour 250 | 9 | 4 | 13 |
| Doubles | Grand Slam tournaments | NA | NA | NA |
| Doubles | Year-End Championships | NA | NA | NA |
| Doubles | ATP Masters 1000 | NA | NA | NA |
| Doubles | Olympic Games | NA | NA | NA |
| Doubles | ATP Tour 500 | NA | NA | NA |
| Doubles | ATP Tour 250 | 1 | 1 | 2 |
| ### Analysis | ||||
| Lets recreat | e the subtotals and grand | total | s rows | and calculate the Wratio col. |
# combine the df, subtotals, and grand totals together
tennis <- bind_rows(
tennis,
tennis %>%
group_by(dicipline) %>%
summarise_at(vars(won, lost, total), funs(sum), na.rm = T),
tennis %>%
summarise_at(vars(won,lost,total), funs(sum),na.rm=T)
)
# rearrange the rows
tennis <-tennis %>%
arrange(desc(dicipline))
# change NA values in type and dicipline cols
tennis$type[is.na(tennis$type) & tennis$dicipline %in% c('Doubles', 'Singles')] <- 'Total'
tennis$dicipline[is.na(tennis$dicipline)] <- 'Grand Total'
# calculate the Wratio using mutate
tennis <- tennis %>%
mutate(Wratio = won / total)
tennis$Wratio <- tennis$Wratio %>%
round(2)
kable(tennis)
| dicipline | type | won | lost | total | Wratio |
|---|---|---|---|---|---|
| Singles | Grand Slam tournaments | 14 | 9 | 23 | 0.61 |
| Singles | Year-End Championships | 5 | 1 | 6 | 0.83 |
| Singles | ATP Masters 1000 | 31 | 14 | 45 | 0.69 |
| Singles | Olympic Games | 0 | 0 | 0 | NaN |
| Singles | ATP Tour 500 | 12 | 3 | 15 | 0.80 |
| Singles | ATP Tour 250 | 9 | 4 | 13 | 0.69 |
| Singles | Total | 71 | 31 | 102 | 0.70 |
| Doubles | Grand Slam tournaments | NA | NA | NA | NA |
| Doubles | Year-End Championships | NA | NA | NA | NA |
| Doubles | ATP Masters 1000 | NA | NA | NA | NA |
| Doubles | Olympic Games | NA | NA | NA | NA |
| Doubles | ATP Tour 500 | NA | NA | NA | NA |
| Doubles | ATP Tour 250 | 1 | 1 | 2 | 0.50 |
| Doubles | Total | 1 | 1 | 2 | 0.50 |
| Grand Total | NA | 72 | 32 | 104 | 0.69 |
| Well that seem | ed like a lot of work to | basica | lly bri | ng the t | able back to where we start. However, atleast now things are a little tidyer. I find the totals and the Wratio useful to quickly summarise the data within the same dataframe. |
Lets take grab the rows with the max and min Wratio.
bind_rows(
tennis %>%
slice(which.max(Wratio)),
tennis %>%
slice(which.min(Wratio))
)
## # A tibble: 2 x 6
## dicipline type won lost total Wratio
## <chr> <chr> <int> <int> <int> <dbl>
## 1 Singles Year-End Championships 5 1 6 0.83
## 2 Doubles ATP Tour 250 1 1 2 0.5
Djokovic performed the worst in the one Doubles tournament he competed in. Perhaps that is the reason why he prefers to play alone in competitions and doesn’t compete in Doubles often.
This dataset is particularly messy with the columns and rows poorly organized. The original data had columns that were entirely null and I already removed when creating the table in MySQL. ### Questions: - For each month, what was the total amount of OT worked? ### Tidying the data First step lets save the data to a tbl_df and then replace all empty strings with NA.
ot <- tbl(project2db,'ot')
ot <- tbl_df(ot)
ot[ot == ''] <- NA
kable(ot)
| employee | day_of_week | date | pay_code | hours |
|---|---|---|---|---|
| Employee: | Gokool, Joel | NA | NA | ID: |
| Transactions: | NA | NA | Pay Code | Hours |
| NA | Wed | 1/6/2016 | OTBANK | 1:00 |
| NA | Thu | 1/14/2016 | OTBANK | 0:30 |
| NA | Wed | 1/20/2016 | OTBANK | 0:30 |
| NA | Wed | 2/3/2016 | OTBANK | 0:30 |
| NA | Thu | 2/11/2016 | OTBANK | 1:00 |
| NA | Fri | 2/19/2016 | OTBANK | 0:30 |
| NA | Tue | 2/23/2016 | OTBANK | 0:30 |
| NA | Fri | 2/26/2016 | OTBANK | 0:45 |
| NA | Wed | 3/2/2016 | OTBANK | 1:00 |
| The dataframe is | still looking | pretty messy | with a ton | of NAs. I will create a col with the Employee name on the left and remove the top 2 rows to tidy things up. |
# Move the employees name to the employee col
ot$employee[is.na(ot$employee)] <- ot %>%
select(day_of_week) %>%
head(1) %>%
unlist()
kable(ot)
| employee | day_of_week | date | pay_code | hours |
|---|---|---|---|---|
| Employee: | Gokool, Joel | NA | NA | ID: |
| Transactions: | NA | NA | Pay Code | Hours |
| Gokool, Joel | Wed | 1/6/2016 | OTBANK | 1:00 |
| Gokool, Joel | Thu | 1/14/2016 | OTBANK | 0:30 |
| Gokool, Joel | Wed | 1/20/2016 | OTBANK | 0:30 |
| Gokool, Joel | Wed | 2/3/2016 | OTBANK | 0:30 |
| Gokool, Joel | Thu | 2/11/2016 | OTBANK | 1:00 |
| Gokool, Joel | Fri | 2/19/2016 | OTBANK | 0:30 |
| Gokool, Joel | Tue | 2/23/2016 | OTBANK | 0:30 |
| Gokool, Joel | Fri | 2/26/2016 | OTBANK | 0:45 |
| Gokool, Joel | Wed | 3/2/2016 | OTBANK | 1:00 |
ot <- ot %>%
na.omit()
ot
## # A tibble: 9 x 5
## employee day_of_week date pay_code hours
## <chr> <chr> <chr> <chr> <chr>
## 1 Gokool, Joel Wed 1/6/2016 OTBANK 1:00
## 2 Gokool, Joel Thu 1/14/2016 OTBANK 0:30
## 3 Gokool, Joel Wed 1/20/2016 OTBANK 0:30
## 4 Gokool, Joel Wed 2/3/2016 OTBANK 0:30
## 5 Gokool, Joel Thu 2/11/2016 OTBANK 1:00
## 6 Gokool, Joel Fri 2/19/2016 OTBANK 0:30
## 7 Gokool, Joel Tue 2/23/2016 OTBANK 0:30
## 8 Gokool, Joel Fri 2/26/2016 OTBANK 0:45
## 9 Gokool, Joel Wed 3/2/2016 OTBANK 1:00
A few more things we can tidy up before analyzing the data are the data types. Lets change the date col to Date data type and lets convert the hours col to a decimal.
ot$date <- as.Date(ot$date,"%m/%d/%Y")
ot
## # A tibble: 9 x 5
## employee day_of_week date pay_code hours
## <chr> <chr> <date> <chr> <chr>
## 1 Gokool, Joel Wed 2016-01-06 OTBANK 1:00
## 2 Gokool, Joel Thu 2016-01-14 OTBANK 0:30
## 3 Gokool, Joel Wed 2016-01-20 OTBANK 0:30
## 4 Gokool, Joel Wed 2016-02-03 OTBANK 0:30
## 5 Gokool, Joel Thu 2016-02-11 OTBANK 1:00
## 6 Gokool, Joel Fri 2016-02-19 OTBANK 0:30
## 7 Gokool, Joel Tue 2016-02-23 OTBANK 0:30
## 8 Gokool, Joel Fri 2016-02-26 OTBANK 0:45
## 9 Gokool, Joel Wed 2016-03-02 OTBANK 1:00
ot$hours <- sapply(
ot$hours %>%
strsplit(':'),
function(x) {
x<-as.numeric(x)
(x[1]*60 + x[2])/60
}
)
ot
## # A tibble: 9 x 5
## employee day_of_week date pay_code hours
## <chr> <chr> <date> <chr> <dbl>
## 1 Gokool, Joel Wed 2016-01-06 OTBANK 1
## 2 Gokool, Joel Thu 2016-01-14 OTBANK 0.5
## 3 Gokool, Joel Wed 2016-01-20 OTBANK 0.5
## 4 Gokool, Joel Wed 2016-02-03 OTBANK 0.5
## 5 Gokool, Joel Thu 2016-02-11 OTBANK 1
## 6 Gokool, Joel Fri 2016-02-19 OTBANK 0.5
## 7 Gokool, Joel Tue 2016-02-23 OTBANK 0.5
## 8 Gokool, Joel Fri 2016-02-26 OTBANK 0.75
## 9 Gokool, Joel Wed 2016-03-02 OTBANK 1
I printed the few tables above just as a table and without kable so that the datatypes for each col are printed and the changes I made are observable.
Now that the table has been tidyed up I am able to easily find the number of hourse worked over time for each month.
monthot <- ot %>%
group_by(month = format(floor_date(date,unit = "month"),"%m-%Y")) %>%
summarise(Monthly_OT = sum(hours))
kable(monthot)
| month | Monthly_OT |
|---|---|
| 01-2016 | 2.00 |
| 02-2016 | 3.25 |
| 03-2016 | 1.00 |
| Perhaps th | e manager wants this table summarized in one number. In that case, here is the avg number of hours worked overtime for the three months of data given. |
monthot %>%
summarise(avg_ot = mean(Monthly_OT))
## # A tibble: 1 x 1
## avg_ot
## <dbl>
## 1 2.08
In conclusion we have learned a couple things about each dataset. The first dataset indicates that there may be a bias towards giving A grades to NYC restuarants and having an A grade does not necessarily mean the restaurant is pest free. The second dataset showed Djokovic prefers to compete in singles tournalments and performs better alone. In the final dataset we found Joel’s number of hours worked overtime each month might not be enough to warrent hiring another employee.