Dataset 1: NYC Restaurant Inspection Grades

This dataset contains information about all the restaurants that have received health inspection from the NYC Dept of Health and Mental Hygiene since 2001. Originally, when I inspected the data I thought it had mutliple violations recorded on each row and was in a “wide” format. After closer inspection, I realized the the data is actually already organized in long format with a row for each violation for each restaurant.

Questions:

  • How do scores compare between the boroughs?
  • What is the most frequent violation?

Loading data from MySQL

I have created a mySQL DB for the project and I will connect to the tables using src_mysql.

## Warning: package 'dplyr' was built under R version 3.4.4
## 
## The downloaded binary packages are in
##  /var/folders/3g/qjtz0bws0_xfv3746wcy58000000gn/T//Rtmpr9vbSv/downloaded_packages
## Warning: package 'dbplyr' was built under R version 3.4.4
## Warning: package 'knitr' was built under R version 3.4.3
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4

Taking a glimpse at the dataframe

foodgrades <- tbl(project2db, "foodgrades")
foodgrades <- foodgrades %>% tbl_df()
glimpse(foodgrades)
## Observations: 377,258
## Variables: 18
## $ CAMIS                 <int> 41708362, 50047773, 41561944, 41563537, ...
## $ DBA                   <chr> "MAGIC MIX JUICERY", "METRO MARKET", "HE...
## $ BORO                  <chr> "MANHATTAN", "BROOKLYN", "BROOKLYN", "MA...
## $ BUILDING              <chr> "102", "438", "517", "174", "24932", "72...
## $ STREET                <chr> "FULTON STREET", "UTICA AVE", "51 STREET...
## $ ZIPCODE               <chr> "10038", "11213", "11220", "10013", "113...
## $ PHONE                 <chr> "6464540680", "7189758223", "3475810351"...
## $ CUISINE_DESCRIPTION   <chr> "Juice, Smoothies, Fruit Salads", "Ameri...
## $ INSPECTION            <chr> "9/3/15", "9/18/18", "3/28/16", "2/10/15...
## $ ACTION                <chr> "Violations were cited in the following ...
## $ VIOLATION_CODE        <chr> "10F", "04N", "10F", "02G", "10H", "08A"...
## $ VIOLATION_DESCRIPTION <chr> "Non-food contact surface improperly con...
## $ CRITICAL_FLAG         <chr> "Not Critical", "Critical", "Not Critica...
## $ SCORE                 <int> 13, 66, 2, 59, 45, 12, 11, 45, 10, 27, 2...
## $ GRADE                 <chr> "A", "", "A", "", "", "A", "A", "", "A",...
## $ GRADE_DATE            <chr> "9/3/15", "", "3/28/16", "", "", "9/10/1...
## $ RECORD_DATE           <chr> "9/25/18", "9/25/18", "9/25/18", "9/25/1...
## $ INSPECTION_TYPE       <chr> "Cycle Inspection / Initial Inspection",...

It appears the way the data imported from SQL that blanks for the SCORE variable were interpreted as 0. I will adress this later and replace them with NA.

Tidying up

To begin tidying the dataframe I will rename CAMIS and DBA as well as drop unnecessary cols. I then change the data type for the INSPECTION variable to be a date type. I am interested in the current standing score for each restaurant in each borough, therefore, I need the inspection date to be a date type to allow me to filter the data properly.

small <- foodgrades %>% 
  rename(id = CAMIS,name = DBA) %>% 
  select(-c(STREET,
            BUILDING,
            PHONE,
            CUISINE_DESCRIPTION,
            ACTION,
            VIOLATION_CODE,
            CRITICAL_FLAG,
            GRADE_DATE:INSPECTION_TYPE
            ))
small$INSPECTION <- as.Date(small$INSPECTION,"%m/%d/%y")
## Warning in strptime(x, format, tz = "GMT"): unknown timezone 'zone/tz/
## 2018e.1.0/zoneinfo/America/New_York'

To narrow the dataframe down to just the most recent scores for each restaurant, I group by id, filter out rows that are missing a borough, and arrange the by desc inspection date, and filter for only the most recent inspection date. I then replace all 0 SCORE values with NA

idgrp <-small %>% 
  group_by(id) %>% 
  select(id,name, INSPECTION,BORO,ZIPCODE,SCORE,VIOLATION_DESCRIPTION) %>% 
  filter(BORO != 'Missing') %>% 
  arrange(desc(INSPECTION)) %>% 
  top_n(1, INSPECTION)
## Warning: package 'bindrcpp' was built under R version 3.4.4
idgrp[idgrp == 0] <- NA

idgrp
## # A tibble: 66,425 x 7
## # Groups:   id [26,759]
##         id name      INSPECTION BORO  ZIPCODE SCORE VIOLATION_DESCRIPTION 
##      <int> <chr>     <date>     <chr> <chr>   <int> <chr>                 
##  1  4.10e7 THE SOUL… 2018-09-23 BROO… 11201      NA ""                    
##  2  4.14e7 BUILDING… 2018-09-23 BROO… 11217      NA ""                    
##  3  5.00e7 MASALA G… 2018-09-23 BROO… 11217      NA ""                    
##  4  5.01e7 POQUITO … 2018-09-23 BROO… 11217      NA ""                    
##  5  5.01e7 A Summer… 2018-09-22 MANH… 10013      12 Proper sanitization n…
##  6  5.01e7 New Gaya… 2018-09-22 QUEE… 11354      65 Accurate thermometer …
##  7  5.01e7 CELEBRAT… 2018-09-22 STAT… 10310      40 "Filth flies or food/…
##  8  5.01e7 New Gaya… 2018-09-22 QUEE… 11354      65 Hand washing facility…
##  9  5.00e7 TONEL RE… 2018-09-22 BROO… 11226      27 Non-food contact surf…
## 10  5.01e7 KING'S C… 2018-09-22 BRONX 10459      23 "Filth flies or food/…
## # ... with 66,415 more rows

Vizualization and Analysis

Lets compare the five boroughs using a simple barplot and mean score for each borough.

idgrp$BORO[idgrp$BORO == 'STATEN ISLAND'] <- 'STAT ISL'
idgrp$BORO[idgrp$BORO == 'MANHATTAN'] <- 'MAN.'

boromean <- idgrp %>% 
  group_by(BORO) %>% 
  select(BORO, SCORE) %>% 
  summarize(avg_score = mean(SCORE, na.rm=T))

boromean <- boromean %>% 
  filter(BORO != 'Missing')

barplot(boromean$avg_score,ylim = c(0,20), names = boromean$BORO)

The barplot is does not serve as a very useful visualization in this care because the means are so similar for each borough. What we can see is the the average score for all boroughs is above 14 which means that all boroughs would collective receive a B grade. Not very appetizing.

Lets see what the scores look like using a boxplot and check the medians

boxplot(SCORE~BORO, data=idgrp, main = "Recent grades by borough")

boromedian <- idgrp %>% 
  group_by(BORO) %>% 
  filter(BORO != 'Missing') %>% 
  select(BORO, SCORE) %>% 
  summarise(med_score = median(SCORE, na.rm=T))
boromedian
## # A tibble: 5 x 2
##   BORO     med_score
##   <chr>        <dbl>
## 1 BRONX           12
## 2 BROOKLYN        12
## 3 MAN.            12
## 4 QUEENS          12
## 5 STAT ISL        12

The boxplots make it clear that each boro does not have significantly different median grades, however, there are frequent outliers with extremely high scores. Also from the boxplot we can notice that the data seems to be scewed to many restaurants having low scores. This can be explained by the process the DOHMH uses to give restaurants grades. Usually if the restaurant earns lower than a C grade they do not receive a score upon inspection and are closed down until reinspection at another time. This system allows the restaurants to avoid have high scores or bad grades posted in their shop windows.

Creating a density plot we can see that despite the average score for each borough being a B grade, the most frequent grade is slightly above average at 12 points, earning a letter A grade. The densitly plot also has a strong right skew. The outliers with extremely high scores likey had an impact on the mean score so I would prefer using the median to summarise the data. The steep drop off in frequency of scores just before 14 and the medians of 12 might be indicative of a bias that the health inspectors have when grading restaurants. Perhaps they feel inclined to keep most restaurants at an A letter grade and knowlingly avoid giving restaurants enough points to earn a B grade.

counts <- table(idgrp$SCORE)
idgrp %>% 
  filter(SCORE > 0) %>% 
  ggplot(aes(SCORE))+geom_density()+scale_x_continuous(breaks = pretty(idgrp$SCORE, n=20))

counts[which.max(counts)]
##    12 
## 10391

Most frequent violations

Lets now take a look at the most frequent violations for current restaurants. To do so I will have to select all rows for the most recent inspection for each restaurant.

allvio <-small %>% 
  group_by(id) %>% 
  select(id,name, INSPECTION,BORO,ZIPCODE,SCORE,VIOLATION_DESCRIPTION) %>% 
  filter(BORO != 'Missing' & min_rank(desc(INSPECTION))==1) %>% 
  arrange(desc(INSPECTION))

allvio[allvio == 0] <- NA
allvio$VIOLATION_DESCRIPTION[allvio$VIOLATION_DESCRIPTION == ""] <- NA
freqviolations<- tbl_df(table(allvio$VIOLATION_DESCRIPTION))
top10 <-freqviolations %>% 
  arrange(desc(n)) %>% 
  top_n(10)
## Selecting by n
kable(top10)
Var1 n
Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit. 12257
Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist. 6283
Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred. 4962
Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly. 4544
Food not protected from potential source of contamination during storage, preparation, transportation, display or service. 4174
Evidence of mice or live mice present in facility’s food and/or non-food areas. 3729
Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation. 3489
Filth flies or food/refuse/sewage-associated (FRSA) flies present in facilitys food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies. 2979
Hot food item not held at or above 140º F. 2977
Food contact surface not properly maintained. 1582

Looking at the list of the top 10 most frequent violations has me worried. Many of them are related to violations that could cause food born illness, vermin, and poor sanitation directly invovling food. These are things I like to pretend never happen in restaurants but I know are the norm in NYC. Using nrows and filtering I will find the proportion of restaurants with a score below 14, receiving an A, but still having a violation in the top 10 most frequent.

topvio <- c(top10$Var1)
countAwtop10 <-allvio %>% 
  filter(SCORE < 14 & VIOLATION_DESCRIPTION %in% topvio) %>% 
  nrow()
countAwtop10
## [1] 35037
countA <- allvio %>%
  filter(SCORE < 14) %>% 
  nrow()
countA
## [1] 45414
countAwtop10/countA
## [1] 0.7715022

77% of A grade restaurants receive one or more of the top 10 violations. That isn’t too suprising considering A is the most frequent letter grade. I am more interested in the number of restaurants that received an A letter grade but have violations with vermin, mice, rats, or roaches.

countvermin <- allvio %>% 
  filter(SCORE < 14 & grepl(' mouse | mice | vermin | rat(s\\b|\\b) | roach(es\\b|\\b) | flies | pest(s\\b|\\b)',ignore.case = T, VIOLATION_DESCRIPTION)) %>% 
  nrow()
countvermin
## [1] 8361
countvermin/countA
## [1] 0.1841062

Somwhat reassuringly, only approximately 18.4% of restaurants with an A grade have reported evidence and violations involving pests. An A grade in my mind, and I believe most New Yorkers, means that the restaurant is clean and safe to eat at. In the future will definitely keep in mind when trying new restaurants that despite having an A letter grade, the restauran has approximately 18.4% chance of having pest problems.

Dataset 2: Tennis

This dataset contains data of Djokovic’s performances taken from: https://en.wikipedia.org/wiki/Novak_Djokovic_career_statistics. Djokovic does not appear to compete in doubles very often so a good part of the data is missing from the dataset. I created a table in the project2 MySQL database to be used. The table already is a recording of the number of wins an losses at each tournament, so I have removed the rows for totals when entering the data into MySQL and will remove to win ratio column so that I can calculate them myself.

Questions:

  • Which competition had Djokovic performed best in throughout his career?
  • Which competition has been his worst?

Taking a glimpse at the data

tennis <- tbl(project2db,'tennis')
tennis <- tbl_df(tennis)
kable(tennis)
dicipline type won lost total Wratio
Singles Grand Slam tournaments 14 9 23 0.61
NA Year-End Championships 5 1 6 0.83
NA ATP Masters 1000 31 14 45 0.69
NA Olympic Games 0 0 0 0
NA ATP Tour 500 12 3 15 0.8
NA ATP Tour 250 9 4 13 0.67
Doubles Grand Slam tournaments - - - -
NA Year-End Championships - - - -
NA ATP Masters 1000 - - - -
NA Olympic Games - - - -
NA ATP Tour 500 - - - -
NA ATP Tour 250 1 1 2 0.5
glimpse(tennis)
## Observations: 12
## Variables: 6
## $ dicipline <chr> "Singles", NA, NA, NA, NA, NA, "Doubles", NA, NA, NA...
## $ type      <chr> "Grand Slam tournaments", "Year-End Championships", ...
## $ won       <chr> "14", "5", "31", "0", "12", "9", "-", "-", "-", "-",...
## $ lost      <chr> "9", "1", "14", "0", "3", "4", "-", "-", "-", "-", "...
## $ total     <chr> "23", "6", "45", "0", "15", "13", "-", "-", "-", "-"...
## $ Wratio    <chr> "0.61", "0.83", "0.69", "0", "0.8", "0.67", "-", "-"...

Tidying up the data before analysis

I will drop the Wratio col, fill the dicipline values down, and replace “-” with NA.

tennis <- tennis %>% 
  select(-Wratio)%>% 
  fill(dicipline)
tennis[tennis == '-'] <- NA
tennis$won <- as.integer(tennis$won)
tennis$lost <- as.integer(tennis$lost)
tennis$total <- as.integer(tennis$total)
kable(tennis)
dicipline type won lost total
Singles Grand Slam tournaments 14 9 23
Singles Year-End Championships 5 1 6
Singles ATP Masters 1000 31 14 45
Singles Olympic Games 0 0 0
Singles ATP Tour 500 12 3 15
Singles ATP Tour 250 9 4 13
Doubles Grand Slam tournaments NA NA NA
Doubles Year-End Championships NA NA NA
Doubles ATP Masters 1000 NA NA NA
Doubles Olympic Games NA NA NA
Doubles ATP Tour 500 NA NA NA
Doubles ATP Tour 250 1 1 2
### Analysis
Lets recreat e the subtotals and grand total s rows and calculate the Wratio col.
# combine the df, subtotals, and grand totals together
tennis <- bind_rows( 
  tennis,
  tennis %>%
    group_by(dicipline) %>% 
    summarise_at(vars(won, lost, total), funs(sum), na.rm = T),
  tennis %>% 
    summarise_at(vars(won,lost,total), funs(sum),na.rm=T)
)
# rearrange the rows
tennis <-tennis %>% 
  arrange(desc(dicipline))
# change NA values in type and dicipline cols
tennis$type[is.na(tennis$type) & tennis$dicipline %in% c('Doubles', 'Singles')] <- 'Total'
tennis$dicipline[is.na(tennis$dicipline)] <- 'Grand Total'

# calculate the Wratio using mutate
tennis <- tennis %>% 
  mutate(Wratio = won / total)

tennis$Wratio <- tennis$Wratio %>% 
  round(2)

kable(tennis)
dicipline type won lost total Wratio
Singles Grand Slam tournaments 14 9 23 0.61
Singles Year-End Championships 5 1 6 0.83
Singles ATP Masters 1000 31 14 45 0.69
Singles Olympic Games 0 0 0 NaN
Singles ATP Tour 500 12 3 15 0.80
Singles ATP Tour 250 9 4 13 0.69
Singles Total 71 31 102 0.70
Doubles Grand Slam tournaments NA NA NA NA
Doubles Year-End Championships NA NA NA NA
Doubles ATP Masters 1000 NA NA NA NA
Doubles Olympic Games NA NA NA NA
Doubles ATP Tour 500 NA NA NA NA
Doubles ATP Tour 250 1 1 2 0.50
Doubles Total 1 1 2 0.50
Grand Total NA 72 32 104 0.69
Well that seem ed like a lot of work to basica lly bri ng the t able back to where we start. However, atleast now things are a little tidyer. I find the totals and the Wratio useful to quickly summarise the data within the same dataframe.

Lets take grab the rows with the max and min Wratio.

bind_rows(
  tennis %>% 
    slice(which.max(Wratio)),
  tennis %>% 
    slice(which.min(Wratio))
)
## # A tibble: 2 x 6
##   dicipline type                     won  lost total Wratio
##   <chr>     <chr>                  <int> <int> <int>  <dbl>
## 1 Singles   Year-End Championships     5     1     6   0.83
## 2 Doubles   ATP Tour 250               1     1     2   0.5

Djokovic performed the worst in the one Doubles tournament he competed in. Perhaps that is the reason why he prefers to play alone in competitions and doesn’t compete in Doubles often.

Dataset 3: Overtime

This dataset is particularly messy with the columns and rows poorly organized. The original data had columns that were entirely null and I already removed when creating the table in MySQL. ### Questions: - For each month, what was the total amount of OT worked? ### Tidying the data First step lets save the data to a tbl_df and then replace all empty strings with NA.

ot <- tbl(project2db,'ot')
ot <- tbl_df(ot)
ot[ot == ''] <- NA
kable(ot)
employee day_of_week date pay_code hours
Employee: Gokool, Joel NA NA ID:
Transactions: NA NA Pay Code Hours
NA Wed 1/6/2016 OTBANK 1:00
NA Thu 1/14/2016 OTBANK 0:30
NA Wed 1/20/2016 OTBANK 0:30
NA Wed 2/3/2016 OTBANK 0:30
NA Thu 2/11/2016 OTBANK 1:00
NA Fri 2/19/2016 OTBANK 0:30
NA Tue 2/23/2016 OTBANK 0:30
NA Fri 2/26/2016 OTBANK 0:45
NA Wed 3/2/2016 OTBANK 1:00
The dataframe is still looking pretty messy with a ton of NAs. I will create a col with the Employee name on the left and remove the top 2 rows to tidy things up.
# Move the employees name to the employee col
ot$employee[is.na(ot$employee)] <- ot %>% 
  select(day_of_week) %>% 
  head(1) %>% 
  unlist()

kable(ot)
employee day_of_week date pay_code hours
Employee: Gokool, Joel NA NA ID:
Transactions: NA NA Pay Code Hours
Gokool, Joel Wed 1/6/2016 OTBANK 1:00
Gokool, Joel Thu 1/14/2016 OTBANK 0:30
Gokool, Joel Wed 1/20/2016 OTBANK 0:30
Gokool, Joel Wed 2/3/2016 OTBANK 0:30
Gokool, Joel Thu 2/11/2016 OTBANK 1:00
Gokool, Joel Fri 2/19/2016 OTBANK 0:30
Gokool, Joel Tue 2/23/2016 OTBANK 0:30
Gokool, Joel Fri 2/26/2016 OTBANK 0:45
Gokool, Joel Wed 3/2/2016 OTBANK 1:00
ot <- ot %>% 
  na.omit()
ot
## # A tibble: 9 x 5
##   employee     day_of_week date      pay_code hours
##   <chr>        <chr>       <chr>     <chr>    <chr>
## 1 Gokool, Joel Wed         1/6/2016  OTBANK   1:00 
## 2 Gokool, Joel Thu         1/14/2016 OTBANK   0:30 
## 3 Gokool, Joel Wed         1/20/2016 OTBANK   0:30 
## 4 Gokool, Joel Wed         2/3/2016  OTBANK   0:30 
## 5 Gokool, Joel Thu         2/11/2016 OTBANK   1:00 
## 6 Gokool, Joel Fri         2/19/2016 OTBANK   0:30 
## 7 Gokool, Joel Tue         2/23/2016 OTBANK   0:30 
## 8 Gokool, Joel Fri         2/26/2016 OTBANK   0:45 
## 9 Gokool, Joel Wed         3/2/2016  OTBANK   1:00

A few more things we can tidy up before analyzing the data are the data types. Lets change the date col to Date data type and lets convert the hours col to a decimal.

ot$date <- as.Date(ot$date,"%m/%d/%Y")
ot
## # A tibble: 9 x 5
##   employee     day_of_week date       pay_code hours
##   <chr>        <chr>       <date>     <chr>    <chr>
## 1 Gokool, Joel Wed         2016-01-06 OTBANK   1:00 
## 2 Gokool, Joel Thu         2016-01-14 OTBANK   0:30 
## 3 Gokool, Joel Wed         2016-01-20 OTBANK   0:30 
## 4 Gokool, Joel Wed         2016-02-03 OTBANK   0:30 
## 5 Gokool, Joel Thu         2016-02-11 OTBANK   1:00 
## 6 Gokool, Joel Fri         2016-02-19 OTBANK   0:30 
## 7 Gokool, Joel Tue         2016-02-23 OTBANK   0:30 
## 8 Gokool, Joel Fri         2016-02-26 OTBANK   0:45 
## 9 Gokool, Joel Wed         2016-03-02 OTBANK   1:00
ot$hours <- sapply(
  ot$hours %>% 
    strsplit(':'),
  function(x) {
    x<-as.numeric(x)
    (x[1]*60 + x[2])/60
  }
)
ot
## # A tibble: 9 x 5
##   employee     day_of_week date       pay_code hours
##   <chr>        <chr>       <date>     <chr>    <dbl>
## 1 Gokool, Joel Wed         2016-01-06 OTBANK    1   
## 2 Gokool, Joel Thu         2016-01-14 OTBANK    0.5 
## 3 Gokool, Joel Wed         2016-01-20 OTBANK    0.5 
## 4 Gokool, Joel Wed         2016-02-03 OTBANK    0.5 
## 5 Gokool, Joel Thu         2016-02-11 OTBANK    1   
## 6 Gokool, Joel Fri         2016-02-19 OTBANK    0.5 
## 7 Gokool, Joel Tue         2016-02-23 OTBANK    0.5 
## 8 Gokool, Joel Fri         2016-02-26 OTBANK    0.75
## 9 Gokool, Joel Wed         2016-03-02 OTBANK    1

I printed the few tables above just as a table and without kable so that the datatypes for each col are printed and the changes I made are observable.

Analysis

Now that the table has been tidyed up I am able to easily find the number of hourse worked over time for each month.

monthot <- ot %>% 
  group_by(month = format(floor_date(date,unit = "month"),"%m-%Y")) %>% 
  summarise(Monthly_OT = sum(hours))
kable(monthot)
month Monthly_OT
01-2016 2.00
02-2016 3.25
03-2016 1.00
Perhaps th e manager wants this table summarized in one number. In that case, here is the avg number of hours worked overtime for the three months of data given.
monthot %>% 
  summarise(avg_ot = mean(Monthly_OT))
## # A tibble: 1 x 1
##   avg_ot
##    <dbl>
## 1   2.08

Conclusions:

In conclusion we have learned a couple things about each dataset. The first dataset indicates that there may be a bias towards giving A grades to NYC restuarants and having an A grade does not necessarily mean the restaurant is pest free. The second dataset showed Djokovic prefers to compete in singles tournalments and performs better alone. In the final dataset we found Joel’s number of hours worked overtime each month might not be enough to warrent hiring another employee.