DATA 606 Data Project Proposal

Data Preparation

# Load a few libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)
library(ggplot2)
library(ggthemes)

# This dataset will require extensive cleanup and tidying - at present, approximately 4500 of the 40000 observations read into R, and there are many issues with those.  I attempted some troubleshooting but it became apparent that full cleanup is beyond the scope of the project proposal phase.
inspections <- read.table(inspections.source, header = T, stringsAsFactors = T, sep = ",", quote = "", fill = T)
head(inspections)

##      CAMIS                     DBA          BORO BUILDING        STREET
## 1 40511702       NOTARO RESTAURANT     MANHATTAN      635 SECOND AVENUE
## 2 40511702       NOTARO RESTAURANT     MANHATTAN      635 SECOND AVENUE
## 3 50046354                VITE BAR        QUEENS     2507      BROADWAY
## 4 50061389 TACK'S CHINESE TAKE OUT STATEN ISLAND      11C   HOLDEN BLVD
## 5 41516263              NO QUARTER      BROOKLYN     8015      5 AVENUE
## 6 50015855         KABAB HOUSE NYC        QUEENS     4339       MAIN ST
##   ZIPCODE      PHONE CUISINE.DESCRIPTION INSPECTION.DATE
## 1   10016 2126863400             Italian       6/15/2015
## 2   10016 2126863400             Italian      11/25/2014
## 3   11106 3478134702             Italian       10/3/2016
## 4   10314 7189839854             Chinese       5/17/2017
## 5   11209 7187019180            American       3/30/2017
## 6   11355 9172852796           Pakistani        3/3/2015
##                                            ACTION VIOLATION.CODE
## 1 Violations were cited in the following area(s).            02B
## 2 Violations were cited in the following area(s).            20F
## 3 Violations were cited in the following area(s).            10F
## 4 Violations were cited in the following area(s).            02G
## 5 Violations were cited in the following area(s).            04M
## 6 Violations were cited in the following area(s).            10F
##   CRITICAL.FLAG SCORE GRADE GRADE.DATE RECORD.DATE
## 1      Critical    30                    8/28/2017
## 2  Not Critical                          8/28/2017
## 3  Not Critical     2                    8/28/2017
## 4      Critical    46                    8/28/2017
## 5      Critical    18                    8/28/2017
## 6  Not Critical    19                    8/28/2017
##                                     INSPECTION.TYPE
## 1             Cycle Inspection / Initial Inspection
## 2 Administrative Miscellaneous / Initial Inspection
## 3     Pre-permit (Operational) / Initial Inspection
## 4     Pre-permit (Operational) / Initial Inspection
## 5             Cycle Inspection / Initial Inspection
## 6  Pre-permit (Operational) / Compliance Inspection

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Is cuisine type predictive of A-F restaurant grade (i.e. are certain types of cuisine more likely to get high or low grades)?

Cases

What are the cases, and how many are there?

Each case represents a restaurant inspection administered between Jan 2010 and Aug 2017. There are 39,918 inspections in total (not all of which are associated with a grade between A and F).

Data collection

Describe the method of data collection.

The data was collected by the NYC Department of Health and includes restaurant inspections, violations, grades, and information on adjudication.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Data is collected by the NYC Department of Health, which conducts unannounced inspections of restaurants annually to check for compliance in food handling, food temperature, personal hygiene, and vermin control. Violations accrue points and produce a less favorable grade, from A (best) to F (worst). http://www1.nyc.gov/site/doh/services/restaurant-grades.page http://www1.nyc.gov/assets/doh/downloads/pdf/rii/how-we-score-grade.pdf

The data is is sourced through the NYCOpenData initiative, and is available to the public online: https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable is restaurant grade and it is categorical (ordinal).

Explanatory

What is the explanatory variable, and what type is it (numerical/categorical)?

The explanatory variable is cuisine and it is categorical.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

inspections %>% 
  filter(GRADE %in% c("A", "B", "C", "D", "F")) %>% 
  count(GRADE) %>% 
  ggplot(aes(x = GRADE, y = n, fill = GRADE)) +
  geom_bar(stat = "identity") + 
  ggtitle("Restaurant Grades") +
  theme_economist() +
  theme(axis.title.y = element_blank(),
        legend.position = "none")

inspections %>% 
  count(CUISINE.DESCRIPTION) %>% 
  filter(n > 30) %>% # Filtered low counts for legibility / tidiness
  ggplot(aes(x = reorder(CUISINE.DESCRIPTION, n), y = n)) + # Obviously, cleaning is needed
  geom_bar(stat = "identity") +
  ggtitle("Restaurants by Cuisine") +
  coord_flip() +
  theme_economist() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank())