# Load a few libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
library(ggplot2)
library(ggthemes)
# This dataset will require extensive cleanup and tidying - at present, approximately 4500 of the 40000 observations read into R, and there are many issues with those. I attempted some troubleshooting but it became apparent that full cleanup is beyond the scope of the project proposal phase.
inspections <- read.table(inspections.source, header = T, stringsAsFactors = T, sep = ",", quote = "", fill = T)
head(inspections)
## CAMIS DBA BORO BUILDING STREET
## 1 40511702 NOTARO RESTAURANT MANHATTAN 635 SECOND AVENUE
## 2 40511702 NOTARO RESTAURANT MANHATTAN 635 SECOND AVENUE
## 3 50046354 VITE BAR QUEENS 2507 BROADWAY
## 4 50061389 TACK'S CHINESE TAKE OUT STATEN ISLAND 11C HOLDEN BLVD
## 5 41516263 NO QUARTER BROOKLYN 8015 5 AVENUE
## 6 50015855 KABAB HOUSE NYC QUEENS 4339 MAIN ST
## ZIPCODE PHONE CUISINE.DESCRIPTION INSPECTION.DATE
## 1 10016 2126863400 Italian 6/15/2015
## 2 10016 2126863400 Italian 11/25/2014
## 3 11106 3478134702 Italian 10/3/2016
## 4 10314 7189839854 Chinese 5/17/2017
## 5 11209 7187019180 American 3/30/2017
## 6 11355 9172852796 Pakistani 3/3/2015
## ACTION VIOLATION.CODE
## 1 Violations were cited in the following area(s). 02B
## 2 Violations were cited in the following area(s). 20F
## 3 Violations were cited in the following area(s). 10F
## 4 Violations were cited in the following area(s). 02G
## 5 Violations were cited in the following area(s). 04M
## 6 Violations were cited in the following area(s). 10F
## CRITICAL.FLAG SCORE GRADE GRADE.DATE RECORD.DATE
## 1 Critical 30 8/28/2017
## 2 Not Critical 8/28/2017
## 3 Not Critical 2 8/28/2017
## 4 Critical 46 8/28/2017
## 5 Critical 18 8/28/2017
## 6 Not Critical 19 8/28/2017
## INSPECTION.TYPE
## 1 Cycle Inspection / Initial Inspection
## 2 Administrative Miscellaneous / Initial Inspection
## 3 Pre-permit (Operational) / Initial Inspection
## 4 Pre-permit (Operational) / Initial Inspection
## 5 Cycle Inspection / Initial Inspection
## 6 Pre-permit (Operational) / Compliance Inspection
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Is cuisine type predictive of A-F restaurant grade (i.e. are certain types of cuisine more likely to get high or low grades)?
What are the cases, and how many are there?
Each case represents a restaurant inspection administered between Jan 2010 and Aug 2017. There are 39,918 inspections in total (not all of which are associated with a grade between A and F).
Describe the method of data collection.
The data was collected by the NYC Department of Health and includes restaurant inspections, violations, grades, and information on adjudication.
What type of study is this (observational/experiment)?
This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link.
Data is collected by the NYC Department of Health, which conducts unannounced inspections of restaurants annually to check for compliance in food handling, food temperature, personal hygiene, and vermin control. Violations accrue points and produce a less favorable grade, from A (best) to F (worst). http://www1.nyc.gov/site/doh/services/restaurant-grades.page http://www1.nyc.gov/assets/doh/downloads/pdf/rii/how-we-score-grade.pdf
The data is is sourced through the NYCOpenData initiative, and is available to the public online: https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j
What is the response variable, and what type is it (numerical/categorical)?
The response variable is restaurant grade and it is categorical (ordinal).
What is the explanatory variable, and what type is it (numerical/categorical)?
The explanatory variable is cuisine and it is categorical.
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
inspections %>%
filter(GRADE %in% c("A", "B", "C", "D", "F")) %>%
count(GRADE) %>%
ggplot(aes(x = GRADE, y = n, fill = GRADE)) +
geom_bar(stat = "identity") +
ggtitle("Restaurant Grades") +
theme_economist() +
theme(axis.title.y = element_blank(),
legend.position = "none")
inspections %>%
count(CUISINE.DESCRIPTION) %>%
filter(n > 30) %>% # Filtered low counts for legibility / tidiness
ggplot(aes(x = reorder(CUISINE.DESCRIPTION, n), y = n)) + # Obviously, cleaning is needed
geom_bar(stat = "identity") +
ggtitle("Restaurants by Cuisine") +
coord_flip() +
theme_economist() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank())