This R Markdown Notebook is my report for the Data Wrangling with R class assignment for Week 7.
The following report exhibits the use of iterations and functions to minimize duplication and writing efficient code.
For this report we are looking at the NYC Restaurant Data, which provides restaurant inspections across New York City.
library('RSocrata')
library('readr')
library("purrr")
library('dplyr')
library('tibble')
library('stringr')
library('lubridate')
library('ggplot2')
# Note: Data download can take several minutes
# Uncomment the next 3 lines to download and store data locally
#url <- 'https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59'
#nyc_rest <- read.socrata(url = url)
#write_rds(nyc_rest, path = 'data/nyc_restaurant')
nyc_rest <- read_rds('../data/nyc_restaurant')
Use the map function to identify the class of each variable.
nyc_rest %>% map(class)
## $CAMIS
## [1] "integer"
##
## $DBA
## [1] "factor"
##
## $BORO
## [1] "factor"
##
## $BUILDING
## [1] "character"
##
## $STREET
## [1] "factor"
##
## $ZIPCODE
## [1] "integer"
##
## $PHONE
## [1] "character"
##
## $CUISINE.DESCRIPTION
## [1] "factor"
##
## $INSPECTION.DATE
## [1] "POSIXlt" "POSIXt"
##
## $ACTION
## [1] "factor"
##
## $VIOLATION.CODE
## [1] "factor"
##
## $VIOLATION.DESCRIPTION
## [1] "factor"
##
## $CRITICAL.FLAG
## [1] "factor"
##
## $SCORE
## [1] "integer"
##
## $GRADE
## [1] "factor"
##
## $GRADE.DATE
## [1] "POSIXlt" "POSIXt"
##
## $RECORD.DATE
## [1] "POSIXlt" "POSIXt"
##
## $INSPECTION.TYPE
## [1] "factor"
Notice how the date variables are in POSIXlt form. Create a function that takes a single argument (“x”) and checks if it is of POSIXlt class. If it is, have the function change the input to a simple Date class with as.Date. If not then, the function should keep the input class as is. Apply this function to each of the columns in the NY restaurant data set by using the map function. Be sure the final output is a tibble and not a list.
checkPosixlt <- function(x){
ifelse(any(class(x) == "POSIXlt"), T, F)
}
convertPOSIXltToDate <- function(x){
if(!checkPosixlt(x)){
return(x)
}
return(as.Date(x))
}
nyc_rest <- nyc_rest %>% map(convertPOSIXltToDate) %>% as_tibble()
nyc_rest
Using this reformatted tibble, identify how many restaurants in 2016 had a violation regarding “mice”? How about “hair”? What about “sewage”? Hint: the VIOLATION.DESCRIPTION and INSPECTION.DATE variables will be useful here.
deduplicate_df <- function(df){
return(df[!duplicated(df),])
}
total_violations_nyc <- function(search_pattern, search_year){
nyc_rest %>%
deduplicate_df %>%
filter(year(INSPECTION.DATE) == search_year) %>%
summarize(total_issues = sum(str_detect(tolower(VIOLATION.DESCRIPTION),
search_pattern)))
}
search_texts <- c("mice", "hair", "sewage")
search_years <- 2016
violation_counts <- search_texts %>%
map2(search_years, total_violations_nyc)
violation_counts
## [[1]]
## # A tibble: 1 × 1
## total_issues
## <int>
## 1 8146
##
## [[2]]
## # A tibble: 1 × 1
## total_issues
## <int>
## 1 2103
##
## [[3]]
## # A tibble: 1 × 1
## total_issues
## <int>
## 1 13445
Violations regarding:
Create a function to apply to this tibble that takes a year and a regular expression (i.e. “mice”) and returns a ggplot bar chart of the top 20 restaurants with the most violations. Make sure the restaurants are properly rank-ordered in the bar chart
top_violations_nyc <- function(search_pattern, search_year){
nyc_rest %>%
deduplicate_df %>%
filter(year(INSPECTION.DATE) == search_year,
str_detect(tolower(VIOLATION.DESCRIPTION),
search_pattern)) %>%
count(DBA) %>%
arrange(desc(n)) %>%
top_n(20, n) %>%
ggplot() +
geom_bar(mapping = aes(x = reorder(DBA, n),
y = n),
stat = "identity") +
coord_flip() +
theme(text = element_text(size = 7)) +
labs(x = "Restaurant", y = "Violations") +
ggtitle(paste0("Top 20 Restaurants with '",
search_pattern,
"' violations for ",
search_year))
}
top_violations_nyc("mice", 2016)
top_violations_nyc("sewage", 2015)
top_violations_nyc("flies", 2016)