Functions and Iterations

Synopsis

This R Markdown Notebook is my report for the Data Wrangling with R class assignment for Week 7.
The following report exhibits the use of iterations and functions to minimize duplication and writing efficient code.

Data Source

For this report we are looking at the NYC Restaurant Data, which provides restaurant inspections across New York City.

Packages required

library('RSocrata')
library('readr')
library("purrr")
library('dplyr')
library('tibble')
library('stringr')
library('lubridate')
library('ggplot2')

Importing the data

# Note: Data download can take several minutes
# Uncomment the next 3 lines to download and store data locally

#url <- 'https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59'
#nyc_rest <- read.socrata(url = url)
#write_rds(nyc_rest, path = 'data/nyc_restaurant')

nyc_rest <- read_rds('../data/nyc_restaurant')

Questions

Problem 1

Use the map function to identify the class of each variable.

nyc_rest %>% map(class)

## $CAMIS
## [1] "integer"
## 
## $DBA
## [1] "factor"
## 
## $BORO
## [1] "factor"
## 
## $BUILDING
## [1] "character"
## 
## $STREET
## [1] "factor"
## 
## $ZIPCODE
## [1] "integer"
## 
## $PHONE
## [1] "character"
## 
## $CUISINE.DESCRIPTION
## [1] "factor"
## 
## $INSPECTION.DATE
## [1] "POSIXlt" "POSIXt" 
## 
## $ACTION
## [1] "factor"
## 
## $VIOLATION.CODE
## [1] "factor"
## 
## $VIOLATION.DESCRIPTION
## [1] "factor"
## 
## $CRITICAL.FLAG
## [1] "factor"
## 
## $SCORE
## [1] "integer"
## 
## $GRADE
## [1] "factor"
## 
## $GRADE.DATE
## [1] "POSIXlt" "POSIXt" 
## 
## $RECORD.DATE
## [1] "POSIXlt" "POSIXt" 
## 
## $INSPECTION.TYPE
## [1] "factor"

Problem 2

Notice how the date variables are in POSIXlt form. Create a function that takes a single argument (“x”) and checks if it is of POSIXlt class. If it is, have the function change the input to a simple Date class with as.Date. If not then, the function should keep the input class as is. Apply this function to each of the columns in the NY restaurant data set by using the map function. Be sure the final output is a tibble and not a list.

checkPosixlt <- function(x){
  ifelse(any(class(x) == "POSIXlt"), T, F)
}

convertPOSIXltToDate <- function(x){
  if(!checkPosixlt(x)){
    return(x)
  }
  return(as.Date(x))
}

nyc_rest <- nyc_rest %>% map(convertPOSIXltToDate) %>% as_tibble()
nyc_rest

Problem 3

Using this reformatted tibble, identify how many restaurants in 2016 had a violation regarding “mice”? How about “hair”? What about “sewage”? Hint: the VIOLATION.DESCRIPTION and INSPECTION.DATE variables will be useful here.

deduplicate_df <- function(df){
  return(df[!duplicated(df),])
}

total_violations_nyc <- function(search_pattern, search_year){
  nyc_rest %>%
    deduplicate_df %>%
    filter(year(INSPECTION.DATE) == search_year) %>%
    summarize(total_issues = sum(str_detect(tolower(VIOLATION.DESCRIPTION),
                                            search_pattern)))
}

search_texts <- c("mice", "hair", "sewage")
search_years <- 2016

violation_counts <- search_texts %>% 
  map2(search_years, total_violations_nyc)
violation_counts

## [[1]]
## # A tibble: 1 × 1
##   total_issues
##          <int>
## 1         8146
## 
## [[2]]
## # A tibble: 1 × 1
##   total_issues
##          <int>
## 1         2103
## 
## [[3]]
## # A tibble: 1 × 1
##   total_issues
##          <int>
## 1        13445

Violations regarding:

Mice: 8146
Hair: 2103
Sewage: 13445

Problem 4

Create a function to apply to this tibble that takes a year and a regular expression (i.e. “mice”) and returns a ggplot bar chart of the top 20 restaurants with the most violations. Make sure the restaurants are properly rank-ordered in the bar chart

top_violations_nyc <- function(search_pattern, search_year){
  nyc_rest %>%
    deduplicate_df %>%
    filter(year(INSPECTION.DATE) == search_year,
           str_detect(tolower(VIOLATION.DESCRIPTION),
                      search_pattern)) %>%
    count(DBA) %>%
    arrange(desc(n)) %>%
    top_n(20, n) %>% 
    ggplot() +
      geom_bar(mapping = aes(x = reorder(DBA, n),
                             y = n),
               stat = "identity") +
      coord_flip() +
      theme(text = element_text(size = 7)) +
      labs(x = "Restaurant", y = "Violations") +
      ggtitle(paste0("Top 20 Restaurants with '",
                     search_pattern,
                     "' violations for ",
                     search_year))
}

top_violations_nyc("mice", 2016)

top_violations_nyc("sewage", 2015)

top_violations_nyc("flies", 2016)