Synopsis

This is my report for week 7 assignment on data exploration, learning about usage of functions and iterations to minimize duplication by writing efficient codes.The dataset being analysed contains data on New York City Restaurant Inspection Results

Packages Required

This packages contains multiple packages within it with can be used for data representation and manipulation

library(tidyverse)
library(RSocrata)
library(tibble)
library(readr)
library(purrr)
library(ggplot2)
library(lubridate)

Importing data

url <- 'https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59'
newyork_restaurants <- read.socrata(url = url)
#write_rds(newyork_restaurants, path = 'C:/Users/Anitha/Documents/Data Wrangling with R (BANA 8090)/data/newyork_restaurants')

Assignment Problems

1.Use the map function to identify the class of each variable.

newyork_restaurants %>% map(class) 
## $CAMIS
## [1] "integer"
## 
## $DBA
## [1] "character"
## 
## $BORO
## [1] "character"
## 
## $BUILDING
## [1] "character"
## 
## $STREET
## [1] "character"
## 
## $ZIPCODE
## [1] "integer"
## 
## $PHONE
## [1] "character"
## 
## $CUISINE.DESCRIPTION
## [1] "character"
## 
## $INSPECTION.DATE
## [1] "POSIXct" "POSIXt" 
## 
## $ACTION
## [1] "character"
## 
## $VIOLATION.CODE
## [1] "character"
## 
## $VIOLATION.DESCRIPTION
## [1] "character"
## 
## $CRITICAL.FLAG
## [1] "character"
## 
## $SCORE
## [1] "integer"
## 
## $GRADE
## [1] "character"
## 
## $GRADE.DATE
## [1] "POSIXct" "POSIXt" 
## 
## $RECORD.DATE
## [1] "POSIXct" "POSIXt" 
## 
## $INSPECTION.TYPE
## [1] "character"

2.Notice how the date variables are in POSIXlt form. Create a function that takes a single argument (“x”) and checks if it is of POSIXlt class. If it is, have the function change the input to a simple Date class with as.Date. If not then, the function should keep the input class as is. Apply this function to each of the columns in the NY restaurant data set by using the map function. Be sure the final output is a tibble and not a list.

POSIXltToDate<-function(x) 
  {
  if (class(x)=='POSIXct')
    {
    return(as.Date(x))
    } 
  else
    {
    return(x)
    }
  }


newyork_res <- newyork_restaurants %>% map(POSIXltToDate) %>% as_tibble
newyork_res%>% map(class)
## $CAMIS
## [1] "integer"
## 
## $DBA
## [1] "character"
## 
## $BORO
## [1] "character"
## 
## $BUILDING
## [1] "character"
## 
## $STREET
## [1] "character"
## 
## $ZIPCODE
## [1] "integer"
## 
## $PHONE
## [1] "character"
## 
## $CUISINE.DESCRIPTION
## [1] "character"
## 
## $INSPECTION.DATE
## [1] "Date"
## 
## $ACTION
## [1] "character"
## 
## $VIOLATION.CODE
## [1] "character"
## 
## $VIOLATION.DESCRIPTION
## [1] "character"
## 
## $CRITICAL.FLAG
## [1] "character"
## 
## $SCORE
## [1] "integer"
## 
## $GRADE
## [1] "character"
## 
## $GRADE.DATE
## [1] "Date"
## 
## $RECORD.DATE
## [1] "Date"
## 
## $INSPECTION.TYPE
## [1] "character"


3.Using this reformatted tibble, identify how many restaurants in 2016 had a violation regarding “mice”? How about “hair”? What about “sewage”? Hint: the VIOLATION.DESCRIPTION and INSPECTION.DATE variables will be useful here.

newyork_res %>% 
  filter(year(INSPECTION.DATE)==2016) %>% 
  mutate(Violation_Type= ifelse (grepl("hair", tolower(VIOLATION.DESCRIPTION)),"hair",
                         ifelse( grepl("mice",tolower(VIOLATION.DESCRIPTION)),"mice",
                         ifelse( grepl("sewage",tolower(VIOLATION.DESCRIPTION)),"sewage","NA")))) %>% 
  filter(Violation_Type %in% c("hair","mice","sewage")) %>% group_by(Violation_Type,Year=year(INSPECTION.DATE)) %>%
  summarise(ViolationCount=length(Violation_Type))
## Source: local data frame [3 x 3]
## Groups: Violation_Type [?]
## 
##   Violation_Type  Year ViolationCount
##            <chr> <dbl>          <int>
## 1           hair  2016           2132
## 2           mice  2016           8283
## 3         sewage  2016          13671


4.Create a function to apply to this tibble that takes a year and a regular expression (i.e. “mice”) and returns a ggplot bar chart of the top 20 restaurants with the most violations. Make sure the restaurants are properly rank-ordered in the bar chart

ResTopViolations<-function(dataset,year,violation) 
  {
  dataset %>% filter(year(INSPECTION.DATE)==year) %>% 
  mutate(violation_Type= ifelse (grepl(violation, tolower(VIOLATION.DESCRIPTION)),violation,"NA")) %>% 
  filter(violation_Type %in% violation) %>%group_by(DBA) %>%
  summarise(count=length(violation_Type)) %>% 
  arrange(desc(count)) %>% 
  top_n(20) %>% 
  ggplot() +geom_bar(mapping=aes(x=reorder(DBA,count),y=count),stat="identity")+
  coord_flip() +
  labs(x="Restaurant",y="Number of violations")  +
  ggtitle(paste("New York restaurants with the highest",violation,"violations"))
  }
                               
                            
ResTopViolations(newyork_res,2016,"mice")

ResTopViolations(newyork_res,2016,"hair")

ResTopViolations(newyork_res,2016,"sewage")