Week 7

SYNOPSIS Exploratory data analysis od NYC data and use of functions.

PACKAGES

NYC DATA

#url <- 'https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59'
#nyc_data <- read.socrata(url=url)
#write_rds(nyc_data, path = 'C:\\Study\\R\\Data Wrangling with R BANA 8090\\NYC.rds')
NY<-read_rds('C:\\Study\\R\\Data Wrangling with R BANA 8090\\NYC.rds')

1.Use the map function to identify the class of each variable.

NY %>% map(class)

## $CAMIS
## [1] "integer"
## 
## $DBA
## [1] "factor"
## 
## $BORO
## [1] "factor"
## 
## $BUILDING
## [1] "character"
## 
## $STREET
## [1] "factor"
## 
## $ZIPCODE
## [1] "integer"
## 
## $PHONE
## [1] "character"
## 
## $CUISINE.DESCRIPTION
## [1] "factor"
## 
## $INSPECTION.DATE
## [1] "POSIXlt" "POSIXt" 
## 
## $ACTION
## [1] "factor"
## 
## $VIOLATION.CODE
## [1] "factor"
## 
## $VIOLATION.DESCRIPTION
## [1] "factor"
## 
## $CRITICAL.FLAG
## [1] "factor"
## 
## $SCORE
## [1] "integer"
## 
## $GRADE
## [1] "factor"
## 
## $GRADE.DATE
## [1] "POSIXlt" "POSIXt" 
## 
## $RECORD.DATE
## [1] "POSIXlt" "POSIXt" 
## 
## $INSPECTION.TYPE
## [1] "factor"

2.Notice how the date variables are in POSIXlt form. Create a function that takes a single argument (“x”) and checks if it is of POSIXlt class. If it is, have the function change the input to a simple Date class with as.Date. If not then, the function should keep the input class as is. Apply this function to each of the columns in the NY restaurant data set by using the map function. Be sure the final output is a tibble and not a list.

new_posix <- function(x){
  if(any(class(x)== "POSIXlt")){
    return(as.Date(x))
  }
  return(x)
}

NY <- NY %>% map(new_posix) %>% as_tibble()
NY %>% map(class)

## $CAMIS
## [1] "integer"
## 
## $DBA
## [1] "factor"
## 
## $BORO
## [1] "factor"
## 
## $BUILDING
## [1] "character"
## 
## $STREET
## [1] "factor"
## 
## $ZIPCODE
## [1] "integer"
## 
## $PHONE
## [1] "character"
## 
## $CUISINE.DESCRIPTION
## [1] "factor"
## 
## $INSPECTION.DATE
## [1] "Date"
## 
## $ACTION
## [1] "factor"
## 
## $VIOLATION.CODE
## [1] "factor"
## 
## $VIOLATION.DESCRIPTION
## [1] "factor"
## 
## $CRITICAL.FLAG
## [1] "factor"
## 
## $SCORE
## [1] "integer"
## 
## $GRADE
## [1] "factor"
## 
## $GRADE.DATE
## [1] "Date"
## 
## $RECORD.DATE
## [1] "Date"
## 
## $INSPECTION.TYPE
## [1] "factor"

3. Using this reformatted tibble, identify how many restaurants in 2016 had a violation regarding “mice”? How about “hair”? What about “sewage”? Hint: the VIOLATION.DESCRIPTION and INSPECTION.DATE variables will be useful here

find_viol <- function(x){
NY %>% filter(year(INSPECTION.DATE) == 2016) %>%
        summarize(viol_cnt = sum(str_detect(tolower(VIOLATION.DESCRIPTION),x)))
}


a<-find_viol("mice")
b<-find_viol("hair")
c<-find_viol("sewage")
a

## # A tibble: 1 × 1
##   viol_cnt
##      <int>
## 1     8283

## # A tibble: 1 × 1
##   viol_cnt
##      <int>
## 1     2132

## # A tibble: 1 × 1
##   viol_cnt
##      <int>
## 1    13671

boxplot(c(a,b,c))

3.Create a function to apply to this tibble that takes a year and a regular expression (i.e. “mice”) and returns a ggplot bar chart of the top 20 restaurants with the most violations. Make sure the restaurants are properly rank-ordered in the bar chart

viol_year <- function(violation, year){
NY_20 <- NY %>% filter(year(INSPECTION.DATE) == year,
             str_detect(tolower(VIOLATION.DESCRIPTION),violation)) %>%
             count(DBA) %>% arrange(desc(n)) %>% top_n(20,n)
ggplot(NY_20,aes(x = reorder(DBA, -n),y = n)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x = "Restaurant", y = "Violations Count")
}

viol_year("mice",2016)

Week 7

Zeeshan

December 3, 2016