This report is my submission towards the week-7 assignment. In this report, I have worked on the four questions given as homework. I learnt concepts related to “functions” and “iterations” while working on this assignment. I have used the NYC restaurant dataset for this homework.
library(RSocrata)
library(readr)
library(tidyverse)
library(lubridate)
library(stringr)
library(purrr)
NYC_res_data<-read.socrata("https://nycopendata.socrata.com/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59")
Use the ‘map’ function to identify the class of each variable.
x<-c(1:18)
map(NYC_res_data[,x],class)
## $CAMIS
## [1] "integer"
##
## $DBA
## [1] "factor"
##
## $BORO
## [1] "factor"
##
## $BUILDING
## [1] "character"
##
## $STREET
## [1] "factor"
##
## $ZIPCODE
## [1] "integer"
##
## $PHONE
## [1] "character"
##
## $CUISINE.DESCRIPTION
## [1] "factor"
##
## $INSPECTION.DATE
## [1] "POSIXlt" "POSIXt"
##
## $ACTION
## [1] "factor"
##
## $VIOLATION.CODE
## [1] "factor"
##
## $VIOLATION.DESCRIPTION
## [1] "factor"
##
## $CRITICAL.FLAG
## [1] "factor"
##
## $SCORE
## [1] "integer"
##
## $GRADE
## [1] "factor"
##
## $GRADE.DATE
## [1] "POSIXlt" "POSIXt"
##
## $RECORD.DATE
## [1] "POSIXlt" "POSIXt"
##
## $INSPECTION.TYPE
## [1] "factor"
Notice how the date variables are in POSIXlt form. Create a function that takes a single argument (“x”) and checks if it is of POSIXlt class. If it is, have the function change the input to a simple Date class with as.Date. If not then, the function should keep the input class as is. Apply this function to each of the columns in the NY restaurant data set by using the map function. Be sure the final output is a tibble and not a list.
finding_class<-function(y) {
if (class(NYC_res_data[,y]) == "POSIXlt" || class(NYC_res_data[,y]) == "POSIXt") {
NYC_res_data[,y] <<- as.Date(NYC_res_data[,y])
}
}
d<-map(c(1:18),finding_class)
NYC_res_data<-as_tibble(NYC_res_data)
NYC_res_data %>% map(class)
## $CAMIS
## [1] "integer"
##
## $DBA
## [1] "factor"
##
## $BORO
## [1] "factor"
##
## $BUILDING
## [1] "character"
##
## $STREET
## [1] "factor"
##
## $ZIPCODE
## [1] "integer"
##
## $PHONE
## [1] "character"
##
## $CUISINE.DESCRIPTION
## [1] "factor"
##
## $INSPECTION.DATE
## [1] "Date"
##
## $ACTION
## [1] "factor"
##
## $VIOLATION.CODE
## [1] "factor"
##
## $VIOLATION.DESCRIPTION
## [1] "factor"
##
## $CRITICAL.FLAG
## [1] "factor"
##
## $SCORE
## [1] "integer"
##
## $GRADE
## [1] "factor"
##
## $GRADE.DATE
## [1] "Date"
##
## $RECORD.DATE
## [1] "Date"
##
## $INSPECTION.TYPE
## [1] "factor"
Using this reformatted tibble, identify how many restaurants in 2016 had a violation regarding “mice”? How about “hair”? What about “sewage”? Hint: the VIOLATION.DESCRIPTION and INSPECTION.DATE variables will be useful here.
violations_count<- function(x){
NYC_res_data %>% filter(year(INSPECTION.DATE) == 2016) %>%
summarize(total_violations = sum(str_detect(VIOLATION.DESCRIPTION,x)))
}
issues <- c("mice", "hair", "sewage")
number_of_violations <- issues %>% map(violations_count)
number_of_violations
## [[1]]
## # A tibble: 1 × 1
## total_violations
## <int>
## 1 8283
##
## [[2]]
## # A tibble: 1 × 1
## total_violations
## <int>
## 1 2132
##
## [[3]]
## # A tibble: 1 × 1
## total_violations
## <int>
## 1 13646
Create a function to apply to this tibble that takes a year and a regular expression (i.e. “mice”) and returns a ggplot bar chart of the top 20 restaurants with the most violations. Make sure the restaurants are properly rank-ordered in the bar chart.
violations <- function(type, year){
top20 <- NYC_res_data %>% filter(year(INSPECTION.DATE) == year,
str_detect(VIOLATION.DESCRIPTION,type)) %>%
count(DBA) %>% arrange(desc(n)) %>% top_n(20,n)
ggplot(top20,aes(x = reorder(DBA, n),y = n)) + geom_bar(stat = "identity") + coord_flip() +
labs(x = "Name of Restaurant", y = "Total Violations") +
ggtitle(paste0("Top 20 Restaurants with '",type,"' violations for ",year))
}
violations ("mice", 2016)
violations ("hair", 2016)
violations ("sewage", 2016)