library(tidyverse) # tidyverse
library(DataExplorer) #makes EDA easy
library(lubridate) #used for dates
library(anytime) #helps with date/time
library(openxlsx) #open xlsx format files
library(wordcloud2) #for word cloud
library(tm) # text manipulation
library(leaflet) #inteactive map for zips
library(plotly)
library(tmap) # interactive maps for community areas
library(sf)

Introduction

Chicago is a world class city with a vibrant atmosphere and one of the best food scenes in the country. I used to live in the city proper and the one question that always came up was: where should I eat?

That question is easily answered nowadays. Go to Yelp/Google Maps/search engine and type away. You can find ratings for restaurants, wait times, and a plethora of other information. The one bit of information that most of these platforms don’t provide is food safety. It can be the best restaurant in the world, but if how well does it do in health inspections? If I visit this restaurant, will I get food poisoning?

Luckily the City of Chicago provides a list of all the city’s health inspections. This dataset features inspections going all the way back to 2010 and up to present date. Every establishment that serves food and has been inspected is here. The issue is that the excel file has over 240,000 observations and it is in a long format. Some key information is provided including address, zip code, results, business name, risk category, and a few other variables. Of particular importance are the risk category, results, and zip codes. The data was downloaded in October 2022.

I am interested in generalizing and seeing what I can discover from this dataset.

Questions to answer:

  1. I suspect that since the pandemic and subsequent reopening we probably have seen an increase in failure rates/a decrease in passing rates from inspections. Is this true?
  2. Where are the highest failure rates in the city? How about the lowest?
  3. Besides restaurants, what other types of establishments are subject to food safety inspections?

Source for dataset: https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5

Data

df = read.xlsx("/Users/jz/Local Desktop/local - personal projects/Excel/chicago restaurants/Food_Inspections.xlsx")

We have a pretty complete dataset, 72.2% of all rows are complete. Given 72.2% complete rows there are only 1.8% missing observations.

plot_str(df)

introduce(df)
plot_intro(df)

plot_missing(df)

#visualize all discrete vars

plot_bar(df)
## 8 columns ignored with more than 50 categories.
## DBA.Name: 30496 categories
## AKA.Name: 28992 categories
## Facility.Type: 509 categories
## Address: 19117 categories
## City: 77 categories
## Inspection.Type: 110 categories
## Violations: 174154 categories
## Location: 17752 categories

We see that by and large the violations column is the one that is missing. Given that it is comprised of string text we are not too concerned with this. the next two highest columns with missing data are Facility type and AKA Name at 2.08% and 1.04% respectively. Again not too concerning, considering that they are very small percentages.

Data Cleaning and Preprocessing

We would like to clean the data a bit change variable types. We have a couple of different possible results for this analysis.

We are interested in the Results variable which has 7 different levels/possibilities:

“Business Not Located” ,“Fail”, “No Entry”, “Not Ready”, “Out of Business”, “Pass”, and “Pass w/ Conditions”

For this analysis we will be only looking at the establishments which received a “Pass”, “Pass w/ Conditions”, and “Fail” ratings so we will be dropping the rest of the observations.

There are 4 risk categories:“All”, “Risk 1 (High)”, “Risk 2 (Medium)”, “Risk 3 (Low)”

We will be dropping the “All” category since it is redundant.

# data cleaning, we are going to drop some locations as we have that info in the coords

df_clean = df
df_clean = df %>% drop_columns(c("Location"))
#plot_bar(df_clean, by = "Results")
#levels for Results
#levels(as.factor(df$Results))

#get dates as date type

df_clean$Inspection.Date = as.Date(df_clean$Inspection.Date, origin = "1900-01-01")

#make categories/factors into factors
#str(df_clean)
df_clean$State = as.factor(df_clean$State)
df_clean$Risk = as.factor(df_clean$Risk)
df_clean$Facility.Type = as.factor(df_clean$Facility.Type)
df_clean$City = as.factor(df_clean$City)
df_clean$Inspection.Type = as.factor(df_clean$Inspection.Type)

df_clean = df_clean[df_clean $Risk %in% c("Risk 1 (High)", "Risk 2 (Medium)", "Risk 3 (Low)"), ]                     # Multiple factor levels
df_clean$Results = as.factor(df_clean$Results)

#levels(df_clean$Risk)
#drops the "All" level from Risk
df_clean$Risk = droplevels(df_clean$Risk)
#now we filter by the categories we are interested in looking at Pass, Fail, and Pass w/conditions
levels(df_clean$Results)
## [1] "Business Not Located" "Fail"                 "No Entry"            
## [4] "Not Ready"            "Out of Business"      "Pass"                
## [7] "Pass w/ Conditions"
df_clean = droplevels(filter(df_clean, Results %in% c("Pass", "Pass w/ Conditions", "Fail")))
levels(df_clean$Results)
## [1] "Fail"               "Pass"               "Pass w/ Conditions"
plot_bar(df_clean, by = "Results")
## 8 columns ignored with more than 50 categories.
## DBA.Name: 26434 categories
## AKA.Name: 25082 categories
## Facility.Type: 492 categories
## Address: 17839 categories
## City: 72 categories
## Inspection.Date: 3173 categories
## Inspection.Type: 104 categories
## Violations: 173485 categories

#success

We also see that this dataset includes establishments that are not in Chicago explicitly. We will only be focused on those that are located in the city and therefore filtering by “Chicago” and any variations of Chicago that may be present in the dataset.

#data cleaning continued

levels(df_clean$City)
##  [1] "312CHICAGO"           "ALGONQUIN"            "ALSIP"               
##  [4] "BANNOCKBURNDEERFIELD" "BERWYN"               "BLOOMINGDALE"        
##  [7] "BLUE ISLAND"          "BOLINGBROOK"          "BRIDGEVIEW"          
## [10] "BROADVIEW"            "BURNHAM"              "CALUMET CITY"        
## [13] "CCHICAGO"             "CHARLES A HAYES"      "CHCHICAGO"           
## [16] "CHCICAGO"             "chicago"              "Chicago"             
## [19] "CHicago"              "CHICAGO"              "CHICAGO HEIGHTS"     
## [22] "CHICAGO."             "CHICAGOC"             "CHICAGOCHICAGO"      
## [25] "CHICAGOI"             "CHICAGOO"             "CICERO"              
## [28] "COUNTRY CLUB HILLS"   "DES PLAINES"          "EAST HAZEL CREST"    
## [31] "ELK GROVE VILLAGE"    "ELMHURST"             "EVANSTON"            
## [34] "EVERGREEN PARK"       "FRANKFORT"            "GLEN ELLYN"          
## [37] "GLENCOE"              "GRIFFITH"             "HIGHLAND PARK"       
## [40] "JUSTICE"              "LAKE BLUFF"           "LAKE ZURICH"         
## [43] "LANSING"              "LOMBARD"              "LOS ANGELES"         
## [46] "Maywood"              "MAYWOOD"              "MERRIVILLE"          
## [49] "MORTON GROVE"         "NAPERVILLE"           "NEW HOLSTEIN"        
## [52] "NEW YORK"             "NILES NILES"          "Norridge"            
## [55] "OAK LAWN"             "OAK PARK"             "OLYMPIA FIELDS"      
## [58] "OOLYMPIA FIELDS"      "PALOS PARK"           "PLAINFIELD"          
## [61] "ROSEMONT"             "SCHAUMBURG"           "SCHILLER PARK"       
## [64] "SKOKIE"               "STREAMWOOD"           "SUMMIT"              
## [67] "TINLEY PARK"          "WADSWORTH"            "WHEATON"             
## [70] "WHITING"              "WORTH"
#lots of not Chicago cities... filter by Chicago/possible Chicago misspellings

df_clean = droplevels(filter(df_clean, 
                             City  %in% c("Chicago","312CHICAGO", "CCHICAGO", 
                                                    "CHCHICAGO","CHCICAGO","chicago","Chicago", "CHicago", "CHICAGO",
                                                    "CHICAGOC","CHICAGOCHICAGO", "CHICAGOI","CHICAGOO",
                                                    "312CHICAGO")))
#check to see if only Chicago values are present
levels(df_clean$City)
##  [1] "312CHICAGO"     "CCHICAGO"       "CHCHICAGO"      "CHCICAGO"      
##  [5] "chicago"        "Chicago"        "CHicago"        "CHICAGO"       
##  [9] "CHICAGOC"       "CHICAGOCHICAGO" "CHICAGOI"       "CHICAGOO"

The dataset goes back to 2010. We are only interested in the recent results and will be filtering out any observations before 2017.

#year(df_clean$Inspection.Date)
#we are only interested in 2017-2018
df_clean = filter(df_clean, year(df_clean$Inspection.Date) %in%
                    c("2022", "2021","2020", "2019","2018", "2017"))
#check years
min(year(df_clean$Inspection.Date))
## [1] 2017
max(year(df_clean$Inspection.Date))
## [1] 2022

Data Analysis

There are many different types of establishments that are subject to the city’s food inspectors. In fact, there are 508 different categories!

#Check levels of facility type
levels(df_clean$Facility.Type) %>% head(n=10)
##  [1] "(convenience store)"                "(gas station)"                     
##  [3] "1005 NURSING HOME"                  "1023"                              
##  [5] "1023 CHILDERN'S SERVICE FACILITY"   "1023 CHILDERN'S SERVICE S FACILITY"
##  [7] "1023 CHILDERN'S SERVICES FACILITY"  "1023 CHILDREN'S SERVICES FACILITY" 
##  [9] "1023-CHILDREN'S SERVICES FACILITY"  "1475 LIQUOR"
#508 different categories, including hair salon

#group by type of facility and order by count in descending order
df_clean %>% group_by(Facility.Type) %>% summarise(count = length(Facility.Type)) %>%
             ungroup() %>% arrange(desc(count)) %>% head(n=20)

We can see that Restaurant is by far the most popular category with 57,072 entries followed by Grocery store and School. For the purposes of this analysis we will keep only categories with counts above 200 and label the rest as other. 200 was selected as the cutoff to keep the number of categories to a minimum.

df_clean$Facility.Type  = df_clean$Facility.Type%>% forcats::fct_lump_min(min = 200, other_level = "Other") 


#testdf %>%  mutate(Facility.Type = as.factor(Facility.Type) %>%
  #                     forcats::fct_lump_min(min = 200, other_level = "Other")) 


levels(df_clean$Facility.Type)
##  [1] "Bakery"                          "Catering"                       
##  [3] "Children's Services Facility"    "Daycare (2 - 6 Years)"          
##  [5] "Daycare Above and Under 2 Years" "Golden Diner"                   
##  [7] "Grocery Store"                   "Hospital"                       
##  [9] "Liquor"                          "Long Term Care"                 
## [11] "Mobile Food Preparer"            "Restaurant"                     
## [13] "School"                          "Other"
#save grouped data to variable 
bar = df_clean %>% group_by(Facility.Type) %>% summarise(count = length(Facility.Type)) %>%
             ungroup() %>% arrange(desc(count))

bar
#success we have collapsed the facility types from 508 to 15
ggplot(bar, aes(x = Facility.Type, y = count)) + geom_col(color = "red", fill = "#B3DDF2") +
  labs(title = "Number of inspections by type of Facility") + 
  theme_minimal() + 
  labs(x = "Facility Type", y = "Count") +
  theme(axis.title = element_text(size = 9), plot.title = element_text(size = 9, face = "bold"),
        axis.text.x = element_text(angle = 70, hjust = 1), text = element_text(size=8)) 

#we have the raw counts, but we would like percentages

bar2 = df_clean %>% group_by(Facility.Type) %>% summarise(count = n()) %>% 
  mutate(percent = round(count / sum(count), 3)) %>% ungroup() %>% arrange(desc(percent))

bar2

As a percentage we see that Restaurants make up around 68.6% of inspections followed by grocery stores with 11.1% and schools with 7.3%. This is intuitive but it also presents a bit of a surprise that there are many other types of establishments that are subject to food safety inspections.

We shift our focus a bit to focus on restaurants, we’d like to track failure rate by year for restaurants

bar4 = df_clean[df_clean$Facility.Type == "Restaurant",] %>%
  group_by(Year = year(Inspection.Date), Results) %>%
  summarise(count = n())
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
bar4
#data is long, we need it in a wide format
#long to wide
bar5 = bar4 %>% spread(Results, count) 
#
bar5$Total = rowSums(bar5[,-1], na.rm = TRUE)
bar5
#we have an na year and column, let's drop them as they are 66 observations

bar5 = subset(bar5, select = -c(`<NA>`) )

bar5 = bar5[-c(7), ]

#na row/column with 66 obs has been dropped

#now we are interested in percentages per year by results
bar5 = subset(bar5, select = -c(`Total`) )
bar5 = subset(bar5, select = -c(`Year`) )
bar5
YearofInspection = c("2017", "2018", "2019", "2020", "2021" , "2022")

pct_table = as.data.frame(cbind(YearofInspection,round(prop.table(as.matrix(bar5), margin = 1), digits = 3)))

pct_table$YearofInspection = as.integer(pct_table$YearofInspection)
pct_table$Fail = as.double(pct_table$Fail)
pct_table$Pass = as.double(pct_table$Pass)
pct_table$`Pass w/ Conditions` = as.double(pct_table$`Pass w/ Conditions`)

pct_table

Now we will plot the trends over the years

ggplot(pct_table, aes(x=YearofInspection)) + 
  geom_line(aes(y = Fail, color = "red")) + 
  geom_line(aes(y = Pass, color="darkgreen")) + 
  geom_line(aes(y = `Pass w/ Conditions`, color = "darkgoldenrod1")) + 
  labs(title = "Results of Inspection Trends 2017-2022", colour="Results", x = "Year", y  = "Percent")  + 
  scale_color_identity(name = "Results",
                          breaks = c("red", "darkgreen", "darkgoldenrod1"),
                          labels = c("Fail", "Pass", "Pass w/ Conditions"),
                          guide = "legend") + theme_minimal() + 
  theme(axis.title = element_text(size = 9),
        plot.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(size = 9))

Very surprised to see these results! The shortage of labor across industries has become very apparent after the peak of the COVID pandemic and one would expect that we might see failure rates increase drastically in 2020 but we only see a slight increase of around 2%. One would expect with staffing shortages there would be less staff to manage establishments and address some of the concerns that could cause the inspection to fail.

Another surprising trend we can see is in the “Pass” rating. We see a sharp decline from 2017 going into 2019 before it begins to recover. Likewise the “Pass w/ Conditions” sees the opposite effect. There is a sharp increase from 2017 into 2019 before it begins to drop.

What could be the cause? Upon further research we found that the City of Chicago implemented changes to starting on July 1, 2018. This would help explain why we see such a drastic increase in “Pass w/conditions” results as previous inspections that would have resulted in a “Pass” rating were downgraded to “Pass w/conditions.”

###Zip Codes

Chicago is a big city with many different neighborhoods that all help make Chicago a great place to live. We are interested to see where in the city can we find the most failed inspections. Now we will find the zip code with the most fail ratings. We find the top 10 and the bottom 10 in terms of the count of “Fail” results for the 2021 and 2022 years.

We start by finding the frequencies of each rating that we are interested in, for each zip code. Since we are obtaining raw counts areas with a lot establishments will see more more fail results than areas with a lesser number of establishments.

We find that the top zip code in terms of failed inspections has a failure rate of almost 50%! The subsequent zip codes which have the most inspections failed have a failure rate of around 30-35% so there is quite a difference.

The zip codes with the lowest failure rates all have rates lower than 20% so while there is a spread between the top and bottom zip codes it is not as pronounced as it would be imagined.

#first subset into years
df_clean$yearIns = (year(df_clean$Inspection.Date))
df_clean$monthIns = month(df_clean$Inspection.Date)

zipdf = filter(df_clean, df_clean$yearIns %in%
                    c("2022", "2021"))

by_zip = zipdf %>% group_by(Zip, Results) %>% summarise(count = length(Zip)) %>%
             ungroup() %>% arrange(desc(count))
## `summarise()` has grouped output by 'Zip'. You can override using the `.groups`
## argument.
top10_zip = by_zip %>% spread(Results, count) %>% ungroup() %>% arrange(desc(Fail)) %>% head(n = 10)
bottom10_zip = by_zip %>% spread(Results, count) %>% ungroup() %>% arrange((Fail)) %>% head(n = 10)

top10_zip
bottom10_zip
by_zip = by_zip %>% spread(Results, count) %>% ungroup() %>% arrange(desc(Fail)) 
by_zip$Zip = as.factor(by_zip$Zip)
zip_list = by_zip$Zip

#need to remove zip factor variable otherwise prop.table will not run
by_zip = subset(by_zip, select = -c(`Zip`) )
#add zips back to list
by_zip_pct  = cbind(zip_list, as.data.frame(round(prop.table(as.matrix(by_zip), margin = 1), digits = 3)))
by_zip_pct
#gather top 10 and bottom 10 by percentage
top10_zip = by_zip_pct %>% arrange(desc(Fail)) %>% head(n = 10)
bottom10_zip = by_zip_pct %>% arrange((Fail)) %>% head(n = 10)
#show results
top10_zip
bottom10_zip
#zip code results visualzied
ggplot(top10_zip, aes(x = zip_list, y = top10_zip$Fail, fill = zip_list)) + geom_col(alpha = 0.9) +
  labs(title = "Failure rates by Zip - Top 10", x = "Zip Code" ,  y = "Percent") + 
  theme_minimal() + 
  theme(axis.title = element_text(size = 9), plot.title = element_text(size = 15, face = "bold"),
        axis.text.x = element_text(angle = 70, hjust = 1), text = element_text(size=8), legend.position = "none")
## Warning: Use of `top10_zip$Fail` is discouraged.
## ℹ Use `Fail` instead.

ggplot(bottom10_zip, aes(x = zip_list, y = `Fail`, fill = zip_list)) + geom_col() +
  labs(title = "Failure rates by Zip - Bottom 10", x = "Zip Code" ,  y = "Percent") + 
  theme_minimal() + 
  theme(axis.title = element_text(size = 9), plot.title = element_text(size = 15, face = "bold"),
        axis.text.x = element_text(angle = 70, hjust = 1), text = element_text(size=8), legend.position = "none")

Statistical Testing

In this section we will explore whether these risk categories see different failure rates.

We have 3 designated risk categories, risk 1 (high), risk 2 (medium), risk 3(low), that see inspections twice per year, once per year, and once every other year. We are interested in 3 possible outcomes: Pass, Fail, and Pass w/conditions.

We want to know if there is a different in proportion in failure rates between categories. Since we are interested in failure we will combine Pass and Pass w/conditions into a single result. We will limit our results to 2021.

#filter on year and restaurants
prop_df = filter(df_clean, df_clean$yearIns %in% c("2021"), df_clean$Facility.Type == "Restaurant" )

by_risk = prop_df %>% group_by(Risk, Results) %>% summarise(count = length(Risk)) %>%
             ungroup() %>% arrange(desc(count))
## `summarise()` has grouped output by 'Risk'. You can override using the
## `.groups` argument.
by_risk = by_risk %>% spread(Results, count) 

by_risk
# now we will obtain the same table except collapsing Pass w/ Conditions and Pass into Pass

df_clean$Results  = recode_factor(df_clean$Results, "Pass w/ Conditions" = "Pass")

prop_df2 = filter(df_clean, df_clean$yearIns %in% c("2021"), df_clean$Facility.Type == "Restaurant" )


by_risk2 = prop_df2 %>% group_by(Risk, Results) %>% summarise(count = length(Risk)) %>%
             ungroup() %>% arrange(desc(count)) %>% spread(Results, count) 
## `summarise()` has grouped output by 'Risk'. You can override using the
## `.groups` argument.
by_risk2
#success, counts match

#now we will obtain proportions

risk = by_risk2$Risk

by_risk2 = subset(by_risk2, select = -c(`Risk`) )

by_risk2_pct = cbind(risk, as.data.frame(round(prop.table(as.matrix(by_risk2), margin = 1), digits = 3)))
by_risk2_pct

We have obtained the proportions of Pass/Fail results for the year 2021 by Risk Category. We can immediately notice two peculiar insights. There is not a huge difference in Pass/Fail percentages in between categories, the biggest difference appears to be between Risk 1 and Risk 2/Risk 3 and the difference in only around 3%. We conduct a test for equal proportions between all 3 groups and find some surprising results! The only statistically significant difference between the risk groups was between Risk category 1 and Risk category 2. The differences between 2 and 3 were not statistically significant and more surprisingly the differences between 1 and 3 were also not statistically significant. The difference between failure rates between 2 and 3 is .001 and it is surprising to see that it made a difference when comparing it to Risk 1.

It should be noted the estimated interval for Risk 1 vs Risk 2 would include 0 if we rounded the values so while statistically significant, this conclusion should be accepted with caution.

# 1 vs 2
# difference is statistically significant (p-value = 0.01308)
prop.test(c(1759, 274), c(5663+1759, 1060+274))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(1759, 274) out of c(5663 + 1759, 1060 + 274)
## X-squared = 6.158, df = 1, p-value = 0.01308
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.007418798 0.055782826
## sample estimates:
##    prop 1    prop 2 
## 0.2369981 0.2053973
# 1 vs 3
# difference is not statistically significant (p-value > 0.05)
prop.test(c(1759, 42), c(5663+1759, 164+42))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(1759, 42) out of c(5663 + 1759, 164 + 42)
## X-squared = 1.042, df = 1, p-value = 0.3074
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.02524066  0.09146990
## sample estimates:
##    prop 1    prop 2 
## 0.2369981 0.2038835
# 2 vs 3
# difference is not statistically significant (p-value > 0.05)
prop.test(c(1060, 164), c(1060+274, 164+42))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(1060, 164) out of c(1060 + 274, 164 + 42)
## X-squared = 3.346e-29, df = 1, p-value = 1
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.06216147  0.05913386
## sample estimates:
##    prop 1    prop 2 
## 0.7946027 0.7961165

Conclusion and comments

We set out to answer 3 main questions at the beginning of this analysis.

  1. I suspect that since the pandemic and subsequent reopening we probably have seen an increase in failure rates from inspections. Is this true?

There had been a decrease in Pass ratings however this happened well before the pandemic! We see a decrease in Pass ratings from 2017 to 2019 and this is very likely due to a change in the code of the criteria that inspectors had to follow. In fact we are starting to see an increase in Pass rates. While we have seen a slight increase in failure rates these have been very small increases, nowhere near what we would have expected given the sitation caused by the pandemic (closures, etc.) and the following labor shortage seen across the food industry.

  1. Where are the highest failure rates in the city? How about the lowest? The zip codes with the highest failure rates see failures between 30 and 50%. The zip codes with the lowest failure rates all see rates lower than 20%

  2. Besides restaurants, what other types of establishments are subject to food safety inspections? There are an incredible amount of different types of establishments that are subject to food safety inspections but the overwhelming amount of inspections are done at restaurants, grocery stores, and schools.

There is further exploration of this data set that could be done! While we primarily focused on “recent” data it does go back to 2010 so further analysis on historical trends could be done. We can also plot a map* of the establishments and see where they are geographically located.

Note See tableau link, R interactive map coming soon! https://public.tableau.com/app/profile/jose.zavala1055/viz/ChicagoRestaurantsInspections/Map

top_map_data = dplyr::select(by_zip_pct, 'zip_list', 'Fail', 'Pass', 'Pass w/ Conditions')
top_map_data$State = top_map_data$State = 'IL'
#install.packages("rgdal")
zcta_shapefile =  rgdal::readOGR("/Users/jz/Local Desktop/local - personal projects/maps/cb_2018_us_zcta510_500k/cb_2018_us_zcta510_500k.shp")
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/jz/Local Desktop/local - personal projects/maps/cb_2018_us_zcta510_500k/cb_2018_us_zcta510_500k.shp", layer: "cb_2018_us_zcta510_500k"
## with 33144 features
## It has 5 fields
## Integer64 fields read as strings:  ALAND10 AWATER10
zcta_data = data.table::copy(zcta_shapefile@data)

top_map_data$zip_list = as.numeric(as.character(top_map_data$zip_list))
zcta_data$ZCTA5CE10 = as.numeric(zcta_data$ZCTA5CE10)
zcta_data = zcta_data %>% left_join(top_map_data, by = c("ZCTA5CE10" = "zip_list"))

# Now reattach this data file back to the SpatialPolygonsDataFrame data slot
zcta_shapefile@data = zcta_data

# (optional) There are 33k ZCTAs in the US; consider reducing these to a particular region of interest.
# I'm using the `state` value that i just joined for demo purposes. You can skip this if you want to do the whole country

#zcta_data = st_as_sf(zcta_data)
#zcta_data = zcta_data %>%
#    dplyr::filter(zcta_shapefile@data$State=='IL')

zcta_shapefile = zcta_shapefile[!is.na(zcta_shapefile@data$State) & zcta_shapefile@data$State=='IL',]

nrow(zcta_shapefile@data) == length(zcta_shapefile@polygons)
## [1] TRUE
## [1] TRUE
nrow(zcta_shapefile@data)
## [1] 58
## [1] 58
length(zcta_shapefile@polygons)
## [1] 58
## [1] 58
labels = sprintf("<strong>Zip Code: %s</strong><br/> Failure Rate: %s",
  zcta_shapefile@data$ZCTA5CE10, zcta_shapefile@data$Fail) %>% lapply(htmltools::HTML)

map_palette = colorNumeric('YlOrRd', zcta_shapefile@data$Fail)

zcta_map = zcta_shapefile %>% 
  leaflet::leaflet(options = leaflet::leafletOptions(preferCanvas = TRUE)) %>%
  leaflet::addProviderTiles(providers$OpenStreetMap,
                            options = providerTileOptions(updateWhenZooming = FALSE, updateWhenIdle = TRUE )) %>%
  leaflet::setView(lat = 41.881832, lng = -87.623177, zoom = 10) %>%

  leaflet::addPolygons(fillColor = ~map_palette(Fail), weight = 2, opacity = 1, color = "white", dashArray = "1",
    stroke = TRUE, fillOpacity = 0.5, 
    highlight = highlightOptions(weight = 5, color = "#667", dashArray = "", fillOpacity = 0.7, bringToFront = TRUE),
    label = labels,
    labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "15px", direction = "auto")) %>%
  
  #need to use leaflegend, original leaflet options reverse values (ascending), can't reverse them 
  leaflegend::addLegendNumeric(pal = map_palette, values = ~Fail, decreasing = TRUE, orientation = 'vertical',
                   width = 50, height = 200, position = 'topright', title = 'Failure Rates')
zcta_map

###Challenges with zip choropleth

Producing a zip choropleth is much harder than it seems! For one, zip codes are not considered geographic/polygon data. Ultimately zips are being used as a proxy for the neighborhoods of Chicago, the end goal of the analysis will be to create a map of the communities of Chicago and see the failure rates.

Chi_map2 = st_read("/Users/jz/Downloads/Chicago Zip Code and Neighborhood Map/geo_export_981e810a-4808-44fc-9252-b722e7367d97.shp")
## Reading layer `geo_export_981e810a-4808-44fc-9252-b722e7367d97' from data source `/Users/jz/Downloads/Chicago Zip Code and Neighborhood Map/geo_export_981e810a-4808-44fc-9252-b722e7367d97.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 61 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS84(DD)
Chi_Zipsf <- merge(Chi_map2, top_map_data, by.x = "zip", by.y = "zip_list")

tmap_mode("view")
## tmap mode set to interactive viewing
comm_map = tm_shape(Chi_Zipsf ) +
  tm_polygons("Fail", style="cont", pal="YlOrRd", alpha = .8,
              title = "Restaurant Failure Rates by Community Area", 
              breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5), id = "zip",
              popup.vars = c(" Failure Rate = "="Fail"),
              popup.format = list(), legend.show = FALSE) +
  tm_add_legend('fill', col = RColorBrewer::brewer.pal(6, "YlOrRd"), 
                labels = c('0.5', '0.4', '0.3', '0.2', '0.1', '0'),
                title="Restaurant Failure Rates by Community Area",
                reverse = TRUE)
comm_map
#tmap_save(comm_map, "com_map.html")
df_cloud = df %>% drop_columns(c("Inspection.ID", "License.#", "Address", "Location"))
#get dates as date type

df_cloud$Inspection.Date = as.Date(df_cloud$Inspection.Date, origin = "1900-01-01")

#make categories/factors into factors
#str(df_cloud)
df_cloud$State = as.factor(df_cloud$State)
df_cloud$Risk = as.factor(df_cloud$Risk)
df_cloud$Facility.Type = as.factor(df_cloud$Facility.Type)
df_cloud$City = as.factor(df_cloud$City)
df_cloud$Inspection.Type = as.factor(df_cloud$Inspection.Type)

df_cloud = df_cloud[df_cloud $Risk %in% c("Risk 1 (High)", "Risk 2 (Medium)", "Risk 3 (Low)"), ]                     # Multiple factor levels
df_cloud$Results = as.factor(df_cloud$Results)

#levels(df_cloud$Risk)
#drops the "All" level from Risk
df_cloud$Risk = droplevels(df_cloud$Risk)

df_cloud = filter(df_cloud, year(df_cloud$Inspection.Date) %in%
                    c("2022", "2021","2020", "2019","2018", "2017"))

df_cloud = droplevels(filter(df_cloud, Results %in% c("Pass", "Pass w/ Conditions", "Fail")))
df_cloud = droplevels(filter(df_cloud, City  %in% c("Chicago","312CHICAGO", "CCHICAGO", 
                                                    "CHCHICAGO","CHCICAGO","chicago","Chicago", "CHicago", "CHICAGO",
                                                    "CHICAGOC","CHICAGOCHICAGO", "CHICAGOI","CHICAGOO",
                                                    "312CHICAGO")))

df_cloud = filter(df_cloud, year(df_cloud$Inspection.Date) %in% c("2022"))

df_cloud1 =filter(df_cloud, Results %in% c("Pass"))
df_cloud2 =filter(df_cloud, Results %in% c("Pass w/ Conditions"))
df_cloud3 =filter(df_cloud, Results %in% c("Fail"))

cloud_corpus1 = Corpus(VectorSource(df_cloud1$Violations))

cloud_corpus1  = cloud_corpus1 %>% tm_map(removeNumbers) %>% 
  tm_map(removePunctuation) %>% tm_map(stripWhitespace) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english")) %>% 
  tm_map(removeWords, stopwords("SMART"))
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("SMART")):
## transformation drops documents
cloud_corpus2 = Corpus(VectorSource(df_cloud2$Violations))

cloud_corpus2  = cloud_corpus2 %>% tm_map(removeNumbers) %>% 
  tm_map(removePunctuation) %>% tm_map(stripWhitespace) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english")) %>% 
  tm_map(removeWords, stopwords("SMART"))
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("SMART")):
## transformation drops documents
cloud_corpus3 = Corpus(VectorSource(df_cloud3$Violations))

cloud_corpus3  = cloud_corpus3 %>% tm_map(removeNumbers) %>% 
  tm_map(removePunctuation) %>% tm_map(stripWhitespace) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english")) %>% 
  tm_map(removeWords, stopwords("SMART"))
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("SMART")):
## transformation drops documents
tdm1 = TermDocumentMatrix(cloud_corpus1) %>% as.matrix()
words1 = sort(rowSums(tdm1), decreasing = TRUE)
cloud1 = data.frame(word1 = names(words1), freq1 = words1)

tdm2 = TermDocumentMatrix(cloud_corpus2) %>% as.matrix()
words2 = sort(rowSums(tdm2), decreasing = TRUE)
cloud2 = data.frame(word2 = names(words2), freq2 = words2)

tdm3 = TermDocumentMatrix(cloud_corpus3) %>% as.matrix()
words3 = sort(rowSums(tdm3), decreasing = TRUE)
cloud3 = data.frame(word3 = names(words3), freq3 = words3)

#manually remove some words which do not tell us much, arbitrarily create word freq cutoff for less than 50 counts
cloud1 = cloud1 %>%
  filter(freq1 > 50,
        !(word1  %in% c("comments", "instructed", "food", "observed")),)
cloud2 = cloud2 %>%
  filter(freq2 > 50,
         !(word2  %in% c("comments", "instructed", "food", "observed")))
cloud3 = cloud3 %>%
  filter(freq3 > 50,
         !(word3  %in% c("comments", "instructed", "food", "observed")))

#3 word clouds for Pass, Pass w/ Conditions, and Fail ratings
wordcloud2(cloud1)
wordcloud2(cloud2)
wordcloud2(cloud3)