1. Setting Up Libraries

    I will mainly be using tidyr and dplyr here to do the analysis.

  2. Analysis Questions

    Thanks to Vyanna Hill for the analysis prompt:
    Find out if there is a trend of type of facilities and their risk level given by the inspector.

  3. Pull in the Data from Github

    To answer the questions above I will be doing a lot of sorting and group_by + filtering.

my_git_url <- getURL("https://raw.githubusercontent.com/aelsaeyed/Data607/main/Project2/Food/food-inspections.csv")
insp_raw <- read.csv(text = my_git_url)
head(insp_raw, 10)
  1. Subsetting Columns

    Ill select only the columns that I want to explore.Im going to see if there are any trends in risk common to individial restaurant chains, and I will also see if theres a commonality among restaurants by zipcodes.
insp_cleaned = insp_raw %>% 
  select(2,5,6,7,10)

head(insp_cleaned)
  1. Grouping by Restaurant

    Here I separate the risk column to isolate the numerical values associate with different levels of risk. I know I only want the high risk establishments, and this will make it easier to filter for those. First, I group by the company name, and then I create 3 columns, each that summarize one risk value, from 1 to 3. I then create a new column that stores the total number of observations per restaurant. At that point I am ready to compare the high risk establishments vs the total number of establishment per brand/restaurant/company. This dataset has a large number of companies that have only one branch in Chicago, and there are also many that are not high risk. To get useful information I filter out the ones with no high risk observation and arrange my data to show the companies with the lowest rates of high risk establishments. The table below shows that the Illinois Sportservice Inc. for example has 200 total establishments but only 2 that are at high risk.
res_grp <- insp_cleaned %>%
  separate(Risk, c("Risk", "risknum", "risk_lvl")) %>%
  mutate(`risknum` = as.integer(risknum)) %>%
  select(1,2,4,6,7) %>%
  group_by(DBA.Name) %>%
  summarize(
    count_highrisk = sum(risknum[risknum==1]),
    count_medrisk = sum(risknum[risknum==2]),
    count_lowrisk = sum(risknum[risknum==3]),
    count_total = sum(risknum[risknum])
  ) %>% 
  mutate(`rate_high_risk` = (count_highrisk/count_total) ) %>%
  filter(count_highrisk > 0) %>% 
  arrange(rate_high_risk) %>% 
  select(1,2,5,6)
## Warning: Expected 3 pieces. Additional pieces discarded in 196725 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 100 rows [21,
## 81, 136, 1562, 1961, 2178, 2439, 2751, 3000, 3485, 5459, 5828, 5881, 6728, 6877,
## 8041, 9856, 10119, 13610, 14581, ...].
head(res_grp)