Dataset available here

https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i

The Health Department has produced an inspection report and scoring system. After conducting an inspection the Health Inspector calculates a score basedon the violations observed.

The scores are described as followed:

Score	Operating Condition Category	Inspection Finding
>90	Good	Typically, only lower-risk health and safety violations observed
		May have high-risk violations
86-90	Adequate	Several violations observed
		May have high-risk violations
71-85	Needs Improvement	Multiple violations observed
		Typically, several high-risk violations
<71	Poor	Multiple violations observed
		Typically, several high-risk violations

Description of variables

Variable name	Description
business_id	Unique identifier of the business for which this inspection was done.
business_name	Common name of the business.
business_address	Street address of the business.
business_city	City of the business. This field must be included if the file contains businesses from multiple cities.
business_state	State or province for the business. In the U.S. this should be the two-letter code for the state.
business_postal_code	Zip code or other postal code.
business_latitude	Latitude of the business. This field must be a valid WGS 84 latitude.
business_longitude	Longitude of the business. This field must be a valid WGS 84 longitude.
business_location	Longitude + Latitude
business_phone_number	Phone number for a business including country specific dialing information.
inspection_id	Identifying value for each individual inspection
inspection_date	Date of the inspection in YYYYMMDD format.
inspection_score	Inspection score on a 0-100 scale. 100 is the highest score.
inspection_type	String representing the type of inspection, Must be one of: initial, routine, followup, complaint
violation_id	Code for the violation. It is recommended that this be based on the FDA Food Code. However, municipalities can decide to use pre-existing codes for this field. 11 character max.
violation_description	One line description of the violation. 200 character max.
risk_category	Factor of risk, from Low, Medium, to High

library(tidyverse)
library(plyr)
library(AMR)

Read in data

df <- read.csv("Restaurant_Scores_-_LIVES_Standard.csv")

Examining the data

str(df)

## 'data.frame':    52315 obs. of  17 variables:
##  $ business_id          : int  1922 67448 1757 93022 4864 79782 73840 76437 27350 91811 ...
##  $ business_name        : Factor w/ 5574 levels "100% Dessert Cafe",..: 253 4393 1321 5445 1426 1340 2627 4692 9 4363 ...
##  $ business_address     : Factor w/ 5517 levels "001 WEST PORTAL Ave",..: 5264 3334 805 4166 2048 2222 3883 5285 3032 482 ...
##  $ business_city        : Factor w/ 1 level "San Francisco": 1 1 1 1 1 1 1 1 1 1 ...
##  $ business_state       : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ business_postal_code : Factor w/ 55 levels "","00000","64110",..: 26 31 17 10 18 18 11 16 18 18 ...
##  $ business_latitude    : num  37.8 37.7 37.8 NA 37.8 ...
##  $ business_longitude   : num  -122 -122 -122 NA -122 ...
##  $ business_location    : Factor w/ 2537 levels "","(0, 0)","(37.668824, -122.409411)",..: 1432 323 1908 1 650 1 1 1 747 1 ...
##  $ business_phone_number: num  NA NA NA NA 1.42e+10 ...
##  $ inspection_id        : Factor w/ 25116 levels "1000_20160919",..: 1656 11199 1191 23368 7002 15904 13381 14317 2937 22740 ...
##  $ inspection_date      : Factor w/ 834 levels "01/01/2017 12:00:00 AM",..: 124 296 634 582 780 284 784 200 470 597 ...
##  $ inspection_score     : int  86 92 86 92 84 92 71 76 NA 93 ...
##  $ inspection_type      : Factor w/ 15 levels "Administrative or Document Review",..: 13 13 13 13 13 13 13 13 11 13 ...
##  $ violation_id         : Factor w/ 39600 levels "","1000_20160919_103103",..: 3122 20012 2198 38327 13122 28628 24021 25669 1 37726 ...
##  $ violation_description: Factor w/ 68 levels "","Consumer advisory not provided for raw or undercooked foods",..: 61 67 39 22 7 38 12 54 1 62 ...
##  $ risk_category        : Factor w/ 4 levels "","High Risk",..: 3 3 4 4 3 4 2 2 1 2 ...

There are 12 factor variables with large amounts of levels which might be hard to use as many are simply descriptive, and one incorrectly labeled as it is a variable of time.

There are 5 numeric variables, two being coordinates, one being a phone number, one being a business ID, and one being inspection score.

There are a fair amount of missing values across the entire data set but it is expected with 52,315 observations.

Make ID variable

We need to make an ID variable to serve as a key for the rest of this project.

health <- df %>% mutate(ID = id(df)) %>%
                select(ID, everything())
health

I noticed this gave me incorrect values when trying to give an ID variable to my rows, as the last number would end with 52268 when I had 52315 rows so I had to try something else to establish an ID variable.

ID Variable attempt #2

id <- rownames(df)
healthx <- cbind(id=id, df)

Splitting data

I split the data into two new data frames, one containing individual information on businesses and the other being the results of the inspections.

business_info <- healthx %>% select(id, business_id, business_name, business_address, business_city, business_state, business_postal_code, business_longitude, business_latitude)

inspection_info <- healthx %>% select(id, inspection_id,inspection_date, inspection_score,inspection_type, violation_id, violation_description,risk_category)

Full Join

Then I used full_join() to merge my data back together using my “ID” variable as my key.

combined <- full_join(business_info, inspection_info, by = 'id')

Looking at risk levels according to the data

levels(combined$risk_category)

## [1] ""              "High Risk"     "Low Risk"      "Moderate Risk"

First thing I noticed is that there is there are 4 factors, with one named “”, which looks like just missing information.

To make this less confusing, I decided to give it a name rather than leaving it blank.

levels(combined$risk_category) <- c('missing', 'High Risk', "Low Risk", "Medium Risk")

Next I wanted to observe the frequency of each risk category

freqtable <- combined %>% freq(risk_category)
freqtable

Next, I wanted to observe the missing values to see if they were actually scored but simply missing a classification after the fact.

combined %>% select(business_name, risk_category, inspection_score) %>% 
  filter(risk_category == "missing", !is.na(inspection_score))

I noticed that most of the values with non missing scores score very high, I then summarised the data to see how many actually qualified as Low Risk

combined %>% select(business_name, risk_category, inspection_score) %>% 
  filter(risk_category == "missing", !is.na(inspection_score)) %>%
  summarise('Low Risk' = inspection_score >= 90) %>%
  count()

Surprisingly, almost all of the rows with “missing” risk categories actually fall into Low risk.

Graphing the data

combined %>% ggplot(aes(x=risk_category, fill=risk_category)) +
  labs(x="Risk Level") +
  geom_bar()

It is important to note that most of the missing data actually falls into the “Low Risk” category

Graphing the data without the “missing” category

I filter out all rows with “missing” and simply make the same graph

graph1 <- combined %>% select(business_name, risk_category, inspection_score) %>% 
  filter(!risk_category == "missing")

graph1 %>% ggplot(aes(x=risk_category, fill=risk_category)) +
  labs(x="Risk Level") +
  geom_bar()

This second plot definitely less accurate as it seems to change what the overall data actually claims.

If we do not take into account the missing data which is mostly “Low Risk”, it definitely drops the average score of the entirety of the data.

Box plot

combined %>% ggplot(aes(x=risk_category,y=inspection_score)) +
                      geom_boxplot()

## Warning: Removed 13725 rows containing non-finite values (stat_boxplot).

Looking at a boxplot of the data, we find that most of the values are actually greater than 80, and of the high risk category, most observations below 75 seem to fall within only the first quartile.

More surprising, we find there are observations marked “Low Risk” and “Medium Risk” falling under their appropriate categories, >=90 being low risk, and medium risk 86-90.

Not surprisingly, as we discovered earlier, almost all of the data we found to be missing scored 100, leaving only a few observations below as outliers.

Stat450_Project

Gene Dalusong

November 28, 2018