https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i
The Health Department has produced an inspection report and scoring system. After conducting an inspection the Health Inspector calculates a score basedon the violations observed.
The scores are described as followed:
| Score | Operating Condition Category | Inspection Finding |
|---|---|---|
| >90 | Good | Typically, only lower-risk health and safety violations observed |
| May have high-risk violations | ||
| 86-90 | Adequate | Several violations observed |
| May have high-risk violations | ||
| 71-85 | Needs Improvement | Multiple violations observed |
| Typically, several high-risk violations | ||
| <71 | Poor | Multiple violations observed |
| Typically, several high-risk violations |
| Variable name | Description |
|---|---|
| business_id | Unique identifier of the business for which this inspection was done. |
| business_name | Common name of the business. |
| business_address | Street address of the business. |
| business_city | City of the business. This field must be included if the file contains businesses from multiple cities. |
| business_state | State or province for the business. In the U.S. this should be the two-letter code for the state. |
| business_postal_code | Zip code or other postal code. |
| business_latitude | Latitude of the business. This field must be a valid WGS 84 latitude. |
| business_longitude | Longitude of the business. This field must be a valid WGS 84 longitude. |
| business_location | Longitude + Latitude |
| business_phone_number | Phone number for a business including country specific dialing information. |
| inspection_id | Identifying value for each individual inspection |
| inspection_date | Date of the inspection in YYYYMMDD format. |
| inspection_score | Inspection score on a 0-100 scale. 100 is the highest score. |
| inspection_type | String representing the type of inspection, Must be one of: initial, routine, followup, complaint |
| violation_id | Code for the violation. It is recommended that this be based on the FDA Food Code. However, municipalities can decide to use pre-existing codes for this field. 11 character max. |
| violation_description | One line description of the violation. 200 character max. |
| risk_category | Factor of risk, from Low, Medium, to High |
library(tidyverse)
library(plyr)
library(AMR)
df <- read.csv("Restaurant_Scores_-_LIVES_Standard.csv")
str(df)
## 'data.frame': 52315 obs. of 17 variables:
## $ business_id : int 1922 67448 1757 93022 4864 79782 73840 76437 27350 91811 ...
## $ business_name : Factor w/ 5574 levels "100% Dessert Cafe",..: 253 4393 1321 5445 1426 1340 2627 4692 9 4363 ...
## $ business_address : Factor w/ 5517 levels "001 WEST PORTAL Ave",..: 5264 3334 805 4166 2048 2222 3883 5285 3032 482 ...
## $ business_city : Factor w/ 1 level "San Francisco": 1 1 1 1 1 1 1 1 1 1 ...
## $ business_state : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
## $ business_postal_code : Factor w/ 55 levels "","00000","64110",..: 26 31 17 10 18 18 11 16 18 18 ...
## $ business_latitude : num 37.8 37.7 37.8 NA 37.8 ...
## $ business_longitude : num -122 -122 -122 NA -122 ...
## $ business_location : Factor w/ 2537 levels "","(0, 0)","(37.668824, -122.409411)",..: 1432 323 1908 1 650 1 1 1 747 1 ...
## $ business_phone_number: num NA NA NA NA 1.42e+10 ...
## $ inspection_id : Factor w/ 25116 levels "1000_20160919",..: 1656 11199 1191 23368 7002 15904 13381 14317 2937 22740 ...
## $ inspection_date : Factor w/ 834 levels "01/01/2017 12:00:00 AM",..: 124 296 634 582 780 284 784 200 470 597 ...
## $ inspection_score : int 86 92 86 92 84 92 71 76 NA 93 ...
## $ inspection_type : Factor w/ 15 levels "Administrative or Document Review",..: 13 13 13 13 13 13 13 13 11 13 ...
## $ violation_id : Factor w/ 39600 levels "","1000_20160919_103103",..: 3122 20012 2198 38327 13122 28628 24021 25669 1 37726 ...
## $ violation_description: Factor w/ 68 levels "","Consumer advisory not provided for raw or undercooked foods",..: 61 67 39 22 7 38 12 54 1 62 ...
## $ risk_category : Factor w/ 4 levels "","High Risk",..: 3 3 4 4 3 4 2 2 1 2 ...
There are 12 factor variables with large amounts of levels which might be hard to use as many are simply descriptive, and one incorrectly labeled as it is a variable of time.
There are 5 numeric variables, two being coordinates, one being a phone number, one being a business ID, and one being inspection score.
There are a fair amount of missing values across the entire data set but it is expected with 52,315 observations.
We need to make an ID variable to serve as a key for the rest of this project.
health <- df %>% mutate(ID = id(df)) %>%
select(ID, everything())
health
I noticed this gave me incorrect values when trying to give an ID variable to my rows, as the last number would end with 52268 when I had 52315 rows so I had to try something else to establish an ID variable.
id <- rownames(df)
healthx <- cbind(id=id, df)
I split the data into two new data frames, one containing individual information on businesses and the other being the results of the inspections.
business_info <- healthx %>% select(id, business_id, business_name, business_address, business_city, business_state, business_postal_code, business_longitude, business_latitude)
inspection_info <- healthx %>% select(id, inspection_id,inspection_date, inspection_score,inspection_type, violation_id, violation_description,risk_category)
Then I used full_join() to merge my data back together using my “ID” variable as my key.
combined <- full_join(business_info, inspection_info, by = 'id')
levels(combined$risk_category)
## [1] "" "High Risk" "Low Risk" "Moderate Risk"
First thing I noticed is that there is there are 4 factors, with one named “”, which looks like just missing information.
To make this less confusing, I decided to give it a name rather than leaving it blank.
levels(combined$risk_category) <- c('missing', 'High Risk', "Low Risk", "Medium Risk")
freqtable <- combined %>% freq(risk_category)
freqtable
Next, I wanted to observe the missing values to see if they were actually scored but simply missing a classification after the fact.
combined %>% select(business_name, risk_category, inspection_score) %>%
filter(risk_category == "missing", !is.na(inspection_score))
I noticed that most of the values with non missing scores score very high, I then summarised the data to see how many actually qualified as Low Risk
combined %>% select(business_name, risk_category, inspection_score) %>%
filter(risk_category == "missing", !is.na(inspection_score)) %>%
summarise('Low Risk' = inspection_score >= 90) %>%
count()
Surprisingly, almost all of the rows with “missing” risk categories actually fall into Low risk.
combined %>% ggplot(aes(x=risk_category, fill=risk_category)) +
labs(x="Risk Level") +
geom_bar()
It is important to note that most of the missing data actually falls into the “Low Risk” category
I filter out all rows with “missing” and simply make the same graph
graph1 <- combined %>% select(business_name, risk_category, inspection_score) %>%
filter(!risk_category == "missing")
graph1 %>% ggplot(aes(x=risk_category, fill=risk_category)) +
labs(x="Risk Level") +
geom_bar()
This second plot definitely less accurate as it seems to change what the overall data actually claims.
If we do not take into account the missing data which is mostly “Low Risk”, it definitely drops the average score of the entirety of the data.
combined %>% ggplot(aes(x=risk_category,y=inspection_score)) +
geom_boxplot()
## Warning: Removed 13725 rows containing non-finite values (stat_boxplot).
Looking at a boxplot of the data, we find that most of the values are actually greater than 80, and of the high risk category, most observations below 75 seem to fall within only the first quartile.
More surprising, we find there are observations marked “Low Risk” and “Medium Risk” falling under their appropriate categories, >=90 being low risk, and medium risk 86-90.
Not surprisingly, as we discovered earlier, almost all of the data we found to be missing scored 100, leaving only a few observations below as outliers.