0. Results Summary

I investigated the relationship between:

Initially I planned to use daily restaurant closures as a variable but I incorrectly assumed that there would be more restaurants closed daily. I switched halway through to daily red violations.

Findings

I encourage you to check out some of the graphs in the report below, but my major discoveries were:

  1. The Health Department didn’t close many restaurants during the time period in question. Weekends are a large determinant in both the number of jail bookings as well as the number of red violations given by the Health Department.
  2. Less people are booked into jail during July & September, or they get arrested while camping trips outside of King county.
  3. Daily Red Violations generally do not vary much throughout the year (excluding weekends)

1. Define Project Scope

The Task

Using whatever tools you have at your disposal (though ideally Power BI) we would like you to find two datasets that you can integrate (i.e. they have a meaningful primary key in common) at the site below. https://data.kingcounty.gov/browse

We’d like you to integrate, analyze, define insights, and visualize the output. No more than 2-4 hours of time should go into this – we’re looking for insight into your ‘quick take’ on data and storytelling rather than a polished solution.

Based on the description of the task above, I divided the project into the six steps listed below.

Outline of Work

  1. Define Project Scope (15 min)
  2. Acquire Data (30 min)
  3. Clean & “Massage” Data (30 min)
  4. Merge Data (10 min)
  5. Explore Synthesized Data for Insights (45 min)
  6. Summarize Insights (15 min)
  7. Evaluation (15 min)
  8. Share Findings (15 min)

Task Objectives

A successful project will:

  1. Include report sharing my methodology and findings
  2. Use clear, reproducible code
  3. Be completed in less than four hours

I plan to evaluate my work against these objectives at the end.

2. Acquire Data

Explore Data.KingCounty.gov

I spent a lot more time than I intended exploring the possibilities. The most obvious thing to do for me would have been to combine 2012 and 2016 general election results by precinct, but I decided to do something more interesting.

I ultimately decided on integrating the following two databases:

I downloaded both databases as csv to be used in R.

My plan is to integrate the two by the date and explore if there is any relationship between the occurence of restaurant closures and jail bookings over time

Set-Up

First load necessary packages

#load necessary packages
library(lubridate)
library(dplyr)
library(knitr)

Then load the data into memory.

#read databases into memory

food <-read.csv("data/Food_Establishment_Inspection_Data.csv")
jail <- read.csv("data/Adult_Jail_Booking_Apr_1__2016_to_Mar_31__2017_as_of_Apr_6__2017.csv")

3. Clean & “Massage” Data

Food Inspection Database

  1. Convert date variable for easy manipulation
  2. Filter for time range that matches jail booking dataset: 4/1/16-3/31/17
  3. Filter only for restaurant closures (i.e. Inspection.Closed.Business == TRUE)
  4. Make each row signify one restaurant closure
  5. Summarise by date and create count of restaurant closures
  6. Rename Inspection.Date column to just date
closures <- food %>%
    #convert times
    mutate(Inspection.Date = mdy(Inspection.Date)) %>%
    #filter for time range
    filter(Inspection.Date >= "2016-04-01" & Inspection.Date <= "2017-03-31") %>%
    #filter for restaurant closures
    filter(Inspection.Closed.Business == "true") %>% #gross not boolean
    #one row = 1 restaurant closure
    select(Name:Inspection.Closed.Business) %>%
    distinct(Program.Identifier, Inspection.Date) %>%
    #summarise by date %>%
    group_by(Inspection.Date) %>%
    summarise(Close.Count = n_distinct(Program.Identifier)) %>%
    #rename Inspection.Date
    rename(Date = Inspection.Date)

Jail Booking Database

  1. Convert date variables for easy manipulation
  2. Create new variable called “Date”
  3. Summarise by date and create count of bookings
bookings <- jail %>%
    #convert times
    mutate(Booking.Date.Time = mdy_hms(Booking.Date.Time), Release.Date.Time = mdy_hms(Release.Date.Time)) %>%
    #create Date variable
    mutate(Date = date(Booking.Date.Time)) %>%
    #summarise by date %>%
    group_by(Date) %>%
    summarise(Book.Count = n_distinct(Book.of.Arrest.Number))

4. Merge Data

jail.food <- bookings %>% left_join(closures, by = "Date")

Big problem here. As you can see in the table below, there are considerably more jail bookings happening than restaurant closures in King County. Specifically, within the timeframe there are 42 restaurants closed compared to 35,988 bookings made.

kable(head(jail.food, n = 12), caption = "Jail Bookings and Restaurant Closures By Date")
Jail Bookings and Restaurant Closures By Date
Date Book.Count Close.Count
2016-04-01 104 NA
2016-04-02 68 NA
2016-04-03 61 NA
2016-04-04 107 NA
2016-04-05 112 NA
2016-04-06 143 NA
2016-04-07 110 NA
2016-04-08 111 NA
2016-04-09 73 NA
2016-04-10 70 NA
2016-04-11 101 NA
2016-04-12 110 2

4.5 Re-Munge & Re-Merge

Perhaps it might work better to use RED (i.e. critical) violations at restaurants instead of restaurant closures

criticals <- food %>%
    #convert times
    mutate(Inspection.Date = mdy(Inspection.Date)) %>%
    #filter for time range
    filter(Inspection.Date >= "2016-04-01" & Inspection.Date <= "2017-03-31") %>%
    #filter for RED violations
    filter(Violation.Type == "red") %>%
    #summarise by date %>%
    group_by(Inspection.Date) %>%
    summarise(Red.Count = n_distinct(Violation_Record_ID)) %>%
    #rename Inspection.Date
    rename(Date = Inspection.Date)
jail.red <- bookings %>% left_join(criticals, by = "Date")
kable(head(jail.red, n = 12), caption = "Jail Bookings and Red Violations By Date")
Jail Bookings and Red Violations By Date
Date Book.Count Red.Count
2016-04-01 104 29
2016-04-02 68 NA
2016-04-03 61 NA
2016-04-04 107 32
2016-04-05 112 68
2016-04-06 143 70
2016-04-07 110 60
2016-04-08 111 29
2016-04-09 73 4
2016-04-10 70 NA
2016-04-11 101 34
2016-04-12 110 47

5. Explore Synthesized Data for Insights

Matrix Scatterplot

To start, I always like to make matrix scatterplots to look for a direction to move in.

plot(jail.red, pch=16, col="blue", main="Matrix Scatterplot of Date, Bookings, and Red Violations")

The most useful scatterplots are the three in the bottom left corner.

3D Plot

Another plot I like to make when there are three variables involved is a 3D scatterplot. It helps me see if there is any clustering happening.

TO EXPLORE THIS PLOT YOURSELF, CLICK AND DRAG

library(plotly)

q <- plot_ly(jail.red, x = ~Date, y = ~Red.Count, z = ~Book.Count,  marker = list(size = 6))
q

Note 58 observations where 0 red violations occurred are excluded from this and plots below

Here I actually DO see two clusters. There is a main cluster and another cluster visible that appears to exist when the number of bookings is less than 90 and the number of red violations is less than 12. Maybe it has to do with days of the week?

low.red.cluster <- jail.red %>%
    filter(Red.Count < 12 & Book.Count < 100) %>%
    mutate(Day = wday(Date, label = TRUE)) %>%
    group_by(Day) %>%
    summarise(Count = n_distinct(Date))

kable(low.red.cluster)
Day Count
Sun 16
Mon 2
Fri 2
Sat 35

Suspicion confirmed. With the exception of four points in the lower cluster, the cluster represents weekends with low booking and low (non-zero) red violations.

Bookings and Red Violations by Time

library(reshape2)
library(ggplot2)

#reshape data to long for graphing
jail.red.long <- jail.red %>%
    rename(Bookings = Book.Count, Red.Violations = Red.Count) %>%
    melt(id.vars = "Date", variable.name = "Type", value.name = "Count") %>%
    #add Weekday & Weekend columns
    mutate(Weekday = wday(Date, label = TRUE)) %>%
    mutate(Weekend = ifelse((Weekday == "Sat" | Weekday == "Sun"), TRUE, FALSE))

#plot
ggplot(jail.red.long, aes(Date, Count)) +
    geom_point(aes(color = Weekend)) +
    facet_grid(.~Type)
## Warning: Removed 58 rows containing missing values (geom_point).

From this figure we can infer:

  1. Both Red Violations and Jail Bookings are LOWER on the weekends.
  2. We see a summertime reduction in # of bookings

Predicting the # of Bookings

#model predicting Booking Count using Red Count
book.count.lm1 <- lm(Book.Count ~ Red.Count, data = jail.red)
r1 <- summary(book.count.lm1)$r.squared

#model predicting Booking Count using Red Count and Date
book.count.lm2 <- lm(Book.Count ~ Red.Count + Date, data = jail.red)
r2 <- summary(book.count.lm2)$adj.r.squared

jail.red.days <- jail.red %>%
    #add Weekday & Weekend columns
    mutate(Weekday = wday(Date, label = TRUE)) %>%
    mutate(Weekend = ifelse((Weekday == "Sat" | Weekday == "Sun"), TRUE, FALSE))

#model predicting Booking Count using Red Count and Weekend
book.count.lm3 <- lm(Book.Count ~ Red.Count + factor(Weekend), data = jail.red.days)
r3 <- summary(book.count.lm3)$adj.r.squared


#model predicting Booking Count using Red Count and Weekend
book.count.lm4 <- lm(Book.Count ~ Red.Count + factor(Weekend) + Date, data = jail.red.days)
r4 <- summary(book.count.lm4)$adj.r.squared

Predictors <- c("Red.Count", "Red.Count + Date", "Red.Count + Weekend", "Red.Count + Weekend + Date")
r.squared <- c(r1, r2, r3, r4)
kable(data.frame(Predictors, r.squared))
Predictors r.squared
Red.Count 0.3154746
Red.Count + Date 0.3291538
Red.Count + Weekend 0.5730011
Red.Count + Weekend + Date 0.5785350

From this table we can see that the models that the variables that have the most value in predicting how many jail bookings will be in a summers day are how many restaurant red violations there are as well as whether or not it is the weekend.

6. Summarize Insights

The big takeaways are as follows:

  1. The Health Department didn’t close many restaurants during the time period in question. Weekends are a large determinant in both the number of jail bookings as well as the number of red violations given by the Health Department.
  2. Less people are booked into jail during July & September, or they get arrested while camping trips outside of King county.
  3. Daily Red Violations generally do not vary much throughout the year (excluding weekends)

More Questions:

  1. What other factors might influence the number of jail bookings?
  2. How is this data distrubted geographically across Seattle?
  3. How can I use the city of Chicago’s findings to predict where red violations might next occur in Seattle?

Next Steps:

  1. I am very willing and interested to repeat this task using Microsoft BI, if necessary.
  2. I would love to build a predictive model integrating other data sets and map the results geographically.
  3. Also I bet a breakdown of types of crimes would be similarly interesting.

7. Evaluation

1. Include report sharing my methodology and findings

Check.

2. Use clear, reproducible code

Check. The code is hidden by default but can revealed by clicking the “Code” buttons at the top of the page and throughout the report. I could have commented out my code better. ###3. Be completed in less than four hours

I kept track of my findings in an excel workbook. Here are my results.

time <- read.csv("timesheet.csv") 
kable(data.frame(time))
Step Detail Plan..min. Reality..min.
1.0 Define Project Scope 15 15
2.0 Acquire Data 30 45
3.0 Clean & “Massage” Data 30 45
4.0 Merge Data 10 5
4.5 Re-Munge & Re-Merge 0 10
5.0 Explore Synthesized Data for Insights 45 60
6.0 Summarize Insights 15 15
7.0 Evaluation 15 15
8.0 Share Findings 15 15
9.0 Editing & Polishing 0 45
NA First Draft Total Minutes 175 225
NA Final Draft Total Minutes 175 270

I am under four hours, but just barely. I planned for three hours and the extra time came from extracting a variable without enough data (restaurant closures) as well as getting lost browsing databases on kingcounty.data.gov.

8. Share Findings

I documented this task with the rmarkdown package so all I have to do is publish it and share the link. I also spent 10 minutes writing a summary so it isn’t necessary to read my entire process.