Lab 3 - 11.S939

Objective

We will use 311 data in combination with neighborhood census data, to determine how many pothole requests were made in each neighborhood.

We will also determine the number of miles of road in each neighborhood, using OSM data, so we can normalize the number of potholes by the amount of road in each neighborhood (e.g., potholes reported per mile of road).

We will pay attention to how many requests were closed vs. how many remain open, and how quickly requests were closed, since this can tell us about how attentive the city is to pothole requests in different neighborhoods.

Setup

# load libraries
library(sf)
library(tidyverse)
library(osmdata)
library(DT)
library(scales)
library(mapview)
library(lubridate)

# set default ggplot theme
theme_set(theme_minimal())

Neighborhoods 🏘

First, we need to load up neighborhood data to associate with potholes and roads.

Data Source: “Boston Neighborhood Boundaries Approximated by 2020 Census Block Groups,” was downloaded from Analyze Boston

# load neighborhood data
neighborhood <- st_read(
  "data/Boston_Neighborhood_Boundaries_Approximated_by_2020_Census_Block_Groups/Boston_Neighborhood_Boundaries_Approximated_by_2020_Census_Block_Groups.shp"
)

Reading layer `Boston_Neighborhood_Boundaries_Approximated_by_2020_Census_Block_Groups' from data source `/Users/dustinmichels/MIT_class/mit-data-science/labs/lab3/data/Boston_Neighborhood_Boundaries_Approximated_by_2020_Census_Block_Groups/Boston_Neighborhood_Boundaries_Approximated_by_2020_Census_Block_Groups.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 24 features and 37 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 739715.8 ymin: 2908294 xmax: 812981.4 ymax: 2970217
Projected CRS: NAD83 / Massachusetts Mainland (ftUS)

# select cols of interest
neighborhood <- neighborhood |>
  select(nbh_name = blockgr202, population = tot_pop_al)

# remove harbor islands, since there is no pothole data
neighborhood <- neighborhood |> filter(nbh_name != "Harbor Islands")

# preview table
head(neighborhood)

Roads 🚦

Next, we will query OpenStreetMap, using the Overpass API, to get road data for Boston.

Based on OSM documentation, the seven values provided here (motorway, trunk, primary, secondary, tertiary, unclassified, and residential) should represent most types of roads.

# set a bounding box for Boston
bbox = c(-71.188, 42.238, -70.924, 42.393)

# query the api for roads
roads <- opq(bbox) |>
  add_osm_feature(
    key = "highway",
    value = c(
      "motorway",
      "trunk",
      "primary",
      "secondary",
      "tertiary",
      "unclassified",
      "residential"
    )
  ) |>
  osmdata_sf() |>
  pluck("osm_lines") |>
  select(name)|>
  st_transform(2249)

rm(bbox)

Count road miles per neighborhood

# split roads at neighborhood boundaries
road_data <-
  st_intersection(roads, neighborhood)

# calculate length of each road segment
road_data <- road_data |>
  mutate(length = as.numeric(st_length(geometry)))

# summarize total road length in each neighborhood
road_data <- road_data |>
  group_by(nbh_name) |>
  summarise(road_length_ft = sum(length)) |>
  st_drop_geometry()

# convert to miles
road_data <- road_data |>
  mutate(road_length_miles = road_length_ft / 5280) |>
  select(-road_length_ft)

# free up space
#rm(roads)

road_data |> arrange(desc(road_length_miles))

Table 1: Road miles per neighborhood

Potholes 🚧

Finally, we can move on to potholes.

Data Source: 311 data for 2024 was downloaded from Analyze Boston.

Prepare pothole data

# read data
service <- read_csv("data/311_Service_Requests_2024.csv")

# filter down to cases involving potholes
pat = regex("pothole", ignore_case = TRUE)
pothole <- service |>
  filter(str_detect(type, pat) |
           str_detect(reason, pat) |
           str_detect(case_title, pat))

# select cols of interest
pothole <- pothole |>
  select(case_enquiry_id,
         open_dt,
         closed_dt,
         on_time,
         case_status,
         latitude,
         longitude)

# remove any rows with missing coordinates, and print how many were removed
before = nrow(pothole)
pothole <- pothole |>
  filter(!is.na(latitude) & !is.na(longitude))
sprintf("Removed %s row(s) with missing coordinates",
        before - nrow(pothole))

[1] "Removed 1 row(s) with missing coordinates"

# convert to sf object
pothole <- pothole |>
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) |>
  st_transform(2249)

# cleanup vars
rm(pat, service, before)

# preview data
head(pothole)

Count potholes per neighborhood

We now have a dataset of pothole requests. We want to count the number of potholes per neighborhood.

# associate neighborhood data with points
pothole_data <- 
  st_join(pothole, neighborhood)

# count number of potholes in each neighborhood
pothole_count <- 
  pothole_data |> 
  filter(!is.na(nbh_name)) |> 
  count(nbh_name, name="pothole_count") |> 
  st_drop_geometry()

pothole_count |> arrange(desc(pothole_count))

Table 2: Potholes reported by neighborhood

Let’s also count how many open vs. closed requests there are per neighborhood, and visualize this data.

pothole_data |>
  filter(!is.na(nbh_name)) |>
  count(nbh_name, case_status) |>
  st_drop_geometry() |>
  ggplot(aes(
    x = reorder(nbh_name, -n),
    y = n,
    fill = case_status
  )) +
  geom_col() +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Neighborhood", y = "Number of Potholes", fill = "Case status", ) +
  coord_flip()

Figure 1: Potholes reported by neighborhood, considering open vs. closed status

Based on raw counts, it is clear Dorchester has the most reported potholes, and also closes the majority of them. What does this tell us? It may just be because Dorchester has the most roads. This is why we need to normalize by miles of road in the next step, to find any discrepancies in road management by neighborhood.

Normalize pothole counts

We already have the pothole counts and the miles of road for each neighborhood. Now we just need to combine.

# merge tables together
result <-
  neighborhood |>
  left_join(pothole_count, by = "nbh_name") |>
  left_join(road_data, by = "nbh_name")

# compute density
result <- result |>
  mutate(pothole_per_mile = pothole_count / road_length_miles) |>
  select(nbh_name, pothole_per_mile)

Here is the data in table form.

result |>
  st_drop_geometry() |> arrange(desc(pothole_per_mile))

Table 3: Potholes per mile of road, by neighborhood

And here it is as a map.

result |>
  ggplot() +
  geom_sf(aes(fill = pothole_per_mile)) +
  ggrepel::geom_label_repel(
    aes(label = nbh_name, geometry = geometry),
    force = 10,
    alpha = 0.8,
    stat = "sf_coordinates",
    size = 2.5,
    segment.color = "white"
  ) +
  labs(fill = "Potholes per mile of road") +
  scale_fill_viridis_c() +
  theme_void()

Figure 2: Potholes per mile of road, by neighborhood

It is clear the greatest number of potholes per mile of road are found in Back Bay, followed by Downtown, South End, and Beacon Hill. These are all relatively small, high traffic neighborhoods, which may explain the high number of potholes per mile of road.

Time to close requests

Another question we can ask is how long it takes to repair a pothole. This can tell us about the efficiency of the city in addressing potholes in different neighborhoods.

Many requests are closed in a matter of seconds, probably by an automated system. Requests that are closed in under one hour are likely being closed because they are duplicates or otherwise invalid. Since we are interested in how quickly potholes are being repaired, we will filter out requests that are closed in less than one hour.

# add column for duration in hours
# filter out instances where issue is closed in less than 1 minute
# count mean hours to close per neighborhood
pothole_hours <-
  pothole_data |>
  mutate(hours = as.double(closed_dt - open_dt, "hours")) |>
  filter(hours > 1) |>
  filter(!is.na(nbh_name)) |>
  group_by(nbh_name) |>
  summarise(mean_hours = mean(hours, na.rm = TRUE), ) |>
  st_drop_geometry()

pothole_hours |> arrange(desc(mean_hours))

Table 4: Mean hours to close pothole requests, by neighborhood

Here it is as a map.

# merge tables together
neighborhood |>
  left_join(pothole_hours, by = "nbh_name") |>
  ggplot() +
  geom_sf(aes(fill = mean_hours)) +
  ggrepel::geom_label_repel(
    aes(label = nbh_name, geometry = geometry),
    force = 10,
    alpha = 0.8,
    stat = "sf_coordinates",
    size = 2.5,
    segment.color = "white"
  ) +
  labs(fill = "Mean hours to close request") +
  scale_fill_viridis_c() +
  theme_void()

Figure 3: Mean hours to close pothole requests, by neighborhood

Where the number of potholes per road mile seemed highly concentrated in the city center, the time to close requests is more evenly distributed across neighborhoods. Longwood is the slowest to close requests, with a mean response time of 164 hours, while Fenway and Mission Hill are the fastest, with mean response times of 53 and 52 hours, respectively.

Summary

Potholes count

Based on raw counts, Dorchester has the most reported potholes (Figure 1). However, it also has the greatest number of road miles (Table 1).
When we normalize by miles of road, we find the greatest number of potholes per mile of road are found in Back Bay, followed by Downtown, South End, and Beacon Hill (Table 3).

Time to close requests

The time to close requests ranges from 52 to 164 hours, with Longwood being the slowest and Fenway and Mission Hill being the fastest (Table 4).
There is a bit of a cluster of slow response times in the center of the city, but the distribution is more even than the potholes per mile of road (Figure 3). Fenway is a bit of an outlier, since it is near the city center but has fast response times.

Next steps

With more time, it would be interesting to assess how response time relates to the demographics of each neighborhood (e.g., income, percent white).

It would also be interesting to try to determine if there is a systematic bias in reporting. For example, maybe more potholes are reported in wealthy neighborhoods because residents are more likely to report issues, not because there actually are more potholes.

Finally, it would be interesting to look at the distribution of potholes and response times over time. For example, are there more potholes in the winter? Are response times longer in the summer when there are more requests?