Avery Richards, UC Berkeley School of Public Health
Background:
The following explores a coding environment and methology for documenting the surveillance of open defection on the streets of San Francisco, California. This is done with support from the San Francisco Department of Public Works, who provide a robust dataset of city records from San Francisco’s Open Data Portal.
Open defecation is a known risk to human health worldwide. It is estimated that in the urban United States, "2017-2019 up to 460,000 individuals lacked access to basic water and sanitation services." (Capone et el, 2020). Recent global COVID-19 outbreaks have prompted research identifying human feces as a potential vector of SARS-COV2 transimision (Xiao et al. 2020), further emphasizing the importance of public health surveillance and efficacy of sanitary interventions to prevent open defecation in urban areas of the United States.
Information communication technologies (ICT) are being implemented by cities in the United States to organize and improve the quality of interaction between residents and government service providers. Recent innovations in ICT include 311 systems: centralized open data platforms where residents can request services, file complaints, and obtain information about where they live. City records, like San Francisco’s 311, “have been used to predict the emergence of unsafe or unhealthy conditions.” (Kontokosta, Hong, & Korsberg, 2017). City government can disclose 311 reporting through public facing Open Data Portals. Levels of informational transperancy vary between cities, municipalites, and systems. San Francisco’s open data portal includes geospatial coordinates that supplement much of its public facing 311 data, allowing for a wide volume of aggregate information available as public record.
In this article we walkthrough a workflow from API source, preprocessing, mapping, and control to adjust for information bias. We then model a point pattern density analysis.. To do this, we incorportate Rsocrata,dplyr, and base-R functions. Along with data cleaning to subset and remove explict non-incident reports, we use the sf and tmap packages to visually explore our geospatial coordinates. The spatstat, raster, and leaflet packages are used to model the density functions of our adjusted output as a rasterized layer over an interactive street map.
Data Aquisition:
An first step requires we import our dataset from SF open data portal with the R-socrata package. Note that importing a raw dataset from the API endpoint is memory-intensive and may require patience while your computer crunches through importing all the data at once.
# Import and load library to access API.
library(RSocrata)
# Reads in complete dataset to local environment, run time >= 30 minutes on most machines.
sanfran_calls <- read.socrata("https://data.sfgov.org/resource/vw6y-z8j6.json",
stringsAsFactors = FALSE)
# Saving an imported dataset is strongly recommended.
write.csv(sanfran_calls, "sanfran_311_calls.csv", stringsAsFactors = FALSE)Our output is a data.frame class object with around \(4 * 10^6\) rows. Saving and working from an initial import on your local computer is strongly recommended! As the output of the total API as a data object is large, it can overwhelm certain memory and computational limits. Next we use a few base-R functions to help us identify our targeted parameters when searching for reports of open defecation.
311 Reports:
Structural organization between cities and systems vary. Where accessing 311 API enpoints with RSocrata is fairly boilerplate, the follwing code fits uniquely to San Francsico’s 311, which may change over time itself. The principle methods outlined here in this section are largely base functions of the R-programming language, and may be applied to other 311 system API outputs accordingly.
## [1] "X" "service_request_id"
## [3] "requested_datetime" "closed_date"
## [5] "updated_datetime" "status_description"
## [7] "status_notes" "agency_responsible"
## [9] "service_name" "service_subtype"
## [11] "service_details" "address"
## [13] "street" "supervisor_district"
## [15] "neighborhoods_sffind_boundaries" "police_district"
## [17] "lat" "long"
## [19] "point.latitude" "point.longitude"
## [21] "point.human_address" "source"
## [23] "url" "date_only"
# Generate a data frame of unique service type
service_types <- unique(sanfran_calls$service_subtype)
# Preview list of service subtypes.
head(service_types, 10)## [1] "Bulky Items"
## [2] "Hazardous Materials"
## [3] "Graffiti on City_receptacle"
## [4] "Graffiti on Other_enter_additional_details_below"
## [5] "Graffiti on Building_other"
## [6] "Trees - Damaged_Tree"
## [7] "City_garbage_can_overflowing"
## [8] "General Cleaning"
## [9] "Blocking_Driveway_Cite_Tow"
## [10] "request_for_service"
In the sanfran_calls object, we can locate parameters that outline our area of inquiry. The service_subtype column is a good first step for locating “Waste” related reports. To narrow our search for evidence of open defecation reports, we apply the grep() function to character strings in the service_subtype column, looking for rows in the sanfran_call$service_subtype vector that contain categorical evidence of Waste related reports.
Subsetting:
# The grep() function allows us to condition on "Waste" calls.
waste_calls <- sanfran_calls[grep("Waste", sanfran_calls$service_subtype), ]# Remove "Medical Waste" reports and reduce volume of reeports.
not_medical_waste <- waste_calls[waste_calls$service_subtype != "Medical Waste", ]
# Evaluate the size of the new data.frame object.
nrow(not_medical_waste)## [1] 174740
Once we generate a new data.frame of "Waste" related service_subtype, we remove any reports categorized as "Medical Waste", syringe or other medically related hazards unrelated to open defecation. This is the first step towards narrowing a down curated set of relevant reports. We have reached an area of initial inquiry with the not_medical_waste data object. Once we have imported and subset a portion of this dataset into reports of interest, we explore the data for explicit errors signalling non-incidence of open defectation reports: rows present in our new data.frame that can be ruled out of our analysis.
In the not_medical_waste data object, there is a wide variety of narrative and systematic information, most of which lives in the status_notes and agency_responsible columns. These are two columns we use to condition our data between evidence of positive incidence, and false positive incidence, of open defecation ie: not an open defectaction report. Along with exploration of API output and consulations with City of San Francisco 311 administrative staff, we determine these two columns provide the most reliable source, (outside of photographic spot checking and ground truth sampling), for the validation of a 311 as an open defecation event.
It is worth noting that a close to half of San Francisco’s 311 “Human and Animal Waste” reports contain URL addresses linking photos taken by users. These photographs may confound automated visual analysis, but are worth exploration with random sampling and statistical estimation to assign a calculated uncertainty to the overall population of reports.
Narrative text data in the status_notes column is a valuable tool for an analysis of open defecation reports, as the data represents a narrative testimony from staff present to the event reported. From exploration of the status_notes column, we observe that much of the reporting narrative represent external events of non-incidence. To adjust our data to account for these narrative clues, we filter out reports from the not_medical_waste data object based upon unique narrative strings visible in status_notes, as well as agency_responsible columns.
# Two columns with with narrative information relevant to our analysis.
waste_select_col <- not_medical_waste[, c("status_notes", "agency_responsible")]
# Observing the variety of unique narrative input in "status_notes" column.
unique_nar_note <- as.data.frame(unique(not_medical_waste$status_notes))
nrow(unique_nar_note)## [1] 11838
# Observe variety of systemic categories in "agency_responsible" column
unique_ag_resp <- as.data.frame(unique(not_medical_waste$agency_responsible))
nrow(unique_ag_resp)## [1] 67
Due to the non-uniformity of narrrative between various staff and agencies over time, the process of ruling out false positive reports from assorted status_notes must be rigorous, but may never be exhaustive without ongoing maintenance. See the example code below as a sample to illustrate the process of ruling out reports based on the narrative data in status_notes and/or categorical data in agency_responsible. Please visit the supplementary code repository for a complete list of narrative features the mapped data is eventually adjusted for.
# Use grep() functions to remove reports based on agency_responsible.
rem_agency_1 <- not_medical_waste[grep("Animal Care and Control - G",
not_medical_waste$agency_responsible, invert = TRUE), ]
rem_agency_2 <- rem_agency_1[grep('AT and T - Graffiti Queue',
rem_agency_1$agency_responsible, invert = TRUE), ]The
agency_responisiblecategories unrelated to open defecation reports are ruled out with thegrep()function above, and the same process is applied tostatus_notesnarrative information.
# Use grep() functions to remove reports based on string data in status_notes.
sanfran_calls_no_na_1 <- sanfran_calls_no_na[grep("Duplicate",
sanfran_calls_no_na$status_notes, invert = TRUE), ]
sanfran_calls_no_na_2 <- sanfran_calls_no_na_1[grep("no feces",
sanfran_calls_no_na_1$status_notes, invert = TRUE), ]
sanfran_clean_calls <- sanfran_calls_no_na_102[grep("nothing",
sanfran_calls_no_na_102$status_notes, invert = TRUE), ]Applying a rule-based exclusion of “Human or Animal Waste” report directly from staff narrative reporting is an intensive process. In this workflow, over 100 unique narrative cues are identified in the status_notes column. After ruling out non-incident events, we subset “Human or Animal Waste” reports to map.
We remove "NA" values from the point.longitude and point.latitude coordinate columns and subset specific time intervals to evaluate specific prevalence. Mapping coordinates is not possible without "NA" values being removed.
# Map functions will break if "NA" exists.
clean_coords <- subset(sanfran_clean_calls, point.longitude != "NA"
& point.latitude != "NA")
# Select a date range to plot, January 2020.
subset_raw_calls <- subset(clean_coords,
as.data.frame.Date(requested_datetime)
>= "2020-01-01"&
as.data.frame.Date(requested_datetime)
< "2020-02-01")Mapping:
We are now ready to begin evaluating geospatial outputs. We create our initial maps with the simple features and tmap packages, transforming our data.frame class object with available point.latitude and point.longitudecolumns. Here we map an interactive template to observe a point prevalence.
## [1] "data.frame"
# Transform dataset into an sf object.
map_raw_calls <- st_as_sf(subset_raw_calls, coords
= c('long','lat'), crs = 4326)
# Raw calls is now an sf class object
class(map_raw_calls)## [1] "sf" "data.frame"
Human & Animal Waste Reports, Raw Data: January 2020
Take a moment to explore this map. You may zoom in and click on individual points to produce an attribute table qssociated with reports. If you feel especially brave, locate a URL link at the bottom of the attribute table and have a look at the event being reported..
Information bias in crowdsourced data:
Now that we have imported, subset, and adjusted for explict non-incidence reports through the agency_responsible and status_notes columns, there are other steps we can take to constrain the volume of reports 311. Listed below are two seperate approaches seeking to reduce excess reporting from crowdsourced data. These two methods are both context dependent on the biology and social perception of the incidence event in question, and not exhaustive by any means. Experience with crowdsourced data evaluation may contribute to more advanced methods for reducing noise and missclassification with aggregrated 311 ouputs. These approaches offer a starting point towards thinking about how to apply constraints to aggregate data models generated from public observations of health related events.
The distinct() function:
A function in the dplyr library,distinct() allows us to adjust for situations where 2 or more reports occur on any single event. The difference is hardly observable in the case of these Human of Animal Waste data, however. Out of a few thousand reports generated in the month of January 2020, we observe close to 150 reports excluded in the month of January, 2020.
# Load the dpylr package.
library(dplyr)
# Remove all rows that are not are distinct in regards to address, longitude, and latitude domains.
map_dist_calls = map_raw_calls %>%
distinct(address, .keep_all = TRUE) %>%
distinct(point.latitude, .keep_all = TRUE) %>%
distinct(point.longitude, .keep_all = TRUE)
# Map after adjusting for distinct parameters.
m2 <- qtm(map_dist_calls, dots.col = "red", symbols.size = .002)
m2Human & Animal Waste Reports, Distinct Points: January 2020The
distinct()approach works better when applied to the prevalence of events with longer biological duration than waste on the street.distinct()is also useful for smaller time intervals of analysis, social situations where a single event may become weighted with reports from multiple users, or a single user filing excessive reports.
# Row length of each data.frame object reflects frequecy of reports.
raw_sum <- nrow(map_raw_calls)
dist_sum <- nrow(map_dist_calls)
# A difference of 150 reports in a month
raw_sum - dist_sum ## [1] 154
The duplicated() function:
Another apporoach to adjusting for implicit bias uses the duplicated() function. The ouput of a layered sequence of duplicated() functions create clustered coordinates, representing specific areas of reported prevalence. If the distinct() function flattens each coordinate to equal weight in a map, duplicated() takes a alternate approach by stacking points that have approximate spatial equivalency.
# Adjusting for duplicated values only in the latitude column.
dupl_lat <- duplicated(subset_raw_calls$lat)
# Creating new object with only those duplicated latitude values.
lat_select <- subset_raw_calls[dupl_lat, ]
# Adjusting for duplicated values in the longitude column, from the lat_select object.
dupl_long <- duplicated(lat_select$long)
# Creating a new object with only those duplicated latitude and longitude value.
latlong_dupl <- lat_select[dupl_long, ]
# Transforming the duplicated point into an sf object.
map_dupl_calls <- st_as_sf(latlong_dupl, coords
= c('long','lat'), crs = 4326)The duplicated() method applies a sharp constraint to the volume of reports, a trade-off being stronger specificity of patterns in unique areas, costing us sensitivity to prevalence of event in an larger areas. In the case of waste and open defecation on the streets, there are beneficial effects to employing a duplicated() approach to crowdsourced data. The duplicated() method supports interventions that focus on areas impacted with an intesity of reports.
# Unadjusted points in the January 2020 interval.
raw_sum <- nrow(map_raw_calls)
# Duplicated points in the January 2020 interval.
dupl_sum <- nrow(map_dupl_calls)
# A difference of reports for the January 2020 interval.
raw_sum - dupl_sum ## [1] 2966
Duplicated Human & Animal Waste Reports: 3 month interval, Jan-Mar 2020
The duplicated() method yeilds a small fraction of points when compared to the distinct() method. In this case, these two approaches to adjust for information bias are mutually exclusiveand outline fundamental challenges to data quality when creating models from public facing crowdsourced data. In the context of open defecation, distinct() may underfit, accounting only for bias from excess reporting of an event. On the other hand, duplicated() may overfit, identifying locations only where spatial patterns of reporting occur over time, or indeed where user participation in reporting occurs with frequency. In the case of waste on the street, distinct() is better at capturing incidence in a specific area during short periods, and duplicated() larger patterns of prevalance.
Where distinct() generates a better cross-sectional analysis of the reports, duplicated() works to account for prevalence in impacted areas, scaling efficiently over time. Considering the biological mechanism, social nature and behavior around waste and open defecation, along with the volume of data generated by 311 reports, creating a global model of waste reports with the duplicated() approach is preferable.
As the coordinate count increases, clusters of spatial prevalance emerge over time. Evaluation of the following maps reveal how an increased duration of analysis creates a spatially hiearchical sort of volume of wiith duplicated() coordinates. We have now imported, subset, adjusted, and begun to map Human and Animal Waste reports. In the next section, we model a Point Pattern Analysis visualization of one the data we have mapped.
Point Pattern Analysis:
There is more than one component we need to see our model visualization process through. First we create a "ppp" class of coordinated from our sf object, and then we create a Window from a shapefile of the City and County of San Francisco. With that shapfile we are able to generate a model from our coordinates into a specified area to compute for density. To begin, we subset() a year of duplicated() reports, reflecting the prevalence of waste reports in the year 2017.
# Detach all metadata from the object, leaving only the spatial data in 'long' and 'lat'.
zone_raw_coords_2017 <- subset(map_these_2017,
select = c('long', 'lat'))
# Setting Coordinate Reference System with Simple Features.
zone_points_2017 = st_as_sf(zone_raw_coords_2017, coords
= c('long','lat'), crs = 26910)# Restructure the simple features object as points as 'ppp' class object.
zone_ppp_2017 <- as(as_Spatial(zone_points_2017), "ppp") # Bring the downloaded shapefile into the R-studio environment.
nhoods <- readShapeSpatial('sf_nhoods.shp')This workflow restructures several package-specific data formats, traversing dependencies between the
sf,maptools,raster, andspatstatlibraries. Downloading a Shapefile of San Francisco is required. The shapefile will become the spatial window we compute our point density within.
# Connect all San Francisco neighboorhood shapes into a solid spatial object.
nhoods_asone <- unionSpatialPolygons(nhoods, nhoods@data[["name"]])
# Restructure object as a SpatialPolygon type.
zone_polygon <- as(nhoods_asone, "SpatialPolygons")
# Restructure again as an "owin" type object.
zone_owin_2017 <- as(zone_polygon, "owin")# Set window of point density model with the owin object
spatstat::Window(zone_ppp_2017) <- zone_owin_2017## Error in eval(expr, envir, enclos): object 'zone_owin_2017' not found
We now have a plot visually similar to the interactive qtm maps seen earlier. In the next lines we divide the map into a fairly dense aray of partitions (quadrats), reflecting an approximation of around 2 city blocks. We count the points inside these partitions, and assign values to the quadrats based upon the density count.
# Partition map into quadrats.
Q17 <- spatstat::quadratcount(zone_ppp_2017, nx= 50, ny=50)
# Parameters passed to the 'nx' and 'ny' argument specify the resolution, how small the areas being computed will be. # Evaluate the above step with a plotting function.
plot(zone_ppp_2017, pch=20, cols="grey70", main=NULL)
plot(Q17, add=TRUE) # Computes the intensity of our point pattern model visualized above.
Q17.d <- spatstat::intensity(Q17)Here we get a rough sense of how the points are distributed throughtout our Window object. We transform this model into a raster image format, where density values are assigned colors from a designated palette. We can then layer our raster image above an interactive leaflet street map.
# Generate a raster image from our point process model.
zone_rast_2017 <- raster::raster(intensity(Q17, image=TRUE))
# Apply an proj4 string as coordinate reference system to out raster object.
crs(zone_rast_2017) <- sp::CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
# Set a palette for how we view our raster object.
pal <- colorNumeric(c("blue", "orange", "red"), values(zone_rast_2017),
na.color = "#00000000")
# Load leaflet library to view our map.
library(leaflet)
# Function to view interactive map, Human and Animal Waste Reports: prevalence, 2017.
citywide_prev_2017 =
leaflet() %>% addTiles() %>%
addProviderTiles(providers$CartoDB.Voyager) %>%
addRasterImage(zone_rast_2017, colors = pal, opacity = 0.5) %>%
addResetMapButton()
citywide_prev_2017PPA of duplicated Human and Animal Waste Reports: San Francisco, 2017
Density models like these are not limited to Human and Animal waste reports, and with careful interpretation of input, can be used to ascertain visually specifc prevelance from a wide variety of events surveilled through crowdsourced data systems.
The output of our model allows us to view a magnitude of prevalence in defined areas while the transperancy of our raster()image plots those areas onto an accurate and recognizable street map. Aggregation of prevalence between intervals of time allow us to observe variance of intensity between samples, alerting public health intervention strategy to areas of specific interest over time within city limits.
Limitations:
As a helping professional, student of public health epidemiology and data science, I cannot over-emphasize how crowdsourced data can sing a sirens’ song to those of us who get excited around large and spatially intensive datasets. Data quantity has never translated into quality, especially when evaluating crowdsourced data like these. In the case explored above, we are evaluating what are essentially complaints by active users of a public system, ascertained through complex pathways riven with potential systematic bias and errors. We are never really looking at conventional observations with crowdsourced data, but observations of observations. We must apply scrutiny and skepticism to our models, seek to reflect precise, rather than socially percieved, realities.
Participation bias can be seen as a sort of non-differential bias: a variability of user reports contributing to the system’s output, distorting the weight of reports in regards to an estimate of events in an area. Before evaluating the output of crowdsourced data over longer intervals, we must consider trends of reporting that contribute to changes in the frequency of reports over time.
As observed here, frequency of reports over longer intervals can be distorted by trends in total 311 user participation. Rather than a true relationship of event growth, we see observations of observations inflating as the base of 311 system participation increases.
A GIF of Human and Animal Waste Reports: 4 year window
Overreporting is an obvious challenge, but underreporting poses the greatest threat to the ongoing validity of crowdsourced models. As researchers utilizing public facing data, we cannot determine the identity or relevant characteristics of those who contribute to these city records. There are starkly disparate social conditions existing in urban areas of the United States, including San Francisco. Not everyone alive in America’s cities consider interaction or reporting through government channels a valid, trustworthy, or socially acceptable behavior. Covariate data sources may shed light on the social characteristics of areas related to reporting, but without an equally distributed participant base, regional bias will be carried into the density output of reports.
It is possible for a system administrator to access user ID metrics and discern a population of unique users contributing to the aggregate. This sort of query may allow for inference around frequencies of a user population over time, but is reserved for limited case use with San Francisco’s 311. For researchers external to the systems’ maintenance, we are left with anonymized reports to work with through the public facing API, as the system is designed. Without knowledge of the demography, regional characteristics, and variation of user base over time, ground truthing and predictive estimation methods may be required to forward ascertainment of events within communities socially disconnected from ICT reporting systems like 311.
In this case study, systematic error can be also be extended to the category in question. As we have seen, Human or Animal Waste can be interpreted as a variety of events, from dead rodents to neglected garbage bags. We rule out a great deal of misclassified reports in our pre-processing and evaluation of the agency_responsible and status_notes columns, as photo evidence visible from View(not_medical_waste$url) can testify. There is constraint on what is represented with the duplicated() reporting framework. Our results are by no means perfect, and differential misclassification between fecal contaminant event and non-event is an ongoing threat to the internal validity of a study design like this.
At the end of the day, crowdsourced 311 data is just plain noisy. Along with underreporting and regional bias, finding ways to identify and adjust misclassification, minimizing implicit bias and systemic errors are the main challenges we face when creating geospatial models from data like these.
Discussion / Future Directions:
Modeling data from crowdsourced outputs has limitations, some of which we have been able to adjust for here, others require further observation, exploration, and experimentation. A next step to leveraging crowsourced geospatial models for effective public health intervention is the ascertainment of counterfactual zones. This may be accomplished by using covariate data, geocoded to illustrate a dimensionally larger social landscape between and within impacted areas. Matching areas within city limits can promote a sound evaluation of local, regional, and global interventions within a city.
Evaluating unique intervention strategies with Interupted Time Series analysis may help us optimize recources and human energy commited to public and community sanitation efforts. And lastly, ground truth measurement of fecal incidence along with analysis of chemical composition of fecal matter may allow researchers to estimate under and over-reporting in sensitive areas, as well as evaluate the risk of multiple disease outcomes from exposure to fecal matter on the streets of urban areas in the world.
Acknowlegements
This case study would not have been completed without alliance and academic mentorshop received from Douglas Martin and Dr. Jay Graham, attention and partnership from the San Francisco Department of Public Works, not to mention ongoing support and spiritual guidance from all my friends at the Dlab.
Thank you for reading!
Full code repository: https://github.com/Averysaurus/SF_ODM/blob/master/Mapping_OD_Prevelence_sup.R
Shapefile of San Francisco: https://data.sfgov.org/Geographic-Locations-and-Boundaries/SF-Find-Neighborhoods/pty2-tcw4