Mapping Recent Boil Water Advisories

MUSA 500, Homework #6

Author

Minwook Kang, Nissim Lebovits, and Ann Zhang

Published

December 4, 2022

Background

Boil Water Advisories

Boil water advisories (BWAs), notices, or warnings, are public health advisories or directives issued by governmental or other health authorities to alert consumers when a community’s drinking water is or could be contaminated by pathogens.1 Regulations for when and how boil water advisories are issued vary across the United States. Although the Centers for Disease Control have developed guidance for BWAs, research indicates that these advisories are communicated inconsistently. In particular, local news media do not consistently incorporate CDC guidelines.2

Current data collection on BWAs seems to rely on specific case studies or analyses of news articles.3 At least some state health departments (e.g., the New York State Department of Health) collect data on start dates and end dates for BWAs in specific geographies.4 However, there does not seem to be a consistent repository for data on BWAs nationwide. It would therefore be helpful to identify data that can supplement analyses of the incidence and impact of boil water advisories across the United States and potentially support predictive modeling for risk assessment and mitigation.

Twitter Data

Analyzing social media data is an emerging field of research with broad applications. Twitter is one of the most popular sources of social media data, as tweets can be downloaded via Twitter’s API for analysis. Tweets contain a range of valuable information for analysis, including text, dates, users, and sometimes geolocation. To analyze boil water advisories, Twitter data could potentially help to:

  • catalogue BWAs across the country
  • analyze how local media are reporting on BWAs
  • analyze the public’s reception of and compliance with BWAs

In the steps below, we experiment with using Twitter data to explore BWAs across the United States. We use the Twitter API to import a set of tweets containing terms related to BWAs. We then explore those tweets in various ways, plotting them across time and space and performing basic text analysis to understand any patterns in their content. We then comment on the potential and limitations of this approach, as well as possible future steps.

Data Importing and Processing

In this project, we import our data via the Twitter API using the twitteR package. Documentation and an appropriate workflow are available here. There are some significant limitations to this, however, namely that Twitter limits the number of calls a user can make to the API to 15,000 per 20 minutes, and users can only collect tweets from the past month. For convenience’s sake, we’ve downloaded roughly a week’s worth of data—tweets containing the terms “boil water advisory”, “boil water notice”, or “boil water order”—and then written it to a .csv file to avoid having to repeatedly re-import data. Our analysis below is carried out with this data, comprising about 2,100 tweets between November 29th, 2022 and December 2nd, 2022.

Data Import and Cleaning

For the purpose of our analysis, data downloaded from Twitter presents a few challenges, mostly related to text cleaning. Given the limited scope of this assignment, we’ve refrained from extensive manipulation of these strings; for a more in-depth analysis, it would be best to create a custom dictionary for use in removing stopwords from the strings. For our purposes, though, we’ve used the tidytext and textdata packages to remove common stopwords. We also filtered out terms related to boil water advisories (e.g., “boil”, “water”, “notice”, “advisory”, “lifted”), days of the week, and strings commonly appearing on Twitter such as “https” and “t.co”.

Data Exploration

Having cleaned our data, we proceeded to an exploratory analysis, considering tweet content over time, key words, sentiments, and geographic dispersion of tweets.

Tweets Over Time

First, we used ggplot’s density plot to look at the distribution of tweets over time. We can see a peak around November 29th, 2022, right after the issuance of a boil water advisory in Houston, Texas on the 28th. Because our data only cover about a week, we cannot see what these trends look like over the long term. It would valuable to explore changes in tweet content over the course of at least a year in order to examine seasonality, such as the higher incidence of BWAs in the fall through spring, when freezing temperatures increase the likelihood of watermain breaks.5 These data could then potentially be connected to known weather events, etc.

Top Mentions

Next, we used the tidytext package to analyze the most common words in our Twitter data. The top 25 are shown in this graph. Unsurprisingly, “Houston” was first, corresponding to the known BWA on November 28th. Other terms on the list, such as “city”, “issued”, “drink”, “lifted”, and so forth are similarly logical and could be used to construct a dictionary of BWA-related stopwords if further analysis of Twitter data were pursued.

Word Cloud

Similarly, we also created a word cloud of these data in order to highlight the most common terms. Note, however, that, like our graph above, a wordcloud is a useful tool to get a general impression of text data, but not for substantive analysis of the relative importance of certain search terms. Morever, a word cloud is not the best way to ascertain where these events are happening. Because it is limited by the minimum number of occurrences of a word (set here to 50 occurrences), it filters out events that happen only once, i.e., any locations with a single event. It also splits names into individual words, so a place like “Lower Merion” would show up as “Lower” and “Merion”, not as a proper place name. Crucially, both our frequency plot and our wordcloud can actually serve to obscure the incidence of BWAs in smaller towns, which might only show up as names once in our text, and therefore are not useful for granular analysis of where BWAs are happening.

Sentiment Analysis

Using tidytext’s sentiment analysis tools and the textdata package’s library of sentiment “dictionaries”, we can parse our Twitter data to see the “sentiment” expressed in the recent Tweets about boil water advisories. Because of the limitations on basic-level Twitter API access, we did not have enough data to find really interesting results from our sentiment analysis; we were only able to compare positive and negative sentiment. Unsurprisingly, the top 25 terms by order of frequency were predominately associated with negative feelings. Access to full content of the Twitter API through academic researcher privileges might reveal more interesting patterns, though.

Mapping

Lastly, we attempted to take advantage of geolocations for tweets to map BWA-related tweets. Unfortunately, only 3 out of our 2,100 tweets were geocoded: one in Texas and two in the United Kingdom. This low rate of geocoding suggests that geolocation data from tweets is not, in isolation, a good tool for understanding the distribution of BWAs. While there is some potential to use text mining to identify locations referenced in the body of the tweet, that would be a more labor-intensive process than simply mapping geocoded tweets, so we did not attempt it here.

Due to a currently unresolved package issue, leaflet is not working, meaning that mapview and tmap will not load properly. As a result, we’ve built a static map here in ggplot, rather than an interactive one.

Map of geocoded tweets

Reflections

Overall Usefulness

Twitter data may have some utility for understanding BWAs. One major advantage it has over other data sources is that it may offer insight into the reaction of residents to BWAs, whereas conventional sources are typically government or news releases. With full access to historical Twitter data and appropriate, in-depth string cleaning, text mining and sentiment analysis may yield useful results.

Limitations

As it currently stands, it would take a significant amount of work to render Twitter data useful. First, the lack of consistent geocoding is a major disadvantage; geocoded data would be useful in various kinds of modeling and predictive analysis. It may be possible to cross reference location terms like “Houston” with approximate coordinates, but doing so would be time consuming. Moreover, it’s not clear based on our cursory anaylsis here whether our text mining has yielded any insights. The BWA in Houston was already a known incident, and initial sentiment analysis indicated intuitively that people reacted negatively to water contamination.

Conclusions and Next Steps

In sum, it seems like parsing Twitter data to understand BWAs would demand a much larger quantity of data in order to possibly be useful. Insight into sentiment and geolocation would both be valuable supplements to current data on BWAs, but would be very labor intensive to glean from Twitter data. Given the likely imminent collapse of Twitter, it is unclear if this is worth pursuing further.

Footnotes

  1. https://en.wikipedia.org/wiki/Boil-water_advisory↩︎

  2. Sydney O’Shay, Ashleigh M. Day, Khairul Islam, Shawn P. McElmurry & Matthew W. Seeger (2022) Boil Water Advisories as Risk Communication: Consistency between CDC Guidelines and Local News Media Articles, Health Communication, 37:2, 152-162, DOI: 10.1080/10410236.2020.1827540↩︎

  3. Sydney O’Shay, Ashleigh M. Day, Khairul Islam, Shawn P. McElmurry & Matthew W. Seeger (2022) Boil Water Advisories as Risk Communication: Consistency between CDC Guidelines and Local News Media Articles, Health Communication, 37:2, 152-162, DOI: 10.1080/10410236.2020.1827540; Sridhar Vedachalam, Kyra T. Spotte-Smith, Susan J. Riha, A meta-analysis of public compliance to boil water advisories, Water Research, Volume 94, 2016, Pages 136-145, ISSN 0043-1354, https://doi.org/10.1016/j.watres.2016.02.014.↩︎

  4. Sridhar Vedachalam, Mary E. John, Susan J. Riha, Spatial analysis of boil water advisories issued during an extreme weather event in the Hudson River Watershed, USA, Applied Geography, Volume 48, 2014, Pages 112-121, ISSN 0143-6228, https://doi.org/10.1016/j.apgeog.2014.02.001.↩︎

  5. Sridhar Vedachalam, Kyra T. Spotte-Smith, Susan J. Riha, A meta-analysis of public compliance to boil water advisories, Water Research, Volume 94, 2016, Pages 136-145, ISSN 0043-1354, https://doi.org/10.1016/j.watres.2016.02.014.↩︎