Background and Motivation

The following is a fulfillment of a data visualization request on the subreddit /r/DataVizRequests, which is a subreddit devoted to creating visualizations of datasets. This particular request not only involved visualizing a dataset, but also generating the dataset in the first place. The dataset is of area codes mentioned in a heroin request and review subreddit and the visualization is a collection of shaded regions superimposed on top of a google map.

This was done for free as an exercise for my own edification. Kudos to the requester /u/VroooMoose for identifying an interesting line of research.

/u/VroooMoose’s Objective:

Hello! I’m an EMT in one of the United States’ hotbeds for heroin and opiate use, addiction, and overdoses. I rarely go a couple of years without being affected within ~3 degrees of separation, I rarely go a shift of EMS without having an overdose, and I can’t go through watching anything on TV without seeing at least 3 commercials addressing addiction. It is a massive problem.

I originally went to r/opiates so that I could get some insight from users from their perspectives since many of my coworkers have a blanket-attitude that they are all pieces of shit, but I believe that is painfully short-sighted. From there, I discovered r/glassine, which is a board of Reviews, Discussions, or HAT (Has anyone tried?) threads on specific bags or “brands” of heroin. Glassine refers to the bags that heroin used to primarily come in, whereas plain plastic baggies has been more common in my experience in EMS. Glance through this Vice article for a quick visual idea.

I’m interested in creating a heatmap by Area Code (the most common way r/glassine identifies locations) to see how the locations vary for threads. While opiate use and addiction has become widely accepted as an epidemic around the country, I think it would be interesting to see where some geographic concentrations are.

Let me know if you have any questions!

Working Hypothesis

For now I am operating under the assumption that /r/glassine area code mentions are indicative of an area code’s level of opiate use and abuse because the redditors frequenting the sub are representative of US opiate users. This assumption will fall apart. I will define this datasets biases and evaluate the extent to which it can provide useful insights. From there I will develop a new working hypothesis.

Count Data

Because the dataset did not yet exist. I had to begin by collecting the area codes mentioned in /r/glassine. I used the RedditExtractoR R package to scrape the post titles and I used the stringr R package for extracting candidate area codes from post titles.

Load the relevant libraries:

library(RedditExtractoR)
library(stringr)

The following code chunk checks whether the raw data is available in the working directory. If not, collect the data from the github repo. Alternatively, scrape new data from reddit.

if (file.exists("glassine_urls.csv")) {
        glassine_urls <- read.csv("glassine_urls.csv", stringsAsFactors = FALSE)
} else {
        download.file("https://github.com/areeves87/glassine-area-codes/blob/master/glassine_urls.csv","glassine_urls.csv",mode="wb")
        glassine_urls <- read.csv("glassine_urls.csv", stringsAsFactors = FALSE)
}

# ##Uncomment to re-perform reddit scrape
# glassine_urls <- reddit_urls(subreddit = "glassine",page_threshold = 40)


if (file.exists("area_codes_by_state.csv")) {
        area_codes_by_state <- read.csv("area_codes_by_state.csv")
} else {
        download.file("https://github.com/areeves87/glassine-area-codes/blob/master/area_codes_by_state.csv","area_codes_by_state.csv",mode="wb")
        area_codes_by_state <- read.csv("area_codes_by_state.csv")
}

Examine the glassine data to decide how to parse the content.

str(glassine_urls,strict.width="cut")
## 'data.frame':    998 obs. of  5 variables:
##  $ date        : chr  "27-11-17" "27-11-17" "27-11-17" "27-11-17" ...
##  $ num_comments: int  0 0 0 1 1 3 0 0 1 12 ...
##  $ title       : chr  "(R) Birthday purple stamp with cupcake" "(HAT) Tu"..
##  $ subreddit   : chr  "glassine" "glassine" "glassine" "glassine" ...
##  $ URL         : chr  "http://www.reddit.com/r/glassine/comments/7fyih9/"..

The glassine data has 998 rows and 5 columns, with each row containing data for a single thread and each column representing a feature across all 998 threads. The features include:

  • a “date” column, which is the date the thread was posted
  • a “num_comments” column, which is the number of comments the thread had
  • a “title” column, which is the title of the thread
  • a “subreddit” column, which is the subreddit the thread was posted to
  • a “url” column, which is the url of the thread

Of these columns, the “title” column is the most pertinent since it often contains the area code the poster wants information about. In order to generate a dataset of area code mentions, we will need to extract any possible area codes mentiond in the title.

The following parsing procedure uses the stringr library’s str_extract function to identify and extract a three-digit string from each row in the title column. If the function fails to extract anything, it gives an ‘NA’ for that row. The function returns an array of candidate area code mentions.

We can validate the array of candidate area code strings against an external dataset of US area codes. If the string can be found among the US area codes, its a valid area code. We’ll keep only the valid candidates for futher analysis, but we’ll also print out a frequency table of the invalid area codes just to see what we excluded.

candidate_mentions<-str_extract(glassine_urls$title, "[0-9]{3}")
valid_mentions<-candidate_mentions[candidate_mentions %in% area_codes_by_state$Area.code]
table(candidate_mentions[!candidate_mentions %in% area_codes_by_state$Area.code],
      useNA = "always", dnn = 'Invalid Candidates')
## Invalid Candidates
##  007  100  173  357  738  777  983 <NA> 
##    1    4    1    2    1    1    1  393

A few three-digit strings that are not area codes occassionally show up in thread titles. In addition, nearly 40% of extractions returned ‘NA’, which means there was often no three-digit string to be extracted from the title. However, 60% of threads did have a candidate to extract and 98% of candidates were valid in that they did have a corresponding area code. This gives us 594 area code mentions with which to shade our choropleth map.

After removing the invalid candidates, the valid area code mentions are ready to be turned into a frequency table which we will later use to shade our choropleth maps. This code chunk will generate the frequency table for area codes in the glassine dataset from most to least mentions. Then, to give us an idea of how evenly distributed the mentions are among the area codes, we generate a cumulative fraction plot.

tally<-sort(table(valid_mentions, dnn = 'Area Code Tallies'),decreasing=TRUE)
tally
## Area Code Tallies
## 412 973 215 609 724 201 908 315 570 267 413 480 540 585 631 708 757 845 
## 263 205  52  38  13   3   3   2   2   1   1   1   1   1   1   1   1   1 
## 862 864 872 978 
##   1   1   1   1
plot(cumsum(as.vector(tally))/sum(as.vector(tally)),ylim=c(0,1),
     ylab = "Cumulative Fraction of Mentions",
     xlab = "nth-most Mentioned Area Code")
text(x = cumsum(as.vector(tally))/sum(as.vector(tally)),
     labels = row.names(tally), 
     cex=0.6, pos=1, col="red")

The sorted table shows that the top two most frequently-mentioned area codes are ‘412’ and ‘973’, and the cumulative fraction plot illustrates that nearly 80% of the mentions come from just these two area codes. When we make our map, we’ll be able to see where these two area codes are located.

Boundary Data

Choropleth maps shade different map regions according to a variable of interest. In this case, the shading will reflect the number of mentions a given area code has in the glassine dataset. But to shade a given map region we will have to first define its location. This involves finding data that delineates the extent of the area codes within USA. Finding a shapefile for US area codes took some googling, but eventually I found one here. I’ve uploaded the zip file to the github repo to make it easy for others to use.

I use a number of R packages for working with the shapefile. rgdal::readOGR is used for reading the shapefile into R. rgeos::gCentroid is used for finding the centroid of area code territories. sp/rgdal R packages have spTransform and CRS functions for map projection and datum transformation. The broom R package is useful for transforming spatial polygon dataframes into tidy dataframes for ggplot2.

Load map-making libraries:

# load up libraries:
library(sp)
library(rgdal)
## rgdal: version: 1.2-16, (SVN revision 701)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.0, released 2017/04/28
##  Path to GDAL shared files: C:/Users/Reeves/Documents/R/win-library/3.3/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/Reeves/Documents/R/win-library/3.3/rgdal/proj
##  Linking to sp version: 1.2-5
library(broom)
library(rgeos)
## rgeos version: 0.3-26, (SVN revision 560)
##  GEOS runtime version: 3.6.1-CAPI-1.10.1 r0 
##  Linking to sp version: 1.2-5 
##  Polygon checking: TRUE

Once we’ve loaded the rgdal package, we can read in the shapefile after extracting it from the zipped folder. The following code chunk checks whether the Areacode.zip file is available in the working directory. If it isn’t, it downloads the file from the github repo into the working directory. Then, it unzips the AreaCode.zip file to extract the AreaCode.shp file delineating the extent of each area code in lat,lon coordinates. Finally, it reads in the .shp file and then removes all extracted files to clear up space.

if (file.exists("AreaCode.shp")) {
        area <- readOGR("AreaCode.shp")
} else if (file.exists("AreaCode.zip")) {
        unzip("AreaCode.zip")
        area <- readOGR("AreaCode.shp")
        
        file.remove("AreaCode.dbf","AreaCode.prj","AreaCode.sbn",
                    "AreaCode.sbx","AreaCode.shx","AreaCode.shp","AreaCode.shp.xml")
} else {
        download.file("https://github.com/areeves87/glassine-area-codes/blob/master/AreaCode.zip","AreaCode.zip",mode="wb")
        unzip("AreaCode.zip")
        area <- readOGR("AreaCode.shp")
        
        file.remove("AreaCode.dbf","AreaCode.prj","AreaCode.sbn",
                    "AreaCode.sbx","AreaCode.shx","AreaCode.shp","AreaCode.shp.xml")
}
## OGR data source with driver: ESRI Shapefile 
## Source: "AreaCode.shp", layer: "AreaCode"
## with 336 features
## It has 12 fields
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

(I’m getting ahead of myself a bit here, but I’m including one more processing step on the frequency table. This is because of a plotting issue I run into later on with the maps. The essence of the problem is that these low-count area codes completely overlap with some high-count area codes, making it difficult to see the shading for the high-count area codes. There is probably a more elegant way of handling the problem. For now, here’s the hard-coded step:)

tally<-tally[!(names(tally) %in% c("267","862"))]

This is a shapefile that includes area code boundries all over the USA, but We only need a subset of the area codes contained in the shapefile. Let’s select shapes only for area codes mentioned in the frequency table.

glassine_area<-area[area$NPA %in% names(tally),]

Now let’s add a column of data indicating number of mentions per area code.

ac_ref<-match(glassine_area@data$NPA,names(tally))
glassine_area@data$TALLY <- as.vector(tally[ac_ref])

Let’s also add some columns for where to plot area code lables. We’ll use the centroid of an area code, which is found using a function in the rgeos package. Find lat,lon of centroids and add them as two columns of data.

glassine_area$CENTER.x<-tidy(gCentroid(glassine_area, byid=TRUE))[,1]
## Warning in tidy.default(gCentroid(glassine_area, byid = TRUE)): No method
## for tidying an S3 object of class SpatialPoints , using as.data.frame
glassine_area$CENTER.y<-tidy(gCentroid(glassine_area, byid=TRUE))[,2]
## Warning in tidy.default(gCentroid(glassine_area, byid = TRUE)): No method
## for tidying an S3 object of class SpatialPoints , using as.data.frame

(Make proper transform – I copied this next part from another tutorial and I’m not sure if this step is necessary:)

glassine_WGS84 <- spTransform(glassine_area, CRS("+init=epsg:4326"))
glassine_df_WGS84 <- tidy(glassine_WGS84)
## Regions defined for each Polygons
glassine_WGS84$polyID <- sapply(slot(glassine_WGS84, "polygons"), function(x) slot(x, "ID"))
glassine_df_WGS84 <- merge(glassine_df_WGS84, glassine_WGS84, by.x = "id", by.y="polyID")

Okay now we have a dataset of an area code’s boundary and the number of times it was mentioned in the /r/glassine dataset. We are ready to make our choropleth maps!

Choropleth Maps

For this step I use RColorBrewer to make the color gradient for the shadings, ggmap for retrieving google map images, and ggplot2 for making the shaded overlays.

# load up libraries:
library(RColorBrewer)
library(ggplot2)
library(ggmap)

We’ll start by creating a choropleth map of mainland USA to get an overview of how the area codes are distributed. I’ ve stored a .png of the map from the last time I ran the code, so this next chunk start by trying to load the .png. If it fails, it repeats the original map-making procedure, which is to use ggmap to grab a satellite image of the united states and geom_polygon to overlay the area codes. The polygons get filled according to the value of their TALLY, with light blue for the lowest tally and red for the highest tally.

if (file.exists("usa_glassine.png")) {
  knitr::include_graphics('usa_glassine.png')
} else {
  usa_basemap <- get_map(location="United States", zoom=4, maptype = 'satellite')

ggmap(usa_basemap) +
        geom_polygon(data = glassine_df_WGS84, 
                     aes(x=long, y=lat, group = group, # coordinates, and group them by polygons
                         fill = TALLY), alpha = 0.5) + # variable to use for filling
        scale_fill_gradient(low="#bfefff",high="red")+
        ggtitle("Area Code Mentions in /r/glassine")
}

It is clear from this map that the area codes mostly lay in the northeastern US. Let’s zoom in on Pennsylvania to get a clearer idea of where the area codes are.

We’ll repeat the same procedure. If it comes to generating the map again we’ll set the location to “PA” and the zoom to 6, then fill in the shading for the area code. This time, we’ll even label the area codes that have more than 20 mentions.

if (file.exists("pa_glassine.png")) {
  knitr::include_graphics('pa_glassine.png')
} else {
  pa_basemap <- get_map(location="PA", zoom=6, maptype = 'satellite')

ggmap(pa_basemap) +
        geom_polygon(data = glassine_df_WGS84, 
                    aes(x=long, y=lat, group = group, # coordinates, and group them by polygons
                        fill = TALLY), alpha = .8) + # variable to use for filling
        scale_fill_gradient(low="#bfefff",high="red")+
        geom_text(data = glassine_df_WGS84,aes(x=CENTER.x,y=CENTER.y,
                        label=ifelse(TALLY>20,as.character(NPA),'')))+
        ggtitle("Area Code Mentions in /r/glassine")
}

Discussion

Out of roughly the 1000 most-recent threads on /r/glassine, just four area codes made up the majority of mentions. 412, which covers Pittsburgh, is one of the hotspots. 973, which covers Newark, is another hotspot. Because only a few area codes get mention in the subreddit, the idea that the subreddit is representative of US opiate users is highly questionable. Instead, it appears we are looking at a population biased towards Pittsburgh and Newark.

Although this subreddit probably won’t tell us much about opiate users across the US, it may be informative for the hotspots I’ve identified. Future research might compare this dataset against a publicly-available map of overdoses or heroin-related drug busts in order to validate the findings. It may also be informative to examine how many redditors post to the /r/glassine subreddit – we may be looking at posts from just a few individuals, which would further bias the results.

Feedback, requests, etc. are welcome and should be directed at the github repo.