Adding Your Data

Purposes

So, you have some data… perhaps you want to look at how your patients are responding to an exercise intervention for obesity and see if there’s a difference in compliance and weight loss success that could be related to violence near home that make it unsafe to play outside.

Obtain Geographic Data

The City of Philadelphia supplies information about shootings (including officer-involved shootings) which includes data about the shooting victim and the location. Here, we’re really interested in the location of shootings over the past few years, to understand what parts of Philadelphia are more prone to this specific kind of violence.

To see more information about this dataset, please visit https://www.opendataphilly.org/dataset/shooting-victims/resource/a6240077-cbc7-46fb-b554-39417be606ee?inner_span=True.

For our purposes, we’re going to get the bare minimum of information: latitude and longitude. The API endpoint is described in the link above and uses a SQL query to select only the data we care about. Because our query has spaces and other special characters, we need to “encode” it for request.

The data will come in as json, which we’ll parse.

library(jsonlite)
url <- URLencode('https://www.opendataphilly.org/api/action/datastore_search_sql?sql=SELECT lat, lng from "a6240077-cbc7-46fb-b554-39417be606ee"')
shooting_data <- fromJSON(url)

We can examine the shooting data by using R’s str (structure) command:

str(shooting_data)

## List of 3
##  $ help   : chr "https://www.opendataphilly.org/api/3/action/help_show?name=datastore_search_sql"
##  $ success: logi TRUE
##  $ result :List of 3
##   ..$ records:'data.frame':  2500 obs. of  2 variables:
##   .. ..$ lat: chr [1:2500] "39.922968633" "40.0296637640001" "40.037804839" "39.9682945860001" ...
##   .. ..$ lng: chr [1:2500] "-75.18870837" "-75.1205859089999" "-75.139286339" "-75.223324033" ...
##   ..$ fields :'data.frame':  2 obs. of  2 variables:
##   .. ..$ type: chr [1:2] "numeric" "numeric"
##   .. ..$ id  : chr [1:2] "lat" "lng"
##   ..$ sql    : chr "SELECT lat, lng from \"a6240077-cbc7-46fb-b554-39417be606ee\""

Here we see that we have a data frame, accessible at shooting_data$result$records:

head(shooting_data$result$records, 6)

##                lat               lng
## 1     39.922968633      -75.18870837
## 2 40.0296637640001 -75.1205859089999
## 3     40.037804839     -75.139286339
## 4 39.9682945860001     -75.223324033
## 5     39.995426509     -75.126076954
## 6 40.0429494850001 -75.1617552229999

Mapping Points

If we wanted to, we could easily create a map of these shootings, just based on latitude and longitude. Since latitude and longitude are currently in “chr” (character) format, we’ll make them numeric so that we can do math on them. We’ll create a map that’s centered on the mean latitude and longitude of all our shootings, and which is appropriately zoomed in (you might have to experiment with the zoom factor).

We’re going to add a standard road map below to show more context, using addTiles.

library(leaflet)
library(leaflet.extras)
library(dplyr)

shootings <- shooting_data$result$records
shootings$lat <- as.numeric(shootings$lat)
shootings$lng <- as.numeric(shootings$lng)

shootings %>% 
  leaflet() %>% 
  addTiles() %>%
  setView(lng = mean(as.numeric(shootings$lng), na.rm=TRUE), 
          lat = mean(as.numeric(shootings$lat), na.rm=TRUE), zoom = 10) %>%
  addMarkers(clusterOptions = markerClusterOptions()) %>%
  suspendScroll()

Mapping Polygons

What’s more likely, however, is that we want to use polygon data to create a map that shows how much a particular area is affected. This is because we want to create combined data – we want to put information about our patients or research subjects along with the level of violence they are exposed to. Instead of using latitude and longitude, we’ll gather the number of shootings per Census tract, which we can then use as a proxy for violence exposure for the patients and subjects who live in that Census tract. It’s a sort of “binning”, but using the existing “bins” of Census tracts.

Let’s start by getting a map:

library(rgdal)
philadelphiaCensusTracts <- readOGR("http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson")

## OGR data source with driver: GeoJSON 
## Source: "http://data.phl.opendata.arcgis.com/datasets/8bc0786524a4486bb3cf0f9862ad0fbf_0.geojson", layer: "OGRGeoJSON"
## with 384 features
## It has 14 fields

Mapping Point Data to Polygons

Now what we’d like to do is get the shootings-per-tract data, which we can then combine with our research or clinical data to see if violence near home has any effect on our outcomes. To do this, we take the latitude and longitude of our shootings and transform them slightly so that they are understood as spatial coordinates, not just pairs of numbers. We’ll use the same map projection used in our original philadelphiaCensusTracts.

library(sp)
coordinates <- SpatialPoints(shootings[c("lng", "lat")])
proj4string(coordinates) <- proj4string(philadelphiaCensusTracts)

Let’s now apply what we know about our polygons (from philadelphiaCensusTracts) and apply that to our points. We’ll end up with a table that has one row for each shooting coordinate. Essentially, what we’re doing is taking each point, lining it up with a matching polygon, and then getting the data about that polygon, which came along with the geoJSON file we downloaded.

shooting_tract_data <- over(coordinates, philadelphiaCensusTracts)
head(shooting_tract_data)

##   OBJECTID STATEFP10 COUNTYFP10 TRACTCE10     GEOID10 NAME10
## 1      198        42        101    003600 42101003600     36
## 2      112        42        101    029000 42101029000    290
## 3      356        42        101    028200 42101028200    282
## 4       57        42        101    010300 42101010300    103
## 5      141        42        101    017602 42101017602 176.02
## 6       98        42        101    024700 42101024700    247
##            NAMELSAD10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10  INTPTLAT10
## 1     Census Tract 36   G5020          S  964539        0 +39.9279255
## 2    Census Tract 290   G5020          S  818133    20001 +40.0296096
## 3    Census Tract 282   G5020          S  855812        0 +40.0348704
## 4    Census Tract 103   G5020          S  281357        0 +39.9678322
## 5 Census Tract 176.02   G5020          S  400458        0 +39.9967182
## 6    Census Tract 247   G5020          S  941266        0 +40.0417034
##     INTPTLON10 LOGRECNO
## 1 -075.1920206    10377
## 2 -075.1166058    10605
## 3 -075.1403327    10596
## 4 -075.2254383    10437
## 5 -075.1305528    10507
## 6 -075.1645311    10560

We see the first few lines of the Census data for each of our shootings. For example, the first shooting in our shooting data corresponds to Census tract 36, which is in State 42 (Pennsylvania) and County 101 (Philadelphia County). We can use this to find out how many shootings take place in each Census tract.

shootings_by_census_tract <- shooting_tract_data %>% 
                                 group_by(GEOID10) %>% 
                                 summarise(num_shootings = n()) %>% 
                                 ungroup() 
head(shootings_by_census_tract)

## # A tibble: 6 x 2
##   GEOID10     num_shootings
##   <fct>               <int>
## 1 42101000100             2
## 2 42101000200             1
## 3 42101000300             1
## 4 42101000600             2
## 5 42101000700             1
## 6 42101000804             1

Handling Empty Data

Don’t forget that there are some Census tracts that aren’t represented at all in our shooting_tract_data data frame, so let’s make sure we enrich it with all the tracts that aren’t included in the shooting data. We can get those by taking the data frame of our tract data, selecting the list of all the Census tracts in Philadelphia, and making sure that if they weren’t mentioned above, we add them, but with num_shootings equal to 0.

non_shooting_tracts <- philadelphiaCensusTracts@data %>% 
                       select(GEOID10) %>%
                       filter(!GEOID10 %in% shootings_by_census_tract$GEOID10) %>%
                       mutate(num_shootings = 0)
head(non_shooting_tracts)

##       GEOID10 num_shootings
## 1 42101014300             0
## 2 42101020600             0
## 3 42101001400             0
## 4 42101001900             0
## 5 42101002900             0
## 6 42101005000             0

We can now combine the tracts-with-shootings and the tracts-with-no-shootings to get an overall picture of violence by census tract:

shootings_by_census_tract <- rbind(shootings_by_census_tract, non_shooting_tracts)

Let’s map this! We need to combine the data we aggregated with the data in the map file.

WARNING: PITFALL AHEAD!

We have a really simple merge, right? Census tract information is contained in both datasets, so we have a “hinge”:

merged_data <- merge (philadelphiaCensusTracts@data, shootings_by_census_tract, by = "GEOID10")

Let’s peek!

head(merged_data)

##       GEOID10 OBJECTID STATEFP10 COUNTYFP10 TRACTCE10 NAME10
## 1 42101000100      190        42        101    000100      1
## 2 42101000200      191        42        101    000200      2
## 3 42101000300      192        42        101    000300      3
## 4 42101000401      145        42        101    000401   4.01
## 5 42101000402      144        42        101    000402   4.02
## 6 42101000500      193        42        101    000500      5
##          NAMELSAD10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10  INTPTLAT10
## 1    Census Tract 1   G5020          S  704909        0 +39.9523827
## 2    Census Tract 2   G5020          S  382478        0 +39.9553999
## 3    Census Tract 3   G5020          S  548342      356 +39.9568780
## 4 Census Tract 4.01   G5020          S  214766    14981 +39.9541871
## 5 Census Tract 4.02   G5020          S  303680        0 +39.9532973
## 6    Census Tract 5   G5020          S  428783        0 +39.9519534
##     INTPTLON10 LOGRECNO num_shootings
## 1 -075.1466628    10335             2
## 2 -075.1569775    10336             1
## 3 -075.1716655    10337             1
## 4 -075.1758082    10338             0
## 5 -075.1686952    10339             0
## 6 -075.1581776    10340             0

Looks great, and as merge is wont to do, it’s ordered the new data frame by the merge variable, GEOID10.

Can we just put that back into the map?

No! While maintaining the order of rows is not something we’re used to doing in R, we absolutely have to maintain the original order of rows when we’re using spatial polygon data. That’s because the rows of the data frame correspond 1:1, in order, to the polygons.

There are a few ways to work around this problem, but we have a great, easy to use solution. Let’s take a peek at our original order:

head(philadelphiaCensusTracts@data)

##   OBJECTID STATEFP10 COUNTYFP10 TRACTCE10     GEOID10 NAME10
## 0        1        42        101    009400 42101009400     94
## 1        2        42        101    009500 42101009500     95
## 2        3        42        101    009600 42101009600     96
## 3        4        42        101    013800 42101013800    138
## 4        5        42        101    013900 42101013900    139
## 5        6        42        101    014000 42101014000    140
##         NAMELSAD10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10  INTPTLAT10
## 0  Census Tract 94   G5020          S  366717        0 +39.9632709
## 1  Census Tract 95   G5020          S  319070        0 +39.9658709
## 2  Census Tract 96   G5020          S  405273        0 +39.9655396
## 3 Census Tract 138   G5020          S  341256        0 +39.9764504
## 4 Census Tract 139   G5020          S  562934        0 +39.9750563
## 5 Census Tract 140   G5020          S  439802        0 +39.9735358
##     INTPTLON10 LOGRECNO
## 0 -075.2322437    10429
## 1 -075.2379140    10430
## 2 -075.2435075    10431
## 3 -075.1771771    10468
## 4 -075.1711846    10469
## 5 -075.1630966    10470

Yes, there’s a great “OBJECTID” column that’s numbered incrementally. So it’s very simple for us to take our merged dataframe and re-instate that order:

merged_data <- merged_data[order(merged_data$OBJECTID),]
head(merged_data)

##         GEOID10 OBJECTID STATEFP10 COUNTYFP10 TRACTCE10 NAME10
## 95  42101009400        1        42        101    009400     94
## 96  42101009500        2        42        101    009500     95
## 97  42101009600        3        42        101    009600     96
## 134 42101013800        4        42        101    013800    138
## 135 42101013900        5        42        101    013900    139
## 136 42101014000        6        42        101    014000    140
##           NAMELSAD10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10  INTPTLAT10
## 95   Census Tract 94   G5020          S  366717        0 +39.9632709
## 96   Census Tract 95   G5020          S  319070        0 +39.9658709
## 97   Census Tract 96   G5020          S  405273        0 +39.9655396
## 134 Census Tract 138   G5020          S  341256        0 +39.9764504
## 135 Census Tract 139   G5020          S  562934        0 +39.9750563
## 136 Census Tract 140   G5020          S  439802        0 +39.9735358
##       INTPTLON10 LOGRECNO num_shootings
## 95  -075.2322437    10429            15
## 96  -075.2379140    10430            13
## 97  -075.2435075    10431            10
## 134 -075.1771771    10468            10
## 135 -075.1711846    10469             7
## 136 -075.1630966    10470             9

Now, this data frame we can add back into the spatial polygon data frame safely:

philadelphiaCensusTracts@data <- merged_data

And then mapping is a snap:

shooting_palette <- colorBin("Reds", domain = philadelphiaCensusTracts@data$num_shootings, bins = 5, na.color = "#808080")

leaflet() %>%
    addPolygons(
    data = philadelphiaCensusTracts,
    fillColor = ~shooting_palette(philadelphiaCensusTracts@data$num_shootings),
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "white", # border color
    fillOpacity = 1,
    label = paste("Number of Shootings: ", philadelphiaCensusTracts@data$num_shootings, sep = "")) %>%
  suspendScroll()

We could also plot the actual shootings on top – that might be duplicative effort but it’s also helpful to make sure there’s nothing obviously wrong in the mapping.

leaflet() %>%
    addPolygons(
    data = philadelphiaCensusTracts,
    fillColor = ~shooting_palette(philadelphiaCensusTracts@data$num_shootings),
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "white", # border color
    fillOpacity = 1,
    label = paste("Number of Shootings: ", philadelphiaCensusTracts@data$num_shootings, sep = "")) %>%
  addCircles(lat = shootings$lat, lng = shootings$lng, color = "dark grey", weight = 1, radius = 100) %>%
  suspendScroll()

So what would that map look like if we hadn’t handled the row ordering problem? Let’s compare:

badPhilly <- philadelphiaCensusTracts
badPhilly@data <- merged_data[order(merged_data$GEOID10),]


leaflet() %>%
    addPolygons(
    data = badPhilly,
    fillColor = ~shooting_palette(badPhilly@data$num_shootings),
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "white", # border color
    fillOpacity = 1,
    label = paste("Number of Shootings: ", 
                  badPhilly@data$num_shootings, sep = "")) %>%
  addCircles(lat = shootings$lat, lng = shootings$lng, color = "dark grey", weight = 1, radius = 100) %>%
  suspendScroll()

You can see that the extra diagnostic layer was helpful to show that the shooting data points don’t line up with the polygons.

Adding Some Proprietary Data To The Mix

Important aside: this data is a work of fiction, so:

Any resemblance to actual persons, living or dead, or actual events is purely coincidental.

Let’s take a peek at our fake data, which purports to give the number of minutes various patients on an obesity-related research project reported exercising.

fake_exercise_data <- read.csv("../Data/fake_exercise_data.csv", stringsAsFactors = FALSE)
head(fake_exercise_data)

##       mrn census_tract daily_exercise_minutes
## 1 3755308       274.02                     17
## 2 8144402       279.01                     25
## 3 4819425       247.00                     32
## 4 8974594       141.00                     32
## 5 9478166       279.01                     73
## 6 1633889       177.01                     31

exercise_per_tract <- fake_exercise_data %>% 
                      group_by(census_tract) %>%
                      summarise(mean_exercise = mean(daily_exercise_minutes)) %>%
                      ungroup()
head(exercise_per_tract)

## # A tibble: 6 x 2
##   census_tract mean_exercise
##          <dbl>         <dbl>
## 1         1             27  
## 2         2             93  
## 3         3             87  
## 4         4.01          74  
## 5         6             61  
## 6         7             49.5

Combine with Map Data Frame!

WARNING WARNING PITFALL AHEAD!

First, we’ll combine exercise_per_tract with the data found in philadelphiaCensusTracts@data. This time, we’re combining the short name, not the fully qualified GEOID (who knows why?!).

census_tracts <- merge(x=philadelphiaCensusTracts@data, y=exercise_per_tract, by.x="NAME10", by.y="census_tract", all.x = TRUE)

Then we’ll add our enriched data back to the geojson data, so that in addition to the fields it came with, it will now contain the exercise and shooting data we gathered. It’s important to order this data by the OBJECTID so that the correct polygon is associated with the correct data!

philadelphiaCensusTracts@data <- census_tracts[order(census_tracts$OBJECTID),]

Now, let’s create an interactive map! We’ll color the polygons by exercise amount to begin with.

exercise_palette <- colorBin("Blues", domain = philadelphiaCensusTracts@data$mean_exercise, bins = 8, na.color = "#cccccc")

interactive_map <- leaflet(philadelphiaCensusTracts) %>%
  addPolygons(
    data = philadelphiaCensusTracts,
    fillColor = ~exercise_palette(philadelphiaCensusTracts@data$mean_exercise),
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "white", # border color
    fillOpacity = 1) %>%
  suspendScroll()
interactive_map

Now, let’s add some labels. We’ll do variable interpolation to create labels that tell what each Census tract is and the exercise and shooting data for that tract:

labels <- sprintf(
  "<strong>%s</strong><br/>
  Exercise in Minutes: %g <br/>
  Number of Shootings: %g",
  philadelphiaCensusTracts@data$NAMELSAD10, 
  philadelphiaCensusTracts@data$mean_exercise,
  philadelphiaCensusTracts@data$num_shootings
) %>% lapply(htmltools::HTML)

Then we’ll create the map again, but with labels. This allows the viewer to see at a glance the violence and exercise metrics for each tract!

interactive_map <- leaflet(philadelphiaCensusTracts) %>%
  setView(lng = mean(as.numeric(shootings$lng), na.rm=TRUE), 
          lat = mean(as.numeric(shootings$lat), na.rm=TRUE), zoom = 11) %>%
  addPolygons(
    data = philadelphiaCensusTracts,
    fillColor = ~exercise_palette(philadelphiaCensusTracts@data$mean_exercise),
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "white", # border color
    fillOpacity = 1,
    label = labels,
  labelOptions = labelOptions(
    style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px",
    direction = "auto")) %>%
  suspendScroll()
interactive_map

Combine it All!

So, we have a couple of polygon maps and a point map – how can we combine them all in Leaflet? We’re going to use a layer controller.

my_map <- leaflet() %>%
  setView(lng = mean(philadelphiaCensusTracts@bbox['x',], na.rm=TRUE), 
          lat = mean(philadelphiaCensusTracts@bbox['y',], na.rm=TRUE), zoom = 11) %>%
addPolygons(data = philadelphiaCensusTracts,
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "grey", # border color
    fillColor = ~shooting_palette(philadelphiaCensusTracts@data$num_shootings),
    fillOpacity = 1,
    label = labels,
    labelOptions = labelOptions(
    style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px",
    direction = "auto"),
    group = "Frequency of Shootings With Injury"
  ) %>%
    addPolygons(
    data = philadelphiaCensusTracts,
    weight = 1,  # border thickness
    opacity = 0.5, # border opacity
    color = "grey", # border color
    fillColor = ~exercise_palette(philadelphiaCensusTracts@data$mean_exercise),
    fillOpacity = 1,
    label = labels,
    labelOptions = labelOptions(
    style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px",
    direction = "auto"),
      group = "Mean Exercise in Minutes"
  ) %>%
  addCircles(lat = shootings$lat, lng = shootings$lng, color = "dark grey", weight = 1, radius = 100, group = "Frequency of Shootings With Injury") %>%
    addCircles(lat = shootings$lat, lng = shootings$lng, color = "dark grey", weight = 1, radius = 100, group = "Mean Exercise in Minutes") %>%
    addLayersControl(
    baseGroups = c("Frequency of Shootings With Injury", "Mean Exercise in Minutes"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  suspendScroll()

my_map

```