• A case study of the Scrambled-data problem
    • Plotting and viewing data before merge()
    • Plotting and viewing data after a merge()
    • Fixing the Problem
    • Using wmap() to never have to worry about this

A case study of the Scrambled-data problem

One of the bread-and-butter tasks of data analysis is relating one data source or type to another. In R, doing a “merge” of two data sets is straightforward and easy– the command merge() uses one or more identifiers to attach two objects together. However, things get complicated when you start dealing with R’s SpatialPolygonsDataFrame objects, and problems created by merges can be dificult to discover. Long story short, the only way in which the @data object and the @polygons objects are linked are through the order of the rows. If the order of the rows are altered (and the order of the polygons are not changed identically), the mapped data will no longer represent the values of the correct geographies.

Let’s say that you are interested in New England, and you have a shapefile for these states.

Plotting and viewing data before merge()

plot(new_england, main="New England")

With the following ID fields/attributes:

new_england@data
##    state    state_name polygon_order
## 1:    25 Massachusetts             1
## 2:    44  Rhode Island             2
## 3:    23         Maine             3
## 4:    50       Vermont             4
## 5:    36      New York             5
## 6:    33 New Hampshire             6
## 7:     9   Connecticut             7

Note that the field “polygon order” has been added– no such field regularly exists, but is included here as a reference for the original order of the data as they relate to the polygons.

If you would like, you can single out a specific state to make sure your IDs and map polygons match up:

plot(new_england[new_england@data$state_name=="Massachusetts",],main="Massachusetts")

You are interested in mapping the amount of hard alcohol consumed per capita in these areas. As such, you have information from AEDS in 2014.

spirits
##       state_name spirits_pc
## 1: Massachusetts      0.912
## 2:         Maine      0.930
## 3:       Vermont      0.722
## 4: New Hampshire      1.958
## 5:   Connecticut      0.922
## 6:  Rhode Island      1.008
## 7:      New York      0.788

Plotting and viewing data after a merge()

You want to merge this data onto the attributes (@data slot) of the SpatialPolygonsDataFrame. Let’s make a copy of the new_england SpatialPolygonsDataFrame to experiment.

new_england_copy<-copy(new_england)
new_england_copy@data<-merge(new_england_copy@data,spirits,by="state_name")

We can use the wmap() function (contained within the woodson pallettes package) to map the values that exist within the data of the SpatialPolygonsDataFrame:

wmap(chloropleth_map=new_england_copy, 
           geog_id="state_name", 
           variable="spirits_pc",
            map_title="Spirits per capita in New England (after Merge)")

Looks plausible! But wait, isn’t New Hampshire the state that has no liquor tax, not Vermont? What’s happening here?! Let’s do a spot-check:

plot(new_england_copy[new_england_copy@data$state_name=="Massachusetts",],main="Massachusetts.. or is it?")

OH NO! That’s Maine, not Massachusetts!!

Fixing the Problem

A little-known consequence of using the merge() function is that the new object will be automatically re-ordered based on the fields that you used to generate the merge. We can see this if we check out the data of this new new_england_copy object after the merge– the data is now alphabetized on state name, and the polygon_order is no longer sequential.

new_england_copy@data
##       state_name state polygon_order spirits_pc
## 1:   Connecticut     9             7      0.922
## 2:         Maine    23             3      0.930
## 3: Massachusetts    25             1      0.912
## 4: New Hampshire    33             6      1.958
## 5:      New York    36             5      0.788
## 6:  Rhode Island    44             2      1.008
## 7:       Vermont    50             4      0.722

Luckily, we can fix this, if we just re-order the data to correspond with the polygons ordering (that we have saved in our polygon_order column).

new_england_copy@data<-copy(new_england_copy@data[order(polygon_order)])

Now we can plot Massachussetts again:

plot(new_england_copy[new_england_copy@data$state_name=="Massachusetts",],main="Massachusetts is Back")

Now, we can map with confidence:

wmap(chloropleth_map=new_england_copy, 
           geog_id="state_name", 
           variable="spirits_pc",
           map_title="Spirits per capita in New England (after ordering)")

For more solutions and strategies to solve and/or prevent this problem, please see the post on bringing spatial data into R.

Using wmap() to never have to worry about this

Also worth noting is that the wmap() function can merge on your data (provided a unique ID) and map your variables taking care of all of this under the hood– let’s try using the wmap() function and using the spirits data separately (not merged in):

wmap(chloropleth_map=new_england, 
     data=spirits,
           geog_id="state_name", 
           variable="spirits_pc",
           map_title="Spirits per capita in New England (using wmap(), no merge needed)")