Geo-spatial data is the easiest to make a methodological mistake with - we’ll be layering data points on top of a map, indicating location, and we need to make sure our visualizations successfully show the right data. It’s quite easy to make a mistake. It’s also common to see errors in the data itself. When I map crimes in Philadelphia, I often get a cluster of data points concentrated in one area of Florida. Why? Someone made a mistake when entering the latitude and longitude points, and repeated the mistake multiple times.The most common errors with latitude and longitude are reversing the values and forgetting to make them positive/negative.

Working With Geo-Spatial Data

Most data software has the ability to recognize cities, states, and countries by name. In R, you can simply type:

state.name
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

R will print them all out for you. It also has state abbreviations built in,with ‘state.abb.’

With just the names of the states and a data point for each one, we could create what’s called a choropleth map, one made up of colored regions that indicate the data level. Whereas an election map will show each state as either red or blue, a choropleth map usually instead uses a range of one color to indicate a level: the more ‘red’ a state is, the more homicides per capita happened there, for instance. Which leads us to the most common geo-spatial error: not accounting for population. If I told you there were more gun deaths in California in 2020 than in Mississippi, I could show data like this:

State Firearm_Deaths
California 3,449
Mississippi 818

(Data from the CDC)

Does California have a bigger gun problem than Mississippi? That would be the incorrect conclusion. When comparing states, they each have a different population - so we have to calculate a rate. That is, to compare states with different populations, we have to calculate a ‘rate per x’ for our data - in this case, the rate of firearm deaths per capita:

population / firearm_deaths = rate

The same would go for countries, cities, or any other data related to a particular location.

So where can we get population data? It’s usually quite easy to find - Wikiepedia is one place - but it’s again built in to R:

as.data.frame(state.x77) -> state_stats

state_stats$Population
##  [1]  3615   365  2212  2110 21198  2541  3100   579  8277  4931   868   813
## [13] 11197  5313  2861  2280  3387  3806  1058  4122  5814  9111  3921  2341
## [25]  4767   746  1544   590   812  7333  1144 18076  5441   637 10735  2715
## [37]  2284 11860   931  2816   681  4173 12237  1203   472  4981  3559  1799
## [49]  4589   376

We could, in theory, merge these state statistics with our firearm_deaths dataframe - the only requirement is that each dataset has a shared column (in this case, State). Once merged, we could use the #mutate()# function to calculate the rate.

In the case of this data, though, the CDC has already calculated a rate based on population. Looking at their table shows that California ranks 44th in firearm death rate, and Mississippi is 1st.

Geo-Spatial Data Types

Geo-spatial data can come in a number of formats, the most common being:

Geo-spatial plotting also require a base layer - the map itself - along with data that is superimposed on top of the map. This data is either built in to the dataset you’re using, or can be merged together.

Geo-Spatial R Packages

There are loads of different approaches to visualize geo-spatial data in R. For consistency and ease, we’ll use Leaflet, which allows us to chain together steps using the ‘%>%’ pipe operator, just like the tidyverse. Let’s also load a package called ‘sf,’ or ‘simple features,’ which makes it very easy to import geoJSON files as dataframes, another called ‘sp’ that makes loading Shapefiles a breeze, and the ‘maps’ package that helps with our map ‘projection,’ or how our map deals with teh curvature of the Earth.

## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map

And here’s the example code from the Leaflet website:

m <- leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=174.768, lat=-36.852, popup="The birthplace of R")

m  # Print the map

What does that say? Well, first of all, we’re creating a variable, and the map visualization will be equal to the variable - so to see it, we just need to type the name of the variable. We tell R to use leaflet(), then add a background image - this one is supplied by OpenStreetMap - then add a marker to the map at a specified location, and finally add a popup to the marker that indicates it’s the location where R was created.

Great! If that works, let’s change the values and make our own map. I’ll center mine around the Golden Gate Bridge, and add a marker for it as well.

What’s the latitude/longitude pair for this location? I like to use Google Maps to get that info. If I go to Google Maps and enter ‘Golden Gate Bridge,’ I get this: https://www.google.com/maps/place/Golden+Gate+Bridge/@37.8199286,-122.4804438,17z/data=!3m1!4b1!4m5!3m4!1s0x808586deffffffc3:0xcded139783705509!8m2!3d37.8199286!4d-122.4782551

The latitude and longitude are literally inside the URL, xalthough it can be hard to see them among all the other code (they are just after the ‘@’ symbol). Another way to get the lat/long pair is by clicking on the map itself to create a marker; a popup at the bottom of the screen should show the latitude / longitude pair.

Let’s try re-using the above Leaflet code, but changing the location and popup content:

m <- leaflet() %>% 
  addTiles() %>% 
  addMarkers(lng=-122.478534, lat=37.819988, popup="Golden Gate Bridge")

m

Okay, we are able to reproduce the demo code with a new location. Now let’s get into plotting data on maps.

Continue to Part II.