Fork me on GitHub

Libraries needed for this section are:

Data needed:

If you haven’t already, create a directory R_Workshop on your Desktop. Then set R_Workshop as your working directory in R Studio (Session > Set Working Directory > Choose Directory..), and download the files above.

About geocoding

What is Geocoding?1

  • Geocoding is the process of transforming a description of a location (such as an address, name of a place, or coordinates) to a location on the earth’s surface.”
  • “A geocoder is a piece of software or a (web) service that implements a geocoding process i.e. a set of inter-related components in the form of operations, algorithms, and data sources that work together to produce a spatial representation for descriptive locational references.”
  • Reverse geocoding uses geographic coordinates to find a description of the location, most typically a postal address or place name.” (I rarely have needed to do this.)

How is it done?

There are a number of ways, for example:

  • Interpolation for Street adresses

Interpolation for street adress geocoding

  • Rooftop Level for Street adresses

Rooftop geocoding

Issues

  • Quality of input data or: how specific is your location information?
  • Quality of output data: return rate, accuracy and precision
  • Regional differences: Geocoding adresses in the US is a differen beast than in Nigeria. Geocoding adresses in suburban Chicago is different from geocoding in rural Alabama.
  • Limitations of geocoding services: Bulding and maintaining an accurate global geocoding service is very resource intensive. So is running a geocoding server, particularly if it is hit with millions of requests per second. For that reason geocoding services are usually very limited in what they offer for free.

About APIs

What is an API and what does it have to do with geocoding?


Exercise 1

  1. Go to your browser.
  2. Point it to: http://maps.googleapis.com/maps/api/geocode/xml?address=Stanford+CA&sensor=false
  3. What just happened?
  4. Now write the URL to return the location of the city of Santiago – what do we get here?
  5. Now we want only the city in Spain. Extra bonus: Request the results in JSON format.
    If you need, here is some documentation: https://developers.google.com/maps/documentation/geocoding/

What have we learned?

  • We can use an API to access a service (or tools) provided by someone else without knowing the details of how this service or tool is implemented.

  • A geocoding API provides a direct way to access these services via an HTTP request (simply speaking: a URL).

  • A geocoding service API request must be in a particular form as specified by the service provier.

  • Geocoding services responses are returned in a structured format, which is typically XML or JSON, sometimes also KMZ.

Our goal is now to do what we did in a web browser from R. For this we have to take into account also:

  • Authentication: Using the geocoders often requires a valid API key. The key can usually be requested for free by registering at the website of the provider.
  • Rate Limiting: Geocode service providers typically use a rate limiting mechanism to ensure that the service stays available to all users.

There are many geocoding providers2. They vary in terms of format specifications, access and use(!) resrictions, quality of results, and more. So choosing a geocoder for a research project depends on the specifics of your situation.

Generic structure of a geocoding script

  1. Load address data. Clean them up if necessary.

  2. Send each address to the geocoding service (typically: create a URL request for each address)

  3. Process results. Extract lat lon and any other values you are interested in and turn into into a convenient format, usually a table.

  4. Save the output.

Geocoding with the Google Maps API

We will start by using the geocode command from the ggmap library.


Exercise 2

  1. Install (if you haven’t) and load the ggmap library.
  2. Using the geocode command, how would you search for location of the city of Santiago.
  3. How does your result compare to what you got from when you did the query through the web browser? (Hint: check out the output= option of the command)
  4. Now geocode this address: 380 New York St, Redlands, CA

R implementation example of a geocoding script

Now let’s write an R script to process an entire list of adresses for geocoding this way. We will use the generic script above and implement it in R:

  1. Load address data.
    To save a step, we will make use of the readr library, which allows us to read in the csv into a data frame without downloading to the desktop, like so:
banks <- read_csv(url("https://www.dropbox.com/s/z0el6vfg1vtmxw5/PhillyBanks_sm.csv?dl=1"))  # we need the `readr` library for this!
  1. Send each address to the geocoding service.
    The nice thing is that geocode can take a vector of adresses. So all we have to do is find out where the addresses are in our banks data frame and then submit them to the function.
banksCoords <- geocode([PUT THE ADDRESS VECTOR HERE])
  1. Process results.
    Most of this is all taken care of already in the geocode function. We only need to bind the lat/lon coordinates back to our original dataframe. We use the cbind function for this, like:
banksCoords <- data.frame(cbind(banks, banksCoords))
  1. Save the output.
    Saving out as csv is pretty easy, we can use write.table, for example. If we wanted to save it as a shapefile, we’d need to convert the dataframe to a spatial object first as we did in an earlier session, and then save with writeOGR.

Exercise 3

  1. Taking the steps outlined above, put together a script that will geocode the Philly Bank adresses and save the output to a shapefile.
    One way to do this is here.

Geocoding with the ArcGIS API (Stanford Affiliates Only)

Thanks to our fabulous geospatial manager Stace Maples who is tirelessly working to make our GIS lives easier we have our own geolocator at Stanford at

http://locator.stanford.edu/arcgis/rest/services/geocode

The services available here cover the US only. The good news here are that there are no limits as of how many addresses you can throw at this server. However, you should let Stace know if you are intending to run a major job!

To use this service :

Now let’s put together a URL that will determine the the location for 380 New York St, Redlands, CA.3

Here is what we need:

ArcGIS requires also the input addresses also to be in JSON format, which means they need to look like this:

  addresses=
  {
    "records": [
      {
        "attributes": {
          "OBJECTID": 1,
          "SingleLine": "380 New York St., Redlands, CA, 92373"
        }
      }
    ]
  }

We attach all the request parameters to the request URL after a ?

That makes for this very convoluted URL:

http://locator.stanford.edu/arcgis/rest/services/geocode/Composite_NorthAmerica/GeocodeServer/geocodeAddresses?addresses={"records":[{"attributes":{"OBJECTID":1,"SingleLine":"380 New York St., Redlands, CA"}}]}&token=<YOUR TOKEN>&f=pjson

What a mess.

ArcGIS takes addresses in Single and Mutiline mode. The addresses in your table can be stored in a single field (as used above) or in multiple fields, one for each address component (Street, City, etc). Batch geocoding performance is better when the address parts are stored in separate fields (multiline). However, if there is an error in your batch, all the addresses in that batch that already have been geocoded will be dropped.


Exercise 4

  1. Request a token.
  2. Copy and paste the URL from above, replace the place holder with your token and then copy in a browser.
  3. Try to understand the result.

Now let’s run the same adresses from above with the ArcGIS geocoder.

Here, again are our steps.

  1. Load address data.
    Like above. Check.

  2. Send each address to the geocoding service.
    For our we don’t have a convenient function to do this, so we have to write our own.

  3. Process results.
    We will do this in the same function. Here it is:

## begin geocode function takes token and address as single line one at a
## time (SingleLine API) needs more work for errors: e.g. what if no results
## are returned? etc etc
geocodeSL <- function(address, token) {
    # load the libraries
    require(httr)
    require(jsonlite)
    
    # the server URL
    gserver <- "http://locator.stanford.edu/arcgis/rest/services/geocode/Composite_NorthAmerica/GeocodeServer/"
    
    # template for SingleLine format
    pref <- "{'records':[{'attributes':{'OBJECTID':1,'SingleLine':'"
    suff <- "'}}]}"
    
    # make a valid URL
    url <- URLencode(paste0(gserver, "geocodeAddresses?addresses=", pref, address, 
        suff, "&token=", token, "&f=json"))
    
    # submit the request
    rawdata <- GET(url)
    
    # parse JSON to get the content
    res <- content(rawdata, "parsed", "application/json")
    # process the result
    resdf <- with(res$locations[[1]], {
        data.frame(lat = attributes$Y, lon = attributes$X, status = attributes$Status, 
            score = attributes$Score, side = attributes$Side, matchAdr = attributes$Match_addr)
    })
    # return as data frame
    return(resdf)
}
## end geocode function

I have uploaded this function here, so to use it from within R, you can “source” it like this:

source("https://www.dropbox.com/s/k520ukglnrhzyj3/geocodeSL.R?dl=1")

This geocoding function unfortunately is not as convenient as the one we used earlier. So we have to loop through our adresses ourselves and save the result to a data frame. Before that you should set myToken to the value of your token and make sure that you have the httr and jsonlite librareis installed.

Once thats taken care of, we can do:

banksCoords <- do.call("rbind", sapply(banks$Address, function(x) geocodeSL(x, 
    myToken), simplify = FALSE))
  1. Save the output.
    As in the prior exercise.

Exercise 5

  1. Using the provided function geocodeSL try to geocode the same adress table as above.
    One way to do this is here.

A word about Open Data Science Toolkit (DSK)

The open Data Science Toolkit (DSK) is available as a self-contained Vagrant VM or EC2 AMI that you can deploy yourself. It includes a Google-style geocoder which emulates Google’s geocoding API. This API uses data from the US Census and OpenStreetMap, along with code from GeoIQ and Schuyler Erle’s Modular Street Address Geocoder.

Insructions for how to run DSK on Amazon or Vagrant are here: http://www.datasciencetoolkit.org/developerdocs#amazon

Note that geocode from ggmap also has the option to access DSK, but it will use their public server, which is often slow or unavailable.

Geocoding IP adresses

If you are interested to do this in R see here: https://github.com/cengel/r_IPgeocode


  1. All from: https://en.wikipedia.org/wiki/Geocoding

  2. For a quite comprehensive overview look here and here.

  3. I found this helpful. Even though it is about ESRI’s World Geocoder it is very applicable for other ESRI geocoders.