Geocoding, per Google, “is the process of converting addresses (like”1600 Amphitheatre Parkway, Mountain View, CA”) into geographic coordinates (like latitude 37.423021 and longitude -122.083739), which you can use to place markers on a map, or position the map.” This demo introduces a couple of geocoding options in R.
ggmap has a function, mutate_geocode(), which geocodes using Google’s API. See the package readme here: https://cran.r-project.org/web/packages/ggmap/readme/README.html .
tmap has a function, geocode_OSM(), which geocodes using OpenStreetMap’s Nominatim API: https://www.rdocumentation.org/packages/tmap/versions/1.6-1/topics/geocode_OSM .
Before we begin, let’s load those two packages. (Remember to install any packages that are not yet installed in R.)
#install.packages("ggmap")
#install.packages("tmap")
library(ggmap)
library(tmap)
library(tmaptools)
library(tidyverse)
library(readxl) #for importing the pharmacy CSVs below
library(sf)
library(mapview)
Using each tool, we will geocode a list of addresses of CVS pharmacies in Atlanta. Another cool feature of each of these geocoders is they will, like Google Maps, accept a free-text name of a place, like, say, Variety Playhouse. We will also geocode a list of names of concert venues in Atlanta using their names only, without addresses.
Before we begin those fun things, some logistics.
Google’s API used to be free up to 25,000 queries per day. Now it’s not. Here are some articles (1, 2) reacting to the price change. Also see the above readme by the ggmap package author. For a project with a limited number of geocodes, like this demo, it is still free as far as I can tell. You get “$200 in free usage every month”. I’m not sure what $200 translates to, exactly.
To use geocode via Google’s API, a user must first register for an API key. Begin here: https://console.cloud.google.com/google/maps-apis/. Note, at this writing, that link does not load on on my CDC laptop.
This link should load, though, which describes the process: https://developers.google.com/maps/documentation/embed/get-api-key
Here are some screenshots from registering in the Cloud Console on a different device.
You do, unfortunately, have to enter a credit card number.
After going through a step asking some survey questions, this is the landing page, where you can select from various products. Click on Geocoding API, and then click Enable.
Then go to the Credentials tab and copy the API key. It will be a string of letters and numbers.
To register your Google API key, use the register_google() function from gmap. The write=TRUE argument means you don’t have to re-enter the key every time you load R on this computer. Substitute your own key for y0uR_ap1_k3y_h3r3.
ggmap::register_google("y0uR_ap1_k3y_h3r3", write=TRUE)
Let’s do a quick test to make sure the key is working.
tibble(address = c("Variety Playhouse Atlanta", "The Earl Atlanta")) %>%
mutate_geocode(address) %>%
st_as_sf(coords = c("lon", "lat"),crs = 4326) %>%
mapview()
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Variety+Playhouse+Atlanta&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=The+Earl+Atlanta&key=xxx
The first line of code creates a tibble out of those two free-text names.
tibble(address = c("Variety Playhouse Atlanta", "The Earl Atlanta"))
The second line of code adds two new columns to that tibble, one with a latitude of each place and another with a longitude of each place. mutate_geocode(), like mutate() from dplyr, adds a new column to an existing dataframe.
tibble(address = c("Variety Playhouse Atlanta", "The Earl Atlanta")) %>%
mutate_geocode(address)
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Variety+Playhouse+Atlanta&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=The+Earl+Atlanta&key=xxx
The package also has a function simply called geocode(), which returns a tibble of lon-lat values without adding those columns to an existing dataset.
geocode(c("Variety Playhouse Atlanta", "The Earl Atlanta"))
The third line in the original code chunk converts the tibble to an sf object. st_as_sf() is a function from the sf package. The coords argument can either take a vector of numbers specifying the coordinates or a vector of the column names where the coordinates are found. In this case, the coordinates are stored in the columns named ‘lon’ and ‘lat’. The crs argument specifies the coordinate reference system. Note that after the st_as_sf() function is applied, the lon and lat values no longer appear. They are now contained in the geometry column.
tibble(address = c("Variety Playhouse Atlanta", "The Earl Atlanta")) %>%
ggmap::mutate_geocode(address) %>%
sf::st_as_sf(coords = c("lon", "lat"),crs = 4326)
And then the last line visualizes those points.
tibble(address = c("Variety Playhouse Atlanta", "The Earl Atlanta")) %>%
mutate_geocode(address) %>%
st_as_sf(coords = c("lon", "lat"),crs = 4326) %>%
mapview()
Now let’s try importing a longer list of addresses. Here’s a list of CVS pharmacies in Atlanta (https://www.cvs.com/store-locator/cvs-pharmacy-locations/Georgia/Atlanta). I put the list on github (https://github.com/michaeldgarber/cvs-atl/blob/main/cvs_atl.csv) to make it easier to read in.
cvs_atl = "https://raw.githubusercontent.com/michaeldgarber/cvs-atl/main/cvs_atl.csv" %>%
url() %>%
read_csv()
Examine the imported data. Some of the addresses as copied directly from the CVS website failed the first time around. I removed the suite numbers and changed the one that had the phase “Across from Piedmont Hospital.”
The corrected column is called cvs_address_corrected. The original is called cvs_address_from_website.
cvs_atl
Geocode using mutate_geocode() and convert to an sf object.
cvs_atl_sf = cvs_atl %>%
mutate_geocode(cvs_address_corrected) %>%
st_as_sf(coords = c("lon", "lat"),crs = 4326) %>%
mutate(geocode_type = "Google")
After those corrections, all of the addresses were successfully geocoded. Map the CVSs.
cvs_atl_sf %>% mapview()
Create a tibble of venues. Try a few different things with the precision of the name, like not specifying the city for Variety Playhouse and The Tabernacle, and using lowercase for some of them.
Note : I tried not specifying the city for The Tabernacle, and it sent me to a Tabernacle in Oklahoma (not shown), so I’m including the word atlanta on that one.
venues_atl = tibble(venue_name = c(
"The Earl Atlanta",
"Variety Playhouse",
"Tabernacle atlanta",
"the masquerade atlanta",
"Eddie's Attic Decatur",
"Smith's Olde Bar",
"Terminal West",
"chastain park ampitheater",
"drunken unicorn atlanta",
"state farm arena",
"fox theater atlanta"
))
venues_atl
Geocode them using mutate_geocode(). The geocoder accepts some flexibility with respect to letter case and whether or not the city is listed.
venues_atl_sf = venues_atl %>%
mutate_geocode(venue_name) %>%
st_as_sf(coords = c("lon", "lat"),crs = 4326) %>%
mutate(geocode_type = "Google")
venues_atl_sf %>% mapview()
It’s pretty flexible in that it will take lowercase values and for some of the venues, the city didn’t need to be specified. It did squawk when I included some of the suite numbers for the CVS addresses as pasted directly from the CVS website. It’s also seemingly pretty fast, although I haven’t tried it on a super large dataset.
Cost could be a downside. I will see if my credit card gets charged for this. :).
Now let’s shift to OpenStreetMap’s geocoder. As a reminder, we’re using the geocode_OSM() function from the tmap package.
There’s no need to register a key for OpenStreetMap. Geocoding is done via the Nominatim API. See this page, which describes the policy for processing a large amount of queries:
https://wiki.openstreetmap.org/wiki/Nominatim_usage_policy
Another cool thing about the OpenStreetMap geocoder compared with Google is it returns a full bounding box, where applicable, in addition to just point locations. For example, let’s geocode a few cities using the metro dataset that comes with the tmap package. See https://rdrr.io/cran/tmap/man/metro.html.
Load and explore the metro dataset first. Note that it’s an sf object, so the point locations of the cities (centroids, presumably) are already geocoded. To make it more manageable, let’s grab the ten cities with the highest population in 2020.
data(metro) #load the metro dataset from tmap.
nrow(metro)
## [1] 436
class(metro)
## [1] "sf" "data.frame"
metro
cities_10 = metro %>%
dplyr::arrange(desc(pop2020)) %>%
slice(1:10)
cities_10 %>% mapview()
We can use geocode_osm() to return a bounding box for each city in our sample. Unfortunately, unlike mutate_geocode(), geocode_OSM() doesn’t appear to work in a pipe.
names(cities_10)
## [1] "name" "name_long" "iso_a3" "pop1950" "pop1960" "pop1970"
## [7] "pop1980" "pop1990" "pop2000" "pop2010" "pop2020" "pop2030"
## [13] "geometry"
#This piped version fails.
# cities_10_geocode = cities_10 %>%
# geocode_OSM(name_long)
#This works.
cities_10_geocode = geocode_OSM(cities_10$name_long)
Check out the output. Note that there is a point geocode (lat, lon) as well as a bounding box (lat_min, lat_max, lon_min, and lon_max).
cities_10_geocode
Look at the southwesternmost corners and the northeasternmost corners of each city, in case that’s of interest…
cities_sw = cities_10_geocode %>%
rename(city = query) %>%
dplyr::select(city, lat_min, lon_min) %>%
st_as_sf(coords = c("lon_min", "lat_min"), crs=4326) %>%
mutate(corner = "sw")
cities_ne = cities_10_geocode %>%
rename(city = query) %>%
dplyr::select(city, lat_max, lon_max) %>%
st_as_sf(coords = c("lon_max", "lat_max"), crs=4326) %>%
mutate(corner ="ne")
cities_sw %>%
bind_rows(cities_ne) %>%
mapview(zcol = "city")
Observations: * Why is Tokyo coded in the ocean? * The two points for Mexico City don’t seem to have worked. They’re both located in the center of the city, not at the corners.
cvs_atl_sf_osm = geocode_OSM(cvs_atl$cvs_address_corrected) %>%
as_tibble() %>%
st_as_sf(coords = c("lon", "lat"),crs = 4326) %>%
mutate(geocode_type = "OSM")
## No results found for "6370 Powers Ferry Rd. Atlanta, GA 30339".
## No results found for "1554 N Decatur Road, Emory Village Atlanta, GA 30307".
## No results found for "100 Perimeter Center Pl Ne Atlanta, GA 30346".
## No results found for "2237 Cascade Road Sw Atlanta, GA 30311".
## No results found for "4300 Paces Ferry Rd Se, Suite 170, Vinings Jubilee Shopping Center Atlanta, GA 30339".
## No results found for "2014 Powers Ferry Rd. Nw, Parkwood Circle Shopping Plaza Atlanta, GA 30339".
## No results found for "3030 Headland Dr. Atlanta, GA 30311".
## No results found for "1520 Avenue Place, Suite B-100 Atlanta, GA 30329".
## No results found for "1610 Mt. Vernon Rd, Atlanta, GA 30338".
## No results found for "3615 Clairmont Road Atlanta, GA 30319".
## No results found for "2400 N Druid Hills Rd Ne Atlanta, GA".
## No results found for "2429 Martin Luther King Jr. Dr. Atlanta, GA 30311".
## No results found for "235 Peachtree St. Ne, Suite C24 Atlanta, GA 30303".
## No results found for "1455 Moreland Ave. Atlanta, GA 30316".
## No results found for "2830 N. Druid Hills Rd. Ne Atlanta, GA 30329".
mapview(cvs_atl_sf_osm)
Quite a few returned no results. How many were geocoded? Perhaps the proportion geocoded would increase if we spent some more time cleaning up the address format? The “Emory Village Atlanta” text and maybe a few of the suite numbers could be causing issues.
nrow(cvs_atl) #the original list
## [1] 32
nrow(cvs_atl_sf_osm) #the list that was geocoded by OSM
## [1] 17
#What proportion geocoded?
nrow(cvs_atl_sf_osm)/nrow(cvs_atl)
## [1] 0.53125
Moving on, though:
venues_atl_sf_osm = geocode_OSM(venues_atl$venue_name) %>%
as_tibble() %>%
st_as_sf(coords = c("lon", "lat"),crs = 4326)
## No results found for "chastain park ampitheater".
## No results found for "drunken unicorn atlanta".
venues_atl
mapview(venues_atl_sf_osm)
nrow(venues_atl_sf_osm) #the list that was geocoded
## [1] 9
nrow(venues_atl) #the original list
## [1] 11
nrow(venues_atl_sf_osm)/nrow(venues_atl) #proportion geocoded
## [1] 0.8181818
Two were not geocoded, and Terminal West was placed at the Denver airport :).
In general, geocode_OSM() apparently works better with free-text names than with addresses. Let’s try to clean up the names a bit for the three that failed, and see if the OSM geocoder will get up to 100%.
venues_atl_cleaner = venues_atl %>%
mutate(venue_name_clean = case_when(
venue_name == "chastain park ampitheater" ~ "Chastain Park Amphitheater, Atlanta",
venue_name == "drunken unicorn atlanta" ~ "Drunken Unicorn, Atlanta",
venue_name == "Terminal West" ~ "Terminal West, Atlanta",
TRUE ~ venue_name)
)
venues_atl_sf_osm_take2 = geocode_OSM(venues_atl_cleaner$venue_name_clean) %>%
as_tibble() %>%
st_as_sf(coords = c("lon", "lat"),crs = 4326) %>%
mutate(geocode_type = "OSM")
## No results found for "Chastain Park Amphitheater, Atlanta".
## No results found for "Drunken Unicorn, Atlanta".
mapview(venues_atl_sf_osm_take2)
That change successfully moved Terminal West from Denver to Atlanta. It still isn’t in the correct spot, though – too far to the west. No improvement on the two that were missing…they’re still missing. I’m not surprised that the Drunken Unicorn is unknown to OSM, but I thought Chastain Park Ampitheater would have been picked up.
Take a look at the final output for each geocoder type to test the accuracy of OSM (assuming Google is right).
cvs_atl_sf %>%
bind_rows(cvs_atl_sf_osm) %>%
mapview(
zcol = "geocode_type",
col.regions = c("#4285F4", "#E38F9E")
)
In addition to missing some, OSM is off for a few of them. See, for example:
venues_atl_sf_osm_take2 %>%
bind_rows(venues_atl_sf) %>%
mapview(
zcol = "geocode_type",
col.regions = c("#4285F4", "#E38F9E")
)
Google’s is quite a bit more reliable. It works well for both address-based and free-text-based queries. OpenStreetMap returned many null results for the pharmacies, which were address-based queries. It performed better for the concert venues, which were free-text based, but it still missed a few. The bounding-box geocoder also didn’t seem to be entirely reliable based on the results for Tokyo and Mexico City. Additionally, I prefer the mutate_geocode() function to the geocode_osm() function because it can be used in a pipe. A downside of Google of course is that it costs money, although the cost could be free to very low for a small number of queries.