Several options for reverse geo-coding (i.e. determining a specific State and County from (latitude, longitude) coordinate pairs) are explored for both performance and accuracy. Direct reverse-geo-code API calls, which take about 200msec per point, are compared to computation via “point-in-polygon”, as well as machine-learning randomForest
and nnet
classification models.
A Random Forest model with accuracy approaching 98% improves throughput by a factor of 104 over a web-based API call. A neural network model has faster prediction times, but its accuracy was lower and modeling times were prohibitively long.
Reverse geo-coded GPS coordinates are useful in cases where GPS cordinates from, for instance, a tweet or photograph are to be assigned to a specific political bounday. While reverse-look up of coordinates can be done using various web-services, such as the Google Maps API, it takes on the order of 200msec per (latitude, longitude) pair, severely limiting computing throughput. In addition, only 2500 free queries per day are allowed, limiting data processing.
The purpose here to improve dramatically computation times while meeting high ( ~ 98% ) accuracy.
Performance results can be difficult communicate because they can be misinterpreted and claims overstated. In this case I standardized data collection as much as possible to ensure some level of comparability. Each measurements was taken in a standardized run framework where the data were divided into training and test data sets and then the modeling and prediction steps were individually timed.
In some cases these times timing measurements differ slightly from those time observed when computations were done in loops with varying numbers of data points.
For the API calls I used elapsed-time measurements since these better reflect actual usage.
I collected data on the reverse geocode times using the Google Maps API and the {googleway}
package. Geo-locations are reverse coded using the following function
data <- google_reverse_geocode(location = c(as.numeric(x["lat"]), as.numeric(x["lon"])), key = secret.key)
where secret.key
is a private key generated from my Google Maps API account.
The time for each API call is taken as the elapsed time
as measured by the proc.time()
function, since this reflects the usage model (processing time is relatively short, throughput is limited by data across the network). It is this network time penalty which makes the calls impractical for my application.
The data represent 8602 separate API calls on 21 different times of day at three different geo loactions (Portland, Bend, and Seattle). The distribution is clearly multimodal with peaks corresponding to different geo-locations. The Seattle-area had the fastest access, while Bend had the slowest. Internet delays are the key performance limiter on data throughput.
The median measured time per reverse geo-code data point is 176 msec with a 95th quantile of 275 msec, representing a thruput of 0.00568 points/msec.
Point-in-Polygon (P-i-P) is a standard method for computing whether a given point lies inside a given perimeter, such as a political boundary. An advantage is that this can be directly computed, potentially with substantial performance and accuracy improvement.
To compute P-i-P I used the {SDMTools}
package, which uses the winding number calculated from a “sum of angles” approach.
For the case at hand membership in several political boundaries (i.e. each county) needs to be computed. This means multiple polygons need to be tested for each data point, increasing the complexity of the calculation and, in principle, slowing the computation.
In practice, the P-i-P is much faster than the API call. Based on a stand-alone benchmark run, for 2000 points, the throughput is 2.26 points per msec - an improvement of roughly a factor of 400 over the API calls.
Note, as the number of points get’s large, the thruput (points/time) asymptotes to a constant value of roughly 5000 points/msec.
Initially I had assumed the API method would provide 100% accurate results. However, as I investigated, I was suprised to learn they did not. Indeed errors in the API data contributed about 2% error to the randomForest classifier developed below.
In order to establish a “grounded truth” for developing a supervised machine learning capability, it’s useful to compare the results of the P-i-P and API methods. This is best seen visually by mapping points of agreement adn disagreement between the two methods.
The code chunk below creates the base maps used in this work.
## Create State Map With County Boundaries Overlayed
library(ggmap)
map = get_googlemap(center = c(lon = -120.31619, lat = 44),
zoom = 6, size = c(450, 340), scale = 2, source = "google",
maptype="roadmap") #, key = my.secret.key)
map.plot = ggmap(map)
## get oregon counties
counties <- map_data("county")
state_county <- subset(counties, region == 'oregon')
## plot map
map.plot +
geom_polygon(data = state_county, aes(x=long, y=lat, group = group),
fill = NA, color = "darkblue")
Points of disagreement between the Google Maps API and the P-i-P method are clearly show the problem is with the API calls. Note that data-points along the coast are excluded as there is a systematic (non-random) difference in how these are classified between the API and P-i-P methods.
Several points within county interiors don’t agree. At this point it’s not clear whether the disagreement is an inherent to the Google Maps API reverse geocoding algorithms, or with the way I have interpreted the resulting data (which are returned in a very “untidy” format). This could be investigated with some more effort in data collection.
The accuracy of the API, as graded against the P-i-P method, is 97.8%.
Since high speed and (not necessarily perfect) accuracy are goals, the question arises, can classification machine learning algorithms be trained well enough to provide adequate accuracy (e.g. close to 98%) and offer a substantial speed advantage?
Supervised machine learning requires a large number of random (latitude, longitude) “grounded truth” data points. Based on the above analysis I generated a set of random (lat, lon) pairs and used the P-i-P method for classification.
To match the intended use-case, data points are based on random sampling within a (latitude, longitude) “rectangle” that exceeds the size of Oregon. Points are then classified by county if they lie within Oregon, or as “outside” if they do not.
For this work, 18698 data points serve as both training and test data sets.
n_train
Since speed and accuracy are goals, it makes sense to take a broad look at the accuracy and thruput of a randomForest
model as a function of n_train
. To standardize evaluation a subset of n_test
= 2000 test data points are used for each accuracy evaluation, while the value of n_train
is allowed to vary.
The number of trees ntree
= 21 was fixed after some rudimentary exploration. Within fairly wide bounds this produced good results.
Here is a sample of the data collected
n_train | model_time | predict_time | accuracy | thruput | n_test |
---|---|---|---|---|---|
1221.00 | 67.00 | 125.00 | 91.68 | 9.77 | 19802.00 |
4065.00 | 243.00 | 122.00 | 95.67 | 33.32 | 16958.00 |
6909.00 | 505.00 | 96.00 | 96.63 | 71.97 | 14114.00 |
9753.00 | 700.00 | 82.00 | 96.89 | 118.94 | 11270.00 |
12597.00 | 886.00 | 51.00 | 97.36 | 247.00 | 8426.00 |
15441.00 | 1094.00 | 44.00 | 98.06 | 350.93 | 5582.00 |
The accuracy of the randomForest
improves with the number of points, exceeding 95% accuracy above about 4000 data points. Note that while the model time increases with the number of data points, the time to predict is relatively constant (meaning the predict time per point scales well with the number of points predicted).
Clearly n_train
has a large influence on the accuracy of the model. Since accuracy near 98% is desired will choose about 15000 data points.
Computation times (both total and normalize to the number of points) are shown below. Of interest, the modeling time per point for the randomForest is relatively constant.
Performance was measured as shown in the code snippet below.
set.seed(8675309)
## set number of trees
ntree.x = 29
## start model
start.time <- proc.time()
rf.model <- randomForest(county ~., train.set, ntree = ntree.x)
rf.model.time <- (1000.*(proc.time() - start.time)[1]) %>% round(1) %>% as.numeric
rf.model.thruput <- (nrow(train.set) / rf.model.time) %>% round(1)
## start prediction
start.time <- proc.time()
predicted.test <- predict(rf.model, test.set)
rf.predict.time <- (1000.*((proc.time() - start.time)[1])) %>% as.numeric %>% round(2)
rf.predict.thruput <- (nrow(test.set) / rf.predict.time) %>% round(1)
In a standardized run it took 1110 msec to model the 14700 data points. Predictions made on the remaining 5000 data points took 40 msec, for a throughput of 125 data points per msec.
Model accuracy is 97.6%.
The results of applying a test data set to the model results are mapped below. For the randomForest
errors appear on the boundaries of counties. No points lie within the interiors. Note that errors occur along both straight and complex boundaries, though error rates appear to increase with boundary complexity.
Interest in Neural Networks as a multiple classification machine learning model is growing. The R package {nnet}
is a popular version of a neural network model based on a feed-forward multi-layer perceptron.
In practice, the model required substantial tuning with every paramter adjustemtn. For this case a hidden layer of size = 12
produced reasonably stable results, though in some cases the decay
and rang
variables were tweaked as the number of modeled points in the train.set
was changed.
The training set is modeled with the following function call.
decay.x <- 2e-4
nn.model <- nnet(county.class ~., train.set[,-3], size = 12, rang = 0.02, decay = decay.x, maxit = 6000, softmax = TRUE, trace=FALSE)
The model converged after about 860 iterations with the above parameters.
Performance was measured in the same way as for the randomForest. Model times were substantially longer than the randomForest. It took 152.08 seconds to model 10001 data points, while prediction of 5000 points took 30 msec.
Model accuracy 94.6% is well below our goal of 98%. Note that the model might be improved through more extensive tuning and modeling more data points. However the modeling times were already prohibitively long and model stability with a training set approaching 15,000 data points was problematic, so that effort was abandoned.
The accuracy of the modeling results are shown in the map below.
Note that in the case of the nnet
model, there is, as with other models, substantial error along county boundaries. However, in the case of the nnet
model, the inaccuracy extends to the county interior in many places, adding yet another reason to conclude the nnet
model is less desirable than randomForest
for this application.
Various reverse-geocoding schemes were explored for both performance and accuracy, including API calls, P-i-P, and classification machine learning.
The table below, which summarizes the results, shows that a randomForest
model gives accuracy comparable to that of the Google Maps API of nearly 98%, though with a throughput improvement of over 104.
method | accuracy (%) | predict.thruput | model.thruput |
---|---|---|---|
nnet | 94.60 | 166.67 | 0.07 |
randomForest | 97.60 | 125.00 | 13.20 |
P-i-P | 100.00 | 2.26 | |
API | 97.80 | 0.01 |
Modeling throughput times of the randomForest
are also reasonable (taking several seconds for on the order of 15000 data points on a Intel-based MacBook Air) making it a good choice for the intended application of rapid “on-prem” reverse geocoding.
While the application of this method has been to counties within State boundaries, it is reasonable this approach can be easily extensible to other reverse geo-coding problems, such as zip codes, congressional districts, and state and national boundaries. This makes the use of a random Forest reverse geocoding approach an attractive options for high performance, high precision reverse geocoding.