Machine Learning for High Performance Reverse-Geo-Coding

Summary

Several options for reverse geo-coding (i.e. determining a specific State and County from (latitude, longitude) coordinate pairs) are explored for both performance and accuracy. Direct reverse-geo-code API calls, which take about 200msec per point, are compared to computation via “point-in-polygon”, as well as machine-learning randomForest and nnet classification models.
A Random Forest model with accuracy approaching 98% improves throughput by a factor of 10⁴ over a web-based API call. A neural network model has faster prediction times, but its accuracy was lower and modeling times were prohibitively long.

Problem Statement

Reverse geo-coded GPS coordinates are useful in cases where GPS cordinates from, for instance, a tweet or photograph are to be assigned to a specific political bounday. While reverse-look up of coordinates can be done using various web-services, such as the Google Maps API, it takes on the order of 200msec per (latitude, longitude) pair, severely limiting computing throughput. In addition, only 2500 free queries per day are allowed, limiting data processing.
The purpose here to improve dramatically computation times while meeting high ( ~ 98% ) accuracy.

A Note on Performance Measurements

Performance results can be difficult communicate because they can be misinterpreted and claims overstated. In this case I standardized data collection as much as possible to ensure some level of comparability. Each measurements was taken in a standardized run framework where the data were divided into training and test data sets and then the modeling and prediction steps were individually timed.
In some cases these times timing measurements differ slightly from those time observed when computations were done in loops with varying numbers of data points.
For the API calls I used elapsed-time measurements since these better reflect actual usage.

Look-up

Reverse geo-coding performance using Google Maps API

I collected data on the reverse geocode times using the Google Maps API and the {googleway} package. Geo-locations are reverse coded using the following function

data <- google_reverse_geocode(location = c(as.numeric(x["lat"]), as.numeric(x["lon"])), key = secret.key)

where secret.key is a private key generated from my Google Maps API account.

The time for each API call is taken as the elapsed time as measured by the proc.time() function, since this reflects the usage model (processing time is relatively short, throughput is limited by data across the network). It is this network time penalty which makes the calls impractical for my application.

The data represent 8602 separate API calls on 21 different times of day at three different geo loactions (Portland, Bend, and Seattle). The distribution is clearly multimodal with peaks corresponding to different geo-locations. The Seattle-area had the fastest access, while Bend had the slowest. Internet delays are the key performance limiter on data throughput.

The median measured time per reverse geo-code data point is 176 msec with a 95th quantile of 275 msec, representing a thruput of 0.00568 points/msec.

Point-in-Polygon Performance

Point-in-Polygon (P-i-P) is a standard method for computing whether a given point lies inside a given perimeter, such as a political boundary. An advantage is that this can be directly computed, potentially with substantial performance and accuracy improvement.

To compute P-i-P I used the {SDMTools} package, which uses the winding number calculated from a “sum of angles” approach.

For the case at hand membership in several political boundaries (i.e. each county) needs to be computed. This means multiple polygons need to be tested for each data point, increasing the complexity of the calculation and, in principle, slowing the computation.

In practice, the P-i-P is much faster than the API call. Based on a stand-alone benchmark run, for 2000 points, the throughput is 2.26 points per msec - an improvement of roughly a factor of 400 over the API calls.

Note, as the number of points get’s large, the thruput (points/time) asymptotes to a constant value of roughly 5000 points/msec.

Comparison of P-i-P to Google Maps API Accuracy

Initially I had assumed the API method would provide 100% accurate results. However, as I investigated, I was suprised to learn they did not. Indeed errors in the API data contributed about 2% error to the randomForest classifier developed below.

In order to establish a “grounded truth” for developing a supervised machine learning capability, it’s useful to compare the results of the P-i-P and API methods. This is best seen visually by mapping points of agreement adn disagreement between the two methods.

The code chunk below creates the base maps used in this work.

 ## Create State Map With County Boundaries Overlayed

    library(ggmap)
    map = get_googlemap(center =  c(lon = -120.31619, lat = 44), 
              zoom = 6, size = c(450, 340), scale = 2, source = "google",
              maptype="roadmap") #, key = my.secret.key)

    map.plot = ggmap(map)

    ## get oregon counties

    counties <- map_data("county")
    state_county <- subset(counties, region == 'oregon')

    ## plot map
    map.plot + 
        geom_polygon(data = state_county, aes(x=long, y=lat, group = group), 
                     fill = NA, color = "darkblue")

Points of disagreement between the Google Maps API and the P-i-P method are clearly show the problem is with the API calls. Note that data-points along the coast are excluded as there is a systematic (non-random) difference in how these are classified between the API and P-i-P methods.

Several points within county interiors don’t agree. At this point it’s not clear whether the disagreement is an inherent to the Google Maps API reverse geocoding algorithms, or with the way I have interpreted the resulting data (which are returned in a very “untidy” format). This could be investigated with some more effort in data collection.

The accuracy of the API, as graded against the P-i-P method, is 97.8%.

Machine Learning Approaches

Since high speed and (not necessarily perfect) accuracy are goals, the question arises, can classification machine learning algorithms be trained well enough to provide adequate accuracy (e.g. close to 98%) and offer a substantial speed advantage?

Method

Supervised machine learning requires a large number of random (latitude, longitude) “grounded truth” data points. Based on the above analysis I generated a set of random (lat, lon) pairs and used the P-i-P method for classification.

To match the intended use-case, data points are based on random sampling within a (latitude, longitude) “rectangle” that exceeds the size of Oregon. Points are then classified by county if they lie within Oregon, or as “outside” if they do not.

For this work, 18698 data points serve as both training and test data sets.

Random Forest Classifier Optimization

Accuracy and thruput versus `n_train`

Since speed and accuracy are goals, it makes sense to take a broad look at the accuracy and thruput of a randomForest model as a function of n_train. To standardize evaluation a subset of n_test = 2000 test data points are used for each accuracy evaluation, while the value of n_train is allowed to vary.
The number of trees ntree = 21 was fixed after some rudimentary exploration. Within fairly wide bounds this produced good results.

Here is a sample of the data collected

n_train	model_time	predict_time	accuracy	thruput	n_test
1221.00	67.00	125.00	91.68	9.77	19802.00
4065.00	243.00	122.00	95.67	33.32	16958.00
6909.00	505.00	96.00	96.63	71.97	14114.00
9753.00	700.00	82.00	96.89	118.94	11270.00
12597.00	886.00	51.00	97.36	247.00	8426.00
15441.00	1094.00	44.00	98.06	350.93	5582.00

The accuracy of the randomForest improves with the number of points, exceeding 95% accuracy above about 4000 data points. Note that while the model time increases with the number of data points, the time to predict is relatively constant (meaning the predict time per point scales well with the number of points predicted).

Clearly n_train has a large influence on the accuracy of the model. Since accuracy near 98% is desired will choose about 15000 data points.

Computation times (both total and normalize to the number of points) are shown below. Of interest, the modeling time per point for the randomForest is relatively constant.

Performance Benchmark Results

Performance was measured as shown in the code snippet below.

    set.seed(8675309)
    ## set number of trees
    ntree.x = 29

    ## start model
    start.time <- proc.time()

        rf.model <- randomForest(county ~., train.set, ntree = ntree.x)
        
    rf.model.time <- (1000.*(proc.time() - start.time)[1]) %>% round(1) %>% as.numeric
    rf.model.thruput <- (nrow(train.set) / rf.model.time) %>% round(1)
    
    ## start prediction
    start.time <- proc.time()
    
        predicted.test <- predict(rf.model, test.set)
        
    rf.predict.time <- (1000.*((proc.time() - start.time)[1])) %>% as.numeric %>% round(2)
    rf.predict.thruput <- (nrow(test.set) / rf.predict.time) %>% round(1)

In a standardized run it took 1110 msec to model the 14700 data points. Predictions made on the remaining 5000 data points took 40 msec, for a throughput of 125 data points per msec.

Model accuracy is 97.6%.

The results of applying a test data set to the model results are mapped below. For the randomForest errors appear on the boundaries of counties. No points lie within the interiors. Note that errors occur along both straight and complex boundaries, though error rates appear to increase with boundary complexity.

Neural Network Classification

Interest in Neural Networks as a multiple classification machine learning model is growing. The R package {nnet} is a popular version of a neural network model based on a feed-forward multi-layer perceptron.

In practice, the model required substantial tuning with every paramter adjustemtn. For this case a hidden layer of size = 12 produced reasonably stable results, though in some cases the decay and rang variables were tweaked as the number of modeled points in the train.set was changed.

The training set is modeled with the following function call.

decay.x <- 2e-4

nn.model <- nnet(county.class ~., train.set[,-3], size = 12, rang = 0.02, decay = decay.x, maxit = 6000, softmax = TRUE, trace=FALSE)

The model converged after about 860 iterations with the above parameters.

Performance was measured in the same way as for the randomForest. Model times were substantially longer than the randomForest. It took 152.08 seconds to model 10001 data points, while prediction of 5000 points took 30 msec.

Model accuracy 94.6% is well below our goal of 98%. Note that the model might be improved through more extensive tuning and modeling more data points. However the modeling times were already prohibitively long and model stability with a training set approaching 15,000 data points was problematic, so that effort was abandoned.

The accuracy of the modeling results are shown in the map below.

Note that in the case of the nnet model, there is, as with other models, substantial error along county boundaries. However, in the case of the nnet model, the inaccuracy extends to the county interior in many places, adding yet another reason to conclude the nnet model is less desirable than randomForest for this application.

Summary & Conclusions

Various reverse-geocoding schemes were explored for both performance and accuracy, including API calls, P-i-P, and classification machine learning.

The table below, which summarizes the results, shows that a randomForest model gives accuracy comparable to that of the Google Maps API of nearly 98%, though with a throughput improvement of over 10⁴.

method	accuracy (%)	predict.thruput	model.thruput
nnet	94.60	166.67	0.07
randomForest	97.60	125.00	13.20
P-i-P	100.00	2.26
API	97.80	0.01

Modeling throughput times of the randomForest are also reasonable (taking several seconds for on the order of 15000 data points on a Intel-based MacBook Air) making it a good choice for the intended application of rapid “on-prem” reverse geocoding.

While the application of this method has been to counties within State boundaries, it is reasonable this approach can be easily extensible to other reverse geo-coding problems, such as zip codes, congressional districts, and state and national boundaries. This makes the use of a random Forest reverse geocoding approach an attractive options for high performance, high precision reverse geocoding.

Machine Learning for High Performance Reverse-Geo-Coding

Winston Saunders

August 9, 2016

Summary

Problem Statement

A Note on Performance Measurements

Look-up

Reverse geo-coding performance using Google Maps API

Point-in-Polygon Performance

Comparison of P-i-P to Google Maps API Accuracy

Machine Learning Approaches

Method

Random Forest Classifier Optimization

Accuracy and thruput versus `n_train`

Performance Benchmark Results

Neural Network Classification

Summary & Conclusions

Machine Learning for High Performance Reverse-Geo-Coding

Winston Saunders

August 9, 2016

Summary

Problem Statement

A Note on Performance Measurements

Look-up

Reverse geo-coding performance using Google Maps API

Point-in-Polygon Performance

Comparison of P-i-P to Google Maps API Accuracy

Machine Learning Approaches

Method

Random Forest Classifier Optimization

Accuracy and thruput versus n_train

Performance Benchmark Results

Neural Network Classification

Summary & Conclusions

Accuracy and thruput versus `n_train`