During an evening in my local pub I was chatting to a friend who is a local tour guide. During the evening I asked him how he decides to include various venues when planning his route. The answer being ‘…all my years of experience’.
Having recently finished modules on the Johns Hopkins Data Specialisation course I wondered if it would be possible to select venues using the wisdom of the crowd and then create a route using web services underpinned by a little R code.
Eager to practice my newly acquired R skills, I created this code. I must warn you, I have around 6 months experience of R and I’m not a professional coder. In the spirit of continual improvement, I welcome any constructive feedback.
Here’s the rough plan I sketched out on the back of a beer mat.
The first challenge I had was authenticating with the Yelp service. OAuth works on the basis of encoding a URL using private keys to create an encrypted signature.Adding this signature along with keys into the URL request allows authentication with no password. Very neat.
For demonstration purposes I’ve selected bars in Chester. Chester is my home town and there seem to be more bars and restaurants than anything else here.
If you wish to pick a different Yelp category you can pick one from this page.
Yelp’s API returns a rich set of data. For my purposes I selected the business name and the latitude/longitude values (for mapping later). Their API returns JSON which is a data structure presented by name, value pairs. I used an R package called RJSONIO along with Regular Expressions to extract the key data into an R data structure. JSON is very easy to work with.
#Yelp API - Search for 10 bars in Chester.
#Add your own keys here from the yelp site
#CONSUMER_KEY = ""
#CONSUMER_SECRET = ""
#TOKEN = ""
#TOKEN_SECRET = ""
yelp.url <- function(category, city,radius,cons_key, cons_sec, tok, tok_sec) {
#Add escape codes to the search parameter
category1 <- gsub(pattern = " ", replacement = "%2520", x = category)
category2 <- gsub(pattern = " ", replacement = "+", x = category)
city1 <- gsub(pattern = " ", replacement = "%2520", x = city)
city2 <- gsub(pattern = " ", replacement = "+", x = city)
#Create random string for use in encoding
nonce <- paste(sample(x = c(0:9, letters, LETTERS, "_"), size = 32, replace = TRUE), collapse= "")
#Time stamp
tm <- format(x = Sys.time(), "%s")
#Prepare URL for authenticating
#Tried CurlEscape() but needed finer control over escape strings
api.url <- paste("GET\u0026http%3A%2F%2Fapi.yelp.com%2Fv2%2Fsearch\u0026",
"category_filter%3D", category1,"%26",
"limit%3D10%26location%3D",city1,"%26",
"oauth_consumer_key%3D", cons_key, "%26",
"oauth_nonce%3D", nonce, "%26",
"oauth_signature_method%3DHMAC-SHA1%26",
"oauth_timestamp%3D", tm, "%26",
"oauth_token%3D", tok, "%26",
"radius_filter%3D",radius,"%26",
"sort%3D2",
sep = "")
#Encode signature
signature <- as.character(curlPercentEncode(base64(
hmac(key=paste(cons_sec, tok_sec, sep="&"),
object=api.url, algo="sha1", serialize=FALSE, raw=TRUE))))
#Build the URL to be used in the request
api.url <- paste("http://api.yelp.com/v2/search?",
"category_filter=", category2,"&",
"limit=10&location=",city2,"&",
"oauth_consumer_key=", cons_key, "&",
"oauth_nonce=", nonce, "&",
"oauth_signature=", signature, "&",
"oauth_signature_method=HMAC-SHA1&",
"oauth_timestamp=", tm, "&",
"oauth_token=", tok, "&",
"radius_filter=",radius,"&",
"sort=2",
sep = "")
return(api.url)
}
yelp.data <- function(category="bars", city="Chester", radius=3500) {
require(package="digest")
require(package="RJSONIO")
require(package="RCurl")
require(package="stringr")
require("dplyr")
#Create the signed URL
y.url <- yelp.url(category = category,city=city,
radius=radius,
cons_key = CONSUMER_KEY,
cons_sec = CONSUMER_SECRET,
tok = TOKEN,
tok_sec = TOKEN_SECRET)
#Post URL to Yelp and process the JSON returned
x <- getURL(y.url) %>% fromJSON()
#Extract lat/long pairs and concatenate
businesses <- unlist(x$businesses)
business.names <- unname(businesses[grep('^name$',names(businesses))])
latlong <- businesses[grep('^location.coordinate.',names(businesses))]
latlong.matrix <- apply(matrix(latlong, ncol = 2, byrow = TRUE), 1, paste, collapse = ",")
latlong.matrix2 <- matrix(latlong, ncol = 2, byrow = TRUE)
o.latlong <- paste(latlong.matrix,sep="|",collapse="|")
df <- data.frame(business.names,latlong.matrix2)
output <- list(LatLong = o.latlong, "Names" = business.names, df=df)
return(output)
}
We call this function by passing in parameters as follows.
city <- "Chester UK"
radius <- "3500" #metres
category <- "bars"
#Find the the top 10 bars from Yelp.com
y <- yelp.data(category=category, city=city,radius=radius)
y$df
## business.names X1 X2
## 1 Telfords Warehouse 53.1940384 -2.8972299
## 2 Old Harkers Arms 53.193418 -2.881248
## 3 Bear & Billet 53.186744 -2.890165
## 4 Albion Inn 53.1884117 -2.88802
## 5 Ye Olde Boot Inn 53.1905138631287 -2.89043889005541
## 6 Bar Lounge 53.189836 -2.893896
## 7 The Liverpool Arms 53.1932812 -2.8929561
## 8 The Botanist 53.1912196 -2.8902453
## 9 The Mount Inn 53.1912378 -2.8709356
## 10 Amber Lounge 53.1902965 -2.8919373
Thanks to the R package TSP we can calculate the optimal route between points using different algorithms that are the product of computational research carried out over many years. I didn’t want to concern myself with the details of each algorithm so for the purpose of this experiment I ran each of them and picked the one generating the shortest distance.
Before I could call the TSP package I had to create a symmetrical distance matrix.
I queried the distances between venues using Google Maps Distance API. This is a useful API enabling data acquisition in 1 call.
Much like Yelp, Google returns data in the JSON format. One problem I discovered here is that Google don’t always give back a symmetrical matrix. I guess this is related to their route planning algorithms and 1 way streets. In any case, we can solve this using forceSymmetric().
construct.distance.url <- function(origins, return.call = "json", sensor = "false") {
root <- "https://maps.googleapis.com/maps/api/distancematrix/"
u <- paste(root, return.call, "?origins=", origins, "&destinations=",
origins,"&mode=walking", sep = "")
return(URLencode(u))
}
distance.matrix <- function(address,verbose=FALSE) {
require("RCurl")
require("Matrix")
require("dplyr")
if(verbose) cat(address,"\n")
u <- construct.distance.url(address) %>% getURL(ssl.verifypeer = FALSE) %>% fromJSON()
if(u$status=="OK") {
#Turn JSON into 2 dim matrix, measure=Distance Value in Metres
x2 <- unlist(u$rows)
output<-matrix(as.numeric(unname(x2[grep('distance.value',names(x2))])),
ncol=as.numeric(nrow(y$df))) %>% forceSymmetric()
return(output)
} else {
return("There was a problem with the web query")
}
}
x <- distance.matrix(y$LatLong)
x
## 10 x 10 Matrix of class "dsyMatrix"
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 1181 1150 1197 784 907 381 701 2026 769
## [2,] 1181 0 1183 912 774 1018 972 826 849 880
## [3,] 1150 1183 0 271 486 556 770 612 1788 418
## [4,] 1197 912 271 0 459 602 816 479 1516 464
## [5,] 784 774 486 459 0 243 403 126 1415 105
## [6,] 907 1018 556 602 243 0 527 369 1658 138
## [7,] 381 972 770 816 403 527 0 320 1767 389
## [8,] 701 826 612 479 126 369 320 0 1435 231
## [9,] 2026 849 1788 1516 1415 1658 1767 1435 0 1520
## [10,] 769 880 418 464 105 138 389 231 1520 0
This distance matrix is all we need for TSP. Notice the different TSP methods are being passed to solveTSP() using sapply. After the solutions have been created I pick the route with the shortest distance.
#Calculate the minimal distances using algorithms described in 'methods'
#Returns the shortest route
tsp.route <- function(places,names){
require("TSP")
items <- as.numeric(NROW(names))
city.matrix <- matrix(places,nrow=items, ncol=items, dimnames=list(names,names))
tsp <- TSP(city.matrix)
methods <- c("nearest_insertion", "farthest_insertion",
"cheapest_insertion","arbitrary_insertion","nn",
"repetitive_nn", "2-opt")
tours <- sapply(methods, FUN = function(m)
solve_TSP(tsp,method = m),simplify=FALSE)
best <- tours[which.min(c(sapply(tours, FUN = attr, "tour_length")))]
best.route <- names(best[[1]])
best.distance <- tour_length(tsp,best[[1]])
output <- list(route = best.route, distance.travelled = best.distance)
return(output)
}
To call this function I pass in the distance matrix along with names of the venue (for displaying on the map)
w.route<-tsp.route(x,y$Names)
w.route
## $route
## [1] "Albion Inn" "Bear & Billet" "Bar Lounge"
## [4] "Amber Lounge" "Ye Olde Boot Inn" "The Botanist"
## [7] "The Liverpool Arms" "Telfords Warehouse" "Old Harkers Arms"
## [10] "The Mount Inn"
##
## $distance.travelled
## [1] 5443
A little function coupling code sneaked in that was needed to sort the Yelp venues using the route returned by TSP. I did this as the Yelp data already has the location data I need for the mapping. Of course, this can be eliminated with a little code refactoring.
#Sort the Yelp data using optimal route (to be refactored)
sorted.route <- y$df[match(w.route$route, y$df$business.names),]
We’re near the end. The final step is to plot the route on Google Maps. I used the excellent GGMAP package for this. I needed to plot a route, waypoints, and display a legend.
The Route class within GGMAP queries the Google Directions API. We push in the various lat/long pairs and Google conveniently returns coordinates that represent a leg of the route. The Route class combines each leg of the journey, making a route. I then plot the completed route to a map layer. The points are placed on a layer using the lat/long pairs for each venue.
I hit a thorny problem when it came to adding a legend. GGMAP is usually good at this but in my case I didn’t really need to associate the legend with a data series. Thanks to the wizards in the community for suggesting adding a dummy layer and associating the legend with that.
create.map<-function(lst){
require("ggmap")
require("plyr")
require("dplyr")
#Create DF and prevent factors from being created.
way.points <- data.frame(lapply(lst[,1:3], as.character), stringsAsFactors=FALSE)
#Combine the row number with the business names to related points to legend labels
way.points <- mutate(way.points, business.names = paste(seq_along(X1), business.names, sep = " "))
#Call Route() in 1 pass
rte.from <- apply(way.points[-nrow(way.points),2:3],1,paste,collapse=",")
rte.to <- apply(way.points[-1,2:3],1,paste,collapse=",")
rte <- do.call(rbind,
mapply(route, rte.from, rte.to, SIMPLIFY=FALSE,
MoreArgs=list(mode="walking",output="simple",structure="leg")))
#Work out the rough centre point of the map
map.centre <- c(mean(as.numeric(way.points$X2)),mean(as.numeric(way.points$X1)))
#Load the coordinates from Route() to be used to plot the paths
coords <- rbind(as.matrix(rte[,7:8]),as.matrix(rte[nrow(rte),9:10])) %>%
as.data.frame()
#Create the Map - first 2 layers are the path and point. The second geom_point is a
#a dummy one used to define the legend.
ggm <- qmap(location=map.centre,zoom = 15,maptype = "road",legend="bottomleft")
ggm +
geom_path(data=coords,aes(x=startLon,y=startLat),color="blue",size=2)+
geom_point(data=way.points,aes(x=as.numeric(X2),y=as.numeric(X1)),
size=10,color="yellow")+
geom_point(data=way.points,
aes(x=as.numeric(X2),y=as.numeric(X1),color =
factor(business.names, levels=unique(business.names))),
alpha = 0) +
geom_text(data=way.points,
aes(x=as.numeric(X2),y=as.numeric(X1), label=seq_along(X1)))+
scale_color_discrete(name = "Venues") +
labs(title=paste("The optimal route for the top rated Venues to visit in ",city,
" within a ",as.numeric(radius)/1000,"km radius",sep=""))+
theme(legend.key = element_rect(fill = NA),legend.position = c(-0.28, 0.61),
plot.title = element_text(hjust = 0, vjust = 1, face = c("bold")))
}
create.map(sorted.route)
In my day job as a data analyst I’m more accustomed to querying traditional data sources using methods such as SQL, DAX, and MDX. In my opinion, as more and more business applications move to the cloud, we’ll see the need to use techniques demonstrated in this experiment increase.
It’s thanks to the amazing R community (in particular, Stack contributors’ & package authors) that I was able to solve this problem. The R community is a strong foundation of the R eco-system.
If you want to reproduce this you’ll need access to the following packages.
You’ll also need to set up a Yelp developer account to source your own keys.