I’m currently looking for a new appartement. One important criteria for me is the proximity to services and entertainment, in particular restaurants. Or maybe I was just looking for a pretext to do some R…

Anyhow, I thought I would do some web scraping + mapping in R and see what neighborhoods in Montreal are the best when it comes to dining out.

The first step was to decide where to get the data from. Well that decision was easy, because the first sites I looked at had javascript pages (instead of static html), which meant standard web scraping would not work. Then I tried Yelp, and fortunately enough they use static html (although they limit any of their views to 1000 results).

Below is the code to scrape all 1000 results for Montreal, and put the result in a dataframe. I then calculate points for each neighborhood as the sumproduct of ratings and number off reviews:

library(rvest)
library(stringr)
library(magrittr)

scrape_page = function(start){
  
  restos = read_html(paste0("http://www.yelp.ca/search?find_loc=Montr%C3%A9al,+QC,+Canada&start=", 
                            start, "&cflt=food"))
  
  n1 = restos %>% html_nodes(".main-attributes")
  n2 = restos %>% html_nodes(".secondary-attributes")
  
  df = data.frame(neighborhood=factor(), rating=numeric(), votes=numeric())
  
  #we have to extract the nodes one by one because sometimes the address field can be missing
  for(i in 1:10){
    
    main_attr = n1 %>% extract2(i)
    sec_attr = n2 %>% extract2(i)
    
    if(length(sec_attr %>% html_nodes(".neighborhood-str-list")) == 1
       && length(main_attr %>% html_nodes(".star-img")) == 1
       && length(main_attr %>% html_nodes(".review-count")) == 1){
      
      neighborhood = iconv(sec_attr %>% html_nodes(".neighborhood-str-list") %>% html_text() %>% str_trim(), 
                            from="UTF-8", to="UTF-8")
      
      # rename the neighborhoods so the names match those found in the spatial sata (see below)
      if(str_detect(neighborhood, "Notre-Dame") || str_detect(neighborhood, "des-Neiges")){
        neighborhood = "Côte-des-Neiges-Notre-Dame-de-Grâce"
      }
      else if(str_detect(neighborhood, "Plateau")){
        neighborhood = "Le Plateau-Mont-Royal"
      }
      else if(str_detect(neighborhood, "Sud")){
        neighboorhood = "Le Sud-Ouest"
      }
      else if(str_detect(neighborhood, "Ville-Marie")){
        neighboorhood = "Ville-Marie"
      }
      else if(str_detect(neighborhood, "Villeray")){
        neighborhood = "Villeray-Saint-Michel-Parc-Extension"
      }
      else if(str_detect(neighborhood, "Saint-Luc")){
        neighborhood = "Côte-Saint-Luc"
      }
      else if(str_detect(neighborhood, "Trembles")){
        neighborhood = "Rivière-des-Prairies-Pointe-aux-Trembles"
      }
      else if(str_detect(neighborhood, "onard")){
        neighborhood = "Saint-Léonard"
      }
      else if(str_detect(neighborhood, "Montr")){
        neighborhood = "Montréal-Ouest"
      }
      else if(str_detect(neighborhood, "Ahun")){
        neighborhood = "Ahuntsic-Cartierville"
      }
      
      rating = as.numeric(main_attr %>% html_nodes(".star-img") %>% 
                          html_attr("title") %>% str_extract("[0-9]\\.[0-9]"))
      
      votes = as.numeric(main_attr %>% html_nodes(".review-count") %>% html_text() %>% str_extract("[0-9]+"))
      
      df = rbind(df, data.frame(neighborhood=neighborhood, rating=rating, votes=votes))
    }    
  }
  
  df
}

all_ratings = lapply(seq(from=0, to=990, by=10), scrape_page)
all_ratings_df = do.call(rbind, all_ratings)

Now we tally up the points by neighborhood, and we’re gonna score each neighborhood according to the average quality of its restaurants AND to its total number of restaurants, using the square root function on the latter to give it less weight (the idea being I would prefer to live in a neighborhood with 10 good restaurants rather than 100 crappy ones).

all_ratings_df$points = all_ratings_df$rating * all_ratings_df$votes

points = aggregate(points ~ neighborhood, data=all_ratings_df, sum)

restaurant_count = aggregate(rating ~ neighborhood, data=all_ratings_df, length)
colnames(restaurant_count)[2] = "restaurant_count"

votes = aggregate(votes ~ neighborhood, data=all_ratings_df, sum)

scores = data.frame(neighborhood=points[,1], 
                                  score=points[,2]/votes[,2] * sqrt(restaurant_count[,2]))

So now that we have the scores aggregated by neighborhood, we need a map to draw them on:

library(maptools)

tmp_dir = tempdir()
url_data = "http://donnees.ville.montreal.qc.ca/dataset/00bd85eb-23aa-4669-8f1b-ba9a000e3dd8/resource/62f7ce10-36ce-4bbd-b419-8f0a10d3b280/download/limadmin-shp.zip"
zip_file = paste0(tmp_dir, "/montreal.zip")
download.file(url_data, zip_file)
unzip(zip_file, exdir=tmp_dir)

montreal = readShapeSpatial(paste0(tmp_dir, "/LIMADMIN.shp"))

The next step is to merge the points from the web scraping step onto the shape data. Note that unfortunately, not all neighborhoods are represented on the Yelp website.

df = merge(montreal@data, scores, by.x="NOM", by.y="neighborhood", all.x=TRUE)

#reorder df to match montreal@data
df = df[rank(montreal@data$NOM),]

Now we can assign a category to each neighborhoods based on their points (I chose to make 3 categories), and attribute a color to each category.

library(RColorBrewer)
library(classInt)

ci = classIntervals(df$score, n=3, style="quantile")

palette = brewer.pal(8, "YlOrRd")[c(2,4,6)]

colors = findColours(ci, palette)

Finally we draw the map, add a legend and write the names of the neighborhoods as well using the gCentroid function from the rgeos package

library(rgeos)

plot(montreal, col=colors)
legend('topleft', legend=c("Starvation area", "Survivable", "Foodie's heaven"), fill=palette)

centroids = gCentroid(montreal, byid=TRUE)[!is.na(df$score)]
text(centroids$x, centroids$y, labels=montreal@data$NOM[!is.na(df$score)], cex=.7)

Voilà! It is quite unfortunate that we only have the central neighborhoods of Montreal. To be continued…