Web scraping, a brief tutorial

Your last challenge was to sift through my example code and take what you need. You should have something like this:

rm(list=ls())
library(tidyverse)
library(rvest)
grab_wiki_links <- function(url, keep_duplicates = FALSE){
  links <- read_html(url) %>% # read page
      html_nodes("a") %>% # link nodes
      html_attr("href") %>% # links
      str_subset(":", negate = TRUE) %>% # remove links with colons
      str_subset("^#", negate = TRUE) %>% # remove anchor links for this page
      str_subset("^/wiki/") %>% # only keep links to wiki pages
      str_subset("/wiki/Main_Page", negate = TRUE) # exclude main page
  if(keep_duplicates){
    return(links)
  } else {
    return(unique(links))
  }
}

rel_to_abs_link <- function(relative_link){
  base_url <- "https://en.wikipedia.org"
  paste(base_url,
        relative_link,
        sep="")
}

link_to_title <- function(relative_link){
  str_replace(relative_link,".*/wiki/","") # '.*' ensures it will also work with 
  # absolute links
}

link_to_row <- function(relative_link){
  page <- link_to_title(relative_link)
  link <- relative_link %>% 
    rel_to_abs_link() %>%
    grab_wiki_links()
  link_name = link_to_title(link)
  out <- tibble(page = page,
                link = link,
                link_name = link_name) %>%
    nest(links = c(link, link_name))
  return(out)
}

df_to_df <- function(first_df){
  first_df %>%
    unnest(links) %>%
    filter(!link %in% page) %>%
    pull(link) %>%
    unique() %>%
    map_df(link_to_row)
}


page_to_df_recursive <- function(url, depth = 3){
  # to make sure we can plug in an absolute url...
  url <- str_replace(url,".*/wiki/","/wiki/")
  out <- link_to_row(url)
  for(layer in 1:depth){
    out <- df_to_df(out)
  }
  return(out)
}

add_degree <- function(df){
  df %>% 
    mutate(degree = map(links, nrow)) %>%
    unnest(degree)
}

first_url <- "https://en.wikipedia.org/wiki/Graph_theory"

graph_theory <- page_to_df_recursive(first_url, 1) %>%
  add_degree()

Part 2

Now let’s do something interesting with the data. To start, make sure you’ve got ggraph and tidygraph installed.

rm(list = ls())
# pacman is a handy library I recently learned about. This code will load 
# the libraries and install any that aren't already installed.
# Of course you have to start with 'install.packages("pacman")'
pacman::p_load(netrankr, ggraph, tidygraph, tidyverse) 

I’m going to use the data I’ve already downloaded and saved because I don’t want to have to wait again.

df <- read_rds("data_graph_3.rds")

To make this useful, I want to create a dataframe to represent all the links from one topic to another. Given the current setup, this shouldn’t be too bad…

edges <- df %>% 
  unnest(links) %>%
  select(-link) %>%
  rename(from = page,
         to = link_name)
dim(edges)
## [1] 10161289        2

Okay, with 10 million connections, and 45000 nodes, I’ve got too much data to plot the network, but that’s actually the least interesting part of this anyways. Let’s see what tidygraph can do for us. By default…

graph <- as_tbl_graph(edges)
graph
## # A tbl_graph: 1110134 nodes and 10161289 edges
## #
## # A directed multigraph with 1 component
## #
## # Node Data: 1,110,134 x 1 (active)
##   name                        
##   <chr>                       
## 1 Plot_(graphics)             
## 2 Graph_(discrete_mathematics)
## 3 Functional_graph            
## 4 Mathematics                 
## 5 Function_(mathematics)      
## 6 Ordered_pair                
## # … with 1,110,128 more rows
## #
## # Edge Data: 10,161,289 x 2
##    from    to
##   <int> <int>
## 1     1 44747
## 2     1 44748
## 3     1 44749
## # … with 10,161,286 more rows

That converts our list of links (aka “edges”) between nodes (aka “vertices”) into an “adjacency matrix. It’s telling me I’ve got 1.1 million nodes with 10.1 million connections, so the average node has”degree 10". But let’s take a closer look.

We’ll activate() the nodes because tidygraph splits our data into two tidy dataframes (one about edges and one about nodes). If tidygraph didn’t exist we could use edges to calculate centrality by grouping and counting our data. But why not let someone else’s code do the work for us. What are the pages with the most connections?

graph %>% 
  activate(nodes) %>%
  mutate(centrality = centrality_degree(mode = "in")) %>%
  arrange(desc(centrality))
## # A tbl_graph: 1110134 nodes and 10161289 edges
## #
## # A directed multigraph with 1 component
## #
## # Node Data: 1,110,134 x 2 (active)
##   name              centrality
##   <chr>                  <dbl>
## 1 ISBN_(identifier)      29109
## 2 Doi_(identifier)       22798
## 3 GND_(identifier)       11314
## 4 LCCN_(identifier)      10926
## 5 ISSN_(identifier)       9514
## 6 Wayback_Machine         8719
## # … with 1,110,128 more rows
## #
## # Edge Data: 10,161,289 x 2
##    from     to
##   <int>  <int>
## 1  8772  60914
## 2  8772 201699
## 3  8772  88221
## # … with 10,161,286 more rows

Huh. That’s not what I wanted. It looks like two thirds of my pages have a really boring section that says something like “this book has some isbn… here’s the page for what an isbn is…” even though it might be irrelevant most of the time. Let’s dig through the documentation and try to filter out anything with (identifier) in it.

graph %>%
  activate(nodes) %>%
  filter(str_detect(name,"(identifier)", negate = TRUE)) %>%
  mutate(centrality = centrality_degree(mode = "in")) %>%
  arrange(desc(centrality))
## # A tbl_graph: 1109967 nodes and 9909401 edges
## #
## # A directed multigraph with 1887 components
## #
## # Node Data: 1,109,967 x 2 (active)
##   name                       centrality
##   <chr>                           <dbl>
## 1 Wayback_Machine                  8701
## 2 Mathematics                      5745
## 3 United_States                    3773
## 4 Computer_science                 3079
## 5 Operating_system                 3064
## 6 Cambridge_University_Press       2608
## # … with 1,109,961 more rows
## #
## # Edge Data: 9,909,401 x 2
##    from     to
##   <int>  <int>
## 1  8701  60691
## 2  8701 201327
## 3  8701  88035
## # … with 9,909,398 more rows

Much better! The Wayback Machine is really interesting, but it’s probably in there for the same reason as the identifiers (“here’s a citation… here’s where it is on the Wayback Machine…”) but I’m going to leave it in because it’s a magical thing that deserves to be more widely known.

The tidygraph package takes our messy network data and handles it as two tidy dataframes in the background: one for nodes (wikipedia pages) and one for edges (links to other pages). activate(nodes) tells R to deal with the node data so we can filter out all the pages that have ‘(identifier)’ in their title. I also use the idea of centrality to sort the data. In this case I’m arranging the data in descending order of the number of other pages that link to any given page (“in links”).

I’m also going to restrict myself to a small subset of my dataset because the code below will take a long time to run otherwise. I’ll work with the smaller subset, get my code working, then apply that code to the larger data set.

set.seed(12345)
graph_small <- edges %>% 
  sample_frac(1/100) %>%
  as_tbl_graph()
graph_small
## # A tbl_graph: 79651 nodes and 101613 edges
## #
## # A directed multigraph with 4035 components
## #
## # Node Data: 79,651 x 1 (active)
##   name               
##   <chr>              
## 1 Toshiba            
## 2 Gallica            
## 3 PDF/VT             
## 4 Lamar_University   
## 5 Genome_sequencing  
## 6 Minkowski_spacetime
## # … with 79,645 more rows
## #
## # Edge Data: 101,613 x 2
##    from    to
##   <int> <int>
## 1     1 30270
## 2     2 30271
## 3     3 23717
## # … with 101,610 more rows

That should be easier to deal with. While setting up this coding challenge I initially tried to just forge ahead with the full dataset and had to leave R running overnight. It failed because I hadn’t loaded in a necessary package.

I’m going to calculate some measures of centrality measures using some convenient functions from tidygraph. Since I’ll want to apply the same procedure to the larger graph, I’m going to set up a function to save myself time (and reduce risk of mistakes) later on. I’m also going to have it skip the random walk measure by default just to save a little more time.

calc_centrality <- function(graph, rand_walk = FALSE){
  if(rand_walk){
    out <- graph %>%
      activate(nodes) %>%
      mutate(in_degree = centrality_degree(mode = "in"),
           out_degree = centrality_degree(mode = "out"),
           betweenness = centrality_betweenness(),
           page_rank = centrality_pagerank(),
           rand_walk = centrality_random_walk()
         )
  } else {
    out <- graph %>%
      activate(nodes) %>%
      mutate(in_degree = centrality_degree(mode = "in"),
           out_degree = centrality_degree(mode = "out"),
           betweenness = centrality_betweenness(),
           page_rank = centrality_pagerank()
           )
  }
  return(out)
}

system.time({
  graph_small <- graph_small %>% 
    calc_centrality()
})
##    user  system elapsed 
## 161.326   0.005 161.348

Okay, even then it’s taking a ton of time (about 2 minutes on my machine). I’m going to make an even smaller graph. Since I’ve already got small_graph, I’ll use that for plots. But it’ll be nice to have an even smaller network handy if I’m doing any more calculating.

set.seed(12345)
graph_smaller <- edges %>% 
  sample_frac(1/10000) %>%
  as_tbl_graph()
graph_smaller
## # A tbl_graph: 1963 nodes and 1016 edges
## #
## # A directed multigraph with 954 components
## #
## # Node Data: 1,963 x 1 (active)
##   name               
##   <chr>              
## 1 Toshiba            
## 2 Gallica            
## 3 PDF/VT             
## 4 Lamar_University   
## 5 Genome_sequencing  
## 6 Minkowski_spacetime
## # … with 1,957 more rows
## #
## # Edge Data: 1,016 x 2
##    from    to
##   <int> <int>
## 1     1   987
## 2     2   988
## 3     3   989
## # … with 1,013 more rows
system.time({
  graph_smaller <- graph_smaller %>% 
    calc_centrality(rand_walk = FALSE)
})
##    user  system elapsed 
##    0.04    0.00    0.04

Alright! I’m willing to wait around for 4 hundredths of a second. FYI, I ran that last bit of code with rand_walk set to TRUE and it took waaaayyy longer (I didn’t even wait for it to finish before stopping R so I could move on). If you’re going to try this with your own network data, a) be sure your data set isn’t too big, and b) let it run over night.

What is centrality?

Let’s take a quick digression through graph theory.

Kevin Bacon has been in a lot of movies with a lot of other actors. And those actors have been in a lot of other movies with a lot of other actors. We can calculate a Bacon Number as the number of degrees of separation between Kevin Bacon and any other actor. In network terms, this is a question of shortest path length. For example, to connect Elvis Presley to Kevin Bacon we can observe that Elvis and Ed Asner were both in one movie and Ed Asner was in a movie with Kevin Bacon. Asner has a Bacon number of 1 and Elvis a number of 2. The hypothesis is that nobody has a Bacon Number greater than 6. That is, there are no more than 6 degrees of separation between any actor and Kevin Bacon.

But is there something special about Kevin Bacon? Maybe… But maybe it’s really the case that a lot of actors have been in movies with Harvey Keitel.

I really hope Six Degrees of Kevin Bacon is in my dataset. But I can’t look till R finishes calculating all those centrality measures. That’s just a way of asking which Wikipedia page is most important. The first two measures are the simplest: how many links does a page have. Out links are less impressive ; I could make a Wikipedia page called “list of pages I decided to add to this list” and just add a link to every page, but that doesn’t mean the page I create is important. But if many pages link to a page, then obviously that page is important. A page that no other page links to is obviously not very important.

Degree is simple, easy to calculate and understand. And so it’s a natural starting point, but there are a lot of other more sophisticated ways to think about how important a node is in a network.

Betweenness is a question of how often a page connects two other pages. Like Ed Asner linking Elvis and Kevin Bacon, a page that links many different concepts is probably important. Now, we can’t just look at any old connections. I could fly to London via Winnipeg, but I wouldn’t because New York has direct flights. We want to restrict our view to shortest paths. If I was flying to London from Winnipeg I could very well end up flying through JFK, so JFK probably has a high degree level of betweenness. No matter where you’re flying from, there’s a pretty high chance you’d fly through JFK.

You’ve heard of Page Rank and it’s almost certainly biased my idea of what it means for a node in a network to be important. This is a fancy way of weighting connections rather than just counting them to calculate a page’s degree. A page like ISBN might have many pages, but most of those links are boring. Page Rank is an attempt to make that page less important.

Random Walk is interesting. The documentation says “centrality based on the inverse sum of expected random walk length between nodes”. Here’s my guess at what that means: 1) take some page and ask how far you’d have to link that page to some other page if you just go through random links ; 2) if it’s going to take a lot of links, that’s a bad sign; 3) repeat.

An important page like “Mathematics” will connect to a lot of mathy pages directly. I’m going to do my own random walk from the band “OK Go” and see how long it takes to get back to Math.

That wasn’t very scientific and it took entirely too long. There were 20 steps in that path, but we could have found shorter and longer paths. But if you just try a bunch and average it out, you’ll have some sort of an estimate of how connected any one page is to a lot of other pages. But obviously it’s going to take way more computation than just calculating degree. Which is why I’ve been able to type a page and a half without R finishing even though I took a break to try to find a path between a band that makes fun videos to mathematics via one of my favorite candies and Jeff Goldblum. I regret not using the Budapest page to link you to John von Neumann.

Tl;dr: we can think systematically about how important any one part of a network is and the answer is bound to be interesting. It might even help you make money (e.g. by finding out what Snapstagram influencers you should offer a sponsorship deal to). But it takes time to calculate, so you need the computer to do that hard work for you. If you do this sort of stuff professionally you might even want to invest in setting the job up to be run on a video card. But I work for the state so I haven’t got money for a video card. And so I will leave this alone over night and hope for the best.

Back to the data…

Let’s start by looking at our various centrality measures. It might end up looking a bit funny since we cut away so much of our dataset. But the focus right now is getting the code working.

Normally I’d think of using pivot_longer to put all of our centrality measures into one column, then facet_wrap to make all four graphs at once. But that’s not working here. So we’ll make our graphs one at a time. I’m going to go a step further and put it all together into a function,

make_the_plots <- function(graph){
  in_degree_hist <- graph %>%
    activate(nodes) %>%
    ggplot(aes(in_degree)) + geom_histogram(bins = 10)

  out_degree_hist <- graph %>%
    activate(nodes) %>%
    ggplot(aes(out_degree)) + geom_histogram(bins = 10)

  between_hist <- graph %>%
    activate(nodes) %>%
    ggplot(aes(betweenness)) + geom_histogram(bins = 10)

  page_rank_hist <- graph %>%
    activate(nodes) %>%
    ggplot(aes(page_rank)) + geom_histogram(bins = 10)

  pr_vs_b_point <- graph %>%
    activate(nodes) %>%
    ggplot(aes(page_rank, in_degree)) + geom_point()
  
  pr_col <- graph %>%
    activate(nodes) %>%
    top_n(30, page_rank) %>%
    ggplot(aes(fct_reorder(name,page_rank), page_rank)) + 
    geom_col() + 
    coord_flip() +
    labs(x = "page", y = "page rank")
  
  return(list(in_degree_hist,
              out_degree_hist,
              between_hist,
              page_rank_hist,
              pr_vs_b_point, pr_col))
}
plots <- make_the_plots(graph_small)
plots
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

Finally…

Now let’s run through this whole mess with the full data set.

graph_central <- graph %>% 
  activate(nodes) %>%
  calc_centrality()
graph <- graph_central %>%
  filter(str_detect(name,"(identifier)", negate = TRUE))
plots <- graph %>% make_the_plots()
plots
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

And the actual challenge…

This challenge won’t be terribly challenging, but hopefully it will be interesting. All I want is your equivalent of that last plot. That is, using the data you scraped (starting from some page different than Graph Theory), show me your top 30 pages in terms of page rank. As usual, include the .R file that you use to get that plot.

If you don’t like mice, you can tweak the following code as necessary to generate a plot

png("top_30_pagerank.png")
plots[[6]]
dev.off()
## png 
##   2

Aside

I started running the script for this page around 12:30 today and it finished just before 5. Then I noticed some stuff I wanted to update and left it alone for the night. What time will it finish? I’m going to start running it at 5:05 pm. The time below will show when it finished running:

Sys.time()
## [1] "2020-11-16 20:41:00 EST"