Web scraping, a brief tutorial

This week we’ll be scraping data off the web. It turns out there’s a tidyverse package that will make this very easy: rvest. First, load in the appropriate libraries.

rm(list = ls())
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# install.packages("rvest")
library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding

We’ll use Wikipedia and explore the links between articles. This will give a bit of insight into what mathematicians call “Graph Theory”.

rvest starts with a call to read_html on a url. I’ll start with the page for Graph Theory. In the questions below you’ll choose your own starting point.

first_url <- "https://en.wikipedia.org/wiki/Graph_theory"
first_page <- read_html(first_url)

Now I need to find the links. You might recall that in html links look something like this: <a href="http://example.com>Link text</a>. We’ll pull together all those tags, then extract the links.

hrefs <- first_page %>% html_nodes("a")
links <- hrefs %>% html_attr("href")
head(links)
## [1] NA                             "#mw-head"                    
## [3] "#searchInput"                 "/wiki/Graph_of_a_function"   
## [5] "/wiki/Graph_(disambiguation)" "/wiki/File:6n-graf.svg"

I’m getting 1016 links, but I only want the ones that link to another wikipedia article. I’ll want to filter links appropriately. That’s a job for stringr!

Try running the code below without the pipe to head and you’ll see a ton of things, and you’ll also start to see some patterns.

links %>% str_subset("/wiki/") %>% head()
## [1] "/wiki/Graph_of_a_function"    "/wiki/Graph_(disambiguation)"
## [3] "/wiki/File:6n-graf.svg"       "/wiki/File:6n-graf.svg"      
## [5] "/wiki/Graph_drawing"          "/wiki/Mathematics"

It looks like the ideal link takes the form "/wiki/Topic". One thing that seems common to the pages I don’t want is that they have : in them. So let’s start by filtering those out and seeing what we’re left with.

links %>% str_subset(":") %>% head()
## [1] "/wiki/File:6n-graf.svg"    "/wiki/File:6n-graf.svg"   
## [3] "/wiki/File:Undirected.svg" "/wiki/File:Undirected.svg"
## [5] "/wiki/File:Directed.svg"   "/wiki/File:Directed.svg"

I could make a list of everything I don’t want and remove it from what I do, but the documentation shows a feature that will save me a step.

links %>% str_subset(":", negate = TRUE) %>% head()
## [1] "#mw-head"                     "#searchInput"                
## [3] "/wiki/Graph_of_a_function"    "/wiki/Graph_(disambiguation)"
## [5] "/wiki/Graph_drawing"          "/wiki/Mathematics"

Those links starting with # are anchor links (linking to a subsection of this page, not a new page). Let’s get rid of those.

links %>% str_subset(":", negate = TRUE) %>% str_subset("#", negate = TRUE) %>% head()
## [1] "/wiki/Graph_of_a_function"          "/wiki/Graph_(disambiguation)"      
## [3] "/wiki/Graph_drawing"                "/wiki/Mathematics"                 
## [5] "/wiki/Graph_(discrete_mathematics)" "/wiki/Vertex_(graph_theory)"

We’re getting closer, but commenting out the head call shows some stuff further down () that we want to deal with appropriately. There’s also some links with # that we want to keep (i.e. links to a subsection of another page). What we really want to get rid of are links that start with # which we can identify with the ^ symbol. In this context, ^ means “the start of the line”. So the code below excludes all links that start with # but leaves any remaining links that have # later in the link.

links %>% str_subset(":", negate = TRUE) %>% str_subset("^#", negate = TRUE) %>% tail()
## [1] "/w/index.php?title=Graph_theory&printable=yes"                                      
## [2] "//creativecommons.org/licenses/by-sa/3.0/"                                          
## [3] "//foundation.wikimedia.org/wiki/Terms_of_Use"                                       
## [4] "//foundation.wikimedia.org/wiki/Privacy_policy"                                     
## [5] "//www.wikimediafoundation.org/"                                                     
## [6] "//en.m.wikipedia.org/w/index.php?title=Graph_theory&mobileaction=toggle_view_mobile"

Looking at the bottom of our list (by using tail instead of head) we can see that we’ll want to get rid of the links in the footer of first_page. Let’s do that by focusing on links that only follow the patter we saw earlier.

links %>% 
  str_subset(":", negate = TRUE) %>%
  str_subset("^#", negate = TRUE) %>%
  str_subset("/wiki/") %>%
  tail()
## [1] "/wiki/Graph_theory"                            
## [2] "/wiki/Graph_theory"                            
## [3] "/wiki/Main_Page"                               
## [4] "/wiki/Main_Page"                               
## [5] "//foundation.wikimedia.org/wiki/Terms_of_Use"  
## [6] "//foundation.wikimedia.org/wiki/Privacy_policy"

Almost there! A quick adjustment:

links %>% 
  str_subset(":", negate = TRUE) %>%
  str_subset("^#", negate = TRUE) %>%
  str_subset("^/wiki/") %>%
  tail()
## [1] "/wiki/LCCN_(identifier)" "/wiki/NDL_(identifier)" 
## [3] "/wiki/Graph_theory"      "/wiki/Graph_theory"     
## [5] "/wiki/Main_Page"         "/wiki/Main_Page"

Finally, let’s make sure we exclude any links to the main page. That way we explore the things related to graph theory first. If we don’t exclude the main page, we’ll start on our page, check the link to the main page, then start getting a bunch of pages unrelated to Graph Theory.

first_links <- links %>% 
  str_subset(":", negate = TRUE) %>%
  str_subset("^#", negate = TRUE) %>%
  str_subset("^/wiki/") %>%
  str_subset("/wiki/Main_Page", negate = TRUE)

At this point, I’ve got 623 links (down from 1016!) but some are duplicates. I might want to know about those duplicates (maybe multiple links tell us about a stronger connection between two topics), but I’m not sure yet. So for the time being, I’m going to wrap up the code I made into a function and leave open the possibility of keeping duplicates.

grab_wiki_links <- function(url, keep_duplicates = FALSE){
  links <- read_html(url) %>% # read page
      html_nodes("a") %>% # link nodes
      html_attr("href") %>% # links
      str_subset(":", negate = TRUE) %>% # remove links with colons
      str_subset("^#", negate = TRUE) %>% # remove anchor links for this page
      str_subset("^/wiki/") %>% # only keep links to wiki pages
      str_subset("/wiki/Main_Page", negate = TRUE) # exclude main page
  if(keep_duplicates){
    return(links)
  } else {
    return(unique(links))
  }
}
grab_wiki_links(first_url) %>% head()
## [1] "/wiki/Graph_of_a_function"          "/wiki/Graph_(disambiguation)"      
## [3] "/wiki/Graph_drawing"                "/wiki/Mathematics"                 
## [5] "/wiki/Graph_(discrete_mathematics)" "/wiki/Vertex_(graph_theory)"

Alright, what do we do with these links? First, we want to note that Graph Theory is linked to these pages. Second, we want to see what those pages are linked to.

I’ll start with the second problem because it’s a similar problem to the one I just solved. I’ll need to take the relative links and tack them on to the base of the url.

rel_to_abs_link <- function(relative_link){
  base_url <- "https://en.wikipedia.org"
  paste(base_url,
        relative_link,
        sep="")
}
grab_wiki_links(first_url) %>%
  head(1) %>%
  rel_to_abs_link()
## [1] "https://en.wikipedia.org/wiki/Graph_of_a_function"

That looks about right. Let’s see if I can pipe that right into the first function…

grab_wiki_links(first_url) %>%
  head(1) %>%
  rel_to_abs_link() %>%
  grab_wiki_links() %>%
  head()
## [1] "/wiki/Plot_(graphics)"              "/wiki/Graph_(discrete_mathematics)"
## [3] "/wiki/Functional_graph"             "/wiki/Mathematics"                 
## [5] "/wiki/Function_(mathematics)"       "/wiki/Ordered_pair"

Cool! What I’ve got is code that will take a Wikipedia page, extract links to other Wikipedia pages, and extract the links from one of those pages.

Now the tricky part: we’ve got to think about how to hold on to this information about the links. It’s a good thing I’m starting with Graph Theory because there’s a section on that page about this very problem!

Since I’m looking at hundreds of links, I’ll start by thinking about list structures. I’m thinking the sensible way to deal with this is to use a nested tibble where each row represents a wikipedia page and one of the columns includes a list link we get from grab_wiki_links.

For this to work nicely, I’m going to want to extract titles from the URLs. Since I’m going to do that repeatedly, I’ll build a simple function.

# convert relative links to a page title
link_to_title <- function(relative_link){
  str_replace(relative_link,".*/wiki/","") # '.*' ensures it will also work with 
  # absolute links
}
link_to_title(first_links) %>% head()
## [1] "Graph_of_a_function"          "Graph_(disambiguation)"      
## [3] "Graph_drawing"                "Mathematics"                 
## [5] "Graph_(discrete_mathematics)" "Vertex_(graph_theory)"

There are about a half million ways I could have done that, but after looking through the stringr documentation and trying some stuff, I like that one best for this problem.

Now it’s time to put together our data.

tibble(page = "Graph_theory", 
       link = grab_wiki_links(first_url),
       link_name = link_to_title(link)) %>%
  head()
## # A tibble: 6 x 3
##   page         link                               link_name                   
##   <chr>        <chr>                              <chr>                       
## 1 Graph_theory /wiki/Graph_of_a_function          Graph_of_a_function         
## 2 Graph_theory /wiki/Graph_(disambiguation)       Graph_(disambiguation)      
## 3 Graph_theory /wiki/Graph_drawing                Graph_drawing               
## 4 Graph_theory /wiki/Mathematics                  Mathematics                 
## 5 Graph_theory /wiki/Graph_(discrete_mathematics) Graph_(discrete_mathematics)
## 6 Graph_theory /wiki/Vertex_(graph_theory)        Vertex_(graph_theory)

That’s close to what I want, but I want it to be nested.

first_df <- tibble(page = "Graph_theory",
       link = grab_wiki_links(first_url),
       link_name = link_to_title(link)) %>%
  nest(links = c(link, link_name)) 
first_df
## # A tibble: 1 x 2
##   page         links             
##   <chr>        <list>            
## 1 Graph_theory <tibble [509 × 2]>

Alright. Now I want to add rows based on the links.

link_to_row <- function(relative_link){
  page <- link_to_title(relative_link)
  link <- relative_link %>% 
    rel_to_abs_link() %>%
    grab_wiki_links()
  link_name = link_to_title(link)
  out <- tibble(page = page,
                link = link,
                link_name = link_name) %>%
    nest(links = c(link, link_name))
  return(out)
}
link_to_row(first_links[1])
## # A tibble: 1 x 2
##   page                links             
##   <chr>               <list>            
## 1 Graph_of_a_function <tibble [138 × 2]>

That looks fine, but let’s look closer…

link_to_row(first_links[1]) %>% unnest() %>% head()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(links)`
## # A tibble: 6 x 3
##   page               link                             link_name                 
##   <chr>              <chr>                            <chr>                     
## 1 Graph_of_a_functi… /wiki/Plot_(graphics)            Plot_(graphics)           
## 2 Graph_of_a_functi… /wiki/Graph_(discrete_mathemati… Graph_(discrete_mathemati…
## 3 Graph_of_a_functi… /wiki/Functional_graph           Functional_graph          
## 4 Graph_of_a_functi… /wiki/Mathematics                Mathematics               
## 5 Graph_of_a_functi… /wiki/Function_(mathematics)     Function_(mathematics)    
## 6 Graph_of_a_functi… /wiki/Ordered_pair               Ordered_pair

Good! Now I just need to do this (without the unnesting) for all the links in first_df

I take the dataframe, unnest it so I can grab the links, and select what I want (the links).

links_to_get <- first_df %>% 
  unnest(links) %>% 
  pull(link) # pull the relevant vector out of the tibble
# this is similar to select, but returns a vector instead of a tibble

Next I’ll have to map link_to_row over that. This might take a while, so while I’m prototyping, I’m going to just look at the first few links

second_df <- links_to_get[1:5] %>%
  map(link_to_row)
second_df
## [[1]]
## # A tibble: 1 x 2
##   page                links             
##   <chr>               <list>            
## 1 Graph_of_a_function <tibble [138 × 2]>
## 
## [[2]]
## # A tibble: 1 x 2
##   page                   links            
##   <chr>                  <list>           
## 1 Graph_(disambiguation) <tibble [22 × 2]>
## 
## [[3]]
## # A tibble: 1 x 2
##   page          links             
##   <chr>         <list>            
## 1 Graph_drawing <tibble [148 × 2]>
## 
## [[4]]
## # A tibble: 1 x 2
##   page        links             
##   <chr>       <list>            
## 1 Mathematics <tibble [431 × 2]>
## 
## [[5]]
## # A tibble: 1 x 2
##   page                         links             
##   <chr>                        <list>            
## 1 Graph_(discrete_mathematics) <tibble [112 × 2]>

map returns a list by default, but I want a dataframe. Fortunately, map_df will do what I want!

second_df <- links_to_get[1:5] %>%
  map_df(link_to_row)
second_df
## # A tibble: 5 x 2
##   page                         links             
##   <chr>                        <list>            
## 1 Graph_of_a_function          <tibble [138 × 2]>
## 2 Graph_(disambiguation)       <tibble [22 × 2]> 
## 3 Graph_drawing                <tibble [148 × 2]>
## 4 Mathematics                  <tibble [431 × 2]>
## 5 Graph_(discrete_mathematics) <tibble [112 × 2]>

I might want to get rid of the disambiguation pages. But I’ll think about that after I get the basic procedure set up. I’m most of the way there, but I’d like to roll together everything I’ve learned above into something nice and easy to use, which calls for yet another function.

The function below (df_to_df) should do the trick, but since my wifi isn’t great it takes a long time.

df_to_df <- function(first_df){
  second_df <- first_df %>% 
    unnest(links) %>% 
    pull(link) %>% 
    map_df(link_to_row)
  return(full_join(first_df, second_df))
}

I’m going to replace link_to_row so that I double check to make sure first_df doesn’t already include any of the links I’m looking at…

already_done <- function(link, first_df){
  link_to_title(link) %in% first_df$page
}

df_to_df <- function(first_df){
  links <- first_df %>%
    unnest(links) %>%
    pull(link)
  keep <- map_lgl(links, function(link){
    !already_done(link, first_df)
  })
  second_df <- links[keep] %>% 
    map_df(link_to_row)
  return(full_join(first_df, second_df))
}

second_df <- df_to_df(first_df)
## Joining, by = c("page", "links")
head(second_df)
## # A tibble: 6 x 2
##   page                         links             
##   <chr>                        <list>            
## 1 Graph_theory                 <tibble [509 × 2]>
## 2 Graph_of_a_function          <tibble [138 × 2]>
## 3 Graph_(disambiguation)       <tibble [22 × 2]> 
## 4 Graph_drawing                <tibble [148 × 2]>
## 5 Mathematics                  <tibble [431 × 2]>
## 6 Graph_(discrete_mathematics) <tibble [112 × 2]>

It works! Now let’s extract a bit of information about how connected each page is.

second_df %>% 
  mutate(degree = map(links,nrow)) %>%
  unnest(degree) %>%
  arrange(desc(degree))
## # A tibble: 509 x 3
##    page                                  links                degree
##    <chr>                                 <list>                <int>
##  1 Gottfried_Wilhelm_Leibniz             <tibble [1,754 × 2]>   1754
##  2 Artificial_intelligence               <tibble [1,310 × 2]>   1310
##  3 History_of_mathematics                <tibble [1,011 × 2]>   1011
##  4 Computer_animation                    <tibble [987 × 2]>      987
##  5 Leonhard_Euler                        <tibble [986 × 2]>      986
##  6 Game_theory                           <tibble [888 × 2]>      888
##  7 Software_maintenance                  <tibble [852 × 2]>      852
##  8 Sociology                             <tibble [829 × 2]>      829
##  9 Philosophy_of_artificial_intelligence <tibble [821 × 2]>      821
## 10 System_on_a_chip                      <tibble [811 × 2]>      811
## # … with 499 more rows

Go Leibniz! Of the pages linked to Graph Theory (including the Graph Theory page itself) He’s connected to the most wikipedia pages.

Let’s take our df_to_df function and make it a little more useful…

page_to_df_recursive <- function(url, depth = 3){
  # to make sure we can plug in an absolute url...
  url <- str_replace(url,".*/wiki/","/wiki/")
  out <- link_to_row(url)
  for(layer in 1:depth){
    out <- df_to_df(out)
  }
  return(out)
}

Now we can start at any wikipedia page, check out everything it’s linked to, and everything those pages are linked to, to whatever depth we want to go (or have time to wait for our computer to go.)

Alright, that all took longer than I expected, so let’s make this challenge a two-parter. We’ll update the assignment for Wednesday’s class.

Part I

Your first task is to figure out what code from above is essential, then run it using a wikipedia page of your choice. Turn in an R script (your_name_wikipedia_pt1.R) to the appropriate place in Blackboard. Include a starting url (your choice, just something different than Graph Theory), the minimum amount of code necessary to work (e.g. your code shouldn’t involve the first_links object or anything else that isn’t strictly necessary).