Web scraping

Web scraping is the process of programmaticaly loading a large number of static html pages inside R, and turning tables or text or some other data on those static pages into tabular data we can use for stats.

Scraping is a legal grey area. Many sites expressly forbid the practice–while the owners of a data set don’t mind using the data to dynamically update (for instance) real estate prices for a particular property, or racial compositions for a particular elementary school, they’re very averse to letting someone else build the underlying data table in its entirety.

Scraping is also receding as a data collection method, as a response to changing Web rendering technologies. Static html pages were the typical presentation online between the 1990s and around 2010s. Recently pages have become more dynamic and capable, with the unintended consequence that they’re also more difficult to scrape. Companies are also increasingly use technologies like CloudFlare to expressly stop scraping.

Finally, a person who owns a website is the unwitting collaborator with you, the person doing the scraping. They built their site for one purpose, and you adapted your scraper to the page owner’s design. They won’t email you when the page’s design changes, breaking your scraper.

You’re getting a data set at no financial cost. You’re entitled to very generous refund terms if anything goes wrong.

Something that’s not web scraping

There are other ways to get data directly from an external database into your R session without scraping. This is done by using a technology called an API. If a data provider has an API, use it to gather their data.

A quick example, using the World Bank’s API.

library(tidyverse)
library(magrittr)
library(showtext)
# if needed, install this with
# install.packages("wbstats")
library(wbstats)

t1 <- wbstats::wb_data(
  country = "all",
  indicator = c(
    "women_ba" = "SE.TER.CUAT.BA.FE.ZS",
    "women_fert" = "SP.DYN.TFRT.IN"
  ),
  start_date = 1980,
  end_date = 2023, 
  return_wide = T, 
) %>% 
  tibble

This is not a contrived example–all US government agencies have data APIs, as do basically all the international agencies. If there’s an API for your data, you should use it.

When there’s no API, we should scrape

Scraping is more bespoke than it is systematic. The general process is

  1. You load one element of the eventual vector of pages you intend to scrape
  2. On this one page, you attempt to identify the html tags for the elements you want to scrape
  3. You then figure out how the page urls are indexed on the site, so that you can apply this scrape to all all the pages

We’ll start with an example from OpenSecrets, an non-profit who aggregate FEC campaign donations data to make it more useful to users. We’re interested in seeing how much industries donate.

We need two new pieces of software. First, rvest



It lets you programmatically open websites from R (some people call this headless browsing). We’ll get a look at the library’s functions in a minute.

The other thing we’ll need is a Chrome extension SelectorGadget, to identify tagged elements on a website. To install it, visit the Chrome Webstore and search for the extension by name.

If you search for it correctly it will look like this:



Armed with both of these tools, we’re ready to get scraping.

Open Chrome and navigate to

https://www.opensecrets.org/federal-lobbying/alphabetical-list?type=s

Which will look like this



Hover over any of industry links and you’ll see a html populate at the bottom of your screen. We’re going to have R load each of these links.

To aid this process, Open SelectorGadget. Click on one of the industry links is highlighted, like this



The reference .color-category is actually a CSS Selector.

A CSS selector is the first part of a CSS Rule. It is a pattern of elements and other terms that tell the browser which HTML elements should be selected

We’ll load the page and use this selector

library(rvest)

h0 <- "https://www.opensecrets.org/federal-lobbying/alphabetical-list?type=s" %>% 
  read_html

h0 %>% 
  html_nodes(
    ".color-category"
  )

Let’s look at the results of this function

These 141 results include some graphical properties, but the thing we’re interested in is the href element, which returns the links to the pages which summarize by industry. To get at that we’re going to need append an a to our selector.

h0 %>% 
  html_nodes(
    ".color-category a"
  )

This is looking better! html_attr() will return the precise attribute we’re interested in

h0 %>% 
  html_nodes(
    ".color-category a"
  ) %>% 
  html_attr("href")

We’ll use list columns so that R will do all housekeeping for us.

t0 <- tibble(
  url = str_c(
    "https://www.opensecrets.org",
    h0 %>% 
      html_nodes(
        ".color-category a"
      ) %>% 
      html_attr("href")
  )
) %>% 
  filter(
    url %>% 
      str_detect("sectors", negate = F)
  )

Now we can use map and rvest::read_html() to load a vector of these pages

t0 %<>% 
  mutate(
    pags = url %>% 
      map(
        read_html, 
        .progress = T
      )
  )

What do these sectoral pages look like?



We don’t want to report 13 sectors’ lobbying–we want precision down to the industry level. So we’re going to open one of these sector summaries (the health sector is here https://www.opensecrets.org/federal-lobbying/sectors/summary?cycle=2024&id=H) and we can manipulate SelectorGadget again to get the precise tag for these industry links.

Mess with SG until you get the following selection



Again, we’ll append "a" so that we can readily access the hyperlink for each industry. Armed with this link, we’ll map over each sector, and append the root url for Open Secrets

t1 <- tibble(
  url = t0$pags %>% 
    map(
      \(i)
      i %>% 
        html_nodes(".color-category a") %>% 
        html_attr("href")
    ) %>% 
    unlist %>% 
    str_c(
      "https://www.opensecrets.org",
      .
    )
)

mmmm that cycle=2024 is interesting. While the bar chart shows each year since 1998, that’s not scrapeable.

Try it yourself–vary cycle=1998 and drop that into the URL for Agricultural Services/Products url, in the following way

https://www.opensecrets.org/federal-lobbying/industries/summary?cycle=1998&id=A07

So, what function is useful for us, to get every unique combination of cycle and industry id? Right, clever imaginary interloctuor, our old friend expand_grid!

eg1 <- expand_grid(
  cycle = 1998:2024,
  id = t1$url %>% 
    str_sub(
      t1$url %>% 
        str_locate("id=") %>% 
        extract(, 2) %>% 
        add(1)
    )
) %>% 
  mutate(
    url = str_c(
      "https://www.opensecrets.org/industries/lobbying?cycle=",
      cycle, 
      "&ind=",
      id
    )
  )

Now, rvest is pretty fast and efficient, but it can only scrape about ~100/minute, so scraping every unique industry year combination would take around ~25 minutes.

To make this tractable to an exercise, let’s just do the first 200 Scraping the pages for each industry year

eg1 %<>%
  slice(
    1:200
  )

eg1$pag <- eg1$url  %>% 
  map(
    possibly(
      \(i)
      i %>%
        read_html, 
      otherwise = NULL
    ),
    .progress = T
  )

What are the CSS tags for each relevant piece of the page, that we’d like as variables in a eventual table?

Let’s open the first URL and mess with SG

What do we want in an eventual table?

  1. The industry title
  2. The lobbying total

Exercise

On your own, figure out which CSS tags correspond with each property described in the list above











Armed with these tags, we can do the following

eg1$tot <- eg1$pag %>% 
  map(
    possibly(
      \(i)
      i %>% 
        html_nodes(".l-grid .f-strata-title") %>%
        html_text2, 
      otherwise = NULL
    ),
    .progress = T
  )

eg1$tot2 <- eg1$tot %>% 
  map_chr(
    \(i)
    i %>% 
      length %>% 
      equals(0) %>% 
      ifelse(
        NA,
        i %>% 
          extract2(1)
      )
  )

eg1$title <- eg1$pag %>% 
  map_chr(
    possibly(
      \(i)
      i %>%
        html_nodes(".Hero-title") %>%
        html_text2, 
      otherwise = ""
    ),
    .progress = T
  )

Now we’ll load the version which I very helpfully scraped for all 2565 industry/year rows, and do some housekeeping.

eg2 <- "https://github.com/thomasjwood/code_lab/raw/main/data/eg1_scrape.rds" %>% 
  url %>% 
  readRDS %>% 
  mutate(
    tot2  = tot2 %>%
      str_remove_all(
        "\\$|,"
      ) %>% 
      as.numeric,
    title = title %>% 
      str_remove(" Lobbying")
  ) %>% 
  filter(
    tot2 %>% 
      is.na %>% 
      not
  )

Mmmm I want to make my typical overly elaborate figure, so let’s change some of the silly long industry titles into something shorter

hs <- eg2$title %>% 
  unique %>% 
  extract(
    order(
      eg2$title %>% 
        unique %>% 
        str_length 
    ) %>% 
      rev
  ) %>% 
  extract(1:20)

eg2$title  %<>%
  plyr::mapvalues(
    hs, 
    c("Non-profits", "Alt Energy", "Electronics", "Pro-Abortion", "Crop Production",
      "Misc Manufacturing", "Pharmaceuticals", "Religious Orgs", "Chemicals",
      "Ag Services", "Public Officials", "Entertainment", "Building Materials",
      "Anti-Abortion", "Telecom", "Finance", "Forestry", "Hospitals", "Trade Contractors",
      "Defense Policy")
  )

Now we’ll use tidyquant to adjust all the dollars in into 2023 amounts

cpi <- tidyquant::tq_get(
  "CPIAUCSL", 
  get = "economic.data", 
  from = "1998-01-01"
) %>% 
  mutate(year = date %>% year) %>% 
  group_by(year) %>% 
  summarize(
    pr = price %>% mean
  )

cpi$af <- cpi$pr %>% 
  divide_by(
    cpi$pr[cpi$year == 2023]
  )

t2 <- eg2 %>% 
  left_join(
    cpi %>% 
      select(-pr) %>% 
      rename(
        cycle = year
      )
  ) %>% 
  mutate(
    tot3 = tot2 %>% 
      divide_by(af),
    title = title %>% 
      fct_reorder(
        tot3,
        .na_rm = T,
        .desc = T)
  )

We’ll have industry ordered by amounts donated

t2$title %<>%
  factor(
    t2 %>% 
      group_by(title) %>% 
      summarize(mu = tot3 %>% mean) %>% 
      arrange(desc(mu)) %>% 
      use_series(title)
  )

Now we plot

font_add_google("Public Sans")
showtext_auto()

t2 %>% 
  ggplot(
    aes(
      title, tot3
    )
  ) +
  geom_point(
    size = 1.75,
    shape = 19,
    alpha = .5
  ) +
  labs(
    title = "Federal lobbying by industry, 1998-2024",
    subtitle = "Points depict total annual lobbying, adjusted to 2023 dollars. Pro-Israel, Gun Rights, Gun Conrol, Abortion Rights,and Anti-Abortion lobbyists highlighted.",
    x = "",
    y = "",
    caption = "Data source: opensecrets.org"
  ) +
  scale_y_continuous(
    breaks = seq(0, 6e8, length.out = 4),
    labels = c(
      "$0",
      str_c(
        "$",
        seq(200, 600, 200),
        "M"
      )
    )
  ) +
  theme_minimal(base_family = "Public Sans") + 
  theme(
    plot.margin = margin(
      t = .25,
      r = .25,
      l = .7, 
      b = .1,
      unit = "cm"
    ),
    plot.title = element_text(face = "bold"),
    plot.caption = element_text(
      margin = margin(-.3, unit = "cm"),
      face = "italic"),
    panel.grid = element_blank(),
    axis.text.x = element_text(
      margin = margin(t = -.6, unit = "cm"), 
      hjust = 1,
      vjust = 1,
      size = 6.75,
      angle = 45/1.6
    ),
    plot.title.position = "plot",
    legend.position = "none"
  )