Web scraping is the process of programmaticaly loading a large number of static html pages inside R, and turning tables or text or some other data on those static pages into tabular data we can use for stats.
Scraping is a legal grey area. Many sites expressly forbid the practice–while the owners of a data set don’t mind using the data to dynamically update (for instance) real estate prices for a particular property, or racial compositions for a particular elementary school, they’re very averse to letting someone else build the underlying data table in its entirety.
Scraping is also receding as a data collection method, as a response to changing Web rendering technologies. Static html pages were the typical presentation online between the 1990s and around 2010s. Recently pages have become more dynamic and capable, with the unintended consequence that they’re also more difficult to scrape. Companies are also increasingly use technologies like CloudFlare to expressly stop scraping.
Finally, a person who owns a website is the unwitting collaborator with you, the person doing the scraping. They built their site for one purpose, and you adapted your scraper to the page owner’s design. They won’t email you when the page’s design changes, breaking your scraper.
You’re getting a data set at no financial cost. You’re entitled to very generous refund terms if anything goes wrong.
There are other ways to get data directly from an external database into your R session without scraping. This is done by using a technology called an API. If a data provider has an API, use it to gather their data.
A quick example, using the World Bank’s API.
library(tidyverse)
library(magrittr)
library(showtext)
# if needed, install this with
# install.packages("wbstats")
library(wbstats)
t1 <- wbstats::wb_data(
country = "all",
indicator = c(
"women_ba" = "SE.TER.CUAT.BA.FE.ZS",
"women_fert" = "SP.DYN.TFRT.IN"
),
start_date = 1980,
end_date = 2023,
return_wide = T,
) %>%
tibble
This is not a contrived example–all US government agencies have data APIs, as do basically all the international agencies. If there’s an API for your data, you should use it.
Scraping is more bespoke than it is systematic. The general process is
We’ll start with an example from OpenSecrets, an non-profit who aggregate FEC campaign donations data to make it more useful to users. We’re interested in seeing how much industries donate.
We need two new pieces of software. First, rvest
It lets you programmatically open websites from R (some people call this headless browsing). We’ll get a look at the library’s functions in a minute.
The other thing we’ll need is a Chrome extension SelectorGadget, to identify tagged elements on a website. To install it, visit the Chrome Webstore and search for the extension by name.
If you search for it correctly it will look like this:
Armed with both of these tools, we’re ready to get scraping.
Open Chrome and navigate to
https://www.opensecrets.org/federal-lobbying/alphabetical-list?type=s
Which will look like this
Hover over any of industry links and you’ll see a html populate at the bottom of your screen. We’re going to have R load each of these links.
To aid this process, Open SelectorGadget. Click on one of the industry links is highlighted, like this
The reference .color-category
is actually a CSS
Selector.
A CSS selector is the first part of a CSS Rule. It is a pattern of elements and other terms that tell the browser which HTML elements should be selected
We’ll load the page and use this selector
library(rvest)
h0 <- "https://www.opensecrets.org/federal-lobbying/alphabetical-list?type=s" %>%
read_html
h0 %>%
html_nodes(
".color-category"
)
Let’s look at the results of this function
These 141 results include some graphical properties, but the thing
we’re interested in is the href
element, which returns the
links to the pages which summarize by industry. To get at that we’re
going to need append an a
to our selector.
h0 %>%
html_nodes(
".color-category a"
)
This is looking better! html_attr()
will return the
precise attribute we’re interested in
h0 %>%
html_nodes(
".color-category a"
) %>%
html_attr("href")
We’ll use list columns so that R will do all housekeeping for us.
t0 <- tibble(
url = str_c(
"https://www.opensecrets.org",
h0 %>%
html_nodes(
".color-category a"
) %>%
html_attr("href")
)
) %>%
filter(
url %>%
str_detect("sectors", negate = F)
)
Now we can use map
and rvest::read_html()
to load a vector of these pages
t0 %<>%
mutate(
pags = url %>%
map(
read_html,
.progress = T
)
)
What do these sectoral pages look like?
We don’t want to report 13 sectors’ lobbying–we want precision down to the industry level. So we’re going to open one of these sector summaries (the health sector is here https://www.opensecrets.org/federal-lobbying/sectors/summary?cycle=2024&id=H) and we can manipulate SelectorGadget again to get the precise tag for these industry links.
Mess with SG until you get the following selection
Again, we’ll append "a"
so that we can readily access
the hyperlink for each industry. Armed with this link, we’ll map over
each sector, and append the root url for Open Secrets
t1 <- tibble(
url = t0$pags %>%
map(
\(i)
i %>%
html_nodes(".color-category a") %>%
html_attr("href")
) %>%
unlist %>%
str_c(
"https://www.opensecrets.org",
.
)
)
mmmm that cycle=2024
is interesting. While the bar chart
shows each year since 1998, that’s not scrapeable.
Try it yourself–vary cycle=1998
and drop that into the
URL for Agricultural Services/Products url, in the
following way
https://www.opensecrets.org/federal-lobbying/industries/summary?cycle=1998&id=A07
So, what function is useful for us, to get every unique combination
of cycle
and industry id
? Right, clever
imaginary interloctuor, our old friend expand_grid
!
eg1 <- expand_grid(
cycle = 1998:2024,
id = t1$url %>%
str_sub(
t1$url %>%
str_locate("id=") %>%
extract(, 2) %>%
add(1)
)
) %>%
mutate(
url = str_c(
"https://www.opensecrets.org/industries/lobbying?cycle=",
cycle,
"&ind=",
id
)
)
Now, rvest
is pretty fast and efficient, but it can only
scrape about ~100/minute, so scraping every unique industry year
combination would take around ~25 minutes.
To make this tractable to an exercise, let’s just do the first 200 Scraping the pages for each industry year
eg1 %<>%
slice(
1:200
)
eg1$pag <- eg1$url %>%
map(
possibly(
\(i)
i %>%
read_html,
otherwise = NULL
),
.progress = T
)
What are the CSS tags for each relevant piece of the page, that we’d like as variables in a eventual table?
Let’s open the first URL and mess with SG
What do we want in an eventual table?
On your own, figure out which CSS tags correspond with each property described in the list above
Armed with these tags, we can do the following
eg1$tot <- eg1$pag %>%
map(
possibly(
\(i)
i %>%
html_nodes(".l-grid .f-strata-title") %>%
html_text2,
otherwise = NULL
),
.progress = T
)
eg1$tot2 <- eg1$tot %>%
map_chr(
\(i)
i %>%
length %>%
equals(0) %>%
ifelse(
NA,
i %>%
extract2(1)
)
)
eg1$title <- eg1$pag %>%
map_chr(
possibly(
\(i)
i %>%
html_nodes(".Hero-title") %>%
html_text2,
otherwise = ""
),
.progress = T
)
Now we’ll load the version which I very helpfully scraped for all 2565 industry/year rows, and do some housekeeping.
eg2 <- "https://github.com/thomasjwood/code_lab/raw/main/data/eg1_scrape.rds" %>%
url %>%
readRDS %>%
mutate(
tot2 = tot2 %>%
str_remove_all(
"\\$|,"
) %>%
as.numeric,
title = title %>%
str_remove(" Lobbying")
) %>%
filter(
tot2 %>%
is.na %>%
not
)
Mmmm I want to make my typical overly elaborate figure, so let’s change some of the silly long industry titles into something shorter
hs <- eg2$title %>%
unique %>%
extract(
order(
eg2$title %>%
unique %>%
str_length
) %>%
rev
) %>%
extract(1:20)
eg2$title %<>%
plyr::mapvalues(
hs,
c("Non-profits", "Alt Energy", "Electronics", "Pro-Abortion", "Crop Production",
"Misc Manufacturing", "Pharmaceuticals", "Religious Orgs", "Chemicals",
"Ag Services", "Public Officials", "Entertainment", "Building Materials",
"Anti-Abortion", "Telecom", "Finance", "Forestry", "Hospitals", "Trade Contractors",
"Defense Policy")
)
Now we’ll use tidyquant
to adjust all the dollars in
into 2023 amounts
cpi <- tidyquant::tq_get(
"CPIAUCSL",
get = "economic.data",
from = "1998-01-01"
) %>%
mutate(year = date %>% year) %>%
group_by(year) %>%
summarize(
pr = price %>% mean
)
cpi$af <- cpi$pr %>%
divide_by(
cpi$pr[cpi$year == 2023]
)
t2 <- eg2 %>%
left_join(
cpi %>%
select(-pr) %>%
rename(
cycle = year
)
) %>%
mutate(
tot3 = tot2 %>%
divide_by(af),
title = title %>%
fct_reorder(
tot3,
.na_rm = T,
.desc = T)
)
We’ll have industry ordered by amounts donated
t2$title %<>%
factor(
t2 %>%
group_by(title) %>%
summarize(mu = tot3 %>% mean) %>%
arrange(desc(mu)) %>%
use_series(title)
)
Now we plot
font_add_google("Public Sans")
showtext_auto()
t2 %>%
ggplot(
aes(
title, tot3
)
) +
geom_point(
size = 1.75,
shape = 19,
alpha = .5
) +
labs(
title = "Federal lobbying by industry, 1998-2024",
subtitle = "Points depict total annual lobbying, adjusted to 2023 dollars. Pro-Israel, Gun Rights, Gun Conrol, Abortion Rights,and Anti-Abortion lobbyists highlighted.",
x = "",
y = "",
caption = "Data source: opensecrets.org"
) +
scale_y_continuous(
breaks = seq(0, 6e8, length.out = 4),
labels = c(
"$0",
str_c(
"$",
seq(200, 600, 200),
"M"
)
)
) +
theme_minimal(base_family = "Public Sans") +
theme(
plot.margin = margin(
t = .25,
r = .25,
l = .7,
b = .1,
unit = "cm"
),
plot.title = element_text(face = "bold"),
plot.caption = element_text(
margin = margin(-.3, unit = "cm"),
face = "italic"),
panel.grid = element_blank(),
axis.text.x = element_text(
margin = margin(t = -.6, unit = "cm"),
hjust = 1,
vjust = 1,
size = 6.75,
angle = 45/1.6
),
plot.title.position = "plot",
legend.position = "none"
)