We learned to scrape the web a while ago. Recently, I had a task I thought was a good opportunity to put the rusty skill to use, but as is often the case the real world was messier than I had hoped it to be. With Tom’s help, I got where I wanted to be in the end. Below I lay out the process so that you guys find your future scraping exercise to be a breeze!
Foreign Relations of the United States (FRUS) is an “official documentary historical record of major U.S. foreign policy decisions and significant diplomatic activity.” It is a rich source of data, where you can see the diplomatic notes and telegrams related to major historical events. You can access the documents for the past administrations, all the way to the Clinton administration.
For my purposes, I was interested in documents that were related to specific conflicts. Helpfully, FRUS maintains certain tags for major historical events. You can see the whole list of tags at the following link: https://history.state.gov/tags/all
What I specifically wanted to do was to automate the downloading process for all the documents that had the tag “World War II”. You can acess the list of documents in the following link: https://history.state.gov/tags/world-war-ii. You can see that there are 48 volumes! And within each, there is a list of different chapters of documents. Phew! Will I ever get through downloading all of these before I enter the job market? What about all the bookkeeping - how do I save them so that I retain the information about the volumes each document comes from? Am I doomed? This is when I noticed that OTH very helpfully provided the option of downloading each volume as a single file. Eureka! If only I could automate the process of accessing each volume under the tag of “World War II” and to download all the EPUB files…
I will walk through the process of how I managed to do this below, the outcome being a folder on my machine containing all 48 volumes, the whole process barely taking a minute! (When I did it manually for another tag with less volumes (38 total), it took full 15 minutes. It also gave me eye-strain. Also I was unsure if I hadn’t missed anything.) Maybe we could see if we can tweak the process to automate the downloading process for documents classified with another Tag that would be of interest to you? (Tags include Malawi, Antisemitism, Elections…)
I first saved the address to the top-most page of interest, which was the page containing the list of files tagged “World War II”.
h1 <- "https://history.state.gov/tags/world-war-ii" %>%
read_html
Remember this guy?
I used our friend with a reliable mechanical arm to look for the CSS selector for what I want: the link to each volume. We can see that it is #content-inner a.
With this information, I can get urls for all of the pages on the list.
h1_t <- tibble(
url = str_c(
"https://history.state.gov",
h1 %>%
html_nodes(
"#content-inner a"
) %>%
html_attr("href")
)
)
The result is like below. 48 rows seems exactly right!
Now, using a map approach you can retrieve the HTML content of each URLs into a new column, pags. Interestingly, when I ran below code without Sys.sleep(3) the number of NULL in pags kept changing. Tom enlightened me that this was because the scraping was obvious when I was doing all of these way too quickly for a normal human! So we give it some human-like breaks before each map iterations.
h1_t %<>%
mutate(
pags = url %>%
map(
possibly(
\(i){
Sys.sleep(3) # pauses the execution for 3 seconds between requests to avoid rate-limitation
i %>%
read_html
},
otherwise = NULL
),
.progress = T
)
)
Then, I needed to find the links for those little epub buttons that you could download the files with. Admittedly, this was tricky, as I could not use the CSS selector for it. I think a more robust way of trying this would be to look at what links are present in the page you are looking at, and try to identify which could be the one you want. You can do this like the following:
# Let's check using the first page
sample <- h1_t$pags[[1]]
# Look for one that has .epub
sample %>%
html_elements("a") %>%
html_attr("href")
# You can also do this
sample %>%
html_elements("a") %>%
html_attr("href") %>%
str_subset("\\.epub")
Now, we can use the map approach to get the .epub links for all pages.
h1_t$links <- h1_t$pags %>%
map_chr(
\(i)
# i <- h1_t$pags[[1]]
i %>%
html_elements("a") %>%
html_attr("href") %>%
str_subset("\\.epub")
)
All that is left for us to do is use download.file() to do the actual downloading! tryCatch will let me know where it failed to perform, so that I don’t have to strain my eyes to find out if I have all the files.
# Specify the folder name you want to save your .epub files in
download_folder <- "World-War-ii"
dir_create(download_folder)
# tryCatch function for downloading
download_epub <- function(url, dest_folder) {
filename <- basename(url) # Extract the filename from URL
dest_path <- file.path(dest_folder, filename)
# Download the file
tryCatch(
{
download.file(url, dest_path, mode = "wb")
message("Downloaded: ", filename)
},
error = function(e) {
message("Failed to download: ", url)
}
)
}
# Map over all links to download the files
h1_t$links %>%
map(~ download_epub(.x, download_folder))
And then I end up with the beautiful result below:
Do you want to try above process, tweaking as necessary, to get all the FRUS files with the tag of your interest?