Let’s try scraping another page, this time containing maps on the Syrian conflict (around 80 total). The page we’re working from can be accessed here: https://commons.wikimedia.org/w/index.php?title=File:Syrian_Civil_War_map.svg&offset=&limit=500#filehistory

Refer back to the tutorial to reuse the code whenever appropriate. Working code is also provided in the code chunks, which you can view by clicking on “Code.” But try to write your own code if you can.

Step 1: Install and load the required packages

#install.packages("tidyverse")
#install.packages("rest")
#install.packages("stringr")
#install.packages("httr")

library(tidyverse)
library(rvest)
library(stringr)
library(httr)

Step 2: Obtain the URL

The URL is already set.

Step 3: Read in the html data from the web page

url <- "https://commons.wikimedia.org/w/index.php?title=File:Syrian_Civil_War_map.svg&offset=&limit=500#filehistory"
wiki <- read_html(url)

Step 5: Obtain upload information on each map

Extract information on each map, including the upload date and time and uploader username. The information should be stored in a dataframe with a single column.

map_info <- wiki %>%
  html_nodes("tr") %>%
  html_text() %>%
  as.data.frame() %>% #store in dataframe
  transform(`.` = as.character(`.`)) %>%
  filter(grepl("×", ., perl=TRUE))

Separate out date and time information from the rest of the text, and store them as separate variables.

map_info$time <- str_extract(map_info$`.`, "\\d{2}:\\d{2}")
map_info$date <- str_extract(map_info$`.`, "\\d{1,2} \\w+ \\d{4}")
map_info <- separate(map_info, date, into = c("day", "month", "year"), sep = " ")
map_info$month <- match(map_info$month, month.name)

Compared to before in the tutorial, extracting username data should be much easier here. Examine the data to see if there are any useful patterns.

Hint 1: Always place \\ in front of special characters like brackets.
Hint 2: [\\d\\w ] will capture numbers, letters, and spaces. Note that usernames may contain spaces in addition to letters and numbers.
Hint 3: Putting + after something lets R know you want one or more of the thing, e.g. [\\d ]+ for one or more numbers or spaces.

map_info$uploader <- str_extract(map_info$`.`, "\\)[\\d\\w ]+ \\(")

Make sure to remove any extra text before and after the usernames. Hint: If all the entries begin or end with a specific text, using gsub() with ^text and text$ will do the trick, e.g. gsub("^text", "", map_info$uploader)

map_info$uploader <- gsub("^\\)", "", map_info$uploader)
map_info$uploader <- gsub(" \\($", "", map_info$uploader)

Now format the date variables and paste them with the username to get the file labels.

map_info <- map_info %>%
  #make sure the variables are character variables
  transform(month = as.character(month)) %>%
  #make sure the month and day are in two digits
  transform(month = ifelse(nchar(month) == 1, paste0("0", month), month)) %>%
  transform(day = ifelse(nchar(day) == 1, paste0("0", day), day)) %>%
  #paste the date and username info together
  unite(labels, year, month, day, uploader, sep = "_", remove = FALSE)

Now combine the dataframe containing the upload information with the dataframe containing the links.

map_info <- cbind(map_info, links)

Find and remove same-day same-uploader duplicates.

map_info <- map_info %>%
  mutate(dup = duplicated(labels)) %>%
  filter(dup == FALSE)

Just for funsies, lets filter out all the maps uploaded in 2018. Hint: The expression for “not equal to” is !=

map_info <- filter(map_info, year != 2018) 

Step 6: Download the maps

Download the maps. Hint: These images are svg files. write_disk(paste0(label, ".png") won’t work because it assumes they are png files.

for (i in seq_along(map_info$labels)) {
  label = map_info$labels[i]
  link = map_info$url[i]
  
  GET(link, write_disk(paste0(label, ".svg"), overwrite = TRUE))
}