Let’s try scraping another page, this time containing maps on the Syrian conflict (around 80 total). The page we’re working from can be accessed here: https://commons.wikimedia.org/w/index.php?title=File:Syrian_Civil_War_map.svg&offset=&limit=500#filehistory
Refer back to the tutorial to reuse the code whenever appropriate. Working code is also provided in the code chunks, which you can view by clicking on “Code.” But try to write your own code if you can.
#install.packages("tidyverse")
#install.packages("rest")
#install.packages("stringr")
#install.packages("httr")
library(tidyverse)
library(rvest)
library(stringr)
library(httr)
The URL is already set.
url <- "https://commons.wikimedia.org/w/index.php?title=File:Syrian_Civil_War_map.svg&offset=&limit=500#filehistory"
wiki <- read_html(url)
Extract the links to the image files and store them in a dataframe. Make sure to remove any unnecessary links, such as those for the images preceding the “File history” section. Hint: use head() to take a look at the top rows to see which links are duplicates.
links <- wiki %>%
html_nodes("img") %>%
html_attr("src") %>%
as.data.frame() %>%
transform(`.` = as.character(`.`)) %>%
filter(grepl("Syrian_Civil_War_map", ., perl=TRUE))
links <- as.data.frame(links[4:nrow(links), 1]) #the first three images are duplicates
Rename the row header for the links to something usable using names(), e.g. names([dataframe]) <- c("url")
names(links) <- c("url")
Clean up the links to make them usable. Hint: Try copying some of the links into your browser and take a look at the images that show up (or don’t show up). What happens when a link contains “/thumb”?
links$url <- gsub("\\/thumb", "", links$url) #get rid of "/thumb" referring to the thumbnail versions of the maps
links$url <- gsub("\\/\\d+px.+$", "", links$url) #get rid of everything after ".svg
Extract information on each map, including the upload date and time and uploader username. The information should be stored in a dataframe with a single column.
map_info <- wiki %>%
html_nodes("tr") %>%
html_text() %>%
as.data.frame() %>% #store in dataframe
transform(`.` = as.character(`.`)) %>%
filter(grepl("×", ., perl=TRUE))
Separate out date and time information from the rest of the text, and store them as separate variables.
map_info$time <- str_extract(map_info$`.`, "\\d{2}:\\d{2}")
map_info$date <- str_extract(map_info$`.`, "\\d{1,2} \\w+ \\d{4}")
map_info <- separate(map_info, date, into = c("day", "month", "year"), sep = " ")
map_info$month <- match(map_info$month, month.name)
Compared to before in the tutorial, extracting username data should be much easier here. Examine the data to see if there are any useful patterns.
Hint 1: Always place \\ in front of special characters like brackets.
Hint 2: [\\d\\w ] will capture numbers, letters, and spaces. Note that usernames may contain spaces in addition to letters and numbers.
Hint 3: Putting + after something lets R know you want one or more of the thing, e.g. [\\d ]+ for one or more numbers or spaces.
map_info$uploader <- str_extract(map_info$`.`, "\\)[\\d\\w ]+ \\(")
Make sure to remove any extra text before and after the usernames. Hint: If all the entries begin or end with a specific text, using gsub() with ^text and text$ will do the trick, e.g. gsub("^text", "", map_info$uploader)
map_info$uploader <- gsub("^\\)", "", map_info$uploader)
map_info$uploader <- gsub(" \\($", "", map_info$uploader)
Now format the date variables and paste them with the username to get the file labels.
map_info <- map_info %>%
#make sure the variables are character variables
transform(month = as.character(month)) %>%
#make sure the month and day are in two digits
transform(month = ifelse(nchar(month) == 1, paste0("0", month), month)) %>%
transform(day = ifelse(nchar(day) == 1, paste0("0", day), day)) %>%
#paste the date and username info together
unite(labels, year, month, day, uploader, sep = "_", remove = FALSE)
Now combine the dataframe containing the upload information with the dataframe containing the links.
map_info <- cbind(map_info, links)
Find and remove same-day same-uploader duplicates.
map_info <- map_info %>%
mutate(dup = duplicated(labels)) %>%
filter(dup == FALSE)
Just for funsies, lets filter out all the maps uploaded in 2018. Hint: The expression for “not equal to” is !=
map_info <- filter(map_info, year != 2018)
Download the maps. Hint: These images are svg files. write_disk(paste0(label, ".png") won’t work because it assumes they are png files.
for (i in seq_along(map_info$labels)) {
label = map_info$labels[i]
link = map_info$url[i]
GET(link, write_disk(paste0(label, ".svg"), overwrite = TRUE))
}