Crash course to web scraping?

Nah.

Without going into all the hairy details, the basic thing to understand is that all of the necessary pieces of information are contained in some structured way in the HTML encoding of the web page. Think of it as kind of like reaching into a giant Excel sheet and trying to grab specific pieces of information without being able to actually see the data while reaching in. The point is, everything is structured, not random. This means we should be able to programmatically extract the right data in a fairly reliable way without things getting all jumbled up in the process (at least, not in a way we can’t programmatically address).

Step 1: Install and load the required packages

Note: The primary package used for scraping here is the rvest package.

#install.packages("tidyverse")
#install.packages("rest")
#install.packages("stringr")
#install.packages("httr")

library(tidyverse)
library(rvest)
library(stringr)
library(httr)

Step 2: Obtain the URL

Go to the archive page and set the view setting to 500 images per page. The URL should then include “&offset=&limit=500#filehistory” at the end. For example: “https://en.wikipedia.org/w/index.php?title=File:Somalia_map_states_regions_districts.png&limit=500#filehistory

Step 3: Read in the html data from the web page

url <- "https://en.wikipedia.org/w/index.php?title=File:Somalia_map_states_regions_districts.png&limit=500#filehistory"
wiki <- read_html(url)

Step 5: Obtain upload information on each map

To label each file in the “date_username” format, we need the upload dates and the usernames of the people who created the maps. The following code extracts information on each map.

map_info <- wiki %>%
  html_nodes("tr") %>%
  html_text() %>%
  as.data.frame() %>% #store in dataframe
  transform(`.` = as.character(`.`)) %>%
  filter(grepl("×", ., perl=TRUE))

This is what the data looks like:

## [1] "current16:32, 15 May 2017952 × 981 (263 KB)Kzl55As of April 2017"                                                                   
## [2] "17:24, 18 April 2017852 × 900 (265 KB)Kzl55An updated map. By Nicolay Sidorov."                                                     
## [3] "22:32, 16 November 20143,000 × 3,169 (1.64 MB)AcidSnowUpdated territory inhabited by ethnic Somalis and fixed key."                 
## [4] "17:54, 15 November 20143,000 × 3,169 (1.71 MB)AcidSnowAs of October 14, 2014."                                                      
## [5] "02:33, 13 October 20143,000 × 3,169 (1.7 MB)AcidSnowUpdated territorial control and fixed the boundaries of the regions of Somalia."
## [6] "01:19, 10 September 20143,000 × 3,169 (1.73 MB)AcidSnowWTF, piracy has been died for years now."

As you can see, the dates and usernames are sandwiched between some unnecessary text. We can start by separating the timestamps in the beginning from the rest of the text. The timestamps may be helpful in terms of finding files that were uploaded on the same day and keeping only the latest upload for each date.

map_info$time <- str_extract(map_info$`.`, "\\d{2}:\\d{2}")

Now to get the upload date:

map_info$date <- str_extract(map_info$`.`, "\\d{1,2} \\w+ \\d{4}")

regex translation:

We want to convert the date into “yyyy_mm_dd” format. To do this, first separate the year, month, and day into distinct variables.

map_info <- separate(map_info, date, into = c("day", "month", "year"), sep = " ")

Then convert the month names into numbers.

map_info$month <- match(map_info$month, month.name)

Once we have the usernames, we can paste the information together in the “yyyy_mm_dd_username” format.

So now to get the usernames. The usernames come just after the image size and file size information and can be easily separated from these, since the file size information is enclosed in brackets and the usernames only contain letters and numbers. But just after the usernames there is a bunch of text consisting of letters and numbers, just like the usernames themselves. Separating the usernames from the unnecessary text is not 100% straightforward in terms of regex commands.

One option is to obtain a list of all the possible usernames and use this list to filter out irrelevant text. The following code extracts the usernames for all of the maps on the page, albeit in a messy way.

usernames <- wiki %>%
  html_nodes("td") %>%
  html_text() %>%
  as.data.frame() %>% #make dataframe
  transform(`.` = as.character(`.`))

The first 30 rows of the data look something like this:

##                   .
## 1                  
## 2   This is a file 
## 3                  
## 4   This locator ma
## 5   DescriptionSoma
## 6  \nAfrikaans: Hie
## 7              Date
## 8     12 April 2010
## 9            Source
## 10         Own work
## 11           Author
## 12  Ingoman (James 
## 13  Public domainPu
## 14                 
## 15  I, the copyrigh
## 16          current
## 17  16:32, 15 May 2
## 18                 
## 19  952 × 981 (263 
## 20            Kzl55
## 21  As of April 201
## 22                 
## 23  17:24, 18 April
## 24                 
## 25  852 × 900 (265 
## 26            Kzl55
## 27  An updated map.
## 28                 
## 29  22:32, 16 Novem
## 30

We can see that the first 19 rows are pretty much useless. The “Kzl55” in row 20 is one of the usernames, associated with the uploader of a map that was uploaded on May 15, 2017 as indicated by the date above it. Also, looking at the rest of the data will show that there is a username on every sixth row starting on row 20.

The code usernames[seq(20, nrow(usernames), 6), ] will extract the username for each file. To get the list of unique usernames:

usernames_list <- unique(usernames[seq(20, nrow(usernames), 6), ])
usernames_regex <- paste(usernames_list, collapse = "|")

usernames_regex
## [1] "Kzl55|AcidSnow|Spesh531|Wolfiukas|Turkmenistan|NordNordWest|Ingoman|Malus Catulus|Aotearoa|Quibik"

The only purpose of storing the usernames in a long string separated by “|” is to make searching for usernames in text easier. Placed in a regex search phrase, “|” is the equivalent of “or.” So, “Kzl55|AcidSnow|Spesh531|Wolfiukas|Turkmenistan|NordNordWest|Ingoman|Malus Catulus|Aotearoa|Quibik” can be used to tell R to search for “Kzl55” OR “AcidSnow” OR “Spesh531,” and so on.

Now we can get the usernames.

map_info$uploader <- str_extract(map_info$`.`, usernames_regex)

And now paste the username, year, month, and date together with “_" in between.

map_info <- map_info %>%
  #make sure the variables are character variables
  transform(month = as.character(month)) %>%
  #make sure the month and day are in two digits
  transform(month = ifelse(nchar(month) == 1, paste0("0", month), month)) %>%
  transform(day = ifelse(nchar(day) == 1, paste0("0", day), day)) %>%
  #paste the date and username info together
  unite(labels, year, month, day, uploader, sep = "_", remove = FALSE)
## [1] "2017_05_15_Kzl55"    "2017_04_18_Kzl55"    "2014_11_16_AcidSnow"
## [4] "2014_11_15_AcidSnow" "2014_10_13_AcidSnow" "2014_09_10_AcidSnow"

Finally, we have all the information needed to download and label the files. Combine the dataframe containing the upload information with the dataframe containing the links.

map_info <- cbind(map_info, links)

Remember that our naming conventions assumes only one map uploaded per person per date. So to avoid files having the same name, we have to remove some maps that were uploaded on the same day by the same person as another map. Naturally, we want to keep the latest upload.

We can confirm by looking at the dates and time stamps that the latest upload always comes before older uploads in the dataframe. So now that we have the label variable, we can use it to check for duplicates and filter out instances of the same label that come after the first.

##                labels  time
## 57 2011_02_10_Ingoman 20:30
## 58  2011_02_10_Quibik 19:56
## 59 2011_02_10_Ingoman 18:10
## 60 2011_02_10_Ingoman 17:50
## 61  2011_02_06_Quibik 00:19
## 62 2011_01_08_Ingoman 01:41
## 63 2011_01_08_Ingoman 01:31
## 64 2011_01_07_Ingoman 03:10
## 65 2011_01_06_Ingoman 22:54
## 66 2011_01_06_Ingoman 22:06
## 67 2011_01_06_Ingoman 21:47
## 68 2011_01_06_Ingoman 20:48
## 69 2010_12_28_Ingoman 18:09
## 70 2010_12_28_Ingoman 18:03
## 71 2010_12_16_Ingoman 23:00

Find duplicates:

map_info <- mutate(map_info, dup = duplicated(labels))
##                labels  time   dup
## 57 2011_02_10_Ingoman 20:30 FALSE
## 58  2011_02_10_Quibik 19:56  TRUE
## 59 2011_02_10_Ingoman 18:10  TRUE
## 60 2011_02_10_Ingoman 17:50  TRUE
## 61  2011_02_06_Quibik 00:19 FALSE
## 62 2011_01_08_Ingoman 01:41 FALSE
## 63 2011_01_08_Ingoman 01:31  TRUE
## 64 2011_01_07_Ingoman 03:10 FALSE
## 65 2011_01_06_Ingoman 22:54 FALSE
## 66 2011_01_06_Ingoman 22:06  TRUE
## 67 2011_01_06_Ingoman 21:47  TRUE
## 68 2011_01_06_Ingoman 20:48  TRUE
## 69 2010_12_28_Ingoman 18:09 FALSE
## 70 2010_12_28_Ingoman 18:03  TRUE
## 71 2010_12_16_Ingoman 23:00 FALSE

Remove the duplicates:

map_info <- filter(map_info, dup == FALSE)

But wait, what if an intern had already downloaded some maps by hand??? Say, all of the maps uploaded before 2009? We can filter out maps by year, like so:

map_info <- filter(map_info, year >= 2009) #keeps maps where the upload year is greater than or equal to 2009

Step six: Download the maps

for (i in seq_along(map_info$labels)) {
  label = map_info$labels[i]
  link = map_info$url[i]
  
  GET(link, write_disk(paste0(label, ".png"), overwrite = TRUE))
}

Final comments

Is there an easier way to do this? Probably, yes. There are multiple ways to get at the same result.