Scraping Maps from Wikipedia Image Archives

Crash course to web scraping?

Nah.

Without going into all the hairy details, the basic thing to understand is that all of the necessary pieces of information are contained in some structured way in the HTML encoding of the web page. Think of it as kind of like reaching into a giant Excel sheet and trying to grab specific pieces of information without being able to actually see the data while reaching in. The point is, everything is structured, not random. This means we should be able to programmatically extract the right data in a fairly reliable way without things getting all jumbled up in the process (at least, not in a way we can’t programmatically address).

Step 1: Install and load the required packages

Note: The primary package used for scraping here is the rvest package.

#install.packages("tidyverse")
#install.packages("rest")
#install.packages("stringr")
#install.packages("httr")

library(tidyverse)
library(rvest)
library(stringr)
library(httr)

Step 2: Obtain the URL

Go to the archive page and set the view setting to 500 images per page. The URL should then include “&offset=&limit=500#filehistory” at the end. For example: “https://en.wikipedia.org/w/index.php?title=File:Somalia_map_states_regions_districts.png&limit=500#filehistory”

Step 3: Read in the html data from the web page

url <- "https://en.wikipedia.org/w/index.php?title=File:Somalia_map_states_regions_districts.png&limit=500#filehistory"
wiki <- read_html(url)

Step 4: Get the URL link for each map

The following code extracts the URLs associated with all of the images on the page and stores them in a dataframe, excluding the URLs we don’t need such as those linking to the Wikipedia logo. We will download the maps using these links.

links <- wiki %>%
  html_nodes("img") %>%
  html_attr("src") %>%
  as.data.frame() %>%
  transform(`.` = as.character(`.`)) %>%
  #exclude extraneous links, i.e. those that don't include "Somalia_map_states_regions_districts.png"
  filter(grepl("Somalia_map_states_regions_districts.png", ., perl=TRUE))

We can remove the first row of the dataframe since the first and second rows are identical; the latest map is shown twice, first at the top of the page and later in the “File history” section along with the older maps.

links <- as.data.frame(links[2:nrow(links), 1])

names(links) <- c("url") #name the row header

Now the dataframe looks like this:

##                                                                                                                                                                    url
## 1                          //upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Somalia_map_states_regions_districts.png/116px-Somalia_map_states_regions_districts.png
## 2 //upload.wikimedia.org/wikipedia/commons/thumb/archive/9/9f/20170515163157%21Somalia_map_states_regions_districts.png/114px-Somalia_map_states_regions_districts.png
## 3 //upload.wikimedia.org/wikipedia/commons/thumb/archive/9/9f/20170418172438%21Somalia_map_states_regions_districts.png/114px-Somalia_map_states_regions_districts.png
## 4 //upload.wikimedia.org/wikipedia/commons/thumb/archive/9/9f/20141116223232%21Somalia_map_states_regions_districts.png/114px-Somalia_map_states_regions_districts.png
## 5 //upload.wikimedia.org/wikipedia/commons/thumb/archive/9/9f/20141115175414%21Somalia_map_states_regions_districts.png/114px-Somalia_map_states_regions_districts.png
## 6 //upload.wikimedia.org/wikipedia/commons/thumb/archive/9/9f/20141013023337%21Somalia_map_states_regions_districts.png/114px-Somalia_map_states_regions_districts.png

At this point, the links are not usable because they are sandwiched between some extra text. Clean up the extra text using the following code:

links$url <- gsub("^\\/\\/", "", links$url) #get rid of the "//" in the beginning
links$url <- gsub("\\/thumb", "", links$url) #get rid of "/thumb" referring to the thumbnail versions of the maps
links$url <- gsub("\\/\\d+px.+$", "", links$url) #get rid of everything after ".png"

Notes: The notation used in the gsub commands comes from regular expressions, or regex for short. ^ means “starting with.” \\ simply tells R to treat / as normal text instead of a special character. Hence, ^\\/\\/ means “// that starts the text.” d refers to any number, + means “one or more,” . is a wildcard meaning any item, and $ means “ending with.” Thus, \\/\\d+px.+$ means “/ followed by one or more number followed by px followed by any number of objects until the end of the text.”

Now the dataframe looks like this:

##                                                                                                             url
## 1                          upload.wikimedia.org/wikipedia/commons/9/9f/Somalia_map_states_regions_districts.png
## 2 upload.wikimedia.org/wikipedia/commons/archive/9/9f/20170515163157%21Somalia_map_states_regions_districts.png
## 3 upload.wikimedia.org/wikipedia/commons/archive/9/9f/20170418172438%21Somalia_map_states_regions_districts.png
## 4 upload.wikimedia.org/wikipedia/commons/archive/9/9f/20141116223232%21Somalia_map_states_regions_districts.png
## 5 upload.wikimedia.org/wikipedia/commons/archive/9/9f/20141115175414%21Somalia_map_states_regions_districts.png
## 6 upload.wikimedia.org/wikipedia/commons/archive/9/9f/20141013023337%21Somalia_map_states_regions_districts.png

Now we can access the maps, but we still need information on each map so we can properly label the files.

Step 5: Obtain upload information on each map

To label each file in the “date_username” format, we need the upload dates and the usernames of the people who created the maps. The following code extracts information on each map.

map_info <- wiki %>%
  html_nodes("tr") %>%
  html_text() %>%
  as.data.frame() %>% #store in dataframe
  transform(`.` = as.character(`.`)) %>%
  filter(grepl("×", ., perl=TRUE))

This is what the data looks like:

## [1] "current16:32, 15 May 2017952 × 981 (263 KB)Kzl55As of April 2017"                                                                   
## [2] "17:24, 18 April 2017852 × 900 (265 KB)Kzl55An updated map. By Nicolay Sidorov."                                                     
## [3] "22:32, 16 November 20143,000 × 3,169 (1.64 MB)AcidSnowUpdated territory inhabited by ethnic Somalis and fixed key."                 
## [4] "17:54, 15 November 20143,000 × 3,169 (1.71 MB)AcidSnowAs of October 14, 2014."                                                      
## [5] "02:33, 13 October 20143,000 × 3,169 (1.7 MB)AcidSnowUpdated territorial control and fixed the boundaries of the regions of Somalia."
## [6] "01:19, 10 September 20143,000 × 3,169 (1.73 MB)AcidSnowWTF, piracy has been died for years now."

As you can see, the dates and usernames are sandwiched between some unnecessary text. We can start by separating the timestamps in the beginning from the rest of the text. The timestamps may be helpful in terms of finding files that were uploaded on the same day and keeping only the latest upload for each date.

map_info$time <- str_extract(map_info$`.`, "\\d{2}:\\d{2}")

Now to get the upload date:

map_info$date <- str_extract(map_info$`.`, "\\d{1,2} \\w+ \\d{4}")

regex translation:

\\d{2}:\\d{2} = “Two number characters followed by : followed by two number characters,” e.g. “16:32”
\\d{1,2} \\w+ \\d{4} = “One or two number characters, a space, one or more letters, a space, four number characters,” e.g. “15 May 2017”

We want to convert the date into “yyyy_mm_dd” format. To do this, first separate the year, month, and day into distinct variables.

map_info <- separate(map_info, date, into = c("day", "month", "year"), sep = " ")

Then convert the month names into numbers.

map_info$month <- match(map_info$month, month.name)

Once we have the usernames, we can paste the information together in the “yyyy_mm_dd_username” format.

So now to get the usernames. The usernames come just after the image size and file size information and can be easily separated from these, since the file size information is enclosed in brackets and the usernames only contain letters and numbers. But just after the usernames there is a bunch of text consisting of letters and numbers, just like the usernames themselves. Separating the usernames from the unnecessary text is not 100% straightforward in terms of regex commands.

One option is to obtain a list of all the possible usernames and use this list to filter out irrelevant text. The following code extracts the usernames for all of the maps on the page, albeit in a messy way.

usernames <- wiki %>%
  html_nodes("td") %>%
  html_text() %>%
  as.data.frame() %>% #make dataframe
  transform(`.` = as.character(`.`))

The first 30 rows of the data look something like this:

##                   .
## 1                  
## 2   This is a file 
## 3                  
## 4   This locator ma
## 5   DescriptionSoma
## 6  \nAfrikaans: Hie
## 7              Date
## 8     12 April 2010
## 9            Source
## 10         Own work
## 11           Author
## 12  Ingoman (James 
## 13  Public domainPu
## 14                 
## 15  I, the copyrigh
## 16          current
## 17  16:32, 15 May 2
## 18                 
## 19  952 × 981 (263 
## 20            Kzl55
## 21  As of April 201
## 22                 
## 23  17:24, 18 April
## 24                 
## 25  852 × 900 (265 
## 26            Kzl55
## 27  An updated map.
## 28                 
## 29  22:32, 16 Novem
## 30

We can see that the first 19 rows are pretty much useless. The “Kzl55” in row 20 is one of the usernames, associated with the uploader of a map that was uploaded on May 15, 2017 as indicated by the date above it. Also, looking at the rest of the data will show that there is a username on every sixth row starting on row 20.

The code usernames[seq(20, nrow(usernames), 6), ] will extract the username for each file. To get the list of unique usernames:

usernames_list <- unique(usernames[seq(20, nrow(usernames), 6), ])
usernames_regex <- paste(usernames_list, collapse = "|")

usernames_regex

## [1] "Kzl55|AcidSnow|Spesh531|Wolfiukas|Turkmenistan|NordNordWest|Ingoman|Malus Catulus|Aotearoa|Quibik"

Now we can get the usernames.

map_info$uploader <- str_extract(map_info$`.`, usernames_regex)

And now paste the username, year, month, and date together with “_" in between.

map_info <- map_info %>%
  #make sure the variables are character variables
  transform(month = as.character(month)) %>%
  #make sure the month and day are in two digits
  transform(month = ifelse(nchar(month) == 1, paste0("0", month), month)) %>%
  transform(day = ifelse(nchar(day) == 1, paste0("0", day), day)) %>%
  #paste the date and username info together
  unite(labels, year, month, day, uploader, sep = "_", remove = FALSE)

## [1] "2017_05_15_Kzl55"    "2017_04_18_Kzl55"    "2014_11_16_AcidSnow"
## [4] "2014_11_15_AcidSnow" "2014_10_13_AcidSnow" "2014_09_10_AcidSnow"

Finally, we have all the information needed to download and label the files. Combine the dataframe containing the upload information with the dataframe containing the links.

map_info <- cbind(map_info, links)

Remember that our naming conventions assumes only one map uploaded per person per date. So to avoid files having the same name, we have to remove some maps that were uploaded on the same day by the same person as another map. Naturally, we want to keep the latest upload.

We can confirm by looking at the dates and time stamps that the latest upload always comes before older uploads in the dataframe. So now that we have the label variable, we can use it to check for duplicates and filter out instances of the same label that come after the first.

##                labels  time
## 57 2011_02_10_Ingoman 20:30
## 58  2011_02_10_Quibik 19:56
## 59 2011_02_10_Ingoman 18:10
## 60 2011_02_10_Ingoman 17:50
## 61  2011_02_06_Quibik 00:19
## 62 2011_01_08_Ingoman 01:41
## 63 2011_01_08_Ingoman 01:31
## 64 2011_01_07_Ingoman 03:10
## 65 2011_01_06_Ingoman 22:54
## 66 2011_01_06_Ingoman 22:06
## 67 2011_01_06_Ingoman 21:47
## 68 2011_01_06_Ingoman 20:48
## 69 2010_12_28_Ingoman 18:09
## 70 2010_12_28_Ingoman 18:03
## 71 2010_12_16_Ingoman 23:00

Find duplicates:

map_info <- mutate(map_info, dup = duplicated(labels))

##                labels  time   dup
## 57 2011_02_10_Ingoman 20:30 FALSE
## 58  2011_02_10_Quibik 19:56  TRUE
## 59 2011_02_10_Ingoman 18:10  TRUE
## 60 2011_02_10_Ingoman 17:50  TRUE
## 61  2011_02_06_Quibik 00:19 FALSE
## 62 2011_01_08_Ingoman 01:41 FALSE
## 63 2011_01_08_Ingoman 01:31  TRUE
## 64 2011_01_07_Ingoman 03:10 FALSE
## 65 2011_01_06_Ingoman 22:54 FALSE
## 66 2011_01_06_Ingoman 22:06  TRUE
## 67 2011_01_06_Ingoman 21:47  TRUE
## 68 2011_01_06_Ingoman 20:48  TRUE
## 69 2010_12_28_Ingoman 18:09 FALSE
## 70 2010_12_28_Ingoman 18:03  TRUE
## 71 2010_12_16_Ingoman 23:00 FALSE

Remove the duplicates:

map_info <- filter(map_info, dup == FALSE)

But wait, what if an intern had already downloaded some maps by hand??? Say, all of the maps uploaded before 2009? We can filter out maps by year, like so:

map_info <- filter(map_info, year >= 2009) #keeps maps where the upload year is greater than or equal to 2009

Step six: Download the maps

for (i in seq_along(map_info$labels)) {
  label = map_info$labels[i]
  link = map_info$url[i]
  
  GET(link, write_disk(paste0(label, ".png"), overwrite = TRUE))
}

Final comments

Is there an easier way to do this? Probably, yes. There are multiple ways to get at the same result.