According to the developer:
Selectorgadget is a javascript bookmarklet that allows you to interactively figure out what css selector you need to extract desired components from a page.
To learn how SelectorGadget works, enter vignette("selectorgadget") into the R console.
Note: The primary package used for scraping here is the rvest package.
#install.packages("tidyverse")
#install.packages("rvest")
#install.packages("stringr")
#install.packages("httr")
library(tidyverse)
library(rvest)
library(stringr)
library(httr)
For this tutorial, we will scrape the maps from this file history archive: https://en.wikipedia.org/w/index.php?title=File:Somalia_map_states_regions_districts.png
Go to the archive page and set the view setting to 500 images per page. The URL should then include “&offset=&limit=500#filehistory” at the end.
url <- "https://en.wikipedia.org/w/index.php?title=File:Somalia_map_states_regions_districts.png&offset=&limit=500#filehistory"
page <- read_html(url)
Lets test out SelectorGadget to see how it works. On the web page, scroll down to the File history section and click anywhere on the large table with the maps. It should highlight the entire table yellow, along with some other elements on the page. Items that are highlighted yellow or green are items that are included in the CSS selector that appears in the SelectorGadget bar.
So in this case “td” or table data covers all items that are inside a table.
To exclude items from the selection, click on them so that they are either highlighted red or no longer highlighted. Sometimes this will de-highlight some of the items we want. Click on these again so that they are highlighted yellow or green. Keep clicking on the items we want/the items we don’t want until the selection looks right. Scroll up and down to make sure unwanted items aren’t highlighted.
Now lets try scraping the entire table.
table <- page %>%
html_nodes("#mw-imagepage-section-filehistory td") %>% # <- the selector code goes here
html_text() %>%
as.data.frame()
The CSS selector code, #mw-imagepage-section-filehistory td), goes inside html_nodes() which reads the nodes in the page that correspond to the inputted selector. html_text() converts the node data into text and as.data.frame() simply converts the data into a dataframe inside R.
Now lets view the data.
## .
## 1 current
## 2 16:32, 15 May 2017
## 3
## 4 952 × 981 (263 KB)
## 5 Kzl55
## 6 As of April 2017
## 7
## 8 17:24, 18 April 2017
## 9
## 10 852 × 900 (265 KB)
Yuck! The data is actually usable with some tweaks, but we don’t need to go through all the trouble. There is a better way!
Before scraping the maps themselves, lets download the information in the Date/Time and User columns. We will use these data to create unique names for each of the maps so that the image files are easily identifiable. The data is also useful for identifying the maps we don’t need to download. For example, sometimes a user uploads a map and and then an updated version of the same map later on the same day. We will want to keep the latest update and not the older versions.
Lets begin by scraping the Date/Time column. Use SelectorGadget to get the CSS selector for the Date/Time column, which is #mw-imagepage-section-filehistory td:nth-child(2). We then reuse most of the same code from before, with minor adjustments.
datetime <- page %>%
html_nodes("#mw-imagepage-section-filehistory td:nth-child(2)") %>%
html_text() %>%
as.data.frame()
#name the column
names(datetime) <- "time"
## time
## 1 16:32, 15 May 2017
## 2 17:24, 18 April 2017
## 3 22:32, 16 November 2014
## 4 17:54, 15 November 2014
## 5 02:33, 13 October 2014
## 6 01:19, 10 September 2014
Now repeat the steps to scrape the User column.
user <- page %>%
html_nodes("td:nth-child(5)") %>%
html_text() %>%
as.data.frame()
#name the column
names(user) <- "user"
## user
## 1 Kzl55
## 2 Kzl55
## 3 AcidSnow
## 4 AcidSnow
## 5 AcidSnow
## 6 AcidSnow
Since the datetime and user dataframes have the same number of rows, we can bind them together horizontally to make one dataframe.
file_history <- cbind(datetime, user)
## time user
## 1 16:32, 15 May 2017 Kzl55
## 2 17:24, 18 April 2017 Kzl55
## 3 22:32, 16 November 2014 AcidSnow
## 4 17:54, 15 November 2014 AcidSnow
## 5 02:33, 13 October 2014 AcidSnow
## 6 01:19, 10 September 2014 AcidSnow
Now lets split the time column into separate columns containing time and date information, respectively.
file_history <- file_history %>%
transform(time = as.character(time)) %>% #convert from factor to character
separate(time, into = c("time", "date"), sep = ", ")
To note, we use pipe operations with %>% throughout this tutorial. The above code with pipes is equivalent to the following, non-piped commands:
file_history <- transform(file_history, time = as.character(time)) file_history <- separate(file_history, time, into = c("time", "date"), sep = ", ")
## time date user
## 1 16:32 15 May 2017 Kzl55
## 2 17:24 18 April 2017 Kzl55
## 3 22:32 16 November 2014 AcidSnow
## 4 17:54 15 November 2014 AcidSnow
## 5 02:33 13 October 2014 AcidSnow
## 6 01:19 10 September 2014 AcidSnow
To make labeling easier, convert the date information into YYYY-MM-DD format. The lubridate package is useful for this, but here we’ll use other means. First, use the code below to extract the year, month, and day information and store them in separate columns.
file_history <- file_history %>%
mutate(year = str_extract(date, "\\d{4}")) %>%
mutate(month = str_extract(date, "[A-Za-z]+")) %>%
mutate(day = str_extract(date, "^\\d{1,2}"))
## time date user year month day
## 1 16:32 15 May 2017 Kzl55 2017 May 15
## 2 17:24 18 April 2017 Kzl55 2017 April 18
## 3 22:32 16 November 2014 AcidSnow 2014 November 16
## 4 17:54 15 November 2014 AcidSnow 2014 November 15
## 5 02:33 13 October 2014 AcidSnow 2014 October 13
## 6 01:19 10 September 2014 AcidSnow 2014 September 10
The above code uses regular expressions (“regex” for short) along with the stringr package. Regular expressions are useful for parsing text, which makes them also useful for web scraping. In regex, \\d or, equivalently, [0-9] refers to any number. You can use curly backets to indicate the specific number of numeric digits. So, \\d{4} means “any four numeric digits.” In this case this applies to the upload year. Similarly, \\d{1,2} means “one or two numeric digits.” ^ means “starting with,” so ^\\d{1,2} means “starting with one or two numeric digits.” This corresponds to the upload day. [A-Z-a-z] means any letter and the + sign means “one or more.” So [A-Za-z]+ means “one or more letters,” and in this case corresponds to months.
The next step is to convert the month names to numbers, and add “0” in front of the single-digit days or months, e.g. “5” to “05”.
file_history <- file_history %>%
transform(month = match(month, month.name)) %>% #change month name to number
transform(month = ifelse(nchar(month) == 1, paste0("0", month), month)) %>% #add zero
transform(day = ifelse(nchar(day) == 1, paste0("0", day), day))
## time date user year month day
## 1 16:32 15 May 2017 Kzl55 2017 05 15
## 2 17:24 18 April 2017 Kzl55 2017 04 18
## 3 22:32 16 November 2014 AcidSnow 2014 11 16
## 4 17:54 15 November 2014 AcidSnow 2014 11 15
## 5 02:33 13 October 2014 AcidSnow 2014 10 13
## 6 01:19 10 September 2014 AcidSnow 2014 09 10
Now we can create the labels for the image files.
file_history <- file_history %>%
mutate(file_name = paste0(year, "-", month, "-", day, "_", user))
## time date user year month day file_name
## 1 16:32 15 May 2017 Kzl55 2017 05 15 2017-05-15_Kzl55
## 2 17:24 18 April 2017 Kzl55 2017 04 18 2017-04-18_Kzl55
## 3 22:32 16 November 2014 AcidSnow 2014 11 16 2014-11-16_AcidSnow
## 4 17:54 15 November 2014 AcidSnow 2014 11 15 2014-11-15_AcidSnow
## 5 02:33 13 October 2014 AcidSnow 2014 10 13 2014-10-13_AcidSnow
## 6 01:19 10 September 2014 AcidSnow 2014 09 10 2014-09-10_AcidSnow
You can see that some of the rows have the same label. As discussed earlier, we want to download only the latest uploads by each user for each day. From the time stamps you can see that the later uploads appear earlier in the data. Taking advantage of this fact, we can identify older uploads with the duplicated function which considers the first instance of a value in a dataset as the original and later instances as duplicates.
file_history <- file_history %>%
mutate(dup = duplicated(file_name))
## time date user year month day file_name
## 1 16:32 15 May 2017 Kzl55 2017 05 15 2017-05-15_Kzl55
## 2 17:24 18 April 2017 Kzl55 2017 04 18 2017-04-18_Kzl55
## 3 22:32 16 November 2014 AcidSnow 2014 11 16 2014-11-16_AcidSnow
## 4 17:54 15 November 2014 AcidSnow 2014 11 15 2014-11-15_AcidSnow
## 5 02:33 13 October 2014 AcidSnow 2014 10 13 2014-10-13_AcidSnow
## 6 01:19 10 September 2014 AcidSnow 2014 09 10 2014-09-10_AcidSnow
## dup
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
Later, when downloading the image files, we can use the dup column to weed out the “duplicate” images.
To scrape the maps, we need the associated URLs. Use SelectorGadget to find the CSS selector for just the column containing the maps, which is #mw-imagepage-section-filehistory img. We then use this code in the following:
links <- page %>%
html_nodes("#mw-imagepage-section-filehistory img") %>%
html_attr("src") %>%
as.data.frame()
names(links) <- "url"
This code differs from the previous in a few ways. Most importantly, we use the html_attr function to extract the source text, which is where the URLs are located.
## url
## 1 //upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Somalia_map_states_regions_districts.png/116px-Somalia_map_states_regions_districts.png
## 2 //upload.wikimedia.org/wikipedia/commons/thumb/archive/9/9f/20170515163157%21Somalia_map_states_regions_districts.png/114px-Somalia_map_states_regions_districts.png
## 3 //upload.wikimedia.org/wikipedia/commons/thumb/archive/9/9f/20170418172438%21Somalia_map_states_regions_districts.png/114px-Somalia_map_states_regions_districts.png
## 4 //upload.wikimedia.org/wikipedia/commons/thumb/archive/9/9f/20141116223232%21Somalia_map_states_regions_districts.png/114px-Somalia_map_states_regions_districts.png
## 5 //upload.wikimedia.org/wikipedia/commons/thumb/archive/9/9f/20141115175414%21Somalia_map_states_regions_districts.png/114px-Somalia_map_states_regions_districts.png
## 6 //upload.wikimedia.org/wikipedia/commons/thumb/archive/9/9f/20141013023337%21Somalia_map_states_regions_districts.png/114px-Somalia_map_states_regions_districts.png
At this point, the links are not usable because they are sandwiched between some extra text. Clean up the extra text using the following code:
links$url <- gsub("^\\/\\/", "", links$url) #get rid of the "//" in the beginning
links$url <- gsub("\\/thumb", "", links$url) #get rid of "/thumb" referring to the thumbnail versions of the maps
links$url <- gsub("\\/\\d+px.+$", "", links$url) #get rid of everything after ".png"
Again, we use regex commands to parse the text. ^ means “starting with.” \\ simply tells R to treat / as normal text instead of a special character. Hence, ^\\/\\/ means “// that starts the text.” \\d refers to any number, + means “one or more,” . is a wildcard meaning any item, and $ means “ending with.” Thus, \\/\\d+px.+$ means “/ followed by one or more number followed by px followed by any number of objects until the end of the text.”
Now the dataframe looks like this:
## url
## 1 upload.wikimedia.org/wikipedia/commons/9/9f/Somalia_map_states_regions_districts.png
## 2 upload.wikimedia.org/wikipedia/commons/archive/9/9f/20170515163157%21Somalia_map_states_regions_districts.png
## 3 upload.wikimedia.org/wikipedia/commons/archive/9/9f/20170418172438%21Somalia_map_states_regions_districts.png
## 4 upload.wikimedia.org/wikipedia/commons/archive/9/9f/20141116223232%21Somalia_map_states_regions_districts.png
## 5 upload.wikimedia.org/wikipedia/commons/archive/9/9f/20141115175414%21Somalia_map_states_regions_districts.png
## 6 upload.wikimedia.org/wikipedia/commons/archive/9/9f/20141013023337%21Somalia_map_states_regions_districts.png
Attach the links to the dataframe with the time, date, and user information.
file_history <- cbind(file_history, links)
Remove the duplicate rows.
file_history <- file_history %>%
filter(dup == FALSE)
Now we’re ready to download the maps.
…
But wait, what if an intern had already downloaded some maps by hand??? Say, all of the maps uploaded before 2009? We can filter out maps by year, like so:
file_history <- filter(file_history, year >= 2009) #keeps maps where the upload year is 2009 or later
Rather than simply download dozens of files onto the current directory, it’s more convenient to create a folder within the directory that the files will automatically go into.
if (! "wiki maps" %in% list.files()) {
dir.create("wiki maps")
}
The above code looks for a file or folder named “wiki maps” within the current directory, and if there is no such file or folder, creates a folder named “wiki maps” inside the directory.
To download each image, we use the GET function from the the httr package. It will take as arguments the image URL and the image label. The label is what the file will be named once downloaded. We can modify the image label to specify that we want the files to go inside the “wiki maps” folder, without altering the labels themselves. We use GET in a for loop to cycle through all of the rows in the dataframe.
for (i in seq_along(file_history$file_name)) {
label = file_history$file_name[i]
link = file_history$url[i]
GET(link, write_disk(paste0("wiki maps/", label, ".png"), overwrite = TRUE)) #overwrite = TRUE to overwrite pre-existing file with same name
Sys.sleep(3) #3 second pause after each download
}
Note: It’s good practice to insert a pause after each download so that the scraper doesn’t overwhelm the target server. The Sys.sleep() function causes the program to pause for a specified number of seconds.