Scraping the Web with R

Reading through Nathan Yau’s book, Vizualize This, I came across this awesome tutorial to scrape weather data from wundergroud.com using Python. I like Python but I also like R so I was inspired to adapt his tutorial in R. As Nathan points out if you wanted to get the average daily temp for a single day, you could simply look at the page for that day. However, if you wanted to get data for 1 year, 5 years, or 10 years, that would be a lot of pages to load and would be very tedious. Here we will be using R to to get the average daily temperature in San Francisco from www.wunderground.com for everyday of 2015. Any website scraping tasks breaks down into three general tasks:

    1. Figure out the pattern of the website you are
    interested in (i.e. View Page Source, use selectorgadget)
    2. Iterate over all the pages you want data from & load it into
    R
    3. Format & store

This is pretty general but a good place to start. Before you start running the code below, set your working directory to a convenient directory (folder).

setwd('Your directory path')

I used the rvest package by Hadley Wickham. You can install it by simply running the command below.

install.packages('rvest')
require(rvest)

I also just found out about this application called selectorgadget and it’s awesome! It allows you to interactively find the exact html element you want by simply selecting components on the page. To install, go to this link and follow the instructions: https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html

If you look at the page for the January 1st, 2015, you can see the value that we want is 48. http://www.wunderground.com/history/airport/KSFO/2015/1/1/DailyHistory.html

Using selectorgadget, you can quickly identify the selector you need to extract the data. In this case it’s ‘.wx-value’.

Now if you wanted to get the data for just January 1st, 2015, you can quickly read in the data by using the command below.

weather_data<-read_html("http://www.wunderground.com/history/airport/KSFO/2015/1/1/DailyHistory.html")
mean_daily_temp<-weather_data %>% 
        html_node(".wx-value") %>%
        html_text() %>%
        as.numeric()

html_node() matches the first instance of the selector that it was given and htmltext() extracts the text. as.numeric() changes the text to a number. In this case, the average temperature happens to be the first node but if it’s not, you can use htmlnodes() to store all values in a variable and select the one you are looking for by declaring an index inside a single square bracket [].

Well done. That’s one day. Now if you want data for 365 days you need to iterate over the url using the code below. The very first line of code opens a file in your current working directory. This is where your data will be eventually be stored. The next few lines use a combination of for Loops and if/elseif control structures to control how the code iterates over the url. In a nutshell, it iterates by replacing within the url with each day and month value. Notice how it controls for February which only has 28 days and several months which only have 30 days. The second part formats the data so can that it can be cleanly saved to the text file you opened in the first line of code.

f <- file('wunder-data.txt','w')
for (m in 1:12) {
        for (d in 1:31) {
                if (m==2) {
                        if (d>28){
                                break
                        }
                }
                else if (m %in% c(4,5,9,11)){
                        if (d > 30){
                                break
                        }
                }
                url <- paste("http://www.wunderground.com/history/airport/KSFO/2015/",                           as.character(m),"/",as.character(d),"/DailyHistory.html",sep = '')
                weather_data<-read_html(url)
                mean_daily_temp<-weather_data %>% 
                        html_node(".wx-value") %>%
                        html_text() %>%
                        as.numeric()
                if (nchar(m)<2){
                        mStamp <- paste('0',as.character(m),sep = '')
                } else{
                        mStamp<-as.character(m)
                }
                
                if (nchar(d)<2){
                        dStamp <- paste('0',as.character(d),sep='')
                } else {
                        dStamp <- as.character(d)
                }
                timestamp <- paste('2015',mStamp,dStamp,sep = '')
                timestampPlusTemp <- paste(timestamp,',',as.character(mean_daily_temp),sep = '')
                writeLines(timestampPlusTemp,f)
        }
}
close(f)

There you go. You can run this code as it stands and see your results. It might take a while to load since you are loading a lot of pages. Patience, young grasshopper. You can also edit it to fit your needs as it is by no means perfect. Now that the data is saved in your directory as a text file and delimited with commas, you can read it back into R to do additional analysis.

My main learning by looking over Nathan’s tutorial and trying it out in R, is that often data will not be available as a nice comma delimited file but this shouldn’t stop you from trying to get it. Trial and error baby!

Scraping the Web with R

Tobia Martens

February 3, 2016