andycatlin@maddogdatascience.com 19-Oct-2014
Here, we’ll learn about Hadley Wickham’s rvest package to scrape information from a web page.
Prior to scraping information from a web page, you need to determine how to identify the content that you want. This often means studying the HTML page source, and figuring out which CSS selector(s) both
Having some knowledge of both HTML tags and CSS selectors is necessary for all but the most trivial of web scraping assignments. Even armed with this knowledge, figuring out the proper CSS selectors to use for even moderately complex web pages can be challenging.
The SelectorGadget.com provides a great tool to help with this. Hadley Wickham tweeted, “If you’re doing web scraping (in any language) and don’t know what SelectorGadget is, go to http://selectorgadget.com/ RIGHT NOW.”
This 5 minute video shows how to use SelectorGadget.com and rvest to scrape a web page.
Specifically, we’ll look at how to build lists of project categories and projects from sourceforge.net’s home page.
Using SelectorGadget.com, as shown in this five minute video, shows how we have determined that we can identify the SourceForge.Net home page project categories by using the CSS selector .titled h2, and we can identify the SourceForge.net home page listed projects by using the CSS selector .project info a
Below is some basic R code that uses the rvest package to take the information in these CSS selectors to build lists of project categories and projects.
Note that Hadley Wickham’s package package rvest is not yet on CRAN, so you need to follow installation instructions in the readme.MD at https://github.com/hadley/rvest; you should also restart R and RStudio after you install the devtools package in the instructions.
require(rvest)
Loading required package: rvest
Attaching package: 'rvest'
The following object is masked from 'package:utils':
history
Using the tags identified by SelectorGadget, we use rvest’s verbs to build lists of categories and projects.
url<-"http://sourceforge.net/"
categories<-url %>%
html() %>%
html_nodes(".titled h2") %>%
html_text()
categories
[1] "Projects Of The Month" "Editor's Choice" "Featured"
projects<-url %>%
html() %>%
html_nodes(".project-info a") %>%
html_text()
projects
[1] "\n Staff Choice\n Miranda IM\n "
[2] "\n Community Choice\n PortableApps.com: Portable Software/USB\n "
[3] "\n BIRT iHub F-Type\n "
[4] "\n eXo Platform - Social Collaboration\n "
[5] "\n winPenPack\n "
[6] "\n TEncoder Video Converter\n "
[7] "\n wxWidgets\n "
[8] "\n shadowsocks-gui\n "
[9] "\n gnuplot\n "
[10] "\n OS X Portable Applications\n "
[11] "\n gretl\n "
[12] "\n Nullsoft Scriptable Install System\n "
[13] "\n The FreeType Project\n "
Here is a regular expression to trim leading and trailing spaces
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
categories <- trim(categories)
categories
[1] "Projects Of The Month" "Editor's Choice" "Featured"
projects <- trim(projects)
Note that the first two projects are still messy.
projects[1:2]
[1] "Staff Choice\n Miranda IM"
[2] "Community Choice\n PortableApps.com: Portable Software/USB"
Replace 2+ whitespace characters with single whitespace characters.
projects <- gsub("(?<=[\\s])\\s*|^\\s+$", "", projects, perl=TRUE)
projects
[1] "Staff Choice\nMiranda IM"
[2] "Community Choice\nPortableApps.com: Portable Software/USB"
[3] "BIRT iHub F-Type"
[4] "eXo Platform - Social Collaboration"
[5] "winPenPack"
[6] "TEncoder Video Converter"
[7] "wxWidgets"
[8] "shadowsocks-gui"
[9] "gnuplot"
[10] "OS X Portable Applications"
[11] "gretl"
[12] "Nullsoft Scriptable Install System"
[13] "The FreeType Project"
Replace \n with :
projects <- gsub("\\n",":", projects)
knitr::kable(projects)
| Staff Choice:Miranda IM |
| Community Choice:PortableApps.com: Portable Software/USB |
| BIRT iHub F-Type |
| eXo Platform - Social Collaboration |
| winPenPack |
| TEncoder Video Converter |
| wxWidgets |
| shadowsocks-gui |
| gnuplot |
| OS X Portable Applications |
| gretl |
| Nullsoft Scriptable Install System |
| The FreeType Project |