This project is for scraping the web pages, “http://www.r-bloggers.com/search/web%20/ page1 to page17”, and store these into R data frame. The total number of rows is 166 that the site says.
library(XML)
library(RCurl)
library(knitr)
To scrape date, title and author, I made a ‘scraper’ function.
scraper <- function(page){
the_url=paste("http://www.r-bloggers.com/search/web%20scraping/page/",page, sep="")
SOURCE <- getURL(the_url)
PARSED <- htmlParse(SOURCE)
date=xpathSApply(PARSED, "//div[@class='date']",xmlValue)[1:10]
title=xpathSApply(PARSED, "//h2/a[@title]",xmlValue)[3:12]
author=xpathSApply(PARSED, "//a[@rel='author']",xmlValue)[1:10]
author <- gsub("\\/\\*.+\\*\\/","",author)
na.omit(data.frame(date,title,author,page))
}
To scrape one page, we can put a number between 1 and 17 into ‘page’ argument
kable(scraper(2))
| date | title | author | page |
|---|---|---|---|
| November 8, 2011 | Web Scraping Google Scholar: Part 2 (Complete Success) | Tony Breyal | 2 |
| November 8, 2011 | Web Scraping Google Scholar (Partial Success) | Tony Breyal | 2 |
| November 7, 2011 | Web Scraping Google URLs | Tony Breyal | 2 |
| November 5, 2011 | Next Level Web Scraping | Kay Cichini | 2 |
| November 1, 2011 | Web Scraping Google Scholar & Show Result as Word Cloud Using R | Kay Cichini | 2 |
| April 15, 2015 | Scraping Web Pages With R | Tony Hirst | 2 |
| November 30, 2014 | FOMC Dates – Scraping Data From Web Pages | Peter Chan | 2 |
| June 27, 2014 | Scraping Fantasy Football Projections from the Web | Isaac Petersen | 2 |
| February 19, 2014 | Web-Scraping: the Basics | Rolf Fredheim | 2 |
| January 4, 2014 | Relenium, Selenium for R. A new tool for webscraping. | aleixrvr | 2 |
kable(scraper(14))
| date | title | author | page |
|---|---|---|---|
| December 22, 2012 | Chocolate and nobel prize – a true story? | Max Gordon | 14 |
| October 28, 2012 | Animated map of 2012 US election campaigning, with R and ffmpeg | civilstat | 14 |
| October 3, 2012 | Tips on accessing data from various sources with R | David Smith | 14 |
| September 25, 2012 | R Helper Functions | bryan | 14 |
| September 16, 2012 | The R-Podcast Episode 10: Adventures in Data Munging Part 2 | Eric | 14 |
| June 20, 2012 | UseR 2012 highlights | David Smith | 14 |
| May 17, 2012 | Visualizing the CRAN: Graphing Package Dependencies | wrathematics | 14 |
| April 22, 2012 | 118 years of US State Weather Data | drunksandlampposts | 14 |
| April 5, 2012 | The 50 most used R packages | flodel | 14 |
| March 23, 2012 | RStudio Development Environment | bryan | 14 |
To scrape all the pages, I made a list and used ‘for’ statement, and then change it to a dataframe.
output <- list()
for(i in 1:17){
output[[i]] <- scraper(i)}
output[[1:2]]
## [1] rvest: easy web scraping with R
## [2] Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples
## [3] Web Scraping: working with APIs
## [4] Web Scraping: Scaling up Digital Data Collection
## [5] Web Scraping part2: Digging deeper
## [6] A Little Web Scraping Exercise with XML-Package
## [7] R: Web Scraping R-bloggers Facebook Page
## [8] Web scraping with Python – the dark side of data
## [9] Web Scraping Google+ via XPath
## [10] Web Scraping Yahoo Search Page via XPath
## 10 Levels: A Little Web Scraping Exercise with XML-Package ...
df <- do.call(rbind, output) #if you want a dataframe, not a list
kable(df)
| date | title | author | page |
|---|---|---|---|
| November 24, 2014 | rvest: easy web scraping with R | hadleywickham | 1 |
| September 17, 2014 | Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples | Bob Rudis (@hrbrmstr) | 1 |
| March 12, 2014 | Web Scraping: working with APIs | Rolf Fredheim | 1 |
| March 5, 2014 | Web Scraping: Scaling up Digital Data Collection | Rolf Fredheim | 1 |
| February 25, 2014 | Web Scraping part2: Digging deeper | Rolf Fredheim | 1 |
| April 5, 2012 | A Little Web Scraping Exercise with XML-Package | Kay Cichini | 1 |
| January 6, 2012 | R: Web Scraping R-bloggers Facebook Page | Tony Breyal | 1 |
| December 27, 2011 | Web scraping with Python – the dark side of data | axiomOfChoice | 1 |
| November 11, 2011 | Web Scraping Google+ via XPath | Tony Breyal | 1 |
| November 10, 2011 | Web Scraping Yahoo Search Page via XPath | Tony Breyal | 1 |
| November 8, 2011 | Web Scraping Google Scholar: Part 2 (Complete Success) | Tony Breyal | 2 |
| November 8, 2011 | Web Scraping Google Scholar (Partial Success) | Tony Breyal | 2 |
| November 7, 2011 | Web Scraping Google URLs | Tony Breyal | 2 |
| November 5, 2011 | Next Level Web Scraping | Kay Cichini | 2 |
| November 1, 2011 | Web Scraping Google Scholar & Show Result as Word Cloud Using R | Kay Cichini | 2 |
| April 15, 2015 | Scraping Web Pages With R | Tony Hirst | 2 |
| November 30, 2014 | FOMC Dates – Scraping Data From Web Pages | Peter Chan | 2 |
| June 27, 2014 | Scraping Fantasy Football Projections from the Web | Isaac Petersen | 2 |
| February 19, 2014 | Web-Scraping: the Basics | Rolf Fredheim | 2 |
| January 4, 2014 | Relenium, Selenium for R. A new tool for webscraping. | aleixrvr | 2 |
| August 23, 2012 | R and the web (for beginners), Part III: Scraping MPs’ expenses in detail from the web | GivenTheData | 3 |
| April 2, 2012 | Web-Scraping in R | diffuseprior | 3 |
| January 15, 2012 | Scraping table from any web page with R or CloudStat | PR | 3 |
| January 12, 2012 | Scraping table from html web with CloudStat | CloudStat | 3 |
| October 22, 2011 | A Little Webscraping-Exercise… | Kay Cichini | 3 |
| August 10, 2011 | Scraping web data in R | Zach Mayer | 3 |
| April 14, 2009 | Webscraping using readLines and RCurl | bryan | 3 |
| April 14, 2009 | Webscraping using readLines and RCurl | bryan | 3 |
| March 15, 2015 | Short R tutorial: Scraping Javascript Generated Data with R | DataCamp | 3 |
| January 21, 2015 | FOMC Dates – Full History Web Scrape | Peter Chan | 3 |
| May 15, 2014 | Scraping XML Tables with R | jgreenb1 | 4 |
| April 29, 2014 | Scraping SSL Labs Server Test Results With R | Bob Rudis (@hrbrmstr) | 4 |
| April 14, 2014 | Interfacing R with Web technologies | David Smith | 4 |
| April 4, 2014 | Scraping organism metadata for Treebase repositories from GOLD using Python and R | What is this? David Springate’s personal blog :: R | 4 |
| April 6, 2012 | R-Bloggers’ Web-Presence | Kay Cichini | 4 |
| February 18, 2012 | How-to Extract Text From Multiple Websites with R | Christopher Gandrud | 4 |
| January 27, 2012 | Scraping Flora of North America | Recology - R | 4 |
| January 5, 2012 | Scraping R-bloggers with Python – Part 2 | The PolStat R Feed | 4 |
| January 4, 2012 | Scraping R-Bloggers with Python | The PolStat R Feed | 4 |
| November 9, 2011 | R-Function GScholarScraper to Webscrape Google Scholar Search Result | Kay Cichini | 4 |
| September 8, 2011 | Interacting with bioinformatics webservers using R | nsaunders | 5 |
| February 18, 2011 | R Screen Scraping: 105 Counties of Election Data | Earl Glynn | 5 |
| February 18, 2011 | Simple R Screen Scraping Example | Earl Glynn | 5 |
| August 13, 2010 | Scrape Web data using R | – | 5 |
| March 20, 2015 | Digital Data Collection course | Rolf Fredheim | 5 |
| March 6, 2015 | Getting Data From An Online Source | Robert Norberg | 5 |
| February 28, 2015 | Playing around with #rstats twitter data | [email protected] | |
| 5 | |||
| December 19, 2014 | 50 years of Christmas at the Windsors | Dominic Nyhuis | 5 |
| November 27, 2014 | Power Outage Impact Choropleths In 5 Steps in R (featuring rvest & RStudio “Projects”) | hrbrmstr | 5 |
| November 26, 2014 | Slightly Advanced rvest with Help from htmltools + XML + pipeR | klr | 5 |
| November 14, 2014 | What size will you be after you lose weight? | dan | 6 |
| September 28, 2014 | A bioinformatics walk-through: Accessing protein-protein interaction interfaces for all known protein structures with PDBe PISA | biochemistries | 6 |
| August 28, 2014 | R User Group Roundup | Joseph Rickert | 6 |
| April 30, 2014 | Automatically Scrape Flight Ticket Data Using R and Phantomjs | Huidong Tian | 6 |
| March 13, 2014 | Text Mining Gun Deaths Data | Francis Smart | 6 |
| March 13, 2014 | Better handling of JSON data in R? | Rolf Fredheim | 6 |
| March 10, 2014 | Upcoming NYC R Programming Classes | vivian | 6 |
| February 1, 2014 | Introduction | steadyfish | 6 |
| July 29, 2013 | Programming instrumental music from scratch | Vik Paruchuri | 6 |
| July 29, 2013 | Programming instrumental music from scratch | - r | 7 |
| July 29, 2013 | Programming instrumental music from scratch | Vik Paruchuri | 7 |
| May 6, 2013 | xkcd: Visualized | Myles | 7 |
| April 30, 2013 | Has R-help gotten meaner over time? And what does Mancur Olson have to say about it? | Trey Causey | 7 |
| December 15, 2012 | Data Science, Data Analysis, R and Python | Ron Pearson (aka TheNoodleDoodler) | 7 |
| October 27, 2012 | .Rhistory | distantobserver | 7 |
| July 28, 2012 | Hangman in R: A learning experience | tylerrinker | 7 |
| March 20, 2012 | Data Analysis Training | prasoonsharma | 7 |
| January 11, 2012 | Making an R Package: Not as hard as you think | markbulling | 7 |
| January 3, 2012 | Plotting Doctor Who Ratings (1963-2011) with R | Tony Breyal | 7 |
| November 13, 2011 | GScholarXScraper: Hacking the GScholarScraper function with XPath | Tony Breyal | 8 |
| November 10, 2011 | Facebook Graph API Explorer with R | Tony Breyal | 8 |
| September 29, 2010 | UCLA Statistics: Analyzing Thesis/Dissertation Lengths | Ryan Rosario | 8 |
| September 4, 2010 | Cricket data analysis | prasoonsharma | 8 |
| January 22, 2010 | What to Expect? | Ryan | 8 |
| April 18, 2015 | Analysing The Rock ‘n’ Roll Madrid Marathon | aschinchon | 8 |
| April 8, 2015 | Monitoring Price Fluctuations of Book Trade-In Values on Amazon | Andrew Landgraf | 8 |
| March 31, 2015 | More Airline Crashes via the Hadleyverse | hrbrmstr | 8 |
| March 23, 2015 | Knitr’s best hidden gem: spin | Dean Attali’s R Blog | 8 |
| February 26, 2015 | Fuzzy String Matching – a survival skill to tackle unstructured information | Bigdata Doc | 8 |
| February 20, 2015 | Who Has the Best Fantasy Football Projections? 2015 Update | Isaac Petersen | 9 |
| February 4, 2015 | Predicting the six nations | Mango Solutions | 9 |
| January 19, 2015 | Building a choropleth map of Italy using mapIT | Davide Massidda | 9 |
| January 16, 2015 | New updates to the rNOMADS package and big changes in the GFS model | glossarch | 9 |
| December 23, 2014 | Explore Kaggle Competition Data with R | notesofdabbler | 9 |
| December 16, 2014 | How to analyze a new dataset (or, analyzing ‘supercar’ data, part 1) | Sharpsight Admin | 9 |
| December 14, 2014 | FOMC Dates – Price Data Exploration | Peter Chan | 9 |
| November 17, 2014 | A Letter of Recommendation for Nan Xiao | Yihui Xie | 9 |
| November 1, 2014 | Leveraging R for Job Openings for Economists | Thiemo Fetzer | 9 |
| October 30, 2014 | Wrangling F1 Data With R – F1DataJunkie Book | Tony Hirst | 9 |
| October 23, 2014 | How to Download and Run R Scripts from this Site | Isaac Petersen | 10 |
| September 26, 2014 | FIFA 15 Analysis with R | The Clerk | 10 |
| September 18, 2014 | “Do You Want to Steal a Snowman?” – A Look (with R) At TorrentFreak’s Top 10 PiRated Movies List #TLAPD | Bob Rudis (@hrbrmstr) | 10 |
| August 12, 2014 | Visit of Di Cook | Rob J Hyndman | 10 |
| July 6, 2014 | Identify Fantasy Football Sleepers with this Shiny App | Isaac Petersen | 10 |
| June 30, 2014 | Time to Accept It: publishing in the Journal of Statistical Software | brobar | 10 |
| June 5, 2014 | 2014 World Cup Squads | gjabel | 10 |
| June 2, 2014 | Basketball Data Part II – Length of Career by Position | jgreenb1 | 10 |
| May 26, 2014 | Using sentiment analysis to predict ratings of popular tv series | tlfvincent | 10 |
| April 28, 2014 | On the trade history and dynamics of NBA teams | tlfvincent | 10 |
| April 11, 2014 | Rblogger Posting Patterns Analyzed with R | Mark T Patterson | 11 |
| April 10, 2014 | BARUG talks highlight R’s diverse applications | Joseph Rickert | 11 |
| April 4, 2014 | Mapping academic collaborations in Evolutionary Biology | What is this? David Springate’s personal blog :: R | 11 |
| March 29, 2014 | President Approval Ratings from Roosevelt to Obama | tlfvincent | 11 |
| March 27, 2014 | Evolution of Code | Educate-R - R | 11 |
| February 13, 2014 | Terms | Tal Galili | 11 |
| February 11, 2014 | Live Google Spreadsheet For Keeping Track Of Sochi Medals | hrbrmstr | 11 |
| January 22, 2014 | Using One Programming Language In the Context of Another – Python and R | Tony Hirst | 11 |
| January 20, 2014 | Statistics meets rhetoric: A text analysis of “I Have a Dream” in R | Max Ghenis | 11 |
| January 20, 2014 | Statistics meets rhetoric: A text analysis of “I Have a Dream” in R | Max Ghenis | 11 |
| January 20, 2014 | Second NYC R classes(announcement and teaching experience) | Tal Galili | 12 |
| January 13, 2014 | Calling Python from R with rPython | bryan | 12 |
| January 13, 2014 | Why R is Better Than Excel for Fantasy Football (and most other) Data Analysis | Isaac Petersen | 12 |
| November 7, 2013 | College Basketball: Presence in the NBA over Time | Mark T Patterson | 12 |
| September 21, 2013 | Creating your personal, portable R code library with GitHub | bryan | 12 |
| August 31, 2013 | MLB Rankings Using the Bradley-Terry Model | John Ramey | 12 |
| July 4, 2013 | ggplot2 Chloropleth of Supreme Court Decisions: A Tutorial | tylerrinker | 12 |
| July 2, 2013 | Which airline should you be loyal to? | dan | 12 |
| June 24, 2013 | Opel Corsa Diesel Usage | Wingfeet | 12 |
| May 26, 2013 | Logging Data in R Loops: Applied to Twitter. | Alistair Leak | 12 |
| May 13, 2013 | Shiny App for CRAN packages | pssguy | 13 |
| May 12, 2013 | The Guerilla Guide to R | Nikhil Gopal | 13 |
| April 19, 2013 | Presentations of the third Milano R net meeting | Milano R net | 13 |
| April 10, 2013 | Milano (Italy). April 18, 2013. Third Milano R net meeting: agenda | Milano R net | 13 |
| March 25, 2013 | April 18, 2013Third Milano R net meeting: agenda | Milano R net | 13 |
| February 4, 2013 | Generating Labels for Supervised Text Classification using CAT and R | Solomon | 13 |
| January 29, 2013 | Hilary: the most poisoned baby name in US history | hilaryparker | 13 |
| January 25, 2013 | R and foreign characters | Rolf Fredheim | 13 |
| January 23, 2013 | SPARQL with R in less than 5 minutes | bryan | 13 |
| January 1, 2013 | Multiple Classification and Authorship of the Hebrew Bible | inkhorn82 | 13 |
| December 22, 2012 | Chocolate and nobel prize – a true story? | Max Gordon | 14 |
| October 28, 2012 | Animated map of 2012 US election campaigning, with R and ffmpeg | civilstat | 14 |
| October 3, 2012 | Tips on accessing data from various sources with R | David Smith | 14 |
| September 25, 2012 | R Helper Functions | bryan | 14 |
| September 16, 2012 | The R-Podcast Episode 10: Adventures in Data Munging Part 2 | Eric | 14 |
| June 20, 2012 | UseR 2012 highlights | David Smith | 14 |
| May 17, 2012 | Visualizing the CRAN: Graphing Package Dependencies | wrathematics | 14 |
| April 22, 2012 | 118 years of US State Weather Data | drunksandlampposts | 14 |
| April 5, 2012 | The 50 most used R packages | flodel | 14 |
| March 23, 2012 | RStudio Development Environment | bryan | 14 |
| January 13, 2012 | R: A Quick Scrape of Top Grossing Films from boxofficemojo.com | Tony Breyal | 15 |
| January 10, 2012 | Installing quantstrat from R-forge and source | bryan | 15 |
| January 6, 2012 | Analyzing R-bloggers | The PolStat R Feed | 15 |
| January 4, 2012 | Mapping the Iowa GOP 2012 Caucus Results | jjh | 15 |
| December 20, 2011 | Outliers in the European Parliament | The PolStat Feed | 15 |
| December 7, 2011 | Subscriptions Feature Added | bryan | 15 |
| November 13, 2011 | Google Scholar (still) sucks | bbolker | 15 |
| October 31, 2011 | Power Tools for Aspiring Data Journalists: R | Tony Hirst | 15 |
| August 9, 2011 | Forecasting recessions | Zach Mayer | 15 |
| August 4, 2011 | CHCN: Canadian Historical Climate Network | Steven Mosher | 15 |
| July 30, 2011 | hacking .gov shortened links | Harlan | 16 |
| June 29, 2011 | roll calls, ideal points, 112th Congress | jackman | 16 |
| June 9, 2011 | Automating R Scripts on Amazon EC2 | Travis Nelson | 16 |
| May 14, 2011 | Friday fun projects | nsaunders | 16 |
| April 25, 2011 | Further Adventures in Visualisation with ggplot2 | hayward | 16 |
| April 15, 2011 | Friday Function: setInternet2 | richierocks | 16 |
| April 2, 2011 | Find NHL Players with 30 Goals and 100 PIM using R | btibert3 | 16 |
| March 21, 2011 | NBA Analysis: Coming Soon! | Ryan | 16 |
| February 6, 2011 | Clustering NHL Skaters | – | 16 |
| January 16, 2011 | Dial-a-statistic! Featuring R and Estonia | Ethan Brown | 16 |
| October 31, 2010 | How to buy a used car with R (part 1) | Dan Knoepfle’s Blog | 17 |
| October 31, 2010 | How to buy a used car with R (part 1) | Dan Knoepfle’s Blog | 17 |
| August 31, 2010 | Using XML package vs. BeautifulSoup | Ryan | 17 |
| August 5, 2010 | Are MLB Games Getting Longer? | Ryan | 17 |
| June 29, 2010 | Analyze Gold Demand and Investments using R | C | 17 |
| December 28, 2009 | tooltips in R graphics; nytR package | jackman | 17 |
str(df)
## 'data.frame': 165 obs. of 4 variables:
## $ date : Factor w/ 149 levels "April 5, 2012",..: 9 10 5 6 3 1 4 2 8 7 ...
## $ title : Factor w/ 161 levels "A Little Web Scraping Exercise with XML-Package",..: 4 2 10 9 6 1 3 7 5 8 ...
## $ author: Factor w/ 97 levels "axiomOfChoice",..: 3 2 5 5 5 4 6 1 6 6 ...
## $ page : int 1 1 1 1 1 1 1 1 1 1 ...
The result shows 165 rows, which is dfferent from 166, what the web site says. Each page has 10 articles, but page6 has only 9 articles and page17 has 6 articles. Therefore, 165(10*15+9+6=165) is right result.