The site r-bloggers is a team blog, with a lot of great how-to content on various R topics. The page http://www.r-bloggers.com/search/web%20scraping provides a list of topics related to web scraping, which is also the topic of this project!
Part 1: For each of the reference blog entries on the first page, you should pull out the title, date, and author, and store these in an R data frame. Your code should be in github, and published to rpubs.com.
library(rvest)
## Warning: package 'rvest' was built under R version 3.1.3
library(XML)
## Warning: package 'XML' was built under R version 3.1.3
##
## Attaching package: 'XML'
##
## The following object is masked from 'package:rvest':
##
## xml
library(knitr)
## Warning: package 'knitr' was built under R version 3.1.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.1.3
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gsubfn)
## Warning: package 'gsubfn' was built under R version 3.1.3
## Loading required package: proto
library(tools)
##
## Attaching package: 'tools'
##
## The following object is masked from 'package:XML':
##
## toHTML
r_bloggers <- html("http://www.r-bloggers.com/search/web%20scraping")
posts <- r_bloggers %>%
html_nodes(xpath = '//div[contains(@id,"post")]')
titles <- posts %>%
html_nodes(xpath = 'h2/a/text()')
t_titles <- data.frame(sapply(titles,xmlValue))
dates <- posts %>%
html_nodes(xpath = 'div[1]/div')
t_dates <- data.frame(sapply(dates,xmlValue))
authors <- posts %>%
html_nodes(xpath = 'div[1]/a')
t_authors <- data.frame(sapply(authors,xmlValue))
t_posts <- cbind(Title = t_titles, Date = t_dates, Author = t_authors) #merge tables
colnames(t_posts) <- c("Title", "Date", "Author") #add colnames to table
#View table of first page posts
kable(t_posts)
| Title | Date | Author |
|---|---|---|
| rvest: easy web scraping with R | November 24, 2014 | hadleywickham |
| Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples | September 17, 2014 | Bob Rudis (@hrbrmstr) |
| Web Scraping: working with APIs | March 12, 2014 | Rolf Fredheim |
| Web Scraping: Scaling up Digital Data Collection | March 5, 2014 | Rolf Fredheim |
| Web Scraping part2: Digging deeper | February 25, 2014 | Rolf Fredheim |
| A Little Web Scraping Exercise with XML-Package | April 5, 2012 | Kay Cichini |
| R: Web Scraping R-bloggers Facebook Page | January 6, 2012 | Tony Breyal |
| Web scraping with Python – the dark side of data | December 27, 2011 | axiomOfChoice |
| Web Scraping Google+ via XPath | November 11, 2011 | Tony Breyal |
| Web Scraping Yahoo Search Page via XPath | November 10, 2011 | Tony Breyal |
baseurl <- htmlParse("http://www.r-bloggers.com/search/web%20scraping")
xpath <- '//*[contains(@class, "last")]'
total_pages <- as.numeric(xpathSApply(baseurl, xpath, xmlValue))
total_pages
## [1] 17
#Substitute page number in a loop to set other search page URLs
substitute_url_args <- function(url, list_args) {
gsubfn("%\\((.*?)\\)s", x = url, env = list_args)
}
n <- total_pages
for(i in 2:n) {
#Page number substituted here
s <- "http://www.r-bloggers.com/search/web%20scraping/page/%(id)s"
L <- list(id = i)
newurl <- substitute_url_args(s, L)
#Post data pulled from page
r_bloggers <- html(newurl)
posts <- r_bloggers %>%
html_nodes(xpath = '//div[contains(@id,"post")]')
titles <- posts %>%
html_nodes(xpath = 'h2/a/text()')
t_titles <- data.frame(sapply(titles,xmlValue))
dates <- posts %>%
html_nodes(xpath = 'div[1]/div')
t_dates <- data.frame(sapply(dates,xmlValue))
authors <- posts %>%
html_nodes(xpath = 'div[1]/a')
t_authors <- data.frame(sapply(authors,xmlValue))
t_posts_new <- cbind(Title = t_titles, Date = t_dates, Author = t_authors) #merge tables
colnames(t_posts_new) <- c("Title", "Date", "Author") #add colnames to table
t_posts <- bind_rows(t_posts, t_posts_new)
}
## Warning in rbind_all(list(x, ...)): Unequal factor levels: coercing to
## character
## Warning in rbind_all(list(x, ...)): Unequal factor levels: coercing to
## character
## Warning in rbind_all(list(x, ...)): Unequal factor levels: coercing to
## character
kable(t_posts)
| Title | Date | Author |
|---|---|---|
| rvest: easy web scraping with R | November 24, 2014 | hadleywickham |
| Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples | September 17, 2014 | Bob Rudis (@hrbrmstr) |
| Web Scraping: working with APIs | March 12, 2014 | Rolf Fredheim |
| Web Scraping: Scaling up Digital Data Collection | March 5, 2014 | Rolf Fredheim |
| Web Scraping part2: Digging deeper | February 25, 2014 | Rolf Fredheim |
| A Little Web Scraping Exercise with XML-Package | April 5, 2012 | Kay Cichini |
| R: Web Scraping R-bloggers Facebook Page | January 6, 2012 | Tony Breyal |
| Web scraping with Python – the dark side of data | December 27, 2011 | axiomOfChoice |
| Web Scraping Google+ via XPath | November 11, 2011 | Tony Breyal |
| Web Scraping Yahoo Search Page via XPath | November 10, 2011 | Tony Breyal |
| Web Scraping Google Scholar: Part 2 (Complete Success) | November 8, 2011 | Tony Breyal |
| Web Scraping Google Scholar (Partial Success) | November 8, 2011 | Tony Breyal |
| Web Scraping Google URLs | November 7, 2011 | Tony Breyal |
| Next Level Web Scraping | November 5, 2011 | Kay Cichini |
| Web Scraping Google Scholar & Show Result as Word Cloud Using R | November 1, 2011 | Kay Cichini |
| Scraping Web Pages With R | April 15, 2015 | Tony Hirst |
| FOMC Dates – Scraping Data From Web Pages | November 30, 2014 | Peter Chan |
| Scraping Fantasy Football Projections from the Web | June 27, 2014 | Isaac Petersen |
| Web-Scraping: the Basics | February 19, 2014 | Rolf Fredheim |
| Relenium, Selenium for R. A new tool for webscraping. | January 4, 2014 | aleixrvr |
| R and the web (for beginners), Part III: Scraping MPs’ expenses in detail from the web | August 23, 2012 | GivenTheData |
| Web-Scraping in R | April 2, 2012 | diffuseprior |
| Scraping table from any web page with R or CloudStat | January 15, 2012 | PR |
| Scraping table from html web with CloudStat | January 12, 2012 | CloudStat |
| A Little Webscraping-Exercise… | October 22, 2011 | Kay Cichini |
| Scraping web data in R | August 10, 2011 | Zach Mayer |
| Webscraping using readLines and RCurl | April 14, 2009 | bryan |
| Webscraping using readLines and RCurl | April 14, 2009 | bryan |
| Short R tutorial: Scraping Javascript Generated Data with R | March 15, 2015 | DataCamp |
| FOMC Dates – Full History Web Scrape | January 21, 2015 | Peter Chan |
| Scraping XML Tables with R | May 15, 2014 | jgreenb1 |
| Scraping SSL Labs Server Test Results With R | April 29, 2014 | Bob Rudis (@hrbrmstr) |
| Interfacing R with Web technologies | April 14, 2014 | David Smith |
| Scraping organism metadata for Treebase repositories from GOLD using Python and R | April 4, 2014 | What is this? David Springate’s personal blog :: R |
| R-Bloggers’ Web-Presence | April 6, 2012 | Kay Cichini |
| How-to Extract Text From Multiple Websites with R | February 18, 2012 | Christopher Gandrud |
| Scraping Flora of North America | January 27, 2012 | Recology - R |
| Scraping R-bloggers with Python – Part 2 | January 5, 2012 | The PolStat R Feed |
| Scraping R-Bloggers with Python | January 4, 2012 | The PolStat R Feed |
| R-Function GScholarScraper to Webscrape Google Scholar Search Result | November 9, 2011 | Kay Cichini |
| Interacting with bioinformatics webservers using R | September 8, 2011 | nsaunders |
| R Screen Scraping: 105 Counties of Election Data | February 18, 2011 | Earl Glynn |
| Simple R Screen Scraping Example | February 18, 2011 | Earl Glynn |
| Scrape Web data using R | August 13, 2010 | – |
| Digital Data Collection course | March 20, 2015 | Rolf Fredheim |
| Getting Data From An Online Source | March 6, 2015 | Robert Norberg |
| Playing around with #rstats twitter data | February 28, 2015 | [email protected] |
| /* <![CDATA[ */!function(){try{var t=“currentScript”in document?document.currentScript:function(){for(var t=document.getElementsBy | TagName(“script”),e=t | .length;e–;)if(t[e].getAttribute(“cf-hash”))return t[e]}();if(t&&t.previousSibling){var e,r,n,i,c=t.previousSibling,a=c.getAttribute(“data-cfemail”);if(a){for(e=“”,r=parseInt(a.substr(0,2),16),n=2;a.length-n;n+=2)i=parseInt(a.substr(n,2),16)^r,e+=String.fromCharCode(i);e=document.createTextNode(e),c.parentNode.replaceChild(e,c)}}}catch(u){}}();/* ]]> */ |
| 50 years of Christmas at the Windsors | December 19, 2014 | Dominic Nyhuis |
| Power Outage Impact Choropleths In 5 Steps in R (featuring rvest & RStudio “Projects”) | November 27, 2014 | hrbrmstr |
| Slightly Advanced rvest with Help from htmltools + XML + pipeR | November 26, 2014 | klr |
| What size will you be after you lose weight? | November 14, 2014 | dan |
| A bioinformatics walk-through: Accessing protein-protein interaction interfaces for all known protein structures with PDBe PISA | September 28, 2014 | biochemistries |
| R User Group Roundup | August 28, 2014 | Joseph Rickert |
| Automatically Scrape Flight Ticket Data Using R and Phantomjs | April 30, 2014 | Huidong Tian |
| Text Mining Gun Deaths Data | March 13, 2014 | Francis Smart |
| Better handling of JSON data in R? | March 13, 2014 | Rolf Fredheim |
| Upcoming NYC R Programming Classes | March 10, 2014 | vivian |
| Introduction | February 1, 2014 | steadyfish |
| Programming instrumental music from scratch | July 29, 2013 | Vik Paruchuri |
| Programming instrumental music from scratch | July 29, 2013 | - r |
| Programming instrumental music from scratch | July 29, 2013 | Vik Paruchuri |
| xkcd: Visualized | May 6, 2013 | Myles |
| Has R-help gotten meaner over time? And what does Mancur Olson have to say about it? | April 30, 2013 | Trey Causey |
| Data Science, Data Analysis, R and Python | December 15, 2012 | Ron Pearson (aka TheNoodleDoodler) |
| .Rhistory | October 27, 2012 | distantobserver |
| Hangman in R: A learning experience | July 28, 2012 | tylerrinker |
| Data Analysis Training | March 20, 2012 | prasoonsharma |
| Making an R Package: Not as hard as you think | January 11, 2012 | markbulling |
| Plotting Doctor Who Ratings (1963-2011) with R | January 3, 2012 | Tony Breyal |
| GScholarXScraper: Hacking the GScholarScraper function with XPath | November 13, 2011 | Tony Breyal |
| Facebook Graph API Explorer with R | November 10, 2011 | Tony Breyal |
| UCLA Statistics: Analyzing Thesis/Dissertation Lengths | September 29, 2010 | Ryan Rosario |
| Cricket data analysis | September 4, 2010 | prasoonsharma |
| What to Expect? | January 22, 2010 | Ryan |
| Analysing The Rock ‘n’ Roll Madrid Marathon | April 18, 2015 | aschinchon |
| Monitoring Price Fluctuations of Book Trade-In Values on Amazon | April 8, 2015 | Andrew Landgraf |
| More Airline Crashes via the Hadleyverse | March 31, 2015 | hrbrmstr |
| Knitr’s best hidden gem: spin | March 23, 2015 | Dean Attali’s R Blog |
| Fuzzy String Matching – a survival skill to tackle unstructured information | February 26, 2015 | Bigdata Doc |
| Who Has the Best Fantasy Football Projections? 2015 Update | February 20, 2015 | Isaac Petersen |
| Predicting the six nations | February 4, 2015 | Mango Solutions |
| Building a choropleth map of Italy using mapIT | January 19, 2015 | Davide Massidda |
| New updates to the rNOMADS package and big changes in the GFS model | January 16, 2015 | glossarch |
| Explore Kaggle Competition Data with R | December 23, 2014 | notesofdabbler |
| How to analyze a new dataset (or, analyzing ‘supercar’ data, part 1) | December 16, 2014 | Sharpsight Admin |
| FOMC Dates – Price Data Exploration | December 14, 2014 | Peter Chan |
| A Letter of Recommendation for Nan Xiao | November 17, 2014 | Yihui Xie |
| Leveraging R for Job Openings for Economists | November 1, 2014 | Thiemo Fetzer |
| Wrangling F1 Data With R – F1DataJunkie Book | October 30, 2014 | Tony Hirst |
| How to Download and Run R Scripts from this Site | October 23, 2014 | Isaac Petersen |
| FIFA 15 Analysis with R | September 26, 2014 | The Clerk |
| “Do You Want to Steal a Snowman?” – A Look (with R) At TorrentFreak’s Top 10 PiRated Movies List #TLAPD | September 18, 2014 | Bob Rudis (@hrbrmstr) |
| Visit of Di Cook | August 12, 2014 | Rob J Hyndman |
| Identify Fantasy Football Sleepers with this Shiny App | July 6, 2014 | Isaac Petersen |
| Time to Accept It: publishing in the Journal of Statistical Software | June 30, 2014 | brobar |
| 2014 World Cup Squads | June 5, 2014 | gjabel |
| Basketball Data Part II – Length of Career by Position | June 2, 2014 | jgreenb1 |
| Using sentiment analysis to predict ratings of popular tv series | May 26, 2014 | tlfvincent |
| On the trade history and dynamics of NBA teams | April 28, 2014 | tlfvincent |
| Rblogger Posting Patterns Analyzed with R | April 11, 2014 | Mark T Patterson |
| BARUG talks highlight R’s diverse applications | April 10, 2014 | Joseph Rickert |
| Mapping academic collaborations in Evolutionary Biology | April 4, 2014 | What is this? David Springate’s personal blog :: R |
| President Approval Ratings from Roosevelt to Obama | March 29, 2014 | tlfvincent |
| Evolution of Code | March 27, 2014 | Educate-R - R |
| Terms | February 13, 2014 | Tal Galili |
| Live Google Spreadsheet For Keeping Track Of Sochi Medals | February 11, 2014 | hrbrmstr |
| Using One Programming Language In the Context of Another – Python and R | January 22, 2014 | Tony Hirst |
| Statistics meets rhetoric: A text analysis of “I Have a Dream” in R | January 20, 2014 | Max Ghenis |
| Statistics meets rhetoric: A text analysis of “I Have a Dream” in R | January 20, 2014 | Max Ghenis |
| Second NYC R classes(announcement and teaching experience) | January 20, 2014 | Tal Galili |
| Calling Python from R with rPython | January 13, 2014 | bryan |
| Why R is Better Than Excel for Fantasy Football (and most other) Data Analysis | January 13, 2014 | Isaac Petersen |
| College Basketball: Presence in the NBA over Time | November 7, 2013 | Mark T Patterson |
| Creating your personal, portable R code library with GitHub | September 21, 2013 | bryan |
| MLB Rankings Using the Bradley-Terry Model | August 31, 2013 | John Ramey |
| ggplot2 Chloropleth of Supreme Court Decisions: A Tutorial | July 4, 2013 | tylerrinker |
| Which airline should you be loyal to? | July 2, 2013 | dan |
| Opel Corsa Diesel Usage | June 24, 2013 | Wingfeet |
| Logging Data in R Loops: Applied to Twitter. | May 26, 2013 | Alistair Leak |
| Shiny App for CRAN packages | May 13, 2013 | pssguy |
| The Guerilla Guide to R | May 12, 2013 | Nikhil Gopal |
| Presentations of the third Milano R net meeting | April 19, 2013 | Milano R net |
| Milano (Italy). April 18, 2013. Third Milano R net meeting: agenda | April 10, 2013 | Milano R net |
| April 18, 2013Third Milano R net meeting: agenda | March 25, 2013 | Milano R net |
| Generating Labels for Supervised Text Classification using CAT and R | February 4, 2013 | Solomon |
| Hilary: the most poisoned baby name in US history | January 29, 2013 | hilaryparker |
| R and foreign characters | January 25, 2013 | Rolf Fredheim |
| SPARQL with R in less than 5 minutes | January 23, 2013 | bryan |
| Multiple Classification and Authorship of the Hebrew Bible | January 1, 2013 | inkhorn82 |
| Chocolate and nobel prize – a true story? | December 22, 2012 | Max Gordon |
| Animated map of 2012 US election campaigning, with R and ffmpeg | October 28, 2012 | civilstat |
| Tips on accessing data from various sources with R | October 3, 2012 | David Smith |
| R Helper Functions | September 25, 2012 | bryan |
| The R-Podcast Episode 10: Adventures in Data Munging Part 2 | September 16, 2012 | Eric |
| UseR 2012 highlights | June 20, 2012 | David Smith |
| Visualizing the CRAN: Graphing Package Dependencies | May 17, 2012 | wrathematics |
| 118 years of US State Weather Data | April 22, 2012 | drunksandlampposts |
| The 50 most used R packages | April 5, 2012 | flodel |
| RStudio Development Environment | March 23, 2012 | bryan |
| R: A Quick Scrape of Top Grossing Films from boxofficemojo.com | January 13, 2012 | Tony Breyal |
| Installing quantstrat from R-forge and source | January 10, 2012 | bryan |
| Analyzing R-bloggers | January 6, 2012 | The PolStat R Feed |
| Mapping the Iowa GOP 2012 Caucus Results | January 4, 2012 | jjh |
| Outliers in the European Parliament | December 20, 2011 | The PolStat Feed |
| Subscriptions Feature Added | December 7, 2011 | bryan |
| Google Scholar (still) sucks | November 13, 2011 | bbolker |
| Power Tools for Aspiring Data Journalists: R | October 31, 2011 | Tony Hirst |
| Forecasting recessions | August 9, 2011 | Zach Mayer |
| CHCN: Canadian Historical Climate Network | August 4, 2011 | Steven Mosher |
| hacking .gov shortened links | July 30, 2011 | Harlan |
| roll calls, ideal points, 112th Congress | June 29, 2011 | jackman |
| Automating R Scripts on Amazon EC2 | June 9, 2011 | Travis Nelson |
| Friday fun projects | May 14, 2011 | nsaunders |
| Further Adventures in Visualisation with ggplot2 | April 25, 2011 | hayward |
| Friday Function: setInternet2 | April 15, 2011 | richierocks |
| Find NHL Players with 30 Goals and 100 PIM using R | April 2, 2011 | btibert3 |
| NBA Analysis: Coming Soon! | March 21, 2011 | Ryan |
| Clustering NHL Skaters | February 6, 2011 | – |
| Dial-a-statistic! Featuring R and Estonia | January 16, 2011 | Ethan Brown |
| How to buy a used car with R (part 1) | October 31, 2010 | Dan Knoepfle’s Blog |
| How to buy a used car with R (part 1) | October 31, 2010 | Dan Knoepfle’s Blog |
| Using XML package vs. BeautifulSoup | August 31, 2010 | Ryan |
| Are MLB Games Getting Longer? | August 5, 2010 | Ryan |
| Analyze Gold Demand and Investments using R | June 29, 2010 | C |
| tooltips in R graphics; nytR package | December 28, 2009 | jackman |