The site r-bloggers is a team blog, with a lot of great how-to content on various R topics. The page http://www.r-bloggers.com/search/web%20scraping provides a list of topics related to web scraping, which is also the topic of this project!
Grading rubric:
. For each of the reference blog entries on the first page, you should pull out the title, date, and author, and store these in an R data frame. Your code should be in github, and published to rpubs.com. You’ll receive a maximum of 90% for completing this base assignment.
. To earn the full 100 points, you must do some kind of further data extraction and/or analysis. Here are four sample ideas. You don’t need to do more than one of these, and you are free to instead choose your own area for further analysis. Maximum additional points: 10%.
1- Extend your scraper to include the base information for blog entries on all of the tagged pages. Your R data frame should include any necessary additional rows.
# Turning off warning as all warning are pertaining to R version 3.1.3
options(warn=-1)
# Loading necessary libraries
library(XML)
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:XML':
##
## xml
library(stringr)
library(knitr)
1- In the reference blog entries on the first page, we are storing the title, date, and author in data frame. Then displaying the stored data using “kable” function from “knitr” package
url = 'http://www.r-bloggers.com/search/web%20scraping'
doc <- htmlParse(url)
url_post<- html_nodes(doc, xpath='//div[contains(@id,"post")]')
# length(url_post)
titles<- html_nodes(url_post,xpath='h2/a/text()')
dates <- html_nodes(url_post, xpath='div[1]/div')
authors<- html_nodes(url_post, xpath='div[1]/a')
titles<- sapply(titles,xmlValue)
dates<- sapply(dates,xmlValue)
authors<- sapply(authors,xmlValue)
out_df <- data.frame("Date" = str_trim(dates,side = "both"), "Title"= str_trim(titles,side = "both"),"Author"= str_trim(authors,side = "both"), "Page"= 1)
kable(out_df)
| Date | Title | Author | Page |
|---|---|---|---|
| November 24, 2014 | rvest: easy web scraping with R | hadleywickham | 1 |
| September 17, 2014 | Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples | Bob Rudis (@hrbrmstr) | 1 |
| March 12, 2014 | Web Scraping: working with APIs | Rolf Fredheim | 1 |
| March 5, 2014 | Web Scraping: Scaling up Digital Data Collection | Rolf Fredheim | 1 |
| February 25, 2014 | Web Scraping part2: Digging deeper | Rolf Fredheim | 1 |
| April 5, 2012 | A Little Web Scraping Exercise with XML-Package | Kay Cichini | 1 |
| January 6, 2012 | R: Web Scraping R-bloggers Facebook Page | Tony Breyal | 1 |
| December 27, 2011 | Web scraping with Python the dark side of data | axiomOfChoice | 1 |
| November 11, 2011 | Web Scraping Google+ via XPath | Tony Breyal | 1 |
| November 10, 2011 | Web Scraping Yahoo Search Page via XPath | Tony Breyal | 1 |
out_df <- data.frame("Date" = str_trim(dates,side = "both"), "Title"= str_trim(titles,side = "both"),"Author"= str_trim(authors,side = "both"), "Page"= 1)
## Finding the total number of pages and store in variable named "count"
pages<- html_nodes(url_post, xpath='//*[@id="leftcontent"]/div[11]/span[1]')
pages<-sapply(pages,xmlValue)
x<- data.frame(pages)
pages<-as.numeric(str_extract(pages,"[0-9]+$"))
x<- data.frame(pages)
count<- x[1,1]
# traverse every page and store title, date, and author for every blog
for ( i in 2:count) {
url <- paste("http://www.r-bloggers.com/search/web%20scraping/page/",i,"/",sep="")
doc1 <- htmlParse(url)
url_post<- html_nodes(doc1, xpath='//div[contains(@id,"post")]')
titles<- html_nodes(url_post,xpath='h2/a/text()')
dates <- html_nodes(url_post, xpath='div[1]/div')
authors<- html_nodes(url_post, xpath='div[1]/a')
titles<- sapply(titles,xmlValue)
dates<- sapply(dates,xmlValue)
authors<- sapply(authors,xmlValue)
out_df <- rbind(out_df, data.frame(
"Date" = dates,
"Title"= titles,
"Author"= authors,
"Page"= i)
)
# Note: Using grepl function to find and replace control data in authors with word "Unknown"
# d<- as.list(authors)
# if (grepl("CDATA", d[7])) authors= NULL
# out_df <- rbind(out_df, data.frame(
# "Date" = dates,
# "Title"= titles,
# "Author"= if (grepl("CDATA", d[7])) authors= #
# "Unknown" # else authors,
# "Page"= i)
# )
}
kable(out_df)
| Date | Title | Author | Page |
|---|---|---|---|
| November 24, 2014 | rvest: easy web scraping with R | hadleywickham | 1 |
| September 17, 2014 | Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples | Bob Rudis (@hrbrmstr) | 1 |
| March 12, 2014 | Web Scraping: working with APIs | Rolf Fredheim | 1 |
| March 5, 2014 | Web Scraping: Scaling up Digital Data Collection | Rolf Fredheim | 1 |
| February 25, 2014 | Web Scraping part2: Digging deeper | Rolf Fredheim | 1 |
| April 5, 2012 | A Little Web Scraping Exercise with XML-Package | Kay Cichini | 1 |
| January 6, 2012 | R: Web Scraping R-bloggers Facebook Page | Tony Breyal | 1 |
| December 27, 2011 | Web scraping with Python the dark side of data | axiomOfChoice | 1 |
| November 11, 2011 | Web Scraping Google+ via XPath | Tony Breyal | 1 |
| November 10, 2011 | Web Scraping Yahoo Search Page via XPath | Tony Breyal | 1 |
| November 8, 2011 | Web Scraping Google Scholar: Part 2 (Complete Success) | Tony Breyal | 2 |
| November 8, 2011 | Web Scraping Google Scholar (Partial Success) | Tony Breyal | 2 |
| November 7, 2011 | Web Scraping Google URLs | Tony Breyal | 2 |
| November 5, 2011 | Next Level Web Scraping | Kay Cichini | 2 |
| November 1, 2011 | Web Scraping Google Scholar & Show Result as Word Cloud Using R | Kay Cichini | 2 |
| April 15, 2015 | Scraping Web Pages With R | Tony Hirst | 2 |
| November 30, 2014 | FOMC Dates Scraping Data From Web Pages | Peter Chan | 2 |
| June 27, 2014 | Scraping Fantasy Football Projections from the Web | Isaac Petersen | 2 |
| February 19, 2014 | Web-Scraping: the Basics | Rolf Fredheim | 2 |
| January 4, 2014 | Relenium, Selenium for R. A new tool for webscraping. | aleixrvr | 2 |
| August 23, 2012 | R and the web (for beginners), Part III: Scraping MPs expenses in detail from the web | GivenTheData | 3 |
| April 2, 2012 | Web-Scraping in R | diffuseprior | 3 |
| January 15, 2012 | Scraping table from any web page with R or CloudStat | PR | 3 |
| January 12, 2012 | Scraping table from html web with CloudStat | CloudStat | 3 |
| October 22, 2011 | A Little Webscraping-Exercise… | Kay Cichini | 3 |
| August 10, 2011 | Scraping web data in R | Zach Mayer | 3 |
| April 14, 2009 | Webscraping using readLines and RCurl | bryan | 3 |
| April 14, 2009 | Webscraping using readLines and RCurl | bryan | 3 |
| March 15, 2015 | Short R tutorial: Scraping Javascript Generated Data with R | DataCamp | 3 |
| January 21, 2015 | FOMC Dates Full History Web Scrape | Peter Chan | 3 |
| May 15, 2014 | Scraping XML Tables with R | jgreenb1 | 4 |
| April 29, 2014 | Scraping SSL Labs Server Test Results With R | Bob Rudis (@hrbrmstr) | 4 |
| April 14, 2014 | Interfacing R with Web technologies | David Smith | 4 |
| April 4, 2014 | Scraping organism metadata for Treebase repositories from GOLD using Python and R | What is this? David Springate’s personal blog :: R | 4 |
| April 6, 2012 | R-Bloggers Web-Presence | Kay Cichini | 4 |
| February 18, 2012 | How-to Extract Text From Multiple Websites with R | Christopher Gandrud | 4 |
| January 27, 2012 | Scraping Flora of North America | Recology - R | 4 |
| January 5, 2012 | Scraping R-bloggers with Python Part 2 | The PolStat R Feed | 4 |
| January 4, 2012 | Scraping R-Bloggers with Python | The PolStat R Feed | 4 |
| November 9, 2011 | R-Function GScholarScraper to Webscrape Google Scholar Search Result | Kay Cichini | 4 |
| September 8, 2011 | Interacting with bioinformatics webservers using R | nsaunders | 5 |
| February 18, 2011 | R Screen Scraping: 105 Counties of Election Data | Earl Glynn | 5 |
| February 18, 2011 | Simple R Screen Scraping Example | Earl Glynn | 5 |
| August 13, 2010 | Scrape Web data using R | – | 5 |
| March 20, 2015 | Digital Data Collection course | Rolf Fredheim | 5 |
| March 6, 2015 | Getting Data From An Online Source | Robert Norberg | 5 |
| February 28, 2015 | Playing around with #rstats twitter data | [email protected] | |
| /* <![CDATA[ */!funct | ion(){try{var t=“currentScript”in document?document.currentScript:function(){for(var t=document.getElementsByTagName(“script”),e=t | .length;e–;)if(t[e].getAttribute(“cf-hash”))return t[e]}();if(t&&t.previousSibling){var e,r,n,i,c=t.previousSibling,a=c.getAttribute(“data-cfemail”);if(a){for(e=“”,r=parseInt(a.substr(0,2),16),n=2;a.length-n;n+=2)i=parseInt(a.substr(n,2),16)^r,e+=String.fromCharCode(i);e=document.createTextNode(e),c.parentNode.replaceChild(e,c)}}}catch(u){}}();/* ]]> */ 5 | |
| December 19, 2014 | 50 years of Christmas at the Windsors | Dominic Nyhuis | 5 |
| November 27, 2014 | Power Outage Impact Choropleths In 5 Steps in R (featuring rvest & RStudio “Projects”) | hrbrmstr | 5 |
| November 26, 2014 | Slightly Advanced rvest with Help from htmltools + XML + pipeR | klr | 5 |
| November 14, 2014 | What size will you be after you lose weight? | dan | 6 |
| September 28, 2014 | A bioinformatics walk-through: Accessing protein-protein interaction interfaces for all known protein structures with PDBe PISA | biochemistries | 6 |
| August 28, 2014 | R User Group Roundup | Joseph Rickert | 6 |
| April 30, 2014 | Automatically Scrape Flight Ticket Data Using R and Phantomjs | Huidong Tian | 6 |
| March 13, 2014 | Text Mining Gun Deaths Data | Francis Smart | 6 |
| March 13, 2014 | Better handling of JSON data in R? | Rolf Fredheim | 6 |
| March 10, 2014 | Upcoming NYC R Programming Classes | vivian | 6 |
| February 1, 2014 | Introduction | steadyfish | 6 |
| July 29, 2013 | Programming instrumental music from scratch | Vik Paruchuri | 6 |
| July 29, 2013 | Programming instrumental music from scratch | - r | 7 |
| July 29, 2013 | Programming instrumental music from scratch | Vik Paruchuri | 7 |
| May 6, 2013 | xkcd: Visualized | Myles | 7 |
| April 30, 2013 | Has R-help gotten meaner over time? And what does Mancur Olson have to say about it? | Trey Causey | 7 |
| December 15, 2012 | Data Science, Data Analysis, R and Python | Ron Pearson (aka TheNoodleDoodler) | 7 |
| October 27, 2012 | .Rhistory | distantobserver | 7 |
| July 28, 2012 | Hangman in R: A learning experience | tylerrinker | 7 |
| March 20, 2012 | Data Analysis Training | prasoonsharma | 7 |
| January 11, 2012 | Making an R Package: Not as hard as you think | markbulling | 7 |
| January 3, 2012 | Plotting Doctor Who Ratings (1963-2011) with R | Tony Breyal | 7 |
| November 13, 2011 | GScholarXScraper: Hacking the GScholarScraper function with XPath | Tony Breyal | 8 |
| November 10, 2011 | Facebook Graph API Explorer with R | Tony Breyal | 8 |
| September 29, 2010 | UCLA Statistics: Analyzing Thesis/Dissertation Lengths | Ryan Rosario | 8 |
| September 4, 2010 | Cricket data analysis | prasoonsharma | 8 |
| January 22, 2010 | What to Expect? | Ryan | 8 |
| April 18, 2015 | Analysing The Rock ‘n’ Roll Madrid Marathon | aschinchon | 8 |
| April 8, 2015 | Monitoring Price Fluctuations of Book Trade-In Values on Amazon | Andrew Landgraf | 8 |
| March 31, 2015 | More Airline Crashes via the Hadleyverse | hrbrmstr | 8 |
| March 23, 2015 | Knitrs best hidden gem: spin | Dean Attali’s R Blog | 8 |
| February 26, 2015 | Fuzzy String Matching a survival skill to tackle unstructured information | Bigdata Doc | 8 |
| February 20, 2015 | Who Has the Best Fantasy Football Projections? 2015 Update | Isaac Petersen | 9 |
| February 4, 2015 | Predicting the six nations | Mango Solutions | 9 |
| January 19, 2015 | Building a choropleth map of Italy using mapIT | Davide Massidda | 9 |
| January 16, 2015 | New updates to the rNOMADS package and big changes in the GFS model | glossarch | 9 |
| December 23, 2014 | Explore Kaggle Competition Data with R | notesofdabbler | 9 |
| December 16, 2014 | How to analyze a new dataset (or, analyzing ‘supercar’ data, part 1) | Sharpsight Admin | 9 |
| December 14, 2014 | FOMC Dates Price Data Exploration | Peter Chan | 9 |
| November 17, 2014 | A Letter of Recommendation for Nan Xiao | Yihui Xie | 9 |
| November 1, 2014 | Leveraging R for Job Openings for Economists | Thiemo Fetzer | 9 |
| October 30, 2014 | Wrangling F1 Data With R F1DataJunkie Book | Tony Hirst | 9 |
| October 23, 2014 | How to Download and Run R Scripts from this Site | Isaac Petersen | 10 |
| September 26, 2014 | FIFA 15 Analysis with R | The Clerk | 10 |
| September 18, 2014 | “Do You Want to Steal a Snowman?” A Look (with R) At TorrentFreaks Top 10 PiRated Movies List #TLAPD | Bob Rudis (@hrbrmstr) | 10 |
| August 12, 2014 | Visit of Di Cook | Rob J Hyndman | 10 |
| July 6, 2014 | Identify Fantasy Football Sleepers with this Shiny App | Isaac Petersen | 10 |
| June 30, 2014 | Time to Accept It: publishing in the Journal of Statistical Software | brobar | 10 |
| June 5, 2014 | 2014 World Cup Squads | gjabel | 10 |
| June 2, 2014 | Basketball Data Part II Length of Career by Position | jgreenb1 | 10 |
| May 26, 2014 | Using sentiment analysis to predict ratings of popular tv series | tlfvincent | 10 |
| April 28, 2014 | On the trade history and dynamics of NBA teams | tlfvincent | 10 |
| April 11, 2014 | Rblogger Posting Patterns Analyzed with R | Mark T Patterson | 11 |
| April 10, 2014 | BARUG talks highlight Rs diverse applications | Joseph Rickert | 11 |
| April 4, 2014 | Mapping academic collaborations in Evolutionary Biology | What is this? David Springate’s personal blog :: R | 11 |
| March 29, 2014 | President Approval Ratings from Roosevelt to Obama | tlfvincent | 11 |
| March 27, 2014 | Evolution of Code | Educate-R - R | 11 |
| February 13, 2014 | Terms | Tal Galili | 11 |
| February 11, 2014 | Live Google Spreadsheet For Keeping Track Of Sochi Medals | hrbrmstr | 11 |
| January 22, 2014 | Using One Programming Language In the Context of Another Python and R | Tony Hirst | 11 |
| January 20, 2014 | Statistics meets rhetoric: A text analysis of “I Have a Dream” in R | Max Ghenis | 11 |
| January 20, 2014 | Statistics meets rhetoric: A text analysis of “I Have a Dream” in R | Max Ghenis | 11 |
| January 20, 2014 | Second NYC R classes(announcement and teaching experience) | Tal Galili | 12 |
| January 13, 2014 | Calling Python from R with rPython | bryan | 12 |
| January 13, 2014 | Why R is Better Than Excel for Fantasy Football (and most other) Data Analysis | Isaac Petersen | 12 |
| November 7, 2013 | College Basketball: Presence in the NBA over Time | Mark T Patterson | 12 |
| September 21, 2013 | Creating your personal, portable R code library with GitHub | bryan | 12 |
| August 31, 2013 | MLB Rankings Using the Bradley-Terry Model | John Ramey | 12 |
| July 4, 2013 | ggplot2 Chloropleth of Supreme Court Decisions: A Tutorial | tylerrinker | 12 |
| July 2, 2013 | Which airline should you be loyal to? | dan | 12 |
| June 24, 2013 | Opel Corsa Diesel Usage | Wingfeet | 12 |
| May 26, 2013 | Logging Data in R Loops: Applied to Twitter. | Alistair Leak | 12 |
| May 13, 2013 | Shiny App for CRAN packages | pssguy | 13 |
| May 12, 2013 | The Guerilla Guide to R | Nikhil Gopal | 13 |
| April 19, 2013 | Presentations of the third Milano R net meeting | Milano R net | 13 |
| April 10, 2013 | Milano (Italy). April 18, 2013. Third Milano R net meeting: agenda | Milano R net | 13 |
| March 25, 2013 | April 18, 2013Third Milano R net meeting: agenda | Milano R net | 13 |
| February 4, 2013 | Generating Labels for Supervised Text Classification using CAT and R | Solomon | 13 |
| January 29, 2013 | Hilary: the most poisoned baby name in US history | hilaryparker | 13 |
| January 25, 2013 | R and foreign characters | Rolf Fredheim | 13 |
| January 23, 2013 | SPARQL with R in less than 5 minutes | bryan | 13 |
| January 1, 2013 | Multiple Classification and Authorship of the Hebrew Bible | inkhorn82 | 13 |
| December 22, 2012 | Chocolate and nobel prize a true story? | Max Gordon | 14 |
| October 28, 2012 | Animated map of 2012 US election campaigning, with R and ffmpeg | civilstat | 14 |
| October 3, 2012 | Tips on accessing data from various sources with R | David Smith | 14 |
| September 25, 2012 | R Helper Functions | bryan | 14 |
| September 16, 2012 | The R-Podcast Episode 10: Adventures in Data Munging Part 2 | Eric | 14 |
| June 20, 2012 | UseR 2012 highlights | David Smith | 14 |
| May 17, 2012 | Visualizing the CRAN: Graphing Package Dependencies | wrathematics | 14 |
| April 22, 2012 | 118 years of US State Weather Data | drunksandlampposts | 14 |
| April 5, 2012 | The 50 most used R packages | flodel | 14 |
| March 23, 2012 | RStudio Development Environment | bryan | 14 |
| January 13, 2012 | R: A Quick Scrape of Top Grossing Films from boxofficemojo.com | Tony Breyal | 15 |
| January 10, 2012 | Installing quantstrat from R-forge and source | bryan | 15 |
| January 6, 2012 | Analyzing R-bloggers | The PolStat R Feed | 15 |
| January 4, 2012 | Mapping the Iowa GOP 2012 Caucus Results | jjh | 15 |
| December 20, 2011 | Outliers in the European Parliament | The PolStat Feed | 15 |
| December 7, 2011 | Subscriptions Feature Added | bryan | 15 |
| November 13, 2011 | Google Scholar (still) sucks | bbolker | 15 |
| October 31, 2011 | Power Tools for Aspiring Data Journalists: R | Tony Hirst | 15 |
| August 9, 2011 | Forecasting recessions | Zach Mayer | 15 |
| August 4, 2011 | CHCN: Canadian Historical Climate Network | Steven Mosher | 15 |
| July 30, 2011 | hacking .gov shortened links | Harlan | 16 |
| June 29, 2011 | roll calls, ideal points, 112th Congress | jackman | 16 |
| June 9, 2011 | Automating R Scripts on Amazon EC2 | Travis Nelson | 16 |
| May 14, 2011 | Friday fun projects | nsaunders | 16 |
| April 25, 2011 | Further Adventures in Visualisation with ggplot2 | hayward | 16 |
| April 15, 2011 | Friday Function: setInternet2 | richierocks | 16 |
| April 2, 2011 | Find NHL Players with 30 Goals and 100 PIM using R | btibert3 | 16 |
| March 21, 2011 | NBA Analysis: Coming Soon! | Ryan | 16 |
| February 6, 2011 | Clustering NHL Skaters | – | 16 |
| January 16, 2011 | Dial-a-statistic! Featuring R and Estonia | Ethan Brown | 16 |
| October 31, 2010 | How to buy a used car with R (part 1) | Dan Knoepfle’s Blog | 17 |
| October 31, 2010 | How to buy a used car with R (part 1) | Dan Knoepfle’s Blog | 17 |
| August 31, 2010 | Using XML package vs. BeautifulSoup | Ryan | 17 |
| August 5, 2010 | Are MLB Games Getting Longer? | Ryan | 17 |
| June 29, 2010 | Analyze Gold Demand and Investments using R | C | 17 |
| December 28, 2009 | tooltips in R graphics; nytR package | jackman | 17 |