Project Description:

The site r-bloggers is a team blog, with a lot of great how-to content on various R topics. The page http://www.r-bloggers.com/search/web%20scraping provides a list of topics related to web scraping, which is also the topic of this project!

Grading rubric:

. For each of the reference blog entries on the first page, you should pull out the title, date, and author, and store these in an R data frame. Your code should be in github, and published to rpubs.com. You’ll receive a maximum of 90% for completing this base assignment.

. To earn the full 100 points, you must do some kind of further data extraction and/or analysis. Here are four sample ideas. You don’t need to do more than one of these, and you are free to instead choose your own area for further analysis. Maximum additional points: 10%.

1- Extend your scraper to include the base information for blog entries on all of the tagged pages. Your R data frame should include any necessary additional rows.

# Turning off warning as all warning are pertaining to R version 3.1.3
options(warn=-1)

# Loading necessary libraries 
library(XML)
library(rvest)
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:XML':
## 
##     xml
library(stringr)
library(knitr)

1- In the reference blog entries on the first page, we are storing the title, date, and author in data frame. Then displaying the stored data using “kable” function from “knitr” package

url = 'http://www.r-bloggers.com/search/web%20scraping'
doc <- htmlParse(url)
url_post<-   html_nodes(doc, xpath='//div[contains(@id,"post")]')
# length(url_post)


titles<- html_nodes(url_post,xpath='h2/a/text()')
dates <- html_nodes(url_post, xpath='div[1]/div')
authors<- html_nodes(url_post, xpath='div[1]/a')


titles<- sapply(titles,xmlValue)   
dates<- sapply(dates,xmlValue)   
authors<- sapply(authors,xmlValue)

out_df <- data.frame("Date" = str_trim(dates,side = "both"), "Title"= str_trim(titles,side = "both"),"Author"= str_trim(authors,side = "both"), "Page"= 1)

kable(out_df)
Date Title Author Page
November 24, 2014 rvest: easy web scraping with R hadleywickham 1
September 17, 2014 Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples Bob Rudis (@hrbrmstr) 1
March 12, 2014 Web Scraping: working with APIs Rolf Fredheim 1
March 5, 2014 Web Scraping: Scaling up Digital Data Collection Rolf Fredheim 1
February 25, 2014 Web Scraping part2: Digging deeper Rolf Fredheim 1
April 5, 2012 A Little Web Scraping Exercise with XML-Package Kay Cichini 1
January 6, 2012 R: Web Scraping R-bloggers Facebook Page Tony Breyal 1
December 27, 2011 Web scraping with Python – the dark side of data axiomOfChoice 1
November 11, 2011 Web Scraping Google+ via XPath Tony Breyal 1
November 10, 2011 Web Scraping Yahoo Search Page via XPath Tony Breyal 1
  1. Extending the scraper to include the title, date, and author for blog entries on all of the tagged pages
out_df <- data.frame("Date" = str_trim(dates,side = "both"), "Title"= str_trim(titles,side = "both"),"Author"= str_trim(authors,side = "both"), "Page"= 1)


## Finding the total number of pages and store in variable named "count" 

pages<- html_nodes(url_post, xpath='//*[@id="leftcontent"]/div[11]/span[1]')
pages<-sapply(pages,xmlValue)
x<- data.frame(pages)
pages<-as.numeric(str_extract(pages,"[0-9]+$"))
x<- data.frame(pages)
count<- x[1,1]

# traverse every page and store title, date, and author for every blog 

for ( i in 2:count)  { 
 url <- paste("http://www.r-bloggers.com/search/web%20scraping/page/",i,"/",sep="")
  doc1 <- htmlParse(url)
  
  url_post<-   html_nodes(doc1, xpath='//div[contains(@id,"post")]')
    
 titles<- html_nodes(url_post,xpath='h2/a/text()')
 dates <- html_nodes(url_post, xpath='div[1]/div')
 authors<- html_nodes(url_post, xpath='div[1]/a')
 
  titles<- sapply(titles,xmlValue)   
  dates<- sapply(dates,xmlValue)   
  authors<- sapply(authors,xmlValue)
  

 out_df <- rbind(out_df, data.frame(
                "Date" =  dates, 
                "Title"=  titles,
                "Author"= authors, 
                "Page"= i)
                )
 

 # Note: Using grepl function to find and replace control data in authors with word "Unknown"
 
#  d<- as.list(authors)
# if (grepl("CDATA", d[7])) authors= NULL 
#  out_df <- rbind(out_df, data.frame(
#                "Date" =  dates, 
#                "Title"=  titles,
#                "Author"= if (grepl("CDATA", d[7])) authors= # 
#     "Unknown" # else authors, 
#                "Page"= i)
#               )
    
}

 kable(out_df)
Date Title Author Page
November 24, 2014 rvest: easy web scraping with R hadleywickham 1
September 17, 2014 Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples Bob Rudis (@hrbrmstr) 1
March 12, 2014 Web Scraping: working with APIs Rolf Fredheim 1
March 5, 2014 Web Scraping: Scaling up Digital Data Collection Rolf Fredheim 1
February 25, 2014 Web Scraping part2: Digging deeper Rolf Fredheim 1
April 5, 2012 A Little Web Scraping Exercise with XML-Package Kay Cichini 1
January 6, 2012 R: Web Scraping R-bloggers Facebook Page Tony Breyal 1
December 27, 2011 Web scraping with Python – the dark side of data axiomOfChoice 1
November 11, 2011 Web Scraping Google+ via XPath Tony Breyal 1
November 10, 2011 Web Scraping Yahoo Search Page via XPath Tony Breyal 1
November 8, 2011 Web Scraping Google Scholar: Part 2 (Complete Success) Tony Breyal 2
November 8, 2011 Web Scraping Google Scholar (Partial Success) Tony Breyal 2
November 7, 2011 Web Scraping Google URLs Tony Breyal 2
November 5, 2011 Next Level Web Scraping Kay Cichini 2
November 1, 2011 Web Scraping Google Scholar & Show Result as Word Cloud Using R Kay Cichini 2
April 15, 2015 Scraping Web Pages With R Tony Hirst 2
November 30, 2014 FOMC Dates – Scraping Data From Web Pages Peter Chan 2
June 27, 2014 Scraping Fantasy Football Projections from the Web Isaac Petersen 2
February 19, 2014 Web-Scraping: the Basics Rolf Fredheim 2
January 4, 2014 Relenium, Selenium for R. A new tool for webscraping. aleixrvr 2
August 23, 2012 R and the web (for beginners), Part III: Scraping MPs’ expenses in detail from the web GivenTheData 3
April 2, 2012 Web-Scraping in R diffuseprior 3
January 15, 2012 Scraping table from any web page with R or CloudStat PR 3
January 12, 2012 Scraping table from html web with CloudStat CloudStat 3
October 22, 2011 A Little Webscraping-Exercise… Kay Cichini 3
August 10, 2011 Scraping web data in R Zach Mayer 3
April 14, 2009 Webscraping using readLines and RCurl bryan 3
April 14, 2009 Webscraping using readLines and RCurl bryan 3
March 15, 2015 Short R tutorial: Scraping Javascript Generated Data with R DataCamp 3
January 21, 2015 FOMC Dates – Full History Web Scrape Peter Chan 3
May 15, 2014 Scraping XML Tables with R jgreenb1 4
April 29, 2014 Scraping SSL Labs Server Test Results With R Bob Rudis (@hrbrmstr) 4
April 14, 2014 Interfacing R with Web technologies David Smith 4
April 4, 2014 Scraping organism metadata for Treebase repositories from GOLD using Python and R What is this? David Springate’s personal blog :: R 4
April 6, 2012 R-Bloggers’ Web-Presence Kay Cichini 4
February 18, 2012 How-to Extract Text From Multiple Websites with R Christopher Gandrud 4
January 27, 2012 Scraping Flora of North America Recology - R 4
January 5, 2012 Scraping R-bloggers with Python – Part 2 The PolStat R Feed 4
January 4, 2012 Scraping R-Bloggers with Python The PolStat R Feed 4
November 9, 2011 R-Function GScholarScraper to Webscrape Google Scholar Search Result Kay Cichini 4
September 8, 2011 Interacting with bioinformatics webservers using R nsaunders 5
February 18, 2011 R Screen Scraping: 105 Counties of Election Data Earl Glynn 5
February 18, 2011 Simple R Screen Scraping Example Earl Glynn 5
August 13, 2010 Scrape Web data using R 5
March 20, 2015 Digital Data Collection course Rolf Fredheim 5
March 6, 2015 Getting Data From An Online Source Robert Norberg 5
February 28, 2015 Playing around with #rstats twitter data [email protected]
/* <![CDATA[ */!funct ion(){try{var t=“currentScript”in document?document.currentScript:function(){for(var t=document.getElementsByTagName(“script”),e=t .length;e–;)if(t[e].getAttribute(“cf-hash”))return t[e]}();if(t&&t.previousSibling){var e,r,n,i,c=t.previousSibling,a=c.getAttribute(“data-cfemail”);if(a){for(e=“”,r=parseInt(a.substr(0,2),16),n=2;a.length-n;n+=2)i=parseInt(a.substr(n,2),16)^r,e+=String.fromCharCode(i);e=document.createTextNode(e),c.parentNode.replaceChild(e,c)}}}catch(u){}}();/* ]]> */ 5
December 19, 2014 50 years of Christmas at the Windsors Dominic Nyhuis 5
November 27, 2014 Power Outage Impact Choropleths In 5 Steps in R (featuring rvest & RStudio “Projects”) hrbrmstr 5
November 26, 2014 Slightly Advanced rvest with Help from htmltools + XML + pipeR klr 5
November 14, 2014 What size will you be after you lose weight? dan 6
September 28, 2014 A bioinformatics walk-through: Accessing protein-protein interaction interfaces for all known protein structures with PDBe PISA biochemistries 6
August 28, 2014 R User Group Roundup Joseph Rickert 6
April 30, 2014 Automatically Scrape Flight Ticket Data Using R and Phantomjs Huidong Tian 6
March 13, 2014 Text Mining Gun Deaths Data Francis Smart 6
March 13, 2014 Better handling of JSON data in R? Rolf Fredheim 6
March 10, 2014 Upcoming NYC R Programming Classes vivian 6
February 1, 2014 Introduction steadyfish 6
July 29, 2013 Programming instrumental music from scratch Vik Paruchuri 6
July 29, 2013 Programming instrumental music from scratch - r 7
July 29, 2013 Programming instrumental music from scratch Vik Paruchuri 7
May 6, 2013 xkcd: Visualized Myles 7
April 30, 2013 Has R-help gotten meaner over time? And what does Mancur Olson have to say about it? Trey Causey 7
December 15, 2012 Data Science, Data Analysis, R and Python Ron Pearson (aka TheNoodleDoodler) 7
October 27, 2012 .Rhistory distantobserver 7
July 28, 2012 Hangman in R: A learning experience tylerrinker 7
March 20, 2012 Data Analysis Training prasoonsharma 7
January 11, 2012 Making an R Package: Not as hard as you think markbulling 7
January 3, 2012 Plotting Doctor Who Ratings (1963-2011) with R Tony Breyal 7
November 13, 2011 GScholarXScraper: Hacking the GScholarScraper function with XPath Tony Breyal 8
November 10, 2011 Facebook Graph API Explorer with R Tony Breyal 8
September 29, 2010 UCLA Statistics: Analyzing Thesis/Dissertation Lengths Ryan Rosario 8
September 4, 2010 Cricket data analysis prasoonsharma 8
January 22, 2010 What to Expect? Ryan 8
April 18, 2015 Analysing The Rock ‘n’ Roll Madrid Marathon aschinchon 8
April 8, 2015 Monitoring Price Fluctuations of Book Trade-In Values on Amazon Andrew Landgraf 8
March 31, 2015 More Airline Crashes via the Hadleyverse hrbrmstr 8
March 23, 2015 Knitr’s best hidden gem: spin Dean Attali’s R Blog 8
February 26, 2015 Fuzzy String Matching – a survival skill to tackle unstructured information Bigdata Doc 8
February 20, 2015 Who Has the Best Fantasy Football Projections? 2015 Update Isaac Petersen 9
February 4, 2015 Predicting the six nations Mango Solutions 9
January 19, 2015 Building a choropleth map of Italy using mapIT Davide Massidda 9
January 16, 2015 New updates to the rNOMADS package and big changes in the GFS model glossarch 9
December 23, 2014 Explore Kaggle Competition Data with R notesofdabbler 9
December 16, 2014 How to analyze a new dataset (or, analyzing ‘supercar’ data, part 1) Sharpsight Admin 9
December 14, 2014 FOMC Dates – Price Data Exploration Peter Chan 9
November 17, 2014 A Letter of Recommendation for Nan Xiao Yihui Xie 9
November 1, 2014 Leveraging R for Job Openings for Economists Thiemo Fetzer 9
October 30, 2014 Wrangling F1 Data With R – F1DataJunkie Book Tony Hirst 9
October 23, 2014 How to Download and Run R Scripts from this Site Isaac Petersen 10
September 26, 2014 FIFA 15 Analysis with R The Clerk 10
September 18, 2014 “Do You Want to Steal a Snowman?” – A Look (with R) At TorrentFreak’s Top 10 PiRated Movies List #TLAPD Bob Rudis (@hrbrmstr) 10
August 12, 2014 Visit of Di Cook Rob J Hyndman 10
July 6, 2014 Identify Fantasy Football Sleepers with this Shiny App Isaac Petersen 10
June 30, 2014 Time to Accept It: publishing in the Journal of Statistical Software brobar 10
June 5, 2014 2014 World Cup Squads gjabel 10
June 2, 2014 Basketball Data Part II – Length of Career by Position jgreenb1 10
May 26, 2014 Using sentiment analysis to predict ratings of popular tv series tlfvincent 10
April 28, 2014 On the trade history and dynamics of NBA teams tlfvincent 10
April 11, 2014 Rblogger Posting Patterns Analyzed with R Mark T Patterson 11
April 10, 2014 BARUG talks highlight R’s diverse applications Joseph Rickert 11
April 4, 2014 Mapping academic collaborations in Evolutionary Biology What is this? David Springate’s personal blog :: R 11
March 29, 2014 President Approval Ratings from Roosevelt to Obama tlfvincent 11
March 27, 2014 Evolution of Code Educate-R - R 11
February 13, 2014 Terms Tal Galili 11
February 11, 2014 Live Google Spreadsheet For Keeping Track Of Sochi Medals hrbrmstr 11
January 22, 2014 Using One Programming Language In the Context of Another – Python and R Tony Hirst 11
January 20, 2014 Statistics meets rhetoric: A text analysis of “I Have a Dream” in R Max Ghenis 11
January 20, 2014 Statistics meets rhetoric: A text analysis of “I Have a Dream” in R Max Ghenis 11
January 20, 2014 Second NYC R classes(announcement and teaching experience) Tal Galili 12
January 13, 2014 Calling Python from R with rPython bryan 12
January 13, 2014 Why R is Better Than Excel for Fantasy Football (and most other) Data Analysis Isaac Petersen 12
November 7, 2013 College Basketball: Presence in the NBA over Time Mark T Patterson 12
September 21, 2013 Creating your personal, portable R code library with GitHub bryan 12
August 31, 2013 MLB Rankings Using the Bradley-Terry Model John Ramey 12
July 4, 2013 ggplot2 Chloropleth of Supreme Court Decisions: A Tutorial tylerrinker 12
July 2, 2013 Which airline should you be loyal to? dan 12
June 24, 2013 Opel Corsa Diesel Usage Wingfeet 12
May 26, 2013 Logging Data in R Loops: Applied to Twitter. Alistair Leak 12
May 13, 2013 Shiny App for CRAN packages pssguy 13
May 12, 2013 The Guerilla Guide to R Nikhil Gopal 13
April 19, 2013 Presentations of the third Milano R net meeting Milano R net 13
April 10, 2013 Milano (Italy). April 18, 2013. Third Milano R net meeting: agenda Milano R net 13
March 25, 2013 April 18, 2013Third Milano R net meeting: agenda Milano R net 13
February 4, 2013 Generating Labels for Supervised Text Classification using CAT and R Solomon 13
January 29, 2013 Hilary: the most poisoned baby name in US history hilaryparker 13
January 25, 2013 R and foreign characters Rolf Fredheim 13
January 23, 2013 SPARQL with R in less than 5 minutes bryan 13
January 1, 2013 Multiple Classification and Authorship of the Hebrew Bible inkhorn82 13
December 22, 2012 Chocolate and nobel prize – a true story? Max Gordon 14
October 28, 2012 Animated map of 2012 US election campaigning, with R and ffmpeg civilstat 14
October 3, 2012 Tips on accessing data from various sources with R David Smith 14
September 25, 2012 R Helper Functions bryan 14
September 16, 2012 The R-Podcast Episode 10: Adventures in Data Munging Part 2 Eric 14
June 20, 2012 UseR 2012 highlights David Smith 14
May 17, 2012 Visualizing the CRAN: Graphing Package Dependencies wrathematics 14
April 22, 2012 118 years of US State Weather Data drunksandlampposts 14
April 5, 2012 The 50 most used R packages flodel 14
March 23, 2012 RStudio Development Environment bryan 14
January 13, 2012 R: A Quick Scrape of Top Grossing Films from boxofficemojo.com Tony Breyal 15
January 10, 2012 Installing quantstrat from R-forge and source bryan 15
January 6, 2012 Analyzing R-bloggers The PolStat R Feed 15
January 4, 2012 Mapping the Iowa GOP 2012 Caucus Results jjh 15
December 20, 2011 Outliers in the European Parliament The PolStat Feed 15
December 7, 2011 Subscriptions Feature Added bryan 15
November 13, 2011 Google Scholar (still) sucks bbolker 15
October 31, 2011 Power Tools for Aspiring Data Journalists: R Tony Hirst 15
August 9, 2011 Forecasting recessions Zach Mayer 15
August 4, 2011 CHCN: Canadian Historical Climate Network Steven Mosher 15
July 30, 2011 hacking .gov shortened links Harlan 16
June 29, 2011 roll calls, ideal points, 112th Congress jackman 16
June 9, 2011 Automating R Scripts on Amazon EC2 Travis Nelson 16
May 14, 2011 Friday fun projects nsaunders 16
April 25, 2011 Further Adventures in Visualisation with ggplot2 hayward 16
April 15, 2011 Friday Function: setInternet2 richierocks 16
April 2, 2011 Find NHL Players with 30 Goals and 100 PIM using R btibert3 16
March 21, 2011 NBA Analysis: Coming Soon! Ryan 16
February 6, 2011 Clustering NHL Skaters 16
January 16, 2011 Dial-a-statistic! Featuring R and Estonia Ethan Brown 16
October 31, 2010 How to buy a used car with R (part 1) Dan Knoepfle’s Blog 17
October 31, 2010 How to buy a used car with R (part 1) Dan Knoepfle’s Blog 17
August 31, 2010 Using XML package vs. BeautifulSoup Ryan 17
August 5, 2010 Are MLB Games Getting Longer? Ryan 17
June 29, 2010 Analyze Gold Demand and Investments using R C 17
December 28, 2009 tooltips in R graphics; nytR package jackman 17