IS 607 Project 4

In this excercise, we will scrape data about blog posts from a website about R.

library(rvest)
library(magrittr)
library(stringr)

rBlog <- html("http://www.r-bloggers.com/search/web%20scraping/")

title <- rBlog %>%
  html_nodes("#leftcontent h2") %>%
  html_text()

date <- rBlog %>%
  html_nodes(".date") %>%
  html_text()

meta <- rBlog %>%
  html_nodes(".meta") %>%
  html_text()

meta

##  [1] "November 24, 2014By hadleywickham"         
##  [2] "September 17, 2014By Bob Rudis (@hrbrmstr)"
##  [3] "March 12, 2014By Rolf Fredheim"            
##  [4] "March 5, 2014By Rolf Fredheim"             
##  [5] "February 25, 2014By Rolf Fredheim"         
##  [6] "April 5, 2012By Kay Cichini"               
##  [7] "January 6, 2012By Tony Breyal"             
##  [8] "December 27, 2011By axiomOfChoice"         
##  [9] "November 11, 2011By Tony Breyal"           
## [10] "November 10, 2011By Tony Breyal"

The .meta tag we used to pull author data also returns the date. Here we will use the stringr package to cut the extra data.

meta <- str_split(string=meta,pattern="By ") # Split the string after the date with "By", which precedes the author
author <- c() # initialize the vector
for (i in 1:10){
author[i] <- meta[[i]][2]
}
meta

## [[1]]
## [1] "November 24, 2014" "hadleywickham"    
## 
## [[2]]
## [1] "September 17, 2014"    "Bob Rudis (@hrbrmstr)"
## 
## [[3]]
## [1] "March 12, 2014" "Rolf Fredheim" 
## 
## [[4]]
## [1] "March 5, 2014" "Rolf Fredheim"
## 
## [[5]]
## [1] "February 25, 2014" "Rolf Fredheim"    
## 
## [[6]]
## [1] "April 5, 2012" "Kay Cichini"  
## 
## [[7]]
## [1] "January 6, 2012" "Tony Breyal"    
## 
## [[8]]
## [1] "December 27, 2011" "axiomOfChoice"    
## 
## [[9]]
## [1] "November 11, 2011" "Tony Breyal"      
## 
## [[10]]
## [1] "November 10, 2011" "Tony Breyal"

Now we will store the information about the blog posts into a dataframe:

blogPosts <- data.frame(title,author,date)
blogPosts

##                                                                                  title
## 1                                                      rvest: easy web scraping with R
## 2  Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples
## 3                                                      Web Scraping: working with APIs
## 4                                     Web Scraping: Scaling up Digital Data Collection
## 5                                                   Web Scraping part2: Digging deeper
## 6                                      A Little Web Scraping Exercise with XML-Package
## 7                                             R: Web Scraping R-bloggers Facebook Page
## 8                                     Web scraping with Python – the dark side of data
## 9                                                       Web Scraping Google+ via XPath
## 10                                            Web Scraping Yahoo Search Page via XPath
##                   author               date
## 1          hadleywickham  November 24, 2014
## 2  Bob Rudis (@hrbrmstr) September 17, 2014
## 3          Rolf Fredheim     March 12, 2014
## 4          Rolf Fredheim      March 5, 2014
## 5          Rolf Fredheim  February 25, 2014
## 6            Kay Cichini      April 5, 2012
## 7            Tony Breyal    January 6, 2012
## 8          axiomOfChoice  December 27, 2011
## 9            Tony Breyal  November 11, 2011
## 10           Tony Breyal  November 10, 2011

We can use the same code to scrape content from the site containing posts about Twitter. The only changes we will make is storing the new web contect to “twitterBlog”. The only issue we will face is that the first author email is an email address, which appears to be protected from web-scraping operations:

twitterBlog <- html("http://www.r-bloggers.com/search/twitter/")

title <- twitterBlog %>%
  html_nodes("#leftcontent h2") %>%
  html_text()

date <- twitterBlog %>%
  html_nodes(".date") %>%
  html_text()

meta <- twitterBlog %>%
  html_nodes(".meta") %>%
  html_text()
  
meta <- str_split(string=meta,pattern="By ") 
author <- c() 
for (i in 1:length(meta)){
author[i] <- meta[[i]][2]
}
blogPosts2 <- data.frame(title,author,date)
blogPosts2

##                                                                        title
## 1                                   Playing around with #rstats twitter data
## 2            Programming a Twitter bot – and the rescue from procrastination
## 3                              Twitter’s new R package for anomaly detection
## 4                 Twitter’s R package for detecting breakouts in time series
## 5                                                   Twitter Pop-up Analytics
## 6                     Twitter’s REST API v1.1 with R (for Linux and Windows)
## 7  How to reveal anyone’s interests on Twitter using social network analysis
## 8                 What your twitter friends say about you and your interests
## 9                                          R Job Notifications Using Twitter
## 10                                 Talking to Twitter’s REST API v1.1 with R
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            author
## 1  [email protected]\n/* <![CDATA[ */!function(){try{var t="currentScript"in document?document.currentScript:function(){for(var t=document.getElementsByTagName("script"),e=t.length;e--;)if(t[e].getAttribute("cf-hash"))return t[e]}();if(t&&t.previousSibling){var e,r,n,i,c=t.previousSibling,a=c.getAttribute("data-cfemail");if(a){for(e="",r=parseInt(a.substr(0,2),16),n=2;a.length-n;n+=2)i=parseInt(a.substr(n,2),16)^r,e+=String.fromCharCode(i);e=document.createTextNode(e),c.parentNode.replaceChild(e,c)}}}catch(u){}}();/* ]]> */
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Simon Munzert
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     David Smith
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     David Smith
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Myles Harrison
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Raffael Vogler
## 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Brian Rowe
## 8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Brian Rowe
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Deciphering life: One bit at a time :: R
## 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Raffael Vogler
##                  date
## 1   February 28, 2015
## 2    January 19, 2015
## 3     January 7, 2015
## 4   November 24, 2014
## 5    October 20, 2014
## 6  September 22, 2014
## 7  September 22, 2014
## 8     August 10, 2014
## 9       June 30, 2014
## 10      June 10, 2014

IS 607 Project 4

David Stern

April 26, 2015