Web Scraping with R

This Vignette explores the Web Scraping functionality of R by scraping the news headlines and the short description from the News.com portal.

Packages:

The packages required for this exercise are rvest and dplyr

Load required packages

library(rvest)
## Loading required package: xml2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Reading the data from a website

An open source software tool/add-in “selectorgadget” is used for scraping information from the website. In this Vignette I have used Google Chrome. The use of selector gadget is simple and the web contains all the necessary information on how to use it. The gadget is used to select the parts of the website and get the relevant tags to get access to that part by simply clicking on that part of the website.

https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html provides more detail on this topic.

The following code is used to read the content from the News.com portal

# Store the website address in the variable web_address

web_address<-'http://www.news.com.au/'

# Read the html code of the webpage in to the variable webpage_code
webpage_code<-read_html(web_address)

# Read the headlines of the News.com.au portal using the CSS path identified using the Selector Gadget
news_headlines <- html_nodes(webpage_code,'.widget_newscorpau_capi_sync_collection:nth-child(1) .story-block > .heading a')

# Convert the new headlines to text and view the summary data
news_headlines_text <- html_text(news_headlines)
head(news_headlines_text)
## [1] "MOTHER<U+0092>S SECRET: How a daggy item made woman $250K"
## [2] "Has Trump met his match?"                          
## [3] "Nanny <U+0091>there to keep tabs on me<U+0092>"                  
## [4] "AFL star slammed for false act"                    
## [5] "Russia warns: One step from war"                   
## [6] "Trump<U+0092>s options include killing Kim"
# Read the description of the News.com.au portal using the CSS path identified using the Selector Gadget and view the summary data

news_description <- html_text(html_nodes(webpage_code,'.widget_newscorpau_capi_sync_collection:nth-child(1) .standfirst'))
head(news_description)
## [1] "\n\t\t\t\tQUEENSLAND mum Simone Taylor decided to give an ordinary bathroom item a makeover. Within two years she had made $250,000.\n\t"                 
## [2] "\n\t\t\t\tTRUMP reckons he<U+0092>s the master deal maker, but he may have underestimated China<U+0092>s <U+0091>princeling<U+0092> president and his <U+0091>sweet seduction<U+0092>.\n\t"         
## [3] "\n\t\t\t\tMEL B<U+0092>s court documents details tale of her <U+0091>prison<U+0092> of a marriage and claims the nanny who slept with her husband also spied on her. \n\t"     
## [4] "\n\t\t\t\tAFTER being praised by TV commentators, AFL superstar Nat Fyfe has been criticised for his response to a rival<U+0092>s sickening injury. \n\t"        
## [5] "\n\t\t\t\tRUSSIA intensified its response to the US air strikes on Syria, sending its most advanced warship to confront American vessels.\n\t"            
## [6] "\n\t\t\t\tTHE National Security Council has told Donald Trump that his options on North Korea include assassinating Kim Jong-un, report says.\n\t"
# Format the descriptions information
news_description<-gsub("\n\t\t\t\t","",news_description)
news_description<-gsub("\n\t","",news_description)

#Let's have another look at the description data 
head(news_description)
## [1] "QUEENSLAND mum Simone Taylor decided to give an ordinary bathroom item a makeover. Within two years she had made $250,000."            
## [2] "TRUMP reckons he<U+0092>s the master deal maker, but he may have underestimated China<U+0092>s <U+0091>princeling<U+0092> president and his <U+0091>sweet seduction<U+0092>."    
## [3] "MEL B<U+0092>s court documents details tale of her <U+0091>prison<U+0092> of a marriage and claims the nanny who slept with her husband also spied on her. "
## [4] "AFTER being praised by TV commentators, AFL superstar Nat Fyfe has been criticised for his response to a rival<U+0092>s sickening injury. "   
## [5] "RUSSIA intensified its response to the US air strikes on Syria, sending its most advanced warship to confront American vessels."       
## [6] "THE National Security Council has told Donald Trump that his options on North Korea include assassinating Kim Jong-un, report says."
# Create a data frame with the headlines and the description of the news
Need_to_know<-data.frame(news_headlines_text,news_description)

# view the structure of the data frame
str(Need_to_know)
## 'data.frame':    11 obs. of  2 variables:
##  $ news_headlines_text: Factor w/ 11 levels "AFL star slammed for false act",..: 6 3 7 1 8 10 4 11 5 9 ...
##  $ news_description   : Factor w/ 11 levels "AFTER being praised by TV commentators, AFL superstar Nat Fyfe has been criticised for his response to a rival<U+0092>s sickening inju"| __truncated__,..: 6 10 4 1 8 9 2 5 3 7 ...