Web Scraping & API’s

For Activity 10, I’ve chosen to create a news web scraper of the BBC News most popularly read news articles and use the Twitter API to map the geolocation of tweets in California.

Part I: BBC Web Scraper

The BBC News has a “Most Read” section of its website and mobile app which is updated regularly to reflect what news information is trending. I used the rvest package to develope the web scraper which gathers the rank, news article title, and links corresponding to the news article.

First, load in the rvest package and store the target and base url into variable which we will use later.

#import rvest
library(rvest)

## Loading required package: xml2

#url to top 10 most read page 
bbc_most_read<- "http://www.bbc.com/news/popular/read"

#base url for bbc website
#we'll need this later to complete our links
bbc_base_url<-"http://www.bbc.com"

Next, we’re ready to scrape the website for the most popularly read news articles.

#scrape the ranking and article titles as a single list
most_read<-bbc_most_read %>% 
read_html() %>% 
html_nodes(".most-popular-page-list-item span") %>% 
html_text()

#scrape the page links corresponding to the article titles
links<-bbc_most_read %>% 
read_html() %>% 
html_nodes(".most-popular-page-list-item a") %>% 
html_attr("href")

Now that we have the rankings, news articles, and page link extensions, let’s put the data in a more useful and presentable format.

#complete the links by pasting the base url to the 
#page url extension returned above
links<-paste0(bbc_base_url,links)

#extract the odd numbered elements in the list 
#which are all number rankings
rank<-most_read[seq(1,20,2)]

#extract the even numbered elements in the list
#which are all article titles 
title<-most_read[seq(2,20,2)]

#summarize our results in a data.frame
data.frame(rank,title,links)

##    rank
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## 6     6
## 7     7
## 8     8
## 9     9
## 10   10
##                                                                            title
## 1                          Microsoft accused of Windows 10 upgrade 'nasty trick'
## 2                                    Yellowstone National Park in 1871 and today
## 3                                    Six things about the $6 Bourdain-Obama meal
## 4                        US election: Why has Trump caught Clinton in the polls?
## 5                     Syria conflict: IS 'destroyed helicopters' at Russian base
## 6                       Greece bailout: Eurozone agrees 'breakthrough' debt deal
## 7                                        Why is India's Taj Mahal turning green?
## 8                Mullah Mansour: The trail of clues after Taliban leader's death
## 9  Australia's deputy PM 'pulling strings' in Depp's head like 'Hannibal Lecter'
## 10                        US seeks death penalty over Charleston church shooting
##                                                 links
## 1         http://www.bbc.com/news/technology-36367221
## 2   http://www.bbc.com/news/election-us-2016-36372929
## 3   http://www.bbc.com/news/world-asia-india-36366733
## 4    http://www.bbc.com/news/world-us-canada-36362634
## 5  http://www.bbc.com/news/world-middle-east-36368346
## 6         http://www.bbc.com/news/world-asia-36369236
## 7         http://www.bbc.com/news/world-asia-36365988
## 8  http://www.bbc.com/news/world-middle-east-36371226
## 9    http://www.bbc.com/news/world-us-canada-36375672
## 10      http://www.bbc.com/news/world-europe-36375973

Part II: R & Twitter STREAM API

I chose the Twitter API for the second part of the assignment after learning that many of the functions in the LinkedIn API R package Rlinkedin are no longer functional due to changes by LinkedIn to their API (see github for more info https://github.com/mpiccirilli/Rlinkedin). You can still access your basic profile information but that’s about all I was able to do.

Moving on, I came across an interesting article published fairly recently (Nov 2015) and used the code to reproduce their results. The original post can be found here http://politicaldatascience.blogspot.com/2015/12/rtutorial-using-r-to-harvest-twitter.html.

The post is very thorough in explaining each step so I am including the code and a few comments below only. For a walkthrough, I recommend visiting the website.

#load all of the packages
x <- c('ggplot2','grid','stringr','ROAuth','streamR','dplyr')
suppressMessages(lapply(x, require, character.only = TRUE))

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE

#needed for API call
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"

An important note here: You should assign your Consumer Key and Consumer Secret at this point into the variables “consumerKey” and “consumerSecret”, respectively. I’ve done this in the background since these keys are private.

my_oauth <- OAuthFactory$new(consumerKey = consumerKey, 
consumerSecret = consumerSecret, 
requestURL = requestURL, 
accessURL = accessURL, 
authURL = authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

So, far we have everything we need to make a connection to Twitter’s API. Next, we’ll actually make the connection and collect tweets. We’ll be specifically collecting tweets from California and its immediate surrounding States/Countries for at least a 30 seconds snapshot in realtime.

file = "tweets.json"
track = NULL
follow = NULL
loc = c(-125, 30, -114, 42)
lang = NULL
minutes = 0.5
time = 60*minutes
tweets = NULL
filterStream(file.name = file, 
             track = track,
             follow = follow, 
             locations = loc, 
             language = lang,
             timeout = time, 
             tweets = tweets, 
             oauth = my_oauth,
             verbose = TRUE)
tweets.df <- parseTweets(file)

Now we’ve collected our tweets, so at this point you can dive into the data, but to take this a step further, let’s generate a cool map of California and visualize the geographic distribution of the tweets we’ve just collected.

#snapshot of the data 
head(select(tweets.df,c(text,retweeted,statuses_count,followers_count,full_name)),3)

##                                                                                                                                              text
## 1                                           Do my hair somewhere in sunset boulevard and Wilaman \xed\xa0\xbd\xed\xb8\x98\xed\xa0\xbd\xed\xb8\x98
## 2                                   Nothing beats staring at something beautiful ,with headphones up real loud, listening to your favorite song..
## 3 THANK YOU @olivetheorange 4 being THE best road trip buddy ever! You're my kind of crazy &amp; I'm so grateful for YOU! https://t.co/4CznkxUAmz
##   retweeted statuses_count followers_count         full_name
## 1     FALSE           3158             195   Los Angeles, CA
## 2     FALSE          31236            1454       Banning, CA
## 3     FALSE          10860            1035 San Francisco, CA

tweets.df$hashtags <- str_extract(tweets.df$text, "#[:alnum:]+")

tweets.df$hashtags <- as.factor(tweets.df$hashtags)
summary(tweets.df$hashtags)

##               #Aye         #Blessings           #bobatea 
##                  1                  1                  1 
##            #CFAOne              #DBWV         #doyertips 
##                  1                  1                  1 
##            #Duarte           #exposed           #Finally 
##                  1                  1                  1 
##        #glutenfree       #Hospitality         #Marketing 
##                  1                  2                  1 
##        #ohhappyday           #openmic              #Reno 
##                  1                  1                  1 
##            #Repost             #Risen #separationanxiety 
##                  1                  1                  1 
##      #SiliconBeach          #suprised         #Trump2016 
##                  1                  1                  1 
##                #WV  #XQSuperSchoolBus         #yalargate 
##                  1                  1                  1 
##               NA's 
##                132

tweets.df$full_name <- as.factor(tweets.df$full_name)
summary(tweets.df$full_name)

##               Alameda, CA              Arizona, USA 
##                         1                         2 
##           Bakersfield, CA               Banning, CA 
##                         4                         1 
##               Barstow, CA              Berkeley, CA 
##                         1                         1 
##           Bloomington, CA             Brentwood, CA 
##                         1                         1 
##               Burbank, CA              Calexico, CA 
##                         2                         1 
##           California, USA  Camp Pendleton South, CA 
##                         9                         1 
##              Campbell, CA                    Canada 
##                         1                         1 
##                Carson, CA           Chula Vista, CA 
##                         1                         1 
##        Citrus Heights, CA             Claremont, CA 
##                         1                         1 
##             Coachella, CA                Corona, CA 
##                         1                         1 
##            Costa Mesa, CA             Daly City, CA 
##                         1                         1 
##           Diamond Bar, CA                Downey, CA 
##                         1                         1 
##                Duarte, CA        East la Mirada, CA 
##                         1                         1 
##            El Cerrito, CA              El Monte, CA 
##                         1                         1 
##             Elk Grove, CA            Enterprise, NV 
##                         1                         1 
##               Fontana, CA                Fresno, CA 
##                         1                         6 
##             Fullerton, CA               Hayward, CA 
##                         2                         1 
##             Henderson, NV              Hercules, CA 
##                         2                         1 
##      Huntington Beach, CA                Irvine, CA 
##                         1                         2 
##         Jurupa Valley, CA               Kennedy, CA 
##                         1                         1 
##      Lake Havasu City, AZ              Lakewood, CA 
##                         1                         2 
##             Las Vegas, NV               Lathrop, CA 
##                         7                         1 
##            Long Beach, CA           Los Angeles, CA 
##                         2                        18 
##                Madera, CA                Marina, CA 
##                         1                         1 
##            Menlo Park, CA Mexicali, Baja California 
##                         1                         1 
##                    Mexico                    México 
##                         1                         2 
##              Milpitas, CA               Modesto, CA 
##                         1                         1 
##            Montebello, CA              Monterey, CA 
##                         1                         1 
##                  Napa, CA               Nevada, USA 
##                         1                         2 
##       North Las Vegas, NV               Oakland, CA 
##                         1                         1 
##               Ontario, CA               Oregon, USA 
##                         1                         1 
##                Oxnard, CA              Paradise, NV 
##                         1                         1 
##                Perris, CA                 Poway, CA 
##                         1                         1 
##        Rancho Cordova, CA      Rancho San Diego, CA 
##                         1                         1 
##                  Reno, NV                Rialto, CA 
##                         1                         1 
##              Richmond, CA            Ridgecrest, CA 
##                         1                         1 
##             Riverside, CA            Sacramento, CA 
##                         2                         1 
##               Salinas, CA             San Diego, CA 
##                         1                         4 
##         San Francisco, CA              San Jose, CA 
##                         5                         3 
##             Santa Ana, CA         Santa Barbara, CA 
##                         2                         1 
##           Santa Clara, CA            Santa Cruz, CA 
##                         1                         1 
##          Santa Monica, CA              Saratoga, CA 
##                         1                         1 
##           Suisun City, CA             Sunnyvale, CA 
##                         1                         2 
##         Sunrise Manor, NV              Temecula, CA 
##                         1                         1 
##               Ventura, CA              Vineyard, CA 
##                         1                         1 
##                 Vista, CA           Walnut Park, CA 
##                         1                         1 
##                Walnut, CA           West Covina, CA 
##                         1                         1 
##        West Hollywood, CA                  Yuma, AZ 
##                         1                         2

points <- data.frame(x = as.numeric(tweets.df$place_lon), 
                     y = as.numeric(tweets.df$place_lat))
points$hashtags <- tweets.df$hashtags

points[!is.na(tweets.df$lon), "x"] <- as.numeric(tweets.df$lon)[!is.na(tweets.df$lon)]

points <- points[(points$y > 25 & points$y < 42), ]
points <- points[points$x < -114,]

map.data <- map_data("state", region=c("california"))

## 
##  # maps v3.1: updated 'world': all lakes moved to separate new #
##  # 'lakes' database. Type '?world' or 'news(package="maps")'.  #

mapPlot <- ggplot(map.data) + # ggplot is the basic plotting function used.
  # The following lines define the map-areas.
  geom_map(aes(map_id = region), 
           map = map.data, 
           fill = "white", 
           color = "grey20", 
           size = 0.25) +  
  expand_limits(x = map.data$long, 
                y = map.data$lat) + 
  # The following parameters could be altered to insert axes, title, etc.
  theme(axis.line = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        axis.title = element_blank(), 
        panel.background = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major = element_blank(), 
        plot.background = element_blank(), 
        plot.margin = unit(0 * c(-1.5, -1.5, -1.5, -1.5), "lines")) + 
  # The next line plots points for each tweet. Size, transparency (alpha) 
  # and color could be altered.
  geom_point(data = points, 
             aes(x = x, y = y), 
             size = 2, 
             alpha = 1/20, 
             color = "red")
mapPlot

Web Scraping & API’s

Anthony (Ranthony)

May 24, 2016