CIS 4730
Unstructured Data Management

Lab: Web scraping

Rongen Zhang

Getting data from the Web

There are many ways to obtain data from the Internet; let’s consider four categories:

click-and-download on the internet as a “flat” file, such as .csv, .xls
install-and-play an API for which someone has written a handy R package
API-query published with an unwrapped API
Scraping implicit in an html website

Click-and-Download

In the simplest case, the data you need is already on the internet in a tabular format. There are a couple of strategies here:

Use read.csv or read.table to read the data straight into R.

url <- "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv"
df <- read.csv(file=url, header=TRUE, stringsAsFactors=FALSE)
head(df)

##    make   model mpg weight price
## 1   amc concord  22   2930  4099
## 2   amc   oacer  17   3350  4749
## 3   amc  spirit  22   2640  3799
## 4 buick century  20   3250  4816
## 5 buick electra  15   4080  7827

install-and-play

Many common web services and APIs have been “wrapped”, i.e. R functions have been written around them which send your query to the server and format the response.

Package Rfacebook
Package twitteR
Package GoTr - R wrapper for An API of Ice And Fire

install.packages("devtools") #allow you to install R packages from GitHub
devtools::install_github("MangoTheCat/GoTr")
library(GoTr)
characters_583 <- got_api(type = "characters", id = 583)
class(characters_583)

## [1] "list"

characters_583

## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
## 
## $name
## [1] "Jon Snow"
## 
## $gender
## [1] "Male"
## 
## $culture
## [1] "Northmen"
## 
## $born
## [1] "In 283 AC"
## 
## $died
## [1] ""
## 
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
## 
## 
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
## 
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
## 
## $aliases[[3]]
## [1] "The Snow of Winterfell"
## 
## $aliases[[4]]
## [1] "The Crow-Come-Over"
## 
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
## 
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
## 
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
## 
## $aliases[[8]]
## [1] "Lord Crow"
## 
## 
## $father
## [1] ""
## 
## $mother
## [1] ""
## 
## $spouse
## [1] ""
## 
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
## 
## 
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
## 
## 
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
## 
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
## 
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
## 
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
## 
## 
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
## 
## $tvSeries[[2]]
## [1] "Season 2"
## 
## $tvSeries[[3]]
## [1] "Season 3"
## 
## $tvSeries[[4]]
## [1] "Season 4"
## 
## $tvSeries[[5]]
## [1] "Season 5"
## 
## $tvSeries[[6]]
## [1] "Season 6"
## 
## 
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"

API-query

This is when you use URLs to interact with a web API.

Package `httr`

httr is designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are:

GET: fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource.
POST: create a new resource. POST requests usually carry a payload that specifies the data for the new resource.
PUT: update an existing resource. The payload may contain the updated data for the resource.
DELETE: delete an existing resource.

httr contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET(), POST()).

install.packages("httr")
library(httr)
characters_583 <- GET("https://anapioficeandfire.com/api/characters/583")
characters_583_content = content(characters_583)
class(characters_583_content)

## [1] "list"

characters_583_content

## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
## 
## $name
## [1] "Jon Snow"
## 
## $gender
## [1] "Male"
## 
## $culture
## [1] "Northmen"
## 
## $born
## [1] "In 283 AC"
## 
## $died
## [1] ""
## 
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
## 
## 
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
## 
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
## 
## $aliases[[3]]
## [1] "The Snow of Winterfell"
## 
## $aliases[[4]]
## [1] "The Crow-Come-Over"
## 
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
## 
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
## 
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
## 
## $aliases[[8]]
## [1] "Lord Crow"
## 
## 
## $father
## [1] ""
## 
## $mother
## [1] ""
## 
## $spouse
## [1] ""
## 
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
## 
## 
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
## 
## 
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
## 
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
## 
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
## 
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
## 
## 
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
## 
## $tvSeries[[2]]
## [1] "Season 2"
## 
## $tvSeries[[3]]
## [1] "Season 3"
## 
## $tvSeries[[4]]
## [1] "Season 4"
## 
## $tvSeries[[5]]
## [1] "Season 5"
## 
## $tvSeries[[6]]
## [1] "Season 6"
## 
## 
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"

Your turn

We are interested in the following repository:

Web URL: https://github.com/tidyverse/dplyr
API URL: https://api.github.com/repos/tidyverse/dplyr

Use httr to retrieve repository data from the GitHub API, and print the following information:

The number of watchers
The number of subscribers
The number of open issues
The language of repository

Scraping

What if data is present on a website, but is not provided in an API at all? It is possible to grab that information too. How easy that is depends a lot on the quality and structure of the website that we are scrapping.

Two useful tools:

rvest: R package to easily harvest (scrape) web pages
SelectorGadget: Install in your browser

Package `rvest` overview

install.packages("rvest")
library(rvest)
library(stringr)
library(tidyverse)

The most important functions in rvest are:

Retrieve an html document from a URL, a file on disk or a string containing html with read_html().
Select parts of a document
- Using css selectors: html_nodes(doc, css="table td")
- use xpath selectors with html_nodes(doc, xpath = "//table//td")
Extract components with html_tag() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).
Parse HTML tables into data frames with html_table().

Use rvest to retrieve an html document

popular_movies <- read_html("https://www.imdb.com/chart/moviemeter/")
popular_movies

## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

Use rvest to select parts of a document using css selectors
- Use SelectorGadget to identify the css string

css selector for ratings: screenshot(open link in new tab)

ratings <- popular_movies %>% 
  html_nodes(css="strong") %>%
  html_text() %>%
  as.numeric()
ratings

##  [1] 8.3 5.8 7.6 7.7 6.4 6.4 7.2 6.3 7.2 5.7 7.7 7.5 9.1 6.8 6.6 6.3 8.5 6.9 7.3
## [20] 6.4 5.0 6.9 7.9 6.9 6.7 5.8 6.3 7.3 7.3 8.0 6.8 6.3 6.3 5.4 7.4 6.1 7.2 6.3
## [39] 6.6 1.3 5.3 7.5 6.6 7.2 7.5 7.8 6.5 7.1 7.9 7.6 6.4 3.3 8.0 6.5 7.5 7.6 6.6
## [58] 6.3 8.6 8.4 6.9 8.0 8.4 8.6 5.2 7.6 6.0 6.1 6.6 5.4 9.3 7.6 9.0 7.4 4.8 5.8
## [77] 7.3 7.4 5.8 7.8 7.3 8.2

css selector for Movie Titles: screenshot

MoviesTitles ={}
MoviesTitles <- popular_movies %>%
  html_nodes(css="#main a") %>%
  html_text() %>%
  str_trim()
head(MoviesTitles)

## [1] ""           ""           "Dune"       ""           "The Batman"
## [6] ""

MoviesTitles = MoviesTitles[MoviesTitles !=" "] #remove white spaces
MoviesTitles = MoviesTitles[MoviesTitles !=""]
MoviesTitles

##   [1] "Dune"                                     
##   [2] "The Batman"                               
##   [3] "Halloween Kills"                          
##   [4] "No Time to Die"                           
##   [5] "The Last Duel"                            
##   [6] "Dune"                                     
##   [7] "Venom: Let There Be Carnage"              
##   [8] "Free Guy"                                 
##   [9] "Eternals"                                 
##  [10] "The Forgotten Battle"                     
##  [11] "Rust"                                     
##  [12] "Night Teeth"                              
##  [13] "Halloween"                                
##  [14] "The French Dispatch"                      
##  [15] "Sardar Udham"                             
##  [16] "The Flash"                                
##  [17] "Black Widow"                              
##  [18] "Halloween"                                
##  [19] "Red Notice"                               
##  [20] "Uncharted"                                
##  [21] "The Guilty"                               
##  [22] "The Black Phone"                          
##  [23] "The Trip"                                 
##  [24] "Last Night in Soho"                       
##  [25] "Black Adam"                               
##  [26] "The Many Saints of Newark"                
##  [27] "After We Fell"                            
##  [28] "Hocus Pocus"                              
##  [29] "Shang-Chi and the Legend of the Ten Rings"
##  [30] "Scream"                                   
##  [31] "Titane"                                   
##  [32] "Venom"                                    
##  [33] "Old"                                      
##  [34] "Copshop"                                  
##  [35] "Scream"                                   
##  [36] "Halloween Ends"                           
##  [37] "The Suicide Squad"                        
##  [38] "Spider-Man: No Way Home"                  
##  [39] "Casino Royale"                            
##  [40] "Spectre"                                  
##  [41] "Injustice"                                
##  [42] "Antlers"                                  
##  [43] "Warning"                                  
##  [44] "Cruella"                                  
##  [45] "Ghostbusters: Afterlife"                  
##  [46] "Halloween"                                
##  [47] "Ron's Gone Wrong"                         
##  [48] "Malignant"                                
##  [49] "The Green Knight"                         
##  [50] "The Cost of Deception"                    
##  [51] "The Addams Family 2"                      
##  [52] "A Nightmare on Elm Street"                
##  [53] "Resident Evil: Welcome to Raccoon City"   
##  [54] "Lamb"                                     
##  [55] "Old Henry"                                
##  [56] "Beetlejuice"                              
##  [57] "Ambulance"                                
##  [58] "Skyfall"                                  
##  [59] "The Night House"                          
##  [60] "Dune: Part Two"                           
##  [61] "Midsommar"                                
##  [62] "Knives Out"                               
##  [63] "Once Upon a Time... In Hollywood"         
##  [64] "Friday the 13th"                          
##  [65] "365 Days"                                 
##  [66] "Blade Runner 2049"                        
##  [67] "Halloween II"                             
##  [68] "Promising Young Woman"                    
##  [69] "Being the Ricardos"                       
##  [70] "Harry Potter and the Sorcerer's Stone"    
##  [71] "The Lost Daughter"                        
##  [72] "The Little Things"                        
##  [73] "Parasite"                                 
##  [74] "Joker"                                    
##  [75] "The Addams Family"                        
##  [76] "The Nightmare Before Christmas"           
##  [77] "Avengers: Endgame"                        
##  [78] "Untitled the Munsters Reboot"             
##  [79] "Interstellar"                             
##  [80] "F9: The Fast Saga"                        
##  [81] "Belfast"                                  
##  [82] "Candyman"                                 
##  [83] "Wonka"                                    
##  [84] "Dear Evan Hansen"                         
##  [85] "Quantum of Solace"                        
##  [86] "Snake Eyes"                               
##  [87] "The Shawshank Redemption"                 
##  [88] "The Crow"                                 
##  [89] "Hocus Pocus 2"                            
##  [90] "The Dark Knight"                          
##  [91] "The Rocky Horror Picture Show"            
##  [92] "There's Someone Inside Your House"        
##  [93] "The Addams Family"                        
##  [94] "It"                                       
##  [95] "Tenet"                                    
##  [96] "Halloween H20: 20 Years Later"            
##  [97] "Titanic"                                  
##  [98] "Home Sweet Home Alone"                    
##  [99] "Poltergeist"                              
## [100] "The Wolf of Wall Street"

Use rvest to extract html tags and their attributes

css selector for the poster of first movie: screenshot

poster_img_source <- popular_movies %>%
  html_nodes(css="tr:nth-child(1) img") %>%
  html_attr("src")
poster_img_source

## [1] "https://m.media-amazon.com/images/M/MV5BN2FjNmEyNWMtYzM0ZS00NjIyLTg5YzYtYThlMGVjNzE1OGViXkEyXkFqcGdeQXVyMTkxNjUyNQ@@._V1_UY67_CR0,0,45,67_AL_.jpg"

Use rvest to parse html tables into data frames

top_movie_list <- popular_movies %>%
  html_nodes(css="table") %>%
  html_table()
top_movie_list # this is a list which has the data.frame for top 100 movies based on popularity

## [[1]]
## # A tibble: 100 x 5
##    ``    `Rank & Title`            `IMDb Rating` `Your Rating`             ``   
##    <lgl> <chr>                             <dbl> <chr>                     <lgl>
##  1 NA    "Dune\n        (2021)\n …           8.3 "12345678910\n        \n… NA   
##  2 NA    "The Batman\n        (20…          NA   "12345678910\n        \n… NA   
##  3 NA    "Halloween Kills\n      …           5.8 "12345678910\n        \n… NA   
##  4 NA    "No Time to Die\n       …           7.6 "12345678910\n        \n… NA   
##  5 NA    "The Last Duel\n        …           7.7 "12345678910\n        \n… NA   
##  6 NA    "Dune\n        (1984)\n …           6.4 "12345678910\n        \n… NA   
##  7 NA    "Venom: Let There Be Car…           6.4 "12345678910\n        \n… NA   
##  8 NA    "Free Guy\n        (2021…           7.2 "12345678910\n        \n… NA   
##  9 NA    "Eternals\n        (2021…           6.3 "12345678910\n        \n… NA   
## 10 NA    "The Forgotten Battle\n …           7.2 "12345678910\n        \n… NA   
## # … with 90 more rows

movie_list_table <- top_movie_list[[1]] # the first one is the casting table
head(movie_list_table)

## # A tibble: 6 x 5
##   ``    `Rank & Title`            `IMDb Rating` `Your Rating`              ``   
##   <lgl> <chr>                             <dbl> <chr>                      <lgl>
## 1 NA    "Dune\n        (2021)\n …           8.3 "12345678910\n        \n … NA   
## 2 NA    "The Batman\n        (20…          NA   "12345678910\n        \n … NA   
## 3 NA    "Halloween Kills\n      …           5.8 "12345678910\n        \n … NA   
## 4 NA    "No Time to Die\n       …           7.6 "12345678910\n        \n … NA   
## 5 NA    "The Last Duel\n        …           7.7 "12345678910\n        \n … NA   
## 6 NA    "Dune\n        (1984)\n …           6.4 "12345678910\n        \n … NA

#remove irrelevant stuffs
movie_list_table <- movie_list_table[2:3] 
colnames(movie_list_table) = c("title", "rating")

#movie_list_table$title = str_replace_all(movie_list_table$title,"[\r\\n]", "") 
movie_list_table$title = str_remove(movie_list_table$title,"[\r\\n]") #remove 
movie_list_table$title = str_squish(movie_list_table$title) # trim whitespace within string
movie_list_table$title = str_remove(movie_list_table$title,"([^(\\d\\d\\d\\d)]\\d.+)") #remove ranking info

movie_list_table

## # A tibble: 100 x 2
##    title                              rating
##    <chr>                               <dbl>
##  1 Dune (2021)                           8.3
##  2 The Batman (2022)                    NA  
##  3 Halloween Kills (2021)                5.8
##  4 No Time to Die (2021)                 7.6
##  5 The Last Duel (2021)                  7.7
##  6 Dune (1984)                           6.4
##  7 Venom: Let There Be Carnage (2021)    6.4
##  8 Free Guy (2021)                       7.2
##  9 Eternals (2021)                       6.3
## 10 The Forgotten Battle (2020)           7.2
## # … with 90 more rows

Lab assignment

This lab assignment involves 2 tasks (see the following slides). Once you finish the following tasks, please put everything in one single R file with the file name assignment3.R (.R is the file extension) and upload it to iCollege (Lab Assignment 3).

You will lose 50% of the points if you use a different filename or put your code in multiple files.

In addition, lab assignments will be graded based on:

Accuracy: whether the R script achieves the objectives
Readability: whether the R script is clean, well-formatted, and easily readable
- You risk loosing 10 points if your code has no proper indentation or has more than 80 characters in a line.

Lab Assignment 1/2

https://editorial.rottentomatoes.com/guide/best-wide-release-2016/

Use ‘rvest’ and ‘SelectorGadget’ to…

Get the name of the 6th most popular movie in 2016
Get the image source (URL) of the movie “Arrival”
Get the description paragraph of the movie “Moana”
Get the URLs of all 10 movies
Create a table containing the names of all 10 movie titles,the first listed director, and the movie rating. Store and clean table in a data.frame so it looks like the next slide.

Table

Lab Assignment 2/2

Write a regular expression to extract email addresses from the following text:

@GeorgiaStateU: My two email addresses are smith77@gsu.edu and alex.smith@yahoo.com.uk

z<- "@GeorgiaStateU: My two email addresses are smith77@gsu.edu and 
alex.smith@yahoo.com.uk"
my_regex = "put your regex here"
stringr::str_extract_all(z, my_regex)[[1]]

## [1] "smith77@gsu.edu"         "alex.smith@yahoo.com.uk"

CIS 4730Unstructured Data Management