CIS 4730
Unstructured Data Management

Lab: Web scraping

Rongen Zhang

Getting data from the Web

There are many ways to obtain data from the Internet; let’s consider four categories:

Click-and-Download

In the simplest case, the data you need is already on the internet in a tabular format. There are a couple of strategies here:

url <- "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv"
df <- read.csv(file=url, header=TRUE, stringsAsFactors=FALSE)
head(df)
##    make   model mpg weight price
## 1   amc concord  22   2930  4099
## 2   amc   oacer  17   3350  4749
## 3   amc  spirit  22   2640  3799
## 4 buick century  20   3250  4816
## 5 buick electra  15   4080  7827

install-and-play

Many common web services and APIs have been “wrapped”, i.e. R functions have been written around them which send your query to the server and format the response.

install.packages("devtools") #allow you to install R packages from GitHub
devtools::install_github("MangoTheCat/GoTr")
library(GoTr)
characters_583 <- got_api(type = "characters", id = 583)
class(characters_583)
## [1] "list"
characters_583
## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
## 
## $name
## [1] "Jon Snow"
## 
## $gender
## [1] "Male"
## 
## $culture
## [1] "Northmen"
## 
## $born
## [1] "In 283 AC"
## 
## $died
## [1] ""
## 
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
## 
## 
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
## 
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
## 
## $aliases[[3]]
## [1] "The Snow of Winterfell"
## 
## $aliases[[4]]
## [1] "The Crow-Come-Over"
## 
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
## 
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
## 
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
## 
## $aliases[[8]]
## [1] "Lord Crow"
## 
## 
## $father
## [1] ""
## 
## $mother
## [1] ""
## 
## $spouse
## [1] ""
## 
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
## 
## 
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
## 
## 
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
## 
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
## 
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
## 
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
## 
## 
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
## 
## $tvSeries[[2]]
## [1] "Season 2"
## 
## $tvSeries[[3]]
## [1] "Season 3"
## 
## $tvSeries[[4]]
## [1] "Season 4"
## 
## $tvSeries[[5]]
## [1] "Season 5"
## 
## $tvSeries[[6]]
## [1] "Season 6"
## 
## 
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"

API-query

This is when you use URLs to interact with a web API.

Package httr

httr is designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are:

httr contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET(), POST()).

install.packages("httr")
library(httr)
characters_583 <- GET("https://anapioficeandfire.com/api/characters/583")
characters_583_content = content(characters_583)
class(characters_583_content)
## [1] "list"
characters_583_content
## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
## 
## $name
## [1] "Jon Snow"
## 
## $gender
## [1] "Male"
## 
## $culture
## [1] "Northmen"
## 
## $born
## [1] "In 283 AC"
## 
## $died
## [1] ""
## 
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
## 
## 
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
## 
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
## 
## $aliases[[3]]
## [1] "The Snow of Winterfell"
## 
## $aliases[[4]]
## [1] "The Crow-Come-Over"
## 
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
## 
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
## 
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
## 
## $aliases[[8]]
## [1] "Lord Crow"
## 
## 
## $father
## [1] ""
## 
## $mother
## [1] ""
## 
## $spouse
## [1] ""
## 
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
## 
## 
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
## 
## 
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
## 
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
## 
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
## 
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
## 
## 
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
## 
## $tvSeries[[2]]
## [1] "Season 2"
## 
## $tvSeries[[3]]
## [1] "Season 3"
## 
## $tvSeries[[4]]
## [1] "Season 4"
## 
## $tvSeries[[5]]
## [1] "Season 5"
## 
## $tvSeries[[6]]
## [1] "Season 6"
## 
## 
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"

Your turn

We are interested in the following repository:

Use httr to retrieve repository data from the GitHub API, and print the following information:

Scraping

What if data is present on a website, but is not provided in an API at all? It is possible to grab that information too. How easy that is depends a lot on the quality and structure of the website that we are scrapping.

Two useful tools:

Package rvest overview

install.packages("rvest")
library(rvest)
library(stringr)
library(tidyverse)

The most important functions in rvest are:

popular_movies <- read_html("https://www.imdb.com/chart/moviemeter/")
popular_movies
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

css selector for ratings: screenshot(open link in new tab)

ratings <- popular_movies %>% 
  html_nodes(css="strong") %>%
  html_text() %>%
  as.numeric()
ratings
##  [1] 8.3 5.8 7.6 7.7 6.4 6.4 7.2 6.3 7.2 5.7 7.7 7.5 9.1 6.8 6.6 6.3 8.5 6.9 7.3
## [20] 6.4 5.0 6.9 7.9 6.9 6.7 5.8 6.3 7.3 7.3 8.0 6.8 6.3 6.3 5.4 7.4 6.1 7.2 6.3
## [39] 6.6 1.3 5.3 7.5 6.6 7.2 7.5 7.8 6.5 7.1 7.9 7.6 6.4 3.3 8.0 6.5 7.5 7.6 6.6
## [58] 6.3 8.6 8.4 6.9 8.0 8.4 8.6 5.2 7.6 6.0 6.1 6.6 5.4 9.3 7.6 9.0 7.4 4.8 5.8
## [77] 7.3 7.4 5.8 7.8 7.3 8.2

css selector for Movie Titles: screenshot

MoviesTitles ={}
MoviesTitles <- popular_movies %>%
  html_nodes(css="#main a") %>%
  html_text() %>%
  str_trim()
head(MoviesTitles)
## [1] ""           ""           "Dune"       ""           "The Batman"
## [6] ""
MoviesTitles = MoviesTitles[MoviesTitles !=" "] #remove white spaces
MoviesTitles = MoviesTitles[MoviesTitles !=""]
MoviesTitles
##   [1] "Dune"                                     
##   [2] "The Batman"                               
##   [3] "Halloween Kills"                          
##   [4] "No Time to Die"                           
##   [5] "The Last Duel"                            
##   [6] "Dune"                                     
##   [7] "Venom: Let There Be Carnage"              
##   [8] "Free Guy"                                 
##   [9] "Eternals"                                 
##  [10] "The Forgotten Battle"                     
##  [11] "Rust"                                     
##  [12] "Night Teeth"                              
##  [13] "Halloween"                                
##  [14] "The French Dispatch"                      
##  [15] "Sardar Udham"                             
##  [16] "The Flash"                                
##  [17] "Black Widow"                              
##  [18] "Halloween"                                
##  [19] "Red Notice"                               
##  [20] "Uncharted"                                
##  [21] "The Guilty"                               
##  [22] "The Black Phone"                          
##  [23] "The Trip"                                 
##  [24] "Last Night in Soho"                       
##  [25] "Black Adam"                               
##  [26] "The Many Saints of Newark"                
##  [27] "After We Fell"                            
##  [28] "Hocus Pocus"                              
##  [29] "Shang-Chi and the Legend of the Ten Rings"
##  [30] "Scream"                                   
##  [31] "Titane"                                   
##  [32] "Venom"                                    
##  [33] "Old"                                      
##  [34] "Copshop"                                  
##  [35] "Scream"                                   
##  [36] "Halloween Ends"                           
##  [37] "The Suicide Squad"                        
##  [38] "Spider-Man: No Way Home"                  
##  [39] "Casino Royale"                            
##  [40] "Spectre"                                  
##  [41] "Injustice"                                
##  [42] "Antlers"                                  
##  [43] "Warning"                                  
##  [44] "Cruella"                                  
##  [45] "Ghostbusters: Afterlife"                  
##  [46] "Halloween"                                
##  [47] "Ron's Gone Wrong"                         
##  [48] "Malignant"                                
##  [49] "The Green Knight"                         
##  [50] "The Cost of Deception"                    
##  [51] "The Addams Family 2"                      
##  [52] "A Nightmare on Elm Street"                
##  [53] "Resident Evil: Welcome to Raccoon City"   
##  [54] "Lamb"                                     
##  [55] "Old Henry"                                
##  [56] "Beetlejuice"                              
##  [57] "Ambulance"                                
##  [58] "Skyfall"                                  
##  [59] "The Night House"                          
##  [60] "Dune: Part Two"                           
##  [61] "Midsommar"                                
##  [62] "Knives Out"                               
##  [63] "Once Upon a Time... In Hollywood"         
##  [64] "Friday the 13th"                          
##  [65] "365 Days"                                 
##  [66] "Blade Runner 2049"                        
##  [67] "Halloween II"                             
##  [68] "Promising Young Woman"                    
##  [69] "Being the Ricardos"                       
##  [70] "Harry Potter and the Sorcerer's Stone"    
##  [71] "The Lost Daughter"                        
##  [72] "The Little Things"                        
##  [73] "Parasite"                                 
##  [74] "Joker"                                    
##  [75] "The Addams Family"                        
##  [76] "The Nightmare Before Christmas"           
##  [77] "Avengers: Endgame"                        
##  [78] "Untitled the Munsters Reboot"             
##  [79] "Interstellar"                             
##  [80] "F9: The Fast Saga"                        
##  [81] "Belfast"                                  
##  [82] "Candyman"                                 
##  [83] "Wonka"                                    
##  [84] "Dear Evan Hansen"                         
##  [85] "Quantum of Solace"                        
##  [86] "Snake Eyes"                               
##  [87] "The Shawshank Redemption"                 
##  [88] "The Crow"                                 
##  [89] "Hocus Pocus 2"                            
##  [90] "The Dark Knight"                          
##  [91] "The Rocky Horror Picture Show"            
##  [92] "There's Someone Inside Your House"        
##  [93] "The Addams Family"                        
##  [94] "It"                                       
##  [95] "Tenet"                                    
##  [96] "Halloween H20: 20 Years Later"            
##  [97] "Titanic"                                  
##  [98] "Home Sweet Home Alone"                    
##  [99] "Poltergeist"                              
## [100] "The Wolf of Wall Street"

css selector for the poster of first movie: screenshot

poster_img_source <- popular_movies %>%
  html_nodes(css="tr:nth-child(1) img") %>%
  html_attr("src")
poster_img_source
## [1] "https://m.media-amazon.com/images/M/MV5BN2FjNmEyNWMtYzM0ZS00NjIyLTg5YzYtYThlMGVjNzE1OGViXkEyXkFqcGdeQXVyMTkxNjUyNQ@@._V1_UY67_CR0,0,45,67_AL_.jpg"

top_movie_list <- popular_movies %>%
  html_nodes(css="table") %>%
  html_table()
top_movie_list # this is a list which has the data.frame for top 100 movies based on popularity
## [[1]]
## # A tibble: 100 x 5
##    ``    `Rank & Title`            `IMDb Rating` `Your Rating`             ``   
##    <lgl> <chr>                             <dbl> <chr>                     <lgl>
##  1 NA    "Dune\n        (2021)\n …           8.3 "12345678910\n        \n… NA   
##  2 NA    "The Batman\n        (20…          NA   "12345678910\n        \n… NA   
##  3 NA    "Halloween Kills\n      …           5.8 "12345678910\n        \n… NA   
##  4 NA    "No Time to Die\n       …           7.6 "12345678910\n        \n… NA   
##  5 NA    "The Last Duel\n        …           7.7 "12345678910\n        \n… NA   
##  6 NA    "Dune\n        (1984)\n …           6.4 "12345678910\n        \n… NA   
##  7 NA    "Venom: Let There Be Car…           6.4 "12345678910\n        \n… NA   
##  8 NA    "Free Guy\n        (2021…           7.2 "12345678910\n        \n… NA   
##  9 NA    "Eternals\n        (2021…           6.3 "12345678910\n        \n… NA   
## 10 NA    "The Forgotten Battle\n …           7.2 "12345678910\n        \n… NA   
## # … with 90 more rows

movie_list_table <- top_movie_list[[1]] # the first one is the casting table
head(movie_list_table)
## # A tibble: 6 x 5
##   ``    `Rank & Title`            `IMDb Rating` `Your Rating`              ``   
##   <lgl> <chr>                             <dbl> <chr>                      <lgl>
## 1 NA    "Dune\n        (2021)\n …           8.3 "12345678910\n        \n … NA   
## 2 NA    "The Batman\n        (20…          NA   "12345678910\n        \n … NA   
## 3 NA    "Halloween Kills\n      …           5.8 "12345678910\n        \n … NA   
## 4 NA    "No Time to Die\n       …           7.6 "12345678910\n        \n … NA   
## 5 NA    "The Last Duel\n        …           7.7 "12345678910\n        \n … NA   
## 6 NA    "Dune\n        (1984)\n …           6.4 "12345678910\n        \n … NA

#remove irrelevant stuffs
movie_list_table <- movie_list_table[2:3] 
colnames(movie_list_table) = c("title", "rating")

#movie_list_table$title = str_replace_all(movie_list_table$title,"[\r\\n]", "") 
movie_list_table$title = str_remove(movie_list_table$title,"[\r\\n]") #remove 
movie_list_table$title = str_squish(movie_list_table$title) # trim whitespace within string
movie_list_table$title = str_remove(movie_list_table$title,"([^(\\d\\d\\d\\d)]\\d.+)") #remove ranking info

movie_list_table
## # A tibble: 100 x 2
##    title                              rating
##    <chr>                               <dbl>
##  1 Dune (2021)                           8.3
##  2 The Batman (2022)                    NA  
##  3 Halloween Kills (2021)                5.8
##  4 No Time to Die (2021)                 7.6
##  5 The Last Duel (2021)                  7.7
##  6 Dune (1984)                           6.4
##  7 Venom: Let There Be Carnage (2021)    6.4
##  8 Free Guy (2021)                       7.2
##  9 Eternals (2021)                       6.3
## 10 The Forgotten Battle (2020)           7.2
## # … with 90 more rows

Lab assignment

This lab assignment involves 2 tasks (see the following slides). Once you finish the following tasks, please put everything in one single R file with the file name assignment3.R (.R is the file extension) and upload it to iCollege (Lab Assignment 3).

You will lose 50% of the points if you use a different filename or put your code in multiple files.

In addition, lab assignments will be graded based on:

Lab Assignment 1/2

https://editorial.rottentomatoes.com/guide/best-wide-release-2016/

Use ‘rvest’ and ‘SelectorGadget’ to…

Table

Lab Assignment 2/2

Write a regular expression to extract email addresses from the following text:

@GeorgiaStateU: My two email addresses are and

z<- "@GeorgiaStateU: My two email addresses are smith77@gsu.edu and 
alex.smith@yahoo.com.uk"
my_regex = "put your regex here"
stringr::str_extract_all(z, my_regex)[[1]]
## [1] "smith77@gsu.edu"         "alex.smith@yahoo.com.uk"