CIS 4730
Unstructured Data Management

Lab: Web scraping

Rongen Zhang

Getting data from the Web

There are many ways to obtain data from the Internet; let’s consider four categories:

Click-and-Download

In the simplest case, the data you need is already on the internet in a tabular format. There are a couple of strategies here:

url <- "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv"
df <- read.csv(file=url, header=TRUE, stringsAsFactors=FALSE)
head(df)
##    make   model mpg weight price
## 1   amc concord  22   2930  4099
## 2   amc   oacer  17   3350  4749
## 3   amc  spirit  22   2640  3799
## 4 buick century  20   3250  4816
## 5 buick electra  15   4080  7827

install-and-play

Many common web services and APIs have been “wrapped”, i.e. R functions have been written around them which send your query to the server and format the response.

install.packages("devtools") #allow you to install R packages from GitHub
devtools::install_github("MangoTheCat/GoTr")
library(GoTr)
characters_583 <- got_api(type = "characters", id = 583)
class(characters_583)
## [1] "list"
characters_583
## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
## 
## $name
## [1] "Jon Snow"
## 
## $gender
## [1] "Male"
## 
## $culture
## [1] "Northmen"
## 
## $born
## [1] "In 283 AC"
## 
## $died
## [1] ""
## 
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
## 
## 
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
## 
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
## 
## $aliases[[3]]
## [1] "The Snow of Winterfell"
## 
## $aliases[[4]]
## [1] "The Crow-Come-Over"
## 
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
## 
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
## 
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
## 
## $aliases[[8]]
## [1] "Lord Crow"
## 
## 
## $father
## [1] ""
## 
## $mother
## [1] ""
## 
## $spouse
## [1] ""
## 
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
## 
## 
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
## 
## 
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
## 
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
## 
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
## 
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
## 
## 
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
## 
## $tvSeries[[2]]
## [1] "Season 2"
## 
## $tvSeries[[3]]
## [1] "Season 3"
## 
## $tvSeries[[4]]
## [1] "Season 4"
## 
## $tvSeries[[5]]
## [1] "Season 5"
## 
## $tvSeries[[6]]
## [1] "Season 6"
## 
## 
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"

API-query

This is when you use URLs to interact with a web API.

Package httr

httr is designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are:

httr contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET(), POST()).

install.packages("httr")
library(httr)
characters_583 <- GET("https://anapioficeandfire.com/api/characters/583")
characters_583_content = content(characters_583)
class(characters_583_content)
## [1] "list"
characters_583_content
## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
## 
## $name
## [1] "Jon Snow"
## 
## $gender
## [1] "Male"
## 
## $culture
## [1] "Northmen"
## 
## $born
## [1] "In 283 AC"
## 
## $died
## [1] ""
## 
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
## 
## 
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
## 
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
## 
## $aliases[[3]]
## [1] "The Snow of Winterfell"
## 
## $aliases[[4]]
## [1] "The Crow-Come-Over"
## 
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
## 
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
## 
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
## 
## $aliases[[8]]
## [1] "Lord Crow"
## 
## 
## $father
## [1] ""
## 
## $mother
## [1] ""
## 
## $spouse
## [1] ""
## 
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
## 
## 
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
## 
## 
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
## 
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
## 
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
## 
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
## 
## 
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
## 
## $tvSeries[[2]]
## [1] "Season 2"
## 
## $tvSeries[[3]]
## [1] "Season 3"
## 
## $tvSeries[[4]]
## [1] "Season 4"
## 
## $tvSeries[[5]]
## [1] "Season 5"
## 
## $tvSeries[[6]]
## [1] "Season 6"
## 
## 
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"

Your turn

We are interested in the following repository:

Use httr to retrieve repository data from the GitHub API, and print the following information:

Scraping

What if data is present on a website, but is not provided in an API at all? It is possible to grab that information too. How easy that is depends a lot on the quality and structure of the website that we are scrapping.

Two useful tools:

Package rvest overview

install.packages("rvest")
library(rvest)
library(stringr)
library(tidyverse)

The most important functions in rvest are:

popular_movies <- read_html("https://www.imdb.com/chart/moviemeter/")
popular_movies
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

css selector for ratings: screenshot(open link in new tab)

ratings <- popular_movies %>% 
  html_nodes(css="strong") %>%
  html_text() %>%
  as.numeric()
ratings
##  [1] 6.7 5.5 7.5 6.2 5.5 7.5 6.6 6.7 7.5 7.4 5.6 7.4 6.7 7.4 6.3 6.0 5.9 6.4 5.7
## [20] 7.2 7.5 5.4 6.8 4.7 5.8 6.3 7.4 7.6 5.4 5.8 7.1 3.3 6.4 6.1 6.9 6.8 9.3 7.2
## [39] 7.5 8.1 8.4 7.6 6.4 6.2 7.1 6.2 8.3 7.9 8.9 7.6 9.2 7.4 4.8 5.3 7.1 4.9 7.1
## [58] 5.7 7.3 6.6 6.9 4.2 5.9 7.7 5.4 8.4 7.0 7.8 6.0 6.1 6.8 5.5 7.9 7.5 6.4 5.4
## [77] 9.0

css selector for Movie Titles: screenshot

MoviesTitles ={}
MoviesTitles <- popular_movies %>%
  html_nodes(css="#main a") %>%
  html_text() %>%
  str_trim()
head(MoviesTitles)
## [1] ""                  ""                  "The Tomorrow War" 
## [4] ""                  "F9: The Fast Saga" ""
MoviesTitles = MoviesTitles[MoviesTitles !=" "] #remove white spaces
MoviesTitles = MoviesTitles[MoviesTitles !=""]
MoviesTitles
##   [1] "The Tomorrow War"                                 
##   [2] "F9: The Fast Saga"                                
##   [3] "The Many Saints of Newark"                        
##   [4] "Luca"                                             
##   [5] "Fear Street Part 1: 1994"                         
##   [6] "The Ice Road"                                     
##   [7] "A Quiet Place Part II"                            
##   [8] "No Sudden Move"                                   
##   [9] "Black Widow"                                      
##  [10] "In the Heights"                                   
##  [11] "Cruella"                                          
##  [12] "Good on Paper"                                    
##  [13] "Raya and the Last Dragon"                         
##  [14] "Fatherhood"                                       
##  [15] "Jolt"                                             
##  [16] "The Suicide Squad"                                
##  [17] "Nobody"                                           
##  [18] "The Hitman's Wife's Bodyguard"                    
##  [19] "The Boss Baby: Family Business"                   
##  [20] "The Forever Purge"                                
##  [21] "The Conjuring: The Devil Made Me Do It"           
##  [22] "Shang-Chi and the Legend of the Ten Rings"        
##  [23] "America: The Motion Picture"                      
##  [24] "Wrath of Man"                                     
##  [25] "A Quiet Place"                                    
##  [26] "Infinite"                                         
##  [27] "Zola"                                             
##  [28] "False Positive"                                   
##  [29] "Till Death"                                       
##  [30] "The Little Things"                                
##  [31] "Sing 2"                                           
##  [32] "Tenet"                                            
##  [33] "Once Upon a Time... In Hollywood"                 
##  [34] "Gaia"                                             
##  [35] "Don't Breathe 2"                                  
##  [36] "Army of the Dead"                                 
##  [37] "Midsommar"                                        
##  [38] "365 Days"                                         
##  [39] "Godzilla vs. Kong"                                
##  [40] "Werewolves Within"                                
##  [41] "Blood Red Sky"                                    
##  [42] "Old"                                              
##  [43] "Haseen Dillruba"                                  
##  [44] "Spider-Man: No Way Home"                          
##  [45] "Halloween Kills"                                  
##  [46] "The Fast and the Furious"                         
##  [47] "Space Jam: A New Legacy"                          
##  [48] "The Shawshank Redemption"                         
##  [49] "The Harder They Fall"                             
##  [50] "Candyman"                                         
##  [51] "Wish Dragon"                                      
##  [52] "Dune"                                             
##  [53] "Promising Young Woman"                            
##  [54] "Beckett"                                          
##  [55] "Zack Snyder's Justice League"                     
##  [56] "Avengers: Endgame"                                
##  [57] "Gone Baby Gone"                                   
##  [58] "Jurassic World: Dominion"                         
##  [59] "Lansky"                                           
##  [60] "Clifford the Big Red Dog"                         
##  [61] "Mortal Kombat"                                    
##  [62] "Don't Breathe"                                    
##  [63] "Peter Rabbit 2: The Runaway"                      
##  [64] "The Father"                                       
##  [65] "Cinderella"                                       
##  [66] "The Green Knight"                                 
##  [67] "Thor: Ragnarok"                                   
##  [68] "Pulp Fiction"                                     
##  [69] "Harry Potter and the Sorcerer's Stone"            
##  [70] "The Godfather"                                    
##  [71] "Nomadland"                                        
##  [72] "Awake"                                            
##  [73] "Spiral"                                           
##  [74] "Snake Eyes"                                       
##  [75] "Furious 7"                                        
##  [76] "The Superdeep"                                    
##  [77] "Silver Skates"                                    
##  [78] "The Woman in the Window"                          
##  [79] "Rurouni Kenshin: Final Chapter Part I - The Final"
##  [80] "The Fate of the Furious"                          
##  [81] "The Hitman's Bodyguard"                           
##  [82] "The Misfits"                                      
##  [83] "Censor"                                           
##  [84] "The Mitchells vs the Machines"                    
##  [85] "Spirit Untamed"                                   
##  [86] "Joker"                                            
##  [87] "Thor: Love and Thunder"                           
##  [88] "Death Proof"                                      
##  [89] "Another Round"                                    
##  [90] "Top Gun: Maverick"                                
##  [91] "The Fast and the Furious: Tokyo Drift"            
##  [92] "Fear Street Part Two: 1978"                       
##  [93] "Love"                                             
##  [94] "Yesterday"                                        
##  [95] "The Dead Don't Die"                               
##  [96] "Knives Out"                                       
##  [97] "The Conjuring"                                    
##  [98] "Fast & Furious Presents: Hobbs & Shaw"            
##  [99] "Wonder Woman 1984"                                
## [100] "The Dark Knight"

css selector for the poster of first movie: screenshot

poster_img_source <- popular_movies %>%
  html_nodes(css="tr:nth-child(1) img") %>%
  html_attr("src")
poster_img_source
## [1] "https://m.media-amazon.com/images/M/MV5BNTI2YTI0MWEtNGQ4OS00ODIzLWE1MWEtZGJiN2E3ZmM1OWI1XkEyXkFqcGdeQXVyODk4OTc3MTY@._V1_UY67_CR0,0,45,67_AL_.jpg"

top_movie_list <- popular_movies %>%
  html_nodes(css="table") %>%
  html_table()
top_movie_list # this is a list which has the data.frame for top 100 movies based on popularity
## [[1]]
## # A tibble: 100 x 5
##    ``    `Rank & Title`            `IMDb Rating` `Your Rating`             ``   
##    <lgl> <chr>                             <dbl> <chr>                     <lgl>
##  1 NA    "The Tomorrow War\n     …           6.7 "12345678910\n        \n… NA   
##  2 NA    "F9: The Fast Saga\n    …           5.5 "12345678910\n        \n… NA   
##  3 NA    "The Many Saints of Newa…          NA   "12345678910\n        \n… NA   
##  4 NA    "Luca\n        (2021)\n …           7.5 "12345678910\n        \n… NA   
##  5 NA    "Fear Street Part 1: 199…           6.2 "12345678910\n        \n… NA   
##  6 NA    "The Ice Road\n        (…           5.5 "12345678910\n        \n… NA   
##  7 NA    "A Quiet Place Part II\n…           7.5 "12345678910\n        \n… NA   
##  8 NA    "No Sudden Move\n       …           6.6 "12345678910\n        \n… NA   
##  9 NA    "Black Widow\n        (2…           6.7 "12345678910\n        \n… NA   
## 10 NA    "In the Heights\n       …           7.5 "12345678910\n        \n… NA   
## # … with 90 more rows

movie_list_table <- top_movie_list[[1]] # the first one is the casting table
head(movie_list_table)
## # A tibble: 6 x 5
##   ``    `Rank & Title`            `IMDb Rating` `Your Rating`              ``   
##   <lgl> <chr>                             <dbl> <chr>                      <lgl>
## 1 NA    "The Tomorrow War\n     …           6.7 "12345678910\n        \n … NA   
## 2 NA    "F9: The Fast Saga\n    …           5.5 "12345678910\n        \n … NA   
## 3 NA    "The Many Saints of Newa…          NA   "12345678910\n        \n … NA   
## 4 NA    "Luca\n        (2021)\n …           7.5 "12345678910\n        \n … NA   
## 5 NA    "Fear Street Part 1: 199…           6.2 "12345678910\n        \n … NA   
## 6 NA    "The Ice Road\n        (…           5.5 "12345678910\n        \n … NA

#remove irrelevant stuffs
movie_list_table <- movie_list_table[2:3] 
colnames(movie_list_table) = c("title", "rating")

#movie_list_table$title = str_replace_all(movie_list_table$title,"[\r\\n]", "") 
movie_list_table$title = str_remove(movie_list_table$title,"[\r\\n]") #remove 
movie_list_table$title = str_squish(movie_list_table$title) # trim whitespace within string
movie_list_table$title = str_remove(movie_list_table$title,"([^(\\d\\d\\d\\d)]\\d.+)") #remove ranking info

movie_list_table
## # A tibble: 100 x 2
##    title                              rating
##    <chr>                               <dbl>
##  1 "The Tomorrow War (2021)"             6.7
##  2 ""                                    5.5
##  3 "The Many Saints of Newark (2021)"   NA  
##  4 "Luca (2021)"                         7.5
##  5 "Fear Street Part"                    6.2
##  6 "The Ice Road (2021)"                 5.5
##  7 "A Quiet Place Part II (2020)"        7.5
##  8 "No Sudden Move (2021)"               6.6
##  9 "Black Widow (2021)"                  6.7
## 10 "In the Heights (2021)"               7.5
## # … with 90 more rows

Lab assignment

This lab assignment involves 2 tasks (see the following slides). Once you finish the following tasks, please put everything in one single R file with the file name assignment3.R (.R is the file extension) and upload it to iCollege (Lab Assignment 3).

You will lose 50% of the points if you use a different filename or put your code in multiple files.

In addition, lab assignments will be graded based on:

Lab Assignment 1/2

https://en.wikipedia.org/wiki/Atlanta

Use rvest and SelectorGadget to…

## # A tibble: 18 x 3
##    Census      Pop.    `%±`  
##  * <chr>       <chr>   <chr> 
##  1 1850        2,572   —     
##  2 1860        9,554   271.5%
##  3 1870        21,789  128.1%
##  4 1880        37,409  71.7% 
##  5 1890        65,533  75.2% 
##  6 1900        89,872  37.1% 
##  7 1910        154,839 72.3% 
##  8 1920        200,616 29.6% 
##  9 1930        270,366 34.8% 
## 10 1940        302,288 11.8% 
## 11 1950        331,314 9.6%  
## 12 1960        487,455 47.1% 
## 13 1970        495,039 1.6%  
## 14 1980        425,022 −14.1%
## 15 1990        394,017 −7.3% 
## 16 2000        416,474 5.7%  
## 17 2010        420,003 0.8%  
## 18 2019 (est.) 506,811 20.7%

Lab Assignment 2/2

Write a regular expression to extract email addresses from the following text:

@GeorgiaStateU: My two email addresses are and

z<- "@GeorgiaStateU: My two email addresses are smith77@gsu.edu and 
alex.smith@yahoo.com.uk"
my_regex = "put your regex here"
stringr::str_extract_all(z, my_regex)[[1]]
## [1] "smith77@gsu.edu"         "alex.smith@yahoo.com.uk"