Click-and-Download

In the simplest case, the data you need is already on the internet in a tabular format. There are a couple of strategies here:

Use read.csv or read.table to read the data straight into R.

url <- "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv"
df <- read.csv(file=url, header=TRUE, stringsAsFactors=FALSE)
head(df)

##    make   model mpg weight price
## 1   amc concord  22   2930  4099
## 2   amc   oacer  17   3350  4749
## 3   amc  spirit  22   2640  3799
## 4 buick century  20   3250  4816
## 5 buick electra  15   4080  7827

install-and-play

Many common web services and APIs have been “wrapped”, i.e. R functions have been written around them which send your query to the server and format the response.

Package Rfacebook
Package twitteR
Package GoTr - R wrapper for An API of Ice And Fire

install.packages("devtools") #allow you to install R packages from GitHub
devtools::install_github("MangoTheCat/GoTr")
library(GoTr)
characters_583 <- got_api(type = "characters", id = 583)
class(characters_583)

## [1] "list"

characters_583

## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
## 
## $name
## [1] "Jon Snow"
## 
## $gender
## [1] "Male"
## 
## $culture
## [1] "Northmen"
## 
## $born
## [1] "In 283 AC"
## 
## $died
## [1] ""
## 
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
## 
## 
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
## 
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
## 
## $aliases[[3]]
## [1] "The Snow of Winterfell"
## 
## $aliases[[4]]
## [1] "The Crow-Come-Over"
## 
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
## 
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
## 
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
## 
## $aliases[[8]]
## [1] "Lord Crow"
## 
## 
## $father
## [1] ""
## 
## $mother
## [1] ""
## 
## $spouse
## [1] ""
## 
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
## 
## 
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
## 
## 
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
## 
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
## 
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
## 
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
## 
## 
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
## 
## $tvSeries[[2]]
## [1] "Season 2"
## 
## $tvSeries[[3]]
## [1] "Season 3"
## 
## $tvSeries[[4]]
## [1] "Season 4"
## 
## $tvSeries[[5]]
## [1] "Season 5"
## 
## $tvSeries[[6]]
## [1] "Season 6"
## 
## 
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"

API-query

This is when you use URLs to interact with a web API.

Package `httr`

httr is designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are:

GET: fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource.
POST: create a new resource. POST requests usually carry a payload that specifies the data for the new resource.
PUT: update an existing resource. The payload may contain the updated data for the resource.
DELETE: delete an existing resource.

httr contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET(), POST()).

install.packages("httr")
library(httr)
characters_583 <- GET("https://anapioficeandfire.com/api/characters/583")
characters_583_content = content(characters_583)
class(characters_583_content)

## [1] "list"

characters_583_content

## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
## 
## $name
## [1] "Jon Snow"
## 
## $gender
## [1] "Male"
## 
## $culture
## [1] "Northmen"
## 
## $born
## [1] "In 283 AC"
## 
## $died
## [1] ""
## 
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
## 
## 
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
## 
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
## 
## $aliases[[3]]
## [1] "The Snow of Winterfell"
## 
## $aliases[[4]]
## [1] "The Crow-Come-Over"
## 
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
## 
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
## 
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
## 
## $aliases[[8]]
## [1] "Lord Crow"
## 
## 
## $father
## [1] ""
## 
## $mother
## [1] ""
## 
## $spouse
## [1] ""
## 
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
## 
## 
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
## 
## 
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
## 
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
## 
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
## 
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
## 
## 
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
## 
## $tvSeries[[2]]
## [1] "Season 2"
## 
## $tvSeries[[3]]
## [1] "Season 3"
## 
## $tvSeries[[4]]
## [1] "Season 4"
## 
## $tvSeries[[5]]
## [1] "Season 5"
## 
## $tvSeries[[6]]
## [1] "Season 6"
## 
## 
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"

Your turn

We are interested in the following repository:

Web URL: https://github.com/tidyverse/dplyr
API URL: https://api.github.com/repos/tidyverse/dplyr

Use httr to retrieve repository data from the GitHub API, and print the following information:

The number of watchers
The number of subscribers
The number of open issues
The language of repository

Scraping

What if data is present on a website, but is not provided in an API at all? It is possible to grab that information too. How easy that is depends a lot on the quality and structure of the website that we are scrapping.

Two useful tools:

rvest: R package to easily harvest (scrape) web pages
SelectorGadget: Install in your browser

Package `rvest` overview

install.packages("rvest")
library(rvest)
library(stringr)
library(tidyverse)

The most important functions in rvest are:

Retrieve an html document from a URL, a file on disk or a string containing html with read_html().
Select parts of a document
- Using css selectors: html_nodes(doc, css="table td")
- use xpath selectors with html_nodes(doc, xpath = "//table//td")
Extract components with html_tag() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).
Parse HTML tables into data frames with html_table().

Use rvest to retrieve an html document

popular_movies <- read_html("https://www.imdb.com/chart/moviemeter/")
popular_movies

## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

Use rvest to select parts of a document using css selectors
- Use SelectorGadget to identify the css string

css selector for ratings: screenshot(open link in new tab)

ratings <- popular_movies %>% 
  html_nodes(css="strong") %>%
  html_text() %>%
  as.numeric()
ratings

##  [1] 6.7 5.5 7.5 6.2 5.5 7.5 6.6 7.4 7.5 7.4 5.6 7.4 6.7 7.4 6.3 6.0 5.9 6.4 5.7
## [20] 7.2 7.5 5.4 6.8 4.6 5.8 6.3 7.4 7.6 5.4 5.8 7.1 3.3 6.4 6.1 6.9 6.8 9.3 7.2
## [39] 7.5 8.1 8.4 7.6 6.5 6.2 7.1 6.2 8.3 7.9 8.9 7.6 9.2 7.4 4.8 5.3 7.1 4.9 7.1
## [58] 5.7 7.3 6.6 6.9 4.2 5.9 7.7 5.3 8.4 7.0 7.8 6.0 6.9 6.1 6.8 5.5 7.9 7.5 6.4
## [77] 5.4 9.0

css selector for Movie Titles: screenshot

MoviesTitles ={}
MoviesTitles <- popular_movies %>%
  html_nodes(css="#main a") %>%
  html_text() %>%
  str_trim()
head(MoviesTitles)

## [1] ""                  ""                  "The Tomorrow War" 
## [4] ""                  "F9: The Fast Saga" ""

MoviesTitles = MoviesTitles[MoviesTitles !=" "] #remove white spaces
MoviesTitles = MoviesTitles[MoviesTitles !=""]
MoviesTitles

##   [1] "The Tomorrow War"                                 
##   [2] "F9: The Fast Saga"                                
##   [3] "The Many Saints of Newark"                        
##   [4] "Luca"                                             
##   [5] "Fear Street Part 1: 1994"                         
##   [6] "The Ice Road"                                     
##   [7] "A Quiet Place Part II"                            
##   [8] "No Sudden Move"                                   
##   [9] "Black Widow"                                      
##  [10] "In the Heights"                                   
##  [11] "Cruella"                                          
##  [12] "Good on Paper"                                    
##  [13] "Raya and the Last Dragon"                         
##  [14] "Fatherhood"                                       
##  [15] "Jolt"                                             
##  [16] "The Suicide Squad"                                
##  [17] "Nobody"                                           
##  [18] "The Hitman's Wife's Bodyguard"                    
##  [19] "The Boss Baby: Family Business"                   
##  [20] "The Forever Purge"                                
##  [21] "The Conjuring: The Devil Made Me Do It"           
##  [22] "Shang-Chi and the Legend of the Ten Rings"        
##  [23] "America: The Motion Picture"                      
##  [24] "Wrath of Man"                                     
##  [25] "A Quiet Place"                                    
##  [26] "Infinite"                                         
##  [27] "Zola"                                             
##  [28] "False Positive"                                   
##  [29] "Till Death"                                       
##  [30] "The Little Things"                                
##  [31] "Sing 2"                                           
##  [32] "Tenet"                                            
##  [33] "Once Upon a Time... In Hollywood"                 
##  [34] "Gaia"                                             
##  [35] "Don't Breathe 2"                                  
##  [36] "Army of the Dead"                                 
##  [37] "Midsommar"                                        
##  [38] "365 Days"                                         
##  [39] "Godzilla vs. Kong"                                
##  [40] "Werewolves Within"                                
##  [41] "Blood Red Sky"                                    
##  [42] "Old"                                              
##  [43] "Haseen Dillruba"                                  
##  [44] "Spider-Man: No Way Home"                          
##  [45] "Halloween Kills"                                  
##  [46] "The Fast and the Furious"                         
##  [47] "Space Jam: A New Legacy"                          
##  [48] "The Shawshank Redemption"                         
##  [49] "The Harder They Fall"                             
##  [50] "Candyman"                                         
##  [51] "Wish Dragon"                                      
##  [52] "Dune"                                             
##  [53] "Promising Young Woman"                            
##  [54] "Beckett"                                          
##  [55] "Zack Snyder's Justice League"                     
##  [56] "Avengers: Endgame"                                
##  [57] "Gone Baby Gone"                                   
##  [58] "Jurassic World: Dominion"                         
##  [59] "Lansky"                                           
##  [60] "Clifford the Big Red Dog"                         
##  [61] "Mortal Kombat"                                    
##  [62] "Don't Breathe"                                    
##  [63] "Peter Rabbit 2: The Runaway"                      
##  [64] "The Father"                                       
##  [65] "Cinderella"                                       
##  [66] "The Green Knight"                                 
##  [67] "Thor: Ragnarok"                                   
##  [68] "Pulp Fiction"                                     
##  [69] "Harry Potter and the Sorcerer's Stone"            
##  [70] "The Godfather"                                    
##  [71] "Nomadland"                                        
##  [72] "Awake"                                            
##  [73] "Spiral"                                           
##  [74] "Snake Eyes"                                       
##  [75] "Furious 7"                                        
##  [76] "The Superdeep"                                    
##  [77] "Silver Skates"                                    
##  [78] "The Woman in the Window"                          
##  [79] "Rurouni Kenshin: Final Chapter Part I - The Final"
##  [80] "The Fate of the Furious"                          
##  [81] "The Hitman's Bodyguard"                           
##  [82] "The Misfits"                                      
##  [83] "Censor"                                           
##  [84] "The Mitchells vs the Machines"                    
##  [85] "Spirit Untamed"                                   
##  [86] "Joker"                                            
##  [87] "Thor: Love and Thunder"                           
##  [88] "Death Proof"                                      
##  [89] "Another Round"                                    
##  [90] "Top Gun: Maverick"                                
##  [91] "The Fast and the Furious: Tokyo Drift"            
##  [92] "Fear Street Part Two: 1978"                       
##  [93] "Love"                                             
##  [94] "Yesterday"                                        
##  [95] "The Dead Don't Die"                               
##  [96] "Knives Out"                                       
##  [97] "The Conjuring"                                    
##  [98] "Fast & Furious Presents: Hobbs & Shaw"            
##  [99] "Wonder Woman 1984"                                
## [100] "The Dark Knight"

Use rvest to extract html tags and their attributes

css selector for the poster of first movie: screenshot

poster_img_source <- popular_movies %>%
  html_nodes(css="tr:nth-child(1) img") %>%
  html_attr("src")
poster_img_source

## [1] "https://m.media-amazon.com/images/M/MV5BNTI2YTI0MWEtNGQ4OS00ODIzLWE1MWEtZGJiN2E3ZmM1OWI1XkEyXkFqcGdeQXVyODk4OTc3MTY@._V1_UY67_CR0,0,45,67_AL_.jpg"

Use rvest to parse html tables into data frames

top_movie_list <- popular_movies %>%
  html_nodes(css="table") %>%
  html_table()
top_movie_list # this is a list which has the data.frame for top 100 movies based on popularity

## [[1]]
## # A tibble: 100 x 5
##    ``    `Rank & Title`            `IMDb Rating` `Your Rating`             ``   
##    <lgl> <chr>                             <dbl> <chr>                     <lgl>
##  1 NA    "The Tomorrow War\n     …           6.7 "12345678910\n        \n… NA   
##  2 NA    "F9: The Fast Saga\n    …           5.5 "12345678910\n        \n… NA   
##  3 NA    "The Many Saints of Newa…          NA   "12345678910\n        \n… NA   
##  4 NA    "Luca\n        (2021)\n …           7.5 "12345678910\n        \n… NA   
##  5 NA    "Fear Street Part 1: 199…           6.2 "12345678910\n        \n… NA   
##  6 NA    "The Ice Road\n        (…           5.5 "12345678910\n        \n… NA   
##  7 NA    "A Quiet Place Part II\n…           7.5 "12345678910\n        \n… NA   
##  8 NA    "No Sudden Move\n       …           6.6 "12345678910\n        \n… NA   
##  9 NA    "Black Widow\n        (2…           7.4 "12345678910\n        \n… NA   
## 10 NA    "In the Heights\n       …           7.5 "12345678910\n        \n… NA   
## # … with 90 more rows

movie_list_table <- top_movie_list[[1]] # the first one is the casting table
head(movie_list_table)

## # A tibble: 6 x 5
##   ``    `Rank & Title`            `IMDb Rating` `Your Rating`              ``   
##   <lgl> <chr>                             <dbl> <chr>                      <lgl>
## 1 NA    "The Tomorrow War\n     …           6.7 "12345678910\n        \n … NA   
## 2 NA    "F9: The Fast Saga\n    …           5.5 "12345678910\n        \n … NA   
## 3 NA    "The Many Saints of Newa…          NA   "12345678910\n        \n … NA   
## 4 NA    "Luca\n        (2021)\n …           7.5 "12345678910\n        \n … NA   
## 5 NA    "Fear Street Part 1: 199…           6.2 "12345678910\n        \n … NA   
## 6 NA    "The Ice Road\n        (…           5.5 "12345678910\n        \n … NA

#remove irrelevant stuffs
movie_list_table <- movie_list_table[2:3] 
colnames(movie_list_table) = c("title", "rating")

#movie_list_table$title = str_replace_all(movie_list_table$title,"[\r\\n]", "") 
movie_list_table$title = str_remove(movie_list_table$title,"[\r\\n]") #remove 
movie_list_table$title = str_squish(movie_list_table$title) # trim whitespace within string
movie_list_table$title = str_remove(movie_list_table$title,"([^(\\d\\d\\d\\d)]\\d.+)") #remove ranking info

movie_list_table

## # A tibble: 100 x 2
##    title                              rating
##    <chr>                               <dbl>
##  1 "The Tomorrow War (2021)"             6.7
##  2 ""                                    5.5
##  3 "The Many Saints of Newark (2021)"   NA  
##  4 "Luca (2021)"                         7.5
##  5 "Fear Street Part"                    6.2
##  6 "The Ice Road (2021)"                 5.5
##  7 "A Quiet Place Part II (2020)"        7.5
##  8 "No Sudden Move (2021)"               6.6
##  9 "Black Widow (2021)"                  7.4
## 10 "In the Heights (2021)"               7.5
## # … with 90 more rows

Lab assignment

This lab assignment involves 2 tasks (see the following slides). Once you finish the following tasks, please put everything in one single R file with the file name assignment3.R (.R is the file extension) and upload it to iCollege (Lab Assignment 3).

You will lose 50% of the points if you use a different filename or put your code in multiple files.

In addition, lab assignments will be graded based on:

Accuracy: whether the R script achieves the objectives
Readability: whether the R script is clean, well-formatted, and easily readable
- You risk loosing 10 points if your code has no proper indentation or has more than 80 characters in a line.

Lab Assignment 1/2

https://en.wikipedia.org/wiki/Atlanta

Use rvest and SelectorGadget to…

Get the Mayor’s name
Get the image source (URL) for the Atlanta City Hall
Get the paragraph for the 1996 Summer Olympic Games
Get the URLs of all the links in the External links section
Get the table “Historical population”, store it as a data.frame, and clean the data.frame so that it looks like the data.frame in the next slide
- Hint 1: Recall colnames() and rownames()
- Hint 2: df[ rows_you_want , columns_you_want ]

CIS 4730
Unstructured Data Management

Lab: Web scraping

Getting data from the Web

Click-and-Download

install-and-play

API-query

Package `httr`

Your turn

Scraping

Package `rvest` overview

Lab assignment

Lab Assignment 1/2

Lab Assignment 2/2

CIS 4730Unstructured Data Management

Lab: Web scraping

Getting data from the Web

Click-and-Download

install-and-play

API-query

Package httr

Your turn

Scraping

Package rvest overview

Lab assignment

Lab Assignment 1/2

Lab Assignment 2/2

CIS 4730
Unstructured Data Management

Package `httr`

Package `rvest` overview