Rongen Zhang
There are many ways to obtain data from the Internet; let’s consider four categories:
In the simplest case, the data you need is already on the internet in a tabular format. There are a couple of strategies here:
read.csv or read.table to read the data straight into R.url <- "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv"
df <- read.csv(file=url, header=TRUE, stringsAsFactors=FALSE)
head(df)## make model mpg weight price
## 1 amc concord 22 2930 4099
## 2 amc oacer 17 3350 4749
## 3 amc spirit 22 2640 3799
## 4 buick century 20 3250 4816
## 5 buick electra 15 4080 7827
Many common web services and APIs have been “wrapped”, i.e. R functions have been written around them which send your query to the server and format the response.
RfacebooktwitteRGoTr - R wrapper for An API of Ice And Fireinstall.packages("devtools") #allow you to install R packages from GitHub
devtools::install_github("MangoTheCat/GoTr")
library(GoTr)
characters_583 <- got_api(type = "characters", id = 583)
class(characters_583)## [1] "list"
## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
##
## $name
## [1] "Jon Snow"
##
## $gender
## [1] "Male"
##
## $culture
## [1] "Northmen"
##
## $born
## [1] "In 283 AC"
##
## $died
## [1] ""
##
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
##
##
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
##
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
##
## $aliases[[3]]
## [1] "The Snow of Winterfell"
##
## $aliases[[4]]
## [1] "The Crow-Come-Over"
##
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
##
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
##
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
##
## $aliases[[8]]
## [1] "Lord Crow"
##
##
## $father
## [1] ""
##
## $mother
## [1] ""
##
## $spouse
## [1] ""
##
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
##
##
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
##
##
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
##
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
##
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
##
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
##
##
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
##
## $tvSeries[[2]]
## [1] "Season 2"
##
## $tvSeries[[3]]
## [1] "Season 3"
##
## $tvSeries[[4]]
## [1] "Season 4"
##
## $tvSeries[[5]]
## [1] "Season 5"
##
## $tvSeries[[6]]
## [1] "Season 6"
##
##
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"
This is when you use URLs to interact with a web API.
httrhttr is designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are:
httr contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET(), POST()).
install.packages("httr")
library(httr)
characters_583 <- GET("https://anapioficeandfire.com/api/characters/583")
characters_583_content = content(characters_583)
class(characters_583_content)## [1] "list"
## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
##
## $name
## [1] "Jon Snow"
##
## $gender
## [1] "Male"
##
## $culture
## [1] "Northmen"
##
## $born
## [1] "In 283 AC"
##
## $died
## [1] ""
##
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
##
##
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
##
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
##
## $aliases[[3]]
## [1] "The Snow of Winterfell"
##
## $aliases[[4]]
## [1] "The Crow-Come-Over"
##
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
##
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
##
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
##
## $aliases[[8]]
## [1] "Lord Crow"
##
##
## $father
## [1] ""
##
## $mother
## [1] ""
##
## $spouse
## [1] ""
##
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
##
##
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
##
##
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
##
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
##
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
##
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
##
##
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
##
## $tvSeries[[2]]
## [1] "Season 2"
##
## $tvSeries[[3]]
## [1] "Season 3"
##
## $tvSeries[[4]]
## [1] "Season 4"
##
## $tvSeries[[5]]
## [1] "Season 5"
##
## $tvSeries[[6]]
## [1] "Season 6"
##
##
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"
We are interested in the following repository:
Use httr to retrieve repository data from the GitHub API, and print the following information:
What if data is present on a website, but is not provided in an API at all? It is possible to grab that information too. How easy that is depends a lot on the quality and structure of the website that we are scrapping.
Two useful tools:
rvest: R package to easily harvest (scrape) web pagesrvest overviewThe most important functions in rvest are:
read_html().html_nodes(doc, css="table td")html_nodes(doc, xpath = "//table//td")html_tag() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).html_table().rvest to retrieve an html document## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
rvest to select parts of a document using css selectors
SelectorGadget to identify the css stringcss selector for ratings: screenshot(open link in new tab)
## [1] 8.3 5.8 7.6 7.7 6.4 6.4 7.2 6.3 7.2 5.7 7.7 7.5 9.1 6.8 6.6 6.3 8.5 6.9 7.3
## [20] 6.4 5.0 6.9 7.9 6.9 6.7 5.8 6.3 7.3 7.3 8.0 6.8 6.3 6.3 5.4 7.4 6.1 7.2 6.3
## [39] 6.6 1.3 5.3 7.5 6.6 7.2 7.5 7.8 6.5 7.1 7.9 7.6 6.4 3.3 8.0 6.5 7.5 7.6 6.6
## [58] 6.3 8.6 8.4 6.9 8.0 8.4 8.6 5.2 7.6 6.0 6.1 6.6 5.4 9.3 7.6 9.0 7.4 4.8 5.8
## [77] 7.3 7.4 5.8 7.8 7.3 8.2
css selector for Movie Titles: screenshot
MoviesTitles ={}
MoviesTitles <- popular_movies %>%
html_nodes(css="#main a") %>%
html_text() %>%
str_trim()
head(MoviesTitles)## [1] "" "" "Dune" "" "The Batman"
## [6] ""
MoviesTitles = MoviesTitles[MoviesTitles !=" "] #remove white spaces
MoviesTitles = MoviesTitles[MoviesTitles !=""]
MoviesTitles## [1] "Dune"
## [2] "The Batman"
## [3] "Halloween Kills"
## [4] "No Time to Die"
## [5] "The Last Duel"
## [6] "Dune"
## [7] "Venom: Let There Be Carnage"
## [8] "Free Guy"
## [9] "Eternals"
## [10] "The Forgotten Battle"
## [11] "Rust"
## [12] "Night Teeth"
## [13] "Halloween"
## [14] "The French Dispatch"
## [15] "Sardar Udham"
## [16] "The Flash"
## [17] "Black Widow"
## [18] "Halloween"
## [19] "Red Notice"
## [20] "Uncharted"
## [21] "The Guilty"
## [22] "The Black Phone"
## [23] "The Trip"
## [24] "Last Night in Soho"
## [25] "Black Adam"
## [26] "The Many Saints of Newark"
## [27] "After We Fell"
## [28] "Hocus Pocus"
## [29] "Shang-Chi and the Legend of the Ten Rings"
## [30] "Scream"
## [31] "Titane"
## [32] "Venom"
## [33] "Old"
## [34] "Copshop"
## [35] "Scream"
## [36] "Halloween Ends"
## [37] "The Suicide Squad"
## [38] "Spider-Man: No Way Home"
## [39] "Casino Royale"
## [40] "Spectre"
## [41] "Injustice"
## [42] "Antlers"
## [43] "Warning"
## [44] "Cruella"
## [45] "Ghostbusters: Afterlife"
## [46] "Halloween"
## [47] "Ron's Gone Wrong"
## [48] "Malignant"
## [49] "The Green Knight"
## [50] "The Cost of Deception"
## [51] "The Addams Family 2"
## [52] "A Nightmare on Elm Street"
## [53] "Resident Evil: Welcome to Raccoon City"
## [54] "Lamb"
## [55] "Old Henry"
## [56] "Beetlejuice"
## [57] "Ambulance"
## [58] "Skyfall"
## [59] "The Night House"
## [60] "Dune: Part Two"
## [61] "Midsommar"
## [62] "Knives Out"
## [63] "Once Upon a Time... In Hollywood"
## [64] "Friday the 13th"
## [65] "365 Days"
## [66] "Blade Runner 2049"
## [67] "Halloween II"
## [68] "Promising Young Woman"
## [69] "Being the Ricardos"
## [70] "Harry Potter and the Sorcerer's Stone"
## [71] "The Lost Daughter"
## [72] "The Little Things"
## [73] "Parasite"
## [74] "Joker"
## [75] "The Addams Family"
## [76] "The Nightmare Before Christmas"
## [77] "Avengers: Endgame"
## [78] "Untitled the Munsters Reboot"
## [79] "Interstellar"
## [80] "F9: The Fast Saga"
## [81] "Belfast"
## [82] "Candyman"
## [83] "Wonka"
## [84] "Dear Evan Hansen"
## [85] "Quantum of Solace"
## [86] "Snake Eyes"
## [87] "The Shawshank Redemption"
## [88] "The Crow"
## [89] "Hocus Pocus 2"
## [90] "The Dark Knight"
## [91] "The Rocky Horror Picture Show"
## [92] "There's Someone Inside Your House"
## [93] "The Addams Family"
## [94] "It"
## [95] "Tenet"
## [96] "Halloween H20: 20 Years Later"
## [97] "Titanic"
## [98] "Home Sweet Home Alone"
## [99] "Poltergeist"
## [100] "The Wolf of Wall Street"
rvest to extract html tags and their attributescss selector for the poster of first movie: screenshot
poster_img_source <- popular_movies %>%
html_nodes(css="tr:nth-child(1) img") %>%
html_attr("src")
poster_img_source## [1] "https://m.media-amazon.com/images/M/MV5BN2FjNmEyNWMtYzM0ZS00NjIyLTg5YzYtYThlMGVjNzE1OGViXkEyXkFqcGdeQXVyMTkxNjUyNQ@@._V1_UY67_CR0,0,45,67_AL_.jpg"
rvest to parse html tables into data framestop_movie_list <- popular_movies %>%
html_nodes(css="table") %>%
html_table()
top_movie_list # this is a list which has the data.frame for top 100 movies based on popularity## [[1]]
## # A tibble: 100 x 5
## `` `Rank & Title` `IMDb Rating` `Your Rating` ``
## <lgl> <chr> <dbl> <chr> <lgl>
## 1 NA "Dune\n (2021)\n … 8.3 "12345678910\n \n… NA
## 2 NA "The Batman\n (20… NA "12345678910\n \n… NA
## 3 NA "Halloween Kills\n … 5.8 "12345678910\n \n… NA
## 4 NA "No Time to Die\n … 7.6 "12345678910\n \n… NA
## 5 NA "The Last Duel\n … 7.7 "12345678910\n \n… NA
## 6 NA "Dune\n (1984)\n … 6.4 "12345678910\n \n… NA
## 7 NA "Venom: Let There Be Car… 6.4 "12345678910\n \n… NA
## 8 NA "Free Guy\n (2021… 7.2 "12345678910\n \n… NA
## 9 NA "Eternals\n (2021… 6.3 "12345678910\n \n… NA
## 10 NA "The Forgotten Battle\n … 7.2 "12345678910\n \n… NA
## # … with 90 more rows
## # A tibble: 6 x 5
## `` `Rank & Title` `IMDb Rating` `Your Rating` ``
## <lgl> <chr> <dbl> <chr> <lgl>
## 1 NA "Dune\n (2021)\n … 8.3 "12345678910\n \n … NA
## 2 NA "The Batman\n (20… NA "12345678910\n \n … NA
## 3 NA "Halloween Kills\n … 5.8 "12345678910\n \n … NA
## 4 NA "No Time to Die\n … 7.6 "12345678910\n \n … NA
## 5 NA "The Last Duel\n … 7.7 "12345678910\n \n … NA
## 6 NA "Dune\n (1984)\n … 6.4 "12345678910\n \n … NA
#remove irrelevant stuffs
movie_list_table <- movie_list_table[2:3]
colnames(movie_list_table) = c("title", "rating")
#movie_list_table$title = str_replace_all(movie_list_table$title,"[\r\\n]", "")
movie_list_table$title = str_remove(movie_list_table$title,"[\r\\n]") #remove
movie_list_table$title = str_squish(movie_list_table$title) # trim whitespace within string
movie_list_table$title = str_remove(movie_list_table$title,"([^(\\d\\d\\d\\d)]\\d.+)") #remove ranking info
movie_list_table## # A tibble: 100 x 2
## title rating
## <chr> <dbl>
## 1 Dune (2021) 8.3
## 2 The Batman (2022) NA
## 3 Halloween Kills (2021) 5.8
## 4 No Time to Die (2021) 7.6
## 5 The Last Duel (2021) 7.7
## 6 Dune (1984) 6.4
## 7 Venom: Let There Be Carnage (2021) 6.4
## 8 Free Guy (2021) 7.2
## 9 Eternals (2021) 6.3
## 10 The Forgotten Battle (2020) 7.2
## # … with 90 more rows
This lab assignment involves 2 tasks (see the following slides). Once you finish the following tasks, please put everything in one single R file with the file name assignment3.R (.R is the file extension) and upload it to iCollege (Lab Assignment 3).
You will lose 50% of the points if you use a different filename or put your code in multiple files.
In addition, lab assignments will be graded based on:
https://editorial.rottentomatoes.com/guide/best-wide-release-2016/
Use ‘rvest’ and ‘SelectorGadget’ to…
Write a regular expression to extract email addresses from the following text:
@GeorgiaStateU: My two email addresses are smith77@gsu.edu and alex.smith@yahoo.com.uk
z<- "@GeorgiaStateU: My two email addresses are smith77@gsu.edu and
alex.smith@yahoo.com.uk"
my_regex = "put your regex here"
stringr::str_extract_all(z, my_regex)[[1]]## [1] "smith77@gsu.edu" "alex.smith@yahoo.com.uk"