Rongen Zhang
There are many ways to obtain data from the Internet; let’s consider four categories:
In the simplest case, the data you need is already on the internet in a tabular format. There are a couple of strategies here:
read.csv or read.table to read the data straight into R.url <- "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv"
df <- read.csv(file=url, header=TRUE, stringsAsFactors=FALSE)
head(df)## make model mpg weight price
## 1 amc concord 22 2930 4099
## 2 amc oacer 17 3350 4749
## 3 amc spirit 22 2640 3799
## 4 buick century 20 3250 4816
## 5 buick electra 15 4080 7827
Many common web services and APIs have been “wrapped”, i.e. R functions have been written around them which send your query to the server and format the response.
RfacebooktwitteRGoTr - R wrapper for An API of Ice And Fireinstall.packages("devtools") #allow you to install R packages from GitHub
devtools::install_github("MangoTheCat/GoTr")
library(GoTr)
characters_583 <- got_api(type = "characters", id = 583)
class(characters_583)## [1] "list"
## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
##
## $name
## [1] "Jon Snow"
##
## $gender
## [1] "Male"
##
## $culture
## [1] "Northmen"
##
## $born
## [1] "In 283 AC"
##
## $died
## [1] ""
##
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
##
##
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
##
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
##
## $aliases[[3]]
## [1] "The Snow of Winterfell"
##
## $aliases[[4]]
## [1] "The Crow-Come-Over"
##
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
##
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
##
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
##
## $aliases[[8]]
## [1] "Lord Crow"
##
##
## $father
## [1] ""
##
## $mother
## [1] ""
##
## $spouse
## [1] ""
##
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
##
##
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
##
##
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
##
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
##
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
##
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
##
##
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
##
## $tvSeries[[2]]
## [1] "Season 2"
##
## $tvSeries[[3]]
## [1] "Season 3"
##
## $tvSeries[[4]]
## [1] "Season 4"
##
## $tvSeries[[5]]
## [1] "Season 5"
##
## $tvSeries[[6]]
## [1] "Season 6"
##
##
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"
This is when you use URLs to interact with a web API.
httrhttr is designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are:
httr contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET(), POST()).
install.packages("httr")
library(httr)
characters_583 <- GET("https://anapioficeandfire.com/api/characters/583")
characters_583_content = content(characters_583)
class(characters_583_content)## [1] "list"
## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
##
## $name
## [1] "Jon Snow"
##
## $gender
## [1] "Male"
##
## $culture
## [1] "Northmen"
##
## $born
## [1] "In 283 AC"
##
## $died
## [1] ""
##
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
##
##
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
##
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
##
## $aliases[[3]]
## [1] "The Snow of Winterfell"
##
## $aliases[[4]]
## [1] "The Crow-Come-Over"
##
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
##
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
##
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
##
## $aliases[[8]]
## [1] "Lord Crow"
##
##
## $father
## [1] ""
##
## $mother
## [1] ""
##
## $spouse
## [1] ""
##
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
##
##
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
##
##
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
##
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
##
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
##
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
##
##
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
##
## $tvSeries[[2]]
## [1] "Season 2"
##
## $tvSeries[[3]]
## [1] "Season 3"
##
## $tvSeries[[4]]
## [1] "Season 4"
##
## $tvSeries[[5]]
## [1] "Season 5"
##
## $tvSeries[[6]]
## [1] "Season 6"
##
##
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"
We are interested in the following repository:
Use httr to retrieve repository data from the GitHub API, and print the following information:
What if data is present on a website, but is not provided in an API at all? It is possible to grab that information too. How easy that is depends a lot on the quality and structure of the website that we are scrapping.
Two useful tools:
rvest: R package to easily harvest (scrape) web pagesrvest overviewThe most important functions in rvest are:
read_html().html_nodes(doc, css="table td")html_nodes(doc, xpath = "//table//td")html_tag() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).html_table().rvest to retrieve an html document## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
rvest to select parts of a document using css selectors
SelectorGadget to identify the css stringcss selector for ratings: screenshot(open link in new tab)
## [1] 6.7 5.5 7.5 6.2 5.5 7.5 6.6 7.4 7.5 7.4 5.6 7.4 6.7 7.4 6.3 6.0 5.9 6.4 5.7
## [20] 7.2 7.5 5.4 6.8 4.6 5.8 6.3 7.4 7.6 5.4 5.8 7.1 3.3 6.4 6.1 6.9 6.8 9.3 7.2
## [39] 7.5 8.1 8.4 7.6 6.5 6.2 7.1 6.2 8.3 7.9 8.9 7.6 9.2 7.4 4.8 5.3 7.1 4.9 7.1
## [58] 5.7 7.3 6.6 6.9 4.2 5.9 7.7 5.3 8.4 7.0 7.8 6.0 6.9 6.1 6.8 5.5 7.9 7.5 6.4
## [77] 5.4 9.0
css selector for Movie Titles: screenshot
MoviesTitles ={}
MoviesTitles <- popular_movies %>%
html_nodes(css="#main a") %>%
html_text() %>%
str_trim()
head(MoviesTitles)## [1] "" "" "The Tomorrow War"
## [4] "" "F9: The Fast Saga" ""
MoviesTitles = MoviesTitles[MoviesTitles !=" "] #remove white spaces
MoviesTitles = MoviesTitles[MoviesTitles !=""]
MoviesTitles## [1] "The Tomorrow War"
## [2] "F9: The Fast Saga"
## [3] "The Many Saints of Newark"
## [4] "Luca"
## [5] "Fear Street Part 1: 1994"
## [6] "The Ice Road"
## [7] "A Quiet Place Part II"
## [8] "No Sudden Move"
## [9] "Black Widow"
## [10] "In the Heights"
## [11] "Cruella"
## [12] "Good on Paper"
## [13] "Raya and the Last Dragon"
## [14] "Fatherhood"
## [15] "Jolt"
## [16] "The Suicide Squad"
## [17] "Nobody"
## [18] "The Hitman's Wife's Bodyguard"
## [19] "The Boss Baby: Family Business"
## [20] "The Forever Purge"
## [21] "The Conjuring: The Devil Made Me Do It"
## [22] "Shang-Chi and the Legend of the Ten Rings"
## [23] "America: The Motion Picture"
## [24] "Wrath of Man"
## [25] "A Quiet Place"
## [26] "Infinite"
## [27] "Zola"
## [28] "False Positive"
## [29] "Till Death"
## [30] "The Little Things"
## [31] "Sing 2"
## [32] "Tenet"
## [33] "Once Upon a Time... In Hollywood"
## [34] "Gaia"
## [35] "Don't Breathe 2"
## [36] "Army of the Dead"
## [37] "Midsommar"
## [38] "365 Days"
## [39] "Godzilla vs. Kong"
## [40] "Werewolves Within"
## [41] "Blood Red Sky"
## [42] "Old"
## [43] "Haseen Dillruba"
## [44] "Spider-Man: No Way Home"
## [45] "Halloween Kills"
## [46] "The Fast and the Furious"
## [47] "Space Jam: A New Legacy"
## [48] "The Shawshank Redemption"
## [49] "The Harder They Fall"
## [50] "Candyman"
## [51] "Wish Dragon"
## [52] "Dune"
## [53] "Promising Young Woman"
## [54] "Beckett"
## [55] "Zack Snyder's Justice League"
## [56] "Avengers: Endgame"
## [57] "Gone Baby Gone"
## [58] "Jurassic World: Dominion"
## [59] "Lansky"
## [60] "Clifford the Big Red Dog"
## [61] "Mortal Kombat"
## [62] "Don't Breathe"
## [63] "Peter Rabbit 2: The Runaway"
## [64] "The Father"
## [65] "Cinderella"
## [66] "The Green Knight"
## [67] "Thor: Ragnarok"
## [68] "Pulp Fiction"
## [69] "Harry Potter and the Sorcerer's Stone"
## [70] "The Godfather"
## [71] "Nomadland"
## [72] "Awake"
## [73] "Spiral"
## [74] "Snake Eyes"
## [75] "Furious 7"
## [76] "The Superdeep"
## [77] "Silver Skates"
## [78] "The Woman in the Window"
## [79] "Rurouni Kenshin: Final Chapter Part I - The Final"
## [80] "The Fate of the Furious"
## [81] "The Hitman's Bodyguard"
## [82] "The Misfits"
## [83] "Censor"
## [84] "The Mitchells vs the Machines"
## [85] "Spirit Untamed"
## [86] "Joker"
## [87] "Thor: Love and Thunder"
## [88] "Death Proof"
## [89] "Another Round"
## [90] "Top Gun: Maverick"
## [91] "The Fast and the Furious: Tokyo Drift"
## [92] "Fear Street Part Two: 1978"
## [93] "Love"
## [94] "Yesterday"
## [95] "The Dead Don't Die"
## [96] "Knives Out"
## [97] "The Conjuring"
## [98] "Fast & Furious Presents: Hobbs & Shaw"
## [99] "Wonder Woman 1984"
## [100] "The Dark Knight"
rvest to extract html tags and their attributescss selector for the poster of first movie: screenshot
poster_img_source <- popular_movies %>%
html_nodes(css="tr:nth-child(1) img") %>%
html_attr("src")
poster_img_source## [1] "https://m.media-amazon.com/images/M/MV5BNTI2YTI0MWEtNGQ4OS00ODIzLWE1MWEtZGJiN2E3ZmM1OWI1XkEyXkFqcGdeQXVyODk4OTc3MTY@._V1_UY67_CR0,0,45,67_AL_.jpg"
rvest to parse html tables into data framestop_movie_list <- popular_movies %>%
html_nodes(css="table") %>%
html_table()
top_movie_list # this is a list which has the data.frame for top 100 movies based on popularity## [[1]]
## # A tibble: 100 x 5
## `` `Rank & Title` `IMDb Rating` `Your Rating` ``
## <lgl> <chr> <dbl> <chr> <lgl>
## 1 NA "The Tomorrow War\n … 6.7 "12345678910\n \n… NA
## 2 NA "F9: The Fast Saga\n … 5.5 "12345678910\n \n… NA
## 3 NA "The Many Saints of Newa… NA "12345678910\n \n… NA
## 4 NA "Luca\n (2021)\n … 7.5 "12345678910\n \n… NA
## 5 NA "Fear Street Part 1: 199… 6.2 "12345678910\n \n… NA
## 6 NA "The Ice Road\n (… 5.5 "12345678910\n \n… NA
## 7 NA "A Quiet Place Part II\n… 7.5 "12345678910\n \n… NA
## 8 NA "No Sudden Move\n … 6.6 "12345678910\n \n… NA
## 9 NA "Black Widow\n (2… 7.4 "12345678910\n \n… NA
## 10 NA "In the Heights\n … 7.5 "12345678910\n \n… NA
## # … with 90 more rows
## # A tibble: 6 x 5
## `` `Rank & Title` `IMDb Rating` `Your Rating` ``
## <lgl> <chr> <dbl> <chr> <lgl>
## 1 NA "The Tomorrow War\n … 6.7 "12345678910\n \n … NA
## 2 NA "F9: The Fast Saga\n … 5.5 "12345678910\n \n … NA
## 3 NA "The Many Saints of Newa… NA "12345678910\n \n … NA
## 4 NA "Luca\n (2021)\n … 7.5 "12345678910\n \n … NA
## 5 NA "Fear Street Part 1: 199… 6.2 "12345678910\n \n … NA
## 6 NA "The Ice Road\n (… 5.5 "12345678910\n \n … NA
#remove irrelevant stuffs
movie_list_table <- movie_list_table[2:3]
colnames(movie_list_table) = c("title", "rating")
#movie_list_table$title = str_replace_all(movie_list_table$title,"[\r\\n]", "")
movie_list_table$title = str_remove(movie_list_table$title,"[\r\\n]") #remove
movie_list_table$title = str_squish(movie_list_table$title) # trim whitespace within string
movie_list_table$title = str_remove(movie_list_table$title,"([^(\\d\\d\\d\\d)]\\d.+)") #remove ranking info
movie_list_table## # A tibble: 100 x 2
## title rating
## <chr> <dbl>
## 1 "The Tomorrow War (2021)" 6.7
## 2 "" 5.5
## 3 "The Many Saints of Newark (2021)" NA
## 4 "Luca (2021)" 7.5
## 5 "Fear Street Part" 6.2
## 6 "The Ice Road (2021)" 5.5
## 7 "A Quiet Place Part II (2020)" 7.5
## 8 "No Sudden Move (2021)" 6.6
## 9 "Black Widow (2021)" 7.4
## 10 "In the Heights (2021)" 7.5
## # … with 90 more rows
This lab assignment involves 2 tasks (see the following slides). Once you finish the following tasks, please put everything in one single R file with the file name assignment3.R (.R is the file extension) and upload it to iCollege (Lab Assignment 3).
You will lose 50% of the points if you use a different filename or put your code in multiple files.
In addition, lab assignments will be graded based on:
https://en.wikipedia.org/wiki/Atlanta
Use rvest and SelectorGadget to…
colnames() and rownames()df[ rows_you_want , columns_you_want ]## # A tibble: 18 x 3
## Census Pop. `%±`
## * <chr> <chr> <chr>
## 1 1850 2,572 —
## 2 1860 9,554 271.5%
## 3 1870 21,789 128.1%
## 4 1880 37,409 71.7%
## 5 1890 65,533 75.2%
## 6 1900 89,872 37.1%
## 7 1910 154,839 72.3%
## 8 1920 200,616 29.6%
## 9 1930 270,366 34.8%
## 10 1940 302,288 11.8%
## 11 1950 331,314 9.6%
## 12 1960 487,455 47.1%
## 13 1970 495,039 1.6%
## 14 1980 425,022 −14.1%
## 15 1990 394,017 −7.3%
## 16 2000 416,474 5.7%
## 17 2010 420,003 0.8%
## 18 2019 (est.) 506,811 20.7%
Write a regular expression to extract email addresses from the following text:
@GeorgiaStateU: My two email addresses are smith77@gsu.edu and alex.smith@yahoo.com.uk
z<- "@GeorgiaStateU: My two email addresses are smith77@gsu.edu and
alex.smith@yahoo.com.uk"
my_regex = "put your regex here"
stringr::str_extract_all(z, my_regex)[[1]]## [1] "smith77@gsu.edu" "alex.smith@yahoo.com.uk"