CIS 4730 Unstructured Data Management

Getting data from the Web

There are many ways to obtain data from the Internet; let’s consider four categories:

click-and-download on the internet as a “flat” file, such as .csv, .xls
install-and-play an API for which someone has written a handy R package
API-query published with an unwrapped API
Scraping implicit in an html website

Click-and-Download

In the simplest case, the data you need is already on the internet in a tabular format. There are a couple of strategies here:

Use read.csv or read.table to read the data straight into R.

url <- "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv"
df <- read.csv(file=url, header=TRUE, stringsAsFactors=FALSE)
head(df)

##    make   model mpg weight price
## 1   amc concord  22   2930  4099
## 2   amc   oacer  17   3350  4749
## 3   amc  spirit  22   2640  3799
## 4 buick century  20   3250  4816
## 5 buick electra  15   4080  7827

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

df <- read_csv(file=url)

## Rows: 5 Columns: 5

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): make, model
## dbl (3): mpg, weight, price
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

df

## # A tibble: 5 x 5
##   make  model     mpg weight price
##   <chr> <chr>   <dbl>  <dbl> <dbl>
## 1 amc   concord    22   2930  4099
## 2 amc   oacer      17   3350  4749
## 3 amc   spirit     22   2640  3799
## 4 buick century    20   3250  4816
## 5 buick electra    15   4080  7827

install-and-play

Many common web services and APIs have been “wrapped”, i.e. R functions have been written around them which send your query to the server and format the response.

Package Rfacebook
Package twitteR
Package GoTr - R wrapper for An API of Ice And Fire

install.packages("devtools") #allow you to install R packages from GitHub
devtools::install_github("MangoTheCat/GoTr")
library(GoTr)
characters_583 <- got_api(type = "characters", id = 583)
class(characters_583)

## [1] "list"

characters_583

## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
## 
## $name
## [1] "Jon Snow"
## 
## $gender
## [1] "Male"
## 
## $culture
## [1] "Northmen"
## 
## $born
## [1] "In 283 AC"
## 
## $died
## [1] ""
## 
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
## 
## 
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
## 
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
## 
## $aliases[[3]]
## [1] "The Snow of Winterfell"
## 
## $aliases[[4]]
## [1] "The Crow-Come-Over"
## 
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
## 
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
## 
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
## 
## $aliases[[8]]
## [1] "Lord Crow"
## 
## 
## $father
## [1] ""
## 
## $mother
## [1] ""
## 
## $spouse
## [1] ""
## 
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
## 
## 
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
## 
## 
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
## 
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
## 
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
## 
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
## 
## 
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
## 
## $tvSeries[[2]]
## [1] "Season 2"
## 
## $tvSeries[[3]]
## [1] "Season 3"
## 
## $tvSeries[[4]]
## [1] "Season 4"
## 
## $tvSeries[[5]]
## [1] "Season 5"
## 
## $tvSeries[[6]]
## [1] "Season 6"
## 
## 
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"

API-query

This is when you use URLs to interact with a web API.

Package `httr`

httr is designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are:

GET: fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource.
POST: create a new resource. POST requests usually carry a payload that specifies the data for the new resource.
PUT: update an existing resource. The payload may contain the updated data for the resource.
DELETE: delete an existing resource.

httr contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET(), POST()).

install.packages("httr")
library(httr)
characters_583 <- GET("https://anapioficeandfire.com/api/characters/583")
characters_583_content = content(characters_583)
class(characters_583_content)

## [1] "list"

characters_583_content

## $url
## [1] "https://anapioficeandfire.com/api/characters/583"
## 
## $name
## [1] "Jon Snow"
## 
## $gender
## [1] "Male"
## 
## $culture
## [1] "Northmen"
## 
## $born
## [1] "In 283 AC"
## 
## $died
## [1] ""
## 
## $titles
## $titles[[1]]
## [1] "Lord Commander of the Night's Watch"
## 
## 
## $aliases
## $aliases[[1]]
## [1] "Lord Snow"
## 
## $aliases[[2]]
## [1] "Ned Stark's Bastard"
## 
## $aliases[[3]]
## [1] "The Snow of Winterfell"
## 
## $aliases[[4]]
## [1] "The Crow-Come-Over"
## 
## $aliases[[5]]
## [1] "The 998th Lord Commander of the Night's Watch"
## 
## $aliases[[6]]
## [1] "The Bastard of Winterfell"
## 
## $aliases[[7]]
## [1] "The Black Bastard of the Wall"
## 
## $aliases[[8]]
## [1] "Lord Crow"
## 
## 
## $father
## [1] ""
## 
## $mother
## [1] ""
## 
## $spouse
## [1] ""
## 
## $allegiances
## $allegiances[[1]]
## [1] "https://anapioficeandfire.com/api/houses/362"
## 
## 
## $books
## $books[[1]]
## [1] "https://anapioficeandfire.com/api/books/5"
## 
## 
## $povBooks
## $povBooks[[1]]
## [1] "https://anapioficeandfire.com/api/books/1"
## 
## $povBooks[[2]]
## [1] "https://anapioficeandfire.com/api/books/2"
## 
## $povBooks[[3]]
## [1] "https://anapioficeandfire.com/api/books/3"
## 
## $povBooks[[4]]
## [1] "https://anapioficeandfire.com/api/books/8"
## 
## 
## $tvSeries
## $tvSeries[[1]]
## [1] "Season 1"
## 
## $tvSeries[[2]]
## [1] "Season 2"
## 
## $tvSeries[[3]]
## [1] "Season 3"
## 
## $tvSeries[[4]]
## [1] "Season 4"
## 
## $tvSeries[[5]]
## [1] "Season 5"
## 
## $tvSeries[[6]]
## [1] "Season 6"
## 
## 
## $playedBy
## $playedBy[[1]]
## [1] "Kit Harington"

Your turn

We are interested in the following repository:

Web URL: https://github.com/tidyverse/dplyr
API URL: https://api.github.com/repos/tidyverse/dplyr

Use httr to retrieve repository data from the GitHub API, and print the following information:

The number of watchers
The number of subscribers
The number of open issues
The language of repository

## Select info from dplyr github repository

## watchers_count:  4038

## subscribers_count:  252

## open_issues_count:  123

## language:  R

Scraping

What if data is present on a website, but is not provided in an API at all? It is possible to grab that information too. How easy that is depends a lot on the quality and structure of the website that we are scrapping.

Two useful tools:

rvest: R package to easily harvest (scrape) web pages
SelectorGadget: Install in your browser

Package `rvest` overview

install.packages("rvest")
library(rvest)
library(stringr)
library(tidyverse)

The most important functions in rvest are:

Retrieve an html document from a URL, a file on disk or a string containing html with read_html().
Select parts of a document
- Using css selectors: html_elements(doc, css="table td")
- use xpath selectors with html_elements(doc, xpath = "//table//td")
Extract components with html_tag() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).
Parse HTML tables into data frames with html_table().

Use rvest to retrieve an html document

popular_movies <- read_html("https://www.imdb.com/chart/moviemeter/")
popular_movies

## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

Use rvest to select parts of a document using css selectors * Use SelectorGadget to identify the css string

css selector for ratings: screenshot(open link in new tab)

ratings <- popular_movies %>% 
  html_elements(css="strong") %>%
  html_text() %>%
  as.numeric()
ratings

##  [1] 8.4 6.8 7.1 8.3 8.5 7.4 5.4 7.4 6.9 6.7 8.1 7.1 6.5 6.5 7.4 6.7 7.3 6.3 7.2
## [20] 9.1 7.0 7.3 8.0 5.7 9.1 5.7 7.2 6.6 3.9 5.6 6.7 7.0 7.5 6.4 6.6 7.3 7.2 7.6
## [39] 7.7 8.4 8.3 9.2 9.3 5.8 8.3 7.6 7.9 8.4 6.3 7.2 6.9 8.7 4.7 5.7 7.3 7.3 6.5
## [58] 5.2 7.5 7.5 7.4 7.3 6.1 7.2 8.4 7.2 5.2 8.4 6.7 7.5 7.4 7.8 4.7 6.7 6.9 7.5
## [77] 7.3 5.4 4.8 7.4 8.2 5.8 8.6 7.4 7.4 7.7 6.3 7.3 3.7 7.4

css selector for Movie Titles: screenshot

MoviesTitles ={}
MoviesTitles <- popular_movies %>%
  html_elements(css="#main a") %>%
  html_text() %>%
  str_trim()
head(MoviesTitles)

## [1] ""                 ""                 "The Batman"       ""                
## [5] "The Adam Project" ""

MoviesTitles = MoviesTitles[MoviesTitles !=" "] #remove white spaces
MoviesTitles = MoviesTitles[MoviesTitles !=""]
MoviesTitles

##   [1] "The Batman"                                 
##   [2] "The Adam Project"                           
##   [3] "Turning Red"                                
##   [4] "The Kashmir Files"                          
##   [5] "Spider-Man: No Way Home"                    
##   [6] "X"                                          
##   [7] "Deep Water"                                 
##   [8] "West Side Story"                            
##   [9] "The Power of the Dog"                       
##  [10] "Fresh"                                      
##  [11] "Dune"                                       
##  [12] "Nightmare Alley"                            
##  [13] "Dog"                                        
##  [14] "Scream"                                     
##  [15] "Licorice Pizza"                             
##  [16] "Uncharted"                                  
##  [17] "Encanto"                                    
##  [18] "The King's Man"                             
##  [19] "Free Guy"                                   
##  [20] "The Dark Knight"                            
##  [21] "The Lost City"                              
##  [22] "Belfast"                                    
##  [23] "CODA"                                       
##  [24] "Black Crab"                                 
##  [25] "Sonic the Hedgehog 2"                       
##  [26] "The Unbearable Weight of Massive Talent"    
##  [27] "Windfall"                                   
##  [28] "Don't Look Up"                              
##  [29] "Death on the Nile"                          
##  [30] "Cheaper by the Dozen"                       
##  [31] "Doctor Strange in the Multiverse of Madness"
##  [32] "The Weekend Away"                           
##  [33] "House of Gucci"                             
##  [34] "Gangubai Kathiawadi"                        
##  [35] "Sing 2"                                     
##  [36] "Eternals"                                   
##  [37] "Ambulance"                                  
##  [38] "The Shadow in My Eye"                       
##  [39] "The French Dispatch"                        
##  [40] "Once Upon a Time... In Hollywood"           
##  [41] "Drive My Car"                               
##  [42] "Everything Everywhere All at Once"          
##  [43] "Morbius"                                    
##  [44] "Batman Begins"                              
##  [45] "The Godfather"                              
##  [46] "The Shawshank Redemption"                   
##  [47] "Studio 666"                                 
##  [48] "Jujutsu Kaisen 0: The Movie"                
##  [49] "West Side Story"                            
##  [50] "The Worst Person in the World"              
##  [51] "The Dark Knight Rises"                      
##  [52] "The Cursed"                                 
##  [53] "Ghostbusters: Afterlife"                    
##  [54] "Radhe Shyam"                                
##  [55] "The Tashkent Files"                         
##  [56] "Blacklight"                                 
##  [57] "The Matrix Resurrections"                   
##  [58] "Fantastic Beasts: The Secrets of Dumbledore"
##  [59] "Kiss of the Spider Woman"                   
##  [60] "No Time to Die"                             
##  [61] "Against the Ice"                            
##  [62] "F9: The Fast Saga"                          
##  [63] "The Outfit"                                 
##  [64] "King Richard"                               
##  [65] "The Last Duel"                              
##  [66] "Good Time"                                  
##  [67] "No Exit"                                    
##  [68] "The Suicide Squad"                          
##  [69] "Joker"                                      
##  [70] "Red Rocket"                                 
##  [71] "Umma"                                       
##  [72] "Avengers: Endgame"                          
##  [73] "Puss in Boots: The Last Wish"               
##  [74] "Spencer"                                    
##  [75] "Batman"                                     
##  [76] "Body Heat"                                  
##  [77] "Top Gun: Maverick"                          
##  [78] "Dunkirk"                                    
##  [79] "Master"                                     
##  [80] "The Lost Daughter"                          
##  [81] "Top Gun"                                    
##  [82] "tick, tick...BOOM!"                         
##  [83] "Badhaai Do"                                 
##  [84] "Thor: Love and Thunder"                     
##  [85] "Batman Forever"                             
##  [86] "Texas Chainsaw Massacre"                    
##  [87] "Nobody"                                     
##  [88] "The Wolf of Wall Street"                    
##  [89] "Old"                                        
##  [90] "Jhund"                                      
##  [91] "Scream"                                     
##  [92] "The Northman"                               
##  [93] "Avatar 2"                                   
##  [94] "Spider-Man"                                 
##  [95] "Kingsman: The Secret Service"               
##  [96] "Bullet Train"                               
##  [97] "Cyrano"                                     
##  [98] "After Love"                                 
##  [99] "Batman & Robin"                             
## [100] "Tenet"

Use html_attrs() and html_attr() to extract html tags and their attributes css selector for the poster of first movie: screenshot

popular_movies %>%
  html_elements(css="tr:nth-child(1) img") %>%
               # html_attrs() returns
  html_attrs() %>% # all data from available attributes
  data.frame()

##        c.src....https...m.media.amazon.com.images.M.MV5BMDdmMTBiNTYtMDIzNi00NGVlLWIzMDYtZTk3MTQ3NGQxZGEwXkEyXkFqcGdeQXVyMzMwOTU5MDk.._V1_UY67_CR0.0.45.67_AL_.jpg...
## src                https://m.media-amazon.com/images/M/MV5BMDdmMTBiNTYtMDIzNi00NGVlLWIzMDYtZTk3MTQ3NGQxZGEwXkEyXkFqcGdeQXVyMzMwOTU5MDk@._V1_UY67_CR0,0,45,67_AL_.jpg
## width                                                                                                                                                             45
## height                                                                                                                                                            67
## alt                                                                                                                                                       The Batman

poster_img_source <- popular_movies %>%
  html_elements(css="tr:nth-child(1) img") %>%
  html_attr("src") # select "src" from the result above
poster_img_source

## [1] "https://m.media-amazon.com/images/M/MV5BMDdmMTBiNTYtMDIzNi00NGVlLWIzMDYtZTk3MTQ3NGQxZGEwXkEyXkFqcGdeQXVyMzMwOTU5MDk@._V1_UY67_CR0,0,45,67_AL_.jpg"

Use rvest to parse html tables into data frames

top_movie_list <- popular_movies %>%
  html_elements(css="table") %>%
  html_table()
               # this is a list which has the data.frame 
top_movie_list # for top 100 movies based on popularity

## [[1]]
## # A tibble: 100 x 5
##    ``    `Rank & Title`                        `IMDb Rating` `Your Rating` ``   
##    <lgl> <chr>                                         <dbl> <chr>         <lgl>
##  1 NA    "The Batman\n        (2022)\n       ~           8.4 "12345678910~ NA   
##  2 NA    "The Adam Project\n        (2022)\n ~           6.8 "12345678910~ NA   
##  3 NA    "Turning Red\n        (2022)\n      ~           7.1 "12345678910~ NA   
##  4 NA    "The Kashmir Files\n        (2022)\n~           8.3 "12345678910~ NA   
##  5 NA    "Spider-Man: No Way Home\n        (2~           8.5 "12345678910~ NA   
##  6 NA    "X\n        (2022)\n            6\n(~           7.4 "12345678910~ NA   
##  7 NA    "Deep Water\n        (2022)\n       ~           5.4 "12345678910~ NA   
##  8 NA    "West Side Story\n        (2021)\n  ~           7.4 "12345678910~ NA   
##  9 NA    "The Power of the Dog\n        (2021~           6.9 "12345678910~ NA   
## 10 NA    "Fresh\n        (2022)\n            ~           6.7 "12345678910~ NA   
## # ... with 90 more rows

class(top_movie_list)

## [1] "list"

movie_list_table <- top_movie_list[[1]] # the first one is the casting table
head(movie_list_table)

## # A tibble: 6 x 5
##   ``    `Rank & Title`                         `IMDb Rating` `Your Rating` ``   
##   <lgl> <chr>                                          <dbl> <chr>         <lgl>
## 1 NA    "The Batman\n        (2022)\n        ~           8.4 "12345678910~ NA   
## 2 NA    "The Adam Project\n        (2022)\n  ~           6.8 "12345678910~ NA   
## 3 NA    "Turning Red\n        (2022)\n       ~           7.1 "12345678910~ NA   
## 4 NA    "The Kashmir Files\n        (2022)\n ~           8.3 "12345678910~ NA   
## 5 NA    "Spider-Man: No Way Home\n        (20~           8.5 "12345678910~ NA   
## 6 NA    "X\n        (2022)\n            6\n(\~           7.4 "12345678910~ NA

#remove irrelevant stuffs
movie_list_table <- movie_list_table[2:3] 
colnames(movie_list_table) = c("title", "rating")

#movie_list_table$title = str_replace_all(
#  movie_list_table$title,"[\r\\n]", "") 

movie_list_table$title = str_remove(
  movie_list_table$title,"[\r\\n]") #remove 

movie_list_table$title = str_squish(
  movie_list_table$title) 
# trim whitespace within string

movie_list_table$title = str_remove(
  movie_list_table$title,"([^(\\d\\d\\d\\d)]\\d.+)") 
#remove ranking info

movie_list_table

## # A tibble: 100 x 2
##    title                          rating
##    <chr>                           <dbl>
##  1 The Batman (2022)                 8.4
##  2 The Adam Project (2022)           6.8
##  3 Turning Red (2022)                7.1
##  4 The Kashmir Files (2022)          8.3
##  5 Spider-Man: No Way Home (2021)    8.5
##  6 X (2022)                          7.4
##  7 Deep Water (2022)                 5.4
##  8 West Side Story (2021)            7.4
##  9 The Power of the Dog (2021)       6.9
## 10 Fresh (2022)                      6.7
## # ... with 90 more rows

Lab assignment

This lab assignment involves 2 tasks (see the following slides). Once you finish the following tasks, please put everything in one single R file with the file name <Your_first_name>.R (.R is the file extension) and upload it to iCollege (Lab Assignment 3).

In addition, lab assignments will be graded based on:

Accuracy: whether the R script achieves the objectives
Readability: whether the R script is clean, well-formatted, and easily readable
- You risk loosing 10 points if your code has no proper indentation or has more than 80 characters in a line.

Lab Assignment 1/2

https://en.wikipedia.org/wiki/Atlanta

Use rvest and SelectorGadget to…

Get the Mayor’s name
Get the image source (URL) for the Atlanta City Hall
Get the paragraph for the 1996 Summer Olympic Games
Get the URLs of all the links in the External links section
Get the table “Historical population”, store it as a data.frame, and clean the data.frame so that it looks like the data.frame in the next slide
- Hint 1: html_attrs() function to see which tag and attribute to extract (see p.15)
- Hint 2: Notice that html_table() is a list with data.frame/tibble as its element (see p.16-18)
- Hint 3: Recall colnames() and rownames() from Lab-02-Data-Types to rename the columns and rows
- Hint 4: df[ rows_you_want , columns_you_want ]

## Atlanta Mayor:  Andre Dickens

## Atlanta city hall image address:  //upload.wikimedia.org/wikipedia/commons/thumb/9/90/Atlanta_City_Hall%2C_Atlanta%2C_GA_%2847474768451%29.jpg/220px-Atlanta_City_Hall%2C_Atlanta%2C_GA_%2847474768451%29.jpg

## 1996 Summer Olympic Games paragraph:  Atlanta was selected as the site for the 1996 Summer Olympic Games. Following the announcement, the city government undertook several major construction projects to improve Atlanta's parks, sporting venues, and transportation infrastructure; however, for the first time, none of the $1.7 billion cost of the games was governmentally funded. While the games experienced transportation and accommodation problems and, despite extra security precautions, there was the Centennial Olympic Park bombing,[65] the spectacle was a watershed event in Atlanta's history. For the first time in Olympic history, every one of the record 197 national Olympic committees invited to compete sent athletes, sending more than 10,000 contestants participating in a record 271 events. The related projects such as Atlanta's Olympic Legacy Program and civic effort initiated a fundamental transformation of the city in the following decade.[64]

## external links (13):  http://www.atlantaga.gov/ http://www.atlantawatershed.org/ http://www.atlantapd.org http://www.discoveratlanta.com/ http://www.georgiaencyclopedia.org/articles/counties-cities-neighborhoods/atlanta /wiki/New_Georgia_Encyclopedia http://dlg.galileo.usg.edu/atlnewspapers /wiki/Digital_Library_of_Georgia http://album.atlantahistorycenter.com/cdm/landingpage/collection/athpc /wiki/Atlanta_History_Center http://www.nps.gov/history/nr/travel/atlanta/ /wiki/Scientific_American https://books.google.com/books?id=YIE9AQAAIAAJ&printsec=frontcover&source=gbs_ge_summary_r&cad=0#v=onepage&q=carbonic%20oxide&f=false

## # A tibble: 18 x 3
##    Census Pop.    `%±`  
##  * <chr>  <chr>   <chr> 
##  1 1850   2,572   —     
##  2 1860   9,554   271.5%
##  3 1870   21,789  128.1%
##  4 1880   37,409  71.7% 
##  5 1890   65,533  75.2% 
##  6 1900   89,872  37.1% 
##  7 1910   154,839 72.3% 
##  8 1920   200,616 29.6% 
##  9 1930   270,366 34.8% 
## 10 1940   302,288 11.8% 
## 11 1950   331,314 9.6%  
## 12 1960   487,455 47.1% 
## 13 1970   495,039 1.6%  
## 14 1980   425,022 -14.1%
## 15 1990   394,017 -7.3% 
## 16 2000   416,474 5.7%  
## 17 2010   420,003 0.8%  
## 18 2020   498,715 18.7%

Lab Assignment 2/2

Write a regular expression to extract email addresses from the following text:

@GeorgiaStateU: My two email addresses are jlee469@gsu.edu and jeremy.lee.4093@gmail.com

z<- "@GeorgiaStateU: My two email addresses are jlee469@gsu.edu and 
jeremy.lee.4093@gmail.com"
my_regex = "put your regex here"
stringr::str_extract_all(z, my_regex)[[1]]

## [1] "jlee469@gsu.edu"           "jeremy.lee.4093@gmail.com"