A7 Web Data

Introduction

Suppose you choose 3 of our favorite books. You write one HTML file with some metadata about all 3. This may include the author(s) names, number of pages, etc. You repeat this with an XML file and JSON file. Let’s say you would like to do analysis with the data. Which of these 3 files is less of a pain to load into your R environment?

Load libraries

library(httr)
library(rjson)
library(XML)

HTML Table

url <- "https://raw.githubusercontent.com/djunga/A7-Web-Data/main/mybooks.html"

myhtml <- GET(url)

x <- rawToChar(myhtml$content)
x <- htmlParse(x)
x <- readHTMLTable(x)
x <- data.frame(x)
colnames(x) <- gsub("NULL.", "", colnames(x))

XML

url <- "https://raw.githubusercontent.com/djunga/A7-Web-Data/main/mybooks.xml"
myxml <- GET(url)
a <- rawToChar(myxml$content)
a <- xmlParse(a)
a <- xmlToDataFrame(a)

JSON

url <- "https://raw.githubusercontent.com/djunga/A7-Web-Data/main/mybooks.json"
b <- GET(url)
b <- rawToChar(b$content)
b <- fromJSON(b)
b <- unlist(b)
b

##                           book1.Title                         book1.Author1 
##                         "Neuromancer"                      "William Gibson" 
##                         book1.Author2                           book1.Genre 
##                                  "NA"                              "Sci-Fi" 
##                       book1.Published                           book1.Pages 
##                                "1984"                                 "271" 
##                           book2.Title                         book2.Author1 
##                          "1 the Road"                        "Ross Goodwin" 
##                         book2.Author2                           book2.Genre 
##                     "Kenric McDowell"                              "Poetry" 
##                       book2.Published                           book2.Pages 
##                                "2018"                                 "171" 
##                           book3.Title                         book3.Author1 
## "AI 2041: Ten Visions for Our Future"                         "Chen Qiufan" 
##                         book3.Author2                           book3.Genre 
##                          "Kai-Fu Lee"                              "Sci-Fi" 
##                       book3.Published                           book3.Pages 
##                                "2021"                                 "480"

mycolnames <- gsub("book[0-9][.]", "", names(b))[1:6]
  

w <- data.frame(b[1:6], b[7:12], b[13:18])
w <- data.frame(t(w))
colnames(w) <- mycolnames

Display each data frame.

HTML data frame

head(x)

##                                 Title        Author1         Author2  Genre
## 1                         Neuromancer William Gibson              NA Sci-Fi
## 2                          1 the Road   Ross Goodwin Kenric McDowell Poetry
## 3 AI 2041: Ten Visions for Our Future    Chen Qiufan      Kai-Fu Lee Sci-Fi
##   Published Pages
## 1      1984   271
## 2      2018   171
## 3      2021   480

XML data frame

head(a)

##                                 title        author1         author2  genre
## 1                         Neuromancer William Gibson              NA Sci-Fi
## 2                          1 the Road   Ross Goodwin Kenric McDowell Poetry
## 3 AI 2041: Ten Visions for Our Future    Chen Qiufan      Kai-Fu Lee Sci-Fi
##   published pages
## 1      1984   271
## 2      2018   171
## 3      2021   480

JSON data frame

head(w)

##                                        Title        Author1         Author2
## b.1.6.                           Neuromancer William Gibson              NA
## b.7.12.                           1 the Road   Ross Goodwin Kenric McDowell
## b.13.18. AI 2041: Ten Visions for Our Future    Chen Qiufan      Kai-Fu Lee
##           Genre Published Pages
## b.1.6.   Sci-Fi      1984   271
## b.7.12.  Poetry      2018   171
## b.13.18. Sci-Fi      2021   480

Conclusion

The HTML and XML data loaded required very little processing to be converted to a data frame format. In contrast, the JSON data required several steps, including using gsub to get the proper column names, and transposing the rows and columns. A JSON file on its own may appear more friendly in terms of visual format, but you may have a less frustrating time loading an HTML or XML file into your R environment.