Introduction

This week is focused on data file types that are commonly passed or retrieved when web scraping or dealing with APIs. This example will have an html, json, and xml file that I’ve hand created to get a better idea of each type of structure.

Each file will include the authors, publisher, publishing date, and title of a different book related to data science.

URLs for the different files:

HTML

The biggest takeaways in this area are the use of html_element and html_table. You can read the html directly, but I thought it would be better practice to declare the table element as ‘.mytable’ so it could be used by html_element for reference. This is quite handy if I had a large HTML file and only wanted one portion of it. If I read the html directly, it would take some string matching and more intense clean up.

html <- read_html("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/Week%207%20-%20Web%20Data%20%26%20APIs/html2.html")

table <- html %>% html_element('.mytable') %>% html_table()

table

## # A tibble: 3 × 6
##   Title                                   Author Author2 Author3 Publi…¹ Publi…²
##   <chr>                                   <chr>  <chr>   <chr>   <chr>   <chr>  
## 1 R for Data Science (1e)                 Hadle… Garret… <NA>    O'Reil… 2017-0…
## 2 Deep Learning (Adaptive Computation an… Ian G… Yoshua… Aaron … O'Reil… 2016-1…
## 3 Python Data Science Handbook: Essentia… Jake … <NA>    <NA>    O'Reil… 2023-0…
## # … with abbreviated variable names ¹Publisher, ²Published

JSON

I’m most familiar with JSON and although I really like that when the package interprets the json object it can easily be accessed with normal R nomenclature(ie. using $ to access an item inside a DF or List), I remember having quite the painful experience having to call a massive json object, and then chop up pieces of all the different lists provided to group the information I needed. Maybe this was due to poor relational structures in the database I was pulling from. Notice here that JSON returns a list and I had to combine several data frames to create the single frame I was looking for.

#Note here that JSON comes in a bit different from the others as a list.
json <- jsonlite::fromJSON("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/Week%207%20-%20Web%20Data%20%26%20APIs/json.json")


jsonFrame <- rbind(json$`Favorite Books`$Book1,json$`Favorite Books`$Book2,json$`Favorite Books`$Book3)
jsonFrame

##    Published
## 1 2016-11-18
## 2 2023-01-17
## 3 2017-01-31
##                                                                       Name
## 1         Deep Learning (Adaptive Computation and Machine Learning series)
## 2 Python Data Science Handbook: Essential Tools for Working with Data (2e)
## 3                                                  R for Data Science (1e)
##           Author1          Author2         Author3      Publisher
## 1  Ian Goodfellow    Yoshua Bengio Aaron Courville O'Reilly Media
## 2 Jake VanderPlas             <NA>            <NA> O'Reilly Media
## 3  Hadley Wickham Garret Grolemund            <NA> O'Reilly Media

XML

I prefer XML the most. It makes the most sense to me when handwriting it. It’s very simple and straightforward. My prior experience comes from my old job, I had to create xml formats for applications to fill in information when passing flat files between software. To ensure the xmlToDataFrame command I was using worked, I needed to load the “methods” library after the xml library. Also note that I had to use getURL() with the URL to get the info rather than something purpose built for XML files. I printed the object so that one could see what it looks like before using xmlToDataFrame.

xml <- getURL("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/Week%207%20-%20Web%20Data%20%26%20APIs/xml.xml")
xml

## [1] "<Records>\n  <book>\n  \t  <title>Python Data Science Handbook: Essential Tools for Working with Data (2e)</title>\n\t  <author>Jake VanderPlas</author>\n        <author2>NA</author2>\n        <author3>NA</author3>\n\t  <publisher>O'Reilly Media</publisher>\n\t  <published>2023-01-17</published>\n  </book>\n  <book>\n        <title>R for Data Science (1e)</title>\n\t  <author>Hadley Wickham</author>\n        <author2>Garret Grolemund</author2>\n        <author3>NA</author3>\n\t  <publisher>O'Reilly Media</publisher>\n\t  <published>2017-01-31</published>\n  </book>\n  <book>\n        <title>Deep Learning (Adaptive Computation and Machine Learning series)</title>\n\t  <author>Ian Goodfellow</author>\n        <author2>Yoshua Bengio</author2>\n        <author3>Aaron Courville</author3>\n\t  <publisher>O'Reilly Media</publisher>\n\t  <published>2016-11-18</published>\n  </book>\n</Records>\n"

#Make sure to load the "methods" library before running this line or you will not keep the xml headers as column names
xmlData <- xmlToDataFrame(xml)

print(xmlData)

##                                                                      title
## 1 Python Data Science Handbook: Essential Tools for Working with Data (2e)
## 2                                                  R for Data Science (1e)
## 3         Deep Learning (Adaptive Computation and Machine Learning series)
##            author          author2         author3      publisher  published
## 1 Jake VanderPlas               NA              NA O'Reilly Media 2023-01-17
## 2  Hadley Wickham Garret Grolemund              NA O'Reilly Media 2017-01-31
## 3  Ian Goodfellow    Yoshua Bengio Aaron Courville O'Reilly Media 2016-11-18

Week 7 - Data File Types

Daniel Craig

2023-03-11

Introduction

URLs for the different files:

HTML

JSON

XML