The purpose of this week’s assignment was to practice working with JSON, XML, and HTML data in R by manually creating three copies of a simple data table, one in each of these three languages, and then importing them into R dataframes. As this was my first experience with any of these languages, it was a challenging but beneficial exercise.
For my content, I chose three books on how dogs perceive the world. All three were written by canine cognition researchers at Barnard College and Duke University and are based on their years of research into the capabilities and psychology of dogs. In particular, I am fascinated by their descriptions of dogs’ sense of smell and how that shapes their perception and cognition (disclaimer: we have several dogs at home including a young bloodhound mix with an otherworldly sense of smell). They are well-written and engaging casual reads, woven through with stories of dogs the authors have personally known and loved, which makes them some of my favorites.
For these three books, I referred to Amazon and pulled in nine variables, creating a small table in Excel as a model for my JSON, XML, and HTML code.
Then I began with the libraries:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(jsonlite)
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
library(RCurl)
Attaching package: 'RCurl'
The following object is masked from 'package:tidyr':
complete
library(XML)library(xml2)library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
JSON
First I used Notepad++ to create a JSON file with my Excel table as a reference: https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Week_7/Books.json
Then I brought it into R as follows. There were a number of packages available but the fromJSON function in the jsonlite package brought this particular table directly into a dataframe in a single step, whereas others required a few more steps and/or transformations.
I did notice that R brought in all data elements as character type, so depending on the desired analysis, one might need to convert some numeric or date fields before using.
path_json <-"https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Week_7/Books.json"#-------- Use function from jsonlite package to create dataframe directly from JSON file in one stepdf_json <- jsonlite::fromJSON(path_json) str(df_json)
'data.frame': 3 obs. of 9 variables:
$ title : chr "The Genius of Dogs: How Dogs Are Smarter Than You Think" "Inside of a Dog: What Dogs See, Smell, and Know" "Being a Dog: Following the Dog Into a World of Smell"
$ author1 : chr "Brian Hare" "Alexandra Horowitz" "Alexandra Horowitz"
$ author2 : chr "Vanessa Woods" NA NA
$ publisher : chr "Plume" "Scribner" "Scribner"
$ edition : chr "1" "1" "1"
$ date : chr "10/29/2023" "09/28/2010" "10/17/2017"
$ format : chr "Paperback" "Paperback" "Paperback"
$ isbn-10 : chr "142180467" "1416583432" "1476796025"
$ amazon_rating: chr "4.5" "4.4" "4.5"
df_json
title author1
1 The Genius of Dogs: How Dogs Are Smarter Than You Think Brian Hare
2 Inside of a Dog: What Dogs See, Smell, and Know Alexandra Horowitz
3 Being a Dog: Following the Dog Into a World of Smell Alexandra Horowitz
author2 publisher edition date format isbn-10 amazon_rating
1 Vanessa Woods Plume 1 10/29/2023 Paperback 142180467 4.5
2 <NA> Scribner 1 09/28/2010 Paperback 1416583432 4.4
3 <NA> Scribner 1 10/17/2017 Paperback 1476796025 4.5
XML
To create the same data table in XML, I used Notepad++ again: https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Week_7/Books2.xml
XML was more complicated to bring into R as there was no simple one-step function like jsonlite provided above, and I tried a few XML packages and functions. After experimentation, I used functions from the RCurl and the XML packages.
First, I used getURL which created an “External Pointer” and xmlParse to bring in the data in a format that looked just like the original XML format (below). This two-step process seemed to be the key, and then I was able to use to some intuitive xml-specific functions like xmlToList and xmlToDataFrame to work with the data without any issue.
As happened with the JSON file, R brought in all data elements from my XML file as character type.
path_xml <-"https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Week_7/Books2.xml"# -------- Get data with getURL and parse the XML code with xmlParsedata_xml <-getURL(path_xml) %>%xmlParse()# -------- Create dataframe from XMLdf_xml<-xmlToDataFrame(data_xml) str(df_xml)
'data.frame': 3 obs. of 9 variables:
$ title : chr "The Genius of Dogs: How Dogs Are Smarter Than You Think" "Inside of a Dog: What Dogs See, Smell, and Know" "Being a Dog: Following the Dog Into a World of Smell"
$ author1 : chr "Brian Hare" "Alexandra Horowitz" "Alexandra Horowitz"
$ author2 : chr "Vanessa Woods" NA NA
$ publisher : chr "Plume" "Scribner" "Scribner"
$ edition : chr "1" "1" "1"
$ date : chr "10/29/2023" "09/28/2010," "10/17/2017,"
$ format : chr "Paperback" "Paperback" "Paperback"
$ isbn-10 : chr "142180467" "1416583432" "1476796025"
$ amazon_rating: chr "4.5" "4.4" "4.5"
df_xml
title author1
1 The Genius of Dogs: How Dogs Are Smarter Than You Think Brian Hare
2 Inside of a Dog: What Dogs See, Smell, and Know Alexandra Horowitz
3 Being a Dog: Following the Dog Into a World of Smell Alexandra Horowitz
author2 publisher edition date format isbn-10
1 Vanessa Woods Plume 1 10/29/2023 Paperback 142180467
2 <NA> Scribner 1 09/28/2010, Paperback 1416583432
3 <NA> Scribner 1 10/17/2017, Paperback 1476796025
amazon_rating
1 4.5
2 4.4
3 4.5
HTML
Finally, I created an HTML file using Notepad++ and my XML file for reference: https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Week_7/Books2.html
Bringing it into R was conceptually similar to XML, and I used the rvest package in this case.
Unlike the XML or JSON import processes, the HTML import process coerced all fields with numeric values to int or num (but left dates as chr). In the case of Amazon ratings, this is appropriate; in the case of ISBN numbers, I would probably use double quotes in HTML or otherwise try to force R to consider it a character field so as not to lose any leading zeroes or cause issues with joins to other tables.
Also unlike the JSON- and XML-based dataframes, the HTML-based dataframe was the only one where the nulls were not populated with NA.
path_html <-"https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Week_7/Books2.html"#-------- Similar to XMLdata_html <-getURL(path_html) %>%htmlParse()#-------- From rvest packagedf_html<-read_html(path_html)%>%html_node("table") %>%html_table()str(df_html)
tibble [3 × 9] (S3: tbl_df/tbl/data.frame)
$ title : chr [1:3] "The Genius of Dogs: How Dogs Are Smarter Than You Think" "Inside of a Dog: What Dogs See, Smell, and Know" "Being a Dog: Following the Dog Into a World of Smell"
$ author1 : chr [1:3] "Brian Hare" "Alexandra Horowitz" "Alexandra Horowitz"
$ author2 : chr [1:3] "Vanessa Woods" "" ""
$ publisher : chr [1:3] "Plume" "Scribner" "Scribner"
$ edition : int [1:3] 1 1 1
$ date : chr [1:3] "10/29/2023" "09/28/2010" "10/17/2017"
$ format : chr [1:3] "Paperback" "Paperback" "Paperback"
$ isbn-10 : int [1:3] 142180467 1416583432 1476796025
$ amazon_rating: num [1:3] 4.5 4.4 4.5
df_html
# A tibble: 3 × 9
title author1 author2 publisher edition date format `isbn-10` amazon_rating
<chr> <chr> <chr> <chr> <int> <chr> <chr> <int> <dbl>
1 The Ge… Brian … "Vanes… Plume 1 10/2… Paper… 1.42e8 4.5
2 Inside… Alexan… "" Scribner 1 09/2… Paper… 1.42e9 4.4
3 Being … Alexan… "" Scribner 1 10/1… Paper… 1.48e9 4.5
In the end, the JSON and XML dataframes were identical, while the HTML dataframe did not have NAs in the null data elements and the numeric fields were all forced to num or int by default.