Week 7 Assignment

Week 7 Assignment - Working with XML and JSON in R

For this assignment, I began by picking three of my favorite books related to one of my favorite topics, Marketing. For each book, I created a data structure that included the following:

Title
Subtitle
Author(s)
Publisher
Subject(s)

I created the data structures to hold this information in the following three formats:

JSON
XML
HTML

Finally, for this assignment, I wrote the code necessary to import the data from each of these separate files, and load them into R data frames using the appropriate libraries and associated methods.

Load Libraries

In addition to the main standard libraries, for this assignment we are using the “XML” and “rjson” libraries to aid in the process of loading in XML and JSON files respectively. The method needed to read in the HTML table, is included in the XML library.

knitr::opts_chunk$set(echo = TRUE)

library("tidyverse")

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("rjson")
library("XML")
library("methods")
library("knitr")
library("kableExtra")

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

json_file = 'http://www.korymartin.com/books.json'
xml_file = 'http://www.korymartin.com/books.xml'
html_file = 'http://www.korymartin.com/books.html'

JSON Data

In this step we are loading the .json file into a data frame using the rjson library and the fromJSON function. Given that the .json is created in a nested structure, the ingested data is initially setup in a multi-dimensional array. To import it into the data frame, I wrote lines of code to create separate data frames for each book and then append them to a newly created data frame.

books_json = fromJSON(file=json_file)

json_df = data.frame()
json_df = rbind(json_df,as.data.frame(books_json[1]))
json_df = rbind(json_df,as.data.frame(books_json[2]))
json_df = rbind(json_df,as.data.frame(books_json[3]))
json_df %>% kbl(col.names = c("Title", "Subtitle", "Author(s)", "Publisher", "Subject(s)"))

Title	Subtitle	Author(s)	Publisher	Subject(s)
Predictive Marketing	Easy Ways Every Marketer Can Use Customer Analytics and Big Data	Omer Artun, PhD., Dominique Levin	Wiley	Marketing Analytics, Marketing
Hooked	How to Build Habit-Forming Products	Nir Eyal, Ryan Hoover	Portfolio/Penguin	Product Development, Consumer Behavior, Marketing
Targeted	How Technology is Revolutionizing Advertising and the Way Companies Reach Consumers	Mike Smith	AMACOM	Marketing Analytics, Marketing

XML Data

In this step we are loading the .xml file into a data frame using the XML library and the xmlParse function. The xmlToDataFrame function was used to convert the ingested data directly into a data frame named xml_df.

books_xml = xmlParse(file = xml_file)
xml_df = xmlToDataFrame(nodes=getNodeSet(books_xml, "//book"))

xml_df %>% kbl(col.names = c("Title", "Subtitle", "Author(s)", "Publisher", "Subject(s)"))

Title	Subtitle	Author(s)	Publisher	Subject(s)
Predictive Marketing	Easy Ways Every Marketer Can Use Customer Analytics and Big Data	Omer Artun PhD., Dominique Levin	Wiley	Marketing Analytics, Marketing
Hooked	How to Build Habit-Forming Products	Nir Eyal, Ryan Hoover	Portfolio/Penguin	Product Development, Consumer Behavior, Marketing
Targeted	How Technology is Revolutionizing Advertising and the Way Companies Reach Consumers	Mike Smith	AMACOM	Marketing Analytics, Marketing

HTML Data

In this lat step, we are loading the table created in our .html file by using the readHTMLTable function, which is also included in the XML library.

books_html <- readHTMLTable(html_file)

books_html %>% kbl(col.names = c("Title", "Subtitle", "Author(s)", "Publisher", "Subject(s)"))

Title	Subtitle	Author(s)	Publisher	Subject(s)
Predictive Marketing	Easy Ways Every Marketer Can Use Customer Analytics and Big Data	Omer Artun PhD., Dominique Levin	Wiley	Marketing Analytics, Marketing
Hooked	How to Build Habit-Forming Products	Nir Eyal, Ryan Hoover	Portfolio/Penguin	Product Development, Consumer Behavior,Marketing
Targeted	How Technology is Revolutionizing Advertising and the Way Companies Reach Consumers	Mike Smith	AMACOM	Marketing Analytics, Marketing

Conclusion

In this instance the three HTML files are identical. However, initially, when creating my JSON data structure, I used a list to hold the Authors and the Subjects. When converting this data into data frames, the result was a data frame that had multiple entries associated with each book, to hold the different values for author and subject, which was not the intended outcome. Therefore, I modified the underlying source data. While this was a necessary hack, I would expect to find a different solution for solving this challenge if this happened with data that was coming from a different source. But once I made this fix, then each of the data frames were identical.

However, one observation was that only the .html data was imported directly into a data frame without requiring any additional steps.

This was a fun exercise that allowed me to gain additional practice working with different data formats that can be found across the web. While I’ve done this type of work in the Python programming language, this was the first time I wrote code in R to import data from these various data structures.

Week 7 Assignment

Kory Martin

3/8/2023