For this assignment, I began by picking three of my favorite books related to one of my favorite topics, Marketing. For each book, I created a data structure that included the following:
I created the data structures to hold this information in the following three formats:
Finally, for this assignment, I wrote the code necessary to import the data from each of these separate files, and load them into R data frames using the appropriate libraries and associated methods.
In addition to the main standard libraries, for this assignment we are using the “XML” and “rjson” libraries to aid in the process of loading in XML and JSON files respectively. The method needed to read in the HTML table, is included in the XML library.
knitr::opts_chunk$set(echo = TRUE)
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("rjson")
library("XML")
library("methods")
library("knitr")
library("kableExtra")
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
json_file = 'http://www.korymartin.com/books.json'
xml_file = 'http://www.korymartin.com/books.xml'
html_file = 'http://www.korymartin.com/books.html'
In this step we are loading the .json file into a data frame using the rjson library and the fromJSON function. Given that the .json is created in a nested structure, the ingested data is initially setup in a multi-dimensional array. To import it into the data frame, I wrote lines of code to create separate data frames for each book and then append them to a newly created data frame.
books_json = fromJSON(file=json_file)
json_df = data.frame()
json_df = rbind(json_df,as.data.frame(books_json[1]))
json_df = rbind(json_df,as.data.frame(books_json[2]))
json_df = rbind(json_df,as.data.frame(books_json[3]))
json_df %>% kbl(col.names = c("Title", "Subtitle", "Author(s)", "Publisher", "Subject(s)"))
| Title | Subtitle | Author(s) | Publisher | Subject(s) |
|---|---|---|---|---|
| Predictive Marketing | Easy Ways Every Marketer Can Use Customer Analytics and Big Data | Omer Artun, PhD., Dominique Levin | Wiley | Marketing Analytics, Marketing |
| Hooked | How to Build Habit-Forming Products | Nir Eyal, Ryan Hoover | Portfolio/Penguin | Product Development, Consumer Behavior, Marketing |
| Targeted | How Technology is Revolutionizing Advertising and the Way Companies Reach Consumers | Mike Smith | AMACOM | Marketing Analytics, Marketing |
In this step we are loading the .xml file into a data frame using the XML library and the xmlParse function. The xmlToDataFrame function was used to convert the ingested data directly into a data frame named xml_df.
books_xml = xmlParse(file = xml_file)
xml_df = xmlToDataFrame(nodes=getNodeSet(books_xml, "//book"))
xml_df %>% kbl(col.names = c("Title", "Subtitle", "Author(s)", "Publisher", "Subject(s)"))
| Title | Subtitle | Author(s) | Publisher | Subject(s) |
|---|---|---|---|---|
| Predictive Marketing | Easy Ways Every Marketer Can Use Customer Analytics and Big Data | Omer Artun PhD., Dominique Levin | Wiley | Marketing Analytics, Marketing |
| Hooked | How to Build Habit-Forming Products | Nir Eyal, Ryan Hoover | Portfolio/Penguin | Product Development, Consumer Behavior, Marketing |
| Targeted | How Technology is Revolutionizing Advertising and the Way Companies Reach Consumers | Mike Smith | AMACOM | Marketing Analytics, Marketing |
In this lat step, we are loading the table created in our .html file by using the readHTMLTable function, which is also included in the XML library.
books_html <- readHTMLTable(html_file)
books_html %>% kbl(col.names = c("Title", "Subtitle", "Author(s)", "Publisher", "Subject(s)"))
|
In this instance the three HTML files are identical. However, initially, when creating my JSON data structure, I used a list to hold the Authors and the Subjects. When converting this data into data frames, the result was a data frame that had multiple entries associated with each book, to hold the different values for author and subject, which was not the intended outcome. Therefore, I modified the underlying source data. While this was a necessary hack, I would expect to find a different solution for solving this challenge if this happened with data that was coming from a different source. But once I made this fix, then each of the data frames were identical.
However, one observation was that only the .html data was imported directly into a data frame without requiring any additional steps.
This was a fun exercise that allowed me to gain additional practice working with different data formats that can be found across the web. While I’ve done this type of work in the Python programming language, this was the first time I wrote code in R to import data from these various data structures.