For this week’s assignment, I had to pick three books and store their information into a HTML, XML, and JSON format. Then, load the three files into R data frames. I can’t remember the last time I picked up a book so I just picked three data science books I came across in the past year. Through several youtube videos, I created three files using Visual Studio Code and uploaded them to Github.

Loading the Libraries

library(tidyverse)
library(rvest)
library(XML)
library(xml2)
library(jsonlite)

Importing file from Github - HTML

html <- read_html("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/Week%207%20Assignment/Week7%20books.html")

Converting the HTML into a R dataframe

#From the rvest library -> parsing a HTML table into a data frame
df_html <- html_table(html)
df_html <- df_html[[1]]
knitr:: kable(df_html)

Title	Authors	Authors2	Interesting Attributes	Interesting Attributes2
Data Science for Business	Foster Provost	Tom Fawcett	Conversational style make it an easy read.	Cover a wide range of topics
Data Science for Data Science	Hadley Wickham	Garret Grolemund	Provides step by step instructions and code examples	Covers a wide range of R packages
Python for Data Analysis	Wes McKinney	N/A	Available online for free	Written by the creator of Pandas Library

Importing file from Github - XML

#From the xml2 library
xml_file <- read_xml("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/Week%207%20Assignment/Week%207%20books.xml")

Converting XML to R data frame

#Had to parse xml file before I could use the function xmltoDataFrame()
xml_parse <- xmlParse(xml_file)
#From the XML library
df_xml <- xmlToDataFrame(xml_parse)
knitr:: kable(df_xml)

title	authors	authors2	interesting_attributes	interesting_attributes2
Data Science for Business	Foster Provost	Tom Fawcett	Conversational style make it an easy read.	Cover a wide range of topics
Data Science for Data Science	Hadley Wickham	Garret Grolemund	Provides step by step instructions and code examples	Covers a wide range of R packages
Python for Data Analysis	Wes McKinney	N/A	Available online for free	Written by the creator of Pandas Library

Importing file from Github - JSON

json_file <- fromJSON("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/Week%207%20Assignment/week%207%20books.json")

Converting JSON to R dataframe

df_json <- as.data.frame(json_file)
knitr:: kable(df_json)

title	authors	authors2	interesting attributes	interesting attributes2
Data Science for Business	Foster Provost	Tom Fawcett	Conversational style make it an easy read.	Cover a wide range of topics
Data Science for Data Science	Hadley Wickham	Garret Grolemund	Provides step by step instructions and code examples	Covers a wide range of R packages
Python for Data Analysis	Wes McKinney	N/A	Available online for free	Written by the creator of Pandas Library

Conclusion

All three data frames are pretty similar. They all have 3 rows with 5 columns and the data is exactly the same. The biggest difference that probably stands out is the column names, which I forgot to capitalize when I was creating the xml and json files. I changed the column names below and checked to see if they are identical.

colnames(df_xml) <- colnames(df_html)
colnames(df_json) <- colnames(df_html)
identical(df_html,df_xml)

## [1] FALSE

identical(df_html,df_json)

## [1] FALSE

identical(df_xml,df_json)

## [1] TRUE

It looks like the data frames from the xml and json files are identical. However, the data frame from the HTML seem to differ from the other two data frames. I used the all.equal() function to see if I could identify the difference and it looks like it is just a minor string mismatch.

all.equal(df_html,df_xml)

## [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [2] "Attributes: < Component \"class\": 1 string mismatch >"

all.equal(df_html,df_json)

## [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [2] "Attributes: < Component \"class\": 1 string mismatch >"

Week 7 Assignment: Working with XML and JSON in R

Jian Quan Chen

2023-03-12