For this week’s assignment, I had to pick three books and store their information into a HTML, XML, and JSON format. Then, load the three files into R data frames. I can’t remember the last time I picked up a book so I just picked three data science books I came across in the past year. Through several youtube videos, I created three files using Visual Studio Code and uploaded them to Github.

Loading the Libraries

library(tidyverse)
library(rvest)
library(XML)
library(xml2)
library(jsonlite)

Importing file from Github - HTML

html <- read_html("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/Week%207%20Assignment/Week7%20books.html")

Converting the HTML into a R dataframe

#From the rvest library -> parsing a HTML table into a data frame
df_html <- html_table(html)
df_html <- df_html[[1]]
knitr:: kable(df_html)
Title Authors Authors2 Interesting Attributes Interesting Attributes2
Data Science for Business Foster Provost Tom Fawcett Conversational style make it an easy read. Cover a wide range of topics
Data Science for Data Science Hadley Wickham Garret Grolemund Provides step by step instructions and code examples Covers a wide range of R packages
Python for Data Analysis Wes McKinney N/A Available online for free Written by the creator of Pandas Library

Importing file from Github - XML

#From the xml2 library
xml_file <- read_xml("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/Week%207%20Assignment/Week%207%20books.xml")

Converting XML to R data frame

#Had to parse xml file before I could use the function xmltoDataFrame()
xml_parse <- xmlParse(xml_file)
#From the XML library
df_xml <- xmlToDataFrame(xml_parse)
knitr:: kable(df_xml)
title authors authors2 interesting_attributes interesting_attributes2
Data Science for Business Foster Provost Tom Fawcett Conversational style make it an easy read. Cover a wide range of topics
Data Science for Data Science Hadley Wickham Garret Grolemund Provides step by step instructions and code examples Covers a wide range of R packages
Python for Data Analysis Wes McKinney N/A Available online for free Written by the creator of Pandas Library

Importing file from Github - JSON

json_file <- fromJSON("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/Week%207%20Assignment/week%207%20books.json")

Converting JSON to R dataframe

df_json <- as.data.frame(json_file)
knitr:: kable(df_json)
title authors authors2 interesting attributes interesting attributes2
Data Science for Business Foster Provost Tom Fawcett Conversational style make it an easy read. Cover a wide range of topics
Data Science for Data Science Hadley Wickham Garret Grolemund Provides step by step instructions and code examples Covers a wide range of R packages
Python for Data Analysis Wes McKinney N/A Available online for free Written by the creator of Pandas Library

Conclusion

All three data frames are pretty similar. They all have 3 rows with 5 columns and the data is exactly the same. The biggest difference that probably stands out is the column names, which I forgot to capitalize when I was creating the xml and json files. I changed the column names below and checked to see if they are identical.

colnames(df_xml) <- colnames(df_html)
colnames(df_json) <- colnames(df_html)
identical(df_html,df_xml)
## [1] FALSE
identical(df_html,df_json)
## [1] FALSE
identical(df_xml,df_json)
## [1] TRUE

It looks like the data frames from the xml and json files are identical. However, the data frame from the HTML seem to differ from the other two data frames. I used the all.equal() function to see if I could identify the difference and it looks like it is just a minor string mismatch.

all.equal(df_html,df_xml)
## [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [2] "Attributes: < Component \"class\": 1 string mismatch >"
all.equal(df_html,df_json)
## [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [2] "Attributes: < Component \"class\": 1 string mismatch >"