Load libraries to read all files

#install.packages("rjson") #install package if it is not already on your machine
#install.packages("jsonlite") #install package if it is not already on your machine
library(rjson)
library(jsonlite)
library(xml2)
library(tidyverse)
Pick 3 favorite Books and load book information into separate data frames

Book #1 - an XML file

#Load the XML file Book1.xml
book1 <- read_xml("Book1.xml")
book1
## {xml_document}
## <note>
## [1] <title>A Knight in Shining Armor</title>
## [2] <author>Jude Deveraux</author>
## [3] <pages>480 pages</pages>
## [4] <genre>Contemporary Romance</genre>
## [5] <rank>235127</rank>
#Convert book1 data to a dataframe using tibble
book1.df <- tibble(book1)
book1.df
## # A tibble: 2 x 1
##   book1        
##   <named list> 
## 1 <externalptr>
## 2 <externalptr>

Book #2 - a HTML file

#Load HTML file Book2.html
book2 <- read_html("Book2.html")
book2
## {html_document}
## <html>
## [1] <body><table><tr>\n<td>Run Fast. Cook Fast. Eat Slow: Quick-Fix Reci ...
#Convert book2 data to a dataframe using tibble
book2.df <- tibble(book2)
book2.df
## # A tibble: 2 x 1
##   book2        
##   <named list> 
## 1 <externalptr>
## 2 <externalptr>

Book #3 - a Json file

#Load Json file Book3.json
book3 <- read_json("Book3.json")
book3
## $title
## [1] "The MindBody Code"
## 
## $author
## [1] "Dr Mario Martinez"
## 
## $page
## [1] "328"
## 
## $genre
## [1] "Personal Transformation Self-Help"
## 
## $rank
## [1] "226399"
#Convert book3 to a dataframe using tibble
book3.df <- tibble(book3)
book3.df
## # A tibble: 5 x 1
##   book3       
##   <named list>
## 1 <chr [1]>   
## 2 <chr [1]>   
## 3 <chr [1]>   
## 4 <chr [1]>   
## 5 <chr [1]>

Findings:

R reads and loads each file differently although the data is very similar. The XML file loads a separate line for each part of the file (title, author, pages, and so on). The HTML file loads all data to just one line and does not recognize each part of the file as it’s own section. And the Json file seems to be the most organized because there is clear separation between each piece of information in the file.

Loading to a dataframe is the most difficult part. You cannot just load the information without any sort of manipulation. The dataframe, created using tibble, does not recognize the text for the book information, but instead loads attributes such as “char [1]”, “named list”, and “externalptr”.