Assignment - Working with XML and JSON in R:

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

install.packages(“XML”)

library(RCurl)
## Loading required package: bitops
library(XML)
library(RJSONIO)
library(stringr)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Below are the three files:

HTML: https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.html

XML: https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.xml

JSON: https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.json

Loading HTML file:

books.html <- htmlParse(getURL("https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.html"))
books.html
## <!DOCTYPE html>
## <html>
## <head><title>Favorite Books</title></head>
## <body>
##    <table>
## <tr align="left">
## <th>Title</th> <th>Author</th> <th>Subject</th> <th align="right">Edition</th> <th align="right">Price</th> <th align="right">Year</th> </tr>
## <tr align="left">
## <td>Machine Learning Yearning</td> <td>Andrew NG</td> <td>Deep Learning</td> <td align="right">3.0</td> <td align="right">$260.00</td> <td align="right">2017</td> </tr>
## <tr align="left">
## <td>Predictive Analytics</td> <td>Eric Siegel</td> <td>Statistics</td> <td align="right">6.0</td> <td align="right">$100.00</td> <td align="right">2016</td> </tr>
## <tr align="left">
## <td>Storytelling With Data</td> <td>Kole Nussbaumer Knaflic</td> <td>Data Transformation</td> <td align="right">5.0</td> <td align="right">$70.00</td> <td align="right">2015</td> </tr>
## <tr align="left">
## <td>Introduction to Machine Learning with Python</td> <td>Andreas C Muller &amp; Sarah Guido</td> <td>Machine Learning</td> <td align="right">8.0</td> <td align="right">$160.00</td> <td align="right">2018</td> </tr>
## </table>
## </body>
## </html>
## 
The column names are in

, using xpathSApply we can get the column names:

column <- xpathSApply(doc = books.html, path = "//th", fun = xmlValue)
column
## [1] "Title"   "Author"  "Subject" "Edition" "Price"   "Year"
The values are in

, using xpathSApply we can get the values:

To get the tittle:

Title <- xpathSApply(doc = books.html, path = "//td[position()=1]", fun = xmlValue)
Title
## [1] "Machine Learning Yearning"                   
## [2] "Predictive Analytics"                        
## [3] "Storytelling With Data"                      
## [4] "Introduction to Machine Learning with Python"

Like wise using the loop create the final output data:

books_html_out <- data.frame(c(1:4))
for (i in 1:length(column)){
  path_string <- paste("//td[position()=", i, "]", sep = "")
  books_html_out[,i] <- xpathSApply(doc = books.html, path = path_string, fun = xmlValue)
}
names(books_html_out) <- column
books_html_out
##                                          Title
## 1                    Machine Learning Yearning
## 2                         Predictive Analytics
## 3                       Storytelling With Data
## 4 Introduction to Machine Learning with Python
##                           Author             Subject Edition   Price Year
## 1                      Andrew NG       Deep Learning     3.0 $260.00 2017
## 2                    Eric Siegel          Statistics     6.0 $100.00 2016
## 3        Kole Nussbaumer Knaflic Data Transformation     5.0  $70.00 2015
## 4 Andreas C Muller & Sarah Guido    Machine Learning     8.0 $160.00 2018

Loading HTML file:

books.xml <- xmlParse(getURL("https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.xml"))
books.xml
## <?xml version="1.0" encoding="ISO-8859-1"?>
## <favorite_books>
##   <book id="1">
##     <title>Machine Learning Yearning</title>
##     <author>Andrew NG</author>
##     <subject>Deep Learning</subject>
##     <edition>3.0</edition>
##     <price>$260.00</price>
##     <year>2017</year>
##   </book>
##   <book id="2">
##     <title>Predictive Analytics</title>
##     <author>Eric Siegel</author>
##     <subject>Statistics</subject>
##     <edition>6.0</edition>
##     <price>$100.00</price>
##     <year>2016</year>
##   </book>
##   <book id="3">
##     <title>Storytelling With Data</title>
##     <author>Kole Nussbaumer Knaflic</author>
##     <subject>Data Transformation</subject>
##     <edition>5.0</edition>
##     <price>$70.00</price>
##     <year>2015</year>
##   </book>
##   <book id="4">
##     <title>Introduction to Machine Learning with Python</title>
##     <author>Andreas C Muller and Sarah Guido</author>
##     <subject>Machine Learning</subject>
##     <edition>8.0</edition>
##     <price>$160.00</price>
##     <year>2018</year>
##   </book>
## </favorite_books>
## 

Using xmlName look at the child nodes under the :

column <- xpathSApply(doc = books.xml, path = "//book[@id='1']/child::*", fun = xmlName)
column
## [1] "title"   "author"  "subject" "edition" "price"   "year"

Like wise using the loop create the final output data:

books_xml_out <- data.frame(c(1:4))
for (i in column){
  path_string <- paste("//", i, sep = "")
  books_xml_out[, i] <- xpathSApply(doc = books.xml, path = path_string, fun = xmlValue)
}
books_xml_out
##   c.1.4.                                        title
## 1      1                    Machine Learning Yearning
## 2      2                         Predictive Analytics
## 3      3                       Storytelling With Data
## 4      4 Introduction to Machine Learning with Python
##                             author             subject edition   price
## 1                        Andrew NG       Deep Learning     3.0 $260.00
## 2                      Eric Siegel          Statistics     6.0 $100.00
## 3          Kole Nussbaumer Knaflic Data Transformation     5.0  $70.00
## 4 Andreas C Muller and Sarah Guido    Machine Learning     8.0 $160.00
##   year
## 1 2017
## 2 2016
## 3 2015
## 4 2018

Loading JSON file after checking if its valid JSON file:

url_json <- "https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.json"
isValidJSON(url_json)
## [1] TRUE
htmlParse(getURL(url_json))
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><p>{"favorite books" :[
##  {
##  "title" : "Machine Learning Yearning",
##  "author" : ["Andrew NG"],
##  "subject" : ["Deep Learning"],
##  "edition" : "3.0",
##  "price" : "$260.00",
##  "year" : "2017"
##  },
##         {
##  "title" : "Predictive Analytics",
##  "author" : ["Eric Siegel"],
##  "subject" : ["Statistics"],
##  "edition" : "6.0",
##  "price" : "$100.00",
##  "year" : "2016"
##  },
##         {
##  "title" : "Storytelling With Data",
##  "author" : ["Kole Nussbaumer Knaflic"],
##  "subject" : ["Data Transformation"],
##  "edition" : "5.0",
##  "price" : "$70.00",
##  "year" : "2015"
##  },
##         {
##  "title" : "Introduction to Machine Learning with Python",
##  "author" : ["Andreas C Muller", "Sarah Guido"],
##  "subject" : ["Machine Learning"],
##  "edition" : "8.0",
##  "price" : "$160.00",
##  "year" : "2018"
##  }]
## }</p></body></html>
## 
books_json <- fromJSON(url_json)
books_json
## $`favorite books`
## $`favorite books`[[1]]
## $`favorite books`[[1]]$title
## [1] "Machine Learning Yearning"
## 
## $`favorite books`[[1]]$author
## [1] "Andrew NG"
## 
## $`favorite books`[[1]]$subject
## [1] "Deep Learning"
## 
## $`favorite books`[[1]]$edition
## [1] "3.0"
## 
## $`favorite books`[[1]]$price
## [1] "$260.00"
## 
## $`favorite books`[[1]]$year
## [1] "2017"
## 
## 
## $`favorite books`[[2]]
## $`favorite books`[[2]]$title
## [1] "Predictive Analytics"
## 
## $`favorite books`[[2]]$author
## [1] "Eric Siegel"
## 
## $`favorite books`[[2]]$subject
## [1] "Statistics"
## 
## $`favorite books`[[2]]$edition
## [1] "6.0"
## 
## $`favorite books`[[2]]$price
## [1] "$100.00"
## 
## $`favorite books`[[2]]$year
## [1] "2016"
## 
## 
## $`favorite books`[[3]]
## $`favorite books`[[3]]$title
## [1] "Storytelling With Data"
## 
## $`favorite books`[[3]]$author
## [1] "Kole Nussbaumer Knaflic"
## 
## $`favorite books`[[3]]$subject
## [1] "Data Transformation"
## 
## $`favorite books`[[3]]$edition
## [1] "5.0"
## 
## $`favorite books`[[3]]$price
## [1] "$70.00"
## 
## $`favorite books`[[3]]$year
## [1] "2015"
## 
## 
## $`favorite books`[[4]]
## $`favorite books`[[4]]$title
## [1] "Introduction to Machine Learning with Python"
## 
## $`favorite books`[[4]]$author
## [1] "Andreas C Muller" "Sarah Guido"     
## 
## $`favorite books`[[4]]$subject
## [1] "Machine Learning"
## 
## $`favorite books`[[4]]$edition
## [1] "8.0"
## 
## $`favorite books`[[4]]$price
## [1] "$160.00"
## 
## $`favorite books`[[4]]$year
## [1] "2018"

Use sapply to unlist:

books_json <- sapply(books_json[[1]], unlist)
books_json
## [[1]]
##                       title                      author 
## "Machine Learning Yearning"                 "Andrew NG" 
##                     subject                     edition 
##             "Deep Learning"                       "3.0" 
##                       price                        year 
##                   "$260.00"                      "2017" 
## 
## [[2]]
##                  title                 author                subject 
## "Predictive Analytics"          "Eric Siegel"           "Statistics" 
##                edition                  price                   year 
##                  "6.0"              "$100.00"                 "2016" 
## 
## [[3]]
##                     title                    author 
##  "Storytelling With Data" "Kole Nussbaumer Knaflic" 
##                   subject                   edition 
##     "Data Transformation"                     "5.0" 
##                     price                      year 
##                  "$70.00"                    "2015" 
## 
## [[4]]
##                                          title 
## "Introduction to Machine Learning with Python" 
##                                        author1 
##                             "Andreas C Muller" 
##                                        author2 
##                                  "Sarah Guido" 
##                                        subject 
##                             "Machine Learning" 
##                                        edition 
##                                          "8.0" 
##                                          price 
##                                      "$160.00" 
##                                           year 
##                                         "2018"

Now to transpose use lapply:

books_json <- lapply(books_json, t)
books_json
## [[1]]
##      title                       author      subject         edition
## [1,] "Machine Learning Yearning" "Andrew NG" "Deep Learning" "3.0"  
##      price     year  
## [1,] "$260.00" "2017"
## 
## [[2]]
##      title                  author        subject      edition price    
## [1,] "Predictive Analytics" "Eric Siegel" "Statistics" "6.0"   "$100.00"
##      year  
## [1,] "2016"
## 
## [[3]]
##      title                    author                   
## [1,] "Storytelling With Data" "Kole Nussbaumer Knaflic"
##      subject               edition price    year  
## [1,] "Data Transformation" "5.0"   "$70.00" "2015"
## 
## [[4]]
##      title                                          author1           
## [1,] "Introduction to Machine Learning with Python" "Andreas C Muller"
##      author2       subject            edition price     year  
## [1,] "Sarah Guido" "Machine Learning" "8.0"   "$160.00" "2018"

Create seperate dataframes for all the distinct books:

books_json <- lapply(books_json, data.frame, stringsAsFactors = F)
books_json
## [[1]]
##                       title    author       subject edition   price year
## 1 Machine Learning Yearning Andrew NG Deep Learning     3.0 $260.00 2017
## 
## [[2]]
##                  title      author    subject edition   price year
## 1 Predictive Analytics Eric Siegel Statistics     6.0 $100.00 2016
## 
## [[3]]
##                    title                  author             subject
## 1 Storytelling With Data Kole Nussbaumer Knaflic Data Transformation
##   edition  price year
## 1     5.0 $70.00 2015
## 
## [[4]]
##                                          title          author1
## 1 Introduction to Machine Learning with Python Andreas C Muller
##       author2          subject edition   price year
## 1 Sarah Guido Machine Learning     8.0 $160.00 2018

Use Rbind.fill to merge all the dataframes into one final dataframe:

books_json_out <- do.call("rbind.fill", books_json)
books_json_out
##                                          title                  author
## 1                    Machine Learning Yearning               Andrew NG
## 2                         Predictive Analytics             Eric Siegel
## 3                       Storytelling With Data Kole Nussbaumer Knaflic
## 4 Introduction to Machine Learning with Python                    <NA>
##               subject edition   price year          author1     author2
## 1       Deep Learning     3.0 $260.00 2017             <NA>        <NA>
## 2          Statistics     6.0 $100.00 2016             <NA>        <NA>
## 3 Data Transformation     5.0  $70.00 2015             <NA>        <NA>
## 4    Machine Learning     8.0 $160.00 2018 Andreas C Muller Sarah Guido

Comparison:

The output for HTML amd XML looks similar. In HTML and XML the the authors are treated as one string and loaded into one attribute “author”.But in JSON file when there are more than 1 author it stores the different authors in different attributes like “author1”" and “author2”. So we can see that the data is in wide format in JSON format.