Assignment - Working with XML and JSON in R:
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
library(RCurl)
## Loading required package: bitops
library(XML)
library(RJSONIO)
library(stringr)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Below are the three files:
HTML: https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.html
XML: https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.xml
JSON: https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.json
Loading HTML file:
books.html <- htmlParse(getURL("https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.html"))
books.html
## <!DOCTYPE html>
## <html>
## <head><title>Favorite Books</title></head>
## <body>
## <table>
## <tr align="left">
## <th>Title</th> <th>Author</th> <th>Subject</th> <th align="right">Edition</th> <th align="right">Price</th> <th align="right">Year</th> </tr>
## <tr align="left">
## <td>Machine Learning Yearning</td> <td>Andrew NG</td> <td>Deep Learning</td> <td align="right">3.0</td> <td align="right">$260.00</td> <td align="right">2017</td> </tr>
## <tr align="left">
## <td>Predictive Analytics</td> <td>Eric Siegel</td> <td>Statistics</td> <td align="right">6.0</td> <td align="right">$100.00</td> <td align="right">2016</td> </tr>
## <tr align="left">
## <td>Storytelling With Data</td> <td>Kole Nussbaumer Knaflic</td> <td>Data Transformation</td> <td align="right">5.0</td> <td align="right">$70.00</td> <td align="right">2015</td> </tr>
## <tr align="left">
## <td>Introduction to Machine Learning with Python</td> <td>Andreas C Muller & Sarah Guido</td> <td>Machine Learning</td> <td align="right">8.0</td> <td align="right">$160.00</td> <td align="right">2018</td> </tr>
## </table>
## </body>
## </html>
##
The column names are in
, using xpathSApply we can get the column names:
column <- xpathSApply(doc = books.html, path = "//th", fun = xmlValue)
column
## [1] "Title" "Author" "Subject" "Edition" "Price" "Year"
The values are in
, using xpathSApply we can get the values:
To get the tittle:
Title <- xpathSApply(doc = books.html, path = "//td[position()=1]", fun = xmlValue)
Title
## [1] "Machine Learning Yearning"
## [2] "Predictive Analytics"
## [3] "Storytelling With Data"
## [4] "Introduction to Machine Learning with Python"
Like wise using the loop create the final output data:
books_html_out <- data.frame(c(1:4))
for (i in 1:length(column)){
path_string <- paste("//td[position()=", i, "]", sep = "")
books_html_out[,i] <- xpathSApply(doc = books.html, path = path_string, fun = xmlValue)
}
names(books_html_out) <- column
books_html_out
## Title
## 1 Machine Learning Yearning
## 2 Predictive Analytics
## 3 Storytelling With Data
## 4 Introduction to Machine Learning with Python
## Author Subject Edition Price Year
## 1 Andrew NG Deep Learning 3.0 $260.00 2017
## 2 Eric Siegel Statistics 6.0 $100.00 2016
## 3 Kole Nussbaumer Knaflic Data Transformation 5.0 $70.00 2015
## 4 Andreas C Muller & Sarah Guido Machine Learning 8.0 $160.00 2018
Loading HTML file:
books.xml <- xmlParse(getURL("https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.xml"))
books.xml
## <?xml version="1.0" encoding="ISO-8859-1"?>
## <favorite_books>
## <book id="1">
## <title>Machine Learning Yearning</title>
## <author>Andrew NG</author>
## <subject>Deep Learning</subject>
## <edition>3.0</edition>
## <price>$260.00</price>
## <year>2017</year>
## </book>
## <book id="2">
## <title>Predictive Analytics</title>
## <author>Eric Siegel</author>
## <subject>Statistics</subject>
## <edition>6.0</edition>
## <price>$100.00</price>
## <year>2016</year>
## </book>
## <book id="3">
## <title>Storytelling With Data</title>
## <author>Kole Nussbaumer Knaflic</author>
## <subject>Data Transformation</subject>
## <edition>5.0</edition>
## <price>$70.00</price>
## <year>2015</year>
## </book>
## <book id="4">
## <title>Introduction to Machine Learning with Python</title>
## <author>Andreas C Muller and Sarah Guido</author>
## <subject>Machine Learning</subject>
## <edition>8.0</edition>
## <price>$160.00</price>
## <year>2018</year>
## </book>
## </favorite_books>
##
Using xmlName look at the child nodes under the
column <- xpathSApply(doc = books.xml, path = "//book[@id='1']/child::*", fun = xmlName)
column
## [1] "title" "author" "subject" "edition" "price" "year"
Like wise using the loop create the final output data:
books_xml_out <- data.frame(c(1:4))
for (i in column){
path_string <- paste("//", i, sep = "")
books_xml_out[, i] <- xpathSApply(doc = books.xml, path = path_string, fun = xmlValue)
}
books_xml_out
## c.1.4. title
## 1 1 Machine Learning Yearning
## 2 2 Predictive Analytics
## 3 3 Storytelling With Data
## 4 4 Introduction to Machine Learning with Python
## author subject edition price
## 1 Andrew NG Deep Learning 3.0 $260.00
## 2 Eric Siegel Statistics 6.0 $100.00
## 3 Kole Nussbaumer Knaflic Data Transformation 5.0 $70.00
## 4 Andreas C Muller and Sarah Guido Machine Learning 8.0 $160.00
## year
## 1 2017
## 2 2016
## 3 2015
## 4 2018
Loading JSON file after checking if its valid JSON file:
url_json <- "https://raw.githubusercontent.com/Riteshlohiya/DATA607-Week-7-Assignment/master/books.json"
isValidJSON(url_json)
## [1] TRUE
htmlParse(getURL(url_json))
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><p>{"favorite books" :[
## {
## "title" : "Machine Learning Yearning",
## "author" : ["Andrew NG"],
## "subject" : ["Deep Learning"],
## "edition" : "3.0",
## "price" : "$260.00",
## "year" : "2017"
## },
## {
## "title" : "Predictive Analytics",
## "author" : ["Eric Siegel"],
## "subject" : ["Statistics"],
## "edition" : "6.0",
## "price" : "$100.00",
## "year" : "2016"
## },
## {
## "title" : "Storytelling With Data",
## "author" : ["Kole Nussbaumer Knaflic"],
## "subject" : ["Data Transformation"],
## "edition" : "5.0",
## "price" : "$70.00",
## "year" : "2015"
## },
## {
## "title" : "Introduction to Machine Learning with Python",
## "author" : ["Andreas C Muller", "Sarah Guido"],
## "subject" : ["Machine Learning"],
## "edition" : "8.0",
## "price" : "$160.00",
## "year" : "2018"
## }]
## }</p></body></html>
##
books_json <- fromJSON(url_json)
books_json
## $`favorite books`
## $`favorite books`[[1]]
## $`favorite books`[[1]]$title
## [1] "Machine Learning Yearning"
##
## $`favorite books`[[1]]$author
## [1] "Andrew NG"
##
## $`favorite books`[[1]]$subject
## [1] "Deep Learning"
##
## $`favorite books`[[1]]$edition
## [1] "3.0"
##
## $`favorite books`[[1]]$price
## [1] "$260.00"
##
## $`favorite books`[[1]]$year
## [1] "2017"
##
##
## $`favorite books`[[2]]
## $`favorite books`[[2]]$title
## [1] "Predictive Analytics"
##
## $`favorite books`[[2]]$author
## [1] "Eric Siegel"
##
## $`favorite books`[[2]]$subject
## [1] "Statistics"
##
## $`favorite books`[[2]]$edition
## [1] "6.0"
##
## $`favorite books`[[2]]$price
## [1] "$100.00"
##
## $`favorite books`[[2]]$year
## [1] "2016"
##
##
## $`favorite books`[[3]]
## $`favorite books`[[3]]$title
## [1] "Storytelling With Data"
##
## $`favorite books`[[3]]$author
## [1] "Kole Nussbaumer Knaflic"
##
## $`favorite books`[[3]]$subject
## [1] "Data Transformation"
##
## $`favorite books`[[3]]$edition
## [1] "5.0"
##
## $`favorite books`[[3]]$price
## [1] "$70.00"
##
## $`favorite books`[[3]]$year
## [1] "2015"
##
##
## $`favorite books`[[4]]
## $`favorite books`[[4]]$title
## [1] "Introduction to Machine Learning with Python"
##
## $`favorite books`[[4]]$author
## [1] "Andreas C Muller" "Sarah Guido"
##
## $`favorite books`[[4]]$subject
## [1] "Machine Learning"
##
## $`favorite books`[[4]]$edition
## [1] "8.0"
##
## $`favorite books`[[4]]$price
## [1] "$160.00"
##
## $`favorite books`[[4]]$year
## [1] "2018"
Use sapply to unlist:
books_json <- sapply(books_json[[1]], unlist)
books_json
## [[1]]
## title author
## "Machine Learning Yearning" "Andrew NG"
## subject edition
## "Deep Learning" "3.0"
## price year
## "$260.00" "2017"
##
## [[2]]
## title author subject
## "Predictive Analytics" "Eric Siegel" "Statistics"
## edition price year
## "6.0" "$100.00" "2016"
##
## [[3]]
## title author
## "Storytelling With Data" "Kole Nussbaumer Knaflic"
## subject edition
## "Data Transformation" "5.0"
## price year
## "$70.00" "2015"
##
## [[4]]
## title
## "Introduction to Machine Learning with Python"
## author1
## "Andreas C Muller"
## author2
## "Sarah Guido"
## subject
## "Machine Learning"
## edition
## "8.0"
## price
## "$160.00"
## year
## "2018"
Now to transpose use lapply:
books_json <- lapply(books_json, t)
books_json
## [[1]]
## title author subject edition
## [1,] "Machine Learning Yearning" "Andrew NG" "Deep Learning" "3.0"
## price year
## [1,] "$260.00" "2017"
##
## [[2]]
## title author subject edition price
## [1,] "Predictive Analytics" "Eric Siegel" "Statistics" "6.0" "$100.00"
## year
## [1,] "2016"
##
## [[3]]
## title author
## [1,] "Storytelling With Data" "Kole Nussbaumer Knaflic"
## subject edition price year
## [1,] "Data Transformation" "5.0" "$70.00" "2015"
##
## [[4]]
## title author1
## [1,] "Introduction to Machine Learning with Python" "Andreas C Muller"
## author2 subject edition price year
## [1,] "Sarah Guido" "Machine Learning" "8.0" "$160.00" "2018"
Create seperate dataframes for all the distinct books:
books_json <- lapply(books_json, data.frame, stringsAsFactors = F)
books_json
## [[1]]
## title author subject edition price year
## 1 Machine Learning Yearning Andrew NG Deep Learning 3.0 $260.00 2017
##
## [[2]]
## title author subject edition price year
## 1 Predictive Analytics Eric Siegel Statistics 6.0 $100.00 2016
##
## [[3]]
## title author subject
## 1 Storytelling With Data Kole Nussbaumer Knaflic Data Transformation
## edition price year
## 1 5.0 $70.00 2015
##
## [[4]]
## title author1
## 1 Introduction to Machine Learning with Python Andreas C Muller
## author2 subject edition price year
## 1 Sarah Guido Machine Learning 8.0 $160.00 2018
Use Rbind.fill to merge all the dataframes into one final dataframe:
books_json_out <- do.call("rbind.fill", books_json)
books_json_out
## title author
## 1 Machine Learning Yearning Andrew NG
## 2 Predictive Analytics Eric Siegel
## 3 Storytelling With Data Kole Nussbaumer Knaflic
## 4 Introduction to Machine Learning with Python <NA>
## subject edition price year author1 author2
## 1 Deep Learning 3.0 $260.00 2017 <NA> <NA>
## 2 Statistics 6.0 $100.00 2016 <NA> <NA>
## 3 Data Transformation 5.0 $70.00 2015 <NA> <NA>
## 4 Machine Learning 8.0 $160.00 2018 Andreas C Muller Sarah Guido
Comparison:
The output for HTML amd XML looks similar. In HTML and XML the the authors are treated as one string and loaded into one attribute “author”.But in JSON file when there are more than 1 author it stores the different authors in different attributes like “author1”" and “author2”. So we can see that the data is in wide format in JSON format.