0.1 Assignment Overview

This assignment will focus on creating three files – HTML, XML, and JSON – to be parsed into R dataframes based on attributes of three selected books of interest. Using the attributes selected such as title and author, create three files which store the book’s information in HTML , XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”).

The goal of this assignment is to use R code and any R packages of choice to load the information from each of the three sources into separate R data frames.

0.1.1 Setup

This assignment requires the following R packages:

  • XML
  • Rcurl
  • Plyr
  • jsonlite
  • knitr

The following files are in GITHUB.

  • books.html
  • books.xml
  • books.json

0.1.2 Parsing XML file

x <- "https://raw.githubusercontent.com/raghu74us/607_1/master/books.xml"
x <- getURL(url=x)
x1 <- xmlParse(x,useInternalNodes = TRUE,validate=F)
x1
## <?xml version="1.0"?>
## <books>
##   <book id="1">
##     <Title>The Power of Positive Living</Title>
##     <Authors>John Norman, Vincent Peale</Authors>
##     <OriginallyPublished>1952</OriginallyPublished>
##     <Genres>Inspirational fiction, Philosophy, Children's literature</Genres>
##   </book>
##   <book id="2">
##     <Title>Mindset: The New Psychology of Success</Title>
##     <Authors>Carol Dweck</Authors>
##     <OriginallyPublished>2006</OriginallyPublished>
##     <Genres>Motivation</Genres>
##   </book>
##   <book id="3">
##     <Title>Strengths Finder 2.0</Title>
##     <Authors>Tom Rath</Authors>
##     <OriginallyPublished>2001</OriginallyPublished>
##     <Genres>Leadership</Genres>
##   </book>
## </books>
## 
x2 <- ldply(xmlToList(x1), data.frame)
str(x2)
## 'data.frame':    3 obs. of  6 variables:
##  $ .id                : chr  "book" "book" "book"
##  $ Title              : Factor w/ 3 levels "The Power of Positive Living",..: 1 2 3
##  $ Authors            : Factor w/ 3 levels "John Norman, Vincent Peale",..: 1 2 3
##  $ OriginallyPublished: Factor w/ 3 levels "1952","2006",..: 1 2 3
##  $ Genres             : Factor w/ 3 levels "Inspirational fiction, Philosophy, Children's literature",..: 1 2 3
##  $ .attrs             : Factor w/ 3 levels "1","2","3": 1 2 3
kable(x2)
.id Title Authors OriginallyPublished Genres .attrs
book The Power of Positive Living John Norman, Vincent Peale 1952 Inspirational fiction, Philosophy, Children’s literature 1
book Mindset: The New Psychology of Success Carol Dweck 2006 Motivation 2
book Strengths Finder 2.0 Tom Rath 2001 Leadership 3

0.1.3 Parsing JSON

Load the file books.json from Github.

j <- "https://raw.githubusercontent.com/raghu74us/607_1/master/books.json"
j2 <- fromJSON(j)

j2
##                                    Title                    Authors
## 1           The Power of Positive Living John Norman, Vincent Peale
## 2 Mindset: The New Psychology of Success                Carol Dweck
## 3                   Strengths Finder 2.0                   Tom Rath
##   OriginallyPublished
## 1                1952
## 2                2006
## 3                2001
##                                                     Genres
## 1 Inspirational fiction, Philosophy, Children's literature
## 2                                               Motivation
## 3                                               Leadership
j3 = readLines(j)
j3[5]
## [1] "        \"OriginallyPublished\": \"1952\","
str(j2)
## 'data.frame':    3 obs. of  4 variables:
##  $ Title              : chr  "The Power of Positive Living" "Mindset: The New Psychology of Success" "Strengths Finder 2.0"
##  $ Authors            : chr  "John Norman, Vincent Peale" "Carol Dweck" "Tom Rath"
##  $ OriginallyPublished: chr  "1952" "2006" "2001"
##  $ Genres             : chr  "Inspirational fiction, Philosophy, Children's literature" "Motivation" "Leadership"
kable(j2)
Title Authors OriginallyPublished Genres
The Power of Positive Living John Norman, Vincent Peale 1952 Inspirational fiction, Philosophy, Children’s literature
Mindset: The New Psychology of Success Carol Dweck 2006 Motivation
Strengths Finder 2.0 Tom Rath 2001 Leadership

0.1.4 Parsing HTML

Load the file books.html from Github.

h <- "https://raw.githubusercontent.com/raghu74us/607_1/master/books.html"
t <- getURL(url=h)
h1 <- xmlParse(t,isHTML = TRUE,useInternalNodes = FALSE,validate = F)
h1
## $file
## [1] "<buffer>"
## 
## $version
## [1] ""
## 
## $children
## $children$html
## <html>
##  <body>
##   <table border="1">
##    <thead>
##     <tr>
##      <th>Title</th>
##      <th>Author(s)</th>
##      <th>OriginallyPublished</th>
##      <th>Genres</th>
##     </tr>
##    </thead>
##    <tr>
##     <td>The Power of Positive Living</td>
##     <td>John Norman, Vincent Peale</td>
##     <td>1952</td>
##     <td>Inspirational fiction, Philosophy, Children&apos;s literature</td>
##    </tr>
##    <tr>
##     <td>Mindset: The New Psychology of Success</td>
##     <td>Carol Dweck</td>
##     <td>2006</td>
##     <td>Motivation</td>
##    </tr>
##    <tr>
##     <td>Strengths Finder 2.0</td>
##     <td>Tom Rath</td>
##     <td>2001</td>
##     <td>Leadership</td>
##    </tr>
##   </table>
##  </body>
## </html>
## 
## 
## attr(,"class")
## [1] "XMLDocumentContent"
html <- readHTMLTable(t)
str(html)
## List of 1
##  $ NULL:'data.frame':    4 obs. of  4 variables:
##   ..$ Title              : Factor w/ 4 levels "Mindset: The New Psychology of Success",..: 4 3 1 2
##   ..$ Author(s)          : Factor w/ 4 levels "Author(s)","Carol Dweck",..: 1 3 2 4
##   ..$ OriginallyPublished: Factor w/ 4 levels "1952","2001",..: 4 1 3 2
##   ..$ Genres             : Factor w/ 4 levels "Genres","Inspirational fiction, Philosophy, Children's literature",..: 1 2 4 3
html2DF<-as.data.frame.list(html)
html2DF
##                               NULL.Title             NULL.Author.s.
## 1                                  Title                  Author(s)
## 2           The Power of Positive Living John Norman, Vincent Peale
## 3 Mindset: The New Psychology of Success                Carol Dweck
## 4                   Strengths Finder 2.0                   Tom Rath
##   NULL.OriginallyPublished
## 1      OriginallyPublished
## 2                     1952
## 3                     2006
## 4                     2001
##                                                NULL.Genres
## 1                                                   Genres
## 2 Inspirational fiction, Philosophy, Children's literature
## 3                                               Motivation
## 4                                               Leadership

0.1.5 Conclusion:

The data type is different for html vs xml vs JSON. Data type is Factor for Html whereas char for JSON. Also, XML structure has id and attrs variables. File format is different for JSON compared to HTML and XML.