Week7 Assignment

Overview

This assignment explores the abilty of the R language to interact with various file types. Specifically, it will attempt to load the following file types to R dataframes:

.html
.xml
.json

This will be done using three files, one for each of the data types listed above. Each of the files will contain the same information, but will be constructed in the format requirement of that specific file type. The files will contain information on a set of three books and include data for each of the follwoing areas:

Title
Author
Number of Pages
Publish Year

HTML

Load and parse the HTML file:

library(XML)
library(stringr)

fileLocation <- "C:/Users/cbailey/Documents/GitHub/607-Week7-Assignment/02_Books.html"
parsed_html <- htmlParse(file = fileLocation)
parsed_html

## <!DOCTYPE HTML>
## <html>
## <head><title>Books</title></head>
## <body> 
##           <table>
## <tr>
## <th>Title</th>   
##                   <th>Author_1</th>
##                   <th>Author_2</th> 
##                   <th>Author_3</th> 
##                   <th>Author_4</th>    
##                   <th>Number_of_Pages</th>   
##                   <th>Publish_Year</th>
##               </tr>
## <tr>
## <td>R for Everyone</td>       
##                   <td>Jared P. Lander</td> 
##                   <td></td>
##                   <td></td>  
##                   <td></td>            
##                   <td>464</td>
##                   <td>2013</td>
##               </tr>
## <tr>
## <td>Automated Data Collection with R</td>       
##                   <td>Simon Munzert</td>
##                   <td>Christian Rubba</td>
##                   <td>Peter Meibner</td>
##                   <td>Dominic Nyhuis</td>           
##                   <td>474</td>
##                   <td>2015</td>
##               </tr>
## <tr>
## <td>Sams Teach Yourself R</td>       
##                   <td>Andy Nicholls</td>
##                   <td>Richard Pugh</td>
##                   <td>Aimee Gott</td> 
##                   <td></td>          
##                   <td>624</td>
##                   <td>2015</td>
##               </tr>
## </table>
## </body>
## </html>
##

Load the parsed data to a dataframe

books_html <- as.data.frame(readHTMLTable(doc = parsed_html))
books_html

##                         NULL.Title   NULL.Author_1   NULL.Author_2
## 1                   R for Everyone Jared P. Lander                
## 2 Automated Data Collection with R   Simon Munzert Christian Rubba
## 3            Sams Teach Yourself R   Andy Nicholls    Richard Pugh
##   NULL.Author_3  NULL.Author_4 NULL.Number_of_Pages NULL.Publish_Year
## 1                                               464              2013
## 2 Peter Meibner Dominic Nyhuis                  474              2015
## 3    Aimee Gott                                 624              2015

Cleanup the names of the dataframe columns

names(books_html) <- str_replace_all(names(books_html), "NULL.", "")
names(books_html) <- str_replace_all(names(books_html), "\\.", "_")
names(books_html)

## [1] "Title"           "Author_1"        "Author_2"        "Author_3"       
## [5] "Author_4"        "Number_of_Pages" "Publish_Year"

XML

XML Method1

Load and parse the XML file:

fileLocation <- "C:/Users/cbailey/Documents/GitHub/607-Week7-Assignment/03_Books.xml"
parsed_xml <- xmlParse(file = fileLocation)
parsed_xml

## <?xml version="1.0"?>
## <Books>
##   <book id="1">
##     <Title>R for Everyone</Title>
##     <Authors>Jared P. Lander</Authors>
##     <Number_of_Pages>464</Number_of_Pages>
##     <Publish_Year>2013</Publish_Year>
##   </book>
##   <book id="2">
##     <Title>Automated Data Collection with R</Title>
##     <Authors>Simon Munzert; Christian Rubba; Peter Meibner; Dominic Nyhuis</Authors>
##     <Number_of_Pages>474</Number_of_Pages>
##     <Publish_Year>2015</Publish_Year>
##   </book>
##   <book id="3">
##     <Title>Sams Teach Yourself R</Title>
##     <Authors>Andy Nicholls; Richard Pugh; Aimee Gott</Authors>
##     <Number_of_Pages>624</Number_of_Pages>
##     <Publish_Year>2015</Publish_Year>
##   </book>
## </Books>
##

Load the parsed XML to a dataframe

xmlToDataFrame(parsed_xml)

##                              Title
## 1                   R for Everyone
## 2 Automated Data Collection with R
## 3            Sams Teach Yourself R
##                                                         Authors
## 1                                               Jared P. Lander
## 2 Simon Munzert; Christian Rubba; Peter Meibner; Dominic Nyhuis
## 3                       Andy Nicholls; Richard Pugh; Aimee Gott
##   Number_of_Pages Publish_Year
## 1             464         2013
## 2             474         2015
## 3             624         2015

While this structure does contain all the data, it has all the authors listed in a single element seperated by semicolons. This does not fully take advantage of XML’s ability to have each of the individual authors recorded as their own node within a parent node of “Authors”. To explore the changes need to accomodate this more complex structure, the next section will use a new XML file where each author is stored is its own node.

XML Method2

Load the updated xml file which now has seperate notes for each author

fileLocation <- "C:/Users/cbailey/Documents/GitHub/607-Week7-Assignment/04_Books_sep_authors.xml"
parsed_xml <- xmlParse(file = fileLocation)
parsed_xml

## <?xml version="1.0"?>
## <Books>
##   <book id="1">
##     <Title>R for Everyone</Title>
##     <Authors>
##       <Author id="1">Jared P. Lander</Author>
##     </Authors>
##     <Number_of_Pages>464</Number_of_Pages>
##     <Publish_Year>2013</Publish_Year>
##   </book>
##   <book id="2">
##     <Title>Automated Data Collection with R</Title>
##     <Authors>
##       <Author id="1">Simon Munzert</Author>
##       <Author id="2">Christian Rubba</Author>
##       <Author id="3">Peter Meibner</Author>
##       <Author id="4">Dominic Nyhuis</Author>
##     </Authors>
##     <Number_of_Pages>474</Number_of_Pages>
##     <Publish_Year>2015</Publish_Year>
##   </book>
##   <book id="3">
##     <Title>Sams Teach Yourself R</Title>
##     <Authors>
##       <Author id="1">Andy Nicholls</Author>
##       <Author id="2">Richard Pugh</Author>
##       <Author id="3">Aimee Gott</Author>
##     </Authors>
##     <Number_of_Pages>624</Number_of_Pages>
##     <Publish_Year>2015</Publish_Year>
##   </book>
## </Books>
##

Try using xmlToDataframe again

xmlToDataFrame(parsed_xml)

##                              Title
## 1                   R for Everyone
## 2 Automated Data Collection with R
## 3            Sams Teach Yourself R
##                                                   Authors Number_of_Pages
## 1                                         Jared P. Lander             464
## 2 Simon MunzertChristian RubbaPeter MeibnerDominic Nyhuis             474
## 3                     Andy NichollsRichard PughAimee Gott             624
##   Publish_Year
## 1         2013
## 2         2015
## 3         2015

While this technically did load the data to an R dataframe, it changed the structure of the data by collapsing all of the authors into a single “Authors” column. This may not have been ok if during the collasping it had also added an identifable seperator between each author in the final output but it did not.

XML Method3

This section will now explore a way to load to a dataframe with each author remaining as a unique element.

Load the parsed XML data to a list

library(plyr)
book_list <- xmlToList(parsed_xml)

Use unlist, apply functions, and rbind.fill to reshape the data and load into a dataframe. (The rbind.fill is needed to compensate for the varying number of values in the “Authors” nodes.)

book_unlist <- sapply(book_list, unlist)
book_df <- do.call("rbind.fill", lapply(lapply(book_unlist, t), data.frame, stringsAsFactors = FALSE))
book_df

##                              Title Authors.Author.text
## 1                   R for Everyone     Jared P. Lander
## 2 Automated Data Collection with R       Simon Munzert
## 3            Sams Teach Yourself R       Andy Nicholls
##   Authors.Author..attrs.id Number_of_Pages Publish_Year .attrs.id
## 1                        1             464         2013         1
## 2                        1             474         2015         2
## 3                        1             624         2015         3
##   Authors.Author.text.1 Authors.Author..attrs.id.1 Authors.Author.text.2
## 1                  <NA>                       <NA>                  <NA>
## 2       Christian Rubba                          2         Peter Meibner
## 3          Richard Pugh                          2            Aimee Gott
##   Authors.Author..attrs.id.2 Authors.Author.text.3
## 1                       <NA>                  <NA>
## 2                          3        Dominic Nyhuis
## 3                          3                  <NA>
##   Authors.Author..attrs.id.3
## 1                       <NA>
## 2                          4
## 3                       <NA>

This successsfully loaded the data to a dataframe but does require the cleanup of column names and the removal of the unnecessary attribute columns.

Cleanup column names

names(book_df) <- str_replace_all(names(book_df), "\\.", "_")
names(book_df) <- str_replace_all(names(book_df), "Authors_", "")
names(book_df) <- str_replace_all(names(book_df), "_text", "")
names(book_df) <- str_replace_all(names(book_df), "_3", "_4")
names(book_df) <- str_replace_all(names(book_df), "_2", "_3")
names(book_df) <- str_replace_all(names(book_df), "_1", "_2")
names(book_df) <- str_replace_all(names(book_df), "Author\\b", "Author_1")
names(book_df)

##  [1] "Title"              "Author_1"           "Author__attrs_id"  
##  [4] "Number_of_Pages"    "Publish_Year"       "_attrs_id"         
##  [7] "Author_2"           "Author__attrs_id_2" "Author_3"          
## [10] "Author__attrs_id_3" "Author_4"           "Author__attrs_id_4"

Drop attribute columns and reorder columns

col_names <- names(book_df)
attr_cols <-str_detect(col_names, "attr")
book_df <- book_df[,!attr_cols]

library(dplyr)
book_df <- book_df %>% select(Title
                   ,Author_1
                   ,Author_2
                   ,Author_3
                   ,Author_4
                   ,Number_of_Pages
                   ,Publish_Year)
book_df

##                              Title        Author_1        Author_2
## 1                   R for Everyone Jared P. Lander            <NA>
## 2 Automated Data Collection with R   Simon Munzert Christian Rubba
## 3            Sams Teach Yourself R   Andy Nicholls    Richard Pugh
##        Author_3       Author_4 Number_of_Pages Publish_Year
## 1          <NA>           <NA>             464         2013
## 2 Peter Meibner Dominic Nyhuis             474         2015
## 3    Aimee Gott           <NA>             624         2015

This method was able to load the XML data to a dataframe and maintain each author as a seperate element. However, this is a strong example where it may have been better to load the data to multiple dataframes or other R objects rather force it into a single dataframe.

JSON

This seciton makes use of the RJSONIO package and assumes that package has been installed.

Load and parse the JSON file:

library(RJSONIO)
fileLocation <- "C:/Users/cbailey/Documents/GitHub/607-Week7-Assignment/05_Books.json"
parsed_json <- fromJSON(content = fileLocation)
parsed_json

## $Books
## $Books[[1]]
## $Books[[1]]$Title
## [1] "R for Everyone"
## 
## $Books[[1]]$Authors
##            Author 
## "Jared P. Lander" 
## 
## $Books[[1]]$Number_of_Pages
## [1] 464
## 
## $Books[[1]]$Publish_Year
## [1] 2013
## 
## 
## $Books[[2]]
## $Books[[2]]$Title
## [1] "Automated Data Collection with R"
## 
## $Books[[2]]$Authors
##            Author            Author            Author            Author 
##   "Simon Munzert" "Christian Rubba"   "Peter Meißner"  "Dominic Nyhuis" 
## 
## $Books[[2]]$Number_of_Pages
## [1] 474
## 
## $Books[[2]]$Publish_Year
## [1] 2015
## 
## 
## $Books[[3]]
## $Books[[3]]$Title
## [1] "Sams Teach Yourself R"
## 
## $Books[[3]]$Authors
##          Author          Author          Author 
## "Andy Nicholls"  "Richard Pugh"    "Aimee Gott" 
## 
## $Books[[3]]$Number_of_Pages
## [1] 464
## 
## $Books[[3]]$Publish_Year
## [1] 2015

The structure of the JSON file and how it is parsed into R is similar enough that only slight modifications to the above XML method3 are needed to load the JSON file to an R dataframe.

Again, use unlist, apply functions, and rbind.fill to reshape the data and load into a dataframe. (The rbind.fill is needed to compensate for the varying number of values in the “Authors” nodes.)

book_list <- parsed_json
book_unlist <- sapply(book_list[[1]], unlist)
book_df <- do.call("rbind.fill", lapply(lapply(book_unlist, t), data.frame, stringsAsFactors = FALSE))
book_df

##                              Title  Authors.Author Number_of_Pages
## 1                   R for Everyone Jared P. Lander             464
## 2 Automated Data Collection with R   Simon Munzert             474
## 3            Sams Teach Yourself R   Andy Nicholls             464
##   Publish_Year Authors.Author.1 Authors.Author.2 Authors.Author.3
## 1         2013             <NA>             <NA>             <NA>
## 2         2015  Christian Rubba    Peter Meißner   Dominic Nyhuis
## 3         2015     Richard Pugh       Aimee Gott             <NA>

This successsfully loaded the data to a dataframe but does require the cleanup of column names. However, unlike XML there are no attribute columns needing to be removed.

Cleanup column names

names(book_df) <- str_replace_all(names(book_df), "\\.", "_")
names(book_df) <- str_replace_all(names(book_df), "Authors_", "")
names(book_df) <- str_replace_all(names(book_df), "_3", "_4")
names(book_df) <- str_replace_all(names(book_df), "_2", "_3")
names(book_df) <- str_replace_all(names(book_df), "_1", "_2")
names(book_df) <- str_replace_all(names(book_df), "Author\\b", "Author_1")
names(book_df)

## [1] "Title"           "Author_1"        "Number_of_Pages" "Publish_Year"   
## [5] "Author_2"        "Author_3"        "Author_4"

Drop attribute columns and reorder columns

col_names <- names(book_df)
attr_cols <-str_detect(col_names, "attr")
book_df <- book_df[,!attr_cols]

library(dplyr)
book_df <- book_df %>% select(Title
                   ,Author_1
                   ,Author_2
                   ,Author_3
                   ,Author_4
                   ,Number_of_Pages
                   ,Publish_Year)
book_df

##                              Title        Author_1        Author_2
## 1                   R for Everyone Jared P. Lander            <NA>
## 2 Automated Data Collection with R   Simon Munzert Christian Rubba
## 3            Sams Teach Yourself R   Andy Nicholls    Richard Pugh
##        Author_3       Author_4 Number_of_Pages Publish_Year
## 1          <NA>           <NA>             464         2013
## 2 Peter Meißner Dominic Nyhuis             474         2015
## 3    Aimee Gott           <NA>             464         2015

This method was able to load the JSON data to a dataframe and maintain each author as a seperate element. However, like the XML example, this case shows it may have been better to load the data to multiple dataframes or other R objects rather force it into a single dataframe.

Conclusions

Ultimately, it was possible to produce the same dataframe from all three source files (.html, .xml, and .json). However, it became clear that in cases where the number of records within a variable was not a constant, the file types of .xml or .json more naturally accomodated that situation. Additionally, this assignment reaffirmed that in some cases it may be better to break to break a dataset apart to be stored across multiple dataframes to allow for more tidy tables.