This assignment explores the abilty of the R language to interact with various file types. Specifically, it will attempt to load the following file types to R dataframes:
This will be done using three files, one for each of the data types listed above. Each of the files will contain the same information, but will be constructed in the format requirement of that specific file type. The files will contain information on a set of three books and include data for each of the follwoing areas:
Load and parse the HTML file:
library(XML)
library(stringr)
fileLocation <- "C:/Users/cbailey/Documents/GitHub/607-Week7-Assignment/02_Books.html"
parsed_html <- htmlParse(file = fileLocation)
parsed_html
## <!DOCTYPE HTML>
## <html>
## <head><title>Books</title></head>
## <body>
## <table>
## <tr>
## <th>Title</th>
## <th>Author_1</th>
## <th>Author_2</th>
## <th>Author_3</th>
## <th>Author_4</th>
## <th>Number_of_Pages</th>
## <th>Publish_Year</th>
## </tr>
## <tr>
## <td>R for Everyone</td>
## <td>Jared P. Lander</td>
## <td></td>
## <td></td>
## <td></td>
## <td>464</td>
## <td>2013</td>
## </tr>
## <tr>
## <td>Automated Data Collection with R</td>
## <td>Simon Munzert</td>
## <td>Christian Rubba</td>
## <td>Peter Meibner</td>
## <td>Dominic Nyhuis</td>
## <td>474</td>
## <td>2015</td>
## </tr>
## <tr>
## <td>Sams Teach Yourself R</td>
## <td>Andy Nicholls</td>
## <td>Richard Pugh</td>
## <td>Aimee Gott</td>
## <td></td>
## <td>624</td>
## <td>2015</td>
## </tr>
## </table>
## </body>
## </html>
##
Load the parsed data to a dataframe
books_html <- as.data.frame(readHTMLTable(doc = parsed_html))
books_html
## NULL.Title NULL.Author_1 NULL.Author_2
## 1 R for Everyone Jared P. Lander
## 2 Automated Data Collection with R Simon Munzert Christian Rubba
## 3 Sams Teach Yourself R Andy Nicholls Richard Pugh
## NULL.Author_3 NULL.Author_4 NULL.Number_of_Pages NULL.Publish_Year
## 1 464 2013
## 2 Peter Meibner Dominic Nyhuis 474 2015
## 3 Aimee Gott 624 2015
Cleanup the names of the dataframe columns
names(books_html) <- str_replace_all(names(books_html), "NULL.", "")
names(books_html) <- str_replace_all(names(books_html), "\\.", "_")
names(books_html)
## [1] "Title" "Author_1" "Author_2" "Author_3"
## [5] "Author_4" "Number_of_Pages" "Publish_Year"
Load and parse the XML file:
fileLocation <- "C:/Users/cbailey/Documents/GitHub/607-Week7-Assignment/03_Books.xml"
parsed_xml <- xmlParse(file = fileLocation)
parsed_xml
## <?xml version="1.0"?>
## <Books>
## <book id="1">
## <Title>R for Everyone</Title>
## <Authors>Jared P. Lander</Authors>
## <Number_of_Pages>464</Number_of_Pages>
## <Publish_Year>2013</Publish_Year>
## </book>
## <book id="2">
## <Title>Automated Data Collection with R</Title>
## <Authors>Simon Munzert; Christian Rubba; Peter Meibner; Dominic Nyhuis</Authors>
## <Number_of_Pages>474</Number_of_Pages>
## <Publish_Year>2015</Publish_Year>
## </book>
## <book id="3">
## <Title>Sams Teach Yourself R</Title>
## <Authors>Andy Nicholls; Richard Pugh; Aimee Gott</Authors>
## <Number_of_Pages>624</Number_of_Pages>
## <Publish_Year>2015</Publish_Year>
## </book>
## </Books>
##
Load the parsed XML to a dataframe
xmlToDataFrame(parsed_xml)
## Title
## 1 R for Everyone
## 2 Automated Data Collection with R
## 3 Sams Teach Yourself R
## Authors
## 1 Jared P. Lander
## 2 Simon Munzert; Christian Rubba; Peter Meibner; Dominic Nyhuis
## 3 Andy Nicholls; Richard Pugh; Aimee Gott
## Number_of_Pages Publish_Year
## 1 464 2013
## 2 474 2015
## 3 624 2015
While this structure does contain all the data, it has all the authors listed in a single element seperated by semicolons. This does not fully take advantage of XML’s ability to have each of the individual authors recorded as their own node within a parent node of “Authors”. To explore the changes need to accomodate this more complex structure, the next section will use a new XML file where each author is stored is its own node.
Load the updated xml file which now has seperate notes for each author
fileLocation <- "C:/Users/cbailey/Documents/GitHub/607-Week7-Assignment/04_Books_sep_authors.xml"
parsed_xml <- xmlParse(file = fileLocation)
parsed_xml
## <?xml version="1.0"?>
## <Books>
## <book id="1">
## <Title>R for Everyone</Title>
## <Authors>
## <Author id="1">Jared P. Lander</Author>
## </Authors>
## <Number_of_Pages>464</Number_of_Pages>
## <Publish_Year>2013</Publish_Year>
## </book>
## <book id="2">
## <Title>Automated Data Collection with R</Title>
## <Authors>
## <Author id="1">Simon Munzert</Author>
## <Author id="2">Christian Rubba</Author>
## <Author id="3">Peter Meibner</Author>
## <Author id="4">Dominic Nyhuis</Author>
## </Authors>
## <Number_of_Pages>474</Number_of_Pages>
## <Publish_Year>2015</Publish_Year>
## </book>
## <book id="3">
## <Title>Sams Teach Yourself R</Title>
## <Authors>
## <Author id="1">Andy Nicholls</Author>
## <Author id="2">Richard Pugh</Author>
## <Author id="3">Aimee Gott</Author>
## </Authors>
## <Number_of_Pages>624</Number_of_Pages>
## <Publish_Year>2015</Publish_Year>
## </book>
## </Books>
##
Try using xmlToDataframe again
xmlToDataFrame(parsed_xml)
## Title
## 1 R for Everyone
## 2 Automated Data Collection with R
## 3 Sams Teach Yourself R
## Authors Number_of_Pages
## 1 Jared P. Lander 464
## 2 Simon MunzertChristian RubbaPeter MeibnerDominic Nyhuis 474
## 3 Andy NichollsRichard PughAimee Gott 624
## Publish_Year
## 1 2013
## 2 2015
## 3 2015
While this technically did load the data to an R dataframe, it changed the structure of the data by collapsing all of the authors into a single “Authors” column. This may not have been ok if during the collasping it had also added an identifable seperator between each author in the final output but it did not.
This section will now explore a way to load to a dataframe with each author remaining as a unique element.
Load the parsed XML data to a list
library(plyr)
book_list <- xmlToList(parsed_xml)
Use unlist, apply functions, and rbind.fill to reshape the data and load into a dataframe. (The rbind.fill is needed to compensate for the varying number of values in the “Authors” nodes.)
book_unlist <- sapply(book_list, unlist)
book_df <- do.call("rbind.fill", lapply(lapply(book_unlist, t), data.frame, stringsAsFactors = FALSE))
book_df
## Title Authors.Author.text
## 1 R for Everyone Jared P. Lander
## 2 Automated Data Collection with R Simon Munzert
## 3 Sams Teach Yourself R Andy Nicholls
## Authors.Author..attrs.id Number_of_Pages Publish_Year .attrs.id
## 1 1 464 2013 1
## 2 1 474 2015 2
## 3 1 624 2015 3
## Authors.Author.text.1 Authors.Author..attrs.id.1 Authors.Author.text.2
## 1 <NA> <NA> <NA>
## 2 Christian Rubba 2 Peter Meibner
## 3 Richard Pugh 2 Aimee Gott
## Authors.Author..attrs.id.2 Authors.Author.text.3
## 1 <NA> <NA>
## 2 3 Dominic Nyhuis
## 3 3 <NA>
## Authors.Author..attrs.id.3
## 1 <NA>
## 2 4
## 3 <NA>
This successsfully loaded the data to a dataframe but does require the cleanup of column names and the removal of the unnecessary attribute columns.
Cleanup column names
names(book_df) <- str_replace_all(names(book_df), "\\.", "_")
names(book_df) <- str_replace_all(names(book_df), "Authors_", "")
names(book_df) <- str_replace_all(names(book_df), "_text", "")
names(book_df) <- str_replace_all(names(book_df), "_3", "_4")
names(book_df) <- str_replace_all(names(book_df), "_2", "_3")
names(book_df) <- str_replace_all(names(book_df), "_1", "_2")
names(book_df) <- str_replace_all(names(book_df), "Author\\b", "Author_1")
names(book_df)
## [1] "Title" "Author_1" "Author__attrs_id"
## [4] "Number_of_Pages" "Publish_Year" "_attrs_id"
## [7] "Author_2" "Author__attrs_id_2" "Author_3"
## [10] "Author__attrs_id_3" "Author_4" "Author__attrs_id_4"
Drop attribute columns and reorder columns
col_names <- names(book_df)
attr_cols <-str_detect(col_names, "attr")
book_df <- book_df[,!attr_cols]
library(dplyr)
book_df <- book_df %>% select(Title
,Author_1
,Author_2
,Author_3
,Author_4
,Number_of_Pages
,Publish_Year)
book_df
## Title Author_1 Author_2
## 1 R for Everyone Jared P. Lander <NA>
## 2 Automated Data Collection with R Simon Munzert Christian Rubba
## 3 Sams Teach Yourself R Andy Nicholls Richard Pugh
## Author_3 Author_4 Number_of_Pages Publish_Year
## 1 <NA> <NA> 464 2013
## 2 Peter Meibner Dominic Nyhuis 474 2015
## 3 Aimee Gott <NA> 624 2015
This method was able to load the XML data to a dataframe and maintain each author as a seperate element. However, this is a strong example where it may have been better to load the data to multiple dataframes or other R objects rather force it into a single dataframe.
This seciton makes use of the RJSONIO package and assumes that package has been installed.
Load and parse the JSON file:
library(RJSONIO)
fileLocation <- "C:/Users/cbailey/Documents/GitHub/607-Week7-Assignment/05_Books.json"
parsed_json <- fromJSON(content = fileLocation)
parsed_json
## $Books
## $Books[[1]]
## $Books[[1]]$Title
## [1] "R for Everyone"
##
## $Books[[1]]$Authors
## Author
## "Jared P. Lander"
##
## $Books[[1]]$Number_of_Pages
## [1] 464
##
## $Books[[1]]$Publish_Year
## [1] 2013
##
##
## $Books[[2]]
## $Books[[2]]$Title
## [1] "Automated Data Collection with R"
##
## $Books[[2]]$Authors
## Author Author Author Author
## "Simon Munzert" "Christian Rubba" "Peter Meißner" "Dominic Nyhuis"
##
## $Books[[2]]$Number_of_Pages
## [1] 474
##
## $Books[[2]]$Publish_Year
## [1] 2015
##
##
## $Books[[3]]
## $Books[[3]]$Title
## [1] "Sams Teach Yourself R"
##
## $Books[[3]]$Authors
## Author Author Author
## "Andy Nicholls" "Richard Pugh" "Aimee Gott"
##
## $Books[[3]]$Number_of_Pages
## [1] 464
##
## $Books[[3]]$Publish_Year
## [1] 2015
The structure of the JSON file and how it is parsed into R is similar enough that only slight modifications to the above XML method3 are needed to load the JSON file to an R dataframe.
Again, use unlist, apply functions, and rbind.fill to reshape the data and load into a dataframe. (The rbind.fill is needed to compensate for the varying number of values in the “Authors” nodes.)
book_list <- parsed_json
book_unlist <- sapply(book_list[[1]], unlist)
book_df <- do.call("rbind.fill", lapply(lapply(book_unlist, t), data.frame, stringsAsFactors = FALSE))
book_df
## Title Authors.Author Number_of_Pages
## 1 R for Everyone Jared P. Lander 464
## 2 Automated Data Collection with R Simon Munzert 474
## 3 Sams Teach Yourself R Andy Nicholls 464
## Publish_Year Authors.Author.1 Authors.Author.2 Authors.Author.3
## 1 2013 <NA> <NA> <NA>
## 2 2015 Christian Rubba Peter Meißner Dominic Nyhuis
## 3 2015 Richard Pugh Aimee Gott <NA>
This successsfully loaded the data to a dataframe but does require the cleanup of column names. However, unlike XML there are no attribute columns needing to be removed.
Cleanup column names
names(book_df) <- str_replace_all(names(book_df), "\\.", "_")
names(book_df) <- str_replace_all(names(book_df), "Authors_", "")
names(book_df) <- str_replace_all(names(book_df), "_3", "_4")
names(book_df) <- str_replace_all(names(book_df), "_2", "_3")
names(book_df) <- str_replace_all(names(book_df), "_1", "_2")
names(book_df) <- str_replace_all(names(book_df), "Author\\b", "Author_1")
names(book_df)
## [1] "Title" "Author_1" "Number_of_Pages" "Publish_Year"
## [5] "Author_2" "Author_3" "Author_4"
Drop attribute columns and reorder columns
col_names <- names(book_df)
attr_cols <-str_detect(col_names, "attr")
book_df <- book_df[,!attr_cols]
library(dplyr)
book_df <- book_df %>% select(Title
,Author_1
,Author_2
,Author_3
,Author_4
,Number_of_Pages
,Publish_Year)
book_df
## Title Author_1 Author_2
## 1 R for Everyone Jared P. Lander <NA>
## 2 Automated Data Collection with R Simon Munzert Christian Rubba
## 3 Sams Teach Yourself R Andy Nicholls Richard Pugh
## Author_3 Author_4 Number_of_Pages Publish_Year
## 1 <NA> <NA> 464 2013
## 2 Peter Meißner Dominic Nyhuis 474 2015
## 3 Aimee Gott <NA> 464 2015
This method was able to load the JSON data to a dataframe and maintain each author as a seperate element. However, like the XML example, this case shows it may have been better to load the data to multiple dataframes or other R objects rather force it into a single dataframe.
Ultimately, it was possible to produce the same dataframe from all three source files (.html, .xml, and .json). However, it became clear that in cases where the number of records within a variable was not a constant, the file types of .xml or .json more naturally accomodated that situation. Additionally, this assignment reaffirmed that in some cases it may be better to break to break a dataset apart to be stored across multiple dataframes to allow for more tidy tables.