In Assignment 5, three different files where separately created in a text editor. These files were HTML (using an html table), XML, and JSON formats. The files were then uploaded to a GitHub repository when the files were transformed into data frames using R.
#Library used
library(jsonlite)
# A function in jsonlite fromJSON converts JSON content to R objects.
json_data = fromJSON("https://raw.githubusercontent.com/melbow2424/Data-607-Assignment-5/main/books.json")
#Libraries used
library(XML)
library(plyr)
library("xml2")
#Reads XML file from the xml2 package
xml_file <- read_xml("https://raw.githubusercontent.com/melbow2424/Data-607-Assignment-5/main/books.xml")
# Could not get the XML file directly from the url reading so parse the info.
xml_format <- xmlParse(file = xml_file)
#Idply for every element in a list, a function is applied, here xmlToList and data.frame, and then a data frame is created. xmlToList function creates a list from xml nodes.
xml_data <- ldply(xmlToList(xml_format), data.frame)
library(rvest)
library(RCurl)
library(data.table)
#Downloads the url from the RCurl package
html_file <- getURL('https://raw.githubusercontent.com/melbow2424/Data-607-Assignment-5/main/books.html')
#Reads data from HTML tables (if formatted as an HTML table). Comes from XML package. This created a list.
tables = readHTMLTable(html_file, as.data.frame = TRUE)
#From that list I used rbindlist to create a data table
html_data <- rbindlist(tables)
print(json_data)
## title author pages
## 1 We Are Never Meeting in Real Life Samantha Irby 288
## 2 Coraline Neil Gaiman 176
## 3 R for Data Science Garrett Grolemund, Hadley Wickham 522
## genre rating
## 1 Humor 3.91
## 2 Dark Fantasy 4.09
## 3 Education 4.57
print(xml_data)
## .id title author pages genre
## 1 book We Are Never Meeting in Real Life Samantha Irby 288 Humor
## 2 book Coraline Neil Gaiman 176 Dark Fantasy
## 3 book R for Data Science <NA> 522 Education
## rating name name.1
## 1 3.91 <NA> <NA>
## 2 4.09 <NA> <NA>
## 3 4.57 Garrett Grolemund Hadley Wickham
print(html_data)
## title author pages
## 1: We Are Never Meeting in Real Life Samantha Irby 288
## 2: Coraline Neil Gaiman 176
## 3: R for Data Science Garrett Grolemund|Hadley Wickham 522
## genre rating
## 1: Humor 3.91
## 2: Dark Fantasy 4.09
## 3: Education 4.57
Each file needed to be imported into R differently due to the structure of each file. Also, each data frame did not create identical layouts. The json data frame split the two authors from the R for Data Science by a comma where the xml created new columns for the two authors names. The html data frame had the two authors from the R for Data Science book separated by a | but that was written into the file. That had nothing to do with the R coded inputs.