For this assignment, I have created three files, which are listed below (click on the links to open the files). These files were published as google documents.
Our main objective is to parse the above three documents, extract the books information from these documents and tidy the data where ever needed.
We need the following libraries of R to parse and tidy the above files:
In order to parse google docs, where the HTML Table file is published, we have to use the GET function of httpr package. The readHTML() function is not working to read the data directly from google docs (perhaps, because of the “https” and due to incompatible format). So I am readung the document content to a HTMLInternalDocument object, and converting this object’s data to character. The character object data is then fed into the readHTML() function, to display the data frame containing the books information. The code is given below:
URL <- "https://docs.google.com/document/d/1j4Bod6elIm3pvd7qVeRsdLNMxzYsk8IoYBxKQAVTlpw/pub"
my_URL <- GET(URL)
html <- content(my_URL)
y <- as(html,"character")
readHTMLTable(y, which = 1, stringsAsFactors = FALSE, head=TRUE)
## Book Name
## 1 DB2 Developer's Guide
## 2 Understanding DB2
## 3 Probability: With Applications and R
## Author ISBN Price(USD)
## 1 Craig Mullins 978-0132836425 85
## 2 Raul F. Chong; Michael Dong; Dwaine R. Snow B00EKYU6DI 120
## 3 Robert P. Dobrow 978-1118241257 101
In second row of the above data frame (for the book “Understanding DB2”), there are three authors, and they are separated by semi-colons. This data frame is not “tidy”. Let us separate the three authors into three separate rows
y <- readHTMLTable(y, which = 1, stringsAsFactors = FALSE, head=TRUE)
names(y) <- c("Book_Name","Author","ISBN", "Price")
z <- separate(y[y[1]=="Understanding DB2",],Author,into=c("Author_1","Author_2","Author_3"),sep=";")
#Delete the row from y. The deleted row was collected into z, already
y <- y[-which(y[1]=="Understanding DB2"),]
#Transform the z
z <- gather(z,Author_no,Author,-Book_Name, -ISBN, -Price)
z <- data.frame(Book_Name=z$Book_Name,Author=z$Author,ISBN=z$ISBN,Price=z$Price)
#Add rows of z to y
y <- rbind(y,z)
Here is the transformed data frame.
y
## Book_Name Author ISBN
## 1 DB2 Developer's Guide Craig Mullins 978-0132836425
## 3 Probability: With Applications and R Robert P. Dobrow 978-1118241257
## 31 Understanding DB2 Raul F. Chong B00EKYU6DI
## 4 Understanding DB2 Michael Dong B00EKYU6DI
## 5 Understanding DB2 Dwaine R. Snow B00EKYU6DI
## Price
## 1 85
## 3 101
## 31 120
## 4 120
## 5 120
Let us add a new variable to the above data frame, so that we can represent the author number for a book.
a <- group_by(y,Book_Name)
a <- mutate(a, Author_Number = row_number())
#y <- data.frame(cbind(Book_Name=b$Book_Name,Author=b$Author,Author_Number=b$Author_Number,ISBN=b$ISBN,Price=a$Price))
y <- data.frame(list(Book_Name=a$Book_Name,Author=a$Author,Author_Number=a$Author_Number,ISBN=a$ISBN,Price=a$Price))
y
## Book_Name Author Author_Number
## 1 DB2 Developer's Guide Craig Mullins 1
## 2 Probability: With Applications and R Robert P. Dobrow 1
## 3 Understanding DB2 Raul F. Chong 1
## 4 Understanding DB2 Michael Dong 2
## 5 Understanding DB2 Dwaine R. Snow 3
## ISBN Price
## 1 978-0132836425 85
## 2 978-1118241257 101
## 3 B00EKYU6DI 120
## 4 B00EKYU6DI 120
## 5 B00EKYU6DI 120
Display 1: Data Frame obtained after parsing the Books HTML Table published at google documents
In the above data frame the “Author_Number” variable represents the number of the author (if there are more than one authors for a book). Observe that for the “Understanding DB2” book we have three authors, and each author is assigned an author number, to represent multiple authors.
NOTE that the cleanup we performed above (converting multiple authors books to multiple rows is needed again, after we successfully parse the XML and JSON documents). So let us create a function for reusability. The function name will be “books_transform()”, and this functions takes the data frame as input and gives a tidied data frame as output
books_transform <- function(y)
{
z <- separate(y[y[1]=="Understanding DB2",],Author,into=c("Author_1","Author_2","Author_3"),sep=";")
#Delete the row from y. The deleted row was collected into z, already
y <- y[-which(y[1]=="Understanding DB2"),]
#Transform the z
z <- gather(z,Author_no,Author,-Book_Name, -ISBN, -Price)
z <- data.frame(Book_Name=z$Book_Name,Author=z$Author,ISBN=z$ISBN,Price=z$Price)
#Add rows of z to y
y <- rbind(y,z)
a <- group_by(y,Book_Name)
a <- mutate(a, Author_Number = row_number())
a <- as.data.frame(a)
y <- data.frame(list(Book_Name=a$Book_Name,Author=a$Author,Author_Number=a$Author_Number,ISBN=a$ISBN,Price=a$Price))
return(y)
}
In order to parse google docs, where the XML file file is published, we have to use the GET function of httpr package. We cannot apply the xmlParse function directly, since the XML Code is embedded as text in the webpage. We have to cleanup the html code first, and extract the XML code from the html code. Then the extracted XML ode will be parsed by xmlParse() function. The complete code is given below:
URL <- "https://docs.google.com/document/d/13rs6sVlmECdLl8vPp7o6cejZRxIgYxkvnjh30XydEH4/pub"
my_URL <- GET(URL)
html <- content(my_URL)
y <- as(html,"character")
a <- regexpr("xml version=?",y)[[1]]
b <- regexpr("/Books></span>",y)[[1]]
b <- (b + nchar("/Books></span>") - 1)
y <- paste("<?",substr(y,a,b),sep="")
y <- gsub(">",">",y)
y <- gsub("<","<",y)
y <- gsub("(<span>|</span>|\n|<p class=\"c0\">|</p>|<p class=\"c1\">)","",y)
#Finally "y" contains the XML code in the form of a character string.
y
## [1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?><Books><Book><Book_Name>DB2 Developer's Guide</Book_Name><Author>Craig Mullins</Author><ISBN>0132836425</ISBN><Price>85</Price></Book><Book><Book_Name>Understanding DB2</Book_Name><Author>Raul F. Chong; Michael Dong; Dwaine R. Snow</Author><ISBN>B00EKYU6DI</ISBN><Price>120</Price></Book><Book><Book_Name>Probability: With Applications and R</Book_Name><Author>Robert P. Dobrow</Author><ISBN>978-1118241257</ISBN><Price>101</Price></Book></Books>"
Once the XML string is obtained, let us use the cat() function to verify if the XML is well formed. If you just display the “y” variable (as shown above), which has the XML code, we will get the escape sequence character (“\”) embedded in front of a qutote. Such display will be difficult to read. So the cat() function will help us to display the string without the escape characters.
cat(y)
## <?xml version="1.0" encoding="UTF-8"?><Books><Book><Book_Name>DB2 Developer's Guide</Book_Name><Author>Craig Mullins</Author><ISBN>0132836425</ISBN><Price>85</Price></Book><Book><Book_Name>Understanding DB2</Book_Name><Author>Raul F. Chong; Michael Dong; Dwaine R. Snow</Author><ISBN>B00EKYU6DI</ISBN><Price>120</Price></Book><Book><Book_Name>Probability: With Applications and R</Book_Name><Author>Robert P. Dobrow</Author><ISBN>978-1118241257</ISBN><Price>101</Price></Book></Books>
Extracted XML Code
Hence we obtained the XML code into the variable “y”. Now we can apply XML functions to read the contents of XML data. Applying the xmlParse() function to parse the XML present in “y”, and this function will throw an error, if the supplied XML is not well formed.
xmlParse(y)
## <?xml version="1.0" encoding="UTF-8"?>
## <Books>
## <Book>
## <Book_Name>DB2 Developer's Guide</Book_Name>
## <Author>Craig Mullins</Author>
## <ISBN>0132836425</ISBN>
## <Price>85</Price>
## </Book>
## <Book>
## <Book_Name>Understanding DB2</Book_Name>
## <Author>Raul F. Chong; Michael Dong; Dwaine R. Snow</Author>
## <ISBN>B00EKYU6DI</ISBN>
## <Price>120</Price>
## </Book>
## <Book>
## <Book_Name>Probability: With Applications and R</Book_Name>
## <Author>Robert P. Dobrow</Author>
## <ISBN>978-1118241257</ISBN>
## <Price>101</Price>
## </Book>
## </Books>
##
Output of xmlParse() function showing the code as well formed
The above display confirms that the supplied XML is error free, and now we can apply xmlToDataFrame() function to collect all the child nodes information (in our case study, the books information is present in the child nodes).
y <- xmlToDataFrame(y)
y
## Book_Name
## 1 DB2 Developer's Guide
## 2 Understanding DB2
## 3 Probability: With Applications and R
## Author ISBN Price
## 1 Craig Mullins 0132836425 85
## 2 Raul F. Chong; Michael Dong; Dwaine R. Snow B00EKYU6DI 120
## 3 Robert P. Dobrow 978-1118241257 101
books_transform(y)
## Book_Name Author Author_Number
## 1 DB2 Developer's Guide Craig Mullins 1
## 2 Probability: With Applications and R Robert P. Dobrow 1
## 3 Understanding DB2 Raul F. Chong 1
## 4 Understanding DB2 Michael Dong 2
## 5 Understanding DB2 Dwaine R. Snow 3
## ISBN Price
## 1 0132836425 85
## 2 978-1118241257 101
## 3 B00EKYU6DI 120
## 4 B00EKYU6DI 120
## 5 B00EKYU6DI 120
Display 2: Data Frame obtained after parsing the Books XML file published at google documents
URL <- "https://docs.google.com/document/d/1GbDBIIC5mNI0ufj0GWBj2pwrfNOVideP-ZfpjcB7jL8/pub"
my_URL <- GET(URL)
html <- content(my_URL)
y <- as(html,"character")
#regexpr("<span>",y)[[1]]
#regexpr("</span>",y)[[1]]
y <- substr(y,regexpr("<span>",y)[[1]],(regexpr("</span>",y)[[1]] + nchar("</span>") - 1))
#Substituting spaces in the place of <br>, <span> and </span>
y <- gsub(("<br>|<span>|</span>"),"",y,ignore.case=FALSE)
cat(y)
## { "BookName": [ "DB2 Developer's Guide", "Understanding DB2", "Probability: With Applications and R" ],"Author": [ "Craig Mullins", "Raul F. Chong; Michael Dong; Dwaine R. Snow", "Robert P. Dobrow" ],"ISBN": [ "978-0132836425", "B00EKYU6DI", "978-1118241257" ],"Price": [ 85, 120, 101 ] }
Extracted JSON code
The above display shows the extracted JSON document text from the JSON file located at google documents
Let us parse the extracted JSON document using fromJSON() function. The fromJSON() function outputs a list. The extracted list is converted to data frame, in the following code.
y <- fromJSON(y)
y <- data.frame(y)
names(y) <- c("Book_Name","Author","ISBN","Price")
y
## Book_Name
## 1 DB2 Developer's Guide
## 2 Understanding DB2
## 3 Probability: With Applications and R
## Author ISBN Price
## 1 Craig Mullins 978-0132836425 85
## 2 Raul F. Chong; Michael Dong; Dwaine R. Snow B00EKYU6DI 120
## 3 Robert P. Dobrow 978-1118241257 101
The above display shows how the extracted JSON document text from JSON file located at google documents is finally converted to data frame. Let us apply the books_transform() function to tidy up the data in the above data frame.
books_transform(y)
## Book_Name Author Author_Number
## 1 DB2 Developer's Guide Craig Mullins 1
## 2 Probability: With Applications and R Robert P. Dobrow 1
## 3 Understanding DB2 Raul F. Chong 1
## 4 Understanding DB2 Michael Dong 2
## 5 Understanding DB2 Dwaine R. Snow 3
## ISBN Price
## 1 978-0132836425 85
## 2 978-1118241257 101
## 3 B00EKYU6DI 120
## 4 B00EKYU6DI 120
## 5 B00EKYU6DI 120
Display 3: Data Frame obtained after parsing the Books JSON file published at google documents