We load the data below
books.url <- file(paste(url,"books.csv", sep = ""), open="r" )
books.data <- read.csv2(books.url, sep = ",", header=TRUE, stringsAsFactors = FALSE , encoding="UTF-8")
This is how the books.data from csv looks like
books.data
## Title
## 1 Discovering Statistics Using R
## 2 R Cookbook
## 3 R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics)
## Author_1 Author_2 Author_3 Published_Date Weight Type
## 1 Andy Field Jeremy Miles Zoe Field 2012 5 pounds paperback
## 2 Paul Teetor 2011 1.6 pounds paperback
## 3 Jared Lander 2013 1 pound paperback
Now lets look at the structure of the xml file.
cat(books.xml)
## <?xml version="1.0"?>
## <doc>
## <books>
## <book>
## <Title>Discovering Statistics Using R</Title>
## <Author_1>Andy Field</Author_1>
## <Author_2>Jeremy Miles</Author_2>
## <Author_3>Zoe Field</Author_3>
## <Published_Date>2012</Published_Date>
## <Weight>5 pounds</Weight>
## <Type>paperback</Type>
## </book>
## <book>
## <Title>R Cookbook </Title>
## <Author_1>Paul Teetor</Author_1>
## <Author_2> </Author_2>
## <Author_3> </Author_3>
## <Published_Date>2011</Published_Date>
## <Weight>1.6 pounds</Weight>
## <Type>paperback</Type>
## </book>
## <book>
## <Title>R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics)</Title>
## <Author_1>Jared Lander</Author_1>
## <Author_2> </Author_2>
## <Author_3> </Author_3>
## <Published_Date>2013</Published_Date>
## <Weight>1 pound</Weight>
## <Type>paperback</Type>
## </book>
## </books>
## </doc>
Now that we have the data in xml we can covert to a dataframe.
books.xml.list <- xmlToList(books.xml)
books.xml.df <- as.data.frame(rbindlist(books.xml.list, fill = TRUE))
row.names(books.xml.df) <- names(books.xml.list$books$book)
kable(books.xml.df)
| Title |
Discovering Statistics Using R |
R Cookbook |
R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics) |
| Author_1 |
Andy Field |
Paul Teetor |
Jared Lander |
| Author_2 |
Jeremy Miles |
|
|
| Author_3 |
Zoe Field |
|
|
| Published_Date |
2012 |
2011 |
2013 |
| Weight |
5 pounds |
1.6 pounds |
1 pound |
| Type |
paperback |
paperback |
paperback |
Overall, I noticed that many of the results were returned in a list format. It seems that when parsing various data format types they results will be in a list format that needs to be modified into the data frame. The packages are extremely useful in producing these results. I would say that the xml required the most amount of effort in coding and understanding, whereas json felt to be the easiest to understand. I had not previously worked with data were the encoding to UTF-8 was so critical to being able to parse the data, I now have a new understanding of this encoding.