For this week’s assignment, I had to pick three books and store their information into a HTML, XML, and JSON format. Then, load the three files into R data frames. I can’t remember the last time I picked up a book so I just picked three data science books I came across in the past year. Through several youtube videos, I created three files using Visual Studio Code and uploaded them to Github.
library(tidyverse)
library(rvest)
library(XML)
library(xml2)
library(jsonlite)
html <- read_html("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/Week%207%20Assignment/Week7%20books.html")
#From the rvest library -> parsing a HTML table into a data frame
df_html <- html_table(html)
df_html <- df_html[[1]]
knitr:: kable(df_html)
Title | Authors | Authors2 | Interesting Attributes | Interesting Attributes2 |
---|---|---|---|---|
Data Science for Business | Foster Provost | Tom Fawcett | Conversational style make it an easy read. | Cover a wide range of topics |
Data Science for Data Science | Hadley Wickham | Garret Grolemund | Provides step by step instructions and code examples | Covers a wide range of R packages |
Python for Data Analysis | Wes McKinney | N/A | Available online for free | Written by the creator of Pandas Library |
#From the xml2 library
xml_file <- read_xml("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/Week%207%20Assignment/Week%207%20books.xml")
#Had to parse xml file before I could use the function xmltoDataFrame()
xml_parse <- xmlParse(xml_file)
#From the XML library
df_xml <- xmlToDataFrame(xml_parse)
knitr:: kable(df_xml)
title | authors | authors2 | interesting_attributes | interesting_attributes2 |
---|---|---|---|---|
Data Science for Business | Foster Provost | Tom Fawcett | Conversational style make it an easy read. | Cover a wide range of topics |
Data Science for Data Science | Hadley Wickham | Garret Grolemund | Provides step by step instructions and code examples | Covers a wide range of R packages |
Python for Data Analysis | Wes McKinney | N/A | Available online for free | Written by the creator of Pandas Library |
json_file <- fromJSON("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/Week%207%20Assignment/week%207%20books.json")
df_json <- as.data.frame(json_file)
knitr:: kable(df_json)
title | authors | authors2 | interesting attributes | interesting attributes2 |
---|---|---|---|---|
Data Science for Business | Foster Provost | Tom Fawcett | Conversational style make it an easy read. | Cover a wide range of topics |
Data Science for Data Science | Hadley Wickham | Garret Grolemund | Provides step by step instructions and code examples | Covers a wide range of R packages |
Python for Data Analysis | Wes McKinney | N/A | Available online for free | Written by the creator of Pandas Library |
All three data frames are pretty similar. They all have 3 rows with 5 columns and the data is exactly the same. The biggest difference that probably stands out is the column names, which I forgot to capitalize when I was creating the xml and json files. I changed the column names below and checked to see if they are identical.
colnames(df_xml) <- colnames(df_html)
colnames(df_json) <- colnames(df_html)
identical(df_html,df_xml)
## [1] FALSE
identical(df_html,df_json)
## [1] FALSE
identical(df_xml,df_json)
## [1] TRUE
It looks like the data frames from the xml and json files are identical. However, the data frame from the HTML seem to differ from the other two data frames. I used the all.equal() function to see if I could identify the difference and it looks like it is just a minor string mismatch.
all.equal(df_html,df_xml)
## [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [2] "Attributes: < Component \"class\": 1 string mismatch >"
all.equal(df_html,df_json)
## [1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
## [2] "Attributes: < Component \"class\": 1 string mismatch >"