For this task, I created an excel file that contained data points for three books that were available on my house. Once the excel was created, I converted the file into xml using the excel developer tool and mapping it to a xml template I had created. For the Json file, I used online file converter (https://www.convertcsv.com/csv-to-json.htm) and uploaded my csv input to get Json output. For the html file, I used another file converter (https://www.convertcsv.com/csv-to-html.htm)) and used the csv format of my excel to get the html output. I then added all the three outputs to Github for my project.
This is to load the R packages needed for this project.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'stringr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(dplyr)
library(lemon)
## Warning: package 'lemon' was built under R version 4.1.3
##
## Attaching package: 'lemon'
## The following object is masked from 'package:purrr':
##
## %||%
## The following objects are masked from 'package:ggplot2':
##
## CoordCartesian, element_render
library("XML")
library("methods")
library("rjson")
This is to load the XML into a R dataframe.
df_xml<-xmlToDataFrame("file:///C:/Users/dkbs0/OneDrive/Desktop/Class 607/Week 7/XML/booknew.xml")
df_xml
## title author1 author2
## 1 Matched Ally Condle Null
## 2 Catching Fire Suzane Collins Null
## 3 Data Science for Business Foster Provost Tom Fawcett
## publisher year isbn pages price
## 1 Allyson braithwaite Condie 2010 9780142419779 369 9.99
## 2 Scholastic Press 2009 9780439023498 391 17.99
## 3 O'Reily Media Inc. 2013 9781449361327 386 49.99
This code is to load the html file into R.
rawHTML <- paste(readLines("https://raw.githubusercontent.com/deepasharma06/Data-607/main/Books.htm"), collapse="\n")
## Warning in readLines("https://raw.githubusercontent.com/deepasharma06/
## Data-607/main/Books.htm"): incomplete final line found on 'https://
## raw.githubusercontent.com/deepasharma06/Data-607/main/Books.htm'
rawHTML
## [1] "<table class=\"table table-bordered table-hover table-condensed\">\n<thead><tr><th title=\"Field #1\">title</th>\n<th title=\"Field #2\">author1</th>\n<th title=\"Field #3\">author 2</th>\n<th title=\"Field #4\">publisher</th>\n<th title=\"Field #5\">year</th>\n<th title=\"Field #6\">isbn</th>\n<th title=\"Field #7\">pages</th>\n<th title=\"Field #8\">price</th>\n</tr></thead>\n<tbody><tr>\n<td>Matched</td>\n<td>Ally Condle</td>\n<td>Null</td>\n<td>Allyson braithwaite Condie</td>\n<td align=\"right\">2010</td>\n<td align=\"right\">9780142419779</td>\n<td align=\"right\">369</td>\n<td align=\"right\">9.99</td>\n</tr>\n<tr>\n<td>Catching Fire</td>\n<td>Suzane Collins</td>\n<td>Null</td>\n<td>Scholastic Press</td>\n<td align=\"right\">2009</td>\n<td align=\"right\">9780439023498</td>\n<td align=\"right\">391</td>\n<td align=\"right\">17.99</td>\n</tr>\n<tr>\n<td>Data Science for Business</td>\n<td>Foster Provost</td>\n<td>Tom Fawcett</td>\n<td>O'Reily Media Inc.</td>\n<td align=\"right\">2013</td>\n<td align=\"right\">9781449361327</td>\n<td align=\"right\">386</td>\n<td align=\"right\">49.99</td>\n</tr>\n</tbody></table>"
The html file loaded above is loaded to a R dataframe using the code below.
df_html <- as.data.frame(read_html(rawHTML) %>% html_table(fill=TRUE))
df_html
## title author1 author.2
## 1 Matched Ally Condle Null
## 2 Catching Fire Suzane Collins Null
## 3 Data Science for Business Foster Provost Tom Fawcett
## publisher year isbn pages price
## 1 Allyson braithwaite Condie 2010 9.780142e+12 369 9.99
## 2 Scholastic Press 2009 9.780439e+12 391 17.99
## 3 O'Reily Media Inc. 2013 9.781449e+12 386 49.99
This is to load the Json file into a R dataframe.
url <- "https://raw.githubusercontent.com/deepasharma06/Data-607/main/booknew.json"
mydata <- jsonlite::read_json(url)
df_json <-as.data.frame(do.call("cbind", mydata))
df_json
## V1 V2 V3
## title Matched Catching Fire Data Science for Business
## author1 Ally Condle Suzane Collins Foster Provost
## author2 NULL NULL Tom Fawcett
## publisher Allyson braithwaite Condie Scholastic Press O'Reily Media Inc.
## year 2010 2009 2013
## isbn 9.78014e+12 9.78044e+12 9.78145e+12
## pages 369 391 386
## price 9.99 17.99 49.99
I loaded the xml file into Github (https://github.com/deepasharma06/Data-607/blob/main/booknew.xml). However, when I used the Github link on the code, I got an error saying the format may not be xml. I checked to make sure the xml was in the correct format using online editor (https://jsonformatter.org/xml-validator) and I did not find any issues with the xml. So, I loaded by file from my local machine. The html portion was easier and I was able to pull my html from Github and create a dataframe in R. I was able to pull the JSON file from Github as well and to create a dataframe in R. However, the outpur looked off even though the actual table (df_json) did have all the elements from the original excel file. Overall, I came to the conclusion that the way the xml, html and json formats stores data is quite different from each other. The xml format seem to be the most straightforward because the data is laid in the form of records and each record contains the data from a specific row in my excel. The Json format also laid the data in the form of records but each record seems to be unique from the other even thought the whole transaction is enclosed in ([]) bracket. The html format looked very different from the xml and json. The html file first defines the fields from the excel and then lays out the elements for these fields which makes it very hard to read. Source files are available in Github in the following link. 1. XML - https://raw.githubusercontent.com/deepasharma06/Data-607/main/booknew.xml 2. HTML - https://raw.githubusercontent.com/deepasharma06/Data-607/main/Books.htm 3. Json - https://raw.githubusercontent.com/deepasharma06/Data-607/main/booknew.json