Data 607 HW 7

Introduction

For this task, I created an excel file that contained data points for three books that were available on my house. Once the excel was created, I converted the file into xml using the excel developer tool and mapping it to a xml template I had created. For the Json file, I used online file converter (https://www.convertcsv.com/csv-to-json.htm) and uploaded my csv input to get Json output. For the html file, I used another file converter (https://www.convertcsv.com/csv-to-html.htm)) and used the csv format of my excel to get the html output. I then added all the three outputs to Github for my project.

Load Libraries:

This is to load the R packages needed for this project.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'stringr' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(rvest)

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(dplyr)
library(lemon)

## Warning: package 'lemon' was built under R version 4.1.3

## 
## Attaching package: 'lemon'

## The following object is masked from 'package:purrr':
## 
##     %||%

## The following objects are masked from 'package:ggplot2':
## 
##     CoordCartesian, element_render

library("XML")
library("methods")
library("rjson")

Load XML file to R dataframe

This is to load the XML into a R dataframe.

df_xml<-xmlToDataFrame("file:///C:/Users/dkbs0/OneDrive/Desktop/Class 607/Week 7/XML/booknew.xml")
df_xml

##                       title        author1     author2
## 1                   Matched    Ally Condle        Null
## 2             Catching Fire Suzane Collins        Null
## 3 Data Science for Business Foster Provost Tom Fawcett
##                    publisher year          isbn pages price
## 1 Allyson braithwaite Condie 2010 9780142419779   369  9.99
## 2           Scholastic Press 2009 9780439023498   391 17.99
## 3         O'Reily Media Inc. 2013 9781449361327   386 49.99

First Load Raw HTML file to R and create dataframe

This code is to load the html file into R.

rawHTML <- paste(readLines("https://raw.githubusercontent.com/deepasharma06/Data-607/main/Books.htm"), collapse="\n")

## Warning in readLines("https://raw.githubusercontent.com/deepasharma06/
## Data-607/main/Books.htm"): incomplete final line found on 'https://
## raw.githubusercontent.com/deepasharma06/Data-607/main/Books.htm'

rawHTML

## [1] "<table class=\"table table-bordered table-hover table-condensed\">\n<thead><tr><th title=\"Field #1\">title</th>\n<th title=\"Field #2\">author1</th>\n<th title=\"Field #3\">author 2</th>\n<th title=\"Field #4\">publisher</th>\n<th title=\"Field #5\">year</th>\n<th title=\"Field #6\">isbn</th>\n<th title=\"Field #7\">pages</th>\n<th title=\"Field #8\">price</th>\n</tr></thead>\n<tbody><tr>\n<td>Matched</td>\n<td>Ally Condle</td>\n<td>Null</td>\n<td>Allyson braithwaite Condie</td>\n<td align=\"right\">2010</td>\n<td align=\"right\">9780142419779</td>\n<td align=\"right\">369</td>\n<td align=\"right\">9.99</td>\n</tr>\n<tr>\n<td>Catching Fire</td>\n<td>Suzane Collins</td>\n<td>Null</td>\n<td>Scholastic Press</td>\n<td align=\"right\">2009</td>\n<td align=\"right\">9780439023498</td>\n<td align=\"right\">391</td>\n<td align=\"right\">17.99</td>\n</tr>\n<tr>\n<td>Data Science for Business</td>\n<td>Foster Provost</td>\n<td>Tom Fawcett</td>\n<td>O&#39;Reily Media Inc.</td>\n<td align=\"right\">2013</td>\n<td align=\"right\">9781449361327</td>\n<td align=\"right\">386</td>\n<td align=\"right\">49.99</td>\n</tr>\n</tbody></table>"

create dataframe for raw HTML

The html file loaded above is loaded to a R dataframe using the code below.

df_html <- as.data.frame(read_html(rawHTML) %>% html_table(fill=TRUE))
df_html

##                       title        author1    author.2
## 1                   Matched    Ally Condle        Null
## 2             Catching Fire Suzane Collins        Null
## 3 Data Science for Business Foster Provost Tom Fawcett
##                    publisher year         isbn pages price
## 1 Allyson braithwaite Condie 2010 9.780142e+12   369  9.99
## 2           Scholastic Press 2009 9.780439e+12   391 17.99
## 3         O'Reily Media Inc. 2013 9.781449e+12   386 49.99

Load JSON file to R dataframe

This is to load the Json file into a R dataframe.

url <- "https://raw.githubusercontent.com/deepasharma06/Data-607/main/booknew.json"
mydata <- jsonlite::read_json(url)


df_json <-as.data.frame(do.call("cbind", mydata))
df_json

##                                   V1               V2                        V3
## title                        Matched    Catching Fire Data Science for Business
## author1                  Ally Condle   Suzane Collins            Foster Provost
## author2                         NULL             NULL               Tom Fawcett
## publisher Allyson braithwaite Condie Scholastic Press        O'Reily Media Inc.
## year                            2010             2009                      2013
## isbn                     9.78014e+12      9.78044e+12               9.78145e+12
## pages                            369              391                       386
## price                           9.99            17.99                     49.99

Conclusion:

I loaded the xml file into Github (https://github.com/deepasharma06/Data-607/blob/main/booknew.xml). However, when I used the Github link on the code, I got an error saying the format may not be xml. I checked to make sure the xml was in the correct format using online editor (https://jsonformatter.org/xml-validator) and I did not find any issues with the xml. So, I loaded by file from my local machine. The html portion was easier and I was able to pull my html from Github and create a dataframe in R. I was able to pull the JSON file from Github as well and to create a dataframe in R. However, the outpur looked off even though the actual table (df_json) did have all the elements from the original excel file. Overall, I came to the conclusion that the way the xml, html and json formats stores data is quite different from each other. The xml format seem to be the most straightforward because the data is laid in the form of records and each record contains the data from a specific row in my excel. The Json format also laid the data in the form of records but each record seems to be unique from the other even thought the whole transaction is enclosed in ([]) bracket. The html format looked very different from the xml and json. The html file first defines the fields from the excel and then lays out the elements for these fields which makes it very hard to read. Source files are available in Github in the following link. 1. XML - https://raw.githubusercontent.com/deepasharma06/Data-607/main/booknew.xml 2. HTML - https://raw.githubusercontent.com/deepasharma06/Data-607/main/Books.htm 3. Json - https://raw.githubusercontent.com/deepasharma06/Data-607/main/booknew.json