DATA 607 Week 7 Assignment

Introduction

In this assignment, I will load 3 different file types into R and assess the differences between each file type.

Load Packages

Here, I load the necessary packages needed to load the files into R.

library(rvest)
library(xml2)
library(dplyr)
library(XML)
library(methods)
library(rjson)

Load .html file

First, I will load the .html file from GitHub.

# specify the url
htmlurl <- "https://raw.githubusercontent.com/kristinlussi/DATA_607/main/Week%207/books.html"

# read the html file using the rvest package
book_html <- read_html(htmlurl) %>%
  html_table()

# select the first table and convert to data frame
book_html <- book_html[[1]] %>%
  as.data.frame()

# show raw data frame
book_html

##                            X1                          X2           X3
## 1                   Book Name                      Author  Attribute 1
## 2     A Brief History of Time             Stephen Hawking      physics
## 3 Weapons of Math Destruction                Cathy O'Neil     big data
## 4   Data Science for Business Foster Provost, Tom Fawcett data science
##            X4          X5
## 1 Attribute 2 Attribute 3
## 2       space        math
## 3        math  inequality
## 4    business   analytics

As you can see, when the file is loaded into R and converted into a data frame, the column names from the .html file are not matched up with the column names for the data frame. The default column names are X1, X2, X3, X4, and X5.

We will fix this in the next section:

# rename the columns
colnames(book_html) <- c("Book Name", "Author", "Attribute 1", "Atrribute 2", "Attribute 3")

# remove the first row
book_html <- book_html[-1,]

# show final data frame
book_html

##                     Book Name                      Author  Attribute 1
## 2     A Brief History of Time             Stephen Hawking      physics
## 3 Weapons of Math Destruction                Cathy O'Neil     big data
## 4   Data Science for Business Foster Provost, Tom Fawcett data science
##   Atrribute 2 Attribute 3
## 2       space        math
## 3        math  inequality
## 4    business   analytics

Load .xml fil

Next, I will load the .xml file from GitHub.

# specify the url
xmlurl <- "https://raw.githubusercontent.com/kristinlussi/DATA_607/main/Week%207/book.xml"

# read the xml content
book_xml <- readLines(xmlurl, warn = FALSE)

# join the lines into a single string
book_xml <- paste(book_xml, collapse = "\n")

# parse the XML string
book_xml<- xmlParse(book_xml)

# convert to a data frame
book_xml <- xmlToDataFrame(book_xml)

# show raw results
book_xml

##                            td                          NA           NA
## 1                   Book Name                      Author  Attribute 1
## 2     A Brief History of Time             Stephen Hawking      physics
## 3 Weapons of Math Destruction                Cathy O'Neil     big data
## 4   Data Science for Business Foster Provost, Tom Fawcett data science
##            NA          NA
## 1 Attribute 2 Attribute 3
## 2       space        math
## 3        math  inequality
## 4    business   analytics

As you can see, when the file is loaded into R and converted into a data frame, the column names from the .xml file are not matched up with the column names for the data frame. The default column names “td” for the first column, and blank for the remaining columns.

We will fix this in the next section:

# rename the columns
colnames(book_xml) <- c("Book Name", "Author", "Attribute 1", "Atrribute 2", "Attribute 3")

# remove the first row
book_xml <- book_xml[-1,]

# show final data frame
book_xml

##                     Book Name                      Author  Attribute 1
## 2     A Brief History of Time             Stephen Hawking      physics
## 3 Weapons of Math Destruction                Cathy O'Neil     big data
## 4   Data Science for Business Foster Provost, Tom Fawcett data science
##   Atrribute 2 Attribute 3
## 2       space        math
## 3        math  inequality
## 4    business   analytics

Load .json file

Finally, we will load the .json file from GitHub.

# specify the url
jsonurl <- "https://raw.githubusercontent.com/kristinlussi/DATA_607/main/Week%207/books.json"

# read the json file
book_json <- fromJSON(file = jsonurl) 

# show raw result
book_json

## [[1]]
## [[1]]$`Book Name`
## [1] "A Brief History of Time"
## 
## [[1]]$Author
## [1] "Stephen Hawking"
## 
## [[1]]$`Attribute 1`
## [1] "physics"
## 
## [[1]]$`Attribute 2`
## [1] "space"
## 
## [[1]]$`Attribute 3`
## [1] "math"
## 
## 
## [[2]]
## [[2]]$`Book Name`
## [1] "Weapons of Math Destruction"
## 
## [[2]]$Author
## [1] "Cathy O'Neil"
## 
## [[2]]$`Attribute 1`
## [1] "big data"
## 
## [[2]]$`Attribute 2`
## [1] "math"
## 
## [[2]]$Attribute
## [1] "inequality"
## 
## 
## [[3]]
## [[3]]$`Book Name`
## [1] "Data Science for Business"
## 
## [[3]]$Author
## [1] "Foster Provost, Tom Fawcett"
## 
## [[3]]$`Attribute 1`
## [1] "data science"
## 
## [[3]]$`Attribute 2`
## [1] "business"
## 
## [[3]]$`Attribute 3`
## [1] "analytics"

# extract each table and store in data frame
book_json1 <- as.data.frame(book_json[1]) 
book_json2 <- as.data.frame(book_json[2]) %>%
  rename(
    "Attribute.3" = "Attribute"
  )
book_json3 <- as.data.frame(book_json[3])

# merge data frames into one data frame
book_json <- bind_rows(book_json1, book_json2, book_json3)

# show final data frame
book_json

##                     Book.Name                      Author  Attribute.1
## 1     A Brief History of Time             Stephen Hawking      physics
## 2 Weapons of Math Destruction                Cathy O'Neil     big data
## 3   Data Science for Business Foster Provost, Tom Fawcett data science
##   Attribute.2 Attribute.3
## 1       space        math
## 2        math  inequality
## 3    business   analytics

When the .json file is read into R, there are three tables that are loaded. Each of these tables are converted into a data frame. The column names match up with the column names from the .json file, except the spaces are replaced with a “.”. The three data frames are then bound into one data frame.

Conclusion

In conclusion, each file type behaves differently once loaded into R.