Week 7 Assignment

Introduction

The goal of this assignment is to begin working on the ability to process data sourced from web sources that is not in a convenient direct download to csv or some sort of tabular data. The data types of focus on this assignment are html files which would be typical from direct scraping, along with XML and JSON files which are more likely to be retrieved from API utilization.

To get directly familiar with these formats, we will create a representation of information regarding three books of a certain genre in the three different formats. After the data has been created, we will utilize various R packages in order to load the information as dataframes.

Data Creation

For the data, I went with my favorite science fiction books. Beyond capturing the name and authors of these books, I also added: The year of publication, the average reviewer rating on Goodreads, and the amount of reviewers who have voted on the book on Goodreads. This information was then formatted into XML, JSON, and HTML files by hand. For HTML, I replicated the data twice as usually when scraping HTML data there will be other information that you have to filter out to get the data you want. After creation these files were then uploaded to Github. We will set these URLs as variables to load below.

jurl <- r"(https://raw.githubusercontent.com/alu-potato/DATA607/main/Assignments/Week%207%20Assignment/books.json)"
xurl <- r"(https://raw.githubusercontent.com/alu-potato/DATA607/main/Assignments/Week%207%20Assignment/books.xml)"
hurl <- r"(https://raw.githubusercontent.com/alu-potato/DATA607/main/Assignments/Week%207%20Assignment/books.html)"

Data Loading JSON

Now we have to consider how to load this data into R for each file type. For JSON, we will utilize the package rjson to initially parse the URL. After the initial parsing, jinfo is in the format of a list of lists with one redundant layer, so we need to peel back that layer by accessing the first index. Then, we have a list of 3 separate lists of character vectors. To unite these into a single list we utlize rbind(). Finally, jinfo can be converted into a dataframe through as_tibble(), however it does need to be unnested in order to flatten the character vectors inside the dataframe. If we had attempted to flatten the lists before this point, we would lose the inherent structure of the data.

library(rjson)
library(XML)
library(methods)
suppressMessages(
  library(rvest)
)

jinfo <- fromJSON(file = jurl) # Our JSON file ends up coming out as a list of lists with one redundant layer, so we need to peel back that layer by accessing the first index and defining it as jinfo
jinfo <- jinfo[[1]] # We now have a list of 3 lists that all need to be fed into a dataframe, feeding these as is to as_tibble() causes an error.
(jinfo <- do.call(rbind,jinfo)) # Row binding over jinfo provides us with a 5x3 list of character vectors which can be fed properly into as_tibble() after unnesting .

##      Name                                   Year Published Authors          
## [1,] "Roadside Picnic"                      "1972"         character,2      
## [2,] "Do Androids Dream of Electric Sheep?" "1968"         "Philip K. Dick" 
## [3,] "The Handmaid’s Tale"                  "1985"         "Margaret Atwood"
##      Goodreads Average Rating Goodreads Voters
## [1,] "4.16"                   "58353"         
## [2,] "4.09"                   "415797"        
## [3,] "4.13"                   "1869506"

(jframe <- unnest(as_tibble(jinfo),cols=colnames(jinfo)))

## # A tibble: 4 × 5
##   Name                                 `Year Published` Authors  Goodr…¹ Goodr…²
##   <chr>                                <chr>            <chr>    <chr>   <chr>  
## 1 Roadside Picnic                      1972             Arkady … 4.16    58353  
## 2 Roadside Picnic                      1972             Boris S… 4.16    58353  
## 3 Do Androids Dream of Electric Sheep? 1968             Philip … 4.09    415797 
## 4 The Handmaid’s Tale                  1985             Margare… 4.13    1869506
## # … with abbreviated variable names ¹`Goodreads Average Rating`,
## #   ²`Goodreads Voters`

Data Loading XML

For XML, we will be utilizing the XML package to parse the frame and then converting it to a dataframe with the methods package. As the XML package does not provide native URL support we will also use httr to retrieve the file. We are able to get a dataframe here in a two step process. However, we should note that the output for loading the data through JSON was more tidy. In this dataframe we received multiple authors as a single value. We also don’t get a tibble here, although we could convert it to one.

library(XML)
library(httr)
library(methods)

xinfo <- xmlParse(
  rawToChar(GET(xurl)$content)
  )
(xframe <- as_tibble(xmlToDataFrame(xinfo)))

## # A tibble: 3 × 5
##   Name                                 Year_Published Authors    Goodr…¹ Goodr…²
##   <chr>                                <chr>          <chr>      <chr>   <chr>  
## 1 Roadside Picnic                      1972           Arkady St… 4.16    58353  
## 2 Do Androids Dream of Electric Sheep? 1968           Philip K.… 4.09    415797 
## 3 The Handmaid’s Tale                  1985           Margaret … 4.13    1869506
## # … with abbreviated variable names ¹Goodreads_Average_Rating,
## #   ²Goodreads_Voters

Data Loading HTML

For HTML, we will utilize rvest to load the data into a frame. The process is even simpler than loading XML as a dataframe. We simply read the url as html text, navigate to the table we are looking for (in this case there is only one), and cast it as a dataframe utilizing html_table(). Although the dataframe isn’t tidy as we again have multiple authors in one cell, we do have the column types properly formatted in contrast to loading the other formats in.

suppressMessages(
  library(rvest)
)

hinfo <- read_html(hurl)
(hframe <- hinfo |>
  html_element("table") |>
  html_table())

## # A tibble: 3 × 5
##   Name                                 `Year Published` Authors  Goodr…¹ Goodr…²
##   <chr>                                           <int> <chr>      <dbl>   <int>
## 1 Roadside Picnic                                  1972 Arkady …    4.16   58353
## 2 Do Androids Dream of Electric Sheep?             1968 Philip …    4.09  415797
## 3 The Handmaid’s Tale                              1985 Margare…    4.13 1869506
## # … with abbreviated variable names ¹`Goodreads Average Rating`,
## #   ²`Goodreads Voters`

Week 7 Assignment

Taha Ahmad

2023-03-06

Introduction

Data Creation

Data Loading JSON

Data Loading XML

Data Loading HTML

Conclusions