This week’s assignment focused on creating four files in different formats—JSON, HTML, XML, and Parquet. Each file contained a structured table of information including attributes such as title, author, language, year of publication, and more. The goal was to demonstrate how to create data tables in these formats and practice importing them into R as data frames. By working with multiple formats, we gained insights into the versatility of each format and their use cases in data storage and analysis. This task also helped us strengthen our skills in reading and handling diverse data formats in R
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.4.4 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
To import the HTML data, I first created and uploaded the HTML file
containing the information to GitHub. I used the rvest
package in R to handle the extraction process. I passed the GitHub URL
through the read_html() function, which retrieves the raw
HTML data and stores it as an xml_document in the form of
an xml_node. Then, using the html_nodes()
function from the dplyr package, I located the table within
the HTML data by specifying the “table” argument. This method
efficiently extracts the table from the HTML structure for further
analysis.
After extracting the data from the table, I place it in the data frame html_data.
# Load the necessary library for web scraping
library(rvest)
## Warning: package 'rvest' was built under R version 4.3.3
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
# URL to the HTML file on GitHub containing the table of data
html_url = "https://raw.githubusercontent.com/Shriyanshh/Week-7-Assignment/refs/heads/main/data.html"
# Read the HTML content from the GitHub URL
git_html = read_html(html_url)
# Check the class of the object to ensure it has been read correctly as HTML
class(git_html)
## [1] "xml_document" "xml_node"
# Extract the first table from the HTML document into a data frame
html_data = git_html %>%
html_nodes("table") %>% # Find all the tables in the HTML document
.[[1]] %>% # Select the first table (if there are multiple tables)
html_table() # Convert the HTML table into a data frame
# Display the extracted data frame
html_data
To import the JSON data, I utilized both the jsonlite
and httr packages. First, I used httr’s
GET() function to retrieve the JSON data from the web page,
storing the result in git_json, which holds the data as a
response object. Next, I applied httr’s
content() function to extract the raw data from the
response and store it as a character vector. Following that, I used
jsonlite’s fromJSON() function to parse the
JSON file and convert it into a list. By setting
flatten=TRUE, I ensured that the list was transformed into
a wide data frame format. Finally, I converted the list into a data
frame for further analysis.
# Load the necessary libraries for working with JSON data and making web requests
library(jsonlite) # For parsing JSON data
## Warning: package 'jsonlite' was built under R version 4.3.3
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
library(httr) # For handling HTTP requests
## Warning: package 'httr' was built under R version 4.3.3
# JSON file URL from GitHub
json_url = "https://raw.githubusercontent.com/Shriyanshh/Week-7-Assignment/refs/heads/main/data.json"
# Retrieve the JSON data from the GitHub URL using GET request
git_json = GET(json_url)
# Check the class of the response to confirm it's an HTTP response object
class(git_json)
## [1] "response"
# Extract the content from the HTTP response and convert it into a text format
json_content = content(git_json, "text")
# Check the class of the extracted content (it should be a character vector)
class(json_content)
## [1] "character"
# Parse the JSON content into an R object (list by default) and flatten it into a wide structure
json_data = fromJSON(json_content, flatten = TRUE)
# Check the class of the parsed JSON data (it should now be a list or data frame)
class(json_data)
## [1] "list"
# Convert the JSON data (list) into a data frame for easier analysis
json_data = as.data.frame(json_data)
# Display the JSON data frame
json_data
To import the XML data, I used the xml2 package. I
utilized the read_xml() function to read the XML data from
the URL, which stores it as an xml_document in the form of
xml_nodes. After retrieving the XML data, I converted it
into a list and then transformed the list into a data frame for further
analysis.
# Set the CRAN mirror and install the arrow package
install.packages("arrow", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/16462/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'arrow' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'arrow'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\16462\AppData\Local\R\win-library\4.3\00LOCK\arrow\libs\x64\arrow.dll
## to C:\Users\16462\AppData\Local\R\win-library\4.3\arrow\libs\x64\arrow.dll:
## Permission denied
## Warning: restored 'arrow'
##
## The downloaded binary packages are in
## C:\Users\16462\AppData\Local\Temp\RtmpieDUTQ\downloaded_packages
# Load the arrow package for reading and writing Parquet files
library(arrow)
## Warning: package 'arrow' was built under R version 4.3.3
##
## Attaching package: 'arrow'
## The following object is masked from 'package:lubridate':
##
## duration
## The following object is masked from 'package:utils':
##
## timestamp
# URL to the Parquet file stored on GitHub
parquet_url <- "https://raw.githubusercontent.com/Shriyanshh/Week-7-Assignment/main/data.parquet"
# Download the Parquet file temporarily and save it locally as "data.parquet"
download.file(parquet_url, destfile = "data.parquet", mode = "wb")
# Read the downloaded Parquet file into R as a data frame
git_parquet <- read_parquet("data.parquet")
# Display the contents of the Parquet file, now stored as a data frame
print(git_parquet)
## # A tibble: 20 × 7
## Category Item_Name Item_ID Brand Price Variation_ID Variation_Details
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 Electronics Smartpho… 101 Tech… 700. 101-A Color: Black, St…
## 2 Electronics Smartpho… 101 Tech… 700. 101-B Color: White, St…
## 3 Electronics Laptop 102 Comp… 1100. 102-A Color: Silver, S…
## 4 Electronics Laptop 102 Comp… 1100. 102-B Color: Space Gra…
## 5 Home Appliances Refriger… 201 Home… 900. 201-A Color: Stainless…
## 6 Home Appliances Refriger… 201 Home… 900. 201-B Color: White, Ca…
## 7 Home Appliances Washing … 202 Clea… 500. 202-A Type: Front Load…
## 8 Home Appliances Washing … 202 Clea… 500. 202-B Type: Top Load, …
## 9 Clothing T-Shirt 301 Fash… 20.0 301-A Color: Blue, Siz…
## 10 Clothing T-Shirt 301 Fash… 20.0 301-B Color: Red, Size…
## 11 Clothing T-Shirt 301 Fash… 20.0 301-C Color: Green, Si…
## 12 Clothing Jeans 302 Deni… 50.0 302-A Color: Dark Blue…
## 13 Clothing Jeans 302 Deni… 50.0 302-B Color: Light Blu…
## 14 Books Fiction … 401 - 15.0 401-A Format: Hardcove…
## 15 Books Fiction … 401 - 15.0 401-B Format: Paperbac…
## 16 Books Non-Fict… 402 - 25.0 402-A Format: eBook, L…
## 17 Books Non-Fict… 402 - 25.0 402-B Format: Paperbac…
## 18 Sports Equipme… Basketba… 501 Spor… 30.0 501-A Size: Size 7, Co…
## 19 Sports Equipme… Tennis R… 502 Rack… 90.0 502-A Material: Graphi…
## 20 Sports Equipme… Tennis R… 502 Rack… 90.0 502-B Material: Alumin…
To import the Parquet data, I used the arrow package in
R. First, I specified the URL to the Parquet file that I uploaded to
GitHub. Since Parquet files cannot be read directly from a URL, I used
download.file() to temporarily download the file to my
local system. The downloaded file was saved as
"data.parquet". After downloading the file, I used the
read_parquet() function from the arrow package
to load the Parquet data into R and store it as a data frame. Finally, I
printed the data frame to inspect the imported data.
# Install and load the arrow package, which is used for reading and writing Parquet files
setRepositories(ind = 1) # Choose a CRAN mirror
install.packages("arrow")
## Warning: package 'arrow' is in use and will not be installed
install.packages("arrow", repos = "https://cloud.r-project.org")
## Warning: package 'arrow' is in use and will not be installed
library(arrow)
# URL to the Parquet file stored on GitHub
parquet_url <- "https://raw.githubusercontent.com/Shriyanshh/Week-7-Assignment/main/data.parquet"
# Download the Parquet file temporarily and save it locally as "data.parquet"
# Parquet files cannot be directly read from a URL, so we need to download it first
download.file(parquet_url, destfile = "data.parquet", mode = "wb")
## Warning in download.file(parquet_url, destfile = "data.parquet", mode = "wb"):
## URL
## https://raw.githubusercontent.com/Shriyanshh/Week-7-Assignment/main/data.parquet:
## cannot open destfile 'data.parquet', reason 'Invalid argument'
## Warning in download.file(parquet_url, destfile = "data.parquet", mode = "wb"):
## download had nonzero exit status
# Read the downloaded Parquet file into R as a data frame
git_parquet <- read_parquet("data.parquet")
# Display the contents of the Parquet file, now stored as a data frame
print(git_parquet)
## # A tibble: 20 × 7
## Category Item_Name Item_ID Brand Price Variation_ID Variation_Details
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 Electronics Smartpho… 101 Tech… 700. 101-A Color: Black, St…
## 2 Electronics Smartpho… 101 Tech… 700. 101-B Color: White, St…
## 3 Electronics Laptop 102 Comp… 1100. 102-A Color: Silver, S…
## 4 Electronics Laptop 102 Comp… 1100. 102-B Color: Space Gra…
## 5 Home Appliances Refriger… 201 Home… 900. 201-A Color: Stainless…
## 6 Home Appliances Refriger… 201 Home… 900. 201-B Color: White, Ca…
## 7 Home Appliances Washing … 202 Clea… 500. 202-A Type: Front Load…
## 8 Home Appliances Washing … 202 Clea… 500. 202-B Type: Top Load, …
## 9 Clothing T-Shirt 301 Fash… 20.0 301-A Color: Blue, Siz…
## 10 Clothing T-Shirt 301 Fash… 20.0 301-B Color: Red, Size…
## 11 Clothing T-Shirt 301 Fash… 20.0 301-C Color: Green, Si…
## 12 Clothing Jeans 302 Deni… 50.0 302-A Color: Dark Blue…
## 13 Clothing Jeans 302 Deni… 50.0 302-B Color: Light Blu…
## 14 Books Fiction … 401 - 15.0 401-A Format: Hardcove…
## 15 Books Fiction … 401 - 15.0 401-B Format: Paperbac…
## 16 Books Non-Fict… 402 - 25.0 402-A Format: eBook, L…
## 17 Books Non-Fict… 402 - 25.0 402-B Format: Paperbac…
## 18 Sports Equipme… Basketba… 501 Spor… 30.0 501-A Size: Size 7, Co…
## 19 Sports Equipme… Tennis R… 502 Rack… 90.0 502-A Material: Graphi…
## 20 Sports Equipme… Tennis R… 502 Rack… 90.0 502-B Material: Alumin…
Each format has its own advantages and disadvantages depending on how the data is stored and analyzed. Below is an overview of each format along with its pros and cons.
Pros:
- Simple and human-readable format. - Supports nested structures, which
makes it versatile. - Commonly used in APIs and web development for data
exchange.
Cons:
- May not be ideal for very large datasets. - Slower than binary formats
like Parquet for large-scale data processing.
Pros:
- Easily viewable in web browsers, making it accessible for visual
inspection. - Great for displaying tabular data on web pages.
Cons:
- Not optimized for analytical or large-scale data manipulation. -
Requires parsing when converting to data frames in R.
Pros:
- Structured and self-descriptive, ideal for storing hierarchical
information. - Used widely in web services for data interchange.
Cons:
- More verbose than JSON and less efficient for large datasets. -
Parsing XML can be slower and more complex.
Pros:
- Efficient columnar storage format that is excellent for big data
analytics. - Supports compression and optimized for queries, reducing
both storage and processing time.
Cons:
- Not human-readable and requires specialized tools like
arrow to access and manipulate. - Primarily useful for
large datasets rather than small data exchange.
In this assignment, we explored various file formats—JSON, HTML, XML, and Parquet—by creating and importing data tables for each format into R. Through this process, we gained practical experience in handling different data formats, uploading files to GitHub, and utilizing R packages to read and convert the data into data frames for analysis.
For JSON, we utilized the jsonlite and
httr packages to retrieve the file from GitHub, parse it
into a list, and convert it to a wide data frame. For
HTML, we used the rvest package to extract
the table data from the raw HTML content stored on GitHub and converted
it into a data frame. For XML, the xml2
package was used to read the XML data, convert it into a list, and
further transform it into a data frame. Lastly, for
Parquet, we leveraged the arrow package,
downloading the file temporarily from GitHub and reading it into R using
the read_parquet() function.
Overall, this assignment provided us with valuable insights into how different file formats can be used to store structured data and how we can efficiently work with them in R. It also reinforced our understanding of integrating data from different sources, such as GitHub, into a consistent data analysis workflow.