Code
# create a "data" directory if it doesn't exist already
# Using showWarnings = FALSE to suppress warning if directory already exists
dir.create("data",showWarnings = FALSE)
Download data and understand the Arrow mindset
Class: DSA 406-001
Using READY + SCAN Frameworks with Arrow for Efficient EDA
By the end of this module, you will be able to:
Apply the READY framework to plan your big data investigation
Use the SCAN framework to systematically explore large datasets
Understand when to use Arrow vs traditional R for data exploration
Build your first “detective’s workflow” for any new dataset
Navigate the critical first 10 minutes with a massive dataset
Imagine you’re a detective who just received 40 million pieces of evidence. You can’t spread them all on your desk - your desk (computer memory) isn’t big enough!
Traditional R thinking:
# Load EVERYTHING into memory first
data <- read_csv("huge_file.csv") # 💥 Crash!
data |>
filter(year == 2023)
Arrow thinking:
# Create a "view" of the data, then filter
data <- open_dataset("huge_file.csv") # ✅ Instant!
data |> filter(year == 2023)|>
collect() # Only brings filtered results to memory
When we say “big data,” we’re talking about datasets that challenge traditional approaches:
Your laptop typically has 8-16GB RAM
Operating system uses 2-4GB
Other programs use 1-2GB
Available for R: Maybe 4-8GB
Our dataset: 9GB = Memory overflow without Arrow!
Detective Rule: When your dataset approaches your available RAM, Arrow becomes essential for investigation.
Is your data < 100MB? → Use traditional R in your data, < 8GB and mostly single-table operations? → Use Arrow
Arrow uses lazy evaluation - it builds up a query plan without actually executing it until you call collect()
.
Think of it like:
Traditional R: “Cook the entire meal, then throw away what you don’t want”
Arrow: “Plan the meal, shop for only what you need, then cook just that”
We’re working with the real Seattle Library dataset - over 40 million rows of checkout data!
First, let’s create a special folder to store our data (you may already have one but if you don’t un-comment the dir.create()
and run the chunk.
# create a "data" directory if it doesn't exist already
# Using showWarnings = FALSE to suppress warning if directory already exists
dir.create("data",showWarnings = FALSE)
Now for the fun part! We’ll download the Seattle Library dataset. The dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6.
⚠️ Important: This is a 9GB file, so:
file_path <- “data/seattle-library-checkouts.csv”
# Download the file
curl::multi_download(
urls = “https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv”,
dests = file_path,
resume = TRUE
)
# Download Seattle library checkout dataset:
# Instructions:
# 1. get data from AWS S3 URL
# 2. save to local drive
# 3. use resume = TRUE to allow continuing downloads
# IMPORTANT: commented out the download path to avoid re-downloading the dataset
#curl::multi_download("https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv", "data/seattle-library-checkouts.csv", resume = TRUE )
Why USE: curl::multi_download()
After the download completes, let’s make sure everything worked:
# Check if the Seattle library dataset file exists and print its size:
# 1. Verify file exists at specified path
# 2. Calculate file size in gigabytes by dividing bytes by 1024^3
file.exists("data/seattle-library-checkouts.csv")
[1] TRUE
file.size("data/seattle-library-checkouts.csv") / 1024^3 # Size in GB
[1] 8.579315
#load in the packages and install if needed with code below
# Function to check and install required packages
<- c("tidyverse", "arrow")
required_packages
# Install missing packages
for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
#library(pkg, character.only = TRUE)
}
}
#Load libraries
lapply(required_packages, library, character.only = TRUE)
Warning: package 'tidyverse' was built under R version 4.5.1
Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'tibble' was built under R version 4.5.1
Warning: package 'tidyr' was built under R version 4.5.1
Warning: package 'readr' was built under R version 4.5.1
Warning: package 'purrr' was built under R version 4.5.1
Warning: package 'dplyr' was built under R version 4.5.1
Warning: package 'stringr' was built under R version 4.5.1
Warning: package 'forcats' was built under R version 4.5.1
Warning: package 'lubridate' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'arrow' was built under R version 4.5.1
Attaching package: 'arrow'
The following object is masked from 'package:lubridate':
duration
The following object is masked from 'package:utils':
timestamp
[[1]]
[1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
[7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
[13] "grDevices" "utils" "datasets" "methods" "base"
[[2]]
[1] "arrow" "lubridate" "forcats" "stringr" "dplyr" "purrr"
[7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
[13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
open_dataset()
MagicNow let’s see the fundamental difference between read_csv()
and open_dataset()
:
DON’T RUN!
# regular approach - would crash most computers!
# DON'T RUN!!
#seattle_library_checkouts <- read_csv("data/seattle-library-checkouts.csv")
RUN
# Arrow approach - creates a "view" without loading
<- open_dataset("data/seattle-library-checkouts.csv", format = "csv")
seattle_csv
# View Object
seattle_csv
FileSystemDataset with 1 csv file
12 columns
UsageClass: string
CheckoutType: string
MaterialType: string
CheckoutYear: int64
CheckoutMonth: int64
Checkouts: int64
Title: string
ISBN: null
Creator: string
Subjects: string
Publisher: string
PublicationYear: string
library(glue) # string interpolation - cleaner alternative to paste()
Warning: package 'glue' was built under R version 4.5.1
# Check out how much memory this is using.
glue("Memory used by Arrow object: {format(object.size(seattle_csv), units = 'KB')}")
Memory used by Arrow object: 0.5 Kb
# Let's see what the file size we are actually working with
<- file.size("data/seattle-library-checkouts.csv")
file_size_bytes <- file_size_bytes / (1024^3) # Convert to GB
file_size_gb glue("Estimated file size: {round(file_size_gb, 2)} GB")
Estimated file size: 8.58 GB
open_dataset()
does:Creates an Arrow dataset object that “points to” your CSV file
Doesn’t actually load the data into memory yet
Acts like a “view” or “window” into your data
When you run this code with open_dataset()
, Arrow does something clever:
It peeks at the first few thousand rows
Figures out what kind of data is in each column
Creates a roadmap of the data
Then… it stops!
That’s right - it doesn’t load the whole 9GB file. Imagine Arrow as a really efficient librarian who:
Creates an index of where everything is
Only gets books (data) when you specifically ask for them
Keeps track of what’s where without moving everything
Apply WHILE you’re exploring data
# Peek at the structure without loading
|>
seattle_csv glimpse()
FileSystemDataset with 1 csv file
41,389,465 rows x 12 columns
$ UsageClass <string> "Physical", "Physical", "Digital", "Physical", "Physi…
$ CheckoutType <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Horizo…
$ MaterialType <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOOK",…
$ CheckoutYear <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,…
$ CheckoutMonth <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ Checkouts <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2, 3,…
$ Title <string> "Super rich : a guide to having it all / Russell Simm…
$ ISBN <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Creator <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim Par…
$ Subjects <string> "Self realization, Conduct of life, Attitude Psycholo…
$ Publisher <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Dial …
$ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c2005.…
👉 Your Turn: Before going on fill in the table below:
Variable Name | Data Type (class) | What do we notice form the output? Things to keep an eye on? |
---|---|---|
UsageClass | String | Looks to describe the type of media the reading or book is based on, that includes “Physical” and “Digital Values”. |
CheckoutType | String | This appears to be the type of checkout that was used to access the book, such as “Horizon” or “OverDrive”. |
MaterialType | String | The format of the book that is being checked out that includes “Book”, “EBook”, and “SoundDisc”. |
CheckoutYear | Integer (int64) | The year of checkout in numeric value. |
CheckoutMonth | Integer (int64) | The month of checkout that is labeled as a numeric value rather than using a string or text to define month. |
Checkouts | Integer (int64) | Ranging from 1-25 based on the glimpse of data, this may appear to be referring to the quantity of checkouts based on reading material. |
Title | String | Displays the name of each reading/book. |
ISBN | Null or NA | ISBN are supposed to be 10 to 13 digit identification numbers for each reading , however, it appears this column only contains NULL or NA (invalid) values. That may require troubleshooting or determining if this value is no longer required. |
Creator | String | Appears to be the author of the book /reading that does contain date values in this field. |
Subjects | String | The subject of genre of the reading. |
Publisher | String | The publishing group or company of each reading. |
PublicationYear | String | The text string of year the specific reading is published, not numeric as expected. |
Apply BEFORE you start investigating data
Let’s work through the READY framework for our Seattle Library dataset:
👥 Group Activity (3 minutes): Work with your neighbor to brainstorm:
👉Your Investigation Questions:
# Add your R questions as comments:
# example: Do we have all library branches or just some?
# 1. Is all the relevant information present in regards to each item in the datase?
# 2. Does the dataset encompass all possible items that can be checked out from a library?
# 3. Is this version of the dataset the most up to date and relevant to perform the desired analysis?
Detective Assessment: We’re about to download 9GB of Seattle Library checkout data - that’s 40+ million individual checkout records!
👉 Your Investigation Questions:
# Add your E questions as comments:
# example: Library Directors: "How do we optimize our collection?"
# 1. What aspects of the dataset should be considered that has a affect on measuring specifc metrics such as checkout rate?
# 2. What genres or subjects of specifc readings are considered to be the most popular?
# 3. What is the total amount of items/values present in the library dataset?
# 4. What considerations regarding the fields (variables) included in the dataset are most relevant for a libarary to be aware of to maintain relevance in the public stream?
Our Primary Investigation Question: “What patterns exist in library usage that could inform collection and service decisions?”
Your Investigation Questions:
# Add your A questions as comments(Read through these):
# First contact: What IS this data?
# Data quality: Can we trust it?
# Scope assessment: What can we investigate?
# Pattern hunting: What stories emerge?
# Stakeholder insights: What's actionable?
# 1.How many types of variables are present in the data to consider?
# 2. What are the expectations and overall sentiment regarding what can possibly be found or interpreted in this dataset?
# 3. What is the signficance of the data and how useful would it be for the type of analysis that is expected to be conducted?
👉Your Investigation Questions:
# Add your group's D questions as comments:
# EXAMPLE: Are there missing values in key fields?
# 1. Are all values that are apart of each field or variable correctly formatted in relation to the data type of the variable?
# 2. Is there any possible instances where existing fields may require modifications (such as splitting a varaible) to most benefit from the information present?
# 3. How well formatted is the datatset? Would this require additional cleaning or performing additional levels of editing to increase readability?
👉Your Investigation Questions:
# Add your Y questions as comments:
# Example: Usage patterns across time (seasonal, pandemic impact?)
# 1. How can the subject or genre of a book or reading impact how frequently it gets checked out in the library?
# 2. Does the class or format of a reading or book impact how popular or the frequency of the specific media gets checked out?
# 3. Are certain characterisitcs based on field values such as "publisher" or "creator" that impacts the check out frequency of a reading or book?
# 4. Is there any other relevant insights that can be extracted from the dataset that does not only rely on check out rate?
What Arrow Actually Does:
Schema Detection: Reads just enough rows to understand data types
Metadata Storage: Creates an index of where data lives on disk
Lazy Operations: Builds query plans without executing them
Columnar Processing: Only reads columns you actually need
Predicate Pushdown: Applies filters before reading data
Think of Arrow like a smart restaurant:
Traditional R Restaurant:
Brings you the entire menu of food at once
You pick what you want and throw away the rest
Kitchen overwhelmed, customers wait, food wasted
Arrow Restaurant:
Shows you a menu (schema)
Takes your order (query plan)
Cooks only what you ordered (lazy evaluation)
Delivers exactly what you need (collect())
When You’ll Need These Skills:
Academic Research: Census data, genomics, climate models
Business Analytics: Customer transactions, web logs, sensor data
Public Policy: Government datasets, health records, economic indicators
Machine Learning: Training datasets often exceed memory limits