Code
# Create a "data" directory if it doesn't exist already
# Using showWarnings = FALSE to suppress warning if directory already exists
#dir.create("data", showWarnings = FALSE)
Download data and understand the Arrow mindset
Using READY + SCAN Frameworks with Arrow for Efficient EDA
By the end of this module, you will be able to:
Apply the READY framework to plan your big data investigation
Use the SCAN framework to systematically explore large datasets
Understand when to use Arrow vs traditional R for data exploration
Build your first “detective’s workflow” for any new dataset
Navigate the critical first 10 minutes with a massive dataset
Imagine you’re a detective who just received 40 million pieces of evidence. You can’t spread them all on your desk - your desk (computer memory) isn’t big enough!
Traditional R thinking:
# Load EVERYTHING into memory first
data <- read_csv("huge_file.csv") # 💥 Crash!
data |>
filter(year == 2023)
Arrow thinking:
# Create a "view" of the data, then filter
data <- open_dataset("huge_file.csv") # ✅ Instant!
data |> filter(year == 2023)|>
collect() # Only brings filtered results to memory
When we say “big data,” we’re talking about datasets that challenge traditional approaches:
Your laptop typically has 8-16GB RAM
Operating system uses 2-4GB
Other programs use 1-2GB
Available for R: Maybe 4-8GB
Our dataset: 9GB = Memory overflow without Arrow!
Detective Rule: When your dataset approaches your available RAM, Arrow becomes essential for investigation.
Is your data < 100MB? → Use traditional R Is your data < 8GB and mostly single-table operations? → Use Arrow
Arrow uses lazy evaluation - it builds up a query plan without actually executing it until you call collect()
.
Think of it like:
Traditional R: “Cook the entire meal, then throw away what you don’t want”
Arrow: “Plan the meal, shop for only what you need, then cook just that”
We’re working with the real Seattle Library dataset - over 40 million rows of checkout data!
First, let’s create a special folder to store our data (you may already have one but if you don’t uncomment the dir.create()
and run the chunck.
# Create a "data" directory if it doesn't exist already
# Using showWarnings = FALSE to suppress warning if directory already exists
#dir.create("data", showWarnings = FALSE)
Now for the fun part! We’ll download the Seattle Library dataset. The dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6.
⚠️ Important: This is a 9GB file, so:
# Download Seattle library checkout dataset:
# 1. Fetch data from AWS S3 bucket URL
# 2. Save to local data directory
# 3. Use resume = TRUE to allow continuing interrupted downloads
::multi_download("https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv", "data/seattle-library-checkouts.csv", resume = TRUE ) curl
Why USE: curl::multi_download()
After the download completes, let’s make sure everything worked:
# Check if the Seattle library dataset file exists and print its size:
# 1. Verify file exists at specified path
# 2. Calculate file size in gigabytes by dividing bytes by 1024^3
file.exists("data/seattle-library-checkouts.csv")
[1] TRUE
file.size("data/seattle-library-checkouts.csv") / 1024^3 # Size in GB
[1] 8.579315
#load in the packages and install if needed with code below
# Function to check and install required packages
<- c("tidyverse", "arrow")
required_packages
# Install missing packages
for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
#library(pkg, character.only = TRUE)
}
}
#Load libraries
lapply(required_packages, library, character.only = TRUE)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'arrow'
The following object is masked from 'package:lubridate':
duration
The following object is masked from 'package:utils':
timestamp
[[1]]
[1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
[7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
[13] "grDevices" "utils" "datasets" "methods" "base"
[[2]]
[1] "arrow" "lubridate" "forcats" "stringr" "dplyr" "purrr"
[7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
[13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
open_dataset()
MagicNow let’s see the fundamental difference between read_csv()
and open_dataset()
:
DON’T RUN!
# Traditional approach - would crash most computers!
#seattle_library_checkouts <- read_csv("data/seattle-library-checkouts.csv") # DON'T RUN!
RUN
# Arrow approach - creates a "view" without loading
<- open_dataset("data/seattle-library-checkouts.csv", format = "csv")
seattle_csv
# View Object
seattle_csv
FileSystemDataset with 1 csv file
12 columns
UsageClass: string
CheckoutType: string
MaterialType: string
CheckoutYear: int64
CheckoutMonth: int64
Checkouts: int64
Title: string
ISBN: null
Creator: string
Subjects: string
Publisher: string
PublicationYear: string
library(glue) # string interpolation - cleaner alternative to paste()
# Check out how much memory this is using.
glue("Memory used by Arrow object: {format(object.size(seattle_csv), units = 'KB')}")
Memory used by Arrow object: 0.5 Kb
# Let's see what the file size we are actually working with
<- file.size("data/seattle-library-checkouts.csv")
file_size_bytes <- file_size_bytes / (1024^3) # Convert to GB
file_size_gb glue("Estimated file size: {round(file_size_gb, 2)} GB")
Estimated file size: 8.58 GB
open_dataset()
does:Creates an Arrow dataset object that “points to” your CSV file
Doesn’t actually load the data into memory yet
Acts like a “view” or “window” into your data
When you run this code with open_dataset()
, Arrow does something clever:
It peeks at the first few thousand rows
Figures out what kind of data is in each column
Creates a roadmap of the data
Then… it stops!
That’s right - it doesn’t load the whole 9GB file. Imagine Arrow as a really efficient librarian who:
Creates an index of where everything is
Only gets books (data) when you specifically ask for them
Keeps track of what’s where without moving everything
Apply WHILE you’re exploring data
# Peek at the structure without loading
|>
seattle_csv glimpse()
FileSystemDataset with 1 csv file
41,389,465 rows x 12 columns
$ UsageClass <string> "Physical", "Physical", "Digital", "Physical", "Physi…
$ CheckoutType <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Horizo…
$ MaterialType <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOOK",…
$ CheckoutYear <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,…
$ CheckoutMonth <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ Checkouts <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2, 3,…
$ Title <string> "Super rich : a guide to having it all / Russell Simm…
$ ISBN <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Creator <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim Par…
$ Subjects <string> "Self realization, Conduct of life, Attitude Psycholo…
$ Publisher <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Dial …
$ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c2005.…
👉 Your Turn: Before going on fill in the table below:
Variable Name | Data Type (class) | What do we notice form the output? Things to keep an eye on? |
---|---|---|
UsageClass | String | Physical/Digital are the two types |
CheckoutType | String | Overdrive and horizon are the two types |
MaterialType | String | Book, Ebook, sounddisc, audiobook seem to be the four types |
CheckoutYear | int64 | Chronological |
CheckoutMonth | int64 | Chronological |
Checkouts | int64 | The number of checkouts is pretty small, mostly less than 10 |
Title | String | Title followed by slash followed by author |
ISBN | null | NA for all |
Creator | String | Last name, First name, occasionally year |
Subjects | String | A few phrases separated by commas indicating the general topic |
Publisher | String | Publishing company |
PublicationYear | String | cYear |
Apply BEFORE you start investigating data
Let’s work through the READY framework for our Seattle Library dataset:
👥 Group Activity (3 minutes): Work with your neighbor to brainstorm:
👉Your Investigation Questions:
# Add your R questions as comments:
# example: Do we have all library branches or just some?
# Do we have all different types of genres for books?
# What is the most common genre or topic of book
Detective Assessment: We’re about to download 9GB of Seattle Library checkout data - that’s 40+ million individual checkout records!
👉 Your Investigation Questions:
# Add your E questions as comments:
# example: Library Directors: "How do we optimize our collection?"
# Librarians/Directors: What is the optimal time to switch out book collections?
# Customers: How often will the collection be updated?
Our Primary Investigation Question: “What patterns exist in library usage that could inform collection and service decisions?”
Your Investigation Questions:
# Add your A questions as comments(Read through these):
# 1. First contact: What IS this data?
# 2. Data quality: Can we trust it?
# 3. Scope assessment: What can we investigate?
# 4. Pattern hunting: What stories emerge?
# 5. Stakeholder insights: What's actionable?
# analyze all missing values
# analyze values that look different from others or the general format
# What are questions we can answer using the given data
👉Your Investigation Questions:
# Add your group's D questions as comments:
# EXAMPLE: Are there missing values in key fields?
# Are there any repetitions in the book title?
# For some of the publications, there are [] and years are written in a different format - how are those interpreted?
👉Your Investigation Questions:
# Add your Y questions as comments:
# Example: Usage patterns across time (seasonal, pandemic impact?)
# Most checked out type of genre
# Most and least read authors
What Arrow Actually Does:
Schema Detection: Reads just enough rows to understand data types
Metadata Storage: Creates an index of where data lives on disk
Lazy Operations: Builds query plans without executing them
Columnar Processing: Only reads columns you actually need
Predicate Pushdown: Applies filters before reading data
Think of Arrow like a smart restaurant:
Traditional R Restaurant:
Brings you the entire menu of food at once
You pick what you want and throw away the rest
Kitchen overwhelmed, customers wait, food wasted
Arrow Restaurant:
Shows you a menu (schema)
Takes your order (query plan)
Cooks only what you ordered (lazy evaluation)
Delivers exactly what you need (collect())
When You’ll Need These Skills:
Academic Research: Census data, genomics, climate models
Business Analytics: Customer transactions, web logs, sensor data
Public Policy: Government datasets, health records, economic indicators
Machine Learning: Training datasets often exceed memory limits