Code
# Create a "data" directory if it doesn't exist already
# Using showWarnings = FALSE to suppress warning if directory already exists
dir.create("data", showWarnings = FALSE)
Download data and understand the Arrow mindset
Using READY + SCAN Frameworks with Arrow for Efficient EDA
By the end of this module, you will be able to:
Apply the READY framework to plan your big data investigation
Use the SCAN framework to systematically explore large datasets
Understand when to use Arrow vs traditional R for data exploration
Build your first “detective’s workflow” for any new dataset
Navigate the critical first 10 minutes with a massive dataset
Imagine you’re a detective who just received 40 million pieces of evidence. You can’t spread them all on your desk - your desk (computer memory) isn’t big enough!
Traditional R thinking:
# Load EVERYTHING into memory first
data <- read_csv("huge_file.csv") # 💥 Crash!
data |>
filter(year == 2023)
Arrow thinking:
# Create a "view" of the data, then filter
data <- open_dataset("huge_file.csv") # ✅ Instant!
data |> filter(year == 2023)|>
collect() # Only brings filtered results to memory
When we say “big data,” we’re talking about datasets that challenge traditional approaches:
Your laptop typically has 8-16GB RAM
Operating system uses 2-4GB
Other programs use 1-2GB
Available for R: Maybe 4-8GB
Our dataset: 9GB = Memory overflow without Arrow!
Detective Rule: When your dataset approaches your available RAM, Arrow becomes essential for investigation.
Is your data < 100MB? → Use traditional R Is your data < 8GB and mostly single-table operations? → Use Arrow
Arrow uses lazy evaluation - it builds up a query plan without actually executing it until you call collect()
.
Think of it like:
Traditional R: “Cook the entire meal, then throw away what you don’t want”
Arrow: “Plan the meal, shop for only what you need, then cook just that”
We’re working with the real Seattle Library dataset - over 40 million rows of checkout data!
First, let’s create a special folder to store our data (you may already have one but if you don’t uncomment the dir.create()
and run the chunck.
# Create a "data" directory if it doesn't exist already
# Using showWarnings = FALSE to suppress warning if directory already exists
dir.create("data", showWarnings = FALSE)
Now for the fun part! We’ll download the Seattle Library dataset. The dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6.
⚠️ Important: This is a 9GB file, so:
# Download Seattle library checkout dataset:
# 1. Fetch data from AWS S3 bucket URL
# 2. Save to local data directory
# 3. Use resume = TRUE to allow continuing interrupted downloads
::multi_download("https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv", "data/seattle-library-checkouts.csv", resume = TRUE ) curl
Why USE: curl::multi_download()
After the download completes, let’s make sure everything worked:
# Check if the Seattle library dataset file exists and print its size:
# 1. Verify file exists at specified path
# 2. Calculate file size in gigabytes by dividing bytes by 1024^3
file.exists("data/seattle-library-checkouts.csv")
[1] TRUE
file.size("data/seattle-library-checkouts.csv") / 1024^3 # Size in GB
[1] 2.433585
#load in the packages and install if needed with code below
# Function to check and install required packages
<- c("tidyverse", "arrow")
required_packages
# Install missing packages
for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
#library(pkg, character.only = TRUE)
}
}
#Load libraries
lapply(required_packages, library, character.only = TRUE)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'arrow'
The following object is masked from 'package:lubridate':
duration
The following object is masked from 'package:utils':
timestamp
[[1]]
[1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
[7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
[13] "grDevices" "utils" "datasets" "methods" "base"
[[2]]
[1] "arrow" "lubridate" "forcats" "stringr" "dplyr" "purrr"
[7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
[13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
open_dataset()
MagicNow let’s see the fundamental difference between read_csv()
and open_dataset()
:
DON’T RUN!
# Traditional approach - would crash most computers!
#seattle_library_checkouts <- read_csv("data/seattle-library-checkouts.csv") # DON'T RUN!
RUN
# Arrow approach - creates a "view" without loading
<- open_dataset("data/seattle-library-checkouts.csv", format = "csv")
seattle_csv
# View Object
seattle_csv
FileSystemDataset with 1 csv file
12 columns
UsageClass: string
CheckoutType: string
MaterialType: string
CheckoutYear: int64
CheckoutMonth: int64
Checkouts: int64
Title: string
ISBN: null
Creator: string
Subjects: string
Publisher: string
PublicationYear: string
library(glue) # string interpolation - cleaner alternative to paste()
# Check out how much memory this is using.
glue("Memory used by Arrow object: {format(object.size(seattle_csv), units = 'KB')}")
Memory used by Arrow object: 0.5 Kb
# Let's see what the file size we are actually working with
<- file.size("data/seattle-library-checkouts.csv")
file_size_bytes <- file_size_bytes / (1024^3) # Convert to GB
file_size_gb glue("Estimated file size: {round(file_size_gb, 2)} GB")
Estimated file size: 2.43 GB
open_dataset()
does:Creates an Arrow dataset object that “points to” your CSV file
Doesn’t actually load the data into memory yet
Acts like a “view” or “window” into your data
When you run this code with open_dataset()
, Arrow does something clever:
It peeks at the first few thousand rows
Figures out what kind of data is in each column
Creates a roadmap of the data
Then… it stops!
That’s right - it doesn’t load the whole 9GB file. Imagine Arrow as a really efficient librarian who:
Creates an index of where everything is
Only gets books (data) when you specifically ask for them
Keeps track of what’s where without moving everything
Apply WHILE you’re exploring data
# Peek at the structure without loading
#seattle_csv |>
#glimpse()
👉 Your Turn: Before going on fill in the table below:
Variable Name | Data Type (class) | What do we notice form the output? Things to keep an eye on? |
---|---|---|
UsageClass | String | Ensure consistency |
CheckoutType | String | Ensure consistency |
MaterialType | String | Ensure consistency |
CheckoutYear | int64 | Ensure all years are valid |
CheckoutMonth | int64 | Ensure all years are valid |
Checkouts | int64 | Ensure all years are valid |
Title | string | Ensure consistency |
ISBN | null | Are they optional or missing entirely? |
Creator | string | Ensure consistency |
Subjects | string | Ensure consistency |
Publisher | string | Ensure consistency |
PublicationYear | string | Ensure consistency |
Apply BEFORE you start investigating data
Let’s work through the READY framework for our Seattle Library dataset:
👥 Group Activity (3 minutes): Work with your neighbor to brainstorm:
👉Your Investigation Questions:
# Add your R questions as comments:
# example: Do we have all library branches or just some?
# Are there records for all types of items (books, DVDs, e-books, etc.)?
# Does the dataset include both in-library checkouts and online checkouts (if applicable)?
Detective Assessment: We’re about to download 9GB of Seattle Library checkout data - that’s 40+ million individual checkout records!
👉 Your Investigation Questions:
# Add your E questions as comments:
# example: Library Directors: "How do we optimize our collection?"
# What are the most frequently checked-out items and categories?
# How can we improve services based on checkout frequency and trends?
Our Primary Investigation Question: “What patterns exist in library usage that could inform collection and service decisions?”
Your Investigation Questions:
# Add your A questions as comments(Read through these):
# 1. First contact: What IS this data?
# What fields are included in the dataset
# 2. Data quality: Can we trust it?
# Are there missing or duplicate values?
# 3. Scope assessment: What can we investigate?
# Are there enough records to represent long-term trends?
# 4. Pattern hunting: What stories emerge?
# What patterns can we find in checkout?
# 5. Stakeholder insights: What's actionable?
# What actionable insights can we derive to guide library programming or collection expansion?
👉Your Investigation Questions:
# Add your group's D questions as comments:
# EXAMPLE: Are there missing values in key fields?
# Are there missing values in key fields?
# Are there outliers?
👉Your Investigation Questions:
# Add your Y questions as comments:
# Example: Usage patterns across time (seasonal, pandemic impact?)
# Do certain geographic areas check out different types of materials?
# What are the most popular genres or types of item?
What Arrow Actually Does:
Schema Detection: Reads just enough rows to understand data types
Metadata Storage: Creates an index of where data lives on disk
Lazy Operations: Builds query plans without executing them
Columnar Processing: Only reads columns you actually need
Predicate Pushdown: Applies filters before reading data
Think of Arrow like a smart restaurant:
Traditional R Restaurant:
Brings you the entire menu of food at once
You pick what you want and throw away the rest
Kitchen overwhelmed, customers wait, food wasted
Arrow Restaurant:
Shows you a menu (schema)
Takes your order (query plan)
Cooks only what you ordered (lazy evaluation)
Delivers exactly what you need (collect())
When You’ll Need These Skills:
Academic Research: Census data, genomics, climate models
Business Analytics: Customer transactions, web logs, sensor data
Public Policy: Government datasets, health records, economic indicators
Machine Learning: Training datasets often exceed memory limits