Week 5 - ARROW-part 1- key

Download data and understand the Arrow mindset

Author

Emma McCue

Arrow Part 1: Detective’s Guide to Big Data - First Contact with Your Dataset

Using READY + SCAN Frameworks with Arrow for Efficient EDA

Learning Objectives

By the end of this module, you will be able to:

  • Apply the READY framework to plan your big data investigation

  • Use the SCAN framework to systematically explore large datasets

  • Understand when to use Arrow vs traditional R for data exploration

  • Build your first “detective’s workflow” for any new dataset

  • Navigate the critical first 10 minutes with a massive dataset

🕵️ The Detective’s Dilemma: When Data is Too Big to “See”

Imagine you’re a detective who just received 40 million pieces of evidence. You can’t spread them all on your desk - your desk (computer memory) isn’t big enough!

Traditional R thinking:

# Load EVERYTHING into memory first   

data <- read_csv("huge_file.csv")  # 💥 Crash!   

data |>    
  filter(year == 2023)

Arrow thinking:

# Create a "view" of the data, then filter   

data <- open_dataset("huge_file.csv")  # ✅ Instant! 

data |> filter(year == 2023)|> 
  collect()  # Only brings filtered results to memory

🧠 Understanding Big Data: When Memory Becomes the Bottleneck

The Scale Problem in Data Science

When we say “big data,” we’re talking about datasets that challenge traditional approaches:

  • Your laptop typically has 8-16GB RAM

  • Operating system uses 2-4GB

  • Other programs use 1-2GB

  • Available for R: Maybe 4-8GB

  • Our dataset: 9GB = Memory overflow without Arrow!

Detective Rule: When your dataset approaches your available RAM, Arrow becomes essential for investigation.

Is your data < 100MB? → Use traditional R Is your data < 8GB and mostly single-table operations? → Use Arrow

🔍 Key Concept: Lazy Evaluation

Arrow uses lazy evaluation - it builds up a query plan without actually executing it until you call collect().

Think of it like:

  • Traditional R: “Cook the entire meal, then throw away what you don’t want”

  • Arrow: “Plan the meal, shop for only what you need, then cook just that”

🚀 Arrow Basics (10 minutes)

Setting Up Our Big Data Playground - done in pre-workshop materials

We’re working with the real Seattle Library dataset - over 40 million rows of checkout data!

Step 1: Create a Directory

First, let’s create a special folder to store our data (you may already have one but if you don’t uncomment the dir.create() and run the chunck.

Code
# Create a "data" directory if it doesn't exist already  

# Using showWarnings = FALSE to suppress warning if directory already exists    

dir.create("data", showWarnings = FALSE)

Step 2: Download the Dataset

Now for the fun part! We’ll download the Seattle Library dataset. The dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6.

⚠️ Important: This is a 9GB file, so:

  • Make sure you have enough disk space
Code
# Download Seattle library checkout dataset: 

# 1. Fetch data from AWS S3 bucket URL 
# 2. Save to local data directory 
# 3. Use resume = TRUE to allow continuing interrupted downloads  

curl::multi_download("https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv", "data/seattle-library-checkouts.csv", resume = TRUE )

Why USE: curl::multi_download()

  • Shows a progress bar (great for tracking large downloads)
  • Can resume if interrupted (super helpful for big files!)
  • More reliable than base R download methods

Step 3: Verify the Download

After the download completes, let’s make sure everything worked:

Code
# Check if the Seattle library dataset file exists and print its size:
# 1. Verify file exists at specified path
# 2. Calculate file size in gigabytes by dividing bytes by 1024^3

file.exists("data/seattle-library-checkouts.csv")
[1] TRUE
Code
file.size("data/seattle-library-checkouts.csv") / 1024^3  # Size in GB
[1] 2.433585

Before we move on let’s add the packages and libraries we need

Code
#load in the packages and install if needed with code below


# Function to check and install required packages
required_packages <- c("tidyverse", "arrow")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
    #library(pkg, character.only = TRUE)
  }
}

#Load libraries
lapply(required_packages, library, character.only = TRUE)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'arrow'


The following object is masked from 'package:lubridate':

    duration


The following object is masked from 'package:utils':

    timestamp
[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

🔬 The open_dataset() Magic

Now let’s see the fundamental difference between read_csv() and open_dataset():

Who thinks their computer could handle loading 9GB into memory?

DON’T RUN!

Code
# Traditional approach - would crash most computers!

#seattle_library_checkouts <- read_csv("data/seattle-library-checkouts.csv") # DON'T RUN! 

RUN

Code
# Arrow approach - creates a "view" without loading 
seattle_csv <- open_dataset("data/seattle-library-checkouts.csv", format = "csv")

# View Object
seattle_csv
FileSystemDataset with 1 csv file
12 columns
UsageClass: string
CheckoutType: string
MaterialType: string
CheckoutYear: int64
CheckoutMonth: int64
Checkouts: int64
Title: string
ISBN: null
Creator: string
Subjects: string
Publisher: string
PublicationYear: string
Code
library(glue) # string interpolation - cleaner alternative to paste()

# Check out how much memory this is using.
glue("Memory used by Arrow object: {format(object.size(seattle_csv), units = 'KB')}")
Memory used by Arrow object: 0.5 Kb
Code
# Let's see what the file size we are actually working with 
file_size_bytes <- file.size("data/seattle-library-checkouts.csv")
file_size_gb <- file_size_bytes / (1024^3)  # Convert to GB
glue("Estimated file size: {round(file_size_gb, 2)} GB")
Estimated file size: 2.43 GB

What’s Actually Happening? 🤔

The Magic of Lazy Loading and what open_dataset() does:

  • Creates an Arrow dataset object that “points to” your CSV file

  • Doesn’t actually load the data into memory yet

  • Acts like a “view” or “window” into your data

  • When you run this code with open_dataset(), Arrow does something clever:

    1. It peeks at the first few thousand rows

    2. Figures out what kind of data is in each column

    3. Creates a roadmap of the data

    4. Then… it stops!

    That’s right - it doesn’t load the whole 9GB file. Imagine Arrow as a really efficient librarian who:

    • Creates an index of where everything is

    • Only gets books (data) when you specifically ask for them

    • Keeps track of what’s where without moving everything

📊 SCAN Framework - Your Field Investigation Guide

Apply WHILE you’re exploring data

Code
# Peek at the structure without loading 
#seattle_csv |> 
  #glimpse()

👉 Your Turn: Before going on fill in the table below:

Variable Name Data Type (class) What do we notice form the output? Things to keep an eye on?
UsageClass String Ensure consistency
CheckoutType String Ensure consistency
MaterialType String Ensure consistency
CheckoutYear int64 Ensure all years are valid
CheckoutMonth int64 Ensure all years are valid
Checkouts int64 Ensure all years are valid
Title string Ensure consistency
ISBN null Are they optional or missing entirely?
Creator string Ensure consistency
Subjects string Ensure consistency
Publisher string Ensure consistency
PublicationYear string Ensure consistency

📋 READY Framework - Your Strategic Case Planning

Apply BEFORE you start investigating data

Let’s work through the READY framework for our Seattle Library dataset:

R - Representative Data: Do we have what we need?

👥 Group Activity (3 minutes): Work with your neighbor to brainstorm:

👉Your Investigation Questions:

# Add your R questions as comments: 

# example: Do we have all library branches or just some? 

# Are there records for all types of items (books, DVDs, e-books, etc.)?
# Does the dataset include both in-library checkouts and online checkouts (if applicable)? 

Detective Assessment: We’re about to download 9GB of Seattle Library checkout data - that’s 40+ million individual checkout records!

E - Executive Driven Questions: What stakeholders want to know?

👉 Your Investigation Questions:

# Add your E questions as comments: 

# example: Library Directors: "How do we optimize our collection?" 

# What are the most frequently checked-out items and categories?
# How can we improve services based on checkout frequency and trends?

Our Primary Investigation Question: “What patterns exist in library usage that could inform collection and service decisions?”

A - Analytical Framework: What’s our exploration strategy?

Your Investigation Questions:

# Add your A questions as comments(Read through these): 

# 1. First contact: What IS this data? 
# What fields are included in the dataset
# 2. Data quality: Can we trust it?
# Are there missing or duplicate values?
# 3. Scope assessment: What can we investigate?
# Are there enough records to represent long-term trends?
# 4. Pattern hunting: What stories emerge?
# What patterns can we find in checkout?
# 5. Stakeholder insights: What's actionable?
# What actionable insights can we derive to guide library programming or collection expansion?

D - Data Best Practices: What quality unknowns should we check?

👉Your Investigation Questions:

# Add your group's D questions as comments: 

# EXAMPLE: Are there missing values in key fields?

# Are there missing values in key fields?
# Are there outliers?

Y - Your Insights: What story might emerge?

👉Your Investigation Questions:

# Add your Y questions as comments: 
# Example: Usage patterns across time (seasonal, pandemic impact?) 

# Do certain geographic areas check out different types of materials?
# What are the most popular genres or types of item?

What Arrow Actually Does:

  1. Schema Detection: Reads just enough rows to understand data types

  2. Metadata Storage: Creates an index of where data lives on disk

  3. Lazy Operations: Builds query plans without executing them

  4. Columnar Processing: Only reads columns you actually need

  5. Predicate Pushdown: Applies filters before reading data

The Lazy Evaluation Advantage

Think of Arrow like a smart restaurant:

Traditional R Restaurant:

  • Brings you the entire menu of food at once

  • You pick what you want and throw away the rest

  • Kitchen overwhelmed, customers wait, food wasted

Arrow Restaurant:

  • Shows you a menu (schema)

  • Takes your order (query plan)

  • Cooks only what you ordered (lazy evaluation)

  • Delivers exactly what you need (collect())

Real-World Big Data Scenarios

When You’ll Need These Skills:

  • Academic Research: Census data, genomics, climate models

  • Business Analytics: Customer transactions, web logs, sensor data

  • Public Policy: Government datasets, health records, economic indicators

  • Machine Learning: Training datasets often exceed memory limits