Week 5 - ARROW-part 1- key

Download data and understand the Arrow mindset

Author

Amy Schaap

Arrow Part 1: Detective’s Guide to Big Data - First Contact with Your Dataset

Using READY + SCAN Frameworks with Arrow for Efficient EDA

Learning Objectives

By the end of this module, you will be able to:

  • Apply the READY framework to plan your big data investigation

  • Use the SCAN framework to systematically explore large datasets

  • Understand when to use Arrow vs traditional R for data exploration

  • Build your first “detective’s workflow” for any new dataset

  • Navigate the critical first 10 minutes with a massive dataset

🕵️ The Detective’s Dilemma: When Data is Too Big to “See”

Imagine you’re a detective who just received 40 million pieces of evidence. You can’t spread them all on your desk - your desk (computer memory) isn’t big enough!

Traditional R thinking:

# Load EVERYTHING into memory first   

data <- read_csv("huge_file.csv")  # 💥 Crash!   

data |>    
  filter(year == 2023)

Arrow thinking:

# Create a "view" of the data, then filter   

data <- open_dataset("huge_file.csv")  # ✅ Instant! 

data |> filter(year == 2023)|> 
  collect()  # Only brings filtered results to memory

🧠 Understanding Big Data: When Memory Becomes the Bottleneck

The Scale Problem in Data Science

When we say “big data,” we’re talking about datasets that challenge traditional approaches:

  • Your laptop typically has 8-16GB RAM

  • Operating system uses 2-4GB

  • Other programs use 1-2GB

  • Available for R: Maybe 4-8GB

  • Our dataset: 9GB = Memory overflow without Arrow!

Detective Rule: When your dataset approaches your available RAM, Arrow becomes essential for investigation.

Is your data < 100MB? → Use traditional R Is your data < 8GB and mostly single-table operations? → Use Arrow

🔍 Key Concept: Lazy Evaluation

Arrow uses lazy evaluation - it builds up a query plan without actually executing it until you call collect().

Think of it like:

  • Traditional R: “Cook the entire meal, then throw away what you don’t want”

  • Arrow: “Plan the meal, shop for only what you need, then cook just that”

🚀 Arrow Basics (10 minutes)

Setting Up Our Big Data Playground - done in pre-workshop materials

We’re working with the real Seattle Library dataset - over 40 million rows of checkout data!

Step 1: Create a Directory

First, let’s create a special folder to store our data (you may already have one but if you don’t uncomment the dir.create() and run the chunck.

Code
# Create a "data" directory if it doesn't exist already  

# Using showWarnings = FALSE to suppress warning if directory already exists    

dir.create("data", showWarnings = FALSE)

Step 2: Download the Dataset

Now for the fun part! We’ll download the Seattle Library dataset. The dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6.

⚠️ Important: This is a 9GB file, so:

  • Make sure you have enough disk space
Code
# Download Seattle library checkout dataset: 

# 1. Fetch data from AWS S3 bucket URL 
# 2. Save to local data directory 
# 3. Use resume = TRUE to allow continuing interrupted downloads  

curl::multi_download("https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv", "data/seattle-library-checkouts.csv", resume = TRUE )

Why USE: curl::multi_download()

  • Shows a progress bar (great for tracking large downloads)
  • Can resume if interrupted (super helpful for big files!)
  • More reliable than base R download methods

Step 3: Verify the Download

After the download completes, let’s make sure everything worked:

Code
# Check if the Seattle library dataset file exists and print its size:
# 1. Verify file exists at specified path
# 2. Calculate file size in gigabytes by dividing bytes by 1024^3

file.exists("data/seattle-library-checkouts.csv")
[1] TRUE
Code
file.size("data/seattle-library-checkouts.csv") / 1024^3  # Size in GB
[1] 8.579315

Before we move on let’s add the packages and libraries we need

Code
#load in the packages and install if needed with code below


# Function to check and install required packages
required_packages <- c("tidyverse", "arrow")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
    #library(pkg, character.only = TRUE)
  }
}

#Load libraries
lapply(required_packages, library, character.only = TRUE)
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tibble' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'arrow' was built under R version 4.3.3

Attaching package: 'arrow'

The following object is masked from 'package:lubridate':

    duration

The following object is masked from 'package:utils':

    timestamp
[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

🔬 The open_dataset() Magic

Now let’s see the fundamental difference between read_csv() and open_dataset():

Who thinks their computer could handle loading 9GB into memory?

DON’T RUN!

Code
# Traditional approach - would crash most computers!

#seattle_library_checkouts <- read_csv("data/seattle-library-checkouts.csv") # DON'T RUN! 

RUN

Code
# Arrow approach - creates a "view" without loading 
seattle_csv <- open_dataset("data/seattle-library-checkouts.csv", format = "csv")

# View Object
seattle_csv
FileSystemDataset with 1 csv file
12 columns
UsageClass: string
CheckoutType: string
MaterialType: string
CheckoutYear: int64
CheckoutMonth: int64
Checkouts: int64
Title: string
ISBN: null
Creator: string
Subjects: string
Publisher: string
PublicationYear: string
Code
library(glue) # string interpolation - cleaner alternative to paste()

# Check out how much memory this is using.
glue("Memory used by Arrow object: {format(object.size(seattle_csv), units = 'KB')}")
Memory used by Arrow object: 0.5 Kb
Code
# Let's see what the file size we are actually working with 
file_size_bytes <- file.size("data/seattle-library-checkouts.csv")
file_size_gb <- file_size_bytes / (1024^3)  # Convert to GB
glue("Estimated file size: {round(file_size_gb, 2)} GB")
Estimated file size: 8.58 GB

What’s Actually Happening? 🤔

The Magic of Lazy Loading and what open_dataset() does:

  • Creates an Arrow dataset object that “points to” your CSV file

  • Doesn’t actually load the data into memory yet

  • Acts like a “view” or “window” into your data

  • When you run this code with open_dataset(), Arrow does something clever:

    1. It peeks at the first few thousand rows

    2. Figures out what kind of data is in each column

    3. Creates a roadmap of the data

    4. Then… it stops!

    That’s right - it doesn’t load the whole 9GB file. Imagine Arrow as a really efficient librarian who:

    • Creates an index of where everything is

    • Only gets books (data) when you specifically ask for them

    • Keeps track of what’s where without moving everything

📊 SCAN Framework - Your Field Investigation Guide

Apply WHILE you’re exploring data

Code
# Peek at the structure without loading 
seattle_csv |> 
  glimpse()
FileSystemDataset with 1 csv file
41,389,465 rows x 12 columns
$ UsageClass      <string> "Physical", "Physical", "Digital", "Physical", "Physi…
$ CheckoutType    <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Horizo…
$ MaterialType    <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOOK",…
$ CheckoutYear     <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,…
$ CheckoutMonth    <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ Checkouts        <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2, 3,…
$ Title           <string> "Super rich : a guide to having it all / Russell Simm…
$ ISBN              <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Creator         <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim Par…
$ Subjects        <string> "Self realization, Conduct of life, Attitude Psycholo…
$ Publisher       <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Dial …
$ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c2005.…

👉 Your Turn: Before going on fill in the table below:

Variable Name Data Type (class) What do we notice form the output? Things to keep an eye on?
UsageClass
CheckoutType
MaterialType
CheckoutYear
CheckoutMonth
Checkouts
Title
ISBN
Creator
Subjects
Publisher
PublicationYear

📋 READY Framework - Your Strategic Case Planning

Apply BEFORE you start investigating data

Let’s work through the READY framework for our Seattle Library dataset:

R - Representative Data: Do we have what we need?

👥 Group Activity (3 minutes): Work with your neighbor to brainstorm:

👉Your Investigation Questions:

# Add your R questions as comments: 

# example: Do we have all library branches or just some? 

# Are digital and physical materials both represented?
# Does the dataset include checkouts across all years of operation or only recent years?

Detective Assessment: We’re about to download 9GB of Seattle Library checkout data - that’s 40+ million individual checkout records!

E - Executive Driven Questions: What stakeholders want to know?

👉 Your Investigation Questions:

# Add your E questions as comments: 

# example: Library Directors: "How do we optimize our collection?" 

# How do library usage trends reflect community engagement?
# Where should resources be allocated - digital vs. physical?

Our Primary Investigation Question: “What patterns exist in library usage that could inform collection and service decisions?”

A - Analytical Framework: What’s our exploration strategy?

Your Investigation Questions:

# Add your A questions as comments(Read through these): 

# 1. First contact: What IS this data? 
# 2. Data quality: Can we trust it?
# 3. Scope assessment: What can we investigate? 
# 4. Pattern hunting: What stories emerge? 
# 5. Stakeholder insights: What's actionable?

D - Data Best Practices: What quality unknowns should we check?

👉Your Investigation Questions:

# Add your group's D questions as comments: 

# EXAMPLE: Are there missing values in key fields?

# Do duplicate records exist (same book, same time, same user)?
# Are dates and times formatted consistently?

Y - Your Insights: What story might emerge?

👉Your Investigation Questions:

# Add your Y questions as comments: 
# Example: Usage patterns across time (seasonal, pandemic impact?) 

# Are digital checkouts rising compared to physical?
# Which authors or subjects show sudden spikes in popularity?

What Arrow Actually Does:

  1. Schema Detection: Reads just enough rows to understand data types

  2. Metadata Storage: Creates an index of where data lives on disk

  3. Lazy Operations: Builds query plans without executing them

  4. Columnar Processing: Only reads columns you actually need

  5. Predicate Pushdown: Applies filters before reading data

The Lazy Evaluation Advantage

Think of Arrow like a smart restaurant:

Traditional R Restaurant:

  • Brings you the entire menu of food at once

  • You pick what you want and throw away the rest

  • Kitchen overwhelmed, customers wait, food wasted

Arrow Restaurant:

  • Shows you a menu (schema)

  • Takes your order (query plan)

  • Cooks only what you ordered (lazy evaluation)

  • Delivers exactly what you need (collect())

Real-World Big Data Scenarios

When You’ll Need These Skills:

  • Academic Research: Census data, genomics, climate models

  • Business Analytics: Customer transactions, web logs, sensor data

  • Public Policy: Government datasets, health records, economic indicators

  • Machine Learning: Training datasets often exceed memory limits