Week 5 - ARROW-part 1- key

Download data and understand the Arrow mindset

Author

Sriya Venkat

Arrow Part 1: Detective’s Guide to Big Data - First Contact with Your Dataset

Using READY + SCAN Frameworks with Arrow for Efficient EDA

Learning Objectives

By the end of this module, you will be able to:

  • Apply the READY framework to plan your big data investigation

  • Use the SCAN framework to systematically explore large datasets

  • Understand when to use Arrow vs traditional R for data exploration

  • Build your first “detective’s workflow” for any new dataset

  • Navigate the critical first 10 minutes with a massive dataset

🕵️ The Detective’s Dilemma: When Data is Too Big to “See”

Imagine you’re a detective who just received 40 million pieces of evidence. You can’t spread them all on your desk - your desk (computer memory) isn’t big enough!

Traditional R thinking:

# Load EVERYTHING into memory first   

data <- read_csv("huge_file.csv")  # 💥 Crash!   

data |>    
  filter(year == 2023)

Arrow thinking:

# Create a "view" of the data, then filter   

data <- open_dataset("huge_file.csv")  # ✅ Instant! 

data |> filter(year == 2023)|> 
  collect()  # Only brings filtered results to memory

🧠 Understanding Big Data: When Memory Becomes the Bottleneck

The Scale Problem in Data Science

When we say “big data,” we’re talking about datasets that challenge traditional approaches:

  • Your laptop typically has 8-16GB RAM

  • Operating system uses 2-4GB

  • Other programs use 1-2GB

  • Available for R: Maybe 4-8GB

  • Our dataset: 9GB = Memory overflow without Arrow!

Detective Rule: When your dataset approaches your available RAM, Arrow becomes essential for investigation.

Is your data < 100MB? → Use traditional R Is your data < 8GB and mostly single-table operations? → Use Arrow

🔍 Key Concept: Lazy Evaluation

Arrow uses lazy evaluation - it builds up a query plan without actually executing it until you call collect().

Think of it like:

  • Traditional R: “Cook the entire meal, then throw away what you don’t want”

  • Arrow: “Plan the meal, shop for only what you need, then cook just that”

🚀 Arrow Basics (10 minutes)

Setting Up Our Big Data Playground - done in pre-workshop materials

We’re working with the real Seattle Library dataset - over 40 million rows of checkout data!

Step 1: Create a Directory

First, let’s create a special folder to store our data (you may already have one but if you don’t uncomment the dir.create() and run the chunck.

Code
# Create a "data" directory if it doesn't exist already  

# Using showWarnings = FALSE to suppress warning if directory already exists    

#dir.create("data", showWarnings = FALSE)

Step 2: Download the Dataset

Now for the fun part! We’ll download the Seattle Library dataset. The dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6.

⚠️ Important: This is a 9GB file, so:

  • Make sure you have enough disk space
Code
# Download Seattle library checkout dataset: 

# 1. Fetch data from AWS S3 bucket URL 
# 2. Save to local data directory 
# 3. Use resume = TRUE to allow continuing interrupted downloads  

curl::multi_download("https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv", "data/seattle-library-checkouts.csv", resume = TRUE )

Why USE: curl::multi_download()

  • Shows a progress bar (great for tracking large downloads)
  • Can resume if interrupted (super helpful for big files!)
  • More reliable than base R download methods

Step 3: Verify the Download

After the download completes, let’s make sure everything worked:

Code
# Check if the Seattle library dataset file exists and print its size:
# 1. Verify file exists at specified path
# 2. Calculate file size in gigabytes by dividing bytes by 1024^3

file.exists("data/seattle-library-checkouts.csv")
[1] TRUE
Code
file.size("data/seattle-library-checkouts.csv") / 1024^3  # Size in GB
[1] 8.579315

Before we move on let’s add the packages and libraries we need

Code
#load in the packages and install if needed with code below


# Function to check and install required packages
required_packages <- c("tidyverse", "arrow")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
    #library(pkg, character.only = TRUE)
  }
}

#Load libraries
lapply(required_packages, library, character.only = TRUE)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'arrow'


The following object is masked from 'package:lubridate':

    duration


The following object is masked from 'package:utils':

    timestamp
[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

🔬 The open_dataset() Magic

Now let’s see the fundamental difference between read_csv() and open_dataset():

Who thinks their computer could handle loading 9GB into memory?

DON’T RUN!

Code
# Traditional approach - would crash most computers!

#seattle_library_checkouts <- read_csv("data/seattle-library-checkouts.csv") # DON'T RUN! 

RUN

Code
# Arrow approach - creates a "view" without loading 
seattle_csv <- open_dataset("data/seattle-library-checkouts.csv", format = "csv")

# View Object
seattle_csv
FileSystemDataset with 1 csv file
12 columns
UsageClass: string
CheckoutType: string
MaterialType: string
CheckoutYear: int64
CheckoutMonth: int64
Checkouts: int64
Title: string
ISBN: null
Creator: string
Subjects: string
Publisher: string
PublicationYear: string
Code
library(glue) # string interpolation - cleaner alternative to paste()

# Check out how much memory this is using.
glue("Memory used by Arrow object: {format(object.size(seattle_csv), units = 'KB')}")
Memory used by Arrow object: 0.5 Kb
Code
# Let's see what the file size we are actually working with 
file_size_bytes <- file.size("data/seattle-library-checkouts.csv")
file_size_gb <- file_size_bytes / (1024^3)  # Convert to GB
glue("Estimated file size: {round(file_size_gb, 2)} GB")
Estimated file size: 8.58 GB

What’s Actually Happening? 🤔

The Magic of Lazy Loading and what open_dataset() does:

  • Creates an Arrow dataset object that “points to” your CSV file

  • Doesn’t actually load the data into memory yet

  • Acts like a “view” or “window” into your data

  • When you run this code with open_dataset(), Arrow does something clever:

    1. It peeks at the first few thousand rows

    2. Figures out what kind of data is in each column

    3. Creates a roadmap of the data

    4. Then… it stops!

    That’s right - it doesn’t load the whole 9GB file. Imagine Arrow as a really efficient librarian who:

    • Creates an index of where everything is

    • Only gets books (data) when you specifically ask for them

    • Keeps track of what’s where without moving everything

📊 SCAN Framework - Your Field Investigation Guide

Apply WHILE you’re exploring data

Code
# Peek at the structure without loading 
seattle_csv |> 
  glimpse()
FileSystemDataset with 1 csv file
41,389,465 rows x 12 columns
$ UsageClass      <string> "Physical", "Physical", "Digital", "Physical", "Physi…
$ CheckoutType    <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Horizo…
$ MaterialType    <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOOK",…
$ CheckoutYear     <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,…
$ CheckoutMonth    <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ Checkouts        <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2, 3,…
$ Title           <string> "Super rich : a guide to having it all / Russell Simm…
$ ISBN              <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Creator         <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim Par…
$ Subjects        <string> "Self realization, Conduct of life, Attitude Psycholo…
$ Publisher       <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Dial …
$ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c2005.…

👉 Your Turn: Before going on fill in the table below:

Variable Name Data Type (class) What do we notice form the output? Things to keep an eye on?
UsageClass String Physical/Digital are the two types
CheckoutType String Overdrive and horizon are the two types
MaterialType String Book, Ebook, sounddisc, audiobook seem to be the four types
CheckoutYear int64 Chronological
CheckoutMonth int64 Chronological
Checkouts int64 The number of checkouts is pretty small, mostly less than 10
Title String Title followed by slash followed by author
ISBN null NA for all
Creator String Last name, First name, occasionally year
Subjects String A few phrases separated by commas indicating the general topic
Publisher String Publishing company
PublicationYear String cYear

📋 READY Framework - Your Strategic Case Planning

Apply BEFORE you start investigating data

Let’s work through the READY framework for our Seattle Library dataset:

R - Representative Data: Do we have what we need?

👥 Group Activity (3 minutes): Work with your neighbor to brainstorm:

👉Your Investigation Questions:

# Add your R questions as comments: 

# example: Do we have all library branches or just some? 

# Do we have all different types of genres for books?
# What is the most common genre or topic of book

Detective Assessment: We’re about to download 9GB of Seattle Library checkout data - that’s 40+ million individual checkout records!

E - Executive Driven Questions: What stakeholders want to know?

👉 Your Investigation Questions:

# Add your E questions as comments: 

# example: Library Directors: "How do we optimize our collection?" 

# Librarians/Directors: What is the optimal time to switch out book collections?
# Customers: How often will the collection be updated?

Our Primary Investigation Question: “What patterns exist in library usage that could inform collection and service decisions?”

A - Analytical Framework: What’s our exploration strategy?

Your Investigation Questions:

# Add your A questions as comments(Read through these): 

# 1. First contact: What IS this data? 
# 2. Data quality: Can we trust it?
# 3. Scope assessment: What can we investigate? 
# 4. Pattern hunting: What stories emerge? 
# 5. Stakeholder insights: What's actionable?

# analyze all missing values
# analyze values that look different from others or the general format
# What are questions we can answer using the given data

D - Data Best Practices: What quality unknowns should we check?

👉Your Investigation Questions:

# Add your group's D questions as comments: 

# EXAMPLE: Are there missing values in key fields?

# Are there any repetitions in the book title?
# For some of the publications, there are [] and years are written in a different format - how are those interpreted?

Y - Your Insights: What story might emerge?

👉Your Investigation Questions:

# Add your Y questions as comments: 
# Example: Usage patterns across time (seasonal, pandemic impact?) 

# Most checked out type of genre
# Most and least read authors

What Arrow Actually Does:

  1. Schema Detection: Reads just enough rows to understand data types

  2. Metadata Storage: Creates an index of where data lives on disk

  3. Lazy Operations: Builds query plans without executing them

  4. Columnar Processing: Only reads columns you actually need

  5. Predicate Pushdown: Applies filters before reading data

The Lazy Evaluation Advantage

Think of Arrow like a smart restaurant:

Traditional R Restaurant:

  • Brings you the entire menu of food at once

  • You pick what you want and throw away the rest

  • Kitchen overwhelmed, customers wait, food wasted

Arrow Restaurant:

  • Shows you a menu (schema)

  • Takes your order (query plan)

  • Cooks only what you ordered (lazy evaluation)

  • Delivers exactly what you need (collect())

Real-World Big Data Scenarios

When You’ll Need These Skills:

  • Academic Research: Census data, genomics, climate models

  • Business Analytics: Customer transactions, web logs, sensor data

  • Public Policy: Government datasets, health records, economic indicators

  • Machine Learning: Training datasets often exceed memory limits