Week 5 - ARROW-part 1

Download data and understand the Arrow mindset

Author

Luis Tapia

Class: DSA 406-001

Arrow Part 1: Detective’s Guide to Big Data - First Contact with Your Dataset

Using READY + SCAN Frameworks with Arrow for Efficient EDA

Learning Objectives

By the end of this module, you will be able to:

  • Apply the READY framework to plan your big data investigation

  • Use the SCAN framework to systematically explore large datasets

  • Understand when to use Arrow vs traditional R for data exploration

  • Build your first “detective’s workflow” for any new dataset

  • Navigate the critical first 10 minutes with a massive dataset

🕵️ The Detective’s Dilemma: When Data is Too Big to “See”

Imagine you’re a detective who just received 40 million pieces of evidence. You can’t spread them all on your desk - your desk (computer memory) isn’t big enough!

Traditional R thinking:

# Load EVERYTHING into memory first   

data <- read_csv("huge_file.csv")  # 💥 Crash!   

data |>    
  filter(year == 2023)

Arrow thinking:

# Create a "view" of the data, then filter   

data <- open_dataset("huge_file.csv")  # ✅ Instant! 

data |> filter(year == 2023)|> 
  collect()  # Only brings filtered results to memory

🧠 Understanding Big Data: When Memory Becomes the Bottleneck

The Scale Problem in Data Science

When we say “big data,” we’re talking about datasets that challenge traditional approaches:

  • Your laptop typically has 8-16GB RAM

  • Operating system uses 2-4GB

  • Other programs use 1-2GB

  • Available for R: Maybe 4-8GB

  • Our dataset: 9GB = Memory overflow without Arrow!

Detective Rule: When your dataset approaches your available RAM, Arrow becomes essential for investigation.

Is your data < 100MB? → Use traditional R in your data, < 8GB and mostly single-table operations? → Use Arrow

🔍 Key Concept: Lazy Evaluation

Arrow uses lazy evaluation - it builds up a query plan without actually executing it until you call collect().

Think of it like:

  • Traditional R: “Cook the entire meal, then throw away what you don’t want”

  • Arrow: “Plan the meal, shop for only what you need, then cook just that”

🚀 Arrow Basics (10 minutes)

Setting Up Our Big Data Playground - done in pre-workshop materials

We’re working with the real Seattle Library dataset - over 40 million rows of checkout data!

Step 1: Create a Directory

First, let’s create a special folder to store our data (you may already have one but if you don’t un-comment the dir.create() and run the chunk.

Code
# create a "data" directory if it doesn't exist already  

# Using showWarnings = FALSE to suppress warning if directory already exists    

dir.create("data",showWarnings = FALSE)

Step 2: Download the Dataset

Now for the fun part! We’ll download the Seattle Library dataset. The dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6.

⚠️ Important: This is a 9GB file, so:

  • Make sure you have enough disk space

file_path <- “data/seattle-library-checkouts.csv”

# Download the file

curl::multi_download(

urls = “https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv”,

dests = file_path,

resume = TRUE

)

Code
# Download Seattle library checkout dataset: 

# Instructions:
# 1. get data from AWS S3 URL 
# 2. save to local drive
# 3. use resume = TRUE to allow continuing downloads  

# IMPORTANT: commented out the download path to avoid re-downloading the dataset

#curl::multi_download("https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv", "data/seattle-library-checkouts.csv", resume = TRUE )

Why USE: curl::multi_download()

  • Shows a progress bar (great for tracking large downloads)
  • Can resume if interrupted (super helpful for big files!)
  • More reliable than base R download methods

Step 3: Verify the Download

After the download completes, let’s make sure everything worked:

Code
# Check if the Seattle library dataset file exists and print its size:
# 1. Verify file exists at specified path
# 2. Calculate file size in gigabytes by dividing bytes by 1024^3

file.exists("data/seattle-library-checkouts.csv")
[1] TRUE
Code
file.size("data/seattle-library-checkouts.csv") / 1024^3  # Size in GB
[1] 8.579315

Before we move on let’s add the packages and libraries we need

Code
#load in the packages and install if needed with code below


# Function to check and install required packages
required_packages <- c("tidyverse", "arrow")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
    #library(pkg, character.only = TRUE)
  }
}

#Load libraries
lapply(required_packages, library, character.only = TRUE)
Warning: package 'tidyverse' was built under R version 4.5.1
Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'tibble' was built under R version 4.5.1
Warning: package 'tidyr' was built under R version 4.5.1
Warning: package 'readr' was built under R version 4.5.1
Warning: package 'purrr' was built under R version 4.5.1
Warning: package 'dplyr' was built under R version 4.5.1
Warning: package 'stringr' was built under R version 4.5.1
Warning: package 'forcats' was built under R version 4.5.1
Warning: package 'lubridate' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'arrow' was built under R version 4.5.1

Attaching package: 'arrow'

The following object is masked from 'package:lubridate':

    duration

The following object is masked from 'package:utils':

    timestamp
[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

🔬 The open_dataset() Magic

Now let’s see the fundamental difference between read_csv() and open_dataset():

Who thinks their computer could handle loading 9GB into memory?

DON’T RUN!

Code
# regular approach - would crash most computers!
# DON'T RUN!!
#seattle_library_checkouts <- read_csv("data/seattle-library-checkouts.csv")

RUN

Code
# Arrow approach - creates a "view" without loading 
seattle_csv <- open_dataset("data/seattle-library-checkouts.csv", format = "csv")

# View Object
seattle_csv
FileSystemDataset with 1 csv file
12 columns
UsageClass: string
CheckoutType: string
MaterialType: string
CheckoutYear: int64
CheckoutMonth: int64
Checkouts: int64
Title: string
ISBN: null
Creator: string
Subjects: string
Publisher: string
PublicationYear: string
Code
library(glue) # string interpolation - cleaner alternative to paste()
Warning: package 'glue' was built under R version 4.5.1
Code
# Check out how much memory this is using.
glue("Memory used by Arrow object: {format(object.size(seattle_csv), units = 'KB')}")
Memory used by Arrow object: 0.5 Kb
Code
# Let's see what the file size we are actually working with 
file_size_bytes <- file.size("data/seattle-library-checkouts.csv")
file_size_gb <- file_size_bytes / (1024^3)  # Convert to GB
glue("Estimated file size: {round(file_size_gb, 2)} GB")
Estimated file size: 8.58 GB

What’s Actually Happening? 🤔

The Magic of Lazy Loading and what open_dataset() does:

  • Creates an Arrow dataset object that “points to” your CSV file

  • Doesn’t actually load the data into memory yet

  • Acts like a “view” or “window” into your data

  • When you run this code with open_dataset(), Arrow does something clever:

    1. It peeks at the first few thousand rows

    2. Figures out what kind of data is in each column

    3. Creates a roadmap of the data

    4. Then… it stops!

    That’s right - it doesn’t load the whole 9GB file. Imagine Arrow as a really efficient librarian who:

    • Creates an index of where everything is

    • Only gets books (data) when you specifically ask for them

    • Keeps track of what’s where without moving everything

📊 SCAN Framework - Your Field Investigation Guide

Apply WHILE you’re exploring data

Code
# Peek at the structure without loading 
seattle_csv |> 
  glimpse()
FileSystemDataset with 1 csv file
41,389,465 rows x 12 columns
$ UsageClass      <string> "Physical", "Physical", "Digital", "Physical", "Physi…
$ CheckoutType    <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Horizo…
$ MaterialType    <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOOK",…
$ CheckoutYear     <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,…
$ CheckoutMonth    <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ Checkouts        <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2, 3,…
$ Title           <string> "Super rich : a guide to having it all / Russell Simm…
$ ISBN              <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Creator         <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim Par…
$ Subjects        <string> "Self realization, Conduct of life, Attitude Psycholo…
$ Publisher       <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Dial …
$ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c2005.…

👉 Your Turn: Before going on fill in the table below:

Variable Name Data Type (class) What do we notice form the output? Things to keep an eye on?
UsageClass String Looks to describe the type of media the reading or book is based on, that includes “Physical” and “Digital Values”.
CheckoutType String This appears to be the type of checkout that was used to access the book, such as “Horizon” or “OverDrive”.
MaterialType String The format of the book that is being checked out that includes “Book”, “EBook”, and “SoundDisc”.
CheckoutYear Integer (int64) The year of checkout in numeric value.
CheckoutMonth Integer (int64) The month of checkout that is labeled as a numeric value rather than using a string or text to define month.
Checkouts Integer (int64) Ranging from 1-25 based on the glimpse of data, this may appear to be referring to the quantity of checkouts based on reading material.
Title String Displays the name of each reading/book.
ISBN Null or NA ISBN are supposed to be 10 to 13 digit identification numbers for each reading , however, it appears this column only contains NULL or NA (invalid) values. That may require troubleshooting or determining if this value is no longer required.
Creator String Appears to be the author of the book /reading that does contain date values in this field.
Subjects String The subject of genre of the reading.
Publisher String The publishing group or company of each reading.
PublicationYear String The text string of year the specific reading is published, not numeric as expected.

📋 READY Framework - Your Strategic Case Planning

Apply BEFORE you start investigating data

Let’s work through the READY framework for our Seattle Library dataset:

R - Representative Data: Do we have what we need?

👥 Group Activity (3 minutes): Work with your neighbor to brainstorm:

👉Your Investigation Questions:

# Add your R questions as comments: 

# example: Do we have all library branches or just some? 

# 1. Is all the relevant information present in regards to each item in the datase?

# 2. Does the dataset encompass all possible items that can be checked out from a library? 

# 3. Is this version of the dataset the most up to date and relevant to perform the desired analysis?

Detective Assessment: We’re about to download 9GB of Seattle Library checkout data - that’s 40+ million individual checkout records!

E - Executive Driven Questions: What stakeholders want to know?

👉 Your Investigation Questions:

# Add your E questions as comments: 

# example: Library Directors: "How do we optimize our collection?" 

# 1. What aspects of the dataset should be considered that has a affect on measuring specifc metrics such as checkout rate?

# 2. What genres or subjects of specifc readings are considered to be the most popular?

# 3. What is the total amount of items/values present in the library dataset?

# 4. What considerations regarding the fields (variables) included in the dataset are most relevant for a libarary to be aware of to maintain relevance in the public stream?

Our Primary Investigation Question: “What patterns exist in library usage that could inform collection and service decisions?”

A - Analytical Framework: What’s our exploration strategy?

Your Investigation Questions:

# Add your A questions as comments(Read through these): 

# First contact: What IS this data? 
# Data quality: Can we trust it?
# Scope assessment: What can we investigate? 
# Pattern hunting: What stories emerge? 
# Stakeholder insights: What's actionable?

# 1.How many types of variables are present in the data to consider?

# 2. What are the expectations and overall sentiment regarding what can possibly be found or interpreted in this dataset? 

# 3. What is the signficance of the data and how useful would it be for the type of analysis that is expected to be conducted?

D - Data Best Practices: What quality unknowns should we check?

👉Your Investigation Questions:

# Add your group's D questions as comments: 

# EXAMPLE: Are there missing values in key fields?

# 1. Are all values that are apart of each field or variable correctly formatted in relation to the data type of the variable? 

# 2. Is there any possible instances where existing fields may require modifications (such as splitting a varaible) to most benefit from the information present? 

# 3. How well formatted is the datatset? Would this require additional cleaning or performing additional levels of editing to increase readability?

Y - Your Insights: What story might emerge?

👉Your Investigation Questions:

# Add your Y questions as comments: 
# Example: Usage patterns across time (seasonal, pandemic impact?) 

# 1. How can the subject or genre of a book or reading impact how frequently it gets checked out in the library?

# 2. Does the class or format of a reading or book impact how popular or the frequency of the specific media gets checked out?

# 3. Are certain characterisitcs based on field values such as "publisher" or "creator" that impacts the check out frequency of a reading or book?

# 4. Is there any other relevant insights that can be extracted from the dataset that does not only rely on check out rate?

What Arrow Actually Does:

  1. Schema Detection: Reads just enough rows to understand data types

  2. Metadata Storage: Creates an index of where data lives on disk

  3. Lazy Operations: Builds query plans without executing them

  4. Columnar Processing: Only reads columns you actually need

  5. Predicate Pushdown: Applies filters before reading data

The Lazy Evaluation Advantage

Think of Arrow like a smart restaurant:

Traditional R Restaurant:

  • Brings you the entire menu of food at once

  • You pick what you want and throw away the rest

  • Kitchen overwhelmed, customers wait, food wasted

Arrow Restaurant:

  • Shows you a menu (schema)

  • Takes your order (query plan)

  • Cooks only what you ordered (lazy evaluation)

  • Delivers exactly what you need (collect())

Real-World Big Data Scenarios

When You’ll Need These Skills:

  • Academic Research: Census data, genomics, climate models

  • Business Analytics: Customer transactions, web logs, sensor data

  • Public Policy: Government datasets, health records, economic indicators

  • Machine Learning: Training datasets often exceed memory limits