Week 5 - Arrow package_student

For larger than life datasets

Author

DSA406_001_SP25_wk5_cisrael

Today’s Lesson: Efficient Big Data Analysis with Arrow

Objectives

By the end of this lesson, you will be able to:

  • Understand why Arrow is crucial for big data analysis
  • Practice with real-world data
  • Apply data science communication principles to your analysis

Why Arrow?

Arrow is designed for handling very large datasets (100GB+) with:

  • Zero-copy reads
  • Columnar format efficiency
  • Cross-language compatibility
  • Native parallel processing

🤔 Data Scientist Thinking: When working with big data, we need to consider both computational efficiency and memory management. How might these features impact our daily workflow?

Resource: Apache coockbook

#Load Library
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.2
Warning: package 'forcats' was built under R version 4.4.2
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(arrow)
Warning: package 'arrow' was built under R version 4.4.2

Attaching package: 'arrow'

The following object is masked from 'package:lubridate':

    duration

The following object is masked from 'package:utils':

    timestamp

Downloading the Data

A dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6.

Step 1: Create a Directory

First, let’s create a special folder to store our data:

# Create a "data" directory if it doesn't exist already
# Using showWarnings = FALSE to suppress warning if directory already exists
dir.create("data", showWarnings = FALSE)

Step 2: Download the Dataset

Now for the fun part! We’ll download the Seattle Library dataset (9GB).

⚠️ Important: This is a 9GB file, so:

  • Make sure you have enough disk space
# Download Seattle library checkout dataset:
# 1. Fetch data from AWS S3 bucket URL
# 2. Save to local data directory
# 3. Use resume = TRUE to allow continuing interrupted downloads
curl::multi_download(
  "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
  "data/seattle-library-checkouts.csv",
  resume = TRUE
)
# A tibble: 1 × 10
  success status_code resumefrom url    destfile error type  modified
  <lgl>         <int>      <dbl> <chr>  <chr>    <chr> <chr> <dttm>  
1 TRUE            416          0 https… "C:\\Us… <NA>  appl… NA      
# ℹ 2 more variables: time <dbl>, headers <list>

Why USE: curl::multi_download()

  • Shows a progress bar (great for tracking large downloads)
  • Can resume if interrupted (super helpful for big files!)
  • More reliable than base R download methods

Resource: CURL

🤔 While the file downloads, let’s think about:

  1. Why do we need special tools for such large datasets?

  2. What challenges might we face with traditional R methods?

  3. How might a library use this kind of data?

Step 3: Verify the Download

After the download completes, let’s make sure everything worked:

# Check if the Seattle library dataset file exists and print its size:
# 1. Verify file exists at specified path
# 2. Calculate file size in gigabytes by dividing bytes by 1024^3

file.exists("data/seattle-library-checkouts.csv")
[1] TRUE
file.size("data/seattle-library-checkouts.csv") / 1024^3  # Size in GB
[1] 8.579315

Loading Our Data with Arrow 🚀

Step 1: Opening the Dataset

# Load Seattle library checkout data into Arrow dataset:
# 1. Specify the CSV file path
# 2. Set ISBN column to be read as string type to preserve leading zeros
# 3. Define CSV as the file format
seattle_csv <- open_dataset(
  sources = "data/seattle-library-checkouts.csv", 
  col_types = schema(ISBN = string()),
  format = "csv"
)

What’s Actually Happening? 🤔

The Magic of Lazy Loading

When you run this code with open_dataset(), Arrow does something clever:

  1. It peeks at the first few thousand rows

  2. Figures out what kind of data is in each column

  3. Creates a roadmap of the data

  4. Then… it stops!

That’s right - it doesn’t load the whole 9GB file. Imagine Arrow as a really efficient librarian who:

  • Creates an index of where everything is

  • Only gets books (data) when you specifically ask for them

  • Keeps track of what’s where without moving everything

Why Specify ISBN as String?

  • The first 80,000 rows have blank ISBNs
  • Without our help, Arrow might get confused about what type of data this is
  • By telling Arrow “this is definitely text”, we prevent any confusion

Run seattle_csv

You’ll see something interesting:

  • Information about where the data is stored

  • The structure Arrow discovered

  • Column names and types

  • But NO actual data yet!

#Inspect the data
seattle_csv
FileSystemDataset with 1 csv file
12 columns
UsageClass: string
CheckoutType: string
MaterialType: string
CheckoutYear: int64
CheckoutMonth: int64
Checkouts: int64
Title: string
ISBN: string
Creator: string
Subjects: string
Publisher: string
PublicationYear: string

🎯 Key Takeaways

  1. Arrow is lazy (in a good way!)
  • Only does work when necessary

  • Saves memory and time

  1. Arrow is smart
  • Figures out data types automatically

  • But accepts our help when needed

  1. Arrow is efficient

    • Keeps track of data without loading it

    • Ready to fetch exactly what we need

DPLYR functions

Arrow package provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.

Using collect() we can force arrow to perform the computation and return some data.

# View this will take a long time so don't run it in class
#dplyr::collect(seattle_csv)

🔍 Investigation

👉 Now You Try!** (3 min)

Check out the dataset to understand the data structure

#Create below:

# Clue 1: Basic structure
glimpse(seattle_csv)
FileSystemDataset with 1 csv file
41,389,465 rows x 12 columns
$ UsageClass      <string> "Physical", "Physical", "Digital", "Physical", "Physi…
$ CheckoutType    <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Horizo…
$ MaterialType    <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOOK",…
$ CheckoutYear     <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,…
$ CheckoutMonth    <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ Checkouts        <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2, 3,…
$ Title           <string> "Super rich : a guide to having it all / Russell Simm…
$ ISBN            <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "…
$ Creator         <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim Par…
$ Subjects        <string> "Self realization, Conduct of life, Attitude Psycholo…
$ Publisher       <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Dial …
$ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c2005.…
# Clue 2: Dataset dimensions
dim(seattle_csv)
[1] 41389465       12
# Clue 3: Column names 
colnames(seattle_csv)
 [1] "UsageClass"      "CheckoutType"    "MaterialType"    "CheckoutYear"   
 [5] "CheckoutMonth"   "Checkouts"       "Title"           "ISBN"           
 [9] "Creator"         "Subjects"        "Publisher"       "PublicationYear"
#Clue 4: Structure and types
str(seattle_csv)
Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' <FileSystemDataset>
  Inherits from: <Dataset>
  Public:
    .:xp:.: externalptr
    .class_title: function () 
    .unsafe_delete: function () 
    class_title: function () 
    clone: function (deep = FALSE) 
    files: active binding
    filesystem: active binding
    format: active binding
    initialize: function (xp) 
    metadata: active binding
    NewScan: function () 
    num_cols: active binding
    num_rows: active binding
    pointer: function () 
    print: function (...) 
    schema: active binding
    set_pointer: function (xp) 
    ToString: function () 
    type: active binding
    WithSchema: function (schema)  

👉 Now You Try! Create a write up:

So this dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2005 to October 2022. Each column provided details about the books that have been checked out. The dataset contains String and Numerical (Integer) type variables. Some columns that may require further investigation are ISBN and PublicationYear. ISBN has many empty values, which could impact the analysis. Some of the years for PublicationYear are formatted normally like “2025”, while others are formatted like “c2025.” or “2025.”. This could alter the results of any grouping or calculations using PublicationYear, so this column may need to be cleaned.

Part 2: Practical Arrow Usage

Working with summarize() in Arrow

  1. The summarise() function is like creating a summary report of your data. With Arrow, it’s especially powerful because it can create these summaries without loading all the data into memory.

    Common summary functions:

  • sum():

    • Adds up all values

    • Best for: Totals, like total checkouts

    • Example: sum(5, 3, 2) = 10

  • mean():

    • Calculates the average

    • Best for: Typical or central values

    • Example: mean(5, 3, 2) = (5+3+2)/3 = 3.33

  • median():

    • Finds the middle value

    • Best for: Central tendency when you have outliers

    • Example: median(5, 3, 2) = 3

  • max():

    • Finds the highest value

    • Best for: Peak values

    • Example: max(5, 3, 2) = 5

  • min():

    • Finds the lowest value

    • Best for: Minimum values

    • Example: min(5, 3, 2) = 2

  1. Let’s start with just sum

I wonder how many checkouts there are in a year?

# Calculate total checkouts per year by:
# 1. Grouping the data by checkout year
# 2. Summing all checkouts within each year
# 3. Collecting results from the lazy query into memory
yearly_checkouts <- seattle_csv %>%
  group_by(CheckoutYear) %>%
  summarise(
    total_checkouts = sum(Checkouts)
  ) %>%
  collect()

yearly_checkouts
# A tibble: 18 × 2
   CheckoutYear total_checkouts
          <int>           <int>
 1         2016         9021051
 2         2022         7001989
 3         2017         9231648
 4         2018         9149176
 5         2019         9199083
 6         2020         6053717
 7         2021         7361031
 8         2005         3798685
 9         2006         6599318
10         2007         7126627
11         2008         8438486
12         2009         9135167
13         2010         8608966
14         2011         8321732
15         2012         8163046
16         2013         9057096
17         2014         9136081
18         2015         9084179

💭 Analysis Questions:

  • What trends do you notice in the yearly checkouts?

  • Can you spot any unusual years? What might explain them?

    ggplot(yearly_checkouts, aes(x = CheckoutYear, y = total_checkouts)) +
      geom_line(color = "#2c5282", linewidth = 1) +
      geom_point(color = "#2c5282", size = 3) +
      labs(
        title = "Total Library Checkouts by Year",
        x = "Year",
        y = "Total Checkouts",
        caption = "Data source: Seattle Public Library"
      ) +
      theme_minimal() +
      theme(
        plot.title = element_text(size = 16, face = "bold"),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        panel.grid.minor = element_blank()
      ) +
      scale_y_continuous(labels = scales::comma)

I wonder what different types of material types there are?

# Get unique material types in the dataset by:
# 1. Selecting only the MaterialType column
# 2. Removing duplicate values
# 3. Collecting results from the lazy query into memory

material_count <- seattle_csv %>%
  group_by(MaterialType) %>%
  summarise(count = n()) %>%
  collect() %>%
  arrange(desc(count)) %>%
  filter(count > 20000)

material_count
# A tibble: 18 × 2
   MaterialType            count
   <chr>                   <int>
 1 BOOK                 21950747
 2 EBOOK                 6386372
 3 SOUNDDISC             4317838
 4 VIDEODISC             3446900
 5 AUDIOBOOK             2776302
 6 SONG                   969118
 7 VIDEOCASS              499789
 8 MUSIC                  276475
 9 SOUNDCASS              199754
10 MIXED                  132474
11 MOVIE                   92192
12 TELEVISION              57730
13 CR                      51167
14 VIDEO                   49298
15 SOUNDDISC, VIDEODISC    33265
16 REGPRINT                24822
17 SOUNDREC                23319
18 COMIC                   21492
ggplot(material_count, aes(x = reorder(MaterialType, count), y = count)) +
  geom_bar(stat = "identity", fill = "#2c5282", alpha = 0.8) +
  coord_flip() +
  labs(
    title = "Distribution of Material Types in Seattle Library",
    x = "Material Type",
    y = "Number of Items",
    caption = "Data source: Seattle Public Library"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(labels = scales::comma)

💭 After the Results:

  • Were there any material types that surprised you?

  • How do these categories reflect changes in technology and media consumption?

  • What questions could we investigate about these different material types?

I wonder in 2020 what were the totals for the materials types?

# Calculate total checkouts per material type in 2020 by:
# 1. Filtering for checkouts made in 2020
# 2. Grouping the data by material type (e.g., books, DVDs)
# 3. Summing checkouts within each group
# 4. Collecting results from the lazy query into memory
checkouts_2020 <- seattle_csv %>%
  filter(CheckoutYear == 2020) %>%
  group_by(MaterialType) %>%
  summarise(total_checkouts = sum(Checkouts)) %>%
  collect() %>%
  arrange(desc(total_checkouts))

checkouts_2020
# A tibble: 43 × 2
   MaterialType         total_checkouts
   <chr>                          <int>
 1 EBOOK                        2793961
 2 AUDIOBOOK                    1513625
 3 BOOK                         1241999
 4 VIDEODISC                     361587
 5 SOUNDDISC                     116221
 6 MIXED                           9118
 7 REGPRINT                        7573
 8 VIDEO                           2430
 9 MUSIC                           2404
10 SOUNDDISC, VIDEODISC            1049
# ℹ 33 more rows
# Calculate the summary statistics for the checkouts in 2020
summary_stats <- checkouts_2020 %>%
  summarise(
    total_all_checkouts = sum(total_checkouts),
    avg_checkouts_per_material = mean(total_checkouts),
    median_checkouts = median(total_checkouts),
    n_material_types = n()
  )

summary_stats
# A tibble: 1 × 4
  total_all_checkouts avg_checkouts_per_mate…¹ median_checkouts n_material_types
                <int>                    <dbl>            <int>            <int>
1             6053717                  140784.               19               43
# ℹ abbreviated name: ¹​avg_checkouts_per_material

Critical Thinking:

  • Why might we be particularly interested in 2020 data?

  • What hypotheses can we form about different material types?

👉 Now You Try! (3 minutes)

Modify the code above to find the total checkouts for a different year.

checkouts_2010 <- seattle_csv %>%
  filter(CheckoutYear == 2010) %>%
  group_by(MaterialType) %>%
  summarise(total_checkouts = sum(Checkouts)) %>%
  collect() %>%
  arrange(desc(total_checkouts))

checkouts_2010
# A tibble: 45 × 2
   MaterialType         total_checkouts
   <chr>                          <int>
 1 BOOK                         4217881
 2 VIDEODISC                    2824185
 3 SOUNDDISC                    1274448
 4 AUDIOBOOK                      87551
 5 EBOOK                          84328
 6 CR                             17918
 7 VIDEOCASS                      15742
 8 SOUNDDISC, VIDEODISC           15740
 9 MUSIC                          15368
10 VISUAL                         13009
# ℹ 35 more rows

If time allows work in a group and come up with a wondering and investigate it.

#ADD COMMENTS AND CODE BELOW

🎯 Key Takeaways

  1. Arrow is lazy (in a good way!)
-    Only does work when necessary

-    Saves memory and time
  1. Arrow is smart

    • Figures out data types automatically

    • But accepts our help when needed

  2. Arrow is efficient

    • Keeps track of data without loading it

    • Ready to fetch exactly what we need

Homework Assignment 👉

  1. What is the average number of checkouts per year?
# Find the avearge number of checkouts and arrange in descending order
avg_checkouts_per_year <- seattle_csv %>%
  group_by(CheckoutYear) %>%
  summarise(avg_checkouts = mean(Checkouts, na.rm = TRUE))%>%
  collect()%>%
  arrange(desc(avg_checkouts))

avg_checkouts_per_year
# A tibble: 18 × 2
   CheckoutYear avg_checkouts
          <int>         <dbl>
 1         2009          4.10
 2         2008          3.95
 3         2010          3.93
 4         2013          3.60
 5         2011          3.56
 6         2019          3.55
 7         2020          3.52
 8         2014          3.50
 9         2007          3.49
10         2018          3.43
11         2012          3.40
12         2006          3.40
13         2015          3.33
14         2022          3.31
15         2021          3.28
16         2017          3.27
17         2016          3.24
18         2005          2.85
ggplot(avg_checkouts_per_year, aes(x = CheckoutYear, y = avg_checkouts)) +
  geom_line(color = "#2c5282", linewidth = 1) +
  geom_point(color = "#2c5282", size = 3) +
  labs(
    title = "Average Library Checkouts by Year",
    x = "Year",
    y = "Average Checkouts",
    caption = "Data source: Seattle Public Library"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    panel.grid.minor = element_blank()
  ) +
  scale_y_continuous(labels = scales::comma)

  1. What is the average number of checkouts for each material type in 2021?
# Find the avearge number of checkouts in 2021
# for each material type and arrange in descending order
avg_material_checkouts_2021 <- seattle_csv %>%
  filter(CheckoutYear == 2021) %>%
  group_by(MaterialType) %>%
  summarise(avg_checkouts = mean(Checkouts, na.rm = TRUE))%>%
  collect()%>%
  arrange(desc(avg_checkouts))

avg_material_checkouts_2021
# A tibble: 46 × 2
   MaterialType       avg_checkouts
   <chr>                      <dbl>
 1 MIXED                     163.  
 2 AUDIOBOOK                   4.69
 3 REGPRINT                    3.93
 4 UNSPECIFIED                 3.6 
 5 EBOOK                       3.55
 6 VIDEODISC                   3.19
 7 REGPRINT, VIDEOREC          2.9 
 8 BOOK                        2.69
 9 CR                          2.62
10 ER, NONPROJGRAPH            2.2 
# ℹ 36 more rows

Write a brief analysis (150-300 words) answering these questions:

  • What trends do you notice in the yearly averages?
    • There doesn’t seem to be a clear trend for the yearly averages. The checkouts have fluctuated over the years, having increasing and decreasing over time. It is interesting to note that the top 3 years are 2008-2010, with 2009 having the highest average. 2005 has the lowest average by far. It could be said the average checkouts have started to normalize between about 3.2 and 3.6, but I would want more data before confidently stating that as a trend.
  • Which material types had the highest and lowest average checkouts in 2021?
    • The MIXED Material type had the highest average by far at 163 checkouts. No other material had average checkouts in the triple or even double digits. The next two highest were AUDIOBOOK at 4.7 and REGPRINT at 3.9. There were many material types that had the lowest checkout average of 1. Some notable ones were SOUNDDISC, SOUNDCASSS, VIDEOCASS, and PICTURE.
  • What might explain these differences in average checkouts?
    • I think the main two factors impacting the average checkouts could be the popularity of the media type and effects of the Covid-19 pandemic. For example, the least popular types included cassette tapes and cds, both which were pretty outdated forms of media by 2021. By contrast, audio books were a much more popular form of consuming media. Their increase in popularity was impacted by the pandemic. I looked up what Mixed media materials could be, and found that it included media such as school curriculum materials, letters, diaries, and packages that include multiple different materials. I can also see how this would be very popular. People who were doing online school during quarantine would need materials. And when you don’t have anything to do, getting multiple types of media with one checkout gives you more to consume.