Project Part 2: Quarto dociument

YourProject Assignment 1: First Contact with Your Dataset Using Arrow

Assignment Overview

This week you’ll apply the READY + SCAN frameworks to your own dataset using Arrow for efficient big data exploration. You’ll become a “data detective” investigating your dataset systematically.

Learning Objectives

By completing this assignment, you will: - Apply the READY framework to plan your data investigation - Use the SCAN framework to systematically explore your dataset - Practice using Arrow for memory-efficient data loading - Document your initial findings and develop investigation questions

Part 1: Data Setup and Loading

Step 1: Extract and Load Your Data

Use the appropriate code pattern below based on your data format:

LOAD LIBRARIES

# Load required libraries

library(arrow)


Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(glue)

For ZIP files containing CSV(s):

# Set up and extract your ZIP file
zip_path <- "/Users/sriyavenkat/Downloads/archive.zip"
outdir <- file.path(dirname(zip_path), "extracted_data")
dir.create(outdir, showWarnings = FALSE)
unzip(zip_path, exdir = outdir, overwrite = TRUE)

# Get list of CSV files
csv_files <- list.files(outdir, pattern = "\\.csv$", full.names = TRUE)
names(csv_files) <- tools::file_path_sans_ext(basename(csv_files))

# Open with Arrow - specify the main file you want to work with
my_dataset <- open_dataset("/Users/sriyavenkat/Downloads/archive/orders.csv", format = "csv")

# Check memory usage
glue("Memory used by Arrow object: {format(object.size(my_dataset), units = 'KB')}")

Memory used by Arrow object: 0.5 Kb

For compressed CSV files (.csv.gz):

csv_gz_path <- "/Users/sriyavenkat/Downloads/archive.zip"  # UPDATE THIS PATH
#
# # Option 1: Auto-generate column names if header is problematic
fmt <- CsvFileFormat$create(autogenerate_column_names = TRUE)
my_dataset <- open_dataset(csv_gz_path, format = fmt)
#
 # Rename columns as needed (adjust based on your data)
my_dataset <- my_dataset %>%
  transmute(
     col1 = f0,  # Give meaningful names
     col2 = f1,
     col3 = f2
    # Add more columns as needed
   )
#
# # Option 2: If you have a good header, use standard format
# # my_dataset <- open_dataset(csv_gz_path, format = "csv")

For text files with custom delimiters:

# # For semicolon-separated or other delimiters
# txt_path <- "/Users/sriyavenkat/Downloads/archive.zip"  # UPDATE THIS PATH
# 
# # Specify your delimiter
# fmt <- CsvFileFormat$create(delimiter = ";")  # Change ";" to your delimiter
# my_dataset <- open_dataset(txt_path, format = fmt)
#

Part 2: READY Framework Analysis

Work through each component of READY with your dataset:

R - Representative Data

Document your thoughts as comments:

What is the scope of your data?

Contains anonymized online grocery transactions from Instacart customers across the United States. Each record represents either an individual order or a specific product purchased within that order.

Time period covered: Instead of actual timestamps, it includes relative timing variables such as order day of week, hour of day, etc.

Geographic coverage: All of US

Population represented: Active Instacart users who placed at least one order during the study period

Potential biases or limitations: Population skews towards online shoppers - higher-income, more tech savvy.

Example questions to consider:

Do we have complete coverage of what we’re studying? Yes
Are there any obvious gaps in the data? Yes, there is no demographic data or data with regards to pricing.
What might be missing? Exact timestamps, coupons, discounts

E - Executive Driven Questions

Who would care about insights from your data?

Key stakeholders include Instacart’s marketing and analytics teams, retail partners, and supply chain planners. These groups rely on customer purchase data to comprehend shopping habits, optimize recommendations, and improve operational efficiency.

Primary stakeholders: Key business/research questions they might ask: What decisions could this data inform?

Marketing & Personalization Teams – to design targeted recommendations and promotions.
Retail & Brand Managers – to understand which products drive reorders and cross-sales.
Supply Chain & Inventory Teams – to forecast demand and manage stock levels efficiently.

Your stakeholder questions: 1. 2. 3.

Which products or departments are most frequently reordered, and what factors influence reorder likelihood?
What product combinations commonly appear in the same basket, and how can these insights improve cross-selling or product placement
At what times or days do customers place the largest orders, and how can scheduling and marketing be optimized to match peak activity?

What decisions could this data inform?

Marketing campaigns and personalized product recommendations
Inventory forecasting and warehouse logistics planning.
Store-level assortment planning and pricing strategy adjustments.

A - Analytical Framework

Your exploration strategy:

Phase 1: Data Quality Assessment - Check for missing values - Identify data types and consistency - Look for outliers or anomalies

Phase 2: Descriptive Analysis - What are the key variables? - What’s the distribution of important metrics? - What time patterns exist?

Phase 3: Pattern Investigation - What relationships might exist between variables? - Are there seasonal or temporal patterns? - What groupings or segments emerge?

Your specific analytical approach: 1. 2. 3.

Phase 1: Data Quality Assessment

Check for missing values, especially in days since prior order.
Verify that categorical variables (aisle, department, reordered) are consistent.
Confirm that IDs are unique and there are no duplicate order-product pairs

Phase 2: Descriptive Analysis

Identify key variables such as order_hour, reordered, etc.
Explore the distribution of basket sizes and reorder frequencies.
Examine ordering patterns by day of week and hour of day to detect customer shopping habits.

Phase 3: Pattern Investigation

Analyze product co-occurrence to identify commonly paired items.
Investigate how order timing (day/hour) affects basket size and reorder likelihood.
Segment customers by order frequency or department preferences.

My specific approach:

Join orders, products, aisles, and departments into one dataset for holistic analysis.
Calculate reorder rates, basket sizes, and department-level purchase shares.
Use association rules and correlation analysis to uncover product relationships and frequent basket combinations.

D - Data Best Practices

Quality checks to perform:

Assess missing values in all variables and document expected vs. unexpected gaps.
Check for duplicate keys.
Validate consistency between related tables .

Missing data assessment:

Expect data_since_prior_order to be missing for first orders.
Confirm that reordered is complete and encoded as 0 or 1.
Verify that no critical categorical values (aisle, department) are missing.

Data type verification:

Ensure all ID variables are integers.
Confirm thatorder of day is numeric and within valid range.
Treat product, aisle, and department names as categorical (factor or string).

My concerns:

Missing or inconsistent values in relative timing variables.
Lack of explicit calendar dates may limit trend or seasonal analysis.
Potential bias since data only includes active Instacart users (not all grocery shoppers).

Y - Your Insights

Based on general shopping behavior, I expect to see clear temporal patterns (e.g., people placing more orders on weekends and evenings). Staples such as milk, bananas, and eggs are likely to appear as frequent reorders. Departments like produce and dairy may anchor larger baskets.

My predictions:

Reordering behavior: High reorder rates for staple items and household essentials.
Temporal trends: Peak order volumes on weekends and evenings (after work hours).
Basket composition: Items from similar aisles (e.g., produce + dairy) frequently co-occur in the same orders.

Part 3: Data Quality Assessment Summary

S -Stakeholders (Revisited)

my_dataset |>
  head(1000) |>
  collect() |>
  glimpse()

Rows: 1,000
Columns: 7
$ order_id               <int> 2539329, 2398795, 473747, 2254736, 431534, 3367…
$ user_id                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
$ eval_set               <chr> "prior", "prior", "prior", "prior", "prior", "p…
$ order_number           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, …
$ order_dow              <int> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, 1, 2, 3,…
$ order_hour_of_day      <int> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11, 10, 10…
$ days_since_prior_order <dbl> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, 14, NA, …

After examining the data structure, who else might be interested?

Beyond data analysts, teams such as marketing, supply chain planning, retail partners, and product management would be highly interested in this dataset. Marketing and personalization teams can use order timing and reorder data to understand customer behavior, while supply chain teams can use it to forecast demand and optimize inventory.

What specific questions would they have?

When do customers place most of their grocery orders which days or hours are busiest?
How frequently do customers reorder products, and what is the typical time gap between orders?
Are there customer segments (based on order frequency or basket size) that show distinctive shopping patterns?
How do reorder rates vary across different product departments (e.g., produce vs. frozen foods)?

What concerns might they have about data quality?

Missing values for days since prior order in first orders may complicate time-based analyses.
The lack of geographic or demographic details limits personalization and regional insights.
Because prices and promotions aren’t included, analyses might overlook economic factors driving reorders.
Data represent only Instacart users, not all grocery shoppers, which introduces potential sampling bias.

C - Columns and Coverage

Create a summary table of your variables:

dataset_overview <- my_dataset |>
  summarise(
    total_records = n(),
    earliest_order_number = min(order_number, na.rm = TRUE),
    latest_order_number = max(order_number, na.rm = TRUE),
    earliest_hour = min(order_hour_of_day, na.rm = TRUE),
    latest_hour = max(order_hour_of_day, na.rm = TRUE),
    min_days_since_prior = min(days_since_prior_order, na.rm = TRUE),
    max_days_since_prior = max(days_since_prior_order, na.rm = TRUE)
  ) |>
  collect()  

dataset_overview

# A tibble: 1 × 7
  total_records earliest_order_number latest_order_number earliest_hour
          <int>                 <int>               <int>         <int>
1       3421083                     1                 100             0
# ℹ 3 more variables: latest_hour <int>, min_days_since_prior <dbl>,
#   max_days_since_prior <dbl>

A - Aggregates: Overall Picture

# Get comprehensive dataset statistics
# A – Aggregates: overall picture
data_summary <- my_dataset |>
  summarise(
    total_rows = n(),
    across(where(is.character), ~n_distinct(.x, na.rm = TRUE)),
    across(where(is.numeric),
           list(min = ~min(.x, na.rm = TRUE),
                max = ~max(.x, na.rm = TRUE)))
  ) |>
  collect()

data_summary

# A tibble: 1 × 14
  total_rows eval_set order_id_min order_id_max user_id_min user_id_max
       <int>    <int>        <int>        <int>       <int>       <int>
1    3421083        3            1      3421083           1      206209
# ℹ 8 more variables: order_number_min <int>, order_number_max <int>,
#   order_dow_min <int>, order_dow_max <int>, order_hour_of_day_min <int>,
#   order_hour_of_day_max <int>, days_since_prior_order_min <dbl>,
#   days_since_prior_order_max <dbl>

N - Notable Segments

# Analyze key categorical variables
# Modify based on your specific data
# N – Notable Segments: analyze key categorical variables

orders_by_day <- my_dataset |>
  group_by(order_dow) |>
  summarise(total_orders = n()) |>
  arrange(order_dow) |>
  collect()

orders_by_day

# A tibble: 7 × 2
  order_dow total_orders
      <int>        <int>
1         0       600905
2         1       587478
3         2       467260
4         3       436972
5         4       426339
6         5       453368
7         6       448761

orders_by_hour <- my_dataset |>
  group_by(order_hour_of_day) |>
  summarise(total_orders = n()) |>
  arrange(order_hour_of_day) |>
  collect()

orders_by_hour

# A tibble: 24 × 2
   order_hour_of_day total_orders
               <int>        <int>
 1                 0        22758
 2                 1        12398
 3                 2         7539
 4                 3         5474
 5                 4         5527
 6                 5         9569
 7                 6        30529
 8                 7        91868
 9                 8       178201
10                 9       257812
# ℹ 14 more rows

orders_completeness <- my_dataset |>
  summarise(
    total_rows = n(),
    across(everything(),
           list(
             missing = ~sum(is.na(.x)),
             missing_pct = ~mean(is.na(.x)) * 100
           ))
  ) |>
  collect()

orders_completeness

# A tibble: 1 × 15
  total_rows order_id_missing order_id_missing_pct user_id_missing
       <int>            <int>                <dbl>           <int>
1    3421083                0                    0               0
# ℹ 11 more variables: user_id_missing_pct <dbl>, eval_set_missing <int>,
#   eval_set_missing_pct <dbl>, order_number_missing <int>,
#   order_number_missing_pct <dbl>, order_dow_missing <int>,
#   order_dow_missing_pct <dbl>, order_hour_of_day_missing <int>,
#   order_hour_of_day_missing_pct <dbl>, days_since_prior_order_missing <int>,
#   days_since_prior_order_missing_pct <dbl>

Complete this comprehensive assessment:

DATASET OVERVIEW: - Records: [3.4mil] representing [Instacart grocery orders placed by individuals] - Time span: [several months] from [start] to [end] - Key metrics: [

total_rows

<int>

eval_set

<int>

order_id_min

<int>

order_id_max

<int>

user_id_min

<int>

user_id_max

<int>

3421083

206209

order_number_min

<int>

order_number_max

<int>

order_dow_min

<int>

order_dow_max

<int>

100

	order_hour_of_day_min <int>	order_hour_of_day_max <int>	days_since_prior_order_min <dbl>
	0	23	0

]

DATA COMPLETENESS: - Core fields: [100]% complete - Variable 1: [100]% complete - Variable 2: [94]% complete

DATA QUALITY STRENGTHS: 1. [What aspects are high quality?] 2. [What makes this reliable?] 3. [What coverage is excellent?]

What aspects are high quality? The dataset is highly complete and structurally sound and all the identifiefs are consistently present and properly formatted, and categorical and numeric variables follow valid, expected ranges.

What makes this reliable? The data was collected directly from Instacart’s transactional system, minimizing entry or reporting errors. because there is internal consistency like matching user IDs, the stats patters are reliable.

What coverage is excellent? The dataset captures detailed information across a large base of active U.S. users. It provides strong coverage of shopping frequency, timing, and product categories, allowing for accurate analysis of consumer behavioral patterns.

DATA QUALITY CONCERNS: 1. [What are the main issues?] 2. [What might limit analysis?] 3. [What needs careful handling?]

Missing values in days since prior orders for initial orders.
not enough explicit timestamps
No pricing or revenue data

MISSING DATA IMPACT: - Most missing: [days since prior order] at 6% - Impact on analysis: affects calculations of time between purchases - Handling strategy: Treat missing values as “first order” and create an indicator variable (first_order = is.na(days_since_prior_order))

RELIABILITY ASSESSMENT: - Most reliable: order_id, user_id, order_number, order_dow, order_hour_of_day

Variables needing caution: days_since_prior_order
Overall confidence level: High

JUSTIFICATION: The dataset is internally pretty consistent and nearly complete across all of the key fields. The only systematic thing missing is the first orders, and the remaining variables display valid ranges and logical relationships.

Deliverables Checklist

Ensure your submission includes:

Complete READY framework analysis with thoughtful responses
Systematic SCAN framework exploration with specific findings
Successful data loading with Arrow
Professional data description and summary statistics
Comprehensive missing value analysis with percentages
Variable summary table documenting key fields
Memory efficiency demonstration
3-5 well-defined, specific exporatory research questions
Data quality assessment with honest evaluation
Professional summary with clear next steps

Grading Criteria

READY Framework (20%): Thoughtful strategic planning showing understanding of stakeholders and analytical approach
Data Loading (15%): Successful Arrow implementation with proper documentation
SCAN Framework (25%): Systematic exploration with specific, meaningful findings
Data Quality Assessment (20%): Comprehensive evaluation with specific evidence
Research Questions (15%): Clear, answerable questions tied to stakeholder needs and data capabilities
Professional Communication (5%): Clear, honest, well-organized presentation throughout

Tips for Success

Be specific in your observations - avoid vague statements
Think like a stakeholder - what would decision-makers actually want to know?
Document your reasoning for all assessment decisions
Be honest about limitations - this builds credibility
Focus on actionable insights - what can actually be learned from this data?
Ask for help if your data format doesn’t match the provided templates

Remember: This is exploratory data analysis - you’re learning about your data, not proving predetermined hypotheses. Let your curiosity guide your investigation while maintaining systematic rigor.