Your Project Title: Part 2 - Deep Dive Analysis

Author

Sriya Venkat

Project Part 2: From Data Understanding to Stakeholder Insights

Overview

Welcome to Part 2 of your data analysis project! In Part 1, you successfully loaded your data, applied the READY and SCAN frameworks, and developed 3-5 research questions. Now it’s time to dig deeper.

What You’ll Accomplish in Part 2:

  • Part 4: Systematically analyze and handle missing data
  • Part 5: Select the most meaningful variables for your analysis
  • Part 6: Create compelling visualizations that answer exploratory questions
  • Part 7: Communicate your findings to stakeholders

Remember: This is still exploratory data analysis - you’re investigating patterns and building understanding, not proving predetermined hypotheses.


Setup and Data Loading

Load Required Libraries

# Function to check and install required packages
required_packages <- c(
  "tidyverse",    # Data manipulation and visualization
  "arrow",        # Efficient big data handling
  "duckdb",       # In-process analytics database
  "DBI",          # Database interface
  "glue",         # String interpolation
  "naniar",       # Missing data visualization
  "corrr",        # Correlation analysis
  "scales"        # Scale functions for visualization
)

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

# Load all packages
lapply(required_packages, library, character.only = TRUE)
[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

[[3]]
 [1] "duckdb"    "DBI"       "arrow"     "lubridate" "forcats"   "stringr"  
 [7] "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
[13] "tidyverse" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
[19] "methods"   "base"     

[[4]]
 [1] "duckdb"    "DBI"       "arrow"     "lubridate" "forcats"   "stringr"  
 [7] "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
[13] "tidyverse" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
[19] "methods"   "base"     

[[5]]
 [1] "glue"      "duckdb"    "DBI"       "arrow"     "lubridate" "forcats"  
 [7] "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"   
[13] "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices" "utils"    
[19] "datasets"  "methods"   "base"     

[[6]]
 [1] "naniar"    "glue"      "duckdb"    "DBI"       "arrow"     "lubridate"
 [7] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
[13] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
[19] "utils"     "datasets"  "methods"   "base"     

[[7]]
 [1] "corrr"     "naniar"    "glue"      "duckdb"    "DBI"       "arrow"    
 [7] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
[13] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[19] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[8]]
 [1] "scales"    "corrr"     "naniar"    "glue"      "duckdb"    "DBI"      
 [7] "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
[13] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[19] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

Reconnect to Your Dataset

library(arrow)
library(dplyr)
library(glue)
library(ggplot2)
library(tidyr)
library(stringr)

my_data <- open_dataset(
  "/Users/sriyavenkat/Downloads/archive/orders.csv",
  format = "csv"
)

# Quick verification
glue("✅ Connected to Instacart orders dataset successfully")
✅ Connected to Instacart orders dataset successfully
my_data |> head(5) |> collect()
# A tibble: 5 × 7
  order_id user_id eval_set order_number order_dow order_hour_of_day
     <int>   <int> <chr>           <int>     <int>             <int>
1  2539329       1 prior               1         2                 8
2  2398795       1 prior               2         3                 7
3   473747       1 prior               3         3                12
4  2254736       1 prior               4         4                 7
5   431534       1 prior               5         4                15
# ℹ 1 more variable: days_since_prior_order <dbl>

Part 4: Missing Data Analysis & Strategy

Building on your Part 3 data quality assessment

Introduction to Missing Data

In Part 3, you identified where missing data exists. Now we need to understand WHY it’s missing and HOW to handle it. The way you handle missing data can significantly impact your conclusions!

The Three Questions We Must Answer:

  1. HOW MUCH data is missing?
  2. WHERE is data missing (patterns)?
  3. WHY is data missing (type)?

Step 1: Quantify Missing Data

❓ EXPLORATORY QUESTION 1: What percentage of data is missing for each variable?

First, let’s get precise percentages for all variables.

missing_summary <- my_data |>
  summarise(
    total_rows = n(),
    missing_days_since_prior = sum(is.na(days_since_prior_order)),
    missing_order_hour = sum(is.na(order_hour_of_day)),
    missing_order_dow  = sum(is.na(order_dow))
  ) |>
  collect()

missing_pct <- missing_summary |>
  mutate(
    pct_days_since_prior = round(missing_days_since_prior / total_rows * 100, 2),
    pct_order_hour       = round(missing_order_hour       / total_rows * 100, 2),
    pct_order_dow        = round(missing_order_dow        / total_rows * 100, 2)
  ) |>
  select(starts_with("pct_")) |>
  pivot_longer(
    everything(),
    names_to = "variable",
    values_to = "percent_missing"
  ) |>
  mutate(variable = str_remove(variable, "pct_"))

missing_pct
# A tibble: 3 × 2
  variable         percent_missing
  <chr>                      <dbl>
1 days_since_prior            6.03
2 order_hour                  0   
3 order_dow                   0   

Visualize Missing Data Patterns

missing_pct |>
  mutate(variable = reorder(variable, percent_missing)) |>
  ggplot(aes(x = percent_missing, y = variable, fill = percent_missing)) +
  geom_col() +
  scale_fill_gradient(low = "#1a9850", high = "#d73027") +
  theme_minimal() +
  labs(
    title = "Missing Data Across Variables",
    subtitle = "Instacart Orders Dataset",
    x = "Percentage Missing (%)",
    y = "Variable"
  ) +
  theme(legend.position = "none")

Missing data percentages across all variables

📝 Your Interpretation:

Answer these questions in your own words:

  • Which variable has the MOST missing data?
  • Which variables are nearly complete (< 5% missing)?
  • Are any variables severely incomplete (> 50% missing)?
  • Does the amount of missingness concern you for your analysis?

Your notes:

# Write your interpretation here:

Which variable has the MOST missing data?
days_since_prior_order has the highest missing percentage.

Which variables are nearly complete?
order_dow and order_hour_of_day are basically 100% complete.

Are any variables severely incomplete (>50%)?
No, missingness is concentrated only in first orders.

Does the amount of missingness concern you?
No, the missingness is expected and structurally tied to first-time customers who have no previous order.


Step 2: Identify Patterns in Missingness

❓ EXPLORATORY QUESTION 2: Does missingness vary by groups or over time?

Missing data often has patterns! Let’s investigate.

Pattern 1: Missing by Categorical Groups

# Analyze missing days_since_prior_order by day of week
missing_by_group <- my_data |>
  group_by(order_dow) |>
  summarise(
    total_items   = n(),
    missing_count = sum(is.na(days_since_prior_order)),
    .groups = "drop"
  ) |>
  mutate(
    pct_missing = round(missing_count / total_items * 100, 1)
  ) |>
  arrange(desc(pct_missing)) |>
  collect()

missing_by_group
# A tibble: 7 × 4
  order_dow total_items missing_count pct_missing
      <int>       <int>         <int>       <dbl>
1         0      600905         38517         6.4
2         6      448761         28691         6.4
3         2      467260         27861         6  
4         1      587478         34973         6  
5         3      436972         25658         5.9
6         5      453368         26073         5.8
7         4      426339         24436         5.7
# Visualize the pattern
missing_by_group |>
  ggplot(aes(
    x = reorder(order_dow, -pct_missing),
    y = pct_missing
  )) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Missing Data Varies by Day of Week",
    subtitle = "Missing days_since_prior_order by order_dow",
    x = "Day of Week (0 = Sunday)",
    y = "Percentage Missing (%)"
  )

Does missingness vary by group?

📝 Your Interpretation:

Answer these questions:

  • Do some groups have much more missing data than others?
  • Is there a logical reason why certain groups would have missing data?
  • Does this missingness seem random or systematic?

Your notes:

# Write your interpretation here:

Every order_dow should have similar missing %

No meaningful variation

This shows that missingness is structural

Pattern 2: Missing Over Time

❓ EXPLORATORY QUESTION 3: Does data quality improve or worsen over time?

# Analyze missing data by order number
missing_by_time <- my_data |>
  group_by(order_number) |>
  summarise(
    total_records = n(), 
    missing_count = sum(is.na(days_since_prior_order)),
    .groups = "drop"
  ) |>
  mutate(
    pct_missing = round(missing_count / total_records * 100, 2)
  ) |>
  arrange(order_number) |>
  collect()

# Visualize temporal pattern
ggplot(missing_by_time, aes(x = order_number, y = pct_missing)) +
  geom_line(color = "#4575b4", linewidth = 1.2) +
  geom_point(color = "#4575b4", size = 3) +
  labs(
    title = "Missing Data Over Order Sequence",
    subtitle = "days_since_prior_order is missing only for first orders",
    x = "Order Number",
    y = "% Missing Data"
  ) +
  theme_minimal()

📝 Your Interpretation:

Answer these questions:

  • Is missingness improving, worsening, or staying constant over time?
  • What might explain any trends you observe?
  • Does this affect which time periods you should focus on?

Your notes:

# Write your interpretation here:

Missingness is not random - it's pretty much spiking at order_number = 1.

MAR(structural missingness) would explain it.

Data quality is consistent across order history after the first order.

Step 3: Classify Missingness Types

Understanding MCAR, MAR, and MNAR

Three types of missingness:

  1. MCAR (Missing Completely At Random):
    • Missing values are unrelated to any other variables
    • Example: Random data entry errors
    • Safe to delete or use simple imputation
  2. MAR (Missing At Random):
    • Missingness depends on observed variables
    • Example: Older records missing digital fields
    • Need to account for related variables
  3. MNAR (Missing Not At Random):
    • Missingness depends on the missing value itself
    • Example: High earners not reporting income
    • Most problematic - may bias results

📋 Classification Table

Create a table classifying each variable’s missingness type:

# Based on your exploration above, classify each variable
missingness_classification <- tribble(
  ~variable, ~pct_missing, ~type, ~evidence, ~impact_on_analysis,

  
  "days_since_prior_order",
  6.0, 
  "MAR (Structural)",
  "Missing only for first orders (order_number = 1), because no prior order exists",
  "Must create a first-order indicator; affects time-between-orders analysis",

 
  "order_hour_of_day",
  0,
  "MCAR / None",
  "No missingness detected in dataset",
  "No impact on analysis",

  
  "order_dow",
  0,
  "MCAR / None",
  "No missingness detected in dataset",
  "No impact on analysis"
)

missingness_classification |>
  arrange(desc(pct_missing))
# A tibble: 3 × 5
  variable               pct_missing type            evidence impact_on_analysis
  <chr>                        <dbl> <chr>           <chr>    <chr>             
1 days_since_prior_order           6 MAR (Structura… Missing… Must create a fir…
2 order_hour_of_day                0 MCAR / None     No miss… No impact on anal…
3 order_dow                        0 MCAR / None     No miss… No impact on anal…

📝 Your Classification Justification:

For each variable, explain WHY you classified it as MCAR, MAR, or MNAR:

days_since_prior_order — MAR (Structural)

Missingness occurs only for first-time customers. First orders cannot have a previous order, so missingness depends on an observed variable (order_number).

order_hour_of_day — MCAR / None

No missing values

order_dow — MCAR / None

No missing values

Step 4: Choose Handling Strategies

Decision Framework

Missingness Type Data Type Recommended Strategy Why
MCAR Categorical Delete or “Unknown” category Random - any method works
MCAR Numeric Mean/Median imputation Simple and unbiased
MAR Categorical Group-wise imputation or “Unknown” Account for related variables
MAR Numeric Group-wise median/mean Preserve group differences
MNAR Any Keep as missing + flag Imputation could bias results

Strategy Implementation

For Categorical Variables

❓ EXPLORATORY QUESTION 4: What happens if we use different strategies for categorical variables?

library(arrow)
library(dplyr)

my_dataset <- open_dataset(
  "/Users/sriyavenkat/Downloads/archive/orders.csv",
  format = "csv"
)

my_data <- my_dataset

my_data_clean <- my_data |>
  mutate(
    order_dow_clean = case_when(
      is.na(order_dow) ~ "Unknown",
      TRUE ~ as.character(order_dow)
    )
  )

my_data_clean |>
  group_by(order_dow_clean) |>
  summarise(count = n(), .groups = "drop") |>
  arrange(desc(count)) |>
  collect()
# A tibble: 7 × 2
  order_dow_clean  count
  <chr>            <int>
1 0               600905
2 1               587478
3 2               467260
4 5               453368
5 6               448761
6 3               436972
7 4               426339

📝 Your Categorical Strategy:

Document your decision for each categorical variable with missing data:

Variable: order_dow

Strategy Chosen: "Unknown" category
Justification: Missing values were rare and didn’t show a clear pattern across groups. we just treat them as unknown

Impact on Analysis: not much because the small number of "Unknown" cases won’t distort weekly patterns and keeps data completeness transparent.

For Numeric Variables

❓ EXPLORATORY QUESTION 5: Should we use mean or median for numeric imputation?

#| label: numeric-distribution

# Examine distribution of numeric variable

numeric_distribution <- my_data %>%
filter(!is.na(days_since_prior_order)) %>%
select(days_since_prior_order) %>%
collect()

# Plot distribution

ggplot(numeric_distribution, aes(x = days_since_prior_order)) +
geom_histogram(bins = 50, fill = "#4575b4", color = "white") +
labs(
title = "Distribution of Days Since Prior Order",
subtitle = "Distribution appears right-skewed due to long gaps for some users",
x = "Days Since Prior Order",
y = "Count"
) +
theme_minimal()

# Calculate mean vs median

numeric_stats <- my_data %>%
summarise(
mean_val = mean(days_since_prior_order, na.rm = TRUE),
median_val = median(days_since_prior_order, na.rm = TRUE)
) %>%
collect() %>%
mutate(difference = mean_val - median_val)
Warning: median() currently returns an approximate median in Arrow
This warning is displayed once per session.
numeric_stats
# A tibble: 1 × 3
  mean_val median_val difference
     <dbl>      <dbl>      <dbl>
1     11.1       7.32       3.79

❓ EXPLORATORY QUESTION 6: Should we impute within groups or use global values?

numeric_data <- my_data %>%
select(order_dow, days_since_prior_order) %>%
collect()


groupwise_stats <- numeric_data %>%
group_by(order_dow) %>%
summarise(
median_val = median(days_since_prior_order, na.rm = TRUE),
mean_val   = mean(days_since_prior_order, na.rm = TRUE),
count      = n(),
.groups = "drop"
)

groupwise_stats
# A tibble: 7 × 4
  order_dow median_val mean_val  count
      <int>      <dbl>    <dbl>  <int>
1         0          8     11.8 600905
2         1          8     11.3 587478
3         2          8     11.2 467260
4         3          7     10.8 436972
5         4          7     10.5 426339
6         5          7     10.5 453368
7         6          7     11.4 448761
my_data_numeric_clean <- numeric_data %>%
left_join(groupwise_stats %>% select(order_dow, median_val),
by = "order_dow") %>%
mutate(
days_since_prior_clean = case_when(
is.na(days_since_prior_order) ~ as.double(median_val),
TRUE ~ as.double(days_since_prior_order)
)
)

my_data_numeric_clean %>%
summarise(
missing_before = sum(is.na(days_since_prior_order)),
missing_after  = sum(is.na(days_since_prior_clean))
)
# A tibble: 1 × 2
  missing_before missing_after
           <int>         <int>
1         206209             0

📝 Your Numeric Strategy:

Document your decision for each numeric variable:

Variable: days_since_prior_order
Distribution: Right-skewed
Strategy Chosen: Group-wise median (by order_dow)

Justification:
Most customers reorder within a relatively short time, but a minority wait much longer, which creates a strong right tail

Impact on Analysis:
This reduces bias from extreme values and produces a more stable feature for later summaries or modeling, while still preserving realistic patterns in time between orders.

Step 5: Impact Assessment

❓ EXPLORATORY QUESTION 7: How do different strategies affect our analysis?

total_rows <- 3421083
missing_rows <- round(total_rows * 0.06)    
kept_delete <- total_rows - missing_rows     

strategy_comparison <- tibble(
Strategy = c("Delete Missing", "Simple Imputation", "Group-wise Imputation", "Keep as NA"),
Records_Kept = c(
scales::comma(kept_delete),   # Delete Missing
scales::comma(total_rows),    # Simple Imputation
scales::comma(total_rows),    # Group-wise Imputation
scales::comma(total_rows)     # Keep as NA
),
Pros = c(
"No assumptions made",
"Simple and fast",
"Accounts for patterns",
"Honest about unknowns"
),
Cons = c(
"Loses ~6% of data",
"Ignores patterns",
"More complex",
"Can't use in all analyses"
)
)

strategy_comparison
# A tibble: 4 × 4
  Strategy              Records_Kept Pros                  Cons                 
  <chr>                 <chr>        <chr>                 <chr>                
1 Delete Missing        3,215,818    No assumptions made   Loses ~6% of data    
2 Simple Imputation     3,421,083    Simple and fast       Ignores patterns     
3 Group-wise Imputation 3,421,083    Accounts for patterns More complex         
4 Keep as NA            3,421,083    Honest about unknowns Can't use in all ana…

📝 Final Missing Data Summary:

Document your overall approach:

Overall Strategy: used unknown for missing vals and group wise median for days_since_prior_order based on order_dow to preserve pattrns

Most Challenging Decision:
Deciding whether to impute globally or within groups for days_since_prior_order.

Potential Limitations:
Imputation may smooth out real variation, and the "Unknown" category may mix different types of missingness.

Confidence in Approach: High

Part 5: Variable Selection & Focus

Building on your Part 2 analytical framework

Introduction

You started with 10+ variables. Now it’s time to narrow down to the 3-7 variables that matter most for your research questions.


Step 1: Variable Inventory

❓ EXPLORATORY QUESTION 8: What role does each variable play in our analysis?

Create a comprehensive inventory categorizing all variables:

variable_inventory <- tribble(
  ~variable, ~role, ~data_type, ~usefulness, ~notes,

  "order_id", "Identifier", "integer", "Low", "Unique per order; not used in analysis",
  "user_id", "Identifier", "integer", "Medium", "Useful for grouping or customer-level patterns",
  "eval_set", "Metadata", "character", "Low", "Not meaningful for analysis; only describes dataset split",
  "order_number", "Predictor", "integer", "High", "Shows customer ordering sequence; useful for behavior trends",
  "order_dow", "Predictor", "integer", "High", "Strong temporal signal; helps answer order-timing questions",
  "order_hour_of_day", "Predictor", "integer", "High", "Key variable for understanding ordering time patterns",
  "days_since_prior_order", "Predictor", "numeric", "High", "Important for reorder behavior; has missing values that require handling"
)

variable_inventory |>
  arrange(desc(usefulness), role)
# A tibble: 7 × 5
  variable               role       data_type usefulness notes                  
  <chr>                  <chr>      <chr>     <chr>      <chr>                  
1 user_id                Identifier integer   Medium     Useful for grouping or…
2 order_id               Identifier integer   Low        Unique per order; not …
3 eval_set               Metadata   character Low        Not meaningful for ana…
4 order_number           Predictor  integer   High       Shows customer orderin…
5 order_dow              Predictor  integer   High       Strong temporal signal…
6 order_hour_of_day      Predictor  integer   High       Key variable for under…
7 days_since_prior_order Predictor  numeric   High       Important for reorder …

📝 Your Variable Assessment:

For each variable, justify its potential usefulness:

High Usefulness Variables (keep):

order_dow

order_hour_of_day

days_since_prior_order

order_number

Medium Usefulness:

user_id (only useful if doing customer-level segmentation)

Low Usefulness (exclude):

order_id

eval_set

Step 2: Examine Relationships

For Numeric Variables: Correlation Analysis

❓ EXPLORATORY QUESTION 9: Which numeric variables are correlated with each other?

# Sample data for correlation (if dataset is large)
correlation_data <- my_data |>
  select(where(is.numeric)) |>   # Select only numeric columns
  collect() |>
  sample_n(min(10000, n()))      # Sample if large dataset

# Calculate correlation matrix
cor_matrix <- correlation_data |>
  correlate() |>
  rearrange()
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
cor_matrix
# A tibble: 6 × 7
  term                order_number order_dow order_id  user_id order_hour_of_day
  <chr>                      <dbl>     <dbl>    <dbl>    <dbl>             <dbl>
1 order_number           NA        -0.000206 -0.00855 -0.0238           -0.0606 
2 order_dow              -0.000206 NA        -0.0111   0.00655           0.0140 
3 order_id               -0.00855  -0.0111   NA        0.00223           0.00502
4 user_id                -0.0238    0.00655   0.00223 NA                -0.00801
5 order_hour_of_day      -0.0606    0.0140    0.00502 -0.00801          NA      
6 days_since_prior_o…    -0.354    -0.0293    0.0108   0.0186            0.00255
# ℹ 1 more variable: days_since_prior_order <dbl>
# Visualize correlations
cor_matrix |>
  stretch() |>
  ggplot(aes(x = x, y = y, fill = r)) +
  geom_tile() +
  scale_fill_gradient2(low = "#d73027", mid = "white", high = "#4575b4",
                       midpoint = 0, limits = c(-1, 1)) +
  theme_minimal() +
  labs(
    title = "Correlation Matrix of Numeric Variables",
    subtitle = "Identifies redundancy or strong linear relationships",
    x = NULL,
    y = NULL
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

📝 Correlation Interpretation:

Answer these questions:

  • Which variables are strongly correlated (|r| > 0.7)?
  • Are any of these redundant? Can we drop one?
  • Do any correlations surprise you?

Your notes:

Strong correlations:

order_number vs. days_since_prior_order
Weak to moderate negative correlation.
Higher order numbers often mean regular customers → shorter gaps between orders.


Low:

order_dow and days_since_prior_order

order_hour_of_day and days_since_prior_order

order_number and order_dow

For Categorical Variables: Relationship Analysis

❓ EXPLORATORY QUESTION 10: How do categorical variables relate to our outcome?

relationship_analysis <- my_data |>
  group_by(order_dow) |>
  summarise(
    count = n(),
    mean_days = mean(days_since_prior_order, na.rm = TRUE),
    median_days = median(days_since_prior_order, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_days)) |>
  collect()

relationship_analysis
# A tibble: 7 × 4
  order_dow  count mean_days median_days
      <int>  <int>     <dbl>       <dbl>
1         0 600905      11.8        7.96
2         6 448761      11.4        7.36
3         1 587478      11.3        7.64
4         2 467260      11.2        7.99
5         3 436972      10.8        7.35
6         4 426339      10.5        7.00
7         5 453368      10.5        7   
# Visualization
ggplot(relationship_analysis,
       aes(x = reorder(as.factor(order_dow), mean_days),
           y = mean_days)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  labs(
    title = "Average Days Since Prior Order by Day of Week",
    subtitle = "Higher values indicate customers go longer between orders",
    x = "Day of Week (0 = Sunday)",
    y = "Mean Days Since Prior Order"
  ) +
  theme_minimal()

📝 Your Interpretation:

Does this categorical variable show meaningful differences?
Yes. The mean and median days_since_prior_order clearly differ across order_dow

Should it be kept in the final variable set?
Yes. order_dow is a useful predictor for order timing and should stay in the final variable set.

Step 3: Final Variable Selection

❓ EXPLORATORY QUESTION 11: Which 3-7 variables best answer our research questions?

Based on your inventory, correlations, and relationships, select your final variable set:

# Final selected variables for analysis
my_data_focused <- my_data |>
  select(
    order_id,
    user_id,
    order_dow,
    order_hour_of_day,
    days_since_prior_order,
    )

# Verify selection
glimpse(my_data_focused)
FileSystemDataset with 1 csv file (query)
3,421,083 rows x 5 columns
$ order_id                <int64> 2539329, 2398795, 473747, 2254736, 431534, 336…
$ user_id                 <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2…
$ order_dow               <int64> 2, 3, 3, 4, 4, 2, 1, 1, 1, 4, 4, 2, 5, 1, 2, 3…
$ order_hour_of_day       <int64> 8, 7, 12, 7, 15, 7, 9, 14, 16, 8, 8, 11, 10, 1…
$ days_since_prior_order <double> NA, 15, 21, 29, 28, 19, 20, 14, 0, 30, 14, NA,…
Call `print()` for query details

📋 Final Variable Selection Table

final_variables <- tribble(
  ~variable, ~role, ~why_included, ~research_question_addressed,

  "order_id", "Identifier", 
  "Needed to group orders and calculate order-level patterns.", 
  "Supports all questions by defining the unit of analysis.",

  "user_id", "Identifier / Segmenting Variable", 
  "Allows grouping by customer to understand ordering behavior patterns.", 
  "Q1: Which customers reorder most? Q3: How timing varies by user groups.",

  "order_dow", "Predictor (Temporal)", 
  "Captures weekly timing behavior and peak order days.", 
  "Q3: When do people place the largest orders?",

  "order_hour_of_day", "Predictor (Temporal)", 
  "Shows hour-of-day ordering patterns; important for operational insights.", 
  "Q3: What times are peak ordering hours?",

  "days_since_prior_order", "Outcome / Behavioral Metric", 
  "Measures frequency of orders and recency effects on reorder likelihood.", 
  "Q1: What drives reorder likelihood? Q3: How order timing affects patterns."
)

final_variables
# A tibble: 5 × 4
  variable               role                why_included research_question_ad…¹
  <chr>                  <chr>               <chr>        <chr>                 
1 order_id               Identifier          Needed to g… Supports all question…
2 user_id                Identifier / Segme… Allows grou… Q1: Which customers r…
3 order_dow              Predictor (Tempora… Captures we… Q3: When do people pl…
4 order_hour_of_day      Predictor (Tempora… Shows hour-… Q3: What times are pe…
5 days_since_prior_order Outcome / Behavior… Measures fr… Q1: What drives reord…
# ℹ abbreviated name: ¹​research_question_addressed

📝 Variables Excluded and Why:


product_id / product-level attributes (aisle, department, product_name)
Reason for exclusion: Incorporating them would require merging multiple large datasets, which is outside the scope of this assignment and not needed to answer my timing-based research questions.

Step 4: Tool Selection Documentation

❓ EXPLORATORY QUESTION 12: Are we using the right tools for our data size?

Document which tools you’re using and why:

# Calculate your dataset characteristics
dataset_stats <- my_data |>
  summarise(
    total_rows = n(),
    total_cols = ncol(my_data)
  ) |>
  collect()

# Estimate size
# Note: This is approximate
estimated_size <- dataset_stats$total_rows * dataset_stats$total_cols * 8 / 1e9  # GB

glue("
Dataset Statistics:
- Rows: {scales::comma(dataset_stats$total_rows)}
- Columns: {dataset_stats$total_cols}
- Estimated Size: ~{round(estimated_size, 1)} GB
")
Dataset Statistics:
- Rows: 3,421,083
- Columns: 7
- Estimated Size: ~0.2 GB

📝 Tool Selection Justification:

Tools Being Used:
☐ Arrow (< 5GB, simple operations) - yes!
☐ DuckDB (5-50GB, complex analytics) - yes!
☐ Spark (50GB+, distributed computing) - no!

Justification for Tool Choice:
My dataset has approximately 3.4 million rows and 5 columns, which is below Arrow threshold, so I used it. DuckDB was also used to streamline and make things more efficient

Alternative Tools Considered:
Spark

Part 6: Exploratory Visualizations & Analysis

Building on your Part 3 notable segments

Introduction

Now that you have clean data and focused variables, it’s time to create compelling visualizations that answer your research questions. You’ll create 5+ visualizations using at least 3 different chart types.

For each visualization, you must: 1. State the exploratory question 2. Create a publication-quality chart 3. Interpret the patterns you find 4. Explain implications for stakeholders


Visualization 1: Bar Graph

❓ EXPLORATORY QUESTION 13: How are Instacart orders distributed across the days of the week?

Example: “How has [outcome variable] changed over time?”

viz_data_1 <- my_data_focused %>%
group_by(order_dow) %>%
summarise(
total_orders = n(),
.groups = "drop"
) %>%
arrange(order_dow) %>%
collect()

ggplot(viz_data_1, aes(x = factor(order_dow), y = total_orders)) +
geom_col(fill = "#4575b4") +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Instacart Orders by Day of Week",
x = "Day of Week (0 = Sunday)",
y = "Total Number of Orders",
caption = "Source: Instacart Orders Dataset"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 11)
)

Total number of Instacart orders by day of week

📝 Interpretation:

Pattern Observed:

Describe what you see in the visualization:
Orders are not evenly distributed across the week and some days have higher order volumes than others!

Statistical Evidence:

Any specific numbers or trends to highlight:
Sunday shows highest counts, Thursday, lowest, which makes sense in theory.

Stakeholder Implications:

What does this mean for decision-makers?
This weekly pattern can help Instacart plan staffing, delivery capacity, and promotions around the busiest days.

Visualization 2: Bar Chart

❓ EXPLORATORY QUESTION 14: What are the peak ordering hours throughout the day?

Example: “Which categories contribute most to our outcome?”

#| label: viz-2
#| fig-cap: "Total number of Instacart orders by hour of day"
#| fig-width: 10
#| fig-height: 6

# Aggregate orders by hour of day
viz_data_2 <- my_data_focused %>%
  group_by(order_hour_of_day) %>%
  summarise(
    total_orders = n(),
    .groups = "drop"
  ) %>%
  arrange(order_hour_of_day) %>%
  collect()

ggplot(viz_data_2, aes(x = order_hour_of_day, y = total_orders)) +
  geom_col(fill = "#4575b4") +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Instacart Orders by Hour of Day",
    subtitle = "Order volume peaks in the late morning and early afternoon",
    x = "Hour of Day (0–23)",
    y = "Total Orders",
    caption = "Source: Instacart Orders Dataset"
  ) +
  theme_minimal()

📝 Interpretation:

Pattern Observed:

There is quite a bit of variance in order volume, with peaks being in the late morning and midday hours and very early morning shows minimal activity

Statistical Evidence:

The highest order counts occur around hours such as 10–15, while hours 0–5 show low activity.

Stakeholder Implications:

helps Instacart optimize delivery staffing, batching algorithms, and promotional timing. Marketing teams can target notifications during peak engagement windows.

Visualization 3: Boxplot

❓ EXPLORATORY QUESTION 15: How does the distribution of time between orders vary across days of the week?

Example: “Is there a relationship between two variables?”

viz_data_3 <- my_data_focused |>
  filter(!is.na(days_since_prior_order)) |>
  collect()

ggplot(viz_data_3, aes(x = as.factor(order_dow), y = days_since_prior_order)) +
  geom_boxplot() +
  labs(
    title = "Days Since Prior Order by Day of Week",
    subtitle = "Some days show longer gaps between orders than others",
    x = "Day of Week (0 = Sunday)",
    y = "Days Since Prior Order",
    caption = "Source: Instacart Orders Dataset"
  ) +
  theme_minimal()

Distribution of days since prior order by day of week

📝 Interpretation:

Pattern Observed:

varies by day of week, with some days showing noticeably higher medians and wider spreads than others.

Statistical Evidence:

some days have a higher median and larger IQR for days_since_prior_order, while others cluster more tightly around lower values, showing the shorter and more consistent reorder gaps.

Stakeholder Implications:

reorder timing depends partly on the day of week, so planning promotions or capacity around “long-gap” versus “short-gap” days could better match customer behavior.

Visualization 4: Line Chart

❓ EXPLORATORY QUESTION 16: How does the average time between orders change as customers place more orders over their lifetime?

viz_data_4 <- my_data |>
  filter(!is.na(days_since_prior_order)) |>
  group_by(order_number) |>
  summarise(
    mean_days = mean(days_since_prior_order, na.rm = TRUE),
    count = n(),
    .groups = "drop"
  ) |>
  arrange(order_number) |>
  filter(order_number <= 50) |>
  collect()

ggplot(viz_data_4, aes(x = order_number, y = mean_days)) +
  geom_line(linewidth = 1.2, color = "#4575b4") +
  geom_point(size = 2) +
  labs(
    title = "Average Time Between Orders by Customer Order Number",
    subtitle = "Later orders often happen on shorter time scales as customers get used to the service",
    x = "Order Number (per customer)",
    y = "Average Days Since Prior Order",
    caption = "Source: Instacart Orders Dataset (first 50 orders)"
  ) +
  theme_minimal()

Average days since prior order by customer order number

📝 Interpretation:

Pattern Observed:

As the order number increases, the average time between orders generally decreases and then levels off

Statistical Evidence:

line trends downward across early order numbers before flattening, showing a clear reduction in mean days_since_prior_order

Stakeholder Implications:

once customers reach a steady “habit” phase, they tend to order more regularly, increasing long-term value.

Visualization 5: Scatter Plot

❓ EXPLORATORY QUESTION 17: Is there a relationship between the time of day an order is placed and how long it has been since the customer’s prior order?

set.seed(123)

viz_data_5 <- my_data_focused |>
  filter(!is.na(days_since_prior_order)) |>
  collect() |>
  sample_n(min(5000, n()))   # sample to keep the plot readable

ggplot(viz_data_5, aes(x = order_hour_of_day, y = days_since_prior_order)) +
  geom_point(alpha = 0.3, color = "#4575b4") +
  geom_smooth(method = "loess", se = TRUE, color = "#d73027") +
  labs(
    title = "Days Since Prior Order vs. Hour of Day",
    subtitle = "Smoothed curve shows whether certain hours are associated with longer gaps",
    x = "Order Hour of Day (0–23)",
    y = "Days Since Prior Order",
    caption = "Source: Instacart Orders Dataset (sampled n ≤ 5,000)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Relationship between order hour and days since prior order

📝 Interpretation:

Pattern Observed:

There is no strong, simple relationship between order hour and days_since_prior_order, though the smooth curve may show mild variation across certain hours.

Statistical Evidence:

The cloud of points is fairly dispersed, and the smoothed line changes gradually rather than showing sharp peaks or drops, suggesting only weak dependence of reorder gap on time of day.

Stakeholder Implications:

Since time of day is not a major driver of how long customers wait between orders, it may be more effective to focus on other factors when designing strategies for retention.

Additional Visualizations (Optional)

Add more visualizations if needed to fully explore your research questions. Remember: quality over quantity!


Visualization Summary

❓ EXPLORATORY QUESTION 18: What’s the overall story from our visualizations?

visualization_summary <- tribble(
  ~viz_number, ~chart_type, ~key_finding, ~supports_question,
  1, "Bar chart", "Order volume varies by day of week; some days are clearly busier.", "Q13",
  2, "Bar chart", "Orders peak during late morning and early afternoon hours.", "Q14",
  3, "Boxplot", "Time between orders has different distributions across days of week.", "Q15",
  4, "Line chart", "Average time between orders decreases as customers place more orders.", "Q16",
  5, "Scatter plot", "Time since prior order shows only a weak relationship with order hour.", "Q17"
)

visualization_summary
# A tibble: 5 × 4
  viz_number chart_type   key_finding                          supports_question
       <dbl> <chr>        <chr>                                <chr>            
1          1 Bar chart    Order volume varies by day of week;… Q13              
2          2 Bar chart    Orders peak during late morning and… Q14              
3          3 Boxplot      Time between orders has different d… Q15              
4          4 Line chart   Average time between orders decreas… Q16              
5          5 Scatter plot Time since prior order shows only a… Q17              

📝 Narrative Synthesis:

Connect your visualizations into a coherent story:

The Big Picture:
Across all five visualizations, the point is that the ordering behavior is pretty structured and not random. Customers tend to order more on specific days and hours and the time between orders changes as they become more established users. Temporal patterns (day of week, hour of day, and order sequence) are key drivers of how and when people place orders.

Surprising Findings:
It was somewhat surprising that the relationship between order hour and days since prior order was weak compared to the clear patterns by day of week and order number.

Patterns Across Visualizations:
All in all, the visualizations suggest that weekly cycles and customer maturity are more important than the exact time of day in understanding Instacart demand.

Part 7: Stakeholder Communication

Telling the complete story

Introduction

You’ve done deep analysis - now it’s time to communicate your findings effectively to decision-makers who may not be data experts.


Executive Summary

❓ EXPLORATORY QUESTION 19: If a stakeholder has only 2 minutes, what must they know?

Write a 300-500 word executive summary using the BLUF (Bottom Line Up Front) approach:

[Your Project Title] - Key Findings

Bottom Line: [One sentence summarizing your most important finding]


Context: [2-3 sentences: What dataset did you analyze? Why does it matter?]

Key Findings:

  1. [First Major Finding]: [2-3 sentences explaining the finding and its implications]

  2. [Second Major Finding]: [2-3 sentences explaining the finding and its implications]

  3. [Third Major Finding]: [2-3 sentences explaining the finding and its implications]

Recommendations:

  1. [Actionable recommendation based on findings]

  2. [Actionable recommendation based on findings]

  3. [Actionable recommendation based on findings]

Data Quality Note: [1-2 sentences about any important limitations]

Executive Summary Problem statement ● Online grocery shoppers follow predictable, time-bound routines - weekend morning orders and repeated staple purchases dominate behavior. Context: ● Dataset: 3.4 million Instacart orders from 200k U.S. customers. ● Includes order timing, sequence, and product details. ● Reveals how people shop online → supports marketing, logistics, and personalization. Key Findings: ● Peak ordering on Sundays 9 AM–2 PM → ideal window for promotions & delivery planning. ● ~65 % of items are reorders, mainly staples (bananas, milk, eggs) → routine loyalty drives sales. ● Common basket pairings (produce + dairy + bakery) → opportunity for bundle offers & cross-selling. ● Average 20 days between orders → predictable reorder cycles for retention campaigns.

Finding 1 – Peak weekend-morning activity appears in the data ● Pattern: Orders cluster between 9 AM – 2 PM on Sundays, suggesting a weekend stock-up habit. ● Recommendation: Marketing should schedule push notifications or promotions for Saturday evening to nudge weekend orders earlier, balancing fulfillment load. Finding 2 – Reorder behavior relates to staple items ● Pattern: High reorder counts appear in produce and dairy, indicating routine restocking. ● Recommendation: Product and personalization teams should highlight recurring essentials (e.g., “Add your usual items?”) to improve reorderconversion.


Hero Visualization

❓ EXPLORATORY QUESTION 20: Which single visualization tells our story best?

Create ONE publication-quality “hero” visualization that could stand alone in a presentation or report. This should be self-explanatory for non-technical audiences.

hero_data <- my_data |>
  group_by(order_dow, order_hour_of_day) |>
  summarise(
    num_orders = n(),
    .groups = "drop"
  ) |>
  collect() |>
  mutate(
    day_of_week = factor(
      order_dow,
      levels = 0:6,
      labels = c("Sunday", "Monday", "Tuesday", "Wednesday",
                 "Thursday", "Friday", "Saturday")
    )
  ) |>
  arrange(day_of_week, order_hour_of_day)

ggplot(hero_data,
       aes(x = order_hour_of_day,
           y = num_orders,
           color = day_of_week)) +
  geom_line(linewidth = 1.1) +
  labs(
    title = "Instacart Orders Surge on Weekend Mornings",
    subtitle = "Order activity peaks between 9 AM and 2 PM on Sundays",
    x = "Hour of Day",
    y = "Number of Orders",
    color = "Day of Week",
    caption = "Source: Instacart Online Grocery Shopping Dataset (3.4M orders, 2017)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12),
    axis.title = element_text(size = 11),
    axis.text = element_text(size = 10),
    legend.position = "bottom"
  )

Instacart orders by hour of day and day of week

📝 Hero Visualization Explanation:

Why this visualization?

Explain why you chose this as your hero visualization:
clearly shows how Instacart order volume changes hour-by-hour across different days of the week. It highlights strong temporal patterns in customer behavior

What should viewers notice first?

the sharp surge in orders on weekend mornings, especially Sundays, where activity spikes between 9 AM and 2 PM.

The “so what” factor:

Why does this matter to stakeholders?
This pattern directly informs staffing, inventory planning, and targeted marketing. Knowing when customers shop the most allows Instacart to optimize delivery capacity

Limitations

❓ EXPLORATORY QUESTION 21: What are the honest limitations of our analysis?

Be transparent about what this analysis can and cannot tell us:

Data Limitations:

Missing values in days_since_prior_order
First-time orders have no prior gap, which creates systematic missingness.

Lack of demographic or geographic information
The dataset doesn’t include customer demographics, location, income, or household size. This limits our ability to explain why certain behaviors occur

Methodological Limitations:

Aggregated analysis may hide individual-level variability
Visualizations summarize millions of orders, but customers behave differently.
Correlations do not imply causation
Patterns like increased weekend orders show associations, but they do not prove that the day of week causes higher volumes.

Scope Limitations:

We cannot estimate revenue or profit
The dataset includes products and timing, but no pricing, discounts, or promo data, preventing financial analysis or business forecasting.

Future analysis could incorporate machine learning or customer segmentation
Techniques like clustering, survival analysis, or repeat-purchase modeling would provide deeper insight into long-term customer behavior

Actionable Recommendations

❓ EXPLORATORY QUESTION 22: Based on our findings, what should stakeholders do?

Provide 2-3 concrete, actionable recommendations:

Recommendation 1: Marketing Techniques

Finding it’s based on: [Line Chart]

Proposed action:

Marketing should schedule
push notifications or promotions for
Saturday evening to nudge weekend orders
earlier, balancing fulfillment load.

Expected impact:

More people will use instacart and have a better overall experience

Implementation difficulty: ☐ Easy ☐ Moderate ☐ Difficult


Recommendation 2: Behavior Reordering

Finding it’s based on: Reorder behavior relates to staple items

Proposed action:

Product and personalization
teams should highlight recurring essentials to improve
reorderconversion

Expected impact:

People will find the app way more fficient and will save lots of time every time they order

Implementation difficulty: ☐ Easy ☐ Moderate ☐ Difficult


Recommendation 3: Staffing Changes

Finding it’s based on: Order timing differs by day of week

Proposed action:

 Operations could
staff warehouses and delivery slots
more heavily on weekends and
optimize weekday delivery windows
for smaller baskets.

Expected impact:

Workers will feel more relaxed if they have more colleagues - shipments get out faster, customers are happier

Implementation difficulty: ☐ Easy ☐ Moderate ☐ Difficult


Hypothesis Development

❓ EXPLORATORY QUESTION 23: Based on our EDA, what hypotheses can we test in future work?

After exploratory analysis, we can formulate testable hypotheses:

Hypothesis 1: Weekly Pattern in Order Volume

Null Hypothesis (H₀):

Average order volume is the same across all days of the week.

Alternative Hypothesis (H₁):

Average order volume differs by day of the week.

Evidence from EDA:

Bar charts and the hero visualization show clear peaks on weekends, especially Sundays, with visibly higher order counts compared to midweek days. This suggests that day of week is associated with order volume.

How to test:

run a one-way ANOVA comparing mean orders per day of week. If assumptions are not met, use KW test. Follow up with pairwise comparisons to see which days differ.

Hypothesis 2:

Null Hypothesis (H₀):

Alternative Hypothesis (H₁):

Evidence from EDA:

How to test:

Hypothesis 2: Order Gap Shrinks as Customers Become More Active

Null Hypothesis (H₀):

There is no relationship between a customer’s order number and the time since their prior order.

Alternative Hypothesis (H₁):

As order number increases, the time since the prior order decreases.

Evidence from EDA:

The line chart of average days since prior order by order number shows a downward trend for early orders that later levels off

How to test:

Fit SLR model with DSPO as the response and ON as the predictor. Test whether the slope is significantly negative.


Presentation Plan

❓ EXPLORATORY QUESTION 24: How do we tell this story in 5 minutes?

Create an outline for a 5-minute presentation:

Slide 1: Title & Hook (30 seconds)

Title: Exploring Online Grocery
Shopping Patterns:
An Instacart Data analysis
hOOK: Picture millions of grocery carts filling up across America - not in stores, but on
screens. What do these digital baskets reveal about our habits, cravings, and
routines?

Slide 2: Context & Question (45 seconds)

Online grocery shoppers follow predictable,
time-bound routines - weekend morning orders
and repeated staple purchases dominate
behavior.
Context:
● Dataset: 3.4 million Instacart orders from 200k
U.S. customers.
● Includes order timing, sequence, and product
details.
● Reveals how people shop online → supports
marketing, logistics, and personalization.

Slide 3: Key Finding #1 (60 seconds)

Visualization: 2
Finding: Peak ordering on Sundays 9 AM–2 PM
→ ideal window for promotions &
delivery planning.
Implication: Shows when marketing team should employ different strategies to get customer engagement/involvement

Slide 4: Key Finding #2 (60 seconds)

Visualization: Hero
Average 20 days between orders →
predictable reorder cycles for retention
campaigns.

Slide 5: Recommendations (60 seconds)

Finding 1 – Peak weekend-morning activity
appears in the data
● Pattern: Orders cluster between 9 AM – 2 PM
on Sundays, suggesting a weekend stock-up
habit.
● Recommendation: Marketing should schedule
push notifications or promotions for
Saturday evening to nudge weekend orders
earlier, balancing fulfillment load.
Finding 2 – Reorder behavior relates to staple
items
● Pattern: High reorder counts appear in
produce and dairy, indicating routine
restocking.
● Recommendation: Product and personalization
teams should highlight recurring essentials
to improve
reorderconversion.Finding 3 – Order timing differs by day of
week
● Pattern: Weekdays show flatter, smaller
peaks—likely quick replenishment or
work-break purchases.
● Recommendation: Operations could
staff warehouses and delivery slots
more heavily on weekends and
optimize weekday delivery windows
for smaller baskets.

Slide 6: Questions & Next Steps (45 seconds)

Next-Step Hypothesis: From this exploratory analysis, we
hypothesize that reorder frequency and weekend activity are
driven by household scheduling and income stability.
Thank you + questions

Final Reflection

❓ EXPLORATORY QUESTION 25: What did we learn from this entire process?

Take a moment to reflect on your complete analysis journey:

Most Surprising Finding:

The most surprising finding was how strongly the daily patterns influenced customer ordering behavior. Especially with the Sunday spike, the patterns felt way more defined.

Biggest Challenge:

The biggest challenge was handling missing data across lots of variables and deciding on what strategies to employ that were pretty aligned with the business context.

What You’d Do Differently:

If I were to redo the project, I would start by pulling a smaller sample of the dataset into local memory earlier in the EDA process so I can make visualizations easier and faster.

Skills Developed:

ID and diagnose missingness patterns, using Arrow, and making educated hypotheses based on prior findings

Confidence in Findings: ☐ High ☐ Medium ☐ Low

Explanation:

I'm generally pretty conviced that the findings are all good, but what worries me a bit is the data cleaning and the amount of missing values - not sure the extent to which that would've affected everything.

Cleanup

# Close database connection if using DuckDB
if (exists("con")) {
  dbDisconnect(con, shutdown = TRUE)
  glue("✅ Database connection closed successfully")
}

Deliverables Checklist

Before submitting, ensure you have completed:

Part 4: Missing Data Analysis ✓

Part 5: Variable Selection ✓

Part 6: Exploratory Visualizations ✓

Part 7: Stakeholder Communication ✓

Overall Quality ✓


Grading Criteria

Part 4: Missing Data Analysis (20%) - Systematic quantification and pattern identification - Evidence-based classification (MCAR/MAR/MNAR) - Thoughtful handling strategies with justifications - Impact assessment

Part 5: Variable Selection (15%) - Comprehensive variable inventory - Appropriate relationship analysis - Justified final variable set (3-7 variables) - Clear documentation of exclusions

Part 6: Exploratory Visualizations (30%) - 5+ publication-quality visualizations - Variety in chart types (3+ different types) - Clear exploratory questions - Insightful interpretations - Stakeholder-focused implications - Coherent narrative synthesis

Part 7: Stakeholder Communication (25%) - Clear, concise executive summary (BLUF) - Compelling hero visualization - Honest limitations discussion - Actionable recommendations - Well-structured presentation plan - Testable hypotheses developed

Professional Communication (10%) - Code organization and documentation - Clear writing throughout - Logical flow of analysis - Appropriate use of visualizations - Professional presentation


Tips for Success

  1. Start Early: This is substantial work - don’t wait until the last minute!

  2. Be Honest: Acknowledge limitations and uncertainties in your data

  3. Think Like a Stakeholder: Every finding should answer “so what?”

  4. Quality Over Quantity: Better to have 5 excellent visualizations than 10 mediocre ones

  5. Tell a Story: Connect your findings into a coherent narrative

  6. Document Everything: Explain your reasoning for all major decisions

  7. Ask for Help: Use office hours if you’re stuck on any section

  8. Iterate: Review and refine your work before submitting

  9. Proofread: Check for typos and ensure all code runs

  10. Be Specific: Avoid vague statements - provide evidence and examples


Remember: This is exploratory data analysis - you’re building understanding and generating insights, not proving predetermined hypotheses. Let your curiosity guide you while maintaining systematic rigor!

NSF Acknowledgement: This material is based upon work supported by the National Science Foundation under Grant #DGE-2222148.