Phillies Associate Software Engineer, Data Engineering – Interview Study Guide


PART 1: YOUR CODE – FUNCTION-BY-FUNCTION BREAKDOWN

Know every function, every decision. They will ask you to walk through your code.


pipeline.py

load_raw_data(filepath: str) -> list[dict]

What it does: Opens batch_raw.json and parses it into a list of Python dictionaries using json.load().

Why this approach: json.load() reads the entire file into memory at once. This is fine for 1,254 records but would not scale to millions. If asked “what would you change at scale,” say: streaming JSON parser like ijson, or switch to a columnar format like Parquet.

Be ready to discuss: - json.load() vs json.loads() – load reads from a file object, loads reads from a string - Why you used a context manager (with open(...)) – ensures the file handle is closed even if an exception occurs


clean_value(value: Any) -> Any

What it does: Converts empty dicts {} and empty lists [] to None. Returns everything else unchanged.

Why it exists: The raw JSON data uses {} to represent missing values in some fields (e.g., "action": {}). Pandas and SQL handle None/NaN natively but would treat {} as a valid value, which would corrupt downstream analysis.

Design decision: You chose to centralize this cleaning logic in one function rather than repeating if value == {} checks everywhere. This is the DRY principle (Don’t Repeat Yourself).


deduplicate_samples(samples: list[dict]) -> list[dict]

What it does: Removes duplicate tracking samples that share the same timestamp within a single pitch. Keeps the first occurrence of each timestamp.

Why it’s needed: The raw tracking data contained 4,254 duplicate samples across all 1,254 records. These are sensor artifacts – the tracking system sometimes records the same frame twice.

How it works: Uses a set to track seen timestamps. Iterates through samples and only keeps those with timestamps not already in the set. Sets have O(1) lookup, making this O(n) overall.

Be ready to discuss: - Why per-pitch deduplication (not global) – two different pitches can legitimately have the same relative timestamp - Why keep first occurrence – assumption is the first recording is the original; the duplicate is the artifact - Weakness: if the second occurrence were a corrected value, you’d lose it. You could mitigate this by comparing position data between duplicates.


extract_nested_value(data: dict, *keys, default=None) -> Any

What it does: Safely navigates deeply nested dictionaries/lists. For example, extract_nested_value(data, "summary_acts", "pitch", "speed", "mph") walks through 4 levels of nesting without crashing if any level is missing.

Why it exists: The raw JSON has deeply nested structures (e.g., record["summary_acts"]["pitch"]["speed"]["mph"]). Direct access would throw KeyError or TypeError if any level is missing. This function returns None (or a custom default) instead.

How it works: 1. Iterates through the provided keys 2. At each level, checks if the current value is a dict (use key lookup) or list (use index lookup) 3. If any key is missing, breaks and returns the default 4. Calls clean_value() on the final result to convert {} to None

Be ready to discuss: - The *keys syntax (variadic arguments) – allows any number of keys to be passed - How it handles both dicts and lists (checks isinstance) - Alternative approaches: dict.get() chaining (data.get("a", {}).get("b", {}).get("c")) – works but is less readable and doesn’t handle lists


calculate_attack_angle(samples_bat: list[dict]) -> Optional[float]

What it does: Calculates the vertical angle of the bat’s path through the hitting zone in degrees. Positive = uppercut, negative = downward chop, zero = flat/level swing.

Algorithm: 1. Filter samples to those with valid head position data (x, y, z coordinates) 2. Sort by time 3. For each consecutive pair of samples, calculate horizontal speed: sqrt(dx^2 + dy^2) / dt 4. Find the segment with maximum horizontal speed (this is where the bat is moving fastest through the zone – the contact point) 5. At that segment, calculate: attack_angle = arctan2(dz, horizontal_distance) in degrees

Why measure at peak horizontal speed: This represents the bat’s path at the point where contact would occur. Earlier in the swing the bat is accelerating; after the contact zone it’s decelerating. The angle at peak speed is the most baseball-relevant measurement.

Why atan2 instead of atan: atan2(y, x) handles all quadrants correctly and avoids division-by-zero when horizontal distance is 0. atan(y/x) would crash or give wrong signs.

Why attack angle is robust: It uses a ratio (vertical/horizontal), so timing noise affects both numerator and denominator equally and partially cancels out. Bat speed, by contrast, divides distance by a small time interval, amplifying noise.

Validation: 98.6% coverage with only 9 records filtered. Mean of ~2 degrees for confirmed swings, which is reasonable (MLB average is +8 to +12, but this data may include different swing contexts).

Be ready to discuss: - The coordinate system: x, y are horizontal, z is vertical - Why you need >= 3 samples (need at least 2 segments to compare speeds and find the maximum) - Edge cases: what if all horizontal speeds are 0? best_segment_idx stays None, function returns None


calculate_bat_speed(samples_bat: list[dict]) -> tuple[Optional[float], str]

What it does: Calculates bat speed at the sweet spot (6 inches from the barrel head) in mph. Returns a tuple of (speed, confidence).

Algorithm: 1. For each frame, calculate the sweet spot position: 6 inches from the head toward the handle 2. Use 3-frame windows (~10ms at 300Hz) to smooth out single-frame noise 3. Filter out windows with bad timing (outside 8-15ms expected range) or unreasonable distance (>2 feet in one window) 4. Calculate speed = distance / time, convert feet/sec to mph (* 0.681818) 5. Return the maximum speed from valid windows

Why sweet spot, not barrel head: MLB Statcast measures bat speed at the sweet spot (6 inches from barrel head) because that’s where batters typically make contact. The barrel tip moves faster but isn’t the contact point.

Why 3-frame smoothing: At 300Hz, each frame is ~3.33ms. A single frame-to-frame measurement is noisy because small position errors get divided by tiny time intervals. A 3-frame window (~10ms) averages out single-frame jitter while still capturing peak speed.

The get_sweet_spot inner function: - Calculates the vector from head to handle - Validates bat length (1.5-4.0 feet; real bats are ~2.5-3 feet) - Interpolates 6 inches (0.5 feet) along that vector from the head

Confidence classification: - “reliable”: >= 5 valid windows, <= 20% questionable windows, max speed <= 95 mph - “questionable”: everything else

Validation: Median of 70.1 mph vs MLB’s 71.5 mph average. Range of 60-85 mph vs MLB’s 62-81 mph. Close alignment validates the methodology.

Be ready to discuss: - Why the conversion factor 0.681818 (1 ft/s = 0.681818 mph) - Why cap at 95 mph for “reliable” – MLB max is ~80 mph (Stanton); anything above 95 is almost certainly a tracking artifact - Why >= 10 samples required – need enough data points for the 3-frame window approach to be meaningful


is_swing(samples_bat: list[dict]) -> bool

What it does: Determines if a swing occurred based on whether bat tracking data contains actual position data (not just [{"event": "No"}]).

Logic: Returns True if samples_bat has entries AND they aren’t just the single {"event": "No"} marker AND at least one sample has head or handle position data.

Why this approach: The tracking system records [{"event": "No"}] when no swing was detected. Any other tracking data indicates bat movement. The additional check for “head” or “handle” keys ensures we have actual position tracking, not just metadata.

Limitation you should mention: This detects bat movement, not necessarily a true swing attempt. Check swings, bat waggles, and sensor noise can all trigger tracking data. That’s why you added the swing_confidence field in flatten_record.


validate_exit_velocity(value) -> Optional[float]

What it does: Returns the value if between 0-125 mph, otherwise returns None and logs a warning.

Why 125 mph: The hardest-hit ball in MLB Statcast history was ~122 mph (Giancarlo Stanton). 125 gives a small buffer. Anything above is physically impossible and indicates a tracking error.


validate_attack_angle(value) -> Optional[float]

What it does: Returns the value if between -45 and +45 degrees, otherwise returns None.

Why +/- 45: Even extreme swings (Ted Williams-style uppercuts or old-school chop swings) don’t exceed ~30 degrees. +/- 45 is generous – anything beyond is a tracking error where the system is measuring random bat movement, not a real swing path.


validate_bat_speed(value) -> Optional[float]

What it does: Returns the value if between 0-120 mph, otherwise returns None.

Why 120: MLB max is ~80 mph. 120 is extremely generous to avoid filtering edge cases, with the speed_confidence flag handling the gray area above 95 mph.


validate_spin_rate(value) -> Optional[float]

What it does: Returns the value if between 500-4000 rpm, otherwise returns None.

Why these bounds: Lowest realistic spin is ~500 rpm (some splitters/changeups). Highest is ~3500 rpm (high-spin curveballs). 4000 provides a buffer.


validate_pitch_speed(value) -> Optional[float]

What it does: Returns the value if between 50-110 mph, otherwise returns None.

Why these bounds: Slowest MLB pitches (eephus) are ~55-60 mph. Fastest (Chapman, deGrom) are ~102-105 mph. 50-110 provides buffer for unusual situations.


validate_pitch_result(value) -> Optional[str]

What it does: Checks that pitch result is one of: Strike, Ball, HitIntoPlay, Pickoff. Returns None for anything else.

Why a whitelist approach: Safer than a blacklist. If new result types appear in future data, they’ll be flagged rather than silently accepted. You’ll know about schema changes immediately.


validate_batter_id(value) -> Optional[int]

What it does: Attempts to convert the value to an integer. Returns None if conversion fails.

Why: Batter IDs should always be integers (MLB player IDs). String representations like “12345” are converted. Non-numeric values are rejected.


flatten_record(record: dict, index: int) -> dict

What it does: This is the core transformation function. Takes one raw JSON record (deeply nested) and produces one flat dictionary row suitable for a DataFrame/CSV.

What it extracts: - Pitch info: type, result, action, speed, spin (from summary_acts.pitch) - Hit info: exit velocity (from summary_acts.hit.speed.mph) - Score info: balls, strikes, outs (from summary_score) - Batter ID: from events[0].personId.mlbId

What it calculates (feature engineering): - is_swing: calls is_swing() on bat tracking samples - is_contact: True if exit velocity exists OR action is “Foul” - is_whiff: swing detected AND no contact - attack_angle_deg: calls calculate_attack_angle() - bat_speed_mph + speed_confidence: calls calculate_bat_speed() - swing_confidence: “confirmed” if action is Swinging/Foul or result is HitIntoPlay; “suspected” if tracking detected movement but action was something else (like “Called”)

Key design decisions: - Deduplicates samples_bat before any calculations - Handles empty dict {} for action field explicitly - Foul balls count as contact even without exit velocity (the bat made contact with the ball) - The swing_confidence system distinguishes sensor-confirmed swings from possible false positives

Be ready to discuss: - Why is_contact includes fouls without exit velocity: the tracking system doesn’t always capture exit velocity on foul tips, but contact clearly occurred - Why “suspected” swings matter: they have more extreme attack angles (mean -1.5 vs +2.1 for confirmed), higher std deviation (16.0 vs 11.2), suggesting they’re measuring random bat movement, not real swings


validate_dataframe(df: pd.DataFrame) -> pd.DataFrame

What it does: Applies all validation functions across the entire DataFrame. Logs warnings for invalid values, then replaces them with None.

Design decision: Validation is separate from extraction. This follows the principle of separation of concerns – flatten_record extracts and computes; validate_dataframe enforces data quality rules. This makes it easy to adjust validation rules without touching extraction logic.

Be ready to discuss: - Why validate after creating the DataFrame (not during flattening): allows batch logging of issues, easier to report summary statistics of data quality problems - Why set invalid values to None instead of dropping rows: preserves partial data. A pitch with an invalid exit velocity still has valid pitch speed, type, and result data.


process_pipeline(input_path: str, output_path: str) -> pd.DataFrame

What it does: Orchestrates the entire pipeline: load -> flatten -> DataFrame -> validate -> save CSV.

Key design decisions: - Performance timing on every step (load, flatten, create DataFrame, validate, save) – demonstrates observability thinking - Summary statistics logged after processing (total pitches, unique games, swing counts, coverage percentages) - Sequential processing chosen over parallel after benchmarking showed parallel was 4.7x slower for this dataset size

Parallelization analysis (they will likely ask about this): - You tested multiprocessing.Pool with 16 workers - Sequential: 0.8s flatten time. Parallel: 3.8s flatten time. - Why slower: Windows uses “spawn” (not “fork”), requiring ~3-4s just to create worker processes. Plus serialization overhead to send records to workers via pickle. - Breakeven point: ~5,000-10,000 records. Above 10,000, parallel wins. - Production alternatives at true scale: Apache Spark, Dask, or async I/O for I/O-bound work


aggregate.py

aggregate_batter_stats(input_path: str, output_path: str) -> pd.DataFrame

What it does: Loads the cleaned CSV into DuckDB (in-memory), runs a SQL aggregation query, outputs per-batter statistics.

The SQL query:

SELECT
    batter_id,
    COUNT(*) FILTER (WHERE is_swing = TRUE) AS swing_count,
    ROUND(100.0 * COUNT(*) FILTER (WHERE is_whiff = TRUE) /
        NULLIF(COUNT(*) FILTER (WHERE is_swing = TRUE), 0), 2) AS whiff_rate_pct,
    MAX(exit_velocity_mph) AS max_exit_velocity_mph
FROM pitches
WHERE batter_id IS NOT NULL
GROUP BY batter_id
ORDER BY batter_id

Be ready to explain every part: - COUNT(*) FILTER (WHERE ...): DuckDB/Postgres syntax for conditional counting. More readable than SUM(CASE WHEN ... THEN 1 ELSE 0 END). - NULLIF(..., 0): Prevents division by zero. If a batter has 0 swings, NULLIF returns NULL instead of 0, making the division return NULL instead of crashing. - ROUND(..., 2): Rounds to 2 decimal places for clean output. - WHERE batter_id IS NOT NULL: Excludes pitches where we couldn’t identify the batter.

Why DuckDB instead of pandas: - SQL is more readable for aggregation logic - DuckDB is a columnar analytical database – optimized for exactly this kind of query - DuckDB can query CSV files directly, reducing boilerplate - Demonstrates SQL proficiency (relevant to the job requirements) - At scale, the same SQL could run on Postgres, Redshift, or BigQuery with minimal changes


run_additional_analysis(input_path: str)

What it does: Runs exploratory queries: overall dataset statistics, bat speed stats for reliable swings, pitch type breakdown.

Purpose: Shows how you initially explored the data to understand distributions and validate your feature engineering. This is a good data engineering practice – always explore your output to catch issues.

Notable queries: - PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY bat_speed_mph): Calculates the median. This is an ordered-set aggregate function, more advanced than simple AVG(). - Pitch type breakdown with per-type whiff rates: shows analytical thinking about how different pitch types perform.


test_pipeline.py

Test structure: 60 tests across 11 test classes

Why this testing approach matters: You organized tests by function using classes (TestIsSwing, TestCalculateAttackAngle, etc.). Each class tests the happy path, edge cases, and error conditions for one function.

Key test patterns you used: - Boundary testing: validate_exit_velocity(0), validate_exit_velocity(125) (testing exact boundaries) - Null/empty handling: Every function is tested with None, empty list, and empty dict inputs - Physical reasoning in test data: The attack angle tests use samples that move purely horizontally (expect ~0), upward (expect positive), and downward (expect negative) - Integration-style tests: TestFlattenRecord and TestSwingConfidence test the full flattening logic with realistic record structures

Be ready to discuss: - Why you chose pytest over unittest: simpler syntax, better assertion messages, fixtures, parametrize support - What you would add with more time: integration tests that run the full pipeline end-to-end, property-based tests with Hypothesis, tests for process_pipeline and aggregate_batter_stats


PART 2: INTERVIEW QUESTIONS WITH ANSWERS


Section A: Questions About Your Project

Q1: Walk us through your pipeline architecture. Why did you structure it this way?

Answer: The pipeline has two stages. Stage 1 (pipeline.py) loads raw JSON, flattens nested structures into tabular rows, engineers features (is_swing, is_contact, is_whiff, attack_angle, bat_speed), validates data quality, and outputs a clean CSV. Stage 2 (aggregate.py) loads that CSV into DuckDB and runs SQL aggregation to produce per-batter statistics. I separated them because the cleaning/enrichment step and the analysis step have different concerns – the CSV serves as a stable intermediate format that any tool can consume.


Q2: Why did you choose DuckDB for aggregation instead of pandas?

Answer: Three reasons. First, SQL is more readable for GROUP BY aggregations with conditional counts – the FILTER (WHERE ...) syntax makes the intent clear. Second, DuckDB is a columnar analytical database designed for exactly this workload, so it performs well on larger datasets. Third, the same SQL could migrate to Postgres, Redshift, or BigQuery with minimal changes, which matters if this pipeline moves to production. Pandas groupby operations would be harder to port to a SQL-based warehouse.


Q3: You tested parallelization and found it slower. Explain.

Answer: I tested multiprocessing.Pool with 16 workers for the flattening step. It was 4.7x slower (3.8s vs 0.8s sequential). The overhead came from three sources: Windows uses “spawn” not “fork” to create processes, costing ~3-4 seconds upfront. Each record must be serialized with pickle to send to workers and deserialized back. And each flatten_record call only takes ~0.6ms, so the overhead dominates. The breakeven is around 5,000-10,000 records. For production with 100K+ records, parallelization would provide significant speedup. At true scale (millions), I’d consider Spark or Dask instead of multiprocessing.


Q4: Why did you provide quality flags instead of just filtering out bad data?

Answer: Filtering out data is a destructive decision that removes optionality for downstream consumers. An analyst studying swing mechanics might want only “confirmed” swings. An analyst studying the tracking system’s accuracy might specifically want the “suspected” swings. By providing swing_confidence and speed_confidence flags, I give consumers the information to filter appropriately for their use case without making that decision for them. This is a core data engineering principle – expose data quality information, don’t hide it.


Q5: How would you handle streaming data instead of batch JSON?

Answer: I’d use a message broker like Kafka or AWS Kinesis as the ingestion layer. Each pitch event would be published as a message. A consumer service would apply the same flatten/validate/enrich logic to individual records as they arrive, then write to a database (Postgres or a warehouse). The flatten_record function wouldn’t change much since it already processes one record at a time. The main architectural change is replacing file I/O with message consumption and database writes, plus adding exactly-once processing guarantees and dead-letter queues for failed records.


Q6: Explain your attack angle calculation. Why measure at peak horizontal bat speed?

Answer: Attack angle is arctan2(vertical_velocity, horizontal_velocity) in degrees. I measure it at the point of maximum horizontal bat speed because that’s when the bat is moving fastest through the hitting zone – the moment where contact would occur. Earlier in the swing, the bat is still accelerating; later, it’s decelerating and wrapping around. The angle at peak speed is the most baseball-relevant measurement of the bat’s path. I use atan2 instead of atan because it handles all quadrants correctly and avoids division by zero.


Q7: Your bat speed uses 3-frame smoothing. Why 3 frames?

Answer: The tracking system runs at 300Hz, so each frame is ~3.33ms. A single frame-to-frame speed calculation divides a small distance by a tiny time interval (~3.3ms), which amplifies any position tracking error. A 3-frame window (~10ms) smooths out single-frame noise while still capturing the peak speed accurately. 2 frames would still be too noisy. 5+ frames would over-smooth and underestimate the true peak. 3 frames is the standard compromise in motion analysis – long enough to reduce noise, short enough to capture dynamics.


Q8: How does your deduplication work, and what could go wrong?

Answer: I use a set to track seen timestamps and keep only the first occurrence of each duplicate. This is O(n) with O(1) set lookups. The risk is that the second occurrence might be a corrected value rather than a true duplicate. To mitigate this, I could compare position data between duplicates – if positions differ, keep the one with more plausible values. For this dataset, the duplicates had identical position data, so keeping the first was correct.


Q9: Your validation ranges – how did you choose them?

Answer: Based on physical limits and MLB data. Exit velocity: MLB max is ~122 mph (Stanton), so I cap at 125. Pitch speed: slowest MLB pitches are ~55 mph (eephus), fastest ~105 mph, so 50-110. Attack angle: even extreme swings are within +/-30 degrees, so +/-45 is generous. Spin rate: splitters are ~500 rpm, curveballs peak at ~3500, so 500-4000. The goal is to catch tracking errors (values that are physically impossible) without filtering legitimate outliers.


Q10: What would you do differently if you were starting this project over?

Answer: Three things. First, I’d use Pydantic models to define the input schema rather than manual nested extraction – it would validate the raw JSON structure upfront and make the code more self-documenting. Second, I’d add end-to-end integration tests that run the full pipeline and verify the output CSV against expected values. Third, I’d output to Parquet instead of CSV – it’s columnar, compressed, and preserves types (CSV loses the distinction between None and empty string, and doesn’t enforce numeric types).


Section B: SQL Questions

Q11: Explain the difference between ROW_NUMBER(), RANK(), and DENSE_RANK().

Answer: All three assign rankings within a partition. For values [100, 100, 90]: - ROW_NUMBER(): 1, 2, 3 – always unique, ties broken arbitrarily - RANK(): 1, 1, 3 – ties get the same rank, next rank skips (no rank 2) - DENSE_RANK(): 1, 1, 2 – ties get the same rank, next rank doesn’t skip

Use ROW_NUMBER when you need exactly N results per group (like top-3). Use RANK or DENSE_RANK when ties matter (like “who had the highest exit velocity” where two players might tie).


Q12: What’s the difference between a CTE and a subquery?

Answer: A CTE (Common Table Expression) is defined with WITH ... AS (...) before the main query. A subquery is nested inside the main query. CTEs are more readable, can be referenced multiple times in the same query, and are required for recursive queries. Subqueries can sometimes be optimized differently by the query planner. For complex analytics, CTEs are almost always preferred for readability. I used a CTE in my top-3 hardest hits query to separate the ranking logic from the filtering logic.


Q13: Explain window functions beyond ROW_NUMBER.

Answer: - LAG(column, n) / LEAD(column, n): Access the value from the previous/next row. Baseball example: compare a pitcher’s velocity on the current pitch to their previous pitch to detect fatigue. - SUM() OVER (ORDER BY ...): Running total. Example: cumulative strikeouts through a game. - AVG() OVER (ROWS BETWEEN 9 PRECEDING AND CURRENT ROW): Rolling average. Example: 10-pitch rolling average of spin rate. - NTILE(4): Divides rows into N equal buckets. Example: split batters into quartiles by exit velocity.


Q14: What are indexes and when would you create them?

Answer: An index is a data structure (usually B-tree) that speeds up lookups on a column. For the pitches table, I’d index batter_id because it’s used in GROUP BY and WHERE clauses frequently. A composite index on (game_id, batter_id) would speed up queries that filter by game and group by batter. Tradeoff: indexes speed up reads but slow down writes (every INSERT/UPDATE must also update the index) and consume storage. For an analytical workload with infrequent writes and frequent reads, indexes are almost always worth it.


Q15: What’s the difference between WHERE and HAVING?

Answer: WHERE filters rows before grouping. HAVING filters groups after aggregation. Example: WHERE batter_id IS NOT NULL removes rows before the GROUP BY. HAVING swing_count > 5 removes groups (batters) who had fewer than 5 swings after the count is calculated. You can’t use aggregate functions in WHERE.


Q16: Explain JOINs – INNER, LEFT, RIGHT, FULL, CROSS.

Answer: - INNER JOIN: Only rows that match in both tables. If a pitch has no matching player record, it’s excluded. - LEFT JOIN: All rows from the left table, NULLs for non-matching right table rows. Use when you want all pitches even if some don’t have player info. - RIGHT JOIN: Opposite of LEFT. Rarely used – just swap table order and use LEFT. - FULL OUTER JOIN: All rows from both tables, NULLs where there’s no match. - CROSS JOIN: Every row in table A paired with every row in table B (Cartesian product). Useful for generating combinations.

Baseball example: pitches LEFT JOIN players ON pitches.batter_id = players.mlb_id – get all pitches with player names, keeping pitches where the batter is unknown (batter_id is NULL).


Section C: Python & Data Engineering

Q17: What’s the difference between a list comprehension and a generator?

Answer: A list comprehension [x for x in items] builds the entire list in memory at once. A generator (x for x in items) produces values lazily, one at a time. For my pipeline, I used a list comprehension for flattening: [flatten_record(record, i) for i, record in enumerate(raw_data)]. This loads all 1,254 rows into memory simultaneously. If the dataset were millions of records, I’d switch to a generator and process in chunks to avoid memory exhaustion.


Q18: How would you handle schema evolution?

Answer: Three approaches depending on severity. For new fields added to the source: my extract_nested_value already handles missing keys gracefully by returning None, so new fields are ignored unless I explicitly extract them. For removed fields: the pipeline would produce None/NULL values, which downstream consumers should already handle. For breaking changes (renamed fields, restructured nesting): I’d use schema validation at ingestion (e.g., Pydantic or JSON Schema) to fail fast and alert rather than silently producing bad data. In production, I’d version the pipeline and schema together.


Q19: What’s the difference between pandas and polars?

Answer: Pandas is the established standard – single-threaded, row-oriented execution, eager evaluation. Polars is newer – written in Rust, multi-threaded by default, lazy evaluation (builds a query plan and optimizes before executing), and generally 5-50x faster than pandas for large datasets. For this project, pandas was fine at 1,254 records. At scale, I’d consider polars or move the transformation into SQL (DuckDB or a warehouse).


Q20: How would you schedule this pipeline to run daily?

Answer: For a simple daily schedule, I’d use Apache Airflow. I’d define a DAG with three tasks: ingest_task >> transform_task >> aggregate_task. Each task runs a Python function. Airflow handles scheduling (cron-like), retries on failure, dependency management (aggregate only runs if transform succeeds), and provides a web UI for monitoring. Alternatives: Dagster (better for data assets), Prefect (simpler API), or cloud-native options like AWS Step Functions.


Q21: What is ETL vs ELT?

Answer: ETL (Extract, Transform, Load): data is transformed before loading into the destination. My pipeline does this – I clean and enrich the JSON before saving the CSV. ELT (Extract, Load, Transform): raw data is loaded into a warehouse first, then transformed in place using SQL/dbt. ELT is the modern standard because cloud warehouses (BigQuery, Snowflake, Redshift) are powerful and cheap, so it’s more efficient to transform data where it lives rather than in a separate process. The Phillies likely use ELT for their warehouse.


Q22: What is idempotency and why does it matter?

Answer: An idempotent operation produces the same result regardless of how many times you run it. My pipeline is idempotent – running it twice on the same batch_raw.json produces identical output. This matters for reliability: if a pipeline run fails halfway and gets retried, you don’t end up with duplicate data or corrupted results. In practice, this means using UPSERT instead of INSERT, writing to temp files then atomically moving, and designing transformations as pure functions.


Q23: What is a star schema vs a snowflake schema?

Answer: Star schema has a central fact table surrounded by dimension tables, each joined directly to the fact table. Snowflake schema normalizes dimensions further (dimension tables have their own sub-dimensions). For the Phillies data:

Star schema example: - Fact table: fact_pitches (pitch_id, game_id, batter_id, pitcher_id, exit_velocity, bat_speed, attack_angle, …) - Dimension tables: dim_player (player_id, name, team, position), dim_game (game_id, date, venue, home_team, away_team), dim_pitch_type (pitch_type_id, name, category)

Star schemas are preferred for analytics because queries require fewer JOINs. Snowflake schemas save storage through normalization but are slower to query.


Q24: What are slowly changing dimensions? Give a baseball example.

Answer: An SCD tracks how dimension attributes change over time. The job description mentions “player contract information” – this is a classic SCD.

  • Type 1 (Overwrite): Update the player’s team to their current team. Lose history. Simple but you can’t analyze past trades.
  • Type 2 (Add row): Add a new row with effective_start_date and effective_end_date when a player is traded. Player has multiple rows. Full history preserved. Most common for analytics.
  • Type 3 (Add column): Add a previous_team column. Only tracks one change. Rarely used.

For the Phillies, Type 2 would let analysts query “what was this player’s performance when they played for Team X vs Team Y.”


Section D: Cloud & Infrastructure

Q25: How would you deploy this pipeline to the cloud?

Answer: On AWS: Raw JSON lands in S3. An Airflow DAG (running on MWAA or ECS) triggers on file arrival. The pipeline runs in a Docker container on ECS/Fargate. Cleaned output writes to S3 (Parquet, not CSV). A dbt model transforms it in Redshift/Athena for the analytical layer. Dashboards read from Redshift. Monitoring via CloudWatch alerts for pipeline failures or data quality issues. On GCP: substitute GCS, Cloud Composer, Cloud Run, BigQuery.


Q26: What is S3 and how does it differ from a database?

Answer: S3 is object storage – you store files (JSON, CSV, Parquet, images) as objects in buckets. No schema, no indexes, no query optimization. Essentially infinite scale, very cheap. A database (Postgres, Redshift) has structured tables with schemas, indexes, query planners, and transactional guarantees. Data lakes use S3 for cheap raw storage; data warehouses use databases for fast analytical queries. Modern architecture uses both: land raw data in S3, transform and load into a warehouse.


Q27: What is Docker and why is it useful for data engineering?

Answer: Docker packages an application and all its dependencies into a container – a lightweight, isolated environment that runs identically on any machine. For my pipeline, a Dockerfile would specify Python 3.11, install pandas/duckdb/pyarrow from requirements.txt, and copy my source code. Anyone can docker run it and get identical results regardless of their local Python version or installed packages. In production, containers run on Kubernetes or ECS, making it easy to scale, deploy, and rollback.


Q28: What is dbt and how does it fit into a data pipeline?

Answer: dbt (data build tool) handles the “T” in ELT. It runs SQL transformations inside the data warehouse. You write SELECT statements (called “models”), and dbt materializes them as tables or views. Key features: ref() function for dependency management between models, built-in testing (not_null, unique, accepted_values), auto-generated documentation, and incremental materializations for large tables. For my project, my aggregate.py SQL query would become a dbt model, and the validations would become dbt tests.


Section E: Data Quality & Observability

Q29: How would you detect a spike in missing bat tracking data?

Answer (from your short answers, know this cold): Three components. First, calculate missing_rate = swings_without_tracking / total_swings * 100 after each pipeline run and log it to a time-series database (Prometheus). Second, use a rolling 7-day baseline comparison rather than a static threshold – alert when the current rate exceeds the baseline by more than 5 percentage points. This catches the 1% to 15% spike while avoiding false alarms from normal variance. Third, route alerts through PagerDuty to the on-call engineer with context: current rate, baseline rate, affected game IDs, and a link to the runbook (common causes: camera occlusion, sensor calibration drift, venue-specific issues).


Q30: What other data quality checks would you add?

Answer: Row count checks (did we get the expected number of records?), schema validation (did the JSON structure change?), freshness checks (is the data actually from today’s games?), distribution checks (did the average pitch speed shift dramatically – might indicate a unit change), referential integrity (do all batter_ids exist in the player master table?), and completeness checks (what percentage of pitches have exit velocity, spin rate, etc. and is that percentage stable over time?).


Section F: Baseball Domain Knowledge

Q31: What is Statcast and what data does it produce?

Answer: Statcast is MLB’s tracking system (currently using Hawk-Eye cameras since 2020, previously TrackMan radar). It produces three categories of data: (1) Ball tracking – pitch velocity, spin rate, spin axis, movement, release point, and on contact: exit velocity, launch angle, hit distance. (2) Bat tracking – bat speed, attack angle, swing path (added in 2024). (3) Player tracking – sprint speed, route efficiency, arm strength, jump distance. The data powers advanced metrics like expected batting average (xBA), expected slugging (xSLG), and barrel rate.


Q32: What’s the difference between launch angle and attack angle?

Answer: Launch angle is the vertical angle the BALL leaves the bat (measured by Statcast from ball flight). Attack angle is the vertical angle of the BAT’S PATH through the zone (measured from bat tracking). They’re related but different – a steep positive attack angle tends to produce higher launch angles, but the relationship depends on where on the bat the ball makes contact, the pitch’s incoming angle, and timing. Attack angle is a property of the swing; launch angle is a property of the batted ball outcome.


Q33: What is whiff rate and why do teams care?

Answer: Whiff rate = swings and misses / total swings. For pitchers, a high whiff rate on a specific pitch means it’s effective at generating strikeouts. For batters, a high whiff rate indicates poor contact ability. Teams use it for: pitcher evaluation (which pitches are most effective), batter scouting (which batters are vulnerable to specific pitch types), and game planning (throw more sliders to batters with high whiff rates on breaking balls). My pipeline calculates this per batter in the aggregation step.


Q34: What is exit velocity and why is it considered predictive?

Answer: Exit velocity is the speed of the ball coming off the bat. It’s one of the most predictive offensive metrics because it’s “sticky” – consistent for individual hitters across seasons, unlike batting average which fluctuates with luck. High exit velocity correlates with extra-base hits and higher expected batting average (xBA). A ball hit at 95+ mph is called a “barrel” and has a high probability of being a hit. Teams use it to evaluate hitters independent of luck and defensive positioning.


Q35: What other baseball analytics terms should you know?

Answer: - wOBA (weighted on-base average): Weights different outcomes (single, double, HR, walk) by their run value. Better than batting average because not all hits are equal. - xBA / xSLG (expected batting average / slugging): Based on exit velocity and launch angle. Tells you what a player “should” have hit based on contact quality, stripping out luck and defense. - Barrel rate: Percentage of batted balls with optimal exit velocity (98+ mph) and launch angle (26-30 degrees). High barrel rate = high power. - Spin rate: Revolutions per minute of a pitched ball. Higher spin fastballs appear to “rise.” Higher spin curveballs break more. Teams use spin rate to evaluate pitcher stuff. - Hard-hit rate: Percentage of batted balls with exit velocity 95+ mph.


Section G: Behavioral / Culture Fit

Q36: Why do you want to work for the Phillies specifically?

Prep notes (personalize this): Research the Phillies R&D department. Know that they invest in analytics and technology. Mention specific things: their use of Statcast data, their development pipeline for young players, specific personnel moves that were analytically driven. Show that you follow baseball and understand how data drives decisions in the sport.


Q37: Tell us about a time you dealt with messy data.

Answer: Use your project. “The raw tracking data had 4,254 duplicate samples across 1,254 records, timing glitches where frames had 1ms intervals instead of the expected 3.33ms, and position jumps where the tracking system lost the bat momentarily. I handled duplicates by deduplicating on timestamp within each pitch. For timing and position artifacts, rather than discarding all affected records, I implemented quality flags – speed_confidence and swing_confidence – so analysts can decide their own filtering threshold. I validated my approach by comparing my bat speed calculations against MLB Statcast averages and found close alignment (70.1 vs 71.5 mph median).”


Q38: How do you handle working with non-technical stakeholders?

Answer: “I focus on translating technical constraints into their language. For example, instead of saying ‘the tracking data has timing artifacts that inflate speed calculations,’ I’d say ‘some bat speed readings may be higher than reality due to sensor issues – I’ve flagged which ones are trustworthy and which should be used with caution.’ In my project, I designed the output schema with analysts in mind: clear column names, confidence flags with documented meanings, and a README that explains what each field means and how to filter appropriately.”


Q39: How do you handle feedback in code reviews?

Answer: “I treat code reviews as a learning opportunity. If someone suggests a different approach, I want to understand why – it might be a pattern I haven’t encountered or a performance consideration I didn’t think of. I also try to make my code easy to review: descriptive variable names, logical function decomposition, and test coverage so reviewers can understand the intended behavior. In my project, I organized tests into clear classes by function, which would make it easy for a reviewer to verify each component.”


Q40: How transparent are you about using AI tools?

Answer: “I documented my AI usage in the README and in docstrings on specific functions where Claude assisted. I used it for researching MLB Statcast methodology, validating my attack angle formula against industry standards, and some code formatting. I validated everything it suggested against MLB data – for example, confirming that my bat speed calculations aligned with known league averages. I believe AI tools are valuable for accelerating research and catching errors, but the engineer needs to understand and own every line of code.”


Section H: Technical Scenarios They Might Pose

Q41: “A data scientist tells you their model’s predictions got worse after your last pipeline update. How do you debug this?”

Answer: First, check what changed in the pipeline update – diff the code. Then compare output distributions before and after: did value ranges shift? Did null rates change? Did a validation rule get too aggressive and filter out valid data? Check row counts, column means/medians, and null percentages between the two versions. If the schema changed (new columns, renamed columns, type changes), the model might be ingesting data differently. I’d provide the data scientist with a comparison report and work backward from the specific predictions that degraded to the specific data inputs that changed.


Q42: “How would you backfill 5 years of historical Statcast data into a new data warehouse?”

Answer: Break it into manageable batches (by season or month). Run the pipeline for each batch with idempotent writes (UPSERT, not INSERT). Monitor progress and data quality per batch. Use parallelism across batches since they’re independent. Validate each batch before moving to the next: row counts should match source, distributions should be consistent. For the warehouse, load into staging tables first, then swap into production tables after validation passes. Log everything for audit trail.


Q43: “Our pipeline takes 4 hours to run. How would you make it faster?”

Answer: Profile first – identify which step is the bottleneck (ingestion? transformation? loading?). Common optimizations: switch from CSV to Parquet (columnar, compressed, faster reads). Use batch/bulk loading instead of row-by-row inserts. Parallelize independent transformations. Move heavy transformations into the warehouse (ELT vs ETL). Use incremental processing – only process new/changed data instead of reprocessing everything. Cache intermediate results. If I/O bound, use async I/O. If CPU bound, use multiprocessing or Spark.


Q44: “Design a system to serve real-time player statistics to a mobile app during a live game.”

Answer: The pipeline ingests Statcast events via a streaming platform (Kafka). A stream processor (Flink or a simple consumer) updates pre-computed aggregates in a fast key-value store (Redis) or a low-latency database (DynamoDB). The mobile app queries an API layer that reads from Redis. Aggregates are updated incrementally (not recomputed from scratch) for speed. Fallback: if streaming fails, the app shows cached data with a staleness indicator. This separates the analytical pipeline (batch, warehouse) from the serving layer (real-time, cache).


Section I: Questions From Your Resume

Q45: Tell us about your work at Push Performance AZ.

Answer: “I started as a Data Science intern building a web app that analyzed 1.4 million MLB pitches to generate count-specific pitch recommendations. I worked directly with pitching coaches and presented findings to MLB starters like Garrett Crochet, Bryan Woo, and Logan Webb. That internship turned into a full Data Engineer role where I built an athlete assessment report product that’s now generating over $100K in annual revenue. The platform processes 2,000+ athlete profiles with real-time integration to VALD ForceDecks force plate systems. I built the Python data pipelines for biomechanical data ingestion, feature engineering, and composite scoring, and I created the React + TypeScript dashboards for visualization.”

Why this matters for the Phillies: This directly maps to the job description – they work with biomechanical time-series data, player tracking data, and need pipelines that feed dashboards and internal applications. You’ve already done all of this in a baseball-adjacent context.


Q46: You increased ETL throughput 3-4x at Push Performance. Walk us through how.

Answer: “The original pipeline took over 3 hours to process all athlete data. I identified three bottlenecks. First, API calls to VALD ForceDecks were sequential – I switched to multithreading with adaptive rate limiting so we could make concurrent requests without hitting API limits. Second, repeated API calls for the same data – I added BigQuery caching so we only fetch data that’s changed since the last run. Third, the processing itself – I optimized the data transformation logic. Together these brought it down to 45-60 minutes, a 3-4x improvement.”

Be ready to discuss: - Why multithreading (not multiprocessing) for API calls: API calls are I/O-bound (waiting for network responses), so threading is appropriate. Multiprocessing is for CPU-bound work. - Adaptive rate limiting: you likely tracked response codes (429 = rate limited) and backed off dynamically rather than using a fixed sleep. - BigQuery caching strategy: store processed results with timestamps, only re-fetch records modified after the last run timestamp.


Q47: Tell us about the Pitch Recommendation Engine.

Answer: “I trained gradient-boosted models (XGBoost) using in-zone whiff rate, chase rate, and wOBA to optimize pitch sequencing decisions – essentially, given the current count and situation, what pitch should the pitcher throw next? I built interactive visualizations with Dash and Plotly including pitch-tree diagrams and heatmaps, and deployed the app via Flask on Render.”

Be ready to discuss: - Why XGBoost: handles non-linear relationships, feature importance is interpretable, performs well on tabular data with moderate feature counts. - What is wOBA and why use it: weighted on-base average – weights outcomes by run value. Better than batting average for evaluating pitch effectiveness because it captures the severity of the outcome, not just hit/no-hit. - What is chase rate: percentage of pitches outside the strike zone that batters swing at. High chase rate means the pitch is deceptive. - How count affects pitch selection: behind in the count (pitcher’s count like 0-2), you can throw more chase pitches. Ahead in the count (hitter’s count like 3-1), hitter is more selective so you might challenge with a fastball in the zone.


Q48: Tell us about the Runner Advancement Prediction project.

Answer: “I led a team of 5 building a computer vision + ML system to predict whether a baserunner would advance on a sacrifice fly. We used YOLOv8 and OpenCV to extract features from broadcast footage – things like runner position, fielder position, throw distance – replicating proprietary data that’s normally only available to MLB teams via Statcast. We trained an ensemble model (Gradient Boosting, Random Forest, Logistic Regression) on 18K+ labeled plays with 45 engineered features and achieved 82% accuracy with 0.81 F1 score on 100 test videos.”

Be ready to discuss: - Why an ensemble: combining multiple model types reduces variance and captures different patterns. Gradient Boosting handles complex interactions, Random Forest reduces overfitting, Logistic Regression provides a strong baseline. - Why F1 score matters here: the classes (advance vs stay) might be imbalanced. F1 balances precision and recall. Accuracy alone could be misleading if one class dominates. - What features you engineered from video: runner lead distance, fielder depth, throw angle, time from catch to throw, runner speed estimate, etc. - How YOLOv8 fits in: object detection to identify and track players, ball, and bases in each frame.


Q49: You modeled fastball velocity from force plate data in your senior project. Explain.

Answer: “I used data from 80 NCAA Division I athletes collected on Hawkin Dynamics force plates – 5,500+ data points measuring ground reaction forces during various movements. I built regression models (in R using xgboost and ggplot2) to predict fastball velocity from force plate metrics. I used VIF analysis to remove multicollinear features, stepwise AIC selection for model simplification, and 10-fold cross-validation to evaluate generalization. I presented the findings to 200+ people at the ORU Senior Design Symposium.”

Why this matters for the Phillies: The job description explicitly mentions “biomechanical time-series data” as a dataset you’ll work with. You’ve already modeled biomechanical data in a baseball context. This is rare and directly relevant.

Be ready to discuss: - VIF (Variance Inflation Factor): measures multicollinearity. VIF > 5-10 means features are highly correlated and you should remove one. Important because correlated features inflate coefficient standard errors. - AIC (Akaike Information Criterion): balances model fit with complexity. Lower AIC = better. Stepwise AIC removes features that don’t improve the model enough to justify the added complexity. - 10-fold cross-validation: split data into 10 folds, train on 9, test on 1, rotate. Gives a robust estimate of out-of-sample performance.


Q50: You played D1 baseball as a pitcher. How does that inform your data engineering work?

Answer: “Playing pitcher at the D1 level gave me firsthand experience with the data I’m engineering. I used Trackman data to analyze my own pitch movement and spin rates, and ForceDecks to track my physical development. I led scouting and film sessions, which taught me how to translate data into actionable insights for non-technical audiences (teammates and coaches). When I’m building a data pipeline for baseball data, I understand what the end users actually need because I’ve been that end user. I know what whiff rate means in practice – not just as a number, but as the feeling of throwing a slider that a hitter can’t catch up to.”

This is your biggest differentiator. Most data engineering candidates understand data. You understand baseball AND data. The job description says “A demonstrated passion for baseball and sports analytics” – you lived it for 17 years, 3 at the D1 level.


Q51: You’re pursuing a Master’s in Data Science while working as a Data Engineer. How do you balance that?

Answer: “The two complement each other well. My coursework gives me deeper theoretical foundations in statistics and machine learning, while my work at Push Performance gives me practical engineering experience deploying those concepts in production. For example, I learned about ensemble methods in class and applied them directly in the Runner Advancement project. The time management skills I developed as a student-athlete – balancing 20+ hours of practice with a full course load – transfer directly to managing work and graduate school.”


Q52: Why are you interested in leaving Push Performance AZ for the Phillies?

Answer (personalize, but here’s a framework): “Push Performance has been an incredible experience – I built a product from scratch that generates real revenue, and I got to work with MLB-caliber athletes. But the Phillies represent the opportunity to work at the highest level of the sport I’ve played my whole life. The data challenges are bigger (more data sources, higher stakes, more complex infrastructure), the engineering team is more established (I’d learn from senior engineers), and the impact is direct – my work would help an MLB team make better decisions. I’m also excited about working with the specific datasets mentioned in the job description – Statcast, biomechanical time-series, and contract data – because I’ve already worked with similar data at a smaller scale.”


Q53: You’ve worked with ForceDecks/VALD data and Trackman data. How do they differ?

Answer: “Trackman/Hawk-Eye (Statcast) tracks the ball and bat in flight – pitch velocity, spin rate, movement, exit velocity, launch angle. It’s event-level data: one measurement per pitch. ForceDecks/force plates measure ground reaction forces during physical movements – jumps, squats, pushes. It’s time-series data: hundreds of measurements per second during each test. The engineering challenges differ: Statcast data is structured and event-based, so the pipeline is mostly about ingestion, cleaning, and aggregation. Force plate data is high-frequency time-series, so you need signal processing techniques (filtering, peak detection, integration) before you can extract meaningful features.”


Q54: You presented data findings to MLB pitchers (Crochet, Woo, Webb). How did you approach that?

Answer: “These are elite athletes, not data scientists. I focused on three things: keeping it visual (charts over tables), keeping it actionable (what should they DO differently, not just what the data says), and keeping it concise (5-10 minutes, not an hour). For example, instead of showing a model’s coefficient table, I’d show a heatmap of where hitters are weakest against their specific pitch types in specific counts. The key insight was that the data has to serve the athlete’s goals – more strikeouts, fewer hard-hit balls – not demonstrate how clever the analysis was.”

This maps directly to the job requirement: “Collaborate with data scientists and analysts to understand their data requirements and provide support.” You’ve already done this with an even harder audience (professional athletes who don’t care about SQL).


Q55: Your resume mentions BigQuery. Compare it to other data warehouses.

Answer: “BigQuery is Google Cloud’s serverless data warehouse. It’s columnar, scales automatically, and you pay per query (no infrastructure to manage). Compared to AWS Redshift, which requires provisioning clusters and managing nodes, BigQuery is simpler operationally but gives you less control over performance tuning. Compared to Snowflake, which separates compute and storage and runs on any cloud, BigQuery is tightly integrated with the GCP ecosystem. I used BigQuery at Push Performance for caching API responses and storing processed athlete data. For the Phillies, the choice would depend on which cloud provider the org uses.”



PART 3: QUICK REFERENCE CHEAT SHEET

SQL Syntax You Should Know Cold

-- Window functions
ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2 DESC)
RANK() OVER (PARTITION BY col ORDER BY col2 DESC)
LAG(column, 1) OVER (ORDER BY timestamp)
SUM(column) OVER (ORDER BY timestamp ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)

-- Conditional aggregation (DuckDB/Postgres)
COUNT(*) FILTER (WHERE condition)
-- Equivalent in MySQL:
SUM(CASE WHEN condition THEN 1 ELSE 0 END)

-- Prevent division by zero
NULLIF(denominator, 0)

-- CTE
WITH cte_name AS (
    SELECT ...
)
SELECT ... FROM cte_name

-- UPSERT (Postgres)
INSERT INTO table (...) VALUES (...)
ON CONFLICT (primary_key) DO UPDATE SET col = EXCLUDED.col;

Python Patterns You Should Know

# Context manager
with open(file) as f:
    data = json.load(f)

# List comprehension vs generator
results = [process(x) for x in items]      # all in memory
results = (process(x) for x in items)      # lazy, one at a time

# Type hints
def func(x: int) -> Optional[str]:

# *args and **kwargs
def extract(*keys, default=None):  # keys is a tuple of positional args

# Set for O(1) lookup (used in deduplication)
seen = set()
if item not in seen:
    seen.add(item)

Your Pipeline’s Key Numbers (Memorize These)

Metric Value
Total records 1,254
Duplicate samples removed 4,254
Clean samples after dedup 96,943
Unique batters 31
Confirmed swings ~340
Suspected swings ~90
Attack angle coverage 98.6% of swings
Median bat speed (reliable) 70.1 mph
MLB avg bat speed 71.5 mph
Pipeline total runtime ~2.8 seconds
Sequential vs parallel flatten 0.8s vs 3.8s (sequential wins)
Total tests 60

Their Stack (From Job Description) – Mapped to YOUR Experience

Technology Your Experience What to Say
Python Primary language. Pipelines at Push Performance, ML projects, this technical project “Python is my primary language. I’ve built production pipelines, ML models, and data processing systems with it.”
TypeScript Built React + TS dashboards at Push Performance “I built the React + TypeScript dashboards at Push Performance – spider charts, percentile comparisons, PDF reporting.”
Go No direct experience “I haven’t used Go yet, but I’m comfortable picking up new languages – I work across Python, R, TypeScript, and C/C++.”
SQL BigQuery at Push, DuckDB in this project, MySQL listed on resume “I use SQL daily – BigQuery for caching and analytics at Push Performance, and DuckDB/MySQL for project work. Comfortable with window functions, CTEs, aggregations.”
Airflow/Dagster/Prefect No direct experience (but you did ETL orchestration at Push) “I haven’t used Airflow specifically, but I built and orchestrated ETL pipelines at Push Performance with scheduling and API rate limiting. I understand DAG concepts and can ramp up on the specific tool quickly.”
Docker (Be honest about your level) Know the basics at minimum – Dockerfile, images, containers
dbt No direct experience “I understand the ELT model and how dbt fits in for SQL-based transformations in the warehouse.”
Cloud (GCP/AWS) GCP + BigQuery at Push Performance “I’ve worked with Google Cloud and BigQuery in production at Push Performance – caching, data storage, and ETL pipelines.”

Your Key Differentiators (Mention These Throughout the Interview)

  1. You played D1 baseball as a pitcher. You understand the data from the athlete’s perspective. You’ve used Trackman and ForceDecks data on yourself.
  2. You’ve already presented data to MLB pitchers. Crochet, Woo, Webb. The Phillies will care about this – it shows you can bridge the gap between data and baseball operations.
  3. You’ve built a revenue-generating data product. The Push Performance assessment report generates $100K+/year. You’re not just academic – you’ve shipped production software that makes money.
  4. You’ve worked with the exact data types this job mentions. Statcast/Trackman data (1.4M pitches), biomechanical time-series (ForceDecks), and player performance data.
  5. You have both ML and engineering skills. Most candidates are one or the other. You can build the pipeline AND understand the models it feeds.
  6. MS in Data Science + BS in CS. Strong technical foundation in both engineering and statistics.