BigQuery Window Functions & Advanced Features - Beginner’s Guide

Getting Started

What You’ll Need

  1. A Google account
  2. Access to BigQuery (free tier is fine!)
  3. Go to: https://console.cloud.google.com/bigquery

Dataset: USA Baby Names (1910-Current)

Public Dataset: bigquery-public-data.usa_names.usa_1910_current

This dataset contains baby name registrations from US Social Security records: - name: Baby name (e.g., “Emma”, “Michael”) - gender: M (male) or F (female) - number: How many babies were given this name - year: Year of birth (1910-2023) - state: US state code (e.g., “CA”, “NY”)

Example row: name=‘Emma’, gender=‘F’, number=20355, year=2020, state=‘CA’ means 20,355 baby girls were named Emma in California in 2020.


Part 1: Window Functions with OVER Clause

What Are Window Functions?

Think of window functions as a way to perform calculations across a set of rows that are related to the current row, WITHOUT collapsing those rows like GROUP BY does. You get to see individual rows AND aggregate calculations at the same time!

1.1 Basic OVER() - Running Total

Goal: Calculate the cumulative number of babies named “Emma” over the years

SELECT 
  year,                                                    -- Show the year
  SUM(number) as yearly_total,                           -- Total Emmas born this year
  SUM(SUM(number)) OVER (ORDER BY year) as cumulative_total  -- Running total up to this year
FROM `bigquery-public-data.usa_names.usa_1910_current`   -- From the public dataset
WHERE name = 'Emma'                                       -- Only look at name "Emma"
GROUP BY year                                             -- One row per year
ORDER BY year;                                            -- Sort by year

Line-by-line explanation: 1. SELECT year - We want to see which year we’re talking about 2. SUM(number) as yearly_total - Add up all the Emmas born in each year (across all states). This is a regular aggregate function 3. SUM(SUM(number)) OVER (ORDER BY year) as cumulative_total - This is the magic! - Inner SUM(number): First, we sum the numbers for each year (same as line 2) - Outer SUM(...) OVER (ORDER BY year): Now, take those yearly totals and create a running sum - OVER (ORDER BY year): This creates a “window” that starts at the earliest year and grows as we move forward in time - Result: Each row shows the total number of Emmas born from the beginning of time up to that year 4. WHERE name = 'Emma' - Filter to only Emma (before grouping) 5. GROUP BY year - Combine all states into one row per year 6. ORDER BY year - Show results chronologically

What you’ll see: The results will show each year with two totals: the number of Emmas born that specific year, and the cumulative total from 1910 up to and including that year. The cumulative total should always increase (or stay the same) as you go down the rows, since you’re adding more years to the sum.

Key concept: The OVER clause doesn’t reduce the number of rows. You still get one row per year, but each row can “see” other rows to perform calculations.

1.2 PARTITION BY - Ranking Within Groups

Goal: Find the top 5 most popular names for boys and girls in 2020

SELECT 
  year,                                                     -- The year (2020)
  gender,                                                   -- M or F
  name,                                                     -- Name like "Olivia" or "Liam"
  SUM(number) as total_count,                              -- Total babies with this name across all states
  RANK() OVER (                                            -- Assign a rank (1st, 2nd, 3rd...)
    PARTITION BY year, gender                              -- Separate rankings for M and F
    ORDER BY SUM(number) DESC                              -- Highest count gets rank 1
  ) as popularity_rank
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020                                          -- Only look at 2020 data
GROUP BY year, gender, name                                -- One row per unique combination
QUALIFY popularity_rank <= 5                               -- Only keep ranks 1-5
ORDER BY gender, popularity_rank;                          -- Sort by gender, then rank

Line-by-line explanation: 1. year, gender, name - The columns we want to display 2. SUM(number) as total_count - Add up how many babies got this name across all 50 states - Example: “Olivia” might appear in California (5,000), Texas (4,000), etc. This adds them all up 3. RANK() OVER (...) - This is where the magic happens! We’re assigning a ranking 4. PARTITION BY year, gender - CRITICAL CONCEPT! This divides our data into separate groups - Think of it as creating two separate ranking lists: one for boys, one for girls - Without this, boys and girls would compete in the same ranking - It’s like saying “reset the ranking to 1 every time the gender changes” 5. ORDER BY SUM(number) DESC - Within each partition (M and F), rank by highest count first - DESC means descending (biggest to smallest) - The name with the most babies gets rank 1 6. WHERE year = 2020 - We only want one year’s data for this example 7. GROUP BY year, gender, name - Combine all states into one row per name 8. QUALIFY popularity_rank <= 5 - Special BigQuery feature! Like WHERE but for window functions - Regular WHERE doesn’t work with window functions (they haven’t been calculated yet) - QUALIFY runs AFTER the window function is calculated - This keeps only ranks 1, 2, 3, 4, and 5

What you’ll see: The results will show the top 5 names for each gender. Girls and boys will have separate rankings - the most popular girl name will be rank 1, and separately, the most popular boy name will also be rank 1. Each row shows the name, how many babies received that name across all states, and its ranking within that gender.

Key concept - PARTITION BY: Imagine you have a deck of cards. PARTITION BY is like separating them into suits (hearts, diamonds, clubs, spades). Then, within each suit, you can rank them (Ace is 1st, King is 2nd, etc.). Without PARTITION BY, you’d be ranking all 52 cards together!

1.3 LAG and LEAD - Comparing with Previous/Next Rows

Goal: See how the name “Sophia” grew or declined in popularity year over year

SELECT 
  year,                                                    -- The year
  name,                                                    -- Name (will always be "Sophia")
  SUM(number) as total,                                   -- How many Sophias this year
  LAG(SUM(number)) OVER (                                 -- Get the value from the PREVIOUS row
    PARTITION BY name                                      -- Stay within the same name
    ORDER BY year                                          -- Order by year (so we know what "previous" means)
  ) as prev_year_total,                                   -- Total from last year
  SUM(number) - LAG(SUM(number)) OVER (                   -- Current year minus last year
    PARTITION BY name 
    ORDER BY year
  ) as change                                             -- How much it increased or decreased
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name = 'Sophia' AND year BETWEEN 2010 AND 2020     -- Only Sophia, only these years
GROUP BY year, name                                       -- One row per year
ORDER BY year;                                            -- Show chronologically

Line-by-line explanation: 1. year, name - What we’re displaying 2. SUM(number) as total - Add up all Sophias across all states for this year 3. LAG(SUM(number)) OVER (...) - LAG means “look at the previous row” - Think of LAG as “looking backward” in time - This gets the total from the year before - Example: When processing 2015, LAG gives us the value from 2014 4. PARTITION BY name - Keep calculations within the same name - Since we’re only looking at “Sophia”, this doesn’t change much here - But it’s good practice in case you want to analyze multiple names 5. ORDER BY year - This is crucial! It tells LAG which direction “previous” is - Without this, LAG wouldn’t know which row is “previous” 6. SUM(number) - LAG(SUM(number)) OVER (...) - Subtract last year from this year - If positive: the name grew in popularity - If negative: the name declined in popularity - If NULL: there’s no previous year (happens in first row)

What you’ll see: The results will show each year between 2010 and 2020, with the total number of Sophias born that year, the previous year’s total (which will be NULL for the first year since there’s no prior year in our filtered dataset), and the year-over-year change (positive numbers indicate growth, negative indicate decline).

Key concept - LAG vs LEAD: - LAG = look backward (previous row) - useful for “compared to last year” - LEAD = look forward (next row) - useful for “compared to next year” - Both can take a second parameter: LAG(value, 2) means “look back 2 rows”

Part 2: Common Table Expressions (CTEs)

What Are CTEs?

A CTE (Common Table Expression) is like creating a temporary named result set that you can reference in your main query. Think of it as: - Creating a “scratch table” that only exists for this one query - Breaking a complex query into readable steps - Giving meaningful names to intermediate results

Syntax: Start with WITH name_of_cte AS (query)

2.1 Simple CTE - Breaking Down Complex Logic

Goal: Calculate year-over-year change in total births by gender

-- Step 1: Create a CTE named "yearly_totals"
WITH yearly_totals AS (
  SELECT 
    year,                                                  -- The year
    gender,                                                -- M or F
    SUM(number) as total_births                           -- Add up ALL names for this year/gender
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  GROUP BY year, gender                                   -- One row per year/gender combo
)
-- Step 2: Now use the CTE in the main query
SELECT 
  year,                                                    -- The year
  total_births,                                           -- Total births (from the CTE)
  total_births - LAG(total_births) OVER (                -- Subtract last year's total
    PARTITION BY gender                                   -- Keep M and F separate
    ORDER BY year                                         -- Order by year
  ) as yearly_change                                      -- The year-over-year change
FROM yearly_totals                                        -- We're querying the CTE, not the original table!
WHERE gender = 'F'                                        -- Only female births
ORDER BY year DESC                                        -- Most recent first
LIMIT 10;                                                 -- Just show the last 10 years

Line-by-line explanation:

Part 1 - The CTE: 1. WITH yearly_totals AS (...) - Create a temporary result set called “yearly_totals” 2. SELECT year, gender, SUM(number) - In the CTE, calculate total births per year/gender 3. FROM bigquery-public-data... - Source data 4. GROUP BY year, gender - Aggregate all names into one total per year/gender

Part 2 - The Main Query: 5. SELECT year, total_births - Pull data from the CTE 6. total_births - LAG(total_births) OVER (...) - Compare this year to last year 7. PARTITION BY gender - Keep male and female calculations separate 8. ORDER BY year - Tell LAG which row is “previous” 9. FROM yearly_totals - Important! We’re now querying the CTE, not the original table 10. WHERE gender = 'F' - Filter to only females 11. ORDER BY year DESC - Show newest years first 12. LIMIT 10 - Only show 10 rows

Why use a CTE here? Without the CTE, we’d have to calculate SUM(number) multiple times in the same query, making it harder to read:

-- Without CTE (harder to read):
SELECT 
  year,
  SUM(number),
  SUM(number) - LAG(SUM(number)) OVER (...)
FROM ...
GROUP BY year, gender

With the CTE, we calculate the sum ONCE and give it a clear name (total_births), making the logic easier to follow.

2.2 Multiple CTEs - Name Diversity Analysis

Goal: Calculate how many different names account for 50% of all births in different years

This is more complex, so we’ll use multiple CTEs to break it down into steps.

-- STEP 1: Calculate what percentage each name represents
WITH name_percentages AS (
  SELECT 
    year,                                                  -- The year
    gender,                                                -- M or F
    name,                                                  -- Individual name
    number,                                                -- How many babies got this name
    SUM(number) OVER (PARTITION BY year, gender) as total_births,  -- Total births for this year/gender
    number / SUM(number) OVER (PARTITION BY year, gender) * 100 as percentage  -- What % this name is
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE year IN (1950, 2020)                              -- Compare two different years
),

-- STEP 2: Rank names and calculate cumulative percentage
ranked_names AS (
  SELECT 
    year,                                                  -- The year
    gender,                                                -- M or F
    name,                                                  -- Individual name
    percentage,                                            -- What % this name represents (from CTE 1)
    SUM(percentage) OVER (                                -- Running sum of percentages
      PARTITION BY year, gender                           -- Separate for each year/gender
      ORDER BY number DESC                                -- Start with most popular names
    ) as cumulative_percentage,                           -- Running total: 1st name = 5%, 1st+2nd = 8%, etc.
    ROW_NUMBER() OVER (                                   -- Assign row numbers
      PARTITION BY year, gender 
      ORDER BY number DESC
    ) as rank                                             -- 1 = most popular, 2 = 2nd most popular, etc.
  FROM name_percentages
)

-- STEP 3: Find where cumulative percentage crosses 50%
SELECT 
  year,                                                    -- The year
  gender,                                                  -- M or F
  MIN(rank) as names_for_50_percent                       -- How many names to reach 50%
FROM ranked_names
WHERE cumulative_percentage >= 50                         -- Only rows where we've reached 50%
GROUP BY year, gender                                     -- One result per year/gender
ORDER BY year, gender;                                    -- Sort the results

Line-by-line explanation:

CTE 1 - name_percentages: 1. WITH name_percentages AS (...) - Create first temporary result 2. year, gender, name, number - Basic fields we need 3. SUM(number) OVER (PARTITION BY year, gender) as total_births - Window function! - For EVERY row with year=1950 and gender=F, this calculates the same total - It’s the denominator for our percentage calculation - Example: If 2 million girls were born in 1950, every row for 1950/F shows 2,000,000 4. number / SUM(number) OVER (...) * 100 as percentage - Calculate what % each name is - Example: “Mary” with 50,000 births ÷ 2,000,000 total × 100 = 2.5% 5. WHERE year IN (1950, 2020) - We’re comparing these two years

CTE 2 - ranked_names: 6. WITH ..., ranked_names AS (...) - Create second CTE (note the comma after first CTE) 7. SELECT year, gender, name, percentage - Pull data from the first CTE 8. SUM(percentage) OVER (...) - Create a running sum of percentages - First row (most popular): might be 5% - Second row: 5% + 3% = 8% - Third row: 8% + 2.5% = 10.5% - Eventually one row crosses 50%! 9. ORDER BY number DESC - Start with the most popular name - This ensures our cumulative sum builds from #1 ranked name onward 10. ROW_NUMBER() OVER (...) - Assign sequential numbers (1, 2, 3, 4…) - This tells us “how many names” we’ve included

Main Query: 11. SELECT year, gender, MIN(rank) - Now use the second CTE 12. WHERE cumulative_percentage >= 50 - Only look at rows where we’ve reached 50% - Multiple names might cross 50% (like 49.8%, 51.2%, 53%, etc.) 13. MIN(rank) - Get the FIRST name that crossed 50% - This tells us how many names it took to reach 50% of all births

What you’ll see: The results will show four rows (one for each year/gender combination: 1950 F, 1950 M, 2020 F, 2020 M). Each row shows how many different names were needed to account for 50% of all births in that year/gender. Generally, you’ll see that older years (like 1950) needed fewer names to reach 50%, indicating less diversity - parents chose from a smaller pool of popular names. More recent years typically need many more names to reach 50%, showing greater diversity in naming choices.

Why this matters: This measures name diversity. In 1950, parents chose from fewer names (everyone named their daughters Mary, Susan, Linda, etc.). By 2020, parents picked more unique names, so you need to count more names to reach 50% of births.

Key concept - Multiple CTEs: You can chain CTEs by separating them with commas:

WITH first_cte AS (...),
     second_cte AS (...),
     third_cte AS (...)
SELECT ... FROM third_cte;

Each CTE can reference the ones above it!


Part 3: Wildcard Tables

What Are Wildcard Tables?

Sometimes your data is split across multiple tables with similar names (like sales_2020, sales_2021, sales_2022). Instead of querying each one separately, you can use a wildcard (*) to query them all at once!

Syntax: FROM dataset.table_prefix* where * matches any characters

Important Note About Our Dataset

The USA names dataset is actually a single table, not multiple tables. So for this section, I’ll show you the concept using examples, and you can apply this when you encounter datasets that ARE split into multiple tables.

3.1 Basic Wildcard Concept

Example structure (hypothetical - not our actual dataset):

bigquery-public-data.usa_names.usa_1910
bigquery-public-data.usa_names.usa_1911
bigquery-public-data.usa_names.usa_1912
...
bigquery-public-data.usa_names.usa_2020

How you’d query all of them:

SELECT 
  _TABLE_SUFFIX as year,           -- Special column! Contains the part after usa_
  COUNT(*) as row_count             -- Count rows in each table
FROM `bigquery-public-data.usa_names.usa_*`  -- The * matches all tables starting with usa_
WHERE _TABLE_SUFFIX BETWEEN '1910' AND '2020'  -- Filter which tables to include
GROUP BY year
ORDER BY year;

Line-by-line explanation: 1. _TABLE_SUFFIX - Special BigQuery column that only exists with wildcard queries - It contains the part of the table name AFTER the * - Example: from table usa_1950, _TABLE_SUFFIX = ‘1950’ - This is a STRING, not a number! 2. COUNT(*) - Count rows in each table 3. FROM bigquery-public-data.usa_names.usa_* - The wildcard query - usa_* matches: usa_1910, usa_1911, usa_1950, usa_anything 4. WHERE _TABLE_SUFFIX BETWEEN '1910' AND '2020' - Filter which tables to scan - This is important for cost! Don’t scan tables you don’t need - Remember: _TABLE_SUFFIX is a string, so ‘1910’ < ‘1950’ < ‘2020’ in alphabetical order

Wildcard tables in BigQuery let you query many similarly named tables at once using a single query, instead of repeating the same code for each table. They are especially useful for time-series data, where each day or year is stored in its own table.

What a table wildcard is

  • A table wildcard is the * character placed at the end of a table prefix in the FROM clause.
    • Example pattern: `project.dataset.table_prefix*`.
  • BigQuery expands this pattern into all tables that share that prefix (for example gsod1940, gsod1941, …, gsod1949).
  • When you use a wildcard, BigQuery automatically provides a special column called _TABLE_SUFFIX that contains the part of the table name that matched the * (the “changing” part of the name).

Why table wildcards are useful

  • They avoid running the same query many times on different tables (for example one table per day or per year).
  • They make it easier to analyze time‑series data (daily logs, yearly summaries, etc.) stored in multiple tables.
  • They let you limit cost and scan size by filtering which tables you include via _TABLE_SUFFIX, instead of querying an entire dataset.

3.2 NOAA GSOD example – explained line by line

The query

SELECT
  max,
  ROUND((max - 32) * 5/9, 1) AS celsius,
  mo,
  da,
  year,
  _TABLE_SUFFIX AS suffix
FROM
  `bigquery-public-data.noaa_gsod.gsod194*`
WHERE
  max != 9999.9
  AND ( _TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4' )
ORDER BY
  max DESC
LIMIT 100;

Step‑by‑step explanation

  1. FROM clause with wildcard

    FROM `bigquery-public-data.noaa_gsod.gsod194*`
    • bigquery-public-data → project that hosts public datasets.
    • noaa_gsod → dataset containing weather observations.
    • gsod194* → table prefix plus *.
      • This matches all tables whose names start with gsod194, such as gsod1940, gsod1941, … gsod1949.
    • Because of the *, you now have access to _TABLE_SUFFIX.
  2. What _TABLE_SUFFIX means here

    • For table gsod1940, prefix is gsod194 and suffix is '0'.
    • For table gsod1944, suffix is '4', and so on.
    • _TABLE_SUFFIX will therefore be '0', '1', … '9' depending on which table the current row came from.
  3. SELECT list

    SELECT
      max,
      ROUND((max - 32) * 5/9, 1) AS celsius,
      mo,
      da,
      year,
      _TABLE_SUFFIX AS suffix
    • max → maximum daily temperature (in Fahrenheit).
    • ROUND((max - 32) * 5/9, 1) AS celsius → converts Fahrenheit to Celsius and rounds to 1 decimal place.
    • mo, da, year → month, day, and year columns for the observation date.
    • _TABLE_SUFFIX AS suffix → shows which specific table the row came from (which final digit the table name has).
  4. WHERE clause

    WHERE
      max != 9999.9
      AND ( _TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4' )
    • max != 9999.9 → filters out rows where max is an invalid placeholder (often used as a missing-value flag).
    • (_TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4') → keeps only tables whose names end in 0 or 4.
      • That means only years like 1940, 1944, 1948, etc. are included.
  5. ORDER and LIMIT

    ORDER BY
      max DESC
    LIMIT 100;
    • ORDER BY max DESC → sorts results from hottest to coolest.
    • LIMIT 100 → shows only the top 100 hottest records.

You can see that one query works across several year‑tables at once, instead of writing a separate query for each gsod1940, gsod1941, etc.

GitHub Archive example – explained line by line

The query

SELECT
  _TABLE_SUFFIX AS day_suffix,
  COUNT(*) AS event_count
FROM
  `githubarchive.day.2017*`
WHERE
  type = "WatchEvent"
  AND _TABLE_SUFFIX BETWEEN '0101' AND '0131'
GROUP BY
  day_suffix
ORDER BY
  day_suffix;

Step‑by‑step explanation

  1. FROM clause with wildcard

    FROM `githubarchive.day.2017*`
    • githubarchive → public GitHub Archive project.
    • day.2017* → matches tables like day.20170101, day.20170102, … for each day in 2017.
    • Again, because of *, _TABLE_SUFFIX becomes available.
  2. What _TABLE_SUFFIX represents here

    • The prefix here is 2017.
    • For table day.20170101, the suffix after the prefix 2017 is '0101'.
    • For day.20171231, the suffix is '1231'.
    • _TABLE_SUFFIX therefore represents the month and day (MMDD) for that table.
  3. SELECT list

    SELECT
      _TABLE_SUFFIX AS day_suffix,
      COUNT(*) AS event_count
    • _TABLE_SUFFIX AS day_suffix → shows the day for each result row (e.g. '0101' for Jan 1).
    • COUNT(*) AS event_count → counts how many events exist for that day.
  4. WHERE clause

    WHERE
      type = "WatchEvent"
      AND _TABLE_SUFFIX BETWEEN '0101' AND '0131'
    • type = "WatchEvent" → keeps only GitHub events where a user “stars” a repository.
    • _TABLE_SUFFIX BETWEEN '0101' AND '0131' → restricts the tables scanned to those whose suffix (MMDD) is between '0101' (Jan 1) and '0131' (Jan 31).
      • This effectively means “only January 2017 tables”.
  5. GROUP BY and ORDER BY

    GROUP BY
      day_suffix
    ORDER BY
      day_suffix;
    • GROUP BY day_suffix → aggregates all events for each day.
    • ORDER BY day_suffix → sorts days in calendar order from '0101' to '0131'.

This gives you one row per day in January with the number of “watch” events, using many daily tables in a single query.

3.3 How to write your own wildcard query

  1. Find the repeated table pattern

    • Look for tables that share a common prefix and differ by dates or numbers, such as:
      • events_20230101, events_20230102, …
      • sales_2020, sales_2021, sales_2022, …
    • Decide what the common prefix is (events_, sales_, day.2017, etc.).
  2. Write the FROM clause with *

    • Use backticks around the full identifier:
      • FROM \project_id.dataset_id.table_prefix*``
    • Example:
      • FROM \my_project.analytics.events_2023*``
  3. Use _TABLE_SUFFIX in WHERE to limit which tables you scan

    • For date‑like suffixes:
      • WHERE _TABLE_SUFFIX BETWEEN '20230101' AND '20230331'
    • For simple numeric suffixes:
      • WHERE _TABLE_SUFFIX IN ('0', '4')
    • Combine with normal column filters, such as event type, country, or status.
  4. Include _TABLE_SUFFIX in SELECT when helpful

    • To see which table each result row came from or to group by it:

      SELECT
        _TABLE_SUFFIX AS source_suffix,
        COUNT(*) AS row_count
      FROM
        `project.dataset.tableprefix*`
      GROUP BY
        source_suffix;
  5. Avoid common beginner mistakes

    • Always use backticks (`) for table names, not quotes:
      • Correct: `bigquery-public-data.noaa_gsod.gsod194*`
      • Incorrect: ’bigquery-public-data.noaa_gsod.gsod194*’
    • Remember that _TABLE_SUFFIX only works when there is a wildcard * in the table reference. It will not work if you query a single table with no wildcard.
    • Make sure the values you compare _TABLE_SUFFIX to (like '0101', '20230101', '0') match the actual format of the suffix in the table names.

3.4 When to Use Wildcard Tables

Use wildcard tables when: - Your data is partitioned into separate tables by date, region, or category - You want to query across all (or some) of those tables - You want to save typing many UNION ALL statements

Example without wildcards (tedious):

SELECT * FROM sales_2020
UNION ALL
SELECT * FROM sales_2021
UNION ALL
SELECT * FROM sales_2022
-- Imagine doing this for 100 tables!

Example with wildcards (simple):

SELECT * FROM sales_*
WHERE _TABLE_SUFFIX BETWEEN '2020' AND '2022'

3.5 Best Practices

  1. **Always use WHERE _TABLE_SUFFIX** to limit which tables you scan (saves money!)
  2. **Remember _TABLE_SUFFIX is a string**, so use string comparisons
  3. Check the table structure first - not all datasets use multiple tables
  4. Be specific with your prefix - usa_* is better than * to avoid matching unintended tables

Why This Matters

In BigQuery, you pay based on how much data you scan. If you have 100 tables with data from 100 different days, and you only need data from 5 days, wildcard tables with proper filtering help you scan only 5 tables instead of 100!

Cost example: - Query without filter: scans all 100 tables = $$ - Query with WHERE _TABLE_SUFFIX IN ('20240101', '20240102', ...): scans only 5 tables = $


Part 4: Arrays

What Are Arrays?

An array is a list of values stored in a single field. Think of it like a shopping list stored in one cell of a spreadsheet.

Example: Instead of having separate rows for: - Row 1: “Emma” - Row 2: “Olivia”
- Row 3: “Ava”

You can have one row with: [“Emma”, “Olivia”, “Ava”]

4.1 Creating Arrays - ARRAY_AGG

Goal: For each year and gender, create an array containing the top 5 most popular names

-- Step 1: Find the top 5 names
WITH top_names AS (
  SELECT 
    year,                                                  -- The year
    gender,                                                -- M or F
    name,                                                  -- Individual name
    SUM(number) as total,                                 -- Total babies with this name
    RANK() OVER (                                         -- Rank names within each group
      PARTITION BY year, gender                           -- Separate ranking for each year/gender
      ORDER BY SUM(number) DESC                           -- Highest count = rank 1
    ) as rank
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE year = 2020                                       -- Just look at 2020
  GROUP BY year, gender, name                             -- One row per name
  QUALIFY rank <= 5                                       -- Keep only top 5
)

-- Step 2: Aggregate those names into arrays
SELECT 
  year,                                                    -- The year (2020)
  gender,                                                  -- M or F
  ARRAY_AGG(name ORDER BY rank) as top_5_names,          -- Create an array of names
  ARRAY_AGG(total ORDER BY rank) as top_5_counts         -- Create an array of their counts
FROM top_names
GROUP BY year, gender                                     -- One row per year/gender combination
ORDER BY gender;                                          -- Show F first, then M

Line-by-line explanation:

CTE - top_names: 1-6. Same as before: rank names by popularity, keep top 5

Main Query: 7. ARRAY_AGG(name ORDER BY rank) - This is the magic! - ARRAY_AGG collects multiple values into a single array - Takes all the name values in the group and puts them in a list - ORDER BY rank ensures they’re in order: [1st place, 2nd place, 3rd place, …] - Result: Instead of 5 rows with one name each, you get 1 row with 5 names in an array 8. ARRAY_AGG(total ORDER BY rank) - Same thing, but for the counts 9. GROUP BY year, gender - Collapse multiple name rows into one row with arrays

What you’ll see: Two rows - one for females (F) and one for males (M) from 2020. Each row contains two arrays: one with the top 5 most popular names in order, and another with the corresponding counts for those names, also in order. The first element in each array corresponds to the most popular name, the second element to the second most popular, and so on.

Before ARRAY_AGG (5 rows for females): | year | gender | name | total | |——|——–|———–|——–| | 2020 | F | Olivia | 17,535 | | 2020 | F | Emma | 15,581 | | 2020 | F | Ava | 13,084 | | … 2 more rows …

After ARRAY_AGG (1 row for females): | year | gender | top_5_names | top_5_counts | |——|——–|————-|————–| | 2020 | F | [Olivia, Emma, Ava, Charlotte, Sophia] | [17535, 15581, 13084, …] |

When to use ARRAY_AGG: - You want to group related items together - You’re building a summary with multiple values per category - You want to reduce the number of rows in your result

4.2 UNNEST - Expanding Arrays Back Into Rows

Goal: Take an array and convert it back into separate rows

Think of UNNEST as the opposite of ARRAY_AGG. It “unpacks” an array.

-- Step 1: Create an array of sample names
WITH name_arrays AS (
  SELECT 
    year,                                                  -- The year
    ARRAY_AGG(                                            -- Create an array
      DISTINCT name                                       -- Only unique names (no duplicates)
      ORDER BY name                                       -- Sort alphabetically
      LIMIT 10                                            -- Only take 10 names
    ) as sample_names                                     -- Name of the array column
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE year = 2020 AND gender = 'F'                     -- 2020, females only
  GROUP BY year                                           -- One row (with one array)
)

-- Step 2: UNNEST the array back into rows
SELECT 
  year,                                                    -- The year (same for all rows)
  name                                                     -- Individual name (from the array)
FROM name_arrays,                                         -- From our CTE
  UNNEST(sample_names) as name;                          -- UNNEST creates new rows

Line-by-line explanation:

CTE - name_arrays: 1. ARRAY_AGG(DISTINCT name ORDER BY name LIMIT 10) - Build an array - DISTINCT removes duplicates (some names appear in multiple states) - ORDER BY name sorts alphabetically - LIMIT 10 keeps only 10 names in the array 2. Result: ONE row with year=2020 and an array of 10 names

Main Query: 3. FROM name_arrays, UNNEST(sample_names) as name - The UNNEST magic! - UNNEST(sample_names) takes the array and creates one new row per element - as name gives the extracted value a column name - The comma acts like a CROSS JOIN - it pairs each row in name_arrays with each element in the array

Visual example:

Before UNNEST (1 row): | year | sample_names | |——|————–| | 2020 | [Ava, Charlotte, Emma, …] |

After UNNEST (10 rows): | year | name | |——|———–| | 2020 | Ava | | 2020 | Charlotte | | 2020 | Emma | | 2020 | … (7 more rows) |

When to use UNNEST: - You have data stored in arrays and need to work with individual elements - You want to join array elements with other tables - You need to filter or aggregate individual array items

Real-world example: Imagine a table of orders where each order has an array of product IDs: - Order 1: products = [101, 102, 103] - Order 2: products = [104, 105]

UNNEST lets you create one row per product so you can count how many times each product was ordered.

4.3 Array Functions - Working with Array Data

Goal: Use built-in functions to work with arrays

SELECT 
  year,                                                    -- The year
  gender,                                                  -- M or F
  ARRAY_AGG(                                              -- Create an array of top 5 names
    name 
    ORDER BY number DESC                                  -- Sort by popularity
    LIMIT 5                                               -- Take only top 5
  ) as top_names,
  ARRAY_LENGTH(                                           -- Count items in the array
    ARRAY_AGG(name ORDER BY number DESC LIMIT 5)
  ) as array_size,                                        -- Should be 5
  ARRAY_AGG(name ORDER BY number DESC LIMIT 5)[OFFSET(0)] as most_popular_name  -- Get first element
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020                                         -- Just 2020 data
GROUP BY year, gender;                                    -- One row per year/gender

Line-by-line explanation:

  1. ARRAY_AGG(name ORDER BY number DESC LIMIT 5) - Create the array (same as before)
  2. ARRAY_LENGTH(...) - Counts how many elements are in an array
    • Takes an array as input
    • Returns a number (the count)
    • Example: ARRAY_LENGTH([1,2,3,4,5]) returns 5
  3. [OFFSET(0)] - Access a specific element in the array
    • Arrays in BigQuery are ZERO-INDEXED (they start counting at 0, not 1)
    • OFFSET(0) = first element
    • OFFSET(1) = second element
    • OFFSET(4) = fifth element
    • Example: [“Emma”, “Olivia”, “Ava”][OFFSET(0)] returns “Emma”

Important: Array Indexing Starts at 0! | Position | Index | Name in Array | |———-|——-|—————| | 1st | OFFSET(0) | Emma | | 2nd | OFFSET(1) | Olivia | | 3rd | OFFSET(2) | Ava | | 4th | OFFSET(3) | Charlotte | | 5th | OFFSET(4) | Sophia |

Other useful array functions:

-- Check if an array contains a value
SELECT 'Emma' IN UNNEST(['Emma', 'Olivia', 'Ava']) as contains_emma;  -- Returns TRUE

-- Get array of only specific values
SELECT ARRAY(SELECT name FROM UNNEST(['Emma', 'Olivia', 'Ava']) as name WHERE name LIKE 'E%');
-- Returns ['Emma']

-- Concatenate two arrays
SELECT ARRAY_CONCAT(['Emma', 'Olivia'], ['Liam', 'Noah']);
-- Returns ['Emma', 'Olivia', 'Liam', 'Noah']

-- Get last element (using length)
SELECT ['Emma', 'Olivia', 'Ava'][OFFSET(ARRAY_LENGTH(['Emma', 'Olivia', 'Ava']) - 1)];
-- Returns 'Ava' (length=3, so index 2 is the last element)

Common mistakes: - Using [1] instead of [OFFSET(1)] - BigQuery requires the OFFSET keyword - Forgetting arrays start at 0 - OFFSET(1) is the SECOND element, not the first - Accessing an index that doesn’t exist - OFFSET(10) on a 5-element array causes an error


Part 5: Structs

What Are Structs?

A STRUCT is like a mini-table stored in a single field. It groups related pieces of data together.

Think of it like this: - Array: A list of similar items [apple, banana, orange] - Struct: A collection of different items {name: “apple”, color: “red”, price: 1.50}

Real-world analogy: An address is a struct: {street: “123 Main St”, city: “Boston”, zip: “02101”}

5.2 Arrays of Structs - The Ultimate Combo

Goal: Create arrays where each element is a struct (combining both concepts!)

This is very powerful and commonly used in BigQuery.

-- Step 1: Rank the names
WITH ranked_names AS (
  SELECT 
    year,                                                  -- The year
    gender,                                                -- M or F
    name,                                                  -- Name
    SUM(number) as total,                                 -- Total across all states
    RANK() OVER (                                         -- Calculate rank
      PARTITION BY year, gender 
      ORDER BY SUM(number) DESC
    ) as rank
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE year IN (2010, 2015, 2020)                       -- Three different years
  GROUP BY year, gender, name
  QUALIFY rank <= 3                                       -- Keep top 3
)

-- Step 2: Create array of structs
SELECT 
  year,                                                    -- The year
  gender,                                                  -- M or F
  ARRAY_AGG(                                              -- Create an array
    STRUCT(name, total, rank)                             -- Each element is a struct
    ORDER BY rank                                         -- Order by rank
  ) as top_names                                          -- Name of the array column
FROM ranked_names
GROUP BY year, gender                                     -- One row per year/gender
ORDER BY year, gender;                                    -- Sort results

Line-by-line explanation:

CTE - ranked_names: 1-9. Find top 3 names per year/gender (we’ve seen this before)

Main Query: 10. ARRAY_AGG(STRUCT(name, total, rank) ORDER BY rank) - Array of structs! - STRUCT(name, total, rank) creates a struct for each name - ARRAY_AGG(...) collects all those structs into an array - Result: Instead of 3 rows (one per name), you get 1 row with an array of 3 structs

What you’ll see: Six rows total - one for each year/gender combination (2010 F, 2010 M, 2015 F, 2015 M, 2020 F, 2020 M). Each row contains an array in the top_names column. Each element in that array is a struct containing a name, its total count, and its rank. This compact format puts all three top names for a given year/gender into a single row.

Before (regular rows): | year | gender | name | total | rank | |——|——–|———|——-|——| | 2020 | F | Olivia | 17535 | 1 | | 2020 | F | Emma | 15581 | 2 | | 2020 | F | Ava | 13084 | 3 |

After (array of structs): | year | gender | top_names | |——|——–|———–| | 2020 | F | [{name: Olivia, total: 17535, rank: 1}, {name: Emma, total: 15581, rank: 2}, {name: Ava, total: 13084, rank: 3}] |

Why use arrays of structs? - Extremely compact data representation - Perfect for exporting to JSON - Common in APIs and modern data formats - One row can contain complex, nested information

Real-world examples: - Orders table: each order has an array of products (structs with product_id, quantity, price) - User table: each user has an array of addresses (structs with street, city, zip) - Events table: each event has an array of attendees (structs with name, email, status)

5.3 Accessing Struct Fields - Working with Structured Data

Goal: Query data from struct fields using dot notation

-- Step 1: Create structs with statistics
WITH name_stats AS (
  SELECT 
    year,                                                  -- The year
    STRUCT(                                               -- Create a struct of statistics
      COUNT(DISTINCT name) as unique_names,               -- Field 1: count of unique names
      SUM(number) as total_births,                       -- Field 2: total births
      AVG(number) as avg_per_name                        -- Field 3: average per name
    ) as stats                                            -- Name the struct "stats"
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE gender = 'M'                                      -- Only males
  GROUP BY year                                           -- One row per year
)

-- Step 2: Access the struct fields
SELECT 
  year,                                                    -- The year
  stats.unique_names,                                     -- Access field with dot notation
  stats.total_births,                                     -- Access field with dot notation
  ROUND(stats.avg_per_name, 2) as avg_per_name           -- Access and round
FROM name_stats
WHERE year BETWEEN 2010 AND 2020                         -- Recent years only
ORDER BY year;                                            -- Sort chronologically

Line-by-line explanation:

CTE - name_stats: 1. STRUCT(COUNT(...), SUM(...), AVG(...)) as stats - Create a struct with 3 fields - unique_names: how many different names exist - total_births: total number of boys born - avg_per_name: average number of boys per name (total ÷ unique) 2. GROUP BY year - Calculate these stats for each year

Main Query: 3. stats.unique_names - Dot notation to access struct fields - stats is the struct column name - .unique_names accesses the unique_names field inside the struct - It’s like saying “from the stats struct, give me the unique_names value” 4. ROUND(stats.avg_per_name, 2) - You can use struct fields in functions - Access the field with dot notation - Then apply any function you want

Visual representation:

What’s stored in the stats struct:

stats: {
  unique_names: 12500,
  total_births: 2000000,
  avg_per_name: 160.0
}

How to access it: - stats.unique_names → returns 12500 - stats.total_births → returns 2000000 - stats.avg_per_name → returns 160.0

What you’ll see: One row per year between 2010 and 2020 (11 rows total). Each row shows the year and three separate columns extracted from the stats struct: the number of unique male names that year, total male births, and the average number of babies per name (rounded to 2 decimal places).

Accessing nested structs: If you have a struct inside a struct, use multiple dots:

-- Example: address struct inside person struct
person.address.city
person.address.zip

-- Reading from left to right:
-- 1. Get the "person" struct
-- 2. From that struct, get the "address" struct
-- 3. From that struct, get the "city" field

Best practice: When working with structs, always use meaningful field names. Instead of:

STRUCT(a, b, c)  -- Bad: what do these mean?

Use:

STRUCT(name, count, rank)  -- Good: clear meaning

Part 6: Putting It All Together - Advanced Example

Now let’s combine everything you’ve learned: CTEs, window functions, arrays, and structs!

Goal: Find names that are growing fastest in popularity, with complete year-by-year analysis

-- STEP 1: Get yearly totals for each name
WITH yearly_stats AS (
  SELECT 
    year,                                                  -- The year
    gender,                                                -- M or F
    name,                                                  -- Individual name
    SUM(number) as total                                  -- Total across all states
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE year BETWEEN 2015 AND 2020                       -- Focus on recent years
  GROUP BY year, gender, name                             -- One row per year/gender/name combo
),

-- STEP 2: Calculate year-over-year growth
growth_calc AS (
  SELECT 
    year,                                                  -- The year
    gender,                                                -- M or F
    name,                                                  -- Name
    total,                                                 -- Total this year (from CTE 1)
    LAG(total) OVER (                                     -- Get previous year's total
      PARTITION BY name, gender                           -- Keep same name/gender
      ORDER BY year                                       -- Order by year
    ) as prev_year,                                       -- Last year's total
    total - LAG(total) OVER (                            -- Absolute growth (difference)
      PARTITION BY name, gender 
      ORDER BY year
    ) as absolute_growth,
    SAFE_DIVIDE(                                          -- Percentage growth
      total - LAG(total) OVER (                          -- Change
        PARTITION BY name, gender 
        ORDER BY year
      ),
      LAG(total) OVER (                                   -- Divided by previous year
        PARTITION BY name, gender 
        ORDER BY year
      )
    ) * 100 as pct_growth                                 -- Convert to percentage
  FROM yearly_stats
),

-- STEP 3: Aggregate growth data into arrays of structs
trending AS (
  SELECT 
    name,                                                  -- The name
    gender,                                                -- M or F
    ARRAY_AGG(                                            -- Create an array
      STRUCT(year, total, pct_growth)                     -- Each element is a struct
      ORDER BY year                                       -- Chronological order
    ) as yearly_data,                                     -- Array of year-by-year data
    AVG(pct_growth) as avg_growth_rate,                  -- Average growth rate
    COUNT(*) as years_tracked                             -- How many years we have data for
  FROM growth_calc
  WHERE pct_growth IS NOT NULL                            -- Ignore first year (no previous year)
  GROUP BY name, gender                                   -- One row per name/gender
  HAVING AVG(pct_growth) > 10                            -- Only names growing >10% per year on average
    AND COUNT(*) >= 4                                     -- Must have at least 4 years of data
)

-- STEP 4: Final output with array indexing
SELECT 
  name,                                                    -- The name
  gender,                                                  -- M or F
  ROUND(avg_growth_rate, 2) as avg_growth_pct,           -- Average % growth per year
  yearly_data[OFFSET(0)].total as total_2015,            -- Access first array element's total field
  yearly_data[OFFSET(ARRAY_LENGTH(yearly_data)-1)].total as total_2020  -- Access last element's total
FROM trending
ORDER BY avg_growth_rate DESC                             -- Fastest growing first
LIMIT 20;                                                 -- Top 20 trending names

Detailed explanation of each CTE:

CTE 1 - yearly_stats: - Purpose: Get the basic data we need (yearly totals per name) - Output: One row per year/gender/name with the total count - Why: Simplifies the next step by pre-aggregating

CTE 2 - growth_calc: - Purpose: Calculate how much each name grew from year to year - Key parts: - LAG(total) gets last year’s number - total - LAG(total) calculates absolute change (+1000 babies, -500 babies, etc.) - SAFE_DIVIDE(...) calculates percentage change - SAFE_DIVIDE instead of / prevents division by zero errors - PARTITION BY name, gender keeps each name’s calculations separate - Output: Each year for each name has growth metrics

CTE 3 - trending: - Purpose: Identify names that are consistently growing - Key parts: - ARRAY_AGG(STRUCT(...)) creates one row per name with all years in an array - Each array element is a struct containing year, total, and growth% - AVG(pct_growth) calculates average growth rate across all years - HAVING AVG(pct_growth) > 10 filters to only fast-growing names - Output: One row per trending name with their entire history in an array

Final Query: - Purpose: Display the results in a readable format - Key parts: - yearly_data[OFFSET(0)].total - Get the total from the FIRST year - [OFFSET(0)] accesses first array element - .total accesses the total field from that struct - yearly_data[OFFSET(ARRAY_LENGTH(yearly_data)-1)].total - Get total from LAST year - ARRAY_LENGTH(yearly_data) counts how many years we have - -1 because arrays start at 0 (if length is 5, last index is 4) - Result: Shows which names went from small to big, and how much they grew

What you’ll see: Up to 20 rows showing the fastest-growing names. Each row displays the name, gender, average annual growth percentage, the total count from 2015 (the first year in our dataset), and the total count from 2020 (the last year). Names with higher average growth percentages appear first, showing which names increased most dramatically in popularity over this period.

Key techniques used: 1. ✅ CTEs to break complex logic into steps 2. ✅ Window functions (LAG) to compare years 3. ✅ PARTITION BY to keep data organized 4. ✅ Arrays to store multiple years’ data 5. ✅ Structs to group related fields 6. ✅ Array indexing to access specific elements 7. ✅ Dot notation to access struct fields


Practice Exercises

Now it’s your turn! Try these exercises to reinforce what you’ve learned. Start simple and work your way up.

Exercise 1: Window Functions Basics

Task: Find the top 10 names for boys in 2019, showing each name’s rank and what percentage of the total they represent.

Hints: - Use RANK() to assign rankings - Use SUM() OVER (PARTITION BY year, gender) to get total births - Calculate percentage as: (individual count / total) * 100 - Use QUALIFY to filter to top 10

Click to see solution
SELECT 
  name,
  SUM(number) as count,
  SUM(number) / SUM(SUM(number)) OVER (PARTITION BY year, gender) * 100 as percentage,
  RANK() OVER (PARTITION BY year, gender ORDER BY SUM(number) DESC) as rank
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2019 AND gender = 'M'
GROUP BY name, year, gender
QUALIFY rank <= 10
ORDER BY rank;

Exercise 2: LAG and LEAD Practice

Task: For the name “Emma”, show each year from 2000-2020 with the current year’s count, previous year’s count, next year’s count, and whether it increased or decreased.

Hints: - Use LAG() to get previous year - Use LEAD() to get next year - Use CASE WHEN to create a text description

Click to see solution
SELECT 
  year,
  SUM(number) as current_year,
  LAG(SUM(number)) OVER (ORDER BY year) as previous_year,
  LEAD(SUM(number)) OVER (ORDER BY year) as next_year,
  CASE 
    WHEN SUM(number) > LAG(SUM(number)) OVER (ORDER BY year) THEN 'Increased'
    WHEN SUM(number) < LAG(SUM(number)) OVER (ORDER BY year) THEN 'Decreased'
    ELSE 'Same'
  END as trend
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name = 'Emma' AND year BETWEEN 2000 AND 2020
GROUP BY year
ORDER BY year;

Exercise 3: CTEs for Readability

Task: Find which states have the most diverse baby names in 2020 (measured by the number of unique names used).

Hints: - First CTE: count unique names per state - Second CTE: rank states - Use COUNT(DISTINCT name) for unique names

Click to see solution
WITH state_diversity AS (
  SELECT 
    state,
    COUNT(DISTINCT name) as unique_names,
    SUM(number) as total_births
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE year = 2020
  GROUP BY state
),
ranked_states AS (
  SELECT 
    state,
    unique_names,
    total_births,
    RANK() OVER (ORDER BY unique_names DESC) as diversity_rank
  FROM state_diversity
)
SELECT 
  state,
  unique_names,
  total_births,
  diversity_rank
FROM ranked_states
ORDER BY diversity_rank
LIMIT 10;

Exercise 4: Arrays

Task: For each state, create an array of the top 3 most popular girl names in 2020.

Hints: - First, rank names within each state - Then use ARRAY_AGG to collect the top 3 - Remember to ORDER BY inside ARRAY_AGG

Click to see solution
WITH ranked_names AS (
  SELECT 
    state,
    name,
    number,
    RANK() OVER (PARTITION BY state ORDER BY number DESC) as rank
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE year = 2020 AND gender = 'F'
  QUALIFY rank <= 3
)
SELECT 
  state,
  ARRAY_AGG(name ORDER BY rank) as top_3_names
FROM ranked_names
GROUP BY state
ORDER BY state;

Exercise 5: Structs

Task: Create a struct containing statistics about names starting with ‘A’ vs names starting with ‘Z’ in 2020.

Hints: - Use SUBSTR(name, 1, 1) to get first letter - Create separate STRUCTs for A names and Z names - Include count of names and total babies in each struct

Click to see solution
SELECT 
  year,
  STRUCT(
    COUNT(DISTINCT CASE WHEN SUBSTR(name, 1, 1) = 'A' THEN name END) as unique_names,
    SUM(CASE WHEN SUBSTR(name, 1, 1) = 'A' THEN number ELSE 0 END) as total_babies
  ) as a_names,
  STRUCT(
    COUNT(DISTINCT CASE WHEN SUBSTR(name, 1, 1) = 'Z' THEN name END) as unique_names,
    SUM(CASE WHEN SUBSTR(name, 1, 1) = 'Z' THEN number ELSE 0 END) as total_babies
  ) as z_names
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020
GROUP BY year;

Exercise 6: Combining Everything (Advanced!)

Task: Find names that were popular (top 100) in 1950 but are NOT in the top 100 in 2020. Show their 1950 rank, 2020 rank, and create an array of their yearly totals from 1950 to 2020 (in 10-year intervals).

Hints: - This needs multiple CTEs - Use window functions to rank in both years - Use ARRAY_AGG with a struct containing year and total - Filter to names in top 100 in 1950 but not 2020

Click to see solution
WITH ranks_1950 AS (
  SELECT 
    name,
    gender,
    SUM(number) as total,
    RANK() OVER (PARTITION BY gender ORDER BY SUM(number) DESC) as rank
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE year = 1950
  GROUP BY name, gender
),
ranks_2020 AS (
  SELECT 
    name,
    gender,
    SUM(number) as total,
    RANK() OVER (PARTITION BY gender ORDER BY SUM(number) DESC) as rank
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE year = 2020
  GROUP BY name, gender
),
declined_names AS (
  SELECT 
    a.name,
    a.gender,
    a.rank as rank_1950,
    COALESCE(b.rank, 999) as rank_2020
  FROM ranks_1950 a
  LEFT JOIN ranks_2020 b ON a.name = b.name AND a.gender = b.gender
  WHERE a.rank <= 100 
    AND (b.rank > 100 OR b.rank IS NULL)
),
yearly_totals AS (
  SELECT 
    d.name,
    d.gender,
    d.rank_1950,
    d.rank_2020,
    ARRAY_AGG(
      STRUCT(y.year, y.total)
      ORDER BY y.year
    ) as decade_history
  FROM declined_names d
  LEFT JOIN (
    SELECT 
      name,
      gender,
      year,
      SUM(number) as total
    FROM `bigquery-public-data.usa_names.usa_1910_current`
    WHERE year IN (1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020)
    GROUP BY name, gender, year
  ) y ON d.name = y.name AND d.gender = y.gender
  GROUP BY d.name, d.gender, d.rank_1950, d.rank_2020
)
SELECT 
  name,
  gender,
  rank_1950,
  rank_2020,
  decade_history
FROM yearly_totals
ORDER BY rank_1950
LIMIT 20;

Next Steps

Now that you understand the basics:

  1. Practice! Try the exercises above with the USA names dataset
  2. Explore: Look at other public datasets in BigQuery
  3. Read docs: Check out the official BigQuery documentation at https://cloud.google.com/bigquery/docs
  4. Build projects: Apply these concepts to real data problems
  5. Join communities: Stack Overflow and Reddit’s r/bigquery are great resources

Helpful Resources


Final Tips for Beginners

  1. Start simple: Don’t try to write the perfect query on the first try. Build it step by step.

  2. Use CTEs liberally: They make debugging SO much easier. You can run just the CTE to see if it’s working.

  3. Comment your code: Future you will thank present you!

-- Get the top 5 names per year
WITH top_names AS (
  ...
)
  1. Preview before running: Use LIMIT 100 while developing to avoid scanning huge amounts of data.

  2. Check your logic with small datasets: Filter to one year or one state first, make sure your logic is correct, then expand.

  3. Window functions are your friend: They solve so many problems that would otherwise require self-joins or complex subqueries.

  4. Arrays and structs are powerful: They might seem weird at first, but they’re incredibly useful for organizing complex data.

  5. Don’t be afraid of errors: Every error message is teaching you something about SQL!


Happy querying!

Remember: The best way to learn is by doing. Open BigQuery, copy these queries, modify them, break them, fix them, and make them your own!