Public Dataset:
bigquery-public-data.usa_names.usa_1910_current
This dataset contains baby name registrations from US Social Security
records: - name: Baby name (e.g., “Emma”, “Michael”) -
gender: M (male) or F (female) - number: How
many babies were given this name - year: Year of birth
(1910-2023) - state: US state code (e.g., “CA”, “NY”)
Example row: name=‘Emma’, gender=‘F’, number=20355, year=2020, state=‘CA’ means 20,355 baby girls were named Emma in California in 2020.
Think of window functions as a way to perform calculations across a set of rows that are related to the current row, WITHOUT collapsing those rows like GROUP BY does. You get to see individual rows AND aggregate calculations at the same time!
Goal: Calculate the cumulative number of babies named “Emma” over the years
SELECT
year, -- Show the year
SUM(number) as yearly_total, -- Total Emmas born this year
SUM(SUM(number)) OVER (ORDER BY year) as cumulative_total -- Running total up to this year
FROM `bigquery-public-data.usa_names.usa_1910_current` -- From the public dataset
WHERE name = 'Emma' -- Only look at name "Emma"
GROUP BY year -- One row per year
ORDER BY year; -- Sort by year
Line-by-line explanation: 1.
SELECT year - We want to see which year we’re talking about
2. SUM(number) as yearly_total - Add up all the Emmas born
in each year (across all states). This is a regular aggregate function
3.
SUM(SUM(number)) OVER (ORDER BY year) as cumulative_total -
This is the magic! - Inner SUM(number): First, we sum the
numbers for each year (same as line 2) - Outer
SUM(...) OVER (ORDER BY year): Now, take those yearly
totals and create a running sum - OVER (ORDER BY year):
This creates a “window” that starts at the earliest year and grows as we
move forward in time - Result: Each row shows the total number of Emmas
born from the beginning of time up to that year 4.
WHERE name = 'Emma' - Filter to only Emma (before grouping)
5. GROUP BY year - Combine all states into one row per year
6. ORDER BY year - Show results chronologically
What you’ll see: The results will show each year with two totals: the number of Emmas born that specific year, and the cumulative total from 1910 up to and including that year. The cumulative total should always increase (or stay the same) as you go down the rows, since you’re adding more years to the sum.
Key concept: The OVER clause doesn’t
reduce the number of rows. You still get one row per year, but each row
can “see” other rows to perform calculations.
Goal: Find the top 5 most popular names for boys and girls in 2020
SELECT
year, -- The year (2020)
gender, -- M or F
name, -- Name like "Olivia" or "Liam"
SUM(number) as total_count, -- Total babies with this name across all states
RANK() OVER ( -- Assign a rank (1st, 2nd, 3rd...)
PARTITION BY year, gender -- Separate rankings for M and F
ORDER BY SUM(number) DESC -- Highest count gets rank 1
) as popularity_rank
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020 -- Only look at 2020 data
GROUP BY year, gender, name -- One row per unique combination
QUALIFY popularity_rank <= 5 -- Only keep ranks 1-5
ORDER BY gender, popularity_rank; -- Sort by gender, then rank
Line-by-line explanation: 1.
year, gender, name - The columns we want to display 2.
SUM(number) as total_count - Add up how many babies got
this name across all 50 states - Example: “Olivia” might appear in
California (5,000), Texas (4,000), etc. This adds them all up 3.
RANK() OVER (...) - This is where the magic happens! We’re
assigning a ranking 4. PARTITION BY year, gender -
CRITICAL CONCEPT! This divides our data into separate
groups - Think of it as creating two separate ranking lists: one for
boys, one for girls - Without this, boys and girls would compete in the
same ranking - It’s like saying “reset the ranking to 1 every time the
gender changes” 5. ORDER BY SUM(number) DESC - Within each
partition (M and F), rank by highest count first - DESC means descending
(biggest to smallest) - The name with the most babies gets rank 1 6.
WHERE year = 2020 - We only want one year’s data for this
example 7. GROUP BY year, gender, name - Combine all states
into one row per name 8. QUALIFY popularity_rank <= 5 -
Special BigQuery feature! Like WHERE but for window
functions - Regular WHERE doesn’t work with window functions (they
haven’t been calculated yet) - QUALIFY runs AFTER the window function is
calculated - This keeps only ranks 1, 2, 3, 4, and 5
What you’ll see: The results will show the top 5 names for each gender. Girls and boys will have separate rankings - the most popular girl name will be rank 1, and separately, the most popular boy name will also be rank 1. Each row shows the name, how many babies received that name across all states, and its ranking within that gender.
Key concept - PARTITION BY: Imagine you have a deck of cards. PARTITION BY is like separating them into suits (hearts, diamonds, clubs, spades). Then, within each suit, you can rank them (Ace is 1st, King is 2nd, etc.). Without PARTITION BY, you’d be ranking all 52 cards together!
Goal: See how the name “Sophia” grew or declined in popularity year over year
SELECT
year, -- The year
name, -- Name (will always be "Sophia")
SUM(number) as total, -- How many Sophias this year
LAG(SUM(number)) OVER ( -- Get the value from the PREVIOUS row
PARTITION BY name -- Stay within the same name
ORDER BY year -- Order by year (so we know what "previous" means)
) as prev_year_total, -- Total from last year
SUM(number) - LAG(SUM(number)) OVER ( -- Current year minus last year
PARTITION BY name
ORDER BY year
) as change -- How much it increased or decreased
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name = 'Sophia' AND year BETWEEN 2010 AND 2020 -- Only Sophia, only these years
GROUP BY year, name -- One row per year
ORDER BY year; -- Show chronologically
Line-by-line explanation: 1. year, name
- What we’re displaying 2. SUM(number) as total - Add up
all Sophias across all states for this year 3.
LAG(SUM(number)) OVER (...) - LAG means “look at
the previous row” - Think of LAG as “looking backward” in time
- This gets the total from the year before - Example: When processing
2015, LAG gives us the value from 2014 4. PARTITION BY name
- Keep calculations within the same name - Since we’re only looking at
“Sophia”, this doesn’t change much here - But it’s good practice in case
you want to analyze multiple names 5. ORDER BY year - This
is crucial! It tells LAG which direction “previous” is - Without this,
LAG wouldn’t know which row is “previous” 6.
SUM(number) - LAG(SUM(number)) OVER (...) - Subtract last
year from this year - If positive: the name grew in popularity - If
negative: the name declined in popularity - If NULL: there’s no previous
year (happens in first row)
What you’ll see: The results will show each year between 2010 and 2020, with the total number of Sophias born that year, the previous year’s total (which will be NULL for the first year since there’s no prior year in our filtered dataset), and the year-over-year change (positive numbers indicate growth, negative indicate decline).
Key concept - LAG vs LEAD: - LAG = look
backward (previous row) - useful for “compared to last year” -
LEAD = look forward (next row) - useful for “compared to
next year” - Both can take a second parameter:
LAG(value, 2) means “look back 2 rows”
Goal: Calculate a 5-year moving average to see long-term trends for the name “James”
SELECT
year, -- The year
name, -- Name (always "James")
SUM(number) as yearly_total, -- How many James' this year
AVG(SUM(number)) OVER ( -- Calculate an average
PARTITION BY name -- Keep within same name
ORDER BY year -- Order by year
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING -- Define the "window frame"
) as five_year_avg -- The 5-year moving average
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name = 'James' AND year BETWEEN 1950 AND 1960 -- Only James, only these years
GROUP BY year, name -- One row per year
ORDER BY year; -- Show chronologically
Line-by-line explanation: 1. year, name
- What we’re displaying 2. SUM(number) as yearly_total -
Regular aggregation: total James’ born this year 3.
AVG(SUM(number)) OVER (...) - Calculate the average across
a window of rows 4. PARTITION BY name - Keep calculations
within the same name 5. ORDER BY year - Order matters for
window frames! 6.
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING -
This is the window frame - 2 PRECEDING: Include the 2 rows
before the current row - AND 2 FOLLOWING: Include the 2
rows after the current row - Plus the current row = 5 rows total -
Example: For 1955, it includes 1953, 1954, 1955, 1956, 1957 - This
creates a “sliding window” that moves through your data
What you’ll see: Each row shows a year, the total number of babies named James that year, and the five-year moving average. For years in the middle of the dataset (like 1955), the moving average will include data from two years before, two years after, and the current year. For years at the edges (like 1950 or 1960), the moving average may be based on fewer than 5 years since there aren’t enough surrounding years in the filtered dataset.
Visualizing the window frame:
Years: 1953 1954 1955 1956 1957
^^^^
Current Row
[-------- Window --------]
2 back 1 back NOW 1 ahead 2 ahead
Why use moving averages? They smooth out year-to-year fluctuations to reveal underlying trends. Single years might be “noisy” (random ups and downs), but the average shows the real pattern.
Other window frame options: -
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW - From
beginning to now (running total) -
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW - Just previous
and current row -
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING - From now
to the end
A CTE (Common Table Expression) is like creating a temporary named result set that you can reference in your main query. Think of it as: - Creating a “scratch table” that only exists for this one query - Breaking a complex query into readable steps - Giving meaningful names to intermediate results
Syntax: Start with
WITH name_of_cte AS (query)
Goal: Calculate year-over-year change in total births by gender
-- Step 1: Create a CTE named "yearly_totals"
WITH yearly_totals AS (
SELECT
year, -- The year
gender, -- M or F
SUM(number) as total_births -- Add up ALL names for this year/gender
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY year, gender -- One row per year/gender combo
)
-- Step 2: Now use the CTE in the main query
SELECT
year, -- The year
total_births, -- Total births (from the CTE)
total_births - LAG(total_births) OVER ( -- Subtract last year's total
PARTITION BY gender -- Keep M and F separate
ORDER BY year -- Order by year
) as yearly_change -- The year-over-year change
FROM yearly_totals -- We're querying the CTE, not the original table!
WHERE gender = 'F' -- Only female births
ORDER BY year DESC -- Most recent first
LIMIT 10; -- Just show the last 10 years
Line-by-line explanation:
Part 1 - The CTE: 1.
WITH yearly_totals AS (...) - Create a temporary result set
called “yearly_totals” 2. SELECT year, gender, SUM(number)
- In the CTE, calculate total births per year/gender 3.
FROM bigquery-public-data... - Source data 4.
GROUP BY year, gender - Aggregate all names into one total
per year/gender
Part 2 - The Main Query: 5.
SELECT year, total_births - Pull data from the CTE 6.
total_births - LAG(total_births) OVER (...) - Compare this
year to last year 7. PARTITION BY gender - Keep male and
female calculations separate 8. ORDER BY year - Tell LAG
which row is “previous” 9. FROM yearly_totals -
Important! We’re now querying the CTE, not the original
table 10. WHERE gender = 'F' - Filter to only females 11.
ORDER BY year DESC - Show newest years first 12.
LIMIT 10 - Only show 10 rows
Why use a CTE here? Without the CTE, we’d have to
calculate SUM(number) multiple times in the same query,
making it harder to read:
-- Without CTE (harder to read):
SELECT
year,
SUM(number),
SUM(number) - LAG(SUM(number)) OVER (...)
FROM ...
GROUP BY year, gender
With the CTE, we calculate the sum ONCE and give it a clear name
(total_births), making the logic easier to follow.
Goal: Calculate how many different names account for 50% of all births in different years
This is more complex, so we’ll use multiple CTEs to break it down into steps.
-- STEP 1: Calculate what percentage each name represents
WITH name_percentages AS (
SELECT
year, -- The year
gender, -- M or F
name, -- Individual name
number, -- How many babies got this name
SUM(number) OVER (PARTITION BY year, gender) as total_births, -- Total births for this year/gender
number / SUM(number) OVER (PARTITION BY year, gender) * 100 as percentage -- What % this name is
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year IN (1950, 2020) -- Compare two different years
),
-- STEP 2: Rank names and calculate cumulative percentage
ranked_names AS (
SELECT
year, -- The year
gender, -- M or F
name, -- Individual name
percentage, -- What % this name represents (from CTE 1)
SUM(percentage) OVER ( -- Running sum of percentages
PARTITION BY year, gender -- Separate for each year/gender
ORDER BY number DESC -- Start with most popular names
) as cumulative_percentage, -- Running total: 1st name = 5%, 1st+2nd = 8%, etc.
ROW_NUMBER() OVER ( -- Assign row numbers
PARTITION BY year, gender
ORDER BY number DESC
) as rank -- 1 = most popular, 2 = 2nd most popular, etc.
FROM name_percentages
)
-- STEP 3: Find where cumulative percentage crosses 50%
SELECT
year, -- The year
gender, -- M or F
MIN(rank) as names_for_50_percent -- How many names to reach 50%
FROM ranked_names
WHERE cumulative_percentage >= 50 -- Only rows where we've reached 50%
GROUP BY year, gender -- One result per year/gender
ORDER BY year, gender; -- Sort the results
Line-by-line explanation:
CTE 1 - name_percentages: 1.
WITH name_percentages AS (...) - Create first temporary
result 2. year, gender, name, number - Basic fields we need
3.
SUM(number) OVER (PARTITION BY year, gender) as total_births
- Window function! - For EVERY row with year=1950 and
gender=F, this calculates the same total - It’s the denominator for our
percentage calculation - Example: If 2 million girls were born in 1950,
every row for 1950/F shows 2,000,000 4.
number / SUM(number) OVER (...) * 100 as percentage -
Calculate what % each name is - Example: “Mary” with 50,000 births ÷
2,000,000 total × 100 = 2.5% 5. WHERE year IN (1950, 2020)
- We’re comparing these two years
CTE 2 - ranked_names: 6.
WITH ..., ranked_names AS (...) - Create second CTE (note
the comma after first CTE) 7.
SELECT year, gender, name, percentage - Pull data from the
first CTE 8. SUM(percentage) OVER (...) - Create a running
sum of percentages - First row (most popular): might be 5% - Second row:
5% + 3% = 8% - Third row: 8% + 2.5% = 10.5% - Eventually one row crosses
50%! 9. ORDER BY number DESC - Start with the most popular
name - This ensures our cumulative sum builds from #1 ranked name onward
10. ROW_NUMBER() OVER (...) - Assign sequential numbers (1,
2, 3, 4…) - This tells us “how many names” we’ve included
Main Query: 11.
SELECT year, gender, MIN(rank) - Now use the second CTE 12.
WHERE cumulative_percentage >= 50 - Only look at rows
where we’ve reached 50% - Multiple names might cross 50% (like 49.8%,
51.2%, 53%, etc.) 13. MIN(rank) - Get the FIRST name that
crossed 50% - This tells us how many names it took to reach 50% of all
births
What you’ll see: The results will show four rows (one for each year/gender combination: 1950 F, 1950 M, 2020 F, 2020 M). Each row shows how many different names were needed to account for 50% of all births in that year/gender. Generally, you’ll see that older years (like 1950) needed fewer names to reach 50%, indicating less diversity - parents chose from a smaller pool of popular names. More recent years typically need many more names to reach 50%, showing greater diversity in naming choices.
Why this matters: This measures name diversity. In 1950, parents chose from fewer names (everyone named their daughters Mary, Susan, Linda, etc.). By 2020, parents picked more unique names, so you need to count more names to reach 50% of births.
Key concept - Multiple CTEs: You can chain CTEs by separating them with commas:
WITH first_cte AS (...),
second_cte AS (...),
third_cte AS (...)
SELECT ... FROM third_cte;
Each CTE can reference the ones above it!
Sometimes your data is split across multiple tables with similar
names (like sales_2020, sales_2021,
sales_2022). Instead of querying each one separately, you
can use a wildcard (*) to query them all at once!
Syntax: FROM dataset.table_prefix*
where * matches any characters
The USA names dataset is actually a single table, not multiple tables. So for this section, I’ll show you the concept using examples, and you can apply this when you encounter datasets that ARE split into multiple tables.
Example structure (hypothetical - not our actual dataset):
bigquery-public-data.usa_names.usa_1910
bigquery-public-data.usa_names.usa_1911
bigquery-public-data.usa_names.usa_1912
...
bigquery-public-data.usa_names.usa_2020
How you’d query all of them:
SELECT
_TABLE_SUFFIX as year, -- Special column! Contains the part after usa_
COUNT(*) as row_count -- Count rows in each table
FROM `bigquery-public-data.usa_names.usa_*` -- The * matches all tables starting with usa_
WHERE _TABLE_SUFFIX BETWEEN '1910' AND '2020' -- Filter which tables to include
GROUP BY year
ORDER BY year;
Line-by-line explanation: 1.
_TABLE_SUFFIX - Special BigQuery column
that only exists with wildcard queries - It contains the part of the
table name AFTER the * - Example: from table
usa_1950, _TABLE_SUFFIX = ‘1950’ - This is a STRING, not a
number! 2. COUNT(*) - Count rows in each table 3.
FROM bigquery-public-data.usa_names.usa_* - The wildcard
query - usa_* matches: usa_1910, usa_1911, usa_1950,
usa_anything 4.
WHERE _TABLE_SUFFIX BETWEEN '1910' AND '2020' - Filter
which tables to scan - This is important for cost! Don’t scan tables you
don’t need - Remember: _TABLE_SUFFIX is a string, so ‘1910’ < ‘1950’
< ‘2020’ in alphabetical order
Wildcard tables in BigQuery let you query many similarly named tables at once using a single query, instead of repeating the same code for each table. They are especially useful for time-series data, where each day or year is stored in its own table.
* character
placed at the end of a table prefix in the FROM clause.
gsod1940, gsod1941, …,
gsod1949)._TABLE_SUFFIX that contains the part of the
table name that matched the * (the “changing” part of the
name)._TABLE_SUFFIX, instead of
querying an entire dataset.SELECT
max,
ROUND((max - 32) * 5/9, 1) AS celsius,
mo,
da,
year,
_TABLE_SUFFIX AS suffix
FROM
`bigquery-public-data.noaa_gsod.gsod194*`
WHERE
max != 9999.9
AND ( _TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4' )
ORDER BY
max DESC
LIMIT 100;
FROM clause with wildcard
FROM `bigquery-public-data.noaa_gsod.gsod194*`
bigquery-public-data → project that hosts public
datasets.noaa_gsod → dataset containing weather
observations.gsod194* → table prefix plus *.
gsod194,
such as gsod1940, gsod1941, …
gsod1949.*, you now have access to
_TABLE_SUFFIX.What _TABLE_SUFFIX means here
gsod1940, prefix is gsod194 and
suffix is '0'.gsod1944, suffix is '4', and so
on._TABLE_SUFFIX will therefore be '0',
'1', … '9' depending on which table the
current row came from.SELECT list
SELECT
max,
ROUND((max - 32) * 5/9, 1) AS celsius,
mo,
da,
year,
_TABLE_SUFFIX AS suffix
max → maximum daily temperature (in Fahrenheit).ROUND((max - 32) * 5/9, 1) AS celsius → converts
Fahrenheit to Celsius and rounds to 1 decimal place.mo, da, year → month, day,
and year columns for the observation date._TABLE_SUFFIX AS suffix → shows which specific table
the row came from (which final digit the table name has).WHERE clause
WHERE
max != 9999.9
AND ( _TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4' )
max != 9999.9 → filters out rows where max
is an invalid placeholder (often used as a missing-value flag).(_TABLE_SUFFIX = '0' OR _TABLE_SUFFIX = '4') → keeps
only tables whose names end in 0 or 4.
ORDER and LIMIT
ORDER BY
max DESC
LIMIT 100;
ORDER BY max DESC → sorts results from hottest to
coolest.LIMIT 100 → shows only the top 100 hottest
records.You can see that one query works across several year‑tables at once,
instead of writing a separate query for each gsod1940,
gsod1941, etc.
SELECT
_TABLE_SUFFIX AS day_suffix,
COUNT(*) AS event_count
FROM
`githubarchive.day.2017*`
WHERE
type = "WatchEvent"
AND _TABLE_SUFFIX BETWEEN '0101' AND '0131'
GROUP BY
day_suffix
ORDER BY
day_suffix;
FROM clause with wildcard
FROM `githubarchive.day.2017*`
githubarchive → public GitHub Archive project.day.2017* → matches tables like
day.20170101, day.20170102, … for each day in
2017.*, _TABLE_SUFFIX becomes
available.What _TABLE_SUFFIX represents
here
2017.day.20170101, the suffix after the prefix
2017 is '0101'.day.20171231, the suffix is
'1231'._TABLE_SUFFIX therefore represents the month and day
(MMDD) for that table.SELECT list
SELECT
_TABLE_SUFFIX AS day_suffix,
COUNT(*) AS event_count
_TABLE_SUFFIX AS day_suffix → shows the day for each
result row (e.g. '0101' for Jan 1).COUNT(*) AS event_count → counts how many events exist
for that day.WHERE clause
WHERE
type = "WatchEvent"
AND _TABLE_SUFFIX BETWEEN '0101' AND '0131'
type = "WatchEvent" → keeps only GitHub events where a
user “stars” a repository._TABLE_SUFFIX BETWEEN '0101' AND '0131' → restricts the
tables scanned to those whose suffix (MMDD) is between
'0101' (Jan 1) and '0131' (Jan 31).
GROUP BY and ORDER BY
GROUP BY
day_suffix
ORDER BY
day_suffix;
GROUP BY day_suffix → aggregates all events for each
day.ORDER BY day_suffix → sorts days in calendar order from
'0101' to '0131'.This gives you one row per day in January with the number of “watch” events, using many daily tables in a single query.
Find the repeated table pattern
events_20230101, events_20230102, …sales_2020, sales_2021,
sales_2022, …events_,
sales_, day.2017, etc.).Write the FROM clause with *
FROM \project_id.dataset_id.table_prefix*``FROM \my_project.analytics.events_2023*``Use _TABLE_SUFFIX in WHERE to limit which
tables you scan
WHERE _TABLE_SUFFIX BETWEEN '20230101' AND '20230331'WHERE _TABLE_SUFFIX IN ('0', '4')Include _TABLE_SUFFIX in SELECT when
helpful
To see which table each result row came from or to group by it:
SELECT
_TABLE_SUFFIX AS source_suffix,
COUNT(*) AS row_count
FROM
`project.dataset.tableprefix*`
GROUP BY
source_suffix;Avoid common beginner mistakes
_TABLE_SUFFIX only works when there is a
wildcard * in the table reference. It will not work if you
query a single table with no wildcard._TABLE_SUFFIX to (like
'0101', '20230101', '0') match
the actual format of the suffix in the table
names.Use wildcard tables when: - Your data is partitioned into separate tables by date, region, or category - You want to query across all (or some) of those tables - You want to save typing many UNION ALL statements
Example without wildcards (tedious):
SELECT * FROM sales_2020
UNION ALL
SELECT * FROM sales_2021
UNION ALL
SELECT * FROM sales_2022
-- Imagine doing this for 100 tables!
Example with wildcards (simple):
SELECT * FROM sales_*
WHERE _TABLE_SUFFIX BETWEEN '2020' AND '2022'
usa_*
is better than * to avoid matching unintended tablesIn BigQuery, you pay based on how much data you scan. If you have 100 tables with data from 100 different days, and you only need data from 5 days, wildcard tables with proper filtering help you scan only 5 tables instead of 100!
Cost example: - Query without filter: scans all 100
tables = $$ - Query with
WHERE _TABLE_SUFFIX IN ('20240101', '20240102', ...): scans
only 5 tables = $
An array is a list of values stored in a single field. Think of it like a shopping list stored in one cell of a spreadsheet.
Example: Instead of having separate rows for: - Row
1: “Emma” - Row 2: “Olivia”
- Row 3: “Ava”
You can have one row with: [“Emma”, “Olivia”, “Ava”]
Goal: For each year and gender, create an array containing the top 5 most popular names
-- Step 1: Find the top 5 names
WITH top_names AS (
SELECT
year, -- The year
gender, -- M or F
name, -- Individual name
SUM(number) as total, -- Total babies with this name
RANK() OVER ( -- Rank names within each group
PARTITION BY year, gender -- Separate ranking for each year/gender
ORDER BY SUM(number) DESC -- Highest count = rank 1
) as rank
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020 -- Just look at 2020
GROUP BY year, gender, name -- One row per name
QUALIFY rank <= 5 -- Keep only top 5
)
-- Step 2: Aggregate those names into arrays
SELECT
year, -- The year (2020)
gender, -- M or F
ARRAY_AGG(name ORDER BY rank) as top_5_names, -- Create an array of names
ARRAY_AGG(total ORDER BY rank) as top_5_counts -- Create an array of their counts
FROM top_names
GROUP BY year, gender -- One row per year/gender combination
ORDER BY gender; -- Show F first, then M
Line-by-line explanation:
CTE - top_names: 1-6. Same as before: rank names by popularity, keep top 5
Main Query: 7.
ARRAY_AGG(name ORDER BY rank) - This is the
magic! - ARRAY_AGG collects multiple values into a
single array - Takes all the name values in the group and
puts them in a list - ORDER BY rank ensures they’re in
order: [1st place, 2nd place, 3rd place, …] - Result: Instead of 5 rows
with one name each, you get 1 row with 5 names in an array 8.
ARRAY_AGG(total ORDER BY rank) - Same thing, but for the
counts 9. GROUP BY year, gender - Collapse multiple name
rows into one row with arrays
What you’ll see: Two rows - one for females (F) and one for males (M) from 2020. Each row contains two arrays: one with the top 5 most popular names in order, and another with the corresponding counts for those names, also in order. The first element in each array corresponds to the most popular name, the second element to the second most popular, and so on.
Before ARRAY_AGG (5 rows for females): | year | gender | name | total | |——|——–|———–|——–| | 2020 | F | Olivia | 17,535 | | 2020 | F | Emma | 15,581 | | 2020 | F | Ava | 13,084 | | … 2 more rows …
After ARRAY_AGG (1 row for females): | year | gender | top_5_names | top_5_counts | |——|——–|————-|————–| | 2020 | F | [Olivia, Emma, Ava, Charlotte, Sophia] | [17535, 15581, 13084, …] |
When to use ARRAY_AGG: - You want to group related items together - You’re building a summary with multiple values per category - You want to reduce the number of rows in your result
Goal: Take an array and convert it back into separate rows
Think of UNNEST as the opposite of ARRAY_AGG. It “unpacks” an array.
-- Step 1: Create an array of sample names
WITH name_arrays AS (
SELECT
year, -- The year
ARRAY_AGG( -- Create an array
DISTINCT name -- Only unique names (no duplicates)
ORDER BY name -- Sort alphabetically
LIMIT 10 -- Only take 10 names
) as sample_names -- Name of the array column
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020 AND gender = 'F' -- 2020, females only
GROUP BY year -- One row (with one array)
)
-- Step 2: UNNEST the array back into rows
SELECT
year, -- The year (same for all rows)
name -- Individual name (from the array)
FROM name_arrays, -- From our CTE
UNNEST(sample_names) as name; -- UNNEST creates new rows
Line-by-line explanation:
CTE - name_arrays: 1.
ARRAY_AGG(DISTINCT name ORDER BY name LIMIT 10) - Build an
array - DISTINCT removes duplicates (some names appear in
multiple states) - ORDER BY name sorts alphabetically -
LIMIT 10 keeps only 10 names in the array 2. Result: ONE
row with year=2020 and an array of 10 names
Main Query: 3.
FROM name_arrays, UNNEST(sample_names) as name -
The UNNEST magic! - UNNEST(sample_names)
takes the array and creates one new row per element -
as name gives the extracted value a column name - The comma
acts like a CROSS JOIN - it pairs each row in name_arrays with each
element in the array
Visual example:
Before UNNEST (1 row): | year | sample_names | |——|————–| | 2020 | [Ava, Charlotte, Emma, …] |
After UNNEST (10 rows): | year | name | |——|———–| | 2020 | Ava | | 2020 | Charlotte | | 2020 | Emma | | 2020 | … (7 more rows) |
When to use UNNEST: - You have data stored in arrays and need to work with individual elements - You want to join array elements with other tables - You need to filter or aggregate individual array items
Real-world example: Imagine a table of orders where each order has an array of product IDs: - Order 1: products = [101, 102, 103] - Order 2: products = [104, 105]
UNNEST lets you create one row per product so you can count how many times each product was ordered.
Goal: Use built-in functions to work with arrays
SELECT
year, -- The year
gender, -- M or F
ARRAY_AGG( -- Create an array of top 5 names
name
ORDER BY number DESC -- Sort by popularity
LIMIT 5 -- Take only top 5
) as top_names,
ARRAY_LENGTH( -- Count items in the array
ARRAY_AGG(name ORDER BY number DESC LIMIT 5)
) as array_size, -- Should be 5
ARRAY_AGG(name ORDER BY number DESC LIMIT 5)[OFFSET(0)] as most_popular_name -- Get first element
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020 -- Just 2020 data
GROUP BY year, gender; -- One row per year/gender
Line-by-line explanation:
ARRAY_AGG(name ORDER BY number DESC LIMIT 5) - Create
the array (same as before)ARRAY_LENGTH(...) - Counts how many elements
are in an array
[OFFSET(0)] - Access a specific element in the
array
OFFSET(0) = first elementOFFSET(1) = second elementOFFSET(4) = fifth elementImportant: Array Indexing Starts at 0! | Position | Index | Name in Array | |———-|——-|—————| | 1st | OFFSET(0) | Emma | | 2nd | OFFSET(1) | Olivia | | 3rd | OFFSET(2) | Ava | | 4th | OFFSET(3) | Charlotte | | 5th | OFFSET(4) | Sophia |
Other useful array functions:
-- Check if an array contains a value
SELECT 'Emma' IN UNNEST(['Emma', 'Olivia', 'Ava']) as contains_emma; -- Returns TRUE
-- Get array of only specific values
SELECT ARRAY(SELECT name FROM UNNEST(['Emma', 'Olivia', 'Ava']) as name WHERE name LIKE 'E%');
-- Returns ['Emma']
-- Concatenate two arrays
SELECT ARRAY_CONCAT(['Emma', 'Olivia'], ['Liam', 'Noah']);
-- Returns ['Emma', 'Olivia', 'Liam', 'Noah']
-- Get last element (using length)
SELECT ['Emma', 'Olivia', 'Ava'][OFFSET(ARRAY_LENGTH(['Emma', 'Olivia', 'Ava']) - 1)];
-- Returns 'Ava' (length=3, so index 2 is the last element)
Common mistakes: - Using [1] instead of
[OFFSET(1)] - BigQuery requires the OFFSET keyword -
Forgetting arrays start at 0 - OFFSET(1) is the SECOND
element, not the first - Accessing an index that doesn’t exist -
OFFSET(10) on a 5-element array causes an error
A STRUCT is like a mini-table stored in a single field. It groups related pieces of data together.
Think of it like this: - Array: A list of similar items [apple, banana, orange] - Struct: A collection of different items {name: “apple”, color: “red”, price: 1.50}
Real-world analogy: An address is a struct: {street: “123 Main St”, city: “Boston”, zip: “02101”}
Goal: Create arrays where each element is a struct (combining both concepts!)
This is very powerful and commonly used in BigQuery.
-- Step 1: Rank the names
WITH ranked_names AS (
SELECT
year, -- The year
gender, -- M or F
name, -- Name
SUM(number) as total, -- Total across all states
RANK() OVER ( -- Calculate rank
PARTITION BY year, gender
ORDER BY SUM(number) DESC
) as rank
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year IN (2010, 2015, 2020) -- Three different years
GROUP BY year, gender, name
QUALIFY rank <= 3 -- Keep top 3
)
-- Step 2: Create array of structs
SELECT
year, -- The year
gender, -- M or F
ARRAY_AGG( -- Create an array
STRUCT(name, total, rank) -- Each element is a struct
ORDER BY rank -- Order by rank
) as top_names -- Name of the array column
FROM ranked_names
GROUP BY year, gender -- One row per year/gender
ORDER BY year, gender; -- Sort results
Line-by-line explanation:
CTE - ranked_names: 1-9. Find top 3 names per year/gender (we’ve seen this before)
Main Query: 10.
ARRAY_AGG(STRUCT(name, total, rank) ORDER BY rank) -
Array of structs! -
STRUCT(name, total, rank) creates a struct for each name -
ARRAY_AGG(...) collects all those structs into an array -
Result: Instead of 3 rows (one per name), you get 1 row with an array of
3 structs
What you’ll see: Six rows total - one for each year/gender combination (2010 F, 2010 M, 2015 F, 2015 M, 2020 F, 2020 M). Each row contains an array in the top_names column. Each element in that array is a struct containing a name, its total count, and its rank. This compact format puts all three top names for a given year/gender into a single row.
Before (regular rows): | year | gender | name | total | rank | |——|——–|———|——-|——| | 2020 | F | Olivia | 17535 | 1 | | 2020 | F | Emma | 15581 | 2 | | 2020 | F | Ava | 13084 | 3 |
After (array of structs): | year | gender | top_names | |——|——–|———–| | 2020 | F | [{name: Olivia, total: 17535, rank: 1}, {name: Emma, total: 15581, rank: 2}, {name: Ava, total: 13084, rank: 3}] |
Why use arrays of structs? - Extremely compact data representation - Perfect for exporting to JSON - Common in APIs and modern data formats - One row can contain complex, nested information
Real-world examples: - Orders table: each order has an array of products (structs with product_id, quantity, price) - User table: each user has an array of addresses (structs with street, city, zip) - Events table: each event has an array of attendees (structs with name, email, status)
Goal: Query data from struct fields using dot notation
-- Step 1: Create structs with statistics
WITH name_stats AS (
SELECT
year, -- The year
STRUCT( -- Create a struct of statistics
COUNT(DISTINCT name) as unique_names, -- Field 1: count of unique names
SUM(number) as total_births, -- Field 2: total births
AVG(number) as avg_per_name -- Field 3: average per name
) as stats -- Name the struct "stats"
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE gender = 'M' -- Only males
GROUP BY year -- One row per year
)
-- Step 2: Access the struct fields
SELECT
year, -- The year
stats.unique_names, -- Access field with dot notation
stats.total_births, -- Access field with dot notation
ROUND(stats.avg_per_name, 2) as avg_per_name -- Access and round
FROM name_stats
WHERE year BETWEEN 2010 AND 2020 -- Recent years only
ORDER BY year; -- Sort chronologically
Line-by-line explanation:
CTE - name_stats: 1.
STRUCT(COUNT(...), SUM(...), AVG(...)) as stats - Create a
struct with 3 fields - unique_names: how many different
names exist - total_births: total number of boys born -
avg_per_name: average number of boys per name (total ÷
unique) 2. GROUP BY year - Calculate these stats for each
year
Main Query: 3. stats.unique_names -
Dot notation to access struct fields -
stats is the struct column name -
.unique_names accesses the unique_names field inside the
struct - It’s like saying “from the stats struct, give me the
unique_names value” 4. ROUND(stats.avg_per_name, 2) - You
can use struct fields in functions - Access the field with dot notation
- Then apply any function you want
Visual representation:
What’s stored in the stats struct:
stats: {
unique_names: 12500,
total_births: 2000000,
avg_per_name: 160.0
}
How to access it: - stats.unique_names
→ returns 12500 - stats.total_births → returns 2000000 -
stats.avg_per_name → returns 160.0
What you’ll see: One row per year between 2010 and 2020 (11 rows total). Each row shows the year and three separate columns extracted from the stats struct: the number of unique male names that year, total male births, and the average number of babies per name (rounded to 2 decimal places).
Accessing nested structs: If you have a struct inside a struct, use multiple dots:
-- Example: address struct inside person struct
person.address.city
person.address.zip
-- Reading from left to right:
-- 1. Get the "person" struct
-- 2. From that struct, get the "address" struct
-- 3. From that struct, get the "city" field
Best practice: When working with structs, always use meaningful field names. Instead of:
STRUCT(a, b, c) -- Bad: what do these mean?
Use:
STRUCT(name, count, rank) -- Good: clear meaning
Now let’s combine everything you’ve learned: CTEs, window functions, arrays, and structs!
Goal: Find names that are growing fastest in popularity, with complete year-by-year analysis
-- STEP 1: Get yearly totals for each name
WITH yearly_stats AS (
SELECT
year, -- The year
gender, -- M or F
name, -- Individual name
SUM(number) as total -- Total across all states
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year BETWEEN 2015 AND 2020 -- Focus on recent years
GROUP BY year, gender, name -- One row per year/gender/name combo
),
-- STEP 2: Calculate year-over-year growth
growth_calc AS (
SELECT
year, -- The year
gender, -- M or F
name, -- Name
total, -- Total this year (from CTE 1)
LAG(total) OVER ( -- Get previous year's total
PARTITION BY name, gender -- Keep same name/gender
ORDER BY year -- Order by year
) as prev_year, -- Last year's total
total - LAG(total) OVER ( -- Absolute growth (difference)
PARTITION BY name, gender
ORDER BY year
) as absolute_growth,
SAFE_DIVIDE( -- Percentage growth
total - LAG(total) OVER ( -- Change
PARTITION BY name, gender
ORDER BY year
),
LAG(total) OVER ( -- Divided by previous year
PARTITION BY name, gender
ORDER BY year
)
) * 100 as pct_growth -- Convert to percentage
FROM yearly_stats
),
-- STEP 3: Aggregate growth data into arrays of structs
trending AS (
SELECT
name, -- The name
gender, -- M or F
ARRAY_AGG( -- Create an array
STRUCT(year, total, pct_growth) -- Each element is a struct
ORDER BY year -- Chronological order
) as yearly_data, -- Array of year-by-year data
AVG(pct_growth) as avg_growth_rate, -- Average growth rate
COUNT(*) as years_tracked -- How many years we have data for
FROM growth_calc
WHERE pct_growth IS NOT NULL -- Ignore first year (no previous year)
GROUP BY name, gender -- One row per name/gender
HAVING AVG(pct_growth) > 10 -- Only names growing >10% per year on average
AND COUNT(*) >= 4 -- Must have at least 4 years of data
)
-- STEP 4: Final output with array indexing
SELECT
name, -- The name
gender, -- M or F
ROUND(avg_growth_rate, 2) as avg_growth_pct, -- Average % growth per year
yearly_data[OFFSET(0)].total as total_2015, -- Access first array element's total field
yearly_data[OFFSET(ARRAY_LENGTH(yearly_data)-1)].total as total_2020 -- Access last element's total
FROM trending
ORDER BY avg_growth_rate DESC -- Fastest growing first
LIMIT 20; -- Top 20 trending names
Detailed explanation of each CTE:
CTE 1 - yearly_stats: - Purpose: Get the basic data we need (yearly totals per name) - Output: One row per year/gender/name with the total count - Why: Simplifies the next step by pre-aggregating
CTE 2 - growth_calc: - Purpose: Calculate how much
each name grew from year to year - Key parts: - LAG(total)
gets last year’s number - total - LAG(total) calculates
absolute change (+1000 babies, -500 babies, etc.) -
SAFE_DIVIDE(...) calculates percentage change -
SAFE_DIVIDE instead of / prevents division by
zero errors - PARTITION BY name, gender keeps each name’s
calculations separate - Output: Each year for each name has growth
metrics
CTE 3 - trending: - Purpose: Identify names that are
consistently growing - Key parts: - ARRAY_AGG(STRUCT(...))
creates one row per name with all years in an array - Each array element
is a struct containing year, total, and growth% -
AVG(pct_growth) calculates average growth rate across all
years - HAVING AVG(pct_growth) > 10 filters to only
fast-growing names - Output: One row per trending name with their entire
history in an array
Final Query: - Purpose: Display the results in a
readable format - Key parts: - yearly_data[OFFSET(0)].total
- Get the total from the FIRST year - [OFFSET(0)] accesses
first array element - .total accesses the total field from
that struct -
yearly_data[OFFSET(ARRAY_LENGTH(yearly_data)-1)].total -
Get total from LAST year - ARRAY_LENGTH(yearly_data) counts
how many years we have - -1 because arrays start at 0 (if
length is 5, last index is 4) - Result: Shows which names went from
small to big, and how much they grew
What you’ll see: Up to 20 rows showing the fastest-growing names. Each row displays the name, gender, average annual growth percentage, the total count from 2015 (the first year in our dataset), and the total count from 2020 (the last year). Names with higher average growth percentages appear first, showing which names increased most dramatically in popularity over this period.
Key techniques used: 1. ✅ CTEs to break complex logic into steps 2. ✅ Window functions (LAG) to compare years 3. ✅ PARTITION BY to keep data organized 4. ✅ Arrays to store multiple years’ data 5. ✅ Structs to group related fields 6. ✅ Array indexing to access specific elements 7. ✅ Dot notation to access struct fields
Now it’s your turn! Try these exercises to reinforce what you’ve learned. Start simple and work your way up.
Task: Find the top 10 names for boys in 2019, showing each name’s rank and what percentage of the total they represent.
Hints: - Use RANK() to assign rankings - Use SUM() OVER (PARTITION BY year, gender) to get total births - Calculate percentage as: (individual count / total) * 100 - Use QUALIFY to filter to top 10
SELECT
name,
SUM(number) as count,
SUM(number) / SUM(SUM(number)) OVER (PARTITION BY year, gender) * 100 as percentage,
RANK() OVER (PARTITION BY year, gender ORDER BY SUM(number) DESC) as rank
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2019 AND gender = 'M'
GROUP BY name, year, gender
QUALIFY rank <= 10
ORDER BY rank;
Task: For the name “Emma”, show each year from 2000-2020 with the current year’s count, previous year’s count, next year’s count, and whether it increased or decreased.
Hints: - Use LAG() to get previous year - Use LEAD() to get next year - Use CASE WHEN to create a text description
SELECT
year,
SUM(number) as current_year,
LAG(SUM(number)) OVER (ORDER BY year) as previous_year,
LEAD(SUM(number)) OVER (ORDER BY year) as next_year,
CASE
WHEN SUM(number) > LAG(SUM(number)) OVER (ORDER BY year) THEN 'Increased'
WHEN SUM(number) < LAG(SUM(number)) OVER (ORDER BY year) THEN 'Decreased'
ELSE 'Same'
END as trend
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE name = 'Emma' AND year BETWEEN 2000 AND 2020
GROUP BY year
ORDER BY year;
Task: Find which states have the most diverse baby names in 2020 (measured by the number of unique names used).
Hints: - First CTE: count unique names per state - Second CTE: rank states - Use COUNT(DISTINCT name) for unique names
WITH state_diversity AS (
SELECT
state,
COUNT(DISTINCT name) as unique_names,
SUM(number) as total_births
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020
GROUP BY state
),
ranked_states AS (
SELECT
state,
unique_names,
total_births,
RANK() OVER (ORDER BY unique_names DESC) as diversity_rank
FROM state_diversity
)
SELECT
state,
unique_names,
total_births,
diversity_rank
FROM ranked_states
ORDER BY diversity_rank
LIMIT 10;
Task: For each state, create an array of the top 3 most popular girl names in 2020.
Hints: - First, rank names within each state - Then use ARRAY_AGG to collect the top 3 - Remember to ORDER BY inside ARRAY_AGG
WITH ranked_names AS (
SELECT
state,
name,
number,
RANK() OVER (PARTITION BY state ORDER BY number DESC) as rank
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020 AND gender = 'F'
QUALIFY rank <= 3
)
SELECT
state,
ARRAY_AGG(name ORDER BY rank) as top_3_names
FROM ranked_names
GROUP BY state
ORDER BY state;
Task: Create a struct containing statistics about names starting with ‘A’ vs names starting with ‘Z’ in 2020.
Hints: - Use SUBSTR(name, 1, 1) to get first letter - Create separate STRUCTs for A names and Z names - Include count of names and total babies in each struct
SELECT
year,
STRUCT(
COUNT(DISTINCT CASE WHEN SUBSTR(name, 1, 1) = 'A' THEN name END) as unique_names,
SUM(CASE WHEN SUBSTR(name, 1, 1) = 'A' THEN number ELSE 0 END) as total_babies
) as a_names,
STRUCT(
COUNT(DISTINCT CASE WHEN SUBSTR(name, 1, 1) = 'Z' THEN name END) as unique_names,
SUM(CASE WHEN SUBSTR(name, 1, 1) = 'Z' THEN number ELSE 0 END) as total_babies
) as z_names
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020
GROUP BY year;
Task: Find names that were popular (top 100) in 1950 but are NOT in the top 100 in 2020. Show their 1950 rank, 2020 rank, and create an array of their yearly totals from 1950 to 2020 (in 10-year intervals).
Hints: - This needs multiple CTEs - Use window functions to rank in both years - Use ARRAY_AGG with a struct containing year and total - Filter to names in top 100 in 1950 but not 2020
WITH ranks_1950 AS (
SELECT
name,
gender,
SUM(number) as total,
RANK() OVER (PARTITION BY gender ORDER BY SUM(number) DESC) as rank
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 1950
GROUP BY name, gender
),
ranks_2020 AS (
SELECT
name,
gender,
SUM(number) as total,
RANK() OVER (PARTITION BY gender ORDER BY SUM(number) DESC) as rank
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year = 2020
GROUP BY name, gender
),
declined_names AS (
SELECT
a.name,
a.gender,
a.rank as rank_1950,
COALESCE(b.rank, 999) as rank_2020
FROM ranks_1950 a
LEFT JOIN ranks_2020 b ON a.name = b.name AND a.gender = b.gender
WHERE a.rank <= 100
AND (b.rank > 100 OR b.rank IS NULL)
),
yearly_totals AS (
SELECT
d.name,
d.gender,
d.rank_1950,
d.rank_2020,
ARRAY_AGG(
STRUCT(y.year, y.total)
ORDER BY y.year
) as decade_history
FROM declined_names d
LEFT JOIN (
SELECT
name,
gender,
year,
SUM(number) as total
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year IN (1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020)
GROUP BY name, gender, year
) y ON d.name = y.name AND d.gender = y.gender
GROUP BY d.name, d.gender, d.rank_1950, d.rank_2020
)
SELECT
name,
gender,
rank_1950,
rank_2020,
decade_history
FROM yearly_totals
ORDER BY rank_1950
LIMIT 20;
Now that you understand the basics:
bigquery-public-data in the BigQuery consoleStart simple: Don’t try to write the perfect query on the first try. Build it step by step.
Use CTEs liberally: They make debugging SO much easier. You can run just the CTE to see if it’s working.
Comment your code: Future you will thank present you!
-- Get the top 5 names per year
WITH top_names AS (
...
)
Preview before running: Use LIMIT 100 while developing to avoid scanning huge amounts of data.
Check your logic with small datasets: Filter to one year or one state first, make sure your logic is correct, then expand.
Window functions are your friend: They solve so many problems that would otherwise require self-joins or complex subqueries.
Arrays and structs are powerful: They might seem weird at first, but they’re incredibly useful for organizing complex data.
Don’t be afraid of errors: Every error message is teaching you something about SQL!
Happy querying!
Remember: The best way to learn is by doing. Open BigQuery, copy these queries, modify them, break them, fix them, and make them your own!