This document provides an interactive demonstration of fundamental
and advanced SQL concepts from the DataRockie Data Science
Bootcamp. Each section features an explanation, the executable
SQL code, and a fully interactive table displaying the live query
results from the chinook database.
First, we prepare our R environment. This involves loading the
necessary libraries, downloading the chinook.db sample
database, and establishing a connection.
# Install packages if they are not already installed
if (!require(DBI)) install.packages("DBI")
if (!require(RSQLite)) install.packages("RSQLite")
if (!require(DT)) install.packages("DT") # For interactive tables
# Load libraries
library(DBI)
library(RSQLite)
library(DT)
# Download the Chinook database if it doesn't exist
db_file <- "chinook.db"
if (!file.exists(db_file)) {
download.file("https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip", "chinook.zip")
unzip("chinook.zip")
}
# Establish a connection to the database
con <- dbConnect(RSQLite::SQLite(), db_file)
The JOIN clause combines rows from multiple tables based
on related columns. This query builds a complete music track overview by
joining four tables: Artists, Albums,
Tracks, and Genres.
SELECT
ar.Name AS ArtistName,
al.Title as AlbumTitle,
tr.Name AS TrackName,
ge.Name AS Genre
FROM Artists ar
JOIN Albums al ON ar.ArtistId = al.ArtistId
JOIN Tracks tr ON al.AlbumId = tr.AlbumId
JOIN Genres ge ON tr.GenreId = ge.GenreId
LIMIT 100;
Result: Explore the first 100 tracks from the joined tables below. You can sort by any column or use the search box to filter.
datatable(join_result,
options = list(pageLength = 5, autoWidth = TRUE),
caption = "Interactive table of joined music data.",
rownames = FALSE)
Aggregate functions like COUNT() summarize data across
multiple rows. This query demonstrates how COUNT(*) (total
rows) differs from COUNT(column) (total non-empty values in
that column).
SELECT
COUNT(*) as TotalCustomers,
COUNT(Company) as B2BCustomers
FROM Customers;
Result: The query shows the total customer count versus the count of customers who have a company associated with them (B2B).
datatable(count_result,
options = list(dom = 't'), ## 't' means table only, no search/paging
caption = "Summary of total vs. B2B customers.",
rownames = FALSE)
GROUP BY, HAVING)Combine aggregates with GROUP BY and HAVING
for powerful, filtered analysis. This query finds all non-USA countries
with 5 or more customers and sorts them.
SELECT
Country,
COUNT(*) AS NumberOfCustomers
FROM Customers
WHERE Country <> 'USA'
GROUP BY Country
HAVING COUNT(*) >= 5
ORDER BY NumberOfCustomers DESC;
Result: A ranked list of top customer countries (excluding the USA).
datatable(filtered_agg_result,
options = list(pageLength = 10),
caption = "Countries (non-USA) with 5 or more customers.",
rownames = FALSE)
CASE WHEN)The CASE statement embeds if-then-else logic directly
into your query. Here, we create a Segment column to
classify customers as B2B (has a company) or B2C (no company).
SELECT
FirstName || ' ' || LastName AS FullName,
Company,
Country,
CASE
WHEN Company IS NULL THEN 'B2C (Consumer)'
ELSE 'B2B (Business)'
END AS Segment
FROM Customers
LIMIT 100;
Result: The table below includes the dynamically generated
Segmentcolumn. Try searching for “B2B” or “B2C”.
datatable(case_result,
options = list(pageLength = 5),
caption = "Customer segmentation using CASE WHEN.",
rownames = FALSE)
WITH)Common Table Expressions (CTEs) make complex queries clean, readable, and modular by creating temporary, named result sets. Below, we compare a standard complex join with a much cleaner CTE-based approach to calculate total revenue from USA customers in 2009.
This single query combines filtering and joining in one block. It works, but it can be difficult to read and debug.
SELECT
SUM(i.Total) as total_revenue_usa_cust_2009
FROM Customers c
JOIN Invoices i ON c.CustomerId = i.CustomerId
WHERE
c.Country = 'USA' AND
STRFTIME('%Y', i.InvoiceDate) = '2009';
This version breaks the problem down into logical steps.
usa_customers and invoices_y2009 are defined
first, making the final SELECT statement simple and
self-explanatory.
WITH usa_customers AS (
SELECT * FROM Customers
WHERE Country = 'USA'
),
invoices_y2009 AS (
SELECT * FROM Invoices
WHERE STRFTIME('%Y', InvoiceDate) = '2009'
)
SELECT
SUM(t2.Total) as TotalRevenueUsa2009
FROM usa_customers t1
JOIN invoices_y2009 t2
ON t1.CustomerId = t2.CustomerId;
Both queries produce the same result, but the CTE approach is far more maintainable.
# Format the result nicely
cte_result$TotalRevenueUsa2009 <- paste0("$", round(cte_result$TotalRevenueUsa2009, 2))
datatable(cte_result,
options = list(dom = 't'),
caption = "Total 2009 revenue from USA customers.",
rownames = FALSE)