This document provides an interactive demonstration of fundamental and advanced SQL concepts from the DataRockie Data Science Bootcamp. Each section features an explanation, the executable SQL code, and a fully interactive table displaying the live query results from the chinook database.


1. Environment Setup

First, we prepare our R environment. This involves loading the necessary libraries, downloading the chinook.db sample database, and establishing a connection.

# Install packages if they are not already installed
if (!require(DBI)) install.packages("DBI")
if (!require(RSQLite)) install.packages("RSQLite")
if (!require(DT)) install.packages("DT") # For interactive tables

# Load libraries
library(DBI)
library(RSQLite)
library(DT)

# Download the Chinook database if it doesn't exist
db_file <- "chinook.db"
if (!file.exists(db_file)) {
  download.file("https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip", "chinook.zip")
  unzip("chinook.zip")
}

# Establish a connection to the database
con <- dbConnect(RSQLite::SQLite(), db_file)

2. JOIN Clause

The JOIN clause combines rows from multiple tables based on related columns. This query builds a complete music track overview by joining four tables: Artists, Albums, Tracks, and Genres.

SELECT
    ar.Name AS ArtistName,
    al.Title as AlbumTitle,
    tr.Name AS TrackName,
    ge.Name AS Genre
FROM Artists ar
JOIN Albums  al ON ar.ArtistId = al.ArtistId
JOIN Tracks  tr ON al.AlbumId = tr.AlbumId
JOIN Genres  ge ON tr.GenreId = ge.GenreId
LIMIT 100;

Result: Explore the first 100 tracks from the joined tables below. You can sort by any column or use the search box to filter.

datatable(join_result, 
          options = list(pageLength = 5, autoWidth = TRUE),
          caption = "Interactive table of joined music data.",
          rownames = FALSE)

3. Aggregate Functions

Aggregate functions like COUNT() summarize data across multiple rows. This query demonstrates how COUNT(*) (total rows) differs from COUNT(column) (total non-empty values in that column).

SELECT
    COUNT(*) as TotalCustomers,
    COUNT(Company) as B2BCustomers
FROM Customers;

Result: The query shows the total customer count versus the count of customers who have a company associated with them (B2B).

datatable(count_result, 
          options = list(dom = 't'), ## 't' means table only, no search/paging
          caption = "Summary of total vs. B2B customers.",
          rownames = FALSE)

4. Advanced Aggregation (GROUP BY, HAVING)

Combine aggregates with GROUP BY and HAVING for powerful, filtered analysis. This query finds all non-USA countries with 5 or more customers and sorts them.

SELECT
    Country,
    COUNT(*) AS NumberOfCustomers
FROM Customers
WHERE Country <> 'USA'
GROUP BY Country
HAVING COUNT(*) >= 5
ORDER BY NumberOfCustomers DESC;

Result: A ranked list of top customer countries (excluding the USA).

datatable(filtered_agg_result,
          options = list(pageLength = 10),
          caption = "Countries (non-USA) with 5 or more customers.",
          rownames = FALSE)

5. Conditional Logic (CASE WHEN)

The CASE statement embeds if-then-else logic directly into your query. Here, we create a Segment column to classify customers as B2B (has a company) or B2C (no company).

SELECT
    FirstName || ' ' || LastName AS FullName,
    Company,
    Country,
    CASE
        WHEN Company IS NULL THEN 'B2C (Consumer)'
        ELSE 'B2B (Business)'
    END AS Segment
FROM Customers
LIMIT 100;

Result: The table below includes the dynamically generated Segment column. Try searching for “B2B” or “B2C”.

datatable(case_result,
          options = list(pageLength = 5),
          caption = "Customer segmentation using CASE WHEN.",
          rownames = FALSE)

6. Common Table Expression (WITH)

Common Table Expressions (CTEs) make complex queries clean, readable, and modular by creating temporary, named result sets. Below, we compare a standard complex join with a much cleaner CTE-based approach to calculate total revenue from USA customers in 2009.

Comparison: Standard Join vs. CTE

The Hard Way (Complex Join)

This single query combines filtering and joining in one block. It works, but it can be difficult to read and debug.

SELECT
    SUM(i.Total) as total_revenue_usa_cust_2009
FROM Customers c
JOIN Invoices i ON c.CustomerId = i.CustomerId
WHERE
    c.Country = 'USA' AND
    STRFTIME('%Y', i.InvoiceDate) = '2009';

The Better Way (with CTE)

This version breaks the problem down into logical steps. usa_customers and invoices_y2009 are defined first, making the final SELECT statement simple and self-explanatory.

WITH usa_customers AS (
    SELECT * FROM Customers
    WHERE Country = 'USA'
), 
invoices_y2009 AS (
    SELECT * FROM Invoices
    WHERE STRFTIME('%Y', InvoiceDate) = '2009'
)
SELECT
    SUM(t2.Total) as TotalRevenueUsa2009
FROM usa_customers t1
JOIN invoices_y2009 t2
    ON t1.CustomerId = t2.CustomerId;

Query Result

Both queries produce the same result, but the CTE approach is far more maintainable.

# Format the result nicely
cte_result$TotalRevenueUsa2009 <- paste0("$", round(cte_result$TotalRevenueUsa2009, 2))

datatable(cte_result,
          options = list(dom = 't'),
          caption = "Total 2009 revenue from USA customers.",
          rownames = FALSE)