SQL demo: reducing the result set

In the following I focus on reducing vertical duplication in the result set with DISTINCT, DISTINCT ON and GROUP BY … HAVING

The document is created in Quarto accessed from RStudio.

1. Loading libraries

library(tidyverse)  
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)  
library(RPostgres)  
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.
con <- DBI::dbConnect(      
  RPostgres::Postgres(),      
  dbname = 'postgres',      
  host = 'localhost',      
  port = 5432,      
  user = 'postgres',      
  password = 'abcd'  )

3. Creating a table


CREATE TABLE racing (
   make VARCHAR,
   model VARCHAR,
   year INTEGER,
   price INTEGER
);

INSERT INTO racing (make, model, year, price)
VALUES
('Nissan', 'Stanza', 1990, 2000),
('Dodge', 'Neon', 1995, 800),
('Dodge', 'Neon', 1998, 2500),
('Dodge', 'Neon', 1999, 3000),
('Ford', 'Mustang', 2001, 1000),
('Ford', 'Mustang', 2005, 2000),
('Subaru', 'Impreza', 1997, 1000),
('Mazda', 'Miata', 2001, 5000),
('Mazda', 'Miata', 2001, 3000),
('Mazda', 'Miata', 2001, 2500),
('Mazda', 'Miata', 2002, 5500),
('Opel', 'GT', 1972, 1500),
('Opel', 'GT', 1969, 7500),
('Opel', 'Cadet', 1973, 500)
;

SELECT * from racing;
Displaying records 1 - 10
make model year price
Nissan Stanza 1990 2000
Dodge Neon 1995 800
Dodge Neon 1998 2500
Dodge Neon 1999 3000
Ford Mustang 2001 1000
Ford Mustang 2005 2000
Subaru Impreza 1997 1000
Mazda Miata 2001 5000
Mazda Miata 2001 3000
Mazda Miata 2001 2500

4. Query examples

DISTINCT clause

DISTINCT removes duplicate rows from the result set


SELECT DISTINCT model, make FROM racing;
7 records
model make
GT Opel
Mustang Ford
Neon Dodge
Miata Mazda
Cadet Opel
Impreza Subaru
Stanza Nissan

DISTINCT ON clause

DISTINCT ON limits duplicate removal to a set of columns, without necessarily including the column that is used for the removal


SELECT DISTINCT ON (model) make,model,year 
FROM racing 
WHERE year > 1975
ORDER BY model, year DESC LIMIT 4;
4 records
make model year
Subaru Impreza 1997
Mazda Miata 2002
Ford Mustang 2005
Dodge Neon 1999

Here is the same statement with model used as a selection criterion but not displayed in the results


SELECT DISTINCT ON (model) make,year 
FROM racing 
WHERE year > 1975
ORDER BY model, year DESC LIMIT 4;
4 records
make year
Subaru 1997
Mazda 2002
Ford 2005
Dodge 1999

GROUP BY clause

GROUP BY tends to be combined with aggregate functions like COUNT(), MAX(), MIN(), AVG(), SUM()


SELECT COUNT(model), model FROM racing
GROUP BY model;
7 records
count model
3 Neon
4 Miata
2 Mustang
2 GT
1 Stanza
1 Cadet
1 Impreza

HAVING clause

used in combination with GROUP BY and aggregate functions


SELECT COUNT(model), model 
FROM racing
GROUP BY model HAVING COUNT(model) > 1;
4 records
count model
3 Neon
4 Miata
2 Mustang
2 GT

We can also include WHERE clause which is executed before GROUP BY



SELECT COUNT(model), model 
FROM racing
WHERE year > 1973
GROUP BY model HAVING COUNT(model) > 1;
3 records
count model
3 Neon
4 Miata
2 Mustang

SELECT AVG(price), make, COUNT(model) FROM racing
GROUP BY make HAVING AVG(price) >1000
ORDER BY AVG(price) DESC;
5 records
avg make count
4000.000 Mazda 4
3166.667 Opel 3
2100.000 Dodge 3
2000.000 Nissan 1
1500.000 Ford 2

The results of AVG clause can be rounded to one place after comma


SELECT ROUND(AVG(price),1) AS avg_price, make, COUNT(model) FROM racing
GROUP BY make HAVING AVG(price) >1000
ORDER BY AVG(price) DESC;
5 records
avg_price make count
4000.0 Mazda 4
3166.7 Opel 3
2100.0 Dodge 3
2000.0 Nissan 1
1500.0 Ford 2