A SQL demo: designing a relational database model: Unesco World Heritage Cultural Sites

Author

Beata Sirowy

The task is to design a database model representing flat data across multiple linked tables using “third-normal form” (3NF), i.e. columns that have vertical duplication need to be placed in their own table and linked into the main table using a foreign key.

 

Third normal form (3NF) is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management. It was defined in 1971 by Edgar F. Codd, an English computer scientist who invented the relational model for database management. (Wikipedia)

 

The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)      
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)      
library(RPostgres)      
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(                  
  RPostgres::Postgres(),                  
  dbname = 'postgres',                  
  host = 'localhost',                  
  port = 5432,                  
  user = 'postgres',                  
  password = 'abcd'     
)

3. Data

This is a preview of the original csv table

4. Database design

The file includes following columns:

name

year

longitude

latitude

area_hectares

category

states

region

ISO

The column names written in bold include a lot of vertical replication, which needs to be eliminated - the data will be moved to separate tables.

The data model will include 5 linked tables:

The main table - including data that is not vertically replicated in the original csv file

  1. site (name, year. longitude, latitude, area_hectares)

4 tables including dtat that is vertically replicated

  1. category
  2. state
  3. region
  4. ISO

a. Creating a data model diagram

The diagram is created with drawSQL tool

b. Creating tables


CREATE TABLE category (
id SERIAL,
name VARCHAR(64) UNIQUE NOT NULL,
PRIMARY KEY(id)
);

CREATE TABLE state(
id SERIAL,
name VARCHAR(64) UNIQUE NOT NULL,
PRIMARY KEY(id)
);


CREATE TABLE region (
id SERIAL,
name VARCHAR(64) UNIQUE NOT NULL,
PRIMARY KEY(id)
);

CREATE TABLE ISO (
id SERIAL,
code CHAR(4) UNIQUE,
PRIMARY KEY(id)
);

CREATE TABLE site ( 
id SERIAL, 
name VARCHAR(256) UNIQUE NOT NULL, 
year NUMERIC,
longtitude NUMERIc,
latitude NUMERIC,
area_hectares NUMERIC,
category_id INTEGER REFERENCES category(id) ON DELETE CASCADE,
state_id INTEGER REFERENCES state(id) ON DELETE CASCADE,
region_id INTEGER REFERENCES region(id) ON DELETE CASCADE,
ISO_id INTEGER REFERENCES ISO(id) ON DELETE CASCADE,
PRIMARY KEY(id)
);

SELECT * FROM site;
0 records
id name year longtitude latitude area_hectares category_id state_id region_id iso_id

Here are table details (accessed with command from psql shell):

We also create a working table - a temporary “container” for all data imported from the csv file (raw_unesco). This table will be used for further operations in SQL and then discarded.


CREATE TABLE raw_unesco ( 
name VARCHAR(1024)UNIQUE NOT NULL, 
year INTEGER,
longtitude NUMERIC,
latitude NUMERIC,
area_hectares NUMERIC,
category VARCHAR(64),
states VARCHAR (128),
region VARCHAR (128),
ISO char(4),
category_id INTEGER,
state_id INTEGER,
region_id INTEGER,
ISO_id INTEGER
);

 

c. Importing data into tables

The order of columns in the destination table (site) has to be the same as in the source table (csv file).


COPY raw_unesco(name, year, longtitude, latitude, area_hectares, category, states, region, ISO) 
FROM  'D:\Datasets\whc-sites-2018-reduced.csv' 
WITH DELIMITER ',' CSV HEADER;

INSERT INTO category (name) 
SELECT DISTINCT category FROM raw_unesco;

SELECT * FROM category;
3 records
id name
1 Cultural
2 Mixed
3 Natural


INSERT INTO state (name) 
SELECT DISTINCT states FROM raw_unesco;

SELECT * FROM state;
Displaying records 1 - 10
id name
1 Indonesia
2 Bangladesh
3 Jerusalem (Site proposed by Jordan)
4 Iran (Islamic Republic of)
5 Kiribati
6 Cameroon
7 Luxembourg
8 Sweden
9 Viet Nam
10 Uganda

INSERT INTO region (name) 
SELECT DISTINCT region FROM raw_unesco;

SELECT * FROM region;
5 records
id name
1 Asia and the Pacific
2 Arab States
3 Europe and North America
4 Latin America and the Caribbean
5 Africa

INSERT INTO ISO (code) 
SELECT DISTINCT ISO
FROM raw_unesco;
Select * FROM ISO;
Displaying records 1 - 10
id code
1 sm
2 fj
3 bd
4 np
5 vu
6 fr
7 bh
8 sk
9 pa
10 NA

Next, we update the “raw_unesco” table adding category_id, state_id, region_id, and ISO:_id column. The values in these columns will be taken from newly created tables (category, state, region, ISO) ,


UPDATE raw_unesco  
SET category_id = (SELECT category.id FROM category WHERE category.name = raw_unesco.category); 

UPDATE raw_unesco  
SET region_id = (SELECT region.id FROM region WHERE region.name = raw_unesco.region); 

UPDATE raw_unesco  
SET state_id = (SELECT state.id FROM state WHERE state.name = raw_unesco.states); 

UPDATE raw_unesco  
SET ISO_id = (SELECT ISO.id FROM ISO WHERE ISO.code = raw_unesco.ISO); 

As we can see the columns in the “raw_unesco” have been updated.


SELECT name, category_id, state_id, region_id, iso_id
FROM raw_unesco
WHERE states = 'Sweden';
Displaying records 1 - 10
name category_id state_id region_id iso_id
Birka and Hovgården 1 8 3 37
Engelsberg Ironworks 1 8 3 37
Rock Carvings in Tanum 1 8 3 37
Skogskyrkogården 1 8 3 37
Royal Domain of Drottningholm 1 8 3 37
Hanseatic Town of Visby 1 8 3 37
Church Town of Gammelstad, Luleå 1 8 3 37
Laponian Area 2 8 3 37
Naval Port of Karlskrona 1 8 3 37
Agricultural Landscape of Southern Öland 1 8 3 37

In the next step I copy the values from categoryt_id, state_id, region_id, and ISO_id column in the “raw_unesco” table into the “site” table - the ultimate destination table.


INSERT INTO site (
name, 
year, 
longtitude, 
latitude, 
area_hectares, 
category_id, 
state_id, 
region_id, 
ISO_id)  
SELECT 
name, 
year, 
longtitude, 
latitude, 
area_hectares, 
category_id, 
state_id, 
region_id, 
ISO_id  
FROM raw_unesco;

SELECT * FROM site
WHERE name LIKE 'A%'
LIMIT 10;
Displaying records 1 - 10
id name year longtitude latitude area_hectares category_id state_id region_id iso_id
9 Al Qal’a of Beni Hammad 1980 4.7868 35.8184 150.00 1 33 2 69
19 Antigua Naval Dockyard and Related Archaeological Sites 2016 -61.7617 17.0069 255.00 1 162 4 117
47 Australian Fossil Mammal Sites (Riversleigh / Naracoorte) 1994 138.7167 -19.0833 10326.00 3 63 1 78
55 Australian Convict Sites 2010 150.9944 -33.3783 1502.51 1 63 1 78
72 Architectural, Residential and Cultural Complex of the Radziwill Family at Nesvizh 2005 26.6914 53.2228 0.00 1 57 3 110
104 Atlantic Forest South-East Reserves 1999 -48.0000 -24.1667 468193.00 3 136 4 72
119 Ancient City of Nessebar 1983 27.7300 42.6561 27.10 1 144 3 15
125 Angkor 1992 103.8333 13.4333 40100.00 1 16 1 34
165 Ancient Building Complex in the Wudang Mountains 1994 111.0000 32.4667 0.00 1 55 1 22
170 Ancient City of Ping Yao 1997 112.1544 37.2014 245.62 1 55 1 22

We can now discard the temporary table “raw_unesco”


DROP TABLE raw_unesco;

 

d. Joining tables

We can join tables to display the combined data


SELECT site.name, site.year, category.name AS category, state.name AS state, region.name AS region, ISO.code AS ISO
FROM site
JOIN category ON category.id = site.category_id
JOIN state ON state.id = site.state_id
JOIN region ON region.id = site.region_id
JOIN ISO ON ISO.id = site.iso_id
LIMIT 10
;
Displaying records 1 - 10
name year category state region iso
Royal Palaces of Abomey 1985 Cultural Benin Africa bj
Wadi Al-Hitan (Whale Valley) 2005 Natural Egypt Arab States eg
Taxila 1980 Cultural Pakistan Asia and the Pacific pk
Robben Island 1999 Cultural South Africa Africa za
Cultural Landscape and Archaeological Remains of the Bamiyan Valley 2003 Cultural Afghanistan Asia and the Pacific af
Minaret and Archaeological Remains of Jam 2002 Cultural Afghanistan Asia and the Pacific af
Historic Centres of Berat and Gjirokastra 2005 Cultural Albania Europe and North America al
Butrint 1992 Cultural Albania Europe and North America al
Al Qal’a of Beni Hammad 1980 Cultural Algeria Arab States dz
Tassili n’Ajjer 1982 Mixed Algeria Arab States dz