A SQL demo: designing a relational database model: Unesco World Heritage Cultural Sites
Author
Beata Sirowy
The task is to design a database model representing flat data across multiple linked tables using “third-normal form” (3NF), i.e. columns that have vertical duplication need to be placed in their own table and linked into the main table using a foreign key.
Third normal form (3NF) is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management. It was defined in 1971 by Edgar F. Codd, an English computer scientist who invented the relational model for database management. (Wikipedia)
The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.
1. Loading libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI) library(RPostgres) library(dbplyr)
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
2. Establishing a database connection (PostgreSQL)
I use RPostgres package to connent to a locally hosted PostgreSQL database.
con <- DBI::dbConnect( RPostgres::Postgres(), dbname ='postgres', host ='localhost', port =5432, user ='postgres', password ='abcd')
3. Data
This is a preview of the original csv table
4. Database design
The file includes following columns:
name
year
longitude
latitude
area_hectares
category
states
region
ISO
The column names written in bold include a lot of vertical replication, which needs to be eliminated - the data will be moved to separate tables.
The data model will include 5 linked tables:
The main table - including data that is not vertically replicated in the original csv file
site (name, year. longitude, latitude, area_hectares)
4 tables including dtat that is vertically replicated
category
state
region
ISO
a. Creating a data model diagram
The diagram is created with drawSQL tool
b. Creating tables
CREATETABLEcategory (id SERIAL,name VARCHAR(64) UNIQUENOTNULL,PRIMARYKEY(id));
CREATETABLE region (id SERIAL,name VARCHAR(64) UNIQUENOTNULL,PRIMARYKEY(id));
CREATETABLE ISO (id SERIAL,code CHAR(4) UNIQUE,PRIMARYKEY(id));
CREATETABLE site ( id SERIAL, name VARCHAR(256) UNIQUENOTNULL, yearNUMERIC,longtitude NUMERIc,latitude NUMERIC,area_hectares NUMERIC,category_id INTEGERREFERENCEScategory(id) ONDELETECASCADE,state_id INTEGERREFERENCES state(id) ONDELETECASCADE,region_id INTEGERREFERENCES region(id) ONDELETECASCADE,ISO_id INTEGERREFERENCES ISO(id) ONDELETECASCADE,PRIMARYKEY(id));
SELECT*FROM site;
0 records
id
name
year
longtitude
latitude
area_hectares
category_id
state_id
region_id
iso_id
Here are table details (accessed with command from psql shell):
We also create a working table - a temporary “container” for all data imported from the csv file (raw_unesco). This table will be used for further operations in SQL and then discarded.
INSERTINTO state (name) SELECTDISTINCT states FROM raw_unesco;
SELECT*FROM state;
Displaying records 1 - 10
id
name
1
Indonesia
2
Bangladesh
3
Jerusalem (Site proposed by Jordan)
4
Iran (Islamic Republic of)
5
Kiribati
6
Cameroon
7
Luxembourg
8
Sweden
9
Viet Nam
10
Uganda
INSERTINTO region (name) SELECTDISTINCT region FROM raw_unesco;
SELECT*FROM region;
5 records
id
name
1
Asia and the Pacific
2
Arab States
3
Europe and North America
4
Latin America and the Caribbean
5
Africa
INSERTINTO ISO (code) SELECTDISTINCT ISOFROM raw_unesco;
Select*FROM ISO;
Displaying records 1 - 10
id
code
1
sm
2
fj
3
bd
4
np
5
vu
6
fr
7
bh
8
sk
9
pa
10
NA
Next, we update the “raw_unesco” table adding category_id, state_id, region_id, and ISO:_id column. The values in these columns will be taken from newly created tables (category, state, region, ISO) ,
UPDATE raw_unesco SET category_id = (SELECTcategory.idFROMcategoryWHEREcategory.name = raw_unesco.category);
UPDATE raw_unesco SET region_id = (SELECT region.idFROM region WHERE region.name = raw_unesco.region);
UPDATE raw_unesco SET state_id = (SELECT state.idFROM state WHERE state.name = raw_unesco.states);
UPDATE raw_unesco SET ISO_id = (SELECT ISO.idFROM ISO WHERE ISO.code = raw_unesco.ISO);
As we can see the columns in the “raw_unesco” have been updated.
SELECT name, category_id, state_id, region_id, iso_idFROM raw_unescoWHERE states ='Sweden';
Displaying records 1 - 10
name
category_id
state_id
region_id
iso_id
Birka and Hovgården
1
8
3
37
Engelsberg Ironworks
1
8
3
37
Rock Carvings in Tanum
1
8
3
37
Skogskyrkogården
1
8
3
37
Royal Domain of Drottningholm
1
8
3
37
Hanseatic Town of Visby
1
8
3
37
Church Town of Gammelstad, Luleå
1
8
3
37
Laponian Area
2
8
3
37
Naval Port of Karlskrona
1
8
3
37
Agricultural Landscape of Southern Öland
1
8
3
37
In the next step I copy the values from categoryt_id, state_id, region_id, and ISO_id column in the “raw_unesco” table into the “site” table - the ultimate destination table.
Antigua Naval Dockyard and Related Archaeological Sites
2016
-61.7617
17.0069
255.00
1
162
4
117
47
Australian Fossil Mammal Sites (Riversleigh / Naracoorte)
1994
138.7167
-19.0833
10326.00
3
63
1
78
55
Australian Convict Sites
2010
150.9944
-33.3783
1502.51
1
63
1
78
72
Architectural, Residential and Cultural Complex of the Radziwill Family at Nesvizh
2005
26.6914
53.2228
0.00
1
57
3
110
104
Atlantic Forest South-East Reserves
1999
-48.0000
-24.1667
468193.00
3
136
4
72
119
Ancient City of Nessebar
1983
27.7300
42.6561
27.10
1
144
3
15
125
Angkor
1992
103.8333
13.4333
40100.00
1
16
1
34
165
Ancient Building Complex in the Wudang Mountains
1994
111.0000
32.4667
0.00
1
55
1
22
170
Ancient City of Ping Yao
1997
112.1544
37.2014
245.62
1
55
1
22
We can now discard the temporary table “raw_unesco”
DROPTABLE raw_unesco;
d. Joining tables
We can join tables to display the combined data
SELECT site.name, site.year, category.name AScategory, state.name AS state, region.name AS region, ISO.code AS ISOFROM siteJOINcategoryONcategory.id= site.category_idJOIN state ON state.id= site.state_idJOIN region ON region.id= site.region_idJOIN ISO ON ISO.id= site.iso_idLIMIT10;
Displaying records 1 - 10
name
year
category
state
region
iso
Royal Palaces of Abomey
1985
Cultural
Benin
Africa
bj
Wadi Al-Hitan (Whale Valley)
2005
Natural
Egypt
Arab States
eg
Taxila
1980
Cultural
Pakistan
Asia and the Pacific
pk
Robben Island
1999
Cultural
South Africa
Africa
za
Cultural Landscape and Archaeological Remains of the Bamiyan Valley