The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.
1. Loading libraries
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)
library(RPostgres)
library(dbplyr)
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
2. Establishing a database connection (PostgreSQL)
I use RPostgres package to connent to a locally hosted PostgreSQL database.
con <- DBI::dbConnect(
RPostgres::Postgres(),
dbname = 'postgres',
host = 'localhost',
port = 5432,
user = 'postgres',
password = 'abcd' )
3. Normalizing data
We have the following csv table with vertical replication in column B

The task is to import csv file into PostgreSQL table and normalize it.
First we create a new table - a destination for csv data:
CREATE TABLE xy_raw(x TEXT, y TEXT, y_id INTEGER);
Then, we copy csv data into the new table
COPY xy_raw(x,y)
FROM 'D:\Datasets\03-Techniques(2).csv'
WITH DELIMITER ',' CSV;
4 records
Zap |
A |
NA |
Zip |
A |
NA |
One |
B |
NA |
Two |
B |
NA |
This data needs to be spread into two tables to avoid vertical replication.
I create the new tables.
CREATE TABLE y (
id SERIAL,
PRIMARY KEY(id),
y TEXT
);
CREATE TABLE xy (
id SERIAL,
PRIMARY KEY(id),
x TEXT,
y_id INTEGER,
UNIQUE(x,y_id)
);
As the next step I insert data from the table xy_row into the new tables.
I start from identufying distinct values in column y of xy_raw table
SELECT DISTINCT y from xy_raw;
I insert this data into the y table
INSERT INTO y (y)
SELECT DISTINCT y FROM xy_raw;
Next, we update the xy_raw table adding y_id column. The values in this column will be taken from id column in y table ,
UPDATE xy_raw
SET y_id = (SELECT y.id FROM y WHERE y.y = xy_raw.y);
4 records
Zap |
A |
2 |
Zip |
A |
2 |
One |
B |
1 |
Two |
B |
1 |
In the next step I copy the values from y_id column in the xy_raw table into the table xy - the ultimate destination table.
INSERT INTO xy (x, y_id)
SELECT x, y_id
FROM xy_raw;
We have now two tables each row in the table xy has a primary keuy and a fcorresponding foreign key that points at the table y.
4 records
1 |
Zap |
2 |
2 |
Zip |
2 |
3 |
One |
1 |
4 |
Two |
1 |
We can join tables to get combined results in a normalized table:
SELECT *
FROM xy JOIN y ON xy.y_id = y.id;
4 records
1 |
Zap |
2 |
2 |
A |
2 |
Zip |
2 |
2 |
A |
3 |
One |
1 |
1 |
B |
4 |
Two |
1 |
1 |
B |