A SQL demo: loading and normalizing csv data

Author

Beata Sirowy

The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)     
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)     
library(RPostgres)     
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(               
  RPostgres::Postgres(),               
  dbname = 'postgres',               
  host = 'localhost',               
  port = 5432,               
  user = 'postgres',               
  password = 'abcd'     )

3. Normalizing data

We have the following csv table with vertical replication in column B

The task is to import csv file into PostgreSQL table and normalize it.

First we create a new table - a destination for csv data:


CREATE TABLE xy_raw(x TEXT, y TEXT, y_id INTEGER);

Then, we copy csv data into the new table


COPY xy_raw(x,y) 
FROM  'D:\Datasets\03-Techniques(2).csv' 
WITH DELIMITER ',' CSV;

SELECT * FROM xy_raw;
4 records
x y y_id
Zap A NA
Zip A NA
One B NA
Two B NA

This data needs to be spread into two tables to avoid vertical replication.

I create the new tables.


CREATE TABLE y (
id SERIAL, 
PRIMARY KEY(id), 
y TEXT
);

CREATE TABLE xy ( 
id SERIAL, 
PRIMARY KEY(id), 
x TEXT, 
y_id INTEGER,
UNIQUE(x,y_id)
);

As the next step I insert data from the table xy_row into the new tables.

I start from identufying distinct values in column y of xy_raw table


SELECT DISTINCT y from xy_raw;
2 records
y
B
A

I insert this data into the y table


INSERT INTO y (y) 
SELECT DISTINCT y FROM xy_raw;

SELECT * FROM y;
2 records
id y
1 B
2 A

Next, we update the xy_raw table adding y_id column. The values in this column will be taken from id column in y table ,


UPDATE xy_raw 
SET y_id = (SELECT y.id FROM y WHERE y.y = xy_raw.y);

SELECT * FROM xy_raw;
4 records
x y y_id
Zap A 2
Zip A 2
One B 1
Two B 1

In the next step I copy the values from y_id column in the xy_raw table into the table xy - the ultimate destination table.


INSERT INTO xy (x, y_id) 
SELECT x, y_id 
FROM xy_raw;

We have now two tables each row in the table xy has a primary keuy and a fcorresponding foreign key that points at the table y.


select * from xy;
4 records
id x y_id
1 Zap 2
2 Zip 2
3 One 1
4 Two 1

select * from y;
2 records
id y
1 B
2 A

We can join tables to get combined results in a normalized table:


SELECT *
FROM xy JOIN y ON xy.y_id = y.id;
4 records
id x y_id id y
1 Zap 2 2 A
2 Zip 2 2 A
3 One 1 1 B
4 Two 1 1 B