A SQL demo: loading and normalizing csv data

Author

Beata Sirowy

The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DBI)     
library(RPostgres)     
library(dbplyr)


Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(               
  RPostgres::Postgres(),               
  dbname = 'postgres',               
  host = 'localhost',               
  port = 5432,               
  user = 'postgres',               
  password = 'abcd'     )

3. Normalizing data

We have the following csv table with vertical replication in column B

The task is to import csv file into PostgreSQL table and normalize it.

First we create a new table - a destination for csv data:


CREATE TABLE xy_raw(x TEXT, y TEXT, y_id INTEGER);

Then, we copy csv data into the new table


COPY xy_raw(x,y) 
FROM  'D:\Datasets\03-Techniques(2).csv' 
WITH DELIMITER ',' CSV;


SELECT * FROM xy_raw;

4 records
x	y	y_id
Zap	A	NA
Zip	A	NA
One	B	NA
Two	B	NA

This data needs to be spread into two tables to avoid vertical replication.

I create the new tables.


CREATE TABLE y (
id SERIAL, 
PRIMARY KEY(id), 
y TEXT
);


CREATE TABLE xy ( 
id SERIAL, 
PRIMARY KEY(id), 
x TEXT, 
y_id INTEGER,
UNIQUE(x,y_id)
);

As the next step I insert data from the table xy_row into the new tables.

I start from identufying distinct values in column y of xy_raw table


SELECT DISTINCT y from xy_raw;

2 records
y
B
A

I insert this data into the y table


INSERT INTO y (y) 
SELECT DISTINCT y FROM xy_raw;


SELECT * FROM y;

2 records
id	y
1	B
2	A

Next, we update the xy_raw table adding y_id column. The values in this column will be taken from id column in y table ,


UPDATE xy_raw 
SET y_id = (SELECT y.id FROM y WHERE y.y = xy_raw.y);


SELECT * FROM xy_raw;

4 records
x	y	y_id
Zap	A	2
Zip	A	2
One	B	1
Two	B	1

In the next step I copy the values from y_id column in the xy_raw table into the table xy - the ultimate destination table.


INSERT INTO xy (x, y_id) 
SELECT x, y_id 
FROM xy_raw;

We have now two tables each row in the table xy has a primary keuy and a fcorresponding foreign key that points at the table y.


select * from xy;

4 records
id	x	y_id
1	Zap	2
2	Zip	2
3	One	1
4	Two	1


select * from y;

2 records
id	y
1	B
2	A

We can join tables to get combined results in a normalized table:


SELECT *
FROM xy JOIN y ON xy.y_id = y.id;

4 records
id	x	y_id	id	y
1	Zap	2	2	A
2	Zip	2	2	A
3	One	1	1	B
4	Two	1	1	B