Flat text files and Regex (a SQL demo)

Author

Beata Sirowy

I import into a SQL database a text document .- an archive of e-mail correspondence - and search through it using regular expressions.

The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)         
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)         
library(RPostgres)         
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(                           
  RPostgres::Postgres(),                           
  dbname = 'postgres',                           
  host = 'localhost',                           
  port = 5432,                           
  user = 'postgres',                           
  password = 'abcd'        
  )

3. Copying a flat text file into a table

The file includes a long text document - an archive of email correspondence. Importing text files, the database looks for a delimiter to split the text into columns. To keep the entire document as it is (without splitting lines into columns) I specified a delimiter that does not appear in this text - E’\007’ is the BEL character and not in the data so each row of the text is one row of the same column (the table has just one column).


CREATE TABLE mbox (line TEXT);

SET CLIENT_ENCODING TO 'utf8';

copy mbox FROM 'D:\Datasets\mbox.txt' with delimiter E'\007';
The table has just one column - every line of the text in the original document is one row in the column 'line'. 


SELECT * FROM mbox 
LIMIT 10;
Displaying records 1 - 10
line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
Return-Path:
Received: from murder (mail.umich.edu [141.211.14.90])
by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])

4. Searching through the text with regular expressions

 

Finding lines starting from ‘From’


SELECT line FROM mbox WHERE line ~ '^From ';
Displaying records 1 - 10
line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
From louis@media.berkeley.edu Fri Jan 4 18:10:48 2008
From zqian@umich.edu Fri Jan 4 16:10:39 2008
From rjlowe@iupui.edu Fri Jan 4 15:46:24 2008
From zqian@umich.edu Fri Jan 4 15:03:18 2008
From rjlowe@iupui.edu Fri Jan 4 14:50:18 2008
From cwen@iupui.edu Fri Jan 4 11:37:30 2008
From cwen@iupui.edu Fri Jan 4 11:35:08 2008
From gsilver@umich.edu Fri Jan 4 11:12:37 2008
From gsilver@umich.edu Fri Jan 4 11:11:52 2008

 

Extracting email addresses - the extraction starts immediately after a space followed by one or more characters and @ character followed by anything but a space,


SELECT substring(line, ' (.+@[^ ]+) ') 
FROM mbox 
WHERE line ~ '^From ';
Displaying records 1 - 10
substring
stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
rjlowe@iupui.edu
zqian@umich.edu
rjlowe@iupui.edu
cwen@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
gsilver@umich.edu

 

Selecting email addresses, the count of email addresses, and ordering by the count of email addresses descending.


SELECT substring(line, ' (.+@[^ ]+) '), 
count(substring(line, ' (.+@[^ ]+) ')) 
FROM mbox 
WHERE line ~ '^From ' 
GROUP BY substring(line, ' (.+@[^ ]+) ') 
ORDER BY count(substring(line, ' (.+@[^ ]+) ')) 
DESC;
Displaying records 1 - 10
substring count
zqian@umich.edu 195
mmmay@indiana.edu 161
cwen@iupui.edu 158
chmaurer@iupui.edu 111
aaronz@vt.edu 110
ian@caret.cam.ac.uk 96
jimeng@umich.edu 93
rjlowe@iupui.edu 90
dlhaines@umich.edu 84
david.horwitz@uct.ac.za 67

 

In the following, we get the same result by using a sub-select. This is a shorter query but slower - should be avoided in high-performance situations like online applications - where speed is important.


SELECT email, count(email) 
FROM( 
SELECT substring(line, ' (.+@[^ ]+) ') AS email 
FROM mbox 
WHERE line ~ '^From '
) AS badsub
GROUP BY email 
ORDER BY count(email) DESC;
Displaying records 1 - 10
email count
zqian@umich.edu 195
mmmay@indiana.edu 161
cwen@iupui.edu 158
chmaurer@iupui.edu 111
aaronz@vt.edu 110
ian@caret.cam.ac.uk 96
jimeng@umich.edu 93
rjlowe@iupui.edu 90
dlhaines@umich.edu 84
david.horwitz@uct.ac.za 67