Flat text files and Regex (a SQL demo)

Author

Beata Sirowy

I import into a SQL database a text document .- an archive of e-mail correspondence - and search through it using regular expressions.

The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DBI)         
library(RPostgres)         
library(dbplyr)


Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(                           
  RPostgres::Postgres(),                           
  dbname = 'postgres',                           
  host = 'localhost',                           
  port = 5432,                           
  user = 'postgres',                           
  password = 'abcd'        
  )

3. Copying a flat text file into a table

The file includes a long text document - an archive of email correspondence. Importing text files, the database looks for a delimiter to split the text into columns. To keep the entire document as it is (without splitting lines into columns) I specified a delimiter that does not appear in this text - E’\007’ is the BEL character and not in the data so each row of the text is one row of the same column (the table has just one column).


CREATE TABLE mbox (line TEXT);


SET CLIENT_ENCODING TO 'utf8';


copy mbox FROM 'D:\Datasets\mbox.txt' with delimiter E'\007';

The table has just one column - every line of the text in the original document is one row in the column 'line'.


SELECT * FROM mbox 
LIMIT 10;

Displaying records 1 - 10
line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
Return-Path: postmaster@collab.sakaiproject.org
Received: from murder (mail.umich.edu [141.211.14.90])
by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])

4. Searching through the text with regular expressions

Finding lines starting from ‘From’


SELECT line FROM mbox WHERE line ~ '^From ';

Displaying records 1 - 10
line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
From louis@media.berkeley.edu Fri Jan 4 18:10:48 2008
From zqian@umich.edu Fri Jan 4 16:10:39 2008
From rjlowe@iupui.edu Fri Jan 4 15:46:24 2008
From zqian@umich.edu Fri Jan 4 15:03:18 2008
From rjlowe@iupui.edu Fri Jan 4 14:50:18 2008
From cwen@iupui.edu Fri Jan 4 11:37:30 2008
From cwen@iupui.edu Fri Jan 4 11:35:08 2008
From gsilver@umich.edu Fri Jan 4 11:12:37 2008
From gsilver@umich.edu Fri Jan 4 11:11:52 2008

Extracting email addresses - the extraction starts immediately after a space followed by one or more characters and @ character followed by anything but a space,


SELECT substring(line, ' (.+@[^ ]+) ') 
FROM mbox 
WHERE line ~ '^From ';

Displaying records 1 - 10
substring
stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
rjlowe@iupui.edu
zqian@umich.edu
rjlowe@iupui.edu
cwen@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
gsilver@umich.edu

Selecting email addresses, the count of email addresses, and ordering by the count of email addresses descending.


SELECT substring(line, ' (.+@[^ ]+) '), 
count(substring(line, ' (.+@[^ ]+) ')) 
FROM mbox 
WHERE line ~ '^From ' 
GROUP BY substring(line, ' (.+@[^ ]+) ') 
ORDER BY count(substring(line, ' (.+@[^ ]+) ')) 
DESC;

Displaying records 1 - 10
substring	count
zqian@umich.edu	195
mmmay@indiana.edu	161
cwen@iupui.edu	158
chmaurer@iupui.edu	111
aaronz@vt.edu	110
ian@caret.cam.ac.uk	96
jimeng@umich.edu	93
rjlowe@iupui.edu	90
dlhaines@umich.edu	84
david.horwitz@uct.ac.za	67

In the following, we get the same result by using a sub-select. This is a shorter query but slower - should be avoided in high-performance situations like online applications - where speed is important.


SELECT email, count(email) 
FROM( 
SELECT substring(line, ' (.+@[^ ]+) ') AS email 
FROM mbox 
WHERE line ~ '^From '
) AS badsub
GROUP BY email 
ORDER BY count(email) DESC;

Displaying records 1 - 10
email	count
zqian@umich.edu	195
mmmay@indiana.edu	161
cwen@iupui.edu	158
chmaurer@iupui.edu	111
aaronz@vt.edu	110
ian@caret.cam.ac.uk	96
jimeng@umich.edu	93
rjlowe@iupui.edu	90
dlhaines@umich.edu	84
david.horwitz@uct.ac.za	67