I import into a SQL database a text document .- an archive of e-mail correspondence - and search through it using regular expressions.
The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.
1. Loading libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI) library(RPostgres) library(dbplyr)
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
2. Establishing a database connection (PostgreSQL)
I use RPostgres package to connent to a locally hosted PostgreSQL database.
con <- DBI::dbConnect( RPostgres::Postgres(), dbname ='postgres', host ='localhost', port =5432, user ='postgres', password ='abcd' )
3. Copying a flat text file into a table
The file includes a long text document - an archive of email correspondence. Importing text files, the database looks for a delimiter to split the text into columns. To keep the entire document as it is (without splitting lines into columns) I specified a delimiter that does not appear in this text - E’\007’ is the BEL character and not in the data so each row of the text is one row of the same column (the table has just one column).
Received: from murder (mail.umich.edu [141.211.14.90])
by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])
4. Searching through the text with regular expressions
Finding lines starting from ‘From’
SELECT line FROM mbox WHERE line ~ '^From ';
Displaying records 1 - 10
line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
From louis@media.berkeley.edu Fri Jan 4 18:10:48 2008
From zqian@umich.edu Fri Jan 4 16:10:39 2008
From rjlowe@iupui.edu Fri Jan 4 15:46:24 2008
From zqian@umich.edu Fri Jan 4 15:03:18 2008
From rjlowe@iupui.edu Fri Jan 4 14:50:18 2008
From cwen@iupui.edu Fri Jan 4 11:37:30 2008
From cwen@iupui.edu Fri Jan 4 11:35:08 2008
From gsilver@umich.edu Fri Jan 4 11:12:37 2008
From gsilver@umich.edu Fri Jan 4 11:11:52 2008
Extracting email addresses - the extraction starts immediately after a space followed by one or more characters and @ character followed by anything but a space,
SELECT substring(line, ' (.+@[^ ]+) ') FROM mbox WHERE line ~ '^From ';
Displaying records 1 - 10
substring
stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
rjlowe@iupui.edu
zqian@umich.edu
rjlowe@iupui.edu
cwen@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
gsilver@umich.edu
Selecting email addresses, the count of email addresses, and ordering by the count of email addresses descending.
SELECT substring(line, ' (.+@[^ ]+) '), count(substring(line, ' (.+@[^ ]+) ')) FROM mbox WHERE line ~ '^From 'GROUPBY substring(line, ' (.+@[^ ]+) ') ORDERBYcount(substring(line, ' (.+@[^ ]+) ')) DESC;
Displaying records 1 - 10
substring
count
zqian@umich.edu
195
mmmay@indiana.edu
161
cwen@iupui.edu
158
chmaurer@iupui.edu
111
aaronz@vt.edu
110
ian@caret.cam.ac.uk
96
jimeng@umich.edu
93
rjlowe@iupui.edu
90
dlhaines@umich.edu
84
david.horwitz@uct.ac.za
67
In the following, we get the same result by using a sub-select. This is a shorter query but slower - should be avoided in high-performance situations like online applications - where speed is important.
SELECT email, count(email) FROM( SELECT substring(line, ' (.+@[^ ]+) ') AS email FROM mbox WHERE line ~ '^From ') AS badsubGROUPBY email ORDERBYcount(email) DESC;