The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.
1. Loading libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI) library(RPostgres) library(dbplyr)
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
2. Establishing a database connection (PostgreSQL)
I use RPostgres package to connent to a locally hosted PostgreSQL database.
con <- DBI::dbConnect( RPostgres::Postgres(), dbname ='postgres', host ='localhost', port =5432, user ='postgres', password ='abcd' )
3. Regular Expressions: an overview
a text-based programming language;
used for text matching and parsing: specifying patterns of characters, numbers, and symbols, which can be used to identify, replace or split data;
can be used in many programming languages;
marker characters are keywords - very compact (developed in 1960s/70s)
Regular Expression Quick Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times
(non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times
(non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
WHERE CLAUSE OPERATORS
~¨ matches
~* matches case sensitive
!~ does not match
!~* does not match case insensitive
~ is different than LIKE. ~‘DOG’ is equivalent to LIKE ‘%DOG%’
We will use the previously created textfun table to test some expressions.
Finding strings including any of the letters in in the range a - c
SELECT content FROM textfunWHERE content ~'[a-c]'LIMIT5;
5 records
content
https://wikipedia.org/two/10000
https://wikipedia.org/two/10001
https://wikipedia.org/two/10002
https://wikipedia.org/two/10003
https://wikipedia.org/two/10004
Finding strings including the word ‘one’
SELECT content FROM textfunWHERE content ~'one'LIMIT5;
5 records
content
https://wikipedia.org/one/10005
https://wikipedia.org/one/10006
https://wikipedia.org/one/10007
https://wikipedia.org/one/10008
https://wikipedia.org/one/10010
Finding strings not including the word ‘one’
SELECT content FROM textfunWHERE content !~'one'LIMIT5;
5 records
content
https://wikipedia.org/two/10000
https://wikipedia.org/two/10001
https://wikipedia.org/two/10002
https://wikipedia.org/two/10003
https://wikipedia.org/two/10004
Finding strings ending with ‘124’
SELECT content FROM textfunWHERE content ~'124$'LIMIT5;
5 records
content
https://wikipedia.org/one/10124
https://wikipedia.org/two/11124
https://wikipedia.org/one/12124
https://wikipedia.org/two/13124
https://wikipedia.org/one/14124
Finding strings beginning with ‘h’
SELECT content FROM textfunWHERE content ~ '^(h)'LIMIT5;
5 records
content
https://wikipedia.org/two/10000
https://wikipedia.org/two/10001
https://wikipedia.org/two/10002
https://wikipedia.org/two/10003
https://wikipedia.org/two/10004
Finding strings including a digit from range 2-3 immediately followed by a digit from the range 8-9.
SELECT content FROM textfunWHERE content ~ '[2-3][8-9]'LIMIT10;
Displaying records 1 - 10
content
https://wikipedia.org/one/10028
https://wikipedia.org/two/10029
https://wikipedia.org/one/10038
https://wikipedia.org/one/10039
https://wikipedia.org/one/10128
https://wikipedia.org/two/10129
https://wikipedia.org/two/10138
https://wikipedia.org/two/10139
https://wikipedia.org/two/10228
https://wikipedia.org/one/10229
Finding strings including a digit from the range 2-3, separated by any number of characters, followed by a digit from the range 8-9.
SELECT content FROM textfunWHERE content ~ '[2-3].*[8-9]'LIMIT10;
Displaying records 1 - 10
content
https://wikipedia.org/one/10028
https://wikipedia.org/two/10029
https://wikipedia.org/one/10038
https://wikipedia.org/one/10039
https://wikipedia.org/one/10128
https://wikipedia.org/two/10129
https://wikipedia.org/two/10138
https://wikipedia.org/two/10139
https://wikipedia.org/two/10208
https://wikipedia.org/two/10209
The following table will be used in further examples
CREATETABLE em (id serial, primarykey(id), email text UNIQUE);
INSERTINTO em (email) VALUES ('sev@umich.edu');
INSERTINTO em (email) VALUES ('coleen@umich.edu');
INSERTINTO em (email) VALUES ('sally@uiuc.edu');
INSERTINTO em (email) VALUES ('ted79@umuc.edu');
INSERTINTO em (email) VALUES ('glenn1@apple.com');
INSERTINTO em (email) VALUES ('nbody@apple.com');
select*from em;
6 records
id
email
1
sev@umich.edu
2
coleen@umich.edu
3
sally@uiuc.edu
4
ted79@umuc.edu
5
glenn1@apple.com
6
nbody@apple.com
Selecting strings matching specific criteria
SELECT email FROM em WHERE email ~ 'umich';
2 records
email
sev@umich.edu
coleen@umich.edu
SELECT email FROM em WHERE email ~ '^c';
1 records
email
coleen@umich.edu
SELECT email FROM em WHERE email ~ 'edu$';
4 records
email
sev@umich.edu
coleen@umich.edu
sally@uiuc.edu
ted79@umuc.edu
SELECT email FROM em WHERE email ~ '^[gnt]';
3 records
email
ted79@umuc.edu
glenn1@apple.com
nbody@apple.com
SELECT email FROM em WHERE email ~ '[0-9]';
2 records
email
ted79@umuc.edu
glenn1@apple.com
SELECT email FROM em WHERE email ~ '[0-9][0-9]';
1 records
email
ted79@umuc.edu
4. Single and multiple matches
SUBSTRING() gets the first match in the text column
REGEXP_ MATCHES() gets an array of matches
SUBSTRING() - the first match in a column
Pulling out digits from email column (one or more contiguous digits). The last statement limits the result list - columns with N/A aren’t displayed.
SELECT substring(email FROM'[0-9]+') FROM emWHERE email ~ '[0-9]';
2 records
substring
79
1
Pulling out domain names from email addresses.
SELECT substring(email FROM'.+@(.*)$')FROM em;
6 records
substring
umich.edu
umich.edu
uiuc.edu
umuc.edu
apple.com
apple.com
Pulling out distinct domain names from email addresses, including @ character.
SELECTDISTINCT substring(email FROM'.+(@.*)$') FROM em;
4 records
substring
@umich.edu
@uiuc.edu
@apple.com
@umuc.edu
Pulling out distinct domain names from email addresses, counting the occurrences of each.
SELECT substring(email FROM'.+@(.*)$'), count(substring(email FROM'.+@(.*)$')) FROM em GROUPBY substring(email FROM'.+@(.*)$');
4 records
substring
count
apple.com
2
uiuc.edu
1
umich.edu
2
umuc.edu
1
Finding all records where email address includes domain name ‘umich.edu’-
SELECT*FROM em WHERE substring(email FROM'.+@(.*)$') ='umich.edu';
2 records
id
email
1
sev@umich.edu
2
coleen@umich.edu
REGEXP_MATCHES () - an array of matches within a column
CREATETABLE tw (id serial, primarykey(id), tweet text);
INSERTINTO tw (tweet) VALUES ('This is #SQL and #FUN stuff');
INSERTINTO tw (tweet) VALUES ('More people should learn #SQL FROM #PROF_ABC')
INSERTINTO tw (tweet) VALUES ('#PROF_ABC also teaches #PYTHON and #R');
select*from tw;
3 records
id
tweet
1
This is #SQL and #FUN stuff
2
More people should learn #SQL FROM #PROF_ABC
3
#PROF_ABC also teaches #PYTHON and #R
SELECTid, tweet FROM tw WHERE tweet ~ '#SQL';
2 records
id
tweet
1
This is #SQL and #FUN stuff
2
More people should learn #SQL FROM #PROF_ABC
SELECT regexp_matches(tweet,'#([A-Za-z0-9_]+)', 'g') FROM tw;
7 records
regexp_matches
{SQL}
{FUN}
{SQL}
{PROF_ABC}
{PROF_ABC}
{PYTHON}
{R}
SELECTDISTINCT regexp_matches(tweet,'#([A-Za-z0-9_]+)', 'g') FROM tw;
5 records
regexp_matches
{R}
{PROF_ABC}
{FUN}
{SQL}
{PYTHON}
SELECTid, regexp_matches(tweet,'#([A-Za-z0-9_]+)', 'g') FROM tw;
7 records
id
regexp_matches
1
{SQL}
1
{FUN}
2
{SQL}
2
{PROF_ABC}
3
{PROF_ABC}
3
{PYTHON}
3
{R}
select substring(email FROM'[0-9]+') FROM em WHERE email ~'[0-9]';