Refular Expressions (a SQL demo)

Author

Beata Sirowy

The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)        
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)        
library(RPostgres)        
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(                        
  RPostgres::Postgres(),                        
  dbname = 'postgres',                        
  host = 'localhost',                        
  port = 5432,                        
  user = 'postgres',                        
  password = 'abcd'        )

3. Regular Expressions: an overview

  • a text-based programming language;

  • used for text matching and parsing: specifying patterns of characters, numbers, and symbols, which can be used to identify, replace or split data;

  • can be used in many programming languages;

  • marker characters are keywords - very compact (developed in 1960s/70s)

    Regular Expression Quick Guide
    
    ^        Matches the beginning of a line
    
    $        Matches the end of the line
    
    .        Matches any character
    
    \s       Matches whitespace
    
    \S       Matches any non-whitespace character
    
    *        Repeats a character zero or more times
    
    *?       Repeats a character zero or more times 
             (non-greedy)
    
    +        Repeats a character one or more times
    
    +?       Repeats a character one or more times 
             (non-greedy)
    
    [aeiou]  Matches a single character in the listed set
    
    [^XYZ]   Matches a single character not in the listed set
    
    [a-z0-9] The set of characters can include a range
    
    (        Indicates where string extraction is to start
    
    )        Indicates where string extraction is to end
    
    
     WHERE CLAUSE OPERATORS
    
     ~¨  matches 
    
     ~*  matches case sensitive  
    
     !~  does not match
    
     !~* does not match case insensitive
    

~ is different than LIKE. ~‘DOG’ is equivalent to LIKE ‘%DOG%’

We will use the previously created textfun table to test some expressions.

 

Finding strings including any of the letters in in the range a - c


SELECT content FROM textfun
WHERE content ~'[a-c]'
LIMIT 5;
5 records
content
https://wikipedia.org/two/10000
https://wikipedia.org/two/10001
https://wikipedia.org/two/10002
https://wikipedia.org/two/10003
https://wikipedia.org/two/10004


Finding strings including the word ‘one’


SELECT content FROM textfun
WHERE content ~'one'
LIMIT 5;
5 records
content
https://wikipedia.org/one/10005
https://wikipedia.org/one/10006
https://wikipedia.org/one/10007
https://wikipedia.org/one/10008
https://wikipedia.org/one/10010

 

Finding strings not including the word ‘one’


SELECT content FROM textfun
WHERE content !~'one'
LIMIT 5;
5 records
content
https://wikipedia.org/two/10000
https://wikipedia.org/two/10001
https://wikipedia.org/two/10002
https://wikipedia.org/two/10003
https://wikipedia.org/two/10004

 

Finding strings ending with ‘124’


SELECT content FROM textfun
WHERE content ~'124$'
LIMIT 5;
5 records
content
https://wikipedia.org/one/10124
https://wikipedia.org/two/11124
https://wikipedia.org/one/12124
https://wikipedia.org/two/13124
https://wikipedia.org/one/14124

 

Finding strings beginning with ‘h’


SELECT content FROM textfun
WHERE content ~ '^(h)'
LIMIT 5;
5 records
content
https://wikipedia.org/two/10000
https://wikipedia.org/two/10001
https://wikipedia.org/two/10002
https://wikipedia.org/two/10003
https://wikipedia.org/two/10004

 

Finding strings including a digit from range 2-3 immediately followed by a digit from the range 8-9.


SELECT content FROM textfun
WHERE content ~ '[2-3][8-9]'
LIMIT 10;
Displaying records 1 - 10
content
https://wikipedia.org/one/10028
https://wikipedia.org/two/10029
https://wikipedia.org/one/10038
https://wikipedia.org/one/10039
https://wikipedia.org/one/10128
https://wikipedia.org/two/10129
https://wikipedia.org/two/10138
https://wikipedia.org/two/10139
https://wikipedia.org/two/10228
https://wikipedia.org/one/10229

 

Finding strings including a digit from the range 2-3, separated by any number of characters, followed by a digit from the range 8-9.


SELECT content FROM textfun
WHERE content ~ '[2-3].*[8-9]'
LIMIT 10;
Displaying records 1 - 10
content
https://wikipedia.org/one/10028
https://wikipedia.org/two/10029
https://wikipedia.org/one/10038
https://wikipedia.org/one/10039
https://wikipedia.org/one/10128
https://wikipedia.org/two/10129
https://wikipedia.org/two/10138
https://wikipedia.org/two/10139
https://wikipedia.org/two/10208
https://wikipedia.org/two/10209

  

The following table will be used in further examples


CREATE TABLE em (
id serial, 
primary key(id), 
email text UNIQUE);

INSERT INTO em (email) VALUES ('sev@umich.edu');

INSERT INTO em (email) VALUES ('coleen@umich.edu');

INSERT INTO em (email) VALUES ('sally@uiuc.edu');

INSERT INTO em (email) VALUES ('ted79@umuc.edu');

INSERT INTO em (email) VALUES ('glenn1@apple.com');

INSERT INTO em (email) VALUES ('nbody@apple.com');

select * from em;
6 records
id email
1 sev@umich.edu
2 coleen@umich.edu
3 sally@uiuc.edu
4 ted79@umuc.edu
5 glenn1@apple.com
6 nbody@apple.com

Selecting strings matching specific criteria


SELECT email 
FROM em 
WHERE email ~ 'umich';
2 records
email
sev@umich.edu
coleen@umich.edu

SELECT email 
FROM em 
WHERE email ~ '^c';
1 records
email
coleen@umich.edu

SELECT email 
FROM em 
WHERE email ~ 'edu$';
4 records
email
sev@umich.edu
coleen@umich.edu
sally@uiuc.edu
ted79@umuc.edu

SELECT email 
FROM em 
WHERE email ~ '^[gnt]';
3 records
email
ted79@umuc.edu
glenn1@apple.com
nbody@apple.com

SELECT email 
FROM em 
WHERE email ~ '[0-9]';
2 records
email
ted79@umuc.edu
glenn1@apple.com

SELECT email 
FROM em 
WHERE email ~ '[0-9][0-9]';
1 records
email
ted79@umuc.edu

4. Single and multiple matches

  • SUBSTRING() gets the first match in the text column

  • REGEXP_ MATCHES() gets an array of matches

SUBSTRING() - the first match in a column

 

Pulling out digits from email column (one or more contiguous digits). The last statement limits the result list - columns with N/A aren’t displayed.


SELECT substring(email FROM '[0-9]+') 
FROM em
WHERE email ~ '[0-9]';
2 records
substring
79
1

 

Pulling out domain names from email addresses.



SELECT substring(email FROM '.+@(.*)$')
FROM em;
6 records
substring
umich.edu
umich.edu
uiuc.edu
umuc.edu
apple.com
apple.com

 

Pulling out distinct domain names from email addresses, including @ character.


SELECT DISTINCT substring(email FROM '.+(@.*)$') 
FROM em;
4 records
substring
@umich.edu
@uiuc.edu
@apple.com
@umuc.edu

 

Pulling out distinct domain names from email addresses, counting the occurrences of each.


SELECT substring(email FROM '.+@(.*)$'), 
    count(substring(email FROM '.+@(.*)$')) 
FROM em 
GROUP BY substring(email FROM '.+@(.*)$');
4 records
substring count
apple.com 2
uiuc.edu 1
umich.edu 2
umuc.edu 1


Finding all records where email address includes domain name ‘umich.edu’-


SELECT * FROM em 
WHERE substring(email 
FROM '.+@(.*)$') = 'umich.edu';
2 records
id email
1 sev@umich.edu
2 coleen@umich.edu

 

REGEXP_MATCHES () - an array of matches within a column


CREATE TABLE tw (id serial, primary key(id), tweet text);

INSERT INTO tw (tweet) 
VALUES ('This is #SQL and #FUN stuff');

INSERT INTO tw (tweet) 
VALUES ('More people should learn #SQL FROM #PROF_ABC')

INSERT INTO tw (tweet) 
VALUES ('#PROF_ABC also teaches #PYTHON and #R');

select * from tw;
3 records
id tweet
1 This is #SQL and #FUN stuff
2 More people should learn #SQL FROM #PROF_ABC
3 #PROF_ABC also teaches #PYTHON and #R

SELECT id, tweet FROM tw WHERE tweet ~ '#SQL';
2 records
id tweet
1 This is #SQL and #FUN stuff
2 More people should learn #SQL FROM #PROF_ABC

SELECT regexp_matches(tweet,'#([A-Za-z0-9_]+)', 'g') FROM tw;
7 records
regexp_matches
{SQL}
{FUN}
{SQL}
{PROF_ABC}
{PROF_ABC}
{PYTHON}
{R}


SELECT DISTINCT regexp_matches(tweet,'#([A-Za-z0-9_]+)', 'g') FROM tw;
5 records
regexp_matches
{R}
{PROF_ABC}
{FUN}
{SQL}
{PYTHON}

SELECT id, regexp_matches(tweet,'#([A-Za-z0-9_]+)', 'g') FROM tw;
7 records
id regexp_matches
1 {SQL}
1 {FUN}
2 {SQL}
2 {PROF_ABC}
3 {PROF_ABC}
3 {PYTHON}
3 {R}

select substring(email FROM '[0-9]+') FROM em WHERE email ~'[0-9]';
2 records
substring
79
1