Text functions and query performance (a SQL demo)

Author

Beata Sirowy

The document showcases various text functions, evaluated on a randomly generated dataset, along with an analysis of query performance.

The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DBI)       
library(RPostgres)       
library(dbplyr)


Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(                     
  RPostgres::Postgres(),                     
  dbname = 'postgres',                     
  host = 'localhost',                     
  port = 5432,                     
  user = 'postgres',                     
  password = 'abcd'     
  )

3. An overview - text functions

a. Pattern matching: WHERE clause operators:

LIKE / ILIKE / NOT LIKE / NOT ILIKE - we can use wildcards with these expressions. ILIKE operator is case -insensitive. A query including ILIKE takes more time to execute than a query with LIKE operator.
SIMILAR TO / NOT SIMILAR TO (works like a ‘regular expression’ - presented further in this document)
= . > , <=, >= , BETWEEN IN

b. String functions

‘string_a_’ || ‘string_b_’ Result: string_a_string_b_
*string_a_’ || 45 Result: string_a_45
char_length(‘tom’) - number of characters in a string
lower(‘TOM’) , upper(‘tom’) - converts a string to a lower/upper case
overlay(‘Thxxxxs’ placing ‘oma’ from 3 for 4) Result: Thomas - replaces a substring
pg_relation_size(‘table_name’). pg_indexes_size(‘table_name’) - provides size (bits)
etc.

3. Generating random data

The following PostgreSQL functions are useful in data generation:

REPEAT () - generating long strings (horizontal)
GENERATE_SERIES() - generating multiple rows (vertical), like Python’s or R’s range
RANDOM() makes rows unique

Floating point 0 <= RANDOM() <= 1.0


SELECT RANDOM(), RANDOM()*100, TRUNC(RANDOM()*100);

1 records
random	?column?	trunc
0.5851044	50.44067	24


SELECT REPEAT('Lemon', 5);

1 records
repeat
LemonLemonLemonLemonLemon


SELECT GENERATE_SERIES(2, 5);

4 records
generate_series
2
3
4
5

Combining these function, we can quickly generate a data frame with lots of random data (e.g. for training purposes)


SELECT 'https://wikipedia.org' || 
TRUNC(RANDOM()*10000) || 
REPEAT('Pink', 2) || 
GENERATE_SERIES(1,6);

6 records
?column?
https://wikipedia.org9252PinkPink1
https://wikipedia.org4443PinkPink2
https://wikipedia.org2717PinkPink3
https://wikipedia.org3032PinkPink4
https://wikipedia.org9096PinkPink5
https://wikipedia.org9222PinkPink6

4. Creating a random dataset

a. creating a table


CREATE TABLE textfun (
content TEXT
);


CREATE INDEX textfun_b ON textfun (content);

The default type of index is b-tree.

In the following I use pg_relation_size(‘table_name’) and pg_indexes_size(‘table_name’) functions - their output is size (bits).


SELECT pg_relation_size ('textfun'), pg_indexes_size ('textfun');

1 records
pg_relation_size	pg_indexes_size
0	8192

Index - trades space for speed.

b. Inserting randomly generated data into the table

We first specify the way the data is generated:

The following statement randomly generates one of two strings: 

CASE WHEN (RANDOM() < 0.5) 
THEN 'https://wikipedia.org/one/'
ELSE 'https://wikipedia.org/two/'
END

This can be combined with: 

GENERATE_SERIES(10000, 20000)


INSERT INTO textfun (content)
SELECT(
CASE WHEN (RANDOM() < 0.5) 
THEN 'https://wikipedia.org/one/'
ELSE 'https://wikipedia.org/two/'
END
) ||
GENERATE_SERIES(10000, 20000);

This creates a table with 10000 rows, here is a preview:


SELECT pg_relation_size ('textfun'), pg_indexes_size ('textfun');

1 records
pg_relation_size	pg_indexes_size
606208	712704

The index is growing faster than the actual data content - as the table includes only one column (with TEXT variable type that takes a lot of space) and the column content is completely replicated in the index.

With ‘real’ tables including multiple columns, the size of the table would exceed the size of the index.

c. Using text functions on the generated table


SELECT content 
FROM textfun
WHERE content LIKE '%1505%';

Displaying records 1 - 10
content
https://wikipedia.org/one/11505
https://wikipedia.org/two/15050
https://wikipedia.org/two/15051
https://wikipedia.org/one/15052
https://wikipedia.org/two/15053
https://wikipedia.org/one/15054
https://wikipedia.org/two/15055
https://wikipedia.org/one/15056
https://wikipedia.org/one/15057
https://wikipedia.org/one/15058

UPPER() and LOWER() function return the content written in the upper / lower case


SELECT UPPER(content) 
FROM textfun 
WHERE content LIKE '%1505%';

Displaying records 1 - 10
upper
HTTPS://WIKIPEDIA.ORG/ONE/11505
HTTPS://WIKIPEDIA.ORG/TWO/15050
HTTPS://WIKIPEDIA.ORG/TWO/15051
HTTPS://WIKIPEDIA.ORG/ONE/15052
HTTPS://WIKIPEDIA.ORG/TWO/15053
HTTPS://WIKIPEDIA.ORG/ONE/15054
HTTPS://WIKIPEDIA.ORG/TWO/15055
HTTPS://WIKIPEDIA.ORG/ONE/15056
HTTPS://WIKIPEDIA.ORG/ONE/15057
HTTPS://WIKIPEDIA.ORG/ONE/15058

RIGHT() and LEFT() functions return a specified number of characters on the right/ left end of the string.


SELECT RIGHT(content, 4) 
FROM textfun 
WHERE content LIKE '%1505%';

Displaying records 1 - 10
right
1505
5050
5051
5052
5053
5054
5055
5056
5057
5058


SELECT LEFT(content, 4) 
FROM textfun 
WHERE content LIKE '%1505%';

Displaying records 1 - 10
left
http
http
http
http
http
http
http
http
http
http

The following query selects a substring from strings in the column *content* . In this case, the substring is 7 characters long and starts from a second string.


SELECT substr(content, 2, 7) 
FROM textfun
LIMIT 5;

5 records
substr
ttps://
ttps://
ttps://
ttps://
ttps://

The following query splits content with a specified divider and returns the content of n-th division. In this case, the divider is “/” and we want to get the string in rhe fourth part - the word one or two.


SELECT content, 
split_part(content, '/', 4) FROM textfun;

Displaying records 1 - 10
content	split_part
https://wikipedia.org/two/10000	two
https://wikipedia.org/two/10001	two
https://wikipedia.org/two/10002	two
https://wikipedia.org/two/10003	two
https://wikipedia.org/two/10004	two
https://wikipedia.org/one/10005	one
https://wikipedia.org/one/10006	one
https://wikipedia.org/one/10007	one
https://wikipedia.org/one/10008	one
https://wikipedia.org/two/10009	two

The following query - a “character translation map” - replaces the exact characters given in the first string with the characters given in the second string,


SELECT translate (content, 'tpw:', 'TPx!'), content
FROM textfun
LIMIT 5

5 records
translate	content
hTTPs!//xikiPedia.org/Txo/10000	https://wikipedia.org/two/10000
hTTPs!//xikiPedia.org/Txo/10001	https://wikipedia.org/two/10001
hTTPs!//xikiPedia.org/Txo/10002	https://wikipedia.org/two/10002
hTTPs!//xikiPedia.org/Txo/10003	https://wikipedia.org/two/10003
hTTPs!//xikiPedia.org/Txo/10004	https://wikipedia.org/two/10004

4. B-tree index performance

We use EXPLAIN ANALYSE command to check the performance / compare performance of different queries.


EXPLAIN ANALYSE 
SELECT content 
FROM textfun 
WHERE content LIKE '%153%';

5 records
QUERY PLAN
Seq Scan on textfun (cost=0.00..199.80 rows=403 width=32) (actual time=0.034..1.139 rows=120 loops=1)
Filter: (content ~~ ‘%153%’::text)
Rows Removed by Filter: 9881
Planning Time: 0.055 ms
Execution Time: 1.148 ms


EXPLAIN ANALYSE
SELECT content 
FROM textfun
WHERE content LIKE '_wo';

5 records
QUERY PLAN
Seq Scan on textfun (cost=0.00..199.80 rows=403 width=32) (actual time=1.074..1.074 rows=0 loops=1)
Filter: (content ~~ ’_wo’::text)
Rows Removed by Filter: 10001
Planning Time: 0.065 ms
Execution Time: 1.085 ms


EXPLAIN ANALYSE 
SELECT content 
FROM textfun 
WHERE content ILIKE '%153%';

5 records
QUERY PLAN
Seq Scan on textfun (cost=0.00..199.80 rows=403 width=32) (actual time=0.274..16.586 rows=120 loops=1)
Filter: (content ~~* ‘%153%’::text)
Rows Removed by Filter: 9881
Planning Time: 0.061 ms
Execution Time: 16.596 ms

EXPLAIN ANALYSE
SELECT content 
FROM textfun 
WHERE content LIKE 'one';

6 records
QUERY PLAN
Bitmap Heap Scan on textfun (cost=4.67..75.61 rows=50 width=32) (actual time=0.022..0.022 rows=0 loops=1)
Filter: (content ~~ ‘one’::text)
-> Bitmap Index Scan on textfun_b (cost=0.00..4.66 rows=50 width=0) (actual time=0.011..0.011 rows=0 loops=1)
Index Cond: (content = ‘one’::text)
Planning Time: 0.231 ms
Execution Time: 0.038 ms


EXPLAIN ANALYZE
SELECT content FROM textfun
WHERE content IN ('one', 'two');

6 records
QUERY PLAN
Bitmap Heap Scan on textfun (cost=9.35..88.46 rows=101 width=32) (actual time=0.018..0.019 rows=0 loops=1)
Recheck Cond: (content = ANY (‘{one,two}’::text[]))
-> Bitmap Index Scan on textfun_b (cost=0.00..9.32 rows=101 width=0) (actual time=0.011..0.011 rows=0 loops=1)
Index Cond: (content = ANY (‘{one,two}’::text[]))
Planning Time: 0.174 ms
Execution Time: 0.027 ms

As we can see, the last two scans are “Index only scans” , where index condition is automatically specified. The other ones are sequential ones (i.e. all records needs to be checked), this makes them far less efficient. time-wise
The query including ILIKE operator took the most time to execute and the difference in performance is rather substantial - even with our very simple data set.
B-tree indexes perform best with sorting, exact matches and prefix look up.

A way to speed up a sequential scan is to limit the result set with LIMIT clause - when relevant. Se below:


EXPLAIN ANALYSE 
SELECT content 
FROM textfun 
WHERE content ILIKE '%153%'
LIMIT 10;

6 records
QUERY PLAN
Limit (cost=0.00..4.96 rows=10 width=32) (actual time=0.276..2.609 rows=10 loops=1)
-> Seq Scan on textfun (cost=0.00..199.80 rows=403 width=32) (actual time=0.275..2.608 rows=10 loops=1)
Filter: (content ~~* ‘%153%’::text)
Rows Removed by Filter: 1528
Planning Time: 0.055 ms
Execution Time: 2.616 ms

5. Hashing functions in indexes

Using PostgreSQL in-built hashing functions can improve the performance of indexing. MD5 hashing function is usually sufficient in situations where we check uniqueness (it is simpler than SHA256).

In the following we create a MD5 index on content column of textfun table.


CREATE UNIQUE INDEX textfun_b_MD5 ON textfun (md5(content));


EXPLAIN ANALYZE
SELECT * FROM textfun
WHERE content = 'one%';

6 records
QUERY PLAN
Bitmap Heap Scan on textfun (cost=4.67..75.61 rows=50 width=32) (actual time=0.018..0.018 rows=0 loops=1)
Recheck Cond: (content = ‘one%’::text)
-> Bitmap Index Scan on textfun_b (cost=0.00..4.66 rows=50 width=0) (actual time=0.010..0.010 rows=0 loops=1)
Index Cond: (content = ‘one%’::text)
Planning Time: 1.508 ms
Execution Time: 0.028 ms


EXPLAIN ANALYZE
SELECT * FROM textfun
WHERE md5(content) = md5('one');

4 records
QUERY PLAN
Index Scan using textfun_b_md5 on textfun (cost=0.29..8.30 rows=1 width=32) (actual time=0.044..0.044 rows=0 loops=1)
Index Cond: (md5(content) = ‘f97c5d29941bfb1b2fdab0874906ab82’::text)
Planning Time: 0.094 ms
Execution Time: 0.051 ms

We can see that in our simple database the execution actually took longer to perform the search with MD5 hash function than with ordinary index scan using default b-tree index- but in more complex databases it is often a way to improve the performance. The overall time of the query (planning + execution) was shorter with MD5 index search.

Hashing index can be also created with the following command (vithout specifying ‘USING HASH’ the default method is b-tree):

CREATE INDEX textfun_hash 
ON textfun USING HASH (content);

Hashing with a separate column

Another option would be to add a column including just the outputs of a MD5 hashing function on the content column.


ALTER TABLE textfun 
ADD MD5_content uuid UNIQUE;


UPDATE textfun 
SET MD5_content = md5(content)::uuid;


SELECT * FROM textfun LIMIT 5;

5 records
content	md5_content
https://wikipedia.org/two/10000	0cc98221-828b-cffd-ec83-e1c6c91ce597
https://wikipedia.org/two/10001	bd49a077-c574-fe32-4743-adc16adcc034
https://wikipedia.org/two/10002	c96218e8-7ebf-145c-8e04-b695258a876a
https://wikipedia.org/two/10003	80f621ee-58e4-5a16-628b-43b7d7968234
https://wikipedia.org/two/10004	4e343795-6e28-893f-f1d7-565b05eebbd2

We can now perform search on the column including hashed data (MD5_content)


EXPLAIN ANALYZE
SELECT * FROM textfun
WHERE MD5_content = md5('one')::uuid;

4 records
QUERY PLAN
Index Scan using textfun_md5_content_key on textfun (cost=0.29..8.30 rows=1 width=48) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: (md5_content = ‘f97c5d29-941b-fb1b-2fda-b0874906ab82’::uuid)
Planning Time: 0.163 ms
Execution Time: 0.012 ms

Summing up,

No index: slowest look-up, but we save space;
B-tree indexes - usually preferred, help with exact look-up, predix look-up, sorting, >.<, range, sorting;
Hash indexes - smaller, help only on exact look-up (quicker than b-tree indexes).

6. Pattern matching: regular expressions (REGEXP)

Beyond the text functions already discussed, we can also use regular expression for pattern matching purposes.

Regular expression: a method of specifying patterns of characters, numbers, and symbols, which can be used to identify, replace or split data.

They can help with tasks like data cleaning, validation, and extraction, which can be difficult to accomplish with standard SQL functions.

Not every database supports them (e.g. supported in PostgreSQL, MySQL, Oracle). Should be used with caution - may affect database performance.

Three main types:

REGEXP_LIKE: used to match the input string with the regular expression pattern

The query returns e-mail addresses that match the regular expression.
XP_REPLACE: used to replace the string matching a regular expression pattern

The query cleans up a string that should only contain numbers
REGEXP_SUBSTR: used to extract a substring from a string that matches a regular expression pattern.

The query extracts URL from the text - the following expression matches URL: https?://[^ ]+

In this statement, ? - denotes repetition of the previous item zero or more times; + denotes repetition of the previous item zero or more times, ^ matches the start of a string, […] specifies a character class.