The document showcases various text functions, evaluated on a randomly generated dataset, along with an analysis of query performance.
The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.
1. Loading libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI) library(RPostgres) library(dbplyr)
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
2. Establishing a database connection (PostgreSQL)
I use RPostgres package to connent to a locally hosted PostgreSQL database.
con <- DBI::dbConnect( RPostgres::Postgres(), dbname ='postgres', host ='localhost', port =5432, user ='postgres', password ='abcd' )
3. An overview - text functions
a. Pattern matching: WHERE clause operators:
LIKE / ILIKE / NOT LIKE / NOT ILIKE - we can use wildcards with these expressions. ILIKE operator is case -insensitive. A query including ILIKE takes more time to execute than a query with LIKE operator.
SIMILAR TO / NOT SIMILAR TO (works like a ‘regular expression’ - presented further in this document)
b. Inserting randomly generated data into the table
We first specify the way the data is generated:
Thefollowing statement randomly generates one of two strings: CASEWHEN (RANDOM() <0.5) THEN'https://wikipedia.org/one/'ELSE'https://wikipedia.org/two/'ENDThis can be combined with: GENERATE_SERIES(10000, 20000)
The index is growing faster than the actual data content - as the table includes only one column (with TEXT variable type that takes a lot of space) and the column content is completely replicated in the index.
With ‘real’ tables including multiple columns, the size of the table would exceed the size of the index.
c. Using text functions on the generated table
SELECT content FROM textfunWHERE content LIKE'%1505%';
Displaying records 1 - 10
content
https://wikipedia.org/one/11505
https://wikipedia.org/two/15050
https://wikipedia.org/two/15051
https://wikipedia.org/one/15052
https://wikipedia.org/two/15053
https://wikipedia.org/one/15054
https://wikipedia.org/two/15055
https://wikipedia.org/one/15056
https://wikipedia.org/one/15057
https://wikipedia.org/one/15058
UPPER() and LOWER() function return the content written in the upper / lower case
SELECTUPPER(content) FROM textfun WHERE content LIKE'%1505%';
Displaying records 1 - 10
upper
HTTPS://WIKIPEDIA.ORG/ONE/11505
HTTPS://WIKIPEDIA.ORG/TWO/15050
HTTPS://WIKIPEDIA.ORG/TWO/15051
HTTPS://WIKIPEDIA.ORG/ONE/15052
HTTPS://WIKIPEDIA.ORG/TWO/15053
HTTPS://WIKIPEDIA.ORG/ONE/15054
HTTPS://WIKIPEDIA.ORG/TWO/15055
HTTPS://WIKIPEDIA.ORG/ONE/15056
HTTPS://WIKIPEDIA.ORG/ONE/15057
HTTPS://WIKIPEDIA.ORG/ONE/15058
RIGHT() and LEFT() functions return a specified number of characters on the right/ left end of the string.
SELECTRIGHT(content, 4) FROM textfun WHERE content LIKE'%1505%';
Displaying records 1 - 10
right
1505
5050
5051
5052
5053
5054
5055
5056
5057
5058
SELECTLEFT(content, 4) FROM textfun WHERE content LIKE'%1505%';
Displaying records 1 - 10
left
http
http
http
http
http
http
http
http
http
http
The following query selects a substring from strings in the column *content* . In this case, the substring is 7 characters long and starts from a second string.
SELECTsubstr(content, 2, 7) FROM textfunLIMIT5;
5 records
substr
ttps://
ttps://
ttps://
ttps://
ttps://
The following query splits content with a specified divider and returns the content of n-th division. In this case, the divider is “/” and we want to get the string in rhe fourth part - the word one or two.
SELECT content, split_part(content, '/', 4) FROM textfun;
Displaying records 1 - 10
content
split_part
https://wikipedia.org/two/10000
two
https://wikipedia.org/two/10001
two
https://wikipedia.org/two/10002
two
https://wikipedia.org/two/10003
two
https://wikipedia.org/two/10004
two
https://wikipedia.org/one/10005
one
https://wikipedia.org/one/10006
one
https://wikipedia.org/one/10007
one
https://wikipedia.org/one/10008
one
https://wikipedia.org/two/10009
two
The following query - a “character translation map” - replaces the exact characters given in the first string with the characters given in the second string,
Recheck Cond: (content = ANY (‘{one,two}’::text[]))
-> Bitmap Index Scan on textfun_b (cost=0.00..9.32 rows=101 width=0) (actual time=0.011..0.011 rows=0 loops=1)
Index Cond: (content = ANY (‘{one,two}’::text[]))
Planning Time: 0.174 ms
Execution Time: 0.027 ms
As we can see, the last two scans are “Index only scans” , where index condition is automatically specified. The other ones are sequential ones (i.e. all records needs to be checked), this makes them far less efficient. time-wise
The query including ILIKE operator took the most time to execute and the difference in performance is rather substantial - even with our very simple data set.
B-tree indexes perform best with sorting, exact matches and prefix look up.
A way to speed up a sequential scan is to limit the result set with LIMIT clause - when relevant. Se below:
EXPLAIN ANALYSE SELECT content FROM textfun WHERE content ILIKE '%153%'LIMIT10;
Using PostgreSQL in-built hashing functions can improve the performance of indexing. MD5 hashing function is usually sufficient in situations where we check uniqueness (it is simpler than SHA256).
In the following we create a MD5 index on content column of textfun table.
CREATEUNIQUEINDEX textfun_b_MD5 ON textfun (md5(content));
Index Scan using textfun_b_md5 on textfun (cost=0.29..8.30 rows=1 width=32) (actual time=0.044..0.044 rows=0 loops=1)
Index Cond: (md5(content) = ‘f97c5d29941bfb1b2fdab0874906ab82’::text)
Planning Time: 0.094 ms
Execution Time: 0.051 ms
We can see that in our simple database the execution actually took longer to perform the search with MD5 hash function than with ordinary index scan using default b-tree index- but in more complex databases it is often a way to improve the performance. The overall time of the query (planning + execution) was shorter with MD5 index search.
Hashing index can be also created with the following command (vithout specifying ‘USING HASH’ the default method is b-tree):
CREATEINDEX textfun_hash ON textfun USINGHASH (content);
Hashing with a separate column
Another option would be to add a column including just the outputs of a MD5 hashing function on the content column.
ALTERTABLE textfun ADD MD5_content uuid UNIQUE;
UPDATE textfun SET MD5_content = md5(content)::uuid;
SELECT*FROM textfun LIMIT5;
5 records
content
md5_content
https://wikipedia.org/two/10000
0cc98221-828b-cffd-ec83-e1c6c91ce597
https://wikipedia.org/two/10001
bd49a077-c574-fe32-4743-adc16adcc034
https://wikipedia.org/two/10002
c96218e8-7ebf-145c-8e04-b695258a876a
https://wikipedia.org/two/10003
80f621ee-58e4-5a16-628b-43b7d7968234
https://wikipedia.org/two/10004
4e343795-6e28-893f-f1d7-565b05eebbd2
We can now perform search on the column including hashed data (MD5_content)
Index Scan using textfun_md5_content_key on textfun (cost=0.29..8.30 rows=1 width=48) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: (md5_content = ‘f97c5d29-941b-fb1b-2fda-b0874906ab82’::uuid)
Planning Time: 0.163 ms
Execution Time: 0.012 ms
Summing up,
No index: slowest look-up, but we save space;
B-tree indexes - usually preferred, help with exact look-up, predix look-up, sorting, >.<, range, sorting;
Hash indexes - smaller, help only on exact look-up (quicker than b-tree indexes).
6. Pattern matching: regular expressions (REGEXP)
Beyond the text functions already discussed, we can also use regular expression for pattern matching purposes.
Regular expression: a method of specifying patterns of characters, numbers, and symbols, which can be used to identify, replace or split data.
They can help with tasks like data cleaning, validation, and extraction, which can be difficult to accomplish with standard SQL functions.
Not every database supports them (e.g. supported in PostgreSQL, MySQL, Oracle). Should be used with caution - may affect database performance.
Three main types:
REGEXP_LIKE: used to match the input string with the regular expression pattern
The query returns e-mail addresses that match the regular expression.
XP_REPLACE: used to replace the string matching a regular expression pattern
The query cleans up a string that should only contain numbers
REGEXP_SUBSTR: used to extract a substring from a string that matches a regular expression pattern.
The query extracts URL from the text - the following expression matches URL: https?://[^ ]+
In this statement, ? - denotes repetition of the previous item zero or more times; + denotes repetition of the previous item zero or more times, ^ matches the start of a string, […] specifies a character class.