Building a natural language index (a SQL demo)

Author

Beata Sirowy

I build a natural language index on a document - first by hand and then using PostgreSQL inbuilt features.

The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)         
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)         
library(RPostgres)         
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(                           
  RPostgres::Postgres(),                           
  dbname = 'postgres',                          
  host = 'localhost',                          
  port = 5432,                           
  user = 'postgres',                           
  password = 'abcd'           )

3. Building a natural language index “manually”

Taking advantage of some features of natural language we can make indexes quicker and more efficient.

To optimize the performance of an index:

  • we don’t index ‘stop words’ (such as and, or, but, why - words that don’t have meaning on their own);

  • we ignore capitalization (upper case) / lower case distinction);

  • we reduce variations of words with equivalent meanings down to a single “stem word”.

Stemming and stop words depend on which language the document is stored in - the rules for new languages can be installed in PostgreSQL:


SELECT cfgname FROM pg_ts_config;
Displaying records 1 - 10
cfgname
simple
arabic
armenian
basque
catalan
danish
dutch
english
finnish
french

Defining the key steps in a “manual” building of a natural language index

  • creating a table of stop words used in the document;

  • creating a table of stems (mapping the words in the text document to their stems, removing capitalization);

  • creating the index table.

We create a table to be used in the demo


CREATE TABLE docs (
id SERIAL PRIMARY KEY,
doc TEXT
);

INSERT INTO docs (doc) VALUES
('This is SQL and Python and other fun teaching stuff'),
('More people should learn SQL from Prof_Chuck'),
('Prof_Chuck also teaches Python and also SQL');

select * from docs;
3 records
id doc
1 This is SQL and Python and other fun teaching stuff
2 More people should learn SQL from Prof_Chuck
3 Prof_Chuck also teaches Python and also SQL

Breaking the document column into one row per word + primary key


SELECT id, new.keyword AS keyword
FROM docs AS D, 
unnest(string_to_array(D.doc, ' ')) new(keyword)
ORDER BY id;
Displaying records 1 - 10
id keyword
1 This
1 is
1 SQL
1 and
1 Python
1 and
1 other
1 fun
1 teaching
1 stuff

Discarding duplicate rows


SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)
ORDER BY id;
Displaying records 1 - 10
id keyword
1 and
1 fun
1 is
1 other
1 Python
1 SQL
1 stuff
1 teaching
1 This
2 from

Converting all words to lower case


SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
ORDER BY id;
Displaying records 1 - 10
id keyword
1 and
1 fun
1 is
1 other
1 python
1 sql
1 stuff
1 teaching
1 this
2 from

Creating table with our GIS index


CREATE TABLE docs_gin (
  keyword TEXT,
  doc_id INTEGER REFERENCES docs(id) ON DELETE CASCADE
  );

Creating table with ‘stop words’


CREATE TABLE stop_words (word TEXT unique);

INSERT INTO stop_words (word) VALUES ('is'), ('this'), ('and');

SELECT * FROM stop_words;
3 records
word
is
this
and

Eliminating the ‘stop words’ from the query


SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
WHERE s.keyword NOT IN (SELECT word FROM stop_words)
ORDER BY id;
Displaying records 1 - 10
id keyword
1 fun
1 other
1 python
1 sql
1 stuff
1 teaching
2 from
2 learn
2 more
2 people

Inserting the stop-word free list into the GIN table


INSERT INTO docs_gin (doc_id, keyword)
SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
WHERE s.keyword NOT IN (SELECT word FROM stop_words)
ORDER BY id;


SELECT * FROM docs_gin;
Displaying records 1 - 10
keyword doc_id
fun 1
other 1
python 1
sql 1
stuff 1
teaching 1
from 2
learn 2
more 2
people 2

Queries

 

One word query


SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = lower('SQL');
3 records
doc
More people should learn SQL from Prof_Chuck
Prof_Chuck also teaches Python and also SQL
This is SQL and Python and other fun teaching stuff

 

A multi word query


SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword =
  ANY(string_to_array(lower('Meet fun people'), ' '));
2 records
doc
More people should learn SQL from Prof_Chuck
This is SQL and Python and other fun teaching stuff

 

A stop-word query - does not return any results


SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = lower('and');
0 records
doc

Creating a table with stems of words


CREATE TABLE docs_stem (word TEXT, stem TEXT);

INSERT INTO docs_stem (word, stem) VALUES
('teaching', 'teach'), ('teaches', 'teach');

SELECT * from docs_stem;
2 records
word stem
teaching teach
teaches teach

Moving the initial word extraction into a sub-query


SELECT id, keyword FROM (
SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
) AS X;
Displaying records 1 - 10
id keyword
2 from
2 learn
1 other
3 sql
1 python
3 python
1 and
1 stuff
2 more
1 this

 

Adding the stems as third column (may or may not exist)


SELECT id, keyword, stem FROM (
SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
) AS K
LEFT JOIN docs_stem AS S ON K.keyword = S.word
ORDER BY keyword DESC;
Displaying records 1 - 10
id keyword stem
1 this NA
1 teaching teach
3 teaches teach
1 stuff NA
1 sql NA
3 sql NA
2 sql NA
2 should NA
1 python NA
3 python NA

 

Defining the course of action: if the stem is there, use it


SELECT id,
CASE WHEN stem IS NOT NULL THEN stem 
ELSE keyword 
END AS awesome,keyword, stem
FROM (
SELECT DISTINCT id, lower(s.keyword) AS keyword
FROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)
) AS K
LEFT JOIN docs_stem AS S ON K.keyword = S.word
ORDER BY keyword DESC;
Displaying records 1 - 10
id awesome keyword stem
1 this this NA
1 teach teaching teach
3 teach teaches teach
1 stuff stuff NA
1 sql sql NA
3 sql sql NA
2 sql sql NA
2 should should NA
1 python python NA
3 python python NA

 

Null Coalescing - returns the first non-null in a list, the command will be further integrated in our query


SELECT COALESCE(NULL, NULL, 'Prof_Chuck');
1 records
coalesce
Prof_Chuck

SELECT COALESCE('Prof_Chuck', NULL, 'SQL');
1 records
coalesce
Prof_Chuck

Defining the course of action:: if the stem is there, use it instead of the keyword


SELECT id, COALESCE(stem, keyword) AS keyword
FROM (
SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
) AS K
LEFT JOIN docs_stem AS S ON K.keyword = S.word
ORDER BY keyword DESC;
Displaying records 1 - 10
id keyword
1 this
1 teach
3 teach
1 stuff
1 sql
3 sql
2 sql
2 should
1 python
3 python

Inserting the stems in the index table docs_gin


DELETE FROM docs_gin;

INSERT INTO docs_gin (doc_id, keyword)
SELECT id, COALESCE(stem, keyword)
FROM (
  SELECT DISTINCT id, s.keyword AS keyword
  FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
) AS K
LEFT JOIN docs_stem AS S ON K.keyword = S.word;

SELECT * FROM docs_gin;
Displaying records 1 - 10
keyword doc_id
also 3
and 1
and 3
from 2
fun 1
is 1
learn 2
more 2
other 1
people 2

 

Lets do stop words and stems


DELETE FROM docs_gin;

INSERT INTO docs_gin (doc_id, keyword)
SELECT id, COALESCE(stem, keyword)
FROM (
  SELECT DISTINCT id, s.keyword AS keyword
  FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
  WHERE s.keyword NOT IN (SELECT word FROM stop_words)
) AS K
LEFT JOIN docs_stem AS S ON K.keyword = S.word;

SELECT * FROM docs_gin
LIMIT 20;
Displaying records 1 - 10
keyword doc_id
also 3
from 2
fun 1
learn 2
more 2
other 1
people 2
prof_chuck 2
prof_chuck 3
python 3

Examples of queries


SELECT COALESCE((SELECT stem FROM docs_stem WHERE word=lower('SQL')), lower('SQL'));
1 records
coalesce
sql

Handling the stems in queries. Use the keyword if there is no stem


SELECT DISTINCT id, doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = COALESCE((SELECT stem FROM docs_stem WHERE word=lower('SQL')), lower('SQL'));
3 records
id doc
1 This is SQL and Python and other fun teaching stuff
2 More people should learn SQL from Prof_Chuck
3 Prof_Chuck also teaches Python and also SQL

 

Prefer the stem over the actual keyword. The technical term for converting search term to their stems is called “conflation


SELECT COALESCE((SELECT stem FROM docs_stem WHERE word=lower('teaching')), lower('teaching'));
1 records
coalesce
teach

SELECT DISTINCT id, doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = COALESCE((SELECT stem FROM docs_stem WHERE word=lower('teaching')), lower('teaching'));
2 records
id doc
1 This is SQL and Python and other fun teaching stuff
3 Prof_Chuck also teaches Python and also SQL

4. Building a natural language index with PostgreSQL built-in features (much easier and more efficient)

  • ts_vector is an special “array” of stemmed words, passed throug a stop-word filter + positions within the document. We need to specify the language of the document.

SELECT to_tsvector('english', 'This is SQL and Python and other fun teaching stuff');
1 records
to_tsvector
‘fun’:8 ‘python’:5 ‘sql’:3 ‘stuff’:10 ‘teach’:9

SELECT to_tsvector('english', 'More people should learn SQL from Prof_Chuck');
1 records
to_tsvector
‘chuck’:8 ‘learn’:4 ‘peopl’:2 ‘prof’:7 ‘sql’:5

SELECT to_tsvector('english', 'Prof_Chuck also teaches Python and also SQL');
1 records
to_tsvector
‘also’:3,7 ‘chuck’:2 ‘prof’:1 ‘python’:5 ‘sql’:8 ‘teach’:4

 

  • ts_query is an “array” of lower case, stemmed words with stop words removed plus logical operators & = and, ! = not, | = or. We need to specify the language of the document.

SELECT to_tsquery('english', 'teaching');
1 records
to_tsquery
‘teach’

SELECT to_tsquery('english', 'teaches');
1 records
to_tsquery
‘teach’

SELECT to_tsquery('english', 'and');
1 records
to_tsquery

SELECT to_tsquery('german', 'SQL');
1 records
to_tsquery
‘sql’

SELECT to_tsquery('english', 'Teach | teaches | teaching | and | the | if');
1 records
to_tsquery
‘teach’ | ‘teach’ | ‘teach’
  • plainto_tsquery just pulls out the keywords - plain text

SELECT plainto_tsquery('english', 'SQL Python');
1 records
plainto_tsquery
‘sql’ & ‘python’

SELECT plainto_tsquery('english', 'Teach teaches teaching and the if');
1 records
plainto_tsquery
‘teach’ & ‘teach’ & ‘teach’

 

  • phraseto_tsquery: words that come in order (a phrase)

SELECT phraseto_tsquery('english', 'SQL Python');
1 records
phraseto_tsquery
‘sql’ <-> ‘python’

 

  • websearch_to_tsquery is in PostgreSQL >= 11 and a bit like Google advanced websearch

SELECT websearch_to_tsquery('english', 'SQL -not Python');
1 records
websearch_to_tsquery
‘sql’ & ‘python’

SELECT to_tsquery('english', 'teaching') @@
  to_tsvector('english', 'Prof_Chuck also teaches Python and also SQL');
1 records
?column?
TRUE

Creating an english language inverted index using a tsvector index.


DROP TABLE docs cascade;

CREATE TABLE docs (id SERIAL, doc TEXT, PRIMARY KEY(id));

CREATE INDEX gin1 ON docs USING gin(to_tsvector('english', doc));

INSERT INTO docs (doc) VALUES
('This is SQL and Python and other fun teaching stuff'),
('More people should learn SQL from Prof_Chuck'),
('Prof_Chuck also teaches Python and also SQL');

Inserting filler rows


INSERT INTO docs (doc) SELECT 'Neon ' || generate_series(10000,20000);

SELECT id, doc FROM docs WHERE
    to_tsquery('english', 'learn') @@ to_tsvector('english', doc);
1 records
id doc
2 More people should learn SQL from Prof_Chuck

EXPLAIN SELECT id, doc FROM docs WHERE
    to_tsquery('english', 'learn') @@ to_tsvector('english', doc);
4 records
QUERY PLAN
Bitmap Heap Scan on docs (cost=221.22..281.66 rows=35 width=36)
Recheck Cond: (’‘’learn’’‘::tsquery @@ to_tsvector(’english’::regconfig, doc))
-> Bitmap Index Scan on gin1 (cost=0.00..221.21 rows=35 width=0)
Index Cond: (to_tsvector(‘english’::regconfig, doc) @@ ’‘’learn’’’::tsquery)

 

Checking the operation types for the various indexes in PostgreSQL


SELECT am.amname AS index_method, opc.opcname AS opclass_name
    FROM pg_am am, pg_opclass opc
    WHERE opc.opcmethod = am.oid
    ORDER BY index_method, opclass_name;
Displaying records 1 - 10
index_method opclass_name
brin bit_minmax_ops
brin box_inclusion_ops
brin bpchar_bloom_ops
brin bpchar_minmax_ops
brin bytea_bloom_ops
brin bytea_minmax_ops
brin char_bloom_ops
brin char_minmax_ops
brin date_bloom_ops
brin date_minmax_multi_ops