Building a natural language index (a SQL demo)

Author

Beata Sirowy

I build a natural language index on a document - first by hand and then using PostgreSQL inbuilt features.

The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DBI)         
library(RPostgres)         
library(dbplyr)


Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(                           
  RPostgres::Postgres(),                           
  dbname = 'postgres',                          
  host = 'localhost',                          
  port = 5432,                           
  user = 'postgres',                           
  password = 'abcd'           )

3. Building a natural language index “manually”

Taking advantage of some features of natural language we can make indexes quicker and more efficient.

To optimize the performance of an index:

we don’t index ‘stop words’ (such as and, or, but, why - words that don’t have meaning on their own);
we ignore capitalization (upper case) / lower case distinction);
we reduce variations of words with equivalent meanings down to a single “stem word”.

Stemming and stop words depend on which language the document is stored in - the rules for new languages can be installed in PostgreSQL:


SELECT cfgname FROM pg_ts_config;

Displaying records 1 - 10
cfgname
simple
arabic
armenian
basque
catalan
danish
dutch
english
finnish
french

Defining the key steps in a “manual” building of a natural language index

creating a table of stop words used in the document;
creating a table of stems (mapping the words in the text document to their stems, removing capitalization);
creating the index table.

We create a table to be used in the demo


CREATE TABLE docs (
id SERIAL PRIMARY KEY,
doc TEXT
);


INSERT INTO docs (doc) VALUES
('This is SQL and Python and other fun teaching stuff'),
('More people should learn SQL from Prof_Chuck'),
('Prof_Chuck also teaches Python and also SQL');


select * from docs;

3 records
id	doc
1	This is SQL and Python and other fun teaching stuff
2	More people should learn SQL from Prof_Chuck
3	Prof_Chuck also teaches Python and also SQL

Breaking the document column into one row per word + primary key


SELECT id, new.keyword AS keyword
FROM docs AS D, 
unnest(string_to_array(D.doc, ' ')) new(keyword)
ORDER BY id;

Displaying records 1 - 10
id	keyword
1	This
1	is
1	SQL
1	and
1	Python
1	and
1	other
1	fun
1	teaching
1	stuff

Discarding duplicate rows


SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)
ORDER BY id;

Displaying records 1 - 10
id	keyword
1	and
1	fun
1	is
1	other
1	Python
1	SQL
1	stuff
1	teaching
1	This
2	from

Converting all words to lower case


SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
ORDER BY id;

Displaying records 1 - 10
id	keyword
1	and
1	fun
1	is
1	other
1	python
1	sql
1	stuff
1	teaching
1	this
2	from

Creating table with our GIS index


CREATE TABLE docs_gin (
  keyword TEXT,
  doc_id INTEGER REFERENCES docs(id) ON DELETE CASCADE
  );

Creating table with ‘stop words’


CREATE TABLE stop_words (word TEXT unique);


INSERT INTO stop_words (word) VALUES ('is'), ('this'), ('and');


SELECT * FROM stop_words;

3 records
word
is
this
and

Eliminating the ‘stop words’ from the query


SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
WHERE s.keyword NOT IN (SELECT word FROM stop_words)
ORDER BY id;

Displaying records 1 - 10
id	keyword
1	fun
1	other
1	python
1	sql
1	stuff
1	teaching
2	from
2	learn
2	more
2	people

Inserting the stop-word free list into the GIN table


INSERT INTO docs_gin (doc_id, keyword)
SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
WHERE s.keyword NOT IN (SELECT word FROM stop_words)
ORDER BY id;



SELECT * FROM docs_gin;

Displaying records 1 - 10
keyword	doc_id
fun	1
other	1
python	1
sql	1
stuff	1
teaching	1
from	2
learn	2
more	2
people	2

Queries

One word query


SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = lower('SQL');

3 records
doc
More people should learn SQL from Prof_Chuck
Prof_Chuck also teaches Python and also SQL
This is SQL and Python and other fun teaching stuff

A multi word query


SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword =
  ANY(string_to_array(lower('Meet fun people'), ' '));

2 records
doc
More people should learn SQL from Prof_Chuck
This is SQL and Python and other fun teaching stuff

A stop-word query - does not return any results


SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = lower('and');

0 records
doc

Creating a table with stems of words


CREATE TABLE docs_stem (word TEXT, stem TEXT);


INSERT INTO docs_stem (word, stem) VALUES
('teaching', 'teach'), ('teaches', 'teach');


SELECT * from docs_stem;

2 records
word	stem
teaching	teach
teaches	teach

Moving the initial word extraction into a sub-query


SELECT id, keyword FROM (
SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
) AS X;

Displaying records 1 - 10
id	keyword
2	from
2	learn
1	other
3	sql
1	python
3	python
1	and
1	stuff
2	more
1	this

Adding the stems as third column (may or may not exist)


SELECT id, keyword, stem FROM (
SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
) AS K
LEFT JOIN docs_stem AS S ON K.keyword = S.word
ORDER BY keyword DESC;

Displaying records 1 - 10
id	keyword	stem
1	this	NA
1	teaching	teach
3	teaches	teach
1	stuff	NA
1	sql	NA
3	sql	NA
2	sql	NA
2	should	NA
1	python	NA
3	python	NA

Defining the course of action: if the stem is there, use it


SELECT id,
CASE WHEN stem IS NOT NULL THEN stem 
ELSE keyword 
END AS awesome,keyword, stem
FROM (
SELECT DISTINCT id, lower(s.keyword) AS keyword
FROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)
) AS K
LEFT JOIN docs_stem AS S ON K.keyword = S.word
ORDER BY keyword DESC;

Displaying records 1 - 10
id	awesome	keyword	stem
1	this	this	NA
1	teach	teaching	teach
3	teach	teaches	teach
1	stuff	stuff	NA
1	sql	sql	NA
3	sql	sql	NA
2	sql	sql	NA
2	should	should	NA
1	python	python	NA
3	python	python	NA

Null Coalescing - returns the first non-null in a list, the command will be further integrated in our query


SELECT COALESCE(NULL, NULL, 'Prof_Chuck');

1 records
coalesce
Prof_Chuck


SELECT COALESCE('Prof_Chuck', NULL, 'SQL');

1 records
coalesce
Prof_Chuck

Defining the course of action:: if the stem is there, use it instead of the keyword


SELECT id, COALESCE(stem, keyword) AS keyword
FROM (
SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
) AS K
LEFT JOIN docs_stem AS S ON K.keyword = S.word
ORDER BY keyword DESC;

Displaying records 1 - 10
id	keyword
1	this
1	teach
3	teach
1	stuff
1	sql
3	sql
2	sql
2	should
1	python
3	python

Inserting the stems in the index table docs_gin


DELETE FROM docs_gin;


INSERT INTO docs_gin (doc_id, keyword)
SELECT id, COALESCE(stem, keyword)
FROM (
  SELECT DISTINCT id, s.keyword AS keyword
  FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
) AS K
LEFT JOIN docs_stem AS S ON K.keyword = S.word;


SELECT * FROM docs_gin;

Displaying records 1 - 10
keyword	doc_id
also	3
and	1
and	3
from	2
fun	1
is	1
learn	2
more	2
other	1
people	2

Lets do stop words and stems


DELETE FROM docs_gin;


INSERT INTO docs_gin (doc_id, keyword)
SELECT id, COALESCE(stem, keyword)
FROM (
  SELECT DISTINCT id, s.keyword AS keyword
  FROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)
  WHERE s.keyword NOT IN (SELECT word FROM stop_words)
) AS K
LEFT JOIN docs_stem AS S ON K.keyword = S.word;


SELECT * FROM docs_gin
LIMIT 20;

Displaying records 1 - 10
keyword	doc_id
also	3
from	2
fun	1
learn	2
more	2
other	1
people	2
prof_chuck	2
prof_chuck	3
python	3

Examples of queries


SELECT COALESCE((SELECT stem FROM docs_stem WHERE word=lower('SQL')), lower('SQL'));

1 records
coalesce
sql

Handling the stems in queries. Use the keyword if there is no stem


SELECT DISTINCT id, doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = COALESCE((SELECT stem FROM docs_stem WHERE word=lower('SQL')), lower('SQL'));

3 records
id	doc
1	This is SQL and Python and other fun teaching stuff
2	More people should learn SQL from Prof_Chuck
3	Prof_Chuck also teaches Python and also SQL

Prefer the stem over the actual keyword. The technical term for converting search term to their stems is called “conflation”


SELECT COALESCE((SELECT stem FROM docs_stem WHERE word=lower('teaching')), lower('teaching'));

1 records
coalesce
teach


SELECT DISTINCT id, doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = COALESCE((SELECT stem FROM docs_stem WHERE word=lower('teaching')), lower('teaching'));

2 records
id	doc
1	This is SQL and Python and other fun teaching stuff
3	Prof_Chuck also teaches Python and also SQL

4. Building a natural language index with PostgreSQL built-in features (much easier and more efficient)

ts_vector is an special “array” of stemmed words, passed throug a stop-word filter + positions within the document. We need to specify the language of the document.


SELECT to_tsvector('english', 'This is SQL and Python and other fun teaching stuff');

1 records
to_tsvector
‘fun’:8 ‘python’:5 ‘sql’:3 ‘stuff’:10 ‘teach’:9


SELECT to_tsvector('english', 'More people should learn SQL from Prof_Chuck');

1 records
to_tsvector
‘chuck’:8 ‘learn’:4 ‘peopl’:2 ‘prof’:7 ‘sql’:5


SELECT to_tsvector('english', 'Prof_Chuck also teaches Python and also SQL');

1 records
to_tsvector
‘also’:3,7 ‘chuck’:2 ‘prof’:1 ‘python’:5 ‘sql’:8 ‘teach’:4

ts_query is an “array” of lower case, stemmed words with stop words removed plus logical operators & = and, ! = not, | = or. We need to specify the language of the document.


SELECT to_tsquery('english', 'teaching');

1 records
to_tsquery
‘teach’


SELECT to_tsquery('english', 'teaches');

1 records
to_tsquery
‘teach’


SELECT to_tsquery('english', 'and');

1 records
to_tsquery


SELECT to_tsquery('german', 'SQL');

1 records
to_tsquery
‘sql’


SELECT to_tsquery('english', 'Teach | teaches | teaching | and | the | if');

1 records
to_tsquery
‘teach’ \| ‘teach’ \| ‘teach’

plainto_tsquery just pulls out the keywords - plain text


SELECT plainto_tsquery('english', 'SQL Python');

1 records
plainto_tsquery
‘sql’ & ‘python’


SELECT plainto_tsquery('english', 'Teach teaches teaching and the if');

1 records
plainto_tsquery
‘teach’ & ‘teach’ & ‘teach’

phraseto_tsquery: words that come in order (a phrase)


SELECT phraseto_tsquery('english', 'SQL Python');

1 records
phraseto_tsquery
‘sql’ <-> ‘python’

websearch_to_tsquery is in PostgreSQL >= 11 and a bit like Google advanced websearch


SELECT websearch_to_tsquery('english', 'SQL -not Python');

1 records
websearch_to_tsquery
‘sql’ & ‘python’


SELECT to_tsquery('english', 'teaching') @@
  to_tsvector('english', 'Prof_Chuck also teaches Python and also SQL');

1 records
?column?
TRUE

Creating an english language inverted index using a tsvector index.


DROP TABLE docs cascade;


CREATE TABLE docs (id SERIAL, doc TEXT, PRIMARY KEY(id));


CREATE INDEX gin1 ON docs USING gin(to_tsvector('english', doc));


INSERT INTO docs (doc) VALUES
('This is SQL and Python and other fun teaching stuff'),
('More people should learn SQL from Prof_Chuck'),
('Prof_Chuck also teaches Python and also SQL');

Inserting filler rows


INSERT INTO docs (doc) SELECT 'Neon ' || generate_series(10000,20000);


SELECT id, doc FROM docs WHERE
    to_tsquery('english', 'learn') @@ to_tsvector('english', doc);

1 records
id	doc
2	More people should learn SQL from Prof_Chuck


EXPLAIN SELECT id, doc FROM docs WHERE
    to_tsquery('english', 'learn') @@ to_tsvector('english', doc);

4 records
QUERY PLAN
Bitmap Heap Scan on docs (cost=221.22..281.66 rows=35 width=36)
Recheck Cond: (’‘’learn’’‘::tsquery @@ to_tsvector(’english’::regconfig, doc))
-> Bitmap Index Scan on gin1 (cost=0.00..221.21 rows=35 width=0)
Index Cond: (to_tsvector(‘english’::regconfig, doc) @@ ’‘’learn’’’::tsquery)

Checking the operation types for the various indexes in PostgreSQL


SELECT am.amname AS index_method, opc.opcname AS opclass_name
    FROM pg_am am, pg_opclass opc
    WHERE opc.opcmethod = am.oid
    ORDER BY index_method, opclass_name;

Displaying records 1 - 10
index_method	opclass_name
brin	bit_minmax_ops
brin	box_inclusion_ops
brin	bpchar_bloom_ops
brin	bpchar_minmax_ops
brin	bytea_bloom_ops
brin	bytea_minmax_ops
brin	char_bloom_ops
brin	char_minmax_ops
brin	date_bloom_ops
brin	date_minmax_multi_ops