I build a natural language index on a document - first by hand and then using PostgreSQL inbuilt features.
The document is created in Quarto accessed from RStudio connected to a locally hosted PostgreSQL database.
1. Loading libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI) library(RPostgres) library(dbplyr)
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
2. Establishing a database connection (PostgreSQL)
I use RPostgres package to connent to a locally hosted PostgreSQL database.
con <- DBI::dbConnect( RPostgres::Postgres(), dbname ='postgres', host ='localhost', port =5432, user ='postgres', password ='abcd' )
3. Building a natural language index “manually”
Taking advantage of some features of natural language we can make indexes quicker and more efficient.
To optimize the performance of an index:
we don’t index ‘stop words’ (such as and, or, but, why - words that don’t have meaning on their own);
we ignore capitalization (upper case) / lower case distinction);
we reduce variations of words with equivalent meanings down to a single “stem word”.
Stemming and stop words depend on which language the document is stored in - the rules for new languages can be installed in PostgreSQL:
SELECT cfgname FROM pg_ts_config;
Displaying records 1 - 10
cfgname
simple
arabic
armenian
basque
catalan
danish
dutch
english
finnish
french
Defining the key steps in a “manual” building of a natural language index
creating a table of stop words used in the document;
creating a table of stems (mapping the words in the text document to their stems, removing capitalization);
creating the index table.
We create a table to be used in the demo
CREATETABLE docs (id SERIAL PRIMARYKEY,doc TEXT);
INSERTINTO docs (doc) VALUES('This is SQL and Python and other fun teaching stuff'),('More people should learn SQL from Prof_Chuck'),('Prof_Chuck also teaches Python and also SQL');
select*from docs;
3 records
id
doc
1
This is SQL and Python and other fun teaching stuff
2
More people should learn SQL from Prof_Chuck
3
Prof_Chuck also teaches Python and also SQL
Breaking the document column into one row per word + primary key
SELECTid, new.keyword AS keywordFROM docs AS D, unnest(string_to_array(D.doc, ' ')) new(keyword)ORDERBYid;
Displaying records 1 - 10
id
keyword
1
This
1
is
1
SQL
1
and
1
Python
1
and
1
other
1
fun
1
teaching
1
stuff
Discarding duplicate rows
SELECTDISTINCTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)ORDERBYid;
Displaying records 1 - 10
id
keyword
1
and
1
fun
1
is
1
other
1
Python
1
SQL
1
stuff
1
teaching
1
This
2
from
Converting all words to lower case
SELECTDISTINCTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)ORDERBYid;
SELECTDISTINCTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)WHERE s.keyword NOTIN (SELECT word FROM stop_words)ORDERBYid;
Displaying records 1 - 10
id
keyword
1
fun
1
other
1
python
1
sql
1
stuff
1
teaching
2
from
2
learn
2
more
2
people
Inserting the stop-word free list into the GIN table
INSERTINTO docs_gin (doc_id, keyword)SELECTDISTINCTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)WHERE s.keyword NOTIN (SELECT word FROM stop_words)ORDERBYid;
SELECT*FROM docs_gin;
Displaying records 1 - 10
keyword
doc_id
fun
1
other
1
python
1
sql
1
stuff
1
teaching
1
from
2
learn
2
more
2
people
2
Queries
One word query
SELECTDISTINCT doc FROM docs AS DJOIN docs_gin AS G ON D.id= G.doc_idWHERE G.keyword =lower('SQL');
3 records
doc
More people should learn SQL from Prof_Chuck
Prof_Chuck also teaches Python and also SQL
This is SQL and Python and other fun teaching stuff
A multi word query
SELECTDISTINCT doc FROM docs AS DJOIN docs_gin AS G ON D.id= G.doc_idWHERE G.keyword =ANY(string_to_array(lower('Meet fun people'), ' '));
2 records
doc
More people should learn SQL from Prof_Chuck
This is SQL and Python and other fun teaching stuff
A stop-word query - does not return any results
SELECTDISTINCT doc FROM docs AS DJOIN docs_gin AS G ON D.id= G.doc_idWHERE G.keyword =lower('and');
Moving the initial word extraction into a sub-query
SELECTid, keyword FROM (SELECTDISTINCTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)) AS X;
Displaying records 1 - 10
id
keyword
2
from
2
learn
1
other
3
sql
1
python
3
python
1
and
1
stuff
2
more
1
this
Adding the stems as third column (may or may not exist)
SELECTid, keyword, stem FROM (SELECTDISTINCTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)) AS KLEFTJOIN docs_stem AS S ON K.keyword = S.wordORDERBY keyword DESC;
Displaying records 1 - 10
id
keyword
stem
1
this
NA
1
teaching
teach
3
teaches
teach
1
stuff
NA
1
sql
NA
3
sql
NA
2
sql
NA
2
should
NA
1
python
NA
3
python
NA
Defining the course of action: if the stem is there, use it
SELECTid,CASEWHEN stem ISNOTNULLTHEN stem ELSE keyword ENDAS awesome,keyword, stemFROM (SELECTDISTINCTid, lower(s.keyword) AS keywordFROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)) AS KLEFTJOIN docs_stem AS S ON K.keyword = S.wordORDERBY keyword DESC;
Displaying records 1 - 10
id
awesome
keyword
stem
1
this
this
NA
1
teach
teaching
teach
3
teach
teaches
teach
1
stuff
stuff
NA
1
sql
sql
NA
3
sql
sql
NA
2
sql
sql
NA
2
should
should
NA
1
python
python
NA
3
python
python
NA
Null Coalescing - returns the first non-null in a list, the command will be further integrated in our query
SELECTCOALESCE(NULL, NULL, 'Prof_Chuck');
1 records
coalesce
Prof_Chuck
SELECTCOALESCE('Prof_Chuck', NULL, 'SQL');
1 records
coalesce
Prof_Chuck
Defining the course of action:: if the stem is there, use it instead of the keyword
SELECTid, COALESCE(stem, keyword) AS keywordFROM (SELECTDISTINCTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)) AS KLEFTJOIN docs_stem AS S ON K.keyword = S.wordORDERBY keyword DESC;
Displaying records 1 - 10
id
keyword
1
this
1
teach
3
teach
1
stuff
1
sql
3
sql
2
sql
2
should
1
python
3
python
Inserting the stems in the index table docs_gin
DELETEFROM docs_gin;
INSERTINTO docs_gin (doc_id, keyword)SELECTid, COALESCE(stem, keyword)FROM (SELECTDISTINCTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)) AS KLEFTJOIN docs_stem AS S ON K.keyword = S.word;
SELECT*FROM docs_gin;
Displaying records 1 - 10
keyword
doc_id
also
3
and
1
and
3
from
2
fun
1
is
1
learn
2
more
2
other
1
people
2
Lets do stop words and stems
DELETEFROM docs_gin;
INSERTINTO docs_gin (doc_id, keyword)SELECTid, COALESCE(stem, keyword)FROM (SELECTDISTINCTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(lower(D.doc), ' ')) s(keyword)WHERE s.keyword NOTIN (SELECT word FROM stop_words)) AS KLEFTJOIN docs_stem AS S ON K.keyword = S.word;
SELECT*FROM docs_ginLIMIT20;
Displaying records 1 - 10
keyword
doc_id
also
3
from
2
fun
1
learn
2
more
2
other
1
people
2
prof_chuck
2
prof_chuck
3
python
3
Examples of queries
SELECTCOALESCE((SELECT stem FROM docs_stem WHERE word=lower('SQL')), lower('SQL'));
1 records
coalesce
sql
Handling the stems in queries. Use the keyword if there is no stem
SELECTDISTINCTid, doc FROM docs AS DJOIN docs_gin AS G ON D.id= G.doc_idWHERE G.keyword =COALESCE((SELECT stem FROM docs_stem WHERE word=lower('SQL')), lower('SQL'));
3 records
id
doc
1
This is SQL and Python and other fun teaching stuff
2
More people should learn SQL from Prof_Chuck
3
Prof_Chuck also teaches Python and also SQL
Prefer the stem over the actual keyword. The technical term for converting search term to their stems is called “conflation”
SELECTCOALESCE((SELECT stem FROM docs_stem WHERE word=lower('teaching')), lower('teaching'));
1 records
coalesce
teach
SELECTDISTINCTid, doc FROM docs AS DJOIN docs_gin AS G ON D.id= G.doc_idWHERE G.keyword =COALESCE((SELECT stem FROM docs_stem WHERE word=lower('teaching')), lower('teaching'));
2 records
id
doc
1
This is SQL and Python and other fun teaching stuff
3
Prof_Chuck also teaches Python and also SQL
4. Building a natural language index with PostgreSQL built-in features (much easier and more efficient)
ts_vector is an special “array” of stemmed words, passed throug a stop-word filter + positions within the document. We need to specify the language of the document.
SELECT to_tsvector('english', 'This is SQL and Python and other fun teaching stuff');
1 records
to_tsvector
‘fun’:8 ‘python’:5 ‘sql’:3 ‘stuff’:10 ‘teach’:9
SELECT to_tsvector('english', 'More people should learn SQL from Prof_Chuck');
1 records
to_tsvector
‘chuck’:8 ‘learn’:4 ‘peopl’:2 ‘prof’:7 ‘sql’:5
SELECT to_tsvector('english', 'Prof_Chuck also teaches Python and also SQL');
ts_query is an “array” of lower case, stemmed words with stop words removed plus logical operators & = and, ! = not, | = or. We need to specify the language of the document.
SELECT to_tsquery('english', 'teaching');
1 records
to_tsquery
‘teach’
SELECT to_tsquery('english', 'teaches');
1 records
to_tsquery
‘teach’
SELECT to_tsquery('english', 'and');
1 records
to_tsquery
SELECT to_tsquery('german', 'SQL');
1 records
to_tsquery
‘sql’
SELECT to_tsquery('english', 'Teach | teaches | teaching | and | the | if');
1 records
to_tsquery
‘teach’ | ‘teach’ | ‘teach’
plainto_tsquery just pulls out the keywords - plain text
SELECT plainto_tsquery('english', 'SQL Python');
1 records
plainto_tsquery
‘sql’ & ‘python’
SELECT plainto_tsquery('english', 'Teach teaches teaching and the if');
1 records
plainto_tsquery
‘teach’ & ‘teach’ & ‘teach’
phraseto_tsquery: words that come in order (a phrase)
SELECT phraseto_tsquery('english', 'SQL Python');
1 records
phraseto_tsquery
‘sql’ <-> ‘python’
websearch_to_tsquery is in PostgreSQL >= 11 and a bit like Google advanced websearch
SELECT to_tsquery('english', 'teaching') @@ to_tsvector('english', 'Prof_Chuck also teaches Python and also SQL');
1 records
?column?
TRUE
Creating an english language inverted index using a tsvector index.
DROPTABLE docs cascade;
CREATETABLE docs (id SERIAL, doc TEXT, PRIMARYKEY(id));
CREATEINDEX gin1 ON docs USING gin(to_tsvector('english', doc));
INSERTINTO docs (doc) VALUES('This is SQL and Python and other fun teaching stuff'),('More people should learn SQL from Prof_Chuck'),('Prof_Chuck also teaches Python and also SQL');