Building an inverted string index (a SQL demo)

Author

Beata Sirowy

The document begins with an overview of index types in PostgreSQL and their use cases.

Then, I built an inverted index on a document—first by hand and then using PostgreSQL’s built-in features.

The document is created in Quarto accessed from RStudio, connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DBI)        
library(RPostgres)        
library(dbplyr)


Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(                        
  RPostgres::Postgres(),                        
  dbname = 'postgres',                        
  host = 'localhost',                        
  port = 5432,                        
  user = 'postgres',                        
  password = 'abcd'        
  )

3. PostgreSQL index types

Getting the right type of index is crucial for database performance.

B-Tree - The default for many applications - automatically balanced as it grows

Image source: Wikipedia.org
BRIN - Block Range Index - Smaller / faster if data is mostly sorted; often used in data mining applications

Image source: Wikipedia.org
Hash - Quick lookup of long key strings - no prefix lookup, only direct matches

Image source: Wikipedia.org
GIN - Generalized Inverted Indexes - multiple values in a column, usually preferred type
GiST - Generalized Search Tree - uses a hashing function; due to it the index is lossy, i.e. it might produce false matches when two words hash to the same bit position.
SP-GiST - Space Partitioned Generalized Search Tree - used with GIS data

Forward and Inverted Indexes

in general, there are two categories of indexes:

Forward indexes - we give the index a logical key and it tells us where to find the row that contains the key. (B-Tree, BRIN, Hash)
Inverted indexes - we give the index a string (query) and the index gives us a list of all the rows that match the query. (GIN, GiST), works a bit like a wildcard

Image source: https://www.pg4e.com/lectures/05-FullText-images/inverted-index.png

The division is not always strict - B-tree indexes are stored in sorted order (forward indexes), but we give a B-Tree the prefix of a logical key, it can give us a set of rows (like an inverted index).

A typical use case for an inverse index: to quickly search text documents

inverted indexes are generally used for : e.g. blog posts, websites
similar to Google Search
- Crawl: Retrieve documents, parse them and create an inverted index
- Search: Take keywords, find the documents with the words then rank them and present results

References:

Severance, Ch. (2019) PostgreSQL Lecture Notes. Available: https://www.pg4e.com/lectures/05-FullText

4. Building an inverted index (GIN) ‘by hand’ - using only SQL

PostgreSQL has an inbuilt index function that can create indexes automatically. Here I create an inverted index “by hand”, using SQL commands, to explore the mechanics of this type of index.

We can split long text columns into space-delimited words using PostgreSQL’s split-like function string_to_array().

then we can use the PostgresSQL unnest() function to turn the resulting array into separate rows - this is somehow similar to generate_series() function

Creating a table to be used further in the demo


CREATE TABLE docs (id SERIAL, doc TEXT, PRIMARY KEY(id));


INSERT INTO docs (doc) VALUES
('This is SQL and Python and other fun teaching stuff'),
('More people should learn SQL from Prof_Chuck'),
('Prof_Chuck also teaches Python and also SQL');

SELECT


SELECT * FROM  docs;

3 records
id	doc
1	This is SQL and Python and other fun teaching stuff
2	More people should learn SQL from Prof_Chuck
3	Prof_Chuck also teaches Python and also SQL

Break the document column into one row per word + primary key


SELECT id, s.keyword AS keyword
FROM docs AS D, 
unnest(string_to_array(D.doc, ' ')) s(keyword)
ORDER BY id;

Displaying records 1 - 10
id	keyword
1	This
1	is
1	SQL
1	and
1	Python
1	and
1	other
1	fun
1	teaching
1	stuff

Discard duplicate rows


SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)
ORDER BY id;

Displaying records 1 - 10
id	keyword
1	and
1	fun
1	is
1	other
1	Python
1	SQL
1	stuff
1	teaching
1	This
2	from

Create the keyword table


CREATE TABLE docs_gin (
  keyword TEXT,
  doc_id INTEGER REFERENCES docs(id) ON DELETE CASCADE
);

Insert the keyword / primary key rows into a table


INSERT INTO docs_gin (doc_id, keyword)
SELECT DISTINCT id, s.keyword AS keyword
FROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)
ORDER BY id;


SELECT * FROM docs_gin ORDER BY doc_id;

Displaying records 1 - 10
keyword	doc_id
and	1
fun	1
is	1
other	1
Python	1
SQL	1
stuff	1
teaching	1
This	1
from	2

Find all the distinct documents that match a keyword


SELECT DISTINCT keyword, doc_id FROM docs_gin AS G
WHERE G.keyword = 'SQL';

3 records
keyword	doc_id
SQL	1
SQL	2
SQL	3

Find all the distinct documents that match a keyword


SELECT DISTINCT id, doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = 'Python';

2 records
id	doc
1	This is SQL and Python and other fun teaching stuff
3	Prof_Chuck also teaches Python and also SQL

Remove duplicates and have more than one keyword


SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword IN ('fun', 'people');

2 records
doc
More people should learn SQL from Prof_Chuck
This is SQL and Python and other fun teaching stuff

Find a phrase including any of the keywords


SELECT DISTINCT doc FROM docs AS D
JOIN docs_gin AS G ON D.id = G.doc_id
WHERE G.keyword = ANY(string_to_array('I want to learn', ' '));

1 records
doc
More people should learn SQL from Prof_Chuck

5. Creating an inverted index (GIN) with PostgreSQL in-built features

The basic syntax of the command:

CREATE INDEX name ON table USING GIN ([operator class]column);


CREATE INDEX GIN_docs ON docs USING gin(string_to_array(doc, ' ') array_ops);

The <@ is looking for an intersection between two arrays (a set theory concept of intersection - like in inner join)


SELECT id, doc FROM docs 
WHERE '{learn}' <@ string_to_array(doc, ' ');

1 records
id	doc
2	More people should learn SQL from Prof_Chuck


EXPLAIN ANALYZE
SELECT id, doc FROM docs 
WHERE '{learn}' <@ string_to_array(doc, ' ');

5 records
QUERY PLAN
Seq Scan on docs (cost=0.00..1.04 rows=1 width=36) (actual time=0.014..0.015 rows=1 loops=1)
Filter: (‘{learn}’::text[] <@ string_to_array(doc, ’ ’::text))
Rows Removed by Filter: 2
Planning Time: 0.061 ms
Execution Time: 0.024 ms

As we can see, the type of index scan performed was a sequential scan - not using our GIN index. This is due to the small size of the table. I will extend it with some randomly generated rows and try again.


INSERT INTO docs (doc) 
SELECT 'Pink_Panther ' || generate_series(1000,2000);


SELECT * FROM docs
LIMIT 20;

Displaying records 1 - 10
id	doc
1	This is SQL and Python and other fun teaching stuff
2	More people should learn SQL from Prof_Chuck
3	Prof_Chuck also teaches Python and also SQL
4	Pink_Panther 1000
5	Pink_Panther 1001
6	Pink_Panther 1002
7	Pink_Panther 1003
8	Pink_Panther 1004
9	Pink_Panther 1005
10	Pink_Panther 1006

We perform the search again and check the type.



EXPLAIN ANALYZE
SELECT id, doc FROM docs 
WHERE '{learn}' <@ string_to_array(doc, ' ');

5 records
QUERY PLAN
Seq Scan on docs (cost=0.00..7.32 rows=1 width=36) (actual time=0.017..0.373 rows=1 loops=1)
Filter: (‘{learn}’::text[] <@ string_to_array(doc, ’ ’::text))
Rows Removed by Filter: 1003
Planning Time: 0.071 ms
Execution Time: 0.384 ms

As we can see, GIN index was used in this case (a bitmap heap scan was performed - not a sequential scan as in the previous case) .