Building an inverted string index (a SQL demo)

Author

Beata Sirowy

The document begins with an overview of index types in PostgreSQL and their use cases.

Then, I built an inverted index on a document—first by hand and then using PostgreSQL’s built-in features.

The document is created in Quarto accessed from RStudio, connected to a locally hosted PostgreSQL database.

1. Loading libraries

library(tidyverse)        
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)        
library(RPostgres)        
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

2. Establishing a database connection (PostgreSQL)

I use RPostgres package to connent to a locally hosted PostgreSQL database.

con <- DBI::dbConnect(                        
  RPostgres::Postgres(),                        
  dbname = 'postgres',                        
  host = 'localhost',                        
  port = 5432,                        
  user = 'postgres',                        
  password = 'abcd'        
  )

3. PostgreSQL index types

Getting the right type of index is crucial for database performance.

  • B-Tree - The default for many applications - automatically balanced as it grows

    Image source: Wikipedia.org
  • BRIN - Block Range Index - Smaller / faster if data is mostly sorted; often used in data mining applications

    Image source: Wikipedia.org
  • Hash - Quick lookup of long key strings - no prefix lookup, only direct matches

    Image source: Wikipedia.org
  • GIN - Generalized Inverted Indexes - multiple values in a column, usually preferred type

  • GiST - Generalized Search Tree - uses a hashing function; due to it the index is lossy, i.e. it might produce false matches when two words hash to the same bit position.

  • SP-GiST - Space Partitioned Generalized Search Tree - used with GIS data

Forward and Inverted Indexes

in general, there are two categories of indexes:

  • Forward indexes - we give the index a logical key and it tells us where to find the row that contains the key. (B-Tree, BRIN, Hash)

  • Inverted indexes - we give the index a string (query) and the index gives us a list of all the rows that match the query. (GIN, GiST), works a bit like a wildcard

    Image source: https://www.pg4e.com/lectures/05-FullText-images/inverted-index.png

The division is not always strict - B-tree indexes are stored in sorted order (forward indexes), but we give a B-Tree the prefix of a logical key, it can give us a set of rows (like an inverted index).

A typical use case for an inverse index: to quickly search text documents

  • inverted indexes are generally used for : e.g. blog posts, websites

  • similar to Google Search

    • Crawl: Retrieve documents, parse them and create an inverted index

    • Search: Take keywords, find the documents with the words then rank them and present results

References:

4. Building an inverted index (GIN) ‘by hand’ - using only SQL

PostgreSQL has an inbuilt index function that can create indexes automatically. Here I create an inverted index “by hand”, using SQL commands, to explore the mechanics of this type of index.

  • We can split long text columns into space-delimited words using PostgreSQL’s split-like function string_to_array().

  • then we can use the PostgresSQL unnest() function to turn the resulting array into separate rows - this is somehow similar to generate_series() function

     

    Creating a table to be used further in the demo

    
    CREATE TABLE docs (id SERIAL, doc TEXT, PRIMARY KEY(id));
    
    INSERT INTO docs (doc) VALUES
    ('This is SQL and Python and other fun teaching stuff'),
    ('More people should learn SQL from Prof_Chuck'),
    ('Prof_Chuck also teaches Python and also SQL');

    SELECT

    
    SELECT * FROM  docs; 
    3 records
    id doc
    1 This is SQL and Python and other fun teaching stuff
    2 More people should learn SQL from Prof_Chuck
    3 Prof_Chuck also teaches Python and also SQL

    Break the document column into one row per word + primary key

    
    SELECT id, s.keyword AS keyword
    FROM docs AS D, 
    unnest(string_to_array(D.doc, ' ')) s(keyword)
    ORDER BY id;
    Displaying records 1 - 10
    id keyword
    1 This
    1 is
    1 SQL
    1 and
    1 Python
    1 and
    1 other
    1 fun
    1 teaching
    1 stuff

    Discard duplicate rows

    
    SELECT DISTINCT id, s.keyword AS keyword
    FROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)
    ORDER BY id;
    Displaying records 1 - 10
    id keyword
    1 and
    1 fun
    1 is
    1 other
    1 Python
    1 SQL
    1 stuff
    1 teaching
    1 This
    2 from

    Create the keyword table

    
    CREATE TABLE docs_gin (
      keyword TEXT,
      doc_id INTEGER REFERENCES docs(id) ON DELETE CASCADE
    );

    Insert the keyword / primary key rows into a table

    
    INSERT INTO docs_gin (doc_id, keyword)
    SELECT DISTINCT id, s.keyword AS keyword
    FROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)
    ORDER BY id;
    
    SELECT * FROM docs_gin ORDER BY doc_id;
    Displaying records 1 - 10
    keyword doc_id
    and 1
    fun 1
    is 1
    other 1
    Python 1
    SQL 1
    stuff 1
    teaching 1
    This 1
    from 2

    Find all the distinct documents that match a keyword

    
    SELECT DISTINCT keyword, doc_id FROM docs_gin AS G
    WHERE G.keyword = 'SQL';
    3 records
    keyword doc_id
    SQL 1
    SQL 2
    SQL 3

    Find all the distinct documents that match a keyword

    
    SELECT DISTINCT id, doc FROM docs AS D
    JOIN docs_gin AS G ON D.id = G.doc_id
    WHERE G.keyword = 'Python';
    2 records
    id doc
    1 This is SQL and Python and other fun teaching stuff
    3 Prof_Chuck also teaches Python and also SQL

    Remove duplicates and have more than one keyword

    
    SELECT DISTINCT doc FROM docs AS D
    JOIN docs_gin AS G ON D.id = G.doc_id
    WHERE G.keyword IN ('fun', 'people');
    2 records
    doc
    More people should learn SQL from Prof_Chuck
    This is SQL and Python and other fun teaching stuff

    Find a phrase including any of the keywords

    
    SELECT DISTINCT doc FROM docs AS D
    JOIN docs_gin AS G ON D.id = G.doc_id
    WHERE G.keyword = ANY(string_to_array('I want to learn', ' '));
    1 records
    doc
    More people should learn SQL from Prof_Chuck

5. Creating an inverted index (GIN) with PostgreSQL in-built features

The basic syntax of the command:

CREATE INDEX name ON table USING GIN ([operator class]column);


CREATE INDEX GIN_docs ON docs USING gin(string_to_array(doc, ' ') array_ops);


The <@ is looking for an intersection between two arrays (a set theory concept of intersection - like in inner join)


SELECT id, doc FROM docs 
WHERE '{learn}' <@ string_to_array(doc, ' ');
1 records
id doc
2 More people should learn SQL from Prof_Chuck

EXPLAIN ANALYZE
SELECT id, doc FROM docs 
WHERE '{learn}' <@ string_to_array(doc, ' ');
5 records
QUERY PLAN
Seq Scan on docs (cost=0.00..1.04 rows=1 width=36) (actual time=0.014..0.015 rows=1 loops=1)
Filter: (‘{learn}’::text[] <@ string_to_array(doc, ’ ’::text))
Rows Removed by Filter: 2
Planning Time: 0.061 ms
Execution Time: 0.024 ms


As we can see, the type of index scan performed was a sequential scan - not using our GIN index. This is due to the small size of the table. I will extend it with some randomly generated rows and try again.


INSERT INTO docs (doc) 
SELECT 'Pink_Panther ' || generate_series(1000,2000);

SELECT * FROM docs
LIMIT 20;
Displaying records 1 - 10
id doc
1 This is SQL and Python and other fun teaching stuff
2 More people should learn SQL from Prof_Chuck
3 Prof_Chuck also teaches Python and also SQL
4 Pink_Panther 1000
5 Pink_Panther 1001
6 Pink_Panther 1002
7 Pink_Panther 1003
8 Pink_Panther 1004
9 Pink_Panther 1005
10 Pink_Panther 1006


We perform the search again and check the type.



EXPLAIN ANALYZE
SELECT id, doc FROM docs 
WHERE '{learn}' <@ string_to_array(doc, ' ');
5 records
QUERY PLAN
Seq Scan on docs (cost=0.00..7.32 rows=1 width=36) (actual time=0.017..0.373 rows=1 loops=1)
Filter: (‘{learn}’::text[] <@ string_to_array(doc, ’ ’::text))
Rows Removed by Filter: 1003
Planning Time: 0.071 ms
Execution Time: 0.384 ms


As we can see, GIN index was used in this case (a bitmap heap scan was performed - not a sequential scan as in the previous case) .