The document begins with an overview of index types in PostgreSQL and their use cases.
Then, I built an inverted index on a document—first by hand and then using PostgreSQL’s built-in features.
The document is created in Quarto accessed from RStudio, connected to a locally hosted PostgreSQL database.
1. Loading libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI) library(RPostgres) library(dbplyr)
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
2. Establishing a database connection (PostgreSQL)
I use RPostgres package to connent to a locally hosted PostgreSQL database.
con <- DBI::dbConnect( RPostgres::Postgres(), dbname ='postgres', host ='localhost', port =5432, user ='postgres', password ='abcd' )
3. PostgreSQL index types
Getting the right type of index is crucial for database performance.
B-Tree - The default for many applications - automatically balanced as it grows
Image source: Wikipedia.org
BRIN - Block Range Index - Smaller / faster if data is mostly sorted; often used in data mining applications
Image source: Wikipedia.org
Hash - Quick lookup of long key strings - no prefix lookup, only direct matches
Image source: Wikipedia.org
GIN - Generalized Inverted Indexes - multiple values in a column, usually preferred type
GiST - Generalized Search Tree - uses a hashing function; due to it the index is lossy, i.e. it might produce false matches when two words hash to the same bit position.
SP-GiST - Space Partitioned Generalized Search Tree - used with GIS data
Forward and Inverted Indexes
in general, there are two categories of indexes:
Forward indexes - we give the index a logical key and it tells us where to find the row that contains the key. (B-Tree, BRIN, Hash)
Inverted indexes - we give the index a string (query) and the index gives us a list of all the rows that match the query. (GIN, GiST), works a bit like a wildcard
The division is not always strict - B-tree indexes are stored in sorted order (forward indexes), but we give a B-Tree the prefix of a logical key, it can give us a set of rows (like an inverted index).
A typical use case for an inverse index: to quickly search text documents
inverted indexes are generally used for : e.g. blog posts, websites
similar to Google Search
Crawl: Retrieve documents, parse them and create an inverted index
Search: Take keywords, find the documents with the words then rank them and present results
4. Building an inverted index (GIN) ‘by hand’ - using only SQL
PostgreSQL has an inbuilt index function that can create indexes automatically. Here I create an inverted index “by hand”, using SQL commands, to explore the mechanics of this type of index.
We can split long text columns into space-delimited words using PostgreSQL’s split-like function string_to_array().
then we can use the PostgresSQL unnest() function to turn the resulting array into separate rows - this is somehow similar to generate_series() function
Creating a table to be used further in the demo
CREATETABLE docs (id SERIAL, doc TEXT, PRIMARYKEY(id));
INSERTINTO docs (doc) VALUES('This is SQL and Python and other fun teaching stuff'),('More people should learn SQL from Prof_Chuck'),('Prof_Chuck also teaches Python and also SQL');
SELECT
SELECT*FROM docs;
3 records
id
doc
1
This is SQL and Python and other fun teaching stuff
2
More people should learn SQL from Prof_Chuck
3
Prof_Chuck also teaches Python and also SQL
Break the document column into one row per word + primary key
SELECTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)ORDERBYid;
Displaying records 1 - 10
id
keyword
1
This
1
is
1
SQL
1
and
1
Python
1
and
1
other
1
fun
1
teaching
1
stuff
Discard duplicate rows
SELECTDISTINCTid, s.keyword AS keywordFROM docs AS D, unnest(string_to_array(D.doc, ' ')) s(keyword)ORDERBYid;
As we can see, the type of index scan performed was a sequential scan - not using our GIN index. This is due to the small size of the table. I will extend it with some randomly generated rows and try again.