Written by Ming Ying on November 12, 2023

With the release of v0.7.0, sparse vector support was added to pgvector. As a result, we have deprecated pg_sparse in favour of the upstream pgvector implementation.

We’re excited to unveil pg_sparse: the first extension to enable efficient storage and retrieval of sparse vectors using HNSW inside Postgres. pg_sparse is to sparse vectors what pgvector is to dense vectors.

Generated by new models like SPLADE, sparse vectors can detect the presence of exact keywords while also capturing semantic similarity between terms. Unlike dense vectors, sparse vectors contain significantly more entries, most of which are zero. For instance, OpenAI’s text-embedding-ada-002 model outputs dense vectors with 1536 entries, whereas SPLADE outputs sparse vectors with over 30,000 entries.

pg_sparse is a fork of pgvector, which means that it leverages pgvector’s existing vector storage and HNSW implementation. It consists of two major changes:

  1. A new Postgres data type called svector that stores sparse vectors by their non-zero entries
  2. A modification to the way that pgvector allocates Postgres pages to support vectors with variable numbers of non-zero entries

You can easily get started by installing pg_sparse inside an existing, self-hosted Postgres instance or by running our Docker image. Run the following query to get started:

-- Load extension
CREATE EXTENSION svector;
-- Create test data
CREATE TABLE items (id bigserial PRIMARY KEY, embedding svector(4));
INSERT INTO items (embedding) VALUES ('[1,0,3,0]'), ('[0,0,5,6]');
-- Create HNSW index for cosine similarity
CREATE INDEX ON items USING shnsw (embedding svector_cosine_ops);
-- Run query
SELECT * FROM items ORDER BY embedding <=> '[3,0,1,0]';

SPLADE Inside Postgres: An Example

Next, let’s walk through a more complex example that inserts, indexes, and searches sparse vectors generated by SPLADE using pg_sparse. To create our sparse vectors, let’s install the dependencies:

!pip install -U transformers torch datasets pandas tqdm

Next, let’s run Python code that

  1. Loads an example Huggingface dataset containing the descriptions of 50,000 movies
  2. Generates a SPLADE vector for each description
  3. Saves the dataset as a CSV file
from transformers import AutoModelForMaskedLM, AutoTokenizer
from tqdm import tqdm
from datasets import load_dataset
import pandas as pd
import re
import torch

model_id = 'naver/splade-cocondenser-ensembledistil'
dataset_name = 'SandipPalit/Movie_Dataset'

def create_splade(text):
    tokens = tokenizer(text, return_tensors='pt')
    output = model(**tokens)
    vec = torch.max(
        torch.log(
            1 + torch.relu(output.logits)
        ) * tokens.attention_mask.unsqueeze(-1),
    dim=1)[0].squeeze()

    return vec

def clean_text(text):
    return re.sub(r'[\r\n\t]+', ' ', text)

# Initialize SPLADE model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

# Load dataset
dataset = load_dataset(dataset_name)
training_dataset = dataset["train"]

# Generate SPLADE vectors
# Note this will take a long time, consider reducing the size of
# training_dataset to reduce the time
texts = []
vectors = []

for example in tqdm(training_dataset, desc="Processing..."):
    text = clean_text(example['Overview'])
    texts.append(text)
    vector = create_splade(text)
    vectors.append(vector.tolist())

# Save as dataframe
df = pd.DataFrame({
    'text': texts,
    'splade_vector': vectors
})

# Generate another SPLADE vector for querying
query = "Space exploration"
df.at[0, 'text'] = query
df.at[0, 'splade_vector'] = create_splade(query).tolist()

# Save to CSV
df.to_csv("splade_vectors.csv", index=False)

Using a Postgres client like psql, we can now load in our movie dataset. This will take a few minutes due to the size of the dataset.

CREATE TABLE movies (description text, splade_vector svector(30522));
\copy movies FROM '/path/to/splade_vectors.csv' DELIMITER ',' CSV HEADER;

Let’s also enable timing to examine the performance of our queries.

\timing

Now we’re ready to execute our first search. To begin, we’ll look for the ten rows that are most similar to the first row (”Space exploration”) based on cosine similarity:

SELECT description
FROM movies
ORDER BY splade_vector <=> (SELECT splade_vector FROM movies LIMIT 1)
LIMIT 10;

This query executed a Postgres sequential scan that took 155ms to return the first ten rows. Now, let’s create an HNSW index to speed up the search results. This may take a few minutes to run.

CREATE INDEX ON movies
USING shnsw (splade_vector svector_cosine_ops);

With the index created, let’s re-run the same search query as above. We get back the same results, but with a query time of 2.740 ms — a 50X speedup in query time!

Benchmark Results

We measured index creation and query time over a dataset of 100K sparse vectors generated by SPLADE, each with 30,522 dimensions. We set Postgres’ maintenance_working_mem configuration to 512MB so that the entire HNSW graph could fit into memory during the build.

ALTER SYSTEM SET maintenance_work_mem = '512MB';
SELECT pg_reload_conf();

ef_construction is a parameter that creates a higher quality graph and more accurate search results, at the expense of longer index creation times. At the default ef_construction value of 64, the index takes 200 seconds to build (500 vectors / second).

ef_constructionIndex TimeNumber of Vectors
32114s100,000
64200s100,000
128362s100,000

Next, we compared the time to return the top 10 results with and without the HNSW index. We set m=16, ef_construction=64, and ef_search=40. With HNSW, this search took 6ms. Without HNSW, this search took 150ms.

Wrapping Up

pg_sparse is the latest installment in our roadmap to building an Elasticsearch alternative on Postgres. Now, Postgres is capable of full-text search (via pg_search), dense vector search (via pgvector), and sparse vector search.

pg_sparse is compatible alongside both pg_search and pgvector. In the future, we’d love to have our work merged upstream as part of one pgvector extension. A unified extension for all types of vector search is simpler to use and maintain, and would also enable sparse vector search in managed Postgres databases like AWS RDS that already support pgvector.

Until then, the best way to use pg_sparse is by installing it inside an existing, self-hosted Postgres instance, or running the Postgres Docker image.

pg_sparse is licensed under the PostgreSQL license in accordance with pgvector’s license. If you’d like to contribute, the best place to start is our Slack community. And please don’t hesitate to show your support by giving us a star!