pg_sparse: Similarity Search with SPLADE Inside Postgres
Written by Ming Ying on November 12, 2023
With the release of v0.7.0
, sparse vector support was added to pgvector
. As a
result, we have deprecated pg_sparse
in favour of the upstream pgvector
implementation.
We’re excited to unveil pg_sparse
: the first extension to enable efficient storage and retrieval of sparse vectors using HNSW inside Postgres. pg_sparse
is to sparse vectors what pgvector
is to dense vectors.
Generated by new models like SPLADE, sparse vectors can detect the presence of exact keywords while also capturing semantic similarity between terms. Unlike dense vectors, sparse vectors contain significantly more entries, most of which are zero. For instance, OpenAI’s text-embedding-ada-002
model outputs dense vectors with 1536 entries, whereas SPLADE outputs sparse vectors with over 30,000 entries.
pg_sparse
is a fork of pgvector
, which means that it leverages pgvector
’s existing vector storage and HNSW implementation. It consists of two major changes:
- A new Postgres data type called
svector
that stores sparse vectors by their non-zero entries - A modification to the way that
pgvector
allocates Postgres pages to support vectors with variable numbers of non-zero entries
You can easily get started by installing pg_sparse
inside an existing, self-hosted Postgres instance
or by running our Docker image. Run the following query to get started:
-- Load extension
CREATE EXTENSION svector;
-- Create test data
CREATE TABLE items (id bigserial PRIMARY KEY, embedding svector(4));
INSERT INTO items (embedding) VALUES ('[1,0,3,0]'), ('[0,0,5,6]');
-- Create HNSW index for cosine similarity
CREATE INDEX ON items USING shnsw (embedding svector_cosine_ops);
-- Run query
SELECT * FROM items ORDER BY embedding <=> '[3,0,1,0]';
SPLADE Inside Postgres: An Example
Next, let’s walk through a more complex example that inserts, indexes, and searches sparse vectors generated by SPLADE using pg_sparse
. To create our sparse vectors, let’s install the dependencies:
!pip install -U transformers torch datasets pandas tqdm
Next, let’s run Python code that
- Loads an example Huggingface dataset containing the descriptions of 50,000 movies
- Generates a SPLADE vector for each description
- Saves the dataset as a CSV file
from transformers import AutoModelForMaskedLM, AutoTokenizer
from tqdm import tqdm
from datasets import load_dataset
import pandas as pd
import re
import torch
model_id = 'naver/splade-cocondenser-ensembledistil'
dataset_name = 'SandipPalit/Movie_Dataset'
def create_splade(text):
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens)
vec = torch.max(
torch.log(
1 + torch.relu(output.logits)
) * tokens.attention_mask.unsqueeze(-1),
dim=1)[0].squeeze()
return vec
def clean_text(text):
return re.sub(r'[\r\n\t]+', ' ', text)
# Initialize SPLADE model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
# Load dataset
dataset = load_dataset(dataset_name)
training_dataset = dataset["train"]
# Generate SPLADE vectors
# Note this will take a long time, consider reducing the size of
# training_dataset to reduce the time
texts = []
vectors = []
for example in tqdm(training_dataset, desc="Processing..."):
text = clean_text(example['Overview'])
texts.append(text)
vector = create_splade(text)
vectors.append(vector.tolist())
# Save as dataframe
df = pd.DataFrame({
'text': texts,
'splade_vector': vectors
})
# Generate another SPLADE vector for querying
query = "Space exploration"
df.at[0, 'text'] = query
df.at[0, 'splade_vector'] = create_splade(query).tolist()
# Save to CSV
df.to_csv("splade_vectors.csv", index=False)
Using a Postgres client like psql
, we can now load in our movie dataset. This will take a few minutes due to the size of the dataset.
CREATE TABLE movies (description text, splade_vector svector(30522));
\copy movies FROM '/path/to/splade_vectors.csv' DELIMITER ',' CSV HEADER;
Let’s also enable timing to examine the performance of our queries.
\timing
Now we’re ready to execute our first search. To begin, we’ll look for the ten rows that are most similar to the first row (”Space exploration”) based on cosine similarity:
SELECT description
FROM movies
ORDER BY splade_vector <=> (SELECT splade_vector FROM movies LIMIT 1)
LIMIT 10;
This query executed a Postgres sequential scan that took 155ms to return the first ten rows. Now, let’s create an HNSW index to speed up the search results. This may take a few minutes to run.
CREATE INDEX ON movies
USING shnsw (splade_vector svector_cosine_ops);
With the index created, let’s re-run the same search query as above. We get back the same results, but with a query time of 2.740 ms — a 50X speedup in query time!
Benchmark Results
We measured index creation and query time over a dataset of 100K sparse vectors generated by SPLADE, each with 30,522 dimensions.
We set Postgres’ maintenance_working_mem
configuration to 512MB so that the entire HNSW graph could fit into memory during the build.
ALTER SYSTEM SET maintenance_work_mem = '512MB';
SELECT pg_reload_conf();
ef_construction
is a parameter that creates a higher quality graph and more accurate search results, at the expense of longer
index creation times. At the default ef_construction
value of 64
, the index takes 200 seconds to build (500 vectors / second).
ef_construction | Index Time | Number of Vectors |
---|---|---|
32 | 114s | 100,000 |
64 | 200s | 100,000 |
128 | 362s | 100,000 |
Next, we compared the time to return the top 10 results with and without the HNSW index. We set m=16
, ef_construction=64
,
and ef_search=40
. With HNSW, this search took 6ms. Without HNSW, this search took 150ms.
Wrapping Up
pg_sparse
is the latest installment in our roadmap to building an Elasticsearch alternative on Postgres. Now, Postgres
is capable of full-text search (via pg_search
), dense vector search (via pgvector
), and sparse vector
search.
pg_sparse
is compatible alongside both pg_search
and pgvector
. In the future, we’d love to have our work merged upstream as part of one pgvector
extension. A unified extension for all types of vector search is simpler to use and maintain, and would also enable sparse vector search in managed Postgres databases like AWS RDS that already support pgvector
.
Until then, the best way to use pg_sparse
is by installing it inside an existing, self-hosted Postgres instance,
or running the Postgres Docker image.
pg_sparse
is licensed under the PostgreSQL license in
accordance with pgvector
’s license. If you’d like to contribute, the best place to start is our
Slack community. And please don’t hesitate to show your support by
giving us a star!