Baked Search: Building semantic search quickly for toy use cases
Dec 2, 2024
Decent quality semantic search has got much easier and cheaper to ship yourself in the last couple of years. I thought I’d try and write a super quick guide that gets a search backend up and running as quickly and cheaply as possible.
The guide assumes that you have a toy use case - you’re building as a hobbyist. The example I’ve chosen is writing search for a blog - specifically a blog built using a static site generator like Hugo, Jekyll, Gatsby etc (like this one!). To do this, we’re going to use a tweak on a pattern called baked data
- they’ll be a read only copy of our data deployed alongside our code to build and manage our search index.
This means we can bundle everything into a single application image and then throw everything at a scale-to-zero hosting provider (Cloud Run in our case). This should keep things simple, cheap (potentially near zero) and easy to maintain (we just rebuild our index when our content changes).
OK, so if you are still with me, start your timers! Let’s try and rattle through this quickly.
Let’s start by creating our vector store and seeding it with some data. We want to get up and running quickly so we can start to play about with search, so we’re going to start by using a turnkey solution from ChromaDb. Under the hood Chroma uses its own fork of the HNSW lib for indexing and searching vectors (specifically, the fast ann algorithm for nearest neighbour search). SQLLite is used as a storage engine and there’s a neat API for generating and storing embeddings. By default the model used for generating the embeddings under the hood is all-MiniLM-L6-v2
. This means we only deal with text on the way in and and the way out which is the simplest way to get started. ChromaDb can be run in a client/server mode, but we’re going to use the PersistentClient - this creates a chromadb client but stores all the data on the local filesystem. ChromaDb is Apache 2.0 licensed and there’s a beta program for a (paid) hosted version, but at our scale we’re not going to need that, we’re local filesystem all the way. Here’s a sketch of some code to create the client and then seed it with data from an rss feed - which we’ll stick in a file named setup_db.py
:
import chromadb
def setup_chroma_collection(
rss_url: str, collection_name: str = "blog_collection"
) -> None:
"""Fetch RSS feed, parse items, and load into a ChromaDB collection.
Args:
rss_url: URL of the RSS feed.
collection_name: Name of the ChromaDB collection.
"""
# Set the path at which chroma should persist its data and instantiate the client
chroma_path = os.getenv("CHROMA_PATH", "/app/chroma_data")
client = chromadb.PersistentClient(path=chroma_path)
# Create a new collection
collection = client.create_collection(name=collection_name)
# Fetch and parse RSS feed for our blog, the actual work here is skipped for brevity
rss_root = fetch_rss_feed(rss_url)
# documents are the data that we generate our embeddings for e.g. the body text of a blog post
# metadata is useful context about the document, for example the title, publication date or a synopsis
# ids are a unique id for each document, for example, the filename.
documents, metadatas, ids = parse_rss_items(rss_root)
# Add the data to the chromadb collection
collection.add(documents=documents, metadatas=metadatas, ids=ids)
Next, we’ll write a small flask app which connects to the same datastore and provides a “search” route for us to throw queries at:
# To keep it brief I'll omit some code here, but to get the collection we just use
# chromadb.PersistentClient(path=chroma_path_on_filesystem)
# collection = client.get_collection(collection_name)
@app.route("/search", methods=["GET"])
def search() -> Any:
"""Search endpoint that accepts a query and returns search results.
Returns:
JSON response containing the search results with flattened document data.
"""
query = request.args.get("q")
if not query:
return jsonify({"error": "No query provided"}), 400
# Query the collection
raw_results = collection.query(query_texts=[query], n_results=3)
# We'll flatten and restructure the db response, it looks a little like this:
# {
# 'documents': [[
# 'This is a blog post about python',
# 'This is a blog post about vector databases'
# ]],
# 'ids': [['id1', 'id2']],
# 'distances': [[1.0404009819030762, 1.243080496788025]],
# 'uris': None,
# 'data': None,
# 'metadatas': [[None, None]],
# 'embeddings': None,
# }
formatted_results = format_results(raw_results)
return jsonify({"results": formatted_results})
Now let’s dockerise this little service we’ve written. To keep things cheap (free!) and simple (one image only!) we’re going to dockerise our service and bake in the data to the container filesystem. We’ll construct the image in a way that means the minimum number of layers are busted each time we rebuild the index. This means that most layers (i.e. the model used for generating the embeddings, the app) are not rebuilt each time the docker build command is run. It also means we’re not continually re-downloading our dependencies every time we recreate the image. Instead we just rebuild the search index. Here’s the dockerfile:
# Use the official Python image with a slim version of Debian Linux
FROM python:3.12-slim
# Update all installed packages
RUN apt-get update && rm -rf /var/lib/apt/lists/*
# Set the working directory in the container
WORKDIR /app
# Copy the requirements file into the container
COPY requirements.txt .
# Install the required Python packages
RUN pip install --no-cache-dir -r requirements.txt
# Set environment variables
ENV PORT=8080
ENV CHROMA_PATH=/app/chroma_data
# Trigger ChromaDB model download so this is baked into the image and infrequently busted.
RUN python -c "import chromadb; \
client = chromadb.PersistentClient(path='$CHROMA_PATH'); \
collection = client.create_collection(name='test_collection'); \
collection.add(documents=['Trigger lazy load of embedding model'], ids=['id1'])"
# Copy the rest of the application code into the container
COPY . .
# Run Chroma DB seed script, when the ARG changes we'll bust at this layer
ARG SEED_DB_TIMESTAMP
RUN python seed_db.py
# Expose the port that Gunicorn runs on
EXPOSE 8080
# Run the application with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "app:create_app()"]
You can build and run this dockerfile quickly and easily using:
docker build --build-arg SEED_DB_TIMESTAMP=$(date +%s) -t blog-search:latest .
docker run -d -p 8080:8080 --name blog-search blog-search:latest
Lastly, you can setup a github action which rebuilds your image both on push to the repo but also on a weekly schedule - this will keep the search index current even if you’re not working on the backend.
name: Publish to Google Cloud
on:
push:
branches:
- main
schedule:
# Run every Sunday at 01:00 GMT
- cron: "0 1 * * 0"
# rest of file omitted.
And that’s it! After that you’re just left with the frontend stuff, but I leave that to the reader.