dspy unpacked: continuous prompt optimisation
Omar Khattab, Chris Potts, Matei Zaharia | 2023 | Paper | Github | Docs
A lot of work with LLMs today involves working through a loop where you break a problem into steps, write a prompt for each step, then put the whole together by adjusting each prompt to feed into the next one.
dspy simplifies this process. It gives you a framework to structure your pipeline - forcing you to architect the application so your program flow is split from the variable stuff - the prompts and model weights that get fed to the LLM.
What’s neat is that this separation then means that dspy can algorithmically tune the variable stuff (prompts and weights) to help you gain a more reliable and better quality output. The prompts are treated more like parameters than parts of the program.
The process works using a simple optimization algorithm, imagine a function:
def metric(handcrafted_example, model_prediction, trace=None):
return handcrafted_example.answer.lower() == model_prediction.answer.lower()
Until example == pred, we’d keep going, tweaking our prompts and weights until we got a stable, consistent answer (the boiler plate of this is lifted away using an evaluator class which allows for straightforward parallelisation).
Of course, we can make the evaluations here much more complex. For example by using gpt-4 to score our answers - here’s an example (credit: this is lifted from the docs) which assesses whether a tweet is any good using three criteria - if it’s engaging or not, if we’re factually correct and how concise we are:
gpt4T = dspy.OpenAI(model='gpt-4-1106-preview', max_tokens=1000, model_type='chat')
def metric(example, prediction, trace=None):
question, answer, tweet = example.question, example.answer, prediction.output
engaging = "Does the assessed text make for a self-contained, engaging tweet?"
correct = f"The text should answer `{question}` with `{answer}`. Does the assessed text contain this answer?"
with dspy.context(lm=gpt4T):
correct = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=correct)
engaging = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=engaging)
correct, engaging = [m.assessment_answer.lower() == 'yes' for m in [correct, engaging]]
score = (correct + engaging) if correct and (len(tweet) <= 280) else 0
if trace is not None: return score >= 2
return score / 2.0
What we’ve done here is allowed gpt-4 (or llama 3 or whatever) to bootstrap our pipeline for us. The prompts are just a parameter that we optimise for and gpt-4 is judging the quality of each output generated by our pipeline.
What’s cool is that we can start nesting here, so rather than using simple heuristics and gpt-4, we can use a dspy program itself as the metric. That’s where the third argument into the function comes in - trace
. trace
captures the input/output to each predictor and we can use this optimize each step. This allows us to bootstrap our way to much more complex interactions - focusing as we go on both whether we’ve achieved our overarching task and how each step in our pipeline is performing.
Let’s think through the implications of this architecture. The cool thing about something like LangChain is that it contains lots of prebuilt components with templated prompts to do different tasks. You then chain these modules together, plug in your data and bish bash bosh you have a pipeline.
This is great if you just want to be fast and scrappy and you’re not going to try and scale out your approach. The downside is that what you’ve built is highly coupled to those templated strings which are buried pretty deep in your code.
These prompts are brittle and if your dependencies shift (model changes, data changes, domain complexity as captured by your application flow) then your results will vary wildly and you’re back in an eval loop again. This approach won’t really scale to a production setting where you might be optimising for cost, wanting to stay close to the model frontier or where the domain complexity is constantly ramping up.
As dspy splits these variable bits out from your code much more explicitly upfront you can instead just focus on managing the application flow, and dspy will optimize the prompts for you given your metric. When you change your data, make tweaks to your control flow, or change your target model, the dspy compiler can map your program into a new set of prompts. The ability to quickly re-compile for a different target model is important, as it feels like many application level uses of AI are blocked today on a reasoning advance that feels close, so having an architecture that can quickly adapt to base model advances makes sense.
The downside of course is the complexity - we’re expending all this effort just to generate a bit of natural language, which seems insane. It’s probably wise to watch costs when using a proprietary model to score your prompts (nice critique here). It’s also worth noting that getting the optimiser working well requires an example dataset in the hundreds, not the tens (the docs suggest 300-500). Worth noting here that I think this type of thing shifts how we execute during product discovery - a good understanding of the feedback loops that build these datasets for you is important, as is the ability to perhaps prototype rapidly on one framework and then shift to another (perhaps supporting this evolution is the end state for these frameworks?). There’s a nice FAQ in the repo which tells you what scenarios dspy may and may not be suited for.
But still, all this feels far more viable - as the sands continue to shift having a framework that enables you to switch model targets more rapidly makes sense. When you change your data, make tweaks to your control flow, or change your target model, the dspy compiler can map your program into a new set of prompts. It also fits the product development curve better (start with closed source like gpt-3.5, then switch to an open source finetune as you look to cut costs and scale out).
Perhaps it’s too early to call, but one to watch over the next little while.