You should have private evals

May 9, 2025

I think this is a very good post. Taking the time to test for yourself and understand how each model generation is useful to you, in your context is clearly going to be a big advantage. So much of the assessment of LLMs is vibes based that your own vibes matter most, so spending some time defining what they are is important. This blog offers a framework, and examples, of how to do just that.