Open-source, at every layer of the stack — benchmarking, prompt optimization, and fine-tuning infrastructure.
Benchmark any LLM against your data. Pick the best model, then make it better. Run side-by-side evals across OpenAI, Anthropic, and open-source models — then fine-tune and measure improvement on your own task distribution.
Agentic prompt optimization. Reflex takes your dataset and prompt, runs evals, diagnoses why scores are falling short, and rewrites the prompt — iterating until it converges. Works with any provider, supports MLflow and W&B for experiment tracking.
Closing the loop between evaluation and training. Good fine-tuning requires good data curation, careful eval design, and knowing when to stop — none of which scales with manual iteration. A pipeline that automates the cycle: evaluate, curate, train, repeat.
These are highlights — there's more on GitHub, including experiments, forks, and works-in-progress that didn't make the cut here.