Verdict

Benchmark any LLM against your data. Pick the best model, then make it better. Run side-by-side evals across OpenAI, Anthropic, and open-source models — then fine-tune and measure improvement on your own task distribution.

Python LLM Evals Fine-tuning Open Source

Reflex

Agentic prompt optimization. Reflex takes your dataset and prompt, runs evals, diagnoses why scores are falling short, and rewrites the prompt — iterating until it converges. Works with any provider, supports MLflow and W&B for experiment tracking.

Python Prompt Optimization Open Source

Fine-tuning Infrastructure

Closing the loop between evaluation and training. Good fine-tuning requires good data curation, careful eval design, and knowing when to stop — none of which scales with manual iteration. A pipeline that automates the cycle: evaluate, curate, train, repeat.

Python Fine-tuning Open Source

More on GitHub

These are highlights — there's more on GitHub, including experiments, forks, and works-in-progress that didn't make the cut here.

View GitHub Profile ↗