Open-source, at every layer of the stack — agent tracing, benchmarking, failure attribution, and prompt optimization.
Failure attribution. When an agent fails, Origin reads the trace and tells you which span caused it — with severity, confidence, and a fix type that tells you whether the problem is in a prompt, retrieval index, tool schema, or routing decision.
Agent tracing. Records every step of an agent pipeline — inputs, outputs, timing, and how steps relate to each other — in a single structured object. Zero runtime dependencies. Works with any LLM framework.
Agentic prompt optimization. Reflex takes your dataset and prompt, runs evals, diagnoses why scores are falling short, and rewrites the prompt — iterating until it converges. Works with any provider, supports MLflow and W&B for experiment tracking.
Benchmark any LLM against your data. Pick the best model, then make it better. Run side-by-side evals across OpenAI, Anthropic, and open-source models — then fine-tune and measure improvement on your own task distribution.
Autonomous vLLM deployment optimizer. Give it a model, a GPU, and a workload trace. It runs overnight — propose a config, boot the server, benchmark against your real traffic, keep or revert, repeat. Wake up to a deployment recipe that beats hand-tuned defaults, with a full audit trail of every experiment.
These are highlights — there's more on GitHub, including experiments, forks, and works-in-progress that didn't make the cut here.