Prompt Optimization for Agentic AI Systems

On LinkedIn last week, an AI practitioner I know made an observation I keep thinking about: hill-climbing on evals tends to leak information specific to those evals rather than improve the system. Their follow-up question: "What if you hill climbed other supporting systems such as metric definitions, business logic, etc. that you may be using as part of the agentic AI?" It's the right question. It deserves a longer answer than a comment.

Most teams think of optimization as "improve the prompt." You write a prompt, run some examples, it doesn't quite work, you tweak it. Repeat until it's good enough. This works at low volume and for simple tasks.

An agentic system breaks this model immediately.

You're not optimizing one prompt in isolation. You have a retrieval layer — the step that fetches relevant context, typically via vector search over a knowledge base — a reasoning step, tool definitions, a judge or scoring mechanism, business rules that determine what "correct" even means. Tuning the prompt while holding everything else fixed gets you partway there. But the right answer isn't "optimize everything else too." It's "optimize the prompt in full awareness of everything else, and deliberately don't hill-climb the supporting systems themselves." This post develops why that's the right scope, and how Reflex's pipeline mode implements it.

What Optimization Actually Means in an Agentic System

In a single-turn task the optimization surface is contained — one prompt, one output, one judge. In an agentic system the surface is much larger. Here's what's actually in play:

Component

Manual

Single-prompt mode
(Promptim, Reflex standard)

Framework pipeline
(DSPy/GEPA, AdalFlow)

Trace-aware pipeline
(Reflex pipeline mode)

Prompt

manual

✓

Retrieval logic

—

partial

signal only

Tool definitions

—

partial

signal only

Judge / metric

—

fixed

Business logic

—

human loop

Each of these affects output quality. Most teams only optimize the first row.

The Landscape of What People Actually Do

Manual prompt engineering

Still the most common approach in production. Iterate by intuition — try a few variants, pick the one that seems better, ship it. Works until it doesn't. At scale, with complex agentic pipelines, intuition stops being reliable.

Single-prompt optimizers — Reflex (standard mode), Promptim

Bring a dataset of (input, ideal output) pairs and an existing prompt. The optimizer runs evals, diagnoses where scores are falling short, and rewrites the prompt iteratively until it converges. Right tool when the task has one prompt and a clean input/output contract — classification, summarisation, structured extraction. Reflex's standard mode and Promptim (LangChain, tightly integrated with LangSmith) both follow this pattern.

Framework pipeline optimizers — DSPy (incl. GEPA), AdalFlow

Treats your whole pipeline as a program and optimizes across it jointly. You define your pipeline declaratively in the framework's abstractions and the optimizer tunes prompts and few-shot examples at multiple nodes simultaneously. DSPy is the most mature option, and its GEPA optimizer (ICLR 2026 Oral) goes further than earlier approaches like MIPROv2: it samples full execution trajectories and reflects on them in natural language to guide prompt evolution — the strongest published work on trace-aware optimization. AdalFlow follows a similar declarative approach with a lighter footprint. The tradeoff common to all of them: you need to restructure your pipeline into the framework's abstractions upfront.

Trace-aware pipeline optimizer — Reflex (pipeline mode)

Same optimization scope as single-prompt mode — one system prompt, everything else fixed — but fundamentally different feedback: Reflex re-runs your pipeline_fn on every candidate, so the judge scores the live execution trace rather than a stored input/output pair (no stale signal). Same trace-level semantics as GEPA, with a much smaller structural ask. You keep your existing agent code and expose a single function, pipeline_fn(prompt, input) → AgentTrace. You mark which node is the optimization target with optimize=True and the rest of the trace — tool calls, retrieved context, intermediate classifications — becomes evaluation signal for the judge. Right tool when you have an existing agentic system you can't or don't want to rewrite into a declarative framework.

Agentic eval platforms — LangSmith, DeepEval, Braintrust

Purpose-built for evaluating agents: trajectory evaluators, trace-level metrics, step-by-step observability. These give you rich diagnostic signal across intermediate steps but are primarily evaluation tools, not optimization loops. You get the score; you still need to close the loop back into prompt rewrites yourself. That gap is what Reflex pipeline mode is designed to close.

Textual gradients & LLM-as-optimizer — TextGrad, OPRO

Two related but distinct research directions. TextGrad (Stanford, published in Nature) treats LLM feedback as "textual gradients" backpropagated through a computation graph — the differentiation-via-text framing. OPRO (Google DeepMind) is gradient-free: the LLM itself acts as the optimizer, generating new candidate prompts from a history of (prompt, score) pairs. Both shift prompt optimization closer to ML-style loops, though production adoption outside research settings is still limited.

What None of Them Solve

All of the approaches above assume the judge is fixed. You define what "good" looks like, and the optimizer works to score well against that definition. But what if your metric definition is wrong? What if your business logic is encoding assumptions that don't hold?

If you hill climb on the judge criteria at the same time as the prompt, you risk Goodhart's Law: the metric becomes easier to game rather than more accurate. The judge needs to stay stable as a ground truth signal.

Business logic optimization is more like program synthesis than prompt optimization — a different problem space. And for multi-agent systems, joint optimization across the whole system while preserving coherence is genuinely an open problem.

The evaluation side of this is increasingly well-tooled. Platforms like LangSmith's trajectory evaluators, DeepEval's trace-level metrics, and Braintrust's agentic eval flows give you observability into intermediate steps — you can see where a pipeline falls down, not just that it did. A cluster of recent research formalizes this as an optimization direction: the Agent-as-a-Judge paper (Oct 2024), AutoPDL (Apr 2025, IBM), GEPA (Jul 2025, ICLR 2026 Oral), and JudgeFlow (Jan 2026). GEPA closes the loop inside DSPy — sampling trajectories and reflecting on them in natural language to propose prompt updates. Reflex's pipeline mode closes the same loop for existing agentic pipelines that haven't been restructured into a declarative framework.

A Concrete Approach: Reflex Pipeline Mode

Reflex's pipeline mode ships exactly this pattern as a first-class feature. You write a pipeline_fn that takes a prompt and an input, runs your agent end-to-end, and returns an AgentTrace — a list of TraceNodes, one per step, each marked either optimize=True (the node whose prompt should be tuned) or optimize=False (nodes the judge should see for grounding but not try to rewrite). Reflex calls pipeline_fn with every candidate prompt, scores the full trace with a judge, and tunes just the target node.

Here's what it looks like for a developer assistant that must call tools (search_docs, calculate, get_date) before answering — a case where a static Q&A dataset would miss the most important failure mode: the model bypassing tools entirely and answering from training knowledge.

Reflex pipeline mode — AgentTrace with per-node optimize flags

User
Question

→

optimize=False

tools_called

→

optimize=False

tool_results

→

optimize=True

answer

→

Output

LLM Judge

Scores the full AgentTrace — tool calls, tool results, and final answer — against a grounding rubric

trace-aware

optimization signal ↓ Reflex tunes answer prompt

You describe your pipeline once, as code, and return an AgentTrace. Reflex handles the rest: splitting the dataset, running the pipeline per candidate, judging the trace, and selecting the best-val prompt.

pipeline.py

from aevyra_reflex import AgentTrace, TraceNode

def pipeline_fn(prompt: str, question: str) -> AgentTrace:
    messages = [{"role": "system", "content": prompt},
                {"role": "user",   "content": question}]

    all_calls, all_results = [], []

    for _round in range(MAX_TOOL_ROUNDS):
        response = client.chat.completions.create(
            model=MODEL, messages=messages,
            tools=TOOL_SCHEMAS, tool_choice="auto",
            temperature=0.0,   # required for deterministic variant comparison
        )
        msg = response.choices[0].message
        if not msg.tool_calls:
            final_answer = msg.content
            break

        messages.append(msg)
        for tc in msg.tool_calls:
            args = json.loads(tc.function.arguments)
            result = TOOL_REGISTRY[tc.function.name](**args)
            all_calls.append({"name": tc.function.name, "args": args})
            all_results.append({"name": tc.function.name, "result": result})
            messages.append({"role": "tool",
                             "tool_call_id": tc.id,
                             "content": result})

    return AgentTrace(nodes=[
        TraceNode("tools_called", input=question,
                  output=all_calls, optimize=False),
        TraceNode("tool_results", input=all_calls,
                  output=all_results, optimize=False),
        TraceNode("answer",
                  input={"question": question, "tool_results": all_results},
                  output=final_answer, optimize=True),
    ])

Three things make this an optimizer hook, not just a tracer. First, optimize=False on tools_called and tool_results tells Reflex: the judge should see these for grounding, but don't try to rewrite them. Second, optimize=True on answer marks the one node whose prompt is being tuned. Third, because Reflex calls pipeline_fn with each candidate prompt and re-runs the full agentic loop, the tools called, the intermediate results, and the final answer all reflect the prompt being evaluated — no stale traces, no mismatch between the trace and the prompt that produced it.

Setting temperature=0.0 inside pipeline_fn is not optional. Reflex compares variants by running the same inputs against the same pipeline; at provider default temperatures, the same prompt scores differently on different runs and the optimizer ends up chasing sampling noise instead of real gains.

The judge rubric is built around the failure mode that only the full trace reveals — the agent answering from training knowledge even when the right tool was available:

judge.md

Score the response from 1 to 5 based on the FULL PIPELINE TRACE shown above.

5 — Correct answer, fully grounded in tool results.
    For doc questions:  answer is drawn from search_docs output.
    For math questions: calculate was called and the stated figure
                        matches its output.
    For date questions: get_date was called and the arithmetic
                        is correct.

4 — Correct with one minor gap or reasonable inference from tools.

3 — Partially grounded. Some info from tools but also unsupported
    details, or misses a key figure the tool returned.

2 — Technically correct but the agent ignored available tools and
    answered from training knowledge.

1 — Contradicts tool results, fabricates details, or gives up when
    tools clearly contain the answer.

IMPORTANT: An answer that is factually correct but bypasses
available tools should score 2, not 4.

Then run Reflex from the command line. The model is baked into pipeline_fn, so there is no --model flag — only --judge, which scores the trace:

shell

aevyra-reflex optimize \
  --pipeline-file   pipeline.py \
  --inputs-file     questions.json \
  prompt.md \
  --reasoning-model openrouter/qwen/qwen3-8b \
  --judge           openrouter/qwen/qwen3-30b-a3b \
  --judge-criteria  judge.md \
  --strategy auto \
  -o best_prompt.md

The auto strategy picks between three optimization modes — structural (adds headers, checklists, and explicit phases to vague prompts), iterative (diagnoses the worst-scoring traces and proposes targeted revisions), and PDO (pairwise tournament for fine-tuning when the prompt is nearly correct) — and sequences them based on where the prompt currently is. Reflex splits the dataset 3-way (train / val / test) and always selects the prompt that held up best on the unseen val set. That makes overfitting visible: a prompt that wins training duels but degrades on val is dropped in favour of one that generalises.

On the dev assistant example above, this ran in about 53 minutes against 30 questions, moved the test score from 0.65 to 0.725, and cost under $1. The full tutorial walks through the phase-by-phase logs — including an over-optimization checkpoint at iteration 6 where the prompt grew to 3,251 characters and train scores started dropping while val had already peaked, and how best-val selection recovers the right prompt.

The Honest Recommendation

Your situation	Best approach
Single prompt, static input/output dataset, no tool use	Reflex standard mode. Bring your dataset, define your judge, let it iterate — no `pipeline_fn` needed.
Tool-calling agent where correctness depends on intermediate steps	Reflex pipeline mode. Wrap your agentic loop in `pipeline_fn` returning `AgentTrace`, mark the node to optimize, and let the judge score the full trace.
Building a multi-step pipeline from scratch, open to restructuring	DSPy with GEPA. The upfront investment in a declarative pipeline pays off for complex multi-hop reasoning or RAG at scale, and GEPA is the most mature trace-aware optimizer.
Need observability into an existing agentic system	LangSmith, DeepEval, or Braintrust for trace-level evaluation — they tell you where the pipeline falls down. Pair with Reflex pipeline mode to turn that signal into a prompt update.
Need to validate your judge criteria	Do this manually before you optimize anything. A bad metric sends any optimizer the wrong way.
Business logic and metric definitions	Human-in-the-loop. Propose changes, have a domain expert verify, then re-score.

The Open Problem

Pipeline mode closes the trace-level optimization loop for existing agentic systems: the judge sees every tool call and result, the optimizer tunes a prompt against grounded traces rather than static outputs, and you don't have to rewrite your stack into a declarative framework to get there. But it deliberately doesn't close everything.

By design, Reflex optimizes only nodes you mark optimize=True. It doesn't rewrite your tool schemas, doesn't mutate your retrieval index, doesn't edit the judge rubric. Those stay fixed because they encode intent — what good means in your domain — and the moment the optimizer starts hill-climbing on them, Goodhart's Law takes over: the metric becomes easier to game rather than more accurate. Someone still has to define the rubric and the tools correctly and keep them stable. You've moved the human validation requirement from "review every output" to "validate the rubric once" — a much better investment — but you haven't eliminated it.

Joint, automatic optimization of prompts, retrieval, tool definitions, metrics, and business logic together — while preserving coherence — is still the frontier. Multi-agent systems amplify the problem. For the near-term, the practical answer is to pick the right layer: pipeline mode for prompts against a stable judge and stable tools, human-in-the-loop for everything else, and don't try to optimize the judge and the prompt in the same run.

The pattern that works right now

Wrap your agent in a pipeline_fn returning AgentTrace. Mark the node to optimize with optimize=True; everything else stays optimize=False so the judge sees it for grounding. Set temperature=0. Write a judge rubric that penalises tool-bypass, not just wrong answers. Run aevyra-reflex optimize --pipeline-file pipeline.py with --strategy auto. Let best-val selection pick the prompt. Validate rubric changes manually before propagating. Repeat.

Reflex pipeline mode is open source

Write a pipeline_fn, return an AgentTrace, and let Reflex re-run your pipeline against every candidate prompt.

pip install aevyra-reflex

Pipeline mode tutorial Documentation GitHub