Skip to main content
A single LLM call doesn’t always produce output that meets a precise quality bar — syllable counts, JSON schemas, style guides, factual constraints. The evaluator_loop node type solves this by running a generator/evaluator cycle inside a single DAG node: the generator produces a candidate, the evaluator scores it, and if the evaluator rejects it, the generator tries again with the feedback in context. This repeats until the evaluator approves or max_iterations is reached.

How the loop works

Each iteration consists of exactly two LLM calls:
  1. Generator call — produces a candidate output. On the first iteration it receives only the original prompt and DAG inputs. On subsequent iterations it also receives previous_output and evaluator_feedback.
  2. Evaluator call — receives the candidate and must return a JSON object with at least {"approved": bool, "feedback": str}. If approved, the loop exits and the final candidate becomes the node’s artifact. If not approved, the cycle repeats.
The loop exits when either the evaluator returns "approved": true or max_iterations attempts have been exhausted. The final artifact is always the last generator output, regardless of whether the evaluator ultimately approved it.

Complete example: haiku with syllable enforcement

The haiku_evaluator.yaml example uses a cheap Haiku model to generate the poem and a more capable Sonnet model to rigorously check the 5-7-5 syllable constraint — a judgment call that’s harder to get right than the generation itself.
name: haiku_with_evaluator
description: Write a haiku on a topic; a separate evaluator checks the form and iterates until approved.

budget:
  max_tokens: 20000
  max_usd: 2.00

nodes:
  - id: haiku
    type: evaluator_loop
    max_iterations: 3          # at most 3 generator+evaluator call pairs

    generator:
      model: claude-haiku-4-5-20251001
      max_output_tokens: 400
      prompt: |
        Write a haiku about "{{ topic }}".

        A haiku is a three-line poem with the syllable structure 5-7-5.
        Return only the three lines, nothing else.

        {% if iteration > 1 %}
        Your previous attempt:
        {{ previous_output }}

        Evaluator feedback:
        {{ evaluator_feedback }}

        Revise the haiku based on the feedback.
        {% endif %}

    evaluator:
      model: claude-sonnet-4-6
      max_output_tokens: 500
      prompt: |
        You are a strict haiku reviewer. Evaluate the following candidate
        against these criteria:
          1. Exactly three lines.
          2. Syllable structure 5-7-5 (first line 5, second 7, third 5).
          3. Evokes the topic "{{ topic }}" with imagery, not abstract words.

        Candidate:
        {{ candidate }}

        Reply with ONLY a JSON object of the form:
        {"approved": true|false, "feedback": "<short critique>", "score": <0.0 to 1.0>}

        Do not include any other text. Do not wrap the JSON in backticks.
Run it:
agentgraph run haiku_evaluator.yaml --input topic="morning fog"

Template variables reference

Generator prompt variables

VariableAvailableDescription
{{ topic }} (any DAG input)All iterationsStandard DAG inputs and upstream node outputs
{{ iteration }}All iterationsCurrent iteration number, 1-indexed
{{ previous_output }}Iteration 2+The generator’s output from the previous iteration
{{ evaluator_feedback }}Iteration 2+The feedback field from the evaluator’s JSON response
Use {% if iteration > 1 %} guards to keep the first iteration’s prompt clean and only inject feedback context on retries.

Evaluator prompt variables

VariableDescription
{{ candidate }}The generator’s current output (the text to evaluate)
{{ iteration }}Current iteration number, 1-indexed
Any DAG inputTopic, constraints, or other context from the original inputs

Evaluator JSON contract

The evaluator must return a JSON object. The minimum required shape is:
{"approved": true, "feedback": "Syllable counts correct, strong imagery."}
You can add extra fields like score — dagraph ignores them but they appear in traces. If the evaluator returns malformed JSON or non-JSON text, dagraph treats the iteration as not approved and uses the raw text as feedback for the next round.
Instruct your evaluator prompt to return only the JSON object with no surrounding text, backticks, or markdown fencing. Wrapping the JSON in a code block is the most common cause of parse failures.

Cost and iteration budget

Each iteration consumes one generator call plus one evaluator call. With max_iterations: 3 and an approval on the second attempt, you pay for four total LLM calls (two iterations × two calls each, minus the third evaluator that never fires because the second was approved).
Use a cheaper, faster model for the evaluator when your evaluation criteria are well-defined and mechanical (counting, schema validation, format checks). Reserve the more capable model for generation. In the haiku example, Haiku generates and Sonnet evaluates — the reverse of what you might expect — because syllable counting is harder than haiku writing for a smaller model.
Per-node budget caps apply across all iterations combined. Set a budget on the node if you want a hard ceiling on how much a single evaluator loop can spend:
- id: haiku
  type: evaluator_loop
  max_iterations: 5
  budget:
    max_tokens: 5000   # covers all gen+eval calls across all 5 iterations
  generator:
    ...
  evaluator:
    ...

Building your own evaluator prompt

The evaluator prompt is the most important part of the pattern. Keep these principles in mind:
  • Be specific about criteria. Vague instructions like “make it good” produce inconsistent approved decisions. List numbered, testable criteria.
  • Ask for a reason even when approving. The feedback field on an approved response still gets stored in the trace and helps you understand what worked.
  • Tell the evaluator what to ignore. If you only care about format and not style, say so explicitly to prevent false rejections.

Parallel agents

Run independent evaluator loops in parallel across multiple candidates.

Multi-provider fallback

Keep evaluator loops running even during provider outages with fallback chains.