Control workflow costs with budgets and retry policies

dagraph gives you two independent mechanisms to keep workflows safe: budget caps that stop a run the moment it would exceed your spending limit, and retry policies that automatically recover from transient provider errors like rate limits or network timeouts. Both work at the DAG level (applying to the whole run) and at the node level (applying to a single node’s total spend or failure handling).

Budget caps

DAG-level budget

Set a budget: field at the top of your YAML to apply a global cap to the entire run. If the cap is exceeded at any point — mid-node, mid-iteration — dagraph raises BudgetExceededError and stops immediately. No additional charges accumulate after the cap is hit.

name: research
budget:
  max_tokens: 50000
  max_usd: 2.00

nodes:
  - id: researcher
    type: agent
    model: claude-haiku-4-5-20251001
    prompt: "Research {{ topic }}."

budget.max_tokens

integer

Maximum billable tokens for the run. Billable tokens are input tokens plus output tokens. Cache read and cache write tokens are excluded — this matters on the claude_code backend, which loads a large system-prompt cache for each subprocess. That overhead does not count against your token cap.

budget.max_usd

number

Maximum spend in US dollars. The engine calculates cost using published per-million-token prices for each model. You can set max_tokens, max_usd, or both — each is checked independently.

The budget cap is a hard stop. When BudgetExceededError is raised, the run ends immediately. Completed node outputs are preserved in the artifact store, but the run cannot be resumed. Design your DAG-level caps with enough headroom that normal runs stay well under the limit.

Per-node budget

Override the budget for a single node by adding a budget: field directly on the node. This is useful when one expensive node (such as a synthesizer using claude-opus-4-7) should be capped independently while lighter nodes share the DAG-level budget. A per-node cap covers the total spend of everything that node does — for an evaluator_loop with three iterations, the cap applies across all six LLM calls (three generator + three evaluator calls combined).

budget:
  max_tokens: 100000
  max_usd: 5.00

nodes:
  - id: cheap_researcher
    type: agent
    model: claude-haiku-4-5-20251001
    prompt: "Summarize {{ topic }} briefly."

  - id: expensive_synthesizer
    type: agent
    model: claude-opus-4-7
    depends_on: [cheap_researcher]
    budget:
      max_usd: 1.50       # This node alone won't exceed $1.50
    prompt: "Write a detailed report based on: {{ cheap_researcher }}"

If you use claude_code as your backend, the USD cost shown is an approximation based on the API pricing table. The claude_code backend draws from your Claude Code subscription plan, not your API balance. Use max_tokens for a more reliable cap on that backend.

Retry policies

Configuring retries

Add a retry: field to any node to automatically re-run it on failure. Without retry:, every node runs exactly once.

- id: fetch_summary
  type: agent
  model: claude-sonnet-4-6
  retry:
    max_attempts: 3
    backoff_seconds: 5
    retry_on: ["*"]
  prompt: "Summarize the latest news about {{ topic }}."

retry.max_attempts

integer

default:"1"

Total number of attempts including the first. Minimum 1, maximum 20. max_attempts: 3 means one initial attempt and up to two retries.

retry.backoff_seconds

number

default:"0"

Seconds to wait between attempts. Use a non-zero value when retrying after rate-limit errors so you don’t immediately hit the same limit again.

retry.retry_on

string[]

default:"[\"*\"]"

A list of exception class names to retry on. "*" matches any exception. Names are matched against the full class hierarchy, so ["OSError"] also catches FileNotFoundError. Narrow this list to avoid retrying on errors that indicate bad inputs or invalid prompts.

Exceptions that are never retried

Two exceptions bypass retry_on entirely and are never retried, regardless of your configuration:

BudgetExceededError — the budget cap is a hard limit; retrying would immediately exceed it again.
ApprovalPending — this is a control-flow signal from an approval_gate node, not a failure.

For transient provider errors (rate limits, network timeouts, intermittent 5xx responses), use retry_on: ["*"] with max_attempts: 3 and backoff_seconds: 5. This covers the common case without you needing to know the exact exception class name.

Conditional execution

Use the when: field on any node to skip it entirely based on inputs or upstream outputs. The expression is a Jinja template evaluated against all inputs and the outputs of completed upstream nodes. If the expression is truthy, the node runs normally. If falsy, the node is skipped — its status is set to skipped and an empty string is passed downstream.

nodes:
  - id: check_topic
    type: agent
    model: claude-haiku-4-5-20251001
    prompt: "Is '{{ topic }}' a valid research topic? Reply yes or no."

  - id: deep_research
    type: agent
    model: claude-sonnet-4-6
    depends_on: [check_topic]
    when: "{{ topic | length > 10 }}"
    prompt: "Research {{ topic }} in depth."

Skipped nodes are not retried and do not consume budget. Downstream nodes that depend on a skipped node still run — they just receive an empty string for that node’s output variable.

Putting it all together

The following example combines a DAG-level budget, a per-node budget override, and a retry policy on the node most likely to hit rate limits:

name: resilient_research
description: Research with cost controls and automatic retries.

budget:
  max_tokens: 60000
  max_usd: 3.00

nodes:
  - id: research_a
    type: agent
    model: claude-haiku-4-5-20251001
    retry:
      max_attempts: 3
      backoff_seconds: 5
      retry_on: ["*"]
    prompt: "Research '{{ topic }}' — technical angle. 5-8 bullets."

  - id: research_b
    type: agent
    model: claude-haiku-4-5-20251001
    retry:
      max_attempts: 3
      backoff_seconds: 5
      retry_on: ["*"]
    prompt: "Research '{{ topic }}' — economic angle. 5-8 bullets."

  - id: synthesizer
    type: agent
    model: claude-sonnet-4-6
    depends_on: [research_a, research_b]
    budget:
      max_usd: 1.00
    prompt: |
      Synthesize these research findings into a unified report.
      Technical: {{ research_a }}
      Economic: {{ research_b }}

Get Started

Core Concepts

Guides

Configuration

Control workflow costs with budgets and retry policies

Budget caps

DAG-level budget

Per-node budget

Retry policies

Configuring retries

Exceptions that are never retried

Conditional execution

Putting it all together

Get Started

Core Concepts

Guides

Configuration

​Budget caps

​DAG-level budget

​Per-node budget

​Retry policies

​Configuring retries

​Exceptions that are never retried

​Conditional execution

​Putting it all together

Budget caps

DAG-level budget

Per-node budget

Retry policies

Configuring retries

Exceptions that are never retried

Conditional execution

Putting it all together