Run reproducible multi-agent experiments with JamJet — from scaffold to publication-ready results.

Multi-Agent Research with JamJet

Run reproducible multi-agent experiments without building custom experiment plumbing. JamJet gives you durable execution, strategy sweeps, replay, evaluation, and publication-ready export in one runtime.

This guide walks through a complete research workflow — from project scaffold to paper-ready results.

Tip: Already familiar with JamJet? Jump to Run experiments or Export for publication.

When to use JamJet research workflows:

Compare reasoning strategies across models and seeds

Run parameter sweeps with automatic result collection

Replay failed or interesting conditions from checkpoints

Export results for a paper (LaTeX, CSV, JSON with significance tests)

Bridge an experiment into a production workflow later

Setup

Install JamJet
```
pip install jamjet
```

Scaffold a research project

jamjet init my-study --template research
cd my-study

Review the scaffolded structure

my-study/
├── agents/
│   └── researcher.py       # Agent definition with tools
├── baselines/
│   └── baseline.py         # Baseline comparison stubs
├── experiments/
│   ├── config.yaml          # Model, seed, strategy config
│   └── runner.py            # Experiment loop
├── evals/
│   ├── dataset.jsonl        # Evaluation dataset
│   └── scorers.py           # Custom scorer definitions
├── results/                 # Output directory (.gitkeep)
├── workflow.yaml            # Workflow definition
└── README.md

Define your agents

Agents are Python functions decorated with @task. Each agent can use a different reasoning strategy.

from jamjet import task, tool

@tool
async def web_search(query: str) -> str:
    """Search the web for current information."""
    ...

@task(model="claude-sonnet-4-6", tools=[web_search])
async def researcher(question: str) -> str:
    """Research a question using web search."""

Six built-in strategies

JamJet compiles high-level strategy names into explicit IR sub-DAGs. Swap with a single parameter — same agent, different reasoning:

Strategy	Pattern	Best for
`react`	Reason → Act → Observe loop	Tool-heavy tasks
`plan_and_execute`	Plan steps → execute each → synthesize	Multi-step decomposition
`critic`	Generate → critique → revise	Quality-sensitive output
`reflection`	Execute → reflect → gate → revise loop	Self-improving agents
`consensus`	N agents → vote → judge → finalize	Reducing variance
`debate`	Propose → counter → judge → settle loop	Adversarial reasoning

# workflow.yaml — change strategy to compare
agents:
  researcher:
    model: claude-sonnet-4-6
    strategy: debate        # swap to: react, reflection, consensus...
    tools: [web_search]
    max_iterations: 6

Run experiments

ExperimentGrid runs every combination of conditions and seeds as durable workflow executions. If a run crashes, it resumes from checkpoint — no re-running prior steps.

from jamjet.eval import ExperimentGrid

grid = ExperimentGrid(
    conditions={
        "strategy": ["react", "plan_and_execute", "critic",
                      "reflection", "consensus", "debate"],
        "model": ["claude-sonnet-4-6", "gpt-4o"],
    },
    seeds=[42, 123, 456],
    dataset="evals/dataset.jsonl",
    scorers=["llm_judge", "factuality"],
)

results = await grid.run()
results.summary()  # Rich table in terminal

This runs 6 strategies x 2 models x 3 seeds = 36 durable executions, each with full event traces and checkpoints.

Note: Every execution is event-sourced. If the experiment crashes at run 22 of 36, it resumes from run 22 — no tokens wasted re-running completed conditions.

Evaluate results

Built-in scorers

JamJet includes four scorer types out of the box:

llm_judge — LLM evaluates output against a rubric (0-5 scale)
assertion — Boolean check against output structure or content
latency — Scores based on execution time vs threshold
cost — Scores based on token cost vs budget

Custom scorers

from jamjet import scorer
from jamjet.eval import ScorerResult

@scorer(name="factuality", description="Check factual accuracy")
async def factuality_scorer(input: dict, output: dict, context: dict) -> ScorerResult:
    # Your custom scoring logic here
    return ScorerResult(
        score=0.85,
        passed=True,
        reason="All claims verified against source",
    )

Custom scorers are automatically available in ExperimentGrid and EvalNode by name.

Eval nodes in workflows

Evaluation can also run inline as a workflow node — during execution, not after:

nodes:
  check-quality:
    type: eval
    scorers:
      - type: llm_judge
        rubric: "Is the response accurate and well-sourced?"
        min_score: 4
    on_fail: retry_with_feedback
    max_retries: 2

When the score is too low, retry_with_feedback re-runs the upstream node with the scorer's reasoning as additional context.

Export for publication

LaTeX tables

results.to_latex("table1.tex", caption="Strategy comparison across models")

Outputs a booktabs-formatted table with mean +/- std per condition — ready to \input{} in your paper.

CSV for R / pandas

results.to_csv("results.csv")

JSON for further analysis

results.to_json("results.json")

Statistical comparison

Compare any two conditions with built-in significance tests:

comparison = results.compare("debate", "react", test="auto", alpha=0.05)

Returns a ComparisonResult with:

Field	Description
`test_name`	Test used (welch, wilcoxon, mann_whitney)
`statistic`	Test statistic
`p_value`	p-value
`effect_size`	Cohen's d
`ci_lower`, `ci_upper`	95% confidence interval
`significant`	`True` if p < alpha
`mean_a`, `mean_b`	Group means

Available tests:

welch — Welch's t-test (default, independent samples)
wilcoxon — Wilcoxon signed-rank (paired samples)
mann_whitney — Mann-Whitney U (non-parametric, independent)
auto — Picks based on sample size and pairing

CLI export

jamjet eval export results.json --format latex --caption "Table 1"
jamjet eval compare results.json --conditions "debate,react" --test auto

Replay and fork

Exact replay

Reproduce any execution from its checkpoint — same inputs, same execution path:

jamjet replay exec_abc123

The runtime fetches the original execution events, extracts the workflow and input, and creates a new execution with the same parameters. Useful for verification and debugging.

Fork for ablation studies

Fork from a completed execution with modified inputs — change one variable while keeping everything else identical:

jamjet fork exec_abc123 --override-input '{"model": "gpt-4o"}'

The original input is preserved and the override is merged in. This is the fastest path to ablation studies — no re-configuring, no re-running the full grid.

Provenance

Every node completion in JamJet carries ProvenanceMetadata:

Field	Description
`model_id`	Which model produced this output
`model_version`	Model version string
`confidence`	Self-reported confidence (0.0–1.0)
`verified`	Whether the output passed verification
`source`	Origin identifier

Provenance is attached automatically — no extra configuration needed. It flows through to event traces, audit logs, and exported results.

Tip: Provenance metadata is especially valuable for multi-model experiments where you need to attribute which model produced which intermediate result.

Inspect workflows

Use jamjet inspect to examine compiled workflows, including strategy-specific details:

jamjet inspect workflow.yaml

For strategy-aware agents, this shows:

Strategy name and type
Iteration count and plan steps
Critic verdicts (for critic/debate strategies)
Cost per iteration

Example: complete research afternoon

from jamjet.eval import ExperimentGrid

# 1. Define the grid
grid = ExperimentGrid(
    conditions={
        "strategy": ["react", "plan_and_execute", "critic",
                      "reflection", "consensus", "debate"],
    },
    seeds=[42, 123, 456],
    dataset="evals/dataset.jsonl",
    scorers=["llm_judge"],
)

# 2. Run (durable — resumes on crash)
results = await grid.run()

# 3. Export LaTeX table
results.to_latex("table1.tex", caption="Strategy comparison")

# 4. Statistical test
comp = results.compare("debate", "react")
print(f"p = {comp.p_value:.4f}, Cohen's d = {comp.effect_size:.2f}")

# 5. Replay a specific run
# $ jamjet replay exec_debate_seed42

# 6. Fork for ablation
# $ jamjet fork exec_debate_seed42 --override-input '{"model":"gpt-4o"}'

Next steps

Browse research examples — DCI deliberation and LDP routing patterns
See the /research page for capabilities overview
Read the eval harness docs for deeper eval configuration
Try jamjet init my-study --template research to get started

Research Guide