Research Guide
Run reproducible multi-agent experiments with JamJet — from scaffold to publication-ready results.
Multi-Agent Research with JamJet
JamJet gives researchers a complete experiment infrastructure out of the box: durable execution for reproducibility, six reasoning strategies for ablation studies, an experiment grid for parameter sweeps, built-in evaluation with custom scorers, and publication export (LaTeX, CSV, JSON) with statistical tests.
This guide walks through a complete research workflow — from project scaffold to paper-ready results.
Tip: Already familiar with JamJet? Jump to Run experiments or Export for publication.
Setup
-
Install JamJet
pip install jamjet -
Scaffold a research project
jamjet init my-study --template research cd my-study -
Review the scaffolded structure
my-study/ ├── agents/ │ └── researcher.py # Agent definition with tools ├── baselines/ │ └── baseline.py # Baseline comparison stubs ├── experiments/ │ ├── config.yaml # Model, seed, strategy config │ └── runner.py # Experiment loop ├── evals/ │ ├── dataset.jsonl # Evaluation dataset │ └── scorers.py # Custom scorer definitions ├── results/ # Output directory (.gitkeep) ├── workflow.yaml # Workflow definition └── README.md
Define your agents
Agents are Python functions decorated with @task. Each agent can use a different reasoning strategy.
from jamjet import task, tool
@tool
async def web_search(query: str) -> str:
"""Search the web for current information."""
...
@task(model="claude-sonnet-4-6", tools=[web_search])
async def researcher(question: str) -> str:
"""Research a question using web search."""Six built-in strategies
JamJet compiles high-level strategy names into explicit IR sub-DAGs. Swap with a single parameter — same agent, different reasoning:
| Strategy | Pattern | Best for |
|---|---|---|
react | Reason → Act → Observe loop | Tool-heavy tasks |
plan_and_execute | Plan steps → execute each → synthesize | Multi-step decomposition |
critic | Generate → critique → revise | Quality-sensitive output |
reflection | Execute → reflect → gate → revise loop | Self-improving agents |
consensus | N agents → vote → judge → finalize | Reducing variance |
debate | Propose → counter → judge → settle loop | Adversarial reasoning |
# workflow.yaml — change strategy to compare
agents:
researcher:
model: claude-sonnet-4-6
strategy: debate # swap to: react, reflection, consensus...
tools: [web_search]
max_iterations: 6Run experiments
ExperimentGrid runs every combination of conditions and seeds as durable workflow executions. If a run crashes, it resumes from checkpoint — no re-running prior steps.
from jamjet.eval import ExperimentGrid
grid = ExperimentGrid(
conditions={
"strategy": ["react", "plan_and_execute", "critic",
"reflection", "consensus", "debate"],
"model": ["claude-sonnet-4-6", "gpt-4o"],
},
seeds=[42, 123, 456],
dataset="evals/dataset.jsonl",
scorers=["llm_judge", "factuality"],
)
results = await grid.run()
results.summary() # Rich table in terminalThis runs 6 strategies x 2 models x 3 seeds = 36 durable executions, each with full event traces and checkpoints.
Note: Every execution is event-sourced. If the experiment crashes at run 22 of 36, it resumes from run 22 — no tokens wasted re-running completed conditions.
Evaluate results
Built-in scorers
JamJet includes four scorer types out of the box:
llm_judge— LLM evaluates output against a rubric (0-5 scale)assertion— Boolean check against output structure or contentlatency— Scores based on execution time vs thresholdcost— Scores based on token cost vs budget
Custom scorers
Register domain-specific scorers with the @scorer decorator:
from jamjet import scorer
from jamjet.eval import ScorerResult
@scorer(name="factuality", description="Check factual accuracy")
async def factuality_scorer(input: dict, output: dict, context: dict) -> ScorerResult:
# Your custom scoring logic here
return ScorerResult(
score=0.85,
passed=True,
reason="All claims verified against source",
)Custom scorers are automatically available in ExperimentGrid and EvalNode by name.
Eval nodes in workflows
Evaluation can also run inline as a workflow node — during execution, not after:
nodes:
check-quality:
type: eval
scorers:
- type: llm_judge
rubric: "Is the response accurate and well-sourced?"
min_score: 4
on_fail: retry_with_feedback
max_retries: 2When the score is too low, retry_with_feedback re-runs the upstream node with the scorer's reasoning as additional context.
Export for publication
LaTeX tables
results.to_latex("table1.tex", caption="Strategy comparison across models")Outputs a booktabs-formatted table with mean +/- std per condition — ready to \input{} in your paper.
CSV for R / pandas
results.to_csv("results.csv")JSON for further analysis
results.to_json("results.json")Statistical comparison
Compare any two conditions with built-in significance tests:
comparison = results.compare("debate", "react", test="auto", alpha=0.05)Returns a ComparisonResult with:
| Field | Description |
|---|---|
test_name | Test used (welch, wilcoxon, mann_whitney) |
statistic | Test statistic |
p_value | p-value |
effect_size | Cohen's d |
ci_lower, ci_upper | 95% confidence interval |
significant | True if p < alpha |
mean_a, mean_b | Group means |
Available tests:
welch— Welch's t-test (default, independent samples)wilcoxon— Wilcoxon signed-rank (paired samples)mann_whitney— Mann-Whitney U (non-parametric, independent)auto— Picks based on sample size and pairing
CLI export
jamjet eval export results.json --format latex --caption "Table 1"
jamjet eval compare results.json --conditions "debate,react" --test autoReplay and fork
Exact replay
Reproduce any execution from its checkpoint — same inputs, same execution path:
jamjet replay exec_abc123The runtime fetches the original execution events, extracts the workflow and input, and creates a new execution with the same parameters. Useful for verification and debugging.
Fork for ablation studies
Fork from a completed execution with modified inputs — change one variable while keeping everything else identical:
jamjet fork exec_abc123 --override-input '{"model": "gpt-4o"}'The original input is preserved and the override is merged in. This is the fastest path to ablation studies — no re-configuring, no re-running the full grid.
Provenance
Every node completion in JamJet carries ProvenanceMetadata:
| Field | Description |
|---|---|
model_id | Which model produced this output |
model_version | Model version string |
confidence | Self-reported confidence (0.0–1.0) |
verified | Whether the output passed verification |
source | Origin identifier |
Provenance is attached automatically — no extra configuration needed. It flows through to event traces, audit logs, and exported results.
Tip: Provenance metadata is especially valuable for multi-model experiments where you need to attribute which model produced which intermediate result.
Inspect workflows
Use jamjet inspect to examine compiled workflows, including strategy-specific details:
jamjet inspect workflow.yamlFor strategy-aware agents, this shows:
- Strategy name and type
- Iteration count and plan steps
- Critic verdicts (for critic/debate strategies)
- Cost per iteration
Example: complete research afternoon
from jamjet.eval import ExperimentGrid
# 1. Define the grid
grid = ExperimentGrid(
conditions={
"strategy": ["react", "plan_and_execute", "critic",
"reflection", "consensus", "debate"],
},
seeds=[42, 123, 456],
dataset="evals/dataset.jsonl",
scorers=["llm_judge"],
)
# 2. Run (durable — resumes on crash)
results = await grid.run()
# 3. Export LaTeX table
results.to_latex("table1.tex", caption="Strategy comparison")
# 4. Statistical test
comp = results.compare("debate", "react")
print(f"p = {comp.p_value:.4f}, Cohen's d = {comp.effect_size:.2f}")
# 5. Replay a specific run
# $ jamjet replay exec_debate_seed42
# 6. Fork for ablation
# $ jamjet fork exec_debate_seed42 --override-input '{"model":"gpt-4o"}'Next steps
- Browse research examples — DCI deliberation and LDP routing patterns
- See the /research page for capabilities overview
- Read the eval harness docs for deeper eval configuration
- Try
jamjet init my-study --template researchto get started