JamJet

Research Guide

Run reproducible multi-agent experiments with JamJet — from scaffold to publication-ready results.

Multi-Agent Research with JamJet

JamJet gives researchers a complete experiment infrastructure out of the box: durable execution for reproducibility, six reasoning strategies for ablation studies, an experiment grid for parameter sweeps, built-in evaluation with custom scorers, and publication export (LaTeX, CSV, JSON) with statistical tests.

This guide walks through a complete research workflow — from project scaffold to paper-ready results.

Tip: Already familiar with JamJet? Jump to Run experiments or Export for publication.


Setup

  1. Install JamJet

    pip install jamjet
  2. Scaffold a research project

    jamjet init my-study --template research
    cd my-study
  3. Review the scaffolded structure

    my-study/
    ├── agents/
    │   └── researcher.py       # Agent definition with tools
    ├── baselines/
    │   └── baseline.py         # Baseline comparison stubs
    ├── experiments/
    │   ├── config.yaml          # Model, seed, strategy config
    │   └── runner.py            # Experiment loop
    ├── evals/
    │   ├── dataset.jsonl        # Evaluation dataset
    │   └── scorers.py           # Custom scorer definitions
    ├── results/                 # Output directory (.gitkeep)
    ├── workflow.yaml            # Workflow definition
    └── README.md

Define your agents

Agents are Python functions decorated with @task. Each agent can use a different reasoning strategy.

from jamjet import task, tool

@tool
async def web_search(query: str) -> str:
    """Search the web for current information."""
    ...

@task(model="claude-sonnet-4-6", tools=[web_search])
async def researcher(question: str) -> str:
    """Research a question using web search."""

Six built-in strategies

JamJet compiles high-level strategy names into explicit IR sub-DAGs. Swap with a single parameter — same agent, different reasoning:

StrategyPatternBest for
reactReason → Act → Observe loopTool-heavy tasks
plan_and_executePlan steps → execute each → synthesizeMulti-step decomposition
criticGenerate → critique → reviseQuality-sensitive output
reflectionExecute → reflect → gate → revise loopSelf-improving agents
consensusN agents → vote → judge → finalizeReducing variance
debatePropose → counter → judge → settle loopAdversarial reasoning
# workflow.yaml — change strategy to compare
agents:
  researcher:
    model: claude-sonnet-4-6
    strategy: debate        # swap to: react, reflection, consensus...
    tools: [web_search]
    max_iterations: 6

Run experiments

ExperimentGrid runs every combination of conditions and seeds as durable workflow executions. If a run crashes, it resumes from checkpoint — no re-running prior steps.

from jamjet.eval import ExperimentGrid

grid = ExperimentGrid(
    conditions={
        "strategy": ["react", "plan_and_execute", "critic",
                      "reflection", "consensus", "debate"],
        "model": ["claude-sonnet-4-6", "gpt-4o"],
    },
    seeds=[42, 123, 456],
    dataset="evals/dataset.jsonl",
    scorers=["llm_judge", "factuality"],
)

results = await grid.run()
results.summary()  # Rich table in terminal

This runs 6 strategies x 2 models x 3 seeds = 36 durable executions, each with full event traces and checkpoints.

Note: Every execution is event-sourced. If the experiment crashes at run 22 of 36, it resumes from run 22 — no tokens wasted re-running completed conditions.


Evaluate results

Built-in scorers

JamJet includes four scorer types out of the box:

  • llm_judge — LLM evaluates output against a rubric (0-5 scale)
  • assertion — Boolean check against output structure or content
  • latency — Scores based on execution time vs threshold
  • cost — Scores based on token cost vs budget

Custom scorers

Register domain-specific scorers with the @scorer decorator:

from jamjet import scorer
from jamjet.eval import ScorerResult

@scorer(name="factuality", description="Check factual accuracy")
async def factuality_scorer(input: dict, output: dict, context: dict) -> ScorerResult:
    # Your custom scoring logic here
    return ScorerResult(
        score=0.85,
        passed=True,
        reason="All claims verified against source",
    )

Custom scorers are automatically available in ExperimentGrid and EvalNode by name.

Eval nodes in workflows

Evaluation can also run inline as a workflow node — during execution, not after:

nodes:
  check-quality:
    type: eval
    scorers:
      - type: llm_judge
        rubric: "Is the response accurate and well-sourced?"
        min_score: 4
    on_fail: retry_with_feedback
    max_retries: 2

When the score is too low, retry_with_feedback re-runs the upstream node with the scorer's reasoning as additional context.


Export for publication

LaTeX tables

results.to_latex("table1.tex", caption="Strategy comparison across models")

Outputs a booktabs-formatted table with mean +/- std per condition — ready to \input{} in your paper.

CSV for R / pandas

results.to_csv("results.csv")

JSON for further analysis

results.to_json("results.json")

Statistical comparison

Compare any two conditions with built-in significance tests:

comparison = results.compare("debate", "react", test="auto", alpha=0.05)

Returns a ComparisonResult with:

FieldDescription
test_nameTest used (welch, wilcoxon, mann_whitney)
statisticTest statistic
p_valuep-value
effect_sizeCohen's d
ci_lower, ci_upper95% confidence interval
significantTrue if p < alpha
mean_a, mean_bGroup means

Available tests:

  • welch — Welch's t-test (default, independent samples)
  • wilcoxon — Wilcoxon signed-rank (paired samples)
  • mann_whitney — Mann-Whitney U (non-parametric, independent)
  • auto — Picks based on sample size and pairing

CLI export

jamjet eval export results.json --format latex --caption "Table 1"
jamjet eval compare results.json --conditions "debate,react" --test auto

Replay and fork

Exact replay

Reproduce any execution from its checkpoint — same inputs, same execution path:

jamjet replay exec_abc123

The runtime fetches the original execution events, extracts the workflow and input, and creates a new execution with the same parameters. Useful for verification and debugging.

Fork for ablation studies

Fork from a completed execution with modified inputs — change one variable while keeping everything else identical:

jamjet fork exec_abc123 --override-input '{"model": "gpt-4o"}'

The original input is preserved and the override is merged in. This is the fastest path to ablation studies — no re-configuring, no re-running the full grid.


Provenance

Every node completion in JamJet carries ProvenanceMetadata:

FieldDescription
model_idWhich model produced this output
model_versionModel version string
confidenceSelf-reported confidence (0.0–1.0)
verifiedWhether the output passed verification
sourceOrigin identifier

Provenance is attached automatically — no extra configuration needed. It flows through to event traces, audit logs, and exported results.

Tip: Provenance metadata is especially valuable for multi-model experiments where you need to attribute which model produced which intermediate result.


Inspect workflows

Use jamjet inspect to examine compiled workflows, including strategy-specific details:

jamjet inspect workflow.yaml

For strategy-aware agents, this shows:

  • Strategy name and type
  • Iteration count and plan steps
  • Critic verdicts (for critic/debate strategies)
  • Cost per iteration

Example: complete research afternoon

from jamjet.eval import ExperimentGrid

# 1. Define the grid
grid = ExperimentGrid(
    conditions={
        "strategy": ["react", "plan_and_execute", "critic",
                      "reflection", "consensus", "debate"],
    },
    seeds=[42, 123, 456],
    dataset="evals/dataset.jsonl",
    scorers=["llm_judge"],
)

# 2. Run (durable — resumes on crash)
results = await grid.run()

# 3. Export LaTeX table
results.to_latex("table1.tex", caption="Strategy comparison")

# 4. Statistical test
comp = results.compare("debate", "react")
print(f"p = {comp.p_value:.4f}, Cohen's d = {comp.effect_size:.2f}")

# 5. Replay a specific run
# $ jamjet replay exec_debate_seed42

# 6. Fork for ablation
# $ jamjet fork exec_debate_seed42 --override-input '{"model":"gpt-4o"}'

Next steps

  • Browse research examples — DCI deliberation and LDP routing patterns
  • See the /research page for capabilities overview
  • Read the eval harness docs for deeper eval configuration
  • Try jamjet init my-study --template research to get started

On this page