评估框架
使用 JamJet eval 评估输出质量、运行回归测试套件并管控 CI 流程。
评估工具
JamJet 内置了评估系统,用于测量和保障输出质量——从快速的临时检查到完整的 CI 回归测试套件。
为什么评估很重要
LLM 输出是概率性的。同一个工作流在大多数输入上可能产生优秀结果,但在边缘情况下会失败。JamJet 评估为你提供:
- LLM 评判 — 由独立模型根据评分标准对输出质量打分
- 断言检查 — 结构化检查(长度、字段存在性、格式)
- 延迟和成本阈值 — 在 CI 中强制执行 SLA
- 回归测试套件 — 在问题进入生产环境之前捕获回归
内联评估(工作流)
在工作流中添加 eval 节点来评分输出并在失败时重试:
nodes:
check-quality:
type: eval
scorers:
- type: llm_judge
rubric: "答案是否准确、完整且少于 200 字?"
min_score: 4 # 1-5 分制
model: claude-haiku-4-5-20251001
- type: assertion
check: "len(output.answer) > 0"
- type: latency
max_ms: 5000
on_fail: retry_with_feedback
max_retries: 2
next: end当设置 on_fail: retry_with_feedback 时,评分器的反馈会自动注入到下一次模型调用的提示中,形成自我改进循环。
数据集评估(命令行)
对于批量评估,创建一个 JSONL 数据集:
{"id": "q1", "input": {"query": "什么是 JamJet?"}, "expected": {"topic": "runtime"}}
{"id": "q2", "input": {"query": "如何安装它?"}, "expected": {"topic": "install"}}
{"id": "q3", "input": {"query": "它支持哪些模型?"}, "expected": {}}运行评估:
jamjet eval run dataset.jsonl \
--workflow workflow.yaml \
--rubric "答案是否准确且有帮助?" \
--min-score 4 \
--assert "len(output.answer) >= 50" \
--latency-ms 3000 \
--concurrency 10 \
--fail-below 0.9正在运行 50 条评估记录... ████████████████████ 50/50
┌─────────┬────────────┬───────┬──────────┬────────────────────┐
│ 记录 │ 状态 │ 评分 │ 延迟 │ 备注 │
├─────────┼────────────┼───────┼──────────┼────────────────────┤
│ q1 │ ✓ 通过 │ 4.8 │ 512ms │ │
│ q2 │ ✓ 通过 │ 4.2 │ 623ms │ │
│ q3 │ ✗ 失败 │ 2.1 │ 891ms │ 答案过于模糊 │
└─────────┴────────────┴───────┴──────────┴────────────────────┘
结果: 49/50 通过 (98.0%) — 高于阈值 90.0% ✓通过时退出码为 0,失败时为 1 — 可直接用于 CI。
CI 集成
添加到你的 GitHub Actions 工作流:
- name: Run eval suite
run: |
jamjet eval run evals/core.jsonl \
--workflow workflow.yaml \
--rubric "Is the answer accurate and complete?" \
--fail-below 0.85
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
JAMJET_URL: http://localhost:7700提示: 在运行评估前,先在 CI 任务中启动 JamJet 开发运行时:
jamjet dev &然后sleep 2等待其初始化。
Python 评估 API
对于自定义评估逻辑,使用 Python 评估包:
import asyncio
from jamjet.eval import EvalDataset, EvalRunner
from jamjet.eval.scorers import LlmJudgeScorer, AssertionScorer, LatencyScorer
dataset = EvalDataset.from_file("evals/core.jsonl")
runner = EvalRunner(
workflow_path="workflow.yaml",
runtime_url="http://localhost:7700",
scorers=[
LlmJudgeScorer(
rubric="Is the answer accurate and helpful?",
model="claude-haiku-4-5-20251001",
min_score=4,
),
AssertionScorer(check="len(output['answer']) >= 50"),
LatencyScorer(max_ms=3000),
],
concurrency=10,
)
results = asyncio.run(runner.run(dataset))
runner.print_summary(results)
# 检查总体通过率
pass_rate = sum(1 for r in results if r.passed) / len(results)
assert pass_rate >= 0.9, f"Eval failed: {pass_rate:.0%} pass rate"自定义评分器
通过继承 BaseScorer 编写自己的评分器:
from jamjet.eval.scorers import BaseScorer, ScorerResult
class ExactMatchScorer(BaseScorer):
async def score(
self,
output: dict,
*,
expected: dict,
duration_ms: float,
cost_usd: float,
input_data: dict,
) -> ScorerResult:
answer = output.get("answer", "")
expected_answer = expected.get("answer", "")
passed = answer.strip().lower() == expected_answer.strip().lower()
return ScorerResult(
scorer="exact_match",
passed=passed,
score=1.0 if passed else 0.0,
message=None if passed else f"Expected '{expected_answer}', got '{answer}'",
)评分器类型
LLM 评判
使用独立模型按 1–5 评分标准对输出进行评分:
- type: llm_judge
rubric: "Is the answer accurate, complete, and under 200 words?"
min_score: 4 # 1–5(5 = 完美)
model: claude-haiku-4-5-20251001评判器会获得输入、输出和评分标准,并返回一个包含 score(1–5)和 reason 的 JSON 对象。
断言
针对 output 和 expected 评估的 Python 表达式:
- type: assertion
check: "len(output.answer) > 0"
# 多个断言
- type: assertion
check: "'sources' in output"
- type: assertion
check: "output.confidence >= 0.7"延迟
检查执行是否在时间预算内完成:
- type: latency
max_ms: 3000成本
检查执行是否在成本预算内:
- type: cost
max_usd: 0.05输出格式
将结果保存到文件以便进一步分析:
jamjet eval run dataset.jsonl \
--workflow workflow.yaml \
--output results.json{
"summary": {
"total": 50,
"passed": 47,
"failed": 3,
"pass_rate": 0.94,
"avg_latency_ms": 612,
"avg_cost_usd": 0.0003
},
"rows": [
{
"id": "q1",
"passed": true,
"scorers": [
{ "scorer": "llm_judge", "passed": true, "score": 4.8 }
],
"duration_ms": 512,
"cost_usd": 0.00023
}
]
}