Self-Evaluation
Self-evaluation enables automatic scoring of agent output during sweeps and orchestration eval mode. This page covers how self-evaluation works, how to configure it, and how to interpret results.
How It Works
After each sweep iteration (or eval orchestration), the agent output is evaluated against defined criteria. The evaluation can be:
- Built-in: The same provider re-evaluates its own output
- Cross-provider: A separate provider evaluates the output
Agent Output → Evaluation Prompt → Score (0.0–1.0) per criterion
Configuring Self-Eval in Sweeps
Via YAML Frontmatter
---
title: Security Audit
sweep:
defaultIterations: 5
eval:
criteria:
- correctness
- completeness
- thoroughness
- code-quality
passThreshold: 0.8
---
Via CLI
prompts-gpt sweep audit.md -n 5 \
--eval \
--eval-criteria correctness,completeness,thoroughness \
--eval-threshold 0.8
Via SDK
const result = await sweepPrompt({
promptFile: "audit.md",
iterations: 5,
provider: "codex",
eval: {
criteria: ["correctness", "completeness", "thoroughness"],
passThreshold: 0.8,
},
});
Evaluation Criteria
Common criteria and what they measure:
| Criterion | What It Measures |
|---|---|
correctness | Are the changes/findings accurate? |
completeness | Were all aspects of the task addressed? |
thoroughness | Depth of analysis or implementation |
code-quality | Code style, patterns, best practices |
security | Security implications of changes |
performance | Performance impact of changes |
test-coverage | Quality of test additions |
documentation | Documentation completeness |
You can define custom criteria — any string is accepted.
Score Interpretation
| Score | Meaning |
|---|---|
| 0.0–0.3 | Poor — significant issues |
| 0.3–0.5 | Below average — notable gaps |
| 0.5–0.7 | Average — acceptable with improvements |
| 0.7–0.85 | Good — meets expectations |
| 0.85–1.0 | Excellent — exceeds expectations |
Pass/Fail Threshold
The passThreshold determines whether an iteration "passes":
eval:
passThreshold: 0.8 # Overall weighted average must be ≥ 0.8
When an iteration fails the threshold, the sweep continues to the next iteration, providing the evaluation feedback as additional context.
Eval Results
In Artifacts
Results are stored in eval-scores.json:
{
"scores": [
{
"iteration": 1,
"overallScore": 0.72,
"criteria": [
{ "name": "correctness", "score": 0.8 },
{ "name": "completeness", "score": 0.65 },
{ "name": "thoroughness", "score": 0.7 }
],
"passed": false
},
{
"iteration": 2,
"overallScore": 0.87,
"criteria": [
{ "name": "correctness", "score": 0.9 },
{ "name": "completeness", "score": 0.85 },
{ "name": "thoroughness", "score": 0.85 }
],
"passed": true
}
],
"passThreshold": 0.8
}
Programmatic Access
const result = await sweepPrompt({ /* ... */ });
for (const score of result.evalScores ?? []) {
console.log(`Iteration ${score.iteration}: ${score.overallScore}`);
if (!score.passed) {
console.log(" Below threshold — context fed to next iteration");
}
}
Eval in Orchestration
Use orchestrate --mode eval for one-shot evaluation:
prompts-gpt orchestrate --mode eval \
--prompt review.md --provider codex \
--criteria correctness,completeness,quality
This runs the prompt once and evaluates the output, returning scores without iteration.
Best Practices
- Start with 3–5 criteria — too many dilute the signal
- Set threshold to 0.7–0.8 — leaves room for improvement
- Use cross-provider eval for unbiased scoring
- Review score trends across iterations to gauge convergence
- Combine with sweeps to get progressive improvement with quality gates