Self-Evaluation

Self-evaluation enables automatic scoring of agent output during sweeps and orchestration eval mode. This page covers how self-evaluation works, how to configure it, and how to interpret results.

How It Works

After each sweep iteration (or eval orchestration), the agent output is evaluated against defined criteria. The evaluation can be:

Built-in: The same provider re-evaluates its own output
Cross-provider: A separate provider evaluates the output

Agent Output → Evaluation Prompt → Score (0.0–1.0) per criterion

Configuring Self-Eval in Sweeps

Via YAML Frontmatter

---
title: Security Audit
sweep:
  defaultIterations: 5
  eval:
    criteria:
      - correctness
      - completeness
      - thoroughness
      - code-quality
    passThreshold: 0.8
---

Via CLI

prompts-gpt sweep audit.md -n 5 \
  --eval \
  --eval-criteria correctness,completeness,thoroughness \
  --eval-threshold 0.8

Via SDK

const result = await sweepPrompt({
  promptFile: "audit.md",
  iterations: 5,
  provider: "codex",
  eval: {
    criteria: ["correctness", "completeness", "thoroughness"],
    passThreshold: 0.8,
  },
});

Evaluation Criteria

Common criteria and what they measure:

Criterion	What It Measures
`correctness`	Are the changes/findings accurate?
`completeness`	Were all aspects of the task addressed?
`thoroughness`	Depth of analysis or implementation
`code-quality`	Code style, patterns, best practices
`security`	Security implications of changes
`performance`	Performance impact of changes
`test-coverage`	Quality of test additions
`documentation`	Documentation completeness

You can define custom criteria — any string is accepted.

Score Interpretation

Score	Meaning
0.0–0.3	Poor — significant issues
0.3–0.5	Below average — notable gaps
0.5–0.7	Average — acceptable with improvements
0.7–0.85	Good — meets expectations
0.85–1.0	Excellent — exceeds expectations

Pass/Fail Threshold

The passThreshold determines whether an iteration "passes":

eval:
  passThreshold: 0.8  # Overall weighted average must be ≥ 0.8

When an iteration fails the threshold, the sweep continues to the next iteration, providing the evaluation feedback as additional context.

Eval Results

In Artifacts

Results are stored in eval-scores.json:

{
  "scores": [
    {
      "iteration": 1,
      "overallScore": 0.72,
      "criteria": [
        { "name": "correctness", "score": 0.8 },
        { "name": "completeness", "score": 0.65 },
        { "name": "thoroughness", "score": 0.7 }
      ],
      "passed": false
    },
    {
      "iteration": 2,
      "overallScore": 0.87,
      "criteria": [
        { "name": "correctness", "score": 0.9 },
        { "name": "completeness", "score": 0.85 },
        { "name": "thoroughness", "score": 0.85 }
      ],
      "passed": true
    }
  ],
  "passThreshold": 0.8
}

Programmatic Access

const result = await sweepPrompt({ /* ... */ });

for (const score of result.evalScores ?? []) {
  console.log(`Iteration ${score.iteration}: ${score.overallScore}`);
  if (!score.passed) {
    console.log("  Below threshold — context fed to next iteration");
  }
}

Eval in Orchestration

Use orchestrate --mode eval for one-shot evaluation:

prompts-gpt orchestrate --mode eval \
  --prompt review.md --provider codex \
  --criteria correctness,completeness,quality

This runs the prompt once and evaluates the output, returning scores without iteration.

Best Practices

Start with 3–5 criteria — too many dilute the signal
Set threshold to 0.7–0.8 — leaves room for improvement
Use cross-provider eval for unbiased scoring
Review score trends across iterations to gauge convergence
Combine with sweeps to get progressive improvement with quality gates