Synced package doc
Docs/prompts-gpt Package/Self-Evaluation

Self-Evaluation

Self-evaluation enables automatic scoring of agent output during sweeps and orchestration eval mode. This page covers how self-evaluation works, how to configure it, and how to interpret results.

How It Works

After each sweep iteration (or eval orchestration), the agent output is evaluated against defined criteria. The evaluation can be:

  1. Built-in: The same provider re-evaluates its own output
  2. Cross-provider: A separate provider evaluates the output
Agent Output → Evaluation Prompt → Score (0.0–1.0) per criterion

Configuring Self-Eval in Sweeps

Via YAML Frontmatter

---
title: Security Audit
sweep:
  defaultIterations: 5
  eval:
    criteria:
      - correctness
      - completeness
      - thoroughness
      - code-quality
    passThreshold: 0.8
---

Via CLI

prompts-gpt sweep audit.md -n 5 \
  --eval \
  --eval-criteria correctness,completeness,thoroughness \
  --eval-threshold 0.8

Via SDK

const result = await sweepPrompt({
  promptFile: "audit.md",
  iterations: 5,
  provider: "codex",
  eval: {
    criteria: ["correctness", "completeness", "thoroughness"],
    passThreshold: 0.8,
  },
});

Evaluation Criteria

Common criteria and what they measure:

CriterionWhat It Measures
correctnessAre the changes/findings accurate?
completenessWere all aspects of the task addressed?
thoroughnessDepth of analysis or implementation
code-qualityCode style, patterns, best practices
securitySecurity implications of changes
performancePerformance impact of changes
test-coverageQuality of test additions
documentationDocumentation completeness

You can define custom criteria — any string is accepted.

Score Interpretation

ScoreMeaning
0.0–0.3Poor — significant issues
0.3–0.5Below average — notable gaps
0.5–0.7Average — acceptable with improvements
0.7–0.85Good — meets expectations
0.85–1.0Excellent — exceeds expectations

Pass/Fail Threshold

The passThreshold determines whether an iteration "passes":

eval:
  passThreshold: 0.8  # Overall weighted average must be ≥ 0.8

When an iteration fails the threshold, the sweep continues to the next iteration, providing the evaluation feedback as additional context.

Eval Results

In Artifacts

Results are stored in eval-scores.json:

{
  "scores": [
    {
      "iteration": 1,
      "overallScore": 0.72,
      "criteria": [
        { "name": "correctness", "score": 0.8 },
        { "name": "completeness", "score": 0.65 },
        { "name": "thoroughness", "score": 0.7 }
      ],
      "passed": false
    },
    {
      "iteration": 2,
      "overallScore": 0.87,
      "criteria": [
        { "name": "correctness", "score": 0.9 },
        { "name": "completeness", "score": 0.85 },
        { "name": "thoroughness", "score": 0.85 }
      ],
      "passed": true
    }
  ],
  "passThreshold": 0.8
}

Programmatic Access

const result = await sweepPrompt({ /* ... */ });

for (const score of result.evalScores ?? []) {
  console.log(`Iteration ${score.iteration}: ${score.overallScore}`);
  if (!score.passed) {
    console.log("  Below threshold — context fed to next iteration");
  }
}

Eval in Orchestration

Use orchestrate --mode eval for one-shot evaluation:

prompts-gpt orchestrate --mode eval \
  --prompt review.md --provider codex \
  --criteria correctness,completeness,quality

This runs the prompt once and evaluates the output, returning scores without iteration.

Best Practices

  1. Start with 3–5 criteria — too many dilute the signal
  2. Set threshold to 0.7–0.8 — leaves room for improvement
  3. Use cross-provider eval for unbiased scoring
  4. Review score trends across iterations to gauge convergence
  5. Combine with sweeps to get progressive improvement with quality gates

See Also