Fix a broken LLM pipeline in half a day. Then it just works.

PromptPotter is LLM-driven program evolution for prompts and pipeline parameters. Point it at your backend, run one command, walk away — the Potter critiques, regenerates, and converges on a prompt that holds up under measurement.

Compatible with every major model provider
OpenAI Anthropic Groq OpenRouter

Four layers, one command. The optimizer escalates only when it has to.

  • 01 Candidate generation
  • 02 Context refinement
  • 03 Plan revision
  • 04 Self-optimization

Production-grade prompts at round speed.

Learn more

Every round, L1 reads the prior winners and the critique log, then proposes a fresh population of candidate prompts. PoBB elimination (ε=0.05, n_min=4) culls the weak so spend tracks signal — no flat sweeps, no wasted queries.

round_0004/candidates.json
winner_id   "6a674185"
composite   0.94   // +0.31 vs origin
parent      "3c1d9020"  (round 3 winner)
eliminated  11 of 16  // PoBB ε=0.05, n_min=4
critique    "add explicit combinatorial count cue"

Don’t just tune the prompt — refine what the task even is.

Learn more

When L1 plateaus across rounds, L2 fires once. It reads the stalled critique stream and rewrites task_context — never pipeline_params. The next L1 round inherits the new framing and goes again, often with a step-change in fitness.

// L2 fired after 3 stalled rounds on hard samples
task_context   "This dataset rewards step-by-step combinatorial
              enumeration. Candidates that skip the count step
              consistently miss multi-hop entries — make the
              counting explicit before the answer."

pipeline_params (unchanged — L2 never edits these)

When refinement isn’t enough, replan.

Learn more

L3 fires only when both L1 and L2 have stalled. It scraps the plan and writes a new one — different sample budget, different scoring composite, different exploration policy. The new plan becomes the contract for round five and on.

round_0007/l3_plan.json
verdict      "replan"
budget       samples_per_round: 64 → 128
composite    accuracy + latency_penalty * 0.15
policy       "prefer wide exploration over fine sweep"
rationale    "L2's task_context edit didn't move composite
              past 0.71 — the scoring shape is wrong, not the prompt"
A recursive layer. The Potter improves its own meta-prompt.
Recursive optimization
  • Inner cycle exposed via the promptpotter connector
  • Outer Potter mutates the inner’s L1 / L2 / L3 templates
  • One command — no second runtime to deploy
Observability
  • Every round on disk in human-readable JSON
  • Live dashboard at /ui polls every two seconds
  • Full Langfuse trace for every LLM call
Cost & safety
  • Spend headline on the dashboard, never patience
  • Atomic ledger — Ctrl+C loses zero work
  • Multi-tenant identity, OIDC at the API boundary

The benchmarks the Potter has earned its keep on.

Six datasets, each with its own failure mode. Open one to read the campaign log, the critique trail, and the prompt that won.

Browse all benchmarks
TermNorm

Entity normalization on a five-step backend.

Origin 0.63 composite
Round 4 winner 0.94 composite
Hard samples 12 / 312 still missing
Spend $2.41
AIME

Competition math — 60% from a single-axis edit.

Round 1 winner 6a674185
Pass rate 60% (+24)
Edit "combinatorial count" cue
BBEH

Big-Bench Extra Hard — the headline benchmark.

Public reference 14.8%
PromptPotter in validation
Protocol held-out test split
GSM8K

Grade-school math, round-1 convergence.

Rounds to ≥95% 1
Provider Groq · gpt-oss-120b
Spend $0.34
JustLogic

Logical depth ≥ 6 — where L2 earns its place.

Origin 27% pass
After L2 fired (round 4) 44% pass
L1 alone (stalled) 31% pass
HotpotQA

Multi-hop QA across paragraphs.

Status M11 Track 1
Provider Groq · gpt-oss-120b
Comparison vs. promptfoo, promptolution

Resources to read before your first campaign.

Manual

Install → first run → reading the dashboard

A thirty-minute path from pip install to your first converged campaign, in plain language.

Concepts

How the three-layer loop actually escalates

Why L2 fires only after L1 stalls, why L3 is the last resort, and what the recursive L4 layer is for.

Research

Benchmarks & ablation studies

BBEH, AIME, GSM8K, JustLogic, HotpotQA — with protocols, with held-out splits, with the receipts.

“It fixed a six-month-stuck pipeline in an afternoon. Then it just worked.”

— Maintainer dogfooding TermNorm, sprint 14

Start your first campaign.

Free. One command. Read every round in your editor as it lands.

Get started — it’s free
BYO model & backend
Groq OpenAI Anthropic OpenRouter Langfuse Python 3.13 TermNorm
Get early access…