Basic concepts
Deep dive into HyperAgents: agents, the evolutionary loop, archive, evaluation, and execution. Workflow diagrams are included below alongside the narrative so you can read in one place.
Overview
HyperAgents combines evolutionary computation and Quality-Diversity (QD) ideas: keep an archive of agent versions, score each one, and use strong ancestors as parents for the next mutation (MetaAgent edits).
Workflow diagrams
Evolutionary loop (outer)
One generation (sequence)
Use participant id Main (not Loop) — Mermaid reserves loop for control blocks.
TaskAgent vs MetaAgent (programs)
Execution mode
The two agents
TaskAgent — the worker
- Input: formatted task string (from
domain.formatInput). - Output: prediction string (and optional structured result).
- Tools: domain-specific, optional (e.g. calculator, bash).
- Implementation:
src/agent/task_agent.ts.
Behavior is mostly prompt + tools — both can be edited by the MetaAgent.
MetaAgent — the improver
- Input: repository path, evaluation paths, parent score context.
- Output: modified files / diffs on disk.
- Tools: framework bash + editor only.
- Implementation:
src/agent/meta_agent.ts.
The MetaAgent is the mutation operator: it does not solve tasks directly; it rewrites what does.
How they cooperate
The evolutionary loop
Implemented in src/core/generate_loop.ts. Each generation typically:
- Select parent from the archive (
select_parent.ts). - Set up executor (local or Docker).
- Apply lineage — replay patches so the workspace matches the parent.
- Run MetaAgent — produce a new patch from failures and context.
- Run TaskAgent via harness (staged eval may run first if configured).
- Evaluate — domain scores predictions; reports written under output dir.
- Update archive — append JSONL snapshot with new
genId, scores, patch list.
Configuration sketch
const config: GenerateLoopConfig = {
domains: [myDomain],
metaAgent,
taskAgentFactory: (t) => new TaskAgent({ model, tools: t }),
tools: getFrameworkTools(),
outputDir: "./outputs/evolution",
repoPath: ".",
maxGenerations: 5,
executionMode: "local",
parentSelection: "score_child_prop",
evalSamples: 10,
};
The archive
The archive is an append-only JSONL file: each line is a full snapshot { archive, entries }. Read the last line for current state.
Entry shape (conceptual)
| Field | Meaning |
|---|---|
genId | Unique generation id |
parentId | Parent generation (tree edge) |
patchFiles | Patch paths in lineage |
scores | Per-domain numeric scores |
validParent | Can future gens use this as parent? |
metadata | Run metadata (e.g. run_eval) |
Lineage is a tree
Branches are normal: parentId points to the actual ancestor, not necessarily the latest id.
Why JSONL?
Appending a line is cheap; you keep history of every snapshot without rewriting a giant JSON file. See also JSONL vs JSON in the table below.
| JSON | JSONL | |
|---|---|---|
| Structure | One object per file | One object per line |
| Append | Rewrite file | Append line |
| Latest state | Parse all | Read last line |
| Typical use here | report.json, predictions.json | archive.jsonl |
Parent selection strategies
From src/core/select_parent.ts, chosen once in config for the whole run:
| Strategy | Behavior |
|---|---|
random | Uniform over valid parents — max exploration |
latest | Most recent valid parent — simple chain |
best | Highest score — pure exploitation |
score_prop | Random weighted by score |
score_child_prop | Score-weighted with child penalty (default) — explore under-used parents |
Why not always best? You can get stuck in a local maximum; sometimes a weaker parent opens a path to a better global solution.
Child penalty (default strategy) uses: weight = (score + 0.01) × 1 / (1 + numChildren).
Domains and evaluation
A Domain (src/domains/base.ts) defines your benchmark:
config— name, splits, score keys, sample counts.loadTasks— async load ofDomainTask[].evaluate— score one prediction (usually 0–1).formatInput— task → model prompt.report— aggregateEvalResult[]into summary.
Example domains in the repo include bash, scoring, calculator, factcheck, paper review, and git evolution demos.
Evaluators
src/domains/evaluators.ts provides three patterns:
staticEvaluator— normalized string equality; free and deterministic.llmJudgeEvaluator— rubric-based model scoring; costs tokens.humanFeedbackEvaluator— map user ratings to the 0–1 interval.
Pick the one that matches task objectivity and budget.
The harness
src/domains/harness.ts connects TaskAgent to tasks:
Used for one-off evals and inside runGenerateLoop.
Predictions vs scores
| Score | Prediction | |
|---|---|---|
| What | Number from 0 to 1 | Model output string |
| Typical files | report.json | predictions.json |
| Used for | Parent selection, ranking | User-facing output, debugging |
ensemble (src/core/ensemble.ts) picks a high-scoring generation and returns its prediction for a given question.
Executors
src/utils/executor.ts — same interface, two modes:
- Local — temp directory, fastest for development; host must trust generated code.
- Docker — per-generation container; slower, safer for untrusted codegen.
Output layout (evolution)
Typical tree under outputDir:
outputs/bash_evolution/
├── archive.jsonl
├── gen_initial/metadata.json
├── gen_1/
│ ├── metadata.json
│ ├── agent_output/model_patch.diff
│ └── bash_eval/
│ ├── predictions.json
│ └── report.json
└── gen_2/ ...
Single eval (no loop) may only have predictions.json and report.json.
Self-referential improvement (prompt files)
If HyperAgents is installed from npm, framework TypeScript in node_modules is not what you mutate. Instead, point agents at files in your repo:
const metaAgent = new MetaAgent({ model, promptFile: "./prompts/meta_agent.txt" });
// or
const config: GenerateLoopConfig = {
// ...
promptsDir: "./prompts",
};
With promptsDir, the loop can scaffold meta_agent.txt and task_agent.txt. Template placeholders such as {{repoPath}}, {{evalPath}}, {{scoreContext}} are filled at runtime (see main repo docs/concepts.md).
Without promptsDir, built-in templates are used — the MetaAgent still edits your domain code and separate files, but not its packaged default prompt text.
Early termination
- If best archive score reaches 1.0, the loop stops (no wasted compute).
- MetaAgent prompt includes score context so it avoids needless edits when already at 100%.
Examples overview
| Example | Demonstrates | Loop |
|---|---|---|
| scoring | Prompt / grading logic | Manual or demo script |
| calculator | Fixing a buggy tool | Manual iterations |
| bash | Command generation | eval / evolve |
| factcheck | Classification | eval / evolve |
| paper_review | Accept/reject | Single eval in script |
| git_evolution | Git-native patches | Full loop |
Glossary
| Term | Definition |
|---|---|
| Archive | JSONL history of generations and scores |
| Domain | Task suite + evaluation for one benchmark |
| Evaluator | static / LLM judge / human scoring helper |
| Executor | Local or Docker workspace for one generation |
| Generation | One improve + evaluate cycle |
| Harness | Runs TaskAgent over domain tasks |
| MetaAgent | Edits code to improve TaskAgent |
| Parent | Archive node used as base for a child |
| Patch | Diff capturing MetaAgent changes |
| Prediction | Raw TaskAgent output for a task |
| Selection strategy | Rule for picking the next parent |
