harness engineering — an exploration

Three Repos.
One Bot Account.
A Lot of Questions.

A live exploration of agent-first development — based on a real Go repo, a real runner, and real failure modes. Nothing is settled yet.

01The three-repo topology — TASK, TARGET, LOG

02What kind of agent are you actually using?

03The adversarial problem — two agents, one PR

04Context window: what do you put in?

05Web search is a liability — or is it?

06Subagents: isolated windows, ephemeral state

07Quick-fire: test yourself

01 — topology

Three Repos.
One Sandbox Account.

task repo · control

spec-socialpredict-tasks

/ (orchestration)

├── AGENTS.md

├── TASKS.json

├── codex-runner.sh

├── defaults.env

├── .codex/

│ └── agents, profiles

├── .codex-runs/

│ ├── events/*.ndjson

│ ├── messages/*.txt

│ ├── context/*.ndjson

│ └── RUNLOG.ndjson

└── scripts/

Runner polls TASKS.json. Picks next ready task by dependency order. Dispatches a Codex session. Monitors context every 15s. Checkpoints at 70%, resumes with session ID. Writes all artifacts here.

target repo · fork

socialpredict

/ (fork of real repo)

├── backend/

│ └── scripts/

│ └── guardrails.sh

├── .codex-reports/

│ └── tasks/

│ └── SP-001/

│ ├── meta.json

│ ├── summary.json

│ ├── conversation.ndjson

│ └── decisions.ndjson

└── AGENTS.md

Code changes happen here. Agent opens PRs against this fork. Guardrails run pre-commit. Reports written to .codex-reports/. Human reviews and either merges or closes entirely.

sandbox account

pwdel-auto

Separate GitHub identity

Not your main account.

All automated commits,

PRs, pushes, and

reviews come from

this identity only.

Naming pattern:

username-auto

username-harness

username-bot

Forks the real repo here.

The human in the loop is exactly one thing: PR review. You see the diff. You approve or you close it completely. Closing means thrown out — no partial merges, no "fix it later." The task goes back to pending.

02 — agent types

What Kind of Agent
Are You Actually Using?

type 01

Gate

Makes or blocks a decision. Pass/fail criteria layered on top of policy. Blocker / high / follow-up severity levels. Work cannot proceed until gates clear.

require 2 reviews when backend/ changes

type 02

Policy

Follows explicit rules. Naming patterns, required artifacts, concrete triggers. Good for consistency and repo-specific discipline. Predictable but rigid.

all handlers must have OpenAPI annotations

type 03

Heuristic

Use good judgement. Leaves things to the LLM. Flexible where rules are hard to specify — but can drift, hallucinate preferences, or conflict silently with policy agents.

prefer idiomatic Go patterns where applicable

Click a card to explore each type. In the socialpredict harness, many agents blend all three orientations — the interesting question is which type is dominant and whether that matches your intent.

03 — the adversarial problem

Two Agents.
Same PR. No Coordination.

policy + gate · non-flexible

Go Best Practices Agent

cyclomatic complexity: 12 → FAIL (max 10)

go vet: PASS

function length: 87 lines → FAIL (max 60)

SOLID: interface segregation OK

heuristic · style-oriented

Go Style Guide Agent

readability: clear and idiomatic

naming conventions: OK

complexity: "acceptable for this domain"

error handling: idiomatic

Both agents reviewed the same function. Agent A flags cyclomatic complexity as a hard blocker. Agent B calls it acceptable for the domain context. Neither agent knows the other exists. How do you resolve this?

Choose how the human resolves this conflict. None of these options are obviously correct — that's the point.

04 — context window

The Context Budget.
What Do You Put In?

total context used0%

balanced

pre-compaction frequency

never aggressive

context saved

0%

reasoning quality

100%

        soft-threshold: 60% → wrap up

        hard-threshold: 70% → SIGTERM

        → session_id stored → resume

        context poll: every 15s

Adjust the sliders to see how different inputs compete for the same limited window. In the real runner, once you cross 70% the session is interrupted and resumed from a checkpoint.

05 — information sources

Web Search Is
A Liability.

Or a tool. Depends entirely on whether you control what goes in. Explore the trust spectrum.

Open web search under an agent framework introduces non-determinism — results change run-to-run, content can be adversarial, and sources lie. In this harness: web search is off by default. Anything that goes in must be on an explicit allowlist. For Go style rules: scrape the official guide once, verify it, version it in a knowledge base, and never touch the open web again.

pick a query type:

Select a query type to see the recommended source.

06 — subagents and parallel context windows

Every Agent Gets
Its Own Window.

Each Codex or Claude agent runs in an isolated environment with its own full context budget. Spawning subagents doesn't share your main window — it opens new ones. When an agent shuts down, its window is cleared. So how do you keep state alive?

main dispatcher

dispatcher_agent

Context: task prompt + AGENTS.md + summary.json

window:

45%

specialist A

go-lint agent

Fresh window. Loads only what it needs.

window:

30%

specialist B

test-runner agent

Separate process. Own token budget.

window:

55%

↑ These three agents run in parallel. None share state. Each closes and clears when done.

how to keep an agent alive / persist state

approach 01

Run persistently

Host as a long-running process. Supervisor restarts on crash.

approach 02 · recommended

Serialize state externally

Checkpoint to storage. Reload on restart with summarization.

approach 03

Retrieval-augmented rehydration

Embeddings in a vector DB. Fetch only what's relevant.

approach 04

Keep-alive / heartbeat

Periodic pings prevent idle shutdown on supported platforms.

Click a strategy to explore it. The key insight: context windows are ephemeral by default. Any state you want to survive shutdown must be explicitly serialized somewhere outside the agent process.

07 — test yourself

Quick-Fire.
Six Questions.

question 1 of 6

0 / 0

Three Repos.One Bot Account.A Lot of Questions.

Three Repos.One Sandbox Account.

What Kind of AgentAre You Actually Using?

Two Agents.Same PR. No Coordination.

The Context Budget.What Do You Put In?

Web Search IsA Liability.

Every Agent GetsIts Own Window.

Quick-Fire.Six Questions.

Three Repos.
One Bot Account.
A Lot of Questions.

Three Repos.
One Sandbox Account.

What Kind of Agent
Are You Actually Using?

Two Agents.
Same PR. No Coordination.

The Context Budget.
What Do You Put In?

Web Search Is
A Liability.

Every Agent Gets
Its Own Window.

Quick-Fire.
Six Questions.