Agent evaluation, in the open

See how agents
actually work.

We run LLM agents on real workflows in sandboxed workspaces and grade the result. Every file read, every edit, every verdict — an auditable trace you can open.

Explore a sample trace →

01 · Prompt

The task and the files the agent is handed — a real, bounded workspace.

02 · Transcript

Every message, tool call, and result, in order. Nothing summarized away.

03 · Verdict

The edits it made and how the rubric graded them. Auditable, not asserted.

See how agentsactually work.

See how agents
actually work.