Agent evaluation, in the open
See how agents
actually work.
We run LLM agents on real workflows in sandboxed workspaces and grade the result. Every file read, every edit, every verdict — an auditable trace you can open.
Explore a sample trace → 01 · Prompt
The task and the files the agent is handed — a real, bounded workspace.
02 · Transcript
Every message, tool call, and result, in order. Nothing summarized away.
03 · Verdict
The edits it made and how the rubric graded them. Auditable, not asserted.