Context bloat is real, but the architecture has the potential to solve it.

You need clever naming for the filesystem and exploration policy in AGENTS.md. (not trivial!)

The benchmark is definitely the core bottleneck. I don't know any good benchmark for this, probably an open research question in itself.