Blog

Notes on how GBA Eval works and why it is built the way it is.

Grok Build 0.1 (unsuccessfully) reward-hacks GBA Eval: Grok hard-codes the exact example ROMs to fake passing in its CPU implementation

How gameplay replay scoring works: How we grade emulators on producing gameplay replays

What replay scores look like: Examples of candidate emulators at each replay-score band.

Iterating on the grading strategy: Why grading a GBA emulator is hard and the process we took while iterating on our scoring methodology.

How the environment is designed: What do models see when attempting the GBA Eval task?