Blog
Notes on how GBA Eval works and why it is built the way it is.
Grok Build 0.1 (unsuccessfully) reward-hacks GBA Eval: Grok hard-codes the exact example ROMs to fake passing in its CPU implementation
How gameplay replay scoring works: How we grade emulators on producing gameplay replays
What replay scores look like: Examples of candidate emulators at each replay-score band.
Iterating on the grading strategy: Why grading a GBA emulator is hard and the process we took while iterating on our scoring methodology.
How the environment is designed: What do models see when attempting the GBA Eval task?