This a post is the first in a theoretically unbounded series on the Bochs code cache. In it, I describe the architecture of the Bochs code cache and see how sensitive size parameters are to performance.
Why have a code cache?
Bochs is a free, open-source x86 system emulator that’s available for both Windows and Linux. Like most modern high-performance system emulators, Bochs employs sophisticated techniques under the hood to accelerate fetching and decoding guest x86 instructions.
Fetching and decoding instructions is hard in x86, and the task is painfully slow in software. The higher the Bochs icache hit rate, the less time it has to spend fetching and decoding instructions and the more time it can spend actually doing real work. Whereas a real, physical system might have an instruction cache (and if you’re fancy, a micro-op cache), dynamic systems like the JVM, Pin, DynamoRIO, QEMU, and Bochs employ a software-managed code cache to accelerate instruction fetch and decode.
Architecture and Operation of the Bochs Code Cache
- The entry[] array
- maps from physical addresses to mpool[] entries
- indexed as a direct-mapped cache
- 64K bxICacheEntry_c entries by default
- The mpool[] array
- maps decoded instructions to decoded instruction handlers
- next pointers define traces
- allocated in ascending order until full, then flush the cache
- 576K bxInstruction_c entrties by default
- The entry is invalid or has a mismatching address (i.e. miss)
- The selected entry is (re-)allocated for the new physical address
- The bochs front-end fetches & decodes up to BX_MAX_TRACE_LENGTH instructions (default is 32) starting at that paddress, looking for new entry-array hits as it goes
- If mpool is in danger of filling, the whole code cache is flushed
- Decoded instructions are placed into consecutive mpool entries, starting immediately after the previous insertion
- Bochs’ CPU model begins execution at the first newly-inserted instruction
- The entry is valid and has matching address (i.e. hit)
- The selected entry’s intruction pointer into the mpool table is dereferenced
- Bochs’ CPU model begins execution at the dereferenced mpool entry and continues through the execute1 and execute2/next pointers through the rest of the thread
Analysis, TBD
- What is the code cache’s hit rate, and how is it affected by BxICacheEntries? BxICacheMemPool?
- How often is the code cache flushed?
- How does a modern CPU fare branch-wise when executing out of mpool?
- Would the code cache benefit at all from a nursery?
- How about having an associative cache?
- What about a smarter mpool[] replacement policy?