Understanding the Bochs Code Cache, Part 1

Understanding the Bochs Code Cache, Part 1

This a post is the first in a theoretically unbounded series on the Bochs code cache.  In it, I describe the architecture of the Bochs code cache and see how sensitive size parameters are to performance.

Why have a code cache?

Bochs is a free, open-source x86 system emulator that’s available for both Windows and Linux.  Like most modern high-performance system emulators, Bochs employs sophisticated techniques under the hood to accelerate fetching and decoding guest x86 instructions.

Fetching and decoding instructions is hard in x86, and the task is painfully slow in software.  The higher the Bochs icache hit rate, the less time it has to spend fetching and decoding instructions and the more time it can spend actually doing real work.   Whereas a real, physical system might have an instruction cache (and if you’re fancy, a micro-op cache), dynamic systems like the JVM, Pin, DynamoRIO, QEMU, and Bochs employ a software-managed code cache to accelerate instruction fetch and decode.

Architecture and Operation of the Bochs Code Cache

The code cache comprises two structures:

  1. The entry[] array
    • maps from physical addresses to mpool[] entries
    • indexed as a direct-mapped cache
    • 64K bxICacheEntry_c entries by default
  2. The mpool[] array
    • maps decoded instructions to decoded instruction handlers
    • next pointers define traces
    • allocated in ascending order until full, then flush the cache
    • 576K bxInstruction_c entrties by default
When the instruction fetch routine fetchdecode32() is called in Bochs, the entry array is indexed into with a hash of the physical address and the fetch mode, which is just a status register populated with bits like long mode, SSE-enabled, etc.  The index is used as an offset into the entry array with one of two results:
  1. The entry is invalid or has a mismatching address (i.e. miss)
    • The selected entry is (re-)allocated for the new physical address
    • The bochs front-end fetches & decodes up to BX_MAX_TRACE_LENGTH instructions (default is 32) starting at that paddress, looking for new entry-array hits as it goes
    • If mpool is in danger of filling, the whole code cache is flushed
    • Decoded instructions are placed into consecutive mpool entries, starting immediately after the previous insertion
    • Bochs’ CPU model begins execution at the first newly-inserted instruction
  2. The entry is valid and has matching address (i.e. hit)
    • The selected entry’s intruction pointer into the mpool table is dereferenced
    • Bochs’ CPU model begins execution at the dereferenced mpool entry and continues through the execute1 and execute2/next pointers through the rest of the thread
The structures are shoddily depicted here:

Analysis, TBD

For no particular reason, I believe the most important things to look at in assessing the code cache’s performance are:
  1. What is the code cache’s hit rate, and how is it affected by BxICacheEntries?  BxICacheMemPool?
  2. How often is the code cache flushed?
  3. How does a modern CPU fare branch-wise when executing out of mpool?
Numbers 1 and 2 above are obvious from the architecture, but 3 is less so.  The execute1/execute2 pointers make it so that the Bochs CPU model knows immediately which subroutine to execute for each instruction, but it still has to branch there to execute it.  This is potentially problematic since it’s going to generate a lot of indirect branches, which is bad for branch prediction, iTLB, and L1I locality.  Contrasted to something like QEMU, Pin, or DynamoRIO, this approach is substantially less efficient but is also much easier to implement.  Ideally the code should be inlined somehow, so that the processor doesn’t have to incur so many indirect branches (which are typically already the bane of a dynamic compiler).
Some other possible avenues of discovery are:
  1. Would the code cache benefit at all from a nursery?
  2. How about having an associative cache?
  3. What about a smarter mpool[] replacement policy?
It’s worth noting that #3 here seems unlikely since the Bochs policy is very widely employed in these kinds of systems.  But who knows?

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *