HPCA 2000
In Summary
The authors propose propose the eXtended Block Cache (XBC), a new structure to replace the Trace Cache (TC). XBC has similar latency and bandwidth to TC, but due to reduced fragmentation and redundancy, its hit rate can be higher for the same sized cache (or the same for a 33% smaller cache).
An XB is a multiple-entry, single-exit sequence of instructions terminated by a conditional branch, an indirect branch, or the kth sequential instruction, where k is the maximum number of instructions in an XB. XBs are not terminated by unconditional JMPs. XBs are tagged in the XBC by the IP of the last instruction in the block, and instructions are stored in reverse program order, which allows the Fill Unit (XFU) to lengthen XBs in the XBC by adding instructions as a prefix without evicting/overwriting or incurring redundancy. The XBTB predicts the next n XBs to be executed from the XBC.
The XBC caches instructions in fill mode, and supplies them to the decoder in supply mode. The XBC transitions into supply mode when we execute an instruction that is a tag hit in the XBC; the XBTB predicts the subsequent XBs to be executed, at which point Instruction Fetch can be turned off.
Takeaways
This is clearly an improvement over TC, and it’s interesting in that it’s almost a superset of a simple uop cache, which was proposed later. The chief difference between the XBC and the first-proposed UC is the organization of the cached uops. Both XBC and UC cache decoded instructions (uops), but the XBC forms XBs which are multiple-entry, single-exit. UC caches uops as well, but simplifies things by only caching basic blocks (single-entry, single-exit). That the UC caches single-entry blocks of uops indexed by the first instruction of the block makes lookups much easier than in XBC, where XBs are indexed by the final EIP in the sequence.
One interesting rule of thumb which I think is still relevant:
[Mich99] describes analytically some of these transitions [between steady-state and stall phases]. As a rule of thumb, the CPU spends about 50% of the time in steady state execution, 30% in transition phase, and is stalled for 20% of the time. During steady state execution we need a high-bandwidth front-end structure, while latency is less important (hidden by pipelining). However to minimize the time spent in transition, a fast ramp-up is required, which implies low latency. This shows the importance of a high-bandwidth low-latency front-end structure.
This is likely still correct, and isn’t in any way specific to XBCs, except in that it’s the motivation for the authors’ interest in the area. But it’s as concise a summary I’ve seen of why bandwidth and latency both matter in the front-end, even if they’re each critical in non-overlapping program phases.