Mark Smotherman
Last updated: June 2008
Summary: Eager execution deals with the uncertain nature of branches by applying the design principle of "late select" to the paths in a program. In their 1972 paper, Riseman and Foster demonstrated an impressive speedup was available from this approach. However, it has been too expensive to actually implement eager execution, so other branch-handling alternatives have been employed in commercial processors over the years, including prefetching and branch prediction with speculative execution. Research into eager execution has continued to refine certain aspects of the idea, including work like Uht's Disjoint Eager Execution as well as speculative multithreading.
.. under construction ..
"The IBM 370/165 employed a branching strategy that fetched two paths, the inline path and the target path [FLO 74]. Two buffers were maintained, the main instruction buffer (MIB) and the auxiliary instruction buffer (AIB). When the effective address of the target instruction stream is known at stage n', the target stream is fetched into the AIB. Both streams are processed until the outcome is known."page 77 (emphasis added), Harvey Cragon, Branch Strategy Taxonomy and Performance Models. Los Alamitos, CA: IEEE Computer Society Press, 1992.
Like the passage given above, I've seen it alleged many times that certain commercial processors from the 1960's executed down both paths beyond a branch; but, as far as I'm aware, while some academic papers and designs have proposed this, multiple-path execution has never been done by a commercial processor.
[Conjecture: I think it may be traced to a misreading of the 1984 Lee and Smith IEEE Computer paper. E.g., see Perleberg and Smith.]
What you will find in commercial processors is a combination of these techniques:
IBM Stretch (1961)
The IBM Stretch (later marketed as the 7030) was a remarkable high-performance design. Instructions in Stretch flowed through two processing elements:
The indexing and instruction unit of Stretch directly executed instructions related to the index registers and "prepared" the arithmetic instructions by calculating effective addresses and starting any memory operand loads. Unconditional branches and conditional branches that depended on the index registers were fully executed in the indexing and instruction unit.
Conditional branches that depended on the arithmetic registers were predicted untaken, and the untaken path was speculatively executed. Misprediction recovery was accomplished through the Stretch's four-entry "lookahead unit", which functioned as a history buffer for the index register values.
[Conjecture: because branch misprediction recovery time was found to be one of the culprits that reduced Stretch performance, I think speculative execution was ruled out in subsequent IBM minaframe designs, up until the 3090.]
[Nine Stretch/7030's were built.]
IBM S/360 Model 75 (1966)
The Model 75 was the intended high-performance member of the New Processor Line (NPL) and was originally named the 501. It was announced as the S/360 Model 70 in April 1964 and then renamed as the 75 in April 1965. It had an I-unit and a separate E-unit for instruction overlap. The I-unit had two doubleword instruction prefetch registers and prefetched operands into a doubleword operand prefetch register for the E-unit. The I-unit also executed the I/O instructions.
Buffers and functional units for the Model 75 include:
IBM S/360 Model 91 (1967)
The Model 91 developed from the high-performance Project X. It became a part of the NPL in September 1963 (after the formal announcement of the CDC 6600), and was known then as the 604. A footnote in the April 1964 S/360 announcement mentioned a Model 90 that was under development, and the Model 92 was formally announced in August. The model name changed to the 91 in November 1964.
The Model 91 processor featured a deep pipeline and out-of-order execution within the floating-point unit using Tomasulo's algorithm (this was helpful since there were only four architected FP registers). Like the Stretch and Model 75, separate I and E units were used. Data prefetch was performed but there was no instruction preexecution.
There is an 8-doubleword instruction stack that provides instruction prefetch. A backward branch to a target address within 8 doublewords of the branch is called a "short loop" and causes the processor to enter into loop mode. In loop mode, the decoder tags the taken-path instructions as conditional and sends them to wait at the execution units until the branch is resolved. In normal mode, the decoder tags the untaken-path instructions as conditional and sends them to wait at the execution units until the branch is resolved.
There is one sentence in D. Anderson, F. Sparacio, and R. Tomasulo, "The IBM System/360 Model 91: Machine Philosophy and Instruction- Handling," (see Bell & Newell) which may at first glance suggest dual-path (or eager) execution:
"The detrimental performance effect which stems from short loops led to a dual branch philosophy."
However, the authors mean their philosophy has two aspects, specifically the normal mode and loop mode of branch handling. The paper continues:
"The first aspect deals with branches which are either forward into the instruction stream, beyond the prefetched instructions, or if backward from the branch instruction, greater than eight double-words back. In these situations, the branch storage-delay is unavoidable. As a hedge against such a branch being taken, the branch sequencing (Fig. 12) initiates fetches for the first two double words down the target path. Two branch buffers are provided (Fig. 10-the instruction supply data flow) to receive these words, in order that the instruction buffer array will be unaffected if the result is a no-branch decision."
...
"The second aspect of the branch philosophy treats the case for which the target is backward within eight double words of the branch instruction. A separation of eight double words or less defines a "short" loop-this number being chosen as a hardware/ performance compromise. Part of the housekeeping required in the branch sequencing is a "back eight" test. If this test is satisfied the instruction unit enters what is termed "loop mode." Two beneficial results derive from loop mode. First, the complete loop is fetched into the instruction buffer array, after which instruction fetching ceases. Consequently, the address port to storage is totally available for operand fetching and a one instruction per cycle issue rate is possible. The second advantage gained by loop mode is a reduction by a factor of two to three in the time required to sequence the loop- establishing branch instruction. (For example, the "branch on index" instruction normally requires eight cycles for a successful branch, while in loop mode three cycles are sufficient.) In many significant programs it is estimated that the CPU will be in loop mode up to 30% of the time."
Buffers and functional units for the Model 91 include:
The Model 195 adds a cache, a decimal functional unit in the fixed-point unit, and a extended-precision functional unit (with one reservation station) in the floating-point unit. [Does US 3,588,829 relate to adding the cache to the 195?]
[15 Model 91's and 2 Model 95's were built. Approximately twenty Model 195's were built.]
IBM S/360 Model 85 (1969) and S/370 Models 165/168 (1971/1973)
Following the same approach as the Models 75 and 91, separate I and E units with data prefetch but without instruction preexecution were used in the S/360 Model 85 and the S/370 Model 165 (and later the Model 168). Note, however, that the Tomasulo out-of-order approach was not used. Also, instructions after an unresolved branch are stalled in a decoded instruction queue rather than at the execution units. This can be traced back to the amount of circuitry required in the Model 91:
"In retrospect, the conditional philosophy and its effect on loop mode, although significant to the performance of the CPU and conceptually simple, were found to require numerous interlocks throughout the CPU. The complications of conditional mode, coupled with the fact that it is primarily aimed at circumventing storage access delays, indicate that a careful re-examination of its usefulness will be called for as the access time decreases."
Thus, I believe Harvey Cragon is mistaken in his description of the Model 165. Moreover, in the article by Flores that is cited by Cragon above, Flores states in discussing branching:
"If the lookahead unit anticipates instructions only through the [sequential] instruction stream, then, when a true branch occurs, all the instructions that it has preprocessed are useless and must be discarded."
...
"To improve matters, more of a hedge is possible -- even though we preprocess only one set of instructions, it is still possible to prefetch instructions in the other leg of the branch. Then, should our guess prove wrong, memory cycles are saved, since other instructions are already available to preprocess. To make possible this hedge, the Model 165 includes two instruction buffers, instead of just one, called the main instruction buffer (IBM) and the auxiliary instruction buffer (IBA). When the guess is to accept the branch, instructions for the new target address are put into the IBM while instructions from the other (sequential) leg are brought to the IBA."pages 32-33 (emphasis added)
Connors, Florkowski, and Patton describe the 370/168 and 3033 in a May 1979 Datamation article (note they are IBM Poughkeepsie folks, so I trust their description as being first-hand). They clearly describe one path:
"If the decoded instruction is a branch, the IPPF then begins fetching instructions along the branch path into the inactive instruction buffer, and for a conditional branch designates (by means of a guess) which of the instruction buffers is now active and which is inactive. Thus both buffers are kept filled with instructions, but only the active one continues to decode."(emphasis added)
Buffers and functional units for the Model 85 and Model 165 include:
The Model 168 added virtual address translation, a fourth instruction queue entry, and a second result register.
Connors, Florkowski, and Patton note that the 3033 adds a third instruction fetch buffer to handle the case when a second branch is seen before the current one is resolved. (Again, only one is actively decoded.) The instruction buffer size was doubled so that all three buffers were 32 bytes. The number of operand buffers was tripled (to six), and the number of result registers was doubled (to four). Instruction decoding performance was improved with a reduction from two cycles to one cycle. [See US Patent 4,200,927, Hughes, et al., "Multi-instruction stream branch processing mechanism," filed 1978, granted 1980. Also, US Patent 4,189,768, Liptay and Rymarczyk, "Operand fetch control improvement," filed 1978, granted 1980.]
[About thirty Model 85's were built; they apparently had a reputation for poor reliability/availability.]
Later IBM processors
The 3081 (1981) was IBM's first LSI-based mainframe design, and design decisions were made in ways that were felt would reduce design errors. Consequently, a decision was made to pursue a sequential instruction-execution design and avoid instruction pipelining and overlap. (See Gustafson and Sparacio, IBM JRD, v26:1, 1982.)
The 3090 (1985) began as an attempt to remap the 3033 into LSI technology. Several improvements to the 3033 design were made based on workload studies. Although unintentional, the resulting 3090 operated in a manner somewhat reminiscent of Stretch, with prefetch of data operands and pre-execution of loop-closing and load-address instructions by the I element. A decode history table is used for branch-on-condition prediction when there is an unresolved condition-code-setting instruction in the instruction queue; otherwise the I element uses the current condition code to resolve the branch. (See Tucker, IBM SJ, v25:1, 1986.)
The z900 (2000) also pre-executed branches and loads in the I-unit. (See Schwarz, et al., IBM JRD v46:4/5, 2002.)
hardwired processors
microcoded processors
CMOS generations
zSeries (64-bit addressing)
| pre-execution | inst. prefetch paths | branch prediction | speculative execution | |
|---|---|---|---|---|
| Stretch | indexing insts. (including index branches) | one | non-index conditional branches predicted untaken | yes, recovery using lookahead rollback |
| 91 | two (instruction stack and "BTB" prefetch buffers) | predicted insts. stall at execution units | no | |
| 85/165/168 | two | stall in IQ | no | |
| 3033 | three | stall in IQ | no | |
| 3090 | LA, BCT, BXH, BXLE | three | BHT (called "DHT") | yes, flush to recover |
| 9000 | one | 4K-entry BTAC (called "BHT") | yes, insts. tagged with two conditional bits | |
| z900 | LA, LD, index and some linking branches | one | 8K-entry BTB | yes, flush to recover |
| z990 | LA | five | 8K-entry BTB | yes, flush to recover |
... these designs go beyond normal speculative execution down a predicted branch path ...
... what is earliest reference? e.g., Stone in 1970 suggests following both paths (cloning a virtual stack machine) and stopping only on a store ...
... discuss limit studies involving eager execution: Riseman and Foster; Lam and Wilson; etc. ...
Gus Uht has recently written up a survey of multipath execution: A.K. Uht, "Multipath Execution," chapter 6 in D. Kaeli and P.-C. Yew (Eds.), Speculative Execution in High Performance Computer Architectures, CRC Press, August 2004. Table 6.1 gives an overview comparison of the following schemes, and he provides further description of each scheme.
Other schemes I hope to describe here include:
[AHU 98] P. Ahuja, K. Skadron, M. Martonosi, and D. Clark, "Multi-path execution: Opportunities and limits," ICS, July 1998.
[CHE 01] C.-Y. Cher and T.N. Vijaykumar, "Skipper: A microarchitecture for exploiting control-flow independence," Micro-34, Dec. 2001, pp. 4-15.
[FLO 74] I. Flores, "Lookahead Control in the IBM System 370 Model 165," IEEE Computer, November 1974, pp. 24-38.
[FUK 91] A. Fukuda, K. Murakami, S. Tomita, "Toward Advanced Parallel Processing: Exploiting Parallelism at Task and Instruction Levels, IEEE Micro, vol.11, no.4, pp.16-19 and 50-61, Aug. 1991.
[HEI 96] T. Heil and J. Smith, "Selective dual path execution," technical report, University of Wisconsin at Madison, November 1996.
[KLA 98a] A. Klauser, T. Austin, D. Grunwald, and B. Calder, "Dynamic hammock predication for non-predicated instruction set architectures," PACT, 1998, pp. 278-285.
[KLA 98b] A. Klauser, A. Paithankar, and D. Grunwald, "Selective eager execution on the PolyPath architecture," ISCA, June 1998, pp. 250-259.
[KUG 91] M. Kuga, K. Murakami, S. Tomita,"DSNS (Dynamically-hazard-resolved, Statically-code-scheduled, Nonuniform Superscalar): Yet Another Superscalar Processor Architecture," ACM-SIGARCH Computer Architecture News, vol.19, no.4, pp.14-29, June 1991.
[LAM 92] M. Lam and R. Wilson, "Limits of control flow on parallelism," ISCA, May 1992, pp. 46-57.
[MUR 89] K. Murakami, N. Irie, M. Kuga, and S. Tomita, "SIMP: A novel high-speed single-processor architecture," ISCA, 1989.
[PIE 94] J. Pierce, and T. Mudge, "Wrong-path instruction prefetching," technical report CSE-222-94, University of Michigan at Ann Arbor, 1994.
[ROT 99a] E. Rotenberg, Q. Jacobson, and J. Smith, "A study of control i independence in superscalar processors," HPCA-5, 1999.
[ROT 99b] E. Rotenberg and J. Smith, "Control independence in trace processors," Micro-32, 1999.
[TYS 97] G. Tyson, K. Lick and M. Farrens, "Limited dual path execution," technical report CSE-TR-346-97, University of Michigan at Ann Arbor, 1997.
[UHT 95] A. Uht, V. Sindagi, and K. Hall, "Disjoint eager execution: An optimal form of speculative execution," Micro-28, Dec. 1995, pp. 313-325.
[WAL 98] S. Wallace, B. Calder, and D. Tullsen, "Threaded multiple path execution," ISCA, June 1998, pp. 238-249.
My thanks to Julian Thomas and Stuart Tucker for their help with information on IBM mainframes.
[History page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]
mark@cs.clemson.edu