Clemson University CSPC 464/664 Fall 2002 Mark Smotherman Conditional Branching Four aspects: 1. Specifying comparands A. explicit compare instruction (basically a subtraction): RR or RI i. 87% of integer compares are RI (register vs. immediate value) ii. 77% of flt. pt. compares are RI [DLX SPEC92 stats, H&P 2e, fig 2.8] B. side effect of ALU operation i. setting according to result of ALU op allows elimination of compare [although Cragon cites a study that found this ineffective] ii. setting by ALU op is optional in SPARC C. storing result of subtraction (condition codes / indicators / flags) i. single set of bits in PSR or flags register for integer conditions (e.g., NVZC in SPARC, OF/SF/ZF/AF/PF/CF in IA-32) ii. second set of bits in FP status register for flt. pt. conditions (e.g., eq, lt, gt, unordered) D. _BUT_ presents serialization problem for high-perf. implementation [cf. Sites on design of Alpha] i. no storage if compare and branch packaged together (e.g., MIPS) ii. multiple sets of condition code bits (e.g., IBM ACS, RS/6000, PPC) iii. use general registers to hold condition codes (e.g., MC88110) 2. Decision - logical relation between comparands A. basic two: equal, not equal i. e.g., equal if Z=1 ii. 86% of integer compares use equal/not-equal [MIPS SPEC92 stats, H&P 2e, fig. 2.15] B. other four: ls, le, gt, ge i. difference in interpreting the results for signed vs. unsigned integers ii. e.g., signed compare is less than or equal if Z=1 or N!=V iii. e.g., unsigned compare is "below or equal" if Z=1 or C=1 C. flt. pt. compare can result in "unordered" (e.g., comparing with NaN) 3. Branch target address A. "jump" - absolute address (requires relocation adjustments for link/load) B. "branch" - explicit, sign-extended relative offset from PC (98% of offsets fit in 8 bits [MIPS SPEC92 stats, H&P 2e, fig 2.32]) C. "skip" - implicit relative offset from PC, typically +1 in inst. words D. register indirect (sometimes used to implement subroutine return and jump tables for case statements, also used by JSR for shared libraries) E. trap address (e.g., conditional trap on SPARC and MC88110) 4. Change PC A. immediate effect B. delayed effect - next one or so sequential instructions have already been fetched and will be executed regardless of branch decision C. delayed effect with anulling/squashing - sequential instructions already fetched and may be executed or optionally purged on untaken (e.g., SPARC) D. _BUT_ fixed number of delay slots presents problem for more-deeply piped or superscalar implementations [cf. 8-stage MIPS R4000, and Sites on design of Alpha] Packaging: * compare (1+2), then conditional branch (3+4) * ALUop side effect (1), then conditional branch (2+3+4) * compare and branch (1+2+3+4) - no storage (condition codes) needed, but need multiple comparand specifiers plus the branch address field; thus to keep instruction length reasonable often use reg. vs. implicit value of 0 * compare (1+2), prepare to branch ([2+]3), then exit (4) - allows multiway branching so that multiple condition/decision/BTA triples can be specified but only one change of PC occurs. Hardware support maintains table of allowed-taken branches in priority (e.g., FIFO) order, takes highest priority true branch on executing the exit inst., and then flushes the table. [IBM ACS, 1960s] CEQX 1,10,11 // compare index regs X10 and X11 and // set condition C1 if X1==X2 ... BOR 1,2,0,ALPHA // establish the first branch predicate // of (C1 or C2) and set corresponding // branch target address to ALPHA BAND 3,4,0,BRAVO // establish the second branch predicate // of (C3 and C4) and set corresponding // branch target address to BRAVO BEQ 5,6,0,BRAVO // establish the third branch predicate // of (C5 == C6) and set corresponding // branch target address to BRAVO also // - thus BRAVO if (C3andC4)or(C5==C6) BAND 7,7,0,CHARLIE // establish the fourth branch predicate // of (C7) and set corresponding // branch target address to CHARLIE EXIT // perform transfer of control to a branch // target address based on the first true // branch predicate; if none are true, // then fall through; as part of this // instruction the exit table is flushed Note ISA add-ons called "prepare to branch" include: [TI ASC] prepare to branch (redundant specification of 3 - for prefetch) [PIPE] prepare to branch (specify 4B - variable number of delay slots) IA64 provides multiway branching by use of predicates cmp.eq p1,p2 = r1,r2 cmp.ne p3,p4 = r1,r2 cmp.lt p5,p6 = r1,r2 ;; (p1) br.cond label1 // take first true branch (p3) br.cond label2 (p5) br.cond label3 ;; Of course, IA64 allows branches to be replaced by predicated code (the compiler technique is called "if-conversion") cmp.eq p1,p2 = r1,r2 ;; (p1) add r5 = r3,r4 // predicated execution of then part (p2) sub r8 = r6,r7 // predicated execution of else part // -- without branching -- Packaging for flt. pt. branches may differ from packaging for integer branches. Loop control branches ("fast loop closers") - combine loop count increment/ decrement with conditional branch IBM Stretch (ca. 1955-1961) - COUNT BRANCH AND REFILL PLUS IBM S/360 - bxh - (increment and) branch on index high bxle - (increment and) branch on index less or equal DEC PDP-11 - sob - subtract one and branch if not equal zero DEC VAX - acb - add compare and conditionally branch aoblss - add one and branch if less aobleq - add one and branch if less or equal sobgeq - subtract one and branch if greater or equal sobgtr - subtract one and branch if greater Intel x86 - loop - decrement cx and branch if cx not zero loopz - decrement cx and branch if cx not zero and ZF is 1 loopnz - decrement cx and branch if cx not zero and ZF is 0 rep - string prefix to repeat string inst. until cx is zero (rep decrements cx each time) repz - repeat string inst. while cx not zero and ZF is 1 repnz - repeat string inst. while cx not zero and ZF is 0 RS/6000 - special loop count register ctr in branch unit that can be decremented and tested for zero by conditional branch inst. bc loop,ctr!=0 IA64 - special loop count register lc and epilog count register ec with what at first appear to be strange semantics br.cloop - counted loop branch - branch if lc not 0; decrement lc if not 0 (it doesn't decrement lc before testing, as earlier instruction sets would do; thus, the number of iterations of the loop is lc+1; possibly this is implementation showing through since testing after decrementing would consume more time) br.ctop/cexit - modulo-scheduled counted loop branch - for br.ctop, the branch is taken if either lc is non-zero or EC > 1 - for br.cexit, the branch is not taken ifeither lc is non-zero or EC > 1 (typically used for software pipelining and combined with register rotation - during the prolog and kernel phase, lc counts down; and during epilog phase, lc is zero and ec counts down; register rotation stops when both lc and ec are zero) (bp) br.wtop/wexit - modulo-scheduled while loop branch