delayed branches seemed like a good idea in the 1980's, but they complicate programming and only work well on a simple 5-stage pipeline to keep a deeper pipeline busy, you need more delay slot instructions; however, the idea of a single delay slot as already frozen into the ISA (shows that it's a poor idea to expose a given implementation in an ISA) instead, do branch prediction in the IF stage .---------------------------------------- correct PC value | | .------------------------------------- misprediction signal | | (also flushes pipeline) | | | | | | +-+ | | IF stage | | ID stage | | | | | V | | `.---. +------+ +--------+ | | |mux|-->| PC |-*-->| icache |-->| | .---' +------+ | +--------+ | | | | | | .---. | | | .-->|mux| | | | | `---' .----. | | | | ' `------| +4 |<--* | | | | `----' | | | | | | | | | | BTA +------+ | | | | `---<----| BTAC |<--* | | | +------+ | | | AND<------- hit in BTAC | | | ^ | | | | | | | | prediction +------+ | | | `-----<------| BHT |<--' | | +------+ +-+ examples: let the branch being predicted reside at "X" let the branch target address be "Y" correctly predict untaken X: bne r0,r1,Y IF ID EX MEM WB X+4: add r4,r2,r3 IF ID EX MEM WB X+8: ... IF ID EX MEM WB .---.X+4+------+ +--------+ |.--|-->| X |-*-->| icache |-> "bne r0,r1,Y" X+4.---' +------+ | +--------+ .---. | .-->| \| | | `---' X+4 .----. X | | ' `------| +4 |<--* | | `----' | | Y +------+ X | if this instruction has been seen before, | ` - < - -| BTAC |<--* the BTAC has an entry | +------+ | |pred untaken+------+ X | `-----<------| BHT |<--' BHT predicts untaken for branch address X +------+ (or no match for address X in BTAC) correctly predict taken X: bne r0,r1,Y IF ID EX MEM WB Y: sub r7,r5,r6 IF ID EX MEM WB Y+4: ... IF ID EX MEM WB .---. Y +------+ +--------+ |.--|-->| X |-*-->| icache |-> "bne r0,r1,Y" Y .---' +------+ | +--------+ .---. | .-->|/ | | | `---' X+4 .----. X | | ' ` - - -| +4 |<--* | | `----' | | | Y +------+ X | | `---<----| BTAC |<--* if BTAC has entry | +------+ | | pred taken +------+ X | `-----<------| BHT |<--' BHT predicts taken for branch address X +------+ mispredict => must recovery by flushing wrong instructions out of pipeline and redirecting PC to correct path example of predict taken where branch is actually untaken (misprediction can sometimes be determined in the ID stage, whereas on other microprocessors it may be 10-20 stages later) X: bne r0,r1,Y IF ID! EX MEM WB ! = determine mispredict Y: sub r7,r5,r6 IF x-- --- --- --- x = flush (becomes nop) X+4: add r4,r2,r3 IF ID EX MEM WB static branch prediction 1. predict untaken is easiest to implement since it doesn't need BTA 2. can predict based on the opcode or hint bits in the instruction 3. can predict based on the sign of the branch displacement - e.g., predict taken for backward branches since they are likely loop branches ("The CDC 7600 used this predictor, it[']s cheap and fairly effective, garnering something on the order of 70% correct predictions." Mitch Alsup, comp.arch, Feb. 2011) dynamic branch prediction 1. hardware keeps branch history in icache or BHT (branch history table) a. one-bit history reflects last occurrence b. two-bit saturating counter (2bc) or similar FSM helps with loops c. BHT is usually indexed by low-order bits of PC 2. hardware keeps branch target addresses in BTAC (branch target address cache) 3. BHT and BTAC may be combined into a single BTB (branch target buffer) 4. hardware return address stack deals with returns to different call sites (invisible to software) for less-predictable branches, we can use more complex prediction using multiple counters/FSMs per branch two-level adaptive prediction 1. note that the branch history within a counter/FSM is "lumped" - that is, from the current state alone, you cannot tell whether the last branch was taken or untaken 2. branch history shift register (BHSR) keeps taken/not-taken pattern of last set of branches encountered 3. pattern history table (PHT) keeps history for each specific pattern from first-level E.g., a single bit could keep track of the last branch direction first second level level BH .--PHT--. BHT +---+----+----+ ... | | | | +----+ +---+----+----+ | PC |------------------>| 1 | 11 | 00 | +----+ think of PC as +---+----+----+ think of the branch history selecting a row | | | | bit as selecting the column ... of the PHT +---+----+----+ that is, each row has: 1st) an s-bit selection index, which can itself be an FSM-predictor 2nd) 2^s n-bit FSM-predictors, one of which is selected the arrangement above has a local BHSR and a local PHT for each different PC index; either the BHSR or the PHT, or both, could instead be designed as a single (i.e., global) component or as a small number of components, shared by defined subsets the Intel Pentium II/III uses a 512-entry BTB (128x4) indexed by bits in the the EIP, with entries containing a branch target address and a 4-bit local BHSR which is used to index into a 16-entry PHT (the four branches per set share the same PHT) |<---- BTAC ---->|1st | | 2nd | | part |lvl | |level| | |(BH)| |(PHT)| V tag T BTA BHSR P BTB +-+----+----+----+----+----+...+------------+-----+ ... |1| 9 | 2 | 32 | 4 | 1 | | 2 | 32 | +-----+ +-+----+----+----+----+----+ +------------+-----+ | EIP |----->|v|tag |type|addr|hist|pred|...|replace info| PHT | +-----+ +-+----+----+----+----+----+ +------------+-----+ | | | | | | | | | | ... \_____ four per set _____/ \_ one per set __/ to reduce table space, xor'ing bits from the PC and a single, global BHSR has been found to be effective - this method is called "gshare" (a modified form of gshare is used in the UltraSPARC-III) E.g., consider using 4 bits from a 16-bit PC (bits 4-7 when numbered in little endian), a 4-bit global BHSR, and a PHT of saturating 2bc values. Give the PHT entry index, value, and taken/not-taken prediction for the following branch addresses and BHSR values (use the msb of the indexed 2bc value for the prediction, with 1=taken). 15 7654 0 +----------------+ PC |--------....----| +----------------+ `--' (xor)-------> 4-bit selection index into shared PHT .--. (i.e., the first-level result) +----+ | global BHSR |....| | +----+ | 3210 | v +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ shared PHT |10|00|10|11|01|00|01|00|10|01|10|01|00|10|01|11| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ 0 1 2 3 4 5 6 7 8 9 a b c d e f a) PC = 0x0124, BHSR = 0x5, so 0x2 xor 0x5 = 0x7 PHT index 0x7 => value 00 => untaken b) PC = 0x0124, BHSR = 0xa, so 0x2 xor 0xa = 0x8 PHT index 0x8 => value 10 => taken c) PC = 0x2588, BHSR = 0x7, so 0x8 xor 0x7 = 0xf PHT index 0xf => value 11 => taken speculative execution 1. not just fetch down predicted path, but start execution 2. write results to buffers but not to registers or cache until you are sure that the branch was correctly predicted, e.g., P6 reorder buffer and in-order retirement performance -- we can write the average CPI calculation in this form: avg. CPI = base CPI + penalties if each instruction and correctly predicted branch takes one cycle, then avg. CPI = 1 + misprediction cycles = 1 + (branch freq.)*(misprediction freq.)*(mispredict penalty) thus, the average CPI can be reduced by a) reducing the frequency of branching, b) increasing the accuracy of the prediction, and/or c) reducing the misprediction penalty