DYNAMIC INSTRUCTION ISSUE Idea is to try to issue (i.e., start execution) for up to N instructions each clock cycle. This may be from the decode stage in an in-order machine or from instructions buffers in an out-of-order machine. Dependencies ------------ For each instruction pair in the decode stage for a multiple-instruction-issue machine, five possible dependencies need to be checked (the logic to compare register numbers is similar to the hazard checking logic needed for interlocks and forwarding in a scalar pipeline). +----+-------+-------+-------+ The number of dependencies varies as | op | Rdest | Rsrc1 | Rsrc2 | the square of number of instructions +----+-------+-------+-------+ that must be compared in single cycle | WAR=? WAR=? of decoding: | | | | .-----*-------' 2 insts. => 5 comparators | | *- | ----*-------. 3 insts. => 15 comparators | | | | WAW=? | RAW=? RAW=? 4 insts. => 30 comparators +----+-------+-------+-------+ | op | Rdest | Rsrc1 | Rsrc2 | ... etc.. => 5*n*(n-1)/2 comparators +----+-------+-------+-------+ Steps in dynamic scheduling --------------------------- --- in program order --- ------ out of order ------ in order / \ / \ / \ .------. .------. .----. .------. .------. .------. |inst. | .---------. |result| |reg.| |memory|-->|icache|-->|decode|-->|window|-->|execution|-->|window|-->|file| `------' \ `------' \ `------' \ `------' \ `---------' \ `------' \ `----' refill fetch dispatch issue complete retire or ("issue") ("start") ("finish") ("commit") instructions are assigned a slot in the result window upon "dispatch" - provides for precise exceptions even with out-of-order completion instructions wait in instruction window with either operand values or names - only when all its operand values are available is an instructions ready to execute and thus allowed to "issue" instructions write results into the result window - values in the result window are "retired" into the register file in program order Instruction window terms ------------------------ Key problem is dispatching instructions based on both the fetch ___| incoming i-stream |___ in-order and lookahead state \ / (some insts. not completed) => decode---\--- +--------------------+ / must use tags or reg. numbers \ | not yet dispatched | / along with any available values. dispatch----- +---+----------------+ ------------------------------------------- | not yet issued | A A +--------+-------+---+ | | | executing | | lookahead state +---+-------+---+ in | | completed | window (not all result +------------+---+-------+ | values are out-of-order | not yet issued | | available) execution +--------+-------+---+ | | | executing | | V +---+-------+---+ ------- | ----------------------- | completed | | A +-----------+------+ | | | completed & next | | in-order state | to be retired | V | in-order---------------------- +------------------+ ------- (values available) retirement __| |__ | \ retired i-stream / V Restricted data flow -------------------- The instruction window and result window provide a resticted form of data flow execution (a non-von Neumann approach to computing) by linking insts. with true data dependencies to each other using names. ................ . value value . . v v . . .--------. . . |multiply| . this data flow node is ready to execute . `--------' . since both values are available . v . . NAME1 .\ ................ \ shared use of NAME1 respresents true data dependency ............... / . value NAME1 . / . v v . . .-----. . . | add | . this data flow node is not ready to execute . `-----' . since only one value is available . v . . NAME2 . ............... Upon completion, the multiply will broadcast its result along with "NAME1" to the waiting instructions in the instruction window and to the result window. At this point, the needed operand for the waiting add will match with "NAME1" and the result will be copied into the instruction window buffer (i.e., the data flow node) for the add. The add will then be marked as ready to execute. Dependency detection/removal ---------------------------- 1. in-order issue - register scoreboard (dynamic issue but static scheduling) +--------+ +---------+ | decode |--issue-->| execute | insts. leave decode with operand VALUES +--------+ +---------+ register scoreboard in decode stage uses a single bit for each register: - bit indicates a pending update - if bit set for a source or dest. register, then stall instruction - else set bit corresponding to the destination register and issue - reset bit for that register when instruction completes (write-back) can add forwarding logic to reduce impact of stalls RAW causes a stall in the decode stage WAR is eliminated by reading operands in decode stage WAW causes a stall in the decode stage 2. historical scoreboard - minimal level that supports dynamic scheduling +--------+ +---------+ +---------+ insts. leave | decode |--dispatch-->| buffers |--issue-->| execute | decode with +--------+ +---------+ +---------+ operand VALUES or NAMES decode uses busy bits like a simple register scoreboard - if bit set for the destination register, then stall instruction - if bit set for a source register, send register number with inst. - set bit corresponding to the destination register - reset bit for that register when instruction completes (write-back) allows one pending update per register dispatch w/ available operands to avoid WAR or check and stall writes until WAR-dependent insts. read the register and issue dispatch w/ register numbers to handle RAW on completion, broadcast register number and result value register numbers in inst. buffers serve as trivial result tags since there can be only one pending update per register RAW causes a stall in the instruction buffers WAR removed by reading operands in decode stage or stalling writes WAW causes a stall in the decode stage 3. full tagging or renaming - supports dynamic scheduling +--------+ +---------+ +---------+ insts. leave | decode |--dispatch-->| buffers |--issue-->| execute | decode with +--------+ +---------+ +---------+ operand VALUES or NAMES implement with tags (reorder buffer or future file) or physical regs. to assign a new tag or physical register number for each pending update allows multiple pending update per register (good for ISAs w/ few regs.) RAW causes a stall in the instruction buffers WAR removed WAW removed Historical scoreboard (as in CDC 6600) -- names = architected registers ----------------------------------------------------------------------- register busy centralized reservation station ("scoreboard") file bits +---+ +---+ /--operand1--\ /--operand2--\ fn result R0 | - | | 0 | ready reg/value ready reg/value unit reg. R1 | - | | 1 | +-----+---------+-----+---------+-----+------+ R2 | - | | 1 | mul R1,R5,R5 | 1 |=? 9 | 1 |=? 9 | mul | R1 | R3 | - | | 0 | \ +-----+---------+-----+---------+-----+------+ R4 | - | | 0 | add R2,R1,R5 | 0 |=? R1 | 1 |=? 9 | add | R2 | R5 | 9 | | 0 | +-----+---------+-----+---------+-----+------+ R6 | - | | 0 | R7 | - | | 0 | * the "add" is dependent on the "mul", but the "mul" can +---+ +---+ execute since both of its operands are ready * add has to wait until multiplier broadcasts and scoreboard associatively matches ("=?") the "R1" on the result bus with any "R1"s in operand register fields active fn units A result result Upon completion, send | operand1 operand2 reg. value excep < result reg., value > | +--------+--------+------+------+-----+ to register file and | mul | 9 | 9 | R1 | ---- | --- | to the scoreboard; | +--------+--------+------+------+-----+ reset busy bit for | \___________/ result register. | | | `-----------------------------------------' In this approach, register renaming is not performed; instead, waiting operands are "tagged" with the register number. It does allow for out-of-order issue, but it does not provide for recovery from exceptions or mispredicted branches. Allowing multiple pending updates to a given architected register is motivation to do renaming in hardware. Consider the code sequence R1 dependencies (dest. reg. listed first) mul R1,_,_ e.g., 20: mul R1,R5,R5 // first update to R1 add _,R1,_ 24: add R2,R1,R5 // (let R5 contain, e.g., 9) sub _,R1,_ 28: sub R3,R1,R5 xor R1,_,_ 32: xor R1,R5,R5 // second update to R1 and _,R1,_ 36: and R4,R1,R5 CDC historical scoreboard, 1964 (as shown above) mul R1,_,_ decode, mark R1 busy, dispatch .\ add _,R1,_ decode, dispatch with one operand = R1 and marked not ready .. sub _,R1,_ decode, dispatch with one operand = R1 and marked not ready . >>> WAW <<< (WAW is detected by checking busy bit of dest. reg.) . xor R1,_,_ decode and stall since WAW on R1 \ and _,R1,_ * if "xor" executed and wrote R1 before "mul" finished, then "add" and "sub" would get the wrong input value * WAR either not checked (if available operands read in decode) or checked and causes stall until all insts. dependent on that register have issued Tomasulo-like tagging (see below), dates from 1967, modern example is AMD K5 mul T,_,_ decode, assign tag of T for first pending update of R1, dispatch | . (register-fetch/rob-search will return R1 -> T) .\ add _,T,_ decode, dispatch with one operand = T and marked not ready . sub _,T,_ decode, dispatch with one operand = T and marked not ready xor W,_,_ decode, assign tag of W for second pending update of R1, dispatch | . (register-fetch/rob-search will return R1 -> W) \ and _,W,_ decode, dispatch with one operand = W and marked not ready * the "xor" instruction can execute and complete before "mul" finishes but the "add" and "sub" will still wait on a result tagged with "T" * future file avoids tag matching in reorder buffer (UltraSPARC-III) Register renaming (see below) - done when there are more physical regs. than architected regs. - avoids tag matching in reorder buffer; adds reg. renaming table to logic, modern examples are MIPS R10000 and Pentium 4 mul PR5,_,_ decode, assign phy. reg. (e.g. PR5) to architected R1, dispatch | . (mapping table now has R1 -> PR5) .\ add _,PR5,_ decode, dispatch with PR5 and marked not ready . sub _,PR5,_ decode, dispatch with PR5 and marked not ready xor PR8,_,_ decode, assign phy. reg. PR8 to architected R1, dispatch | . (mapping table now has R1 -> PR8) \ and _,PR8,_ decode, dispatch with PR8 and marked not ready * "xor" can execute and complete prior to "mul" ending * when "mul" ends, the busy bit for PR5 is reset and all instructions waiting for PR5 will have that operand marked as ready * operand values are read from the physical reg. file upon inst. issue * since "xor" redefines R1 and thus marks end-of-life for previous R1 value (i.e., from "mul"), "xor" retirement releases PR5 back to allocation pool Reorder buffer with tagging -- names = tags ------------------------------------------- (after four insts., see code example above) reg. file reorder buffer Upon exception, flush +---+ valid reg. cmpl. tag value PC excep buffer from that inst. R0 | - | +-----+------+-----+-----+-----+----+-----+ upwards, save that PC, R1 | - | | 1 |=? R1 | 0 |=? W | --- | 32 | --- | and restart decoder with R2 | - | +-----+------+-----+-----+-----+----+-----+ excep. handler address. R3 | - | | 1 |=? R3 | 0 |=? V | --- | 28 | --- | Upon misprediction, flush R4 | - | +-----+------+-----+-----+-----+----+-----+ above that branch and R5 | 9 | | 1 |=? R2 | 0 |=? U | --- | 24 | --- | restart decoder with the R6 | - | +-----+------+-----+-----+-----+----+-----+ correct address. (Or, R7 | - | | 1 |=? R1 | 0 |=? T | --- | 20 | --- | wait until retirement to +---+ +-----+------+-----+-----+-----+----+-----+ check for exceptions and | \_|_______________________________/ mispredicts.) | reg. | | | access | prioritized | retirement into register file `------.------' assoc. search V upon successful completion | | register fetch for decoder -- use the most recent value/tag | when the register number finds hit(s) in the reorder buffer, V e.g., a fetch to R1 returns tag W (not a value and not tag T) waiting instructions in instruction window /--operand 1--\ /--operand 2--\ fn result ready tag/value ready tag/value unit tag +-----+---------+-----+---------+-----+------+ Waiting instructions receive | 0 |=? T | 1 |=? 9 | sub | V | missing operands by assoc. +-----+---------+-----+---------+-----+------+ matching ("=?") on operand 1 | 0 |=? T | 1 |=? 9 | add | U | and operand 2 tags for each +-----+---------+-----+---------+-----+------+ result generated by fn units. fn units result result result operand1 operand2 tag value excep Upon completion, forward +--------+--------+-----+-------+-----+ < tag, value > xor | 9 | 9 | W | --- | --- | to the inst. window. Upon +--------+--------+-----+-------+-----+ completion/exception, forward +--------+--------+-----+-------+-----+ < tag, value, excep > mul | 9 | 9 | T | --- | --- | to the reorder buffer. +--------+--------+-----+-------+-----+ \_________________/ A `------------------------------' Dispatch actions: (1) Fetch operand register values from the register file and also tag/values from the reorder buffer. Use the most recent tag/value entry from the reorder buffer, if available; otherwise, use the register file value. (2) Assign a new tag for the operation result. (3) Allocate slot in reorder buffer, and enter register, tag, and PC fields. (4) Allocate a slot in the instruction window, and fill in the operand, functional unit, and tag fields. Issue actions: (1) Look for ready instructions that have both operands valid and the specified functional unit available. (2) Issue as many ready instructions as possible. Completion actions: (1) Send the result and its tag to the reorder buffer and to the other entries in the instruction window. (2) Make sure that the dispatch of an instruction in a given cycle does not miss the completed result distribution of a needed operand in that cycle. (3) If an exception occurred, send the tag and type to the reorder buffer and record the type in the reorder buffer. Retirement actions: (1) Examine the oldest entry in the reorder buffer and, if valid and no exceptions, write the value into the specified register; if it is a store instruction, write the data to the cache. (2) If an exception is indicated, record the PC of that instruction, flush the buffer, and invoke an exception handler. Future file -- names = tags --------------------------- (after four insts., see code example above) Keep lookahead state in special register file - this avoids the expensive, prioritized associative search of the reorder buffer for operands future file register reorder buffer MR cmpl. tag value file reg. cmpl. value PC excep +--+-----+---+-----+ +---+ +----+-----+-----+----+-----+ Use index R0 | 0| - | - | - | R0 | - | 3 | R1 | 0 | --- | 32 | --- | rather R1 | 1| 0 | W | - | R1 | - | 2 | R3 | 0 | --- | 28 | --- | than tag R2 | 1| 0 | U | - | R2 | - | 1 | R2 | 0 | --- | 24 | --- | match for R3 | 1| 0 | V | - | R3 | - | 0 | R1 | 0 | --- | 20 | --- | result. R4 | 0| - | - | - | R4 | - | +----+-----+-----+----+-----+ R5 | 0| - | - | - | R5 | 9 | \_________________________/ R6 | 0| - | - | - | R6 | - | | R7 | 0| - | - | - | R7 | - |<--------------' retirement - send reg. number +--+-----+---+-----+ +---+ and value to register file. | reg. reg. | | access access | `------.-----------------' | register access rather than associative search in decode stage | use future file value/tag when MR=1 (i.e., "more recent" is true) | future file provides a tag if register's completion flag = 0 | (a new tag is assigned and written into the future file for V each destination register during dispatch) waiting instructions in instruction window /--operand 1--\ /--operand 2--\ fn result result buffer ready tag/value ready tag/value unit reg. tag index +-----+---------+-----+---------+-----+------+------+------+ Obtain missing | 0 |=? T | 1 |=? 9 | sub | R3 | V | 2 | operands by assoc. +-----+---------+-----+---------+-----+------+------+------+ matching on tags | 0 |=? T | 1 |=? 9 | add | R2 | U | 1 | for each result +-----+---------+-----+---------+-----+------+------+------+ from fn. units. fn units result result result buffer Upon exception or operand1 operand2 reg. tag value index excep misprediction, note +--------+--------+------+------+------+------+-----+ the problem in the xor | 9 | 9 | R1 | W | ---- | 3 | --- | particular reorder +--------+--------+------+------+------+------+-----+ buffer entry and wait +--------+--------+------+------+------+------+-----+ until retirement. mul | 9 | 9 | R1 | T | ---- | 0 | --- | +--------+--------+------+------+------+------+-----+ \________________________________/ A | | `----------------------------------------' On completion, send: < result tag, value > to inst. window, < result value, buffer index, excep > to reorder buffer, < result reg num., result tag, value > to future file Note that future file may already have recorded a more recent update; so, store the forwarded value in the future file only if the result tag matches the current tag for that register, and if so, also set the completion flag. When retirement finds an exception or misprediction, flush the reorder buffer, future file, and inst. window; then restart with the correct branch address or with the exception handler address. Register renaming -- names = physical registers ----------------------------------------------- (after four insts., see code example above) Keep lookahead state in larger physical register file - avoids the prioritized associative search of reorder buffer for operands. Architected register file may be a separate structure or may exist as subset of the physical register file. In the latter case, the ROB doesn't need a value field (and this eliminates an arch. register file write on retirement). A register map ("register alias table") is used for renaming. Rename the source registers in an inst. to their mapped physical registers. Assign a new physical register for the destination register and update the map. The new physical register can be obtained from an allocation table. Set a busy bit for the allocated physical register. (Note that the Pentium 4 keeps both a frontend RAT and a retirement RAT.) Previous arch. reg. -> phy. reg. mapping recorded in ROB ("oldmap") so that phy. reg. can be released when inst. that redefines arch. reg. retires. physical reg. file with architected reorder buffer busy bits reg. map reg. file reg. cmpl. value PC excep oldmap +---+ +-+ +-----+ +---+ +----+-----+-----+----+-----+-----+ Access PR0| - | |0| R0| - | R0| - | 3| R1 | 0 | --- | 32 | --- | PR5 | w/index PR1| - | |0| R1| PR8 | R1| - | 2| R3 | 0 | --- | 28 | --- | - | rather PR2| - | |0| R2| PR6 | R2| - | 1| R2 | 0 | --- | 24 | --- | - | than PR3| - | |0| R3| PR7 | R3| - | 0| R1 | 0 | --- | 20 | --- | - | w/tag. PR4| - | |0| R4| - | R4| - | +----+-----+-----+----+-----+-----+ PR5| - | |1| R5| - | R5| 9 | \_________________________/ | PR6| - | |1| R6| - | R6| - | | | PR7| - | |1| R7| - | R7| - |<------------' | PR8| - | |1| +-----+ +---+ | ... ... ... A .------------------------------' PRn| - | |0| | V retirement - if separated, send reg. +---+ +-+ +------------------+ num. and value to arch reg. file; | reg. alloc. list | release old phy. reg. num. to the +------------------+ physical register allocation list. list may be impl. w/ a pair of cntrs. waiting instructions in instruction window /-operand 1-\ /-operand 2-\ fn result buffer ready reg. ready reg. unit reg. index +-----+-------+-----+-------+-----+------+------+ Watch for availability of | 0 |=? PR5 | 1 |=? R5 | sub | PR7 | 2 | missing operands by assoc. +-----+-------+-----+-------+-----+------+------+ matching ("=?") register | 0 |=? PR5 | 1 |=? R5 | add | PR6 | 1 | numbers for each result +-----+-------+-----+-------+-----+------+------+ produced by fn. units. register fetch done by special stage in fn. unit pipeline (after issue) (reduces result bus complexity and width of inst. window entries) fn units result result buffer Upon exception, can either operand1 operand2 reg. value index excep record the problem using +--------+--------+------+------+------+-----+ excep. field, or can stop xor | 9 | 9 | PR8 | ---- | 3 | --- | dispatch, let pipes drain, +--------+--------+------+------+------+-----+ and then "unmap" registers +--------+--------+------+------+------+-----+ from top of ROB down (i.e., mul | 9 | 9 | PR5 | ---- | 0 | --- | reverse program order) until +--------+--------+------+------+------+-----+ inst. w/ exception found \_________________________/ A | | `---------------------------------' On completion, send < reg num > to inst. window, < reg num., value > to reorder buffer and to phy. reg. file; also reset phy. reg. busy bit To handle branches, can either wait until retirement, or save copies of the register map and allocation table on each branch; and, on mispredict, restore the saved copies and execute down the correct path.