Clemson University CPSC 464/664 Mark Smotherman historical implementation - hardwired accumulator machine +------------------------------+ | hardwired control unit |<----------clock | (random logic) | +------------------------------+ ^ op | | ... | | code V V V | | control signals | .....|.......................... . | . datapath . | || +-----+ . . | ||<->| ACC |<--. . . | || +-----+ | . . | || | | . . | || | --. | . . | || `->| \ | . . | || > |-' . . | || .->| / ALU . . | || | --' . . | || | . . +-----+ || +-----+ . +--------+ . | IR |<->||<->| MDR |<------------->| | . +-----+ || +-----+ . | memory | . +-----+ || +-----+ . | | . | PC |<->||-->| MAR |-------------->| | . +-----+ || +-----+ . +--------+ . . ................................ datapath activity for ADD mem_addr (i.e., the sequence of register transfers, or micro-operations) 1: MAR <- PC .2: read (MDR <- memory[MAR]) ..3: PC <- PC + 1 ... 4: IR <- MDR ... .5: decode IR ... ..6: MAR <- IR_addr_field ... ...7: read (MDR <- memory[MAR]) ... .... 8: ACC <- ACC + MDR PC |1.3 .... . MAR |12. ..67 . memory |.22222222222222222222...77777777777777777777. MDR |... 24... 78 IR |... .456. .. decode |... ..5.. .. ACC |... ...6. .8 +--------------------------------------------- memory latency memory latency the control signals are produced relatively quickly by random logic, but performance is dominated by the two accesses per inst. to main memory faster? => (1) fetch two instructions at once (2) simple pipelining - overlap inst. fetch and data access historical implementation - microprogrammed general register machine + - - - - - - - - - - - - - - - - -+ | microprogrammed control unit | microinstructions in a control store replace hardwired logic +--------+ +------+ +------------+ | decode |->| CSAR |->| cntl store | encourages more complicated +--------+ +------+ +------------+ instructions (i.e., more work ^ ^ | | ... | specified per instruction) to | | | +------------+ reduce the number of trips to | | `-----| CSIR | main memory for instructions + - -|- - - - - - - - +------------+ | op | | ... | | code V V V | | control signals | .....|.......................... . | . datapath . | || +-----+ . . | || | R_0 | . general registers replace the single . | ||<->| ... |<--. . accumulator register to allow more . | || | R_n | | . data to be held in the CPU and thus . | || +-----+ | . reduce the number of trips to main . | || | | . memory for data . | || | --. | . . | || `->| \ | . . | || > |-' . . | || .->| / ALU . . | || | --' . . | || | . . +-----+ || +-----+ . +--------+ . | IR |<->||<->| MDR |<------------->| | . +-----+ || +-----+ . | memory | . +-----+ || +-----+ . | | . | PC |<->||-->| MAR |-------------->| | . +-----+ || +-----+ . +--------+ . . ................................ long sequences of control signals produced for complex instructions; performance is dominated by the access time of the control store faster? => (1) move to two or three internal busses in datapath (2) pipelining of datapath (3) pipelining of microprogrammed control unit, that is, overlap the next fetch from the control store with the previous set of control signals modern implementation - hardwired load/store machine +------------------------------+ | hardwired control unit |<----------clock | (PLA - compact logic) | +------------------------------+ control unit once again hardwired ^ op | | ... | | code V V V | | control signals | .....|.......................... . | . datapath . | || +-----+ . . | || | R_0 | . simple instructions can be fetched from . | ||<->| ... |<--. . the inst. cache and thus avoid trips to . | || | R_n | | . main memory . | || +-----+ | . . | || | | . data accesses can be made to the data . | || | --. | . cache and thus avoid trips to main . | || `->| \ | . memory . | || > |-' . . | || .->| / ALU . . | || | --' . . | || | . . +-----+ || +-----+ . +---------+ +--------+ . | IR |<->||<->| MDR |<----------->| d cache |<------->| | . +-----+ || +-----+ . .-|->+---------+--|-. | memory | . +-----+ || +-----+ . | `--+---------+<-' | | | . | PC |<->||-->| MAR |------------>| i cache |-------->| | . +-----+ || +-----+ . +---------+ miss +--------+ . . ................................ the control signals once again are produced relatively quickly by hardwired logic; performance is dominated by the hit rates of the inst. and data caches can think of it as "compile to microcode" where the icache replaces the control store faster? => (1) move to two or three internal busses in datapath (2) pipelining of datapath (3) larger caches (4) multiple instructions per cycle (superscalar or VLIW) datapath activity for LOAD R_dest,[R_src+displ] register transfers divide into five pipeline stages 1: iAR <- PC | .2: read (iDR <- icache[iAR]) | instruction ..3: PC <- PC + 1 | fetch ...4: IR <- iDR | .... .... 5: decode IR | decode and .... .6: temp_1 <- R_src | register read .... .. .... .. 7: temp_2 <- temp_1 + displ | execution .... .. . .... .. . 8: dAR <- temp_2 | memory access .... .. . .9: read (dDR <- dcache[dAR]) | .... .. . ... .... .. . ... 0: R_dest <- dDR | result write back PC |1.3. .. . ... . iAR |12.. .. . ... . icache |.22. .. . ... . (=> icache hit) iDR |..24 .. . ... . IR |...4 .. . ... . decode |.... 5. . ... . dAR |.... .. . 89. . dcache |.... .. . .99 . (=> dcache hit) dDR |.... .. . ..9 0 R_src |.... .6 . ... . temp_1 |.... .6 7 ... . temp_2 |.... .. 7 8.. . R_dest |.... .. . ... 0 +-----|-----|-----|-----|----- IF D EX MEM WB similar timing (based on cache hits) allows pipeline overlap of these five stages IF D EX MEM WB IF D EX MEM WB IF D EX MEM WB IF D EX MEM WB IF D EX MEM WB IF D EX MEM WB ....