[Note: used different textbook in Fall 2008] 330 Fall 2008 -- Final Exam (with answers) 1. Fill in with the power of ten. (1 pt. each) pico __-12__ micro __-6___ kilo __+3___ giga __+9___ peta __+15__ nano __-9___ milli __-3___ mega __+6___ tera __+12__ exa __+18__ 2. Give the CPU time equation in terms of clock frequency and define the terms you use. (4 pts.) CPU time = ( IC * CPI ) / CF IC = instruction count CPI = cycles per instruction CF = clock frequency 3. a. Which term in the CPU time equation is measured in Hertz? (1 pt.) CF (clock frequency) b. What is the definition of Hertz in simpler units? (1 pt.) cycles per second 4. Which term in the CPU time equation is most influenced by a multiple- issue processor (superscalar or VLIW)? Explain your answer. (3 pts.) CPI (cycles per instruction) multiple-issue designs attempt to increase the number of instructions executed per cycle, leading to an increased IPC and therefore decreased CPI 5. a. Find the execution time of a program that executes 5 billion instructions on a processor with an average CPI of 1.2 and a clock cycle time of 250 psec. (2 pts.) 5 * 10^9 insts. * 1.2 cycles/inst. * 250 * 10^-12 sec/cycle = 5*1.2*250 * 10^(9-12) sec = 1500 * 10^-3 sec = 1.5 sec b. What is the MIPS value for the machine described in part (a)? (2 pts.) ( 5 * 10^9 insts ) / ( 1.5 sec * 10^6 ) = 5/1.5 * 10^(9-6) MIPS = 3.333 * 10^3 MIPS = 3333 MIPS 6. Give the equation for Amdahl's law in terms of "f" and "s" and define the terms you use. (6 pts.) 1 / ( (1-f) + f/s ) f = fraction of program execution time that can be improved s = speedup of improvement 7. What is the value of "f" in Amdahl's law for an "embarrassingly parallel" program? (1 pt.) f approximately equal to 1 8. What fraction of normal execution time must an 10x enhancement be used to achieve an overall speedup of 5? Show the result as a fraction. (5 pts.) 5 = 1 / ( (1-f) + f/10 ) multiply both sides by ((1-f)+f/s)/5 (1-f) + f/10 = 1/5 multiply both sides by 10 10 - 10*f + f = 10/5 simplify 10 - 9*f = 2 subtract 10 from both sides -9*f = -8 divide both sides by -9 and cancel signs f = 8/9 9. What fraction of normal execution time must an 10x enhancement be used to achieve an overall speedup of 7.5? Show the result as a fraction. (5 pts.) 7.5 = 1 / ( (1-f) + f/10 ) (1-f) + f/10 = 1/7.5 10 - 10*f + f = 10/7.5 10 - 9*f = 20/15 -9*f = (20 - 150)/15 f = (-130/15)/-9 f = -130/(-135) f = 26/27 10. Give the truth table for a half-adder and the circuit implementation. (4 pts.) ___ a b | c_out sum a------*---| ` -------+------------- | | |--- c_out 0 0 | 0 0 b---*------|___' 0 1 | 0 1 | | and 1 0 | 0 1 | | ___ 1 1 | 1 0 | `---\\ \ | || >--- sum c_out = a and b `------//___/ sum = a xor b xor 11. Give the truth table for a full-adder and the circuit implementation. (You can use half-adders as components.) (4 pts.) a b c_in | c_out sum a b c_in -------------+------------- | | | 0 0 0 | 0 0 +----+ | where HA is 0 0 1 | 0 1 | HA | | a half adder 0 1 0 | 0 1 +----+ | with its carry 0 1 1 | 1 0 | `-. | output on left 1 0 0 | 0 1 | | | and sum output 1 0 1 | 1 0 | +----+ on right 1 1 0 | 1 0 | | HA | 1 1 1 | 1 1 | +----+ | .--' | c_out = a*b + c_in*(a^b) | | | sum = (a^b)^c_in |\___/| | or | | | \ / | \ / | | | c_out sum 12. A bit-serial adder has one D flip-flop, two inputs A and B, and one output S. The combinational logic consists of a full adder with the carry out being used as the next state. (Ignore the initial clear/set of the carry.) (8 pts.) +---------+ A ----->| | sum | full |-----------------------> S B ----->| | | adder | carry = D_in +---+ Q .---->| |------------->| |---. | +---------+ | D | | | CLK->| | | | +---+ | `--------------------------------------' a. Give the state transition table with inputs A, B, and Q(t), and outputs S and Q(t+1). A B Q(t) | Q(t+i) S -------------+------------ 0 0 0 | 0 0 note: same as full adder where 0 0 1 | 0 1 Q(t) == c_in and Q(t+1) == c_out 0 1 0 | 0 1 0 1 1 | 1 0 1 0 0 | 0 1 1 0 1 | 1 0 1 1 0 | 1 0 1 1 1 | 1 1 b. Give the state diagram. 11/0 .---. .------. .---. 00/0, / v.===./ v.===./ \ 01/0, 01/1, | | 0 | | 1 | | 10/0, 10/1 \ /`==='^ /`==='^ / 11/1 `---' `------' `---' 00/1 13. Show a four-bit parallel adder using full-adders as components. (3 pts.) A_3 B_3 A_2 B_2 A_1 B_1 A_0 B_0 | | .---. | | .---. | | .---. | | .-- 0 | | | | | | | | | | | | | | | V___V___V | V___V___V | V___V___V | V___V___V | full | | | full | | | full | | | full | |__adder__| | |__adder__| | |__adder__| | |__adder__| | | | | | | | | | | | V | `----' | `----' | `----' | carry_out V V V V sum_3 sum_2 sum_1 sum_0 14. Briefly explain how a 64-bit integer adder can be used to execute a SIMD add instruction that performs eight independent byte-wide adds. (2 pts.) break the carry chain between each group of eight bits XC-1. There are two simple ALUs in the Pentium 4 that run twice as fast as other circuitry so that they can each execute in one-half the normal clock cycle. Explain why this would be an advantage in a superscalar system. (4 pts.) two half-cycle ALUs allow the execution of one simple-ALU-op instruction in the first half cycle and then a data dependent simple-ALU-op instruction in the second half cycle [e.g., on a P4-like processor, the instruction pair add R3,R1,R2 ; R3 <- R1+R2 add R5,R3,R4 ; R5 <- R3+R4 and depends on the first add for R3 could be executed in one clock cycle note that the data dependency on R3 is a true data dependency, i.e., RAW -- the read-after-write ordering of accesses to R3 must be preserved for the program to be correct] 15. Consider the instructions I1-I8. Destination registers are listed first. Assume that there are no condition codes. I1: clear R2 // ALU op, treat as or, R2 = 0 I2: move R1, list // ALU op, treat as or, R1 = address of list loop: I3: brz R1, next // branch to next if R1 == 0 I4: nop I5: add R2, R2, 1 I6: load R1, 4(R1) I7: jump loop I8: nop next: Assuming no forwarding hardware, show the 5-stage pipeline cycle diagram (that is, stairstep diagram) for the execution of I1-I2 and one iteration of the loop body (assume "list" != 0). Assume that the branches are delayed and that for brz the value must be available for comparison in the ID stage; ID can overlap with WB. (8 pts.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 I1:clear R2 IF ID EX MM WB I2:move R1,list ___ _IF _ID _EX _MM _WB ___ ___ ___ ___ ___ ___ ___ ___ ___ I3:brz R1,next ___ ___ _IF _ID _EX _MM _WB ___ ___ ___ ___ ___ ___ I4:nop ___ ___ ___ ___ ___ _IF _ID _EX _MM _WB ___ ___ ___ ___ ___ I5:add R2,R2,1 ___ ___ ___ ___ ___ ___ _IF _ID _EX _MM _WB ___ ___ ___ ___ I6:load R1,4(R1) ___ ___ ___ ___ ___ ___ ___ _IF _ID _EX _MM _WB ___ ___ ___ I7:jump loop ___ ___ ___ ___ ___ ___ ___ ___ _IF _ID _EX _MM _WB ___ ___ I8:nop ___ ___ ___ ___ ___ ___ ___ ___ ___ _IF _ID _EX _MM _WB ___ 16. Identify examples of locality of reference, if clearly present, in the instructions in question 15. Be sure to explain your chosen examples in enough detail that the reference patterns are clearly evident. (4 pts.) spatial - sequential execution of instructions temporal - repeated execution of instructions in loop 17. Identify examples of locality of reference, if clearly present, in the data accessed by the code in question 15. Be sure to explain your chosen examples in enough detail that the reference patterns are clearly evident. (3 pts.) no clear reference pattern of data XC-2. Identify and explain what the code segment does. (4 pts.) walks down a linked list counting the number of nodes; next-node pointer in node is in second word (offset 4) [e.g., the HLL equivalent might be something like this: struct node{ ... // first word struct node* next; // second word (at node_addr+4) is link }; int count; struct node *list, *p; ... count = 0; // count allocated to R2 p = list; // p allocated to R1 while( p != 0 ){ count++; p = p->next; } ] 18. Assume a four-line fully-associative cache, eight bytes per line, and LRU replacement. The cache is initially empty. For the byte-addressable memory address reference stream given below determine how many of the references are hits. (5 pts.) +---------+ +---------+ +---------+ +---------+ | | | | | | | | +---------+ +---------+ +---------+ +---------+ fa: 0, 4, 8, 20, 36, 96, 32, 4, 12, 16, 24, 48, 44, 8, 0 hit: _ H _ _ _ _ H _ _ _ _ _ _ _ _ total = __2_ miss: M _ M M M M _ M M M M M M M M total = _13_ name the four lines A,B,C,D and allocate them in that order until full, then the replacement info is MRU A A B C D A D B C A D B C A D A B C D A D B C A D B C A A B C C A D B C A D B C LRU A B B C A D B C A D B final contents +---------+ +---------+ +---------+ +---------+ A| 8 - 15 | B| 48 - 55 | C| 40 - 47 | D| 0 - 7 | +---------+ +---------+ +---------+ +---------+ 19. Identify superscalar, EPIC, and VLIW (leaving one blank empty). (6 pts.) dependency fn. unit time to start checking assignment execution ---------- ---------- ------------- _superscalar_ hardware hardware hardware _EPIC________ software hardware hardware _____________ software software hardware _VLIW________ software software software 20. An HP processor uses a three-bit branch history shift register in each branch history table entry and uses a "majority vote" algorithm to predict whether the next branch is taken (T) or untaken (U). Assume an entry is initialized to UUU. Determine the prediction accuracy on the following branch trace; include all trace entries in your calculation. (4 pts.) (a) T T T T U T T T T U T T T T U T T T T U BHSR youngest U T T T T U T T T T U T T T T U T T T T U ... U U T T T T U T T T T U T T T T U T T T T oldest U U U T T T T U T T T T U T T T T U T T T \ predict U U T T T T T T T T T T T T T T T T T T (a) T T T T U T T T T U T T T T U T T T T U correct? n n y y n y y y y n y y y y n y y y y n = 14/20 accuracy (b) T U T U T U T U T U T U T U T U T U T U BHSR youngest U T U T U T U T U T U T U T U T U T U T U ... U U T U T U T U T U T U T U T U T U T U T oldest U U U T U T U T U T U T U T U T U T U T U \ predict U U U T U T U T U T U T U T U T U T U T (b) T U T U T U T U T U T U T U T U T U T U correct? n y n n n n n n n n n n n n n n n n n n = 1/20 accuracy 21. Identify the point in the instruction pipeline for an out-of-order superscalar processor when an entry in the reorder buffer is reserved for an instruction. Why is it important to make this reservation at that point? (3 pts.) [recall that an o-o-o pipeline consists of an in-order front-end, an out-of-order middle, and an in-order back-end. the front-end and the middle are coupled by the instruction window or reservation stations, and the middle and the back-end are coupled by the ROB] the ROB entry should be allocated by one of the last stages of the in-order front-end (e.g., the one that places the instruction in the inst. window) this allows the ROB entries to be allocated in program order and thus inst. retirement can be done in program order 22. Explain what the term "false dependency" means and how it can arise in an out-of-order superscalar processor. What can be done in hardware to eliminate false dependencies? (3 pts.) a false dependency is an ordering required between two instructions to maintain program correctness because of register reuse (or memory location reuse) rather than because of a data value producer-consumer ordering [the two types of false dependencies are: WAR (write-after-read, a.k.a. anti-dependency), and WAW (write-after-write, a.k.a. output dependency) e.g., a WAR dependency is add R3,R1,R2 ; R3 <- R1+R2 sub R1,R4,R5 ; R1 <- R4+R5 and must allow the add to read R1 ; before this inst. writes R1 ] for register-related false dependencies, hardware can perform register renaming 23. Which of the following loops is "embarrassingly parallel"? Explain your answer. (3 pts.) (a) for( i = 0; i < N; i++ ){ a[ i ] = b[ i ] + c [ i ]; } (b) a[ 0 ] = c [ 0 ]; for( i = 1; i < N; i++ ){ a[ i ] = a[ i-1 ] + c [ i ]; } (a) is embarassingly parallel because each iteration is independent [the second through last iteration of (b) depends on the previous iteration]