CPSC 330 - Fall 2008 Homework 5 Due Monday, November 24 Example Download WinDLX on a Windows computer. Copy test.s to another file, e.g., test2.s, and change the contents to the following: .data value: .word 1 .text .global main main: lw r1,value addi r1,r1,1 addi r1,r1,2 finish: trap 0 This program will execute as follows: lw r1,value ; r1 <- memory[value] addi r1,r1,1 ; r1 <- r1 + 1 addi r1,r1,2 ; r1 <- r1 + 2 trap 0 ; end execution Run WinDLX. I suggest tiling the code box in the upper left, the pipeline box in the upper middle, the statistics box in the upper right, and the clock cycle diagram across the middle and bottom. (See screen capture http://www.cs.clemson.edu/~mark/330/windlx.gif) Click configuration on the upper menu bar; set absolute cycle count and disable forwarding. Click file on the upper menu bar; click load code or data; you should see test2.s in a menu; click test2.s; click select; click load; info box should appear asking to reset DLX; click yes. Step through the program using F7 until info box appear stating trap #0 occured. Note register stalls between lw and first addi, and between first and second addi's. You should see that these four instructions take 12 cycles. (The trap invokes the OS starting on cycle 12.) How can the lw's WB and the first addi's ID overlap? The lw writes the results into R1 in the first phase of cycle 5, and the addi stalls until it can read R1 in the second phase of cycle 5. Click configuration on the upper menu bar; enable forwarding; an info box should ask if you want to reset DLX; click yes. Step through the program using F7 until info box appear stating trap #0 occured. Note a register stall between lw and first addi but not between the first and second addi's. Explain. There is a load-use penalty of one for the lw and first addi pair, since the loaded value is not available for forwarding until the end of the MEM stage. How many cycles are saved by the use of forwarding for the program? Three. 1. Run the following program on WinDLX. .data a: .word 1,2,3,4 b: .word 0,0,0,0 .text .global main main: addi r1,r0,a addi r3,r0,b addi r4,r0,4 loop: lw r2,0(r1) addi r2,r2,1 sw 0(r3),r2 addi r1,r1,4 addi r3,r3,4 subi r4,r4,1 seqi r5,r4,0 beqz r5,loop finish: trap 0 The program calculates b[i] = a[i] + 1 for i = 0 to 3 (a) Compare the cycles required to run this program with and without forwarding. (b) For the beqz instruction, in what stage is the branch target address calculated? How do you know this? (c) For the beqz instruction, in what stage is the branch decision (taken or untaken) made? How do you know this? (d) Why is there a register stall between the seqi and beqz even with forwarding enabled? (e) Explain the forwarding required to prevent a stall between the addi r2,r2,1 and the sw 0(r3),r2. (f) Can you reorder the instructions in a way that preserves program correctness to reduce the cycle count even in the case of forwarding enabled? If so, show the reordered program and give the cycle count. Example Consider a one-bit history for branch prediction. It records the state of the last branch as taken (T) or untaken (U) and predicts the next branch will be the same. Assume the bit is initialized to U. Determine the prediction accuracy on the following branch trace; include all trace entries in your calculation. T T T T U T T T T U T T T T U T T T T U T T T T U ^ ^ ^ ^ ^ (carets mark entry into loop) trace: T T T T U T T T T U T T T T U T T T T U T T T T U history: U T T T T U T T T T U T T T T U T T T T U T T T T prediction? M M M M M M M M M M There are two mispredicts per loop execution. 2. Download the program http://www.cs.clemson.edu/~mark/330/predictors.c Run the predictors with these three input files: trace1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 -1 trace2 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 -1 trace3 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 -1 (a) Draw the state diagram for the encoding 0x3127b. (b) From comparing the results of trace1 and trace2, what is the training time of the form of gshare implemented in the program? Explain your reasoning. (c) For trace3, draw the state diagram for the encoding 0x81000. (d) For trace3, explain how the eight gshare PHT entries end up as they do.