CPSC 330 - Spring 2009 Study Guide for Final Exam The final will be approximately 75% cumulative through Exam 2, and approximately 25% on parallel processing and Homework 6 material such as Amdahl's Law. 1. Be able to define or match these terms Amdahl's Law MIPS MFLOPS coarse-grained parallelism fine-grained parallelism compute-dominant workload communication-dominant workload embarrassingly parallel multiprocessor multithreading simultaneous multithreading (SMT) multicore many-core Flynn notation (SISD, SIMD, MIMD) single-instruction-multiple-data (SIMD) - muliple processor units operating in lockstep - subword parallelism (partitioned ALU) vector processor shared memory distributed memory symmetric multiprocessor (SMP) non-uniform memory access (NUMA) cache-coherent non-uniform memory access (CC-NUMA) massively parallel processing (MPP) cache coherency bus snooping MESI protocol write-update write-invalidate false sharing directory-based coherency (non-broadcast) memory consistency write buffer message passing cluster autotuner roofline model UC Berkeley dwarf mine (common kernels / programming models) 2. Be able to: A. Show how Amdahl's Law is derived. B. Calculate overall speedup using Amdahl's Law, or the fraction of the original program execution time that must be enhanced in order to meet an overall speedup goal given a particular enhancement. C. Explain how Amdahl's Law applies to parallel systems. D. Give the two formulae for MIPS. E. Calculate MIPS or MFLOPS. Example questions 1. Explain the difference between the measures of MIPS and MFLOPS. 2. If you knew a computer system was rated as 1000 MIPS, what would you estimate to be the MFLOPS and why? 3. Give Amdahl's Law (that is, the overall speedup in terms of f and s). 4. Consider an enhancement to a computer system that provides a three times speedup when in use. What is the overall speedup if the enhancement can only be used for one half of a program's normal execution time? 5. a) Find the execution time for a program that executes 1 billion instructions on a processor with an avg. CPI of 3.0 and a clock cycle time of 33.3 nsec. b) What is the MIPS value for the processor and program in part (a)? 6. For the following instruction set workload and cycle values, find the average CPI. If the clock rate is 800 MHz, what is the MIPS value? type | freq cycles -------+-------------- alu | 0.5 2 branch | 0.2 6 ld/st | 0.3 6 7. Which of the following loops is "embarrassingly parallel"? Explain your answer. (a) for( i = 0; i < N; i++ ){ a[ i ] = b[ i ] + c [ i ]; } (b) a[ 0 ] = c [ 0 ]; for( i = 1; i < N; i++ ){ a[ i ] = a[ i-1 ] + c [ i ]; } 8. Explain what the term "false sharing" means and how it can arise in a cache-coherent multiprocessor system. What can be done to eliminate false sharing? 9. Consider the parallel loop for( i = 0; i < 100; i++ ){ a[ i ] = b[ i ] + c [ i ]; } If you had a two-processor SMP with an invalidation-based cache coherency protocol, would you assign (a) one processor i = 0-49 and the other i = 50-99, or (b) one processor the even values of i and the other the odd values, or does it matter? Explain your answer. 10. Explain how weak memory consistency in a shared-memory multiprocessor can allow the following program to print 0 and 0. initially, a = 0 and b = 0 thread 1 thread 2 --------- --------- store a=1 store b=1 print b print a 11. MMX and SSE instruction set extensions are called SIMD. Define the term and explain what it means for current Intel and AMD processors. 12. Explain how a cluster differs from an SMP in terms of ld/st memory accesses and processor interconnections. Be prepared to work problems as given in homework 6.