CPSC 330 - Fall 2008 Project 2 - Matrix Multiplication Optimization Paper due at class time on November 14. You may work individually or in teams of up to three. If you discover that a teammate is not pulling his weight, fire him and inform me. You should prepare a five page (plus or minus) analysis report for a case study of optimizing dgemm.c or similar matrix multiply code for C = A*B. You should pick a single platform and test different optimizations for double-precision square matrix multiplies from size 100 to 1000 in steps of 100. You can ask other faculty, use papers or software from the web, etc., in preparing your report. ANY RESOURCE YOU USE MUST BE CITED. (I am aware of web-available student reports that you will discover when you search the web, e.g., http://wwwcsif.cs.ucdavis.edu/~koike/ecs289k/projecta/, which you can use if you cite. Any uncited use of resources such as these constitutes plagiarism.) Your paper should be structured in the following manner: 1. Introduction stores, fp adds, fp muls, etc.> 2. Platform Used for Experiments 3. Performance of Unoptimized Matrix Multiply 4. Performance of Optimizations 5. Conclusion References This is an open-ended project, and you will be graded on both the number of optimizations you try but also on the how you use the available resources to guide your selection and testing of optimizations. Blindly applying numerous optimizations will receive less credit than a careful discussion and judicious selection of fewer optimizations. Possible references - this is only a sample, and you are not required to use any of these AMD Core Math Library (ACML) http://developer.amd.com/cpu/libraries/acml/Pages/default.aspx ACML provides a free set of thoroughly optimized and threaded math routines for HPC, scientific, engineering and related compute-intensive applications. ACML is ideal for weather modeling, computational fluid dynamics, financial analysis, oil and gas applications and more. ACML consists of the following main components: * A full implementation of Level 1, 2 and 3 Basic Linear Algebra Subroutines (BLAS), with key routines optimized for high performance on AMD Opteron processors. Automatically Tuned Linear Algebra Software (ATLAS) http://math-atlas.sourceforge.net/ The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance. At present, it provides C and Fortran77 interfaces to a portably efficient BLAS implementation, as well as a few routines from LAPACK. B. Greer and G. Henry "High Performance Software on Intel Pentium Pro Processors or Micro-Ops to TeraFLOPS," Supercomputing, Nov. 1997. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1592627] Abstract: We give a technical discussion of the Intel Pentium Pro processor and optimization strategies used to achieve high performance on scientific applications. We demonstrate these optimizations by characterizing matrix multiplication (DGEMM). We give insight and a model into our efforts on obtaining the world's first TeraFLOP MP LINPACK run (on the Intel ASCI Option Red Supercomputer), based on Pentium Pro processor technology. The importance is carried by the increasing trend of commodity parts in the supercomputing arena. John McCalpin and Mark Smotherman, "Automatic benchmark generation for cache optimization of matrix operations," ACM Southeast Regional Conference, March 1995. [paper: http://portal.acm.org/citation.cfm?id=1122054 software: http://www.cs.clemson.edu/~mark/464/mmgen4.c] Abstract: Computationally intensive algorithms must usually be restructured to make the best use of cache memory in current high- performance, hierarchical memory computers. Unfortunately, cache conscious algorithms are sensitive to object sizes and addresses as well as the details of the cache and translation lookaside buffer geometries, and this sensitivity makes both automatic restructuring and hand-turning difficult tasks. An optimization approach is presented in this paper that automatically generates and executes a benchmark program from a concise specification of the algorithm's structure. This technique provides the performance data needed for verification of code generation heuristics or search among the various restructuring options. Matrix transpose and matrix multiplication are examined using this approach for several workstations with restructuring options of loop order, tiling (blocking), and unrolling. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply http://www.icsi.berkeley.edu/~bilmes/phipac/ Abstract: BLAS3 matrix-matrix operations usually have great potential for aggressive optimization. Unfortunately, they usually need to be hand-coded for a specific machine and/or compiler to achieve near peak performance. We have developed a methodology whereby near-peak performance on such routines can be achieved automatically. First, rather than code by hand, we produce parameterized code generators whose parameters are germane to the resulting machine performance. Second, the generated code follows the PHiPAC (Portable High Performance Ansi C) coding suggestions that include manual loop unrolling, explicit removal of unnecessary dependencies in code blocks (if not removed, C semantics would prohibit many optimizations), and use of machine sympathetic C constructs. Third, we develop search scripts that, for a given code generator, find the best set of parameters for a given architecture/compiler. We have developed a BLAS-GEMM compatible multi-level cache-blocked matrix-matrix multiply code generator that has achieved performance around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/80i, SGI Power Challenge R8k, SGI Octane R10k, and 80% on the SGI Indigo R4k. On the IBM, HP, SGI R4k, and the Sun Ultra-170, the resulting DGEMM is, in fact, faster than the GEMM in the vendor-optimized BLAS GEMM. Other generators, search scripts, and performance results are under development.