Clemson University
CPSC 464/664 Lecture Notes
Fall 2003
Mark Smotherman


Chapter 2 Lecture Outline

  1. ISA - Instruction Set Architecture - visible to compiler and low-level programmer

  2. ISA measures
    1. static, e.g., program size
    2. dynamic, e.g., number of instructions executed
    3. measures depend on compiler and program
    4. dynamic measures estimated by hand or captured by simulator or hardware performance counters

  3. Instruction length
    1. fixed-length - one word per instruction (RISC)
    2. slightly variable - 1, 2, or 3 words
    3. variable-length - some number of bytes
    4. highly-encoded and variable-length instructions (CISC) save memory fetch traffic but complicate decode and execution (thus some implementations will predecode instructions as they come from memory and store them in a decoded form in the instruction cache)

  4. Instruction fields
    1. opcode - operation code (e.g., add, subtract)
    2. number of operands - usually fixed but the opcode can indicate the number of explicit operand fields
    3. operand location - can be implicit in opcode (e.g., stack), explicit register name, or explicit address phrase
    4. operand type specification - usually part of the opcode (but may reside with data in the form of a type tag)
    5. operand length specification (for multiple elements, e.g., character strings) - given in an instruction field (direct) or in a register (indirect)

  5. Operand storage in CPU (faster access and implicit or short addressing)
    1. stack (e.g., Burroughs mainframes, HP3000, x86 FP, recently Harris RTX)
    2. accumulator (e.g., IAS, IBM 7094, early microprocessors - requires minimum hardware)
    3. set of general purpose registers (e.g., IBM S/360, PDP-11, VAX, RISC)

  6. Number of explicit operands in instruction
    1. none => stack architecture; stack is source for both operands as well as destination
    2. one => accumulator-based architecture; ACC is one source as well as destination
    3. two => either gpr or mem-mem; one operand is both a source and the destination
    4. three => either gpr (RISC) or mem-mem

  7. Number of memory operands in ALU operation instruction
    1. none => load/store architecture, reg-reg operations (RISC)
    2. one => 1 1/2 address architecture, reg-mem operations
    3. two or three => mem-mem operations

  8. Memory addressing issues
    1. address space - any addressable region or name space
    2. names are integers => ordering and distance (operation on one yields another, e.g., indexing)
    3. usually linear structure (note: paging is linear, while segmentation is two-dimensional)
    4. resolution (bit/byte/word as addressable unit) - tradeoff range for resolution
    5. byte order within multibyte word or structure
      1. big-endian (e.g., IBM S/360, MC68000) - L-to-R, top-to-bottom
      2. little-endian (e.g., VAX, Intel 80x86) - in Intel's case, due to legacy of serial arithmetic in 4004 (i.e., access least significant digits first in order to develop carries)
      3. example program that shows the difference
    6. operand alignment
      1. language issues: COBOL require fields be contiguous within a record due to REDEFINES statement; FORTRAN similar has a problem with COMMON blocks
      2. hardware support for unaligned access may require two memory access per operand and an alignment network; increases cycle time for all instructions
      3. RISC approach is to invoke software trap handler on unaligned access
      4. example program that shows the difference

  9. Address phrase
    1. addressing mode specification may be an explicit field (e.g., VAX)
    2. address modifiers (e.g., base register, index register, indirection, auto-increment, scaling)
    3. address composition by specifying active regions of memory
      1. PC, base register, or segment register defines base of region (e.g., IBM S/360, Intel 8086)
      2. register contents either added to displacement or used as high bits of address
      3. one active region per access type (e.g., Intel 8086) or many (e.g., IBM S/360)

  10. Addressing modes
    1. fixed set of addition, scaling, and indirection steps taken at execution time
    2. derived from an underlying model of a data structure
    3. Clark and Emer reported 4 modes (register, displacement, immediate, and register indirect) accounted for 93% of all data accesses on the VAX
    4. special purpose DSP modes - circular buffer, bit reverse (FFT)

  11. Operation types (movement, arithmetic, logic, etc.)

  12. SIMD instruction set extensions
    1. SIMD originally defined in 1960s as category of multiprocessor with one control unit and multiple processing elements - each instruction is executed by all processing elements on different data streams, e.g., Illiac IV
    2. today is used to describe partitionable ALUs in which multiple operands can fit in a fixed-width register and are acted upon in parallel
      • actually Illiac IV also provided this and called it "dual arithmetic"
      • earliest use may be in Lincoln Labs TX-2 in late 1950s:
              The structure of the arithmetic element can be altered under program
              control.  Each instruction specifies a particular form of machine in
              which to operate, ranging from a full 36-bit computer to four 9-bit
              computers with many variations.  Not only is such a scheme able to
              make more efficient use of the memory in storing data of various word
              lengths, but it also can be expected to result in greater over-all
              machine speed because of the increased parallelism of operation.
        
              Peak operating rates must then be referred to particular configurations.
              For addition and multiplication, these peak rates are given in the
              following table:
        
                         PEAK OPERATING SPEEDS OF TX-2
                 Word Lengths      Additions    Multiplications
                  (in bits)        per second     per second
                      36             150,000        80,000
                      18             300,000       240,000
                       9             600,000       600,000
              
    3. Intel i860 (ca. 1989) added three packed graphics data types (e.g., eight 1-byte pixel values per 64-bit word) and a special graphics function unit (e.g., z-buffer interpolation)
    4. Motorola 88110 (ca. 1991) included six graphics data types and performed saturating arithmetic
    5. Intel - MMX, SSE, SSE2

  13. Control flow operations
    1. conditional branches ( see also)
      1. decision
        1. condition code - set by compare or as side effect of ALU op, kept in PSW or special reg.
        2. compare and branch - single instruction but may need multiple operand specifiers as well as a branch target address phrase; frequently compare is equal/not-equal 0
        3. condition set in general register - avoids serialized resource of single CC
        4. 80% of compares use immediate value; 64% use 0; 50% use ==0 or !=0
      2. target address - direct addressing, register indirect, or small displacement relative to PC (which makes code easily relocatable, and 94% of displacements fit within 8 bits)
    2. unconditional jumps
    3. jump and link (merely save return address in register with no memory traffic => RISC)
    4. procedure call (memory traffic with stack and frame pointer updates => CISC) ( see also)
    5. procedure return
    6. interrupt / fault / trap / exception (can be thought of as a special, unplanned procedure call)
    7. return from interrupt / fault / trap / exception (restore PC and PSW)
    8. meta-instructions - repeat, execute

  14. Operating system support
    1. need to protect
      1. CPU - timer prevents infinite loops
      2. memory - bounds registers or access permissions in VM mapping table entries
      3. I/O - special I/O insts. or memory-mapped device registers protected as in (2)
    2. need multiple execution modes (mode bit in PSW)
      1. access allowed to timer, memory mapping, and I/O only in OS mode
      2. instructions can be privileged and/or memory areas can be OS-access only
    3. request to OS made by intentional interrupt - allows entry into OS in a restricted manner ( see also)
    4. interrupt vector contains addresses of trap handlers and other OS entry points
    5. synchronization instructions for multiprocessors - test-and-set, compare-and-swap, load-linked and store-conditional, etc.

  15. Impact of compilers
    1. optimizations
      1. high-level - procedure inlining, loop unrolling
      2. local - within basic block - CSE, constant propagation
      3. global - across branches - CSE, copy propagation, loop invariants, array accessing in loops
      4. machine-dependent - strength reduction, pipeline scheduling
    2. register allocation (text says this is the most important optimization)
    3. storage allocation
      1. stack (activation records) - primarily scalars, register allocation very effective
      2. global - statically allocated objects, including arrays
      3. heap - dynamically allocated structures accessed via pointers
      4. alias problems
    4. code generation
    5. principle: ISA should provide primitives not solutions
      1. high level constructs often produce "semantic clash" with multiple languages, so that too much content in one instruction leads to infrequent usage
      2. breaking up constructs into smaller units increases the chance of reuse (e.g., operand or CSE remains in register) and the ability of the compiler to optimally reorder the units

  16. Case study: IBM S/360 instruction set (ca. 1964)
    1. "360 degrees of data processing" - scientific and commercial family of computers that became dominant mainframe architecture
    2. see "Architecture of the IBM System/360," Amdahl, Blaauw, and Brooks, 1964
    3. three instruction lengths - RR 16 bits, RX/RS/SI 32 bits, SS 48 bits
    4. two-operand formats - RR two registers, SS two memory operands, etc.
    5. sixteen 32-bit general registers, four 64-bit floating-point registers
    6. memory addressing uses base register contents plus 12-bit displacement
    7. load address instruction; no memory indirect addressing mode
    8. condition codes in PSW for branching (encoded in two bits)
    9. fast loop closing branches: bxle, bxh
    10. subroutine calls save return address in a register
    11. two execution modes
    12. see Appendix F - The IBM 360/370 Architecture for Mainframe Computers (PDF)
    13. 360/370 (1964) - 24-bit addresses; 370/XA (1983) - 31-bit addresses; zSeries (2000) - 64-bit addresses

  17. Case study: PDP-11 instruction set (ca. 1970)
    1. extremely popular 16-bit minicomputer family
    2. see "A New Architecture for Minicomputers - The DEC PDP-11," C. Gordon Bell, et al., 1970
    3. instruction lengths are one, two, or three 16-bit words
    4. eight 16-bit general registers; R6 is SP and R7 is PC
    5. two-operand formats
    6. multiple addressing modes, including autoincrement and autodecrement (which provide support for stack push and pop, and for array indexing)
    7. condition codes in PSW for branching (NZVC)
    8. fast loop closing branch: subtract one and branch (on nonzero)
    9. subroutine calls save return address on a memory stack
    10. two execution modes
    11. "What we learned from the PDP-11,", C. Gordon Bell and William Strecker, 1975
    12. "Retrospective on the PDP-11," William Strecker, 1995 (contains additional retrospectives about VAX and Alpha)

  18. Case study: DEC VAX instruction set (ca. 1978)
    1. 32-bit follow-up to PDP-11, became extremely popular superminicomputer
      1. protypical CISC
      2. single instructions exist for polynomial evaluation, case/switch, queue insert/delete, etc.
    2. see "VAX-11/780: A Virtual Address Extension to the DEC PDP-11 Family," William Strecker, 1978
    3. data type compatibility with PDP-11 but virtual address extension to 32 bits
    4. highly-variable instruction format with one to six operands
    5. two-operand and three-operand formats
    6. sixteen 32-bit general registers; R12 is AP, R13 is FP, R14 is SP, R15 is PC
    7. sixteen addressing modes similar to PDP-11, plus a prefix byte for index value scaling
    8. condition codes in PSW for branching (NZVC)
    9. multiple fast loop closing branches
    10. subroutine calls save return address on a memory stack
    11. four execution modes
    12. see Appendix E - Another Alternative to RISC: The VAX Architecture (PDF)

  19. Case study: Intel x86 instruction set (ca. 1978)
    1. family of architectures/extensions
      1. 8086 in 1978, 286 in 1982, 386 in 1985
      2. 486, Pentium, Pentium Pro/II/III, and Pentium 4 same basic inst. set
      3. MMX, SSE, SSE2 are SIMD inst. set extensions
    2. variable-length instructions (by byte)
    3. 8086 - extended accumulator architecture; 386 - more of eight general register architecture; FPU adds an eight-entry floating-point register stack; SSE2 extension also adds eight 128-bit registers
    4. two-operand formats
    5. multiple addressing modes (more regular in 386)
    6. condition flags for branching
    7. fast loop closing branches
    8. subroutine calls save return address on a memory stack
    9. 8086 - no protection; 286 - four execution modes
    10. see Appendix D - An Alternative to RISC: The Intel 80x86 (PDF)

  20. Case study: Motorola 68000 instruction set (ca. 1980)
    1. popular 16-bit microcomputer
    2. variable length instructions (by 16-bit word)
    3. eight 32-bit data registers and eight 32-bit address registers; FPU adds eight floating-point registers
    4. two-operand formats
    5. multiple addressing modes, includes some very complicated addressing modes
    6. condition codes in PSW for branching (NZVC)
    7. fast loop closing branches - decrement and branch
    8. subroutine calls save return address on a memory stack
    9. two execution modes
    10. ColdFire - derivative family of embedded processors - implements only streamlined subset

  21. Appendix C - A Survey of RISC Architectures for Desktop, Server, and Embedded Computers (PDF)

  22. Case study: MIPS instruction set (ca. 1986) - see section 2.12
    1. prototypical RISC
    2. fixed-length instructions
    3. load/store architecture, three-register format for ALU operations
    4. 32 integer registers and 16 floating-point registers (r0 always contains a 0)
    5. one addressing mode - base register plus signed 16-bit offset
    6. compare and branch (no condition codes)
    7. subroutine calls save return address in a register
    8. two execution modes

  23. Case study: SPARC instruction set (ca. 1986)
    1. Scalable Processor Architecture (descendant of Berkeley RISC)
    2. see "The Design and Development of SPARC," David Patterson and Wayne Rosing, 1989
    3. fixed-length instructions
    4. load/store architecture, three-register format for ALU operations
    5. 32 integer registers visible at one time, using register windows
      1. 8 global, 8 ins, 8 local, and 8 outs (%g0 always contains a 0)
      2. switch windows by explicit save and restore instructions, trap into OS on underflow or overflow
      3. out registers of previous window overlap in registers of current window in order to pass parameters
    6. 32 floating-point registers
    7. two addressing modes - register plus register, and register plus signed 13-bit offset
    8. condition codes in PSW for branching (NZVC)
    9. subroutine calls use register windows
    10. two execution modes

  24. Case study: Instruction set extensions in SPARC v.9 (ca. 1992)
    1. see "SPARC v9: Adding 64-bit Addressing & Robustness to an Existing RISC Architecture," David Ditzel, 1992
    2. see also two webcasts by Bill Joy: first part and second part
    3. integer registers and virtual addresses extended to 64 bits
    4. code compatible with Version 8 (low 32 bits always same as V8 result)
    5. integer register windows are more flexible
      1. the kernel can allocate one window per thread (for no-overhead context switch)
      2. clean registers are guaranteed to be zero on first use - provides security
      3. an additional register set is used as globals for trap handling
    6. floating point registers are extended to 64 bits, with lock bits to reduce register save/restore on context switch (an idea from IBM RS/6000)
    7. 128-bit quad precision floating point format is supported
    8. dual set of integer condition codes (one for 32-bit results, one for 64-bit); four sets of floating point condition codes
    9. branch on register value (to eliminate condition code setting and testing)
    10. either endian
    11. unaligned loads for better FORTRAN support
    12. speculative, non-faulting loads -- for code such as if(p!=NULL) x = *p; the compiler can move the load of *p prior to the condition test; the load will return 0 if there is an addressing error
    13. conditional move combined with pointer alias detection: the difference of two pointers can be stored in an integer register and FMOVRZ will conditionally move a floating point register based on the result; this allows the compiler to move loads up above stores and then check for an alias and correct
    14. relaxed memory order (RMO) and new barrier instruction for forcing order
    15. compare and swap for synchronization support
    16. prefetch instruction for data and for instructions
    17. hint bits in branch opcodes so compiler can help predict branches

  25. Case study: Instruction set extensions in UltraSPARC-I (ca. 1995)
    1. adds VIS (visual instruction set) for multimedia
    2. block load/store
    3. additional register sets to be used as globals for TLB miss, interrupts
    4. four page sizes (8K, 16K, 512K, 4M) to reduce TLB misses
    5. performance counters (like Pentium)
    6. shutdown to reduce power (30W to 20mW)

  26. Case study: Alpha instruction set (ca. 1992)
    1. probably cleanest RISC, designed as 64-bit architecture from start
    2. fixed-length instructions
    3. load/store architecture, three-register format for ALU operations
    4. 32 integer registers and 32 floating-point registers (r31 and f31 always contain 0)
    5. one addressing mode - base register plus signed 16-bit offset
    6. compare and branch (no condition codes)
    7. subroutine calls save return address in a register
    8. see "Alpha Architecture," Dick Sites and Dirk Meyer, 1992
    9. two execution modes

  27. Case study: PowerPC instruction set (ca. 1992)
    1. derived from IBM POWER = "Performance Optimized With Enhanced Risc"
    2. fixed-length instructions
    3. load/store architecture, three-register format for ALU operations (but includes load/store multiple)
    4. 32 integer registers, 32 floating-point registers, and link and count registers (used by branch unit), AltiVec adds 32 vector registers
    5. four addressing modes - two include register update (autoincrement)
    6. condition register holds eight 4-bit condition code fields
    7. subroutine calls save return address in a register
    8. see "Evolution of Power PC Architecture," Richard Oehler, 1992
    9. extended with AltiVec instruction set

  28. Case study: PA-RISC instruction set (v1.0 1987 / v2.0 1992)
    1. see "H-P Precision Architecture," Michael Mahon, 1987
    2. see "PA-RISC Design Issues," Michael Mahon, 1992

  29. Case study: Itanium instruction set (ca. 2001)
    1. (see section 4.7)
    2. EPIC architecture (explicitly parallel instruction computing)
    3. three 41-bit instruction syllables packaged together in a 128-bit bundle, along with 5-bit template; implicit and explicit stops to indicate dependency boundaries
    4. load/store architecture, three-register format for ALU operations
    5. 128 integer registers, 128 floating-point registers, 64 predicate registers, 8 branch registers, 128 application registers (special purpose and control)
    6. memory addressing uses contents of single register, with optional update by another register or immediate value
    7. predicate registers used to make an unconditional branch into a conditional branch
    8. subroutine calls use register stack
    9. two execution modes
    10. Itanium tutorial page
    11. Intel software tutorials (Itanium and P4)
    12. HP page

  30. Commercial workload characteristics
    1. large number of concurrent users with high process switch rate
    2. large proportion of instructions executed within OS
    3. fewer loop iterations and larger proportion of non-looping branches

  31. Object-oriented workload characteristics
    1. Calder, Grunwald, and Zorn reported that C++ programs were 3 times as call intensive as C
    2. Flynn estimates 10 instructions are executed for a call for local stack adjustment, parameter passing, and register saves/restores


Key Points


[Course home page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]

mark@cs.clemson.edu