Clemson University -- CPSC 231 -- Fall 2009 UltraSPARC Visual Instruction Set (VIS) SIMD - single-instruction multiple data SIMD was originally defined in the 1960s as category of multiprocessor with one control unit and multiple processing elements - each instruction is executed by all processing elements on different data streams. Today the term is used to describe partitionable ALUs in which multiple operands can fit in a fixed-width register and are acted upon in parallel. (other terms include subword parallelism and short vector extensions) Subword parallelism's earliest use may be in the Lincoln Labs TX-2 in the late 1950s: The structure of the arithmetic element can be altered under program control. Each instruction specifies a particular form of machine in which to operate, ranging from a full 36-bit computer to four 9-bit computers with many variations. Not only is such a scheme able to make more efficient use of the memory in storing data of various word lengths, but it also can be expected to result in greater over-all machine speed because of the increased parallelism of operation. graphics support was added to general-purpose microprocessors starting in the late 1980s Intel i860 (1989) added three packed graphics data types (e.g., eight 1-byte pixel values per 64-bit word) and a special graphics function unit (e.g., z-buffer interpolation) Motorola 88110 (1991) included six graphics data types and performed saturating arithmetic large number of instruction set extensions in mid- to late-1990s * HP MAX (1994) - Media Acceleration Extensions (in PA 7100LC) * SPARC VIS (1995) - Visual Instruction Set * Alpha MVI (1996) - Motion Video Instructions * Intel MMX (1996) - Multimedia Extension * MIPS MDMX (1996) - MIPS Digital Media Extensions * AMD 3DNow! (1998) * PowerPC AltiVec (1998) * Intel SSE (1999) - Streaming SIMD Extension * Intel SSE2 (2001) * Intel SSE3 (2004) also multimedia processors in 1990s * Microunity Mediaprocessor * Chromatic Mpact * Philips Trimedia now GPUs (sometimes called stream processors) * Nvidia * ATI (now owned by AMD) SPARC VIS 1.0 (1995) * 64-bit datapath, partitionable to 8x8-bits, 4x16-bits, or 2x32-bits (note that SSE and AltiVec use 128-bit datapaths) * used floating-point registers but integer/fixed-point operations * 80+ new instructions VIS 2.0 (2000) * more data shuffling instructions VIS 3.0 (will ship on SPARC Rock) * supports packed floating-point Solaris mediaLib routines use VIS and are callable from C, C++, and Java example representation of a pixel - pack RGB components into a 32-bit word <8 bits> <8 bits> <8 bits> <8 bits> 00000000 rrrrrrrr gggggggg bbbbbbbb (pad) red green blue C code to add two pixels int component_add( int a, int b, int offset ){ int a_color, b_color, c_color; a_color = ( a >> offset ) & 0xff; b_color = ( b >> offset ) & 0xff; c_color = a_color + b_color; if( c_color > 255 ) c_color = 255; /* clamp */ return( c_color << offset ); } int pixel_add( int a, int b ){ return( component_add(a,b,16) /* red */ | component_add(a,b, 8) /* green */ | component_add(a,b, 0) /* blue */ ); } VIS assembly code to do same (two versions) /* vis_add1 - pass parameters by value */ .global vis_add1 vis_add1: save %sp, -104, %sp wr %g0, 24, %gsr /* set graphics status register to */ /* 4 fractional bits for fexpand */ st %i0, [%fp-4] /* SPARC requires memory shuffle for */ ld [%fp-4], %f2 /* int register to fp register */ fexpand %f2, %f4 /* f2 -> four 16-bit fields in f4,f5 */ st %i1, [%fp-8] /* SPARC requires memory shuffle for */ ld [%fp-8], %f3 /* int register to fp register */ fexpand %f3, %f6 /* f3 -> four 16-bit fields in f6,f7 */ fpadd16 %f4, %f6, %f8 /* f8,9 <- f4,5 + f6,7 */ fpack16 %f8, %f2 /* f8,9 -> four 8-bit fields in f2 */ st %f2, [%fp-4] /* SPARC requires memory shuffle for */ ld [%fp-4], %i0 /* fp register to int register */ ret restore /* vis_add2 - pass parameters by reference */ .global vis_add2 vis_add2: wr %g0, 24, %gsr /* set graphics status register to */ /* 4 fractional bits for fexpand */ ld [%o0], %f2 fexpand %f2, %f4 /* f2 -> four 16-bit fields in f4,f5 */ ld [%o1], %f3 fexpand %f3, %f6 /* f3 -> four 16-bit fields in f6,f7 */ fpadd16 %f4, %f6, %f8 /* f8,9 <- f4,5 + f6,7 */ fpack16 %f8, %f2 /* f8,9 -> four 8-bit fields in f2 */ st %f2, [%o2] /* with each clamped to 255 */ retl nop driver program #include int pixel_add( int, int ); int vis_add1( int, int ); void vis_add2( int *, int *, int * ); /* src1,src2,dest by ref */ main(){ int a = ( 1 << 16 ) | ( 2 << 8 ) | 200; int b = ( 4 << 16 ) | ( 5 << 8 ) | 200; printf("addend a is 0x%08x\n",a); printf("addend b is 0x%08x\n",b); printf(" ----------\n"); printf("pixel_add returns 0x%08x\n",pixel_add(a,b)); printf("vis_add1 returns 0x%08x\n",vis_add1(a,b)); vis_add2(&a,&b,&a); printf("vis_add2 returns 0x%08x\n",a); return 0; } output addend a is 0x000102c8 addend b is 0x000405c8 ---------- pixel_add returns 0x000507ff vis_add1 returns 0x000507ff vis_add2 returns 0x000507ff note: compile vis programs using gcc -mcpu=ultrasparc fexpand - each of four 8-bit components in a given fp register are left-shifted by four bits and inserted into separate 16-bit fields in an fp register pair to allow for computations 7 0 +--------+ | 8-bit | +--------+ / / / / +------------------+ |0000| |0000| +------------------+ fpadd16 - performs four 16-bit adds in parallel using fp register pairs fpack16 - field in GSR contains scale factor - value 4 means no change - each of four 16-bit components in fp register pair are left-shifted by the scale factor, then computed as 8-bit components (each 0-255) using bits 7 to 14; the four resulting 8-bit components are packed into another fp register 15 0 +----------------+ |16-bit component| +----------------+ / / +------------------------+ |sgnext| | | |00| +---------^------^-------+ 14 7 extract byte