Clemson University -- CPSC 231 -- Fall 2009 Real Numbers range -- need to deal with larger range [not just -2^(n-1) to +2^(n-1)-1] - large numbers - small numbers (fractions) von Neumann recommended using fixed point with the programmer doing the scaling (see /home/mark/231/lecture.notes/scale.txt for examples of writing programs using scaling and the preplanning required) fixed point = integer part fraction part use positional representation to convert a binary fraction to a decimal value . d_-1 d_-2 d_-3 ... = sum over i [ d_i * 2^i ] .011 (base 2) = 0 * 2^(-1) + 1 * 2^(-2) + 1 * 2^(-3) = 0 * 1/2 + 1 * 1/4 + 1 * 1/8 = 0 * 0.5 + 1 * 0.25 + 1 * 0.125 = 0 + 0.25 + 0.125 = 0.375 converting a decimal fraction to a binary fraction repeatedly multiply decimal fraction by two, taking the integer part 0 or 1 as bits of the fraction from left to right, stop when fraction is zero e.g., 0.375 (base 10) 0.375 x 2 = 0.75 integer parts are 0, 1, 1 0.75 x 2 = 1.5 binary fraction is .011 0.5 x 2 = 1.0 ^stop note that a terminating decimal fraction may be non-terminating when represented in binary, e.g., 0.1 base 10 = 0.00011001100... base 2 Floating Point automatic scaling by hardware like scientific notation, automatic scaling by hardware requires an exponent field (i.e., the scale factor) and a mantissa (i.e., 1.fraction) - one sign bit - range is governed by the number of bits in the exponent - precision is governed by the number of bits in the mantissa normal form - number represented as 1.fraction times an appropriate exponent +---+-----+----------+ | s | exp | fraction | value = (-1)^s x 1.fraction x 2^exp +---+-----+----------+ exponent uses bias notation so that negative exponents have leading 0s (historical reason is that integer compare operations could be used on this type of floating-point representation) consider a 2-bit exponent using bias 2 (binary 10) zero exp means zero 01 exp means (01 base 2 - bias 10 base 2) = 2^(-1) = 1/2 10 exp means (10 base 2 - bias 10 base 2) = 2^0 = 1 11 exp means (11 base 2 - bias 10 base 2) = 2^1 = 2 and a 2-bit fraction w/ leading 1. hidden for sign=0 (positive numbers): 0 00 00 = 0.0 0 01 00 = 1.00(base 2) x 2^(01 - 10 base 2) = 1.00 x 2^(-1) = 0.500 0 01 01 = 1.01(base 2) x 2^(01 - 10 base 2) = 1.25 x 2^(-1) = 0.625 0 01 10 = 1.10(base 2) x 2^(01 - 10 base 2) = 1.50 x 2^(-1) = 0.750 0 01 11 = 1.11(base 2) x 2^(01 - 10 base 2) = 1.75 x 2^(-1) = 0.875 0 10 00 = 1.00(base 2) x 2^(10-10 base 2) = 1.00 x 2^0 = 1.00 0 10 01 = 1.01(base 2) x 2^(10-10 base 2) = 1.25 x 2^0 = 1.25 0 10 10 = 1.10(base 2) x 2^(10-10 base 2) = 1.50 x 2^0 = 1.50 0 10 11 = 1.11(base 2) x 2^(10-10 base 2) = 1.75 x 2^0 = 1.75 0 11 00 = 1.00(base 2) x 2^(11 - 10 base 2) = 1.00 x 2^1 = 2.0 0 11 01 = 1.01(base 2) x 2^(11 - 10 base 2) = 1.25 x 2^1 = 2.5 0 11 10 = 1.10(base 2) x 2^(11 - 10 base 2) = 1.50 x 2^1 = 3.0 0 11 11 = 1.11(base 2) x 2^(11 - 10 base 2) = 1.75 x 2^1 = 3.5 representing these points on the number line: 2^(-1) 2^0 2^1 _____ ___________ ______________________ / \ / \ / \ 0-------|-|-|-|-*---*---*---*---#-------#-------#-------#----... .5 .75 1 1.25 1.75 2 2.5 3 3.5 .625 .875 1.5 notice the geometric increase in spacing between representable numbers (0.125, then 0.25, then 0.5) precision - how many digits are used to represent a number accuracy - how correct the number is; e.g., 4.56789 is a number with six- digit precision but rather inaccurate as an approximation to pi, while 3 is less precise but more accurate rounding - choose nearest representable neighbor while absolute error in representation increases as the numbers get larger, relative error remains the same, e.g., consider representing numbers one fourth of the distance between the "pickets" ine the "picket fence" above A B 0-------|-|-|-|-*.--*---*---*---#-.-----#-------#-------#----... .5 .75 1 1.25 1.75 2 2.5 3 3.5 A = 1.0625 will be represented by 1.0, so absolute error = 0.0625, relative error = 0.0625/1.0625 = 0.0588 B = 2.125 will be represented by 2.0, so absolute error = 0.125, relative error = 0.125/2.125 = 0.0588 the maximum relative error in nonzero representations will occur for a number that is halfway between pickets, thus for a binary mantissa the relative error will be bounded by 2^(- (number of bits in fraction + 1)) for a number that gets truncated or rounded to zero, the maximum relative error can reach a factor of 1, e.g., if 0.1 is represented as 0 denormal numbers are used for gradual underflow rather than just truncating values less than 2^(minimum exponent) to zero in example above, use 00 exp for smallest exp value (2^(-1)) but with no hidden bit; thus, assign 0 00 01=0.125, 0 00 10=0.250, and 0 00 11=0.375 .5 1 1.5 2 2.5 3 3.5 0-.-.-.-|-|-|-|-*---*---*---*---#-------#-------#-------# ^ ^ ^ | | .375 | .25 .125 a five-bit floating-point format with denormals yields 32 patterns 0 00 00 = zero 1 00 00 = -zero 0 00 01 = 0.125 1 00 01 = -0.125 0 00 10 = 0.25 1 00 10 = -0.25 0 00 11 = 0.375 1 00 11 = -0.375 0 01 00 = 0.5 1 01 00 = -0.5 0 01 01 = 0.625 1 01 01 = -0.625 0 01 10 = 0.75 1 01 10 = -0.75 0 01 11 = 0.875 1 01 11 = -0.875 0 10 00 = 1.0 1 10 00 = -1.0 0 10 01 = 1.25 1 10 01 = -1.25 0 10 10 = 1.5 1 10 10 = -1.5 0 10 11 = 1.75 1 10 11 = -1.75 0 11 00 = 2.0 1 11 00 = -2.0 0 11 01 = 2.5 1 11 01 = -2.5 0 11 10 = 3.0 1 11 10 = -3.0 0 11 11 = 3.5 1 11 11 = -3.5 relative error bound of 1/2^(number of bits in fraction + 1) = 0.125 holds down to the smallest normalized number 2^min_exp = 0.5 This table also shows the encoding of the 2-bit exponent and 2-bit fraction fraction 4 3 2 1 0 00 01 10 11 +---+------+------+ exponent +----------------------------- | s | exp | frac | 00 | 0.0 0.125 0.25 0.375 +---+------+------+ 2^(-1) 01 | 0.5 0.625 0.75 0.875 2^0 10 | 1.0 1.25 1.5 1.75 2^(+1) 11 | 2.0 2.5 3.0 3.5 Operations addition -- shift smaller number to right until exponents are equal, then add 1.5 = 0 10 10 = 1.10x2^0 = 1.10x2^0 +0.875 = + 0 01 11 = + 1.11x2^(-1) = + 0.11x2^0 ------ ---------- 2.375 10.01x2^0 = 1.00x2^1 = 2.0 = 0 11 00 note loss of bits when shifting to match exponents and when truncating to fit into 5-bit format (absolute error of 0.375, relative error of 15.8%) error magnification in case of effective subtraction of two approximately equal numbers [that is, a-b or a+(-b), where a and b are approx. equal]; this is called catastrophic cancellation; the leading (most significant) bits cancel out and all that is left are the trailing (least significant) bits, which are most affected by representation errors. 2.00 = 0 11 00 = 1.00x2^1 = 1.00x2^1 -1.75 = - 0 10 11 = - 1.11x2^0 = - 0.11x2^1 ----- ---------- 0.25 0.01x2^1 = 1.00x2^(-1) = 0.5 = 0 01 00 (absolute error of 0.25, relative error of 100%) to maintain accuracy in effective subtraction, most FP hardware adds three bits: guard bit, round bit, and sticky bit e.g., error in above subtraction eliminated with addition of guard bit g 2.00 = 0 11 00 = 1.00x2^1 = 1.000x2^1 -1.75 = - 0 10 11 = - 1.11x2^0 = - 0.111x2^1 ----- ---------- 0.25 0.001x2^1 = 0.10x2^(-1) = 0.25 = 0 00 10 (these extra bits bits would also help in the first addition above to produce a result of 2.5, which would then have less absolute error, 0.125, and less relative error, 5.3%) the extra bits also help to implement round to even fraction bits | g | r | s | x x ... x x x 0 0 1 \ ... | round down (truncate) x x ... x x x 0 1 1 | x x ... x x 0 1 0 0 / (0100 round to even -> truncate) x x ... x x 1 1 0 0 \ (1100 round to even -> add one to last x x ... x x x 1 0 1 | bit in fraction) ... | round up x x ... x x x 1 1 1 / multiplication -- multiply 1.fraction parts, add exponents 1.5 = 0 10 10 = 1.10x2^0 *1.5 = * 0 10 10 = * 1.10x2^0 ----- ----------- 2.25 10.01x2^0 = 1.00x2^1 = 2.0 = 0 11 00 or rounding up 10.01x2^0 = 1.001x2^1 = 1.01x2^1 = 2.5 = 0 11 01 using extra bit first floating point support was on IBM 704 in 1950s; each manufacturer had its own FP format until standardization by IEEE in 1980s IEEE formats single precision - 32-bit format = 1-bit sign, 8-bit exp, 23-bit fraction double precision - 64-bit format = " , 11-bit exp, 52-bit fraction also extended precision - 80-bit format quad precision - 128-bit format special codes for NaN - Not a Number (propagates itself through any operation) infinity - (also propagates in most cases) denormal numbers IEEE single-precision floating-point encoding (exponent is bias-127) 23-bit fraction 8-bit 000...00 000...01 ... 111...11 exponent +---------------------------------------- 000...00 | 0 2^(-149) almost 2^(-126) (zero and denormals) 000...01 | 2^(-126) ... almost 2^(-125) ... | ... ... ... 011...10 | 0.5 ... ... 011...11 | 1.0 ... ... 100...00 | 2.0 ... ... ... | ... ... ... 111...10 | 2^(+127) ... almost 2^(+128) 111.1111 | inf NaN ... NaN (infinity and NaNs) "almost" means that instead of exactly two, the significand = 2 - 2^(-23) Examples consider this C program int main(void){ float a=2.5, b=0.1, c; int *pa=(int *)&a,*pb=(int *)&b,*pc=(int *)&c; c = a + b; printf("a = %f (0x%x)\n",a,*pa); printf("b = %f (0x%x)\n",b,*pb); printf("c = %f (0x%x)\n",c,*pc); } it prints the float and hex values (hex requires dereferencing an int pointer) a = 2.500000 (0x40200000) b = 0.100000 (0x3dcccccd) c = 2.600000 (0x40266666) the initialization of values and code for the assignment statement are ... .uaword 0x40200000 ! ~2.50000000000000000000e0 .uaword 0x3dcccccd ! ~1.00000001490116119385e-1 ... ld [%fp-20], %f2 ld [%fp-24], %f3 fadds %f2, %f3, %f2 st %f2, [%fp-28] ... next, consider this C program int main(void){ int i; float sum; sum = 0.0; for(i=1; i<=10000000; i++){ sum = sum + 1.0/((float)i); } printf("decreasing order: %f\n",sum); sum = 0.0; for(i=10000000; i>0; i--){ sum = sum + 1.0/((float)i); } printf("increasing order: %f\n",sum); } when run, it prints decreasing order: 15.403683 increasing order: 16.686031 can you explain why? what is the correct answer? next, this program int main(void){ double hd, xd, yd; float hs, xs, ys; int i; hs = 0.1; xs = 0.0; for(i=0;i<10;i++){ xs += hs; } ys = 1.0 - xs; hd = 0.1; xd = 0.0; for(i=0;i<10;i++){ xd += hd; } yd = 1.0 - xd; printf("left-overs: %g %g\n",ys,yd); return 0; } prints "left-overs: -1.19209e-07 1.11022e-16" -- note the difference in the signs. Can you explain why? You cannot directly move values between the integer register set and the floating-point register set on SPARC. This has been criticized as a design mistake for SPARC, but the rationale is that these register sets were in different chips on early implementations. Thus you have to store the value in the integer register into the stack (or other data area) and reload the floating-point register from that location. Additionally, you have to convert that value into the appropriate floating- point format. You can see this happening in double sub(int a){ register double x; x = a; return(x); } which, if you compile (only) with "gcc -O2 -S", gives (edited) in the .s file: sub: add %sp, -120, %sp ! add sufficient area to the stack for an ! exception frame and local variable space st %o0, [%sp+100] ! store 32-bit integer value in %o0 into stack ld [%sp+100], %f2 ! reload that 32-bit word into %f2 fitod %f2, %f0 ! convert the 32-bit integer value in %f2 into ! a 64-bit (dbl. prec.) flt. pt. value in ! the register pair %f0 and %f1 sub %sp, -120, %sp ! restore the sp retl nop