Floating point arithmetic
Floating point arithmetic is a way to represent and handle a large range of real numbers in a binary form: The C64's built-in BASIC interpreter contains a set of subroutines which perform various tasks on numbers in floating point format, allowing BASIC to use real numbers. These routines may also be called from the user's own machine code programs, to handle real numbers in the range ±2.93873588·10^{−39} to ±1.70141183·10^{38}.
How it works[edit | edit source]
A real number T in the floating point format consists of a mantissa m and an integer exponent E, which are "selected" so that
- T = m · 2^{E}
The mantissa is normalized, which means it is always a number in the range from 0.5 to 1, so that 0.5 ≤ m < 1, and it's stored as a fixed-decimal binary real; a number that begins with a one right after the decimal point, followed by several binary decimals (31 of them, in the case of the 64's BASIC routines).
The exponent is a 8-bit integer with some special provisions for handling negative exponents (i.e. floating point real numbers less than 0.5 or greater or equal than 1): The interpreter stores the exponent as the number E + 128, so that an exponent of 2 is stored as 130 (128 + 2), and an exponent of −2 as 126 (128 − 2). Exponent $80 is reserved for representing the zero. The advantage of this so called excess-128 representation is that the floats could be easily compared in a lexical way without doing a full subtract operation.
Value | Resulting exponent | Float value |
---|---|---|
$00 | --- | 0 |
$01 | -127 | 2↑(-127) * mantissa |
. . . | ||
$80 | 0 | 2↑0 * mantissa |
. . . | ||
$FF | +127 | 2↑127 * mantissa |
Besides the mantissa and exponent, a separate sign bit indicates whether the entire floating point number is to be perceived as positive or negative. Together, the three parts thus "cover" any real number (within the aforementioned range) except zero. To "indicate" that a floating point number equals zero, the C-64 reserves one exponent value, 0 (which would otherwise indicate an exponent of −128), to "flag" that the whole floating point number is 0, regardless of the value of the accompanying mantissa.
Comparison with IEEE 754 single precision[edit | edit source]
IEEE 754 single precision is the type used for the "float" data type in C programs on a PC. The C64 type is different from IEEE 754 single precision as follows:
- C64 has 32 bits in the mantissa; IEEE 754 has 23.
- IEEE 754 places the sign bit to the left of the exponent. C64 places the sign bit between the exponent and the mantissa.
- IEEE 754 biases the exponent differently from C64. For C64, the numeric value 1.0 has the exponent $81; for IEEE 754, the exponent is $7F.
- IEEE 754 supports subnormal values: if the exponent is zero, but the mantissa is not, the mantissa is treated as if the exponent were $01, except that the implied leading 1 becomes a 0. C64 does not support subnormals; if the exponent is zero, the value is treated as zero.
- IEEE 754 distinguishes negative zero from positive for certain purposes. C64 does not.
- IEEE 754 has infinities and NaNs. Infinities have exponent $FF and mantissa equal to zero; they stand for results of numeric overflow or operations such as division by zero. NaNs have exponent $FF and mantissa not equal to zero, and stand for results of erroneous operations such as the square root of a negative number. For the C64 BASIC, any operation that would produce an infinity or a NaN is an error, and exponent $FF denotes just another range of finite values.
Representation in the C-64[edit | edit source]
There are basically two different layouts how floats are stored:
- for calculations with float registers in a some kind expanded layout of 6 or 7 bytes.
- for variables compacted into 5 bytes.
Float Registers[edit | edit source]
Two regions in zeropage are allocated for working with floating point numbers:
- One is called FAC, for Floating Point Accumulator:
- Address 97/$61 is the exponent byte
- Addresses 98–101/$62–$65 hold the four-byte (32 bit) mantissa
- Address 102/$66 stores the sign in its most significant bit; 0 for positive, $FF (-1) for negative.
- Address 112/$70 contains rounding bits for intermediate calculations.
- The other is called ARG, for Floating Point ARGument. It's arranged in the same way as FAC, only eight bytes further up:
- Address 105/$69 holds the exponent byte
- Addresses 106–109/$6A–$6D hold the four-byte mantissa
- Address 110/$6E holds the sign in its most significant bit; 0 for positive, $FF (-1) for negative.
A float in FAC uses 7 bytes, in ARG needs 6 bytes.
Variables[edit | edit source]
Note that this amounts to at least six bytes per floating point number, but the routines provided for moving numbers between FAC, ARG and arbitrary RAM addresses use a compression "trick" so that floating point numbers stored in RAM only take up five bytes: Since the mantissa is always in the 0.5-to-1 range, the first binary digit will always be a "1" — no need to store that. When storing a number in RAM, that "invariant 1" is replaced by the sign bit, and when reading numbers from RAM, the sign bit is moved to the separate sign byte in FAC or ARG, and the invariant first mantissa digit is restored to "1".
Float parts: | exp | sign/ma4 | ma3 | ma2 | ma1 | |||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Byte: | 0 | 1 | 2 | 3 | 4 | |||||||||||||||||||||||||||||||||||
Bit: | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
E | E | E | E | E | E | E | E | S | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M |
Typical values in variable layout:
Exponent (hex.) |
Mantissa (hex.) |
Value (dec.) |
---|---|---|
00 | xx xx xx xx | 0 (mantissa ignored) |
01 | 00 00 00 00 | 2.93873588E-39 (smallest possible value) |
80 | 80 00 00 00 | 0.5 |
81 | 00 00 00 00 | 1 |
81 | 80 00 00 00 | -1 |
FF | 7F FF FF FF | 1.70141183e+38 (greatest possible value, positive signed) |
FF | FF FF FF FF | -1.70141183e+38 (greatest possible value, negative signed) |
Conversion example[edit | edit source]
- Exponent: exp-128
- Mantissa: (m4 >= 128 ? -1 : +1) * ((m4 | 0x80) >> 8 + m3 >> 16 + m2 >> 24 + m1 >> 32)
as a C-language like expression, with "x >> y" as "float multiply x by 2↑(-y)" (right-bit-shift operation)
exp m4 m3 m2 m1 98 35 44 7A 00 - some constant in hex 152 53 68 122 0 - same in dec 10011000 00110101 01000100 01111010 00000000 - same in bin ^ sign bit (reused, because the mantissa bit is always 1) In this case: Exponent = 152 - 128 = 24 ; dec Mantissa = 0.10110101010001000111101000000000 ; bin Mantissa = +1 * ((128+53) >> 8 + 68 >> 16 + 122 >> 24 + 0 >> 32) ; dec = 1 * 2^-1 + 0 * 2^-2 + 1 * 2^-3 + 1 * 2^-4 + 0 * 2^-5 + ... = 0.70807611942291259765 So the number is... 0.70807611942291259765 * 2^24 = 11879546.0
Routines[edit | edit source]
Fast Ceil routine[edit | edit source]
In C code:
float fceil(float fp) { typedef union { unsigned int value; BYTE bytes[4 - 1]; } mantissa; union { float fp; BYTE exponent; mantissa mantissa; } v; v.fp = fp; if (v.exponent == 0x00 || v.exponent == 0x80 || v.exponent == 0x81) return v.fp; int fractional_bits = 32 - (v.exponent - 128); if (fractional_bits <= 0) return v.fp; unsigned int integral_mask = 0xffffffff << fractional_bits; unsigned int fmantissa = v.mantissa.value & integral_mask; v.mantissa.value = fmantissa; if (v.fp > 0 && v.fp != fp) v.fp++; return v.fp; }
Fast Floor routine[edit | edit source]
In C code:
float ffloor(float fp) { typedef union { unsigned int value; BYTE bytes[4 - 1]; } mantissa; union { float fp; BYTE exponent; mantissa mantissa; } v; v.fp = fp; if (v.exponent == 0x00 || v.exponent == 0x80 || v.exponent == 0x81) return v.fp; int fractional_bits = 32 - (v.exponent - 128); if (fractional_bits <= 0) return v.fp; unsigned int integral_mask = 0xffffffff << fractional_bits; unsigned int fmantissa = v.mantissa.value & integral_mask; v.mantissa.value = fmantissa; return v.fp; }
Round to decimalcount[edit | edit source]
In C code:
float fround(float fp, int decimalcount) { typedef union { unsigned int value; BYTE bytes[4 - 1]; } mantissa; union { float fp; BYTE exponent; mantissa mantissa; } v; float tmpfp = fp * powf(10, decimalcount+1); v.fp = tmpfp; int fractional_bits = 32 - (v.exponent - 128); if (fractional_bits <= 0) return v.fp; unsigned int integral_mask = 0xffffffff << fractional_bits; unsigned int fmantissa = v.mantissa.value & integral_mask; v.mantissa.value = fmantissa; v.fp = v.fp / 10; tmpfp = v.fp; fractional_bits = 32 - (v.exponent - 128); if (fractional_bits <= 0) return v.fp; integral_mask = 0xffffffff << fractional_bits; fmantissa = v.mantissa.value & integral_mask; v.mantissa.value = fmantissa; float diff = tmpfp - v.fp; if (diff > 0.5f) v.fp++; v.fp = v.fp / powf(10, decimalcount); return v.fp; }
Floats in Basic[edit | edit source]
All CBM BASIC variants evaluate expressions by using floating point arithmetics even if simpler data types like integer variables are involved.
Just apart from the common operators there are several functions exposing special aspects of float values.
- ABS - get the absolute value of a float (remove the sign)
- INT - turn the value to the smallest integer (fraction part will be removed)
- SGN - sign of the float
- STR$ - converts a float to a string
- VAL - converts a string to a float
Using floating point routines[edit | edit source]
Just like the CPU's accumulator plays a central role in much of what the machine does, the FAC and ARG are the "hubs" of floating point calculations: Numbers to be processed are stored in FAC and ARG, and after calling the relevant routine with a JSR the result is "delivered" in FAC.
Where other RAM locations must be specified, the A/Y register combination is used, wherein the low-byte of the memory address is stored in A and the high byte is stored in Y. Similarly, when converting to and from absolute, 16-bit, signed integer values, the A/Y combination is used.
Finally, the QINT routine indicated below stores the 32 bit signed value in FAC+1 through FAC+4, with the highest order byte starting in FAC+1 and the lowest order byte in FAC+4.
Routines for moving (copying) numbers[edit | edit source]
Label | Address dec. |
Address hex. |
Description |
---|---|---|---|
CONUPK | 47756 | BA8C | Fetch a number from a RAM location to ARG (A=Addr.LB, Y=Addr.HB) |
MOVEF | 48124 | BBFC | Copy a number currently in ARG, over into FAC |
MOVFA | 48143 | BC0F | Copy a number currently in FAC, over into ARG |
MOVFM | 48034 | BBA2 | Fetch a number from a RAM location to FAC (A=Addr.LB, Y=Addr.HB) |
MOVMF | 48084 | BBD4 | Store the number currently in FAC, to a RAM location. Uses X and Y rather than A and Y to point to RAM. (X=Addr.LB, Y=Addr.HB) |
Routines for converting between floating point and other formats[edit | edit source]
Label | Address dec. |
Address hex. |
Description |
---|---|---|---|
FACINX | 45482 | B1AA | Convert number in FAC to 16-bit signed integer (Y=LB, A=HB). |
FIN | 48371 | BCF3 | Convert number expressed as a zero-terminated PETSCII string, to floating point number in FAC. Expects string-address in $7a/$7b, and to make it work either call CHRGOT ($0079) beforehand or load the accumulator with the first char of the string and clear the carry-flag manually. |
STRVAL | 47029 | B7B5 | Convert numerical PETSCII-string to floating point number in FAC. Expects string-address in $22/$23 and length of string in accumulator. |
FOUT | 48605 | BDDD | Convert number in FAC to a zero-terminated PETSCII string (starting at $0100, address in A, Y too). Direct output of FAC also via $AABC/43708 possible. |
GIVAYF | 45969 | B391 | Convert 16-bit signed integer to floating point number in FAC. Expects lowbyte in Y- and highbyte in A-register. |
QINT | 48283 | BC9B | Convert number in FAC to 32-bit signed integer ($62-$65, big-endian order). |
Routines for performing calculations[edit | edit source]
Label | Address dec. |
Address hex. |
Description |
---|---|---|---|
ABS (ROM Routine) | 48216 | BC58 | Performs the ABS function on the number in FAC |
ATN (ROM Routine) | 58126 | E30E | Performs the ATN function on the number in FAC |
COS (ROM Routine) | 57956 | E264 | Performs the COS function on the number in FAC |
MUL10 | 47842 | BAE2 | Multiply the number held in FAC by 10. |
DIV10 | 47870 | BAFE | Divide the number held in FAC by 10. Ignores the sign of the number in FAC, the result is always positive. |
EXP (ROM Routine) | 49133 | BFED | Performs the EXP function on the number in FAC |
FADD | 47207 | B867 | Adds the number in FAC with one stored in RAM (A=Addr.LB, Y=Addr.HB) |
FADDT | 47210 | B86A | Adds the numbers in FAC and ARG |
FDIV | 47887 | BB0F | Divides a number stored in RAM by the number in FAC (A=Addr.LB, Y=Addr.HB) |
FDIVT | 47890 | BB12 | Divides the number in ARG by the number in FAC. Sign comparison is not performed and ARISGN byte at $6f is not set, which has to be accounted for when using this entry point. Sign errors may occur otherwise. |
FMULT | 47656 | BA28 | Multiplies a number from RAM and FAC (clobbers ARG, A=Addr.LB, Y=Addr.HB) |
FPWR | 49016 | BF78 | Raises a number stored ín RAM to the power in FAC (A=Addr.LB, Y=Addr.HB) |
FPWRT | 49019 | BF7B | Raises the number in ARG to the power in FAC |
FSUB | 47184 | B850 | Subtracts the number in FAC from one stored in RAM (A=Addr.LB, Y=Addr.HB) |
FSUBT | 47187 | B853 | Subtracts the number in FAC from the number in ARG |
INT (ROM Routine) | 48332 | BCCC | Performs the INT function on the number in FAC |
LOG (ROM Routine) | 47594 | B9EA | Performs the LOG function on the number in FAC |
NEGOP | 49076 | BFB4 | Switches sign on the number in FAC, if non-zero |
POLY1 | 57411 | E043 | Evaluates a polynomial with odd powers only, for the value given in FAC |
POLY2 | 57433 | E059 | Evaluates a polynomial with odd and even powers, for the value given in FAC |
SIN (ROM Routine) | 57963 | E26B | Performs the SIN function on the number in FAC |
SGN (ROM Routine) | 48185 | BC39 | Performs the SGN function on the number in FAC |
SQR (ROM Routine) | 49009 | BF71 | Performs the SQR function on the number in FAC |
TAN (ROM Routine) | 58036 | E2B4 | Performs the TAN function on the number in FAC |
Routines for comparing numbers[edit | edit source]
Label | Address dec. |
Address hex. |
Description |
---|---|---|---|
FCOMP | 48219 | BC5B | Compares the number in FAC against one stored in RAM (A=Addr.LB, Y=Addr.HB). The result of the comparison is stored in A: Zero (0) indicates the values were equal. One (1) indicates FAC was greater than RAM and negative one (-1 or $FF) indicates FAC was less than RAM. Also sets processor flags (N,Z) depending on whether the number in FAC is zero, positive or negative |
SIGN | 48171 | BC2B | Sets processor flags (N,Z) depending on whether the number in FAC is zero, positive or negative |
Bugs[edit | edit source]
All Commodore BASICs, except BASIC 7.0 for the Commodore 128, have a bug in the multiply routine which can produce incorrect results for some pairs of operands.
Links[edit | edit source]
Wikipedia: Floating point |
Wikipedia: Offset binary (Excess code) |
- Microsoft BASIC for 6502: float routines, Source Repository on GitHub