Floating point arithmetic

From C64-Wiki
(Redirected from Float)
Jump to navigationJump to search

Floating point arithmetic is a way to represent and handle a large range of real numbers in a binary form: The C64's built-in BASIC interpreter contains a set of subroutines which perform various tasks on numbers in floating point format, allowing BASIC to use real numbers. These routines may also be called from the user's own machine code programs, to handle real numbers in the range ±2.93873588·10−39 to ±1.70141183·1038.

How it works[edit | edit source]

A real number T in the floating point format consists of a mantissa m and an integer exponent E, which are "selected" so that

T = m · 2E

The mantissa is normalized, which means it is always a number in the range from 0.5 to 1, so that 0.5 ≤ m < 1, and it's stored as a fixed-decimal binary real; a number that begins with a one right after the decimal point, followed by several binary decimals (31 of them, in the case of the 64's BASIC routines).

The exponent is a 8-bit integer with some special provisions for handling negative exponents (i.e. floating point real numbers less than 0.5 or greater or equal than 1): The interpreter stores the exponent as the number E + 128, so that an exponent of 2 is stored as 130 (128 + 2), and an exponent of −2 as 126 (128 − 2). Exponent $80 is reserved for representing the zero. The advantage of this so called excess-128 representation is that the floats could be easily compared in a lexical way without doing a full subtract operation.

Value Resulting exponent Float value
$00 --- 0
$01 -127 2↑(-127) * mantissa
. . .
$80 0 2↑0 * mantissa
. . .
$FF +127 2↑127 * mantissa

Besides the mantissa and exponent, a separate sign bit indicates whether the entire floating point number is to be perceived as positive or negative. Together, the three parts thus "cover" any real number (within the aforementioned range) except zero. To "indicate" that a floating point number equals zero, the C-64 reserves one exponent value, 0 (which would otherwise indicate an exponent of −128), to "flag" that the whole floating point number is 0, regardless of the value of the accompanying mantissa.

Comparison with IEEE 754 single precision[edit | edit source]

IEEE 754 single precision is the type used for the "float" data type in C programs on a PC. The C64 type is different from IEEE 754 single precision as follows:

  • C64 has 32 bits in the mantissa; IEEE 754 has 23.
  • IEEE 754 places the sign bit to the left of the exponent. C64 places the sign bit between the exponent and the mantissa.
  • IEEE 754 biases the exponent differently from C64. For C64, the numeric value 1.0 has the exponent $81; for IEEE 754, the exponent is $7F.
  • IEEE 754 supports subnormal values: if the exponent is zero, but the mantissa is not, the mantissa is treated as if the exponent were $01, except that the implied leading 1 becomes a 0. C64 does not support subnormals; if the exponent is zero, the value is treated as zero.
  • IEEE 754 distinguishes negative zero from positive for certain purposes. C64 does not.
  • IEEE 754 has infinities and NaNs. Infinities have exponent $FF and mantissa equal to zero; they stand for results of numeric overflow or operations such as division by zero. NaNs have exponent $FF and mantissa not equal to zero, and stand for results of erroneous operations such as the square root of a negative number. For the C64 BASIC, any operation that would produce an infinity or a NaN is an error, and exponent $FF denotes just another range of finite values.

Representation in the C-64[edit | edit source]

There are basically two different layouts how floats are stored:

  1. for calculations with float registers in a some kind expanded layout of 6 or 7 bytes.
  2. for variables compacted into 5 bytes.

Float Registers[edit | edit source]

Two regions in zeropage are allocated for working with floating point numbers:

  1. One is called FAC, for Floating Point Accumulator:
    • Address 97/$61 is the exponent byte
    • Addresses 98–101/$62–$65 hold the four-byte (32 bit) mantissa
    • Address 102/$66 stores the sign in its most significant bit; 0 for positive, $FF (-1) for negative.
    • Address 112/$70 contains rounding bits for intermediate calculations.
  2. The other is called ARG, for Floating Point ARGument. It's arranged in the same way as FAC, only eight bytes further up:
    • Address 105/$69 holds the exponent byte
    • Addresses 106–109/$6A–$6D hold the four-byte mantissa
    • Address 110/$6E holds the sign in its most significant bit; 0 for positive, $FF (-1) for negative.

A float in FAC uses 7 bytes, in ARG needs 6 bytes.

Variables[edit | edit source]

Note that this amounts to at least six bytes per floating point number, but the routines provided for moving numbers between FAC, ARG and arbitrary RAM addresses use a compression "trick" so that floating point numbers stored in RAM only take up five bytes: Since the mantissa is always in the 0.5-to-1 range, the first binary digit will always be a "1" — no need to store that. When storing a number in RAM, that "invariant 1" is replaced by the sign bit, and when reading numbers from RAM, the sign bit is moved to the separate sign byte in FAC or ARG, and the invariant first mantissa digit is restored to "1".

Float parts: exp sign/ma4 ma3 ma2 ma1
Byte: 0 1 2 3 4
Bit: 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
E E E E E E E E S M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M

Typical values in variable layout:

(the MSB of mantissa which is occupied by the sign bit is always 1 has to be virtually taken in to account to calculate the real value of 2^E * M, it corresponds to the decimal value 0.5)

Exponent
(hex.)
Mantissa
(hex.)
Value
(dec.)
00 xx xx xx xx 0
(mantissa ignored)
01 00 00 00 00 2.93873588E-39
(smallest possible value)
80 00 00 00 00 0.5
81 00 00 00 00 1
81 80 00 00 00 -1
FF 7F FF FF FF 1.70141183e+38
(greatest possible value, positive signed)
FF FF FF FF FF -1.70141183e+38
(greatest possible value, negative signed)

Conversion example[edit | edit source]

  • Exponent: exp-128
  • Mantissa: (m4 >= 128 ? -1 : +1) * ((m4 | 0x80) >> 8 + m3 >> 16 + m2 >> 24 + m1 >> 32)
    as a C-language like expression, with "x >> y" as "float multiply x by 2↑(-y)" (right-bit-shift operation)
     exp       m4       m3       m2       m1
      98       35       44       7A       00      - some constant in hex
     152       53       68      122        0      - same in dec
10011000 00110101 01000100 01111010 00000000      - same in bin
         ^ sign bit (reused, because the mantissa bit is always 1)

In this case:
Exponent = 152 - 128 = 24                                          ; dec
Mantissa = 0.10110101010001000111101000000000                      ; bin
Mantissa = +1 * ((128+53) >> 8 + 68 >> 16 + 122 >> 24 + 0 >> 32)   ; dec
         = 1 * 2^-1 + 0 * 2^-2 + 1 * 2^-3 + 1 * 2^-4 + 0 * 2^-5 + ...
         = 0.70807611942291259765

So the number is...
0.70807611942291259765 * 2^24 = 11879546.0

Routines[edit | edit source]

Fast Ceil routine[edit | edit source]

In C code:

float fceil(float fp)
{
    typedef union {
        unsigned int    value;
        BYTE            bytes[4 - 1];
    } mantissa;
    union {
        float       fp;
        BYTE        exponent;
        mantissa    mantissa;
    } v;
    v.fp = fp;
    if (v.exponent == 0x00 || v.exponent == 0x80 || v.exponent == 0x81)
       return v.fp;

    int fractional_bits = 32 - (v.exponent - 128);
    if (fractional_bits <= 0) return v.fp;
    
    unsigned int integral_mask = 0xffffffff << fractional_bits;
    unsigned int fmantissa = v.mantissa.value & integral_mask;
    v.mantissa.value = fmantissa;
    
    if (v.fp > 0 && v.fp != fp) v.fp++;
    return v.fp;
}

Fast Floor routine[edit | edit source]

In C code:

float ffloor(float fp)
{
    typedef union {
        unsigned int    value;
        BYTE            bytes[4 - 1];
    } mantissa;
    union {
        float       fp;
        BYTE        exponent;
        mantissa    mantissa;
    } v;
    v.fp = fp;
    if (v.exponent == 0x00 || v.exponent == 0x80 || v.exponent == 0x81)
       return v.fp;

    int fractional_bits = 32 - (v.exponent - 128);
    if (fractional_bits <= 0) return v.fp;
    
    unsigned int integral_mask = 0xffffffff << fractional_bits;
    unsigned int fmantissa = v.mantissa.value & integral_mask;
    v.mantissa.value = fmantissa;

    return v.fp;
}

Round to decimalcount[edit | edit source]

In C code:

float fround(float fp, int decimalcount)
{
    typedef union {
        unsigned int    value;
        BYTE            bytes[4 - 1];
    } mantissa;
    union {
        float       fp;
        BYTE        exponent;
        mantissa    mantissa;
    } v;
    float tmpfp = fp * powf(10, decimalcount+1);

    v.fp = tmpfp;

    int fractional_bits = 32 - (v.exponent - 128);
    if (fractional_bits <= 0) return v.fp;
    
    unsigned int integral_mask = 0xffffffff << fractional_bits;
    unsigned int fmantissa = v.mantissa.value & integral_mask;
    v.mantissa.value = fmantissa;

    v.fp = v.fp / 10;
    tmpfp = v.fp;

    fractional_bits = 32 - (v.exponent - 128);
    if (fractional_bits <= 0) return v.fp;
    
    integral_mask = 0xffffffff << fractional_bits;
    fmantissa = v.mantissa.value & integral_mask;
    v.mantissa.value = fmantissa;

    float diff = tmpfp - v.fp;
    if (diff > 0.5f) v.fp++;
 
    v.fp = v.fp / powf(10, decimalcount);

    return v.fp;
}

Floats in Basic[edit | edit source]

All CBM BASIC variants evaluate expressions by using floating point arithmetics even if simpler data types like integer variables are involved.

Just apart from the common operators there are several functions exposing special aspects of float values.

  • ABS - get the absolute value of a float (remove the sign)
  • INT - turn the value to the smallest integer (fraction part will be removed)
  • SGN - sign of the float
  • STR$ - converts a float to a string
  • VAL - converts a string to a float


Using floating point routines[edit | edit source]

Just like the CPU's accumulator plays a central role in much of what the machine does, the FAC and ARG are the "hubs" of floating point calculations: Numbers to be processed are stored in FAC and ARG, and after calling the relevant routine with a JSR the result is "delivered" in FAC.

Where other RAM locations must be specified, the A/Y register combination is used, wherein the low-byte of the memory address is stored in A and the high byte is stored in Y. Similarly, when converting to and from absolute, 16-bit, signed integer values, the A/Y combination is used.

Finally, the QINT routine indicated below stores the 32 bit signed value in FAC+1 through FAC+4, with the highest order byte starting in FAC+1 and the lowest order byte in FAC+4.

Routines for moving (copying) numbers[edit | edit source]

Label Address
dec.
Address
hex.
Description
CONUPK 47756 BA8C Fetch a number from a RAM location to ARG (A=Addr.LB, Y=Addr.HB)
MOVEF 48124 BBFC Copy a number currently in ARG, over into FAC
MOVFA 48143 BC0F Copy a number currently in FAC, over into ARG
MOVFM 48034 BBA2 Fetch a number from a RAM location to FAC (A=Addr.LB, Y=Addr.HB)
MOVMF 48084 BBD4 Store the number currently in FAC, to a RAM location. Uses X and Y rather than A and Y to point to RAM. (X=Addr.LB, Y=Addr.HB)

Routines for converting between floating point and other formats[edit | edit source]

Label Address
dec.
Address
hex.
Description
FACINX 45482 B1AA Convert number in FAC to 16-bit signed integer (Y=LB, A=HB).
FIN 48371 BCF3 Convert number expressed as a zero-terminated PETSCII string, to floating point number in FAC. Expects string-address in $7a/$7b, and to make it work either call CHRGOT ($0079) beforehand or load the accumulator with the first char of the string and clear the carry-flag manually.
STRVAL 47029 B7B5 Convert numerical PETSCII-string to floating point number in FAC. Expects string-address in $22/$23 and length of string in accumulator.
FOUT 48605 BDDD Convert number in FAC to a zero-terminated PETSCII string (starting at $0100, address in A, Y too). Direct output of FAC also via $AABC/43708 possible.
GIVAYF 45969 B391 Convert 16-bit signed integer to floating point number in FAC. Expects lowbyte in Y- and highbyte in A-register.
QINT 48283 BC9B Convert number in FAC to 32-bit signed integer ($62-$65, big-endian order).

Routines for performing calculations[edit | edit source]

Label Address
dec.
Address
hex.
Description
ABS (ROM Routine) 48216 BC58 Performs the ABS function on the number in FAC
ATN (ROM Routine) 58126 E30E Performs the ATN function on the number in FAC
COS (ROM Routine) 57956 E264 Performs the COS function on the number in FAC
MUL10 47842 BAE2 Multiply the number held in FAC by 10.
DIV10 47870 BAFE Divide the number held in FAC by 10. Ignores the sign of the number in FAC, the result is always positive.
EXP (ROM Routine) 49133 BFED Performs the EXP function on the number in FAC
FADD 47207 B867 Adds the number in FAC with one stored in RAM (A=Addr.LB, Y=Addr.HB)
FADDT 47210 B86A Adds the numbers in FAC and ARG
FDIV 47887 BB0F Divides a number stored in RAM by the number in FAC (A=Addr.LB, Y=Addr.HB)
FDIVT 47890 BB12 Divides the number in ARG by the number in FAC. Sign comparison is not performed and ARISGN byte at $6f is not set, which has to be accounted for when using this entry point. Sign errors may occur otherwise.
FMULT 47656 BA28 Multiplies a number from RAM and FAC (clobbers ARG, A=Addr.LB, Y=Addr.HB)
FPWR 49016 BF78 Raises a number stored ín RAM to the power in FAC (A=Addr.LB, Y=Addr.HB)
FPWRT 49019 BF7B Raises the number in ARG to the power in FAC
FSUB 47184 B850 Subtracts the number in FAC from one stored in RAM (A=Addr.LB, Y=Addr.HB)
FSUBT 47187 B853 Subtracts the number in FAC from the number in ARG
INT (ROM Routine) 48332 BCCC Performs the INT function on the number in FAC
LOG (ROM Routine) 47594 B9EA Performs the LOG function on the number in FAC
NEGOP 49076 BFB4 Switches sign on the number in FAC, if non-zero
POLY1 57411 E043 Evaluates a polynomial with odd powers only, for the value given in FAC
POLY2 57433 E059 Evaluates a polynomial with odd and even powers, for the value given in FAC
SIN (ROM Routine) 57963 E26B Performs the SIN function on the number in FAC
SGN (ROM Routine) 48185 BC39 Performs the SGN function on the number in FAC
SQR (ROM Routine) 49009 BF71 Performs the SQR function on the number in FAC
TAN (ROM Routine) 58036 E2B4 Performs the TAN function on the number in FAC

Routines for comparing numbers[edit | edit source]

Label Address
dec.
Address
hex.
Description
FCOMP 48219 BC5B Compares the number in FAC against one stored in RAM (A=Addr.LB, Y=Addr.HB). The result of the comparison is stored in A: Zero (0) indicates the values were equal.
One (1) indicates FAC was greater than RAM and
negative one (-1 or $FF) indicates FAC was less than RAM.
Also sets processor flags (N,Z) depending on whether the number in FAC is zero, positive or negative
SIGN 48171 BC2B Sets processor flags (N,Z) depending on whether the number in FAC is zero, positive or negative

Bugs[edit | edit source]

All Commodore BASICs, except BASIC 7.0 for the Commodore 128, have a bug in the multiply routine which can produce incorrect results for some pairs of operands.

Links[edit | edit source]

WP-W11.png Wikipedia: Floating point
WP-W11.png Wikipedia: Offset binary (Excess code)