Floating point arithmetic

Floating point arithmetic is a way to represent and handle a large range of real numbers in a binary form: The C64's built-in BASIC interpreter contains a set of subroutines which perform various tasks on numbers in floating point format, allowing BASIC to use real numbers. These routines may also be called from the user's own machine code programs, to handle real numbers in the range ±2.93873588·10⁻³⁹ to ±1.70141183·10³⁸.

How it works[edit | edit source]

A real number T in the floating point format consists of a mantissa m and an integer exponent E, which are "selected" so that

T = m · 2^E

The mantissa is normalized, which means it is always a number in the range from 0.5 to 1, so that 0.5 ≤ m < 1, and it's stored as a fixed-decimal binary real; a number that begins with a one right after the decimal point, followed by several binary decimals (31 of them, in the case of the 64's BASIC routines).

The exponent is a 8-bit integer with some special provisions for handling negative exponents (i.e. floating point real numbers less than 0.5 or greater or equal than 1): The interpreter stores the exponent as the number E + 128, so that an exponent of 2 is stored as 130 (128 + 2), and an exponent of −2 as 126 (128 − 2). Exponent $80 is reserved for representing the zero. The advantage of this so called excess-128 representation is that the floats could be easily compared in a lexical way without doing a full subtract operation.

Value	Resulting exponent	Float value
$00	---	0
$01	-127	2↑(-127) * mantissa
. . .
$80	0	2↑0 * mantissa
. . .
$FF	+127	2↑127 * mantissa

Besides the mantissa and exponent, a separate sign bit indicates whether the entire floating point number is to be perceived as positive or negative. Together, the three parts thus "cover" any real number (within the aforementioned range) except zero. To "indicate" that a floating point number equals zero, the C-64 reserves one exponent value, 0 (which would otherwise indicate an exponent of −128), to "flag" that the whole floating point number is 0, regardless of the value of the accompanying mantissa.

Comparison with IEEE 754 single precision[edit | edit source]

IEEE 754 single precision is the type used for the "float" data type in C programs on a PC. The C64 type is different from IEEE 754 single precision as follows:

C64 has 32 bits in the mantissa; IEEE 754 has 23.
IEEE 754 places the sign bit to the left of the exponent. C64 places the sign bit between the exponent and the mantissa.
IEEE 754 biases the exponent differently from C64. For C64, the numeric value 1.0 has the exponent $81; for IEEE 754, the exponent is $7F.
IEEE 754 supports subnormal values: if the exponent is zero, but the mantissa is not, the mantissa is treated as if the exponent were $01, except that the implied leading 1 becomes a 0. C64 does not support subnormals; if the exponent is zero, the value is treated as zero.
IEEE 754 distinguishes negative zero from positive for certain purposes. C64 does not.
IEEE 754 has infinities and NaNs. Infinities have exponent $FF and mantissa equal to zero; they stand for results of numeric overflow or operations such as division by zero. NaNs have exponent $FF and mantissa not equal to zero, and stand for results of erroneous operations such as the square root of a negative number. For the C64 BASIC, any operation that would produce an infinity or a NaN is an error, and exponent $FF denotes just another range of finite values.

Representation in the C-64[edit | edit source]

There are basically two different layouts how floats are stored:

for calculations with float registers in a some kind expanded layout of 6 or 7 bytes.
for variables compacted into 5 bytes.

Float Registers[edit | edit source]

Two regions in zeropage are allocated for working with floating point numbers:

One is called FAC, for Floating Point Accumulator:
- Address 97/$61 is the exponent byte
- Addresses 98–101/$62–$65 hold the four-byte (32 bit) mantissa
- Address 102/$66 stores the sign in its most significant bit; 0 for positive, $FF (-1) for negative.
- Address 112/$70 contains rounding bits for intermediate calculations.
The other is called ARG, for Floating Point ARGument. It's arranged in the same way as FAC, only eight bytes further up:
- Address 105/$69 holds the exponent byte
- Addresses 106–109/$6A–$6D hold the four-byte mantissa
- Address 110/$6E holds the sign in its most significant bit; 0 for positive, $FF (-1) for negative.

A float in FAC uses 7 bytes, in ARG needs 6 bytes.

Variables[edit | edit source]

Note that this amounts to at least six bytes per floating point number, but the routines provided for moving numbers between FAC, ARG and arbitrary RAM addresses use a compression "trick" so that floating point numbers stored in RAM only take up five bytes: Since the mantissa is always in the 0.5-to-1 range, the first binary digit will always be a "1" — no need to store that. When storing a number in RAM, that "invariant 1" is replaced by the sign bit, and when reading numbers from RAM, the sign bit is moved to the separate sign byte in FAC or ARG, and the invariant first mantissa digit is restored to "1".

Float parts:	exp								sign/ma4								ma3								ma2								ma1
Byte:	0								1								2								3								4
Bit:	7	6	5	4	3	2	1	0	7	6	5	4	3	2	1	0	7	6	5	4	3	2	1	0	7	6	5	4	3	2	1	0	7	6	5	4	3	2	1	0
	E	E	E	E	E	E	E	E	S	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M	M

Typical values in variable layout:

(the MSB of mantissa which is occupied by the sign bit is always 1 has to be virtually taken in to account to calculate the real value of 2^E * M, it corresponds to the decimal value 0.5)

Exponent (hex.)	Mantissa (hex.)	Value (dec.)
00	xx xx xx xx	0 (mantissa ignored)
01	00 00 00 00	2.93873588E-39 (smallest possible value)
80	00 00 00 00	0.5
81	00 00 00 00	1
81	80 00 00 00	-1
FF	7F FF FF FF	1.70141183e+38 (greatest possible value, positive signed)
FF	FF FF FF FF	-1.70141183e+38 (greatest possible value, negative signed)

Conversion example[edit | edit source]

Exponent: exp-128
Mantissa: (m4 >= 128 ? -1 : +1) * ((m4 | 0x80) >> 8 + m3 >> 16 + m2 >> 24 + m1 >> 32)
as a C-language like expression, with "x >> y" as "float multiply x by 2↑(-y)" (right-bit-shift operation)

     exp       m4       m3       m2       m1
      98       35       44       7A       00      - some constant in hex
     152       53       68      122        0      - same in dec
10011000 00110101 01000100 01111010 00000000      - same in bin
         ^ sign bit (reused, because the mantissa bit is always 1)

In this case:
Exponent = 152 - 128 = 24                                          ; dec
Mantissa = 0.10110101010001000111101000000000                      ; bin
Mantissa = +1 * ((128+53) >> 8 + 68 >> 16 + 122 >> 24 + 0 >> 32)   ; dec
         = 1 * 2^-1 + 0 * 2^-2 + 1 * 2^-3 + 1 * 2^-4 + 0 * 2^-5 + ...
         = 0.70807611942291259765

So the number is...
0.70807611942291259765 * 2^24 = 11879546.0

Routines[edit | edit source]

Fast Ceil routine[edit | edit source]

In C code:

float fceil(float fp)
{
    typedef union {
        unsigned int    value;
        BYTE            bytes[4 - 1];
    } mantissa;
    union {
        float       fp;
        BYTE        exponent;
        mantissa    mantissa;
    } v;
    v.fp = fp;
    if (v.exponent == 0x00 || v.exponent == 0x80 || v.exponent == 0x81)
       return v.fp;

    int fractional_bits = 32 - (v.exponent - 128);
    if (fractional_bits <= 0) return v.fp;
    
    unsigned int integral_mask = 0xffffffff << fractional_bits;
    unsigned int fmantissa = v.mantissa.value & integral_mask;
    v.mantissa.value = fmantissa;
    
    if (v.fp > 0 && v.fp != fp) v.fp++;
    return v.fp;
}

Fast Floor routine[edit | edit source]

In C code:

float ffloor(float fp)
{
    typedef union {
        unsigned int    value;
        BYTE            bytes[4 - 1];
    } mantissa;
    union {
        float       fp;
        BYTE        exponent;
        mantissa    mantissa;
    } v;
    v.fp = fp;
    if (v.exponent == 0x00 || v.exponent == 0x80 || v.exponent == 0x81)
       return v.fp;

    int fractional_bits = 32 - (v.exponent - 128);
    if (fractional_bits <= 0) return v.fp;
    
    unsigned int integral_mask = 0xffffffff << fractional_bits;
    unsigned int fmantissa = v.mantissa.value & integral_mask;
    v.mantissa.value = fmantissa;

    return v.fp;
}

Round to decimalcount[edit | edit source]

In C code:

float fround(float fp, int decimalcount)
{
    typedef union {
        unsigned int    value;
        BYTE            bytes[4 - 1];
    } mantissa;
    union {
        float       fp;
        BYTE        exponent;
        mantissa    mantissa;
    } v;
    float tmpfp = fp * powf(10, decimalcount+1);

    v.fp = tmpfp;

    int fractional_bits = 32 - (v.exponent - 128);
    if (fractional_bits <= 0) return v.fp;
    
    unsigned int integral_mask = 0xffffffff << fractional_bits;
    unsigned int fmantissa = v.mantissa.value & integral_mask;
    v.mantissa.value = fmantissa;

    v.fp = v.fp / 10;
    tmpfp = v.fp;

    fractional_bits = 32 - (v.exponent - 128);
    if (fractional_bits <= 0) return v.fp;
    
    integral_mask = 0xffffffff << fractional_bits;
    fmantissa = v.mantissa.value & integral_mask;
    v.mantissa.value = fmantissa;

    float diff = tmpfp - v.fp;
    if (diff > 0.5f) v.fp++;
 
    v.fp = v.fp / powf(10, decimalcount);

    return v.fp;
}

Floats in Basic[edit | edit source]

All CBM BASIC variants evaluate expressions by using floating point arithmetics even if simpler data types like integer variables are involved.

Just apart from the common operators there are several functions exposing special aspects of float values.

ABS - get the absolute value of a float (remove the sign)
INT - turn the value to the smallest integer (fraction part will be removed)
SGN - sign of the float
STR$ - converts a float to a string
VAL - converts a string to a float

Using floating point routines[edit | edit source]

Just like the CPU's accumulator plays a central role in much of what the machine does, the FAC and ARG are the "hubs" of floating point calculations: Numbers to be processed are stored in FAC and ARG, and after calling the relevant routine with a JSR the result is "delivered" in FAC.

Where other RAM locations must be specified, the A/Y register combination is used, wherein the low-byte of the memory address is stored in A and the high byte is stored in Y. Similarly, when converting to and from absolute, 16-bit, signed integer values, the A/Y combination is used.

Finally, the QINT routine indicated below stores the 32 bit signed value in FAC+1 through FAC+4, with the highest order byte starting in FAC+1 and the lowest order byte in FAC+4.

Routines for moving (copying) numbers[edit | edit source]

Label	Address dec.	Address hex.	Description
CONUPK	47756	BA8C	Fill ARG with number from memory (A=Adr.LB, Y=Adr.HB). Then, in preparation for subsequent operations, compares the signs of ARG and FAC and writes the result to address $6F ($00: if signs are the same, $80: if signs are different), and loads the exponent from FAC to A (sets zero flag when FAC equals zero). The routines FADDT , FDIVT , FMULTT and FPWRT require this preparation.
MOVEF	48124	BBFC	Copy a number currently in ARG, over into FAC
MOVFA	48143	BC0F	Copy a number currently in FAC, over into ARG
MOVFM	48034	BBA2	Fetch a number from a RAM location to FAC (A=Addr.LB, Y=Addr.HB)
MOVMF	48084	BBD4	Store the number currently in FAC, to a RAM location. Uses X and Y rather than A and Y to point to RAM. (X=Addr.LB, Y=Addr.HB)

Routines for converting between floating point and other formats[edit | edit source]

Label	Address dec.	Address hex.	Description
FACINX	45482	B1AA	Convert number in FAC to 16-bit signed integer (Y=LB, A=HB).
FIN	48371	BCF3	Convert number expressed as a zero-terminated PETSCII string, to floating point number in FAC. Expects string-address in $7a/$7b, and to make it work either call CHRGOT ($0079) beforehand or load the accumulator with the first char of the string and clear the carry-flag manually if first char is a digit or set it otherwise.
STRVAL	47029	B7B5	Convert numerical PETSCII-string to floating point number in FAC. Expects string-address in $22/$23 and length of string in accumulator.
FOUT	48605	BDDD	Convert number in FAC to a zero-terminated PETSCII string (starting at $0100, address in A, Y too). Direct output of FAC also via $AABC/43708 possible.
GIVAYF	45969	B391	Convert 16-bit signed integer to floating point number in FAC. Expects lowbyte in Y- and highbyte in A-register.
QINT	48283	BC9B	Convert number in FAC to 32-bit signed integer ($62-$65, big-endian order).

Routines for performing calculations[edit | edit source]

Label	Address dec.	Address hex.	Description
ABS (ROM Routine)	48216	BC58	Performs the ABS function on the number in FAC
ATN (ROM Routine)	58126	E30E	Performs the ATN function on the number in FAC
COS (ROM Routine)	57956	E264	Performs the COS function on the number in FAC
MUL10	47842	BAE2	Multiply the number held in FAC by 10.
DIV10	47870	BAFE	Divide the number held in FAC by 10. Ignores the sign of the number in FAC, the result is always positive.
EXP (ROM Routine)	49133	BFED	Performs the EXP function on the number in FAC
FADD	47207	B867	Adds the number in FAC with one stored in RAM (A=Addr.LB, Y=Addr.HB)
FADDT	47210	B86A	Adds the numbers in FAC and ARG
FDIV	47887	BB0F	Divides a number stored in RAM by the number in FAC (A=Addr.LB, Y=Addr.HB)
FDIVT	47890	BB12	Divides the number in ARG by the number in FAC. Sign comparison is not performed and ARISGN byte at $6f is not set, which has to be accounted for when using this entry point. Sign errors may occur otherwise.
FMULT	47656	BA28	Multiplies a number from RAM and FAC (clobbers ARG, A=Addr.LB, Y=Addr.HB)
FPWR	49016	BF78	Raises a number stored ín RAM to the power in FAC (A=Addr.LB, Y=Addr.HB)
FPWRT	49019	BF7B	Raises the number in ARG to the power in FAC
FSUB	47184	B850	Subtracts the number in FAC from one stored in RAM (A=Addr.LB, Y=Addr.HB)
FSUBT	47187	B853	Subtracts the number in FAC from the number in ARG
INT (ROM Routine)	48332	BCCC	Performs the INT function on the number in FAC
LOG (ROM Routine)	47594	B9EA	Performs the LOG function on the number in FAC
NEGOP	49076	BFB4	Switches sign on the number in FAC, if non-zero
POLY1	57411	E043	Evaluates a polynomial with odd powers only, for the value given in FAC
POLY2	57433	E059	Evaluates a polynomial with odd and even powers, for the value given in FAC
SIN (ROM Routine)	57963	E26B	Performs the SIN function on the number in FAC
SGN (ROM Routine)	48185	BC39	Performs the SGN function on the number in FAC
SQR (ROM Routine)	49009	BF71	Performs the SQR function on the number in FAC
TAN (ROM Routine)	58036	E2B4	Performs the TAN function on the number in FAC

Routines for comparing numbers[edit | edit source]

Label	Address dec.	Address hex.	Description
FCOMP	48219	BC5B	Compares the number in FAC against one stored in RAM (A=Addr.LB, Y=Addr.HB). The result of the comparison is stored in A: Zero (0) indicates the values were equal. One (1) indicates FAC was greater than RAM and negative one (-1 or $FF) indicates FAC was less than RAM. Also sets processor flags (N,Z) depending on whether the number in FAC is zero, positive or negative
SIGN	48171	BC2B	Sets processor flags (N,Z) depending on whether the number in FAC is zero, positive or negative

Bugs[edit | edit source]

All Commodore BASICs, except BASIC 7.0 for the Commodore 128, have a bug in the multiply routine which can produce incorrect results for some pairs of operands.

Links[edit | edit source]

Wikipedia: Floating point

Wikipedia: Offset binary (Excess code)

Microsoft BASIC for 6502: float routines, Source Repository on GitHub

Floating point arithmetic

Contents

How it works[edit | edit source]

Comparison with IEEE 754 single precision[edit | edit source]

Representation in the C-64[edit | edit source]

Float Registers[edit | edit source]

Variables[edit | edit source]

Conversion example[edit | edit source]

Routines[edit | edit source]

Fast Ceil routine[edit | edit source]

Fast Floor routine[edit | edit source]

Round to decimalcount[edit | edit source]

Floats in Basic[edit | edit source]

Using floating point routines[edit | edit source]

Routines for moving (copying) numbers[edit | edit source]

Routines for converting between floating point and other formats[edit | edit source]

Routines for performing calculations[edit | edit source]

Routines for comparing numbers[edit | edit source]

Bugs[edit | edit source]

Links[edit | edit source]

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Help and Feedback

C64-Games

Portals

Tools

In other languages