Floating point numbers with 16 bits of precision are used mostly in computer graphics. They are also called half precision floating point numbers (as having half the bits of single precision 32bit floats). There's one sign bit, five bit exponent, and ten bits for mantissa. Half floats are not really meant to be used for arithmetic computations due to the limited precision (and no support in common CPUs/FPUs).
Half floats first appeared in early 2000s as samples in images and textures. Floats provide higher dynamic range than what is available with regular 8bit or 16bit integer samples. On the other hand, commonly used single and double precision floats have much higher memory cost per pixel. Half floats have more reasonable memory requirements and their precision is adequate for many usages in imaging.
16bit float formats have been supported by ATI and NVidia GPUs for many years. I'm not sure about other IHVs but at least Direct3D 10 capable GPUs should all support it.
Read on if your interested how to convert between half and single precision floats (Object Pascal code).
Half/Single Conversion code
Finally, here's the code for converting to half floats and back to single precision float. It's based on C++ code from OpenEXR library (half class).
First some types and constants
Note that THalfFloat type is just an alias for Word.
type THalfFloat = type Word; const HalfMin: Single = 5.96046448e-08; // Smallest positive half HalfMinNorm: Single = 6.10351562e-05; // Smallest positive normalized half HalfMax: Single = 65504.0; // Largest positive half // Smallest positive e for which half (1.0 + e) != half (1.0) HalfEpsilon: Single = 0.00097656; HalfNaN: THalfFloat = 65535; HalfPosInf: THalfFloat = 31744; HalfNegInf: THalfFloat = 64512;
Single precision float to half
function FloatToHalf(Float: Single): THalfFloat;
var
Src: LongWord;
Sign, Exp, Mantissa: LongInt;
begin
Src := PLongWord(@Float)^;
// Extract sign, exponent, and mantissa from Single number
Sign := Src shr 31;
Exp := LongInt((Src and $7F800000) shr 23) - 127 + 15;
Mantissa := Src and $007FFFFF;
if (Exp > 0) and (Exp < 30) then
begin
// Simple case - round the significand and combine it with the sign and exponent
Result := (Sign shl 15) or (Exp shl 10) or ((Mantissa + $00001000) shr 13);
end
else if Src = 0 then
begin
// Input float is zero - return zero
Result := 0;
end
else
begin
// Difficult case - lengthy conversion
if Exp <= 0 then
begin
if Exp < -10 then
begin
// Input float's value is less than HalfMin, return zero
Result := 0;
end
else
begin
// Float is a normalized Single whose magnitude is less than HalfNormMin.
// We convert it to denormalized half.
Mantissa := (Mantissa or $00800000) shr (1 - Exp);
// Round to nearest
if (Mantissa and $00001000) > 0 then
Mantissa := Mantissa + $00002000;
// Assemble Sign and Mantissa (Exp is zero to get denormalized number)
Result := (Sign shl 15) or (Mantissa shr 13);
end;
end
else if Exp = 255 - 127 + 15 then
begin
if Mantissa = 0 then
begin
// Input float is infinity, create infinity half with original sign
Result := (Sign shl 15) or $7C00;
end
else
begin
// Input float is NaN, create half NaN with original sign and mantissa
Result := (Sign shl 15) or $7C00 or (Mantissa shr 13);
end;
end
else
begin
// Exp is > 0 so input float is normalized Single
// Round to nearest
if (Mantissa and $00001000) > 0 then
begin
Mantissa := Mantissa + $00002000;
if (Mantissa and $00800000) > 0 then
begin
Mantissa := 0;
Exp := Exp + 1;
end;
end;
if Exp > 30 then
begin
// Exponent overflow - return infinity half
Result := (Sign shl 15) or $7C00;
end
else
// Assemble normalized half
Result := (Sign shl 15) or (Exp shl 10) or (Mantissa shr 13);
end;
end;
end;
Half to single precision float
function HalfToFloat(Half: THalfFloat): Single;
var
Dst, Sign, Mantissa: LongWord;
Exp: LongInt;
begin
// Extract sign, exponent, and mantissa from half number
Sign := Half shr 15;
Exp := (Half and $7C00) shr 10;
Mantissa := Half and 1023;
if (Exp > 0) and (Exp < 31) then
begin
// Common normalized number
Exp := Exp + (127 - 15);
Mantissa := Mantissa shl 13;
Dst := (Sign shl 31) or (LongWord(Exp) shl 23) or Mantissa;
// Result := Power(-1, Sign) * Power(2, Exp - 15) * (1 + Mantissa / 1024);
end
else if (Exp = 0) and (Mantissa = 0) then
begin
// Zero - preserve sign
Dst := Sign shl 31;
end
else if (Exp = 0) and (Mantissa <> 0) then
begin
// Denormalized number - renormalize it
while (Mantissa and $00000400) = 0 do
begin
Mantissa := Mantissa shl 1;
Dec(Exp);
end;
Inc(Exp);
Mantissa := Mantissa and not $00000400;
// Now assemble normalized number
Exp := Exp + (127 - 15);
Mantissa := Mantissa shl 13;
Dst := (Sign shl 31) or (LongWord(Exp) shl 23) or Mantissa;
// Result := Power(-1, Sign) * Power(2, -14) * (Mantissa / 1024);
end
else if (Exp = 31) and (Mantissa = 0) then
begin
// +/- infinity
Dst := (Sign shl 31) or $7F800000;
end
else //if (Exp = 31) and (Mantisa <> 0) then
begin
// Not a number - preserve sign and mantissa
Dst := (Sign shl 31) or $7F800000 or (Mantissa shl 13);
end;
// Reinterpret LongWord as Single
Result := PSingle(@Dst)^;
end;
Very interesting, but I could not get it to work with my Delphi7.
program Half_test;
{$APPTYPE CONSOLE}
{$Include HalfFloat.pas}
var
X, Z : single;
Y : THalfFloat;
begin { TODO -oUser -cConsole Main : Insert code here }
decimalseparator := ‘.’;
x := 1439.156;
y := FloatToHalf(X);
z := HalfToFloat(y);
writeln(x:10:3);
writeln(z:10:3);
readln;
end.
I was expecting
1439.156 for both x and z, but I got the value 1440.000 for the z value.
What could be wrong?
Regards Frank (Norway)
Seems like some rounding issue maybe. Value of Z should be 1439.0 (checked with several reference C implementations).