Floating point numbers with 16 bits of precision are used mostly in computer graphics. They are also called half precision floating point numbers (as having half the bits of single precision 32bit floats). There's one sign bit, five bit exponent, and ten bits for mantissa. Half floats are not really meant to be used for arithmetic computations due to the limited precision (and no support in common CPUs/FPUs).

Half floats first appeared in early 2000s as samples in images and textures. Floats provide higher dynamic range than what is available with regular 8bit or 16bit integer samples. On the other hand, commonly used single and double precision floats have much higher memory cost per pixel. Half floats have more reasonable memory requirements and their precision is adequate for many usages in imaging.

16bit float formats have been supported by ATI and NVidia GPUs for many years. I'm not sure about other IHVs but at least Direct3D 10 capable GPUs should all support it.

Read on if your interested how to convert between half and single precision floats (Object Pascal code).

### Half/Single Conversion code

Finally, here's the code for converting to half floats and back to single precision float. It's based on C++ code from OpenEXR library (half class).

#### First some types and constants

Note that THalfFloat type is just an alias for Word.

type THalfFloat = type Word; const HalfMin: Single = 5.96046448e-08; // Smallest positive half HalfMinNorm: Single = 6.10351562e-05; // Smallest positive normalized half HalfMax: Single = 65504.0; // Largest positive half // Smallest positive e for which half (1.0 + e) != half (1.0) HalfEpsilon: Single = 0.00097656; HalfNaN: THalfFloat = 65535; HalfPosInf: THalfFloat = 31744; HalfNegInf: THalfFloat = 64512;

#### Single precision float to half

function FloatToHalf(Float: Single): THalfFloat; var Src: LongWord; Sign, Exp, Mantissa: LongInt; begin Src := PLongWord(@Float)^; // Extract sign, exponent, and mantissa from Single number Sign := Src shr 31; Exp := LongInt((Src and $7F800000) shr 23) - 127 + 15; Mantissa := Src and $007FFFFF; if (Exp > 0) and (Exp < 30) then begin // Simple case - round the significand and combine it with the sign and exponent Result := (Sign shl 15) or (Exp shl 10) or ((Mantissa + $00001000) shr 13); end else if Src = 0 then begin // Input float is zero - return zero Result := 0; end else begin // Difficult case - lengthy conversion if Exp <= 0 then begin if Exp < -10 then begin // Input float's value is less than HalfMin, return zero Result := 0; end else begin // Float is a normalized Single whose magnitude is less than HalfNormMin. // We convert it to denormalized half. Mantissa := (Mantissa or $00800000) shr (1 - Exp); // Round to nearest if (Mantissa and $00001000) > 0 then Mantissa := Mantissa + $00002000; // Assemble Sign and Mantissa (Exp is zero to get denormalized number) Result := (Sign shl 15) or (Mantissa shr 13); end; end else if Exp = 255 - 127 + 15 then begin if Mantissa = 0 then begin // Input float is infinity, create infinity half with original sign Result := (Sign shl 15) or $7C00; end else begin // Input float is NaN, create half NaN with original sign and mantissa Result := (Sign shl 15) or $7C00 or (Mantissa shr 13); end; end else begin // Exp is > 0 so input float is normalized Single // Round to nearest if (Mantissa and $00001000) > 0 then begin Mantissa := Mantissa + $00002000; if (Mantissa and $00800000) > 0 then begin Mantissa := 0; Exp := Exp + 1; end; end; if Exp > 30 then begin // Exponent overflow - return infinity half Result := (Sign shl 15) or $7C00; end else // Assemble normalized half Result := (Sign shl 15) or (Exp shl 10) or (Mantissa shr 13); end; end; end;

#### Half to single precision float

function HalfToFloat(Half: THalfFloat): Single; var Dst, Sign, Mantissa: LongWord; Exp: LongInt; begin // Extract sign, exponent, and mantissa from half number Sign := Half shr 15; Exp := (Half and $7C00) shr 10; Mantissa := Half and 1023; if (Exp > 0) and (Exp < 31) then begin // Common normalized number Exp := Exp + (127 - 15); Mantissa := Mantissa shl 13; Dst := (Sign shl 31) or (LongWord(Exp) shl 23) or Mantissa; // Result := Power(-1, Sign) * Power(2, Exp - 15) * (1 + Mantissa / 1024); end else if (Exp = 0) and (Mantissa = 0) then begin // Zero - preserve sign Dst := Sign shl 31; end else if (Exp = 0) and (Mantissa <> 0) then begin // Denormalized number - renormalize it while (Mantissa and $00000400) = 0 do begin Mantissa := Mantissa shl 1; Dec(Exp); end; Inc(Exp); Mantissa := Mantissa and not $00000400; // Now assemble normalized number Exp := Exp + (127 - 15); Mantissa := Mantissa shl 13; Dst := (Sign shl 31) or (LongWord(Exp) shl 23) or Mantissa; // Result := Power(-1, Sign) * Power(2, -14) * (Mantissa / 1024); end else if (Exp = 31) and (Mantissa = 0) then begin // +/- infinity Dst := (Sign shl 31) or $7F800000; end else //if (Exp = 31) and (Mantisa <> 0) then begin // Not a number - preserve sign and mantissa Dst := (Sign shl 31) or $7F800000 or (Mantissa shl 13); end; // Reinterpret LongWord as Single Result := PSingle(@Dst)^; end;

Very interesting, but I could not get it to work with my Delphi7.

program Half_test;

{$APPTYPE CONSOLE}

{$Include HalfFloat.pas}

var

X, Z : single;

Y : THalfFloat;

begin { TODO -oUser -cConsole Main : Insert code here }

decimalseparator := ‘.’;

x := 1439.156;

y := FloatToHalf(X);

z := HalfToFloat(y);

writeln(x:10:3);

writeln(z:10:3);

readln;

end.

I was expecting

1439.156 for both x and z, but I got the value 1440.000 for the z value.

What could be wrong?

Regards Frank (Norway)

Seems like some rounding issue maybe. Value of Z should be 1439.0 (checked with several reference C implementations).