FP8
The 8-bit Floating Point format is a compact numerical representation of real numbers. It is significantly smaller than FP16 and FP32, enabling it to significantly reduce memory usage and increase bandwidth capabilities in deep learning systems.
Anatomy of an FP8
Unlike FP32, FP8 is not standardized, but two common versions exist. Every FP8 format must use 1 bit for the sign, while the optimal choice on how the remaining 7 bits should be divided between mantissa and exponent is dependent on the context in which the FP8 values are being used.
E4M3
Uses 4 bits for the exponent and 3 bits for the mantissa (significand). This allows for a wider dynamic range and makes E4M3 a more common choice for inferencing.
E5M2
Uses 5 bits for the exponent and 2 bits for the mantissa (significand).
NVIDIA
NVIDIA's Hopper and Blackwell GPU architecture supports FP8 natively through Tensor Cores. NVIDIA Transformer Engine, TensorRT, and cuBLAS are all used for machine learning applications that use FP8 optimization.