• GomaEspumaRegional@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Standard float16 uses 1bit sign + 5bit exponent + 10bit fraction.

    bfloat16 uses 1bit sign + 8bit exponent + 7bit fraction.

    bfloat16 basically gives the same exponent precision as a standard float32. But most neural networks don’t require a huge fraction range. So bfloat16 gives you the possibility of executing 2x 8bit NP FLOPs vs using a float32 to do the same 1x8bit NP FLOP.

    Having the ALU support this format allows for the scheduler to pack 4xbfloat16 that can be executed in parallel in a standard 64bit ALU. So basically you double or quadruple the 8bit NP FLOPs that you would get from using traditional float16/32 representations.