The Lab

Mobile GPU Floating Point Accuracy Variances    April 4, 2013

When coding GPU shaders for multiple target platforms it is important to consider the differences of the hardware
implementations and their impact. This is especially true when creating Natural User Interfaces (NUIs) as the use of shaders for visual enhancement or experience augmentation is crucial, and cannot vary between OS platforms or devices.

One of the key differentiators between mobile GPU families is the capability of the computational units. These differences are normally seen with handling of code complexity or visual artifacts created by the rendering schemes, especially with tile-based systems. These can sometimes be overcome using simpler shader algorithms or creative approaches to the geometry constructs being used.

However, the more significant contributor to the quality of the shader output lies in the accuracy of the floating point calculations within the GPU. This contrasts greatly from CPU computational accuracy and variances are common between the different mobile GPU implementations from ARM Mali, Imagination Technology, Vivante and others.

Being able to compare the accuracy of various GPU models allows us to prepare for the lowest accuracy units to ensure the shader output is still acceptable while optimizing for incredible visual effects from the better performing hardware. Our uSwish NUI platform makes direct use of this information to ensure a consistent look and feel, a key differentiator in the User Interface market.

In a perfect world only one reference implementation would be needed. This is simply not viable with today’s hardware. At worst, we may need to do several implementations, targeting the various accuracy levels, to ensure a common visual effect and consistent user experience. If calculation errors occur when outside the usable range of the floating point units, then we must account for that to prevent undesirable effects.

Let’s compare some current mobile devices using some simple fragment shader code:

precision highp float;
uniform vec2 resolution;
void main( void )
{
      float x = ( 1.0 – ( gl_FragCoord.x / resolution.x ));

      float y = ( gl_FragCoord.y / resolution.y ) * 26.0;
      float yp = pow( 2.0, floor(y) );
      float fade = fract( yp + fract(x) );
      if(fract(y) < 0.9)
          gl_FragColor = vec4( vec3( fade ), 1.0 );
    else
          gl_FragColor = vec4( 0.0 );
}

 This example will calculate a varying fade level from bright white down to black over 26 iterations on the screen. The further down the screen the smooth blended line goes, the more precision we have in the floating point unit.

For reference we will use a desktop rendering of the shader, for our purposes this sample from a laptop nVidia GeForce GT 630M is more than enough:

YOUi Labs Shader Comparison

 

After comparing the output across different GPU chipsets we immediately see the difference in performance and usable range of the floating point units. It is important to note that this is not related to device performance or even GPU implementation differences by different manufacturers – it is simply the computation range of the GPU itself. Most comparisons are done through tests of pure performance: triangles per second or texel fill rate. Although these numbers are valuable, they do not tell the full story of the GPUs true capability. When applied to Natural User Interfaces these computational differences are even more important since, unlike games, there is no tolerance for any visual artifacts.

 

To see the effect of these computational differences, or to try some on your own device to examine performance, check out the following:

1) YouTube video: The Importance of Shaders, showing the result of these calculation errors

2) YOUi Labs Shader Effect Test, an Android application for viewing the shaders for comparison on a device

 

  
  • http://twitter.com/ppetrovdotnet pip010

    What differences (hard/soft) are affecting the results for mobile GPUS?

    Are all those GPUs conformant with : IEEE 754?

    • StuartRussell

      The GPU manufacturer’s must comply with the Khronos group OpenGL 2.0 ES specification which give accuracy and range requirements under section 2.1.1:

      “We do not specify how floating-point numbers are to be represented or how operations on them are to be performed. We require simply that numbers’ floating-point parts contain enough bits and that their exponent fields are large enough so that individual results of floating-point operations are accurate to about 1 part in 10^5. The maximum representable magnitude of a floating-point number used to represent positional, normal, or texture coordinates must be at least 2^32; the maximum representable magnitude for colors must be at least 2^10. The maximum representable magnitude for all other floating-point values must be at least2^32. x · 0 = 0 · x = 0 for any non-infinite and non-NaN x. 1 · x = x · 1 = x. x + 0 = 0 + x = x. 0 0 = 1. (Occasionally further requirements will be specified.)

      Most single-precision floating-point formats meet these requirements”

      • pip010

        ahhh so compliant does not imply ‘the same’, I keep ignoring that part about “standards” :(

  • http://twitter.com/neilt3d Neil Trevett

    Hopefully this is a short term issue. OpenCL defines strict Numerical Compliance, that essentially mirrors IEEE 754. See section 7. of the OpenCL spec: http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf

    Additionally, that compliance is exhaustively tested in the the OpenCL conformance tests.

    Once mobile GPUs are built to support conformant OpenCL (that industry transition is underway and will be largely complete this year) – this issue will disappear.

    It is an interesting question whether OpenGL should formally raise its precision requirements – at least for compute shaders.

    • pip010

      for games !? NO!

  • PurpleTallest

    The “drift error” seems to just be different default rounding modes (RZ vs RNE) in these GPUs, not an actual error.

  • Sandeep K

    Is this test rendering at the device’s native resolution? If so, it is not very robust to compare FP accuracy of different GPUs. The test would end up testing a larger range of values on devices with high resolution screens (due to “gl_FragCoord.x / resolution.x”) and FP accuracy is non-uniform across the range.

    One fix would be to render to an offscreen FBO that is the same size on all devices and the result could be blit onto the screen.

    (Apologies if this is a duplicate comment)

    • StuartRussell

      Yes, the rendering is to an FBO of fixed size on all platforms. Resolution is constant. The only variable is the floating point accuracy and algorithm being used.

  • Tom Olson

    Interesting post – this study nicely points out the amount of variation there is in conformant ES 2.0 implementations. There are actually several things going on in these results – different aspects of the images reveal different things about the GPUs in question.

    The precision of the output is dominated by this line in the shader:

    float fade = fract(yp + fract(x));

    This expression discards low-order bits of fract(x) (causing it to lose precision), adds the result to an integer (with rounding), and then throws away the integer part of the sum. The number of bits of fract(x) discarded increases with
    increasing y. Differences in the result for different GPUs are (mostly) due to:
    a) differences in what kind of rounding the implementation does, and
    b) differences in the number of bits in the floating point significand.

    Rounding Modes

    For the test shader, implementations that do standard IEEE-754 rounding (round-to-nearest-even) will show the symmetrical U-shaped pattern that we see in the results for nVidia GT 630M, Mali-400, and Mali-T604. It looks like
    the SGX-544 is trying to do the same, but something goes wrong at large values of yp – perhaps a bug in the pow() implemention? The “drift from the left edge” that you refer to is NOT an error in the calculation; it’s a result of the fact that these implementations are rounding (yp+fract(x)) to the nearest representable floating point value. In the limit, when fract(x) is just slightly less than one, the nearest representable value is the integer yp+1. Since the final fract() operation discards the integer part, the result is zero and we get black along the left edge. Implementations that don’t show “drift from the left edge”, such as Tegra 3, GC4000, and Adreno 225, are using round-to-zero or possibly round-to-minus-infinity. Round-to-zero is easier to implement, and it may be what you want in some applications, but it results in both a systematic bias and a larger average absolute error than round-to-nearest. That’s why round-to-nearest-even is the default rounding mode for IEEE-754 (the standard for floating-point arithmetic).

    Significand Precision

    As StuartRussell commented, the OpenGL ES 2.0 API defines floating point fairly loosely. To help deal with this, the GLSL ES 1.0 shading language has precision qualifiers that allow shader programs to specify how much precision they want
    for a floating point variable. The first line of the test shader sets the default precision to “highp”, which corresponds to a minimum of 16 bits for the significand. But in OpenGL ES 2.0, support for highp is optional in fragment shaders, and in fact Tegra 3 and Mali-400 don’t support it. So, on those GPUs, you are actually getting mediump (minimum 10 bit
    significand). That’s the reason both of those devices start to underflow sooner than the other cores in the test. (Mali-T604, of course, does support highp – in fact, it is the only mobile device on your list to meet the frankly scary precision requirements of OpenCL Full Profile.) Ideally, your application should check for the “OES_fragment_precision_high” extension, which tells you whether the GPU supports highp in fragment shaders, before trying to compile a shader that uses highp.

    Having said that – it’s great that you are doing this work and looking hard at the behavior of arithmetic on various mobile GPUs. If you’d like to discuss, or want help in designing future tests, please get in touch – we’d be very interested in working on them with you.

    regards,

    Tom Olson
    Director, Graphics Research, ARM
    Chair, OpenGL ES Working Group

  • libv

    You seem to have missed something rather crucial here…

    The program you are running is a fragment shader. The Mali-400 is not a unified shader design. It has a standalone fragment shader (4 of them in case of the exynos, hence the MP4 bit). And… This fragment shader is half-float only. That’s right, only 16 bit precision!

    Now though… Your fragment shader program starts with: “precision highp float;” and guess what, this is simply not accepted by the mali shader compiler. You probably saw it come back and reply, and you simply removed the line without thinking anything further of it. But the mali shader compiler refuses that line _for a reason_: it’s halffloat only, per design.

    This really should be clearly stated when doing comparisons like this. Mali-400 is less precise in this test, per design, and your test methodology only tests fragment shader precision, not vertex shader precision.

    Now that these two issues are in the open, does the mali-400 really perform that badly for your use case(s)? Or is lack of precision only relevant to your “NUI” when they occur on the vertex shader?

    I am amused by the PVR result, although not too surprised. It never has been very reliable or deterministic.

    – The author of the reverse engineered, open source graphics driver for the ARM Mali.

    • StuartRussell

      Thank you for your candid response. You are correct about the Mali-400 being a non-unified design, but that is not relevant with respect to user expectations of accuracy within calculations, rather a design choice. The test was done mainly to determine the cause of visual artifacts and suitability for use in GPGPU operations prior to OpenCL being commonly available. Since several complex graphics effects utilize complex math, they provide an immediate visual feedback of the capabilities of the different devices.

      Such is also the case with the design choice of 16 bit precision. The Khronos specification allows for both the medium and high precision modes, if the design supports it. This was accommodated to allow for high performance in simpler shader calculations, but provide higher accuracy when it was required. ARM’s choice to not implement a high precision mode is a design limitation as compared to other products on the market: You are 100%: It is by design. That said, an argument about comparing lower precision cannot be made since the user expectation in this case is to use this highest available precision based on the specification – which dictates using highp mode. In all cases each device is given the chance – through the OpenGL ES 2.0 specification API designed methods – to perform in the highest precision for this test.

      For most usage scenarios the vertex shader is used for projection calculations and preparation of fragment shader. We have done other measurements comparing the vertex shader accuracy as well. It’s ability to lerp linearly for the fragment shader is definitely a source of potential visual impact, but minor compared to the fragment shader resolution. The errors encountered in the vertex shader can also sometimes be overcome by offloading operations to the GPU and providing uniforms that are pre-calculated. The bulk of the complex operations are usually done in the fragment shader, especially when using the GPU for GPGPU implementations. The accuracy of the vertex shader is irrelevant in this case.

      Either way, the results are interesting, and show to hurdles facing these designs of adopting the OpenCL specifications. Hopefully in a few years there will be no variance in this area when all the implementations must meet the IEEE floating point specification. Until then, we need to find creative ways to showcase the benefits of each design and work around known weaknesses, and tests like this are great insight into what is causing visual errors or other artifacts.

  • Tom Olson

    This topic turned out to be very interesting! I’ve started a series of blogs over at ARM.com talking in more detail about how this shader works and what it means, and what else we can learn from shaders of this type. The first one is up now at http://blogs.arm.com/multimedia/965-benchmarking-floating-point-precision-in-mobile-gpus/, and there are two more in the pipeline. Thanks for getting the ball rolling – I look forward to discussing it with you.

    –Tom Olson

    • StuartRussell

      One of the key parts of this discussion is to bring the design world (the users of the technology) together with the engineering side (makers) to appreciate what is there, and why it does what it does. One of the reasons there has been such an impact from this blog is because it was intended to simplify the effects of these engineering choices into a visual format that users can relate to, and to reference. We engineers all understand the implementations and cause of the visual distortions – it is after all introductory computer principles – but simply explaining it does not fix the issues or concerns of the users: those artists who create using the tools. I look forward to your future posts hopefully delving into the choices that were made, as well as the reasons and arguments for those choices, as well as suggestions for artists to maximize their capabilities.

  • david moloney

    Our choice at Movidius since the outset in 2005 has been to implement all 4 IEEE rounding modes, even for fp16 (half) precision.

    The choice not to implement RNE has important implications for DSP and CV algorithms

    For instance the designers of IBMs Cell SPE also made the decision to implement RTZ rather than RNE with important consequences for FFT numerical performance

    http://www.fftw.org/cell/

    “The SPEs are fully IEEE-754 compliant in double precision.
    In single precision, they only implement round-towards-zero as opposed to the
    standard round-to-even mode. (The PPE is fully IEEE-754 compliant
    like all other PowerPC implementations.) Because of the rounding mode,
    FFTW is less accurate when running on the SPEs than on the PPE. The
    accuracy loss is hard to quantify in general, but as a rough guideline, the L2
    norm of the relative roundoff error for random inputs is 4-8 times larger than
    the corresponding calculation in round-to-even arithmetic. In other words,
    expect to lose 2 to 3 bits of accuracy.

    FFTW currently does not use
    any algorithm that degrades accuracy to gain performance on the SPE. One
    implication of this choice is that large 1D transforms run slower than they
    would if we were willing to sacrifice another bit or so of accuracy.“

    • StuartRussell

      Wow, that is great to hear. Considering how quickly the requirements and expectations are evolving in today’s market, it is crucial for companies to make solid long term engineering choices that will benefit the users of the technology well into the future. As the requirements of the users of the technology are not set, and can become rather creative, making design choices of this nature are sure to pay off in the long term as these choices eventually become the base line standard, forcing others to catch up.