The Mali G76 µarch - Fine Tuning It

Section by Ryan Smith

While the biggest change in the G72 is by far Arm’s vastly wider cores, it’s not the only change to come to the Bifrost architecture. The company has also undertaken a few smaller changes to further optimize the architecture and improve performance efficiency.

First off, within their ALUs Arm has added support for Int8 dot products. These operations are becoming increasingly important in machine learning inference, as it’s a critical operation in processing neural networks and despite the limited precision, is still deep enough for basic inference in a number of cases. To be sure, even the original Bifrost already natively supported Int8 data types, including packing 4 of them into a single lane, but G76 becomes the first to be able to use them in a dot product in a single cycle.

As a result, Arm is touting a 2.7x increase in machine learning performance. This will of course depend on the workload – particularly the framework and model used – so it’s just a high-level approximation. But Arm is betting big on machine learning, so significantly speeding up GPU machine learning inference gives Arm’s customers another option for efficiently processing these neural networks.

Meanwhile, in part as a consequence of the better scalability of Mali-G76’s core design, Arm has also taken a look at other aspects of GPU scalability to improve performance. Their research found that another potential scaling bottleneck is the tiler, which could block the rest of the GPU if it stalled during a polygon writeback. As a result, Arm has moved from an in-order writeback mechanism to an out-of-order writeback mechanism, allowing for polygons to be written back with more flexibility by bypassing those writeback stalls. Unfortunately Arm is being somewhat mum here on how this was implemented – generally changing an in-order process to out-of-order is not a simple task – so we haven’t been given much other information on the matter.

Arm has also made a subtle but important change to how their tile buffers can be used in an effort to keep more traffic local to the GPU core. In certain cases, it’s now possible for applications that run out of color tile buffer space to spill over into the depth tile buffer. Arm is specifically citing workloads involving heavy use of multiple render targets without MSAA for driving this change; the lack of MSAA means that the depth tile buffer is used only sparingly, while the multiple render targets quickly chew through the color tile buffer rather quickly. The net result of this is that it cuts down on the number of trips that need to be made to main memory, which is a rather expensive operation.

Speaking of spilling, G76’s thread local storage mechanism has also been optimized for how it handles register spills. Now the GPU will attempt to group data chunks from spills together so that they can be more easily fetched in the future. This is as opposed to how earlier GPUs did it, where register spills were scattered based on which SIMD lane the data ultimately belonged to.

The Mali G76 µarch - Scaling It Up Performance & Efficiency - End Remarks
Comments Locked

25 Comments

View All Comments

  • eastcoast_pete - Friday, June 1, 2018 - link

    I expect some headwind for this, but bear with me. Firstly, great that ARM keeps pushing forward on the graphics front, this does sound promising. Here my crazy (?) question: would a MALI G76 based graphics card for a PC (laptop or desktop) be a. feasible and b. be better/faster than Intel embedded. Like many users, I have gotten frustrated with the crypto-craze induced price explosion for NVIDIA and AMD dedicated graphics, and Intel seems to have thrown in the towel on making anything close to those when it comes to graphics. So, if one can run WIN 10 on ARM chips, would a graphics setup with, let's say, 20 Mali G76 cores, be worthwhile to have? How would it compare to lower-end dedicated graphics from the current duopoly? Any company out there ambitious and daring enough to try?
  • eastcoast_pete - Friday, June 1, 2018 - link

    Just to clarify: I mean a dedicated multicore (20, please) PCIe-connected MALI graphics card in a PC with a grown-up Intel or AMD Ryzen CPU - hence "crazy", but maybe not. I know there will be some sort of ARM-CPU based WIN 10 laptops, but those target the market currently served by Celeron etc.
  • Alurian - Friday, June 1, 2018 - link

    Arguably MALI might one day be powerful to do interesting things with should ARM choose to take that direction. But comparing MALI to the dedicated graphics units that AMD and NVIDIA have been working with for decades...certainly not in the short term. If it was that easy Intel would have popped out a competitor chip by now.
  • Valantar - Friday, June 1, 2018 - link

    I'd say it depends on your use case. For desktop usage and multimedia, it'd probably be decent, although it would use significantly more power than any iGPU simply due to being a PCIe device.

    On the other hand, for 3D and gaming, drivers are key (and far more complex), and ARM would here have to play catch-up with a decade or more of driver development from their competitors. It would not go well.
  • duploxxx - Friday, June 1, 2018 - link

    like many users, I have gotton frustrated with and intel seems to have thrown in the towel....

    how does that sound you think?..........

    easy solution buy a ryzen apu. more then enough cpu power to run win10 and decent gpu and if you think the intel cpu are better then ******
  • eastcoast_pete - Friday, June 1, 2018 - link

    Have you tried to buy a graphics card recently? I actually like the Ryzen chips, but once you add the (required) dedicated graphics card, it gets expensive fast. There is a place for a good, cheap graphics solution that still beats Intel embedded but doesn't break the bank. Think HTC setups. My comment on Intel having thrown in the towel refers to them now using AMD dedicated graphics fused to their CPUs in recent months; they have clearly abandoned the idea of increasing the performance of their own designs (Iris much?) , and that is bad for the competitive situation in the lower end graphics space.
  • jimjamjamie - Friday, June 1, 2018 - link

    Ryzen 3 2200G
    Ryzen 5 2400G

    Thank me later.
  • dromoxen - Sunday, June 3, 2018 - link

    2200ge
    2400ge
    Intel has far from given up gfx , in fact they are plunging into it with both feet? They are demonstrating their own discrete card and hope for availability sometime in 2019. the amd powered hades is just a stop gap, maybe even a little technology demonstrator if you will. The promise of APU accelerated processing is finally arriving, most especially for AI apps.
  • Ryan Smith - Friday, June 1, 2018 - link

    Truthfully I don't have a good answer for you. A Mali-G76MP20 works out to 480 ALUs (8*3*20), which isn't a lot by PC standards. However ALUs alone aren't everything, as we can clearly see comparing NVIDIA to AMD, AMD to Intel, etc.

    At a high level, high core count Malis are meant to offer laptop-class performance. So I'd expect them to be performance competitive with an Intel GT2 configuration, if not ahead of them in some cases. (Note that Mali is only for iGPUs as part of an SoC; it's lacking a bunch of important bits necessary to be used discretely)

    At least if Arm gets their way, then perhaps one day we'll get to see this. With Windows-on-ARM, there's no reason you couldn't eventually build an A76+G76 SoC for a Windows machine.
  • eastcoast_pete - Friday, June 1, 2018 - link

    Thanks Ryan! I wouldn't expect MALI graphics in a PC to challenge the high end of dedicated graphics, but if they come close to an NVIDIA 1030 card but significantly cheaper, I would be game to try. That being said, I realize that going from an SOC to a actual stand-alone part would require some heavy lifting. But then, there is an untapped market waiting to be served. Lastly, this must have occurred to the people at ARM graphics (MALI team) , and I wonder if any of them has ever speculated on how their newest&hottest would stack up against GT2, or entry-level NVIDIA and AMD solutions. Any off-the-record remarks?

Log in

Don't have an account? Sign up now