Huge Memory Bandwidth, but not for every Block

One highly intriguing aspect of the M1 Max, maybe less so for the M1 Pro, is the massive memory bandwidth that is available for the SoC.

Apple was keen to market their 400GB/s figure during the launch, but this number is so wild and out there that there’s just a lot of questions left open as to how the chip is able to take advantage of this kind of bandwidth, so it’s one of the first things to investigate.

Starting off with our memory latency tests, the new M1 Max changes system memory behaviour quite significantly compared to what we’ve seen on the M1. On the core and L2 side of things, there haven’t been any changes and we consequently don’t see much alterations in terms of the results – it’s still a 3.2GHz peak core with 128KB of L1D at 3 cycles load-load latencies, and a 12MB L2 cache.

Where things are quite different is when we enter the system cache, instead of 8MB, on the M1 Max it’s now 48MB large, and also a lot more noticeable in the latency graph. While being much larger, it’s also evidently slower than the M1 SLC – the exact figures here depend on access pattern, but even the linear chain access shows that data has to travel a longer distance than the M1 and corresponding A-chips.

DRAM latency, even though on paper is faster for the M1 Max in terms of frequency on bandwidth, goes up this generation. At a 128MB comparable test depth, the new chip is roughly 15ns slower. The larger SLCs, more complex chip fabric, as well as possible worse timings on the part of the new LPDDR5 memory all could add to the regression we’re seeing here. In practical terms, because the SLC is so much bigger this generation, workloads latencies should still be lower for the M1 Max due to the higher cache hit rates, so performance shouldn’t regress.

A lot of people in the HPC audience were extremely intrigued to see a chip with such massive bandwidth – not because they care about GPU or other offload engines of the SoC, but because the possibility of the CPUs being able to have access to such immense bandwidth, something that otherwise is only possible to achieve on larger server-class CPUs that cost a multitude of what the new MacBook Pros are sold at. It was also one of the first things I tested out – to see exactly just how much bandwidth the CPU cores have access to.

Unfortunately, the news here isn’t the best case-scenario that we hoped for, as the M1 Max isn’t able to fully saturate the SoC bandwidth from just the CPU side;

From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself. On the M1 Max, it seems that we’re hitting the limit of what a core can do – or more precisely, a limit to what the CPU cluster can do.

The little hump between 12MB and 64MB should be the SLC of 48MB in size, the reduction in BW at the 12MB figure signals that the core is somehow limited in bandwidth when evicting cache lines back to the upper memory system. Our test here consists of reading, modifying, and writing back cache lines, with a 1:1 R/W ratio.

Going from 1 core/threads to 2, what the system is actually doing is spreading the workload across the two performance clusters of the SoC, so both threads are on their own cluster and have full access to the 12MB of L2. The “hump” after 12MB reduces in size, ending earlier now at +24MB, which makes sense as the 48MB SLC is now shared amongst two cores. Bandwidth here increases to 186GB/s.

Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s.

While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of. More importantly for the M1 Max, it’s only slightly higher than the 204GB/s limit of the M1 Pro, so from a CPU-only workload perspective, it doesn’t appear to make sense to get the Max if one is focused just on CPU bandwidth.

That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth. Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters). While I’m sure there’s some productivity workload out there where the GPU is able to stretch its legs, we haven’t been able to identify them yet.

That leaves everything else which is on the SoC, media engine, NPU, and just workloads that would simply stress all parts of the chip at the same time. The new media engine on the M1 Pro and Max are now able to decode and encode ProRes RAW formats, the above clip is a 5K 12bit sample with a bitrate of 1.59Gbps, and the M1 Max is not only able to play it back in real-time, it’s able to do it at multiple times the speed, with seamless immediate seeking. Doing the same thing on my 5900X machine results in single-digit frames. The SoC DRAM bandwidth while seeking around was at around 40-50GB/s – I imagine that workloads that stress CPU, GPU, media engines all at the same time would be able to take advantage of the full system memory bandwidth, and allow the M1 Max to stretch its legs and differentiate itself more from the M1 Pro and other systems.

M1 Pro & M1 Max: Performance Laptop Chips Power Behaviour: No Real TDP, but Wide Range
Comments Locked

493 Comments

View All Comments

  • Ryan Smith - Monday, October 25, 2021 - link

    Thanks. Fixed!
  • 5j3rul3 - Monday, October 25, 2021 - link

    It's amazing.

    Is there any analysis for promotion, M1 Max GPU ray tracing...?
  • dada_dave - Monday, October 25, 2021 - link

    Ray tracing is in Metal, but as of yet no GPU-hardware accelerated ray tracing yet
  • Kangal - Monday, October 25, 2021 - link

    Really impressive chip.
    I noted my satisfaction/dissatisfaction a whole year ago with the original Apple M1. I suggested that Apple should release a family of chipsets for their devices. It was mainly for being more competitive and having better product segmentation. This didn’t happen, and it looks like its only somewhat happening. Not to mention, they could've done this transition even earlier like a year or two ago. Also they could update their “chipset-family” with the subsequent architectural improvements per generation. For instance;

    Apple M1, ~1W, only small cores, 1cu GPU... for 2in watch, wearables
    Apple M10, ~3W, 2 large cores, 4cu GPU... for 5in phones, iPods
    Apple M20, ~5W, 3 large cores, 4cu GPU... for 7in phablets or Mini iPad
    Apple M30, ~7W, 4 large cores, 8cu GPU… for 9in tablet, ultra thin, fanless
    Apple M40, ~10W, 8 large cores, 8cu GPU… for 11in laptop, ultra thin, fanless
    Apple M50, ~15W, 8 large cores, 16cu GPU… for 14in laptop, thin, active cooled
    Apple M60, ~25W, 8 large cores, 32cu GPU… for 17in laptop, thick, active cooled
    Apple M70, ~45W, 16 large cores, 32cu GPU… for 23in iMac, thick, AC power
    Apple M80, ~85W, 16 large cores, 64cu GPU... for 31in iMac+, thicker, AC power
    Apple M90, ~115W, 32 large cores, 64cu GPU…. for Mac Pro, desktop, strong cooling

    …and after 1.5 years, they can move unto the next refined architecture/node, and repeat the cycle every 18 months). The naming could be pretty simple as well, for example; in 2020 it was M50, then in 2021 their new model is the M51, then it is M52, then M53, then M54, etc etc. This was the lineup that I had hoped for, kinda bummed, they didn't rush out the gate with such a strong lineup, and they possibly may not in the future.
  • rmullns08 - Monday, October 25, 2021 - link

    With how many SKUs Apple already has with the just the M1 Pro/Max configuration's it would likely be a supply chain nightmare to try to manage 10 CPUs as well.
  • gobaers - Monday, October 25, 2021 - link

    Page 4 should be "put succinctly" not "succulently." Even if we do appreciate water efficiency in our chip manufacturing process ;)
  • paulraphael - Monday, October 25, 2021 - link

    "Put succulently, the new M1 SoCs prove that Apple ...."
    A rare case of autocorrect improving an idea.
  • Hifihedgehog - Monday, October 25, 2021 - link

    Mmmunchy Krunchy Dee-licious.
  • GC2:CS - Monday, October 25, 2021 - link

    So hardware is on one hand much upgraded like in terms of memory architecture but on the other hand it is still a year old Firestorm icestorm GPU and NPU.
    I wonder if LPDDR5 is simply not suited for iPhones but seems strange the A15 gives some upgrade to everything while sticking with DDR4.

    24 and 48 MB system caches were shown by apple. For brief moment they labeled their M1Pro area with 24 little parts as system cache. Max doubles the same part. I just was not sure one little part of SM equals 1 MB.

    So making two independet clusters with their own L2 helps compared to a single 8C/24 MB cluster ?
    After all firestorm is about 5 W per core so it is probably easy to fit many of them in lets say upcoming 300 W desktop Apple silicon. The question is, is there a space for an even larger core than firestorm ? If a 5 W core is fast why not make a 20W core (even if less efficient) and put two of them into a desktop, along with few dozens firestorms ? Like make firestorm the little core in the desktop.
    Honestly i could not take an idea that next year we will have desktop PC with less powerfull main cores than in a phone (how could i flex on my friends then ?)

    While M1’s are fast we have the A15 with supposed large gains in CPU and GPU efficiency better NPU and 32 MB system cache already shiping. Seems like a good omen for those M2 generations ?
    So M2/Pro/Max will get up to 10/20/40 GPU cores 32/48/64 MB system cache 18/36/36 MB of L2 ?!?

    Apple Silicon lineup is getting confusing - not liking it much. A4-A15 is the best naming scheme for any piece of silicon I have ever seen. (Would be better if they started at A1).
  • StinkyPinky - Monday, October 25, 2021 - link

    Thanks for being the only place that actually did real world benchmarks. Some of these reviews around the web are god awful.

    Any chance you can do Civ 6? That always seems a good test of both CPU and GPU.

Log in

Don't have an account? Sign up now