Conclusions & Thoughts on Dense Compute

Truth be told, when I was in discussions with Supermicro about reviewing one of its Ice Lake systems, I wasn’t sure what to expect.  I spoke to my contact at the company about sending a system that is expected to be a popular all-around enterprise system, one that could entertain many of Supermicro’s markets, and the SYS-120U-TNR fits that bill with the understanding that companies are also requesting denser environments.

The desire to move from previously standard 2U designs to 1U designs, even for generic dual socket systems, seems to be a feature of this next generation of enterprise deployments. Data centers and colocation centers have built infrastructure to support high-powered racks for AI – those enterprises that require the super dense AI workloads now invest in 5U systems consuming 5kW+, enough that you can’t even fill a 42U rack without going above standard rack power limits unless you have high power infrastructure in place. The knock-on effect of having better colo and enterprise infrastructure is allowing customers that use generic all-round off-the-shelf systems to reduce those racks of 2U infrastructure in half. This can also be combined with any benefit of moving from an older generation of processor to the new generation.

This causes a small issue for those of us that review servers every now and then: a modern dual socket server in a home rack with some good CPUs can no longer be tested without ear protection. Normally it would be tested in a lower-than-peak fan mode, without additional thermal assistance, however these systems require either fans at full or some additional HVAC to even run standard tests. A modern datacenter enables these systems to run as loud as they need, and the cooling environment is optimized for performance density regardless of the fan speed. Enterprise customers are taking advantage of this at scale, and that’s why companies like Supermicro are designing systems like the SYS-120U-TNR to meet those needs.

Dense Thoughts on Compute

What I think Supermicro is trying to do here with the SYS-120U-TNR is to cater for the biggest portion of demand in a variety of use cases. This system could be used as a single CPU caching tier, it could be a multi-tiered database processing hub, it could be used for AI acceleration in both training and inference, add in a double slot NVIDIA GPU with a virtualization license and you could run several remote workers with CUDA access, or with multiple FPGAs it could be a hub for SmartNIC offload or development. I applaud the fact that Supermicro have quite capably built an all-round machine that can be constructed to cater to so many markets.

One slightly fallback from my perspective is the lack of a default Network interface – even a simple gigabit connection – without an add-in card. Supermicro won’t ship the system without an add-in NIC anyway, however users will either have to add in their own PCIe solution (taking up a slot) or rely on one of Supermicro’s Ultra Riser networking cards drawing PCIe lanes from the processor. We could state that Supermicro’s decision allows for better flexibility, especially when space at the rear of a system is limited, but I’m still of the opinion that at least something should be there, and hanging off of the chipset.

On the CPU side of things, as we noted in our Intel 3rd Generation Xeon Scalable Ice Lake review, the processors themselves offer an interesting increase in generational performance, as well as key optimization points for things like AVX-512, SGX enclaves, and Optane DC Persistent Memory. The move up to PCIe 4.0, eight lanes of DDR4-3200 memory, and focusing on an optimized software stack do well as plus points for the product, but if your workload falls outside of those optimizable use cases, AMD equivalent offerings seem to have more performance for the same cost, or in some instances a lower cost and lower power.

The Xeon Gold 6330s we are testing today are the updates to the 28-core Xeon Gold 6258R from the previous generation, running at the same power, but half the cost and much lower frequencies. There’s a trade-off there as the Xeon 6330s aren’t as fast, but consume the same power – by charging half as much for the processors, Intel is trying to change the TCO equation to where it needs to be for their customers. The Ice Lake Xeon Gold 6348 are closer in frequency to the 6258R (2.6G base vs 2.7G base), and are closer in price ($3072 list vs $3950), but with a lower frequency they are rated to a higher TDP (235W vs 205W). In our Ice Lake review, the new 8380 vs older 8280 won as the power was higher, there were more cores, and we saw an uplift in IPC. The question is now more in the mid-range: while Intel struggles to get its new CPUs to match the old without changing list pricing, AMD allows customers to change from dual socket to single socket, all while increasing performance and reducing power costs.

This is somewhat inconsequential for today’s review, in that Supermicro’s system caters to the customers that require Intel for their enterprise infrastructure, regardless of the processor performance.  The SYS-120U-TNR is versatile and configurable for a lot of markets, ripe for an ‘off-the-shelf’ system deployment.

System Results
Comments Locked

53 Comments

View All Comments

  • Elstar - Saturday, July 24, 2021 - link

    > All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.

    AVX-512, as an instruction set, was a huge leap forward compared to AVX/AVX2. So much so that Intel created the AVX-512VL extension that allows one to use AVX-512 instructions on vectors smaller than 512-bits. As a vector programmer, here are the things I like about AVX-512:

    1) Dedicated mask registers and every instruction can take an optional mask for zeroing/merging results
    2) AVX-512 instructions can broadcast from memory without requiring a separate instruction.
    3) More register (not just wider)

    Also, and this is kind of hard to explain, but AVX/AVX2 as an instruction set is really annoying beacause it acts like two SSE units. So for example, you can't permute (or "shuffle" in Intel parlance) the contents of an AVX2 register as a whole. You can only permute the two 128-bit halves as if they were/are two SSE registers fused together. AVX-512 doesn't repeat this half-assed design approach.
  • mode_13h - Sunday, July 25, 2021 - link

    > 1) Dedicated mask registers and every instruction can take an optional
    > mask for zeroing/merging results

    This seems like the only major win. The rest are just chipping at the margins.

    More registers is a win for cases like fitting a larger convolution kernel or matrix row/column in registers, but I think it's really the GP registers that are under the most pressure.

    AVX-512 is not without its downsides, which have been well-documented.
  • Spunjji - Monday, July 26, 2021 - link

    @Elstar - Interesting info. Just makes me more curious as to how many of these things might be benefiting the 3DPM workload specifically. Another good reason for more people to get eyes on the code!
  • Dolda2000 - Saturday, July 24, 2021 - link

    >All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.
    I don't remember where it was posted any longer (it was in the comment section of some article over a year ago), but apparently 3DPM makes heavy use of wide (I don't recall exactly how wide) integer multiplications, which were made available in vectorized form in AVX-512.
  • dwbogardus - Saturday, July 24, 2021 - link

    Performance optimization is converged upon from two different directions: 1) the code users run to perform a task, and 2) the compute hardware upon which the code is intended to run. As an Intel engineer, for some time I was in a performance evaluation group. We ran many thousands of simulations of all kinds to quantify the performance of our processor and chipset designs before they ever went to silicon. This was in addition to our standard pre-silicon validation. Pre-silicon performance validation was to demonstrate that the expected performance was being delivered. You may rest assured that every major silicon architectural revision or addition to the silicon and power consumption was justified by demonstrated performance improvements. Once the hardware is optimized, then the coders dive into optimizing their code to take best advantage of the improved hardware. It is sort of like "double-bounded successive approximation" toward a higher performance target from both HW and SW directions. No surprise that benchmarks are optimized to the latest and highest performant hardware.
  • GeoffreyA - Sunday, July 25, 2021 - link

    Fair enough. But what if the legacy code path, in this case AVX2, were suboptimal?
  • mode_13h - Sunday, July 25, 2021 - link

    > You may rest assured that every major silicon architectural revision
    > or addition to the silicon and power consumption was justified
    > by demonstrated performance improvements.

    Well, it looks like you folks failed on AVX-512 -- at least, in Skylake/Cascade Lake:

    https://blog.cloudflare.com/on-the-dangers-of-inte...

    I experienced this firsthand, when we had performance problems with Intel's own OpenVINO framework. When we reported this to Intel, they confirmed that performance would be improved by disabling AVX-512. We applied *their* patch, effectively reverting it to AVX2, and our performance improved substantially.

    I know AVX-512 helps in some cases, but it's demonstrably false to suggest that AVX-512 is *only* an improvement.

    However, that was never the point in contention. The question was: how well 3DPM performs with a AVX2 codepath that's optimized to the same degree as the AVX-512 path. I fully expect AVX-512 would still be faster, but probably more inline with what we've seen with other benchmarks. I'd guess probably less than 2x.
  • mode_13h - Thursday, July 22, 2021 - link

    > a modern dual socket server in a home rack with some good CPUs
    > can no longer be tested without ear protection.

    When I saw the title of this review, that was my first thought. I feel for you, and sure wouldn't like to work in a room with these machines!
  • sjkpublic@gmail.com - Thursday, July 22, 2021 - link

    Why is this still relevant? You can buy CPU 'cards' and stick them in a chassis using less power and cost as much or less.
  • mode_13h - Friday, July 23, 2021 - link

    Are you referring to blade servers? But they don't have the ability to host PCIe cards or a dozen SSDs like this thing does. I'm also not sure how their power budget compares, nor how much RAM they can have.

    Anyway, if all you needed was naked CPU power, without storage or peripherals, then I think OCP has some solutions for even higher density. However, not everyone is just looking to scale massive amounts of raw compute.

Log in

Don't have an account? Sign up now