Conclusions & Thoughts on Dense Compute

Truth be told, when I was in discussions with Supermicro about reviewing one of its Ice Lake systems, I wasn’t sure what to expect.  I spoke to my contact at the company about sending a system that is expected to be a popular all-around enterprise system, one that could entertain many of Supermicro’s markets, and the SYS-120U-TNR fits that bill with the understanding that companies are also requesting denser environments.

The desire to move from previously standard 2U designs to 1U designs, even for generic dual socket systems, seems to be a feature of this next generation of enterprise deployments. Data centers and colocation centers have built infrastructure to support high-powered racks for AI – those enterprises that require the super dense AI workloads now invest in 5U systems consuming 5kW+, enough that you can’t even fill a 42U rack without going above standard rack power limits unless you have high power infrastructure in place. The knock-on effect of having better colo and enterprise infrastructure is allowing customers that use generic all-round off-the-shelf systems to reduce those racks of 2U infrastructure in half. This can also be combined with any benefit of moving from an older generation of processor to the new generation.

This causes a small issue for those of us that review servers every now and then: a modern dual socket server in a home rack with some good CPUs can no longer be tested without ear protection. Normally it would be tested in a lower-than-peak fan mode, without additional thermal assistance, however these systems require either fans at full or some additional HVAC to even run standard tests. A modern datacenter enables these systems to run as loud as they need, and the cooling environment is optimized for performance density regardless of the fan speed. Enterprise customers are taking advantage of this at scale, and that’s why companies like Supermicro are designing systems like the SYS-120U-TNR to meet those needs.

Dense Thoughts on Compute

What I think Supermicro is trying to do here with the SYS-120U-TNR is to cater for the biggest portion of demand in a variety of use cases. This system could be used as a single CPU caching tier, it could be a multi-tiered database processing hub, it could be used for AI acceleration in both training and inference, add in a double slot NVIDIA GPU with a virtualization license and you could run several remote workers with CUDA access, or with multiple FPGAs it could be a hub for SmartNIC offload or development. I applaud the fact that Supermicro have quite capably built an all-round machine that can be constructed to cater to so many markets.

One slightly fallback from my perspective is the lack of a default Network interface – even a simple gigabit connection – without an add-in card. Supermicro won’t ship the system without an add-in NIC anyway, however users will either have to add in their own PCIe solution (taking up a slot) or rely on one of Supermicro’s Ultra Riser networking cards drawing PCIe lanes from the processor. We could state that Supermicro’s decision allows for better flexibility, especially when space at the rear of a system is limited, but I’m still of the opinion that at least something should be there, and hanging off of the chipset.

On the CPU side of things, as we noted in our Intel 3rd Generation Xeon Scalable Ice Lake review, the processors themselves offer an interesting increase in generational performance, as well as key optimization points for things like AVX-512, SGX enclaves, and Optane DC Persistent Memory. The move up to PCIe 4.0, eight lanes of DDR4-3200 memory, and focusing on an optimized software stack do well as plus points for the product, but if your workload falls outside of those optimizable use cases, AMD equivalent offerings seem to have more performance for the same cost, or in some instances a lower cost and lower power.

The Xeon Gold 6330s we are testing today are the updates to the 28-core Xeon Gold 6258R from the previous generation, running at the same power, but half the cost and much lower frequencies. There’s a trade-off there as the Xeon 6330s aren’t as fast, but consume the same power – by charging half as much for the processors, Intel is trying to change the TCO equation to where it needs to be for their customers. The Ice Lake Xeon Gold 6348 are closer in frequency to the 6258R (2.6G base vs 2.7G base), and are closer in price ($3072 list vs $3950), but with a lower frequency they are rated to a higher TDP (235W vs 205W). In our Ice Lake review, the new 8380 vs older 8280 won as the power was higher, there were more cores, and we saw an uplift in IPC. The question is now more in the mid-range: while Intel struggles to get its new CPUs to match the old without changing list pricing, AMD allows customers to change from dual socket to single socket, all while increasing performance and reducing power costs.

This is somewhat inconsequential for today’s review, in that Supermicro’s system caters to the customers that require Intel for their enterprise infrastructure, regardless of the processor performance.  The SYS-120U-TNR is versatile and configurable for a lot of markets, ripe for an ‘off-the-shelf’ system deployment.

System Results
Comments Locked

53 Comments

View All Comments

  • mode_13h - Friday, July 23, 2021 - link

    > It's a real-world workload

    Except it's not. It started out that way, but then he gave it to Intel to optimize the AVX-512 path. So, the AVX-512 is optimized by "a world expert, according to Jim Keller" (to paraphrase Ian). And yet, the AVX-512 results are put up against the AVX2 results, on AMD CPUs, as if they're both optimized to the same degree and that just happens to be the *actual* difference in performance.

    As an excuse for this, Ian points out that he gave AMD the same opportunity, but they haven't taken him up on it. Well, that still doesn't make it a fair representation of AVX2 vs. AVX-512 performance.

    > I'm not sure the point should be to microoptimize it to the ends of the world,
    > or it wouldn't be a realistic workload any longer.

    A lot of workloads are heavily-optimized. This includes kernels in HPC programs, many games, and the most popular video compression engines. Probably a lot of stuff in SPEC Bench has been optimized a high degree. And let's not even start on AI frameworks.

    All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.

    Plus, there's my point about optimizing it for ARM NEON and SVE, so it could be used in a somewhat apples-to-apples comparison with ARM processors.
  • GeoffreyA - Friday, July 23, 2021 - link

    I agree it's unfair. On the "non-AVX" test, the Ryzens go to the top. On one hand, the test shows how much faster an AVX512 processor can be. On the other hand, it's unfair that some are running the AVX2 path and some the AVX512, and the results are put together. (Reminiscent of the Athlon XP's SSE not being used in some benchmarks.)

    Others, I don't know, but in a thing like HEVC encoding, the gains aren't all that much for these instructions. It leads me to feel the 3DPM test is gaining disproportionately from AVX512, in a narrow sort of way, and that's being magnified. The result shows, "Look at how fast these AVX512 CPUs are, leaving their AMD counterparts in the dust."

    https://networkbuilders.intel.com/docs/acceleratin...

    https://software.intel.com/content/www/us/en/devel...
  • mode_13h - Saturday, July 24, 2021 - link

    > it's unfair that some are running the AVX2 path and some the AVX512,
    > and the results are put together.

    That's a reasonable position, but I'm not even going that far. I'm okay with putting up AVX2 against AVX-512, but I think they need to be optimized somewhat comparably. That way, the difference you see only shows the true difference in hardware capability, and not also the (unknown) difference in the level of code optimization.

    > "Look at how fast these AVX512 CPUs are, leaving their AMD counterparts in the dust."

    It does have a few specialized instructions that have no AVX2 counterpart. And if you're doing something they were specifically designed to accelerate, then you can get a legit order of magnitude speedup. And it's not impossible 3DPM hit one of those cases. But, in order to know, Ian really needs to disclose the code.
  • GeoffreyA - Saturday, July 24, 2021 - link

    "it's not impossible 3DPM hit one of those cases"

    Possible, even likely. And if so, it's a bit of an unbalanced picture. It will be interesting to see what happens when AMD adds support.
  • mode_13h - Sunday, July 25, 2021 - link

    > Possible, even likely.

    We don't know, so don't presume. There are some obvious things you can get wrong that sabotage performance. Cache thrashing, pointer aliasing, and false sharing, just to name a few. Probably a lot of the speedup, in the AVX-512 case, was fixing just such things.
  • Spunjji - Monday, July 26, 2021 - link

    @GeoffreyA - I would argue that it wouldn't necessarily be unbalanced if the benchmark benefits particularly heavily from AVX-512, simply because there are going to be workloads like that out there, and the people who have them are probably going to be aware of that to some extent.

    With comparable optimisation between the AVX2 and AVX-512 code paths, it could still be a helpful example of a best-case for the feature, for those few people for whom it's going to work out like that.

    For everyone else, we could definitely do with more generalised real-world examples (like x264) where the AVX-512 part of the workload isn't necessarily dominant.
  • GeoffreyA - Wednesday, July 28, 2021 - link

    That's a good way of looking at it, Spunjji. You're right. Hopefully we can those balanced, real-world examples in addition.
  • GeoffreyA - Saturday, July 24, 2021 - link

    And for a best AVX2 vs. best AVX512, I think we probably need some bigger test, something like encoding I would think. I could be wrong, but remember reading that x264 had AVX512 support. I doubt whether it's been optimised to the fullest, though. And most of the critical work on x264 was done a long time ago.
  • GeoffreyA - Sunday, July 25, 2021 - link

    My mistake. x265.
  • mode_13h - Sunday, July 25, 2021 - link

    Yeah, some of the rendering and encoding benchmarks use it.

Log in

Don't have an account? Sign up now