System Results and Benchmarks

When it comes down to system tests, the most obvious case in point is power consumption. Running through our benchmark tests and the IPMI does a good job of monitoring the power consumption every few minutes. We managed to see a 747 Watt peak listed, however the graph to run a few quick last photos for this reviews is showing something north of 750W. 

750W for a fully loaded dual 28C 2x205W system sounds quite high. This system has a peak of 1200W on the power supply, so that leaves 500W for an AI accelerator and anything additional. This means a good GPU and a dozen high power NVMe drives is about your limit. Luckily that's all you can fit into the system. Users who need 270 W processors in this system might have to cut back on some of the extras.

One of the elements in which to test this system at full power, and if we look at the processor power consumption we get about 205 W per processor (which is the rated TDP) during turbo.

Out of this power, it would appear that the idle power is around 100 W, which is split between cores/DRAM (we assume IO is under DRAM?). When loaded, extra budget goes into the processors. We see the same thing on CineBench, except there seems to be less stress on the DRAM/IO in this test.

Benchmarks

While we don't have a series of server specific tests, we are able to probe the capability of the system as delivered through mix of our enterprise and workstation testing. LLVM compile and SPEC are Linux based, while the rest are Windows, based on personal familiarity and also our back catalog of comparison data. It is worth noting that some software has difficulty scaling beyond 64 threads in Windows due to thread groups - this is down to the way the software is compiled and run. All the tests here were all able to dismiss this limitation except LinX LINPACK, which has a 64 thread limit (and is limited to Intel).

LLVM Compile

(4-1) Blender 2.83 Custom Render Test

(8-5) LinX 0.9.5 LINPACK

SPECint2017 Base Rate-N

SPECfp2017 Base Rate-N

(4-5) V-Ray Renderer

(4-2) Corona 1.3 Benchmark

(2-2) 3D Particle Movement v2.1 (Peak AVX)

(1-1) Agisoft Photoscan 1.3, Complex Test

(4-7b) CineBench R23 Multi-Thread

In almost all cases, the dual socket 28C SYS-120U-TNR sits behind the single socket 64C option from AMD. For the tests against dual 8280 or dual 6258R, we can see a generational uplift, however there is still a struggle against a AMD's previous generation top tier processor. That said, AMD's processor costs $6950, whereas two of these 6330s is around $3800. There is always a balance between price, total cost of ownership, and benefits versus the complexities of a dual socket system against a single socket system. The benchmarks where the SYS-120U-TNR did the best were our AVX tests, such as 3DPM and y-cruncher, where these processors could use AVX-512. As stated by Intel's Lisa Spelman in our recent interview, "70% of those deal wins, the reason listed by our salesforce for that win was AVX-512; optimization is real".

BIOS, Software, BMC Thoughts and Conclusions
POST A COMMENT

53 Comments

View All Comments

  • Elstar - Saturday, July 24, 2021 - link

    > All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.

    AVX-512, as an instruction set, was a huge leap forward compared to AVX/AVX2. So much so that Intel created the AVX-512VL extension that allows one to use AVX-512 instructions on vectors smaller than 512-bits. As a vector programmer, here are the things I like about AVX-512:

    1) Dedicated mask registers and every instruction can take an optional mask for zeroing/merging results
    2) AVX-512 instructions can broadcast from memory without requiring a separate instruction.
    3) More register (not just wider)

    Also, and this is kind of hard to explain, but AVX/AVX2 as an instruction set is really annoying beacause it acts like two SSE units. So for example, you can't permute (or "shuffle" in Intel parlance) the contents of an AVX2 register as a whole. You can only permute the two 128-bit halves as if they were/are two SSE registers fused together. AVX-512 doesn't repeat this half-assed design approach.
    Reply
  • mode_13h - Sunday, July 25, 2021 - link

    > 1) Dedicated mask registers and every instruction can take an optional
    > mask for zeroing/merging results

    This seems like the only major win. The rest are just chipping at the margins.

    More registers is a win for cases like fitting a larger convolution kernel or matrix row/column in registers, but I think it's really the GP registers that are under the most pressure.

    AVX-512 is not without its downsides, which have been well-documented.
    Reply
  • Spunjji - Monday, July 26, 2021 - link

    @Elstar - Interesting info. Just makes me more curious as to how many of these things might be benefiting the 3DPM workload specifically. Another good reason for more people to get eyes on the code! Reply
  • Dolda2000 - Saturday, July 24, 2021 - link

    >All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.
    I don't remember where it was posted any longer (it was in the comment section of some article over a year ago), but apparently 3DPM makes heavy use of wide (I don't recall exactly how wide) integer multiplications, which were made available in vectorized form in AVX-512.
    Reply
  • dwbogardus - Saturday, July 24, 2021 - link

    Performance optimization is converged upon from two different directions: 1) the code users run to perform a task, and 2) the compute hardware upon which the code is intended to run. As an Intel engineer, for some time I was in a performance evaluation group. We ran many thousands of simulations of all kinds to quantify the performance of our processor and chipset designs before they ever went to silicon. This was in addition to our standard pre-silicon validation. Pre-silicon performance validation was to demonstrate that the expected performance was being delivered. You may rest assured that every major silicon architectural revision or addition to the silicon and power consumption was justified by demonstrated performance improvements. Once the hardware is optimized, then the coders dive into optimizing their code to take best advantage of the improved hardware. It is sort of like "double-bounded successive approximation" toward a higher performance target from both HW and SW directions. No surprise that benchmarks are optimized to the latest and highest performant hardware. Reply
  • GeoffreyA - Sunday, July 25, 2021 - link

    Fair enough. But what if the legacy code path, in this case AVX2, were suboptimal? Reply
  • mode_13h - Sunday, July 25, 2021 - link

    > You may rest assured that every major silicon architectural revision
    > or addition to the silicon and power consumption was justified
    > by demonstrated performance improvements.

    Well, it looks like you folks failed on AVX-512 -- at least, in Skylake/Cascade Lake:

    https://blog.cloudflare.com/on-the-dangers-of-inte...

    I experienced this firsthand, when we had performance problems with Intel's own OpenVINO framework. When we reported this to Intel, they confirmed that performance would be improved by disabling AVX-512. We applied *their* patch, effectively reverting it to AVX2, and our performance improved substantially.

    I know AVX-512 helps in some cases, but it's demonstrably false to suggest that AVX-512 is *only* an improvement.

    However, that was never the point in contention. The question was: how well 3DPM performs with a AVX2 codepath that's optimized to the same degree as the AVX-512 path. I fully expect AVX-512 would still be faster, but probably more inline with what we've seen with other benchmarks. I'd guess probably less than 2x.
    Reply
  • mode_13h - Thursday, July 22, 2021 - link

    > a modern dual socket server in a home rack with some good CPUs
    > can no longer be tested without ear protection.

    When I saw the title of this review, that was my first thought. I feel for you, and sure wouldn't like to work in a room with these machines!
    Reply
  • sjkpublic@gmail.com - Thursday, July 22, 2021 - link

    Why is this still relevant? You can buy CPU 'cards' and stick them in a chassis using less power and cost as much or less. Reply
  • mode_13h - Friday, July 23, 2021 - link

    Are you referring to blade servers? But they don't have the ability to host PCIe cards or a dozen SSDs like this thing does. I'm also not sure how their power budget compares, nor how much RAM they can have.

    Anyway, if all you needed was naked CPU power, without storage or peripherals, then I think OCP has some solutions for even higher density. However, not everyone is just looking to scale massive amounts of raw compute.
    Reply

Log in

Don't have an account? Sign up now