Better Core in Zen 2

Just in case you have missed it, in our microarchitecture analysis article Ian has explained in great detail why AMD claims that its new Zen2 is significantly better architecture than Zen1:

  • a different second-stage branch predictor, known as a TAGE predictor
  • doubling of the micro-op cache
  • doubling of the L3 cache
  • increase in integer resources
  • increase in load/store resources
  • support for two AVX-256 instructions per cycle (instead of having to combine two 128 bit units).

All of these on-paper improvements show that AMD is attacking its key markets in both consumer and enterprise performance. With the extra compute and promised efficiency, we can surmise that AMD has the ambition to take the high-performance market back too. Unlike the Xeon, the 2nd gen EPYC does not declare lower clocks when running AVX2 - instead it runs on a power aware scheduler that supplies as much frequency as possible within the power constraints of the platform.

Users might question, especially with Intel so embedded in high performance and machine learning, why AMD hasn't gone with an AVX-512 design? As a snap back to the incumbent market leader, AMD has stated that not all 'routines can be parallelized to that degree', as well as a very clear signal that 'it is not a good use of our silicon budget'. I do believe that we may require pistols at dawn. Nonetheless, it will be interesting how each company approaches vector parallelisation as new generations of hardware come out. But as it stands, AMD is pumping its FP performance without going full-on AVX-512.

In response to AMD's claims of an overall 15% IPC increase for Zen 2, we saw these results borne out of our analysis of Zen 2 in the consumer processor line, which was released last month. In our analysis, Andrei checked and found that it is indeed 15-17% faster. Along with the performance improvements, there have been also security hardening updates, improved virtualization support, and new but proprietary instructions for cache and memory bandwidth Quality of Service (QoS). (The QoS features seem very similar to what Intel has introduced in Broadwell/Xeon E5 version 4 and Skylake - AMD is catching up in that area).

Rome Layout: Simple Makes It a Lot Easier

When we analyzed AMD's first generation of EPYC, one of the big disadvantages was the complexity. AMD had built its 32-core Naples processors by enabling four 8-core silicon dies, and attaching each one to two memory channels, resulting in a non-uniform memory architecutre (NUMA). Due to this 'quad NUMA' layout, a number of applications saw quite a few NUMA balancing issues. This happened in almost every OS, and in some cases we saw reports that system administrators and others had to do quite a bit optimization work to get the best performance out of the EPYC 7001 series.  

The New 2nd Gen EPYC, Rome, has solved this. The CPU design implements a central I/O hub through which all communications off-chip occur. The full design uses eight core chiplets, called Core Complex Dies (CCDs), with one central die for I/O, called the I/O Die (IOD). All of the CCDs communicate with this this central I/O hub through dedicated high-speed Infinity Fabric (IF) links, and through this the cores can communicate to the DRAM and PCIe lanes contained within, or other cores.

The CCDs consist of two four-core Core CompleXes (1 CCD = 2 CCX). Each CCX consist of a four cores and 16 MB of L3 cache, which are at the heart of Rome. The top 64-core Rome processors overall have 16 CCX, and those CCX can only communicate with each other over the central I/O die. There is no inter-chiplet CCD communication.

This is what this diagram shows. On the left we have Naples, first Gen EPYC, which uses four Zepellin dies each connected to the other with an IF link. On the right is Rome, with eight CCDs in green around the outside, and a centralized IO die in the middle with the DDR and PCIe interfaces.

As Ian reported, while the CCDs are made at TSMC, using its latest 7 nm process technology. The IO die by contrast is built on GlobalFoundries' 14nm process. Since I/O circuitry, especially when compared to caching/processing and logic circuitry, is notoriously hard to scale down to smaller process nodes, AMD is being clever here and using a very mature process technology to help improve time to market, and definitely has advantages.

This topology is clearly visible when you take off the hood. 

AMD Rome chip

Main advantage is that the 2nd Gen 'EPYC 7002' family is much easier to understand and optimize for, especially from a software point of view, compared to Naples. Ultimately each processor only has one memory latency environment, as each core has the same latency to speak to all eight memory channels simultanously  - this is compared to the first generation Naples, which had two NUMA regions per CPU due to direct attached memory.

As seen in the image below, this means that in a dual socket setup, a Naples processor will act like a traditional NUMA environment that most software engineers are familiar with.

Ultimately the only other way to do this is with a large monolithic die, which for smaller process nodes is becoming less palatable when it comes to yields and pricing. In that respect, AMD has a significant advantage in being able to develop small 7nm silicon with high yields and also provide a substantial advantage when it comes to binning for frequency.

How a system sees the new NUMA environment is quite interesting. For the Naples EPYC 7001 CPUs, this was rather complicated in a dual socket setup: 

Here each number shows the 'weighting' given to the delay to access each of the other NUMA domains. Within the same domain, the weighting is light at only 10, but then a NUMA domain on the same chip was given a 16. Jumping off the chip bumped this up to 32.

This changed significantly on Rome EPYC 7002: 

Although there are situations where the EPYC 7001 CPUs communicated faster, but the fact that the topology is much simpler from the software point of view is worth a lot. It makes getting good performance out of the chip much easier for everyone that has to used it, which will save a lot of money in Enterprise, but also help accelerate adoption. 

The First Boot Experience Rome and PCIe 4.0
Comments Locked

180 Comments

View All Comments

  • sing_electric - Thursday, August 8, 2019 - link

    Not just Netburst - remember, Intel's plans were ORIGINALLY for Itanium to migrate down through the stack, ending up in consumer machines. Two massively costly mistakes when it came to planing the future of CPUs. Honestly, I hope Intel properly compensated the team behind the P6, an architecture so good that it was essentially brought back a year after release to after those 2 failures.

    OTOH, it's kind of amazing that AMD survived the Bulldozer years, since their margin for error is much smaller than Intel's. Good thing they bought ATI, since I'm not sure the company survives without the money they made from graphics cards and consoles...
  • JohanAnandtech - Thursday, August 8, 2019 - link

    Thank you for the kudos and sympathy. It was indeed hot! At 39°C/102°F, the server was off.

    I agree - I too admire the no-nonsense leadership of Lisa Su. Focus, careful execution and customer centric.
  • WaltC - Thursday, August 8, 2019 - link

    AMD has proven once again that Intel can be beaten, and soundly, too...;) The myth of the indestructible Intel is forever shattered, and Intel's CPU architectures are so old they creak and are riddled with holes, imo. Where would Intel have put us, if there'd been no AMD? You like Rdram, you like Itanium, just for starters? You like paying through the nose? That's where Intel wanted to go in its never-ending quest to monopolize the market! AMD stopped all of that by offering an alternative path the market was happy to take--a path that didn't involve emulators and tossing out your software library just to give Intel a closed bus! Intel licensed AMD's x86-64, among other things--and they flourished when AMD dropped the ball. I chalk all that up to AMD going through a succession of horrible CEOs--people who literally had no clue! Remember the guy who ran AMD for awhile who concluded it made sense for AMD to sell Intel servers...!? Man, I thought AMD was probably done! There's just no substitute for first-class management at the top--Su was the beginning of the AMD renaissance! Finally! As a chip manufacturer, Intel will either learn how to exist in a competitive market or the company over time will simply fade away. I often get the feeling that Intel these days is more interested in the financial services markets than in the computer hardware markets. While Intel was busy milking its older architectures and raking in the dough, AMD was busy becoming a real competitor once again! What a difference the vision at the top, or the lack of it, makes.
  • aryonoco - Thursday, August 8, 2019 - link

    That dude was Rory Read, and while the SeaMicro acquisition didn't work out, he did some great work and restructured AMD and in many ways saved the company while dealing with the Bulldozer disaster.

    Rory stablized the finances of the company by lowering costs over 30%, created the semi-custom division that enabled them to win the contracts for both the Xbox and PS4, creating a stable stream of revenue. Of course Rory's greatest accomplishment was hiring Lisa Su and then grooming her to become the CEO.

    Rory was a transitional CEO and he did exactly what was required of him. If there is a CEO that should be blamed for AMD's woes, it's Dirk Meyer.
  • aryonoco - Thursday, August 8, 2019 - link

    Forgot to mention, Rory also hired Kim Keller to design K12, and in effect he started the project that would later on become Zen.

    Of course Lisa deserves all the glory from then on. She has been an exceptional leader, bringing focus and excelling at execution, things that AMD always traditionally lacked.
  • tamalero - Sunday, August 11, 2019 - link

    Id Blame Hector Ruiz first.
    It was his crown to lose during the Athlon 64 era, and he simply didn't have anything to show. Making the Athlon 64 core arch a one hit wonder for more than a decade.
  • MarcusTaz - Wednesday, August 7, 2019 - link

    Another site's article that starts with an F stated that Rome runs hot and uses 1.4 volts, above TMSC recommended 1.3 volt. Did you need to run 1.4 volts for these tests?
  • evernessince - Wednesday, August 7, 2019 - link

    Well 1st, that 1.3v figure is from TSMC's mobile focused 7nm LPP node. Zen 2 is made on the high performance 7nm node, not the mobile focused LPP. Whatever publication you read didn't do their homework. TSMC has not published information on their high performance node and I think it rather arrogant to give AMD an F based on an assumption. As if AMD engineers are stupid enough to put dangerous voltages through their CPUs that would result in a company sinking lawsuit. It makes zero sense.

    FYI all AMD 3000 series processors go up to 1.4v stock. Given that these are server processors, they will run hot. After all, more cores = more heat. It's the exact same situation for Intel server processors. The only difference here is that AMD is providing 50 - 100% more performance in the same or less power consumption at 40% less cost.
  • DigitalFreak - Thursday, August 8, 2019 - link

    You reading Fudzilla?
  • Kevin G - Wednesday, August 7, 2019 - link

    AMD is back. They have the performance crown again and have decided to lap the competition with what can be described as an embarrassing price/performance comparison to Intel. The only thing they need to do is be able to meet demand.

    One thing I wish they would have done is added quad socket support. Due to the topology necessary, intersocket bandwidth would be a concern at higher core counts but if you just need lots of memory, those low end 8 core chips would have been fine (think memcache or bulk NVMe storage).

    With the topology improvements, I also would have liked AMD to try something creative: a quad chip + low clocked/low voltage Vega 20 in the same package all linked together via Infinity Fabric. That would be something stunning for HPC compute. I do see AMD releasing some GPU in a server socket at some point for this market as things have been aligning in this direction for sometime.

    Supporting something like CCIX or OpenCAPI also would have been nice. A nod toward my previous point, even enabling Infinity Fabric to Vega 20 compute cards instead of PCIe 4.0 would have been yet another big step for AMD as that'd permit full coherency between the two chips without additional overhead.

    I think it would be foolish to ignore AVX-512 for Zen 3, even if the hardware they run it one continues to use 256 bit wide SIMD units. ISA parity is important even if they don't inherently show much of a performance gain (though considering the clock speed drops seen in Sky Lake-SP, if AMD could support AVX-512 at the clocks they're able to sustain at AVX2 on Zen 2, they might pull off an overall throughput win).

    With regards to Intel, they have Cooper Lake due later this year. If Intel was wise, they'd use that as a means to realign their pricing structure and ditch the memory capacity premium. Everything else Intel can do in the short term is flex their strong packaging techniques and push integrated accelerators: on package fabric, FPGA, Optane DIMMs etc. Intel can occupy several lucrative niches in important, growing fields with that they have in-house right now but they need to get them to market and at competitive prices. Otherwise it is AMD's game for the next 12 to 15 months until Ice Lake-SP arrives to bring back the competitive landscape. It isn't even certain that Intel can score a clean win either as Zen 3 based chips may start to arrive in the same time frame.

Log in

Don't have an account? Sign up now