As an industry, we are slowly moving into an era where how we package the small pieces of silicon together is just as important as the silicon itself. New ways to connect all the silicon include side by side, on top of each other, and all sorts of fancy connections that help keep the benefits of chiplet designs but also taking advantage of them. Today, AMD is showcasing its next packaging uplift: stacked L3 cache on its Zen 3 chiplets, bumping each chiplet from 32 MiB to 96 MiB, however this announcement is targeting its large EPYC enterprise processors.

AMD’s current offering in this market is its third generation EPYC 7003 processor line, also known as Milan, which offers up to 64 Zen 3 cores across eight TSMC 7nm chiplets, co-packaged with a central IO die built on GlobalFoundries 14nm. The IO die provides eight DDR4-3200 memory channels and 128 lanes of PCIe 4.0, along with other things like security. Today’s announcement, or reveal (or acknowledgement?) is that coming in Q1, AMD is going to launch Milan-X.

Milan-X is an upgraded version of Milan using the stacked L3 cache packaging technology. A 64-core version of Milan today, with eight 8-core chiplets, has 256 MiB of total L3 cache - the Milan-X version will use added L3 cache on each of those chiplets, creating a processor with a total 768 MiB of L3 cache, unrivalled by anything else in the industry. This extra L3 cache is built on a cache density optimized variant of TSMC N7, measures 36 mm2, and puts the added 64 MiB on top of the 32 MiB that is already there. The rest of the chiplet has a shim built around it to help with thermal transfer.

Given AMD’s disclosures about its stacked cache technology back in June at Computex, we already had been expecting consumer and enterprise variants to come to market at some point – AMD promised it would be coming to Zen 3 and put into production by the end of 2021, and this announcement today is confirming that timeline. As and when it will come to the consumer product line is still unannounced. That being said, today’s announcement is still lacking on explicit details.

AMD confirms that Milan-X will be socket compatible with current Milan processors (that’s the SP3 socket), but hasn’t listed any details about power, frequency, or pricing. We are expecting the L3 cache to consume some extra power, so if we are working to a 280 W limit, that would imply that there is some small frequency loss. Beyond that, using an effective +45% of 7nm silicon per chiplet (36mm2 for top cache, 80.7mm2 for bottom core die) should theoretically increase the price by +45% if AMD is wafer limited at TSMC and they want to keep the same cost per silicon unit area. The Milan-X actually represents a unique offering in the x86 market with so much L3 cache on offer per chiplet, so you can imagine that AMD could offer a nice premium over regular Milan.

We are told that is to come closer to launch in the first three months of next year (Q1 2022). However AMD is keen to point out that the increased cache is putting less bandwidth pressure on main memory, allowing for speedup of certain workloads by 66% (for EDA-based RTL Verification on Synopsys VCS) when comparing 16-core Milan with 16-core Milan-X, although the exact chiplet configuration was not disclosed.

AMD also went on to say that Microsoft will be announcing a public preview of their Azure HBV3 Series VMs with Milan-X today alongside AMD’s event, although didn’t talk about availability. Beyond that, the usual talk about expecting major OEM partners (Dell, Lenovo, HPE, Supermicro, Cisco) to adopt the new hardware in their portfolios at the full launch.

Comments Locked


View All Comments

  • LightningNZ - Monday, November 8, 2021 - link

    Exactly. You've also got a process mismatch which may make TSVs more difficult to match across dies, and the IO die runs at a lower clock speed which also lessens any cache advantage. It would have to be a huge cache for it to offset the additional latency and still provide a win over DRAM.
  • E. Gadsby - Monday, November 8, 2021 - link

    Bandwidth was 2TBps for the original stacking announcement. Would be surprising if that were also not the case here.
  • Wilco1 - Tuesday, November 9, 2021 - link

    I meant design one 8-core chiplet with 96MB of cache and then create 2 sets of masks, one for the full cache and one with the extra cache cut off. Or do a single chiplet with 64MB L3 and use it both for desktop and servers. This should have lower latency than adding an SRAM die on top.

    I can't help but feel that adding a tiny SRAM die on top a tiny chiplet is overkill. It's a great technology, but it seems like a solution looking for a problem...
  • E. Gadsby - Wednesday, November 10, 2021 - link

    I think it gives them optionality. A separate mask set is likely far more expensive and this is unlikely to be a high volume product even on desktop threadripper. The sram chiplet is cheaper to make. The main ccd is already high volume so that cost is amortized and volume of this stacked part can be adjusted as per demand. Those who need the big cache will pay the extra cost. Risk to company is lower

Log in

Don't have an account? Sign up now