In the computing industry, we’ve lived with PCIe as a standard for a long time. It is used to add any additional features to a system: graphics, storage, USB ports, more storage, networking, add-in cards, storage, sound cards, Wi-Fi, oh did I mention storage? Well the one thing that we haven’t been able to put into a PCIe slot is DRAM – I don’t mean DRAM as a storage device, but memory that actually is added to the system as useable DRAM. Back in 2019 a new CXL standard was introduced, which uses a PCIe 5.0 link as the physical interface. Part of that standard is CXL.memory – the ability to add DRAM into a system through a CXL/PCIe slot. Today Samsung is unveiling the first DRAM module specifically designed in this way.

CXL: A Refresher

The original CXL standard started off as a research project inside Intel to create an interface that can support accelerators, IO, cache, and memory. It subsequently spun out into its own consortium, with over 50+ members, and support from key players in the industry: Intel, AMD, Arm, IBM, Broadcom, Marvell, NVIDIA, Samsung, SK Hynix, WD, and others. The latest standard is CXL 2.0, finalized in November 2020.

The CXL 1.1 standard covers three sets of intrinsics, known as CXL.io, CXL.memory and CXL.cache. These allow for deeper control over the connected devices, as well as an expansion as to what is possible. The CXL consortium sees three main areas for this:

The first type is a cache/accelerator, such as an offload engine or a SmartNIC (a smart network controller). With the CXL.io and CXL.cache intrinsics, this would allow the network controller to sort incoming data, analyze it, and filter what is needed directly into the main processors memory.

The second type is an accelerator with memory, and direct access to the HBM on the accelerator from the processor (as well as access to DRAM from the accelerator). The idea is a pseudo-heterogeneous compute design allowing for simpler but dense computational solvers.

The third type is perhaps the one we’re most interested in today: memory buffers. Using CXL.memory, a memory buffer can be installed over a CXL link and the attached memory can be directly pooled with the system memory. This allows for either increased memory bandwidth, or increased memory expansion, to the order of thousands of gigabytes.

CXL 2.0 also introduces CXL.security, support for persistent memory, and switching capabilities.

It should be noted that CXL is using the same electrical interface as PCIe. That means any CXL device will have what looks like a PCIe physical connector. Beyond that, CXL uses PCIe in its startup process, so currently any CXL supporting device has to also support a PCIe-to-PCIe link, making any CXL controller also a PCIe controller by default.

One of the common questions I’ve seen is what would happen if a CXL-only CPU was made? Because CXL and PCIe are intertwined, a CPU can’t be CXL-only, it would have to support PCIe connections as well. That being said, from the other direction: if we see CXL-based graphics cards for example, they would also have to at least initialize over PCIe, however full working modes might not be possible if CXL isn’t initialized.

Intel is set to introduce CXL 1.1 over PCIe 5.0 with its Sapphire Rapids processors. Microchip has announced PCIe 5.0 and CXL-based retimers for motherboard trace extensions. Samsung today is the third announcement for CXL supported devices. IBM has a similar technology called OMI (OpenCAPI Memory Interface), however that hasn’t seen wide adoption outside of IBM’s own processors.

Samsung’s CXL Memory Module

Modern processors rely on memory controllers for attached DRAM access. The top line x86 processors have eight channels of DDR4, while a number of accelerators have gone down the HBM route. One of the limiting factors in scaling up memory bandwidth is the number of controllers, which can also limit capacity, and beyond that memory needs to be validated and trained to work with a system. Most systems are not built to simply add or remove memory the same way you might do with a storage device.

Enter CXL, and the ability to add memory like a storage device. Samsung’s unveiling today is of a CXL-attached module packed to the max with DDR5. It uses a full PCIe 5.0 x16 link, allowing for a theoretical bidirectional 32 GT/s, but with multiple TB of memory behind a buffer controller. In much the same way that companies like Samsung pack NAND into a U.2-sized form factor, with sufficient cooling, Samsung does the same here but with DRAM. 

The DRAM is still a volatile memory, and data is lost if power is lost. (I doubt it is hot swappable either, but weirder things have happened). Persistent memory can be used, but only with CXL 2.0. Samsung hasn't stated if their device supports CXL 2.0, but it should be at least CXL 1.1 as they state it currently is being tested with Intel's Sapphire Rapids platform.

It should be noted that a modern DRAM slot is usually rated maximum for ~18W. The only modules in that power window are Intel’s Optane DCPMM, but a 256 GB DDR4 module would be in that ~10+ W range. For a 2 TB add-in CXL module like this, I suspect we are looking at around 70-80 W, and so to add that amount of DRAM through the CXL interface would likely require active cooling as well as the big heatsink that these renders suggest.

Samsung doesn’t give any details about the module they are unveiling, except that it is CXL based and has DDR5 in it. Not only that, but the ‘photos’ provided look a lot like renders, so it’s hard to state if they have an aesthetic unit available for photography, or if there’s simply a working controller in a bring-up lab somewhere that has been validated on a system. Update: Samsung has confirmed these are live shots, not renders.

As part of the announcement Samsung quoted AMD and Intel, indicating which partners they are more closely working with, and what they have today is being validated on Intel next-gen servers. Intel’s next-gen servers, Sapphire Rapids, are due to launch at the end of the year, in line with the Aurora supercomputing contract set to be initially shipped by year end.

Related Reading

POST A COMMENT

47 Comments

View All Comments

  • Tomatotech - Tuesday, May 11, 2021 - link

    A wild wafer-munching wizard appears! He answers your questions here:

    https://www.anandtech.com/show/16227/compute-expre...
    Reply
  • mode_13h - Tuesday, May 11, 2021 - link

    Uh, that's a good background read, but it doesn't specifically address the issue of latency (other than in regards to the optional point-to-point encryption feature).

    That's not fully-accurate. There's a throw-away line:

    "CXL 2.0 is still built upon the same PCIe 5.0 physical standard, which means that there aren’t any updates in bandwidth or latency ..."

    However, my understanding is that the design of the CXL protocol does actually reduce latency vs. PCIe, even though they share the same PHY layer.

    I think it's still going to be bad enough that you wouldn't forego direct-attached DRAM. I see these plug-in modules as being useful for caching and specifically as a memory pool shared between multiple CPUs (and other accelerators).
    Reply
  • mode_13h - Tuesday, May 11, 2021 - link

    > a memory pool shared between multiple CPUs (and other accelerators).

    I mean specifically for holding shared data.
    Reply
  • back2future - Tuesday, May 11, 2021 - link

    it was called RAMdrive or i-RAM https://en.wikipedia.org/wiki/RAM_drive Reply
  • pjcamp - Tuesday, May 11, 2021 - link

    Wow! A memory expansion board! That takes me back.

    https://en.wikipedia.org/wiki/Expanded_memory
    https://en.wikipedia.org/wiki/Extended_memory

    I did my dissertation on a PC AT with one of these and Word for Windows before the Windows it ran on even existed. Word was bundled with a runtime version. I still had to save the document every 4 or 5 pages since the disk swapping became intolerable.
    Reply
  • Toadster - Tuesday, May 11, 2021 - link

    DEVICE=C:\Windows\HIMEM.SYS
    DOS=HIGH,UMB
    DEVICE=C:\Windows\EMM386.EXE NOEMS
    Reply
  • mode_13h - Tuesday, May 11, 2021 - link

    LOL.

    IIRC, HIMEM was all about unlocking that extra 64k at the end of the 20-bit address range. EMM was a paging-based hack to access > 1 MB addresses without having to run in full-blown 32-bit mode (of which there were 2, IIRC). And configuring your AUTOEXEC.BAT and CONFIG.SYS to play the latest DOS game eventually became a black art.

    Let's not even get started on IRQ and DMA channel conflicts...
    Reply
  • MrEcho - Tuesday, May 11, 2021 - link

    This would be great for VM servers. RAM is always an issue with running a lot of VM's. Reply
  • Kamen Rider Blade - Tuesday, May 11, 2021 - link

    Why can't the 3x major associations get along and work together?

    CEL (Compute Express Link)
    https://en.wikipedia.org/wiki/Compute_Express_Link

    Gen-Z
    https://en.wikipedia.org/wiki/Gen-Z

    OpenCAPI
    https://en.wikipedia.org/wiki/Coherent_Accelerator...

    Serialized Memory Interface could do a world of good for the future of computing.
    It could replace the current parallel interface to DRAM.
    Reply
  • Billy Tallis - Tuesday, May 11, 2021 - link

    Last I heard, CXL and Gen-Z were working together, with the aim that CXL would be used more for direct-attached stuff like this within a single system, and Gen-Z would be more for interconnects between systems within the same rack or between racks.

    OpenCAPI seems to still be at odds with the other two, especially CXL. CAPI has been around a lot longer, but CXL seems to have garnered the support of everyone but IBM.
    Reply

Log in

Don't have an account? Sign up now