Hot Chips 2020 Live Blog: IBM z15, a 5.2 GHz Mainframe CPU (11:00am PT)
by Dr. Ian Cutress on August 17, 2020 2:00 PM EST- Posted in
- CPUs
- Enterprise CPUs
- IBM
- Live Blog
- z15
- Hot Chips 32
02:04PM EDT - Final talk of this session is IBM z15. This is big iron mainframe stuff. Prepare to be shocked about these chips, and wonder why you don't have them.
02:04PM EDT - Mainframes are still releveant - 220+ billion lines of COBOL still in deployment today. 70% of all business transactions still use COBOL
02:05PM EDT - Programs built in 1964 on IBM mainframes still work today
02:05PM EDT - 70k searches on google per second, vs. 1.3 million transactions per second on mainframes
02:06PM EDT - Deep pipeline high frequency z-series
02:07PM EDT - z13 introduced SMT, z14 introduced pervasive encryption
02:07PM EDT - These CPUs are ground up built by IBM and not used by anyone else
02:07PM EDT - Two bits of silicon - Storage Controller with 960 MB L4 cache, four Complte Chips
02:08PM EDT - 5 drawers are fully connected through the SC chips
02:08PM EDT - Large 14nm SOI designs
02:08PM EDT - 700mm2+ each
02:08PM EDT - ~700mm2
02:08PM EDT - Each SC chip has 12 cores. 8 MB L2 cache
02:08PM EDT - Max config supports 240 cores. 190 cores are customer available, others are for management or recovery
02:09PM EDT - 60 PCIe 4 x16 connections
02:09PM EDT - 40 TB RAIM memory supported
02:09PM EDT - Two CP chips create a single logical cluster
02:10PM EDT - CP chip
02:10PM EDT - 12 cores, 5.2 GHz
02:10PM EDT - 9.1B transistors
02:10PM EDT - 128 KB L1-I cache, 128 KB L1-D cache
02:10PM EDT - L2 4MB cache private
02:10PM EDT - 256 MB L3 cache shared
02:11PM EDT - Secure exectuion - 38 new instructions of vector performance
02:11PM EDT - speed up of common instructions in a smart way
02:11PM EDT - on-chip accelerators such as gzip, elliptic curve crypto, on-core sort/merge
02:11PM EDT - Here are the comparisons to z14
02:11PM EDT - 14% ST perf over z14
02:12PM EDT - Deep pipeline, CISC architecture
02:12PM EDT - branch is async
02:12PM EDT - two copies of almost everything shown
02:12PM EDT - recovering unit for when errors are detected - processor rolls back
02:13PM EDT - This allows for transient recovery from hardware errors
02:13PM EDT - Known good state can be transferred to a new core if non-transient error happens
02:13PM EDT - The goal of these cores is to be recoverable, even when blasted with high-energy proton beams
02:15PM EDT - NXU is syncrhonous and runs in real time
02:16PM EDT - Two main ways for compression - IBM uses both depending on the size to get the best results
02:17PM EDT - Elliptic curve cryptography acceleration unit in each core, along with enhanced modulo unit which it relies on
02:17PM EDT - 'MA unit' has its own instruction set and 'core'
02:17PM EDT - sign and verify is implemented as firmware and hardware
02:17PM EDT - Acts as a blueprint for future accelerators
02:17PM EDT - Attached to back-end of the pipeline
02:18PM EDT - All execution is in order and non-speculative
02:18PM EDT - No pipeline pain points
02:18PM EDT - Results are passed to the core
02:18PM EDT - physically these accelerators could be placed far away from core logic as needed
02:18PM EDT - dozens to hundreds of modulo ops with a couple of doubleword instructions
02:19PM EDT - Core is called millicode ?
02:19PM EDT - Here are the internal instruction set
02:19PM EDT - speed ups vs external attached PCIe accelerator on z14
02:19PM EDT - Secure Execution for z15 with vertical isolation
02:20PM EDT - specialized mode in the CPU, IO, and memory subsystem
02:20PM EDT - ultravisor sits between the hypervisor and OS
02:21PM EDT - integrity hash and input/output counts to stop malicious guests
02:22PM EDT - controlled code environment
02:23PM EDT - 5.2 GHz water cooled
02:23PM EDT - 4 CP and 1 SC chip per drawer
02:23PM EDT - 14% ST and 25% cap vs z14
02:24PM EDT - Q&A time
02:25PM EDT - Q: Pipeline depth? A: It's long! Long front end and back end and extends with the recovery
02:25PM EDT - Around 30
02:25PM EDT - Q: L1 / L2 load to use latency? A: L1 4-cycle loop, 8-cycle for L1 miss/L2 hit
02:26PM EDT - Q: For secure execution, what is the isolation from firmware? A: Validated trusted firmware / ultravisor. That's an integral part of our secure guest security
02:27PM EDT - Q: Async branch prediction? What happens if it's behind i-fetch A: It's lossless, and i-fetch is in-order. After a pipeline restart, if i-fetch is ahead, the pipeline will react and throw away as required. There are syncs - there's a hard sync at dispatch, so no predictions are dropped
02:27PM EDT - Q: Core IPC vs Power10? A: ask power!
02:28PM EDT - AES-256 for page encryption, integrity hash is SHA-512
02:29PM EDT - Q: 5.2 GHz? How? A: Deep pipelining and focus on gate design. A lot of work. Deep pipelining is table steaks, but a lot fo other things are needed
02:30PM EDT - Q: Is power/eff an important focus? A: It does consume less power than z14 that's similar configured. From a chip perspective, the focus wasn't to reduce overall power - the focus was on performance and throughput. That was done to put two more cores in and double caches - we burned the power budget to add in more performance. This is the sort of product this is. We stuff more hardware and acceleration.
02:30PM EDT - That's a wrap for the first session. Come back in 30 minutes for the next session, where we start on AMD's 4000-series Renoir
02:31PM EDT - .
20 Comments
View All Comments
Ian Cutress - Monday, August 17, 2020 - link
They eventually said 5.2 GHz base in the Q&Aikjadoon - Monday, August 17, 2020 - link
>The goal of these cores is to be recoverable, even when blasted with high-energy proton beamsWell, I personally now feel underprepared for any incoming high-energy proton beams.
Mdarrish - Monday, August 17, 2020 - link
They or similar high energy radiation come in all the time - cosmic rays switch the state of gates in processors. The smaller the transistors, the easier it is to switch the state. Modern CPUs have redundant circuits to ‘vote’ on results when that happens.octavus - Monday, August 17, 2020 - link
Too bad that 940GB L4 cache chip can't be an option for X86 systems.Lord of the Bored - Monday, August 17, 2020 - link
"02:05PM EDT - Programs built in 1964 on IBM mainframes still work today"I am torn. On the one hand, I strongly believe that sort of backwards-compatibility is something to be applauded.
On the other hand, the surrounding context makes me concerned that they know this because some ancient program that hasn't been maintained in fifty years is a critical part of everyday financial transactions.
Mdarrish - Monday, August 17, 2020 - link
They were updated for Y2K, so only 20 years. Seriously, there are usually updates along the way. The point is that critical business systems don’t have to be rewritten or need emulators.Zizy - Tuesday, August 18, 2020 - link
If it works, why break it?While I doubt there is any meaningfully large piece of software from that era still in use, some core algorithms likely have been written back then and are still used today completely unmodified in over 50 years.
quadibloc - Tuesday, August 18, 2020 - link
The z15 has a very deep pipeline. So it's not surprising that it can have a 5.2 GHz clock. It's like a Pentium 4 (or even a Bulldozer); and it's kept cool by a fancy water-cooling system. So IBM hasn't worked miracles, it's just attained 5.2 GHz in ways that Intel and AMD reject for good reason.melgross - Tuesday, August 18, 2020 - link
AMD sand Intel can’t do this because this is a very large system. The cost to do it likely costs more than the totality of most server rooms.Dolda2000 - Wednesday, August 19, 2020 - link
To be fair though, a lot of the pipeline depth seems to be contributed by the retire/commit side of things, with all the extra stages for checkpointing and verification, which shouldn't leave any effect on branch mispredict penalties.Granted, the front-end pipeline isn't exactly short, but it doesn't look completely out of line compared to Intel designs. If anything, it surprises me a bit that they haven't implemented a micro-op cache to bypass all those myriad of steps they seem to have for decoding. It looks like something the design would benefit quite a lot from.