The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

Name: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity
Item: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

by Dr. Ian Cutress & Andrei Frumusanu on November 4, 2021 9:00 AM EST

474 Comments | Add A Comment

474 Comments

Instruction Changes

Both of the processor cores inside Alder Lake are brand new – they build on the previous generation Core and Atom designs in multiple ways. As always, Intel gives us a high level overview of the microarchitecture changes, as we’ve written in an article from Architecture Day:

At the highest level, the P-core supports a 6-wide decode (up from 4), and has split the execution ports to allow for more operations to execute at once, enabling higher IPC and ILP from workflow that can take advantage. Usually a wider decode consumes a lot more power, but Intel says that its micro-op cache (now 4K) and front-end are improved enough that the decode engine spends 80% of its time power gated.

For the E-core, similarly it also has a 6-wide decode, although split to 2x3-wide. It has a 17 execution ports, buffered by double the load/store support of the previous generation Atom core. Beyond this, Gracemont is the first Atom core to support AVX2 instructions.

As part of our analysis into new microarchitectures, we also do an instruction sweep to see what other benefits have been added. The following is literally a raw list of changes, which we are still in the process of going through. Please forgive the raw data. Big thanks to our industry friends who help with this analysis.

Any of the following that is listed as A|B means A in latency (in clocks) and B in reciprocal throughput (1/instructions).

P-core: Golden Cove vs Cypress Cove

Microarchitecture Changes:

6-wide decoder with 32b window: it means code size much less important, e.g. 3 MOV imm64 / clks;(last similar 50% jump was Pentium -> Pentium Pro in 1995, Conroe in 2006 was just 3->4 jump)
Triple load: (almost) universal
- every GPR, SSE, VEX, EVEX load gains (only MMX load unsupported)
- BROADCAST*, GATHER*, PREFETCH* also gains
Decoupled double FADD units
- every single and double SIMD VADD/VSUB (and AVX VADDSUB* and VHADD*/VHSUB*) has latency gains
- Another ADD/SUB means 4->2 clks
- Another MUL means 4->3 clks
- AVX512 support: 512b ADD/SUB rec. throughput 0.5, as in server!
- exception: half precision ADD/SUB handled by FMAs
- exception: x87 FADD remained 3 clks
Some form of GPR (general purpose register) immediate additions treated as NOPs (removed at the "allocate/rename/move ellimination/zeroing idioms" step)
- LEA r64, [r64+imm8]
- ADD r64, imm8
- ADD r64, imm32
- INC r64
- Is this just for 64b addition GPRs?
eliminated instructions:
- MOV r32/r64
- (V)MOV(A/U)(PS/PD/DQ) xmm, ymm
- 0-5 0x66 NOP
- LNOP3-7
- CLC/STC
zeroing idioms:
- (V)XORPS/PD, (V)PXOR xmm, ymm
- (V)PSUB(U)B/W/D/Q xmm
- (V)PCMPGTB/W/D/Q xmm
- (V)PXOR xmm

Faster GPR instructions (vs Cypress Cove):

LOCK latency 20->18 clks
LEA with scale throughput 2->3/clk
(I)MUL r8 latency 4->3 clks
LAHF latency 3->1 clks
CMPS* latency 5->4 clks
REP CMPSB 1->3.7 Bytes/clock
REP SCASB 0.5->1.85 Bytes/clock
REP MOVS* 115->122 Bytes/clock
CMPXVHG16B 20|20 -> 16|14
PREFETCH* throughput 1->3/clk
ANDN/BLSI/BLSMSK/BLSR throughput 2->3/clock
SHA1RNDS4 latency 6->4
SHA1MSG2 throughput 0.2->0.25/clock
SHA256MSG2 11|5->6|2
ADC/SBB (r/e)ax 2|2 -> 1|1

Faster SIMD instructions (vs Cypress Cove):

*FADD xmm/ymm latency 4->3 clks (after MUL)
*FADD xmm/ymm latency 4->2 clks(after ADD)
* means (V)(ADD/SUB/ADDSUB/HADD/HSUB)(PS/PD) affected
VADD/SUB/PS/PD zmm 4|1->3.3|0.5
CLMUL xmm 6|1->3|1
CLMUL ymm, zmm 8|2->3|1
VPGATHERDQ xmm, [xm32], xmm 22|1.67->20|1.5 clks
VPGATHERDD ymm, [ym32], ymm throughput 0.2 -> 0.33/clock
VPGATHERQQ ymm, [ym64], ymm throughput 0.33 -> 0.50/clock

Regressions, Slower instructions (vs Cypress Cove):

Store-to-Load-Forward 128b 5->7, 256b 6->7 clocks
PAUSE latency 140->160 clocks
LEA with scale latency 2->3 clocks
(I)DIV r8 latency 15->17 clocks
FXCH throughput 2->1/clock
LFENCE latency 6->12 clocks
VBLENDV(B/PS/PD) xmm, ymm 2->3 clocks
(V)AESKEYGEN latency 12->13 clocks
VCVTPS2PH/PH2PS latency 5->6 clocks
BZHI throughput 2->1/clock
VPGATHERDD ymm, [ym32], ymm latency 22->24 clocks
VPGATHERQQ ymm, [ym64], ymm latency 21->23 clocks

E-core: Gracemont vs Tremont

Microarchitecture Changes:

Dual 128b store port (works with every GPR, PUSH, MMX, SSE, AVX, non-temporal m32, m64, m128)
Zen2-like memory renaming with GPRs
New zeroing idioms
- SUB r32, r32
- SUB r64, r64
- CDQ, CQO
- (V)PSUBB/W/D/Q/SB/SW/USB/USW
- (V)PCMPGTB/W/D/Q
New ones idiom: (V)PCMPEQB/W/D/Q
MOV elimination: MOV; MOVZX; MOVSX r32, r64
NOP elimination: NOP, 1-4 0x66 NOP throughput 3->5/clock, LNOP 3, LNOP 4, LNOP 5

Faster GPR instructions (vs Tremont)

PAUSE latency 158->62 clocks
MOVSX; SHL/R r, 1; SHL/R r,imm8 tp 1->0.25
ADD;SUB; CMP; AND; OR; XOR; NEG; NOT; TEST; MOVZX; BSSWAP; LEA [r+r]; LEA [r+disp8/32] throughput 3->4 per clock
CMOV* throughput 1->2 per clock
RCR r, 1 10|10 -> 2|2
RCR/RCL r, imm/cl 13|13->11|11
SHLD/SHRD r1_32, r1_32, imm8 2|2 -> 2|0.5
MOVBE latency 1->0.5 clocks
(I)MUL r32 3|1 -> 3|0.5
(I)MUL r64 5|2 -> 5|0.5
REP STOSB/STOSW/STOSD/STOSQ 15/8/12/11 byte/clock -> 15/15/15/15 bytes/clock

Faster SIMD instructions (vs Tremont)

A lot of xmm SIMD throughput is 4/clock instead of theoretical maximum(?) of 3/clock, not sure how this is possible
MASKMOVQ throughput 1 per 104 clocks -> 1 per clock
PADDB/W/D; PSUBB/W/D PAVGB/PAVGW 1|0.5 -> 1|.33
PADDQ/PSUBQ/PCMPEQQ mm, xmm: 2|1 -> 1|.33
PShift (x)mm, (x)mm 2|1 -> 1|.33
PMUL*, PSADBW mm, xmm 4|1 -> 3|1
ADD/SUB/CMP/MAX/MINPS/PD 3|1 -> 3|0.5
MULPS/PD 4|1 -> 4|0.5
CVT*, ROUND xmm, xmm 4|1 -> 3|1
BLENDV* xmm, xmm 3|2 -> 3|0.88
AES, GF2P8AFFINEQB, GF2P8AFFINEINVQB xmm 4|1 -> 3|1
SHA256RNDS2 5|2 -> 4|1
PHADD/PHSUB* 6|6 -> 5|5

Regressions, Slower (vs Tremont):

m8, m16 load latency 4->5 clocks
ADD/MOVBE load latency 4->5 clocks
LOCK ADD 16|16->18|18
XCHG mem 17|17->18|18
(I)DIV +1 clock
DPPS 10|1.5 -> 18|6
DPPD 6|1 -> 10|3.5
FSIN/FCOS +12% slower

Power: P-Core vs E-Core, Win10 vs Win11 CPU Tests: Core-to-Core and Cache Latency, DDR4 vs DDR5 MLP

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

474 Comments

View All Comments

mode_13h - Saturday, November 6, 2021 - link
> A competent CEO wouldn’t have allowed this situation to occur.

And you'd know this because... ?

> This is an outstanding example of how the claim that only engineers
> make good CEOs for tech companies is suspect.

You're making way to much of this.

I don't know who says "only engineers make good CEOs for tech companies". That's an absolutist statement I doubt anyone reasonable and with suitable expertise to make such proclamations ever would. There are plenty of examples where engineers in the CEO's chair have functioned poorly, but also many where they've done well. The balance of that particular assessment doesn't hang on Gelsinger, and especially not on this one issue.

Also, your liberal arts degree is showing. I'm not casting any aspersions on liberal arts, but when you jump to attack engineers for stepping outside their box, it does look like you've got a big chip on your shoulder.
Oxford Guy - Sunday, November 7, 2021 - link
‘Also, your liberal arts degree is showing. I'm not casting any aspersions on liberal arts’

Your assumption about my degrees is due to the fact that I understand leadership and integrity?
mode_13h - Sunday, November 7, 2021 - link
> Your assumption about my degrees is due to the fact that I understand leadership and integrity?

Yes, exactly. It's exactly your grasp of leadership and integrity that I credit for your attack on engineers stepping outside their box. Such a keen observation. /s

(and that's how you "sarcasm", on the Internet.)
kwohlt - Sunday, November 7, 2021 - link
I'm not sure how familiar you are with CPU design, but Alder Lake was taped in before Gelsinger took over. The design was finalized, and there was no changing it without massive delays. For the miniscule amount of the market that insists on AVX-512 for the consumer line, it can be implemented after disabling E Cores. AVX-512 just doesn't work on Gracemont, so you can't have both Gracemont and AVX-512 simultaneously. CPU designs take 4 years. You'll see the true impact of Gelsingers leadership in a few years.
SystemsBuilder - Saturday, November 6, 2021 - link
MS and intel tried to sync their plans to launch Windows 11 and Alderlake at (roughly) the same time. intel might have been rushed to lock their POR to hit Windows 11 launch. There may even be a contractual relationship between Intel and Microsoft to make sure Windows 11 runs best on Intel's Alder Lake - Intel pay MS to optimize the scheduler for Alder lake and in return Intel has to lock the Alder Lake POR maybe even up to a year go... because MS was not going to move the Windows 11 launch date.

Speculation from my side of course, but I don't think I am too far off...
Oxford Guy - Saturday, November 6, 2021 - link
Such excuses don’t work.

The current situation is inexcusable.
SystemsBuilder - Saturday, November 6, 2021 - link
yes it is inexcusable BUT the Pat might not have had a choice because he does not control Microsoft.
Satya N. would just tell Pat we have a contract - fulfill it!
We are not going to delay Windows 11 it's shipping October 2021 so we will stick with the POR you gave us in 2020!
Satya is running a $2.52 trillion market cap company current #1 in the world
Pat is running a $206.58 billion market cap company
so guess who's calling the shots.
Pat says "ok... but maybe we can enable it for the 22H1 version of win 11, please Satya help me out here..."
in the end I think MS will do the right thing and get it to work but it might get delayed a bit.
Again, my speculation. And again, I don't think I am far off...
Oxford Guy - Saturday, November 6, 2021 - link
The solution was not to create this incompetent partial ‘have faith’ AVX-512 situation. Faith is for religion, not tech products.

The solution was to be clear and consistent. For instance, if Windows is the problem then that should have been made clear. Gelsinger should have said MS doesn’t yet have its software at full compatibility with Alder Lake. He should have said it will be officially supported when Windows is ready for it.

He should have had a software utility for power users to disable the small cores in order to have AVX-512 support, or at least a BIOS option — mandated for all Alder Lake boards — that disables them as an option for those who need AVX-512.

The current situation cannot be blamed on Microsoft. Intel has the ability to be clear, consistent, and competent about its products.

Claiming that Intel isn’t a large enough entity to tell the truth also doesn’t pass muster. Even if it’s inconvenient for Microsoft to be exposed for releasing Windows 11 prematurely and even if it’s inconvenient for Intel to be exposed for releasing Alder Lake prematurely — saving face isn’t an adequate excuse for creating a situation this untenable.

Consumers deserve non-broken products that aren’t sold via smoke and mirrors tactics.
SystemsBuilder - Saturday, November 6, 2021 - link
a couple of points:
- yes it would have been better to communicate to the market that AVX-512 will be enabled with 22H1 (or what ever - speculating) of windows 11 but what about making it work with windows 10 and when... i mean the whole situation it's a cluster. I do agree that the current marketing decision under Pat's what and how to communicate to the market what is happening with Alder Lake and AVX-512 and Windows 10/11 could have been handled much, much better. the way they have done it is a disaster. it's like is it in or out i mean wtf. is it strategic or not. This market communicating, related decisions and what every new agreements they need to strike with Microsoft to make the whole thing make sense is on Pat - firmly!
- i am not blaming Microsoft at all. I am mostly blaming the old marketing and the old CEO - pure incompetence for getting Intel into this situation in the first place. I don't have all the insights into Intel's internals but from an outside perspective it looks like that to me.
Oxford Guy - Saturday, November 6, 2021 - link
Gelsinger’s responsibility is to lead, not blame previous leadership.

Alder Lake came out on his watch. The AVX-512 debacle, communications and lack of mandated minimum specs (official partial support for the lifetime of AL in 100% of AL boards via BIOS switch to disable small cores) happened to while he was CEO.

The lie about fusing off happened under his leadership.

We have been lied to and spacetime can’t be warped to erase the stain on his tenure.

The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

Instruction Changes

P-core: Golden Cove vs Cypress Cove

E-core: Gracemont vs Tremont

Post Your Comment

474 Comments

View All Comments

mode_13h - Saturday, November 6, 2021 - link

Oxford Guy - Sunday, November 7, 2021 - link

mode_13h - Sunday, November 7, 2021 - link

kwohlt - Sunday, November 7, 2021 - link

SystemsBuilder - Saturday, November 6, 2021 - link

Oxford Guy - Saturday, November 6, 2021 - link

SystemsBuilder - Saturday, November 6, 2021 - link

Oxford Guy - Saturday, November 6, 2021 - link

SystemsBuilder - Saturday, November 6, 2021 - link

Oxford Guy - Saturday, November 6, 2021 - link

Log in

Don't have an account? Sign up now