Apple's Cyclone Microarchitecture Detailed
by Anand Lal Shimpi on March 31, 2014 2:10 AM ESTThe most challenging part of last year's iPhone 5s review was piecing together details about Apple's A7 without any internal Apple assistance. I had less than a week to turn the review around and limited access to tools (much less time to develop them on my own) to figure out what Apple had done to double CPU performance without scaling frequency. The end result was an (incorrect) assumption that Apple had simply evolved its first ARMv7 architecture (codename: Swift). Based on the limited information I had at the time I assumed Apple simply addressed some low hanging fruit (e.g. memory access latency) in building Cyclone, its first 64-bit ARMv8 core. By the time the iPad Air review rolled around, I had more knowledge of what was underneath the hood:
As far as I can tell, peak issue width of Cyclone is 6 instructions. That’s at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.
With Swift, I had the luxury of Apple committing LLVM changes that not only gave me the code name but also confirmed the size of the machine (3-wide OoO core, 2 ALUs, 1 load/store unit). With Cyclone however, Apple held off on any public commits. Figuring out the codename and its architecture required a lot of digging.
Last week, the same reader who pointed me at the Swift details let me know that Apple revealed Cyclone microarchitectural details in LLVM commits made a few days ago (thanks again R!). Although I empirically verified many of Cyclone's features in advance of the iPad Air review last year, today we have some more concrete information on what Apple's first 64-bit ARMv8 architecture looks like.
Note that everything below is based on Apple's LLVM commits (and confirmed by my own testing where possible).
Apple Custom CPU Core Comparison | ||||||
Apple A6 | Apple A7 | |||||
CPU Codename | Swift | Cyclone | ||||
ARM ISA | ARMv7-A (32-bit) | ARMv8-A (32/64-bit) | ||||
Issue Width | 3 micro-ops | 6 micro-ops | ||||
Reorder Buffer Size | 45 micro-ops | 192 micro-ops | ||||
Branch Mispredict Penalty | 14 cycles | 16 cycles (14 - 19) | ||||
Integer ALUs | 2 | 4 | ||||
Load/Store Units | 1 | 2 | ||||
Load Latency | 3 cycles | 4 cycles | ||||
Branch Units | 1 | 2 | ||||
Indirect Branch Units | 0 | 1 | ||||
FP/NEON ALUs | ? | 3 | ||||
L1 Cache | 32KB I$ + 32KB D$ | 64KB I$ + 64KB D$ | ||||
L2 Cache | 1MB | 1MB | ||||
L3 Cache | - | 4MB |
As I mentioned in the iPad Air review, Cyclone is a wide machine. It can decode, issue, execute and retire up to 6 instructions/micro-ops per clock. I verified this during my iPad Air review by executing four integer adds and two FP adds in parallel. The same test on Swift actually yields fewer than 3 concurrent operations, likely because of an inability to issue to all integer and FP pipes in parallel. Similar limits exist with Krait.
I also noted an increase in overall machine size in my initial tinkering with Cyclone. Apple's LLVM commits indicate a massive 192 entry reorder buffer (coincidentally the same size as Haswell's ROB). Mispredict penalty goes up slightly compared to Swift, but Apple does present a range of values (14 - 19 cycles). This also happens to be the same range as Sandy Bridge and later Intel Core architectures (including Haswell). Given how much larger Cyclone is, a doubling of L1 cache sizes makes a lot of sense.
On the execution side Cyclone doubles the number of integer ALUs, load/store units and branch units. Cyclone also adds a unit for indirect branches and at least one more FP pipe. Cyclone can sustain three FP operations in parallel (including 3 FP/NEON adds). The third FP/NEON pipe is used for div and sqrt operations, the machine can only execute two FP/NEON muls in parallel.
I also found references to buffer sizes for each unit, which I'm assuming are the number of micro-ops that feed each unit. I don't believe Cyclone has a unified scheduler ahead of all of its execution units and instead has statically partitioned buffers in front of each port. I've put all of this information into the crude diagram below:
Unfortunately I don't have enough data on Swift to really produce a decent comparison image. With six decoders and nine ports to execution units, Cyclone is big. As I mentioned before, it's bigger than anything else that goes in a phone. Apple didn't build a Krait/Silvermont competitor, it built something much closer to Intel's big cores. At the launch of the iPhone 5s, Apple referred to the A7 as being "desktop class" - it turns out that wasn't an exaggeration.
Cyclone is a bold move by Apple, but not one that is without its challenges. I still find that there are almost no applications on iOS that really take advantage of the CPU power underneath the hood. More than anything Apple needs first party software that really demonstrates what's possible. The challenge is that at full tilt a pair of Cyclone cores can consume quite a bit of power. So for now, Cyclone's performance is really used to exploit race to sleep and get the device into a low power state as quickly as possible. The other problem I see is that although Cyclone is incredibly forward looking, it launched in devices with only 1GB of RAM. It's very likely that you'll run into memory limits before you hit CPU performance limits if you plan on keeping your device for a long time.
It wasn't until I wrote this piece that Apple's codenames started to make sense. Swift was quick, but Cyclone really does stir everything up. The earlier than expected introduction of a consumer 64-bit ARMv8 SoC caught pretty much everyone off guard (e.g. Qualcomm's shift to vanilla ARM cores for more of its product stack).
The real question is where does Apple go from here? By now we know to expect an "A8" branded Apple SoC in the iPhone 6 and iPad Air successors later this year. There's little benefit in going substantially wider than Cyclone, but there's still a ton of room to improve performance. One obvious example would be through frequency scaling. Cyclone is clocked very conservatively (1.3GHz in the 5s/iPad mini with Retina Display and 1.4GHz in the iPad Air), assuming Apple moves to a 20nm process later this year it should be possible to get some performance by increasing clock speed scaling without a power penalty. I suspect Apple has more tricks up its sleeve than that however. Swift and Cyclone were two tocks in a row by Intel's definition, a third in 3 years would be unusual but not impossible (Intel sort of committed to doing the same with Saltwell/Silvermont/Airmont in 2012 - 2014).
Looking at Cyclone makes one thing very clear: the rest of the players in the ultra mobile CPU space didn't aim high enough. I wonder what happens next round.
182 Comments
View All Comments
twotwotwo - Monday, March 31, 2014 - link
I think the iPad gradually sneaking into desktop use cases is more likely to be Apple's hope. Then Apple gets the ecosystem control and 30% share they have on iOS, doesn't have to get legacy software ported or meet desktop expectations, etc. But I, too, think anything is possible. :)teiglin - Monday, March 31, 2014 - link
I hope this is a kick in the pants to Qualcomm, actually--the fact is there is no real competition for 8974 at the top of Android products; A15s just don't seem to fare as well in practice. I hope that the successor to 8974 is as aggressive Cyclone, and that Android OEMs are willing to pay for a larger, more-powerful SoC.It's frustrating to me as someone who genuinely prefers Android to iOS--Cyclone actually lured me to the 5s for a few months, and I didn't even mind the small size so much; still, I found that I'm more comfortable in Android. Apple's vertical integration really does lead to good stuff on so many levels, but software flexibility is not one of them.
teiglin - Monday, March 31, 2014 - link
ugh, where is my edit button? I hope 8974's successor is as aggressive AS Cyclone. Words are hard, I guess.jeffkibuule - Monday, March 31, 2014 - link
Except as much as we may think, Qualcomm does not compete with Apple for CPUs. There's no pitch meeting to Apple to use Snapdragon, certainly not after they've invested hundreds of millions of dollars to create their own custom CPU architectures. And the sad thing is that the companies Qualcomm does or did compete with (nvidia and TI respectively), they've effectively shut them out due to shipping a well integrated SoC with a very modern cellular baseband attached.Mondozai - Monday, March 31, 2014 - link
Nvidia is down but not out.Qualcomm is in a way competing with Apple. If they and other android fabless companies flub the SoC, then Apple sina marketshare.
Remember, Apple is competing on everything, from SoC, hardware to software. This may seem obvious, but apparently it needs to be pointed out.
Mondozai - Monday, March 31, 2014 - link
Sina = gains. Not sure how autocorrect did that one.jeffkibuule - Monday, March 31, 2014 - link
I would still say they aren't. Qualcomm needs to be profitable per chip, Apple only per phone. They can reliable depend on charging $100 for $8 worth of flash storage to pay for their R&D and larger silicon without worry. And keep in mind there are only so many high-end phones otherwise a chip like the Tegra K1 becomes to expensive. Nvidia Dell behind because they also don't offer the wide portfolio of updated chips and multiple performance/cost points Qualcomm does.WaltFrench - Monday, March 31, 2014 - link
Methinks that in terms of performance per watt or per dollar, Qualcomm only needs to compete with other shops that sell independently (mostly, to Android OEMs).If Apple somehow tripled its benchmark performance while cutting power in half, they wouldn't sell THAT many more phones (unless they matched that feat with similarly unlikely price or feature-set changes).
twotwotwo - Monday, March 31, 2014 - link
Just geeking out, a bunch of truly irresponsible speculations about what they could do, besides freq scaling, with the extra room provided by 20nm:- True L3. Earlier, Peter Greenhalgh referred to what Apple has now as a write buffer, not a true L3 cache (which would include clean data evicted from L2 and prefetched data, I guess). They could change that.
- RAM interface. 128-bit could come back. I guess LPDDR4 is coming someday, too.
- More special-purpose silicon. Add something for compression (which AMD advertises as part of their Seattle ARM SoC plan). They could use that for compressed RAM a la Mavericks, to get more out of the physical RAM they have.
- GPU, etc. Imagination's announced Wizard. Way out there, if Apple saw something cool they could do by getting into the GPU IP business themselves, they could: expensive, but probably less so than their ARM design odyssey.
- Little cores. Could cut the base CPU power draw with screen on. Could do it to increase battery life, or to help cancel out any power-draw increase from higher peak clocks or whatever. Sounds like a sane idea, but relatively few SoCs doing it now, so hrm.
There would be many good reasons not to attack things like that next round, either to separate process and arch changes like Intel, or if their current big push is on something else entirely, little chips for watches or SoCs for TVs or whatever. But it seems plausible there are a lot of ways to improve the SoC other than a wider core.
coder543 - Monday, March 31, 2014 - link
If Apple doesn't already use Mavericks-style RAM compression in iOS, that is some *seriously* low-hanging fruit they need to grab, since they're so heavily opposed to including more RAM in their devices. The linux kernel had experimental RAM compression for ages now (zram, since at least 2009) and I wish Android were using it. It seems to be an option in Android 4.4, but I can't find any mention of it actually being used on real-world device.From a simply pragmatic viewpoint, Apple could just make the A8 a slightly-tweaked A7 that also happens to be quad-core, with 2GB+ of RAM. That would be more than enough to excite people, and with a proper hot-plugging algorithm, they could get all of the benefits of a dual-core chip and all of the benefits of a quad-core chip, depending on the software load it encounters.