Qualcomm Snapdragon S4 (Krait) Performance Preview - 1.5 GHz MSM8960 MDP and Adreno 225 Benchmarks
by Brian Klug & Anand Lal Shimpi on February 21, 2012 3:01 AM EST- Posted in
- Smartphones
- Snapdragon
- Qualcomm
- Adreno
- Krait
- Mobile
We won't go too deep into Krait's CPU architecture, because we've already done so in an earlier piece. What we can provide however is a quick recap. Architecturally Krait isn't a design of tradeoffs, rather it's a significant step forward along almost all vectors. Each core can fetch, decode and execute more instructions in parallel than its predecessor (Scorpion, Snapdragon S1/S2/S3).
Qualcomm Architecture Comparison | ||||
Scorpion | Krait | |||
Pipeline Depth | 10 stages | 11 stages | ||
Decode | 2-wide | 3-wide | ||
Issue Width | 3-wide? | 4-wide | ||
Execution Ports | 3 | 7 | ||
L2 Cache (dual-core) | 512KB | 1MB | ||
Core Configurations | 1, 2 | 1, 2, 4 |
Even if you're not comparing to Qualcomm's previous architecture, Krait maintains the same low level advantage over any other ARM Cortex A9 based design (NVIDIA Tegra 2/3, TI OMAP 4, Apple A5). Clock speeds are up with only a small increase in pipeline depth. The combination of these two factors alone should result in significant performance improvements for even single threaded applications. If you want to abstract by one more level: Krait will be faster regardless of application, regardless of usage model. You're looking at a generational gap in architecture here, not simply a clock bump.
Architecture Comparison | ||||||||
ARM11 | ARM Cortex A8 | ARM Cortex A9 | Qualcomm Scorpion | Qualcomm Krait | ||||
Decode | single-issue | 2-wide | 2-wide | 2-wide | 3-wide | |||
Pipeline Depth | 8 stages | 13 stages | 8 stages | 10 stages | 11 stages | |||
Out of Order Execution | N | N | Y | Partial | Y | |||
FPU | VFP11 (pipelined) | VFPv3 (not-pipelined) | Optional VFPv3 (pipelined) | VFPv3 (pipelined) | VFPv4 (pipelined) | |||
NEON | N/A | Y (64-bit wide) | Optional MPE (64-bit wide) | Y (128-bit wide) | Y (128-bit wide) | |||
Process Technology | 90nm | 65nm/45nm | 40nm | 40nm | 28nm | |||
Typical Clock Speeds | 412MHz | 600MHz/1GHz | 1.2GHz | 1GHz | 1.5GHz |
The memory interface of the chip has been improved tremendously. At a high level, the MSM8960 is Qualcomm's first SoC to feature PoP support for two LPDDR2 memory channels. We suspect there are lower level improvements to the memory interface as well however we don't have more details from Qualcomm, not to mention the current state of memory latency/bandwidth testing on Android is pretty abysmal.
Quantifying the Krait performance advantage requires a mixture of synthetic and application level tests. We'll start with Linpack, a Java port of the classic memory bandwidth/FPU test:
Occasionally we'll see performance numbers that just make us laugh at their absurdity. Krait's Linpack performance is no exception. The performance advantage here is insane. The MSM8960 is able to deliver more than twice the performance of any currently shipping SoC. The gains are likely due in no small part to improvements in Krait's cache/memory controller. Krait can also issue multi-issue FP instructions, A9 class architectures can apparenty only dual-issue integer instructions.
Moving on we have our standard JavaScript benchmarks: Sunspider and Browsermark. Both of these tests show significant performance improvements, although understandably not by the margins we saw above in Linpack:
Krait and the MSM8960 are 20 - 35% faster than the dual-core Cortex A9s used in Samsung's Galaxy Nexus. For a look at how overall web page loading is impacted we loaded AnandTech.com three times and averaged the results. We presented results with the browser cache cleared after each run as well as results after all assets were cached:
AnandTech.com Page Loading Comparison (Stock ICS Browser) | ||||
Browser Cache Cleared | Cache In Use | |||
Qualcomm MDP MSM8960 (Krait) | 5.5 seconds | 3.0 seconds | ||
Samsung Galaxy Nexus (ARM Cortex A9) | 5.8 seconds | 4.4 seconds |
There's hardly any advantage when you're network bound, which is to be expected. However whenever the device can pull assets from a local cache (something that is quite common as images, CSS and even many page elements remain static between loads) the advantage grows considerably. Here we're seeing a 46% advantage from Krait over the Cortex A9 in the Galaxy Nexus.
We turn to Qualcomm's own Vellamo as a system/CPU/browser performance test:
Again, we're showing a huge performance advantage here thanks to Krait. Seeing as how Vellamo is a Qualcomm benchmark don't get too attached to the advantage here, but it does echo some of what we've seen earlier.
Finally we have Rightware's Basemark OS 1.1 RC which is fast becomming an impressively polished system benchmark, one which will hopefully eventually take the place of the likes of Quadrant.
Basemark OS - System | |||
HTC Rezound | Galaxy Nexus | MDP MSM8960 | |
System Overall Score | 658 | 538 | 907 |
Simple Java 1 | 298 loops/s | 210 loops/s | 375 loops/s |
Simple Java 2 | 7.28 loops/s | 8.61 loops/s | 10.8 loops/s |
SMP Test | 35.3 loops/s | 49.2 loops/s | 64.4 loops/s |
100K File (eMMC->SD) | 6.49 mB/s | 9.52 mB/s | 8.64 mB/s |
100K File (SD->eMMC) | 33.0 mB/s | 17.8 mB/s | 39.8 mB/s |
100K File (eMMC->eMMC) | 37.8 mB/s | 34.5 mB/s | 48.9 mB/s |
100K File (SD->SD) | 8.47 mB/s | 8.30 mB/s | 12.7 mB/s |
Database Operation | 10.0 ops/s | 5.73 ops/s | 19.4 ops/s |
Zip Compression | 0.509 s | 0.848 s | 0.561 s |
Zip Decompression | 0.097 s | 0.206 s | 0.073 s |
On the CPU centric tests Basemark OS is showing anywhere from a 20% - 80% increase in performance over the 1.5 GHz APQ8060 based HTC Rezound. IO performance is also tangibly improved although that could be a function of NAND performance rather than the SoC specifically.
These results as a whole simply quantify what we've felt during our use of the MSM8960 MDP: this is the absolute smoothest we've ever seen Ice Cream Sandwich run.
86 Comments
View All Comments
bhspencer - Tuesday, February 21, 2012 - link
Does anyone know if Linpak is using the hardware or software floating point calculations for the MFLOPS number.metafor - Wednesday, February 22, 2012 - link
Hardware. But it's run on the JIT instead of native code. According to CF-Bench, Java FP performance is around 1/3 of native. Neither actually use NEON but instead uses the older VFP instructions.vision33r - Tuesday, February 21, 2012 - link
The Tegra 3 is actually a big disappointment from a performance standpoint. It actually has 5 CPU cores and the GPU performance isn't much better than the Tegra 2. The Adreno 225 is a much bigger upgrade but I'm afraid that it's another marginal upgrade.The A5 in the iPad2/iPhone 4S are over 1 year old by March. In that time, Nvidia's Tegra 2/3 has not dominated and the MSM8960 is finally a true contender for the fastest SOC on the market. By the time this thing is out in volume, Apple has the A6 ready and most likely another 4-8x performance increase over the A5.
This SOC will probably be forgotten when the A6 is out.
LetsGo - Wednesday, February 22, 2012 - link
Yeah your right looking at my Asus Transformer Prime running GTA 3. /SA lot of graphical optimisations can be done on the CPU cores before data is offloaded to the GPU.
The moral of the story is that Benchmarks are only a rough guide at best.
tipoo - Wednesday, February 22, 2012 - link
Unless the rumors are true and its A5X, not A6, with just faster dual cores rather than quads on a newer architecture. I would not be surprised, its like the 3G-3GS was an architecture change, then the 4 was just a faster chip on a similar architecture. The iPad 2 was an architecture change, the 3 might just be a faster version of the same thing, hopefully with improvements in the GPU. I'd be fine with that, as long as the GPU kept up with the new resolution.Stormkroe - Tuesday, February 21, 2012 - link
I was just plotting out what little resolution scaling info there is here and noticed something very odd. Both the iphone 4s and galaxy s2 actually score MUCH higher when the resolution is raised to 720p offscreen. I can see that in the 4s' case it could be explained with fps caps, but the S2 is definitely not hitting a cap at 34.6 fps @ 800x480, yet it hits 42.5 fps @ 1280x720. All other phones predictably step down in speed. Anyone else notice this?Alexstarfire - Tuesday, February 21, 2012 - link
Yes I did. It was actually the reason I was going to post. I was curious to know if the iPhone had VSync or not because it made no sense that it would get better performance at a higher resolution. Neither of the results make any sense to me since they are counter-intuitive.If the "offscreen" tests force VSync off then that could explain it for the iPhone but not really for the SGSII unless some parts of the test go way past the 60FPS cap with VSync turned on.
alter.eg00 - Wednesday, February 22, 2012 - link
Shut up & take my moneyDenithor - Wednesday, February 22, 2012 - link
Seconded!!I'm still carrying a first generation HTC Incredible (yep, one of the original ones!), been out of contract for a few months, was waiting to hear more about the 28nm SoC update. These look really, really good, seriously looking forward to them hitting the market now!
tipoo - Wednesday, February 22, 2012 - link
I wonder how many apps scale beyond two cores. For the time being, I doubt its many, and since you're still not doing any true multitasking I think a faster dual core like this will trump a slower quad like the Tegra 3 most of the time.