They've been lacking some QA since the X58 chipset and Bloomfield/Lynnfield CPU's , their last really solid products. People who bought into those platforms 5-6 years ago still have competitive systems TODAY. Sure, they lack SATA 6Gbps and USB 3.0, but like the SSD 320 (also from that era) they are virtually flawless.
Since the 6-series chipset and introduction of Sandy Bridge, there has been an unprecedented surge in errata. Most of it is irrelevant to the general consumer (even P67 wasn't an issue unless using ports other than 0 and 1) but it shows the lapse in quality control at Intel. They're better than this!
I would think logically the amount of errata in a processor design would increase as the processors design complexity increases with each new successive release.
For instance it's going to be far more difficult to debug a design when you have in-excess of 2+ billion transisters than one that has 100 million transisters.
got a 4 year old i7-980x x58 system and it still kicks major ass. 4.2ghz oc on all 6 cores 24gb triple channel ram. The only reason i want to upgrade to x99 is the lack of sata III and usb 3 both of those features would greatly help me. I could easily wait till skylake-e and x99's successor (x119????) and not feel bad at all tho.
If it took this long to catch the issue, it sounds like it was not an easy bug to find. You make it sounds like it is easy to be the first one to implement some big new thing. The fact that there has only been one issue so far sounds pretty impressive to me.
This is interesting. I bought a base model Fall 2013 rMBP 15" with the 2.0 GHz i7-4750HQ and the processor doesn't have TSX enabled. I was kinda mad about this, I wanted this in hopes that some software may implement it. But I suppose now it's a non-issue.
No, that's not the case. Devil's Canyon i7 and i5 are the first UNLOCKED SKUs to feature TSX.
Besides Devil's Canyon, a lot of other locked Core i7 and i5 support TSX right from Haswell introduction in June 2013 (more than a year ago) e.g. Core i7-4770 (without K) and so on.
This is reminiscent of the infamous Pentium FDIV bug in a sense that it was made available to the public by a non-Intel person much later than the relevant products were released to the market. 20 years have passed since FDIV bug, but Intel still drops the ball from time to time - they still can't do their testing right and in advance...
Their release schedule is very aggressive and they put a lot of Investor relations into it. I wouldn't rule out that they would overlook such things to stick to it.
Yes it does. Every chip company takes risks because there is only so much validation you can make before your competitor beats you to the punch. It even mentions in the article that there have been a bunch of them and their competitors have a bunch too.
It is for precisely this reason that the -E/EP CPUs are one generation behind and -EX CPUs are two generations behind the latest and greatest. More time for validation.
Fortunately for Intel, most SNB-EP Xeons ended up being C2 stepping, since the bug was discovered while SNB-EP was still in C0/C1 qualification stage (SNB-E was not that lucky, since they were launched several months earlier).
This time, it does not seem that the staggered launch saved EP. EX, on the other hand, yes.
But in any case, the strategy Intel is employing is smart. There is a year between the first consumer and first EP SKUs, and two years between consumer and EX parts. During that time they do manage to kill lots of issues, which is especially important for EX line which is tailored for mission-critical operations.
Hilariously enough, that was actually fixed in BIOS for SNB-E -- I have a C1 stepping SNB-E with enabled VT-d. So now that you mention it, this probably will get fixed for most affected parties.
Well, considering that Haswell EP is in mass production already, and final qualification-stage samples were out for months, the timing is bad.
I doubt Intel is going to stop imminent release of HSW-EP, which is in some weeks anyway.
SNB-EP situation was lucky, because VT-d affected C0/C1 steppings only which was only mass produced for the HEDT (consumer) part, while most of the production-grade Xeons were of C2 stepping.
This time, the situation is different - if this bug affects the latest stepping of HSW-EP, first Xeons that will be sold in public are also going to be affected.
Not good, since TSX is exactly kind of a feature you'd see in heavy use in the market for EP SKUs.
The only good for some people might be that the ebay market might soon be loaded with diverted HSW-EP engineering samples since all OEMs will probably rapidly move to qualification of the next stepping (with TSX fixes).
Don't think that others don't make such mistakes - they simply don't get as much attention due to less relevance when shipping smaller numbers. Or worse, they might not publish the errors at all.
" they still can't do their testing right and in advance..."
Well, I'm sure you can do it better, so please apply right away to Intel. </s>
All snark aside, c'mon, you have humans coming up with tests to try to confirm compatibility, and none of them are perfect. So if you think they can catch every esoteric situation, then I guess you'll have to adjust your viewpoint.
Well, in the past I happened to work in testing (software testing, however) for 5 years full time and part time, and found more than 1500 bugs during that time myself personally, so I know what testing is, and that's why I'm so critical regarding bugs. OK, hardware testing is not the same thing, but nevertheless...
You make it seem like they don't know what they're doing, but your experience with software testing is much different. I have experience with hardware verification, and I can assure you that Intel does quite a bit of it. Its worth it for them to prevent these types of bugs. As you can imagine a widespread issue like this could end up costing billions. Luckily its just TSX, which probably doesn't affect too many people, and its possible that less verification effort was devoted to that piece and they launched it anyway and planned a fix for the next stepping. It sucks, but its not a showstopper.
The more interesting question is how many bugs did you miss which were found downstream from you. If you found 1500 bugs, assuming that you were 90% effective, which is an unreasonably high percentage, then there were 1666 bugs originally, and you missed 166 of them. Per Capers Jones, the average test efficiency is 30%, while 70% of the bugs are missed.
Most hardware has some errata. I was part of a team which was fighting a bug for a couple of months, which turned out to be caused an errata (not Intel in this case). While the errata was known, the applicability to our software was not immediately apparent. The problem occurred when there was a new batch of hardware, and the problem was originally thought to be caused by the new batch. The real reason was that enough hardware had been produced that the high order bit in one byte of the (sequentially assigned) MAC address was set, and the way the MAC address was used exercised the errata. It's incredibly expensive to fix an IC after production, and even more so when the IC has been installed on a circuit board, and the worst case is when the circuit board has be shipped to many locations and is in use. Because of this, errata is usually addressed by a workaround, not by replacing the part, as long as most of the chip works.
Yes, I agree with you and I understand this; so, hardware errors are better be found and fixed at the design and ES verification and testing stages. Of course, easier said than done.
I had a similar reaction when I first read the comment. The tools for functional verification have improved over the past 10 years (emergence of OVM/UVM), so we'd expect to see a decrease in the occurrence of this type of issue, but the fact is that reaching 100% coverage is difficult given limited time and compute resources, and highly dependent upon writing good testbenches. Its not an easy thing to do.
I agree with nbtech. I would just like to add that debugging and fixing a bug found in silicon is reeeally hard. Narrowing down the sequence of events to make the failure repeatable is an art. Remember, a 3GHz CPU is launching instructions roughly at the rate of 3 billion per second (not even counting multi-core and multiple issue). Software-based simulators and even hardware-based emulators run orders of magnitude slower-- if you can't cause the failure in a couple of seconds, you have to debug on the silicon itself, which has limited visibility of the internal state.
This feature is predominantly intended for the server market.
It is highly likely that Haswell EP 2S will get a new stepping (C2 for high core versions). I suppose 4S models will ship with fixed stepping.
The question is, will Intel also update the 1S Haswells which are pretty much identical to the desktop versions with enabled support for ECC.
Destkop/Mobile Haswells, I do not think they'd bother, but if they update 1S server SKUs, I see no reason for Intel not to silently roll out updated steppings for desktop SKUs as well, since it is the same silicon as 1S server SKUs.
If you develop or use software that actually utilises TSX, you're probably in the market for the big iron Xeon EX processors. In that case, it's not an issue as Haswell-EX will have working TSX.
TSX is essentially a convenience feature, to allow lock free code without as much work from a developer. The same thing can be accomplished by rewriting the code using compare and set instructions instead of blocking locks. So much like many new features, TSX doesn't enable any magic that couldn't be accomplished before, it just saves time on the development end.
From that perspective, I'm betting nobody will scream about it being missing, especially since the slightest inclination that it's unreliable would keep people from using it anyway.
When it's done and reliable, then release it. In the meantime, possibly allow an "at your own risk" feature toggle for people that live on the bleeding edge.
That's not an equal analogy, because hardware AES is a huge performance gain, and software AES isn't difficult (just use a peer reviewed open source library at the 128 byte level and wrap it in some multiplexing code). There is no "hard" way to do AES in software that's as fast as the hardware instructions, AFAIK.
TSX is also a huge performance gain over the "easy" method of using blocking locks, but likely has comparable performance to the "hard" way of compare and set. So in that sense, it's less of a real gain, assuming of course that you're not the person paying the developers.
It is not a convinience feature. In fact, it is >easier< to write the code without it.
Have a potentially contended data (between threads)? Just lock the sucker, that's the easiest way.
But that is not the most efficient way. Basically, what TSX does is, it relies on CPU smarts to speed-up multi-threaded code >without< having to resort to even more complex (and error-prone) lock-free programming. In that way, one can call TSX a "convenience', but in reality it does require additional work, just not that much additional work.
TSX is roughly comparable to, say, SSE instructions (but it is not nearly as useful in terms of potential applications). In order to use SSE with some decent speedup, you have to do a bit more efforts in your code, so it is not really a "convenience", as it require developer to do more work, in order to achieve faster code execution.
The issue of how to do multi-processor locks is complex. Until a decade ago, you used a semaphore or mutex, and took the timing hit of marking the lock as uncacheable. Hundreds of clocks even if you weren't modifying the lock. (Usually you would use a RMW read-modify-write instruction to put the identity of your thread in the lock and check if the lock was not reserved by comparing the returned value to zero, or -1 or whatever.) When you release the lock, you first have to check if it still contains the id that you put in, then either relase the lock or turn control over to a waiting thead. (I'm simplifying a lot, so don't shoot me.) Anyway, three main memory reads, or RMW cycles in the best case.
Then along came Opteron. Opterons, with all IO passing through a CPU chip, and cache-coherency connections to all other CPU chips, meant that requests for uncacheable memory could be ignored. The locks worked as before, but now you could have a fast (possibly 3 CPU clock latency) read or RMW cycle. I never measured 100x performance improvements unless thrashing was involved, but that tells the real story. It was no longer about how slow uncached memory was, but about how much more work your database or other application could do before it hit thrashing. Many times I could wind the CPUs up to 100% load for minutes at a time without starting to thrash, which was a very good thing. (Some other CISC and RISC CPUs also support/supported IOMMUs. Most that didn't are now dead.)
When Intel added IOMMU support to their x86/x64 CPUs they didn't duplicate what AMD had done. (And now works on all AMD CPU including single socket desktop chips, and ARM64 chips.) This is because AMD uses a MOESI protocol and Intel uses MESIF. Again way too much detail for here, but it means that in certain locking cases, you get a (relatively) slow ping-pong effect when two threads are sequentially accessing a lock. (Think producer/consumer.) By treating the transaction as speculative, and never touching the lock if there is no conflict, overall transactions are sped up.
AMD proposed a similar instruction set extension (ASF) in 2009, but AFAIK the best lcoking code was not significantly improved (on AMD CPUs) and the proposal has languished. Will this bug kill TSX? Probably not. There is an awful lot of ancient history embedded in the x86 ISA, this will just be a bit more. But I expect the effect on potential users to be the same as AMD's ASF leading programmers to better lock implementations.
No, the reason Russia is designing a new chip is because they want to protect their market (protect, as in economic protection, not security).
Russia is not designing a new CPU architecture anyway, they will use ARM architecture. Now, if you think somebody can spot a deliberate security flaw in a CPU design consisting of hundreds of millions of blocks, yeah... good luck with that. The only way to be 100% sure is to design it by yourself from scratch, and even that does not guarantee it won't have flaws that can be silently exploited.
In any case, Intel's Microcode has nothing whatsoever to do with this. You can simply prevent any theoretical possibility that somebody can use Intel AMT against you by simply not giving it access to public Internet.
Not to mention that Microcode updates are not persistent, and have to be applied after every power-on. If you do not allow BIOS upgrades and control the OS and do not allow public Internet access to the system firmware, there is simply no way somebody can exploit your CPU remotely.
Linux comes with CPU microcode, so when Intel updates the microcode in Linux, and the new Linux kernel is installed, the new microcode is there. I'm expect Windows does the same thing. If you never patch your operating system or BIOS, the TSX is likely to stay enabled.
Got to suck for the ones that bought a 4770 instead of K specifically because of this feature. But then how cares? It wasn't available in most Intel CPUs anyway so almost no consumer software will use it for the next decade. And for servers they get replaced in shorter cycles anyway.
Will we be looking at a recall here at some point? I mean if you sell me something supposed to do X, and then X gets removed, what then? That isn't "as advertised" anymore is it?
I doubt Intel would initiate a recall just because the problem is found. In this case, it would be extremely expensive since Intel would need to replace one year of production and, now, including early batches of Xeon EPs as well.
However, if the legal pressure mounts (lawsuits filled, etc.), they might do it. But I am sure Intel would try to fight this, or limit the exposure only to certain SKUs for which it can be demonstrated that TSX is in use.
In any case, unlike FDIV bug which, basically, ruined calculation results and could affect pretty much anybody who was using Pentium CPU, this bug is less critical since it requires running software that uses TSX (not very common yet, at least not on the desktop/mobile where the biggest volume of Haswells were sold so far) and very specific conditions which are, presumably, hard to reproduce.
Damn! I think we have an answer as to why Broadwell desktops and laptops are delayed...
I'm guessing they believe (probably correctly) that they can ship Broadwell Y without TSX and no-one will much care. Still not clear why the gap between the laptop and quad-core chips ship dates --- maybe other reasons, or maybe they have reason to believe that the problem is probably more easily fixed on dual-core chips.
Personally I'd score this (if this explanation is correct) as +1 for my earlier explanation for the delay of Broadwell --- a consequence of the insane complexity of x86 becoming unsustainable and causing Intel real harm. Intel's only official comment is ‘a complex set of internal timing conditions and system events … may result in unpredictable system behavior’ and, yes, that COULD be a problem on any CPU --- but it's a whole lot more likely to occur (IMHO) on x86.
It also adds fuel to my argument that Apple is probably losing patience with Intel. As I've said, the Broadwell delays have screwed up a year of their product plans; if TSX on Haswell is broken, that also delays by at least a year the plans I believe they have to introduce an innovative set of parallel programming constructs into Swift which require HW TM.
OK, so on reading further, I see that (a) this likely does not affect Apple because (near as I can understand Intel's maze of feature differentiation details) the relevant parts do not ship in any Apple products. So much for that theory. (On the other hand, well done Intel --- clearly the way to get developers to support a feature you expect to charge for/be a differentiator in future is to limit it to a tiny fraction of your chips...)
(b) in turn this suggests that my theory for this delaying Broadwell is nonsense. Unless Broadwell WAS supposed to have TSX across the laptop and desktop and Intel are still hoping they can get there by delaying a few months, with a plan B of, if necessary, simply launching without the feature?
I've a system beside my bed for development, a laptop for convenience, and a new build for implementation, w/ the fun little Anniversary Edition serving as a 'place holder' for the Devil's Canyon that I no longer plan to buy.
SoOo ... to the pages of folks that consider the loss of TSX to be no big deal to consumers, or having no potential affect upon beyond servers/workstations?
It is a big deal, and effects us all.
We're talking increases to transactional throughput of not less than three times, and in excess of fives times, ultimately with little (or possibly any, at some point) effort on the developer's part.
To see this trivialized in reports/forums frustrates nearly as much as Intel's disabling of this feature, which I don't believe (even for a second) to have not been absolutely necessary: Claiming Devil's Canyon would easily/consistently overclock beyond 4.x GHz on air? Now, that *was* a marketing ploy. But, as for this scenario? I think everyone can safely remove the tin hats.
As for me? I reckon I'll try 'n cling to the hope that it may be enabled in some soon-to-be-released CPU that fits (or I've wasted even more of my limited resources/time )-;~
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
63 Comments
Back to Article
andrebrait - Tuesday, August 12, 2014 - link
It's a good year for Intel. First the the USB 3.0 sleep bug (which persists in Intel's mobile 8-series chipsets), now this.TiGr1982 - Tuesday, August 12, 2014 - link
No, that sleep bug was already discovered last year (2013).TiGr1982 - Tuesday, August 12, 2014 - link
And don't forget P67 SATA bug in 2011.Samus - Wednesday, August 13, 2014 - link
They've been lacking some QA since the X58 chipset and Bloomfield/Lynnfield CPU's , their last really solid products. People who bought into those platforms 5-6 years ago still have competitive systems TODAY. Sure, they lack SATA 6Gbps and USB 3.0, but like the SSD 320 (also from that era) they are virtually flawless.Since the 6-series chipset and introduction of Sandy Bridge, there has been an unprecedented surge in errata. Most of it is irrelevant to the general consumer (even P67 wasn't an issue unless using ports other than 0 and 1) but it shows the lapse in quality control at Intel. They're better than this!
StevoLincolnite - Wednesday, August 13, 2014 - link
I would think logically the amount of errata in a processor design would increase as the processors design complexity increases with each new successive release.For instance it's going to be far more difficult to debug a design when you have in-excess of 2+ billion transisters than one that has 100 million transisters.
Laststop311 - Monday, August 18, 2014 - link
got a 4 year old i7-980x x58 system and it still kicks major ass. 4.2ghz oc on all 6 cores 24gb triple channel ram. The only reason i want to upgrade to x99 is the lack of sata III and usb 3 both of those features would greatly help me. I could easily wait till skylake-e and x99's successor (x119????) and not feel bad at all tho.nandnandnand - Tuesday, August 12, 2014 - link
1. Introduce cutting-edge new feature years before widespread adoption.2. Flub the implementation in 2+ years of chips.
Good going, Intel.
nicmonson - Tuesday, August 12, 2014 - link
If it took this long to catch the issue, it sounds like it was not an easy bug to find. You make it sounds like it is easy to be the first one to implement some big new thing. The fact that there has only been one issue so far sounds pretty impressive to me.tipoo - Wednesday, August 13, 2014 - link
That "only one issue" also required them to disable the whole thing...Devfarce - Tuesday, August 12, 2014 - link
This is interesting. I bought a base model Fall 2013 rMBP 15" with the 2.0 GHz i7-4750HQ and the processor doesn't have TSX enabled. I was kinda mad about this, I wanted this in hopes that some software may implement it. But I suppose now it's a non-issue.otherwise - Wednesday, August 13, 2014 - link
Before Devil's Canyon only Xeon's had TSX enabled and this was clearly laid out by Intel. I don't know what you were expecting.TiGr1982 - Wednesday, August 13, 2014 - link
No, that's not the case. Devil's Canyon i7 and i5 are the first UNLOCKED SKUs to feature TSX.Besides Devil's Canyon, a lot of other locked Core i7 and i5 support TSX right from Haswell introduction in June 2013 (more than a year ago) e.g. Core i7-4770 (without K) and so on.
TiGr1982 - Tuesday, August 12, 2014 - link
This is reminiscent of the infamous Pentium FDIV bug in a sense that it was made available to the public by a non-Intel person much later than the relevant products were released to the market.20 years have passed since FDIV bug, but Intel still drops the ball from time to time - they still can't do their testing right and in advance...
coburn_c - Tuesday, August 12, 2014 - link
Their release schedule is very aggressive and they put a lot of Investor relations into it. I wouldn't rule out that they would overlook such things to stick to it.TiGr1982 - Tuesday, August 12, 2014 - link
Yes, but this does not excuse them much.nicmonson - Tuesday, August 12, 2014 - link
Yes it does. Every chip company takes risks because there is only so much validation you can make before your competitor beats you to the punch. It even mentions in the article that there have been a bunch of them and their competitors have a bunch too.madmilk - Wednesday, August 13, 2014 - link
It is for precisely this reason that the -E/EP CPUs are one generation behind and -EX CPUs are two generations behind the latest and greatest. More time for validation.Assimilator87 - Wednesday, August 13, 2014 - link
Funny you should say that. Remember the VT-d errata in the C1 stepping of Sandy Bridge-E/EP?psyq321 - Wednesday, August 13, 2014 - link
Fortunately for Intel, most SNB-EP Xeons ended up being C2 stepping, since the bug was discovered while SNB-EP was still in C0/C1 qualification stage (SNB-E was not that lucky, since they were launched several months earlier).This time, it does not seem that the staggered launch saved EP. EX, on the other hand, yes.
But in any case, the strategy Intel is employing is smart. There is a year between the first consumer and first EP SKUs, and two years between consumer and EX parts. During that time they do manage to kill lots of issues, which is especially important for EX line which is tailored for mission-critical operations.
lmcd - Wednesday, August 13, 2014 - link
Hilariously enough, that was actually fixed in BIOS for SNB-E -- I have a C1 stepping SNB-E with enabled VT-d. So now that you mention it, this probably will get fixed for most affected parties.psyq321 - Wednesday, August 13, 2014 - link
Well, considering that Haswell EP is in mass production already, and final qualification-stage samples were out for months, the timing is bad.I doubt Intel is going to stop imminent release of HSW-EP, which is in some weeks anyway.
SNB-EP situation was lucky, because VT-d affected C0/C1 steppings only which was only mass produced for the HEDT (consumer) part, while most of the production-grade Xeons were of C2 stepping.
This time, the situation is different - if this bug affects the latest stepping of HSW-EP, first Xeons that will be sold in public are also going to be affected.
Not good, since TSX is exactly kind of a feature you'd see in heavy use in the market for EP SKUs.
The only good for some people might be that the ebay market might soon be loaded with diverted HSW-EP engineering samples since all OEMs will probably rapidly move to qualification of the next stepping (with TSX fixes).
lmcd - Wednesday, August 13, 2014 - link
SNB-E VT-d got fixed. I have C1 SNB-E, and an updated motherboard that fixed it.MrSpadge - Wednesday, August 13, 2014 - link
Don't think that others don't make such mistakes - they simply don't get as much attention due to less relevance when shipping smaller numbers. Or worse, they might not publish the errors at all.jrg - Thursday, September 4, 2014 - link
Or maybe the others don't see enough use that anyone ever discovers the bug....romrunning - Wednesday, August 13, 2014 - link
" they still can't do their testing right and in advance..."Well, I'm sure you can do it better, so please apply right away to Intel. </s>
All snark aside, c'mon, you have humans coming up with tests to try to confirm compatibility, and none of them are perfect. So if you think they can catch every esoteric situation, then I guess you'll have to adjust your viewpoint.
I'm amazed they catch as many as they do!
TiGr1982 - Wednesday, August 13, 2014 - link
Well, in the past I happened to work in testing (software testing, however) for 5 years full time and part time, and found more than 1500 bugs during that time myself personally, so I know what testing is, and that's why I'm so critical regarding bugs. OK, hardware testing is not the same thing, but nevertheless...nbtech - Thursday, August 14, 2014 - link
You make it seem like they don't know what they're doing, but your experience with software testing is much different. I have experience with hardware verification, and I can assure you that Intel does quite a bit of it. Its worth it for them to prevent these types of bugs. As you can imagine a widespread issue like this could end up costing billions. Luckily its just TSX, which probably doesn't affect too many people, and its possible that less verification effort was devoted to that piece and they launched it anyway and planned a fix for the next stepping. It sucks, but its not a showstopper.jhh - Friday, August 15, 2014 - link
The more interesting question is how many bugs did you miss which were found downstream from you. If you found 1500 bugs, assuming that you were 90% effective, which is an unreasonably high percentage, then there were 1666 bugs originally, and you missed 166 of them. Per Capers Jones, the average test efficiency is 30%, while 70% of the bugs are missed.Most hardware has some errata. I was part of a team which was fighting a bug for a couple of months, which turned out to be caused an errata (not Intel in this case). While the errata was known, the applicability to our software was not immediately apparent. The problem occurred when there was a new batch of hardware, and the problem was originally thought to be caused by the new batch. The real reason was that enough hardware had been produced that the high order bit in one byte of the (sequentially assigned) MAC address was set, and the way the MAC address was used exercised the errata. It's incredibly expensive to fix an IC after production, and even more so when the IC has been installed on a circuit board, and the worst case is when the circuit board has be shipped to many locations and is in use. Because of this, errata is usually addressed by a workaround, not by replacing the part, as long as most of the chip works.
TiGr1982 - Friday, August 15, 2014 - link
Yes, I agree with you and I understand this; so, hardware errors are better be found and fixed at the design and ES verification and testing stages. Of course, easier said than done.dylan522p - Monday, August 18, 2014 - link
Hardware is SOOOO much more complicated than software. You have no idea.nbtech - Thursday, August 14, 2014 - link
I had a similar reaction when I first read the comment.The tools for functional verification have improved over the past 10 years (emergence of OVM/UVM), so we'd expect to see a decrease in the occurrence of this type of issue, but the fact is that reaching 100% coverage is difficult given limited time and compute resources, and highly dependent upon writing good testbenches. Its not an easy thing to do.
aamartin - Thursday, August 14, 2014 - link
I agree with nbtech. I would just like to add that debugging and fixing a bug found in silicon is reeeally hard. Narrowing down the sequence of events to make the failure repeatable is an art. Remember, a 3GHz CPU is launching instructions roughly at the rate of 3 billion per second (not even counting multi-core and multiple issue). Software-based simulators and even hardware-based emulators run orders of magnitude slower-- if you can't cause the failure in a couple of seconds, you have to debug on the silicon itself, which has limited visibility of the internal state.yuhong - Tuesday, August 12, 2014 - link
Personally, I really hope there will be a new stepping of Haswell CPUs with the TSX errata fixed.Gigaplex - Wednesday, August 13, 2014 - link
Haswell? Unlikely. Broadwell isn't far away.psyq321 - Wednesday, August 13, 2014 - link
This feature is predominantly intended for the server market.It is highly likely that Haswell EP 2S will get a new stepping (C2 for high core versions).
I suppose 4S models will ship with fixed stepping.
The question is, will Intel also update the 1S Haswells which are pretty much identical to the desktop versions with enabled support for ECC.
Destkop/Mobile Haswells, I do not think they'd bother, but if they update 1S server SKUs, I see no reason for Intel not to silently roll out updated steppings for desktop SKUs as well, since it is the same silicon as 1S server SKUs.
TerdFerguson - Tuesday, August 12, 2014 - link
Intel's response here is atrocious. TSX was a selling point for the chip, so they need to make good on it or offer refunds.Gondalf - Wednesday, August 13, 2014 - link
Really??? Do you utilize TSX??? come on!r3loaded - Wednesday, August 13, 2014 - link
If you develop or use software that actually utilises TSX, you're probably in the market for the big iron Xeon EX processors. In that case, it's not an issue as Haswell-EX will have working TSX.barleyguy - Wednesday, August 13, 2014 - link
TSX is essentially a convenience feature, to allow lock free code without as much work from a developer. The same thing can be accomplished by rewriting the code using compare and set instructions instead of blocking locks. So much like many new features, TSX doesn't enable any magic that couldn't be accomplished before, it just saves time on the development end.From that perspective, I'm betting nobody will scream about it being missing, especially since the slightest inclination that it's unreliable would keep people from using it anyway.
When it's done and reliable, then release it. In the meantime, possibly allow an "at your own risk" feature toggle for people that live on the bleeding edge.
$.02
Senti - Wednesday, August 13, 2014 - link
Let's take some parallels: "hardware AES is just convenience feature feature – it just saves developers the work to implement it in software". Sure!barleyguy - Thursday, August 14, 2014 - link
That's not an equal analogy, because hardware AES is a huge performance gain, and software AES isn't difficult (just use a peer reviewed open source library at the 128 byte level and wrap it in some multiplexing code). There is no "hard" way to do AES in software that's as fast as the hardware instructions, AFAIK.TSX is also a huge performance gain over the "easy" method of using blocking locks, but likely has comparable performance to the "hard" way of compare and set. So in that sense, it's less of a real gain, assuming of course that you're not the person paying the developers.
psyq321 - Wednesday, August 13, 2014 - link
It is not a convinience feature. In fact, it is >easier< to write the code without it.Have a potentially contended data (between threads)? Just lock the sucker, that's the easiest way.
But that is not the most efficient way. Basically, what TSX does is, it relies on CPU smarts to speed-up multi-threaded code >without< having to resort to even more complex (and error-prone) lock-free programming. In that way, one can call TSX a "convenience', but in reality it does require additional work, just not that much additional work.
TSX is roughly comparable to, say, SSE instructions (but it is not nearly as useful in terms of potential applications). In order to use SSE with some decent speedup, you have to do a bit more efforts in your code, so it is not really a "convenience", as it require developer to do more work, in order to achieve faster code execution.
eachus - Tuesday, August 19, 2014 - link
The issue of how to do multi-processor locks is complex. Until a decade ago, you used a semaphore or mutex, and took the timing hit of marking the lock as uncacheable. Hundreds of clocks even if you weren't modifying the lock. (Usually you would use a RMW read-modify-write instruction to put the identity of your thread in the lock and check if the lock was not reserved by comparing the returned value to zero, or -1 or whatever.) When you release the lock, you first have to check if it still contains the id that you put in, then either relase the lock or turn control over to a waiting thead. (I'm simplifying a lot, so don't shoot me.) Anyway, three main memory reads, or RMW cycles in the best case.Then along came Opteron. Opterons, with all IO passing through a CPU chip, and cache-coherency connections to all other CPU chips, meant that requests for uncacheable memory could be ignored. The locks worked as before, but now you could have a fast (possibly 3 CPU clock latency) read or RMW cycle. I never measured 100x performance improvements unless thrashing was involved, but that tells the real story. It was no longer about how slow uncached memory was, but about how much more work your database or other application could do before it hit thrashing. Many times I could wind the CPUs up to 100% load for minutes at a time without starting to thrash, which was a very good thing. (Some other CISC and RISC CPUs also support/supported IOMMUs. Most that didn't are now dead.)
When Intel added IOMMU support to their x86/x64 CPUs they didn't duplicate what AMD had done. (And now works on all AMD CPU including single socket desktop chips, and ARM64 chips.) This is because AMD uses a MOESI protocol and Intel uses MESIF. Again way too much detail for here, but it means that in certain locking cases, you get a (relatively) slow ping-pong effect when two threads are sequentially accessing a lock. (Think producer/consumer.) By treating the transaction as speculative, and never touching the lock if there is no conflict, overall transactions are sped up.
AMD proposed a similar instruction set extension (ASF) in 2009, but AFAIK the best lcoking code was not significantly improved (on AMD CPUs) and the proposal has languished. Will this bug kill TSX? Probably not. There is an awful lot of ancient history embedded in the x86 ISA, this will just be a bit more. But I expect the effect on potential users to be the same as AMD's ASF leading programmers to better lock implementations.
ABR - Wednesday, August 13, 2014 - link
How do they "push a microcode update" to a CPU that's already out in the wild?ObstinateMuon - Wednesday, August 13, 2014 - link
The same way they did last time. Via the universal backdoor in Windows.Alexvrb - Wednesday, August 13, 2014 - link
BIOS updates dude. :-/ObstinateMuon - Wednesday, August 13, 2014 - link
You're right. It doesn't matter which OS you run. They can instead choose to do it remotely via AMT. See http://www.fsf.org/blogs/community/active-manageme...The_Assimilator - Wednesday, August 13, 2014 - link
Please stop drinking the Stallman kool-aid, it just makes it obvious that you're an idiot.maxpwr - Wednesday, August 13, 2014 - link
There is a reason why Russia is designing a new chip to replace all Intel CPUs.psyq321 - Wednesday, August 13, 2014 - link
No, the reason Russia is designing a new chip is because they want to protect their market (protect, as in economic protection, not security).Russia is not designing a new CPU architecture anyway, they will use ARM architecture. Now, if you think somebody can spot a deliberate security flaw in a CPU design consisting of hundreds of millions of blocks, yeah... good luck with that. The only way to be 100% sure is to design it by yourself from scratch, and even that does not guarantee it won't have flaws that can be silently exploited.
In any case, Intel's Microcode has nothing whatsoever to do with this. You can simply prevent any theoretical possibility that somebody can use Intel AMT against you by simply not giving it access to public Internet.
Not to mention that Microcode updates are not persistent, and have to be applied after every power-on. If you do not allow BIOS upgrades and control the OS and do not allow public Internet access to the system firmware, there is simply no way somebody can exploit your CPU remotely.
ObstinateMuon - Wednesday, August 13, 2014 - link
It's unrealistic to never connect to the internet. There needs to be a way for the user to control AMT.jhh - Friday, August 15, 2014 - link
Linux comes with CPU microcode, so when Intel updates the microcode in Linux, and the new Linux kernel is installed, the new microcode is there. I'm expect Windows does the same thing. If you never patch your operating system or BIOS, the TSX is likely to stay enabled.beginner99 - Wednesday, August 13, 2014 - link
Got to suck for the ones that bought a 4770 instead of K specifically because of this feature. But then how cares? It wasn't available in most Intel CPUs anyway so almost no consumer software will use it for the next decade. And for servers they get replaced in shorter cycles anyway.willis936 - Wednesday, August 13, 2014 - link
But hey they got a virtualization extension!nevertell - Wednesday, August 13, 2014 - link
" But then how cares? ... so almost no consumer software will use it..."The people who pay the most care.
TheJian - Wednesday, August 13, 2014 - link
Will we be looking at a recall here at some point? I mean if you sell me something supposed to do X, and then X gets removed, what then? That isn't "as advertised" anymore is it?psyq321 - Wednesday, August 13, 2014 - link
Depends on the legal situation.I doubt Intel would initiate a recall just because the problem is found. In this case, it would be extremely expensive since Intel would need to replace one year of production and, now, including early batches of Xeon EPs as well.
However, if the legal pressure mounts (lawsuits filled, etc.), they might do it. But I am sure Intel would try to fight this, or limit the exposure only to certain SKUs for which it can be demonstrated that TSX is in use.
In any case, unlike FDIV bug which, basically, ruined calculation results and could affect pretty much anybody who was using Pentium CPU, this bug is less critical since it requires running software that uses TSX (not very common yet, at least not on the desktop/mobile where the biggest volume of Haswells were sold so far) and very specific conditions which are, presumably, hard to reproduce.
Gigaplex - Wednesday, August 13, 2014 - link
If Sony can get away with removing OtherOS, then Intel shouldn't have too many issues dealing with TSX.name99 - Wednesday, August 13, 2014 - link
Damn!I think we have an answer as to why Broadwell desktops and laptops are delayed...
I'm guessing they believe (probably correctly) that they can ship Broadwell Y without TSX and no-one will much care. Still not clear why the gap between the laptop and quad-core chips ship dates --- maybe other reasons, or maybe they have reason to believe that the problem is probably more easily fixed on dual-core chips.
Personally I'd score this (if this explanation is correct) as +1 for my earlier explanation for the delay of Broadwell --- a consequence of the insane complexity of x86 becoming unsustainable and causing Intel real harm. Intel's only official comment is
‘a complex set of internal timing conditions and system events … may result in unpredictable system behavior’ and, yes, that COULD be a problem on any CPU --- but it's a whole lot more likely to occur (IMHO) on x86.
It also adds fuel to my argument that Apple is probably losing patience with Intel. As I've said, the Broadwell delays have screwed up a year of their product plans; if TSX on Haswell is broken, that also delays by at least a year the plans I believe they have to introduce an innovative set of parallel programming constructs into Swift which require HW TM.
name99 - Wednesday, August 13, 2014 - link
OK, so on reading further, I see that(a) this likely does not affect Apple because (near as I can understand Intel's maze of feature differentiation details) the relevant parts do not ship in any Apple products. So much for that theory. (On the other hand, well done Intel --- clearly the way to get developers to support a feature you expect to charge for/be a differentiator in future is to limit it to a tiny fraction of your chips...)
(b) in turn this suggests that my theory for this delaying Broadwell is nonsense. Unless Broadwell WAS supposed to have TSX across the laptop and desktop and Intel are still hoping they can get there by delaying a few months, with a plan B of, if necessary, simply launching without the feature?
dylan522p - Monday, August 18, 2014 - link
They've been sampling for the last half year. I doubt they can change it yet.cowcreekgeek - Saturday, September 6, 2014 - link
I've a system beside my bed for development, a laptop for convenience, and a new build for implementation, w/ the fun little Anniversary Edition serving as a 'place holder' for the Devil's Canyon that I no longer plan to buy.SoOo ... to the pages of folks that consider the loss of TSX to be no big deal to consumers, or having no potential affect upon beyond servers/workstations?
It is a big deal, and effects us all.
We're talking increases to transactional throughput of not less than three times, and in excess of fives times, ultimately with little (or possibly any, at some point) effort on the developer's part.
To see this trivialized in reports/forums frustrates nearly as much as Intel's disabling of this feature, which I don't believe (even for a second) to have not been absolutely necessary:
Claiming Devil's Canyon would easily/consistently overclock beyond 4.x GHz on air? Now, that *was* a marketing ploy. But, as for this scenario? I think everyone can safely remove the tin hats.
As for me? I reckon I'll try 'n cling to the hope that it may be enabled in some soon-to-be-released CPU that fits (or I've wasted even more of my limited resources/time )-;~
urbanman2004 - Sunday, November 17, 2019 - link
Nothing but lies from Intel