I actually had high hopes for Sun's Rock architecture, which had a rather elegant hardware-scout/speculative threading system to hide memory latencies, and instead of a reorder-buffer they had a neat checkpoint table, that simultaneously gave you out of order retirement, as well as hardware support for software transactional memory.
Alas, it looked good on paper, but died in practice, either because the theory was flawed (but academic simulations seemed to suggest it would be a win), or because Sun didn't have the resources to invest in it properly and Oracle killed it.
Claiming a breakthrough in VLIW static scheduling that yields 2.3x seems interesting, but the reality made be different, not to mention what kinds of workloads would get these speedups. If you compare the way NVidia and AMD's GPUs work, in particular AMD's, they rely heavily on static analysis, but in the end, extracting max performance is highly dependent on structuring your workload to deal with the way the underlying architecture executes kernels.
If it turns out you have to actually restructure your code to get this 2.3x performance, rather than gcc-recompile with a different architecture, then it's not really an apples-to-apples speedup.
Having been at Sun and having been (too) intimately involved with the microprocessor side of the house for way too damn long, I can tell you that when it came to microprocessors, Sun was all vision and no execution. The theme that was repeated over several microprocessors: a new, big idea that made all of the DEs horny, but that proved annoyingly tricky to implement. Sacrifices would then be made elsewhere in order to make a tape out date and/or power or die budget. But these sacrifices would be made without a real understanding of the consequences -- and the chip would arrive severely compromised. (Or wouldn't arrive at all.) Examples abound but include Viking, Cheetah, UltraJava/NanoJava/PicoJava, MAJC, Millennium (cancelled), Niagara (shared FPU!) and ROC (originally "Regatta-on-a-chip", but became "Rock" only when it was clear that it was going to be so late that it wasn't going to be meaningfully competing with IBM's Regatta after all). The only microprocessor that Sun really got unequivocally right (on time, on budget, leading performance, basically worked) was Spitfire -- but even then, on the subsequent shrinks (Blackbird and beyond) the grievous e-cache design flaws basically killed it.
Point is: in microprocessors, execution isn't everything -- it's the only thing.
Really? Ha, that is funny! I guess sun got the codenames and the fact that it was MCM full of GP's, but apparently didn't notice why it was MCM, or the fact that there were 4 MCM's in the full regatta config.
I mean, like, did sun expect to make a wafer level chip?
Its good to know the envy went both directions, I remember a lot of talk about sun's E10k...
And "debacle" is really the only word for Viking. A major rite of passage in kernel development in the 1990s was finding your first Viking bug; I found mine within a month of joining in 1996 (a logic bug whereby psr.pil was not honored for the three "settling" nops following wrpsr, allowing a low priority interrupt to tunnel in -- affecting all sun4m/sun4d CPUs). Bonwick's was still the king of the hill, though: he was the one who discovered that the i-cache wasn't grounded out properly, causing instructions with enough zeros in them to flip a bit (!!). The story of tracking that one down (branches would go to the wrong place) was our equivalent of the Norse sagas, an oral tradition handed down from engineer to engineer over the generations. Good times!
>>Alas, it looked good on paper, but died in practice, either because the theory was flawed (but academic simulations seemed to suggest it would be a win), or because Sun didn't have the resources to invest in it properly and Oracle killed it.
I heard this never actually worked at all and they added the ability to turn off the hardware scout entirely before canceling it. I'm not really sure how the scout was supposed to be able to help performance. If the algorithm is indirect heavy then speculatively running it won't help you. On the other hand, if it isn't you might as well rely on conventional prefetch. Do you have a link to those studies?
>> If it turns out you have to actually restructure your code to get this 2.3x performance, rather than gcc-recompile with a different architecture, then it's not really an apples-to-apples speedup.
Right, I would only add that the algorithm itself has to be amenable to that architecture in the first place. Most general purpose code isn't and won't be able to take advantage of a large number of parallel execution resources.
Alas, it looked good on paper, but died in practice, either because the theory was flawed (but academic simulations seemed to suggest it would be a win), or because Sun didn't have the resources to invest in it properly and Oracle killed it.
Claiming a breakthrough in VLIW static scheduling that yields 2.3x seems interesting, but the reality made be different, not to mention what kinds of workloads would get these speedups. If you compare the way NVidia and AMD's GPUs work, in particular AMD's, they rely heavily on static analysis, but in the end, extracting max performance is highly dependent on structuring your workload to deal with the way the underlying architecture executes kernels.
If it turns out you have to actually restructure your code to get this 2.3x performance, rather than gcc-recompile with a different architecture, then it's not really an apples-to-apples speedup.