What differentiates Mill with Itanium? Also, what are the 2.3x power/performance...

gnoway · on March 27, 2014

I'm actually wondering where the 2.3x number he cites is coming from. I don't believe the Mill team is claiming 2.3x performance advantage over Haswell while using 2.3x less power, which is how I read that comment.

I watched the replay of the Execution talk here:

http://millcomputing.com/docs/execution/

I'd recommend watching all of the talks if you have the time.

In this talk, maybe 2/3-3/4 of the way through, Godard made a claim about performance relative to OOO, 'like a Haswell' or Haswell specifically - can't remember which, and I can't go through the video again right now. He said something to the effect that they would approach performance for {OOO|~Haswell|Haswell} using less power. It was a very general statement, which I took to mean that a Mill family member intended for GP PC desktop use could approach - not match or exceed - performance of a typical GP PC desktop processor while using less power. Which is certainly not something we've never heard before. And I think the statement is coming from theoretical calculation.

As far as difference with Itanium: I don't know anything about processor design, but I am pretty certain the belt concept central to the Mill is not applied in the Itanium/EPIC. I think it's likely that the Mill is intended to support more operations per instruction than Itanium. The other thing is that there is not 'The Mill Processor' - it's more of a design scheme and ISA.

igodard · on March 27, 2014

It would be nice if there were a single number that could be justified by measurement, but there's no hardware yet to measure and there would not be a single number even if the hardware existed. That's because there's not just one "Mill", it's a family.

What we can say is that for equivalent computation capacity (i.e. number of functional units) the Mill will give somewhat better performance at much better power. Internally, the Mill's power budget is essentially the same as that of a DSP with the same function capacity, because they work in much the same way. DSPs have been around for a long time, and the power/performance comparisons with OOO have been long published. For equal process and equal Mips capacity the power difference for the core is 8-12x better than OOO, and we expect to do at least as well.

That's for equal compute capacity. Every architecture has a cap on scaling compute capacity. The cap seems to be around 8 pipelines in OOO machines; try to add more and you just slow down everything more than you gain from the extra pipes.

The Mill has caps too. We don't know yet where the diminishing returns point will be in detail, but our sims and engineering expertise suggests that it will be somewhere in the 30-40 pipes region. Such a high-end Mill would swap a good deal - but not all - of its power advantage for more horsepower.

You have the inverse story at the low end of the family: the lowest Mill has only five pipes, and no floating point at all. Not barn-burning performance, but much lower power even than existing non-OOO offerings.

So there's no one number, and no hard measurements anyway. If you doubt our projections then you are entitled to your opinion; in fact there's a fair amount of disagreement even within the Mill team as to what we will see in the actual chip. But the team includes quite a few who have been doing this for years, and in several cases were involved in the creation of the chips that you would compare the Mill against, so their considered opinion should not be rejected out of hand.

Brashman · on March 27, 2014

From what I've seen and heard in the (academic) computer architecture community, performance and power gains often diminish when moving from theory to simulation to RTL and into silicon (It seems the Mill team is aware of this too). Thus, I tend to be skeptical about large performance/power gains. On the other hand, it's not entirely unreasonable that VLIW could see these gains. I'll be curious to see what happens with Mill. It seems to me the biggest challenges with VLIW architectures are on the compiler side and the need to recompile legacy code.

termain · on March 28, 2014

"You have the inverse story at the low end of the family: the lowest Mill has only five pipes, and no floating point at all. Not barn-burning performance, but much lower power even than existing non-OOO offerings."

Does the Mill even need an FP unit? Or rather, couldn't a VLIW architecture be able to emulate floating point in such a way that it's nearly as fast and/or more flexible as far as precision and/or might be more optimizable for certain values?

Minimally, if you break down the FP opp into it's constituent integer operations, you put all of those in flight at the same time or schedule them to hide latencies of other operations, I would think.

gnoway · on March 27, 2014

Thanks for your reply.

I have no reason to doubt your projections. I mainly took issue with the 2.3x number in the parent blog post because I remembered you saying something different in your talk. That's all.