Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads (anandtech.com)
127 points by jseliger on March 31, 2016 | hide | past | favorite | 78 comments


I made a table of certain metrics of the various models.

This is useful for me when comparing ranges of cores, power, price, and maybe it will be for you too.

I'll filter to my range of options, then make decisions on $/W, $/Ghz, etc..

https://docs.google.com/spreadsheets/d/1PcjgdtSV-2JLJXDpktjg...


Here's a fork with an added column that considers total system cost, which shifts the sweet spot up from 8 to 14-16 cores: https://docs.google.com/spreadsheets/d/1LOLcD0gbSlukcWAFtLYu...


I like it!!


That's awesome - thanks. Intel should do this ;)


It would be helpful to also have the Xeon-D CPUs in that table for comparison.


What's very interesting is that in most cases with either modern PCIe-connection based SSDs or multiple SSD drives, database performance has become CPU limited once again in operations that require multiple transactions. These chips surely help with the higher processor counts.


Need to still wait until Skylake for CLWB and PCOMMIT x86 NVM aka "SSD-instructions".

http://danluu.com/clwb-pcommit/

Maybe we need even more cores soon...


Wouldn't SSDs - and even better system-bus-connected storage contraptions - also drive the use of different algorithms? I can see how merely cutting the disk latencies by two (or more) orders of magnitude leaves you starved for computational resources, but you don't have to do many things at that point. Data packing is different (could be even CPU matched now?), access patterns can be more random, etc.


That's an interesting question, but I expect many of the algorithms will stay the same. Think of a computer as having a multi-level cache: network -> disk -> ram -> L2 cache -> L1 cache -> register, each one smaller than the last. We're seeing a big change in disk latency, but that's only one part of the chain. For example, we're still going to be optimizing data structures to fit in L1 cache lines.


Interestingly, that may be the new ordering if the disks are SSDs, but the typical seek latency on a spinning disk (~5 ms) is definitely higher than the latency to read data from another machine's memory across ethernet (a few hundred us), and even the bandwidths are comparable (~150 MB/s).

So, now it has jumped from (disk -> network -> memory -> ...) to (network -> disk -> memory -> ...), which is a big change.


> network -> disk -> ram -> L2 cache -> L1 cache -> register

It's more and more like network/disk -> L3 cache -> L2 cache. DRAM is pretty slow.

Because PCIe controller is anyways on the same chip as L3 cache, there's no reason to send the data on a long trip to DRAM and back. Until, of course, when the cache line gets evicted for reason or another.


Having said that, I do wonder to what degree various database engines are aware of the underlying disk platforms.

I've definitely noticed that when optimizing a query and deciding between a high number of seeks vs. a table scan, older versions of MSSQL will tend to be pessimistic of drive latencies and just go with the full scan (potentially incorrectly / prematurely). In an uncached scenario on an SSD, this is probably sub-optimal. My guess would be that instead of looking at actual seek latency, the optimizer was using reasonable guesses for spinning disks. I'm guessing newer versions are more SSD aware though.


But it looks like you're better off sharding your DB, instead of letting it use all the cores at once.


Enty level is really nice considering the price per core:

Intel Xeon E5-2630 v4 10/20 2.2 GHz 25 MB 85W $667 US

Intel Xeon E5-2630L v4 10/20 1.8 GHz 25 MB 55W $612 US

Intel Xeon E5-2623 V4 4/8 2.6 GHz 5 MB 85W $444 US

Intel Xeon E5-2620 v4 8/16 2.1 GHz 20 MB 85W $417 US

Intel Xeon E5-2609 V4 8/8 1.7 GHz 20 MB 85W $306 US

Intel Xeon E5-2603 v4 6/6 1.7 GHz 10 MB 85W $213 US


Nice. :)

A better calculation would be per GHz-core. Because a core at 2.2 is not the same thing as a core at 1.7.


Don't forget the rest of the machine. At these comparatively low CPU price points, buying three machines instead of two might not be a good decision.

(Of course, licensees of per-core-licensed software are screwed again.)


except, open source :-)


In all honesty, cache should be in here too. Getting a daxpy to fit into cache is the most satisfying feeling in the world.


Compared to the 5820K i7, these don't seem much better... Is there something I'm missing?

http://ark.intel.com/products/82932/Intel-Core-i7-5820K-Proc...


ECC ram and multi-processor support. But yeah I see what you're saying and I agree.


ECC isn't just a gimmick nowadays?


Huh? ECC is necessary for anything but toy applications.


Is it? Or is that the old wives tale? Many very non-toy applications started on consumer hardware (including Google, and they turned out alright).

http://blog.codinghorror.com/to-ecc-or-not-to-ecc/


You'll note that Google grew ECC as soon as they started billing people for real money. This is no coincidence.



They all seem to be an extremely incremental (to be kind) iteration on what was already out there as far as price goes. I have no idea why there are so many people that seem excited by them. The chips might be good, but the prices negate any of the gains.

There is no way Intel would be charging over $4000.00 for a chip if they had any competition in that space.


The Xeons support much more RAM.


Ah yes but we would need to see how much faster is a 8/16 compared to a 8/8 and I bet that'd be workload dependent...


Cool, might finally see an update to Apple's Mac Pro - it's been waiting on the E5 for quite a while considering the current Mac Pro ships with Intel's Ivy Bridge architecture!


One of those announcements that makes me stop everything and reminisce... This is simply incredible! I could actually run 100 production Windows servers on this with my Data-center license... the ROI is insane... thank you Intel!


Be aware that Microsoft is moving to a core-based licensing model for Windows Server 2016. Microsoft would like a piece of this "ROI", too.


Ah I see - good point, and it makes sense. I actually prefer they keep pushing up pricing on the Server and SQL solutions so that people like me will finally start seriously considering opensource solutions.


I work for a Microsoft partner and I am involved with their sales processes. The discussion is never around competing with open source. It's not on their radar. The majority of the discussion is about replacing Oracle, which is still dramatically more expensive than SQL Server Enterprise Edition.


It's just funny to see MS charging by power of the processor. That's something they said they would not do, and used to mock Oracle for.


Meet the new boss, same as the old boss.


Maybe that's true for list price, but it's definitely not true for my organization for negotiated price. We have ELAs in place for both Oracle and Microsoft and the upshot of it all is Oracle RDBMS is significantly cheaper for us than SQL Server. Enterprise licensing and contract negotiation is definitely a case of where YMMV.


I'm not intimate with Oracle pricing, but Microsoft doesn't want to sell just the RDBMS, but the entire SQL Server ecosystem. The enterprise license includes all of SSAS, SSRS, SSIS, integrated R in the DB, integrated SQL over big data sources, and a whole bunch of other stuff. My understanding is that the fully loaded cost of the entire data pipeline from source to business insight is where the TCO argument confess from.

I'm not trying to shill and hope I don't come across that way. I just want to share a semi-insider perspective to help others understand where MS is coming from. Do with that understanding what you will. I don't have a horse on this race.



We used the E5-2620 series since v1 and now it gets two additional cores for the same price.

However when it comes to CPU's the price was really good for a while now, hopefully the prices for SAS drives will soon be as good as CPU's aswell. I mean computing power is propably cheaper than storage at the moment.


2620's are the ultimate CPU per $ for pretty much all workloads except DB (where I like 2690's :)


It depends on your DB workload ;) Actually we sell the 2620 for shared nothing architectures to all kinds of small to midsize customers.

If you don't need to run fast queries against your analytics or keep logs for something like 20 years you really won't get too much data. Especially when the only thing you index are the content of business documents.


Looks like that one could be used in a gaming/dev rig as well.


I think I still like the Xeon-D better. See this http://www.cpu-world.com/Compare/253/Intel_Xeon_D_D-1537_vs_...


Throw that kit in a SuperMicro MicroBlade Chassis with Dual Node Blades, kerplow. 56 nodes with 64G RAM, 4TB SSD, and 8 cores per node for ~140W/node. Banging.


The powerful marketing is the multi core vs clock speed.

Sort of like: buy one get one free (of slower clock speed).

I am impressed. (by the effective marketing deflecting the importance of clock speed).


The rated clock speed is what you're guaranteed to get when you run all cores. If you run a subset of cores, you'll get significantly more speed via turbo.


I'm sorry, but pure single core performance ended circa 2007. This is the end of the road.

It is shameful that in 2016 we still don't have, say, parallel rendering in browsers. All hope is for Servo.


That is definitely false - single core performance still matters a lot in many ML applications, where utilizing many (>8) cores is still difficult.


If by ML, you mean machine learning (and not, e.g., Ocaml et al.), I thought those people were actually into GPUs.


GPUs are useful for deep learning and a few other easily parallelizable algos, but the majority of open source ML software is still stuck in the CPU.


Redis is mostly single thread.


This is not really a great thing about Redis.


It would be a more powerful tool if it was multi-threaded. I'm not an expert, but my impression that it was able to be more powerful in other ways (stability, features) because of the choice to make it mostly single threaded (given labor and complexity constraints).

Is there an alternative RAM database that you like better that is multi-threaded?


As much as Redis is an incredibly potent tool and the quality of craftsmanship on it is very high, there are some incredibly peculiar design decisions that have been made.

Single-treading is one of those. There are times when having more than one thread to help process things would come in very handy, though I recognize that the cost of adding this can be very high.

It's something that will have to be addressed eventually for a single Redis process to take advantage of newer hardware with very low ceilings on CPU power, but huge numbers of cores.


ML on CPUs?


That seems unrealistic. Single core performance still matters for algorithms that can't be efficiently parallelised. Moreover, writing efficient, parallel versions of a lot of algorithms is hard, and often introduces significant overheads of its own that must be outweighed by the better scalability that the parallelisation brings.


> Single core performance still matters for algorithms that can't be efficiently parallelised.

Any real world examples? Especially considering that at this point, sacrificing cores to boost the remaining ones seems to be a really bad deal with current silicon. Core power requirements appear to decrease faster than their actual computational speed does if you go low-power. Even if you lose 40% of performance due to overhead, if the same-TDP CPU package is twice as fast with more cores, you still win. (And who's to say that your implementation can't be improved in the future?)


I honestly don't know how to answer that. Are you suggesting that you know how to parallelise an arbitrary expensive algorithm? Because if you've beaten Amdahl's law, a lot of people would like to make you very, very rich.


Sounds like an ambiguous question. Most algorithms are "arbitrarily expensive". It generally depends on some measure of the data you're putting in. But in case you mean "an arbitrary algorithm", then no, nobody knows how to do that. But it appears that the most useful things people actually want to do lie somewhere in the middle: not trivial to parallelize but also not exactly impossible.


Just to be clear, I meant what I wrote: an arbitrary expensive algorithm, i.e., solving the general case.

As for "most useful things people actually want to do", it seems to me that a lot of relatively computationally expensive software still isn't using lots of cores where they're available in practice today.

One significant example is computer games. Since the advent of GPUs with effectively hundreds or thousands of parallel computations available, rendering hasn't been the bottleneck it once was. Today the bottleneck might instead be the game control logic that runs on a CPU, and is often still either single-core or divided among at most a small, fixed number of cores doing different tasks.

Another common real world example is graphics and image processing software. You'd think there might be a lot of natural data parallelism to exploit, but software in this area has made relatively little use of algorithms that scale to arbitrary numbers of cores so far.

A third example would be real-time processing, say operations on high speed network traffic. In this case you can sometimes dispatch different packets to different cores to process them in parallel, but the amount of processing you can do on any given packet might well be limited by the speed of a single core, because the overheads for cache misses or inter-core communications are prohibitive. If your processing needs to consider more than one packet at once, so you can't just spray packets at different cores as they arrive, then this can become a very significant real world bottleneck.

This isn't to say that none of these problems will ever be solved as we develop more understanding and better tools, but even in 2016 the state of the art is far from using as many cores as we have available efficiently for a lot of real world use cases. Manual parallelisation often has architecture-level implications and few development teams have the experience and foresight to get it right consistently with today's programming tools. Automatic optimisation to exploit data parallelism is an interesting research field but still in its infancy, and many mainstream programming languages have far from ideal semantics for such optimisations because of aliasing issues and the like. Either or probably both of these areas will have to advance considerably before we can assume that scaling out into more cores is generally going to give better performance than scaling up with faster CPUs and related hardware architecture.


If someone doesn't value his life, then sure - single core perfomance ended even in 1988. Every millisecond matters.


This article give higher clock speeds than the chart on Ars Technica does. http://www.anandtech.com/show/10158/the-intel-xeon-e5-v4-rev... The Xeon E5-2699 v4 can run at 2.8 GHz with all cores busy, and is capable of boosting up to 3.6 GHz. But the one on Ars says the E5-2699 v4 is only 2.2 GHz.


Clock speeds depend on many things, like thermal limits, power draw, what instructions are used, how many cores are active, whether the integrated GPU is running (for OpenCL) if it's enabled, …

2.2 GHz is guaranteed – without AVX, with you only get 1.8. 2.8 GHz is possible, assuming there are thermal reserves, power consumption is not hitting a limit, etc.


Ok, so all cores could run at 2.6 GHz momentarily but it would probably start thermal throttling pretty fast. And maybe not at all if the cores were using power-intensive units.


Ars is correct.


clock speed is the last thing you should watch on a processor. There are some other factors which boosts the performance incredible. Like the L1-L3 Caches And on servers you prolly watch out the TDP, too.

Also clock-speed doesn't mean a processor is slower, there are processors with a slower clock speed and still have a higher IPS.


Fantastic news that they have been released. Means I can get official quotes for them now!


I have a Skylake E3 Xeon. Why didn't they use the Skylake cores for these E5 CPUs?


Because it takes ~2 years to wrap the EP uncore around a new core and validate the whole thing. Expect Skylake-EP around a year from now.


First time I heard the term "uncore", thanks!

https://en.wikipedia.org/wiki/Uncore


Are the cores not the same as the Skylake Xeon cores that was released back in Fall?

(are E3 cores different from E5 cores?)


It's more or less the same core (modulo AVX512), but E5 has a much more complex uncore than E3.


Can you elaborate? Maybe I am misunderstanding EP here? There are Haswells E/EPs already.


Haswell is the model prior to Broadwell, and was launched in 2013. https://en.wikipedia.org/wiki/Haswell_(microarchitecture)


Understood that Broadwell is "tick" on Haswell's "tock", I was questioning what the previous poster meant by the EP uncore. I thought all Xexon 2600s were "efficient performance(EP.)"


AnandTech's version is probably better for the HN audience: http://www.anandtech.com/show/10158/the-intel-xeon-e5-v4-rev...


I left ars when Jon Stokes did. Learned more about CPUs from his articles than I did in most of college courses combined.


Ok, we'll change to that for now. Original URL was http://arstechnica.com/gadgets/2016/03/intels-new-broadwell-....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: