For those interested about RAM timings, effects of cache, and other low level memory topics explained exceptionally well, check out "What Every Programmer Should Know About Memory"[1]
That book is detailed, but that's also the problem with it. The relevant signal is all but drowned in the noise of 114 pages of details of CAS/RAS, circuit diagrams for SRAM and DRAM cells, bus protocols for hardware transactional memory, descriptions of Linux APIs for configuring NUMA support, etc.; frankly, things that need not be known unless dealing with those issues specifically.
In the article of this post, on the other hand, a handful of key characteristics of the cache hierarchy are graphically displayed using pretty minimal code snippets and attractive graphs. I believe programmers would generally be better served by the intuitions gained from the blog post than daunted by the exhaustive detail and verbose presentation of Ulrich's book.
IOW, keep in mind simple things like: measure; be aware of the cheapness of adjacent accesses, of the possibility of false sharing with certain strides, of the cost of following pointers, of the pitfalls of concurrent access to adjacent memory locations, and even serial access to the same memory location (dependencies wrt ILP). Concentrate on building a mental model for providing intuitions of cache behaviour at the appropriate level of abstraction, but always be prepared to measure to verify intuitions, as the hardware changes all the time, and different architectures can give rise to surprisingly different results.
Then perhaps you might be serve better by the linked article, which is kinda the "dumbed down" version of the Drepper whitepaper. But I don't see why detail is bad. The Drepper paper is pretty much the only place, short of JEDEC specs and manufacturer datasheets, where you can get an exposition about how modern DRAM works.
(FWIW: his similar whitepapers on ELF shared libraries and glibc TLS are equally great, and also pretty much the canonical review documents on those subjects.)
Too much detail is bad because it hides the relevant knowledge. That is, there are certain paragraphs / certain images that the programmer will remember. Those are the important components. Brevity is good.
Too much detail is bad in the same way that an intro biology course flooding you with chemical equations for bodily functions (when chem is not a prerequisite), literally hundreds of new terms, many of which are fairly isolated, and a hundred medium-rare diseases is bad. What do you end up remembering?
"Body-stuff is hard."
And 90% of what you read falls out of your head in a year, along with 90% of the useful stuff. If it's trimmed down, you may lose 50%, and 50% of the useful stuff, which covers all of the generally useful stuff in the massive 10-kilo textbook that covers more than a biochem grad student remembers.
Fewer relatively-unrelated (or seen-as-useless) facts also make it easier to form logical connections with the remaining facts, which help you remember and apply what you've learned.
Has anyone run similar tests on the JVM after warmup?
Theoretically the JVM should be pushing memory around to negate as much of the cache penalty as possible, and I've definitely noticed this in some algorithms (this is how people come up with code where Java outperforms almost identical code in pure C, and I've seen 3:1 speed differences in favor of the JVM), but I'd be curious to see the graphs.
Maybe I'll do it later today if I get a few minutes...
That seems very unlikely. Effects like the ones in the article (which is great, btw) can't be "fixed" simply by moving the data. In the first example, there's simply no way for any runtime environment to "know" at creation time that you would be walking the array in 16 byte strides. And if it tried to reorder the data dynamically, it would obviously have to touch all that memory, thus invalidating whatever caching could be done anyway.
Where JVMs beat naive C, it's almost always in the locking and synchronization code. See the Sun paper on Biased Locking for an example of the kind of tricks you can do by (for example) "guessing" that newly locked objects are only ever going to be referenced by one thread.
But memory is memory. The JVMs view of it lacks a raw pointer, but is otherwise pretty much the same as C's.
On that note, I'm a fan of the speculative lock elision in the Azul Systems machines. When the JVM sees a synchronized section, it can create a snapshot of the current register and memory state and execute the critical section speculatively, checking for memory access conflicts with other threads, and roll back the changes if there's a conflict. Actual locks are a fall-back option.
It's a special case of hardware transactional memory. (Specifically, bounded best-effort HTM; it can only track a limited number of memory changes, and aborts if you push it too far.) It can give small but non-trivial scaling improvements on multithreaded code.
For example, suppose you have several threads that want to change a hash table. They could lock the whole hash table first -- or they could go ahead without locking, and abort if there was an actual memory conflict. Essentially it lowers the granularity of locking, automatically.
Nice summary. His "example 6", about false cache sharing, bit me once when I was allocating data structures for my threads and they ended up close enough in memory that threads shared cache lines. It took a while before I figured it out and probably the only reason I did was that it didn't happen all the time, so I was wondering why the code sometimes ran slower. If it had been consistent I would just have thought the code was slow. While there are profilers that spit out cache hit rates, it's usually on the scale of whole threads which makes it very difficult to find which code is bad. You also have to have enough experience to know what a "normal" hit rate is, which I'm not sure I am.
Yes, but be careful with this if your algorithms knowledge comes from a traditional algorithms course. These are often taught as if computational complexity was the only concern. Algorithms that minimize space complexity are often more of an issue.
So while the choice of algorithm is a large issue, one must be careful to select an algorithm that makes the appropriate space vs. computation tradeoff. That knowledge is often guided by an awareness of where the systems bottlenecks are.
Once you've chosen the correct algorithm and you're down to optimizing code, which is what this article discusses, understanding cache effects is critical and this article provides a nice view of the basics. The usual rules about not doing optimization unless one needs to applies, but I don't see why that always merits an algorithms junkie popping out of the woodwork to essentially say the same thing everyone always says: premature optimization is dumb.
Man, you are reading way way too far into what I said. Aren't introverts like us supposed to always be taking statements at face value? And isn't what I said true? Attacks like "algorithms junkie", and completely shoving words into my mouth just make me not want to be here anymore.
For the record, I write software rasterizers for fun. I optimize code using SSE2. I code mainly in C++. I am the target audience for this article. Maybe you were trying to illustrate a contrasting point of view, which is fine, but it is stupid to make it so personal.
I don't think that "algorithms junkie" is an insult; I'd happily call myself an algorithms junkie, for example. Algorithms are awesome, and so are people who write software rasterizers.
(And you know what? Figuring out algorithms that work well on modern hardware is loads of fun.)
But choice of algorithm can often mean cache-awareness, too - using a random access algorithm where a sequential one could be used instead can be a 10x performance drain, and a 10x constant multiplier is pretty substantial.
For a concrete example, a lot of the most important optimizations in physics simulations involve satisfying the simultaneous goals of using cache-friendly and parallel algorithms. In some cases people are even willing to accept worse big O behavior because the "worse" algorithms use a lot of sequential access and parallelize easier than the sophisticated ones, more than making up the difference for the object counts that come up in practice.
For example, the algorithmic purist may insist on using an O(1) or at least O(log n) container for maps or sets, where in practice, if the container will never hold more than a handful of items, a linear search through an array may be faster - because of cache effects.
Similarly, an overemphasis of pure OO design can result in lots of fine-grained objects stored in the heap, with lots of edges needing to be traversed to accomplish productive work. If one isn't aware of how expensive following pointers can end up owing to cache effects, you can create designs that lose substantial chunks of performance in such a way that it's not obvious with e.g. a simple code-based profiler where the bottleneck is..
This seems to be how Qt4 is. It's a great API to work with, but unfortunately results in 50-call-deep stacks (not an exaggeration -- I have to set valgrind to --num-callers=50 to get reasonable stack traces) and a lot of latency. That plus bugs in the new QGraphicsScene API are making me consider switching to something lower level, like Clutter or my old friend OpenGL.
You're explaining that poorly, I think. As written, that's only true if 99% of all software is written using poor algorithm choices. Maybe that's true in some regimes, but there is still a bunch of performance-sensitive software written in C/C++ by developers who make good algorithm and data structure choices.
And so when these people get to looking at performance details, they're looking at constant factor improvements. And as described in the article and elsewhere, the constant factors due to cache effects are often on the order of, say, 6 or 10. The penalty to poor cache design can be huge, and it's important to know how they work to avoid that.
gcc has __attribute__((packed)), but we may be talking about different things - that's mostly used for on-the-wire protocols. For most structs, laying out the members from large to small works fine, and avoids superfluous shifting on every read and write.
[1] http://people.redhat.com/drepper/cpumemory.pdf