I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.
Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.
That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.
The ECC information is stored in separate DRAM devices on the DIMM. This is responsible for some of the increased cost of DIMMs with ECC at a given size. When marketed the extra memory for ECC are typically not included in the size for DIMMs so a 32GB DIMM with and without ECC will have differing numbers of total DRAM devices.
I think you responded to the wrong person, unless you think I was implying that the extra bits needed for ECC didn’t need extra space at all? I wasn’t suggesting that - just that they aren’t like a checksum that is stored elsewhere or something that can be ignored - the whole 72 bits are needed to decode the 64 bits of data and the 64 bits of data cannot be read independently.
If we're talking about standard server RDIMMs with ECC (or the prosumer stuff) the CPU visible ECC (excluding DDR5's on-die ECC) is typically implemented as a sideband value you could ignore if you disabled the correction logic.
I suppose what winds up where is up to the memory controller but (for DDR5) in each BL16 transaction beat you're usually getting 32 bits of data value and 8 bits of ECC (per sub channel). Those ECC bits are usually called check bits CB[7:0] and they accompany the data bits DQ[31:0] .
If you're talking about transactions for LPDDR things are a bit different there, though as that has to be transmitted inband with your data
We are talking about errors happening in user space applications with ECC operating normally and what the application ultimately sees.
My point is that when writing an app you wouldn’t be able to “not use” ECC accidentally or easily if it’s there. It’s just seamless. I’m not talking about special test modes or accessing stuff differently on purpose.
Interesting that DDR5 is different than DDR4. 8 bits for 32 is doubling of 8 for 64 so it must have been warranted.
I'm sorry, but I, just like your admins, don't believe this. It's theoretically possible to have "undetectable" errors, but it's very unlikely and you'd see a much higher than this incidence of detected unrecoverable errors and you'd see a much higher incidence than this of repaired errors. I just don't buy the argument of "invisible errors".
EDIT: took a look on the paper you linked and it basically says the same thing I did. The probability of these cases becomes increasingly and increasingly small and while ECC would indeed, not reduce it to _zero_ it would greatly greatly reduce it.
Ok, I am sure there is _some_ amount of unrepairable errors.
But the initial discussion was that ECC ram makes it go away and your point that it doesn't. And the vast vast majority of the errors, according to my understanding and to the paper you pointed to, are repairable. About 1 out of 400 ish errors are non-repairable. That's a huge improvement! If you had ECC ram, the failures Firefox sees here would drop from 10% to 0.025%! That is highly significant!
Even more! 2 bit errors now you would be informed of! You would _know_ what is wrong.
You could have 3(!) bit errors and this you might not see, but they'd be several orders of magnitude even rarer.
So yes, it would not 100% go away, but 99.9 % go away. That's... Making it go away in my book.
And last but not least, this paper mentions uncorrectable errors. It says nothing of undetectable ecc errors! You said _undetectable_ errors. I'm sure they happen, but would be surprised if you have any meaningful incidence of this, even at terabytes of data. It's probay on the order of 0.000625 of errors you can get ( but if you want I can do more solid math)
I think we diverge on ‘making it go away in my book’.
When you’re the one having to debug all these bizarre things ( there were real money numbers involved so these things mattered ), over millions of jobs every day , rare events with low probability don’t disappear - they just happen and take time to diagnose and fix.
So in my book ecc improves the situation, but I still had to deal with bad dimms, and ecc wasn’t enough. We used not to see these issues because we already had too many software bugs, but as we got increasingly reliable, hardware issues slowly became a problem, just like compiler bugs or other elements of the chain usually considered reliable.
I fully agree that there are lots of other cases where this doesn’t matter and ecc is good enough.
Oh, I get this point. If you have a sufficiently large amount of data an you monitor the errors and your software gets better and better even low probability cases will happen and will stand out.
But this is sort of the march of nines.
My knee jerk reaction to blaming ECC is "naaah". Mostly because it's such a convenient scapegoat. It happens, I'm sure, but it would not be the first explanation I reach for. I once heard someone blame "cosmic rays" on a bug that happened multiple times. You can imagine how irked I was on the dang cosmic rays hitting the same data with such consistency!
Anyways, I'm sorry if my tone sounded abrasive, I, too, have appreciated the discussion.
No you were not abrasive at all - I’ve learned to assume good faith in forum conversations.
In retrospect I should have started by giving the context ( march of 9s is a good description) actually, which would have made everything a lot clearer for everyone.
You're thinking in terms of independent errors. I would think that this assumption is often not the case, so 3 errors right next to each other are comparatively likely to happen (far more than 3 individual errors). This would explain such 'strange' occurrences about ECC memory.
Yes there are scheduling issues, Numa problems , etc caused by the cluster in a box form factor.
We had a massive performance issue a few years ago that we fixed by mapping our processes to the numa zones topology . The default design of our software would otherwise effectively route all memory accesses to the same numa zone and performance went down the drain.
Modern AMD processors are basically a bunch of smaller processors (chiplets) glued together with an interconnect. So yes single chip nodes can have many numa zones.
Wrong level of abstraction. NUMA is an additional layer. If the program (script, whatever) was written with a monolithic CPU in mind then the big picture logic won't account for the new details. The kernel can't magically add information it doesn't have (although it does try its best).
Given current trends I think we're eventually going to be forced to adopt new programming paradigms. At some point it will probably make sense to treat on-die HBM distinctly from local RAM and that's in addition to the increasing number of NUMA nodes.
The kernel tries to guess as well as it can though - many years ago I hit a fun bug in the kernel scheduler that was triggered by numa process migration ie the kernel would move the processes to the core closest to the ram. It happened that in some cases the migrated processes never got scheduled and got stuck forever.
Disabling numa migration removed the problem. I figured out the issue because of the excellent ‘a decade of wasted cores’ paper which essentially said that on ‘big’ machines like ours funky things could happen scheduling wise so started looking at scheduling settings .
The main numa-pinning performance issue I was describing was different though, and like you said came from us needing to change the way the code was written to account for the distance to ram stick. Modern servers will usually let you choose from fully managed ( hope and pray , single zone ) to many zones, and the depending on what you’ve chosen to expose, use it in your code. As always, benchmark benchmarks.
Guessing this is especially hard to automate with peripherals involved. I once had a workload slow severely because it was running on the NUMA node that didn't share memory with the NIC.
Isn't high grade SSD storage pretty much a memory layer as well these days as the difference is no longer several orders of magnitude in access time and thoughput but only one or two (compared to tha last layer of memory)?
Optane was supposed to fill the gap but Intel never found a market for this.
Flash is still extremely slow compared to ram, including modern flash, especially in a world where ram is already very slow and your cpu already keeps waiting for it.
That being said, you should consider ram/flash/spinning to be all part of a storage hierarchy with different constants and tradeoffs ( volatile or not, big or small , fast or slow etc ), and knowing these tradeoffs will help you design simpler and better systems.
Often the Linux scheduling improvements come a year or two after the chip. Also, Linux makes moment-by-moment scheduling and allocation decisions that are unaware of the big picture of workload requirements.
My kids make fun of me because I know the shopkeepers around me by first name, along with the details of their businesses , and that shopping takes forever because I talk to everyone, customers included.
I just love it, it’s easy and I get a lot in return - from perks to incredible encounters. At work it’s been very helpful.
I developed that skill while traveling alone for a year , and it boils down to practicing and reading whether the person you’re talking to is ok with your talking or not.
And now because I know them I go there because I can buy my stuff but also spend five minutes chatting and that makes going grocery shopping a real joy. And because I go there and chat they do nice things like give me a couple of tomatoes or “you’ve got to try this cake” or the wine shop where I automatically get a 15% discount, or the butcher where they let me in when are already closed but they know I’ve come over specially.
And some of those people have become real friends, like go and have dinner together friends. We have very different lives but we get on because we get on. I think everyone benefits from reaching out of their bubble a bit.
If I’m feeling a bit glum I’ll go out to buy bread or something because I know just seeing the people I see regularly will lift me up.
It's interesting, because while having that skill is helpful I think part of the issue a lot of people have is an overturned sense for it - they will be worried they are getting judged for wasting their counterparts time.
It's good to have, but don't let not having it (yet) stop you!
Cold approaches worked better before social media and smartphones . now your awkward encounters can live forever online and cause humiliation for years to come , or some stranger looking for clout may step in. This is has become so common now , because everyone wants to be a hero.
Imagine Lolita with a future seer twist. The adolescent girl knows she will be lovers with the adult male main character in a future time and teases him by bathing with him among other interactions while she is still an adolescent. It's teased at in the third book and fourth book until finally it's revealed to be a love story with a power ala The Stars My Destination.
The intonation is different, there are harsher sounds, but there are diphtongs everywhere in Dutch, and to me thisbis what makes it sound like English. French, Spanish, German etc don’t have diphtongs ( or they’re quite rare )
A couple of days ago a colleague of mine was talking about very old rts games he still liked to play , and mentioned red alert. It turned out he had never heard of dune 2, Warcraft 1 and 2!
My favourite one ( still happens ) is having to mute then unmute at the beginning of the conversation otherwise nobody can hear me. It was so common, with people fiddling with their headset, calling again etc that I eventually asked everyone exhibiting audio issues to start with this
Another interesting one is that if you’re not connected properly , you send messages , but never get notified that they never left, and are never notified that you’re not connected.
It’s also a resource hog and will eat your machine for breakfast.
I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.
Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.
https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf