More

Agingcoder · 2026-03-06T07:19:12 1772781552

No it doesn’t :-)

I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.

Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.

https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

close04 · 2026-03-06T11:50:44 1772797844

If we’re being pragmatic, it solves enough problems that you could still call it an undisputed win for stability.

thebruce87m · 2026-03-06T08:29:13 1772785753

That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.

smalley · 2026-03-06T21:09:56 1772831396

The ECC information is stored in separate DRAM devices on the DIMM. This is responsible for some of the increased cost of DIMMs with ECC at a given size. When marketed the extra memory for ECC are typically not included in the size for DIMMs so a 32GB DIMM with and without ECC will have differing numbers of total DRAM devices.

There's a pretty good set of diagrams and descriptions of the faults in this paper https://dl.acm.org/doi/10.1145/3725843.3756089.

Also to the parent: there's an updated public paper on DDR4 era fault observations https://ieeexplore.ieee.org/document/10071066

thebruce87m · 2026-03-06T23:12:37 1772838757

I think you responded to the wrong person, unless you think I was implying that the extra bits needed for ECC didn’t need extra space at all? I wasn’t suggesting that - just that they aren’t like a checksum that is stored elsewhere or something that can be ignored - the whole 72 bits are needed to decode the 64 bits of data and the 64 bits of data cannot be read independently.

smalley · 2026-03-07T01:34:39 1772847279

If we're talking about standard server RDIMMs with ECC (or the prosumer stuff) the CPU visible ECC (excluding DDR5's on-die ECC) is typically implemented as a sideband value you could ignore if you disabled the correction logic.

I suppose what winds up where is up to the memory controller but (for DDR5) in each BL16 transaction beat you're usually getting 32 bits of data value and 8 bits of ECC (per sub channel). Those ECC bits are usually called check bits CB[7:0] and they accompany the data bits DQ[31:0] .

If you're talking about transactions for LPDDR things are a bit different there, though as that has to be transmitted inband with your data

thebruce87m · 2026-03-07T08:47:29 1772873249

We are talking about errors happening in user space applications with ECC operating normally and what the application ultimately sees.

My point is that when writing an app you wouldn’t be able to “not use” ECC accidentally or easily if it’s there. It’s just seamless. I’m not talking about special test modes or accessing stuff differently on purpose.

Interesting that DDR5 is different than DDR4. 8 bits for 32 is doubling of 8 for 64 so it must have been warranted.

Agingcoder · 2026-03-06T12:36:54 1772800614

I fully agree with you ! Neither soft nor hard memory errors, nothing… but but flips ,and reproducible at that.

We scanned all our machines following this ( a few thousand servers ) and found out that ram issues were actually quite common, as said in the paper.

RealityVoid · 2026-03-06T14:11:56 1772806316

I'm sorry, but I, just like your admins, don't believe this. It's theoretically possible to have "undetectable" errors, but it's very unlikely and you'd see a much higher than this incidence of detected unrecoverable errors and you'd see a much higher incidence than this of repaired errors. I just don't buy the argument of "invisible errors".

EDIT: took a look on the paper you linked and it basically says the same thing I did. The probability of these cases becomes increasingly and increasingly small and while ECC would indeed, not reduce it to _zero_ it would greatly greatly reduce it.

Agingcoder · 2026-03-06T15:41:56 1772811716

Well my admins eventually believed me , so I’m fairly comfortable with what I said.

We also had a few thousands of physical servers with about of terabyte of ram each.

You are right : we did see repaired errors, but we also saw (indirectly, and after testing ) unrepaired ones

RealityVoid · 2026-03-06T17:32:39 1772818359

Ok, I am sure there is _some_ amount of unrepairable errors.

But the initial discussion was that ECC ram makes it go away and your point that it doesn't. And the vast vast majority of the errors, according to my understanding and to the paper you pointed to, are repairable. About 1 out of 400 ish errors are non-repairable. That's a huge improvement! If you had ECC ram, the failures Firefox sees here would drop from 10% to 0.025%! That is highly significant!

Even more! 2 bit errors now you would be informed of! You would _know_ what is wrong.

You could have 3(!) bit errors and this you might not see, but they'd be several orders of magnitude even rarer.

So yes, it would not 100% go away, but 99.9 % go away. That's... Making it go away in my book.

And last but not least, this paper mentions uncorrectable errors. It says nothing of undetectable ecc errors! You said _undetectable_ errors. I'm sure they happen, but would be surprised if you have any meaningful incidence of this, even at terabytes of data. It's probay on the order of 0.000625 of errors you can get ( but if you want I can do more solid math)

Agingcoder · 2026-03-06T18:04:44 1772820284

We’re in agreement.

I think we diverge on ‘making it go away in my book’.

When you’re the one having to debug all these bizarre things ( there were real money numbers involved so these things mattered ), over millions of jobs every day , rare events with low probability don’t disappear - they just happen and take time to diagnose and fix.

So in my book ecc improves the situation, but I still had to deal with bad dimms, and ecc wasn’t enough. We used not to see these issues because we already had too many software bugs, but as we got increasingly reliable, hardware issues slowly became a problem, just like compiler bugs or other elements of the chain usually considered reliable.

I fully agree that there are lots of other cases where this doesn’t matter and ecc is good enough.

Thanks for taking the time to reply !

RealityVoid · 2026-03-06T19:05:45 1772823945

Oh, I get this point. If you have a sufficiently large amount of data an you monitor the errors and your software gets better and better even low probability cases will happen and will stand out.

But this is sort of the march of nines.

My knee jerk reaction to blaming ECC is "naaah". Mostly because it's such a convenient scapegoat. It happens, I'm sure, but it would not be the first explanation I reach for. I once heard someone blame "cosmic rays" on a bug that happened multiple times. You can imagine how irked I was on the dang cosmic rays hitting the same data with such consistency!

Anyways, I'm sorry if my tone sounded abrasive, I, too, have appreciated the discussion.

Agingcoder · 2026-03-06T22:35:01 1772836501

:-) never forget Occam’s razor !

No you were not abrasive at all - I’ve learned to assume good faith in forum conversations.

In retrospect I should have started by giving the context ( march of 9s is a good description) actually, which would have made everything a lot clearer for everyone.

mafuy · 2026-03-08T04:24:26 1772943866

You're thinking in terms of independent errors. I would think that this assumption is often not the case, so 3 errors right next to each other are comparatively likely to happen (far more than 3 individual errors). This would explain such 'strange' occurrences about ECC memory.

kasabali · 2026-03-06T08:13:44 1772784824

were they 3-bit flips?

thfuran · 2026-03-06T13:54:31 1772805271

It seems extremely unlikely that you’d end up with a lot of those but no smaller detectable errors.

Agingcoder · 2026-03-03T21:16:09 1772572569

Yes there are scheduling issues, Numa problems , etc caused by the cluster in a box form factor.

We had a massive performance issue a few years ago that we fixed by mapping our processes to the numa zones topology . The default design of our software would otherwise effectively route all memory accesses to the same numa zone and performance went down the drain.

zadikian · 2026-03-04T02:40:35 1772592035

Wait, does a single CPU chip have numa within it now, or are you only talking about multi-socket machines?

__turbobrew__ · 2026-03-04T04:46:27 1772599587

Modern AMD processors are basically a bunch of smaller processors (chiplets) glued together with an interconnect. So yes single chip nodes can have many numa zones.

saati · 2026-03-04T15:29:38 1772638178

That was Zen 1, the later ones don't have per chiplet memory controllers, it's all on the single IO die, and they are not NUMA for a single socket.

_2d30 · 2026-03-04T03:07:44 1772593664

Single chips do.

brcmthrowaway · 2026-03-03T21:21:18 1772572878

Intel contributes to Linux, how is this a problem?

fc417fc802 · 2026-03-03T23:17:51 1772579871

Wrong level of abstraction. NUMA is an additional layer. If the program (script, whatever) was written with a monolithic CPU in mind then the big picture logic won't account for the new details. The kernel can't magically add information it doesn't have (although it does try its best).

Given current trends I think we're eventually going to be forced to adopt new programming paradigms. At some point it will probably make sense to treat on-die HBM distinctly from local RAM and that's in addition to the increasing number of NUMA nodes.

Agingcoder · 2026-03-04T01:30:17 1772587817

Yes exactly.

The kernel tries to guess as well as it can though - many years ago I hit a fun bug in the kernel scheduler that was triggered by numa process migration ie the kernel would move the processes to the core closest to the ram. It happened that in some cases the migrated processes never got scheduled and got stuck forever.

Disabling numa migration removed the problem. I figured out the issue because of the excellent ‘a decade of wasted cores’ paper which essentially said that on ‘big’ machines like ours funky things could happen scheduling wise so started looking at scheduling settings .

The main numa-pinning performance issue I was describing was different though, and like you said came from us needing to change the way the code was written to account for the distance to ram stick. Modern servers will usually let you choose from fully managed ( hope and pray , single zone ) to many zones, and the depending on what you’ve chosen to expose, use it in your code. As always, benchmark benchmarks.

zadikian · 2026-03-04T02:43:25 1772592205

Guessing this is especially hard to automate with peripherals involved. I once had a workload slow severely because it was running on the NUMA node that didn't share memory with the NIC.

consp · 2026-03-04T07:24:30 1772609070

Isn't high grade SSD storage pretty much a memory layer as well these days as the difference is no longer several orders of magnitude in access time and thoughput but only one or two (compared to tha last layer of memory)?

Agingcoder · 2026-03-04T14:54:06 1772636046

Optane was supposed to fill the gap but Intel never found a market for this.

Flash is still extremely slow compared to ram, including modern flash, especially in a world where ram is already very slow and your cpu already keeps waiting for it.

That being said, you should consider ram/flash/spinning to be all part of a storage hierarchy with different constants and tradeoffs ( volatile or not, big or small , fast or slow etc ), and knowing these tradeoffs will help you design simpler and better systems.

fc417fc802 · 2026-03-04T10:06:30 1772618790

Sort of? Relative to 6 or more channels of RAM it's still quite abysmal but perhaps high bandwidth flash will change how things are done.

wmf · 2026-03-03T22:03:21 1772575401

Often the Linux scheduling improvements come a year or two after the chip. Also, Linux makes moment-by-moment scheduling and allocation decisions that are unaware of the big picture of workload requirements.

Agingcoder · 2026-03-01T21:49:00 1772401740

My kids make fun of me because I know the shopkeepers around me by first name, along with the details of their businesses , and that shopping takes forever because I talk to everyone, customers included.

I just love it, it’s easy and I get a lot in return - from perks to incredible encounters. At work it’s been very helpful.

I developed that skill while traveling alone for a year , and it boils down to practicing and reading whether the person you’re talking to is ok with your talking or not.

In any case, it makes me immensely happy.

saaaaaam · 2026-03-01T22:29:00 1772404140

This is absolutely my experience.

And now because I know them I go there because I can buy my stuff but also spend five minutes chatting and that makes going grocery shopping a real joy. And because I go there and chat they do nice things like give me a couple of tomatoes or “you’ve got to try this cake” or the wine shop where I automatically get a 15% discount, or the butcher where they let me in when are already closed but they know I’ve come over specially.

And some of those people have become real friends, like go and have dinner together friends. We have very different lives but we get on because we get on. I think everyone benefits from reaching out of their bubble a bit.

If I’m feeling a bit glum I’ll go out to buy bread or something because I know just seeing the people I see regularly will lift me up.

Panzer04 · 2026-03-02T13:21:05 1772457665

It's interesting, because while having that skill is helpful I think part of the issue a lot of people have is an overturned sense for it - they will be worried they are getting judged for wasting their counterparts time.

It's good to have, but don't let not having it (yet) stop you!

Agingcoder · 2026-03-01T21:44:22 1772401462

This is very different

Most random encounters have a pretext, from smoking a cigarette to talking to the shopkeeper, or being in a queue for a long time.

Talking to a woman ( esp given that many of them are harassed from what I understand from my female friends ) without any reason to is much harder

paulpauper · 2026-03-01T22:14:53 1772403293

Cold approaches worked better before social media and smartphones . now your awkward encounters can live forever online and cause humiliation for years to come , or some stranger looking for clout may step in. This is has become so common now , because everyone wants to be a hero.

Agingcoder · 2026-03-01T21:39:56 1772401196

You virtually never bother them - worst case they’ll turn you down.

On the contrary, they’re usually very happy to tell you about what they do.

Agingcoder · 2026-02-27T22:00:50 1772229650

I read the book years ago so might have forgotten - what intergenerational romance ?

stevenwoo · 2026-02-27T23:11:00 1772233860

Imagine Lolita with a future seer twist. The adolescent girl knows she will be lovers with the adult male main character in a future time and teases him by bathing with him among other interactions while she is still an adolescent. It's teased at in the third book and fourth book until finally it's revealed to be a love story with a power ala The Stars My Destination.

Agingcoder · 2026-02-22T22:55:32 1771800932

The intonation is different, there are harsher sounds, but there are diphtongs everywhere in Dutch, and to me thisbis what makes it sound like English. French, Spanish, German etc don’t have diphtongs ( or they’re quite rare )

Agingcoder · 2026-02-15T21:47:12 1771192032

That’s what they do usually I understand - llm generates proof in lean, and proof checker proves.

Agingcoder · 2026-02-08T22:28:44 1770589724

A couple of days ago a colleague of mine was talking about very old rts games he still liked to play , and mentioned red alert. It turned out he had never heard of dune 2, Warcraft 1 and 2!

Agingcoder · 2026-02-03T20:27:59 1770150479

It has a very large number of bugs.

My favourite one ( still happens ) is having to mute then unmute at the beginning of the conversation otherwise nobody can hear me. It was so common, with people fiddling with their headset, calling again etc that I eventually asked everyone exhibiting audio issues to start with this

Another interesting one is that if you’re not connected properly , you send messages , but never get notified that they never left, and are never notified that you’re not connected.

It’s also a resource hog and will eat your machine for breakfast.

The list goes on and on, it’s very surprising.