Yes. The Cray supercomputers from the 80s were crazy good matmul machines in particular. The quad-CPU Cray X-MP (1984) could sustain 800 MFLOPS to 1 GFLOPS, and with a 1 GB SSD, had enough computer power and bandwidth to train a 7-10M-parameter language model in about six months, and infer at 18-25 tok/sec.
A mid-90s Cray T3E could have handled GPT-2 124M, 24 years before OpenAI.
I also had a punch-card computer from 1965 learn XOR with backpropagation.
The hardware was never the bottleneck, the ideas were.
Post-quantum crypto is a good example of this. Lattice-based schemes were theorized in the 90s, but they took decades to actually reach production. The math existed, the hardware existed, and the ideas for making it work were just not there yet.
I am a bit surprised, but I guess everything eventually wears out.
In the 1980's I worked as a field engineer that supported a lot of pdp-11's. They were very reliable for the time; tape drives and disks were the #1 maintenance items. To actually have to open up the processor and change a board was not a regular activity.
Other machines of that era, like those from Gould or Perkin/Elmer or DG gave regular practice in the art of repairing processors.
Guess I expect them to work forever. Like a Toyota.
I encouter two main failure modes. First, the bipolar PROMs degrade at the atomic level, the metal ions in the fuses tend to migrate or 'regrow' over decades, causing bit rot.
Second, the backplanes suffer from mechanical fatigue. After forty years of thermal expansion and structural flexing, especially when inserting boards, the traces and solder joints develop stress cracks. Both are a pain to repair.
XENIX's second target processor was an 11/34 with a programmers workbench. That nightmare took 3~4 years... Microsoft years, while they used the Pdp-11/70 for development.
Thanks for reposting! I'm the author of ATTN-11. Happy to answer any questions about the fixed-point arithmetic, the PDP-11 hardware, or the training process.
Incredible work! Fitting transformer into 32KB RAM is crazy
For those who read this project and do not know PDP-11 it could be hard to understand that working with these memory limits is difficult.
Here is visual guide for PDP11 architecture - https://vectree.io/c/pdp-11-hardware-architecture
That PDP-11 was the most fun minicomputer of the late 1970s in my opinion. Growing up in NH about an hour north of Digital's HQ all sorts of schools from primary to secondary as well as museums had PDP-8, PDP-10, PDP-11 and later VAX machines.
The PDP-11 had a timesharing OS called RSTS/E which could give maybe 10 people a BASIC programming experience a little bit better than an Apple ][. If you were messing with 8-bit microcomputers in 1981 you might think a 16-bit future would look like the PDP-11 but the 1970 design was long in the tooth by 1980 -- like 8-bit micros it was limited to a 64kb logical address space. Virtual memory let it offer 64k environments to more users, but not let a user have a bigger environment.
Fun stuff! At one point I wondered about building something similar. But I lack the AI chops, and have too many other projects going on anyway.
I'm curious as to the type of memory in the 11/34. I also have a working PDP-11, an 11/05 with 32KW of actual core. I wonder what performance would be like with EIS emulation grafted in. Stunningly slow, I imagine.
I also have a working design for a small Transformer on the original Game Boy. It has around 4000 parameters fitting in the 8 KB cartridge SRAM, where the "saved game" is the trained model. A TI-82 with its 32 KB of RAM would be even more comfortable.
Around the same time (1984), there was also another very cool piece of technology that often gets overlooked: the CMU WARP. It wasn’t as flashy as the Crays and the Connection Machine, but it was the first systolic array accelerator (what we’d now call TPUs). It packed as much MFLOPS as a Cray 1.
It's also the computer that powered the Chevrolet Navlab self-driving car in 1986.
I've been building a functional language for differentiable programming that compiles to JAX. The core idea is homoiconicity applied to ML, models are data structures that can inspect and transform themselves.
For those interested, this guy is revamping the Emacs widget library with something more modern and platform agnostic, based on SDL: https://appetrosyan.github.io/posts/
Interesting, thanks for sharing! I've had thoughts about making vui.el backend-agnostic so it could target different widget implementations (like xwidgets or even native-GUI). An SDL-based widget library could potentially be one of those backends. Need to dig into appetrosyan's work before I can say anything intelligent about it though. And of course, it was an idea and I am unlikely to dive deep without practical need (time is limited, sadly).
My only complaint regarding the Zed editor is the inability to display two panes of the sidebar one below the other. Not only is it impossible to display them together, but switching between them requires clicking a tiny button in the status bar. To make matters worse, performing a search hides the symbols and the tree view.
"> I think all of ML being in Python is a colossal mistake that we'll pay for for years.
Market pressure. Early ML frameworks were in Lisp, then eventually Lua with Torch, but demand dictated the choice of Python because "it's simple" even if the result is cobbled together.
Lisp is arguably still the most suitable language for neural networks for a lot of reasons beyond the scope of this post, but the tooling is missing. I’m developing such a framework right now, though I have no illusions that many will adopt it. Python may not be elegant or efficient, but it's simple, and that's what people want.
Gee, I wonder why the tooling for ML in Lisp is missing even though the early ML frameworks were in Lisp. Perhaps there is something about the language that stifles truly wide collaboration?
I doubt it considering there are massive Clojure codebases with large teams collaborating on them every day. The lack of Lisp tooling and the prevalence of Python are more a result of inertia, low barrier to entry and ecosystem lock-in.
reply