Chumsky: A Tutorial

marginalia_nu · on April 12, 2022

A few years back I ported Haskell style monadic parsing to Java, basically as a shit post entry to an event at work. I sadly don't have the code since it was a work thing, but I got it to work.

... at tremendous cost to my sanity. I've later been told I may have inadvertently broken two of the seven seals.

giraffe_lady · on April 12, 2022

haha timely comment for me. I'm currently working on a project where I have no choice but to target a limited form of lua 5.1, but have total freedom within that constraint.

Was considering what it would take to write a small ML that compiles to lua. "Almost certainly more than I'm willing to put in" was my tentative unresearched conclusion but this validates it a bit.

shpongled · on April 12, 2022

micro ML/lisp to C isn't too involved, really!

1. Convert to continuation-passing style (or ANF)

2. Perform closure conversion

3. Hoisting nested functions/lambda lifting

4. Generate C! (and then write a runtime/GC).

Targeting lua should be much easier

throw10920 · on April 12, 2022

Where can one read about these steps in an easily-digestible format?

shpongled · on April 12, 2022

Unfortunately, I haven't found a ton of "easily-digestible" and, at the same time, comprehensive guides on compiling functional languages. Generally you'll find a mix of blog posts/class notes/papers covering a single step.

Some resources I like:

- Andrew Kennedy's 2007 paper Compiling with Continuations, Continued [1]. This one is the most clear IMO

- Andrew Appel's Compiling with Continuations book (a bit outdated though... assembly code is for VAX)

- Matt Might's series [2]

- MLton's source and documentation [3]

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...

[2] https://matt.might.net/articles/closure-conversion/

[3] http://mlton.org/

xigoi · on April 12, 2022

Take your pick: https://github.com/hengestone/lua-languages

giraffe_lady · on April 12, 2022

I spent a few hours with this list a few weeks ago and nah. They're all abandoned, or undocumented, or incomplete, or target luajit, or intensely nonstandard, or surface the underlying lua behavior I'm trying to avoid. Usually a mix of some of those.

The lisp ones look solid and I love lisp, but for this specific thing a good type system would be most valuable.

sitkack · on April 12, 2022

A pretty-printer for your tiny ML could emit Lua. I mean you even get proper tail calls! It would be a real ML.

kaz-inc · on April 12, 2022

There's a lot more to writing a compiler/interpreter than parsing, and I would almost posit that parsing is the _least_ interesting part of writing a compiler.

mamcx · on April 12, 2022

That is what I thought before.

But what this library and some others are aiming for is the NEXT level of parsing difficulty: Parsers that work AND also provide high-fidelity output for SEVERAL consumers (IDE, editors, linters, ...) AND provide high-quality errors (aka: diagnostics) AND provide suggestion to fix that errors.

That is another level!

MUCH HARDER!

jjice · on April 12, 2022

I agree that it's the least interesting part (at least for me), but a necessary part to be able to get to the juicy stuff. Getting a quick parser ready helps keep me wanting to code since I don't have to trudge though to get to the fun part.

xiphias2 · on April 12, 2022

Can you tell any other part of the compiler ecosystem that helps you write a new programming language in 1 hour?

This is supposed to be a fun article for experimentation in programming languages, not a new advancement in a garbage collection algorithm for increasing the efficiency of a datacenter by 2%.

My favourite new programming language is Svelte for example (just a parser that emits javascript), it's just a little different from HTML/CSS/JS, but just enough to make web development fun again for me.

DylanSp · on April 12, 2022

Rust already has Nom for parser combinators, though. This library looks fine (at first glance) for a parser library, but it doesn't seem like it helps with building languages beyond just parsing (which is what I expected from the title).

zesterer · on April 13, 2022

The main points of focus for chumsky, beyond nom, are:

- Ease of use + ergonomics

- Backtracking 'by default'

- High quality errors and error recovery

Nom is very specifically designed for parsing input not intended to be read by humans, and that focus is evident in the things is prioritises.

charcircuit · on April 12, 2022

tekknolagi · on April 12, 2022

The article implements eval() for a couple stages in the language. It's not just parsing.

DylanSp · on April 12, 2022

I mostly agree, this just seems like a case of a weird description in the HN submission. As it is... it's a parser library, the Github project calls it a parser library, and that's fine.

zesterer · on April 13, 2022

If it helps, I also have tutorials in the works for type inference & type-checking, monomorphisation, optimisation, and possibly codegen too.

junon · on April 12, 2022

Yep. Getting the initial AST/CST is easy. Going from that to whatever your target is is the meat of the entire project.

postalrat · on April 12, 2022

No where in the title or article did I read "compiler".

gnulinux · on April 12, 2022

A programming language implementation is either a compiler or interpreter. Title says "write your own programming language..." which implies an implementation. But that's not the case. As other commenters point out parser is a tiny, usually inconsequential part of a lang implementation.

zokier · on April 12, 2022

The article contains full interpreter?

gnulinux · on April 13, 2022

I wouldn't call it a full interpreter, it just walks expressions' AST. No way to create class/abstractions.

mhh__ · on April 12, 2022

Nearly all mature language implementations are compilers to some extent

jakearmitage · on April 12, 2022

Now do indentation-based. All parser libraries tend to trip on this small, yet widely relevant, feature. To be honest, most parser libraries should at least include a Python sample as some sort of acid test.

codethief · on April 12, 2022

I have never used Chumsky before but the tutorial has the following to say about whitespace:

> Whitespace is not automatically ignored. Chumsky is a general-purpose parsing library, and some languages care very much about the structure of whitespace, so Chumsky does too

zesterer · on April 13, 2022

I have some thoughts about this: https://github.com/zesterer/chumsky/issues/20#issuecomment-9...

TL;DR: The best approach, in my view, is to detect whitespace in the lexer and emit fake delimiter tokens for the parser so that the parser-level syntax is still context-free.

brundolf · on April 12, 2022

Wow, Chumsky (and Ariadne!) looks really powerful. Got me wondering if I should port my language off of my hand-written parser; I haven't even begun to wrestle with parsing-error recovery.

zesterer · on April 13, 2022

I've been developing both throughout the development of Tao, my own hobby language. It's since developed quite an extensive syntax (see here for an example: https://github.com/zesterer/tao/blob/master/examples/99.tao), so you can count it as evidence that chumsky scales to complex grammars: https://github.com/zesterer/tao

ponow · on April 12, 2022

Apparently, expertise with Rust is a pre-requisite.

jonpalmisc · on April 12, 2022

Likely controversial opinion: Parser combinators are fun but feel a little too magic for me.

Maybe I’m being stubborn, but I prefer to write lexers and parsers by hand. Although, I would like to get around to writing a “common lexer and parser utility library” one day to abstract away some common patterns while simultaneously not forcing design decisions. Perhaps I’ll end up with a parser combinator library by the time I’m done and I’ll have to eat my words.

foldr · on April 12, 2022

I tend to agree. If you're using an imperative language then the basic primitives of structured programming (loops, conditionals and functions) are all you need to write a recursive descent parser. Combinators like Parsec's 'sepBy' can make your parser more concise, but at the cost of making it more difficult to reason about performance and more difficult to read for anyone who hasn't memorized all the combinators.

Most parser combinator libraries give you lots of help with the easy stuff (e.g. parse a list of integers separate by commas), but not so much help with the hard stuff like generating good error messages or avoiding unbounded backtracking.

mhh__ · on April 12, 2022

I think lexers shouldn't be written by hand but parsers often have to be to at least some extent.

zesterer · on April 13, 2022

My advice has always been: 'Write at least one parser by hand before using parser combinators. Then, never write parsers by hand ever again'.

formerly_proven · on April 12, 2022

Naming something after Chomsky, a difficult choice. Though I understand his reception has been largely uncritical in the US for some reason...

samvher · on April 12, 2022

My impression is that he’s pretty controversial politically but that his contributions to linguistics/theory of grammars are generally seen as very impactful. And I imagine that in this case the reference is more related to the latter (?)

Curious to find out more about critiques on his work though (in all spheres)

thomassmith65 · on April 12, 2022

One reason is that his fame among culture warriors on youtube is related to his third act (or arguably fourth). His first act in Linguistics, back in the 1950s, is relevant to Computer Science (eg: "Chomsky Normal Form")

zesterer · on April 13, 2022

As the author of the library, I'd like to note that I consider some of Chomsky's views to be... questionable. The choice of name was largely down to his work in the world of linguistics and parser theory.

marco_yolo · on April 12, 2022

anarchy and libertarian socialism is particularly popular among the American left. Though I think OP is from UK. Either way as a individualist and capitalist, I rolled my eyes.

jesushax · on April 12, 2022

Even among the right, I think anarchism and libertarian socialism are viewed more sympathetically than authoritarian leftist tendencies. Much like how the left is less vitriolic (in principle, at least) towards capitalist libertarians than they are to fascists.

leobg · on April 12, 2022

Chumsky is a funny name for a tool to make a language. The old man would like it, I think.

bqmjjx0kac · on April 12, 2022

To be clear, Noam Chomsky is still alive at 93 years old.

bitwize · on April 12, 2022

He was featured on Hackernews recently, opining about the Russia-Ukraine situation.

leobg · on April 13, 2022

I know. Though he presumably doesn’t know about that tool.

standardly · on April 12, 2022

numb chumsky

DylanSp · on April 12, 2022

The title here on HN seems kinda misleading; this is a parser library (and describes itself as that in the readme and github description), which isn't the only thing needed for writing a language.

lolinder · on April 12, 2022

Yes, the library is only a parser, but the tutorial does cover using the parsed structure for interpretation as well. I think it's mostly a fair title but could probably use the word "tiny" in it (as in, "write your own tiny programming...").

emteycz · on April 12, 2022

Don't let that mislead you: This is much more than a simple parser library. It's really a big chunk of what a compiler does, as the title advertises. Yeah, you need to handle the actual runtime part, of course.

DylanSp · on April 13, 2022

I'm looking at the API docs and just seeing parser combinators and parsing-related stuff, like reporting/handling/recovering from parse errors. That's very handy to have, certainly, but I feel like I'm missing something if there's supposed to be support for implementing type-checking, nontrivial semantics, optimization, codegen, etc.