Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Chumsky: A Tutorial (github.com/zesterer)
78 points by surrTurr on April 12, 2022 | hide | past | favorite | 49 comments


A few years back I ported Haskell style monadic parsing to Java, basically as a shit post entry to an event at work. I sadly don't have the code since it was a work thing, but I got it to work.

... at tremendous cost to my sanity. I've later been told I may have inadvertently broken two of the seven seals.


haha timely comment for me. I'm currently working on a project where I have no choice but to target a limited form of lua 5.1, but have total freedom within that constraint.

Was considering what it would take to write a small ML that compiles to lua. "Almost certainly more than I'm willing to put in" was my tentative unresearched conclusion but this validates it a bit.


micro ML/lisp to C isn't too involved, really!

1. Convert to continuation-passing style (or ANF)

2. Perform closure conversion

3. Hoisting nested functions/lambda lifting

4. Generate C! (and then write a runtime/GC).

Targeting lua should be much easier


Where can one read about these steps in an easily-digestible format?


Unfortunately, I haven't found a ton of "easily-digestible" and, at the same time, comprehensive guides on compiling functional languages. Generally you'll find a mix of blog posts/class notes/papers covering a single step.

Some resources I like:

- Andrew Kennedy's 2007 paper Compiling with Continuations, Continued [1]. This one is the most clear IMO

- Andrew Appel's Compiling with Continuations book (a bit outdated though... assembly code is for VAX)

- Matt Might's series [2]

- MLton's source and documentation [3]

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...

[2] https://matt.might.net/articles/closure-conversion/

[3] http://mlton.org/



I spent a few hours with this list a few weeks ago and nah. They're all abandoned, or undocumented, or incomplete, or target luajit, or intensely nonstandard, or surface the underlying lua behavior I'm trying to avoid. Usually a mix of some of those.

The lisp ones look solid and I love lisp, but for this specific thing a good type system would be most valuable.


A pretty-printer for your tiny ML could emit Lua. I mean you even get proper tail calls! It would be a real ML.


There's a lot more to writing a compiler/interpreter than parsing, and I would almost posit that parsing is the _least_ interesting part of writing a compiler.


That is what I thought before.

But what this library and some others are aiming for is the NEXT level of parsing difficulty: Parsers that work AND also provide high-fidelity output for SEVERAL consumers (IDE, editors, linters, ...) AND provide high-quality errors (aka: diagnostics) AND provide suggestion to fix that errors.

That is another level!

MUCH HARDER!


I agree that it's the least interesting part (at least for me), but a necessary part to be able to get to the juicy stuff. Getting a quick parser ready helps keep me wanting to code since I don't have to trudge though to get to the fun part.


Can you tell any other part of the compiler ecosystem that helps you write a new programming language in 1 hour?

This is supposed to be a fun article for experimentation in programming languages, not a new advancement in a garbage collection algorithm for increasing the efficiency of a datacenter by 2%.

My favourite new programming language is Svelte for example (just a parser that emits javascript), it's just a little different from HTML/CSS/JS, but just enough to make web development fun again for me.


Rust already has Nom for parser combinators, though. This library looks fine (at first glance) for a parser library, but it doesn't seem like it helps with building languages beyond just parsing (which is what I expected from the title).


The main points of focus for chumsky, beyond nom, are:

- Ease of use + ergonomics

- Backtracking 'by default'

- High quality errors and error recovery

Nom is very specifically designed for parsing input not intended to be read by humans, and that focus is evident in the things is prioritises.


LLVM


The article implements eval() for a couple stages in the language. It's not just parsing.


I mostly agree, this just seems like a case of a weird description in the HN submission. As it is... it's a parser library, the Github project calls it a parser library, and that's fine.


If it helps, I also have tutorials in the works for type inference & type-checking, monomorphisation, optimisation, and possibly codegen too.


Yep. Getting the initial AST/CST is easy. Going from that to whatever your target is is the meat of the entire project.


No where in the title or article did I read "compiler".


A programming language implementation is either a compiler or interpreter. Title says "write your own programming language..." which implies an implementation. But that's not the case. As other commenters point out parser is a tiny, usually inconsequential part of a lang implementation.


The article contains full interpreter?


I wouldn't call it a full interpreter, it just walks expressions' AST. No way to create class/abstractions.


Nearly all mature language implementations are compilers to some extent


Now do indentation-based. All parser libraries tend to trip on this small, yet widely relevant, feature. To be honest, most parser libraries should at least include a Python sample as some sort of acid test.


I have never used Chumsky before but the tutorial has the following to say about whitespace:

> Whitespace is not automatically ignored. Chumsky is a general-purpose parsing library, and some languages care very much about the structure of whitespace, so Chumsky does too


I have some thoughts about this: https://github.com/zesterer/chumsky/issues/20#issuecomment-9...

TL;DR: The best approach, in my view, is to detect whitespace in the lexer and emit fake delimiter tokens for the parser so that the parser-level syntax is still context-free.


Wow, Chumsky (and Ariadne!) looks really powerful. Got me wondering if I should port my language off of my hand-written parser; I haven't even begun to wrestle with parsing-error recovery.


I've been developing both throughout the development of Tao, my own hobby language. It's since developed quite an extensive syntax (see here for an example: https://github.com/zesterer/tao/blob/master/examples/99.tao), so you can count it as evidence that chumsky scales to complex grammars: https://github.com/zesterer/tao


Apparently, expertise with Rust is a pre-requisite.


Likely controversial opinion: Parser combinators are fun but feel a little too magic for me.

Maybe I’m being stubborn, but I prefer to write lexers and parsers by hand. Although, I would like to get around to writing a “common lexer and parser utility library” one day to abstract away some common patterns while simultaneously not forcing design decisions. Perhaps I’ll end up with a parser combinator library by the time I’m done and I’ll have to eat my words.


I tend to agree. If you're using an imperative language then the basic primitives of structured programming (loops, conditionals and functions) are all you need to write a recursive descent parser. Combinators like Parsec's 'sepBy' can make your parser more concise, but at the cost of making it more difficult to reason about performance and more difficult to read for anyone who hasn't memorized all the combinators.

Most parser combinator libraries give you lots of help with the easy stuff (e.g. parse a list of integers separate by commas), but not so much help with the hard stuff like generating good error messages or avoiding unbounded backtracking.


I think lexers shouldn't be written by hand but parsers often have to be to at least some extent.


My advice has always been: 'Write at least one parser by hand before using parser combinators. Then, never write parsers by hand ever again'.


Naming something after Chomsky, a difficult choice. Though I understand his reception has been largely uncritical in the US for some reason...


My impression is that he’s pretty controversial politically but that his contributions to linguistics/theory of grammars are generally seen as very impactful. And I imagine that in this case the reference is more related to the latter (?)

Curious to find out more about critiques on his work though (in all spheres)


One reason is that his fame among culture warriors on youtube is related to his third act (or arguably fourth). His first act in Linguistics, back in the 1950s, is relevant to Computer Science (eg: "Chomsky Normal Form")


As the author of the library, I'd like to note that I consider some of Chomsky's views to be... questionable. The choice of name was largely down to his work in the world of linguistics and parser theory.


anarchy and libertarian socialism is particularly popular among the American left. Though I think OP is from UK. Either way as a individualist and capitalist, I rolled my eyes.


Even among the right, I think anarchism and libertarian socialism are viewed more sympathetically than authoritarian leftist tendencies. Much like how the left is less vitriolic (in principle, at least) towards capitalist libertarians than they are to fascists.


Chumsky is a funny name for a tool to make a language. The old man would like it, I think.


To be clear, Noam Chomsky is still alive at 93 years old.


He was featured on Hackernews recently, opining about the Russia-Ukraine situation.


I know. Though he presumably doesn’t know about that tool.


numb chumsky


The title here on HN seems kinda misleading; this is a parser library (and describes itself as that in the readme and github description), which isn't the only thing needed for writing a language.


Yes, the library is only a parser, but the tutorial does cover using the parsed structure for interpretation as well. I think it's mostly a fair title but could probably use the word "tiny" in it (as in, "write your own tiny programming...").


Don't let that mislead you: This is much more than a simple parser library. It's really a big chunk of what a compiler does, as the title advertises. Yeah, you need to handle the actual runtime part, of course.


I'm looking at the API docs and just seeing parser combinators and parsing-related stuff, like reporting/handling/recovering from parse errors. That's very handy to have, certainly, but I feel like I'm missing something if there's supposed to be support for implementing type-checking, nontrivial semantics, optimization, codegen, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: