Right but pretty.c doesn’t seem to explicitly support those.

michaelsbradley · on Oct 25, 2024

How so? it’s just char*

IgorPartola · on Oct 26, 2024

What if len() of a char* vs a Unicode string?

michaelsbradley · on Oct 26, 2024

char* is just raw bytes.

At the language level C historically hasn't offered much support for working with specific character sets and their encodings. With C17 and C23 we get u"...", U"...", u8"...", type char8_t, and similar, but there's still little/no built-in tooling for text processing.

For text processing work with char* whose bytes are some encoding/s of Unicode, e.g. UTF-8, then you an use a C library such as libunistring or ICU.

However the bytes of a char* could instead be an encoding of a non-Unicode character set, e.g. GB2312 encoded as EUC-CN.

So char* is character set and encoding agnostic. And C-the-language doesn't even try to offer you tools for working with different sets and encodings. Instead, you can use a library or write your own code for that purpose.

A number of languages make the same decision, keeping the string type set/encoding agnostic, with libraries taking up the slack.

In Nim, for example, the string type is essentially raw bytes (string literals in .nim sources are UTF-8). If you're doing Unicode text processing then you'd use facilities from the std/unicode module

https://nim-lang.org/docs/unicode.html

Same story with Zig

https://ziglang.org/documentation/0.8.0/std/#std;unicode

Lua too, and you'll probably use a 3rd party library such as luaut8 for working with Unicode/UTF-8

https://github.com/starwing/luautf8

Returning to the matter of pretty.c, since it's just sugar for C, it makes sense (to me) that the string type is just an alias for the set/encoding agnostic char*. It's up to the programmer to know and decide what the bytes represent and choose a library accordingly.