How many hand crafted lexers dealing with lf vs. cr-lf encodings do exist? My gu...

hnlmorg · 2025-07-09T06:26:27 1752042387

I’ve written a fair few lexers in my time. My general approach for CR is to simply ignore the character entirely.

If CR is used correctly in windows, then its behaviour is already covered by the LF case (as required for POSIX systems) and if CR is used incorrectly then you end up with all kinds of weird edge cases. So you’re much better off just jumping over that character entirely.

ninjaoxygen · 2025-07-09T09:34:01 1752053641

Ignoring CR is often how two systems end up parsing the same file differently, one as two lines the other as a single line.

If the format is not sensitive to additional empty lines then converting them all CR to LF in-place is likely a safer approach, or a tokenizer that coalesces all sequential CR/LF characters into a single EOL token.

I write a lot of software that parses control protocols, the differences between the firmware from a single manufacturer on different devices is astonishing! I find it shocking the number that actually have no delimiters or packet length.

hnlmorg · 2025-07-09T11:18:08 1752059888

Why would ignoring CR lead to problems? It has nothing to do with line feeding on any system released in the last quarter of a century.

If you’re targeting iMacs or the Commodore 64, then sure, it’s something to be mindful of. But I’d wager you’d have bigger compatibility problems before you even get to line endings.

Is there some other edge cases regarding CR that I’ve missed? Or are you thinking ultra defensively (from a security standpoint)?

That said, I do like your suggestion of treating CR like LF where the schema isn’t sensitive to line numbering. Unfortunately for my use case, line numbering does matter somewhat. So would be good to understand if I have a ticking time bomb

astrobe_ · 2025-07-09T17:24:49 1752081889

Best option is to treat anything with ASCII code < 0x20 (space) as (white)space, but one doesn't have the chance often enough, unfortunately.

grodriguez100 · 2025-07-09T07:03:36 1752044616

Old Macs (pre-OS X I think) used CR only as line terminators.

hnlmorg · 2025-07-09T11:22:02 1752060122

Yes, you’re right. I completely forgot about them.

I can’t imagine anyone targeting macOS 9 (or earlier) systems these days but you’re right that it’s an edge case people should be aware of.

ummonk · 2025-07-09T16:59:54 1752080394

I guess at this point it's okay to deprecate Mac OS 9 files?

hnlmorg · 2025-07-09T21:50:07 1752097807

It depends what you’re writing the parser for. In my cases, it’s compilers and similar tooling that wouldn’t support Mac OS 9 anyway. So you’d never get Mac OS 9 generated files to begin with.

layer8 · 2025-07-09T10:51:46 1752058306

More generally, any textual file format where whitespace is significant at the end of a line is calling for trouble.

hnlmorg · 2025-07-09T11:20:45 1752060045

Maybe. But expecting people to remember a ; (or similar) at the end of lines is going to cause more frequent problems from a UX performance.

So you’re better off accepting the edge cases problems that white space introduces considering the benefits outweighs the pain.

layer8 · 2025-07-09T11:39:54 1752061194

That’s not what I meant. It’s okay for the line break itself to be significant. But whitespace immediately preceding the line break shouldn’t be significant, due to its general invisibility.

hnlmorg · 2025-07-09T11:46:47 1752061607

Is CR considered whitespace? I always thought that was classed as a non-printable control character. But maybe I’m wrong?

Or are you talking about SP preceding CR and/or LF?

layer8 · 2025-07-09T13:06:51 1752066411

Line breaks are considered whitespace, hence CR is considered whitespace. It is also a control character. This is similar to TAB, or indeed LF.

See here for example: https://en.cppreference.com/w/c/string/byte/isspace

Or here for Unicode: https://en.wikipedia.org/wiki/Whitespace_character#Unicode

hnlmorg · 2025-07-09T21:51:18 1752097878

Ahh that makes sense then.

Thanks for the responses