These are less JavaScript problems than utf-16 problems. The whole one character...

fauigerzigerk · on Feb 12, 2014

">These are less JavaScript problems than utf-16 problems"

The issues related to combining marks are not UTF-16 problems and are not solved by converting to codepoints.

Also, Java as well as many other UTF-16 based languages have much better unicode support than JavaScript (like access to codepoints and unicode character classes in regular expressions).

As always, if something can be done in a sloppy broken way JavaScript will take advantage of it to the fullest.

slashdev · on Feb 12, 2014

They're typically solved by normalization, something JavaScript currently does not do well.

ygra · on Feb 12, 2014

The article makes it a point that substrings, reversing text (when does anyone ever do that, actually, apart from coding exercises?) don't even work naïvely when you consider UTF-32. Yes, the code unit / code point dichotomy is annoying and plenty of people don't know about it, but there are many more pitfalls in Unicode when you don't know what you do.

If you have an easy solution of deprecating UTF-16 everywhere where it's used (while not breaking anything that currently works), I'm all ears. Unicode is a pragmatic, not a perfect, standard and there are historical mistakes. But for better or for worse they exist and will probably stay.

masklinn · on Feb 12, 2014

Technically, they're not even UTF-16 problems they're extended UCS-2 problems (aka UTF-16-treated-as-UCS-2-with-surrogate-pairs). Logically, a UTF-16 interface wouldn't expose code units first and foremost.

mathias · on Feb 12, 2014

Exactly. I wrote a separate post on that: http://mathiasbynens.be/notes/javascript-encoding

webreac · on Feb 12, 2014

yes ! There is a "goto considered harmful", but utf-16 is a lot more harmful.