Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

These are less JavaScript problems than utf-16 problems. The whole one character is not a code point problem. It's common to Java, .net, basically all of windows, and anything else that uses utf-16 strings. The solution is easy. If you need a one to one mapping of code points to characters convert to utf32 first. Utf8 has the same problems, the only difference is people know characters and code points don't match up. Whereas with utf16 there's a bunch of people who are either new or should never have been programmers to begin with that are clueless about it. Sadly this number is so large that just about any program that uses utf-16 strings is broken for inputs where code points != characters. This is partly the fault of the languages and libraries which give you functions like substring, reverse, etc on utf-16 strings, where they basically have no consistent meaning. It should have been a storage format not a manipulation format.


">These are less JavaScript problems than utf-16 problems"

The issues related to combining marks are not UTF-16 problems and are not solved by converting to codepoints.

Also, Java as well as many other UTF-16 based languages have much better unicode support than JavaScript (like access to codepoints and unicode character classes in regular expressions).

As always, if something can be done in a sloppy broken way JavaScript will take advantage of it to the fullest.


They're typically solved by normalization, something JavaScript currently does not do well.


The article makes it a point that substrings, reversing text (when does anyone ever do that, actually, apart from coding exercises?) don't even work naïvely when you consider UTF-32. Yes, the code unit / code point dichotomy is annoying and plenty of people don't know about it, but there are many more pitfalls in Unicode when you don't know what you do.

If you have an easy solution of deprecating UTF-16 everywhere where it's used (while not breaking anything that currently works), I'm all ears. Unicode is a pragmatic, not a perfect, standard and there are historical mistakes. But for better or for worse they exist and will probably stay.


Technically, they're not even UTF-16 problems they're extended UCS-2 problems (aka UTF-16-treated-as-UCS-2-with-surrogate-pairs). Logically, a UTF-16 interface wouldn't expose code units first and foremost.


Exactly. I wrote a separate post on that: http://mathiasbynens.be/notes/javascript-encoding


yes ! There is a "goto considered harmful", but utf-16 is a lot more harmful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: