Why Does Code Points Between U+d800 And U+dbff Generate One-length String In Ecmascript 6?

September 27, 2023 Post a Comment

I'm getting too confused. Why do code points from U+D800 to U+DBFF encode as a single (2 bytes) String element, when using the ECMAScript 6 native Unicode helpers? I'm not asking h

Solution 1:

I think your confusion is about how Unicode encodings work in general, so let me try to explain.

Unicode itself just specifies a list of characters, called "code points", in a particular order. It doesn't tell you how to convert those to bits, it just gives them all a number between 0 and 1114111 (in hexadecimal, 0x10FFFF). There are several different ways these numbers from U+0 to U+10FFFF can be represented as bits.

In an earlier version, it was expected that a range of 0 to 65535 (0xFFFF) would be enough. This can be naturally represented in 16 bits, using the same convention as an unsigned integer. This was the original way of storing Unicode, and is now known as UCS-2. To store a single code point, you reserve 16 bits of memory.

Later, it was decided that this range was not large enough; this meant that there were code points higher than 65535, which you can't represent in a 16-bit piece of memory. UTF-16 was invented as a clever way of storing these higher code points. It works by saying "if you look at a 16-bit piece of memory, and it's a number between 0xD800 and 0xDBF (a "low surrogate"), then you need to look at the next 16 bits of memory as well". Any piece of code which is performing this extra check is processing its data as UTF-16, and not UCS-2.

It's important to understand that the memory itself doesn't "know" which encoding it's in, the difference between UCS-2 and UTF-16 is how you interpret that memory. When you write a piece of software, you have to choose which interpretation you're going to use.

Now, onto Javascript...

Javascript handles input and output of strings by interpreting its internal representation as UTF-16. That's great, it means that you can type in and display the famous 💩 character, which can't be stored in one 16-bit piece of memory.

The problem is that most of the built in string functions actually handle the data as UCS-2 - that is, they look at 16 bits at a time, and don't care if what they see is a special "surrogate". The function you used, charCodeAt(), is an example of this: it reads 16 bits out of memory, and gives them to you as a number between 0 and 65535. If you feed it 💩, it will just give you back the first 16 bits; ask it for the next "character" after, and it will give you the second 16 bits (which will be a "high surrogate", between 0xDC00 and 0xDFFF).

In ECMAScript 6 (2015), a new function was added: codePointAt(). Instead of just looking at 16 bits and giving them to you, this function checks if they represent one of the UTF-16 surrogate code units, and if so, looks for the "other half" - so it gives you a number between 0 and 1114111. If you feed it 💩, it will correctly give you 128169.

var poop = '💩';
console.log('Treat it as UCS-2, two 16-bit numbers: ' + poop.charCodeAt(0) + ' and ' + poop.charCodeAt(1));
console.log('Treat it as UTF-16, one value cleverly encoded in 32 bits: ' + poop.codePointAt(0));
// The surrogates are 55357 and 56489, which encode 128169 as follows:// 0x010000 + ((55357 - 0xD800) << 10) + (56489 - 0xDC00) = 128169

Your edited question now asks this:

I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string?

The hexadecimal value D800 is 55296 in decimal, which is less than 65536, so given everything I've said above, this fits fine in 16 bits of memory. So if we ask charCodeAt to read 16 bits of memory, and it finds that number there, it's not going to have a problem.

Similarly, the .length property measures how many sets of 16 bits there are in the string. Since this string is stored in 16 bits of memory, there is no reason to expect any length other than 1.

The only unusual thing about this number is that in Unicode, that value is reserved - there isn't, and never will be, a character U+D800. That's because it's one of the magic numbers that tells a UTF-16 algorithm "this is only half a character". So a possible behaviour would be for any attempt to create this string to simply be an error - like opening a pair of brackets that you never close, it's unbalanced, incomplete.

The only way you could end up with a string of length 2 is if the engine somehow guessed what the second half should be; but how would it know? There are 1024 possibilities, from 0xDC00 to 0xDFFF, which could be plugged into the formula I show above. So it doesn't guess, and since it doesn't error, the string you get is 16 bits long.

Of course, you can supply the matching halves, and codePointAt will interpret them for you.

// Set up two 16-bit pieces of memoryvar high=String.fromCharCode(55357), low=String.fromCharCode(56489);
// Note: String.fromCodePoint will give the same answer// Glue them together (this + is string concatenation, not number addition)var poop = high + low;
// Read out the memory as UTF-16console.log(poop);
console.log(poop.codePointAt(0));

Solution 2:

Well, it does this because the specification says it has to:

Together these two say that if an argument is < 0 or > 0x10FFFF, a RangeError is thrown, but otherwise any codepoint <= 65535 is incorporated into the result string as-is.

As for why things are specified this way, I don't know. It seems like JavaScript doesn't really support Unicode, only UCS-2.

Unicode.org has the following to say on the matter:

http://www.unicode.org/faq/utf_bom.html#utf16-2
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D800₁₆ to DBFF₁₆, and trailing, or low, surrogates are from DC00₁₆ to DFFF₁₆. They are called surrogates, since they do not represent characters directly, but only as a pair.
http://www.unicode.org/faq/utf_bom.html#utf16-7
Q: Are there any 16-bit values that are invalid?
A: Unpaired surrogates are invalid in UTFs. These include any value in the range D800₁₆ to DBFF₁₆ not followed by a value in the range DC00₁₆ to DFFF₁₆, or any value in the range DC00₁₆ to DFFF₁₆ not preceded by a value in the range D800₁₆ to DBFF₁₆.

Therefore the result of String.fromCodePoint is not always valid UTF-16 because it can emit unpaired surrogates.

JavaScript Sample

Why Does Code Points Between U+d800 And U+dbff Generate One-length String In Ecmascript 6?

Solution 1:

Solution 2:

Post a Comment for "Why Does Code Points Between U+d800 And U+dbff Generate One-length String In Ecmascript 6?"