Convert Large Array Of Integers To Unicode String And Then Back To Array Of Integers In Node.js

January 27, 2023 Post a Comment

I have some data which is represented as an array of integers and can be up to 200 000 elements. The integer value can vary from 0 to 200 000. To emulate this data (for debugging p

Solution 1:

That's because higher code point values will yield 2 words, as can be seen in this snippet:

var s = String.fromCodePoint(0x2F804)
console.log(s);  // Shows one character
console.log('length = ', s.length); // 2, because encoding is \uD87E\uDC04

var i = s.codePointAt(0);
console.log('CodePoint value at 0: ', i); // correct

var i = s.codePointAt(1); // Should not do this, it starts in the middle of a sequence!
console.log('CodePoint value at 1: ', i); // misleading

In your code things go wrong when you do split, as there the words making up the string are all split, discarding the fact that some pairs are intended to combine into a single character.

You can use the ES6 solution to this, where the spread syntax takes this into account:

let dataBack = [...dataAsText].map((e, i) => {
   // etc.

Now your counts will be the same.

Example:

// (Only 20 instead of 200000)
let data = [];
for (let i = 199980; i < 200000; i++) {
    data.push(i);
}

let dataAsText = data.map(e => String.fromCodePoint(e)).join("");

console.log("String length: " + dataAsText.length);

let dataBack = [...dataAsText].map(e => e.codePointAt(0));

console.log(dataBack);

Surrogates

Be aware that in the range 0 ... 65535 there are ranges reserved for so-called surrogates, which only represent a character when combined with another value. You should not iterate over those expecting that these values represent a character on their own. So in your original code, this will be another source for error.

To fix this, you should really skip over those values:

for (let i = 0; i < len; i++) {
    if (i < 0xd800 || i > 0xdfff) data.push(i);
}

In fact, there are many other code points that do not represent a character.

Solution 2:

I have a feeling split doesn't work with unicode values, a quick test above 65536 shows that they become double the length after splitting

Perhaps look at this post and answers, as they ask a similar question

Solution 3:

I don't think you want charPointAt (or charCodeAt) at all. To convert a number to a string, just use String; to have a single delimited string with all the values, use a delimiter (like ,); to convert it back to a number, use the appropriate one of Number, the unary +, parseInt, or parseFloat (in your case, Number or + probably):

// Only 20 instead of 200000
let data = [];
for (let i = 199980; i < 200000; i++) {
    data.push(i);
}

let dataAsText = data.join(",");

console.log(dataAsText);

let dataBack = dataAsText.split(",").map(Number);

console.log(dataBack);

If your goal with codePointAt is to keep the dataAsText string short, then you can do that, but you can't use split to recreate the array because JavaScript strings are UTF-16 (effectively) and split("") will split at each 16-bit code unit rather than keeping code points together.

A delimiter would help there too:

// Again, only 20 instead of 200000
let data = [];
for (let i = 199980; i < 200000; i++) {
    data.push(i);
}

let dataAsText = data.map(e => String.fromCodePoint(e)).join(",");

console.log("String length: " + dataAsText.length);

let dataBack = dataAsText.split(",").map(e => e.codePointAt(0));

console.log(dataBack);

Solution 4:

If you're looking for a way to encode a list of integers so that you can safely transmit it over a network, node Buffers with base64 encoding might be a better option:

let data = [];
for (let i = 0; i < 200000; i++) {
    data.push(i);
}

// encoding

var ta = new Int32Array(data);
var buf = Buffer.from(ta.buffer);
var encoded = buf.toString('base64');

// decoding

var buf = Buffer.from(encoded, 'base64');
var ta = new Uint32Array(buf.buffer, buf.byteOffset, buf.byteLength >> 2);
var decoded = Array.from(ta);

// same?

console.log(decoded.join() == data.join())

Your original approach won't work because not every integer has a corresponding code point in unicode.

UPD: if you don't need the data to be binary-safe, no need for base64, just store the buffer as is:

// saving

var ta = new Int32Array(data);
fs.writeFileSync('whatever', Buffer.from(ta.buffer));

// loading

var buf = fs.readFileSync('whatever');
var loadedData = Array.from(new Uint32Array(buf.buffer, buf.byteOffset, buf.byteLength >> 2));

// same?

console.log(loadedData.join() == data.join())

JavaScript Sample