[JavaScript] Use unicode-substring for Extracting Parts of Strings Containing Unicode
I’ll introduce cases and sample code showing why you should use unicode-substring instead of substring when extracting parts of strings containing Unicode in JavaScript and Node.js.
When you pass a string extracted with String.prototype.substring() as an argument to the encodeURI method, if the string contains only Unicode high surrogates, you’ll get a URIError: malformed URI sequence error.
A senior engineer colleague advised me: “Unicode specifications are complex, so it’s better not to handle them with String.prototype.substring().”
Reference: FAQ - UTF-8, UTF-16, UTF-32 & BOM
Since unicode-substring was recommended as a replacement for String.prototype.substring(), I tried it out immediately.
const unicodeSubstring = require('unicode-substring');
const string = "😄😄Emoji😄😄";
// str.substring(indexStart[, indexEnd])
console.log(string.substring(0, 3)); // 😄�
// unicodeSubstring(string, start, end)
console.log(unicodeSubstring(string, 0, 3)); // 😄😄E
The results of form submission from http://localhost:3000/unicode using the code in the above pull request are as follows:
GET /unicode
POST /unicode/substring
Execution Result - Response JSON unicode-substring - npm
{
body: {
substring: "😄😄Emoji😄😄",
unicodeSubstring: "😄😄Emoji😄😄",
start: "0",
end: "3"
},
formatted: {
substring: "😄�",
unicodeSubstring: "😄😄E"
}
}
unicode-substring/index.js has little code, so if you have time, doing code reading might deepen your understanding of Unicode.
That’s all from the Gemba.