[JavaScript] Use unicode-substring for Extracting Parts of Strings Containing Unicode

Tadashi Shigeoka ·  Thu, November 7, 2019

I’ll introduce cases and sample code showing why you should use unicode-substring instead of substring when extracting parts of strings containing Unicode in JavaScript and Node.js.

JavaScript

Background: substring + encodeURI = URIError: malformed URI sequence

When you pass a string extracted with String.prototype.substring() as an argument to the encodeURI method, if the string contains only Unicode high surrogates, you’ll get a URIError: malformed URI sequence error.

Use unicode-substring Instead of substring

A senior engineer colleague advised me: “Unicode specifications are complex, so it’s better not to handle them with String.prototype.substring().”

Reference: FAQ - UTF-8, UTF-16, UTF-32 & BOM

unicode-substring

Since unicode-substring was recommended as a replacement for String.prototype.substring(), I tried it out immediately.

Node.js REPL Execution Example: substring vs unicode-substring

const unicodeSubstring = require('unicode-substring');
const string = "😄😄Emoji😄😄";

// str.substring(indexStart[, indexEnd])

console.log(string.substring(0, 3)); // 😄�

// unicodeSubstring(string, start, end)

console.log(unicodeSubstring(string, 0, 3)); // 😄😄E

substring vs unicode-substring Sample Code

Compare String.prototype.substring() vs unicode-substring (sample codes) · Pull Request #5 · codenote-net/expressjs-sandbox

The results of form submission from http://localhost:3000/unicode using the code in the above pull request are as follows:

GET /unicode unicode-substring input form

POST /unicode/substring unicode-substring response json

Execution Result - Response JSON unicode-substring - npm

{
    body: {
        substring: "😄😄Emoji😄😄",
        unicodeSubstring: "😄😄Emoji😄😄",
        start: "0",
        end: "3"
    },
    formatted: {
        substring: "😄�",
        unicodeSubstring: "😄😄E"
    }
}

Code Reading: unicode-substring

unicode-substring/index.js has little code, so if you have time, doing code reading might deepen your understanding of Unicode.

That’s all from the Gemba.