Bogado.net

Using unicode on Java

Unicode strings are notorious problematic. Java strings are encoded with UTF-16 for simpler usage in most cases, and yes that means that a char or Character in java occupy 16bits or bytes and that most of the Strings in java are almost double than the counterpart in C or C++. The advantage is that you don’t need O(n) to find the n<sup>th</sup> valid position on a String.

Reversing a string

The problem of reversing a string is a good on to investigate some of the techniques of iterating over an string. We will start with a couple of "bad implementations" and also a few good ones. The effects of selecting the wrong API range from really bad, that starts breaking with something as similar to English as Spanish or french to something that requires a less common languages and unfortunately also with the very popular emoji characters.

Using bytes, the bad approach

If you come from a C or C++ background you could easily fall into the trap of thinking that bytes are as good as char when iterating a string, but if you read before you can start to see how this will break. One could even imagine that this would break spectacularly, if char have two bytes then one possibility, when iterating through a string, is to get zeros interleaving the normal characters.

But the APIs in java are designed to hide the internal representation from the user, so all the methods on the String class that inspect the internal in bytes will do some kind of conversion. But what conversion? If you don’t specify a charset, java will use the default charset, defined by the locale of the underling OS. The end result is that a reverse string using bytes will sort of work, but break as soon as you leave ascii land, for most modern OSes uses UTF-8 as a default charset.


    // Bad implementation
    private String reverseStringBad(String s) {
        byte[] result = s.getBytes();
        for (int i = 0; i < result.length/2; i++) {
            int endPos     = result.length - 1 - i;
            byte tmp = result[i];
            result[i] = result[endPos];
            result[endPos] = tmp;
        }
        return new String(result);
    }

with the folloing results:

    Reverse bad: "abc"="cba" |10000× mean time 41ms
    Reverse bad: "a"="a" |10000× mean time 3ms
    Reverse bad: "Simple"="elpmiS" |10000× mean time 5ms
    Reverse bad: "áéíóú"="�óíéá�" |10000× mean time 42ms
    Reverse bad: "😀😁🙀"="�����"…"�����" len(12) |10000× mean time 77ms
    Reverse bad: "abcAA"…"AAabc" len(4102)="cbaAA"…"AAcba" len(4102) |10000× mean time 341ms
    Reverse bad: "😀😁🙀🙀🙀"…"🙀🙀😀😁🙀" len(4108)="�����"…"�����" len(8216) |10000× mean time 2180ms

Using chars, the naive approach

I consider this the most natural way of iterating through a string, and luckly even though this is still broken it will resist more inputs, resulting correct values to all the unicode characters that require just one 16bit word. This makes that all the test cases above work but the last.


    // This implementation works better than the verification
    // But it still incorrect when you take unicode in consideration.
    private String reverseStringNaive(String s) {
        StringBuilder b = new StringBuilder();
        for (char c: s.toCharArray()) {
            b.insert(0, c);
        }
        return b.toString();
    }

This give us the following results:

  Reverse naive: "abc"="cba" |10000× mean time 6ms
  Reverse naive: "a"="a" |10000× mean time 1ms
  Reverse naive: "Simple"="elpmiS" |10000× mean time 12ms
  Reverse naive: "áéíóú"="úóíéá" |10000× mean time 4ms
  Reverse naive: "😀😁🙀"="?😁😀?" |10000× mean time 3ms
  Reverse naive: "abcAA"…"AAabc" len(4102)="cbaAA"…"AAcba" len(4102) |10000× mean time 10251ms
  Reverse naive: "😀😁🙀🙀🙀"…"🙀🙀😀😁🙀" len(4108)="?😁😀🙀🙀"…"🙀🙀😁😀?" len(4108) |10000× mean time 10176ms

Using codePoints, the good approach

Java gives you some tools to query positions and sizes of different characters. This is the correct way of dealing with unicode strings. It may not be the most straight forward way, since you will have to remember to take into account that some of the codePoints will advance more than one index in your iteration.

To do that we need to use the Character static API charCount. Also for some reason there’s no insertCodePoint in the StringBuilder class, but there’s an appendCodePoint. So to insert code points into arbritary positions you need to build a char array by using another Character static method, toChars.


    // This is the simplest I could get with fun.
    private String reverseStringGood(String s) {
        StringBuilder result = new StringBuilder();
        for (int i = 0; i < s.length();) {
            int codePoint = s.codePointAt(i);
            result.insert(0, Character.toChars(codePoint));
            i += Character.charCount(codePoint);
        }
        return result.toString();
    }

This give us the following results:

   Reverse Good: "abc"="cba" |10000× mean time 9ms
   Reverse Good: "a"="a" |10000× mean time 2ms
   Reverse Good: "Simple"="elpmiS" |10000× mean time 7ms
   Reverse Good: "áéíóú"="úóíéá" |10000× mean time 6ms
   Reverse Good: "😀😁🙀"="🙀😁😀" |10000× mean time 6ms
   Reverse Good: "abcAA"…"AAabc" len(4102)="cbaAA"…"AAcba" len(4102) |10000× mean time 10984ms
   Reverse Good: "😀😁🙀🙀🙀"…"🙀🙀😀😁🙀" len(4108)="🙀😁😀🙀🙀"…"🙀🙀🙀😁😀" len(4108) |10000× mean time 5611ms

Using StringBuilder, the no fun approach

Turns out that the StringBuilder class can reverse the internal representation. This is probably the best possible way of doing this particular challenge, since it is implemented by the default library it should be correct, otherwise file a bug report with Oracle. It should also be optimized.

The takeaway here is, unless you’re trying to amuse yourself aways use the provided APIs. It will create simpler code and will minize the possibility of introducing bugs.


    // This is correct and faster but less fun. ;-)
    private String reverseStringNoFun(String s) {
        return new StringBuffer(s).reverse().toString();
    }

This give us the following results:

 Reverse No fun: "abc"="cba" |10000× mean time 4ms
 Reverse No fun: "a"="a" |10000× mean time 1ms
 Reverse No fun: "Simple"="elpmiS" |10000× mean time 1ms
 Reverse No fun: "áéíóú"="úóíéá" |10000× mean time 1ms
 Reverse No fun: "😀😁🙀"="🙀😁😀" |10000× mean time 4ms
 Reverse No fun: "abcAA"…"AAabc" len(4102)="cbaAA"…"AAcba" len(4102) |10000× mean time 110ms
 Reverse No fun: "😀😁🙀🙀🙀"…"🙀🙀😀😁🙀" len(4108)="🙀😁😀🙀🙀"…"🙀🙀🙀😁😀" len(4108) |10000× mean time 133ms

Other considerations

Some other stuff that can bite you when using unicode.

length

The length method will return the number of chars on the string, this is different than the number of unicode code points. If you need to figure out the number of codepoints you will need to use codePointCount(start, end) instead. Notice that this operation is O(n) and that the indexes might cut code points in the middle, those will count as '1'.

indexing

The naive indexing will also fail you as the index parameters will never be in codepoints. To solve this you can use the method offsetByCodePoint(start_index, codePointOffset). The documentation on that method was a little bit confusing for on the first time I read it, but the easy way to think about this method is that it gives you the index offset, from the starting point, of codePointOffset code points. This is also a O(n) operation.

For example suppose you want to get the 10 last code points of a string. If you know for sure that the string has at least 20 characters you would do the following calculation : str.offsetByCodePoint(str.length() - 20, codePointCount(str.length() - 20, str.length()) - 10). In english: count the number of code points on the 20 last characters in the string, with that get the offset of that count less 10 code points, starting in the same position. Since you need 10 they will certainlly be on the last 20 characters of the string, as each code point takes up to 2 characters.

Other problems

I didn’t cover some other problems that unicode can bring you. There are characters that have more than one representation, and characters that get combined with the next one. So even all of those implementations are also naive. If the strings being manipulated are sensitive in any way, you need to take care of that also.