Posts Tagged ‘Unicode’

UTF-8

UTF-8 is known for being self-synchronizing (self-segregating) by design. Therefore it is very robust against occasional errors. If one byte is accidentally missing in a string encoded in GB18030, it can happen that the whole string becomes broken and unreadable. However, for UTF-8, any bad byte breaks only one character.

For programmers, self-synchronization can mean more than just robustness, for example:

We know that, generally speaking, strstr cannot be used for strings in multi-byte encodings (the final byte of one character and the first byte of the next can happen to match the needle) – we have to either convert them to wchar_t‘s and then use wcsstr, or use a more complicated substring search algorithm that takes care of multi-byte characters (Microsoft’s _mbsstr, for example).

However, for UTF-8 strings, strstr is absolutely safe and works as expected, so long as the two parameters are both valid UTF-8. It is not difficult to figure out.

Tags: ,