Posts Tagged ‘Unicode’
UTF-8 is known for being self-synchronizing (self-segregating) by design. Therefore it is very robust against occasional errors. If one byte is accidentally missing in a string encoded in GB18030, it can happen that the whole string becomes broken and unreadable. However, for UTF-8, any bad byte breaks only one character.
For programmers, self-synchronization can mean more than just robustness, for example:
We know that, generally speaking,
strstr cannot be used for strings in multi-byte encodings (the final byte of one character and the first byte of the next can happen to match the needle) – we have to either convert them to
wchar_t‘s and then use
wcsstr, or use a more complicated substring search algorithm that takes care of multi-byte characters (Microsoft’s
_mbsstr, for example).
However, for UTF-8 strings,
strstr is absolutely safe and works as expected, so long as the two parameters are both valid UTF-8. It is not difficult to figure out.