<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>chys&#039;s random notes &#187; Unicode</title>
	<atom:link href="http://en.chys.info/tag/unicode/feed/" rel="self" type="application/rss+xml" />
	<link>http://en.chys.info</link>
	<description>Study more problems; Talk less of isms.</description>
	<lastBuildDate>Thu, 06 Sep 2012 12:32:20 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>UTF-8</title>
		<link>http://en.chys.info/2009/03/utf-8/</link>
		<comments>http://en.chys.info/2009/03/utf-8/#comments</comments>
		<pubDate>Thu, 05 Mar 2009 10:49:27 +0000</pubDate>
		<dc:creator>chys</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C/C++]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://blog.chys.info/?p=299</guid>
		<description><![CDATA[UTF-8 is known for being self-synchronizing (self-segregating) by design. Therefore it is very robust against occasional errors. If one byte is accidentally missing in a string encoded in GB18030, it can happen that the whole string becomes broken and unreadable. However, for UTF-8, any bad byte breaks only one character. For programmers, self-synchronization can mean [...]<hr/>
Related posts:<ol>
<li><a href='http://en.chys.info/2009/06/wprintfs/' rel='bookmark' title='wprintf(&#8220;%s&#8221;,&#8230;)'>wprintf(&#8220;%s&#8221;,&#8230;)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8</a> is known for being self-synchronizing (self-segregating) by design. Therefore it is very robust against occasional errors. If one byte is accidentally missing in a string encoded in <a href="http://en.wikipedia.org/wiki/GB_18030">GB18030</a>, it can happen that the whole string becomes broken and unreadable. However, for UTF-8, any bad byte breaks only one character.</p>
<p>For programmers, self-synchronization can mean more than just robustness, for example:</p>
<p>We know that, generally speaking, <code><a href="http://www.cplusplus.com/reference/clibrary/cstring/strstr.html">strstr</a></code> cannot be used for strings in multi-byte encodings (the final byte of one character and the first byte of the next can happen to match the needle) &#8211; we have to either convert them to <code>wchar_t</code>&#8216;s and then use <code>wcsstr</code>, or use a more complicated substring search algorithm that takes care of multi-byte characters (Microsoft&#8217;s <code>_mbsstr</code>, for example).</p>
<p>However, for UTF-8 strings, <code>strstr</code> is absolutely safe and works as expected, so long as the two parameters are both valid UTF-8. It is not difficult to figure out.</p>
<hr/><p>Related posts:<ol>
<li><a href='http://en.chys.info/2009/06/wprintfs/' rel='bookmark' title='wprintf(&#8220;%s&#8221;,&#8230;)'>wprintf(&#8220;%s&#8221;,&#8230;)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://en.chys.info/2009/03/utf-8/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
