Posts Tagged ‘C/C++’

wprintf(“%s”,…)

Microsoft and GNU interprets %s differently in the wide-string version of the printf-family functions (wprintf, etc.)

Microsoft: “when used with wprintf functions, specifies a wide-character string.”

C99 and GNU: “If no l modifier is present: The const char * argument is expected to be a pointer to an array of character type (pointer to a string).”

Fortunately, both accept “%ls” for wide strings.

Unfortunately, the only supported format specifier for multi-byte (narrow) strings in C99 is “%s”, which Microsoft interpret differently.

Fortunately, the specifier that Microsoft recommends for multi-byte strings, “%hs”, is also accepted by many other C libraries, though undocumented. Such acceptance is very reasonable – the unknown prefix h is simply ignored. (I tested it with GNU and Solaris C libraries.)It seems such acceptance is necessary in order to strictly conform to the wording of C99.

Microsoft wprintf GNU wprintf C99
%s Wide Narrow Narrow
%S Narrow Wide (deprecated)
%hs Narrow Narrow (undocumented)
%ls Wide Wide Wide

To draw a conclusion:

  1. Everybody agrees that, in wprintf, “%ls” specifies a wide string. (I’m not sure whether VC6 supports it.)
  2. There is no consensus on the specifier for multi-byte strings. The best practical choice is “%hs”.

This table and conclusion also apply to the “%c” family.

Tags: ,

I hate the “c…” headers

What’s the reason for using <cstdio> instead of <stdio.h>? Merely to pretend more standard compliant?

Framers of the C++ standard probably wished to “clean” the global namespace by pulling everything into std. Unfortunately, many implementations (Microsoft, GNU, etc.) instead put all those symbols in both the global and std namespaces, rendering this argument invalid in practice.

Even more unfortunately, a few other well-known implementations (e.g. Solaris) actually followed the standards.

Actually I lost some points in a course for exactly this reason, in which the TA failed to compile on Solaris my program which compiled well on Linux. In that program I included <cstdio> but forgot to pretend std:: to two printf‘s. Since then, I have always been using <name.h> rather than <cname> though “deprecated.”

To write strictly conforming programs, we need to remember what symbols are macros and what are not. The C++ standard lists those symbols which are symbols:

[Note: the names defined as macros in C include the following: assert, errno, offsetof, setjmp, va_arg, va_end, and va_start. -end note]

They had the rarely-used setjmp here, but omitted three very important ones which the C standard says should be macros. Let’s look into the header stdio.h provided by glibc:

/* Standard streams. */
extern struct _IO_FILE *stdin; /* Standard input stream. */
extern struct _IO_FILE *stdout; /* Standard output stream. */
extern struct _IO_FILE *stderr; /* Standard error output stream. */
/* C89/C99 say they're macros. Make them happy. */
#define stdin stdin
#define stdout stdout
#define stderr stderr

They’re not in std either.

Tags:

C++0x in ICC 11

Intel is advancing faster than GNU in support the upcoming C++0x. The two features I expect most, auto declaration and lambda function, have both been supported satisfactorily. Lambda functions that use variables defined outside (output_and_sum in the following demo) are supported as well as simpler ones (add in the following demo).

#include <cstdio>
#include <algorithm>
using namespace std;
 
int main()
{
    static const int myvec[] = { 1, 2, 3, 4, 5 };
 
    int sum = 0;
    auto add = [] (int a, int b) -> int { return a+b; };
    auto output_and_sum = [&sum,add](int x) { sum  = add(sum,x); printf ("%d\n", x); };
    for_each (myvec, myvec+5, output_and_sum);
    printf ("sum = %d\n", sum);
}

(I use printf instead of cout merely to make the assembly list more readable.)

ICC compiles this program (an option -std=c++0x is necessary, of course) and gives the correct resuls:

1
2
3
4
5
sum = 15

Also the two lambda functions are well inlined and optimized – variable sum, which is used by the lambda function by reference, is completely stored in a register, though it is easy to find several useless instructions in the assembly dump.

Some other less important (in my view) features (e.g. initializer_list) are not implemented yet. (BTW, I really had a hard time finding the “request noncommercial free license” link on Intel’s website, while I could see the link to “buy a commercial license” everywhere..)

Hopefully C++0x will be more successful than C99, which is filled with ugly and nobody-wants-to-implement features.

Tags: , ,

Every C programmer should learn some assembly

I am more convinced of this now.

One of the most frequently asked questions in C is the difference between a pointer and an array. A newbie in C often finds it “mission impossible” to differentiate between the following four variable types:
char p1[][8] = { "Hello", "world" };
char *p2[8] = { "Hello", "world" };
char (*p3)[8] = p1;
char **p4 = p2;

And it really is difficult to explain it clearly in a few words. However, if one knows some assembly, one can check the assembly listing generated by an assemblera compiler and at least the difference between p1 and p2 should be straightforward:

p1:
    .string "Hello"
    .zero 2
    .string "world"
    .zero 2
.LC0:
    .string "Hello"
.LC1:
    .string "world"
p2:
    .long .LC0
    .long .LC1
p3:
    .long p1
p4:
    .long p2

(I prefer the AT&T-style assembly)

I feel so lucky that I had learned some assembly used in NES before starting C. So for me “pointer” has always been a very natural concept and surely different from an array. Many poor freshmen undergrads had to begin with C++ without any knowledge in assembly or C or even any other language – I would have been crazy had I been under such a situation.

Tags: ,

UTF-8

UTF-8 is known for being self-synchronizing (self-segregating) by design. Therefore it is very robust against occasional errors. If one byte is accidentally missing in a string encoded in GB18030, it can happen that the whole string becomes broken and unreadable. However, for UTF-8, any bad byte breaks only one character.

For programmers, self-synchronization can mean more than just robustness, for example:

We know that, generally speaking, strstr cannot be used for strings in multi-byte encodings (the final byte of one character and the first byte of the next can happen to match the needle) – we have to either convert them to wchar_t‘s and then use wcsstr, or use a more complicated substring search algorithm that takes care of multi-byte characters (Microsoft’s _mbsstr, for example).

However, for UTF-8 strings, strstr is absolutely safe and works as expected, so long as the two parameters are both valid UTF-8. It is not difficult to figure out.

Tags: ,

Leap year bug crashes Zune

Microsoft’s 30GB Zune players fail to work today (Dec 31).

The problem has been identified – A bug in the freescale firmware leads to an infinite loop on the last day of a leap year.

year = ORIGINYEAR; /* = 1980 */

while (days > 365)
{
   if (IsLeapYear(year))
   {
      if (days > 366)
      {
         days -= 366;
         year += 1;
      }
   }
   else
   {
      days -= 365;
      year += 1;
   }
}

If such poor codes were found in an airplane, or a medical device, ooops, it should be terrible..

Tags:

GCC #pragma pack bug

#pragma pack is accepted by many C/C++ compilers as the de facto standard grammar to handle alignment of variables.

However there is an old bug in GCC, reported many times, the first of which was in 2002, still not fixed now.

#include <cstdio>
using namespace std;

#pragma pack(1)
template <typename T>
struct A {
    char a;
    int b;
};
A<int> x;
#pragma pack(2)
A<char> y;
#pragma pack(4)
A<short> z;

int main()
{
    printf ("%d %d %dn", sizeof x,sizeof y,sizeof z);
    return 0;
}

This gives 5 6 8 instead of 5 5 5 as we may expect. (VC++ and ICC both give the more reasonable 5 5 5.)

This example is not very bad. Even worse is, that this bug can damage programs that use STL. Here is an example:

a.cpp:

#include <cstdio>
#include <map>
using namespace std;

#pragma pack(1)
void foo (const map<short,const char*> &x)
{
    for (map<short,const char*>::const_iterator it=x.begin();
            it!=x.end(); ++it)
        printf ("%d %sn", it->first, it->second);
}

b.cpp:

#include <map>
using namespace std;

void foo (const map<short,const char *> &);

int main()
{
    map<short, const char *> x;
    x[0] = "Hello";
    x[1] = "World";
    foo (x);
}

Compile a.cpp and b.cpp separately and link them together. This program segfaults if compiled with GCC, but works well with ICC or VC++.

In conclusion, for better portability and/or reliability, never use #pragma pack unless absolutely necessary. If really unavoidable, always push and pop immediately before and after the structure definition. (If the program is intended to be compiled by GCC/ICC only, it is better to use the more reliable GCC-specific __attribute__((__packed__)).)

PS. It seems Sun CC (shipped with Solaris) also has this bug. It fails for the first example here, but for the second it works well. I don’t know how manages to align pair<short,const char *> correctly…

Tags: ,

The Uncertainty Principle

If we define a global constant in C++:
const int x = 2;
The typical behavior of an optimizing compiler is to allocate memory for x only if its address is explicitly taken with an & operator or it is bound to a reference. (Otherwise constant propogation is sufficient.)

Now let’s assume we can only change the codes, compile and run them. Disassembly is not allowed. (That’s like physics – we can only do experiments and observe the results.) Then we are unable to find out whether x actually has an address. The only way we can detect it is to take its address and see whether there’s going to be a compiling error. But this process creates an address for it. (In physics, when we are measuring the position and momentum of a particle, we are changing its position and/or momentum.)

A similar thing is about the default constructor, default copy-constructor, default copy-assignment and default destructor. Some books say that a class always has them (in semantics); some books say compilers create them on demand (in implementation). Both are right – we cannot detect this in program.

Time (if any) before the Big Bang is irrelavent to our universe, so we can assert time did not exist before the Big Bang for a simpler model. Likewise, we can assert (semantically) a constant always has an address, and a class always have the four things, for a simpler model.

Tags:

C++0x draft finished

Only bugfixes, no new feature, will be added to the standard from now on.

Good. It is almost certain it will be called C++ 2009.

As I see, many new features will bring great convenience to programming. But… C++ will appear far more complicated and difficult and perhaps unreasonable to a newbie. It’s already complicated enough now…

Ten years ago C9x became C99, and now C++0x is becoming C++09. But there will be one certain difference: No major compiler completely implements C99 even today; however, every actively developed C++ compiler seems to be planning to add C++09 support as immediately as possible.

Reference
[1] September 2008 ISO C++ Standards Meeting: The Draft Has Landed, and a New Convener

Tags: ,

SEEK_* Problem with MPICH2

This is a frequently encountered error when writing MPI programs in C++.

/usr/include/mpicxx.h:26:2: error: #error “SEEK_SET is #defined but must not be for the C++ binding of MPI”
/usr/include/mpicxx.h:30:2: error: #error “SEEK_CUR is #defined but must not be for the C++ binding of MPI”
/usr/include/mpicxx.h:35:2: error: #error “SEEK_END is #defined but must not be for the C++ binding of MPI”

I remember (not very sure) it was MPICH2 that introduced this problem. It was not present in MPICH1.

This happens if <mpi.h> is inlcuded AFTER some system headers like <stdio.h> and <iostream>. The workaround is simple: always include <mpi.h> first in any C/C++ source that uses MPI APIs. (Another workaround is to pass -DMPICH_IGNORE_CXX_SEEK to mpicxx, but it seems this has other drawbacks. Do this only if you are unable to modify the sources.)

This certainly should be considered a bug of MPICH2: SEEK_SET, SEEK_CUR, and SEEK_END are constant symbols required by the ANSI C standard, but MPICH2 has global variables with exactly the same names. MPICH2 has not changed this, though they apparently have realized this problem for long. (Otherwise they could not have introduced MPICH_IGNORE_CXX_SEEK. Probably they have binary compatibility concerns?)

Tags: ,