Unspecified Behaviour in C and C++

0
4288

This article covers the various aspects of unspecified behaviour in the C and C++ programming languages. It is a sequel to earlier articles on undefined behaviour, published in OSFY.

Every programmer aims to have error-free code during compilation. Sometimes you end up with code devoid of any syntax errors but might have a few runtime errors. Even if all such errors are rectified and the program is shipped to the end users, there could be potential bugs ready to be triggered at any instant. There are numerous reasons for such bugs, including problems like unrealistic time schedules, incorrect designs, etc. One reason for bugs in the software might be the presence of code that exhibits undefined behaviour. A programmer should also know about unspecified behaviour to avoid the presence of non-standardised code in software. Unspecified behaviour in a particular programming language refers to behaviour that may vary depending on the different implementations of that programming language. In this article, we will discuss unspecified behaviour in C and C++. We will also have a short discussion about implementation-defined behaviour in C and C++, which is slightly different from unspecified behaviour.

But before proceeding any further, let us try to distinguish between unspecified behaviour and undefined behaviour. The latter has already been covered in earlier issues of OSFY. The C11 standard defines undefined behaviour as “…behaviour, upon the use of a non-portable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements.” This definition itself is a mild warning against the use of constructs which might lead to undefined behaviour in a program. In general, we must refrain from using code that will cause undefined behaviour because such code may lead to syntax errors, runtime errors, or luckily even the expected behaviour from the point of view of a programmer. So, it is absolutely dangerous to have code with undefined behaviour in your programs. Integer division by zero, modifying an object more than once before reaching a sequence point, etc, are examples of undefined behaviour.

Figure 1: Output of gcc, Clang, and tcc compilers

The C11 standard defines unspecified behaviour as “…behaviour where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance.” So, unlike code exhibiting undefined behaviour which should be avoided in a program at any cost, the programmer has the freedom to use code exhibiting unspecified behaviour. But the output of the code with unspecified behaviour may not be the same for all the compilers and all the systems. As we proceed further, we will see examples wherein the same program in the same system gives different outputs when compiled with different compilers. So, a programmer writing code for a single system and a single compiler need not worry about unspecified behaviour. But a programmer writing code with portability as an intention should worry about both undefined behaviour and unspecified behaviour.

Unlike code exhibiting undefined behaviour, code that exhibits unspecified behaviour is often included in many C and C++ programs. You will not get any further guidelines regarding the behaviour of code exhibiting unspecified behaviour either in the programming language standard or in the compiler implementation manual. Still, code exhibiting unspecified behaviour is not banned outright in programs. According to the C11 standard, the order of evaluation of sub-expressions, the order of evaluation of function arguments, the values of any padding bits in integer representations, the order in which # and ## operations are evaluated during macro substitution, etc, are examples for unspecified behaviour in C.

I have referred to the document numbered N1570 of the draft ISO/IEC 9899:201x dated April 12, 2011 for the C11 programming language standard and the document numbered N4140 dated October 7, 2014 for the C++14 programming language standard. I encourage you to refer to these documents for any clarifications and further details. Finally, regarding unspecified behaviour, let us say the moral of the story is “all programmers need not be afraid of unspecified behaviour, but only those who want their programs to have portability need to worry.” But, unfortunately, programmers who program for just one compiler and one system are often considered as people not worth their salt. So it is better to think about portability even if you are writing a program just for your amusement.

Figure 2: Differences between gcc and Clang compiler outputs

Now, let us work with some C programs to better understand unspecified behaviour. I have compiled some programs with code fragments exhibiting unspecified behaviour with a number of compilers to check whether the output obtained is the same or different with different compilers. The test results were mixed. For some programs, all the compilers showed the same output, and for some other programs the output depended on the compiler even when executed on a single system. I have used three different C compilers and two different C++ compilers in Linux to test whether a piece of code with the so-called unspecified behaviour acts surprisingly or not. The three C compilers used for testing are gcc, tcc, and Clang. The two C++ compilers used for testing are g++ and Clang++. The GNU Compiler Collection (GCC) provides the C compiler gcc and the C++ compiler g++. The LLVM compiler infrastructure provides the C compiler Clang and the C++ compiler Clang++. The Tiny C Compiler (tcc) is a C compiler created by Fabrice Bellard.

The first unspecified behaviour we are going to explore in detail is the order of evaluation of sub-expressions. First of all, what does the evaluation order of a sub-expression mean? Consider the C statement ‘a=b+c;’ where a, b, and c are integer variables. It is not specified in the C or C++ standards whether the value of the variable b or the variable c is fetched first to perform the addition operation. In this particular case, the output does not depend on the evaluation order, but this may not be the case always. Consider the program named pgm1.c shown below as an example, in which a change in the evaluation order changes the output. The program pgm1.c and all the other programs discussed in this article can be downloaded from opensourceforu.com/article_source_code/Feb19unspecifiedbehaviourinC.zip.

#include<stdio.h>

int x=100;

int left( )

{

x++;

return x;

}

int right( )

{

x=1;

return x;

}

int main( )

{

int y=left( )+right( );

printf(“\nSum = %d\n\n”,y);

}

In the program pgm1.c, the line of code causing unspecified behaviour is int y=left( )+right( ); due to the unspecified evaluation order of sub-expressions. Here, the output depends on the evaluation order because both the functions left( ) and right( ) are modifying the same global variable x. If the function left( ) is called first then the value of the global variable x will become 101, because the statement x++; gets evaluated first and therefore the function left( ) returns the value 101. Later, when the function right( ) gets called, the value of the global variable x becomes 1 because of the statement x=1; therefore the function right( ) returns the value 1. Thus, in this case, the output will be 102.

Let’s suppose the function right( ) is called first, then the value of the global variable x will become 1, because the statement x=1; gets evaluated first and therefore the function right( ) returns the value 1. Later, when the function left( ) gets called, the value of the global variable x becomes 2 because of the statement x++; and therefore the function left( ) returns the value 2. Thus, in this second case, the output will be 3. So, in this program, the output depends on the evaluation order. But when I tested the program with the three different C compilers mentioned earlier, all of them behaved in the same way by calling the function left( ) first and the function right( ) second, and thus producing the output 102. Figure 1 shows the output from the gcc, Clang, and tcc compilers when the program pgm1.c is compiled.

I have also renamed the program pgm1.c as pgm1.cc and tested it with the C++ compilers g++ and Clang++, but the output again was 102 because the function left( ) was called first and the function right( ) was called second. But we have to keep two things in mind regarding the behaviour of this program. First, even though all the compilers tested here gave the same output, the program may or may not behave the same for yet another untested compiler. Second, since this is unspecified behaviour, nothing will be documented in the compiler implementation manual also.

Now let us discuss code in which the order of evaluation of function arguments causes unspecified behaviour. Consider the program pgm2.c given below.

#include<stdio.h>

void fun(int i, int j)

{

/* Empty Function */

}

int left( )

{

printf(“Man”);

return 0;

}

int right( )

{

printf(“Bat”);

return 0;

}

int main( )

{

fun(left( ), right( ));

printf(“\n\n”);

}

In this program, the line of code causing the unspecified behaviour is fun(left(), right()); due to the unspecified evaluation order of the arguments to the function fun( ). Here the arguments to the function fun( ) are themselves two functions, left( ) and right( ). If the function left( ) is called first and the function right( ) second, then the message ManBat will be printed on the screen; whereas if the function right( ) is called first and the function left( ) is called second, then the message BatMan will be printed on the screen. Surely, those who are familiar with DC Comics will know the difference between the two messages printed, because BatMan is a super hero and ManBat is a super villain. But unlike the program pgm1.c where all the compilers showed the same behaviour, in the case of pgm2.c, the behaviour of the compilers was indeed different. Figure 2 shows the difference in the outputs of the gcc and Clang compilers.

Figure 3: Different numbers as output

From the figure it is clear that the order of evaluation of function arguments is different for gcc and Clang compilers even in the same system. When the program pgm2.c is tested with the tcc compiler, the behaviour is similar to that of the Clang compiler with the message ‘ManBat’ printed on the screen. This program was also renamed into a C++ program as pgm2.cc and tested with the compilers g++ and Clang++. Here again the outputs were different. The output of g++ was the same as that of gcc, whereas the output of Clang++ was similar to that of Clang.

Now let us look at a more practical example where the evaluation order of function arguments again comes into play. Consider the program pgm3.c given below.

#include<stdio.h>

int x=333;

int fun(int i, int j)

{

return i+j;

}

int left( )

{

x=100;

return x;

}

int right( )

{

x++;

return x;

}

int main( )

{

printf(“\nSum = %d\n”, fun(left( ), right( )));

return 0;

}

In this program, the sum obtained after the addition operation is different when compiled with the gcc and Clang compilers. There are two reasons why the two compilers calculate different numbers as output. First, the evaluation order of function arguments is different for the two compilers in the line of code printf(“\nSum = %d\n”, fun(left( ), right( )));. Second, both the functions left( ) and right( ) are modifying the same global variable x. As an exercise, try to work out how the two compilers obtained their respective outputs by referring to the previous examples. Figure 3 shows the exact output obtained with the gcc and Clang compilers, but look at this only if you fail to get the correct output for both the compilers.

The program pgm3.c was also tested with the tcc compiler and the behaviour was similar to that of the Clang compiler. This program was also renamed into a C++ program as pgm3.cc and tested with the compilers g++ and Clang++. Just like the previous program, the output of g++ was the same as that of gcc and the output of Clang++ was similar to that of Clang.

Even though the title of the article says ‘Unspecified behaviour in C and C++’, you should not think that C and C++ have common undefined and unspecified behaviours. To better understand this point, consider the program pgm4.cc shown below, in which the same line of code int test = &x > &y; causes unspecified behaviour in C++ and undefined behaviour in C.

#include <stdio.h>

int main()

{

int x;

int y;

int test = &x > &y;

if(test==0)

{

printf(“\ng++\n\n”);

}

else

{

printf(“\nclang++\n\n”);

}

}

In C11, a comparison of pointers to objects is defined only if the pointers point to members of the same object or elements of the same array, whereas in C++14, this is just an unspecified behaviour. The program pgm4.cc is tested with the C++ compilers g++ and Clang++, and the output is shown in Figure 4.

Figure 4: A C++ example

As you can observe from the figure, g++ and Clang++ compilers behave differently again. Based on this program, we must understand two important points. First, similar programming language constructs in C and C++ need not be defined similarly in the standards. The example above shows a behaviour undefined in C and unspecified in C++. If you go through the standards of C and C++ carefully, you will be able to identify many such differences in definitions of similar programming language constructs. The second point to remember is that the same behaviour undefined in one particular standard of a programming language might be strictly defined in a later standard. A careful analysis of C99 and C11 standards will help you find out many examples for such changes in definitions.

Implementation-defined behaviour

There is a sub-category of unspecified behaviour called implementation-defined behaviour. The C11 standard defines implementation-defined behaviour as “unspecified behaviour, where each implementation documents how the choice is made.” There aren’t any details regarding unspecified behaviour and implementation-defined behaviour in the programming language standard, but there are details regarding implementation-defined behaviour in the compiler implementation manual. So, basically, the difference is that the programmer is responsible for figuring out unspecified behaviour all by himself, whereas he will get directions from the compiler implementation manual regarding implementation-defined behaviour. According to the C11 standard, the alternative manner in which the main function may be defined, the number of significant initial characters in an identifier, the number of bits in a byte, the value of a string literal containing a multi-byte character not represented in the execution character set, the accuracy of the floating-point operations, the result of converting a pointer to an integer or vice versa, etc, are examples of implementation-defined behaviour.

I have only mentioned a few examples in each category discussed in this article, but the list is not exhaustive. There is a wide range of undefined, unspecified, and implementation-defined behaviour in C and C++. Due to the many reasons and problems stated earlier, you should refrain from the use of such code in your programs. My advice is that if your code looks crazier than usual, you should refer to the programming language standard to make sure that the code will not exhibit undefined, unspecified, or implementation-defined behaviour.

LEAVE A REPLY

Please enter your comment!
Please enter your name here