Understanding Undefined Behaviour in C

0
7465

In computer programming, undefined behaviour is defined as ‘the result of compiling computer code which is not prescribed by the specs of the programming language in which it is written’. This article will help you understand this behaviour with the help of a few case studies.

Many of you might have come across the concept called ‘undefined behaviour’ in C language. During the runtime, we sometimes see strange results instead of an expected output. Let us dig deep into the ‘C’ ocean and see what undefined behaviour really is. What causes it and how can it be taken care of?

According to C99 standards, undefined behaviour is defined as: “Between two sequence points, an object is modified more than once, or is modified and the prior value is read other than to determine the value to be stored.”

In C FAQs this behaviour is defined as: “Anything at all can happen; the standard imposes no requirements. The program may fail to compile, or it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.”

In very simple words, an object when modified more than once between two sequence points will result in undefined behaviour. As a C programmer, understanding undefined behaviour is very important for better coding and for the program to yield a good performance, especially when it comes to embedded C coding in embedded system design.

Let us understand what a sequence point is in the C programming language with the help of an example.

Figure 1: Output of the code shown in program 1, compiled using gcc compiler
Figure 1a: Output of the code shown in program 1, run under an Intel based Linux platform

Sequence points in C

According to the C99 standard: “Between the previous and next sequence point an object shall have its stored value modified at most once by the evaluation of an expression. Furthermore, the prior value shall be accessed only to determine the value to be stored.”

Let us consider an example shown in code snippet-1 given below to understand the meaning of the above statement:

1. int a = 10;

2. a = a++ * a++;

In the expression shown in Line 2 above, it is clear that the value of the variable ‘a’ is getting modified more than once, since this variable lies between two sequence points. Here the sequence points are statement terminators, which is nothing but the semicolon (;) present at the end of the expressions—one is at the end of Line 1 and the other is at the end of Line 2.

Let us consider one more example to understand the meaning of the sequence points:

1. int i = 2;

2. a[i++] = i++;

In the expression shown in Line 2 in the code snippet-2 given above, it is clear that the value of the variable ‘i’ is getting modified more than once, since this ‘i’ lies between two sequence points. Even here, the two sequence points are the semicolons which are present at the end of the expressions or statements — one is at the end of Line 1 and the other is at the end of Line 2.

After understanding the meaning of the sequence points through examples, let us understand the meaning of undefined behaviour with the help of case studies.

Figure 2: Assembly code generated for program 1, compiled using gcc compiler
Figure 3: Output of the code shown in program 1, compiled using the Clang compiler

Case study 1

Consider code snippet-1, for which the complete code is given in the program-1 below:

1 #include <stdio.h>

2

3 int main()

4 {

5 int a = 10;

6 a = a++ * a++;

7 printf(“The value of a: %d\n”, a);

8

9 return 0;

10 }

From this program, in Line 6 it is very clear that the value of ‘a’ is modified more than once between the previous and the next sequence points. For experimental purposes, I have used two different compilers to compile and run. And the object code has been generated with the following two different compilers.

gcc compiler -- version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)

clang compiler – version 3.8.0-2ubuntu3~trusty4 (tags/RELEASE_380/final)

From the output shown in Figure 1, the gcc compiler generates a warning when compiled with the -Wall option, and the output obtained when this is executed is shown in Figure 1a.

When the same code is compiled using a Clang compiler, the output obtained is shown in Figure 3 and assembly code is shown in Figure 4.

Figure 4: Assembly code generated for the program 1, compiled using Clang compiler

As can be seen from Figures 1 and 3, both the compilers generate the warnings. One can even see that the assembly codes generated from both the compilers are different.

1 #include <stdio.h>

2

3 int main()

4 {

5 int a = 10;

6 a = a++ + a++ + a++;

7 printf(“The value of a: %d\n”, a);

8

9 return 0;

10 }

In program-2 above, in Line 6, the value of variable ‘a’ is being modified more than once between two sequence points, which leads to undefined behaviour. Compile the code under the gcc compiler with the option -Wall and see the warning generated by the compiler. Repeat compiling the same code under different compilers, then run and see the output and conclude it. This is left as an exercise for programmers to perform.

Figure 5: Output of program 3
Figure 6: Assembly code generated for program 3, compiled using gcc compiler

Case study 2

Let us consider an example shown in program-3 below, which attempts to modify the ‘string literals’ and will lead to undefined behaviour.

1 #include <stdio. h>

2

3 int main ()

4 {

5 /*Initialize the pointer to the base address of the string*/

6 char *p = “string constant”;

7

8 /*Attempt to modify the character at 0th index*/

9 *(p + 0) = ‘S’;

10

11 return 0;

12 }

In the code shown in program-3 above, one is trying to modify the string literal which is pointed to by the pointer ‘p’. Since program-3 is non-compliant code when compiled and run, it leads to undefined behaviour, as the standard imposes absolutely no requirements and the compiler can do anything. The behaviour is undefined if a program attempts to modify any character in a string literal. Modifying a string literal frequently results in an access violation because the former is typically stored in read-only memory.

The output obtained when compiled in gcc is shown in Figure 5. Some compilers will issue the warnings while some will silently compile the code, as in case of the gcc compiler, where even the -Wall option is enabled.

Any attempt to modify a C string literal will lead to undefined behaviour.

Case study 3

Instance 1: According to the C standards, signed integer overflow is undefined behaviour. Some compilers may trap the overflow condition when compiled with some trap handling options while some simply ignore the overflow conditions, assuming that the overflow will never happen, and generate the code accordingly.

1 #include <stdio.h>

2

3 int main()

4 {

5 /* Assigning the INT_MAX value to the ‘x’ variable */

6 int x = INT_MAX;

7

8 /* Checking if wraps around or not */

9 if (x + 1 > x)

10 {

11 x++;

12 }

13 else

14 {

15 /* Error handling code */

16 }

17

18 return 0;

19 }

In program-4 given above, we are checking whether the value of ‘x’, which is INT_MAX after adding Constant 1, results in overflow or not. Since the behaviour in this case is undefined by the C standards, the compiler may ignore the ‘if’ condition by optimising the code, which means it may simply replace the condition (x + 1) by the TRUE value, assuming that the value of (x + 1) is always greater than ‘x’, or may proceed with the code generation.

In the gcc compiler, it simply wraps around and gives the negative value when the value of x = INT_MAX. The assembly code generated by the gcc compiler for program-4 is shown in Figure 6.

In Figure 6, Lines 129 to 133 make it clear that the gcc compiler is generating the code which wraps around and goes to the negative side.

Let us compile program-4 using the gcc compiler with the options -fstrict-overflow and -fno-strict-overflow, and see how the compiler generates the assembly codes.


Note 1: How to compile the code using options and generate the assembly code

1. To compile the code with -fno-strict-overflow:

gcc -g -fno-strict-overflow program_4.c

2. To compile the code with -fstrict-overflow:

gcc -g -fstrict-overflow program_4.c

3. After compiling, to save the assembly code separately:

objdump -S a.out > 1_no_strict_overflow

objdump -S a.out > 1_strict_overflow

4. Comparing both the files using the vimdiff command:

vimdiff 1_no_strict_overflow 1_strict_overflow

The output will be what’s shown in Figure 7.

Note 2: The -fstrict-overflow option allows the compiler to assume strict signed overflow rules, depending on the language being compiled. For C, this means that when doing arithmetic with signed numbers, the overflow is undefined by the C standards, so the compiler may assume that it does not happen. This permits various optimisations. For example, the compiler may assume that an expression like ‘x + 2 > x’ is always true for signed ‘x’. This assumption is only valid if signed overflow is undefined, as the expression is false if ‘x + 2’ overflows when using two’s complement arithmetic. When this option is in effect, any attempt to determine whether an operation on signed numbers overflows must be written carefully to not actually involve overflow. The -fno-strict-overflow option allows the compiler not to assume strict signed overflow rules.

Figure 7: Assembly code generated for program 4, compiled using the gcc compiler with options – fno-strict-overflow and fstrict-overflow

In Figure 7, we can see the difference between the assembly code when compiled with the options -fno-strict-overflow and -fstrict-overflow. The left side assembly shows that code is generated using the option -fno-strict-overflow, in which the gcc compiler will ignore the signed integer overflow conditions and resume with normal code generation. But, on the other hand, the right side assembly code shows the code is generated using the option -fstrict-overflow, in which the compiler will apply all the overflow rules.

Instance 2: Similar to the example shown in Instance 1, one more instance can be shown where adding two numbers results in the overflow condition, which leads to undefined behaviour.

1 #include <stdio.h>

2

3 int main()

4 {

5 /* Declaring two int variables */

6 int a;

7 int b;

8

9 /* Code to read the values of a and b */

10

11 /* Calling add function */

12 int sum = add(a, b);

13

14 /* Some code here */

15

16 return 0;

17 }

18

19 /* Function definition */

20 int add(int a, int b)

21 {

22 int sum;

23

24 sum = a + b;

25

26 return sum;

27 }

In the code shown in program-5 above, in Line 24, the result may overflow after adding two numbers, since both the variables are signed by default, which leads to undefined behaviour.

Modifying the value of the variable more than once between two sequence points ends up in undefined behaviour. Modifying the string literal, which is pointed to by the pointer, also leads to undefined behaviour since the standard imposes absolutely no requirements and the compiler can do anything. According to the C standards, signed integer overflow is undefined behaviour too. A few compilers may trap the overflow condition when compiled with some trap handling options, while a few compilers simply ignore the overflow conditions (assuming that the overflow will never happen) and generate the code accordingly.

LEAVE A REPLY

Please enter your comment!
Please enter your name here