Helgrind: Detecting Synchronisation Issues in Multithreaded Programs

0
49
programmer working on multithreading helgrind programming

Let’s explore how Helgrind can be used to detect and debug multithreading issues with the help of a multithreaded C program.

Multithreading is a programming paradigm that allows concurrent execution of multiple threads within a single process, and is popular in modern software development for its ability to improve performance and responsiveness in applications. However, this comes with inherent thread synchronisation challenges like race conditions and deadlock. These can be very difficult to detect and debug due to the following reasons.

Non-determinism: Thread execution order and timing are unpredictable. These can vary between runs, even on the same hardware, making it difficult to reproduce bugs.

Heisenbug effect: Using a debugger to pause and examine threads can potentially change their execution order, making the issue disappear.

Overview of Helgrind

Helgrind is a tool within the Valgrind suite and is designed to assist developers in identifying thread synchronisation issues. It focuses specifically on detecting errors related to the use of POSIX pthread APIs. Helgrind can be used for C, C++ and Fortran programs, and analyses the behaviour of threads and thread APIs during program execution. It identifies potential issues related to thread synchronisation.

Helgrind can detect three categories of errors.

Incorrect use of POSIX pthread APIs: It can detect misuse of POSIX pthread APIs related to mutex, semaphores, spinlocks, condition variables, barriers, reader-writer locks, etc. For example, it can detect issues like:

a. Unbalanced lock operations: Helgrind detects cases where a thread attempts to unlock a mutex that has not been locked previously.

b. Misuse of condition variables: It detects incorrect use of condition variables such as waiting on a condition variable without holding the associated mutex.

Race conditions: Race conditions occur when two or more threads access shared data or resources concurrently, leading to unexpected or incorrect behaviour. Helgrind keeps track of which threads are accessing which memory locations and flags a potential data race scenario when multiple threads access the same memory location concurrently.

Potential deadlock conditions: Deadlocks happen when two or more threads are blocked forever, waiting for each other to release resources that they need. Helgrind builds a graph of lock acquisitions and releases, and checks for potential cyclic dependencies among locks. If a cyclic dependency is found, it indicates a potential deadlock situation.

Using Helgrind to detect threading issues

Let us take an example of a multithreaded C program which uses POSIX pthread APIs for synchronisation. The program compiles and runs without any warnings/errors. We will analyse the program using Helgrind to detect any thread synchronisation issues.

1 /* Program race.c */
2
3 #include <stdio.h>
4 #include <pthread.h>
5
6 int shared_counter = 0;
7 pthread_mutex_t mutex;
8
9 void *inc_counter(void *arg) {
10 int current_value;
11 int i;
12 for (i = 0; i < 10000; i++) {
13
14 pthread_mutex_lock(&mutex);
15 // Read the shared variable
16 current_value = shared_counter;
17 pthread_mutex_unlock(&mutex);
18
19 // Modify the local copy
20 current_value++;
21
22 // Write back to the shared variable
23 shared_counter = current_value;
24 }
25 return NULL;
26 }
27
28 int main() {
29 pthread_t thread1, thread2;
30
31 pthread_mutex_init(&mutex, NULL);
32
33 pthread_create(&thread1, NULL, inc_counter, NULL);
34 pthread_create(&thread2, NULL, inc_counter, NULL);
35
36 pthread_join(thread1, NULL);
37 pthread_join(thread2, NULL);
38
39 pthread_mutex_destroy(&mutex);
40
41 printf(“Final value of shared counter: %d\n”, shared_counter);
42 return 0;
43 }

This program (race.c) creates three threads – main() thread, thread1 and thread2. Thread1 and thread2 increment the counter (shared_counter variable) 10,000 times. We will use Helgrind to detect any thread synchronisation issue.

First, compile the program as follows:

$ gcc -g -o race race.c -lpthread

Then run the program using Helgrind:

$ valgrind --tool=helgrind ./race
==16932== Helgrind, a thread error detector
==16932== Copyright (C) 2007-2013, and GNU GPL’d, by OpenWorks LLP et al.
==16932== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==16932== Command: ./race
==16932==
==16932== ---Thread-Announcement------------------------------------------
==16932==
==16932== Thread #3 was created
==16932== at 0x514D1DE: clone (in /usr/lib64/libc-2.17.so)
<Snipped>
==16932== by 0x514D21C: clone (in /usr/lib64/libc-2.17.so)
==16932== Address 0x601084 is 0 bytes inside data symbol “shared_counter”
==16932==
Final value of shared counter: 20000
==16932==
==16932== For counts of detected and suppressed errors, rerun with: -v
==16932== Use --history-level=approx or =none to gain increased speed, at
==16932== the cost of reduced accuracy of conflicting-access information
==16932== ERROR SUMMARY: 20000 errors from 2 contexts (suppressed: 30005 from 8)

The output of Helgrind suggests some race conditions. Let’s interpret it.

1. In the Helgrind output, the two threads are indicated by Thread #2 and Thread #3 (main() is the first thread).

2. The output suggests two race scenarios.

3. The first race situation is:

==16932== Possible data race during read of size 4 at 0x601084 by thread #3

==16932== Locks held: 1, at address 0x6010A0

==16932== at 0x40080F: inc_counter (race.c:16)

==16932== by 0x4C2F881: ??? (in /usr/lib64/valgrind/vgpreload_helgrind-amd64-linux.so)

==16932== by 0x4E42DC4: start_thread (in /usr/lib64/libpthread-2.17.so)

==16932== by 0x514D21C: clone (in /usr/lib64/libc-2.17.so)

==16932==

==16932== This conflicts with a previous write of size 4 by thread #2

==16932== Locks held: none

==16932== at 0x400829: inc_counter (race.c:23)

==16932== by 0x4C2F881: ??? (in /usr/lib64/valgrind/vgpreload_helgrind-amd64-linux.so)

==16932== by 0x4E42DC4: start_thread (in /usr/lib64/libpthread-2.17.so)

==16932== by 0x514D21C: clone (in /usr/lib64/libc-2.17.so)

==16932== Address 0x601084 is 0 bytes inside data symbol “shared_counter”

Here it indicates a conflict between [thread #3, line 16] and [thread #2, line23]. Let’s check the source and analyse it. The following race condition occurs:

a. Thread #3 acquires the mutex lock and reads the shared variable. This is evident from line 16 of the source.

b. At the same time, Thread #2 is writing back to the shared variable as evident from line 23. This is a race situation.

4. Now, we come to the second race situation:

==16932== Possible data race during write of size 4 at 0x601084 by thread #3

==16932== Locks held: none

==16932== at 0x400829: inc_counter (race.c:23)

==16932== by 0x4C2F881: ??? (in /usr/lib64/valgrind/vgpreload_helgrind-amd64-linux.so)

==16932== by 0x4E42DC4: start_thread (in /usr/lib64/libpthread-2.17.so)

==16932== by 0x514D21C: clone (in /usr/lib64/libc-2.17.so)

==16932==

==16932== This conflicts with a previous write of size 4 by thread #2

==16932== Locks held: none

==16932== at 0x400829: inc_counter (race.c:23)

==16932== by 0x4C2F881: ??? (in /usr/lib64/valgrind/vgpreload_helgrind-amd64-linux.so)

==16932== by 0x4E42DC4: start_thread (in /usr/lib64/libpthread-2.17.so)

==16932== by 0x514D21C: clone (in /usr/lib64/libc-2.17.so)

==16932== Address 0x601084 is 0 bytes inside data symbol “shared_counter”

5. Here it indicates the conflict between [thread #3, line 23] and [thread #2, line23]. Upon analysing the source, it is evident that both Thread #2 and Thread #3 can write to the shared variable at the same time as they are not using any locking mechanism.

6. To fix both the race scenarios, the mutex should be acquired both before reading and before writing the shared counter. In the source we can see that mutex is acquired before reading but not while writing the shared counter. The following change in the source fixes the race condition:….…..

pthread_mutex_lock(&mutex);
// Write back to the shared variable
shared_counter = current_value;
pthread_mutex_unlock(&mutex);

7. After fixing, if we again run the program with Helgrind, we get the following output, which suggests that the race condition has been fixed.

$ valgrind --tool=helgrind ./race_fixed
==17034== Helgrind, a thread error detector
==17034== Copyright (C) 2007-2013, and GNU GPL’d, by OpenWorks LLP et al.
==17034== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==17034== Command: ./race
==17034==
Final value of shared counter: 20000
==17034==
==17034== For counts of detected and suppressed errors, rerun with: -v
==17034== Use --history-level=approx or =none to gain increased speed, at
==17034== the cost of reduced accuracy of conflicting-access information
==17034== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 60005 from 11)

Helgrind is a valuable tool for detecting deadlocks and race conditions in a multithreaded project, leading to reliable and robust multithreaded applications. By analysing thread synchronisation issues, it helps developers identify potential issues early in the development process.

LEAVE A REPLY

Please enter your comment!
Please enter your name here