Thread Affinity in Parallel Programming Using TPL

Thread affinity refers to the assignment of threads to specific processors or cores on a multi-core CPU. When a thread is bound to a specific processor or core, it is said to have affinity to that processor or core. This is in contrast to thread migration, where a thread can move between different processors or cores during its execution.

Thread affinity can have a significant impact on the performance of parallel programs, particularly when the program is using shared resources such as shared memory. By assigning threads to specific processors or cores, it is possible to reduce contention for shared resources and improve cache locality. Cache locality refers to the idea that data accessed frequently by a thread should be stored close to the processor or core where that thread is executing, to reduce the time spent waiting for data to be fetched from main memory.

In this post I’m going to talk about what strategies we can implement in .Net to have a optimal parallel performance when an application needs to run very efficiently on multiple thread.

Why Thread Affinity is Important?

Thread affinity can have a significant impact on parallel performance because if threads are constantly being moved between cores, it can lead to cache thrashing and other performance issues.

To illustrate this, let’s consider an example. Suppose you have a .NET application that performs a lot of parallel processing on a large data set. If you use the default scheduling policy, which is to let the .NET runtime decide how to schedule threads, it may result in threads being moved between cores frequently. This can lead to a significant slowdown in performance because of the overhead involved in moving data between different caches.

Implicit thread binding vs explicit

Implicit thread binding allows the operating system’s scheduler to determine which processor or core to assign a thread to. The operating system uses various algorithms and policies to determine the best processor or core to assign a thread to based on factors such as processor utilization, cache locality, and load balancing. Implicit binding is typically the default behavior in most programming environments.

On the other hand, explicit thread binding involves using system calls or library functions to bind threads to specific processors or cores. This can be useful in situations where you want to control thread placement for performance optimization or to ensure deterministic behavior. Explicit binding can be achieved using system-specific calls or libraries, such as pthread_setaffinity_np in POSIX-compliant systems or the Thread.ProcessorAffinity property in .NET.

The main difference between implicit and explicit thread binding is the level of control over thread placement. Implicit binding leaves thread placement decisions to the operating system, while explicit binding allows the programmer to specify the exact processor or core that a thread should be bound to. However, explicit binding can also introduce additional complexity and overhead, and may not always result in better performance compared to implicit binding.

When using thread affinity, it is important to use appropriate scheduling policies to ensure that threads are assigned to processors or cores in a way that maximizes parallel performance. For example, a round-robin scheduling policy may be appropriate when the workload is evenly distributed across all processors or cores, while a more sophisticated scheduling policy may be needed when the workload is unevenly distributed.

In what scenarios we use Explicit thread binding vs implicit?

Explicit thread binding involves manually assigning threads to specific processing resources using low-level operating system calls or specialized programming interfaces. This approach is typically used when fine-grained control over thread placement is required to optimize performance. Explicit thread binding can be useful in scenarios where there are multiple threads competing for shared resources or when the workload is highly sensitive to cache performance.

Implicit thread binding, on the other hand, allows the operating system or runtime environment to automatically assign threads to processing resources based on the current system load and other factors. This approach is typically used in scenarios where the workload is less sensitive to cache performance or where the overhead of explicit thread binding outweighs the potential performance benefits.

In general, explicit thread binding is more appropriate in situations where there are a small number of threads that are heavily contending for shared resources, while implicit thread binding is more appropriate in scenarios where there are many threads with diverse resource requirements and where the system load varies widely over time.

Thread Affinity in .Net and TPL

Explicit thread binding: This involves using system calls or library functions to bind threads to specific processors or cores. For example, the Process.ProcessorAffinity property can be used in .NET to specify the processor cores on which the threads of a process can be scheduled.

using System.Diagnostics;
Process proc = Process.GetCurrentProcess();
// Set the processor affinity to the first processor (processor 0)
proc.ProcessorAffinity = (IntPtr)1;

Note that even though the Process.ProcessorAffinity property allows you to specify the processors on which the threads of a process can be scheduled, it doesn’t allow you to specify on which processor a specific thread should run. The actual scheduling of the threads is still managed by the operating system.

Setting thread affinity is usually not recommended unless you have a specific reason to do so, such as optimizing performance or ensuring deterministic execution on a specific processor. In general, the operating system’s thread scheduler is best left to manage thread placement based on system load and available resources. All in all changing the ProcessorAffinity of a process can have significant performance implications and should be done with caution.

Different Mechanisms available to assign thread Affinity .Net TPL

ThreadAffinityScheduler class in the System.Threading.Tasks.Extensions namespace, which allows you to specify which processor cores threads should run on. By using this class, you can ensure that threads are always running on the same core, which can improve cache locality and reduce cache thrashing.
TaskScheduler.FromCurrentSynchronizationContext method to create a custom scheduler that is optimized for your particular use case. This method allows you to specify a custom synchronization context, which can be used to control how tasks are scheduled and executed.
ThreadPoolScheduler – This is the default scheduling policy in the TPL, and it uses the thread pool to execute tasks. This is a good general-purpose policy that works well for most scenarios where the number of tasks is relatively small and the tasks are not long-running.
ConcurrentExclusiveSchedulerPair – This scheduling policy provides two separate task queues, one for concurrent tasks and one for exclusive tasks. This can be useful in scenarios where you need to execute both concurrent and exclusive tasks, and you want to avoid contention between them.
WorkStealingTaskScheduler – This scheduling policy uses a work-stealing algorithm to balance the workload between threads. This can be useful in scenarios where the workload is highly variable and you want to maximize CPU utilization.
LimitedConcurrencyLevelTaskScheduler – This scheduling policy limits the number of concurrent tasks that can execute at any given time. This can be useful in scenarios where you need to limit resource usage or avoid contention for shared resources.

These are just a few examples of the scheduling policies available in the TPL. In general, you should choose a scheduling policy that is appropriate for the workload and resources available in your specific scenario. You may need to experiment with different policies and settings to find the optimal approach for your program. Here’s an example of a custom task scheduler that assigns tasks to a specific thread pool.

// Create a custom task scheduler with thread affinity for a specific data set
public class CustomTaskScheduler: TaskScheduler {
  private readonly ConcurrentQueue < Task > _tasks = new ConcurrentQueue < Task > ();
  private readonly Thread[] _threads;
  public CustomTaskScheduler(int numThreads) {
    _threads = new Thread[numThreads];
    for (int i = 0; i < numThreads; i++) {
      _threads[i] = new Thread(() => {
        while (true) {
          if (_tasks.TryDequeue(out Task task)) {
            TryExecuteTask(task);
          } else {
            Thread.Sleep(1);
          }
        }
      });
      _threads[i].IsBackground = true;
      _threads[i].Start();
    }
  }
  protected override IEnumerable < Task > GetScheduledTasks() {
    return _tasks.ToArray();
  }
  protected override void QueueTask(Task task) {
    _tasks.Enqueue(task);
  }
  protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued) {
    return false;
  }
}

Usage:

public void RunWithThreadAffinity() {

// Fill the data array with some values
  int[] data = new int[1000];

  CustomTaskScheduler scheduler = new CustomTaskScheduler(4);

// Create some tasks to process the data
  Task[] tasks = new Task[10];
  for (int i = 0; i < 10; i++) {
// Process some subset of the data array 
    tasks[i] = new Task(() => {});
    tasks[i].Start(scheduler);
  }
  Task.WaitAll(tasks);
}

Improving the performance of a parallel program using thread affinity

Suppose we have a parallel program that performs a large number of matrix multiplications using C#. We want to improve the performance of the program by using thread affinity to reduce contention for shared memory and improve cache locality.

To implement thread affinity, we can use the Thread.BeginThreadAffinity and Thread.EndThreadAffinity function provided by TPL to bind each thread to a specific processor or core. We can then use an appropriate scheduling policy, such as a static or dynamic scheduling policy, to ensure that the workload is evenly distributed across all processors or cores.

By using thread affinity and an appropriate scheduling policy, we can significantly improve the performance of the program, particularly when running on a multi-core CPU with shared memory. Here’s an example implementation of a parallel program that performs a large number of matrix multiplications using TPL and thread affinity in C#.

using System;
using System.Threading;
using System.Threading.Tasks;

class MatrixMultiplication
{
    static void Main(string[] args)
    {
        int[,] matrix1 = InitializeMatrix(1000, 1000);
        int[,] matrix2 = InitializeMatrix(1000, 1000);
        int[,] resultMatrix = new int[1000, 1000];

        MultiplyMatrixParallel(matrix1, matrix2, resultMatrix);
    }

    static int[,] InitializeMatrix(int rows, int cols)
    {
        Random random = new Random();
        int[,] matrix = new int[rows, cols];
        for (int i = 0; i < rows; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                matrix[i, j] = random.Next(1, 10);
            }
        }

        return matrix;
    }

    static void MultiplyMatrixParallel(int[,] matrix1, int[,] matrix2, int[,] resultMatrix)
    {
        int rows = matrix1.GetLength(0);
        int cols = matrix2.GetLength(1);
        int numTasks = Environment.ProcessorCount;

        // Create and start tasks
        Task[] tasks = new Task[numTasks];
        for (int task = 0; task < numTasks; task++)
        {
            int startRow = task * (rows / numTasks);
            int endRow = (task == numTasks - 1) ? rows : (task + 1) * (rows / numTasks);

            tasks[task] = Task.Factory.StartNew(() =>
            {
                Thread.BeginThreadAffinity();
                MultiplyMatrix(matrix1, matrix2, resultMatrix, startRow, endRow);
                Thread.EndThreadAffinity();
            });
        }

        // Wait for all tasks to complete
        Task.WaitAll(tasks);
    }

    static void MultiplyMatrix(int[,] matrix1, int[,] matrix2, int[,] resultMatrix, int startRow, int endRow)
    {
        int rows = matrix1.GetLength(0);
        int cols = matrix2.GetLength(1);

        // Perform matrix multiplication for the specified rows
        for (int i = startRow; i < endRow; i++)
        {
            for (int j = 0; j < cols; j++)
            {
                for (int k = 0; k < rows; k++)
                {
                    resultMatrix[i, j] += matrix1[i, k] * matrix2[k, j];
                }
            }
        }
    }
}

Summary

Thread affinity is the assignment of threads to specific processors or cores on a multi-core CPU, which can improve the performance of parallel programs using shared resources such as shared memory. Thread migration, on the other hand, refers to a thread moving between different processors or cores during its execution. Explicit thread binding involves using system calls or library functions to bind threads to specific processors or cores while implicit thread binding allows the operating system’s scheduler to determine which processor or core to assign a thread to. The main difference between them is the level of control over thread placement, with explicit binding being useful when fine-grained control over thread placement is required to optimize performance.

Explicit thread binding is typically used when there are a small number of threads that are heavily contending for shared resources while implicit thread binding is more appropriate in scenarios where there are many threads with diverse resource requirements and where the system load varies widely over time. In .NET and TPL, different mechanisms are available to assign thread affinity, such as the ThreadAffinityScheduler class, TaskScheduler.FromCurrentSynchronizationContext method, and ThreadPoolScheduler. However, setting thread affinity is usually not recommended unless there is a specific reason to do so, such as optimizing performance or ensuring deterministic execution on a specific processor.

Share...

2 thoughts on “Thread Affinity in Parallel Programming Using TPL”

Valentin Z. says:

December 20, 2023 at 6:45 am

Hey Hamid,
Nice and comprehensive article! Just wondering though as I see that in the last example there is no usage of Thread.BeginThreadAffinity, was that missed or intended?
Thanks!

1. Hamid Mosalla says:
  
  December 20, 2023 at 4:38 pm
  
  You’re right, I’ve modified the code, just remember that if you’re using managed code in .NET, thread affinity is usually not necessary. The Task Parallel Library (TPL) is more than capable of efficiently handling the parallel execution of tasks without the need for manual thread affinity management.

Hamid Mosalla

Programming Adventures