What is TPL Dataflow in .Net and When Should We Use It

There are various ways to write a concurrent or parallel program in .Net but they’re often don’t have the flexibility and robustness needed. That’s where TPL Dataflow comes in, it helps us build a more robust concurrent program and it can helps us reduce a lot of complexity. For example when we use other paradigms to make a program concurrent we should watch for share states. This synchronization alone can make our code much more complicated, but Dataflow make this process easier.

In those circumstances we create blocks and each block has its own local data and private thread. So don’t need to use the lock or semaphore for synchronization anymore. Another way Dataflow can make our lives easier is data processing. Imagine we have a stream of data and we want to process it as it becomes available. Basically we can create a pipeline, one pipeline process the data and make it ready, and send it to the next block. The act of sending the data to the next block is another feature of Dataflow, that’s communication between different blocks.

In this post we take a look at what TPL Dataflow is and when to use it. I’ll bring couple of example in this post but go more in depth in future posts.

What is TPL Dataflow

At its core, Dataflow is a set of constructs built on top of task parallel library that can help us create a more robust concurrent program. Dataflow consists of one or more blocks, each block can be connected to another blocks to form a pipeline. But it also can be used simply on its own to solve concurrency issues. If we use it as a pipeline, our data move from one block who send the data which is called source to another block who receive the data which is called target. Also A source can be linked to zero or more targets; targets can be linked from zero or more sources.

Blocks also can be both receiver and sender of data, this kind of blocks called propagators. Most of the time each block has its own private thread, which means synchronization issues are rare. I say rare because we could increase the number of threads per block, but if a block does only one thing, for example produce data for the next block, that shouldn’t be an issue.

So basically we connect a series of blocks, each block has a specific task. Most of the time one block does some processing and pass the data down though the pipeline. We can think of these connected blocks as a network that responsible for doing a specific task.

When to use TPL Dataflow

Now this question arises, when should we use these blocks? There are couple of circumstances that comes to mind.

When we need to create a processing pipeline
When we want to stream data, maybe because one block can produce the data for another block and it also has a buffer to wait for the data to become available
Multi-user Concurrency Problems
Avoid issues arises in application with shared states, such as thread safety issues
Batch processing, when it’s needed to break up each section of our application. Doing that allows us to assign different amount of thread to of block that really needs it.

Note that blocks themselves have some complexity and overhead and we should carefully weigh the trade offs before we decide to use them.

When Not to Use TPL Dataflow

There are some situation where using TPL Dataflow is not suitable. I’ll briefly go thought them and explain why that could be the case.

Your application is not concurrent, dah
Performance is very important, because TPL Dataflow is an abstraction over thread etc.
When you need a fine grained control over what a thread does and how it does it etc.
There’s no mutable state, in other word our system doesn’t have synchronization issues

A lot of these point can be discussed further and some of them can be refuted. For example TPL Dataflow is not slow, but they are still abstraction over normal .Net constructs. The important take away is that we should examine the pro and cons and made a decision of case by case basis.

TPL Dataflow Example in .Net

Here we take a look at combination of TransformBlock and ActionBlock. TransformBlock can receive an input and transform it to a different output. ActionBlock on the other hand only receive input. So naturally ActionBlock is located at the end of our pipeline and we use it to print the final result. You can see the this example with more explanations here.

Each of these blocks take an input and generate an output. But right now they are not very interesting, we need to connect these blocks to form a network.

First we create a DataflowLinkOptions and set the PropagateCompletion to true. This cause the blocks to propagate their result to the next block even if there’s an error in a source block. Then we link the blocks using the LinkTo method, it receives a source and connect it to a target. After that post the URL of the book The Iliad of Homer to the head of the Dataflow pipeline. Note that we use the synchronous method for posting the book to the head of our pipeline. But there’s also the async version method, namely SendAsync which we can use.

After that we call the Complete method to mark the head of the pipeline as completed. The head of the pipeline propagates its completion after it processes all buffered messages. We should use the Complete on the head of our pipeline when we’re sure we have no more input to the head. Finally we wait for the tail block to finish processing by calling printReversedWords.Completion.Wait();. Overall blocks are considered finished if the tail of our block finishes, that’s the reason we are calling it on the last block.

Note that we should await this instead of using Wait(). MSDN example does this because we’re in a console app and correct async code is not the point here.

Summary

In this post, I discussed what TPL Dataflow is and when it should be used and when it should be avoided. We also saw an example of Dataflow using ActionBlock and TransformBlock which received a text, each block did some processing on it and passed it down the pipeline.

Share...

Luís Barbosa says:

August 7, 2018 at 9:18 am

The TPL Dataflow is an in-process actor library abstracting away the hard work of async and/or concurrent programming. Pipelining tasks is of course one of the multiples use case TPL Dataflow applies, however, I think it’s reductive to say that TPL Dataflow is a set of constructs to build pipelines. The ActionBlock alone covers a large number of use cases. Here are some examples:

https://stackoverflow.com/questions/24966019/async-with-huge-data-streams/24966167#24966167
https://stackoverflow.com/questions/34843224/task-startnew-parallel-foreach-doesnt-await/34843290#34843290
https://stackoverflow.com/questions/34360772/batch-processing-in-mvc5-website/34361999#34361999
https://stackoverflow.com/questions/27841790/how-to-use-threads-for-processing-many-tasks/27842076#27842076
https://stackoverflow.com/questions/26009333/how-to-use-task-parallel-library-tpl-with-load-balancing-and-limited-degree-of/26009467#26009467
http://blog.i3arnon.com/2016/05/23/tpl-dataflow/

1. Hamid Mosalla says:
  
  August 7, 2018 at 9:58 am
  
  Thank you for the response Luís. I’m going to read your links and update the article accordingly.
  
  1. Luís Barbosa says:
    
    August 7, 2018 at 5:46 pm
    
    Hi Hamid, dig into Jeremy Miller’s Jasper source code for more use cases: https://github.com/JasperFx/jasper
    For example, he uses TPL Dataflow for sending outgoing messages. I’ve successfully used TPL Dataflow for the same use case. A little bit different from Jeremy’s one, but to achieve the same result.
    
    Keep up the good work with your posts.
    
Pingback: More Robust Asynchrony and Parallelism With TPL Dataflow ActionBlock

Hamid Mosalla

Programming Adventures