How to perform an Iterative Search Depth First using asynchronous / parallel processing?

advertisements

Here is a method that does a DFS search and returns a list of all items given a top level item id. How could I modify this to take advantage of parallel processing? Currently, the call to get the sub items is made one by one for each item in the stack. It would be nice if I could get the sub items for multiple items in the stack at the same time, and populate my return list faster. How could I do this (either using async/await or TPL, or anything else) in a thread safe manner?

private async Task<IList<Item>> GetItemsAsync(string topItemId)
{
    var items = new List<Item>();
    var topItem = await GetItemAsync(topItemId);

    Stack<Item> stack = new Stack<Item>();
    stack.Push(topItem);
    while (stack.Count > 0)
    {
        var item = stack.Pop();
        items.Add(item);                   

        var subItems = await GetSubItemsAsync(item.SubId);

        foreach (var subItem in subItems)
        {
            stack.Push(subItem);
        }
    }

    return items;
}

EDIT: I was thinking of something along these lines, but it's not coming together:

var tasks = stack.Select(async item =>
{
    items.Add(item);
    var subItems = await GetSubItemsAsync(item.SubId);

    foreach (var subItem in subItems)
    {
        stack.Push(subItem);
    }
}).ToList();

if (tasks.Any())
    await Task.WhenAll(tasks);

UPDATE: If I wanted to batch up the tasks, would something like this work?

foreach (var batch in items.BatchesOf(100))
{
    var tasks = batch.Select(async item =>
    {
        await DoSomething(item);
    }).ToList();

    if (tasks.Any())
    {
        await Task.WhenAll(tasks);
    }
}

The language I'm using is C#.


Here's a method that you can use to traverse a tree, asynchronously, and in parallel:

public static async Task<IEnumerable<T>> TraverseAsync<T>(
    this IEnumerable<T> source,
    Func<T, Task<IEnumerable<T>>> childSelector)
{
    var results = new ConcurrentBag<T>();
    Func<T, Task> foo = null;
    foo = async next =>
    {
        results.Add(next);
        var children = await childSelector(next);
        await Task.WhenAll(children.Select(child => foo(child)));
    };
    await Task.WhenAll(source.Select(child => foo(child)));
    return results;
}

The method requires a method to get the children for each node asynchronously, which you already have. It doesn't special case generating the root node(s), so you'd want to use the method you have to get them outside of the scope of this method and provide them as this method's first argument.

The calling code may look something like this:

var allNodes = await new[]{await GetItemAsync(topItemId)}
    .TraverseAsync(item => GetSubItemsAsync(item.SubId));

The method fetches the children of each node in parallel, asynchronously, marking itself as complete when they have all finished. Each node then recursively calculates all of its children in parallel.

You've mentioned that you're concerned about using recursion because of the stack space that it would consume, but that's not an issue here, because the methods are asynchronous. Every time you move one level deep in the recursion the method isn't going on level deeper on the stack; instead it's merely scheduling the recursive method calls to be run at a later point in time, so each level always start at a fixed point on the stack.


If you're looking for a way of limiting the amount of parallelism, for fear that there will be just too much, I'd first ask you to try it out. If you're directing all of the calls here to the thread pool then the thread pool itself is likely to have an upper bound on the amount of parallelism based on what it feels is likely to perform best. It'll just stop creating more threads and just keep the pending items in a queue after a certain point, and the thread pool is far more likely to have an effective algorithm for determining the appropriate degree of parallelism than you are. That said, if you have a compelling need to artificially limit the amount of parallelism beyond what the thread pool does, there are certainly ways. One option is to create your own synchronization context that artificially inhibits the number of pending operations to some fixed number:

public class FixedDegreeSynchronizationContext : SynchronizationContext
{
    private SemaphoreSlim semaphore;
    public FixedDegreeSynchronizationContext(int maxDegreeOfParallelism)
    {
        semaphore = new SemaphoreSlim(maxDegreeOfParallelism,
            maxDegreeOfParallelism);
    }
    public override async void Post(SendOrPostCallback d, object state)
    {
        await semaphore.WaitAsync().ConfigureAwait(false);
        try
        {
            base.Send(d, state);
        }
        finally
        {
            semaphore.Release();
        }
    }

    public override void Send(SendOrPostCallback d, object state)
    {
        semaphore.Wait();
        try
        {
            base.Send(d, state);
        }
        finally
        {
            semaphore.Release();
        }
    }
}

You can create an instance of a context such as this and set it as the current context before calling TraverseAsync or create another overload that accepts a maxDegreesOfParallelism and sets the context inside the method.

Another variation of this would be to limit the number of calls to say your child selector without putting any limitations on the number of any of the other asynchronous operations that are going on here. (None of the others should particularly expensive, so I wouldn't expect it to matter much either way, but this is certainly something worth experimenting with.) To do this we could create a task queue that process the items given to it with a fixed degree of parallelism, but that won't artificially limit anything not passed to this queue. The queue itself is simple enough, as a straightforward variation of the sync context:

public class FixedParallelismQueue
{
    private SemaphoreSlim semaphore;
    public FixedParallelismQueue(int maxDegreesOfParallelism)
    {
        semaphore = new SemaphoreSlim(maxDegreesOfParallelism,
            maxDegreesOfParallelism);
    }

    public async Task<T> Enqueue<T>(Func<Task<T>> taskGenerator)
    {
        await semaphore.WaitAsync();
        try
        {
            return await taskGenerator();
        }
        finally
        {
            semaphore.Release();
        }
    }
    public async Task Enqueue(Func<Task> taskGenerator)
    {
        await semaphore.WaitAsync();
        try
        {
            await taskGenerator();
        }
        finally
        {
            semaphore.Release();
        }
    }
}

Here, when calling the method, you can use this queue as a part of your child selector:

ar taskQueue = new FixedParallelismQueue(degreesOfParallelism);
var allNodes = await new[]{await GetItemAsync(topItemId)}
    .TraverseAsync(item =>
        taskQueue.Enqueue(() => GetSubItemsAsync(item.SubId)));