Flo: A Simple C++ Library for Parallel Max Flow Computations

A small library for easily creating network flow graphs, and performing max-flow computations on multi-core CPUs and NVIDIA GPUs.

by Joshua Korn

Background

Network flows are everywhere, across mathematics, computer science, and operations research. From finding bipartite matchings to modeling traffic to image segmenting, a lot of unique problems can be modeled as a network flow. Commonly, we wish to find the maximum flow through a given network. There are many different algorithms to do so, but most of them are inherently sequential. As networks get large, as is the case in many real-life applications, these algorithms may not always have optimal performance, and as such, I wish to investigate how much speedup can be obtained with these algorithms due to parallelism.

Proposal

Parallelizing max-flow algorithms is inherently challenging, as with other graph algorithms, there are many dependencies between nodes, and most common algorithms used for max-flow computations are sequential. There is some, but not much research on the topic of parallelizing these algorithms. Figuring out how to partition a given graph over workers for optimal parallelization, and minimizing the need for atomicity and communication as much as possible, will be the main challenge.

The only starter code I may utilize is my parallel BFS and some of the test graphs from Assignment 3. I plan on developing on the GHC machines, which have 8 cores that support hyperthreading (16 logical cores) and a GTX 1080. The GHC machines are perfect, since they are easily accessible, and allow me to test both my parallel CPU and GPU implementations. I will be referencing algorithms that I learned in 15-451, as well as some papers on parallel flow algorithms that I will cite as they are used.

For this project, I wish to take advantage of parallelism available in modern CPUs and GPUs to obtain a significant speedup in certain max flow algorithms. In particular, I plan on implementing three max flow algorithms in three ways (sequential, parallel for CPUs, and parallel for GPUs using NVIDIA CUDA) - these three algorithms are Edmonds-Karp, Dinic's, and Push-relabel (Goldberg-Tarjan). The first of these algorithms, Edmonds-Karp, is parallelizable between iterations since it uses BFS to find shortest paths. Dinic's algorithm uses BFS as a subroutine as well, and uses the concept of a layered graph to obtain a better asymptotic runtime than Edmonds-Karp. Finally, the Push-relabel algorithm is inherently parallel over vertices, similar to bottom-up BFS, and is already considered one of the most efficient max-flow algorithms, so a large speedup is expected. After implementing all of this, I wish to benchmark the performance of these algorithms on various test graphs (looking at both how the algorithms compare against each other on the same platform, and the speedups obtained from sequential to CPU to GPU). Finally, I want to package these algorithms into a simple C++ library that allows users to easily create graphs and run these algorithms on them (with options to run on CPU or GPU).

If I have time afterwards, I would like to provide some example use cases of this library, such as finding bipartite matchings and doing image segmentation. If possible, I'd also like to have the library dynamically figure out which algorithm would be better to use based on the input graph's properties, instead of having the user choose the algorithm.

My demonstration would be showing the various speedup graphs described earlier, both on test graphs, and, time permitting, on the practical examples I will create. The goal is to obtain significant speedups, especially on larger graphs, on all of these algorithms as compared to the baseline sequential implementations.

Schedule

End Date Goals
Tue, 4/25 (Checkpoint) Determine graph representation, create testing and timing harnesses with graph generation, write sequential Edmonds-Karp and test correctness.
Tue, 5/2 Write sequential version of other two algorithms and test for correctness, benchmark times on same graphs.
Fri, 5/5 Finish all CPU parallel implementations, benchmark against sequential code.
Tue, 5/9 Finish all GPU implementations and benchmarking.
Thu, 5/11 Create all speedup plots for writeup, and create C++ library.
Fri, 5/12 Extras (special examples, dynamic algorithm choosing), time permitting. Final writeup.

Checkpoint

The updated schedule is above, and includes everything I have completed for this checkpoint. In particular, I have created a basic testing framework, a system to randomly generate connected graphs to be used for testing, and have decided to go with an adjacency matrix representation for my graphs, where entry (i, j) gives the capacity of that edge. This will take up a bit more memory, but makes accesses a lot quicker and more organized than with using my original plan of an adjacency list representation. Finally, I have a sequential Edmonds-Karp implementation that I will use as a baseline correctness test for my other algorithms. I believe I am still on track to produce all of my deliverables - my goals remain the same. The extras I will have a day or two to implement if I stay on track, so I am not sure if those will happen. What I would want to show at the competition is still the same, namely the different speedup plots obtained. Because I have just set up the foundation for my project so far, I don't have any preliminary results. Currently I don't have any issues that are cause for concern.

Final

Summary

I implemented three max flow algorithms - Edmonds-Karp, Dinic's, and Push-relabel - both sequentially and in parallel using OpenMP, and analyzed speedups obtained. In particular, I focused on a lock-free Push-relabel algorithm, and was able to obtain up to 6x speedup on tested graphs. I also parallelized this algorithm on GPUs with a single CUDA thread per vertex to benchmark performance, and noticed poor performance, for reasons discussed later. Finally, I wrote a small class-based library to easily represent graphs and run max flow algorithms on them. Tests were run on GHC35, which has an 8-core hyperthreaded CPU and a GTX 1080 GPU. All tests were run using fixed, randomly generated connected graphs with max edge capacity of 50, and results were averaged over multiple trials. Code can be found on my GitHub.

Example usage of the library:

FlowGraph graph(6);
std::vector<std::tuple<int, int, int>> edges;
edges.push_back(std::make_tuple(0, 1, 4)); // (from, to, capacity)
edges.push_back(std::make_tuple(0, 2, 2));
edges.push_back(std::make_tuple(1, 3, 3));
edges.push_back(std::make_tuple(2, 4, 3));
edges.push_back(std::make_tuple(3, 5, 2));
edges.push_back(std::make_tuple(4, 5, 4));
edges.push_back(std::make_tuple(3, 0, 5));
graph.AddEdges(&edges);
graph.DeleteEdge(3, 0);
auto result = graph.FlowEdmondsKarp();

Background

See main section "Background" for motivation. All of these algorithms operate on a graph with specified edge capacities, and output the value of the max flow along with all edges that have positive flow pushed on them. The main challenge is that most flow algorithms are not inherently parallelizable. Many algorithms use the concept of a residual graph, which is essentially the state of the graph after flow is pushed during a single iteration. Therefore, synchronization is needed between iterations - flow cannot be pushed on the residual graph if the residual graph is not yet created. Algorithms that don’t use residual graphs, such as Push-relabel, instead require massive amounts of system-wide synchronization, as explained later. As well, for general graphs, there is not much data locality. This is especially true for the Push-relabel algorithm, which keeps track of heights and excess flow values for each node, which needs to be constantly accessed by all neighbors. For all of these reasons, there is not much research done in parallel flows. As well, existing sequential algorithms are, in most practical uses, extremely fast.

The three algorithms I investigated were the following:

Edmonds-Karp

𝑂(π‘›π‘š²). Uses BFS as a subroutine to pick the shortest path in the residual graph at each iteration. Not parallelizable between iterations.

Dinic's

𝑂(𝑛²π‘š). Uses BFS at each iteration to construct a layered graph (edges only from depth 𝑑 to 𝑑+1 from the source vertex). Then, the algorithm uses DFS to find a blocking flow, where every path in the layered graph has a saturated edge (no residual capacity). Only the layered graph construction is parallelizable.

Push-relabel

𝑂(𝑛²π‘š). Uses the idea of height labels and excess flow – if a node has excess flow, try to push it to a lower height neighbor, or else try to increase your height. This algorithm is much more parallelizable (can parallelize over vertices), and as a result was the main focus of this project. Due to excess flows, this algorithm does not necessarily maintain a valid flow until the very end. An analogy to how this algorithm operates is how water flows through pipes.

Approach

All code was done in C++/CUDA. CPU parallelism was done with OpenMP. For both Edmonds-Karp and Dinic's, the only really parallelizable part of the algorithms are the BFS subroutines. Therefore, I used modified versions of my OpenMP parallel top-down BFS implementation from assignment 3 in these algorithms to obtain speedups. Just as in the assignment, threads map to local frontiers that are then merged together with fetch-and-add operations. The main focus of parallelizing these two algorithms was for analysis (see the "Results" subsection), and not so much a focus on the code itself. I was originally intending on trying to parallelize these algorithms on GPUs as well, but realized that it wouldn't be effective since the only parallelizable routine is BFS, which due to access patterns in general graphs and low arithmetic intensity, would not benefit (and would most likely observe negative speedups) from GPU parallelization.

My main focus was on parallelizing the Push-relabel algorithm. This algorithm can be parallelized across vertices. However, all vertices need to see global residual capacity and excess flow updates, as well as height relabelings. Each vertex also needs to know when to terminate – local termination (excess flow is 0) is not good enough, since a node could gain more excess flow later on. Therefore, most attempts to parallelize Push-relabel require a lot of global barriers and locks, which severely limits parallelism.

After investigation, I came across research on a lock-free parallel Push-relabel algorithm. In particular, I based my implementation on a research paper by Bo Hong, cited below. The key insight is that instead of pushing flow to any neighbor with lower height, you only push flow to your lowest height neighbor. With this small modification (and a few others), Hong proved correctness of asynchronous push and relabel operations. Therefore, in an idealized setting, you can have one thread per vertex, that repeatedly tries to perform push and relabel operations as long as the node has excess flow. In order to avoid locks, however, since many global values still need to be updated, quite a few fetch-and-add operations are necessary. In order to determine global termination without barriers, as the initial research paper suggests (which makes the algorithm not truly lock-free), a key observation that must be made is that the Push-relabel algorithm starts out by pushing all possible flow out of the source node (in the preflow stage), and the sink vertex has no incoming flow. As the algorithm proceeds, we notice that net flow out of the source is monotonically decreasing, and similarly net flow into the sink is monotonically increasing. Therefore, a sufficient termination condition is when these two values are equal. Two global values that are updated with fetch-and-add operations are maintained to represent these values, and each thread terminates only if the two values are equal.

On the CPU, threads were mapped to chunks of vertices, and were responsible for polling for non-zero excess flows for all of its vertices. On the GPU, I experimented with actually having one thread per vertex.

Results

Edmonds-Karp is more parallelizable than Dinic's

The number of nodes and edges of the ten fixed, randomly generated graphs used for testing are shown in the first row of graphs. Larger graphs weren't used due to memory constraints, and to prevent individual test runs from going longer than a few minutes. These results made it evident that parallelizing BFS would likely provide a larger speedup to Edmonds-Karp than to Dinic's, since almost all the time of Edmonds-Karp is spent in BFS. These observations correspond exactly with the observed speedups (tested against sequential baseline):

Edmonds-Karp observed speedups of over 6x, generally increasing as graph size increased. Atomic fetch-and-add operations in BFS as well as low arithmetic intensity prevent much further speedup from being obtained. Dinic's, however, did not get speedups greater than 2x across all tested graphs. This is for a few reasons (besides the lower overall amount of time spent in BFS). First, each time the BFS subroutine is used in Dinic's, the overall depth of the layered graph increases by at least 1. This depth can be at most the number of vertices at termination. However, the larger the initial depth from source to sink in the graph, the fewer number of times the layered graph has to be reconstructed. For instance, if the graph initially has an s-t depth of 𝑛-1, then only one layered graph would be created. Therefore, the speedup obtained by parallelizing the BFS subroutine is restricted by the initial s-t depth in the graph, which is not largely dependent on graph size. Besides this, sequential Dinic's is already one of the fastest flow algorithms for practical use, so squeezing out even more performance is difficult due to the added overhead associated with threads. In fact, Dinic's greatly outperformed both Edmonds-Karp and Push-relabel in all tested graphs.

Lock-free parallel Push-relabel: promising speedups

The speedups observed are promising, showing up to 6x over sequential on tested graphs. There is a decent amount of contention due to the large number of atomic fetch-and-add operations required, which therefore limits further speedup. There are some interesting issues that arise with the Push-relabel algorithm, however. First, parallelism, and even sequential performance, greatly varies depending on the number of active (excess flow > 0) nodes at a given step. This depends on not only the initial structure of the graph, but also the ordering of push and relabel operations done. As a result, speedup is not very linear in graph size (especially due to randomness), but still follows a general trend. Also note that smaller graph instances were tested here, since the algorithm generally took quite a while to run, especially on larger instances, which made data collection over multiple trials take too long. There are some optimizations that I did not have time to investigate, which use FIFO orderings, dynamic trees, etc., to improve practical performance of Push-relabel.

Lock-free parallel Push-relabel in CUDA

In the spirit of Bo Hong's paper, I decided to see what would happen if you really did have one thread per vertex, instead of having a single thread responsible for a chunk of vertices. I decided to do this in CUDA, and observed that on most graphs tested, GPU performance was at least 10x worse than sequential! There are quite a few reasons for this. First, you can’t take advantage of shared memory for general graphs, since there are no regular access patterns and in general there is quite poor locality. Second, I needed to use global memory to update height and flows since other nodes outside of the thread block needed this information. And third, and possibly worst of all, there is a ton of contention - |𝑉| threads all trying to fetch-and-add on multiple values! Realistically, CUDA would work well on image graphs for segmentation, and there is some research done in this field. Shared memory can be used, and thread blocks can do wave flow pushes, with each thread first pushing flow column-wise and then row-wise within a block. Communication is limited as well, since only flow that is pushed across the border of the block needs to be communicated.

References

1. 15-451 notes.

2. Wikipedia articles on Edmonds-Karp, Dinic's, and Push-relabel for sequential algorithm descriptions/information.

3. Geeksforgeeks.org for sequential algorithm descriptions/information.

4. This website for Dinic's sequential reference (in particular, the idea of passing forward the minimum flow as an argument in the DFS routine instead of backtracking to compute it).

5. A Lock-free Multi-threaded Algorithm for the Maximum Flow Problem, Bo Hong.