|
| ||||||||||||||||
| Abstract: | ||||||||||||||||
|
The largest and fastest supercomputers in the top500 list deploy a scalable, three-dimensional torus or mesh interconnect. For these machines, the number of links (hops) traversed by a message has a direct effect on the time required to reach the destination. This is especially true in presence of bandwidth congestion, when multiple messages share links from source to destination. For large parallel machines with a significant diameter, this can become a serious performance bottleneck. Traditionally, application developers have neglected this fact because of the advantages of virtual cut-through and wormhole routing for most message sizes on small machines. This might not be true any longer due to the large diameters of machines. This research will demonstrate the effect of network contention on message latencies and propose and evaluate techniques to minimize communication traffic and hence, bandwidth congestion on the network. This will be achieved by topology-aware mapping of tasks in an application. By placing communication tasks on processors which are in physical proximity on the network, communication can be restricted to near neighbors. This reduces link-sharing among messages and leads to a better utilization of the available bandwidth. Our aim is to minimize hop-bytes, which is a weighted sum of the number of hops between the source and destination for all messages, the weights being the message sizes. This can minimize the communication time and hence, lead to significant speed-ups for parallel applications and in certain cases, also remove scaling bottlenecks. The research will involve developing a general automatic topology-aware mapping framework which takes the task graph and processor graph as input, and outputs near-optimal mapping solutions. | ||||||||||||||||
| Problem and Motivation: | ||||||||||||||||
|
The network topology of the largest and most scalable supercomputers today is a three-dimensional (3D) torus. Some examples are Cray's XT family (XT3, XT4 and XT5) and IBM's Blue Gene family (Blue Gene/L and Blue Gene/P). For large installations of such machines, the diameter of the network can be large (somewhere between 20 to 60 hops for Blue Gene/P and XT4.) This can have a significant effect on message latencies. When multiple messages start sharing network resources, this effect becomes more pronounced due to network congestion. Thus, it becomes necessary to consider the topology of the machine while mapping tasks to processors. This research will demonstrate that it is not wise to assume that message latencies are independent of the distance a message travels. For a number of years, this assumption has been supported by the advantages of virtual cut-through and wormhole routing, suggesting that the message latency is independent of the distance in absence of blocking [McKinley93]. When virtual cut-through or wormhole routing is deployed, message latency is modeled by the equation, where
Lf is the length of each flit, B is the link bandwidth, D is
the number of links (hops) traversed and L is the length of the message.
In absence of blocking, for sufficiently large messages (where
Lf << L), the first term is very small compared to the
second. But with large diameters of very large supercomputers, this is no
longer true for small to medium-sized messages.
Equation (1)
models message latencies in the absence of contention. In the case where
multiple messages share the same network links, the situation becomes more
complex. The bandwidth available per link per message is reduced since it
is now being shared. The phenomenon of network resource sharing leading to
contention can be explained with a simple example. Let us consider a 3D
torus network of size 8 X 8 X 8. The total number of uni-directional links
on the system is 512 X 6 = 3072. The diameter of this network is 4 + 4 + 4
= 12 and hence, if messages travel from one random node to another, they
will traverse 6 hops on average. Now, if we have four cores per node and
every processor sends a message at the same time, all these messages
require 512 X 4 X 6 = 12288 links in total and hence, every link will be
used for four messages on average. This leads to contention for each link
and therefore increases message latencies. Describing this scenario in
terms of bandwidth requirements, in order to operate at minimum latency,
we need four times the total raw bandwidth available. However, this is not
the case and thus the delivered bandwidth is one-fourth of the peak. The first part of
this research involves using simple MPI benchmarks for an extensive study
of message latencies and their dependence on distance (hops) on several
machines -- Cray's XT family, IBM's Blue Gene family and clusters like
Ranger. As we shall see in the Results section, in presence of contention,
the dependence of message latencies on hops becomes quite significant,
especially for large-sized messages. Hence, it is important to consider
topology of the machine, especially of 3D torus/mesh interconnects, to
obtain the best performance. We will also
demonstrate, with simple benchmarks and production codes, that topology
mapping can significantly improve performance and scaling of
communication-bound applications. The metric we will employ to assess the
success of topology-aware schemes will be hop-bytes. Hop-bytes are
the weighted sum of the number of hops between the source and destination
for all messages, with the weights being the message size. This metric
gives an indication of the total communication traffic on the network due
to an application. Reducing the total hop-bytes reduces link sharing and
contention, thereby keeping message latencies close to the ideal. Building on our
successes at mapping individual applications, we aim to develop an
automatic, topology-aware mapping framework. Given the communication
information of an application and topology of a machine, this framework
will produce intelligent mappings to minimize communication traffic. The
mapping problem can be reduced to the graph-embedding problem, which is
NP-complete. Hence, we will develop a set of heuristics, each dealing with
a different scenario, which together will be able to deal with most
parallel applications and their communication requirements. In an effort
towards making mapping techniques scalable to very large machines, we will
also discuss strategies for completely distributed and hybrid mapping
strategies.
| ||||||||||||||||
| Background and Related Work: | ||||||||||||||||
|
The problem of topology-aware mapping has been studied extensively and proved to be NP-complete [Bokhari81, Ercal87]. Pioneering work in this area was done by Bokhari in 1981, where he used pairwise exchanges between nodes to arrive at good solutions [Bokhari81]. Most of the techniques developed in the 80s took a long time to arrive at the solution and hence, cannot be used for a time-efficient mapping during runtime. They are almost never used in practice. Heuristic techniques such as pairwise exchanges are theoretical studies with no results on real machines. Also, most of these techniques (heuristics techniques especially) were developed specifically for hypercubes, shuffle-exchange networks or array processors. Recent emergence of very large parallel machines has led to the necessity of topology mapping again. Most work from the 80s cannot be used in the present context because of unscalable techniques and different topologies from the ones being used today. As mentioned earlier, increasing effect of the number of hops on message latencies has fuelled such studies again. To the best of our knowledge, there has been no published research reporting network contention or quantifying it for the Cray XT family. This research is pioneering work on topology-aware research on the Cray machines. One contribution of this work is an API for providing topology information on Cray machines and demonstrating that topology-aware mapping can lead to improvements on Cray machines also. Contrary to Cray, IBM systems like Blue Gene/L and Blue Gene/P acknowledge the dependence of message latencies on distance and encourage application developers to use topology of these machines to their advantage. On Blue Gene/L, there is a 89 nanoseconds per hop latency attributed to the torus logic and wire delays. This fact has been used both by system developers [Walkup04, Petrini04] and application developers to improve performance on Blue Gene/L [Bhanot05, Bhatele08a]. Bhanot et. al have developed a framework for optimizing task layout on BG/L. Since it uses simulated annealing, it is quite slow and the solution is developed offline. | ||||||||||||||||
| Uniqueness of the Approach: | ||||||||||||||||
|
This work is among the first to discuss the effects of contention on Cray and IBM machines and to compare across multiple architectures. We believe that the set of benchmarks we have developed for quantifying message latencies would be useful for the HPC community to assess latencies on a supercomputer and to determine the message sizes for which number of hops makes a significant difference. The effective bandwidth benchmark in the HPC Challenge benchmark suite measures the total bandwidth available on a system but does not analyze the effects of distance or contention on message latencies. Our experience in developing mapping algorithms for production codes and insights discussed in this work will be useful to individual application writers trying to scale their codes to large supercomputers. We believe that the proposed automatic mapping framework will be applicable to a wide variety of communication scenarios and will relieve the application writers from the burden of finding correct mapping solutions for their codes. Unlike most of the previous work, this research handles both cardinality and topological variations in the graphs. It also handles various architectures and different kinds of communication scenarios (through the use of heuristics). Therefore, it will be useful to a large body of applications running on large parallel machines. | ||||||||||||||||
| Results and Contributions: | ||||||||||||||||
|
Contention
Studies: We wrote a MPI benchmark called WICON (With Contention) to
quantify message latencies in presence of contention, which is a regime
not handled by the basic model of wormhole routing discussed earlier. In
this benchmark, all MPI tasks are grouped into pairs and the smaller rank
in the pair sends messages of size B bytes to its partner and awaits a
reply. All pairs do this communication simultaneously. The average time
for the message sends is recorded for different message sizes. To quantify
the effect of hops on message latencies this benchmark is run in two
modes:
Figure 1 shows
the results of running WICON in the NN and RND modes on Blue Gene/P and
XT3. The first plot shows the results of WICON on 4,096 cores of BG/P. It
is clear that the random-processor (RND) latencies are more than the
near-neighbor (NN) latencies (by a factor of 1.75 for large messages.)
This is expected based on the assertion that hops have a significant
impact on the message latencies in the presence of contention, which
increases with larger messages because of a proportional increase in
packets on the network. Similar experiments were repeated on XT3 to
understand the effects of contention on Cray XT machines. The second plot
in Figure 1 presents the results for WICON benchmark on 2,048 cores of
XT3. We see a significant difference between the NN and RND lines (a
factor of 2.25 at 1 MB messages which is greater than that on BG/P.) This
is not unexpected and a quantum chemistry code has shown huge benefits (up
to 40%) from topology-aware mapping on XT3 [Bohm07]. The
benchmark in the previous section injects random contention on the
network. To quantify the effects of contention under controlled
conditions, WICON was modified to conduct a controlled experiment.
Again, all ranks are divided into pairs but now the pairs are chosen
such that they are a fixed number of hops, say n, away from
each other. All pairs send messages simultaneously and the average
time for message sends of different sizes for varying hops is
recorded. Pairs are chosen only along one dimension of the torus, in
this case, the Z dimension. Figure 2
shows the results of running the WICON2 benchmark on Blue Gene/P. On
each plot there are several lines, one each for a specific pairing
which is n hops away. The tests were done on a torus of
dimensions 8 X 8 X 16. Since messages are sent along Z, maximum
number of hops possible is 8 and hence there are 8 lines on the
plot. The Blue Gene/P plot on the right shows that the message
latencies for large messages for the 1 hop and 8 hops case can
differ by a factor of 8! Figure 2.
Plots showing the results of WICON2 on BG/P As all
messages travel more hops, links are shared by a greater number of
messages, increasing the contention on the network and decreasing
the available effective bandwidth. This is what applications have to
deal with during communication in practice. This huge difference
between message latencies indicates that it is very important to
keep communicating tasks close by and minimize contention on the
network [Bhatele09a]. This is especially true for communication
bound applications. Next, we look at mapping successes for two
applications which have different communication
characterstics. Dynamic
Irregular Communication: NAMD is a production Molecular Dynamics (MD)
application used for simulation of bio-molecules [Bhatele08b]. NAMD is
parallelized by use of Charm++ objects called patches and computes. The
simulation box is spatially divided into smaller cells called patches. The
force calculation for every pair of patches is assigned to a different
compute. Thus, communication in NAMD consists of section multicasts from
patches to computes and back. Every patch multicasts its atom data to
multiple computes, whereas each compute receives data from only two
patches. Patches are statically assigned to a few processors during
start-up and computes are distributed evenly by a load balancer. Let us
now see the deployment of topology-aware techniques in the static
placement of patches and the load balancers. Topology
placement of patches: Since patches form a geometric decomposition of
the simulation space, they constitute a 3D group of objects which can be
mapped nicely onto the 3D torus of machines. An ORB (Orthogonal Recursive
Bisection) of the torus is used to obtain partitions equal in number to
the patches and then, a one-to-one mapping of the patches to the processor
partitions is done. Topology-aware
Load Balancers: Once patches have been statically assigned onto the
processor torus, computes which interact with these patches should be
placed around them. Consider Figure 3(a), which shows the entire 3D torus
on which the job is running. When placing a compute, it should be placed
topologically close to the two processors that house the patches it
interacts with. The two patches define a smaller brick within the 3D torus
(shown in dark grey in the figure). The sum of distances from any
processor within this brick to the two patches is minimum. Figure 3(b) shows
the hop-bytes for all messages per iteration when running NAMD on
Blue Gene/P on different sized partitions. A standard benchmark used in
the MD community was used for the runs: 92,227-atom ApoLipoprotein-A1
(ApoA1). As we would expect, hop-bytes consistently increase as we go from
a smaller partition to a larger one. The three strategies compared are:
topology oblivious mapping of patches and computes (Topology Oblivious),
topology-aware static placement of patches (TopoPlace Patches) and
topology-aware placement for both patches and load balancing for computes
(TopoAware LDBs). Topology aware
schemes for the placement of patches and the load balancer help in
reducing the hop-bytes for all processor counts. Also, the decrease in
hop-bytes becomes more significant as we go to larger-sized partitions.
This is due to the fact that the average distance traveled by each message
increases as we increase the partition size in the case of default
mapping; however, it becomes controlled when we do a topology-aware
mapping. Since the actual performance of the load balancers depends on
several metrics, the question remains as to whether the reduction in
hop-bytes leads to an actual improvement in performance. As it turns out,
we also see a reduction in the number of proxies and in the
max-to-average ratio for topology-aware load balancers, which is
reflected in the overall performance of NAMD on Blue Gene/P (see table
above). The topology oblivious scheme stops scaling around 4,096 cores and
hence, we did not obtain numbers for it beyond that. We see an improvement
of 10% at 16,384 cores with the use of topology-aware load balancers. For
further details, please refer to [Bhatele09b]. Static Regular
Communication: OpenAtom is a fine-grained parallelization of the
CPAIMD method to understand dynamics of atoms at a quantum scale [Bohm07].
Computation in OpenAtom is divided into a large number of objects,
enabling scaling to tens of thousands of processors. Calculating the
electrostatic energy involves computing several terms. Hence, CPAIMD
computations involve a large number of phases with high inter-processor
communication. These phases are discretized into a large number of
objects, which generate a lot of communication, but ensures efficient
interleaving of work. The entire computation is divided into ten phases,
which are parallelized by decomposing the physical system into fifteen
chare arrays. Since multiple
chare arrays interact among one another, the communication dependencies
are complex and mapping is a challenging task. OpenAtom provides us with a
scenario where the load on each object is static (under the CPAIMD method)
and the communication is regular and clearly understood. Hence, it should
be possible to intelligently map the arrays in this application to
minimize inter-processor communication and maintain load balance. Because
of space limitations and a fairly involved mapping acheme, we will not go
into the details of how the mapping is done but present performance
results. We studied the
strong scaling (fixed problem size) performance of OpenAtom with and
without topology aware mapping. Two benchmarks commonly used in the CPMD
community: the minimization of WATER_32M_70Ry and WATER_256M_70Ry were
used. As shown in the table on the left, performance improvements from
topology-aware mapping for Blue Gene/P (BG/P) can be quite significant. As
the number of cores and likewise, the diameter of the torus grows, the
performance impact increases until there is 40% improvement for
WATER_32M_70Ry at 4096 and 50% for WATER_256M_70Ry at 8192 cores. The
improvements from topological awareness on Cray XT3, presented in the
table on the right, are comparable to those on BG/P. There is an
improvement of 20% on XT3 for WATER_256_70Ry at 1024 cores, compared to
the improvement of 38% on BG/P at 1024 cores. Contributions: Significant work was done in the 80s on
topology-aware mapping but it was directed towards interconnect topologies
like hypercubes and shuffle-exchange networks which are not used in
practice today. Work done in those years was directed towards machines
consisting of a few hundred processors. The proposed work is highly
relevant for the machines of the petascale era, such as Blue Gene and XT.
Apart from MPI applications, it is directed towards applications with
virtualization (having multiple objects per physical processor and
multiple object graphs), which has not been explored previously. As
opposed to earlier research in this area, this work is directed towards
high scalability and fast runtime solutions. This research
demonstrates that topology-aware mapping is important for communication
bound applications. Results quantifying message latencies in presence of
contention will be useful to application writers trying to optimize their
codes. Application-specific techniques discussed here are being used in
production codes, NAMD and OpenAtom, and have helped scientists in getting
their results faster. The TopoManager API which obtains the topology
information about machines at runtime is already being used by other
application groups (such as the US Lattice QCD collaboration). It is
especially useful because information on Cray machines is not readily
available and this API provides a single wrapper for different machines.
We hope that this API will be eventually used to implement MPI_Cart_create
and other virtual topology functions on Cray machines. The final goal of
this research is to utilize the experience from application-specific
mapping, in developing an automatic framework which can develop
topology-aware mapping solutions. The work on the automatic mapping
framework will relieve the application writers of doing the mapping
themselves. We also foresee acceptance of the idea that applications
running on Cray XT machines will benefit as much as those on Blue Gene
machines. This might even led to modification of the batch schedulers on
these machines to allocate contiguous 3D mesh partitions for
jobs. | ||||||||||||||||
| References: | ||||||||||||||||
|
[McKinley93] Lionel M. Ni and Philip K. McKinley, A Survey
of Wormhole Routing Techniques in Direct Networks, IEEE Computer,
26(2): 62-76, 1993. | ||||||||||||||||