A systematic Characterization of Application Sensitivity to Network Performance

Yüklə 0,74 Mb.

Pdf görüntüsü

səhifə	10/51
tarix	15.10.2018
ölçüsü	0,74 Mb.
	#74178

1 ... 6 7 8 9 10 11 12 13 ... 51

. . . Figure 2.1: LogGP Abstract Machine
Baseline LogGP Parameters.

18
2.1
Experiment Design Philosophy
In the space of experimental design, this work uses application-centric metrics combined
with an emulation methodology. The basic approach is to determine application sensitivity to ma-
chine communication characteristics by running a benchmark suite on a large system in which the
communication layer has been modiﬁed to allow the latency, overhead, per-message bandwidth and
per-byte bandwidth to be adjusted independently. This four-parameter characterization of commu-
nication performance is based on the LogGP model [2, 29], the framework for our systematic inves-
tigation of the communication design space. By adjusting these parameters, we can observe changes
in the execution time or throughput of applications on a spectrum of systems ranging from the current
high-performance clusters to conventional LAN based clusters.
We validate the emulation with analytic models. The models range from simple frequency-
cost pairs to simple queuing networks. The intent of the models is to validate the emulation experi-
ments. A side beneﬁt of the models is that we can compare the accuracy of the models against live
systems. The absolute accuracy can serve as a guide for future designers as to the applicability of
analytic models to their situations.
In order to both demonstrate the soundness of the methodology, as well as draw general
conclusions about application behavior, we must have a representative application suite. While no
suite can possibly capture all application behavior, a diverse suite may capture the relevant structures
of a broad class of programs. Our suite includes a variety of parallel programs written in the Split-C
programming language, a sub-set of the NAS Parallel Benchmarks and the SPECsfs benchmark.
2.2
LogGP Network Model
When investigating trade-offs in communication architectures, it is important to recognize
that the time per communication operation breaks down into portions that involve different machine
resources: the processor, the network interface, and the actual network. However, it is also impor-
tant that the communication cost model not be too deeply wedded to a speciﬁc machine implemen-
tation. The LogGP model [2, 29] provides an ideal abstraction by characterizing the performance of
the key resources, but not their structure. A distributed-memory environment in which processors
physically communicate by point-to-point messages is characterized by four parameters (illustrated
in Figure 2.1).
1
: the latency, or delay, incurred in communicating a message containing a small number of words

19
P M
P M
P M
Interconnection network
P (processors)
L (latency)
g (gap)
limited capacity
(L/g to or from
a proccessor)
o (overhead)
o
. . .
Figure 2.1: LogGP Abstract Machine
The LogGP model describes an abstract conﬁguration in terms of ﬁve performance parameters:
2
,
the latency experienced in each communication event,
3
, the overhead experienced by the sending
and receiving processors,
4
, the gap between successive sends or successive receives by a processor,
5
, the cost-per-byte for long transfers, and
6
, the number of processors/memory modules.
from its source processor/memory module to its target.
3
: the overhead, deﬁned as the length of time that a processor is engaged in the transmission or
reception of each message; during this time, the processor cannot perform other operations.
4
: the gap, deﬁned as the minimum time interval between consecutive message transmissions or
consecutive message receptions at a module; this is the time it takes for a message to cross
through the bandwidth bottleneck in the system.
6
: the number of processor/memory modules.
2
,
3
, and
4
are speciﬁed in units of time. It is assumed that the network has a ﬁnite capacity,
such that at most
782@9A4CB
messages can be in transit from any processor or to any processor at any time.
If a processor attempts to transmit a message that would exceed this limit, it stalls until the message
can be sent without exceeding the capacity limit.
The simplest communication operation, sending a single packet from one machine to an-
other, requires a time of
2EDGF03
. Thus, the latency may include the time spent in the network inter-
faces and the actual transit time through the network, which are indistinguishable to the processor.
A request-response operation, such as a read or blocking write, takes time
F02HDPIQ3
. The processor

20
Platform
R
(
S
s)
T
(
S
s)
U
(
S
s)
MB/s
VXW
Ya`
Berkeley NOW
2.9
5.8
5.0
38
Intel Paragon
1.8
7.6
6.5
141
Meiko CS-2
1.7
13.6
7.5
47
Table 2.1: Baseline LogGP Parameters.
This table shows the performance of the hardware platform used, the Berkeley NOW. Two popular
parallel computers, the Intel Paragon and the Meiko CS-2 are included for comparison.
issuing the request and the one serving the response both are involved for time
b0R
. The remainder
of the time can be overlapped with computation or sending additional messages.
The available per-processor message bandwidth, or communication rate (messages per
unit time) is
cedfT
. Depending on the machine, this limit might be imposed by the available network
bandwidth or by other facets of the design. In many machines, the limit is imposed by the message
processing rate of the network interface, rather than the network itself. Because many machines
have separate mechanisms for long messages, e.g., DMA, it is useful to extend the model with an
additional gap parameter,
g
, which speciﬁes the time-per-byte, or the reciprocal of the bulk transfer
bandwidth [2]. In our machine,
g
is determined by the DMA rate to or from the network interface,
rather than the network link bandwidth.
The LogGP characteristics for the Active Message layer are summarized in Table 2.1. For
reference, we also provide measured LogGP characteristics for two tightly integrated parallel pro-
cessors, the Intel Paragon and Meiko CS-2 [31].
2.3
Apparatuses
In this section we describe the various apparatuses used. The experimental apparatus con-
sists of commercially available hardware and system software, augmented with publicly available
research software that has been modiﬁed to conduct the experiment. There are three distinct vari-
ations of the same basic apparatus. All use the an Active Messages [104] variant called Generic
Active Messages (GAM) [30]. The primary differentiator between the apparatuses is the high-level
transport layered on top of this basic messaging substrate, Split-C [28], MPI [75] or TCP/IP. Each
apparatus is used for one application suite: Split-C/AM for the Split-C applications, MPI-GAM for
the NAS Parallel Benchmarks, and TCP-GAM for the SPECsfs benchmark. The Split-C and MPI
apparatus use the identical GAM Active Message layer. The Active Message Layer for the TCP/IP

Yüklə 0,74 Mb.

Dostları ilə paylaş:

1 ... 6 7 8 9 10 11 12 13 ... 51