A systematic Characterization of Application Sensitivity to Network Performance

Yüklə 0,74 Mb.

Pdf görüntüsü

səhifə	22/51
tarix	15.10.2018
ölçüsü	0,74 Mb.
	#74178

1 ... 18 19 20 21 22 23 24 25 ... 51

3.1.2 Characteristics
3.2.2 gap
3.2.3 Latency

52
Å
Radb: This version of the radix sort [2] was restructured to use bulk messages. After the
global histogram phase, all keys are sent to their destination processor in one bulk message.
Depending on network characteristics, use of these bulk messages can speed up the perfor-
mance of the sort relative to the standard radix sort.
3.1.2
Characteristics
As summarized in Table 3.1, the applications represent a broad spectrum of problem do-
mains and communication/computation characteristics. To quantify the differences among our target
applications, we instrumented our communication layer to record baseline characteristics for each
program (with unmodiﬁed LogGP parameters) on 32 nodes. Table 3.2 shows the average number of
messages, maximum number of messages per node (as an indication of communication imbalance),
the message frequency expressed in the average number of messages per processor per millisecond,
the average message interval in microseconds, and the average interval between barriers as a mea-
sure of how often processors synchronize. Table 3.2 also shows the percentage of the messages using
the Active Message bulk transfer mechanism, the percentage of the total messages which are a read
request or reply, the average bandwidth per processor for bulk messages, and the average bandwidth
per processor for small messages. Note that the reported bandwidth is for bytes transmitted through
the communication layer as opposed to bandwidth delivered to the application, e.g., it includes head-
ers.
Table 3.2 shows that the communication frequency of our applications varies by more than
two orders of magnitude, and yet none of them are “embarrassingly parallel.” This disparity suggests
that it is quite difﬁcult to talk about typical communication behavior or sensitivity. Most of the ap-
plications have balanced communication overall, whereas others (Sample, P-Ray) have signiﬁcant
imbalances. Barnes and EM3D(write) are bulk synchronous applications employing barriers rela-
tively frequently. Barnes, Mur
Æ
, P-Ray, Radb and NOW-sort utilize bulk messages while the other
applications send only short messages. Finally, EM3D(read), Barnes, P-Ray, and Connect do mostly
reads, while the other applications are entirely write based. Applications doing reads are likely to be
dependent on network round trip times, and thus sensitive to latency, while write based applications
are more likely to be tolerant of network latency. Most of the applications demonstrate regular com-
munication patterns. However, Connect and P-Ray are more irregular and contain a number of hot
spots. While these applications do not constitute a workload, their architectural requirements vary
across large classes of parallel applications.

53
3.2
Analytic Models
3.2.1
Overhead
To develop insight into our experimental results, we develop a simple analytical model of
application sensitivity to added overhead. The model is based on the fact that added overhead is
incurred each time a processor sends or receive a message. Thus, given an processor’s base runtime,
ÇeÈ½ÉËÊ#Ì
, the added overhead,
ÍÏÎ
, and
Ð
, the number of communication events for each processor, we
expect runtime,
ÇÒÑÓÉ»ÔÖÕ
, to be:
Ç×ÑÓÉ»ÔÖÕÙØÚÇeÈ½ÉËÊÛÌÝÜPÞ
ÐßÍÏÎ
The factor of two arises because, for Split-C programs, all communication events are one
of a request/response pair. For each request sent, the processor will incur an overhead penalty receiv-
ing the corresponding response in addition to the overhead for the sent request. If the processor is
sending a response, it must have incurred an overhead penalty when it received the request message.
Given this model for the overhead sensitivity of individual processors, we extrapolate to
predicting overall application runtime by making the following simplifying assumptions. First, ap-
plications run at the speed of the slowest processor, and second, the slowest processor is the processor
that sends the most messages. Thus, by replacing
Ð
in the equation with the maximum number of
messages sent by a processor from Table 3.2, we derive a simple model for predicting application
sensitivity to added overhead as a function of the maximum number of messages sent by any pro-
cessor during execution.
The simple linear model presented above does not capture serial dependencies in the appli-
cation. Our overhead model will thus tend to under predict run time due to serialization effects. For
example, imagine a pipelined shift performed by all processors: processor zero sends to processor
one, which waits for the message then forwards it to processor two, etc. The maximum number of
messages sent by any processor is one, but the entire time of the operation is proportional to
à
. As
we inﬂate overhead by
ÍáÎ
, the total time for this operation will increase by
àpÍÏÎ
, not
ÍÏÎ
as predicted
by the model. Serial dependencies such as the one outlined above are highly application speciﬁc and
so we choose to use the a simple linear model. Section 3.3.1 quantiﬁes the model’s under prediction.
for all the applications.

54
3.2.2
gap
Developing a model for application sensitivity to gap presents more difﬁculties than de-
veloping the model for sensitivity to overhead. A processor is not affected unless it attempts to send
messages more frequently than the gap. At this point, the processor must stall for a period waiting
for the network interface to become available. Without more precise knowledge of inter-message
intervals, the model for gap sensitivity depends on assumptions made about these intervals. At one
extreme, the uniform model assumes that all messages are sent at the application’s average message
interval,
â
, from Table 3.2. In this case, the predicted runtime,
ã¡äæåeç
èÓé»êÖë
, can be predicted as a function
of total gap,
ì
, the average message interval,
â
, the base runtime,
ãeí8îÒï
ê
, and the maximum number
of messages sent by any node,
ð
:
ã
äñåeç
èÓé»ê½ëÝò
ó
ô
õ
ã
í8îÒï
ê÷ö
ðùøúìrûùâ%ü
if
ìþýGâ
ãeí8îÒï
ê
otherwise
At the other extreme, the burst model assumes that all messages are sent in discreet com-
munication phases where the application attempts to send as fast as the communication layer will
allow. Under this model, the added gap,
ÿì
, is incurred for each communication event. This second
model again assumes that the applications runs at the speed of the processor sending
ð
messages,
the maximum number of messages per processor from Figure 3.2, and would predict runtime,
ã
ä
í
ç
èÓé»ê½ë
,
as:
ã¡ä
í
ç
èÓé»êÖë
ò
ã
í8îÓï
ê
ö
ðßÿì
Application communication patterns determine which of the two models more closely pre-
dicts actual application runtime. The uniform model predicts that applications ignore increased gap
until reaching a threshold equaling the application’s average message interval. At this threshold, the
applications should slowdown linearly with increasing gap. The burst model predicts a linear slow-
down independent of average message interval.
3.2.3
Latency
Our model for latency is very simple. We simply model each synchronous Split-C read as
the cost of an round trip, i.e.
¡£¢
ö¥¤§¦
. Such a model, however, fails to account for any higher-level
dependencies, such as synchronization via split-phased operations. Many of the programs have care-
fully orchestrated communication phases and these use a variety of synchronization mechanisms, in-
cluding split phased reads, writes and barriers. The only application which performs many blocking

Yüklə 0,74 Mb.

Dostları ilə paylaş:

1 ... 18 19 20 21 22 23 24 25 ... 51