A systematic Characterization of Application Sensitivity to Network Performance

Yüklə 0,74 Mb.

Pdf görüntüsü

səhifə	17/51
tarix	15.10.2018
ölçüsü	0,74 Mb.
	#74178

1 ... 13 14 15 16 17 18 19 20 ... 51

37
would be useful as a validation of the LogGP model.
The most important class of factor-level combinations missing from this work are that of
real machines. For example, we could easily emulate a machine running a TCP/IP stack on 100 Mb
Ethernet. This class of machines will be quite important in the future, as it represents a very cost-
effective point in the design space. Likewise, our apparatus could emulate a 622Mb ATM cluster
of workstations. Running experiments on these common designs would not only provide additional
validations of our experiments, it would allow a near direct comparison of the cost-performance of
these technologies (e.g. ATM vs. Ethernet). However, although such point-comparisons are quite
useful, many other studies already have performed such comparisons [1, 49, 61]. For the purposes
of this thesis, we view the uniqueness of isolating the impact of each parameter as more important
that making point-wise comparisons of different technologies.
2.5
Other Models
The LogGP model is not the only model available to describe program behavior. In this
section, we explore other models. We ﬁrst describe the models, as well as relate their purposes,
strengths and weaknesses in the context of this thesis. We also describe why this thesis uses the
LogGP model.
Most of the models presented here were developed in the context of parallel program de-
sign, must be considered in light of that purpose. Previous to the invention of the models presented
here, many parallel programs were written using the Parallel Random Access Machine (PRAM)
model. The problem with the PRAM model is that it ignores real machine characteristics and thus
encourages algorithm designers to exploit characteristics of the PRAM machine that result in poor
performance on real machines.
Although a reaction the unrealistic assumptions of the PRAM model, many of the models
explored in this section do not have parameters that correspond well to a variety of real machine com-
ponents. Out of a fear that additional parameters will make the models cumbersome to use, many of
the models attempt to capture the limitations of real machines in only 1 or 2 parameters. The pur-
pose of the parameters is to encourage algorithm designers to creating algorithms that run well on
real machines; the parameters are not intended to model the machines per say. The lack of archi-
tectural focus is not a fault of the models, but limits their applicability for understanding the role of
machine architecture on application performance.

38
2.5.1
Bulk Synchronous Parallel
The Bulk Synchronous Parallel (BSP) model [101] attempts to bridge theory and practice
with an very simple model. Computation is divided into a number of “supersteps” that are separated
by global barriers. Each superstep must be at least
¤
units long, and messages from one step cannot
be used until the next step. Like the LogGP model, BSP does not model any speciﬁc topology. In
addition to the restriction that a superstep must last at least
¤
time units, messages can only be sent
or received at an interval of
¥
during a given step.
Although proven useful for algorithm design, the BSP’s restricted architectural focus would
greatly limit this thesis. The architectural parameters are only
¤
and
¥
. In a real programs
¦
and
§
are often limitations that would be missed by the BSP model.
In addition, it would be difﬁcult to categorize all parallel programs into the BSP frame-
work. Most challenging are the supersteps. For example, several of the applications in this thesis
use task-queue models that do not ﬁt well into the BSP framework. The task-queue programs have
only a few long phases where communication is overlapped with computation; in some programs
there is only a single phase [97]! Thus, although a program written in the BSP framework will make
efﬁcient use of machine resources, the model will not reveal much about application behavior or
architectural characteristics.
2.5.2
Queue Shared Memory
Like BSP, the Queue Shared Memory (QSM) model is also designed for parallel program-
ming [46]. This model extends the PRAM model to include queuing effects at shared memory loca-
tions. The model divides memory into local and global portions. Operations on the global memory
take unit time, unless they operate on the same location, in which case they are queued and serviced
at a rate of
¥
. Thus, the only two parameters in this model are
¥
and
¨
, and the only network param-
eter is
¥
.
A major ﬂaw from an architectural perspective with this model is that the
¥
parameter is on
a per-word basis. The QSM developers correctly note that a machine with many memory banks per
processor can adequately approximate such a model, and use the Cray vector supercomputers as an
examples. However, as technology trends move more memory capacity per chip [50], it will become
increasing difﬁcult at an architectural level to sustain many memory banks. This ﬂaw in the QSM
model could be easily corrected by modifying the model to a point where
¥
is sustained for every
global memory operation, which would encourage algorithm designers to avoid remote memory as

39
much as possible.
2.5.3
LoPC and LoGPC
The LoPC and LoGPC models [43, 78] extend the basic LogP model with contention pa-
rameters. The original paper, [43] added the
©
parameter to model contention effects at the end-
points. In large parallel machines with all-to-all patterns, these contention effects can result com-
munication times that are 50% greater than predicted by a straightforward LogGP model [39]. Note
that these contention effects get worse with increasing overhead.
The LoGPC model extends the contention effects into the network. However, in order to
model these effects, the work makes certain assumptions about the network. Although the model
captures a large class of networks by abstracting k-ary d-cubes, other topologies, such as certain
expander networks, are not modeled. Still, for low-order dimension-routed networks, network con-
gestion can be a serious limitation.
Of all the models presented here, LoPC is perhaps the most interesting for the purposes of
this thesis. Contention effects can become a signiﬁcant fraction of the time on large machines. This
observation is not new [84], but the recent work on LoPC and LoGPC better quantify this effect.
However, as was shown in [39], as well as in the LoPC model itself, for the range of ma-
chines of interest to this study, (16-32 nodes) the endpoint contention effects are minimal. For ex-
ample, the predicted vs. actual communication cost of the radix sort program on a 32 node CM-5
differed by only 9%. Only at larger machines sizes (256 and 512 nodes) do contention effects dom-
inate, resulting in predictions that are up to 50% inaccurate. However, the standard LogGP model is
quite suitable for this study because of the machine sizes used.
2.5.4
Queuing Theory
In the world of LAN and WAN networks, queuing theoretic approaches dominate [57, 63,
65]. The problem with using this class of models for parallel programs is that the basic assumptions
about program behavior do not mesh well with what many parallel programs do. For example, queu-
ing theoretic models assume that the output of a program can be modeled by some stochastic process.
That is, network data is generated as some random function of time. Often, poisson processes are
used because the mathematics remain tractable while constructing “realistic” models. In addition,
much of queuing theory is built on the assumption that the system is in steady-state operation. This
allows the modeler to close the resulting system of equations. The nature of many parallel programs,

Yüklə 0,74 Mb.

Dostları ilə paylaş:

1 ... 13 14 15 16 17 18 19 20 ... 51