List of Figures
2.1
The MapReduce programming model . . . . . . . . . . . . . . . . .
7
2.2
Skandium’s Internal Architecture. Figure taken from [12] . . . . . . .
10
2.3
The Skandium Map skeleton . . . . . . . . . . . . . . . . . . . . . .
11
2.4
Example of a color histogram. Picture taken from [2] . . . . . . . . .
13
2.5
Example of the KMeans clustering in the 2-d space. Picture taken from
[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.1
A High level view of MapReduce for Skandium . . . . . . . . . . . .
19
4.1
Storing and Partitioning the intermediate pairs: An initial approach . .
27
4.2
The parallel store muscle . . . . . . . . . . . . . . . . . . . . . . . .
28
4.3
A caching technique to resolve collisions
. . . . . . . . . . . . . . .
30
4.4
The hashArray that is used by each store thread to store the intermedi-
ate pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.5
Phoenix’s storing/partitioning scheme . . . . . . . . . . . . . . . . .
31
4.6
The merge phase in the case of 4 rows. The merge for a single column
of the hash matrix is illustrated . . . . . . . . . . . . . . . . . . . . .
32
5.1
The efficiency of each of the four implementation schemes for System1
(left) and System2 (right) . . . . . . . . . . . . . . . . . . . . . . . .
36
5.2
The speedup with increasing number of workers for System1 . . . . .
38
5.3
The speedup with increasing number of workers for System2 . . . . .
38
5.4
MapReduce vs Manual Threading for word count on System1 (left)
and System2 (right) . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
5.5
MapReduce vs Manual Threading for matrix multiply on System1 (left)
and System2 (right) . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.6
MapReduce vs Manual Threading for inverted index on System1 (left)
and System2 (right) . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
vi
5.7
MapReduce vs Manual Threading for KMeans on System1 (left) and
System2 (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.8
The garbage collector impact on the performance of the four MapRe-
duce applications . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.1
The Speedup with varying hash array size for word count on Sys-
tem1(left) and System2(right) . . . . . . . . . . . . . . . . . . . . . .
47
6.2
The Speedup with varying hash array size for matrix multiply on Sys-
tem1 (left) and System2(right) . . . . . . . . . . . . . . . . . . . . .
48
6.3
The Speedup with varying hash array size for inverted index on Sys-
tem1 (left) and System2 (right) . . . . . . . . . . . . . . . . . . . . .
48
6.4
The Speedup with varying hash array size for Kmeans on System1
(left) and System2 (right) . . . . . . . . . . . . . . . . . . . . . . . .
49
6.5
The word count speedup with varying hash array size for a ”hash+tree”
and a ”hash+sorted list” data structure . . . . . . . . . . . . . . . . .
50
6.6
The Speedup as the Heap Size decreases with and without the key pool
for word count (left) and inverted index (right) . . . . . . . . . . . . .
52
6.7
The Speedup as the Heap Size decreases with and without the key pool
for matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . .
53
6.8
The Speedup as the Heap Size decreases using a larger new generation
for word count (left) and inverted index (right) . . . . . . . . . . . . .
54
6.9
The Speedup as the Heap Size decreases using the parallel GC for the
word count application . . . . . . . . . . . . . . . . . . . . . . . . .
54
6.10 High Level view of the auto-tuning mechanism for MapReduce . . . .
57
6.11 The auto-tuning benefits compared to the unoptimized version . . . .
58
vii
List of Tables
5.1
The characteristics of the two systems that were used for the skeleton’s
evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
5.2
The characteristics of the five applications that were used for the skele-
ton’s evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
viii
Chapter 1
Introduction
Parallel microprocessors promise to fuel performance gains at a time when there are
diminishing returns on finding more instruction-level parallelism and the processor’s
clock speed is limited by factors like power consumption, heat and current leakage.
However, parallel computing is still not widely used because of the great complex-
ity of programming parallel machines. The programmer has to take decisions in the
program about the distribution of processes over the processors of the system, the syn-
chronization between them, the communication patterns etc. All these decisions have
a serious impact on the program’s complexity and reliability.
Research in parallel architectures is expected to grow fast and future parallel ma-
chines will become more and more complex. Classic parallel programming techniques
will fail to address this complexity, and this fact stresses even more the need for struc-
tured, high-level parallel programming models that would abstract away the complex-
ity of the parallel machine and provide the user with a simple, machine- independent
programming paradigm.
Parallel programming skeletons [6] is a high level programming model for paral-
lel and distributed computing. According to this model, the programmer is presented
with a selection of algorithmic skeletons which are high order functions or program
templates each of which describes the outermost function in a program or the overall
structure of a computation i.e. the meaning of an algorithm. Once the programmer
selects the skeleton that best describes the problem he is trying to solve, he can plug in
the low level problem-specific code that turns the generic description of the problem
to the final program. The problem decomposition is implicitly defined by the struc-
ture of each skeleton and the runtime system automatically distributes the user code
segments across the nodes of the parallel machine. Moreover, the communication and
1
Dostları ilə paylaş: |