Chapter 3. MapReduce for Skandium: The programming model
20
We expect that the integrated into Skandium, the skeleton will perform worse than a
system like Phoenix [5] or MRJ [9] which solely focus on the MapReduce skeleton. We
will discuss this problem in chapter 5 where we evaluate the skeleton’s performance.
3.2
The Programming Interface
The remainder of this section presents the muscles that are used by the programmer
to interact with the library and an example of using MapReduce to implement a word
count application.
3.2.1
The Splitter
The splitter is a muscle that implements the Skandium’s split interface and it is respon-
sible for splitting the input into multiple chunks, one chunk per mapper. The muscle
falls in the category of the partially generic muscles as it is not entirely application-
independent: For a word count application the split algorithm should ensure that the
original text is split at a word boundary, while for a histogram application, it should
ensure a 3-byte boundary split for bitmaps with 3 bytes per pixel. However, there are
many map reduce applications that share the same algorithmic pattern for the split op-
eration i.e. splitting at word boundary for word count and inverted index, splitting at
matrix row boundary for matrix multiplication and principal component analysis.
To take advantage of these common splitting patterns we have added a set of split
muscles in the library: The textSplit muscle that splits the input at word boundaries,
the byteSplit muscle that splits the input at N-byte boundaries and the matrixSplit that
splits a matrix at row boundaries. The TextSplit and the ByteSplit muscles take as input
an array of bytes while the matrixSplit muscle takes as input two, two-dimensional
arrays.
In each case, the output of the predefined splitters is a set of InputInterval objects
which are a new addition to the library to support the generic splitting of the input.
The InputInterval objects hold a reference to the original input and the start and end
indexes which indicate which part should be processed by each mapper. Passing the
pointer to the actual input to the mappers we avoid the expensive data copying between
the split and the map phase. As the splitter is a partially generic muscle, the user can
still provide a split muscle if the application requires an input split that is not covered
by the library, or if the input is other than those defined by the predefined splitters.
Chapter 3. MapReduce for Skandium: The programming model
21
3.2.2
The Mapper
The mapper is a muscle that implements Skandium’s execute Interface and defines the
functionality of the map function. Each mapper takes as input a split of the original
input and its output is a Collection of Key Value Pair objects. The Key Value Pair is a
class that we have added in the Skandium library and represents the key-value pair of
the MapReduce model. A Key Value Pair holds references to a key and a data which
can be arbitrary objects.
A collection is a Java interface that represents a group of elements. This interface
is implemented by several data structures in Java like the ArrayList, the LinkedList,
the Vector etc. Using this object at the mapper-library interface rather than a concrete
implementation increases flexibility. The user is free to use the most convenient or the
best performing implementing class for the problem in hand and pass that structure to
the system with no restrictions.
The listing below illustrates the mapper of a word count application. The muscle
iterates over the input interval (line 11) it makes words out of the characters of the
input (lines 12-17) and creates a new Key Value Pair passing the word as the key and
an Integer object with the value 1 as the value at the constructor (line 19). The mapper
adds each key to an ArrayList and when the computation ends, it returns the ArrayList
to the system (line 28).
1
p u b l i c
c l a s s Mapper i m p l e m e n t s
E x e c u t e ,
C o l l e c t i o n > {
3
p u b l i c
A r r a y L i s t
e x e c u t e (
5
I n p u t I n t e r v a l param ) t h r o w s E x c e p t i o n {
7
S t r i n g B u f f e r
b u f f e r =new S t r i n g B u f f e r ( ) ;
S t r i n g word= n u l l ;
9
A r r a y L i s t p a i r s =
new A r r a y L i s t () ;
11
f o r ( i n t
i = param . s t a r t ; i
c h a r ch = ( c h a r ) param . i n p u t 1 [ i ] ;
13
i f
( C h a r a c t e r . i s L e t t e r O r D i g i t ( ch ) ) {
b u f f e r . a p p e n d ( ch ) ;
15
} e l s e {
i f
( b u f f e r . l e n g t h ( ) >0) {
17
word = new S t r i n g ( b u f f e r ) ;
b u f f e r . d e l e t e ( 0 ,
b u f f e r . l e n g t h ( ) ) ;
Chapter 3. MapReduce for Skandium: The programming model
22
19
p a i r s . add ( new K e y V a l u e P a i r ( word , 1 ) ) ;
}
21
}
}
23
i f
( b u f f e r . l e n g t h ( ) >0) {
word = new S t r i n g ( b u f f e r ) ;
25
b u f f e r . d e l e t e ( 0 ,
b u f f e r . l e n g t h ( ) ) ;
p a i r s . add ( new K e y V a l u e P a i r ( word , 1 ) ) ;
27
}
r e t u r n p a i r s ;
29
}
}
3.2.3
The Reducer
The reducer is an execute muscle that implements Skandium’s execute interface and
defines the functionality of the reduce function. Each reducer takes as input a Collec-
tion of Combiners. The Combiner class has been added to the Skandium library to
represent a set of values grouped under their common key. The combiner object con-
sists of a key and a ArrayList of the emitted values associated with the same key. In
order to reduce the memory consumed by the combiner object, a list to hold the values
is created only when more than one values have been added to the combiner. If the
combiner holds only one value, it simply keeps the reference to that value and the list
reference is null. The reducer typically applies the reduction operation on the values
of each key.
The listing below illustrates the reducer of a word count application. The muscle
iterates over the collection of combiners (line 8) and for each combiner, if its list of
values is not null, it sets the counter to be equal to the size of the list (line 12). In
this way, it counts the total occurrences of a word as a value of 1 is emitted from the
mappers for each word occurrence. If the list of values is null, the the combiner must
hold only a single value and the counter of occurrences is simply set to one (line 14).
In the general case, the output of the reducer is an arbitrary object. However, if the
library-defined final merge muscle is also used, the output of the reducer should be a
collection of Key Value Pair objects as in our example. In this case, each emitted from
the reducer pair consists of a key and the reduced set values associated with the key.