uint lid [[thread_position_in_threadgroup]],
uint lsize [[threads_per_threadgroup]],
uint simd_size [[threads_per_simdgroup]],
uint simd_lane_id [[thread_index_in_simdgroup]],
uint simd_group_id [[simdgroup_index_in_threadgroup]])
{
// perform
first level of reduction
// read from device memory,
write to threadgroup memory
int val = input[gid] + input[gid + lsize];
for (uint s=lsize/simd_size; s>simd_size; s/=simd_size)
{
// perform per-SIMD partial reduction
for (uint offset=simd_size/2; offset>0; offset/=2)
val += simd_shuffle_down(val, offset);
// write per-SIMD
partial reduction value to
// threadgroup memory
if (simd_lane_id == 0)
ldata[simd_group_id] = val;
// wait for all partial
reductions to complete
threadgroup_barrier(mem_flags::mem_threadgroup);
val = (lid < s) ? ldata[lid] : 0;
}
// perform final per-SIMD partial reduction to
// calculate the threadgroup
partial reduction result
for (uint offset=s/2; offset>0; offset/=2)
val += simd_shuffle_down(val, offset);
// atomically update the reduction result
if (lid == 0)
atomic_fetch_add_explicit(output, val);
}
5.14 Quad-group Functions
A quad-group function is a SIMD-group function (see section 5.13) with an execution width of
4. Quad-group functions (listed in Table 35) are supported by kernel and fragment functions.
2017-9-12 | Copyright © 2017 Apple Inc. All Rights Reserved.
Page
of
148
174
T
is one of the scalar or vector integer or floating-point types.
In a kernel function, quads divide across the SIMD-group. In a fragment function, the lane id
represents the fragment location in a 2 x 2 quad as follows:
• lane id 0: upper-left pixel
• lane id 1: upper-right pixel
• lane id 2: lower-left pixel
• lane id 3: lower-right pixel
Let's take a look at examples that start with the following threadgroup:
quad_shuffle_up()
shifts up each threadgroup by the
delta
number of threads. If
delta
is 2,
the resulting computed quad lane IDs are shifted down by 2, as seen below. Negative values for
computed quad lane IDs indicate invalid IDs. The computed quad lane IDs do not wrap around,
so the data for the lower invalid quad lane IDs remain unchanged.
Similarly,
quad_shuffle_down()
shifts down each threadgroup by the
delta
number of
threads. Starting from the original threadgroup, if
delta
is 2, the resulting computed quad lane
IDs are shifted up by 2, as seen below. Computed quad lane IDs greater than the quad- group
size indicate invalid IDs. The computed quad lane IDs do not wrap around, so the data for the
upper invalid SIMD lane IDs remain unchanged.
T quad_shuffle_xor(T value,
ushort mask)
Returns the value of
data
specified by thread
whose quad lane ID is computed by
performing a bitwise XOR of the caller’s quad
lane ID and
mask
. The value of
data
specified
by the resulting quad lane ID is returned. The
value of
mask
must be the same for all
threads in a quad- group; otherwise the
behavior is undefined.
Quad Lane ID
0
1
2
3
data
a
b
c
d
Computed Quad Lane ID
-2
-1
0
1
valid
0
0
1
1
data
a
b
a
b
Computed Quad Lane ID
2
3
4
5
valid
1
1
0
0
data
c
d
c
d
2017-9-12 | Copyright © 2017 Apple Inc. All Rights Reserved.
Page
of
150
174