Macros for kernels
In addition to the CUDA/OpenCL portability macros in port.mako,
katsdpsigproc also provides a number of macros that help to write more advanced
kernels.
These macros make use of the more advanced parts of Mako syntax, and you should probably read its manual alongside this one.
Transposition
Transposition can be done with the Transpose operation without
writing any kernel code; but one may wish to combine the transposition with
other code to save memory bandwidth. The code to do this is contained in
transpose_base.mako, and can be loaded with
<%namespace name="transpose" file="transpose_base.mako"/>
You will need to choose three parameters: block, vtx and vty. The kernel will run with block × block work groups, and each work group will transpose a tiley × tilex region of the source to a tilex × tiley region of the destination, where tilex is block · vtx and tiley is block · vty. Typical block values are 8–32 while vtx and vty are small — at most 4, and 1 is not unreasonable (the built-in transpose operation uses Autotuning to select these parameters). The examples in this section all assume that these parameters are passed as variables to the Mako template.
A structure is needed to hold some internal coordinates. Declare it like this:
<%transpose:transpose_coords_class class_name="transpose_coords" block="${block}" vtx="${vtx}" vty="${vty}"/>
This creates a struct named transpose_coords; you could call it
whatever you want. Declare an instance of it in your kernel function, and
initialize it using the generated function
transpose_coords_init_simple() (the name is generated from the
class_name):
transpose_coords coords;
transpose_coords_init_simple(&coords);
As the name suggests, there is a more general version
transpose_coords_init() which lets you provide your own information
about which coordinates the work item will work on, but it’s beyond the scope
of this tutorial.
Your kernel will consist of two main phases:
Load data, transform it if needed, and store it in local memory.
Read the data from local memory (with a transposed access pattern), transform it further if needed, and write it out.
Thus, you’ll need one or more local memory buffers to hold the intermediate results. Declare the type (at global scope) like this:
<%transpose:transpose_data_class class_name="transpose_values" type="${ctype}" block="${block}" vtx="${vtx}" vty="${vty}"/>
where ctype is the C type of each element. It’s advisable not to declare
your own struct for the ctype as it can hurt efficiency; rather use multiple
data classes (or multiple instances of the same one). When creating an
instance of the struct, remember to declare it as LOCAL_DECL
e.g.
LOCAL_DECL transpose_values values;
Now we’re ready to load a chunk of data. Here’s a simple example (based on the
Transpose operation) that simply copies data into the local memory:
<%transpose:transpose_load coords="coords" block="${block}" vtx="${vtx}" vty="${vty}" args="r, c, lr, lc">
values.arr[${lr}][${lc}] = in[${r} * in_stride + ${c}];
</%transpose:transpose_load>
BARRIER();
The args determines the names of extra Mako variables that can be used inside the block. With the names in the example, (r, c) are the coordinates in the input data, and (lr, lc) are the coordinates to use in the intermediate storage. Note that in and in_stride are not magical: they’re just arguments that were passed to this kernel. You could get the input from multiple arrays or even synthesize it mathematically if there was a good reason to.
The call to BARRIER separates the load and store phases, and is
critical to ensure synchronization of the local memory.
The store phase (after the barrier) is very similar:
<%transpose:transpose_store coords="coords" block="${block}" vtx="${vtx}" vty="${vty}" args="r, c, lr, lc">
out[${r} * out_stride + ${c}] = values.arr[${lr}][${lc}];
</%transpose:transpose_store>
Note that once again (r, c) are the coordinates for external storage and (lr, lc) are the coordinates for internal storage; don’t swap anything around as it’s handled for you.
The examples above are all based on the built-in Transpose operation
(but slightly simplified), whose source is in transpose.mako. It is
recommended that you look at the source (both the Python and kernel code) as
an example.
Reductions
Reductions are operations like sum or max that repeatedly apply a binary
operator to turn an array of values into a single value. The namespace
wg_reduce.mako provides macros for performing reductions within a
single work group. Larger-scale reductions are left to the user to implement
e.g. by recursively applying work-group reductions or by transferring the
results of a single round of reduction to the CPU.
For efficiency, the implementation assumes that the binary operator is associative and commutative, and hence that the order of operations can be juggled. An example of a non-commutative operator is matrix multiplication. Floating-point addition is commutative, but not quite associative due to rounding errors; thus, summation may return a different result to a CPU summation, or to summation on a different GPU. However, for a specific system it is deterministic i.e., is not affected by random timing.
This section covers the basics of defining a reduction across a whole work
group with 1D shape. It is also possible to partition a work group into
equally-sized pieces, each of which performs an independent reduction, but
that is beyond the scope of this tutorial. Refer to the comments in
wg_reduce.mako for details.
Here’s a kernel where each work group just sums a section of the input array corresponding to the work group’s global IDs. A wgs variable must be passed to indicate the work group size.
<%include file="port.mako"/>
<%namespace name="wg_reduce" file="wg_reduce.mako"/>
${wg_reduce.define_scratch('int', wgs, 'scratch_t')}
${wg_reduce.define_function('int', wgs, 'reduce_helper', 'scratch_t', wg_reduce.op_plus)}
KERNEL void reduce(const GLOBAL int *in, GLOBAL int *out)
{
LOCAL_DECL scratch_t scratch;
int lid = get_local_id(0);
int value = in[get_global_id(0)];
int sum = reduce_helper(value, lid, &scratch);
if (lid == 0)
out[get_group_id(0)] = sum;
}
Similar to the transposition helpers, we need to define a struct to hold
scratch data, with define_scratch. It takes the type of data being
reduced, the number of items in the reduction, and the name of the struct to
define. We then define the function that does the reduction, using
define_function. It takes those same arguments, as well as the
function name and a macro defining the binary operator. There are a few common
operators provided (op_plus, op_min,
op_max, op_fmin, op_fmax), but you can
easily define your own. The macro takes two values and the type and generates
code for the result. For example, here is the definition of
op_plus [1].
<%def name="op_plus(a, b, type)">((${a}) + (${b}))</%def>
Synchronization
The function defined by define_function contains barriers. It’s
thus critical that all work items in the work group call the function
collectively.
Disabling broadcasting
By default, the return value is valid in all items in the work group. However,
frequently only one work item actually uses the value, as seen in the example
above. In this case one can pass broadcast=False to
define_function, and only the first work item will receive the sum
(others will receive an undefined value). This improves performance as the
implementation naturally obtains the sum in the first work item and normally
requires an extra step to broadcast it to the others.
Shuffle optimization
In many cases the internals can be optimized using special CUDA instructions,
but it is not done by default because it is only safe under certain
conditions. The comments in wg_reduce.mako have the full details, but
if you use a 1D kernel and pass get_local_id(0) as the index (as is done
in this example) then it is safe to enable this optimization. To do so, pass
shuffle=True to both define_scratch and
define_function.