Macros for kernels ================== In addition to the CUDA/OpenCL portability macros in :file:`port.mako`, katsdpsigproc also provides a number of macros that help to write more advanced kernels. These macros make use of the more advanced parts of `Mako`_ syntax, and you should probably read its manual alongside this one. .. _Mako: https://docs.makotemplates.org Transposition ------------- Transposition can be done with the :class:`.Transpose` operation without writing any kernel code; but one may wish to combine the transposition with other code to save memory bandwidth. The code to do this is contained in :file:`transpose_base.mako`, and can be loaded with .. code:: <%namespace name="transpose" file="transpose_base.mako"/> You will need to choose three parameters: `block`, `vtx` and `vty`. The kernel will run with `block` × `block` work groups, and each work group will transpose a `tiley` × `tilex` region of the source to a `tilex × tiley` region of the destination, where `tilex` is `block` · `vtx` and `tiley` is `block` · `vty`. Typical `block` values are 8–32 while `vtx` and `vty` are small — at most 4, and 1 is not unreasonable (the built-in transpose operation uses :doc:`autotune` to select these parameters). The examples in this section all assume that these parameters are passed as variables to the Mako template. A structure is needed to hold some internal coordinates. Declare it like this: .. code:: <%transpose:transpose_coords_class class_name="transpose_coords" block="${block}" vtx="${vtx}" vty="${vty}"/> This creates a struct named :c:struct:`!transpose_coords`; you could call it whatever you want. Declare an instance of it in your kernel function, and initialize it using the generated function :c:func:`!transpose_coords_init_simple` (the name is generated from the `class_name`): .. code:: transpose_coords coords; transpose_coords_init_simple(&coords); As the name suggests, there is a more general version :c:func:`!transpose_coords_init` which lets you provide your own information about which coordinates the work item will work on, but it's beyond the scope of this tutorial. Your kernel will consist of two main phases: 1. Load data, transform it if needed, and store it in local memory. 2. Read the data from local memory (with a transposed access pattern), transform it further if needed, and write it out. Thus, you'll need one or more local memory buffers to hold the intermediate results. Declare the type (at global scope) like this: .. code:: <%transpose:transpose_data_class class_name="transpose_values" type="${ctype}" block="${block}" vtx="${vtx}" vty="${vty}"/> where `ctype` is the C type of each element. It's advisable not to declare your own struct for the `ctype` as it can hurt efficiency; rather use multiple data classes (or multiple instances of the same one). When creating an instance of the struct, remember to declare it as :c:macro:`!LOCAL_DECL` e.g. .. code:: LOCAL_DECL transpose_values values; Now we're ready to load a chunk of data. Here's a simple example (based on the :class:`.Transpose` operation) that simply copies data into the local memory: .. code:: <%transpose:transpose_load coords="coords" block="${block}" vtx="${vtx}" vty="${vty}" args="r, c, lr, lc"> values.arr[${lr}][${lc}] = in[${r} * in_stride + ${c}]; BARRIER(); The `args` determines the names of extra Mako variables that can be used inside the block. With the names in the example, `(r, c)` are the coordinates in the input data, and `(lr, lc)` are the coordinates to use in the intermediate storage. Note that `in` and `in_stride` are not magical: they're just arguments that were passed to this kernel. You could get the input from multiple arrays or even synthesize it mathematically if there was a good reason to. The call to :c:macro:`!BARRIER` separates the load and store phases, and is critical to ensure synchronization of the local memory. The store phase (after the barrier) is very similar: .. code:: <%transpose:transpose_store coords="coords" block="${block}" vtx="${vtx}" vty="${vty}" args="r, c, lr, lc"> out[${r} * out_stride + ${c}] = values.arr[${lr}][${lc}]; Note that once again `(r, c)` are the coordinates for external storage and `(lr, lc)` are the coordinates for internal storage; don't swap anything around as it's handled for you. The examples above are all based on the built-in :class:`.Transpose` operation (but slightly simplified), whose source is in :file:`transpose.mako`. It is recommended that you look at the source (both the Python and kernel code) as an example. Reductions ---------- Reductions are operations like sum or max that repeatedly apply a binary operator to turn an array of values into a single value. The namespace :file:`wg_reduce.mako` provides macros for performing reductions within a single work group. Larger-scale reductions are left to the user to implement e.g. by recursively applying work-group reductions or by transferring the results of a single round of reduction to the CPU. For efficiency, the implementation assumes that the binary operator is associative and commutative, and hence that the order of operations can be juggled. An example of a non-commutative operator is matrix multiplication. Floating-point addition is commutative, but not quite associative due to rounding errors; thus, summation may return a different result to a CPU summation, or to summation on a different GPU. However, for a specific system it is deterministic i.e., is not affected by random timing. This section covers the basics of defining a reduction across a whole work group with 1D shape. It is also possible to partition a work group into equally-sized pieces, each of which performs an independent reduction, but that is beyond the scope of this tutorial. Refer to the comments in :file:`wg_reduce.mako` for details. Here's a kernel where each work group just sums a section of the input array corresponding to the work group's global IDs. A `wgs` variable must be passed to indicate the work group size. .. literalinclude:: examples/sum.mako Similar to the transposition helpers, we need to define a struct to hold scratch data, with :c:macro:`!define_scratch`. It takes the type of data being reduced, the number of items in the reduction, and the name of the struct to define. We then define the function that does the reduction, using :c:macro:`!define_function`. It takes those same arguments, as well as the function name and a macro defining the binary operator. There are a few common operators provided (:c:macro:`!op_plus`, :c:macro:`!op_min`, :c:macro:`!op_max`, :c:macro:`!op_fmin`, :c:macro:`!op_fmax`), but you can easily define your own. The macro takes two values and the type and generates code for the result. For example, here is the definition of :c:macro:`!op_plus` [1]_. .. code:: <%def name="op_plus(a, b, type)">((${a}) + (${b})) .. [1] As with C preprocessor macros, parentheses should be used to ensure that operations are ordered correctly if the arguments are themselves expressions or if the macro result is used in a larger expression. Synchronization _______________ The function defined by :c:macro:`!define_function` contains barriers. It's thus critical that all work items in the work group call the function collectively. Disabling broadcasting ______________________ By default, the return value is valid in all items in the work group. However, frequently only one work item actually uses the value, as seen in the example above. In this case one can pass ``broadcast=False`` to :c:macro:`!define_function`, and only the first work item will receive the sum (others will receive an undefined value). This improves performance as the implementation naturally obtains the sum in the first work item and normally requires an extra step to broadcast it to the others. Shuffle optimization ____________________ In many cases the internals can be optimized using special CUDA instructions, but it is not done by default because it is only safe under certain conditions. The comments in :file:`wg_reduce.mako` have the full details, but if you use a 1D kernel and pass ``get_local_id(0)`` as the index (as is done in this example) then it is safe to enable this optimization. To do so, pass ``shuffle=True`` to *both* :c:macro:`!define_scratch` and :c:macro:`!define_function`.