Autotuning

The kernels shown so far all have a fixed work-group size. However, it’s not easy to know in advance what the best work-group size is for a specific piece of hardware, and even harder to know what it should be in code that may run on multiple generations of hardware. Furthermore, work-group size is just the most common tuning parameter, but there might be others.

A common approach to this problem is to use “autotuning”: benchmark different options on the fly and then use the best. There are two variants of this approach, which I’ll call “on-line” and “off-line”. In on-line autotuning, each time a kernel is invoked by the user, a different set of parameters is tested; the benchmarking is a side effect of doing useful work, but the useful work will have variable performance until the tuner has converged. In off-line autotuning, benchmarking is done on a synthetic workload and the optimal parameters are selected before doing any real work.

On-line autotuning has the advantage that it does not require a long wait before useful work can be done, and also that the workload being benchmarked will be representative. However, katsdpsigproc was originally developed for use in real-time pipelines where highly variable performance during the tuning phase is unacceptable. It thus only supports off-line autotuning.

Despite being off-line, the autotuning is transparent and automatic to users of operations. If an operation template is constructed in a new configuration, the autotuning will be run. The result is saved in a sqlite database so that it will not need to be run again. Results are indexed by the device and driver version so that old results will not be reused for new hardware for which they might not be appropriate.

To take advantage of autotuning, authors of operations will need to add code to their operation templates to specify the parameter space to search and the code to benchmark. Let’s see how that might look for our MultiplyTemplate class (see Operation templates), to make it automatically determine the work group size.

class MultiplyTemplate:
    def __init__(self, context, tuning=None):
        if tuning is None:
            tuning = self.autotune(context)
        self.wgs = tuning['wgs']
        self.program = build(context, '', source=SOURCE)

    @classmethod
    @katsdpsigproc.tune.autotuner(test={'wgs': 32})
    def autotune(cls, context):
        queue = context.create_tuning_command_queue()
        size = 1048576

        def generate(wgs):
            fn = cls(context, {'wgs': wgs}).instantiate(queue, size, 1)
            fn.ensure_all_bound()
            fn.buffer('data').zero(queue)
            return katsdpsigproc.tune.make_measure(queue, fn)

        return katsdpsigproc.tune.autotune(generate, wgs=[32, 64, 128, 256])

    def instantiate(self, queue, size, scale):
        return Multiply(self, queue, size, scale)

There is a fair amount of convention and boiler-plate here, so let’s go through it a step at a time.

The constructor takes an extra tuning argument, defaulting to None. Tuning parameters can be explicitly provided as a dictionary with string keys; in this case the only key used is 'wgs' (short for work-group size). Users can thus override the tuning parameters, but this argument is really intended for internal use.
A class method (autotune()) computes the optimal parameters for a given configuration. It has a decorator that tells the autotuning system to cache the result, similar to functools.lru_cache(). For our simple class there is no configuration, but if this function takes additional arguments they form part of the database key so that different configurations are tuned separately. These types need to be simple types like numbers and strings that can be serialized by sqlite3, but there is support for enums and numpy dtypes. We’ll come back to the test= part in the section on Testing.
The function uses katsdpsigproc.tune.autotune() to do the actual autotuning. It is passed a function to describe how to benchmark a specific set of parameters, and a keyword argument for each parameter to tune (whose names must match the argument names to generate()) with a list of values to try. The autotuner is not particularly clever: given multiple parameters to tune, it tries all combinations, so if there are many parameters you need to be careful not to cause a combinatorial explosion that will take forever to test.
The generate() function sets up the benchmark for a specific value of wgs by constructing an instance of the class with the explicitly-provided tuning parameters. It also instantiates it (giving an instance of Multiply) with a size chosen to be large enough to reasonably exercise a GPU, and allocates buffers. It would be more efficient to allocate a single buffer once outside generate() to be used for all possible values of wgs, but one needs to be careful that such a buffer is suitably padded for all cases.

It then uses katsdpsigproc.tune.make_measure() to construct a benchmark function, which will return the performance of this configuration each time it is called. You could build your own benchmark function, but make_measure() takes care of inserting markers into a command queue on either side of your operation and querying them to get the elapsed GPU time. The autotuning system will call the benchmark function multiple times to get an estimate of performance.

And that’s it! The only change to the rest of the code is that the Multiply kernel now needs to use template.wgs instead of template.WGS because it’s no longer a Python constant.

Most of my autotuning functions look broadly similar to the above, but the only part that really does any magical introspection is the katsdpsigproc.tune.autotuner() decorator, and you can write the body of your functions in completely different ways if you so choose.

Skipping combinations

As mentioned, when multiple parameters are being tuned together, the tuner will try all combinations, which can take an excessive amount of time. To test only a smaller subset of combinations, one can return None from the generate() function to skip testing of that combination. This still costs a Python function call so one should still avoid starting with a space containing billions of combinations.

Some combinations might also lead to compiler errors, for example, because they use too many registers. The autotuning system will gracefully skip combinations that cause exceptions, so it is not necessary to catch and deal with the compiler errors yourself. Not catching exceptions also means you’ll get a more useful error if you introduce a bug that causes all combinations to fail.

Versioning

Autotuning results are inserted into a SQL table whose name is based on the fully-qualified name of the autotuning function, and has columns for the device, platform, driver version, the arguments to the autotuning function, and the dictionary keys in the result. This presents a problem if you want to change the arguments or return keys from the function, because users who have already run autotuning will get database errors when the columns don’t match. Furthermore, even if you don’t change the interface, you might change the implementation to such an extent that old autotuning results might no longer be appropriate.

To solve these issues, the table name also includes a version number. It defaults to zero, but can be overridden by define a class constant autotune_version. Old results will not be removed from the database, and might even still be used if the user downgrades back to the previous version.

Overriding autotuning

The default behaviour of katsdpsigproc’s autotuning machinery is to autotune for an inexact match between the GPU detected at runtime and the results stored in the autotuning SQL table.

It is possible to request an inexact match in the autotuning lookup by setting an environment variable, KATSDPSIGPROC_TUNE_MATCH. If KATSDPSIGPROC_TUNE_MATCH is set to “nearest”, the nearest match to the current GPU in the autotuning SQL table will be returned, by ignoring in turn the device driver, then platform, then device name. If no match is found, autotuning will proceed. If KATSDPSIGPROC_TUNE_MATCH is set to “exact” (or anything else), default behaviour will proceed.

It is also possible to override the location of the tuning database by setting the environment variable KATSDPSIGPROC_TUNE_DB.

Testing

The testing in general is addressed in Testing, but it is worth noting that autotuning causes some additional challenges in testing:

One wants tests to be reproducible, but if different developers end up with different autotuning results, they will end up running different tests.
The autotuning code should itself be tested, but once it has been run once the result will be cached and it will not run again.

To address these issues, the context fixture disables autotuning. Instead, your autotune() function will return the result specified with the test keyword argument to the autotuner() decorator. You should use an argument that is likely to work across a range of devices.

To test the autotuning code itself, use the force_autotune mark. It overrides the behavior described above so that the autotuning function always runs with no caching.