Autotuning ========== The kernels shown so far all have a fixed work-group size. However, it's not easy to know in advance what the best work-group size is for a specific piece of hardware, and even harder to know what it should be in code that may run on multiple generations of hardware. Furthermore, work-group size is just the most common tuning parameter, but there might be others. A common approach to this problem is to use "autotuning": benchmark different options on the fly and then use the best. There are two variants of this approach, which I'll call "on-line" and "off-line". In on-line autotuning, each time a kernel is invoked by the user, a different set of parameters is tested; the benchmarking is a side effect of doing useful work, but the useful work will have variable performance until the tuner has converged. In off-line autotuning, benchmarking is done on a synthetic workload and the optimal parameters are selected before doing any real work. On-line autotuning has the advantage that it does not require a long wait before useful work can be done, and also that the workload being benchmarked will be representative. However, katsdpsigproc was originally developed for use in real-time pipelines where highly variable performance during the tuning phase is unacceptable. It thus only supports off-line autotuning. Despite being off-line, the autotuning is transparent and automatic to users of operations. If an operation template is constructed in a new configuration, the autotuning will be run. The result is saved in a sqlite database so that it will not need to be run again. Results are indexed by the device and driver version so that old results will not be reused for new hardware for which they might not be appropriate. To take advantage of autotuning, authors of operations will need to add code to their operation templates to specify the parameter space to search and the code to benchmark. Let's see how that might look for our :class:`!MultiplyTemplate` class (see :ref:`operation-templates`), to make it automatically determine the work group size. .. code:: python class MultiplyTemplate: def __init__(self, context, tuning=None): if tuning is None: tuning = self.autotune(context) self.wgs = tuning['wgs'] self.program = build(context, '', source=SOURCE) @classmethod @katsdpsigproc.tune.autotuner(test={'wgs': 32}) def autotune(cls, context): queue = context.create_tuning_command_queue() size = 1048576 def generate(wgs): fn = cls(context, {'wgs': wgs}).instantiate(queue, size, 1) fn.ensure_all_bound() fn.buffer('data').zero(queue) return katsdpsigproc.tune.make_measure(queue, fn) return katsdpsigproc.tune.autotune(generate, wgs=[32, 64, 128, 256]) def instantiate(self, queue, size, scale): return Multiply(self, queue, size, scale) There is a fair amount of convention and boiler-plate here, so let's go through it a step at a time. - The constructor takes an extra `tuning` argument, defaulting to ``None``. Tuning parameters can be explicitly provided as a dictionary with string keys; in this case the only key used is ``'wgs'`` (short for work-group size). Users can thus override the tuning parameters, but this argument is really intended for internal use. - A class method (:meth:`!autotune`) computes the optimal parameters for a given configuration. It has a decorator that tells the autotuning system to cache the result, similar to :func:`functools.lru_cache`. For our simple class there is no configuration, but if this function takes additional arguments they form part of the database key so that different configurations are tuned separately. These types need to be simple types like numbers and strings that can be serialized by :mod:`!sqlite3`, but there is support for enums and numpy dtypes. We'll come back to the ``test=`` part in the section on :ref:`autotune-testing`. - The function uses :func:`katsdpsigproc.tune.autotune` to do the actual autotuning. It is passed a function to describe how to benchmark a specific set of parameters, and a keyword argument for each parameter to tune (whose names must match the argument names to :func:`!generate`) with a list of values to try. The autotuner is not particularly clever: given multiple parameters to tune, it tries all combinations, so if there are many parameters you need to be careful not to cause a combinatorial explosion that will take forever to test. - The :func:`!generate` function sets up the benchmark for a specific value of `wgs` by constructing an instance of the class with the explicitly-provided tuning parameters. It also instantiates it (giving an instance of :class:`!Multiply`) with a size chosen to be large enough to reasonably exercise a GPU, and allocates buffers. It would be more efficient to allocate a single buffer once outside :func:`!generate` to be used for all possible values of `wgs`, but one needs to be careful that such a buffer is suitably padded for all cases. It then uses :func:`katsdpsigproc.tune.make_measure` to construct a benchmark function, which will return the performance of this configuration each time it is called. You could build your own benchmark function, but :func:`~katsdpsigproc.tune.make_measure` takes care of inserting markers into a command queue on either side of your operation and querying them to get the elapsed GPU time. The autotuning system will call the benchmark function multiple times to get an estimate of performance. And that's it! The only change to the rest of the code is that the :class:`!Multiply` kernel now needs to use ``template.wgs`` instead of ``template.WGS`` because it's no longer a Python constant. Most of my autotuning functions look broadly similar to the above, but the only part that really does any magical introspection is the :func:`katsdpsigproc.tune.autotuner` decorator, and you can write the body of your functions in completely different ways if you so choose. Skipping combinations --------------------- As mentioned, when multiple parameters are being tuned together, the tuner will try all combinations, which can take an excessive amount of time. To test only a smaller subset of combinations, one can return ``None`` from the :func:`!generate` function to skip testing of that combination. This still costs a Python function call so one should still avoid starting with a space containing billions of combinations. Some combinations might also lead to compiler errors, for example, because they use too many registers. The autotuning system will gracefully skip combinations that cause exceptions, so it is not necessary to catch and deal with the compiler errors yourself. Not catching exceptions also means you'll get a more useful error if you introduce a bug that causes *all* combinations to fail. Versioning ---------- Autotuning results are inserted into a SQL table whose name is based on the fully-qualified name of the autotuning function, and has columns for the device, platform, driver version, the arguments to the autotuning function, and the dictionary keys in the result. This presents a problem if you want to change the arguments or return keys from the function, because users who have already run autotuning will get database errors when the columns don't match. Furthermore, even if you don't change the interface, you might change the implementation to such an extent that old autotuning results might no longer be appropriate. To solve these issues, the table name also includes a version number. It defaults to zero, but can be overridden by define a class constant :const:`!autotune_version`. Old results will *not* be removed from the database, and might even still be used if the user downgrades back to the previous version. Overriding autotuning --------------------- The default behaviour of katsdpsigproc's autotuning machinery is to autotune for an inexact match between the GPU detected at runtime and the results stored in the autotuning SQL table. It is possible to request an inexact match in the autotuning lookup by setting an environment variable, :envvar:`KATSDPSIGPROC_TUNE_MATCH`. If :envvar:`KATSDPSIGPROC_TUNE_MATCH` is set to "nearest", the nearest match to the current GPU in the autotuning SQL table will be returned, by ignoring in turn the device driver, then platform, then device name. If no match is found, autotuning will proceed. If `KATSDPSIGPROC_TUNE_MATCH` is set to "exact" (or anything else), default behaviour will proceed. It is also possible to override the location of the tuning database by setting the environment variable :envvar:`KATSDPSIGPROC_TUNE_DB`. .. _autotune-testing: Testing ------- The testing in general is addressed in :doc:`testing`, but it is worth noting that autotuning causes some additional challenges in testing: - One wants tests to be reproducible, but if different developers end up with different autotuning results, they will end up running different tests. - The autotuning code should itself be tested, but once it has been run once the result will be cached and it will not run again. To address these issues, the :ref:`context ` fixture disables autotuning. Instead, your :func:`!autotune` function will return the result specified with the `test` keyword argument to the :func:`.autotuner` decorator. You should use an argument that is likely to work across a range of devices. To test the autotuning code itself, use the :ref:`force_autotune ` mark. It overrides the behavior described above so that the autotuning function always runs with no caching.