Buffers
=======

Device buffers
--------------
The :class:`~katsdpsigproc.accel.DeviceArray` class represents GPU memory. It
cannot be read or written directly, but otherwise tries to provide an
interface similar to numpy: instances have a shape, dtype, strides and so on.
However, currently only C-order layout can be created.

In the simplest case, construction works similarly too, but requires the
context to be passed:

.. code:: python

  buf = katsdpsigproc.accel.DeviceArray(ctx, (3, 5), np.float32)

This creates a 3×5 buffer with uninitialized content.

Padding
_______
Because GPUs operate on fixed-size work groups, it is often necessary or just
convenient to include padding in buffers so that the GPU code doesn't need to
include special-case handling for the boundaries. The constructor takes an
additional `padded_shape` argument which specifies the actual size of the
underlying memory allocation, and must be at least as big as the `shape` on
every dimension. The "usable" part of the array is effectively a slice of the
top-left corner from the full allocation. It is safe to read and write the
padding elements, but their values should be considered as undefined. For
example, commands that copy host data to the device might or might not
overwrite the padding elements.

.. tikz::

    [>=latex]
    \draw[fill=gray!20!white] (0, 0) rectangle (8, 6);
    \draw[fill=white] (0.1, 2) rectangle (4, 5.9);
    \draw[<->] (0, -0.2) -- node[auto, swap] {padded\_shape[1]} (8, -0.2);
    \draw[<->] (8.2, 0) -- node[auto, swap] {padded\_shape[0]} (8.2, 6);
    \draw[<->] (0.1, 1.8) -- node[auto, swap] {shape[1]} (4, 1.8);
    \draw[<->] (4.2, 2) -- node[auto, swap] {shape[0]} (4.2, 5.9);

Host buffers
------------
Typically there will be some device buffers that need to be copied to and/or
from the host. While regular numpy arrays can be used for this, it is not
efficient, and may involve an extra copy. GPU drivers generally require the
host memory in copies to be allocated in a particular way to allow for optimal
transfers ("page-locked memory" in CUDA parlance). Instead, use the
:class:`.HostArray` class, which is a subclass of :class:`numpy.ndarray`.

Some care is still needed: it is quite possible to end up with an instance of
:class:`.HostArray` that is nevertheless not on the fast
path. The simplest approach is to start with a device array and call
:meth:`.DeviceArray.empty_like` which returns a matching host array.

Alternatively, one can use the constructor:

.. code:: python

  host = katsdpsigproc.accel.HostArray(shape, dtype, padded_shape, context=ctx)

For efficient copies, the shape, dtype *and* padded_shape must all match the
device array used in the copy.

Copying and filling
-------------------
The simplest way to move data from host to device is with :meth:`.DeviceArray.set`.
This is a *synchronous* command: it only returns once the transfer is
complete, and you can immediate start changing the host array. It requires a
a command queue (see :doc:`init`). For example, here is a way to fill a device
array with ones.

.. literalinclude:: examples/set_ones.py

Of course, this is a very inefficient way to fill GPU memory with a constant,
because we're first filling memory on the host, then copying it across a
narrow bus. Later one we'll see a utility module for filling device memory
with an arbitrary constant; but for zero-filling there is
:meth:`.DeviceArray.zero`.

To copy device memory back to the host, use :meth:`.DeviceArray.get` e.g.
continuing from the example above:

.. code:: python

  dev.get(queue, host)

It is also possible to omit the second argument, in which case a new
:class:`.HostArray` will be allocated and returned. However, memory allocation
is expensive, so if the transfer will be done many times it is better to
allocate the memory once.

There are also :meth:`.DeviceArray.get_async` and
:meth:`.DeviceArray.set_async` that perform asynchronous transfers: that is,
the function call will return immediately but the transfer will only occur
later. You will need to use :doc:`synchronization functions <sync>`
to determine when it is safe to reuse the memory.

It is also possible to copy sub-regions between host and device buffers or from
one device buffer to another. See :meth:`.DeviceArray.get_region`,
:meth:`.DeviceArray.set_region` and :meth:`.DeviceArray.copy_region`.