katsdpsigproc package

Subpackages

Submodules

katsdpsigproc.abc module

Abstract base classes for opencl and cuda.

class katsdpsigproc.abc.AbstractCommandQueue(*args, **kwds)[source]

Bases: ABC, Generic[_B, _C, _E, _K]

Abstraction of a command queue.

context: _C

abstract enqueue_copy_buffer_rect(src_buffer: _B, dest_buffer: _B, src_origin: int, dest_origin: int, shape: Sequence[int], src_strides: Sequence[int], dest_strides: Sequence[int]) → None[source]

Copy a subregion of one buffer to another.

This is a low-level interface that ignores the shape, strides etc of the buffers, and treats them as byte arrays. It also only supports 3 or fewer dimensions. Use copy_region() for a high-level interface.

Parameters:

src_buffer – Source and destination buffers
dest_buffer – Source and destination buffers
src_origin – Offsets for the start of the copy, in bytes
dest_origin – Offsets for the start of the copy, in bytes
shape – Shape of the region to copy (1-3 elements). The first dimension is a byte count.
src_strides – Strides for the source and destination memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
dest_strides – Strides for the source and destination memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.

abstract enqueue_kernel(kernel: _K, args: Sequence[Any], global_size: Tuple[int, ...], local_size: Tuple[int, ...]) → None[source]

Enqueue a kernel to the command queue.

Warning

It is not thread-safe to call this function in two threads on the same kernel at the same time.

Parameters:

kernel – Kernel to run
args – Arguments to pass to the kernel. Refer to the PyOpenCL/CUDA documentation for details. Additionally, this function allows a low-level device array to be passed.
global_size – Number of work-items in each global dimension
local_size – Number of work-items in each local dimension. Must divide exactly into global_size.

abstract enqueue_marker() → AbstractEvent[source]: Create an event at this point in the command queue.

abstract enqueue_read_buffer(buffer: _B, data: Any, blocking: bool = True) → None[source]

Copy data from the device to the host.

Only whole-buffer copies are supported, and the shape and type must match. In general, one should use the convenience functions in accel.DeviceArray.

Parameters:

buffer – Source
data – Target
blocking – If true (default) the call blocks until the copy is complete

abstract enqueue_read_buffer_rect(buffer: _B, data: Any, buffer_origin: int, data_origin: int, shape: Sequence[int], buffer_strides: Sequence[int], data_strides: Sequence[int], blocking: bool = True) → None[source]

Copy a region of a buffer to host memory.

This is a low-level interface that ignores the shape, strides etc of the buffers, and treats them as byte arrays. It also only supports 3 or fewer dimensions. Use set_region() for a high-level interface.

Parameters:

buffer – Source
data (array-like) – Target
buffer_origin – Offsets for the start of the copy, in bytes
data_origin – Offsets for the start of the copy, in bytes
shape – Shape of the region to copy (1-3 elements). The first dimension is a byte count.
buffer_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
data_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
blocking – If true, block until the transfer is complete.

abstract enqueue_wait_for_events(events: Sequence[_E]) → None[source]: Enqueue a barrier to wait for all events in events.

abstract enqueue_write_buffer(buffer: _B, data: Any, blocking=True) → None[source]

Copy data from the host to the device.

Only whole-buffer copies are supported, and the shape and type must match. In general, one should use the convenience functions in accel.DeviceArray.

Parameters:

buffer – Target
data (array-like) – Source
blocking – If true (default), the call blocks until the source has been fully read (it has not necessarily reached the device).

abstract enqueue_write_buffer_rect(buffer: _B, data: Any, buffer_origin: int, data_origin: int, shape: Sequence[int], buffer_strides: Sequence[int], data_strides: Sequence[int], blocking: bool = True) → None[source]

Copy a region of host memory to a buffer.

This is a low-level interface that ignores the shape, strides etc of the buffers, and treats them as byte arrays. It also only supports 3 or fewer dimensions. Use set_region() for a high-level interface.

Parameters:

buffer – Target
data (array-like) – Source
buffer_origin – Offsets for the start of the copy, in bytes
data_origin – Offsets for the start of the copy, in bytes
shape – Shape of the region to copy (1-3 elements). The first dimension is a byte count.
buffer_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
data_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
blocking – If true, block until the transfer is complete.

abstract enqueue_zero_buffer(buffer: _B) → None[source]: Fill a buffer with zero bytes.

abstract finish() → None[source]: Block until all enqueued work has completed.

abstract flush() → None[source]: Start enqueued work running, but do not wait for it to complete.

class katsdpsigproc.abc.AbstractContext(*args, **kwds)[source]

Bases: ABC, Generic[_B, _RB, _RS, _D, _P, _Q, _TQ]

Abstraction of an OpenCL/CUDA context.

Create a typed buffer on the device.

Parameters:

shape – Shape for the array
dtype – Type for the data
raw – Memory backing the array (automatically allocated if None)

Create a buffer in host memory that can be efficiently copied to and from the device.

Parameters:

shape – Shape for the array
dtype – Type for the data

abstract allocate_raw(n_bytes: int) → _RB[source]: Create an untyped buffer on the device.

abstract allocate_svm(shape: Tuple[int, ...], dtype: dtype | None | type | _SupportsDType | str | Tuple[Any, int] | Tuple[Any, int | Sequence[int]] | List[Any] | _DTypeDict | Tuple[Any, Any], raw: _RS | None = None) → ndarray[source]: Allocate shared virtual memory.

abstract allocate_svm_raw(n_bytes: int) → _RS[source]: Allocate raw storage that can be passed to allocate_svm().

abstract compile(source: str, extra_flags: List[str] | None = None) → AbstractProgram[source]

Build a program object from source.

Parameters:

source – Source code
extra_flags – Extra parameters to pass to the compiler

abstract create_command_queue(profile: bool = False) → AbstractCommandQueue[source]

Create a new command queue associated with this context.

Parameters:: profile – If true, the command queue will support timing kernels

abstract create_tuning_command_queue() → AbstractTuningCommandQueue[source]: Create a new command queue for doing autotuning.

abstract property device: AbstractDevice: Return the device associated with the context (or the first device, if multiple).

class katsdpsigproc.abc.AbstractDevice(*args, **kwds)[source]

Bases: ABC, Generic[_C]

Abstraction of a device.

abstract property driver_version: str: Return human-readable name for the driver version.

abstract classmethod get_devices() → Sequence[_D][source]: Return a list of all devices on all platforms.

abstract classmethod get_devices_by_platform() → Sequence[Sequence[_D]][source]: Return a list of all devices, with a sub-list per platform.

abstract property is_accelerator: bool: Whether device is an accelerator (as defined by OpenCL device types).

abstract property is_cpu: bool: Whether the device is a CPU.

abstract property is_cuda: bool: Whether the device is a CUDA device.

abstract property is_gpu: bool: Whether device is a GPU.

abstract make_context() → AbstractContext[source]: Create a new context associated with this device.

abstract property name: str: Return human-readable name for the device.

abstract property platform_name: str: Return human-readable name for the platform owning the device.

abstract property simd_group_size: int

Return the number of workitems that run in lock-step.

This must only be used to tune performance parameters; there are no guarantees about memory coherency, forward progress etc.

class katsdpsigproc.abc.AbstractEvent[source]

Bases: ABC

Abstraction of an event.

This is more akin to a CUDA event than an OpenCL event, in that it is a marker in a command queue rather than associated with a specific command.

abstract time_since(prior_event: _E) → float[source]

Return the time in seconds from prior_event to self.

Unlike the PyCUDA method of the same name, this will wait for the events to complete if they have not already.

abstract time_till(next_event: _E) → float[source]

Return the time in seconds from this event to next_event.

See time_since().

abstract wait() → None[source]: Block until the event has completed.

class katsdpsigproc.abc.AbstractKernel(program: _P, name: str)[source]

Bases: ABC, Generic[_P]

Abstraction of a kernel object.

The object can be enqueued using AbstractCommandQueue.enqueue_kernel().

The recommended way to create this object is via AbstractProgram.get_kernel().

class katsdpsigproc.abc.AbstractProgram(*args, **kwds)[source]

Bases: ABC, Generic[_K]

Abstraction of a program object.

abstract get_kernel(name: str) → AbstractKernel[source]

Create a new kernel.

Parameters:: name – Name of the kernel function

class katsdpsigproc.abc.AbstractTuningCommandQueue(*args, **kwds)[source]

Bases: AbstractCommandQueue[_B, _C, _E, _K]

Command queue with extra facilities for autotuning.

It keeps track of kernels that are enqueued since the last call to start_tuning(), and reports the total time they consume when stop_tuning() is called.

context: _C

abstract start_tuning() → None[source]

abstract stop_tuning() → float[source]

katsdpsigproc.accel module

Utilities for interfacing with accelerator hardware.

Both OpenCL and CUDA are supported, as well as multiple devices.

The modules cuda and opencl provide the abstraction layer, but most code will not import these directly (and it might not be possible to import them). Instead, use create_some_context() to set up a context on whatever device is available.

katsdpsigproc.accel.have_cuda

True if PyCUDA could be imported (does not guarantee any CUDA devices)

Type:: bool

katsdpsigproc.accel.have_opencl

True if PyOpenCL could be imported (does not guarantee any OpenCL devices)

Type:: bool

class katsdpsigproc.accel.AbstractAllocator(*args, **kwds)[source]

Bases: ABC, Generic[_RB]

Interface for allocating device memory.

abstract allocate(shape: Tuple[int, ...], dtype: dtype | None | type | _SupportsDType | str | Tuple[Any, int] | Tuple[Any, int | Sequence[int]] | List[Any] | _DTypeDict | Tuple[Any, Any], padded_shape: Tuple[int, ...] | None = None, raw: _RB | None = None) → DeviceArray[source]

abstract allocate_raw(n_bytes: int) → _RB[source]

context: AbstractContext

class katsdpsigproc.accel.AliasIOSlot(children: Iterable[IOSlotBase])[source]

Bases: IOSlotBase

Slot that allocates a single low-level buffer to back multiple children.

The child slots need not have the same type or shape. This is typically used when logically distinct slots can share memory because they do not contain live data at the same time.

Parameters:: children –
Raises:: ValueError – If children is empty or contains non-attachable elements

is_bound() → bool[source]: Return whether storage is currently attached to this slot.

required_bytes() → int[source]: Return number of bytes of device storage required.

class katsdpsigproc.accel.CompoundIOSlot(children: Iterable[IOSlot])[source]

Bases: IOSlot

IO slot that owns multiple child slots, and presents the combined requirement.

The children must all have the same type and shape. This is used for connecting a single buffer to multiple operations.

Parameters:

children – Child slots

Raises:

ValueError – If children is empty, or if elements have inconsistent shapes
ValueError – If any child is not attachable
TypeError – If children contains inconsistent data types

class katsdpsigproc.accel.DeviceAllocator(context: AbstractContext[_B, _RB, _RS, _D, _P, _Q, _TQ])[source]

Bases: Generic[_B, _RB, _RS, _D, _P, _Q, _TQ], AbstractAllocator[_RB]

Allocate DeviceArray objects from a context.

allocate(shape: Tuple[int, ...], dtype: dtype | None | type | _SupportsDType | str | Tuple[Any, int] | Tuple[Any, int | Sequence[int]] | List[Any] | _DTypeDict | Tuple[Any, Any], padded_shape: Tuple[int, ...] | None = None, raw: _RB | None = None) → DeviceArray[source]

allocate_raw(n_bytes: int) → _RB[source]

context: AbstractContext

Bases: object

A light-weight array-like wrapper around a device buffer.

It that handles padding better than PyCUDA (which has very poor support).

It only supports C-order arrays where the inner-most dimension is contiguous. Transfers are designed to use an HostArray of the same shape and padding, but fall back to using a copy when necessary.

Parameters:

context – Context in which to allocate the memory
shape – Shape for the usable data
dtype – Data type
padded_shape – Shape for memory allocation (defaults to shape)
raw – If specified, provides the backing memory. It must be created from the allocate_raw method of an allocator.

asarray_like(ary: ndarray) → HostArray[source]: Return an array with the same content as ary, but the same memory layout as self.

Perform a device-to-device copy of a subregion of self to dest.

If the source and destination memory overlap, the result is undefined.

The regions to copy are specified using a subset of numpy array indexing syntax. The following are supported:

slices with positive strides
integers
np.newaxis
If fewer indices than axes are specified, all elements on the remaining axes are used.

Ellipses are not yet supported, but it would be straightforward to add support.

Parameters:

command_queue – Command queue for the asynchronous operation.
dest – Target of the copy
src_region – Index expressions constructed by np.s_ or np.index_exp.
dest_region – Index expressions constructed by np.s_ or np.index_exp.

Raises:

TypeError – if the source and destination do not have the same dtype
ValueError – if the source and destination regions select different shapes
IndexError – if the source or destination regions are unsupported or out-of-range

property dtype: dtype

empty_like() → HostArray[source]: Return an array-like object that can be efficiently copied.

get(command_queue: AbstractCommandQueue, ary: ndarray | None = None) → ndarray[source]

Copy synchronously from self to ary.

If ary is None, or if it is not suitable as a target, the copy is to a newly-allocated HostArray. The actual target is returned.

get_async(command_queue: AbstractCommandQueue, ary: ndarray | None = None) → ndarray[source]: Copy asynchronously from self to ary (see get()).

Perform a device-to-host copy of a subregion of self to ary.

See copy_region() for a description of how regions are specified.

Parameters:

command_queue – Command queue for the operation.
ary – Target of the copy
device_region – Index expressions constructed by np.s_ or np.index_exp, to specify the source and target regions.
ary_region – Index expressions constructed by np.s_ or np.index_exp, to specify the source and target regions.
blocking – If false, the operation will be asynchronous.

Raises:

TypeError – if the source and destination do not have the same dtype
ValueError – if the source and destination regions select different shapes
ValueError – if ary is not a HostArray or is a view of one
IndexError – if the source or destination regions are unsupported or out-of-range

property ndim: int: Return number of dimensions.

set(command_queue: AbstractCommandQueue, ary: ndarray) → None[source]: Copy synchronously from ary to self.

set_async(command_queue: AbstractCommandQueue, ary: ndarray) → None[source]: Copy asynchronously from ary to self.

Perform a host-to-device copy of a subregion ary to self.

See copy_region() for a description of how regions are specified.

Parameters:

command_queue – Command queue for the operation.
ary – Source of the copy
device_region – Index expressions constructed by np.s_ or np.index_exp, to specify the source and target regions.
ary_region – Index expressions constructed by np.s_ or np.index_exp, to specify the source and target regions.
blocking – If false, the operation will be asynchronous.

Raises:

TypeError – if the source and destination do not have the same dtype
ValueError – if the source and destination regions select different shapes
IndexError – if the source or destination regions are unsupported or out-of-range

property shape: Tuple[int, ...]

property strides: Tuple[int, ...]: Return strides, as in numpy.

zero(command_queue: AbstractCommandQueue) → None[source]: Memset with zeros (asynchronously).

Bases: object

A single dimension of an IOSlot.

It represents padding and alignment requirements. Instances can be linked together, with all linked instances (whether directly or indirectly linked) exposing the intersection of their requirements. Internally this is represented using a union-find tree with path compression.

Note that min_padded_round and alignment provide similar but different requirements. The former is used to determine a minimum amount of padding, but any larger amount is acceptable, even if not a multiple of min_padded_round. The latter is stricter, always requiring a multiple of this value. The latter is not expected to be frequently used except when type-punning.

Dimensions can also be frozen, which is done by IOSlot once a buffer is bound. This prevents the requirements from changing later (via linking or otherwise) and hence invalidating the bound buffer, or allowing two buffers with a shared dimension but different strides to be used.

Parameters:

size – Size of the actual data, before padding
min_padded_round – The min_padded_size will be computed by rounding size up to a multiple of this.
min_padded_size – Minimum size of the padded allocation, overriding min_padded_round.
alignment – Padded size is required to be a multiple of this. This must be a power of 2.
align_dtype – If specified, it is a hint that this data type is the fastest-varying axis of a multidimensional array. The padded size may be chosen to be such that the stride is a multiple of ALIGN_BYTES, to ensure efficient access by GPUs. The hint will be ignored if exact is true.
exact – If true, padding is forbidden

ALIGN_BYTES = 128

Add an alignment hint.

Indicates that this will be used with an array whose fastest-varying dimension is of type dtype. If the size is not a power of 2, it is ignored.

property alignment: int

property alignment_hint: int

property exact: bool

freeze() → None[source]: Prevent further modifications.

property frozen: bool

link(other: Dimension) → None[source]

Make both self and other reference a single shared requirement.

Raises:: ValueError – If the resulting requirement is exact and unsatisfiable

property min_padded_size: int

required_padded_size() → int[source]: Return padded size required to satisfy this dimension.

property size: int

valid(padded_size: int) → bool[source]: Return whether size is a valid padded size.

Bases: ndarray

A restricted array class that can be used to initialise a DeviceArray.

It uses C ordering and allows padding, which is always in units of the dtype. It optionally uses pinned memory to allow fast transfer to and from device memory.

Because of limitations in numpy and PyCUDA (which do not support non-contiguous memory very well), it works by taking a slice from the origin of contiguous storage, and using that contiguous storage in host-device copies. While one can create views, those views cannot be used in fast copies because there is no way to know whether the view is anchored at the origin.

See the numpy documentation on subclassing for an explanation of why it is written the way it is.

Parameters:

shape – Shape for the array
dtype – Data type for the array
padded_shape – Total size of memory allocation (defaults to shape)
context – If specified, the memory will be allocated in a way that allows efficient copies to and from this context.

classmethod padded_view(obj: HostArray) → ndarray[source]

classmethod padded_view(obj: ndarray) → ndarray | None

Retrieve the view of the full memory without padding.

Returns None if cls.safe(obj) is False.

classmethod safe(obj: ndarray) → bool[source]: Determine whether self can be copied to/from a device directly.

Bases: IOSlotBase

An input/output slot with type and shape information.

It contains a reference to a buffer (initially None) and the shape, type, padding and alignment requirements for that buffer. It can allocate the buffer on request. Each dimension is represesented by a Dimension.

Slots are arranged in a tree, and only the root can be manipulated directly. A slot cannot be reattached to a new parent.

Parameters:

dimensions (tuple of int or Dimension) – The dimensions of the slots. Integers are converted to Dimension with no further requirements.
dtype (numpy dtype) – Type of the data elements

allocate(allocator: AbstractAllocator[_RB], raw: _RB | None = None, *, bind: bool = True) → DeviceArray[source]

Allocate and optionally bind a buffer satisfying the requirements.

Warning

When raw is provided, there is no check that the storage is large enough.

Parameters:

allocator – Memory allocator from which to obtain the memory
raw (PyCUDA DeviceAllocation or PyOpenCL Buffer, optional) – Backing storage for the allocation
bind – If true (the default), the allocated buffer is immediately bound. If false, the buffer is returned and not bound, and can be bound later.

Raises:

ValueError – If this is not a root slot

bind(buffer: DeviceArray | None) → None[source]

Set the internal buffer reference.

Always use this function rather than writing it directly.

If the buffer is not None, it is validated (see validate()).

Parameters:: buffer – Buffer to store
Raises:: ValueError – If this is not a root slot

is_bound() → bool[source]: Return whether storage is currently attached to this slot.

required_bytes() → int[source]: Return number of bytes of device storage required.

required_padded_shape() → Tuple[int, ...][source]: Return padded shape required to satisfy only this slot.

validate(buffer: DeviceArray) → None[source]

Check that buffer is suitable for binding.

Parameters:

buffer – Buffer to validate (must not be None)

Raises:

TypeError – If the data type does not match
ValueError – If the dimensions or shape do not match
ValueError – If a padded size does not match Dimension.required_padded_size()

class katsdpsigproc.accel.IOSlotBase[source]

Bases: ABC

An input/output slot of an operation.

A slot can be bound to storage, or can allocate storage itself. This base class is untyped and unshaped, so in most cases one will use IOSlot instead.

Slots are arranged in a tree, and only the root can be manipulated directly. The entire tree shares the same storage. A slot cannot be reattached to a new parent.

allocate(allocator: AbstractAllocator[_RB], raw: _RB | None = None, *, bind: bool = True) → Any[source]

Allocate and optionally bind a buffer satisfying the requirements.

Warning

When raw is provided, there is no check that the storage is large enough.

Parameters:

allocator – Memory allocator from which to obtain the memory
raw (PyCUDA DeviceAllocation or PyOpenCL Buffer, optional) – Backing storage for the allocation
bind – If true (the default), the allocated buffer is immediately bound. If false, the buffer is returned and not bound, and can be bound later.

Raises:

ValueError – If this is not a root slot

allocate_host(context: AbstractContext) → HostArray[source]: Allocate a HostArray compatible with this slot.

attachable() → bool[source]: Return whether this slot can be attached as a child to another.

check_root() → None[source]: Check whether this slot is a root slot, and raise an exception if not.

abstract is_bound()[source]: Return whether storage is currently attached to this slot.

abstract required_bytes() → int[source]: Return number of bytes of device storage required.

class katsdpsigproc.accel.LinenoLexer(*args, **kw)[source]

Bases: Lexer

Insert #line directives into the source code.

It is used by passing lexer_cls to the mako template constructor.

class katsdpsigproc.accel.Operation(command_queue: AbstractCommandQueue, allocator: AbstractAllocator | None = None)[source]

Bases: ABC

An instance of a device operation.

Typically one first creates a template (which contains the program code, and is expensive to create) and then instantiates it for use with specific buffers.

An instance of this class contains slots, which are named instances of IOSlot. The user binds specific buffers to these slots to specify the memory used in the operation.

This class is only useful when subclassed. The subclass will populate the slots. Subclasses also provide a _run function that handles the implementation of __call__.

Operations are arranged in a tree, with internal nodes subclassing OperationSequence. Internal nodes provide slots that proxy for the slots of their children; the children’s slots should not be manipulated directly. When a parent ceases to exist, its children should no longer be used i.e., they cannot be reattached to a new parent.

Parameters:

command_queue – Command queue for the operation
allocator – Allocator used to allocate unbound slots

slots

Maps names to slot instances

Type:: dictionary

hidden_slots

Extra slots which are not root slots, and hence cannot have buffers bound to them, but whose buffers can still be referenced. This is generally used by OperationSequence when several slots are aliased to the same memory.

Type:: dictionary

is_root

True if this operation is not part of an OperationSequence.

Type:: boolean

bind(**kwargs) → None[source]

Bind buffers to slots by keyword.

Each keyword argument name specifies a slot name.

Raises:

KeyError – if a named slots does not exist
TypeError – if a named slot if an alias slot

buffer(name: str) → DeviceArray[source]

Retrieve the buffer bound to a slot.

It will consult both slots and hidden_slots.

Parameters:

name (str) – Name of the slot to access

Returns:

Buffer bound to slot name.

Return type:

DeviceArray

Raises:

KeyError – If no slot with this name exists
TypeError – If the slot exists but it is an alias slot
ValueError – If the slot exists but does not yet have a buffer bound

ensure_all_bound() → None[source]: Make sure that all slots have a buffer bound, allocating if necessary.

ensure_bound(name: str) → None[source]: Make sure that a specific slot has a buffer bound, allocating if necessary.

parameters() → Mapping[str, Any][source]: Return dictionary of configuration options for this operation.

required_bytes() → int[source]: Return number of bytes of device storage required.

class katsdpsigproc.accel.OperationSequence(command_queue: AbstractCommandQueue, operations: Iterable[Tuple[str, Operation]], compounds: Mapping[str, Iterable[str]] | None = None, aliases: Mapping[str, Iterable[str]] | None = None, allocator: AbstractAllocator | None = None)[source]

Bases: Operation

Convenience class for an operation that is built up of smaller named operations.

Slots are mapped to share data between these smaller operations. Initially, each slot named slot in a child named op is remapped to a parent slot named op:slot. After this, each set provided in compounds is removed from the slots and combined into a single compound slot. Finally, each set in aliases is removed and combined into an alias slot.

For both compounds and aliases, if a child does not exist, it is skipped, and if none of the children exist, no action is taken.

Parameters:

command_queue – Command queue for the operation
operations (sequence of 2-tuples) – Name, operation pairs to add. Calling the operation executes them in order
compounds (mapping of str to sequence of str, optional) – Names for compound slots, mapped to the original slot names that are replaced
aliases (mapping of str to sequence of str, optional) – Names for alias slots, mapped to the original slot names that are replaced
allocator (DeviceAllocator or SVMAllocator, optional) – Allocator used to allocate unbound slots

class katsdpsigproc.accel.SVMAllocator(context: AbstractContext[_B, _RB, _RS, _D, _P, _Q, _TQ])[source]

Bases: Generic[_B, _RB, _RS, _D, _P, _Q, _TQ], AbstractAllocator[_RS]

Allocate SVMArray objects from a context.

allocate(shape: Tuple[int, ...], dtype: dtype | None | type | _SupportsDType | str | Tuple[Any, int] | Tuple[Any, int | Sequence[int]] | List[Any] | _DTypeDict | Tuple[Any, Any], padded_shape: Tuple[int, ...] | None = None, raw: _RS | None = None) → SVMArray[source]

allocate_raw(n_bytes: int) → _RS[source]

context: AbstractContext

Bases: HostArray, DeviceArray

An array that uses shared virtual memory to be accessible from both the host and the device.

Shared virtual memory is also known as managed memory (in CUDA).

It should not be used as a source or target for copies to the device, since it already resides on the device.

Due to limitations in PyOpenCL, this is currently only available for CUDA, and CUDA’s restrictions on managed memory apply. Only the base array (not views) can be passed to kernels.

Parameters:

context – Context in which to allocate the memory
shape – Shape for the array
dtype – Data type for the array
padded_shape – Total size of memory allocation (defaults to shape)
raw (PyCUDA DeviceAllocation or PyOpenCL Buffer, optional) – If specified, provides the backing memory

property buffer: ndarray

get(command_queue: AbstractCommandQueue, ary: ndarray | None = None) → ndarray[source]

Copy synchronously copy from self to ary.

If ary is None, or if it is not suitable as a target, the copy is to a newly allocated HostArray. The actual target is returned. For SVMArray, this is a CPU copy.

get_async(command_queue: AbstractCommandQueue, ary: ndarray | None = None) → ndarray[source]

Copy asynchronously from self to ary (see get).

This is implemented synchronously for SVMArray, but exists for compatibility.

Perform a device-to-host copy of a subregion of self to ary.

See copy_region() for a description of how regions are specified.

Parameters:

command_queue – Command queue for the operation.
ary – Target of the copy
device_region – Index expressions constructed by np.s_ or np.index_exp, to specify the source and target regions.
ary_region – Index expressions constructed by np.s_ or np.index_exp, to specify the source and target regions.
blocking – If false, the operation will be asynchronous.

Raises:

TypeError – if the source and destination do not have the same dtype
ValueError – if the source and destination regions select different shapes
ValueError – if ary is not a HostArray or is a view of one
IndexError – if the source or destination regions are unsupported or out-of-range

set(command_queue: AbstractCommandQueue, ary: ndarray) → None[source]

Copy synchronously from ary to self.

For SVMArray, this is a CPU copy.

set_async(command_queue: AbstractCommandQueue, ary: ndarray) → None[source]

Copy asynchronously from ary to self.

This is implemented synchronously for SVMArray, but exists for compatibility.

Perform a host-to-device copy of a subregion ary to self.

See copy_region() for a description of how regions are specified.

Parameters:

command_queue – Command queue for the operation.
ary – Source of the copy
device_region – Index expressions constructed by np.s_ or np.index_exp, to specify the source and target regions.
ary_region – Index expressions constructed by np.s_ or np.index_exp, to specify the source and target regions.
blocking – If false, the operation will be asynchronous.

Raises:

TypeError – if the source and destination do not have the same dtype
ValueError – if the source and destination regions select different shapes
IndexError – if the source or destination regions are unsupported or out-of-range

katsdpsigproc.accel.all_devices() → List[AbstractDevice][source]: Return a list of all discovered devices.

katsdpsigproc.accel.build(context: AbstractContext, name: str, render_kws: Mapping[str, Any] | None = None, extra_dirs: List[str | PathLike] | None = None, extra_flags: List[str] | None = None, source: str | None = None) → AbstractProgram[source]

Build a source module from a mako template.

Parameters:

context – Context for which to compile the code
name – Source file name, relative to the katsdpsigproc module
render_kws – Keyword arguments to pass to mako
extra_dirs – Extra directories to search for source code
extra_flags – Flags to pass to the compiler
source – If specified, provides the source, overriding name

Returns:

Compiled module

Return type:

program

katsdpsigproc.accel.candidate_devices(device_filter: Callable[[AbstractDevice], bool] | None = None) → Sequence[AbstractDevice][source]

Get devices that are considered for create_some_context().

Refer to create_some_context() for documentation of how this list is affected by device_filter and environment variables. If no matching devices are found, returns an empty list. If an environment variable is out of range, raises RuntimeError.

katsdpsigproc.accel.create_some_context(interactive: bool = True, device_filter: Callable[[AbstractDevice], bool] | None = None) → AbstractContext[source]

Create a single-device context, selecting a device automatically.

This is similar to pyopencl.create_some_context. A number of environment variables can be set to limit the choice to a single device:

KATSDPSIGPROC_DEVICE: device number from amongst all devices

CUDA_DEVICE: CUDA device number (compatible with PyCUDA)

PYOPENCL_CTX: OpenCL platform and optionally device number (compatible with PyOpenCL)

The first of these that is encountered takes effect. If it does not exist, an exception is thrown.

Parameters:

interactive – If true, and sys.stdin.isatty() is true, and there are multiple choices, it will prompt the user. Otherwise, it will choose the first available device, favouring CUDA over OpenCL, then GPU over accelerators over other OpenCL devices.
device_filter – If specified, each device in turn is passed to it, and it must return True to keep the device as a candidate or False to reject it.

Raises:

RuntimeError – If no device could be found or the user made an invalid selection

katsdpsigproc.accel.divup(x: int, y: int) → int[source]: Divide x by y and round the result upwards.

katsdpsigproc.accel.roundup(x: int, y: int) → int[source]: Round x up to the next multiple of y.

katsdpsigproc.accel.visualize_operation(operation: Operation, filename: str) → None[source]

Write a visualization of an Operation to file.

This requires the graphviz package to be installed.

Parameters:

operation – Operation or operation sequence to visualize
filename – Base filename to write. It should end in .gv, and a rendered PDF will automatically be produced with a filename ending in .gv.pdf.

katsdpsigproc.cuda module

Abstraction layer over PyCUDA.

It implements the abstract interfaces defined in katsdpsigproc.abc.

class katsdpsigproc.cuda.CommandQueue(context: Context, pycuda_stream: pycuda.driver.Stream | None = None, profile: bool = False)[source]

Bases: AbstractCommandQueue[GPUArray, Context, Event, Kernel]

context: _C

enqueue_copy_buffer_rect(src_buffer: pycuda.gpuarray.GPUArray | ndarray, dest_buffer: pycuda.gpuarray.GPUArray | ndarray, src_origin: int, dest_origin: int, shape: Sequence[int], src_strides: Sequence[int], dest_strides: Sequence[int]) → None[source]

Copy a subregion of one buffer to another.

This is a low-level interface that ignores the shape, strides etc of the buffers, and treats them as byte arrays. It also only supports 3 or fewer dimensions. Use copy_region() for a high-level interface.

Parameters:

src_buffer – Source and destination buffers
dest_buffer – Source and destination buffers
src_origin – Offsets for the start of the copy, in bytes
dest_origin – Offsets for the start of the copy, in bytes
shape – Shape of the region to copy (1-3 elements). The first dimension is a byte count.
src_strides – Strides for the source and destination memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
dest_strides – Strides for the source and destination memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.

enqueue_kernel(kernel: Kernel, args: Sequence[Any], global_size: Tuple[int, ...], local_size: Tuple[int, ...]) → None[source]

Enqueue a kernel to the command queue.

Warning

It is not thread-safe to call this function in two threads on the same kernel at the same time.

Parameters:

kernel – Kernel to run
args – Arguments to pass to the kernel. Refer to the PyOpenCL/CUDA documentation for details. Additionally, this function allows a low-level device array to be passed.
global_size – Number of work-items in each global dimension
local_size – Number of work-items in each local dimension. Must divide exactly into global_size.

enqueue_marker() → Event[source]: Create an event at this point in the command queue.

enqueue_read_buffer(buffer: pycuda.gpuarray.GPUArray, data: Any, blocking: bool = True) → None[source]

Copy data from the device to the host.

Only whole-buffer copies are supported, and the shape and type must match. In general, one should use the convenience functions in accel.DeviceArray.

Parameters:

buffer – Source
data – Target
blocking – If true (default) the call blocks until the copy is complete

enqueue_read_buffer_rect(buffer: pycuda.gpuarray.GPUArray | ndarray, data: Any, buffer_origin: int, data_origin: int, shape: Sequence[int], buffer_strides: Sequence[int], data_strides: Sequence[int], blocking: bool = True) → None[source]

Copy a region of a buffer to host memory.

This is a low-level interface that ignores the shape, strides etc of the buffers, and treats them as byte arrays. It also only supports 3 or fewer dimensions. Use set_region() for a high-level interface.

Parameters:

buffer – Source
data (array-like) – Target
buffer_origin – Offsets for the start of the copy, in bytes
data_origin – Offsets for the start of the copy, in bytes
shape – Shape of the region to copy (1-3 elements). The first dimension is a byte count.
buffer_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
data_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
blocking – If true, block until the transfer is complete.

enqueue_wait_for_events(events: Sequence[Event]) → None[source]: Enqueue a barrier to wait for all events in events.

enqueue_write_buffer(buffer: pycuda.gpuarray.GPUArray, data: Any, blocking: bool = True) → None[source]

Copy data from the host to the device.

Only whole-buffer copies are supported, and the shape and type must match. In general, one should use the convenience functions in accel.DeviceArray.

Parameters:

buffer – Target
data (array-like) – Source
blocking – If true (default), the call blocks until the source has been fully read (it has not necessarily reached the device).

enqueue_write_buffer_rect(buffer: pycuda.gpuarray.GPUArray | ndarray, data: Any, buffer_origin: int, data_origin: int, shape: Sequence[int], buffer_strides: Sequence[int], data_strides: Sequence[int], blocking: bool = True) → None[source]

Copy a region of host memory to a buffer.

This is a low-level interface that ignores the shape, strides etc of the buffers, and treats them as byte arrays. It also only supports 3 or fewer dimensions. Use set_region() for a high-level interface.

Parameters:

buffer – Target
data (array-like) – Source
buffer_origin – Offsets for the start of the copy, in bytes
data_origin – Offsets for the start of the copy, in bytes
shape – Shape of the region to copy (1-3 elements). The first dimension is a byte count.
buffer_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
data_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
blocking – If true, block until the transfer is complete.

enqueue_zero_buffer(buffer: pycuda.gpuarray.GPUArray | ndarray) → None[source]: Fill a buffer with zero bytes.

finish() → None[source]: Block until all enqueued work has completed.

flush() → None[source]: Start enqueued work running, but do not wait for it to complete.

class katsdpsigproc.cuda.Context(pycuda_context: pycuda.driver.Context)[source]

Bases: AbstractContext[GPUArray, DeviceAllocation, _RawManaged, Device, Program, CommandQueue, TuningCommandQueue]

Create a typed buffer on the device.

Parameters:

shape – Shape for the array
dtype – Type for the data
raw – Memory backing the array (automatically allocated if None)

Create a buffer in host memory that can be efficiently copied to and from the device.

Parameters:

shape – Shape for the array
dtype – Type for the data

allocate_raw(n_bytes: int) → pycuda.driver.DeviceAllocation[source]: Create an untyped buffer on the device.

allocate_svm(shape: Tuple[int, ...], dtype: dtype | None | type | _SupportsDType | str | Tuple[Any, int] | Tuple[Any, int | Sequence[int]] | List[Any] | _DTypeDict | Tuple[Any, Any], raw: _RawManaged | None = None) → ndarray[source]: Allocate shared virtual memory.

allocate_svm_raw(n_bytes: int) → _RawManaged[source]: Allocate raw storage that can be passed to allocate_svm().

compile(source: str, extra_flags: List[str] | None = None) → Program[source]

Build a program object from source.

Parameters:

source – Source code
extra_flags – Extra parameters to pass to the compiler

create_command_queue(profile: bool = False) → CommandQueue[source]

Create a new command queue associated with this context.

Parameters:: profile – If true, the command queue will support timing kernels

create_tuning_command_queue() → TuningCommandQueue[source]: Create a new command queue for doing autotuning.

property device: Device: Return the device associated with the context (or the first device, if multiple).

class katsdpsigproc.cuda.Device(pycuda_device: pycuda.driver.Device)[source]

Bases: AbstractDevice[Context]

property compute_capability: Tuple[int, int]

property driver_version: str: Return human-readable name for the driver version.

classmethod get_devices() → Sequence[_D][source]: Return a list of all devices on all platforms.

classmethod get_devices_by_platform() → Sequence[Sequence[_D]][source]: Return a list of all devices, with a sub-list per platform.

property is_accelerator: bool: Whether device is an accelerator (as defined by OpenCL device types).

property is_cpu: bool: Whether the device is a CPU.

property is_cuda: bool: Whether the device is a CUDA device.

property is_gpu: bool: Whether device is a GPU.

make_context() → Context[source]: Create a new context associated with this device.

property name: str: Return human-readable name for the device.

property platform_name: str: Return human-readable name for the platform owning the device.

property simd_group_size: int

Return the number of workitems that run in lock-step.

This must only be used to tune performance parameters; there are no guarantees about memory coherency, forward progress etc.

class katsdpsigproc.cuda.Event(pycuda_event: pycuda.driver.Event)[source]

Bases: AbstractEvent

time_since(prior_event: Event) → float[source]

Return the time in seconds from prior_event to self.

Unlike the PyCUDA method of the same name, this will wait for the events to complete if they have not already.

time_till(next_event: Event) → float[source]

Return the time in seconds from this event to next_event.

See time_since().

wait() → None[source]: Block until the event has completed.

class katsdpsigproc.cuda.Kernel(program: Program, name: str)[source]: Bases: AbstractKernel[Program]

class katsdpsigproc.cuda.Program(pycuda_module: pycuda.driver.Module)[source]

Bases: AbstractProgram[Kernel]

get_kernel(name: str) → Kernel[source]

Create a new kernel.

Parameters:: name – Name of the kernel function

class katsdpsigproc.cuda.TuningCommandQueue(*args, **kwargs)[source]

Bases: CommandQueue, AbstractTuningCommandQueue[GPUArray, Context, Event, Kernel]

context: _C

start_tuning() → None[source]

stop_tuning() → float[source]

katsdpsigproc.fft module

Fast Fourier Transforms.

Currently this module only supports CUDA, using cuFFT.

It does not use scikit-cuda, because that insists on creating a cublas context to obtain a version number, which then permanently consumes GPU memory on some arbitrary GPU. It also seems the maintainer no longer has time.

Only Linux is supported. Windows and MacOS support would require changing the code for locating the library.

class katsdpsigproc.fft.Fft(template: FftTemplate, command_queue: AbstractCommandQueue, mode: FftMode, allocator: AbstractAllocator | None = None)[source]

Bases: Operation

Forward or inverse Fourier transformation.

Slots

src: Input data
dest: Output data
work_area: Scratch area for work. The contents should not be used; it is made available so that it can be aliased with other scratch areas. If the implementation does not need any device memory for scratch space, this slot will not exist.

Parameters:

template – Operation template
command_queue (katsdpsigproc.cuda.CommandQueue) – Command queue for the operation
mode – FFT direction
allocator – Allocator used to allocate unbound slots

command_queue: CommandQueue

class katsdpsigproc.fft.FftMode(value)[source]

Bases: Enum

An enumeration.

FORWARD = 0

INVERSE = 1

Bases: object

Operation template for a forward or reverse FFT.

The transformation is done over the last N dimensions, with the remaining dimensions for batching multiple arrays to be transformed. Dimensions before the first N must not be padded.

This template bakes in more information than most (data shapes), which is due to constraints in CUFFT.

The template can specify real->complex, complex->real, or complex->complex. In the last case, the same template can be used to instantiate forward or inverse transforms. Otherwise, real->complex transforms must be forward, and complex->real transforms must be inverse.

For real<->complex transforms, the final dimension of the padded shape need only be \(\lfloor\frac{L}{2}\rfloor + 1\), where \(L\) is the last element of shape.

The transform is unnormalised: performing a forward followed by a reverse transform will scale the result by the number of elements.

Parameters:

context – Context for the operation
N – Number of dimensions for the transform
shape – Shape of the data (N or more dimensions). For real->complex or complex->real transformation, this is the size of the real side of the transform.
dtype_src ({np.float32, np.float64, np.complex64, np.complex128}) – Data type for input
dtype_dest ({np.float32, np.float64, np.complex64, np.complex128}) – Data type for output
padded_shape_src – Padded shape of the input
padded_shape_dest – Padded shape of the output
tuning – Tuning parameters (currently unused)

instantiate(*args, **kwargs)[source]

katsdpsigproc.fill module

Fill device array with a constant value.

class katsdpsigproc.fill.Fill(template: FillTemplate, command_queue: AbstractCommandQueue, shape: Tuple[int, ...], allocator: AbstractAllocator | None = None)[source]

Bases: Operation

Concrete instance of FillTemplate.

Slots

data: Array to be filled (padding will be filled too)

Parameters:

template – Operation template
command_queue – Command queue for the operation
shape – Shape for the data slot
allocator – Allocator used to allocate unbound slots

parameters() → Mapping[str, Any][source]: Return dictionary of configuration options for this operation.

set_value(value: Any) → None[source]

Bases: object

Fill a device array with a constant value.

The pad elements are also filled with this value.

Note

To fill with zeros, use DeviceArray.zero()

Parameters:

context – Context for which kernels will be compiled
dtype – Type of data elements
ctype – Type (in C/CUDA, not numpy) of data elements
tuning –
Kernel tuning parameters; if omitted, will autotune. The possible parameters are
- wgs: number of workitems per workgroup

classmethod autotune(context: AbstractContext, dtype: dtype | None | type | _SupportsDType | str | Tuple[Any, int] | Tuple[Any, int | Sequence[int]] | List[Any] | _DTypeDict | Tuple[Any, Any], ctype: str) → Mapping[str, Any][source]

instantiate(command_queue: AbstractCommandQueue, shape: Tuple[int, ...], allocator: AbstractAllocator | None = None) → Fill[source]

katsdpsigproc.maskedsum module

Perform on-device percentile calculation of 2D arrays.

class katsdpsigproc.maskedsum.MaskedSum(template: MaskedSumTemplate, command_queue: AbstractCommandQueue, shape: Tuple[int, int], allocator: AbstractAllocator | None = None)[source]

Bases: Operation

Concrete instance of MaskedSumTemplate.

Slots

src: Input type complex64 Shape is number of rows by number of columns, masked sum is calculated along the rows, independently per column.
mask: Input type float32 Shape is (number of rows of input).
dest: Output type complex64 Shape is (number of columns of input)

parameters() → Mapping[str, Any][source]: Return dictionary of configuration options for this operation.

class katsdpsigproc.maskedsum.MaskedSumTemplate(context: AbstractContext, use_amplitudes: bool = False, tuning: _TuningDict | None = None)[source]

Bases: object

Kernel for calculating masked sums of a 2D array of data.

Masked sums are calculated per column (along rows, independently per column).

Parameters:

context – Context for which kernels will be compiled
use_amplitudes – If true, the amplitudes of the inputs rather than the inputs themselves will be summed.
tuning –
Kernel tuning parameters; if omitted, will autotune. The possible parameters are
- size: number of workitems per workgroup

classmethod autotune(context: AbstractContext, use_amplitudes: bool) → _TuningDict[source]

autotune_version = 1

instantiate(command_queue: AbstractCommandQueue, shape: Tuple[int, int], allocator: AbstractAllocator | None = None) → MaskedSum[source]

katsdpsigproc.opencl module

Abstraction layer over PyOpenCL.

It implements the abstract interfaces defined by katsdpsigproc.abc.

class katsdpsigproc.opencl.CommandQueue(context: Context, pyopencl_command_queue: pyopencl.CommandQueue | None = None, profile: bool = False)[source]

Bases: AbstractCommandQueue[Array, Context, Event, Kernel]

Abstraction of a command queue.

If no existing command queue is passed to the constructor, a new one is created.

Parameters:

context (katsdpsigproc.abc._C) – Context owning the queue
pyopencl_command_queue – Existing command queue to wrap
profile – If true and pyopencl_command_queue is omitted, enabling profiling (timing) on the queue.

context: _C

enqueue_copy_buffer(src_buffer: _DummyArray | pyopencl.array.Array, dest_buffer: _DummyArray | pyopencl.array.Array) → None[source]

enqueue_copy_buffer_rect(src_buffer: _DummyArray | pyopencl.array.Array, dest_buffer: _DummyArray | pyopencl.array.Array, src_origin: int, dest_origin: int, shape: Sequence[int], src_strides: Sequence[int], dest_strides: Sequence[int]) → None[source]

Copy a subregion of one buffer to another.

This is a low-level interface that ignores the shape, strides etc of the buffers, and treats them as byte arrays. It also only supports 3 or fewer dimensions. Use copy_region() for a high-level interface.

Parameters:

src_buffer – Source and destination buffers
dest_buffer – Source and destination buffers
src_origin – Offsets for the start of the copy, in bytes
dest_origin – Offsets for the start of the copy, in bytes
shape – Shape of the region to copy (1-3 elements). The first dimension is a byte count.
src_strides – Strides for the source and destination memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
dest_strides – Strides for the source and destination memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.

enqueue_kernel(kernel: Kernel, args: Sequence[Any], global_size: Tuple[int, ...], local_size: Tuple[int, ...]) → None[source]

Enqueue a kernel to the command queue.

Warning

It is not thread-safe to call this function in two threads on the same kernel at the same time.

Parameters:

kernel – Kernel to run
args – Arguments to pass to the kernel. Refer to the PyOpenCL/CUDA documentation for details. Additionally, this function allows a low-level device array to be passed.
global_size – Number of work-items in each global dimension
local_size – Number of work-items in each local dimension. Must divide exactly into global_size.

enqueue_marker() → Event[source]: Create an event at this point in the command queue.

enqueue_read_buffer(buffer: pyopencl.array.Array, data: Any, blocking: bool = True) → None[source]

Copy data from the device to the host.

Only whole-buffer copies are supported, and the shape and type must match. In general, one should use the convenience functions in accel.DeviceArray.

Parameters:

buffer – Source
data – Target
blocking – If true (default) the call blocks until the copy is complete

enqueue_read_buffer_rect(buffer: pyopencl.array.Array, data: Any, buffer_origin: int, data_origin: int, shape: Sequence[int], buffer_strides: Sequence[int], data_strides: Sequence[int], blocking: bool = True) → None[source]

Copy a region of a buffer to host memory.

This is a low-level interface that ignores the shape, strides etc of the buffers, and treats them as byte arrays. It also only supports 3 or fewer dimensions. Use set_region() for a high-level interface.

Parameters:

buffer – Source
data (array-like) – Target
buffer_origin – Offsets for the start of the copy, in bytes
data_origin – Offsets for the start of the copy, in bytes
shape – Shape of the region to copy (1-3 elements). The first dimension is a byte count.
buffer_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
data_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
blocking – If true, block until the transfer is complete.

enqueue_wait_for_events(events: Sequence[Event]) → None[source]: Enqueue a barrier to wait for all events in events.

enqueue_write_buffer(buffer: pyopencl.array.Array, data: Any, blocking: bool = True) → None[source]

Copy data from the host to the device.

Only whole-buffer copies are supported, and the shape and type must match. In general, one should use the convenience functions in accel.DeviceArray.

Parameters:

buffer – Target
data (array-like) – Source
blocking – If true (default), the call blocks until the source has been fully read (it has not necessarily reached the device).

enqueue_write_buffer_rect(buffer: pyopencl.array.Array, data: Any, buffer_origin: int, data_origin: int, shape: Sequence[int], buffer_strides: Sequence[int], data_strides: Sequence[int], blocking: bool = True) → None[source]

Copy a region of host memory to a buffer.

This is a low-level interface that ignores the shape, strides etc of the buffers, and treats them as byte arrays. It also only supports 3 or fewer dimensions. Use set_region() for a high-level interface.

Parameters:

buffer – Target
data (array-like) – Source
buffer_origin – Offsets for the start of the copy, in bytes
data_origin – Offsets for the start of the copy, in bytes
shape – Shape of the region to copy (1-3 elements). The first dimension is a byte count.
buffer_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
data_strides – Strides for the destination and source memory layout, with the same length as shape. The first element of each must be 1, and each element must be a factor of the next element.
blocking – If true, block until the transfer is complete.

enqueue_zero_buffer(buffer: pyopencl.array.Array) → None[source]: Fill a buffer with zero bytes.

finish() → None[source]: Block until all enqueued work has completed.

flush() → None[source]: Start enqueued work running, but do not wait for it to complete.

class katsdpsigproc.opencl.Context(pyopencl_context: pyopencl.Context)[source]

Bases: AbstractContext[Array, Buffer, None, Device, Program, CommandQueue, TuningCommandQueue]

Abstraction of an OpenCL context.

Create a typed buffer on the device.

Parameters:

shape – Shape for the array
dtype – Type for the data
raw – Memory backing the array (automatically allocated if None)

Create a buffer in host memory that can be efficiently copied to and from the device.

Parameters:

shape – Shape for the array
dtype – Type for the data

allocate_raw(n_bytes: int) → pyopencl.Buffer[source]: Create an untyped buffer on the device.

allocate_svm(shape: Tuple[int, ...], dtype: dtype | None | type | _SupportsDType | str | Tuple[Any, int] | Tuple[Any, int | Sequence[int]] | List[Any] | _DTypeDict | Tuple[Any, Any], raw: None = None) → ndarray[source]: Allocate shared virtual memory.

allocate_svm_raw(n_bytes: int) → None[source]: Allocate raw storage that can be passed to allocate_svm().

compile(source: str, extra_flags: List[str] | None = None) → Program[source]

Build a program object from source.

Parameters:

source – Source code
extra_flags – Extra parameters to pass to the compiler

create_command_queue(profile: bool = False) → CommandQueue[source]

Create a new command queue associated with this context.

Parameters:: profile – If true, the command queue will support timing kernels

create_tuning_command_queue() → TuningCommandQueue[source]: Create a new command queue for doing autotuning.

property device: Device: Return the device associated with the context (or the first device, if multiple).

class katsdpsigproc.opencl.Device(pyopencl_device: pyopencl.Device)[source]

Bases: AbstractDevice[Context]

Abstraction of an OpenCL device.

property driver_version: str: Return human-readable name for the driver version.

classmethod get_devices() → Sequence[_D][source]: Return a list of all devices on all platforms.

classmethod get_devices_by_platform() → Sequence[Sequence[_D]][source]: Return a list of all devices, with a sub-list per platform.

property is_accelerator: bool: Whether device is an accelerator (as defined by OpenCL device types).

property is_cpu: bool: Whether the device is a CPU.

property is_cuda: bool: Whether the device is a CUDA device.

property is_gpu: bool: Whether device is a GPU.

make_context() → Context[source]: Create a new context associated with this device.

property name: str: Return human-readable name for the device.

property platform_name: str: Return human-readable name for the platform owning the device.

property simd_group_size: int

Return the number of workitems that run in lock-step.

This must only be used to tune performance parameters; there are no guarantees about memory coherency, forward progress etc.

class katsdpsigproc.opencl.Event(pyopencl_event: pyopencl.Event)[source]

Bases: AbstractEvent

time_since(prior_event: Event) → float[source]

Return the time in seconds from prior_event to self.

Unlike the PyCUDA method of the same name, this will wait for the events to complete if they have not already.

time_till(next_event: Event) → float[source]

Return the time in seconds from this event to next_event.

See time_since().

wait() → None[source]: Block until the event has completed.

class katsdpsigproc.opencl.Kernel(program: Program, name: str)[source]: Bases: AbstractKernel[Program]

class katsdpsigproc.opencl.Program(pyopencl_program: pyopencl.Program)[source]

Bases: AbstractProgram[Kernel]

get_kernel(name: str) → Kernel[source]

Create a new kernel.

Parameters:: name – Name of the kernel function

class katsdpsigproc.opencl.TuningCommandQueue(*args, **kwargs)[source]

Bases: CommandQueue, AbstractTuningCommandQueue[Array, Context, Event, Kernel]

Command queue with extra facilities for autotuning.

It keeps track of kernels that are enqueued since the last call to start_tuning(), and reports the total time they consume when stop_tuning() is called.

context: _C

enqueue_kernel(kernel: Kernel, args: Sequence[Any], global_size: Tuple[int, ...], local_size: Tuple[int, ...]) → None[source]

Enqueue a kernel to the command queue.

Warning

It is not thread-safe to call this function in two threads on the same kernel at the same time.

Parameters:

kernel – Kernel to run
args – Arguments to pass to the kernel. Refer to the PyOpenCL/CUDA documentation for details. Additionally, this function allows a low-level device array to be passed.
global_size – Number of work-items in each global dimension
local_size – Number of work-items in each local dimension. Must divide exactly into global_size.

start_tuning() → None[source]

stop_tuning() → float[source]

katsdpsigproc.percentile module

Perform on-device percentile calculation of 2D arrays.

class katsdpsigproc.percentile.Percentile5(template: Percentile5Template, command_queue: AbstractCommandQueue, shape: Tuple[int, int], column_range: Tuple[int, int] | None, allocator: AbstractAllocator | None = None)[source]

Bases: Operation

Concrete instance of PercentileTemplate.

Warning

Assumes all values are positive when template.is_amplitude is True.

Slots

src: Input type float32 or complex64. Shape is number of rows by number of columns, where 5 percentiles are computed along the columns, per row.
dest: Output type float32. Shape is (5, number of rows of input)

Parameters:

template – Operation template
command_queue – Command queue for the operation
shape – Shape of the source data
column_range – Half-open interval of columns that will be processed. If not specified, all columns are processed.
allocator – Allocator used to allocate unbound slots

parameters() → Mapping[str, Any][source]: Return dictionary of configuration options for this operation.

class katsdpsigproc.percentile.Percentile5Template(context: AbstractContext, max_columns: int, is_amplitude: bool = True, tuning: _TuningDict | None = None)[source]

Bases: object

Kernel for calculating percentiles of a 2D array of data.

5 percentiles [0,100,25,75,50] are calculated per row (along columns, independently per row). The lower percentile element, rather than a linear interpolation is chosen. WARNING: assumes all values are positive.

Parameters:

context – Context for which kernels will be compiled
max_columns – Maximum number of columns
is_amplitude – If true, the inputs are scalar amplitudes; if false, they are complex numbers and the answers are computed on the absolute values
tuning –
Kernel tuning parameters; if omitted, will autotune. The possible parameters are
- size: number of workitems per workgroup along each row
- wgsy: number of workitems per workgroup along each column

classmethod autotune(context: AbstractContext, max_columns: int, is_amplitude: bool) → _TuningDict[source]

autotune_version = 8

instantiate(command_queue: AbstractCommandQueue, shape: Tuple[int, int], column_range: Tuple[int, int] | None = None, allocator: AbstractAllocator | None = None) → Percentile5[source]

katsdpsigproc.pytest_plugin module

Plugin for testing with pytest.

See the pytest section of the documentation for details.

katsdpsigproc.pytest_plugin.command_queue(context: AbstractContext) → AbstractCommandQueue[source]

katsdpsigproc.pytest_plugin.context(device: AbstractDevice, patch_autotune) → Generator[AbstractContext, None, None][source]

katsdpsigproc.pytest_plugin.patch_autotune(request, monkeypatch) → None[source]

katsdpsigproc.pytest_plugin.pytest_addoption(parser)[source]

katsdpsigproc.pytest_plugin.pytest_configure(config) → None[source]

katsdpsigproc.pytest_plugin.pytest_generate_tests(metafunc) → None[source]

katsdpsigproc.reduce module

Reduction algorithms.

class katsdpsigproc.reduce.HReduce(template: HReduceTemplate, command_queue: AbstractCommandQueue, shape: Tuple[int, int], column_range: Tuple[int, int] | None = None, allocator: AbstractAllocator | None = None)[source]

Bases: Operation

Concrete instance of HReduceTemplate.

In each row, the elements in the specified column range are reduced using the reduction operator supplied to the template.

Slots

srcrows × columns: Input data
destrows: Output reductions

Parameters:

template – Operation template
command_queue – Command queue for the operation
shape – Shape for the source slot
column_range – Half-open range of columns to reduce (defaults to the entire array)
allocator – Allocator used to allocate unbound slots

parameters() → Mapping[str, Any][source]: Return dictionary of configuration options for this operation.

Bases: object

Performs reduction along rows in a 2D array.

Only commutative reduction operators are supported.

Parameters:

context – Context for which kernels will be compiled
dtype – Type of data elements
ctype – Type (in C/CUDA, not numpy) of data elements
op – C expression to combine two variables, a and b
identity – C expression for an identity value for op
extra_code – Arbitrary C code to paste in (for use by op or identity)
tuning –
Kernel tuning parameters; if omitted, will autotune. The possible parameters are
- wgsx: number of workitems per workgroup per row
- wgsy: number of rows to handle per workgroup

classmethod autotune(context: AbstractContext, dtype: dtype | None | type | _SupportsDType | str | Tuple[Any, int] | Tuple[Any, int | Sequence[int]] | List[Any] | _DTypeDict | Tuple[Any, Any], ctype: str, op: str, identity: str, extra_code: str) → _TuningDict[source]

instantiate(command_queue: AbstractCommandQueue, shape: Tuple[int, int], column_range: Tuple[int, int] | None = None, allocator: AbstractAllocator | None = None) → HReduce[source]

katsdpsigproc.resource module

Utilities for scheduling device operations with asyncio.

class katsdpsigproc.resource.JobQueue[source]

Bases: object

Maintain a list of in-flight asynchronous jobs.

add(job: Awaitable) → None[source]

Append a job to the list.

If job is a coroutine, it is automatically wrapped in a task.

clean() → None[source]: Remove completed jobs from the front of the queue.

async finish(max_remaining: int = 0) → None[source]

Wait for jobs to finish until there are at most max_remaining in the queue.

This is a coroutine.

class katsdpsigproc.resource.Resource(value: _T, loop: AbstractEventLoop | None = None)[source]

Bases: Generic[_T]

Abstraction of a contended resource, which may exist on a device.

Passing of ownership is done via futures. Acquiring a resource is a non-blocking operation that returns two futures: a future to wait for before use, and a future to be signalled with a result when done. The value of each of these futures is a (possibly empty) list of device events which must be waited on before more device work is scheduled.

Parameters:

value (object) – Underlying resource to manage
loop – Event loop used for asynchronous operations. If not specified, defaults to asyncio.get_event_loop().

value

Underlying resource passed to the constructor

Type:: object

acquire() → ResourceAllocation[_T][source]

Indicate intent to acquire the resource.

This does not actually acquire the resource, but instead returns a handle that can be used to acquire and release it later. Acquisitions always occur in the order in which calls to acquire() are made.

See ResourceAllocation for further details.

class katsdpsigproc.resource.ResourceAllocation(start: asyncio.Future[List[AbstractEvent]], end: asyncio.Future[List[AbstractEvent]], value: _T, loop: AbstractEventLoop)[source]

Bases: Generic[_T]

A handle representing a future acquisition of a resource.

There are two ways to make the acquisition current:

Call wait(), which returns a future. The result of this future is a list of device events that must complete before it is safe to use the resource; they can either be waited for on the host (for example, using run_in_executor), or by enqueuing a device wait before device operations on the resource.

The above steps (with a blocking host wait) can be combined using wait_events().

Instances of this class should never be constructed directly. Instead, use Resource.acquire().

This class implements the context manager protocol, providing the underlying resource as the return value. This handles some cleanup if the method using the resource raises an exception without releasing the resource cleanly.

ready(events: List[AbstractEvent] | None = None) → None[source]

Indicate that we are done with the resource, and that subsequent acquirers may use it.

Note that even if the caller decides that it doesn’t need to use the resource, it must not call this until the resource is ready.

Parameters:: events – Device events that must be waited on before the resource is ready

wait() → asyncio.Future[List[AbstractEvent]][source]: Return a future that will be set to a list of device events to waited for.

async wait_events() → None[source]

Wait for previous use of the resource to be complete on the host.

This is a coroutine.

async katsdpsigproc.resource.async_wait_for_events(events: Iterable[AbstractEvent], loop: AbstractEventLoop | None = None) → None[source]: Coroutine that waits for a list of device events.

async katsdpsigproc.resource.wait_until(future: Awaitable[_T], when: float, loop: AbstractEventLoop | None = None) → _T[source]: Like asyncio.wait_for(), but with an absolute timeout.

katsdpsigproc.transpose module

Transpose 2D arrays on a device.

class katsdpsigproc.transpose.Transpose(template: TransposeTemplate, command_queue: AbstractCommandQueue, shape: Tuple[int, int], allocator: AbstractAllocator | None = None)[source]

Bases: Operation

Concrete instance of TransposeTemplate.

Slots

src: Input
dest: Output

parameters() → Mapping[str, Any][source]: Return dictionary of configuration options for this operation.

Bases: object

Kernel for transposing a 2D array of data.

Parameters:

context – Context for which kernels will be compiled
dtype – Type of data elements
ctype – Type (in C/CUDA, not numpy) of data elements
tuning –
Kernel tuning parameters; if omitted, will autotune. The possible parameters are
- block: number of workitems per workgroup in each dimension
- vtx, vty: elements per workitem in each dimension

classmethod autotune(context: AbstractContext, dtype: dtype | None | type | _SupportsDType | str | Tuple[Any, int] | Tuple[Any, int | Sequence[int]] | List[Any] | _DTypeDict | Tuple[Any, Any], ctype: str) → _TuningDict[source]

autotune_version = 1

instantiate(command_queue: AbstractCommandQueue, shape: Tuple[int, int], allocator: AbstractAllocator | None = None) → Transpose[source]

katsdpsigproc.tune module

Tools for autotuning algorithm parameters and caching the results.

Note that this is intended to apply only to parameters that affect performance, such as work-group sizes, and not the outputs.

The design is that each computation class that implements autotuning will provide a class method (typically called autotune) that takes a context and user-supplied parameters (related to the problem rather than the implementation), and returns a dictionary of values. A decorator is applied to the autotuning method to cause the results to be cached.

The class may define an autotune_version class attribute to version the table.

At present, caching is implemented in an sqlite3 database. There is a table corresponding to each autotuning method. The table has the following columns:

device_name: string, name for the compute device
device_platform: string, name for the compute device’s platform
device_version: version string for the driver
arg_*: for each parameter passed to the autotuner
value_*: for each value returned from the autotuner

The database is stored in the user cache directory.

katsdpsigproc.tune.adapt_value(value: Any) → Any[source]

Convert value to a type that can be used in sqlite3.

This is not done through the sqlite3 adapter interface, because that is global rather than per-connection. This also only applies to lookup keys, not results, because it is not a symmetric relationship.

katsdpsigproc.tune.autotune(generate: Callable[[...], Callable[[int], float] | None], time_limit: float = 0.1, threads: int | None = None, **kwargs: Any) → Mapping[str, Any][source]

Run a number of tuning experiments and find the optimal combination of parameters.

Each argument is a iterable. The generate function is passed each element of the Cartesian product (by keyword), and returns a callable. This callable is passed an iteration count, and returns a score: lower is better. If either generate or the function it returns raises an exception, it is suppressed. Returns a dictionary with the best combination of values.

Instead of a callable, the generate function may return None to skip a configuration that it knows will be unsuitable. Throwing an exception has essentially the same effect (and is used in code written before returning None was allowed), but returning None is preferred since an exception just clutters the log.

The scoring function should not do a warmup pass: that is handled by this function.

Parameters:

generate – function that creates a scoring function
time_limit – amount of time to spend testing each configuration (excluding setup time)
threads – number of parallel calls to generate to make at a time (defaults to CPU count)

Raises:

Exception – if every combination throws an exception, the last exception is re-raised.
ValueError – if the Cartesian product is empty

katsdpsigproc.tune.autotuner(test: Mapping[str, Any]) → Callable[[_T], _T][source]

Decorate a function to make it an autotuning function that caches the result.

The function must take a class and a context as the first two arguments. The remaining arguments form a cache key, along with properties of the device and the name of the function.

Every argument to the function must have a name, which implies that the *args construct may not be used.

Parameters:: test (dictionary) – A value that will be returned by stub_autotuner().

katsdpsigproc.tune.autotuner_impl(test: Mapping[str, Any], fn: _TuningFunc, *args: Any, **kwargs: Any) → Mapping[str, Any][source]

Implement autotuner().

It is split into a separate function so that mocks can patch it.

katsdpsigproc.tune.force_autotuner(test: Mapping[str, Any], fn: _TuningFunc, *args: Any, **kwargs: Any) → Mapping[str, Any][source]

Drop-in replacement for autotuner_impl() that does not do any caching.

It is intended to be used with a mocking framework.

katsdpsigproc.tune.make_measure(queue: AbstractTuningCommandQueue, function: Callable[[], None]) → Callable[[int], float][source]

Generate a measurement function.

The result can be returned by the function passed to autotune(). It calls function (with no arguments) the appropriate number of times and returns the averaged elapsed time as measured by queue.

katsdpsigproc.tune.stub_autotuner(test: Mapping[str, Any], fn: _TuningFunc, *args: Any, **kwargs: Any) → Mapping[str, Any][source]

Drop-in replacement for autotuner_impl() that does not do any tuning.

It instead returns the provided value. It is intended to be used with a mocking framework.

Module contents

Karoo Array Telescope accelerated signal processing tools.