Synchronization

Kernel executions are asynchronous, yet so far we haven’t needed any explicit synchronization. That’s because

a command queue is in-order, so there is no need to explicitly synchronize between sequential commands issued to the same command queue; and
to get the results we’ve ended with DeviceArray.get(), which is synchronous.

However, we’ve alluded to the possibility of overlapping data transfers with computation, which needs more sophisticated techniques involving multiple command queues. We’ve seen DeviceArray.get_async() and DeviceArray.set_async() to do asynchronous transfers, but how do we tell when they’re complete?

The simplest form of synchronization is AbstractCommandQueue.finish(), which blocks until all commands issued on the command queue have completed. A related function is AbstractCommandQueue.flush(), which ensures that work has been submitted to the device (rather than being buffered up somewhere); this is useful if you want to get on with some CPU work at the same time as the device does its thing, as otherwise the device might not even start until you call finish().

Sometimes we don’t want the CPU to block on all work in a command queue. A common paradigm is to use one command queue for transferring data to the device and a second for doing computation on the device, with a double buffer. Once a block of data has been transferred, we want the device to be able to start working on it immediately, without the CPU needing to be involved. In other words, we want one command queue to block on some event from another command queue.

To do this, we use events. These behave like CUDA events; in OpenCL they would be called markers, and because function names are based on OpenCL, the relevant method is AbstractCommandQueue.enqueue_marker(), which returns an instance of AbstractEvent. An event is an item in a command queue that does no work itself, but which can be waited for.

To wait for an event on the host, use AbstractEvent.wait(). To make a command queue wait for an event before proceeding with subsequent commands, use AbstractCommandQueue.enqueue_wait_for_events().

Profiling

If the command queue was created with profiling enabled, you can also use events for simple profiling, namely measuring the time elapsed (on the GPU) between two events. Note that if you are trying to tune your code, there may be vendor tools that give far more insight.

To create a command queue that supports profiling, use AbstractContext.create_tuning_command_queue() (instead of the usual create_command_queue()). Then use AbstractEvent.time_since() to get the time difference between events.