OpenCL Runtime: Programs and Kernels¶

Program¶

PYOPENCL_NO_CACHE¶

By default, PyOpenCL will use cached (on disk) “binaries” returned by the OpenCL runtime when calling Program.build() on a program constructed with source. (It will depend on the ICD in use how much compilation work is saved by this.) By setting the environment variable PYOPENCL_NO_CACHE to any string that pytools.strtobool() evaluates as True, this caching is suppressed. No additional in-memory caching is performed. To retain the compiled version of a kernel in memory, simply retain the Program and/or Kernel objects.

PyOpenCL will also cache “invokers”, which are short snippets of Python that are generated to accelerate passing arguments to and enqueuing a kernel.

Added in version 2013.1.

PYOPENCL_COMPILER_OUTPUT¶

When setting the environment variable PYOPENCL_COMPILER_OUTPUT to any string that pytools.strtobool() evaluates as True, PyOpenCL will show compiler messages emitted during program build.

PYOPENCL_BUILD_OPTIONS¶

Any options found in the environment variable PYOPENCL_BUILD_OPTIONS will be appended to options in Program.build().

Added in version 2013.1.

class pyopencl.Program(context, src)[source]¶
class pyopencl.Program(context, devices, binaries)

binaries must contain one binary for each entry in devices. If src is a bytes object starting with a valid SPIR-V magic number, it will be handed off to the OpenCL implementation as such, rather than as OpenCL C source code. (SPIR-V support requires OpenCL 2.1.)

Changed in version 2016.2: Add support for SPIR-V.

info¶

Lower case versions of the program_info constants may be used as attributes on instances of this class to directly query info attributes.

get_info(param)[source]¶

See program_info for values of param.

get_build_info(device, param)[source]¶

See program_build_info for values of param.

build(options=[], devices=None, cache_dir=None)[source]¶

options is a string of compiler flags. Returns self.

If cache_dir is not None - built binaries are cached in an on-disk cache with given path. If passed cache_dir is None, but context of this program was created with not-None cache_dir - it will be used as cache directory. If passed cache_dir is None and context was created with None cache_dir: built binaries will be cached in an on-disk cache called pyopencl-compiler-cache-vN-uidNAME-pyVERSION in the directory returned by tempfile.gettempdir().

See also PYOPENCL_NO_CACHE, PYOPENCL_BUILD_OPTIONS.

Changed in version 2011.1: options may now also be a list of str.

compile(self, options=[], devices=None, headers=[])[source]¶
Parameters:

headers – a list of tuples (name, program).

Only available with CL 1.2.

Added in version 2011.2.

kernel_name¶

You may use program.kernel_name to obtain a Kernel object from a program. Note that every lookup of this type produces a new kernel object, so that this won’t work:

prg.sum.set_args(a_g, b_g, res_g)
ev = cl.enqueue_nd_range_kernel(queue, prg.sum, a_np.shape, None)

Instead, either use the (recommended, stateless) calling interface:

sum_knl = prg.sum
sum_knl(queue, a_np.shape, None, a_g, b_g, res_g)

or the long, stateful way around, if you prefer:

sum_knl.set_args(a_g, b_g, res_g)
ev = cl.enqueue_nd_range_kernel(queue, sum_knl, a_np.shape, None)

The following will also work, however note that a number of caches that are important for efficient kernel enqueue are attached to the Kernel object, and these caches will be ineffective in this usage pattern:

prg.sum(queue, a_np.shape, None, a_g, b_g, res_g)

Note that the Program has to be built (see build()) in order for this to work simply by attribute lookup.

Note

The program_info attributes live in the same name space and take precedence over Kernel names.

Note

If you need to retrieve a kernel whose name includes non-identifier characters, retrieving it as an attribute of Program will not work, for obvious reasons. In that case, you can use the Kernel constructor directly.

all_kernels()[source]¶

Returns a list of all Kernel objects in the Program.

set_specialization_constant(spec_id, buffer)¶

Only available with CL 2.2 and newer.

Added in version 2020.3.

static from_int_ptr(int_ptr_value, retain=True)[source]¶
int_ptr¶

Return an integer corresponding to the pointer value of the underlying cl_program. Use from_int_ptr() to turn back into a Python object.

Added in version 2013.2.

Instances of this class are hashable, and two instances of this class may be compared using “==” and “!=”. (Hashability was added in version 2011.2.) Two objects are considered the same if the underlying OpenCL object is the same, as established by C pointer equality.

pyopencl.create_program_with_built_in_kernels(context, devices, kernel_names)[source]¶

Only available with CL 1.2.

Added in version 2011.2.

Only available with CL 1.2.

Added in version 2011.2.

pyopencl.unload_platform_compiler(platform)¶

Only available with CL 1.2.

Added in version 2011.2.

Kernel¶

class pyopencl.Kernel(program, name)¶
info¶

Lower case versions of the kernel_info constants may be used as attributes on instances of this class to directly query info attributes.

clone()¶

Only available with CL 2.1.

Added in version 2020.3.

get_info(param)[source]¶

See kernel_info for values of param.

get_work_group_info(param, device)[source]¶

See kernel_work_group_info for values of param.

get_arg_info(arg_index, param)¶

See kernel_arg_info for values of param.

Only available in OpenCL 1.2 and newer.

get_sub_group_info(self, device, param, input_value=None)¶

When the OpenCL spec requests input_value to be of type size_t, these may be passed directly as a number. When it requests input_value to be of type size_t *, a tuple of integers may be passed.

Only available in OpenCL 2.1 and newer.

Added in version 2020.3.

set_arg(self, index, arg)¶

arg may be

set_args(self, *args)¶

Invoke set_arg() on each element of args in turn.

Added in version 0.92.

set_scalar_arg_dtypes(arg_dtypes)[source]¶

Inform the wrapper about the sized types of scalar Kernel arguments. For each argument, arg_dtypes contains an entry. For non-scalars, this must be None. For scalars, it must be an object acceptable to the numpy.dtype constructor, indicating that the corresponding scalar argument is of that type.

After invoking this function with the proper information, most suitable number types will automatically be cast to the right type for kernel invocation.

Note

The information set by this method is attached to a single kernel instance. A new kernel instance is created every time you use program.kernel attribute access. The following will therefore not work:

prg = cl.Program(...).build()
prg.kernel.set_scalar_arg_dtypes(...)
prg.kernel(queue, n_globals, None, args)
__call__(queue, global_size, local_size, *args, global_offset=None, wait_for=None, g_times_l=False, allow_empty_ndrange=False)¶

Use enqueue_nd_range_kernel() to enqueue a kernel execution, after using set_args() to set each argument in turn. See the documentation for set_arg() to see what argument types are allowed.

global_size and local_size are tuples of identical length, with between one and three entries. global_size specifies the overall size of the computational grid: one work item will be launched for every integer point in the grid. local_size specifies the workgroup size, which must evenly divide the global_size in a dimension-by-dimension manner. None may be passed for local_size, in which case the implementation will use an implementation-defined workgroup size. If g_times_l is True, the global size will be multiplied by the local size. (which makes the behavior more like Nvidia CUDA) In this case, global_size and local_size also do not have to have the same number of entries.

allow_empty_ndrange is a bool indicating how an empty NDRange is to be treated, where “empty” means that one or more entries of global_size or local_size are zero. OpenCL itself does not allow enqueueing kernels over empty NDRanges. Setting this flag to True enqueues a marker with a wait list (clEnqueueMarkerWithWaitList) to obtain the synchronization effects that would have resulted from the kernel enqueue. Setting allow_empty_ndrange to True requires OpenCL 1.2 or newer.

Returns a new pyopencl.Event. wait_for may either be None or a list of pyopencl.Event instances for whose completion this command waits before starting execution.

Note

__call__() is not thread-safe. It sets the arguments using set_args() and then runs enqueue_nd_range_kernel(). Another thread could race it in doing the same things, with undefined outcome. This issue is inherited from the C-level OpenCL API. The recommended solution is to make a kernel (i.e. access prg.kernel_name, which corresponds to making a new kernel) for every thread that may enqueue calls to the kernel.

A solution involving implicit locks was discussed and decided against on the mailing list in October 2012.

Changed in version 0.92: local_size was promoted to third positional argument from being a keyword argument. The old keyword argument usage will continue to be accepted with a warning throughout the 0.92 release cycle. This is a backward-compatible change (just barely!) because local_size as third positional argument can only be a tuple or None. tuple instances are never valid Kernel arguments, and None is valid as an argument, but its treatment in the wrapper had a bug (now fixed) that prevented it from working.

Changed in version 2011.1: Added the g_times_l keyword arg.

Changed in version 2020.2: Added the allow_empty_ndrange keyword argument.

capture_call(output_file, queue, global_size, local_size, *args, global_offset=None, wait_for=None, g_times_l=False)[source]¶

This method supports the exact same interface as __call__(), but instead of invoking the kernel, it writes a self-contained PyOpenCL program to filename that reproduces this invocation. Data and kernel source code will be packaged up in filename’s source code.

This is mainly intended as a debugging aid. For example, it can be used to automate the task of creating a small, self-contained test case for an observed problem. It can also help separate a misbehaving kernel from a potentially large or time-consuming outer code.

Parameters:

output_file – a a filename or a file-like to which the generated code is to be written.

To use, simply change:

evt = my_kernel(queue, gsize, lsize, arg1, arg2, ...)

to:

evt = my_kernel.capture_call("bug.py", queue, gsize, lsize, arg1, arg2, ...)

Added in version 2013.1.

from_int_ptr(int_ptr_value: int, retain: bool = True) pyopencl._cl.Kernel¶

(static method) Return a new Python object referencing the C-level cl_kernel object at the location pointed to by int_ptr_value. The relevant clRetain* function will be called if retain is True.If the previous owner of the object will not release the reference, retain should be set to False, to effectively transfer ownership to pyopencl.

Added in version 2013.2.

Changed in version 2016.1: retain added.

int_ptr¶

Return an integer corresponding to the pointer value of the underlying cl_kernel. Use from_int_ptr() to turn back into a Python object.

Added in version 2013.2.

Instances of this class are hashable, and two instances of this class may be compared using “==” and “!=”. (Hashability was added in version 2011.2.) Two objects are considered the same if the underlying OpenCL object is the same, as established by C pointer equality.

class pyopencl.LocalMemory(size)¶

A helper class to pass __local memory arguments to kernels.

Added in version 0.91.2.

size¶

The size of local buffer in bytes to be provided.

pyopencl.enqueue_nd_range_kernel(queue, kernel, global_work_size, local_work_size, global_work_offset=None, wait_for=None, g_times_l=False, allow_empty_ndrange=False)¶

global_size and local_size are tuples of identical length, with between one and three entries. global_size specifies the overall size of the computational grid: one work item will be launched for every integer point in the grid. local_size specifies the workgroup size, which must evenly divide the global_size in a dimension-by-dimension manner. None may be passed for local_size, in which case the implementation will use an implementation-defined workgroup size. If g_times_l is True, the global size will be multiplied by the local size. (which makes the behavior more like Nvidia CUDA) In this case, global_size and local_size also do not have to have the same number of entries.

allow_empty_ndrange is a bool indicating how an empty NDRange is to be treated, where “empty” means that one or more entries of global_size or local_size are zero. OpenCL itself does not allow enqueueing kernels over empty NDRanges. Setting this flag to True enqueues a marker with a wait list (clEnqueueMarkerWithWaitList) to obtain the synchronization effects that would have resulted from the kernel enqueue. Setting allow_empty_ndrange to True requires OpenCL 1.2 or newer.

Returns a new pyopencl.Event. wait_for may either be None or a list of pyopencl.Event instances for whose completion this command waits before starting execution.

Changed in version 2011.1: Added the g_times_l keyword arg.

Changed in version 2020.2: Added the allow_empty_ndrange keyword argument.