OpenCL Runtime: Programs and Kernels¶
Program¶
- PYOPENCL_NO_CACHE¶
By default, PyOpenCL will use cached (on disk) “binaries” returned by the OpenCL runtime when calling
Program.build()
on a program constructed with source. (It will depend on the ICD in use how much compilation work is saved by this.) By setting the environment variablePYOPENCL_NO_CACHE
to any string thatpytools.strtobool()
evaluates asTrue
, this caching is suppressed. No additional in-memory caching is performed. To retain the compiled version of a kernel in memory, simply retain theProgram
and/orKernel
objects.PyOpenCL will also cache “invokers”, which are short snippets of Python that are generated to accelerate passing arguments to and enqueuing a kernel.
Added in version 2013.1.
- PYOPENCL_COMPILER_OUTPUT¶
When setting the environment variable
PYOPENCL_COMPILER_OUTPUT
to any string thatpytools.strtobool()
evaluates asTrue
, PyOpenCL will show compiler messages emitted during program build.
- PYOPENCL_BUILD_OPTIONS¶
Any options found in the environment variable
PYOPENCL_BUILD_OPTIONS
will be appended to options inProgram.build()
.Added in version 2013.1.
- class pyopencl.Program(context, src)[source]¶
- class pyopencl.Program(context, devices, binaries)
binaries must contain one binary for each entry in devices. If src is a
bytes
object starting with a valid SPIR-V magic number, it will be handed off to the OpenCL implementation as such, rather than as OpenCL C source code. (SPIR-V support requires OpenCL 2.1.)Changed in version 2016.2: Add support for SPIR-V.
- info¶
Lower case versions of the
program_info
constants may be used as attributes on instances of this class to directly query info attributes.
- get_info(param)[source]¶
See
program_info
for values of param.
- get_build_info(device, param)[source]¶
See
program_build_info
for values of param.
- build(options=[], devices=None, cache_dir=None)[source]¶
options is a string of compiler flags. Returns self.
If cache_dir is not None - built binaries are cached in an on-disk cache with given path. If passed cache_dir is None, but context of this program was created with not-None cache_dir - it will be used as cache directory. If passed cache_dir is None and context was created with None cache_dir: built binaries will be cached in an on-disk cache called
pyopencl-compiler-cache-vN-uidNAME-pyVERSION
in the directory returned bytempfile.gettempdir()
.See also
PYOPENCL_NO_CACHE
,PYOPENCL_BUILD_OPTIONS
.
- compile(self, options=[], devices=None, headers=[])[source]¶
- Parameters:
headers – a list of tuples (name, program).
Only available with CL 1.2.
Added in version 2011.2.
- kernel_name¶
You may use
program.kernel_name
to obtain aKernel
object from a program. Note that every lookup of this type produces a new kernel object, so that this won’t work:prg.sum.set_args(a_g, b_g, res_g) ev = cl.enqueue_nd_range_kernel(queue, prg.sum, a_np.shape, None)
Instead, either use the (recommended, stateless) calling interface:
sum_knl = prg.sum sum_knl(queue, a_np.shape, None, a_g, b_g, res_g)
or the long, stateful way around, if you prefer:
sum_knl.set_args(a_g, b_g, res_g) ev = cl.enqueue_nd_range_kernel(queue, sum_knl, a_np.shape, None)
The following will also work, however note that a number of caches that are important for efficient kernel enqueue are attached to the
Kernel
object, and these caches will be ineffective in this usage pattern:prg.sum(queue, a_np.shape, None, a_g, b_g, res_g)
Note that the
Program
has to be built (seebuild()
) in order for this to work simply by attribute lookup.Note
The
program_info
attributes live in the same name space and take precedence overKernel
names.
- set_specialization_constant(spec_id, buffer)¶
Only available with CL 2.2 and newer.
Added in version 2020.3.
- int_ptr¶
Return an integer corresponding to the pointer value of the underlying
cl_program
. Usefrom_int_ptr()
to turn back into a Python object.Added in version 2013.2.
Instances of this class are hashable, and two instances of this class may be compared using “==” and “!=”. (Hashability was added in version 2011.2.) Two objects are considered the same if the underlying OpenCL object is the same, as established by C pointer equality.
- pyopencl.create_program_with_built_in_kernels(context, devices, kernel_names)[source]¶
Only available with CL 1.2.
Added in version 2011.2.
- pyopencl.link_program(context, programs, options=[], devices=None)[source]¶
Only available with CL 1.2.
Added in version 2011.2.
- pyopencl.unload_platform_compiler(platform)¶
Only available with CL 1.2.
Added in version 2011.2.
Kernel¶
- class pyopencl.Kernel(program, name)¶
- info¶
Lower case versions of the
kernel_info
constants may be used as attributes on instances of this class to directly query info attributes.
- clone()¶
Only available with CL 2.1.
Added in version 2020.3.
- get_info(param)[source]¶
See
kernel_info
for values of param.
- get_work_group_info(param, device)[source]¶
See
kernel_work_group_info
for values of param.
- get_arg_info(arg_index, param)¶
See
kernel_arg_info
for values of param.Only available in OpenCL 1.2 and newer.
- get_sub_group_info(self, device, param, input_value=None)¶
When the OpenCL spec requests input_value to be of type
size_t
, these may be passed directly as a number. When it requests input_value to be of typesize_t *
, a tuple of integers may be passed.Only available in OpenCL 2.1 and newer.
Added in version 2020.3.
- set_arg(self, index, arg)¶
arg may be
None: This may be passed for
__global
memory references to pass a NULL pointer to the kernel.Anything that satisfies the Python buffer interface, in particular
numpy.ndarray
,str
, ornumpy
’s sized scalars, such asnumpy.int32
ornumpy.float64
.Note
Note that Python’s own
int
orfloat
objects will not work out of the box. SeeKernel.set_scalar_arg_dtypes()
for a way to make them work. Alternatively, the standard library modulestruct
can be used to convert Python’s native number types to binary data in astr
.An instance of
MemoryObject
. (e.g.Buffer
,Image
, etc.)An instance of
LocalMemory
.An instance of
Sampler
.An instance of
CommandQueue
. (CL 2.0 and higher only)
- set_args(self, *args)[source]¶
Invoke
set_arg()
on each element of args in turn.Added in version 0.92.
- set_scalar_arg_dtypes(arg_dtypes)[source]¶
Inform the wrapper about the sized types of scalar
Kernel
arguments. For each argument, arg_dtypes contains an entry. For non-scalars, this must be None. For scalars, it must be an object acceptable to thenumpy.dtype
constructor, indicating that the corresponding scalar argument is of that type.After invoking this function with the proper information, most suitable number types will automatically be cast to the right type for kernel invocation.
Note
The information set by this method is attached to a single kernel instance. A new kernel instance is created every time you use program.kernel attribute access. The following will therefore not work:
prg = cl.Program(...).build() prg.kernel.set_scalar_arg_dtypes(...) prg.kernel(queue, n_globals, None, args)
- __call__(queue, global_size, local_size, *args, global_offset=None, wait_for=None, g_times_l=False, allow_empty_ndrange=False)[source]¶
Use
enqueue_nd_range_kernel()
to enqueue a kernel execution, after usingset_args()
to set each argument in turn. See the documentation forset_arg()
to see what argument types are allowed.global_size and local_size are tuples of identical length, with between one and three entries. global_size specifies the overall size of the computational grid: one work item will be launched for every integer point in the grid. local_size specifies the workgroup size, which must evenly divide the global_size in a dimension-by-dimension manner. None may be passed for local_size, in which case the implementation will use an implementation-defined workgroup size. If g_times_l is True, the global size will be multiplied by the local size. (which makes the behavior more like Nvidia CUDA) In this case, global_size and local_size also do not have to have the same number of entries.
allow_empty_ndrange is a
bool
indicating how an empty NDRange is to be treated, where “empty” means that one or more entries of global_size or local_size are zero. OpenCL itself does not allow enqueueing kernels over empty NDRanges. Setting this flag to True enqueues a marker with a wait list (clEnqueueMarkerWithWaitList
) to obtain the synchronization effects that would have resulted from the kernel enqueue. Setting allow_empty_ndrange to True requires OpenCL 1.2 or newer.Returns a new
pyopencl.Event
. wait_for may either be None or a list ofpyopencl.Event
instances for whose completion this command waits before starting exeuction.Note
__call__()
is not thread-safe. It sets the arguments usingset_args()
and then runsenqueue_nd_range_kernel()
. Another thread could race it in doing the same things, with undefined outcome. This issue is inherited from the C-level OpenCL API. The recommended solution is to make a kernel (i.e. accessprg.kernel_name
, which corresponds to making a new kernel) for every thread that may enqueue calls to the kernel.A solution involving implicit locks was discussed and decided against on the mailing list in October 2012.
Changed in version 0.92: local_size was promoted to third positional argument from being a keyword argument. The old keyword argument usage will continue to be accepted with a warning throughout the 0.92 release cycle. This is a backward-compatible change (just barely!) because local_size as third positional argument can only be a
tuple
or None.tuple
instances are never validKernel
arguments, and None is valid as an argument, but its treatment in the wrapper had a bug (now fixed) that prevented it from working.Changed in version 2011.1: Added the g_times_l keyword arg.
Changed in version 2020.2: Added the allow_empty_ndrange keyword argument.
- capture_call(output_file, queue, global_size, local_size, *args, global_offset=None, wait_for=None, g_times_l=False)[source]¶
This method supports the exact same interface as
__call__()
, but instead of invoking the kernel, it writes a self-contained PyOpenCL program to filename that reproduces this invocation. Data and kernel source code will be packaged up in filename’s source code.This is mainly intended as a debugging aid. For example, it can be used to automate the task of creating a small, self-contained test case for an observed problem. It can also help separate a misbehaving kernel from a potentially large or time-consuming outer code.
- Parameters:
output_file – a a filename or a file-like to which the generated code is to be written.
To use, simply change:
evt = my_kernel(queue, gsize, lsize, arg1, arg2, ...)
to:
evt = my_kernel.capture_call("bug.py", queue, gsize, lsize, arg1, arg2, ...)
Added in version 2013.1.
- from_int_ptr(int_ptr_value: int, retain: bool = True) pyopencl._cl.Kernel ¶
(static method) Return a new Python object referencing the C-level
cl_kernel
object at the location pointed to by int_ptr_value. The relevantclRetain*
function will be called if retain is True.If the previous owner of the object will not release the reference, retain should be set to False, to effectively transfer ownership topyopencl
.Added in version 2013.2.
Changed in version 2016.1: retain added.
- int_ptr¶
Return an integer corresponding to the pointer value of the underlying
cl_kernel
. Usefrom_int_ptr()
to turn back into a Python object.Added in version 2013.2.
Instances of this class are hashable, and two instances of this class may be compared using “==” and “!=”. (Hashability was added in version 2011.2.) Two objects are considered the same if the underlying OpenCL object is the same, as established by C pointer equality.
- class pyopencl.LocalMemory(size)¶
A helper class to pass
__local
memory arguments to kernels.Added in version 0.91.2.
- size¶
The size of local buffer in bytes to be provided.
- pyopencl.enqueue_nd_range_kernel(queue, kernel, global_work_size, local_work_size, global_work_offset=None, wait_for=None, g_times_l=False, allow_empty_ndrange=False)¶
global_size and local_size are tuples of identical length, with between one and three entries. global_size specifies the overall size of the computational grid: one work item will be launched for every integer point in the grid. local_size specifies the workgroup size, which must evenly divide the global_size in a dimension-by-dimension manner. None may be passed for local_size, in which case the implementation will use an implementation-defined workgroup size. If g_times_l is True, the global size will be multiplied by the local size. (which makes the behavior more like Nvidia CUDA) In this case, global_size and local_size also do not have to have the same number of entries.
allow_empty_ndrange is a
bool
indicating how an empty NDRange is to be treated, where “empty” means that one or more entries of global_size or local_size are zero. OpenCL itself does not allow enqueueing kernels over empty NDRanges. Setting this flag to True enqueues a marker with a wait list (clEnqueueMarkerWithWaitList
) to obtain the synchronization effects that would have resulted from the kernel enqueue. Setting allow_empty_ndrange to True requires OpenCL 1.2 or newer.Returns a new
pyopencl.Event
. wait_for may either be None or a list ofpyopencl.Event
instances for whose completion this command waits before starting exeuction.Changed in version 2011.1: Added the g_times_l keyword arg.
Changed in version 2020.2: Added the allow_empty_ndrange keyword argument.