Device Interface¶

Version Queries¶

pycuda.VERSION¶

Gives the numeric version of PyCUDA as a variable-length tuple of integers. Enables easy version checks such as VERSION >= (0, 93).

Added in PyCUDA 0.93.

pycuda.VERSION_STATUS¶: A text string such as “rc4” or “beta” qualifying the status of the release.

Added in version 0.93.

pycuda.VERSION_TEXT¶: The full release name (such as “0.93rc4”) in string form.

Added in version 0.93.

Error Reporting¶

exception pycuda.driver.Error¶: Base class of all PyCuda errors.

exception pycuda.driver.CompileError¶

Thrown when pycuda.compiler.SourceModule compilation fails.

msg¶: Added in version 0.94.

stdout¶: Added in version 0.94.

stderr¶: Added in version 0.94.

command_line¶: Added in version 0.94.

exception pycuda.driver.MemoryError¶: Thrown when mem_alloc() or related functionality fails.

exception pycuda.driver.LogicError¶

Thrown when PyCuda was confronted with a situation where it is likely that the programmer has made a mistake. LogicErrors do not depend on outer circumstances defined by the run-time environment.

Example: CUDA was used before it was initialized.

exception pycuda.driver.LaunchError¶: Thrown when kernel invocation has failed. (Note that this will often be reported by the next call after the actual kernel invocation.)

exception pycuda.driver.RuntimeError¶

Thrown when a unforeseen run-time failure is encountered that is not likely due to programmer error.

Example: A file was not found.

Constants¶

class pycuda.driver.ctx_flags¶

Flags for Device.make_context(). CUDA 2.0 and above only.

SCHED_AUTO¶: If there are more contexts than processors, yield, otherwise spin while waiting for CUDA calls to complete.

SCHED_SPIN¶: Spin while waiting for CUDA calls to complete.

SCHED_YIELD¶: Yield to other threads while waiting for CUDA calls to complete.

SCHED_MASK¶: Mask of valid scheduling flags in this bitfield.

SCHED_BLOCKING_SYNC¶: Use blocking synchronization. CUDA 2.2 and newer.

MAP_HOST¶: Support mapped pinned allocations. CUDA 2.2 and newer.

LMEM_RESIZE_TO_MAX¶: Keep local memory allocation after launch. CUDA 3.2 and newer. Rumored to decrease Fermi launch overhead?

Added in version 2011.1.

FLAGS_MASK¶: Mask of valid flags in this bitfield.

class pycuda.driver.event_flags¶

Flags for Event. CUDA 2.2 and newer.

DEFAULT¶

BLOCKING_SYNC¶

DISABLE_TIMING¶: CUDA 3.2 and newer.

Added in version 0.94.

INTERPROCESS¶: CUDA 4.1 and newer.

Added in version 2011.2.

class pycuda.driver.device_attribute¶

MAX_THREADS_PER_BLOCK¶

MAX_BLOCK_DIM_X¶

MAX_BLOCK_DIM_Y¶

MAX_BLOCK_DIM_Z¶

MAX_GRID_DIM_X¶

MAX_GRID_DIM_Y¶

MAX_GRID_DIM_Z¶

TOTAL_CONSTANT_MEMORY¶

WARP_SIZE¶

MAX_PITCH¶

CLOCK_RATE¶

TEXTURE_ALIGNMENT¶

GPU_OVERLAP¶

MULTIPROCESSOR_COUNT¶: CUDA 2.0 and above only.

SHARED_MEMORY_PER_BLOCK¶: Deprecated as of CUDA 2.0. See below for replacement.

MAX_SHARED_MEMORY_PER_BLOCK¶: CUDA 2.0 and above only.

REGISTERS_PER_BLOCK¶: Deprecated as of CUDA 2.0. See below for replacement.

MAX_REGISTERS_PER_BLOCK¶: CUDA 2.0 and above.

KERNEL_EXEC_TIMEOUT¶: CUDA 2.2 and above.

INTEGRATED¶: CUDA 2.2 and above.

CAN_MAP_HOST_MEMORY¶: CUDA 2.2 and above.

COMPUTE_MODE¶: CUDA 2.2 and above. See compute_mode.

MAXIMUM_TEXTURE1D_WIDTH¶
MAXIMUM_TEXTURE2D_WIDTH¶
MAXIMUM_TEXTURE2D_HEIGHT¶
MAXIMUM_TEXTURE3D_WIDTH¶
MAXIMUM_TEXTURE3D_HEIGHT¶
MAXIMUM_TEXTURE3D_DEPTH¶
MAXIMUM_TEXTURE2D_ARRAY_WIDTH¶
MAXIMUM_TEXTURE2D_ARRAY_HEIGHT¶
MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES¶: CUDA 3.0 and above.

Added in version 0.94.

MAXIMUM_TEXTURE2D_LAYERED_WIDTH¶
MAXIMUM_TEXTURE2D_LAYERED_HEIGHT¶
MAXIMUM_TEXTURE2D_LAYERED_LAYERS¶
MAXIMUM_TEXTURE1D_LAYERED_WIDTH¶
MAXIMUM_TEXTURE1D_LAYERED_LAYERS¶: CUDA 4.0 and above.

Added in version 2011.1.

SURFACE_ALIGNMENT¶: CUDA 3.0 (post-beta) and above.

Added in version 0.94.

CONCURRENT_KERNELS¶: CUDA 3.0 (post-beta) and above.

Added in version 0.94.

ECC_ENABLED¶: CUDA 3.0 (post-beta) and above.

Added in version 0.94.

PCI_BUS_ID¶: CUDA 3.2 and above.

Added in version 0.94.

PCI_DEVICE_ID¶: CUDA 3.2 and above.

Added in version 0.94.

TCC_DRIVER¶: CUDA 3.2 and above.

Added in version 0.94.

MEMORY_CLOCK_RATE¶
GLOBAL_MEMORY_BUS_WIDTH¶
L2_CACHE_SIZE¶
MAX_THREADS_PER_MULTIPROCESSOR¶
ASYNC_ENGINE_COUNT¶
UNIFIED_ADDRESSING¶: CUDA 4.0 and above.

Added in version 2011.1.

MAXIMUM_TEXTURE2D_GATHER_WIDTH¶
MAXIMUM_TEXTURE2D_GATHER_HEIGHT¶
MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE¶
MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE¶
MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE¶
PCI_DOMAIN_ID¶
TEXTURE_PITCH_ALIGNMENT¶
MAXIMUM_TEXTURECUBEMAP_WIDTH¶
MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH¶
MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS¶
MAXIMUM_SURFACE1D_WIDTH¶
MAXIMUM_SURFACE2D_WIDTH¶
MAXIMUM_SURFACE2D_HEIGHT¶
MAXIMUM_SURFACE3D_WIDTH¶
MAXIMUM_SURFACE3D_HEIGHT¶
MAXIMUM_SURFACE3D_DEPTH¶
MAXIMUM_SURFACE1D_LAYERED_WIDTH¶
MAXIMUM_SURFACE1D_LAYERED_LAYERS¶
MAXIMUM_SURFACE2D_LAYERED_WIDTH¶
MAXIMUM_SURFACE2D_LAYERED_HEIGHT¶
MAXIMUM_SURFACE2D_LAYERED_LAYERS¶
MAXIMUM_SURFACECUBEMAP_WIDTH¶
MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH¶
MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS¶
MAXIMUM_TEXTURE1D_LINEAR_WIDTH¶
MAXIMUM_TEXTURE2D_LINEAR_WIDTH¶
MAXIMUM_TEXTURE2D_LINEAR_HEIGHT¶
MAXIMUM_TEXTURE2D_LINEAR_PITCH¶: CUDA 4.1 and above.

Added in version 2011.2.

MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH¶
MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT¶
COMPUTE_CAPABILITY_MAJOR¶
COMPUTE_CAPABILITY_MINOR¶
MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH¶: CUDA 5.0 and above.

Added in version 2014.1.

STREAM_PRIORITIES_SUPPORTED¶: CUDA 5.5 and above.

Added in version 2014.1.

GLOBAL_L1_CACHE_SUPPORTED¶
LOCAL_L1_CACHE_SUPPORTED¶
MAX_SHARED_MEMORY_PER_MULTIPROCESSOR¶
MAX_REGISTERS_PER_MULTIPROCESSOR¶
MANAGED_MEMORY¶
MULTI_GPU_BOARD¶
MULTI_GPU_BOARD_GROUP_ID¶: CUDA 6.0 and above.

Added in version 2014.1.

HOST_NATIVE_ATOMIC_SUPPORTED¶
SINGLE_TO_DOUBLE_PRECISION_PERF_RATIO¶
PAGEABLE_MEMORY_ACCESS¶
CONCURRENT_MANAGED_ACCESS¶
COMPUTE_PREEMPTION_SUPPORTED¶
CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM¶: CUDA 8.0 and above.

MAX_SHARED_MEMORY_PER_BLOCK_OPTIN¶: CUDA 9.0 and above.

PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES¶
DIRECT_MANAGED_MEM_ACCESS_FROM_HOST¶: CUDA 9.2 and above.

HANDLE_TYPE_POSIX_FILE_DESCRIPTOR_SUPPORTED¶
HANDLE_TYPE_WIN32_HANDLE_SUPPORTED¶
HANDLE_TYPE_WIN32_KMT_HANDLE_SUPPORTED¶: CUDA 10.2 and above.

MAX_PERSISTING_L2_CACHE_SIZE¶
MAX_BLOCKS_PER_MULTIPROCESSOR¶
GENERIC_COMPRESSION_SUPPORTED¶
RESERVED_SHARED_MEMORY_PER_BLOCK¶: CUDA 11.0 and above.

READ_ONLY_HOST_REGISTER_SUPPORTED¶
MEMORY_POOLS_SUPPORTED¶: CUDA 11.2 and above.

class pycuda.driver.pointer_attribute¶

CONTEXT¶
MEMORY_TYPE¶
DEVICE_POINTER¶
HOST_POINTER¶

CUDA 4.0 and above.

Added in version 2011.1.

class pycuda.driver.profiler_output_mode¶

KEY_VALUE_PAIR¶
CSV¶

CUDA 4.0 and above.

Added in version 2011.1.

class pycuda.driver.function_attribute¶

Flags for Function.get_attribute(). CUDA 2.2 and newer.

MAX_THREADS_PER_BLOCK¶

SHARED_SIZE_BYTES¶

CONST_SIZE_BYTES¶

LOCAL_SIZE_BYTES¶

NUM_REGS¶

PTX_VERSION¶: CUDA 3.0 (post-beta) and above.

Added in version 0.94.

BINARY_VERSION¶: CUDA 3.0 (post-beta) and above.

Added in version 0.94.

CACHE_MODE_CA¶: Added in version 2022.1.

MAX_DYNAMIC_SHARED_SIZE_BYTES¶: Added in version 2022.1.

PREFERRED_SHARED_MEMORY_CARVEOUT¶: Added in version 2022.1.

MAX¶

class pycuda.driver.func_cache¶

See Function.set_cache_config(). CUDA 3.0 (post-beta) and above.

Added in version 0.94.

PREFER_NONE¶

PREFER_SHARED¶

PREFER_L1¶

PREFER_EQUAL¶: CUDA 4.1 and above.

Added in version 2011.2.

class pycuda.driver.shared_config¶

See Function.set_shared_config(). CUDA 4.2 and above.

DEFAULT_BANK_SIZE¶

FOUR_BYTE_BANK_SIZE¶

EIGHT_BYTE_BANK_SIZE¶

class pycuda.driver.array_format¶

UNSIGNED_INT8¶

UNSIGNED_INT16¶

UNSIGNED_INT32¶

SIGNED_INT8¶

SIGNED_INT16¶

SIGNED_INT32¶

HALF¶

FLOAT¶

class pycuda.driver.array3d_flags¶

2DARRAY¶: CUDA 3.0 and above. Deprecated–use LAYERED.

Added in version 0.94.

LAYERED¶: CUDA 4.0 and above.

Added in version 2011.1.

SURFACE_LDST¶: CUDA 3.1 and above.

Added in version 0.94.

CUBEMAP TEXTURE_GATHER: CUDA 4.1 and above.

Added in version 2011.2.

class pycuda.driver.address_mode¶

WRAP¶

CLAMP¶

MIRROR¶

BORDER¶: CUDA 3.2 and above.

Added in version 0.94.

class pycuda.driver.filter_mode¶

POINT¶

LINEAR¶

class pycuda.driver.memory_type¶

HOST¶

DEVICE¶

ARRAY¶

class pycuda.driver.compute_mode¶

CUDA 2.2 and newer.

DEFAULT¶

PROHIBITED¶

EXCLUSIVE_PROCESS¶: CUDA 4.0 and above.

Added in version 2011.1.

class pycuda.driver.jit_option¶

CUDA 2.1 and newer.

MAX_REGISTERS¶

THREADS_PER_BLOCK¶

WALL_TIME¶

INFO_LOG_BUFFER¶

INFO_LOG_BUFFER_SIZE_BYTES¶

ERROR_LOG_BUFFER¶

ERROR_LOG_BUFFER_SIZE_BYTES¶

OPTIMIZATION_LEVEL¶

TARGET_FROM_CUCONTEXT¶

TARGET¶

FALLBACK_STRATEGY¶

class pycuda.driver.jit_target¶

CUDA 2.1 and newer.

COMPUTE_10¶

COMPUTE_11¶

COMPUTE_12¶

COMPUTE_13¶

COMPUTE_20¶: CUDA 3.0 and above.

Added in version 0.94.

COMPUTE_21¶: CUDA 3.2 and above.

Added in version 0.94.

class pycuda.driver.jit_fallback¶

CUDA 2.1 and newer.

PREFER_PTX¶

PREFER_BINARY¶

class pycuda.driver.host_alloc_flags¶

Flags to be used to allocate Pagelocked Host Memory.

PORTABLE¶

DEVICEMAP¶

WRITECOMBINED¶

class pycuda.driver.mem_attach_flags¶

Flags to be used to allocate Managed Memory.

..versionadded:: 2014.1

GLOBAL¶

HOST¶

SINGLE¶

class pycuda.driver.mem_host_register_flags¶

PORTABLE¶

DEVICEMAP¶

CUDA 4.0 and newer.

Added in version 2011.1.

class pycuda.driver.limit¶

Limit values for Context.get_limit() and Context.set_limit().

CUDA 3.1 and newer.

Added in version 0.94.

STACK_SIZE¶

PRINTF_FIFO_SIZE¶

MALLOC_HEAP_SIZE¶: CUDA 3.2 and above.

class pycuda.driver.ipc_mem_flags¶

LAZY_ENABLE_PEER_ACCESS¶

Devices and Contexts¶

pycuda.driver.get_version()¶: Obtain the version of CUDA against which PyCuda was compiled. Returns a 3-tuple of integers as (major, minor, revision).

pycuda.driver.get_driver_version()¶: Obtain the version of the CUDA driver on top of which PyCUDA is running. Returns an integer version number.

pycuda.driver.init(flags=0)¶

Initialize CUDA.

Warning

This must be called before any other function in this module.

Concurrency and Streams¶

class pycuda.driver.Stream(flags=0)¶

A handle for a queue of operations that will be carried out in order.

synchronize()¶: Wait for all activity on this stream to cease, then return.

is_done()¶: Return True iff all queued operations have completed.

wait_for_event(evt)¶

Enqueues a wait for the given Event instance.

CUDA 3.2 and above.

Added in version 2011.1.

class pycuda.driver.Event(flags=0)¶

An event is a temporal ‘marker’ in a Stream that allows taking the time between two events–such as the time required to execute a kernel. An event’s time is recorded when the Stream has finished all tasks enqueued before the record() call.

See event_flags for values for the flags parameter.

record(stream=None)¶: Insert a recording point for self into the Stream stream. Return self.

synchronize()¶: Wait until the device execution stream reaches this event. Return self.

query()¶: Return True if the device execution stream has reached this event.

time_since(event)¶: Return the time in milliseconds that has passed between self and event. Use this method as end.time_since(start). Note that this method will fail with an “invalid value” error if either of the events has not been reached yet. Use synchronize() to ensure that the event has been reached.

time_till(event)¶: Return the time in milliseconds that has passed between event and self. Use this method as start.time_till(end). Note that this method will fail with an “invalid value” error if either of the events has not been reached yet. Use synchronize() to ensure that the event has been reached.

ipc_handle()¶: Return a bytes object representing an IPC handle to this event. Requires Python 2.6 and CUDA 4.1.

static from_ipc_handle(handle)¶: Requires Python 2.6 and CUDA 4.1.

Memory¶

Global Device Memory¶

pycuda.driver.mem_get_info()¶: Return a tuple (free, total) indicating the free and total memory in the current context, in bytes.

pycuda.driver.mem_alloc(bytes)¶: Return a DeviceAllocation object representing a linear piece of device memory.

pycuda.driver.to_device(buffer)¶: Allocate enough device memory for buffer, which adheres to the Python Buffer Protocol interface. Copy the contents of buffer onto the device. Return a DeviceAllocation object representing the newly-allocated memory.

pycuda.driver.from_device(devptr, shape, dtype, order='C')¶: Make a new numpy.ndarray from the data at devptr on the GPU, interpreting them using shape, dtype and order.

pycuda.driver.from_device_like(devptr, other_ary)¶: Make a new numpy.ndarray from the data at devptr on the GPU, interpreting them as having the same shape, dtype and order as other_ary.

pycuda.driver.mem_alloc_pitch(width, height, access_size)¶

Allocates a linear piece of device memory at least width bytes wide and height rows high that an be accessed using a data type of size access_size in a coalesced fashion.

Returns a tuple (dev_alloc, actual_pitch) giving a DeviceAllocation and the actual width of each row in bytes.

class pycuda.driver.DeviceAllocation¶

An object representing an allocation of linear device memory. Once this object is deleted, its associated device memory is freed.

Objects of this type can be cast to int to obtain a linear index into this Context’s memory.

free()¶: Release the held device memory now instead of when this object becomes unreachable. Any further use of the object is an error and will lead to undefined behavior.

as_buffer(size, offset=0)¶: Return the pointer encapsulated by self as a Python buffer object, with the given size and, optionally, offset.

Added in version 2014.1.

pycuda.driver.mem_get_ipc_handle(devptr)¶

Return an opaque bytes object representing an IPC handle to the device pointer devptr.

Added in version 2011.2.

Requires CUDA 4.1 and Python 2.6.

class pycuda.driver.IPCMemoryHandle(ipc_handle, flags=ipc_mem_flags.LAZY_ENABLE_PEER_ACCESS)¶

Added in version 2011.2.

Requires CUDA 4.1 and Python 2.6.

Objects of this type can be used in the same ways as a DeviceAllocation.

close()¶

class pycuda.driver.PointerHolderBase¶

A base class that facilitates casting to pointers within PyCUDA. This allows the user to construct custom pointer types that may have been allocated by facilities outside of PyCUDA proper, but still need to be objects to facilitate RAII. The user needs to supply one method to facilitate the pointer cast:

get_pointer()¶: Return the pointer encapsulated by self.

as_buffer(size, offset=0)¶: Return the pointer encapsulated by self as a Python buffer object, with the given size and, optionally, offset.

Added in version 2014.1.

Note

If your subclass provides its own __init__(), it must call the base class __init__(). Failure to do so will lead to boost.Python.ArgumentError being raised when it is used.

Pagelocked Host Memory¶

Pagelocked Allocation¶

pycuda.driver.pagelocked_empty(shape, dtype, order='C', mem_flags=0)¶

Allocate a pagelocked numpy.ndarray of shape, dtype and order.

mem_flags may be one of the values in host_alloc_flags. It may only be non-zero on CUDA 2.2 and newer.

For the meaning of the other parameters, please refer to the numpy documentation.

pycuda.driver.pagelocked_zeros(shape, dtype, order='C', mem_flags=0)¶: Like pagelocked_empty(), but initialized to zero.

pycuda.driver.pagelocked_empty_like(array, mem_flags=0)¶

pycuda.driver.pagelocked_zeros_like(array, mem_flags=0)¶

The numpy.ndarray instances returned by these functions have an attribute base that references an object of type

class pycuda.driver.PagelockedHostAllocation¶

Inherits from HostPointer.

An object representing an allocation of pagelocked host memory. Once this object is deleted, its associated device memory is freed.

free()¶: Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated numpy array) is an error and will lead to undefined behavior.

get_flags()¶

Return a bit field of values from host_alloc_flags.

Only available on CUDA 3.2 and newer.

Added in version 0.94.

class pycuda.driver.HostAllocation¶: A deprecated name for PagelockedHostAllocation.

Aligned Host Memory¶

pycuda.driver.aligned_empty(shape, dtype, order='C', alignment=4096)¶

Allocate an numpy.ndarray of shape, dtype and order, with data aligned to alignment bytes.

For the meaning of the other parameters, please refer to the numpy documentation.

Added in version 2011.1.

pycuda.driver.aligned_zeros(shape, dtype, order='C', alignment=4096)¶: Like aligned_empty(), but with initialization to zero.

Added in version 2011.1.

pycuda.driver.aligned_empty_like(array, alignment=4096)¶: Added in version 2011.1.

pycuda.driver.aligned_zeros_like(array, alignment=4096)¶: Added in version 2011.1.

The numpy.ndarray instances returned by these functions have an attribute base that references an object of type

class pycuda.driver.AlignedHostAllocation¶

Inherits from HostPointer.

An object representing an allocation of aligned host memory.

free()¶: Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated numpy array) is an error and will lead to undefined behavior.

Post-Allocation Pagelocking¶

pycuda.driver.register_host_memory(ary, flags=0)¶

Returns a numpy.ndarray which shares memory with ary. This memory will be page-locked as long as the return value of this function is alive.

The returned array’s base attribute contains a RegisteredHostMemory instance, whose base attribute in turn contains ary.

CUDA 4.0 and newer.

ary’s data address and size must be page-aligned. One way to achieve this is to use the functions in Aligned Host Memory.

Added in version 2011.1.

class pycuda.driver.RegisteredHostMemory¶

Inherits from HostPointer.

CUDA 4.0 and newer.

Added in version 2011.1.

unregister()¶: Unregister the page-lock on the host memory held by this instance. Note that this does not free the memory, it only frees the page-lock.

base¶: Contains the Python object from which this instance was constructed.

class pycuda.driver.HostPointer¶

Represents a page-locked host pointer.

get_device_pointer()¶

Return a device pointer that indicates the address at which this memory is mapped into the device’s address space.

Only available on CUDA 2.2 and newer.

Managed Memory¶

CUDA 6.0 adds support for a “Unified Memory” model, which creates a managed virtual memory space that is visible to both CPUs and GPUs. The OS will migrate the physical pages associated with managed memory between the CPU and GPU as needed. This allows a numpy array on the host to be passed to kernels without first creating a DeviceAllocation and manually copying the host data to and from the device.

Note

Managed memory is only available for some combinations of CUDA device, operating system, and host compiler target architecture. Check the CUDA C Programming Guide and CUDA release notes for details.

Warning

This interface to managed memory should be considered experimental. It is provided as a preview, but for now the same interface stability guarantees as for the rest of PyCUDA do not apply.

Managed Memory Allocation¶

pycuda.driver.managed_empty(shape, dtype, order='C', mem_flags=0)¶

Allocate a managed numpy.ndarray of shape, dtype and order.

mem_flags may be one of the values in mem_attach_flags.

For the meaning of the other parameters, please refer to the numpy documentation.

Only available on CUDA 6.0 and newer.

Added in version 2014.1.

pycuda.driver.managed_zeros(shape, dtype, order='C', mem_flags=0)¶

Like managed_empty(), but initialized to zero.

Only available on CUDA 6.0 and newer.

Added in version 2014.1.

pycuda.driver.managed_empty_like(array, mem_flags=0)¶: Only available on CUDA 6.0 and newer.

Added in version 2014.1.

pycuda.driver.managed_zeros_like(array, mem_flags=0)¶: Only available on CUDA 6.0 and newer.

Added in version 2014.1.

The numpy.ndarray instances returned by these functions have an attribute base that references an object of type

class pycuda.driver.ManagedAllocation¶

An object representing an allocation of managed host memory. Once this object is deleted, its associated CUDA managed memory is freed.

free()¶: Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated numpy array) is an error and will lead to undefined behavior.

get_device_pointer()¶: Return a device pointer that indicates the address at which this memory is mapped into the device’s address space. For managed memory, this is also the host pointer.

attach(mem_flags, stream=None)¶: Alter the visibility of the managed allocation to be one of the values in mem_attach_flags. A managed array can be made visible to the host CPU and the entire CUDA context with mem_attach_flags.GLOBAL, or limited to the CPU only with mem_attach_flags.HOST. If mem_attach_flags.SINGLE is selected, then the array will only be visible to CPU and the provided instance of Stream.

Managed Memory Usage¶

A managed numpy array is constructed and used on the host in a similar manner to a pagelocked array:

from pycuda.autoinit import context
import pycuda.driver as cuda
import numpy as np

a = cuda.managed_empty(shape=10, dtype=np.float32, mem_flags=cuda.mem_attach_flags.GLOBAL)
a[:] = np.linspace(0, 9, len(a)) # Fill array on host

It can be passed to a GPU kernel, and used again on the host without an explicit copy:

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void doublify(float *a)
{
    a[threadIdx.x] *= 2;
}
""")
doublify = mod.get_function("doublify")

doublify(a, grid=(1,1), block=(len(a),1,1))
context.synchronize() # Wait for kernel completion before host access

median = np.median(a) # Computed on host!

Warning

The CUDA Unified Memory model has very specific rules regarding concurrent access of managed memory allocations. Host access to any managed array is not allowed while the GPU is executing a kernel, regardless of whether the array is in use by the running kernel. Failure to follow the concurrency rules will generate a segmentation fault, causing the Python interpreter to terminate immediately.

Users of managed numpy arrays should read the “Unified Memory Programming” appendix of the CUDA C Programming Guide for further details on the concurrency restrictions.

If you are encountering interpreter terminations due to concurrency issues, the faulthandler <http://pypi.python.org/pypi/faulthandler> module may be helpful in locating the location in your Python program where the faulty access is occurring.

Arrays and Textures¶

class pycuda.driver.ArrayDescriptor¶

width¶

height¶

format¶: A value of type array_format.

num_channels¶

class pycuda.driver.ArrayDescriptor3D¶

width¶

height¶

depth¶

format¶: A value of type array_format. CUDA 2.0 and above only.

num_channels¶

class pycuda.driver.Array(descriptor)¶

A 2D or 3D memory block that can only be accessed via texture references.

descriptor can be of type ArrayDescriptor or ArrayDescriptor3D.

free()¶: Release the array and its device memory now instead of when this object becomes unreachable. Any further use of the object is an error and will lead to undefined behavior.

get_descriptor()¶: Return a ArrayDescriptor object for this 2D array, like the one that was used to create it.

get_descriptor_3d()¶: Return a ArrayDescriptor3D object for this 3D array, like the one that was used to create it. CUDA 2.0 and above only.

handle¶: Return an int representing the address in device memory where this array resides.

class pycuda.driver.SurfaceReference¶

Note

Instances of this class can only be constructed through Module.get_surfref().

CUDA 3.1 and above.

Added in version 0.94.

set_array(array, flags=0)¶

Bind self to the Array array.

As long as array remains bound to this texture reference, it will not be freed–the texture reference keeps a reference to the array.

get_array()¶: Get back the Array to which self is bound.

Note

This will be a different object than the one passed to set_array(), but it will compare equal.

class pycuda.driver.TextureReference¶

A handle to a binding of either linear memory or an Array to a texture unit.

set_array(array)¶

Bind self to the Array array.

As long as array remains bound to this texture reference, it will not be freed–the texture reference keeps a reference to the array.

set_address(devptr, bytes, allow_offset=False)¶

Bind self to the a chunk of linear memory starting at the integer address devptr, encompassing a number of bytes. Due to alignment requirements, the effective texture bind address may be different from the requested one by an offset. This method returns this offset in bytes. If allow_offset is False, a nonzero value of this offset will cause an exception to be raised.

Unlike for Array objects, no life support is provided for linear memory bound to texture references.

set_address_2d(devptr, descr, pitch)¶: Bind self as a 2-dimensional texture to a chunk of global memory at devptr. The line-to-line offset in bytes is given by pitch. Width, height and format are given in the ArrayDescriptor descr. set_format() need not and should not be called in addition to this method.

set_format(fmt, num_components)¶: Set the texture to have array_format fmt and to have num_components channels.

set_address_mode(dim, am)¶: Set the address mode of dimension dim to am, which must be one of the address_mode values.

set_flags(flags)¶: Set the flags to a combination of the TRSF_XXX values.

get_array()¶: Get back the Array to which self is bound.

Note

This will be a different object than the one passed to set_array(), but it will compare equal.

get_address_mode(dim)¶

get_filter_mode()¶

get_format()¶

Return a tuple (fmt, num_components), where fmt is of type array_format, and num_components is the number of channels in this texture.

(Version 2.0 and above only.)

get_flags()¶

pycuda.driver.TRSA_OVERRIDE_FORMAT¶

pycuda.driver.TRSF_READ_AS_INTEGER¶

pycuda.driver.TRSF_NORMALIZED_COORDINATES¶

pycuda.driver.TR_DEFAULT¶

pycuda.driver.matrix_to_array(matrix, order)¶: Turn the two-dimensional numpy.ndarray object matrix into an Array. The order argument can be either “C” or “F”. If it is “C”, then tex2D(x,y) is going to fetch matrix[y,x], and vice versa for for “F”.

pycuda.driver.np_to_array(nparray, order, allowSurfaceBind=False)¶

Turn a numpy.ndarray with 2D or 3D structure, into an Array. The order argument can be either “C” or “F”. If allowSurfaceBind is passed as True the returned Array can be read and write with SurfaceReference in addition of reads by TextureReference. Function automatically detect dtype and adjust channels to supported array_format. Also add direct support for np.float64, np.complex64 and np.complex128 formats.

Example of use:

#include <pycuda-helpers.hpp>

texture<fp_tex_double, 3, cudaReadModeElementType> my_tex; // complex128: fp_tex_cdouble
                                                           // complex64 : fp_tex_cfloat
                                                           // float64   : fp_tex_double
surface<void, 3, cudaReadModeElementType> my_surf;         // Surfaces in 2D needs 'cudaSurfaceType2DLayered'

__global__ void f()
{
  ...
  fp_tex3D(my_tex, i, j, k);
  fp_surf3Dwrite(myvar, my_surf, i, j, k, cudaBoundaryModeClamp); // fp extensions don't need width in bytes
  fp_surf3Dread(&myvar, my_surf, i, j, k, cudaBoundaryModeClamp);
  ...
}

Added in version 2015.1.

pycuda.driver.gpuarray_to_array(gpuparray, order, allowSurfaceBind=False)¶: Turn a GPUArray with 2D or 3D structure, into an Array. Same structure and use of np_to_array()

Added in version 2015.1.

pycuda.driver.make_multichannel_2d_array(matrix, order)¶

Turn the three-dimensional numpy.ndarray object matrix into an 2D Array with multiple channels.

Depending on order, the matrix’s shape is interpreted as

height, width, num_channels for order == “C”,
num_channels, width, height for order == “F”.

Note

This function assumes that matrix has been created with the memory order order. If that is not the case, the copied data will likely not be what you expect.

Initializing Device Memory¶

pycuda.driver.memset_d8(dest, data, count)¶

pycuda.driver.memset_d16(dest, data, count)¶

pycuda.driver.memset_d32(dest, data, count)¶: Fill array with data.

Note

count is the number of elements, not bytes.

pycuda.driver.memset_d2d8(dest, pitch, data, width, height)¶

pycuda.driver.memset_d2d16(dest, pitch, data, width, height)¶

pycuda.driver.memset_d2d32(dest, pitch, data, width, height)¶: Fill a two-dimensional array with data.

pycuda.driver.memset_d8_async(dest, data, count, stream=None)¶

pycuda.driver.memset_d16_async(dest, data, count, stream=None)¶

pycuda.driver.memset_d32_async(dest, data, count, stream=None)¶: Fill array with data asynchronously, optionally serialized via stream.

Added in version 2015.1.

pycuda.driver.memset_d2d8_async(dest, pitch, data, width, height, stream=None)¶

pycuda.driver.memset_d2d16_async(dest, pitch, data, width, height, stream=None)¶

pycuda.driver.memset_d2d32_async(dest, pitch, data, width, height, stream=None)¶: Fill a two-dimensional array with data asynchronously, optionally serialized via stream.

Added in version 2015.1.

Unstructured Memory Transfers¶

pycuda.driver.memcpy_htod(dest, src)¶: Copy from the Python buffer src to the device pointer dest (an int or a DeviceAllocation). The size of the copy is determined by the size of the buffer.

pycuda.driver.memcpy_htod_async(dest, src, stream=None)¶

Copy from the Python buffer src to the device pointer dest (an int or a DeviceAllocation) asynchronously, optionally serialized via stream. The size of the copy is determined by the size of the buffer.

src must be page-locked memory, see, e.g. pagelocked_empty().

New in 0.93.

pycuda.driver.memcpy_dtoh(dest, src)¶: Copy from the device pointer src (an int or a DeviceAllocation) to the Python buffer dest. The size of the copy is determined by the size of the buffer.

pycuda.driver.memcpy_dtoh_async(dest, src, stream=None)¶

Copy from the device pointer src (an int or a DeviceAllocation) to the Python buffer dest asynchronously, optionally serialized via stream. The size of the copy is determined by the size of the buffer.

dest must be page-locked memory, see, e.g. pagelocked_empty().

New in 0.93.

pycuda.driver.memcpy_dtod(dest, src, size)¶

pycuda.driver.memcpy_dtod_async(dest, src, size, stream=None)¶: CUDA 3.0 and above.

Added in version 0.94.

pycuda.driver.memcpy_peer(dest, src, size, dest_context=None, src_context=None)¶

pycuda.driver.memcpy_peer_async(dest, src, size, dest_context=None, src_context=None, stream=None)¶: CUDA 4.0 and above.

Added in version 2011.1.

pycuda.driver.memcpy_dtoa(ary, index, src, len)¶

pycuda.driver.memcpy_atod(dest, ary, index, len)¶

pycuda.driver.memcpy_htoa(ary, index, src)¶

pycuda.driver.memcpy_atoh(dest, ary, index)¶

pycuda.driver.memcpy_atoa(dest, dest_index, src, src_index, len)¶

Structured Memory Transfers¶

class pycuda.driver.Memcpy2D¶

src_x_in_bytes¶: X Offset of the origin of the copy. (initialized to 0)

src_y¶: Y offset of the origin of the copy. (initialized to 0)

src_pitch¶: Size of a row in bytes at the origin of the copy.

set_src_host(buffer)¶: Set the buffer, which must be a Python object adhering to the buffer interface, to be the origin of the copy.

set_src_array(array)¶: Set the Array array to be the origin of the copy.

set_src_device(devptr)¶: Set the device address devptr (an int or a DeviceAllocation) as the origin of the copy.

set_src_unified(buffer)¶

Same as set_src_host(), except that buffer may also correspond to device memory.

CUDA 4.0 and above. Requires unified addressing.

Added in version 2011.1.

dst_x_in_bytes¶: X offset of the destination of the copy. (initialized to 0)

dst_y¶: Y offset of the destination of the copy. (initialized to 0)

dst_pitch¶: Size of a row in bytes at the destination of the copy.

set_dst_host(buffer)¶: Set the buffer, which must be a Python object adhering to the buffer interface, to be the destination of the copy.

set_dst_array(array)¶: Set the Array array to be the destination of the copy.

set_dst_device(devptr)¶: Set the device address devptr (an int or a DeviceAllocation) as the destination of the copy.

set_dst_unified(buffer)¶

Same as set_dst_host(), except that buffer may also correspond to device memory.

CUDA 4.0 and above. Requires unified addressing.

Added in version 2011.1.

width_in_bytes¶: Number of bytes to copy for each row in the transfer.

height¶: Number of rows to copy.

__call__([aligned=True])¶: Perform the specified memory copy, waiting for it to finish. If aligned is False, tolerate device-side misalignment for device-to-device copies that may lead to loss of copy bandwidth.

__call__(stream): Perform the memory copy asynchronously, serialized via the Stream stream. Any host memory involved in the transfer must be page-locked.

class pycuda.driver.Memcpy3D¶

Memcpy3D has the same members as Memcpy2D, and additionally all of the following:

src_height¶: Ignored when source is an Array. May be 0 if Depth==1.

src_z¶: Z offset of the origin of the copy. (initialized to 0)

dst_height¶: Ignored when destination is an Array. May be 0 if Depth==1.

dst_z¶: Z offset of the destination of the copy. (initialized to 0)

depth¶

Memcpy3D is supported on CUDA 2.0 and above only.

class pycuda.driver.Memcpy3DPeer¶

Memcpy3DPeer has the same members as Memcpy3D, and additionally all of the following:

set_src_context(ctx)¶

set_dst_context(ctx)¶

CUDA 4.0 and newer.

Added in version 2011.1.

Code on the Device: Modules and Functions¶

class pycuda.driver.Module¶

Handle to a CUBIN module loaded onto the device. Can be created with module_from_file() and module_from_buffer().

get_function(name)¶: Return the Function name in this module.

Warning

While you can obtain different handles to the same function using this method, these handles all share the same state that is set through the set_XXX methods of Function. This means that you can’t obtain two different handles to the same function and Function.prepare() them in two different ways.

get_global(name)¶

Return a tuple (device_ptr, size_in_bytes) giving the device address and size of the global name.

The main use of this method is to find the address of pre-declared __constant__ arrays so they can be filled from the host before kernel invocation.

get_texref(name)¶: Return the TextureReference name from this module.

get_surfref(name)¶

Return the SurfaceReference name from this module.

CUDA 3.1 and above.

Added in version 0.94.

pycuda.driver.module_from_file(filename)¶: Create a Module by loading the CUBIN file filename.

pycuda.driver.module_from_buffer(buffer, options=[], message_handler=None)¶

Create a Module by loading a PTX or CUBIN module from buffer, which must support the Python buffer interface. (For example, str and numpy.ndarray do.)

Parameters:

options – A list of tuples (jit_option, value).
message_handler – A callable that is called with a arguments of (compile_success_bool, info_str, error_str) which allows the user to process error and warning messages from the PTX compiler.

Loading PTX modules as well as non-default values of options and message_handler are only allowed on CUDA 2.1 and newer.

class pycuda.driver.Function¶

Handle to a __global__ function in a Module. Create using Module.get_function().

__call__(arg1, ..., argn, block=block_size[, grid=(1, 1)[, stream=None[, shared=0[, texrefs=[][, time_kernel=False]]]]])¶

Launch self, with a thread block size of block. block must be a 3-tuple of integers.

arg1 through argn are the positional C arguments to the kernel. See param_set() for details. See especially the warnings there.

grid specifies, as a tuple of up to three integer entries, the number of thread blocks to launch, as a multi-dimensional grid. stream, if specified, is a Stream instance serializing the copying of input arguments (if any), execution, and the copying of output arguments (again, if any). shared gives the number of bytes available to the kernel in extern __shared__ arrays. texrefs is a list of TextureReference instances that the function will have access to.

The function returns either None or the number of seconds spent executing the kernel, depending on whether time_kernel is True.

This is a convenience interface that can be used instead of the param_*` and launch_*` methods below. For a faster (but mildly less convenient) way of invoking kernels, see prepare() and prepared_call().

arg1 through argn are allowed to be of the following types:

Subclasses of numpy.number. These are sized number types such as numpy.uint32 or numpy.float32.
DeviceAllocation instances, which will become a device pointer to the allocated memory.
Instances of ArgumentHandler subclasses. These can be used to automatically transfer numpy arrays onto and off of the device.
Objects supporting the Python Buffer Protocol interface. These chunks of bytes will be copied into the parameter space verbatim.
GPUArray instances.

Warning

You cannot pass values of Python’s native int or float types to param_set. Since there is no unambiguous way to guess the size of these integers or floats, it complains with a TypeError.

Note

This method has to guess the types of the arguments passed to it, which can make it somewhat slow. For a kernel that is invoked often, this can be inconvenient. For a faster (but mildly less convenient) way of invoking kernels, see prepare() and prepared_call().

Note

grid with more than two dimensions requires CUDA 4.0 or newer.

param_set_texref(texref)¶: Make the TextureReference texref available to the function.

prepare(arg_types, shared=None, texrefs=[])¶

Prepare the invocation of this function by

setting up the argument types as arg_types. arg_types is expected to be an iterable containing type characters understood by the struct module or numpy.dtype objects.

(In addition, PyCUDA understands ‘F’ and ‘D’ for single- and double precision floating point numbers.)
Registering the texture references texrefs for use with this functions. The TextureReference objects in texrefs will be retained, and whatever these references are bound to at invocation time will be available through the corresponding texture references within the kernel.

Return self.

prepared_call(grid, block, *args, shared_size=0)¶: Invoke self using launch_grid(), with args a grid size of grid, and a block size of block. Assumes that prepare() was called on self. The texture references given to prepare() are set up as parameters, as well.

Changed in version 2012.1: shared_size was added.

prepared_timed_call(grid, block, *args, shared_size=0)¶

Invoke self using launch_grid(), with args, a grid size of grid, and a block size of block. Assumes that prepare() was called on self. The texture references given to prepare() are set up as parameters, as well.

Return a 0-ary callable that can be used to query the GPU time consumed by the call, in seconds. Once called, this callable will block until completion of the invocation.

Changed in version 2012.1: shared_size was added.

prepared_async_call(grid, block, stream, *args, shared_size=0)¶: Invoke self using launch_grid_async(), with args, a grid size of grid, and a block size of block, serialized into the pycuda.driver.Stream stream. If stream is None, do the same as prepared_call(). Assumes that prepare() was called on self. The texture references given to prepare() are set up as parameters, as well.

Changed in version 2012.1: shared_size was added.

get_attribute(attr)¶

Return one of the attributes given by the function_attribute value attr.

All function_attribute values may also be directly read as (lower-case) attributes on the Function object itself, e.g. func.num_regs.

CUDA 2.2 and newer.

Added in version 0.93.

set_attribute(attr, value)¶: Set one of the (settable) attributes given by the function_attribute value attr to value.

Added in version 2022.1.

set_cache_config(fc)¶

See func_cache for possible values of fc.

CUDA 3.0 (post-beta) and newer.

Added in version 0.94.

set_shared_config(sc)¶

See shared_config for possible values of sc.

CUDA 4.2 and newer.

Added in version 2013.1.

local_size_bytes¶

The number of bytes of local memory used by this function.

On CUDA 2.1 and below, this is only available if this function is part of a pycuda.compiler.SourceModule. It replaces the now-deprecated attribute lmem.

shared_size_bytes¶

The number of bytes of shared memory used by this function.

On CUDA 2.1 and below, this is only available if this function is part of a pycuda.compiler.SourceModule. It replaces the now-deprecated attribute smem.

num_regs¶

The number of 32-bit registers used by this function.

On CUDA 2.1 and below, this is only available if this function is part of a pycuda.compiler.SourceModule. It replaces the now-deprecated attribute registers.

set_shared_size(bytes)¶: Set shared to be the number of bytes available to the kernel in extern __shared__ arrays.

Warning

Deprecated as of version 2011.1.

set_block_shape(x, y, z)¶: Set the thread block shape for this function.

Warning

Deprecated as of version 2011.1.

param_set(arg1, ... argn)¶: Set the thread block shape for this function.

Warning

Deprecated as of version 2011.1.

param_set_size(bytes)¶: Size the parameter space to bytes.

Warning

Deprecated as of version 2011.1.

param_seti(offset, value)¶: Set the integer at offset in the parameter space to value.

Warning

Deprecated as of version 2011.1.

param_setf(offset, value)¶: Set the float at offset in the parameter space to value.

Warning

Deprecated as of version 2011.1.

launch()¶: Launch a single thread block of self.

Warning

Deprecated as of version 2011.1.

launch_grid(width, height)¶: Launch a width*height grid of thread blocks of self.

Warning

Deprecated as of version 2011.1.

launch_grid_async(width, height, stream)¶: Launch a width*height grid of thread blocks of self, sequenced by the Stream stream.

Warning

Deprecated as of version 2011.1.

class pycuda.driver.ArgumentHandler(array)¶

class pycuda.driver.In(array)¶: Inherits from ArgumentHandler. Indicates that Buffer Protocol array should be copied to the compute device before invoking the kernel.

class pycuda.driver.Out(array)¶: Inherits from ArgumentHandler. Indicates that Buffer Protocol array should be copied off the compute device after invoking the kernel.

class pycuda.driver.InOut(array)¶: Inherits from ArgumentHandler. Indicates that Buffer Protocol array should be copied both onto the compute device before invoking the kernel, and off it afterwards.

Profiler Control¶

CUDA 4.0 and newer.

pycuda.driver.initialize_profiler(config_file, output_file, output_mode)¶: output_mode is one of the attributes of profiler_output_mode.

Added in version 2011.1.

pycuda.driver.start_profiler()¶: Added in version 2011.1.

pycuda.driver.stop_profiler()¶: Added in version 2011.1.

Just-in-time Compilation¶

pycuda.compiler.DEFAULT_NVCC_FLAGS¶

Added in version 2011.1.

If no options are given in the calls below, the value of this list-type variable is used instead. This may be useful for injecting necessary flags into the compilation of automatically compiled kernels, such as those used by the module pycuda.gpuarray.

The initial value of this variable is taken from the environment variable PYCUDA_DEFAULT_NVCC_FLAGS.

If you modify this variable in your code, please be aware that this is a globally shared variable that may be modified by multiple packages. Please exercise caution in such modifications–you risk breaking other people’s code.

class pycuda.compiler.SourceModule(source, nvcc='nvcc', options=None, keep=False, no_extern_c=False, arch=None, code=None, cache_dir=None, include_dirs=[])¶

Create a pycuda.driver.Module from the CUDA source code source. The Nvidia compiler nvcc is assumed to be on the PATH if no path to it is specified, and is invoked with options to compile the code. If keep is True, the compiler output directory is kept, and a line indicating its location in the file system is printed for debugging purposes.

Unless no_extern_c is True, the given source code is wrapped in extern “C” { … } to prevent C++ name mangling.

arch and code specify the values to be passed for the -arch and -code options on the nvcc command line. If arch is None, it defaults to the current context’s device’s compute capability. If code is None, it will not be specified.

cache_dir gives the directory used for compiler caching. If None then cache_dir is taken to be PYCUDA_CACHE_DIR if set or a sensible per-user default. If passed as False, caching is disabled.

If the environment variable PYCUDA_DISABLE_CACHE is set to any value then caching is disabled. This preference overrides any value of cache_dir and can be used to disable caching globally.

This class exhibits the same public interface as pycuda.driver.Module, but does not inherit from it.

Change note: SourceModule was moved from pycuda.driver to pycuda.compiler in version 0.93.

compile(source, nvcc="nvcc", options=None, keep=False,
no_extern_c=False, arch=None, code=None, cache_dir=None,
include_dirs=[]): Perform the same compilation as the corresponding SourceModule constructor, but only return resulting cubin file as a string. In particular, do not upload the code to the GPU.

Device Interface¶

Version Queries¶

Error Reporting¶

Constants¶

Graphics-related constants¶

Devices and Contexts¶

Concurrency and Streams¶

Memory¶

Global Device Memory¶

Pagelocked Host Memory¶

Pagelocked Allocation¶

Aligned Host Memory¶

Post-Allocation Pagelocking¶

Managed Memory¶

Managed Memory Allocation¶

Managed Memory Usage¶

Arrays and Textures¶

Initializing Device Memory¶

Unstructured Memory Transfers¶

Structured Memory Transfers¶

Code on the Device: Modules and Functions¶

Profiler Control¶

Just-in-time Compilation¶