Device Interface

Version Queries

pycuda.VERSION

Gives the numeric version of PyCUDA as a variable-length tuple of integers. Enables easy version checks such as VERSION >= (0, 93).

Added in PyCUDA 0.93.

pycuda.VERSION_STATUS

A text string such as “rc4” or “beta” qualifying the status of the release.

Added in version 0.93.

pycuda.VERSION_TEXT

The full release name (such as “0.93rc4”) in string form.

Added in version 0.93.

Error Reporting

exception pycuda.driver.Error

Base class of all PyCuda errors.

exception pycuda.driver.CompileError

Thrown when pycuda.compiler.SourceModule compilation fails.

msg

Added in version 0.94.

stdout

Added in version 0.94.

stderr

Added in version 0.94.

command_line

Added in version 0.94.

exception pycuda.driver.MemoryError

Thrown when mem_alloc() or related functionality fails.

exception pycuda.driver.LogicError

Thrown when PyCuda was confronted with a situation where it is likely that the programmer has made a mistake. LogicErrors do not depend on outer circumstances defined by the run-time environment.

Example: CUDA was used before it was initialized.

exception pycuda.driver.LaunchError

Thrown when kernel invocation has failed. (Note that this will often be reported by the next call after the actual kernel invocation.)

exception pycuda.driver.RuntimeError

Thrown when a unforeseen run-time failure is encountered that is not likely due to programmer error.

Example: A file was not found.

Constants

class pycuda.driver.ctx_flags

Flags for Device.make_context(). CUDA 2.0 and above only.

SCHED_AUTO

If there are more contexts than processors, yield, otherwise spin while waiting for CUDA calls to complete.

SCHED_SPIN

Spin while waiting for CUDA calls to complete.

SCHED_YIELD

Yield to other threads while waiting for CUDA calls to complete.

SCHED_MASK

Mask of valid scheduling flags in this bitfield.

SCHED_BLOCKING_SYNC

Use blocking synchronization. CUDA 2.2 and newer.

MAP_HOST

Support mapped pinned allocations. CUDA 2.2 and newer.

LMEM_RESIZE_TO_MAX

Keep local memory allocation after launch. CUDA 3.2 and newer. Rumored to decrease Fermi launch overhead?

Added in version 2011.1.

FLAGS_MASK

Mask of valid flags in this bitfield.

class pycuda.driver.event_flags

Flags for Event. CUDA 2.2 and newer.

DEFAULT
BLOCKING_SYNC
DISABLE_TIMING

CUDA 3.2 and newer.

Added in version 0.94.

INTERPROCESS

CUDA 4.1 and newer.

Added in version 2011.2.

class pycuda.driver.device_attribute
MAX_THREADS_PER_BLOCK
MAX_BLOCK_DIM_X
MAX_BLOCK_DIM_Y
MAX_BLOCK_DIM_Z
MAX_GRID_DIM_X
MAX_GRID_DIM_Y
MAX_GRID_DIM_Z
TOTAL_CONSTANT_MEMORY
WARP_SIZE
MAX_PITCH
CLOCK_RATE
TEXTURE_ALIGNMENT
GPU_OVERLAP
MULTIPROCESSOR_COUNT

CUDA 2.0 and above only.

SHARED_MEMORY_PER_BLOCK

Deprecated as of CUDA 2.0. See below for replacement.

MAX_SHARED_MEMORY_PER_BLOCK

CUDA 2.0 and above only.

REGISTERS_PER_BLOCK

Deprecated as of CUDA 2.0. See below for replacement.

MAX_REGISTERS_PER_BLOCK

CUDA 2.0 and above.

KERNEL_EXEC_TIMEOUT

CUDA 2.2 and above.

INTEGRATED

CUDA 2.2 and above.

CAN_MAP_HOST_MEMORY

CUDA 2.2 and above.

COMPUTE_MODE

CUDA 2.2 and above. See compute_mode.

MAXIMUM_TEXTURE1D_WIDTH
MAXIMUM_TEXTURE2D_WIDTH
MAXIMUM_TEXTURE2D_HEIGHT
MAXIMUM_TEXTURE3D_WIDTH
MAXIMUM_TEXTURE3D_HEIGHT
MAXIMUM_TEXTURE3D_DEPTH
MAXIMUM_TEXTURE2D_ARRAY_WIDTH
MAXIMUM_TEXTURE2D_ARRAY_HEIGHT
MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES

CUDA 3.0 and above.

Added in version 0.94.

MAXIMUM_TEXTURE2D_LAYERED_WIDTH
MAXIMUM_TEXTURE2D_LAYERED_HEIGHT
MAXIMUM_TEXTURE2D_LAYERED_LAYERS
MAXIMUM_TEXTURE1D_LAYERED_WIDTH
MAXIMUM_TEXTURE1D_LAYERED_LAYERS

CUDA 4.0 and above.

Added in version 2011.1.

SURFACE_ALIGNMENT

CUDA 3.0 (post-beta) and above.

Added in version 0.94.

CONCURRENT_KERNELS

CUDA 3.0 (post-beta) and above.

Added in version 0.94.

ECC_ENABLED

CUDA 3.0 (post-beta) and above.

Added in version 0.94.

PCI_BUS_ID

CUDA 3.2 and above.

Added in version 0.94.

PCI_DEVICE_ID

CUDA 3.2 and above.

Added in version 0.94.

TCC_DRIVER

CUDA 3.2 and above.

Added in version 0.94.

MEMORY_CLOCK_RATE
GLOBAL_MEMORY_BUS_WIDTH
L2_CACHE_SIZE
MAX_THREADS_PER_MULTIPROCESSOR
ASYNC_ENGINE_COUNT
UNIFIED_ADDRESSING

CUDA 4.0 and above.

Added in version 2011.1.

MAXIMUM_TEXTURE2D_GATHER_WIDTH
MAXIMUM_TEXTURE2D_GATHER_HEIGHT
MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE
MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE
MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE
PCI_DOMAIN_ID
TEXTURE_PITCH_ALIGNMENT
MAXIMUM_TEXTURECUBEMAP_WIDTH
MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH
MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS
MAXIMUM_SURFACE1D_WIDTH
MAXIMUM_SURFACE2D_WIDTH
MAXIMUM_SURFACE2D_HEIGHT
MAXIMUM_SURFACE3D_WIDTH
MAXIMUM_SURFACE3D_HEIGHT
MAXIMUM_SURFACE3D_DEPTH
MAXIMUM_SURFACE1D_LAYERED_WIDTH
MAXIMUM_SURFACE1D_LAYERED_LAYERS
MAXIMUM_SURFACE2D_LAYERED_WIDTH
MAXIMUM_SURFACE2D_LAYERED_HEIGHT
MAXIMUM_SURFACE2D_LAYERED_LAYERS
MAXIMUM_SURFACECUBEMAP_WIDTH
MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH
MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS
MAXIMUM_TEXTURE1D_LINEAR_WIDTH
MAXIMUM_TEXTURE2D_LINEAR_WIDTH
MAXIMUM_TEXTURE2D_LINEAR_HEIGHT
MAXIMUM_TEXTURE2D_LINEAR_PITCH

CUDA 4.1 and above.

Added in version 2011.2.

MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH
MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT
COMPUTE_CAPABILITY_MAJOR
COMPUTE_CAPABILITY_MINOR
MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH

CUDA 5.0 and above.

Added in version 2014.1.

STREAM_PRIORITIES_SUPPORTED

CUDA 5.5 and above.

Added in version 2014.1.

GLOBAL_L1_CACHE_SUPPORTED
LOCAL_L1_CACHE_SUPPORTED
MAX_SHARED_MEMORY_PER_MULTIPROCESSOR
MAX_REGISTERS_PER_MULTIPROCESSOR
MANAGED_MEMORY
MULTI_GPU_BOARD
MULTI_GPU_BOARD_GROUP_ID

CUDA 6.0 and above.

Added in version 2014.1.

HOST_NATIVE_ATOMIC_SUPPORTED
SINGLE_TO_DOUBLE_PRECISION_PERF_RATIO
PAGEABLE_MEMORY_ACCESS
CONCURRENT_MANAGED_ACCESS
COMPUTE_PREEMPTION_SUPPORTED
CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM

CUDA 8.0 and above.

MAX_SHARED_MEMORY_PER_BLOCK_OPTIN

CUDA 9.0 and above.

PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES
DIRECT_MANAGED_MEM_ACCESS_FROM_HOST

CUDA 9.2 and above.

HANDLE_TYPE_POSIX_FILE_DESCRIPTOR_SUPPORTED
HANDLE_TYPE_WIN32_HANDLE_SUPPORTED
HANDLE_TYPE_WIN32_KMT_HANDLE_SUPPORTED

CUDA 10.2 and above.

MAX_PERSISTING_L2_CACHE_SIZE
MAX_BLOCKS_PER_MULTIPROCESSOR
GENERIC_COMPRESSION_SUPPORTED
RESERVED_SHARED_MEMORY_PER_BLOCK

CUDA 11.0 and above.

READ_ONLY_HOST_REGISTER_SUPPORTED
MEMORY_POOLS_SUPPORTED

CUDA 11.2 and above.

class pycuda.driver.pointer_attribute
CONTEXT
MEMORY_TYPE
DEVICE_POINTER
HOST_POINTER

CUDA 4.0 and above.

Added in version 2011.1.

class pycuda.driver.profiler_output_mode
KEY_VALUE_PAIR
CSV

CUDA 4.0 and above.

Added in version 2011.1.

class pycuda.driver.function_attribute

Flags for Function.get_attribute(). CUDA 2.2 and newer.

MAX_THREADS_PER_BLOCK
SHARED_SIZE_BYTES
CONST_SIZE_BYTES
LOCAL_SIZE_BYTES
NUM_REGS
PTX_VERSION

CUDA 3.0 (post-beta) and above.

Added in version 0.94.

BINARY_VERSION

CUDA 3.0 (post-beta) and above.

Added in version 0.94.

CACHE_MODE_CA

Added in version 2022.1.

MAX_DYNAMIC_SHARED_SIZE_BYTES

Added in version 2022.1.

PREFERRED_SHARED_MEMORY_CARVEOUT

Added in version 2022.1.

MAX
class pycuda.driver.func_cache

See Function.set_cache_config(). CUDA 3.0 (post-beta) and above.

Added in version 0.94.

PREFER_NONE
PREFER_SHARED
PREFER_L1
PREFER_EQUAL

CUDA 4.1 and above.

Added in version 2011.2.

class pycuda.driver.shared_config

See Function.set_shared_config(). CUDA 4.2 and above.

DEFAULT_BANK_SIZE
FOUR_BYTE_BANK_SIZE
EIGHT_BYTE_BANK_SIZE
class pycuda.driver.array_format
UNSIGNED_INT8
UNSIGNED_INT16
UNSIGNED_INT32
SIGNED_INT8
SIGNED_INT16
SIGNED_INT32
HALF
FLOAT
class pycuda.driver.array3d_flags
2DARRAY

CUDA 3.0 and above. Deprecated–use LAYERED.

Added in version 0.94.

LAYERED

CUDA 4.0 and above.

Added in version 2011.1.

SURFACE_LDST

CUDA 3.1 and above.

Added in version 0.94.

CUBEMAP TEXTURE_GATHER

CUDA 4.1 and above.

Added in version 2011.2.

class pycuda.driver.address_mode
WRAP
CLAMP
MIRROR
BORDER

CUDA 3.2 and above.

Added in version 0.94.

class pycuda.driver.filter_mode
POINT
LINEAR
class pycuda.driver.memory_type
HOST
DEVICE
ARRAY
class pycuda.driver.compute_mode

CUDA 2.2 and newer.

DEFAULT
PROHIBITED
EXCLUSIVE_PROCESS

CUDA 4.0 and above.

Added in version 2011.1.

class pycuda.driver.jit_option

CUDA 2.1 and newer.

MAX_REGISTERS
THREADS_PER_BLOCK
WALL_TIME
INFO_LOG_BUFFER
INFO_LOG_BUFFER_SIZE_BYTES
ERROR_LOG_BUFFER
ERROR_LOG_BUFFER_SIZE_BYTES
OPTIMIZATION_LEVEL
TARGET_FROM_CUCONTEXT
TARGET
FALLBACK_STRATEGY
class pycuda.driver.jit_target

CUDA 2.1 and newer.

COMPUTE_10
COMPUTE_11
COMPUTE_12
COMPUTE_13
COMPUTE_20

CUDA 3.0 and above.

Added in version 0.94.

COMPUTE_21

CUDA 3.2 and above.

Added in version 0.94.

class pycuda.driver.jit_fallback

CUDA 2.1 and newer.

PREFER_PTX
PREFER_BINARY
class pycuda.driver.host_alloc_flags

Flags to be used to allocate Pagelocked Host Memory.

PORTABLE
DEVICEMAP
WRITECOMBINED
class pycuda.driver.mem_attach_flags

Flags to be used to allocate Managed Memory.

..versionadded:: 2014.1

GLOBAL
HOST
SINGLE
class pycuda.driver.mem_host_register_flags
PORTABLE
DEVICEMAP

CUDA 4.0 and newer.

Added in version 2011.1.

class pycuda.driver.limit

Limit values for Context.get_limit() and Context.set_limit().

CUDA 3.1 and newer.

Added in version 0.94.

STACK_SIZE
PRINTF_FIFO_SIZE
MALLOC_HEAP_SIZE

CUDA 3.2 and above.

class pycuda.driver.ipc_mem_flags
LAZY_ENABLE_PEER_ACCESS

Devices and Contexts

pycuda.driver.get_version()

Obtain the version of CUDA against which PyCuda was compiled. Returns a 3-tuple of integers as (major, minor, revision).

pycuda.driver.get_driver_version()

Obtain the version of the CUDA driver on top of which PyCUDA is running. Returns an integer version number.

pycuda.driver.init(flags=0)

Initialize CUDA.

Warning

This must be called before any other function in this module.

See also pycuda.autoinit.

class pycuda.driver.Device(number)
class pycuda.driver.Device(pci_bus_id)

A handle to the number’th CUDA device. See also pycuda.autoinit.

Changed in version 2011.2: The pci_bus_id version of the constructor is new in CUDA 4.1.

static count()

Return the number of CUDA devices found.

name()
pci_bus_id()

CUDA 4.1 and newer.

Added in version 2011.2.

compute_capability()

Return a 2-tuple indicating the compute capability version of this device.

total_memory()

Return the total amount of memory on the device in bytes.

get_attribute(attr)

Return the (numeric) value of the attribute attr, which may be one of the device_attribute values.

All device_attribute values may also be directly read as (lower-case) attributes on the Device object itself, e.g. dev.clock_rate.

get_attributes()

Return all device attributes in a dict, with keys from device_attribute.

make_context(flags=ctx_flags.SCHED_AUTO)

Create a Context on this device, with flags taken from the ctx_flags values.

Also make the newly-created context the current context.

retain_primary_context()

Return the Context obtained by retaining the device’s primary context, which is the one used by the CUDA runtime API. Unlike make_context(), the newly-created context is not made current.

CUDA 7.0 and newer.

Added in version 2020.1.

can_access_peer(dev)

CUDA 4.0 and newer.

Added in version 2011.1.

__hash__()
__eq__()
__ne__()
class pycuda.driver.Context

An equivalent of a UNIX process on the compute device. Create instances of this class using Device.make_context(). See also pycuda.autoinit.

detach()

Decrease the reference count on this context. If the reference count hits zero, the context is deleted.

push()

Make self the active context, pushing it on top of the context stack. CUDA 2.0 and above only.

static pop()

Remove any context from the top of the context stack, deactivating it. CUDA 2.0 and above only.

static get_device()

Return the device that the current context is working on.

static synchronize()

Wait for all activity in the current context to cease, then return.

static set_limit(limit, value)

See limit for possible values of limit.

CUDA 3.1 and above.

Added in version 0.94.

static get_limit(limit)

See limit for possible values of limit.

CUDA 3.1 and above.

Added in version 0.94.

static set_cache_config(cc)

See func_cache for possible values of cc.

CUDA 3.2 and above.

Added in version 0.94.

static get_cache_config()

Return a value from func_cache.

CUDA 3.2 and above.

Added in version 0.94.

static set_shared_config(sc)

See shared_config for possible values of sc.

CUDA 4.2 and above.

Added in version 2013.1.

static get_shared_config()

Return a value from shared_config.

CUDA 4.2 and above.

Added in version 2013.1.

get_api_version()

Return an integer API version number.

CUDA 3.2 and above.

Added in version 0.94.

enable_peer_access(peer, flags=0)

CUDA 4.0 and above.

Added in version 2011.1.

disable_peer_access(peer, flags=0)

CUDA 4.0 and above.

Added in version 2011.1.

Concurrency and Streams

class pycuda.driver.Stream(flags=0)

A handle for a queue of operations that will be carried out in order.

synchronize()

Wait for all activity on this stream to cease, then return.

is_done()

Return True iff all queued operations have completed.

wait_for_event(evt)

Enqueues a wait for the given Event instance.

CUDA 3.2 and above.

Added in version 2011.1.

class pycuda.driver.Event(flags=0)

An event is a temporal ‘marker’ in a Stream that allows taking the time between two events–such as the time required to execute a kernel. An event’s time is recorded when the Stream has finished all tasks enqueued before the record() call.

See event_flags for values for the flags parameter.

record(stream=None)

Insert a recording point for self into the Stream stream. Return self.

synchronize()

Wait until the device execution stream reaches this event. Return self.

query()

Return True if the device execution stream has reached this event.

time_since(event)

Return the time in milliseconds that has passed between self and event. Use this method as end.time_since(start). Note that this method will fail with an “invalid value” error if either of the events has not been reached yet. Use synchronize() to ensure that the event has been reached.

time_till(event)

Return the time in milliseconds that has passed between event and self. Use this method as start.time_till(end). Note that this method will fail with an “invalid value” error if either of the events has not been reached yet. Use synchronize() to ensure that the event has been reached.

ipc_handle()

Return a bytes object representing an IPC handle to this event. Requires Python 2.6 and CUDA 4.1.

static from_ipc_handle(handle)

Requires Python 2.6 and CUDA 4.1.

Memory

Global Device Memory

pycuda.driver.mem_get_info()

Return a tuple (free, total) indicating the free and total memory in the current context, in bytes.

pycuda.driver.mem_alloc(bytes)

Return a DeviceAllocation object representing a linear piece of device memory.

pycuda.driver.to_device(buffer)

Allocate enough device memory for buffer, which adheres to the Python Buffer Protocol interface. Copy the contents of buffer onto the device. Return a DeviceAllocation object representing the newly-allocated memory.

pycuda.driver.from_device(devptr, shape, dtype, order='C')

Make a new numpy.ndarray from the data at devptr on the GPU, interpreting them using shape, dtype and order.

pycuda.driver.from_device_like(devptr, other_ary)

Make a new numpy.ndarray from the data at devptr on the GPU, interpreting them as having the same shape, dtype and order as other_ary.

pycuda.driver.mem_alloc_pitch(width, height, access_size)

Allocates a linear piece of device memory at least width bytes wide and height rows high that an be accessed using a data type of size access_size in a coalesced fashion.

Returns a tuple (dev_alloc, actual_pitch) giving a DeviceAllocation and the actual width of each row in bytes.

class pycuda.driver.DeviceAllocation

An object representing an allocation of linear device memory. Once this object is deleted, its associated device memory is freed.

Objects of this type can be cast to int to obtain a linear index into this Context’s memory.

free()

Release the held device memory now instead of when this object becomes unreachable. Any further use of the object is an error and will lead to undefined behavior.

as_buffer(size, offset=0)

Return the pointer encapsulated by self as a Python buffer object, with the given size and, optionally, offset.

Added in version 2014.1.

pycuda.driver.mem_get_ipc_handle(devptr)

Return an opaque bytes object representing an IPC handle to the device pointer devptr.

Added in version 2011.2.

Requires CUDA 4.1 and Python 2.6.

class pycuda.driver.IPCMemoryHandle(ipc_handle, flags=ipc_mem_flags.LAZY_ENABLE_PEER_ACCESS)

Added in version 2011.2.

Requires CUDA 4.1 and Python 2.6.

Objects of this type can be used in the same ways as a DeviceAllocation.

close()
class pycuda.driver.PointerHolderBase

A base class that facilitates casting to pointers within PyCUDA. This allows the user to construct custom pointer types that may have been allocated by facilities outside of PyCUDA proper, but still need to be objects to facilitate RAII. The user needs to supply one method to facilitate the pointer cast:

get_pointer()

Return the pointer encapsulated by self.

as_buffer(size, offset=0)

Return the pointer encapsulated by self as a Python buffer object, with the given size and, optionally, offset.

Added in version 2014.1.

Note

If your subclass provides its own __init__(), it must call the base class __init__(). Failure to do so will lead to boost.Python.ArgumentError being raised when it is used.

Pagelocked Host Memory

Pagelocked Allocation

pycuda.driver.pagelocked_empty(shape, dtype, order='C', mem_flags=0)

Allocate a pagelocked numpy.ndarray of shape, dtype and order.

mem_flags may be one of the values in host_alloc_flags. It may only be non-zero on CUDA 2.2 and newer.

For the meaning of the other parameters, please refer to the numpy documentation.

pycuda.driver.pagelocked_zeros(shape, dtype, order='C', mem_flags=0)

Like pagelocked_empty(), but initialized to zero.

pycuda.driver.pagelocked_empty_like(array, mem_flags=0)
pycuda.driver.pagelocked_zeros_like(array, mem_flags=0)

The numpy.ndarray instances returned by these functions have an attribute base that references an object of type

class pycuda.driver.PagelockedHostAllocation

Inherits from HostPointer.

An object representing an allocation of pagelocked host memory. Once this object is deleted, its associated device memory is freed.

free()

Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated numpy array) is an error and will lead to undefined behavior.

get_flags()

Return a bit field of values from host_alloc_flags.

Only available on CUDA 3.2 and newer.

Added in version 0.94.

class pycuda.driver.HostAllocation

A deprecated name for PagelockedHostAllocation.

Aligned Host Memory

pycuda.driver.aligned_empty(shape, dtype, order='C', alignment=4096)

Allocate an numpy.ndarray of shape, dtype and order, with data aligned to alignment bytes.

For the meaning of the other parameters, please refer to the numpy documentation.

Added in version 2011.1.

pycuda.driver.aligned_zeros(shape, dtype, order='C', alignment=4096)

Like aligned_empty(), but with initialization to zero.

Added in version 2011.1.

pycuda.driver.aligned_empty_like(array, alignment=4096)

Added in version 2011.1.

pycuda.driver.aligned_zeros_like(array, alignment=4096)

Added in version 2011.1.

The numpy.ndarray instances returned by these functions have an attribute base that references an object of type

class pycuda.driver.AlignedHostAllocation

Inherits from HostPointer.

An object representing an allocation of aligned host memory.

free()

Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated numpy array) is an error and will lead to undefined behavior.

Post-Allocation Pagelocking

pycuda.driver.register_host_memory(ary, flags=0)

Returns a numpy.ndarray which shares memory with ary. This memory will be page-locked as long as the return value of this function is alive.

The returned array’s base attribute contains a RegisteredHostMemory instance, whose base attribute in turn contains ary.

CUDA 4.0 and newer.

ary’s data address and size must be page-aligned. One way to achieve this is to use the functions in Aligned Host Memory.

Added in version 2011.1.

class pycuda.driver.RegisteredHostMemory

Inherits from HostPointer.

CUDA 4.0 and newer.

Added in version 2011.1.

unregister()

Unregister the page-lock on the host memory held by this instance. Note that this does not free the memory, it only frees the page-lock.

base

Contains the Python object from which this instance was constructed.

class pycuda.driver.HostPointer

Represents a page-locked host pointer.

get_device_pointer()

Return a device pointer that indicates the address at which this memory is mapped into the device’s address space.

Only available on CUDA 2.2 and newer.

Managed Memory

CUDA 6.0 adds support for a “Unified Memory” model, which creates a managed virtual memory space that is visible to both CPUs and GPUs. The OS will migrate the physical pages associated with managed memory between the CPU and GPU as needed. This allows a numpy array on the host to be passed to kernels without first creating a DeviceAllocation and manually copying the host data to and from the device.

Note

Managed memory is only available for some combinations of CUDA device, operating system, and host compiler target architecture. Check the CUDA C Programming Guide and CUDA release notes for details.

Warning

This interface to managed memory should be considered experimental. It is provided as a preview, but for now the same interface stability guarantees as for the rest of PyCUDA do not apply.

Managed Memory Allocation

pycuda.driver.managed_empty(shape, dtype, order='C', mem_flags=0)

Allocate a managed numpy.ndarray of shape, dtype and order.

mem_flags may be one of the values in mem_attach_flags.

For the meaning of the other parameters, please refer to the numpy documentation.

Only available on CUDA 6.0 and newer.

Added in version 2014.1.

pycuda.driver.managed_zeros(shape, dtype, order='C', mem_flags=0)

Like managed_empty(), but initialized to zero.

Only available on CUDA 6.0 and newer.

Added in version 2014.1.

pycuda.driver.managed_empty_like(array, mem_flags=0)

Only available on CUDA 6.0 and newer.

Added in version 2014.1.

pycuda.driver.managed_zeros_like(array, mem_flags=0)

Only available on CUDA 6.0 and newer.

Added in version 2014.1.

The numpy.ndarray instances returned by these functions have an attribute base that references an object of type

class pycuda.driver.ManagedAllocation

An object representing an allocation of managed host memory. Once this object is deleted, its associated CUDA managed memory is freed.

free()

Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated numpy array) is an error and will lead to undefined behavior.

get_device_pointer()

Return a device pointer that indicates the address at which this memory is mapped into the device’s address space. For managed memory, this is also the host pointer.

attach(mem_flags, stream=None)

Alter the visibility of the managed allocation to be one of the values in mem_attach_flags. A managed array can be made visible to the host CPU and the entire CUDA context with mem_attach_flags.GLOBAL, or limited to the CPU only with mem_attach_flags.HOST. If mem_attach_flags.SINGLE is selected, then the array will only be visible to CPU and the provided instance of Stream.

Managed Memory Usage

A managed numpy array is constructed and used on the host in a similar manner to a pagelocked array:

from pycuda.autoinit import context
import pycuda.driver as cuda
import numpy as np

a = cuda.managed_empty(shape=10, dtype=np.float32, mem_flags=cuda.mem_attach_flags.GLOBAL)
a[:] = np.linspace(0, 9, len(a)) # Fill array on host

It can be passed to a GPU kernel, and used again on the host without an explicit copy:

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void doublify(float *a)
{
    a[threadIdx.x] *= 2;
}
""")
doublify = mod.get_function("doublify")

doublify(a, grid=(1,1), block=(len(a),1,1))
context.synchronize() # Wait for kernel completion before host access

median = np.median(a) # Computed on host!

Warning

The CUDA Unified Memory model has very specific rules regarding concurrent access of managed memory allocations. Host access to any managed array is not allowed while the GPU is executing a kernel, regardless of whether the array is in use by the running kernel. Failure to follow the concurrency rules will generate a segmentation fault, causing the Python interpreter to terminate immediately.

Users of managed numpy arrays should read the “Unified Memory Programming” appendix of the CUDA C Programming Guide for further details on the concurrency restrictions.

If you are encountering interpreter terminations due to concurrency issues, the faulthandler <http://pypi.python.org/pypi/faulthandler> module may be helpful in locating the location in your Python program where the faulty access is occurring.

Arrays and Textures

class pycuda.driver.ArrayDescriptor
width
height
format

A value of type array_format.

num_channels
class pycuda.driver.ArrayDescriptor3D
width
height
depth
format

A value of type array_format. CUDA 2.0 and above only.

num_channels
class pycuda.driver.Array(descriptor)

A 2D or 3D memory block that can only be accessed via texture references.

descriptor can be of type ArrayDescriptor or ArrayDescriptor3D.

free()

Release the array and its device memory now instead of when this object becomes unreachable. Any further use of the object is an error and will lead to undefined behavior.

get_descriptor()

Return a ArrayDescriptor object for this 2D array, like the one that was used to create it.

get_descriptor_3d()

Return a ArrayDescriptor3D object for this 3D array, like the one that was used to create it. CUDA 2.0 and above only.

handle

Return an int representing the address in device memory where this array resides.

class pycuda.driver.SurfaceReference

Note

Instances of this class can only be constructed through Module.get_surfref().

CUDA 3.1 and above.

Added in version 0.94.

set_array(array, flags=0)

Bind self to the Array array.

As long as array remains bound to this texture reference, it will not be freed–the texture reference keeps a reference to the array.

get_array()

Get back the Array to which self is bound.

Note

This will be a different object than the one passed to set_array(), but it will compare equal.

class pycuda.driver.TextureReference

A handle to a binding of either linear memory or an Array to a texture unit.

set_array(array)

Bind self to the Array array.

As long as array remains bound to this texture reference, it will not be freed–the texture reference keeps a reference to the array.

set_address(devptr, bytes, allow_offset=False)

Bind self to the a chunk of linear memory starting at the integer address devptr, encompassing a number of bytes. Due to alignment requirements, the effective texture bind address may be different from the requested one by an offset. This method returns this offset in bytes. If allow_offset is False, a nonzero value of this offset will cause an exception to be raised.

Unlike for Array objects, no life support is provided for linear memory bound to texture references.

set_address_2d(devptr, descr, pitch)

Bind self as a 2-dimensional texture to a chunk of global memory at devptr. The line-to-line offset in bytes is given by pitch. Width, height and format are given in the ArrayDescriptor descr. set_format() need not and should not be called in addition to this method.

set_format(fmt, num_components)

Set the texture to have array_format fmt and to have num_components channels.

set_address_mode(dim, am)

Set the address mode of dimension dim to am, which must be one of the address_mode values.

set_flags(flags)

Set the flags to a combination of the TRSF_XXX values.

get_array()

Get back the Array to which self is bound.

Note

This will be a different object than the one passed to set_array(), but it will compare equal.

get_address_mode(dim)
get_filter_mode()
get_format()

Return a tuple (fmt, num_components), where fmt is of type array_format, and num_components is the number of channels in this texture.

(Version 2.0 and above only.)

get_flags()
pycuda.driver.TRSA_OVERRIDE_FORMAT
pycuda.driver.TRSF_READ_AS_INTEGER
pycuda.driver.TRSF_NORMALIZED_COORDINATES
pycuda.driver.TR_DEFAULT
pycuda.driver.matrix_to_array(matrix, order)

Turn the two-dimensional numpy.ndarray object matrix into an Array. The order argument can be either “C” or “F”. If it is “C”, then tex2D(x,y) is going to fetch matrix[y,x], and vice versa for for “F”.

pycuda.driver.np_to_array(nparray, order, allowSurfaceBind=False)

Turn a numpy.ndarray with 2D or 3D structure, into an Array. The order argument can be either “C” or “F”. If allowSurfaceBind is passed as True the returned Array can be read and write with SurfaceReference in addition of reads by TextureReference. Function automatically detect dtype and adjust channels to supported array_format. Also add direct support for np.float64, np.complex64 and np.complex128 formats.

Example of use:

#include <pycuda-helpers.hpp>

texture<fp_tex_double, 3, cudaReadModeElementType> my_tex; // complex128: fp_tex_cdouble
                                                           // complex64 : fp_tex_cfloat
                                                           // float64   : fp_tex_double
surface<void, 3, cudaReadModeElementType> my_surf;         // Surfaces in 2D needs 'cudaSurfaceType2DLayered'

__global__ void f()
{
  ...
  fp_tex3D(my_tex, i, j, k);
  fp_surf3Dwrite(myvar, my_surf, i, j, k, cudaBoundaryModeClamp); // fp extensions don't need width in bytes
  fp_surf3Dread(&myvar, my_surf, i, j, k, cudaBoundaryModeClamp);
  ...
}

Added in version 2015.1.

pycuda.driver.gpuarray_to_array(gpuparray, order, allowSurfaceBind=False)

Turn a GPUArray with 2D or 3D structure, into an Array. Same structure and use of np_to_array()

Added in version 2015.1.

pycuda.driver.make_multichannel_2d_array(matrix, order)

Turn the three-dimensional numpy.ndarray object matrix into an 2D Array with multiple channels.

Depending on order, the matrix’s shape is interpreted as

  • height, width, num_channels for order == “C”,

  • num_channels, width, height for order == “F”.

Note

This function assumes that matrix has been created with the memory order order. If that is not the case, the copied data will likely not be what you expect.

Initializing Device Memory

pycuda.driver.memset_d8(dest, data, count)
pycuda.driver.memset_d16(dest, data, count)
pycuda.driver.memset_d32(dest, data, count)

Fill array with data.

Note

count is the number of elements, not bytes.

pycuda.driver.memset_d2d8(dest, pitch, data, width, height)
pycuda.driver.memset_d2d16(dest, pitch, data, width, height)
pycuda.driver.memset_d2d32(dest, pitch, data, width, height)

Fill a two-dimensional array with data.

pycuda.driver.memset_d8_async(dest, data, count, stream=None)
pycuda.driver.memset_d16_async(dest, data, count, stream=None)
pycuda.driver.memset_d32_async(dest, data, count, stream=None)

Fill array with data asynchronously, optionally serialized via stream.

Added in version 2015.1.

pycuda.driver.memset_d2d8_async(dest, pitch, data, width, height, stream=None)
pycuda.driver.memset_d2d16_async(dest, pitch, data, width, height, stream=None)
pycuda.driver.memset_d2d32_async(dest, pitch, data, width, height, stream=None)

Fill a two-dimensional array with data asynchronously, optionally serialized via stream.

Added in version 2015.1.

Unstructured Memory Transfers

pycuda.driver.memcpy_htod(dest, src)

Copy from the Python buffer src to the device pointer dest (an int or a DeviceAllocation). The size of the copy is determined by the size of the buffer.

pycuda.driver.memcpy_htod_async(dest, src, stream=None)

Copy from the Python buffer src to the device pointer dest (an int or a DeviceAllocation) asynchronously, optionally serialized via stream. The size of the copy is determined by the size of the buffer.

src must be page-locked memory, see, e.g. pagelocked_empty().

New in 0.93.

pycuda.driver.memcpy_dtoh(dest, src)

Copy from the device pointer src (an int or a DeviceAllocation) to the Python buffer dest. The size of the copy is determined by the size of the buffer.

pycuda.driver.memcpy_dtoh_async(dest, src, stream=None)

Copy from the device pointer src (an int or a DeviceAllocation) to the Python buffer dest asynchronously, optionally serialized via stream. The size of the copy is determined by the size of the buffer.

dest must be page-locked memory, see, e.g. pagelocked_empty().

New in 0.93.

pycuda.driver.memcpy_dtod(dest, src, size)
pycuda.driver.memcpy_dtod_async(dest, src, size, stream=None)

CUDA 3.0 and above.

Added in version 0.94.

pycuda.driver.memcpy_peer(dest, src, size, dest_context=None, src_context=None)
pycuda.driver.memcpy_peer_async(dest, src, size, dest_context=None, src_context=None, stream=None)

CUDA 4.0 and above.

Added in version 2011.1.

pycuda.driver.memcpy_dtoa(ary, index, src, len)
pycuda.driver.memcpy_atod(dest, ary, index, len)
pycuda.driver.memcpy_htoa(ary, index, src)
pycuda.driver.memcpy_atoh(dest, ary, index)
pycuda.driver.memcpy_atoa(dest, dest_index, src, src_index, len)

Structured Memory Transfers

class pycuda.driver.Memcpy2D
src_x_in_bytes

X Offset of the origin of the copy. (initialized to 0)

src_y

Y offset of the origin of the copy. (initialized to 0)

src_pitch

Size of a row in bytes at the origin of the copy.

set_src_host(buffer)

Set the buffer, which must be a Python object adhering to the buffer interface, to be the origin of the copy.

set_src_array(array)

Set the Array array to be the origin of the copy.

set_src_device(devptr)

Set the device address devptr (an int or a DeviceAllocation) as the origin of the copy.

set_src_unified(buffer)

Same as set_src_host(), except that buffer may also correspond to device memory.

CUDA 4.0 and above. Requires unified addressing.

Added in version 2011.1.

dst_x_in_bytes

X offset of the destination of the copy. (initialized to 0)

dst_y

Y offset of the destination of the copy. (initialized to 0)

dst_pitch

Size of a row in bytes at the destination of the copy.

set_dst_host(buffer)

Set the buffer, which must be a Python object adhering to the buffer interface, to be the destination of the copy.

set_dst_array(array)

Set the Array array to be the destination of the copy.

set_dst_device(devptr)

Set the device address devptr (an int or a DeviceAllocation) as the destination of the copy.

set_dst_unified(buffer)

Same as set_dst_host(), except that buffer may also correspond to device memory.

CUDA 4.0 and above. Requires unified addressing.

Added in version 2011.1.

width_in_bytes

Number of bytes to copy for each row in the transfer.

height

Number of rows to copy.

__call__([aligned=True])

Perform the specified memory copy, waiting for it to finish. If aligned is False, tolerate device-side misalignment for device-to-device copies that may lead to loss of copy bandwidth.

__call__(stream)

Perform the memory copy asynchronously, serialized via the Stream stream. Any host memory involved in the transfer must be page-locked.

class pycuda.driver.Memcpy3D

Memcpy3D has the same members as Memcpy2D, and additionally all of the following:

src_height

Ignored when source is an Array. May be 0 if Depth==1.

src_z

Z offset of the origin of the copy. (initialized to 0)

dst_height

Ignored when destination is an Array. May be 0 if Depth==1.

dst_z

Z offset of the destination of the copy. (initialized to 0)

depth

Memcpy3D is supported on CUDA 2.0 and above only.

class pycuda.driver.Memcpy3DPeer

Memcpy3DPeer has the same members as Memcpy3D, and additionally all of the following:

set_src_context(ctx)
set_dst_context(ctx)

CUDA 4.0 and newer.

Added in version 2011.1.

Code on the Device: Modules and Functions

class pycuda.driver.Module

Handle to a CUBIN module loaded onto the device. Can be created with module_from_file() and module_from_buffer().

get_function(name)

Return the Function name in this module.

Warning

While you can obtain different handles to the same function using this method, these handles all share the same state that is set through the set_XXX methods of Function. This means that you can’t obtain two different handles to the same function and Function.prepare() them in two different ways.

get_global(name)

Return a tuple (device_ptr, size_in_bytes) giving the device address and size of the global name.

The main use of this method is to find the address of pre-declared __constant__ arrays so they can be filled from the host before kernel invocation.

get_texref(name)

Return the TextureReference name from this module.

get_surfref(name)

Return the SurfaceReference name from this module.

CUDA 3.1 and above.

Added in version 0.94.

pycuda.driver.module_from_file(filename)

Create a Module by loading the CUBIN file filename.

pycuda.driver.module_from_buffer(buffer, options=[], message_handler=None)

Create a Module by loading a PTX or CUBIN module from buffer, which must support the Python buffer interface. (For example, str and numpy.ndarray do.)

Parameters:
  • options – A list of tuples (jit_option, value).

  • message_handler – A callable that is called with a arguments of (compile_success_bool, info_str, error_str) which allows the user to process error and warning messages from the PTX compiler.

Loading PTX modules as well as non-default values of options and message_handler are only allowed on CUDA 2.1 and newer.

class pycuda.driver.Function

Handle to a __global__ function in a Module. Create using Module.get_function().

__call__(arg1, ..., argn, block=block_size[, grid=(1, 1)[, stream=None[, shared=0[, texrefs=[][, time_kernel=False]]]]])

Launch self, with a thread block size of block. block must be a 3-tuple of integers.

arg1 through argn are the positional C arguments to the kernel. See param_set() for details. See especially the warnings there.

grid specifies, as a tuple of up to three integer entries, the number of thread blocks to launch, as a multi-dimensional grid. stream, if specified, is a Stream instance serializing the copying of input arguments (if any), execution, and the copying of output arguments (again, if any). shared gives the number of bytes available to the kernel in extern __shared__ arrays. texrefs is a list of TextureReference instances that the function will have access to.

The function returns either None or the number of seconds spent executing the kernel, depending on whether time_kernel is True.

This is a convenience interface that can be used instead of the param_*` and launch_*` methods below. For a faster (but mildly less convenient) way of invoking kernels, see prepare() and prepared_call().

arg1 through argn are allowed to be of the following types:

  • Subclasses of numpy.number. These are sized number types such as numpy.uint32 or numpy.float32.

  • DeviceAllocation instances, which will become a device pointer to the allocated memory.

  • Instances of ArgumentHandler subclasses. These can be used to automatically transfer numpy arrays onto and off of the device.

  • Objects supporting the Python Buffer Protocol interface. These chunks of bytes will be copied into the parameter space verbatim.

  • GPUArray instances.

Warning

You cannot pass values of Python’s native int or float types to param_set. Since there is no unambiguous way to guess the size of these integers or floats, it complains with a TypeError.

Note

This method has to guess the types of the arguments passed to it, which can make it somewhat slow. For a kernel that is invoked often, this can be inconvenient. For a faster (but mildly less convenient) way of invoking kernels, see prepare() and prepared_call().

Note

grid with more than two dimensions requires CUDA 4.0 or newer.

param_set_texref(texref)

Make the TextureReference texref available to the function.

prepare(arg_types, shared=None, texrefs=[])

Prepare the invocation of this function by

  • setting up the argument types as arg_types. arg_types is expected to be an iterable containing type characters understood by the struct module or numpy.dtype objects.

    (In addition, PyCUDA understands ‘F’ and ‘D’ for single- and double precision floating point numbers.)

  • Registering the texture references texrefs for use with this functions. The TextureReference objects in texrefs will be retained, and whatever these references are bound to at invocation time will be available through the corresponding texture references within the kernel.

Return self.

prepared_call(grid, block, *args, shared_size=0)

Invoke self using launch_grid(), with args a grid size of grid, and a block size of block. Assumes that prepare() was called on self. The texture references given to prepare() are set up as parameters, as well.

Changed in version 2012.1: shared_size was added.

prepared_timed_call(grid, block, *args, shared_size=0)

Invoke self using launch_grid(), with args, a grid size of grid, and a block size of block. Assumes that prepare() was called on self. The texture references given to prepare() are set up as parameters, as well.

Return a 0-ary callable that can be used to query the GPU time consumed by the call, in seconds. Once called, this callable will block until completion of the invocation.

Changed in version 2012.1: shared_size was added.

prepared_async_call(grid, block, stream, *args, shared_size=0)

Invoke self using launch_grid_async(), with args, a grid size of grid, and a block size of block, serialized into the pycuda.driver.Stream stream. If stream is None, do the same as prepared_call(). Assumes that prepare() was called on self. The texture references given to prepare() are set up as parameters, as well.

Changed in version 2012.1: shared_size was added.

get_attribute(attr)

Return one of the attributes given by the function_attribute value attr.

All function_attribute values may also be directly read as (lower-case) attributes on the Function object itself, e.g. func.num_regs.

CUDA 2.2 and newer.

Added in version 0.93.

set_attribute(attr, value)

Set one of the (settable) attributes given by the function_attribute value attr to value.

Added in version 2022.1.

set_cache_config(fc)

See func_cache for possible values of fc.

CUDA 3.0 (post-beta) and newer.

Added in version 0.94.

set_shared_config(sc)

See shared_config for possible values of sc.

CUDA 4.2 and newer.

Added in version 2013.1.

local_size_bytes

The number of bytes of local memory used by this function.

On CUDA 2.1 and below, this is only available if this function is part of a pycuda.compiler.SourceModule. It replaces the now-deprecated attribute lmem.

shared_size_bytes

The number of bytes of shared memory used by this function.

On CUDA 2.1 and below, this is only available if this function is part of a pycuda.compiler.SourceModule. It replaces the now-deprecated attribute smem.

num_regs

The number of 32-bit registers used by this function.

On CUDA 2.1 and below, this is only available if this function is part of a pycuda.compiler.SourceModule. It replaces the now-deprecated attribute registers.

set_shared_size(bytes)

Set shared to be the number of bytes available to the kernel in extern __shared__ arrays.

Warning

Deprecated as of version 2011.1.

set_block_shape(x, y, z)

Set the thread block shape for this function.

Warning

Deprecated as of version 2011.1.

param_set(arg1, ... argn)

Set the thread block shape for this function.

Warning

Deprecated as of version 2011.1.

param_set_size(bytes)

Size the parameter space to bytes.

Warning

Deprecated as of version 2011.1.

param_seti(offset, value)

Set the integer at offset in the parameter space to value.

Warning

Deprecated as of version 2011.1.

param_setf(offset, value)

Set the float at offset in the parameter space to value.

Warning

Deprecated as of version 2011.1.

launch()

Launch a single thread block of self.

Warning

Deprecated as of version 2011.1.

launch_grid(width, height)

Launch a width*height grid of thread blocks of self.

Warning

Deprecated as of version 2011.1.

launch_grid_async(width, height, stream)

Launch a width*height grid of thread blocks of self, sequenced by the Stream stream.

Warning

Deprecated as of version 2011.1.

class pycuda.driver.ArgumentHandler(array)
class pycuda.driver.In(array)

Inherits from ArgumentHandler. Indicates that Buffer Protocol array should be copied to the compute device before invoking the kernel.

class pycuda.driver.Out(array)

Inherits from ArgumentHandler. Indicates that Buffer Protocol array should be copied off the compute device after invoking the kernel.

class pycuda.driver.InOut(array)

Inherits from ArgumentHandler. Indicates that Buffer Protocol array should be copied both onto the compute device before invoking the kernel, and off it afterwards.

Profiler Control

CUDA 4.0 and newer.

pycuda.driver.initialize_profiler(config_file, output_file, output_mode)

output_mode is one of the attributes of profiler_output_mode.

Added in version 2011.1.

pycuda.driver.start_profiler()

Added in version 2011.1.

pycuda.driver.stop_profiler()

Added in version 2011.1.

Just-in-time Compilation

pycuda.compiler.DEFAULT_NVCC_FLAGS

Added in version 2011.1.

If no options are given in the calls below, the value of this list-type variable is used instead. This may be useful for injecting necessary flags into the compilation of automatically compiled kernels, such as those used by the module pycuda.gpuarray.

The initial value of this variable is taken from the environment variable PYCUDA_DEFAULT_NVCC_FLAGS.

If you modify this variable in your code, please be aware that this is a globally shared variable that may be modified by multiple packages. Please exercise caution in such modifications–you risk breaking other people’s code.

class pycuda.compiler.SourceModule(source, nvcc='nvcc', options=None, keep=False, no_extern_c=False, arch=None, code=None, cache_dir=None, include_dirs=[])

Create a pycuda.driver.Module from the CUDA source code source. The Nvidia compiler nvcc is assumed to be on the PATH if no path to it is specified, and is invoked with options to compile the code. If keep is True, the compiler output directory is kept, and a line indicating its location in the file system is printed for debugging purposes.

Unless no_extern_c is True, the given source code is wrapped in extern “C” { … } to prevent C++ name mangling.

arch and code specify the values to be passed for the -arch and -code options on the nvcc command line. If arch is None, it defaults to the current context’s device’s compute capability. If code is None, it will not be specified.

cache_dir gives the directory used for compiler caching. If None then cache_dir is taken to be PYCUDA_CACHE_DIR if set or a sensible per-user default. If passed as False, caching is disabled.

If the environment variable PYCUDA_DISABLE_CACHE is set to any value then caching is disabled. This preference overrides any value of cache_dir and can be used to disable caching globally.

This class exhibits the same public interface as pycuda.driver.Module, but does not inherit from it.

Change note: SourceModule was moved from pycuda.driver to pycuda.compiler in version 0.93.

compile(source, nvcc="nvcc", options=None, keep=False,
no_extern_c=False, arch=None, code=None, cache_dir=None,
include_dirs=[])

Perform the same compilation as the corresponding SourceModule constructor, but only return resulting cubin file as a string. In particular, do not upload the code to the GPU.