Device Interface¶
Version Queries¶
- pycuda.VERSION¶
- Gives the numeric version of PyCUDA as a variable-length tuple of integers. Enables easy version checks such as VERSION >= (0, 93). - Added in PyCUDA 0.93. 
- pycuda.VERSION_STATUS¶
- A text string such as “rc4” or “beta” qualifying the status of the release. - Added in version 0.93. 
- pycuda.VERSION_TEXT¶
- The full release name (such as “0.93rc4”) in string form. - Added in version 0.93. 
Error Reporting¶
- exception pycuda.driver.Error¶
- Base class of all PyCuda errors. 
- exception pycuda.driver.CompileError¶
- Thrown when - pycuda.compiler.SourceModulecompilation fails.- msg¶
- Added in version 0.94. 
 - stdout¶
- Added in version 0.94. 
 - stderr¶
- Added in version 0.94. 
 - command_line¶
- Added in version 0.94. 
 
- exception pycuda.driver.MemoryError¶
- Thrown when - mem_alloc()or related functionality fails.
- exception pycuda.driver.LogicError¶
- Thrown when PyCuda was confronted with a situation where it is likely that the programmer has made a mistake. - LogicErrors do not depend on outer circumstances defined by the run-time environment.- Example: CUDA was used before it was initialized. 
- exception pycuda.driver.LaunchError¶
- Thrown when kernel invocation has failed. (Note that this will often be reported by the next call after the actual kernel invocation.) 
- exception pycuda.driver.RuntimeError¶
- Thrown when a unforeseen run-time failure is encountered that is not likely due to programmer error. - Example: A file was not found. 
Constants¶
- class pycuda.driver.ctx_flags¶
- Flags for - Device.make_context(). CUDA 2.0 and above only.- SCHED_AUTO¶
- If there are more contexts than processors, yield, otherwise spin while waiting for CUDA calls to complete. 
 - SCHED_SPIN¶
- Spin while waiting for CUDA calls to complete. 
 - SCHED_YIELD¶
- Yield to other threads while waiting for CUDA calls to complete. 
 - SCHED_MASK¶
- Mask of valid scheduling flags in this bitfield. 
 - SCHED_BLOCKING_SYNC¶
- Use blocking synchronization. CUDA 2.2 and newer. 
 - MAP_HOST¶
- Support mapped pinned allocations. CUDA 2.2 and newer. 
 - LMEM_RESIZE_TO_MAX¶
- Keep local memory allocation after launch. CUDA 3.2 and newer. Rumored to decrease Fermi launch overhead? - Added in version 2011.1. 
 - FLAGS_MASK¶
- Mask of valid flags in this bitfield. 
 
- class pycuda.driver.event_flags¶
- Flags for - Event. CUDA 2.2 and newer.- DEFAULT¶
 - BLOCKING_SYNC¶
 - DISABLE_TIMING¶
- CUDA 3.2 and newer. - Added in version 0.94. 
 - INTERPROCESS¶
- CUDA 4.1 and newer. - Added in version 2011.2. 
 
- class pycuda.driver.device_attribute¶
- MAX_THREADS_PER_BLOCK¶
 - MAX_BLOCK_DIM_X¶
 - MAX_BLOCK_DIM_Y¶
 - MAX_BLOCK_DIM_Z¶
 - MAX_GRID_DIM_X¶
 - MAX_GRID_DIM_Y¶
 - MAX_GRID_DIM_Z¶
 - TOTAL_CONSTANT_MEMORY¶
 - WARP_SIZE¶
 - MAX_PITCH¶
 - CLOCK_RATE¶
 - TEXTURE_ALIGNMENT¶
 - GPU_OVERLAP¶
 - MULTIPROCESSOR_COUNT¶
- CUDA 2.0 and above only. 
 - SHARED_MEMORY_PER_BLOCK¶
- Deprecated as of CUDA 2.0. See below for replacement. 
 - MAX_SHARED_MEMORY_PER_BLOCK¶
- CUDA 2.0 and above only. 
 - REGISTERS_PER_BLOCK¶
- Deprecated as of CUDA 2.0. See below for replacement. 
 - MAX_REGISTERS_PER_BLOCK¶
- CUDA 2.0 and above. 
 - KERNEL_EXEC_TIMEOUT¶
- CUDA 2.2 and above. 
 - INTEGRATED¶
- CUDA 2.2 and above. 
 - CAN_MAP_HOST_MEMORY¶
- CUDA 2.2 and above. 
 - COMPUTE_MODE¶
- CUDA 2.2 and above. See - compute_mode.
 - MAXIMUM_TEXTURE1D_WIDTH¶
- MAXIMUM_TEXTURE2D_WIDTH¶
- MAXIMUM_TEXTURE2D_HEIGHT¶
- MAXIMUM_TEXTURE3D_WIDTH¶
- MAXIMUM_TEXTURE3D_HEIGHT¶
- MAXIMUM_TEXTURE3D_DEPTH¶
- MAXIMUM_TEXTURE2D_ARRAY_WIDTH¶
- MAXIMUM_TEXTURE2D_ARRAY_HEIGHT¶
- MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES¶
- CUDA 3.0 and above. - Added in version 0.94. 
 - MAXIMUM_TEXTURE2D_LAYERED_WIDTH¶
- MAXIMUM_TEXTURE2D_LAYERED_HEIGHT¶
- MAXIMUM_TEXTURE2D_LAYERED_LAYERS¶
- MAXIMUM_TEXTURE1D_LAYERED_WIDTH¶
- MAXIMUM_TEXTURE1D_LAYERED_LAYERS¶
- CUDA 4.0 and above. - Added in version 2011.1. 
 - SURFACE_ALIGNMENT¶
- CUDA 3.0 (post-beta) and above. - Added in version 0.94. 
 - CONCURRENT_KERNELS¶
- CUDA 3.0 (post-beta) and above. - Added in version 0.94. 
 - ECC_ENABLED¶
- CUDA 3.0 (post-beta) and above. - Added in version 0.94. 
 - PCI_BUS_ID¶
- CUDA 3.2 and above. - Added in version 0.94. 
 - PCI_DEVICE_ID¶
- CUDA 3.2 and above. - Added in version 0.94. 
 - TCC_DRIVER¶
- CUDA 3.2 and above. - Added in version 0.94. 
 - MEMORY_CLOCK_RATE¶
- GLOBAL_MEMORY_BUS_WIDTH¶
- L2_CACHE_SIZE¶
- MAX_THREADS_PER_MULTIPROCESSOR¶
- ASYNC_ENGINE_COUNT¶
- UNIFIED_ADDRESSING¶
- CUDA 4.0 and above. - Added in version 2011.1. 
 - MAXIMUM_TEXTURE2D_GATHER_WIDTH¶
- MAXIMUM_TEXTURE2D_GATHER_HEIGHT¶
- MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE¶
- MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE¶
- MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE¶
- PCI_DOMAIN_ID¶
- TEXTURE_PITCH_ALIGNMENT¶
- MAXIMUM_TEXTURECUBEMAP_WIDTH¶
- MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH¶
- MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS¶
- MAXIMUM_SURFACE1D_WIDTH¶
- MAXIMUM_SURFACE2D_WIDTH¶
- MAXIMUM_SURFACE2D_HEIGHT¶
- MAXIMUM_SURFACE3D_WIDTH¶
- MAXIMUM_SURFACE3D_HEIGHT¶
- MAXIMUM_SURFACE3D_DEPTH¶
- MAXIMUM_SURFACE1D_LAYERED_WIDTH¶
- MAXIMUM_SURFACE1D_LAYERED_LAYERS¶
- MAXIMUM_SURFACE2D_LAYERED_WIDTH¶
- MAXIMUM_SURFACE2D_LAYERED_HEIGHT¶
- MAXIMUM_SURFACE2D_LAYERED_LAYERS¶
- MAXIMUM_SURFACECUBEMAP_WIDTH¶
- MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH¶
- MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS¶
- MAXIMUM_TEXTURE1D_LINEAR_WIDTH¶
- MAXIMUM_TEXTURE2D_LINEAR_WIDTH¶
- MAXIMUM_TEXTURE2D_LINEAR_HEIGHT¶
- MAXIMUM_TEXTURE2D_LINEAR_PITCH¶
- CUDA 4.1 and above. - Added in version 2011.2. 
 - MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH¶
- MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT¶
- COMPUTE_CAPABILITY_MAJOR¶
- COMPUTE_CAPABILITY_MINOR¶
- MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH¶
- CUDA 5.0 and above. - Added in version 2014.1. 
 - STREAM_PRIORITIES_SUPPORTED¶
- CUDA 5.5 and above. - Added in version 2014.1. 
 - GLOBAL_L1_CACHE_SUPPORTED¶
- LOCAL_L1_CACHE_SUPPORTED¶
- MAX_SHARED_MEMORY_PER_MULTIPROCESSOR¶
- MAX_REGISTERS_PER_MULTIPROCESSOR¶
- MANAGED_MEMORY¶
- MULTI_GPU_BOARD¶
- MULTI_GPU_BOARD_GROUP_ID¶
- CUDA 6.0 and above. - Added in version 2014.1. 
 - HOST_NATIVE_ATOMIC_SUPPORTED¶
- SINGLE_TO_DOUBLE_PRECISION_PERF_RATIO¶
- PAGEABLE_MEMORY_ACCESS¶
- CONCURRENT_MANAGED_ACCESS¶
- COMPUTE_PREEMPTION_SUPPORTED¶
- CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM¶
- CUDA 8.0 and above. 
 - MAX_SHARED_MEMORY_PER_BLOCK_OPTIN¶
- CUDA 9.0 and above. 
 - PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES¶
- DIRECT_MANAGED_MEM_ACCESS_FROM_HOST¶
- CUDA 9.2 and above. 
 - HANDLE_TYPE_POSIX_FILE_DESCRIPTOR_SUPPORTED¶
- HANDLE_TYPE_WIN32_HANDLE_SUPPORTED¶
- HANDLE_TYPE_WIN32_KMT_HANDLE_SUPPORTED¶
- CUDA 10.2 and above. 
 
- class pycuda.driver.pointer_attribute¶
- 
CUDA 4.0 and above. Added in version 2011.1. 
- class pycuda.driver.profiler_output_mode¶
- 
CUDA 4.0 and above. Added in version 2011.1. 
- class pycuda.driver.function_attribute¶
- Flags for - Function.get_attribute(). CUDA 2.2 and newer.- MAX_THREADS_PER_BLOCK¶
 - SHARED_SIZE_BYTES¶
 - CONST_SIZE_BYTES¶
 - LOCAL_SIZE_BYTES¶
 - NUM_REGS¶
 - PTX_VERSION¶
- CUDA 3.0 (post-beta) and above. - Added in version 0.94. 
 - BINARY_VERSION¶
- CUDA 3.0 (post-beta) and above. - Added in version 0.94. 
 - CACHE_MODE_CA¶
- Added in version 2022.1. 
 - MAX_DYNAMIC_SHARED_SIZE_BYTES¶
- Added in version 2022.1. 
 - PREFERRED_SHARED_MEMORY_CARVEOUT¶
- Added in version 2022.1. 
 - MAX¶
 
- class pycuda.driver.func_cache¶
- See - Function.set_cache_config(). CUDA 3.0 (post-beta) and above.- Added in version 0.94. - PREFER_NONE¶
 - PREFER_SHARED¶
 - PREFER_L1¶
 - PREFER_EQUAL¶
- CUDA 4.1 and above. - Added in version 2011.2. 
 
- See - Function.set_shared_config(). CUDA 4.2 and above.
- class pycuda.driver.array_format¶
- UNSIGNED_INT8¶
 - UNSIGNED_INT16¶
 - UNSIGNED_INT32¶
 - SIGNED_INT8¶
 - SIGNED_INT16¶
 - SIGNED_INT32¶
 - HALF¶
 - FLOAT¶
 
- class pycuda.driver.array3d_flags¶
- 
- LAYERED¶
- CUDA 4.0 and above. - Added in version 2011.1. 
 - SURFACE_LDST¶
- CUDA 3.1 and above. - Added in version 0.94. 
 - CUBEMAP TEXTURE_GATHER
- CUDA 4.1 and above. - Added in version 2011.2. 
 
- class pycuda.driver.address_mode¶
- WRAP¶
 - CLAMP¶
 - MIRROR¶
 - BORDER¶
- CUDA 3.2 and above. - Added in version 0.94. 
 
- class pycuda.driver.compute_mode¶
- CUDA 2.2 and newer. - DEFAULT¶
 - PROHIBITED¶
 - EXCLUSIVE_PROCESS¶
- CUDA 4.0 and above. - Added in version 2011.1. 
 
- class pycuda.driver.jit_option¶
- CUDA 2.1 and newer. - MAX_REGISTERS¶
 - THREADS_PER_BLOCK¶
 - WALL_TIME¶
 - INFO_LOG_BUFFER¶
 - INFO_LOG_BUFFER_SIZE_BYTES¶
 - ERROR_LOG_BUFFER¶
 - ERROR_LOG_BUFFER_SIZE_BYTES¶
 - OPTIMIZATION_LEVEL¶
 - TARGET_FROM_CUCONTEXT¶
 - TARGET¶
 - FALLBACK_STRATEGY¶
 
- class pycuda.driver.jit_target¶
- CUDA 2.1 and newer. - COMPUTE_10¶
 - COMPUTE_11¶
 - COMPUTE_12¶
 - COMPUTE_13¶
 - COMPUTE_20¶
- CUDA 3.0 and above. - Added in version 0.94. 
 - COMPUTE_21¶
- CUDA 3.2 and above. - Added in version 0.94. 
 
- class pycuda.driver.host_alloc_flags¶
- Flags to be used to allocate Pagelocked Host Memory. - PORTABLE¶
 - DEVICEMAP¶
 - WRITECOMBINED¶
 
- class pycuda.driver.mem_attach_flags¶
- Flags to be used to allocate Managed Memory. - ..versionadded:: 2014.1 - GLOBAL¶
 - HOST¶
 - SINGLE¶
 
- class pycuda.driver.mem_host_register_flags¶
- PORTABLE¶
 - DEVICEMAP¶
 - CUDA 4.0 and newer. - Added in version 2011.1. 
- class pycuda.driver.limit¶
- Limit values for - Context.get_limit()and- Context.set_limit().- CUDA 3.1 and newer. - Added in version 0.94. - STACK_SIZE¶
 - PRINTF_FIFO_SIZE¶
 - MALLOC_HEAP_SIZE¶
- CUDA 3.2 and above. 
 
Devices and Contexts¶
- pycuda.driver.get_version()¶
- Obtain the version of CUDA against which PyCuda was compiled. Returns a 3-tuple of integers as (major, minor, revision). 
- pycuda.driver.get_driver_version()¶
- Obtain the version of the CUDA driver on top of which PyCUDA is running. Returns an integer version number. 
- pycuda.driver.init(flags=0)¶
- Initialize CUDA. - Warning - This must be called before any other function in this module. - See also - pycuda.autoinit.
- class pycuda.driver.Device(number)¶
- class pycuda.driver.Device(pci_bus_id)
- A handle to the number’th CUDA device. See also - pycuda.autoinit.- Changed in version 2011.2: The pci_bus_id version of the constructor is new in CUDA 4.1. - static count()¶
- Return the number of CUDA devices found. 
 - name()¶
 - pci_bus_id()¶
- CUDA 4.1 and newer. - Added in version 2011.2. 
 - compute_capability()¶
- Return a 2-tuple indicating the compute capability version of this device. 
 - total_memory()¶
- Return the total amount of memory on the device in bytes. 
 - get_attribute(attr)¶
- Return the (numeric) value of the attribute attr, which may be one of the - device_attributevalues.- All - device_attributevalues may also be directly read as (lower-case) attributes on the- Deviceobject itself, e.g. dev.clock_rate.
 - get_attributes()¶
- Return all device attributes in a - dict, with keys from- device_attribute.
 - make_context(flags=ctx_flags.SCHED_AUTO)¶
- Create a - Contexton this device, with flags taken from the- ctx_flagsvalues.- Also make the newly-created context the current context. 
 - retain_primary_context()¶
- Return the - Contextobtained by retaining the device’s primary context, which is the one used by the CUDA runtime API. Unlike- make_context(), the newly-created context is not made current.- CUDA 7.0 and newer. - Added in version 2020.1. 
 - can_access_peer(dev)¶
- CUDA 4.0 and newer. - Added in version 2011.1. 
 - __hash__()¶
 - __eq__()¶
 - __ne__()¶
 
- class pycuda.driver.Context¶
- An equivalent of a UNIX process on the compute device. Create instances of this class using - Device.make_context(). See also- pycuda.autoinit.- detach()¶
- Decrease the reference count on this context. If the reference count hits zero, the context is deleted. 
 - push()¶
- Make self the active context, pushing it on top of the context stack. CUDA 2.0 and above only. 
 - static pop()¶
- Remove any context from the top of the context stack, deactivating it. CUDA 2.0 and above only. 
 - static get_device()¶
- Return the device that the current context is working on. 
 - static synchronize()¶
- Wait for all activity in the current context to cease, then return. 
 - static set_limit(limit, value)¶
- See - limitfor possible values of limit.- CUDA 3.1 and above. - Added in version 0.94. 
 - static get_limit(limit)¶
- See - limitfor possible values of limit.- CUDA 3.1 and above. - Added in version 0.94. 
 - static set_cache_config(cc)¶
- See - func_cachefor possible values of cc.- CUDA 3.2 and above. - Added in version 0.94. 
 - static get_cache_config()¶
- Return a value from - func_cache.- CUDA 3.2 and above. - Added in version 0.94. 
 - See - shared_configfor possible values of sc.- CUDA 4.2 and above. - Added in version 2013.1. 
 - Return a value from - shared_config.- CUDA 4.2 and above. - Added in version 2013.1. 
 - get_api_version()¶
- Return an integer API version number. - CUDA 3.2 and above. - Added in version 0.94. 
 - enable_peer_access(peer, flags=0)¶
- CUDA 4.0 and above. - Added in version 2011.1. 
 - disable_peer_access(peer, flags=0)¶
- CUDA 4.0 and above. - Added in version 2011.1. 
 
Concurrency and Streams¶
- class pycuda.driver.Stream(flags=0)¶
- A handle for a queue of operations that will be carried out in order. - synchronize()¶
- Wait for all activity on this stream to cease, then return. 
 - is_done()¶
- Return True iff all queued operations have completed. 
 
- class pycuda.driver.Event(flags=0)¶
- An event is a temporal ‘marker’ in a - Streamthat allows taking the time between two events–such as the time required to execute a kernel. An event’s time is recorded when the- Streamhas finished all tasks enqueued before the- record()call.- See - event_flagsfor values for the flags parameter.- synchronize()¶
- Wait until the device execution stream reaches this event. Return self. 
 - query()¶
- Return True if the device execution stream has reached this event. 
 - time_since(event)¶
- Return the time in milliseconds that has passed between self and event. Use this method as end.time_since(start). Note that this method will fail with an “invalid value” error if either of the events has not been reached yet. Use - synchronize()to ensure that the event has been reached.
 - time_till(event)¶
- Return the time in milliseconds that has passed between event and self. Use this method as start.time_till(end). Note that this method will fail with an “invalid value” error if either of the events has not been reached yet. Use - synchronize()to ensure that the event has been reached.
 - ipc_handle()¶
- Return a - bytesobject representing an IPC handle to this event. Requires Python 2.6 and CUDA 4.1.
 - static from_ipc_handle(handle)¶
- Requires Python 2.6 and CUDA 4.1. 
 
Memory¶
Global Device Memory¶
- pycuda.driver.mem_get_info()¶
- Return a tuple (free, total) indicating the free and total memory in the current context, in bytes. 
- pycuda.driver.mem_alloc(bytes)¶
- Return a - DeviceAllocationobject representing a linear piece of device memory.
- pycuda.driver.to_device(buffer)¶
- Allocate enough device memory for buffer, which adheres to the Python Buffer Protocol interface. Copy the contents of buffer onto the device. Return a - DeviceAllocationobject representing the newly-allocated memory.
- pycuda.driver.from_device(devptr, shape, dtype, order='C')¶
- Make a new - numpy.ndarrayfrom the data at devptr on the GPU, interpreting them using shape, dtype and order.
- pycuda.driver.from_device_like(devptr, other_ary)¶
- Make a new - numpy.ndarrayfrom the data at devptr on the GPU, interpreting them as having the same shape, dtype and order as other_ary.
- pycuda.driver.mem_alloc_pitch(width, height, access_size)¶
- Allocates a linear piece of device memory at least width bytes wide and height rows high that an be accessed using a data type of size access_size in a coalesced fashion. - Returns a tuple (dev_alloc, actual_pitch) giving a - DeviceAllocationand the actual width of each row in bytes.
- class pycuda.driver.DeviceAllocation¶
- An object representing an allocation of linear device memory. Once this object is deleted, its associated device memory is freed. - Objects of this type can be cast to - intto obtain a linear index into this- Context’s memory.- free()¶
- Release the held device memory now instead of when this object becomes unreachable. Any further use of the object is an error and will lead to undefined behavior. 
 - as_buffer(size, offset=0)¶
- Return the pointer encapsulated by self as a Python buffer object, with the given size and, optionally, offset. - Added in version 2014.1. 
 
- pycuda.driver.mem_get_ipc_handle(devptr)¶
- Return an opaque - bytesobject representing an IPC handle to the device pointer devptr.- Added in version 2011.2. - Requires CUDA 4.1 and Python 2.6. 
- class pycuda.driver.IPCMemoryHandle(ipc_handle, flags=ipc_mem_flags.LAZY_ENABLE_PEER_ACCESS)¶
- Added in version 2011.2. - Requires CUDA 4.1 and Python 2.6. - Objects of this type can be used in the same ways as a - DeviceAllocation.- close()¶
 
- class pycuda.driver.PointerHolderBase¶
- A base class that facilitates casting to pointers within PyCUDA. This allows the user to construct custom pointer types that may have been allocated by facilities outside of PyCUDA proper, but still need to be objects to facilitate RAII. The user needs to supply one method to facilitate the pointer cast: - get_pointer()¶
- Return the pointer encapsulated by self. 
 - as_buffer(size, offset=0)¶
- Return the pointer encapsulated by self as a Python buffer object, with the given size and, optionally, offset. - Added in version 2014.1. 
 - Note - If your subclass provides its own - __init__(), it must call the base class- __init__(). Failure to do so will lead to- boost.Python.ArgumentErrorbeing raised when it is used.
Pagelocked Host Memory¶
Pagelocked Allocation¶
- pycuda.driver.pagelocked_empty(shape, dtype, order='C', mem_flags=0)¶
- Allocate a pagelocked - numpy.ndarrayof shape, dtype and order.- mem_flags may be one of the values in - host_alloc_flags. It may only be non-zero on CUDA 2.2 and newer.- For the meaning of the other parameters, please refer to the - numpydocumentation.
- pycuda.driver.pagelocked_zeros(shape, dtype, order='C', mem_flags=0)¶
- Like - pagelocked_empty(), but initialized to zero.
- pycuda.driver.pagelocked_empty_like(array, mem_flags=0)¶
- pycuda.driver.pagelocked_zeros_like(array, mem_flags=0)¶
The numpy.ndarray instances returned by these functions
have an attribute base that references an object of type
- class pycuda.driver.PagelockedHostAllocation¶
- Inherits from - HostPointer.- An object representing an allocation of pagelocked host memory. Once this object is deleted, its associated device memory is freed. - free()¶
- Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated - numpyarray) is an error and will lead to undefined behavior.
 - get_flags()¶
- Return a bit field of values from - host_alloc_flags.- Only available on CUDA 3.2 and newer. - Added in version 0.94. 
 
- class pycuda.driver.HostAllocation¶
- A deprecated name for - PagelockedHostAllocation.
Aligned Host Memory¶
- pycuda.driver.aligned_empty(shape, dtype, order='C', alignment=4096)¶
- Allocate an - numpy.ndarrayof shape, dtype and order, with data aligned to alignment bytes.- For the meaning of the other parameters, please refer to the - numpydocumentation.- Added in version 2011.1. 
- pycuda.driver.aligned_zeros(shape, dtype, order='C', alignment=4096)¶
- Like - aligned_empty(), but with initialization to zero.- Added in version 2011.1. 
- pycuda.driver.aligned_empty_like(array, alignment=4096)¶
- Added in version 2011.1. 
- pycuda.driver.aligned_zeros_like(array, alignment=4096)¶
- Added in version 2011.1. 
The numpy.ndarray instances returned by these functions
have an attribute base that references an object of type
- class pycuda.driver.AlignedHostAllocation¶
- Inherits from - HostPointer.- An object representing an allocation of aligned host memory. 
Post-Allocation Pagelocking¶
- pycuda.driver.register_host_memory(ary, flags=0)¶
- Returns a - numpy.ndarraywhich shares memory with ary. This memory will be page-locked as long as the return value of this function is alive.- The returned array’s base attribute contains a - RegisteredHostMemoryinstance, whose base attribute in turn contains ary.- CUDA 4.0 and newer. - ary’s data address and size must be page-aligned. One way to achieve this is to use the functions in Aligned Host Memory. - Added in version 2011.1. 
- class pycuda.driver.RegisteredHostMemory¶
- Inherits from - HostPointer.- CUDA 4.0 and newer. - Added in version 2011.1. - unregister()¶
- Unregister the page-lock on the host memory held by this instance. Note that this does not free the memory, it only frees the page-lock. 
 - base¶
- Contains the Python object from which this instance was constructed. 
 
Managed Memory¶
CUDA 6.0 adds support for a “Unified Memory” model, which creates a managed virtual memory space that is visible to both CPUs and GPUs. The OS will migrate the physical pages associated with managed memory between the CPU and GPU as needed. This allows a numpy array on the host to be passed to kernels without first creating a DeviceAllocation and manually copying the host data to and from the device.
Note
Managed memory is only available for some combinations of CUDA device, operating system, and host compiler target architecture. Check the CUDA C Programming Guide and CUDA release notes for details.
Warning
This interface to managed memory should be considered experimental. It is provided as a preview, but for now the same interface stability guarantees as for the rest of PyCUDA do not apply.
Managed Memory Allocation¶
- pycuda.driver.managed_empty(shape, dtype, order='C', mem_flags=0)¶
- Allocate a managed - numpy.ndarrayof shape, dtype and order.- mem_flags may be one of the values in - mem_attach_flags.- For the meaning of the other parameters, please refer to the - numpydocumentation.- Only available on CUDA 6.0 and newer. - Added in version 2014.1. 
- pycuda.driver.managed_zeros(shape, dtype, order='C', mem_flags=0)¶
- Like - managed_empty(), but initialized to zero.- Only available on CUDA 6.0 and newer. - Added in version 2014.1. 
- pycuda.driver.managed_empty_like(array, mem_flags=0)¶
- Only available on CUDA 6.0 and newer. - Added in version 2014.1. 
- pycuda.driver.managed_zeros_like(array, mem_flags=0)¶
- Only available on CUDA 6.0 and newer. - Added in version 2014.1. 
The numpy.ndarray instances returned by these functions
have an attribute base that references an object of type
- class pycuda.driver.ManagedAllocation¶
- An object representing an allocation of managed host memory. Once this object is deleted, its associated CUDA managed memory is freed. - free()¶
- Release the held memory now instead of when this object becomes unreachable. Any further use of the object (or its associated - numpyarray) is an error and will lead to undefined behavior.
 - get_device_pointer()¶
- Return a device pointer that indicates the address at which this memory is mapped into the device’s address space. For managed memory, this is also the host pointer. 
 - attach(mem_flags, stream=None)¶
- Alter the visibility of the managed allocation to be one of the values in - mem_attach_flags. A managed array can be made visible to the host CPU and the entire CUDA context with mem_attach_flags.GLOBAL, or limited to the CPU only with mem_attach_flags.HOST. If mem_attach_flags.SINGLE is selected, then the array will only be visible to CPU and the provided instance of- Stream.
 
Managed Memory Usage¶
A managed numpy array is constructed and used on the host in a similar manner to a pagelocked array:
from pycuda.autoinit import context
import pycuda.driver as cuda
import numpy as np
a = cuda.managed_empty(shape=10, dtype=np.float32, mem_flags=cuda.mem_attach_flags.GLOBAL)
a[:] = np.linspace(0, 9, len(a)) # Fill array on host
It can be passed to a GPU kernel, and used again on the host without an explicit copy:
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void doublify(float *a)
{
    a[threadIdx.x] *= 2;
}
""")
doublify = mod.get_function("doublify")
doublify(a, grid=(1,1), block=(len(a),1,1))
context.synchronize() # Wait for kernel completion before host access
median = np.median(a) # Computed on host!
Warning
The CUDA Unified Memory model has very specific rules regarding concurrent access of managed memory allocations. Host access to any managed array is not allowed while the GPU is executing a kernel, regardless of whether the array is in use by the running kernel. Failure to follow the concurrency rules will generate a segmentation fault, causing the Python interpreter to terminate immediately.
Users of managed numpy arrays should read the “Unified Memory Programming” appendix of the CUDA C Programming Guide for further details on the concurrency restrictions.
If you are encountering interpreter terminations due to concurrency issues, the faulthandler <http://pypi.python.org/pypi/faulthandler> module may be helpful in locating the location in your Python program where the faulty access is occurring.
Arrays and Textures¶
- class pycuda.driver.ArrayDescriptor¶
- width¶
 - height¶
 - format¶
- A value of type - array_format.
 - num_channels¶
 
- class pycuda.driver.ArrayDescriptor3D¶
- width¶
 - height¶
 - depth¶
 - format¶
- A value of type - array_format. CUDA 2.0 and above only.
 - num_channels¶
 
- class pycuda.driver.Array(descriptor)¶
- A 2D or 3D memory block that can only be accessed via texture references. - descriptor can be of type - ArrayDescriptoror- ArrayDescriptor3D.- free()¶
- Release the array and its device memory now instead of when this object becomes unreachable. Any further use of the object is an error and will lead to undefined behavior. 
 - get_descriptor()¶
- Return a - ArrayDescriptorobject for this 2D array, like the one that was used to create it.
 - get_descriptor_3d()¶
- Return a - ArrayDescriptor3Dobject for this 3D array, like the one that was used to create it. CUDA 2.0 and above only.
 
- class pycuda.driver.SurfaceReference¶
- Note - Instances of this class can only be constructed through - Module.get_surfref().- CUDA 3.1 and above. - Added in version 0.94. - set_array(array, flags=0)¶
- Bind self to the - Arrayarray.- As long as array remains bound to this texture reference, it will not be freed–the texture reference keeps a reference to the array. 
 - get_array()¶
- Get back the - Arrayto which self is bound.- Note - This will be a different object than the one passed to - set_array(), but it will compare equal.
 
- class pycuda.driver.TextureReference¶
- A handle to a binding of either linear memory or an - Arrayto a texture unit.- set_array(array)¶
- Bind self to the - Arrayarray.- As long as array remains bound to this texture reference, it will not be freed–the texture reference keeps a reference to the array. 
 - set_address(devptr, bytes, allow_offset=False)¶
- Bind self to the a chunk of linear memory starting at the integer address devptr, encompassing a number of bytes. Due to alignment requirements, the effective texture bind address may be different from the requested one by an offset. This method returns this offset in bytes. If allow_offset is - False, a nonzero value of this offset will cause an exception to be raised.- Unlike for - Arrayobjects, no life support is provided for linear memory bound to texture references.
 - set_address_2d(devptr, descr, pitch)¶
- Bind self as a 2-dimensional texture to a chunk of global memory at devptr. The line-to-line offset in bytes is given by pitch. Width, height and format are given in the - ArrayDescriptordescr.- set_format()need not and should not be called in addition to this method.
 - set_format(fmt, num_components)¶
- Set the texture to have - array_formatfmt and to have num_components channels.
 - set_address_mode(dim, am)¶
- Set the address mode of dimension dim to am, which must be one of the - address_modevalues.
 - set_flags(flags)¶
- Set the flags to a combination of the TRSF_XXX values. 
 - get_array()¶
- Get back the - Arrayto which self is bound.- Note - This will be a different object than the one passed to - set_array(), but it will compare equal.
 - get_address_mode(dim)¶
 - get_filter_mode()¶
 - get_format()¶
- Return a tuple (fmt, num_components), where fmt is of type - array_format, and num_components is the number of channels in this texture.- (Version 2.0 and above only.) 
 - get_flags()¶
 
- pycuda.driver.TRSA_OVERRIDE_FORMAT¶
- pycuda.driver.TRSF_READ_AS_INTEGER¶
- pycuda.driver.TRSF_NORMALIZED_COORDINATES¶
- pycuda.driver.TR_DEFAULT¶
- pycuda.driver.matrix_to_array(matrix, order)¶
- Turn the two-dimensional - numpy.ndarrayobject matrix into an- Array. The order argument can be either “C” or “F”. If it is “C”, then tex2D(x,y) is going to fetch matrix[y,x], and vice versa for for “F”.
- pycuda.driver.np_to_array(nparray, order, allowSurfaceBind=False)¶
- Turn a - numpy.ndarraywith 2D or 3D structure, into an- Array. The order argument can be either “C” or “F”. If allowSurfaceBind is passed as True the returned- Arraycan be read and write with- SurfaceReferencein addition of reads by- TextureReference. Function automatically detect dtype and adjust channels to supported- array_format. Also add direct support for np.float64, np.complex64 and np.complex128 formats.- Example of use: - #include <pycuda-helpers.hpp> texture<fp_tex_double, 3, cudaReadModeElementType> my_tex; // complex128: fp_tex_cdouble // complex64 : fp_tex_cfloat // float64 : fp_tex_double surface<void, 3, cudaReadModeElementType> my_surf; // Surfaces in 2D needs 'cudaSurfaceType2DLayered' __global__ void f() { ... fp_tex3D(my_tex, i, j, k); fp_surf3Dwrite(myvar, my_surf, i, j, k, cudaBoundaryModeClamp); // fp extensions don't need width in bytes fp_surf3Dread(&myvar, my_surf, i, j, k, cudaBoundaryModeClamp); ... } - Added in version 2015.1. 
- pycuda.driver.gpuarray_to_array(gpuparray, order, allowSurfaceBind=False)¶
- Turn a - GPUArraywith 2D or 3D structure, into an- Array. Same structure and use of- np_to_array()- Added in version 2015.1. 
- pycuda.driver.make_multichannel_2d_array(matrix, order)¶
- Turn the three-dimensional - numpy.ndarrayobject matrix into an 2D- Arraywith multiple channels.- Depending on order, the matrix’s shape is interpreted as - height, width, num_channels for order == “C”, 
- num_channels, width, height for order == “F”. 
 - Note - This function assumes that matrix has been created with the memory order order. If that is not the case, the copied data will likely not be what you expect. 
Initializing Device Memory¶
- pycuda.driver.memset_d8(dest, data, count)¶
- pycuda.driver.memset_d16(dest, data, count)¶
- pycuda.driver.memset_d32(dest, data, count)¶
- Fill array with data. - Note - count is the number of elements, not bytes. 
- pycuda.driver.memset_d2d8(dest, pitch, data, width, height)¶
- pycuda.driver.memset_d2d16(dest, pitch, data, width, height)¶
- pycuda.driver.memset_d2d32(dest, pitch, data, width, height)¶
- Fill a two-dimensional array with data. 
- pycuda.driver.memset_d8_async(dest, data, count, stream=None)¶
- pycuda.driver.memset_d16_async(dest, data, count, stream=None)¶
- pycuda.driver.memset_d32_async(dest, data, count, stream=None)¶
- Fill array with data asynchronously, optionally serialized via stream. - Added in version 2015.1. 
- pycuda.driver.memset_d2d8_async(dest, pitch, data, width, height, stream=None)¶
- pycuda.driver.memset_d2d16_async(dest, pitch, data, width, height, stream=None)¶
- pycuda.driver.memset_d2d32_async(dest, pitch, data, width, height, stream=None)¶
- Fill a two-dimensional array with data asynchronously, optionally serialized via stream. - Added in version 2015.1. 
Unstructured Memory Transfers¶
- pycuda.driver.memcpy_htod(dest, src)¶
- Copy from the Python buffer src to the device pointer dest (an - intor a- DeviceAllocation). The size of the copy is determined by the size of the buffer.
- pycuda.driver.memcpy_htod_async(dest, src, stream=None)¶
- Copy from the Python buffer src to the device pointer dest (an - intor a- DeviceAllocation) asynchronously, optionally serialized via stream. The size of the copy is determined by the size of the buffer.- src must be page-locked memory, see, e.g. - pagelocked_empty().- New in 0.93. 
- pycuda.driver.memcpy_dtoh(dest, src)¶
- Copy from the device pointer src (an - intor a- DeviceAllocation) to the Python buffer dest. The size of the copy is determined by the size of the buffer.
- pycuda.driver.memcpy_dtoh_async(dest, src, stream=None)¶
- Copy from the device pointer src (an - intor a- DeviceAllocation) to the Python buffer dest asynchronously, optionally serialized via stream. The size of the copy is determined by the size of the buffer.- dest must be page-locked memory, see, e.g. - pagelocked_empty().- New in 0.93. 
- pycuda.driver.memcpy_dtod(dest, src, size)¶
- pycuda.driver.memcpy_dtod_async(dest, src, size, stream=None)¶
- CUDA 3.0 and above. - Added in version 0.94. 
- pycuda.driver.memcpy_peer(dest, src, size, dest_context=None, src_context=None)¶
- pycuda.driver.memcpy_peer_async(dest, src, size, dest_context=None, src_context=None, stream=None)¶
- CUDA 4.0 and above. - Added in version 2011.1. 
- pycuda.driver.memcpy_dtoa(ary, index, src, len)¶
- pycuda.driver.memcpy_atod(dest, ary, index, len)¶
- pycuda.driver.memcpy_htoa(ary, index, src)¶
- pycuda.driver.memcpy_atoh(dest, ary, index)¶
- pycuda.driver.memcpy_atoa(dest, dest_index, src, src_index, len)¶
Structured Memory Transfers¶
- class pycuda.driver.Memcpy2D¶
- src_x_in_bytes¶
- X Offset of the origin of the copy. (initialized to 0) 
 - src_y¶
- Y offset of the origin of the copy. (initialized to 0) 
 - src_pitch¶
- Size of a row in bytes at the origin of the copy. 
 - set_src_host(buffer)¶
- Set the buffer, which must be a Python object adhering to the buffer interface, to be the origin of the copy. 
 - set_src_device(devptr)¶
- Set the device address devptr (an - intor a- DeviceAllocation) as the origin of the copy.
 - set_src_unified(buffer)¶
- Same as - set_src_host(), except that buffer may also correspond to device memory.- CUDA 4.0 and above. Requires unified addressing. - Added in version 2011.1. 
 - dst_x_in_bytes¶
- X offset of the destination of the copy. (initialized to 0) 
 - dst_y¶
- Y offset of the destination of the copy. (initialized to 0) 
 - dst_pitch¶
- Size of a row in bytes at the destination of the copy. 
 - set_dst_host(buffer)¶
- Set the buffer, which must be a Python object adhering to the buffer interface, to be the destination of the copy. 
 - set_dst_device(devptr)¶
- Set the device address devptr (an - intor a- DeviceAllocation) as the destination of the copy.
 - set_dst_unified(buffer)¶
- Same as - set_dst_host(), except that buffer may also correspond to device memory.- CUDA 4.0 and above. Requires unified addressing. - Added in version 2011.1. 
 - width_in_bytes¶
- Number of bytes to copy for each row in the transfer. 
 - height¶
- Number of rows to copy. 
 - __call__([aligned=True])¶
- Perform the specified memory copy, waiting for it to finish. If aligned is False, tolerate device-side misalignment for device-to-device copies that may lead to loss of copy bandwidth. 
 - __call__(stream)
- Perform the memory copy asynchronously, serialized via the - Streamstream. Any host memory involved in the transfer must be page-locked.
 
- class pycuda.driver.Memcpy3D¶
- Memcpy3Dhas the same members as- Memcpy2D, and additionally all of the following:- src_z¶
- Z offset of the origin of the copy. (initialized to 0) 
 - dst_z¶
- Z offset of the destination of the copy. (initialized to 0) 
 - depth¶
 - Memcpy3Dis supported on CUDA 2.0 and above only.
- class pycuda.driver.Memcpy3DPeer¶
- Memcpy3DPeerhas the same members as- Memcpy3D, and additionally all of the following:- set_src_context(ctx)¶
 - set_dst_context(ctx)¶
 - CUDA 4.0 and newer. - Added in version 2011.1. 
Code on the Device: Modules and Functions¶
- class pycuda.driver.Module¶
- Handle to a CUBIN module loaded onto the device. Can be created with - module_from_file()and- module_from_buffer().- get_function(name)¶
- Return the - Functionname in this module.- Warning - While you can obtain different handles to the same function using this method, these handles all share the same state that is set through the - set_XXXmethods of- Function. This means that you can’t obtain two different handles to the same function and- Function.prepare()them in two different ways.
 - get_global(name)¶
- Return a tuple (device_ptr, size_in_bytes) giving the device address and size of the global name. - The main use of this method is to find the address of pre-declared __constant__ arrays so they can be filled from the host before kernel invocation. 
 - get_texref(name)¶
- Return the - TextureReferencename from this module.
 - get_surfref(name)¶
- Return the - SurfaceReferencename from this module.- CUDA 3.1 and above. - Added in version 0.94. 
 
- pycuda.driver.module_from_buffer(buffer, options=[], message_handler=None)¶
- Create a - Moduleby loading a PTX or CUBIN module from buffer, which must support the Python buffer interface. (For example,- strand- numpy.ndarraydo.)- Parameters:
- options – A list of tuples ( - jit_option, value).
- message_handler – A callable that is called with a arguments of - (compile_success_bool, info_str, error_str)which allows the user to process error and warning messages from the PTX compiler.
 
 - Loading PTX modules as well as non-default values of options and message_handler are only allowed on CUDA 2.1 and newer. 
- class pycuda.driver.Function¶
- Handle to a __global__ function in a - Module. Create using- Module.get_function().- __call__(arg1, ..., argn, block=block_size[, grid=(1, 1)[, stream=None[, shared=0[, texrefs=[][, time_kernel=False]]]]])¶
- Launch self, with a thread block size of block. block must be a 3-tuple of integers. - arg1 through argn are the positional C arguments to the kernel. See - param_set()for details. See especially the warnings there.- grid specifies, as a tuple of up to three integer entries, the number of thread blocks to launch, as a multi-dimensional grid. stream, if specified, is a - Streaminstance serializing the copying of input arguments (if any), execution, and the copying of output arguments (again, if any). shared gives the number of bytes available to the kernel in extern __shared__ arrays. texrefs is a- listof- TextureReferenceinstances that the function will have access to.- The function returns either None or the number of seconds spent executing the kernel, depending on whether time_kernel is True. - This is a convenience interface that can be used instead of the - param_*`and- launch_*`methods below. For a faster (but mildly less convenient) way of invoking kernels, see- prepare()and- prepared_call().- arg1 through argn are allowed to be of the following types: - Subclasses of - numpy.number. These are sized number types such as- numpy.uint32or- numpy.float32.
- DeviceAllocationinstances, which will become a device pointer to the allocated memory.
- Instances of - ArgumentHandlersubclasses. These can be used to automatically transfer- numpyarrays onto and off of the device.
- Objects supporting the Python Buffer Protocol interface. These chunks of bytes will be copied into the parameter space verbatim. 
- GPUArrayinstances.
 - Warning - You cannot pass values of Python’s native - intor- floattypes to param_set. Since there is no unambiguous way to guess the size of these integers or floats, it complains with a- TypeError.- Note - This method has to guess the types of the arguments passed to it, which can make it somewhat slow. For a kernel that is invoked often, this can be inconvenient. For a faster (but mildly less convenient) way of invoking kernels, see - prepare()and- prepared_call().- Note - grid with more than two dimensions requires CUDA 4.0 or newer. 
 - param_set_texref(texref)¶
- Make the - TextureReferencetexref available to the function.
 - prepare(arg_types, shared=None, texrefs=[])¶
- Prepare the invocation of this function by - setting up the argument types as arg_types. arg_types is expected to be an iterable containing type characters understood by the - structmodule or- numpy.dtypeobjects.- (In addition, PyCUDA understands ‘F’ and ‘D’ for single- and double precision floating point numbers.) 
- Registering the texture references texrefs for use with this functions. The - TextureReferenceobjects in texrefs will be retained, and whatever these references are bound to at invocation time will be available through the corresponding texture references within the kernel.
 - Return self. 
 - prepared_call(grid, block, *args, shared_size=0)¶
- Invoke self using - launch_grid(), with args a grid size of grid, and a block size of block. Assumes that- prepare()was called on self. The texture references given to- prepare()are set up as parameters, as well.- Changed in version 2012.1: shared_size was added. 
 - prepared_timed_call(grid, block, *args, shared_size=0)¶
- Invoke self using - launch_grid(), with args, a grid size of grid, and a block size of block. Assumes that- prepare()was called on self. The texture references given to- prepare()are set up as parameters, as well.- Return a 0-ary callable that can be used to query the GPU time consumed by the call, in seconds. Once called, this callable will block until completion of the invocation. - Changed in version 2012.1: shared_size was added. 
 - prepared_async_call(grid, block, stream, *args, shared_size=0)¶
- Invoke self using - launch_grid_async(), with args, a grid size of grid, and a block size of block, serialized into the- pycuda.driver.Streamstream. If stream is None, do the same as- prepared_call(). Assumes that- prepare()was called on self. The texture references given to- prepare()are set up as parameters, as well.- Changed in version 2012.1: shared_size was added. 
 - get_attribute(attr)¶
- Return one of the attributes given by the - function_attributevalue attr.- All - function_attributevalues may also be directly read as (lower-case) attributes on the- Functionobject itself, e.g. func.num_regs.- CUDA 2.2 and newer. - Added in version 0.93. 
 - set_attribute(attr, value)¶
- Set one of the (settable) attributes given by the - function_attributevalue attr to value.- Added in version 2022.1. 
 - set_cache_config(fc)¶
- See - func_cachefor possible values of fc.- CUDA 3.0 (post-beta) and newer. - Added in version 0.94. 
 - See - shared_configfor possible values of sc.- CUDA 4.2 and newer. - Added in version 2013.1. 
 - local_size_bytes¶
- The number of bytes of local memory used by this function. - On CUDA 2.1 and below, this is only available if this function is part of a - pycuda.compiler.SourceModule. It replaces the now-deprecated attribute lmem.
 - The number of bytes of shared memory used by this function. - On CUDA 2.1 and below, this is only available if this function is part of a - pycuda.compiler.SourceModule. It replaces the now-deprecated attribute smem.
 - num_regs¶
- The number of 32-bit registers used by this function. - On CUDA 2.1 and below, this is only available if this function is part of a - pycuda.compiler.SourceModule. It replaces the now-deprecated attribute registers.
 - Set shared to be the number of bytes available to the kernel in extern __shared__ arrays. - Warning - Deprecated as of version 2011.1. 
 - set_block_shape(x, y, z)¶
- Set the thread block shape for this function. - Warning - Deprecated as of version 2011.1. 
 - param_set(arg1, ... argn)¶
- Set the thread block shape for this function. - Warning - Deprecated as of version 2011.1. 
 - param_set_size(bytes)¶
- Size the parameter space to bytes. - Warning - Deprecated as of version 2011.1. 
 - param_seti(offset, value)¶
- Set the integer at offset in the parameter space to value. - Warning - Deprecated as of version 2011.1. 
 - param_setf(offset, value)¶
- Set the float at offset in the parameter space to value. - Warning - Deprecated as of version 2011.1. 
 - launch()¶
- Launch a single thread block of self. - Warning - Deprecated as of version 2011.1. 
 - launch_grid(width, height)¶
- Launch a width*height grid of thread blocks of self. - Warning - Deprecated as of version 2011.1. 
 
- class pycuda.driver.ArgumentHandler(array)¶
- class pycuda.driver.In(array)¶
- Inherits from - ArgumentHandler. Indicates that Buffer Protocol array should be copied to the compute device before invoking the kernel.
- class pycuda.driver.Out(array)¶
- Inherits from - ArgumentHandler. Indicates that Buffer Protocol array should be copied off the compute device after invoking the kernel.
- class pycuda.driver.InOut(array)¶
- Inherits from - ArgumentHandler. Indicates that Buffer Protocol array should be copied both onto the compute device before invoking the kernel, and off it afterwards.
Profiler Control¶
CUDA 4.0 and newer.
- pycuda.driver.initialize_profiler(config_file, output_file, output_mode)¶
- output_mode is one of the attributes of - profiler_output_mode.- Added in version 2011.1. 
- pycuda.driver.start_profiler()¶
- Added in version 2011.1. 
- pycuda.driver.stop_profiler()¶
- Added in version 2011.1. 
Just-in-time Compilation¶
- pycuda.compiler.DEFAULT_NVCC_FLAGS¶
- Added in version 2011.1. - If no options are given in the calls below, the value of this list-type variable is used instead. This may be useful for injecting necessary flags into the compilation of automatically compiled kernels, such as those used by the module - pycuda.gpuarray.- The initial value of this variable is taken from the environment variable - PYCUDA_DEFAULT_NVCC_FLAGS.- If you modify this variable in your code, please be aware that this is a globally shared variable that may be modified by multiple packages. Please exercise caution in such modifications–you risk breaking other people’s code. 
- class pycuda.compiler.SourceModule(source, nvcc='nvcc', options=None, keep=False, no_extern_c=False, arch=None, code=None, cache_dir=None, include_dirs=[])¶
- Create a - pycuda.driver.Modulefrom the CUDA source code source. The Nvidia compiler nvcc is assumed to be on the- PATHif no path to it is specified, and is invoked with options to compile the code. If keep is True, the compiler output directory is kept, and a line indicating its location in the file system is printed for debugging purposes.- Unless no_extern_c is True, the given source code is wrapped in extern “C” { … } to prevent C++ name mangling. - arch and code specify the values to be passed for the - -archand- -codeoptions on the nvcc command line. If arch is None, it defaults to the current context’s device’s compute capability. If code is None, it will not be specified.- cache_dir gives the directory used for compiler caching. If None then cache_dir is taken to be - PYCUDA_CACHE_DIRif set or a sensible per-user default. If passed as False, caching is disabled.- If the environment variable - PYCUDA_DISABLE_CACHEis set to any value then caching is disabled. This preference overrides any value of cache_dir and can be used to disable caching globally.- This class exhibits the same public interface as - pycuda.driver.Module, but does not inherit from it.- Change note: - SourceModulewas moved from- pycuda.driverto- pycuda.compilerin version 0.93.
- compile(source, nvcc="nvcc", options=None, keep=False,
- no_extern_c=False, arch=None, code=None, cache_dir=None,
- include_dirs=[])
- Perform the same compilation as the corresponding - SourceModuleconstructor, but only return resulting cubin file as a string. In particular, do not upload the code to the GPU.