Reference: Other Functionality¶
Auxiliary Data Types¶
- loopy.typing.Expression¶
alias of
int|integer|float|complex|inexact|bool|bool|ExpressionNode|tuple[Expression, …]
- loopy.typing.ShapeType¶
alias of
tuple[int|integer|float|complex|inexact|ExpressionNode, …]
- loopy.typing.InameStr: TypeAlias = <class 'str'>¶
str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to ‘utf-8’. errors defaults to ‘strict’.
- loopy.typing.InameStrSet¶
Build an immutable unordered collection of unique elements.
- loopy.typing.SymbolMangler: TypeAlias = 'Callable[[LoopKernel, str], tuple[LoopyType, str] | None]'
str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to ‘utf-8’. errors defaults to ‘strict’.
- class loopy.typing.SymbolMangler¶
See above.
- loopy.typing.PreambleGenerator: TypeAlias = 'Callable[\n [PreambleInfo],\n Iterator[tuple[str, str]]]'
str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to ‘utf-8’. errors defaults to ‘strict’.
- class loopy.typing.PreambleGenerator¶
See above.
Obtaining Kernel Performance Statistics¶
- class loopy.ToCountMap(count_map=None)[source]¶
A map from work descriptors like
OpandMemAccessto any arithmetic type.- filter_by(**kwargs)[source]¶
Remove items without specified key fields.
- Parameters:
kwargs – Keyword arguments matching fields in the keys of the
ToCountMap, each given a list of allowable values for that key field.- Returns:
A
ToCountMapcontaining the subset of the items in the originalToCountMapthat match the field values passed.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = lp.get_mem_access_map(knl) filtered_map = mem_map.filter_by(direction=["load"], variable=["a","g"]) tot_loads_a_g = filtered_map.eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- filter_by_func(func)[source]¶
Keep items that pass a test.
- Parameters:
func – A function that takes a map key a parameter and returns a
bool.- Arg:
A
ToCountMapcontaining the subset of the items in the originalToCountMapfor which func(key) is true.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = lp.get_mem_access_map(knl) def filter_func(key): return key.lid_strides[0] > 1 and key.lid_strides[0] <= 4: filtered_map = mem_map.filter_by_func(filter_func) tot = filtered_map.eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- group_by(*args)[source]¶
Group map items together, distinguishing by only the key fields passed in args.
- Parameters:
args – Zero or more
strfields of map keys.- Returns:
A
ToCountMapcontaining the same total counts grouped together by new keys that only contain the fields specified in the arguments passed.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = get_mem_access_map(knl) grouped_map = mem_map.group_by("mtype", "dtype", "direction") f32_global_ld = grouped_map[MemAccess(mtype="global", dtype=np.float32, direction="load") ].eval_with_dict(params) f32_global_st = grouped_map[MemAccess(mtype="global", dtype=np.float32, direction="store") ].eval_with_dict(params) f32_local_ld = grouped_map[MemAccess(mtype="local", dtype=np.float32, direction="load") ].eval_with_dict(params) f32_local_st = grouped_map[MemAccess(mtype="local", dtype=np.float32, direction="store") ].eval_with_dict(params) op_map = get_op_map(knl) ops_dtype = op_map.group_by("dtype") f32ops = ops_dtype[Op(dtype=np.float32)].eval_with_dict(params) f64ops = ops_dtype[Op(dtype=np.float64)].eval_with_dict(params) i32ops = ops_dtype[Op(dtype=np.int32)].eval_with_dict(params) # (now use these counts to, e.g., predict performance)
- to_bytes()[source]¶
Convert counts to bytes using data type in map key.
- Returns:
A
ToCountMapmapping each original key to anislpy.PwQPolynomialwith counts in bytes rather than instances.
Example usage:
# (first create loopy kernel and specify array data types) bytes_map = get_mem_access_map(knl).to_bytes() params = {"n": 512, "m": 256, "l": 128} s1_g_ld_bytes = bytes_map.filter_by( mtype=["global"], lid_strides={0: 1}, direction=["load"]).eval_and_sum(params) s2_g_ld_bytes = bytes_map.filter_by( mtype=["global"], lid_strides={0: 2}, direction=["load"]).eval_and_sum(params) s1_g_st_bytes = bytes_map.filter_by( mtype=["global"], lid_strides={0: 1}, direction=["store"]).eval_and_sum(params) s2_g_st_bytes = bytes_map.filter_by( mtype=["global"], lid_strides={0: 2}, direction=["store"]).eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- class loopy.ToCountPolynomialMap(space, count_map=None)[source]¶
Maps any type of key to a
islpy.PwQPolynomialor aGuardedPwQPolynomial.- eval_and_sum(params=None)[source]¶
Add all counts and evaluate with provided parameter dict params
- Returns:
An
intcontaining the sum of all counts evaluated with the parameters provided.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = lp.get_mem_access_map(knl) filtered_map = mem_map.filter_by(direction=["load"], variable=["a", "g"]) tot_loads_a_g = filtered_map.eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- class loopy.CountGranularity[source]¶
Strings specifying whether an operation should be counted once per work-item, sub-group, or work-group.
- class loopy.Op(dtype=None, name=None, count_granularity=None, kernel_name=None)[source]¶
A descriptor for a type of arithmetic operation.
- dtype¶
A
loopy.types.LoopyTypeornumpy.dtypethat specifies the data type operated on.
- name¶
A
strthat specifies the kind of arithmetic operation as add, mul, div, pow, shift, bw (bitwise), etc.
- count_granularity¶
A
strthat specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found inCountGranularity, and may be accessed, e.g., asCountGranularity.WORKITEM. A work-item is a single instance of computation executing on a single processor (think “thread”), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp.
- class loopy.MemAccess(mtype=None, dtype=None, lid_strides=None, gid_strides=None, direction=None, variable=None, *, variable_tags=None, count_granularity=None, kernel_name=None)[source]¶
A descriptor for a type of memory access.
- dtype¶
A
loopy.types.LoopyTypeornumpy.dtypethat specifies the data type accessed.
- lid_strides¶
A
dictof {int:Expressionorint} that specifies local strides for each local id in the memory access index. Local ids not found will not be present inlid_strides.keys(). Uniform access (i.e. work-items within a sub-group access the same item) is indicated by settinglid_strides[0]=0, but may also occur when no local id 0 is found, in which case the 0 key will not be present in lid_strides.
- gid_strides¶
A
dictof {int:Expressionorint} that specifies global strides for each global id in the memory access index. global ids not found will not be present ingid_strides.keys().
- count_granularity¶
A
strthat specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found inCountGranularity, and may be accessed, e.g., asCountGranularity.WORKITEM. A work-item is a single instance of computation executing on a single processor (think “thread”), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp.
- loopy.get_op_map(program, count_redundant_work=False, count_within_subscripts=True, subgroup_size=None, entrypoint=None, within: ToMatchConvertible = None)[source]¶
Count the number of operations in a loopy kernel.
- Parameters:
knl – A
loopy.LoopKernelwhose operations are to be counted.count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This
boolflag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)count_within_subscripts – A
boolspecifying whether to count operations inside array indices.subgroup_size – (currently unused) An
int,str"guess", or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting aMemAccesswhose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If astr"guess"is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.within – If not None, limit the result to matching contexts. See
loopy.match.parse_match()for syntax.
- Returns:
A
ToCountMapof {Op:islpy.PwQPolynomial}.The
Opspecifies the characteristics of the arithmetic operation.The
islpy.PwQPolynomialholds the number of operations of the kind specified in the key (in terms of theloopy.LoopKernelparameter inames).
Example usage:
# (first create loopy kernel and specify array data types) op_map = get_op_map(knl) params = {"n": 512, "m": 256, "l": 128} f32add = op_map[Op(np.float32, "add", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32mul = op_map[Op(np.float32, "mul", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) # (now use these counts to, e.g., predict performance)
- loopy.get_mem_access_map(program, count_redundant_work=False, subgroup_size=None, entrypoint=None, within: ToMatchConvertible = None)[source]¶
Count the number of memory accesses in a loopy kernel.
- Parameters:
knl – A
loopy.LoopKernelwhose memory accesses are to be counted.count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This
boolflag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)subgroup_size – An
int,str"guess", or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting aMemAccesswhose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If astr"guess"is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.within – If not None, limit the result to matching contexts. See
loopy.match.parse_match()for syntax.
- Returns:
A
ToCountMapof {MemAccess:islpy.PwQPolynomial}.The
MemAccessspecifies the characteristics of the memory access.The
islpy.PwQPolynomialholds the number of memory accesses with the characteristics specified in the key (in terms of theloopy.LoopKernelinames).
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = get_mem_access_map(knl) f32_s1_g_ld_a = mem_map[MemAccess( mtype="global", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="load", variable="a", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32_s1_g_st_a = mem_map[MemAccess( mtype="global", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="store", variable="a", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32_s1_l_ld_x = mem_map[MemAccess( mtype="local", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="load", variable="x", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32_s1_l_st_x = mem_map[MemAccess( mtype="local", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="store", variable="x", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) # (now use these counts to, e.g., predict performance)
- loopy.get_synchronization_map(program, subgroup_size=None, entrypoint=None)[source]¶
Count the number of synchronization events each work-item encounters in a loopy kernel.
- Parameters:
knl – A
loopy.LoopKernelwhose barriers are to be counted.subgroup_size – (currently unused) An
int,str"guess", or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting aMemAccesswhose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If astr"guess"is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.
- Returns:
A dictionary mapping each type of synchronization event to an
islpy.PwQPolynomialholding the number of events per work-item.Possible keys include
barrier_local,barrier_global(if supported by the target) andkernel_launch.
Example usage:
# (first create loopy kernel and specify array data types) sync_map = get_synchronization_map(knl) params = {"n": 512, "m": 256, "l": 128} barrier_ct = sync_map["barrier_local"].eval_with_dict(params) # (now use this count to, e.g., predict performance)
- loopy.gather_access_footprints(program, ignore_uncountable=False, entrypoint=None)[source]¶
Return a dictionary mapping
(var_name, direction)toislpy.Setinstances capturing which indices of each the array var_name are read/written (where direction is eitherreadorwrite.- Parameters:
ignore_uncountable – If False, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)
- loopy.gather_access_footprint_bytes(program, ignore_uncountable=False)[source]¶
Return a dictionary mapping
(var_name, direction)toislpy.PwQPolynomialinstances capturing the number of bytes are read/written (where direction is eitherreadorwriteon array var_name- Parameters:
ignore_uncountable – If True, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)
Controlling caching¶
- LOOPY_NO_CACHE¶
- CG_NO_CACHE¶
By default, loopy will cache (on disk) the result of various stages of code generation to speed up future code generation of the same kernel. By setting the environment variables
LOOPY_NO_CACHEorCG_NO_CACHEto any string thatpytools.strtobool()evaluates asTrue, this caching is suppressed.
- LOOPY_ABORT_ON_CACHE_MISS¶
If set to a string that
pytools.strtobool()evaluates asTrue, loopy will raise an exception if a cache miss occurs. This can be useful for debugging cache-related issues. For example, it can be used to automatically test whether caching is successful for a particular code, by setting this variable toTrueand re-running the code.
Running Kernels¶
Use TranslationUnit.executor to bind a translation unit
to execution resources, and then use ExecutorBase.__call__
to invoke the kernel.
- class loopy.ExecutorBase(t_unit: TranslationUnit, entrypoint: str)[source]¶
An object allowing the execution of an entrypoint of a
TranslationUnit. Create these objects usingloopy.TranslationUnit.executor().
Automatic Testing¶
- loopy.auto_test_vs_ref(ref_prog, ctx, test_prog=None, op_count=(), op_label=(), parameters=None, print_ref_code=False, print_code=True, warmup_rounds=2, dump_binary=False, fills_entire_output=None, do_check=True, check_result=None, max_test_kernel_count=1, quiet=False, blacklist_ref_vendors=(), ref_entrypoint=None, test_entrypoint=None)[source]¶
Compare results of ref_knl to the kernels generated by scheduling test_knl.
- Parameters:
check_result – a callable with
numpy.ndarrayarguments (result, reference_result) returning a a tuple (class:bool, message) indicating correctness/acceptability of the resultmax_test_kernel_count – Stop testing after this many test_knl
Troubleshooting¶
Printing LoopKernel objects¶
If you’re confused about things loopy is referring to in an error message or
about the current state of the LoopKernel you are transforming, the
following always works:
print(kernel)
(And it yields a human-readable–albeit terse–representation of kernel.)
- loopy.get_dot_dependency_graph(kernel, callables_table, iname_cluster=True, use_insn_id=False)[source]¶
Return a string in the dot language depicting dependencies among kernel instructions.
- loopy.show_dependency_graph(*args, **kwargs)[source]¶
Show the dependency graph generated by
get_dot_dependency_graph()in a browser. Accepts the same arguments as that function.
- loopy.t_unit_to_python(t_unit: TranslationUnit, *, var_name: str = 't_unit', return_preamble_and_body_separately: Literal[False] = False) str[source]¶
- loopy.t_unit_to_python(t_unit: TranslationUnit, *, var_name: str = 't_unit', return_preamble_and_body_separately: Literal[True]) tuple[str, str]
” Returns a
strof a python code that instantiates kernel.- Parameters:
kernel – An instance of
loopy.LoopKernelvar_name – A
strof the kernel variable name in the generated python script.return_preamble_and_body_separately – A
bool. If True returns(preamble, body), wherepreambleincludes the import statements andbodyincludes the kernel, translation unit instantiation code.
Note
The implementation is partially complete and a
AssertionErroris raised if the returned python script does not exactly reproduce kernel. Contributions are welcome to fill in the missing voids.