Reference: Other Functionality¶
Auxiliary Data Types¶
Obtaining Kernel Performance Statistics¶
- class loopy.ToCountMap(count_map=None)[source]¶
A map from work descriptors like
Op
andMemAccess
to any arithmetic type.- filter_by(**kwargs)[source]¶
Remove items without specified key fields.
- Parameters:
kwargs – Keyword arguments matching fields in the keys of the
ToCountMap
, each given a list of allowable values for that key field.- Returns:
A
ToCountMap
containing the subset of the items in the originalToCountMap
that match the field values passed.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = lp.get_mem_access_map(knl) filtered_map = mem_map.filter_by(direction=["load"], variable=["a","g"]) tot_loads_a_g = filtered_map.eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- filter_by_func(func)[source]¶
Keep items that pass a test.
- Parameters:
func – A function that takes a map key a parameter and returns a
bool
.- Arg:
A
ToCountMap
containing the subset of the items in the originalToCountMap
for which func(key) is true.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = lp.get_mem_access_map(knl) def filter_func(key): return key.lid_strides[0] > 1 and key.lid_strides[0] <= 4: filtered_map = mem_map.filter_by_func(filter_func) tot = filtered_map.eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- group_by(*args)[source]¶
Group map items together, distinguishing by only the key fields passed in args.
- Parameters:
args – Zero or more
str
fields of map keys.- Returns:
A
ToCountMap
containing the same total counts grouped together by new keys that only contain the fields specified in the arguments passed.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = get_mem_access_map(knl) grouped_map = mem_map.group_by("mtype", "dtype", "direction") f32_global_ld = grouped_map[MemAccess(mtype="global", dtype=np.float32, direction="load") ].eval_with_dict(params) f32_global_st = grouped_map[MemAccess(mtype="global", dtype=np.float32, direction="store") ].eval_with_dict(params) f32_local_ld = grouped_map[MemAccess(mtype="local", dtype=np.float32, direction="load") ].eval_with_dict(params) f32_local_st = grouped_map[MemAccess(mtype="local", dtype=np.float32, direction="store") ].eval_with_dict(params) op_map = get_op_map(knl) ops_dtype = op_map.group_by("dtype") f32ops = ops_dtype[Op(dtype=np.float32)].eval_with_dict(params) f64ops = ops_dtype[Op(dtype=np.float64)].eval_with_dict(params) i32ops = ops_dtype[Op(dtype=np.int32)].eval_with_dict(params) # (now use these counts to, e.g., predict performance)
- to_bytes()[source]¶
Convert counts to bytes using data type in map key.
- Returns:
A
ToCountMap
mapping each original key to anislpy.PwQPolynomial
with counts in bytes rather than instances.
Example usage:
# (first create loopy kernel and specify array data types) bytes_map = get_mem_access_map(knl).to_bytes() params = {"n": 512, "m": 256, "l": 128} s1_g_ld_bytes = bytes_map.filter_by( mtype=["global"], lid_strides={0: 1}, direction=["load"]).eval_and_sum(params) s2_g_ld_bytes = bytes_map.filter_by( mtype=["global"], lid_strides={0: 2}, direction=["load"]).eval_and_sum(params) s1_g_st_bytes = bytes_map.filter_by( mtype=["global"], lid_strides={0: 1}, direction=["store"]).eval_and_sum(params) s2_g_st_bytes = bytes_map.filter_by( mtype=["global"], lid_strides={0: 2}, direction=["store"]).eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- class loopy.ToCountPolynomialMap(space, count_map=None)[source]¶
Maps any type of key to a
islpy.PwQPolynomial
or aGuardedPwQPolynomial
.- eval_and_sum(params=None)[source]¶
Add all counts and evaluate with provided parameter dict params
- Returns:
An
int
containing the sum of all counts evaluated with the parameters provided.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = lp.get_mem_access_map(knl) filtered_map = mem_map.filter_by(direction=["load"], variable=["a", "g"]) tot_loads_a_g = filtered_map.eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- class loopy.CountGranularity[source]¶
Strings specifying whether an operation should be counted once per work-item, sub-group, or work-group.
- class loopy.Op(dtype=None, name=None, count_granularity=None, kernel_name=None)[source]¶
A descriptor for a type of arithmetic operation.
- dtype¶
A
loopy.types.LoopyType
ornumpy.dtype
that specifies the data type operated on.
- name¶
A
str
that specifies the kind of arithmetic operation as add, mul, div, pow, shift, bw (bitwise), etc.
- count_granularity¶
A
str
that specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found inCountGranularity
, and may be accessed, e.g., asCountGranularity.WORKITEM
. A work-item is a single instance of computation executing on a single processor (think “thread”), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp.
- class loopy.MemAccess(mtype=None, dtype=None, lid_strides=None, gid_strides=None, direction=None, variable=None, *, variable_tags=None, count_granularity=None, kernel_name=None)[source]¶
A descriptor for a type of memory access.
- dtype¶
A
loopy.types.LoopyType
ornumpy.dtype
that specifies the data type accessed.
- lid_strides¶
A
dict
of {int
:pymbolic.primitives.Expression
orint
} that specifies local strides for each local id in the memory access index. Local ids not found will not be present inlid_strides.keys()
. Uniform access (i.e. work-items within a sub-group access the same item) is indicated by settinglid_strides[0]=0
, but may also occur when no local id 0 is found, in which case the 0 key will not be present in lid_strides.
- gid_strides¶
A
dict
of {int
:pymbolic.primitives.Expression
orint
} that specifies global strides for each global id in the memory access index. global ids not found will not be present ingid_strides.keys()
.
- count_granularity¶
A
str
that specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found inCountGranularity
, and may be accessed, e.g., asCountGranularity.WORKITEM
. A work-item is a single instance of computation executing on a single processor (think “thread”), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp.
- loopy.get_op_map(program, count_redundant_work=False, count_within_subscripts=True, subgroup_size=None, entrypoint=None, within=None)[source]¶
Count the number of operations in a loopy kernel.
- Parameters:
knl – A
loopy.LoopKernel
whose operations are to be counted.count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This
bool
flag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)count_within_subscripts – A
bool
specifying whether to count operations inside array indices.subgroup_size – (currently unused) An
int
,str
"guess"
, or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting aMemAccess
whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If astr
"guess"
is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.within – If not None, limit the result to matching contexts. See
loopy.match.parse_match()
for syntax.
- Returns:
A
ToCountMap
of {Op
:islpy.PwQPolynomial
}.The
Op
specifies the characteristics of the arithmetic operation.The
islpy.PwQPolynomial
holds the number of operations of the kind specified in the key (in terms of theloopy.LoopKernel
parameter inames).
Example usage:
# (first create loopy kernel and specify array data types) op_map = get_op_map(knl) params = {"n": 512, "m": 256, "l": 128} f32add = op_map[Op(np.float32, "add", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32mul = op_map[Op(np.float32, "mul", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) # (now use these counts to, e.g., predict performance)
- loopy.get_mem_access_map(program, count_redundant_work=False, subgroup_size=None, entrypoint=None, within=None)[source]¶
Count the number of memory accesses in a loopy kernel.
- Parameters:
knl – A
loopy.LoopKernel
whose memory accesses are to be counted.count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This
bool
flag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)subgroup_size – An
int
,str
"guess"
, or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting aMemAccess
whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If astr
"guess"
is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.within – If not None, limit the result to matching contexts. See
loopy.match.parse_match()
for syntax.
- Returns:
A
ToCountMap
of {MemAccess
:islpy.PwQPolynomial
}.The
MemAccess
specifies the characteristics of the memory access.The
islpy.PwQPolynomial
holds the number of memory accesses with the characteristics specified in the key (in terms of theloopy.LoopKernel
inames).
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = get_mem_access_map(knl) f32_s1_g_ld_a = mem_map[MemAccess( mtype="global", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="load", variable="a", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32_s1_g_st_a = mem_map[MemAccess( mtype="global", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="store", variable="a", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32_s1_l_ld_x = mem_map[MemAccess( mtype="local", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="load", variable="x", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32_s1_l_st_x = mem_map[MemAccess( mtype="local", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="store", variable="x", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) # (now use these counts to, e.g., predict performance)
- loopy.get_synchronization_map(program, subgroup_size=None, entrypoint=None)[source]¶
Count the number of synchronization events each work-item encounters in a loopy kernel.
- Parameters:
knl – A
loopy.LoopKernel
whose barriers are to be counted.subgroup_size – (currently unused) An
int
,str
"guess"
, or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting aMemAccess
whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If astr
"guess"
is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.
- Returns:
A dictionary mapping each type of synchronization event to an
islpy.PwQPolynomial
holding the number of events per work-item.Possible keys include
barrier_local
,barrier_global
(if supported by the target) andkernel_launch
.
Example usage:
# (first create loopy kernel and specify array data types) sync_map = get_synchronization_map(knl) params = {"n": 512, "m": 256, "l": 128} barrier_ct = sync_map["barrier_local"].eval_with_dict(params) # (now use this count to, e.g., predict performance)
- loopy.gather_access_footprints(program, ignore_uncountable=False, entrypoint=None)[source]¶
Return a dictionary mapping
(var_name, direction)
toislpy.Set
instances capturing which indices of each the array var_name are read/written (where direction is eitherread
orwrite
.- Parameters:
ignore_uncountable – If False, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)
- loopy.gather_access_footprint_bytes(program, ignore_uncountable=False)[source]¶
Return a dictionary mapping
(var_name, direction)
toislpy.PwQPolynomial
instances capturing the number of bytes are read/written (where direction is eitherread
orwrite
on array var_name- Parameters:
ignore_uncountable – If True, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)
Controlling caching¶
- LOOPY_NO_CACHE¶
- CG_NO_CACHE¶
By default, loopy will cache (on disk) the result of various stages of code generation to speed up future code generation of the same kernel. By setting the environment variables
LOOPY_NO_CACHE
orCG_NO_CACHE
to any string thatpytools.strtobool()
evaluates asTrue
, this caching is suppressed.
Running Kernels¶
Use TranslationUnit.executor
to bind a translation unit
to execution resources, and then use ExecutorBase.__call__
to invoke the kernel.
- class loopy.ExecutorBase(t_unit: TranslationUnit, entrypoint: str)[source]¶
An object allowing the execution of an entrypoint of a
TranslationUnit
. Create these objects usingloopy.TranslationUnit.executor()
.
Automatic Testing¶
- loopy.auto_test_vs_ref(ref_prog, ctx, test_prog=None, op_count=(), op_label=(), parameters=None, print_ref_code=False, print_code=True, warmup_rounds=2, dump_binary=False, fills_entire_output=None, do_check=True, check_result=None, max_test_kernel_count=1, quiet=False, blacklist_ref_vendors=(), ref_entrypoint=None, test_entrypoint=None)[source]¶
Compare results of ref_knl to the kernels generated by scheduling test_knl.
- Parameters:
check_result – a callable with
numpy.ndarray
arguments (result, reference_result) returning a a tuple (class:bool, message) indicating correctness/acceptability of the resultmax_test_kernel_count – Stop testing after this many test_knl
Troubleshooting¶
Printing LoopKernel
objects¶
If you’re confused about things loopy is referring to in an error message or
about the current state of the LoopKernel
you are transforming, the
following always works:
print(kernel)
(And it yields a human-readable–albeit terse–representation of kernel.)
- loopy.get_dot_dependency_graph(kernel, callables_table, iname_cluster=True, use_insn_id=False)[source]¶
Return a string in the dot language depicting dependencies among kernel instructions.
- loopy.show_dependency_graph(*args, **kwargs)[source]¶
Show the dependency graph generated by
get_dot_dependency_graph()
in a browser. Accepts the same arguments as that function.
- loopy.t_unit_to_python(t_unit, var_name='t_unit', return_preamble_and_body_separately=False)[source]¶
” Returns a
str
of a python code that instantiates kernel.- Parameters:
kernel – An instance of
loopy.LoopKernel
var_name – A
str
of the kernel variable name in the generated python script.return_preamble_and_body_separately – A
bool
. If True returns(preamble, body)
, wherepreamble
includes the import statements andbody
includes the kernel, translation unit instantiation code.
Note
The implementation is partially complete and a
AssertionError
is raised if the returned python script does not exactly reproduce kernel. Contributions are welcome to fill in the missing voids.