Reference: Other Functionality#
Obtaining Kernel Performance Statistics#
- class loopy.ToCountMap(count_map=None)[source]#
A map from work descriptors like
Op
andMemAccess
to any arithmetic type.- filter_by(**kwargs)[source]#
Remove items without specified key fields.
- Parameters:
kwargs – Keyword arguments matching fields in the keys of the
ToCountMap
, each given a list of allowable values for that key field.- Returns:
A
ToCountMap
containing the subset of the items in the originalToCountMap
that match the field values passed.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = lp.get_mem_access_map(knl) filtered_map = mem_map.filter_by(direction=["load"], variable=["a","g"]) tot_loads_a_g = filtered_map.eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- filter_by_func(func)[source]#
Keep items that pass a test.
- Parameters:
func – A function that takes a map key a parameter and returns a
bool
.- Arg:
A
ToCountMap
containing the subset of the items in the originalToCountMap
for which func(key) is true.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = lp.get_mem_access_map(knl) def filter_func(key): return key.lid_strides[0] > 1 and key.lid_strides[0] <= 4: filtered_map = mem_map.filter_by_func(filter_func) tot = filtered_map.eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- group_by(*args)[source]#
Group map items together, distinguishing by only the key fields passed in args.
- Parameters:
args – Zero or more
str
fields of map keys.- Returns:
A
ToCountMap
containing the same total counts grouped together by new keys that only contain the fields specified in the arguments passed.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = get_mem_access_map(knl) grouped_map = mem_map.group_by("mtype", "dtype", "direction") f32_global_ld = grouped_map[MemAccess(mtype="global", dtype=np.float32, direction="load") ].eval_with_dict(params) f32_global_st = grouped_map[MemAccess(mtype="global", dtype=np.float32, direction="store") ].eval_with_dict(params) f32_local_ld = grouped_map[MemAccess(mtype="local", dtype=np.float32, direction="load") ].eval_with_dict(params) f32_local_st = grouped_map[MemAccess(mtype="local", dtype=np.float32, direction="store") ].eval_with_dict(params) op_map = get_op_map(knl) ops_dtype = op_map.group_by("dtype") f32ops = ops_dtype[Op(dtype=np.float32)].eval_with_dict(params) f64ops = ops_dtype[Op(dtype=np.float64)].eval_with_dict(params) i32ops = ops_dtype[Op(dtype=np.int32)].eval_with_dict(params) # (now use these counts to, e.g., predict performance)
- to_bytes()[source]#
Convert counts to bytes using data type in map key.
- Returns:
A
ToCountMap
mapping each original key to anislpy.PwQPolynomial
with counts in bytes rather than instances.
Example usage:
# (first create loopy kernel and specify array data types) bytes_map = get_mem_access_map(knl).to_bytes() params = {"n": 512, "m": 256, "l": 128} s1_g_ld_byt = bytes_map.filter_by( mtype=["global"], lid_strides={0: 1}, direction=["load"]).eval_and_sum(params) s2_g_ld_byt = bytes_map.filter_by( mtype=["global"], lid_strides={0: 2}, direction=["load"]).eval_and_sum(params) s1_g_st_byt = bytes_map.filter_by( mtype=["global"], lid_strides={0: 1}, direction=["store"]).eval_and_sum(params) s2_g_st_byt = bytes_map.filter_by( mtype=["global"], lid_strides={0: 2}, direction=["store"]).eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- class loopy.ToCountPolynomialMap(space, count_map=None)[source]#
Maps any type of key to a
islpy.PwQPolynomial
or aGuardedPwQPolynomial
.- eval_and_sum(params=None)[source]#
Add all counts and evaluate with provided parameter dict params
- Returns:
An
int
containing the sum of all counts evaluated with the parameters provided.
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = lp.get_mem_access_map(knl) filtered_map = mem_map.filter_by(direction=["load"], variable=["a", "g"]) tot_loads_a_g = filtered_map.eval_and_sum(params) # (now use these counts to, e.g., predict performance)
- class loopy.CountGranularity[source]#
Strings specifying whether an operation should be counted once per work-item, sub-group, or work-group.
- class loopy.Op(dtype=None, name=None, count_granularity=None, kernel_name=None)[source]#
A descriptor for a type of arithmetic operation.
- dtype#
A
loopy.types.LoopyType
ornumpy.dtype
that specifies the data type operated on.
- name#
A
str
that specifies the kind of arithmetic operation as add, mul, div, pow, shift, bw (bitwise), etc.
- count_granularity#
A
str
that specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found inCountGranularity
, and may be accessed, e.g., asCountGranularity.WORKITEM
. A work-item is a single instance of computation executing on a single processor (think “thread”), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp.
- class loopy.MemAccess(mtype=None, dtype=None, lid_strides=None, gid_strides=None, direction=None, variable=None, *, variable_tags=None, count_granularity=None, kernel_name=None)[source]#
A descriptor for a type of memory access.
- dtype#
A
loopy.types.LoopyType
ornumpy.dtype
that specifies the data type accessed.
- lid_strides#
A
dict
of {int
:pymbolic.primitives.Expression
orint
} that specifies local strides for each local id in the memory access index. Local ids not found will not be present inlid_strides.keys()
. Uniform access (i.e. work-items within a sub-group access the same item) is indicated by settinglid_strides[0]=0
, but may also occur when no local id 0 is found, in which case the 0 key will not be present in lid_strides.
- gid_strides#
A
dict
of {int
:pymbolic.primitives.Expression
orint
} that specifies global strides for each global id in the memory access index. global ids not found will not be present ingid_strides.keys()
.
- count_granularity#
A
str
that specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found inCountGranularity
, and may be accessed, e.g., asCountGranularity.WORKITEM
. A work-item is a single instance of computation executing on a single processor (think “thread”), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp.
- loopy.get_op_map(program, count_redundant_work=False, count_within_subscripts=True, subgroup_size=None, entrypoint=None, within=None)[source]#
Count the number of operations in a loopy kernel.
- Parameters:
knl – A
loopy.LoopKernel
whose operations are to be counted.count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This
bool
flag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)count_within_subscripts – A
bool
specifying whether to count operations inside array indices.subgroup_size – (currently unused) An
int
,str
"guess"
, or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting aMemAccess
whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If astr
"guess"
is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.within – If not None, limit the result to matching contexts. See
loopy.match.parse_match()
for syntax.
- Returns:
A
ToCountMap
of {Op
:islpy.PwQPolynomial
}.The
Op
specifies the characteristics of the arithmetic operation.The
islpy.PwQPolynomial
holds the number of operations of the kind specified in the key (in terms of theloopy.LoopKernel
parameter inames).
Example usage:
# (first create loopy kernel and specify array data types) op_map = get_op_map(knl) params = {"n": 512, "m": 256, "l": 128} f32add = op_map[Op(np.float32, "add", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32mul = op_map[Op(np.float32, "mul", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) # (now use these counts to, e.g., predict performance)
- loopy.get_mem_access_map(program, count_redundant_work=False, subgroup_size=None, entrypoint=None, within=None)[source]#
Count the number of memory accesses in a loopy kernel.
- Parameters:
knl – A
loopy.LoopKernel
whose memory accesses are to be counted.count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This
bool
flag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)subgroup_size – An
int
,str
"guess"
, or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting aMemAccess
whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If astr
"guess"
is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.within – If not None, limit the result to matching contexts. See
loopy.match.parse_match()
for syntax.
- Returns:
A
ToCountMap
of {MemAccess
:islpy.PwQPolynomial
}.The
MemAccess
specifies the characteristics of the memory access.The
islpy.PwQPolynomial
holds the number of memory accesses with the characteristics specified in the key (in terms of theloopy.LoopKernel
inames).
Example usage:
# (first create loopy kernel and specify array data types) params = {"n": 512, "m": 256, "l": 128} mem_map = get_mem_access_map(knl) f32_s1_g_ld_a = mem_map[MemAccess( mtype="global", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="load", variable="a", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32_s1_g_st_a = mem_map[MemAccess( mtype="global", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="store", variable="a", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32_s1_l_ld_x = mem_map[MemAccess( mtype="local", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="load", variable="x", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) f32_s1_l_st_x = mem_map[MemAccess( mtype="local", dtype=np.float32, lid_strides={0: 1}, gid_strides={0: 256}, direction="store", variable="x", count_granularity=CountGranularity.WORKITEM) ].eval_with_dict(params) # (now use these counts to, e.g., predict performance)
- loopy.get_synchronization_map(program, subgroup_size=None, entrypoint=None)[source]#
Count the number of synchronization events each work-item encounters in a loopy kernel.
- Parameters:
knl – A
loopy.LoopKernel
whose barriers are to be counted.subgroup_size – (currently unused) An
int
,str
"guess"
, or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting aMemAccess
whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If astr
"guess"
is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.
- Returns:
A dictionary mapping each type of synchronization event to an
islpy.PwQPolynomial
holding the number of events per work-item.Possible keys include
barrier_local
,barrier_global
(if supported by the target) andkernel_launch
.
Example usage:
# (first create loopy kernel and specify array data types) sync_map = get_synchronization_map(knl) params = {"n": 512, "m": 256, "l": 128} barrier_ct = sync_map["barrier_local"].eval_with_dict(params) # (now use this count to, e.g., predict performance)
- loopy.gather_access_footprints(program, ignore_uncountable=False, entrypoint=None)[source]#
Return a dictionary mapping
(var_name, direction)
toislpy.Set
instances capturing which indices of each the array var_name are read/written (where direction is eitherread
orwrite
.- Parameters:
ignore_uncountable – If False, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)
- loopy.gather_access_footprint_bytes(program, ignore_uncountable=False)[source]#
Return a dictionary mapping
(var_name, direction)
toislpy.PwQPolynomial
instances capturing the number of bytes are read/written (where direction is eitherread
orwrite
on array var_name- Parameters:
ignore_uncountable – If True, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)
Controlling caching#
Running Kernels#
Use TranslationUnit.executor
to bind a translation unit
to execution resources, and then use ExecutorBase.__call__
to invoke the kernel.
- class loopy.ExecutorBase(t_unit: TranslationUnit, entrypoint: str)[source]#
An object allowing the execution of an entrypoint of a
TranslationUnit
. Create these objects usingloopy.TranslationUnit.executor()
.
Automatic Testing#
- loopy.auto_test_vs_ref(ref_prog, ctx, test_prog=None, op_count=(), op_label=(), parameters=None, print_ref_code=False, print_code=True, warmup_rounds=2, dump_binary=False, fills_entire_output=None, do_check=True, check_result=None, max_test_kernel_count=1, quiet=False, blacklist_ref_vendors=(), ref_entrypoint=None, test_entrypoint=None)[source]#
Compare results of ref_knl to the kernels generated by scheduling test_knl.
- Parameters:
check_result – a callable with
numpy.ndarray
arguments (result, reference_result) returning a a tuple (class:bool, message) indicating correctness/acceptability of the resultmax_test_kernel_count – Stop testing after this many test_knl
Troubleshooting#
Printing LoopKernel
objects#
If you’re confused about things loopy is referring to in an error message or
about the current state of the LoopKernel
you are transforming, the
following always works:
print(kernel)
(And it yields a human-readable–albeit terse–representation of kernel.)
- loopy.get_dot_dependency_graph(kernel, callables_table, iname_cluster=True, use_insn_id=False)[source]#
Return a string in the dot language depicting dependencies among kernel instructions.
- loopy.show_dependency_graph(*args, **kwargs)[source]#
Show the dependency graph generated by
get_dot_dependency_graph()
in a browser. Accepts the same arguments as that function.
- loopy.t_unit_to_python(t_unit, var_name='t_unit', return_preamble_and_body_separately=False)[source]#
” Returns a
str
of a python code that instantiates kernel.- Parameters:
kernel – An instance of
loopy.LoopKernel
var_name – A
str
of the kernel variable name in the generated python script.return_preamble_and_body_separately – A
bool
. If True returns(preamble, body)
, wherepreamble
includes the import statements andbody
includes the kernel, translation unit instantiation code.
Note
The implementation is partially complete and a
AssertionError
is raised if the returned python script does not exactly reproduce kernel. Contributions are welcome to fill in the missing voids.