Reference: Other Functionality#

Obtaining Kernel Performance Statistics#

class loopy.ToCountMap(count_map=None)[source]#

A map from work descriptors like Op and MemAccess to any arithmetic type.

__getitem__(index)[source]#
__str__()[source]#

Return str(self).

__repr__()[source]#

Return repr(self).

__len__()[source]#
get(key, default=None)[source]#
items()[source]#
keys()[source]#
values()[source]#
copy(count_map=None)[source]#
with_set_attributes(**kwargs)[source]#
filter_by(**kwargs)[source]#

Remove items without specified key fields.

Parameters:

kwargs – Keyword arguments matching fields in the keys of the ToCountMap, each given a list of allowable values for that key field.

Returns:

A ToCountMap containing the subset of the items in the original ToCountMap that match the field values passed.

Example usage:

# (first create loopy kernel and specify array data types)

params = {"n": 512, "m": 256, "l": 128}
mem_map = lp.get_mem_access_map(knl)
filtered_map = mem_map.filter_by(direction=["load"],
                                 variable=["a","g"])
tot_loads_a_g = filtered_map.eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
filter_by_func(func)[source]#

Keep items that pass a test.

Parameters:

func – A function that takes a map key a parameter and returns a bool.

Arg:

A ToCountMap containing the subset of the items in the original ToCountMap for which func(key) is true.

Example usage:

# (first create loopy kernel and specify array data types)

params = {"n": 512, "m": 256, "l": 128}
mem_map = lp.get_mem_access_map(knl)
def filter_func(key):
    return key.lid_strides[0] > 1 and key.lid_strides[0] <= 4:

filtered_map = mem_map.filter_by_func(filter_func)
tot = filtered_map.eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
group_by(*args)[source]#

Group map items together, distinguishing by only the key fields passed in args.

Parameters:

args – Zero or more str fields of map keys.

Returns:

A ToCountMap containing the same total counts grouped together by new keys that only contain the fields specified in the arguments passed.

Example usage:

# (first create loopy kernel and specify array data types)

params = {"n": 512, "m": 256, "l": 128}
mem_map = get_mem_access_map(knl)
grouped_map = mem_map.group_by("mtype", "dtype", "direction")

f32_global_ld = grouped_map[MemAccess(mtype="global",
                                      dtype=np.float32,
                                      direction="load")
                           ].eval_with_dict(params)
f32_global_st = grouped_map[MemAccess(mtype="global",
                                      dtype=np.float32,
                                      direction="store")
                           ].eval_with_dict(params)
f32_local_ld = grouped_map[MemAccess(mtype="local",
                                     dtype=np.float32,
                                     direction="load")
                          ].eval_with_dict(params)
f32_local_st = grouped_map[MemAccess(mtype="local",
                                     dtype=np.float32,
                                     direction="store")
                          ].eval_with_dict(params)

op_map = get_op_map(knl)
ops_dtype = op_map.group_by("dtype")

f32ops = ops_dtype[Op(dtype=np.float32)].eval_with_dict(params)
f64ops = ops_dtype[Op(dtype=np.float64)].eval_with_dict(params)
i32ops = ops_dtype[Op(dtype=np.int32)].eval_with_dict(params)

# (now use these counts to, e.g., predict performance)
to_bytes()[source]#

Convert counts to bytes using data type in map key.

Returns:

A ToCountMap mapping each original key to an islpy.PwQPolynomial with counts in bytes rather than instances.

Example usage:

# (first create loopy kernel and specify array data types)

bytes_map = get_mem_access_map(knl).to_bytes()
params = {"n": 512, "m": 256, "l": 128}

s1_g_ld_byt = bytes_map.filter_by(
                    mtype=["global"], lid_strides={0: 1},
                    direction=["load"]).eval_and_sum(params)
s2_g_ld_byt = bytes_map.filter_by(
                    mtype=["global"], lid_strides={0: 2},
                    direction=["load"]).eval_and_sum(params)
s1_g_st_byt = bytes_map.filter_by(
                    mtype=["global"], lid_strides={0: 1},
                    direction=["store"]).eval_and_sum(params)
s2_g_st_byt = bytes_map.filter_by(
                    mtype=["global"], lid_strides={0: 2},
                    direction=["store"]).eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
sum()[source]#
Returns:

A sum of the values of the dictionary.

class loopy.ToCountPolynomialMap(space, count_map=None)[source]#

Maps any type of key to a islpy.PwQPolynomial or a GuardedPwQPolynomial.

eval_and_sum(params=None)[source]#

Add all counts and evaluate with provided parameter dict params

Returns:

An int containing the sum of all counts evaluated with the parameters provided.

Example usage:

# (first create loopy kernel and specify array data types)

params = {"n": 512, "m": 256, "l": 128}
mem_map = lp.get_mem_access_map(knl)
filtered_map = mem_map.filter_by(direction=["load"],
                                 variable=["a", "g"])
tot_loads_a_g = filtered_map.eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
class loopy.CountGranularity[source]#

Strings specifying whether an operation should be counted once per work-item, sub-group, or work-group.

WORKITEM#

A str that specifies that an operation should be counted once per work-item.

SUBGROUP#

A str that specifies that an operation should be counted once per sub-group.

WORKGROUP#

A str that specifies that an operation should be counted once per work-group.

class loopy.Op(dtype=None, name=None, count_granularity=None, kernel_name=None)[source]#

A descriptor for a type of arithmetic operation.

dtype#

A loopy.types.LoopyType or numpy.dtype that specifies the data type operated on.

name#

A str that specifies the kind of arithmetic operation as add, mul, div, pow, shift, bw (bitwise), etc.

count_granularity#

A str that specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found in CountGranularity, and may be accessed, e.g., as CountGranularity.WORKITEM. A work-item is a single instance of computation executing on a single processor (think “thread”), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp.

kernel_name#

A str representing the kernel name where the operation occurred.

class loopy.MemAccess(mtype=None, dtype=None, lid_strides=None, gid_strides=None, direction=None, variable=None, *, variable_tags=None, count_granularity=None, kernel_name=None)[source]#

A descriptor for a type of memory access.

mtype#

A str that specifies the memory type accessed as global or local

dtype#

A loopy.types.LoopyType or numpy.dtype that specifies the data type accessed.

lid_strides#

A dict of { int : pymbolic.primitives.Expression or int } that specifies local strides for each local id in the memory access index. Local ids not found will not be present in lid_strides.keys(). Uniform access (i.e. work-items within a sub-group access the same item) is indicated by setting lid_strides[0]=0, but may also occur when no local id 0 is found, in which case the 0 key will not be present in lid_strides.

gid_strides#

A dict of { int : pymbolic.primitives.Expression or int } that specifies global strides for each global id in the memory access index. global ids not found will not be present in gid_strides.keys().

direction#

A str that specifies the direction of memory access as load or store.

variable#

A str that specifies the variable name of the data accessed.

variable_tags#

A frozenset of subclasses of Tag that reflects tags of an accessed variable.

count_granularity#

A str that specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found in CountGranularity, and may be accessed, e.g., as CountGranularity.WORKITEM. A work-item is a single instance of computation executing on a single processor (think “thread”), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp.

kernel_name#

A str representing the kernel name where the operation occurred.

loopy.get_op_map(program, count_redundant_work=False, count_within_subscripts=True, subgroup_size=None, entrypoint=None, within=None)[source]#

Count the number of operations in a loopy kernel.

Parameters:
  • knl – A loopy.LoopKernel whose operations are to be counted.

  • count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This bool flag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)

  • count_within_subscripts – A bool specifying whether to count operations inside array indices.

  • subgroup_size – (currently unused) An int, str "guess", or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting a MemAccess whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If a str "guess" is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.

  • within – If not None, limit the result to matching contexts. See loopy.match.parse_match() for syntax.

Returns:

A ToCountMap of { Op : islpy.PwQPolynomial }.

  • The Op specifies the characteristics of the arithmetic operation.

  • The islpy.PwQPolynomial holds the number of operations of the kind specified in the key (in terms of the loopy.LoopKernel parameter inames).

Example usage:

# (first create loopy kernel and specify array data types)

op_map = get_op_map(knl)
params = {"n": 512, "m": 256, "l": 128}
f32add = op_map[Op(np.float32,
                   "add",
                   count_granularity=CountGranularity.WORKITEM)
               ].eval_with_dict(params)
f32mul = op_map[Op(np.float32,
                   "mul",
                   count_granularity=CountGranularity.WORKITEM)
               ].eval_with_dict(params)

# (now use these counts to, e.g., predict performance)
loopy.get_mem_access_map(program, count_redundant_work=False, subgroup_size=None, entrypoint=None, within=None)[source]#

Count the number of memory accesses in a loopy kernel.

Parameters:
  • knl – A loopy.LoopKernel whose memory accesses are to be counted.

  • count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This bool flag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)

  • subgroup_size – An int, str "guess", or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting a MemAccess whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If a str "guess" is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.

  • within – If not None, limit the result to matching contexts. See loopy.match.parse_match() for syntax.

Returns:

A ToCountMap of { MemAccess : islpy.PwQPolynomial }.

Example usage:

# (first create loopy kernel and specify array data types)

params = {"n": 512, "m": 256, "l": 128}
mem_map = get_mem_access_map(knl)

f32_s1_g_ld_a = mem_map[MemAccess(
                            mtype="global",
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction="load",
                            variable="a",
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)
f32_s1_g_st_a = mem_map[MemAccess(
                            mtype="global",
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction="store",
                            variable="a",
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)
f32_s1_l_ld_x = mem_map[MemAccess(
                            mtype="local",
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction="load",
                            variable="x",
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)
f32_s1_l_st_x = mem_map[MemAccess(
                            mtype="local",
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction="store",
                            variable="x",
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)

# (now use these counts to, e.g., predict performance)
loopy.get_synchronization_map(program, subgroup_size=None, entrypoint=None)[source]#

Count the number of synchronization events each work-item encounters in a loopy kernel.

Parameters:
  • knl – A loopy.LoopKernel whose barriers are to be counted.

  • subgroup_size – (currently unused) An int, str "guess", or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting a MemAccess whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If a str "guess" is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.

Returns:

A dictionary mapping each type of synchronization event to an islpy.PwQPolynomial holding the number of events per work-item.

Possible keys include barrier_local, barrier_global (if supported by the target) and kernel_launch.

Example usage:

# (first create loopy kernel and specify array data types)

sync_map = get_synchronization_map(knl)
params = {"n": 512, "m": 256, "l": 128}
barrier_ct = sync_map["barrier_local"].eval_with_dict(params)

# (now use this count to, e.g., predict performance)
loopy.gather_access_footprints(program, ignore_uncountable=False, entrypoint=None)[source]#

Return a dictionary mapping (var_name, direction) to islpy.Set instances capturing which indices of each the array var_name are read/written (where direction is either read or write.

Parameters:

ignore_uncountable – If False, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)

loopy.gather_access_footprint_bytes(program, ignore_uncountable=False)[source]#

Return a dictionary mapping (var_name, direction) to islpy.PwQPolynomial instances capturing the number of bytes are read/written (where direction is either read or write on array var_name

Parameters:

ignore_uncountable – If True, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)

class loopy.statistics.GuardedPwQPolynomial(pwqpolynomial, valid_domain)[source]#

Controlling caching#

loopy.set_caching_enabled(flag)[source]#

Set whether loopy is allowed to use disk caching for its various code generation stages.

class loopy.CacheMode(new_flag)[source]#

A context manager for setting whether loopy is allowed to use disk caches.

Running Kernels#

Use TranslationUnit.executor to bind a translation unit to execution resources, and then use ExecutorBase.__call__ to invoke the kernel.

class loopy.ExecutorBase(t_unit: TranslationUnit, entrypoint: str)[source]#

An object allowing the execution of an entrypoint of a TranslationUnit. Create these objects using loopy.TranslationUnit.executor().

__call__(queue, **kwargs)[source]#

Call self as a function.

Automatic Testing#

loopy.auto_test_vs_ref(ref_prog, ctx, test_prog=None, op_count=(), op_label=(), parameters=None, print_ref_code=False, print_code=True, warmup_rounds=2, dump_binary=False, fills_entire_output=None, do_check=True, check_result=None, max_test_kernel_count=1, quiet=False, blacklist_ref_vendors=(), ref_entrypoint=None, test_entrypoint=None)[source]#

Compare results of ref_knl to the kernels generated by scheduling test_knl.

Parameters:
  • check_result – a callable with numpy.ndarray arguments (result, reference_result) returning a a tuple (class:bool, message) indicating correctness/acceptability of the result

  • max_test_kernel_count – Stop testing after this many test_knl

Troubleshooting#

Printing LoopKernel objects#

If you’re confused about things loopy is referring to in an error message or about the current state of the LoopKernel you are transforming, the following always works:

print(kernel)

(And it yields a human-readable–albeit terse–representation of kernel.)

loopy.get_dot_dependency_graph(kernel, callables_table, iname_cluster=True, use_insn_id=False)[source]#

Return a string in the dot language depicting dependencies among kernel instructions.

loopy.show_dependency_graph(*args, **kwargs)[source]#

Show the dependency graph generated by get_dot_dependency_graph() in a browser. Accepts the same arguments as that function.

loopy.t_unit_to_python(t_unit, var_name='t_unit', return_preamble_and_body_separately=False)[source]#

” Returns a str of a python code that instantiates kernel.

Parameters:
  • kernel – An instance of loopy.LoopKernel

  • var_name – A str of the kernel variable name in the generated python script.

  • return_preamble_and_body_separately – A bool. If True returns (preamble, body), where preamble includes the import statements and body includes the kernel, translation unit instantiation code.

Note

The implementation is partially complete and a AssertionError is raised if the returned python script does not exactly reproduce kernel. Contributions are welcome to fill in the missing voids.