Reference: Other Functionality

Auxiliary Data Types

loopy.typing.ExpressionT

alias of int | integer | float | complex | inexact | bool | bool | Expression | tuple[ExpressionT, …]

loopy.typing.ShapeType

alias of Tuple[int | integer | float | complex | inexact | Expression, …]

class loopy.typing.auto[source]

A generic placeholder object for something that should be automatically determined. See, for example, the shape or strides argument of ArrayArg.

Obtaining Kernel Performance Statistics

class loopy.ToCountMap(count_map=None)[source]

A map from work descriptors like Op and MemAccess to any arithmetic type.

__getitem__(index)[source]
__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

__len__()[source]
get(key, default=None)[source]
items()[source]
keys()[source]
values()[source]
copy(count_map=None)[source]
with_set_attributes(**kwargs)[source]
filter_by(**kwargs)[source]

Remove items without specified key fields.

Parameters:

kwargs – Keyword arguments matching fields in the keys of the ToCountMap, each given a list of allowable values for that key field.

Returns:

A ToCountMap containing the subset of the items in the original ToCountMap that match the field values passed.

Example usage:

# (first create loopy kernel and specify array data types)

params = {"n": 512, "m": 256, "l": 128}
mem_map = lp.get_mem_access_map(knl)
filtered_map = mem_map.filter_by(direction=["load"],
                                 variable=["a","g"])
tot_loads_a_g = filtered_map.eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
filter_by_func(func)[source]

Keep items that pass a test.

Parameters:

func – A function that takes a map key a parameter and returns a bool.

Arg:

A ToCountMap containing the subset of the items in the original ToCountMap for which func(key) is true.

Example usage:

# (first create loopy kernel and specify array data types)

params = {"n": 512, "m": 256, "l": 128}
mem_map = lp.get_mem_access_map(knl)
def filter_func(key):
    return key.lid_strides[0] > 1 and key.lid_strides[0] <= 4:

filtered_map = mem_map.filter_by_func(filter_func)
tot = filtered_map.eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
group_by(*args)[source]

Group map items together, distinguishing by only the key fields passed in args.

Parameters:

args – Zero or more str fields of map keys.

Returns:

A ToCountMap containing the same total counts grouped together by new keys that only contain the fields specified in the arguments passed.

Example usage:

# (first create loopy kernel and specify array data types)

params = {"n": 512, "m": 256, "l": 128}
mem_map = get_mem_access_map(knl)
grouped_map = mem_map.group_by("mtype", "dtype", "direction")

f32_global_ld = grouped_map[MemAccess(mtype="global",
                                      dtype=np.float32,
                                      direction="load")
                           ].eval_with_dict(params)
f32_global_st = grouped_map[MemAccess(mtype="global",
                                      dtype=np.float32,
                                      direction="store")
                           ].eval_with_dict(params)
f32_local_ld = grouped_map[MemAccess(mtype="local",
                                     dtype=np.float32,
                                     direction="load")
                          ].eval_with_dict(params)
f32_local_st = grouped_map[MemAccess(mtype="local",
                                     dtype=np.float32,
                                     direction="store")
                          ].eval_with_dict(params)

op_map = get_op_map(knl)
ops_dtype = op_map.group_by("dtype")

f32ops = ops_dtype[Op(dtype=np.float32)].eval_with_dict(params)
f64ops = ops_dtype[Op(dtype=np.float64)].eval_with_dict(params)
i32ops = ops_dtype[Op(dtype=np.int32)].eval_with_dict(params)

# (now use these counts to, e.g., predict performance)
to_bytes()[source]

Convert counts to bytes using data type in map key.

Returns:

A ToCountMap mapping each original key to an islpy.PwQPolynomial with counts in bytes rather than instances.

Example usage:

# (first create loopy kernel and specify array data types)

bytes_map = get_mem_access_map(knl).to_bytes()
params = {"n": 512, "m": 256, "l": 128}

s1_g_ld_bytes = bytes_map.filter_by(
                    mtype=["global"], lid_strides={0: 1},
                    direction=["load"]).eval_and_sum(params)
s2_g_ld_bytes = bytes_map.filter_by(
                    mtype=["global"], lid_strides={0: 2},
                    direction=["load"]).eval_and_sum(params)
s1_g_st_bytes = bytes_map.filter_by(
                    mtype=["global"], lid_strides={0: 1},
                    direction=["store"]).eval_and_sum(params)
s2_g_st_bytes = bytes_map.filter_by(
                    mtype=["global"], lid_strides={0: 2},
                    direction=["store"]).eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
sum()[source]
Returns:

A sum of the values of the dictionary.

class loopy.ToCountPolynomialMap(space, count_map=None)[source]

Maps any type of key to a islpy.PwQPolynomial or a GuardedPwQPolynomial.

eval_and_sum(params=None)[source]

Add all counts and evaluate with provided parameter dict params

Returns:

An int containing the sum of all counts evaluated with the parameters provided.

Example usage:

# (first create loopy kernel and specify array data types)

params = {"n": 512, "m": 256, "l": 128}
mem_map = lp.get_mem_access_map(knl)
filtered_map = mem_map.filter_by(direction=["load"],
                                 variable=["a", "g"])
tot_loads_a_g = filtered_map.eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
class loopy.CountGranularity[source]

Strings specifying whether an operation should be counted once per work-item, sub-group, or work-group.

WORKITEM

A str that specifies that an operation should be counted once per work-item.

SUBGROUP

A str that specifies that an operation should be counted once per sub-group.

WORKGROUP

A str that specifies that an operation should be counted once per work-group.

class loopy.Op(dtype=None, name=None, count_granularity=None, kernel_name=None)[source]

A descriptor for a type of arithmetic operation.

dtype

A loopy.types.LoopyType or numpy.dtype that specifies the data type operated on.

name

A str that specifies the kind of arithmetic operation as add, mul, div, pow, shift, bw (bitwise), etc.

count_granularity

A str that specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found in CountGranularity, and may be accessed, e.g., as CountGranularity.WORKITEM. A work-item is a single instance of computation executing on a single processor (think “thread”), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp.

kernel_name

A str representing the kernel name where the operation occurred.

class loopy.MemAccess(mtype=None, dtype=None, lid_strides=None, gid_strides=None, direction=None, variable=None, *, variable_tags=None, count_granularity=None, kernel_name=None)[source]

A descriptor for a type of memory access.

mtype

A str that specifies the memory type accessed as global or local

dtype

A loopy.types.LoopyType or numpy.dtype that specifies the data type accessed.

lid_strides

A dict of { int : pymbolic.primitives.Expression or int } that specifies local strides for each local id in the memory access index. Local ids not found will not be present in lid_strides.keys(). Uniform access (i.e. work-items within a sub-group access the same item) is indicated by setting lid_strides[0]=0, but may also occur when no local id 0 is found, in which case the 0 key will not be present in lid_strides.

gid_strides

A dict of { int : pymbolic.primitives.Expression or int } that specifies global strides for each global id in the memory access index. global ids not found will not be present in gid_strides.keys().

direction

A str that specifies the direction of memory access as load or store.

variable

A str that specifies the variable name of the data accessed.

variable_tags

A frozenset of subclasses of Tag that reflects tags of an accessed variable.

count_granularity

A str that specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found in CountGranularity, and may be accessed, e.g., as CountGranularity.WORKITEM. A work-item is a single instance of computation executing on a single processor (think “thread”), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp.

kernel_name

A str representing the kernel name where the operation occurred.

loopy.get_op_map(program, count_redundant_work=False, count_within_subscripts=True, subgroup_size=None, entrypoint=None, within=None)[source]

Count the number of operations in a loopy kernel.

Parameters:
  • knl – A loopy.LoopKernel whose operations are to be counted.

  • count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This bool flag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)

  • count_within_subscripts – A bool specifying whether to count operations inside array indices.

  • subgroup_size – (currently unused) An int, str "guess", or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting a MemAccess whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If a str "guess" is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.

  • within – If not None, limit the result to matching contexts. See loopy.match.parse_match() for syntax.

Returns:

A ToCountMap of { Op : islpy.PwQPolynomial }.

  • The Op specifies the characteristics of the arithmetic operation.

  • The islpy.PwQPolynomial holds the number of operations of the kind specified in the key (in terms of the loopy.LoopKernel parameter inames).

Example usage:

# (first create loopy kernel and specify array data types)

op_map = get_op_map(knl)
params = {"n": 512, "m": 256, "l": 128}
f32add = op_map[Op(np.float32,
                   "add",
                   count_granularity=CountGranularity.WORKITEM)
               ].eval_with_dict(params)
f32mul = op_map[Op(np.float32,
                   "mul",
                   count_granularity=CountGranularity.WORKITEM)
               ].eval_with_dict(params)

# (now use these counts to, e.g., predict performance)
loopy.get_mem_access_map(program, count_redundant_work=False, subgroup_size=None, entrypoint=None, within=None)[source]

Count the number of memory accesses in a loopy kernel.

Parameters:
  • knl – A loopy.LoopKernel whose memory accesses are to be counted.

  • count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This bool flag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)

  • subgroup_size – An int, str "guess", or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting a MemAccess whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If a str "guess" is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.

  • within – If not None, limit the result to matching contexts. See loopy.match.parse_match() for syntax.

Returns:

A ToCountMap of { MemAccess : islpy.PwQPolynomial }.

Example usage:

# (first create loopy kernel and specify array data types)

params = {"n": 512, "m": 256, "l": 128}
mem_map = get_mem_access_map(knl)

f32_s1_g_ld_a = mem_map[MemAccess(
                            mtype="global",
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction="load",
                            variable="a",
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)
f32_s1_g_st_a = mem_map[MemAccess(
                            mtype="global",
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction="store",
                            variable="a",
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)
f32_s1_l_ld_x = mem_map[MemAccess(
                            mtype="local",
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction="load",
                            variable="x",
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)
f32_s1_l_st_x = mem_map[MemAccess(
                            mtype="local",
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction="store",
                            variable="x",
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)

# (now use these counts to, e.g., predict performance)
loopy.get_synchronization_map(program, subgroup_size=None, entrypoint=None)[source]

Count the number of synchronization events each work-item encounters in a loopy kernel.

Parameters:
  • knl – A loopy.LoopKernel whose barriers are to be counted.

  • subgroup_size – (currently unused) An int, str "guess", or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analogous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting a MemAccess whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If a str "guess" is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.

Returns:

A dictionary mapping each type of synchronization event to an islpy.PwQPolynomial holding the number of events per work-item.

Possible keys include barrier_local, barrier_global (if supported by the target) and kernel_launch.

Example usage:

# (first create loopy kernel and specify array data types)

sync_map = get_synchronization_map(knl)
params = {"n": 512, "m": 256, "l": 128}
barrier_ct = sync_map["barrier_local"].eval_with_dict(params)

# (now use this count to, e.g., predict performance)
loopy.gather_access_footprints(program, ignore_uncountable=False, entrypoint=None)[source]

Return a dictionary mapping (var_name, direction) to islpy.Set instances capturing which indices of each the array var_name are read/written (where direction is either read or write.

Parameters:

ignore_uncountable – If False, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)

loopy.gather_access_footprint_bytes(program, ignore_uncountable=False)[source]

Return a dictionary mapping (var_name, direction) to islpy.PwQPolynomial instances capturing the number of bytes are read/written (where direction is either read or write on array var_name

Parameters:

ignore_uncountable – If True, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)

class loopy.statistics.GuardedPwQPolynomial(pwqpolynomial, valid_domain)[source]

Controlling caching

LOOPY_NO_CACHE
CG_NO_CACHE

By default, loopy will cache (on disk) the result of various stages of code generation to speed up future code generation of the same kernel. By setting the environment variables LOOPY_NO_CACHE or CG_NO_CACHE to any string that pytools.strtobool() evaluates as True, this caching is suppressed.

loopy.set_caching_enabled(flag)[source]

Set whether loopy is allowed to use disk caching for its various code generation stages.

class loopy.CacheMode(new_flag)[source]

A context manager for setting whether loopy is allowed to use disk caches.

Running Kernels

Use TranslationUnit.executor to bind a translation unit to execution resources, and then use ExecutorBase.__call__ to invoke the kernel.

class loopy.ExecutorBase(t_unit: TranslationUnit, entrypoint: str)[source]

An object allowing the execution of an entrypoint of a TranslationUnit. Create these objects using loopy.TranslationUnit.executor().

__call__(queue, **kwargs)[source]

Call self as a function.

Automatic Testing

loopy.auto_test_vs_ref(ref_prog, ctx, test_prog=None, op_count=(), op_label=(), parameters=None, print_ref_code=False, print_code=True, warmup_rounds=2, dump_binary=False, fills_entire_output=None, do_check=True, check_result=None, max_test_kernel_count=1, quiet=False, blacklist_ref_vendors=(), ref_entrypoint=None, test_entrypoint=None)[source]

Compare results of ref_knl to the kernels generated by scheduling test_knl.

Parameters:
  • check_result – a callable with numpy.ndarray arguments (result, reference_result) returning a a tuple (class:bool, message) indicating correctness/acceptability of the result

  • max_test_kernel_count – Stop testing after this many test_knl

Troubleshooting

Printing LoopKernel objects

If you’re confused about things loopy is referring to in an error message or about the current state of the LoopKernel you are transforming, the following always works:

print(kernel)

(And it yields a human-readable–albeit terse–representation of kernel.)

loopy.get_dot_dependency_graph(kernel, callables_table, iname_cluster=True, use_insn_id=False)[source]

Return a string in the dot language depicting dependencies among kernel instructions.

loopy.show_dependency_graph(*args, **kwargs)[source]

Show the dependency graph generated by get_dot_dependency_graph() in a browser. Accepts the same arguments as that function.

loopy.t_unit_to_python(t_unit, var_name='t_unit', return_preamble_and_body_separately=False)[source]

” Returns a str of a python code that instantiates kernel.

Parameters:
  • kernel – An instance of loopy.LoopKernel

  • var_name – A str of the kernel variable name in the generated python script.

  • return_preamble_and_body_separately – A bool. If True returns (preamble, body), where preamble includes the import statements and body includes the kernel, translation unit instantiation code.

Note

The implementation is partially complete and a AssertionError is raised if the returned python script does not exactly reproduce kernel. Contributions are welcome to fill in the missing voids.