Reference: Other Functionality

Obtaining Kernel Performance Statistics

class loopy.ToCountMap(init_dict=None, val_type=<class 'loopy.statistics.GuardedPwQPolynomial'>)

Maps any type of key to an arithmetic type.

filter_by(**kwargs)

Remove items without specified key fields.

Parameters:kwargs – Keyword arguments matching fields in the keys of the ToCountMap, each given a list of allowable values for that key field.
Returns:A ToCountMap containing the subset of the items in the original ToCountMap that match the field values passed.

Example usage:

# (first create loopy kernel and specify array data types)

params = {'n': 512, 'm': 256, 'l': 128}
mem_map = lp.get_mem_access_map(knl)
filtered_map = mem_map.filter_by(direction=['load'],
                                 variable=['a','g'])
tot_loads_a_g = filtered_map.eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
filter_by_func(func)

Keep items that pass a test.

Parameters:func – A function that takes a map key a parameter and returns a bool.
Arg:A ToCountMap containing the subset of the items in the original ToCountMap for which func(key) is true.

Example usage:

# (first create loopy kernel and specify array data types)

params = {'n': 512, 'm': 256, 'l': 128}
mem_map = lp.get_mem_access_map(knl)
def filter_func(key):
    return key.lid_strides[0] > 1 and key.lid_strides[0] <= 4:

filtered_map = mem_map.filter_by_func(filter_func)
tot = filtered_map.eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
group_by(*args)

Group map items together, distinguishing by only the key fields passed in args.

Parameters:args – Zero or more str fields of map keys.
Returns:A ToCountMap containing the same total counts grouped together by new keys that only contain the fields specified in the arguments passed.

Example usage:

# (first create loopy kernel and specify array data types)

params = {'n': 512, 'm': 256, 'l': 128}
mem_map = get_mem_access_map(knl)
grouped_map = mem_map.group_by('mtype', 'dtype', 'direction')

f32_global_ld = grouped_map[MemAccess(mtype='global',
                                      dtype=np.float32,
                                      direction='load')
                           ].eval_with_dict(params)
f32_global_st = grouped_map[MemAccess(mtype='global',
                                      dtype=np.float32,
                                      direction='store')
                           ].eval_with_dict(params)
f32_local_ld = grouped_map[MemAccess(mtype='local',
                                     dtype=np.float32,
                                     direction='load')
                          ].eval_with_dict(params)
f32_local_st = grouped_map[MemAccess(mtype='local',
                                     dtype=np.float32,
                                     direction='store')
                          ].eval_with_dict(params)

op_map = get_op_map(knl)
ops_dtype = op_map.group_by('dtype')

f32ops = ops_dtype[Op(dtype=np.float32)].eval_with_dict(params)
f64ops = ops_dtype[Op(dtype=np.float64)].eval_with_dict(params)
i32ops = ops_dtype[Op(dtype=np.int32)].eval_with_dict(params)

# (now use these counts to, e.g., predict performance)
to_bytes()

Convert counts to bytes using data type in map key.

Returns:A ToCountMap mapping each original key to an islpy.PwQPolynomial with counts in bytes rather than instances.

Example usage:

# (first create loopy kernel and specify array data types)

bytes_map = get_mem_access_map(knl).to_bytes()
params = {'n': 512, 'm': 256, 'l': 128}

s1_g_ld_byt = bytes_map.filter_by(
                    mtype=['global'], lid_strides={0: 1},
                    direction=['load']).eval_and_sum(params)
s2_g_ld_byt = bytes_map.filter_by(
                    mtype=['global'], lid_strides={0: 2},
                    direction=['load']).eval_and_sum(params)
s1_g_st_byt = bytes_map.filter_by(
                    mtype=['global'], lid_strides={0: 1},
                    direction=['store']).eval_and_sum(params)
s2_g_st_byt = bytes_map.filter_by(
                    mtype=['global'], lid_strides={0: 2},
                    direction=['store']).eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
sum()

Add all counts in ToCountMap.

Returns:An islpy.PwQPolynomial or int containing the sum of counts.
eval_and_sum(params)

Add all counts in ToCountMap and evaluate with provided parameter dict.

Returns:An int containing the sum of all counts in the ToCountMap evaluated with the parameters provided.

Example usage:

# (first create loopy kernel and specify array data types)

params = {'n': 512, 'm': 256, 'l': 128}
mem_map = lp.get_mem_access_map(knl)
filtered_map = mem_map.filter_by(direction=['load'],
                                 variable=['a', 'g'])
tot_loads_a_g = filtered_map.eval_and_sum(params)

# (now use these counts to, e.g., predict performance)
class loopy.CountGranularity

Strings specifying whether an operation should be counted once per work-item, sub-group, or work-group.

WORKITEM

A str that specifies that an operation should be counted once per work-item.

SUBGROUP

A str that specifies that an operation should be counted once per sub-group.

WORKGROUP

A str that specifies that an operation should be counted once per work-group.

class loopy.Op(dtype=None, name=None, count_granularity=None)

A descriptor for a type of arithmetic operation.

dtype

A loopy.LoopyType or numpy.dtype that specifies the data type operated on.

name

A str that specifies the kind of arithmetic operation as add, mul, div, pow, shift, bw (bitwise), etc.

count_granularity

A str that specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found in CountGranularity, and may be accessed, e.g., as CountGranularity.WORKITEM. A work-item is a single instance of computation executing on a single processor (think ‘thread’), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp.

class loopy.MemAccess(mtype=None, dtype=None, lid_strides=None, gid_strides=None, direction=None, variable=None, variable_tag=None, count_granularity=None)

A descriptor for a type of memory access.

mtype

A str that specifies the memory type accessed as global or local

dtype

A loopy.LoopyType or numpy.dtype that specifies the data type accessed.

lid_strides

A dict of { int : pymbolic.primitives.Expression or int } that specifies local strides for each local id in the memory access index. Local ids not found will not be present in lid_strides.keys(). Uniform access (i.e. work-items within a sub-group access the same item) is indicated by setting lid_strides[0]=0, but may also occur when no local id 0 is found, in which case the 0 key will not be present in lid_strides.

gid_strides

A dict of { int : pymbolic.primitives.Expression or int } that specifies global strides for each global id in the memory access index. global ids not found will not be present in gid_strides.keys().

direction

A str that specifies the direction of memory access as load or store.

variable

A str that specifies the variable name of the data accessed.

variable_tag

A str that specifies the variable tag of a pymbolic.primitives.TaggedVariable.

count_granularity

A str that specifies whether this operation should be counted once per work-item, sub-group, or work-group. The granularities allowed can be found in CountGranularity, and may be accessed, e.g., as CountGranularity.WORKITEM. A work-item is a single instance of computation executing on a single processor (think ‘thread’), a collection of which may be grouped together into a work-group. Each work-group executes on a single compute unit with all work-items within the work-group sharing local memory. A sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp.

loopy.get_op_map(knl, numpy_types=True, count_redundant_work=False, count_within_subscripts=True, subgroup_size=None)

Count the number of operations in a loopy kernel.

Parameters:
  • knl – A loopy.LoopKernel whose operations are to be counted.
  • numpy_types – A bool specifying whether the types in the returned mapping should be numpy types instead of loopy.LoopyType.
  • count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This bool flag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)
  • count_within_subscripts – A bool specifying whether to count operations inside array indices.
  • subgroup_size – (currently unused) An int, str 'guess', or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting a MemAccess whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If a str 'guess' is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.
Returns:

A ToCountMap of { Op : islpy.PwQPolynomial }.

  • The Op specifies the characteristics of the arithmetic operation.
  • The islpy.PwQPolynomial holds the number of operations of the kind specified in the key (in terms of the loopy.LoopKernel parameter inames).

Example usage:

# (first create loopy kernel and specify array data types)

op_map = get_op_map(knl)
params = {'n': 512, 'm': 256, 'l': 128}
f32add = op_map[Op(np.float32,
                   'add',
                   count_granularity=CountGranularity.WORKITEM)
               ].eval_with_dict(params)
f32mul = op_map[Op(np.float32,
                   'mul',
                   count_granularity=CountGranularity.WORKITEM)
               ].eval_with_dict(params)

# (now use these counts to, e.g., predict performance)
loopy.get_mem_access_map(knl, numpy_types=True, count_redundant_work=False, subgroup_size=None)

Count the number of memory accesses in a loopy kernel.

Parameters:
  • knl – A loopy.LoopKernel whose memory accesses are to be counted.
  • numpy_types – A bool specifying whether the types in the returned mapping should be numpy types instead of loopy.LoopyType.
  • count_redundant_work – Based on usage of hardware axes or other specifics, a kernel may perform work redundantly. This bool flag indicates whether this work should be included in the count. (Likely desirable for performance modeling, but undesirable for code optimization.)
  • subgroup_size – An int, str 'guess', or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting a MemAccess whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If a str 'guess' is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.
Returns:

A ToCountMap of { MemAccess : islpy.PwQPolynomial }.

Example usage:

# (first create loopy kernel and specify array data types)

params = {'n': 512, 'm': 256, 'l': 128}
mem_map = get_mem_access_map(knl)

f32_s1_g_ld_a = mem_map[MemAccess(
                            mtype='global',
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction='load',
                            variable='a',
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)
f32_s1_g_st_a = mem_map[MemAccess(
                            mtype='global',
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction='store',
                            variable='a',
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)
f32_s1_l_ld_x = mem_map[MemAccess(
                            mtype='local',
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction='load',
                            variable='x',
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)
f32_s1_l_st_x = mem_map[MemAccess(
                            mtype='local',
                            dtype=np.float32,
                            lid_strides={0: 1},
                            gid_strides={0: 256},
                            direction='store',
                            variable='x',
                            count_granularity=CountGranularity.WORKITEM)
                       ].eval_with_dict(params)

# (now use these counts to, e.g., predict performance)
loopy.get_synchronization_map(knl, subgroup_size=None)

Count the number of synchronization events each work-item encounters in a loopy kernel.

Parameters:
  • knl – A loopy.LoopKernel whose barriers are to be counted.
  • subgroup_size – (currently unused) An int, str 'guess', or None that specifies the sub-group size. An OpenCL sub-group is an implementation-dependent grouping of work-items within a work-group, analagous to an NVIDIA CUDA warp. subgroup_size is used, e.g., when counting a MemAccess whose count_granularity specifies that it should only be counted once per sub-group. If set to None an attempt to find the sub-group size using the device will be made, if this fails an error will be raised. If a str 'guess' is passed as the subgroup_size, get_mem_access_map will attempt to find the sub-group size using the device and, if unsuccessful, will make a wild guess.
Returns:

A dictionary mapping each type of synchronization event to an islpy.PwQPolynomial holding the number of events per work-item.

Possible keys include barrier_local, barrier_global (if supported by the target) and kernel_launch.

Example usage:

# (first create loopy kernel and specify array data types)

sync_map = get_synchronization_map(knl)
params = {'n': 512, 'm': 256, 'l': 128}
barrier_ct = sync_map['barrier_local'].eval_with_dict(params)

# (now use this count to, e.g., predict performance)
loopy.gather_access_footprints(kernel, ignore_uncountable=False)

Return a dictionary mapping (var_name, direction) to islpy.Set instances capturing which indices of each the array var_name are read/written (where direction is either read or write.

Parameters:ignore_uncountable – If False, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)
loopy.gather_access_footprint_bytes(kernel, ignore_uncountable=False)

Return a dictionary mapping (var_name, direction) to islpy.PwQPolynomial instances capturing the number of bytes are read/written (where direction is either read or write on array var_name

Parameters:ignore_uncountable – If True, an error will be raised for accesses on which the footprint cannot be determined (e.g. data-dependent or nonlinear indices)
class loopy.statistics.GuardedPwQPolynomial(pwqpolynomial, valid_domain)

Controlling caching

loopy.set_caching_enabled(flag)

Set whether loopy is allowed to use disk caching for its various code generation stages.

class loopy.CacheMode(new_flag)

A context manager for setting whether loopy is allowed to use disk caches.

Running Kernels

In addition to simply calling kernels using LoopKernel.__call__, the following underlying functionality may be used:

class loopy.CompiledKernel(context, kernel)

Automatic Testing

loopy.auto_test_vs_ref(ref_knl, ctx, test_knl=None, op_count=[], op_label=[], parameters={}, print_ref_code=False, print_code=True, warmup_rounds=2, dump_binary=False, fills_entire_output=None, do_check=True, check_result=None, max_test_kernel_count=1, quiet=False, blacklist_ref_vendors=[])

Compare results of ref_knl to the kernels generated by scheduling test_knl.

Parameters:
  • check_result – a callable with numpy.ndarray arguments (result, reference_result) returning a a tuple (class:bool, message) indicating correctness/acceptability of the result
  • max_test_kernel_count – Stop testing after this many test_knl

Troubleshooting

Printing LoopKernel objects

If you’re confused about things loopy is referring to in an error message or about the current state of the LoopKernel you are transforming, the following always works:

print(kernel)

(And it yields a human-readable–albeit terse–representation of kernel.)

loopy.get_dot_dependency_graph(kernel, iname_cluster=True, use_insn_id=False)

Return a string in the dot language depicting dependencies among kernel instructions.

loopy.show_dependency_graph(*args, **kwargs)

Show the dependency graph generated by get_dot_dependency_graph() in a browser. Accepts the same arguments as that function.