# Tutorial¶

This guide provides a gentle introduction into what loopy is, how it works, and what it can do. There’s also the NumPy Reference that aims to unambiguously define all aspects of loopy.

## Preparation¶

`loopy`

currently requires on `pyopencl`

to be installed. We
import a few modules and set up a `pyopencl.Context`

and a
`pyopencl.CommandQueue`

:

```
>>> import numpy as np
>>> import pyopencl as cl
>>> import pyopencl.array
>>> import pyopencl.clrandom
>>> import loopy as lp
>>> lp.set_caching_enabled(False)
>>> from loopy.version import LOOPY_USE_LANGUAGE_VERSION_2018_2
>>> from warnings import filterwarnings, catch_warnings
>>> filterwarnings('error', category=lp.LoopyWarning)
>>> ctx = cl.create_some_context(interactive=False)
>>> queue = cl.CommandQueue(ctx)
```

We also create some data on the device that we’ll use throughout our examples:

```
>>> n = 16*16
>>> x_vec_dev = cl.clrandom.rand(queue, n, dtype=np.float32)
>>> y_vec_dev = cl.clrandom.rand(queue, n, dtype=np.float32)
>>> z_vec_dev = cl.clrandom.rand(queue, n, dtype=np.float32)
>>> a_mat_dev = cl.clrandom.rand(queue, (n, n), dtype=np.float32)
>>> b_mat_dev = cl.clrandom.rand(queue, (n, n), dtype=np.float32)
```

And some data on the host:

```
>>> x_vec_host = np.random.randn(n).astype(np.float32)
>>> y_vec_host = np.random.randn(n).astype(np.float32)
```

We’ll also disable console syntax highlighting because it confuses doctest:

```
>>> # not a documented interface
>>> import loopy.options
>>> loopy.options.ALLOW_TERMINAL_COLORS = False
```

## Getting started¶

We’ll start by taking a closer look at a very simple kernel that reads in one vector, doubles it, and writes it to another.

```
>>> knl = lp.make_kernel(
... "{ [i]: 0<=i<n }",
... "out[i] = 2*a[i]")
```

The parts that you see here are the two main components of a loopy kernel:

The

**loop domain**:`{ [i]: 0<=i<n }`

. This tells loopy the values that you would like your loop variables to assume. It is written in ISL syntax. Loopy calls the loop variables**inames**. These are the identifiers that occur in between the brackets at the beginning of the loop domain.Note that

*n*is not an iname in the example. It is a parameter that is passed to the kernel by the user that, in this case, determines the length of the vector being multiplied.The

**instructions**to be executed. These are generally scalar assignments between array elements, consisting of a left hand side and a right hand side. See Assignment objects for the full syntax of an assignment.Reductions are allowed, too, and are given as, for example:

sum(k, a[i,k]*b[k,j])

See Expressions for a full list of allowed constructs in the left- and right-hand side expression of an assignment.

As you create and transform kernels, it’s useful to know that you can always see loopy’s view of a kernel by printing it.

```
>>> knl = lp.set_options(knl, allow_terminal_colors=False)
>>> print(knl)
---------------------------------------------------------------------------
KERNEL: loopy_kernel
---------------------------------------------------------------------------
ARGUMENTS:
a: type: <auto/runtime>, shape: (n), dim_tags: (N0:stride:1) aspace: global
n: ValueArg, type: <auto/runtime>
out: type: <auto/runtime>, shape: (n), dim_tags: (N0:stride:1) aspace: global
---------------------------------------------------------------------------
DOMAINS:
[n] -> { [i] : 0 <= i < n }
---------------------------------------------------------------------------
INAME IMPLEMENTATION TAGS:
i: None
---------------------------------------------------------------------------
INSTRUCTIONS:
for i
out[i] = 2*a[i] {id=insn}
end i
---------------------------------------------------------------------------
```

You’ll likely have noticed that there’s quite a bit more information here than there was in the input. Most of this comes from default values that loopy assumes to cover common use cases. These defaults can all be overridden.

We’ve seen the domain and the instructions above, and we’ll discuss the ‘iname-to-tag-map’ in Implementing Loop Axes (“Inames”). The remaining big chunk of added information is in the ‘arguments’ section, where we observe the following:

`a`

and`out`

have been classified as pass-by-reference (i.e. pointer) arguments in global device memory. Any referenced array variable will default to global unless otherwise specified.Loopy has also examined our access to

`a`

and`out`

and determined the bounds of the array from the values we are accessing. This is shown after**shape:**. Like`numpy`

, loopy works on multi-dimensional arrays. Loopy’s idea of arrays is very similar to that of`numpy`

, including the*shape*attribute.Sometimes, loopy will be unable to make this determination. It will tell you so–for example when the array indices consist of data read from memory. Other times, arrays are larger than the accessed footprint. In either case, you will want to specify the kernel arguments explicitly. See Specifying arguments.

Loopy has not determined the type of

`a`

and`out`

. The data type is given as`<auto/runtime>`

, which means that these types will be determined by the data passed in when the kernel is invoked. Loopy generates (and caches!) a copy of the kernel for each combination of types passed in.In addition, each array axis has a ‘dimension tag’. This is shown above as

`(stride:1)`

. We will see more on this in Implementing Array Axes.

## Running a kernel¶

Running the kernel that we’ve just created is easy. Let’s check the result for good measure.

```
>>> evt, (out,) = knl(queue, a=x_vec_dev)
>>> assert (out.get() == (2*x_vec_dev).get()).all()
```

We can have loopy print the OpenCL kernel it generated
by passing `loopy.Options.write_cl`

.

```
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out,) = knl(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global float const *__restrict__ a, int const n, __global float *__restrict__ out)
{
for (int i = 0; i <= -1 + n; ++i)
out[i] = 2.0f * a[i];
}
```

As promised, loopy has used the type of *x_vec_dev* to specialize the
kernel. If a variable is written as part of the kernel code, loopy will
automatically return it in the second element of the result of a kernel
call (the first being the `pyopencl.Event`

associated with the
execution of the kernel). (If the ordering of the output tuple is not
clear, it can be specified or turned into a `dict`

. See the
*kernel_data* argument of `loopy.make_kernel()`

and
`loopy.Options.return_dict`

.)

For convenience, loopy kernels also directly accept `numpy`

arrays:

```
>>> evt, (out,) = knl(queue, a=x_vec_host)
>>> assert (out == (2*x_vec_host)).all()
```

Notice how both *out* and *a* are `numpy`

arrays, but neither needed
to be transferred to or from the device. Checking for numpy arrays and
transferring them if needed comes at a potential performance cost. If you
would like to make sure that you avoid this cost, pass
`loopy.Options.no_numpy`

.

Further notice how *n*, while technically being an argument, did not need
to be passed, as loopy is able to find *n* from the shape of the input
argument *a*.

For efficiency, loopy generates Python code that handles kernel invocation.
If you are suspecting that this code is causing you an issue, you can
inspect that code, too, using `loopy.Options.write_wrapper`

:

```
>>> knl = lp.set_options(knl, write_wrapper=True, write_cl=False)
>>> evt, (out,) = knl(queue, a=x_vec_host)
from __future__ import division
...
def invoke_loopy_kernel_loopy_kernel(_lpy_cl_kernels, queue, allocator=None, wait_for=None, out_host=None, a=None, n=None, out=None):
if allocator is None:
allocator = _lpy_cl_tools.DeferredAllocator(queue.context)
# {{{ find integer arguments from shapes
if n is None:
if a is not None:
n = a.shape[0]
elif out is not None:
n = out.shape[0]
# }}}
...
```

### Generating code¶

Instead of using loopy to run the code it generates, you can also just use
loopy as a code generator and take care of executing the generated kernels
yourself. In this case, make sure loopy knows about all types, and then
call `loopy.generate_code()`

:

```
>>> typed_knl = lp.add_dtypes(knl, dict(a=np.float32))
>>> code, _ = lp.generate_code(typed_knl)
>>> print(code)
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global float const *__restrict__ a, int const n, __global float *__restrict__ out)
{
for (int i = 0; i <= -1 + n; ++i)
out[i] = 2.0f * a[i];
}
```

Additionally, for C-based languages, header definitions can be obtained via
the `loopy.generate_header()`

:

```
>>> header = str(lp.generate_header(typed_knl)[0])
>>> print(header)
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global float const *__restrict__ a, int const n, __global float *__restrict__ out);
```

## Ordering¶

Next, we’ll change our kernel a bit. Our goal will be to transpose a matrix and double its entries, and we will do this in two steps for the sake of argument:

```
>>> # WARNING: Incorrect.
>>> knl = lp.make_kernel(
... "{ [i,j]: 0<=i,j<n }",
... """
... out[j,i] = a[i,j]
... out[i,j] = 2*out[i,j]
... """)
```

loopy’s programming model is completely *unordered* by default. This means
that:

There is no guarantee about the order in which the loop domain is traversed.

`i==3`

could be reached before`i==0`

but also before`i==17`

. Your program is only correct if it produces a valid result irrespective of this ordering.In addition, there is (by default) no ordering between instructions either. In other words, loopy is free to execute the instructions above in any order whatsoever.

Reading the above two rules, you’ll notice that our transpose-and-multiply kernel is incorrect, because it only computes the desired result if the first instruction completes before the second one. To fix this, we declare an explicit dependency:

```
>>> # WARNING: Incorrect.
>>> knl = lp.make_kernel(
... "{ [i,j]: 0<=i,j<n }",
... """
... out[j,i] = a[i,j] {id=transpose}
... out[i,j] = 2*out[i,j] {dep=transpose}
... """)
```

`{id=transpose}`

assigns the identifier *transpose* to the first
instruction, and `{dep=transpose}`

declares a dependency of the second
instruction on the first. Looking at loopy’s view of this kernel, we see
that these dependencies show up there, too:

```
>>> print(knl.stringify(with_dependencies=True))
---------------------------------------------------------------------------
KERNEL: loopy_kernel
---------------------------------------------------------------------------
...
---------------------------------------------------------------------------
DEPENDENCIES: (use loopy.show_dependency_graph to visualize)
insn : transpose
---------------------------------------------------------------------------
```

These dependencies are in a `dependent : prerequisite`

format that should
be familiar if you have previously dealt with Makefiles. For larger
kernels, these dependency lists can become quite verbose, and there is an
increasing risk that required dependencies are missed. To help catch these,
loopy can also show an instruction dependency graph, using
`loopy.show_dependency_graph()`

:

Dependencies are shown as arrows from prerequisite to dependent in the graph. This functionality requires the open-source graphviz graph drawing tools to be installed. The generated graph will open in a browser window.

Since manually notating lots of dependencies is cumbersome, loopy has a heuristic:

If a variable is written by exactly one instruction, then all instructions reading that variable will automatically depend on the writing instruction.

The intent of this heuristic is to cover the common case of a
precomputed result being stored and used many times. Generally, these
dependencies are *in addition* to any manual dependencies added via
`{dep=...}`

. It is possible (but rare) that the heuristic adds undesired
dependencies. In this case, `{dep=*...}`

(i.e. a leading asterisk) to
prevent the heuristic from adding dependencies for this instruction.

### Loops and dependencies¶

Next, it is important to understand how loops and dependencies interact. Let us take a look at the generated code for the above kernel:

```
>>> knl = lp.set_options(knl, "write_cl")
>>> knl = lp.prioritize_loops(knl, "i,j")
>>> evt, (out,) = knl(queue, a=a_mat_dev)
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global float const *__restrict__ a, int const n, __global float *__restrict__ out)
{
for (int i = 0; i <= -1 + n; ++i)
for (int j = 0; j <= -1 + n; ++j)
{
out[n * j + i] = a[n * i + j];
out[n * i + j] = 2.0f * out[n * i + j];
}
}
```

While our requested instruction ordering has been obeyed, something is still not right:

```
>>> print((out.get() == a_mat_dev.get().T*2).all())
False
```

For the kernel to perform the desired computation, *all
instances* (loop iterations) of the first instruction need to be completed,
not just the one for the current values of *(i, j)*.

Dependencies in loopy act

withinthe largest common set of shared inames.

As a result, our example above realizes the dependency *within* the *i* and *j*
loops. To fix our example, we simply create a new pair of loops *ii* and *jj*
with identical bounds, for the use of the transpose:

```
>>> knl = lp.make_kernel(
... "{ [i,j,ii,jj]: 0<=i,j,ii,jj<n }",
... """
... out[j,i] = a[i,j] {id=transpose}
... out[ii,jj] = 2*out[ii,jj] {dep=transpose}
... """)
>>> knl = lp.prioritize_loops(knl, "i,j,ii,jj")
```

`loopy.duplicate_inames()`

can be used to achieve the same goal.
Now the intended code is generated and our test passes.

```
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out,) = knl(queue, a=a_mat_dev)
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global float const *__restrict__ a, int const n, __global float *__restrict__ out)
{
for (int i = 0; i <= -1 + n; ++i)
for (int j = 0; j <= -1 + n; ++j)
out[n * j + i] = a[n * i + j];
for (int ii = 0; ii <= -1 + n; ++ii)
for (int jj = 0; jj <= -1 + n; ++jj)
out[n * ii + jj] = 2.0f * out[n * ii + jj];
}
>>> assert (out.get() == a_mat_dev.get().T*2).all()
```

Also notice how the changed loop structure is reflected in the dependency graph:

### Loop nesting¶

One last aspect of ordering over which we have thus far not exerted any
control is the nesting of loops. For example, should the *i* loop be nested
around the *j* loop, or the other way around, in the following simple
zero-fill kernel?

It turns out that Loopy will choose a loop nesting for us, but it might be ambiguous. Consider the following code:

```
>>> knl = lp.make_kernel(
... "{ [i,j]: 0<=i,j<n }",
... """
... a[i,j] = 0
... """)
```

Both nestings of the inames i and j result in a correct kernel. This ambiguity can be resolved:

```
>>> knl = lp.prioritize_loops(knl, "j,i")
```

`loopy.prioritize_loops()`

indicates the textual order in which loops
should be entered in the kernel code. Note that this priority has an
advisory role only. If the kernel logically requires a different nesting,
loop priority is ignored. Priority is only considered if loop nesting is
ambiguous.

```
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out,) = knl(queue, a=a_mat_dev)
#define lid(N) ((int) get_local_id(N))
...
for (int j = 0; j <= -1 + n; ++j)
for (int i = 0; i <= -1 + n; ++i)
a[n * i + j] = 0.0f;
...
```

No more warnings! Loop nesting is also reflected in the dependency graph:

## Introduction to Kernel Transformations¶

What we have covered thus far puts you in a position to describe many kinds
of computations to loopy–in the sense that loopy will generate code that
carries out the correct operation. That’s nice, but it’s natural to also
want control over *how* a program is executed. Loopy’s way of capturing
this information is by way of *transformations*. These have the following
general shape:

```
new_kernel = lp.do_something(old_knl, arguments...)
```

These transformations always return a *copy* of the old kernel with the
requested change applied. Typically, the variable holding the old kernel
is overwritten with the new kernel:

```
knl = lp.do_something(knl, arguments...)
```

We’ve already seen an example of a transformation above:
For instance, `prioritize_loops()`

fit the pattern.

`loopy.split_iname()`

is another fundamental (and useful) transformation. It
turns one existing iname (recall that this is loopy’s word for a ‘loop
variable’, roughly) into two new ones, an ‘inner’ and an ‘outer’ one,
where the ‘inner’ loop is of a fixed, specified length, and the ‘outer’
loop runs over these fixed-length ‘chunks’. The three inames have the
following relationship to one another:

```
OLD = INNER + GROUP_SIZE * OUTER
```

Consider this example:

```
>>> knl = lp.make_kernel(
... "{ [i]: 0<=i<n }",
... "a[i] = 0", assumptions="n>=1")
>>> knl = lp.split_iname(knl, "i", 16)
>>> knl = lp.prioritize_loops(knl, "i_outer,i_inner")
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out,) = knl(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
...
for (int i_outer = 0; i_outer <= -1 + ((15 + n) / 16); ++i_outer)
for (int i_inner = 0; i_inner <= (-16 + n + -16 * i_outer >= 0 ? 15 : -1 + n + -16 * i_outer); ++i_inner)
a[16 * i_outer + i_inner] = 0.0f;
...
```

By default, the new, split inames are named *OLD_outer* and *OLD_inner*,
where *OLD* is the name of the previous iname. Upon exit from
`loopy.split_iname()`

, *OLD* is removed from the kernel and replaced by
*OLD_inner* and *OLD_outer*.

Also take note of the *assumptions* argument. This makes it possible to
communicate assumptions about loop domain parameters. (but *not* about
data) In this case, assuming non-negativity helps loopy generate more
efficient code for division in the loop bound for *i_outer*. See below
on how to communicate divisibility assumptions.

Note that the words ‘inner’ and ‘outer’ here have no implied meaning in
relation to loop nesting. For example, it’s perfectly possible to request
*i_inner* to be nested outside *i_outer*:

```
>>> knl = lp.make_kernel(
... "{ [i]: 0<=i<n }",
... "a[i] = 0", assumptions="n>=1")
>>> knl = lp.split_iname(knl, "i", 16)
>>> knl = lp.prioritize_loops(knl, "i_inner,i_outer")
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out,) = knl(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
...
for (int i_inner = 0; i_inner <= (-17 + n >= 0 ? 15 : -1 + n); ++i_inner)
for (int i_outer = 0; i_outer <= -1 + -1 * i_inner + ((15 + n + 15 * i_inner) / 16); ++i_outer)
a[16 * i_outer + i_inner] = 0.0f;
...
```

Notice how loopy has automatically generated guard conditionals to make sure the bounds on the old iname are obeyed.

The combination of `loopy.split_iname()`

and
`loopy.prioritize_loops()`

is already good enough to implement what is
commonly called ‘loop tiling’:

```
>>> knl = lp.make_kernel(
... "{ [i,j]: 0<=i,j<n }",
... "out[i,j] = a[j,i]",
... assumptions="n mod 16 = 0 and n >= 1")
>>> knl = lp.split_iname(knl, "i", 16)
>>> knl = lp.split_iname(knl, "j", 16)
>>> knl = lp.prioritize_loops(knl, "i_outer,j_outer,i_inner")
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out,) = knl(queue, a=a_mat_dev)
#define lid(N) ((int) get_local_id(N))
...
for (int i_outer = 0; i_outer <= ((-16 + n) / 16); ++i_outer)
for (int j_outer = 0; j_outer <= ((-16 + n) / 16); ++j_outer)
for (int i_inner = 0; i_inner <= 15; ++i_inner)
for (int j_inner = 0; j_inner <= 15; ++j_inner)
out[n * (16 * i_outer + i_inner) + 16 * j_outer + j_inner] = a[n * (16 * j_outer + j_inner) + 16 * i_outer + i_inner];
...
```

## Implementing Loop Axes (“Inames”)¶

So far, all the loops we have seen loopy implement were `for`

loops. Each
iname in loopy carries a so-called ‘implementation tag’. Iname Implementation Tags shows
all possible choices for iname implementation tags. The important ones are
explained below.

### Unrolling¶

Our first example of an ‘implementation tag’ is `"unr"`

, which performs
loop unrolling. Let us split the main loop of a vector fill kernel into
chunks of 4 and unroll the fixed-length inner loop by setting the inner
loop’s tag to `"unr"`

:

```
>>> knl = lp.make_kernel(
... "{ [i]: 0<=i<n }",
... "a[i] = 0", assumptions="n>=0 and n mod 4 = 0")
>>> orig_knl = knl
>>> knl = lp.split_iname(knl, "i", 4)
>>> knl = lp.tag_inames(knl, dict(i_inner="unr"))
>>> knl = lp.prioritize_loops(knl, "i_outer,i_inner")
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out,) = knl(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#define int_floor_div_pos_b(a,b) ( ( (a) - ( ((a)<0) ? ((b)-1) : 0 ) ) / (b) )
...
for (int i_outer = 0; i_outer <= int_floor_div_pos_b(-4 + n, 4); ++i_outer)
{
a[4 * i_outer] = 0.0f;
a[1 + 4 * i_outer] = 0.0f;
a[2 + 4 * i_outer] = 0.0f;
a[3 + 4 * i_outer] = 0.0f;
}
...
```

`loopy.tag_inames()`

is a new transformation that assigns
implementation tags to kernels. `"unr"`

is the first tag we’ve
explicitly learned about. Technically, though, it is the second–`"for"`

(or, equivalently, *None*), which is the default, instructs loopy to
implement an iname using a for loop.

Unrolling obviously only works for inames with a fixed maximum number of
values, since only a finite amount of code can be generated. Unrolling the
entire *i* loop in the kernel above would not work.

### Split-and-tag¶

Since split-and-tag is such a common combination, `loopy.split_iname()`

provides a shortcut:

```
>>> knl = orig_knl
>>> knl = lp.split_iname(knl, "i", 4, inner_tag="unr")
```

The *outer_tag* keyword argument exists, too, and works just like you would
expect.

### Printing¶

Iname implementation tags are also printed along with the entire kernel:

```
>>> print(knl)
---------------------------------------------------------------------------
...
INAME IMPLEMENTATION TAGS:
i_inner: unr
i_outer: None
---------------------------------------------------------------------------
...
```

### Parallelization¶

Loops are also parallelized in loopy by assigning them parallelizing
implementation tags. In OpenCL, this means that the loop variable
corresponds to either a local ID or a workgroup ID. The implementation tags
for local IDs are `"l.0"`

, `"l.1"`

, `"l.2"`

, and so on. The
corresponding tags for group IDs are `"g.0"`

, `"g.1"`

, `"g.2"`

, and
so on.

Let’s try this out on our vector fill kernel by creating workgroups of size 128:

```
>>> knl = lp.make_kernel(
... "{ [i]: 0<=i<n }",
... "a[i] = 0", assumptions="n>=0")
>>> knl = lp.split_iname(knl, "i", 128,
... outer_tag="g.0", inner_tag="l.0")
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out,) = knl(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
...
__kernel void __attribute__ ((reqd_work_group_size(128, 1, 1))) loopy_kernel(__global float *__restrict__ a, int const n)
{
if (-1 + -128 * gid(0) + -1 * lid(0) + n >= 0)
a[128 * gid(0) + lid(0)] = 0.0f;
}
```

Loopy requires that workgroup sizes are fixed and constant at compile time. By comparison, the overall execution ND-range size (i.e. the number of workgroups) is allowed to be runtime-variable.

Note how there was no need to specify group or range sizes. Loopy computes those for us:

```
>>> glob, loc = knl.get_grid_size_upper_bounds()
>>> print(glob)
(Aff("[n] -> { [(floor((127 + n)/128))] }"),)
>>> print(loc)
(Aff("[n] -> { [(128)] }"),)
```

Note that this functionality returns internal objects and is not really intended for end users.

### Avoiding Conditionals¶

You may have observed above that we have used a divisibility assumption on
*n* in the kernels above. Without this assumption, loopy would generate
conditional code to make sure no out-of-bounds loop instances are executed.
This here is the original unrolling example without the divisibility
assumption:

```
>>> knl = lp.make_kernel(
... "{ [i]: 0<=i<n }",
... "a[i] = 0", assumptions="n>=0")
>>> orig_knl = knl
>>> knl = lp.split_iname(knl, "i", 4)
>>> knl = lp.tag_inames(knl, dict(i_inner="unr"))
>>> knl = lp.prioritize_loops(knl, "i_outer,i_inner")
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out,) = knl(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
...
for (int i_outer = 0; i_outer <= -1 + ((3 + n) / 4); ++i_outer)
{
a[4 * i_outer] = 0.0f;
if (-2 + -4 * i_outer + n >= 0)
a[1 + 4 * i_outer] = 0.0f;
if (-3 + -4 * i_outer + n >= 0)
a[2 + 4 * i_outer] = 0.0f;
if (-4 + -4 * i_outer + n >= 0)
a[3 + 4 * i_outer] = 0.0f;
}
...
```

While these conditionals enable the generated code to deal with arbitrary
*n*, they come at a performance cost. Loopy allows generating separate code
for the last iteration of the *i_outer* loop, by using the *slabs* keyword
argument to `split_iname()`

. Since this last iteration of *i_outer* is
the only iteration for which `i_inner + 4*i_outer`

can become larger than
*n*, only the (now separate) code for that iteration contains conditionals,
enabling some cost savings:

```
>>> knl = orig_knl
>>> knl = lp.split_iname(knl, "i", 4, slabs=(0, 1), inner_tag="unr")
>>> knl = lp.set_options(knl, "write_cl")
>>> knl = lp.prioritize_loops(knl, "i_outer,i_inner")
>>> evt, (out,) = knl(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
...
/* bulk slab for 'i_outer' */
for (int i_outer = 0; i_outer <= -2 + ((3 + n) / 4); ++i_outer)
{
a[4 * i_outer] = 0.0f;
a[1 + 4 * i_outer] = 0.0f;
a[2 + 4 * i_outer] = 0.0f;
a[3 + 4 * i_outer] = 0.0f;
}
/* final slab for 'i_outer' */
{
int const i_outer = -1 + n + -1 * (3 * n / 4);
if (-1 + n >= 0)
{
a[4 * i_outer] = 0.0f;
if (-2 + -4 * i_outer + n >= 0)
a[1 + 4 * i_outer] = 0.0f;
if (-3 + -4 * i_outer + n >= 0)
a[2 + 4 * i_outer] = 0.0f;
if (4 + 4 * i_outer + -1 * n == 0)
a[3 + 4 * i_outer] = 0.0f;
}
}
...
```

## Specifying arguments¶

Kinds: global, constant, value

Types

### Argument shapes¶

Shapes (and automatic finding thereof)

### Implementing Array Axes¶

## Precomputation, Storage, and Temporary Variables¶

The loopy kernels we have seen thus far have consisted only of assignments from one global-memory storage location to another. Sometimes, computation results obviously get reused, so that recomputing them or even just re-fetching them from global memory becomes uneconomical. Loopy has a number of different ways of addressing this need.

### Explicit private temporaries¶

The simplest of these ways is the creation of an explicit temporary variable, as one might do in C or another programming language:

```
>>> knl = lp.make_kernel(
... "{ [i]: 0<=i<n }",
... """
... <float32> a_temp = sin(a[i])
... out1[i] = a_temp {id=out1}
... out2[i] = sqrt(1-a_temp*a_temp) {dep=out1}
... """)
```

The angle brackets `<>`

denote the creation of a temporary. The name of
the temporary may not override inames, argument names, or other names in
the kernel. The name in between the angle brackets is a typename as
understood by the type registry `pyopencl.array`

. To first order,
the conventional `numpy`

scalar types (`numpy.int16`

,
`numpy.complex128`

) will work. (Yes, `loopy`

supports and
generates correct code for complex arithmetic.)

(If you’re wondering, the dependencies above were added to make the doctest produce predictable output.)

The generated code places this variable into what OpenCL calls ‘private’ memory, local to each work item.

```
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out1, out2) = knl(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
...
{
float a_temp;
for (int i = 0; i <= -1 + n; ++i)
{
a_temp = sin(a[i]);
out1[i] = a_temp;
out2[i] = sqrt(1.0f + -1.0f * a_temp * a_temp);
}
}
```

### Type inference for temporaries¶

Most `loopy`

code can be written so as to be type-generic (with types
determined by parameters passed at run time). The same is true for
temporary variables–specifying a type for the variable is optional. As you
can see in the code below, angle brackets alone denote that a temporary
should be created, and the type of the variable will be deduced from the
expression being assigned.

```
>>> knl = lp.make_kernel(
... "{ [i]: 0<=i<n }",
... """
... <> a_temp = sin(a[i])
... out1[i] = a_temp
... out2[i] = sqrt(1-a_temp*a_temp)
... """)
>>> evt, (out1, out2) = knl(queue, a=x_vec_dev)
```

### Temporaries in local memory¶

In most situations, `loopy`

will automatically deduce whether a given
temporary should be placed into local or private storage. If the variable
is ever written to in parallel and indexed by expressions containing local
IDs, then it is marked as residing in local memory. If this heuristic is
insufficient, `loopy.TemporaryVariable`

instances can be marked
local manually.

Consider the following example:

```
>>> knl = lp.make_kernel(
... "{ [i_outer,i_inner, k]: "
... "0<= 16*i_outer + i_inner <n and 0<= i_inner,k <16}",
... """
... <> a_temp[i_inner] = a[16*i_outer + i_inner] {priority=10}
... out[16*i_outer + i_inner] = sum(k, a_temp[k])
... """)
>>> knl = lp.tag_inames(knl, dict(i_outer="g.0", i_inner="l.0"))
>>> knl = lp.set_temporary_scope(knl, "a_temp", "local")
>>> knl = lp.set_options(knl, "write_cl")
>>> evt, (out,) = knl(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
...
{
__local float a_temp[16];
float acc_k;
if (-1 + -16 * gid(0) + -1 * lid(0) + n >= 0)
{
a_temp[lid(0)] = a[16 * gid(0) + lid(0)];
acc_k = 0.0f;
}
barrier(CLK_LOCAL_MEM_FENCE) /* for a_temp (insn_0_k_update depends on insn) */;
if (-1 + -16 * gid(0) + -1 * lid(0) + n >= 0)
{
for (int k = 0; k <= 15; ++k)
acc_k = acc_k + a_temp[k];
out[16 * gid(0) + lid(0)] = acc_k;
}
}
```

Observe that *a_temp* was automatically placed in local memory, because
it is written in parallel across values of the group-local iname
*i_inner*. In addition, `loopy`

has emitted a barrier instruction to
achieve the Ordering specified by the instruction dependencies.

(The `priority=10`

attribute was added to make the output of the test
deterministic.)

Note

It is worth noting that it was not necessary to provide a size for the
temporary `a_temp`

. `loopy`

deduced the size to be allocated (16
entries in this case) from the indices being accessed. This works just
as well for 2D and 3D temporaries.

The mechanism for finding accessed indices is the same as described in Argument shapes.

If the size-finding heuristic fails or is impractical to use, the of
the temporary can be specified by explicitly creating a
`loopy.TemporaryVariable`

.

Note that the size of local temporaries must, for now, be a constant at compile time.

### Prefetching¶

The above code example may have struck you as ‘un-loopy-ish’ in the sense
that whether the contents of *a* is loaded into an temporary should be
considered an implementation detail that is taken care of by a
transformation rather than being baked into the code. Indeed, such a
transformation exists in `loopy.add_prefetch()`

:

```
>>> knl = lp.make_kernel(
... "{ [i_outer,i_inner, k]: "
... "0<= 16*i_outer + i_inner <n and 0<= i_inner,k <16}",
... """
... out[16*i_outer + i_inner] = sum(k, a[16*i_outer + i_inner])
... """)
>>> knl = lp.tag_inames(knl, dict(i_outer="g.0", i_inner="l.0"))
>>> knl = lp.set_options(knl, "write_cl")
>>> knl_pf = lp.add_prefetch(knl, "a")
>>> evt, (out,) = knl_pf(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
...
acc_k = 0.0f;
a_fetch = a[16 * gid(0) + lid(0)];
for (int k = 0; k <= 15; ++k)
acc_k = acc_k + a_fetch;
out[16 * gid(0) + lid(0)] = acc_k;
...
```

This is not the same as our previous code and, in this scenario, a little
bit useless, because each entry of *a* is ‘pre-fetched’, used, and then
thrown away. (But realize that this could perhaps be useful in other
situations when the same entry of *a* is accessed multiple times.)

What’s missing is that we need to tell `loopy`

that we would like to
fetch the *access footprint* of an entire loop–in this case, of *i_inner*,
as the second argument of `loopy.add_prefetch()`

. We thus arrive back
at the same code with a temporary in local memory that we had generated
earlier:

```
>>> knl_pf = lp.add_prefetch(knl, "a", ["i_inner"], default_tag="l.0")
>>> evt, (out,) = knl_pf(queue, a=x_vec_dev)
#define lid(N) ((int) get_local_id(N))
...
if (-1 + -16 * gid(0) + -1 * lid(0) + n >= 0)
acc_k = 0.0f;
if (-1 + -16 * gid(0) + -1 * lid(0) + n >= 0)
a_fetch[lid(0)] = a[16 * gid(0) + lid(0)];
barrier(CLK_LOCAL_MEM_FENCE) /* for a_fetch (insn_k_update depends on a_fetch_rule) */;
if (-1 + -16 * gid(0) + -1 * lid(0) + n >= 0)
{
for (int k = 0; k <= 15; ++k)
acc_k = acc_k + a_fetch[lid(0)];
out[16 * gid(0) + lid(0)] = acc_k;
}
...
```

### Tagged prefetching¶

### Temporaries in global memory¶

`loopy`

supports using temporaries with global storage duration. As with
local and private temporaries, the runtime allocates storage for global
temporaries when the kernel gets executed. The user must explicitly specify that
a temporary is global. To specify that a temporary is global, use
`loopy.set_temporary_scope()`

.

### Substitution rules¶

### Generic Precomputation¶

## Synchronization¶

When one work item executing with others writes to a memory location, OpenCL does not guarantee that other work items will immediately be able to read the memory location and get back the same thing that was written. In order to ensure that memory is consistent across work items, some sort of synchronization operation is used.

`loopy`

supports synchronization in the form of *barriers* or *atomic
operations*.

### Barriers¶

Prior to code generation, `loopy`

performs a check to see that every memory
access is free of dependencies requiring a barrier. The following kinds of
memory access dependencies require a barrier when they involve more than one
work item:

read-after-write

write-after-read

write-after-write.

`loopy`

supports two kinds of barriers:

*Local barriers*ensure consistency of memory accesses to items within*the same*work group. This synchronizes with all instructions in the work group. The type of memory (local or global) may be specified by the`loopy.instruction.BarrierInstruction.mem_kind`

*Global barriers*ensure consistency of memory accesses across*all*work groups, i.e. it synchronizes with every work item executing the kernel. Note that there is no exact equivalent for this kind of barrier in OpenCL. 1

Once a work item has reached a barrier, it waits for everyone that it synchronizes with to reach the barrier before continuing. This means that unless all work items reach the same barrier, the kernel will hang during execution.

### Barrier insertion¶

By default, `loopy`

inserts local barriers between two instructions when it
detects that a dependency involving local memory may occur across work items. To
see this in action, take a look at the section on Temporaries in local memory.

In contrast, `loopy`

will *not* insert global barriers automatically and
instead will report an error if it detects the need for a global barrier. As an
example, consider the following kernel, which attempts to rotate its input to
the right by 1 in parallel:

```
>>> knl = lp.make_kernel(
... "[n] -> {[i] : 0<=i<n}",
... """
... for i
... <>tmp = arr[i] {id=maketmp,dep=*}
... arr[(i + 1) % n] = tmp {id=rotate,dep=*maketmp}
... end
... """,
... [
... lp.GlobalArg("arr", shape=("n",), dtype=np.int32),
... "...",
... ],
... name="rotate_v1",
... assumptions="n mod 16 = 0")
>>> knl = lp.split_iname(knl, "i", 16, inner_tag="l.0", outer_tag="g.0")
```

Note the presence of the write-after-read dependency in global memory. Due to
this, `loopy`

will complain that global barrier needs to be inserted:

```
>>> cgr = lp.generate_code_v2(knl)
Traceback (most recent call last):
...
loopy.diagnostic.MissingBarrierError: Dependency 'rotate depends on maketmp' (for variable 'arr') requires synchronization by a global barrier (add a 'no_sync_with' instruction option to state that no synchronization is needed)
```

The syntax for a inserting a global barrier instruction is
`... gbarrier`

. `loopy`

also supports manually inserting local
barriers. The syntax for a local barrier instruction is `... lbarrier`

.

### Saving temporaries across global barriers¶

For some platforms (currently only PyOpenCL), `loopy`

implements global
barriers by splitting the kernel into a host side kernel and multiple
device-side kernels. On such platforms, it will be necessary to save non-global
temporaries that are live across kernel calls. This section presents an example
of how to use `loopy.save_and_reload_temporaries()`

which is helpful for
that purpose.

Let us start with an example. Consider the kernel from above with a
`... gbarrier`

instruction that has already been inserted.

```
>>> knl = lp.make_kernel(
... "[n] -> {[i] : 0<=i<n}",
... """
... for i
... <>tmp = arr[i] {id=maketmp,dep=*}
... ... gbarrier {id=bar,dep=*maketmp}
... arr[(i + 1) % n] = tmp {id=rotate,dep=*bar}
... end
... """,
... [
... lp.GlobalArg("arr", shape=("n",), dtype=np.int32),
... "...",
... ],
... name="rotate_v2",
... assumptions="n mod 16 = 0")
>>> knl = lp.split_iname(knl, "i", 16, inner_tag="l.0", outer_tag="g.0")
```

Here is what happens when we try to generate code for the kernel:

```
>>> cgr = lp.generate_code_v2(knl)
Traceback (most recent call last):
...
loopy.diagnostic.MissingDefinitionError: temporary variable 'tmp' gets used in subkernel 'rotate_v2_0' without a definition (maybe you forgot to call loopy.save_and_reload_temporaries?)
```

This happens due to the kernel splitting done by `loopy`

. The splitting
happens when the instruction schedule is generated. To see the schedule, we
should call `loopy.get_one_scheduled_kernel()`

:

```
>>> knl = lp.get_one_scheduled_kernel(lp.preprocess_kernel(knl))
>>> print(knl)
---------------------------------------------------------------------------
KERNEL: rotate_v2
---------------------------------------------------------------------------
...
---------------------------------------------------------------------------
SCHEDULE:
0: CALL KERNEL rotate_v2(extra_args=[], extra_inames=[])
1: tmp = arr[i_inner + i_outer*16] {id=maketmp}
2: RETURN FROM KERNEL rotate_v2
3: ... gbarrier
4: CALL KERNEL rotate_v2_0(extra_args=[], extra_inames=[])
5: arr[(1 + i_inner + i_outer*16) % n] = tmp {id=rotate}
6: RETURN FROM KERNEL rotate_v2_0
---------------------------------------------------------------------------
```

As the error message suggests, taking a look at the generated instruction
schedule will show that while `tmp`

is assigned in the first kernel, the
assignment to `tmp`

is not seen by the second kernel. Because the temporary is
in private memory, it does not persist across calls to device kernels (the same
goes for local temporaries).

`loopy`

provides a function called
`loopy.save_and_reload_temporaries()`

for the purpose of handling the
task of saving and restoring temporary values across global barriers. This
function adds instructions to the kernel without scheduling them. That means
that `loopy.get_one_scheduled_kernel()`

needs to be called one more time to
put those instructions into the schedule.

```
>>> knl = lp.get_one_scheduled_kernel(lp.preprocess_kernel(knl))
>>> knl = lp.save_and_reload_temporaries(knl)
>>> knl = lp.get_one_scheduled_kernel(knl) # Schedule added instructions
>>> print(knl)
---------------------------------------------------------------------------
KERNEL: rotate_v2
---------------------------------------------------------------------------
...
---------------------------------------------------------------------------
TEMPORARIES:
tmp: type: np:dtype('int32'), shape: () scope:private
tmp_save_slot: type: np:dtype('int32'), shape: (n // 16, 16), dim_tags: (N1:stride:16, N0:stride:1) scope:global
---------------------------------------------------------------------------
...
---------------------------------------------------------------------------
SCHEDULE:
0: CALL KERNEL rotate_v2(extra_args=['tmp_save_slot'], extra_inames=[])
1: tmp = arr[i_inner + i_outer*16] {id=maketmp}
2: tmp_save_slot[tmp_save_hw_dim_0_rotate_v2, tmp_save_hw_dim_1_rotate_v2] = tmp {id=tmp.save}
3: RETURN FROM KERNEL rotate_v2
4: ... gbarrier
5: CALL KERNEL rotate_v2_0(extra_args=['tmp_save_slot'], extra_inames=[])
6: tmp = tmp_save_slot[tmp_reload_hw_dim_0_rotate_v2_0, tmp_reload_hw_dim_1_rotate_v2_0] {id=tmp.reload}
7: arr[(1 + i_inner + i_outer*16) % n] = tmp {id=rotate}
8: RETURN FROM KERNEL rotate_v2_0
---------------------------------------------------------------------------
```

Here’s an overview of what `loopy.save_and_reload_temporaries()`

actually
does in more detail:

`loopy`

first uses liveness analysis to determine which temporary variables’ live ranges cross a global barrier.For each temporary,

`loopy`

creates a storage slot for the temporary in global memory (see Temporaries in global memory).`loopy`

saves the temporary into its global storage slot whenever it detects the temporary is live-out from a kernel, and reloads the temporary from its global storage slot when it detects that it needs to do so.

The kernel translates into two OpenCL kernels.

```
>>> cgr = lp.generate_code_v2(knl)
>>> print(cgr.device_code())
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
__kernel void __attribute__ ((reqd_work_group_size(16, 1, 1))) rotate_v2(__global int *__restrict__ arr, int const n, __global int *__restrict__ tmp_save_slot)
{
int tmp;
tmp = arr[16 * gid(0) + lid(0)];
tmp_save_slot[16 * gid(0) + lid(0)] = tmp;
}
__kernel void __attribute__ ((reqd_work_group_size(16, 1, 1))) rotate_v2_0(__global int *__restrict__ arr, int const n, __global int *__restrict__ tmp_save_slot)
{
int tmp;
tmp = tmp_save_slot[16 * gid(0) + lid(0)];
arr[((1 + lid(0) + gid(0) * 16) % n)] = tmp;
}
```

Now we can execute the kernel.

```
>>> arr = cl.array.arange(queue, 16, dtype=np.int32)
>>> print(arr)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
>>> evt, (out,) = knl(queue, arr=arr)
>>> print(arr)
[15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
```

### Atomic operations¶

`loopy`

supports atomic operations. To use them, both the data on which the
atomic operations work as well as the operations themselves must be suitably
tagged, as in the following example:

```
knl = lp.make_kernel(
"{ [i]: 0<=i<n }",
"out[i%20] = out[i%20] + 2*a[i] {atomic}",
[
lp.GlobalArg("out", dtype, shape=lp.auto, for_atomic=True),
lp.GlobalArg("a", dtype, shape=lp.auto),
"..."
],
assumptions="n>0")
```

- 1
In particular, this is

*not*the same as a call to`barrier(CLK_GLOBAL_MEM_FENCE)`

.

## More complicated programs¶

SCOP

### External Functions¶

Loopy currently supports calls to several commonly used mathematical functions, e.g. exp/log, min/max, sin/cos/tan, sinh/cosh, abs, etc. They may be used in a loopy kernel by simply calling them, e.g.:

```
knl = lp.make_kernel(
"{ [i]: 0<=i<n }",
"""
for i
a[i] = sqrt(i)
end
""")
```

Additionally, all functions of one variable are currently recognized during
code-generation however additional implementation may be required for custom
functions. The full lists of available functions may be found in a the
`TargetBase`

implementation (e.g. `CudaTarget`

)

Custom user functions may be represented using the method described in Functions

### Data-dependent control flow¶

### Conditionals¶

### Snippets of C¶

## Common Problems¶

### A static maximum was not found¶

Attempting to create this kernel results in an error:

```
>>> lp.make_kernel(
... "{ [i]: 0<=i<n }",
... """
... out[i] = 5
... out[0] = 6
... """)
... # Loopy prints the following before this exception:
... # While trying to find shape axis 0 of argument 'out', the following exception occurred:
Traceback (most recent call last):
...
loopy.diagnostic.StaticValueFindingError: a static maximum was not found for PwAff '[n] -> { [(1)] : n <= 1; [(n)] : n >= 2 }'
```

The problem is that loopy cannot find a simple, universally valid expression
for the length of *out* in this case. Notice how the kernel accesses both the
*i*-th and the first element of out. The set notation at the end of the error
message summarizes its best attempt:

If n=1, then out has size 1.

If n>=2, then out has size n.

If n<=0, then out has size 1.

Sure, some of these cases could be coalesced, but that’s beside the point.
Loopy does not know that non-positive values of *n* make no sense. It needs to
be told in order for the error to disappear–note the *assumptions* argument:

```
>>> knl = lp.make_kernel(
... "{ [i]: 0<=i<n }",
... """
... out[i] = 5
... out[0] = 6
... """, assumptions="n>=1")
```

Other situations where this error message can occur include:

Finding size of prefetch/precompute arrays

Finding sizes of argument arrays

Finding workgroup sizes

### Write races¶

This kernel performs a simple transposition of an input matrix:

```
>>> knl = lp.make_kernel(
... "{ [i,j]: 0<=i,j<n }",
... """
... out[j,i] = a[i,j]
... """, assumptions="n>=1", name="transpose")
```

To get it ready for execution on a GPU, we split the *i* and *j* loops into
groups of 16.

```
>>> knl = lp.split_iname(knl, "j", 16, inner_tag="l.1", outer_tag="g.0")
>>> knl = lp.split_iname(knl, "i", 16, inner_tag="l.0", outer_tag="g.1")
```

We’ll also request a prefetch–but suppose we only do so across the
*i_inner* iname:

```
>>> knl = lp.add_prefetch(knl, "a", "i_inner")
```

When we try to run our code, we get the following warning from loopy as a first sign that something is amiss:

```
>>> evt, (out,) = knl(queue, a=a_mat_dev)
Traceback (most recent call last):
...
loopy.diagnostic.WriteRaceConditionWarning: in kernel transpose: instruction 'a_fetch_rule' looks invalid: it assigns to indices based on local IDs, but its temporary 'a_fetch' cannot be made local because a write race across the iname(s) 'j_inner' would emerge. (Do you need to add an extra iname to your prefetch?) (add 'write_race_local(a_fetch_rule)' to silenced_warnings kernel attribute to disable)
```

When we ask to see the code, the issue becomes apparent:

```
>>> knl = lp.set_options(knl, "write_cl")
>>> from warnings import catch_warnings
>>> with catch_warnings():
... filterwarnings("always", category=lp.LoopyWarning)
... evt, (out,) = knl(queue, a=a_mat_dev)
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
__kernel void __attribute__ ((reqd_work_group_size(16, 16, 1))) transpose(__global float const *__restrict__ a, int const n, __global float *__restrict__ out)
{
float a_fetch[16];
...
a_fetch[lid(0)] = a[n * (16 * gid(1) + lid(0)) + 16 * gid(0) + lid(1)];
...
out[n * (16 * gid(0) + lid(1)) + 16 * gid(1) + lid(0)] = a_fetch[lid(0)];
...
}
```

Loopy has a 2D workgroup to use for prefetching of a 1D array. When it
considers making *a_fetch* `local`

(in the OpenCL memory sense of the word)
to make use of parallelism in prefetching, it discovers that a write race
across the remaining axis of the workgroup would emerge.

### Barriers¶

`loopy`

may infer the need for a barrier when it is not necessary. The
`no_sync_with`

instruction attribute can be used to resolve this.

See also `loopy.add_nosync()`

.

TODO

## Obtaining Performance Statistics¶

Arithmetic operations, array accesses, and synchronization operations can all
be counted, which may facilitate performance prediction and optimization of a
`loopy`

kernel.

Note

The functions used in the following examples may produce warnings. If you have already made the filterwarnings and catch_warnings calls used in the examples above, you may want to reset these before continuing. We will temporarily suppress warnings to keep the output clean:

```
>>> from warnings import resetwarnings, filterwarnings
>>> resetwarnings()
>>> filterwarnings('ignore', category=Warning)
```

### Counting operations¶

`loopy.get_op_map()`

provides information on the characteristics and
quantity of arithmetic operations being performed in a kernel. To demonstrate
this, we’ll create an example kernel that performs several operations on arrays
containing different types of data:

```
>>> knl = lp.make_kernel(
... "[n,m,l] -> {[i,k,j]: 0<=i<n and 0<=k<m and 0<=j<l}",
... """
... c[i, j, k] = a[i,j,k]*b[i,j,k]/3.0+a[i,j,k]
... e[i, k] = g[i,k]*(2+h[i,k+1])
... """)
>>> knl = lp.add_and_infer_dtypes(knl,
... dict(a=np.float32, b=np.float32, g=np.float64, h=np.float64))
```

Note that loopy will infer the data types for arrays `c`

and `e`

from the
information provided. Now we will count the operations:

```
>>> op_map = lp.get_op_map(knl, subgroup_size=32)
>>> print(lp.stringify_stats_mapping(op_map))
Op(np:dtype('float32'), add, subgroup) : ...
```

Each line of output will look roughly like:

```
Op(np:dtype('float32'), add, subgroup) : [l, m, n] -> { l * m * n : l > 0 and m > 0 and n > 0 }
```

`loopy.get_op_map()`

returns a `loopy.ToCountMap`

of **{**
`loopy.Op`

**:** `islpy.PwQPolynomial`

**}**. A
`loopy.ToCountMap`

holds a dictionary mapping any type of key to an
arithmetic type. In this case, the `islpy.PwQPolynomial`

holds the
number of operations matching the characteristics of the `loopy.Op`

specified in the key (in terms of the `loopy.LoopKernel`

*inames*). `loopy.Op`

attributes include:

dtype: A

`loopy.LoopyType`

or`numpy.dtype`

that specifies the data type operated on.name: A

`str`

that specifies the kind of arithmetic operation as*add*,*sub*,*mul*,*div*,*pow*,*shift*,*bw*(bitwise), etc.

One way to evaluate these polynomials is with `islpy.eval_with_dict()`

:

```
>>> param_dict = {'n': 256, 'm': 256, 'l': 8}
>>> from loopy.statistics import CountGranularity as CG
>>> f32add = op_map[lp.Op(np.float32, 'add', CG.SUBGROUP)].eval_with_dict(param_dict)
>>> f32div = op_map[lp.Op(np.float32, 'div', CG.SUBGROUP)].eval_with_dict(param_dict)
>>> f32mul = op_map[lp.Op(np.float32, 'mul', CG.SUBGROUP)].eval_with_dict(param_dict)
>>> f64add = op_map[lp.Op(np.float64, 'add', CG.SUBGROUP)].eval_with_dict(param_dict)
>>> f64mul = op_map[lp.Op(np.float64, 'mul', CG.SUBGROUP)].eval_with_dict(param_dict)
>>> i32add = op_map[lp.Op(np.int32, 'add', CG.SUBGROUP)].eval_with_dict(param_dict)
>>> print("%i\n%i\n%i\n%i\n%i\n%i" %
... (f32add, f32div, f32mul, f64add, f64mul, i32add))
524288
524288
524288
65536
65536
65536
```

`loopy.ToCountMap`

provides member functions that facilitate filtering,
grouping, and evaluating subsets of the counts. Suppose we want to know the
total number of 32-bit operations of any kind. We can easily count these
using functions `loopy.ToCountMap.filter_by()`

and
`loopy.ToCountMap.eval_and_sum()`

:

```
>>> filtered_op_map = op_map.filter_by(dtype=[np.float32])
>>> f32op_count = filtered_op_map.eval_and_sum(param_dict)
>>> print(f32op_count)
1572864
```

We could accomplish the same goal using `loopy.ToCountMap.group_by()`

,
which produces a `loopy.ToCountMap`

that contains the same counts grouped
together into keys containing only the specified fields:

```
>>> op_map_dtype = op_map.group_by('dtype')
>>> print(lp.stringify_stats_mapping(op_map_dtype))
Op(np:dtype('float32'), None, None) : ...
>>> f32op_count = op_map_dtype[lp.Op(dtype=np.float32)
... ].eval_with_dict(param_dict)
>>> print(f32op_count)
1572864
```

The lines of output above might look like:

```
Op(np:dtype('float32'), None, None) : [m, l, n] -> { 3 * m * l * n : m > 0 and l > 0 and n > 0 }
Op(np:dtype('float64'), None, None) : [m, l, n] -> { 2 * m * n : m > 0 and l > 0 and n > 0 }
```

See the reference page for `loopy.ToCountMap`

and `loopy.Op`

for
more information on these functions.

### Counting memory accesses¶

`loopy.get_mem_access_map()`

provides information on the number and
characteristics of memory accesses performed in a kernel. To demonstrate this,
we’ll continue using the kernel from the previous example:

```
>>> mem_map = lp.get_mem_access_map(knl, subgroup_size=32)
>>> print(lp.stringify_stats_mapping(mem_map))
MemAccess(global, np:dtype('float32'), {}, {}, load, a, None, subgroup) : ...
```

Each line of output will look roughly like:

```
MemAccess(global, np:dtype('float32'), {}, {}, load, a, None, subgroup) : [m, l, n] -> { 2 * m * l * n : m > 0 and l > 0 and n > 0 }
MemAccess(global, np:dtype('float32'), {}, {}, load, b, None, subgroup) : [m, l, n] -> { m * l * n : m > 0 and l > 0 and n > 0 }
MemAccess(global, np:dtype('float32'), {}, {}, store, c, None, subgroup) : [m, l, n] -> { m * l * n : m > 0 and l > 0 and n > 0 }
```

`loopy.get_mem_access_map()`

returns a `loopy.ToCountMap`

of **{**
`loopy.MemAccess`

**:** `islpy.PwQPolynomial`

**}**.
`loopy.MemAccess`

attributes include:

mtype: A

`str`

that specifies the memory type accessed as**global**or**local**dtype: A

`loopy.LoopyType`

or`numpy.dtype`

that specifies the data type accessed.lid_strides: A

`dict`

of**{**`int`

**:**`pymbolic.primitives.Expression`

or`int`

**}**that specifies local strides for each local id in the memory access index. Local ids not found will not be present in`lid_strides.keys()`

. Uniform access (i.e. work-items within a sub-group access the same item) is indicated by setting`lid_strides[0]=0`

, but may also occur when no local id 0 is found, in which case the 0 key will not be present in lid_strides.gid_strides: A

`dict`

of**{**`int`

**:**`pymbolic.primitives.Expression`

or`int`

**}**that specifies global strides for each global id in the memory access index. Global ids not found will not be present in`gid_strides.keys()`

.direction: A

`str`

that specifies the direction of memory access as**load**or**store**.variable: A

`str`

that specifies the variable name of the data accessed.

We can evaluate these polynomials using `islpy.eval_with_dict()`

:

```
>>> f64ld_g = mem_map[lp.MemAccess('global', np.float64, {}, {}, 'load', 'g', None, CG.SUBGROUP)
... ].eval_with_dict(param_dict)
>>> f64st_e = mem_map[lp.MemAccess('global', np.float64, {}, {}, 'store', 'e', None, CG.SUBGROUP)
... ].eval_with_dict(param_dict)
>>> f32ld_a = mem_map[lp.MemAccess('global', np.float32, {}, {}, 'load', 'a', None, CG.SUBGROUP)
... ].eval_with_dict(param_dict)
>>> f32st_c = mem_map[lp.MemAccess('global', np.float32, {}, {}, 'store', 'c', None, CG.SUBGROUP)
... ].eval_with_dict(param_dict)
>>> print("f32 ld a: %i\nf32 st c: %i\nf64 ld g: %i\nf64 st e: %i" %
... (f32ld_a, f32st_c, f64ld_g, f64st_e))
f32 ld a: 1048576
f32 st c: 524288
f64 ld g: 65536
f64 st e: 65536
```

`loopy.ToCountMap`

also makes it easy to determine the total amount
of data moved in bytes. Suppose we want to know the total amount of global
memory data loaded and stored. We can produce a map with just this information
using `loopy.ToCountMap.to_bytes()`

and `loopy.ToCountMap.group_by()`

:

```
>>> bytes_map = mem_map.to_bytes()
>>> print(lp.stringify_stats_mapping(bytes_map))
MemAccess(global, np:dtype('float32'), {}, {}, load, a, None, subgroup) : ...
>>> global_ld_st_bytes = bytes_map.filter_by(mtype=['global']
... ).group_by('direction')
>>> print(lp.stringify_stats_mapping(global_ld_st_bytes))
MemAccess(None, None, None, None, load, None, None, None) : ...
MemAccess(None, None, None, None, store, None, None, None) : ...
>>> loaded = global_ld_st_bytes[lp.MemAccess(direction='load')
... ].eval_with_dict(param_dict)
>>> stored = global_ld_st_bytes[lp.MemAccess(direction='store')
... ].eval_with_dict(param_dict)
>>> print("bytes loaded: %s\nbytes stored: %s" % (loaded, stored))
bytes loaded: 7340032
bytes stored: 2621440
```

The lines of output above might look like:

```
MemAccess(global, np:dtype('float32'), {}, {}, load, a, None, subgroup) : [m, l, n] -> { 8 * m * l * n : m > 0 and l > 0 and n > 0 }
MemAccess(global, np:dtype('float32'), {}, {}, load, b, None, subgroup) : [m, l, n] -> { 4 * m * l * n : m > 0 and l > 0 and n > 0 }
MemAccess(global, np:dtype('float32'), {}, {}, store, c, None, subgroup) : [m, l, n] -> { 4 * m * l * n : m > 0 and l > 0 and n > 0 }
MemAccess(global, np:dtype('float64'), {}, {}, load, g, None, subgroup) : [m, l, n] -> { 8 * m * n : m > 0 and l > 0 and n > 0 }
MemAccess(global, np:dtype('float64'), {}, {}, load, h, None, subgroup) : [m, l, n] -> { 8 * m * n : m > 0 and l > 0 and n > 0 }
MemAccess(global, np:dtype('float64'), {}, {}, store, e, None, subgroup) : [m, l, n] -> { 8 * m * n : m > 0 and l > 0 and n > 0 }
```

One can see how these functions might be useful in computing, for example, achieved memory bandwidth in byte/sec or performance in FLOP/sec.

Since we have not tagged any of the inames or parallelized the kernel across
work-items (which would have produced iname tags), `loopy.get_mem_access_map()`

finds no local or global id strides, leaving `lid_strides`

and `gid_strides`

empty for each memory access. Now we’ll parallelize the kernel and count the array
accesses again. The resulting `islpy.PwQPolynomial`

will be more complicated
this time.

```
>>> knl_consec = lp.split_iname(knl, "k", 128,
... outer_tag="l.1", inner_tag="l.0")
>>> mem_map = lp.get_mem_access_map(knl_consec, subgroup_size=32)
>>> print(lp.stringify_stats_mapping(mem_map))
MemAccess(global, np:dtype('float32'), {0: 1, 1: 128}, {}, load, a, None, workitem) : ...
MemAccess(global, np:dtype('float32'), {0: 1, 1: 128}, {}, load, b, None, workitem) : ...
MemAccess(global, np:dtype('float32'), {0: 1, 1: 128}, {}, store, c, None, workitem) : ...
MemAccess(global, np:dtype('float64'), {0: 1, 1: 128}, {}, load, g, None, workitem) : ...
MemAccess(global, np:dtype('float64'), {0: 1, 1: 128}, {}, load, h, None, workitem) : ...
MemAccess(global, np:dtype('float64'), {0: 1, 1: 128}, {}, store, e, None, workitem) : ...
```

With this parallelization, consecutive work-items will access consecutive array elements in memory. The polynomials are a bit more complicated now due to the parallelization, but when we evaluate them, we see that the total number of array accesses has not changed:

```
>>> f64ld_g = mem_map[lp.MemAccess('global', np.float64, {0: 1, 1: 128}, {}, 'load', 'g', None, CG.WORKITEM)
... ].eval_with_dict(param_dict)
>>> f64st_e = mem_map[lp.MemAccess('global', np.float64, {0: 1, 1: 128}, {}, 'store', 'e', None, CG.WORKITEM)
... ].eval_with_dict(param_dict)
>>> f32ld_a = mem_map[lp.MemAccess('global', np.float32, {0: 1, 1: 128}, {}, 'load', 'a', None, CG.WORKITEM)
... ].eval_with_dict(param_dict)
>>> f32st_c = mem_map[lp.MemAccess('global', np.float32, {0: 1, 1: 128}, {}, 'store', 'c', None, CG.WORKITEM)
... ].eval_with_dict(param_dict)
>>> print("f32 ld a: %i\nf32 st c: %i\nf64 ld g: %i\nf64 st e: %i" %
... (f32ld_a, f32st_c, f64ld_g, f64st_e))
f32 ld a: 1048576
f32 st c: 524288
f64 ld g: 65536
f64 st e: 65536
```

To produce *nonconsecutive* array accesses with local id 0 stride greater than 1,
we’ll switch the inner and outer tags in our parallelization of the kernel:

```
>>> knl_nonconsec = lp.split_iname(knl, "k", 128,
... outer_tag="l.0", inner_tag="l.1")
>>> mem_map = lp.get_mem_access_map(knl_nonconsec, subgroup_size=32)
>>> print(lp.stringify_stats_mapping(mem_map))
MemAccess(global, np:dtype('float32'), {0: 128, 1: 1}, {}, load, a, None, workitem) : ...
MemAccess(global, np:dtype('float32'), {0: 128, 1: 1}, {}, load, b, None, workitem) : ...
MemAccess(global, np:dtype('float32'), {0: 128, 1: 1}, {}, store, c, None, workitem) : ...
MemAccess(global, np:dtype('float64'), {0: 128, 1: 1}, {}, load, g, None, workitem) : ...
MemAccess(global, np:dtype('float64'), {0: 128, 1: 1}, {}, load, h, None, workitem) : ...
MemAccess(global, np:dtype('float64'), {0: 128, 1: 1}, {}, store, e, None, workitem) : ...
```

With this parallelization, consecutive work-items will access *nonconsecutive*
array elements in memory. The total number of array accesses still has not
changed:

```
>>> f64ld_g = mem_map[lp.MemAccess('global', np.float64, {0: 128, 1: 1}, {}, 'load', 'g', None, CG.WORKITEM)
... ].eval_with_dict(param_dict)
>>> f64st_e = mem_map[lp.MemAccess('global', np.float64, {0: 128, 1: 1}, {}, 'store', 'e', None, CG.WORKITEM)
... ].eval_with_dict(param_dict)
>>> f32ld_a = mem_map[lp.MemAccess('global', np.float32, {0: 128, 1: 1}, {}, 'load', 'a', None, CG.WORKITEM)
... ].eval_with_dict(param_dict)
>>> f32st_c = mem_map[lp.MemAccess('global', np.float32, {0: 128, 1: 1}, {}, 'store', 'c', None, CG.WORKITEM)
... ].eval_with_dict(param_dict)
>>> print("f32 ld a: %i\nf32 st c: %i\nf64 ld g: %i\nf64 st e: %i" %
... (f32ld_a, f32st_c, f64ld_g, f64st_e))
f32 ld a: 1048576
f32 st c: 524288
f64 ld g: 65536
f64 st e: 65536
```

We can also filter using an arbitrary test function using
`loopy.ToCountMap.filter_by_func()`

. This is useful when the filter
criteria are more complicated than a simple list of allowable values:

```
>>> def f(key):
... from loopy.types import to_loopy_type
... return key.dtype == to_loopy_type(np.float32) and \
... key.lid_strides[0] > 1
>>> count = mem_map.filter_by_func(f).eval_and_sum(param_dict)
>>> print(count)
2097152
```

### Counting synchronization events¶

`loopy.get_synchronization_map()`

counts the number of synchronization
events per **work-item** in a kernel. First, we’ll call this function on the
kernel from the previous example:

```
>>> sync_map = lp.get_synchronization_map(knl)
>>> print(lp.stringify_stats_mapping(sync_map))
kernel_launch : { 1 }
```

We can evaluate this polynomial using `islpy.eval_with_dict()`

:

```
>>> launch_count = sync_map["kernel_launch"].eval_with_dict(param_dict)
>>> print("Kernel launch count: %s" % launch_count)
Kernel launch count: 1
```

Now to make things more interesting, we’ll create a kernel with barriers:

```
>>> knl = lp.make_kernel(
... "[] -> {[i,k,j]: 0<=i<50 and 1<=k<98 and 0<=j<10}",
... [
... """
... c[i,j,k] = 2*a[i,j,k]
... e[i,j,k] = c[i,j,k+1]+c[i,j,k-1]
... """
... ], [
... lp.TemporaryVariable("c", dtype=None, shape=(50, 10, 99)),
... "..."
... ])
>>> knl = lp.add_and_infer_dtypes(knl, dict(a=np.int32))
>>> knl = lp.split_iname(knl, "k", 128, inner_tag="l.0")
>>> code, _ = lp.generate_code(lp.preprocess_kernel(knl))
>>> print(code)
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
__kernel void __attribute__ ((reqd_work_group_size(97, 1, 1))) loopy_kernel(__global int const *__restrict__ a, __global int *__restrict__ e)
{
__local int c[50 * 10 * 99];
{
int const k_outer = 0;
for (int j = 0; j <= 9; ++j)
for (int i = 0; i <= 49; ++i)
{
barrier(CLK_LOCAL_MEM_FENCE) /* for c (insn rev-depends on insn_0) */;
c[990 * i + 99 * j + lid(0) + 1] = 2 * a[980 * i + 98 * j + lid(0) + 1];
barrier(CLK_LOCAL_MEM_FENCE) /* for c (insn_0 depends on insn) */;
e[980 * i + 98 * j + lid(0) + 1] = c[990 * i + 99 * j + 1 + lid(0) + 1] + c[990 * i + 99 * j + -1 + lid(0) + 1];
}
}
}
```

In this kernel, when a work-item performs the second instruction it uses data
produced by *different* work-items during the first instruction. Because of this,
barriers are required for correct execution, so loopy inserts them. Now we’ll
count the barriers using `loopy.get_synchronization_map()`

:

```
>>> sync_map = lp.get_synchronization_map(knl)
>>> print(lp.stringify_stats_mapping(sync_map))
barrier_local : { 1000 }
kernel_launch : { 1 }
```

Based on the kernel code printed above, we would expect each work-item to
encounter 50x10x2 barriers, which matches the result from
`loopy.get_synchronization_map()`

. In this case, the number of barriers
does not depend on any inames, so we can pass an empty dictionary to
`islpy.eval_with_dict()`

.