Installation¶

Option 0: Static Binary¶

If you would just like to experiment with loopy’s code transformation abilities, the easiest way to get loopy is to download a statically-linked Linux binary.

See Places on the web related to Loopy for details.

Option 1: From Source, no PyOpenCL integration¶

This command should install loopy:

pip install loopy

You may need to run this with sudo. If you don’t already have pip, run this beforehand:

curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py
python get-pip.py

For a more manual installation, download the source, unpack it, and say:

python setup.py install

You may also clone its git repository:

git clone --recursive https://github.com/inducer/loopy.git

Option 2: From Conda Forge, with PyOpenCL integration¶

This set of instructions is intended for 64-bit Linux and MacOS support computers:

Make sure your system has the basics to build software.

On Debian derivatives (Ubuntu and many more), installing build-essential should do the trick.

Everywhere else, just making sure you have the g++ package should be enough.
Install miniforge.
export CONDA=/WHERE/YOU/INSTALLED/miniforge3

If you accepted the default location, this should work:

export CONDA=$HOME/miniforge3
$CONDA/bin/conda create -n dev
source $CONDA/bin/activate dev
conda install git pip pocl islpy pyopencl (Linux)

or

conda install osx-pocl-opencl git pip pocl islpy pyopencl (OS X)

Type the following command:

pip install git+https://github.com/inducer/loopy

Next time you want to use loopy, just run the following command:

source /WHERE/YOU/INSTALLED/miniforge3/bin/activate dev

You may also like to add this to a startup file (like $HOME/.bashrc) or create an alias for it.

See the PyOpenCL installation instructions for options regarding OpenCL drivers.

User-visible Changes¶

Version 2018.1¶

Note

This version is currently under development. You can get snapshots from loopy’s git repository

Version 2016.1.1¶

Add loopy.chunk_iname().
Add unused:l, unused:g, and like:INAME iname tag notation
Release automatically built, self-contained Linux binary
Many fixes and improvements
Docs improvements

Version 2016.1¶

Initial release.

Licensing¶

Loopy is licensed to you under the MIT/X Consortium license:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Frequently Asked Questions¶

Is Loopy specific to OpenCL?¶

No, absolutely not. You can switch to a different code generation target (subclasses of loopy.TargetBase) by using (say):

knl = knl.copy(target=loopy.CudaTarget())

Also see Targets. (Py)OpenCL right now has the best support for running kernels directly out of the box, but that could easily be expanded. Open an issue to discuss what you need.

In the meantime, you can generate code simply by saying:

cg_result = loopy.generate_code_v2(knl)
print(cg_result.host_code())
print(cg_result.device_code())

Additionally, for C-based languages, header definitions are available via:

loopy.generate_header(knl)

For what types of codes does `loopy` work well?¶

Any array-based/number-crunching code whose control flow is not too data dependent should be expressible. For example:

Sparse matrix-vector multiplies, despite data-dependent control flow (varying row lengths, say), is easy and natural to express.
Looping until convergence on the other hand is an example of something that can’t be expressed easily. Such checks would have to be performed outside of loopy code.

Can I see some examples?¶

Loopy has a ton of tests, and right now, those are probably the best source of examples. Here are some links:

Here’s a more complicated example of a loopy code:

import numpy as np

import pyopencl as cl

import loopy as lp
from loopy.version import LOOPY_USE_LANGUAGE_VERSION_2018_2  # noqa: F401


cl_ctx = cl.create_some_context()

knl = lp.make_kernel(
    "{[ictr,itgt,idim]: "
    "0<=itgt<ntargets "
    "and 0<=ictr<ncenters "
    "and 0<=idim<ambient_dim}",

    """
    for itgt
        for ictr
            <> dist_sq = sum(idim,
                    (tgt[idim,itgt] - center[idim,ictr])**2)
            <> in_disk = dist_sq < (radius[ictr]*1.05)**2
            <> matches = (
                    (in_disk
                        and qbx_forced_limit == 0)
                    or (in_disk
                            and qbx_forced_limit != 0
                            and qbx_forced_limit * center_side[ictr] > 0)
                    )

            <> post_dist_sq = dist_sq if matches else HUGE
        end
        <> min_dist_sq, <> min_ictr = argmin(ictr, ictr, post_dist_sq)

        tgt_to_qbx_center[itgt] = min_ictr if min_dist_sq < HUGE else -1
    end
    """)

knl = lp.fix_parameters(knl, ambient_dim=2)
knl = lp.add_and_infer_dtypes(knl, {
        "tgt,center,radius,HUGE": np.float32,
        "center_side,qbx_forced_limit": np.int32,
        })

lp.auto_test_vs_ref(knl, cl_ctx, knl, parameters={
        "HUGE": 1e20, "ncenters": 200, "ntargets": 300,
        "qbx_forced_limit": 1})

This example is included in the loopy distribution as examples/python/find-centers.py. What this does is find nearby “centers” satisfying some criteria for an array of points (“targets”).

Specifying dependencies for groups of instructions is cumbersome. Help?¶

You can now specify instruction ID prefixes and dependencies for groups of instructions, like this:

with {id_prefix=init_m}
    <> m[0] =   ...
    m[1] =   ...
    m[2] =   ...
end

with {id_prefix=update_m,dep=init_m*}
    m[0] = m[0] + ...
    m[1] = m[1] + ...
    m[2] = m[2] * ...
end

with {dep=update_m*}
    output[i, j, 0] =  0.25*m[0]
    output[i, j, 1] =  0.25*m[1]
    output[i, j, 2] =  0.25*m[2]
end

Added in version 2016.2.1: (There was a bug in prior versions that kept this from working.)

What types of transformations can I do?¶

This list is always growing, but here are a few pointers:

Unroll

Use loopy.tag_inames() with the "unr" tag. Unrolled loops must have a fixed size. (See either loopy.split_iname() or loopy.fix_parameters().)
Stride changes (Row/column/something major)

Use loopy.tag_array_axes() with (e.g.) stride:17 or N1,N2,N0 to determine how each axis of an array is realized.
Prefetch

Use loopy.add_prefetch().
Reorder loops

Use loopy.prioritize_loops().
Precompute subexpressions:

Use a Substitution Rules to assign a name to a subexpression, using may be loopy.assignment_to_subst() or loopy.extract_subst(). Then use loopy.precompute() to create an (array or scalar) temporary with precomputed values.
Tile:

Use loopy.split_iname() to produce enough loops, then use loopy.prioritize_loops() to set the ordering.
Fix constants

Use loopy.fix_parameters().
Parallelize (across cores)

Use loopy.tag_inames() with the "g.0", "g.1" (and so on) tags.
Parallelize (across vector lanes)

Use loopy.tag_inames() with the "l.0", "l.1" (and so on) tags.
Affinely map loop domains

Use loopy.affine_map_inames().
Texture-based data access

Use loopy.change_arg_to_image() to use texture memory for an argument.
Kernel Fusion

Use loopy.fuse_kernels().
Explicit-SIMD Vectorization

Use loopy.tag_inames() with the "vec" iname tag. Note that the corresponding axis of an array must also be tagged using the "vec" array axis tag (using loopy.tag_array_axes()) in order for vector code to be generated.

Vectorized loops (and array axes) must have a fixed size. (See either loopy.split_iname() or loopy.fix_parameters() along with loopy.split_array_axis().)
Reuse of Temporary Storage

Use loopy.alias_temporaries() to reduce the size of intermediate storage.
SoA $leftrightarrow$ AoS

Use loopy.tag_array_axes() with the "sep" array axis tag to generate separate arrays for each entry of a short, fixed-length array axis.

Separated array axes must have a fixed size. (See either loopy.split_array_axis().)
Realization of Instruction-level parallelism

Use loopy.tag_inames() with the "ilp" tag. ILP loops must have a fixed size. (See either loopy.split_iname() or loopy.fix_parameters().)
Type inference

Use loopy.add_and_infer_dtypes().
Convey assumptions:

Use loopy.assume() to say, e.g. loopy.assume(knl, "N mod 4 = 0") or loopy.assume(knl, "N > 0").
Perform batch computations

Use loopy.to_batched().
Interface with your own library functions

See Function Interface for details.
Loop collapse

Use loopy.join_inames().

In what sense does Loopy support vectorization?¶

There are really two ways in which the OpenCL/CUDA model of computation exposes vectorization:

“SIMT”: The user writes scalar program instances and either the compiler or the hardware joins the individual program instances into vectors of a hardware-given length for execution.
“Short vectors”: This type of vectorization is based on vector types, e.g. float4, which support arithmetic with implicit vector semantics as well as a number of ‘intrinsic’ functions.

Loopy supports both. The first one, SIMT, is accessible by tagging inames with, e.g., l.0`. Accessing the second one requires using both execution- and data-reshaping capabilities in loopy. To start with, you need an array that has an axis with the length of the desired vector. If that’s not yet available, you may use loopy.split_array_axis() to produce one. Similarly, you need an iname whose bounds match those of the desired vector length. Again, if you don’t already have one, loopy.split_iname() will easily produce one. Lastly, both the array axis an the iname need the implementation tag "vec". Here is an example of this machinery in action:

import numpy as np

import pyopencl as cl
import pyopencl.array

import loopy as lp
from loopy.version import LOOPY_USE_LANGUAGE_VERSION_2018_2  # noqa: F401


ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

n = 15 * 10**6
a = cl.array.arange(queue, n, dtype=np.float32)

knl = lp.make_kernel(
        "{ [i]: 0<=i<n }",
        "out[i] = 2*a[i]")

knl = lp.set_options(knl, write_code=True)
knl = lp.split_iname(knl, "i", 4, slabs=(0, 1), inner_tag="vec")
knl = lp.split_array_axis(knl, "a,out", axis_nr=0, count=4)
knl = lp.tag_array_axes(knl, "a,out", "C,vec")

knl(queue, a=a.reshape(-1, 4), n=n)

Note how the example slices off the last ‘slab’ of iterations to ensure that the bulk of the iteration does not require conditionals which would prevent successful vectorization. This generates the following code:

#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#define int_floor_div_pos_b(a,b) (                 ( (a) - ( ((a)<0) ? ((b)-1) : 0 )  ) / (b)                 )

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global float4 const *__restrict__ a, int const n, __global float4 *__restrict__ out)
{
  /* bulk slab for 'i_outer' */
  for (int i_outer = 0; i_outer <= -2 + int_floor_div_pos_b(3 + n, 4); ++i_outer)
    out[i_outer] = 2.0f * a[i_outer];
  /* final slab for 'i_outer' */
  {
    int const i_outer = -1 + n + -1 * int_floor_div_pos_b(3 * n, 4);

    if (-1 + n >= 0)
    {
      if (-1 + -4 * i_outer + n >= 0)
        out[i_outer].s0 = 2.0f * a[i_outer].s0;
      if (-1 + -4 * i_outer + -1 + n >= 0)
        out[i_outer].s1 = 2.0f * a[i_outer].s1;
      if (-1 + -4 * i_outer + -1 * 2 + n >= 0)
        out[i_outer].s2 = 2.0f * a[i_outer].s2;
      if (-1 + -4 * i_outer + -1 * 3 + n >= 0)
        out[i_outer].s3 = 2.0f * a[i_outer].s3;
    }
  }
}

What is the story with language versioning?¶

The idea is to keep supporting multiple versions at a time. There’s a tension in loopy between the need to build code that keeps working unchanged for some number of years, and needing the language to evolve–not just as a research vehicle, but also to enable to respond to emerging needs in applications and hardware.

The idea is not to support all versions indefinitely, merely to allow users to upgrade on their own schedule on the scale of a couple years. Warnings about needing to upgrade would get noisier as a version nears deprecation. In a way, it is intended to be a version of Python’s __future__ flags, which IMO have the served the language tremendously well.

One can also obtain the current language version programmatically: loopy.MOST_RECENT_LANGUAGE_VERSION. But pinning your code to that would mean choosing to not use the potentially valuable guarantee to keep existing code working unchanged for a while. Instead, it might be wiser to just grab the version of the language current at the time of writing the code.

Uh-oh. I got a scheduling error. Any hints?¶

Make sure that dependencies between instructions are as you intend.

Use loopy.show_dependency_graph() to check.

There’s a heuristic that tries to help find dependencies. If there’s only a single write to a variable, then it adds dependencies from all readers to the writer. In your case, that’s actually counterproductive, because it creates a circular dependency, hence the scheduling issue. So you’ll have to turn that off, like so:
```
knl = lp.make_kernel(
    "{ [t]: 0 <= t < T}",
    """
    <> xt = x[t] {id=fetch,dep=*}
    x[t + 1] = xt * 0.1 {dep=fetch}
    """)
```
Make sure that your loops are correctly nested.

Print the kernel to make sure all instructions are within the set of inames you intend them to be in.
One iname is one for loop.

For sequential loops, one iname corresponds to exactly one for loop in generated code. Loopy will not generate multiple loops from one iname.
Make sure that your loops are correctly nested.

The scheduler will try to be as helpful as it can in telling you where it got stuck.

Citing Loopy¶

If you use loopy for your work and find its approach helpful, please consider citing the following article.

A. Klöckner. Loo.py: transformation-based code generation for GPUs and CPUs. Proceedings of ARRAY ‘14: ACM SIGPLAN Workshop on Libraries, Languages, and Compilers for Array Programming. Edinburgh, Scotland.

Here’s a Bibtex entry for your convenience:

@inproceedings{kloeckner_loopy_2014,
   author = {{Kl{\"o}ckner}, Andreas},
   title = "{Loo.py: transformation-based code~generation for GPUs and CPUs}",
   booktitle = "{Proceedings of ARRAY `14: ACM SIGPLAN Workshop
     on Libraries, Languages, and Compilers for Array Programming}",
   year = 2014,
   publisher = "{Association for Computing Machinery}",
   address = "{Edinburgh, Scotland.}",
   doi = "{10.1145/2627373.2627387}",
}

Getting help¶

Email the friendly folks on the loopy mailing list.

Acknowledgments¶

Work on loopy was supported in part by

the Department of Energy, National Nuclear Security Administration, under Award Number DE-NA0003963,
the US Navy ONR, under grant number N00014-14-1-0117, and
the US National Science Foundation under grant numbers DMS-1418961, CCF-1524433, DMS-1654756, SHF-1911019, and OAC-1931577.

AK also gratefully acknowledges a hardware gift from Nvidia Corporation.

The views and opinions expressed herein do not necessarily reflect those of the funding agencies.

Cross-References to Other Documentation¶

class numpy.int16¶: See numpy.generic.

class numpy.complex128¶: See numpy.generic.

Installation¶

Option 0: Static Binary¶

Option 1: From Source, no PyOpenCL integration¶

Option 2: From Conda Forge, with PyOpenCL integration¶

User-visible Changes¶

Version 2018.1¶

Version 2016.1.1¶

Version 2016.1¶

Licensing¶

Frequently Asked Questions¶

Is Loopy specific to OpenCL?¶

For what types of codes does loopy work well?¶

Can I see some examples?¶

Specifying dependencies for groups of instructions is cumbersome. Help?¶

What types of transformations can I do?¶

In what sense does Loopy support vectorization?¶

What is the story with language versioning?¶

Uh-oh. I got a scheduling error. Any hints?¶

Citing Loopy¶

Getting help¶

Acknowledgments¶

Cross-References to Other Documentation¶

For what types of codes does `loopy` work well?¶