Installation¶
Option 0: Static Binary¶
If you would just like to experiment with loopy
’s code transformation
abilities, the easiest way to get loopy is to download a statically-linked
Linux binary.
See Places on the web related to Loopy for details.
Option 1: From Source, no PyOpenCL integration¶
This command should install loopy
:
pip install loopy
You may need to run this with sudo. If you don’t already have pip, run this beforehand:
curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py
python get-pip.py
For a more manual installation, download the source, unpack it, and say:
python setup.py install
You may also clone its git repository:
git clone --recursive https://github.com/inducer/loopy.git
Option 2: From Conda Forge, with PyOpenCL integration¶
This set of instructions is intended for 64-bit Linux and MacOS support computers:
Make sure your system has the basics to build software.
On Debian derivatives (Ubuntu and many more), installing
build-essential
should do the trick.Everywhere else, just making sure you have the
g++
package should be enough.Install miniforge.
export CONDA=/WHERE/YOU/INSTALLED/miniforge3
If you accepted the default location, this should work:
export CONDA=$HOME/miniforge3
$CONDA/bin/conda create -n dev
source $CONDA/bin/activate dev
conda install git pip pocl islpy pyopencl
(Linux)or
conda install osx-pocl-opencl git pip pocl islpy pyopencl
(OS X)Type the following command:
pip install git+https://github.com/inducer/loopy
Next time you want to use loopy
, just run the following command:
source /WHERE/YOU/INSTALLED/miniforge3/bin/activate dev
You may also like to add this to a startup file (like $HOME/.bashrc
) or create an alias for it.
See the PyOpenCL installation instructions for options regarding OpenCL drivers.
User-visible Changes¶
See also Loopy Language Versioning.
Version 2018.1¶
Note
This version is currently under development. You can get snapshots from loopy’s git repository
Version 2016.1.1¶
Add
loopy.chunk_iname()
.Add
unused:l
,unused:g
, andlike:INAME
iname tag notationRelease automatically built, self-contained Linux binary
Many fixes and improvements
Docs improvements
Version 2016.1¶
Initial release.
Licensing¶
Loopy is licensed to you under the MIT/X Consortium license:
Copyright (c) 2009-17 Andreas Klöckner and Contributors.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Frequently Asked Questions¶
Is Loopy specific to OpenCL?¶
No, absolutely not. You can switch to a different code generation target
(subclasses of loopy.TargetBase
) by using (say):
knl = knl.copy(target=loopy.CudaTarget())
Also see Targets. (Py)OpenCL right now has the best support for running kernels directly out of the box, but that could easily be expanded. Open an issue to discuss what you need.
In the meantime, you can generate code simply by saying:
cg_result = loopy.generate_code_v2(knl)
print(cg_result.host_code())
print(cg_result.device_code())
Additionally, for C-based languages, header definitions are available via:
loopy.generate_header(knl)
For what types of codes does loopy
work well?¶
Any array-based/number-crunching code whose control flow is not too data dependent should be expressible. For example:
Sparse matrix-vector multiplies, despite data-dependent control flow (varying row lengths, say), is easy and natural to express.
Looping until convergence on the other hand is an example of something that can’t be expressed easily. Such checks would have to be performed outside of
loopy
code.
Can I see some examples?¶
Loopy has a ton of tests, and right now, those are probably the best source of examples. Here are some links:
Here’s a more complicated example of a loopy code:
import numpy as np
import pyopencl as cl
import loopy as lp
from loopy.version import LOOPY_USE_LANGUAGE_VERSION_2018_2 # noqa: F401
cl_ctx = cl.create_some_context()
knl = lp.make_kernel(
"{[ictr,itgt,idim]: "
"0<=itgt<ntargets "
"and 0<=ictr<ncenters "
"and 0<=idim<ambient_dim}",
"""
for itgt
for ictr
<> dist_sq = sum(idim,
(tgt[idim,itgt] - center[idim,ictr])**2)
<> in_disk = dist_sq < (radius[ictr]*1.05)**2
<> matches = (
(in_disk
and qbx_forced_limit == 0)
or (in_disk
and qbx_forced_limit != 0
and qbx_forced_limit * center_side[ictr] > 0)
)
<> post_dist_sq = dist_sq if matches else HUGE
end
<> min_dist_sq, <> min_ictr = argmin(ictr, ictr, post_dist_sq)
tgt_to_qbx_center[itgt] = min_ictr if min_dist_sq < HUGE else -1
end
""")
knl = lp.fix_parameters(knl, ambient_dim=2)
knl = lp.add_and_infer_dtypes(knl, {
"tgt,center,radius,HUGE": np.float32,
"center_side,qbx_forced_limit": np.int32,
})
lp.auto_test_vs_ref(knl, cl_ctx, knl, parameters={
"HUGE": 1e20, "ncenters": 200, "ntargets": 300,
"qbx_forced_limit": 1})
This example is included in the loopy
distribution as
examples/python/find-centers.py
.
What this does is find nearby “centers” satisfying some criteria
for an array of points (“targets”).
Specifying dependencies for groups of instructions is cumbersome. Help?¶
You can now specify instruction ID prefixes and dependencies for groups of instructions, like this:
with {id_prefix=init_m}
<> m[0] = ...
m[1] = ...
m[2] = ...
end
with {id_prefix=update_m,dep=init_m*}
m[0] = m[0] + ...
m[1] = m[1] + ...
m[2] = m[2] * ...
end
with {dep=update_m*}
output[i, j, 0] = 0.25*m[0]
output[i, j, 1] = 0.25*m[1]
output[i, j, 2] = 0.25*m[2]
end
Added in version 2016.2.1: (There was a bug in prior versions that kept this from working.)
What types of transformations can I do?¶
This list is always growing, but here are a few pointers:
Unroll
Use
loopy.tag_inames()
with the"unr"
tag. Unrolled loops must have a fixed size. (See eitherloopy.split_iname()
orloopy.fix_parameters()
.)Stride changes (Row/column/something major)
Use
loopy.tag_array_axes()
with (e.g.)stride:17
orN1,N2,N0
to determine how each axis of an array is realized.Prefetch
Use
loopy.add_prefetch()
.Reorder loops
Precompute subexpressions:
Use a Substitution Rules to assign a name to a subexpression, using may be
loopy.assignment_to_subst()
orloopy.extract_subst()
. Then useloopy.precompute()
to create an (array or scalar) temporary with precomputed values.Tile:
Use
loopy.split_iname()
to produce enough loops, then useloopy.prioritize_loops()
to set the ordering.Fix constants
Parallelize (across cores)
Use
loopy.tag_inames()
with the"g.0"
,"g.1"
(and so on) tags.Parallelize (across vector lanes)
Use
loopy.tag_inames()
with the"l.0"
,"l.1"
(and so on) tags.Affinely map loop domains
Texture-based data access
Use
loopy.change_arg_to_image()
to use texture memory for an argument.Kernel Fusion
Use
loopy.fuse_kernels()
.Explicit-SIMD Vectorization
Use
loopy.tag_inames()
with the"vec"
iname tag. Note that the corresponding axis of an array must also be tagged using the"vec"
array axis tag (usingloopy.tag_array_axes()
) in order for vector code to be generated.Vectorized loops (and array axes) must have a fixed size. (See either
loopy.split_iname()
orloopy.fix_parameters()
along withloopy.split_array_axis()
.)Reuse of Temporary Storage
Use
loopy.alias_temporaries()
to reduce the size of intermediate storage.SoA $leftrightarrow$ AoS
Use
loopy.tag_array_axes()
with the"sep"
array axis tag to generate separate arrays for each entry of a short, fixed-length array axis.Separated array axes must have a fixed size. (See either
loopy.split_array_axis()
.)Realization of Instruction-level parallelism
Use
loopy.tag_inames()
with the"ilp"
tag. ILP loops must have a fixed size. (See eitherloopy.split_iname()
orloopy.fix_parameters()
.)Type inference
Convey assumptions:
Use
loopy.assume()
to say, e.g.loopy.assume(knl, "N mod 4 = 0")
orloopy.assume(knl, "N > 0")
.Perform batch computations
Use
loopy.to_batched()
.Interface with your own library functions
See Function Interface for details.
Loop collapse
Use
loopy.join_inames()
.
In what sense does Loopy support vectorization?¶
There are really two ways in which the OpenCL/CUDA model of computation exposes vectorization:
“SIMT”: The user writes scalar program instances and either the compiler or the hardware joins the individual program instances into vectors of a hardware-given length for execution.
“Short vectors”: This type of vectorization is based on vector types, e.g.
float4
, which support arithmetic with implicit vector semantics as well as a number of ‘intrinsic’ functions.
Loopy supports both. The first one, SIMT, is accessible by tagging inames with,
e.g., l.0`
. Accessing the second one requires using both execution- and
data-reshaping capabilities in loopy. To start with, you need an array that
has an axis with the length of the desired vector. If that’s not yet available,
you may use loopy.split_array_axis()
to produce one. Similarly, you need
an iname whose bounds match those of the desired vector length. Again, if you
don’t already have one, loopy.split_iname()
will easily produce one.
Lastly, both the array axis an the iname need the implementation tag "vec"
.
Here is an example of this machinery in action:
import numpy as np
import pyopencl as cl
import pyopencl.array
import loopy as lp
from loopy.version import LOOPY_USE_LANGUAGE_VERSION_2018_2 # noqa: F401
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
n = 15 * 10**6
a = cl.array.arange(queue, n, dtype=np.float32)
knl = lp.make_kernel(
"{ [i]: 0<=i<n }",
"out[i] = 2*a[i]")
knl = lp.set_options(knl, write_code=True)
knl = lp.split_iname(knl, "i", 4, slabs=(0, 1), inner_tag="vec")
knl = lp.split_array_axis(knl, "a,out", axis_nr=0, count=4)
knl = lp.tag_array_axes(knl, "a,out", "C,vec")
knl(queue, a=a.reshape(-1, 4), n=n)
Note how the example slices off the last ‘slab’ of iterations to ensure that the bulk of the iteration does not require conditionals which would prevent successful vectorization. This generates the following code:
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#define int_floor_div_pos_b(a,b) ( ( (a) - ( ((a)<0) ? ((b)-1) : 0 ) ) / (b) )
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global float4 const *__restrict__ a, int const n, __global float4 *__restrict__ out)
{
/* bulk slab for 'i_outer' */
for (int i_outer = 0; i_outer <= -2 + int_floor_div_pos_b(3 + n, 4); ++i_outer)
out[i_outer] = 2.0f * a[i_outer];
/* final slab for 'i_outer' */
{
int const i_outer = -1 + n + -1 * int_floor_div_pos_b(3 * n, 4);
if (-1 + n >= 0)
{
if (-1 + -4 * i_outer + n >= 0)
out[i_outer].s0 = 2.0f * a[i_outer].s0;
if (-1 + -4 * i_outer + -1 + n >= 0)
out[i_outer].s1 = 2.0f * a[i_outer].s1;
if (-1 + -4 * i_outer + -1 * 2 + n >= 0)
out[i_outer].s2 = 2.0f * a[i_outer].s2;
if (-1 + -4 * i_outer + -1 * 3 + n >= 0)
out[i_outer].s3 = 2.0f * a[i_outer].s3;
}
}
}
What is the story with language versioning?¶
The idea is to keep supporting multiple versions at a time. There’s a tension in loopy between the need to build code that keeps working unchanged for some number of years, and needing the language to evolve–not just as a research vehicle, but also to enable to respond to emerging needs in applications and hardware.
The idea is not to support all versions indefinitely, merely to allow users to upgrade on their own schedule on the scale of a couple years. Warnings about needing to upgrade would get noisier as a version nears deprecation. In a way, it is intended to be a version of Python’s __future__ flags, which IMO have the served the language tremendously well.
One can also obtain the current language version programmatically:
loopy.MOST_RECENT_LANGUAGE_VERSION
.
But pinning your code to that would mean choosing to not use the
potentially valuable guarantee to keep existing code working unchanged
for a while. Instead, it might be wiser to just grab the version of the
language current at the time of writing the code.
Uh-oh. I got a scheduling error. Any hints?¶
Make sure that dependencies between instructions are as you intend.
Use
loopy.show_dependency_graph()
to check.There’s a heuristic that tries to help find dependencies. If there’s only a single write to a variable, then it adds dependencies from all readers to the writer. In your case, that’s actually counterproductive, because it creates a circular dependency, hence the scheduling issue. So you’ll have to turn that off, like so:
knl = lp.make_kernel( "{ [t]: 0 <= t < T}", """ <> xt = x[t] {id=fetch,dep=*} x[t + 1] = xt * 0.1 {dep=fetch} """)
Make sure that your loops are correctly nested.
Print the kernel to make sure all instructions are within the set of inames you intend them to be in.
One iname is one for loop.
For sequential loops, one iname corresponds to exactly one
for
loop in generated code. Loopy will not generate multiple loops from one iname.Make sure that your loops are correctly nested.
The scheduler will try to be as helpful as it can in telling you where it got stuck.
Citing Loopy¶
If you use loopy for your work and find its approach helpful, please consider citing the following article.
A. Klöckner. Loo.py: transformation-based code generation for GPUs and CPUs. Proceedings of ARRAY ‘14: ACM SIGPLAN Workshop on Libraries, Languages, and Compilers for Array Programming. Edinburgh, Scotland.
Here’s a Bibtex entry for your convenience:
@inproceedings{kloeckner_loopy_2014,
author = {{Kl{\"o}ckner}, Andreas},
title = "{Loo.py: transformation-based code~generation for GPUs and CPUs}",
booktitle = "{Proceedings of ARRAY `14: ACM SIGPLAN Workshop
on Libraries, Languages, and Compilers for Array Programming}",
year = 2014,
publisher = "{Association for Computing Machinery}",
address = "{Edinburgh, Scotland.}",
doi = "{10.1145/2627373.2627387}",
}
Getting help¶
Email the friendly folks on the loopy mailing list.
Acknowledgments¶
Work on loopy was supported in part by
the Department of Energy, National Nuclear Security Administration, under Award Number DE-NA0003963,
the US Navy ONR, under grant number N00014-14-1-0117, and
the US National Science Foundation under grant numbers DMS-1418961, CCF-1524433, DMS-1654756, SHF-1911019, and OAC-1931577.
AK also gratefully acknowledges a hardware gift from Nvidia Corporation.
The views and opinions expressed herein do not necessarily reflect those of the funding agencies.
Cross-References to Other Documentation¶
- class numpy.int16¶
See
numpy.generic
.
- class numpy.complex128¶
See
numpy.generic
.