Pre-built binaries vs. performance

Ludovic Courtès — January 31, 2018

Guix follows a transparent source/binary deployment model: it will download pre-built binaries when they’re available—like apt-get or yum—and otherwise falls back to building from source. Most of the time the project’s build farm provides binaries so that users don’t have to spend resources building from source. Pre-built binaries may be missing when you’re installing a custom package, or when the build farm hasn’t caught up yet. However, deployment of binaries is often seen as incompatible with high-performance requirements—binaries are “generic”, so how can they take advantage of cutting-edge HPC hardware? In this post, we explore the issue and solutions.

Building portable binaries

CPU architectures are a moving target. The x86_64 instruction set architecture (ISA), for instance, has a whole family of extensions—AVX and AVX2 being the most obvious ones on x86_64. These extensions are often critical for the performance of computational programs. For example, fused multiply-add (FMA), which can have significant impact on some applications, was only introduced in some relatively recent AMD and Intel processors, and new versions of these extensions are being deployed. Each x86_64 machine typically supports a subset of these extensions.

Package distributions that provide pre-built binaries—Guix, but also Debian, Fedora, CentOS, and so on—have one important constraint: they must provide binaries that work on all the computers for the target architecture. Therefore, those binaries should target the common denominator of that architecture. For x86_64, that means not using instructions from AVX & co. Put this way, pre-built binaries look unattractive from an HPC viewpoint.

Run-time selection

In Guix land, this has been the topic of lengthy, discussions over the years. Actually, distro developers know that this issue is not new, and that this concern is not specific to HPC. Many pieces of software, from video players to the C library, can–and do!—greatly benefit from some of these ISA extensions. How do they address this dilemma—providing portable binaries without compromising on performance?

The solution is to select the most appropriate implementation of “hot” code at run time. Video players like MPlayer and number-crunching software like the GNU multiprecision library have used this “trick” since their inception: using the cpuid instruction, they can determine at run time which ISA extensions are available and branch to routines optimized for the available extensions. Many other applications include similar ad-hoc mechanism.

GNU, which runs on 100% of the Top 500 supercomputers, now provides generic mechanisms for this in the toolchain. First, the GNU C Library (glibc) has always had vendor-provided optimized implementations of its string and math routines, selected at run time.

The underlying mechanisms have been generalized in glibc in the form of indirect functions or “IFUNCs”, which work along these lines:

  • Application developers provide libc with a resolver. A resolver is a function that selects the “best” optimized implementation for the CPU at hand and returns it. As an example, glibc’s resolver for memcmp looks like this.
  • Resolvers are called at load time by the run-time linker, ld.so, once for all. Thus, selection happens only once at load time.
  • To simplify the use of IFUNCs, GCC provides an ifunc attribute to decorate functions that have an associated resolver.

IFUNCs are starting to be used outside glibc proper, for instance by the Nettle cryptographic library (code), though there are currently restrictions to be aware of.

Better yet, since version 6, GCC supports automatic function multi-versioning (FMV): the target_clones function attribute allows users to instruct GCC to generate several optimized variants of a function and to generate a resolver to select the right one based on the CPUID.

This LWN article nicely shows how code can benefit from FMV. The article links to this script to automatically annotate FMV candidates with target_clones; there’s even a tutorial!

Problem solved?

When upstream software lacks run-time selection

It turns out that not all software packages, especially scientific software, use these techniques. Some do—for instance, OpenBLAS supports run-time selection when compiled with DYNAMIC_ARCH=1—but many don’t. For example, FFTW insists on being compiled with -mtune=native and provides configuration options to statically select CPU optimizations; ATLAS optimizes itself for the CPU it is being built on. We can always say that the “right” solution would be to “fix” these packages upstream so that they use run-time selection, but how do we handle these today in Guix?

Depending on the situation, we have so far resorted to different solutions. ATLAS so heavily depends on configure-time tuning that we simply don’t distribute pre-built binaries for it. Instead, running guix package -i atlas unconditionally builds it locally, as upstream authors intended.

For FFTW, BLIS, and other packages where optimizations are selected at configure-time, we simply build the generic version, like Debian and others do. This is the most unsatisfactory situation: we have portable binaries at the cost of degraded performance.

However, we also programmatically provide optimized package variants for these. For BLIS, we have a make-blis function that we use to generate a blis-haswell package optimized for Intel Haswell CPUs, a blis-knl package, and so on. Likewise, for FFTW, we have an fftw-avx package that uses AVX2-specific optimizations. We don’t provide binaries for these optimized packages, but users can install the variant that corresponds to their machine.

Dependency graph rewriting

Having optimized package variants is nice, but how can users take advantage of them? For instance, the julia and octave packages depend on the generic (unoptimized) fftw package—this allows us to distribute pre-built binaries. What if you want Octave to use the AVX2-optimized FFTW?

One option is to rewrite the dependency graph of Octave, so that occurrences of the generic fftw package are replaced by fftw. This can be done from the command line using the --with-input option:

guix package -i octave --with-input=fftw@3.3.5=fftw-avx

The above command does that graph rewriting. Consequently, it ends up building from source the part of the Octave dependency graph that depends on fftw. Not ideal because rebuilding can take a while, but readily applicable.

When the library and its replacement (fftw and fftw-avx here) are known to have the same application binary interface (ABI), as is the case here, another option is to simply let the run-time linker pick up the optimized version instead of the unoptimized one. This can be done by setting the LD_LIBRARY_PATH environment variable:

LD_LIBRARY_PATH=`guix build fftw-avx`/lib octave

Here Octave will pick the optimized libfftw.so. (/etc/ld.so.conf would be another possibility but the glibc package in Guix currently ignores that file since that could lead to loading binary-incompatible .so files when using Guix on a distro other than GuixSD.)

Where to go from here?

As we have seen, Guix does not sacrifice performance. In the worst case, it requires users to explicitly install optimized package variants, which get built from source. This is not as simple as we would like though, so people have been looking for ways to improve the situation.

The first option is to work with upstream software developers to introduce run-time selection—an option that benefits everyone. Of course, that’s something we can always do in the background, but it takes time. It does work in the long run though; for instance, BLIS has recently introduced support for run-time selection. Like Clear Linux, we can also start applying function multi-versioning based on compiler feedback in key packages and use that as a starting point when discussing with upstream.

Some have proposed making CPU features a first-class concept in Guix. That way, one could install with, say, --cpu-features=avx2 and end up downloading binaries or building binaries optimized for AVX2. The downsides are that this would be a big change, and that it’s not clear how to tell package build systems to enable such or such optimizations in a generic way.

Another option on the table, inspired by Fedora and Debian, is to provide a mechanism that makes it easy for users to switch between implementations of an interface without needing recompilation. This could work for BLAS implementations or MPI implementations that are known to have the same ABI. Similarly, having support for something similar to ld.so.conf would help—though it would have to be per-user rather than be limited to root, to retain the freedom that Guix provides to users. Such dynamic software composition could work against the reproducibility mantra of Guix though, since software behavior would depend on site-specific configuration not under Guix control.

With its transparent source/binary deployment model, Guix offers both the advantages of pre-built binaries à la apt-get and that of built-from-source, optimized software à la EasyBuild or Spack when it must. The challenges ahead will be to streamline that experience.

  • MDC
  • Inria
  • UBC