CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"

By: CppCon

758   13   66985

Uploaded on 09/27/2015


A primary use case for C++ is low latency, low overhead, high performance code. But C++ does not give you these things for free, it gives you the tools to control these things and achieve them where needed. How do you realize this potential of the language? How do you tune your C++ code and achieve the necessary performance metrics?

This talk will walk through the process of tuning C++ code from benchmarking to performance analysis. It will focus on small scale performance problems ranging from loop kernels to data structures and algorithms. It will show you how to write benchmarks that effectively measure different aspects of performance even in the face of advanced compiler optimizations and bedeviling modern CPUs. It will also show how to analyze the performance of your benchmark, understand its behavior as well as the CPUs behavior, and use a wide array of tools available to isolate and pinpoint performance problems. The tools and some processor details will be Linux and x86 specific, but the techniques and concepts should be broadly applicable.
Chandler Carruth leads the Clang team at Google, building better diagnostics, tools, and more. Previously, he worked on several pieces of Google’s distributed build system. He makes guest appearances helping to maintain a few core C++ libraries across Google’s codebase, and is active in the LLVM and Clang open source communities. He received his M.S. and B.S. in Computer Science from Wake Forest University, but disavows all knowledge of the contents of his Master’s thesis. He is regularly found drinking Cherry Coke Zero in the daytime and pontificating over a single malt scotch in the evening.
Videos Filmed & Edited by Bash Films: http://www.BashFilms.com

Comments (13):

By anonymous    2017-09-20

There is no golden rule for this. Unfortunately, the performance of code like this is notoriously hard to predict. The most important thing to take away from that is

Measure everything!

Now to what's going on in your code: As others noted correctly, we can observe that isAllowed gets compiled to a function using branches, while isAllowed2 ends up being branchless.

Branches are interesting when talking about performance: They are somewhere between literally free and ridiculously expensive, inclusively. This is due to a CPU component called the branch predictor. It tries to predict which branch your control flow will take and makes the CPU speculatively execute it. If it guesses right, the branch is free. If it guesses wrong, the branch is expensive. A great and detailed explanation of that concept, including some numbers, can be found in this answer.

So now we need to decide whether we want the branching or the branchless version. In general, neither need be faster than the other! It really depends on how well your target CPUs can predict the branches, which of course depends on the actual input. (Choosing whether to compile a function to a branching or a branchless result is thus a hard problem for compilers as they don't know what CPUs the binary will be run on, nor what kind of input data to expect. See for example this blogpost.)

So if your benchmark was actually correct, we have determined that on your CPU the branches are too hard to predict to beat the relatively cheap integer arithmetic. This may also be due to the tiny amount of test cases, the branch predictor cannot learn a pattern from such few invocations. But again, we cannot just call that one way or the other, we have to look at the actual performance in the specific case.

As noted in the comments, the execution time is somewhat short for a good measurement, I see huge deviations on my machine. For information about micro benchmarking you can have a look at this talk, it's harder than one might think.

Also, as Martin Bonner helpfully noticed, your two functions don't do the same thing, you'd have to fix that for a correct benchmark of course.

Original Thread

By anonymous    2017-09-20

There is test::black_box() (link to old docs) which is still unstable (as is the whole test crate). This function takes a value of an arbitrary type and returns the same value again. So it is basically the identity function. "Oh well, now that's very useful, isn't it?" you might ask ironically.

But there is something special: the value which is passed through is hidden from LLVM (the thing doing nearly all optimizations in Rust right now)! It's truly a black box as LLVM doesn't know anything about a piece of code. And without knowing anything LLVM can't prove that optimizations won't be changing the program's behavior. Thus: no optimizations.

How does it do that? Let's look at the definition:

pub fn black_box<T>(dummy: T) -> T {
    // we need to "use" the argument in some way LLVM can't
    // introspect.
    unsafe { asm!("" : : "r"(&dummy)) }

I'd be lying if I were to pretend I understand this piece of code completely, but it goes something like that: we insert empty inline assembly (not a single instruction) but tell Rust (which tells LLVM) that this piece of assembly uses the variable dummy. This makes it impossible for the optimizer to reason about the variable. Stupid compiler, so easy to deceive, muhahahaha! If you want another explanation, Chandler Carruth explained the dark magic at CppCon 2015.

So how do you use it now? Just use it for some kind of value... anything that goes through black_box() needs to be calculated. How about something like this?


The return value of my_function() needs to be calculated, because the compiler can't prove it's useless! So the function call won't be removed. Note however, that you have to use unstable features (either the test crate or inline asm to write the function yourself) or use FFI. I certainly wouldn't ship this kind of code in a production library, but it's certainly useful for testing purposes!

Original Thread

By anonymous    2017-09-20

It seems a bad idea to use the if there, to me.

You are right. Whether or not idx >= idx_max, it will be under idx_max after idx %= idx_max. If idx < idx_max, it will be unchanged, whether the if is followed or not.

While you might think branching around the modulo might save time, real culprit, I'd say, is that when branches are followed, pipelining modern CPU's have to reset their pipeline, and that costs a relative lot of time. Better not to have to follow a branch, than do an integer modulo, which costs roughly as much time as an integer division.

EDIT: It turns out that the modulus is pretty slow vs. the branch, as suggested by others here. Here's a guy examining this exact same question: CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!" (suggested in another SO question linked to in another answer to this question).

This guy writes compilers, and thought it would be faster without the branch; but his benchmarks proved him wrong. Even when the branch was taken only 20% of the time, it tested faster.

Another reason not to have the if: One less line of code to maintain, and for someone else to puzzle out what it means. The guy in the above link actually created a "faster modulus" macro. IMHO, this or an inline function is the way to go for performance-critical applications, because your code will be ever so much more understandable without the branch, but will execute as fast.

Finally, the guy in the above video is planning to make this optimization known to compiler writers. Thus, the if will probably be added for you, if not in the code. Hence, just the mod alone will do, when this comes about.

Original Thread

By anonymous    2018-02-12

Assume that you have to implement a static_vector<T, N> class, which is a fixed capacity container that entirely lives on the stack and never allocates, and exposes an std::vector-like interface. (Boost provides boost::static_vector.)

Considering that we must have uninitialized storage for maximum N instances of T, there are multiple choices that can be made when designing the internal data layout:

  • Single-member union:

    union U { T _x; };
    std::array<U, N> _data;
  • Single std::aligned_storage_t:

    std::aligned_storage_t<sizeof(T) * N, alignof(T)> _data;
  • Array of std::aligned_storage_t:

    using storage = std::aligned_storage_t<sizeof(T), alignof(T)>;
    std::array<storage, N> _data;

Regardless of the choice, creating the members will require the use of "placement new" and accessing them will require something along the lines of reinterpret_cast.

Now assume that we have two very minimal implementations of static_vector<T, N>:

  • with_union: implemented using the "single-member union" approach;

  • with_storage: implemented using the "single std::aligned_storage_t" approach.

Let's perform the following benchmark using both g++ and clang++ with -O3. I used quick-bench.com for this task:

void escape(void* p) { asm volatile("" : : "g"(p) : "memory"); }
void clobber()       { asm volatile("" : :        : "memory"); }

template <typename Vector>
void test()
    for(std::size_t j = 0; j < 10; ++j)
        Vector v;
        for(int i = 0; i < 123456; ++i) v.emplace_back(i);

(escape and clobber are taken from Chandler Carruth's CppCon 2015 talk: "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!")

g++ results

clang++ results

As you can see from the results, g++ seems to be able to aggressively optimize (vectorization) the implementation that uses the "single std::aligned_storage_t" approach, but not the implementation using the union.

My questions are:

  • Is there anything in the Standard that prevents the implementation using union from being aggressively optimized? (I.e. does the Standard grant more freedom to the compiler when using std::aligned_storage_t - if so, why?)

  • Is this purely a "quality of implementation" issue?

Original Thread

By anonymous    2018-02-26

That looks like a cumulative total, so the top-level parent of (almost?) everything gets (almost) all the CPU time for itself + children. Related: see Chandler Carruth's CppCon 2015 talk: ["Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"](https://www.youtube.com/watch?v=nXaxk27zwlk) for some tips/tricks on using `perf`. Some of it should be applicable to Java (like the parts about interpreting the output, moreso than the parts about creating source that compiles the way you want without optimizing away your microbenchmark, or optimizing between iterations.)

Original Thread

By anonymous    2018-03-05

For more about constructing microbenchmarks that optimize properly without optimizing away, see **[CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"](https://www.youtube.com/watch?v=nXaxk27zwlk)**. He uses an empty inline asm statement with appropriate constraints (GNU syntax) to require the compiler to have a variable value in a register at some point. Chandler is a clang developer. (edit: this is the same talk linked in the other answer.)

Original Thread

By anonymous    2018-03-05

Benchmarking code is not easy. What I found most useful was Google benchmark library. Even if you are not planning to use it, it might be good to read some of examples. It has a lot of possibilities to parametrize test, output results to file and even returning you Big O notation complexity of your algorithm (to name just few of them). If you are any familiar with Google test framework I would recommend you to use it. It also keeps compiler optimization possible to manage so you can be sure that your code wasn't optimized away.

There is also great talk about benchmarking code on CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!". There are many insights in possible mistake that you can make (it also uses google benchmark)

Original Thread

By anonymous    2018-03-18

Chandler Carruth has a really good talk on how to benchmark: https://www.youtube.com/watch?v=nXaxk27zwlk

Original Thread

By anonymous    2018-03-18

IIRC, Chandler Carruth mentions compiling with frame pointers enabled (`-fno-omit-frame-pointer`) to let perf efficiently collect stack backtraces in his CppCon2015 talk about `perf`: https://www.youtube.com/watch?v=nXaxk27zwlk. But I forget what perf options he then uses to tell `perf` it can use frame pointers and to get it to even collect parent callers. It's a very good video, worth watching.

Original Thread

By anonymous    2018-04-02

Summing up the results into an `unsigned tmp` which you print at the end can stop the compiler from optimizing away, or store to a `volatile int dummy` (but don't make a key part of the code under test use `volatile`!) or use more advanced things like inline `asm` statements that the compiler treats as a black box. See [CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"](https://www.youtube.com/watch?v=nXaxk27zwlk) for an `escape` function that requires the compiler to produce a value in a register, but doesn't add extra asm instructions.

Original Thread

Popular Videos 1088

Submit Your Video

If you have some great dev videos to share, please fill out this form.