CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"

By: CppCon

758   13   66985

Uploaded on 09/27/2015

http://www.Cppcon.org

A primary use case for C++ is low latency, low overhead, high performance code. But C++ does not give you these things for free, it gives you the tools to control these things and achieve them where needed. How do you realize this potential of the language? How do you tune your C++ code and achieve the necessary performance metrics?

This talk will walk through the process of tuning C++ code from benchmarking to performance analysis. It will focus on small scale performance problems ranging from loop kernels to data structures and algorithms. It will show you how to write benchmarks that effectively measure different aspects of performance even in the face of advanced compiler optimizations and bedeviling modern CPUs. It will also show how to analyze the performance of your benchmark, understand its behavior as well as the CPUs behavior, and use a wide array of tools available to isolate and pinpoint performance problems. The tools and some processor details will be Linux and x86 specific, but the techniques and concepts should be broadly applicable.
--
Chandler Carruth leads the Clang team at Google, building better diagnostics, tools, and more. Previously, he worked on several pieces of Google’s distributed build system. He makes guest appearances helping to maintain a few core C++ libraries across Google’s codebase, and is active in the LLVM and Clang open source communities. He received his M.S. and B.S. in Computer Science from Wake Forest University, but disavows all knowledge of the contents of his Master’s thesis. He is regularly found drinking Cherry Coke Zero in the daytime and pontificating over a single malt scotch in the evening.
--
Videos Filmed & Edited by Bash Films: http://www.BashFilms.com

Comments (6):

By anonymous    2017-09-20

There is no golden rule for this. Unfortunately, the performance of code like this is notoriously hard to predict. The most important thing to take away from that is

Measure everything!

Now to what's going on in your code: As others noted correctly, we can observe that isAllowed gets compiled to a function using branches, while isAllowed2 ends up being branchless.

Branches are interesting when talking about performance: They are somewhere between literally free and ridiculously expensive, inclusively. This is due to a CPU component called the branch predictor. It tries to predict which branch your control flow will take and makes the CPU speculatively execute it. If it guesses right, the branch is free. If it guesses wrong, the branch is expensive. A great and detailed explanation of that concept, including some numbers, can be found in this answer.

So now we need to decide whether we want the branching or the branchless version. In general, neither need be faster than the other! It really depends on how well your target CPUs can predict the branches, which of course depends on the actual input. (Choosing whether to compile a function to a branching or a branchless result is thus a hard problem for compilers as they don't know what CPUs the binary will be run on, nor what kind of input data to expect. See for example this blogpost.)

So if your benchmark was actually correct, we have determined that on your CPU the branches are too hard to predict to beat the relatively cheap integer arithmetic. This may also be due to the tiny amount of test cases, the branch predictor cannot learn a pattern from such few invocations. But again, we cannot just call that one way or the other, we have to look at the actual performance in the specific case.


As noted in the comments, the execution time is somewhat short for a good measurement, I see huge deviations on my machine. For information about micro benchmarking you can have a look at this talk, it's harder than one might think.

Also, as Martin Bonner helpfully noticed, your two functions don't do the same thing, you'd have to fix that for a correct benchmark of course.

Original Thread

By anonymous    2017-09-20

There is test::black_box() (link to old docs) which is still unstable (as is the whole test crate). This function takes a value of an arbitrary type and returns the same value again. So it is basically the identity function. "Oh well, now that's very useful, isn't it?" you might ask ironically.

But there is something special: the value which is passed through is hidden from LLVM (the thing doing nearly all optimizations in Rust right now)! It's truly a black box as LLVM doesn't know anything about a piece of code. And without knowing anything LLVM can't prove that optimizations won't be changing the program's behavior. Thus: no optimizations.

How does it do that? Let's look at the definition:

pub fn black_box<T>(dummy: T) -> T {
    // we need to "use" the argument in some way LLVM can't
    // introspect.
    unsafe { asm!("" : : "r"(&dummy)) }
    dummy
}

I'd be lying if I were to pretend I understand this piece of code completely, but it goes something like that: we insert empty inline assembly (not a single instruction) but tell Rust (which tells LLVM) that this piece of assembly uses the variable dummy. This makes it impossible for the optimizer to reason about the variable. Stupid compiler, so easy to deceive, muhahahaha! If you want another explanation, Chandler Carruth explained the dark magic at CppCon 2015.

So how do you use it now? Just use it for some kind of value... anything that goes through black_box() needs to be calculated. How about something like this?

black_box(my_function());

The return value of my_function() needs to be calculated, because the compiler can't prove it's useless! So the function call won't be removed. Note however, that you have to use unstable features (either the test crate or inline asm to write the function yourself) or use FFI. I certainly wouldn't ship this kind of code in a production library, but it's certainly useful for testing purposes!

Original Thread

By anonymous    2017-09-20

It seems a bad idea to use the if there, to me.

You are right. Whether or not idx >= idx_max, it will be under idx_max after idx %= idx_max. If idx < idx_max, it will be unchanged, whether the if is followed or not.

While you might think branching around the modulo might save time, real culprit, I'd say, is that when branches are followed, pipelining modern CPU's have to reset their pipeline, and that costs a relative lot of time. Better not to have to follow a branch, than do an integer modulo, which costs roughly as much time as an integer division.

EDIT: It turns out that the modulus is pretty slow vs. the branch, as suggested by others here. Here's a guy examining this exact same question: CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!" (suggested in another SO question linked to in another answer to this question).

This guy writes compilers, and thought it would be faster without the branch; but his benchmarks proved him wrong. Even when the branch was taken only 20% of the time, it tested faster.

Another reason not to have the if: One less line of code to maintain, and for someone else to puzzle out what it means. The guy in the above link actually created a "faster modulus" macro. IMHO, this or an inline function is the way to go for performance-critical applications, because your code will be ever so much more understandable without the branch, but will execute as fast.

Finally, the guy in the above video is planning to make this optimization known to compiler writers. Thus, the if will probably be added for you, if not in the code. Hence, just the mod alone will do, when this comes about.

Original Thread

Recommended Books

    Popular Videos 317

    Submit Your Video

    If you have some great dev videos to share, please fill out this form.