Vectorization (SIMD) and Scaling | James Reinders, Intel Corporation
43 1 4952
The slide deck for this presentation can be viewed here: http://extremecomputingtraining.anl.gov/files/2014/08/20140804-1030-1115-ATPESC-Argonne-Reinders.2.pdf
Presented at the Argonne Training Program on Extreme-Scale Computing, Summer 2014.
For more information, visit: http://extremecomputingtraining.anl.gov/
By anonymous 2017-09-20
I agree with Patrick Burns' view that it is rather loop hiding and not code vectorisation. Here's why:
C code snippet:
for (int i=0; i<n; i++) c[i] = a[i] + b[i]
What we would like to do is quite clear. But how the task is performed or how it could be performed isn't really. A for-loop by default is a serial construct. It doesn't inform if or how things can be done in parallel.
The most obvious way is that the code is run in a sequential manner. Load
b[i] on to registers, add them, store the result in
c[i], and do this for each
However, modern processors have vector or SIMD instruction set which is capable of operating on a vector of data during the same instruction when performing the same operation (e.g., adding two vectors as shown above). Depending on the processor/architecture, it might be possible to add, say, four numbers from
b under the same instruction, instead of one at a time.
We would like to exploit the Single Instruction Multiple Data and perform data level parallelism, i.e., load 4 things at a time, add 4 things at time, store 4 things at a time, for example. And this is code vectorisation.
Note that this is different from code parallelisation -- where multiple computations are performed concurrently.
It'd be great if the compiler identifies such blocks of code and automatically vectorises them, which is a difficult task. Automatic code vectorisation is a challenging research topic in Computer Science. But over time, compilers have gotten better at it. You can check the auto vectorisation capabilities of
GNU-gcc here. Similarly for
LLVM-clang here. And you can also find some benchmarks in the last link compared against
ICC (Intel C++ compiler).
gcc (I'm on
v4.9) for example doesn't vectorise code automatically at
-O2 level optimisation. So if we were to execute the code shown above, it'd be run sequentially. Here's the timing for adding two integer vectors of length 500 million.
We either need to add the flag
-ftree-vectorize or change optimisation to level
-O3. (Note that
-O3 performs other additional optimisations as well). The flag
-fopt-info-vec is useful as it informs when a loop was successfully vectorised).
# compiling with -O2, -ftree-vectorize and -fopt-info-vec # test.c:32:5: note: loop vectorized # test.c:32:5: note: loop versioned for vectorization because of possible aliasing # test.c:32:5: note: loop peeled for vectorization to enhance alignment
This tells us that the function is vectorised. Here are the timings comparing both non-vectorised and vectorised versions on integer vectors of length 500 million:
x = sample(100L, 500e6L, TRUE) y = sample(100L, 500e6L, TRUE) z = vector("integer", 500e6L) # result vector # non-vectorised, -O2 system.time(.Call("Csum", x, y, z)) # user system elapsed # 1.830 0.009 1.852 # vectorised using flags shown above at -O2 system.time(.Call("Csum", x, y, z)) # user system elapsed # 0.361 0.001 0.362 # both results are checked for identicalness, returns TRUE
This part can be safely skipped without losing continuity.
Compilers will not always have sufficient information to vectorise. We could use OpenMP specification for parallel programming, which also provides a simd compiler directive to instruct compilers to vectorise the code. It is essential to ensure that there are no memory overlaps, race conditions etc.. when vectorising code manually, else it'll result in incorrect results.
#pragma omp simd for (i=0; i<n; i++) c[i] = a[i] + b[i]
By doing this, we specifically ask the compiler to vectorise it no matter what. We'll need to activate OpenMP extensions by using compile time flag
-fopenmp. By doing that:
# timing with -O2 + OpenMP with simd x = sample(100L, 500e6L, TRUE) y = sample(100L, 500e6L, TRUE) z = vector("integer", 500e6L) # result vector system.time(.Call("Cvecsum", x, y, z)) # user system elapsed # 0.360 0.001 0.360
which is great! This was tested with gcc v6.2.0 and llvm clang v3.9.0 (both installed via homebrew, MacOS 10.12.3), both of which support OpenMP 4.0.
In this sense, even though Wikipedia page on Array Programming mentions that languages that operate on entire arrays usually call that as vectorised operations, it really is loop hiding IMO (unless it is actually vectorised).
In case of R, even
colSums() code in C don't exploit code vectorisation IIUC; it is just a loop in C. Same goes for
lapply(). In case of
apply(), it's in R. All of these are therefore loop hiding.
In short, wrapping an R function by:
just writing a for-loop in
C!= vectorising your code.
just writing a for-loop in
R!= vectorising your code.
Intel Math Kernel Library (MKL) for example implements vectorised forms of functions.
- Talk by James Reinders, Intel (this answer is mostly an attempt to summarise this excellent talk)