8x8 Matrix Multiplication, Simple Algorithm
0 0 40
By anonymous 2017-09-20
Note that there is not one single implementation of BLAS - it is tuned for particular processor architectures. Therefore you cannot be absolutely certain which algorithm is being used on your machine without finding out which version of BLAS is in use.
The specification for BLAS specifies the inputs and output of each subroutine, and provides acceptable error bounds for the output of each subroutine. Implementations are free to use whatever algorithm they like, as long they follows the specification.
The reference implementation of BLAS uses a block matrix multiplication algorithm in
DGEMM that has time complexity O(n^3) for multiplying two n x n matrices. I think it's reasonable to assume that most implementations of BLAS will more or less follow the reference implementation.
Note that it doesn't use the naive matrix multiplication algorithm
for i = 1:N for j = 1:N for k = 1:N c(i,j) = c(i,j) + a(i,k) * b(k,j); end end end
This is because, typically, the entire matrix will not fit in local memory. If data is constantly being shifted into and out of local memory, the algorithm will slow down. The block matrix algorithm breaks the operation into small blocks, such that each block is small enough to fit into local memory, reducing the number of shifts into and out of memory.
There exist asymptotically faster matrix multiplication algorithms, eg the Strassen algorithm or the Coppersmith-Winograd algorithm which have a slightly faster rate than O(n^3). However, they are generally not cache aware and ignore locality - meaning that data continually needs to be shunted around in memory, so for most modern architectures the overall algorithm is actually slower than an optimized block matrix multiplication algorithm.
Wikipedia notes that the Strassen algorithm may provide speedups on a single core CPU for matrix sizes greater than several thousand, however the speedup is likely to be around 10% or so, and the developers of BLAS probably don't consider it worthwhile for this rare case (saying that, this paper from 1996 claims a speed increase of around 10% over
DGEMM for n above about 200 -- though I don't know how out of date that is). The Coppersmith-Winograd algorithm, on the other hand, "only provides an advantage for matrices so large that they cannot be processed by modern hardware".
So the answer is that Matlab uses a naive, but efficient and cache-aware algorithm to get its blazing fast matrix multiplication.
I updated this answer by creating some videos that demonstrate the locality of the block matrix multiplication algorithm, compared to the naive algorithm.
In each of the following videos, we are visualizing the multiplication of two 8x8 matrices A and B to create the product C = A x B. The yellow highlight indicates which element in each of the matrices A, B and C is being processed at each step of the algorithm. You can see how the block matrix multiplication only works on small blocks of the matrix at a time, and re-uses each of those blocks multiple times, so that the number of times that data must be shifted in and out of local memory is minimized.
- Simple matrix multiplication
- Block matrix multiplication, block size = 2
- Block matrix multiplication, block size = 4