Cache access patterms for matrix multiply assume 16x16 row-major matrices and four array elements per cache line (e.g., a cache line labeled a[0,0-3] contains a[0][0] through a[0][3]) normal matrix multiply (ijk) for ( i = 0; i < N; i++ ){ for ( j = 0; j < N; j++ ){ temp = 0.0; for ( k = 0; k < N; k++ ){ temp = temp + a[i][k] * b[k][j]; } c[i][j] = temp; } } access patterns: a[ 0, 0- 3]:1111 1111 1111 ... a[ 0, 4- 7]: 1111 1111 1111 a[ 0, 8-11]: 1111 1111 1111 a[ 0,12-15]: 1111 1111 1111 a[ 1, 0- 3]: ... b[ 0, 0- 3]:1 1 1 ... b[ 0, 4- 7]: b[ 0, 8-11]: b[ 0,12-15]: b[ 1, 0- 3]: 1 1 1 b[ 1, 4- 7]: b[ 1, 8-11]: b[ 1,12-15]: b[ 2, 0- 3]: 1 1 1 b[ 2, 4- 7]: b[ 2, 8-11]: b[ 2,12-15]: b[ 3, 0- 3]: 1 1 1 b[ 3, 4- 7]: b[ 3, 8-11]: b[ 3,12-15]: b[ 4, 0- 3]: 1 1 1 ... c[ 0, 0- 3]: 1 1 1 c[ 0, 4- 7]: \______________/ one complete inner loop => 33 references to 21 different cache lines access to A is row-oriented, up to one miss per four elements access to B is column-oriented, up to one miss per element access to C is fixed at one element during inner loop (in general, there are 2*N + 1 references to 1.25*N + 1 cache lines during the inner loop, with approx. 1.25 cache misses per iteration of the inner loop, with cache lines of four entries each) interchanged loops matrix multiply (ikj) for ( i = 0; i < N; i++ ){ for ( k = 0; k < N; k++ ){ temp = a[i][k]; for ( j = 0; j < N; j++ ){ c[i][j] = c[i][j] + temp * b[k][j]; } } } access patterns: a[ 0, 0- 3]:1 1 1 1 ... b[ 0, 0- 3]:1111 b[ 0, 4- 7]: 1111 b[ 0, 8-11]: 1111 b[ 0,12-15]: 1111 b[ 1, 0- 3]: 1111 b[ 1, 4- 7]: 1111 b[ 1, 8-11]: 1111 b[ 1,12-15]: 1111 b[ 2, 0- 3]: 1111 b[ 2, 4- 7]: 1111 b[ 2, 8-11]: 1111 b[ 2,12-15]: 1111 ... c[ 0, 0- 3]:2222 2222 2222 c[ 0, 4- 7]: 2222 2222 2222 c[ 0, 8-11]: 2222 2222 2222 c[ 0,12-15]: 2222 2222 2222 c[ 1, 0- 3]: \______________/ one complete inner loop => 49 references to 9 different cache lines access to A is row-oriented, up to one miss four elements access to B is column-oriented, but constrained to access only a limited number of different lines, up to one miss four elements access to C is fixed at one element during inner loop (in general, there are 3*N + 1 references to 0.5*N + 1 cache lines during the inner loop, with approx. 0.5 cache misses per iteration of the inner loop, with cache lines of four entries each) tiling with tile size of 4x4 (also called blocking and block size; note that you can apply tiling to the inner two loops of any ordering and that tile sizes can differ, i.e., the tiles can be rectangular rather than square) (the code below is i4e4fjk in terms of the generator) for ( i = 0; i < N; i++ ){ for ( jj = 0; jj < N; jj+=4 ){ for ( kk = 0; kk < N; kk+=4 ){ for ( j = jj; j < min(jj+4,N); j++ ){ temp = c[i][j]; for ( k = kk; k < min(kk+4,N); k++ ){ temp = temp + a[i][k] * b[k][j]; } c[i][j] = temp; } } } } access patterns: a[ 0, 0- 3]:1111111111111111 a[ 0, 4- 7]: 1111111111111111 a[ 0, 8-11]: 1111111111111111 ... b[ 0, 0- 3]:1 1 1 1 b[ 0, 4- 7]: 1 1 1 1 b[ 0, 8-11]: 1 1 1 1 b[ 0,12-15]: b[ 1, 0- 3]: 1 1 1 1 b[ 1, 4- 7]: 1 1 1 1 b[ 1, 8-11]: 1 1 1 1 b[ 1,12-15]: b[ 2, 0- 3]: 1 1 1 1 b[ 2, 4- 7]: 1 1 1 1 b[ 2, 8-11]: 1 1 1 1 b[ 2,12-15]: b[ 3, 0- 3]: 1 1 1 1 b[ 3, 4- 7]: 1 1 1 1 b[ 3, 8-11]: 1 1 1 1 b[ 1,12-15]: ... c[ 0, 0- 3]: 1 1 1 \______________/ one complete inner loop => 33 references to 6 different cache lines access to A is row-oriented, but constrained to access only a limited number of different elements, result is that A is limited to one cache lines access to B is column-oriented, but constrained to access only four different lines access to C is fixed at one element during inner loop (in general, there are 2*N + 1 references to a limited number of cache lines during the inner loop, with no subsequent cache misses for the inner loop once the limited number of cache lines are brought in)