notes on the memory hierarchy and caches locality of reference (p. 452) - property of program behavior - can be exploited by hardware - leads to idea of memory hierarchy two types 1) temporal locality 2) spatial locality memory hierarchy type of traffic level access time between levels unit agent ----- ----------- --------------- ---- ----- registers 100's picoseconds ^ | load/store word programmer v cache 1-10's nanoseconds ^ | cache miss/repl. line hardware v main memory 100's nanoseconds ^ | | page fault/repl. page OS | v disk 5-10's milliseconds note the "access gap" between main memory and disk (a 100,000 times speed difference) caches level 1 - typically split into icache and dcache for pipelining level 2 - may be on backside bus in order to run faster level 3 - may be shared by two processors average memory access time (AMAT) (pp. 478-479) if memory access is started in parallel with cache access T_avg = h*(T_cache) + (1-h)*(T_memory) although, note that if the memory access is started only after the cache miss is recognized, the calculation becomes T_avg = h*(T_cache) + (1-h)*(T_cache + T_memory) = T_cache + (1-h)*T_memory [this is form given in the text] example (second form of AMAT calculation) hit ratio = 0.9 cache access time = 10 ns memory access time = 300 ns T_avg = 10 ns + 0.1*(300 ns) = 40 ns speedup = 300 ns / 40 ns = 7.5 can be extended to deal with multiple cache levels memory hierarchy policies 1. fetch - when to get item a) demand fetch b) prefetch 2. placement - where to put item (mapping) a) direct mapped b) set associative c) fully associative 3. replacement - what old item to evict to make room for new item a) FIFO b) LRU etc. 4. write - how copies of an item in different levels are updated a) write-through b) write-back (requires dirty bit, a.k.a. changed, modified) 5. write-miss - whether to fetch item or not a) write-allocate - usually paired with write-back b) write-no-allocate cache to main memory demand fetch upon cache miss, hardware waits placement direct-mapped (one RAM bank, one comparator) set associative (multiple banks, multiple comparators - one per bank) fully associative (content-addressable memory - one comparator per line) (see Figures 5.13 and 5.14) replacement FIFO LRU (but hard to implement for more than two lines) pseudo-LRU random L1 may be write-through L2 may be write-back terminology of cache-related subfields of a memory address +-------------------------------------------------------------+ | memory address | +-------------------------------------------------------------+ . . . . +------------------------------+------------------------------+ | memory block number | byte number within block | +------------------------------+------------------------------+ . . . . . . +-------------+----------------+------------------------------+ | tag | group (or set) | block offset | +-------------+----------------+------------------------------+ . . . . . . . . as +-------------+----------------+------------------------------+ used | tag | index | offset | below +-------------+----------------+------------------------------+ placement policy simple (low-associativity) caches can be built using one bank per degree of associativity fully associativity caches are typically built using some type of content-addressable memory (CAM) with B total blocks (lines) in cache policy # sets blocks per set # banks ----------------------+----------+------------------+-------- direct mapped | B sets | 1 block per set | 1 bank 2-way set associative | B/2 sets | 2 blocks per set | 2 banks ... | ... | ... | ... n-way set associative | B/n sets | n blocks per set | n banks ... | ... | ... | ... fully associative | 1 set | B blocks per set | (CAM) one comparator per bank for simple (low-associativity) schemes cache access steps 1) lookup - use index bits of address to read block(s) out of bank(s) 2) tag check - compare tag of block(s) just read with tag of address 3) routing - on a hit, move the block from the appropriate bank (or CAM entry) to the selection logic 4) selection - select a byte or word from the block using offset bits note: fully associative - no lookup (CAM does tag matching in each entry) direct mapped - no routing needed since only one bank; selection and forwarding to register or ALU can be started in parallel with tag match, and thus direct mapped hit time can be 1.2 to 1.5 times faster than that for other mapping policies) (There are several choices on how to deal with caches and virtually- addressed memory; for now, simply assume we are using a physically- addressed cache and that the virtual-to-physical address translation is done prior to lookup.) write buffer (p. 467) prefetch (p. 547) non-blocking cache (p. 541) also called lockup-free cache allows access while a miss is being handled hit-under-miss or even miss-under-miss formulae for caches total # lines in cache = cache size / line size # cache banks = degree of set associativity ( direct mapped = one bank ) ( fully associative is equivalent to one bank per line ) # lines in a bank of cache = cache size / ( line size * # cache banks ) # bits in a main memory address = log_base_2( main memory size ) # line offset bits = log_base_2( line size ) # index bits = log_base_2( total # lines in cache / # cache banks ) = log_base_2( # lines in a bank of cache ) ( note: no index bits are used in a fully associative cache ) # tag bits = # bits in a main memory address - ( # index bits + # line offset bits ) example - 16 MB main memory, 32 KB direct-mapped cache, 8 bytes per line # memory address bits = log_base_2( 16 M ) = 24 # lines = 32 K / 8 = 4 K # index bits = log_base_2( 4 K ) = 12 # line offset bits = log_base_2( 8 ) = 3 # tag bits = 24 - ( 12 + 3 ) = 9 23 15 14 3 2 1 0 +------------------+------------------------+------+ | 9-bit tag | 12-bit index | 3-bit| offset +------------------+------------------------+------+ characterizing cache misses (3 c's) compulsory (cold-start) - first access to a line capacity - when a line has been cache-resident prior to a miss, but it had been replaced because the working set of cache lines is too large to all fit in cache conflict (collision) - when a line has been cache-resident prior to a miss, but it had been replaced because another cache line mapped to the same cache location (4th "c" for multiprocessors - coherency - when a line has been cache- resident prior to a miss, but it had been invalidated because another processor wrote to the line) direct-mapped cache example (see also the example on pp. 459-461) 1. Assume a 256-byte memory with a four-line direct-mapped cache with two bytes per line. What are the fields in an address? main memory address is log_base_2( 256 ) = 8 bits long index field is log_base_2( 4 ) = 2 offset field is log_base_2( 2 ) = 1 tag index offset +----------+----+--+ | | | | +----------+----+--+ 5 bits 2 1 cache organization and consequent memory block mapping (memory is divided into 2-byte blocks == line size) v tag contents memory byte addresses (given in decimal) +-+---+---------+ index = 0 | | | __ / __ | 0,1 8,9 16,17 24,25 ... index = 1 | | | __ / __ | 2,3 10,11 18,19 26,27 ... index = 2 | | | __ / __ | 4,5 12,13 20,21 28,29 ... index = 3 | | | __ / __ | 6,7 14,15 22,23 30,31 ... +-+---+---------+ tag=0 tag=1 tag=2 tag=3 ... for for for for these these these these addrs addrs addrs addrs because of the use of the index bit field, each index bit value defines one of 2^(# index bits) different memory partitions (shown as rows above). only one line per partition (row) can be in cache at any one time; lines within a partition "conflict" with each other for cache residence. cache actions: 1) lookup - the index bits are used to access one row (i = address.index) 2) tag match - the tag field of the address is compared with the tag from the stored cache line (if the stored cache line is valid, v==1) hit occurs when (cache_line[i].v==1) && (cache_line[i].tag==address.tag) 2. The cache is initially empty. For the byte address reference stream (given as decimal addresses; consider them all to be reads) 0, 1, 12, 2, 3, 4, 12, 5 how many of these eight references are hits? note: since the above addresses are all less than 16, the addresses will be given as 4-bit values and divided into three fields, with the fields separated by a period start with empty cache v tag contents +---+---+-------------------+ | 0 | | | | 0 | | | | 0 | | | | 0 | | | +---+---+-------------------+ v tag contents +---+---+-------------------+ 0 => 0.00.0 => miss => refill | 1 | 0 | mem[ 0] / mem[ 1] | index = 0 | 0 | | | | 0 | | | | 0 | | | +---+---+-------------------+ v tag contents +---+---+-------------------+ 1 => 0.00.1 => hit | 1 | 0 | mem[ 0] / mem[ 1] | index = 0 | 0 | | | | 0 | | | | 0 | | | +---+---+-------------------+ v tag contents +---+---+-------------------+ | 1 | 0 | mem[ 0] / mem[ 1] | | 0 | | | 12 => 1.10.0 => miss => refill | 1 | 1 | mem[12] / mem[13] | index = 2 | 0 | | | +---+---+-------------------+ v tag contents +---+---+-------------------+ | 1 | 0 | mem[ 0] / mem[ 1] | 2 => 0.01.0 => miss => refill | 1 | 0 | mem[ 2] / mem[ 3] | index = 1 | 1 | 1 | mem[12] / mem[13] | | 0 | | | +---+---+-------------------+ v tag contents +---+---+-------------------+ | 1 | 0 | mem[ 0] / mem[ 1] | 3 => 0.01.1 => hit | 1 | 0 | mem[ 2] / mem[ 3] | index = 1 | 1 | 1 | mem[12] / mem[13] | | 0 | | | +---+---+-------------------+ v tag contents +---+---+-------------------+ | 1 | 0 | mem[ 0] / mem[ 1] | | 1 | 0 | mem[ 2] / mem[ 3] | 4 => 0.10.0 => miss => refill | 1 | 0 | mem[ 4] / mem[ 5] | index = 2 (conflicts with 12,13) | 0 | | | +---+---+-------------------+ v tag contents +---+---+-------------------+ | 1 | 0 | mem[ 0] / mem[ 1] | | 1 | 0 | mem[ 2] / mem[ 3] | 12 => 1.10.0 => miss => refill | 1 | 1 | mem[12] / mem[13] | index = 2 (conflict miss) | 0 | | | +---+---+-------------------+ v tag contents +---+---+-------------------+ | 1 | 0 | mem[ 0] / mem[ 1] | | 1 | 0 | mem[ 2] / mem[ 3] | 5 => 0.10.1 => miss => refill | 1 | 0 | mem[ 4] / mem[ 5] | index = 2 (conflict miss) | 0 | | | +---+---+-------------------+ only two of the eight references are hits fully associative cache 3. Assume a 256-byte memory with a four-line fully-associative cache with two bytes per line. What are the fields in an address? main memory address is log_base_2( 256 ) = 8 bits long offset field is log_base_2( 2 ) = 1 tag offset +---------------+--+ | | | +---------------+--+ 7 bits 1 cache organization v tag contents v tag contents v tag contents v tag contents +-+---+---------+ +-+---+---------+ +-+---+---------+ +-+---+---------+ | | | __ / __ | | | | __ / __ | | | | __ / __ | | | | __ / __ | +-+---+---------+ +-+---+---------+ +-+---+---------+ +-+---+---------+ there is no index field so the tag field in each cache line is compared with the tag field of the address in parallel; there is no partitioning of the memory, so there can be no conflicts; any four memory blocks can be in cache at the same time. the cost is that instead of a lookup and tag match with a single cache line tag field as in the direct mapped case, the hardware must instead make multiple tag matches in parallel and then route the contents from a match to a central location. this makes fully-associative cache more expensive and somewhat slower than direct-mapped caches. compromise: set-associative cache 4. Assume a 256-byte memory with a four-line 2-way set associative cache with two bytes per line. What are the fields in an address? main memory address is log_base_2( 256 ) = 8 bits long index field is log_base_2( 2 ) = 1 offset field is log_base_2( 2 ) = 1 tag index offset +------------+--+--+ | | | | +------------+--+--+ 6 bits 1 1 cache organization and consequent memory block mapping (memory is divided into 2-byte blocks == line size) v tag contents v tag contents memory byte addresses (decimal) +-+---+---------+ +-+---+---------+ 0 | | | __ / __ | | | | __ / __ | 0,1 4,5 8,9 ... 1 | | | __ / __ | | | | __ / __ | 2,3 6,7 10,11 ... +-+---+---------+ +-+---+---------+ tag=0 tag=1 tag=2 ... for for for these these these addrs addrs addrs the memory is again partitioned; however, rather than one memory block per partition (i.e., per index value), you can have up to N blocks per partition be cache-resident in an N-way set-associative cache. the hardware performs a lookup, just like a direct-mapped cache, but uses a smaller index field. lookup occurs in each of the N banks; in fact; you could consider an N-way set-associative cache as N direct- mapped caches working in parallel. after lookup, the hardware needs N tag matches to determine a hit and must route the contents associated with a match to a central location. note that the set-associative idea is actually the general case: - a direct-mapped cache is a one-way set-associative cache - a fully-associative cache of B lines is a B-way set-associative cache