Clemson University -- CPSC 231 -- Fall 2009 memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references a memory location once is likely to re-reference it in the future, e.g., multiple accesses to the same variable, revisit instructions in a loop spatial locality - a program that references a memory location once is likely to reference a near-by memory location in the future, e.g., sequentially traversing an array, sequentially fetching instructions since you cannot afford a huge amount of the fastest storage, instead use several levels of increasingly larger, cheaper storage and exploit locality by keeping copies of recently used memory contents in the higher levels registers on-chip 32-128 registers 250 picoseconds L1 cache on-chip 16 KB - 64 KB 1 nanosecond L2 cache SRAM 1 MB - 8 MB 5 nanoseconds main memory DRAM 64 MB - 512 MB 100 nanoseconds <<< "access gap" 10^4 - 10^5 >>> virtual memory disk tens of GB per disk 5 milliseconds key idea - only keep active pieces of program in memory and in caches OS control over main memory / virtual memory transfers hardware control over cache / main memory transfers compiler control over register / main memory transfers called a "hit" if the object to access is found in the current level, or "miss" if you have to obtain from lower level if hit rates in top levels of hierarchy are high enough, then the average memory access time corresponds to the speed of the caches memory hierarchy policies fetch policy - when to bring into higher level demand fetch - bring in when necessary prefetch placement policy - where to put into higher level replacement policy - when full, what to replace/evict out of higher level write policy - when to update lower level write-through - update lower level with all changes to the higher level write-back - update lower level only when evicted from higher level virtual memory paging - divide up program and data into fixed-length pieces (e.g., 4K bytes) - invisible to software other than OS segmentation - divide up program and data into variable-length logical sections; logical segment names will be produced by compiler and will be visible to assembly lang. programmer, compiler, linker, and loader OS manages page and segment misses since so slow to retrieve from disk hardware turns miss into "fault", that is, an interrupt that invokes the OS and causes process switch (i.e., run another process while missing page or segment is read in from disk) address translation for paging and/or segmentation through tables - controls allocation in memory, name mapping, and protection, usually multiple levels of tables are defined (e.g., SPARC has four types of tables, as shown in Figure 13.12 on p. 354) each process has its own page and/or segment tables - each can run in its own "virtual address space" and not be aware that physical memory is shared PTE - page table entry - contains: presence bit, main memory page frame number, protection bits STE - segment table entry - contains: presence bit, disk address, main memory address, length, protection bits for write-back policy, a modified (a.k.a. dirty, changed) bit is used in the table entry to indicate if block must be rewritten to disk on replacement often a referenced (a.k.a. accessed) bit is included in the table entry to indicate if the block has been recently referenced TLB - translation lookaside buffer - small address cache containing recently- used PTEs/STEs, since it would be very slow to have to access the (multiple level) page or segment tables in memory each time before accessing an instruction or data word e.g., on SPARC v7 +------------------------------------------------+-+-+-+-----+---+ | . . . . . . physical page number. . . . . . . .|C|M|R| ACC |ET | +------------------------------------------------+-+-+-+-----+---+ 31 29 27 25 23 21 19 17 15 13 11 9 8 7 6 5 4 3 2 1 0 C - cacheable - copies of this part of memory can be placed in the caches (e.g., you have to mark memory-mapped I/O regions as uncacheable) M - modified R - referenced ACC - access permission bits ET - entry type - 0=invalid, 1=page table pointer, 2=page table entry SPARC calls its TLB equivalent the page descriptor cache (PDC, p. 355) cache like paging (fixed-length) but smaller blocks (e.g., 32 bytes) miss is handled by hardware since on cache miss the missing block can be obtained from main memory very quickly separate L1 caches for instruction and data so that pipeline can access them at the same time L2 usually unified - holds both instructions and data context switch save enough information (or "state") about the currently running process to be able to later resume it - involves CPU registers and memory mapping SPARC v7 specifics on pp. 357-358 see register save/restore macros in /usr/include/v7/sys/privregs.h segmentation example < change to paging > Assume the loader sets up a segment table and loads the main program segment for a given process. It will also build a process control block (PCB) for the program, recording the segment table location in a segment table base register (STBR) field and the initial program counter value in a program counter (PC) field. The PCB is placed into the OS dispatcher's ready queue, and the loader then branches to the OS dispatcher. presence | disk mem ready_queue--> PCB memory | addr len prot addr +---------------+ / 2500: seg0: 0/t1s0/1000/--x-/---- | PC: <2,0> | segment | 2501: seg1: 0/t2s0/ 400/--x-/---- | STBR: 2500 | table | 2502: seg2: 1/t2s4/ 400/--x-/4300 | status: ready | | 2503: seg3: 0/t3s0/1000/r---/---- +---------------+ \ 2504: seg4: 0/t4s0/ 100/rw-a/---- ... 4300: main: inst0 4301: inst1 4302: inst2 reads <4,5> When the dispatcher chooses this process for execution, the fields in the PCB are loaded into the corresponding registers in the CPU and the TLB is flushed. This causes a transfer of control from the dispatcher to the process, and the process begins execution. PC: <2,0> STBR: 2500 TLB: empty For instruction fetch, the virtual address <2,0> in the PC is translated: the TLB is checked first and no entry is found; the segment number is added to the STBR yielding 2502; the table entry is fetched to obtain the base address of 4300 (and as a byproduct, this STE is placed in the TLB); the offset is checked against the length; the protection bits are checked to determine if instruction fetch (i.e., execute) is allowed; and, the base address is added to the offset, yielding 4300. The instruction at 4300 is fetched and executed; the PC is updated to <2,1>. PC: <2,1> STBR: 2500 len prot base TLB: under the tag of seg 2: 400/--x-/4300 For instruction fetch, the virtual address <2,1> in the PC is translated: the TLB is checked first and an entry is found with the correct segment tag; the offset is checked against the length; the protection bits are checked; and, the base address of 4300 is added to the offset, yielding 4301. The instruction at 4301 is fetched and executed; the PC is updated to <2,2>. For instruction fetch, the virtual address <2,2> in the PC is translated: the TLB is checked first and an entry is found with the correct segment tag; the offset is checked against the length; the protection bits are checked; and, the base address of 4300 is added to the offset, yielding 4302. The instruction at 4302 is fetched and found to refer to the virtual address <4,5> as a read. For data access, the virtual address <4,3> in the PC is translated: the TLB is checked first and no entry is found; the segment number is added to the STBR yielding 2504; the table entry is fetched but the presence bit is off. A segment fault occurs, causing an entry into the operating system. The entry into the OS saves the PC and STBR. Once the OS determines that the interrupt was a segment fault, these values are placed back into the PCB, and the status of the PCB is marked as "blocked". The PCB is removed from the ready queue, and disk I/O is initiated to bring the missing segment into main memory wherever a large enough free block can be found. The OS returns through the dispatcher, which finds another ready process. This cause the loading of new PC and STBR register values and the flushing of the TLB. Thus other processes execute while the segment-faulting process waits on its missing segment to be brought in. At some later time, the disk read finishes and the OS is invoked by an I/O completion interrupt. The OS can now update the fields in the segment table of the segment-faulting process by turning on the presence bit for the previously-missing segment and setting its base address in memory. The OS can then mark the PCB status as "ready" and return the PCB to the ready queue. At some point, the process is chosen for execution by the dispatcher and execution resumes by loading of the PC and STBR register values and the flushing of the TLB. The instruction that caused the segment fault is restarted. presence | disk mem ready_queue--> PCB memory | addr len prot addr +---------------+ / 2500: seg0: 0/t1s0/1000/--x-/---- | PC: <2,0> | segment | 2501: seg1: 0/t2s0/ 400/--x-/---- | STBR: 2500 | table | 2502: seg2: 1/t2s4/ 400/--x-/4300 | status: ready | | 2503: seg3: 0/t3s0/1000/r---/---- +---------------+ \ 2504: seg4: 1/t4s0/ 100/rw-a/4700 ... PC: <2,2> 4300: main: inst0 STBR: 2500 4301: inst1 TLB: empty 4302: inst2 reads <4,5> ... 4705: data: value For instruction fetch, the virtual address <2,2> in the PC is translated: the TLB is checked first and no entry is found; the segment number is added to the STBR yielding 2502; the table entry is fetched to obtain the base address of 4300 (and as a byproduct, this STE is placed in the TLB); the offset is checked against the length; the protection bits are checked; and, the base address is added to the offset, yielding 4302. The instruction at 4302 is fetched and found to refer to the virtual address <4,5> as a read. For data access, the virtual address <4,5> in the PC is translated: the TLB is checked first and no entry is found; the segment number is added to the STBR yielding 2504; the table entry is fetched to obtain the base address of 4700 (and as a byproduct, this STE is placed in the TLB); the offset is checked against the length; the protection bits are checked to determine if read access is allowed; and, the base address is added to the offset, yielding 4705. The data value at 4302 is read and the instruction completes execution. The PC is updated to <2,3>. PC: <2,3> STBR: 2500 len prot base TLB: under the tag of seg 2: 400/--x-/4300 under the tag of seg 4: 100/rw-a/4700 To consider - what if the instruction at <2,3> - tries to read from <4,20>? - tries to write into <4,20>? - tries to read from <3,20>? - tries to write into <3,20>? - tries to read from <2,20>? - tries to write into <2,20>? - tries to write into <0,20>? - tries to read into <0,20>? - tries to call a subroutine at <0,500>? - tries to call a subroutine at <1,500>? - tries to call a subroutine at <2,500>? - tries to call a subroutine at <3,500>?