CpSc 360
Chapter 6 (continued)
·
Indexes that are too large
to hold in Memory
· Problems with binary
searching of indexes on disk
· Binary searching of an index
on disk requires several seeks and is not substantially faster than binary searching
of the data itself.
· Index maintenance
(rearrangement) on disk is millions of times more expensive than the equivalent
operation in memory
· Possible solutions
· Use a hashed organization if direct access is a top priority
· Use a tree-structured index (such as a B-tree) if both direct and
sequential access are important
· Indexing to provide access by Multiple Keys
· Files often use multiple
indexes to access data
· A library card catalog
enables access by author name, title, and subject indexing in addition to the
normal indexing by call number
· Books (data records) are
sorted (stored on the shelves) according to the primary key index of call
number
· Note that the primary key is
normally unique but secondary keys may have duplicate entries
· Operations on indexed files
· Record addition
· Entries must be made to all
index files (primary and secondary)
· Care must be taken in using
a standard canonical form for all index entries
· Record deletion
· The data must be removed
from the data file and the corresponding primary index file
· Indexes must possibly be rearranged
after deletion to recover the space left open
· Deletion from secondary
indexes is mandatory if RBO (or RRN) of data records are stored in the
secondary index since the records to which they point well be gone
· Deletion from secondary
indexes can be expensive
· Deletion can be avoided if
the primary (unique) key is stored in the secondary index rather than RBO (or
RRN).
· An extra seek is required to
access the primary index
· If the primary index doesn’t
contain the key then the record was deleted earlier
· May be a good idea is there
are numerous secondary indexes with frequent deletions
· Record updating
· Changes to data records may
require modifications to primary and secondary index entries since the values
of the indexes may change
· If the physical location of
a data record changes (due to a change that is implemented as a deletion
followed by an addition) then the contents of all index entries containing RBO
must be changed
· If the secondary indexes
contain primary key values (rather than RBO) then they need not be changed if
the primary key remains constant
· Retrieval using combinations of secondary keys
· If a query is posed that requires
information from multiple secondary indexes, then simple Boolean AND or OR operations on the
sets resulting from the queries can be performed
· Example:
Find all data
records with student_name = “SMITH” and student_major = “CPSC”
The execution
of this query might first yield a set of student_names and secondly a set of
student_majors. The intersection of
these two sets will yield the desired results.
· Inverted Lists
· Problems with indexes thus
far
· The index files must be
rearranged each time a new record is added to the file
· If duplicate secondary keys
exist, the data is repeated for each entry – thus wasting space
· Solutions
· Associate an array of primary
key references for each secondary key
· This eliminates the need to
add new index entries if duplicate entries already exist.
· This avoids wasted space
with duplicated secondary keys since the secondary key value is already in the file
· This solution may also waste
space if the array is fixed length and thus results in internal fragmentation
· Use a linked list rather
than an array (an inverted list)
· The entries in the secondary
index file are list heads pointing to (RBO) an entry sequenced file of primary
key values formerly stored in the array.
· The entries in the entry
sequenced file are linked together via a stack data structure
· Entries in the secondary
index file must only be rearranged when a new secondary key is added
· Entries in the associated primary
key (entry sequenced) reference file never require sorting (rearranging). Deleted space can be reused with the methods
described in Chapter 5.
· Problems
· Adding entries into the
secondary index still requires sorting (ugh!)
· Traversal of the entry
sequenced file may require numerous seeks
· Selective Indexes
· A secondary index can be
partitioned into multiple secondary indexes, using an equivalence relation that
groups similar index entries together.
· Search requests will contain
data that selects the appropriate index
· Searching may be faster
since each index will be smaller
· Binding
·
When
is a key bound to its associated record?
·
At
what point in time does a key point to the physical address of its associated
record?
·
Primary
keys – at the time the primary key is written to its index entry (store time)
·
Secondary
keys – at the time they are actually used (access time)
·
Early
(tight) binding results in faster access
·
May
require more reorganization - expensive
·
Late
binding results in greater flexibility
·
Can
check against problems with referential integrity
·
May
hurt performance
·
May
eliminate the need for reorganization
·
In
general it is better to make changes in only one place but performance
consideration may require compromises in design