CpSc 360

Lecture 11

 

Chapter 6 (continued)

 

·      Indexes that are too large to hold in Memory

·       Problems with binary searching of indexes on disk

·       Binary searching of an index on disk requires several seeks and is not substantially faster than binary searching of the data itself.

·       Index maintenance (rearrangement) on disk is millions of times more expensive than the equivalent operation in memory

·       Possible solutions

·       Use a hashed organization if direct access is a top priority

·       Use a tree-structured index (such as a B-tree) if both direct and sequential access are important

 

·      Indexing to provide access by Multiple Keys

·       Files often use multiple indexes to access data

·       A library card catalog enables access by author name, title, and subject indexing in addition to the normal indexing by call number

·       Books (data records) are sorted (stored on the shelves) according to the primary key index of call number

·       Note that the primary key is normally unique but secondary keys may have duplicate entries

·       Operations on indexed files

·       Record addition

·       Entries must be made to all index files (primary and secondary)

·       Care must be taken in using a standard canonical form for all index entries

·       Record deletion

·       The data must be removed from the data file and the corresponding primary index file

·       Indexes must possibly be rearranged after deletion to recover the space left open

·       Deletion from secondary indexes is mandatory if RBO (or RRN) of data records are stored in the secondary index since the records to which they point well be gone

·       Deletion from secondary indexes can be expensive

·       Deletion can be avoided if the primary (unique) key is stored in the secondary index rather than RBO (or RRN).

·       An extra seek is required to access the primary index

·       If the primary index doesn’t contain the key then the record was deleted earlier

·       May be a good idea is there are numerous secondary indexes with frequent deletions

·       Record updating

·       Changes to data records may require modifications to primary and secondary index entries since the values of the indexes may change

·       If the physical location of a data record changes (due to a change that is implemented as a deletion followed by an addition) then the contents of all index entries containing RBO must be changed

·       If the secondary indexes contain primary key values (rather than RBO) then they need not be changed if the primary key remains constant

·      Retrieval using combinations of secondary keys

·       If a query is posed that requires information from multiple secondary indexes, then simple Boolean AND  or OR operations on the sets resulting from the queries can be performed

·       Example:

 

Find all data records with student_name = “SMITH” and student_major = “CPSC”

 

The execution of this query might first yield a set of student_names and secondly a set of student_majors.  The intersection of these two sets will yield the desired results.

 

·      Inverted Lists

·       Problems with indexes thus far

·       The index files must be rearranged each time a new record is added to the file

·       If duplicate secondary keys exist, the data is repeated for each entry – thus wasting space

·       Solutions

·       Associate an array of primary key references for each secondary key

·       This eliminates the need to add new index entries if duplicate entries already exist.

·       This avoids wasted space with duplicated secondary keys since the secondary key value is already in the file

·       This solution may also waste space if the array is fixed length and thus results in internal fragmentation

·       Use a linked list rather than an array (an inverted list)

·       The entries in the secondary index file are list heads pointing to (RBO) an entry sequenced file of primary key values formerly stored in the array. 

·       The entries in the entry sequenced file are linked together via a stack data structure

·       Entries in the secondary index file must only be rearranged when a new secondary key is added

·       Entries in the associated primary key (entry sequenced) reference file never require sorting (rearranging).  Deleted space can be reused with the methods described in Chapter 5.

·       Problems

·  Adding entries into the secondary index still requires sorting (ugh!)

·  Traversal of the entry sequenced file may require numerous seeks

 

·      Selective Indexes

·       A secondary index can be partitioned into multiple secondary indexes, using an equivalence relation that groups similar index entries together.

·       Search requests will contain data that selects the appropriate index

·       Searching may be faster since each index will be smaller

 

·      Binding

·                 When is a key bound to its associated record?

·                           At what point in time does a key point to the physical address of its associated record?

·                           Primary keys – at the time the primary key is written to its index entry (store time)

·                           Secondary keys – at the time they are actually used (access time)

·                 Early (tight) binding results in faster access

·                           May require more reorganization - expensive

·                 Late binding results in greater flexibility

·                           Can check against problems with referential integrity

·                           May hurt performance

·                           May eliminate the need for reorganization

·                 In general it is better to make changes in only one place but performance consideration may require compromises in design