CpSc 360

Lecture 13

 

Chapter 7 (continued)

 

·      Merging as a Way of Sorting Large Files on Disk

 

·      Opportunities for improvement

·      Hardware improvements

·      Increase the amount of RAM

·      Increase the number of disk drives

·      Increase the number I/O channels

·      Decrease the number of seeks using multiple-step merges

·      Merge runs together in groups rather than all together

·      Merge results of groups of runs, etc.

·      Construct a tree with dispersion factors of 2n where 2n is the number of files to be merged during each pass

·      E.g. if n=2 then starting with the final pass and working backwards, we would merge:

 

4, 16, 64, …   runs

 

·      Would some base other than 2 work better?

 

·      Algorithmically increase the lengths of the initial sorted runs

·      Longer runs ® fewer total runs

·      Fewer total runs ® a lower order merge

·      Lower order merge ® bigger buffers

·      Bigger buffers ® fewer seeks

·      But how do we get bigger buffers without buying more memory?

·      Use a replacement selection algorithm

·      Read a collection of records and sort them using heapsort.  This creates a heap of sorted values – the primary heap.

·      Don’t write out the whole heap but rather write out only the record in the heap with the smallest key value

·      Bring in a new record and compare the value of its key with that of the key just outputted

·      If the new key value  >= outputted key value, insert the new record into its proper place in the primary heap and again output the record with the smallest key value

·      If the new key value < outputted key value, place the record into a secondary heap of records with key values lower than those already written out. 

·      Continue until there are no more records in the primary heap or the secondary heap reaches capacity.  If the secondary heap reaches capacity, output the remainder of the primary heap make the secondary heap the primary heap, the primary heap the secondary heap and continue the process.

·      The replacement algorithm will output runs with approximately 2*P records, where the number of memory locations is P.

·      Since buffering of input and output is required (to reduce seeks and rotational latency times) and a secondary heap is required, the output runs will not reach 2*P.

·      If the number of records is sufficiently large, multi-step merging techniques provide far more benefits than the replacement algorithm.

·      Both techniques used together will generally be advantageous.

·      Find ways to overlap I/O operations

·      Use multiple disk drives

·      Use one drive for input and the other for output

·      I/O can be overlapped so transmission time can be reduces by as much as 50%

·      Seeking is virtually eliminated

·      Use double buffering for both input and output

·      Use additional disks (3,4,…)

·      Use additional processors for the sort phase

·      Multiprogramming is likely to reduce the performance of an external sort, (and any other job in the system) but overall system performance may increase

·      General rules for external sorts:

·      For in-RAM sorting, use heapsort for forming the original list of sorted elements in a run.  With it and double buffering, we can overlap input and output with internal processing.

·      Use as much RAM as possible.  It makes the runs longer and provides bigger and/or more buffers during the merge phase.

·      If the number of initial runs is so large that total seek and rotation time is much greater than total transmission time, use a multistep merge.  It increases the amount of transmission time but can decrease the number of seeks enormously.

·      Consider using replacement selection of initial run formation, especially if there is a possibility that the runs will be partially ordered.

·      Use more than one disk drive and I/O channel so reading and writing can overlap.  This is especially true if there are no other users on the system.

·      Keep in mind the fundamental elements of external sorting and their relative costs, and look for ways to take advantage of new architectures and systems, such as parallel processing and high-speed local area networks.