CpSc 360

Lecture 12

 

Chapter 7

 

·      Sections 7.1 – 7.4  already covered in earlier class notes

·      Merging as a Way of Sorting Large Files on Disk

·      Sorting data records or even keys in RAM is often infeasible with large files

·      Sort as large a set of records as can fit in available RAM and write the results to a temporary disk file (a run)

·      Repeat the partial sorts in the above step as many times as is necessary, each time creating a temporary run on disk

·      Merge the resulting runs to produce a completely sorted file

·      Advantages of this technique:

·      Arbitrarily large files can be sorted (subject to available disk space)

·      Reading of input data file is sequential and hence is fast.

·      Reading sorted runs is sequential and outputs to the final sorted file are sequential and hence are fast.  Seeks are only required when switching between runs during the merge operation

·      The in-RAM sort can be performed during the I/O operations associated with writing sorted runs to disk (double buffering of sort runs).

·      Tapes can be used for input and output operations since all I/O is sequential

·      Performance considerations

·      Reading records into RAM for sorting and forming runs

·      Writing sorted runs out to multiple disk files

·      Reading sorted runs into RAM for merging

·      Writing sorted file out to disk

·      Analysis

·      The number of seeks increases as the number of files containing runs increases

·      Large input files will create large numbers of files containing runs

·      The merge operation will likely seek to a different file containing a run after each output operation to the sorted file

·      The number of seeks required for a K-way merge of K runs (where each run is as large as the RAM space available) is K2.  The order of complexity is thus O(K2).

·      Large files begin to suffer extreme performance problems!

 

·      Opportunities for improvement

·      Hardware improvements

·      Increase the amount of RAM

·      Increase the number of disk drives

·      Increase the number I/O channels

·      Decrease the number of seeks using multiple-step merges

·      Algorithmically increase the lengths of the initial sorted runs

·      Find ways to overlap I/O operations