CpSc 360
Chapter 7
·
Sections 7.1 – 7.4 already covered in earlier class notes
·
Merging as a Way of Sorting Large
Files on Disk
·
Sorting
data records or even keys in RAM is often infeasible with large files
·
Sort
as large a set of records as can fit in available RAM and write the results to
a temporary disk file (a run)
·
Repeat
the partial sorts in the above step as many times as is necessary, each time
creating a temporary run on disk
·
Merge
the resulting runs to produce a completely sorted file
·
Advantages
of this technique:
·
Arbitrarily
large files can be sorted (subject to available disk space)
·
Reading
of input data file is sequential and hence is fast.
·
Reading
sorted runs is sequential and outputs to the final sorted file are sequential
and hence are fast. Seeks are only required
when switching between runs during the merge operation
·
The
in-RAM sort can be performed during the I/O operations associated with writing
sorted runs to disk (double buffering of sort runs).
·
Tapes
can be used for input and output operations since all I/O is sequential
·
Performance
considerations
·
Reading
records into RAM for sorting and forming runs
·
Writing
sorted runs out to multiple disk files
·
Reading
sorted runs into RAM for merging
·
Writing
sorted file out to disk
·
Analysis
·
The
number of seeks increases as the number of files containing runs increases
·
Large
input files will create large numbers of files containing runs
·
The
merge operation will likely seek to a different file containing a run after
each output operation to the sorted file
·
The
number of seeks required for a K-way
merge of K runs (where each run is as
large as the RAM space available) is K2. The order of complexity is thus O(K2).
·
Large
files begin to suffer extreme performance problems!
·
Opportunities
for improvement
·
Hardware
improvements
·
Increase
the amount of RAM
·
Increase
the number of disk drives
·
Increase
the number I/O channels
·
Decrease
the number of seeks using multiple-step merges
·
Algorithmically
increase the lengths of the initial sorted runs
·
Find
ways to overlap I/O operations